Skip to content
This repository has been archived by the owner on Jan 30, 2024. It is now read-only.

Remove sync-asset-manager-from-master cron job from asset slaves #7016

Conversation

floehopper
Copy link
Contributor

@floehopper floehopper commented Jan 4, 2018

This cron job runs every 10 mins on each of the asset slaves copying Asset Manager (not Whitehall) asset files from the asset master using rsync via an NFS mount of the latter's /mnt/uploads directory.

As far as I can tell from this commit this cron job was added to solve a problem with environment syncing. However, environment syncing of Asset Manager assets is now taken care of by syncing of S3 buckets by the assets.sh job rather than by rsync-ing of directories by the attachments.sh job.

Since this change to Asset Manager the files for new assets are deleted from the filesystem once they have been virus scanned and uploaded to S3, because Asset Manager now serves them from S3 via Nginx. Thus the Asset Manager app should no longer be permanently adding asset files to the filesystem and there's no need to have the sync provided by the sync-asset-manager-from-master cron job.

Furthermore we're about to delete the files for Asset Manager assets which have been uploaded to S3 i.e. the vast majority of them. We plan to use this Asset Manager Rake task to delete the files via the Carrierwave uploader mounted on Asset#file. This will delete the underlying file from the uploads directory under the Rails root directory which is sym-linked to /data/uploads/asset-manager. The latter is where the asset-master /mnt/uploads directory is mounted using NFS. If we were to leave the sync-asset-manager-from-master cron job in place, its rsync command would have to handle the sudden deletion of tens of thousands of file which might put undue load on the asset slaves and/or their NFS mounts.

Since Asset Manager is no longer permanently adding asset files to the filesystem, we can safely remove this cron job and avoid the problem mentioned above. Another advantage of doing this is that we will effectively retain a backup of the Asset Manager asset files on the asset slaves in case deleting the files causes an unforeseen problem. We can delete the files from the slaves once we're happy that everything is working OK.

We could probably also remove the /data/master-uploads NFS mount in this commit, but I think it's safer to apply that change separately later.

Note that we'll be able to remove the cron job definition entirely once the changes in this commit have been applied in all environments.

Note also that the asset_master_and_slave_disk_space_similar_from_xxx Icinga alert will be triggered if the disk space used by asset master vs asset slave is different by more than 384MB (warning) / 512MB (critical).

Since this checks compares the entire disk usage, not just the Asset Manager bit, if we disable the sync-asset-manager-from-master cron job and then delete all the Asset Manager asset files (60GB+) from the asset master, then this check will immediately go critical and stay critical. This is because there will be ~60GB more disk usage on the asset slaves compared to the master until we then later manually delete the Asset Manager asset files from the slaves.

As suggested by @danielroseman, we can warn 2nd Line that this will happen and they can acknowledge the alert with an expire time.

Another advantage of this approach is that we don't need to disable the asset-manager-s3 nightly Duplicity backup to S3 which runs on asset-slave-1 in production (see #6768) or exclude the Asset Manager asset files from the push_attachments_to_s3 nightly cron job which runs on asset-slave-2 in production until we delete the asset files.

This cron job runs every 10 mins on each of the asset slaves copying
Asset Manager (not Whitehall) asset files from the asset master using
rsync via an NFS mount of the latter's /mnt/uploads directory.

As far as I can tell from this commit [1] this cron job was added to
solve a problem with environment syncing. However, environment syncing
of Asset Manager assets is now taken care of by syncing of S3 buckets by
the assets.sh job [2] rather than by rsync-ing of directories by the
attachments.sh job [3].

Since this change [4] to Asset Manager the files for new assets are
deleted from the filesystem once they have been virus scanned and
uploaded to S3. Thus the Asset Manager app should no longer be
permanently adding asset files to the filesystem and there's no need to
have this sync in place.

We're about to delete the files for Asset Manager assets which have
been uploaded to S3 i.e. the vast majority of them. We plan to use this
Asset Manager Rake task [5] to delete the files via the Carrierwave
uploader mounted on Asset#file. This will delete the underlying file
from the uploads directory under the Rails root directory which is
sym-linked to /data/uploads/asset-manager. The latter is where the
asset-master /mnt/uploads directory is mounted using NFS. If we were to
leave the sync-asset-manager-from-master cron job in place, its rsync
command would have to handle the sudden deletion of tens of thousands of
file which might put undue load on the asset slaves and/or their NFS
mounts.

Since Asset Manager is no longer permanently adding asset files to the
filesystem, we can safely remove this cron job and avoid the problem
mentioned above. Another advantage of doing this is that we will
effectively retain a backup of the Asset Manager asset files on the
asset slaves in case deleting them causes an unforseen problem. We can
delete the files from the slaves once we're happy that everything is
working OK.

We could probably also remove the /data/master-uploads NFS mount in this
commit, but I think it's safer to apply that change separately later.

Note that we'll be able to reove the cron job definition entirely once
the changes in this commit have been applied in all environments.

[1]:
d9cf0ee
[2]:
https://github.com/alphagov/env-sync-and-backup/blob/169cb5846567f12030568c842e1ffe7630ca07fe/jobs/assets.sh
[3]:
https://github.com/alphagov/env-sync-and-backup/blob/169cb5846567f12030568c842e1ffe7630ca07fe/jobs/attachments.sh
[4]: alphagov/asset-manager#373
[5]:
https://github.com/alphagov/asset-manager/blob/d803db930614a6063c0fc16730f6ba3eaf08e6d9/lib/tasks/govuk_assets.rake#L5
@floehopper
Copy link
Contributor Author

@surminus, @danielroseman, @boffbowsh: I hope you don't mind me asking you to review this PR, but I'd welcome your thoughts on the "TODO" section at the bottom. Please feel free to recommend someone else to review it if there's someone who is better placed.

@danielroseman
Copy link
Contributor

You can acknowledge an alert in Icinga either permanently or with an expire time, so it might well make sense to go with option 1 as long as it isn't going to be too long until the files are manually deleted.

@floehopper
Copy link
Contributor Author

@danielroseman Thanks! That makes sense. I envisage it being a matter of an hour or two, I think, so option 1 sounds good to me. I'll update the PR description accordingly.

@floehopper floehopper changed the title [WIP] Remove sync-asset-manager-from-master cron job from asset slaves Remove sync-asset-manager-from-master cron job from asset slaves Jan 4, 2018
@floehopper
Copy link
Contributor Author

After discussing this with @chrisroos, we've decided that we already have enough backups/copies of the Asset Manager assets and so we're not going to disable this asset master/slave syncing until after we've deleted all the Asset Manager asset files from NFS (on the asset master) and any syncing to the slaves has completed. This has the advantage that the slaves will always be a true representation of the master which should make things less confusing. It should also mean that the triggering of the asset_master_and_slave_disk_space_similar_from_xxx Icinga alert should be more short-lived.

@floehopper
Copy link
Contributor Author

All Asset Manager asset files were deleted from production yesterday afternoon - see alphagov/asset-manager#296 for details.

Since there are no longer any Asset Manager asset files on asset-master, there's no point in sync-ing this directory any more, so I'm happy to merge this PR.

@floehopper
Copy link
Contributor Author

Note that since the Asset Manager app is still creating new empty directories, there is a danger the Icinga alert might get triggered, because the asset master/slave disk usage comparison might reach the threshold due to the space taken up by empty directories. I plan to address this shortly as part of alphagov/asset-manager#385.

@floehopper floehopper merged commit fc10f08 into master Jan 9, 2018
@floehopper floehopper deleted the remove-sync-asset-manager-from-master-cron-job-from-asset-slaves branch January 9, 2018 11:52
floehopper added a commit to alphagov/asset-manager that referenced this pull request Jan 9, 2018
Since #373 we've been deleting an asset's underlying file from the
filesystem once the file has been uploaded to S3. However, this was
leaving behind a bunch of empty directories. This change means that the
app now removes all empty parent directories when the file is removed.

The immediate motivation behind this change is that once
alphagov/govuk-puppet#7016 is applied in production, the Asset Manager
Carrierwave directory will no longer be sync-ed from the Asset Master to
the Asset Slaves and there will therefore be a danger that the
asset_master_and_slave_disk_space_similar_from_xxx Icinga alert [1]
might eventually be triggered by the non-zero disk space taken up by the
empty directories.

The implementation makes use of `FileUtils.rmdir` with the `parents`
option which is equivalent to the `rmdir -p` unix command and which
recursively removes empty directories from the supplied path upwards.

Note that I think this means that in some circumstances top-level
directories (e.g. `tmp/test_uploads/assets` in the test environment)
might be removed, but I'm pretty confident Carrierwave creates
directories using the equivalent of `mkdir -p` and so any missing
intermediate directories will be created.

[]1:
https://github.com/alphagov/govuk-puppet/blob/19837b4b351d97d84570bd8a7d01ef3420fbef94/modules/govuk/manifests/node/s_asset_slave.pp#L50-L63
floehopper added a commit to alphagov/asset-manager that referenced this pull request Jan 9, 2018
Since #373 we've been deleting an asset's underlying file from the
filesystem once the file has been uploaded to S3. However, this was
leaving behind a bunch of empty directories. This change means that the
app now removes all empty parent directories when the file is removed.

The immediate motivation behind this change is that once
alphagov/govuk-puppet#7016 is applied in production, the Asset Manager
Carrierwave directory will no longer be sync-ed from the Asset Master to
the Asset Slaves and there will therefore be a danger that the
asset_master_and_slave_disk_space_similar_from_xxx Icinga alert [1]
might eventually be triggered by the non-zero disk space taken up by the
empty directories.

The implementation makes use of `FileUtils.rmdir` with the `parents`
option which is equivalent to the `rmdir -p` unix command and which
recursively removes empty directories from the supplied path upwards.

Note that I think this means that in some circumstances top-level
directories (e.g. `tmp/test_uploads/assets` in the test environment)
might be removed, but I'm pretty confident Carrierwave creates
directories using the equivalent of `mkdir -p` and so any missing
intermediate directories will be created.

[1]:
https://github.com/alphagov/govuk-puppet/blob/19837b4b351d97d84570bd8a7d01ef3420fbef94/modules/govuk/manifests/node/s_asset_slave.pp#L50-L63
floehopper added a commit that referenced this pull request Feb 5, 2018
This cron job was marked as 'absent' in this PR [1] which has been
applied in all environments and so it's now safe to remove the
definition.

[1]: #7016
floehopper added a commit that referenced this pull request Feb 5, 2018
This was made redundant in this PR [1] when the only remaining use of
the NFS mount was removed. The comment above this `mount` resource
definition is encouraging in this respect.

Asset Manager assets are no longer permanently saved to the NFS mount
which is why the cron job copying them to the slaves was removed in [1].

Whitehall assets are "pushed" to the slaves by the
`process-incoming-files`, `process-draft-incoming-files` &
`copy-attachments-to-slaves` cron jobs using `rsync` and thus the NFS
mount is not needed on the asset slaves.

As far as I can tell, neither the Publisher (`/data/uploads/publisher`)
nor Support API assets (`/data/uploads/support-api`) require the NFS
mount on the slaves.

[1]: #7016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants