-
Notifications
You must be signed in to change notification settings - Fork 41
Remove sync-asset-manager-from-master cron job from asset slaves #7016
Remove sync-asset-manager-from-master cron job from asset slaves #7016
Conversation
This cron job runs every 10 mins on each of the asset slaves copying Asset Manager (not Whitehall) asset files from the asset master using rsync via an NFS mount of the latter's /mnt/uploads directory. As far as I can tell from this commit [1] this cron job was added to solve a problem with environment syncing. However, environment syncing of Asset Manager assets is now taken care of by syncing of S3 buckets by the assets.sh job [2] rather than by rsync-ing of directories by the attachments.sh job [3]. Since this change [4] to Asset Manager the files for new assets are deleted from the filesystem once they have been virus scanned and uploaded to S3. Thus the Asset Manager app should no longer be permanently adding asset files to the filesystem and there's no need to have this sync in place. We're about to delete the files for Asset Manager assets which have been uploaded to S3 i.e. the vast majority of them. We plan to use this Asset Manager Rake task [5] to delete the files via the Carrierwave uploader mounted on Asset#file. This will delete the underlying file from the uploads directory under the Rails root directory which is sym-linked to /data/uploads/asset-manager. The latter is where the asset-master /mnt/uploads directory is mounted using NFS. If we were to leave the sync-asset-manager-from-master cron job in place, its rsync command would have to handle the sudden deletion of tens of thousands of file which might put undue load on the asset slaves and/or their NFS mounts. Since Asset Manager is no longer permanently adding asset files to the filesystem, we can safely remove this cron job and avoid the problem mentioned above. Another advantage of doing this is that we will effectively retain a backup of the Asset Manager asset files on the asset slaves in case deleting them causes an unforseen problem. We can delete the files from the slaves once we're happy that everything is working OK. We could probably also remove the /data/master-uploads NFS mount in this commit, but I think it's safer to apply that change separately later. Note that we'll be able to reove the cron job definition entirely once the changes in this commit have been applied in all environments. [1]: d9cf0ee [2]: https://github.com/alphagov/env-sync-and-backup/blob/169cb5846567f12030568c842e1ffe7630ca07fe/jobs/assets.sh [3]: https://github.com/alphagov/env-sync-and-backup/blob/169cb5846567f12030568c842e1ffe7630ca07fe/jobs/attachments.sh [4]: alphagov/asset-manager#373 [5]: https://github.com/alphagov/asset-manager/blob/d803db930614a6063c0fc16730f6ba3eaf08e6d9/lib/tasks/govuk_assets.rake#L5
@surminus, @danielroseman, @boffbowsh: I hope you don't mind me asking you to review this PR, but I'd welcome your thoughts on the "TODO" section at the bottom. Please feel free to recommend someone else to review it if there's someone who is better placed. |
You can acknowledge an alert in Icinga either permanently or with an expire time, so it might well make sense to go with option 1 as long as it isn't going to be too long until the files are manually deleted. |
@danielroseman Thanks! That makes sense. I envisage it being a matter of an hour or two, I think, so option 1 sounds good to me. I'll update the PR description accordingly. |
After discussing this with @chrisroos, we've decided that we already have enough backups/copies of the Asset Manager assets and so we're not going to disable this asset master/slave syncing until after we've deleted all the Asset Manager asset files from NFS (on the asset master) and any syncing to the slaves has completed. This has the advantage that the slaves will always be a true representation of the master which should make things less confusing. It should also mean that the triggering of the |
All Asset Manager asset files were deleted from production yesterday afternoon - see alphagov/asset-manager#296 for details. Since there are no longer any Asset Manager asset files on asset-master, there's no point in sync-ing this directory any more, so I'm happy to merge this PR. |
Note that since the Asset Manager app is still creating new empty directories, there is a danger the Icinga alert might get triggered, because the asset master/slave disk usage comparison might reach the threshold due to the space taken up by empty directories. I plan to address this shortly as part of alphagov/asset-manager#385. |
Since #373 we've been deleting an asset's underlying file from the filesystem once the file has been uploaded to S3. However, this was leaving behind a bunch of empty directories. This change means that the app now removes all empty parent directories when the file is removed. The immediate motivation behind this change is that once alphagov/govuk-puppet#7016 is applied in production, the Asset Manager Carrierwave directory will no longer be sync-ed from the Asset Master to the Asset Slaves and there will therefore be a danger that the asset_master_and_slave_disk_space_similar_from_xxx Icinga alert [1] might eventually be triggered by the non-zero disk space taken up by the empty directories. The implementation makes use of `FileUtils.rmdir` with the `parents` option which is equivalent to the `rmdir -p` unix command and which recursively removes empty directories from the supplied path upwards. Note that I think this means that in some circumstances top-level directories (e.g. `tmp/test_uploads/assets` in the test environment) might be removed, but I'm pretty confident Carrierwave creates directories using the equivalent of `mkdir -p` and so any missing intermediate directories will be created. []1: https://github.com/alphagov/govuk-puppet/blob/19837b4b351d97d84570bd8a7d01ef3420fbef94/modules/govuk/manifests/node/s_asset_slave.pp#L50-L63
Since #373 we've been deleting an asset's underlying file from the filesystem once the file has been uploaded to S3. However, this was leaving behind a bunch of empty directories. This change means that the app now removes all empty parent directories when the file is removed. The immediate motivation behind this change is that once alphagov/govuk-puppet#7016 is applied in production, the Asset Manager Carrierwave directory will no longer be sync-ed from the Asset Master to the Asset Slaves and there will therefore be a danger that the asset_master_and_slave_disk_space_similar_from_xxx Icinga alert [1] might eventually be triggered by the non-zero disk space taken up by the empty directories. The implementation makes use of `FileUtils.rmdir` with the `parents` option which is equivalent to the `rmdir -p` unix command and which recursively removes empty directories from the supplied path upwards. Note that I think this means that in some circumstances top-level directories (e.g. `tmp/test_uploads/assets` in the test environment) might be removed, but I'm pretty confident Carrierwave creates directories using the equivalent of `mkdir -p` and so any missing intermediate directories will be created. [1]: https://github.com/alphagov/govuk-puppet/blob/19837b4b351d97d84570bd8a7d01ef3420fbef94/modules/govuk/manifests/node/s_asset_slave.pp#L50-L63
This cron job was marked as 'absent' in this PR [1] which has been applied in all environments and so it's now safe to remove the definition. [1]: #7016
This was made redundant in this PR [1] when the only remaining use of the NFS mount was removed. The comment above this `mount` resource definition is encouraging in this respect. Asset Manager assets are no longer permanently saved to the NFS mount which is why the cron job copying them to the slaves was removed in [1]. Whitehall assets are "pushed" to the slaves by the `process-incoming-files`, `process-draft-incoming-files` & `copy-attachments-to-slaves` cron jobs using `rsync` and thus the NFS mount is not needed on the asset slaves. As far as I can tell, neither the Publisher (`/data/uploads/publisher`) nor Support API assets (`/data/uploads/support-api`) require the NFS mount on the slaves. [1]: #7016
This cron job runs every 10 mins on each of the asset slaves copying Asset Manager (not Whitehall) asset files from the asset master using rsync via an NFS mount of the latter's
/mnt/uploads
directory.As far as I can tell from this commit this cron job was added to solve a problem with environment syncing. However, environment syncing of Asset Manager assets is now taken care of by syncing of S3 buckets by the
assets.sh
job rather than by rsync-ing of directories by theattachments.sh
job.Since this change to Asset Manager the files for new assets are deleted from the filesystem once they have been virus scanned and uploaded to S3, because Asset Manager now serves them from S3 via Nginx. Thus the Asset Manager app should no longer be permanently adding asset files to the filesystem and there's no need to have the sync provided by the
sync-asset-manager-from-master
cron job.Furthermore we're about to delete the files for Asset Manager assets which have been uploaded to S3 i.e. the vast majority of them. We plan to use this Asset Manager Rake task to delete the files via the Carrierwave uploader mounted on
Asset#file
. This will delete the underlying file from the uploads directory under the Rails root directory which is sym-linked to/data/uploads/asset-manager
. The latter is where the asset-master/mnt/uploads
directory is mounted using NFS. If we were to leave thesync-asset-manager-from-master
cron job in place, itsrsync
command would have to handle the sudden deletion of tens of thousands of file which might put undue load on the asset slaves and/or their NFS mounts.Since Asset Manager is no longer permanently adding asset files to the filesystem, we can safely remove this cron job and avoid the problem mentioned above. Another advantage of doing this is that we will effectively retain a backup of the Asset Manager asset files on the asset slaves in case deleting the files causes an unforeseen problem. We can delete the files from the slaves once we're happy that everything is working OK.
We could probably also remove the /data/master-uploads NFS mount in this commit, but I think it's safer to apply that change separately later.
Note that we'll be able to remove the cron job definition entirely once the changes in this commit have been applied in all environments.
Note also that the
asset_master_and_slave_disk_space_similar_from_xxx
Icinga alert will be triggered if the disk space used by asset master vs asset slave is different by more than 384MB (warning) / 512MB (critical).Since this checks compares the entire disk usage, not just the Asset Manager bit, if we disable the
sync-asset-manager-from-master
cron job and then delete all the Asset Manager asset files (60GB+) from the asset master, then this check will immediately go critical and stay critical. This is because there will be ~60GB more disk usage on the asset slaves compared to the master until we then later manually delete the Asset Manager asset files from the slaves.As suggested by @danielroseman, we can warn 2nd Line that this will happen and they can acknowledge the alert with an expire time.
Another advantage of this approach is that we don't need to disable the
asset-manager-s3
nightly Duplicity backup to S3 which runs onasset-slave-1
in production (see #6768) or exclude the Asset Manager asset files from thepush_attachments_to_s3
nightly cron job which runs onasset-slave-2
in production until we delete the asset files.