🎃 An engineering spooky story 🎃 – Medium Engineering

Wait, we can’t recover???

You may have already heard the saying “You don’t have a backup until you tested that you can restore it.” As it turns out, that saying is true but not enough…

Not so long ago

Recently, I screwed up big time with a small pull request:

At Medium, we use Jenkins quite a lot, and by quite a lot, I mean, way too much… Among those “creative” use cases, we use it to trigger our ETL pipelines jobs for our data scientists and some of our analytics tasks.
Because of the nature of those tasks, the hosts running them tend to frequently run low on disk space. To avoid a possible bad situation where we would completely fill up the root partition, I created the pull request above to move the Jenkins tmp directory to a new location on /media/ebs0/tmp which happens to be a separate EBS volume.

What happened

You might be familiar with EBS volume, which are extra virtual disk partitions that AWS let you attach and mount onto your systems. We often use those on some stateful services such as our build systems. The cool thing about EBS volumes is that you can create snapshots of your volume and this makes it a good solution for backups (more on that later). To make it explicit, we like to mount those volumes in the /media/ directory and give them the explicit names ebs0, ebs1, etc… Here, at Medium, we take advantage of it and create hourly snapshots of those ebs0 volumes on important systems.

In addition to moving the tmp directory to that new location on the ebs0 volume, I also meant to make tmpwatch clean up that new tmp directory. If you are not very familiar with the tmpwatch command, know that this is the magic command that cleans up your /tmp directories. The functioning of it is very straight forward: It simply looks recursively through all the directories provided in the command and recursively removes files which haven’t been accessed for a given amount of time. Most Linux distributions use it in a cron job to periodically free up space in the /tmp volume. This is also what we use and in the 2nd change of the pull request, you can see me adding a new path to the tmpwatch script.

The first mistake

Sadly, that’s where I screwed up. Instead of cleaning up the /media/ebs0/tmp directory, I asked tmpwatch to look at the entire /media/ebs0


Because tmpwatch is configured to delete files not accessed for a few days, the issue manifested itself after about 3 days.

After realizing the issue, and past the usual pain in the stomach that comes with those sad discoveries, we started to breath again knowing that those volumes are backed up.

As any good Ops practitioners, we go over occasional disaster recovery exercise which includes restoring our backups. We somewhat felt that we should be able to recover from that but that’s when we discovered a second issue.

The second mistake

As it turns out, we were only keeping 40 backups of those volumes and since we do an hourly backup, this meant, we only kept less than 2 days worth of backups. Whenever we tested restoring our backups, we always took the first available backup but never thought of looking at how far back we can restore those volumes.

The issue being like 3 days old, but the oldest backup we had was less than 2 days old and all files that had not been accessed for a while where already gone. Basically, none of the backups we had could help us recovering.


Lesson learned

Long story short… it took about 2.5 days for 3 engineers to recover from that mistake and we now make sure to keep one month of backup.

Bottom line, I’m adding age of the oldest backup to the checklist and so should you.

Source link