Having tweeted, “When was the last time you practised the recovery of each of your production databases?”, a few days before, I was presented with a very real DR situation for my website.
On Wednesday my website went down while I was at my office Christmas lunch. I restarted it from my phone and it was all up, but I got a message saying one of the RAID1 disks was dead. I do nightly file system and database backups, but I took another backup anyway because I couldn’t remember what had changed during the day. In the evening I contacted the people hosting my dedicated server and asked them to swap out the dead disk. Several hours and several phone calls later it became obvious the data had “magically” disappeared from the second disk in the pair. Of course, it was nothing to do with the hosting company… I think the way I would describe them is a useless bunch of ****s.
I ended up initiating a re-image of the server. Their website says this should take about an hour. After a bit over 2 hours I gave up waiting and went to bed…
I woke up the next morning and started the process of rebuilding the server. Fortunately, I have a DR plan containing all the stuff I need to do to rebuild the server. The plan had one minor flaw, it included some references to my own articles. 🙂 Luckily, these were available via Google Cache. I’ve now copied any relevant information into my DR plan, so when this happens next time I will be even more prepared. 🙂
Having a full set of file system and database backups meant I was confident I would get it sorted. The biggest problem was upload speed. Home broadband is great for download, but pretty terrible for upload, so my initial plan to just push all the backups to the server then start recovering it had to be revised. Instead I cherry-picked the most recent file system backup and pushed that first. Once on the server, the backup unzipped in a few seconds and the main part of the website was back. I then started to push the most recent DB backups and loaded those. I’ve altered the DR plan so next time I can easily do this, prior to a blanket “rsync” of all the backups back to the server. That should speed up the recovery somewhat.
Lessons people need to learn:
- Having regular backups is good. (No shit Sherlock!) If you are on this blog you are probably an Oracle person, so this should not come as a surprise. I just wonder how many people don’t bother with them for their blog etc. In my case I self-host, so it is all down to me.
- You’ve got to have a DR plan. It’s amazing how many bits of software you install over time. It’s also surprising how many odd little commands and config entries you put in over time. Unless you are tracking these manually, or through some admin tool, you are going to have a nightmare getting back. I didn’t remember half of the things I did during the setup of this stuff, but luckily I wrote it all down.
- You’ve got to practice your DR plan. You don’t know if it works unless you try it. I moved my stuff on to this server a few months ago, so I was pretty confident the plan was viable. I do occasionally build a copy of my website on a VM to make sure the backups are working OK. Obviously, this was a pretty good “test”. 🙂
- Remember to back up your backup scripts. My backups were pretty good, but I did forget to include the backup scripts in the backup, so I had to reinvent those. It wasn’t a big deal, but it was a silly mistake.
- If you are looking for quick recovery of your blog/website, you probably don’t want to rely on backups sitting on your home servers/storage as the broadband upload speed (at least in the UK) is pretty terrible. Maybe investing in a cloud backup solution would be a better idea.
I think everything is sorted now. If not, I’ll pick it up as I go along and add the fix into the plan. 🙂
Sorry for any inconvenience. 🙂