Ensure your backups are working
Posted on January 17th, 2009 in development | 5 Comments »
You may not know, but I manage another server.
Earlier this week that server got really slow for some reason, so I rebooted it. I waited and waited but it didn’t start up again.. After some investigation, it appeared the hard disk had crashed.
I knew I had backups, so I thought I was safe. I asked server support to put the backup online. Unfortunately for me, it appears the last backup had not succeeded! Yes, disaster struck twice, and at what a time…
I asked them to try and recover the data from the failed HD. Luckily they were able to get the database, which is the most important thing on the server. Without the database, I would have lost all member info, forum data, etc etc.. The site would have been worthless.
Unfortunately, they couldn’t recover anything from the /home partition. This means a ton of user uploaded images were all lost. Well, not all of them, the backup HD did have an older backup available so I was able to save about 70% of the images. Still, the loss is enormous.
I’m now thinking about sending the HD to some sort of professional data recovery service. Something like DriveSavers. Pricing for this kind of thing is enormous though, ranging from $1-$2,000. What a lucrative business that is.
Could I have prevented this from happening?
Yes, I probably could have. I should have read my logwatch reports from cpanel. In the last logwatch report before the crash, it did indeed state there were some HD issues, but I hadn’t read it.
Good news is, those kinds of errors are almost always listed at the top of the logwatch email. Here’s what it said in my case:
WARNING: Kernel Errors Present
I/O error: dev 08:08, sect...: 35Time(s)
Additional sense indicates Unrecovered read error - auto reallocat...: 35Time(s)
Current sd08:08: sense key Medium Error...: 35Time(s)
EXT3-fs error (device sd(8,8))...: 53Time(s)
ata3: error=0x40 { Uncorrect...: 1Time(s)
ata3: status=0x51 { DriveReady SeekComplete Error }...: 194Time(s)
But why did the backup fail? After some investigation, it appears cpanel tried to gzip all the files, and there simply wasn’t enough room left on the backup HD.
As the site grew, so grew the number of images. The backup HD had gotten too small. I hadn’t a clue anything was wrong because each day I happily received the email titled “Backup complete” from cpanel. I investigated the last backup log email from before the crash, and yes, on line 1022 this was stated:
gzip: stdout: No space left on device
1 line out of 1500 to inform me the backup had failed. GREAT.
Are you safe?
Do you read your logwatch reports? Your backup logs? Logwatch reports aren’t only useful to see if something is wrong with the HD, they also list possible hacking attacks, so they shouldn’t be ignored. Ofcourse, I’ll have to follow my own advice from now on.
If you’re new here, you may want to subscribe to my RSS feed. Thanks for visiting!


5 Responses
Sorry to hear about all that. With all there is to do on the web nowadays, backups usually get the back-burner. Now I am rethinking that.
Sorry to hear that happened. I’m constantly harping about backups. I guess I’m not harping about enough. Now I’ll add log watching to it.
Thanks for sharing your lesson with us all.
Sorry to hear that about you. Thanks for sharing though. I got to check my logs too now.
I am safe, and I have backups stored like in several places. Online, my own HDD, Flash etc. Without backups you’re screwed. Remember there are 2 kinds of servers – those who have crashed and those will crash.
Cheers,
The Moneyac
It sucks when disastrer strikes and you don’t have a working backup with you… it also happened to me about 6 months ago and ever since then I try to get one at least once per week.