You may not know, but I manage another server.
Earlier this week that server got really slow for some reason, so I rebooted it. I waited and waited but it didn’t start up again.. After some investigation, it appeared the hard disk had crashed.
I knew I had backups, so I thought I was safe. I asked server support to put the backup online. Unfortunately for me, it appears the last backup had not succeeded! Yes, disaster struck twice, and at what a time…
I asked them to try and recover the data from the failed HD. Luckily they were able to get the database, which is the most important thing on the server. Without the database, I would have lost all member info, forum data, etc etc.. The site would have been worthless.
Unfortunately, they couldn’t recover anything from the /home partition. This means a ton of user uploaded images were all lost. Well, not all of them, the backup HD did have an older backup available so I was able to save about 70% of the images. Still, the loss is enormous.
I’m now thinking about sending the HD to some sort of professional data recovery service. Something like DriveSavers. Pricing for this kind of thing is enormous though, ranging from $1-$2,000. What a lucrative business that is.
Could I have prevented this from happening?
Yes, I probably could have. I should have read my logwatch reports from cpanel. In the last logwatch report before the crash, it did indeed state there were some HD issues, but I hadn’t read it.
Good news is, those kinds of errors are almost always listed at the top of the logwatch email. Here’s what it said in my case:
WARNING: Kernel Errors Present
I/O error: dev 08:08, sect...: 35Time(s)
Additional sense indicates Unrecovered read error - auto reallocat...: 35Time(s)
Current sd08:08: sense key Medium Error...: 35Time(s)
EXT3-fs error (device sd(8,8))...: 53Time(s)
ata3: error=0x40 { Uncorrect...: 1Time(s)
ata3: status=0x51 { DriveReady SeekComplete Error }...: 194Time(s)
But why did the backup fail? After some investigation, it appears cpanel tried to gzip all the files, and there simply wasn’t enough room left on the backup HD.
As the site grew, so grew the number of images. The backup HD had gotten too small. I hadn’t a clue anything was wrong because each day I happily received the email titled “Backup complete” from cpanel. I investigated the last backup log email from before the crash, and yes, on line 1022 this was stated:
gzip: stdout: No space left on device
1 line out of 1500 to inform me the backup had failed. GREAT.
Are you safe?
Do you read your logwatch reports? Your backup logs? Logwatch reports aren’t only useful to see if something is wrong with the HD, they also list possible hacking attacks, so they shouldn’t be ignored. Ofcourse, I’ll have to follow my own advice from now on.