Near server disaster averted by computer experts

By aaron.axvig, Thu, 04/24/2008 - 03:00

Well I wouldn't say we are experts, but we definitely did have a near disaster.

It started 3 days ago when I installed an NVIDIA driver and program to monitor the RAID 5 running on our server's M2N-E motherboard.  It reported to me that the array was in a degraded state, but did not allow for repairing it from within Windows, requiring some BIOS-level maintenance.  So I planned to go investigate the situation the next day.

When I got to the server and hooked up a monitor and keyboard the video output was garbled.  We could tell it was going into BIOS but it was unreadable.  That's what we get for using a 10 year old PCI video card I guess.  So I planned to come back in a day or two with a different video card.

Of course it then decided to drop another drive from the array.  This took down our website because it was stored on the array.  By website, I mean Default Website, which has the misfortune of being linked to Outlook Web Access for Exchange 2007.  Which means the web.config no longer existed.  OWA would no longer run, and reportedly the only way to fix that is to do a complete re-install of IIS and Exchange 2007.

Needless to say, we really wanted to get that RAID going again.  We rebooted between the RAID BIOS and Windows many times trying to rebuild it, but it would never show up in Windows.  After 10 fruitless repetitions of this we about ready to call it a loss and get ready to wipe the array and start over.

But then I thought of how the monitoring driver and utility I installed had been a relatively new version, and maybe it had incompatibilities with the older BIOS.  Updating the BIOS was worth a shot.

We downloaded the 1305 BIOS from the Asus website, which is still as horrible as it has ever been.  It wouldn't even load from the server, so we had to use another computer, network the downloaded files onto the server, and from there put them on the floppy.

After flashing the board from within BIOS we were greeted with a lockup immediately after the splash screen.  I spent 1/2 an hour unplugging things one-by-one trying to root out the problem, but it didn't help.  I tried unplugging all the cables and plugging them all back in (it sounds weird, but I have seen it fix many problems).  Visions of RMAing the board and going server-less for 2 weeks were running through my head, and I wasn't happy.  Finally I tried removing the battery for a minute, which somehow un-froze the BIOS.

From there it was a quick boot into Windows and then some big smiles as we saw that the array was back, and rebuilding.  No big re-installs this time.

But we weren't without issues.  Immediate attempts to copy files off of the drives were met by network timeouts--it seemed someone had forgotten to plug the network cable back in. :)

And this morning I saw that all the e-mails I received overnight reported being delivered on January 1st, 2008.  The clock needed to be set.  Also, OWA was still not running.  Some Google-ing of the error messages revealed that a few stopped services were probably the cause.  Starting them did get OWA running.  But outbound e-mails were not sending.  I figured this was due to the time issue still, and all the Exchange processes would need to be restarted, so I just rebooted the server.

20 minutes later the server was still not responding (I was at a remote location).  I quickly realized that it was probably halted at the BIOS screen because no keyboard was connected and the default of the new BIOS would be to halt on all errors.  Consider that lesson learned.

And now the happy ending.  Everything is running as it used to, no data was lost, e-mail is working, and I don't have to (get to?) spend my weekend re-doing a server!