wwillson Posted April 1, 2012 Report Share Posted April 1, 2012 I'll make a very long 36 hours short.We had one of those extremely rare and painful events happen on the server. Our main volume is a RAID1 array with two drives. Both drives crashed at the same time - can you say ouch? We did have the BITOG server back up in about 12 hrs, which involved me driving 300 miles to get the backup server out of my office and into the datacenter, which is about 3 miles from my house. The problem is I was in my hometown in Iowa, hence the 300 mile drive.Sorry for the downtime..Wayne Quote Link to comment Share on other sites More sharing options...
brueggma Posted April 1, 2012 Report Share Posted April 1, 2012 Interesting, double disk failure? Was it actually a double disk failure or did you not have hardware monitoring setup?Do you mind giving us computer nerds more specs on the hardware and/or O/S if you feel comfortable giving it out?Netcraft shows Debian? Quote Link to comment Share on other sites More sharing options...
wwillson Posted April 1, 2012 Author Report Share Posted April 1, 2012 arcconf showed no issues with the drives or the array as recently as a week ago. I was out of state for the last week until Friday at about 9:00PM, which is when I took the backup server to the DC and got BITOG back up. The RAID controller (adaptec 5805) never did fail either drive until things were so far gone, that there was no recovery. The Seagate utility couldn't even see one of the drives and found a ton of media errors on the other. It's really disappointing that an expensive RAID controller didn't fail the first drive that started having problems, that would have gotten lots of attention quickly. All we would have done is hot-swap the failing drive out for a new one and just watch it auto rebuild. Instead, the controller, for some stupid reason, just kept both drives up until they reached the point of no return. We run Debian on AMD64 with dual quad xeon and 16 GB RAM. That may seem like overkill, but trust me it isn't. We're now pushing 700,000 unique visitors/mo between both sites. BITOG, of course, being the vast majority. We also use nginx to serve the static content and apache to serve the dynamic content. Nginx is simply amazing at scaling. The server was starting to stress with just apache doing all the work, now with nginx serving all the static content, we are normally at about 10% usage across all 8 cores. It did run up to 50% across all 8 cores with just apache.Wayne Quote Link to comment Share on other sites More sharing options...
brueggma Posted April 1, 2012 Report Share Posted April 1, 2012 Wayne, I hate to give recommendations without knowing the full details, but if you question the hardware (RAID) controller you could just throw in a 3rd drive and "dd" it's contents to it on a weekly basis via cron. This way you can just boot from that 2nd LUN/virtual drive via remote console if this were to happen again. We used to do this at an old place I worked at, and it saved our bacon once. the total outage was the cost of a reboot. I don't know if you're using e2label references in fstab, but you may have issues with double entries when using this method and may need to reference the full device paths (/dev/volgrp01/root for example) not "LABEL=/". Or if you want to throw $$ at the problem, you could purchase more hardware and setup VCS, for clustering. Quote Link to comment Share on other sites More sharing options...
Pablo Posted April 1, 2012 Report Share Posted April 1, 2012 Wow - thanks for the hard work. Do you need any cash for gas? Quote Link to comment Share on other sites More sharing options...
wwillson Posted April 2, 2012 Author Report Share Posted April 2, 2012 Wow - thanks for the hard work. Do you need any cash for gas? No, but I would take some sleep :-)Wayne Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.