Jump to content
Practically Shooting

Server downtime


wwillson

Recommended Posts

I'll make a very long 36 hours short.

We had one of those extremely rare and painful events happen on the server. Our main volume is a RAID1 array with two drives. Both drives crashed at the same time - can you say ouch? We did have the BITOG server back up in about 12 hrs, which involved me driving 300 miles to get the backup server out of my office and into the datacenter, which is about 3 miles from my house. The problem is I was in my hometown in Iowa, hence the 300 mile drive.

Sorry for the downtime..

Wayne

Link to comment
Share on other sites

Interesting, double disk failure? Was it actually a double disk failure or did you not have hardware monitoring setup?

Do you mind giving us computer nerds more specs on the hardware and/or O/S if you feel comfortable giving it out?

Netcraft shows Debian?

Link to comment
Share on other sites

arcconf showed no issues with the drives or the array as recently as a week ago. I was out of state for the last week until Friday at about 9:00PM, which is when I took the backup server to the DC and got BITOG back up. The RAID controller (adaptec 5805) never did fail either drive until things were so far gone, that there was no recovery. The Seagate utility couldn't even see one of the drives and found a ton of media errors on the other. It's really disappointing that an expensive RAID controller didn't fail the first drive that started having problems, that would have gotten lots of attention quickly. All we would have done is hot-swap the failing drive out for a new one and just watch it auto rebuild. Instead, the controller, for some stupid reason, just kept both drives up until they reached the point of no return.

We run Debian on AMD64 with dual quad xeon and 16 GB RAM. That may seem like overkill, but trust me it isn't. We're now pushing 700,000 unique visitors/mo between both sites. BITOG, of course, being the vast majority. We also use nginx to serve the static content and apache to serve the dynamic content. Nginx is simply amazing at scaling. The server was starting to stress with just apache doing all the work, now with nginx serving all the static content, we are normally at about 10% usage across all 8 cores. It did run up to 50% across all 8 cores with just apache.

Wayne

Link to comment
Share on other sites

Wayne,

I hate to give recommendations without knowing the full details, but if you question the hardware (RAID) controller you could just throw in a 3rd drive and "dd" it's contents to it on a weekly basis via cron. This way you can just boot from that 2nd LUN/virtual drive via remote console if this were to happen again. We used to do this at an old place I worked at, and it saved our bacon once. the total outage was the cost of a reboot.

I don't know if you're using e2label references in fstab, but you may have issues with double entries when using this method and may need to reference the full device paths (/dev/volgrp01/root for example) not "LABEL=/".

Or if you want to throw $$ at the problem, you could purchase more hardware and setup VCS, for clustering. smile

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...