Nov 10 2008

IT Horror Stories

Published by at 6:56 am under General,Humor

Congratulations to Jason, the winner of the free pass to CSI.  Here’s his story about how a minor change to a script almost caused a major disaster.  I have my own war story about scripts I’ll share later this week.  Here’s a hint:  Always make sure you’re in the proper directory when running your scripts.

This happened when I was first learning to admin UNIX boxes. Another
SysAdmin and I were working on a shell script to lowercase the file
names of 30-40 million image files. They were on an NFS mount that was
used by several servers. These images were part of detail listings of a
relatively busy web site and we were right in the middle of the day.

Now that the background of the mess are fully explained, the story
gets going. We went through several revisions and were testing against
a directory on a desktop system. Nothing destructive happened during
testing and we were getting fairly comfortable with the “safety” of the
script.

We finally thought we had a working script, so we moved it to the
prod server. Then we noticed a “minor” change that needed to be made on
it. We made the change then decided that since this was a such a small,
little tweak we could run it on the live NFS mount without any further
testing. Fire in the hole!

The script took off and we watched it run. All was well. Then my
phone rang from the NOC. A panicked operator was on the phone saying,
“Hey what’s happening with listing images from xyz.com? They are all
coming up as 404s!” I killed the script while thinking some thing like
“oh crap, oh crap, oh crap!” Sure enough the script had wiped out about
50% of the images. Amazing how fast a shell script can delete when it
goes haywire.

We pointed the web servers to a backup copy of the images, then
started to recover to the production mount. The backup was a couple
days old, so our image processing guys had to re-upload the missing
work. I was lucky that the online backup was there. I had taken it for
reasons unrelated to this event. The next day I got to explain to the
CIO what had happened.

The moral of the story was backup first and test your script until
it is golden before going live. Then test it again and again and again.
Make sure you are doing at the proper time, then go to production. We
didn’t have change control, so I’d add get all the approvals now too.
Cover your butt.

It was a good lesson. I’ve never done anything like that again in the last 7 years.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

No responses yet

Trackback URI | Comments RSS

Leave a Reply

%d bloggers like this: