System errors explained

September 19, 2010   by Serge Knystautas

Nights like tonight make me shake my head.  I used to be this globally sought consultant who was flown in to stop Hotels.com from crashing or rescue an online commerce system that was handling millions of dollars a day.

But sometimes in the trenches, you make stupid mistakes and it isn't so much fun.

We made several big mistakes tonight that caused a lot of frustration for our customers.  I'm really sorry about that and wanted to explain what happened.

As it's a Saturday, we were watching the system to make sure everything was running fine.  Everything was fine, for most of the day.  In the evening, it seemed that today was going to be another record day, so we turned on a few more servers for good measure.

That's where the mistakes started.  First we added these servers in a way that required a 1-minute outage, which we know better than to do during the day.  Then we put the wrong version of our software on the new servers.  This was a complete breakdown in process and the basic block and tackling you'd expect to be doing right by this point.

The real problem though started around 8:30pm.  A bad command was sent to the main database, the one that all the admins use.  This command slowed down what the admins were seeing, but much worse than that, it caused most attempts to add or update a page to the system to fail.  And what should have taken a matter of minutes to figure out took almost two hours, because we were still fixing the mistakes we made when we added the servers wrong.

By 10:15pm, we killed the bad command and everything began to work correctly again.  The last of the errors from admins came in at 10:31pm as the system worked through the backlog of changes.

The silver lining of this was that fan traffic was unaffected, and we crushed our old fan traffic records.  Today we served 4.7 million pages to almost 500,000 fans.

Look, this was completely unacceptable and this system and company was built for 10 times the traffic we are handling.  Today we had over 200 admins concurrently signed in updating their sites, and I'll have a beer to celebrate when we reach 1,000, and another one at 5,000.  We're only getting started here at PrestoSports, and we're going to do what we have to to make sure this doesn't happen again.

Follow Us

Archives

Search