Thursday, March 15, 2012

Planning for Failure

Planning for failure seems to offend us (operations people) at our very core. "But the system has redundancy built-in!" we cry. "What are the odds that our three-nines SLA service provider will go down?"

The odds are apparently pretty good. We "know" that 99.9% availability is "good," but how much actual downtime is that? Wikipedia's calculations indicate that this is 8.76 hours of agonizing, nail-biting suspense per year. Eight hours is an entire workday. How do you plan for an entire workday's worth of downtime, even if it's spread out over a few months?

I was up a bit later than usual when I noticed that my XMPP client simply would not connect to our corporate chat server. After testing a few more services, and requesting a traceroute from a geographically-distant associate, it became abundantly clear that our gateway connection had failed, and the building was entirely inaccessible.

I contacted our service provider's support line and immediately hit a wall: "Without your circuit ID, we simply cannot proceed to testing your circuit." All of our customer information for just such an occasion is neatly stored away in our documentation wiki, where it belongs - inside the building with no Internet access.

The routing issue at hand passed in about an hour, but it underscored a key weakness: What happens when the in-band access to the resources we take for granted is simply not there? Exotic options like a mobile broadband failover are enticing, but they're not useful in a hell-breaks-loose, meteor-hits-the-building scenario. I have since mirrored the essential content (in flat files, no less) to a more reliable off-site location. Lesson learned: Uptime is no replacement for an actual fallback plan.