The Internet is a giant game of Jenga
2007 was a bad year for “100% uptime”, and it looks like 2008 is off to a rocky start. Anyone offering web services knows Rackspace for their “fanatical support” and “100% uptime” promises. They’re well known for being the cream of the crop in managed server vendors (thanks partially to their ubiquitous marketing).
Unfortunately “100% uptime” doesn’t mean what you think. Rackspace suffered a (short) DNS outage in September 2006. Last fall they suffered two outages due to power issues — the most colorful one was caused when a medically-incapacitated truck driver drove into a power transformer outside their data center. Today, 37signals suffered a two-hour Rackspace hardware outage in the middle of the workday. To their credit, Rackspace has been open and proactive about the issues and their responses.
And to be fair, **** happens. There is no such thing as “100% uptime” for Internet services– not the way you & I would interpret that promise.
In ages past, telephone network engineers talked about uptime uptime in terms of “four nines” or “five nines”. “Four nines” means 99.99% uptime — that means no more than 52 minutes a year of downtime. “Five nines” means 99.999% uptime — less than 5 minutes and 15 seconds of annual downtime. Ma Bell and her progeny aimed for core service levels in that range. Today’s network engineers aim for similar targets.
Unfortunately the Internet is a heck of a lot more complicated, and a lot less mature, than Ma Bell’s telephone monopoly. Individual parts of the Internet are capable or four- or five-nines availability, but today’s Internet is layered with dramatically more complexity, variety, and failure points than ever. Data centers are designed to handle power failure, but cooling systems are now an issue. You can deploy redundant clustered servers, but your load balancer may fail. Five backup generators plus UPS’es (uninterruptable power supplies) will certainly protect you from utility power outages, unless 40% of the generators fail, or the UPS fails, or cooling fails, or your file servers go wonky, or one of your core routers may degrade — or all of the above may happen repeatedly within three weeks. And all this is on the service providers’ end, and Internet services don’t take responsibility for the “last mile”. End customers still have to deal with Windows crashes, web browser bugs, and flaky DSL connections.
Today’s Internet is kind of like a giant game of Jenga. It’s a wonder that it works as well as it does, given the amazing number of building blocks that have to work together. Each piece of our Jenga tower is, by itself, more reliable than ever… but increasing complexity means that 100% end-user availability is frustratingly hard to guarantee.
Fortunately there are lots of smart people (including here at SilverDock) working really hard to raise the standard of reliability for Internet services. For our part, we focus on failover capability between four data centers on two continents (and more to come). We’ve also designed our application around a loosely-coupled architecture. And we are working with high-quality CDNs to further improve availability, scalability, and response times.
In the bigger scheme of things, there’s a trial-and-error process of evolution under way. As each layer of the Internet gets better at interoperating — holding up their part of the Jenga tower — we can expect reliability to continue to increase. With a little luck, in 2018 we’ll see an Internet that’s even more reliable than Ma Bell’s old landlines.
Filed under: outage, reliability