Hemiposterical: Reliability through the expectation of failure

2013-10-29

Reliability through the expectation of failure

A nice presentation by Pat Helland from Salesforce (and before that Amazon Web Services) on how they built a very reliable service: they build it out of second-rate hardware:

"The ideal design approach is 'web scale and I want to build it out of shit'."
Salesforce's Keystone system takes data from Oracle and then layers it on top of a set of cheap infrastructure running on commodity servers

Inituitively this may seem crazy. If you want (and are willing to pay for) high reliability, don't you want the most reliable hardware possible?

If you want a somewhat-reliable service then sure, this may make sense at some price and reliability points. You certainly don't want hard drives which fail every 30 days or memory that laces your data with parity errors like pomegranate seeds in a salad. The problems come when you start to get to demand more reliability - say, four nines (99.99% uptime, about 50 minutes downtime per month) and scaling to support tens if not hundreds of concurrent users across the globe. Your system may consist of several different components, from your user-facing web server via a business rules system to a globally-replicating database. When one of your hard drives locks up, or the PC it's on catches fire, you need to be several steps ahead:

you already know that hard drives are prone to failure, so you're monitoring read/write error rates and speeds and as soon as they cross below an acceptable level you stop using that PC;
because you can lose a hard drive at any time, you're writing the same data on two or three hard drives in different PCs at once;
because the first time you know a drive is dead may be when you are reading from it, your client software knows to back off and look for data on an alternate drive if it can't access the local one;
because your PCs are in a data centre, and data centres are vulnerable to power outages / network cables break / cooling failures / regular maintenance, you have two or three data centres and an easy way to route traffic away from the one that's down;

You get the picture. Trust No One, and certainly No Hardware. At every stage of your request flow, expect the worst.

This extends to software too, by the way. Suppose you have a business rules service that lots of different clients use. You don't have any reason to trust the clients, so make sure you are resilient:

rate-limit connections from each client or location so that if you get an unexpected volume of requests from one direction then you start rejecting the new ones, protecting all your other clients;
load-test your service so that you know the maximum number of concurrent clients it can support, and reject new connections from anywhere once you're over that limit;
evaluate how long a client connection should take at maximum, and time out and close clients going over that limit to prevent them clogging up your system;
for all the limits you set, have an automated alert that fires at (say) 80% of the limit so you know you're getting into hot water, and have single monitoring page that shows you all the key stats plotted against your known maximums;
make it easy to push a change that rejects all traffic matching certain characteristics (client, location, type of query) to stop something like a Query of Death from killing all your backends.

Isolate, contain, be resilient, recover quickly. Expect the unexpected, and have a plan to deal with it that is practically automatic.

Helland wants us to build our software to fail:

...because if you design it in a monolithic, interlinked manner, then a simple hardware brownout can ripple through the entire system and take you offline.
"If everything in the system can break it's more robust if it does break. If you run around and nobody knows what happens when it breaks then you don't have a robust system," he says.

He's spot on, and it's a lesson that the implementors of certain large-scale IT systems recently delivered to the world would do well to learn.

Hemiposterical

2013-10-29

Reliability through the expectation of failure

No comments:

Post a Comment