This morning, Internet hosting company CloudFlare dropped off the Net for an hour; their 23 data centers spread across 14 countries and 4 continents progressively vanished from the Internet. This was a near-total outage for them. CloudFlare have served over 1 trillion page views since they started, so they're not exactly amateurs at this business. How, then, could they have such a massive simultaneous outage?
To the immense credit of CloudFlare they posted an excellent post-mortem on their company blog:
The cause of the outage was a system-wide failure of our edge routersIf you're into networking to any degree, go read the whole post-mortem; it's exemplary - and it should make you a little concerned if you have Juniper routers in your shop. If you're not into networking, and wonder why this is so interesting and why it matters to the Internet as a whole, I shall attempt to pitch an explanation.
[...] We saw a DDoS attack [distributed denial-of-service; lots of computers across the world acting together to attack a target] being launched against one of our customers.
We have an internal tool that profiles attacks and outputs signatures that our automated systems as well as our ops team can use to stop attacks.
One of our ops team members took the output from the profiler and added a rule based on its output to drop packets that were between 99,971 and 99,985 bytes long
[...] Flowspec [router configuring system] accepted the rule and relayed it to our edge network. What should have happened is that no packet should have matched that rule because no packet was actually that large. What happened instead is that the routers encountered the rule and then proceeded to consume all their RAM until they crashed.
A bit of background first. The "edge routers" referred to above are devices that connect the CloudFlare data centers (buildings full of computers that run the CloudFlare websites) to the rest of the Internet. Edge routers function like postal sorting centres; every packet (addressed envelope) that comes to them will have its address checked, and the routers will determine whether the address is a local computer in the data center, or some other computer in the Internet. If the latter, the edge router has a list of other routers that handle different addresses; with our postal analogy, it's like realising that NW postcodes get handled by the north-west London regional sorting centre, so all envelopes with NW postcodes will get forwarded to that centre for further routing. The edge routers also advertise to their Internet neighbours which addresses they can handle; in this case, the addresses of machines in the CloudFlare data centers. That information propagates out through the Internet so that when you want to go to an address owned by CloudFlare, your ISP will send your packets to the edge router of the nearest appropriate CloudFlare data center.
Every data center will have at least two edge routers connecting it to the Internet; it may also have other routers which connect it directly to other CloudFlare data centres, but we'll ignore those for now. The reason it has at least two routers is for redundancy - if one router has a software or electronic failure, the other can keep things running until the first one is repaired. But if they are both the same model of router, and both have the same configuration, this only gives you very limited protection.
The outage ran roughly as follows:
- Unnamed bad people mount a distributed denial of service attack against a CloudFlare customer.
- CloudFlare spots the attack and runs its details through a program to work out how to block it.
- The analysis produces a very weird rule that blocking packets between 99,971 and 99,985 bytes long should stop the attack - this cannot possibly be correct as packets on the CloudFlare network are no bigger than 4500 bytes.
- A CloudFlare ops member sends that rule out to all the CloudFlare edge routers so that they will start ignoring the attack.
- The rule causes all CloudFlare routers to use up all their memory and crash, repeatedly.
- CloudFlare ops detect that they are disconnected from the Internet, and presumably their customer support hotline starts ringing off the hook.
- CloudFlare ops can't reprogram the routers via the network because they're continually crashing, so have to contact each data center to get someone to visit each router and physically restart them to wipe out the bad configuration.
- The routers restart, come back online, and get reprogrammed with a known good configuration that does not include the pathological rule.
If I were CloudFlare, I'd be making the following changes to my processes:
- Add a new edge router to each data center that is not a Juniper router;
- Perform some sanity checking and independent review on the DDoS traffic profiler so that if it spits out rules which could have no actual effect then they get spotted and stopped;
- Use a canarying process where new non-critical rules first get pushed out to low-traffic data centers and left to bake for 30-60 minutes, then rolled out to other data centers in a set (and carefully thought-out) sequence.
This is one of the aspects of the Internet's reliability that continues to worry me. It includes some very large, complex distributed systems owned by a range of companies (Microsoft, Google, CloudFlare, Facebook etc.) but within those companies there is a natural tendency to standardise on a single vendor and small range of devices to perform key functions like edge routing. The Internet as a whole is very diverse in technologies and software, which is why it is so robust, but we are going to keep seeing these large entities suffering large if not global outages as long as they value economy of scale in purchasing and maintenance over true system diversity. Worse, if multiple companies standardise on the same hardware, you get problems like the Juniper BGP routering vulnerability that nailed Blackberry maker RIM and a number of ISPs.
Fun fact: the last time that Google went down worldwide was 7th May 2005; a bad Domain Name Service configuration left google.com unfindable by the rest of the Internet for 10-20 minutes. Facebook's last major outage was also DNS-related and took it out for about 25 minutes on 10th December 2012.