The new Time covers in depth the work of the team who fixed Healthcare.gov. It's a fantastic read, with good access to the small but extremely competent team who drove the fix - go absorb the whole thing.
The data coming out of the story confirms a lot of what I suspected about what was wrong and how it needed to be fixed. Breaking down by before-and-after the hit team arrived:
Before
- By October 17 the President was seriously contemplating scrapping the site and starting over.
- Before this intervention, the existing site's teams weren't actually improving it at all except by chance; the site was in a death spiral.
- No one in CMS (or above) was actually checking whether the site would work before launch.
- The engineers (not companies) who built the site actually wanted to fix it, but their bosses weren't able to give them the direction to do it.
- There was no dashboard (a single view) showing the overall health of the site.
- The key problem the site had was being opened up to everyone at once rather than growing steadily in usage.
- The site wasn't caching the data it needed in any sensible way, maximising the cost of each user's action; just introducing a simple cache improved the site's capacity by a factor of 4.
During the Tuesday hearing, Tavenner rejected the allegation that the CMS mishandled the health-care project, adding that the agency has successfully managed other big initiatives. She said the site and its components underwent continuous testing but erred in underestimating the crush of people who would try to get onto the site in its early days. "In retrospect, we could have done more about load testing," she said.As the Time article shows, this was anything but the truth about what was actually wrong.
After
- There wasn't any real government coordination of the rescue - it was managed by the team itself, with general direction but not specific guidance from the White House CTO (Todd Park)
- The rescue squad was a scratch team who hadn't worked together before but was completely aligned in that they really wanted to make the site work, and had the technical chops to know how to make this happen if it was possible.
- Fixing the website was never an insurmountable technical problem: as Dickerson noted "It's just a website. We're not going to the moon." It was just that no-one who knew how to fix it had been in a position to fix it.
- The actual fixes were complete in about 6 weeks.
- One of the most important parts in improving the speed of fixing was to avoid completely the allocation of blame for mistakes.
- Managers should, in general, shut up during technical discussions: "The ones who should be doing the talking are the people who know the most about an issue, not the ones with the highest rank. If anyone finds themselves sitting passively while managers and executives talk over them with less accurate information, we have gone off the rails, and I would like to know about it."
- The team refused to commit to artificial deadlines: they would fix it as fast as they could but would not make promises about when the fixes would be done, refusing to play the predictions game.
- Having simple metrics (like error rate, concurrent users on the site) gave the team a good proxy for how they were doing.
- Targeted hardware upgrades made a dramatic difference to capacity - the team had measured the bottlenecks and knew what they needed to upgrade and in what order.
- Not all problems were fixed: the back-end communications to insurance companies still weren't working, but that was less visible so lower priority.
The overall payoff for these six weeks of work was astonishing; on Monday 23rd December the traffic surged in anticipation of a sign-up deadline:
"We'd been experiencing extraordinary traffic in December, but this was a whole new level of extraordinary ... By 9 o'clock traffic was the same as the peak traffic we'd seen in the middle of a busy December day. Then from 9 to 11, the traffic astoundingly doubled. If you looked at the graphs, it looked like a rocket ship." Traffic rose to 65,000 simultaneous users, then to 83,000, the day's high point. The result: 129,000 enrollments on Dec. 23, about five times as many in a single day as what the site had handled in all of October.Despite this tremendous fix, however, President Obama didn't visit the team to thank them. Perhaps the political fallout from the Healthcare.gov farce was too painful for him.
The best quote that every single government on the planet should read:
[...] one lesson of the fall and rise of HealthCare.gov has to be that the practice of awarding high-tech, high-stakes contracts to companies whose primary skill seems to be getting those contracts rather than delivering on them has to change. "It was only when they were desperate that they turned to us," says Dickerson. "I have no history in government contracting and no future in it ... I don't wear a suit and tie ... They have no use for someone who looks and dresses like me. Maybe this will be a lesson for them. Maybe that will change."The team who pulled President Obama's chestnuts out of the fire didn't even think they were going to be paid for their work initially; it looks like they did eventually get some money, but nowhere near even standard contracting rates. And yet, money wasn't the motivator for them - they deeply wanted to make Healthcare.gov work. As a result they did an extraordinary job and more or less saved the site from oblivion. This matches my experience from government IT developments: it's reasonable to assume that the government don't care about whether the project works at all, because if they did then they'd run it completely differently. Though if I were President I'd be firing Marilyn Tavenner, cashing in her retirement package and using it to pay bonuses to the team who'd saved my ass.
If you have a terribly important problem to solve, the most reliable way to solve it is to find competent people who will solve it for free because they want it to work. Of course, it's usually quite hard to find these people - and if you can't find them at all, maybe your problem shouldn't be solved in the first place.