Since the US government has made a pig's ear, dog's breakfast and sundry other animal preparations of its
health care exchange HealthCare.Gov, I thought I'd exercise some 20/20 hindsight and explain how it should
(or at least could) have been done in a way that would not cost hundreds of millions of dollars
and would not lead to egg all over the face of Very Important People. I don't feel guilty exercising
hindsight, since the architects of this appalling mess didn't seem to worry about exercising any foresight.
A brief summary of the problem first. You want to provide a web-based solution to allow American citizens
to comparison-shop health insurance plans. You are working with a number of insurers who will provide you
with a small set of plans they offer and the rules to determine what premium and deductible they will sell
the plan at depending on purchaser stats (age, family status, residential area etc.) You'll provide a daily
or maybe even hourly feed to insurers with the data on the purchasers who have agreed to sign up for their plans.
You're not quite sure
how many states will use you as their health care exchange rather than building your own, but it sounds
like it could be many tens of states including the big ones (California, Texas). We expect site use to
have definite peaks over the year, usually in October/November/early December as people sign up in preparation
for the new insurance year on Jan 1st. You want it to be accessible to anyone with a web browser that is
not completely Stone Age, so specify IE7 or better and don't rely on any JavaScript that doesn't work
in IE7, Firefox, Safari, Chrome and Opera. You don't work too hard to support mobile browsers for now, but Safari for iPad and iPhone 4 onwards should be checked.
Now we crunch the numbers. We expect to be offering this to tens of millions of Americans eventually,
maybe up to 100M people in this incarnation. We also know that there is very keen interest in this
system, and so many other people could be browsing the site or comparison-shopping with their existing
insurance plans even if they don't intend to buy. Let's say that we could expect a total of 50M individual
people visiting the site in its first full week of operation. The average number of hits per individual:
let's say, 20. We assume 12 hours of usage per day given that it spans America (and ignore Hawaii).
1bn hits per week divided by 302400 seconds yields an average hit rate of about 3300 hits per second.
You can expect peaks of twice that, and spikes of maybe five times that during e.g. news broadcasts about
the system. So you have to handle a peak of 15000 hits per second. That's quite a lot, so let's think
about managing it.
The first thing I think here is "I don't want to be worrying about hardware scaling issues that other people have already solved." I'm already thinking
about running most of this, at least the user-facing portion, on hosted services like Amazon's EC2 or Google's App Engine.
Maybe even Microsoft's Azure, if you particularly enjoy pain. All three of these behemoths have a
staggering numbers of computers. You pay for the computers you use; they let you
keep requesting capacity and they keep giving it to you. This is ideal for our model of very variable query rates.
If we need about one CPU and 1GB of RAM to handle three queries per second of traffic, you'll want to provision about 5000 CPUs (say, 2500 machines) during your first week to handle the spikes, but maybe no more than 500 CPUs during much of the rest of the year.
The next thought I have is "comparison shopping is hard and expensive, let's restrict it to users whom we know are eligible". I'd make account creation very simple; sign up with your name, address and email address plus a simple password. Once you've signed up, your account is put in a "pending" state. We then mail you a letter a) confirming the sign-up but masking out some of your email address and b) providing you with a numeric code. You make your account active and able to see plans by logging in and entering your numeric code. If you forget your password in the interim, we send you a recovery link. This is all well-trodden practice. The upshot is that we know - at least, at a reasonable level of assurance - that every user with an active account is a) within our covered area and b) is not just a casual browser.
As a result, we can design the main frontend to be very light-weight - simple, cacheable images and JavaScript, user-friendly. This reduces the load on our servers and hence makes it cheaper to serve. We can then establish a second part of the site to handle logged-in users and do the hard comparison work.
This site will check for a logged-in cookie on any new request, and immediately bounce users missing cookies to a login page.
Successful login will create a cookie with nonce, user ID and login time signed by our site's private key with (say) a 12 hour expiry. We make missing-cookie users as cheap as possible to redirect. Invalid (forged or expired) cookies can be handled as required, since they occur at much lower rates.
There's not much you can do about the business rules evaluation to determine plan costs: it's going to be expensive in computation. I'd personally be instrumenting the heck out of this code to spot any quick wins in reducing computation effort.
But we've already filtered out the looky-loos to improve the "quality" (likelihood of actually wanting to buy insurance) of users looking at the plans, which helps. Checking the feeds to insurers is also important; put your best testing, integration and QA people on this, since you're dealing with a bunch of foreign systems that will not work as you expect and you need to be seriously defensive.
Now we think about launch. We realise that our website and backends are going to have bugs, and the most likely place
for these bugs is in the rules evaluation and feeds to insurers. As such, we want to detect and nail these bugs before
they cause widespread problems. What I'd do is, at least 1 month in advance of our planned country-wide launch,
launch this site for one of the smaller states - say, Wyoming or Vermont which have populations around 500K -
and announce that we will apply a one-off credit of $100 per individual or $200 per family to users from this state purchasing
insurance. Ballpark guess: these credits will cost around $10M which is incredibly cheap for a live test. We provision the crap out of our system and wait for the flood of applications, expect things to break, and measure our actual load and resources consumed. We are careful about user account creation - we warn users to expect their account creation letters within 10 days, and deliberately stagger sending them so we have a gradual trickle of users onto the site. We have a natural limit of users on the site due to our address validation. Obviously, we find bugs - we fix them as best we can, and ensure we have a solid suite of regression testing that will catch the bugs if they re-occur in future. The rule is "demonstrate, make a test that fails, fix, ensure the test passes."
Once we're happy that we've found all the bugs we can, we open it to another, larger, state and repeat, though this time not offering the credit. We onboard more and more states, each time waiting for the initial surge of users to subside before opening to the next one. The current state-by-state invitation list is prominent on the home page of our site. Our rule of thumb is that we never invite more users than we already have (as a proportion of state population), so we can do no more than approximately double our traffic each time.
This is not a "big bang" launch approach. This is because I don't want to create a large crater with the launch.
For the benefit of anyone trying to do something like this, feel free to redistribute and share, even for commercial
use.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Update: also very worth reading Luke Chung's take on this application, which comes from a slightly different perspective but comes up with many similar conclusions on the design, and also makes the excellent usability point:
The primary mistake the designers of the system made was assuming that people would visit the web site, step through the process, see their subsidy, review the options, and select "buy" a policy. That is NOT how the buying process works. It's not the way people use Amazon.com, a bank mortgage site, or other insurance pricing sites for life, auto or homeowner policies. People want to know their options and prices before making a purchase decision, often want to discuss it with others, and take days to be comfortable making a decision. Especially when the deadline is months away. What's the rush?