Hemiposterical: October 2013

2013-10-31

Belated realization of what works

I've previously blogged about the contrast between the technically sophisticated Obama re-election campaign and the dog's breakfast that is Healthcare.gov. Go take a quick look to refresh your memories.

Now it turns out that at least one of the team being "tech-surged" to worked on the highly successful tech of Obama's re-election campaign:

One of two surge team members named by the agency was Michael Dickerson, which [sic, who taught CNN subs grammar?] CMS said was on leave from Google.
"He has expertise in diving into any layer of the tech stack ... in order to deliver some of the world's most reliable online services," CMS spokeswoman Julie Bataille said.

Dickerson is a site reliability engineer at Google and worked on some of the key performance-critical systems for the Obama team, as per his CV:

Designed and implemented, with Chris Jones and Yair Ghitza, the 2012 realtime election day monitoring and modeling (based on "Gordon" or vanpollwatcher.com).
Also: Wrote a tool for computing walkability of potential contacts, used by several states to prioritize GOTV contacts. Helped create the algorithm for targeting national TV cable ads to party preference and behavior, and wrote the tool that was used to do it. Prepared disaster recovery for all of OFA's mysql databases before Hurricane Sandy. Conducted various scalability and reliability assessments for many teams in OFA Tech and Analytics.

Finally the federal government is getting smart about how to fix the healthcare.gov problems - find people who a) have an interest in seeing this effort not fall on its arse, b) have the technical chops to know about the issues involved in a near-realtime distributed DB-backed system, and c) are willing and able to kick ass, then hand them a stick with a nail in the end and give them an open-ended mandate to pull the HHS chestnuts out of the fire.

Too late? Maybe. The government has committed to having things working by the end of the month. Without knowing specifics, and assuming a virtually unlimited budget, I think they are finally getting the right kind of people in to sort out their problems. The question is how many reputations and careers of the incumbent project managers and developers they are willing to sacrifice. I suspect at this point the answer is "all".

The curse of experts

Megan McArdle, who has been all over the HealthCare.gov and the Affordable Care Act rollout like a rash, has a superb piece at Bloomberg on the reason that the implications of the ACA came as a surprise to most people:

"We all knew" that preventive care doesn't save money, electronic medical records don’t save money, reducing uncompensated care saves very little money, and "reining in the abusive practices" of insurance companies was likely to raise premiums, [my italics] not lower them, because those "abuses" mostly consist of refusing to cover very sick people. But that information did not get communicated very well to the public.

This is, profoundly, what dooms any number of projects. For instance any software engineer or technical manager worth their salt will implicitly believe that a) testing a system with something like real traffic is the only way to detect and mitigate launch problems, and b) if you're only planning on testing one week before a hard deadline then You're Going To Have A Bad Time. Yet, if the project is being managed elsewhere and the project managers are not really asking the engineers about their opinions, just handing down features and deadlines, then the facts that "all the experts know" never get presented to the project manager in a way that makes them understand.

This reminded me of the testimony of CMS head Marilyn Tavenner about the awesome project fuck-up that was the HealthCare.gov launch and her part in it as the official directly responsible for its launch:

During the Tuesday hearing, Tavenner rejected the allegation that the CMS mishandled the health-care project, adding that the agency has successfully managed other big initiatives. She said the site and its components underwent continuous testing but erred in underestimating the crush of people who would try to get onto the site in its early days.
"In retrospect, we could have done more about load testing," she said.

You see what I mean? All the experts "know" that load testing a site that's going to be heavily used is not optional and not to be left to the last moment.

Reassuringly, Tavenner did demonstrate some skills in her area of competence: blame-shifting.

Under questioning, Tavenner pointed the finger at CGI Federal, saying the company sometimes missed deadlines. "We've had some issues with timing of delivery," she said.

I'm sure that's right. I'm equally sure that it's the project manager's job to anticipate, plan for and adjust schedules to handle late (or even early) deliveries - and CMS was the project manager. You'll note from her bio that Tavenner is a life-long health administrator - I'd bet her early career as a nurse lasted just long enough to get her into admin - and has as much business leading a complicated software development project as I do running an emergency room. Less, probably, because at least I know that air goes in and out, blood goes round and round, and any variation on this is a bad thing.

Ironically the Chief Technology Officer of Health and Human Services (HHS being the parent department of the CMS) whose bio indicates reasonable technical chops wasn't actually involved much in the project:

...an employee of Amazon Web Services Inc (AWS) emailed two HHS officials on October 7 saying, "I hear there are some challenges with Healthcare.gov. Is there anything we can do to help?"
HHS' Chief Technology Officer Bryan Sivak replied to Amazon by email on October 8: "I wish there was. I actually wish there was something I could do to help. [my emphasis]"

The Chief Information Officer by contrast is an ex-IBM marketeer and strategizer, and is putting his strategizing skills to good use making clear his distance from the smoking wreck of the project:

HHS' Chief Information Officer Frank Baitman replied to Amazon on October 7, "Thanks for the offer! Unfortunately, as you know, I haven't been involved with Healthcare.gov. I'm still trying to figure out how I can help, and may very well reach out for assistance should the opportunity present itself."

Nice one, Frank. Of course, Sivak is the one who comes across as actually human.

It looks like Tavenner's CMS wanted all the glory and kudos from the HealthCare.gov launch, but instead has become the focus the frustrations and hate of millions of Americans. The lessons here: be careful what you wish for, and if you want to know what the "experts know" then you really need to ask them.

2013-10-30

For some needs, the government comes through

There's a lot of anger in America currently about the general incompetence of the federal government, but it's encouraging to see that at least one government agency is actually good at what it's paid to do:

The National Security Agency has secretly broken into the main communications links that connect Yahoo and Google data centers around the world, according to documents obtained from former NSA contractor Edward Snowden and interviews with knowledgeable officials.

Privacy concerns aside, you've got to admire the NSA for actually conducting some good modern communications interception. Someone probably deserves a substantial bonus; he won't get it, of course, because it's a government payroll - he'll no doubt defect to the private sector eventually, or maybe the SVR will make him the proverbial un-refusable offer.

It would be fascinating to know whether the NSA is just tapping links external to the USA (presumably including links with no more than one node in the USA) or have general access to intra-USA traffic. It's also interesting to speculate on the connection between this eavesdropping and Google's move back in September to encrypt the traffic that the NSA seems to have been intercepting. Yahoo still seems to be open, based on a rather inadequate denial from their PR:

At Yahoo, a spokeswoman said: "We have strict controls in place to protect the security of our data centers, and we have not given access to our data centers to the NSA or to any other government agency."

and one has to wonder about Facebook, Apple, Amazon etc.

So congratulations, citizens of the USA - you have a productive and competent government agency! Perhaps you should have put the NSA in charge of healthcare...

2013-10-29

Reliability through the expectation of failure

A nice presentation by Pat Helland from Salesforce (and before that Amazon Web Services) on how they built a very reliable service: they build it out of second-rate hardware:

"The ideal design approach is 'web scale and I want to build it out of shit'."
Salesforce's Keystone system takes data from Oracle and then layers it on top of a set of cheap infrastructure running on commodity servers

Inituitively this may seem crazy. If you want (and are willing to pay for) high reliability, don't you want the most reliable hardware possible?

If you want a somewhat-reliable service then sure, this may make sense at some price and reliability points. You certainly don't want hard drives which fail every 30 days or memory that laces your data with parity errors like pomegranate seeds in a salad. The problems come when you start to get to demand more reliability - say, four nines (99.99% uptime, about 50 minutes downtime per month) and scaling to support tens if not hundreds of concurrent users across the globe. Your system may consist of several different components, from your user-facing web server via a business rules system to a globally-replicating database. When one of your hard drives locks up, or the PC it's on catches fire, you need to be several steps ahead:

you already know that hard drives are prone to failure, so you're monitoring read/write error rates and speeds and as soon as they cross below an acceptable level you stop using that PC;
because you can lose a hard drive at any time, you're writing the same data on two or three hard drives in different PCs at once;
because the first time you know a drive is dead may be when you are reading from it, your client software knows to back off and look for data on an alternate drive if it can't access the local one;
because your PCs are in a data centre, and data centres are vulnerable to power outages / network cables break / cooling failures / regular maintenance, you have two or three data centres and an easy way to route traffic away from the one that's down;

You get the picture. Trust No One, and certainly No Hardware. At every stage of your request flow, expect the worst.

This extends to software too, by the way. Suppose you have a business rules service that lots of different clients use. You don't have any reason to trust the clients, so make sure you are resilient:

rate-limit connections from each client or location so that if you get an unexpected volume of requests from one direction then you start rejecting the new ones, protecting all your other clients;
load-test your service so that you know the maximum number of concurrent clients it can support, and reject new connections from anywhere once you're over that limit;
evaluate how long a client connection should take at maximum, and time out and close clients going over that limit to prevent them clogging up your system;
for all the limits you set, have an automated alert that fires at (say) 80% of the limit so you know you're getting into hot water, and have single monitoring page that shows you all the key stats plotted against your known maximums;
make it easy to push a change that rejects all traffic matching certain characteristics (client, location, type of query) to stop something like a Query of Death from killing all your backends.

Isolate, contain, be resilient, recover quickly. Expect the unexpected, and have a plan to deal with it that is practically automatic.

Helland wants us to build our software to fail:

...because if you design it in a monolithic, interlinked manner, then a simple hardware brownout can ripple through the entire system and take you offline.
"If everything in the system can break it's more robust if it does break. If you run around and nobody knows what happens when it breaks then you don't have a robust system," he says.

He's spot on, and it's a lesson that the implementors of certain large-scale IT systems recently delivered to the world would do well to learn.

2013-10-27

Government tech vs Valley tech

The ongoing slow-motion disaster of the HealthCare.gov exchanges has provided vast amounts of entertainment for software engineers, and not a little of "if only they'd used this (product/process/language/company) they'd have been fine." There is much talk of a tech "surge" to get highly-skilled engineers who actually know what they're doing to help with fixing the site, but that runs into problems as Jessica Myers points out in Politico:

"The skill that is needed most for someone to come in is the knowledge of how the system works," said Eric Ries, a Silicon Valley startup founder and creator of the popular "lean startup" philosophy. "Even if you got Google up to speed on the crazy architecture that makes no sense, [...] it's like if you have a predigital clock and you want to hire a hotshot. You need someone who knows how an antique clock works."

It's well known maxim - indeed, known in the trade as Brook's Law that adding more manpower to a late project makes it later. HealthCare.gov is no exception. You'll spend ages getting your new guys up to speed on the system, architecture and tools in use - and that education process has to be conducted by the best people you already have, taking them away from their current troubleshooting. That's not to say that it's necessarily the wrong choice at this time, but it's certainly not going to bring the project in early.

One of the strategic problems faced by the developers was the very nature of government IT:

Government IT comprises a network of systems that have developed over the past half-century, said Mike Hettinger, the Software & Information Industry Association's director of public sector innovation. In some cases, thousands of homegrown networks feed into one payroll or financial system. Whereas a scrappy Silicon Valley startup could wipe out a project that doesn't work, a much larger government agency doesn't have that luxury.

This is not a problem peculiar to government IT - payroll systems in particular in private companies are notorious legacy systems that quickly become too complex and full of undocumented behavior to replace without large amounts of pain. However, in private industry there's usually a point at which the cost of supporting and working around the legacy system becomes annoying enough that people are willing to put up with the temporary pain of replacement. Sometimes all it takes is someone hired from outside to come in, set their sights on replacing the legacy system as their first big project in the firm, and it will happen - the original system developer has probably moved on to another firm by now, and so no-one cares about it. Maybe the new finance director is fed up of paying IBM squillions of dollars a year to keep the system running. Whatever, the presence of a legacy system is unstable - very few people have a vested interest in keeping it around.

Government IT, by contrast, can grow a whole ecosystem around this one legacy system, in charge of its care and feeding, providing manual work-arounds for activities the system doesn't support or automates poorly. A government departmental budget is there for spending, so a system that is awkward to use is actually more likely to get budget because the manager can demonstrate a need: "we are up to 50 man-days of work a month to issue invoices, and our two full-time accounting assistants can't cope." The empire grows, and more people have a vested interest - their jobs, in some cases - in maintaining the status quo. As such, government departments are a near-ideal environment for these systems to flourish, rather than withering in the metaphorical dog-poop corner of the departmental garden as they should. The only business environments which can provide a similar level of support are very large firms (IBM, Microsoft, big banks etc.) where a growing budget and headcount is a mark of success to be funded, not failure to be squashed.

The reason that Silicon Valley startups and successful established businesses wipe out projects that don't work well, as opposed to keeping them around to work around their idiosyncratic ways, is because they realise that sooner or later they will be forced to wipe them out anyway - eventually the system will grind to a halt, or everyone who knows how to fix it will have left, or the hardware it depends on will fail with no supplier remaining to provide the necessary parts, or a new regulation will be passed forcing the system to behave in a new way which it cannot possibly do, or the client traffic will grow past the system's performance limit... you get the picture. If you have a mad dog in your garden, you don't wait until it's bitten one of the children - it's a mad dog, everyone knows it's mad and that bitten children are inevitable, which is why you pull your Mossberg 535 from the gun cabinet and let the dog have it.

Back to how this whole mess got started:

"At the end of the day, Washington and how we procure technology for the federal government is just different," Hettinger said.

Yes, it certainly is. One wonders why anyone would think this "different" to be synonymous with "better", when "insane" seems a better fit. Unless, of course, producing a working system is a very secondary consideration to the people in procurement and the Washington-friendly contractors (IBM, Oracle and friends).

2013-10-21

HHS doesn't understand the problem so won't produce a solution

I apologise for turning this into the HealthCare.Gov train-wreck site, but it's such a material-rich environment that I can't help myself.

Today the US government Health and Human Services department issued a statement on what they are doing to fix HealthCare.Gov:

To ensure that we make swift progress, and that the consumer experience continues to improve, our team has called in additional help to solve some of the more complex technical issues we are encountering.
Our team is bringing in some of the best and brightest from both inside and outside government to scrub in with the team and help improve HealthCare.gov.

Interesting. I wonder in particular who from inside government is going to lend their expertise to this disaster-in-motion of software mis-engineering?

We are also defining new test processes to prevent new issues from cropping up as we improve the overall service and deploying fixes to the site during off-peak hours on a regular basis.

I really hope that this is a PR writer mis-understanding what she was told. You can't generally prevent new issues from cropping up from your code changes, because you don't know what those issues might be. You can however make a good stab at preventing old issues by setting up regression tests, running cases based on past errors to verify that the errors do not re-occur. Perhaps that's not forward-looking enough for HHS, but the sad fact is that crystal balls have very limited utility in software engineering. You're far better to improve your existing monitoring and logging so that at least you can identify and characterise errors that are occurring now.

I liked Republican Senator John McCain's suggestion for how to fix things:

"Send Air Force One out to Silicon Valley, load it up with some smart people, bring them back to Washington, and fix this problem. It's ridiculous. And everybody knows that."

The irony is that this is more or less what the Obama campaign did for the 2012 election campaign and it worked spectacularly well. If they'd done something similar for HealthCare.Gov, recruiting interested and motivated tech people from Silicon Valley (notoriously Democrat-heavy) to design and oversee the healthcare exchange, then quite possibly it would not have gone horrendously wrong. The problem now is that they are stuck with their existing design and implementation, and any redesign would necessarily trash most of their existing code and tests and require months of work to produce anything.

I'm reminded of the tourist in Ireland who asks a local how to get to Kilkenny, and the local responds "Ah well, if I wanted to get to get to Kilkenny, I wouldn't start from here."

2013-10-20

Federal IT project comparisons

Stewart Baker at the esteemed Volokh Conspiracy argues that not all big Federal IT projects are disasters:

... it isn't impossible, even with stiff political opposition, to manage big public-facing federal IT projects successfully. I can think of three fairly complex IT projects that my old department delivered despite substantial public/Congressional opposition in the second half of George W. Bush's administration. They weren't quite as hard as the healthcare problem, but they were pretty hard and the time pressure was often just as great.

He quotes three examples:

ESTA: international visa waiver, serving 20M foreign customers per year and serving results to US border ports;
E-verify: US employers checking entitlement to work, about 0.5M transactions per year
US-VISIT: electronic fingerprint checks at US borders, about 45M queries per year

ESTA is a pretty good comparison to the health exchange: the user creates something like an account, uploads their identity information for offline consideration and conducts a financial transaction (paying for the visa). 20 million visitors per year sounds a lot, but it's spread fairly evenly across the day, week and year as the traffic source is world-wide. You're actually looking at an average of well under 1 user per second, and there are only a couple of pages on the site so average queries per second is in single figures. You could serve this with about 6 reasonably-specced PCs in three physically separate locations so that you always have at least two locations active and at least one PC in each location active even allowing for planned and unplanned outages. This is a couple of orders of magnitude less than the health exchange traffic - it's not a bad system to evaluate in preparing for implementation of the health exchange, but you can't expect to just translate across the systems and code. The unofficial rule of thumb is that if you design a system for traffic level X, it should (if well designed) scale fine to 10X traffic, but by the time you approach 100X you need a completely different system. The serving to border checks is a similar scale - most visitors with an ESTA visit the US about once per year, so you expect about 20M border checks per year and so around 1 query per second.

E-verify can be dismissed immediately as not comparable: it's an extremely lightweight check and has very low traffic levels.

US-VISIT is more interesting: although it's only a couple of queries per second, fingerprint matching is well known to be computationally intensive. Fortunately it's very easy to scale. You "shard" the fingerprint database by easily identified characteristics, breaking it into (possibly overlapping) subgroups; say, everyone with a clockwise whorl on their right thumb and anticlockwise spiral on their left index finger goes into subgroup 1. That your frontend receiving a fingerprint set can identify an appropriate subgroup and query one of a pool of machines which has all fingerprint sets matching that characteristic. You have a few machines in each pool in three separate sites, as above.

These are interesting applications, and I agree that they are reasonable examples of federal IT projects that work. But they are relatively simple to design and build, and they did not have the huge publicity and politically imposed deadlines that the health exchanges have. If any lesson comes from these projects, it's that well defined scopes, low traffic levels and relaxed performance requirements seem to be key to keep federal IT projects under control.

2013-10-19

How to build and launch a federal health care exchange

Since the US government has made a pig's ear, dog's breakfast and sundry other animal preparations of its health care exchange HealthCare.Gov, I thought I'd exercise some 20/20 hindsight and explain how it should (or at least could) have been done in a way that would not cost hundreds of millions of dollars and would not lead to egg all over the face of Very Important People. I don't feel guilty exercising hindsight, since the architects of this appalling mess didn't seem to worry about exercising any foresight.

A brief summary of the problem first. You want to provide a web-based solution to allow American citizens to comparison-shop health insurance plans. You are working with a number of insurers who will provide you with a small set of plans they offer and the rules to determine what premium and deductible they will sell the plan at depending on purchaser stats (age, family status, residential area etc.) You'll provide a daily or maybe even hourly feed to insurers with the data on the purchasers who have agreed to sign up for their plans. You're not quite sure how many states will use you as their health care exchange rather than building your own, but it sounds like it could be many tens of states including the big ones (California, Texas). We expect site use to have definite peaks over the year, usually in October/November/early December as people sign up in preparation for the new insurance year on Jan 1st. You want it to be accessible to anyone with a web browser that is not completely Stone Age, so specify IE7 or better and don't rely on any JavaScript that doesn't work in IE7, Firefox, Safari, Chrome and Opera. You don't work too hard to support mobile browsers for now, but Safari for iPad and iPhone 4 onwards should be checked.

Now we crunch the numbers. We expect to be offering this to tens of millions of Americans eventually, maybe up to 100M people in this incarnation. We also know that there is very keen interest in this system, and so many other people could be browsing the site or comparison-shopping with their existing insurance plans even if they don't intend to buy. Let's say that we could expect a total of 50M individual people visiting the site in its first full week of operation. The average number of hits per individual: let's say, 20. We assume 12 hours of usage per day given that it spans America (and ignore Hawaii). 1bn hits per week divided by 302400 seconds yields an average hit rate of about 3300 hits per second. You can expect peaks of twice that, and spikes of maybe five times that during e.g. news broadcasts about the system. So you have to handle a peak of 15000 hits per second. That's quite a lot, so let's think about managing it.

The first thing I think here is "I don't want to be worrying about hardware scaling issues that other people have already solved." I'm already thinking about running most of this, at least the user-facing portion, on hosted services like Amazon's EC2 or Google's App Engine. Maybe even Microsoft's Azure, if you particularly enjoy pain. All three of these behemoths have a staggering numbers of computers. You pay for the computers you use; they let you keep requesting capacity and they keep giving it to you. This is ideal for our model of very variable query rates. If we need about one CPU and 1GB of RAM to handle three queries per second of traffic, you'll want to provision about 5000 CPUs (say, 2500 machines) during your first week to handle the spikes, but maybe no more than 500 CPUs during much of the rest of the year.

The next thought I have is "comparison shopping is hard and expensive, let's restrict it to users whom we know are eligible". I'd make account creation very simple; sign up with your name, address and email address plus a simple password. Once you've signed up, your account is put in a "pending" state. We then mail you a letter a) confirming the sign-up but masking out some of your email address and b) providing you with a numeric code. You make your account active and able to see plans by logging in and entering your numeric code. If you forget your password in the interim, we send you a recovery link. This is all well-trodden practice. The upshot is that we know - at least, at a reasonable level of assurance - that every user with an active account is a) within our covered area and b) is not just a casual browser.

As a result, we can design the main frontend to be very light-weight - simple, cacheable images and JavaScript, user-friendly. This reduces the load on our servers and hence makes it cheaper to serve. We can then establish a second part of the site to handle logged-in users and do the hard comparison work. This site will check for a logged-in cookie on any new request, and immediately bounce users missing cookies to a login page. Successful login will create a cookie with nonce, user ID and login time signed by our site's private key with (say) a 12 hour expiry. We make missing-cookie users as cheap as possible to redirect. Invalid (forged or expired) cookies can be handled as required, since they occur at much lower rates.

There's not much you can do about the business rules evaluation to determine plan costs: it's going to be expensive in computation. I'd personally be instrumenting the heck out of this code to spot any quick wins in reducing computation effort. But we've already filtered out the looky-loos to improve the "quality" (likelihood of actually wanting to buy insurance) of users looking at the plans, which helps. Checking the feeds to insurers is also important; put your best testing, integration and QA people on this, since you're dealing with a bunch of foreign systems that will not work as you expect and you need to be seriously defensive.

Now we think about launch. We realise that our website and backends are going to have bugs, and the most likely place for these bugs is in the rules evaluation and feeds to insurers. As such, we want to detect and nail these bugs before they cause widespread problems. What I'd do is, at least 1 month in advance of our planned country-wide launch, launch this site for one of the smaller states - say, Wyoming or Vermont which have populations around 500K - and announce that we will apply a one-off credit of $100 per individual or $200 per family to users from this state purchasing insurance. Ballpark guess: these credits will cost around $10M which is incredibly cheap for a live test. We provision the crap out of our system and wait for the flood of applications, expect things to break, and measure our actual load and resources consumed. We are careful about user account creation - we warn users to expect their account creation letters within 10 days, and deliberately stagger sending them so we have a gradual trickle of users onto the site. We have a natural limit of users on the site due to our address validation. Obviously, we find bugs - we fix them as best we can, and ensure we have a solid suite of regression testing that will catch the bugs if they re-occur in future. The rule is "demonstrate, make a test that fails, fix, ensure the test passes."

Once we're happy that we've found all the bugs we can, we open it to another, larger, state and repeat, though this time not offering the credit. We onboard more and more states, each time waiting for the initial surge of users to subside before opening to the next one. The current state-by-state invitation list is prominent on the home page of our site. Our rule of thumb is that we never invite more users than we already have (as a proportion of state population), so we can do no more than approximately double our traffic each time.

This is not a "big bang" launch approach. This is because I don't want to create a large crater with the launch.

For the benefit of anyone trying to do something like this, feel free to redistribute and share, even for commercial use.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Update: also very worth reading Luke Chung's take on this application, which comes from a slightly different perspective but comes up with many similar conclusions on the design, and also makes the excellent usability point:

The primary mistake the designers of the system made was assuming that people would visit the web site, step through the process, see their subsidy, review the options, and select "buy" a policy. That is NOT how the buying process works. It's not the way people use Amazon.com, a bank mortgage site, or other insurance pricing sites for life, auto or homeowner policies. People want to know their options and prices before making a purchase decision, often want to discuss it with others, and take days to be comfortable making a decision. Especially when the deadline is months away. What's the rush?

2013-10-18

Project management: harder than one might think

One of the most startling revelations in the continuing slow-motion carnage of the US federal health exchanges is that the government's Center for Medicare and Medicaid Services (CMS) decided to manage the whole affair themselves:

The people I spoke with did all confirm the importance of one other detail in the Times story: that CMS did not hire a general contractor to manage the exchange project but handled that overall technical management task itself. None of the people I spoke with wanted to get into how this decision was made or at what level, but all of them agreed that it was a very bad idea and was at the core of the disaster they have so far experienced.

This is, I believe, the inevitable result of government agencies (UK and USA specifically, but I'm sure other countries are equally guilty) of hiring "generalists", who tend to have liberal arts degrees. Because the subject of these degrees (English, Geography, History, PPE etc.) is unlikely to be directly useful in their owner's regular government work, the story told is that the general communication, analysis and critical thinking skills absorbed are what makes that graduate more valuable in the workplace than (say) someone with pre-university qualifications.

This analysis more or less works for government work which involves reporting and planning, and even for some low-level management. Unfortunately, it fails comprehensively when hard technical issues come up. I still remember the expression on the face of a 25 year old Civil Service fast stream grad (Oxford, PPE) as my grizzled engineering boss tried to explain to her the main engineering issues of the project she was allegedly managing. Picture a dog being taught Greek and you won't be far off. She was so far out of her depth that James Cameron could have been exploring below her. To be fair, you'd get a similar effect by putting an engineering grad in charge of a biochemistry research project, or a chemistry grad in charge of a micro-lending organisation - but at least they'd both be numerate enough to spot errors in the finances.

I note that anyone who proposed that the UK Border Agency head honchos oversee and project-manage the construction of a major bridge or power system would rightly be excoriated in public. "How the hell could they even know where to start? What do they know about compressive strength / high voltage transmission?" Why, then, do we assume that IT projects are any easier for non-experts to manage? I suspect the answer lies in a combination of the infinite malleability of software, and the superficial familiarity that most people have with using web interfaces (and even tweaking HTML themselves). After all, it's just words and funny characters, how hard could it be?

Allow me to link to my favourite XKCD cartoon ever:

Back to the exchanges: there's about as much reason to believe that the CMS has expertise in project management as there is to believe that I'm capable of designing a line of clothes to rival the products of Chanel and DvF. The fact that I can draw something that might be recognisable as a dress (if you squint a little) has absolutely no relevance to being able to design something that millions of people would want to wear - and that can be made for a reasonable sum of money while being resilient to the huge range of stresses and strains imposed on clothing by its wearers. What appalls me is that, given the quote above, no-one stopped the CMS from taking on the project management role despite the fact that everyone seemed to know that it was a terrible idea. Either this was a dastardly covert Tea Party guerilla plot to sabotage the exchanges, or there was a serious break-down in communication. Health and Human Services secretary Kathleen Sebelius is ultimately on the hook for the failure of the health exchanges; did she just not care that they were doomed to fail, or was there someone in the upper chain of reporting who knew what happens to the bearer of bad news and hence decided that discretion was preferable to being unceremoniously fired?

Sebelius, incidentally, is the first daughter of a governor to be elected governor in American history. She has a liberal arts BA and a master's in Public Administration. The CMS Chief Operating Officer is Michelle Snyder who holds advanced degrees in Clinical Psychology and Legal Studies and Administration. She has been a manager in the HHS budget office and had assignments with the Office of Management and Budget, Congress, the Social Security Administration, and as a management consultant in the private sector.

I'm sure liberal arts majors and management consultants have an important role to play in modern society. That role does not, apparently, include being in charge of a major IT project. Not only are they incompetent to run it, it seems that they are incompetent to appoint someone competent to run it. Personally, I'd have started with Richard Granger,, ex-head of the UK NHS Connecting for Health program that pissed £10-15 billion down the drain for no result. Yes, his track record is beyond absymal - on the other hand, a) he now knows first-hand all the mistakes you shouldn't make and b) when you announce his appointment the expectations on your project will plunge so low that even delivering a badly-working underperforming system will impress people.

2013-10-17

Drop dead dates

I had the educational privilege, a few years ago, to watch a team in my workplace try to roll out a new business system to replace an existing system which had worked well for a while but grown gnarled, unmaintainable and no longer scaled to likely future demands. Well aware of the Second System Effect they made the new system feature-for-feature compatible, and even had a good stab at bug-for-bug. However, it was a complex problem and they spent many months spinning up a prototype system.

Eventually their manager decided that they needed to run something in production, so they picked a slice of the traffic on the existing business system that was representative but not critical, and set a target deadline of a week hence to launch it. The developers were privately rather twitchy about the prospect, but recognised the pressure that their manager was under and were willing to give it a shot. Come switchover day the new system was enabled - and promptly fell on its face. The developers found the underlying bugs, fixed them and restarted. It ran a little longer this time, but within a few hours fell over again. They fixed that problem, but within 12 hours it became clear that performance was steadily degrading with time...

The developers had a miserable time during the subsequent week. I got in pretty early as a rule, but the dev team was always in (and slurping coffee) by the time I arrived, and never left before I got home. The bugs posted in their area steadily accumulated, the system repeatedly fell down and was restarted with fixes. The team were living on the ragged edge, trying to keep the system up at the same time as triaging the bugs, adding tests and monitoring to detect the bugs, and trying to measure and improve the performance. This was analogous to changing the wheels on Sebastian Vettel's F1 car mid-lap - one hiccup and either you lose a limb or the car embeds itself in a track barrier. It became clear that the team's testing system had huge gaps, and their monitoring system couldn't generally detect failures happening - you could more or less infer what had caused the failure by checking the logs, but someone had to mail the team saying "hey, this job didn't work" for the team to look at the logs in question.

After a fortnight of this, with the team having pulled an average of 80-90 hour weeks, their manager sensibly realised that this approach was not sustainable. He announced the switch back from the new system to the old system effective next day, and immediately shaped expectations by announcing that they would not be switching back to the new system before three months had passed. The team breathed a sigh of relief, took a few days off, and re-scheduled themselves.

Once the system was pulled offline, the developers made reasonably rapid progress. They'd accumulated a host of bug reports, both in functionality and performance, and (more importantly) had identified crucial gaps in testing and monitoring. For each functional and performance bug, they first verified that they could reproduce it in their testing system - which was where they spent the bulk of their development time for several weeks after turndown - and that the monitoring would detect the condition and alert them appropriately. They triaged the bug reports, worked their way through them in priority order, built load tests that replicated the system load from normal operation and added metrics and monitoring on system latency. The time spent running in production had provided a wealth of logs and load information which gave them a yardstick against which they could measure performance.

After a few months they felt ready to try again, so they spun up the fixed system and loaded in the current data. This went much more smoothly. There were still occasional crashes, but their monitoring alerted them almost instantly so they could stop the system, spend time precisely characterising the problem, fix it, test the fix, deploy the fix and restart. The average time between crashes got longer and longer, the impact of failures got smaller and smaller, and after 6 months or so the system achieved its stated goal of greater scale and performance than its predecessor. However, all this was only possible because of the decision to roll back its initial roll-out.

I was reminded of this today when I saw that informed insiders were estimating the US federal healthcare exchanges as "only 70% complete" and needing "2 weeks to 2 months more work" to be ready. Since there are several tens of millions of potential users who need to register before January 1st, this looks to be a precarious situation. It's doubly precarious when you realise that "70% complete" in a software project is code for "I have no idea when we're going to be done." My personal rule of thumb is that "90% complete" means that you take the number of weeks spent in development so far, and expect the same again until the system is working with the specified reliability.

Megan McArdle, whose coverage of the health care exchanges has been consistently superb, makes a compelling case that Obamacare needs to set a deadline date for a working system, and delay the whole project a year if it's not met:

...given that they didn't even announce that they were taking the system down for more fixes this weekend, I'm also guessing that it's pretty bad. Bad enough that it's time to start talking about a drop-dead date: At what point do we admit that the system just isn't working well enough, roll it back and delay the whole thing for a year?

She's right. If the system is this screwed up at this point, with an unmoveable deadline of January 1st to enroll a large number of people, any sane project manager would move heaven and earth to defer the rollout. In the next 6-9 months they could address all the problems that the first roll-out has revealed, taking the time to test both functionality and performance against the traffic levels that they now know. There's no practical compulsion to run the exchanges now - the American healthcare system has been screwed up for several decades, the population is used to it, waiting another year won't make a great difference to most voters.

Chance of this happening? Essentially zero. The Democrats have nailed their colours to the mast of the good ship Affordable Care Act, and it's going out this year if it kills them. If they hold it over until next year then the full pain of the ACA's premium hikes will hit just before the mid-term elections, and they will get pummelled. They're hoping that if they launch now then the populace will be acclimatised to the costs by next November. As such, launching this year is a politically non-negotiable constraint. Politics, hard deadlines and under-performing software - a better recipe for Schadenfreude I can't imagine.

2013-10-10

American habits the UK should adopt - jailing politicians

And I'm not talking about a few months in chokey for falsifying tens of thousands of pounds of expenses, or eight months for perjury. Ex-mayor of Detroit Kwame Kilpatrick is looking at twenty eight years in the slammer for running a criminal enterprise through the mayor's office:

Kilpatrick used his power as mayor … to steer an astounding amount of business to Ferguson. There was a pattern of threats and pressure from the pair.
This wasn't to protect minority contracts. In fact, they ran some of them out of work.
He was larger than life. He lived the high life. He hosted lavish parties. He accepted cash tributes. He loaded the city payroll with family and friends.
He had an affair with his chief of staff, lied about it, and went to jail for perjury.

Note: he's already done time for perjury. The criminal enterprise sentence is on top of this...

I'd personally add a year to the sentence for membership of Mayors Against Illegal Guns which is rampant posturing if I've ever seen it... Still, if we had decade-long jail sentences for criminal financial malfeasance in a public office, I wonder if it would put a brake on trough-wallowing politicos? Or do they inevitably believe "it can't possibly happen to me"?

2013-10-08

Caveat emptor

The Chinese are sternly warning the Americans not to default on their debt:

Mr Zhu said that China and the US are "inseparable". Beijing is a huge investor in US Treasury bonds.
"The executive branch of the US government has to take decisive and credible steps to avoid a default on its Treasury bonds," he said.

Google found me the major foreign holders of US debt as of July 2013:

China: $1.3 trillion
Japan: $1.2 trillion
Caribbean banking centers: $300 billion
Oil exporters: $260 billion
Brazil: $260 billion

I'm reminded of the maxim: "Borrow $1000 and the bank owns you; borrow $1 million and you own the bank." China's GDP is about $8 trillion, so US debt that it owns is about 12% of GDP. Japan's GDP is about $6 trillion so US debt that it owns is 20% of its GDP. Is China seriously concerned that the US might default on its debt? If Japan is similarly concerned, it seems to be keeping very quiet.

I expect that the problem arises from the Chinese banks relentlessly trying to get out of yuan before the Chinese economic bubble starts to pop. There are huge flows of money out of China to buy dollar-denominated assets; million-dollar houses all over Silicon Valley are being bought up for cash by Chinese buyers. As a data point, friends of mine who just put a $800K townhouse on the market in the South Bay were almost immediately given a cash offer by a Chinese couple wanting to buy a house for their daughter to live in when she goes to college in late 2014. If the US were to even threaten default, the dollar would drop significantly in value - in the past three months alone, the pound has risen from $1.50 to $1.60 due to the concern about the US political situation. If Chinese banks have leveraged investments in dollar-denominated assets, the shockwaves from even a technical US default could land them in very hot water.

2013-10-06

Glenn Greenwald - weasel

Watching Glenn Greenwald being interviewed on BBC Newsnight by Kirsty Wark it struck me that he's remarkably blasé about US and UK secrets leaking out to foreign intelligence services. Up to now I've given him the benefit of the doubt that he thought he was doing the right thing, but this interview made it painfully clear what an arrogant little weasel Greenwald actually is.

Wark did a pretty good job pressing him on his motivations and the implications of the leaked data, not to mention the safety of the remaining encrypted data. Greenwald asserted that he and the Guardian had protected the data with "extremely advanced methods of encryption" and he is completely sure that the data is secure. Well, that's fortunate. No danger of anyone having surreptitiously planted a keylogger in either software or hardware on the relevant Guardian computers? No danger of one of the Guardian journalists with access having been compromised by a domestic or foreign security service? Greenwald seems remarkably sure about things he can't practically know about. Perhaps he just doesn't give a crap.

Wark was curious (as am I) about Greenwald's recent contacts with Snowden and Snowden's current welfare. Greenwald claimed that Edward Snowden has protected the data has with "extreme levels of encryption", proof against cracking by the NSA and the "lesser Russian intelligence agencies". Russia being a country where math prodigies are ten a penny, I fear Greenwald may be underestimating their cryptography-fu. Asserting that Snowden didn't spend his life fighting surveillance just to go to Russia and help them surveil, Greenwald stated that the evidence we know makes it "ludicrous" to believe that the Russians or Chinese had access to Snowdon's data.

Hmm. Glenn, I suggest you Google rubber hose cryptanalysis. If I were the Russian FSB, given that they have effectively complete access to and control over Snowden, I'd be extremely tempted to "lean" on him until he gave up the keys that decrypted his stash of data. Heck, why wouldn't they? They'd be practically negligent not to do so. Nor are they likely to shout from the rooftops if they have done so; they're far more likely to exploit the data quietly and effectively while conveniently being able to blame Greenwald and co. for any leaks.

I invite you to contrast this with Greenwald's note that the UK Government "very thuggishly ran roughshod over press freedoms, running criminal investigations and detaining my partner." Detaining David Miranda for nine hours was not necessarily a good plan by the UK, but he was a foreign national and was not a journalist as far as I (and the Guardian) am aware. So Greenwald's reference to press freedom is a little disingenous. As far as "running roughshod" goes, Greenwald can only pray that he doesn't end up in the hands of the FSB... as Guardian journalist Luke Harding could tell him:

Luke Harding, the Moscow correspondent for The Guardian from to 2007 to 2011 and a fierce critic of Russia, alleges that the FSB subjected him to continual psychological harassment, with the aim of either coercing him into practicing self-censorship in his reporting, or to leave the country entirely. He says that FSB used techniques known as Zersetzung (literally "corrosion" or "undermining") which were perfected by the East German Stasi.

The Russian affairs expert Streetwise Professor has been following the Snowden saga with a critical eye for a while now, believing that he's being made to dance to Putin's tune. Most recently he noted that we have no recent statements known to come from Snowdon; even his most recent statement to the UN was read out on his behalf, there's no proof that the statement came from Snowdon himself and indeed the text suggests Greenwald and other Snowden "colleagues" had a hand in his text. If the Russians are treating Snowden well, why isn't he a regular appearance on TV or YouTube?

It must be nice to be as arrogantly cocksure as Greenwald. I bet Snowden for one would be happy to change places with him right now.

2013-10-05

Some brief questions for the architects of healthcare.gov

There are various theories bouncing around about why the HealthCare.gov website is having such problems with capacity:

One possible cause of the problems is that hitting "apply" on HealthCare.gov causes 92 separate files, plug-ins and other mammoth swarms of data to stream between the user's computer and the servers powering the government website, said Matthew Hancock, an independent expert in website design. He was able to track the files being requested through a feature in the Firefox browser.

That's not very promising. You'd somewhat hope that one of the pivot actions of a website was as lightweight as possible from the client side; there are going to be enough problems with the server persisting everything into its database. The page from which you apply should have as much of the JavaScript already loaded (and appropriately marked as cacheable so it's not re-loaded when the page changes). Loading everything in the time-critical period when you're trying to lock a table (and I bet they're not row-level locking) in the database to write the user application is asking for trouble.

Anyway, that's just speculation. But here are the questions for the HealthCare.gov folks:

How many concurrent users did you determine you could support on the system with the resources at launch?
What were the system bottlenecks that imposed that limit?
What were your contingency plans to add resources to temporarily raise system capacity as required at launch?
What was your estimate of visitor traffic to the site on launch week, based on the anticipated heavy country-wide publicity?
What performance degradation did you anticipate when 2x and 10x the maximum users attempted to use the site?
What plan did you have to divert users by geo-located IP and / or run the system in a degraded mode in the event of overload, to focus the available performance on people who might actually want to buy healthcare?
If you hadn't done the performance and load testing to determine the above numbers and evolve the above plans before launch, would you care to explain why not and what you thought you were being paid to do?

In short; did you determine in advance when and how the system was going to fall over with the traffic you were going to get and how to get it back on its feet, and if you didn't then why not?

If the Senate is looking for questions to ask the HealthCare.gov architects when the inevitable reprisals come around, they could do worse than start with these.

2013-10-04

Sneak peak: Facebook's Q3 results

I'm a moderately active and engaged Facebook user (to my eternal shame), so in the run-up to the Facebook Q3 results later this month I thought I'd see what ads Facebook are offering me.

On desktop: zip. None. Nada. Niente. Wala. Nichts. Plenty of recommended pages, but nothing that looks like it would give anyone even a sniff of revenue. It took several reloads before I triggered a single sponsored link: "Online savings account" from GE Capital Bank. Not even remotely relevant. Note that now and again I click particularly irrelevant ads to hide them with appropriate feedback (irrelevant / duplicate / against my views etc) but this month FB seems to have run out of anything to show me.

On mobile: I scrolled down several days in my stream and there was nothing. No ads at all. So, this vaunted Facebook mobile advertising...

Facebook is currently priced at $51.07 a share, up from sub-$30 3 months ago. What. The Fuck? I wait with keen interest to see what they show this quarter in mobile revenue. Either they're making a heck of a lot from selling social graph information to clients for advertising elsewhere, or they're about to get their pants pulled down and rogered with the traditional curare-tipped iron fencepost.

Dancing around the Great Firewall of China

It seems a little unfair to give Apple heat over its China policies, given how much employment it creates in China, but apparently Apple have censored a Chinese firewall avoiding-app:

Chinese web users have criticised Apple after the company pulled an iPhone app which enabled users to bypass firewalls and access restricted internet sites. The developers of the free app, OpenDoor, reportedly wrote to Apple protesting against the move. [...] Apple asks iPhone app developers to ensure that their apps "comply with all legal requirements in any location where they are made available to users".

Aha. But the problem here is: China does not acknowledge the existence of the Great Firewall of China (GFW). In fact, any mention of it in a blog post or other social media is enough to get that posting censored. China certainly has strong legal requirements about being able to identify the real person behind an Internet identity on a China-hosted service and foreign firms having to "partner" with a local firm for Internet "compliance", and it freely blocks traffic going outside China (via the GFW) which could retrieve user-generated content relating to sensitive topics, but from a legal perspective the GFW itself cannot be the subject of a legal violation since the GFW does not officially exist because you can't talk about it (and the GFW will censor your traffic if you try to do this across the border). Is your head hurting yet?

This, by the way, is perfectly pragmatic behaviour from Apple. They like being able to do business in China, so it's not enough to satisfy the letter of the law - they want to keep the Chinese government happy. As such, dropping GFW-circumventing apps from the App Store makes perfect business sense. It is, however, particularly weasel-like for them to hide behind "legal requirements", or avoid the topic all together. If they want to play ball with the Chinese government for commercial reasons - and it's their fiduciary duty to improve their commercial prospects - why can't they just say so? (Yes, this is a rhetorical question.)

The OpenDoor app developers purport to be bemused:

"It is unclear to us how a simple browser app could include illegal contents, since it's the user's own choosing of what websites to view," the email says.
"Using the same definition, wouldn't all browser apps, including Apple's own Safari and Google's Chrome, include illegal contents?"

Yes they could, in theory. But browsers use well-known protocols: HTTP, which is clear text, and which the GFW can scan for illegal content like "T1ANANM3N []"; HTTPS, which is secure but can be blocked either based on destination IP or just universally. OpenDoor probably (I haven't looked) does something sneaky to make its traffic look like regular HTTP with innocuous content. The GFW could, with some work, drop OpenDoor traffic based on its characteristics and/or destinations, but they would always be playing catch-up. Instead, Apple "voluntarily" (we don't know if any Chinese government pressure was formally applied) drops it from the App Store in China. Everyone's happy! No-one gets any distressing news about human rights abuses in China, and gatherings of subversives are prevented.

Apple are bending over to help the Chinese government, and that's perfectly acceptable in a capitalist society - let's just be clear that it's voluntary, and in search of profits.

2013-10-01

Try again later for healthcare

The much-anticipated switch-on of the Affordable Care Act ("Obamacare") health exchanges happened today, and as anticipated, it was not exactly a smooth ride:

And those going to the federal exchange site, which is handling enrollment for 36 states that didn't fully establish their own exchanges are greeted with this message: "Health Insurance Marketplace: Please wait. We have a lot of visitors on our site right now and we're working to make your experience here better. Please wait here until we send you to the login page. Thanks for your patience!"

California also had quite the surge of users:

The agency that runs the [California State] exchange, Covered California, said it received 1 million hits on the website during the first 90 minutes after the exchange opened. By 3 p.m., the site had received 5 million hits and the two service centers had received 17,000 phone calls.

There are 5400 seconds in 90 minutes, so that's about 200 hits per second. A respectable number, to be sure; many of those will be people just looking at the (static, small, hand-written, easily cached) front page, which is cheap to serve. It would be more interesting to see a rate of user sign-ups and quote serving, with the fraction of errors served in each case, but I guess that's too much to hope for. (For those interested, Covered California is using Microsoft IIS/7.5 and ASP for at least its front page.) I was amused to see the range of languages covered: Spanish, Tagalog, Vietnamese and Chinese as you'd expect, but no Indian languages I recognised. Arabic, Farsi, Russian and Armenian but not French, Portuguese or German. Korean, Hmong and Khmer but not Thai. It seems fairly clear which immigrants learn the language and which don't...

The "please try again later" screen is actually entirely the right approach. The current traffic spike is a short-lived phenomenon - the widespread publicity has led a tidal wave of real users to the exchange websites in a short time window, along with plenty of other "browsers" who are interested in seeing what kind of plan quotes are available. The right thing to do is to tell users to try again later. Real exchange users aren't going to go elsewhere for their plans, and they're not really time-limited - they will try again later in the evening, or tomorrow, or next weekend. The browsers will be put off and either not come back at all (preferred) or come back much later (manageable). In the meantime the exchange operators will have discovered the bottlenecks in their system, and can make tactical fixes for quick wins and start planning for the more involved capacity fixes that may take weeks or months to implement.

Based on the above I'd say that the ACA exchange rollout has gone roughly as well as could be expected. As the user load flattens out over the next few days and weeks the exchanges should work better, and downtime will be less frequent and shorter.

Information Week has a nice dissection from Oregon of why operating a health exchange is hard:

Presenting a product -- an insurance policy -- isn't the hard part. The hard part is figuring out which federal and state programs and tax credits a person or family is eligible for. Getting that part right takes creating an extremely complex rules engine.
About 1,700 individual rules affect eligibility for health insurance subsidies in Oregon.

For each user you have to gather their circumstances - family size, age, income, location - which is fairly straight forward. You then have to write this into a distributed data base, handling de-duplication where required. Then you hand off the data to the rules matching engine - probably the computational workhorse and main memory and disk hog of the system - and wait for it to come back with an offer. There will be separate links to the insurance providers which will receive feeds of what they are prepared to bid for customers, and publish notifications of what bids they made, won and lost.

The real technical challenge for the exchanges will come in the next few months, and it won't be the performance/load issue which has been making headlines today. There will be the inevitable security breaches, usually by insiders trying to sell on personal data. There will be discrepancies between the insurance bid record held by the exchanges and that recorded at the insurers' backends - and hot disputes about which is correct. There will be IT failures on both sides leading to users getting too-high or stupidly-low offers for insurance, and insurers being pressured to honour too-good-to-be-true offers which they (or the exchange on their behalf) made and a wave of consumers snapped up.

(Update: those security breaches seem to be coming earlier than I thought:

One exchange staffer’s simple mistake gave insurance broker Jim Koester access to an Excel document of Social Security numbers, names, addresses and other personal data for whole a list of insurance agents.

Oopsie.)

The policy problems for the exchanges and insurers will start to become more visible next year. Insurers aren't allowed to discriminate on the basis of pre-existing conditions; as such, the first wave of insurance buyers will be heavily seeded with sick or disabled people with no current insurance. The insurers will be hoping to pull in as many healthy people as possible (due to the "individual mandate" making people buy ACA-compliant insurance or face a fine. However, that fine is relatively small for the first year (the greater of $95 per adult and $47.50 per child, or 1 percent of family income), giving healthy adults a good reason to avoid the exchanges for at least a year. The fine will gradually rise, and it will be interesting to see the fine level required to make most people get onto the exchange. In the interim, the insurers are going to be stuck with a wave of very expensive people requiring treatment. There is a natural throttle though - the number of doctors and hospitals willing to accept the reimbursement rates of those insurers is going to go down steadily, so the insurers will have some limit on treatment rates. But then there are going to be widespread complaints that people with insurance from the exchanges can't get timely treatment - a complaint that the insurers can only relieve by raising their reimbursement rates, which isn't going to happen.

Hmm, healthcare rationed by medical provider availability and payer-authorised treatments. This is starting to sound awfully familiar... Think of the NHS but without its effective monopoly on health care purchases to keep down prices.

The $1bn question is whether the insurers are willing to stay on the exchange and absorb the costs of the treatments of their disproportionately sick initial customers, in anticipation of future revenue from young, healthy adults compelled to buy overpriced (for them) and needlessly comprehensive "insurance" - in effect, an inefficient payment plan for all medical costs. This is going to hinge on what they think the federal government is going to give them in subsidies (tax breaks for poorer payers) and how much purchasing leverage they have on the existing health system. Since bureaucracy is one of the biggest problems in the American health care system, and Washington politics is notoriously dysfunctional and lobby-driven, I don't expect this to work out too well.

I should add that there will be beneficiaries of this system - if you are poor and sick, but have enough money to cover your year's deductible, then it's a potentially life-saving bonanza. I fear, though, that help for these people - who undoubtedly need help - is going to come at a high price for everyone else.