2013-09-27

Off the shelf software - not a panacea

I read with interest a discussion at Mr. Worstall's place about the forthcoming IT disaster that will be the "Obamacare" (Affordable Care Act) insurance exchanges. Commentator Steve Crook came up with a comment that I thought deserved further attention:

Part of the problem is that software development is still using basic tools and hand crafting everything. Things have improved a lot in the last decade or so, but we're still a long way from the 'engineering' part of software engineering.
You'll know things have changed when it's possible for software can be assembled from a catalog of standard parts and has an MTBF.
I'm 95%+ behind his first two sentences, but the last one deserves more scrutiny, rampant speculation and blatantly biased opinion. What better medium than a blog post to do so?

Software engineering is hard, which is why most programmers don't bother with it. We see the results all around us in ubiquitous IT failures. For the (relatively) few cases where failures really do matter in a reputational, safety and/or financial sense, software engineering really does come into its own. Let's examine those cases to see why software engineering matters and what traps lie in wait for users of off-the-shelf components ("COTS" - commercial off-the-shelf systems). We'll take the Affordable Care Act (ACA) per-state insurance exchanges as an example.

For those unacquainted with the ACA, one of its key aims is to make affordable insurance plans available to the masses. Many people will obtain their health insurance via a scheme with their employer, but this is only available to full-time employees, and such plans are subject to strict criteria on minimum coverage - which is why many USA employers are switching employees to 30 hour weeks in order to make them part-time and avoid the expense of these plans, but I digress. People over 65 or so are covered by the existing Medicare system, and a subset of poor people are covered by Medicaid. Let's assume that for whatever reason there's a large rump of families and single people under retirement who need coverage; how do they obtain it? ACA requires that each state have an insurance "exchange" on which various insurers offer ACA-compliant plans, and the uninsured are required to have coverage or pay a fine.

People unfamiliar with the USA - and I include perhaps 90% of the UK population in that group, the American penetration of TV and cinema notwithstanding - fail to comprehend the importance of the state (not "State") in America. While US states are not homogeneous, there are well-known stereotypes of states which are accurate enough for generalisations. Florida is full of retirees. Everyone in Texas has a gun, even the florists (other than ancestors of Quakers). Minnesota is a mix of Scandinavians and hardcore Islamists. Massachusetts is full of liberals. North Dakota is isolated, snow-blown, and populated by flint-eyed people that would have no problem feeding you into a woodchipper. And so on. The state takes precedence in the USA Constitution ("The powers not delegated to the United States by the Constitution, nor prohibited by it to the States, are reserved to the States respectively, or to the people.") and different states have very different laws on employment, welfare etc. A per-state approach in the provision of healthcare exchanges therefore makes a certain amount of sense.

That said, states are still quite large. In many cases they can be thought of as entire countries with populations in the tens of millions. Each state is therefore going to have to implement an insurance exchange which can handle millions of unique users - with reasonably strong authentication requirements, since letting user X have access to the medical records of user Y is a no-no. Since users are disproportionately likely to be poor and poorly-educated, this poses its own challenges. The exchanges will receive traffic at a steady rate (people moving in and out of state, changing work and insurance status) combined with sudden spikes (annual or semi-annual application deadlines). They need to operate at high traffic rates with the relatively small number of insurance providers. You don't want this exchange to be down for any extended period, since when people are looking for healthcare insurance it's often at compressed timescales and under Government mandate.

Why wouldn't you use off-the-shelf software for such a system? Well, in some cases you would. Because of the uptime requirements, this system will need to be distributed - different instances in different physical locations such that a power/network failure (happens all the time) or maintenance period won't take out your entire site, so you'll likely use off-the-shelf replicated database solutions. Because of the authentication and security requirements you'll use off-the-shelf open source crypto libraries like OpenSSL. Because you want your individual hardware platforms to be reliable you'll use an off-the-shelf commercially supported Unix like Red Hat Enterprise Linux. So far, so good - these are all really generic services used by thousands if not millions of customers. They might break, but there's strong commercial pressure for them to a) be really careful about testing updates and b) update whenever they find a critical bug.

The problem comes when you move to a higher level of functionality. The rule of thumb is that the more users of the software while it is being actively maintained and developed, the more reliable the software over time. Software which is badly maintained undergoes a brutally Darwinian process where the bug reports from irate users steadily increase to the point where the remaining developers eventually give up their 96 hour weeks and slink away to other contracts, leaving a fetid mess of software. Open source software, by contrast, can always be fixed by someone sufficiently motivated. Not all users are sufficiently motivated, but for some use cases you can find enough interest for the software to be iterated into moderately robust usability. The problem with the ACA exchanges is that they are a unique application - no-one else is trying to manage a government-supervised health insurance exchange - and they are limited to 50 clients (the US states) which vary from 38M people to 600,000 people in size - nearly two orders of magnitude. What works well for Wyoming and Vermont won't be suited for California and Texas, and every state has different population profiles and healthcare laws.

Worse, since each state will be trying to build its own exchange, you'll likely end up with 3-5 large firms supplying the 50 states with various "tailored" solutions based on their own custom models. Each state will have to cope with exchanging some data with other states, as people move across state boundaries; that's 49 moving targets for the exchange backends to cope with. The ACA restrictions will change year after year, so the exchanges will need to be flexible. Within any one state the exchanges will need to exchange data frequently and securely with the insurance providers within that state, which is going to be the real headache. Finally, in an effort to "prove" the efficiency of the exchanges, more and more reports will be required to be run on exchange activity and members, loading the exchange backends with queries and stressing the access protection mechanisms. This is before we consider the problem of malicious attacks to compromise, overload or denial-of-service attack the exchange front-ends, and the risk of compromised exchange maintainers dumping data out to sell.

In isolation, you can probably find software solutions to each of these problems. The problem will be in glueing together these solutions into a coherent, working and maintainable system. For instance, if you spend 2 months incorporating version X of a data querying system and then the manufacturer releases version Y, what do you have to do to ensure that version Y does introduce insecurities, incompatibilities or performance decreases into your exchange? How do you try out version Y safely? If it doesn't work out, how do you roll it back - bearing in mind that you may have had to reformat your data to be compatible with version Y? All these are system-level problems that your exchange operators need to solve. How do you know that your data storage system actually scales in practice to the number of concurrent users that you will have? Unless another state-level organisation is already using it, it's likely that you have no idea. Your state will be the guinea pig. It's likely that you'll hit any number of bottlenecks in the software and some will be expensive and time-consuming to remove.

The biggest danger is the deadline. There's nothing more prone to cause panic in a software development than an externally-imposed deadline for operation. Software is famously hard to admit estimation of completion times, and so 2 months before the deadline you will probably have no idea if you can hit it. Even if conditions are favourable, it takes an exceptionally hard-headed and technically able project manager to triage appropriately and ensure that developers only work on the aspects of the system and its environment that are crucial to operation. Worse, a government-mandated government-funded project with a government-imposed deadline practically requires the state to throw money at the delivery of the system - this attracts the kind of developers who bill by the hour, anticipating a lucrative few months as they labour away as part of a cast of thousands trying to get the system out of the door. There's no alternative to paying for a new system, so the cost will go through the roof. If you're lucky, the system may approximately work some time after the deadline, but there's definitely no guarantee of this.

Conclusion: the ACA exchanges are going to be one more example of government IT projects than run horribly over-budget and deliver (at best) a barely-working unmaintainable system. It's great news for IT contractors and for large project-managing firms like EDS, Lockheed-Martin etc., but the taxpayers are really going to get it in the shorts.

4 comments:

  1. I agree with much of what you say here. I was really referring to be building blocks used to make complex software systems rather than buying COTS.

    Also the tool sets we use to build software. They're a lot better than they were. But...

    In the NHS IT project there were many disparate systems in operation, in different hospitals and surgeries, all running on different hardware and different networks. Everyone wanted to keep what they were familiar with. Any engineer should have looked at the whole thing and sucked air through their teeth and said it's going to take a long time to sort this out. Even if you throw money and people at it. My guess would be that deadlines were set long before anyone knew what work was involved.

    Many years ago I was leading a small group of programmers. There one one guy who was always slower than the rest of the group. But I could rely on his estimates and the quality of his work. I constantly had to defend him against criticism from management because of his 'poor' productivity. Finally, I took entries from the support database to show them error reports on his software were few and far between. It didn't help much.

    Management also need 'fixing'.

    I could talk about this until the cows come home and, probably, until they go out again.

    ReplyDelete
  2. Thanks Steve; I agree that engineers don't do nearly enough sucking-air-through-teeth. I suspect it's from a misguided attempt to appear "professional" and "positive" but there are *so* many cases when the correct response is to find the person responsible, drag him down to the nearest river and waterboard him to within an inch of his life. Professional be damned.

    There are certainly some problems with an engineering culture that relentlessly focuses on actual data on productivity and error rates, but the alternative is so, so much worse...

    What I find incredible on my frequent re-readings of Brooks' Mythical Man Month is how completely he nailed the nature of a software development program nearly 40 years ago and we still make the SAME BLOODY MISTAKES.

    ReplyDelete
  3. Believe me, I tried the sucking air through the teeth method and it didn't do me any good.

    I think image of software development has been ruined by the idea of the hacker. There's this general impression that, given a sufficiently motivated person (they're all talented by default) and a large enough number of keyboards, brilliant software can be conjured from nothing in next to no time. I blame Hollywood and the MSM, neither seem to have much interest in reality.

    I'm reminded of a earlier post of your where you discussed the penetrating power of an assault rifle bullet and how this difered from public impressions gained from films.

    ReplyDelete
  4. Steve, you should definitely watch the movie "The Internship". And you should ensure you have a large amount of alcohol to hand.

    ReplyDelete

All comments are subject to retrospective moderation. I will only reject spam, gratuitous abuse, and wilful stupidity.