Hemiposterical: Of FADECs and Failures

I've been in the software game a while, but it's quite telling that I remain mildly astonished when any program runs through to completion without raising any errors. Note that errors are distinct from crashes; it is nearly always possible to write a program which is crash-free, but error-free is a little trickier. See for instance this snippet of Python:

#!/usr/bin/python
from errorprone_code import main_program
from time import sleep
complete = False
while not complete:
  try:
    main_program()
    complete = True
  except Exception, err:
    print "Strewth! %s" % err
    sleep(1)

which should be crash-free, but we clearly have not made the main_program() run any more free from errors.

The ongoing furore about the Chinook helicopter crash into the Mull of Kintyre in 1994 is primarily focused on the FADEC (full-authority digital engine controller) and whether it is reasonably possible that a FADEC failure could have induced the crash, or at least contributed to it. The best write-up I've found so far on the topic is from the House of Lords inquiry in 2002. I'm wary of any inquiry conducted by the Air Force itself (the original Board of Inquiry by two Air Marshals, for instance) due to the incentives to cover up procurement or operational screw-ups. I'm equally wary of any study by outside "experts" commissioned by politicians as they are incentivised to produce the result that the commissioning politicians would like. The Lords seem to be the least amenable to influence, and are generally diligent and relatively impartial.

The essential problem with the FADEC code that Boeing wrote for the Chinook HC2 and that Boscombe Down disliked so much was that it was unverifiable. EDS-Scicon reviewed the code and found "486 anomalies" in the first 18% of the code they checked. The problem here is that we don't know what those 'anomalies' were. I've done any amount of code review under a wide range of analysis criteria, and 'anomaly' can mean practically anything. It can mean an uninitialised variable value being used (bad, definitely needs fixing), an unreachable code path (generally safe but needs explaining), an inconsistency between comments and code (potentially dangerous if the code was incorrect, just annoying if the comment is incorrect) or just a violation of coding guidelines (e.g. a variable name in StudlyCaps instead of underscore_separated style). Boscombe Down's main concern was that the code was structured in such a way that it was not amenable to any useful form of analysis. In other words, they couldn't tell with any degree of certainty where it might be incorrect or unsafe.

There is a very large gap between "unverifiable" and "incorrect". Tony Hoare's quote from his Turing Award lecture comes to mind:

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult. It demands the same skill, devotion, insight, and even inspiration as the discovery of the simple physical laws which underlie the complex phenomena of nature.

Unverifiable code in a safety-critical system is clearly bad. That doesn't mean that it's actually wrong, nor that it caused the crash. You certainly wouldn't want to let an aircraft with unverifiable engine code into service, but Boscombe Down was overruled by MoD (no doubt a conversation along the lines of "we've already bought the damn things, we'd look pretty stupid if we didn't let them fly"). There did appear to be real problems with the FADEC, including uncommanded engine run-ups experienced on the Chinook HC2, which doesn't surprise me in the least. But as long as the Chinooks flew in regular flight regimes, with standard power settings, they'd be running through the best-tested parts of the FADEC code which would therefore be the least prone to error. There's nothing in the crash which indicates any abnormal engine operation, commanded or uncommanded.

(For the record, here's what I believe. I do not believe that the FADEC failed in any significant way around the time of the crash. I think the crash was a classic controlled flight into terrain, in very bad visibility. I think that the two pilots, both flight lieutenants who were flying more than their recommended hours, were pressured into making the flight in circumstances where they might otherwise have delayed and waited for better flying conditions. We will never know exactly what happened in that cockpit, but there are plenty of people in Boeing, Textron, MoD Procurement and the RAF senior officers who contributed to this crash in some way. Blaming the pilots alone is deeply unfair and smacks of some pretty disgusting expediency by the MoD and RAF.)

Producing code which is effectively free from errors is possible but very expensive. That expense may be justified, if failure would be even more expensive. More likely is that the occasional error would be acceptable as long as it is handled safely (e.g. an engine controller hitting an error condition re-initialises itself, thereby refusing operator commands for a few seconds, and logs that an error has occurred). Even more likely is that the developers hack something together that mostly works, test it as much as they can to remove the more obvious bugs, stick in exception handlers to manage the unexpected, and then charge the client for "functional upgrades" when they report operational errors or strange behaviour after the system has been accepted. But if you want a system that could possibly be made reasonably free of errors, it needs to be a design that is amenable to analysis. That is where Boeing / Textron failed in the FADEC design, and accepting a software system with such a design is where MoD Procurement and the RAF failed.

2011-07-13 Update: as expected, Lord Philip has overturned the verdict of gross negligence saying, in effect, there's sufficient doubt about the circumstances of the accident that the standard of proof for negligence can't be met. Sir William Wratten (who was Commander British Forces during Gulf War #1) and Sir John Day from the original RAF inquiry should feel suitably chastened, but I expect they won't.

Hemiposterical

2011-07-12

Of FADECs and Failures

No comments:

Post a Comment