By now, you've presumably seen how Public Health England screwed up spectacularly in their testing-to-identification pipeline, such that they dropped thousands of cases - because they hit an internal row limit in Excel.
Oops.
Still, how could anyone have predicted that Public Health England - who were founded in 2013 with responsibility for public health in England - could have screwed up so badly? Well, anyone with any experience of government IT in the past... 40 years, let's say. Or anyone who observed that the single most important job of a public health agency is to prepare for pandemics, which roll around every 10 years or so - remember SaRS 2003? H1N1? And that duty, as illustrated in their 2020 performance, is one that PHE could not have failed at any more badly if they'd put their best minds to it.
Simply, there's no incentive for them to be any good at what they do.
It's tempting to simply roll out the PHE leadership and have them hung from the nearest lamp post - or at least, claw back all they payments they received as a result of being associated with Public Health England. For reference, the latest page shows this list as:
- Duncan Selbie
- Prof Dr Julia Goodfellow
- Sir Derek Myers
- George Griffin
- Sian Griffiths
- Paul Cosford
- Yvonne Doyle
- Richard Gleave
- Donald Shepherd
- Rashmi Shukla
When you set up a data processing pipeline like this, your working assumptions should be that:
- The data you ingest is often crap in accuracy, completeness and even syntax;
- At every stage of processing, you're going to lose some of it;
- Your computations are probably incorrect in several infrequent but crucial circumstances; and
- When you spit out your end result, the system you send it to will be frequently partially down, so drop or reject some or all of the (hopefully) valid data you're sending to it.
The insight you need is that you accept that your pipeline is going to be decrepit, leaky and contaminate your data. That's OK as long as you know when it's happening, and approximately how bad it is.
Let's look at the original problem. From the BBC article:
The issue was caused by the way the agency brought together logs produced by commercial firms paid to analyse swab tests of the public, to discover who has the virus. They filed their results in the form of text-based lists - known as CSV files - without issue.We want to have a good estimate, for each agency, whether all the records have been received. Therefore we supplement the list of records with some of our own - which have characteristics which we expect to survive through processing. Assuming each record is a list of numerical values (say, number of virus particles per mL - IDK, I'm not a biologist) a simple way to do this is to make one or more fields in our artificial records have values that are 100x higher or lower than practically feasible. Then for a list of N records, you add one artifical record to the start, one at the end and one in the middle, so you ship N+3 records to central processing. For extra style, change the invalidity characteristic of each of these records - so e.g. you know that an excessively high viral load signals the start of a records list, and excessively low load signals the end.
The next stage:
PHE had set up an automatic process to pull this data together into Excel templates so that it could then be uploaded to a central system and made available to the NHS Test and Trace team, as well as other government computer dashboards.First check: this is not a lot of data. Really, it isn't. Every record represents the test of a human, there's a very finite testing capacity (humans per day), and the amount of core data produced should easily fit in 1KB - 100 or more double-precision floating point numbers. It's not like they're uploading e.g. digital images of mammograms.
So the first step, if you're competent, is for Firm A to read-back the data from PHE:
- Firm A has records R1 ... R10. It computes a checksum for each record - a number which is a "summary" of the record, rather like feeding the record through a sausage machine and taking a picture of the sausage it produces.
- Firm A stores checksums C1, C2, ..., C10 corresponding to each record.
- Firm A sends records R1, R2, ..., R10 to PHE, tagged with origin 'Firm A' and date '2020-10-06'
- Firm A asks PHE to send it checksums of all records tagged 'Firm A', '2020-10-06'
- PHE reads its internal records, identifies 10 records, sends checksums D1, D2, ... D10
- Firm A checks that the number of checksums match, and each checksum is the same: if there's a discrepancy, it loudly flags this to a human.
If PHE wants to be really cunning then one time in 50 it will deliberately omit a checksum in its response, or change one bit of a checksum, and expect the firm to flag an error. If no error is raised, we know that Firm A isn't doing read-backs properly.
Now, PHE wants to aggregate its records. It has (say) 40 firms supplying data to it. So it does processing over all the records and for each record produces a result: one of "Y" (positive test), "N" (negative test), "E" (record invalid), "I" (record implausible). Because of our fake record injection, if 40 firms send 1000 records in total, we should expect zero "E" results, 120 "I" results, and the total of "Y" and "N" results should equal 880. If we calculate anything different, the system should complain loudly, and we send a human to figure out what went wrong.
The system isn't perfect - the aggregation function might accidentally skip 1 in 100 results, for instance, and through bad luck it might not skip an erroneous record. But it's still a good start.
I just pulled this process out of my posterior, and I guarantee it's more robust than what PHE had in place. So why are we paying the Test+Trace system £12 billion or more to implement a system that isn't even as good as a compsci grad would put in place in return for free home gigabit Ethernet, with an incentive scheme based around Xena tapes and Hot Pockets?
Nobody really cared if the system worked well. They just wanted to get it out of the door. No-one - at least, at the higher levels of project management - was going to be held accountable for even a failure such as this. "Lessons will be learned" platitudes will be trotted out, the company will find one or two individuals at the lower level and fire them for negligence, but any project manager not actually asleep on the job would have known this was coming. And they know it will happen again, and again, as long as the organisation implementing systems like this has no direct incentive for it to work. Indeed, the client (UK Government) probably didn't even define what "work" actually meant in terms of effective processing - and how they would measure it.
More decades ago than I care to admit I was working as a programmer at a local authority. A colleague was given a input validation program to write for the local libraries system. It came with an astonishingly lengthy list of validation criteria.
ReplyDeleteAfter a not inconsiderable time programming all the validation (he was sooo proud), doing his testing, he was given a weeks libraries input data to play with along with the meaningful error messages libraries had supplied. About 10k input records.
Nothing made it through validation and the error report was large enough to be seen from space. There was a discussion with libraries who decided they could do without some tests. Nope, still nothing. What was rejected on one test, failed the next. There then followed a negotiating period and, eventually, they got 90% of their records though validation. Of course by that time there was virtually no validation.
I don't know about my colleague, but I learned three lessons:
1. Input data cannot be trusted. Ever.
2. Management rarely understand how their businesses really work
3. There's a strong temptation to fix the symptoms and not the problem.
Steve - I don't think I could agree more, except I might substitute "irresistable" for "strong" in point 3.
ReplyDelete