Showing posts with label software. Show all posts

2022-12-26

The Twitter Whistleblower report - how bad was Twitter, really?

Prompted by a post by everyone's favourite Portugal-based squirrel-torturing blogger, Tim Worstall, I thought I'd dive into the practical implications of all the (frankly, horrendous) technical, security and privacy problems that Twitter was identified as having before Elon Musk rocked up as owner and CEO.

Usual disclaimer: I'm going by the reports. Reality might be different. I cite where I can.

For background: both USA and European authorities take a dim view of corporate access to, and usage of, individual user data. Remember the European "ePrivacy Directive"? Also known as the "'f+ck these annoying cookie pop-ups' law"... Governments in both Europe and the USA are keenly interested in companies tracking individual users' activities, though my personal opinion is that they're just jealous; they'd like to do it too, but they're just not competent. Anyway, a company doing individual tracking at large scale for profit - Twitter, Google, YouTube, Meta, Amazon - attracts their attention, and their laws.

Security

Let's talk about security - and, more importantly, access to secure data. A fundamental principle of security is "least privilege" - everyone should have the smallest set of access privileges to be able to do their job. You could argue that 5000+ people in Twitter "need" to be able to change things in production at some point to do their jobs, but they certainly don't "need" to have always-on, cross-production access. Not least, because someone running a command they found on an internal playbook as an experiment, could easily break a large chunk of the service. But don't rely on me, ask their job candidates:

Twitter's practice was a huge red flag for job candidates, who universally expressed disbelief. One Vice President of Information Technology [his current role, not the target role] considered withdrawing his application on the (accurate) rationale that Twitter's lack of basic engineering hygiene in their arrangement presaged major headaches.

Hire that guy.

Certainly, every company is far from perfect in this area, but those with regulators are continually seeking to narrow the number of people with access, and the scope of access those people have. Twitter pre-Musk clearly did not give a crap about the count and scope of access. One can only imagine why; were they, for instance, relying on a large base of pre-approved employees to intercept and downgrade/block opinions outside the mainstream? How would we tell if this were not the case? Can Twitter show that they were engaged in a systematic reduction of number and scope of access to production? If not, who will be held to account?

Auditing

Control is one thing - but at least, if a human performs an action in the production environment (change, or query), that action should at least be logged, so future audit can see what happened. This is not a high bar, but was apparently too high for pre-2022 Twitter:

There was no logging of who went into the production environment or what they did.

FFS
To make clear the implications: in general, there was no way of finding out who queried (for their own purposes) or changed (deleted posts, down-rated users, etc) the production environment at any particular time. "Why did [event] happen?" "Beats the hell out of me, someone probably changed something." "Who? When?" "No idea."

This is particularly interesting because Twitter's Chief Information Security Officer - who resigned post-Musk - was also their former head of privacy engineering, and before that, apparently, global lead of privacy technology at Google. One could only imagine what that implies.

Control

There is also a wide range of engineering issues. Data integrity (not losing user-entered data) was obviously a critical issue, but Twitter had been aware for a while that they teetered on the edge of a catastrophic production data loss:

even a temporary but overlapping outage of a small number of datacenters would likely [my italics] result in the service going offline for weeks, months, or permanently.

This is not quite as bad as it first seems. After a year or so in operation, companies have a fairly good idea what happens with a datacenter outage - because they're more frequent than you imagine. Say, Henry the intern accidently leans against the Big Red Button on the datacenter floor, that cuts power to everywhere. Or you do a generator test, only to discover that a family of endangered hawks have made their nest in the generator housing for Floor 2... So you get used to (relatively) small-scale interruptions.

If you want to run a global service, though, you need to be able to tolerate single site outages as routine, and multiple site outages (which turn out to be inevitable) have to be managed within the general bounds of your service's promised availability - and latency, and data availability. Even if all your physical locations are very separate, there will inevitably be common cause failures - not least, when you're pushing binary or config changes to them. So, don't wait for these events to sneak up on you - rather, anticipate them.

This means that you have to plan for, and practice these events. If you're not doing so, than a) it will be obvious to anyone asking questions in this area, and b) when things inevitably do run off the rails, there will be bits of burning infrastructure scattered everywhere, around the highly-paid morons who are busy writing memos to cover their asses: "how could we have foreseen this particular event? Clearly, it wasn't our fault, but pay us 20% extra and we might catch or mitigate the next such event."

Go looking for those people. Fire them, and throw them into a den of hungry pigs.

Leaving the doors open

By far the most horrific aspect, however, was the general relaxed attitude about government agencies - and heaven only knows what other NGOs, cabals, and individuals - having under-the-table access to Twitter's data. Just the tolerance of user-installed spyware on privileged devices would be enough for any sane security engineer to be tearing out their hair, but actually letting in individuals known to be employed by foreign - and even domestic - governments for the purposes of obtaining intelligence information, and potentially affecting the flow of information to their and other countries... one is lost for words.

At some stage, Twitter had to either grow up, or close down. Under Dorsey's crew, the latter was inevitable - and likely not far away. It's still too early to tell if Musk can get them to option 1, but there's still hope.

2020-10-07

NHS Track+Trace - what went wrong

By now, you've presumably seen how Public Health England screwed up spectacularly in their testing-to-identification pipeline, such that they dropped thousands of cases - because they hit an internal row limit in Excel.

Oops.

Still, how could anyone have predicted that Public Health England - who were founded in 2013 with responsibility for public health in England - could have screwed up so badly? Well, anyone with any experience of government IT in the past... 40 years, let's say. Or anyone who observed that the single most important job of a public health agency is to prepare for pandemics, which roll around every 10 years or so - remember SaRS 2003? H1N1? And that duty, as illustrated in their 2020 performance, is one that PHE could not have failed at any more badly if they'd put their best minds to it.

Simply, there's no incentive for them to be any good at what they do.

It's tempting to simply roll out the PHE leadership and have them hung from the nearest lamp post - or at least, claw back all they payments they received as a result of being associated with Public Health England. For reference, the latest page shows this list as:

Duncan Selbie
Prof Dr Julia Goodfellow
Sir Derek Myers
George Griffin
Sian Griffiths
Paul Cosford
Yvonne Doyle
Richard Gleave
Donald Shepherd
Rashmi Shukla

However, this misses the point; there's plenty more where they came from. Many of these people are actually smart, or at least cunning. None of them actively wanted tens of thousands of people in the UK to die, or the UK's coronavirus response to become an absolute laughing-stock. Yet, here we are.

When you set up a data processing pipeline like this, your working assumptions should be that:

The data you ingest is often crap in accuracy, completeness and even syntax;
At every stage of processing, you're going to lose some of it;
Your computations are probably incorrect in several infrequent but crucial circumstances; and
When you spit out your end result, the system you send it to will be frequently partially down, so drop or reject some or all of the (hopefully) valid data you're sending to it.

Given all these risks, one is tempted to give up managing data pipelines for a living and change to an easier mode of life such as a career civil servant in the Department for Education where nothing you do will have the slightest effect, yet you'll still get pay and pension. Still, there's a way forward for intrepid souls.

The insight you need is that you accept that your pipeline is going to be decrepit, leaky and contaminate your data. That's OK as long as you know when it's happening, and approximately how bad it is.

Let's look at the original problem. From the BBC article:

The issue was caused by the way the agency brought together logs produced by commercial firms paid to analyse swab tests of the public, to discover who has the virus. They filed their results in the form of text-based lists - known as CSV files - without issue.

We want to have a good estimate, for each agency, whether all the records have been received. Therefore we supplement the list of records with some of our own - which have characteristics which we expect to survive through processing. Assuming each record is a list of numerical values (say, number of virus particles per mL - IDK, I'm not a biologist) a simple way to do this is to make one or more fields in our artificial records have values that are 100x higher or lower than practically feasible. Then for a list of N records, you add one artifical record to the start, one at the end and one in the middle, so you ship N+3 records to central processing. For extra style, change the invalidity characteristic of each of these records - so e.g. you know that an excessively high viral load signals the start of a records list, and excessively low load signals the end.

The next stage:

PHE had set up an automatic process to pull this data together into Excel templates so that it could then be uploaded to a central system and made available to the NHS Test and Trace team, as well as other government computer dashboards.

First check: this is not a lot of data. Really, it isn't. Every record represents the test of a human, there's a very finite testing capacity (humans per day), and the amount of core data produced should easily fit in 1KB - 100 or more double-precision floating point numbers. It's not like they're uploading e.g. digital images of mammograms.

So the first step, if you're competent, is for Firm A to read-back the data from PHE:

Firm A has records R1 ... R10. It computes a checksum for each record - a number which is a "summary" of the record, rather like feeding the record through a sausage machine and taking a picture of the sausage it produces.
Firm A stores checksums C1, C2, ..., C10 corresponding to each record.
Firm A sends records R1, R2, ..., R10 to PHE, tagged with origin 'Firm A' and date '2020-10-06'
Firm A asks PHE to send it checksums of all records tagged 'Firm A', '2020-10-06'
PHE reads its internal records, identifies 10 records, sends checksums D1, D2, ... D10
Firm A checks that the number of checksums match, and each checksum is the same: if there's a discrepancy, it loudly flags this to a human.

This at least assures Firm A that its data has been received, is complete, and is safely stored.

If PHE wants to be really cunning then one time in 50 it will deliberately omit a checksum in its response, or change one bit of a checksum, and expect the firm to flag an error. If no error is raised, we know that Firm A isn't doing read-backs properly.

Now, PHE wants to aggregate its records. It has (say) 40 firms supplying data to it. So it does processing over all the records and for each record produces a result: one of "Y" (positive test), "N" (negative test), "E" (record invalid), "I" (record implausible). Because of our fake record injection, if 40 firms send 1000 records in total, we should expect zero "E" results, 120 "I" results, and the total of "Y" and "N" results should equal 880. If we calculate anything different, the system should complain loudly, and we send a human to figure out what went wrong.

The system isn't perfect - the aggregation function might accidentally skip 1 in 100 results, for instance, and through bad luck it might not skip an erroneous record. But it's still a good start.

I just pulled this process out of my posterior, and I guarantee it's more robust than what PHE had in place. So why are we paying the Test+Trace system £12 billion or more to implement a system that isn't even as good as a compsci grad would put in place in return for free home gigabit Ethernet, with an incentive scheme based around Xena tapes and Hot Pockets?

Nobody really cared if the system worked well. They just wanted to get it out of the door. No-one - at least, at the higher levels of project management - was going to be held accountable for even a failure such as this. "Lessons will be learned" platitudes will be trotted out, the company will find one or two individuals at the lower level and fire them for negligence, but any project manager not actually asleep on the job would have known this was coming. And they know it will happen again, and again, as long as the organisation implementing systems like this has no direct incentive for it to work. Indeed, the client (UK Government) probably didn't even define what "work" actually meant in terms of effective processing - and how they would measure it.

2020-05-12

Testing for determinism

Apropos of nothing^[1], here's a view on testing a complicated system for deterministic behaviour. The late, great John Conway proposed the rules for "Game of Life", an environment on an arbitrary-sized "chess board" where each square could be either alive or dead, and potentially change at every "tick" of a clock according to the following rules.

Any live cell with two or three live neighbours survives.
Any dead cell with three live neighbours becomes a live cell.
All other live cells die in the next generation. Similarly, all other dead cells stay dead.

You'd think that this would be a very boring game, given such simple rules - but it in fact generates some very interesting behaviour. You find eternally iterating structures ("oscillators"), evolving structures that travel steadily across the board ("spaceships"), and even "glider guns" that fire a repeated sequence of spaceships.

Building a simulation of Conway's Game of Life is something of a rite of passage for programmers - doing it in a coding language new to the programmer generally shows that they have figured out the language enough to do interesting things. But how do they know that they have got it right? This is where "unit testing" comes into play.

Unit testing is a practice where you take one function F in your code, figure out what it should be doing, and write a test function that repeatedly calls F with specific inputs, and checks in each case that the output is what's expected. Simple, no? If F computes multiplication, you check that F(4,5)=20, F(0,10)=0, F(45,1)=45 etc.

Here's a unit test script. It's written in Go, for nerds, ^[2] but should be understandable based on function names to most people with some exposure to programming. First, you need to check the function that you've written to see whether two Life boards are equivalent, so you create empty 4x4, 4x5, 5x4 boards and see if your comparison function thinks they're the same.
(In Go, read "!" as "not", and "//" marks a comment which the computer will ignore but programmers can, and should, read)

  b1 := life.NewBoard(4,4)
  b2 := life.NewBoard(4,4)
  // These should be equivalent
  if ! life.AreEqual(b1,b2) {
     t.Error("blank 4x4 boards aren't the same")
  }
  b3 := life.NewBoard(5,4)
  b4 := life.NewBoard(4,5)
  if life.AreEqual(b1,b3) {
    t.Error("different size boards are the same")
  }

That's easy, but you also need to check that adding a live cell to a board makes it materially different:

  // Add in a block to b1 and compare with b2
  life.AddBlock(0,0,b1)
  if life.AreEqual(b1,b2) {
    t.Error("one board has a block, blank board is equivalent")
  }
  // Add the same block to b2 in same place, they should be equal
  life.AddBlock(0,0,b2)
  if ! life.AreEqual(b1,b2) {
    t.Error("2 boards, same block, unequal")
  }

This is helpful, but we still don't know whether that "block" (live cell) was added in the right place. What if a new block is always added at (2,3) rather than the coordinates specified? Our test above would still pass. How do we check for this failure case?

One of the spaceships in Life, termed a glider, exists in a 3x3 grid and moves (in this case) one row down and one column across every 4 generations. Because we understand this fundamental but fairly complex behaviour, we can build a more complicated test. Set up a 5x5 board, create a glider, and see if

the board is different from its start state at time T+1;
the board does not return to its start state at time T+2 through T+19; and
the board does return to its start start at time T+20.

Code to do this:

  b5 := life.NewBoard(5,5)
  life.AddGlider(0, 0, b5, life.DownRight)
  b6 := life.CopyBoard(b5)
  if ! life.AreEqual(b5,b6) {
    t.Error("Copied boards aren't the same")
  }
  // A glider takes 4 cycles to move 1 block down and 1 block across.
  // On a 5x5 board, it will take 5 x 4 cycles to completely cycle
  for i := 0 ; i< 19 ; i++ {
    life.Cycle(b5)
    if life.AreEqual(b5,b6) {
      t.Error(fmt.Sprintf("Glider cycle %d has looped, should not", i))
  }
  life.Cycle(b5)
  if ! life.AreEqual(b5,b6) {
    t.Error("Glider on 5x5 board did not cycle with period 20")
  }

Now, even if you assume AreEqual(), NewBoard(), CopyBoard() work fine, you could certainly construct functions AddGlider(), Cycle() which pass this test. However you'd have to try pretty hard to get them right enough to pass, but still wrong. This is the essence of unit testing - you make it progressively harder, though not impossible, for a function to do the wrong thing. One plausible failure scenario is to make the adjacent-cells locator in Cycle() incorrect such that the glider goes up-and-across rather than down-and-across. To fix that, you could add some code to turn-on a critical cell at (say) time 8, such that that cell would be live in the expected motion, so no effect, but empty in the other motion.

Clearly, for unit testing to work, you want a unit tester who is at least as ingenious (and motivated) as the coder. In most cases, the coder is the unit tester, so "soft" unit tests are unfortunately common - still, at least they're a basis to argue that the code meets some kind of spec. And if the client isn't happy with the tests, they're free to add their own.

Why am I so mad at Neil Ferguson? He's free to make whatever epidemiological assumptions that he wants, but he usurped the "authority" of computer modelling to assert that his model should be trusted, without actually undertaking the necessary and fundamental computer science practices - not least, unit testing.

[1] Lies: Neil Ferguson, take note
[2] Object-oriented model avoided for clarity to readers

2020-05-10

Harmeet Dhillon picked a winner

I enjoyed reading a Gizmodo article today. (This is not a common occurrence). The article itself was a mostly-triumphant comment on James "neurotic women" Damore closing his lawsuit against The Google:

Damore proceeded to sue Google for discrimination in January 2018. Per Bloomberg, three other men who worked for or applied for jobs at Alphabet, Google’s parent company, also signed on to Damore's lawsuit. In the lawsuit, Damore's lawyers argued that he and others "were ostracized, belittled, and punished for their heterodox political views, and for the added sin of their birth circumstances of being Caucasians and/or males."

I read the internal blog posts in the initial complaint, and to be honest it looked pretty problematic for Google. So why close the lawsuit now?

Aha! a clue in a the Bloomberg article on the suit conclusion:

A lawyer for the men, Harmeet Dhillon, said they're prohibited as part of their agreement with Google from saying anything beyond what's in Thursday’s court filing. Google declined to comment.

It's pretty clear, isn't it? Google settled. They looked at what would plausibly come out of discovery, and - even if they were pretty confident in a Silicon Valley jury taking the socially woke side of the case - didn't like how a court case would play out in public. This is a guess on my part, to be clear, but a fairly confident guess. How much would a company pay for positive nationwide publicity? You can treble that for them to avoid negative nationwide publicity.

Damore probably got fairly close to a sensible loss-of-earnings amount. Harmeet Dhillon, his lawyer probably got 30%-40% of that; maybe on the lower end because the publicity was worth beaucoup $$ to her.

When your ess-jay-double-yuh's
Cost you many dollars,
That's Damore!

When their memes and blog post
Enrich lawyers the most
That's Damore!

Called it - EU Big Data has no Value

Back in 2014 I said re the EU Big Data project:

The EU is about to hand a couple billion euros to favoured European companies and university research departments, and it's going to get nine tenths of squat all out of it. Mark my words, and check back in 2020 to see what this project has produced to benefit anyone other than its participants.

Well, this is 2020 - what happened?

Here's the bigdatavalue.eu blog - the most recent article?

THE SECOND GOLDEN AGE OF DUTCH SCIENCE, 1850-1914
Posted on February 18, 2019 by admin

PAKISTAN AND AFGHANISTAN: OPPORTUNITIES AND CHALLENGES IN THE WAKE OF THE CURRENT CRISIS
Posted on September 4, 2018 by admin

I guess nothing actually happened then. But hey, it's only €500M....

Per the original article:

The project, which will start work on 1 January 2015, will examine climate information, satellite imagery, digital pictures and videos, transaction records and GPS signals. It will also look at data privacy issues, access rights to data and databases, intellectual property rights and legal aspects of new technical developments such as who holds the rights to automatically generated data.

"Look at", but apparently didn't actually "do" anything...

It's a human tragedy that the UK won't be involved post-Brexit in such innovative projects such as this.

2019-09-27

The pace of PACER

Permit me a brief, hilarious diversion into the world of US Government corporate IT. PACER is a USA federal online system - "Public Access to Court Electronic Records" which lets people and companies access transcribed records from the US courts. One of their judges has been testifying to the House Judiciary Committee’s Subcommittee on Courts, IP, and the internet and in the process revealed interesting - and horrifying - numbers.

TL;DR -

it costs at least 4x what it reasonably should; but
any cost savings will be eaten up by increased lawyer usage; nevertheless,
rampant capitalism might be at least a partial improvement; so
the government could upload the PACER docs to the cloud, employ a team of 5-10 to manage the service in the cloud, and save beaucoup $$.

Of course, I could be wrong on point 2, but I bet I'm not.

Background

PACER operates with all the ruthless efficiency we have come to expect from the federal government.^[1] It's not free; anyone can register for it, usage requires a payment instrument (credit card) but it is free if you use less than $15 per quarter. The basis of charging is:

All registered agencies or individuals are charged a user fee of $0.10 per page. This charge applies to the number of pages that results from any search, including a search that yields no matches (one page for no matches). You will be billed quarterly.

You would think that, at worst, it would be cost-neutral. One page of black+white text at reasonably high resolution is a bit less than 1MB, and (for an ISP) that costs less than 1c to serve on the network. Therefore you spend less than 9c on the machines and people required to store and serve the data, and profit!

Apparently not...

The PACER claims

It was at this point in the article that I fell off my chair:

Fleissig said preliminary figures show that court filing fees would go up by about $750 per case to “produce revenue equal to the judiciary’s average annual collections under the current public access framework.” That could, for example, drive up the current district court civil filing fee from $350 to $1,100, she said.

What the actual expletive? This implies that:

the average filing requests 7500 pages of PACER documents - and that the lawyers aren't caching pages to reduce client costs (hollow laughter); or
the average filing requests 25 PACER searches; or
the average client is somewhere on the continuum between these points.

It seems ridiculously expensive. One can only conclude, reluctantly, that lawyers are not trying to drive down costs for their clients; I know, it's very hard to credit. ^[2]

And this assumes that 10c/page and $30/search is the actual cost to PACER - let us dig into this.

The operational costs

Apparently PACER costs the government $100M/year to operate:

“Our case management and public access systems can never be free because they require over $100 million per year just to operate,” [Judge Audrey] Fleissig said [in testimony for the House Judiciary Committee’s Subcommittee on Courts, IP, and the internet]. “That money must come from somewhere.”

Judge Fleissig is correct in the broad sense - but hang on, $100M in costs to run this thing? How much traffic does it get?

The serving costs

Let's look at the serving requirements:

PACER, which processed more than 500 million requests for case information last fiscal year

Gosh, that's a lot. What's that per second? 3600 seconds/hour x 24 hours/day x 365 days/year is 32 million seconds/year, so Judge Fleissig is talking about... 16 queries per second. Assume that's one query per page. That's laughably small.

Assume that peak traffic is 10x that, and you can serve comfortably 4 x 1MB pages per second on a 100Mbit network connection from a single machine; that's 40 machines with associated hardware, say amortized cost of $2,000/year per machine - implies order of $100K/year on hardware, to ensure a great user experience 24 hours per day 365 days per year. Compared to $100M/year budget, that's noise. And you can save 50% just by halving the number of machines and rejecting excess traffic at peak times.

The ingestion and storage costs

Perhaps the case ingestion is intrinsically expensive, with PACER having to handle non-standard formats? Nope:

The Judiciary is planning to change the technical standard for filing documents in the Case Management and Electronic Case Filing (CM/ECF) system from PDF to PDF/A. This change will improve the archiving and preservation of case-related documents.

So PACER ingests PDFs from courts - plus, I assume, some metadata - and serves PDFs to users.

How much data does PACER ingest and hold? This is a great Fermi question; here's a good worked example of answer, with some data.

There's a useful Ars Technica article on Aaron Swartz that gives us data on the document corpus as of 2013:

PACER has more than 500 million documents

Assume it's doubled as of 2019, that's 1 billion documents. Assume 1MB/page, 10 pages/doc, that's 10^9 docs x 10 MB per doc = 10^10 MB = 1x10^4 TB. That's 1000 x 10TB hard drives. Assume $300/drive, and drives last 3 years, and you need twice the number of drives to give redundancy, that's $200 per 10TB per year in storage costs, or $200K for 10,000 TB. Still, noise compared to $100M/year budget. But the operational costs of managing that storage can be high - which is why Cloud services like Amazon Web Services, Azure and Google Cloud have done a lot of work to offer managed services in this area.

Amazon, for instance, charges $0.023 per GB per month for storage (on one price model) - for 10^9 x 1MB docs, that's 1,000,000 GB x $0.023 or $23K/month, $276K/year. Still way less than 1% of the $100M/year budget.

Incidentally Aaron Swartz agrees with the general thrust of my article:

Yet PACER fee collections appear to have dramatically outstripped the cost of running the PACER system. PACER users paid about $120 million in 2012, thanks in part to a 25 percent fee hike announced in 2011. But Schultze says the judiciary's own figures show running PACER only costs around $20 million.

A rise in costs of 5x in 6 years? That's approximately doubling every 2 years. As noted above, it seems unlikely to be due to serving costs - even though volumes have risen, serving and storage costs have got cheaper. Bet it's down to personnel costs. I'd love to see the accounts break-down. How many people are they employing, and what are those people doing?

The indexing costs - or lack thereof

Indexing words and then searching a large corpus of text is notoriously expensive - that's what my 10c per electronic page is paying for, right? Apparently not:

There is a fee for retrieving and distributing case information for you: $30 for the search, plus $0.10 per page per document delivered electronically, up to 5 documents (30 page cap applies).

It appears that PACER is primarily constructed to deliver responses to "show me the records of case XXXYYY" or "show me all cases from court ZZZ", not "show me all cases that mention 'Britney Spears'." That's a perfectly valid decision but makes it rather hard to justify the operating costs.

Security considerations

Oh, please. These docs are open to anyone who has an account. The only thing PACER should be worried about is someone in Bangalore or Shanghai scraping the corpus, or the top N% of cases, and serving that content for much less cost. Indeed, that's why they got upset at Aaron Swartz. Honestly, though, the bulk of their users - law firms - are very price-insensitive. Indeed, they quite possibly charge their clients 125% or more of their PACER costs, so if PACER doubled costs overnight they'd celebrate.

I hope I'm wrong. I'm afraid I'm not.

Public serving alternatives

I don't know how much Bing costs to operate, but I'd bet a) that its document corpus is bigger than PACER, b) that its operating costs are comparable, c) that its indexing is better than PACER, d) that its search is better than PACER, e) that its page serving latency is better than PACER... you get the picture.

Really though, if I were looking for a system to replace this, I'd build off an off-the-shelf solution to translate inbound PDFs to indexed text - something like OpenText - and run a small serving stack on top. That reduces the regular serving cost, since pages are a few KB of text rather than 1MB of PDF, and lets me get rid of all the current people costs associated with the customized search and indexing work on the current corpus.

PACER is a terrible use of government money

Undoubtedly it's not the worst^[3], but I'd love for the House Judiciary Committee’s Subcommittee on Courts, IP, and the internet to drag Jeff Bezos in to testify and ask him to quote a ballpark number for serving PACER off Amazon Web Services, with guaranteed 100% profit margin.

Bet it's less than 1/4 of the current $100M/year.

[1] Yes, irony
[2] Why does New Jersey have the most toxic waste dumps and California the most lawyers? ~~California~~ New Jersey got first choice. [Thanks Mr Worstall!]
[3] Which is terribly depressing.

2018-11-02

Unionism in Silicon Valley - called it

Back in January I made the following prediction:

What do I think? Twitter, Facebook and Google offices in the USA are going to be hit with unionization efforts in the next 12 months, initially as a trial in the most favorable locations but if they succeed then this will be ramped up quickly nationwide. This will be framed as a push to align the companies to approved socially just policies - which their boards mostly favor already - but will be used to leapfrog the activist employees into union-endorsed and -funded positions of influence.

Sure enough, a bunch of Google staff walked out of work today, nominally to protest at ex-Android head Andy Rubin getting a cool $90M in severance after being accused of dubious behaviour with someone in a hotel room, which he denies:

Rubin said in a two-part tweet: “The New York Times story contains numerous inaccuracies about my employment at Google and wild exaggerations about my compensation. Specifically, I never coerced a woman to have sex in a hotel room. These false allegations are part of a smear campaign to disparage me during a divorce and custody battle. Also, I am deeply troubled that anonymous Google executives are commenting about my personnel file and misrepresenting the facts.”

For the record, Rubin sounds a bit sleazy even if you apply a high degree of scepticism to the exact circumstances of the event.

Let's look at the "official" walkout Twitter account, and wonder who's actually driving this organisation:

We, Google employees and contractors, will walkout on November 1 at 11:10am to demand these five real changes. #googlewalkout pic.twitter.com/amgTxK3IYw
— Google Walkout For Real Change (@GoogleWalkout) November 1, 2018

For posterity, the "demands" are:

An end to Forced Arbitration in cases of harassment and discrimination for all current and future employees.
A commitment to end pay and opportunity inequity.
A publicly disclosed sexual harassment transparency report.
A clear, uniform, globally inclusive process for reporting sexual misconduct safely and anonymously.
Elevate the Chief Diversity Officer to answer directly to the CEO and make recommendations directly to the Board of Directors. Appoint an Employee Rep to the Board.

Points 1-4 seem pretty reasonable - but what does point 5 have to do with the rest of the list? And who would this "Employee Rep" be - a unionisation activist, perchance? $10 says I'm right. This is a classic tactic: take a reasonable area of complaint and use it as a Trojan Horse to sneak in the early stages of unionisation to the company.

Google allegedly employs very smart people. If only they exercised their critical faculties half as well as their intellects, they might be asking uncomfortable questions of the protest organisers about where point 5 came from and who the organisers have in mind to take on "employee rep" duties. I guarantee you that it's not Rob Pike or Jeff Dean.

2018-09-06

Victimhood poker - the implementation

Back in 2006, blogger Marlinschamps proposed the rules for the game of victimhood poker. In a spare couple of hours last weekend, I decided to code this up so that we had an implementation of it. Beloved readers, here is that implementation. It's in Python; I show it in chunks, but it should all go in a single file called e.g. victimhood.py.

First we define the cards in the deck, their points, and their class:

#!/usr/bin/python
# This code is in the public domain. Copy and use as you see fit.
# Original author: http://hemiposterical.blogspot.com/, credit 
# would be nice but is not required.
import random
deck = {
 # Key: (points,class)
 'Black':           (14, 'skin'),
 'Native-American': (13, 'ethnicity'),
 'Muslim':          (12, 'religion'),
 'Hispanic':        (11, 'ethnicity'),
 'Transgender':     (10, 'gender'),
 'Gay':              (9, 'none'),
 'Female':           (8, 'gender'),
 'Oriental':         (7, 'ethnicity'),
 'Handicapped':      (6, 'none'),
 'Satanist':         (6, 'religion'),
 'Furry':            (5, 'none'),
 'Non-Christian':    (4, 'religion'),
 'East-Indian':      (3, 'ethnicity'),
 'Hindu':            (3, 'religion'),
 'Destitute':        (2, 'economic'),
 'White':            (0, 'skin'),
 'Straight':         (0, 'gender'),
 'Christian':        (0, 'religion'),
 'Bourgeois':        (0, 'economic'),
}
# Categories in the order you'd describe someone
category_list = [
 'economic','none','skin','religion','ethnicity','gender',
]
categories = set(category_list)

In addition, a couple of helper functions to make it easier to ask questions about a specific card:

def cardscore(card):
 """ How much does this card score? """
 (s, unused_cls) = deck[card]
 return s

def cardclass(card):
 """ What class does this card represent? """
 (unused_s, cls) = deck[card]
 return cls

Now we define what a "hand" is, with a bunch of functions to make it easier to merge other cards into a hand and compute the best score and hand from these cards:

class Hand(object):
 """ A hand is a list of cards with some associated scoring functions """
 def __init__(self, start_cards=None):
  if start_cards is None:
   self.cards = []
  else:
   self.cards = start_cards[:]

 def add(self, card):
  self.cards.append(card)
  
 def bestscore(self):
  (score, bestcards) = self.besthand()
  return score

 def bestcards(self):
  (score, bestcards) = self.besthand()
  return bestcards

 def besthand(self):
  """ What's the highest possible score for this hand?
  Limitations: one card per class, no more than 5
  cards in total
  Return (score, best_hand)
  """
  score_by_class = { }
  card_by_class = { }
  for card in self.cards:
    try:
      s = cardscore(card)
      card_class = cardclass(card) 
    except KeyError, err:
      raise KeyError("Invalid card name '%s'" % card)
    if card_class not in score_by_class:
      score_by_class[card_class] = s
    if s >= score_by_class[card_class]:
      score_by_class[card_class] = s
      card_by_class[card_class] = card
  # We now have the best scoring card in each
  # class. But we can only use the best 5.
  cards = card_by_class.values()
  cards.sort(lambda x,y: cmp(cardscore(x),cardscore(y)))
  if len(cards) > 5:
    cards = cards[0:5]
  tot = 0
  for card in cards:
    tot += cardscore(card)
  best_hand = Hand(cards)
  return (tot, best_hand)

 def merge(self, hand):
  """ Merge this hand and another to return a new one """
  ans = self.copy()
  for c in hand.cards:
   ans.add(c)
  return ans

 def copy(self):
  return Hand(self.cards)
 
 def __str__(self):
  return ', '.join(['%s (%d)' % (c, cardscore(c)) for c in self.cards])

 def card_in_class(self,class_name):
  """Returns a card in the given class, if the hand has one"""
  for card in self.cards:
   (s,c) = deck[card] 
   if c == class_name:
    return card
  # No match
  return None

 def description(self):
   card_order = [self.card_in_class(c) for c in category_list]
   card_order = filter(lambda x: x is not None, card_order)
   return ' '.join(card_order)

Now we can define a game with a number of players, and specify how many copies of the deck we want to use for the game:

class Game(object):
 def __init__(self, player_count, deck_multiple=2):
   self.player_count = player_count
   self.deck_multiple = deck_multiple
   self.player_hands = { }
   for i in range(1,1+player_count):
     self.player_hands[i] = Hand()
   self.shuffle_deck()
   self.community = Hand()

 def shuffle_deck(self):
   self.deck = []
   for i in range(self.deck_multiple):
    self.deck.extend(deck.keys())
   random.shuffle(self.deck)

 def deal(self, cards_per_player):
   for p in range(1,1+self.player_count):
     for c in range(cards_per_player): 
       card = self.deck.pop()  # might run out
       self.player_hands[p].add(card)

 def deal_community(self, community_cards):
   self.community = Hand()
   for c in range(community_cards):
    card = self.deck.pop()
    self.community.add(card)

 def get_community(self):
  return self.community

 def best_hand(self, player_num):
   h = self.player_hands[player_num]
   # Expand the hand with any community cards
   h2 = h.merge(self.community)
   return h2.besthand()

Finally, we have some code to demonstrate the game being played. We give 5 cards each to 4 players, and have 3 community cards which they can use. We display each player's best hand and score, and announce the winner:

if __name__ == '__main__':
 player_count=4
 g = Game(player_count=player_count, deck_multiple=2)
 # Everyone gets 5 cards
 g.deal(5)
 # There are 3 community cards
 g.deal_community(3)
 print "Community cards: %s\n" % g.get_community()
 winner = None
 win_score = 0
 for p in range(1,1+player_count):
  (score, hand) = g.best_hand(p)
  print "Player %d scores %d with %s" % (p, score, hand)
  print "  which is a %s" % hand.description()
  if score > win_score:
    winner = p
    win_score = score
 print "\nPlayer %d wins!" % winner

Don't judge my Python, y'all; it's quick and dirty Python 2.7. If I wanted a code review, I'd have set this up in GitHub.

So what does this look like when it runs? Here are a few games played out:

Community cards: Christian (0), Native-American (13), Gay (9)

Player 1 scores 40 with Non-Christian (4), Gay (9), Native-American (13), Black (14)
which is a Gay Black Non-Christian Native-American
Player 2 scores 22 with Christian (0), Bourgeois (0), Gay (9), Native-American (13)
which is a Bourgeois Gay Christian Native-American
Player 3 scores 30 with Destitute (2), Satanist (6), Gay (9), Native-American (13)
which is a Destitute Gay Satanist Native-American
Player 4 scores 42 with Female (8), Gay (9), Muslim (12), Native-American (13)
which is a Gay Muslim Native-American Female

Player 4 wins!

Community cards: Non-Christian (4), Bourgeois (0), Furry (5)

Player 1 scores 24 with Straight (0), Destitute (2), Non-Christian (4), Furry (5), Native-American (13)
which is a Destitute Furry Non-Christian Native-American Straight
Player 2 scores 26 with Bourgeois (0), East-Indian (3), Non-Christian (4), Furry (5), Black (14)
which is a Bourgeois Furry Black Non-Christian East-Indian
Player 3 scores 30 with Bourgeois (0), Non-Christian (4), Furry (5), Oriental (7), Black (14)
which is a Bourgeois Furry Black Non-Christian Oriental
Player 4 scores 33 with Destitute (2), Handicapped (6), Muslim (12), Native-American (13)
:which is a Destitute Handicapped Muslim Native-American
Player 4 wins!

Community cards: Transgender (10), Muslim (12), Oriental (7)

Player 1 scores 53 with Handicapped (6), Transgender (10), Hispanic (11), Muslim (12), Black (14)
which is a Handicapped Black Muslim Hispanic Transgender
Player 2 scores 33 with Bourgeois (0), White (0), Transgender (10), Hispanic (11), Muslim (12)
which is a Bourgeois White Muslim Hispanic Transgender
Player 3 scores 40 with Furry (5), Transgender (10), Muslim (12), Native-American (13)
which is a Furry Muslim Native-American Transgender
Player 4 scores 37 with Destitute (2), Handicapped (6), Oriental (7), Transgender (10), Muslim (12)
which is a Destitute Handicapped Muslim Oriental Transgender

Player 1 wins!

What does this prove? Nothing really, it was kinda fun to write, but I don't see any earthshaking philosophical insights beyond the fact that it's a rather silly game. But then, that's true for its real life analogue as well.

Programming challenge: build a function to instantiate a Hand() from a string e.g. "black east-indian handicapped female" and use this to calculate the canonical score. Bonus points if you can handle missing hyphens.

2018-07-08

How to kill Trusteer's Rapport stone dead

If you, like me, have had to wrangle with a slow and balky family member's Mac, you may have found the root cause of the slowness to be Rapport. This is an IBM-branded piece of "security" software, and has all the user friendliness and attention to performance and detail that we expect from Big Blue - to wit, f-all.

I therefore followed the comprehensive instructions on uninstalling Rapport which were fairly easy to step through and complete. Only problem - it didn't work. The rapportd daemon was still running, new programs were still very slow to start, and there was no apparent way forward.

Not dissuaded, I figured out how to drive a stake through its heart. Here's how.

Rapport start-up

Rapport installs a configuration in OS X launchd which ensures its daemon (rapportd) is started up for every user. The files in /Library/LaunchAgents and /Library/LaunchAgents are easy to remove, but the original files are in /System/Library/LaunchAgents and /System/Library/LaunchDaemons and you need to kill those to stop Rapport.

However, System Integrity Protection (SIP) on OS X El Capitan and later prevents you from deleting files under /System - even as root.

Given that, the following instructions will disable SIP on your Mac, remove the Rapport files, and re-enable SIP. You should be left with a Mac that is no longer burdened by Rapport.

Check whether Rapport is running

From a Terminal window, type
ps -eaf | grep -i rapport
If you see one or more lines mentioning rapportd then you have Rapport running and you should keep going; if not, your problems lie elsewhere.

Disable SIP

Reboot your machine, and hold down COMMAND+R as the machine restarts. This brings you into Recovery mode. From the menu bar, choose Utilities → Terminal to open up a Terminal window. Then type
csrutil disable exit

Now reboot and hold down COMMAND+S as the machine restarts to enter single-user mode (a black background and white text).

Find and delete the Rapport files

You'll need to make your disk writeable, so enter the two commands (which should be suggested in the text displayed when you enter single user mode):
/sbin/fsck -fy /sbin/mount -uw /

Now
cd /System/Library/LaunchAgents
and look for the Rapport files:
ls *apport*
You can then remove them:
rm com.apple.RapportUI* rm com.apple.rapport*

Then
cd ../LaunchDaemons
and look for the Rapport files there:
ls *apport*
You can then remove them too:
rm com.apple.rapportd*

Restore SIP

Rapport should now be dead, but you should re-enable SIP. Reboot and hold down COMMAND+R to go back to Recovery mode. From the menu bar, choose Utilities → Terminal to open up a Terminal window. Then type
csrutil enable exit

Reboot, and you should be done. Open a Terminal window, type
ps -eaf | grep -i rapport
and verify that rapportd no longer appears.

2017-08-16

Since we can't challenge diversity policy, how to prevent mistakes?

The James Damore affair at Google has made it very clear that discussion of companies' diversity policy is completely off the table. When I say "discussion" here, I mean "anything other than adulation". I've seen plenty of the latter in the past week. The recent 'letter from Larry Page' in The Economist was a classic example. It was in desperate need of someone tagging it with a number of [citation needed] starting from paragraph 4:

You’re wrong. Your memo was a great example of what’s called “motivated reasoning” — seeking out only the information that supports what you already believe. It was derogatory to women in our industry and elsewhere [CN]. Despite your stated support for diversity and fairness, it demonstrated profound prejudice[CN]. Your chain of reasoning had so many missing links[CN] that it hardly mattered what you based your argument on. We try to hire people who are willing to follow where the facts lead, whatever their preconceptions [CN]. In your case we clearly got it wrong.

Let's accept, for the sake of argument, that random company employees questioning diversity policy is off the table. This is not an obviously unreasonable constraint, given the firestorm from Damore's manifesto. Then here's a question for Silicon Valley diversity (and leadership) types: since we've removed the possibility of employee criticism from your diversity policy, what is your alternative mechanism for de-risking it?

In all other aspects of engineering, we allow - nay, encourage - ideas and implementations to be tested by disinterested parties. As an example, the software engineering design review pits the software design lead against senior engineers from other development and operational teams who have no vested interest in the new software launching, but a very definite interest in the software not being a scaling or operational disaster. They will challenge the design lead with "what if..." and "how have you determined capacity for metric X..." questions, and expect robust answers backed by data. If the design lead's answers fall short, the new software will not progress to implementation without the reviewer concerns being addressed.

Testing is often an adversarial relationship: the testing team tries to figure out ways that new software might break, and craft tests to exploit those avenues. When the test reveals shortcomings in the software, the developer is not expected to say "well, that probably won't happen, we shouldn't worry about it" and blow off the test. Instead they either discuss the requirements with the tester and amend the test if appropriate, or fix their code to handle the test condition.

Netflix's Chaos Monkey subjects a software service to adverse operational conditions. The software designer might assert that the service is "robust" but if Chaos Monkey creates a reasonably foreseeable environment problem (e.g. killing 10% of backend tasks) and the service starts to throw errors at 60% of its queries, it's not Chaos Monkey which is viewed as the problem.

Even checking-in code - an activity as integral to an engineer's day as operating the coffee machine - is adversarial. For any code that hits production, the developer will have to make the code pass a barrage of pre-existing functional and syntax checks, and then still be subject to review by a human who is generally the owner of that section of code. That human expects new check-ins to improve the operational and syntactic quality of the codebase, and will challenge a check-in that falls short. If the contributing engineer asserts something like "you don't appreciate the beauty of the data structure" in reply, they're unlikely to get check-in approval.

Given all this, why should diversity plans and implementations - as a critical component of a software company - be immune to challenge? If we have decided that engineer-authored manifestos are not an appropriate way to critically analyse a company's diversity system then what is the appropriate way?

Please note that there's a good reason why the testing and development teams are different, why representatives from completely different teams are mandatory attendees of design reviews, and why the reviewer of new code should in general not be someone who reports to the person checking in the code. The diversity team - or their policy implementors - should not be the sole responders to challenges about the efficacy of their own systems.

2017-05-12

Downsides of an IT monolith (NHS edition)

I have been watching, with no little schadefreude (trans. "damage joy") today's outage of many NHS services as a result of a ransomware attack.

This could happen to anyone, n'est ce pas? The various NHS trusts affected were just unlucky. They have many, many users (admin staff in each GP's surgery; nurses, auxiliaries and doctors rushing to enter data before dashing off to the next patient). Why is it unsurprising that this is happening now?

The NHS is an organisational monolith. It makes monolithic policy announcements. As a result of those policies, Windows XP became the canonical choice for NHS PCs. It is still the canonical choice for NHS PCs. Windows XP launched to the public in late 2001. Microsoft ended support for Windows XP in April 2014. Honestly, I have to give Microsoft kudos for this (oh, that hurts) because they kept XP supported way beyond any reasonable timeframe. But all good things come to an end, and security updates are no longer built for XP. The NHS paid Microsoft for an extra year of security patches but decided not to extend that option beyond 2015, presumably because no-one could come up with a convincing value proposition for it. Oops.

The consequences of this were inevitable, and today we saw them. A huge userbase of Internet-connected PCs no longer receiving security updates is going to get hit by something - they were a bit unlucky that it was ransomware, which is harder to recover from than a straight service-DoS, but this was entirely foreseeable.

Luckily the NHS mandates that all critical operational data be backed up to central storage services, and that its sites conduct regular data-restore exercises. Doesn't it? Bueller?

I don't want to blame the central NHS IT security folks here - I'm sure they do as good a job as possible in an impossible-to-manage environment, and that the central patient data is fairly secure. However, if you predicate effective operations for most of the NHS on data stored on regular PCs then you really want to be sure that they are secure. Windows XP has been end-of-support for three gold-durned years at this point, and progress in getting NHS services off it has been negligible. You just know that budget for this migration got repurposed for something else more time-sensitive "temporarily".

This is a great example of organisational inertia, in fact maybe a canonical one. It was going to be really hard to argue for a massively expensive and disruptive change, moving all NHS desktops to a less-archaic OS - Windows 10 seems like a reasonable candidate, but would still probably require a large proportion of desktops and laptops to be replaced. As long as nothing was on fire, there would be a huge pushback on any such change with very few people actively pushing for it to happen. So nothing would happen - until now...

Please check back in 2027 when the NHS will have been on Windows 10 for 8 years, 2 years end-of-life, and the same thing will be happening again.

2016-11-24

Expensive integer overflows, part N+1

Now the European Space Agency has published its preliminary report into what happened with the Schiaparelli lander, it confirms what many had suspected:

As Schiaparelli descended under its parachute, its radar Doppler altimeter functioned correctly and the measurements were included in the guidance, navigation and control system. However, saturation – maximum measurement – of the Inertial Measurement Unit (IMU) had occurred shortly after the parachute deployment. The IMU measures the rotation rates of the vehicle. Its output was generally as predicted except for this event, which persisted for about one second – longer than would be expected. [My italics]

This is a classic software mistake - of which more later - where a stored value becomes too large for its storage slot. The lander was spinning faster than its programmers had estimated, and the measured rotation speed exceeded the maximum value which the control software was designed to store and process.

When merged into the navigation system, the erroneous information generated an estimated altitude that was negative – that is, below ground level.

The stream of estimated altitude reading would have looked something like "4.0km... 3.9km... 3.8km... -200km". Since the most recent value was below the "cut off parachute, you're about to land" altitude, the lander obligingly cut off its parachute, gave a brief fire of the braking thrusters, and completed the rest of its descent under Mars' gravitational acceleration of 3.8m/s^2. That's a lot weaker than Earth's, but 3.7km of freefall gave the lander plenty of time to accelerate; a back-of-the-envelope calculation (v^2 = 2as) suggests a terminal velocity of 167 m/s, minus effects of drag.

Well, there goes $250M down the drain. How did the excessive rotation speed cause all this to happen?

When dealing with signed integers, if - for instance - you are using 16 bits to store a value then the classic two's-complement representation can store values between -32768 and +32767 in those bits. If you add 1 to the stored value 32767 then the effect is that the stored value "wraps around" to -32768; sometimes this is what you actually want to happen, but most of the time it isn't. As a result, everyone writing software knows about integer overflow, and is supposed to take account of it while writing code. Some programming languages (e.g. C, Java, Go) require you to manually check that this won't happen; code for this might look like:

/* Will not work if b is negative */
if (INT16_MAX - b >= a) {
   /* a + b will fit */
   result = a + b
} else {
   /* a + b will overflow, return the biggest
    * positive value we can
    */
   result = INT16_MAX
}

Other languages (e.g. Ada) allow you to trap this in a run-time exception, such as Constraint_Error. When this exception arises, you know you've hit an overflow and can have some additional logic to handle it appropriately. The key point is that you need to consider that this situation may arise, and plan to detect it and handle it appropriately. Simply hoping that the situation won't arise is not enough.

This is why the "longer than would be expected" line in the ESA report particularly annoys me - the software authors shouldn't have been "expecting" anything, they should have had an actual plan to handle out-of-expected-value sensors. They could have capped the value at its expected max, they could have rejected the use of that particular sensor and used a less accurate calculation omitting that sensor's value, they could have bounded the calculation's result based on the last known good altitude and velocity - there are many options. But they should have done something.

Reading the technical specs of the Schiaparelli Mars Lander, the interesting bit is the Guidance, Navigation and Control system (GNC). There are several instruments used to collect navigational data: inertial navigation systems, accelerometers and a radar altimeter. The signals from these instruments are collected, processed through analogue-to-digital conversion and then sent to the spacecraft. The spec proudly announces:

Overall, EDM's GNC system achieves an altitude error of under 0.7 meters

Apparently, the altitude error margin is a teeny bit larger than that if you don't process the data robustly.

What's particularly tragic is that arithmetic overflow has been well established as a failure mode for ESA space flight for more than 20 years. The canonical example is the Ariane 5 failure of 4th June 1996 where ESA's new Ariane 5 rocket went out of control shortly after launch and had to be destroyed, sending $500M of rocket and payload up in smoke. The root cause was an overflow while converting a 64 bit floating point number to a 16 bit integer. In that case, the software authors had actually explicitly identified the risk of overflow in 7 places of the code, but for some reason only added error handling code for 4 of them. One of the remaining cases was triggered, and "foom!"

It's always easy in hindsight to criticise a software design after an accident, but in the case of Schiaparelli it seems reasonable to have expected a certain amount of foresight from the developers.

ESA's David Parker notes "...we will have learned much from Schiaparelli that will directly contribute to the second ExoMars mission being developed with our international partners for launch in 2020." I hope that's true, because they don't seem to have learned very much from Ariane 5.

2016-02-20

Analysing the blue-red hat problem in the face of user error

Everyone knows computers are getting smarter - unless they're being programmed by a major corporation for a government contract - but there has recently been another leap in the level of smart. DeepMind (now part of Google) has built an AI that has successfully deduced the optimal solution to the hat problem:

100 prisoners stand in line, one in front of the other. Each wears either a red hat or a blue hat. Every prisoner can see the hats of the people in front – but not their own hat, or the hats worn by anyone behind. Starting at the back of the line, a prison guard asks each prisoner the colour of their hat. If they answer correctly, they will be pardoned [and if not, executed]. Before lining up, the prisoners confer on a strategy to help them. What should they do?

Tricky, n'est ce pas?

The obvious part first: the first prisoner to answer, whom we'll designate number 1, has no information about his hat colour. Assuming blue and red hats are assigned with equal probability, he can answer either "red" or "blue" with a 50% chance of success and 50% chance of getting executed; he has no better strategy for self-survival. What about the other prisoners?

Applying information theory, our system has 100 binary bits of state - 100 people, each with 1 bit of state relating to whether their hat is blue or not. We generate 99 bits of knowledge about that state as the hat-wearers give answers. So the maximum we can expect to discover reliably is 99/100 hat values. How can we get close to this?

If everyone just guesses their own hat colour randomly, or everyone says "blue", or everyone says "red", then on average 50% of people survive. How to do better? We need to communicate information to people further down their line about their hat colour.

Let's get the first 50 people in line to tell the next 50 people in line about their hat colour. Person 1 announces the hat colour of person 51, person 2 of person 52 and so on. So the last 50 people are guaranteed to survive because they have been told their hat colour. The first 50 people each have a 50-50 chance of survival because the colour they "guess" has no necessary relation to the colour of their hat. On average 25 of them survive, giving an average survival of 75% of people.

The DeepMind algorithm relies on an insight based on the concept of parity: an 0/1 value encapsulating critical state, in this case the number of blue hats seen and guessed, modulo 2. The first user counts the number of blue hats seen and says "blue" if that number is even, and "red" if odd. He still has a 50-50 chance of survival because he has no information about his hat. The second user counts the number of blue hats. If even, and the first person said "blue", then he and the first person both saw the same number of blue hats - so his hat must be red. If even, and the first person said "red", his hat must be blue because it changed the number of blue hats seen between the first person and him. Similar reasoning on the odd case means that he can announce his hat colour with full confidence.

What about person 3? He has to listen to person 1 and person 2, and observe the hat colours in front of him, to deduce whether his hat is blue; his strategy, which works for all others after him too, is to add the parity values (0 for blue, 1 for red) for heard and seen hats modulo 2, and if 0 then announce "blue", if 1 then announce "red". Follow this down the line, and persons 2 through 100 are guaranteed survival while person 1 has a 50-50 chance, for an average 99.5% survival rate.

Of course, this is a fairly complicated algorithm. What if someone mis-counts - what effect does it have? We don't want a fragile algorithm where one person's error can mess up everyone else's calculations, such as with "Chinese whispers." Luckily, a bit of thought (confirmed by experiment) shows us that both the future-casting and parity approaches are resilient to individual error. For future-casting, if one of the first 50 people makes an error then it makes no difference to their chance of survival but their correspondent in the second half of the line is doomed. If one of the second 50 people makes an error then they are doomed unless their correspondent also makes a mistake - generally unlikely, a 10% chance. So if 10% of users make errors then the approximate number of survivors is (75 - 10) + 1, i.e. 66%.

Surprisingly, the parity approach is also robust. It turns out that if user N makes a mistake then they doom themselves, and also doom user N+1 who relies on user N's calculation. But because both user N and N+1 make erroneous guesses, this brings the parity value back in line for user N+2, whose guess will be correct (absent any other errors). So the approximate number of survivors given a 10% error rate is 99.5 - 10*2 = 79.5%

Here's Python code to test the various algorithms: save it as "hats.py" and run it (e.g. "chmod 0755 hats.py ; ./hats.py" on OS X or Linux). It runs 10 trials of 100 people and reports the average number of survivors, based on a 10% error rate in hat wearers following their strategy. Default strategy is the parity algorithm.

#!/usr/bin/python

import random

person_count = 100
half_person_count = person_count / 2
# Hat choices
hat_choices = ['r','b']
hat_opposite = {'b':'r', 'r':'b'}
# 10% error rate in guesses
error_rate = 0.1

def guess_constant(heard_guesses, seen_hats):
    return 'b'

def guess_random(heard_guesses, seen_hats):
    return random.choice(hat_choices)

def guess_future(heard_guesses, seen_hats):
    """ First half of list calls out hat of correspondent in second half of list """
    full_list = heard_guesses + ['x'] + seen_hats
    my_index = len(heard_guesses)
    if my_index < half_person_count:
        # Call out the hat of the person in the second half of the list, hope same as mine
        return full_list[my_index+half_person_count]
    else:
        # Remember what was called out by my corresponding person in first half of list
        return heard_guesses[my_index - half_person_count]

def guess_parity(heard_guesses, seen_hats):
    """ Measure heard and seen parity of blue hats, call out blue for even, red for odd."""
    heard_blue_count = len([g for g in heard_guesses if g == 'b'])
    seen_blue_count = len([s for s in seen_hats if s == 'b'])
    if (heard_blue_count + seen_blue_count) % 2 == 0:
        return 'b'
    else:
        return 'r'

def run_test(guess_fun):
    hat_list = [ random.choice(hat_choices) for i in range(0, person_count) ]
    print "Actual: " + "".join(hat_list)
    answer_list = []
    score_list = []
    error_list = []
    correct = 0
    for i in range(0, person_count):
        guess = guess_fun(answer_list, hat_list[i+1:])
        if random.random() < error_rate:
            guess = hat_opposite[guess]
            error_list.append('X')
        else:
            error_list.append('-')
        answer_list.append(guess)
        if guess == hat_list[i]:
            correct += 1
            score_list.append('-')
        else:
            score_list.append('X')
    print "Called: " + "".join(answer_list)
    print "Score:  " + "".join(score_list)
    print "Errors: " + "".join(error_list)
    print "%d correct" % correct
    return correct

if __name__ == "__main__":
    trial_count = 10
    correct_total = 0
    for i in range(0, trial_count):
        print "\nTrial %d" % (i+1)
        correct_total += run_test(guess_parity)
    print "\nAverage correct: %d" % (correct_total / trial_count)

You can change the "guess_parity" value in the run_test() invocation on the penultimate line to "guess_future" for the "warn the second half" strategy, or "guess_random" for the random choice.

This is a lousy problem for use in software engineering job interviews, by the way. It's a famous problem, so candidates who have heard it are at a major advantage to those who haven't. It relies on a key and non-obvious insight. A candidate who hasn't encountered the problem before and solves it gives a very strong "hire" signal, but a candidate who fails to find the optimal solution could still be a valid hire. The least worst way to assess candidates based on this problem is whether they can write code to evaluate these algorithms, once the algorithms are described to them.

2016-01-20

Putting Twitter's loss in perspective

Today, Twitter (NYSE symbol TWTR) lost 7% of its value to close at $16.69/share at a market cap of $11.4bn. That's a loss of approximately $800m of of share capital.

To put that in perspective, that's 8M $100 bills. The NYSE (New York Stock Exchange) is open from 9:30am to 4pm; 6.5 hours, or 23,400 seconds. A well-tuned toilet flush cycle is 35 seconds, so you could get in 668 back-to-back flushes during NYSE opening hours. Therefore you'd have to flush 12,000 $100 bills each time in order to match TWTR's loss. At 150 bills/stack that's 80 stacks, and I can't see you getting more than 1 stack per flush in a single toilet, so I would characterise today's loss as a rate of 80 NYSE-toilets.

I hesitate to ascribe all this loss to Twitter's de-verification of arch-gay-conservative @Nero on 9th January when Twitter was $20, but its share price has descended in more or less a straight line since then. Today the NYSE actually went very slightly up but Twitter still plummeted.

It certainly wasn't helped by 6 hours of partial unavailability of Twitter today, but I suspect that was the straw breaking the camel's back.

2015-06-21

The spectacular kind of hardware failure

Gentle reader, I have attempted several times to pen my thoughts on the epic hack of the US Office of Personnel Management that compromised the security information of pretty much everyone who works for the US government, but I keep losing my vision and hearing a ringing in my ears when I try to do so. So I turn to a lesser-known and differently-awesome fail: the US visa system.

Since a computer failure on the 26th of May - over three weeks ago - the US embassies and consulates worldwide have been basically unable to issue new visas except in very limited circumstances. You haven't heard much about this because it hasn't really affected most US citizens, but believe me it's still a big issue. It seems that they're not expecting the system to be working again until next week at the earliest. Estimates of impacted users are on the order of 200,000-500,000; many people are stuck overseas, unable to return to the USA until their visa renewal is processed.

What happened? The US Department of State has a FAQ but it is fairly bland, just referring to "technical problems with our visa systems" and noting "this is a hardware failure, and we are working to restore system functions".

So a hardware failure took out nearly the entire system for a month. The most common cause of this kind of failure is a large storage system - either a mechanical failure that prevents access to all the data you wrote on the disks, or a software error that deleted or overwrote most of the data on there. This, of course, is why we have backups - once you discover the problem, you replace the drive (if broken) and then restore your backed up data from the last known good state. You might then have to apply patches on top to cover data that was written after the backup, but the first step should get you 90%+ of the way there. Of course, this assumes that you have backups and that you are regularly doing test restores to confirm that what you're backing up is still usable.

The alternative failure is of a relatively large machine. If you're running something comparable to the largest databases in the world you're going to be using relatively custom hardware. If it goes "foom", e.g. because its motherboard melts, you're completely stuck until an engineer can come over with the replacement part and fix it. If the part is not replaceable, you're going to have to buy an entirely new machine - and move the old one out, and install the new one, and test it, and hook it up to the existing storage, and run qualification checks... But this should still be on the order of 1 week.

A clue comes from a report of the State Department:

"More than 100 engineers from the government and the private sector [my emphasis] are working around the clock on the problem, said John Kirby, State Department spokesman, at a briefing on Wednesday.

You can't use 100 engineers to replace a piece of hardware. They simply won't fit in your server room. This smells for all the world like a mechanical or software failure affecting a storage system where the data has actually been lost. My money is on backups that weren't actually backing up data, or backing it up in a form that needed substantial manual intervention to restore, e.g. a corrupted database index file which would need every single piece of data to be reindexed. Since they've roped in private sector engineers, they're likely from whoever supplied the hardware in question: Oracle or IBM, at a guess.

The US Visa Office issues around 10 million non-immigrant visas per year, which are fairly simple, and about 500,000 immigrant visas per year which are a lot more involved with photos, other biometrics, large forms and legal papers. Say one of the latter takes up 100MB (a hi-res photo is about 5MB) and one of the former takes up 5MB; then that's a total of about 100TB per year. That's a lot of data to process, particularly if you have to build a verification system from scratch.

I'd love to see a report on this from the Government Accountability Office when the dust settles, but fear that the private sector company concerned will put pressure on to keep the report locked up tight "for reasons of commercial confidentiality and government security". My arse.

2015-04-02

Active attack on an American website by China Unicom

I wondered what the next step in the ongoing war between Western content and Chinese censorship might be. Now we have our answer.

"Git" is a source code repository system which allows programmers around the world to collaborate on writing code: you can get a copy of a software project's source code onto your machine, play around with it to make changes, then send those changes back to Git for others to pick up. Github is a public website (for want of a more pedantic term) which provides a repository for all sorts of software and similar projects. The projects don't actually have to be source code: anything which looks like plain text would be fine. You could use Github to collaborate on writing a book, for instance, as long as you used mostly text for the chapters and not e.g. Microsoft Word's binary format that makes it hard for changes to be applied in sequence.

Two projects on Git are "greatfire" and "cn-nytimes" which are, respectively, a mirror for the Greatfire.org website focused on the Great Firewall of China, and a Chinese translation of the New York Times stories. These are, obviously, not something to which the Chinese government wants its citizenry to have unfettered access. However, Github has many other non-controversial software projects on it, and is actually very useful to many software developers in China. What to do?

Last week a massive Distributed Denial of Service (DDoS) attack hit Github:

The attack began around 2AM UTC on Thursday, March 26, and involves a wide combination of attack vectors. These include every vector we've seen in previous attacks as well as some sophisticated new techniques that use the web browsers of unsuspecting, uninvolved people to flood github.com with high levels of traffic. Based on reports we've received, we believe the intent of this attack is to convince us to remove a specific class of content. [my italics]

Blocking Github at the Great Firewall - which is very easy to do - was presumably regarded as undesirable because of its impact on Chinese software businesses. So an attractive alternative was to present the Github team with a clear message that until they discontinued hosting these projects they would continue to be overwhelmed with traffic.

If this attack were just a regular DDoS by compromised PCs around the world it would be relatively trivial to stop: just block the Internet addresses (IPs) of the compromised PCs until traffic returns to normal levels. But this attack is much more clever. It intercepts legitimate requests from worldwide web browsers for a particular file hosted on China's Baidu search engine, and modifies the request to include code that commands repeated requests for pages from the two controversial projects on Github. There's a good analysis from NetreseC:

In short, this is how this Man-on-the-Side attack is carried out:
1. An innocent user is browsing the internet from outside China.
2. One website the user visits loads a JavaScript from a server in China, for example the Badiu Analytics script that often is used by web admins to track visitor statistics (much like Google Analytics).
3. The web browser's request for the Baidu JavaScript is detected by the Chinese passive infrastructure as it enters China.
4. A fake response is sent out from within China instead of the actual Baidu Analytics script. This fake response is a malicious JavaScript that tells the user's browser to continuously reload two specific pages on GitHub.com.

The interesting question is: where is this fake response happening? We're fairly sure that it's not at Baidu themselves, for reasons you can read in the above links. Now Errata Security has done a nice bit of analysis that points the finger at the Great Firewall implementation in ISP China Unicom:

By looking at the IP addresses in the traceroute, we can conclusive prove that the man-in-the-middle device is located on the backbone of China Unicom, a major service provider in China.

That existing Great Firewall implementors have added this new attack functionality fits with Occam's Razor. It's technically possible for China Unicom infrastructure to have been compromised by patriotically-minded independent hackers in China, but given the alternative that China Unicom have been leant on by the Chinese government to make this change, I know what I'd bet my money on.

This is also a major shift in Great Firewall operations: this is the first major case I'm aware of that has them focused on inbound traffic from non-Chinese citizens.

Github look like they've effectively blocked the attack, after a mad few days of scrambling, and kudos to them. Now we have to decide what the appropriate response is. It seems that any non-encrypted query to a China-hosted website would be potential fair game for this kind of attack. Even encrypted (https) requests could be compromised, but that would be a huge red arrow showing that the company owning the original destination (Baidu in this case) had been compromised by the attacker: this would make it 90%+ probable that the attacker had State-level influence.

If this kind of attack persists, any USA- or Europe-focused marketing effort by Chinese-hosted companies is going to be thoroughly torpedoed by the reasonable expectation that web traffic is going to be hijacked for government purposes. I wonder whether the Chinese government has just cut off its economic nose to spite its political face.

2014-12-16

The 2038 problem

I was inspired - perhaps that's not quite the right word - by this article on the Year 2038 bug in the Daily Mail:

Will computers be wiped out on 19 January 2038? Outdated PC systems will not be able to cope with time and date, experts warn Psy's Gangnam Style was recently viewed so many times on YouTube that the site had to upgrade the way figures are shown on the site.

The site 'broke' because it runs on a 32-bit system, which uses four-bytes

These systems can only handle a finite number of binary digits

A four-byte format assumes time began on 1 January, 1970, at 12:00:00

At 03:14:07 UTC on Tuesday, 19 January 2038, the maximum number of seconds that a 32-bit system can handle will have passed since this date

This will cause computers to run negative numbers, and dates [sic]

Anomaly could cause software to crash and computers to be wiped out

I've numbered the points for ease of reference. Let's explain to author Victoria Woollaston (Deputy Science and Technology editor) where she went wrong. The starting axiom is that you can represent 4,294,967,296 distinct numbers with 32 binary digits of information.

1. YouTube didn't (as far as I can see) "break".

Here's the original YouTube post on the event on Dec 1st:

We never thought a video would be watched in numbers greater than a 32-bit integer (=2,147,483,647 views), but that was before we met PSY. "Gangnam Style" has been viewed so many times we had to upgrade to a 64-bit integer (9,223,372,036,854,775,808)!

When they say "integer" they mean it in the correct mathematical sense: a whole number which may be negative, 0 or positive. Although 32 bits can represent 4bn+ numbers as noted above, if you need to represent negative numbers as well as positive then you need to reserve one of those bits to represent that information (all readers about to comment about two's complement representation can save themselves the effort, the difference isn't material.) That leaves you just over 2bn positive and 2bn negative numbers. It's a little bit surprising that they chose to use integers rather than unsigned (natural) numbers as negative view counts don't make sense but hey, whatever.
Presumably they saw Gangnam Style reach 2 billion views and decided to pre-emptively upgrade their views field from signed 32 bit to signed 64 bit. This is likely not a trivial change - if you're using a regular database, you'd do it via a schema change that requires reprocessing the entire database, and I'd guess that YouTube's database is quite big but it seemed to be in place by the time we hit the signed 32 bit integer limit.

2. All systems can only handle a finite number of binary digits.

For fuck's sake. We don't have infinite storage anywhere in the world. The problem is that the finite number of binary digits (32) in 4-byte representation is too small. 8 byte representation has twice the number of binary digits (64, which is still finite) and so can represent many more numbers.

3. The number of bytes has no relationship to the information it represents.

Unix computers (Linux, BSD, OS X etc.) represent time as seconds since the epoch. The epoch is defined as 00:00:00 Coordinated Universal Time (UTC - for most purposes, the same as GMT), Thursday, 1 January 1970. The Unix standard was to count those seconds in a 32 bit signed integer. Now it's clear that 03:14:08 UTC on 19 January 2038 will see that number of seconds exceed what can be stored in a 32 bit signed integer, and the counter will wrap around to a negative number. What happens then is anyone's guess and very application dependent, but it's probably not good.
There is a move towards 64-bit computing in the Unix world, which will include migration of these time representations to 64 bit. Because this move is happening now, we have 23 years to complete it before we reach our Armageddon date. I don't expect there to be many 32 bit systems left operating by then - their memory will be rotted, their disk drives stuck. Only emulated systems will be still working, and everyone knows about the 2038 problem.

4. Basically correct, if grammatically poor

5. Who taught you English, headline writer?

As noted above, what will actually happen on the date in question is heavily dependent on how each program using the information behaves. The most likely result is a crash of some form, but you might see corruption of data before that happens. It won't be good. Luckily it's easy to test programs by just advancing the clock forwards and seeing what happens when the time ticks over. Don't try this on a live system, however.

6. Software crash, sure. Computer being "wiped out"? Unlikely

I can see certain circumstances where a negative date could cause a hard drive to be wiped, but I'd expect it to be more common for hard drives to be filled up - if a janitor process is cleaning up old files, it'll look for files with modification time below a certain value (say, all files older than 5 minutes ago). Files created before the positive-to-negative date point won't be cleaned up by janitors running after that point. So we leave those stale files lying around, but files created after that will still be eligible for clean-up - they have a negative time which is less than the janitor's negative measurement point.

I'm sure there will be date-related breakage as we approach 2038 - if a bank system managers 10 year bonds, then we will see breakage as their expiry time goes past january 2038, so the bank will see breakage in 2028. But hey, companies are already selling 50 year bonds so bank systems have had to deal with this problem already.

Thank goodness that I can rely on the Daily Mail journalists' expertise in all the articles that I don't actually know anything about.

2014-10-22

State-endorsed web browsers turn out to be bad news

Making the headlines in the tech world this week has been evidence of someone trying to man-in-the-middle Chinese iCloud users:

Unlike the recent attack on Google, this attack is nationwide and coincides with the launch today in China of the newest iPhone. While the attacks on Google and Yahoo enabled the authorities to snoop on what information Chinese were accessing on those two platforms, the Apple attack is different. If users ignored the security warning and clicked through to the Apple site and entered their username and password, this information has now been compromised by the Chinese authorities. Many Apple customers use iCloud to store their personal information, including iMessages, photos and contacts. This may also somehow be related again to images and videos of the Hong Kong protests being shared on the mainland.

MITM attacks are not a new phenomenon in China but this one is widespread, and clearly needs substantial resources and access to be effective. As such, it would require at least government complicity to organise and implement.

Of course, modern browsers are designed to avoid exactly this problem. This is why the Western world devotes so much effort to implementing and preserving the integrity of the "certificate chain" in SSL - you know you're connecting to your bank because the certificate is signed by your bank, and the bank's signature is signed by a certificate authority, and your browser already knows what the certificate authority's signature looks like. But it seems that in China a lot of people use Qihoo 360 web browser. It claims to provide anti-virus and malware protection, but for the past 18 months questions have been asked about its SSL implementation:

If your browser is either 360 Safe Browser or Internet Explorer 6, which together make up for about half of all browsers used in China, all you need to do is to click continue once. You will see no subsequent warnings. 360's so-called "Safe Browser" even shows a green check suggesting that the website is safe, once you’ve approved the initial warning message.

I should note, for the sake of clarity, that both the 2013 and the current MITM reports come from greatfire.org, whose owners leave little doubt that they have concerns about the current regime in China. A proper assessment of Qihoo's 360 browser would require it to be downloaded on a sacrificial PC and used to check out websites with known problems in their SSL certificates (e.g. self-signed, out of date, being MITM'd). For extra points you'd download it from a Chinese IP. I don't have the time or spare machine to test this thoroughly, but if anyone does then I'd be interested in the results.

Anyway, if the browser compromise checks out then I'm really not surprised at this development. In fact I'm surprised it hasn't happened earlier, and wonder if there have been parallel efforts at compromising IE/Firefox/Opera/Chrome downloads in China: it would take substantial resources to modify a browser installer to download and apply a binary patch to the downloaded binary which allowed an additional fake certificate authority (e.g. the Chinese government could pretend to be Apple), and more resources to keep up to date with browser releases so that you could auto-build the patch shortly after each new browser version release, but it's at least conceivable. But if you have lots of users of a browser developed by a firm within China, compromising that browser and its users is almost as good and much, much easier.

2014-10-13

Corporate welfare from Steelie Neelie and the EU

I used to be the starry-eyed person who thought that governments pouring into a new concept for "research" was a good thing. That didn't last long. Now I read The Reg on the EU's plan to chuck 2.5 billion euros at "Big Data" "research" and wonder why, in an age of austerity, the EU thinks that pissing away the entire annual defence budget of Austria is a good idea.

First, a primer for anyone unfamiliar with "Big Data". It's a horrendously vague term, as you'd expect. The EU defines the term thus:

Big data is often defined as any data set that cannot be handled using today’s widely available mainstream solutions, techniques, and technologies.

Ah, "mainstream". What does this actually mean? It's a reasonable lower bound to start with what's feasible on a local area network. If you have a data set with low hundreds of terabytes of storage, you can store and process this on some tens of regular PCs; if you go up to about 1PB (petabyte == 1024 terabytes, 1 terabyte is the storage of a regular PC hard drive) then you're starting to go beyond what you can store and process locally, and need to think about someone else hosting your storage and compute facility.

Here's an example. Suppose you have a collection of overhead imagery of the United Kingdom, in the infra-red spectrum, sampled at 1m resolution. Given that the UK land area is just under 250 thousand square kilometers, if you represent this in an image with 256 levels of intensity (1 byte per pixel) you'll need 250,0000 x (1000 x 1000) = 250 000 000 000 pixels or 250 gigabytes of storage. This will comfortably fit on a single hard drive. If you reduce this to 10cm resolution - so that at maximum resolution your laptop screen of 1200 pixel width will show 120m of land - then you're looking at 25 TB of data, so you'll need a network of tens of PCs to store and process it. If, instead of a single infra-red channel, you have 40 channels of different electromagnetic frequencies, from low infra-red up to ultra violet, you're at 1PB and need Big Data to solve the problem of processing the data.

Another example, more privacy-concerning: if you have 1KB of data about each of the 7bn people in the world (say, their daily physical location over 1 year inferred from their mobile phone logs), you'll have 7 terabytes of information. If you have 120 KB of data (say, their physical location every 10 minutes) then this is around 1PB and approaches the Big Data limits.

Here's the press release:

Mastering big data could mean:

up to 30% of the global data market for European suppliers;

100,000 new data-related jobs in Europe by 2020;

10% lower energy consumption, better health-care outcomes and more productive industrial machinery.

My arse, but let's look at each claim in turn.

How is this project going to make it more likely for European suppliers to take over more of the market? Won't all the results of the research be public? How, then, will a European company be better placed to take advantage of them than a US company? Unless one or more US-based international company has promised to attribute a good chunk of its future Big Data work to its European operations as an informal quid-pro-quo for funding from this pot.
As Tim Worstall is fond of saying, jobs are a cost not a benefit. These need to be new jobs that are a prerequisite for larger Big Data economic gains to be realized, not busywork to meet artificial Big Data goals
[citation required] to quote Wikipedia. I'll believe it when I see it measured by someone without financial interest in the Big Data project.

The EU even has a website devoted to the topic: Big Data Value. Some idea of the boondoggle level of this project can be gleaned from the stated commitment:

... to build a data-driven economy across Europe, mastering the generation of value from Big Data and creating a significant competitive advantage for European industry, boosting economic growth and jobs. The BDV PPP will commence in 2015[,] start with first projects in 2016 and will run until 2020. Covering the multidimensional character of Big Data, the PPP activities will address technology and applications development, business model discovery, ecosystem validation, skills profiling, regulatory and IPR environment and social aspects.

So how will we know if these 2.5bn Euros have been well spent? Um. Well. Ah. There are no deliverables specified, no ways that we can check back in 2020 to see if the project was successful. We can't even check in 2017 whether we're making the required progress, other than verifying that the budget is being spent at the appropriate velocity - and believe me, it will be.

The fundamental problem with widespread adoption of Big Data is that you need to accumulate the data before you can start to process it. It's surprisingly hard to do this - there really isn't that much new data generated in most fields and you can do an awful lot if you have reasonably-specced PCs on a high-speed LAN. Give each PC a few TB in storage, stripe your data over PCs for redundancy (not vulnerable to failure of a single drive or PC) and speed, and you're good to go. Even if you have a huge pile of storage, if you don't have the corresponding processing power then you're screwed and you'll have to figure out a way of copying all the data into Amazon/Google/Azure to allow them to process it.

Images and video are probably the most ripe field for Big Data, but still you can't avoid the storage/processing problem. If you already have the data in a cloud storage provider like Amazon/Google/Azure, they likely already have the processing models for your data needs; if you don't, where are all the CPUs you need for your processing? It's likely that the major limitations processing Big Data in most companies is appropriate reduction of the data to a relatively small secondary data set (e.g. processing raw images into vectors via edge detection) before sending it somewhere for processing.

The EU is about to hand a couple billion euros to favoured European companies and university research departments, and it's going to get nine tenths of squat all out of it. Mark my words, and check back in 2020 to see what this project has produced to benefit anyone other than its participants.