I used to be the starry-eyed person who thought that governments pouring into a new concept for "research" was a good thing. That didn't last long. Now I read The Reg on the EU's plan to chuck 2.5 billion euros at "Big Data" "research" and wonder why, in an age of austerity, the EU thinks that pissing away the entire annual defence budget of Austria is a good idea.
First, a primer for anyone unfamiliar with "Big Data". It's a horrendously vague term, as you'd expect. The EU defines the term thus:
Big data is often defined as any data set that cannot be handled using today’s widely available mainstream solutions, techniques, and technologies.Ah, "mainstream". What does this actually mean? It's a reasonable lower bound to start with what's feasible on a local area network. If you have a data set with low hundreds of terabytes of storage, you can store and process this on some tens of regular PCs; if you go up to about 1PB (petabyte == 1024 terabytes, 1 terabyte is the storage of a regular PC hard drive) then you're starting to go beyond what you can store and process locally, and need to think about someone else hosting your storage and compute facility.
Here's an example. Suppose you have a collection of overhead imagery of the United Kingdom, in the infra-red spectrum, sampled at 1m resolution. Given that the UK land area is just under 250 thousand square kilometers, if you represent this in an image with 256 levels of intensity (1 byte per pixel) you'll need 250,0000 x (1000 x 1000) = 250 000 000 000 pixels or 250 gigabytes of storage. This will comfortably fit on a single hard drive. If you reduce this to 10cm resolution - so that at maximum resolution your laptop screen of 1200 pixel width will show 120m of land - then you're looking at 25 TB of data, so you'll need a network of tens of PCs to store and process it. If, instead of a single infra-red channel, you have 40 channels of different electromagnetic frequencies, from low infra-red up to ultra violet, you're at 1PB and need Big Data to solve the problem of processing the data.
Another example, more privacy-concerning: if you have 1KB of data about each of the 7bn people in the world (say, their daily physical location over 1 year inferred from their mobile phone logs), you'll have 7 terabytes of information. If you have 120 KB of data (say, their physical location every 10 minutes) then this is around 1PB and approaches the Big Data limits.
Mastering big data could mean:My arse, but let's look at each claim in turn.
- up to 30% of the global data market for European suppliers;
- 100,000 new data-related jobs in Europe by 2020;
- 10% lower energy consumption, better health-care outcomes and more productive industrial machinery.
- How is this project going to make it more likely for European suppliers to take over more of the market? Won't all the results of the research be public? How, then, will a European company be better placed to take advantage of them than a US company? Unless one or more US-based international company has promised to attribute a good chunk of its future Big Data work to its European operations as an informal quid-pro-quo for funding from this pot.
- As Tim Worstall is fond of saying, jobs are a cost not a benefit. These need to be new jobs that are a prerequisite for larger Big Data economic gains to be realized, not busywork to meet artificial Big Data goals
- [citation required] to quote Wikipedia. I'll believe it when I see it measured by someone without financial interest in the Big Data project.
The EU even has a website devoted to the topic: Big Data Value. Some idea of the boondoggle level of this project can be gleaned from the stated commitment:
... to build a data-driven economy across Europe, mastering the generation of value from Big Data and creating a significant competitive advantage for European industry, boosting economic growth and jobs. The BDV PPP will commence in 2015[,] start with first projects in 2016 and will run until 2020. Covering the multidimensional character of Big Data, the PPP activities will address technology and applications development, business model discovery, ecosystem validation, skills profiling, regulatory and IPR environment and social aspects.So how will we know if these 2.5bn Euros have been well spent? Um. Well. Ah. There are no deliverables specified, no ways that we can check back in 2020 to see if the project was successful. We can't even check in 2017 whether we're making the required progress, other than verifying that the budget is being spent at the appropriate velocity - and believe me, it will be.
The fundamental problem with widespread adoption of Big Data is that you need to accumulate the data before you can start to process it. It's surprisingly hard to do this - there really isn't that much new data generated in most fields and you can do an awful lot if you have reasonably-specced PCs on a high-speed LAN. Give each PC a few TB in storage, stripe your data over PCs for redundancy (not vulnerable to failure of a single drive or PC) and speed, and you're good to go. Even if you have a huge pile of storage, if you don't have the corresponding processing power then you're screwed and you'll have to figure out a way of copying all the data into Amazon/Google/Azure to allow them to process it.
Images and video are probably the most ripe field for Big Data, but still you can't avoid the storage/processing problem. If you already have the data in a cloud storage provider like Amazon/Google/Azure, they likely already have the processing models for your data needs; if you don't, where are all the CPUs you need for your processing? It's likely that the major limitations processing Big Data in most companies is appropriate reduction of the data to a relatively small secondary data set (e.g. processing raw images into vectors via edge detection) before sending it somewhere for processing.
The EU is about to hand a couple billion euros to favoured European companies and university research departments, and it's going to get nine tenths of squat all out of it. Mark my words, and check back in 2020 to see what this project has produced to benefit anyone other than its participants.