There's a tremendous amount of hot air being talked about the alleged US
Government access to personal data from the major internet companies via the supposed
PRISM system. I'm not entirely sure who to believe, though I'm defaulting to "no-one";
there's no reason to trust any Government denials, nor any better reason to put faith in the
technical understanding of journalists. So let's look at how PRISM might actually work
from the limited point of view of snooping on Facebook.
The size of the problem
Facebook has somewhere around 1bn users, but they're not all active - indeed, they vary
greatly in levels of activity. So let's say there are 250M distinct FB users per day,
and they spend an average of 10 minutes per day on it with 2 activities (read a post or
instant message, view a photo, update status, make or delete a friend) per minute. That's
5bn activities per day, or 60,000 per second, that you want to record. How do you find out what they are?
Bear in mind that your key requirement is to be able to know who is talking to
and associating with whom, and what they are saying.
Snooping at the ISP
The easiest place to start surveillance is at the user's domestic Internet Service Provider (ISP).
This is where most USA-based people will connect to the Net. The user will have a public IP (internet address)
which is the point where their traffic enters and exists the Net proper, and the ISP will - or should -
normally know which of their users, physical locations and bank account ties to that IP. This knowledge will be looser for
entities like public wi-fi networks, but they should still have physical location info e.g. the Starbucks
on the corner of 5th and Maple.
Regular (HTTP) internet traffic consists of packets - consider them as postcards - with "from" and "to" Internet
addresses, plus some text content. The packets are very small, so you have to be able to aggregate a lot of them
in order to build up e.g. the entire contents of an email; however they have index numbers so you know what
order they are supposed to be in. The "from" address will be the user's public IP, and for our purposes we
know what "to" addresses belong to Facebook, so we can require the ISP to just capture those packets for
our use. Assuming that we monitor all 250M people in this way, and that each "activity" is about 4KB in
size (ignoring photos and voice chat) that's an average stream of 240MB/sec, nearly 2Gbits/sec that the
Government has to collect from the various ISPs and process in real time. In practice you need to double
that bandwidth because usage isn't flat throughout the day - there will be a definite diurnal cycle and
you need to have capacity for the daily peak.
This is a substantial processing
challenge but it's not impossible - the Government just has to write its own mini-Facebook back-end that
records user activity, without the need to handle photos and videos, and allows them to associate Facebook IDs with real
people IDs. Then they can run their own queries over that data store. They'll be writing nearly 20TB/day to that
store, so they'll need quite a few hard drives (more when you consider redundancy) but hey, it's the government,
they've got the spare $ somewhere.
Of course, someone has to actually pay for the hardware and bandwidth to filter, store and forward the
traffic from the ISPs - more government cash - and someone's going to have to monitor and maintain it.
This has to happen in nearly all USA ISPs, without any word of it getting out. I'm sure this is
completely realistic.
Problem! We're primarily interested in "bloody foreigners", and not just the ones
based in the USA. If two people outside the USA are communicating, even if it's via
a USA-based Facebook data center, we won't even know it's happening. How can we
improve this situation?
Snooping at Facebook's edge
Here we take advantage of the fact that even foreign users end up talking to a FB
data center, and many of them are in the USA. (Presumably whatever we work out here
could also be done by friendly governments like Eire or the UK for data centers abroad.)
Instead of monitoring at the USA users' ingress points, you look at where they
egress into Facebook's network. This gives you many fewer places to monitor, though
obviously much more traffic per spot so you need fewer instances of hardware
but at a much higher grade. You also have fewer places for news of the additional
hardware installation and operation to leak from.
The IP packets still have source addresses so you know
where they came into the Internet (more or less). You'll need additional collection of data from US
ISPs tying IPs to locations and people, where feasible, and you won't have this
quality of source information, but you can probably manage.
So far we've seen that just for Facebook you're looking at quite a substantial volume of
traffic, and we've ignored all photos and videos, but you can probably infer quite a lot
from this data and it doesn't seem to be an insurmountable volume. So far PRISM seems to
be not technically infeasible. But there's a wrinkle...
Encryption
So far we've been blithely assuming that we can read the plain text of what the user is sending to
and receiving from Facebook - the URLs, the posted text - without any problems. Indeed, HTTP - the
system by which web browsers communicate with web browsers - makes it easy to read this information.
An HTTP conversation happens in plain text and looks something like this:
From the browser: asking for the page "index.html" on host "www.example.com":
GET /index.html HTTP/1.1
Host: www.example.com
From the server:
HTTP/1.1 200 OK
Date: Mon, 30 Feb 2012 20:31:00 GMT
Server: Apache/2.3.4.5 (Unix) (Red-Hat/Linux)
Last-Modified: Sun, 29 Feb 2012 01:10:25 GMT
Etag: "2e70e-7d6-5f1c883b"
Content-Type: text/html; charset=UTF-8
Content-Length: 100
Connection: close
<html>
<head>
<title>An Example Page</title>
</head>
<body>
Hello World
</body>
</html>
The first block is the information about the server and what it's returning,
the second block is the HTML page itself.
If you've got access to the stream of data between a user and a website, you can very easily
work out what they're doing. You could even change the data, e.g. modifying every instance of
the word "Guardian" to "Grauniad" in the stream back to the user, so that the user browsing the eponymous
website gets very confused.
Luckily, some clever chaps were aware of this vulnerability of HTTP and came up with a
modification: HTTP Secure (HTTPS).
This is widely used, and is
the default for new Facebook users. The difference it makes for our purposes is that all an external
observer can see in plain text is a conversation between the browser and the Facebook server negotiating
a "shared secret" - a string that both of them know but that no other observer can know. Once this is agreed,
they encrypt the rest of their conversation using that shared secret. The observer can't see what URLs are
being requested, or what data is returned. All they know is that IP 192.168.100.2 is communicating with
Facebook, and that (judging by the encryption negotiation) they're using Internet Explorer 9. That's not
a lot of use to an eavesdropper.
There are a number of approaches to compromising HTTPS sessions, but they're generally rather CPU
intensive, target specific web applications, and are progressively being prevented by upgrades
to the secure protocols. Here's a little light
reading of some examples for the curious. Generally, the only approach that really scales
is a man-in-the-middle attack. This is where an eavesdropper intercepts the user's packets to
Facebook and pretends to be Facebook itself; in turn, the eavesdropper connects to Facebook
pretending to be the user and relay's the user's requests and Facebook's responses.
The way that HTTPS/SSL defeats this is via Certificate Authorities, a small number of trusted firms across the world who provide
the data that can verify that when you connect to a server believed to be from Facebook that the
electronic signature you receive back from that server really does belong to Facebook. The ins and
outs of how this works are complex, but the net effect is that it's really rather hard for even
a Government to pretend to be Facebook, and requires a substantial compromise of either Facebook's
secret SSL keys (so it can sign the connection just like Facebook does) or a certificate authority
(so it can claim that its fake signature really is Facebook's). Even these approaches are not
foolproof, and have to be cracked for each company and updated whenever each company changes
its signature. Unfortunately this can be detected by browsers; for instance, modern browsers know what the
real certificates should be for major websites and can warn you if someone is
trying to impersonate Facebook even if a compromised certificate authority claims that they're kosher.
There's also the not insignificant issue that such an interception approach has to be at least as reliable
as the servers the user connects to, and must not introduce any detectable latency into the connection
despite having to relay all the traffic both ways and filter out the text it's interested in.
The killer, though is that you have to inspect all traffic to Facebook. Unlike plain text
traffic, where you can easily see that packets pertain to photos or videos and ignore them, you
can't tell this for HTTPS until you've intercepted the conversation and started to man-in-the-middle
the connection. You've got to continue relaying the photos or video data, even though you're not interested in it, because if you drop the connection the browser will notice and so will the user. This massively magnifies the
problem - you need as much processing capacity as Facebook itself has at its front ends.
Insider access
Google, Facebook et al have strenuously and specifically denied giving PRISM-like access to user data. Let's take them at their
word. Assuming they're not co-operating, how would you get the access you'd like to user data without them knowing?
The most effective approach, as noted above, is to have an insider compromise their SSL secret keys.
That lets you man-in-the-middle all HTTPS traffic. Unfortunately you have a very small set of insiders
who have that access - and, by definition, those insiders will be as trustworthy and hard-to-compromise
as possible.
The talk swanning around about "free access to data on Facebook's servers" is rubbish. There is
no way any substantial routine access to user data is going unnoticed. Facebook will be
monitoring read traffic, bandwidth usage, CPU and memory load for all its critical servers. If
there's unexplained traffic in any volume, it's going to show up in dozens of monitoring consoles
scattered all over the firm. So many people would have to be in on the snooping that word of it
would inevitably leak.
Conclusion
It's just about feasible for a government to snoop on the plain-text non-photo non-video traffic for Facebook, and the
best place to do it is probably where traffic exits the Internet going to Facebook's network.
You're looking at a very serious amount of hardware to snoop and store the information, but it's
tractable with the budget available from a major government. When it comes to routine snooping on
encrypted (HTTPS) traffic though, forget it. It would require a major systematic compromise of
closely-held secret keys, a very high performance software infrastructure operating at very
high reliability, and - the killer - would have to be able to deal with as much traffic as the
Facebook front ends themselves do. By extension, the same is true for Google, Yahoo, Microsoft
etc. The Government is going to require inconveniently large amounts of hardware placed
inconveniently close to the major Facebook, Google and Microsoft data centers.
I should add that the
alleged $20M/year cost of PRISM would cover the capital costs of about 15,000 servers written off over
3 years (say, $4000 per server since you have to cover associated network, power and cooling infrastructure).
That's really not a lot. If you have 5 TB of storage per server, that's 75,0000 TB over 3 years; the above
requirements just for Facebook basics would be about 21,000 TB over that time, and you'd have to at least
double that for redundancy. This doesn't even approach all the other personnel and software development costs.
Conclusion: the scope of PRISM has almost certainly been massively exaggerated. Journalists have been
taken for a ride.