Showing posts with label hardware. Show all posts

2019-09-27

The pace of PACER

Permit me a brief, hilarious diversion into the world of US Government corporate IT. PACER is a USA federal online system - "Public Access to Court Electronic Records" which lets people and companies access transcribed records from the US courts. One of their judges has been testifying to the House Judiciary Committee’s Subcommittee on Courts, IP, and the internet and in the process revealed interesting - and horrifying - numbers.

TL;DR -

it costs at least 4x what it reasonably should; but
any cost savings will be eaten up by increased lawyer usage; nevertheless,
rampant capitalism might be at least a partial improvement; so
the government could upload the PACER docs to the cloud, employ a team of 5-10 to manage the service in the cloud, and save beaucoup $$.

Of course, I could be wrong on point 2, but I bet I'm not.

Background

PACER operates with all the ruthless efficiency we have come to expect from the federal government.^[1] It's not free; anyone can register for it, usage requires a payment instrument (credit card) but it is free if you use less than $15 per quarter. The basis of charging is:

All registered agencies or individuals are charged a user fee of $0.10 per page. This charge applies to the number of pages that results from any search, including a search that yields no matches (one page for no matches). You will be billed quarterly.

You would think that, at worst, it would be cost-neutral. One page of black+white text at reasonably high resolution is a bit less than 1MB, and (for an ISP) that costs less than 1c to serve on the network. Therefore you spend less than 9c on the machines and people required to store and serve the data, and profit!

Apparently not...

The PACER claims

It was at this point in the article that I fell off my chair:

Fleissig said preliminary figures show that court filing fees would go up by about $750 per case to “produce revenue equal to the judiciary’s average annual collections under the current public access framework.” That could, for example, drive up the current district court civil filing fee from $350 to $1,100, she said.

What the actual expletive? This implies that:

the average filing requests 7500 pages of PACER documents - and that the lawyers aren't caching pages to reduce client costs (hollow laughter); or
the average filing requests 25 PACER searches; or
the average client is somewhere on the continuum between these points.

It seems ridiculously expensive. One can only conclude, reluctantly, that lawyers are not trying to drive down costs for their clients; I know, it's very hard to credit. ^[2]

And this assumes that 10c/page and $30/search is the actual cost to PACER - let us dig into this.

The operational costs

Apparently PACER costs the government $100M/year to operate:

“Our case management and public access systems can never be free because they require over $100 million per year just to operate,” [Judge Audrey] Fleissig said [in testimony for the House Judiciary Committee’s Subcommittee on Courts, IP, and the internet]. “That money must come from somewhere.”

Judge Fleissig is correct in the broad sense - but hang on, $100M in costs to run this thing? How much traffic does it get?

The serving costs

Let's look at the serving requirements:

PACER, which processed more than 500 million requests for case information last fiscal year

Gosh, that's a lot. What's that per second? 3600 seconds/hour x 24 hours/day x 365 days/year is 32 million seconds/year, so Judge Fleissig is talking about... 16 queries per second. Assume that's one query per page. That's laughably small.

Assume that peak traffic is 10x that, and you can serve comfortably 4 x 1MB pages per second on a 100Mbit network connection from a single machine; that's 40 machines with associated hardware, say amortized cost of $2,000/year per machine - implies order of $100K/year on hardware, to ensure a great user experience 24 hours per day 365 days per year. Compared to $100M/year budget, that's noise. And you can save 50% just by halving the number of machines and rejecting excess traffic at peak times.

The ingestion and storage costs

Perhaps the case ingestion is intrinsically expensive, with PACER having to handle non-standard formats? Nope:

The Judiciary is planning to change the technical standard for filing documents in the Case Management and Electronic Case Filing (CM/ECF) system from PDF to PDF/A. This change will improve the archiving and preservation of case-related documents.

So PACER ingests PDFs from courts - plus, I assume, some metadata - and serves PDFs to users.

How much data does PACER ingest and hold? This is a great Fermi question; here's a good worked example of answer, with some data.

There's a useful Ars Technica article on Aaron Swartz that gives us data on the document corpus as of 2013:

PACER has more than 500 million documents

Assume it's doubled as of 2019, that's 1 billion documents. Assume 1MB/page, 10 pages/doc, that's 10^9 docs x 10 MB per doc = 10^10 MB = 1x10^4 TB. That's 1000 x 10TB hard drives. Assume $300/drive, and drives last 3 years, and you need twice the number of drives to give redundancy, that's $200 per 10TB per year in storage costs, or $200K for 10,000 TB. Still, noise compared to $100M/year budget. But the operational costs of managing that storage can be high - which is why Cloud services like Amazon Web Services, Azure and Google Cloud have done a lot of work to offer managed services in this area.

Amazon, for instance, charges $0.023 per GB per month for storage (on one price model) - for 10^9 x 1MB docs, that's 1,000,000 GB x $0.023 or $23K/month, $276K/year. Still way less than 1% of the $100M/year budget.

Incidentally Aaron Swartz agrees with the general thrust of my article:

Yet PACER fee collections appear to have dramatically outstripped the cost of running the PACER system. PACER users paid about $120 million in 2012, thanks in part to a 25 percent fee hike announced in 2011. But Schultze says the judiciary's own figures show running PACER only costs around $20 million.

A rise in costs of 5x in 6 years? That's approximately doubling every 2 years. As noted above, it seems unlikely to be due to serving costs - even though volumes have risen, serving and storage costs have got cheaper. Bet it's down to personnel costs. I'd love to see the accounts break-down. How many people are they employing, and what are those people doing?

The indexing costs - or lack thereof

Indexing words and then searching a large corpus of text is notoriously expensive - that's what my 10c per electronic page is paying for, right? Apparently not:

There is a fee for retrieving and distributing case information for you: $30 for the search, plus $0.10 per page per document delivered electronically, up to 5 documents (30 page cap applies).

It appears that PACER is primarily constructed to deliver responses to "show me the records of case XXXYYY" or "show me all cases from court ZZZ", not "show me all cases that mention 'Britney Spears'." That's a perfectly valid decision but makes it rather hard to justify the operating costs.

Security considerations

Oh, please. These docs are open to anyone who has an account. The only thing PACER should be worried about is someone in Bangalore or Shanghai scraping the corpus, or the top N% of cases, and serving that content for much less cost. Indeed, that's why they got upset at Aaron Swartz. Honestly, though, the bulk of their users - law firms - are very price-insensitive. Indeed, they quite possibly charge their clients 125% or more of their PACER costs, so if PACER doubled costs overnight they'd celebrate.

I hope I'm wrong. I'm afraid I'm not.

Public serving alternatives

I don't know how much Bing costs to operate, but I'd bet a) that its document corpus is bigger than PACER, b) that its operating costs are comparable, c) that its indexing is better than PACER, d) that its search is better than PACER, e) that its page serving latency is better than PACER... you get the picture.

Really though, if I were looking for a system to replace this, I'd build off an off-the-shelf solution to translate inbound PDFs to indexed text - something like OpenText - and run a small serving stack on top. That reduces the regular serving cost, since pages are a few KB of text rather than 1MB of PDF, and lets me get rid of all the current people costs associated with the customized search and indexing work on the current corpus.

PACER is a terrible use of government money

Undoubtedly it's not the worst^[3], but I'd love for the House Judiciary Committee’s Subcommittee on Courts, IP, and the internet to drag Jeff Bezos in to testify and ask him to quote a ballpark number for serving PACER off Amazon Web Services, with guaranteed 100% profit margin.

Bet it's less than 1/4 of the current $100M/year.

[1] Yes, irony
[2] Why does New Jersey have the most toxic waste dumps and California the most lawyers? ~~California~~ New Jersey got first choice. [Thanks Mr Worstall!]
[3] Which is terribly depressing.

2016-11-24

Expensive integer overflows, part N+1

Now the European Space Agency has published its preliminary report into what happened with the Schiaparelli lander, it confirms what many had suspected:

As Schiaparelli descended under its parachute, its radar Doppler altimeter functioned correctly and the measurements were included in the guidance, navigation and control system. However, saturation – maximum measurement – of the Inertial Measurement Unit (IMU) had occurred shortly after the parachute deployment. The IMU measures the rotation rates of the vehicle. Its output was generally as predicted except for this event, which persisted for about one second – longer than would be expected. [My italics]

This is a classic software mistake - of which more later - where a stored value becomes too large for its storage slot. The lander was spinning faster than its programmers had estimated, and the measured rotation speed exceeded the maximum value which the control software was designed to store and process.

When merged into the navigation system, the erroneous information generated an estimated altitude that was negative – that is, below ground level.

The stream of estimated altitude reading would have looked something like "4.0km... 3.9km... 3.8km... -200km". Since the most recent value was below the "cut off parachute, you're about to land" altitude, the lander obligingly cut off its parachute, gave a brief fire of the braking thrusters, and completed the rest of its descent under Mars' gravitational acceleration of 3.8m/s^2. That's a lot weaker than Earth's, but 3.7km of freefall gave the lander plenty of time to accelerate; a back-of-the-envelope calculation (v^2 = 2as) suggests a terminal velocity of 167 m/s, minus effects of drag.

Well, there goes $250M down the drain. How did the excessive rotation speed cause all this to happen?

When dealing with signed integers, if - for instance - you are using 16 bits to store a value then the classic two's-complement representation can store values between -32768 and +32767 in those bits. If you add 1 to the stored value 32767 then the effect is that the stored value "wraps around" to -32768; sometimes this is what you actually want to happen, but most of the time it isn't. As a result, everyone writing software knows about integer overflow, and is supposed to take account of it while writing code. Some programming languages (e.g. C, Java, Go) require you to manually check that this won't happen; code for this might look like:

/* Will not work if b is negative */
if (INT16_MAX - b >= a) {
   /* a + b will fit */
   result = a + b
} else {
   /* a + b will overflow, return the biggest
    * positive value we can
    */
   result = INT16_MAX
}

Other languages (e.g. Ada) allow you to trap this in a run-time exception, such as Constraint_Error. When this exception arises, you know you've hit an overflow and can have some additional logic to handle it appropriately. The key point is that you need to consider that this situation may arise, and plan to detect it and handle it appropriately. Simply hoping that the situation won't arise is not enough.

This is why the "longer than would be expected" line in the ESA report particularly annoys me - the software authors shouldn't have been "expecting" anything, they should have had an actual plan to handle out-of-expected-value sensors. They could have capped the value at its expected max, they could have rejected the use of that particular sensor and used a less accurate calculation omitting that sensor's value, they could have bounded the calculation's result based on the last known good altitude and velocity - there are many options. But they should have done something.

Reading the technical specs of the Schiaparelli Mars Lander, the interesting bit is the Guidance, Navigation and Control system (GNC). There are several instruments used to collect navigational data: inertial navigation systems, accelerometers and a radar altimeter. The signals from these instruments are collected, processed through analogue-to-digital conversion and then sent to the spacecraft. The spec proudly announces:

Overall, EDM's GNC system achieves an altitude error of under 0.7 meters

Apparently, the altitude error margin is a teeny bit larger than that if you don't process the data robustly.

What's particularly tragic is that arithmetic overflow has been well established as a failure mode for ESA space flight for more than 20 years. The canonical example is the Ariane 5 failure of 4th June 1996 where ESA's new Ariane 5 rocket went out of control shortly after launch and had to be destroyed, sending $500M of rocket and payload up in smoke. The root cause was an overflow while converting a 64 bit floating point number to a 16 bit integer. In that case, the software authors had actually explicitly identified the risk of overflow in 7 places of the code, but for some reason only added error handling code for 4 of them. One of the remaining cases was triggered, and "foom!"

It's always easy in hindsight to criticise a software design after an accident, but in the case of Schiaparelli it seems reasonable to have expected a certain amount of foresight from the developers.

ESA's David Parker notes "...we will have learned much from Schiaparelli that will directly contribute to the second ExoMars mission being developed with our international partners for launch in 2020." I hope that's true, because they don't seem to have learned very much from Ariane 5.

2016-10-23

DDoS and the Tragedy of the Commons of the Internet of Things

On Friday there was a massive Distributed Denial of Service attack on DynDNS, who provide Domain Name services to a number of major companies including Twitter, Spotify and SoundCloud, effectively knocking those sites offline for a significant fraction of the global population. Brian Krebs provides a useful summary of the attack; he is unusually well versed in these matters because his website "Krebs on Security" was taken offline on 20th September after a massive Internet-of-Things-sourced DDoS against it. It seems that Krebs' ongoing coverage and analysis of DDoS with a focus on the Internet of Things (IoT) - "smart" Internet connected home devices such as babycams and security monitors - raised the ire of those using the IoT for their nefarious purposes. It proved necessary to stick Krebs' blog behind Google's Project Shield which protects major targets of information suppression behind something resembling +5 enchanted DDoS armour.

Where did this threat to the Internet come from? Should we be worried? What can we do? And why is this whole situation a Tragedy of the Commons?

Primer on DNS

Let's look at Friday's outage first. Dyn DNS is a DNS hosting company. They provide an easy way for companies who want a worldwide web presence to distribute information about the addresses of their servers - in pre-Internet terms, they're like a business phone directory. Your company Cat Grooming Inc., which has bought the domain name catgrooming.com, has set up its web servers on Internet addresses 1.2.3.4 and 1.2.3.5, and its mail server on 1.2.4.1. Somehow, when someone types "catgrooming.com" in their internet brower, they need that translating to the right numerical Internet address. For that translation, their browser consults the local Domain Name Service (DNS) server, which might be from their local ISP, or a public one like Google's Public DNS (8.8.4.4 and 8.8.8.8).

So if Cat Grooming wants to change the Internet address of their webservers, they either have to tell every single DNS server of the new address (impractical), or run a special service that every DNS server consults to discover up to date information for the hostnames. Running a dedicated service is expensive, so many companies use a third party to run this dedicated service. Dyn DNS is one such company: you tell them whenever you make an address change, and they update their records, and your domain's information says that Dyn DNS does its address resolution.

To check whether a hostname on the web uses DynDNS, you can use the "dig" command which should work from the Linux, MacOS or FreeBSD command line:

$ dig +short -t NS twitter.com
ns3.p34.dynect.net.
ns2.p34.dynect.net.
ns1.p34.dynect.net.
ns4.p34.dynect.net.

This shows that twitter.com is using Dyn DNS because it has dynect.net hostnames as its name servers.

Your browser doesn't query Dyn DNS for every twitter.com URL you type. Each result you get back from DNS comes with a "time to live" (TTL) which specifies for how many seconds the answer is valid. If your twitter.com query came back as 199.59.150.7 with a TTL of 3600 then your browser would use that address for the next hour without bothering to check Dyn DNS. Only after 1 hour (3600 seconds) would it re-check Dyn DNS for an update.

Attack mechanism

The Internet of Things includes devices such as "babycams" which enable neurotic parents to keep an eye on their child's activities from elsewhere in the house, or even from the restaurant to which they have sneaked out for a couple of hours of eating that does not involve thrown or barfed food. The easiest way to make these devices accessible from the public Internet is to give them their own Internet address, so you can enter that address on a mobile phone or whatever and connect to the device. Of course, the device will challenge any new connection attempt for a username and password; however, many devices have extremely stupid default passwords and most users won't bother to change them.

Over the past decade, Internet criminals have become very good at scanning large swathes of the Internet to find devices with certain characteristics - unpatched Windows 2000 machines, webcams, SQL servers etc. That lets them find candidate IoT devices on which they can focus automated break-in attempts. If you can get past the password protection for these devices, you can generally make them do anything you want. The typical approach is to add code that makes them periodically query a central command-and-control server for instructions; those instructions might be "hit this service with queries randomly selected from this list, at a rate of one query every 1-2 seconds, for the next 4 hours."

The real problem with this kind of attack is that it's very hard to fix. You have to change each individual device to block out the attackers - there's generally no way to force a reset of passwords to all devices from a given manufacturer. The manufacturer has no real incentive to do this since it has the customer's money already and isn't obviously legally liable for the behavior. The owner has no real incentive to do this because this device compromise doesn't normally materially affect the device operation. You can try to sell the benefits of a password fix - "random strangers on the internet can see your baby!" but even then the technical steps to fix a password may be too tedious or poorly explained for the owner to action. ISPs might be able to detect compromised devices by their network traffic patterns and notify their owners, but if they chase them to fix the devices too aggressively then they might piss off the owners enough to move to a different ISP.

Why don't ISPs pre-emptively fix devices if they find compromised devices on their network? Generally, because they have no safe harbour for this remedial work - they could be prosecuted for illegal access to devices. They might survive in court after spending lots of money on lawyers, but why take the risk?

Effects of the attack

Dyn DNS was effectively knocked off the Internet for many hours. Any website using Dyn DNS for their name servers saw incoming traffic drop off as users' cached addresses from DNS expired and their browsers insisted on getting an up-to-date address - which was not available, because the Dyn DNS servers were melting.

Basic remediation for sites in this situation is to increase the Time-to-Live setting on their DNS records. If Cat Grooming Inc's previous setting was 3600 seconds, then after 1 hour of the Dyn DNS servers being down their traffic would be nearly zero. If their TTL was 86400 seconds (1 day) then a 12 hour attack would only block about half their traffic - not great, but bearable. A TTL of 1 week would mean that a 12 hour attack would be no more than an annoyance. Unfortunately, if the attack downs Dyn DNS before site owners can update their TTL this doesn't really help.

Also, the bigger a site is, the more frequently it needs to update DNS information. Twitter will serve different Internet addresses for twitter.com to users in different countries, trying to point users to the closest Twitter server to them. You don't want a user in Paris pointed to a Twitter server in San Francisco if there is one available in Amsterdam, 500 millseconds closer to them. And when you have many different servers, every day some of them are going offline for maintenance or coming online as new servers, so you need to update DNS to stop users going to the former and start sending them to the latter.

Therefore the bigger your site, the shorter your DNS TTL is likely to be, and the more vulnerable you are to this attack. If you're a small site with infrequent DNS updates, and your DNS TTL is short, then make it longer right the hell now.

Alternative designs

The alternative to this exposed address approach is to have a central service which all the baby monitors from a given manufacturer connect to, e.g. the hostname cams.babycamsRus.com; users then connect to that service as well and the service does the switching to connect Mr. and Mrs. Smith to the babycam chez Smith. This prevents the devices from being found by Internet scans - they don't have their own Internet address, and don't accept outside connections. If you can crack the BabyCams-R-Us servers then you could completely control a huge chunk of IoT devices, but their sysadmins will be specifically looking out for these attacks and it's a much more tricky proposition - it's also easy to remediate once discovered.

Why doesn't every manufacturer do this, if it's more secure? Simply, it's more expensive. You have to set up this central service, capable of servicing all your sold devices at once, and keep it running and secure for many years. In a keenly price-competitive environment, many manufacturers will say "screw this" and go for the cheaper alternative. They have no economic reason not to, no-one is (yet) prosecuting them for selling insecure devices, and customers still prefer cheap over secure.

IPv6 will make things worse

One brake on this run-away cheap-webcams-as-DoS-tool is the shortage of Internet addresses. When the Internet addressing scheme (Internet Protocol version 4, or IPv4 for short) was devised, it was defined as four numbers between 0 and 255, conventionally separated by dots e.g. 1.2.3.4. This gives you just under 4.3 billion possible addresses. Back in 2006 large chunks of this address space were free. This is no longer the case - we are, in essence, out of IPv4 addresses, and there's an active trade in them from companies which are no longer using much of their allocated space. Still, getting large blocks of contiguous addresses is challenging. Even a /24 (shorthand for 256 contiguous IPv4) is expensive to obtain. Father of the Internet Vint Cerf recently apologised for the (relatively) small number of IPv4 addresses - they thought 4.3 billion addresses would be enough for the "experiment" that IPv4 was. The experiment turned into the Internet. Oops.

This shortage means that the current model where webcams and other IoT devices have their own public Internet address is unsustainable: the cost of that address will become prohibitive, and customers will need something that sits behind their single home Internet address given to them by their ISP. You can have many devices behind one address via a mechanism called Network Address Translation NAT) where the router connecting your home to the Internet lets each of your devices start connections to the Internet and allocates them a "port" which is passed to the website they connect to: when the website server responds, it sends the web page back to your router along with the port number, so the router knows which of your home devices the web page should be sent to.

The centralized service described above is (currently) the only practical solution in this case of one IP for many devices. More and more devices on the Internet will be hidden from black-hat hacker access in this way.

Unfortunately (for this problem) we are currently transitioning to use the next generation of Internet addressing - IPv6. This uses 128 bits, which is a staggering number: 340 with an additional 36 zeroes after it. Typically your ISP would give you a "/64" for your home devices to use for their public Internet addresses - a mere 18,000,000,000,000,000,000 (18 quintillion) addresses. Since there are 18 quintillion /64s in the IPv6 address space, we're unlikely to run out of them for a while even if ever person on earth is given a fresh one every day and there's no re-use.

IPv6 use is not yet mainstream, but more and more first world ISPs are giving customers IPv6 access if they want it. Give it a couple of years and I suspect high-end IoT devices will be explicitly targeted at home IPv6 setups.

Summary: we're screwed

IPv4 pressures may temporarily push IoT manufacturers to move away from publicly addressable IoT devices, but as IPv6 becomes more widely used the commercial pressures may once more become too strong to resist and the IoT devices will be publicly discoverable and crackable once more. Absent a serious improvement in secure, reliable and easy dynamic updates to these devices, the IoT botnet is here to stay for a while.

2015-06-21

The spectacular kind of hardware failure

Gentle reader, I have attempted several times to pen my thoughts on the epic hack of the US Office of Personnel Management that compromised the security information of pretty much everyone who works for the US government, but I keep losing my vision and hearing a ringing in my ears when I try to do so. So I turn to a lesser-known and differently-awesome fail: the US visa system.

Since a computer failure on the 26th of May - over three weeks ago - the US embassies and consulates worldwide have been basically unable to issue new visas except in very limited circumstances. You haven't heard much about this because it hasn't really affected most US citizens, but believe me it's still a big issue. It seems that they're not expecting the system to be working again until next week at the earliest. Estimates of impacted users are on the order of 200,000-500,000; many people are stuck overseas, unable to return to the USA until their visa renewal is processed.

What happened? The US Department of State has a FAQ but it is fairly bland, just referring to "technical problems with our visa systems" and noting "this is a hardware failure, and we are working to restore system functions".

So a hardware failure took out nearly the entire system for a month. The most common cause of this kind of failure is a large storage system - either a mechanical failure that prevents access to all the data you wrote on the disks, or a software error that deleted or overwrote most of the data on there. This, of course, is why we have backups - once you discover the problem, you replace the drive (if broken) and then restore your backed up data from the last known good state. You might then have to apply patches on top to cover data that was written after the backup, but the first step should get you 90%+ of the way there. Of course, this assumes that you have backups and that you are regularly doing test restores to confirm that what you're backing up is still usable.

The alternative failure is of a relatively large machine. If you're running something comparable to the largest databases in the world you're going to be using relatively custom hardware. If it goes "foom", e.g. because its motherboard melts, you're completely stuck until an engineer can come over with the replacement part and fix it. If the part is not replaceable, you're going to have to buy an entirely new machine - and move the old one out, and install the new one, and test it, and hook it up to the existing storage, and run qualification checks... But this should still be on the order of 1 week.

A clue comes from a report of the State Department:

"More than 100 engineers from the government and the private sector [my emphasis] are working around the clock on the problem, said John Kirby, State Department spokesman, at a briefing on Wednesday.

You can't use 100 engineers to replace a piece of hardware. They simply won't fit in your server room. This smells for all the world like a mechanical or software failure affecting a storage system where the data has actually been lost. My money is on backups that weren't actually backing up data, or backing it up in a form that needed substantial manual intervention to restore, e.g. a corrupted database index file which would need every single piece of data to be reindexed. Since they've roped in private sector engineers, they're likely from whoever supplied the hardware in question: Oracle or IBM, at a guess.

The US Visa Office issues around 10 million non-immigrant visas per year, which are fairly simple, and about 500,000 immigrant visas per year which are a lot more involved with photos, other biometrics, large forms and legal papers. Say one of the latter takes up 100MB (a hi-res photo is about 5MB) and one of the former takes up 5MB; then that's a total of about 100TB per year. That's a lot of data to process, particularly if you have to build a verification system from scratch.

I'd love to see a report on this from the Government Accountability Office when the dust settles, but fear that the private sector company concerned will put pressure on to keep the report locked up tight "for reasons of commercial confidentiality and government security". My arse.

2014-11-04

Mazzucato and her State-behind-the-iPhone claims

This caught my eye in the Twitter feed of Mariana "everything comes from the State" Mazzucato:

.State behind Apple figure (pic) in my #EUDigitalMinds piece is from p.109 of http://t.co/apmUdCZwdS @NeelieKroesEU pic.twitter.com/Mb8iGOPswn
— Mariana Mazzucato (@MazzucatoM) November 4, 2014

The box claiming that "microprocessor" came from DARPA didn't sound right to me, so I did some digging.

Sure enough, DARPA appears to have had squat all to do with the development of the first microprocessors:

Three projects delivered a microprocessor at about the same time: Garrett AiResearch's Central Air Data Computer (CADC), Texas Instruments (TI) TMS 1000 (1971 September), and Intel's 4004 (1971 November).

I don't know about the CADC, but Tim Jackson's excellent book "Inside Intel" is very clear that the 4004 was a joint Intel-Busicom innovation, DARPA wasn't anywhere to be seen, TI's TMS 1000 was similarly an internal evolutionary development targeted at a range of industry products.

Looking at a preview of Mazzucato's book via Amazon, it seems that her claims about state money being behind the microprocessor are because the US government funded the SEMATECH semiconductor technology consortium with $100 million per year. Note that SEMATECH was founded in 1986 by which point we already had the early 68000 microprocessors, and the first ARM designs (from the UK!) appeared in 1985. Both of these were recognisable predecessors of the various CPUs that have appeared in the iPhone - indeed up to the late iPhone 4 models they used an ARM design.

I'm now curious about the other boxes in that diagram. The NAVSTAR/GPS and HTML/HTTP claims seem right to me, but I wonder about DARPA's association with "DRAM cache" - I'd expect that to come from Intel and friends - and "Signal compression" (Army Research Office) is so mind-meltingly vague a topic that you could claim nearly anyone is associated with it - the Motion Picture Experts Group who oversee the MPEG standards have hundreds of commercial and academic members. If Mazzucato's premise is that "without state support these developments would never have happened" then it's laughably refutable.

At this point I'm very tempted to order Mazzucato's book The Entrepreneurial State for the sole purpose of finding out just how misleading it is on this subject that happen to know about, and thus a measure of how reliable it is for the other parts I know less about.

Update: it seems that associating the DoE (US Department of Energy) with the lithium-ion battery is also something of a stretch. The first commercial lithium-ion battery was released by Sony and Asahi Kasei in Japan. The academic work leading up to it started with an Exxon-funded researcher in the early 70s . The only DofE link I can find is on their Vehicle Technologies Office: Batteries page and states:

This research builds upon decades of work that the Department of Energy has conducted in batteries and energy storage. Research supported by the Vehicle Technologies Office led to today's modern nickel metal hydride batteries, which nearly all first generation hybrid electric vehicles used. Similarly, the Office's research also helped develop the lithium-ion battery technology used in the Chevrolet Volt, the first commercially available plug-in hybrid electric vehicle.

That's a pretty loose connection. I suspect, since they specifically quote the Volt, that the DofE provided money to Chevrolet for research into the development of batteries for their cars, but the connection between the Volt and the iPhone battery is... tenuous.

For fuck's sake, Mariana. You could have had a reasonably good point by illustrating the parts of the iPhone that were fairly definitively state-funded in origin, but you had to go the whole hog and make wild, spurious and refutable claims just to bolster the argument, relying on most reviewers not challenging you because of your political viewpoint and on most readers not knowing better. That's pretty despicable.

2014-02-27

Fixing Healthcare.gov - the inside story

The new Time covers in depth the work of the team who fixed Healthcare.gov. It's a fantastic read, with good access to the small but extremely competent team who drove the fix - go absorb the whole thing.

The data coming out of the story confirms a lot of what I suspected about what was wrong and how it needed to be fixed. Breaking down by before-and-after the hit team arrived:

Before

By October 17 the President was seriously contemplating scrapping the site and starting over.
Before this intervention, the existing site's teams weren't actually improving it at all except by chance; the site was in a death spiral.
No one in CMS (or above) was actually checking whether the site would work before launch.
The engineers (not companies) who built the site actually wanted to fix it, but their bosses weren't able to give them the direction to do it.
There was no dashboard (a single view) showing the overall health of the site.
The key problem the site had was being opened up to everyone at once rather than growing steadily in usage.
The site wasn't caching the data it needed in any sensible way, maximising the cost of each user's action; just introducing a simple cache improved the site's capacity by a factor of 4.

I refer the reader in particular to my blogpost The Curse of Experts where CMS head Marilyn Tavenner was trying to dodge blame.

During the Tuesday hearing, Tavenner rejected the allegation that the CMS mishandled the health-care project, adding that the agency has successfully managed other big initiatives. She said the site and its components underwent continuous testing but erred in underestimating the crush of people who would try to get onto the site in its early days. "In retrospect, we could have done more about load testing," she said.

As the Time article shows, this was anything but the truth about what was actually wrong.

After

There wasn't any real government coordination of the rescue - it was managed by the team itself, with general direction but not specific guidance from the White House CTO (Todd Park)
The rescue squad was a scratch team who hadn't worked together before but was completely aligned in that they really wanted to make the site work, and had the technical chops to know how to make this happen if it was possible.
Fixing the website was never an insurmountable technical problem: as Dickerson noted "It's just a website. We're not going to the moon." It was just that no-one who knew how to fix it had been in a position to fix it.
The actual fixes were complete in about 6 weeks.
One of the most important parts in improving the speed of fixing was to avoid completely the allocation of blame for mistakes.
Managers should, in general, shut up during technical discussions: "The ones who should be doing the talking are the people who know the most about an issue, not the ones with the highest rank. If anyone finds themselves sitting passively while managers and executives talk over them with less accurate information, we have gone off the rails, and I would like to know about it."
The team refused to commit to artificial deadlines: they would fix it as fast as they could but would not make promises about when the fixes would be done, refusing to play the predictions game.
Having simple metrics (like error rate, concurrent users on the site) gave the team a good proxy for how they were doing.
Targeted hardware upgrades made a dramatic difference to capacity - the team had measured the bottlenecks and knew what they needed to upgrade and in what order.
Not all problems were fixed: the back-end communications to insurance companies still weren't working, but that was less visible so lower priority.

The overall payoff for these six weeks of work was astonishing; on Monday 23rd December the traffic surged in anticipation of a sign-up deadline:

"We'd been experiencing extraordinary traffic in December, but this was a whole new level of extraordinary ... By 9 o'clock traffic was the same as the peak traffic we'd seen in the middle of a busy December day. Then from 9 to 11, the traffic astoundingly doubled. If you looked at the graphs, it looked like a rocket ship." Traffic rose to 65,000 simultaneous users, then to 83,000, the day's high point. The result: 129,000 enrollments on Dec. 23, about five times as many in a single day as what the site had handled in all of October.

Despite this tremendous fix, however, President Obama didn't visit the team to thank them. Perhaps the political fallout from the Healthcare.gov farce was too painful for him.

The best quote that every single government on the planet should read:

[...] one lesson of the fall and rise of HealthCare.gov has to be that the practice of awarding high-tech, high-stakes contracts to companies whose primary skill seems to be getting those contracts rather than delivering on them has to change. "It was only when they were desperate that they turned to us," says Dickerson. "I have no history in government contracting and no future in it ... I don't wear a suit and tie ... They have no use for someone who looks and dresses like me. Maybe this will be a lesson for them. Maybe that will change."

The team who pulled President Obama's chestnuts out of the fire didn't even think they were going to be paid for their work initially; it looks like they did eventually get some money, but nowhere near even standard contracting rates. And yet, money wasn't the motivator for them - they deeply wanted to make Healthcare.gov work. As a result they did an extraordinary job and more or less saved the site from oblivion. This matches my experience from government IT developments: it's reasonable to assume that the government don't care about whether the project works at all, because if they did then they'd run it completely differently. Though if I were President I'd be firing Marilyn Tavenner, cashing in her retirement package and using it to pay bonuses to the team who'd saved my ass.

If you have a terribly important problem to solve, the most reliable way to solve it is to find competent people who will solve it for free because they want it to work. Of course, it's usually quite hard to find these people - and if you can't find them at all, maybe your problem shouldn't be solved in the first place.

2014-02-04

Hard core computing from the last century

A spot of tech nostalgia for us, with Google's hirsute chief engineer, Urs Hölzle, discussing his first day in Google's "data center" 15 years ago:

[...] a megabit cost $1200/month and we had to buy two, an amount we didn't actually reach until the summer of 1999. (At the time, 1 Mbps was roughly equivalent to a million queries per day.)
- You'll see a second line for bandwidth, that was a special deal for crawl bandwidth. Larry had convinced the sales person that they should give it to us for "cheap" because it's all incoming traffic, which didn't require any extra bandwidth for them because Exodus traffic was primarily outbound.

What's interesting here is that the primary criteria for billing was space - square footage taken up on the colocation site's floor. Network was an additional cost as noted above, but Exodus didn't bill its residents for power - the 3 x 20A required for all the servers was a scrawled note on the invoice. Nowadays, power is one of the most fundamental requirements of a data center and you don't pour the first bit of concrete before you've got your megawattage lined up. Apple goes as far as sticking its own solar power generation around its North Carolina data center. We've come a long way in fifteen years.

You wouldn't be able to get away with a server rack like Google's 1999 design nowadays - just look at the way they cram the hardware into every available space. I've seen one of these racks on display, and you can barely see any daylight through it from front to back. The fire safety inspector would have kittens.

In the comments, Todd Reed calculates that if you tried to run today's YouTube while paying those data rates, you'd be forking over just under $3bn per month...

This just makes the point that the computing world of 15 years ago really was a different generation from today. Google was anticipating that a few megabits per second would be more than enough to keep crawling the entire web and keep up with the addition of content. Let's look at the most content-dense medium of the modern web - Tweets. In 2013 Twitter averaged 5700 Tweets per second. At 160 characters plus maybe 40 characters of timestamp and attribution that's 200 x 5700 = 1,140,000 characters per second or about 9 Mbits per second (Mbps). It would have cost Google nearly $11,000 per month just to keep up with Twitter's tweets. Nowadays you can get 20Mbps on your home Internet connection for $75 per month (business class) which should cope comfortably with two Twitters - until they started allowing you to attach images...

2013-10-29

Reliability through the expectation of failure

A nice presentation by Pat Helland from Salesforce (and before that Amazon Web Services) on how they built a very reliable service: they build it out of second-rate hardware:

"The ideal design approach is 'web scale and I want to build it out of shit'."
Salesforce's Keystone system takes data from Oracle and then layers it on top of a set of cheap infrastructure running on commodity servers

Inituitively this may seem crazy. If you want (and are willing to pay for) high reliability, don't you want the most reliable hardware possible?

If you want a somewhat-reliable service then sure, this may make sense at some price and reliability points. You certainly don't want hard drives which fail every 30 days or memory that laces your data with parity errors like pomegranate seeds in a salad. The problems come when you start to get to demand more reliability - say, four nines (99.99% uptime, about 50 minutes downtime per month) and scaling to support tens if not hundreds of concurrent users across the globe. Your system may consist of several different components, from your user-facing web server via a business rules system to a globally-replicating database. When one of your hard drives locks up, or the PC it's on catches fire, you need to be several steps ahead:

you already know that hard drives are prone to failure, so you're monitoring read/write error rates and speeds and as soon as they cross below an acceptable level you stop using that PC;
because you can lose a hard drive at any time, you're writing the same data on two or three hard drives in different PCs at once;
because the first time you know a drive is dead may be when you are reading from it, your client software knows to back off and look for data on an alternate drive if it can't access the local one;
because your PCs are in a data centre, and data centres are vulnerable to power outages / network cables break / cooling failures / regular maintenance, you have two or three data centres and an easy way to route traffic away from the one that's down;

You get the picture. Trust No One, and certainly No Hardware. At every stage of your request flow, expect the worst.

This extends to software too, by the way. Suppose you have a business rules service that lots of different clients use. You don't have any reason to trust the clients, so make sure you are resilient:

rate-limit connections from each client or location so that if you get an unexpected volume of requests from one direction then you start rejecting the new ones, protecting all your other clients;
load-test your service so that you know the maximum number of concurrent clients it can support, and reject new connections from anywhere once you're over that limit;
evaluate how long a client connection should take at maximum, and time out and close clients going over that limit to prevent them clogging up your system;
for all the limits you set, have an automated alert that fires at (say) 80% of the limit so you know you're getting into hot water, and have single monitoring page that shows you all the key stats plotted against your known maximums;
make it easy to push a change that rejects all traffic matching certain characteristics (client, location, type of query) to stop something like a Query of Death from killing all your backends.

Isolate, contain, be resilient, recover quickly. Expect the unexpected, and have a plan to deal with it that is practically automatic.

Helland wants us to build our software to fail:

...because if you design it in a monolithic, interlinked manner, then a simple hardware brownout can ripple through the entire system and take you offline.
"If everything in the system can break it's more robust if it does break. If you run around and nobody knows what happens when it breaks then you don't have a robust system," he says.

He's spot on, and it's a lesson that the implementors of certain large-scale IT systems recently delivered to the world would do well to learn.

2013-10-19

How to build and launch a federal health care exchange

Since the US government has made a pig's ear, dog's breakfast and sundry other animal preparations of its health care exchange HealthCare.Gov, I thought I'd exercise some 20/20 hindsight and explain how it should (or at least could) have been done in a way that would not cost hundreds of millions of dollars and would not lead to egg all over the face of Very Important People. I don't feel guilty exercising hindsight, since the architects of this appalling mess didn't seem to worry about exercising any foresight.

A brief summary of the problem first. You want to provide a web-based solution to allow American citizens to comparison-shop health insurance plans. You are working with a number of insurers who will provide you with a small set of plans they offer and the rules to determine what premium and deductible they will sell the plan at depending on purchaser stats (age, family status, residential area etc.) You'll provide a daily or maybe even hourly feed to insurers with the data on the purchasers who have agreed to sign up for their plans. You're not quite sure how many states will use you as their health care exchange rather than building your own, but it sounds like it could be many tens of states including the big ones (California, Texas). We expect site use to have definite peaks over the year, usually in October/November/early December as people sign up in preparation for the new insurance year on Jan 1st. You want it to be accessible to anyone with a web browser that is not completely Stone Age, so specify IE7 or better and don't rely on any JavaScript that doesn't work in IE7, Firefox, Safari, Chrome and Opera. You don't work too hard to support mobile browsers for now, but Safari for iPad and iPhone 4 onwards should be checked.

Now we crunch the numbers. We expect to be offering this to tens of millions of Americans eventually, maybe up to 100M people in this incarnation. We also know that there is very keen interest in this system, and so many other people could be browsing the site or comparison-shopping with their existing insurance plans even if they don't intend to buy. Let's say that we could expect a total of 50M individual people visiting the site in its first full week of operation. The average number of hits per individual: let's say, 20. We assume 12 hours of usage per day given that it spans America (and ignore Hawaii). 1bn hits per week divided by 302400 seconds yields an average hit rate of about 3300 hits per second. You can expect peaks of twice that, and spikes of maybe five times that during e.g. news broadcasts about the system. So you have to handle a peak of 15000 hits per second. That's quite a lot, so let's think about managing it.

The first thing I think here is "I don't want to be worrying about hardware scaling issues that other people have already solved." I'm already thinking about running most of this, at least the user-facing portion, on hosted services like Amazon's EC2 or Google's App Engine. Maybe even Microsoft's Azure, if you particularly enjoy pain. All three of these behemoths have a staggering numbers of computers. You pay for the computers you use; they let you keep requesting capacity and they keep giving it to you. This is ideal for our model of very variable query rates. If we need about one CPU and 1GB of RAM to handle three queries per second of traffic, you'll want to provision about 5000 CPUs (say, 2500 machines) during your first week to handle the spikes, but maybe no more than 500 CPUs during much of the rest of the year.

The next thought I have is "comparison shopping is hard and expensive, let's restrict it to users whom we know are eligible". I'd make account creation very simple; sign up with your name, address and email address plus a simple password. Once you've signed up, your account is put in a "pending" state. We then mail you a letter a) confirming the sign-up but masking out some of your email address and b) providing you with a numeric code. You make your account active and able to see plans by logging in and entering your numeric code. If you forget your password in the interim, we send you a recovery link. This is all well-trodden practice. The upshot is that we know - at least, at a reasonable level of assurance - that every user with an active account is a) within our covered area and b) is not just a casual browser.

As a result, we can design the main frontend to be very light-weight - simple, cacheable images and JavaScript, user-friendly. This reduces the load on our servers and hence makes it cheaper to serve. We can then establish a second part of the site to handle logged-in users and do the hard comparison work. This site will check for a logged-in cookie on any new request, and immediately bounce users missing cookies to a login page. Successful login will create a cookie with nonce, user ID and login time signed by our site's private key with (say) a 12 hour expiry. We make missing-cookie users as cheap as possible to redirect. Invalid (forged or expired) cookies can be handled as required, since they occur at much lower rates.

There's not much you can do about the business rules evaluation to determine plan costs: it's going to be expensive in computation. I'd personally be instrumenting the heck out of this code to spot any quick wins in reducing computation effort. But we've already filtered out the looky-loos to improve the "quality" (likelihood of actually wanting to buy insurance) of users looking at the plans, which helps. Checking the feeds to insurers is also important; put your best testing, integration and QA people on this, since you're dealing with a bunch of foreign systems that will not work as you expect and you need to be seriously defensive.

Now we think about launch. We realise that our website and backends are going to have bugs, and the most likely place for these bugs is in the rules evaluation and feeds to insurers. As such, we want to detect and nail these bugs before they cause widespread problems. What I'd do is, at least 1 month in advance of our planned country-wide launch, launch this site for one of the smaller states - say, Wyoming or Vermont which have populations around 500K - and announce that we will apply a one-off credit of $100 per individual or $200 per family to users from this state purchasing insurance. Ballpark guess: these credits will cost around $10M which is incredibly cheap for a live test. We provision the crap out of our system and wait for the flood of applications, expect things to break, and measure our actual load and resources consumed. We are careful about user account creation - we warn users to expect their account creation letters within 10 days, and deliberately stagger sending them so we have a gradual trickle of users onto the site. We have a natural limit of users on the site due to our address validation. Obviously, we find bugs - we fix them as best we can, and ensure we have a solid suite of regression testing that will catch the bugs if they re-occur in future. The rule is "demonstrate, make a test that fails, fix, ensure the test passes."

Once we're happy that we've found all the bugs we can, we open it to another, larger, state and repeat, though this time not offering the credit. We onboard more and more states, each time waiting for the initial surge of users to subside before opening to the next one. The current state-by-state invitation list is prominent on the home page of our site. Our rule of thumb is that we never invite more users than we already have (as a proportion of state population), so we can do no more than approximately double our traffic each time.

This is not a "big bang" launch approach. This is because I don't want to create a large crater with the launch.

For the benefit of anyone trying to do something like this, feel free to redistribute and share, even for commercial use.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Update: also very worth reading Luke Chung's take on this application, which comes from a slightly different perspective but comes up with many similar conclusions on the design, and also makes the excellent usability point:

The primary mistake the designers of the system made was assuming that people would visit the web site, step through the process, see their subsidy, review the options, and select "buy" a policy. That is NOT how the buying process works. It's not the way people use Amazon.com, a bank mortgage site, or other insurance pricing sites for life, auto or homeowner policies. People want to know their options and prices before making a purchase decision, often want to discuss it with others, and take days to be comfortable making a decision. Especially when the deadline is months away. What's the rush?

2013-08-17

Uptimes and apocalypses

Riley: Buffy. When I saw you stop the world from, you know, ending, I just assumed that was a big week for you. It turns out I suddenly find myself needing to know the plural of apocalypse.
"A New Man", Buffy The Vampire Slayer, S4 E12

Amused by the apocryphal tone of the Daily Mail's coverage of the 5-minute Google outage on Friday - just before midnight BST which explains why no-one in the UK except hardcore nerds noticed - I thought I'd do a brief explanation of the concept of "uptime" for an Internet service.

Marketeers <spit> describe expected system uptime in "nines" - the fraction of time that the system is expected to be available. A "two nines" system is available 99% of the time. This sounds pretty good, until you realise that every day the system can be down for about 14 minutes. If Google, Facebook or the BBC News website were down for quarter of an hour every day, there would be trouble. So this is a pretty low bar.

For "Three nines" (99.9%) you start to move into downtime measured in minutes per week - there are just over 10,000 minutes in a week, so if you allow 1 in 1000 of those to be down, you're looking at 10 minutes per week. This is pretty tight - the rule of thumb says that even if you have someone at the end of a pager 24/7 and great system monitoring that alerts you whenever something goes wrong, it will still take your guy 10-15 minutes to react to the alert, log in, look to see what's wrong - and that's before he works out how to fix it. So your failures need to occur less frequently than weekly.

When you get to "Four nines" (99.99%) you're looking at either a seriously expensive system or a seriously simple system. During a whole year, you're allowed fifty minutes of downtime, which by the maths above indicates no more than two incidents in that year - and, realistically, probably only one. At this level you start to be more reliable than most Internet Service Providers, so it starts to get hard to measure your uptime as your traffic is fluctuating all the time due to Internet outages of your users - if your traffic drops, is it due to something you've done or is it due to something external (e.g. a natural disaster like Hurricane Sandy?) Network connectivity and utility power supply are probably not this reliable, so you have to have serious redundancy and geographic distribution of your systems. I've personally run a distributed business system that nudged four nines of availability, with an under-resourced support team and it was a cast iron bastard - any time anything glitched, you had someone from Bangalore calling you at home around 1am. Not fun.

"Five Nines" (99.999%) is the Holy Grail of marketeers, but in practice it seems to be unachievable for a complex system. You have only 5 minutes per year of downtime allowed, which normally equates to one incident every 3-4 years at max. Either your system is extremely simple, or it's massively expensive to run. Normally the cost of that extra 45 minutes of uptime a year is prohibitive - easily double that of four nines in many cases, sometimes much more - and most reasonable people settle for four nines or, in practice, less than that.

Given that, let's examine the DM's assertion that "Experts said the outage had cost the company about £330,000 and that the event was unheard of." Google had about $50bn revenue last year so divide that by 366 (leap year) to get about $140M/day average, $5.7M/hour. A 5 minute outage is 1/12th of that, $474K or £303K at today's rates, so the number sounds about right. But "unheard of"? May 7 2005 was another outage, this time for around 15 minutes. Google, Twitter, Yahoo, Facebook, Bing, iTunes etc. go down for some areas of the planet fairly frequently - see DownRightNow which is currently showing me service disruptions for Yahoo Mail and Twitter. Gmail was down for a whole bunch of people for 18 minutes back in December. It's part of normal life.

Global networks go down all the time. Google going down for a few minutes is not the end of the world. It's happened before and will almost certainly happen again. The Daily Mail needs to find some better quality experts - but then, I guess their quotes aren't as quotable. I'm not surprised Google drops off the planet for 5 minutes - I'm surprised it doesn't happen more often, and I'm astonished they get it back online in 5 minutes. I also feel sorry for people setting up their Internet connection at home in that outage window, when they tried connecting to www.google.com to verify their connection and it failed. "I can't reach Google - my Internet must be bust, it certainly can't be Google that's unavailable..."

Update: (2013-08-19)
And now Amazon goes down worldwide for 30 minutes. I rest my case.

2013-05-13

Innovate the French way!

Innovation in France apparently consists of taxing successful products in order to subsidise the industries they're replacing:

Mr Lescure believes that a 4pc tax on the sale of smartphones and tablets, namely Apple's iPhone and iPad and Google Android products, could boost government revenues as consumers are spending more money on hardware than on content.

Well, yes; it could boost government revenues. Taxing everyone age 16-25 €12 would also boost government revenue, and probably have a very similar effect and generate a similar level of income. It might even be cheaper to implement. It's an interesting approach to making content generate revenue: charge for the device, and hand out that money to whichever content generator has the best political connections. I'm not sure that it's quite the best approach for the consumer, however.

France is, as a sovereign nation, entirely within its rights to tax whatever it likes and give money to whomever it likes. It is not the first nation to try to protect old media industries, nor will it be the last. I would however be interested to know what lobbying has been going on to point "businessman" Pierre Lescure [warning, contains French] at this particular approach. His Wikipedia biography seems to label him as a professional TV journalist and theatrical director, but I'm sure that had nothing to do with his selection...

2013-04-29

Reflections on uptime

A couple of conversations this week have made me realise how "uptime", and its unloved stepchild "downtime" are misunderstood in today's world of the always-on Internet. I thought I'd blog a little about this and see where it went. First, the case of fanfiction.net.

This conversation was with a buddy in NYC who is an avid fan-fiction reader. She (and fan-fiction readers are disproportionately "she") was complaining about the site fanfiction.net being down for an hour or two, during which time she was going cold-turkey being deprived of new chapters from her favourite authors. When quizzed further, she admitted that the site was down some time between 1am and 3am NYC time, so perhaps she actually got more sleep than she would have otherwise... anyway, her complaint was "why is fanfiction.net always going down?". So here's my attempt at an answer.

fanfiction.net is hosted by Tiggee, so costs real money to run. The site has some ads - Google's AdChoices - but they seem to be tastefully done and not in-your-face. Neither reading nor uploading fanfic costs anything, so ad income has got to cover the entire cost of running the site. As well as Tiggee's fees for hosting this also has to cover the time and trouble of the site maintainers. Downtime means a loss of ad revenue as well as readership, so the owners are going to want to minimise the downtime but not spend too much money doing so. Assuming 5TB of storage and 30TB/month bandwidth, a sample hosting company like Cloud Media will charge you about $3700/month. Let's assume then that it costs $4000/month to host the site as it stands, and that ads bring in a steady $8000/month in revenue (about $7 / hour).

My friend reports an informal estimate of fanfiction.net being down maybe 4% of the time she checks (ignoring rapid re-checking in the 15 minutes after she sees that it has gone down). This would be costing them $320/month in lost ads. This isn't worth getting out of bed for.

Not all hours of downtime are equal, however. Web browsing follows a roughly diurnal curve: working in GMT - note that summer times skew these results due to some nations not observing daylight savings time - at GMT midnight, which is the trough, people are coming in to work in Japan. If you have some control over your downtime (e.g. for system upgrades) you can schedule it in the window of 5am - 8am GMT when virtually none of your likely audience in English-speaking countries is awake except for die-hard nerds. The opportunity cost of your scheduled downtime is probably halved, or more, so that $320/month loss drops further.

Is it even worth trying to stop unscheduled downtime like this? Your hosting company is already taking care of what outages are in their control (bad hardware, network misconfiguration etc.) and mistakes on their part will result in a refund of your hosting fees for outage time outside their Service Level Agreement. You can probably assume that they'll give your hosted machines something like 99.9% of uptime, which is a little bit less than 1 hour of downtime per month. All you have to worry about is misconfiguration or performance problems of the software you run on their hosted machines, which most often happens after you - the site owner - have made a change. There are occasions when your site falls over spontaneously (e.g. because you've run out of storage space) but they are few and far between. Setting up an alerting system which pages you if your service goes down outside your normal working hours would require a substantial technical and financial investment, and probably wreck your sleep patterns.

The usual way out of this corner is delegation, but here the information economy prices work against you; even a spotty part-time sysadmin who has no idea what he's doing would cost you $2500/month to have on-call. Unless you can pool him with a number of other sites to amortise his cost and increase his utilisation, there's no point parting with your cash. This is why people tell you that adding each "nine" of reliability (going from e.g. 90% to 99% or 99% to 99.9% uptime) increases your costs exponentially - you need new layers of people and systems to a) prevent downtime occurring and b) react extremely quickly when it happens. For situations when your downtime losses are close to your uptime income and costs, it's far better to accept the downtime, fix it during your normal working hours, and be extremely conservative in operating your site within its allocated storage and bandwidth limits. Only make changes when you have to, and ensure you're around and watching closely for several hours after the upgrade.

The conclusion? The reason most "free" websites have a somewhat unreliable uptime (between 1 and 2 "nines" i.e 90-99% update) is that it's simply not economically worth their while to pay the additional costs to be down for shorter durations or fewer occasions. You get what you pay for. Of course, there are situations when downtime costs are not close to operating costs - I'll be addressing that in my next blog.

2013-03-20

A tale of two unlocks

Bypassing phone lock screens seems to be the story of the day: first, access to the phone book and photos of an up-to-date iPhone:

By locking the device and enabling the Voice Control feature, it is possible to circumvent the lock screen by ejecting the SIM card from its tray at the moment the device starts dialing.
From here, the phone application remains open, allowing access to recent call logs, contacts, and voicemail (if it isn't protected by a separate PIN code). But also from here, photos and video can also be accessed by creating a new contact. When a new contact is created, it opens up access to the photos application — including Camera Roll and Photo Stream.

Note that the iOS version tested (6.1.3) is the release which fixes the previous unlock screen exploit. One wonders how many more of these exploits are going to come around.

The impact of this bug is limited in frequency but severe in impact. Although all modern iOS devices appear to be vulnerable, the actual exploit does not (in general) give a thief much to work with. He can't apparently make calls or send texts with the device, which are the two potentially most expensive acts. Where it does have an impact is situations where the address book or photos data are regarded as valuable - generally, when the thief knows the iPhone owner or knows they are a friend of someone whose address, phone number or data he wishes to steal. Imagine, for instance, if someone got access to Pippa Middleton's iPhone and used this exploit to read contact information and photos of her family and friends.

But let's not just pile on Apple - Samsung is similarly vulnerable:

From the lock screen, an attacker can enter a fake emergency number to call which momentarily bypasses the lock screen, as before. But if these steps are repeated, the attacker has enough time to go into the Google Play application store and voice search for "no locking" apps, which then disables the lock screen altogether.
From there, the device is left wide open.

The interesting point here is that the vulnerability doesn't appear to be present on "stock" (Google-released) Android 4.1.2 phones - it appears to be peculiar to Samsung devices. That implies to me that in Samsung's effort to pile on their customisations to differentiate themselves from J. Random Other Android device provider, they may have sacrificed something in quality and security testing. Unlike Apple, however, I suspect Samsung don't particularly care. They will certainly care about this flaw (since it makes Samsung leading-edge phones even more attractive to tea leafs who wish to burn up their victims' phone bills) but I don't see them slowing down their development velocity. That's their primary differentiator over Apple - new features and innovation - and there's no way they're going to trade that for slightly improved security. Only if the flaws being discovered have substantial negative impact for the average user (phone crashing all the time, corruption of storage, inability to view videos of cats) will they impact sufficiently on Samsung to change their development direction.

2013-01-23

Social media - HP doing it right

Prolific and stylish blogger Anna Raccoon was lamenting her experiences trying to get an HP printer to talk to her Mac, an experience comparable with deciphering Linear-A:

I've spoken to 17 different technical gurus, and a few extra at Apple. Every one of them paid the minimum rate for whichever country they were in, everyone of them believing they are doing a decent days work, and every one of them utterly useless.

Let me say that this is not 1000 miles from my own experience trying to get an HP laser to talk to my Apple hardware - it's about 50% reliable on wireless connecting at best, and 100% on USB connection but only if connected before a reboot. So you should think seriously about whether wrestling with HP drivers is really what you want to do, though the printer hardware itself seems pretty good and has held up well.

But what's this? Not a day later, HP contact Anna out of the blue:

I’ve just had a charming gentleman, Keith Schneider from 'Executive Customer Relations' (sounds good anyway!), on the phone from sunny California, who assures me that they will produce a French HP Laserjet expert with perfect English who will phone me at home within 24 hours and sort the problem...

Wow. Given California is 9 hours behind France, and so Mr. Schneider only got to work about 5 hours ago if he's an early riser, that's not bad going. Keith Schneider, if you're reading this, you should give a bonus to the social media trawler who spotted this blog post, realised its importance and escalated to you. It's raised HP several notches in my eyes, and I assume many others.

Now if you could do something about the semi-trained baboons who write your drivers, I'd be even happier.

2012-10-05

Goldman Sachs as Facebook?

I know they helped with the IPO, but it seems that Goldman Sachs thinks it has a lot to learn from Facebook on the technology side:

"Finance is a very technology-dependent business," says Don Duet, a global co-chief operating officer in Goldman’s technology division. "We have a substantial infrastructure footprint, and over the past four or five years, we've been moving into a scale-out-type model that's very similar to the big web firms.

The first question which comes to mind: who is Don Duet and what has he done?

Don joined Goldman Sachs in 1988 as an associate in the Technology Department within Fixed Income, Currency and Commodities in New York. He transitioned to a number of roles within Technology [...] He was named managing director in 2000 and partner in 2006.
Don earned a BS in Computer Science and Mathematics from Marist College in 1988.

Huh? Marist College? Looks like a small college in upstate NYC with not much in the way of a grad school element. So Don joined GS straight from school, and learned enough to be dangerous, then played the politics game to rise though the ranks of GS technology division. This should be entertaining. I hazard a guess that Duet doesn't know enough to know what he doesn't know.

What is Don Duet planning for GS?

Duet says it hopes to have these servers up and running within six months. "We want to get the point where we have machines that inherit some of the properties of the original Facebook designs, but actually work in more classic data centers."
[...]
In another echo of the big web players, Goldman has also made the move to "containerized" data centers. About eight years ago, Google started piecing its data centers together using shipping containers packed with servers and other gear.

Oh, Lordy. So Duet thinks that the computing problems that Facebook, Google and Goldman Sachs face are similar enough to adopt the same strategy in hardware? I don't even know where to start. And for state-of-the-art GS hardware he's taking the approach that GOOG took 8 years ago? WTF? GS is rumoured to have about 40,000 servers, vs the rumoured 30,000 FB servers in 2009, which (at an annual doubling) represents ~250K servers today; similarly, Google at 1mm servers in 2009 and maybe a slower growth rate should still be somewhere over 2mm servers today.

The other point I find significant is that in the whole interview Duet never mentions measuring how well GS is using the computers that they have. If their average utilization were 40%, and they could double this by introducing a market in computing resources among the GS divisions, that would be a doubled capacity for free. Maybe efficiency isn't what Duet sees himself as being paid for.

As a GS buddy of mine acerbically notes: "I'm sure, whatever happens, it will be declared a success [by Don Duet]."

2012-09-07

Nokia is toast, and here's why

After its torrid year (share price of $2.64 down 60% in 12 months, and down over 90% from its $40 high in 2007), Nokia are rolling out their first reasonable stabs at Windows Phone model (the Lumia series) and the world has responded with a wide yawn. As Andrew Orlowski in The Reg opines:

A Windows Phone is somehow seen as a risky choice, despite (or perhaps even because of) its radical design. The ODMs (original device manufacturers) haven't yet shown us a Windows Phone with a killer feature. WP has sold on the basis of the People Hub, and perhaps this has been oversold a little.

Thing is, Nokia has always made pretty decent hardware. It tended to lead the edge with cameras in the early 2000's, and the Lumias follow in that tradition. Apple's iPhone cameras, to be honest, have been pretty "blah" - it's been the OS and applications built around them which have made the iPhone such a phenomenally successful camera phone. The Androids, tied in well with Google Plus which offers better photo quality than Facebook, and having better cameras, are more of a benchmark for the Lumia. It's a shame that Nokia's staggeringly inept PR department completely messed up the Lumia launch by giving the press a scandal to latch on to, rather than letting them talk about what looks like quite a good image stabilisation system.

Nokia's problem is the Finnish problem - "Not Invented Here" syndrome was invented in Finland. See, for instance, Linus Torvalds and Linux; the difference was that Linus was smart enough to realise when the product became too big for just him, and smart enough not to reinvent everything about an operating system. Nokia, by contrast, wrote its own phone OS (a diabolical mish-mash of C with a plethora of #defines to build a semi-OO message-passing system that caused the phone to lock up and reboot if any app received any message for which it didn't have a registered handler) and firmly resisted any alternatives. I think they only accepted Symbian when it a) became clear that Nokia OS was disappearing up its own backside and b) no-one else was really using Symbian any more. Adopting Windows Phone must have been an extremely bitter pill for them to swallow, and I'm sure their good software guys left in droves.

What Nokia should have done, once iPhone and Android appeared on the scene, was build some serious attempts at mass-market Android phones with Nokia hardware. Throw away most of the Nokia software, start re-pointing their medium-term software dev effort at Android system applications, and complement what they were good at (low-to-medium-end phone hardware) with a decent next-generation phone OS. Instead, they fell like a brick and had to be bailed out in effect by Microsoft, becoming MS's Great White Hope for Windows Phone. No prizes for guessing who's pulling the strings in Nokia's strategic direction.

Eventually they're going to end up a glorified phone hardware manufacturer, just uploading the latest Windows Phone images into their devices. Oh, how the mighty are fallen.

2012-06-28

RIM - all over bar the shouting

I'm going to go out on a limb here and hazard a guess that RIM is dead, along with Blackberry. Delaying Blackberry OS 10 until 2013, thereby missing out on the Christmas rush? They must be seriously slipping on internal delivery dates to make the delay announcement 6 months before.

RIM have taken a near-dominant position in business email and handed it out for free to anyone that wanted it (by the look of things, for now, Apple and iOS). They have forced their loyal retail consumers to go elsewhere for anything approaching an up-to-date smartphone; I've played with a Gingerbread full-keyboard device from Verizon and it's quite the match for anything RIM currently sells, never mind what Ice Cream Sandwich / Jelly Bean bring to the party. What is going to RIM announce in 6 months time that will trump what Apple and Google are shipping now?

For RIM, it's all over bar the shouting. A steady decline in market share until they are bought at an insulting discount for what few marketable patents they have. If I were a good engineer in RIM, I'd be reaching out to my buddies in Apple and Google right about now.

2012-04-25

The printer won't scan...

...it's out of ink.

James Lileks (if you're not a regular reader of his Bleats, what's wrong with you?) produces a fantastic Bleat on the subject of his Kodak printer's mutinous refusal to scan:

The only reason the scanner wouldn’t work was because the people who designed it, under orders from management, entered some code that bricked the machine unless you bought more ink.
We all know this. We all know that printers are cheap things designed to sell ink. What surprises me is why printer companies willingly and intentionally make devices they know will make people hate their brands. It’s suicidal.

Read the whole thing.

If we are ever truly going to get to the paperless home or paperless office, I fear much of the progress will be directly attributable to this kind of bloody-minded petty penny-grabbing short-term blind stupidity of the printer companies. When they finally go bankrupt and sink into the mud it will be under the weight of millions of customers dancing on their graves.