Hemiposterical: August 2013

2013-08-29

Chemical horror

Today UK MPs rejected military action against the Syrian government despite the recent nerve agent attack killing hundreds of civilians. While I think that this was the right decision in the current circumstances, I've heard a lot of objections to military action along the lines of "civilians get killed in larger numbers by bombs, gunfire, napalm and starvation - what's so different about chemical weapons?" Playing Devil's Advocate, here's my answer.

Chemical warfare generally involves saturating a target area with a chemical agent for one of two reasons:

area denial; making it impossible or at least unattractive for opposing forces to occupy a strategically or tactically advantageous position; or
short-term offense; aiming to kill or debilitate concentrated opposing forces in an attractive location.

Modern chemical weapons such as the nerve agent sarin (one of the possible culprits in the recent attack, and one of the weapons likely used at Halabja) are devastatingly effective against any unprotected people in the immediate area; even gas masks aren't effective, since nerve agents can be absorbed through the skin. Full personal protective equipment is required to protect against a nerve agent attack. Modern armies can afford this for their troops - although the equipment makes them less effective and more susceptible to heat-induced illness - but there's no practical way to protect adequately a civilian population. Nerve agent attacks are therefore very much a 1-shot weapon from a military standpoint, advantageous if they can take opposing military forces by surprise, but with very little sustained effect if the forces are warned and well-equipped.

If an attacker has free use of chemical weapons, and little fear of reprisal, their best tactic is therefore to saturate target areas with nerve agents. As well as nailing any poorly-prepared opposition soldiers, this will force the properly-equipped military to button up in protective gear and be less efficient in communication, manoeuvre and combat. It's not a decisive win, but certainly an advantageous tactic. The side effect of this saturation, however, will be devastating to any civilian position both in the location and downwind. Lacking effective protective gear, casualties in any populated area will be horrendous - and completely irrelevant in military terms.

The international opprobrium in use of chemical weapons is therefore (unusually) for a good reason. It aims to make chemical weapon use, which is normally not a terribly efficient military method but disproportionately devastating in civilian casualties, a much higher cost for offensive forces than conventional weapons. The cost of conventional artillery and airstrikes is roughly proportional to the number of weapons used, and hence incentivises the attacker to target military forces as accurately as possible. Chemical weapons have no such incentive, and so there is a natural incentive to saturate the target area (maximising deterioration of effectiveness of the opposition) which has the side effect of civilian annihilation.

Back to the original question: why is killing a civilian with chemical weapons worse than killing them with a bullet? It's because unfettered chemical weapons use will devastate a civilian population in short order. We have forgotten this because the threat of nuclear retaliation has kept chemical weapon usage to relatively low volumes in the past 95 years. If we want to see many more thousands of chemical weapons deaths, we should treat chemical weapon usage like artillery usage.

Real Internet novices

A rite of passage in a geek's life used to be explaining "the Internet" (in practice, the Web) to parents and grandparents. Nowadays there's enough of a critical mass of email and Web usage that even computer-free grandparents find their buddies talking about emails and photos from their grandchildren at college / abroad, and possibly even exchanging Facebook tips. The youngest generation now grows up with interactive touch-controlled LCD displays (e.g. tablets) and the concept that you can communicate, navigate and send photos from nearly anywhere is taken as read.

There is one group of middle-aged Western people who really have no idea what the Internt is: lifer Federal prisoners since Federal prison regulations forbid Internet use. If you take the year 1998 as the point where the Web and email really went mainstream, anyone who's been in Federal prison for over 15 years as of today has likely had no real experience of the Web:

A handful of San Quentin prisoners took a class about the internet and startups. Through books, presentations, and printouts, they gained a theoretical understanding of the web, but not a practical one. While they still haven’t been online, this basic understanding made it easier for them to articulate just how far off their first impressions were.

We forget just how transformative and mind-blowing the Internet really is, because we have grown up with it - and it has grown up with us. The first real search engine, Archie, actually predated the Web; it was used to locate information in ftp sites. It didn't really index content as we understand it today, rather it was like an efficient directory of hundreds of FTP sites, updated roughly monthly. You had to have a good idea of the likely filename that your required content lived in; once you had that, you could see what FTP servers had that file and where it lived. Because it was a monthly index, there was the perpetual frustration of going to the listed site and finding out that they had moved files around or cleaned them up.

Consider what Internet search can do now, efficiently locating and ranking web pages relevant to your search topic, automatically using synonyms and spelling correction to work out what you actually mean rather than what you're asking for. The number of web pages with meaningful content is staggering; and yet search engines can automatically spot and index updates in days if not hours. We can search images as well as text - sometimes even automatically recognising basic content information of an image without reference to the text around it. The physical location of a website in the world is almost irrelevant: South African sites are indexed side-by-side with Korean, Welsh, Canadian and Tuvalan sites.

In terms of how it has changed human behaviour, "wondering" is usually a very short activity. The answers to "I wonder if it's safe to eat flour with bugs in it," "Why do cats hate dogs?" and indeed a staggering range of things that people wonder are now a few keystrokes and a fraction of a second away. Of course, it has meant that people seeking information have had to learn to apply critical judgement of the quality of the answers. "Go not to the Internet for answers, for it will tell you 'yes', 'no' and 'you suck'." On balance, I think that the increased scepticism of the accuracy of any printed or electronic report, and the consequent death of trust in the traditional news media, is perhaps no bad thing.

I agree with the article's author that we need to do something about Internet access for lifer prisoners. Banning it completely is like banning books completely; you don't want to give them completely free range of access, and preventing inappropriate communication is certainly an issue, but if you plan to ever let a prisoner go free again then equipping him to use the Internet is the very least we need to do if we hope to reintegrate him into society.

I did like prisoner Jorge Heredia's comment though; I'm not sure he realises how accurate his original concept was:

I was completely in the blind about the purpose [of Internet sites]. I thought they were just sites for people to socialize and spend their idle time.

2013-08-28

What's the point of bombing Syria?

I'm getting increasingly concerned at the US, UK and French war drums being beaten in anticipation of bombing Syria to punish them for nerve-gassing civilians. Don't get me wrong, I think that the use of chemical weapons at all is deserving of severe punishment, and if we can identify the military unit responsible for killing hundreds if not thousands of civilians with nerve agents then it should be bombed so hard that "they'll have to pipe in the sunlight"; a good few 2000lb LGBs to break open their shelters, napalm or white phos to burn up the munitions, and a healthy sprinkling of cluster weapons to finish off anyone left standing. At the very least, it will cause other chemical weapons unit commanders to think very hard about whether using their munitions really is in their personal best interest.

However it seems that the Syrian government didn't know this attack was coming, and indeed were rather horrified when it did according to US communications intercepts:

...in the hours after a horrific chemical attack east of Damascus, an official at the Syrian Ministry of Defense exchanged panicked phone calls with leader of a chemical weapons unit, demanding answers for a nerve agent strike that killed more than 1,000 people," the report said.

Now I don't know how much credence to give to this report, but it certainly fits well with observed behaviour; the US and UK now seem very confident that the Syrian government forces were responsible, yet there's no obvious gain - and much potential loss - for the Syrian government itself. Assad and company haven't survived this long by being pointlessly brutal. Even arch-tyrant Saddam Hussein of Iraq had method to his brutality: it was focused on keeping him in power, with him personally shooting senior ministers and officers if he even suspected a threat to him. Bashar al-Assad similarly is a survivor and not one to give Western forces a casus belli to open direct hostilities against him.

Bombing random airfields does nothing to deter the regime from using chemical weapons; quite the opposite. It certainly hurts their war effort, but it wasn't really triggered by government decisions in the first place. The Syrian government can certainly try to get more strict control of their weapons, but this action is punishing the government (however revolting) for actions outside their control. I'm not sure it really sends the right message. I'm also not convinced that the current opposition forces are necessarily the kind of people we want to back.

Update:
Matthew Inman at The Oatmeal pretty much nails my discomfort.

2013-08-23

Regulation is not the answer to IT failure

Widespread gloom and despondency today as the NASDAQ exchange shut down for three hours due to a "glitch":

"I would not want to speculate other than to say this is huge. Everything is halted in the market," said Sal Arnuk at Themis Trading in Chatham, New Jersey. Options trading was also halted, the exchange said.

Guess what? This is what happens if you have a SPOF (single point of failure). If you can only trade your Apple, Facebook shares on the NASDAQ then NASDAQ is a single point of failure for your systems. If you don't want a single point of failure, you have to make it easy to trade on multiple systems.

The WSJ has a few more details:

Nasdaq said it plans to work with other exchanges to investigate Thursday's outage, which centered on a problem with the data feed supplying U.S. markets with trade information, and supports "any necessary steps to enhance the platform."
Nasdaq officials internally pointed to a "connectivity" problem with rival NYSE Arca, according to people familiar with the matter, that led to price quotes not being reported.

Nice muddying of the waters there, NASDAQ. "Work with other exchanges", forsooth. If the problem affected feeds in general, and not just an isolated feed to one exchange, then the problem was at NASDAQ's end. In theory you could get improved robustness by reporting from each exchange to NASDAQ on problems with the feed, but in practice each exchange's clients would notice quickly enough that something was up. If the problem were at Arca's end, it seems odd that NASDAQ would suspend operations. Rumours have it that Arca somehow "locked" an order causing the NASDAQ side to freeze. Unfortunately, that doesn't let NASDAQ off the hook. If you are designing a client-server system then you should plan for clients to do arbitrary and crazy things, especially if you don't control the client code. Your server should tolerate badly-behaved clients, or at least alert you to them and give you the option to force ignoring that client until they have sorted themselves out. Letting a single client freeze the whole system with no work-around for three hours in the middle of the trading day - when all your techs are in the office - is terrible design.

For reference, since the NASDAQ market opening hours are 6.5 hours per day 5 days a week they are open for about 1690 hours per year. Therefore this downtime was 0.18% of the year, bringing them below "three nines" of reliability. If they keep from having an outage next year they should go back to three nines - but it gives you some perspective on the limits of reliability even for a firm where time is, quite literally, money.

What's the answer to this outage? Regulation!

Currently, exchanges can voluntarily choose to have their backup plans reviewed by the SEC, which then audits their technological systems. One potential rule, known as Regulation SCI, would require major exchanges to submit to the audits. That regulation is pending comment.
Lauer said the shutdown should be a wake-up call to regulators to monitor exchanges, which he said have not kept up with the speed of current technology. "We have an overly complex system, and it's complex to the point of dysfunction," he said.

FFS. Lauer may be right about the complexity - actually, I'm pretty sure he is right - but his solution blows chunks. Adding regulators very seldom improves technological problems. For a start, if you have a complex and broken software system, you're not going to be able to evolve it into a reliable system. You're going to have to develop (carefully, with due respect of the Second System Effect) a new system in parallel with the old one, and very slowly and carefully migrate traffic from old to new while detecting and fixing the inevitable bugs and scaling issues. A regulator might be able to make you initiate that process, but sure as little green apples won't make that process any more reliable.

Why not? Let's be brutally honest. What really good software engineer / technical project manager would work for a regulator, employed on a government-standard seniority-based competence-hostile salary scheme, battling with much more highly paid software engineers to make them try to do the right thing? Even if their employer (here, the SEC) has anti-poaching arrangements with the major banks, forbidding them from poaching an SEC tech who works or has recently worked on their compliance, there's nothing to stop Fred from Bank of America informally recommending his ex-colleague Jim at Goldman Sachs that they hire Sheila from the SEC who's been doing compliance testing on BofA and showing unusual technical competence. In return, Jim could tell Fred to look at hiring Sophie from the SEC who's been working with Goldman Sachs, but avoid Hermann at all costs as he's a talentless box-ticking drone.

NASDAQ clearly has software architecture problems, but regulator intervention is not going to fix them. Only commercial competition is going to help. If another firm is willing to set up a small exchange for (say) the top 50 NASDAQ-listed firms, and persuades some major banks to act as market makers, it will slightly increase liquidity in those firms and (more importantly) provide redundancy in the event that NASDAQ trading fails. They may have to plan and provision for NASDAQ failure, handling several times their normal traffic until NASDAQ get back on their feet, but that's feasible. A beneficient side-effect will be that NASDAQ will realise that downtime will no longer just delay trades, but will actually move trades to their competitor and lose them money. If that's not an incentive to improve, I don't know what would be.

2013-08-21

Terrorist mis-management

Pity poor Ayman Al-Zawahiri. Not only is the USA reading his communications, and the Egyptians arresting his brother, he's forced to deal with problems such as those profligate bastards in Al-Qaeda Yemen splurging his hard-earned cash on a new fax machine:

When he took over al Qaeda in 2011, senior U.S. intelligence officials were already pointing out his penchant for micro-management. (In one instance in the 1990s, he reached out to operatives in Yemen to castigate them for buying a new fax machine when their old one was working just fine.)

We never see this aspect of terrorist life portrayed on "24", but I feel they missed a trick here. Think of the missed potential for a gripping scene where Habib Marwan escapes from his hideout with CTU in hot pursuit, but is weighed down by several safeboxes full of receipts and outstanding invoices from his minions which he can't leave behind in case he leaves an expense claim unpaid - or, worse, lets through a claim from Navi Araz for carpet cleaning without a receipt.

Jacob N. Shapiro's analysis makes amusing reading. He points out how terrorism "in the large" increasingly resembles a medium-size business. Perhaps at the top level the leaders aren't worried about filing accounts with the Inland Revenue - unless they have some country's intelligence service providing funding, in which case the audits might be just as bad - but they have to struggle to manage very limited funds to give the best bang per ~~buck~~riyal, demanding accountability from staff who have practically been selected to be unstable and rejecting authority and order.

Perhaps al-Zawahiri should take inspiration from Hank Scorpio; providing good healthcare and an attractive pension plan for your minions does seem to go a long way towards smooth running of your terrorist organisation, and showing personal concern for their domestic problems does a lot for loyalty. al-Zawahiri doesn't really seem to be a people person; he should rope in someone more personable to supplant him in the day-to-day people management. Alan Sugar, maybe. "Hamas al-Masri: you're fired!" <boom>

I do wonder about this assertion though:

Terrorist managers are also obliged to place a premium on bureaucratic control, because they lack other channels to discipline the ranks. When Walmart managers want to deal with an unruly employee or a supplier who is defaulting on a contract, they can turn to formal legal procedures. Terrorists have no such option.

Were I to run an international terrorist organisation, I'd put a premium on remote management of problems. Specifically, I'd be selecting my heads-of-region based on their ability to deal effectively with problems. Since they are (by definition) not bound by legal considerations, I'd expect the "unruly employee" problem to raise its head only once, and that very briefly. After that they can rely on word of mouth to propagate the result of giving grief to the organisation's management layer. If Mr. al-Zawahiri is reading this, I'd be willing to consult on people management re-engineering for a very reasonable fee; just email me your address, with the email titled "al-Zawahiri's permanent address" so it won't get trapped in my spam filters.

2013-08-20

The investment bank working week

It seems that being an intern at the London office of Bank of America Merrill Lynch can be seriously bad for your health; a very hard-working intern passed away a week before completing his internship, and there are strong suggestions that overwork may have been a significant factor in his death. Huffington Post has the sad details:

The intern, 21-year old Moritz Erhardt, was described by Bank of America spokesman John McIvor as "an outstanding student."
[...]
According to a story sent out on the financial terminal service Bloomberg, Erhardt was found dead in the East London student housing where he had been staying, Claredale House.

Now the natural inference is that "he worked super-hard as an intern" == "he was killed by overwork" but there's a large gap between the available facts and that conclusion. Still, apparently well-informed juniors on Wall Street Oasis have noted the banks making the connection:

Can confirm that HR from one of JPM / MS / GS have sent an e-mail round to line managers of IBD interns saying something along the lines of 'don't work your interns too hard for the final weeks'

Erhardt was reported as having pulled three all-nighters in a row - the amount of coffee / Red Bull / Monster energy drinks / other medications needed to keep conscious and approximately functioning for that long without sleep can't be good for you.

So why do banks work interns so hard - and why do interns play along? The latter half of that question is easiest to answer; internship is an audition for employment at the bank, with all the other interns providing constant competition, and often the most straight forward way for an intern to feel that they are competitive is in terms of the hours they've worked - an objective numeric score, in an environment where it is otherwise very hard for you to measure how well you are doing. The banks in turn are seeing how the interns hold up under stress. If they are to be hired, they will be in a stressful environment for years. To gauge their long-term stress management, ratcheting up the stress levels for the relatively short period of their internship is viewed as a reasonable proxy. Overworking interns is a win-win - at least, from one party's perspective.

We can turn to Alex's boss Rupert for the financial community's perspective on interns and new hires: "I tend to worry about employing such people of independently weathly means... One ends up questioning their commitment to the job. If they don't need the money, will they be prepared to put in the back-breaking hours necessary for a trainee?" The ability and inclination of an intern to work long hours are a signal that they are able and willing to play the game of working in a modern banking environment, doing the grunt work for their bosses. The immediate practical value of intern work to a company is usually marginal, if positive at all; the net contribution to the company is probably negative when you factor in the mentoring, reviewing and assisting time of the full time employees.

The game doesn't actually change much once you're in full employment at the back - as an analyst, then an associate, then a VP jockeying towards making the leap to Managing Director and serious remuneration. If you're working a 40 hour week at the bank then you're either employed based on your father being a major customer of the bank, or you're on your way out. A 50 hour week is the minimum acceptable, and even then your commitment to your job can be questioned. Evaluating your commitment in terms of hours "worked" is one of the easiest and most visible ways for your bosses and co-workers to provide feedback in your quarterly / annual reviews, the result of which will be vast amounts of cash and shares (if you're good), admonition followed by dismissal (if you're bad), and just enough remuneration to keep you grinding away at the coal face (for the 70%-80% of people in between).

There's a certain logic in the bank pushing this model. Bank employees aren't paid by the hour; thus, if you employ a quant at a nominal 40 hours a week and expected 50 hours a week, and persuade them to work an actual 60 hours a week with an implicit threat of firing or degraded bonus, you've increased their productivity by 20% - for free! Of course, after a few years they may get fed up with you and head off to a different bank, but except for the most senior levels of a bank everyone plans on a year-to-year basis. The bank will fire the low performers just after bonuses are announced (when they get nothing) to maximise the work from them. The top performers who get head-hunted away will quit either the day after the cash bonus clears in their bank account or the day after their outstanding share units vest. Loyalty? "If you want loyalty, buy a dog." (Michael Lewis, Liar's Poker).

At every level of management, the easiest way to show that you're an effective manager / driver of your team is to demonstrate their dedication to the firm. "My guys worked 70 hour weeks for months on end delivering system X only 2 weeks late!" There is little to no gain in making it easier to deliver system X by negotiation of features and timescales with the client teams; rather, a balls-to-the-wall death march is expected. "My guys worked 45 hour weeks to deliver a slightly lower-feature system X on time" is far less likely to gain a manager plaudits and hence a larger bonus.

In the end, the long hours culture will take a toll. Very few people can put up with a high-stress long-hours environment for years at a time. If the immediate aim (a certain level of promotion) starts to fall out of reach, the best option is either to get head-hunted away or, in a more challenging economy, to coast and provoke the firm to fire one with a payoff. The usual deal is about 1 month in salary per year worked, incentivising the burned-out lifers to quit before the young bloods want to. My personal rule of thumb, backed by Ed Yourdon's experiences in "Death March" is that your maximal short term productivity peaks at 70-80 hours a week. Beyond that, net productivity drops even after a week or two as the long hours and fatigue fog your brain. Longer-term, the 6-12 month sustainable limit (if taking vacation as it accrues) is around 60 hours per week. Interestingly, personal circumstances don't appear to matter much. Married employees have more distractions and potential stress at home, but then have a partner who can take some of the domestic load - the bank employee can return home to food on the table, clean bedsheets and an ironed shirt / business suit.

When I ask my buddies in banking why they're still there, the reasons are usually financial. Waiting for a chunk of shares to vest, trying to gain a promotion so they have better bargaining power to move firms, accumulating cash to reach an early retirement. Very few of them really enjoy what they do, and they hate the long hours - they are under no illusions as to what it does for them - but they realise it's a game between consulting adults and currently they're willing to play it. Note the lack of bitterness when Lehman Brothers went under and employees left the same day with their belongings in a cardboard box; they knew that this was always a possibility and it was factored into their planning as they worked. They don't feel as if anyone screwed them over, though not a few of them were frustrated at Dick Fuld's ineptness in the final days.

Interns, in turn, know what a banking internship involves. It's going to include long hours, a certain amount of abuse from the junior banking employees, stress, impossible tasks, and hard socialising in the few hours that are work-free. That's OK because it only runs for a limited time. But if you take it too seriously, lose perspective and feel that the only acceptable way to fail to achieve a task is to spend three straight days attempting it, you're not going to do yourself any favours. It's only a game. For the final word, I can do no better than yaob227 on wallstreetoasis:

It is "OKAY" to say fuck off when your life depends on it.

2013-08-19

Nuclear math

The reports from eschatological finance site Zero Hedge on the Fukushima nuclear disaster's aftermath make instructive reading. For instance, today's report that tens of trillions of Becquerels have spilled into the Pacific:

... and moments ago reality struck again, when the Nikkei newspaper reported that readings of tritium in seawater taken from the bay near the crippled Fukushima nuclear plant has shown 4700 becquerels per liter.

Tens of trillions of Becquerels! nearly 5000 per liter of seawater! the end is nigh.

Anyone know what a Becquerel is? One Bq is defined as the activity of a quantity of radioactive material in which one nucleus decays per second:

For example, natural potassium (⁴⁰K) in a typical human body produces 4,000 disintegrations per second, 4 kBq of activity.

Heaven forfend! The most radioactive measured water near Fukushima is 70 times as radioactive as normal people!

Who's the original source for this? Why, it's Russian government funded TV station Russia Today.

Russia Today also reports:

The level of radiation at the site [of a highly contaminated leak] was estimated at 100 millisieverts per hour, while the safe level of radiation is 1-13 millisieverts per year, according to ITAR-TASS news agency.

Well, 4 mSv per year is the natural background dose so one can only imagine where ITAR-TASS is getting its data. US radiation workers are allowed 50mSv per year. The contamination level described would likely be fatal if sustained over a day or two, but then it's still 1/100th the level of radioactivity of some trans-uranic waste from US reactors. I wonder why RT wouldn't put this into the proper context?

Now why might the Russian government, drawing substantial economic and political power from its oil and gas exports, want to put a downer on nuclear power? More interestingly, why might Zero Hedge want to support the aims of the Russian government?

Update: (2013-08-21)
Lewis Page at El Reg points out that the 100mSv is beta particles, so you'd have to splash around in or drink the water for it to have any effect. Gamma is 1.5mSv/hr, which means a nuclear worker can work around the water for four 8 hour shifts before approaching their annual radiation limit.

2013-08-17

Uptimes and apocalypses

Riley: Buffy. When I saw you stop the world from, you know, ending, I just assumed that was a big week for you. It turns out I suddenly find myself needing to know the plural of apocalypse.
"A New Man", Buffy The Vampire Slayer, S4 E12

Amused by the apocryphal tone of the Daily Mail's coverage of the 5-minute Google outage on Friday - just before midnight BST which explains why no-one in the UK except hardcore nerds noticed - I thought I'd do a brief explanation of the concept of "uptime" for an Internet service.

Marketeers <spit> describe expected system uptime in "nines" - the fraction of time that the system is expected to be available. A "two nines" system is available 99% of the time. This sounds pretty good, until you realise that every day the system can be down for about 14 minutes. If Google, Facebook or the BBC News website were down for quarter of an hour every day, there would be trouble. So this is a pretty low bar.

For "Three nines" (99.9%) you start to move into downtime measured in minutes per week - there are just over 10,000 minutes in a week, so if you allow 1 in 1000 of those to be down, you're looking at 10 minutes per week. This is pretty tight - the rule of thumb says that even if you have someone at the end of a pager 24/7 and great system monitoring that alerts you whenever something goes wrong, it will still take your guy 10-15 minutes to react to the alert, log in, look to see what's wrong - and that's before he works out how to fix it. So your failures need to occur less frequently than weekly.

When you get to "Four nines" (99.99%) you're looking at either a seriously expensive system or a seriously simple system. During a whole year, you're allowed fifty minutes of downtime, which by the maths above indicates no more than two incidents in that year - and, realistically, probably only one. At this level you start to be more reliable than most Internet Service Providers, so it starts to get hard to measure your uptime as your traffic is fluctuating all the time due to Internet outages of your users - if your traffic drops, is it due to something you've done or is it due to something external (e.g. a natural disaster like Hurricane Sandy?) Network connectivity and utility power supply are probably not this reliable, so you have to have serious redundancy and geographic distribution of your systems. I've personally run a distributed business system that nudged four nines of availability, with an under-resourced support team and it was a cast iron bastard - any time anything glitched, you had someone from Bangalore calling you at home around 1am. Not fun.

"Five Nines" (99.999%) is the Holy Grail of marketeers, but in practice it seems to be unachievable for a complex system. You have only 5 minutes per year of downtime allowed, which normally equates to one incident every 3-4 years at max. Either your system is extremely simple, or it's massively expensive to run. Normally the cost of that extra 45 minutes of uptime a year is prohibitive - easily double that of four nines in many cases, sometimes much more - and most reasonable people settle for four nines or, in practice, less than that.

Given that, let's examine the DM's assertion that "Experts said the outage had cost the company about £330,000 and that the event was unheard of." Google had about $50bn revenue last year so divide that by 366 (leap year) to get about $140M/day average, $5.7M/hour. A 5 minute outage is 1/12th of that, $474K or £303K at today's rates, so the number sounds about right. But "unheard of"? May 7 2005 was another outage, this time for around 15 minutes. Google, Twitter, Yahoo, Facebook, Bing, iTunes etc. go down for some areas of the planet fairly frequently - see DownRightNow which is currently showing me service disruptions for Yahoo Mail and Twitter. Gmail was down for a whole bunch of people for 18 minutes back in December. It's part of normal life.

Global networks go down all the time. Google going down for a few minutes is not the end of the world. It's happened before and will almost certainly happen again. The Daily Mail needs to find some better quality experts - but then, I guess their quotes aren't as quotable. I'm not surprised Google drops off the planet for 5 minutes - I'm surprised it doesn't happen more often, and I'm astonished they get it back online in 5 minutes. I also feel sorry for people setting up their Internet connection at home in that outage window, when they tried connecting to www.google.com to verify their connection and it failed. "I can't reach Google - my Internet must be bust, it certainly can't be Google that's unavailable..."

Update: (2013-08-19)
And now Amazon goes down worldwide for 30 minutes. I rest my case.

2013-08-14

Don't play the blame game

An article I dug up by the guys behind the "Etsy" marketplace website struck a chord with me. In it they explain why blaming engineers for making mistakes is perhaps the worst thing you could do. They express it in terms of a vicious cycle, where the key steps are:

2. Engineer is punished, shamed, blamed, or retrained.
3. Reduced trust between engineers on the ground (the "sharp end") and management (the "blunt end") looking for someone to scapegoat.
4. Engineers become silent on details about actions/situations/observations, resulting in "Cover-Your-Ass" engineering

CYA engineering leads, as night follows day, to a workplace where no-one wants to take responsibility for doing anything. This works (for some definition of "works") well in a government environment, but for a business where innovation and change is key to survival it's a rapidly fatal affliction. All well and good, but what's the alternative? Should you just let engineers make mistakes willy-nilly with no consequence? If they could do that, you could even hire PPE graduates for engineering jobs and save yourself the social dysfunctions of real engineers.

The solution adopted by Etsy (and very few other engineering organisations I've encountered) is a culture of blameless post-mortems. Whenever something goes wrong with significant impact - website goes down, loss of sales, software running amok - then, once the immediate incident has been dealt with everyone will expect a post-mortem to be written. In somewhere like an investment bank this would traditionally be written by a manager who will avoid technical detail and seek to blame anyone and everyone but his own team; this is not helpful. Instead, a good post-mortem culture will require the engineer closest to the incident to write up the post-mortem, and ideally post it for circulation and discussion within a small number of days. The post mortem should detail what went wrong, what actually happened - ideally incorporating a timeline, relevant fragments of IM and email discussions and pointers to logs and graphs - how it was resolved, and most importantly the actions that need to be take to prevent this kind of problem happening again.

It's fine in the post-mortem to identify people who made incorrect decisions, since indeed it's expected that in a stressful, time-pressured and unusual situation engineers and others will make bad calls. What isn't acceptable in the blameless post-mortem is to stop the analysis there: "Fred decided to repush the old version of the binary, which ended up breaking all customers, not just the 1-2 originally affected." Instead we ask ourselves: why did Fred do this? Did he have bad information about the problem? In that case the system monitoring may need improvement. Was he following an out-of-date playbook instruction? In that case someone needs to bring the playbook up to date. Was he too inexperienced to realise the consequences of what he was doing? In that case, perhaps should there be a minimum level of experience for on-call engineers in charge of an incident. What you can't say is "Fred did this because he's an idiot and should be fired."

jailspaw from Etsy explains why this approach works:

A funny thing happens when engineers make mistakes and feel safe when giving details about it: they are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company avoid the same error in the future. They are, after all, the most expert in their own error.

I've seen spirited discussion over post-mortems but crucially they are not about blame - or when the discussion starts to veer off in that direction, a senior engineer steps in to put it back on track. Good engineers hate to have the same mistake happening again and again. A post mortem lets the team understand, without fear of facts being concealed by blame-avoidance, what went wrong, and puts the team in a good position to make a fix. And if the fix is ineffective, and another outage occurs, there's the original post-mortem to include in the analysis: "why did we make the wrong diagnosis? What did we miss in our analysis of the right fix?"

Now if engineers can do this, and make it work - and it seems to work well for Etsy - is there any reason we can't incorporate this into government? When politicians, policy advisors or other policy makers screw up, how about we get them to describe what went wrong, why, and what can be done differently in future to prevent the same thing happening again? Of course, that would require a politician to admit being wrong in the first place, so I'm not holding my breath.

2013-08-12

The business of tipping

The has been a marked contrast between restaurant service in Britain and restaurant service in the USA - and by "the USA" I mean "pretty much any randomly-selected location in the entire United States of America" - at least as long as I've lived, and I suspect a lot longer than that. The general impression you get in a UK restaurant is that you're intruding on the waiting staff's quality time; to be fair, this has become less apparent recently as East European staff have proliferated in London and other major cities. I hope that the irony of British people in a restaurant hoping that their server comes from Tallinn, Riga or Wroclaw instead of Bristol or Derby is not lost on Nick "Chubby" Griffin. American restaurant and diner waiting staff can't, in general, do enough for you. This is clearly due to the tipping culture in America where good service is rewarded with good tips, and bad service punished with derisory tips. (Giving no tip could mean that you're a complete asshole, or just a furriner who doesn't know what's expect). Simple enough, yes?

Even in that seminal California film "Reservoir Dogs", tipping is intimately dissected by the participants:

Mr. White: You don't have any idea what you're talking about. These people bust their ass. This is a hard job.
Mr. Pink: So is working at McDonald's, but you don't see anyone tip them, do you? Why not, they're serving you food. But no, society says don't tip these guys over here, but tip these guys over here. It's bullshit!
Mr. White: Waitressing is the number one occupation for female non-college graduates in this country. It's the one job basically any woman can get, and make a living on. The reason is because of tips.
Mr. Pink: Fuck all that! I'm very sorry the government taxes their tips, that's fucked up. That ain't my fault. It would seem to me that waitresses are one of the many groups the government fucks in the ass on a regular basis. Look, if you show me a piece of paper that says the government shouldn't do that, I'll sign it, put it to a vote, I'll vote for it, but what I won't do is play ball. And this non-college bullshit you're givin' me, I got two words for that: learn to fuckin' type, 'cause if you're expecting me to help out with the rent you're in for a big fuckin' surprise.

Screwing over waitresses (I chose that word deliberately, and you should read about tips, sex and power at some point) is not done by anyone who regards themselves as a civilised human being. My median tip when eating out in the USA is 20%. Average is probably the same. Outliers are 25% (for really good service) and 5% (extremely rare, for British-quality service); if I'm anywhere near the latter, I'll be asking to speak to the manager in most cases. So everything's working as expected - or so I thought...

Back in 2008, San Diego restaurant owner Jay Porter decided to try an experiment: eliminating tipping in his restaurant and replacing it with a fixed (18%) service charge. So how did this work out? Conventional wisdom would suggest that the lack of incentive for better service would lead to a general lowering and flattening of the quality of service curve. After all, if removing tips worked well, wouldn't every restaurant do it?

The results make fascinating reading. Jay Porter's series of blogs on the results are compelling, and a rare case of actual data supporting the facts. What I found particularly interesting were the worked examples. I'd forgotten - because I've never waited tables - that the waiting staff are only part of the restaurant team. The problem is that, if you pay all your staff minimum wage (which is all the average restauranteur can afford), the waiting staff benefit strongly from tips but the kitchen staff do not, despite contributing just as much to the dining experience. Hence the approach in some states (e.g. NY) to pay the waiting staff much less than minimum wage, expecting the wage to be made up with tips, and redistribute that money to the kitchen staff; the waiting staff still do pretty well, but the curve is at least flatter. In other states (e.g. CA) you're not allowed to pay waiting staff less than minimum wage, so there is a customary "tip out" where the waiting staff share some of their tips with kitchen staff, but it's still ad-hoc and prone to abuse.

Porter's approach was to charge a fixed service charge and forbid tips. The service charge gets distributed reasonably equally around waiting and kitchen staff. Now I (and most economists) would have expected this to reduce overall quality of service, but in fact Porter claims this doesn't happen - good waiting staff don't pay much attention to tips throughout the service period, since they're too busy actually serving. It also removes an interesting perverse incentive on a server to maximise his or her tips at the expense of overall restaurant income; read the blog for the full details.

Equally interesting is the case where substituting a service charge for tips doesn't work - tips in a bar:

n a crowded bar, bartenders are expected to just say the price of a drink order to a guest — we wouldn’t present physical checks. And it was during the presentation of the physical checks that we could best explain the service charge/no-tipping concept. The check also had the policy explained on it, so guests had a pretty good chance of understanding what was going on.
[...]
Given that the line item service charge seemed like a lost cause, we switched to building the service cost into our pricing. This is known as service compris, and a lot of people advocate for it, but it wasn’t a success for us. With service compris, an $8.50 cocktail became a $10 cocktail on the menu, and that was a huge psychological leap for our market

Note that the factors which made tip elimination in a bar painful were to some extent due to this being a very unusual situation in a bar. One wonders how it might work out if a state or city banned tips all together in bars in favour of a service charge.

I'm still not convinced that Porter's service charge approach would work as proposed across the huge range of restaurants and diners in the USA, especially in dubious establishments where one or two really good waiting staff keep the business afloat. Still, he makes a compelling case that the conventional wisdom about tipping is not completely supported by the available data.

2013-08-07

Amber crying wolf

Some parts of the USA have implemented an interesting system based on the ubiquity of mobile phones: "Amber Alert". This is something that freaks you out the first time you're watching TV and an alert triggers - the screen goes black, loud white noise erupts, and you see text scrolling across warning you of an escaped convict / missing child / rabid mountain lion in the area. The idea is that the local authorities can alert people in the area of emergency conditions that effect them directly. Since December 2012 they've added this functionality to most mobile phones, and so as well as broadcasting the alert on TV they can cause phones to emit a loud beeping and show the message as a pop-up.

Yesterday they initiated this alert across the entire state as the result of an alleged child abduction following an alleged mother and child murder in Southern California. All very laudable: except that California is big. Really big. And now they've extended the alert to Washington and Oregon. You're talking about alerting 50M people "just in case". The suspect left Boulevard, CA (near the Mexican border) four days ago. He could be anywhere. The alert just had a car make and color with California licence plate. It was no use at all in finding the suspect, especially in California and Oregon where cars of that model, color and bearing California licence plates are ten a penny. Later news announcements had photos of the suspect and children which were much more likely to trigger a response in someone who had seen them recently. The Amber alert was, in essence, annoying tens of millions of people for no benefit. Many people interviewed commented that they'd turned off the Amber Alert feature on their phone as a result, and others were aware that this would happen:

Lappin is worried that the annoying sound and seemingly random message -- the alert had no background on the kidnapping or the missing children -- will discourage people from using the notification system. "It should be for imminent danger that we should all be aware of," he said. "That's what I expect to hear when there's an earthquake, or something where I need to take action."

I'm dubious about the utility of an earthquake alert as xkcd notes, but for yes - alerts are for when you need to take action, not "just FYI".

This is a classic example of the "receiver pays" hazard that has given us email spam. Even with the best intentions, the senders of these messages are not considering the trade-off between the tiny marginal benefit accrued by alerting tens of millions of people with an unhelpful message, and the penalty of millions of pissed-off people turning off their alerts so that when an actually useful message is sent out they will miss it. Nice one.

"But please, won't someone think of the children!" I am thinking of the children - all the future kidnapped children whose alerts will be missed because of an over-eager child protection officer this time around. Perhaps that's not a hip concern, but in the long run it will save a lot more kids.

2013-08-06

OpSec - we've heard of it

FFS. Doesn't anyone practice operational security any more? It seems that some blabbermouth in the White House or State Department leaked the information that the US was intercepting some communications from Al Qaeda's "Mr Mojo" Ayman al-Zawahiri. CNN apparently had the information over the weekend:

[US and Yemeni officials] grew increasingly alarmed after intercepting a message within the past several days said to be from al-Zawahiri, who is believed to be in Pakistan. The message was sent to Nasir al-Wuhayshi, the leader of al Qaeda in the Arabian Peninsula, the terror group's Yemeni affiliate.

CNN, in a rare act of responsibility, held back from reporting this information. Other media organisations were not so restrained.

Two possibilities:

The US is not actually reading these communications. Since al-Zawahiri and al-Wuhayshi know what they talked about, they will know if the reports are erroneous and they won't care. All this achieves is to make the US look like it's actually doing something about the terror threat. I hope, hope this is true.
The US is reading these communications, after expenditure of much blood, sweat and dollars. This effort is now wasted as al-Zawahiri and co. will change their procedures - and doubtless brutally kill anyone suspected of involvement in the interception. I fear that this is actually the case.

Neither possibility reflects particularly well on the current administration. If it doesn't hunt down the leaker and throw them into Fort Leavenworth or Allenwood for a good decade or so, we'll know what their priorities are.

Marcela Trust accounts for 2012

Distracted by other blogging, I was negligent in failing to notice that the Marcela Trust had posted their 2012 accounts back in March. You can read the results of my previous investigations of the Marcela Trust and friends at your leisure.

Interesting points from the accounts:

Natasha Malby resigned as a trustee on 6th October 2011. I wonder why?
They're still with Spofforth's as accountants, and still HQ'd at 14 Buckingham Street. I actually dropped by that address a couple of months back; the doorbell panel indicates that as well as the IPPR (who presumably take up a floor or two of the building) there are a veritable host of small organisations at that address, e.g. the rapacious capitalists at investment firm Dawn Capital.
They donated £100K to the anti-salt campaigners at CASH, which donation seems to have been passed straight through the Marcela Trust from OMC Investments Ltd.
They donated £170K to Fauna & Flora International, registered charity no. 1011102, to "fund specific Community initiatives in the impoverished Zarand area of Western Transylvania and a Rumanian[sic] post graduate student at Cambridge University"
That Romanian student appears to be Lenke Barint who graduated in the MPhil 2011-2012 class.
The Marcela Trust carried forward £65M in its funds balance essentially unchanged.
Current assets dropped from £11.8M in 2011 to £4.5M in 2012; tangible assets went up £20M to £77M; and amount owed to creditors went up from £200K to £13M to balance.
Why did tangible assets jump by £20M? Because they spend £24M to acquire new tangible assets (freehold investment property) and raised £12M in loans - this sounds like they're speculating in property like the Camelia Botnar Foundation has been doing. I wonder how trusteee Mrs. Dawn Pamela Rose is involved in these acquisitions?
Wages and salaries jumped from £570K to £854K for the same 4 people as in 2011. Since Natasha Malby resigned early in the year, her salary was negligible. Dawn Pamela Rose's salary was about the same. Brian Arthur Groves charged the Trust £89K for his services as a director. I wonder who else made up the remainder of the salaries?
Dawn Pamela Rose's company QHH Limited appears in the accounts as a subsidiary with £1.1M turnover and 45% of that as gross profit - being more than wiped out by £622K of "administrative expenses and exceptional items", losing a net £69K for the year. I would really, really like to know what QHH Limited is doing with that money.
Brian Arthur Groves was made a £200K loan, secured against his equity, with interest at 1% above BoE base rate; this amount is now down to £150K. Note that I've not rounded these numbers - according to the accounts they are exact.

Fauna & Flora international had £18.2M income and £17.3M spending in 2012, and has lots of eminent people as Vice Presidents. Of that income, about £4.3M came from trusts and foundations; this looks like a random donation, but I'd be fascinated to know what connection Lenke Barint has to the Marcela Trust trustees.

What does all the above tell me? The Marcela Trust is marking time, being used as a conduit for donations to charities by OMC investments, and using its substantial assets to speculate in the property market. It seems that being an employee or director of the Trust continues to be a well-paid gig, especially in relation to its level of charitable activity. I would love to know what QHH Limited is doing on behalf of the Marcela Trust - more, I'd like to get a job as admin staff for them as it seems to be quite the sinecure.

2013-08-03

Atlas Shrugged - part 2

Nearly a year after watching Atlas Shrugged part 1 I finally got around to watching part 2 - "The Strike". I knew going into it that the cast would change completely, so was expecting a certain amount of dissonance. In the even it wasn't noticeable, perhaps because it had been so long since I saw part 1. So how did part 2 stack up?

The cast changes from part 1 were a mixed bag:

Paul McCrane as Wesley Mouch: an improvement, a much better weasel.
Samantha Mathis made a better Dagney, older and more world-weary. Compared to Mathis, the deficiencies of the youth of Taylor Schilling in the role were more obvious, despite Schilling being very easy on the eye.
Esai Morales: very smooth as Francisco d'Anconia. He was used to narrate Randian opinions in a couple of scenes but did so without being too hammy.
The nerd in me applauded Robert Picardo as Dr. Robert Stadler, the head of the SSI.
The main disappointment was Jason Beghe as Hank Rearden: Grant Bowler was far better in part 1, the Baghe character was more smirking and annoying.
I was in two minds about Patrick Fabian as James Taggart; he was too young in my mind for the role, but he did get the playboy attitude right.

Where I thought the film succeeded, more so than part 1, was in making the subtle nods to popular culture today so that it could be seen as a warning of an (unlikely) very near future. This was exemplified by including a Fox News deconstruction of the Fair Share Law, with Sean Hannity facing off against Juan Willians and Bob Beckel. I nearly laughed my socks off. It was no doubt a good marketing gag - after all, I'd imagine Fox viewers are over-represented in the likely film audience. Indeed, Bob's brother Graham played Ellis Wyatt in part 1 which may have persuaded Bob to come on board for part 2. Also, plaudits are due to whoever came up with the "We are the 99.98%" protest signs waved at Dagney - genius. The use of protestors and their signs in a couple of the scenes to indicate changing public attitudes was clever.

The film itself felt a little rushed, which is not surprising considering how much they had to cram in to 111 minutes, less a couple of minutes of the flight sequence duplicated at the start and end. I will be fascinated to see what they do in part 3 about the 80+ page radio monologue - maybe they'll stick it on a separate DVD... It was not quite stand-alone as a film, but made a stab at it. You needed some context from part 1 or from the book in order to make more sense of the situation and the relationship between Mouch, the Taggarts and the rest of the industrialists, but the film makers did do a creditable job in minimising this. Where it fell down was in explaining Hank Rearden's attitude - why he was refusing to roll over to the government, and why he then changed his mind so quickly when his affair with Dagney was threatened to be exposed. The film makers could have done a better job of repeating his obsession with his firm, but I suppose they were relying on part 1 to have done that.

I can recommend Atlas Shrugged part 2, on balance. You don't need to read the book for it to make sense, as long as you've seen part 1. It was engaging and cleverly pitched. I'm not sure what Rand would have made of it, but I think the tie-ins to modern society would have amused her.