2014-02-10

Aligned incentives to improve uptime and usability

Something I've been chewing over for the last couple of weeks is why we see such a disparity between applications that work really well, scale well and are extremely reliable (e.g. Facebook), and others which are and remain a complete disaster despite huge amounts of money spent developing and supporting them (e.g. Healthcare.gov). I'm going to propose Hopper's Law of Operational Sanity:

Axiom: Your codebase can only be significantly improved for users when your developers feel users' pain.
Corollary: If you're outsourcing support for your application, you don't really care about making it better.
A bold claim; let's try to justify it.

Let's assume that our application in question is used heavily by the public, and that it includes a reasonable feedback mechanism for problems (e.g. a forum, FAQ page, static support pages with a feedback form for more detailed problem reports). We're doing all the monitoring basics; logging success/failure rates, have trend reports around these stats, have a QA/test team and do reasonably frequent releases to production, and have an operations team responsible for monitoring the system and reacting to problems. I claim that this is nowhere near enough for a system that users actually want to use. Why not?

The problem comes with the disconnect between the interest of the engineers developing and testing the application, the interests of the operations team, and the interest of people using it. Developers are paid based on features they launch and on the visibility of bugs they fix. If the operations team is getting alerts every hour for a condition that isn't really important (or, at least, can't be fixed) the development team is unlikely to care - they will just tell the ops team to ignore the alerts. That's fine, but you're desensitising the ops team to alerts, and filling their mailboxes with noise. When an actual problem happens and an alert is sent, it'll take the ops team longer to notice; they may even have set up a mailbox rule to file these alerts, and not notice the real one for days.

Similarly, if there are occasional system overload problems the ops team will have to scramble to fix them, generating lots of bustling activity. The developers don't care because the ops team can deal with them. The ops team are paid for their activity, so don't mind the occasional panic. No-one has the interest in adding code to detect this condition arising and making it easier to handle (e.g. by measuring system capacity and prophylactically alerting if the system goes over 80% of that level, or adding the option to switch the system into a lower-load mode by temporarily blocking expensive functions) despite the fact that this would be of great long-term benefit to the company in reducing failures and associated cost.

The gap between the developers and the users is even more straight forward. The developers and the marketing team can come up with all sorts of cool ideas and features for the product, spend months developing and launching them, and still be left with the product being slammed by unsatisfied users. How come, after all this effort developing and testing the product? Well, it could simply be that, according to the old saw, the dogs don't like it. If the developers are relying on feedback from the launched product to know if the users are happy, they've already lost; the development feedback loop will take way too long. Instead they need to know about problems before they launch the feature. How to do that? Dogfood it!

"Dogfooding" is the process by which employees at a company use a pre-release version of the application for their daily work; ideally, nearly all the developers of the app would dogfood. This is very different from system and QA testing because this work is not done with test data and test users; instead, a real person is interacting with the app and trying to make it do what they want. The only difference between this person and a regular user is that the tester has (or should have) an easy way of raising a specific bug on the developers if they run into something that gets in their way.

Scaling is certainly a user-facing problem, and you can't easily test scaling with a dogfood app in isolation; however if you can measure its resource consumption in isolation then you can get a delta of resource usage between dogfood releases which should at least highlight whether a new release has a CPU/memory/network-devouring bug. This is not a foolproof method, but it's certainly better than the standard approach of running load tests; the only problems load test will normally detect are gross changes in performance for whatever standardised actions you're feeding into the load tests.

If you want your system to work well, the people developing it need to feel the immediate pain of their errors - make them responsible for the monitoring of their part of the system for at least three months after its launch, and only allow it to be handed over to the operations team when it's demonstrated a suitably low level of alerting and a suitably high frequency of actual error conditions generating alerts. Make the developers use a pre-release version of the system for their daily work; if it's a web mail system then use that for their mail, if it's a bug database then keep the bug database's bugs in a pre-release copy of itself. You're going to take a slight hit on productivity, but you'll be surprised how quickly and smoothly you can fail over from a bad version to a known good version after the first few such problems - and if you can do it for developers, you can do it for your production version.

Incidentally, the a facet of this law was in evidence in the recent revelations about Edward Snowdon's access to NSA data:

Intelligence officials have claimed that Snowden was able to do all this [automatically web-crawling the NSA pages] largely because the Oahu NSA facilities had not gotten the software purchased to prevent insider threats in the wake of WikiLeaks. "He was either very lucky or very strategic" to get the positions he held in Hawaii, one official told the Times. But it's also entirely possible that his activities would have gone unchecked in any case, simply because of his system administrator status.
I suspect it's even more simple than that. No-one in a position to detect and stop Snowdon was sufficiently interested in securing the network. They'd certainly implement any system they were asked to and investigate any alert that went off - that's what they were paid for, after all. No-one was really engaged in pro-actively hunting down and preventing security threats. This is a hard thing to set up, admittedly - how do you pay someone for a security breach failing to happen because of their actions? - but if you can't figure out how to do it then Snowden-like security breaches will keep happening.

All the stable door slamming currently going on will simply make it harder for people to do their jobs, because the formerly loose data access that they benefited from will be taken away in the name of security; the pendulum has swung in the other direction, and now no-one will benefit from highlighting a security change with a good security/access trade-off because everyone is in arse-covering mode and no-one will countenance loosening security for any reason.

Incentives matter; this is as true in software as it is anywhere else. Now we need to start acting like they matter.

No comments:

Post a Comment

All comments are subject to retrospective moderation. I will only reject spam, gratuitous abuse, and wilful stupidity.