Hemiposterical: This must be some meaning of the word "glitch" of which I was not previously aware

2012-06-24

This must be some meaning of the word "glitch" of which I was not previously aware

Let's recap what we know about the NatWest/RBS banking outage: looking at The Register's assessment on Friday we see:

The screw-up has been pinned down to a flaw with payment-processing software, and primarily means that bank balances don't register inbound payments.

This problem hit on the 18th of June (a Monday), and so presumably was the result of a Sunday code roll-out. Since this hit so many customers, it's clearly not an intermittent problem; I'd be fascinated to learn who in the RBS/NatWest IT management OK'd a general deployment of this code version. I'd also be interested to learn why it couldn't be immediately rolled back once the problem was apparent - perhaps there was a data schema change involved, so they did the schema upgrade over the weekend downtime and then enabled it on Monday, but for some reason didn't have a verified reversion plan in place. Oopsie.

I particularly liked:

Customers were further infuriated after NatWest charged them for ringing its emergency helpline by initially providing an 0845 number instead of a free 0800 number, although the bank later said it would reimburse the cost of the calls.

where NatWest compounded an already massive technical failure with an entirely avoidable PR failure. Good job guys! I hope someone in PR is going to get a P45 for that particular decision.

It's easy to be wise after the fact, but releasing new software versions to a small set of customers to identify problems just like this is not exactly an unknown approach in software engineering. Nor is having a verified rollback procedure to revert the change when disaster has become apparent. Since it took until the weekend to deploy a fix, my bet is that it's not actually possible to deploy a software upgrade when RBS's systems are in operation, and that the broken payments system is tied to enough other currently-functional systems that downing it for replacement would have made things even worse.

I'm really looking forward to a technical post-mortem here, though I suspect the RBS lawyers and senior IT management will do their damndest to prevent it becoming public due to "propietary technology concerns" or similar weasel words. If I were writing it, though, it would go something like this:

We were told to deploy (new feature X) by June 4th;
Development and testing took longer than anticipated, and on June 13th we were instructed to deploy it by June 18th or face the consequences;
Our software passed all the relevant tests (which were inadequate, because good testing is the first thing that a development team skimps on);
We had a rollback plan (but did not verify it, because we never use them);
We deployed the software in our standard release plan (upgrade the data schema over the weekend, verify the data integrity, then push the upgrade on Sunday night);
Early on Monday morning our pagers and phones starting ringing off the hook;
We attempted a rollback and discovered (unanticipated issue Y) which meant the rollback would not work;
We entered a holding pattern and put together a bugfix that could not be deployed until Saturday since downing the system would have made things even worse;
There followed five days of customer support hell;
We downed the systems on Saturday, applied the bugfix and tested extensively, with everything now appearing fine;
We opened for business on Monday and everything worked just fine.

Follow-up actions for IT management:

Find a suitable scapegoat middle-manager and fire them;
Announce "lessons will be learned";
Appoint a new Director of Business IT Risk who happens to be a mate of one of the RBS board;
Splurge some money on whizzy IT release software and associated consultancy that doesn't make a blind bit of difference;
Move their own personal bank account to a UK bank with some grasp of how to run financial IT.

I'm not entirely sure that getting the Government, or even the FSA, involved is a good idea though. There is going to be plenty of hauling-over-the-coals going on in the RBS IT department without regulator intervention -- and, let's face it, designing and testing business-critical software systems is not exactly the FSA's line of expertise. "There's no problem so bad that government intervention can't make worse", and all that.

It also occurs to me that the best way for RBS to prevent a reoccurrence of this is to pay Good Money to find and recruit a small number of really good engineers from Google, Amazon, Microsoft etc. (all famed for their uptime and reliability - yes, even Microsoft) and give them free rein to fix software, processes and people. Give them a large, meaty bonus conditional on measurable reliability improvements and comprising a significant proportion of RBS shares. Perhaps this is one case where public and FSA wouldn't object to large "banker bonuses".

Hemiposterical

2012-06-24

This must be some meaning of the word "glitch" of which I was not previously aware

No comments:

Post a Comment