Reflections on uptime

A couple of conversations this week have made me realise how "uptime", and its unloved stepchild "downtime" are misunderstood in today's world of the always-on Internet. I thought I'd blog a little about this and see where it went. First, the case of fanfiction.net.

This conversation was with a buddy in NYC who is an avid fan-fiction reader. She (and fan-fiction readers are disproportionately "she") was complaining about the site fanfiction.net being down for an hour or two, during which time she was going cold-turkey being deprived of new chapters from her favourite authors. When quizzed further, she admitted that the site was down some time between 1am and 3am NYC time, so perhaps she actually got more sleep than she would have otherwise... anyway, her complaint was "why is fanfiction.net always going down?". So here's my attempt at an answer.

fanfiction.net is hosted by Tiggee, so costs real money to run. The site has some ads - Google's AdChoices - but they seem to be tastefully done and not in-your-face. Neither reading nor uploading fanfic costs anything, so ad income has got to cover the entire cost of running the site. As well as Tiggee's fees for hosting this also has to cover the time and trouble of the site maintainers. Downtime means a loss of ad revenue as well as readership, so the owners are going to want to minimise the downtime but not spend too much money doing so. Assuming 5TB of storage and 30TB/month bandwidth, a sample hosting company like Cloud Media will charge you about $3700/month. Let's assume then that it costs $4000/month to host the site as it stands, and that ads bring in a steady $8000/month in revenue (about $7 / hour).

My friend reports an informal estimate of fanfiction.net being down maybe 4% of the time she checks (ignoring rapid re-checking in the 15 minutes after she sees that it has gone down). This would be costing them $320/month in lost ads. This isn't worth getting out of bed for.

Not all hours of downtime are equal, however. Web browsing follows a roughly diurnal curve: working in GMT - note that summer times skew these results due to some nations not observing daylight savings time - at GMT midnight, which is the trough, people are coming in to work in Japan. If you have some control over your downtime (e.g. for system upgrades) you can schedule it in the window of 5am - 8am GMT when virtually none of your likely audience in English-speaking countries is awake except for die-hard nerds. The opportunity cost of your scheduled downtime is probably halved, or more, so that $320/month loss drops further.

Is it even worth trying to stop unscheduled downtime like this? Your hosting company is already taking care of what outages are in their control (bad hardware, network misconfiguration etc.) and mistakes on their part will result in a refund of your hosting fees for outage time outside their Service Level Agreement. You can probably assume that they'll give your hosted machines something like 99.9% of uptime, which is a little bit less than 1 hour of downtime per month. All you have to worry about is misconfiguration or performance problems of the software you run on their hosted machines, which most often happens after you - the site owner - have made a change. There are occasions when your site falls over spontaneously (e.g. because you've run out of storage space) but they are few and far between. Setting up an alerting system which pages you if your service goes down outside your normal working hours would require a substantial technical and financial investment, and probably wreck your sleep patterns.

The usual way out of this corner is delegation, but here the information economy prices work against you; even a spotty part-time sysadmin who has no idea what he's doing would cost you $2500/month to have on-call. Unless you can pool him with a number of other sites to amortise his cost and increase his utilisation, there's no point parting with your cash. This is why people tell you that adding each "nine" of reliability (going from e.g. 90% to 99% or 99% to 99.9% uptime) increases your costs exponentially - you need new layers of people and systems to a) prevent downtime occurring and b) react extremely quickly when it happens. For situations when your downtime losses are close to your uptime income and costs, it's far better to accept the downtime, fix it during your normal working hours, and be extremely conservative in operating your site within its allocated storage and bandwidth limits. Only make changes when you have to, and ensure you're around and watching closely for several hours after the upgrade.

The conclusion? The reason most "free" websites have a somewhat unreliable uptime (between 1 and 2 "nines" i.e 90-99% update) is that it's simply not economically worth their while to pay the additional costs to be down for shorter durations or fewer occasions. You get what you pay for. Of course, there are situations when downtime costs are not close to operating costs - I'll be addressing that in my next blog.

No comments:

Post a Comment

All comments are subject to retrospective moderation. I will only reject spam, gratuitous abuse, and wilful stupidity.