So this is interesting. Google is dropping its
cloud storage rates to $10 per month per TB (though 100GB costs $2, twice that rate).
Amazon S3 storage is currently $85 per TB per month and
Microsoft Azure is $64 per TB per month
for their cheapest option (Locally Redundant Storage). I'd expect these prices to be dropping fairly
soon in response to Google's move.
How much does this actually cost to provide? Let's look at the cost of storing and accessing 1 TB of data.
An internal SATA 1TB hard drive costs about $60 on Amazon - but a 2TB costs $85, and a 4TB costs $160 (retail).
So we can figure on about $40 per TB of storage. How long will this drive last? Mean time between failures
of hard drives is between 18 months and 3 years depending on manufacturer and usage; let's split the difference
and say 2 years. Buying 1TB of storage over 2 years will cost the supplier about $40 in capital costs. Isn't this
a rip-off?
Well, having a hard drive is one thing - being able to access it is another. You've got to get data
into that hard drive, and probably you want to get it out again. Assuming that the entire volume of that
drive is written once and read twice per 2 years (probably a lowball estimate) at a rate of about 5 Mbits/s,
that means that in 1 day (86400 seconds) you could read (86400 * 5 / 8) MB or about 54 GB per day so it would
take up about 10 days per year per user, and so you could support about 36 users on a 5 Mbit connection.
Let's say we're using 4 TB drives so you need a 5 Mbit connection for each 9 computers in your storage.
It's not quite that straight forward though. Cloud storage is supposed to be reliable, and
hard drives are manifestly not - they die all the time. Therefore you want at least a second copy of
your data on a separate hard drive, and ideally you want that second copy to be in at least a separate
building in case of a physical disaster (flooding, fire, tornado). Generally the further away the better,
at least up to 100 miles or so, though distance tends to increase the expense of hosting because for every
write to the data you need to send the write to your remote facility as well. Azure gives you the explicit option of how physically separate you want your data to be; Locally Redundant Storage vs Geographically Redundant Storage.
There's also the distinction between data loss and data unavailability; if the primary copy of the user's
data is unavailable (e.g. because the data center has a planned or unplanned availability outage) cloud
providers may give their customers the option of reading data from (or writing data to) the backup
copy of the data. Customers can buy this kind of read access from Azure as an additional option (Read
Access GRS).
If you as the cloud storage operator rack up 3600 users, then, you'll need 900 computers with a total of 3600 TB
of storage in each of 2 sites.
You'll need 500 Mbit/s of bandwidth on each site if you want to offer read redundancy, and about 170 Mbit/s
of bandwidth between sites to replicate writes. You don't need customised hardware for this amount
of traffic, but you do need to buy the bandwidth to get the data to and from the user.
Azure quotes $120 for egressing 1TB of data; if we estimate that it actually costs them $100 then each user will
cost you $200 (reading their data twice), so you will have $720,000 of bandwidth cost.
If the computers last about as long as the hard drives
you'd expect them to cost you about $300 plus the storage ($160), say about $500 once you take into account
rack and network switch hardware. I suspect power won't be too much of an issue since storage isn't a
CPU intensive operation and user access is intermittent - idling computers without a display consume about 20W.
So each site will cost you 900 x $500 over two years, and consume a steady 18 KW of power. Electricity costs about
$100 per MWh to make in the cheaper parts of the USA, so power will cost you $2 per hour per site. So power is only
about 7% of your equipment costs for a 2 year lifetime, and you end up paying about $1.8M in total in hardware, power and bandwidth to provide 1TB of cloud storage to 3600 users over 2 years. Each user pays $240 over that time, or $860K in
total. So it would seem that $10 per hour is a massively losing proposition for the provider even before we take into
account the human costs of designing, building and operating the system.
The real picture is more nuanced. We implicitly assumed that every user would use all the storage (and bandwidth)
they paid for. In practice, they could conceivably be consuming only half of what they've paid for. As long as
we can dynamically provision for users (having a small amount of storage headroom, and adding on more computers
and drives as that headroom is threatened) we could provide maybe 60% of the maximum hardware needed, so
instead of $1.8M our costs would be down to $1M or so. Still something of a loss.
I think the way cloud companies can make money - or at least avoid a loss - on this is to make use of the fact that the computers providing
access to the drives are seldom even slightly busy. Instead of buying $300 of low power computing hardware
to support each 4TB drive, just chain a few 4TB drives onto an existing computer that you're using for something
else (say, Bing search, Google maps, Amazon website serving). When a user starts to access their data,
temporarily reserve a core or two on the machines holding that data to serve it. That way you save nearly 60% of
your hardware costs and might just be bringing your operation into slight profit.
You'll still need to pay the design and implementation costs for your system, not to mention the usual
marketing and business operations, but these don't scale in proportion to your number of users. The more
users you have, the better your business looks.
$10 per month per TB is a bit of a game changer. Suddenly storing in the cloud isn't massively more expensive
than storing on your hard drive. I wonder what the next couple of years will bring?