Amazon Black Thursday (and some of Friday)

What an exciting week in the cloud world this turned out to be. All those anti-cloud people loving the fact that Amazon had a problem with some of their service. The problem with all the media, twitter and blog buzz, was that 99% of it was completely wrong.

Even our dear old BBC couldn't resist making such claims that "Amazons web hosting" (eh?) had gone off-line.

No it most definitely had not. I can proudly say that our 11 billion request-per-month platform, running in Amazon East (Virginia), was quite happily working away, servicing each of those requests with no errors at all.

The problem wasn't Amazon #EC2, but their Elastic Block Storage service. This is their "USB for the cloud" service that lets you attach network storage to servers, clone, snapshot and move them around.

Amazon didn't fail web sites; people just got lazy

One of the first things I teach in our cloud bootcamp is to not get seduced with all the toys in the playground and instead focus on the problem at hand. Cloud solutions do not magically remove your problems, they simply give you a whole set of new ones.

The secret of cloud architecture is to build with failure in mind. If you build assuming that everything will work, then you are just asking for trouble. For example, can your application code cope with the fact that the database may disappear for a few seconds and re-appear on another IP address? Does your application need restarting? Configuration files updated?

In our experience, the majority of people do not build for the cloud, instead they see it as an inexpensive way of removing cost away from buying and maintaining their own physical racks of hardware. They do not see the inherent danger of a world where servers can be spun up in a couple of minutes, and more importantly, be removed in seconds.

You still have work to do

Amazon has quite the catalogue of services on offer and at first glance you wonder if you really do need to do that much work after all.

As it happens, we try to use as little Amazon services as possible mainly so we don't get locked into a given vendor (we use Rackspace in production as well, which isn't without its problems too), and also, we found that many of their services can't cope with the volume of traffic we are pushing. EBS we've found to be far too slow for anything serious, Elastic Load Balancer falls over at a given traffic, and Amazon's much promised "MySQL Cloud" in the form of RDS, has many limitations that makes it impractical for any serious enterprise offering (namely that it relies on EBS doesn't help its performance and the fact you can't tune MySQL specifically).

Any website complaining of downtime have only themselves to blame. They got lazy. Amazon has to shoulder some of the blame, but the reality is that, that you have to plan for the fact that Amazon could go dark at any time.

We consider ourselves Amazon veterans and we know only too well the vulnerabilities of their service. We have have instances dying on us all the time, with the dreaded Amazon-Death-Email following some hours behind it " of your instances had to go..". We have never been seduced to think that as soon as we start using a particular service, that all our problems will magically disappear. Even their grand-daddy S3 service blips in and out.

It was a good week for cloud

This week it was Amazon EBS, next week it could be something/someone completely different. You can't plan for every eventuality, but you can get darn close. The more you lock yourself into a given vendor's services, the more risk you put your company at.

This is particularly true of services like Google's App Engine, SalesForce, Microsoft Azure. If you are relying on these, then you really are betting the farm on their service, as their is very little choice in terms of backup.

This week was good for the cloud eco-system. It showed that Amazon, one of the biggest players in the industry, have indeed got limits. Things can and will break. People have relied too much on Amazon "keeping the lights on" for all the services that normal bare-metal data centers have had to do for years. Moving to the cloud does not remove the responsibility.

I hope this week will have inspired many a technical meeting at various companies and teams to discuss their cloud strategy. I imagine that RackSpace/GoGrid got a significant sign-up boost this week as a result of Amazon dropping their EBS service. However, running screaming to another vendor is not going to help your situation in the long run.

Plan for failure. Build for failure. Test for failure.

There is a silver lining in this cloud

Amazon is still one of the most flexible, powerful and cost-effective cloud solutions available today. This week has not dented our faith in Amazon in the slightest. It illustrated that they are human and are not this mythical data center with infinite CPU, disk space and network that some believe they are.

Just try and not have all your eggs in the one basket. But if you do, make sure you have enough eggs in-case one does get broken.


Recent Cloud posts

Recent JAVA posts

Latest CFML posts

Site Links