Amazon has been telling conferences and anyone that cares to listen that they can handle anything we throw at them. They have always been cagey to say the least, about revealing any sort of scaling numbers and when asked just how many instances they can spin up, they never really give a straight answer. While this may make for great stage banter, sadly it doesn't cut much ice in the enterprise. We need to know the details.
One of our aw2.0 portfolio companies, has been a long term user of Amazon EC2 running a sizable 24x7 of core instances with a number of instances going up and down as scale demands it. Our monthly bill gets us the dubious honor of a first point of contact with an Amazon Account Manager (not that that has been much use). We've pushed the limits of many of their services and continue to do so.
After 3 years of production usage what we can tell you is this .. Amazon do have a breaking point.
In the beginning...
Having come from an traditional co-location set up, we took our services into the cloud to take advantage of fast scalable servers and services, while reducing our costs. That was fine for the first few years as our traffic began to grow, but we are now in that awkward position of evaluating whether or not to leave the cloud, or at least some it (cloud computing is not the most cost effective way of running an enterprise if the majority of them are running all the time).
In the early days we tried out a number of cloud providers. We managed to break some of them with our load even back then. Flexiscale was one of casualities, not coping with the volume of network traffic we were pushing through. They had major problems with one of their network drivers, and while we loved their customer service, we couldn't hang around waiting on them. (aside: Flexiscale has been recently reborn as Flexiant)
Amazon in the early days was fantastic. Instances started up within a couple of minutes, they rarely had any problems and even their SMALL INSTANCE was strong enough to power even the moderately used MySQL database. For a good 20 months, all was well in the Amazon world, with really no need for concern or complaint.
Neighborhood isn't what it use to be
However, in the last 8 or so months, the chinks in their armour have begun to show. The first signs of weakness came from the performance of the newly spun up Amazon SMALL instances. According to our monitoring, the newly spun up machines in the server farm, were under performing compared to the original ones. At first we thought these freaks-of-nature, just happened to beside a "noisy neighbor". A quick termination and a new spin up would usually, through the laws of randomness, have us in a quiet neighborhood where we could do what we needed.
Noisy Neighbors is a problem that happens in the virtualized world. You do not get exclusive access to the underlying CPU, having to share it with other users of the virtual operating system. If you happen to be on a node where someone is computationally very heavy compared to say a network bound web server, then you will not get as much processing time as your neighbor. Sadly, you can't pick your neighbors with the vast majority of cloud vendors.
As time went on, and our load increased, the real usefulness of the SMALL instances, soon disappeared with us pretty much writing off any real production use of them. This is a shame, as many of our web servers are not CPU intensive, just I/O instensive. Moving up to the "High-CPU Medium Instance" as our base image has given us some of that early-pioneer feeling that we are indeed getting the intended throughput that we expect from an instance. Feel somewhat cheated here, as Amazon is forcing us to go to a higher priced instance just because they can't seem to cope with the volume of Small instances.
We haven't been running exlusively Small instances. Our databases enjoy some of the higher specified instances and by in large, they haven't given us any problems. Those high instances are obviously not oversubscribed and still maintain an edge of exclusivity.
However, in the last month of two, we've even noticed that these "High-CPU Medium Instance" have been suffering a similar fate of the Small instances, in that, new instances coming up don't seem to be performing anywhere near what they should be. After some investigation, we discovered a new problem that has crept into Amazon's world: Internal Network Latency.
The commute is such a drag
The time distance two machines are away from one another drastically effects the performance of any service orientated architecture. Whether it is simply connecting to the database, or calling a webservice, the time it takes for a packet of data to move over the network is a delay you have no choice but to swallow. In normal circumstances, a ping between two internal nodes within Amazon is around the 0.3ms level, with the odd ping reporting a whopping 7ms ever 30 or so packets. Completely within operational parameters and what you would expect within an internal network.
We have discovered though, that when our instances appear to be dying or at least shaky, then this network latency jumps up to a whopping 7241ms (yes, 7 seconds to move a packet around internally). Take a look at one of the ping traces we have been experiencing (we have many many more):
[root@domU-12-31-xx-xx-xx-xx mf]# ping 10.222.111.11 PING 10.222.111.11 (10.222.111.11) 56(84) bytes of data. 64 bytes from 10.215.222.16: icmp_seq=2 ttl=61 time=473 ms 64 bytes from 10.222.111.11: icmp_seq=4 ttl=61 time=334 ms 64 bytes from 10.222.111.11: icmp_seq=5 ttl=61 time=0.488 ms 64 bytes from 10.222.111.11: icmp_seq=6 ttl=61 time=285 ms 64 bytes from 10.222.111.11: icmp_seq=7 ttl=61 time=0.577 ms 64 bytes from 10.222.111.11: icmp_seq=8 ttl=61 time=0.616 ms 64 bytes from 10.222.111.11: icmp_seq=9 ttl=61 time=0.794 ms 64 bytes from 10.222.111.11: icmp_seq=10 ttl=61 time=794 ms 64 bytes from 10.222.111.11: icmp_seq=11 ttl=61 time=0.762 ms 64 bytes from 10.222.111.11: icmp_seq=14 ttl=61 time=20.2 ms 64 bytes from 10.222.111.11: icmp_seq=16 ttl=61 time=0.563 ms 64 bytes from 10.222.111.11: icmp_seq=17 ttl=61 time=0.508 ms 64 bytes from 10.222.111.11: icmp_seq=19 ttl=61 time=706 ms 64 bytes from 10.222.111.11: icmp_seq=20 ttl=61 time=481 ms 64 bytes from 10.222.111.11: icmp_seq=22 ttl=61 time=0.868 ms 64 bytes from 10.222.111.11: icmp_seq=24 ttl=61 time=1350 ms 64 bytes from 10.222.111.11: icmp_seq=25 ttl=61 time=4183 ms 64 bytes from 10.222.111.11: icmp_seq=27 ttl=61 time=2203 ms 64 bytes from 10.222.111.11: icmp_seq=31 ttl=61 time=0.554 ms 64 bytes from 10.222.111.11: icmp_seq=32 ttl=61 time=678 ms 64 bytes from 10.222.111.11: icmp_seq=34 ttl=61 time=0.543 ms 64 bytes from 10.222.111.11: icmp_seq=35 ttl=61 time=25.6 ms 64 bytes from 10.222.111.11: icmp_seq=36 ttl=61 time=1955 ms 64 bytes from 10.222.111.11: icmp_seq=41 ttl=61 time=809 ms 64 bytes from 10.222.111.11: icmp_seq=43 ttl=61 time=2564 ms 64 bytes from 10.222.111.11: icmp_seq=44 ttl=61 time=7241 ms
As you can appreciate, this has some considerable knock-on effects to the rest of our system. Everything grinds to a halt. Now I do not believe for a moment, this is the real network delay, but more likely the virtual operating system under extreme load and not able to process the network queue. This is evident from the fact that many of the pings never came back at all.
In one particular "fire fighting mode", we spent an hour literally spinning up new instances and terminating them until we found ourselves on a node that actually responded to our network traffic.
This is all too familiar to us - this is exactly how Flexiscale suffered under load. A look over the Amazon EC2 forums and you'll see others complaining of network problems and unresponsive shell sessions.
Different road surfaces
Wire Turf recently wrote about the fact that not all the Amazon instances are equal in terms of the underlying hardware, and depending on which processor you get allocated can make a huge difference to the performance of your running instance.
So not only should we check for the CPU we are running on, we now must also take note of the network performance before we can safely push an instance into production.
This is not what cloud computing is all about.
Our faith in the almighty Amazon has become dented. I haven't told you yet our experiences with EBS, their Load Balancing solution and their Simple Queue Service yet. Those I will hold back for the Cloud Computing Expo, Cloud Bootcamp I will be hosting this April in New York. Again, great services just as long as you don't use them too much!
The road ahead
Anyone that uses virtualised computing, whether it is in the cloud or in their own private setup (VMWare for example) knows you take a performance hit. These performance hits can be considerable, but on the whole, are tolerable and can be built into an architecture from the start.
The problems that we are starting to see from Amazon, are more than just the overhead of a virtualized environment. They are deep rooted scalabilty problems at their end that need to be addressed sooner rather than later.
Has Amazon become over subscribed? Sure feels like it, as we are being "taxed" by being forced to move up their offering stack to just get the same level of performance we are currently enjoying.
It appears that even Amazon have a limit to what they can scale to.
- Amazon S3: Server-Side Encryption .. a completely pointless feature
- Amazon spinning a right old yarn with Silk. Here's why
- Amazon Black Thursday (and some of Friday)
- Amazon EC2 Latency: The Pretty Graphs
- Dynamic Datasource Manipulation for OpenBD
- Amazon finally delivers on the data-loss on power-down problem