alanwilliamson
Amazon has been telling conferences and anyone that cares to listen that they can handle anything we throw at them. They have always been cagey to say the least, about revealing any sort of scaling numbers and when asked just how many instances they can spin up, they never really give a straight answer. While this may make for great stage banter, sadly it doesn't cut much ice in the enterprise. We need to know the details.
One of our aw2.0 portfolio companies, has been a long term user of Amazon EC2 running a sizable 24x7 of core instances with a number of instances going up and down as scale demands it. Our monthly bill gets us the dubious honor of a first point of contact with an Amazon Account Manager (not that that has been much use). We've pushed the limits of many of their services and continue to do so.

After 3 years of production usage what we can tell you is this .. Amazon do have a breaking point.
In the beginning...
Having come from an traditional co-location set up, we took our services into the cloud to take advantage of fast scalable servers and services, while reducing our costs. That was fine for the first few years as our traffic began to grow, but we are now in that awkward position of evaluating whether or not to leave the cloud, or at least some it (cloud computing is not the most cost effective way of running an enterprise if the majority of them are running all the time).
In the early days we tried out a number of cloud providers. We managed to break some of them with our load even back then. Flexiscale was one of casualities, not coping with the volume of network traffic we were pushing through. They had major problems with one of their network drivers, and while we loved their customer service, we couldn't hang around waiting on them. (aside: Flexiscale has been recently reborn as Flexiant)
Amazon in the early days was fantastic. Instances started up within a couple of minutes, they rarely had any problems and even their SMALL INSTANCE was strong enough to power even the moderately used MySQL database. For a good 20 months, all was well in the Amazon world, with really no need for concern or complaint.
Neighborhood isn't what it use to be
However, in the last 8 or so months, the chinks in their armour have begun to show. The first signs of weakness came from the performance of the newly spun up Amazon SMALL instances. According to our monitoring, the newly spun up machines in the server farm, were under performing compared to the original ones. At first we thought these freaks-of-nature, just happened to beside a "noisy neighbor". A quick termination and a new spin up would usually, through the laws of randomness, have us in a quiet neighborhood where we could do what we needed.
Noisy Neighbors is a problem that happens in the virtualized world. You do not get exclusive access to the underlying CPU, having to share it with other users of the virtual operating system. If you happen to be on a node where someone is computationally very heavy compared to say a network bound web server, then you will not get as much processing time as your neighbor. Sadly, you can't pick your neighbors with the vast majority of cloud vendors.
As time went on, and our load increased, the real usefulness of the SMALL instances, soon disappeared with us pretty much writing off any real production use of them. This is a shame, as many of our web servers are not CPU intensive, just I/O instensive. Moving up to the "High-CPU Medium Instance" as our base image has given us some of that early-pioneer feeling that we are indeed getting the intended throughput that we expect from an instance. Feel somewhat cheated here, as Amazon is forcing us to go to a higher priced instance just because they can't seem to cope with the volume of Small instances.
We haven't been running exlusively Small instances. Our databases enjoy some of the higher specified instances and by in large, they haven't given us any problems. Those high instances are obviously not oversubscribed and still maintain an edge of exclusivity.
However, in the last month of two, we've even noticed that these "High-CPU Medium Instance" have been suffering a similar fate of the Small instances, in that, new instances coming up don't seem to be performing anywhere near what they should be. After some investigation, we discovered a new problem that has crept into Amazon's world: Internal Network Latency.
The commute is such a drag
The time distance two machines are away from one another drastically effects the performance of any service orientated architecture. Whether it is simply connecting to the database, or calling a webservice, the time it takes for a packet of data to move over the network is a delay you have no choice but to swallow. In normal circumstances, a ping between two internal nodes within Amazon is around the 0.3ms level, with the odd ping reporting a whopping 7ms ever 30 or so packets. Completely within operational parameters and what you would expect within an internal network.
We have discovered though, that when our instances appear to be dying or at least shaky, then this network latency jumps up to a whopping 7241ms (yes, 7 seconds to move a packet around internally). Take a look at one of the ping traces we have been experiencing (we have many many more):
[root@domU-12-31-xx-xx-xx-xx mf]# ping 10.222.111.11 PING 10.222.111.11 (10.222.111.11) 56(84) bytes of data. 64 bytes from 10.215.222.16: icmp_seq=2 ttl=61 time=473 ms 64 bytes from 10.222.111.11: icmp_seq=4 ttl=61 time=334 ms 64 bytes from 10.222.111.11: icmp_seq=5 ttl=61 time=0.488 ms 64 bytes from 10.222.111.11: icmp_seq=6 ttl=61 time=285 ms 64 bytes from 10.222.111.11: icmp_seq=7 ttl=61 time=0.577 ms 64 bytes from 10.222.111.11: icmp_seq=8 ttl=61 time=0.616 ms 64 bytes from 10.222.111.11: icmp_seq=9 ttl=61 time=0.794 ms 64 bytes from 10.222.111.11: icmp_seq=10 ttl=61 time=794 ms 64 bytes from 10.222.111.11: icmp_seq=11 ttl=61 time=0.762 ms 64 bytes from 10.222.111.11: icmp_seq=14 ttl=61 time=20.2 ms 64 bytes from 10.222.111.11: icmp_seq=16 ttl=61 time=0.563 ms 64 bytes from 10.222.111.11: icmp_seq=17 ttl=61 time=0.508 ms 64 bytes from 10.222.111.11: icmp_seq=19 ttl=61 time=706 ms 64 bytes from 10.222.111.11: icmp_seq=20 ttl=61 time=481 ms 64 bytes from 10.222.111.11: icmp_seq=22 ttl=61 time=0.868 ms 64 bytes from 10.222.111.11: icmp_seq=24 ttl=61 time=1350 ms 64 bytes from 10.222.111.11: icmp_seq=25 ttl=61 time=4183 ms 64 bytes from 10.222.111.11: icmp_seq=27 ttl=61 time=2203 ms 64 bytes from 10.222.111.11: icmp_seq=31 ttl=61 time=0.554 ms 64 bytes from 10.222.111.11: icmp_seq=32 ttl=61 time=678 ms 64 bytes from 10.222.111.11: icmp_seq=34 ttl=61 time=0.543 ms 64 bytes from 10.222.111.11: icmp_seq=35 ttl=61 time=25.6 ms 64 bytes from 10.222.111.11: icmp_seq=36 ttl=61 time=1955 ms 64 bytes from 10.222.111.11: icmp_seq=41 ttl=61 time=809 ms 64 bytes from 10.222.111.11: icmp_seq=43 ttl=61 time=2564 ms 64 bytes from 10.222.111.11: icmp_seq=44 ttl=61 time=7241 ms
As you can appreciate, this has some considerable knock-on effects to the rest of our system. Everything grinds to a halt. Now I do not believe for a moment, this is the real network delay, but more likely the virtual operating system under extreme load and not able to process the network queue. This is evident from the fact that many of the pings never came back at all.
In one particular "fire fighting mode", we spent an hour literally spinning up new instances and terminating them until we found ourselves on a node that actually responded to our network traffic.
This is all too familiar to us - this is exactly how Flexiscale suffered under load. A look over the Amazon EC2 forums and you'll see others complaining of network problems and unresponsive shell sessions.
Different road surfaces
Wire Turf recently wrote about the fact that not all the Amazon instances are equal in terms of the underlying hardware, and depending on which processor you get allocated can make a huge difference to the performance of your running instance.
So not only should we check for the CPU we are running on, we now must also take note of the network performance before we can safely push an instance into production.
This is not what cloud computing is all about.
Our faith in the almighty Amazon has become dented. I haven't told you yet our experiences with EBS, their Load Balancing solution and their Simple Queue Service yet. Those I will hold back for the Cloud Computing Expo, Cloud Bootcamp I will be hosting this April in New York. Again, great services just as long as you don't use them too much!
The road ahead
Anyone that uses virtualised computing, whether it is in the cloud or in their own private setup (VMWare for example) knows you take a performance hit. These performance hits can be considerable, but on the whole, are tolerable and can be built into an architecture from the start.
The problems that we are starting to see from Amazon, are more than just the overhead of a virtualized environment. They are deep rooted scalabilty problems at their end that need to be addressed sooner rather than later.
Has Amazon become over subscribed? Sure feels like it, as we are being "taxed" by being forced to move up their offering stack to just get the same level of performance we are currently enjoying.
It appears that even Amazon have a limit to what they can scale to.
Comments
please note, all comments will be moderated for spam and abuse before being publicly posted.
Article Details
- Published:
9:01 AM GMT, Tuesday, 12 January 2010 - Categories:
Technical - Tags:
cloud computing amazon ec2 amazon - Comments:
22 left; add comment
Related Articles
- Amazon EC2 Latency: The Pretty Graphs
- Dynamic Datasource Manipulation for OpenBD
- Amazon finally delivers on the data-loss on power-down problem
- Amazon takes MySQL to the cloud; or have they?
- Engadget claims "biggest disasters in the history of cloud computing"
- Amazon makes sharing data easier with shared EBS
Article Archives




Have instances been taking longer to start up in your observation? The reason I ask is because if the network performance is so bad, then AMI's stored in S3 should take much longer to be retrieved and the time it takes from the moment you "hit" the button to when the instance is "available" should be considerable longer that before. Has that been your experience?
Regarding whether this is mostly a reporting problem due to ICMP ping traffic being treated with low priority, I can assure you the problem is real. I ran a Counter Strike server on AWS and in mid-December had a ping of around 50. Around New Year, which is when they start showing problems, I was getting pings of around 500 ms, and the game server had severe lag while playing. That is the reason I came across this article, because I wanted to find out why my server was performing so poorly. I hope they can get this permanently resolved.
Looks like what you needed was a burstable cloud solution where you could utilize the cloud in peak times but mainly relied on a dedicated server. This would allow you to get the cost advantages of a dedicated server with the scalability opportunities of the cloud. Carpathia Hosting has an offer called AlwaysOn/InstantOn that was made for just that problem, might want to check it out.
That explains why I had to switch from an m1.small to a c1.medium instance 6 months ago. I'm also experienced some pretty weird behaviour where my instances stay running but lose all network access and like Alan I've seen people posting similar issues on the EC2 forums. I even subscribed to premium support for two months to try and resolve it but got nowhere.
I came across the link to this blog post on the Windows Azure forums. Some posts have ominously started to appear on that forum along the same lines as the issues Alan is describing here (search for "Variability of app performance"). The Windows Azure model is good, no OS/IIS concerns etc. but Microsoft still have a long way to go make it even as good as Amazon's EC2. It currently takes 10 minutes to deploy or upgrade an application through the Windows Azure web page (that's not an exaggeration), make a spellng mistake in an HTML file, forget a stylesheet, 10 minutes, find another spelling mistake, another 10 minutes. In theory cloud infrastructures can solve the difficult issues of scalability and reliability (using failover) but lately I'm starting to think back to the dedicated hosting providers and absence of these time and resource draining cloud OS issues. The irony is I'm now spending more on my EC2 instances to cope with reliability and performance issues than I was previously for a dedicated host. Sigh. Where is that magic cloud out there that just works.Does this come back to good old fashioned support and communication. The IT guys at Amazon are pretty talented technically but if they don't have a culture of supporting people you don't get any strait answers. Every platform will have issues at one point or another what we need to know is whether this is the new normal or whether it's going to be fixed. Move to Rackspace?
Hi Alan,
Thank you for sharing this so detailed information. I have two doubts about this post: 1) What is your conclusion about that? Is AWS a wise choice, yet? Or do you think it's better to keep our business on dedicated servers until AWS get more mature and reliable? 2) What amount of access (page views) are you talking about? I'm asking that because possibly a not so busy server wouldn't had such a problem, as you hadn't for 20 months. Thank you, RonanMan, I was just about to sign up with EC2 for my next project too. Will you do a followup in a few months to let us know if they're settled down any?
Two relevant issues you avoided addressing:
- Do you see differences in performance for the pre-paid vs on-demand instances (small, medium, high, etc)? If not then this is important info. My understanding is that with on-demand instances you are not guaranteed getting anything started, hence the introduction of the pre-paid/puchased instances.
- Did/does this performance correlate with the spot-price? If not positively correlated then you're overcrowding hypothsis is setback (but not disproved), unless you argue spot prices move lower because demand is higher. If positively correlated you could use this as a predictor of when you are likely to have 'issues'
To the people saying Amazon should move to VMWare.. Xen is not the problem here. Amazon has probably miscalculated the number of instances that people are going to use to max capacity. If everyone just ran a low-volume web or email server then everything would be fine. But if everybody thrashes their instances then there just isn't going to be enough cycles, bandwidth, IOPs, etc, to go around.
They should start deploy VMware for the cloud service to overcome these performance issue.
To those having issues with EC2, spin up another services provider, such as one of the several vCloud Express operators. They're VMware based, not Xen, and have the support and stability that comes with that. Plus with VMotion, the providers are able to move heavy VMs to other pods to even up the load. Just google "vCloud Express" to find one.
This is quite disturbing as Cloud computing is "gaining" in it's popularity with Amazon in the leader in cloud servers. What is the lag between amazon datacenter's and what has been the response from their support in these cases? EC2 suits most people I know
@Nikolai Pings are treated with lowest priority by switches, routers, and servers. If there is no reply then one or some of these are overloaded.
From what I understand the hypervisor is good a dividing CPU resources by handling VM’s as separate processes (operating systems and processors are designed to do this). Operating systems and server motherboards are not designed to do the same for disk and network IO. In my career the project requirements for reliability and performance were far more important than maximizing hardware utilization. My company offers a server cloud that delivers dedicated servers instead of virtual instances on shared servers. Newservers.com Bare Metal Cloud gives you the ability to create EC2 style elastic computing without sharing servers.We've had the same problems with AWS some month ago. Back then everything resolved in à badly configured route. Today we are using AWS high CPU Instances heavily for video transcoding and we can't measure any lack of CPU Power. But I must agree that Amazons Community Support and transparency is Not good if you are a standard Customer. One last comment to your ping output. ICPM is really not a valid way to measure network delays because most routers handle These requests with very low priority. You should measure the time to deliver a small static file or the Time needed for à Simple MySQL query. Thanks a lot anyway for this very detailled article. We'll surely have a deep look into our servers to see if aws services are getting worse.
Jack, Xen also supports pinning the machine to arbitrary number of CPUs, but I don't know if that is done or not by amazon. In my experience though CPU is rarely the bottleneck.
I'm not really surprised by this at all. This is still shared hosting that we are talking about here, multiple VMs running on the same hardware. Now admittedly having them managed via Xen is much more advanced that just having accounts on the same server, but it's still on shared hardware. It's unlikely that you'll ever be able to predict exactly what kind of performance you'll get when you start a VM. There are too many other variables at play.
If you have a two dedicated servers using the exact same hardware you can be reasonably confident that you'll get the same performance characteristics out of both. With shared hosting that just isn't likely to be the case.I dont think oversubscription is the problem, since surely they havent upped the number of running instances per machine. It think its likely that there are more "big-players" using EC2. Say a quarter of instances are run by big-players, they are quite likely heavily using network io. So its likely you end up on a box where a big-player is.
One interesting thing about Windows Azure is that you get dedicated CPU (hardware is fixed partitioned), so the noisey neighbourhood problem might not be so bad?
@Adam: Yes it has to be said the fact that you have to pay extra for support, erks us. Their forums are really not that helpful, as it is full of people "guessing" at what the problem maybe. Very rarely do you get real Amazon input. I have posted many times on the forum these problems, complete with ping traces but no one from Amazon is acknowlegding these problems.
@Janitha: We see them every so often, but when they do start, they seem to stay for at least 20 to 60minutes. Completely unacceptable. Looking at our monitoring reports for the last night, we see some of our nodes having the same issues. We are experiencing this within their Virginia data center, and all the pings are to the same location.I'd be interested if the same things happen with Microsoft,s offering.
You leave out one of the biggest problems with AWS, the super lack of documentation and well maintained community sites. This is really where much of the problem lies. If people could more effectively communicate, it would make everybody's life easier on there.
Wow, those are some shocking latency times between internal nodes. Do you always see kind of delays, since I've never seen anything like that within AWS. Were the two machines you were pinging located in the same availability zone or location?