alanwilliamson

Has Amazon EC2 become over subscribed?

Amazon has been telling conferences and anyone that cares to listen that they can handle anything we throw at them. They have always been cagey to say the least, about revealing any sort of scaling numbers and when asked just how many instances they can spin up, they never really give a straight answer. While this may make for great stage banter, sadly it doesn't cut much ice in the enterprise. We need to know the details.

One of our aw2.0 portfolio companies, has been a long term user of Amazon EC2 running a sizable 24x7 of core instances with a number of instances going up and down as scale demands it. Our monthly bill gets us the dubious honor of a first point of contact with an Amazon Account Manager (not that that has been much use). We've pushed the limits of many of their services and continue to do so.

After 3 years of production usage what we can tell you is this .. Amazon do have a breaking point.

In the beginning...

Having come from an traditional co-location set up, we took our services into the cloud to take advantage of fast scalable servers and services, while reducing our costs. That was fine for the first few years as our traffic began to grow, but we are now in that awkward position of evaluating whether or not to leave the cloud, or at least some it (cloud computing is not the most cost effective way of running an enterprise if the majority of them are running all the time).

In the early days we tried out a number of cloud providers. We managed to break some of them with our load even back then. Flexiscale was one of casualities, not coping with the volume of network traffic we were pushing through. They had major problems with one of their network drivers, and while we loved their customer service, we couldn't hang around waiting on them. (aside: Flexiscale has been recently reborn as Flexiant)

Amazon in the early days was fantastic. Instances started up within a couple of minutes, they rarely had any problems and even their SMALL INSTANCE was strong enough to power even the moderately used MySQL database. For a good 20 months, all was well in the Amazon world, with really no need for concern or complaint.

Beach in Kauai

Neighborhood isn't what it use to be

However, in the last 8 or so months, the chinks in their armour have begun to show. The first signs of weakness came from the performance of the newly spun up Amazon SMALL instances. According to our monitoring, the newly spun up machines in the server farm, were under performing compared to the original ones. At first we thought these freaks-of-nature, just happened to beside a "noisy neighbor". A quick termination and a new spin up would usually, through the laws of randomness, have us in a quiet neighborhood where we could do what we needed.

Noisy Neighbors is a problem that happens in the virtualized world. You do not get exclusive access to the underlying CPU, having to share it with other users of the virtual operating system. If you happen to be on a node where someone is computationally very heavy compared to say a network bound web server, then you will not get as much processing time as your neighbor. Sadly, you can't pick your neighbors with the vast majority of cloud vendors.

As time went on, and our load increased, the real usefulness of the SMALL instances, soon disappeared with us pretty much writing off any real production use of them. This is a shame, as many of our web servers are not CPU intensive, just I/O instensive. Moving up to the "High-CPU Medium Instance" as our base image has given us some of that early-pioneer feeling that we are indeed getting the intended throughput that we expect from an instance. Feel somewhat cheated here, as Amazon is forcing us to go to a higher priced instance just because they can't seem to cope with the volume of Small instances.

We haven't been running exlusively Small instances. Our databases enjoy some of the higher specified instances and by in large, they haven't given us any problems. Those high instances are obviously not oversubscribed and still maintain an edge of exclusivity.

However, in the last month of two, we've even noticed that these "High-CPU Medium Instance" have been suffering a similar fate of the Small instances, in that, new instances coming up don't seem to be performing anywhere near what they should be. After some investigation, we discovered a new problem that has crept into Amazon's world: Internal Network Latency.

The commute is such a drag

The time distance two machines are away from one another drastically effects the performance of any service orientated architecture. Whether it is simply connecting to the database, or calling a webservice, the time it takes for a packet of data to move over the network is a delay you have no choice but to swallow. In normal circumstances, a ping between two internal nodes within Amazon is around the 0.3ms level, with the odd ping reporting a whopping 7ms ever 30 or so packets. Completely within operational parameters and what you would expect within an internal network.

We have discovered though, that when our instances appear to be dying or at least shaky, then this network latency jumps up to a whopping 7241ms (yes, 7 seconds to move a packet around internally). Take a look at one of the ping traces we have been experiencing (we have many many more):

[root@domU-12-31-xx-xx-xx-xx mf]# ping 10.222.111.11
PING 10.222.111.11 (10.222.111.11) 56(84) bytes of data.
64 bytes from 10.215.222.16: icmp_seq=2 ttl=61 time=473 ms
64 bytes from 10.222.111.11: icmp_seq=4 ttl=61 time=334 ms
64 bytes from 10.222.111.11: icmp_seq=5 ttl=61 time=0.488 ms
64 bytes from 10.222.111.11: icmp_seq=6 ttl=61 time=285 ms
64 bytes from 10.222.111.11: icmp_seq=7 ttl=61 time=0.577 ms
64 bytes from 10.222.111.11: icmp_seq=8 ttl=61 time=0.616 ms
64 bytes from 10.222.111.11: icmp_seq=9 ttl=61 time=0.794 ms
64 bytes from 10.222.111.11: icmp_seq=10 ttl=61 time=794 ms
64 bytes from 10.222.111.11: icmp_seq=11 ttl=61 time=0.762 ms
64 bytes from 10.222.111.11: icmp_seq=14 ttl=61 time=20.2 ms
64 bytes from 10.222.111.11: icmp_seq=16 ttl=61 time=0.563 ms
64 bytes from 10.222.111.11: icmp_seq=17 ttl=61 time=0.508 ms
64 bytes from 10.222.111.11: icmp_seq=19 ttl=61 time=706 ms
64 bytes from 10.222.111.11: icmp_seq=20 ttl=61 time=481 ms
64 bytes from 10.222.111.11: icmp_seq=22 ttl=61 time=0.868 ms
64 bytes from 10.222.111.11: icmp_seq=24 ttl=61 time=1350 ms
64 bytes from 10.222.111.11: icmp_seq=25 ttl=61 time=4183 ms
64 bytes from 10.222.111.11: icmp_seq=27 ttl=61 time=2203 ms
64 bytes from 10.222.111.11: icmp_seq=31 ttl=61 time=0.554 ms
64 bytes from 10.222.111.11: icmp_seq=32 ttl=61 time=678 ms
64 bytes from 10.222.111.11: icmp_seq=34 ttl=61 time=0.543 ms
64 bytes from 10.222.111.11: icmp_seq=35 ttl=61 time=25.6 ms
64 bytes from 10.222.111.11: icmp_seq=36 ttl=61 time=1955 ms
64 bytes from 10.222.111.11: icmp_seq=41 ttl=61 time=809 ms
64 bytes from 10.222.111.11: icmp_seq=43 ttl=61 time=2564 ms
64 bytes from 10.222.111.11: icmp_seq=44 ttl=61 time=7241 ms 

As you can appreciate, this has some considerable knock-on effects to the rest of our system. Everything grinds to a halt. Now I do not believe for a moment, this is the real network delay, but more likely the virtual operating system under extreme load and not able to process the network queue. This is evident from the fact that many of the pings never came back at all.

In one particular "fire fighting mode", we spent an hour literally spinning up new instances and terminating them until we found ourselves on a node that actually responded to our network traffic.

This is all too familiar to us - this is exactly how Flexiscale suffered under load. A look over the Amazon EC2 forums and you'll see others complaining of network problems and unresponsive shell sessions.

Different road surfaces

Wire Turf recently wrote about the fact that not all the Amazon instances are equal in terms of the underlying hardware, and depending on which processor you get allocated can make a huge difference to the performance of your running instance.

So not only should we check for the CPU we are running on, we now must also take note of the network performance before we can safely push an instance into production.

This is not what cloud computing is all about.

Our faith in the almighty Amazon has become dented. I haven't told you yet our experiences with EBS, their Load Balancing solution and their Simple Queue Service yet. Those I will hold back for the Cloud Computing Expo, Cloud Bootcamp I will be hosting this April in New York. Again, great services just as long as you don't use them too much!

The road ahead

Anyone that uses virtualised computing, whether it is in the cloud or in their own private setup (VMWare for example) knows you take a performance hit. These performance hits can be considerable, but on the whole, are tolerable and can be built into an architecture from the start.

The problems that we are starting to see from Amazon, are more than just the overhead of a virtualized environment. They are deep rooted scalabilty problems at their end that need to be addressed sooner rather than later.

Coney Island and Brighton Beach

Has Amazon become over subscribed? Sure feels like it, as we are being "taxed" by being forced to move up their offering stack to just get the same level of performance we are currently enjoying.

It appears that even Amazon have a limit to what they can scale to.

Update: Amazon Latency: The Pretty Graphs

Comments

Have instances been taking longer to start up in your observation? The reason I ask is because if the network performance is so bad, then AMI's stored in S3 should take much longer to be retrieved and the time it takes from the moment you "hit" the button to when the instance is "available" should be considerable longer that before. Has that been your experience?

left by Tarun — Friday, 26 February 2010 8:13 PM — web site

Regarding whether this is mostly a reporting problem due to ICMP ping traffic being treated with low priority, I can assure you the problem is real. I ran a Counter Strike server on AWS and in mid-December had a ping of around 50. Around New Year, which is when they start showing problems, I was getting pings of around 500 ms, and the game server had severe lag while playing. That is the reason I came across this article, because I wanted to find out why my server was performing so poorly. I hope they can get this permanently resolved.

left by Mike — Monday, 22 February 2010 3:48 PM

Looks like what you needed was a burstable cloud solution where you could utilize the cloud in peak times but mainly relied on a dedicated server. This would allow you to get the cost advantages of a dedicated server with the scalability opportunities of the cloud. Carpathia Hosting has an offer called AlwaysOn/InstantOn that was made for just that problem, might want to check it out.

left by Brenna — Tuesday, 16 February 2010 7:39 PM

That explains why I had to switch from an m1.small to a c1.medium instance 6 months ago. I'm also experienced some pretty weird behaviour where my instances stay running but lose all network access and like Alan I've seen people posting similar issues on the EC2 forums. I even subscribed to premium support for two months to try and resolve it but got nowhere.

I came across the link to this blog post on the Windows Azure forums. Some posts have ominously started to appear on that forum along the same lines as the issues Alan is describing here (search for "Variability of app performance"). The Windows Azure model is good, no OS/IIS concerns etc. but Microsoft still have a long way to go make it even as good as Amazon's EC2. It currently takes 10 minutes to deploy or upgrade an application through the Windows Azure web page (that's not an exaggeration), make a spellng mistake in an HTML file, forget a stylesheet, 10 minutes, find another spelling mistake, another 10 minutes.

In theory cloud infrastructures can solve the difficult issues of scalability and reliability (using failover) but lately I'm starting to think back to the dedicated hosting providers and absence of these time and resource draining cloud OS issues. The irony is I'm now spending more on my EC2 instances to cope with reliability and performance issues than I was previously for a dedicated host.

Sigh. Where is that magic cloud out there that just works.

left by Aaron — Tuesday, 9 February 2010 12:07 AM — web site

Does this come back to good old fashioned support and communication. The IT guys at Amazon are pretty talented technically but if they don't have a culture of supporting people you don't get any strait answers. Every platform will have issues at one point or another what we need to know is whether this is the new normal or whether it's going to be fixed. Move to Rackspace?

left by Neil — Wednesday, 27 January 2010 2:11 AM — web site

Hi Alan,

Thank you for sharing this so detailed information. I have two doubts about this post:

1) What is your conclusion about that? Is AWS a wise choice, yet? Or do you think it's better to keep our business on dedicated servers until AWS get more mature and reliable?

2) What amount of access (page views) are you talking about? I'm asking that because possibly a not so busy server wouldn't had such a problem, as you hadn't for 20 months.

Thank you, Ronan

left by Ronan Lucio — Tuesday, 26 January 2010 4:01 PM — web site

Man, I was just about to sign up with EC2 for my next project too. Will you do a followup in a few months to let us know if they're settled down any?

left by Nathan Friedly — Thursday, 21 January 2010 3:20 PM — web site

Two relevant issues you avoided addressing:

  • - Do you see differences in performance for the pre-paid vs on-demand instances (small, medium, high, etc)? If not then this is important info. My understanding is that with on-demand instances you are not guaranteed getting anything started, hence the introduction of the pre-paid/puchased instances.

  • - Did/does this performance correlate with the spot-price? If not positively correlated then you're overcrowding hypothsis is setback (but not disproved), unless you argue spot prices move lower because demand is higher. If positively correlated you could use this as a predictor of when you are likely to have 'issues'

left by Mark — Thursday, 21 January 2010 5:14 AM

To the people saying Amazon should move to VMWare.. Xen is not the problem here. Amazon has probably miscalculated the number of instances that people are going to use to max capacity. If everyone just ran a low-volume web or email server then everything would be fine. But if everybody thrashes their instances then there just isn't going to be enough cycles, bandwidth, IOPs, etc, to go around.

left by Huron — Friday, 15 January 2010 5:33 PM

They should start deploy VMware for the cloud service to overcome these performance issue.

left by Craig — Friday, 15 January 2010 5:08 PM — web site

To those having issues with EC2, spin up another services provider, such as one of the several vCloud Express operators. They're VMware based, not Xen, and have the support and stability that comes with that. Plus with VMotion, the providers are able to move heavy VMs to other pods to even up the load. Just google "vCloud Express" to find one.

left by Pete — Friday, 15 January 2010 2:52 PM

This is quite disturbing as Cloud computing is "gaining" in it's popularity with Amazon in the leader in cloud servers. What is the lag between amazon datacenter's and what has been the response from their support in these cases? EC2 suits most people I know

left by Levi — Thursday, 14 January 2010 3:57 AM — web site

@Nikolai Pings are treated with lowest priority by switches, routers, and servers. If there is no reply then one or some of these are overloaded.

From what I understand the hypervisor is good a dividing CPU resources by handling VM’s as separate processes (operating systems and processors are designed to do this). Operating systems and server motherboards are not designed to do the same for disk and network IO.

In my career the project requirements for reliability and performance were far more important than maximizing hardware utilization. My company offers a server cloud that delivers dedicated servers instead of virtual instances on shared servers. Newservers.com Bare Metal Cloud gives you the ability to create EC2 style elastic computing without sharing servers.

left by JP — Wednesday, 13 January 2010 9:35 PM — web site

We've had the same problems with AWS some month ago. Back then everything resolved in à badly configured route. Today we are using AWS high CPU Instances heavily for video transcoding and we can't measure any lack of CPU Power. But I must agree that Amazons Community Support and transparency is Not good if you are a standard Customer. One last comment to your ping output. ICPM is really not a valid way to measure network delays because most routers handle These requests with very low priority. You should measure the time to deliver a small static file or the Time needed for à Simple MySQL query. Thanks a lot anyway for this very detailled article. We'll surely have a deep look into our servers to see if aws services are getting worse.

left by Nikolai — Wednesday, 13 January 2010 7:51 PM — web site

Jack, Xen also supports pinning the machine to arbitrary number of CPUs, but I don't know if that is done or not by amazon. In my experience though CPU is rarely the bottleneck.

left by Avalon — Wednesday, 13 January 2010 6:17 PM

I'm not really surprised by this at all. This is still shared hosting that we are talking about here, multiple VMs running on the same hardware. Now admittedly having them managed via Xen is much more advanced that just having accounts on the same server, but it's still on shared hardware. It's unlikely that you'll ever be able to predict exactly what kind of performance you'll get when you start a VM. There are too many other variables at play.

If you have a two dedicated servers using the exact same hardware you can be reasonably confident that you'll get the same performance characteristics out of both. With shared hosting that just isn't likely to be the case.

left by Joseph Scott — Wednesday, 13 January 2010 5:01 PM — web site

I dont think oversubscription is the problem, since surely they havent upped the number of running instances per machine. It think its likely that there are more "big-players" using EC2. Say a quarter of instances are run by big-players, they are quite likely heavily using network io. So its likely you end up on a box where a big-player is.

left by Anon — Wednesday, 13 January 2010 4:37 PM

One interesting thing about Windows Azure is that you get dedicated CPU (hardware is fixed partitioned), so the noisey neighbourhood problem might not be so bad?

left by Jack — Wednesday, 13 January 2010 8:39 AM

@Adam: Yes it has to be said the fact that you have to pay extra for support, erks us. Their forums are really not that helpful, as it is full of people "guessing" at what the problem maybe. Very rarely do you get real Amazon input. I have posted many times on the forum these problems, complete with ping traces but no one from Amazon is acknowlegding these problems.

@Janitha: We see them every so often, but when they do start, they seem to stay for at least 20 to 60minutes. Completely unacceptable. Looking at our monitoring reports for the last night, we see some of our nodes having the same issues.

We are experiencing this within their Virginia data center, and all the pings are to the same location.

left by Alan Williamson — Wednesday, 13 January 2010 8:26 AM

I'd be interested if the same things happen with Microsoft,s offering.

left by Robert McLaws — Wednesday, 13 January 2010 5:14 AM

You leave out one of the biggest problems with AWS, the super lack of documentation and well maintained community sites. This is really where much of the problem lies. If people could more effectively communicate, it would make everybody's life easier on there.

left by Adam Nelson — Wednesday, 13 January 2010 12:58 AM

Wow, those are some shocking latency times between internal nodes. Do you always see kind of delays, since I've never seen anything like that within AWS. Were the two machines you were pinging located in the same availability zone or location?

left by Janitha — Tuesday, 12 January 2010 10:50 PM — web site

Leave Comment

please note, all comments will be moderated for spam and abuse before being publicly posted.


 

Recent Cloud posts

Recent JAVA posts

Latest CFML posts


 
Site Links
Recommended Sites/Blogs

Follow javachampion on Twitter