alanwilliamson

What do you do when java is gobbling up 99% cpu?

One of our servers, that has been happily running for months without any major problem, suddenly started jumping to 99% CPU and never returning back down to normal levels. The traffic on the site hadn't changed so something was causing this to go into a 99% jump. This was going on for days and I still couldn't pin point it down. It seemed the machine was in a tight loop of sorts. Carefully profiling all the CFML that was running on the machine, there was nothing in the pages that would cause it to jump into a tight loop and never return.

The time had come to explore a thread dump. I had attempted this before, but the JVM was never returning from the tight loop actually producing 99% CPU load. So the kill -3 <pid> was yielding nothing. Doing a little searching around I discovered a small tip from BEA's WebLogic support that suggested you start up the jvm with the JIT disabled; java -nojit which didn't help any.

The next thing to look at was a tip that suggested you try and attach the java debugger (jdb) to it and see if you can suspend it that way.  Again, the font of this information was from another J2EE provider, Resin.   So you start up the java process with the following flags:  -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5432 and then in another shell, you attach the debugger to it using: $JAVA_HOME/bin/jdb -connect com.sun.jdi.SocketAttach:hostname=localhost,port=5432.  That all worked beautifully.

However come the time to actually suspend the process, it continued to hang there and not return.  So the debugger route was a washout.

So now I was kinda stuck as to what to do.  Then in a moment of desparation I simply typed in a huge long silly question to Google asking when it wasn't working.   On page 2 of the results, was my answer, from yet another J2EE server, Tomcat.  The MySQL JDBC driver I was using had a very strange bug that, when it got to the last page of a result set that you were paging using LIMIT x,y then it would go into an infinitive loop if the page boundard was exact.  For example, 100 results, 10 results per page, the 10th page would through the driver into a loop.

I upgraded the MySQL driver and after throwing some backward compatible switches on it, the whole thing burst into life with no problems reported since.  Armed with this new found information I went looking at my queries and noticed a new feature that was introduced to page through results.  This was indeed the reason.

So thanks to WebLogic, Resin and finally Tomcat for helping me get to the root of my MySQL problem!

Comments

Here is a good article about the top 3 reasons an application hangs. Gives good info about those hard-to-detect scenarios.

left by Nick Batell — Thursday, 3 January 2008 2:52 PM — web site

"it was caused due to a race condition our customers code had to a HashMap that brought on an infinite loop. That was a needle in a haystack situation."

Hardly a haystack. That is an easy problem: http://blogs.opensymphony.com/plightbo/2005/07/

left by Dave — Tuesday, 27 September 2005 1:59 PM

An easy problem? Dispute that claim for a start. It is only 'easy' after the event. With respect to your problem, how do you know its nothing really innocent that is manifesting itself in a large production system? You don't.

Production problems are never easy and even with this one, it was not easily reproduced in the staging environment because the unique set of conditions didn't present itself. You are facing the exact same problem.

If you production environment was reproducable in staging then you would find it in a heartbeat. But it's not.

Treat it as the thrill of the chase.

left by Alan Williamson — Monday, 26 September 2005 11:43 AM — web site

This is an easy problem, you could reproduce it. I faced harder problems with production systems that you need to reboot every X days so its impossible to reproduce such a problem in a debug environment. Tracking these production only problems is almost impossible and we had to create incredibly complex logs that allowed us to reproduce/understand the problem.

However, now there is a new solution in the horizon. Solaris 10 dtrace... I still haven't run into such a problem on a Solaris 10 machine but after seeing all the dtrace demos I just can't wait!

BTW last time I had such a problem it occured after a minor version upgrade of the JDK and it was caused due to a race condition our customers code had to a HashMap that brought on an infinite loop. That was a needle in a haystack situation.

left by Shai Almog — Monday, 26 September 2005 11:18 AM

Mark, the link that gave me the pointer to the fact it might be MySQL, i gave in my original blog entry. The database is 4.0.22 i think.

As for the version of it, let me unzip the JAR file and see if there is any clue as to the version.

  • mysql-connector-java-3.0.11-stable-bin.jar

  • Size: 242323 bytes

I upgraded to the latest 3.1 which has doubled in size to over 400kb, so i take some improvements/features where added! :)

BTW Let me thank you for writing a good blog entry on your own blog with respect to the Date=null issue for the latest generation of drivers.

left by Alan Williamson — Thursday, 22 September 2005 6:11 AM

Alan,

I'd be interested to hear what version of the MySQL JDBC driver had the bug, and which version fixed it. I haven't actually had anyone report it that I can easily find in either my e-mail or the bugs database.

left by Mark Matthews — Thursday, 22 September 2005 2:14 AM — web site

Woo hoo, Alan. I liked this post! I'm glad that other developers go through similar thought processes to myself... I enjoy reading about the logic and process of elimination. I've been in your situation myself while debugging a very large (1,000+ requests per min) MS SQL Server problem for a client.

The largest problem was that it was Someone Elses Code, and I had to figure out why the site wasn't running as responsively as it should have been, and why, after a day of usage, the WWW service dllhost.exe ate all the ram. Fixing a memory leak in someone elses VB compiled code, running on an ASP 3.0 platform, with so many requests per min to simulate...

Fun. ;)

left by Barns — Wednesday, 21 September 2005 10:54 PM — web site

Leave Comment

please note, all comments will be moderated for spam and abuse before being publicly posted.


 

Recent JAVA posts

Latest CFML posts

Other Bits'n'Bobs


 
Site Links
Recommended Sites/Blogs