Saturday, September 23, 2006

Performance Tuning Tips for Java - No tools required

Over the years I've developed a simple methodology for doing performance tuning. To tell the truth, I didn't even know it was a methodology until I realized that whenever people said "The software is slow!" I always responded with the same actions. In this article I'm going to talk about what I now fondly call, "Performance tuning by Steve" (Can you tell I've seen one to many infomercials in my day?)

First things first:

Someone comes to you as a developer and says, "The software is too slow." They usually have between 2 and 5 guesses as to why it's slow and maybe so do you. What do you do? I'll tell you what you NEVER do. You NEVER take action based on a guess of what the problem is. The only thing worse than an outsiders guesses as to why your software is slow are your own guesses. Do the investigation. I can't tell you how many times I hear people insist that the reason some piece of software is slow is because of networking, disk or database even though the machine is cpu bound. Before I touch a line of code I always ask these two basic questions.

1) Are you CPU bound

(if it is a multi process test I ask if they are cpu bound on any of the boxes involved in the test). If it is a multi cpu machine I ask if any one processor is pegged.

Answer I usually get is "I don't know." Ahh, I say. Can you run your test again with vmstat (translates to iostat on the mac and that Perfomance thing in the task bar on Windows). Don't forget to tee the output to a file.

2) Are you GC bound?

Answer I usually get, "I don't know." I usually follow with. " Can you run your test again with all processes having some variant on the -verbose:gc flag turned on. And don't forget to tee it to a file. If they are GC bound then one needs to do GC/Garbage creation tuning which I will save for another day. It is a big topic on its own.

If they come back and say, "Yes I'm cpu bound! and no I'm not spending tons of time in GC." My mouth waters a bit. I scratch my chin gently and say, "Great! can you take a series of thread dumps while your app is running." At this point I get a funny look. Why? "I don't think it's deadlocked" They usually say. Well this is an extremely effective way to see where an application is spending it's time. In my experience this is more effective than even a good profiler. If you run a load test that is cpu bound and take a series of thread dumps you will find out where your code is spending time. For a little more on thread dumps check this out. A little on thread dumps

I usually perform analysis on the thread dumps, fix the thing that is the worst offender than try the test again. Rinse dry repeat...

Now you are not CPU bound anymore. What next? I'll tell you what still isn't next. DON'T GUESS! you will be wrong. I promise!
The trick here is to find the bottleneck. It is best to think about a non cpu bound performance problem as thread starvation. Not a network problem and not a disk problem (at least at first and for this their are exceptions).

If you are using SEDA style development (think work queues) then your lucky. The next thing you have to do is just find out what queue is backing up. Really, any statemachine style programming can use this technique.
- Which ever queue is backing up is where your performance problem is.
- You may need to add more worker threads to the end of the Queue. Especially if the threads are doing blocking tasks of any kind.
- You may need to do a better job of separating out CPU tasks from blocking tasks.
* If you have a lot of disk writing to do than the thread that writes to disk should do almost 0 cpu work otherwise you will get pipeline stalls going to disk. Big mistake.
- You may need to more effectively synchronize something if your threads are all blocked waiting for the same resource.

If you are not doing SEDA style development or if you are but you have found the queue that is backing up but don't know why yet then here is the next fun step.
Warning, it is time consuming but very effective.
Wrap large swaths of your slow code like so:

long start = System.currentTimeMillis();
-your code-
long t = System.currentTimeMillis() - start;

total += t;
if(count % 1000 == 0){ //<-- obviously the number you divide by here is dependent on how often the code is called
System.out.println("T1 Average: " + (t/count)+" count:"+count);

Do this in multiple parts of your code and then basically keep moving the timing in tighter and tighter until you have found the part or parts of your code that are taking the longest. This takes some practice. You'll probably get fooled by thread starvation a few times. Maybe get fooled by some blocking calls, ie. network or disk. Keep your mind clear. Try and draw on a white board the hand offs between threads.

One last thing. TAKE NOTES, lots of NOTES. Save the output from every run and make a little readme that reminds you what each experiment was and even any thoughts you had during said experiment.


1) Figure out if you are CPU bound
2) Eliminate GC as a concern (tuning GC is a whole other article)
3) If you are cpu bound use thread dumps to figure out where time is being spent (ignore things in wait, only runnables)
4) If you are not then do the following:
- Still take thread dumps to understand where threads are blocked and where they are runnable
- Bottleneck analysis, find out where things are backing up moving through your state machine
- use timings over many calls of each code path and then move those timings in on the areas that are slow until you figure out where the problem is.
5) Take lots of detailed notes and save all artifacts of each run.

Good luck


Steve Harris said...

A friend of mine Hung pointed me to a good article on thread dumps for those who aren't experienced in how to get them and what they are.

Thanks Hung

Steve Harris said...

A bit about thread dumps link

Oops, got cut off

saravan said...

Its an interesting article, Steve.

One thing to note is vmstat(even some versions of top) shows the total cpu usage in a multiprocessor machine.

Couple of times it has happened to me. vmstat shows the cpu usage around 50% but since I am using a dual core machine, actually one of the cpu is pinned. It might be misleading and you might think it is not cpu bound.

I use gnome-system-monitor to see if a single cpu is bound.

This generally means that the application is not sufficently multi-threaded.

Another common mistake people make is introducing unnecessary lock contentions and thus introducing stalls in processing. This is a tougher problem to solve. CPU is not pinned but your application is slow.

Thread dumps help sometime. Some profilers gives you this data. The general rule is to have fine grain locking and to release when done.