Why many Java performance tests are wrong

Why many Java performance tests are wrong

January 28, 2009 20:58 18 comments

Getting performance statistics right can be hard

A lot of ‘performance tests’ are posted online lately. Many times these performance tests are implemented and executed in a way that completely ignores the inner workings of the Java VM. In this post you can find some basic knowledge to improve your performance testing. Remember, I am not a professional performance tester, so put your tips in the comments!

An example

For example, some days ago a ‘performance test’ on while loops, iterators and for loops was posted. This test is wrong and inaccurate. I will use this test as an example, but there are many other tests that suffer from the same problems.

So, let’s execute this test for the first time. It tests the relative performance on some loop constructs on the Java VM. The first results:

Iterator – Elapsed time in milliseconds: 78
For – Elapsed time in milliseconds: 28
While – Elapsed time in milliseconds: 30

Allright, looks interesting. Let’s change the test a bit. When I reshuffle the code, putting the Iterator test at the end, I get:

For – Elapsed time in milliseconds: 37
While – Elapsed time in milliseconds: 28
Iterator – Elapsed time in milliseconds: 30

Hey, suddenly the For loop is the slowest! That’s weird!

So, when I run the test again, the results should be the same, right?

For – Elapsed time in milliseconds: 37
While – Elapsed time in milliseconds: 32
Iterator – Elapsed time in milliseconds: 33

And now the While loop is a lot slower! Why is that?

Getting valid test results is not that easy!

The example above shows that obtaining valid test results can be hard. You have to know something about the Java VM to get more accurate numbers, and you have to prepare a good test environment.

Some tips and tricks

  • Quit all other applications. It is a no-brainer, but many people are testing with their systems loaded with music players, RSS-feed readers and word processors still active. Background processes can reduce the amount of resources available to your program in an unpredictable way. For example, when you have a limited amount of memory available, your system may start swapping memory content to disk. This will have not only a negative effect on your test results, it also makes these results non-reproducible.
  • Use a dedicated system. Even better than testing on your developer system is to use a dedicated testing system. Do a clean install of the operating system and the minimum amount of tools needed. Make sure the system stays as clean as possible. If you make an image of the system you can restore it in a previous known state.
  • Repeat your tests. A single test result is worthless without knowing if it is accurate (as you have seen in the example above). Therefore, to draw any conclusions from a test, repeat it and use the average result. When the numbers of the test vary too much from run to run, your test is wrong. Something in your test is not predictable or consistent. Try to fix your test first.
  • Investigate memory usage. If your code under test is memory intensive, the amount of available memory will have a large impact on your test results. Increase the amount of memory available. Buy new memory, fix your program under test.
  • Investigate CPU usage. If your code under test is CPU intensive, try to determine which part of your test uses the most CPU time. If the CPU graphs are fluctuating much, try to determine the root cause. For example Garbage Collection, thread-locking or dependencies on external systems can have a big impact.
  • Investigate dependencies on external systems. If your application does not seem to be CPU-bound or memory intensive, try looking into thread-locking or dependencies on external systems (network connections, database servers, etcetera)
  • Thread-locking can have a big impact, to the extent that running your test on multiple cores will decrease performance. Threads that are waiting on each other are really bad for performance.

The Java HotSpot compiler

The Java HotSpot compiler kicks in when it sees a ‘hot spot’ in your code. It is therefore quite common that your code will run faster over time! So, you should adapt your testing methods.

The HotSpot compiler compiles in the background, eating away CPU cycles. So when the compiler is busy, your program is temporarily slower. But after compiling some hot spots, your program will suddenly run faster!

When you make a graph of the througput of your application over time, you can see when the HotSpot compiler is active:

Throughput of a running application

Througput of a running application over time

The warm up period shows the time the HotSpot compiler needs to get your application up to speed.

Do not draw conclusions from the performance statistics during the warm up time!

  • Execute your test, measure the throughput until it stabilizes. The statistics you get during the warm up time should be discarded.
  • Make sure you know how long the warm up time is for your test scenario. We use a warm up time of 10-15 minutes, which is enough for our needs. But test this yourself! It takes time for the JVM to detect the hot spots and compile the running code.


From Dries Buytaert I received a link to a paper called Statistically rigorous Java performance evaluation. I highly recommend reading it when you want to know more about measuring Java performance.

Remember, I am not a professional performance tester, so put your tips in the comments!


  • Michael Bar-Sinai

    Excellent article. Thanks!
    You can set hotspot’s warm up time by setting the -XX:CompileThreshold when you launch java.

  • Ran Biron

    Great article. It already been said years ago, repated… tens? hundreds? thousands? of times. But it never gets old, since there are more micro benchmarks every day, and more decision made using them.

  • Dimitris Menounos

    So tests should be bended over in order to cover the JVM’s shortcomings? Why should we factor out “warm up” time? This advice coming up and I think its wrong. Hotspot has its costs and they should be cleary included in the results.

    It is Hotspot that should be changed tp resolve this issues – and yes it can be done.

  • @Michael Thanks! I did not know that option. When testing, you can use that option, but be aware of the consequences… your tested code may behave very differently when using that option. From http://performance.netbeans.org/howto/jvmswitches/index.html:

    XX:CompileThreshold=100 – this switch will make startup time slower, by HotSpot to compile many more methods down to native code sooner than it otherwise would. The reported result is snappier performance once the IDE is running, since more of the UI code will be compiled rather than interpreted. This value represents the number of times a method must be called before it will be compiled.

    In certain cases (for example in Swing applications) you want to measure the warm up time, because that’s what people will face when using your app. But when comparing loop constructs it doesn’t make sense, because the warmup time will influence your test results too much to be of any value.
    So, it depends when you want to take into account the Hotspot overhead. But many people do not know about it and proudly post test results that mean nothing. Knowing about it, you can make your own decision.

  • Dimitris Menounos

    I am a fair believer that a neutral test should be counting *everything*, not only the parts that are in favor of the test case. Otherwise the test is biased and really useless IMHO.

    There are solutions to the warm up issue (like ahead of time compilation and native code caching), it is just that hotspot does not support them. Again it is hotspot that should be accommodated to address the issue not the other way around.

  • Dimitris Menounos

    “But when comparing loop constructs it doesn’t make sense, because the warmup time will influence your test results too much to be of any value.”

    I see your point better now, when you test “A vs B vs C” you don’t want hotspot warm up to influence one against the others. You are right about that! However your post concludes to the misleading idea that tests *in general* that don’t discard warm up time mean nothing.

  • @Dimitris And therefore I’m happy you provide the additional insight that in particular cases you DO want to measure HotSpot warm up time. We’re here to learn :-)

  • turtlewax

    >>The statistics you get during the warm up time should be discarded

    I agree with Dimitris. Or at the very least, if you do discard warm-up, make sure that point is very well published. Its a well known issue with java, and omitting it will only fuel the criticisms.

  • @Dimitris

    The warm up phase is pretty short with the client VM. With micro benchmarks it’s usually over after less than half a second. Usually you’re interested in improving the performance of programs which run longer than that. Which is the reason why the warm up phase should be usually ignored (since it only skews the measured results for this kind of scenario).

    E.g. no one will care if the path finding in your game is somewhat slow for the first 100msec. As long as you reach 60+ fps within a few seconds (and manage to stay there) everything is fine.

    So, this isn’t about being objective or not. It’s about getting meaningful numbers. E.g. there are two algorithms. Lets call ‘em A and B. A needs a whopping 500ms for compilation, but only takes 10ms per iteration. Whereas B only needs 100ms for compilation and 50ms per iteration. Would you really want algorithm B in your application?

    Now… if it’s a command line application, which quits after 1 second you might care for the warm up. Would be pretty pointless, but go ahead and waste your time. However, if it’s run for a few minutes or hours… or even days/weeks/years… you certainly won’t give a damn about warm up.

    Since the warm up gets averaged out in your typical use case, you can get representative times much quicker by ignoring the warm up altogether. Without the warm up algorithm A is the clear winner.

  • [...] Why many Java performance tests are wrong | StuQ.nl (tags: performance java) [...]

  • Chris Gummer

    Nice article, thanks!

  • Cedric Franz

    I have run many performance tests on our system and one of the settings that makes a big difference is whether the Java VM is started with the -server option or not. It is not clear cut whether the application will run faster with or without this option set because it largely depends on what the application does, but is is definitely worth trying. Our application runs about 10-15% faster under high load with the -server option.

  • Nils

    Really liked that article, gives some good hints… will do some further research on that topic.

    Thanks man!

  • Hi Nils,

    Glad you liked it!

    - Daan

  • Bill the Lizard

    Good article. I’ve been guilty of this kind of performance non-testing myself in the past. It’s good to read a post that doesn’t just say “That’s wrong,” but also says why it’s wrong and what to do about it.

  • Nice posting. I would add that you can see interference in the benchmark by calculating the variance. It should be close to zero in this case. Also you should investigate how the code was optimized to ensure that you don’t lose the effect you are trying to measure. You can use PrintComplied (iirc). Another recommended option is to dump the generated native code using a debug version of the JVM.

  • [...] although it does tell a true story. I assume that he followed at least the basic rules for performance measurement as his results [...]

  • Jan Cajthaml

    Thank you for that article.
    I will cite you in my thesis.

  • Jan Cajthaml

    Btw try testing the difference between java a==b native comparsion and (a^b)==0×0 (on 32bit architecture) :)
    This micro optimalisation heleped me to speed up my application a lot (RUDP server)