Bizare perfromance issue
Jan. 4th, 2013 12:26 pmHere in Germany holidays are long over: I went to the office on 2nd, and now it is 3rd work day of a new year.
And now I got an interesting puzzle to solve from a customer. They gave me a micro benchmark, that runs for 500 cycles on one system, and it takes it 1000 cycles on another. (yes, numbers are that nice and round!) Systems are very similar, same frequency, no power management or turbo or speed step. I run the benchmark for 1000 times to warm up, then I run it for 10000000 times measuring cycles. Wall clock time difference is also 2x, like TSC time difference for individual runs.
As usual, first thing I did I ran it under Vtune (well actually first I ported it to Linux from RTOS, the port did not change performance numbers). I was expecting Vtune to show the same amount of instructions retired from both runs of the benchmark, and 2x cycles spent on slower system. Then I planned to look into places where CPI worsened and find the root cause.
It was not the case! Both instructions and clocks were the same for two runs under Vtune, but wall time difference was still 2x... Now will have to bisect the benchmark, and use other tools (internal and IACA) to understand what is actually happening.
And now I got an interesting puzzle to solve from a customer. They gave me a micro benchmark, that runs for 500 cycles on one system, and it takes it 1000 cycles on another. (yes, numbers are that nice and round!) Systems are very similar, same frequency, no power management or turbo or speed step. I run the benchmark for 1000 times to warm up, then I run it for 10000000 times measuring cycles. Wall clock time difference is also 2x, like TSC time difference for individual runs.
As usual, first thing I did I ran it under Vtune (well actually first I ported it to Linux from RTOS, the port did not change performance numbers). I was expecting Vtune to show the same amount of instructions retired from both runs of the benchmark, and 2x cycles spent on slower system. Then I planned to look into places where CPI worsened and find the root cause.
It was not the case! Both instructions and clocks were the same for two runs under Vtune, but wall time difference was still 2x... Now will have to bisect the benchmark, and use other tools (internal and IACA) to understand what is actually happening.