Vtune rocks
Aug. 29th, 2016 07:43 pmI am investigating a performance issue reported by a customer, as usual. Somewhere mid-way, when I had a solid reproducible (as I thought), I noticed a strange thing:
The workload was processing traffic at 44Gb/sec when it came to saturation (there were free CPU cycles, plenty of memory and PCIe throughput, so I had to find what is a limiting factor). For a start, I wanted to compare Vtune profile of 43Gb/sec (slightly below saturation), and 45Gb/sec (slightly above). What I noticed was that both profiles are the same, and there is no saturation at 45Gb/sec. So I increased a dial at my traffic generator, and found that when I run Vtune collector, my application starts losing packets at 48Gb/sec, not 44Gb/sec. Wtf? Quantum effect - observer affects measurement, and helps a lot.
Before recommending a customer to run a vtune collector forever (kidding), I double checked my setup and found that one of the processing threads was running at a wrong NUMA node. When I moved it to the right one, the saturation happened at 48Gb/sec no matter if I was profiling the SUT or not. Pfff.
The workload was processing traffic at 44Gb/sec when it came to saturation (there were free CPU cycles, plenty of memory and PCIe throughput, so I had to find what is a limiting factor). For a start, I wanted to compare Vtune profile of 43Gb/sec (slightly below saturation), and 45Gb/sec (slightly above). What I noticed was that both profiles are the same, and there is no saturation at 45Gb/sec. So I increased a dial at my traffic generator, and found that when I run Vtune collector, my application starts losing packets at 48Gb/sec, not 44Gb/sec. Wtf? Quantum effect - observer affects measurement, and helps a lot.
Before recommending a customer to run a vtune collector forever (kidding), I double checked my setup and found that one of the processing threads was running at a wrong NUMA node. When I moved it to the right one, the saturation happened at 48Gb/sec no matter if I was profiling the SUT or not. Pfff.