An inquiry, bugz
Nov. 20th, 2015 05:51 amTwo days ago I flown to Tel Aviv to solve a technical puzzle. Now I am in Ben Gurion airport waiting for the departure.
And the puzzle is solved. This was one of the more rare lucky events of Veni, vidi, vici, that do not happen too frequently, so I'll describe the reasoning we used.
In the first 4 hours on Wednesday, I've formulated three hypotheses about what has caused a huge and unexpected software performance regression reported by customer : some code is misusing an architecturally visible CPU feature (1), an OS kernel bug that triggers an issue with u-arch that is normally benign but not in this case (2), or OS kernel bug that is just a crazy kernel bug (3).
Yesterday morning a team of customer's engineers and me started to investigate all three, trying to find a proof for each in parallel. One engineer was developing an OS driver that should spot (1). I was gathering more data to proof (2). Another engineer was gathering more data so I could formulate a question about (3) to the OS kernel team. Another engineer was running all the required tests on multiple SUTs, and a project manager was bringing me sweets, fruits and fresh juice, and communicated with stakeholders.
By lunch, I was done with a half of (2). Indeed there was a performance degradation caused by this reason, but only for ~11%, not 5x. Another half could have contributed ~20% more at most, but this was it.
I did not expect that fast turnaround, but the device driver which checked for (1) was ready by 4PM. We checked it and found that (1) is true, and caused by (3). We discussed the prospective fix, and when we came to a conclusion I estimated that if I was developing a fix it would have taken me ~1 full work day. The engineer who developed a tool to check (1) opened the tool's source code, added ~100 more lines of C code in 20 minutes, compiled, got a compiler error, fixed, compiled, ran, got a runtime error, fixed, compiled, ran, and the issue was resolved! It took less than an hour. (Others called him a genius, and I tend to agree that I only meet coders of this caliber just once or twice per year.)
But Israelis! Experts in bargaining. It was 5PM and the issue was resolved. So they told me: by the way, we have another issue, not as critical but let's solve it too. Reluctantly I ran Vtune, just in case enabling its newest feature of collecting some other stuff besides normal PMU events. And it worked - the first run shown a root cause of another strange issue that appears quite rarely. (Has yet to be confirmed with other tools). Yesterday it was just a lucky day!
And the puzzle is solved. This was one of the more rare lucky events of Veni, vidi, vici, that do not happen too frequently, so I'll describe the reasoning we used.
In the first 4 hours on Wednesday, I've formulated three hypotheses about what has caused a huge and unexpected software performance regression reported by customer : some code is misusing an architecturally visible CPU feature (1), an OS kernel bug that triggers an issue with u-arch that is normally benign but not in this case (2), or OS kernel bug that is just a crazy kernel bug (3).
Yesterday morning a team of customer's engineers and me started to investigate all three, trying to find a proof for each in parallel. One engineer was developing an OS driver that should spot (1). I was gathering more data to proof (2). Another engineer was gathering more data so I could formulate a question about (3) to the OS kernel team. Another engineer was running all the required tests on multiple SUTs, and a project manager was bringing me sweets, fruits and fresh juice, and communicated with stakeholders.
By lunch, I was done with a half of (2). Indeed there was a performance degradation caused by this reason, but only for ~11%, not 5x. Another half could have contributed ~20% more at most, but this was it.
I did not expect that fast turnaround, but the device driver which checked for (1) was ready by 4PM. We checked it and found that (1) is true, and caused by (3). We discussed the prospective fix, and when we came to a conclusion I estimated that if I was developing a fix it would have taken me ~1 full work day. The engineer who developed a tool to check (1) opened the tool's source code, added ~100 more lines of C code in 20 minutes, compiled, got a compiler error, fixed, compiled, ran, got a runtime error, fixed, compiled, ran, and the issue was resolved! It took less than an hour. (Others called him a genius, and I tend to agree that I only meet coders of this caliber just once or twice per year.)
But Israelis! Experts in bargaining. It was 5PM and the issue was resolved. So they told me: by the way, we have another issue, not as critical but let's solve it too. Reluctantly I ran Vtune, just in case enabling its newest feature of collecting some other stuff besides normal PMU events. And it worked - the first run shown a root cause of another strange issue that appears quite rarely. (Has yet to be confirmed with other tools). Yesterday it was just a lucky day!