Nov. 20th, 2015

izard: (Default)
Two days ago I flown to Tel Aviv to solve a technical puzzle. Now I am in Ben Gurion airport waiting for the departure.

And the puzzle is solved. This was one of the more rare lucky events of Veni, vidi, vici, that do not happen too frequently, so I'll describe the reasoning we used.

In the first 4 hours on Wednesday, I've formulated three hypotheses about what has caused a huge and unexpected software performance regression reported by customer : some code is misusing an architecturally visible CPU feature (1), an OS kernel bug that triggers an issue with u-arch that is normally benign but not in this case (2), or OS kernel bug that is just a crazy kernel bug (3).

Yesterday morning a team of customer's engineers and me started to investigate all three, trying to find a proof for each in parallel. One engineer was developing an OS driver that should spot (1). I was gathering more data to proof (2). Another engineer was gathering more data so I could formulate a question about (3) to the OS kernel team. Another engineer was running all the required tests on multiple SUTs, and a project manager was bringing me sweets, fruits and fresh juice, and communicated with stakeholders.

By lunch, I was done with a half of (2). Indeed there was a performance degradation caused by this reason, but only for ~11%, not 5x. Another half could have contributed ~20% more at most, but this was it.

I did not expect that fast turnaround, but the device driver which checked for (1) was ready by 4PM. We checked it and found that (1) is true, and caused by (3). We discussed the prospective fix, and when we came to a conclusion I estimated that if I was developing a fix it would have taken me ~1 full work day. The engineer who developed a tool to check (1) opened the tool's source code, added ~100 more lines of C code in 20 minutes, compiled, got a compiler error, fixed, compiled, ran, got a runtime error, fixed, compiled, ran, and the issue was resolved! It took less than an hour. (Others called him a genius, and I tend to agree that I only meet coders of this caliber just once or twice per year.)

But Israelis! Experts in bargaining. It was 5PM and the issue was resolved. So they told me: by the way, we have another issue, not as critical but let's solve it too. Reluctantly I ran Vtune, just in case enabling its newest feature of collecting some other stuff besides normal PMU events. And it worked - the first run shown a root cause of another strange issue that appears quite rarely. (Has yet to be confirmed with other tools). Yesterday it was just a lucky day!

Profile

izard: (Default)
izard

July 2025

S M T W T F S
  12345
67 8 91011 12
13141516171819
20212223242526
2728293031  

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 25th, 2025 09:31 pm
Powered by Dreamwidth Studios