Aug. 23rd, 2018

Bug

Aug. 23rd, 2018 08:44 am
izard: (Default)
Right now I am sitting in an airport on the way home and thinking about a bug I tried to find yesterday. Spent whole day with a customer but still have little clue.. I have proofs that the root cause is not in software, and not in hardware. So must be an interesting combination.

The bug manifests as random and rare events when a core stops for 12 microseconds. Full processor trace simply shows something like:
...
100.000000000: add r10, rax
100.000000000: test eax, eax
100.000000000: jnz 0x100
100.000012345: add rs1, 0x1
100.000012345: rdtsc
...
There are many reasons why a core might stop for ~10us (thermal, power, SMI, AVX2, turbo, C state, etc). However these causes leave a trace (and I seen none), and if that happens, the benchmark should be interrupted at a random place, not always at that jnz instruction.

JNZ instruction again cannot be a root cause either, because mispredict is 3 orders of magnitude lower than the delay we measure.

The issue happens more frequently on some CPUs and some cores within these CPUs - this points to a hardware root cause. However, when timer interrupts are active on that core, the issue happens and I see interrupt handler in the trace. If they are stopped, I don't see the interrupt handler in the trace, and issue happens. But when I disable timer interrupt on the core completely, the issue almost never happens. This points to software :)

Profile

izard: (Default)
izard

August 2025

S M T W T F S
     12
3456789
10111213 141516
17181920212223
24252627282930
31      

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated Aug. 16th, 2025 04:37 am
Powered by Dreamwidth Studios