Right now I am sitting in an airport on the way home and thinking about a bug I tried to find yesterday. Spent whole day with a customer but still have little clue.. I have proofs that the root cause is not in software, and not in hardware. So must be an interesting combination.
The bug manifests as random and rare events when a core stops for 12 microseconds. Full processor trace simply shows something like:
...
100.000000000: add r10, rax
100.000000000: test eax, eax
100.000000000: jnz 0x100
100.000012345: add rs1, 0x1
100.000012345: rdtsc
...
There are many reasons why a core might stop for ~10us (thermal, power, SMI, AVX2, turbo, C state, etc). However these causes leave a trace (and I seen none), and if that happens, the benchmark should be interrupted at a random place, not always at that jnz instruction.
JNZ instruction again cannot be a root cause either, because mispredict is 3 orders of magnitude lower than the delay we measure.
The issue happens more frequently on some CPUs and some cores within these CPUs - this points to a hardware root cause. However, when timer interrupts are active on that core, the issue happens and I see interrupt handler in the trace. If they are stopped, I don't see the interrupt handler in the trace, and issue happens. But when I disable timer interrupt on the core completely, the issue almost never happens. This points to software :)
The bug manifests as random and rare events when a core stops for 12 microseconds. Full processor trace simply shows something like:
...
100.000000000: add r10, rax
100.000000000: test eax, eax
100.000000000: jnz 0x100
100.000012345: add rs1, 0x1
100.000012345: rdtsc
...
There are many reasons why a core might stop for ~10us (thermal, power, SMI, AVX2, turbo, C state, etc). However these causes leave a trace (and I seen none), and if that happens, the benchmark should be interrupted at a random place, not always at that jnz instruction.
JNZ instruction again cannot be a root cause either, because mispredict is 3 orders of magnitude lower than the delay we measure.
The issue happens more frequently on some CPUs and some cores within these CPUs - this points to a hardware root cause. However, when timer interrupts are active on that core, the issue happens and I see interrupt handler in the trace. If they are stopped, I don't see the interrupt handler in the trace, and issue happens. But when I disable timer interrupt on the core completely, the issue almost never happens. This points to software :)