izard

Right now I am sitting in an airport on the way home and thinking about a bug I tried to find yesterday. Spent whole day with a customer but still have little clue.. I have proofs that the root cause is not in software, and not in hardware. So must be an interesting combination.

The bug manifests as random and rare events when a core stops for 12 microseconds. Full processor trace simply shows something like:
...
100.000000000: add r10, rax
100.000000000: test eax, eax
100.000000000: jnz 0x100
100.000012345: add rs1, 0x1
100.000012345: rdtsc
...
There are many reasons why a core might stop for ~10us (thermal, power, SMI, AVX2, turbo, C state, etc). However these causes leave a trace (and I seen none), and if that happens, the benchmark should be interrupted at a random place, not always at that jnz instruction.

JNZ instruction again cannot be a root cause either, because mispredict is 3 orders of magnitude lower than the delay we measure.

The issue happens more frequently on some CPUs and some cores within these CPUs - this points to a hardware root cause. However, when timer interrupts are active on that core, the issue happens and I see interrupt handler in the trace. If they are stopped, I don't see the interrupt handler in the trace, and issue happens. But when I disable timer interrupt on the core completely, the issue almost never happens. This points to software :)

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Aug. 23rd, 2018

Aug. 23rd, 2018

Bug

Profile

November 2025

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags