Curious bug
Jan. 21st, 2013 11:11 amJust found a root cause of an issue that was entertaining me during recent work weeks.
I was benchmarking some code on SNB, IVB and HSW. To compare the performance, platforms have to be very similar. That was easy for SNB/IVB: I used same chipset, same CPU clock multiplier, same memory, very similar mother boards.
Software setup: Benchmarks were small and single threaded. To rule out any disturbances and model customer's RTOS, I was disabling all possible causes of jitter in h/w, then offlining a core in Linux (echo 0 >/sys/.../cpu/N/online), then starting the benchmark on that core. I needed APIC CPU id for the core to run something on it.
When Hyperthreading was on, both SNB and IVB platforms worked the same way: I offlined core N and used APIC CPU ID: N. However letting OS schedule something on a hyperthread that runs on the core under test is not a good idea. There are 2 ways around: offline a sibling hyperthread too, or disable hyperthreading in BIOS. Offlining hyperthread worked, but disabling hyperthreading in BIOS worked for SNB but not for IVB. Disabling HT is cleaner, but I had to resort to workaround until I found why IVB and SNB behaved differently.
What was the reason? When I disabled HT in BIOS on SNB platform, APIC IDs reduced from [0-7] to [0-3]. Where in IVB, the APIC IDs became [0,2,4,6]. Both behaviors are legit, and assuming one to one mapping of Linux core to APIC id is wrong and only worked for me by accident.
I was benchmarking some code on SNB, IVB and HSW. To compare the performance, platforms have to be very similar. That was easy for SNB/IVB: I used same chipset, same CPU clock multiplier, same memory, very similar mother boards.
Software setup: Benchmarks were small and single threaded. To rule out any disturbances and model customer's RTOS, I was disabling all possible causes of jitter in h/w, then offlining a core in Linux (echo 0 >/sys/.../cpu/N/online), then starting the benchmark on that core. I needed APIC CPU id for the core to run something on it.
When Hyperthreading was on, both SNB and IVB platforms worked the same way: I offlined core N and used APIC CPU ID: N. However letting OS schedule something on a hyperthread that runs on the core under test is not a good idea. There are 2 ways around: offline a sibling hyperthread too, or disable hyperthreading in BIOS. Offlining hyperthread worked, but disabling hyperthreading in BIOS worked for SNB but not for IVB. Disabling HT is cleaner, but I had to resort to workaround until I found why IVB and SNB behaved differently.
What was the reason? When I disabled HT in BIOS on SNB platform, APIC IDs reduced from [0-7] to [0-3]. Where in IVB, the APIC IDs became [0,2,4,6]. Both behaviors are legit, and assuming one to one mapping of Linux core to APIC id is wrong and only worked for me by accident.