Instruction prefetch
Mar. 16th, 2015 11:23 pmWhat if I need to count from 0 to 10000000 in EBX, using a code like this:
MOV EBX, 0
ADD EBX, 1
.. repeat 10000000 times, no loops :)
what the CPI would be?

3 clocks per instruction on my favorite wearable Linux workstation, and not much better on any other X86 CPU I tested.
Why so high CPI? (of course I am not expecting anything like 0.3 because there is no ILP in the sequence - only adder EU is used and every operation is a hazard :)
However profiling shown that most of the stalls are due to L1I misses. If it was a data access pattern like this, hardware prefetcher would get everything to L1D. Did not quite work with code.
What if every 2 cache lines, I'll add PREFETCH2 on EIP+50*64?
CPI will get down to 1.5! No i-cache misses, stalls are due to RESOURCE_STALL.ANY :)
Why bother? There are no compilers and workloads that emit this kind of code to count to 10000000. Well, if I write something like
#define REPEAT100(x) { x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x; \
x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x; \
x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x; \
x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;}
volatile int i;
REPEAT100(REPEAT100(REPEAT100(i+=0x1023)))
I'll get L1I misses. The C code only slightly less artificial than the assembly code above, and indeed it is, but in some very rare cases this code prefetch technique might be useful.
MOV EBX, 0
ADD EBX, 1
.. repeat 10000000 times, no loops :)
what the CPI would be?

3 clocks per instruction on my favorite wearable Linux workstation, and not much better on any other X86 CPU I tested.
Why so high CPI? (of course I am not expecting anything like 0.3 because there is no ILP in the sequence - only adder EU is used and every operation is a hazard :)
However profiling shown that most of the stalls are due to L1I misses. If it was a data access pattern like this, hardware prefetcher would get everything to L1D. Did not quite work with code.
What if every 2 cache lines, I'll add PREFETCH2 on EIP+50*64?
CPI will get down to 1.5! No i-cache misses, stalls are due to RESOURCE_STALL.ANY :)
Why bother? There are no compilers and workloads that emit this kind of code to count to 10000000. Well, if I write something like
#define REPEAT100(x) { x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x; \
x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x; \
x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x; \
x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;}
volatile int i;
REPEAT100(REPEAT100(REPEAT100(i+=0x1023)))
I'll get L1I misses. The C code only slightly less artificial than the assembly code above, and indeed it is, but in some very rare cases this code prefetch technique might be useful.