izard

What if I need to count from 0 to 10000000 in EBX, using a code like this:
MOV EBX, 0
ADD EBX, 1
.. repeat 10000000 times, no loops :)
what the CPI would be?

3 clocks per instruction on my favorite wearable Linux workstation, and not much better on any other X86 CPU I tested.
Why so high CPI? (of course I am not expecting anything like 0.3 because there is no ILP in the sequence - only adder EU is used and every operation is a hazard :)
However profiling shown that most of the stalls are due to L1I misses. If it was a data access pattern like this, hardware prefetcher would get everything to L1D. Did not quite work with code.

What if every 2 cache lines, I'll add PREFETCH2 on EIP+50*64?
CPI will get down to 1.5! No i-cache misses, stalls are due to RESOURCE_STALL.ANY :)

Why bother? There are no compilers and workloads that emit this kind of code to count to 10000000. Well, if I write something like
#define REPEAT100(x) { x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x; \
x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x; \
x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x; \
x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;}
volatile int i;
REPEAT100(REPEAT100(REPEAT100(i+=0x1023)))
I'll get L1I misses. The C code only slightly less artificial than the assembly code above, and indeed it is, but in some very rare cases this code prefetch technique might be useful.

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Mar. 16th, 2015

Mar. 16th, 2015

Instruction prefetch

Profile

July 2025

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags