izard: (Default)
[personal profile] izard
What if I need to count from 0 to 10000000 in EBX, using a code like this:
MOV EBX, 0
ADD EBX, 1
.. repeat 10000000 times, no loops :)
what the CPI would be?

3 clocks per instruction on my favorite wearable Linux workstation, and not much better on any other X86 CPU I tested.
Why so high CPI? (of course I am not expecting anything like 0.3 because there is no ILP in the sequence - only adder EU is used and every operation is a hazard :)
However profiling shown that most of the stalls are due to L1I misses. If it was a data access pattern like this, hardware prefetcher would get everything to L1D. Did not quite work with code.

What if every 2 cache lines, I'll add PREFETCH2 on EIP+50*64?
CPI will get down to 1.5! No i-cache misses, stalls are due to RESOURCE_STALL.ANY :)

Why bother? There are no compilers and workloads that emit this kind of code to count to 10000000. Well, if I write something like
#define REPEAT100(x) { x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x; \
x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x; \
x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x; \
x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;}
volatile int i;
REPEAT100(REPEAT100(REPEAT100(i+=0x1023)))
I'll get L1I misses. The C code only slightly less artificial than the assembly code above, and indeed it is, but in some very rare cases this code prefetch technique might be useful.
This account has disabled anonymous posting.
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

Profile

izard: (Default)
izard

July 2025

S M T W T F S
  12345
67 8 9101112
13141516171819
20212223242526
2728293031  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 11th, 2025 12:51 am
Powered by Dreamwidth Studios