Mar. 16th, 2015

izard: (Default)
What if I need to count from 0 to 10000000 in EBX, using a code like this:
MOV EBX, 0
ADD EBX, 1
.. repeat 10000000 times, no loops :)
what the CPI would be?

3 clocks per instruction on my favorite wearable Linux workstation, and not much better on any other X86 CPU I tested.
Why so high CPI? (of course I am not expecting anything like 0.3 because there is no ILP in the sequence - only adder EU is used and every operation is a hazard :)
However profiling shown that most of the stalls are due to L1I misses. If it was a data access pattern like this, hardware prefetcher would get everything to L1D. Did not quite work with code.

What if every 2 cache lines, I'll add PREFETCH2 on EIP+50*64?
CPI will get down to 1.5! No i-cache misses, stalls are due to RESOURCE_STALL.ANY :)

Why bother? There are no compilers and workloads that emit this kind of code to count to 10000000. Well, if I write something like
#define REPEAT100(x) { x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x; \
x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x; \
x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x; \
x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;x;}
volatile int i;
REPEAT100(REPEAT100(REPEAT100(i+=0x1023)))
I'll get L1I misses. The C code only slightly less artificial than the assembly code above, and indeed it is, but in some very rare cases this code prefetch technique might be useful.

Profile

izard: (Default)
izard

June 2025

S M T W T F S
1234567
891011121314
15161718192021
22 23242526 2728
2930     

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 7th, 2025 10:55 pm
Powered by Dreamwidth Studios