izard: (Default)
[personal profile] izard
I am trying to optimize some assembly code that is around 10 lines long.
Technical, goes under cut
Input is 2 16 bit values: in and key. Output is single 16 bit value.

Original code:
4 lookups on 2 small tables that fit in L1 cache, and 8 logical/bit ops with WAR dependency. Latency of logical/bit ops is 10 cycles. It runs so many times that I can assume no TLB misses, and only capacity cache misses.

My code:
2 lookups on bigger table + 1 op with latency of 1 cycle. 1/4 of bigger table fits in L1 cache, the rest is in L2.

So the original is 4*4+10=26 cycles.
My version should be 2*(4*(1/4) + 11*(3/4))+1=19.5 cycles.
Probably worth trying, will see if I really get 30% speedup on Monday.

P.S.
May be I'll post a similar calculation about using lookups on 32 bit table for the same function. Quite surprisingly, the result is very similar, when using prefetches correctly. Still not very efficient overall as 8GB of RAM is consumed.

Profile

izard: (Default)
izard

September 2025

S M T W T F S
 1 23456
78910111213
14151617181920
21222324252627
282930    

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Sep. 16th, 2025 08:02 am
Powered by Dreamwidth Studios