izard: (Default)
[personal profile] izard
I am trying to optimize some assembly code that is around 10 lines long.
Technical, goes under cut
Input is 2 16 bit values: in and key. Output is single 16 bit value.

Original code:
4 lookups on 2 small tables that fit in L1 cache, and 8 logical/bit ops with WAR dependency. Latency of logical/bit ops is 10 cycles. It runs so many times that I can assume no TLB misses, and only capacity cache misses.

My code:
2 lookups on bigger table + 1 op with latency of 1 cycle. 1/4 of bigger table fits in L1 cache, the rest is in L2.

So the original is 4*4+10=26 cycles.
My version should be 2*(4*(1/4) + 11*(3/4))+1=19.5 cycles.
Probably worth trying, will see if I really get 30% speedup on Monday.

P.S.
May be I'll post a similar calculation about using lookups on 32 bit table for the same function. Quite surprisingly, the result is very similar, when using prefetches correctly. Still not very efficient overall as 8GB of RAM is consumed.

Profile

izard: (Default)
izard

November 2025

S M T W T F S
       1
2345678
9101112131415
16171819202122
23242526272829
30      

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Nov. 3rd, 2025 11:52 pm
Powered by Dreamwidth Studios