izard | CPU cycles, bytes.

I am trying to optimize some assembly code that is around 10 lines long.
Technical, goes under cut
Input is 2 16 bit values: in and key. Output is single 16 bit value.

Original code:
4 lookups on 2 small tables that fit in L1 cache, and 8 logical/bit ops with WAR dependency. Latency of logical/bit ops is 10 cycles. It runs so many times that I can assume no TLB misses, and only capacity cache misses.

My code:
2 lookups on bigger table + 1 op with latency of 1 cycle. 1/4 of bigger table fits in L1 cache, the rest is in L2.

So the original is 4*4+10=26 cycles.
My version should be 2*(4*(1/4) + 11*(3/4))+1=19.5 cycles.
Probably worth trying, will see if I really get 30% speedup on Monday.

P.S.
May be I'll post a similar calculation about using lookups on 32 bit table for the same function. Quite surprisingly, the result is very similar, when using prefetches correctly. Still not very efficient overall as 8GB of RAM is consumed.