izard

As promised, here is the second part. Goes under cut as it's very technical.

I had yet another idea about how to squeeze few cycles from that 10 lines assembly code.

Input is 2 16 bit values: in and key. Output is single 16 bit value.

Original code:
4 lookups on 2 small tables that fit in L1 cache, and 8 logical/bit ops with WAR dependency.

My code, another option:
Single lookup on 8GB table. (32 bit address, 16 bit values). It can fit in memory, but can't fit in cache. Fortunately, only 24 128k "pages" are used on each invocation. That fits L3 cache, that is 38 cycles away. Also there is a pattern in how those 24 pages are being accessed, that means that I can do prefetches to L2, so it will go down to 11 cycles. Wishful thinking! The L2-L3 bandwidth is not sufficient to prefetch 128k to L2 in just tens of cycles :)

Still no-go. May be later, when Sandy Bridge is out.

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

CPU cycles, bytes #2.

CPU cycles, bytes #2.

Profile

November 2025

Most Popular Tags

Style Credit

Expand Cut Tags