CPU cycles, bytes #2.
Apr. 7th, 2009 12:39 pmAs promised, here is the second part. Goes under cut as it's very technical.
I had yet another idea about how to squeeze few cycles from that 10 lines assembly code.
Input is 2 16 bit values: in and key. Output is single 16 bit value.
Original code:
4 lookups on 2 small tables that fit in L1 cache, and 8 logical/bit ops with WAR dependency.
My code, another option:
Single lookup on 8GB table. (32 bit address, 16 bit values). It can fit in memory, but can't fit in cache. Fortunately, only 24 128k "pages" are used on each invocation. That fits L3 cache, that is 38 cycles away. Also there is a pattern in how those 24 pages are being accessed, that means that I can do prefetches to L2, so it will go down to 11 cycles. Wishful thinking! The L2-L3 bandwidth is not sufficient to prefetch 128k to L2 in just tens of cycles :)
Still no-go. May be later, when Sandy Bridge is out.
I had yet another idea about how to squeeze few cycles from that 10 lines assembly code.
Input is 2 16 bit values: in and key. Output is single 16 bit value.
Original code:
4 lookups on 2 small tables that fit in L1 cache, and 8 logical/bit ops with WAR dependency.
My code, another option:
Single lookup on 8GB table. (32 bit address, 16 bit values). It can fit in memory, but can't fit in cache. Fortunately, only 24 128k "pages" are used on each invocation. That fits L3 cache, that is 38 cycles away. Also there is a pattern in how those 24 pages are being accessed, that means that I can do prefetches to L2, so it will go down to 11 cycles. Wishful thinking! The L2-L3 bandwidth is not sufficient to prefetch 128k to L2 in just tens of cycles :)
Still no-go. May be later, when Sandy Bridge is out.