May. 16th, 2018

izard: (Default)
(Where kernel has nothing to do with operating systems, and AI has nothing to do with artificial intelligence)
I got a compute kernel from a customer that was not performing up to expectations. It turned out that AI( as in a roofline model in the kernel was very low, so performance should be limited by memory throughput at a given data locality. (so floating point operations in AVX512 were so simple that getting data to reigster and writing back to memory was the slowest part).

For such workloads, a natural performance metric is GB/sec at given data size. So in theory, when all data fits L1, it should run at ~350GB/sec, 130SB/sec from L2, 19GB/sec from L3, and ~14GB/sec from RAM. (Skylake server I am running on)

Here is what I measured when I added a second stream of processing to the compute loop:

It seems that running 2x streams of processing in single loop improves it a lot, but only when most of data is in L3. I am not sure why at L3 and 2x streams it runs faster than L3 peak throughput ...

There are many factors: frequency goes down with AVX512, h/w prefetchers, TLB, etc.

Profile

izard: (Default)
izard

June 2025

S M T W T F S
1234567
891011121314
15161718192021
22 23242526 2728
2930     

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 8th, 2025 02:59 pm
Powered by Dreamwidth Studios