Optimizing a kernel with low AI
May. 16th, 2018 10:10 am(Where kernel has nothing to do with operating systems, and AI has nothing to do with artificial intelligence)
I got a compute kernel from a customer that was not performing up to expectations. It turned out that AI( as in a roofline model in the kernel was very low, so performance should be limited by memory throughput at a given data locality. (so floating point operations in AVX512 were so simple that getting data to reigster and writing back to memory was the slowest part).
For such workloads, a natural performance metric is GB/sec at given data size. So in theory, when all data fits L1, it should run at ~350GB/sec, 130SB/sec from L2, 19GB/sec from L3, and ~14GB/sec from RAM. (Skylake server I am running on)
Here is what I measured when I added a second stream of processing to the compute loop:

It seems that running 2x streams of processing in single loop improves it a lot, but only when most of data is in L3. I am not sure why at L3 and 2x streams it runs faster than L3 peak throughput ...
There are many factors: frequency goes down with AVX512, h/w prefetchers, TLB, etc.
I got a compute kernel from a customer that was not performing up to expectations. It turned out that AI( as in a roofline model in the kernel was very low, so performance should be limited by memory throughput at a given data locality. (so floating point operations in AVX512 were so simple that getting data to reigster and writing back to memory was the slowest part).
For such workloads, a natural performance metric is GB/sec at given data size. So in theory, when all data fits L1, it should run at ~350GB/sec, 130SB/sec from L2, 19GB/sec from L3, and ~14GB/sec from RAM. (Skylake server I am running on)
Here is what I measured when I added a second stream of processing to the compute loop:
It seems that running 2x streams of processing in single loop improves it a lot, but only when most of data is in L3. I am not sure why at L3 and 2x streams it runs faster than L3 peak throughput ...
There are many factors: frequency goes down with AVX512, h/w prefetchers, TLB, etc.