izard

(Where kernel has nothing to do with operating systems, and AI has nothing to do with artificial intelligence)
I got a compute kernel from a customer that was not performing up to expectations. It turned out that AI( as in a roofline model in the kernel was very low, so performance should be limited by memory throughput at a given data locality. (so floating point operations in AVX512 were so simple that getting data to reigster and writing back to memory was the slowest part).

For such workloads, a natural performance metric is GB/sec at given data size. So in theory, when all data fits L1, it should run at ~350GB/sec, 130SB/sec from L2, 19GB/sec from L3, and ~14GB/sec from RAM. (Skylake server I am running on)

Here is what I measured when I added a second stream of processing to the compute loop:

It seems that running 2x streams of processing in single loop improves it a lot, but only when most of data is in L3. I am not sure why at L3 and 2x streams it runs faster than L3 peak throughput ...

There are many factors: frequency goes down with AVX512, h/w prefetchers, TLB, etc.

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

May. 16th, 2018

May. 16th, 2018

Optimizing a kernel with low AI

Profile

June 2025

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags