Yesterday I wrote a simple function using AVX512 intrinsics for my customer. A customer created AVX512 function that was only 8x faster than non vectorized. My version is now ~30x faster.
While working on this function, I noticed that mask register read and write only get 16 bits, not whole 64 bits. Then I found an interesting thread (4 years old) where Agner Fog and Intel engineers were discussing that limitation.
While working on this function, I noticed that mask register read and write only get 16 bits, not whole 64 bits. Then I found an interesting thread (4 years old) where Agner Fog and Intel engineers were discussing that limitation.