Jul. 29th, 2019

izard: (Default)
Recently I was wondering why relatively simple and short SIMD/SSE intrinsics code is ~1.5 slower when compiled by ICC, than when compiled with CLANG.

When one uses intrinsics, it is supposed to be almost assembly, compiler should only do [optimal] register allocations and deal with input and output variables. However, compilers now do much more than that, especially the optimizations they perform on top of intrinsics are often very non-obvious.

In case of the code I examined, ICC produced ~20% faster code for a scalar part, and ~10% faster code for one vector basic block, but ~2x slower for another vector basic block. When combined, these resulted in ~1.5x difference.

It is not the first time I noticed the compiler optimizations significantly alter SIMD intrinsics flow, but before it was always something small, like fusing 2 instructions when a SIMD equivalent exists, etc. In this case it just produced 2x+ instructions! Alas, just changing to -O0 won't make it back to the planned flow, as O0 adds too many intermediates... So if I am certain that my instruction flow is better than compilers, I'll have to do inline asm rather than intrinsics.

Profile

izard: (Default)
izard

August 2025

S M T W T F S
     12
3456789
10111213 141516
17181920212223
24252627282930
31      

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Aug. 19th, 2025 08:31 am
Powered by Dreamwidth Studios