Jul. 29th, 2019

izard: (Default)
Recently I was wondering why relatively simple and short SIMD/SSE intrinsics code is ~1.5 slower when compiled by ICC, than when compiled with CLANG.

When one uses intrinsics, it is supposed to be almost assembly, compiler should only do [optimal] register allocations and deal with input and output variables. However, compilers now do much more than that, especially the optimizations they perform on top of intrinsics are often very non-obvious.

In case of the code I examined, ICC produced ~20% faster code for a scalar part, and ~10% faster code for one vector basic block, but ~2x slower for another vector basic block. When combined, these resulted in ~1.5x difference.

It is not the first time I noticed the compiler optimizations significantly alter SIMD intrinsics flow, but before it was always something small, like fusing 2 instructions when a SIMD equivalent exists, etc. In this case it just produced 2x+ instructions! Alas, just changing to -O0 won't make it back to the planned flow, as O0 adds too many intermediates... So if I am certain that my instruction flow is better than compilers, I'll have to do inline asm rather than intrinsics.

Profile

izard: (Default)
izard

September 2025

S M T W T F S
 1 23456
78910111213
14151617181920
21222324252627
282930    

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Sep. 9th, 2025 09:56 am
Powered by Dreamwidth Studios