Jul. 29th, 2019

izard: (Default)
Recently I was wondering why relatively simple and short SIMD/SSE intrinsics code is ~1.5 slower when compiled by ICC, than when compiled with CLANG.

When one uses intrinsics, it is supposed to be almost assembly, compiler should only do [optimal] register allocations and deal with input and output variables. However, compilers now do much more than that, especially the optimizations they perform on top of intrinsics are often very non-obvious.

In case of the code I examined, ICC produced ~20% faster code for a scalar part, and ~10% faster code for one vector basic block, but ~2x slower for another vector basic block. When combined, these resulted in ~1.5x difference.

It is not the first time I noticed the compiler optimizations significantly alter SIMD intrinsics flow, but before it was always something small, like fusing 2 instructions when a SIMD equivalent exists, etc. In this case it just produced 2x+ instructions! Alas, just changing to -O0 won't make it back to the planned flow, as O0 adds too many intermediates... So if I am certain that my instruction flow is better than compilers, I'll have to do inline asm rather than intrinsics.

Profile

izard: (Default)
izard

June 2025

S M T W T F S
1234567
891011121314
15161718192021
22 23242526 2728
2930     

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 6th, 2025 06:35 am
Powered by Dreamwidth Studios