Compiler is too smart
Jan. 16th, 2018 10:39 amA hotspot inside a nested loop:
This compiles to:
Blocked store forwarding, CPI 3+, too slow.
What if I try to get rid of the long store forward block penalty using an obvious workaround:
Then compiler will still generate:
Good compiler, smart! It won't let me do
no matter what, even above as inline assembly it does not like.
__m128i result = _mm_add_epi32(a, b); return ((uint16_t*)&result)[1];
This compiles to:
VPADD XMM2, XMM3, XMM1 MOVD XMM1, ESP(100) MOV ESP(102), EAX
Blocked store forwarding, CPI 3+, too slow.
What if I try to get rid of the long store forward block penalty using an obvious workaround:
return ((uint32_t*)&result)[0]>>16;
Then compiler will still generate:
MOVD XMM1, ESP(100) MOV ESP(102), EAX
Good compiler, smart! It won't let me do
MOVD XMM1, EAX ROL EAX, 16
no matter what, even above as inline assembly it does not like.