Jan. 16th, 2018

izard: (Default)
A hotspot inside a nested loop:
__m128i result = _mm_add_epi32(a, b);
return ((uint16_t*)&result)[1];

This compiles to:
VPADD XMM2, XMM3, XMM1
MOVD XMM1, ESP(100)
MOV ESP(102), EAX

Blocked store forwarding, CPI 3+, too slow.
What if I try to get rid of the long store forward block penalty using an obvious workaround:
return ((uint32_t*)&result)[0]>>16;

Then compiler will still generate:
MOVD XMM1, ESP(100)
MOV ESP(102), EAX

Good compiler, smart! It won't let me do
MOVD XMM1, EAX
ROL EAX, 16

no matter what, even above as inline assembly it does not like.

Profile

izard: (Default)
izard

November 2025

S M T W T F S
       1
2345678
910 1112131415
1617 1819202122
23242526272829
30      

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated Nov. 26th, 2025 02:13 am
Powered by Dreamwidth Studios