izard: (Default)
[personal profile] izard
I was recently asked by a customer to write a memcpy, not a generic one but conforming to a set of specific constraints. Main requirement was it should be as efficient as possible on Penryn (a u-arch which is 3 generations behind current one). Customer admits that yes, Penryn is rather old CPU but that's what is used in a product and I should use whatever means to make a fast primitive.

Agner Fog in his excellent optimization's manual (which is nearly as useful as official Intel optimization's manual) correctly suggests that for a couple of u-archs, including Penryn, the fastest memcpy core would consist of aligned loads(SSE3), shifts(SSE4.1) and aligned stores(SSE3). ( provided that the data is at most at LLC). There is also a special case for non-temporals.

It took me two days to develop the function (~200 LOC including ~30 lines asm kernel), unit tests and perf tests and deliver to customer. Then I got a feedback: "Sorry, I can't compile it, I only have Visual Studio 2005 which does not understand SSE4.1 intrinsics, and there is no way I can upgrade :)"

Curiously, for current and next u-archs I should not have bothered: good old rep move works perfectly, does not use any new instructions (that is what implemented in libgcc/glibc). Ulrich got it the most efficient way, finally :))). For years many folks were asking glibc to make a CPU dispatch for memcpy, and they were refusing. No need any more, rep move is back like in old Pentium days.
This account has disabled anonymous posting.
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

Profile

izard: (Default)
izard

November 2025

S M T W T F S
       1
2345678
910 1112131415
1617 1819202122
23242526272829
30      

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated May. 19th, 2026 09:44 am
Powered by Dreamwidth Studios