Dec. 3rd, 2011

izard: (Default)
I was recently asked by a customer to write a memcpy, not a generic one but conforming to a set of specific constraints. Main requirement was it should be as efficient as possible on Penryn (a u-arch which is 3 generations behind current one). Customer admits that yes, Penryn is rather old CPU but that's what is used in a product and I should use whatever means to make a fast primitive.

Agner Fog in his excellent optimization's manual (which is nearly as useful as official Intel optimization's manual) correctly suggests that for a couple of u-archs, including Penryn, the fastest memcpy core would consist of aligned loads(SSE3), shifts(SSE4.1) and aligned stores(SSE3). ( provided that the data is at most at LLC). There is also a special case for non-temporals.

It took me two days to develop the function (~200 LOC including ~30 lines asm kernel), unit tests and perf tests and deliver to customer. Then I got a feedback: "Sorry, I can't compile it, I only have Visual Studio 2005 which does not understand SSE4.1 intrinsics, and there is no way I can upgrade :)"

Curiously, for current and next u-archs I should not have bothered: good old rep move works perfectly, does not use any new instructions (that is what implemented in libgcc/glibc). Ulrich got it the most efficient way, finally :))). For years many folks were asking glibc to make a CPU dispatch for memcpy, and they were refusing. No need any more, rep move is back like in old Pentium days.

Profile

izard: (Default)
izard

June 2025

S M T W T F S
1234567
891011121314
15161718192021
22 23242526 2728
2930     

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 6th, 2025 11:05 am
Powered by Dreamwidth Studios