izard

I was recently asked by a customer to write a memcpy, not a generic one but conforming to a set of specific constraints. Main requirement was it should be as efficient as possible on Penryn (a u-arch which is 3 generations behind current one). Customer admits that yes, Penryn is rather old CPU but that's what is used in a product and I should use whatever means to make a fast primitive.

Agner Fog in his excellent optimization's manual (which is nearly as useful as official Intel optimization's manual) correctly suggests that for a couple of u-archs, including Penryn, the fastest memcpy core would consist of aligned loads(SSE3), shifts(SSE4.1) and aligned stores(SSE3). ( provided that the data is at most at LLC). There is also a special case for non-temporals.

It took me two days to develop the function (~200 LOC including ~30 lines asm kernel), unit tests and perf tests and deliver to customer. Then I got a feedback: "Sorry, I can't compile it, I only have Visual Studio 2005 which does not understand SSE4.1 intrinsics, and there is no way I can upgrade :)"

Curiously, for current and next u-archs I should not have bothered: good old rep move works perfectly, does not use any new instructions (that is what implemented in libgcc/glibc). Ulrich got it the most efficient way, finally :))). For years many folks were asking glibc to make a CPU dispatch for memcpy, and they were refusing. No need any more, rep move is back like in old Pentium days.

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Dec. 3rd, 2011

Dec. 3rd, 2011

Work post: memcpy

Profile

June 2025

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags