Work post: memcpy
Dec. 3rd, 2011 11:23 pmI was recently asked by a customer to write a memcpy, not a generic one but conforming to a set of specific constraints. Main requirement was it should be as efficient as possible on Penryn (a u-arch which is 3 generations behind current one). Customer admits that yes, Penryn is rather old CPU but that's what is used in a product and I should use whatever means to make a fast primitive.
Agner Fog in his excellent optimization's manual (which is nearly as useful as official Intel optimization's manual) correctly suggests that for a couple of u-archs, including Penryn, the fastest memcpy core would consist of aligned loads(SSE3), shifts(SSE4.1) and aligned stores(SSE3). ( provided that the data is at most at LLC). There is also a special case for non-temporals.
It took me two days to develop the function (~200 LOC including ~30 lines asm kernel), unit tests and perf tests and deliver to customer. Then I got a feedback: "Sorry, I can't compile it, I only have Visual Studio 2005 which does not understand SSE4.1 intrinsics, and there is no way I can upgrade :)"
Curiously, for current and next u-archs I should not have bothered: good old rep move works perfectly, does not use any new instructions (that is what implemented in libgcc/glibc). Ulrich got it the most efficient way, finally :))). For years many folks were asking glibc to make a CPU dispatch for memcpy, and they were refusing. No need any more, rep move is back like in old Pentium days.
Agner Fog in his excellent optimization's manual (which is nearly as useful as official Intel optimization's manual) correctly suggests that for a couple of u-archs, including Penryn, the fastest memcpy core would consist of aligned loads(SSE3), shifts(SSE4.1) and aligned stores(SSE3). ( provided that the data is at most at LLC). There is also a special case for non-temporals.
It took me two days to develop the function (~200 LOC including ~30 lines asm kernel), unit tests and perf tests and deliver to customer. Then I got a feedback: "Sorry, I can't compile it, I only have Visual Studio 2005 which does not understand SSE4.1 intrinsics, and there is no way I can upgrade :)"
Curiously, for current and next u-archs I should not have bothered: good old rep move works perfectly, does not use any new instructions (that is what implemented in libgcc/glibc). Ulrich got it the most efficient way, finally :))). For years many folks were asking glibc to make a CPU dispatch for memcpy, and they were refusing. No need any more, rep move is back like in old Pentium days.