More on memcpy
Dec. 17th, 2011 04:58 pmTwo weeks ago I posted about memcpy optimized for Penryn, when customer requested me to re-write it in assembly because his Visual Studio 2005 does not understand intrinsics introduced in 2008.
I normally don't post source code in blog, because I usually work on customer's code, but in this case it is generic memcpy, nothing specific.
Here is the inner loop, first iteration of my manual coding.
buf_loop: add ecx, 0x10;
movdqa xmm2, [ebx + ecx]; // first 16 bytes block from src
add ecx, 0x10;
movdqa xmm3, [ebx + ecx]; // second 16 bytes block from src
movdqa xmm1, xmm0;
movdqa xmm0, xmm3;
palignr xmm3, xmm2, shift; //shift to match dst-src alignment
palignr xmm2, xmm1, shift; //shift to match dst-src alignment
add ecx, -0x20;
movdqa [ecx + edx], xmm2; // write 16 bytes to dst
add ecx, 0x10;
movdqa [ecx + edx], xmm3; // write 16 bytes to dst
add ecx, 0x10;
dec edi;
jne buf_loop;
Of course this was slower than the version produced from my intrinsics code by compiler. For a small kernel, I would not bother using Vtune, Intel Architecture Code Analyzer comes to the rescue.
Here is its output when I fed it the code produced by MS Compiler, showing which execution units are busy at each instruction
| 1 | | | 1 1 | | | | CP | movdqa xmm1, xmmword ptr [edx]
| 1 | | | 1 1 | | | | | movdqa xmm0, xmmword ptr [esi+ecx*1]
| 1 | X | X | | | | 1 | | movdqa xmm3, xmm2
| 1 | 1 | X | | | | X | CP | movdqa xmm4, xmm1
| 1 | X | 1 | | | | X | | movdqa xmm2, xmm0
| 1 | X | | | | | 1 | CP | palignr xmm4, xmm3, 0xf
| 2^ | | | | 1 | 1 | | CP | movdqa xmmword ptr [ecx-0x10], xmm4
| 1 | 1 | | | | | X | | palignr xmm0, xmm1, 0xf
| 2^ | | | | 1 | 1 | | | movdqa xmmword ptr [ecx], xmm0
| 1 | X | 1 | | | | X | | add edx, 0x20
| 1 | X | X | | | | 1 | | add ecx, 0x20
| 1 | X | 1 | | | | X | | dec eax
| 1 | | | | | | 1 | | jnz 0xffffffc
Looking at the output I see the main reason it is faster than my first iteration - it resolves store forwarding stalls by using extra SIMD register and extra move which is free any way.
So I changed my code to
| 1 | | | 1 1 | | | | CP | movdqa xmm1, xmmword ptr [eax]
| 1 | | | 1 1 | | | | | movdqa xmm0, xmmword ptr [eax+0x10]
| 1 | 1 | X | | | | X | | movdqa xmm3, xmm2
| 1 | X | 1 | | | | X | CP | movdqa xmm4, xmm1
| 1 | X | X | | | | 1 | | movdqa xmm2, xmm0
| 1 | 1 | | | | | X | CP | palignr xmm4, xmm3, 0xf
| 2^ | | | | 1 | 1 | | CP | movdqa xmmword ptr [ecx], xmm4
| 1 | X | | | | | 1 | | palignr xmm0, xmm1, 0xf
| 1 | X | 1 | | | | X | | add eax, 0x20
| 2^ | | | | 1 | 1 | | | movdqa xmmword ptr [ecx+0x10], xmm0
| 1 | 1 | X | | | | X | | add ecx, 0x20
| 1 | X | 1 | | | | X | | dec edx
| 1 | | | | | | 1 | | jnz 0xffffffc
2 clocks less latency than MS Compiler's version.
That's good enough to give customer (it is ~5x faster when used in customer's benchmark than std::memcpy, and almost ~2x faster than unrolled copying using movdqu), and about the same speed as initial intrinsics based version. However when I run this benchmark in environment which is one step closer to deployment scenario, intrinsics version is again 20% faster! I am not sure why, perhaps MS compiler can use inline and optimize intrinsics version and cannot do that with asm version?
P.S. Perhaps I can publish a post on habrahabr describing how efficient memcy was evolving from using REP MOV 20 years ago, through many weird hacks back to REP MOV
P.P.S. Don't ever try this kind of code on newer CPUs, REP MOV used by std::memcpy is _much_ faster than any of those dirty hacks.
I normally don't post source code in blog, because I usually work on customer's code, but in this case it is generic memcpy, nothing specific.
Here is the inner loop, first iteration of my manual coding.
buf_loop: add ecx, 0x10;
movdqa xmm2, [ebx + ecx]; // first 16 bytes block from src
add ecx, 0x10;
movdqa xmm3, [ebx + ecx]; // second 16 bytes block from src
movdqa xmm1, xmm0;
movdqa xmm0, xmm3;
palignr xmm3, xmm2, shift; //shift to match dst-src alignment
palignr xmm2, xmm1, shift; //shift to match dst-src alignment
add ecx, -0x20;
movdqa [ecx + edx], xmm2; // write 16 bytes to dst
add ecx, 0x10;
movdqa [ecx + edx], xmm3; // write 16 bytes to dst
add ecx, 0x10;
dec edi;
jne buf_loop;
Of course this was slower than the version produced from my intrinsics code by compiler. For a small kernel, I would not bother using Vtune, Intel Architecture Code Analyzer comes to the rescue.
Here is its output when I fed it the code produced by MS Compiler, showing which execution units are busy at each instruction
| 1 | | | 1 1 | | | | CP | movdqa xmm1, xmmword ptr [edx]
| 1 | | | 1 1 | | | | | movdqa xmm0, xmmword ptr [esi+ecx*1]
| 1 | X | X | | | | 1 | | movdqa xmm3, xmm2
| 1 | 1 | X | | | | X | CP | movdqa xmm4, xmm1
| 1 | X | 1 | | | | X | | movdqa xmm2, xmm0
| 1 | X | | | | | 1 | CP | palignr xmm4, xmm3, 0xf
| 2^ | | | | 1 | 1 | | CP | movdqa xmmword ptr [ecx-0x10], xmm4
| 1 | 1 | | | | | X | | palignr xmm0, xmm1, 0xf
| 2^ | | | | 1 | 1 | | | movdqa xmmword ptr [ecx], xmm0
| 1 | X | 1 | | | | X | | add edx, 0x20
| 1 | X | X | | | | 1 | | add ecx, 0x20
| 1 | X | 1 | | | | X | | dec eax
| 1 | | | | | | 1 | | jnz 0xffffffc
Looking at the output I see the main reason it is faster than my first iteration - it resolves store forwarding stalls by using extra SIMD register and extra move which is free any way.
So I changed my code to
| 1 | | | 1 1 | | | | CP | movdqa xmm1, xmmword ptr [eax]
| 1 | | | 1 1 | | | | | movdqa xmm0, xmmword ptr [eax+0x10]
| 1 | 1 | X | | | | X | | movdqa xmm3, xmm2
| 1 | X | 1 | | | | X | CP | movdqa xmm4, xmm1
| 1 | X | X | | | | 1 | | movdqa xmm2, xmm0
| 1 | 1 | | | | | X | CP | palignr xmm4, xmm3, 0xf
| 2^ | | | | 1 | 1 | | CP | movdqa xmmword ptr [ecx], xmm4
| 1 | X | | | | | 1 | | palignr xmm0, xmm1, 0xf
| 1 | X | 1 | | | | X | | add eax, 0x20
| 2^ | | | | 1 | 1 | | | movdqa xmmword ptr [ecx+0x10], xmm0
| 1 | 1 | X | | | | X | | add ecx, 0x20
| 1 | X | 1 | | | | X | | dec edx
| 1 | | | | | | 1 | | jnz 0xffffffc
2 clocks less latency than MS Compiler's version.
That's good enough to give customer (it is ~5x faster when used in customer's benchmark than std::memcpy, and almost ~2x faster than unrolled copying using movdqu), and about the same speed as initial intrinsics based version. However when I run this benchmark in environment which is one step closer to deployment scenario, intrinsics version is again 20% faster! I am not sure why, perhaps MS compiler can use inline and optimize intrinsics version and cannot do that with asm version?
P.S. Perhaps I can publish a post on habrahabr describing how efficient memcy was evolving from using REP MOV 20 years ago, through many weird hacks back to REP MOV
P.P.S. Don't ever try this kind of code on newer CPUs, REP MOV used by std::memcpy is _much_ faster than any of those dirty hacks.