Dec. 17th, 2011

izard: (Default)
Two weeks ago I posted about memcpy optimized for Penryn, when customer requested me to re-write it in assembly because his Visual Studio 2005 does not understand intrinsics introduced in 2008.

I normally don't post source code in blog, because I usually work on customer's code, but in this case it is generic memcpy, nothing specific.

Here is the inner loop, first iteration of my manual coding.
buf_loop: add ecx, 0x10;
movdqa xmm2, [ebx + ecx]; // first 16 bytes block from src
add ecx, 0x10;
movdqa xmm3, [ebx + ecx]; // second 16 bytes block from src
movdqa xmm1, xmm0;
movdqa xmm0, xmm3;
palignr xmm3, xmm2, shift; //shift to match dst-src alignment
palignr xmm2, xmm1, shift; //shift to match dst-src alignment
add ecx, -0x20;
movdqa [ecx + edx], xmm2; // write 16 bytes to dst
add ecx, 0x10;
movdqa [ecx + edx], xmm3; // write 16 bytes to dst
add ecx, 0x10;
dec edi;
jne buf_loop;

Of course this was slower than the version produced from my intrinsics code by compiler. For a small kernel, I would not bother using Vtune, Intel Architecture Code Analyzer comes to the rescue.

Here is its output when I fed it the code produced by MS Compiler, showing which execution units are busy at each instruction
| 1 | | | 1 1 | | | | CP | movdqa xmm1, xmmword ptr [edx]
| 1 | | | 1 1 | | | | | movdqa xmm0, xmmword ptr [esi+ecx*1]
| 1 | X | X | | | | 1 | | movdqa xmm3, xmm2
| 1 | 1 | X | | | | X | CP | movdqa xmm4, xmm1
| 1 | X | 1 | | | | X | | movdqa xmm2, xmm0
| 1 | X | | | | | 1 | CP | palignr xmm4, xmm3, 0xf
| 2^ | | | | 1 | 1 | | CP | movdqa xmmword ptr [ecx-0x10], xmm4
| 1 | 1 | | | | | X | | palignr xmm0, xmm1, 0xf
| 2^ | | | | 1 | 1 | | | movdqa xmmword ptr [ecx], xmm0
| 1 | X | 1 | | | | X | | add edx, 0x20
| 1 | X | X | | | | 1 | | add ecx, 0x20
| 1 | X | 1 | | | | X | | dec eax
| 1 | | | | | | 1 | | jnz 0xffffffc
Looking at the output I see the main reason it is faster than my first iteration - it resolves store forwarding stalls by using extra SIMD register and extra move which is free any way.

So I changed my code to
| 1 | | | 1 1 | | | | CP | movdqa xmm1, xmmword ptr [eax]
| 1 | | | 1 1 | | | | | movdqa xmm0, xmmword ptr [eax+0x10]
| 1 | 1 | X | | | | X | | movdqa xmm3, xmm2
| 1 | X | 1 | | | | X | CP | movdqa xmm4, xmm1
| 1 | X | X | | | | 1 | | movdqa xmm2, xmm0
| 1 | 1 | | | | | X | CP | palignr xmm4, xmm3, 0xf
| 2^ | | | | 1 | 1 | | CP | movdqa xmmword ptr [ecx], xmm4
| 1 | X | | | | | 1 | | palignr xmm0, xmm1, 0xf
| 1 | X | 1 | | | | X | | add eax, 0x20
| 2^ | | | | 1 | 1 | | | movdqa xmmword ptr [ecx+0x10], xmm0
| 1 | 1 | X | | | | X | | add ecx, 0x20
| 1 | X | 1 | | | | X | | dec edx
| 1 | | | | | | 1 | | jnz 0xffffffc
2 clocks less latency than MS Compiler's version.

That's good enough to give customer (it is ~5x faster when used in customer's benchmark than std::memcpy, and almost ~2x faster than unrolled copying using movdqu), and about the same speed as initial intrinsics based version. However when I run this benchmark in environment which is one step closer to deployment scenario, intrinsics version is again 20% faster! I am not sure why, perhaps MS compiler can use inline and optimize intrinsics version and cannot do that with asm version?

P.S. Perhaps I can publish a post on habrahabr describing how efficient memcy was evolving from using REP MOV 20 years ago, through many weird hacks back to REP MOV

P.P.S. Don't ever try this kind of code on newer CPUs, REP MOV used by std::memcpy is _much_ faster than any of those dirty hacks.

Squirrel

Dec. 17th, 2011 05:08 pm
izard: (Default)
When we went shopping today I've taken my camera with me: we should have passed a place where I noticed a pigeon sitting on a nest in an unusual place. I did not know they breed all year along!

However it was so dark I failed to take any meaningful pic. Instead I tried to catch a sight on a squirel eating nuts next to a grocery.

I don't know why most squirrels are so vary of humans here in Munich. Even my 200mm lens barely helped.

Profile

izard: (Default)
izard

June 2025

S M T W T F S
1234567
891011121314
15161718192021
22 23242526 2728
2930     

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 5th, 2025 06:08 pm
Powered by Dreamwidth Studios