[PATCH 1/1] whirlpool hash amd64 assembly

Wed Sep 3 17:54:35 CEST 2014

Sorry, I hit the wrong button when replying. Here are benchmarks on the Atom system:

Intel(R) Atom(TM) CPU N570   @ 1.66GHz

original, no patches:
Hash:
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 WHIRLPOOL      |     63.40 ns/B     15.04 MiB/s         - c/B

my C only optimization:
Hash:
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 WHIRLPOOL      |     46.21 ns/B     20.64 MiB/s         - c/B

my edited GCC x64 assembly:
Hash:
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 WHIRLPOOL      |     29.29 ns/B     32.56 MiB/s         - c/B

the SSE assembly by Jussi Kivilinna:
Hash:
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 WHIRLPOOL      |     41.19 ns/B     23.15 MiB/s         - c/B

It is weird that the SSE assembly is much faster than the non-SSE on the i5, but unexpectedly slower on the Atom system. The bswap does not explain the difference because I also tested an SSE version with bswap removed with same results.

-Andrei
> -----Original Message-----
> From: jussi.kivilinna at iki.fi
> Sent: Tue, 02 Sep 2014 18:06:28 +0300
> To: andsch at inbox.com
> Subject: Re: [PATCH 1/1] whirlpool hash amd64 assembly
> 
> On 02/09/14 04:02, And Sch wrote:
>> That is very impressive. The goal is accomplished then, I just wanted a
>> faster whirlpool hash in gnupg. I'm no good with assembly, so I have no
>> hope of doing better than the compiler. You may want to title the
>> assembly as sse-amd64 now.
>> 
>> Thanks
> 
> Did you have change to run the implementation on Atom? I'd be very
> interested to know how's the performance there.
> 
> -Jussi
> 
> ps. Please keep mailing-list in CC.
> 
>> 
>>> -----Original Message-----
>>> From: jussi.kivilinna at iki.fi
>>> Sent: Mon, 01 Sep 2014 19:15:03 +0300
>>> To: gcrypt-devel at gnupg.org
>>> Subject: Re: [PATCH 1/1] whirlpool hash amd64 assembly
>>> 
>>> On 29/08/14 18:45, And Sch wrote:
>>> <snip>
>>>> 
>>>> That is more than twice as fast as the original on the Atom system.
>>>> 
>>>> I tried to find a way to use macros to sort out parts of the loop, but
>>>> any change in the order of the instructions slows it down a lot. There
>>>> are also only 7 registers available at one time in most parts of the
>>>> loop, so that makes macros and rearrangements even more difficult.
>>>> 
>>>> I used a little endian version of the last patch I posted and gcc
>>>> -funroll-loops to generate this assembly. I've looked through it and
>>>> tried to organize it as best I can. Suggestions on how to clean it up
>>>> further would be helpful.
>>>> 
>>> 
>>> I don't agree that this is good method for creating assembly
>>> implementations. As I see it, the main point with assembly
>>> implementations is that you can do optimizations that compiler has no
>>> way
>>> of finding. For example, you could load indexes to rax/rbx/rcx/rdx
>>> registers that allow extracting not only first index byte but also
>>> second
>>> byte with just one instruction. Or, use XMM registers to store the
>>> key[]
>>> and state[] arrays instead of stack.
>>> 
>>> Well, I ended up making such implementation, which I've attached. On
>>> Intel i5-4570 (3.6 Ghz turbo), I get:
>>> 
>>>> tests/bench-slope --cpu-mhz 3600 hash whirlpool
>>> Hash:
>>>                 |  nanosecs/byte   mebibytes/sec   cycles/byte
>>>  WHIRLPOOL      |      4.28 ns/B     222.7 MiB/s     15.42 c/B
>>> 
>>> -Jussi
>>> 
>>> _______________________________________________
>>> Gcrypt-devel mailing list
>>> Gcrypt-devel at gnupg.org
>>> http://lists.gnupg.org/mailman/listinfo/gcrypt-devel
>> 
>> ____________________________________________________________
>> FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
>> Check it out at http://www.inbox.com/earth
>> 
>> 
>>

____________________________________________________________
FREE ONLINE PHOTOSHARING - Share your photos online with your friends and family!
Visit http://www.inbox.com/photosharing to find out more!