[PATCH 1/1] whirlpool hash amd64 assembly

Jussi Kivilinna jussi.kivilinna at iki.fi
Wed Sep 3 18:41:18 CEST 2014


On 03/09/14 18:54, And Sch wrote:
> Sorry, I hit the wrong button when replying. Here are benchmarks on the Atom system:
> 
> Intel(R) Atom(TM) CPU N570   @ 1.66GHz
> 
> original, no patches:
> Hash:
>                 |  nanosecs/byte   mebibytes/sec   cycles/byte
>  WHIRLPOOL      |     63.40 ns/B     15.04 MiB/s         - c/B
> 
> my C only optimization:
> Hash:
>                 |  nanosecs/byte   mebibytes/sec   cycles/byte
>  WHIRLPOOL      |     46.21 ns/B     20.64 MiB/s         - c/B
> 
> my edited GCC x64 assembly:
> Hash:
>                 |  nanosecs/byte   mebibytes/sec   cycles/byte
>  WHIRLPOOL      |     29.29 ns/B     32.56 MiB/s         - c/B
> 
> the SSE assembly by Jussi Kivilinna:
> Hash:
>                 |  nanosecs/byte   mebibytes/sec   cycles/byte
>  WHIRLPOOL      |     41.19 ns/B     23.15 MiB/s         - c/B
> 
> It is weird that the SSE assembly is much faster than the non-SSE on the i5, but unexpectedly slower on the Atom system. The bswap does not explain the difference because I also tested an SSE version with bswap removed with same results.
> 

Well, not totally unexpected to me :)

Problem is that (old) Atoms are in-order, unlike any other current x86 CPU which are out-of-order. So extra work would be required for Atom by tweaking instruction ordering by hand, and maybe making RI4 another RTx for byte indexes, and so on. And that work might or might not reduce performance on Core processors. But since I don't have 64-bit capable Atom I can't really do that tweaking *.

-Jussi

* (and since latest Atoms have gained limited out-of-order scheduling, optimizing for old in-order does not seem time-spent-well.) (and I haven't seen Intel contributing Atom optimized implementations to open source projects.)

> -Andrei
>> -----Original Message-----
>> From: jussi.kivilinna at iki.fi
>> Sent: Tue, 02 Sep 2014 18:06:28 +0300
>> To: andsch at inbox.com
>> Subject: Re: [PATCH 1/1] whirlpool hash amd64 assembly
>>
>> On 02/09/14 04:02, And Sch wrote:
>>> That is very impressive. The goal is accomplished then, I just wanted a
>>> faster whirlpool hash in gnupg. I'm no good with assembly, so I have no
>>> hope of doing better than the compiler. You may want to title the
>>> assembly as sse-amd64 now.
>>>
>>> Thanks
>>
>> Did you have change to run the implementation on Atom? I'd be very
>> interested to know how's the performance there.
>>
>> -Jussi
>>
>> ps. Please keep mailing-list in CC.
>>
>>>
>>>> -----Original Message-----
>>>> From: jussi.kivilinna at iki.fi
>>>> Sent: Mon, 01 Sep 2014 19:15:03 +0300
>>>> To: gcrypt-devel at gnupg.org
>>>> Subject: Re: [PATCH 1/1] whirlpool hash amd64 assembly
>>>>
>>>> On 29/08/14 18:45, And Sch wrote:
>>>> <snip>
>>>>>
>>>>> That is more than twice as fast as the original on the Atom system.
>>>>>
>>>>> I tried to find a way to use macros to sort out parts of the loop, but
>>>>> any change in the order of the instructions slows it down a lot. There
>>>>> are also only 7 registers available at one time in most parts of the
>>>>> loop, so that makes macros and rearrangements even more difficult.
>>>>>
>>>>> I used a little endian version of the last patch I posted and gcc
>>>>> -funroll-loops to generate this assembly. I've looked through it and
>>>>> tried to organize it as best I can. Suggestions on how to clean it up
>>>>> further would be helpful.
>>>>>
>>>>
>>>> I don't agree that this is good method for creating assembly
>>>> implementations. As I see it, the main point with assembly
>>>> implementations is that you can do optimizations that compiler has no
>>>> way
>>>> of finding. For example, you could load indexes to rax/rbx/rcx/rdx
>>>> registers that allow extracting not only first index byte but also
>>>> second
>>>> byte with just one instruction. Or, use XMM registers to store the
>>>> key[]
>>>> and state[] arrays instead of stack.
>>>>
>>>> Well, I ended up making such implementation, which I've attached. On
>>>> Intel i5-4570 (3.6 Ghz turbo), I get:
>>>>
>>>>> tests/bench-slope --cpu-mhz 3600 hash whirlpool
>>>> Hash:
>>>>                 |  nanosecs/byte   mebibytes/sec   cycles/byte
>>>>  WHIRLPOOL      |      4.28 ns/B     222.7 MiB/s     15.42 c/B
>>>>
>>>> -Jussi
>>>>
>>>> _______________________________________________
>>>> Gcrypt-devel mailing list
>>>> Gcrypt-devel at gnupg.org
>>>> http://lists.gnupg.org/mailman/listinfo/gcrypt-devel
>>>
>>> ____________________________________________________________
>>> FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
>>> Check it out at http://www.inbox.com/earth
>>>
>>>
>>>
> 
> ____________________________________________________________
> FREE ONLINE PHOTOSHARING - Share your photos online with your friends and family!
> Visit http://www.inbox.com/photosharing to find out more!
> 
> 
> 
> _______________________________________________
> Gcrypt-devel mailing list
> Gcrypt-devel at gnupg.org
> http://lists.gnupg.org/mailman/listinfo/gcrypt-devel
> 




More information about the Gcrypt-devel mailing list