[PATCH 1/1] whirlpool hash amd64 assembly
And Sch
andsch at inbox.com
Wed Sep 3 17:54:35 CEST 2014
Sorry, I hit the wrong button when replying. Here are benchmarks on the Atom system:
Intel(R) Atom(TM) CPU N570 @ 1.66GHz
original, no patches:
Hash:
| nanosecs/byte mebibytes/sec cycles/byte
WHIRLPOOL | 63.40 ns/B 15.04 MiB/s - c/B
my C only optimization:
Hash:
| nanosecs/byte mebibytes/sec cycles/byte
WHIRLPOOL | 46.21 ns/B 20.64 MiB/s - c/B
my edited GCC x64 assembly:
Hash:
| nanosecs/byte mebibytes/sec cycles/byte
WHIRLPOOL | 29.29 ns/B 32.56 MiB/s - c/B
the SSE assembly by Jussi Kivilinna:
Hash:
| nanosecs/byte mebibytes/sec cycles/byte
WHIRLPOOL | 41.19 ns/B 23.15 MiB/s - c/B
It is weird that the SSE assembly is much faster than the non-SSE on the i5, but unexpectedly slower on the Atom system. The bswap does not explain the difference because I also tested an SSE version with bswap removed with same results.
-Andrei
> -----Original Message-----
> From: jussi.kivilinna at iki.fi
> Sent: Tue, 02 Sep 2014 18:06:28 +0300
> To: andsch at inbox.com
> Subject: Re: [PATCH 1/1] whirlpool hash amd64 assembly
>
> On 02/09/14 04:02, And Sch wrote:
>> That is very impressive. The goal is accomplished then, I just wanted a
>> faster whirlpool hash in gnupg. I'm no good with assembly, so I have no
>> hope of doing better than the compiler. You may want to title the
>> assembly as sse-amd64 now.
>>
>> Thanks
>
> Did you have change to run the implementation on Atom? I'd be very
> interested to know how's the performance there.
>
> -Jussi
>
> ps. Please keep mailing-list in CC.
>
>>
>>> -----Original Message-----
>>> From: jussi.kivilinna at iki.fi
>>> Sent: Mon, 01 Sep 2014 19:15:03 +0300
>>> To: gcrypt-devel at gnupg.org
>>> Subject: Re: [PATCH 1/1] whirlpool hash amd64 assembly
>>>
>>> On 29/08/14 18:45, And Sch wrote:
>>> <snip>
>>>>
>>>> That is more than twice as fast as the original on the Atom system.
>>>>
>>>> I tried to find a way to use macros to sort out parts of the loop, but
>>>> any change in the order of the instructions slows it down a lot. There
>>>> are also only 7 registers available at one time in most parts of the
>>>> loop, so that makes macros and rearrangements even more difficult.
>>>>
>>>> I used a little endian version of the last patch I posted and gcc
>>>> -funroll-loops to generate this assembly. I've looked through it and
>>>> tried to organize it as best I can. Suggestions on how to clean it up
>>>> further would be helpful.
>>>>
>>>
>>> I don't agree that this is good method for creating assembly
>>> implementations. As I see it, the main point with assembly
>>> implementations is that you can do optimizations that compiler has no
>>> way
>>> of finding. For example, you could load indexes to rax/rbx/rcx/rdx
>>> registers that allow extracting not only first index byte but also
>>> second
>>> byte with just one instruction. Or, use XMM registers to store the
>>> key[]
>>> and state[] arrays instead of stack.
>>>
>>> Well, I ended up making such implementation, which I've attached. On
>>> Intel i5-4570 (3.6 Ghz turbo), I get:
>>>
>>>> tests/bench-slope --cpu-mhz 3600 hash whirlpool
>>> Hash:
>>> | nanosecs/byte mebibytes/sec cycles/byte
>>> WHIRLPOOL | 4.28 ns/B 222.7 MiB/s 15.42 c/B
>>>
>>> -Jussi
>>>
>>> _______________________________________________
>>> Gcrypt-devel mailing list
>>> Gcrypt-devel at gnupg.org
>>> http://lists.gnupg.org/mailman/listinfo/gcrypt-devel
>>
>> ____________________________________________________________
>> FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
>> Check it out at http://www.inbox.com/earth
>>
>>
>>
____________________________________________________________
FREE ONLINE PHOTOSHARING - Share your photos online with your friends and family!
Visit http://www.inbox.com/photosharing to find out more!
More information about the Gcrypt-devel
mailing list