Low level ops?

Tue Jun 19 18:28:37 CEST 2018

Hello,

On 19.06.2018 08:27, Stef Bon wrote:
> Op wo 13 jun. 2018 om 21:21 schreef Stef Bon <stefbon at gmail.com>:
>>
>>
>> So as I see it it would be worth to try to bring back the overhead for
>> AES-CBC//DEC since they vary from 99% to 12,5%, since the size most
>> ssh messages is between 128 and 1024 bytes.
>>
>> You mention parallel mode for AES-CBC/DEC. Is it possible to use this
>> from the api?
>> And do you know what counts for chacha20-poly1305 at openssh.com?
>>
> 
> Hi,
> 
> can you please take a look at my remarks. I think that it's usefull to
> reduce the overhead
> for the mentioned ciphers.

I made changes on weekend to reduce the overhead for cipher operations. 

When I tried to get those patches to the mailing-list they just would 
not get through. I've spend past two nights trying to figure out what 
the ____ is wrong with my mail setup.

Anyway, overhead for example for AESNI/CBC decryption has reduced 
from ~80 cycles per call to ~30 cycles. The remaining 30 cycles, seems 
to be mainly caused by the optimized AESNI/CBC decryption function 
itself. AESNI/CBC encryption function is less complex and overhead 
for it is now 9 cycles per call (was 40 cycles).

> And what about chacha20-poly1305 at openssh.com?

If you check the chacha20-poly1305 in OpenSSH, you see that for each
packet you need to perform one extra chacha20 block encryption, which
alone is going to cost over 400 cycles.

If you want to see how to implement chacha20-poly1305 at openssh.com with 
libgcrypt, check following commit where I've changed OpenSSH to use 
libgcrypt:
 https://github.com/jkivilin/openssh-portable/commit/dd4d06bb47cbbbe3607b9be30f17f1495adbeb12

> An about controlling the parallel handling through the api?

Parallel handling is automatic for cipher modes that can be 
parallelizable (depends on your CPU's feature set and what 
implementations are available). These are CTR-mode, CBC-decryption,
CFB-decryption, XTS, and OCB. EAX, GCM and CCM modes use CTR for 
encryption/decryption and benefit from CTR-mode optimizations too. 
Chacha20 and Salsa20 stream ciphers also have parallel block 
optimizations. 

To utilize this, you need to provide input buffers larger than 
blocksize to libgcrypt. For AESNI implementations, you get best 
performance starting with buffer size of 8 blocks or 8*16=128 
bytes. For Chacha20, you need 4 blocks or 4*64=256 bytes.

-Jussi

> 
> Thanks,
> 
> Stef
> 
> _______________________________________________
> Gcrypt-devel mailing list
> Gcrypt-devel at gnupg.org
> http://lists.gnupg.org/mailman/listinfo/gcrypt-devel
>