Fwd: Low level ops?

Wed Jun 20 12:11:02 CEST 2018

Sorry for double posting.. My emailer did not show enough info.

---------- Forwarded message ---------
From: Stef Bon <stefbon at gmail.com>
Date: di 19 jun. 2018 om 23:10
Subject: Re: Low level ops?
To: Jussi Kivilinna <jussi.kivilinna at iki.fi>

Op di 19 jun. 2018 om 18:28 schreef Jussi Kivilinna <jussi.kivilinna at iki.fi>:
>
> I made changes on weekend to reduce the overhead for cipher operations.
>

Ok! I don't want to look too impatient, but I did not know you have been working
on the issue, so I posted again.

> When I tried to get those patches to the mailing-list they just would
> not get through. I've spend past two nights trying to figure out what
> the ____ is wrong with my mail setup.
>
Ugh.

> Anyway, overhead for example for AESNI/CBC decryption has reduced
> from ~80 cycles per call to ~30 cycles. The remaining 30 cycles, seems
> to be mainly caused by the optimized AESNI/CBC decryption function
> itself. AESNI/CBC encryption function is less complex and overhead
> for it is now 9 cycles per call (was 40 cycles).
>
>From 90 to 30 is already impressive. I'm very interested to test this.

> > And what about chacha20-poly1305 at openssh.com?
>
> If you check the chacha20-poly1305 in OpenSSH, you see that for each
> packet you need to perform one extra chacha20 block encryption, which
> alone is going to cost over 400 cycles.
>

I've already implemented this. I've used your example code. It's
working perfect. I know I've
mailed earlier about his and reported that it's working.

>
> To utilize this, you need to provide input buffers larger than
> blocksize to libgcrypt. For AESNI implementations, you get best
> performance starting with buffer size of 8 blocks or 8*16=128
> bytes. For Chacha20, you need 4 blocks or 4*64=256 bytes.
>

Uhm  Larger input buffers? I cannot play with that. My code calls
gcry_cipher_encrypt(c_handle, packet->buffer, packet->len, NULL, 0)
to encrypt and gcry_cipher_decrypt(c_handle, packet->buffer,
packet->len, NULL, 0) to decrypt. (leaving out some details like
decrypt the first block first to get the length).

So the whole message, As mentioned before the messages vary from 128
to 1024 bytes, with some
bigger (readdir and read/write).

I think it's good to distinguish two different uses of parallel processing:

1. parallel processing of one message by splitting it in different
blocks (with size blocksize of course).  As I understand Libgcrypt
handles these
automatically when these messages are large enough and the cipher allows that.

2. parallel processing of two or more messages at the same time. Some
ciphers like chacha20-poly1305 at openssh.com and aes256-ctr allow this.
The starting state of the cipher does not depend on the previous
message. I have not tested this, in some "heavy load situations" it
maybe give some
performance improvements. These situations with fuse/sftp are after a
readdir a lot of lookup calls are done, one for every entry in the
directory.

Stef Bon