Low level ops?

Jussi Kivilinna jussi.kivilinna at iki.fi
Mon Jun 11 22:09:55 CEST 2018


Hello,

On 10.06.2018 12:25, Stef Bon wrote:
> Hi,
> 
> I've got a ssh client to access sftp via fuse.
> Now I'm working on making parallel encryption and decryption work. I
> hope I can achieve some performance improvements.
> 
> Now I'm asking whether "low level" function calls in gcrypt can make
> things run faster. Let me explain what I mean. When I look at
> cipher-cbc to the function to encrypt en decrypt. These functions
> check first the blocksize and the buffer (and both). These checks are
> done over and over again, for every message. Does it slow things a
> bit? If so it may be worth the effort to create encrypt/decrypt calls
> whithout these checks. In my application the length of the ouputbuffer
> is always equal to the length of the inputbuffer. And the blocksize is
> always the default blocksize for the cipher. And in ssh the input
> buffer length is always is multiple of the blocksize (padding is
> done).
> 

This depends on the size of buffers that are passed to gcrypt. For best
performance one should pass data in as large buffers as possible, although
approx ~1024-2048 byte buffers give close to maximum throughput.

I made quick tests for overhead by splitting input buffer in bench-slope
to multiples of 16. Here's result for AES-CBC encryption with all HW
acceleration disabled:

AES-CBC/ENC: no-overhead test, full benchmark buffers, speed 13.25 cycles/byte
AES-CBC/ENC: overhead test, benchmark buffers processed in 16 byte chunks, speed: 19.47 cycles/byte, overhead +46.9%
AES-CBC/ENC: overhead test, benchmark buffers processed in 32 byte chunks, speed: 16.36 cycles/byte, overhead +23.4%
AES-CBC/ENC: overhead test, benchmark buffers processed in 64 byte chunks, speed: 14.91 cycles/byte, overhead +12.5%
AES-CBC/ENC: overhead test, benchmark buffers processed in 128 byte chunks, speed: 14.12 cycles/byte, overhead +6.6%
AES-CBC/ENC: overhead test, benchmark buffers processed in 256 byte chunks, speed: 13.69 cycles/byte, overhead +3.3%
AES-CBC/ENC: overhead test, benchmark buffers processed in 512 byte chunks, speed: 13.55 cycles/byte, overhead +2.3%
AES-CBC/ENC: overhead test, benchmark buffers processed in 1024 byte chunks, speed: 13.42 cycles/byte, overhead +1.3%
AES-CBC/ENC: overhead test, benchmark buffers processed in 2048 byte chunks, speed: 13.36 cycles/byte, overhead +0.8%

Absolute overhead per gcry_cipher_encrypt call here is (19.47-13.25)*16 ≃ 
100 cycles. Software AES implementation has higher overhead than other 
SW cipher implementations as AES needs to prefetch look-up tables for 
every gcrypt_encrypt call (side-channel mitigation). Also, with slower 
cipher, overhead will be smaller compared to encryption. For example,
here's results for Serpent:

SERPENT128-CBC/ENC: no-overhead test, full benchmark buffers, speed 38.23 cycles/byte
SERPENT128-CBC/ENC: overhead test, benchmark buffers processed in 16 byte chunks, speed: 42.46 cycles/byte, overhead +11.1%
SERPENT128-CBC/ENC: overhead test, benchmark buffers processed in 32 byte chunks, speed: 40.62 cycles/byte, overhead +6.2%
SERPENT128-CBC/ENC: overhead test, benchmark buffers processed in 64 byte chunks, speed: 39.42 cycles/byte, overhead +3.1%
SERPENT128-CBC/ENC: overhead test, benchmark buffers processed in 128 byte chunks, speed: 38.83 cycles/byte, overhead +1.6%
SERPENT128-CBC/ENC: overhead test, benchmark buffers processed in 256 byte chunks, speed: 38.53 cycles/byte, overhead +0.8%
SERPENT128-CBC/ENC: overhead test, benchmark buffers processed in 512 byte chunks, speed: 38.46 cycles/byte, overhead +0.6%
SERPENT128-CBC/ENC: overhead test, benchmark buffers processed in 1024 byte chunks, speed: 38.36 cycles/byte, overhead +0.3%
SERPENT128-CBC/ENC: overhead test, benchmark buffers processed in 2048 byte chunks, speed: 38.27 cycles/byte, overhead +0.1%

Here overhead per call is ~70 cycles.

With AES-NI accelerated AES, overhead shows up larger since encryption
is fast. But with sufficiently large buffers overhead becomes 
insignificant:

AES-CBC/ENC: no-overhead test, full benchmark buffers, speed 4.50 cycles/byte
AES-CBC/ENC: overhead test, benchmark buffers processed in 16 byte chunks, speed: 7.09 cycles/byte, overhead +57.5%
AES-CBC/ENC: overhead test, benchmark buffers processed in 32 byte chunks, speed: 5.82 cycles/byte, overhead +29.2%
AES-CBC/ENC: overhead test, benchmark buffers processed in 64 byte chunks, speed: 5.16 cycles/byte, overhead +14.5%
AES-CBC/ENC: overhead test, benchmark buffers processed in 128 byte chunks, speed: 4.83 cycles/byte, overhead +7.2%
AES-CBC/ENC: overhead test, benchmark buffers processed in 256 byte chunks, speed: 4.67 cycles/byte, overhead +3.6%
AES-CBC/ENC: overhead test, benchmark buffers processed in 512 byte chunks, speed: 4.58 cycles/byte, overhead +1.8%
AES-CBC/ENC: overhead test, benchmark buffers processed in 1024 byte chunks, speed: 4.54 cycles/byte, overhead +0.9%
AES-CBC/ENC: overhead test, benchmark buffers processed in 2048 byte chunks, speed: 4.52 cycles/byte, overhead +0.4%

Per call overhead for AES-NI CBC/ENC, ~40 cycles.

With parallelizable modes, such as CBC decryption and CTR, this test 
no longer measure actual overhead as underlying algorithm changes with
different chunk sizes. With AES-NI on x86_64, CBC decryption is done 
with 8 parallel blocks (128 bytes), so results below for 16 to 64 chunks
sizes show how slow non-parallel code is compared to parallel code. 
Results starting with 128 chunks sizes show overhead for parallel code:

AES-CBC/DEC: no-overhead test, full benchmark buffers, speed 0.632 cycles/byte
AES-CBC/DEC: overhead test, benchmark buffers processed in 16 byte chunks, speed: 6.56 cycles/byte, overhead +937.4%
AES-CBC/DEC: overhead test, benchmark buffers processed in 32 byte chunks, speed: 3.94 cycles/byte, overhead +523.3%
AES-CBC/DEC: overhead test, benchmark buffers processed in 64 byte chunks, speed: 1.88 cycles/byte, overhead +197.6%
AES-CBC/DEC: overhead test, benchmark buffers processed in 128 byte chunks, speed: 1.26 cycles/byte, overhead +99.4%
AES-CBC/DEC: overhead test, benchmark buffers processed in 256 byte chunks, speed: 0.946 cycles/byte, overhead +49.7%
AES-CBC/DEC: overhead test, benchmark buffers processed in 512 byte chunks, speed: 0.794 cycles/byte, overhead +25.5%
AES-CBC/DEC: overhead test, benchmark buffers processed in 1024 byte chunks, speed: 0.711 cycles/byte, overhead +12.5%
AES-CBC/DEC: overhead test, benchmark buffers processed in 2048 byte chunks, speed: 0.664 cycles/byte, overhead +5.0%

Per call overhead for 128 byte chunks, ~80 cycles.

> It's also possibe that these checks do not cost anything. I don't know.

Checks should not cost that much .. stack burning, and prefetching
(SW AES) cost more, but you'd probably wont want to remove those.

-Jussi

> 
> Stef Bon
> the Netherlands
> 
> _______________________________________________
> Gcrypt-devel mailing list
> Gcrypt-devel at gnupg.org
> http://lists.gnupg.org/mailman/listinfo/gcrypt-devel
> 




More information about the Gcrypt-devel mailing list