[PATCH 4/5] aria-avx512: small optimization for aria_diff_m

Wed Feb 22 13:07:43 CET 2023

On 2023. 2. 21. 오전 2:38, Jussi Kivilinna wrote:

Hi Jussi,

 > Hello,
 >
 > On 20.2.2023 12.54, Taehee Yoo wrote:
 >> On 2/19/23 17:49, Jussi Kivilinna wrote:
 >>
 >> Hi Jussi,
 >> Thank you so much for this optimization!
 >>
 >> I tested this optimization in the kernel.
 >> It works very well.
 >> In my machine(i3-12100), it improves performance ~9%, awesome!
 >
 > Interesting.. I'd expect alderlake to behave similarly to tigerlake. Did
 > you
 > test with version that has unrolled round functions?
 >
 > In libgcrypt, I changed from round unrolling to using loops in order to
 > reduce
 > code size and to allow code to fit into uop-cache. Maybe speed increase
 > happens
 > since vpternlogq reduces code-size for unrolled version enough and
 > algorithm fits
 > into i3-12100's uop-cache, giving the extra performance.
 >

After your response, I retested it and found my benchmark data is wrong.
When I implement aria-avx512, the benchmark result is below.

testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
tcrypt: 1 operation in 1504 cycles (1024 bytes)
tcrypt: 1 operation in 4595 cycles (4096 bytes)
tcrypt: 1 operation in 1763 cycles (1024 bytes)
tcrypt: 1 operation in 5540 cycles (4096 bytes)
testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
tcrypt: 1 operation in 1502 cycles (1024 bytes)
tcrypt: 1 operation in 4615 cycles (4096 bytes)
tcrypt: 1 operation in 1759 cycles (1024 bytes)
tcrypt: 1 operation in 5554 cycles (4096 bytes)

But, the current result is like this.
tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
tcrypt: 1 operation in 1443 cycles (1024 bytes)
tcrypt: 1 operation in 4396 cycles (4096 bytes)
tcrypt: 1 operation in 1683 cycles (1024 bytes)
tcrypt: 1 operation in 5368 cycles (4096 bytes)
tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
tcrypt: 1 operation in 1458 cycles (1024 bytes)
tcrypt: 1 operation in 4416 cycles (4096 bytes)
tcrypt: 1 operation in 1723 cycles (1024 bytes)
tcrypt: 1 operation in 5358 cycles (4096 bytes)

So, after your optimization is like this.
tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
tcrypt: 1 operation in 1388 cycles (1024 bytes)
tcrypt: 1 operation in 4107 cycles (4096 bytes)
tcrypt: 1 operation in 1595 cycles (1024 bytes)
tcrypt: 1 operation in 5011 cycles (4096 bytes)
tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
tcrypt: 1 operation in 1379 cycles (1024 bytes)
tcrypt: 1 operation in 4163 cycles (4096 bytes)
tcrypt: 1 operation in 1603 cycles (1024 bytes)
tcrypt: 1 operation in 5098 cycles (4096 bytes)

The 9% performance gap I said is actually wrong.
I don't know why the result is changed... anyway, this optimization 
increases performance by 5~7%.
Also, I tested it on the both loop and unroll but I couldn't find any 
performance gap.
I haven't enough knowledge about uop-cache, so I couldn't provide useful 
for focusing on the uop-cache.
Sorry for that the previous benchmark result is wrong.

Thank you so much!
Taehee Yoo

 > -Jussi
 >
 >> It will be really helpful to the kernel side aria-avx512 driver for
 >> improving performance.
 >>
 >>  > * cipher/aria-gfni-avx512-amd64.S (aria_diff_m): Use 'vpternlogq' for
 >>  > 3-way XOR operation.
 >>  > ---
 >>  >
 >>  > Using vpternlogq gives small performance improvement on AMD Zen4. 
With
 >>  > Intel tiger-lake speed is the same as before.
 >>  >
 >>  > Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off):
 >>  >
 >>  > Before:
 >>  >   ARIA128        |  nanosecs/byte   mebibytes/sec   cycles/byte
 >> auto Mhz
 >>  >          ECB enc |     0.204 ns/B      4682 MiB/s     0.957
 >> c/B      4700
 >>  >          ECB dec |     0.204 ns/B      4668 MiB/s     0.960
 >> c/B      4700
 >>  >          CTR enc |     0.212 ns/B      4509 MiB/s     0.994
 >> c/B      4700
 >>  >          CTR dec |     0.212 ns/B      4490 MiB/s     0.998
 >> c/B      4700
 >>  >
 >>  > After (~3% faster):
 >>  >   ARIA128        |  nanosecs/byte   mebibytes/sec   cycles/byte
 >> auto Mhz
 >>  >          ECB enc |     0.198 ns/B      4812 MiB/s     0.932
 >> c/B      4700
 >>  >          ECB dec |     0.198 ns/B      4824 MiB/s     0.929
 >> c/B      4700
 >>  >          CTR enc |     0.204 ns/B      4665 MiB/s     0.961
 >> c/B      4700
 >>  >          CTR dec |     0.206 ns/B      4631 MiB/s     0.968
 >> c/B      4700
 >>  >
 >>  > Cc: Taehee Yoo <ap420073 at gmail.com>
 >>  > Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
 >>  > ---
 >>  >   cipher/aria-gfni-avx512-amd64.S | 16 ++++++----------
 >>  >   1 file changed, 6 insertions(+), 10 deletions(-)
 >>  >
 >>  > diff --git a/cipher/aria-gfni-avx512-amd64.S
 >> b/cipher/aria-gfni-avx512-amd64.S
 >>  > index 849c744b..24a49a89 100644
 >>  > --- a/cipher/aria-gfni-avx512-amd64.S
 >>  > +++ b/cipher/aria-gfni-avx512-amd64.S
 >>  > @@ -406,21 +406,17 @@
 >>  >       vgf2p8affineinvqb $0, t2, y3, y3;        \
 >>  >       vgf2p8affineinvqb $0, t2, y7, y7;
 >>  >
 >>  > -
 >>  >   #define aria_diff_m(x0, x1, x2, x3,            \
 >>  >               t0, t1, t2, t3)            \
 >>  >       /* T = rotr32(X, 8); */                \
 >>  >       /* X ^= T */                    \
 >>  > -    vpxorq x0, x3, t0;                \
 >>  > -    vpxorq x1, x0, t1;                \
 >>  > -    vpxorq x2, x1, t2;                \
 >>  > -    vpxorq x3, x2, t3;                \
 >>  >       /* X = T ^ rotr(X, 16); */            \
 >>  > -    vpxorq t2, x0, x0;                \
 >>  > -    vpxorq x1, t3, t3;                \
 >>  > -    vpxorq t0, x2, x2;                \
 >>  > -    vpxorq t1, x3, x1;                \
 >>  > -    vmovdqu64 t3, x3;
 >>  > +    vmovdqa64 x0, t0;                \
 >>  > +    vmovdqa64 x3, t3;                \
 >>  > +    vpternlogq $0x96, x2, x1, x0;            \
 >>  > +    vpternlogq $0x96, x2, x1, x3;            \
 >>  > +    vpternlogq $0x96, t0, t3, x2;            \
 >>  > +    vpternlogq $0x96, t0, t3, x1;
 >>  >
 >>  >   #define aria_diff_word(x0, x1, x2, x3,            \
 >>  >                  x4, x5, x6, x7,            \
 >>
 >> Thank you so much!
 >> Taehee Yoo
 >>
 >