[PATCH 4/5] aria-avx512: small optimization for aria_diff_m

Thu Feb 23 17:30:32 CET 2023

On 22.2.2023 14.07, Taehee Yoo wrote:
> On 2023. 2. 21. 오전 2:38, Jussi Kivilinna wrote:
> 
> Hi Jussi,
> 
>  > Hello,
>  >
>  > On 20.2.2023 12.54, Taehee Yoo wrote:
>  >> On 2/19/23 17:49, Jussi Kivilinna wrote:
>  >>
>  >> Hi Jussi,
>  >> Thank you so much for this optimization!
>  >>
>  >> I tested this optimization in the kernel.
>  >> It works very well.
>  >> In my machine(i3-12100), it improves performance ~9%, awesome!
>  >
>  > Interesting.. I'd expect alderlake to behave similarly to tigerlake. Did
>  > you
>  > test with version that has unrolled round functions?
>  >
>  > In libgcrypt, I changed from round unrolling to using loops in order to
>  > reduce
>  > code size and to allow code to fit into uop-cache. Maybe speed increase
>  > happens
>  > since vpternlogq reduces code-size for unrolled version enough and
>  > algorithm fits
>  > into i3-12100's uop-cache, giving the extra performance.
>  >
> 
> After your response, I retested it and found my benchmark data is wrong.
> When I implement aria-avx512, the benchmark result is below.
> 
> testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
> tcrypt: 1 operation in 1504 cycles (1024 bytes)
> tcrypt: 1 operation in 4595 cycles (4096 bytes)
> tcrypt: 1 operation in 1763 cycles (1024 bytes)
> tcrypt: 1 operation in 5540 cycles (4096 bytes)
> testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
> tcrypt: 1 operation in 1502 cycles (1024 bytes)
> tcrypt: 1 operation in 4615 cycles (4096 bytes)
> tcrypt: 1 operation in 1759 cycles (1024 bytes)
> tcrypt: 1 operation in 5554 cycles (4096 bytes)
> 
> But, the current result is like this.
> tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
> tcrypt: 1 operation in 1443 cycles (1024 bytes)
> tcrypt: 1 operation in 4396 cycles (4096 bytes)
> tcrypt: 1 operation in 1683 cycles (1024 bytes)
> tcrypt: 1 operation in 5368 cycles (4096 bytes)
> tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
> tcrypt: 1 operation in 1458 cycles (1024 bytes)
> tcrypt: 1 operation in 4416 cycles (4096 bytes)
> tcrypt: 1 operation in 1723 cycles (1024 bytes)
> tcrypt: 1 operation in 5358 cycles (4096 bytes)
> 
> So, after your optimization is like this.
> tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
> tcrypt: 1 operation in 1388 cycles (1024 bytes)
> tcrypt: 1 operation in 4107 cycles (4096 bytes)
> tcrypt: 1 operation in 1595 cycles (1024 bytes)
> tcrypt: 1 operation in 5011 cycles (4096 bytes)
> tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
> tcrypt: 1 operation in 1379 cycles (1024 bytes)
> tcrypt: 1 operation in 4163 cycles (4096 bytes)
> tcrypt: 1 operation in 1603 cycles (1024 bytes)
> tcrypt: 1 operation in 5098 cycles (4096 bytes)
> 
> The 9% performance gap I said is actually wrong.
> I don't know why the result is changed... anyway, this optimization increases performance by 5~7%.
> Also, I tested it on the both loop and unroll but I couldn't find any performance gap.
> I haven't enough knowledge about uop-cache, so I couldn't provide useful for focusing on the uop-cache.
> Sorry for that the previous benchmark result is wrong.

Ok, thanks for testing. I was just wondering from where the improvement came.

Anyway, good to see that there was performance increase on other CPU in
addition to AMD Zen4.

-Jussi

> 
> Thank you so much!
> Taehee Yoo
> 
> 
>  > -Jussi
>  >
>  >> It will be really helpful to the kernel side aria-avx512 driver for
>  >> improving performance.
>  >>
>  >>  > * cipher/aria-gfni-avx512-amd64.S (aria_diff_m): Use 'vpternlogq' for
>  >>  > 3-way XOR operation.
>  >>  > ---
>  >>  >
>  >>  > Using vpternlogq gives small performance improvement on AMD Zen4. With
>  >>  > Intel tiger-lake speed is the same as before.
>  >>  >
>  >>  > Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off):
>  >>  >
>  >>  > Before:
>  >>  >   ARIA128        |  nanosecs/byte   mebibytes/sec   cycles/byte
>  >> auto Mhz
>  >>  >          ECB enc |     0.204 ns/B      4682 MiB/s     0.957
>  >> c/B      4700
>  >>  >          ECB dec |     0.204 ns/B      4668 MiB/s     0.960
>  >> c/B      4700
>  >>  >          CTR enc |     0.212 ns/B      4509 MiB/s     0.994
>  >> c/B      4700
>  >>  >          CTR dec |     0.212 ns/B      4490 MiB/s     0.998
>  >> c/B      4700
>  >>  >
>  >>  > After (~3% faster):
>  >>  >   ARIA128        |  nanosecs/byte   mebibytes/sec   cycles/byte
>  >> auto Mhz
>  >>  >          ECB enc |     0.198 ns/B      4812 MiB/s     0.932
>  >> c/B      4700
>  >>  >          ECB dec |     0.198 ns/B      4824 MiB/s     0.929
>  >> c/B      4700
>  >>  >          CTR enc |     0.204 ns/B      4665 MiB/s     0.961
>  >> c/B      4700
>  >>  >          CTR dec |     0.206 ns/B      4631 MiB/s     0.968
>  >> c/B      4700
>  >>  >
>  >>  > Cc: Taehee Yoo <ap420073 at gmail.com>
>  >>  > Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
>  >>  > ---
>  >>  >   cipher/aria-gfni-avx512-amd64.S | 16 ++++++----------
>  >>  >   1 file changed, 6 insertions(+), 10 deletions(-)
>  >>  >
>  >>  > diff --git a/cipher/aria-gfni-avx512-amd64.S
>  >> b/cipher/aria-gfni-avx512-amd64.S
>  >>  > index 849c744b..24a49a89 100644
>  >>  > --- a/cipher/aria-gfni-avx512-amd64.S
>  >>  > +++ b/cipher/aria-gfni-avx512-amd64.S
>  >>  > @@ -406,21 +406,17 @@
>  >>  >       vgf2p8affineinvqb $0, t2, y3, y3;        \
>  >>  >       vgf2p8affineinvqb $0, t2, y7, y7;
>  >>  >
>  >>  > -
>  >>  >   #define aria_diff_m(x0, x1, x2, x3,            \
>  >>  >               t0, t1, t2, t3)            \
>  >>  >       /* T = rotr32(X, 8); */                \
>  >>  >       /* X ^= T */                    \
>  >>  > -    vpxorq x0, x3, t0;                \
>  >>  > -    vpxorq x1, x0, t1;                \
>  >>  > -    vpxorq x2, x1, t2;                \
>  >>  > -    vpxorq x3, x2, t3;                \
>  >>  >       /* X = T ^ rotr(X, 16); */            \
>  >>  > -    vpxorq t2, x0, x0;                \
>  >>  > -    vpxorq x1, t3, t3;                \
>  >>  > -    vpxorq t0, x2, x2;                \
>  >>  > -    vpxorq t1, x3, x1;                \
>  >>  > -    vmovdqu64 t3, x3;
>  >>  > +    vmovdqa64 x0, t0;                \
>  >>  > +    vmovdqa64 x3, t3;                \
>  >>  > +    vpternlogq $0x96, x2, x1, x0;            \
>  >>  > +    vpternlogq $0x96, x2, x1, x3;            \
>  >>  > +    vpternlogq $0x96, t0, t3, x2;            \
>  >>  > +    vpternlogq $0x96, t0, t3, x1;
>  >>  >
>  >>  >   #define aria_diff_word(x0, x1, x2, x3,            \
>  >>  >                  x4, x5, x6, x7,            \
>  >>
>  >> Thank you so much!
>  >> Taehee Yoo
>  >>
>  >
>