[PATCH 4/5] aria-avx512: small optimization for aria_diff_m

Mon Feb 20 18:38:49 CET 2023

Hello,

On 20.2.2023 12.54, Taehee Yoo wrote:
> On 2/19/23 17:49, Jussi Kivilinna wrote:
> 
> Hi Jussi,
> Thank you so much for this optimization!
> 
> I tested this optimization in the kernel.
> It works very well.
> In my machine(i3-12100), it improves performance ~9%, awesome!

Interesting.. I'd expect alderlake to behave similarly to tigerlake. Did you
test with version that has unrolled round functions?

In libgcrypt, I changed from round unrolling to using loops in order to reduce
code size and to allow code to fit into uop-cache. Maybe speed increase happens
since vpternlogq reduces code-size for unrolled version enough and algorithm fits
into i3-12100's uop-cache, giving the extra performance.

-Jussi

> It will be really helpful to the kernel side aria-avx512 driver for improving performance.
> 
>  > * cipher/aria-gfni-avx512-amd64.S (aria_diff_m): Use 'vpternlogq' for
>  > 3-way XOR operation.
>  > ---
>  >
>  > Using vpternlogq gives small performance improvement on AMD Zen4. With
>  > Intel tiger-lake speed is the same as before.
>  >
>  > Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off):
>  >
>  > Before:
>  >   ARIA128        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
>  >          ECB enc |     0.204 ns/B      4682 MiB/s     0.957 c/B      4700
>  >          ECB dec |     0.204 ns/B      4668 MiB/s     0.960 c/B      4700
>  >          CTR enc |     0.212 ns/B      4509 MiB/s     0.994 c/B      4700
>  >          CTR dec |     0.212 ns/B      4490 MiB/s     0.998 c/B      4700
>  >
>  > After (~3% faster):
>  >   ARIA128        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
>  >          ECB enc |     0.198 ns/B      4812 MiB/s     0.932 c/B      4700
>  >          ECB dec |     0.198 ns/B      4824 MiB/s     0.929 c/B      4700
>  >          CTR enc |     0.204 ns/B      4665 MiB/s     0.961 c/B      4700
>  >          CTR dec |     0.206 ns/B      4631 MiB/s     0.968 c/B      4700
>  >
>  > Cc: Taehee Yoo <ap420073 at gmail.com>
>  > Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
>  > ---
>  >   cipher/aria-gfni-avx512-amd64.S | 16 ++++++----------
>  >   1 file changed, 6 insertions(+), 10 deletions(-)
>  >
>  > diff --git a/cipher/aria-gfni-avx512-amd64.S b/cipher/aria-gfni-avx512-amd64.S
>  > index 849c744b..24a49a89 100644
>  > --- a/cipher/aria-gfni-avx512-amd64.S
>  > +++ b/cipher/aria-gfni-avx512-amd64.S
>  > @@ -406,21 +406,17 @@
>  >       vgf2p8affineinvqb $0, t2, y3, y3;        \
>  >       vgf2p8affineinvqb $0, t2, y7, y7;
>  >
>  > -
>  >   #define aria_diff_m(x0, x1, x2, x3,            \
>  >               t0, t1, t2, t3)            \
>  >       /* T = rotr32(X, 8); */                \
>  >       /* X ^= T */                    \
>  > -    vpxorq x0, x3, t0;                \
>  > -    vpxorq x1, x0, t1;                \
>  > -    vpxorq x2, x1, t2;                \
>  > -    vpxorq x3, x2, t3;                \
>  >       /* X = T ^ rotr(X, 16); */            \
>  > -    vpxorq t2, x0, x0;                \
>  > -    vpxorq x1, t3, t3;                \
>  > -    vpxorq t0, x2, x2;                \
>  > -    vpxorq t1, x3, x1;                \
>  > -    vmovdqu64 t3, x3;
>  > +    vmovdqa64 x0, t0;                \
>  > +    vmovdqa64 x3, t3;                \
>  > +    vpternlogq $0x96, x2, x1, x0;            \
>  > +    vpternlogq $0x96, x2, x1, x3;            \
>  > +    vpternlogq $0x96, t0, t3, x2;            \
>  > +    vpternlogq $0x96, t0, t3, x1;
>  >
>  >   #define aria_diff_word(x0, x1, x2, x3,            \
>  >                  x4, x5, x6, x7,            \
> 
> Thank you so much!
> Taehee Yoo
>