[PATCH 4/5] aria-avx512: small optimization for aria_diff_m
Jussi Kivilinna
jussi.kivilinna at iki.fi
Thu Feb 23 17:30:32 CET 2023
On 22.2.2023 14.07, Taehee Yoo wrote:
> On 2023. 2. 21. 오전 2:38, Jussi Kivilinna wrote:
>
> Hi Jussi,
>
> > Hello,
> >
> > On 20.2.2023 12.54, Taehee Yoo wrote:
> >> On 2/19/23 17:49, Jussi Kivilinna wrote:
> >>
> >> Hi Jussi,
> >> Thank you so much for this optimization!
> >>
> >> I tested this optimization in the kernel.
> >> It works very well.
> >> In my machine(i3-12100), it improves performance ~9%, awesome!
> >
> > Interesting.. I'd expect alderlake to behave similarly to tigerlake. Did
> > you
> > test with version that has unrolled round functions?
> >
> > In libgcrypt, I changed from round unrolling to using loops in order to
> > reduce
> > code size and to allow code to fit into uop-cache. Maybe speed increase
> > happens
> > since vpternlogq reduces code-size for unrolled version enough and
> > algorithm fits
> > into i3-12100's uop-cache, giving the extra performance.
> >
>
> After your response, I retested it and found my benchmark data is wrong.
> When I implement aria-avx512, the benchmark result is below.
>
> testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
> tcrypt: 1 operation in 1504 cycles (1024 bytes)
> tcrypt: 1 operation in 4595 cycles (4096 bytes)
> tcrypt: 1 operation in 1763 cycles (1024 bytes)
> tcrypt: 1 operation in 5540 cycles (4096 bytes)
> testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
> tcrypt: 1 operation in 1502 cycles (1024 bytes)
> tcrypt: 1 operation in 4615 cycles (4096 bytes)
> tcrypt: 1 operation in 1759 cycles (1024 bytes)
> tcrypt: 1 operation in 5554 cycles (4096 bytes)
>
> But, the current result is like this.
> tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
> tcrypt: 1 operation in 1443 cycles (1024 bytes)
> tcrypt: 1 operation in 4396 cycles (4096 bytes)
> tcrypt: 1 operation in 1683 cycles (1024 bytes)
> tcrypt: 1 operation in 5368 cycles (4096 bytes)
> tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
> tcrypt: 1 operation in 1458 cycles (1024 bytes)
> tcrypt: 1 operation in 4416 cycles (4096 bytes)
> tcrypt: 1 operation in 1723 cycles (1024 bytes)
> tcrypt: 1 operation in 5358 cycles (4096 bytes)
>
> So, after your optimization is like this.
> tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
> tcrypt: 1 operation in 1388 cycles (1024 bytes)
> tcrypt: 1 operation in 4107 cycles (4096 bytes)
> tcrypt: 1 operation in 1595 cycles (1024 bytes)
> tcrypt: 1 operation in 5011 cycles (4096 bytes)
> tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
> tcrypt: 1 operation in 1379 cycles (1024 bytes)
> tcrypt: 1 operation in 4163 cycles (4096 bytes)
> tcrypt: 1 operation in 1603 cycles (1024 bytes)
> tcrypt: 1 operation in 5098 cycles (4096 bytes)
>
> The 9% performance gap I said is actually wrong.
> I don't know why the result is changed... anyway, this optimization increases performance by 5~7%.
> Also, I tested it on the both loop and unroll but I couldn't find any performance gap.
> I haven't enough knowledge about uop-cache, so I couldn't provide useful for focusing on the uop-cache.
> Sorry for that the previous benchmark result is wrong.
Ok, thanks for testing. I was just wondering from where the improvement came.
Anyway, good to see that there was performance increase on other CPU in
addition to AMD Zen4.
-Jussi
>
> Thank you so much!
> Taehee Yoo
>
>
> > -Jussi
> >
> >> It will be really helpful to the kernel side aria-avx512 driver for
> >> improving performance.
> >>
> >> > * cipher/aria-gfni-avx512-amd64.S (aria_diff_m): Use 'vpternlogq' for
> >> > 3-way XOR operation.
> >> > ---
> >> >
> >> > Using vpternlogq gives small performance improvement on AMD Zen4. With
> >> > Intel tiger-lake speed is the same as before.
> >> >
> >> > Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off):
> >> >
> >> > Before:
> >> > ARIA128 | nanosecs/byte mebibytes/sec cycles/byte
> >> auto Mhz
> >> > ECB enc | 0.204 ns/B 4682 MiB/s 0.957
> >> c/B 4700
> >> > ECB dec | 0.204 ns/B 4668 MiB/s 0.960
> >> c/B 4700
> >> > CTR enc | 0.212 ns/B 4509 MiB/s 0.994
> >> c/B 4700
> >> > CTR dec | 0.212 ns/B 4490 MiB/s 0.998
> >> c/B 4700
> >> >
> >> > After (~3% faster):
> >> > ARIA128 | nanosecs/byte mebibytes/sec cycles/byte
> >> auto Mhz
> >> > ECB enc | 0.198 ns/B 4812 MiB/s 0.932
> >> c/B 4700
> >> > ECB dec | 0.198 ns/B 4824 MiB/s 0.929
> >> c/B 4700
> >> > CTR enc | 0.204 ns/B 4665 MiB/s 0.961
> >> c/B 4700
> >> > CTR dec | 0.206 ns/B 4631 MiB/s 0.968
> >> c/B 4700
> >> >
> >> > Cc: Taehee Yoo <ap420073 at gmail.com>
> >> > Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
> >> > ---
> >> > cipher/aria-gfni-avx512-amd64.S | 16 ++++++----------
> >> > 1 file changed, 6 insertions(+), 10 deletions(-)
> >> >
> >> > diff --git a/cipher/aria-gfni-avx512-amd64.S
> >> b/cipher/aria-gfni-avx512-amd64.S
> >> > index 849c744b..24a49a89 100644
> >> > --- a/cipher/aria-gfni-avx512-amd64.S
> >> > +++ b/cipher/aria-gfni-avx512-amd64.S
> >> > @@ -406,21 +406,17 @@
> >> > vgf2p8affineinvqb $0, t2, y3, y3; \
> >> > vgf2p8affineinvqb $0, t2, y7, y7;
> >> >
> >> > -
> >> > #define aria_diff_m(x0, x1, x2, x3, \
> >> > t0, t1, t2, t3) \
> >> > /* T = rotr32(X, 8); */ \
> >> > /* X ^= T */ \
> >> > - vpxorq x0, x3, t0; \
> >> > - vpxorq x1, x0, t1; \
> >> > - vpxorq x2, x1, t2; \
> >> > - vpxorq x3, x2, t3; \
> >> > /* X = T ^ rotr(X, 16); */ \
> >> > - vpxorq t2, x0, x0; \
> >> > - vpxorq x1, t3, t3; \
> >> > - vpxorq t0, x2, x2; \
> >> > - vpxorq t1, x3, x1; \
> >> > - vmovdqu64 t3, x3;
> >> > + vmovdqa64 x0, t0; \
> >> > + vmovdqa64 x3, t3; \
> >> > + vpternlogq $0x96, x2, x1, x0; \
> >> > + vpternlogq $0x96, x2, x1, x3; \
> >> > + vpternlogq $0x96, t0, t3, x2; \
> >> > + vpternlogq $0x96, t0, t3, x1;
> >> >
> >> > #define aria_diff_word(x0, x1, x2, x3, \
> >> > x4, x5, x6, x7, \
> >>
> >> Thank you so much!
> >> Taehee Yoo
> >>
> >
>
More information about the Gcrypt-devel
mailing list