[PATCH 4/5] aria-avx512: small optimization for aria_diff_m
Taehee Yoo
ap420073 at gmail.com
Mon Feb 20 11:54:30 CET 2023
On 2/19/23 17:49, Jussi Kivilinna wrote:
Hi Jussi,
Thank you so much for this optimization!
I tested this optimization in the kernel.
It works very well.
In my machine(i3-12100), it improves performance ~9%, awesome!
It will be really helpful to the kernel side aria-avx512 driver for
improving performance.
> * cipher/aria-gfni-avx512-amd64.S (aria_diff_m): Use 'vpternlogq' for
> 3-way XOR operation.
> ---
>
> Using vpternlogq gives small performance improvement on AMD Zen4. With
> Intel tiger-lake speed is the same as before.
>
> Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off):
>
> Before:
> ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
> ECB enc | 0.204 ns/B 4682 MiB/s 0.957 c/B 4700
> ECB dec | 0.204 ns/B 4668 MiB/s 0.960 c/B 4700
> CTR enc | 0.212 ns/B 4509 MiB/s 0.994 c/B 4700
> CTR dec | 0.212 ns/B 4490 MiB/s 0.998 c/B 4700
>
> After (~3% faster):
> ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
> ECB enc | 0.198 ns/B 4812 MiB/s 0.932 c/B 4700
> ECB dec | 0.198 ns/B 4824 MiB/s 0.929 c/B 4700
> CTR enc | 0.204 ns/B 4665 MiB/s 0.961 c/B 4700
> CTR dec | 0.206 ns/B 4631 MiB/s 0.968 c/B 4700
>
> Cc: Taehee Yoo <ap420073 at gmail.com>
> Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
> ---
> cipher/aria-gfni-avx512-amd64.S | 16 ++++++----------
> 1 file changed, 6 insertions(+), 10 deletions(-)
>
> diff --git a/cipher/aria-gfni-avx512-amd64.S
b/cipher/aria-gfni-avx512-amd64.S
> index 849c744b..24a49a89 100644
> --- a/cipher/aria-gfni-avx512-amd64.S
> +++ b/cipher/aria-gfni-avx512-amd64.S
> @@ -406,21 +406,17 @@
> vgf2p8affineinvqb $0, t2, y3, y3; \
> vgf2p8affineinvqb $0, t2, y7, y7;
>
> -
> #define aria_diff_m(x0, x1, x2, x3, \
> t0, t1, t2, t3) \
> /* T = rotr32(X, 8); */ \
> /* X ^= T */ \
> - vpxorq x0, x3, t0; \
> - vpxorq x1, x0, t1; \
> - vpxorq x2, x1, t2; \
> - vpxorq x3, x2, t3; \
> /* X = T ^ rotr(X, 16); */ \
> - vpxorq t2, x0, x0; \
> - vpxorq x1, t3, t3; \
> - vpxorq t0, x2, x2; \
> - vpxorq t1, x3, x1; \
> - vmovdqu64 t3, x3;
> + vmovdqa64 x0, t0; \
> + vmovdqa64 x3, t3; \
> + vpternlogq $0x96, x2, x1, x0; \
> + vpternlogq $0x96, x2, x1, x3; \
> + vpternlogq $0x96, t0, t3, x2; \
> + vpternlogq $0x96, t0, t3, x1;
>
> #define aria_diff_word(x0, x1, x2, x3, \
> x4, x5, x6, x7, \
Thank you so much!
Taehee Yoo
More information about the Gcrypt-devel
mailing list