[PATCH 4/5] aria-avx512: small optimization for aria_diff_m

Jussi Kivilinna jussi.kivilinna at iki.fi
Sun Feb 19 09:49:09 CET 2023


* cipher/aria-gfni-avx512-amd64.S (aria_diff_m): Use 'vpternlogq' for
3-way XOR operation.
---

Using vpternlogq gives small performance improvement on AMD Zen4. With
Intel tiger-lake speed is the same as before.

Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off):

Before:
 ARIA128        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |     0.204 ns/B      4682 MiB/s     0.957 c/B      4700
        ECB dec |     0.204 ns/B      4668 MiB/s     0.960 c/B      4700
        CTR enc |     0.212 ns/B      4509 MiB/s     0.994 c/B      4700
        CTR dec |     0.212 ns/B      4490 MiB/s     0.998 c/B      4700

After (~3% faster):
 ARIA128        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |     0.198 ns/B      4812 MiB/s     0.932 c/B      4700
        ECB dec |     0.198 ns/B      4824 MiB/s     0.929 c/B      4700
        CTR enc |     0.204 ns/B      4665 MiB/s     0.961 c/B      4700
        CTR dec |     0.206 ns/B      4631 MiB/s     0.968 c/B      4700

Cc: Taehee Yoo <ap420073 at gmail.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/aria-gfni-avx512-amd64.S | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/cipher/aria-gfni-avx512-amd64.S b/cipher/aria-gfni-avx512-amd64.S
index 849c744b..24a49a89 100644
--- a/cipher/aria-gfni-avx512-amd64.S
+++ b/cipher/aria-gfni-avx512-amd64.S
@@ -406,21 +406,17 @@
 	vgf2p8affineinvqb $0, t2, y3, y3;		\
 	vgf2p8affineinvqb $0, t2, y7, y7;
 
-
 #define aria_diff_m(x0, x1, x2, x3,			\
 		    t0, t1, t2, t3)			\
 	/* T = rotr32(X, 8); */				\
 	/* X ^= T */					\
-	vpxorq x0, x3, t0;				\
-	vpxorq x1, x0, t1;				\
-	vpxorq x2, x1, t2;				\
-	vpxorq x3, x2, t3;				\
 	/* X = T ^ rotr(X, 16); */			\
-	vpxorq t2, x0, x0;				\
-	vpxorq x1, t3, t3;				\
-	vpxorq t0, x2, x2;				\
-	vpxorq t1, x3, x1;				\
-	vmovdqu64 t3, x3;
+	vmovdqa64 x0, t0;				\
+	vmovdqa64 x3, t3;				\
+	vpternlogq $0x96, x2, x1, x0;			\
+	vpternlogq $0x96, x2, x1, x3;			\
+	vpternlogq $0x96, t0, t3, x2;			\
+	vpternlogq $0x96, t0, t3, x1;
 
 #define aria_diff_word(x0, x1, x2, x3,			\
 		       x4, x5, x6, x7,			\
-- 
2.37.2




More information about the Gcrypt-devel mailing list