From tianjia.zhang at linux.alibaba.com  Tue Mar  1 05:38:35 2022
From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang)
Date: Tue,  1 Mar 2022 12:38:35 +0800
Subject: [PATCH v2 1/2] hwf-arm: add ARMv8.2 optional crypto extension HW
 features
Message-ID: <20220301043836.28709-1-tianjia.zhang@linux.alibaba.com>

* src/g10lib.h (HWF_ARM_SHA3, HWF_ARM_SM3, HWF_ARM_SM4)
(HWF_ARM_SHA512): New.
* src/hwf-arm.c (arm_features): Add sha3, sm3, sm4, sha512 HW features.
* src/hwfeatures.c (hwflist): Add sha3, sm3, sm4, sha512 HW features.
--

Signed-off-by: Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
---
 src/g10lib.h     |  4 ++++
 src/hwf-arm.c    | 16 ++++++++++++++++
 src/hwfeatures.c |  4 ++++
 3 files changed, 24 insertions(+)

diff --git a/src/g10lib.h b/src/g10lib.h
index 22c0f0c2..985e75c6 100644
--- a/src/g10lib.h
+++ b/src/g10lib.h
@@ -245,6 +245,10 @@ char **_gcry_strtokenize (const char *string, const char *delim);
 #define HWF_ARM_SHA1            (1 << 2)
 #define HWF_ARM_SHA2            (1 << 3)
 #define HWF_ARM_PMULL           (1 << 4)
+#define HWF_ARM_SHA3            (1 << 5)
+#define HWF_ARM_SM3             (1 << 6)
+#define HWF_ARM_SM4             (1 << 7)
+#define HWF_ARM_SHA512          (1 << 8)
 
 #elif defined(HAVE_CPU_ARCH_PPC)
 
diff --git a/src/hwf-arm.c b/src/hwf-arm.c
index 60107f36..70d375b2 100644
--- a/src/hwf-arm.c
+++ b/src/hwf-arm.c
@@ -137,6 +137,18 @@ static const struct feature_map_s arm_features[] =
 #ifndef HWCAP_SHA2
 # define HWCAP_SHA2  64
 #endif
+#ifndef HWCAP_SHA3
+# define HWCAP_SHA3  (1 << 17)
+#endif
+#ifndef HWCAP_SM3
+# define HWCAP_SM3   (1 << 18)
+#endif
+#ifndef HWCAP_SM4
+# define HWCAP_SM4   (1 << 19)
+#endif
+#ifndef HWCAP_SHA512
+# define HWCAP_SHA512 (1 << 21)
+#endif
 
 static const struct feature_map_s arm_features[] =
   {
@@ -148,6 +160,10 @@ static const struct feature_map_s arm_features[] =
     { HWCAP_SHA1, 0, " sha1", HWF_ARM_SHA1 },
     { HWCAP_SHA2, 0, " sha2", HWF_ARM_SHA2 },
     { HWCAP_PMULL, 0, " pmull", HWF_ARM_PMULL },
+    { HWCAP_SHA3, 0, " sha3",  HWF_ARM_SHA3 },
+    { HWCAP_SM3, 0, " sm3",  HWF_ARM_SM3 },
+    { HWCAP_SM4, 0, " sm4",  HWF_ARM_SM4 },
+    { HWCAP_SHA512, 0, " sha512",  HWF_ARM_SHA512 },
 #endif
   };
 
diff --git a/src/hwfeatures.c b/src/hwfeatures.c
index 97e67b3c..7060d995 100644
--- a/src/hwfeatures.c
+++ b/src/hwfeatures.c
@@ -68,6 +68,10 @@ static struct
     { HWF_ARM_SHA1,            "arm-sha1" },
     { HWF_ARM_SHA2,            "arm-sha2" },
     { HWF_ARM_PMULL,           "arm-pmull" },
+    { HWF_ARM_SHA3,            "arm-sha3" },
+    { HWF_ARM_SM3,             "arm-sm3" },
+    { HWF_ARM_SM4,             "arm-sm4" },
+    { HWF_ARM_SHA512,          "arm-sha512" },
 #elif defined(HAVE_CPU_ARCH_PPC)
     { HWF_PPC_VCRYPTO,         "ppc-vcrypto" },
     { HWF_PPC_ARCH_3_00,       "ppc-arch_3_00" },
-- 
2.34.1


From tianjia.zhang at linux.alibaba.com  Tue Mar  1 05:38:36 2022
From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang)
Date: Tue,  1 Mar 2022 12:38:36 +0800
Subject: [PATCH v2 2/2] Add SM4 ARMv8/AArch64/CE assembly implementation
In-Reply-To: <20220301043836.28709-1-tianjia.zhang@linux.alibaba.com>
References: <20220301043836.28709-1-tianjia.zhang@linux.alibaba.com>
Message-ID: <20220301043836.28709-2-tianjia.zhang@linux.alibaba.com>

* cipher/Makefile.am: Add 'sm4-armv8-aarch64-ce.S'.
* cipher/sm4-armv8-aarch64-ce.S: New.
* cipher/sm4.c (USE_ARM_CE): New.
(SM4_context) [USE_ARM_CE]: Add 'use_arm_ce'.
[USE_ARM_CE] (_gcry_sm4_armv8_ce_expand_key)
(_gcry_sm4_armv8_ce_crypt, _gcry_sm4_armv8_ce_ctr_enc)
(_gcry_sm4_armv8_ce_cbc_dec, _gcry_sm4_armv8_ce_cfb_dec)
(_gcry_sm4_armv8_ce_crypt_blk1_8, sm4_armv8_ce_crypt_blk1_8): New.
(sm4_expand_key) [USE_ARM_CE]: Use ARMv8/AArch64/CE key setup.
(sm4_setkey): Enable ARMv8/AArch64/CE if supported by HW.
(sm4_encrypt) [USE_ARM_CE]: Use SM4 CE encryption.
(sm4_decrypt) [USE_ARM_CE]: Use SM4 CE decryption.
(_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec)
(_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_ARM_CE]: Add
ARMv8/AArch64/CE bulk functions.
* configure.ac (gcry_cv_gcc_inline_asm_aarch64_crypto_sm): Check for GCC
inline assembler supports AArch64 SM Crypto Extension instructions;
Add 'sm4-armv8-aarch64-ce.lo'.
--

This patch adds ARMv8/AArch64/CE bulk encryption/decryption. Bulk
functions process eight blocks in parallel.

Benchmark on T-Head Yitian-710 2.75 GHz:

Before:
 SM4            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        CBC enc |     12.10 ns/B     78.79 MiB/s     33.28 c/B      2750
        CBC dec |      4.63 ns/B     205.9 MiB/s     12.74 c/B      2749
        CFB enc |     12.14 ns/B     78.58 MiB/s     33.37 c/B      2750
        CFB dec |      4.64 ns/B     205.5 MiB/s     12.76 c/B      2750
        CTR enc |      4.69 ns/B     203.3 MiB/s     12.90 c/B      2750
        CTR dec |      4.69 ns/B     203.3 MiB/s     12.90 c/B      2750
        GCM enc |      4.88 ns/B     195.4 MiB/s     13.42 c/B      2750
        GCM dec |      4.88 ns/B     195.5 MiB/s     13.42 c/B      2750
       GCM auth |     0.189 ns/B      5048 MiB/s     0.520 c/B      2750
        OCB enc |      4.86 ns/B     196.0 MiB/s     13.38 c/B      2750
        OCB dec |      4.90 ns/B     194.7 MiB/s     13.47 c/B      2750
       OCB auth |      4.79 ns/B     199.0 MiB/s     13.18 c/B      2750

After (10x - 19x faster than ARMv8/AArch64 impl):
 SM4            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        CBC enc |      1.25 ns/B     762.7 MiB/s      3.44 c/B      2749
        CBC dec |     0.243 ns/B      3927 MiB/s     0.668 c/B      2750
        CFB enc |      1.25 ns/B     763.1 MiB/s      3.44 c/B      2750
        CFB dec |     0.245 ns/B      3899 MiB/s     0.673 c/B      2750
        CTR enc |     0.298 ns/B      3199 MiB/s     0.820 c/B      2750
        CTR dec |     0.298 ns/B      3198 MiB/s     0.820 c/B      2750
        GCM enc |     0.487 ns/B      1957 MiB/s      1.34 c/B      2749
        GCM dec |     0.487 ns/B      1959 MiB/s      1.34 c/B      2750
       GCM auth |     0.189 ns/B      5048 MiB/s     0.519 c/B      2750
        OCB enc |     0.443 ns/B      2150 MiB/s      1.22 c/B      2749
        OCB dec |     0.486 ns/B      1964 MiB/s      1.34 c/B      2750
       OCB auth |     0.369 ns/B      2585 MiB/s      1.01 c/B      2749

Signed-off-by: Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
---
 cipher/Makefile.am            |   1 +
 cipher/sm4-armv8-aarch64-ce.S | 568 ++++++++++++++++++++++++++++++++++
 cipher/sm4.c                  | 152 +++++++++
 configure.ac                  |  35 +++
 4 files changed, 756 insertions(+)
 create mode 100644 cipher/sm4-armv8-aarch64-ce.S

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index a7cbf3fc..3339c463 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -117,6 +117,7 @@ EXTRA_libcipher_la_SOURCES = \
 	seed.c \
 	serpent.c serpent-sse2-amd64.S \
 	sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S sm4-aarch64.S \
+	sm4-armv8-aarch64-ce.S \
 	serpent-avx2-amd64.S serpent-armv7-neon.S \
 	sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \
 	sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \
diff --git a/cipher/sm4-armv8-aarch64-ce.S b/cipher/sm4-armv8-aarch64-ce.S
new file mode 100644
index 00000000..57e84683
--- /dev/null
+++ b/cipher/sm4-armv8-aarch64-ce.S
@@ -0,0 +1,568 @@
+/* sm4-armv8-aarch64-ce.S  -  ARMv8/AArch64/CE accelerated SM4 cipher
+ *
+ * Copyright (C) 2022 Alibaba Group.
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "asm-common-aarch64.h"
+
+#if defined(__AARCH64EL__) && \
+    defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \
+    defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO_SM) && \
+    defined(USE_SM4)
+
+.cpu generic+simd+crypto
+
+.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 16, 24, 25, 26, 27, 28, 29, 30, 31
+    .set .Lv\b\().4s, \b
+.endr
+
+.macro sm4e, vd, vn
+    .inst 0xcec08400 | (.L\vn << 5) | .L\vd
+.endm
+
+.macro sm4ekey, vd, vn, vm
+    .inst 0xce60c800 | (.L\vm << 16) | (.L\vn << 5) | .L\vd
+.endm
+
+.text
+
+/* Register macros */
+
+#define RTMP0   v16
+#define RTMP1   v17
+#define RTMP2   v18
+#define RTMP3   v19
+
+#define RIV     v20
+
+/* Helper macros. */
+
+#define load_rkey(ptr)                     \
+        ld1 {v24.16b-v27.16b}, [ptr], #64; \
+        ld1 {v28.16b-v31.16b}, [ptr];
+
+#define crypt_blk4(b0, b1, b2, b3)         \
+        rev32 b0.16b, b0.16b;              \
+        rev32 b1.16b, b1.16b;              \
+        rev32 b2.16b, b2.16b;              \
+        rev32 b3.16b, b3.16b;              \
+        sm4e b0.4s, v24.4s;                \
+        sm4e b1.4s, v24.4s;                \
+        sm4e b2.4s, v24.4s;                \
+        sm4e b3.4s, v24.4s;                \
+        sm4e b0.4s, v25.4s;                \
+        sm4e b1.4s, v25.4s;                \
+        sm4e b2.4s, v25.4s;                \
+        sm4e b3.4s, v25.4s;                \
+        sm4e b0.4s, v26.4s;                \
+        sm4e b1.4s, v26.4s;                \
+        sm4e b2.4s, v26.4s;                \
+        sm4e b3.4s, v26.4s;                \
+        sm4e b0.4s, v27.4s;                \
+        sm4e b1.4s, v27.4s;                \
+        sm4e b2.4s, v27.4s;                \
+        sm4e b3.4s, v27.4s;                \
+        sm4e b0.4s, v28.4s;                \
+        sm4e b1.4s, v28.4s;                \
+        sm4e b2.4s, v28.4s;                \
+        sm4e b3.4s, v28.4s;                \
+        sm4e b0.4s, v29.4s;                \
+        sm4e b1.4s, v29.4s;                \
+        sm4e b2.4s, v29.4s;                \
+        sm4e b3.4s, v29.4s;                \
+        sm4e b0.4s, v30.4s;                \
+        sm4e b1.4s, v30.4s;                \
+        sm4e b2.4s, v30.4s;                \
+        sm4e b3.4s, v30.4s;                \
+        sm4e b0.4s, v31.4s;                \
+        sm4e b1.4s, v31.4s;                \
+        sm4e b2.4s, v31.4s;                \
+        sm4e b3.4s, v31.4s;                \
+        rev64 b0.4s, b0.4s;                \
+        rev64 b1.4s, b1.4s;                \
+        rev64 b2.4s, b2.4s;                \
+        rev64 b3.4s, b3.4s;                \
+        ext b0.16b, b0.16b, b0.16b, #8;    \
+        ext b1.16b, b1.16b, b1.16b, #8;    \
+        ext b2.16b, b2.16b, b2.16b, #8;    \
+        ext b3.16b, b3.16b, b3.16b, #8;    \
+        rev32 b0.16b, b0.16b;              \
+        rev32 b1.16b, b1.16b;              \
+        rev32 b2.16b, b2.16b;              \
+        rev32 b3.16b, b3.16b;
+
+#define crypt_blk8(b0, b1, b2, b3, b4, b5, b6, b7) \
+        rev32 b0.16b, b0.16b;              \
+        rev32 b1.16b, b1.16b;              \
+        rev32 b2.16b, b2.16b;              \
+        rev32 b3.16b, b3.16b;              \
+        rev32 b4.16b, b4.16b;              \
+        rev32 b5.16b, b5.16b;              \
+        rev32 b6.16b, b6.16b;              \
+        rev32 b7.16b, b7.16b;              \
+        sm4e b0.4s, v24.4s;                \
+        sm4e b1.4s, v24.4s;                \
+        sm4e b2.4s, v24.4s;                \
+        sm4e b3.4s, v24.4s;                \
+        sm4e b4.4s, v24.4s;                \
+        sm4e b5.4s, v24.4s;                \
+        sm4e b6.4s, v24.4s;                \
+        sm4e b7.4s, v24.4s;                \
+        sm4e b0.4s, v25.4s;                \
+        sm4e b1.4s, v25.4s;                \
+        sm4e b2.4s, v25.4s;                \
+        sm4e b3.4s, v25.4s;                \
+        sm4e b4.4s, v25.4s;                \
+        sm4e b5.4s, v25.4s;                \
+        sm4e b6.4s, v25.4s;                \
+        sm4e b7.4s, v25.4s;                \
+        sm4e b0.4s, v26.4s;                \
+        sm4e b1.4s, v26.4s;                \
+        sm4e b2.4s, v26.4s;                \
+        sm4e b3.4s, v26.4s;                \
+        sm4e b4.4s, v26.4s;                \
+        sm4e b5.4s, v26.4s;                \
+        sm4e b6.4s, v26.4s;                \
+        sm4e b7.4s, v26.4s;                \
+        sm4e b0.4s, v27.4s;                \
+        sm4e b1.4s, v27.4s;                \
+        sm4e b2.4s, v27.4s;                \
+        sm4e b3.4s, v27.4s;                \
+        sm4e b4.4s, v27.4s;                \
+        sm4e b5.4s, v27.4s;                \
+        sm4e b6.4s, v27.4s;                \
+        sm4e b7.4s, v27.4s;                \
+        sm4e b0.4s, v28.4s;                \
+        sm4e b1.4s, v28.4s;                \
+        sm4e b2.4s, v28.4s;                \
+        sm4e b3.4s, v28.4s;                \
+        sm4e b4.4s, v28.4s;                \
+        sm4e b5.4s, v28.4s;                \
+        sm4e b6.4s, v28.4s;                \
+        sm4e b7.4s, v28.4s;                \
+        sm4e b0.4s, v29.4s;                \
+        sm4e b1.4s, v29.4s;                \
+        sm4e b2.4s, v29.4s;                \
+        sm4e b3.4s, v29.4s;                \
+        sm4e b4.4s, v29.4s;                \
+        sm4e b5.4s, v29.4s;                \
+        sm4e b6.4s, v29.4s;                \
+        sm4e b7.4s, v29.4s;                \
+        sm4e b0.4s, v30.4s;                \
+        sm4e b1.4s, v30.4s;                \
+        sm4e b2.4s, v30.4s;                \
+        sm4e b3.4s, v30.4s;                \
+        sm4e b4.4s, v30.4s;                \
+        sm4e b5.4s, v30.4s;                \
+        sm4e b6.4s, v30.4s;                \
+        sm4e b7.4s, v30.4s;                \
+        sm4e b0.4s, v31.4s;                \
+        sm4e b1.4s, v31.4s;                \
+        sm4e b2.4s, v31.4s;                \
+        sm4e b3.4s, v31.4s;                \
+        sm4e b4.4s, v31.4s;                \
+        sm4e b5.4s, v31.4s;                \
+        sm4e b6.4s, v31.4s;                \
+        sm4e b7.4s, v31.4s;                \
+        rev64 b0.4s, b0.4s;                \
+        rev64 b1.4s, b1.4s;                \
+        rev64 b2.4s, b2.4s;                \
+        rev64 b3.4s, b3.4s;                \
+        rev64 b4.4s, b4.4s;                \
+        rev64 b5.4s, b5.4s;                \
+        rev64 b6.4s, b6.4s;                \
+        rev64 b7.4s, b7.4s;                \
+        ext b0.16b, b0.16b, b0.16b, #8;    \
+        ext b1.16b, b1.16b, b1.16b, #8;    \
+        ext b2.16b, b2.16b, b2.16b, #8;    \
+        ext b3.16b, b3.16b, b3.16b, #8;    \
+        ext b4.16b, b4.16b, b4.16b, #8;    \
+        ext b5.16b, b5.16b, b5.16b, #8;    \
+        ext b6.16b, b6.16b, b6.16b, #8;    \
+        ext b7.16b, b7.16b, b7.16b, #8;    \
+        rev32 b0.16b, b0.16b;              \
+        rev32 b1.16b, b1.16b;              \
+        rev32 b2.16b, b2.16b;              \
+        rev32 b3.16b, b3.16b;              \
+        rev32 b4.16b, b4.16b;              \
+        rev32 b5.16b, b5.16b;              \
+        rev32 b6.16b, b6.16b;              \
+        rev32 b7.16b, b7.16b;
+
+
+.align 3
+.global _gcry_sm4_armv8_ce_expand_key
+ELF(.type _gcry_sm4_armv8_ce_expand_key,%function;)
+_gcry_sm4_armv8_ce_expand_key:
+    /* input:
+     *   x0: 128-bit key
+     *   x1: rkey_enc
+     *   x2: rkey_dec
+     *   x3: fk array
+     *   x4: ck array
+     */
+    CFI_STARTPROC();
+
+    ld1 {v0.16b}, [x0];
+    rev32 v0.16b, v0.16b;
+    ld1 {v1.16b}, [x3];
+    load_rkey(x4);
+
+    /* input ^ fk */
+    eor v0.16b, v0.16b, v1.16b;
+
+    sm4ekey v0.4s, v0.4s, v24.4s;
+    sm4ekey v1.4s, v0.4s, v25.4s;
+    sm4ekey v2.4s, v1.4s, v26.4s;
+    sm4ekey v3.4s, v2.4s, v27.4s;
+    sm4ekey v4.4s, v3.4s, v28.4s;
+    sm4ekey v5.4s, v4.4s, v29.4s;
+    sm4ekey v6.4s, v5.4s, v30.4s;
+    sm4ekey v7.4s, v6.4s, v31.4s;
+
+    st1 {v0.16b-v3.16b}, [x1], #64;
+    st1 {v4.16b-v7.16b}, [x1];
+    rev64 v7.4s, v7.4s;
+    rev64 v6.4s, v6.4s;
+    rev64 v5.4s, v5.4s;
+    rev64 v4.4s, v4.4s;
+    rev64 v3.4s, v3.4s;
+    rev64 v2.4s, v2.4s;
+    rev64 v1.4s, v1.4s;
+    rev64 v0.4s, v0.4s;
+    ext v7.16b, v7.16b, v7.16b, #8;
+    ext v6.16b, v6.16b, v6.16b, #8;
+    ext v5.16b, v5.16b, v5.16b, #8;
+    ext v4.16b, v4.16b, v4.16b, #8;
+    ext v3.16b, v3.16b, v3.16b, #8;
+    ext v2.16b, v2.16b, v2.16b, #8;
+    ext v1.16b, v1.16b, v1.16b, #8;
+    ext v0.16b, v0.16b, v0.16b, #8;
+    st1 {v7.16b}, [x2], #16;
+    st1 {v6.16b}, [x2], #16;
+    st1 {v5.16b}, [x2], #16;
+    st1 {v4.16b}, [x2], #16;
+    st1 {v3.16b}, [x2], #16;
+    st1 {v2.16b}, [x2], #16;
+    st1 {v1.16b}, [x2], #16;
+    st1 {v0.16b}, [x2];
+
+    ret_spec_stop;
+    CFI_ENDPROC();
+ELF(.size _gcry_sm4_armv8_ce_expand_key,.-_gcry_sm4_armv8_ce_expand_key;)
+
+.align 3
+ELF(.type sm4_armv8_ce_crypt_blk1_4,%function;)
+sm4_armv8_ce_crypt_blk1_4:
+    /* input:
+     *   x0: round key array, CTX
+     *   x1: dst
+     *   x2: src
+     *   x3: num blocks (1..4)
+     */
+    CFI_STARTPROC();
+
+    load_rkey(x0);
+
+    ld1 {v0.16b}, [x2], #16;
+    mov v1.16b, v0.16b;
+    mov v2.16b, v0.16b;
+    mov v3.16b, v0.16b;
+    cmp x3, #2;
+    blt .Lblk4_load_input_done;
+    ld1 {v1.16b}, [x2], #16;
+    beq .Lblk4_load_input_done;
+    ld1 {v2.16b}, [x2], #16;
+    cmp x3, #3;
+    beq .Lblk4_load_input_done;
+    ld1 {v3.16b}, [x2];
+
+.Lblk4_load_input_done:
+    crypt_blk4(v0, v1, v2, v3);
+
+    st1 {v0.16b}, [x1], #16;
+    cmp x3, #2;
+    blt .Lblk4_store_output_done;
+    st1 {v1.16b}, [x1], #16;
+    beq .Lblk4_store_output_done;
+    st1 {v2.16b}, [x1], #16;
+    cmp x3, #3;
+    beq .Lblk4_store_output_done;
+    st1 {v3.16b}, [x1];
+
+.Lblk4_store_output_done:
+    ret_spec_stop;
+    CFI_ENDPROC();
+ELF(.size sm4_armv8_ce_crypt_blk1_4,.-sm4_armv8_ce_crypt_blk1_4;)
+
+.align 3
+.global _gcry_sm4_armv8_ce_crypt_blk1_8
+ELF(.type _gcry_sm4_armv8_ce_crypt_blk1_8,%function;)
+_gcry_sm4_armv8_ce_crypt_blk1_8:
+    /* input:
+     *   x0: round key array, CTX
+     *   x1: dst
+     *   x2: src
+     *   x3: num blocks (1..8)
+     */
+    CFI_STARTPROC();
+
+    cmp x3, #5;
+    blt sm4_armv8_ce_crypt_blk1_4;
+
+    load_rkey(x0);
+
+    ld1 {v0.16b-v3.16b}, [x2], #64;
+    ld1 {v4.16b}, [x2], #16;
+    mov v5.16b, v4.16b;
+    mov v6.16b, v4.16b;
+    mov v7.16b, v4.16b;
+    beq .Lblk8_load_input_done;
+    ld1 {v5.16b}, [x2], #16;
+    cmp x3, #7;
+    blt .Lblk8_load_input_done;
+    ld1 {v6.16b}, [x2], #16;
+    beq .Lblk8_load_input_done;
+    ld1 {v7.16b}, [x2];
+
+.Lblk8_load_input_done:
+    crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7);
+
+    cmp x3, #6;
+    st1 {v0.16b-v3.16b}, [x1], #64;
+    st1 {v4.16b}, [x1], #16;
+    blt .Lblk8_store_output_done;
+    st1 {v5.16b}, [x1], #16;
+    beq .Lblk8_store_output_done;
+    st1 {v6.16b}, [x1], #16;
+    cmp x3, #7;
+    beq .Lblk8_store_output_done;
+    st1 {v7.16b}, [x1];
+
+.Lblk8_store_output_done:
+    ret_spec_stop;
+    CFI_ENDPROC();
+ELF(.size _gcry_sm4_armv8_ce_crypt_blk1_8,.-_gcry_sm4_armv8_ce_crypt_blk1_8;)
+
+.align 3
+.global _gcry_sm4_armv8_ce_crypt
+ELF(.type _gcry_sm4_armv8_ce_crypt,%function;)
+_gcry_sm4_armv8_ce_crypt:
+    /* input:
+     *   x0: round key array, CTX
+     *   x1: dst
+     *   x2: src
+     *   x3: nblocks (multiples of 8)
+     */
+    CFI_STARTPROC();
+
+    load_rkey(x0);
+
+.Lcrypt_loop_blk:
+    subs x3, x3, #8;
+    bmi .Lcrypt_end;
+
+    ld1 {v0.16b-v3.16b}, [x2], #64;
+    ld1 {v4.16b-v7.16b}, [x2], #64;
+
+    crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7);
+
+    st1 {v0.16b-v3.16b}, [x1], #64;
+    st1 {v4.16b-v7.16b}, [x1], #64;
+
+    b .Lcrypt_loop_blk;
+
+.Lcrypt_end:
+    ret_spec_stop;
+    CFI_ENDPROC();
+ELF(.size _gcry_sm4_armv8_ce_crypt,.-_gcry_sm4_armv8_ce_crypt;)
+
+.align 3
+.global _gcry_sm4_armv8_ce_cbc_dec
+ELF(.type _gcry_sm4_armv8_ce_cbc_dec,%function;)
+_gcry_sm4_armv8_ce_cbc_dec:
+    /* input:
+     *   x0: round key array, CTX
+     *   x1: dst
+     *   x2: src
+     *   x3: iv (big endian, 128 bit)
+     *   x4: nblocks (multiples of 8)
+     */
+    CFI_STARTPROC();
+
+    load_rkey(x0);
+    ld1 {RIV.16b}, [x3];
+
+.Lcbc_loop_blk:
+    subs x4, x4, #8;
+    bmi .Lcbc_end;
+
+    ld1 {v0.16b-v3.16b}, [x2], #64;
+    ld1 {v4.16b-v7.16b}, [x2];
+
+    crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7);
+
+    sub x2, x2, #64;
+    eor v0.16b, v0.16b, RIV.16b;
+    ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
+    eor v1.16b, v1.16b, RTMP0.16b;
+    eor v2.16b, v2.16b, RTMP1.16b;
+    eor v3.16b, v3.16b, RTMP2.16b;
+    st1 {v0.16b-v3.16b}, [x1], #64;
+
+    eor v4.16b, v4.16b, RTMP3.16b;
+    ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
+    eor v5.16b, v5.16b, RTMP0.16b;
+    eor v6.16b, v6.16b, RTMP1.16b;
+    eor v7.16b, v7.16b, RTMP2.16b;
+
+    mov RIV.16b, RTMP3.16b;
+    st1 {v4.16b-v7.16b}, [x1], #64;
+
+    b .Lcbc_loop_blk;
+
+.Lcbc_end:
+    /* store new IV */
+    st1 {RIV.16b}, [x3];
+
+    ret_spec_stop;
+    CFI_ENDPROC();
+ELF(.size _gcry_sm4_armv8_ce_cbc_dec,.-_gcry_sm4_armv8_ce_cbc_dec;)
+
+.align 3
+.global _gcry_sm4_armv8_ce_cfb_dec
+ELF(.type _gcry_sm4_armv8_ce_cfb_dec,%function;)
+_gcry_sm4_armv8_ce_cfb_dec:
+    /* input:
+     *   x0: round key array, CTX
+     *   x1: dst
+     *   x2: src
+     *   x3: iv (big endian, 128 bit)
+     *   x4: nblocks (multiples of 8)
+     */
+    CFI_STARTPROC();
+
+    load_rkey(x0);
+    ld1 {v0.16b}, [x3];
+
+.Lcfb_loop_blk:
+    subs x4, x4, #8;
+    bmi .Lcfb_end;
+
+    ld1 {v1.16b, v2.16b, v3.16b}, [x2], #48;
+    ld1 {v4.16b-v7.16b}, [x2];
+
+    crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7);
+
+    sub x2, x2, #48;
+    ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
+    eor v0.16b, v0.16b, RTMP0.16b;
+    eor v1.16b, v1.16b, RTMP1.16b;
+    eor v2.16b, v2.16b, RTMP2.16b;
+    eor v3.16b, v3.16b, RTMP3.16b;
+    st1 {v0.16b-v3.16b}, [x1], #64;
+
+    ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
+    eor v4.16b, v4.16b, RTMP0.16b;
+    eor v5.16b, v5.16b, RTMP1.16b;
+    eor v6.16b, v6.16b, RTMP2.16b;
+    eor v7.16b, v7.16b, RTMP3.16b;
+    st1 {v4.16b-v7.16b}, [x1], #64;
+
+    mov v0.16b, RTMP3.16b;
+
+    b .Lcfb_loop_blk;
+
+.Lcfb_end:
+    /* store new IV */
+    st1 {v0.16b}, [x3];
+
+    ret_spec_stop;
+    CFI_ENDPROC();
+ELF(.size _gcry_sm4_armv8_ce_cfb_dec,.-_gcry_sm4_armv8_ce_cfb_dec;)
+
+.align 3
+.global _gcry_sm4_armv8_ce_ctr_enc
+ELF(.type _gcry_sm4_armv8_ce_ctr_enc,%function;)
+_gcry_sm4_armv8_ce_ctr_enc:
+    /* input:
+     *   x0: round key array, CTX
+     *   x1: dst
+     *   x2: src
+     *   x3: ctr (big endian, 128 bit)
+     *   x4: nblocks (multiples of 8)
+     */
+    CFI_STARTPROC();
+
+    load_rkey(x0);
+
+    ldp x7, x8, [x3];
+    rev x7, x7;
+    rev x8, x8;
+
+.Lctr_loop_blk:
+    subs x4, x4, #8;
+    bmi .Lctr_end;
+
+#define inc_le128(vctr)       \
+    mov vctr.d[1], x8;        \
+    mov vctr.d[0], x7;        \
+    adds x8, x8, #1;          \
+    adc x7, x7, xzr;          \
+    rev64 vctr.16b, vctr.16b;
+
+    /* construct CTRs */
+    inc_le128(v0);      /* +0 */
+    inc_le128(v1);      /* +1 */
+    inc_le128(v2);      /* +2 */
+    inc_le128(v3);      /* +3 */
+    inc_le128(v4);      /* +4 */
+    inc_le128(v5);      /* +5 */
+    inc_le128(v6);      /* +6 */
+    inc_le128(v7);      /* +7 */
+
+    crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7);
+
+    ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
+    eor v0.16b, v0.16b, RTMP0.16b;
+    eor v1.16b, v1.16b, RTMP1.16b;
+    eor v2.16b, v2.16b, RTMP2.16b;
+    eor v3.16b, v3.16b, RTMP3.16b;
+    st1 {v0.16b-v3.16b}, [x1], #64;
+
+    ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
+    eor v4.16b, v4.16b, RTMP0.16b;
+    eor v5.16b, v5.16b, RTMP1.16b;
+    eor v6.16b, v6.16b, RTMP2.16b;
+    eor v7.16b, v7.16b, RTMP3.16b;
+    st1 {v4.16b-v7.16b}, [x1], #64;
+
+    b .Lctr_loop_blk;
+
+.Lctr_end:
+    /* store new CTR */
+    rev x7, x7;
+    rev x8, x8;
+    stp x7, x8, [x3];
+
+    ret_spec_stop;
+    CFI_ENDPROC();
+ELF(.size _gcry_sm4_armv8_ce_ctr_enc,.-_gcry_sm4_armv8_ce_ctr_enc;)
+
+#endif
diff --git a/cipher/sm4.c b/cipher/sm4.c
index ec2281b6..1fef664b 100644
--- a/cipher/sm4.c
+++ b/cipher/sm4.c
@@ -76,6 +76,15 @@
 # endif
 #endif
 
+#undef USE_ARM_CE
+#ifdef ENABLE_ARM_CRYPTO_SUPPORT
+# if defined(__AARCH64EL__) && \
+     defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \
+     defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO_SM)
+#   define USE_ARM_CE 1
+# endif
+#endif
+
 static const char *sm4_selftest (void);
 
 static void _gcry_sm4_ctr_enc (void *context, unsigned char *ctr,
@@ -106,6 +115,9 @@ typedef struct
 #ifdef USE_AARCH64_SIMD
   unsigned int use_aarch64_simd:1;
 #endif
+#ifdef USE_ARM_CE
+  unsigned int use_arm_ce:1;
+#endif
 } SM4_context;
 
 static const u32 fk[4] =
@@ -286,6 +298,43 @@ sm4_aarch64_crypt_blk1_8(const u32 *rk, byte *out, const byte *in,
 }
 #endif /* USE_AARCH64_SIMD */
 
+#ifdef USE_ARM_CE
+extern void _gcry_sm4_armv8_ce_expand_key(const byte *key,
+					  u32 *rkey_enc, u32 *rkey_dec,
+					  const u32 *fk, const u32 *ck);
+
+extern void _gcry_sm4_armv8_ce_crypt(const u32 *rk, byte *out,
+				     const byte *in,
+				     size_t num_blocks);
+
+extern void _gcry_sm4_armv8_ce_ctr_enc(const u32 *rk_enc, byte *out,
+				       const byte *in,
+				       byte *ctr,
+				       size_t nblocks);
+
+extern void _gcry_sm4_armv8_ce_cbc_dec(const u32 *rk_dec, byte *out,
+				       const byte *in,
+				       byte *iv,
+				       size_t nblocks);
+
+extern void _gcry_sm4_armv8_ce_cfb_dec(const u32 *rk_enc, byte *out,
+				       const byte *in,
+				       byte *iv,
+				       size_t nblocks);
+
+extern void _gcry_sm4_armv8_ce_crypt_blk1_8(const u32 *rk, byte *out,
+					    const byte *in,
+					    size_t num_blocks);
+
+static inline unsigned int
+sm4_armv8_ce_crypt_blk1_8(const u32 *rk, byte *out, const byte *in,
+			 unsigned int num_blks)
+{
+  _gcry_sm4_armv8_ce_crypt_blk1_8(rk, out, in, (size_t)num_blks);
+  return 0;
+}
+#endif /* USE_ARM_CE */
+
 static inline void prefetch_sbox_table(void)
 {
   const volatile byte *vtab = (void *)&sbox_table;
@@ -363,6 +412,15 @@ sm4_expand_key (SM4_context *ctx, const byte *key)
     }
 #endif
 
+#ifdef USE_ARM_CE
+  if (ctx->use_arm_ce)
+    {
+      _gcry_sm4_armv8_ce_expand_key (key, ctx->rkey_enc, ctx->rkey_dec,
+				     fk, ck);
+      return;
+    }
+#endif
+
   rk[0] = buf_get_be32(key + 4 * 0) ^ fk[0];
   rk[1] = buf_get_be32(key + 4 * 1) ^ fk[1];
   rk[2] = buf_get_be32(key + 4 * 2) ^ fk[2];
@@ -420,6 +478,9 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen,
 #ifdef USE_AARCH64_SIMD
   ctx->use_aarch64_simd = !!(hwf & HWF_ARM_NEON);
 #endif
+#ifdef USE_ARM_CE
+  ctx->use_arm_ce = !!(hwf & HWF_ARM_SM4);
+#endif
 
   /* Setup bulk encryption routines.  */
   memset (bulk_ops, 0, sizeof(*bulk_ops));
@@ -465,6 +526,11 @@ sm4_encrypt (void *context, byte *outbuf, const byte *inbuf)
 {
   SM4_context *ctx = context;
 
+#ifdef USE_ARM_CE
+  if (ctx->use_arm_ce)
+    return sm4_armv8_ce_crypt_blk1_8(ctx->rkey_enc, outbuf, inbuf, 1);
+#endif
+
   prefetch_sbox_table ();
 
   return sm4_do_crypt (ctx->rkey_enc, outbuf, inbuf);
@@ -475,6 +541,11 @@ sm4_decrypt (void *context, byte *outbuf, const byte *inbuf)
 {
   SM4_context *ctx = context;
 
+#ifdef USE_ARM_CE
+  if (ctx->use_arm_ce)
+    return sm4_armv8_ce_crypt_blk1_8(ctx->rkey_dec, outbuf, inbuf, 1);
+#endif
+
   prefetch_sbox_table ();
 
   return sm4_do_crypt (ctx->rkey_dec, outbuf, inbuf);
@@ -601,6 +672,23 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr,
     }
 #endif
 
+#ifdef USE_ARM_CE
+  if (ctx->use_arm_ce)
+    {
+      /* Process multiples of 8 blocks at a time. */
+      if (nblocks >= 8)
+        {
+          size_t nblks = nblocks & ~(8 - 1);
+
+          _gcry_sm4_armv8_ce_ctr_enc(ctx->rkey_enc, outbuf, inbuf, ctr, nblks);
+
+          nblocks -= nblks;
+          outbuf += nblks * 16;
+          inbuf += nblks * 16;
+        }
+    }
+#endif
+
 #ifdef USE_AARCH64_SIMD
   if (ctx->use_aarch64_simd)
     {
@@ -634,6 +722,12 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr,
 	  crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8;
 	}
 #endif
+#ifdef USE_ARM_CE
+      else if (ctx->use_arm_ce)
+	{
+	  crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8;
+	}
+#endif
 #ifdef USE_AARCH64_SIMD
       else if (ctx->use_aarch64_simd)
 	{
@@ -725,6 +819,23 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv,
     }
 #endif
 
+#ifdef USE_ARM_CE
+  if (ctx->use_arm_ce)
+    {
+      /* Process multiples of 8 blocks at a time. */
+      if (nblocks >= 8)
+        {
+          size_t nblks = nblocks & ~(8 - 1);
+
+          _gcry_sm4_armv8_ce_cbc_dec(ctx->rkey_dec, outbuf, inbuf, iv, nblks);
+
+          nblocks -= nblks;
+          outbuf += nblks * 16;
+          inbuf += nblks * 16;
+        }
+    }
+#endif
+
 #ifdef USE_AARCH64_SIMD
   if (ctx->use_aarch64_simd)
     {
@@ -758,6 +869,12 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv,
 	  crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8;
 	}
 #endif
+#ifdef USE_ARM_CE
+      else if (ctx->use_arm_ce)
+	{
+	  crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8;
+	}
+#endif
 #ifdef USE_AARCH64_SIMD
       else if (ctx->use_aarch64_simd)
 	{
@@ -842,6 +959,23 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv,
     }
 #endif
 
+#ifdef USE_ARM_CE
+  if (ctx->use_arm_ce)
+    {
+      /* Process multiples of 8 blocks at a time. */
+      if (nblocks >= 8)
+        {
+          size_t nblks = nblocks & ~(8 - 1);
+
+          _gcry_sm4_armv8_ce_cfb_dec(ctx->rkey_enc, outbuf, inbuf, iv, nblks);
+
+          nblocks -= nblks;
+          outbuf += nblks * 16;
+          inbuf += nblks * 16;
+        }
+    }
+#endif
+
 #ifdef USE_AARCH64_SIMD
   if (ctx->use_aarch64_simd)
     {
@@ -875,6 +1009,12 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv,
 	  crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8;
 	}
 #endif
+#ifdef USE_ARM_CE
+      else if (ctx->use_arm_ce)
+	{
+	  crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8;
+	}
+#endif
 #ifdef USE_AARCH64_SIMD
       else if (ctx->use_aarch64_simd)
 	{
@@ -1037,6 +1177,12 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
 	  crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8;
 	}
 #endif
+#ifdef USE_ARM_CE
+      else if (ctx->use_arm_ce)
+	{
+	  crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8;
+	}
+#endif
 #ifdef USE_AARCH64_SIMD
       else if (ctx->use_aarch64_simd)
 	{
@@ -1203,6 +1349,12 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks)
 	  crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8;
 	}
 #endif
+#ifdef USE_ARM_CE
+      else if (ctx->use_arm_ce)
+	{
+	  crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8;
+	}
+#endif
 #ifdef USE_AARCH64_SIMD
       else if (ctx->use_aarch64_simd)
 	{
diff --git a/configure.ac b/configure.ac
index f5363f22..2ff053f7 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1906,6 +1906,40 @@ if test "$gcry_cv_gcc_inline_asm_aarch64_crypto" = "yes" ; then
 fi
 
 
+#
+# Check whether GCC inline assembler supports AArch64 SM Crypto Extension instructions
+#
+AC_CACHE_CHECK([whether GCC inline assembler supports AArch64 SM Crypto Extension instructions],
+       [gcry_cv_gcc_inline_asm_aarch64_crypto_sm],
+       [if test "$mpi_cpu_arch" != "aarch64" ||
+           test "$try_asm_modules" != "yes" ; then
+          gcry_cv_gcc_inline_asm_aarch64_crypto_sm="n/a"
+        else
+          gcry_cv_gcc_inline_asm_aarch64_crypto_sm=no
+          AC_LINK_IFELSE([AC_LANG_PROGRAM(
+          [[__asm__(
+                ".cpu generic+simd+crypto\n\t"
+                ".text\n\t"
+                "testfn:\n\t"
+                ".inst 0xce63c004 /* sm3partw1	v4.4s, v0.4s, v3.4s */\n\t"
+                ".inst 0xce66c4e4 /* sm3partw2	v4.4s, v7.4s, v6.4s */\n\t"
+                ".inst 0xce4b2505 /* sm3ss1	v5.4s, v8.4s, v11.4s, v9.4s */\n\t"
+                ".inst 0xce4a80a8 /* sm3tt1a	v8.4s, v5.4s, v10.s[0] */\n\t"
+                ".inst 0xce4a84a8 /* sm3tt1b	v8.4s, v5.4s, v10.s[0] */\n\t"
+                ".inst 0xce4088a9 /* sm3tt2a	v9.4s, v5.4s, v0.s[0] */\n\t"
+                ".inst 0xce448ca9 /* sm3tt2b	v9.4s, v5.4s, v4.s[0] */\n\t"
+                ".inst 0xcec08408 /* sm4e	v8.4s, v0.4s */\n\t"
+                ".inst 0xce70c800 /* sm4ekey	v0.4s, v0.4s, v16.4s */\n\t"
+                );
+            ]], [ testfn(); ])],
+          [gcry_cv_gcc_inline_asm_aarch64_crypto_sm=yes])
+        fi])
+if test "$gcry_cv_gcc_inline_asm_aarch64_crypto_sm" = "yes" ; then
+   AC_DEFINE(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO_SM,1,
+     [Defined if inline assembler supports AArch64 SM Crypto Extension instructions])
+fi
+
+
 #
 # Check whether PowerPC AltiVec/VSX intrinsics
 #
@@ -2755,6 +2789,7 @@ if test "$found" = "1" ; then
       aarch64-*-*)
          # Build with the assembly implementation
          GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-aarch64.lo"
+         GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-armv8-aarch64-ce.lo"
    esac
 fi
 
-- 
2.34.1


From tianjia.zhang at linux.alibaba.com  Tue Mar  1 08:04:06 2022
From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang)
Date: Tue,  1 Mar 2022 15:04:06 +0800
Subject: [PATCH] Ignore tests binary file in git repo
Message-ID: <20220301070406.12591-1-tianjia.zhang@linux.alibaba.com>

Signed-off-by: Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
---
 .gitignore | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/.gitignore b/.gitignore
index 99741c18..ec456c71 100644
--- a/.gitignore
+++ b/.gitignore
@@ -67,6 +67,7 @@ src/gcrypt.h
 src/hmac256
 src/libgcrypt-config
 src/libgcrypt.la
+src/libgcrypt.la.done
 src/libgcrypt.pc
 src/mpicalc
 src/versioninfo.rc
@@ -100,14 +101,20 @@ tests/register
 tests/rsacvt
 tests/t-convert
 tests/t-cv25519
+tests/t-dsa
+tests/t-ecdsa
 tests/t-ed25519
+tests/t-ed448
 tests/t-kdf
 tests/t-lock
 tests/t-mpi-bit
 tests/t-mpi-point
+tests/t-rsa-15
+tests/t-rsa-pss
 tests/t-sexp
 tests/t-secmem
 tests/t-x448
+tests/testdrv
 tests/tsexp
 tests/version
 tests/*.exe
-- 
2.34.1


From jussi.kivilinna at iki.fi  Tue Mar  1 08:26:28 2022
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Tue, 1 Mar 2022 09:26:28 +0200
Subject: [PATCH v2 2/2] Add SM4 ARMv8/AArch64/CE assembly implementation
In-Reply-To: <20220301043836.28709-2-tianjia.zhang@linux.alibaba.com>
References: <20220301043836.28709-1-tianjia.zhang@linux.alibaba.com>
 <20220301043836.28709-2-tianjia.zhang@linux.alibaba.com>
Message-ID: <30158843-b4ec-6864-5d85-1a7031a18c1e@iki.fi>

Hello,

On 1.3.2022 6.38, Tianjia Zhang wrote:
> new file mode 100644
> index 00000000..57e84683
> --- /dev/null
> +++ b/cipher/sm4-armv8-aarch64-ce.S
> @@ -0,0 +1,568 @@
> +/* sm4-armv8-aarch64-ce.S  -  ARMv8/AArch64/CE accelerated SM4 cipher
> + *
> + * Copyright (C) 2022 Alibaba Group.
> + * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
> + *
> + * This file is part of Libgcrypt.
> + *
> + * Libgcrypt is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU Lesser General Public License as
> + * published by the Free Software Foundation; either version 2.1 of
> + * the License, or (at your option) any later version.
> + *
> + * Libgcrypt is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "asm-common-aarch64.h"
> +
> +#if defined(__AARCH64EL__) && \
> +    defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \
> +    defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO_SM) && \
> +    defined(USE_SM4)
> +
> +.cpu generic+simd+crypto
> +
> +.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 16, 24, 25, 26, 27, 28, 29, 30, 31
> +    .set .Lv\b\().4s, \b
> +.endr
> +
> +.macro sm4e, vd, vn
> +    .inst 0xcec08400 | (.L\vn << 5) | .L\vd
> +.endm
> +
> +.macro sm4ekey, vd, vn, vm
> +    .inst 0xce60c800 | (.L\vm << 16) | (.L\vn << 5) | .L\vd
> +.endm

I meant that the problem is that ".macro"/".endm"/".set"/".irp" may not be not supported by all compilers/assemblers. Implementation here could either:
- Rely on assembler supporting these instructions and use "sm4e" and "sm4ekey" directly or
- Use preprocessor #define macros instead of assembler .macros to provide these instructions. Something like this could work:

#define vecnum_v0 0
#define vecnum_v1 1
#define vecnum_v2 2
#define vecnum_v3 3
#define vecnum_v4 4
#define vecnum_v5 5
#define vecnum_v6 6
#define vecnum_v7 7
#define vecnum_v16 16
#define vecnum_v24 24
#define vecnum_v25 25
#define vecnum_v26 26
#define vecnum_v27 27
#define vecnum_v28 28
#define vecnum_v29 29
#define vecnum_v30 30
#define vecnum_v31 31

#define sm4e(vd,vn) \
   .inst (0xcec08400 | (vecnum_##vn << 5) | vecnum_##vd)

#define sm4ekey(vd, vn, vm) \
   .inst (0xce60c800 | (vecnum_##vm << 16) | (vecnum_##vn << 5) | vecnum_##vd)

...

#define crypt_blk4(b0, b1, b2, b3)         \
         rev32 b0.16b, b0.16b;              \
         rev32 b1.16b, b1.16b;              \
         rev32 b2.16b, b2.16b;              \
         rev32 b3.16b, b3.16b;              \
         sm4e(b0, v24);                     \
         sm4e(b1, v24);                     \

-Jussi


From tianjia.zhang at linux.alibaba.com  Tue Mar  1 09:58:34 2022
From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang)
Date: Tue, 1 Mar 2022 16:58:34 +0800
Subject: [PATCH v2 2/2] Add SM4 ARMv8/AArch64/CE assembly implementation
In-Reply-To: <30158843-b4ec-6864-5d85-1a7031a18c1e@iki.fi>
References: <20220301043836.28709-1-tianjia.zhang@linux.alibaba.com>
 <20220301043836.28709-2-tianjia.zhang@linux.alibaba.com>
 <30158843-b4ec-6864-5d85-1a7031a18c1e@iki.fi>
Message-ID: <9b1407b0-2190-0ebf-5bbe-15265cccfc63@linux.alibaba.com>

Hi Jussi,

On 3/1/22 3:26 PM, Jussi Kivilinna wrote:
> Hello,
> 
> On 1.3.2022 6.38, Tianjia Zhang wrote:
>> new file mode 100644
>> index 00000000..57e84683
>> --- /dev/null
>> +++ b/cipher/sm4-armv8-aarch64-ce.S
>> @@ -0,0 +1,568 @@
>> +/* sm4-armv8-aarch64-ce.S? -? ARMv8/AArch64/CE accelerated SM4 cipher
>> + *
>> + * Copyright (C) 2022 Alibaba Group.
>> + * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
>> + *
>> + * This file is part of Libgcrypt.
>> + *
>> + * Libgcrypt is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU Lesser General Public License as
>> + * published by the Free Software Foundation; either version 2.1 of
>> + * the License, or (at your option) any later version.
>> + *
>> + * Libgcrypt is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.? See the
>> + * GNU Lesser General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU Lesser General Public
>> + * License along with this program; if not, see 
>> <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "asm-common-aarch64.h"
>> +
>> +#if defined(__AARCH64EL__) && \
>> +??? defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \
>> +??? defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO_SM) && \
>> +??? defined(USE_SM4)
>> +
>> +.cpu generic+simd+crypto
>> +
>> +.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 16, 24, 25, 26, 27, 28, 29, 30, 31
>> +??? .set .Lv\b\().4s, \b
>> +.endr
>> +
>> +.macro sm4e, vd, vn
>> +??? .inst 0xcec08400 | (.L\vn << 5) | .L\vd
>> +.endm
>> +
>> +.macro sm4ekey, vd, vn, vm
>> +??? .inst 0xce60c800 | (.L\vm << 16) | (.L\vn << 5) | .L\vd
>> +.endm
> 
> I meant that the problem is that ".macro"/".endm"/".set"/".irp" may not 
> be not supported by all compilers/assemblers. Implementation here could 
> either:
> - Rely on assembler supporting these instructions and use "sm4e" and 
> "sm4ekey" directly or
> - Use preprocessor #define macros instead of assembler .macros to 
> provide these instructions. Something like this could work:
> 
> #define vecnum_v0 0
> #define vecnum_v1 1
> #define vecnum_v2 2
> #define vecnum_v3 3
> #define vecnum_v4 4
> #define vecnum_v5 5
> #define vecnum_v6 6
> #define vecnum_v7 7
> #define vecnum_v16 16
> #define vecnum_v24 24
> #define vecnum_v25 25
> #define vecnum_v26 26
> #define vecnum_v27 27
> #define vecnum_v28 28
> #define vecnum_v29 29
> #define vecnum_v30 30
> #define vecnum_v31 31
> 
> #define sm4e(vd,vn) \
>  ? .inst (0xcec08400 | (vecnum_##vn << 5) | vecnum_##vd)
> 
> #define sm4ekey(vd, vn, vm) \
>  ? .inst (0xce60c800 | (vecnum_##vm << 16) | (vecnum_##vn << 5) | 
> vecnum_##vd)
> 
> ...
> 
> #define crypt_blk4(b0, b1, b2, b3)???????? \
>  ??????? rev32 b0.16b, b0.16b;????????????? \
>  ??????? rev32 b1.16b, b1.16b;????????????? \
>  ??????? rev32 b2.16b, b2.16b;????????????? \
>  ??????? rev32 b3.16b, b3.16b;????????????? \
>  ??????? sm4e(b0, v24);???????????????????? \
>  ??????? sm4e(b1, v24);???????????????????? \
> 
> -Jussi

Thanks for your suggestion, using #define can solve this problem.

Cheers,
Tianjia


From tianjia.zhang at linux.alibaba.com  Tue Mar  1 10:56:54 2022
From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang)
Date: Tue,  1 Mar 2022 17:56:54 +0800
Subject: [PATCH v3 1/2] hwf-arm: add ARMv8.2 optional crypto extension HW
 features
Message-ID: <20220301095655.31234-1-tianjia.zhang@linux.alibaba.com>

* src/g10lib.h (HWF_ARM_SHA3, HWF_ARM_SM3, HWF_ARM_SM4)
(HWF_ARM_SHA512): New.
* src/hwf-arm.c (arm_features): Add sha3, sm3, sm4, sha512 HW features.
* src/hwfeatures.c (hwflist): Add sha3, sm3, sm4, sha512 HW features.
--

Signed-off-by: Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
---
 src/g10lib.h     |  4 ++++
 src/hwf-arm.c    | 16 ++++++++++++++++
 src/hwfeatures.c |  4 ++++
 3 files changed, 24 insertions(+)

diff --git a/src/g10lib.h b/src/g10lib.h
index 22c0f0c2..985e75c6 100644
--- a/src/g10lib.h
+++ b/src/g10lib.h
@@ -245,6 +245,10 @@ char **_gcry_strtokenize (const char *string, const char *delim);
 #define HWF_ARM_SHA1            (1 << 2)
 #define HWF_ARM_SHA2            (1 << 3)
 #define HWF_ARM_PMULL           (1 << 4)
+#define HWF_ARM_SHA3            (1 << 5)
+#define HWF_ARM_SM3             (1 << 6)
+#define HWF_ARM_SM4             (1 << 7)
+#define HWF_ARM_SHA512          (1 << 8)
 
 #elif defined(HAVE_CPU_ARCH_PPC)
 
diff --git a/src/hwf-arm.c b/src/hwf-arm.c
index 60107f36..70d375b2 100644
--- a/src/hwf-arm.c
+++ b/src/hwf-arm.c
@@ -137,6 +137,18 @@ static const struct feature_map_s arm_features[] =
 #ifndef HWCAP_SHA2
 # define HWCAP_SHA2  64
 #endif
+#ifndef HWCAP_SHA3
+# define HWCAP_SHA3  (1 << 17)
+#endif
+#ifndef HWCAP_SM3
+# define HWCAP_SM3   (1 << 18)
+#endif
+#ifndef HWCAP_SM4
+# define HWCAP_SM4   (1 << 19)
+#endif
+#ifndef HWCAP_SHA512
+# define HWCAP_SHA512 (1 << 21)
+#endif
 
 static const struct feature_map_s arm_features[] =
   {
@@ -148,6 +160,10 @@ static const struct feature_map_s arm_features[] =
     { HWCAP_SHA1, 0, " sha1", HWF_ARM_SHA1 },
     { HWCAP_SHA2, 0, " sha2", HWF_ARM_SHA2 },
     { HWCAP_PMULL, 0, " pmull", HWF_ARM_PMULL },
+    { HWCAP_SHA3, 0, " sha3",  HWF_ARM_SHA3 },
+    { HWCAP_SM3, 0, " sm3",  HWF_ARM_SM3 },
+    { HWCAP_SM4, 0, " sm4",  HWF_ARM_SM4 },
+    { HWCAP_SHA512, 0, " sha512",  HWF_ARM_SHA512 },
 #endif
   };
 
diff --git a/src/hwfeatures.c b/src/hwfeatures.c
index 97e67b3c..7060d995 100644
--- a/src/hwfeatures.c
+++ b/src/hwfeatures.c
@@ -68,6 +68,10 @@ static struct
     { HWF_ARM_SHA1,            "arm-sha1" },
     { HWF_ARM_SHA2,            "arm-sha2" },
     { HWF_ARM_PMULL,           "arm-pmull" },
+    { HWF_ARM_SHA3,            "arm-sha3" },
+    { HWF_ARM_SM3,             "arm-sm3" },
+    { HWF_ARM_SM4,             "arm-sm4" },
+    { HWF_ARM_SHA512,          "arm-sha512" },
 #elif defined(HAVE_CPU_ARCH_PPC)
     { HWF_PPC_VCRYPTO,         "ppc-vcrypto" },
     { HWF_PPC_ARCH_3_00,       "ppc-arch_3_00" },
-- 
2.34.1


From tianjia.zhang at linux.alibaba.com  Tue Mar  1 10:56:55 2022
From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang)
Date: Tue,  1 Mar 2022 17:56:55 +0800
Subject: [PATCH v3 2/2] Add SM4 ARMv8/AArch64/CE assembly implementation
In-Reply-To: <20220301095655.31234-1-tianjia.zhang@linux.alibaba.com>
References: <20220301095655.31234-1-tianjia.zhang@linux.alibaba.com>
Message-ID: <20220301095655.31234-2-tianjia.zhang@linux.alibaba.com>

* cipher/Makefile.am: Add 'sm4-armv8-aarch64-ce.S'.
* cipher/sm4-armv8-aarch64-ce.S: New.
* cipher/sm4.c (USE_ARM_CE): New.
(SM4_context) [USE_ARM_CE]: Add 'use_arm_ce'.
[USE_ARM_CE] (_gcry_sm4_armv8_ce_expand_key)
(_gcry_sm4_armv8_ce_crypt, _gcry_sm4_armv8_ce_ctr_enc)
(_gcry_sm4_armv8_ce_cbc_dec, _gcry_sm4_armv8_ce_cfb_dec)
(_gcry_sm4_armv8_ce_crypt_blk1_8, sm4_armv8_ce_crypt_blk1_8): New.
(sm4_expand_key) [USE_ARM_CE]: Use ARMv8/AArch64/CE key setup.
(sm4_setkey): Enable ARMv8/AArch64/CE if supported by HW.
(sm4_encrypt) [USE_ARM_CE]: Use SM4 CE encryption.
(sm4_decrypt) [USE_ARM_CE]: Use SM4 CE decryption.
(_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec)
(_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_ARM_CE]: Add
ARMv8/AArch64/CE bulk functions.
* configure.ac: Add 'sm4-armv8-aarch64-ce.lo'.
--

This patch adds ARMv8/AArch64/CE bulk encryption/decryption. Bulk
functions process eight blocks in parallel.

Benchmark on T-Head Yitian-710 2.75 GHz:

Before:
 SM4            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        CBC enc |     12.10 ns/B     78.79 MiB/s     33.28 c/B      2750
        CBC dec |      4.63 ns/B     205.9 MiB/s     12.74 c/B      2749
        CFB enc |     12.14 ns/B     78.58 MiB/s     33.37 c/B      2750
        CFB dec |      4.64 ns/B     205.5 MiB/s     12.76 c/B      2750
        CTR enc |      4.69 ns/B     203.3 MiB/s     12.90 c/B      2750
        CTR dec |      4.69 ns/B     203.3 MiB/s     12.90 c/B      2750
        GCM enc |      4.88 ns/B     195.4 MiB/s     13.42 c/B      2750
        GCM dec |      4.88 ns/B     195.5 MiB/s     13.42 c/B      2750
       GCM auth |     0.189 ns/B      5048 MiB/s     0.520 c/B      2750
        OCB enc |      4.86 ns/B     196.0 MiB/s     13.38 c/B      2750
        OCB dec |      4.90 ns/B     194.7 MiB/s     13.47 c/B      2750
       OCB auth |      4.79 ns/B     199.0 MiB/s     13.18 c/B      2750

After (10x - 19x faster than ARMv8/AArch64 impl):
 SM4            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        CBC enc |      1.25 ns/B     762.7 MiB/s      3.44 c/B      2749
        CBC dec |     0.243 ns/B      3927 MiB/s     0.668 c/B      2750
        CFB enc |      1.25 ns/B     763.1 MiB/s      3.44 c/B      2750
        CFB dec |     0.245 ns/B      3899 MiB/s     0.673 c/B      2750
        CTR enc |     0.298 ns/B      3199 MiB/s     0.820 c/B      2750
        CTR dec |     0.298 ns/B      3198 MiB/s     0.820 c/B      2750
        GCM enc |     0.487 ns/B      1957 MiB/s      1.34 c/B      2749
        GCM dec |     0.487 ns/B      1959 MiB/s      1.34 c/B      2750
       GCM auth |     0.189 ns/B      5048 MiB/s     0.519 c/B      2750
        OCB enc |     0.443 ns/B      2150 MiB/s      1.22 c/B      2749
        OCB dec |     0.486 ns/B      1964 MiB/s      1.34 c/B      2750
       OCB auth |     0.369 ns/B      2585 MiB/s      1.01 c/B      2749

Signed-off-by: Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
---
 cipher/Makefile.am            |   1 +
 cipher/sm4-armv8-aarch64-ce.S | 580 ++++++++++++++++++++++++++++++++++
 cipher/sm4.c                  | 152 +++++++++
 configure.ac                  |   1 +
 4 files changed, 734 insertions(+)
 create mode 100644 cipher/sm4-armv8-aarch64-ce.S

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index a7cbf3fc..3339c463 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -117,6 +117,7 @@ EXTRA_libcipher_la_SOURCES = \
 	seed.c \
 	serpent.c serpent-sse2-amd64.S \
 	sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S sm4-aarch64.S \
+	sm4-armv8-aarch64-ce.S \
 	serpent-avx2-amd64.S serpent-armv7-neon.S \
 	sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \
 	sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \
diff --git a/cipher/sm4-armv8-aarch64-ce.S b/cipher/sm4-armv8-aarch64-ce.S
new file mode 100644
index 00000000..5fb55947
--- /dev/null
+++ b/cipher/sm4-armv8-aarch64-ce.S
@@ -0,0 +1,580 @@
+/* sm4-armv8-aarch64-ce.S  -  ARMv8/AArch64/CE accelerated SM4 cipher
+ *
+ * Copyright (C) 2022 Alibaba Group.
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "asm-common-aarch64.h"
+
+#if defined(__AARCH64EL__) && \
+    defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \
+    defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) && \
+    defined(USE_SM4)
+
+.cpu generic+simd+crypto
+
+#define vecnum_v0 0
+#define vecnum_v1 1
+#define vecnum_v2 2
+#define vecnum_v3 3
+#define vecnum_v4 4
+#define vecnum_v5 5
+#define vecnum_v6 6
+#define vecnum_v7 7
+#define vecnum_v16 16
+#define vecnum_v24 24
+#define vecnum_v25 25
+#define vecnum_v26 26
+#define vecnum_v27 27
+#define vecnum_v28 28
+#define vecnum_v29 29
+#define vecnum_v30 30
+#define vecnum_v31 31
+
+#define sm4e(vd, vn) \
+   .inst (0xcec08400 | (vecnum_##vn << 5) | vecnum_##vd)
+
+#define sm4ekey(vd, vn, vm) \
+   .inst (0xce60c800 | (vecnum_##vm << 16) | (vecnum_##vn << 5) | vecnum_##vd)
+
+.text
+
+/* Register macros */
+
+#define RTMP0   v16
+#define RTMP1   v17
+#define RTMP2   v18
+#define RTMP3   v19
+
+#define RIV     v20
+
+/* Helper macros. */
+
+#define load_rkey(ptr)                     \
+        ld1 {v24.16b-v27.16b}, [ptr], #64; \
+        ld1 {v28.16b-v31.16b}, [ptr];
+
+#define crypt_blk4(b0, b1, b2, b3)         \
+        rev32 b0.16b, b0.16b;              \
+        rev32 b1.16b, b1.16b;              \
+        rev32 b2.16b, b2.16b;              \
+        rev32 b3.16b, b3.16b;              \
+        sm4e(b0, v24);                     \
+        sm4e(b1, v24);                     \
+        sm4e(b2, v24);                     \
+        sm4e(b3, v24);                     \
+        sm4e(b0, v25);                     \
+        sm4e(b1, v25);                     \
+        sm4e(b2, v25);                     \
+        sm4e(b3, v25);                     \
+        sm4e(b0, v26);                     \
+        sm4e(b1, v26);                     \
+        sm4e(b2, v26);                     \
+        sm4e(b3, v26);                     \
+        sm4e(b0, v27);                     \
+        sm4e(b1, v27);                     \
+        sm4e(b2, v27);                     \
+        sm4e(b3, v27);                     \
+        sm4e(b0, v28);                     \
+        sm4e(b1, v28);                     \
+        sm4e(b2, v28);                     \
+        sm4e(b3, v28);                     \
+        sm4e(b0, v29);                     \
+        sm4e(b1, v29);                     \
+        sm4e(b2, v29);                     \
+        sm4e(b3, v29);                     \
+        sm4e(b0, v30);                     \
+        sm4e(b1, v30);                     \
+        sm4e(b2, v30);                     \
+        sm4e(b3, v30);                     \
+        sm4e(b0, v31);                     \
+        sm4e(b1, v31);                     \
+        sm4e(b2, v31);                     \
+        sm4e(b3, v31);                     \
+        rev64 b0.4s, b0.4s;                \
+        rev64 b1.4s, b1.4s;                \
+        rev64 b2.4s, b2.4s;                \
+        rev64 b3.4s, b3.4s;                \
+        ext b0.16b, b0.16b, b0.16b, #8;    \
+        ext b1.16b, b1.16b, b1.16b, #8;    \
+        ext b2.16b, b2.16b, b2.16b, #8;    \
+        ext b3.16b, b3.16b, b3.16b, #8;    \
+        rev32 b0.16b, b0.16b;              \
+        rev32 b1.16b, b1.16b;              \
+        rev32 b2.16b, b2.16b;              \
+        rev32 b3.16b, b3.16b;
+
+#define crypt_blk8(b0, b1, b2, b3, b4, b5, b6, b7) \
+        rev32 b0.16b, b0.16b;              \
+        rev32 b1.16b, b1.16b;              \
+        rev32 b2.16b, b2.16b;              \
+        rev32 b3.16b, b3.16b;              \
+        rev32 b4.16b, b4.16b;              \
+        rev32 b5.16b, b5.16b;              \
+        rev32 b6.16b, b6.16b;              \
+        rev32 b7.16b, b7.16b;              \
+        sm4e(b0, v24);                     \
+        sm4e(b1, v24);                     \
+        sm4e(b2, v24);                     \
+        sm4e(b3, v24);                     \
+        sm4e(b4, v24);                     \
+        sm4e(b5, v24);                     \
+        sm4e(b6, v24);                     \
+        sm4e(b7, v24);                     \
+        sm4e(b0, v25);                     \
+        sm4e(b1, v25);                     \
+        sm4e(b2, v25);                     \
+        sm4e(b3, v25);                     \
+        sm4e(b4, v25);                     \
+        sm4e(b5, v25);                     \
+        sm4e(b6, v25);                     \
+        sm4e(b7, v25);                     \
+        sm4e(b0, v26);                     \
+        sm4e(b1, v26);                     \
+        sm4e(b2, v26);                     \
+        sm4e(b3, v26);                     \
+        sm4e(b4, v26);                     \
+        sm4e(b5, v26);                     \
+        sm4e(b6, v26);                     \
+        sm4e(b7, v26);                     \
+        sm4e(b0, v27);                     \
+        sm4e(b1, v27);                     \
+        sm4e(b2, v27);                     \
+        sm4e(b3, v27);                     \
+        sm4e(b4, v27);                     \
+        sm4e(b5, v27);                     \
+        sm4e(b6, v27);                     \
+        sm4e(b7, v27);                     \
+        sm4e(b0, v28);                     \
+        sm4e(b1, v28);                     \
+        sm4e(b2, v28);                     \
+        sm4e(b3, v28);                     \
+        sm4e(b4, v28);                     \
+        sm4e(b5, v28);                     \
+        sm4e(b6, v28);                     \
+        sm4e(b7, v28);                     \
+        sm4e(b0, v29);                     \
+        sm4e(b1, v29);                     \
+        sm4e(b2, v29);                     \
+        sm4e(b3, v29);                     \
+        sm4e(b4, v29);                     \
+        sm4e(b5, v29);                     \
+        sm4e(b6, v29);                     \
+        sm4e(b7, v29);                     \
+        sm4e(b0, v30);                     \
+        sm4e(b1, v30);                     \
+        sm4e(b2, v30);                     \
+        sm4e(b3, v30);                     \
+        sm4e(b4, v30);                     \
+        sm4e(b5, v30);                     \
+        sm4e(b6, v30);                     \
+        sm4e(b7, v30);                     \
+        sm4e(b0, v31);                     \
+        sm4e(b1, v31);                     \
+        sm4e(b2, v31);                     \
+        sm4e(b3, v31);                     \
+        sm4e(b4, v31);                     \
+        sm4e(b5, v31);                     \
+        sm4e(b6, v31);                     \
+        sm4e(b7, v31);                     \
+        rev64 b0.4s, b0.4s;                \
+        rev64 b1.4s, b1.4s;                \
+        rev64 b2.4s, b2.4s;                \
+        rev64 b3.4s, b3.4s;                \
+        rev64 b4.4s, b4.4s;                \
+        rev64 b5.4s, b5.4s;                \
+        rev64 b6.4s, b6.4s;                \
+        rev64 b7.4s, b7.4s;                \
+        ext b0.16b, b0.16b, b0.16b, #8;    \
+        ext b1.16b, b1.16b, b1.16b, #8;    \
+        ext b2.16b, b2.16b, b2.16b, #8;    \
+        ext b3.16b, b3.16b, b3.16b, #8;    \
+        ext b4.16b, b4.16b, b4.16b, #8;    \
+        ext b5.16b, b5.16b, b5.16b, #8;    \
+        ext b6.16b, b6.16b, b6.16b, #8;    \
+        ext b7.16b, b7.16b, b7.16b, #8;    \
+        rev32 b0.16b, b0.16b;              \
+        rev32 b1.16b, b1.16b;              \
+        rev32 b2.16b, b2.16b;              \
+        rev32 b3.16b, b3.16b;              \
+        rev32 b4.16b, b4.16b;              \
+        rev32 b5.16b, b5.16b;              \
+        rev32 b6.16b, b6.16b;              \
+        rev32 b7.16b, b7.16b;
+
+
+.align 3
+.global _gcry_sm4_armv8_ce_expand_key
+ELF(.type _gcry_sm4_armv8_ce_expand_key,%function;)
+_gcry_sm4_armv8_ce_expand_key:
+    /* input:
+     *   x0: 128-bit key
+     *   x1: rkey_enc
+     *   x2: rkey_dec
+     *   x3: fk array
+     *   x4: ck array
+     */
+    CFI_STARTPROC();
+
+    ld1 {v0.16b}, [x0];
+    rev32 v0.16b, v0.16b;
+    ld1 {v1.16b}, [x3];
+    load_rkey(x4);
+
+    /* input ^ fk */
+    eor v0.16b, v0.16b, v1.16b;
+
+    sm4ekey(v0, v0, v24);
+    sm4ekey(v1, v0, v25);
+    sm4ekey(v2, v1, v26);
+    sm4ekey(v3, v2, v27);
+    sm4ekey(v4, v3, v28);
+    sm4ekey(v5, v4, v29);
+    sm4ekey(v6, v5, v30);
+    sm4ekey(v7, v6, v31);
+
+    st1 {v0.16b-v3.16b}, [x1], #64;
+    st1 {v4.16b-v7.16b}, [x1];
+    rev64 v7.4s, v7.4s;
+    rev64 v6.4s, v6.4s;
+    rev64 v5.4s, v5.4s;
+    rev64 v4.4s, v4.4s;
+    rev64 v3.4s, v3.4s;
+    rev64 v2.4s, v2.4s;
+    rev64 v1.4s, v1.4s;
+    rev64 v0.4s, v0.4s;
+    ext v7.16b, v7.16b, v7.16b, #8;
+    ext v6.16b, v6.16b, v6.16b, #8;
+    ext v5.16b, v5.16b, v5.16b, #8;
+    ext v4.16b, v4.16b, v4.16b, #8;
+    ext v3.16b, v3.16b, v3.16b, #8;
+    ext v2.16b, v2.16b, v2.16b, #8;
+    ext v1.16b, v1.16b, v1.16b, #8;
+    ext v0.16b, v0.16b, v0.16b, #8;
+    st1 {v7.16b}, [x2], #16;
+    st1 {v6.16b}, [x2], #16;
+    st1 {v5.16b}, [x2], #16;
+    st1 {v4.16b}, [x2], #16;
+    st1 {v3.16b}, [x2], #16;
+    st1 {v2.16b}, [x2], #16;
+    st1 {v1.16b}, [x2], #16;
+    st1 {v0.16b}, [x2];
+
+    ret_spec_stop;
+    CFI_ENDPROC();
+ELF(.size _gcry_sm4_armv8_ce_expand_key,.-_gcry_sm4_armv8_ce_expand_key;)
+
+.align 3
+ELF(.type sm4_armv8_ce_crypt_blk1_4,%function;)
+sm4_armv8_ce_crypt_blk1_4:
+    /* input:
+     *   x0: round key array, CTX
+     *   x1: dst
+     *   x2: src
+     *   x3: num blocks (1..4)
+     */
+    CFI_STARTPROC();
+
+    load_rkey(x0);
+
+    ld1 {v0.16b}, [x2], #16;
+    mov v1.16b, v0.16b;
+    mov v2.16b, v0.16b;
+    mov v3.16b, v0.16b;
+    cmp x3, #2;
+    blt .Lblk4_load_input_done;
+    ld1 {v1.16b}, [x2], #16;
+    beq .Lblk4_load_input_done;
+    ld1 {v2.16b}, [x2], #16;
+    cmp x3, #3;
+    beq .Lblk4_load_input_done;
+    ld1 {v3.16b}, [x2];
+
+.Lblk4_load_input_done:
+    crypt_blk4(v0, v1, v2, v3);
+
+    st1 {v0.16b}, [x1], #16;
+    cmp x3, #2;
+    blt .Lblk4_store_output_done;
+    st1 {v1.16b}, [x1], #16;
+    beq .Lblk4_store_output_done;
+    st1 {v2.16b}, [x1], #16;
+    cmp x3, #3;
+    beq .Lblk4_store_output_done;
+    st1 {v3.16b}, [x1];
+
+.Lblk4_store_output_done:
+    ret_spec_stop;
+    CFI_ENDPROC();
+ELF(.size sm4_armv8_ce_crypt_blk1_4,.-sm4_armv8_ce_crypt_blk1_4;)
+
+.align 3
+.global _gcry_sm4_armv8_ce_crypt_blk1_8
+ELF(.type _gcry_sm4_armv8_ce_crypt_blk1_8,%function;)
+_gcry_sm4_armv8_ce_crypt_blk1_8:
+    /* input:
+     *   x0: round key array, CTX
+     *   x1: dst
+     *   x2: src
+     *   x3: num blocks (1..8)
+     */
+    CFI_STARTPROC();
+
+    cmp x3, #5;
+    blt sm4_armv8_ce_crypt_blk1_4;
+
+    load_rkey(x0);
+
+    ld1 {v0.16b-v3.16b}, [x2], #64;
+    ld1 {v4.16b}, [x2], #16;
+    mov v5.16b, v4.16b;
+    mov v6.16b, v4.16b;
+    mov v7.16b, v4.16b;
+    beq .Lblk8_load_input_done;
+    ld1 {v5.16b}, [x2], #16;
+    cmp x3, #7;
+    blt .Lblk8_load_input_done;
+    ld1 {v6.16b}, [x2], #16;
+    beq .Lblk8_load_input_done;
+    ld1 {v7.16b}, [x2];
+
+.Lblk8_load_input_done:
+    crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7);
+
+    cmp x3, #6;
+    st1 {v0.16b-v3.16b}, [x1], #64;
+    st1 {v4.16b}, [x1], #16;
+    blt .Lblk8_store_output_done;
+    st1 {v5.16b}, [x1], #16;
+    beq .Lblk8_store_output_done;
+    st1 {v6.16b}, [x1], #16;
+    cmp x3, #7;
+    beq .Lblk8_store_output_done;
+    st1 {v7.16b}, [x1];
+
+.Lblk8_store_output_done:
+    ret_spec_stop;
+    CFI_ENDPROC();
+ELF(.size _gcry_sm4_armv8_ce_crypt_blk1_8,.-_gcry_sm4_armv8_ce_crypt_blk1_8;)
+
+.align 3
+.global _gcry_sm4_armv8_ce_crypt
+ELF(.type _gcry_sm4_armv8_ce_crypt,%function;)
+_gcry_sm4_armv8_ce_crypt:
+    /* input:
+     *   x0: round key array, CTX
+     *   x1: dst
+     *   x2: src
+     *   x3: nblocks (multiples of 8)
+     */
+    CFI_STARTPROC();
+
+    load_rkey(x0);
+
+.Lcrypt_loop_blk:
+    subs x3, x3, #8;
+    bmi .Lcrypt_end;
+
+    ld1 {v0.16b-v3.16b}, [x2], #64;
+    ld1 {v4.16b-v7.16b}, [x2], #64;
+
+    crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7);
+
+    st1 {v0.16b-v3.16b}, [x1], #64;
+    st1 {v4.16b-v7.16b}, [x1], #64;
+
+    b .Lcrypt_loop_blk;
+
+.Lcrypt_end:
+    ret_spec_stop;
+    CFI_ENDPROC();
+ELF(.size _gcry_sm4_armv8_ce_crypt,.-_gcry_sm4_armv8_ce_crypt;)
+
+.align 3
+.global _gcry_sm4_armv8_ce_cbc_dec
+ELF(.type _gcry_sm4_armv8_ce_cbc_dec,%function;)
+_gcry_sm4_armv8_ce_cbc_dec:
+    /* input:
+     *   x0: round key array, CTX
+     *   x1: dst
+     *   x2: src
+     *   x3: iv (big endian, 128 bit)
+     *   x4: nblocks (multiples of 8)
+     */
+    CFI_STARTPROC();
+
+    load_rkey(x0);
+    ld1 {RIV.16b}, [x3];
+
+.Lcbc_loop_blk:
+    subs x4, x4, #8;
+    bmi .Lcbc_end;
+
+    ld1 {v0.16b-v3.16b}, [x2], #64;
+    ld1 {v4.16b-v7.16b}, [x2];
+
+    crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7);
+
+    sub x2, x2, #64;
+    eor v0.16b, v0.16b, RIV.16b;
+    ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
+    eor v1.16b, v1.16b, RTMP0.16b;
+    eor v2.16b, v2.16b, RTMP1.16b;
+    eor v3.16b, v3.16b, RTMP2.16b;
+    st1 {v0.16b-v3.16b}, [x1], #64;
+
+    eor v4.16b, v4.16b, RTMP3.16b;
+    ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
+    eor v5.16b, v5.16b, RTMP0.16b;
+    eor v6.16b, v6.16b, RTMP1.16b;
+    eor v7.16b, v7.16b, RTMP2.16b;
+
+    mov RIV.16b, RTMP3.16b;
+    st1 {v4.16b-v7.16b}, [x1], #64;
+
+    b .Lcbc_loop_blk;
+
+.Lcbc_end:
+    /* store new IV */
+    st1 {RIV.16b}, [x3];
+
+    ret_spec_stop;
+    CFI_ENDPROC();
+ELF(.size _gcry_sm4_armv8_ce_cbc_dec,.-_gcry_sm4_armv8_ce_cbc_dec;)
+
+.align 3
+.global _gcry_sm4_armv8_ce_cfb_dec
+ELF(.type _gcry_sm4_armv8_ce_cfb_dec,%function;)
+_gcry_sm4_armv8_ce_cfb_dec:
+    /* input:
+     *   x0: round key array, CTX
+     *   x1: dst
+     *   x2: src
+     *   x3: iv (big endian, 128 bit)
+     *   x4: nblocks (multiples of 8)
+     */
+    CFI_STARTPROC();
+
+    load_rkey(x0);
+    ld1 {v0.16b}, [x3];
+
+.Lcfb_loop_blk:
+    subs x4, x4, #8;
+    bmi .Lcfb_end;
+
+    ld1 {v1.16b, v2.16b, v3.16b}, [x2], #48;
+    ld1 {v4.16b-v7.16b}, [x2];
+
+    crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7);
+
+    sub x2, x2, #48;
+    ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
+    eor v0.16b, v0.16b, RTMP0.16b;
+    eor v1.16b, v1.16b, RTMP1.16b;
+    eor v2.16b, v2.16b, RTMP2.16b;
+    eor v3.16b, v3.16b, RTMP3.16b;
+    st1 {v0.16b-v3.16b}, [x1], #64;
+
+    ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
+    eor v4.16b, v4.16b, RTMP0.16b;
+    eor v5.16b, v5.16b, RTMP1.16b;
+    eor v6.16b, v6.16b, RTMP2.16b;
+    eor v7.16b, v7.16b, RTMP3.16b;
+    st1 {v4.16b-v7.16b}, [x1], #64;
+
+    mov v0.16b, RTMP3.16b;
+
+    b .Lcfb_loop_blk;
+
+.Lcfb_end:
+    /* store new IV */
+    st1 {v0.16b}, [x3];
+
+    ret_spec_stop;
+    CFI_ENDPROC();
+ELF(.size _gcry_sm4_armv8_ce_cfb_dec,.-_gcry_sm4_armv8_ce_cfb_dec;)
+
+.align 3
+.global _gcry_sm4_armv8_ce_ctr_enc
+ELF(.type _gcry_sm4_armv8_ce_ctr_enc,%function;)
+_gcry_sm4_armv8_ce_ctr_enc:
+    /* input:
+     *   x0: round key array, CTX
+     *   x1: dst
+     *   x2: src
+     *   x3: ctr (big endian, 128 bit)
+     *   x4: nblocks (multiples of 8)
+     */
+    CFI_STARTPROC();
+
+    load_rkey(x0);
+
+    ldp x7, x8, [x3];
+    rev x7, x7;
+    rev x8, x8;
+
+.Lctr_loop_blk:
+    subs x4, x4, #8;
+    bmi .Lctr_end;
+
+#define inc_le128(vctr)       \
+    mov vctr.d[1], x8;        \
+    mov vctr.d[0], x7;        \
+    adds x8, x8, #1;          \
+    adc x7, x7, xzr;          \
+    rev64 vctr.16b, vctr.16b;
+
+    /* construct CTRs */
+    inc_le128(v0);      /* +0 */
+    inc_le128(v1);      /* +1 */
+    inc_le128(v2);      /* +2 */
+    inc_le128(v3);      /* +3 */
+    inc_le128(v4);      /* +4 */
+    inc_le128(v5);      /* +5 */
+    inc_le128(v6);      /* +6 */
+    inc_le128(v7);      /* +7 */
+
+    crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7);
+
+    ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
+    eor v0.16b, v0.16b, RTMP0.16b;
+    eor v1.16b, v1.16b, RTMP1.16b;
+    eor v2.16b, v2.16b, RTMP2.16b;
+    eor v3.16b, v3.16b, RTMP3.16b;
+    st1 {v0.16b-v3.16b}, [x1], #64;
+
+    ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64;
+    eor v4.16b, v4.16b, RTMP0.16b;
+    eor v5.16b, v5.16b, RTMP1.16b;
+    eor v6.16b, v6.16b, RTMP2.16b;
+    eor v7.16b, v7.16b, RTMP3.16b;
+    st1 {v4.16b-v7.16b}, [x1], #64;
+
+    b .Lctr_loop_blk;
+
+.Lctr_end:
+    /* store new CTR */
+    rev x7, x7;
+    rev x8, x8;
+    stp x7, x8, [x3];
+
+    ret_spec_stop;
+    CFI_ENDPROC();
+ELF(.size _gcry_sm4_armv8_ce_ctr_enc,.-_gcry_sm4_armv8_ce_ctr_enc;)
+
+#endif
diff --git a/cipher/sm4.c b/cipher/sm4.c
index ec2281b6..79e6dbf1 100644
--- a/cipher/sm4.c
+++ b/cipher/sm4.c
@@ -76,6 +76,15 @@
 # endif
 #endif
 
+#undef USE_ARM_CE
+#ifdef ENABLE_ARM_CRYPTO_SUPPORT
+# if defined(__AARCH64EL__) && \
+     defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \
+     defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO)
+#   define USE_ARM_CE 1
+# endif
+#endif
+
 static const char *sm4_selftest (void);
 
 static void _gcry_sm4_ctr_enc (void *context, unsigned char *ctr,
@@ -106,6 +115,9 @@ typedef struct
 #ifdef USE_AARCH64_SIMD
   unsigned int use_aarch64_simd:1;
 #endif
+#ifdef USE_ARM_CE
+  unsigned int use_arm_ce:1;
+#endif
 } SM4_context;
 
 static const u32 fk[4] =
@@ -286,6 +298,43 @@ sm4_aarch64_crypt_blk1_8(const u32 *rk, byte *out, const byte *in,
 }
 #endif /* USE_AARCH64_SIMD */
 
+#ifdef USE_ARM_CE
+extern void _gcry_sm4_armv8_ce_expand_key(const byte *key,
+					  u32 *rkey_enc, u32 *rkey_dec,
+					  const u32 *fk, const u32 *ck);
+
+extern void _gcry_sm4_armv8_ce_crypt(const u32 *rk, byte *out,
+				     const byte *in,
+				     size_t num_blocks);
+
+extern void _gcry_sm4_armv8_ce_ctr_enc(const u32 *rk_enc, byte *out,
+				       const byte *in,
+				       byte *ctr,
+				       size_t nblocks);
+
+extern void _gcry_sm4_armv8_ce_cbc_dec(const u32 *rk_dec, byte *out,
+				       const byte *in,
+				       byte *iv,
+				       size_t nblocks);
+
+extern void _gcry_sm4_armv8_ce_cfb_dec(const u32 *rk_enc, byte *out,
+				       const byte *in,
+				       byte *iv,
+				       size_t nblocks);
+
+extern void _gcry_sm4_armv8_ce_crypt_blk1_8(const u32 *rk, byte *out,
+					    const byte *in,
+					    size_t num_blocks);
+
+static inline unsigned int
+sm4_armv8_ce_crypt_blk1_8(const u32 *rk, byte *out, const byte *in,
+			 unsigned int num_blks)
+{
+  _gcry_sm4_armv8_ce_crypt_blk1_8(rk, out, in, (size_t)num_blks);
+  return 0;
+}
+#endif /* USE_ARM_CE */
+
 static inline void prefetch_sbox_table(void)
 {
   const volatile byte *vtab = (void *)&sbox_table;
@@ -363,6 +412,15 @@ sm4_expand_key (SM4_context *ctx, const byte *key)
     }
 #endif
 
+#ifdef USE_ARM_CE
+  if (ctx->use_arm_ce)
+    {
+      _gcry_sm4_armv8_ce_expand_key (key, ctx->rkey_enc, ctx->rkey_dec,
+				     fk, ck);
+      return;
+    }
+#endif
+
   rk[0] = buf_get_be32(key + 4 * 0) ^ fk[0];
   rk[1] = buf_get_be32(key + 4 * 1) ^ fk[1];
   rk[2] = buf_get_be32(key + 4 * 2) ^ fk[2];
@@ -420,6 +478,9 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen,
 #ifdef USE_AARCH64_SIMD
   ctx->use_aarch64_simd = !!(hwf & HWF_ARM_NEON);
 #endif
+#ifdef USE_ARM_CE
+  ctx->use_arm_ce = !!(hwf & HWF_ARM_SM4);
+#endif
 
   /* Setup bulk encryption routines.  */
   memset (bulk_ops, 0, sizeof(*bulk_ops));
@@ -465,6 +526,11 @@ sm4_encrypt (void *context, byte *outbuf, const byte *inbuf)
 {
   SM4_context *ctx = context;
 
+#ifdef USE_ARM_CE
+  if (ctx->use_arm_ce)
+    return sm4_armv8_ce_crypt_blk1_8(ctx->rkey_enc, outbuf, inbuf, 1);
+#endif
+
   prefetch_sbox_table ();
 
   return sm4_do_crypt (ctx->rkey_enc, outbuf, inbuf);
@@ -475,6 +541,11 @@ sm4_decrypt (void *context, byte *outbuf, const byte *inbuf)
 {
   SM4_context *ctx = context;
 
+#ifdef USE_ARM_CE
+  if (ctx->use_arm_ce)
+    return sm4_armv8_ce_crypt_blk1_8(ctx->rkey_dec, outbuf, inbuf, 1);
+#endif
+
   prefetch_sbox_table ();
 
   return sm4_do_crypt (ctx->rkey_dec, outbuf, inbuf);
@@ -601,6 +672,23 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr,
     }
 #endif
 
+#ifdef USE_ARM_CE
+  if (ctx->use_arm_ce)
+    {
+      /* Process multiples of 8 blocks at a time. */
+      if (nblocks >= 8)
+        {
+          size_t nblks = nblocks & ~(8 - 1);
+
+          _gcry_sm4_armv8_ce_ctr_enc(ctx->rkey_enc, outbuf, inbuf, ctr, nblks);
+
+          nblocks -= nblks;
+          outbuf += nblks * 16;
+          inbuf += nblks * 16;
+        }
+    }
+#endif
+
 #ifdef USE_AARCH64_SIMD
   if (ctx->use_aarch64_simd)
     {
@@ -634,6 +722,12 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr,
 	  crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8;
 	}
 #endif
+#ifdef USE_ARM_CE
+      else if (ctx->use_arm_ce)
+	{
+	  crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8;
+	}
+#endif
 #ifdef USE_AARCH64_SIMD
       else if (ctx->use_aarch64_simd)
 	{
@@ -725,6 +819,23 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv,
     }
 #endif
 
+#ifdef USE_ARM_CE
+  if (ctx->use_arm_ce)
+    {
+      /* Process multiples of 8 blocks at a time. */
+      if (nblocks >= 8)
+        {
+          size_t nblks = nblocks & ~(8 - 1);
+
+          _gcry_sm4_armv8_ce_cbc_dec(ctx->rkey_dec, outbuf, inbuf, iv, nblks);
+
+          nblocks -= nblks;
+          outbuf += nblks * 16;
+          inbuf += nblks * 16;
+        }
+    }
+#endif
+
 #ifdef USE_AARCH64_SIMD
   if (ctx->use_aarch64_simd)
     {
@@ -758,6 +869,12 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv,
 	  crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8;
 	}
 #endif
+#ifdef USE_ARM_CE
+      else if (ctx->use_arm_ce)
+	{
+	  crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8;
+	}
+#endif
 #ifdef USE_AARCH64_SIMD
       else if (ctx->use_aarch64_simd)
 	{
@@ -842,6 +959,23 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv,
     }
 #endif
 
+#ifdef USE_ARM_CE
+  if (ctx->use_arm_ce)
+    {
+      /* Process multiples of 8 blocks at a time. */
+      if (nblocks >= 8)
+        {
+          size_t nblks = nblocks & ~(8 - 1);
+
+          _gcry_sm4_armv8_ce_cfb_dec(ctx->rkey_enc, outbuf, inbuf, iv, nblks);
+
+          nblocks -= nblks;
+          outbuf += nblks * 16;
+          inbuf += nblks * 16;
+        }
+    }
+#endif
+
 #ifdef USE_AARCH64_SIMD
   if (ctx->use_aarch64_simd)
     {
@@ -875,6 +1009,12 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv,
 	  crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8;
 	}
 #endif
+#ifdef USE_ARM_CE
+      else if (ctx->use_arm_ce)
+	{
+	  crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8;
+	}
+#endif
 #ifdef USE_AARCH64_SIMD
       else if (ctx->use_aarch64_simd)
 	{
@@ -1037,6 +1177,12 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
 	  crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8;
 	}
 #endif
+#ifdef USE_ARM_CE
+      else if (ctx->use_arm_ce)
+	{
+	  crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8;
+	}
+#endif
 #ifdef USE_AARCH64_SIMD
       else if (ctx->use_aarch64_simd)
 	{
@@ -1203,6 +1349,12 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks)
 	  crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8;
 	}
 #endif
+#ifdef USE_ARM_CE
+      else if (ctx->use_arm_ce)
+	{
+	  crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8;
+	}
+#endif
 #ifdef USE_AARCH64_SIMD
       else if (ctx->use_aarch64_simd)
 	{
diff --git a/configure.ac b/configure.ac
index f5363f22..e20f9d13 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2755,6 +2755,7 @@ if test "$found" = "1" ; then
       aarch64-*-*)
          # Build with the assembly implementation
          GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-aarch64.lo"
+         GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-armv8-aarch64-ce.lo"
    esac
 fi
 
-- 
2.34.1


From fweimer at redhat.com  Tue Mar  1 12:17:06 2022
From: fweimer at redhat.com (Florian Weimer)
Date: Tue, 01 Mar 2022 12:17:06 +0100
Subject: [PATCH 1/2] fips: Use ELF header to find .rodata1 section
In-Reply-To: <8AEF6972-91BF-4B9A-B335-1EA482BF0DA5@redhat.com> (Clemens Lang's
 message of "Mon, 14 Feb 2022 13:46:19 +0100")
References: <20220211155723.86516-1-cllang@redhat.com>
 <87pmnttmep.fsf@oldenburg.str.redhat.com>
 <8AEF6972-91BF-4B9A-B335-1EA482BF0DA5@redhat.com>
Message-ID: <87bkypq5vx.fsf@oldenburg.str.redhat.com>

* Clemens Lang:

> From what I can see, it currently uses the same approach, and probably
> has the same issue where the compiler could assume that the HMAC is 0
> and constant-propagate that. Again, this currently works just fine with
> GCC, but I don?t think it?s a good idea to rely on GCC?s unwillingness
> to replace a memcmp(3) with a few assembly instructions.

Maybe GCC should provide an explicit way to treat a data object as
constant for linking purposes, while having a compiler barrier around
that.

Separate compilation does that, but the data object must not be compiled
with LTO, and that probably any whole-world assumption that makes LTO so
successful.

> The currently merged state assumes the offset in the file matches the
> address at runtime. This is probably not a good assumption to make. How
> would you determine the offset of a symbol in a file given its runtime
> address? Find the matching program header entry that must have loaded it
> and subtracting the difference between p_vaddr and p_offset?

Yes, I think that should work.

Thanks,
Florian


From jussi.kivilinna at iki.fi  Wed Mar  2 20:13:36 2022
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Wed, 2 Mar 2022 21:13:36 +0200
Subject: [PATCH v3 2/2] Add SM4 ARMv8/AArch64/CE assembly implementation
In-Reply-To: <20220301095655.31234-2-tianjia.zhang@linux.alibaba.com>
References: <20220301095655.31234-1-tianjia.zhang@linux.alibaba.com>
 <20220301095655.31234-2-tianjia.zhang@linux.alibaba.com>
Message-ID: <2898303a-2dda-f96c-30bd-ded69528b7fa@iki.fi>

Hello,

Applied to master. Thanks.

-Jussi

On 1.3.2022 11.56, Tianjia Zhang wrote:
> * cipher/Makefile.am: Add 'sm4-armv8-aarch64-ce.S'.
> * cipher/sm4-armv8-aarch64-ce.S: New.
> * cipher/sm4.c (USE_ARM_CE): New.
> (SM4_context) [USE_ARM_CE]: Add 'use_arm_ce'.
> [USE_ARM_CE] (_gcry_sm4_armv8_ce_expand_key)
> (_gcry_sm4_armv8_ce_crypt, _gcry_sm4_armv8_ce_ctr_enc)
> (_gcry_sm4_armv8_ce_cbc_dec, _gcry_sm4_armv8_ce_cfb_dec)
> (_gcry_sm4_armv8_ce_crypt_blk1_8, sm4_armv8_ce_crypt_blk1_8): New.
> (sm4_expand_key) [USE_ARM_CE]: Use ARMv8/AArch64/CE key setup.
> (sm4_setkey): Enable ARMv8/AArch64/CE if supported by HW.
> (sm4_encrypt) [USE_ARM_CE]: Use SM4 CE encryption.
> (sm4_decrypt) [USE_ARM_CE]: Use SM4 CE decryption.
> (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec)
> (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_ARM_CE]: Add
> ARMv8/AArch64/CE bulk functions.
> * configure.ac: Add 'sm4-armv8-aarch64-ce.lo'.
> --
> 
> This patch adds ARMv8/AArch64/CE bulk encryption/decryption. Bulk
> functions process eight blocks in parallel.
> 


From jussi.kivilinna at iki.fi  Sun Mar  6 18:19:09 2022
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sun,  6 Mar 2022 19:19:09 +0200
Subject: [PATCH 2/3] Add detection for HW feature "intel-avx512"
In-Reply-To: <20220306171910.1011180-1-jussi.kivilinna@iki.fi>
References: <20220306171910.1011180-1-jussi.kivilinna@iki.fi>
Message-ID: <20220306171910.1011180-2-jussi.kivilinna@iki.fi>

* configure.ac (avx512support, gcry_cv_gcc_inline_asm_avx512)
(ENABLE_AVX512_SUPPORT): New.
* src/g10lib.h (HWF_INTEL_AVX512): New.
* src/hwf-x86.c (detect_x86_gnuc): Add AVX512 detection.
* src/hwfeatures.c (hwflist): Add "intel-avx512".
--

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 configure.ac     | 36 +++++++++++++++++++++++++++++++++++
 src/g10lib.h     |  1 +
 src/hwf-x86.c    | 49 +++++++++++++++++++++++++++++++++++++++++++++---
 src/hwfeatures.c |  1 +
 4 files changed, 84 insertions(+), 3 deletions(-)

diff --git a/configure.ac b/configure.ac
index e20f9d13..27d72141 100644
--- a/configure.ac
+++ b/configure.ac
@@ -667,6 +667,14 @@ AC_ARG_ENABLE(avx2-support,
 	      avx2support=$enableval,avx2support=yes)
 AC_MSG_RESULT($avx2support)
 
+# Implementation of the --disable-avx512-support switch.
+AC_MSG_CHECKING([whether AVX512 support is requested])
+AC_ARG_ENABLE(avx512-support,
+              AS_HELP_STRING([--disable-avx512-support],
+                 [Disable support for the Intel AVX512 instructions]),
+	      avx512support=$enableval,avx512support=yes)
+AC_MSG_RESULT($avx512support)
+
 # Implementation of the --disable-neon-support switch.
 AC_MSG_CHECKING([whether NEON support is requested])
 AC_ARG_ENABLE(neon-support,
@@ -1545,6 +1553,29 @@ if test "$gcry_cv_gcc_inline_asm_avx2" = "yes" ; then
 fi
 
 
+#
+# Check whether GCC inline assembler supports AVX512 instructions
+#
+AC_CACHE_CHECK([whether GCC inline assembler supports AVX512 instructions],
+       [gcry_cv_gcc_inline_asm_avx512],
+       [if test "$mpi_cpu_arch" != "x86" ||
+           test "$try_asm_modules" != "yes" ; then
+          gcry_cv_gcc_inline_asm_avx512="n/a"
+        else
+          gcry_cv_gcc_inline_asm_avx512=no
+          AC_LINK_IFELSE([AC_LANG_PROGRAM(
+          [[void a(void) {
+              __asm__("xgetbv; vpopcntq %%zmm7, %%zmm1%{%%k1%}%{z%};\n\t":::"cc");
+              __asm__("vpexpandb %%zmm3, %%zmm1;\n\t":::"cc");
+            }]], [ a(); ] )],
+          [gcry_cv_gcc_inline_asm_avx512=yes])
+        fi])
+if test "$gcry_cv_gcc_inline_asm_avx512" = "yes" ; then
+   AC_DEFINE(HAVE_GCC_INLINE_ASM_AVX512,1,
+     [Defined if inline assembler supports AVX512 instructions])
+fi
+
+
 #
 # Check whether GCC inline assembler supports VAES and VPCLMUL instructions
 #
@@ -2409,6 +2440,10 @@ if test x"$avx2support" = xyes ; then
   AC_DEFINE(ENABLE_AVX2_SUPPORT,1,
             [Enable support for Intel AVX2 instructions.])
 fi
+if test x"$avx512support" = xyes ; then
+  AC_DEFINE(ENABLE_AVX512_SUPPORT,1,
+            [Enable support for Intel AVX512 instructions.])
+fi
 if test x"$neonsupport" = xyes ; then
   AC_DEFINE(ENABLE_NEON_SUPPORT,1,
             [Enable support for ARM NEON instructions.])
@@ -3266,6 +3301,7 @@ GCRY_MSG_SHOW([Try using Intel SSE4.1:   ],[$sse41support])
 GCRY_MSG_SHOW([Try using DRNG (RDRAND):  ],[$drngsupport])
 GCRY_MSG_SHOW([Try using Intel AVX:      ],[$avxsupport])
 GCRY_MSG_SHOW([Try using Intel AVX2:     ],[$avx2support])
+GCRY_MSG_SHOW([Try using Intel AVX512:   ],[$avx512support])
 GCRY_MSG_SHOW([Try using ARM NEON:       ],[$neonsupport])
 GCRY_MSG_SHOW([Try using ARMv8 crypto:   ],[$armcryptosupport])
 GCRY_MSG_SHOW([Try using PPC crypto:     ],[$ppccryptosupport])
diff --git a/src/g10lib.h b/src/g10lib.h
index 985e75c6..c07ed788 100644
--- a/src/g10lib.h
+++ b/src/g10lib.h
@@ -237,6 +237,7 @@ char **_gcry_strtokenize (const char *string, const char *delim);
 #define HWF_INTEL_RDTSC         (1 << 15)
 #define HWF_INTEL_SHAEXT        (1 << 16)
 #define HWF_INTEL_VAES_VPCLMUL  (1 << 17)
+#define HWF_INTEL_AVX512        (1 << 18)
 
 #elif defined(HAVE_CPU_ARCH_ARM)
 
diff --git a/src/hwf-x86.c b/src/hwf-x86.c
index a1aa02e7..0a5266c6 100644
--- a/src/hwf-x86.c
+++ b/src/hwf-x86.c
@@ -182,12 +182,14 @@ detect_x86_gnuc (void)
   } vendor_id;
   unsigned int features, features2;
   unsigned int os_supports_avx_avx2_registers = 0;
+  unsigned int os_supports_avx512_registers = 0;
   unsigned int max_cpuid_level;
   unsigned int fms, family, model;
   unsigned int result = 0;
   unsigned int avoid_vpgather = 0;
 
   (void)os_supports_avx_avx2_registers;
+  (void)os_supports_avx512_registers;
 
   if (!is_cpuid_available())
     return 0;
@@ -338,13 +340,22 @@ detect_x86_gnuc (void)
   if (features & 0x02000000)
      result |= HWF_INTEL_AESNI;
 #endif /*ENABLE_AESNI_SUPPORT*/
-#if defined(ENABLE_AVX_SUPPORT) || defined(ENABLE_AVX2_SUPPORT)
-  /* Test bit 27 for OSXSAVE (required for AVX/AVX2).  */
+#if defined(ENABLE_AVX_SUPPORT) || defined(ENABLE_AVX2_SUPPORT) \
+    || defined(ENABLE_AVX512_SUPPORT)
+  /* Test bit 27 for OSXSAVE (required for AVX/AVX2/AVX512).  */
   if (features & 0x08000000)
     {
+      unsigned int xmm_ymm_mask = (1 << 2) | (1 << 1);
+      unsigned int zmm15_ymm31_k7_mask = (1 << 7) | (1 << 6) | (1 << 5);
+      unsigned int xgetbv = get_xgetbv();
+
       /* Check that OS has enabled both XMM and YMM state support.  */
-      if ((get_xgetbv() & 0x6) == 0x6)
+      if ((xgetbv & xmm_ymm_mask) == xmm_ymm_mask)
         os_supports_avx_avx2_registers = 1;
+
+      /* Check that OS has enabled both XMM and YMM state support.  */
+      if ((xgetbv & zmm15_ymm31_k7_mask) == zmm15_ymm31_k7_mask)
+        os_supports_avx512_registers = 1;
     }
 #endif
 #ifdef ENABLE_AVX_SUPPORT
@@ -396,6 +407,38 @@ detect_x86_gnuc (void)
       if ((features2 & 0x00000200) && (features2 & 0x00000400))
         result |= HWF_INTEL_VAES_VPCLMUL;
 #endif
+
+#ifdef ENABLE_AVX512_SUPPORT
+      /* Test for AVX512 features. List of features is selected so that
+       * supporting CPUs are new enough not to suffer from reduced clock
+       * frequencies when AVX512 is used, which was issue on early AVX512
+       * capable CPUs.
+       *  - AVX512F (features bit 16)
+       *  - AVX512DQ (features bit 17)
+       *  - AVX512IFMA (features bit 21)
+       *  - AVX512CD (features bit 28)
+       *  - AVX512BW (features bit 30)
+       *  - AVX512VL (features bit 31)
+       *  - AVX512_VBMI (features2 bit 1)
+       *  - AVX512_VBMI2 (features2 bit 6)
+       *  - AVX512_VNNI (features2 bit 11)
+       *  - AVX512_BITALG (features2 bit 12)
+       *  - AVX512_VPOPCNTDQ (features2 bit 14)
+       */
+      if (os_supports_avx512_registers
+	  && (features & (1 << 16))
+	  && (features & (1 << 17))
+	  && (features & (1 << 21))
+	  && (features & (1 << 28))
+	  && (features & (1 << 30))
+	  && (features & (1 << 31))
+	  && (features2 & (1 << 1))
+	  && (features2 & (1 << 6))
+	  && (features2 & (1 << 11))
+	  && (features2 & (1 << 12))
+	  && (features2 & (1 << 14)))
+	result |= HWF_INTEL_AVX512;
+#endif
     }
 
   return result;
diff --git a/src/hwfeatures.c b/src/hwfeatures.c
index 7060d995..8e92cbdd 100644
--- a/src/hwfeatures.c
+++ b/src/hwfeatures.c
@@ -62,6 +62,7 @@ static struct
     { HWF_INTEL_RDTSC,         "intel-rdtsc" },
     { HWF_INTEL_SHAEXT,        "intel-shaext" },
     { HWF_INTEL_VAES_VPCLMUL,  "intel-vaes-vpclmul" },
+    { HWF_INTEL_AVX512,        "intel-avx512" },
 #elif defined(HAVE_CPU_ARCH_ARM)
     { HWF_ARM_NEON,            "arm-neon" },
     { HWF_ARM_AES,             "arm-aes" },
-- 
2.32.0


From jussi.kivilinna at iki.fi  Sun Mar  6 18:19:08 2022
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sun,  6 Mar 2022 19:19:08 +0200
Subject: [PATCH 1/3] ghash|polyval: add x86_64 VPCLMUL/AVX2 accelerated
 implementation
Message-ID: <20220306171910.1011180-1-jussi.kivilinna@iki.fi>

* cipher/cipher-gcm-intel-pclmul.c (GCM_INTEL_USE_VPCLMUL_AVX2)
(GCM_INTEL_AGGR8_TABLE_INITIALIZED)
(GCM_INTEL_AGGR16_TABLE_INITIALIZED): New.
(gfmul_pclmul): Fixes to comments.
[GCM_USE_INTEL_VPCLMUL_AVX2] (GFMUL_AGGR16_ASM_VPCMUL_AVX2)
(gfmul_vpclmul_avx2_aggr16, gfmul_vpclmul_avx2_aggr16_le)
(gfmul_pclmul_avx2, gcm_lsh_avx2, load_h1h2_to_ymm1)
(ghash_setup_aggr8_avx2, ghash_setup_aggr16_avx2): New.
(_gcry_ghash_setup_intel_pclmul): Add 'hw_features' parameter; Setup
ghash and polyval function pointers for context; Add VPCLMUL/AVX2 code
path; Defer aggr8 and aggr16 table initialization to until first use in
'_gcry_ghash_intel_pclmul' or '_gcry_polyval_intel_pclmul'.
[__x86_64__] (ghash_setup_aggr8): New.
(_gcry_ghash_intel_pclmul): Add VPCLMUL/AVX2 code path; Add call for
aggr8 table initialization.
(_gcry_polyval_intel_pclmul): Add VPCLMUL/AVX2 code path; Add call for
aggr8 table initialization.
* cipher/cipher-gcm.c [GCM_USE_INTEL_PCLMUL] (_gcry_ghash_intel_pclmul)
(_gcry_polyval_intel_pclmul): Remove.
[GCM_USE_INTEL_PCLMUL] (_gcry_ghash_setup_intel_pclmul): Add
'hw_features' parameter.
(setupM) [GCM_USE_INTEL_PCLMUL]: Pass HW features to
'_gcry_ghash_setup_intel_pclmul'; Let '_gcry_ghash_setup_intel_pclmul'
setup function pointers.
* cipher/cipher-internal.h (GCM_USE_INTEL_VPCLMUL_AVX2): New.
(gcry_cipher_handle): Add member 'gcm.hw_impl_flags'.
--

Patch adds VPCLMUL/AVX2 accelerated implementation for GHASH (GCM) and
POLYVAL (GCM-SIV).

Benchmark on AMD Ryzen 5800X (zen3):

Before:
                |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
       GCM auth |     0.088 ns/B     10825 MiB/s     0.427 c/B      4850
   GCM-SIV auth |     0.083 ns/B     11472 MiB/s     0.403 c/B      4850

After: (~1.93x faster)
                |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
       GCM auth |     0.045 ns/B     21098 MiB/s     0.219 c/B      4850
   GCM-SIV auth |     0.043 ns/B     22181 MiB/s     0.209 c/B      4850

AES128-GCM / AES128-GCM-SIV encryption:
                |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        GCM enc |     0.079 ns/B     12073 MiB/s     0.383 c/B      4850
    GCM-SIV enc |     0.076 ns/B     12500 MiB/s     0.370 c/B      4850

Benchmark on Intel Core i3-1115G4 (tigerlake):

Before:
                |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
       GCM auth |     0.080 ns/B     11919 MiB/s     0.327 c/B      4090
   GCM-SIV auth |     0.075 ns/B     12643 MiB/s     0.309 c/B      4090

After: (~1.28x faster)
                |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
       GCM auth |     0.062 ns/B     15348 MiB/s     0.254 c/B      4090
   GCM-SIV auth |     0.058 ns/B     16381 MiB/s     0.238 c/B      4090

AES128-GCM / AES128-GCM-SIV encryption:
                |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        GCM enc |     0.101 ns/B      9441 MiB/s     0.413 c/B      4090
    GCM-SIV enc |     0.098 ns/B      9692 MiB/s     0.402 c/B      4089

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/cipher-gcm-intel-pclmul.c | 809 +++++++++++++++++++++++++++----
 cipher/cipher-gcm.c              |  15 +-
 cipher/cipher-internal.h         |  11 +
 3 files changed, 724 insertions(+), 111 deletions(-)

diff --git a/cipher/cipher-gcm-intel-pclmul.c b/cipher/cipher-gcm-intel-pclmul.c
index daf807d0..b7324e8f 100644
--- a/cipher/cipher-gcm-intel-pclmul.c
+++ b/cipher/cipher-gcm-intel-pclmul.c
@@ -1,6 +1,6 @@
 /* cipher-gcm-intel-pclmul.c  -  Intel PCLMUL accelerated Galois Counter Mode
  *                               implementation
- * Copyright (C) 2013-2014,2019 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ * Copyright (C) 2013-2014,2019,2022 Jussi Kivilinna <jussi.kivilinna at iki.fi>
  *
  * This file is part of Libgcrypt.
  *
@@ -49,12 +49,18 @@
 #define ASM_FUNC_ATTR_INLINE ASM_FUNC_ATTR ALWAYS_INLINE
 
 
+#define GCM_INTEL_USE_VPCLMUL_AVX2         (1 << 0)
+#define GCM_INTEL_AGGR8_TABLE_INITIALIZED  (1 << 1)
+#define GCM_INTEL_AGGR16_TABLE_INITIALIZED (1 << 2)
+
+
 /*
  Intel PCLMUL ghash based on white paper:
   "Intel? Carry-Less Multiplication Instruction and its Usage for Computing the
    GCM Mode - Rev 2.01"; Shay Gueron, Michael E. Kounavis.
  */
-static ASM_FUNC_ATTR_INLINE void reduction(void)
+static ASM_FUNC_ATTR_INLINE
+void reduction(void)
 {
   /* input: <xmm1:xmm3> */
 
@@ -83,7 +89,8 @@ static ASM_FUNC_ATTR_INLINE void reduction(void)
                 ::: "memory" );
 }
 
-static ASM_FUNC_ATTR_INLINE void gfmul_pclmul(void)
+static ASM_FUNC_ATTR_INLINE
+void gfmul_pclmul(void)
 {
   /* Input: XMM0 and XMM1, Output: XMM1. Input XMM0 stays unmodified.
      Input must be converted to little-endian.
@@ -358,12 +365,12 @@ gfmul_pclmul_aggr4_le(const void *buf, const void *h_1, const void *h_table)
                                                                                \
     "pshufd $78, %%xmm8, %%xmm11\n\t"                                          \
     "pshufd $78, %%xmm5, %%xmm7\n\t"                                           \
-    "pxor %%xmm8, %%xmm11\n\t"  /* xmm11 holds 4:a0+a1 */                      \
-    "pxor %%xmm5, %%xmm7\n\t"   /* xmm7 holds 4:b0+b1 */                       \
+    "pxor %%xmm8, %%xmm11\n\t"  /* xmm11 holds 2:a0+a1 */                      \
+    "pxor %%xmm5, %%xmm7\n\t"   /* xmm7 holds 2:b0+b1 */                       \
     "movdqa %%xmm8, %%xmm6\n\t"                                                \
-    "pclmulqdq $0, %%xmm5, %%xmm6\n\t"   /* xmm6 holds 4:a0*b0 */              \
-    "pclmulqdq $17, %%xmm8, %%xmm5\n\t"  /* xmm5 holds 4:a1*b1 */              \
-    "pclmulqdq $0, %%xmm11, %%xmm7\n\t"  /* xmm7 holds 4:(a0+a1)*(b0+b1) */    \
+    "pclmulqdq $0, %%xmm5, %%xmm6\n\t"   /* xmm6 holds 2:a0*b0 */              \
+    "pclmulqdq $17, %%xmm8, %%xmm5\n\t"  /* xmm5 holds 2:a1*b1 */              \
+    "pclmulqdq $0, %%xmm11, %%xmm7\n\t"  /* xmm7 holds 2:(a0+a1)*(b0+b1) */    \
                                                                                \
     "pxor %%xmm6, %%xmm3\n\t" /* xmm3 holds 2+3+4+5+6+7+8:a0*b0 */             \
     "pxor %%xmm5, %%xmm1\n\t" /* xmm1 holds 2+3+4+5+6+7+8:a1*b1 */             \
@@ -371,16 +378,16 @@ gfmul_pclmul_aggr4_le(const void *buf, const void *h_1, const void *h_table)
                                                                                \
     "pshufd $78, %%xmm0, %%xmm11\n\t"                                          \
     "pshufd $78, %%xmm2, %%xmm7\n\t"                                           \
-    "pxor %%xmm0, %%xmm11\n\t" /* xmm11 holds 3:a0+a1 */                       \
-    "pxor %%xmm2, %%xmm7\n\t"  /* xmm7 holds 3:b0+b1 */                        \
+    "pxor %%xmm0, %%xmm11\n\t" /* xmm11 holds 1:a0+a1 */                       \
+    "pxor %%xmm2, %%xmm7\n\t"  /* xmm7 holds 1:b0+b1 */                        \
     "movdqa %%xmm0, %%xmm6\n\t"                                                \
-    "pclmulqdq $0, %%xmm2, %%xmm6\n\t"  /* xmm6 holds 3:a0*b0 */               \
-    "pclmulqdq $17, %%xmm0, %%xmm2\n\t" /* xmm2 holds 3:a1*b1 */               \
-    "pclmulqdq $0, %%xmm11, %%xmm7\n\t" /* xmm7 holds 3:(a0+a1)*(b0+b1) */     \
+    "pclmulqdq $0, %%xmm2, %%xmm6\n\t"  /* xmm6 holds 1:a0*b0 */               \
+    "pclmulqdq $17, %%xmm0, %%xmm2\n\t" /* xmm2 holds 1:a1*b1 */               \
+    "pclmulqdq $0, %%xmm11, %%xmm7\n\t" /* xmm7 holds 1:(a0+a1)*(b0+b1) */     \
                                                                                \
-    "pxor %%xmm6, %%xmm3\n\t" /* xmm3 holds 1+2+3+3+4+5+6+7+8:a0*b0 */         \
-    "pxor %%xmm2, %%xmm1\n\t" /* xmm1 holds 1+2+3+3+4+5+6+7+8:a1*b1 */         \
-    "pxor %%xmm7, %%xmm4\n\t"/* xmm4 holds 1+2+3+3+4+5+6+7+8:(a0+a1)*(b0+b1) */\
+    "pxor %%xmm6, %%xmm3\n\t" /* xmm3 holds 1+2+3+4+5+6+7+8:a0*b0 */           \
+    "pxor %%xmm2, %%xmm1\n\t" /* xmm1 holds 1+2+3+4+5+6+7+8:a1*b1 */           \
+    "pxor %%xmm7, %%xmm4\n\t"/* xmm4 holds 1+2+3+4+5+6+7+8:(a0+a1)*(b0+b1) */  \
                                                                                \
     /* aggregated reduction... */                                              \
     "movdqa %%xmm3, %%xmm5\n\t"                                                \
@@ -432,14 +439,409 @@ gfmul_pclmul_aggr8_le(const void *buf, const void *h_table)
 
   reduction();
 }
-#endif
 
-static ASM_FUNC_ATTR_INLINE void gcm_lsh(void *h, unsigned int hoffs)
+#ifdef GCM_USE_INTEL_VPCLMUL_AVX2
+
+#define GFMUL_AGGR16_ASM_VPCMUL_AVX2(be_to_le)                                          \
+    /* perform clmul and merge results... */                                            \
+    "vmovdqu 0*16(%[buf]), %%ymm5\n\t"                                                  \
+    "vmovdqu 2*16(%[buf]), %%ymm2\n\t"                                                  \
+    be_to_le("vpshufb %%ymm15, %%ymm5, %%ymm5\n\t") /* be => le */                      \
+    be_to_le("vpshufb %%ymm15, %%ymm2, %%ymm2\n\t") /* be => le */                      \
+    "vpxor %%ymm5, %%ymm1, %%ymm1\n\t"                                                  \
+                                                                                        \
+    "vpshufd $78, %%ymm0, %%ymm5\n\t"                                                   \
+    "vpshufd $78, %%ymm1, %%ymm4\n\t"                                                   \
+    "vpxor %%ymm0, %%ymm5, %%ymm5\n\t" /* ymm5 holds 15|16:a0+a1 */                     \
+    "vpxor %%ymm1, %%ymm4, %%ymm4\n\t" /* ymm4 holds 15|16:b0+b1 */                     \
+    "vpclmulqdq $0, %%ymm1, %%ymm0, %%ymm3\n\t"  /* ymm3 holds 15|16:a0*b0 */           \
+    "vpclmulqdq $17, %%ymm0, %%ymm1, %%ymm1\n\t" /* ymm1 holds 15|16:a1*b1 */           \
+    "vpclmulqdq $0, %%ymm5, %%ymm4, %%ymm4\n\t"  /* ymm4 holds 15|16:(a0+a1)*(b0+b1) */ \
+                                                                                        \
+    "vmovdqu %[h1_h2], %%ymm0\n\t"                                                      \
+                                                                                        \
+    "vpshufd $78, %%ymm13, %%ymm14\n\t"                                                 \
+    "vpshufd $78, %%ymm2, %%ymm7\n\t"                                                   \
+    "vpxor %%ymm13, %%ymm14, %%ymm14\n\t" /* ymm14 holds 13|14:a0+a1 */                 \
+    "vpxor %%ymm2, %%ymm7, %%ymm7\n\t"    /* ymm7 holds 13|14:b0+b1 */                  \
+    "vpclmulqdq $0, %%ymm2, %%ymm13, %%ymm6\n\t"  /* ymm6 holds 13|14:a0*b0 */          \
+    "vpclmulqdq $17, %%ymm13, %%ymm2, %%ymm2\n\t" /* ymm2 holds 13|14:a1*b1 */          \
+    "vpclmulqdq $0, %%ymm14, %%ymm7, %%ymm7\n\t"  /* ymm7 holds 13|14:(a0+a1)*(b0+b1) */\
+                                                                                        \
+    "vpxor %%ymm6, %%ymm3, %%ymm3\n\t" /* ymm3 holds 13+15|14+16:a0*b0 */               \
+    "vpxor %%ymm2, %%ymm1, %%ymm1\n\t" /* ymm1 holds 13+15|14+16:a1*b1 */               \
+    "vpxor %%ymm7, %%ymm4, %%ymm4\n\t" /* ymm4 holds 13+15|14+16:(a0+a1)*(b0+b1) */     \
+                                                                                        \
+    "vmovdqu 4*16(%[buf]), %%ymm5\n\t"                                                  \
+    "vmovdqu 6*16(%[buf]), %%ymm2\n\t"                                                  \
+    be_to_le("vpshufb %%ymm15, %%ymm5, %%ymm5\n\t") /* be => le */                      \
+    be_to_le("vpshufb %%ymm15, %%ymm2, %%ymm2\n\t") /* be => le */                      \
+                                                                                        \
+    "vpshufd $78, %%ymm12, %%ymm14\n\t"                                                 \
+    "vpshufd $78, %%ymm5, %%ymm7\n\t"                                                   \
+    "vpxor %%ymm12, %%ymm14, %%ymm14\n\t" /* ymm14 holds 11|12:a0+a1 */                 \
+    "vpxor %%ymm5, %%ymm7, %%ymm7\n\t"    /* ymm7 holds 11|12:b0+b1 */                  \
+    "vpclmulqdq $0, %%ymm5, %%ymm12, %%ymm6\n\t"  /* ymm6 holds 11|12:a0*b0 */          \
+    "vpclmulqdq $17, %%ymm12, %%ymm5, %%ymm5\n\t" /* ymm5 holds 11|12:a1*b1 */          \
+    "vpclmulqdq $0, %%ymm14, %%ymm7, %%ymm7\n\t"  /* ymm7 holds 11|12:(a0+a1)*(b0+b1) */\
+                                                                                        \
+    "vpxor %%ymm6, %%ymm3, %%ymm3\n\t" /* ymm3 holds 11+13+15|12+14+16:a0*b0 */         \
+    "vpxor %%ymm5, %%ymm1, %%ymm1\n\t" /* ymm1 holds 11+13+15|12+14+16:a1*b1 */         \
+    "vpxor %%ymm7, %%ymm4, %%ymm4\n\t" /* ymm4 holds 11+13+15|12+14+16:(a0+a1)*(b0+b1) */\
+                                                                                        \
+    "vpshufd $78, %%ymm11, %%ymm14\n\t"                                                 \
+    "vpshufd $78, %%ymm2, %%ymm7\n\t"                                                   \
+    "vpxor %%ymm11, %%ymm14, %%ymm14\n\t" /* ymm14 holds 9|10:a0+a1 */                  \
+    "vpxor %%ymm2, %%ymm7, %%ymm7\n\t"    /* ymm7 holds 9|10:b0+b1 */                   \
+    "vpclmulqdq $0, %%ymm2, %%ymm11, %%ymm6\n\t"  /* ymm6 holds 9|10:a0*b0 */           \
+    "vpclmulqdq $17, %%ymm11, %%ymm2, %%ymm2\n\t" /* ymm2 holds 9|10:a1*b1 */           \
+    "vpclmulqdq $0, %%ymm14, %%ymm7, %%ymm7\n\t" /* ymm7 holds 9|10:(a0+a1)*(b0+b1) */  \
+                                                                                        \
+    "vpxor %%ymm6, %%ymm3, %%ymm3\n\t" /* ymm3 holds 9+11+?+15|10+12+?+16:a0*b0 */      \
+    "vpxor %%ymm2, %%ymm1, %%ymm1\n\t" /* ymm1 holds 9+11+?+15|10+12+?+16:a1*b1 */      \
+    "vpxor %%ymm7, %%ymm4, %%ymm4\n\t" /* ymm4 holds 9+11+?+15|10+12+?+16:(a0+a1)*(b0+b1) */\
+                                                                                        \
+    "vmovdqu 8*16(%[buf]), %%ymm5\n\t"                                                  \
+    "vmovdqu 10*16(%[buf]), %%ymm2\n\t"                                                 \
+    be_to_le("vpshufb %%ymm15, %%ymm5, %%ymm5\n\t") /* be => le */                      \
+    be_to_le("vpshufb %%ymm15, %%ymm2, %%ymm2\n\t") /* be => le */                      \
+                                                                                        \
+    "vpshufd $78, %%ymm10, %%ymm14\n\t"                                                 \
+    "vpshufd $78, %%ymm5, %%ymm7\n\t"                                                   \
+    "vpxor %%ymm10, %%ymm14, %%ymm14\n\t" /* ymm14 holds 7|8:a0+a1 */                   \
+    "vpxor %%ymm5, %%ymm7, %%ymm7\n\t"    /* ymm7 holds 7|8:b0+b1 */                    \
+    "vpclmulqdq $0, %%ymm5, %%ymm10, %%ymm6\n\t"  /* ymm6 holds 7|8:a0*b0 */            \
+    "vpclmulqdq $17, %%ymm10, %%ymm5, %%ymm5\n\t" /* ymm5 holds 7|8:a1*b1 */            \
+    "vpclmulqdq $0, %%ymm14, %%ymm7, %%ymm7\n\t" /* ymm7 holds 7|8:(a0+a1)*(b0+b1) */   \
+                                                                                        \
+    "vpxor %%ymm6, %%ymm3, %%ymm3\n\t" /* ymm3 holds 7+9+?+15|8+10+?+16:a0*b0 */        \
+    "vpxor %%ymm5, %%ymm1, %%ymm1\n\t" /* ymm1 holds 7+9+?+15|8+10+?+16:a1*b1 */        \
+    "vpxor %%ymm7, %%ymm4, %%ymm4\n\t" /* ymm4 holds 7+9+?+15|8+10+?+16:(a0+a1)*(b0+b1) */\
+                                                                                        \
+    "vpshufd $78, %%ymm9, %%ymm14\n\t"                                                  \
+    "vpshufd $78, %%ymm2, %%ymm7\n\t"                                                   \
+    "vpxor %%ymm9, %%ymm14, %%ymm14\n\t" /* ymm14 holds 5|6:a0+a1 */                    \
+    "vpxor %%ymm2, %%ymm7, %%ymm7\n\t"   /* ymm7 holds 5|6:b0+b1 */                     \
+    "vpclmulqdq $0, %%ymm2, %%ymm9, %%ymm6\n\t"  /* ymm6 holds 5|6:a0*b0 */             \
+    "vpclmulqdq $17, %%ymm9, %%ymm2, %%ymm2\n\t" /* ymm2 holds 5|6:a1*b1 */             \
+    "vpclmulqdq $0, %%ymm14, %%ymm7, %%ymm7\n\t" /* ymm7 holds 5|6:(a0+a1)*(b0+b1) */   \
+                                                                                        \
+    "vpxor %%ymm6, %%ymm3, %%ymm3\n\t" /* ymm3 holds 5+7+?+15|6+8+?+16:a0*b0 */         \
+    "vpxor %%ymm2, %%ymm1, %%ymm1\n\t" /* ymm1 holds 5+7+?+15|6+8+?+16:a1*b1 */         \
+    "vpxor %%ymm7, %%ymm4, %%ymm4\n\t" /* ymm4 holds 5+7+?+15|6+8+?+16:(a0+a1)*(b0+b1) */\
+                                                                                        \
+    "vmovdqu 12*16(%[buf]), %%ymm5\n\t"                                                 \
+    "vmovdqu 14*16(%[buf]), %%ymm2\n\t"                                                 \
+    be_to_le("vpshufb %%ymm15, %%ymm5, %%ymm5\n\t") /* be => le */                      \
+    be_to_le("vpshufb %%ymm15, %%ymm2, %%ymm2\n\t") /* be => le */                      \
+                                                                                        \
+    "vpshufd $78, %%ymm8, %%ymm14\n\t"                                                  \
+    "vpshufd $78, %%ymm5, %%ymm7\n\t"                                                   \
+    "vpxor %%ymm8, %%ymm14, %%ymm14\n\t" /* ymm14 holds 3|4:a0+a1 */                    \
+    "vpxor %%ymm5, %%ymm7, %%ymm7\n\t"   /* ymm7 holds 3|4:b0+b1 */                     \
+    "vpclmulqdq $0, %%ymm5, %%ymm8, %%ymm6\n\t"  /* ymm6 holds 3|4:a0*b0 */             \
+    "vpclmulqdq $17, %%ymm8, %%ymm5, %%ymm5\n\t" /* ymm5 holds 3|4:a1*b1 */             \
+    "vpclmulqdq $0, %%ymm14, %%ymm7, %%ymm7\n\t" /* ymm7 holds 3|4:(a0+a1)*(b0+b1) */   \
+                                                                                        \
+    "vpxor %%ymm6, %%ymm3, %%ymm3\n\t" /* ymm3 holds 3+5+?+15|4+6+?+16:a0*b0 */         \
+    "vpxor %%ymm5, %%ymm1, %%ymm1\n\t" /* ymm1 holds 3+5+?+15|4+6+?+16:a1*b1 */         \
+    "vpxor %%ymm7, %%ymm4, %%ymm4\n\t" /* ymm4 holds 3+5+?+15|4+6+?+16:(a0+a1)*(b0+b1) */\
+                                                                                        \
+    "vpshufd $78, %%ymm0, %%ymm14\n\t"                                                  \
+    "vpshufd $78, %%ymm2, %%ymm7\n\t"                                                   \
+    "vpxor %%ymm0, %%ymm14, %%ymm14\n\t" /* ymm14 holds 1|2:a0+a1 */                    \
+    "vpxor %%ymm2, %%ymm7, %%ymm7\n\t"   /* ymm7 holds 1|2:b0+b1 */                     \
+    "vpclmulqdq $0, %%ymm2, %%ymm0, %%ymm6\n\t"  /* ymm6 holds 1|2:a0*b0 */             \
+    "vpclmulqdq $17, %%ymm0, %%ymm2, %%ymm2\n\t" /* ymm2 holds 1|2:a1*b1 */             \
+    "vpclmulqdq $0, %%ymm14, %%ymm7, %%ymm7\n\t" /* ymm7 holds 1|2:(a0+a1)*(b0+b1) */   \
+                                                                                        \
+    "vmovdqu %[h15_h16], %%ymm0\n\t"                                                    \
+                                                                                        \
+    "vpxor %%ymm6, %%ymm3, %%ymm3\n\t" /* ymm3 holds 1+3+?+15|2+4+?+16:a0*b0 */         \
+    "vpxor %%ymm2, %%ymm1, %%ymm1\n\t" /* ymm1 holds 1+3+?+15|2+4+?+16:a1*b1 */         \
+    "vpxor %%ymm7, %%ymm4, %%ymm4\n\t" /* ymm4 holds 1+3+?+15|2+4+?+16:(a0+a1)*(b0+b1) */\
+                                                                                        \
+    /* aggregated reduction... */                                                       \
+    "vpxor %%ymm1, %%ymm3, %%ymm5\n\t" /* ymm5 holds a0*b0+a1*b1 */                     \
+    "vpxor %%ymm5, %%ymm4, %%ymm4\n\t" /* ymm4 holds a0*b0+a1*b1+(a0+a1)*(b0+b1) */     \
+    "vpslldq $8, %%ymm4, %%ymm5\n\t"                                                    \
+    "vpsrldq $8, %%ymm4, %%ymm4\n\t"                                                    \
+    "vpxor %%ymm5, %%ymm3, %%ymm3\n\t"                                                  \
+    "vpxor %%ymm4, %%ymm1, %%ymm1\n\t" /* <ymm1:xmm3> holds the result of the           \
+                                          carry-less multiplication of ymm0             \
+                                          by ymm1 */                                    \
+                                                                                        \
+    /* first phase of the reduction */                                                  \
+    "vpsllq $1, %%ymm3, %%ymm6\n\t"  /* packed right shifting << 63 */                  \
+    "vpxor %%ymm3, %%ymm6, %%ymm6\n\t"                                                  \
+    "vpsllq $57, %%ymm3, %%ymm5\n\t"  /* packed right shifting << 57 */                 \
+    "vpsllq $62, %%ymm6, %%ymm6\n\t"  /* packed right shifting << 62 */                 \
+    "vpxor %%ymm5, %%ymm6, %%ymm6\n\t" /* xor the shifted versions */                   \
+    "vpshufd $0x6a, %%ymm6, %%ymm5\n\t"                                                 \
+    "vpshufd $0xae, %%ymm6, %%ymm6\n\t"                                                 \
+    "vpxor %%ymm5, %%ymm3, %%ymm3\n\t" /* first phase of the reduction complete */      \
+                                                                                        \
+    /* second phase of the reduction */                                                 \
+    "vpxor %%ymm3, %%ymm1, %%ymm1\n\t" /* xor the shifted versions */                   \
+    "vpsrlq $1, %%ymm3, %%ymm3\n\t"    /* packed left shifting >> 1 */                  \
+    "vpxor %%ymm3, %%ymm6, %%ymm6\n\t"                                                  \
+    "vpsrlq $1, %%ymm3, %%ymm3\n\t"    /* packed left shifting >> 2 */                  \
+    "vpxor %%ymm3, %%ymm1, %%ymm1\n\t"                                                  \
+    "vpsrlq $5, %%ymm3, %%ymm3\n\t"    /* packed left shifting >> 7 */                  \
+    "vpxor %%ymm3, %%ymm6, %%ymm6\n\t"                                                  \
+    "vpxor %%ymm6, %%ymm1, %%ymm1\n\t" /* the result is in ymm1 */                      \
+                                                                                        \
+    /* merge 128-bit halves */                                                          \
+    "vextracti128 $1, %%ymm1, %%xmm2\n\t"                                               \
+    "vpxor %%xmm2, %%xmm1, %%xmm1\n\t"
+
+static ASM_FUNC_ATTR_INLINE void
+gfmul_vpclmul_avx2_aggr16(const void *buf, const void *h_table,
+			  const u64 *h1_h2_h15_h16)
+{
+  /* Input:
+      Hx: YMM0, YMM8, YMM9, YMM10, YMM11, YMM12, YMM13
+      bemask: YMM15
+      Hash: XMM1
+    Output:
+      Hash: XMM1
+    Inputs YMM0, YMM8, YMM9, YMM10, YMM11, YMM12, YMM13 and YMM15 stay
+    unmodified.
+  */
+  asm volatile (GFMUL_AGGR16_ASM_VPCMUL_AVX2(be_to_le)
+		:
+		: [buf] "r" (buf),
+		  [h_table] "r" (h_table),
+		  [h1_h2] "m" (h1_h2_h15_h16[0]),
+		  [h15_h16] "m" (h1_h2_h15_h16[4])
+		: "memory" );
+}
+
+static ASM_FUNC_ATTR_INLINE void
+gfmul_vpclmul_avx2_aggr16_le(const void *buf, const void *h_table,
+			     const u64 *h1_h2_h15_h16)
+{
+  /* Input:
+      Hx: YMM0, YMM8, YMM9, YMM10, YMM11, YMM12, YMM13
+      bemask: YMM15
+      Hash: XMM1
+    Output:
+      Hash: XMM1
+    Inputs YMM0, YMM8, YMM9, YMM10, YMM11, YMM12, YMM13 and YMM15 stay
+    unmodified.
+  */
+  asm volatile (GFMUL_AGGR16_ASM_VPCMUL_AVX2(le_to_le)
+		:
+		: [buf] "r" (buf),
+		  [h_table] "r" (h_table),
+		  [h1_h2] "m" (h1_h2_h15_h16[0]),
+		  [h15_h16] "m" (h1_h2_h15_h16[4])
+		: "memory" );
+}
+
+static ASM_FUNC_ATTR_INLINE
+void gfmul_pclmul_avx2(void)
+{
+  /* Input: YMM0 and YMM1, Output: YMM1. Input YMM0 stays unmodified.
+     Input must be converted to little-endian.
+   */
+  asm volatile (/* gfmul, ymm0 has operator a and ymm1 has operator b. */
+		"vpshufd $78, %%ymm0, %%ymm2\n\t"
+		"vpshufd $78, %%ymm1, %%ymm4\n\t"
+		"vpxor %%ymm0, %%ymm2, %%ymm2\n\t" /* ymm2 holds a0+a1 */
+		"vpxor %%ymm1, %%ymm4, %%ymm4\n\t" /* ymm4 holds b0+b1 */
+
+		"vpclmulqdq $0, %%ymm1, %%ymm0, %%ymm3\n\t"  /* ymm3 holds a0*b0 */
+		"vpclmulqdq $17, %%ymm0, %%ymm1, %%ymm1\n\t" /* ymm6 holds a1*b1 */
+		"vpclmulqdq $0, %%ymm2, %%ymm4, %%ymm4\n\t"  /* ymm4 holds (a0+a1)*(b0+b1) */
+
+		"vpxor %%ymm1, %%ymm3, %%ymm5\n\t" /* ymm5 holds a0*b0+a1*b1 */
+		"vpxor %%ymm5, %%ymm4, %%ymm4\n\t" /* ymm4 holds a0*b0+a1*b1+(a0+a1)*(b0+b1) */
+		"vpslldq $8, %%ymm4, %%ymm5\n\t"
+		"vpsrldq $8, %%ymm4, %%ymm4\n\t"
+		"vpxor %%ymm5, %%ymm3, %%ymm3\n\t"
+		"vpxor %%ymm4, %%ymm1, %%ymm1\n\t" /* <ymm1:ymm3> holds the result of the
+						      carry-less multiplication of ymm0
+						      by ymm1 */
+
+		/* first phase of the reduction */
+		"vpsllq $1, %%ymm3, %%ymm6\n\t"  /* packed right shifting << 63 */
+		"vpxor %%ymm3, %%ymm6, %%ymm6\n\t"
+		"vpsllq $57, %%ymm3, %%ymm5\n\t"  /* packed right shifting << 57 */
+		"vpsllq $62, %%ymm6, %%ymm6\n\t"  /* packed right shifting << 62 */
+		"vpxor %%ymm5, %%ymm6, %%ymm6\n\t" /* xor the shifted versions */
+		"vpshufd $0x6a, %%ymm6, %%ymm5\n\t"
+		"vpshufd $0xae, %%ymm6, %%ymm6\n\t"
+		"vpxor %%ymm5, %%ymm3, %%ymm3\n\t" /* first phase of the reduction complete */
+
+		/* second phase of the reduction */
+		"vpxor %%ymm3, %%ymm1, %%ymm1\n\t" /* xor the shifted versions */
+		"vpsrlq $1, %%ymm3, %%ymm3\n\t"    /* packed left shifting >> 1 */
+		"vpxor %%ymm3, %%ymm6, %%ymm6\n\t"
+		"vpsrlq $1, %%ymm3, %%ymm3\n\t"    /* packed left shifting >> 2 */
+		"vpxor %%ymm3, %%ymm1, %%ymm1\n\t"
+		"vpsrlq $5, %%ymm3, %%ymm3\n\t"    /* packed left shifting >> 7 */
+		"vpxor %%ymm3, %%ymm6, %%ymm6\n\t"
+		"vpxor %%ymm6, %%ymm1, %%ymm1\n\t" /* the result is in ymm1 */
+                ::: "memory" );
+}
+
+static ASM_FUNC_ATTR_INLINE void
+gcm_lsh_avx2(void *h, unsigned int hoffs)
+{
+  static const u64 pconst[4] __attribute__ ((aligned (32))) =
+    {
+      U64_C(0x0000000000000001), U64_C(0xc200000000000000),
+      U64_C(0x0000000000000001), U64_C(0xc200000000000000)
+    };
+
+  asm volatile ("vmovdqu %[h], %%ymm2\n\t"
+                "vpshufd $0xff, %%ymm2, %%ymm3\n\t"
+                "vpsrad $31, %%ymm3, %%ymm3\n\t"
+                "vpslldq $8, %%ymm2, %%ymm4\n\t"
+                "vpand %[pconst], %%ymm3, %%ymm3\n\t"
+                "vpaddq %%ymm2, %%ymm2, %%ymm2\n\t"
+                "vpsrlq $63, %%ymm4, %%ymm4\n\t"
+                "vpxor %%ymm3, %%ymm2, %%ymm2\n\t"
+                "vpxor %%ymm4, %%ymm2, %%ymm2\n\t"
+                "vmovdqu %%ymm2, %[h]\n\t"
+                : [h] "+m" (*((byte *)h + hoffs))
+                : [pconst] "m" (*pconst)
+                : "memory" );
+}
+
+static ASM_FUNC_ATTR_INLINE void
+load_h1h2_to_ymm1(gcry_cipher_hd_t c)
+{
+  unsigned int key_pos =
+    offsetof(struct gcry_cipher_handle, u_mode.gcm.u_ghash_key.key);
+  unsigned int table_pos =
+    offsetof(struct gcry_cipher_handle, u_mode.gcm.gcm_table);
+
+  if (key_pos + 16 == table_pos)
+    {
+      /* Optimization: Table follows immediately after key. */
+      asm volatile ("vmovdqu %[key], %%ymm1\n\t"
+		    :
+		    : [key] "m" (*c->u_mode.gcm.u_ghash_key.key)
+		    : "memory");
+    }
+  else
+    {
+      asm volatile ("vmovdqa %[key], %%xmm1\n\t"
+		    "vinserti128 $1, 0*16(%[h_table]), %%ymm1, %%ymm1\n\t"
+		    :
+		    : [h_table] "r" (c->u_mode.gcm.gcm_table),
+		      [key] "m" (*c->u_mode.gcm.u_ghash_key.key)
+		    : "memory");
+    }
+}
+
+static ASM_FUNC_ATTR void
+ghash_setup_aggr8_avx2(gcry_cipher_hd_t c)
+{
+  c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_AGGR8_TABLE_INITIALIZED;
+
+  asm volatile (/* load H? */
+		"vbroadcasti128 3*16(%[h_table]), %%ymm0\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+  /* load H <<< 1, H? <<< 1 */
+  load_h1h2_to_ymm1 (c);
+
+  gfmul_pclmul_avx2 (); /* H<<<1?H? => H?, H?<<<1?H? => H? */
+
+  asm volatile ("vmovdqu %%ymm1, 3*16(%[h_table])\n\t"
+		/* load H? <<< 1, H? <<< 1 */
+		"vmovdqu 1*16(%[h_table]), %%ymm1\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+
+  gfmul_pclmul_avx2 (); /* H?<<<1?H? => H?, H?<<<1?H? => H? */
+
+  asm volatile ("vmovdqu %%ymm1, 6*16(%[h_table])\n\t" /* store H? for aggr16 setup */
+		"vmovdqu %%ymm1, 5*16(%[h_table])\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+
+  gcm_lsh_avx2 (c->u_mode.gcm.gcm_table, 3 * 16); /* H? <<< 1, H? <<< 1 */
+  gcm_lsh_avx2 (c->u_mode.gcm.gcm_table, 5 * 16); /* H? <<< 1, H? <<< 1 */
+}
+
+static ASM_FUNC_ATTR void
+ghash_setup_aggr16_avx2(gcry_cipher_hd_t c)
+{
+  c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_AGGR16_TABLE_INITIALIZED;
+
+  asm volatile (/* load H? */
+		"vbroadcasti128 7*16(%[h_table]), %%ymm0\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+  /* load H <<< 1, H? <<< 1 */
+  load_h1h2_to_ymm1 (c);
+
+  gfmul_pclmul_avx2 (); /* H<<<1?H? => H?, H?<<<1?H? => H?? */
+
+  asm volatile ("vmovdqu %%ymm1, 7*16(%[h_table])\n\t"
+		/* load H? <<< 1, H? <<< 1 */
+		"vmovdqu 1*16(%[h_table]), %%ymm1\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+
+  gfmul_pclmul_avx2 (); /* H?<<<1?H? => H??, H?<<<1?H? => H?? */
+
+  asm volatile ("vmovdqu %%ymm1, 9*16(%[h_table])\n\t"
+		/* load H? <<< 1, H? <<< 1 */
+		"vmovdqu 3*16(%[h_table]), %%ymm1\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+
+  gfmul_pclmul_avx2 (); /* H?<<<1?H? => H??, H?<<<1?H? => H?? */
+
+  asm volatile ("vmovdqu %%ymm1, 11*16(%[h_table])\n\t"
+		/* load H? <<< 1, H? <<< 1 */
+		"vmovdqu 5*16(%[h_table]), %%ymm1\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+
+  gfmul_pclmul_avx2 (); /* H?<<<1?H? => H??, H?<<<1?H? => H?? */
+
+  asm volatile ("vmovdqu %%ymm1, 13*16(%[h_table])\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+
+  gcm_lsh_avx2 (c->u_mode.gcm.gcm_table, 7 * 16); /* H? <<< 1, H?? <<< 1 */
+  gcm_lsh_avx2 (c->u_mode.gcm.gcm_table, 9 * 16); /* H?? <<< 1, H?? <<< 1 */
+  gcm_lsh_avx2 (c->u_mode.gcm.gcm_table, 11 * 16); /* H?? <<< 1, H?? <<< 1 */
+  gcm_lsh_avx2 (c->u_mode.gcm.gcm_table, 13 * 16); /* H?? <<< 1, H?? <<< 1 */
+}
+
+#endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */
+#endif /* __x86_64__ */
+
+static unsigned int ASM_FUNC_ATTR
+_gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf,
+			  size_t nblocks);
+
+static unsigned int ASM_FUNC_ATTR
+_gcry_polyval_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf,
+			    size_t nblocks);
+
+static ASM_FUNC_ATTR_INLINE void
+gcm_lsh(void *h, unsigned int hoffs)
 {
   static const u64 pconst[2] __attribute__ ((aligned (16))) =
     { U64_C(0x0000000000000001), U64_C(0xc200000000000000) };
 
-  asm volatile ("movdqu (%[h]), %%xmm2\n\t"
+  asm volatile ("movdqu %[h], %%xmm2\n\t"
                 "pshufd $0xff, %%xmm2, %%xmm3\n\t"
                 "movdqa %%xmm2, %%xmm4\n\t"
                 "psrad $31, %%xmm3\n\t"
@@ -449,15 +851,14 @@ static ASM_FUNC_ATTR_INLINE void gcm_lsh(void *h, unsigned int hoffs)
                 "psrlq $63, %%xmm4\n\t"
                 "pxor %%xmm3, %%xmm2\n\t"
                 "pxor %%xmm4, %%xmm2\n\t"
-                "movdqu %%xmm2, (%[h])\n\t"
-                :
-                : [pconst] "m" (*pconst),
-                  [h] "r" ((byte *)h + hoffs)
+                "movdqu %%xmm2, %[h]\n\t"
+                : [h] "+m" (*((byte *)h + hoffs))
+                : [pconst] "m" (*pconst)
                 : "memory" );
 }
 
 void ASM_FUNC_ATTR
-_gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c)
+_gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c, unsigned int hw_features)
 {
   static const unsigned char be_mask[16] __attribute__ ((aligned (16))) =
     { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 };
@@ -480,6 +881,12 @@ _gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c)
                 : "memory" );
 #endif
 
+  (void)hw_features;
+
+  c->u_mode.gcm.hw_impl_flags = 0;
+  c->u_mode.gcm.ghash_fn = _gcry_ghash_intel_pclmul;
+  c->u_mode.gcm.polyval_fn = _gcry_polyval_intel_pclmul;
+
   /* Swap endianness of hsub. */
   asm volatile ("movdqu (%[key]), %%xmm0\n\t"
                 "pshufb %[be_mask], %%xmm0\n\t"
@@ -489,7 +896,7 @@ _gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c)
                   [be_mask] "m" (*be_mask)
                 : "memory");
 
-  gcm_lsh(c->u_mode.gcm.u_ghash_key.key, 0); /* H <<< 1 */
+  gcm_lsh (c->u_mode.gcm.u_ghash_key.key, 0); /* H <<< 1 */
 
   asm volatile ("movdqa %%xmm0, %%xmm1\n\t"
                 "movdqu (%[key]), %%xmm0\n\t" /* load H <<< 1 */
@@ -500,80 +907,81 @@ _gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c)
   gfmul_pclmul (); /* H<<<1?H => H? */
 
   asm volatile ("movdqu %%xmm1, 0*16(%[h_table])\n\t"
-                "movdqa %%xmm1, %%xmm7\n\t"
                 :
                 : [h_table] "r" (c->u_mode.gcm.gcm_table)
                 : "memory");
 
-  gcm_lsh(c->u_mode.gcm.gcm_table, 0 * 16); /* H? <<< 1 */
-  gfmul_pclmul (); /* H<<<1?H? => H? */
+  gcm_lsh (c->u_mode.gcm.gcm_table, 0 * 16); /* H? <<< 1 */
 
-  asm volatile ("movdqa %%xmm7, %%xmm0\n\t"
-                "movdqu %%xmm1, 1*16(%[h_table])\n\t"
-                "movdqu 0*16(%[h_table]), %%xmm1\n\t" /* load H? <<< 1 */
-                :
-                : [h_table] "r" (c->u_mode.gcm.gcm_table)
-                : "memory");
+  if (0)
+    { }
+#ifdef GCM_USE_INTEL_VPCLMUL_AVX2
+  else if ((hw_features & HWF_INTEL_VAES_VPCLMUL)
+           && (hw_features & HWF_INTEL_AVX2))
+    {
+      c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_USE_VPCLMUL_AVX2;
 
-  gfmul_pclmul (); /* H?<<<1?H? => H? */
+      asm volatile (/* H? */
+		    "vinserti128 $1, %%xmm1, %%ymm1, %%ymm1\n\t"
+		    /* load H <<< 1, H? <<< 1 */
+		    "vinserti128 $1, 0*16(%[h_table]), %%ymm0, %%ymm0\n\t"
+		    :
+		    : [h_table] "r" (c->u_mode.gcm.gcm_table)
+		    : "memory");
 
-  asm volatile ("movdqu %%xmm1, 2*16(%[h_table])\n\t"
-                "movdqa %%xmm1, %%xmm0\n\t"
-                "movdqu (%[key]), %%xmm1\n\t" /* load H <<< 1 */
-                :
-                : [h_table] "r" (c->u_mode.gcm.gcm_table),
-                  [key] "r" (c->u_mode.gcm.u_ghash_key.key)
-                : "memory");
+      gfmul_pclmul_avx2 (); /* H<<<1?H? => H?, H?<<<1?H? => H? */
 
-  gcm_lsh(c->u_mode.gcm.gcm_table, 1 * 16); /* H? <<< 1 */
-  gcm_lsh(c->u_mode.gcm.gcm_table, 2 * 16); /* H? <<< 1 */
+      asm volatile ("vmovdqu %%ymm1, 2*16(%[h_table])\n\t" /* store H? for aggr8 setup */
+		    "vmovdqu %%ymm1, 1*16(%[h_table])\n\t"
+		    :
+		    : [h_table] "r" (c->u_mode.gcm.gcm_table)
+		    : "memory");
 
-#ifdef __x86_64__
-  gfmul_pclmul (); /* H<<<1?H? => H? */
+      gcm_lsh_avx2 (c->u_mode.gcm.gcm_table, 1 * 16); /* H? <<< 1, H? <<< 1 */
 
-  asm volatile ("movdqu %%xmm1, 3*16(%[h_table])\n\t"
-                "movdqu 0*16(%[h_table]), %%xmm1\n\t" /* load H? <<< 1 */
-                :
-                : [h_table] "r" (c->u_mode.gcm.gcm_table)
-                : "memory");
-
-  gfmul_pclmul (); /* H?<<<1?H? => H? */
-
-  asm volatile ("movdqu %%xmm1, 4*16(%[h_table])\n\t"
-                "movdqu 1*16(%[h_table]), %%xmm1\n\t" /* load H? <<< 1 */
-                :
-                : [h_table] "r" (c->u_mode.gcm.gcm_table)
-                : "memory");
+      asm volatile ("vzeroupper\n\t"
+		    ::: "memory" );
+    }
+#endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */
+  else
+    {
+      asm volatile ("movdqa %%xmm1, %%xmm7\n\t"
+		    ::: "memory");
 
-  gfmul_pclmul (); /* H?<<<1?H? => H? */
+      gfmul_pclmul (); /* H<<<1?H? => H? */
 
-  asm volatile ("movdqu %%xmm1, 5*16(%[h_table])\n\t"
-                "movdqu 2*16(%[h_table]), %%xmm1\n\t" /* load H? <<< 1 */
-                :
-                : [h_table] "r" (c->u_mode.gcm.gcm_table)
-                : "memory");
+      asm volatile ("movdqa %%xmm7, %%xmm0\n\t"
+		    "movdqu %%xmm1, 1*16(%[h_table])\n\t"
+		    "movdqu 0*16(%[h_table]), %%xmm1\n\t" /* load H? <<< 1 */
+		    :
+		    : [h_table] "r" (c->u_mode.gcm.gcm_table)
+		    : "memory");
 
-  gfmul_pclmul (); /* H?<<<1?H? => H? */
+      gfmul_pclmul (); /* H?<<<1?H? => H? */
 
-  asm volatile ("movdqu %%xmm1, 6*16(%[h_table])\n\t"
-                :
-                : [h_table] "r" (c->u_mode.gcm.gcm_table)
-                : "memory");
+      asm volatile ("movdqu %%xmm1, 3*16(%[h_table])\n\t" /* store H? for aggr8 setup */
+		    "movdqu %%xmm1, 2*16(%[h_table])\n\t"
+		    :
+		    : [h_table] "r" (c->u_mode.gcm.gcm_table)
+		    : "memory");
 
-  gcm_lsh(c->u_mode.gcm.gcm_table, 3 * 16); /* H? <<< 1 */
-  gcm_lsh(c->u_mode.gcm.gcm_table, 4 * 16); /* H? <<< 1 */
-  gcm_lsh(c->u_mode.gcm.gcm_table, 5 * 16); /* H? <<< 1 */
-  gcm_lsh(c->u_mode.gcm.gcm_table, 6 * 16); /* H? <<< 1 */
+      gcm_lsh (c->u_mode.gcm.gcm_table, 1 * 16); /* H? <<< 1 */
+      gcm_lsh (c->u_mode.gcm.gcm_table, 2 * 16); /* H? <<< 1 */
+    }
 
-#ifdef __WIN64__
   /* Clear/restore used registers. */
-  asm volatile( "pxor %%xmm0, %%xmm0\n\t"
-                "pxor %%xmm1, %%xmm1\n\t"
-                "pxor %%xmm2, %%xmm2\n\t"
-                "pxor %%xmm3, %%xmm3\n\t"
-                "pxor %%xmm4, %%xmm4\n\t"
-                "pxor %%xmm5, %%xmm5\n\t"
-                "movdqu 0*16(%0), %%xmm6\n\t"
+  asm volatile ("pxor %%xmm0, %%xmm0\n\t"
+		"pxor %%xmm1, %%xmm1\n\t"
+		"pxor %%xmm2, %%xmm2\n\t"
+		"pxor %%xmm3, %%xmm3\n\t"
+		"pxor %%xmm4, %%xmm4\n\t"
+		"pxor %%xmm5, %%xmm5\n\t"
+		"pxor %%xmm6, %%xmm6\n\t"
+		"pxor %%xmm7, %%xmm7\n\t"
+		::: "memory" );
+#ifdef __x86_64__
+#ifdef __WIN64__
+  asm volatile ("movdqu 0*16(%0), %%xmm6\n\t"
                 "movdqu 1*16(%0), %%xmm7\n\t"
                 "movdqu 2*16(%0), %%xmm8\n\t"
                 "movdqu 3*16(%0), %%xmm9\n\t"
@@ -587,16 +995,7 @@ _gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c)
                 : "r" (win64tmp)
                 : "memory" );
 #else
-  /* Clear used registers. */
-  asm volatile( "pxor %%xmm0, %%xmm0\n\t"
-                "pxor %%xmm1, %%xmm1\n\t"
-                "pxor %%xmm2, %%xmm2\n\t"
-                "pxor %%xmm3, %%xmm3\n\t"
-                "pxor %%xmm4, %%xmm4\n\t"
-                "pxor %%xmm5, %%xmm5\n\t"
-                "pxor %%xmm6, %%xmm6\n\t"
-                "pxor %%xmm7, %%xmm7\n\t"
-                "pxor %%xmm8, %%xmm8\n\t"
+  asm volatile ("pxor %%xmm8, %%xmm8\n\t"
                 "pxor %%xmm9, %%xmm9\n\t"
                 "pxor %%xmm10, %%xmm10\n\t"
                 "pxor %%xmm11, %%xmm11\n\t"
@@ -605,14 +1004,67 @@ _gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c)
                 "pxor %%xmm14, %%xmm14\n\t"
                 "pxor %%xmm15, %%xmm15\n\t"
                 ::: "memory" );
-#endif
-#endif
+#endif /* __WIN64__ */
+#endif /* __x86_64__ */
 }
 
 
+#ifdef __x86_64__
+static ASM_FUNC_ATTR void
+ghash_setup_aggr8(gcry_cipher_hd_t c)
+{
+  c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_AGGR8_TABLE_INITIALIZED;
+
+  asm volatile ("movdqa 3*16(%[h_table]), %%xmm0\n\t" /* load H? */
+		"movdqu %[key], %%xmm1\n\t" /* load H <<< 1 */
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table),
+		  [key] "m" (*c->u_mode.gcm.u_ghash_key.key)
+		: "memory");
+
+  gfmul_pclmul (); /* H<<<1?H? => H? */
+
+  asm volatile ("movdqu %%xmm1, 3*16(%[h_table])\n\t"
+		"movdqu 0*16(%[h_table]), %%xmm1\n\t" /* load H? <<< 1 */
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+
+  gfmul_pclmul (); /* H?<<<1?H? => H? */
+
+  asm volatile ("movdqu %%xmm1, 4*16(%[h_table])\n\t"
+		"movdqu 1*16(%[h_table]), %%xmm1\n\t" /* load H? <<< 1 */
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+
+  gfmul_pclmul (); /* H?<<<1?H? => H? */
+
+  asm volatile ("movdqu %%xmm1, 5*16(%[h_table])\n\t"
+		"movdqu 2*16(%[h_table]), %%xmm1\n\t" /* load H? <<< 1 */
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+
+  gfmul_pclmul (); /* H?<<<1?H? => H? */
+
+  asm volatile ("movdqu %%xmm1, 6*16(%[h_table])\n\t"
+		"movdqu %%xmm1, 7*16(%[h_table])\n\t" /* store H? for aggr16 setup */
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+
+  gcm_lsh (c->u_mode.gcm.gcm_table, 3 * 16); /* H? <<< 1 */
+  gcm_lsh (c->u_mode.gcm.gcm_table, 4 * 16); /* H? <<< 1 */
+  gcm_lsh (c->u_mode.gcm.gcm_table, 5 * 16); /* H? <<< 1 */
+  gcm_lsh (c->u_mode.gcm.gcm_table, 6 * 16); /* H? <<< 1 */
+}
+#endif /* __x86_64__ */
+
+
 unsigned int ASM_FUNC_ATTR
 _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf,
-                          size_t nblocks)
+			  size_t nblocks)
 {
   static const unsigned char be_mask[16] __attribute__ ((aligned (16))) =
     { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 };
@@ -650,12 +1102,93 @@ _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf,
                   [be_mask] "m" (*be_mask)
                 : "memory" );
 
+#if defined(GCM_USE_INTEL_VPCLMUL_AVX2)
+  if (nblocks >= 16
+      && (c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX2))
+    {
+      u64 h1_h2_h15_h16[4*2];
+
+      asm volatile ("vinserti128 $1, %%xmm7, %%ymm7, %%ymm15\n\t"
+		    "vmovdqa %%xmm1, %%xmm8\n\t"
+		    ::: "memory" );
+
+      if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED))
+	{
+	  ghash_setup_aggr8_avx2 (c);
+	}
+      if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR16_TABLE_INITIALIZED))
+	{
+	  ghash_setup_aggr16_avx2 (c);
+	}
+
+      /* Preload H1, H2, H3, H4, H5, H6, H7, H8, H9, H10, H11, H12. */
+      asm volatile ("vmovdqa %%xmm8, %%xmm1\n\t"
+		    "vmovdqu 0*16(%[h_table]), %%xmm7\n\t"
+		    "vpxor %%xmm8, %%xmm8, %%xmm8\n\t"
+		    "vperm2i128 $0x23, 13*16(%[h_table]), %%ymm8, %%ymm0\n\t"  /* H15|H16 */
+		    "vperm2i128 $0x23, 11*16(%[h_table]), %%ymm8, %%ymm13\n\t" /* H13|H14 */
+		    "vperm2i128 $0x23, 9*16(%[h_table]), %%ymm8, %%ymm12\n\t"  /* H11|H12 */
+		    "vperm2i128 $0x23, 7*16(%[h_table]), %%ymm8, %%ymm11\n\t"  /* H9|H10 */
+		    "vperm2i128 $0x23, 5*16(%[h_table]), %%ymm8, %%ymm10\n\t"  /* H7|H8 */
+		    "vperm2i128 $0x23, 3*16(%[h_table]), %%ymm8, %%ymm9\n\t"   /* H5|H6 */
+		    "vperm2i128 $0x23, 1*16(%[h_table]), %%ymm8, %%ymm8\n\t"   /* H3|H4 */
+		    "vinserti128 $1, %[h_1], %%ymm7, %%ymm7\n\t" /* H1|H2 */
+		    "vmovdqu %%ymm0, %[h15_h16]\n\t"
+		    "vmovdqu %%ymm7, %[h1_h2]\n\t"
+		    : [h1_h2] "=m" (h1_h2_h15_h16[0]),
+		      [h15_h16] "=m" (h1_h2_h15_h16[4])
+		    : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key),
+		      [h_table] "r" (c->u_mode.gcm.gcm_table)
+		    : "memory" );
+
+      while (nblocks >= 16)
+	{
+	  gfmul_vpclmul_avx2_aggr16 (buf, c->u_mode.gcm.gcm_table,
+				     h1_h2_h15_h16);
+
+	  buf += 16 * blocksize;
+	  nblocks -= 16;
+	}
+
+      /* Clear used x86-64/XMM registers. */
+      asm volatile("vmovdqu %%ymm15, %[h15_h16]\n\t"
+		   "vmovdqu %%ymm15, %[h1_h2]\n\t"
+		   "vzeroupper\n\t"
+#ifndef __WIN64__
+		   "pxor %%xmm8, %%xmm8\n\t"
+		   "pxor %%xmm9, %%xmm9\n\t"
+		   "pxor %%xmm10, %%xmm10\n\t"
+		   "pxor %%xmm11, %%xmm11\n\t"
+		   "pxor %%xmm12, %%xmm12\n\t"
+		   "pxor %%xmm13, %%xmm13\n\t"
+		   "pxor %%xmm14, %%xmm14\n\t"
+		   "pxor %%xmm15, %%xmm15\n\t"
+#endif
+		   "movdqa %[be_mask], %%xmm7\n\t"
+		   : [h1_h2] "=m" (h1_h2_h15_h16[0]),
+		     [h15_h16] "=m" (h1_h2_h15_h16[4])
+		   : [be_mask] "m" (*be_mask)
+		   : "memory" );
+    }
+#endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */
+
 #ifdef __x86_64__
   if (nblocks >= 8)
     {
-      /* Preload H1. */
       asm volatile ("movdqa %%xmm7, %%xmm15\n\t"
-                    "movdqa %[h_1], %%xmm0\n\t"
+		    ::: "memory" );
+
+      if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED))
+	{
+	  asm volatile ("movdqa %%xmm1, %%xmm8\n\t"
+			::: "memory" );
+	  ghash_setup_aggr8 (c);
+	  asm volatile ("movdqa %%xmm8, %%xmm1\n\t"
+			::: "memory" );
+	}
+
+      /* Preload H1. */
+      asm volatile ("movdqa %[h_1], %%xmm0\n\t"
                     :
                     : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key)
                     : "memory" );
@@ -667,6 +1200,7 @@ _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf,
           buf += 8 * blocksize;
           nblocks -= 8;
         }
+
 #ifndef __WIN64__
       /* Clear used x86-64/XMM registers. */
       asm volatile( "pxor %%xmm8, %%xmm8\n\t"
@@ -680,7 +1214,7 @@ _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf,
                     ::: "memory" );
 #endif
     }
-#endif
+#endif /* __x86_64__ */
 
   while (nblocks >= 4)
     {
@@ -761,7 +1295,7 @@ _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf,
 
 unsigned int ASM_FUNC_ATTR
 _gcry_polyval_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf,
-                            size_t nblocks)
+			    size_t nblocks)
 {
   static const unsigned char be_mask[16] __attribute__ ((aligned (16))) =
     { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 };
@@ -799,9 +1333,86 @@ _gcry_polyval_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf,
                   [be_mask] "m" (*be_mask)
                 : "memory" );
 
+#if defined(GCM_USE_INTEL_VPCLMUL_AVX2)
+  if (nblocks >= 16
+      && (c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX2))
+    {
+      u64 h1_h2_h15_h16[4*2];
+
+      asm volatile ("vmovdqa %%xmm1, %%xmm8\n\t"
+		    ::: "memory" );
+
+      if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED))
+	{
+	  ghash_setup_aggr8_avx2 (c);
+	}
+      if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR16_TABLE_INITIALIZED))
+	{
+	  ghash_setup_aggr16_avx2 (c);
+	}
+
+      /* Preload H1, H2, H3, H4, H5, H6, H7, H8, H9, H10, H11, H12. */
+      asm volatile ("vmovdqa %%xmm8, %%xmm1\n\t"
+		    "vpxor %%xmm8, %%xmm8, %%xmm8\n\t"
+		    "vmovdqu 0*16(%[h_table]), %%xmm7\n\t"
+		    "vperm2i128 $0x23, 13*16(%[h_table]), %%ymm8, %%ymm0\n\t"  /* H15|H16 */
+		    "vperm2i128 $0x23, 11*16(%[h_table]), %%ymm8, %%ymm13\n\t" /* H13|H14 */
+		    "vperm2i128 $0x23, 9*16(%[h_table]), %%ymm8, %%ymm12\n\t"  /* H11|H12 */
+		    "vperm2i128 $0x23, 7*16(%[h_table]), %%ymm8, %%ymm11\n\t"  /* H9|H10 */
+		    "vperm2i128 $0x23, 5*16(%[h_table]), %%ymm8, %%ymm10\n\t"  /* H7|H8 */
+		    "vperm2i128 $0x23, 3*16(%[h_table]), %%ymm8, %%ymm9\n\t"   /* H5|H6 */
+		    "vperm2i128 $0x23, 1*16(%[h_table]), %%ymm8, %%ymm8\n\t"   /* H3|H4 */
+		    "vinserti128 $1, %[h_1], %%ymm7, %%ymm7\n\t" /* H1|H2 */
+		    "vmovdqu %%ymm0, %[h15_h16]\n\t"
+		    "vmovdqu %%ymm7, %[h1_h2]\n\t"
+		    : [h1_h2] "=m" (h1_h2_h15_h16[0]),
+		      [h15_h16] "=m" (h1_h2_h15_h16[4])
+		    : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key),
+		      [h_table] "r" (c->u_mode.gcm.gcm_table)
+		    : "memory" );
+
+      while (nblocks >= 16)
+	{
+	  gfmul_vpclmul_avx2_aggr16_le (buf, c->u_mode.gcm.gcm_table,
+					h1_h2_h15_h16);
+
+	  buf += 16 * blocksize;
+	  nblocks -= 16;
+	}
+
+      /* Clear used x86-64/XMM registers. */
+      asm volatile("vpxor %%xmm7, %%xmm7, %%xmm7\n\t"
+		   "vmovdqu %%ymm7, %[h15_h16]\n\t"
+		   "vmovdqu %%ymm7, %[h1_h2]\n\t"
+		   "vzeroupper\n\t"
+#ifndef __WIN64__
+		   "pxor %%xmm8, %%xmm8\n\t"
+		   "pxor %%xmm9, %%xmm9\n\t"
+		   "pxor %%xmm10, %%xmm10\n\t"
+		   "pxor %%xmm11, %%xmm11\n\t"
+		   "pxor %%xmm12, %%xmm12\n\t"
+		   "pxor %%xmm13, %%xmm13\n\t"
+		   "pxor %%xmm14, %%xmm14\n\t"
+#endif
+		   : [h1_h2] "=m" (h1_h2_h15_h16[0]),
+		     [h15_h16] "=m" (h1_h2_h15_h16[4])
+		   :
+		   : "memory" );
+    }
+#endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */
+
 #ifdef __x86_64__
   if (nblocks >= 8)
     {
+      if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED))
+	{
+	  asm volatile ("movdqa %%xmm1, %%xmm8\n\t"
+			::: "memory" );
+	  ghash_setup_aggr8 (c);
+	  asm volatile ("movdqa %%xmm8, %%xmm1\n\t"
+			::: "memory" );
+	}
+
       /* Preload H1. */
       asm volatile ("pxor %%xmm15, %%xmm15\n\t"
                     "movdqa %[h_1], %%xmm0\n\t"
diff --git a/cipher/cipher-gcm.c b/cipher/cipher-gcm.c
index 69ff0de6..683f07b0 100644
--- a/cipher/cipher-gcm.c
+++ b/cipher/cipher-gcm.c
@@ -39,15 +39,8 @@
 
 
 #ifdef GCM_USE_INTEL_PCLMUL
-extern void _gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c);
-
-extern unsigned int _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result,
-                                              const byte *buf, size_t nblocks);
-
-extern unsigned int _gcry_polyval_intel_pclmul (gcry_cipher_hd_t c,
-                                                byte *result,
-                                                const byte *buf,
-                                                size_t nblocks);
+extern void _gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c,
+					    unsigned int hw_features);
 #endif
 
 #ifdef GCM_USE_ARM_PMULL
@@ -594,9 +587,7 @@ setupM (gcry_cipher_hd_t c)
 #ifdef GCM_USE_INTEL_PCLMUL
   else if (features & HWF_INTEL_PCLMUL)
     {
-      c->u_mode.gcm.ghash_fn = _gcry_ghash_intel_pclmul;
-      c->u_mode.gcm.polyval_fn = _gcry_polyval_intel_pclmul;
-      _gcry_ghash_setup_intel_pclmul (c);
+      _gcry_ghash_setup_intel_pclmul (c, features);
     }
 #endif
 #ifdef GCM_USE_ARM_PMULL
diff --git a/cipher/cipher-internal.h b/cipher/cipher-internal.h
index c8a1097a..e31ac860 100644
--- a/cipher/cipher-internal.h
+++ b/cipher/cipher-internal.h
@@ -72,6 +72,14 @@
 # endif
 #endif /* GCM_USE_INTEL_PCLMUL */
 
+/* GCM_USE_INTEL_VPCLMUL_AVX2 indicates whether to compile GCM with Intel
+   VPCLMUL/AVX2 code.  */
+#undef GCM_USE_INTEL_VPCLMUL_AVX2
+#if defined(__x86_64__) && defined(GCM_USE_INTEL_PCLMUL) && \
+    defined(ENABLE_AVX2_SUPPORT) && defined(HAVE_GCC_INLINE_ASM_VAES_VPCLMUL)
+# define GCM_USE_INTEL_VPCLMUL_AVX2 1
+#endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */
+
 /* GCM_USE_ARM_PMULL indicates whether to compile GCM with ARMv8 PMULL code. */
 #undef GCM_USE_ARM_PMULL
 #if defined(ENABLE_ARM_CRYPTO_SUPPORT) && defined(GCM_USE_TABLES)
@@ -355,6 +363,9 @@ struct gcry_cipher_handle
 
       /* Key length used for GCM-SIV key generating key. */
       unsigned int siv_keylen;
+
+      /* Flags for accelerated implementations. */
+      unsigned int hw_impl_flags;
     } gcm;
 
     /* Mode specific storage for OCB mode. */
-- 
2.32.0


From jussi.kivilinna at iki.fi  Sun Mar  6 18:19:10 2022
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sun,  6 Mar 2022 19:19:10 +0200
Subject: [PATCH 3/3] ghash|polyval: add x86_64 VPCLMUL/AVX512 accelerated
 implementation
In-Reply-To: <20220306171910.1011180-1-jussi.kivilinna@iki.fi>
References: <20220306171910.1011180-1-jussi.kivilinna@iki.fi>
Message-ID: <20220306171910.1011180-3-jussi.kivilinna@iki.fi>

* cipher/cipher-gcm-intel-pclmul.c (GCM_INTEL_USE_VPCLMUL_AVX512)
(GCM_INTEL_AGGR32_TABLE_INITIALIZED): New.
(ghash_setup_aggr16_avx2): Store H16 for aggr32 setup.
[GCM_USE_INTEL_VPCLMUL_AVX512] (GFMUL_AGGR32_ASM_VPCMUL_AVX512)
(gfmul_vpclmul_avx512_aggr32, gfmul_vpclmul_avx512_aggr32_le)
(gfmul_pclmul_avx512, gcm_lsh_avx512, load_h1h4_to_zmm1)
(ghash_setup_aggr8_avx512, ghash_setup_aggr16_avx512)
(ghash_setup_aggr32_avx512, swap128b_perm): New.
(_gcry_ghash_setup_intel_pclmul) [GCM_USE_INTEL_VPCLMUL_AVX512]: Enable
AVX512 implementation based on HW features.
(_gcry_ghash_intel_pclmul, _gcry_polyval_intel_pclmul): Add
VPCLMUL/AVX512 code path; Small tweaks to VPCLMUL/AVX2 code path; Tweaks
on register clearing.
--

Patch adds VPCLMUL/AVX512 accelerated implementation for GHASH (GCM) and
POLYVAL (GCM-SIV).

Benchmark on Intel Core i3-1115G4:

Before:
                |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
       GCM auth |     0.063 ns/B     15200 MiB/s     0.257 c/B      4090
   GCM-SIV auth |     0.061 ns/B     15704 MiB/s     0.248 c/B      4090

After (ghash ~41% faster, polyval ~34% faster):
                |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
       GCM auth |     0.044 ns/B     21614 MiB/s     0.181 c/B      4096?3
   GCM-SIV auth |     0.045 ns/B     21108 MiB/s     0.185 c/B      4097?3

AES128-GCM / AES128-GCM-SIV encryption:
                |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        GCM enc |     0.084 ns/B     11306 MiB/s     0.346 c/B      4097?3
    GCM-SIV enc |     0.086 ns/B     11026 MiB/s     0.354 c/B      4096?3

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/cipher-gcm-intel-pclmul.c | 940 +++++++++++++++++++++++--------
 cipher/cipher-internal.h         |   8 +
 2 files changed, 728 insertions(+), 220 deletions(-)

diff --git a/cipher/cipher-gcm-intel-pclmul.c b/cipher/cipher-gcm-intel-pclmul.c
index b7324e8f..78a9e338 100644
--- a/cipher/cipher-gcm-intel-pclmul.c
+++ b/cipher/cipher-gcm-intel-pclmul.c
@@ -52,6 +52,8 @@
 #define GCM_INTEL_USE_VPCLMUL_AVX2         (1 << 0)
 #define GCM_INTEL_AGGR8_TABLE_INITIALIZED  (1 << 1)
 #define GCM_INTEL_AGGR16_TABLE_INITIALIZED (1 << 2)
+#define GCM_INTEL_USE_VPCLMUL_AVX512       (1 << 3)
+#define GCM_INTEL_AGGR32_TABLE_INITIALIZED (1 << 4)
 
 
 /*
@@ -813,7 +815,8 @@ ghash_setup_aggr16_avx2(gcry_cipher_hd_t c)
 
   gfmul_pclmul_avx2 (); /* H?<<<1?H? => H??, H?<<<1?H? => H?? */
 
-  asm volatile ("vmovdqu %%ymm1, 13*16(%[h_table])\n\t"
+  asm volatile ("vmovdqu %%ymm1, 14*16(%[h_table])\n\t" /* store H?? for aggr32 setup */
+                "vmovdqu %%ymm1, 13*16(%[h_table])\n\t"
 		:
 		: [h_table] "r" (c->u_mode.gcm.gcm_table)
 		: "memory");
@@ -825,6 +828,400 @@ ghash_setup_aggr16_avx2(gcry_cipher_hd_t c)
 }
 
 #endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */
+
+#ifdef GCM_USE_INTEL_VPCLMUL_AVX512
+
+#define GFMUL_AGGR32_ASM_VPCMUL_AVX512(be_to_le)                                          \
+    /* perform clmul and merge results... */                                              \
+    "vmovdqu64 0*16(%[buf]), %%zmm5\n\t"                                                  \
+    "vmovdqu64 4*16(%[buf]), %%zmm2\n\t"                                                  \
+    be_to_le("vpshufb %%zmm15, %%zmm5, %%zmm5\n\t") /* be => le */                        \
+    be_to_le("vpshufb %%zmm15, %%zmm2, %%zmm2\n\t") /* be => le */                        \
+    "vpxorq %%zmm5, %%zmm1, %%zmm1\n\t"                                                   \
+                                                                                          \
+    "vpshufd $78, %%zmm0, %%zmm5\n\t"                                                     \
+    "vpshufd $78, %%zmm1, %%zmm4\n\t"                                                     \
+    "vpxorq %%zmm0, %%zmm5, %%zmm5\n\t" /* zmm5 holds 29|?|32:a0+a1 */                    \
+    "vpxorq %%zmm1, %%zmm4, %%zmm4\n\t" /* zmm4 holds 29|?|32:b0+b1 */                    \
+    "vpclmulqdq $0, %%zmm1, %%zmm0, %%zmm3\n\t"  /* zmm3 holds 29|?|32:a0*b0 */           \
+    "vpclmulqdq $17, %%zmm0, %%zmm1, %%zmm1\n\t" /* zmm1 holds 29|?|32:a1*b1 */           \
+    "vpclmulqdq $0, %%zmm5, %%zmm4, %%zmm4\n\t"  /* zmm4 holds 29|?|32:(a0+a1)*(b0+b1) */ \
+                                                                                          \
+    "vpshufd $78, %%zmm13, %%zmm14\n\t"                                                   \
+    "vpshufd $78, %%zmm2, %%zmm7\n\t"                                                     \
+    "vpxorq %%zmm13, %%zmm14, %%zmm14\n\t" /* zmm14 holds 25|?|28:a0+a1 */                \
+    "vpxorq %%zmm2, %%zmm7, %%zmm7\n\t"    /* zmm7 holds 25|?|28:b0+b1 */                 \
+    "vpclmulqdq $0, %%zmm2, %%zmm13, %%zmm17\n\t"  /* zmm17 holds 25|?|28:a0*b0 */        \
+    "vpclmulqdq $17, %%zmm13, %%zmm2, %%zmm18\n\t" /* zmm18 holds 25|?|28:a1*b1 */        \
+    "vpclmulqdq $0, %%zmm14, %%zmm7, %%zmm19\n\t"  /* zmm19 holds 25|?|28:(a0+a1)*(b0+b1) */\
+                                                                                          \
+    "vmovdqu64 8*16(%[buf]), %%zmm5\n\t"                                                  \
+    "vmovdqu64 12*16(%[buf]), %%zmm2\n\t"                                                 \
+    be_to_le("vpshufb %%zmm15, %%zmm5, %%zmm5\n\t") /* be => le */                        \
+    be_to_le("vpshufb %%zmm15, %%zmm2, %%zmm2\n\t") /* be => le */                        \
+                                                                                          \
+    "vpshufd $78, %%zmm12, %%zmm14\n\t"                                                   \
+    "vpshufd $78, %%zmm5, %%zmm7\n\t"                                                     \
+    "vpxorq %%zmm12, %%zmm14, %%zmm14\n\t" /* zmm14 holds 21|?|24:a0+a1 */                \
+    "vpxorq %%zmm5, %%zmm7, %%zmm7\n\t"    /* zmm7 holds 21|?|24:b0+b1 */                 \
+    "vpclmulqdq $0, %%zmm5, %%zmm12, %%zmm6\n\t"  /* zmm6 holds 21|?|24:a0*b0 */          \
+    "vpclmulqdq $17, %%zmm12, %%zmm5, %%zmm5\n\t" /* zmm5 holds 21|?|24:a1*b1 */          \
+    "vpclmulqdq $0, %%zmm14, %%zmm7, %%zmm7\n\t"  /* zmm7 holds 21|?|24:(a0+a1)*(b0+b1) */\
+                                                                                          \
+    "vpternlogq $0x96, %%zmm6, %%zmm17, %%zmm3\n\t" /* zmm3 holds 21+?|?|?+32:a0*b0 */    \
+    "vpternlogq $0x96, %%zmm5, %%zmm18, %%zmm1\n\t" /* zmm1 holds 21+?|?|?+32:a1*b1 */    \
+    "vpternlogq $0x96, %%zmm7, %%zmm19, %%zmm4\n\t" /* zmm4 holds 21+?|?|?+32:(a0+a1)*(b0+b1) */\
+                                                                                          \
+    "vpshufd $78, %%zmm11, %%zmm14\n\t"                                                   \
+    "vpshufd $78, %%zmm2, %%zmm7\n\t"                                                     \
+    "vpxorq %%zmm11, %%zmm14, %%zmm14\n\t" /* zmm14 holds 17|?|20:a0+a1 */                \
+    "vpxorq %%zmm2, %%zmm7, %%zmm7\n\t"    /* zmm7 holds 17|?|20:b0+b1 */                 \
+    "vpclmulqdq $0, %%zmm2, %%zmm11, %%zmm17\n\t"  /* zmm17 holds 17|?|20:a0*b0 */        \
+    "vpclmulqdq $17, %%zmm11, %%zmm2, %%zmm18\n\t" /* zmm18 holds 17|?|20:a1*b1 */        \
+    "vpclmulqdq $0, %%zmm14, %%zmm7, %%zmm19\n\t" /* zmm19 holds 17|?|20:(a0+a1)*(b0+b1) */\
+                                                                                          \
+    "vmovdqu64 16*16(%[buf]), %%zmm5\n\t"                                                 \
+    "vmovdqu64 20*16(%[buf]), %%zmm2\n\t"                                                 \
+    be_to_le("vpshufb %%zmm15, %%zmm5, %%zmm5\n\t") /* be => le */                        \
+    be_to_le("vpshufb %%zmm15, %%zmm2, %%zmm2\n\t") /* be => le */                        \
+                                                                                          \
+    "vpshufd $78, %%zmm10, %%zmm14\n\t"                                                   \
+    "vpshufd $78, %%zmm5, %%zmm7\n\t"                                                     \
+    "vpxorq %%zmm10, %%zmm14, %%zmm14\n\t" /* zmm14 holds 13|?|16:a0+a1 */                \
+    "vpxorq %%zmm5, %%zmm7, %%zmm7\n\t"    /* zmm7 holds 13|?|16:b0+b1 */                 \
+    "vpclmulqdq $0, %%zmm5, %%zmm10, %%zmm6\n\t"  /* zmm6 holds 13|?|16:a0*b0 */          \
+    "vpclmulqdq $17, %%zmm10, %%zmm5, %%zmm5\n\t" /* zmm5 holds 13|?|16:a1*b1 */          \
+    "vpclmulqdq $0, %%zmm14, %%zmm7, %%zmm7\n\t" /* zmm7 holds 13|?|16:(a0+a1)*(b0+b1) */ \
+                                                                                          \
+    "vpternlogq $0x96, %%zmm6, %%zmm17, %%zmm3\n\t" /* zmm3 holds 13+?|?|?+32:a0*b0 */    \
+    "vpternlogq $0x96, %%zmm5, %%zmm18, %%zmm1\n\t" /* zmm1 holds 13+?|?|?+32:a1*b1 */    \
+    "vpternlogq $0x96, %%zmm7, %%zmm19, %%zmm4\n\t" /* zmm4 holds 13+?|?|?+32:(a0+a1)*(b0+b1) */\
+                                                                                          \
+    "vpshufd $78, %%zmm9, %%zmm14\n\t"                                                    \
+    "vpshufd $78, %%zmm2, %%zmm7\n\t"                                                     \
+    "vpxorq %%zmm9, %%zmm14, %%zmm14\n\t" /* zmm14 holds 9|?|12:a0+a1 */                  \
+    "vpxorq %%zmm2, %%zmm7, %%zmm7\n\t"   /* zmm7 holds 9|?|12:b0+b1 */                   \
+    "vpclmulqdq $0, %%zmm2, %%zmm9, %%zmm17\n\t"  /* zmm17 holds 9|?|12:a0*b0 */          \
+    "vpclmulqdq $17, %%zmm9, %%zmm2, %%zmm18\n\t" /* zmm18 holds 9|?|12:a1*b1 */          \
+    "vpclmulqdq $0, %%zmm14, %%zmm7, %%zmm19\n\t" /* zmm19 holds 9|?|12:(a0+a1)*(b0+b1) */\
+                                                                                          \
+    "vmovdqu64 24*16(%[buf]), %%zmm5\n\t"                                                 \
+    "vmovdqu64 28*16(%[buf]), %%zmm2\n\t"                                                 \
+    be_to_le("vpshufb %%zmm15, %%zmm5, %%zmm5\n\t") /* be => le */                        \
+    be_to_le("vpshufb %%zmm15, %%zmm2, %%zmm2\n\t") /* be => le */                        \
+                                                                                          \
+    "vpshufd $78, %%zmm8, %%zmm14\n\t"                                                    \
+    "vpshufd $78, %%zmm5, %%zmm7\n\t"                                                     \
+    "vpxorq %%zmm8, %%zmm14, %%zmm14\n\t" /* zmm14 holds 5|?|8:a0+a1 */                   \
+    "vpxorq %%zmm5, %%zmm7, %%zmm7\n\t"   /* zmm7 holds 5|?|8:b0+b1 */                    \
+    "vpclmulqdq $0, %%zmm5, %%zmm8, %%zmm6\n\t"  /* zmm6 holds 5|?|8:a0*b0 */             \
+    "vpclmulqdq $17, %%zmm8, %%zmm5, %%zmm5\n\t" /* zmm5 holds 5|?|8:a1*b1 */             \
+    "vpclmulqdq $0, %%zmm14, %%zmm7, %%zmm7\n\t" /* zmm7 holds 5|?|8:(a0+a1)*(b0+b1) */   \
+                                                                                          \
+    "vpternlogq $0x96, %%zmm6, %%zmm17, %%zmm3\n\t" /* zmm3 holds 5+?|?|?+32:a0*b0 */     \
+    "vpternlogq $0x96, %%zmm5, %%zmm18, %%zmm1\n\t" /* zmm1 holds 5+?|?|?+32:a1*b1 */     \
+    "vpternlogq $0x96, %%zmm7, %%zmm19, %%zmm4\n\t" /* zmm4 holds 5+?|?|?+32:(a0+a1)*(b0+b1) */\
+                                                                                          \
+    "vpshufd $78, %%zmm16, %%zmm14\n\t"                                                   \
+    "vpshufd $78, %%zmm2, %%zmm7\n\t"                                                     \
+    "vpxorq %%zmm16, %%zmm14, %%zmm14\n\t" /* zmm14 holds 1|?|4:a0+a1 */                  \
+    "vpxorq %%zmm2, %%zmm7, %%zmm7\n\t"   /* zmm7 holds 1|2:b0+b1 */                      \
+    "vpclmulqdq $0, %%zmm2, %%zmm16, %%zmm6\n\t"  /* zmm6 holds 1|2:a0*b0 */              \
+    "vpclmulqdq $17, %%zmm16, %%zmm2, %%zmm2\n\t" /* zmm2 holds 1|2:a1*b1 */              \
+    "vpclmulqdq $0, %%zmm14, %%zmm7, %%zmm7\n\t" /* zmm7 holds 1|2:(a0+a1)*(b0+b1) */     \
+                                                                                          \
+    "vpxorq %%zmm6, %%zmm3, %%zmm3\n\t" /* zmm3 holds 1+3+?+15|2+4+?+16:a0*b0 */          \
+    "vpxorq %%zmm2, %%zmm1, %%zmm1\n\t" /* zmm1 holds 1+3+?+15|2+4+?+16:a1*b1 */          \
+    "vpxorq %%zmm7, %%zmm4, %%zmm4\n\t" /* zmm4 holds 1+3+?+15|2+4+?+16:(a0+a1)*(b0+b1) */\
+                                                                                          \
+    /* aggregated reduction... */                                                         \
+    "vpternlogq $0x96, %%zmm1, %%zmm3, %%zmm4\n\t" /* zmm4 holds                          \
+                                                    * a0*b0+a1*b1+(a0+a1)*(b0+b1) */      \
+    "vpslldq $8, %%zmm4, %%zmm5\n\t"                                                      \
+    "vpsrldq $8, %%zmm4, %%zmm4\n\t"                                                      \
+    "vpxorq %%zmm5, %%zmm3, %%zmm3\n\t"                                                   \
+    "vpxorq %%zmm4, %%zmm1, %%zmm1\n\t" /* <zmm1:zmm3> holds the result of the            \
+                                          carry-less multiplication of zmm0               \
+                                          by zmm1 */                                      \
+                                                                                          \
+    /* first phase of the reduction */                                                    \
+    "vpsllq $1, %%zmm3, %%zmm6\n\t"  /* packed right shifting << 63 */                    \
+    "vpxorq %%zmm3, %%zmm6, %%zmm6\n\t"                                                   \
+    "vpsllq $57, %%zmm3, %%zmm5\n\t"  /* packed right shifting << 57 */                   \
+    "vpsllq $62, %%zmm6, %%zmm6\n\t"  /* packed right shifting << 62 */                   \
+    "vpxorq %%zmm5, %%zmm6, %%zmm6\n\t" /* xor the shifted versions */                    \
+    "vpshufd $0x6a, %%zmm6, %%zmm5\n\t"                                                   \
+    "vpshufd $0xae, %%zmm6, %%zmm6\n\t"                                                   \
+    "vpxorq %%zmm5, %%zmm3, %%zmm3\n\t" /* first phase of the reduction complete */       \
+                                                                                          \
+    /* second phase of the reduction */                                                   \
+    "vpsrlq $1, %%zmm3, %%zmm2\n\t"    /* packed left shifting >> 1 */                    \
+    "vpsrlq $2, %%zmm3, %%zmm4\n\t"    /* packed left shifting >> 2 */                    \
+    "vpsrlq $7, %%zmm3, %%zmm5\n\t"    /* packed left shifting >> 7 */                    \
+    "vpternlogq $0x96, %%zmm3, %%zmm2, %%zmm1\n\t" /* xor the shifted versions */         \
+    "vpternlogq $0x96, %%zmm4, %%zmm5, %%zmm6\n\t"                                        \
+    "vpxorq %%zmm6, %%zmm1, %%zmm1\n\t" /* the result is in zmm1 */                       \
+                                                                                          \
+    /* merge 256-bit halves */                                                            \
+    "vextracti64x4 $1, %%zmm1, %%ymm2\n\t"                                                \
+    "vpxor %%ymm2, %%ymm1, %%ymm1\n\t"                                                    \
+    /* merge 128-bit halves */                                                            \
+    "vextracti128 $1, %%ymm1, %%xmm2\n\t"                                                 \
+    "vpxor %%xmm2, %%xmm1, %%xmm1\n\t"
+
+static ASM_FUNC_ATTR_INLINE void
+gfmul_vpclmul_avx512_aggr32(const void *buf, const void *h_table)
+{
+  /* Input:
+      Hx: ZMM0, ZMM8, ZMM9, ZMM10, ZMM11, ZMM12, ZMM13, ZMM16
+      bemask: ZMM15
+      Hash: XMM1
+    Output:
+      Hash: XMM1
+    Inputs ZMM0, ZMM8, ZMM9, ZMM10, ZMM11, ZMM12, ZMM13, ZMM16 and YMM15 stay
+    unmodified.
+  */
+  asm volatile (GFMUL_AGGR32_ASM_VPCMUL_AVX512(be_to_le)
+		:
+		: [buf] "r" (buf),
+		  [h_table] "r" (h_table)
+		: "memory" );
+}
+
+static ASM_FUNC_ATTR_INLINE void
+gfmul_vpclmul_avx512_aggr32_le(const void *buf, const void *h_table)
+{
+  /* Input:
+      Hx: ZMM0, ZMM8, ZMM9, ZMM10, ZMM11, ZMM12, ZMM13, ZMM16
+      bemask: ZMM15
+      Hash: XMM1
+    Output:
+      Hash: XMM1
+    Inputs ZMM0, ZMM8, ZMM9, ZMM10, ZMM11, ZMM12, ZMM13, ZMM16 and YMM15 stay
+    unmodified.
+  */
+  asm volatile (GFMUL_AGGR32_ASM_VPCMUL_AVX512(le_to_le)
+		:
+		: [buf] "r" (buf),
+		  [h_table] "r" (h_table)
+		: "memory" );
+}
+
+static ASM_FUNC_ATTR_INLINE
+void gfmul_pclmul_avx512(void)
+{
+  /* Input: ZMM0 and ZMM1, Output: ZMM1. Input ZMM0 stays unmodified.
+     Input must be converted to little-endian.
+   */
+  asm volatile (/* gfmul, zmm0 has operator a and zmm1 has operator b. */
+		"vpshufd $78, %%zmm0, %%zmm2\n\t"
+		"vpshufd $78, %%zmm1, %%zmm4\n\t"
+		"vpxorq %%zmm0, %%zmm2, %%zmm2\n\t" /* zmm2 holds a0+a1 */
+		"vpxorq %%zmm1, %%zmm4, %%zmm4\n\t" /* zmm4 holds b0+b1 */
+
+		"vpclmulqdq $0, %%zmm1, %%zmm0, %%zmm3\n\t"  /* zmm3 holds a0*b0 */
+		"vpclmulqdq $17, %%zmm0, %%zmm1, %%zmm1\n\t" /* zmm6 holds a1*b1 */
+		"vpclmulqdq $0, %%zmm2, %%zmm4, %%zmm4\n\t"  /* zmm4 holds (a0+a1)*(b0+b1) */
+
+		"vpternlogq $0x96, %%zmm1, %%zmm3, %%zmm4\n\t" /* zmm4 holds
+								* a0*b0+a1*b1+(a0+a1)*(b0+b1) */
+		"vpslldq $8, %%zmm4, %%zmm5\n\t"
+		"vpsrldq $8, %%zmm4, %%zmm4\n\t"
+		"vpxorq %%zmm5, %%zmm3, %%zmm3\n\t"
+		"vpxorq %%zmm4, %%zmm1, %%zmm1\n\t" /* <zmm1:zmm3> holds the result of the
+						      carry-less multiplication of zmm0
+						      by zmm1 */
+
+		/* first phase of the reduction */
+		"vpsllq $1, %%zmm3, %%zmm6\n\t"  /* packed right shifting << 63 */
+		"vpxorq %%zmm3, %%zmm6, %%zmm6\n\t"
+		"vpsllq $57, %%zmm3, %%zmm5\n\t"  /* packed right shifting << 57 */
+		"vpsllq $62, %%zmm6, %%zmm6\n\t"  /* packed right shifting << 62 */
+		"vpxorq %%zmm5, %%zmm6, %%zmm6\n\t" /* xor the shifted versions */
+		"vpshufd $0x6a, %%zmm6, %%zmm5\n\t"
+		"vpshufd $0xae, %%zmm6, %%zmm6\n\t"
+		"vpxorq %%zmm5, %%zmm3, %%zmm3\n\t" /* first phase of the reduction complete */
+
+		/* second phase of the reduction */
+		"vpsrlq $1, %%zmm3, %%zmm2\n\t"    /* packed left shifting >> 1 */
+		"vpsrlq $2, %%zmm3, %%zmm4\n\t"    /* packed left shifting >> 2 */
+		"vpsrlq $7, %%zmm3, %%zmm5\n\t"    /* packed left shifting >> 7 */
+		"vpternlogq $0x96, %%zmm3, %%zmm2, %%zmm1\n\t" /* xor the shifted versions */
+		"vpternlogq $0x96, %%zmm4, %%zmm5, %%zmm6\n\t"
+		"vpxorq %%zmm6, %%zmm1, %%zmm1\n\t" /* the result is in zmm1 */
+                ::: "memory" );
+}
+
+static ASM_FUNC_ATTR_INLINE void
+gcm_lsh_avx512(void *h, unsigned int hoffs)
+{
+  static const u64 pconst[8] __attribute__ ((aligned (64))) =
+    {
+      U64_C(0x0000000000000001), U64_C(0xc200000000000000),
+      U64_C(0x0000000000000001), U64_C(0xc200000000000000),
+      U64_C(0x0000000000000001), U64_C(0xc200000000000000),
+      U64_C(0x0000000000000001), U64_C(0xc200000000000000)
+    };
+
+  asm volatile ("vmovdqu64 %[h], %%zmm2\n\t"
+                "vpshufd $0xff, %%zmm2, %%zmm3\n\t"
+                "vpsrad $31, %%zmm3, %%zmm3\n\t"
+                "vpslldq $8, %%zmm2, %%zmm4\n\t"
+                "vpandq %[pconst], %%zmm3, %%zmm3\n\t"
+                "vpaddq %%zmm2, %%zmm2, %%zmm2\n\t"
+                "vpsrlq $63, %%zmm4, %%zmm4\n\t"
+                "vpternlogq $0x96, %%zmm4, %%zmm3, %%zmm2\n\t"
+                "vmovdqu64 %%zmm2, %[h]\n\t"
+                : [h] "+m" (*((byte *)h + hoffs))
+                : [pconst] "m" (*pconst)
+                : "memory" );
+}
+
+static ASM_FUNC_ATTR_INLINE void
+load_h1h4_to_zmm1(gcry_cipher_hd_t c)
+{
+  unsigned int key_pos =
+    offsetof(struct gcry_cipher_handle, u_mode.gcm.u_ghash_key.key);
+  unsigned int table_pos =
+    offsetof(struct gcry_cipher_handle, u_mode.gcm.gcm_table);
+
+  if (key_pos + 16 == table_pos)
+    {
+      /* Optimization: Table follows immediately after key. */
+      asm volatile ("vmovdqu64 %[key], %%zmm1\n\t"
+		    :
+		    : [key] "m" (*c->u_mode.gcm.u_ghash_key.key)
+		    : "memory");
+    }
+  else
+    {
+      asm volatile ("vmovdqu64 -1*16(%[h_table]), %%zmm1\n\t"
+		    "vinserti64x2 $0, %[key], %%zmm1, %%zmm1\n\t"
+		    :
+		    : [h_table] "r" (c->u_mode.gcm.gcm_table),
+		      [key] "m" (*c->u_mode.gcm.u_ghash_key.key)
+		    : "memory");
+    }
+}
+
+static ASM_FUNC_ATTR void
+ghash_setup_aggr8_avx512(gcry_cipher_hd_t c)
+{
+  c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_AGGR8_TABLE_INITIALIZED;
+
+  asm volatile (/* load H? */
+		"vbroadcasti64x2 3*16(%[h_table]), %%zmm0\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+  /* load H <<< 1, H? <<< 1, H? <<< 1, H? <<< 1 */
+  load_h1h4_to_zmm1 (c);
+
+  gfmul_pclmul_avx512 (); /* H<<<1?H? => H?, ?, H?<<<1?H? => H? */
+
+  asm volatile ("vmovdqu64 %%zmm1, 4*16(%[h_table])\n\t" /* store H? for aggr16 setup */
+		"vmovdqu64 %%zmm1, 3*16(%[h_table])\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+
+  gcm_lsh_avx512 (c->u_mode.gcm.gcm_table, 3 * 16); /* H? <<< 1, ?, H? <<< 1 */
+}
+
+static ASM_FUNC_ATTR void
+ghash_setup_aggr16_avx512(gcry_cipher_hd_t c)
+{
+  c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_AGGR16_TABLE_INITIALIZED;
+
+  asm volatile (/* load H? */
+		"vbroadcasti64x2 7*16(%[h_table]), %%zmm0\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+  /* load H <<< 1, H? <<< 1, H? <<< 1, H? <<< 1 */
+  load_h1h4_to_zmm1 (c);
+
+  gfmul_pclmul_avx512 (); /* H<<<1?H? => H?, ? , H?<<<1?H? => H?? */
+
+  asm volatile ("vmovdqu64 %%zmm1, 7*16(%[h_table])\n\t"
+		/* load H? <<< 1, ?, H? <<< 1 */
+		"vmovdqu64 3*16(%[h_table]), %%zmm1\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+
+  gfmul_pclmul_avx512 (); /* H?<<<1?H? => H??, ? , H?<<<1?H? => H?? */
+
+  asm volatile ("vmovdqu64 %%zmm1, 12*16(%[h_table])\n\t" /* store H?? for aggr32 setup */
+                "vmovdqu64 %%zmm1, 11*16(%[h_table])\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+
+  gcm_lsh_avx512 (c->u_mode.gcm.gcm_table, 7 * 16); /* H? <<< 1, ?, H?? <<< 1 */
+  gcm_lsh_avx512 (c->u_mode.gcm.gcm_table, 11 * 16); /* H?? <<< 1, ?, H?? <<< 1 */
+}
+
+static ASM_FUNC_ATTR void
+ghash_setup_aggr32_avx512(gcry_cipher_hd_t c)
+{
+  c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_AGGR32_TABLE_INITIALIZED;
+
+  asm volatile (/* load H?? */
+		"vbroadcasti64x2 15*16(%[h_table]), %%zmm0\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+  /* load H <<< 1, H? <<< 1, H? <<< 1, H? <<< 1 */
+  load_h1h4_to_zmm1 (c);
+
+  gfmul_pclmul_avx512 (); /* H<<<1?H?? => H??, ?, H?<<<1?H?? => H?? */
+
+  asm volatile ("vmovdqu64 %%zmm1, 15*16(%[h_table])\n\t"
+		/* load H? <<< 1, ?, H? <<< 1 */
+		"vmovdqu64 3*16(%[h_table]), %%zmm1\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+
+  gfmul_pclmul_avx512 (); /* H?<<<1?H?? => H??, ?, H?<<<1?H?? => H?? */
+
+  asm volatile ("vmovdqu64 %%zmm1, 19*16(%[h_table])\n\t"
+		/* load H? <<< 1, ?, H?? <<< 1 */
+		"vmovdqu64 7*16(%[h_table]), %%zmm1\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+
+  gfmul_pclmul_avx512 (); /* H?<<<1?H?? => H??, ?, H??<<<1?H?? => H?? */
+
+  asm volatile ("vmovdqu64 %%zmm1, 23*16(%[h_table])\n\t"
+		/* load H?? <<< 1, ?, H?? <<< 1 */
+		"vmovdqu64 11*16(%[h_table]), %%zmm1\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+
+  gfmul_pclmul_avx512 (); /* H??<<<1?H?? => H??, ?, H??<<<1?H?? => H?? */
+
+  asm volatile ("vmovdqu64 %%zmm1, 27*16(%[h_table])\n\t"
+		:
+		: [h_table] "r" (c->u_mode.gcm.gcm_table)
+		: "memory");
+
+  gcm_lsh_avx512 (c->u_mode.gcm.gcm_table, 15 * 16);
+  gcm_lsh_avx512 (c->u_mode.gcm.gcm_table, 19 * 16);
+  gcm_lsh_avx512 (c->u_mode.gcm.gcm_table, 23 * 16);
+  gcm_lsh_avx512 (c->u_mode.gcm.gcm_table, 27 * 16);
+}
+
+static const u64 swap128b_perm[8] __attribute__ ((aligned (64))) =
+  {
+    /* For swapping order of 128bit lanes in 512bit register using vpermq. */
+    6, 7, 4, 5, 2, 3, 0, 1
+  };
+
+#endif /* GCM_USE_INTEL_VPCLMUL_AVX512 */
 #endif /* __x86_64__ */
 
 static unsigned int ASM_FUNC_ATTR
@@ -921,6 +1318,11 @@ _gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c, unsigned int hw_features)
     {
       c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_USE_VPCLMUL_AVX2;
 
+#ifdef GCM_USE_INTEL_VPCLMUL_AVX512
+      if (hw_features & HWF_INTEL_AVX512)
+	c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_USE_VPCLMUL_AVX512;
+#endif
+
       asm volatile (/* H? */
 		    "vinserti128 $1, %%xmm1, %%ymm1, %%ymm1\n\t"
 		    /* load H <<< 1, H? <<< 1 */
@@ -1104,71 +1506,126 @@ _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf,
 
 #if defined(GCM_USE_INTEL_VPCLMUL_AVX2)
   if (nblocks >= 16
-      && (c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX2))
+      && ((c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX2)
+          || (c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX512)))
     {
-      u64 h1_h2_h15_h16[4*2];
-
-      asm volatile ("vinserti128 $1, %%xmm7, %%ymm7, %%ymm15\n\t"
-		    "vmovdqa %%xmm1, %%xmm8\n\t"
-		    ::: "memory" );
-
-      if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED))
+#if defined(GCM_USE_INTEL_VPCLMUL_AVX512)
+      if (nblocks >= 32
+	  && (c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX512))
 	{
-	  ghash_setup_aggr8_avx2 (c);
+	  asm volatile ("vpopcntb %%zmm7, %%zmm15\n\t" /* spec stop for old AVX512 CPUs */
+			"vshufi64x2 $0, %%zmm7, %%zmm7, %%zmm15\n\t"
+			"vmovdqa %%xmm1, %%xmm8\n\t"
+			"vmovdqu64 %[swapperm], %%zmm14\n\t"
+			:
+			: [swapperm] "m" (swap128b_perm),
+			  [h_table] "r" (c->u_mode.gcm.gcm_table)
+			: "memory" );
+
+	  if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR32_TABLE_INITIALIZED))
+	    {
+	      if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR16_TABLE_INITIALIZED))
+		{
+		  if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED))
+		    ghash_setup_aggr8_avx512 (c); /* Clobbers registers XMM0-XMM7. */
+
+		  ghash_setup_aggr16_avx512 (c); /* Clobbers registers XMM0-XMM7. */
+		}
+
+	      ghash_setup_aggr32_avx512 (c); /* Clobbers registers XMM0-XMM7. */
+	    }
+
+	  /* Preload H1-H32. */
+	  load_h1h4_to_zmm1 (c);
+	  asm volatile ("vpermq %%zmm1, %%zmm14, %%zmm16\n\t" /* H1|H2|H3|H4 */
+			"vmovdqa %%xmm8, %%xmm1\n\t"
+			"vpermq 27*16(%[h_table]), %%zmm14, %%zmm0\n\t"  /* H28|H29|H31|H32 */
+			"vpermq 23*16(%[h_table]), %%zmm14, %%zmm13\n\t" /* H25|H26|H27|H28 */
+			"vpermq 19*16(%[h_table]), %%zmm14, %%zmm12\n\t" /* H21|H22|H23|H24 */
+			"vpermq 15*16(%[h_table]), %%zmm14, %%zmm11\n\t" /* H17|H18|H19|H20 */
+			"vpermq 11*16(%[h_table]), %%zmm14, %%zmm10\n\t" /* H13|H14|H15|H16 */
+			"vpermq 7*16(%[h_table]), %%zmm14, %%zmm9\n\t"   /* H9|H10|H11|H12 */
+			"vpermq 3*16(%[h_table]), %%zmm14, %%zmm8\n\t"   /* H4|H6|H7|H8 */
+			:
+			: [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key),
+			  [h_table] "r" (c->u_mode.gcm.gcm_table)
+			: "memory" );
+
+	  while (nblocks >= 32)
+	    {
+	      gfmul_vpclmul_avx512_aggr32 (buf, c->u_mode.gcm.gcm_table);
+
+	      buf += 32 * blocksize;
+	      nblocks -= 32;
+	    }
+
+	  asm volatile ("vmovdqa %%xmm15, %%xmm7\n\t"
+			"vpxorq %%zmm16, %%zmm16, %%zmm16\n\t"
+			"vpxorq %%zmm17, %%zmm17, %%zmm17\n\t"
+			"vpxorq %%zmm18, %%zmm18, %%zmm18\n\t"
+			"vpxorq %%zmm19, %%zmm19, %%zmm19\n\t"
+			:
+			:
+			: "memory" );
 	}
-      if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR16_TABLE_INITIALIZED))
+#endif /* GCM_USE_INTEL_VPCLMUL_AVX512 */
+
+      if (nblocks >= 16)
 	{
-	  ghash_setup_aggr16_avx2 (c);
-	}
+	  u64 h1_h2_h15_h16[4*2];
 
-      /* Preload H1, H2, H3, H4, H5, H6, H7, H8, H9, H10, H11, H12. */
-      asm volatile ("vmovdqa %%xmm8, %%xmm1\n\t"
-		    "vmovdqu 0*16(%[h_table]), %%xmm7\n\t"
-		    "vpxor %%xmm8, %%xmm8, %%xmm8\n\t"
-		    "vperm2i128 $0x23, 13*16(%[h_table]), %%ymm8, %%ymm0\n\t"  /* H15|H16 */
-		    "vperm2i128 $0x23, 11*16(%[h_table]), %%ymm8, %%ymm13\n\t" /* H13|H14 */
-		    "vperm2i128 $0x23, 9*16(%[h_table]), %%ymm8, %%ymm12\n\t"  /* H11|H12 */
-		    "vperm2i128 $0x23, 7*16(%[h_table]), %%ymm8, %%ymm11\n\t"  /* H9|H10 */
-		    "vperm2i128 $0x23, 5*16(%[h_table]), %%ymm8, %%ymm10\n\t"  /* H7|H8 */
-		    "vperm2i128 $0x23, 3*16(%[h_table]), %%ymm8, %%ymm9\n\t"   /* H5|H6 */
-		    "vperm2i128 $0x23, 1*16(%[h_table]), %%ymm8, %%ymm8\n\t"   /* H3|H4 */
-		    "vinserti128 $1, %[h_1], %%ymm7, %%ymm7\n\t" /* H1|H2 */
-		    "vmovdqu %%ymm0, %[h15_h16]\n\t"
-		    "vmovdqu %%ymm7, %[h1_h2]\n\t"
-		    : [h1_h2] "=m" (h1_h2_h15_h16[0]),
-		      [h15_h16] "=m" (h1_h2_h15_h16[4])
-		    : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key),
-		      [h_table] "r" (c->u_mode.gcm.gcm_table)
-		    : "memory" );
+	  asm volatile ("vinserti128 $1, %%xmm7, %%ymm7, %%ymm15\n\t"
+			"vmovdqa %%xmm1, %%xmm8\n\t"
+			::: "memory" );
 
-      while (nblocks >= 16)
-	{
-	  gfmul_vpclmul_avx2_aggr16 (buf, c->u_mode.gcm.gcm_table,
-				     h1_h2_h15_h16);
+	  if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR16_TABLE_INITIALIZED))
+	    {
+	      if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED))
+		ghash_setup_aggr8_avx2 (c); /* Clobbers registers XMM0-XMM7. */
+
+	      ghash_setup_aggr16_avx2 (c); /* Clobbers registers XMM0-XMM7. */
+	    }
+
+	  /* Preload H1-H16. */
+	  load_h1h2_to_ymm1 (c);
+	  asm volatile ("vperm2i128 $0x23, %%ymm1, %%ymm1, %%ymm7\n\t" /* H1|H2 */
+			"vmovdqa %%xmm8, %%xmm1\n\t"
+			"vpxor %%xmm8, %%xmm8, %%xmm8\n\t"
+			"vperm2i128 $0x23, 13*16(%[h_table]), %%ymm8, %%ymm0\n\t"  /* H15|H16 */
+			"vperm2i128 $0x23, 11*16(%[h_table]), %%ymm8, %%ymm13\n\t" /* H13|H14 */
+			"vperm2i128 $0x23, 9*16(%[h_table]), %%ymm8, %%ymm12\n\t"  /* H11|H12 */
+			"vperm2i128 $0x23, 7*16(%[h_table]), %%ymm8, %%ymm11\n\t"  /* H9|H10 */
+			"vperm2i128 $0x23, 5*16(%[h_table]), %%ymm8, %%ymm10\n\t"  /* H7|H8 */
+			"vperm2i128 $0x23, 3*16(%[h_table]), %%ymm8, %%ymm9\n\t"   /* H5|H6 */
+			"vperm2i128 $0x23, 1*16(%[h_table]), %%ymm8, %%ymm8\n\t"   /* H3|H4 */
+			"vmovdqu %%ymm0, %[h15_h16]\n\t"
+			"vmovdqu %%ymm7, %[h1_h2]\n\t"
+			: [h1_h2] "=m" (h1_h2_h15_h16[0]),
+			  [h15_h16] "=m" (h1_h2_h15_h16[4])
+			: [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key),
+			  [h_table] "r" (c->u_mode.gcm.gcm_table)
+			: "memory" );
+
+	  while (nblocks >= 16)
+	    {
+	      gfmul_vpclmul_avx2_aggr16 (buf, c->u_mode.gcm.gcm_table,
+					h1_h2_h15_h16);
 
-	  buf += 16 * blocksize;
-	  nblocks -= 16;
+	      buf += 16 * blocksize;
+	      nblocks -= 16;
+	    }
+
+	  asm volatile ("vmovdqu %%ymm15, %[h15_h16]\n\t"
+			"vmovdqu %%ymm15, %[h1_h2]\n\t"
+			"vmovdqa %%xmm15, %%xmm7\n\t"
+			:
+			  [h1_h2] "=m" (h1_h2_h15_h16[0]),
+			  [h15_h16] "=m" (h1_h2_h15_h16[4])
+			:
+			: "memory" );
 	}
 
-      /* Clear used x86-64/XMM registers. */
-      asm volatile("vmovdqu %%ymm15, %[h15_h16]\n\t"
-		   "vmovdqu %%ymm15, %[h1_h2]\n\t"
-		   "vzeroupper\n\t"
-#ifndef __WIN64__
-		   "pxor %%xmm8, %%xmm8\n\t"
-		   "pxor %%xmm9, %%xmm9\n\t"
-		   "pxor %%xmm10, %%xmm10\n\t"
-		   "pxor %%xmm11, %%xmm11\n\t"
-		   "pxor %%xmm12, %%xmm12\n\t"
-		   "pxor %%xmm13, %%xmm13\n\t"
-		   "pxor %%xmm14, %%xmm14\n\t"
-		   "pxor %%xmm15, %%xmm15\n\t"
-#endif
-		   "movdqa %[be_mask], %%xmm7\n\t"
-		   : [h1_h2] "=m" (h1_h2_h15_h16[0]),
-		     [h15_h16] "=m" (h1_h2_h15_h16[4])
-		   : [be_mask] "m" (*be_mask)
-		   : "memory" );
+      asm volatile ("vzeroupper\n\t" ::: "memory" );
     }
 #endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */
 
@@ -1176,22 +1633,18 @@ _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf,
   if (nblocks >= 8)
     {
       asm volatile ("movdqa %%xmm7, %%xmm15\n\t"
+		    "movdqa %%xmm1, %%xmm8\n\t"
 		    ::: "memory" );
 
       if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED))
-	{
-	  asm volatile ("movdqa %%xmm1, %%xmm8\n\t"
-			::: "memory" );
-	  ghash_setup_aggr8 (c);
-	  asm volatile ("movdqa %%xmm8, %%xmm1\n\t"
-			::: "memory" );
-	}
+	ghash_setup_aggr8 (c); /* Clobbers registers XMM0-XMM7. */
 
       /* Preload H1. */
-      asm volatile ("movdqa %[h_1], %%xmm0\n\t"
-                    :
-                    : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key)
-                    : "memory" );
+      asm volatile ("movdqa %%xmm8, %%xmm1\n\t"
+		    "movdqa %[h_1], %%xmm0\n\t"
+		    :
+		    : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key)
+		    : "memory" );
 
       while (nblocks >= 8)
         {
@@ -1200,19 +1653,6 @@ _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf,
           buf += 8 * blocksize;
           nblocks -= 8;
         }
-
-#ifndef __WIN64__
-      /* Clear used x86-64/XMM registers. */
-      asm volatile( "pxor %%xmm8, %%xmm8\n\t"
-                    "pxor %%xmm9, %%xmm9\n\t"
-                    "pxor %%xmm10, %%xmm10\n\t"
-                    "pxor %%xmm11, %%xmm11\n\t"
-                    "pxor %%xmm12, %%xmm12\n\t"
-                    "pxor %%xmm13, %%xmm13\n\t"
-                    "pxor %%xmm14, %%xmm14\n\t"
-                    "pxor %%xmm15, %%xmm15\n\t"
-                    ::: "memory" );
-#endif
     }
 #endif /* __x86_64__ */
 
@@ -1256,39 +1696,49 @@ _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf,
                 : [be_mask] "m" (*be_mask)
                 : "memory" );
 
-#if defined(__x86_64__) && defined(__WIN64__)
   /* Clear/restore used registers. */
-  asm volatile( "pxor %%xmm0, %%xmm0\n\t"
-                "pxor %%xmm1, %%xmm1\n\t"
-                "pxor %%xmm2, %%xmm2\n\t"
-                "pxor %%xmm3, %%xmm3\n\t"
-                "pxor %%xmm4, %%xmm4\n\t"
-                "pxor %%xmm5, %%xmm5\n\t"
-                "movdqu 0*16(%0), %%xmm6\n\t"
-                "movdqu 1*16(%0), %%xmm7\n\t"
-                "movdqu 2*16(%0), %%xmm8\n\t"
-                "movdqu 3*16(%0), %%xmm9\n\t"
-                "movdqu 4*16(%0), %%xmm10\n\t"
-                "movdqu 5*16(%0), %%xmm11\n\t"
-                "movdqu 6*16(%0), %%xmm12\n\t"
-                "movdqu 7*16(%0), %%xmm13\n\t"
-                "movdqu 8*16(%0), %%xmm14\n\t"
-                "movdqu 9*16(%0), %%xmm15\n\t"
-                :
-                : "r" (win64tmp)
-                : "memory" );
+  asm volatile ("pxor %%xmm0, %%xmm0\n\t"
+		"pxor %%xmm1, %%xmm1\n\t"
+		"pxor %%xmm2, %%xmm2\n\t"
+		"pxor %%xmm3, %%xmm3\n\t"
+		"pxor %%xmm4, %%xmm4\n\t"
+		"pxor %%xmm5, %%xmm5\n\t"
+		"pxor %%xmm6, %%xmm6\n\t"
+		"pxor %%xmm7, %%xmm7\n\t"
+		:
+		:
+		: "memory" );
+#ifdef __x86_64__
+#ifdef __WIN64__
+  asm volatile ("movdqu 0*16(%0), %%xmm6\n\t"
+		"movdqu 1*16(%0), %%xmm7\n\t"
+		"movdqu 2*16(%0), %%xmm8\n\t"
+		"movdqu 3*16(%0), %%xmm9\n\t"
+		"movdqu 4*16(%0), %%xmm10\n\t"
+		"movdqu 5*16(%0), %%xmm11\n\t"
+		"movdqu 6*16(%0), %%xmm12\n\t"
+		"movdqu 7*16(%0), %%xmm13\n\t"
+		"movdqu 8*16(%0), %%xmm14\n\t"
+		"movdqu 9*16(%0), %%xmm15\n\t"
+		:
+		: "r" (win64tmp)
+		: "memory" );
 #else
   /* Clear used registers. */
-  asm volatile( "pxor %%xmm0, %%xmm0\n\t"
-                "pxor %%xmm1, %%xmm1\n\t"
-                "pxor %%xmm2, %%xmm2\n\t"
-                "pxor %%xmm3, %%xmm3\n\t"
-                "pxor %%xmm4, %%xmm4\n\t"
-                "pxor %%xmm5, %%xmm5\n\t"
-                "pxor %%xmm6, %%xmm6\n\t"
-                "pxor %%xmm7, %%xmm7\n\t"
-                ::: "memory" );
-#endif
+  asm volatile (
+		"pxor %%xmm8, %%xmm8\n\t"
+		"pxor %%xmm9, %%xmm9\n\t"
+		"pxor %%xmm10, %%xmm10\n\t"
+		"pxor %%xmm11, %%xmm11\n\t"
+		"pxor %%xmm12, %%xmm12\n\t"
+		"pxor %%xmm13, %%xmm13\n\t"
+		"pxor %%xmm14, %%xmm14\n\t"
+		"pxor %%xmm15, %%xmm15\n\t"
+		:
+		:
+		: "memory" );
+#endif /* __WIN64__ */
+#endif /* __x86_64__ */
 
   return 0;
 }
@@ -1335,90 +1785,142 @@ _gcry_polyval_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf,
 
 #if defined(GCM_USE_INTEL_VPCLMUL_AVX2)
   if (nblocks >= 16
-      && (c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX2))
+      && ((c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX2)
+          || (c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX512)))
     {
-      u64 h1_h2_h15_h16[4*2];
-
-      asm volatile ("vmovdqa %%xmm1, %%xmm8\n\t"
-		    ::: "memory" );
-
-      if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED))
+#if defined(GCM_USE_INTEL_VPCLMUL_AVX512)
+      if (nblocks >= 32
+	  && (c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX512))
 	{
-	  ghash_setup_aggr8_avx2 (c);
+	  asm volatile ("vpopcntb %%zmm7, %%zmm15\n\t" /* spec stop for old AVX512 CPUs */
+			"vmovdqa %%xmm1, %%xmm8\n\t"
+			"vmovdqu64 %[swapperm], %%zmm14\n\t"
+			:
+			: [swapperm] "m" (swap128b_perm),
+			  [h_table] "r" (c->u_mode.gcm.gcm_table)
+			: "memory" );
+
+	  if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR32_TABLE_INITIALIZED))
+	    {
+	      if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR16_TABLE_INITIALIZED))
+		{
+		  if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED))
+		    ghash_setup_aggr8_avx512 (c); /* Clobbers registers XMM0-XMM7. */
+
+		  ghash_setup_aggr16_avx512 (c); /* Clobbers registers XMM0-XMM7. */
+		}
+
+	      ghash_setup_aggr32_avx512 (c); /* Clobbers registers XMM0-XMM7. */
+	    }
+
+	  /* Preload H1-H32. */
+	  load_h1h4_to_zmm1 (c);
+	  asm volatile ("vpermq %%zmm1, %%zmm14, %%zmm16\n\t" /* H1|H2|H3|H4 */
+			"vmovdqa %%xmm8, %%xmm1\n\t"
+			"vpermq 27*16(%[h_table]), %%zmm14, %%zmm0\n\t"  /* H28|H29|H31|H32 */
+			"vpermq 23*16(%[h_table]), %%zmm14, %%zmm13\n\t" /* H25|H26|H27|H28 */
+			"vpermq 19*16(%[h_table]), %%zmm14, %%zmm12\n\t" /* H21|H22|H23|H24 */
+			"vpermq 15*16(%[h_table]), %%zmm14, %%zmm11\n\t" /* H17|H18|H19|H20 */
+			"vpermq 11*16(%[h_table]), %%zmm14, %%zmm10\n\t" /* H13|H14|H15|H16 */
+			"vpermq 7*16(%[h_table]), %%zmm14, %%zmm9\n\t"   /* H9|H10|H11|H12 */
+			"vpermq 3*16(%[h_table]), %%zmm14, %%zmm8\n\t"   /* H4|H6|H7|H8 */
+			:
+			: [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key),
+			  [h_table] "r" (c->u_mode.gcm.gcm_table)
+			: "memory" );
+
+	  while (nblocks >= 32)
+	    {
+	      gfmul_vpclmul_avx512_aggr32_le (buf, c->u_mode.gcm.gcm_table);
+
+	      buf += 32 * blocksize;
+	      nblocks -= 32;
+	    }
+
+	  asm volatile ("vpxor %%xmm7, %%xmm7, %%xmm7\n\t"
+			"vpxorq %%zmm16, %%zmm16, %%zmm16\n\t"
+			"vpxorq %%zmm17, %%zmm17, %%zmm17\n\t"
+			"vpxorq %%zmm18, %%zmm18, %%zmm18\n\t"
+			"vpxorq %%zmm19, %%zmm19, %%zmm19\n\t"
+			:
+			:
+			: "memory" );
 	}
-      if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR16_TABLE_INITIALIZED))
-	{
-	  ghash_setup_aggr16_avx2 (c);
-	}
-
-      /* Preload H1, H2, H3, H4, H5, H6, H7, H8, H9, H10, H11, H12. */
-      asm volatile ("vmovdqa %%xmm8, %%xmm1\n\t"
-		    "vpxor %%xmm8, %%xmm8, %%xmm8\n\t"
-		    "vmovdqu 0*16(%[h_table]), %%xmm7\n\t"
-		    "vperm2i128 $0x23, 13*16(%[h_table]), %%ymm8, %%ymm0\n\t"  /* H15|H16 */
-		    "vperm2i128 $0x23, 11*16(%[h_table]), %%ymm8, %%ymm13\n\t" /* H13|H14 */
-		    "vperm2i128 $0x23, 9*16(%[h_table]), %%ymm8, %%ymm12\n\t"  /* H11|H12 */
-		    "vperm2i128 $0x23, 7*16(%[h_table]), %%ymm8, %%ymm11\n\t"  /* H9|H10 */
-		    "vperm2i128 $0x23, 5*16(%[h_table]), %%ymm8, %%ymm10\n\t"  /* H7|H8 */
-		    "vperm2i128 $0x23, 3*16(%[h_table]), %%ymm8, %%ymm9\n\t"   /* H5|H6 */
-		    "vperm2i128 $0x23, 1*16(%[h_table]), %%ymm8, %%ymm8\n\t"   /* H3|H4 */
-		    "vinserti128 $1, %[h_1], %%ymm7, %%ymm7\n\t" /* H1|H2 */
-		    "vmovdqu %%ymm0, %[h15_h16]\n\t"
-		    "vmovdqu %%ymm7, %[h1_h2]\n\t"
-		    : [h1_h2] "=m" (h1_h2_h15_h16[0]),
-		      [h15_h16] "=m" (h1_h2_h15_h16[4])
-		    : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key),
-		      [h_table] "r" (c->u_mode.gcm.gcm_table)
-		    : "memory" );
+#endif
 
-      while (nblocks >= 16)
+      if (nblocks >= 16)
 	{
-	  gfmul_vpclmul_avx2_aggr16_le (buf, c->u_mode.gcm.gcm_table,
-					h1_h2_h15_h16);
+	  u64 h1_h2_h15_h16[4*2];
+
+	  asm volatile ("vmovdqa %%xmm1, %%xmm8\n\t"
+			::: "memory" );
 
-	  buf += 16 * blocksize;
-	  nblocks -= 16;
+	  if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR16_TABLE_INITIALIZED))
+	    {
+	      if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED))
+		ghash_setup_aggr8_avx2 (c); /* Clobbers registers XMM0-XMM7. */
+
+	      ghash_setup_aggr16_avx2 (c); /* Clobbers registers XMM0-XMM7. */
+	    }
+
+	  /* Preload H1-H16. */
+	  load_h1h2_to_ymm1 (c);
+	  asm volatile ("vperm2i128 $0x23, %%ymm1, %%ymm1, %%ymm7\n\t" /* H1|H2 */
+			"vmovdqa %%xmm8, %%xmm1\n\t"
+			"vpxor %%xmm8, %%xmm8, %%xmm8\n\t"
+			"vperm2i128 $0x23, 13*16(%[h_table]), %%ymm8, %%ymm0\n\t"  /* H15|H16 */
+			"vperm2i128 $0x23, 11*16(%[h_table]), %%ymm8, %%ymm13\n\t" /* H13|H14 */
+			"vperm2i128 $0x23, 9*16(%[h_table]), %%ymm8, %%ymm12\n\t"  /* H11|H12 */
+			"vperm2i128 $0x23, 7*16(%[h_table]), %%ymm8, %%ymm11\n\t"  /* H9|H10 */
+			"vperm2i128 $0x23, 5*16(%[h_table]), %%ymm8, %%ymm10\n\t"  /* H7|H8 */
+			"vperm2i128 $0x23, 3*16(%[h_table]), %%ymm8, %%ymm9\n\t"   /* H5|H6 */
+			"vperm2i128 $0x23, 1*16(%[h_table]), %%ymm8, %%ymm8\n\t"   /* H3|H4 */
+			"vmovdqu %%ymm0, %[h15_h16]\n\t"
+			"vmovdqu %%ymm7, %[h1_h2]\n\t"
+			: [h1_h2] "=m" (h1_h2_h15_h16[0]),
+			  [h15_h16] "=m" (h1_h2_h15_h16[4])
+			: [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key),
+			  [h_table] "r" (c->u_mode.gcm.gcm_table)
+			: "memory" );
+
+	  while (nblocks >= 16)
+	    {
+	      gfmul_vpclmul_avx2_aggr16_le (buf, c->u_mode.gcm.gcm_table,
+					    h1_h2_h15_h16);
+
+	      buf += 16 * blocksize;
+	      nblocks -= 16;
+	    }
+
+	  asm volatile ("vpxor %%xmm7, %%xmm7, %%xmm7\n\t"
+			"vmovdqu %%ymm7, %[h15_h16]\n\t"
+			"vmovdqu %%ymm7, %[h1_h2]\n\t"
+			: [h1_h2] "=m" (h1_h2_h15_h16[0]),
+			  [h15_h16] "=m" (h1_h2_h15_h16[4])
+			:
+			: "memory" );
 	}
 
-      /* Clear used x86-64/XMM registers. */
-      asm volatile("vpxor %%xmm7, %%xmm7, %%xmm7\n\t"
-		   "vmovdqu %%ymm7, %[h15_h16]\n\t"
-		   "vmovdqu %%ymm7, %[h1_h2]\n\t"
-		   "vzeroupper\n\t"
-#ifndef __WIN64__
-		   "pxor %%xmm8, %%xmm8\n\t"
-		   "pxor %%xmm9, %%xmm9\n\t"
-		   "pxor %%xmm10, %%xmm10\n\t"
-		   "pxor %%xmm11, %%xmm11\n\t"
-		   "pxor %%xmm12, %%xmm12\n\t"
-		   "pxor %%xmm13, %%xmm13\n\t"
-		   "pxor %%xmm14, %%xmm14\n\t"
-#endif
-		   : [h1_h2] "=m" (h1_h2_h15_h16[0]),
-		     [h15_h16] "=m" (h1_h2_h15_h16[4])
-		   :
-		   : "memory" );
+      asm volatile ("vzeroupper\n\t" ::: "memory" );
     }
 #endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */
 
 #ifdef __x86_64__
   if (nblocks >= 8)
     {
+      asm volatile ("movdqa %%xmm1, %%xmm8\n\t"
+		    ::: "memory" );
+
       if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED))
-	{
-	  asm volatile ("movdqa %%xmm1, %%xmm8\n\t"
-			::: "memory" );
-	  ghash_setup_aggr8 (c);
-	  asm volatile ("movdqa %%xmm8, %%xmm1\n\t"
-			::: "memory" );
-	}
+	ghash_setup_aggr8 (c); /* Clobbers registers XMM0-XMM7. */
 
       /* Preload H1. */
-      asm volatile ("pxor %%xmm15, %%xmm15\n\t"
-                    "movdqa %[h_1], %%xmm0\n\t"
-                    :
-                    : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key)
-                    : "memory" );
+      asm volatile ("movdqa %%xmm8, %%xmm1\n\t"
+		    "pxor %%xmm15, %%xmm15\n\t"
+		    "movdqa %[h_1], %%xmm0\n\t"
+		    :
+		    : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key)
+		    : "memory" );
 
       while (nblocks >= 8)
         {
@@ -1427,18 +1929,6 @@ _gcry_polyval_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf,
           buf += 8 * blocksize;
           nblocks -= 8;
         }
-#ifndef __WIN64__
-      /* Clear used x86-64/XMM registers. */
-      asm volatile( "pxor %%xmm8, %%xmm8\n\t"
-                    "pxor %%xmm9, %%xmm9\n\t"
-                    "pxor %%xmm10, %%xmm10\n\t"
-                    "pxor %%xmm11, %%xmm11\n\t"
-                    "pxor %%xmm12, %%xmm12\n\t"
-                    "pxor %%xmm13, %%xmm13\n\t"
-                    "pxor %%xmm14, %%xmm14\n\t"
-                    "pxor %%xmm15, %%xmm15\n\t"
-                    ::: "memory" );
-#endif
     }
 #endif
 
@@ -1481,39 +1971,49 @@ _gcry_polyval_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf,
                 : [be_mask] "m" (*be_mask)
                 : "memory" );
 
-#if defined(__x86_64__) && defined(__WIN64__)
   /* Clear/restore used registers. */
-  asm volatile( "pxor %%xmm0, %%xmm0\n\t"
-                "pxor %%xmm1, %%xmm1\n\t"
-                "pxor %%xmm2, %%xmm2\n\t"
-                "pxor %%xmm3, %%xmm3\n\t"
-                "pxor %%xmm4, %%xmm4\n\t"
-                "pxor %%xmm5, %%xmm5\n\t"
-                "movdqu 0*16(%0), %%xmm6\n\t"
-                "movdqu 1*16(%0), %%xmm7\n\t"
-                "movdqu 2*16(%0), %%xmm8\n\t"
-                "movdqu 3*16(%0), %%xmm9\n\t"
-                "movdqu 4*16(%0), %%xmm10\n\t"
-                "movdqu 5*16(%0), %%xmm11\n\t"
-                "movdqu 6*16(%0), %%xmm12\n\t"
-                "movdqu 7*16(%0), %%xmm13\n\t"
-                "movdqu 8*16(%0), %%xmm14\n\t"
-                "movdqu 9*16(%0), %%xmm15\n\t"
-                :
-                : "r" (win64tmp)
-                : "memory" );
+  asm volatile ("pxor %%xmm0, %%xmm0\n\t"
+		"pxor %%xmm1, %%xmm1\n\t"
+		"pxor %%xmm2, %%xmm2\n\t"
+		"pxor %%xmm3, %%xmm3\n\t"
+		"pxor %%xmm4, %%xmm4\n\t"
+		"pxor %%xmm5, %%xmm5\n\t"
+		"pxor %%xmm6, %%xmm6\n\t"
+		"pxor %%xmm7, %%xmm7\n\t"
+		:
+		:
+		: "memory" );
+#ifdef __x86_64__
+#ifdef __WIN64__
+  asm volatile ("movdqu 0*16(%0), %%xmm6\n\t"
+		"movdqu 1*16(%0), %%xmm7\n\t"
+		"movdqu 2*16(%0), %%xmm8\n\t"
+		"movdqu 3*16(%0), %%xmm9\n\t"
+		"movdqu 4*16(%0), %%xmm10\n\t"
+		"movdqu 5*16(%0), %%xmm11\n\t"
+		"movdqu 6*16(%0), %%xmm12\n\t"
+		"movdqu 7*16(%0), %%xmm13\n\t"
+		"movdqu 8*16(%0), %%xmm14\n\t"
+		"movdqu 9*16(%0), %%xmm15\n\t"
+		:
+		: "r" (win64tmp)
+		: "memory" );
 #else
   /* Clear used registers. */
-  asm volatile( "pxor %%xmm0, %%xmm0\n\t"
-                "pxor %%xmm1, %%xmm1\n\t"
-                "pxor %%xmm2, %%xmm2\n\t"
-                "pxor %%xmm3, %%xmm3\n\t"
-                "pxor %%xmm4, %%xmm4\n\t"
-                "pxor %%xmm5, %%xmm5\n\t"
-                "pxor %%xmm6, %%xmm6\n\t"
-                "pxor %%xmm7, %%xmm7\n\t"
-                ::: "memory" );
-#endif
+  asm volatile (
+		"pxor %%xmm8, %%xmm8\n\t"
+		"pxor %%xmm9, %%xmm9\n\t"
+		"pxor %%xmm10, %%xmm10\n\t"
+		"pxor %%xmm11, %%xmm11\n\t"
+		"pxor %%xmm12, %%xmm12\n\t"
+		"pxor %%xmm13, %%xmm13\n\t"
+		"pxor %%xmm14, %%xmm14\n\t"
+		"pxor %%xmm15, %%xmm15\n\t"
+		:
+		:
+		: "memory" );
+#endif /* __WIN64__ */
+#endif /* __x86_64__ */
 
   return 0;
 }
diff --git a/cipher/cipher-internal.h b/cipher/cipher-internal.h
index e31ac860..e1ff0437 100644
--- a/cipher/cipher-internal.h
+++ b/cipher/cipher-internal.h
@@ -80,6 +80,14 @@
 # define GCM_USE_INTEL_VPCLMUL_AVX2 1
 #endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */
 
+/* GCM_USE_INTEL_VPCLMUL_AVX512 indicates whether to compile GCM with Intel
+   VPCLMUL/AVX512 code.  */
+#undef GCM_USE_INTEL_VPCLMUL_AVX512
+#if defined(__x86_64__) && defined(GCM_USE_INTEL_VPCLMUL_AVX2) && \
+    defined(ENABLE_AVX512_SUPPORT) && defined(HAVE_GCC_INLINE_ASM_AVX512)
+# define GCM_USE_INTEL_VPCLMUL_AVX512 1
+#endif /* GCM_USE_INTEL_VPCLMUL_AVX512 */
+
 /* GCM_USE_ARM_PMULL indicates whether to compile GCM with ARMv8 PMULL code. */
 #undef GCM_USE_ARM_PMULL
 #if defined(ENABLE_ARM_CRYPTO_SUPPORT) && defined(GCM_USE_TABLES)
-- 
2.32.0


From jjelen at redhat.com  Mon Mar  7 11:38:27 2022
From: jjelen at redhat.com (Jakub Jelen)
Date: Mon, 7 Mar 2022 11:38:27 +0100
Subject: [PATCH 2/3] Add detection for HW feature "intel-avx512"
In-Reply-To: <62982660-0afb-b9ba-e14b-4536352694a3@redhat.com>
References: <20220306171910.1011180-1-jussi.kivilinna@iki.fi>
 <20220306171910.1011180-2-jussi.kivilinna@iki.fi>
 <62982660-0afb-b9ba-e14b-4536352694a3@redhat.com>
Message-ID: <35dcc3b3-9dd6-2d55-866d-322688c9545f@redhat.com>

On 3/7/22 11:04, Jakub Jelen wrote:
> On 3/6/22 18:19, Jussi Kivilinna wrote:
>> [...]
>> diff --git a/src/hwfeatures.c b/src/hwfeatures.c
>> index 7060d995..8e92cbdd 100644
>> --- a/src/hwfeatures.c
>> +++ b/src/hwfeatures.c
>> @@ -62,6 +62,7 @@ static struct
>> ????? { HWF_INTEL_RDTSC,???????? "intel-rdtsc" },
>> ????? { HWF_INTEL_SHAEXT,??????? "intel-shaext" },
>> ????? { HWF_INTEL_VAES_VPCLMUL,? "intel-vaes-vpclmul" },
>> +??? { HWF_INTEL_AVX512,??????? "intel-avx512" },
>> ? #elif defined(HAVE_CPU_ARCH_ARM)
>> ????? { HWF_ARM_NEON,??????????? "arm-neon" },
>> ????? { HWF_ARM_AES,???????????? "arm-aes" },
> 
> Hi,
> can you make sure to update also the doc/gcrypt.texi with the new 
> hwfeatures value so we do not have to add it retrospectively?
> 
> https://gitlab.com/redhat-crypto/libgcrypt/libgcrypt-mirror/-/blob/master/doc/gcrypt.texi#L574 
> 
> 
> Thanks,

Trying again with the gcrypt-devel(at)gnupg.org (without the lists. 
which still shows as a reply address when I use Thunderbird (probably 
because of the List-Post header still pointing to this email).

Regards,
-- 
Jakub Jelen
Crypto Team, Security Engineering
Red Hat, Inc.


From jussi.kivilinna at iki.fi  Mon Mar  7 17:59:39 2022
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Mon, 7 Mar 2022 18:59:39 +0200
Subject: [PATCH 2/3] Add detection for HW feature "intel-avx512"
In-Reply-To: <35dcc3b3-9dd6-2d55-866d-322688c9545f@redhat.com>
References: <20220306171910.1011180-1-jussi.kivilinna@iki.fi>
 <20220306171910.1011180-2-jussi.kivilinna@iki.fi>
 <62982660-0afb-b9ba-e14b-4536352694a3@redhat.com>
 <35dcc3b3-9dd6-2d55-866d-322688c9545f@redhat.com>
Message-ID: <a67805c3-ec43-1ee4-46b6-e0b2f70568e1@iki.fi>

Hello,

On 7.3.2022 12.38, Jakub Jelen via Gcrypt-devel wrote:
> On 3/7/22 11:04, Jakub Jelen wrote:
>> On 3/6/22 18:19, Jussi Kivilinna wrote:
>>> [...]
>>> diff --git a/src/hwfeatures.c b/src/hwfeatures.c
>>> index 7060d995..8e92cbdd 100644
>>> --- a/src/hwfeatures.c
>>> +++ b/src/hwfeatures.c
>>> @@ -62,6 +62,7 @@ static struct
>>> ????? { HWF_INTEL_RDTSC,???????? "intel-rdtsc" },
>>> ????? { HWF_INTEL_SHAEXT,??????? "intel-shaext" },
>>> ????? { HWF_INTEL_VAES_VPCLMUL,? "intel-vaes-vpclmul" },
>>> +??? { HWF_INTEL_AVX512,??????? "intel-avx512" },
>>> ? #elif defined(HAVE_CPU_ARCH_ARM)
>>> ????? { HWF_ARM_NEON,??????????? "arm-neon" },
>>> ????? { HWF_ARM_AES,???????????? "arm-aes" },
>>
>> Hi,
>> can you make sure to update also the doc/gcrypt.texi with the new hwfeatures value so we do not have to add it retrospectively?
>>
>> https://gitlab.com/redhat-crypto/libgcrypt/libgcrypt-mirror/-/blob/master/doc/gcrypt.texi#L574
>>

Thanks for reminder. I'll add intel-avx512 to the HW features list.

-Jussi


From jussi.kivilinna at iki.fi  Thu Mar 10 21:26:49 2022
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Thu, 10 Mar 2022 22:26:49 +0200
Subject: [PATCH] SHA512: Add AVX512 implementation
Message-ID: <20220310202649.1673477-1-jussi.kivilinna@iki.fi>

* LICENSES: Add 'cipher/sha512-avx512-amd64.S'.
* cipher/Makefile.am: Add 'sha512-avx512-amd64.S'.
* cipher/sha512-avx512-amd64.S: New.
* cipher/sha512.c (USE_AVX512): New.
(do_sha512_transform_amd64_ssse3, do_sha512_transform_amd64_avx)
(do_sha512_transform_amd64_avx2): Add ASM_EXTRA_STACK to return value
only if assembly routine returned non-zero value.
[USE_AVX512] (_gcry_sha512_transform_amd64_avx512)
(do_sha512_transform_amd64_avx512): New.
(sha512_init_common) [USE_AVX512]: Use AVX512 implementation if HW
feature supported.
---

Benchmark on Intel Core i3-1115G4 (tigerlake):

 Before:
                |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
 SHA512         |      1.51 ns/B     631.6 MiB/s      6.17 c/B      4089

 After (~29% faster):
                |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
 SHA512         |      1.16 ns/B     819.0 MiB/s      4.76 c/B      4090

GnuPG-bug-id: T4460
Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 LICENSES                     |   1 +
 cipher/Makefile.am           |   2 +-
 cipher/sha512-avx512-amd64.S | 461 +++++++++++++++++++++++++++++++++++
 cipher/sha512.c              |  52 +++-
 configure.ac                 |   1 +
 5 files changed, 509 insertions(+), 8 deletions(-)
 create mode 100644 cipher/sha512-avx512-amd64.S

diff --git a/LICENSES b/LICENSES
index 8be7fb24..94499501 100644
--- a/LICENSES
+++ b/LICENSES
@@ -19,6 +19,7 @@ with any binary distributions derived from the GNU C Library.
   - cipher/sha512-avx2-bmi2-amd64.S
   - cipher/sha512-ssse3-amd64.S
   - cipher/sha512-ssse3-i386.c
+  - cipher/sha512-avx512-amd64.S
 
 #+begin_quote
   Copyright (c) 2012, Intel Corporation
diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index 3339c463..6eec9bd8 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -127,7 +127,7 @@ EXTRA_libcipher_la_SOURCES = \
 	sha256-armv8-aarch32-ce.S sha256-armv8-aarch64-ce.S \
 	sha256-intel-shaext.c sha256-ppc.c \
 	sha512.c sha512-ssse3-amd64.S sha512-avx-amd64.S \
-	sha512-avx2-bmi2-amd64.S \
+	sha512-avx2-bmi2-amd64.S sha512-avx512-amd64.S \
 	sha512-armv7-neon.S sha512-arm.S \
 	sha512-ppc.c sha512-ssse3-i386.c \
 	sm3.c sm3-avx-bmi2-amd64.S sm3-aarch64.S \
diff --git a/cipher/sha512-avx512-amd64.S b/cipher/sha512-avx512-amd64.S
new file mode 100644
index 00000000..317f3e5c
--- /dev/null
+++ b/cipher/sha512-avx512-amd64.S
@@ -0,0 +1,461 @@
+/* sha512-avx512-amd64.c - amd64/AVX512 implementation of SHA-512 transform
+ * Copyright (C) 2022 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+/*
+ * Based on implementation from file "sha512-avx2-bmi2-amd64.S":
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+; Copyright (c) 2012, Intel Corporation
+;
+; All rights reserved.
+;
+; Redistribution and use in source and binary forms, with or without
+; modification, are permitted provided that the following conditions are
+; met:
+;
+; * Redistributions of source code must retain the above copyright
+;   notice, this list of conditions and the following disclaimer.
+;
+; * Redistributions in binary form must reproduce the above copyright
+;   notice, this list of conditions and the following disclaimer in the
+;   documentation and/or other materials provided with the
+;   distribution.
+;
+; * Neither the name of the Intel Corporation nor the names of its
+;   contributors may be used to endorse or promote products derived from
+;   this software without specific prior written permission.
+;
+;
+; THIS SOFTWARE IS PROVIDED BY INTEL CORPORATION "AS IS" AND ANY
+; EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+; IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+; PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL CORPORATION OR
+; CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+; EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+; PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+; PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+; LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+; NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+; SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+; This code schedules 1 blocks at a time, with 4 lanes per block
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+*/
+
+#ifdef __x86_64
+#include <config.h>
+#if (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
+     defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) && \
+    defined(HAVE_INTEL_SYNTAX_PLATFORM_AS) && \
+    defined(HAVE_GCC_INLINE_ASM_AVX512) && \
+    defined(USE_SHA512)
+
+#include "asm-common-amd64.h"
+
+.intel_syntax noprefix
+
+.text
+
+/* Virtual Registers */
+#define Y_0 ymm0
+#define Y_1 ymm1
+#define Y_2 ymm2
+#define Y_3 ymm3
+
+#define YTMP0 ymm4
+#define YTMP1 ymm5
+#define YTMP2 ymm6
+#define YTMP3 ymm7
+#define YTMP4 ymm8
+#define XFER YTMP0
+
+#define BYTE_FLIP_MASK ymm9
+#define PERM_VPALIGNR_8 ymm10
+
+#define MASK_DC_00 k1
+
+#define INP rdi /* 1st arg */
+#define CTX rsi /* 2nd arg */
+#define NUM_BLKS rdx /* 3rd arg */
+#define SRND r8d
+#define RSP_SAVE r9
+
+#define TBL rcx
+
+#define a xmm11
+#define b xmm12
+#define c xmm13
+#define d xmm14
+#define e xmm15
+#define f xmm16
+#define g xmm17
+#define h xmm18
+
+#define y0 xmm19
+#define y1 xmm20
+#define y2 xmm21
+#define y3 xmm22
+
+/* Local variables (stack frame) */
+#define frame_XFER         0
+#define frame_XFER_size    (4*4*8)
+#define frame_size         (frame_XFER + frame_XFER_size)
+
+#define clear_reg(x) vpxorq x,x,x
+
+/* addm [mem], reg */
+/* Add reg to mem using reg-mem add and store */
+#define addm(p1, p2) \
+	vmovq	y0, p1; \
+	vpaddq	p2, p2, y0; \
+	vmovq	p1, p2;
+
+/* COPY_YMM_AND_BSWAP ymm, [mem], byte_flip_mask */
+/* Load ymm with mem and byte swap each dword */
+#define COPY_YMM_AND_BSWAP(p1, p2, p3) \
+	vmovdqu p1, p2; \
+	vpshufb p1, p1, p3
+
+/* %macro MY_VPALIGNR	YDST, YSRC1, YSRC2, RVAL */
+/* YDST = {YSRC1, YSRC2} >> RVAL*8 */
+#define MY_VPALIGNR(YDST_SRC1, YSRC2, RVAL) \
+	vpermt2q YDST_SRC1, PERM_VPALIGNR_##RVAL, YSRC2;
+
+#define ONE_ROUND_PART1(XFERIN, a, b, c, d, e, f, g, h) \
+	/* h += Sum1 (e) + Ch (e, f, g) + (k[t] + w[0]); \
+	 * d += h; \
+	 * h += Sum0 (a) + Maj (a, b, c); \
+	 * \
+	 * Ch(x, y, z) => ((x & y) + (~x & z)) \
+	 * Maj(x, y, z) => ((x & y) + (z & (x ^ y))) \
+	 */ \
+	\
+	vmovq y3, [XFERIN]; \
+	vmovdqa64 y2, e; \
+	vpaddq h, h, y3; \
+	vprorq y0, e, 41; \
+	vpternlogq y2, f, g, 0xca; /* Ch (e, f, g) */ \
+	vprorq y1, e, 18; \
+	vprorq y3, e, 14; \
+	vpaddq h, h, y2; \
+	vpternlogq y0, y1, y3, 0x96; /* Sum1 (e) */ \
+	vpaddq h, h, y0; /* h += Sum1 (e) + Ch (e, f, g) + (k[t] + w[0]) */ \
+	vpaddq d, d, h; /* d += h */
+
+#define ONE_ROUND_PART2(a, b, c, d, e, f, g, h) \
+	vmovdqa64 y1, a; \
+	vprorq y0, a, 39; \
+	vpternlogq y1, b, c, 0xe8; /* Maj (a, b, c) */ \
+	vprorq y2, a, 34; \
+	vprorq y3, a, 28; \
+	vpternlogq y0, y2, y3, 0x96; /* Sum0 (a) */ \
+	vpaddq h, h, y1; \
+	vpaddq h, h, y0; /* h += Sum0 (a) + Maj (a, b, c) */
+
+#define FOUR_ROUNDS_AND_SCHED(X, Y_0, Y_1, Y_2, Y_3, a, b, c, d, e, f, g, h) \
+	/*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; RND N + 0 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; */; \
+		vmovdqa		YTMP0, Y_3; \
+		vmovdqa		YTMP1, Y_1; \
+		/* Extract w[t-7] */; \
+		vpermt2q	YTMP0, PERM_VPALIGNR_8, Y_2	/* YTMP0 = W[-7] */; \
+		/* Calculate w[t-16] + w[t-7] */; \
+		vpaddq		YTMP0, YTMP0, Y_0		/* YTMP0 = W[-7] + W[-16] */; \
+		/* Extract w[t-15] */; \
+		vpermt2q	YTMP1, PERM_VPALIGNR_8, Y_0	/* YTMP1 = W[-15] */; \
+	ONE_ROUND_PART1(rsp+frame_XFER+0*8+X*32, a, b, c, d, e, f, g, h); \
+		\
+		/* Calculate sigma0 */; \
+		\
+		/* Calculate w[t-15] ror 1 */; \
+		vprorq		YTMP3, YTMP1, 1;		/* YTMP3 = W[-15] ror 1 */; \
+		/* Calculate w[t-15] shr 7 */; \
+		vpsrlq		YTMP4, YTMP1, 7			/* YTMP4 = W[-15] >> 7 */; \
+	\
+	ONE_ROUND_PART2(a, b, c, d, e, f, g, h); \
+	\
+	/*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; RND N + 1 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; */; \
+		/* Calculate w[t-15] ror 8 */; \
+		vprorq		YTMP1, YTMP1, 8			/* YTMP1 = W[-15] ror 8 */; \
+		/* XOR the three components */; \
+		vpternlogq	YTMP1, YTMP3, YTMP4, 0x96	/* YTMP1 = s0 = W[-15] ror 1 ^ W[-15] >> 7 ^ W[-15] ror 8 */; \
+		\
+		/* Add three components, w[t-16], w[t-7] and sigma0 */; \
+		vpaddq		YTMP0, YTMP0, YTMP1		/* YTMP0 = W[-16] + W[-7] + s0 */; \
+	ONE_ROUND_PART1(rsp+frame_XFER+1*8+X*32, h, a, b, c, d, e, f, g); \
+		/* Move to appropriate lanes for calculating w[16] and w[17] */; \
+		vshufi64x2	Y_0, YTMP0, YTMP0, 0x0		/* Y_0 = W[-16] + W[-7] + s0 {BABA} */; \
+		\
+		/* Calculate w[16] and w[17] in both 128 bit lanes */; \
+		\
+		/* Calculate sigma1 for w[16] and w[17] on both 128 bit lanes */; \
+		vshufi64x2	YTMP2, Y_3, Y_3, 0b11		/* YTMP2 = W[-2] {BABA} */; \
+		vpsrlq		YTMP4, YTMP2, 6			/* YTMP4 = W[-2] >> 6 {BABA} */; \
+	\
+	ONE_ROUND_PART2(h, a, b, c, d, e, f, g); \
+	\
+	/*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; RND N + 2 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; */; \
+		vprorq		YTMP3, YTMP2, 19		/* YTMP3 = W[-2] ror 19 {BABA} */; \
+		vprorq		YTMP1, YTMP2, 61		/* YTMP3 = W[-2] ror 61 {BABA} */; \
+		vpternlogq	YTMP4, YTMP3, YTMP1, 0x96	/* YTMP4 = s1 = (W[-2] ror 19) ^ (W[-2] ror 61) ^ (W[-2] >> 6) {BABA} */; \
+		\
+	ONE_ROUND_PART1(rsp+frame_XFER+2*8+X*32, g, h, a, b, c, d, e, f); \
+		/* Add sigma1 to the other compunents to get w[16] and w[17] */; \
+		vpaddq		Y_0, Y_0, YTMP4			/* Y_0 = {W[1], W[0], W[1], W[0]} */; \
+		\
+		/* Calculate sigma1 for w[18] and w[19] for upper 128 bit lane */; \
+		vpsrlq		YTMP4, Y_0, 6			/* YTMP4 = W[-2] >> 6 {DC--} */; \
+	\
+	ONE_ROUND_PART2(g, h, a, b, c, d, e, f); \
+	\
+	/*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; RND N + 3 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; */; \
+		vprorq		YTMP3, Y_0, 19			/* YTMP3 = W[-2] ror 19 {DC--} */; \
+		vprorq		YTMP1, Y_0, 61			/* YTMP1 = W[-2] ror 61 {DC--} */; \
+		vpternlogq	YTMP4, YTMP3, YTMP1, 0x96	/* YTMP4 = s1 = (W[-2] ror 19) ^ (W[-2] ror 61) ^ (W[-2] >> 6) {DC--} */; \
+		\
+	ONE_ROUND_PART1(rsp+frame_XFER+3*8+X*32, f, g, h, a, b, c, d, e); \
+		/* Add the sigma0 + w[t-7] + w[t-16] for w[18] and w[19] to newly calculated sigma1 to get w[18] and w[19] */; \
+		/* Form w[19, w[18], w17], w[16] */; \
+		vpaddq		Y_0{MASK_DC_00}, YTMP0, YTMP4	/* YTMP2 = {W[3], W[2], W[1], W[0]} */; \
+		\
+		vpaddq		XFER, Y_0, [TBL + (4+X)*32]; \
+		vmovdqa		[rsp + frame_XFER + X*32], XFER; \
+	ONE_ROUND_PART2(f, g, h, a, b, c, d, e)
+
+#define ONE_ROUND(XFERIN, a, b, c, d, e, f, g, h) \
+	ONE_ROUND_PART1(XFERIN, a, b, c, d, e, f, g, h); \
+	ONE_ROUND_PART2(a, b, c, d, e, f, g, h)
+
+#define DO_4ROUNDS(X, a, b, c, d, e, f, g, h) \
+	ONE_ROUND(rsp+frame_XFER+0*8+X*32, a, b, c, d, e, f, g, h); \
+	ONE_ROUND(rsp+frame_XFER+1*8+X*32, h, a, b, c, d, e, f, g); \
+	ONE_ROUND(rsp+frame_XFER+2*8+X*32, g, h, a, b, c, d, e, f); \
+	ONE_ROUND(rsp+frame_XFER+3*8+X*32, f, g, h, a, b, c, d, e)
+
+/*
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+; void sha512_avx512(const void* M, void* D, uint64_t L);
+; Purpose: Updates the SHA512 digest stored at D with the message stored in M.
+; The size of the message pointed to by M must be an integer multiple of SHA512
+;   message blocks.
+; L is the message length in SHA512 blocks
+*/
+.globl _gcry_sha512_transform_amd64_avx512
+ELF(.type _gcry_sha512_transform_amd64_avx512, at function;)
+.align 16
+_gcry_sha512_transform_amd64_avx512:
+	CFI_STARTPROC()
+	xor	eax, eax
+
+	cmp	rdx, 0
+	je	.Lnowork
+
+	/* Setup mask register for DC:BA merging. */
+	mov	eax, 0b1100
+	kmovd	MASK_DC_00, eax
+
+	/* Allocate Stack Space */
+	mov	RSP_SAVE, rsp
+	CFI_DEF_CFA_REGISTER(RSP_SAVE);
+	sub	rsp, frame_size
+	and	rsp, ~(0x40 - 1)
+
+	/*; load initial digest */
+	vmovq	a,[8*0 + CTX]
+	vmovq	b,[8*1 + CTX]
+	vmovq	c,[8*2 + CTX]
+	vmovq	d,[8*3 + CTX]
+	vmovq	e,[8*4 + CTX]
+	vmovq	f,[8*5 + CTX]
+	vmovq	g,[8*6 + CTX]
+	vmovq	h,[8*7 + CTX]
+
+	vmovdqa	BYTE_FLIP_MASK, [.LPSHUFFLE_BYTE_FLIP_MASK ADD_RIP]
+	vpmovzxbq PERM_VPALIGNR_8, [.LPERM_VPALIGNR_8 ADD_RIP]
+
+	lea	TBL,[.LK512 ADD_RIP]
+
+	/*; byte swap first 16 dwords */
+	COPY_YMM_AND_BSWAP(Y_0, [INP + 0*32], BYTE_FLIP_MASK)
+	COPY_YMM_AND_BSWAP(Y_1, [INP + 1*32], BYTE_FLIP_MASK)
+	COPY_YMM_AND_BSWAP(Y_2, [INP + 2*32], BYTE_FLIP_MASK)
+	COPY_YMM_AND_BSWAP(Y_3, [INP + 3*32], BYTE_FLIP_MASK)
+
+	lea	INP, [INP + 128]
+
+	vpaddq	XFER, Y_0, [TBL + 0*32]
+	vmovdqa	[rsp + frame_XFER + 0*32], XFER
+	vpaddq	XFER, Y_1, [TBL + 1*32]
+	vmovdqa	[rsp + frame_XFER + 1*32], XFER
+	vpaddq	XFER, Y_2, [TBL + 2*32]
+	vmovdqa	[rsp + frame_XFER + 2*32], XFER
+	vpaddq	XFER, Y_3, [TBL + 3*32]
+	vmovdqa	[rsp + frame_XFER + 3*32], XFER
+
+	/*; schedule 64 input dwords, by doing 12 rounds of 4 each */
+	mov	SRND, 4
+
+.align 16
+.Loop0:
+	FOUR_ROUNDS_AND_SCHED(0, Y_0, Y_1, Y_2, Y_3, a, b, c, d, e, f, g, h)
+	FOUR_ROUNDS_AND_SCHED(1, Y_1, Y_2, Y_3, Y_0, e, f, g, h, a, b, c, d)
+	FOUR_ROUNDS_AND_SCHED(2, Y_2, Y_3, Y_0, Y_1, a, b, c, d, e, f, g, h)
+	FOUR_ROUNDS_AND_SCHED(3, Y_3, Y_0, Y_1, Y_2, e, f, g, h, a, b, c, d)
+	lea	TBL, [TBL + 4*32]
+
+	sub	SRND, 1
+	jne	.Loop0
+
+	sub	NUM_BLKS, 1
+	je	.Ldone_hash
+
+	lea	TBL, [.LK512 ADD_RIP]
+
+	/* load next block and byte swap */
+	COPY_YMM_AND_BSWAP(Y_0, [INP + 0*32], BYTE_FLIP_MASK)
+	COPY_YMM_AND_BSWAP(Y_1, [INP + 1*32], BYTE_FLIP_MASK)
+	COPY_YMM_AND_BSWAP(Y_2, [INP + 2*32], BYTE_FLIP_MASK)
+	COPY_YMM_AND_BSWAP(Y_3, [INP + 3*32], BYTE_FLIP_MASK)
+
+	lea	INP, [INP + 128]
+
+	DO_4ROUNDS(0, a, b, c, d, e, f, g, h)
+	vpaddq	XFER, Y_0, [TBL + 0*32]
+	vmovdqa	[rsp + frame_XFER + 0*32], XFER
+	DO_4ROUNDS(1, e, f, g, h, a, b, c, d)
+	vpaddq	XFER, Y_1, [TBL + 1*32]
+	vmovdqa	[rsp + frame_XFER + 1*32], XFER
+	DO_4ROUNDS(2, a, b, c, d, e, f, g, h)
+	vpaddq	XFER, Y_2, [TBL + 2*32]
+	vmovdqa	[rsp + frame_XFER + 2*32], XFER
+	DO_4ROUNDS(3, e, f, g, h, a, b, c, d)
+	vpaddq	XFER, Y_3, [TBL + 3*32]
+	vmovdqa	[rsp + frame_XFER + 3*32], XFER
+
+	addm([8*0 + CTX],a)
+	addm([8*1 + CTX],b)
+	addm([8*2 + CTX],c)
+	addm([8*3 + CTX],d)
+	addm([8*4 + CTX],e)
+	addm([8*5 + CTX],f)
+	addm([8*6 + CTX],g)
+	addm([8*7 + CTX],h)
+
+	/*; schedule 64 input dwords, by doing 12 rounds of 4 each */
+	mov	SRND, 4
+
+	jmp	.Loop0
+
+.Ldone_hash:
+	DO_4ROUNDS(0, a, b, c, d, e, f, g, h)
+	DO_4ROUNDS(1, e, f, g, h, a, b, c, d)
+	DO_4ROUNDS(2, a, b, c, d, e, f, g, h)
+	DO_4ROUNDS(3, e, f, g, h, a, b, c, d)
+
+	addm([8*0 + CTX],a)
+	xor	eax, eax /* burn stack */
+	addm([8*1 + CTX],b)
+	addm([8*2 + CTX],c)
+	addm([8*3 + CTX],d)
+	addm([8*4 + CTX],e)
+	addm([8*5 + CTX],f)
+	addm([8*6 + CTX],g)
+	addm([8*7 + CTX],h)
+	kmovd	MASK_DC_00, eax
+
+	vzeroall
+	vmovdqa	[rsp + frame_XFER + 0*32], ymm0 /* burn stack */
+	vmovdqa	[rsp + frame_XFER + 1*32], ymm0 /* burn stack */
+	vmovdqa	[rsp + frame_XFER + 2*32], ymm0 /* burn stack */
+	vmovdqa	[rsp + frame_XFER + 3*32], ymm0 /* burn stack */
+	clear_reg(%xmm16);
+	clear_reg(%xmm17);
+	clear_reg(%xmm18);
+	clear_reg(%xmm19);
+	clear_reg(%xmm20);
+	clear_reg(%xmm21);
+	clear_reg(%xmm22);
+
+	/* Restore Stack Pointer */
+	mov	rsp, RSP_SAVE
+	CFI_DEF_CFA_REGISTER(rsp)
+
+.Lnowork:
+	ret_spec_stop
+	CFI_ENDPROC()
+ELF(.size _gcry_sha512_transform_amd64_avx512,.-_gcry_sha512_transform_amd64_avx512)
+
+/*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; */
+/*;; Binary Data */
+
+ELF(.type _gcry_sha512_avx512_consts, at object)
+_gcry_sha512_avx512_consts:
+.align 64
+/* K[t] used in SHA512 hashing */
+.LK512:
+	.quad	0x428a2f98d728ae22,0x7137449123ef65cd
+	.quad	0xb5c0fbcfec4d3b2f,0xe9b5dba58189dbbc
+	.quad	0x3956c25bf348b538,0x59f111f1b605d019
+	.quad	0x923f82a4af194f9b,0xab1c5ed5da6d8118
+	.quad	0xd807aa98a3030242,0x12835b0145706fbe
+	.quad	0x243185be4ee4b28c,0x550c7dc3d5ffb4e2
+	.quad	0x72be5d74f27b896f,0x80deb1fe3b1696b1
+	.quad	0x9bdc06a725c71235,0xc19bf174cf692694
+	.quad	0xe49b69c19ef14ad2,0xefbe4786384f25e3
+	.quad	0x0fc19dc68b8cd5b5,0x240ca1cc77ac9c65
+	.quad	0x2de92c6f592b0275,0x4a7484aa6ea6e483
+	.quad	0x5cb0a9dcbd41fbd4,0x76f988da831153b5
+	.quad	0x983e5152ee66dfab,0xa831c66d2db43210
+	.quad	0xb00327c898fb213f,0xbf597fc7beef0ee4
+	.quad	0xc6e00bf33da88fc2,0xd5a79147930aa725
+	.quad	0x06ca6351e003826f,0x142929670a0e6e70
+	.quad	0x27b70a8546d22ffc,0x2e1b21385c26c926
+	.quad	0x4d2c6dfc5ac42aed,0x53380d139d95b3df
+	.quad	0x650a73548baf63de,0x766a0abb3c77b2a8
+	.quad	0x81c2c92e47edaee6,0x92722c851482353b
+	.quad	0xa2bfe8a14cf10364,0xa81a664bbc423001
+	.quad	0xc24b8b70d0f89791,0xc76c51a30654be30
+	.quad	0xd192e819d6ef5218,0xd69906245565a910
+	.quad	0xf40e35855771202a,0x106aa07032bbd1b8
+	.quad	0x19a4c116b8d2d0c8,0x1e376c085141ab53
+	.quad	0x2748774cdf8eeb99,0x34b0bcb5e19b48a8
+	.quad	0x391c0cb3c5c95a63,0x4ed8aa4ae3418acb
+	.quad	0x5b9cca4f7763e373,0x682e6ff3d6b2b8a3
+	.quad	0x748f82ee5defb2fc,0x78a5636f43172f60
+	.quad	0x84c87814a1f0ab72,0x8cc702081a6439ec
+	.quad	0x90befffa23631e28,0xa4506cebde82bde9
+	.quad	0xbef9a3f7b2c67915,0xc67178f2e372532b
+	.quad	0xca273eceea26619c,0xd186b8c721c0c207
+	.quad	0xeada7dd6cde0eb1e,0xf57d4f7fee6ed178
+	.quad	0x06f067aa72176fba,0x0a637dc5a2c898a6
+	.quad	0x113f9804bef90dae,0x1b710b35131c471b
+	.quad	0x28db77f523047d84,0x32caab7b40c72493
+	.quad	0x3c9ebe0a15c9bebc,0x431d67c49c100d4c
+	.quad	0x4cc5d4becb3e42b6,0x597f299cfc657e2a
+	.quad	0x5fcb6fab3ad6faec,0x6c44198c4a475817
+
+/* Mask for byte-swapping a couple of qwords in an XMM register using (v)pshufb. */
+.align 32
+.LPSHUFFLE_BYTE_FLIP_MASK:	.octa 0x08090a0b0c0d0e0f0001020304050607
+				.octa 0x18191a1b1c1d1e1f1011121314151617
+
+.align 4
+.LPERM_VPALIGNR_8:		.byte 5, 6, 7, 0
+ELF(.size _gcry_sha512_avx512_consts,.-_gcry_sha512_avx512_consts)
+
+#endif
+#endif
diff --git a/cipher/sha512.c b/cipher/sha512.c
index 9cab33d6..05c8943e 100644
--- a/cipher/sha512.c
+++ b/cipher/sha512.c
@@ -104,6 +104,16 @@
 #endif
 
 
+/* USE_AVX512 indicates whether to compile with Intel AVX512 code. */
+#undef USE_AVX512
+#if defined(__x86_64__) && defined(HAVE_GCC_INLINE_ASM_AVX512) && \
+    defined(HAVE_INTEL_SYNTAX_PLATFORM_AS) && \
+    (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
+     defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS))
+# define USE_AVX512 1
+#endif
+
+
 /* USE_SSSE3_I386 indicates whether to compile with Intel SSSE3/i386 code. */
 #undef USE_SSSE3_I386
 #if defined(__i386__) && SIZEOF_UNSIGNED_LONG == 4 && __GNUC__ >= 4 && \
@@ -197,7 +207,8 @@ static const u64 k[] =
  * stack to store XMM6-XMM15 needed on Win64. */
 #undef ASM_FUNC_ABI
 #undef ASM_EXTRA_STACK
-#if defined(USE_SSSE3) || defined(USE_AVX) || defined(USE_AVX2)
+#if defined(USE_SSSE3) || defined(USE_AVX) || defined(USE_AVX2) \
+    || defined(USE_AVX512)
 # ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS
 #  define ASM_FUNC_ABI __attribute__((sysv_abi))
 #  define ASM_EXTRA_STACK (10 * 16 + 4 * sizeof(void *))
@@ -232,8 +243,10 @@ do_sha512_transform_amd64_ssse3(void *ctx, const unsigned char *data,
                                 size_t nblks)
 {
   SHA512_CONTEXT *hd = ctx;
-  return _gcry_sha512_transform_amd64_ssse3 (data, &hd->state, nblks)
-         + ASM_EXTRA_STACK;
+  unsigned int burn;
+  burn = _gcry_sha512_transform_amd64_ssse3 (data, &hd->state, nblks);
+  burn += burn > 0 ? ASM_EXTRA_STACK : 0;
+  return burn;
 }
 #endif
 
@@ -247,8 +260,10 @@ do_sha512_transform_amd64_avx(void *ctx, const unsigned char *data,
                               size_t nblks)
 {
   SHA512_CONTEXT *hd = ctx;
-  return _gcry_sha512_transform_amd64_avx (data, &hd->state, nblks)
-         + ASM_EXTRA_STACK;
+  unsigned int burn;
+  burn = _gcry_sha512_transform_amd64_avx (data, &hd->state, nblks);
+  burn += burn > 0 ? ASM_EXTRA_STACK : 0;
+  return burn;
 }
 #endif
 
@@ -262,8 +277,27 @@ do_sha512_transform_amd64_avx2(void *ctx, const unsigned char *data,
                                size_t nblks)
 {
   SHA512_CONTEXT *hd = ctx;
-  return _gcry_sha512_transform_amd64_avx2 (data, &hd->state, nblks)
-         + ASM_EXTRA_STACK;
+  unsigned int burn;
+  burn = _gcry_sha512_transform_amd64_avx2 (data, &hd->state, nblks);
+  burn += burn > 0 ? ASM_EXTRA_STACK : 0;
+  return burn;
+}
+#endif
+
+#ifdef USE_AVX512
+unsigned int _gcry_sha512_transform_amd64_avx512(const void *input_data,
+						 void *state,
+						 size_t num_blks) ASM_FUNC_ABI;
+
+static unsigned int
+do_sha512_transform_amd64_avx512(void *ctx, const unsigned char *data,
+                                 size_t nblks)
+{
+  SHA512_CONTEXT *hd = ctx;
+  unsigned int burn;
+  burn = _gcry_sha512_transform_amd64_avx512 (data, &hd->state, nblks);
+  burn += burn > 0 ? ASM_EXTRA_STACK : 0;
+  return burn;
 }
 #endif
 
@@ -393,6 +427,10 @@ sha512_init_common (SHA512_CONTEXT *ctx, unsigned int flags)
   if ((features & HWF_INTEL_AVX2) && (features & HWF_INTEL_BMI2))
     ctx->bctx.bwrite = do_sha512_transform_amd64_avx2;
 #endif
+#ifdef USE_AVX512
+  if ((features & HWF_INTEL_AVX512) != 0)
+    ctx->bctx.bwrite = do_sha512_transform_amd64_avx512;
+#endif
 #ifdef USE_PPC_CRYPTO
   if ((features & HWF_PPC_VCRYPTO) != 0)
     ctx->bctx.bwrite = do_sha512_transform_ppc8;
diff --git a/configure.ac b/configure.ac
index 27d72141..cf255bf3 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2942,6 +2942,7 @@ if test "$found" = "1" ; then
          GCRYPT_ASM_DIGESTS="$GCRYPT_ASM_DIGESTS sha512-ssse3-amd64.lo"
          GCRYPT_ASM_DIGESTS="$GCRYPT_ASM_DIGESTS sha512-avx-amd64.lo"
          GCRYPT_ASM_DIGESTS="$GCRYPT_ASM_DIGESTS sha512-avx2-bmi2-amd64.lo"
+         GCRYPT_ASM_DIGESTS="$GCRYPT_ASM_DIGESTS sha512-avx512-amd64.lo"
       ;;
       i?86-*-*)
          # Build with the assembly implementation
-- 
2.32.0


From ametzler at bebt.de  Wed Mar 16 19:04:00 2022
From: ametzler at bebt.de (Andreas Metzler)
Date: Wed, 16 Mar 2022 19:04:00 +0100
Subject: libgcrypt 1.10 status
Message-ID: <YjImkIcY4Uzv59NS@argenau.bebt.de>

Hello,

libgcrypt 1.10 was not announced on this list and is not listed on
https://gnupg.org/download/index.html#libgcrypt (either as available
version nor its end of life). Is this a stable release, is it a LTS
version?

TIA, cu Andreas
-- 
`What a good friend you are to him, Dr. Maturin. His other friends are
so grateful to you.'
`I sew his ears on from time to time, sure'


From wk at gnupg.org  Fri Mar 18 16:08:26 2022
From: wk at gnupg.org (Werner Koch)
Date: Fri, 18 Mar 2022 16:08:26 +0100
Subject: libgcrypt 1.10 status
In-Reply-To: <YjImkIcY4Uzv59NS@argenau.bebt.de> (Andreas Metzler's message of
 "Wed, 16 Mar 2022 19:04:00 +0100")
References: <YjImkIcY4Uzv59NS@argenau.bebt.de>
Message-ID: <87o82347th.fsf@wheatstone.g10code.de>

On Wed, 16 Mar 2022 19:04, Andreas Metzler said:

> libgcrypt 1.10 was not announced on this list and is not listed on
> https://gnupg.org/download/index.html#libgcrypt (either as available
> version nor its end of life). Is this a stable release, is it a LTS

It was not announced to see whether something broke.  It will be the
stable versions and there will soon be announcement for 1.10.1.

From our versions (source) file:

# We will probably wait for the 1.10.1 before we take 1.10 in public use.
# xxx +macro: libgcrypt_branch LIBGCRYPT-1.10-BRANCH
# xxx +macro: libgcrypt_ver  1.10.0
# xxx +macro: libgcrypt_date 2022-02-01


Shalom-Salam,

   Werner


-- 
Die Gedanken sind frei.  Ausnahmen regelt ein Bundesgesetz.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20220318/25df2bc8/attachment.sig>

From geoff at knauth.org  Mon Mar 28 17:33:01 2022
From: geoff at knauth.org (Geoffrey S. Knauth)
Date: Mon, 28 Mar 2022 11:33:01 -0400
Subject: Libgcrypt 1.10.1 released
In-Reply-To: <87czi6w376.fsf@wheatstone.g10code.de>
References: <87czi6w376.fsf@wheatstone.g10code.de>
Message-ID: <7dd0b019-da32-4bb7-839c-e7d78a8c7425@www.fastmail.com>

Thank you!

On Mon, Mar 28, 2022, at 10:40, Werner Koch wrote:
> Hello!
>
> We are pleased to announce the availability of Libgcrypt version 1.10.1.
> This release starts a new stable branch of Libgcrypt with full API and
> ABI compatibility to the 1.9 series.  Over the last year Jussi Kivilinna
> put again a lot of work into speeding up the algorithms for the most
> commonly used CPUs.  See below for a list of improvements and new
> features in 1.10.
>
> Libgcrypt is a general purpose library of cryptographic building blocks.
> It is originally based on code used by GnuPG.  It does not provide any
> implementation of OpenPGP or other protocols.  Thorough understanding of
> applied cryptography is required to use Libgcrypt.
>
>
> Noteworthy changes in Libgcrypt 1.10.0 and 1.10.1
> =================================================
>
>  * New and extended interfaces:
>
>    - New control codes to check for FIPS 140-3 approved algorithms.
>
>    - New control code to switch into non-FIPS mode.
>
>    - New cipher modes SIV and GCM-SIV as specified by RFC-5297.
>
>    - Extended cipher mode AESWRAP with padding as specified by
>      RFC-5649.  [T5752]
>
>    - New set of KDF functions.
>
>    - New KDF modes Argon2 and Balloon.
>
>    - New functions for combining hashing and signing/verification.  [T4894]
>
>  * Performance:
>
>    - Improved support for PowerPC architectures.
>
>    - Improved ECC performance on zSeries/s390x by using accelerated
>      scalar multiplication.
>
>    - Many more assembler performance improvements for several
>      architectures.
>
>  * Bug fixes:
>
>    - Fix Elgamal encryption for other implementations.
>      [R5328,CVE-2021-40528]
>
>    - Fix alignment problem on macOS.  [T5440]
>
>    - Check the input length of the point in ECDH.  [T5423]
>
>    - Fix an abort in gcry_pk_get_param for "Curve25519".  [T5490]
>
>    - Fix minor memory leaks in FIPS mode.
>
>    - Build fixes for MUSL libc.  [rCffaef0be61]
>
>  * Other features:
>
>    - The control code GCRYCTL_SET_ENFORCED_FIPS_FLAG is ignored
>      because it is useless with the FIPS 140-3 related changes.
>
>    - Update of the jitter entropy RNG code.  [T5523]
>
>    - Simplification of the entropy gatherer when using the getentropy
>      system call.
>
>    - More portable integrity check in FIPS mode.  [rC9fa4c8946a,T5835]
>
>    - Add X9.62 OIDs to sha256 and sha512 modules.  [rC52fd2305ba]
>
>  Note that 1.10.0 was already released on 2022-02-01 without a public
>  announcement to allow for some extra test time.
>
>  For a list of links to commits and bug numbers see the release info at
>  https://dev.gnupg.org/T5691 and https://dev.gnupg.org/T5810
>
>
>
> Download
> ========
>
> Source code is hosted at the GnuPG FTP server and its mirrors as listed
> at https://gnupg.org/download/mirrors.html.  On the primary server
> the source tarball and its digital signature are:
>
>  https://gnupg.org/ftp/gcrypt/libgcrypt/libgcrypt-1.10.1.tar.bz2
>  https://gnupg.org/ftp/gcrypt/libgcrypt/libgcrypt-1.10.1.tar.bz2.sig
>
> or gzip compressed:
>
>  https://gnupg.org/ftp/gcrypt/libgcrypt/libgcrypt-1.10.1.tar.gz
>  https://gnupg.org/ftp/gcrypt/libgcrypt/libgcrypt-1.10.1.tar.gz.sig
>
> In order to check that the version of Libgcrypt you downloaded is an
> original and unmodified file please follow the instructions found at
> https://gnupg.org/download/integrity_check.html.  In short, you may
> use one of the following methods:
>
>  - Check the supplied OpenPGP signature.  For example to check the
>    signature of the file libgcrypt-1.10.1.tar.bz2 you would use this
>    command:
>
>      gpg --verify libgcrypt-1.10.1.tar.bz2.sig libgcrypt-1.10.1.tar.bz2
>
>    This checks whether the signature file matches the source file.
>    You should see a message indicating that the signature is good and
>    made by one or more of the release signing keys.  Make sure that
>    this is a valid key, either by matching the shown fingerprint
>    against a trustworthy list of valid release signing keys or by
>    checking that the key has been signed by trustworthy other keys.
>    See the end of this mail for information on the signing keys.
>
>  - If you are not able to use an existing version of GnuPG, you have
>    to verify the SHA-1 checksum.  On Unix systems the command to do
>    this is either "sha1sum" or "shasum".  Assuming you downloaded the
>    file libgcrypt-1.10.1.tar.bz2, you run the command like this:
>
>      sha1sum libgcrypt-1.10.1.tar.bz2
>
>    and check that the output matches the first line from the
>    this list:
>
> de2cc32e7538efa376de7bf5d3eafa85626fb95f  libgcrypt-1.10.1.tar.bz2
> 9db3ef0ec74bd2915fa7ca6f32ea9ba7e013e1a1  libgcrypt-1.10.1.tar.gz
>
>    You should also verify that the checksums above are authentic by
>    matching them with copies of this announcement.  Those copies can be
>    found at other mailing lists, web sites, and search engines.
>
>
> Copying
> =======
>
> Libgcrypt is distributed under the terms of the GNU Lesser General
> Public License (LGPLv2.1+).  The helper programs as well as the
> documentation are distributed under the terms of the GNU General Public
> License (GPLv2+).  The file LICENSES has notices about contributions
> that require that these additional notices are distributed.
>
>
> Support
> =======
>
> For help on developing with Libgcrypt you should read the included
> manual and if needed ask on the gcrypt-devel mailing list.
>
> In case of problems specific to this release please first check
> https://dev.gnupg.org/T5810 for updated information.
>
> Please also consult the archive of the gcrypt-devel mailing list before
> reporting a bug: https://gnupg.org/documentation/mailing-lists.html .
> We suggest to send bug reports for a new release to this list in favor
> of filing a bug at https://bugs.gnupg.org.  If you need commercial
> support go to https://gnupg.com or https://gnupg.org/service.html .
>
> If you are a developer and you need a certain feature for your project,
> please do not hesitate to bring it to the gcrypt-devel mailing list for
> discussion.
>
>
>
> Thanks
> ======
>
> Since 2001 maintenance and development of GnuPG is done by g10 Code GmbH
> and has mostly been financed by donations.  Three full-time employed
> developers as well as two contractors exclusively work on GnuPG and
> closely related software like Libgcrypt, GPGME and Gpg4win.
>
> Fortunately, and this is still not common with free software, we have
> now established a way of financing the development while keeping all our
> software free and freely available for everyone.  Our model is similar
> to the way RedHat manages RHEL and Fedora: Except for the actual binary
> of the MSI installer for Windows and client specific configuration
> files, all the software is available under the GNU GPL and other Open
> Source licenses.  Thus customers may even build and distribute their own
> version of the software as long as they do not use our trademark
> GnuPG VS-Desktop?.
>
> We like to thank all the nice people who are helping the GnuPG project,
> be it testing, coding, translating, suggesting, auditing, administering
> the servers, spreading the word, answering questions on the mailing
> lists, or helping with donations.
>
> *Thank you all*
>
>    Your Libgcrypt hackers
>
>
>
> p.s.
> This is an announcement only mailing list.  Please send replies only to
> the gnupg-users'at'gnupg.org mailing list.
>
> List of Release Signing Keys:
> To guarantee that a downloaded GnuPG version has not been tampered by
> malicious entities we provide signature files for all tarballs and
> binary versions.  The keys are also signed by the long term keys of
> their respective owners.  Current releases are signed by one or more
> of these keys:
>
>   rsa3072 2017-03-17 [expires: 2027-03-15]
>   5B80 C575 4298 F0CB 55D8  ED6A BCEF 7E29 4B09 2E28
>   Andre Heinecke (Release Signing Key)
>
>   ed25519 2020-08-24 [expires: 2030-06-30]
>   6DAA 6E64 A76D 2840 571B  4902 5288 97B8 2640 3ADA
>   Werner Koch (dist signing 2020)
>
>   ed25519 2021-05-19 [expires: 2027-04-04]
>   AC8E 115B F73E 2D8D 47FA  9908 E98E 9B2D 19C6 C8BD
>   Niibe Yutaka (GnuPG Release Key)
>
>   brainpoolP256r1 2021-10-15 [expires: 2029-12-31]
>   02F3 8DFF 731F F97C B039  A1DA 549E 695E 905B A208
>   GnuPG.com (Release Signing Key 2021)
>
> The keys are available at https://gnupg.org/signature_key.html and
> in any recently released GnuPG tarball in the file g10/distsigkey.gpg .
> Note that this mail has been signed by a different key.
>
>
> --
> The pioneers of a warless world are the youth that
> refuse military service.             - A. Einstein
>
> Attachments:
> * signature.asc

-- 
Geoffrey S. Knauth | https://knauth.org/gsk


From guidovranken at gmail.com  Mon Mar 28 22:47:20 2022
From: guidovranken at gmail.com (Guido Vranken)
Date: Mon, 28 Mar 2022 22:47:20 +0200
Subject: Argon2 incorrect result and division by zero
Message-ID: <CAO5O-E+-BfA=yrX8YQzg2CnJo22XMuY2TE2SMuS5j+zW1OdF4A@mail.gmail.com>

Fuzzer debug output for the reproducer (included at end of this message):

Module libgcrypt result:

{0x0f, 0xb5, 0x95, 0x20, 0xf8, 0x1a, 0x3f, 0xec, 0xac, 0xc0, 0xa4, 0x68,
0x78, 0x33, 0xf7, 0xce,
 0xb0, 0xbd, 0x42, 0x95, 0xc2, 0x63, 0x45, 0x38, 0xc2, 0x06, 0x6e, 0x8c,
0x39, 0x2a, 0xb4, 0xd5,
 0x84, 0x6b, 0x19, 0xf2, 0x5f, 0x00, 0x7b, 0xbf, 0x66, 0xfe, 0xc8, 0xd2,
0xe7, 0x98, 0x2d, 0xa7,
 0x00, 0xdb, 0xf9, 0x43, 0x13, 0xd5, 0x5d, 0x19, 0xec, 0x0e, 0x5c, 0x69,
0x06, 0xd2, 0xb6, 0xd7,
 0xcf, 0x72, 0xbb, 0x3b, 0xa7, 0x29} (70 bytes)

Module Botan result:

{0x0f, 0xb5, 0x95, 0x20, 0xf8, 0x1a, 0x3f, 0xec, 0xac, 0xc0, 0xa4, 0x68,
0x78, 0x33, 0xf7, 0xce,
 0xb0, 0xbd, 0x42, 0x95, 0xc2, 0x63, 0x45, 0x38, 0xc2, 0x06, 0x6e, 0x8c,
0x39, 0x2a, 0xb4, 0xd5,
 0xc1, 0x16, 0x48, 0x32, 0x7c, 0xed, 0xe1, 0x56, 0x90, 0xab, 0x49, 0x32,
0xd0, 0x51, 0x48, 0x55,
 0x6d, 0x96, 0xcc, 0xd1, 0x33, 0xe2, 0xb2, 0x2b, 0x88, 0xf8, 0x35, 0x74,
0xf8, 0x90, 0x78, 0x27,
 0x45, 0xa4, 0x37, 0x99, 0xc6, 0x86} (70 bytes)

It seems that the 64 bytes are always correct but with output sizes larger
than that, discrepancies occur.

Additionally there is a division by zero on this line in argon2_init() if
parallelism is set to 0:

  segment_length = memory_blocks / (parallelism * 4);

it would be better to return an error in this case.

Reproducer:

#include <gcrypt.h>

#define CF_CHECK_EQ(expr, res) if ( (expr) != (res) ) { goto end; }

int main(void)
{
    const unsigned char password[32] = {
        0xa3, 0x18, 0xc3, 0x65, 0x45, 0xbb, 0x67, 0xb8, 0x26, 0xab, 0x1d,
0x8c, 0xa7, 0x0f, 0xc7, 0x8c,
        0x33, 0x0d, 0x4c, 0x57, 0x2b, 0xcf, 0x95, 0x94, 0xfd, 0x75, 0x85,
0xf0, 0x08, 0x2a, 0x04, 0x05};
    const unsigned char salt[32] = {
        0xb4, 0x0f, 0xf9, 0x84, 0x68, 0x4e, 0x44, 0x0c, 0x86, 0x0b, 0xd1,
0x4b, 0x7c, 0x71, 0x85, 0xdb,
        0xa7, 0x9b, 0x47, 0x6e, 0x76, 0xba, 0xf9, 0xa3, 0x47, 0xf0, 0x82,
0x20, 0x84, 0x60, 0xca, 0x8e};

    gcry_kdf_hd_t hd;
    unsigned char out[70];
    const unsigned long params[4] = {sizeof(out), 1, 27701, 1};

    CF_CHECK_EQ(gcry_kdf_open(&hd, GCRY_KDF_ARGON2, GCRY_KDF_ARGON2ID,
params, 4,
                password, sizeof(password),
                salt, sizeof(salt),
                NULL, 0,
                NULL, 0), GPG_ERR_NO_ERROR);
    CF_CHECK_EQ(gcry_kdf_compute(hd, NULL), GPG_ERR_NO_ERROR);
    CF_CHECK_EQ(gcry_kdf_final(hd, sizeof(out), out), GPG_ERR_NO_ERROR);

    for (size_t i = 0; i < sizeof(out); i++) {
        if ( !(i % 16) ) printf("\n");
        printf("0x%02x, ", out[i]);
    }
    printf("\n");

end:
    gcry_kdf_close(hd);
    return 0;
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20220328/b66e6500/attachment.html>

From tianjia.zhang at linux.alibaba.com  Tue Mar 29 10:26:00 2022
From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang)
Date: Tue, 29 Mar 2022 16:26:00 +0800
Subject: [PATCH] Fix configure.ac error of intel-avx512
Message-ID: <20220329082600.42934-1-tianjia.zhang@linux.alibaba.com>

* configure.ac: Correctly set value for avx512support.
--

Signed-off-by: Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
---
 configure.ac | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/configure.ac b/configure.ac
index b41322e357ca..467151391a73 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1303,6 +1303,7 @@ if test "$mpi_cpu_arch" != "x86" ; then
    sse41support="n/a"
    avxsupport="n/a"
    avx2support="n/a"
+   avx512support="n/a"
    padlocksupport="n/a"
    drngsupport="n/a"
 fi
@@ -2404,6 +2405,11 @@ if test x"$avx2support" = xyes ; then
     avx2support="no (unsupported by compiler)"
   fi
 fi
+if test x"$avx512support" = xyes ; then
+  if test "$gcry_cv_gcc_inline_asm_avx512" != "yes" ; then
+    avx512support="no (unsupported by compiler)"
+  fi
+fi
 if test x"$neonsupport" = xyes ; then
   if test "$gcry_cv_gcc_inline_asm_neon" != "yes" ; then
     if test "$gcry_cv_gcc_inline_asm_aarch64_neon" != "yes" ; then
-- 
2.24.3 (Apple Git-128)


From rms at gnu.org  Tue Mar 29 05:30:46 2022
From: rms at gnu.org (Richard Stallman)
Date: Mon, 28 Mar 2022 23:30:46 -0400
Subject: Libgcrypt 1.10.1 released
In-Reply-To: <87czi6w376.fsf@wheatstone.g10code.de> (message from Werner Koch
 on Mon, 28 Mar 2022 16:40:13 +0200)
References: <87czi6w376.fsf@wheatstone.g10code.de>
Message-ID: <E1nZ2ZC-0005wV-4n@fencepost.gnu.org>

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

Congratulations on the new release.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)


From jussi.kivilinna at iki.fi  Tue Mar 29 18:06:04 2022
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Tue, 29 Mar 2022 19:06:04 +0300
Subject: [PATCH] Fix configure.ac error of intel-avx512
In-Reply-To: <20220329082600.42934-1-tianjia.zhang@linux.alibaba.com>
References: <20220329082600.42934-1-tianjia.zhang@linux.alibaba.com>
Message-ID: <b10d0b25-b493-5bbb-3080-aa81be0a01f3@iki.fi>

Hello,

On 29.3.2022 11.26, Tianjia Zhang via Gcrypt-devel wrote:
> * configure.ac: Correctly set value for avx512support.
> --
> 
> Signed-off-by: Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
> ---
>   configure.ac | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/configure.ac b/configure.ac
> index b41322e357ca..467151391a73 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -1303,6 +1303,7 @@ if test "$mpi_cpu_arch" != "x86" ; then
>      sse41support="n/a"
>      avxsupport="n/a"
>      avx2support="n/a"
> +   avx512support="n/a"
>      padlocksupport="n/a"
>      drngsupport="n/a"
>   fi
> @@ -2404,6 +2405,11 @@ if test x"$avx2support" = xyes ; then
>       avx2support="no (unsupported by compiler)"
>     fi
>   fi
> +if test x"$avx512support" = xyes ; then
> +  if test "$gcry_cv_gcc_inline_asm_avx512" != "yes" ; then
> +    avx512support="no (unsupported by compiler)"
> +  fi
> +fi
>   if test x"$neonsupport" = xyes ; then
>     if test "$gcry_cv_gcc_inline_asm_neon" != "yes" ; then
>       if test "$gcry_cv_gcc_inline_asm_aarch64_neon" != "yes" ; then

Applied to master, thanks.

-Jussi