From tianjia.zhang at linux.alibaba.com Tue Mar 1 05:38:35 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Tue, 1 Mar 2022 12:38:35 +0800 Subject: [PATCH v2 1/2] hwf-arm: add ARMv8.2 optional crypto extension HW features Message-ID: <20220301043836.28709-1-tianjia.zhang@linux.alibaba.com> * src/g10lib.h (HWF_ARM_SHA3, HWF_ARM_SM3, HWF_ARM_SM4) (HWF_ARM_SHA512): New. * src/hwf-arm.c (arm_features): Add sha3, sm3, sm4, sha512 HW features. * src/hwfeatures.c (hwflist): Add sha3, sm3, sm4, sha512 HW features. -- Signed-off-by: Tianjia Zhang --- src/g10lib.h | 4 ++++ src/hwf-arm.c | 16 ++++++++++++++++ src/hwfeatures.c | 4 ++++ 3 files changed, 24 insertions(+) diff --git a/src/g10lib.h b/src/g10lib.h index 22c0f0c2..985e75c6 100644 --- a/src/g10lib.h +++ b/src/g10lib.h @@ -245,6 +245,10 @@ char **_gcry_strtokenize (const char *string, const char *delim); #define HWF_ARM_SHA1 (1 << 2) #define HWF_ARM_SHA2 (1 << 3) #define HWF_ARM_PMULL (1 << 4) +#define HWF_ARM_SHA3 (1 << 5) +#define HWF_ARM_SM3 (1 << 6) +#define HWF_ARM_SM4 (1 << 7) +#define HWF_ARM_SHA512 (1 << 8) #elif defined(HAVE_CPU_ARCH_PPC) diff --git a/src/hwf-arm.c b/src/hwf-arm.c index 60107f36..70d375b2 100644 --- a/src/hwf-arm.c +++ b/src/hwf-arm.c @@ -137,6 +137,18 @@ static const struct feature_map_s arm_features[] = #ifndef HWCAP_SHA2 # define HWCAP_SHA2 64 #endif +#ifndef HWCAP_SHA3 +# define HWCAP_SHA3 (1 << 17) +#endif +#ifndef HWCAP_SM3 +# define HWCAP_SM3 (1 << 18) +#endif +#ifndef HWCAP_SM4 +# define HWCAP_SM4 (1 << 19) +#endif +#ifndef HWCAP_SHA512 +# define HWCAP_SHA512 (1 << 21) +#endif static const struct feature_map_s arm_features[] = { @@ -148,6 +160,10 @@ static const struct feature_map_s arm_features[] = { HWCAP_SHA1, 0, " sha1", HWF_ARM_SHA1 }, { HWCAP_SHA2, 0, " sha2", HWF_ARM_SHA2 }, { HWCAP_PMULL, 0, " pmull", HWF_ARM_PMULL }, + { HWCAP_SHA3, 0, " sha3", HWF_ARM_SHA3 }, + { HWCAP_SM3, 0, " sm3", HWF_ARM_SM3 }, + { HWCAP_SM4, 0, " sm4", HWF_ARM_SM4 }, + { HWCAP_SHA512, 0, " sha512", HWF_ARM_SHA512 }, #endif }; diff --git a/src/hwfeatures.c b/src/hwfeatures.c index 97e67b3c..7060d995 100644 --- a/src/hwfeatures.c +++ b/src/hwfeatures.c @@ -68,6 +68,10 @@ static struct { HWF_ARM_SHA1, "arm-sha1" }, { HWF_ARM_SHA2, "arm-sha2" }, { HWF_ARM_PMULL, "arm-pmull" }, + { HWF_ARM_SHA3, "arm-sha3" }, + { HWF_ARM_SM3, "arm-sm3" }, + { HWF_ARM_SM4, "arm-sm4" }, + { HWF_ARM_SHA512, "arm-sha512" }, #elif defined(HAVE_CPU_ARCH_PPC) { HWF_PPC_VCRYPTO, "ppc-vcrypto" }, { HWF_PPC_ARCH_3_00, "ppc-arch_3_00" }, -- 2.34.1 From tianjia.zhang at linux.alibaba.com Tue Mar 1 05:38:36 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Tue, 1 Mar 2022 12:38:36 +0800 Subject: [PATCH v2 2/2] Add SM4 ARMv8/AArch64/CE assembly implementation In-Reply-To: <20220301043836.28709-1-tianjia.zhang@linux.alibaba.com> References: <20220301043836.28709-1-tianjia.zhang@linux.alibaba.com> Message-ID: <20220301043836.28709-2-tianjia.zhang@linux.alibaba.com> * cipher/Makefile.am: Add 'sm4-armv8-aarch64-ce.S'. * cipher/sm4-armv8-aarch64-ce.S: New. * cipher/sm4.c (USE_ARM_CE): New. (SM4_context) [USE_ARM_CE]: Add 'use_arm_ce'. [USE_ARM_CE] (_gcry_sm4_armv8_ce_expand_key) (_gcry_sm4_armv8_ce_crypt, _gcry_sm4_armv8_ce_ctr_enc) (_gcry_sm4_armv8_ce_cbc_dec, _gcry_sm4_armv8_ce_cfb_dec) (_gcry_sm4_armv8_ce_crypt_blk1_8, sm4_armv8_ce_crypt_blk1_8): New. (sm4_expand_key) [USE_ARM_CE]: Use ARMv8/AArch64/CE key setup. (sm4_setkey): Enable ARMv8/AArch64/CE if supported by HW. (sm4_encrypt) [USE_ARM_CE]: Use SM4 CE encryption. (sm4_decrypt) [USE_ARM_CE]: Use SM4 CE decryption. (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_ARM_CE]: Add ARMv8/AArch64/CE bulk functions. * configure.ac (gcry_cv_gcc_inline_asm_aarch64_crypto_sm): Check for GCC inline assembler supports AArch64 SM Crypto Extension instructions; Add 'sm4-armv8-aarch64-ce.lo'. -- This patch adds ARMv8/AArch64/CE bulk encryption/decryption. Bulk functions process eight blocks in parallel. Benchmark on T-Head Yitian-710 2.75 GHz: Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 12.10 ns/B 78.79 MiB/s 33.28 c/B 2750 CBC dec | 4.63 ns/B 205.9 MiB/s 12.74 c/B 2749 CFB enc | 12.14 ns/B 78.58 MiB/s 33.37 c/B 2750 CFB dec | 4.64 ns/B 205.5 MiB/s 12.76 c/B 2750 CTR enc | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 CTR dec | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 GCM enc | 4.88 ns/B 195.4 MiB/s 13.42 c/B 2750 GCM dec | 4.88 ns/B 195.5 MiB/s 13.42 c/B 2750 GCM auth | 0.189 ns/B 5048 MiB/s 0.520 c/B 2750 OCB enc | 4.86 ns/B 196.0 MiB/s 13.38 c/B 2750 OCB dec | 4.90 ns/B 194.7 MiB/s 13.47 c/B 2750 OCB auth | 4.79 ns/B 199.0 MiB/s 13.18 c/B 2750 After (10x - 19x faster than ARMv8/AArch64 impl): SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 1.25 ns/B 762.7 MiB/s 3.44 c/B 2749 CBC dec | 0.243 ns/B 3927 MiB/s 0.668 c/B 2750 CFB enc | 1.25 ns/B 763.1 MiB/s 3.44 c/B 2750 CFB dec | 0.245 ns/B 3899 MiB/s 0.673 c/B 2750 CTR enc | 0.298 ns/B 3199 MiB/s 0.820 c/B 2750 CTR dec | 0.298 ns/B 3198 MiB/s 0.820 c/B 2750 GCM enc | 0.487 ns/B 1957 MiB/s 1.34 c/B 2749 GCM dec | 0.487 ns/B 1959 MiB/s 1.34 c/B 2750 GCM auth | 0.189 ns/B 5048 MiB/s 0.519 c/B 2750 OCB enc | 0.443 ns/B 2150 MiB/s 1.22 c/B 2749 OCB dec | 0.486 ns/B 1964 MiB/s 1.34 c/B 2750 OCB auth | 0.369 ns/B 2585 MiB/s 1.01 c/B 2749 Signed-off-by: Tianjia Zhang --- cipher/Makefile.am | 1 + cipher/sm4-armv8-aarch64-ce.S | 568 ++++++++++++++++++++++++++++++++++ cipher/sm4.c | 152 +++++++++ configure.ac | 35 +++ 4 files changed, 756 insertions(+) create mode 100644 cipher/sm4-armv8-aarch64-ce.S diff --git a/cipher/Makefile.am b/cipher/Makefile.am index a7cbf3fc..3339c463 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -117,6 +117,7 @@ EXTRA_libcipher_la_SOURCES = \ seed.c \ serpent.c serpent-sse2-amd64.S \ sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S sm4-aarch64.S \ + sm4-armv8-aarch64-ce.S \ serpent-avx2-amd64.S serpent-armv7-neon.S \ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ diff --git a/cipher/sm4-armv8-aarch64-ce.S b/cipher/sm4-armv8-aarch64-ce.S new file mode 100644 index 00000000..57e84683 --- /dev/null +++ b/cipher/sm4-armv8-aarch64-ce.S @@ -0,0 +1,568 @@ +/* sm4-armv8-aarch64-ce.S - ARMv8/AArch64/CE accelerated SM4 cipher + * + * Copyright (C) 2022 Alibaba Group. + * Copyright (C) 2022 Tianjia Zhang + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include "asm-common-aarch64.h" + +#if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO_SM) && \ + defined(USE_SM4) + +.cpu generic+simd+crypto + +.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 16, 24, 25, 26, 27, 28, 29, 30, 31 + .set .Lv\b\().4s, \b +.endr + +.macro sm4e, vd, vn + .inst 0xcec08400 | (.L\vn << 5) | .L\vd +.endm + +.macro sm4ekey, vd, vn, vm + .inst 0xce60c800 | (.L\vm << 16) | (.L\vn << 5) | .L\vd +.endm + +.text + +/* Register macros */ + +#define RTMP0 v16 +#define RTMP1 v17 +#define RTMP2 v18 +#define RTMP3 v19 + +#define RIV v20 + +/* Helper macros. */ + +#define load_rkey(ptr) \ + ld1 {v24.16b-v27.16b}, [ptr], #64; \ + ld1 {v28.16b-v31.16b}, [ptr]; + +#define crypt_blk4(b0, b1, b2, b3) \ + rev32 b0.16b, b0.16b; \ + rev32 b1.16b, b1.16b; \ + rev32 b2.16b, b2.16b; \ + rev32 b3.16b, b3.16b; \ + sm4e b0.4s, v24.4s; \ + sm4e b1.4s, v24.4s; \ + sm4e b2.4s, v24.4s; \ + sm4e b3.4s, v24.4s; \ + sm4e b0.4s, v25.4s; \ + sm4e b1.4s, v25.4s; \ + sm4e b2.4s, v25.4s; \ + sm4e b3.4s, v25.4s; \ + sm4e b0.4s, v26.4s; \ + sm4e b1.4s, v26.4s; \ + sm4e b2.4s, v26.4s; \ + sm4e b3.4s, v26.4s; \ + sm4e b0.4s, v27.4s; \ + sm4e b1.4s, v27.4s; \ + sm4e b2.4s, v27.4s; \ + sm4e b3.4s, v27.4s; \ + sm4e b0.4s, v28.4s; \ + sm4e b1.4s, v28.4s; \ + sm4e b2.4s, v28.4s; \ + sm4e b3.4s, v28.4s; \ + sm4e b0.4s, v29.4s; \ + sm4e b1.4s, v29.4s; \ + sm4e b2.4s, v29.4s; \ + sm4e b3.4s, v29.4s; \ + sm4e b0.4s, v30.4s; \ + sm4e b1.4s, v30.4s; \ + sm4e b2.4s, v30.4s; \ + sm4e b3.4s, v30.4s; \ + sm4e b0.4s, v31.4s; \ + sm4e b1.4s, v31.4s; \ + sm4e b2.4s, v31.4s; \ + sm4e b3.4s, v31.4s; \ + rev64 b0.4s, b0.4s; \ + rev64 b1.4s, b1.4s; \ + rev64 b2.4s, b2.4s; \ + rev64 b3.4s, b3.4s; \ + ext b0.16b, b0.16b, b0.16b, #8; \ + ext b1.16b, b1.16b, b1.16b, #8; \ + ext b2.16b, b2.16b, b2.16b, #8; \ + ext b3.16b, b3.16b, b3.16b, #8; \ + rev32 b0.16b, b0.16b; \ + rev32 b1.16b, b1.16b; \ + rev32 b2.16b, b2.16b; \ + rev32 b3.16b, b3.16b; + +#define crypt_blk8(b0, b1, b2, b3, b4, b5, b6, b7) \ + rev32 b0.16b, b0.16b; \ + rev32 b1.16b, b1.16b; \ + rev32 b2.16b, b2.16b; \ + rev32 b3.16b, b3.16b; \ + rev32 b4.16b, b4.16b; \ + rev32 b5.16b, b5.16b; \ + rev32 b6.16b, b6.16b; \ + rev32 b7.16b, b7.16b; \ + sm4e b0.4s, v24.4s; \ + sm4e b1.4s, v24.4s; \ + sm4e b2.4s, v24.4s; \ + sm4e b3.4s, v24.4s; \ + sm4e b4.4s, v24.4s; \ + sm4e b5.4s, v24.4s; \ + sm4e b6.4s, v24.4s; \ + sm4e b7.4s, v24.4s; \ + sm4e b0.4s, v25.4s; \ + sm4e b1.4s, v25.4s; \ + sm4e b2.4s, v25.4s; \ + sm4e b3.4s, v25.4s; \ + sm4e b4.4s, v25.4s; \ + sm4e b5.4s, v25.4s; \ + sm4e b6.4s, v25.4s; \ + sm4e b7.4s, v25.4s; \ + sm4e b0.4s, v26.4s; \ + sm4e b1.4s, v26.4s; \ + sm4e b2.4s, v26.4s; \ + sm4e b3.4s, v26.4s; \ + sm4e b4.4s, v26.4s; \ + sm4e b5.4s, v26.4s; \ + sm4e b6.4s, v26.4s; \ + sm4e b7.4s, v26.4s; \ + sm4e b0.4s, v27.4s; \ + sm4e b1.4s, v27.4s; \ + sm4e b2.4s, v27.4s; \ + sm4e b3.4s, v27.4s; \ + sm4e b4.4s, v27.4s; \ + sm4e b5.4s, v27.4s; \ + sm4e b6.4s, v27.4s; \ + sm4e b7.4s, v27.4s; \ + sm4e b0.4s, v28.4s; \ + sm4e b1.4s, v28.4s; \ + sm4e b2.4s, v28.4s; \ + sm4e b3.4s, v28.4s; \ + sm4e b4.4s, v28.4s; \ + sm4e b5.4s, v28.4s; \ + sm4e b6.4s, v28.4s; \ + sm4e b7.4s, v28.4s; \ + sm4e b0.4s, v29.4s; \ + sm4e b1.4s, v29.4s; \ + sm4e b2.4s, v29.4s; \ + sm4e b3.4s, v29.4s; \ + sm4e b4.4s, v29.4s; \ + sm4e b5.4s, v29.4s; \ + sm4e b6.4s, v29.4s; \ + sm4e b7.4s, v29.4s; \ + sm4e b0.4s, v30.4s; \ + sm4e b1.4s, v30.4s; \ + sm4e b2.4s, v30.4s; \ + sm4e b3.4s, v30.4s; \ + sm4e b4.4s, v30.4s; \ + sm4e b5.4s, v30.4s; \ + sm4e b6.4s, v30.4s; \ + sm4e b7.4s, v30.4s; \ + sm4e b0.4s, v31.4s; \ + sm4e b1.4s, v31.4s; \ + sm4e b2.4s, v31.4s; \ + sm4e b3.4s, v31.4s; \ + sm4e b4.4s, v31.4s; \ + sm4e b5.4s, v31.4s; \ + sm4e b6.4s, v31.4s; \ + sm4e b7.4s, v31.4s; \ + rev64 b0.4s, b0.4s; \ + rev64 b1.4s, b1.4s; \ + rev64 b2.4s, b2.4s; \ + rev64 b3.4s, b3.4s; \ + rev64 b4.4s, b4.4s; \ + rev64 b5.4s, b5.4s; \ + rev64 b6.4s, b6.4s; \ + rev64 b7.4s, b7.4s; \ + ext b0.16b, b0.16b, b0.16b, #8; \ + ext b1.16b, b1.16b, b1.16b, #8; \ + ext b2.16b, b2.16b, b2.16b, #8; \ + ext b3.16b, b3.16b, b3.16b, #8; \ + ext b4.16b, b4.16b, b4.16b, #8; \ + ext b5.16b, b5.16b, b5.16b, #8; \ + ext b6.16b, b6.16b, b6.16b, #8; \ + ext b7.16b, b7.16b, b7.16b, #8; \ + rev32 b0.16b, b0.16b; \ + rev32 b1.16b, b1.16b; \ + rev32 b2.16b, b2.16b; \ + rev32 b3.16b, b3.16b; \ + rev32 b4.16b, b4.16b; \ + rev32 b5.16b, b5.16b; \ + rev32 b6.16b, b6.16b; \ + rev32 b7.16b, b7.16b; + + +.align 3 +.global _gcry_sm4_armv8_ce_expand_key +ELF(.type _gcry_sm4_armv8_ce_expand_key,%function;) +_gcry_sm4_armv8_ce_expand_key: + /* input: + * x0: 128-bit key + * x1: rkey_enc + * x2: rkey_dec + * x3: fk array + * x4: ck array + */ + CFI_STARTPROC(); + + ld1 {v0.16b}, [x0]; + rev32 v0.16b, v0.16b; + ld1 {v1.16b}, [x3]; + load_rkey(x4); + + /* input ^ fk */ + eor v0.16b, v0.16b, v1.16b; + + sm4ekey v0.4s, v0.4s, v24.4s; + sm4ekey v1.4s, v0.4s, v25.4s; + sm4ekey v2.4s, v1.4s, v26.4s; + sm4ekey v3.4s, v2.4s, v27.4s; + sm4ekey v4.4s, v3.4s, v28.4s; + sm4ekey v5.4s, v4.4s, v29.4s; + sm4ekey v6.4s, v5.4s, v30.4s; + sm4ekey v7.4s, v6.4s, v31.4s; + + st1 {v0.16b-v3.16b}, [x1], #64; + st1 {v4.16b-v7.16b}, [x1]; + rev64 v7.4s, v7.4s; + rev64 v6.4s, v6.4s; + rev64 v5.4s, v5.4s; + rev64 v4.4s, v4.4s; + rev64 v3.4s, v3.4s; + rev64 v2.4s, v2.4s; + rev64 v1.4s, v1.4s; + rev64 v0.4s, v0.4s; + ext v7.16b, v7.16b, v7.16b, #8; + ext v6.16b, v6.16b, v6.16b, #8; + ext v5.16b, v5.16b, v5.16b, #8; + ext v4.16b, v4.16b, v4.16b, #8; + ext v3.16b, v3.16b, v3.16b, #8; + ext v2.16b, v2.16b, v2.16b, #8; + ext v1.16b, v1.16b, v1.16b, #8; + ext v0.16b, v0.16b, v0.16b, #8; + st1 {v7.16b}, [x2], #16; + st1 {v6.16b}, [x2], #16; + st1 {v5.16b}, [x2], #16; + st1 {v4.16b}, [x2], #16; + st1 {v3.16b}, [x2], #16; + st1 {v2.16b}, [x2], #16; + st1 {v1.16b}, [x2], #16; + st1 {v0.16b}, [x2]; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_expand_key,.-_gcry_sm4_armv8_ce_expand_key;) + +.align 3 +ELF(.type sm4_armv8_ce_crypt_blk1_4,%function;) +sm4_armv8_ce_crypt_blk1_4: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: num blocks (1..4) + */ + CFI_STARTPROC(); + + load_rkey(x0); + + ld1 {v0.16b}, [x2], #16; + mov v1.16b, v0.16b; + mov v2.16b, v0.16b; + mov v3.16b, v0.16b; + cmp x3, #2; + blt .Lblk4_load_input_done; + ld1 {v1.16b}, [x2], #16; + beq .Lblk4_load_input_done; + ld1 {v2.16b}, [x2], #16; + cmp x3, #3; + beq .Lblk4_load_input_done; + ld1 {v3.16b}, [x2]; + +.Lblk4_load_input_done: + crypt_blk4(v0, v1, v2, v3); + + st1 {v0.16b}, [x1], #16; + cmp x3, #2; + blt .Lblk4_store_output_done; + st1 {v1.16b}, [x1], #16; + beq .Lblk4_store_output_done; + st1 {v2.16b}, [x1], #16; + cmp x3, #3; + beq .Lblk4_store_output_done; + st1 {v3.16b}, [x1]; + +.Lblk4_store_output_done: + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size sm4_armv8_ce_crypt_blk1_4,.-sm4_armv8_ce_crypt_blk1_4;) + +.align 3 +.global _gcry_sm4_armv8_ce_crypt_blk1_8 +ELF(.type _gcry_sm4_armv8_ce_crypt_blk1_8,%function;) +_gcry_sm4_armv8_ce_crypt_blk1_8: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: num blocks (1..8) + */ + CFI_STARTPROC(); + + cmp x3, #5; + blt sm4_armv8_ce_crypt_blk1_4; + + load_rkey(x0); + + ld1 {v0.16b-v3.16b}, [x2], #64; + ld1 {v4.16b}, [x2], #16; + mov v5.16b, v4.16b; + mov v6.16b, v4.16b; + mov v7.16b, v4.16b; + beq .Lblk8_load_input_done; + ld1 {v5.16b}, [x2], #16; + cmp x3, #7; + blt .Lblk8_load_input_done; + ld1 {v6.16b}, [x2], #16; + beq .Lblk8_load_input_done; + ld1 {v7.16b}, [x2]; + +.Lblk8_load_input_done: + crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7); + + cmp x3, #6; + st1 {v0.16b-v3.16b}, [x1], #64; + st1 {v4.16b}, [x1], #16; + blt .Lblk8_store_output_done; + st1 {v5.16b}, [x1], #16; + beq .Lblk8_store_output_done; + st1 {v6.16b}, [x1], #16; + cmp x3, #7; + beq .Lblk8_store_output_done; + st1 {v7.16b}, [x1]; + +.Lblk8_store_output_done: + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_crypt_blk1_8,.-_gcry_sm4_armv8_ce_crypt_blk1_8;) + +.align 3 +.global _gcry_sm4_armv8_ce_crypt +ELF(.type _gcry_sm4_armv8_ce_crypt,%function;) +_gcry_sm4_armv8_ce_crypt: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + load_rkey(x0); + +.Lcrypt_loop_blk: + subs x3, x3, #8; + bmi .Lcrypt_end; + + ld1 {v0.16b-v3.16b}, [x2], #64; + ld1 {v4.16b-v7.16b}, [x2], #64; + + crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7); + + st1 {v0.16b-v3.16b}, [x1], #64; + st1 {v4.16b-v7.16b}, [x1], #64; + + b .Lcrypt_loop_blk; + +.Lcrypt_end: + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_crypt,.-_gcry_sm4_armv8_ce_crypt;) + +.align 3 +.global _gcry_sm4_armv8_ce_cbc_dec +ELF(.type _gcry_sm4_armv8_ce_cbc_dec,%function;) +_gcry_sm4_armv8_ce_cbc_dec: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: iv (big endian, 128 bit) + * x4: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + load_rkey(x0); + ld1 {RIV.16b}, [x3]; + +.Lcbc_loop_blk: + subs x4, x4, #8; + bmi .Lcbc_end; + + ld1 {v0.16b-v3.16b}, [x2], #64; + ld1 {v4.16b-v7.16b}, [x2]; + + crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7); + + sub x2, x2, #64; + eor v0.16b, v0.16b, RIV.16b; + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v1.16b, v1.16b, RTMP0.16b; + eor v2.16b, v2.16b, RTMP1.16b; + eor v3.16b, v3.16b, RTMP2.16b; + st1 {v0.16b-v3.16b}, [x1], #64; + + eor v4.16b, v4.16b, RTMP3.16b; + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v5.16b, v5.16b, RTMP0.16b; + eor v6.16b, v6.16b, RTMP1.16b; + eor v7.16b, v7.16b, RTMP2.16b; + + mov RIV.16b, RTMP3.16b; + st1 {v4.16b-v7.16b}, [x1], #64; + + b .Lcbc_loop_blk; + +.Lcbc_end: + /* store new IV */ + st1 {RIV.16b}, [x3]; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_cbc_dec,.-_gcry_sm4_armv8_ce_cbc_dec;) + +.align 3 +.global _gcry_sm4_armv8_ce_cfb_dec +ELF(.type _gcry_sm4_armv8_ce_cfb_dec,%function;) +_gcry_sm4_armv8_ce_cfb_dec: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: iv (big endian, 128 bit) + * x4: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + load_rkey(x0); + ld1 {v0.16b}, [x3]; + +.Lcfb_loop_blk: + subs x4, x4, #8; + bmi .Lcfb_end; + + ld1 {v1.16b, v2.16b, v3.16b}, [x2], #48; + ld1 {v4.16b-v7.16b}, [x2]; + + crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7); + + sub x2, x2, #48; + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v0.16b, v0.16b, RTMP0.16b; + eor v1.16b, v1.16b, RTMP1.16b; + eor v2.16b, v2.16b, RTMP2.16b; + eor v3.16b, v3.16b, RTMP3.16b; + st1 {v0.16b-v3.16b}, [x1], #64; + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v4.16b, v4.16b, RTMP0.16b; + eor v5.16b, v5.16b, RTMP1.16b; + eor v6.16b, v6.16b, RTMP2.16b; + eor v7.16b, v7.16b, RTMP3.16b; + st1 {v4.16b-v7.16b}, [x1], #64; + + mov v0.16b, RTMP3.16b; + + b .Lcfb_loop_blk; + +.Lcfb_end: + /* store new IV */ + st1 {v0.16b}, [x3]; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_cfb_dec,.-_gcry_sm4_armv8_ce_cfb_dec;) + +.align 3 +.global _gcry_sm4_armv8_ce_ctr_enc +ELF(.type _gcry_sm4_armv8_ce_ctr_enc,%function;) +_gcry_sm4_armv8_ce_ctr_enc: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: ctr (big endian, 128 bit) + * x4: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + load_rkey(x0); + + ldp x7, x8, [x3]; + rev x7, x7; + rev x8, x8; + +.Lctr_loop_blk: + subs x4, x4, #8; + bmi .Lctr_end; + +#define inc_le128(vctr) \ + mov vctr.d[1], x8; \ + mov vctr.d[0], x7; \ + adds x8, x8, #1; \ + adc x7, x7, xzr; \ + rev64 vctr.16b, vctr.16b; + + /* construct CTRs */ + inc_le128(v0); /* +0 */ + inc_le128(v1); /* +1 */ + inc_le128(v2); /* +2 */ + inc_le128(v3); /* +3 */ + inc_le128(v4); /* +4 */ + inc_le128(v5); /* +5 */ + inc_le128(v6); /* +6 */ + inc_le128(v7); /* +7 */ + + crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7); + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v0.16b, v0.16b, RTMP0.16b; + eor v1.16b, v1.16b, RTMP1.16b; + eor v2.16b, v2.16b, RTMP2.16b; + eor v3.16b, v3.16b, RTMP3.16b; + st1 {v0.16b-v3.16b}, [x1], #64; + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v4.16b, v4.16b, RTMP0.16b; + eor v5.16b, v5.16b, RTMP1.16b; + eor v6.16b, v6.16b, RTMP2.16b; + eor v7.16b, v7.16b, RTMP3.16b; + st1 {v4.16b-v7.16b}, [x1], #64; + + b .Lctr_loop_blk; + +.Lctr_end: + /* store new CTR */ + rev x7, x7; + rev x8, x8; + stp x7, x8, [x3]; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_ctr_enc,.-_gcry_sm4_armv8_ce_ctr_enc;) + +#endif diff --git a/cipher/sm4.c b/cipher/sm4.c index ec2281b6..1fef664b 100644 --- a/cipher/sm4.c +++ b/cipher/sm4.c @@ -76,6 +76,15 @@ # endif #endif +#undef USE_ARM_CE +#ifdef ENABLE_ARM_CRYPTO_SUPPORT +# if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO_SM) +# define USE_ARM_CE 1 +# endif +#endif + static const char *sm4_selftest (void); static void _gcry_sm4_ctr_enc (void *context, unsigned char *ctr, @@ -106,6 +115,9 @@ typedef struct #ifdef USE_AARCH64_SIMD unsigned int use_aarch64_simd:1; #endif +#ifdef USE_ARM_CE + unsigned int use_arm_ce:1; +#endif } SM4_context; static const u32 fk[4] = @@ -286,6 +298,43 @@ sm4_aarch64_crypt_blk1_8(const u32 *rk, byte *out, const byte *in, } #endif /* USE_AARCH64_SIMD */ +#ifdef USE_ARM_CE +extern void _gcry_sm4_armv8_ce_expand_key(const byte *key, + u32 *rkey_enc, u32 *rkey_dec, + const u32 *fk, const u32 *ck); + +extern void _gcry_sm4_armv8_ce_crypt(const u32 *rk, byte *out, + const byte *in, + size_t num_blocks); + +extern void _gcry_sm4_armv8_ce_ctr_enc(const u32 *rk_enc, byte *out, + const byte *in, + byte *ctr, + size_t nblocks); + +extern void _gcry_sm4_armv8_ce_cbc_dec(const u32 *rk_dec, byte *out, + const byte *in, + byte *iv, + size_t nblocks); + +extern void _gcry_sm4_armv8_ce_cfb_dec(const u32 *rk_enc, byte *out, + const byte *in, + byte *iv, + size_t nblocks); + +extern void _gcry_sm4_armv8_ce_crypt_blk1_8(const u32 *rk, byte *out, + const byte *in, + size_t num_blocks); + +static inline unsigned int +sm4_armv8_ce_crypt_blk1_8(const u32 *rk, byte *out, const byte *in, + unsigned int num_blks) +{ + _gcry_sm4_armv8_ce_crypt_blk1_8(rk, out, in, (size_t)num_blks); + return 0; +} +#endif /* USE_ARM_CE */ + static inline void prefetch_sbox_table(void) { const volatile byte *vtab = (void *)&sbox_table; @@ -363,6 +412,15 @@ sm4_expand_key (SM4_context *ctx, const byte *key) } #endif +#ifdef USE_ARM_CE + if (ctx->use_arm_ce) + { + _gcry_sm4_armv8_ce_expand_key (key, ctx->rkey_enc, ctx->rkey_dec, + fk, ck); + return; + } +#endif + rk[0] = buf_get_be32(key + 4 * 0) ^ fk[0]; rk[1] = buf_get_be32(key + 4 * 1) ^ fk[1]; rk[2] = buf_get_be32(key + 4 * 2) ^ fk[2]; @@ -420,6 +478,9 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen, #ifdef USE_AARCH64_SIMD ctx->use_aarch64_simd = !!(hwf & HWF_ARM_NEON); #endif +#ifdef USE_ARM_CE + ctx->use_arm_ce = !!(hwf & HWF_ARM_SM4); +#endif /* Setup bulk encryption routines. */ memset (bulk_ops, 0, sizeof(*bulk_ops)); @@ -465,6 +526,11 @@ sm4_encrypt (void *context, byte *outbuf, const byte *inbuf) { SM4_context *ctx = context; +#ifdef USE_ARM_CE + if (ctx->use_arm_ce) + return sm4_armv8_ce_crypt_blk1_8(ctx->rkey_enc, outbuf, inbuf, 1); +#endif + prefetch_sbox_table (); return sm4_do_crypt (ctx->rkey_enc, outbuf, inbuf); @@ -475,6 +541,11 @@ sm4_decrypt (void *context, byte *outbuf, const byte *inbuf) { SM4_context *ctx = context; +#ifdef USE_ARM_CE + if (ctx->use_arm_ce) + return sm4_armv8_ce_crypt_blk1_8(ctx->rkey_dec, outbuf, inbuf, 1); +#endif + prefetch_sbox_table (); return sm4_do_crypt (ctx->rkey_dec, outbuf, inbuf); @@ -601,6 +672,23 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, } #endif +#ifdef USE_ARM_CE + if (ctx->use_arm_ce) + { + /* Process multiples of 8 blocks at a time. */ + if (nblocks >= 8) + { + size_t nblks = nblocks & ~(8 - 1); + + _gcry_sm4_armv8_ce_ctr_enc(ctx->rkey_enc, outbuf, inbuf, ctr, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + #ifdef USE_AARCH64_SIMD if (ctx->use_aarch64_simd) { @@ -634,6 +722,12 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } #endif +#ifdef USE_ARM_CE + else if (ctx->use_arm_ce) + { + crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; + } +#endif #ifdef USE_AARCH64_SIMD else if (ctx->use_aarch64_simd) { @@ -725,6 +819,23 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, } #endif +#ifdef USE_ARM_CE + if (ctx->use_arm_ce) + { + /* Process multiples of 8 blocks at a time. */ + if (nblocks >= 8) + { + size_t nblks = nblocks & ~(8 - 1); + + _gcry_sm4_armv8_ce_cbc_dec(ctx->rkey_dec, outbuf, inbuf, iv, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + #ifdef USE_AARCH64_SIMD if (ctx->use_aarch64_simd) { @@ -758,6 +869,12 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } #endif +#ifdef USE_ARM_CE + else if (ctx->use_arm_ce) + { + crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; + } +#endif #ifdef USE_AARCH64_SIMD else if (ctx->use_aarch64_simd) { @@ -842,6 +959,23 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, } #endif +#ifdef USE_ARM_CE + if (ctx->use_arm_ce) + { + /* Process multiples of 8 blocks at a time. */ + if (nblocks >= 8) + { + size_t nblks = nblocks & ~(8 - 1); + + _gcry_sm4_armv8_ce_cfb_dec(ctx->rkey_enc, outbuf, inbuf, iv, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + #ifdef USE_AARCH64_SIMD if (ctx->use_aarch64_simd) { @@ -875,6 +1009,12 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } #endif +#ifdef USE_ARM_CE + else if (ctx->use_arm_ce) + { + crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; + } +#endif #ifdef USE_AARCH64_SIMD else if (ctx->use_aarch64_simd) { @@ -1037,6 +1177,12 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } #endif +#ifdef USE_ARM_CE + else if (ctx->use_arm_ce) + { + crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; + } +#endif #ifdef USE_AARCH64_SIMD else if (ctx->use_aarch64_simd) { @@ -1203,6 +1349,12 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } #endif +#ifdef USE_ARM_CE + else if (ctx->use_arm_ce) + { + crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; + } +#endif #ifdef USE_AARCH64_SIMD else if (ctx->use_aarch64_simd) { diff --git a/configure.ac b/configure.ac index f5363f22..2ff053f7 100644 --- a/configure.ac +++ b/configure.ac @@ -1906,6 +1906,40 @@ if test "$gcry_cv_gcc_inline_asm_aarch64_crypto" = "yes" ; then fi +# +# Check whether GCC inline assembler supports AArch64 SM Crypto Extension instructions +# +AC_CACHE_CHECK([whether GCC inline assembler supports AArch64 SM Crypto Extension instructions], + [gcry_cv_gcc_inline_asm_aarch64_crypto_sm], + [if test "$mpi_cpu_arch" != "aarch64" || + test "$try_asm_modules" != "yes" ; then + gcry_cv_gcc_inline_asm_aarch64_crypto_sm="n/a" + else + gcry_cv_gcc_inline_asm_aarch64_crypto_sm=no + AC_LINK_IFELSE([AC_LANG_PROGRAM( + [[__asm__( + ".cpu generic+simd+crypto\n\t" + ".text\n\t" + "testfn:\n\t" + ".inst 0xce63c004 /* sm3partw1 v4.4s, v0.4s, v3.4s */\n\t" + ".inst 0xce66c4e4 /* sm3partw2 v4.4s, v7.4s, v6.4s */\n\t" + ".inst 0xce4b2505 /* sm3ss1 v5.4s, v8.4s, v11.4s, v9.4s */\n\t" + ".inst 0xce4a80a8 /* sm3tt1a v8.4s, v5.4s, v10.s[0] */\n\t" + ".inst 0xce4a84a8 /* sm3tt1b v8.4s, v5.4s, v10.s[0] */\n\t" + ".inst 0xce4088a9 /* sm3tt2a v9.4s, v5.4s, v0.s[0] */\n\t" + ".inst 0xce448ca9 /* sm3tt2b v9.4s, v5.4s, v4.s[0] */\n\t" + ".inst 0xcec08408 /* sm4e v8.4s, v0.4s */\n\t" + ".inst 0xce70c800 /* sm4ekey v0.4s, v0.4s, v16.4s */\n\t" + ); + ]], [ testfn(); ])], + [gcry_cv_gcc_inline_asm_aarch64_crypto_sm=yes]) + fi]) +if test "$gcry_cv_gcc_inline_asm_aarch64_crypto_sm" = "yes" ; then + AC_DEFINE(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO_SM,1, + [Defined if inline assembler supports AArch64 SM Crypto Extension instructions]) +fi + + # # Check whether PowerPC AltiVec/VSX intrinsics # @@ -2755,6 +2789,7 @@ if test "$found" = "1" ; then aarch64-*-*) # Build with the assembly implementation GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-aarch64.lo" + GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-armv8-aarch64-ce.lo" esac fi -- 2.34.1 From tianjia.zhang at linux.alibaba.com Tue Mar 1 08:04:06 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Tue, 1 Mar 2022 15:04:06 +0800 Subject: [PATCH] Ignore tests binary file in git repo Message-ID: <20220301070406.12591-1-tianjia.zhang@linux.alibaba.com> Signed-off-by: Tianjia Zhang --- .gitignore | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/.gitignore b/.gitignore index 99741c18..ec456c71 100644 --- a/.gitignore +++ b/.gitignore @@ -67,6 +67,7 @@ src/gcrypt.h src/hmac256 src/libgcrypt-config src/libgcrypt.la +src/libgcrypt.la.done src/libgcrypt.pc src/mpicalc src/versioninfo.rc @@ -100,14 +101,20 @@ tests/register tests/rsacvt tests/t-convert tests/t-cv25519 +tests/t-dsa +tests/t-ecdsa tests/t-ed25519 +tests/t-ed448 tests/t-kdf tests/t-lock tests/t-mpi-bit tests/t-mpi-point +tests/t-rsa-15 +tests/t-rsa-pss tests/t-sexp tests/t-secmem tests/t-x448 +tests/testdrv tests/tsexp tests/version tests/*.exe -- 2.34.1 From jussi.kivilinna at iki.fi Tue Mar 1 08:26:28 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Tue, 1 Mar 2022 09:26:28 +0200 Subject: [PATCH v2 2/2] Add SM4 ARMv8/AArch64/CE assembly implementation In-Reply-To: <20220301043836.28709-2-tianjia.zhang@linux.alibaba.com> References: <20220301043836.28709-1-tianjia.zhang@linux.alibaba.com> <20220301043836.28709-2-tianjia.zhang@linux.alibaba.com> Message-ID: <30158843-b4ec-6864-5d85-1a7031a18c1e@iki.fi> Hello, On 1.3.2022 6.38, Tianjia Zhang wrote: > new file mode 100644 > index 00000000..57e84683 > --- /dev/null > +++ b/cipher/sm4-armv8-aarch64-ce.S > @@ -0,0 +1,568 @@ > +/* sm4-armv8-aarch64-ce.S - ARMv8/AArch64/CE accelerated SM4 cipher > + * > + * Copyright (C) 2022 Alibaba Group. > + * Copyright (C) 2022 Tianjia Zhang > + * > + * This file is part of Libgcrypt. > + * > + * Libgcrypt is free software; you can redistribute it and/or modify > + * it under the terms of the GNU Lesser General Public License as > + * published by the Free Software Foundation; either version 2.1 of > + * the License, or (at your option) any later version. > + * > + * Libgcrypt is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with this program; if not, see . > + */ > + > +#include "asm-common-aarch64.h" > + > +#if defined(__AARCH64EL__) && \ > + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ > + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO_SM) && \ > + defined(USE_SM4) > + > +.cpu generic+simd+crypto > + > +.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 16, 24, 25, 26, 27, 28, 29, 30, 31 > + .set .Lv\b\().4s, \b > +.endr > + > +.macro sm4e, vd, vn > + .inst 0xcec08400 | (.L\vn << 5) | .L\vd > +.endm > + > +.macro sm4ekey, vd, vn, vm > + .inst 0xce60c800 | (.L\vm << 16) | (.L\vn << 5) | .L\vd > +.endm I meant that the problem is that ".macro"/".endm"/".set"/".irp" may not be not supported by all compilers/assemblers. Implementation here could either: - Rely on assembler supporting these instructions and use "sm4e" and "sm4ekey" directly or - Use preprocessor #define macros instead of assembler .macros to provide these instructions. Something like this could work: #define vecnum_v0 0 #define vecnum_v1 1 #define vecnum_v2 2 #define vecnum_v3 3 #define vecnum_v4 4 #define vecnum_v5 5 #define vecnum_v6 6 #define vecnum_v7 7 #define vecnum_v16 16 #define vecnum_v24 24 #define vecnum_v25 25 #define vecnum_v26 26 #define vecnum_v27 27 #define vecnum_v28 28 #define vecnum_v29 29 #define vecnum_v30 30 #define vecnum_v31 31 #define sm4e(vd,vn) \ .inst (0xcec08400 | (vecnum_##vn << 5) | vecnum_##vd) #define sm4ekey(vd, vn, vm) \ .inst (0xce60c800 | (vecnum_##vm << 16) | (vecnum_##vn << 5) | vecnum_##vd) ... #define crypt_blk4(b0, b1, b2, b3) \ rev32 b0.16b, b0.16b; \ rev32 b1.16b, b1.16b; \ rev32 b2.16b, b2.16b; \ rev32 b3.16b, b3.16b; \ sm4e(b0, v24); \ sm4e(b1, v24); \ -Jussi From tianjia.zhang at linux.alibaba.com Tue Mar 1 09:58:34 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Tue, 1 Mar 2022 16:58:34 +0800 Subject: [PATCH v2 2/2] Add SM4 ARMv8/AArch64/CE assembly implementation In-Reply-To: <30158843-b4ec-6864-5d85-1a7031a18c1e@iki.fi> References: <20220301043836.28709-1-tianjia.zhang@linux.alibaba.com> <20220301043836.28709-2-tianjia.zhang@linux.alibaba.com> <30158843-b4ec-6864-5d85-1a7031a18c1e@iki.fi> Message-ID: <9b1407b0-2190-0ebf-5bbe-15265cccfc63@linux.alibaba.com> Hi Jussi, On 3/1/22 3:26 PM, Jussi Kivilinna wrote: > Hello, > > On 1.3.2022 6.38, Tianjia Zhang wrote: >> new file mode 100644 >> index 00000000..57e84683 >> --- /dev/null >> +++ b/cipher/sm4-armv8-aarch64-ce.S >> @@ -0,0 +1,568 @@ >> +/* sm4-armv8-aarch64-ce.S? -? ARMv8/AArch64/CE accelerated SM4 cipher >> + * >> + * Copyright (C) 2022 Alibaba Group. >> + * Copyright (C) 2022 Tianjia Zhang >> + * >> + * This file is part of Libgcrypt. >> + * >> + * Libgcrypt is free software; you can redistribute it and/or modify >> + * it under the terms of the GNU Lesser General Public License as >> + * published by the Free Software Foundation; either version 2.1 of >> + * the License, or (at your option) any later version. >> + * >> + * Libgcrypt is distributed in the hope that it will be useful, >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.? See the >> + * GNU Lesser General Public License for more details. >> + * >> + * You should have received a copy of the GNU Lesser General Public >> + * License along with this program; if not, see >> . >> + */ >> + >> +#include "asm-common-aarch64.h" >> + >> +#if defined(__AARCH64EL__) && \ >> +??? defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ >> +??? defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO_SM) && \ >> +??? defined(USE_SM4) >> + >> +.cpu generic+simd+crypto >> + >> +.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 16, 24, 25, 26, 27, 28, 29, 30, 31 >> +??? .set .Lv\b\().4s, \b >> +.endr >> + >> +.macro sm4e, vd, vn >> +??? .inst 0xcec08400 | (.L\vn << 5) | .L\vd >> +.endm >> + >> +.macro sm4ekey, vd, vn, vm >> +??? .inst 0xce60c800 | (.L\vm << 16) | (.L\vn << 5) | .L\vd >> +.endm > > I meant that the problem is that ".macro"/".endm"/".set"/".irp" may not > be not supported by all compilers/assemblers. Implementation here could > either: > - Rely on assembler supporting these instructions and use "sm4e" and > "sm4ekey" directly or > - Use preprocessor #define macros instead of assembler .macros to > provide these instructions. Something like this could work: > > #define vecnum_v0 0 > #define vecnum_v1 1 > #define vecnum_v2 2 > #define vecnum_v3 3 > #define vecnum_v4 4 > #define vecnum_v5 5 > #define vecnum_v6 6 > #define vecnum_v7 7 > #define vecnum_v16 16 > #define vecnum_v24 24 > #define vecnum_v25 25 > #define vecnum_v26 26 > #define vecnum_v27 27 > #define vecnum_v28 28 > #define vecnum_v29 29 > #define vecnum_v30 30 > #define vecnum_v31 31 > > #define sm4e(vd,vn) \ > ? .inst (0xcec08400 | (vecnum_##vn << 5) | vecnum_##vd) > > #define sm4ekey(vd, vn, vm) \ > ? .inst (0xce60c800 | (vecnum_##vm << 16) | (vecnum_##vn << 5) | > vecnum_##vd) > > ... > > #define crypt_blk4(b0, b1, b2, b3)???????? \ > ??????? rev32 b0.16b, b0.16b;????????????? \ > ??????? rev32 b1.16b, b1.16b;????????????? \ > ??????? rev32 b2.16b, b2.16b;????????????? \ > ??????? rev32 b3.16b, b3.16b;????????????? \ > ??????? sm4e(b0, v24);???????????????????? \ > ??????? sm4e(b1, v24);???????????????????? \ > > -Jussi Thanks for your suggestion, using #define can solve this problem. Cheers, Tianjia From tianjia.zhang at linux.alibaba.com Tue Mar 1 10:56:54 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Tue, 1 Mar 2022 17:56:54 +0800 Subject: [PATCH v3 1/2] hwf-arm: add ARMv8.2 optional crypto extension HW features Message-ID: <20220301095655.31234-1-tianjia.zhang@linux.alibaba.com> * src/g10lib.h (HWF_ARM_SHA3, HWF_ARM_SM3, HWF_ARM_SM4) (HWF_ARM_SHA512): New. * src/hwf-arm.c (arm_features): Add sha3, sm3, sm4, sha512 HW features. * src/hwfeatures.c (hwflist): Add sha3, sm3, sm4, sha512 HW features. -- Signed-off-by: Tianjia Zhang --- src/g10lib.h | 4 ++++ src/hwf-arm.c | 16 ++++++++++++++++ src/hwfeatures.c | 4 ++++ 3 files changed, 24 insertions(+) diff --git a/src/g10lib.h b/src/g10lib.h index 22c0f0c2..985e75c6 100644 --- a/src/g10lib.h +++ b/src/g10lib.h @@ -245,6 +245,10 @@ char **_gcry_strtokenize (const char *string, const char *delim); #define HWF_ARM_SHA1 (1 << 2) #define HWF_ARM_SHA2 (1 << 3) #define HWF_ARM_PMULL (1 << 4) +#define HWF_ARM_SHA3 (1 << 5) +#define HWF_ARM_SM3 (1 << 6) +#define HWF_ARM_SM4 (1 << 7) +#define HWF_ARM_SHA512 (1 << 8) #elif defined(HAVE_CPU_ARCH_PPC) diff --git a/src/hwf-arm.c b/src/hwf-arm.c index 60107f36..70d375b2 100644 --- a/src/hwf-arm.c +++ b/src/hwf-arm.c @@ -137,6 +137,18 @@ static const struct feature_map_s arm_features[] = #ifndef HWCAP_SHA2 # define HWCAP_SHA2 64 #endif +#ifndef HWCAP_SHA3 +# define HWCAP_SHA3 (1 << 17) +#endif +#ifndef HWCAP_SM3 +# define HWCAP_SM3 (1 << 18) +#endif +#ifndef HWCAP_SM4 +# define HWCAP_SM4 (1 << 19) +#endif +#ifndef HWCAP_SHA512 +# define HWCAP_SHA512 (1 << 21) +#endif static const struct feature_map_s arm_features[] = { @@ -148,6 +160,10 @@ static const struct feature_map_s arm_features[] = { HWCAP_SHA1, 0, " sha1", HWF_ARM_SHA1 }, { HWCAP_SHA2, 0, " sha2", HWF_ARM_SHA2 }, { HWCAP_PMULL, 0, " pmull", HWF_ARM_PMULL }, + { HWCAP_SHA3, 0, " sha3", HWF_ARM_SHA3 }, + { HWCAP_SM3, 0, " sm3", HWF_ARM_SM3 }, + { HWCAP_SM4, 0, " sm4", HWF_ARM_SM4 }, + { HWCAP_SHA512, 0, " sha512", HWF_ARM_SHA512 }, #endif }; diff --git a/src/hwfeatures.c b/src/hwfeatures.c index 97e67b3c..7060d995 100644 --- a/src/hwfeatures.c +++ b/src/hwfeatures.c @@ -68,6 +68,10 @@ static struct { HWF_ARM_SHA1, "arm-sha1" }, { HWF_ARM_SHA2, "arm-sha2" }, { HWF_ARM_PMULL, "arm-pmull" }, + { HWF_ARM_SHA3, "arm-sha3" }, + { HWF_ARM_SM3, "arm-sm3" }, + { HWF_ARM_SM4, "arm-sm4" }, + { HWF_ARM_SHA512, "arm-sha512" }, #elif defined(HAVE_CPU_ARCH_PPC) { HWF_PPC_VCRYPTO, "ppc-vcrypto" }, { HWF_PPC_ARCH_3_00, "ppc-arch_3_00" }, -- 2.34.1 From tianjia.zhang at linux.alibaba.com Tue Mar 1 10:56:55 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Tue, 1 Mar 2022 17:56:55 +0800 Subject: [PATCH v3 2/2] Add SM4 ARMv8/AArch64/CE assembly implementation In-Reply-To: <20220301095655.31234-1-tianjia.zhang@linux.alibaba.com> References: <20220301095655.31234-1-tianjia.zhang@linux.alibaba.com> Message-ID: <20220301095655.31234-2-tianjia.zhang@linux.alibaba.com> * cipher/Makefile.am: Add 'sm4-armv8-aarch64-ce.S'. * cipher/sm4-armv8-aarch64-ce.S: New. * cipher/sm4.c (USE_ARM_CE): New. (SM4_context) [USE_ARM_CE]: Add 'use_arm_ce'. [USE_ARM_CE] (_gcry_sm4_armv8_ce_expand_key) (_gcry_sm4_armv8_ce_crypt, _gcry_sm4_armv8_ce_ctr_enc) (_gcry_sm4_armv8_ce_cbc_dec, _gcry_sm4_armv8_ce_cfb_dec) (_gcry_sm4_armv8_ce_crypt_blk1_8, sm4_armv8_ce_crypt_blk1_8): New. (sm4_expand_key) [USE_ARM_CE]: Use ARMv8/AArch64/CE key setup. (sm4_setkey): Enable ARMv8/AArch64/CE if supported by HW. (sm4_encrypt) [USE_ARM_CE]: Use SM4 CE encryption. (sm4_decrypt) [USE_ARM_CE]: Use SM4 CE decryption. (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_ARM_CE]: Add ARMv8/AArch64/CE bulk functions. * configure.ac: Add 'sm4-armv8-aarch64-ce.lo'. -- This patch adds ARMv8/AArch64/CE bulk encryption/decryption. Bulk functions process eight blocks in parallel. Benchmark on T-Head Yitian-710 2.75 GHz: Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 12.10 ns/B 78.79 MiB/s 33.28 c/B 2750 CBC dec | 4.63 ns/B 205.9 MiB/s 12.74 c/B 2749 CFB enc | 12.14 ns/B 78.58 MiB/s 33.37 c/B 2750 CFB dec | 4.64 ns/B 205.5 MiB/s 12.76 c/B 2750 CTR enc | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 CTR dec | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 GCM enc | 4.88 ns/B 195.4 MiB/s 13.42 c/B 2750 GCM dec | 4.88 ns/B 195.5 MiB/s 13.42 c/B 2750 GCM auth | 0.189 ns/B 5048 MiB/s 0.520 c/B 2750 OCB enc | 4.86 ns/B 196.0 MiB/s 13.38 c/B 2750 OCB dec | 4.90 ns/B 194.7 MiB/s 13.47 c/B 2750 OCB auth | 4.79 ns/B 199.0 MiB/s 13.18 c/B 2750 After (10x - 19x faster than ARMv8/AArch64 impl): SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 1.25 ns/B 762.7 MiB/s 3.44 c/B 2749 CBC dec | 0.243 ns/B 3927 MiB/s 0.668 c/B 2750 CFB enc | 1.25 ns/B 763.1 MiB/s 3.44 c/B 2750 CFB dec | 0.245 ns/B 3899 MiB/s 0.673 c/B 2750 CTR enc | 0.298 ns/B 3199 MiB/s 0.820 c/B 2750 CTR dec | 0.298 ns/B 3198 MiB/s 0.820 c/B 2750 GCM enc | 0.487 ns/B 1957 MiB/s 1.34 c/B 2749 GCM dec | 0.487 ns/B 1959 MiB/s 1.34 c/B 2750 GCM auth | 0.189 ns/B 5048 MiB/s 0.519 c/B 2750 OCB enc | 0.443 ns/B 2150 MiB/s 1.22 c/B 2749 OCB dec | 0.486 ns/B 1964 MiB/s 1.34 c/B 2750 OCB auth | 0.369 ns/B 2585 MiB/s 1.01 c/B 2749 Signed-off-by: Tianjia Zhang --- cipher/Makefile.am | 1 + cipher/sm4-armv8-aarch64-ce.S | 580 ++++++++++++++++++++++++++++++++++ cipher/sm4.c | 152 +++++++++ configure.ac | 1 + 4 files changed, 734 insertions(+) create mode 100644 cipher/sm4-armv8-aarch64-ce.S diff --git a/cipher/Makefile.am b/cipher/Makefile.am index a7cbf3fc..3339c463 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -117,6 +117,7 @@ EXTRA_libcipher_la_SOURCES = \ seed.c \ serpent.c serpent-sse2-amd64.S \ sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S sm4-aarch64.S \ + sm4-armv8-aarch64-ce.S \ serpent-avx2-amd64.S serpent-armv7-neon.S \ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ diff --git a/cipher/sm4-armv8-aarch64-ce.S b/cipher/sm4-armv8-aarch64-ce.S new file mode 100644 index 00000000..5fb55947 --- /dev/null +++ b/cipher/sm4-armv8-aarch64-ce.S @@ -0,0 +1,580 @@ +/* sm4-armv8-aarch64-ce.S - ARMv8/AArch64/CE accelerated SM4 cipher + * + * Copyright (C) 2022 Alibaba Group. + * Copyright (C) 2022 Tianjia Zhang + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include "asm-common-aarch64.h" + +#if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) && \ + defined(USE_SM4) + +.cpu generic+simd+crypto + +#define vecnum_v0 0 +#define vecnum_v1 1 +#define vecnum_v2 2 +#define vecnum_v3 3 +#define vecnum_v4 4 +#define vecnum_v5 5 +#define vecnum_v6 6 +#define vecnum_v7 7 +#define vecnum_v16 16 +#define vecnum_v24 24 +#define vecnum_v25 25 +#define vecnum_v26 26 +#define vecnum_v27 27 +#define vecnum_v28 28 +#define vecnum_v29 29 +#define vecnum_v30 30 +#define vecnum_v31 31 + +#define sm4e(vd, vn) \ + .inst (0xcec08400 | (vecnum_##vn << 5) | vecnum_##vd) + +#define sm4ekey(vd, vn, vm) \ + .inst (0xce60c800 | (vecnum_##vm << 16) | (vecnum_##vn << 5) | vecnum_##vd) + +.text + +/* Register macros */ + +#define RTMP0 v16 +#define RTMP1 v17 +#define RTMP2 v18 +#define RTMP3 v19 + +#define RIV v20 + +/* Helper macros. */ + +#define load_rkey(ptr) \ + ld1 {v24.16b-v27.16b}, [ptr], #64; \ + ld1 {v28.16b-v31.16b}, [ptr]; + +#define crypt_blk4(b0, b1, b2, b3) \ + rev32 b0.16b, b0.16b; \ + rev32 b1.16b, b1.16b; \ + rev32 b2.16b, b2.16b; \ + rev32 b3.16b, b3.16b; \ + sm4e(b0, v24); \ + sm4e(b1, v24); \ + sm4e(b2, v24); \ + sm4e(b3, v24); \ + sm4e(b0, v25); \ + sm4e(b1, v25); \ + sm4e(b2, v25); \ + sm4e(b3, v25); \ + sm4e(b0, v26); \ + sm4e(b1, v26); \ + sm4e(b2, v26); \ + sm4e(b3, v26); \ + sm4e(b0, v27); \ + sm4e(b1, v27); \ + sm4e(b2, v27); \ + sm4e(b3, v27); \ + sm4e(b0, v28); \ + sm4e(b1, v28); \ + sm4e(b2, v28); \ + sm4e(b3, v28); \ + sm4e(b0, v29); \ + sm4e(b1, v29); \ + sm4e(b2, v29); \ + sm4e(b3, v29); \ + sm4e(b0, v30); \ + sm4e(b1, v30); \ + sm4e(b2, v30); \ + sm4e(b3, v30); \ + sm4e(b0, v31); \ + sm4e(b1, v31); \ + sm4e(b2, v31); \ + sm4e(b3, v31); \ + rev64 b0.4s, b0.4s; \ + rev64 b1.4s, b1.4s; \ + rev64 b2.4s, b2.4s; \ + rev64 b3.4s, b3.4s; \ + ext b0.16b, b0.16b, b0.16b, #8; \ + ext b1.16b, b1.16b, b1.16b, #8; \ + ext b2.16b, b2.16b, b2.16b, #8; \ + ext b3.16b, b3.16b, b3.16b, #8; \ + rev32 b0.16b, b0.16b; \ + rev32 b1.16b, b1.16b; \ + rev32 b2.16b, b2.16b; \ + rev32 b3.16b, b3.16b; + +#define crypt_blk8(b0, b1, b2, b3, b4, b5, b6, b7) \ + rev32 b0.16b, b0.16b; \ + rev32 b1.16b, b1.16b; \ + rev32 b2.16b, b2.16b; \ + rev32 b3.16b, b3.16b; \ + rev32 b4.16b, b4.16b; \ + rev32 b5.16b, b5.16b; \ + rev32 b6.16b, b6.16b; \ + rev32 b7.16b, b7.16b; \ + sm4e(b0, v24); \ + sm4e(b1, v24); \ + sm4e(b2, v24); \ + sm4e(b3, v24); \ + sm4e(b4, v24); \ + sm4e(b5, v24); \ + sm4e(b6, v24); \ + sm4e(b7, v24); \ + sm4e(b0, v25); \ + sm4e(b1, v25); \ + sm4e(b2, v25); \ + sm4e(b3, v25); \ + sm4e(b4, v25); \ + sm4e(b5, v25); \ + sm4e(b6, v25); \ + sm4e(b7, v25); \ + sm4e(b0, v26); \ + sm4e(b1, v26); \ + sm4e(b2, v26); \ + sm4e(b3, v26); \ + sm4e(b4, v26); \ + sm4e(b5, v26); \ + sm4e(b6, v26); \ + sm4e(b7, v26); \ + sm4e(b0, v27); \ + sm4e(b1, v27); \ + sm4e(b2, v27); \ + sm4e(b3, v27); \ + sm4e(b4, v27); \ + sm4e(b5, v27); \ + sm4e(b6, v27); \ + sm4e(b7, v27); \ + sm4e(b0, v28); \ + sm4e(b1, v28); \ + sm4e(b2, v28); \ + sm4e(b3, v28); \ + sm4e(b4, v28); \ + sm4e(b5, v28); \ + sm4e(b6, v28); \ + sm4e(b7, v28); \ + sm4e(b0, v29); \ + sm4e(b1, v29); \ + sm4e(b2, v29); \ + sm4e(b3, v29); \ + sm4e(b4, v29); \ + sm4e(b5, v29); \ + sm4e(b6, v29); \ + sm4e(b7, v29); \ + sm4e(b0, v30); \ + sm4e(b1, v30); \ + sm4e(b2, v30); \ + sm4e(b3, v30); \ + sm4e(b4, v30); \ + sm4e(b5, v30); \ + sm4e(b6, v30); \ + sm4e(b7, v30); \ + sm4e(b0, v31); \ + sm4e(b1, v31); \ + sm4e(b2, v31); \ + sm4e(b3, v31); \ + sm4e(b4, v31); \ + sm4e(b5, v31); \ + sm4e(b6, v31); \ + sm4e(b7, v31); \ + rev64 b0.4s, b0.4s; \ + rev64 b1.4s, b1.4s; \ + rev64 b2.4s, b2.4s; \ + rev64 b3.4s, b3.4s; \ + rev64 b4.4s, b4.4s; \ + rev64 b5.4s, b5.4s; \ + rev64 b6.4s, b6.4s; \ + rev64 b7.4s, b7.4s; \ + ext b0.16b, b0.16b, b0.16b, #8; \ + ext b1.16b, b1.16b, b1.16b, #8; \ + ext b2.16b, b2.16b, b2.16b, #8; \ + ext b3.16b, b3.16b, b3.16b, #8; \ + ext b4.16b, b4.16b, b4.16b, #8; \ + ext b5.16b, b5.16b, b5.16b, #8; \ + ext b6.16b, b6.16b, b6.16b, #8; \ + ext b7.16b, b7.16b, b7.16b, #8; \ + rev32 b0.16b, b0.16b; \ + rev32 b1.16b, b1.16b; \ + rev32 b2.16b, b2.16b; \ + rev32 b3.16b, b3.16b; \ + rev32 b4.16b, b4.16b; \ + rev32 b5.16b, b5.16b; \ + rev32 b6.16b, b6.16b; \ + rev32 b7.16b, b7.16b; + + +.align 3 +.global _gcry_sm4_armv8_ce_expand_key +ELF(.type _gcry_sm4_armv8_ce_expand_key,%function;) +_gcry_sm4_armv8_ce_expand_key: + /* input: + * x0: 128-bit key + * x1: rkey_enc + * x2: rkey_dec + * x3: fk array + * x4: ck array + */ + CFI_STARTPROC(); + + ld1 {v0.16b}, [x0]; + rev32 v0.16b, v0.16b; + ld1 {v1.16b}, [x3]; + load_rkey(x4); + + /* input ^ fk */ + eor v0.16b, v0.16b, v1.16b; + + sm4ekey(v0, v0, v24); + sm4ekey(v1, v0, v25); + sm4ekey(v2, v1, v26); + sm4ekey(v3, v2, v27); + sm4ekey(v4, v3, v28); + sm4ekey(v5, v4, v29); + sm4ekey(v6, v5, v30); + sm4ekey(v7, v6, v31); + + st1 {v0.16b-v3.16b}, [x1], #64; + st1 {v4.16b-v7.16b}, [x1]; + rev64 v7.4s, v7.4s; + rev64 v6.4s, v6.4s; + rev64 v5.4s, v5.4s; + rev64 v4.4s, v4.4s; + rev64 v3.4s, v3.4s; + rev64 v2.4s, v2.4s; + rev64 v1.4s, v1.4s; + rev64 v0.4s, v0.4s; + ext v7.16b, v7.16b, v7.16b, #8; + ext v6.16b, v6.16b, v6.16b, #8; + ext v5.16b, v5.16b, v5.16b, #8; + ext v4.16b, v4.16b, v4.16b, #8; + ext v3.16b, v3.16b, v3.16b, #8; + ext v2.16b, v2.16b, v2.16b, #8; + ext v1.16b, v1.16b, v1.16b, #8; + ext v0.16b, v0.16b, v0.16b, #8; + st1 {v7.16b}, [x2], #16; + st1 {v6.16b}, [x2], #16; + st1 {v5.16b}, [x2], #16; + st1 {v4.16b}, [x2], #16; + st1 {v3.16b}, [x2], #16; + st1 {v2.16b}, [x2], #16; + st1 {v1.16b}, [x2], #16; + st1 {v0.16b}, [x2]; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_expand_key,.-_gcry_sm4_armv8_ce_expand_key;) + +.align 3 +ELF(.type sm4_armv8_ce_crypt_blk1_4,%function;) +sm4_armv8_ce_crypt_blk1_4: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: num blocks (1..4) + */ + CFI_STARTPROC(); + + load_rkey(x0); + + ld1 {v0.16b}, [x2], #16; + mov v1.16b, v0.16b; + mov v2.16b, v0.16b; + mov v3.16b, v0.16b; + cmp x3, #2; + blt .Lblk4_load_input_done; + ld1 {v1.16b}, [x2], #16; + beq .Lblk4_load_input_done; + ld1 {v2.16b}, [x2], #16; + cmp x3, #3; + beq .Lblk4_load_input_done; + ld1 {v3.16b}, [x2]; + +.Lblk4_load_input_done: + crypt_blk4(v0, v1, v2, v3); + + st1 {v0.16b}, [x1], #16; + cmp x3, #2; + blt .Lblk4_store_output_done; + st1 {v1.16b}, [x1], #16; + beq .Lblk4_store_output_done; + st1 {v2.16b}, [x1], #16; + cmp x3, #3; + beq .Lblk4_store_output_done; + st1 {v3.16b}, [x1]; + +.Lblk4_store_output_done: + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size sm4_armv8_ce_crypt_blk1_4,.-sm4_armv8_ce_crypt_blk1_4;) + +.align 3 +.global _gcry_sm4_armv8_ce_crypt_blk1_8 +ELF(.type _gcry_sm4_armv8_ce_crypt_blk1_8,%function;) +_gcry_sm4_armv8_ce_crypt_blk1_8: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: num blocks (1..8) + */ + CFI_STARTPROC(); + + cmp x3, #5; + blt sm4_armv8_ce_crypt_blk1_4; + + load_rkey(x0); + + ld1 {v0.16b-v3.16b}, [x2], #64; + ld1 {v4.16b}, [x2], #16; + mov v5.16b, v4.16b; + mov v6.16b, v4.16b; + mov v7.16b, v4.16b; + beq .Lblk8_load_input_done; + ld1 {v5.16b}, [x2], #16; + cmp x3, #7; + blt .Lblk8_load_input_done; + ld1 {v6.16b}, [x2], #16; + beq .Lblk8_load_input_done; + ld1 {v7.16b}, [x2]; + +.Lblk8_load_input_done: + crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7); + + cmp x3, #6; + st1 {v0.16b-v3.16b}, [x1], #64; + st1 {v4.16b}, [x1], #16; + blt .Lblk8_store_output_done; + st1 {v5.16b}, [x1], #16; + beq .Lblk8_store_output_done; + st1 {v6.16b}, [x1], #16; + cmp x3, #7; + beq .Lblk8_store_output_done; + st1 {v7.16b}, [x1]; + +.Lblk8_store_output_done: + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_crypt_blk1_8,.-_gcry_sm4_armv8_ce_crypt_blk1_8;) + +.align 3 +.global _gcry_sm4_armv8_ce_crypt +ELF(.type _gcry_sm4_armv8_ce_crypt,%function;) +_gcry_sm4_armv8_ce_crypt: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + load_rkey(x0); + +.Lcrypt_loop_blk: + subs x3, x3, #8; + bmi .Lcrypt_end; + + ld1 {v0.16b-v3.16b}, [x2], #64; + ld1 {v4.16b-v7.16b}, [x2], #64; + + crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7); + + st1 {v0.16b-v3.16b}, [x1], #64; + st1 {v4.16b-v7.16b}, [x1], #64; + + b .Lcrypt_loop_blk; + +.Lcrypt_end: + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_crypt,.-_gcry_sm4_armv8_ce_crypt;) + +.align 3 +.global _gcry_sm4_armv8_ce_cbc_dec +ELF(.type _gcry_sm4_armv8_ce_cbc_dec,%function;) +_gcry_sm4_armv8_ce_cbc_dec: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: iv (big endian, 128 bit) + * x4: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + load_rkey(x0); + ld1 {RIV.16b}, [x3]; + +.Lcbc_loop_blk: + subs x4, x4, #8; + bmi .Lcbc_end; + + ld1 {v0.16b-v3.16b}, [x2], #64; + ld1 {v4.16b-v7.16b}, [x2]; + + crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7); + + sub x2, x2, #64; + eor v0.16b, v0.16b, RIV.16b; + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v1.16b, v1.16b, RTMP0.16b; + eor v2.16b, v2.16b, RTMP1.16b; + eor v3.16b, v3.16b, RTMP2.16b; + st1 {v0.16b-v3.16b}, [x1], #64; + + eor v4.16b, v4.16b, RTMP3.16b; + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v5.16b, v5.16b, RTMP0.16b; + eor v6.16b, v6.16b, RTMP1.16b; + eor v7.16b, v7.16b, RTMP2.16b; + + mov RIV.16b, RTMP3.16b; + st1 {v4.16b-v7.16b}, [x1], #64; + + b .Lcbc_loop_blk; + +.Lcbc_end: + /* store new IV */ + st1 {RIV.16b}, [x3]; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_cbc_dec,.-_gcry_sm4_armv8_ce_cbc_dec;) + +.align 3 +.global _gcry_sm4_armv8_ce_cfb_dec +ELF(.type _gcry_sm4_armv8_ce_cfb_dec,%function;) +_gcry_sm4_armv8_ce_cfb_dec: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: iv (big endian, 128 bit) + * x4: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + load_rkey(x0); + ld1 {v0.16b}, [x3]; + +.Lcfb_loop_blk: + subs x4, x4, #8; + bmi .Lcfb_end; + + ld1 {v1.16b, v2.16b, v3.16b}, [x2], #48; + ld1 {v4.16b-v7.16b}, [x2]; + + crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7); + + sub x2, x2, #48; + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v0.16b, v0.16b, RTMP0.16b; + eor v1.16b, v1.16b, RTMP1.16b; + eor v2.16b, v2.16b, RTMP2.16b; + eor v3.16b, v3.16b, RTMP3.16b; + st1 {v0.16b-v3.16b}, [x1], #64; + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v4.16b, v4.16b, RTMP0.16b; + eor v5.16b, v5.16b, RTMP1.16b; + eor v6.16b, v6.16b, RTMP2.16b; + eor v7.16b, v7.16b, RTMP3.16b; + st1 {v4.16b-v7.16b}, [x1], #64; + + mov v0.16b, RTMP3.16b; + + b .Lcfb_loop_blk; + +.Lcfb_end: + /* store new IV */ + st1 {v0.16b}, [x3]; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_cfb_dec,.-_gcry_sm4_armv8_ce_cfb_dec;) + +.align 3 +.global _gcry_sm4_armv8_ce_ctr_enc +ELF(.type _gcry_sm4_armv8_ce_ctr_enc,%function;) +_gcry_sm4_armv8_ce_ctr_enc: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: ctr (big endian, 128 bit) + * x4: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + load_rkey(x0); + + ldp x7, x8, [x3]; + rev x7, x7; + rev x8, x8; + +.Lctr_loop_blk: + subs x4, x4, #8; + bmi .Lctr_end; + +#define inc_le128(vctr) \ + mov vctr.d[1], x8; \ + mov vctr.d[0], x7; \ + adds x8, x8, #1; \ + adc x7, x7, xzr; \ + rev64 vctr.16b, vctr.16b; + + /* construct CTRs */ + inc_le128(v0); /* +0 */ + inc_le128(v1); /* +1 */ + inc_le128(v2); /* +2 */ + inc_le128(v3); /* +3 */ + inc_le128(v4); /* +4 */ + inc_le128(v5); /* +5 */ + inc_le128(v6); /* +6 */ + inc_le128(v7); /* +7 */ + + crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7); + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v0.16b, v0.16b, RTMP0.16b; + eor v1.16b, v1.16b, RTMP1.16b; + eor v2.16b, v2.16b, RTMP2.16b; + eor v3.16b, v3.16b, RTMP3.16b; + st1 {v0.16b-v3.16b}, [x1], #64; + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v4.16b, v4.16b, RTMP0.16b; + eor v5.16b, v5.16b, RTMP1.16b; + eor v6.16b, v6.16b, RTMP2.16b; + eor v7.16b, v7.16b, RTMP3.16b; + st1 {v4.16b-v7.16b}, [x1], #64; + + b .Lctr_loop_blk; + +.Lctr_end: + /* store new CTR */ + rev x7, x7; + rev x8, x8; + stp x7, x8, [x3]; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_ctr_enc,.-_gcry_sm4_armv8_ce_ctr_enc;) + +#endif diff --git a/cipher/sm4.c b/cipher/sm4.c index ec2281b6..79e6dbf1 100644 --- a/cipher/sm4.c +++ b/cipher/sm4.c @@ -76,6 +76,15 @@ # endif #endif +#undef USE_ARM_CE +#ifdef ENABLE_ARM_CRYPTO_SUPPORT +# if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) +# define USE_ARM_CE 1 +# endif +#endif + static const char *sm4_selftest (void); static void _gcry_sm4_ctr_enc (void *context, unsigned char *ctr, @@ -106,6 +115,9 @@ typedef struct #ifdef USE_AARCH64_SIMD unsigned int use_aarch64_simd:1; #endif +#ifdef USE_ARM_CE + unsigned int use_arm_ce:1; +#endif } SM4_context; static const u32 fk[4] = @@ -286,6 +298,43 @@ sm4_aarch64_crypt_blk1_8(const u32 *rk, byte *out, const byte *in, } #endif /* USE_AARCH64_SIMD */ +#ifdef USE_ARM_CE +extern void _gcry_sm4_armv8_ce_expand_key(const byte *key, + u32 *rkey_enc, u32 *rkey_dec, + const u32 *fk, const u32 *ck); + +extern void _gcry_sm4_armv8_ce_crypt(const u32 *rk, byte *out, + const byte *in, + size_t num_blocks); + +extern void _gcry_sm4_armv8_ce_ctr_enc(const u32 *rk_enc, byte *out, + const byte *in, + byte *ctr, + size_t nblocks); + +extern void _gcry_sm4_armv8_ce_cbc_dec(const u32 *rk_dec, byte *out, + const byte *in, + byte *iv, + size_t nblocks); + +extern void _gcry_sm4_armv8_ce_cfb_dec(const u32 *rk_enc, byte *out, + const byte *in, + byte *iv, + size_t nblocks); + +extern void _gcry_sm4_armv8_ce_crypt_blk1_8(const u32 *rk, byte *out, + const byte *in, + size_t num_blocks); + +static inline unsigned int +sm4_armv8_ce_crypt_blk1_8(const u32 *rk, byte *out, const byte *in, + unsigned int num_blks) +{ + _gcry_sm4_armv8_ce_crypt_blk1_8(rk, out, in, (size_t)num_blks); + return 0; +} +#endif /* USE_ARM_CE */ + static inline void prefetch_sbox_table(void) { const volatile byte *vtab = (void *)&sbox_table; @@ -363,6 +412,15 @@ sm4_expand_key (SM4_context *ctx, const byte *key) } #endif +#ifdef USE_ARM_CE + if (ctx->use_arm_ce) + { + _gcry_sm4_armv8_ce_expand_key (key, ctx->rkey_enc, ctx->rkey_dec, + fk, ck); + return; + } +#endif + rk[0] = buf_get_be32(key + 4 * 0) ^ fk[0]; rk[1] = buf_get_be32(key + 4 * 1) ^ fk[1]; rk[2] = buf_get_be32(key + 4 * 2) ^ fk[2]; @@ -420,6 +478,9 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen, #ifdef USE_AARCH64_SIMD ctx->use_aarch64_simd = !!(hwf & HWF_ARM_NEON); #endif +#ifdef USE_ARM_CE + ctx->use_arm_ce = !!(hwf & HWF_ARM_SM4); +#endif /* Setup bulk encryption routines. */ memset (bulk_ops, 0, sizeof(*bulk_ops)); @@ -465,6 +526,11 @@ sm4_encrypt (void *context, byte *outbuf, const byte *inbuf) { SM4_context *ctx = context; +#ifdef USE_ARM_CE + if (ctx->use_arm_ce) + return sm4_armv8_ce_crypt_blk1_8(ctx->rkey_enc, outbuf, inbuf, 1); +#endif + prefetch_sbox_table (); return sm4_do_crypt (ctx->rkey_enc, outbuf, inbuf); @@ -475,6 +541,11 @@ sm4_decrypt (void *context, byte *outbuf, const byte *inbuf) { SM4_context *ctx = context; +#ifdef USE_ARM_CE + if (ctx->use_arm_ce) + return sm4_armv8_ce_crypt_blk1_8(ctx->rkey_dec, outbuf, inbuf, 1); +#endif + prefetch_sbox_table (); return sm4_do_crypt (ctx->rkey_dec, outbuf, inbuf); @@ -601,6 +672,23 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, } #endif +#ifdef USE_ARM_CE + if (ctx->use_arm_ce) + { + /* Process multiples of 8 blocks at a time. */ + if (nblocks >= 8) + { + size_t nblks = nblocks & ~(8 - 1); + + _gcry_sm4_armv8_ce_ctr_enc(ctx->rkey_enc, outbuf, inbuf, ctr, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + #ifdef USE_AARCH64_SIMD if (ctx->use_aarch64_simd) { @@ -634,6 +722,12 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } #endif +#ifdef USE_ARM_CE + else if (ctx->use_arm_ce) + { + crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; + } +#endif #ifdef USE_AARCH64_SIMD else if (ctx->use_aarch64_simd) { @@ -725,6 +819,23 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, } #endif +#ifdef USE_ARM_CE + if (ctx->use_arm_ce) + { + /* Process multiples of 8 blocks at a time. */ + if (nblocks >= 8) + { + size_t nblks = nblocks & ~(8 - 1); + + _gcry_sm4_armv8_ce_cbc_dec(ctx->rkey_dec, outbuf, inbuf, iv, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + #ifdef USE_AARCH64_SIMD if (ctx->use_aarch64_simd) { @@ -758,6 +869,12 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } #endif +#ifdef USE_ARM_CE + else if (ctx->use_arm_ce) + { + crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; + } +#endif #ifdef USE_AARCH64_SIMD else if (ctx->use_aarch64_simd) { @@ -842,6 +959,23 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, } #endif +#ifdef USE_ARM_CE + if (ctx->use_arm_ce) + { + /* Process multiples of 8 blocks at a time. */ + if (nblocks >= 8) + { + size_t nblks = nblocks & ~(8 - 1); + + _gcry_sm4_armv8_ce_cfb_dec(ctx->rkey_enc, outbuf, inbuf, iv, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + #ifdef USE_AARCH64_SIMD if (ctx->use_aarch64_simd) { @@ -875,6 +1009,12 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } #endif +#ifdef USE_ARM_CE + else if (ctx->use_arm_ce) + { + crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; + } +#endif #ifdef USE_AARCH64_SIMD else if (ctx->use_aarch64_simd) { @@ -1037,6 +1177,12 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } #endif +#ifdef USE_ARM_CE + else if (ctx->use_arm_ce) + { + crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; + } +#endif #ifdef USE_AARCH64_SIMD else if (ctx->use_aarch64_simd) { @@ -1203,6 +1349,12 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } #endif +#ifdef USE_ARM_CE + else if (ctx->use_arm_ce) + { + crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; + } +#endif #ifdef USE_AARCH64_SIMD else if (ctx->use_aarch64_simd) { diff --git a/configure.ac b/configure.ac index f5363f22..e20f9d13 100644 --- a/configure.ac +++ b/configure.ac @@ -2755,6 +2755,7 @@ if test "$found" = "1" ; then aarch64-*-*) # Build with the assembly implementation GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-aarch64.lo" + GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-armv8-aarch64-ce.lo" esac fi -- 2.34.1 From fweimer at redhat.com Tue Mar 1 12:17:06 2022 From: fweimer at redhat.com (Florian Weimer) Date: Tue, 01 Mar 2022 12:17:06 +0100 Subject: [PATCH 1/2] fips: Use ELF header to find .rodata1 section In-Reply-To: <8AEF6972-91BF-4B9A-B335-1EA482BF0DA5@redhat.com> (Clemens Lang's message of "Mon, 14 Feb 2022 13:46:19 +0100") References: <20220211155723.86516-1-cllang@redhat.com> <87pmnttmep.fsf@oldenburg.str.redhat.com> <8AEF6972-91BF-4B9A-B335-1EA482BF0DA5@redhat.com> Message-ID: <87bkypq5vx.fsf@oldenburg.str.redhat.com> * Clemens Lang: > From what I can see, it currently uses the same approach, and probably > has the same issue where the compiler could assume that the HMAC is 0 > and constant-propagate that. Again, this currently works just fine with > GCC, but I don?t think it?s a good idea to rely on GCC?s unwillingness > to replace a memcmp(3) with a few assembly instructions. Maybe GCC should provide an explicit way to treat a data object as constant for linking purposes, while having a compiler barrier around that. Separate compilation does that, but the data object must not be compiled with LTO, and that probably any whole-world assumption that makes LTO so successful. > The currently merged state assumes the offset in the file matches the > address at runtime. This is probably not a good assumption to make. How > would you determine the offset of a symbol in a file given its runtime > address? Find the matching program header entry that must have loaded it > and subtracting the difference between p_vaddr and p_offset? Yes, I think that should work. Thanks, Florian From jussi.kivilinna at iki.fi Wed Mar 2 20:13:36 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Wed, 2 Mar 2022 21:13:36 +0200 Subject: [PATCH v3 2/2] Add SM4 ARMv8/AArch64/CE assembly implementation In-Reply-To: <20220301095655.31234-2-tianjia.zhang@linux.alibaba.com> References: <20220301095655.31234-1-tianjia.zhang@linux.alibaba.com> <20220301095655.31234-2-tianjia.zhang@linux.alibaba.com> Message-ID: <2898303a-2dda-f96c-30bd-ded69528b7fa@iki.fi> Hello, Applied to master. Thanks. -Jussi On 1.3.2022 11.56, Tianjia Zhang wrote: > * cipher/Makefile.am: Add 'sm4-armv8-aarch64-ce.S'. > * cipher/sm4-armv8-aarch64-ce.S: New. > * cipher/sm4.c (USE_ARM_CE): New. > (SM4_context) [USE_ARM_CE]: Add 'use_arm_ce'. > [USE_ARM_CE] (_gcry_sm4_armv8_ce_expand_key) > (_gcry_sm4_armv8_ce_crypt, _gcry_sm4_armv8_ce_ctr_enc) > (_gcry_sm4_armv8_ce_cbc_dec, _gcry_sm4_armv8_ce_cfb_dec) > (_gcry_sm4_armv8_ce_crypt_blk1_8, sm4_armv8_ce_crypt_blk1_8): New. > (sm4_expand_key) [USE_ARM_CE]: Use ARMv8/AArch64/CE key setup. > (sm4_setkey): Enable ARMv8/AArch64/CE if supported by HW. > (sm4_encrypt) [USE_ARM_CE]: Use SM4 CE encryption. > (sm4_decrypt) [USE_ARM_CE]: Use SM4 CE decryption. > (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) > (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_ARM_CE]: Add > ARMv8/AArch64/CE bulk functions. > * configure.ac: Add 'sm4-armv8-aarch64-ce.lo'. > -- > > This patch adds ARMv8/AArch64/CE bulk encryption/decryption. Bulk > functions process eight blocks in parallel. > From jussi.kivilinna at iki.fi Sun Mar 6 18:19:09 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 6 Mar 2022 19:19:09 +0200 Subject: [PATCH 2/3] Add detection for HW feature "intel-avx512" In-Reply-To: <20220306171910.1011180-1-jussi.kivilinna@iki.fi> References: <20220306171910.1011180-1-jussi.kivilinna@iki.fi> Message-ID: <20220306171910.1011180-2-jussi.kivilinna@iki.fi> * configure.ac (avx512support, gcry_cv_gcc_inline_asm_avx512) (ENABLE_AVX512_SUPPORT): New. * src/g10lib.h (HWF_INTEL_AVX512): New. * src/hwf-x86.c (detect_x86_gnuc): Add AVX512 detection. * src/hwfeatures.c (hwflist): Add "intel-avx512". -- Signed-off-by: Jussi Kivilinna --- configure.ac | 36 +++++++++++++++++++++++++++++++++++ src/g10lib.h | 1 + src/hwf-x86.c | 49 +++++++++++++++++++++++++++++++++++++++++++++--- src/hwfeatures.c | 1 + 4 files changed, 84 insertions(+), 3 deletions(-) diff --git a/configure.ac b/configure.ac index e20f9d13..27d72141 100644 --- a/configure.ac +++ b/configure.ac @@ -667,6 +667,14 @@ AC_ARG_ENABLE(avx2-support, avx2support=$enableval,avx2support=yes) AC_MSG_RESULT($avx2support) +# Implementation of the --disable-avx512-support switch. +AC_MSG_CHECKING([whether AVX512 support is requested]) +AC_ARG_ENABLE(avx512-support, + AS_HELP_STRING([--disable-avx512-support], + [Disable support for the Intel AVX512 instructions]), + avx512support=$enableval,avx512support=yes) +AC_MSG_RESULT($avx512support) + # Implementation of the --disable-neon-support switch. AC_MSG_CHECKING([whether NEON support is requested]) AC_ARG_ENABLE(neon-support, @@ -1545,6 +1553,29 @@ if test "$gcry_cv_gcc_inline_asm_avx2" = "yes" ; then fi +# +# Check whether GCC inline assembler supports AVX512 instructions +# +AC_CACHE_CHECK([whether GCC inline assembler supports AVX512 instructions], + [gcry_cv_gcc_inline_asm_avx512], + [if test "$mpi_cpu_arch" != "x86" || + test "$try_asm_modules" != "yes" ; then + gcry_cv_gcc_inline_asm_avx512="n/a" + else + gcry_cv_gcc_inline_asm_avx512=no + AC_LINK_IFELSE([AC_LANG_PROGRAM( + [[void a(void) { + __asm__("xgetbv; vpopcntq %%zmm7, %%zmm1%{%%k1%}%{z%};\n\t":::"cc"); + __asm__("vpexpandb %%zmm3, %%zmm1;\n\t":::"cc"); + }]], [ a(); ] )], + [gcry_cv_gcc_inline_asm_avx512=yes]) + fi]) +if test "$gcry_cv_gcc_inline_asm_avx512" = "yes" ; then + AC_DEFINE(HAVE_GCC_INLINE_ASM_AVX512,1, + [Defined if inline assembler supports AVX512 instructions]) +fi + + # # Check whether GCC inline assembler supports VAES and VPCLMUL instructions # @@ -2409,6 +2440,10 @@ if test x"$avx2support" = xyes ; then AC_DEFINE(ENABLE_AVX2_SUPPORT,1, [Enable support for Intel AVX2 instructions.]) fi +if test x"$avx512support" = xyes ; then + AC_DEFINE(ENABLE_AVX512_SUPPORT,1, + [Enable support for Intel AVX512 instructions.]) +fi if test x"$neonsupport" = xyes ; then AC_DEFINE(ENABLE_NEON_SUPPORT,1, [Enable support for ARM NEON instructions.]) @@ -3266,6 +3301,7 @@ GCRY_MSG_SHOW([Try using Intel SSE4.1: ],[$sse41support]) GCRY_MSG_SHOW([Try using DRNG (RDRAND): ],[$drngsupport]) GCRY_MSG_SHOW([Try using Intel AVX: ],[$avxsupport]) GCRY_MSG_SHOW([Try using Intel AVX2: ],[$avx2support]) +GCRY_MSG_SHOW([Try using Intel AVX512: ],[$avx512support]) GCRY_MSG_SHOW([Try using ARM NEON: ],[$neonsupport]) GCRY_MSG_SHOW([Try using ARMv8 crypto: ],[$armcryptosupport]) GCRY_MSG_SHOW([Try using PPC crypto: ],[$ppccryptosupport]) diff --git a/src/g10lib.h b/src/g10lib.h index 985e75c6..c07ed788 100644 --- a/src/g10lib.h +++ b/src/g10lib.h @@ -237,6 +237,7 @@ char **_gcry_strtokenize (const char *string, const char *delim); #define HWF_INTEL_RDTSC (1 << 15) #define HWF_INTEL_SHAEXT (1 << 16) #define HWF_INTEL_VAES_VPCLMUL (1 << 17) +#define HWF_INTEL_AVX512 (1 << 18) #elif defined(HAVE_CPU_ARCH_ARM) diff --git a/src/hwf-x86.c b/src/hwf-x86.c index a1aa02e7..0a5266c6 100644 --- a/src/hwf-x86.c +++ b/src/hwf-x86.c @@ -182,12 +182,14 @@ detect_x86_gnuc (void) } vendor_id; unsigned int features, features2; unsigned int os_supports_avx_avx2_registers = 0; + unsigned int os_supports_avx512_registers = 0; unsigned int max_cpuid_level; unsigned int fms, family, model; unsigned int result = 0; unsigned int avoid_vpgather = 0; (void)os_supports_avx_avx2_registers; + (void)os_supports_avx512_registers; if (!is_cpuid_available()) return 0; @@ -338,13 +340,22 @@ detect_x86_gnuc (void) if (features & 0x02000000) result |= HWF_INTEL_AESNI; #endif /*ENABLE_AESNI_SUPPORT*/ -#if defined(ENABLE_AVX_SUPPORT) || defined(ENABLE_AVX2_SUPPORT) - /* Test bit 27 for OSXSAVE (required for AVX/AVX2). */ +#if defined(ENABLE_AVX_SUPPORT) || defined(ENABLE_AVX2_SUPPORT) \ + || defined(ENABLE_AVX512_SUPPORT) + /* Test bit 27 for OSXSAVE (required for AVX/AVX2/AVX512). */ if (features & 0x08000000) { + unsigned int xmm_ymm_mask = (1 << 2) | (1 << 1); + unsigned int zmm15_ymm31_k7_mask = (1 << 7) | (1 << 6) | (1 << 5); + unsigned int xgetbv = get_xgetbv(); + /* Check that OS has enabled both XMM and YMM state support. */ - if ((get_xgetbv() & 0x6) == 0x6) + if ((xgetbv & xmm_ymm_mask) == xmm_ymm_mask) os_supports_avx_avx2_registers = 1; + + /* Check that OS has enabled both XMM and YMM state support. */ + if ((xgetbv & zmm15_ymm31_k7_mask) == zmm15_ymm31_k7_mask) + os_supports_avx512_registers = 1; } #endif #ifdef ENABLE_AVX_SUPPORT @@ -396,6 +407,38 @@ detect_x86_gnuc (void) if ((features2 & 0x00000200) && (features2 & 0x00000400)) result |= HWF_INTEL_VAES_VPCLMUL; #endif + +#ifdef ENABLE_AVX512_SUPPORT + /* Test for AVX512 features. List of features is selected so that + * supporting CPUs are new enough not to suffer from reduced clock + * frequencies when AVX512 is used, which was issue on early AVX512 + * capable CPUs. + * - AVX512F (features bit 16) + * - AVX512DQ (features bit 17) + * - AVX512IFMA (features bit 21) + * - AVX512CD (features bit 28) + * - AVX512BW (features bit 30) + * - AVX512VL (features bit 31) + * - AVX512_VBMI (features2 bit 1) + * - AVX512_VBMI2 (features2 bit 6) + * - AVX512_VNNI (features2 bit 11) + * - AVX512_BITALG (features2 bit 12) + * - AVX512_VPOPCNTDQ (features2 bit 14) + */ + if (os_supports_avx512_registers + && (features & (1 << 16)) + && (features & (1 << 17)) + && (features & (1 << 21)) + && (features & (1 << 28)) + && (features & (1 << 30)) + && (features & (1 << 31)) + && (features2 & (1 << 1)) + && (features2 & (1 << 6)) + && (features2 & (1 << 11)) + && (features2 & (1 << 12)) + && (features2 & (1 << 14))) + result |= HWF_INTEL_AVX512; +#endif } return result; diff --git a/src/hwfeatures.c b/src/hwfeatures.c index 7060d995..8e92cbdd 100644 --- a/src/hwfeatures.c +++ b/src/hwfeatures.c @@ -62,6 +62,7 @@ static struct { HWF_INTEL_RDTSC, "intel-rdtsc" }, { HWF_INTEL_SHAEXT, "intel-shaext" }, { HWF_INTEL_VAES_VPCLMUL, "intel-vaes-vpclmul" }, + { HWF_INTEL_AVX512, "intel-avx512" }, #elif defined(HAVE_CPU_ARCH_ARM) { HWF_ARM_NEON, "arm-neon" }, { HWF_ARM_AES, "arm-aes" }, -- 2.32.0 From jussi.kivilinna at iki.fi Sun Mar 6 18:19:08 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 6 Mar 2022 19:19:08 +0200 Subject: [PATCH 1/3] ghash|polyval: add x86_64 VPCLMUL/AVX2 accelerated implementation Message-ID: <20220306171910.1011180-1-jussi.kivilinna@iki.fi> * cipher/cipher-gcm-intel-pclmul.c (GCM_INTEL_USE_VPCLMUL_AVX2) (GCM_INTEL_AGGR8_TABLE_INITIALIZED) (GCM_INTEL_AGGR16_TABLE_INITIALIZED): New. (gfmul_pclmul): Fixes to comments. [GCM_USE_INTEL_VPCLMUL_AVX2] (GFMUL_AGGR16_ASM_VPCMUL_AVX2) (gfmul_vpclmul_avx2_aggr16, gfmul_vpclmul_avx2_aggr16_le) (gfmul_pclmul_avx2, gcm_lsh_avx2, load_h1h2_to_ymm1) (ghash_setup_aggr8_avx2, ghash_setup_aggr16_avx2): New. (_gcry_ghash_setup_intel_pclmul): Add 'hw_features' parameter; Setup ghash and polyval function pointers for context; Add VPCLMUL/AVX2 code path; Defer aggr8 and aggr16 table initialization to until first use in '_gcry_ghash_intel_pclmul' or '_gcry_polyval_intel_pclmul'. [__x86_64__] (ghash_setup_aggr8): New. (_gcry_ghash_intel_pclmul): Add VPCLMUL/AVX2 code path; Add call for aggr8 table initialization. (_gcry_polyval_intel_pclmul): Add VPCLMUL/AVX2 code path; Add call for aggr8 table initialization. * cipher/cipher-gcm.c [GCM_USE_INTEL_PCLMUL] (_gcry_ghash_intel_pclmul) (_gcry_polyval_intel_pclmul): Remove. [GCM_USE_INTEL_PCLMUL] (_gcry_ghash_setup_intel_pclmul): Add 'hw_features' parameter. (setupM) [GCM_USE_INTEL_PCLMUL]: Pass HW features to '_gcry_ghash_setup_intel_pclmul'; Let '_gcry_ghash_setup_intel_pclmul' setup function pointers. * cipher/cipher-internal.h (GCM_USE_INTEL_VPCLMUL_AVX2): New. (gcry_cipher_handle): Add member 'gcm.hw_impl_flags'. -- Patch adds VPCLMUL/AVX2 accelerated implementation for GHASH (GCM) and POLYVAL (GCM-SIV). Benchmark on AMD Ryzen 5800X (zen3): Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz GCM auth | 0.088 ns/B 10825 MiB/s 0.427 c/B 4850 GCM-SIV auth | 0.083 ns/B 11472 MiB/s 0.403 c/B 4850 After: (~1.93x faster) | nanosecs/byte mebibytes/sec cycles/byte auto Mhz GCM auth | 0.045 ns/B 21098 MiB/s 0.219 c/B 4850 GCM-SIV auth | 0.043 ns/B 22181 MiB/s 0.209 c/B 4850 AES128-GCM / AES128-GCM-SIV encryption: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz GCM enc | 0.079 ns/B 12073 MiB/s 0.383 c/B 4850 GCM-SIV enc | 0.076 ns/B 12500 MiB/s 0.370 c/B 4850 Benchmark on Intel Core i3-1115G4 (tigerlake): Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz GCM auth | 0.080 ns/B 11919 MiB/s 0.327 c/B 4090 GCM-SIV auth | 0.075 ns/B 12643 MiB/s 0.309 c/B 4090 After: (~1.28x faster) | nanosecs/byte mebibytes/sec cycles/byte auto Mhz GCM auth | 0.062 ns/B 15348 MiB/s 0.254 c/B 4090 GCM-SIV auth | 0.058 ns/B 16381 MiB/s 0.238 c/B 4090 AES128-GCM / AES128-GCM-SIV encryption: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz GCM enc | 0.101 ns/B 9441 MiB/s 0.413 c/B 4090 GCM-SIV enc | 0.098 ns/B 9692 MiB/s 0.402 c/B 4089 Signed-off-by: Jussi Kivilinna --- cipher/cipher-gcm-intel-pclmul.c | 809 +++++++++++++++++++++++++++---- cipher/cipher-gcm.c | 15 +- cipher/cipher-internal.h | 11 + 3 files changed, 724 insertions(+), 111 deletions(-) diff --git a/cipher/cipher-gcm-intel-pclmul.c b/cipher/cipher-gcm-intel-pclmul.c index daf807d0..b7324e8f 100644 --- a/cipher/cipher-gcm-intel-pclmul.c +++ b/cipher/cipher-gcm-intel-pclmul.c @@ -1,6 +1,6 @@ /* cipher-gcm-intel-pclmul.c - Intel PCLMUL accelerated Galois Counter Mode * implementation - * Copyright (C) 2013-2014,2019 Jussi Kivilinna + * Copyright (C) 2013-2014,2019,2022 Jussi Kivilinna * * This file is part of Libgcrypt. * @@ -49,12 +49,18 @@ #define ASM_FUNC_ATTR_INLINE ASM_FUNC_ATTR ALWAYS_INLINE +#define GCM_INTEL_USE_VPCLMUL_AVX2 (1 << 0) +#define GCM_INTEL_AGGR8_TABLE_INITIALIZED (1 << 1) +#define GCM_INTEL_AGGR16_TABLE_INITIALIZED (1 << 2) + + /* Intel PCLMUL ghash based on white paper: "Intel? Carry-Less Multiplication Instruction and its Usage for Computing the GCM Mode - Rev 2.01"; Shay Gueron, Michael E. Kounavis. */ -static ASM_FUNC_ATTR_INLINE void reduction(void) +static ASM_FUNC_ATTR_INLINE +void reduction(void) { /* input: */ @@ -83,7 +89,8 @@ static ASM_FUNC_ATTR_INLINE void reduction(void) ::: "memory" ); } -static ASM_FUNC_ATTR_INLINE void gfmul_pclmul(void) +static ASM_FUNC_ATTR_INLINE +void gfmul_pclmul(void) { /* Input: XMM0 and XMM1, Output: XMM1. Input XMM0 stays unmodified. Input must be converted to little-endian. @@ -358,12 +365,12 @@ gfmul_pclmul_aggr4_le(const void *buf, const void *h_1, const void *h_table) \ "pshufd $78, %%xmm8, %%xmm11\n\t" \ "pshufd $78, %%xmm5, %%xmm7\n\t" \ - "pxor %%xmm8, %%xmm11\n\t" /* xmm11 holds 4:a0+a1 */ \ - "pxor %%xmm5, %%xmm7\n\t" /* xmm7 holds 4:b0+b1 */ \ + "pxor %%xmm8, %%xmm11\n\t" /* xmm11 holds 2:a0+a1 */ \ + "pxor %%xmm5, %%xmm7\n\t" /* xmm7 holds 2:b0+b1 */ \ "movdqa %%xmm8, %%xmm6\n\t" \ - "pclmulqdq $0, %%xmm5, %%xmm6\n\t" /* xmm6 holds 4:a0*b0 */ \ - "pclmulqdq $17, %%xmm8, %%xmm5\n\t" /* xmm5 holds 4:a1*b1 */ \ - "pclmulqdq $0, %%xmm11, %%xmm7\n\t" /* xmm7 holds 4:(a0+a1)*(b0+b1) */ \ + "pclmulqdq $0, %%xmm5, %%xmm6\n\t" /* xmm6 holds 2:a0*b0 */ \ + "pclmulqdq $17, %%xmm8, %%xmm5\n\t" /* xmm5 holds 2:a1*b1 */ \ + "pclmulqdq $0, %%xmm11, %%xmm7\n\t" /* xmm7 holds 2:(a0+a1)*(b0+b1) */ \ \ "pxor %%xmm6, %%xmm3\n\t" /* xmm3 holds 2+3+4+5+6+7+8:a0*b0 */ \ "pxor %%xmm5, %%xmm1\n\t" /* xmm1 holds 2+3+4+5+6+7+8:a1*b1 */ \ @@ -371,16 +378,16 @@ gfmul_pclmul_aggr4_le(const void *buf, const void *h_1, const void *h_table) \ "pshufd $78, %%xmm0, %%xmm11\n\t" \ "pshufd $78, %%xmm2, %%xmm7\n\t" \ - "pxor %%xmm0, %%xmm11\n\t" /* xmm11 holds 3:a0+a1 */ \ - "pxor %%xmm2, %%xmm7\n\t" /* xmm7 holds 3:b0+b1 */ \ + "pxor %%xmm0, %%xmm11\n\t" /* xmm11 holds 1:a0+a1 */ \ + "pxor %%xmm2, %%xmm7\n\t" /* xmm7 holds 1:b0+b1 */ \ "movdqa %%xmm0, %%xmm6\n\t" \ - "pclmulqdq $0, %%xmm2, %%xmm6\n\t" /* xmm6 holds 3:a0*b0 */ \ - "pclmulqdq $17, %%xmm0, %%xmm2\n\t" /* xmm2 holds 3:a1*b1 */ \ - "pclmulqdq $0, %%xmm11, %%xmm7\n\t" /* xmm7 holds 3:(a0+a1)*(b0+b1) */ \ + "pclmulqdq $0, %%xmm2, %%xmm6\n\t" /* xmm6 holds 1:a0*b0 */ \ + "pclmulqdq $17, %%xmm0, %%xmm2\n\t" /* xmm2 holds 1:a1*b1 */ \ + "pclmulqdq $0, %%xmm11, %%xmm7\n\t" /* xmm7 holds 1:(a0+a1)*(b0+b1) */ \ \ - "pxor %%xmm6, %%xmm3\n\t" /* xmm3 holds 1+2+3+3+4+5+6+7+8:a0*b0 */ \ - "pxor %%xmm2, %%xmm1\n\t" /* xmm1 holds 1+2+3+3+4+5+6+7+8:a1*b1 */ \ - "pxor %%xmm7, %%xmm4\n\t"/* xmm4 holds 1+2+3+3+4+5+6+7+8:(a0+a1)*(b0+b1) */\ + "pxor %%xmm6, %%xmm3\n\t" /* xmm3 holds 1+2+3+4+5+6+7+8:a0*b0 */ \ + "pxor %%xmm2, %%xmm1\n\t" /* xmm1 holds 1+2+3+4+5+6+7+8:a1*b1 */ \ + "pxor %%xmm7, %%xmm4\n\t"/* xmm4 holds 1+2+3+4+5+6+7+8:(a0+a1)*(b0+b1) */ \ \ /* aggregated reduction... */ \ "movdqa %%xmm3, %%xmm5\n\t" \ @@ -432,14 +439,409 @@ gfmul_pclmul_aggr8_le(const void *buf, const void *h_table) reduction(); } -#endif -static ASM_FUNC_ATTR_INLINE void gcm_lsh(void *h, unsigned int hoffs) +#ifdef GCM_USE_INTEL_VPCLMUL_AVX2 + +#define GFMUL_AGGR16_ASM_VPCMUL_AVX2(be_to_le) \ + /* perform clmul and merge results... */ \ + "vmovdqu 0*16(%[buf]), %%ymm5\n\t" \ + "vmovdqu 2*16(%[buf]), %%ymm2\n\t" \ + be_to_le("vpshufb %%ymm15, %%ymm5, %%ymm5\n\t") /* be => le */ \ + be_to_le("vpshufb %%ymm15, %%ymm2, %%ymm2\n\t") /* be => le */ \ + "vpxor %%ymm5, %%ymm1, %%ymm1\n\t" \ + \ + "vpshufd $78, %%ymm0, %%ymm5\n\t" \ + "vpshufd $78, %%ymm1, %%ymm4\n\t" \ + "vpxor %%ymm0, %%ymm5, %%ymm5\n\t" /* ymm5 holds 15|16:a0+a1 */ \ + "vpxor %%ymm1, %%ymm4, %%ymm4\n\t" /* ymm4 holds 15|16:b0+b1 */ \ + "vpclmulqdq $0, %%ymm1, %%ymm0, %%ymm3\n\t" /* ymm3 holds 15|16:a0*b0 */ \ + "vpclmulqdq $17, %%ymm0, %%ymm1, %%ymm1\n\t" /* ymm1 holds 15|16:a1*b1 */ \ + "vpclmulqdq $0, %%ymm5, %%ymm4, %%ymm4\n\t" /* ymm4 holds 15|16:(a0+a1)*(b0+b1) */ \ + \ + "vmovdqu %[h1_h2], %%ymm0\n\t" \ + \ + "vpshufd $78, %%ymm13, %%ymm14\n\t" \ + "vpshufd $78, %%ymm2, %%ymm7\n\t" \ + "vpxor %%ymm13, %%ymm14, %%ymm14\n\t" /* ymm14 holds 13|14:a0+a1 */ \ + "vpxor %%ymm2, %%ymm7, %%ymm7\n\t" /* ymm7 holds 13|14:b0+b1 */ \ + "vpclmulqdq $0, %%ymm2, %%ymm13, %%ymm6\n\t" /* ymm6 holds 13|14:a0*b0 */ \ + "vpclmulqdq $17, %%ymm13, %%ymm2, %%ymm2\n\t" /* ymm2 holds 13|14:a1*b1 */ \ + "vpclmulqdq $0, %%ymm14, %%ymm7, %%ymm7\n\t" /* ymm7 holds 13|14:(a0+a1)*(b0+b1) */\ + \ + "vpxor %%ymm6, %%ymm3, %%ymm3\n\t" /* ymm3 holds 13+15|14+16:a0*b0 */ \ + "vpxor %%ymm2, %%ymm1, %%ymm1\n\t" /* ymm1 holds 13+15|14+16:a1*b1 */ \ + "vpxor %%ymm7, %%ymm4, %%ymm4\n\t" /* ymm4 holds 13+15|14+16:(a0+a1)*(b0+b1) */ \ + \ + "vmovdqu 4*16(%[buf]), %%ymm5\n\t" \ + "vmovdqu 6*16(%[buf]), %%ymm2\n\t" \ + be_to_le("vpshufb %%ymm15, %%ymm5, %%ymm5\n\t") /* be => le */ \ + be_to_le("vpshufb %%ymm15, %%ymm2, %%ymm2\n\t") /* be => le */ \ + \ + "vpshufd $78, %%ymm12, %%ymm14\n\t" \ + "vpshufd $78, %%ymm5, %%ymm7\n\t" \ + "vpxor %%ymm12, %%ymm14, %%ymm14\n\t" /* ymm14 holds 11|12:a0+a1 */ \ + "vpxor %%ymm5, %%ymm7, %%ymm7\n\t" /* ymm7 holds 11|12:b0+b1 */ \ + "vpclmulqdq $0, %%ymm5, %%ymm12, %%ymm6\n\t" /* ymm6 holds 11|12:a0*b0 */ \ + "vpclmulqdq $17, %%ymm12, %%ymm5, %%ymm5\n\t" /* ymm5 holds 11|12:a1*b1 */ \ + "vpclmulqdq $0, %%ymm14, %%ymm7, %%ymm7\n\t" /* ymm7 holds 11|12:(a0+a1)*(b0+b1) */\ + \ + "vpxor %%ymm6, %%ymm3, %%ymm3\n\t" /* ymm3 holds 11+13+15|12+14+16:a0*b0 */ \ + "vpxor %%ymm5, %%ymm1, %%ymm1\n\t" /* ymm1 holds 11+13+15|12+14+16:a1*b1 */ \ + "vpxor %%ymm7, %%ymm4, %%ymm4\n\t" /* ymm4 holds 11+13+15|12+14+16:(a0+a1)*(b0+b1) */\ + \ + "vpshufd $78, %%ymm11, %%ymm14\n\t" \ + "vpshufd $78, %%ymm2, %%ymm7\n\t" \ + "vpxor %%ymm11, %%ymm14, %%ymm14\n\t" /* ymm14 holds 9|10:a0+a1 */ \ + "vpxor %%ymm2, %%ymm7, %%ymm7\n\t" /* ymm7 holds 9|10:b0+b1 */ \ + "vpclmulqdq $0, %%ymm2, %%ymm11, %%ymm6\n\t" /* ymm6 holds 9|10:a0*b0 */ \ + "vpclmulqdq $17, %%ymm11, %%ymm2, %%ymm2\n\t" /* ymm2 holds 9|10:a1*b1 */ \ + "vpclmulqdq $0, %%ymm14, %%ymm7, %%ymm7\n\t" /* ymm7 holds 9|10:(a0+a1)*(b0+b1) */ \ + \ + "vpxor %%ymm6, %%ymm3, %%ymm3\n\t" /* ymm3 holds 9+11+?+15|10+12+?+16:a0*b0 */ \ + "vpxor %%ymm2, %%ymm1, %%ymm1\n\t" /* ymm1 holds 9+11+?+15|10+12+?+16:a1*b1 */ \ + "vpxor %%ymm7, %%ymm4, %%ymm4\n\t" /* ymm4 holds 9+11+?+15|10+12+?+16:(a0+a1)*(b0+b1) */\ + \ + "vmovdqu 8*16(%[buf]), %%ymm5\n\t" \ + "vmovdqu 10*16(%[buf]), %%ymm2\n\t" \ + be_to_le("vpshufb %%ymm15, %%ymm5, %%ymm5\n\t") /* be => le */ \ + be_to_le("vpshufb %%ymm15, %%ymm2, %%ymm2\n\t") /* be => le */ \ + \ + "vpshufd $78, %%ymm10, %%ymm14\n\t" \ + "vpshufd $78, %%ymm5, %%ymm7\n\t" \ + "vpxor %%ymm10, %%ymm14, %%ymm14\n\t" /* ymm14 holds 7|8:a0+a1 */ \ + "vpxor %%ymm5, %%ymm7, %%ymm7\n\t" /* ymm7 holds 7|8:b0+b1 */ \ + "vpclmulqdq $0, %%ymm5, %%ymm10, %%ymm6\n\t" /* ymm6 holds 7|8:a0*b0 */ \ + "vpclmulqdq $17, %%ymm10, %%ymm5, %%ymm5\n\t" /* ymm5 holds 7|8:a1*b1 */ \ + "vpclmulqdq $0, %%ymm14, %%ymm7, %%ymm7\n\t" /* ymm7 holds 7|8:(a0+a1)*(b0+b1) */ \ + \ + "vpxor %%ymm6, %%ymm3, %%ymm3\n\t" /* ymm3 holds 7+9+?+15|8+10+?+16:a0*b0 */ \ + "vpxor %%ymm5, %%ymm1, %%ymm1\n\t" /* ymm1 holds 7+9+?+15|8+10+?+16:a1*b1 */ \ + "vpxor %%ymm7, %%ymm4, %%ymm4\n\t" /* ymm4 holds 7+9+?+15|8+10+?+16:(a0+a1)*(b0+b1) */\ + \ + "vpshufd $78, %%ymm9, %%ymm14\n\t" \ + "vpshufd $78, %%ymm2, %%ymm7\n\t" \ + "vpxor %%ymm9, %%ymm14, %%ymm14\n\t" /* ymm14 holds 5|6:a0+a1 */ \ + "vpxor %%ymm2, %%ymm7, %%ymm7\n\t" /* ymm7 holds 5|6:b0+b1 */ \ + "vpclmulqdq $0, %%ymm2, %%ymm9, %%ymm6\n\t" /* ymm6 holds 5|6:a0*b0 */ \ + "vpclmulqdq $17, %%ymm9, %%ymm2, %%ymm2\n\t" /* ymm2 holds 5|6:a1*b1 */ \ + "vpclmulqdq $0, %%ymm14, %%ymm7, %%ymm7\n\t" /* ymm7 holds 5|6:(a0+a1)*(b0+b1) */ \ + \ + "vpxor %%ymm6, %%ymm3, %%ymm3\n\t" /* ymm3 holds 5+7+?+15|6+8+?+16:a0*b0 */ \ + "vpxor %%ymm2, %%ymm1, %%ymm1\n\t" /* ymm1 holds 5+7+?+15|6+8+?+16:a1*b1 */ \ + "vpxor %%ymm7, %%ymm4, %%ymm4\n\t" /* ymm4 holds 5+7+?+15|6+8+?+16:(a0+a1)*(b0+b1) */\ + \ + "vmovdqu 12*16(%[buf]), %%ymm5\n\t" \ + "vmovdqu 14*16(%[buf]), %%ymm2\n\t" \ + be_to_le("vpshufb %%ymm15, %%ymm5, %%ymm5\n\t") /* be => le */ \ + be_to_le("vpshufb %%ymm15, %%ymm2, %%ymm2\n\t") /* be => le */ \ + \ + "vpshufd $78, %%ymm8, %%ymm14\n\t" \ + "vpshufd $78, %%ymm5, %%ymm7\n\t" \ + "vpxor %%ymm8, %%ymm14, %%ymm14\n\t" /* ymm14 holds 3|4:a0+a1 */ \ + "vpxor %%ymm5, %%ymm7, %%ymm7\n\t" /* ymm7 holds 3|4:b0+b1 */ \ + "vpclmulqdq $0, %%ymm5, %%ymm8, %%ymm6\n\t" /* ymm6 holds 3|4:a0*b0 */ \ + "vpclmulqdq $17, %%ymm8, %%ymm5, %%ymm5\n\t" /* ymm5 holds 3|4:a1*b1 */ \ + "vpclmulqdq $0, %%ymm14, %%ymm7, %%ymm7\n\t" /* ymm7 holds 3|4:(a0+a1)*(b0+b1) */ \ + \ + "vpxor %%ymm6, %%ymm3, %%ymm3\n\t" /* ymm3 holds 3+5+?+15|4+6+?+16:a0*b0 */ \ + "vpxor %%ymm5, %%ymm1, %%ymm1\n\t" /* ymm1 holds 3+5+?+15|4+6+?+16:a1*b1 */ \ + "vpxor %%ymm7, %%ymm4, %%ymm4\n\t" /* ymm4 holds 3+5+?+15|4+6+?+16:(a0+a1)*(b0+b1) */\ + \ + "vpshufd $78, %%ymm0, %%ymm14\n\t" \ + "vpshufd $78, %%ymm2, %%ymm7\n\t" \ + "vpxor %%ymm0, %%ymm14, %%ymm14\n\t" /* ymm14 holds 1|2:a0+a1 */ \ + "vpxor %%ymm2, %%ymm7, %%ymm7\n\t" /* ymm7 holds 1|2:b0+b1 */ \ + "vpclmulqdq $0, %%ymm2, %%ymm0, %%ymm6\n\t" /* ymm6 holds 1|2:a0*b0 */ \ + "vpclmulqdq $17, %%ymm0, %%ymm2, %%ymm2\n\t" /* ymm2 holds 1|2:a1*b1 */ \ + "vpclmulqdq $0, %%ymm14, %%ymm7, %%ymm7\n\t" /* ymm7 holds 1|2:(a0+a1)*(b0+b1) */ \ + \ + "vmovdqu %[h15_h16], %%ymm0\n\t" \ + \ + "vpxor %%ymm6, %%ymm3, %%ymm3\n\t" /* ymm3 holds 1+3+?+15|2+4+?+16:a0*b0 */ \ + "vpxor %%ymm2, %%ymm1, %%ymm1\n\t" /* ymm1 holds 1+3+?+15|2+4+?+16:a1*b1 */ \ + "vpxor %%ymm7, %%ymm4, %%ymm4\n\t" /* ymm4 holds 1+3+?+15|2+4+?+16:(a0+a1)*(b0+b1) */\ + \ + /* aggregated reduction... */ \ + "vpxor %%ymm1, %%ymm3, %%ymm5\n\t" /* ymm5 holds a0*b0+a1*b1 */ \ + "vpxor %%ymm5, %%ymm4, %%ymm4\n\t" /* ymm4 holds a0*b0+a1*b1+(a0+a1)*(b0+b1) */ \ + "vpslldq $8, %%ymm4, %%ymm5\n\t" \ + "vpsrldq $8, %%ymm4, %%ymm4\n\t" \ + "vpxor %%ymm5, %%ymm3, %%ymm3\n\t" \ + "vpxor %%ymm4, %%ymm1, %%ymm1\n\t" /* holds the result of the \ + carry-less multiplication of ymm0 \ + by ymm1 */ \ + \ + /* first phase of the reduction */ \ + "vpsllq $1, %%ymm3, %%ymm6\n\t" /* packed right shifting << 63 */ \ + "vpxor %%ymm3, %%ymm6, %%ymm6\n\t" \ + "vpsllq $57, %%ymm3, %%ymm5\n\t" /* packed right shifting << 57 */ \ + "vpsllq $62, %%ymm6, %%ymm6\n\t" /* packed right shifting << 62 */ \ + "vpxor %%ymm5, %%ymm6, %%ymm6\n\t" /* xor the shifted versions */ \ + "vpshufd $0x6a, %%ymm6, %%ymm5\n\t" \ + "vpshufd $0xae, %%ymm6, %%ymm6\n\t" \ + "vpxor %%ymm5, %%ymm3, %%ymm3\n\t" /* first phase of the reduction complete */ \ + \ + /* second phase of the reduction */ \ + "vpxor %%ymm3, %%ymm1, %%ymm1\n\t" /* xor the shifted versions */ \ + "vpsrlq $1, %%ymm3, %%ymm3\n\t" /* packed left shifting >> 1 */ \ + "vpxor %%ymm3, %%ymm6, %%ymm6\n\t" \ + "vpsrlq $1, %%ymm3, %%ymm3\n\t" /* packed left shifting >> 2 */ \ + "vpxor %%ymm3, %%ymm1, %%ymm1\n\t" \ + "vpsrlq $5, %%ymm3, %%ymm3\n\t" /* packed left shifting >> 7 */ \ + "vpxor %%ymm3, %%ymm6, %%ymm6\n\t" \ + "vpxor %%ymm6, %%ymm1, %%ymm1\n\t" /* the result is in ymm1 */ \ + \ + /* merge 128-bit halves */ \ + "vextracti128 $1, %%ymm1, %%xmm2\n\t" \ + "vpxor %%xmm2, %%xmm1, %%xmm1\n\t" + +static ASM_FUNC_ATTR_INLINE void +gfmul_vpclmul_avx2_aggr16(const void *buf, const void *h_table, + const u64 *h1_h2_h15_h16) +{ + /* Input: + Hx: YMM0, YMM8, YMM9, YMM10, YMM11, YMM12, YMM13 + bemask: YMM15 + Hash: XMM1 + Output: + Hash: XMM1 + Inputs YMM0, YMM8, YMM9, YMM10, YMM11, YMM12, YMM13 and YMM15 stay + unmodified. + */ + asm volatile (GFMUL_AGGR16_ASM_VPCMUL_AVX2(be_to_le) + : + : [buf] "r" (buf), + [h_table] "r" (h_table), + [h1_h2] "m" (h1_h2_h15_h16[0]), + [h15_h16] "m" (h1_h2_h15_h16[4]) + : "memory" ); +} + +static ASM_FUNC_ATTR_INLINE void +gfmul_vpclmul_avx2_aggr16_le(const void *buf, const void *h_table, + const u64 *h1_h2_h15_h16) +{ + /* Input: + Hx: YMM0, YMM8, YMM9, YMM10, YMM11, YMM12, YMM13 + bemask: YMM15 + Hash: XMM1 + Output: + Hash: XMM1 + Inputs YMM0, YMM8, YMM9, YMM10, YMM11, YMM12, YMM13 and YMM15 stay + unmodified. + */ + asm volatile (GFMUL_AGGR16_ASM_VPCMUL_AVX2(le_to_le) + : + : [buf] "r" (buf), + [h_table] "r" (h_table), + [h1_h2] "m" (h1_h2_h15_h16[0]), + [h15_h16] "m" (h1_h2_h15_h16[4]) + : "memory" ); +} + +static ASM_FUNC_ATTR_INLINE +void gfmul_pclmul_avx2(void) +{ + /* Input: YMM0 and YMM1, Output: YMM1. Input YMM0 stays unmodified. + Input must be converted to little-endian. + */ + asm volatile (/* gfmul, ymm0 has operator a and ymm1 has operator b. */ + "vpshufd $78, %%ymm0, %%ymm2\n\t" + "vpshufd $78, %%ymm1, %%ymm4\n\t" + "vpxor %%ymm0, %%ymm2, %%ymm2\n\t" /* ymm2 holds a0+a1 */ + "vpxor %%ymm1, %%ymm4, %%ymm4\n\t" /* ymm4 holds b0+b1 */ + + "vpclmulqdq $0, %%ymm1, %%ymm0, %%ymm3\n\t" /* ymm3 holds a0*b0 */ + "vpclmulqdq $17, %%ymm0, %%ymm1, %%ymm1\n\t" /* ymm6 holds a1*b1 */ + "vpclmulqdq $0, %%ymm2, %%ymm4, %%ymm4\n\t" /* ymm4 holds (a0+a1)*(b0+b1) */ + + "vpxor %%ymm1, %%ymm3, %%ymm5\n\t" /* ymm5 holds a0*b0+a1*b1 */ + "vpxor %%ymm5, %%ymm4, %%ymm4\n\t" /* ymm4 holds a0*b0+a1*b1+(a0+a1)*(b0+b1) */ + "vpslldq $8, %%ymm4, %%ymm5\n\t" + "vpsrldq $8, %%ymm4, %%ymm4\n\t" + "vpxor %%ymm5, %%ymm3, %%ymm3\n\t" + "vpxor %%ymm4, %%ymm1, %%ymm1\n\t" /* holds the result of the + carry-less multiplication of ymm0 + by ymm1 */ + + /* first phase of the reduction */ + "vpsllq $1, %%ymm3, %%ymm6\n\t" /* packed right shifting << 63 */ + "vpxor %%ymm3, %%ymm6, %%ymm6\n\t" + "vpsllq $57, %%ymm3, %%ymm5\n\t" /* packed right shifting << 57 */ + "vpsllq $62, %%ymm6, %%ymm6\n\t" /* packed right shifting << 62 */ + "vpxor %%ymm5, %%ymm6, %%ymm6\n\t" /* xor the shifted versions */ + "vpshufd $0x6a, %%ymm6, %%ymm5\n\t" + "vpshufd $0xae, %%ymm6, %%ymm6\n\t" + "vpxor %%ymm5, %%ymm3, %%ymm3\n\t" /* first phase of the reduction complete */ + + /* second phase of the reduction */ + "vpxor %%ymm3, %%ymm1, %%ymm1\n\t" /* xor the shifted versions */ + "vpsrlq $1, %%ymm3, %%ymm3\n\t" /* packed left shifting >> 1 */ + "vpxor %%ymm3, %%ymm6, %%ymm6\n\t" + "vpsrlq $1, %%ymm3, %%ymm3\n\t" /* packed left shifting >> 2 */ + "vpxor %%ymm3, %%ymm1, %%ymm1\n\t" + "vpsrlq $5, %%ymm3, %%ymm3\n\t" /* packed left shifting >> 7 */ + "vpxor %%ymm3, %%ymm6, %%ymm6\n\t" + "vpxor %%ymm6, %%ymm1, %%ymm1\n\t" /* the result is in ymm1 */ + ::: "memory" ); +} + +static ASM_FUNC_ATTR_INLINE void +gcm_lsh_avx2(void *h, unsigned int hoffs) +{ + static const u64 pconst[4] __attribute__ ((aligned (32))) = + { + U64_C(0x0000000000000001), U64_C(0xc200000000000000), + U64_C(0x0000000000000001), U64_C(0xc200000000000000) + }; + + asm volatile ("vmovdqu %[h], %%ymm2\n\t" + "vpshufd $0xff, %%ymm2, %%ymm3\n\t" + "vpsrad $31, %%ymm3, %%ymm3\n\t" + "vpslldq $8, %%ymm2, %%ymm4\n\t" + "vpand %[pconst], %%ymm3, %%ymm3\n\t" + "vpaddq %%ymm2, %%ymm2, %%ymm2\n\t" + "vpsrlq $63, %%ymm4, %%ymm4\n\t" + "vpxor %%ymm3, %%ymm2, %%ymm2\n\t" + "vpxor %%ymm4, %%ymm2, %%ymm2\n\t" + "vmovdqu %%ymm2, %[h]\n\t" + : [h] "+m" (*((byte *)h + hoffs)) + : [pconst] "m" (*pconst) + : "memory" ); +} + +static ASM_FUNC_ATTR_INLINE void +load_h1h2_to_ymm1(gcry_cipher_hd_t c) +{ + unsigned int key_pos = + offsetof(struct gcry_cipher_handle, u_mode.gcm.u_ghash_key.key); + unsigned int table_pos = + offsetof(struct gcry_cipher_handle, u_mode.gcm.gcm_table); + + if (key_pos + 16 == table_pos) + { + /* Optimization: Table follows immediately after key. */ + asm volatile ("vmovdqu %[key], %%ymm1\n\t" + : + : [key] "m" (*c->u_mode.gcm.u_ghash_key.key) + : "memory"); + } + else + { + asm volatile ("vmovdqa %[key], %%xmm1\n\t" + "vinserti128 $1, 0*16(%[h_table]), %%ymm1, %%ymm1\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table), + [key] "m" (*c->u_mode.gcm.u_ghash_key.key) + : "memory"); + } +} + +static ASM_FUNC_ATTR void +ghash_setup_aggr8_avx2(gcry_cipher_hd_t c) +{ + c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_AGGR8_TABLE_INITIALIZED; + + asm volatile (/* load H? */ + "vbroadcasti128 3*16(%[h_table]), %%ymm0\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + /* load H <<< 1, H? <<< 1 */ + load_h1h2_to_ymm1 (c); + + gfmul_pclmul_avx2 (); /* H<<<1?H? => H?, H?<<<1?H? => H? */ + + asm volatile ("vmovdqu %%ymm1, 3*16(%[h_table])\n\t" + /* load H? <<< 1, H? <<< 1 */ + "vmovdqu 1*16(%[h_table]), %%ymm1\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + + gfmul_pclmul_avx2 (); /* H?<<<1?H? => H?, H?<<<1?H? => H? */ + + asm volatile ("vmovdqu %%ymm1, 6*16(%[h_table])\n\t" /* store H? for aggr16 setup */ + "vmovdqu %%ymm1, 5*16(%[h_table])\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + + gcm_lsh_avx2 (c->u_mode.gcm.gcm_table, 3 * 16); /* H? <<< 1, H? <<< 1 */ + gcm_lsh_avx2 (c->u_mode.gcm.gcm_table, 5 * 16); /* H? <<< 1, H? <<< 1 */ +} + +static ASM_FUNC_ATTR void +ghash_setup_aggr16_avx2(gcry_cipher_hd_t c) +{ + c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_AGGR16_TABLE_INITIALIZED; + + asm volatile (/* load H? */ + "vbroadcasti128 7*16(%[h_table]), %%ymm0\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + /* load H <<< 1, H? <<< 1 */ + load_h1h2_to_ymm1 (c); + + gfmul_pclmul_avx2 (); /* H<<<1?H? => H?, H?<<<1?H? => H?? */ + + asm volatile ("vmovdqu %%ymm1, 7*16(%[h_table])\n\t" + /* load H? <<< 1, H? <<< 1 */ + "vmovdqu 1*16(%[h_table]), %%ymm1\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + + gfmul_pclmul_avx2 (); /* H?<<<1?H? => H??, H?<<<1?H? => H?? */ + + asm volatile ("vmovdqu %%ymm1, 9*16(%[h_table])\n\t" + /* load H? <<< 1, H? <<< 1 */ + "vmovdqu 3*16(%[h_table]), %%ymm1\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + + gfmul_pclmul_avx2 (); /* H?<<<1?H? => H??, H?<<<1?H? => H?? */ + + asm volatile ("vmovdqu %%ymm1, 11*16(%[h_table])\n\t" + /* load H? <<< 1, H? <<< 1 */ + "vmovdqu 5*16(%[h_table]), %%ymm1\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + + gfmul_pclmul_avx2 (); /* H?<<<1?H? => H??, H?<<<1?H? => H?? */ + + asm volatile ("vmovdqu %%ymm1, 13*16(%[h_table])\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + + gcm_lsh_avx2 (c->u_mode.gcm.gcm_table, 7 * 16); /* H? <<< 1, H?? <<< 1 */ + gcm_lsh_avx2 (c->u_mode.gcm.gcm_table, 9 * 16); /* H?? <<< 1, H?? <<< 1 */ + gcm_lsh_avx2 (c->u_mode.gcm.gcm_table, 11 * 16); /* H?? <<< 1, H?? <<< 1 */ + gcm_lsh_avx2 (c->u_mode.gcm.gcm_table, 13 * 16); /* H?? <<< 1, H?? <<< 1 */ +} + +#endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */ +#endif /* __x86_64__ */ + +static unsigned int ASM_FUNC_ATTR +_gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf, + size_t nblocks); + +static unsigned int ASM_FUNC_ATTR +_gcry_polyval_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf, + size_t nblocks); + +static ASM_FUNC_ATTR_INLINE void +gcm_lsh(void *h, unsigned int hoffs) { static const u64 pconst[2] __attribute__ ((aligned (16))) = { U64_C(0x0000000000000001), U64_C(0xc200000000000000) }; - asm volatile ("movdqu (%[h]), %%xmm2\n\t" + asm volatile ("movdqu %[h], %%xmm2\n\t" "pshufd $0xff, %%xmm2, %%xmm3\n\t" "movdqa %%xmm2, %%xmm4\n\t" "psrad $31, %%xmm3\n\t" @@ -449,15 +851,14 @@ static ASM_FUNC_ATTR_INLINE void gcm_lsh(void *h, unsigned int hoffs) "psrlq $63, %%xmm4\n\t" "pxor %%xmm3, %%xmm2\n\t" "pxor %%xmm4, %%xmm2\n\t" - "movdqu %%xmm2, (%[h])\n\t" - : - : [pconst] "m" (*pconst), - [h] "r" ((byte *)h + hoffs) + "movdqu %%xmm2, %[h]\n\t" + : [h] "+m" (*((byte *)h + hoffs)) + : [pconst] "m" (*pconst) : "memory" ); } void ASM_FUNC_ATTR -_gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c) +_gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c, unsigned int hw_features) { static const unsigned char be_mask[16] __attribute__ ((aligned (16))) = { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 }; @@ -480,6 +881,12 @@ _gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c) : "memory" ); #endif + (void)hw_features; + + c->u_mode.gcm.hw_impl_flags = 0; + c->u_mode.gcm.ghash_fn = _gcry_ghash_intel_pclmul; + c->u_mode.gcm.polyval_fn = _gcry_polyval_intel_pclmul; + /* Swap endianness of hsub. */ asm volatile ("movdqu (%[key]), %%xmm0\n\t" "pshufb %[be_mask], %%xmm0\n\t" @@ -489,7 +896,7 @@ _gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c) [be_mask] "m" (*be_mask) : "memory"); - gcm_lsh(c->u_mode.gcm.u_ghash_key.key, 0); /* H <<< 1 */ + gcm_lsh (c->u_mode.gcm.u_ghash_key.key, 0); /* H <<< 1 */ asm volatile ("movdqa %%xmm0, %%xmm1\n\t" "movdqu (%[key]), %%xmm0\n\t" /* load H <<< 1 */ @@ -500,80 +907,81 @@ _gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c) gfmul_pclmul (); /* H<<<1?H => H? */ asm volatile ("movdqu %%xmm1, 0*16(%[h_table])\n\t" - "movdqa %%xmm1, %%xmm7\n\t" : : [h_table] "r" (c->u_mode.gcm.gcm_table) : "memory"); - gcm_lsh(c->u_mode.gcm.gcm_table, 0 * 16); /* H? <<< 1 */ - gfmul_pclmul (); /* H<<<1?H? => H? */ + gcm_lsh (c->u_mode.gcm.gcm_table, 0 * 16); /* H? <<< 1 */ - asm volatile ("movdqa %%xmm7, %%xmm0\n\t" - "movdqu %%xmm1, 1*16(%[h_table])\n\t" - "movdqu 0*16(%[h_table]), %%xmm1\n\t" /* load H? <<< 1 */ - : - : [h_table] "r" (c->u_mode.gcm.gcm_table) - : "memory"); + if (0) + { } +#ifdef GCM_USE_INTEL_VPCLMUL_AVX2 + else if ((hw_features & HWF_INTEL_VAES_VPCLMUL) + && (hw_features & HWF_INTEL_AVX2)) + { + c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_USE_VPCLMUL_AVX2; - gfmul_pclmul (); /* H?<<<1?H? => H? */ + asm volatile (/* H? */ + "vinserti128 $1, %%xmm1, %%ymm1, %%ymm1\n\t" + /* load H <<< 1, H? <<< 1 */ + "vinserti128 $1, 0*16(%[h_table]), %%ymm0, %%ymm0\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); - asm volatile ("movdqu %%xmm1, 2*16(%[h_table])\n\t" - "movdqa %%xmm1, %%xmm0\n\t" - "movdqu (%[key]), %%xmm1\n\t" /* load H <<< 1 */ - : - : [h_table] "r" (c->u_mode.gcm.gcm_table), - [key] "r" (c->u_mode.gcm.u_ghash_key.key) - : "memory"); + gfmul_pclmul_avx2 (); /* H<<<1?H? => H?, H?<<<1?H? => H? */ - gcm_lsh(c->u_mode.gcm.gcm_table, 1 * 16); /* H? <<< 1 */ - gcm_lsh(c->u_mode.gcm.gcm_table, 2 * 16); /* H? <<< 1 */ + asm volatile ("vmovdqu %%ymm1, 2*16(%[h_table])\n\t" /* store H? for aggr8 setup */ + "vmovdqu %%ymm1, 1*16(%[h_table])\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); -#ifdef __x86_64__ - gfmul_pclmul (); /* H<<<1?H? => H? */ + gcm_lsh_avx2 (c->u_mode.gcm.gcm_table, 1 * 16); /* H? <<< 1, H? <<< 1 */ - asm volatile ("movdqu %%xmm1, 3*16(%[h_table])\n\t" - "movdqu 0*16(%[h_table]), %%xmm1\n\t" /* load H? <<< 1 */ - : - : [h_table] "r" (c->u_mode.gcm.gcm_table) - : "memory"); - - gfmul_pclmul (); /* H?<<<1?H? => H? */ - - asm volatile ("movdqu %%xmm1, 4*16(%[h_table])\n\t" - "movdqu 1*16(%[h_table]), %%xmm1\n\t" /* load H? <<< 1 */ - : - : [h_table] "r" (c->u_mode.gcm.gcm_table) - : "memory"); + asm volatile ("vzeroupper\n\t" + ::: "memory" ); + } +#endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */ + else + { + asm volatile ("movdqa %%xmm1, %%xmm7\n\t" + ::: "memory"); - gfmul_pclmul (); /* H?<<<1?H? => H? */ + gfmul_pclmul (); /* H<<<1?H? => H? */ - asm volatile ("movdqu %%xmm1, 5*16(%[h_table])\n\t" - "movdqu 2*16(%[h_table]), %%xmm1\n\t" /* load H? <<< 1 */ - : - : [h_table] "r" (c->u_mode.gcm.gcm_table) - : "memory"); + asm volatile ("movdqa %%xmm7, %%xmm0\n\t" + "movdqu %%xmm1, 1*16(%[h_table])\n\t" + "movdqu 0*16(%[h_table]), %%xmm1\n\t" /* load H? <<< 1 */ + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); - gfmul_pclmul (); /* H?<<<1?H? => H? */ + gfmul_pclmul (); /* H?<<<1?H? => H? */ - asm volatile ("movdqu %%xmm1, 6*16(%[h_table])\n\t" - : - : [h_table] "r" (c->u_mode.gcm.gcm_table) - : "memory"); + asm volatile ("movdqu %%xmm1, 3*16(%[h_table])\n\t" /* store H? for aggr8 setup */ + "movdqu %%xmm1, 2*16(%[h_table])\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); - gcm_lsh(c->u_mode.gcm.gcm_table, 3 * 16); /* H? <<< 1 */ - gcm_lsh(c->u_mode.gcm.gcm_table, 4 * 16); /* H? <<< 1 */ - gcm_lsh(c->u_mode.gcm.gcm_table, 5 * 16); /* H? <<< 1 */ - gcm_lsh(c->u_mode.gcm.gcm_table, 6 * 16); /* H? <<< 1 */ + gcm_lsh (c->u_mode.gcm.gcm_table, 1 * 16); /* H? <<< 1 */ + gcm_lsh (c->u_mode.gcm.gcm_table, 2 * 16); /* H? <<< 1 */ + } -#ifdef __WIN64__ /* Clear/restore used registers. */ - asm volatile( "pxor %%xmm0, %%xmm0\n\t" - "pxor %%xmm1, %%xmm1\n\t" - "pxor %%xmm2, %%xmm2\n\t" - "pxor %%xmm3, %%xmm3\n\t" - "pxor %%xmm4, %%xmm4\n\t" - "pxor %%xmm5, %%xmm5\n\t" - "movdqu 0*16(%0), %%xmm6\n\t" + asm volatile ("pxor %%xmm0, %%xmm0\n\t" + "pxor %%xmm1, %%xmm1\n\t" + "pxor %%xmm2, %%xmm2\n\t" + "pxor %%xmm3, %%xmm3\n\t" + "pxor %%xmm4, %%xmm4\n\t" + "pxor %%xmm5, %%xmm5\n\t" + "pxor %%xmm6, %%xmm6\n\t" + "pxor %%xmm7, %%xmm7\n\t" + ::: "memory" ); +#ifdef __x86_64__ +#ifdef __WIN64__ + asm volatile ("movdqu 0*16(%0), %%xmm6\n\t" "movdqu 1*16(%0), %%xmm7\n\t" "movdqu 2*16(%0), %%xmm8\n\t" "movdqu 3*16(%0), %%xmm9\n\t" @@ -587,16 +995,7 @@ _gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c) : "r" (win64tmp) : "memory" ); #else - /* Clear used registers. */ - asm volatile( "pxor %%xmm0, %%xmm0\n\t" - "pxor %%xmm1, %%xmm1\n\t" - "pxor %%xmm2, %%xmm2\n\t" - "pxor %%xmm3, %%xmm3\n\t" - "pxor %%xmm4, %%xmm4\n\t" - "pxor %%xmm5, %%xmm5\n\t" - "pxor %%xmm6, %%xmm6\n\t" - "pxor %%xmm7, %%xmm7\n\t" - "pxor %%xmm8, %%xmm8\n\t" + asm volatile ("pxor %%xmm8, %%xmm8\n\t" "pxor %%xmm9, %%xmm9\n\t" "pxor %%xmm10, %%xmm10\n\t" "pxor %%xmm11, %%xmm11\n\t" @@ -605,14 +1004,67 @@ _gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c) "pxor %%xmm14, %%xmm14\n\t" "pxor %%xmm15, %%xmm15\n\t" ::: "memory" ); -#endif -#endif +#endif /* __WIN64__ */ +#endif /* __x86_64__ */ } +#ifdef __x86_64__ +static ASM_FUNC_ATTR void +ghash_setup_aggr8(gcry_cipher_hd_t c) +{ + c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_AGGR8_TABLE_INITIALIZED; + + asm volatile ("movdqa 3*16(%[h_table]), %%xmm0\n\t" /* load H? */ + "movdqu %[key], %%xmm1\n\t" /* load H <<< 1 */ + : + : [h_table] "r" (c->u_mode.gcm.gcm_table), + [key] "m" (*c->u_mode.gcm.u_ghash_key.key) + : "memory"); + + gfmul_pclmul (); /* H<<<1?H? => H? */ + + asm volatile ("movdqu %%xmm1, 3*16(%[h_table])\n\t" + "movdqu 0*16(%[h_table]), %%xmm1\n\t" /* load H? <<< 1 */ + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + + gfmul_pclmul (); /* H?<<<1?H? => H? */ + + asm volatile ("movdqu %%xmm1, 4*16(%[h_table])\n\t" + "movdqu 1*16(%[h_table]), %%xmm1\n\t" /* load H? <<< 1 */ + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + + gfmul_pclmul (); /* H?<<<1?H? => H? */ + + asm volatile ("movdqu %%xmm1, 5*16(%[h_table])\n\t" + "movdqu 2*16(%[h_table]), %%xmm1\n\t" /* load H? <<< 1 */ + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + + gfmul_pclmul (); /* H?<<<1?H? => H? */ + + asm volatile ("movdqu %%xmm1, 6*16(%[h_table])\n\t" + "movdqu %%xmm1, 7*16(%[h_table])\n\t" /* store H? for aggr16 setup */ + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + + gcm_lsh (c->u_mode.gcm.gcm_table, 3 * 16); /* H? <<< 1 */ + gcm_lsh (c->u_mode.gcm.gcm_table, 4 * 16); /* H? <<< 1 */ + gcm_lsh (c->u_mode.gcm.gcm_table, 5 * 16); /* H? <<< 1 */ + gcm_lsh (c->u_mode.gcm.gcm_table, 6 * 16); /* H? <<< 1 */ +} +#endif /* __x86_64__ */ + + unsigned int ASM_FUNC_ATTR _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf, - size_t nblocks) + size_t nblocks) { static const unsigned char be_mask[16] __attribute__ ((aligned (16))) = { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 }; @@ -650,12 +1102,93 @@ _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf, [be_mask] "m" (*be_mask) : "memory" ); +#if defined(GCM_USE_INTEL_VPCLMUL_AVX2) + if (nblocks >= 16 + && (c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX2)) + { + u64 h1_h2_h15_h16[4*2]; + + asm volatile ("vinserti128 $1, %%xmm7, %%ymm7, %%ymm15\n\t" + "vmovdqa %%xmm1, %%xmm8\n\t" + ::: "memory" ); + + if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED)) + { + ghash_setup_aggr8_avx2 (c); + } + if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR16_TABLE_INITIALIZED)) + { + ghash_setup_aggr16_avx2 (c); + } + + /* Preload H1, H2, H3, H4, H5, H6, H7, H8, H9, H10, H11, H12. */ + asm volatile ("vmovdqa %%xmm8, %%xmm1\n\t" + "vmovdqu 0*16(%[h_table]), %%xmm7\n\t" + "vpxor %%xmm8, %%xmm8, %%xmm8\n\t" + "vperm2i128 $0x23, 13*16(%[h_table]), %%ymm8, %%ymm0\n\t" /* H15|H16 */ + "vperm2i128 $0x23, 11*16(%[h_table]), %%ymm8, %%ymm13\n\t" /* H13|H14 */ + "vperm2i128 $0x23, 9*16(%[h_table]), %%ymm8, %%ymm12\n\t" /* H11|H12 */ + "vperm2i128 $0x23, 7*16(%[h_table]), %%ymm8, %%ymm11\n\t" /* H9|H10 */ + "vperm2i128 $0x23, 5*16(%[h_table]), %%ymm8, %%ymm10\n\t" /* H7|H8 */ + "vperm2i128 $0x23, 3*16(%[h_table]), %%ymm8, %%ymm9\n\t" /* H5|H6 */ + "vperm2i128 $0x23, 1*16(%[h_table]), %%ymm8, %%ymm8\n\t" /* H3|H4 */ + "vinserti128 $1, %[h_1], %%ymm7, %%ymm7\n\t" /* H1|H2 */ + "vmovdqu %%ymm0, %[h15_h16]\n\t" + "vmovdqu %%ymm7, %[h1_h2]\n\t" + : [h1_h2] "=m" (h1_h2_h15_h16[0]), + [h15_h16] "=m" (h1_h2_h15_h16[4]) + : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key), + [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory" ); + + while (nblocks >= 16) + { + gfmul_vpclmul_avx2_aggr16 (buf, c->u_mode.gcm.gcm_table, + h1_h2_h15_h16); + + buf += 16 * blocksize; + nblocks -= 16; + } + + /* Clear used x86-64/XMM registers. */ + asm volatile("vmovdqu %%ymm15, %[h15_h16]\n\t" + "vmovdqu %%ymm15, %[h1_h2]\n\t" + "vzeroupper\n\t" +#ifndef __WIN64__ + "pxor %%xmm8, %%xmm8\n\t" + "pxor %%xmm9, %%xmm9\n\t" + "pxor %%xmm10, %%xmm10\n\t" + "pxor %%xmm11, %%xmm11\n\t" + "pxor %%xmm12, %%xmm12\n\t" + "pxor %%xmm13, %%xmm13\n\t" + "pxor %%xmm14, %%xmm14\n\t" + "pxor %%xmm15, %%xmm15\n\t" +#endif + "movdqa %[be_mask], %%xmm7\n\t" + : [h1_h2] "=m" (h1_h2_h15_h16[0]), + [h15_h16] "=m" (h1_h2_h15_h16[4]) + : [be_mask] "m" (*be_mask) + : "memory" ); + } +#endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */ + #ifdef __x86_64__ if (nblocks >= 8) { - /* Preload H1. */ asm volatile ("movdqa %%xmm7, %%xmm15\n\t" - "movdqa %[h_1], %%xmm0\n\t" + ::: "memory" ); + + if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED)) + { + asm volatile ("movdqa %%xmm1, %%xmm8\n\t" + ::: "memory" ); + ghash_setup_aggr8 (c); + asm volatile ("movdqa %%xmm8, %%xmm1\n\t" + ::: "memory" ); + } + + /* Preload H1. */ + asm volatile ("movdqa %[h_1], %%xmm0\n\t" : : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key) : "memory" ); @@ -667,6 +1200,7 @@ _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf, buf += 8 * blocksize; nblocks -= 8; } + #ifndef __WIN64__ /* Clear used x86-64/XMM registers. */ asm volatile( "pxor %%xmm8, %%xmm8\n\t" @@ -680,7 +1214,7 @@ _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf, ::: "memory" ); #endif } -#endif +#endif /* __x86_64__ */ while (nblocks >= 4) { @@ -761,7 +1295,7 @@ _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf, unsigned int ASM_FUNC_ATTR _gcry_polyval_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf, - size_t nblocks) + size_t nblocks) { static const unsigned char be_mask[16] __attribute__ ((aligned (16))) = { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 }; @@ -799,9 +1333,86 @@ _gcry_polyval_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf, [be_mask] "m" (*be_mask) : "memory" ); +#if defined(GCM_USE_INTEL_VPCLMUL_AVX2) + if (nblocks >= 16 + && (c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX2)) + { + u64 h1_h2_h15_h16[4*2]; + + asm volatile ("vmovdqa %%xmm1, %%xmm8\n\t" + ::: "memory" ); + + if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED)) + { + ghash_setup_aggr8_avx2 (c); + } + if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR16_TABLE_INITIALIZED)) + { + ghash_setup_aggr16_avx2 (c); + } + + /* Preload H1, H2, H3, H4, H5, H6, H7, H8, H9, H10, H11, H12. */ + asm volatile ("vmovdqa %%xmm8, %%xmm1\n\t" + "vpxor %%xmm8, %%xmm8, %%xmm8\n\t" + "vmovdqu 0*16(%[h_table]), %%xmm7\n\t" + "vperm2i128 $0x23, 13*16(%[h_table]), %%ymm8, %%ymm0\n\t" /* H15|H16 */ + "vperm2i128 $0x23, 11*16(%[h_table]), %%ymm8, %%ymm13\n\t" /* H13|H14 */ + "vperm2i128 $0x23, 9*16(%[h_table]), %%ymm8, %%ymm12\n\t" /* H11|H12 */ + "vperm2i128 $0x23, 7*16(%[h_table]), %%ymm8, %%ymm11\n\t" /* H9|H10 */ + "vperm2i128 $0x23, 5*16(%[h_table]), %%ymm8, %%ymm10\n\t" /* H7|H8 */ + "vperm2i128 $0x23, 3*16(%[h_table]), %%ymm8, %%ymm9\n\t" /* H5|H6 */ + "vperm2i128 $0x23, 1*16(%[h_table]), %%ymm8, %%ymm8\n\t" /* H3|H4 */ + "vinserti128 $1, %[h_1], %%ymm7, %%ymm7\n\t" /* H1|H2 */ + "vmovdqu %%ymm0, %[h15_h16]\n\t" + "vmovdqu %%ymm7, %[h1_h2]\n\t" + : [h1_h2] "=m" (h1_h2_h15_h16[0]), + [h15_h16] "=m" (h1_h2_h15_h16[4]) + : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key), + [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory" ); + + while (nblocks >= 16) + { + gfmul_vpclmul_avx2_aggr16_le (buf, c->u_mode.gcm.gcm_table, + h1_h2_h15_h16); + + buf += 16 * blocksize; + nblocks -= 16; + } + + /* Clear used x86-64/XMM registers. */ + asm volatile("vpxor %%xmm7, %%xmm7, %%xmm7\n\t" + "vmovdqu %%ymm7, %[h15_h16]\n\t" + "vmovdqu %%ymm7, %[h1_h2]\n\t" + "vzeroupper\n\t" +#ifndef __WIN64__ + "pxor %%xmm8, %%xmm8\n\t" + "pxor %%xmm9, %%xmm9\n\t" + "pxor %%xmm10, %%xmm10\n\t" + "pxor %%xmm11, %%xmm11\n\t" + "pxor %%xmm12, %%xmm12\n\t" + "pxor %%xmm13, %%xmm13\n\t" + "pxor %%xmm14, %%xmm14\n\t" +#endif + : [h1_h2] "=m" (h1_h2_h15_h16[0]), + [h15_h16] "=m" (h1_h2_h15_h16[4]) + : + : "memory" ); + } +#endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */ + #ifdef __x86_64__ if (nblocks >= 8) { + if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED)) + { + asm volatile ("movdqa %%xmm1, %%xmm8\n\t" + ::: "memory" ); + ghash_setup_aggr8 (c); + asm volatile ("movdqa %%xmm8, %%xmm1\n\t" + ::: "memory" ); + } + /* Preload H1. */ asm volatile ("pxor %%xmm15, %%xmm15\n\t" "movdqa %[h_1], %%xmm0\n\t" diff --git a/cipher/cipher-gcm.c b/cipher/cipher-gcm.c index 69ff0de6..683f07b0 100644 --- a/cipher/cipher-gcm.c +++ b/cipher/cipher-gcm.c @@ -39,15 +39,8 @@ #ifdef GCM_USE_INTEL_PCLMUL -extern void _gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c); - -extern unsigned int _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, - const byte *buf, size_t nblocks); - -extern unsigned int _gcry_polyval_intel_pclmul (gcry_cipher_hd_t c, - byte *result, - const byte *buf, - size_t nblocks); +extern void _gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c, + unsigned int hw_features); #endif #ifdef GCM_USE_ARM_PMULL @@ -594,9 +587,7 @@ setupM (gcry_cipher_hd_t c) #ifdef GCM_USE_INTEL_PCLMUL else if (features & HWF_INTEL_PCLMUL) { - c->u_mode.gcm.ghash_fn = _gcry_ghash_intel_pclmul; - c->u_mode.gcm.polyval_fn = _gcry_polyval_intel_pclmul; - _gcry_ghash_setup_intel_pclmul (c); + _gcry_ghash_setup_intel_pclmul (c, features); } #endif #ifdef GCM_USE_ARM_PMULL diff --git a/cipher/cipher-internal.h b/cipher/cipher-internal.h index c8a1097a..e31ac860 100644 --- a/cipher/cipher-internal.h +++ b/cipher/cipher-internal.h @@ -72,6 +72,14 @@ # endif #endif /* GCM_USE_INTEL_PCLMUL */ +/* GCM_USE_INTEL_VPCLMUL_AVX2 indicates whether to compile GCM with Intel + VPCLMUL/AVX2 code. */ +#undef GCM_USE_INTEL_VPCLMUL_AVX2 +#if defined(__x86_64__) && defined(GCM_USE_INTEL_PCLMUL) && \ + defined(ENABLE_AVX2_SUPPORT) && defined(HAVE_GCC_INLINE_ASM_VAES_VPCLMUL) +# define GCM_USE_INTEL_VPCLMUL_AVX2 1 +#endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */ + /* GCM_USE_ARM_PMULL indicates whether to compile GCM with ARMv8 PMULL code. */ #undef GCM_USE_ARM_PMULL #if defined(ENABLE_ARM_CRYPTO_SUPPORT) && defined(GCM_USE_TABLES) @@ -355,6 +363,9 @@ struct gcry_cipher_handle /* Key length used for GCM-SIV key generating key. */ unsigned int siv_keylen; + + /* Flags for accelerated implementations. */ + unsigned int hw_impl_flags; } gcm; /* Mode specific storage for OCB mode. */ -- 2.32.0 From jussi.kivilinna at iki.fi Sun Mar 6 18:19:10 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 6 Mar 2022 19:19:10 +0200 Subject: [PATCH 3/3] ghash|polyval: add x86_64 VPCLMUL/AVX512 accelerated implementation In-Reply-To: <20220306171910.1011180-1-jussi.kivilinna@iki.fi> References: <20220306171910.1011180-1-jussi.kivilinna@iki.fi> Message-ID: <20220306171910.1011180-3-jussi.kivilinna@iki.fi> * cipher/cipher-gcm-intel-pclmul.c (GCM_INTEL_USE_VPCLMUL_AVX512) (GCM_INTEL_AGGR32_TABLE_INITIALIZED): New. (ghash_setup_aggr16_avx2): Store H16 for aggr32 setup. [GCM_USE_INTEL_VPCLMUL_AVX512] (GFMUL_AGGR32_ASM_VPCMUL_AVX512) (gfmul_vpclmul_avx512_aggr32, gfmul_vpclmul_avx512_aggr32_le) (gfmul_pclmul_avx512, gcm_lsh_avx512, load_h1h4_to_zmm1) (ghash_setup_aggr8_avx512, ghash_setup_aggr16_avx512) (ghash_setup_aggr32_avx512, swap128b_perm): New. (_gcry_ghash_setup_intel_pclmul) [GCM_USE_INTEL_VPCLMUL_AVX512]: Enable AVX512 implementation based on HW features. (_gcry_ghash_intel_pclmul, _gcry_polyval_intel_pclmul): Add VPCLMUL/AVX512 code path; Small tweaks to VPCLMUL/AVX2 code path; Tweaks on register clearing. -- Patch adds VPCLMUL/AVX512 accelerated implementation for GHASH (GCM) and POLYVAL (GCM-SIV). Benchmark on Intel Core i3-1115G4: Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz GCM auth | 0.063 ns/B 15200 MiB/s 0.257 c/B 4090 GCM-SIV auth | 0.061 ns/B 15704 MiB/s 0.248 c/B 4090 After (ghash ~41% faster, polyval ~34% faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz GCM auth | 0.044 ns/B 21614 MiB/s 0.181 c/B 4096?3 GCM-SIV auth | 0.045 ns/B 21108 MiB/s 0.185 c/B 4097?3 AES128-GCM / AES128-GCM-SIV encryption: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz GCM enc | 0.084 ns/B 11306 MiB/s 0.346 c/B 4097?3 GCM-SIV enc | 0.086 ns/B 11026 MiB/s 0.354 c/B 4096?3 Signed-off-by: Jussi Kivilinna --- cipher/cipher-gcm-intel-pclmul.c | 940 +++++++++++++++++++++++-------- cipher/cipher-internal.h | 8 + 2 files changed, 728 insertions(+), 220 deletions(-) diff --git a/cipher/cipher-gcm-intel-pclmul.c b/cipher/cipher-gcm-intel-pclmul.c index b7324e8f..78a9e338 100644 --- a/cipher/cipher-gcm-intel-pclmul.c +++ b/cipher/cipher-gcm-intel-pclmul.c @@ -52,6 +52,8 @@ #define GCM_INTEL_USE_VPCLMUL_AVX2 (1 << 0) #define GCM_INTEL_AGGR8_TABLE_INITIALIZED (1 << 1) #define GCM_INTEL_AGGR16_TABLE_INITIALIZED (1 << 2) +#define GCM_INTEL_USE_VPCLMUL_AVX512 (1 << 3) +#define GCM_INTEL_AGGR32_TABLE_INITIALIZED (1 << 4) /* @@ -813,7 +815,8 @@ ghash_setup_aggr16_avx2(gcry_cipher_hd_t c) gfmul_pclmul_avx2 (); /* H?<<<1?H? => H??, H?<<<1?H? => H?? */ - asm volatile ("vmovdqu %%ymm1, 13*16(%[h_table])\n\t" + asm volatile ("vmovdqu %%ymm1, 14*16(%[h_table])\n\t" /* store H?? for aggr32 setup */ + "vmovdqu %%ymm1, 13*16(%[h_table])\n\t" : : [h_table] "r" (c->u_mode.gcm.gcm_table) : "memory"); @@ -825,6 +828,400 @@ ghash_setup_aggr16_avx2(gcry_cipher_hd_t c) } #endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */ + +#ifdef GCM_USE_INTEL_VPCLMUL_AVX512 + +#define GFMUL_AGGR32_ASM_VPCMUL_AVX512(be_to_le) \ + /* perform clmul and merge results... */ \ + "vmovdqu64 0*16(%[buf]), %%zmm5\n\t" \ + "vmovdqu64 4*16(%[buf]), %%zmm2\n\t" \ + be_to_le("vpshufb %%zmm15, %%zmm5, %%zmm5\n\t") /* be => le */ \ + be_to_le("vpshufb %%zmm15, %%zmm2, %%zmm2\n\t") /* be => le */ \ + "vpxorq %%zmm5, %%zmm1, %%zmm1\n\t" \ + \ + "vpshufd $78, %%zmm0, %%zmm5\n\t" \ + "vpshufd $78, %%zmm1, %%zmm4\n\t" \ + "vpxorq %%zmm0, %%zmm5, %%zmm5\n\t" /* zmm5 holds 29|?|32:a0+a1 */ \ + "vpxorq %%zmm1, %%zmm4, %%zmm4\n\t" /* zmm4 holds 29|?|32:b0+b1 */ \ + "vpclmulqdq $0, %%zmm1, %%zmm0, %%zmm3\n\t" /* zmm3 holds 29|?|32:a0*b0 */ \ + "vpclmulqdq $17, %%zmm0, %%zmm1, %%zmm1\n\t" /* zmm1 holds 29|?|32:a1*b1 */ \ + "vpclmulqdq $0, %%zmm5, %%zmm4, %%zmm4\n\t" /* zmm4 holds 29|?|32:(a0+a1)*(b0+b1) */ \ + \ + "vpshufd $78, %%zmm13, %%zmm14\n\t" \ + "vpshufd $78, %%zmm2, %%zmm7\n\t" \ + "vpxorq %%zmm13, %%zmm14, %%zmm14\n\t" /* zmm14 holds 25|?|28:a0+a1 */ \ + "vpxorq %%zmm2, %%zmm7, %%zmm7\n\t" /* zmm7 holds 25|?|28:b0+b1 */ \ + "vpclmulqdq $0, %%zmm2, %%zmm13, %%zmm17\n\t" /* zmm17 holds 25|?|28:a0*b0 */ \ + "vpclmulqdq $17, %%zmm13, %%zmm2, %%zmm18\n\t" /* zmm18 holds 25|?|28:a1*b1 */ \ + "vpclmulqdq $0, %%zmm14, %%zmm7, %%zmm19\n\t" /* zmm19 holds 25|?|28:(a0+a1)*(b0+b1) */\ + \ + "vmovdqu64 8*16(%[buf]), %%zmm5\n\t" \ + "vmovdqu64 12*16(%[buf]), %%zmm2\n\t" \ + be_to_le("vpshufb %%zmm15, %%zmm5, %%zmm5\n\t") /* be => le */ \ + be_to_le("vpshufb %%zmm15, %%zmm2, %%zmm2\n\t") /* be => le */ \ + \ + "vpshufd $78, %%zmm12, %%zmm14\n\t" \ + "vpshufd $78, %%zmm5, %%zmm7\n\t" \ + "vpxorq %%zmm12, %%zmm14, %%zmm14\n\t" /* zmm14 holds 21|?|24:a0+a1 */ \ + "vpxorq %%zmm5, %%zmm7, %%zmm7\n\t" /* zmm7 holds 21|?|24:b0+b1 */ \ + "vpclmulqdq $0, %%zmm5, %%zmm12, %%zmm6\n\t" /* zmm6 holds 21|?|24:a0*b0 */ \ + "vpclmulqdq $17, %%zmm12, %%zmm5, %%zmm5\n\t" /* zmm5 holds 21|?|24:a1*b1 */ \ + "vpclmulqdq $0, %%zmm14, %%zmm7, %%zmm7\n\t" /* zmm7 holds 21|?|24:(a0+a1)*(b0+b1) */\ + \ + "vpternlogq $0x96, %%zmm6, %%zmm17, %%zmm3\n\t" /* zmm3 holds 21+?|?|?+32:a0*b0 */ \ + "vpternlogq $0x96, %%zmm5, %%zmm18, %%zmm1\n\t" /* zmm1 holds 21+?|?|?+32:a1*b1 */ \ + "vpternlogq $0x96, %%zmm7, %%zmm19, %%zmm4\n\t" /* zmm4 holds 21+?|?|?+32:(a0+a1)*(b0+b1) */\ + \ + "vpshufd $78, %%zmm11, %%zmm14\n\t" \ + "vpshufd $78, %%zmm2, %%zmm7\n\t" \ + "vpxorq %%zmm11, %%zmm14, %%zmm14\n\t" /* zmm14 holds 17|?|20:a0+a1 */ \ + "vpxorq %%zmm2, %%zmm7, %%zmm7\n\t" /* zmm7 holds 17|?|20:b0+b1 */ \ + "vpclmulqdq $0, %%zmm2, %%zmm11, %%zmm17\n\t" /* zmm17 holds 17|?|20:a0*b0 */ \ + "vpclmulqdq $17, %%zmm11, %%zmm2, %%zmm18\n\t" /* zmm18 holds 17|?|20:a1*b1 */ \ + "vpclmulqdq $0, %%zmm14, %%zmm7, %%zmm19\n\t" /* zmm19 holds 17|?|20:(a0+a1)*(b0+b1) */\ + \ + "vmovdqu64 16*16(%[buf]), %%zmm5\n\t" \ + "vmovdqu64 20*16(%[buf]), %%zmm2\n\t" \ + be_to_le("vpshufb %%zmm15, %%zmm5, %%zmm5\n\t") /* be => le */ \ + be_to_le("vpshufb %%zmm15, %%zmm2, %%zmm2\n\t") /* be => le */ \ + \ + "vpshufd $78, %%zmm10, %%zmm14\n\t" \ + "vpshufd $78, %%zmm5, %%zmm7\n\t" \ + "vpxorq %%zmm10, %%zmm14, %%zmm14\n\t" /* zmm14 holds 13|?|16:a0+a1 */ \ + "vpxorq %%zmm5, %%zmm7, %%zmm7\n\t" /* zmm7 holds 13|?|16:b0+b1 */ \ + "vpclmulqdq $0, %%zmm5, %%zmm10, %%zmm6\n\t" /* zmm6 holds 13|?|16:a0*b0 */ \ + "vpclmulqdq $17, %%zmm10, %%zmm5, %%zmm5\n\t" /* zmm5 holds 13|?|16:a1*b1 */ \ + "vpclmulqdq $0, %%zmm14, %%zmm7, %%zmm7\n\t" /* zmm7 holds 13|?|16:(a0+a1)*(b0+b1) */ \ + \ + "vpternlogq $0x96, %%zmm6, %%zmm17, %%zmm3\n\t" /* zmm3 holds 13+?|?|?+32:a0*b0 */ \ + "vpternlogq $0x96, %%zmm5, %%zmm18, %%zmm1\n\t" /* zmm1 holds 13+?|?|?+32:a1*b1 */ \ + "vpternlogq $0x96, %%zmm7, %%zmm19, %%zmm4\n\t" /* zmm4 holds 13+?|?|?+32:(a0+a1)*(b0+b1) */\ + \ + "vpshufd $78, %%zmm9, %%zmm14\n\t" \ + "vpshufd $78, %%zmm2, %%zmm7\n\t" \ + "vpxorq %%zmm9, %%zmm14, %%zmm14\n\t" /* zmm14 holds 9|?|12:a0+a1 */ \ + "vpxorq %%zmm2, %%zmm7, %%zmm7\n\t" /* zmm7 holds 9|?|12:b0+b1 */ \ + "vpclmulqdq $0, %%zmm2, %%zmm9, %%zmm17\n\t" /* zmm17 holds 9|?|12:a0*b0 */ \ + "vpclmulqdq $17, %%zmm9, %%zmm2, %%zmm18\n\t" /* zmm18 holds 9|?|12:a1*b1 */ \ + "vpclmulqdq $0, %%zmm14, %%zmm7, %%zmm19\n\t" /* zmm19 holds 9|?|12:(a0+a1)*(b0+b1) */\ + \ + "vmovdqu64 24*16(%[buf]), %%zmm5\n\t" \ + "vmovdqu64 28*16(%[buf]), %%zmm2\n\t" \ + be_to_le("vpshufb %%zmm15, %%zmm5, %%zmm5\n\t") /* be => le */ \ + be_to_le("vpshufb %%zmm15, %%zmm2, %%zmm2\n\t") /* be => le */ \ + \ + "vpshufd $78, %%zmm8, %%zmm14\n\t" \ + "vpshufd $78, %%zmm5, %%zmm7\n\t" \ + "vpxorq %%zmm8, %%zmm14, %%zmm14\n\t" /* zmm14 holds 5|?|8:a0+a1 */ \ + "vpxorq %%zmm5, %%zmm7, %%zmm7\n\t" /* zmm7 holds 5|?|8:b0+b1 */ \ + "vpclmulqdq $0, %%zmm5, %%zmm8, %%zmm6\n\t" /* zmm6 holds 5|?|8:a0*b0 */ \ + "vpclmulqdq $17, %%zmm8, %%zmm5, %%zmm5\n\t" /* zmm5 holds 5|?|8:a1*b1 */ \ + "vpclmulqdq $0, %%zmm14, %%zmm7, %%zmm7\n\t" /* zmm7 holds 5|?|8:(a0+a1)*(b0+b1) */ \ + \ + "vpternlogq $0x96, %%zmm6, %%zmm17, %%zmm3\n\t" /* zmm3 holds 5+?|?|?+32:a0*b0 */ \ + "vpternlogq $0x96, %%zmm5, %%zmm18, %%zmm1\n\t" /* zmm1 holds 5+?|?|?+32:a1*b1 */ \ + "vpternlogq $0x96, %%zmm7, %%zmm19, %%zmm4\n\t" /* zmm4 holds 5+?|?|?+32:(a0+a1)*(b0+b1) */\ + \ + "vpshufd $78, %%zmm16, %%zmm14\n\t" \ + "vpshufd $78, %%zmm2, %%zmm7\n\t" \ + "vpxorq %%zmm16, %%zmm14, %%zmm14\n\t" /* zmm14 holds 1|?|4:a0+a1 */ \ + "vpxorq %%zmm2, %%zmm7, %%zmm7\n\t" /* zmm7 holds 1|2:b0+b1 */ \ + "vpclmulqdq $0, %%zmm2, %%zmm16, %%zmm6\n\t" /* zmm6 holds 1|2:a0*b0 */ \ + "vpclmulqdq $17, %%zmm16, %%zmm2, %%zmm2\n\t" /* zmm2 holds 1|2:a1*b1 */ \ + "vpclmulqdq $0, %%zmm14, %%zmm7, %%zmm7\n\t" /* zmm7 holds 1|2:(a0+a1)*(b0+b1) */ \ + \ + "vpxorq %%zmm6, %%zmm3, %%zmm3\n\t" /* zmm3 holds 1+3+?+15|2+4+?+16:a0*b0 */ \ + "vpxorq %%zmm2, %%zmm1, %%zmm1\n\t" /* zmm1 holds 1+3+?+15|2+4+?+16:a1*b1 */ \ + "vpxorq %%zmm7, %%zmm4, %%zmm4\n\t" /* zmm4 holds 1+3+?+15|2+4+?+16:(a0+a1)*(b0+b1) */\ + \ + /* aggregated reduction... */ \ + "vpternlogq $0x96, %%zmm1, %%zmm3, %%zmm4\n\t" /* zmm4 holds \ + * a0*b0+a1*b1+(a0+a1)*(b0+b1) */ \ + "vpslldq $8, %%zmm4, %%zmm5\n\t" \ + "vpsrldq $8, %%zmm4, %%zmm4\n\t" \ + "vpxorq %%zmm5, %%zmm3, %%zmm3\n\t" \ + "vpxorq %%zmm4, %%zmm1, %%zmm1\n\t" /* holds the result of the \ + carry-less multiplication of zmm0 \ + by zmm1 */ \ + \ + /* first phase of the reduction */ \ + "vpsllq $1, %%zmm3, %%zmm6\n\t" /* packed right shifting << 63 */ \ + "vpxorq %%zmm3, %%zmm6, %%zmm6\n\t" \ + "vpsllq $57, %%zmm3, %%zmm5\n\t" /* packed right shifting << 57 */ \ + "vpsllq $62, %%zmm6, %%zmm6\n\t" /* packed right shifting << 62 */ \ + "vpxorq %%zmm5, %%zmm6, %%zmm6\n\t" /* xor the shifted versions */ \ + "vpshufd $0x6a, %%zmm6, %%zmm5\n\t" \ + "vpshufd $0xae, %%zmm6, %%zmm6\n\t" \ + "vpxorq %%zmm5, %%zmm3, %%zmm3\n\t" /* first phase of the reduction complete */ \ + \ + /* second phase of the reduction */ \ + "vpsrlq $1, %%zmm3, %%zmm2\n\t" /* packed left shifting >> 1 */ \ + "vpsrlq $2, %%zmm3, %%zmm4\n\t" /* packed left shifting >> 2 */ \ + "vpsrlq $7, %%zmm3, %%zmm5\n\t" /* packed left shifting >> 7 */ \ + "vpternlogq $0x96, %%zmm3, %%zmm2, %%zmm1\n\t" /* xor the shifted versions */ \ + "vpternlogq $0x96, %%zmm4, %%zmm5, %%zmm6\n\t" \ + "vpxorq %%zmm6, %%zmm1, %%zmm1\n\t" /* the result is in zmm1 */ \ + \ + /* merge 256-bit halves */ \ + "vextracti64x4 $1, %%zmm1, %%ymm2\n\t" \ + "vpxor %%ymm2, %%ymm1, %%ymm1\n\t" \ + /* merge 128-bit halves */ \ + "vextracti128 $1, %%ymm1, %%xmm2\n\t" \ + "vpxor %%xmm2, %%xmm1, %%xmm1\n\t" + +static ASM_FUNC_ATTR_INLINE void +gfmul_vpclmul_avx512_aggr32(const void *buf, const void *h_table) +{ + /* Input: + Hx: ZMM0, ZMM8, ZMM9, ZMM10, ZMM11, ZMM12, ZMM13, ZMM16 + bemask: ZMM15 + Hash: XMM1 + Output: + Hash: XMM1 + Inputs ZMM0, ZMM8, ZMM9, ZMM10, ZMM11, ZMM12, ZMM13, ZMM16 and YMM15 stay + unmodified. + */ + asm volatile (GFMUL_AGGR32_ASM_VPCMUL_AVX512(be_to_le) + : + : [buf] "r" (buf), + [h_table] "r" (h_table) + : "memory" ); +} + +static ASM_FUNC_ATTR_INLINE void +gfmul_vpclmul_avx512_aggr32_le(const void *buf, const void *h_table) +{ + /* Input: + Hx: ZMM0, ZMM8, ZMM9, ZMM10, ZMM11, ZMM12, ZMM13, ZMM16 + bemask: ZMM15 + Hash: XMM1 + Output: + Hash: XMM1 + Inputs ZMM0, ZMM8, ZMM9, ZMM10, ZMM11, ZMM12, ZMM13, ZMM16 and YMM15 stay + unmodified. + */ + asm volatile (GFMUL_AGGR32_ASM_VPCMUL_AVX512(le_to_le) + : + : [buf] "r" (buf), + [h_table] "r" (h_table) + : "memory" ); +} + +static ASM_FUNC_ATTR_INLINE +void gfmul_pclmul_avx512(void) +{ + /* Input: ZMM0 and ZMM1, Output: ZMM1. Input ZMM0 stays unmodified. + Input must be converted to little-endian. + */ + asm volatile (/* gfmul, zmm0 has operator a and zmm1 has operator b. */ + "vpshufd $78, %%zmm0, %%zmm2\n\t" + "vpshufd $78, %%zmm1, %%zmm4\n\t" + "vpxorq %%zmm0, %%zmm2, %%zmm2\n\t" /* zmm2 holds a0+a1 */ + "vpxorq %%zmm1, %%zmm4, %%zmm4\n\t" /* zmm4 holds b0+b1 */ + + "vpclmulqdq $0, %%zmm1, %%zmm0, %%zmm3\n\t" /* zmm3 holds a0*b0 */ + "vpclmulqdq $17, %%zmm0, %%zmm1, %%zmm1\n\t" /* zmm6 holds a1*b1 */ + "vpclmulqdq $0, %%zmm2, %%zmm4, %%zmm4\n\t" /* zmm4 holds (a0+a1)*(b0+b1) */ + + "vpternlogq $0x96, %%zmm1, %%zmm3, %%zmm4\n\t" /* zmm4 holds + * a0*b0+a1*b1+(a0+a1)*(b0+b1) */ + "vpslldq $8, %%zmm4, %%zmm5\n\t" + "vpsrldq $8, %%zmm4, %%zmm4\n\t" + "vpxorq %%zmm5, %%zmm3, %%zmm3\n\t" + "vpxorq %%zmm4, %%zmm1, %%zmm1\n\t" /* holds the result of the + carry-less multiplication of zmm0 + by zmm1 */ + + /* first phase of the reduction */ + "vpsllq $1, %%zmm3, %%zmm6\n\t" /* packed right shifting << 63 */ + "vpxorq %%zmm3, %%zmm6, %%zmm6\n\t" + "vpsllq $57, %%zmm3, %%zmm5\n\t" /* packed right shifting << 57 */ + "vpsllq $62, %%zmm6, %%zmm6\n\t" /* packed right shifting << 62 */ + "vpxorq %%zmm5, %%zmm6, %%zmm6\n\t" /* xor the shifted versions */ + "vpshufd $0x6a, %%zmm6, %%zmm5\n\t" + "vpshufd $0xae, %%zmm6, %%zmm6\n\t" + "vpxorq %%zmm5, %%zmm3, %%zmm3\n\t" /* first phase of the reduction complete */ + + /* second phase of the reduction */ + "vpsrlq $1, %%zmm3, %%zmm2\n\t" /* packed left shifting >> 1 */ + "vpsrlq $2, %%zmm3, %%zmm4\n\t" /* packed left shifting >> 2 */ + "vpsrlq $7, %%zmm3, %%zmm5\n\t" /* packed left shifting >> 7 */ + "vpternlogq $0x96, %%zmm3, %%zmm2, %%zmm1\n\t" /* xor the shifted versions */ + "vpternlogq $0x96, %%zmm4, %%zmm5, %%zmm6\n\t" + "vpxorq %%zmm6, %%zmm1, %%zmm1\n\t" /* the result is in zmm1 */ + ::: "memory" ); +} + +static ASM_FUNC_ATTR_INLINE void +gcm_lsh_avx512(void *h, unsigned int hoffs) +{ + static const u64 pconst[8] __attribute__ ((aligned (64))) = + { + U64_C(0x0000000000000001), U64_C(0xc200000000000000), + U64_C(0x0000000000000001), U64_C(0xc200000000000000), + U64_C(0x0000000000000001), U64_C(0xc200000000000000), + U64_C(0x0000000000000001), U64_C(0xc200000000000000) + }; + + asm volatile ("vmovdqu64 %[h], %%zmm2\n\t" + "vpshufd $0xff, %%zmm2, %%zmm3\n\t" + "vpsrad $31, %%zmm3, %%zmm3\n\t" + "vpslldq $8, %%zmm2, %%zmm4\n\t" + "vpandq %[pconst], %%zmm3, %%zmm3\n\t" + "vpaddq %%zmm2, %%zmm2, %%zmm2\n\t" + "vpsrlq $63, %%zmm4, %%zmm4\n\t" + "vpternlogq $0x96, %%zmm4, %%zmm3, %%zmm2\n\t" + "vmovdqu64 %%zmm2, %[h]\n\t" + : [h] "+m" (*((byte *)h + hoffs)) + : [pconst] "m" (*pconst) + : "memory" ); +} + +static ASM_FUNC_ATTR_INLINE void +load_h1h4_to_zmm1(gcry_cipher_hd_t c) +{ + unsigned int key_pos = + offsetof(struct gcry_cipher_handle, u_mode.gcm.u_ghash_key.key); + unsigned int table_pos = + offsetof(struct gcry_cipher_handle, u_mode.gcm.gcm_table); + + if (key_pos + 16 == table_pos) + { + /* Optimization: Table follows immediately after key. */ + asm volatile ("vmovdqu64 %[key], %%zmm1\n\t" + : + : [key] "m" (*c->u_mode.gcm.u_ghash_key.key) + : "memory"); + } + else + { + asm volatile ("vmovdqu64 -1*16(%[h_table]), %%zmm1\n\t" + "vinserti64x2 $0, %[key], %%zmm1, %%zmm1\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table), + [key] "m" (*c->u_mode.gcm.u_ghash_key.key) + : "memory"); + } +} + +static ASM_FUNC_ATTR void +ghash_setup_aggr8_avx512(gcry_cipher_hd_t c) +{ + c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_AGGR8_TABLE_INITIALIZED; + + asm volatile (/* load H? */ + "vbroadcasti64x2 3*16(%[h_table]), %%zmm0\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + /* load H <<< 1, H? <<< 1, H? <<< 1, H? <<< 1 */ + load_h1h4_to_zmm1 (c); + + gfmul_pclmul_avx512 (); /* H<<<1?H? => H?, ?, H?<<<1?H? => H? */ + + asm volatile ("vmovdqu64 %%zmm1, 4*16(%[h_table])\n\t" /* store H? for aggr16 setup */ + "vmovdqu64 %%zmm1, 3*16(%[h_table])\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + + gcm_lsh_avx512 (c->u_mode.gcm.gcm_table, 3 * 16); /* H? <<< 1, ?, H? <<< 1 */ +} + +static ASM_FUNC_ATTR void +ghash_setup_aggr16_avx512(gcry_cipher_hd_t c) +{ + c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_AGGR16_TABLE_INITIALIZED; + + asm volatile (/* load H? */ + "vbroadcasti64x2 7*16(%[h_table]), %%zmm0\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + /* load H <<< 1, H? <<< 1, H? <<< 1, H? <<< 1 */ + load_h1h4_to_zmm1 (c); + + gfmul_pclmul_avx512 (); /* H<<<1?H? => H?, ? , H?<<<1?H? => H?? */ + + asm volatile ("vmovdqu64 %%zmm1, 7*16(%[h_table])\n\t" + /* load H? <<< 1, ?, H? <<< 1 */ + "vmovdqu64 3*16(%[h_table]), %%zmm1\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + + gfmul_pclmul_avx512 (); /* H?<<<1?H? => H??, ? , H?<<<1?H? => H?? */ + + asm volatile ("vmovdqu64 %%zmm1, 12*16(%[h_table])\n\t" /* store H?? for aggr32 setup */ + "vmovdqu64 %%zmm1, 11*16(%[h_table])\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + + gcm_lsh_avx512 (c->u_mode.gcm.gcm_table, 7 * 16); /* H? <<< 1, ?, H?? <<< 1 */ + gcm_lsh_avx512 (c->u_mode.gcm.gcm_table, 11 * 16); /* H?? <<< 1, ?, H?? <<< 1 */ +} + +static ASM_FUNC_ATTR void +ghash_setup_aggr32_avx512(gcry_cipher_hd_t c) +{ + c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_AGGR32_TABLE_INITIALIZED; + + asm volatile (/* load H?? */ + "vbroadcasti64x2 15*16(%[h_table]), %%zmm0\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + /* load H <<< 1, H? <<< 1, H? <<< 1, H? <<< 1 */ + load_h1h4_to_zmm1 (c); + + gfmul_pclmul_avx512 (); /* H<<<1?H?? => H??, ?, H?<<<1?H?? => H?? */ + + asm volatile ("vmovdqu64 %%zmm1, 15*16(%[h_table])\n\t" + /* load H? <<< 1, ?, H? <<< 1 */ + "vmovdqu64 3*16(%[h_table]), %%zmm1\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + + gfmul_pclmul_avx512 (); /* H?<<<1?H?? => H??, ?, H?<<<1?H?? => H?? */ + + asm volatile ("vmovdqu64 %%zmm1, 19*16(%[h_table])\n\t" + /* load H? <<< 1, ?, H?? <<< 1 */ + "vmovdqu64 7*16(%[h_table]), %%zmm1\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + + gfmul_pclmul_avx512 (); /* H?<<<1?H?? => H??, ?, H??<<<1?H?? => H?? */ + + asm volatile ("vmovdqu64 %%zmm1, 23*16(%[h_table])\n\t" + /* load H?? <<< 1, ?, H?? <<< 1 */ + "vmovdqu64 11*16(%[h_table]), %%zmm1\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + + gfmul_pclmul_avx512 (); /* H??<<<1?H?? => H??, ?, H??<<<1?H?? => H?? */ + + asm volatile ("vmovdqu64 %%zmm1, 27*16(%[h_table])\n\t" + : + : [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory"); + + gcm_lsh_avx512 (c->u_mode.gcm.gcm_table, 15 * 16); + gcm_lsh_avx512 (c->u_mode.gcm.gcm_table, 19 * 16); + gcm_lsh_avx512 (c->u_mode.gcm.gcm_table, 23 * 16); + gcm_lsh_avx512 (c->u_mode.gcm.gcm_table, 27 * 16); +} + +static const u64 swap128b_perm[8] __attribute__ ((aligned (64))) = + { + /* For swapping order of 128bit lanes in 512bit register using vpermq. */ + 6, 7, 4, 5, 2, 3, 0, 1 + }; + +#endif /* GCM_USE_INTEL_VPCLMUL_AVX512 */ #endif /* __x86_64__ */ static unsigned int ASM_FUNC_ATTR @@ -921,6 +1318,11 @@ _gcry_ghash_setup_intel_pclmul (gcry_cipher_hd_t c, unsigned int hw_features) { c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_USE_VPCLMUL_AVX2; +#ifdef GCM_USE_INTEL_VPCLMUL_AVX512 + if (hw_features & HWF_INTEL_AVX512) + c->u_mode.gcm.hw_impl_flags |= GCM_INTEL_USE_VPCLMUL_AVX512; +#endif + asm volatile (/* H? */ "vinserti128 $1, %%xmm1, %%ymm1, %%ymm1\n\t" /* load H <<< 1, H? <<< 1 */ @@ -1104,71 +1506,126 @@ _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf, #if defined(GCM_USE_INTEL_VPCLMUL_AVX2) if (nblocks >= 16 - && (c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX2)) + && ((c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX2) + || (c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX512))) { - u64 h1_h2_h15_h16[4*2]; - - asm volatile ("vinserti128 $1, %%xmm7, %%ymm7, %%ymm15\n\t" - "vmovdqa %%xmm1, %%xmm8\n\t" - ::: "memory" ); - - if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED)) +#if defined(GCM_USE_INTEL_VPCLMUL_AVX512) + if (nblocks >= 32 + && (c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX512)) { - ghash_setup_aggr8_avx2 (c); + asm volatile ("vpopcntb %%zmm7, %%zmm15\n\t" /* spec stop for old AVX512 CPUs */ + "vshufi64x2 $0, %%zmm7, %%zmm7, %%zmm15\n\t" + "vmovdqa %%xmm1, %%xmm8\n\t" + "vmovdqu64 %[swapperm], %%zmm14\n\t" + : + : [swapperm] "m" (swap128b_perm), + [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory" ); + + if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR32_TABLE_INITIALIZED)) + { + if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR16_TABLE_INITIALIZED)) + { + if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED)) + ghash_setup_aggr8_avx512 (c); /* Clobbers registers XMM0-XMM7. */ + + ghash_setup_aggr16_avx512 (c); /* Clobbers registers XMM0-XMM7. */ + } + + ghash_setup_aggr32_avx512 (c); /* Clobbers registers XMM0-XMM7. */ + } + + /* Preload H1-H32. */ + load_h1h4_to_zmm1 (c); + asm volatile ("vpermq %%zmm1, %%zmm14, %%zmm16\n\t" /* H1|H2|H3|H4 */ + "vmovdqa %%xmm8, %%xmm1\n\t" + "vpermq 27*16(%[h_table]), %%zmm14, %%zmm0\n\t" /* H28|H29|H31|H32 */ + "vpermq 23*16(%[h_table]), %%zmm14, %%zmm13\n\t" /* H25|H26|H27|H28 */ + "vpermq 19*16(%[h_table]), %%zmm14, %%zmm12\n\t" /* H21|H22|H23|H24 */ + "vpermq 15*16(%[h_table]), %%zmm14, %%zmm11\n\t" /* H17|H18|H19|H20 */ + "vpermq 11*16(%[h_table]), %%zmm14, %%zmm10\n\t" /* H13|H14|H15|H16 */ + "vpermq 7*16(%[h_table]), %%zmm14, %%zmm9\n\t" /* H9|H10|H11|H12 */ + "vpermq 3*16(%[h_table]), %%zmm14, %%zmm8\n\t" /* H4|H6|H7|H8 */ + : + : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key), + [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory" ); + + while (nblocks >= 32) + { + gfmul_vpclmul_avx512_aggr32 (buf, c->u_mode.gcm.gcm_table); + + buf += 32 * blocksize; + nblocks -= 32; + } + + asm volatile ("vmovdqa %%xmm15, %%xmm7\n\t" + "vpxorq %%zmm16, %%zmm16, %%zmm16\n\t" + "vpxorq %%zmm17, %%zmm17, %%zmm17\n\t" + "vpxorq %%zmm18, %%zmm18, %%zmm18\n\t" + "vpxorq %%zmm19, %%zmm19, %%zmm19\n\t" + : + : + : "memory" ); } - if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR16_TABLE_INITIALIZED)) +#endif /* GCM_USE_INTEL_VPCLMUL_AVX512 */ + + if (nblocks >= 16) { - ghash_setup_aggr16_avx2 (c); - } + u64 h1_h2_h15_h16[4*2]; - /* Preload H1, H2, H3, H4, H5, H6, H7, H8, H9, H10, H11, H12. */ - asm volatile ("vmovdqa %%xmm8, %%xmm1\n\t" - "vmovdqu 0*16(%[h_table]), %%xmm7\n\t" - "vpxor %%xmm8, %%xmm8, %%xmm8\n\t" - "vperm2i128 $0x23, 13*16(%[h_table]), %%ymm8, %%ymm0\n\t" /* H15|H16 */ - "vperm2i128 $0x23, 11*16(%[h_table]), %%ymm8, %%ymm13\n\t" /* H13|H14 */ - "vperm2i128 $0x23, 9*16(%[h_table]), %%ymm8, %%ymm12\n\t" /* H11|H12 */ - "vperm2i128 $0x23, 7*16(%[h_table]), %%ymm8, %%ymm11\n\t" /* H9|H10 */ - "vperm2i128 $0x23, 5*16(%[h_table]), %%ymm8, %%ymm10\n\t" /* H7|H8 */ - "vperm2i128 $0x23, 3*16(%[h_table]), %%ymm8, %%ymm9\n\t" /* H5|H6 */ - "vperm2i128 $0x23, 1*16(%[h_table]), %%ymm8, %%ymm8\n\t" /* H3|H4 */ - "vinserti128 $1, %[h_1], %%ymm7, %%ymm7\n\t" /* H1|H2 */ - "vmovdqu %%ymm0, %[h15_h16]\n\t" - "vmovdqu %%ymm7, %[h1_h2]\n\t" - : [h1_h2] "=m" (h1_h2_h15_h16[0]), - [h15_h16] "=m" (h1_h2_h15_h16[4]) - : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key), - [h_table] "r" (c->u_mode.gcm.gcm_table) - : "memory" ); + asm volatile ("vinserti128 $1, %%xmm7, %%ymm7, %%ymm15\n\t" + "vmovdqa %%xmm1, %%xmm8\n\t" + ::: "memory" ); - while (nblocks >= 16) - { - gfmul_vpclmul_avx2_aggr16 (buf, c->u_mode.gcm.gcm_table, - h1_h2_h15_h16); + if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR16_TABLE_INITIALIZED)) + { + if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED)) + ghash_setup_aggr8_avx2 (c); /* Clobbers registers XMM0-XMM7. */ + + ghash_setup_aggr16_avx2 (c); /* Clobbers registers XMM0-XMM7. */ + } + + /* Preload H1-H16. */ + load_h1h2_to_ymm1 (c); + asm volatile ("vperm2i128 $0x23, %%ymm1, %%ymm1, %%ymm7\n\t" /* H1|H2 */ + "vmovdqa %%xmm8, %%xmm1\n\t" + "vpxor %%xmm8, %%xmm8, %%xmm8\n\t" + "vperm2i128 $0x23, 13*16(%[h_table]), %%ymm8, %%ymm0\n\t" /* H15|H16 */ + "vperm2i128 $0x23, 11*16(%[h_table]), %%ymm8, %%ymm13\n\t" /* H13|H14 */ + "vperm2i128 $0x23, 9*16(%[h_table]), %%ymm8, %%ymm12\n\t" /* H11|H12 */ + "vperm2i128 $0x23, 7*16(%[h_table]), %%ymm8, %%ymm11\n\t" /* H9|H10 */ + "vperm2i128 $0x23, 5*16(%[h_table]), %%ymm8, %%ymm10\n\t" /* H7|H8 */ + "vperm2i128 $0x23, 3*16(%[h_table]), %%ymm8, %%ymm9\n\t" /* H5|H6 */ + "vperm2i128 $0x23, 1*16(%[h_table]), %%ymm8, %%ymm8\n\t" /* H3|H4 */ + "vmovdqu %%ymm0, %[h15_h16]\n\t" + "vmovdqu %%ymm7, %[h1_h2]\n\t" + : [h1_h2] "=m" (h1_h2_h15_h16[0]), + [h15_h16] "=m" (h1_h2_h15_h16[4]) + : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key), + [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory" ); + + while (nblocks >= 16) + { + gfmul_vpclmul_avx2_aggr16 (buf, c->u_mode.gcm.gcm_table, + h1_h2_h15_h16); - buf += 16 * blocksize; - nblocks -= 16; + buf += 16 * blocksize; + nblocks -= 16; + } + + asm volatile ("vmovdqu %%ymm15, %[h15_h16]\n\t" + "vmovdqu %%ymm15, %[h1_h2]\n\t" + "vmovdqa %%xmm15, %%xmm7\n\t" + : + [h1_h2] "=m" (h1_h2_h15_h16[0]), + [h15_h16] "=m" (h1_h2_h15_h16[4]) + : + : "memory" ); } - /* Clear used x86-64/XMM registers. */ - asm volatile("vmovdqu %%ymm15, %[h15_h16]\n\t" - "vmovdqu %%ymm15, %[h1_h2]\n\t" - "vzeroupper\n\t" -#ifndef __WIN64__ - "pxor %%xmm8, %%xmm8\n\t" - "pxor %%xmm9, %%xmm9\n\t" - "pxor %%xmm10, %%xmm10\n\t" - "pxor %%xmm11, %%xmm11\n\t" - "pxor %%xmm12, %%xmm12\n\t" - "pxor %%xmm13, %%xmm13\n\t" - "pxor %%xmm14, %%xmm14\n\t" - "pxor %%xmm15, %%xmm15\n\t" -#endif - "movdqa %[be_mask], %%xmm7\n\t" - : [h1_h2] "=m" (h1_h2_h15_h16[0]), - [h15_h16] "=m" (h1_h2_h15_h16[4]) - : [be_mask] "m" (*be_mask) - : "memory" ); + asm volatile ("vzeroupper\n\t" ::: "memory" ); } #endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */ @@ -1176,22 +1633,18 @@ _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf, if (nblocks >= 8) { asm volatile ("movdqa %%xmm7, %%xmm15\n\t" + "movdqa %%xmm1, %%xmm8\n\t" ::: "memory" ); if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED)) - { - asm volatile ("movdqa %%xmm1, %%xmm8\n\t" - ::: "memory" ); - ghash_setup_aggr8 (c); - asm volatile ("movdqa %%xmm8, %%xmm1\n\t" - ::: "memory" ); - } + ghash_setup_aggr8 (c); /* Clobbers registers XMM0-XMM7. */ /* Preload H1. */ - asm volatile ("movdqa %[h_1], %%xmm0\n\t" - : - : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key) - : "memory" ); + asm volatile ("movdqa %%xmm8, %%xmm1\n\t" + "movdqa %[h_1], %%xmm0\n\t" + : + : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key) + : "memory" ); while (nblocks >= 8) { @@ -1200,19 +1653,6 @@ _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf, buf += 8 * blocksize; nblocks -= 8; } - -#ifndef __WIN64__ - /* Clear used x86-64/XMM registers. */ - asm volatile( "pxor %%xmm8, %%xmm8\n\t" - "pxor %%xmm9, %%xmm9\n\t" - "pxor %%xmm10, %%xmm10\n\t" - "pxor %%xmm11, %%xmm11\n\t" - "pxor %%xmm12, %%xmm12\n\t" - "pxor %%xmm13, %%xmm13\n\t" - "pxor %%xmm14, %%xmm14\n\t" - "pxor %%xmm15, %%xmm15\n\t" - ::: "memory" ); -#endif } #endif /* __x86_64__ */ @@ -1256,39 +1696,49 @@ _gcry_ghash_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf, : [be_mask] "m" (*be_mask) : "memory" ); -#if defined(__x86_64__) && defined(__WIN64__) /* Clear/restore used registers. */ - asm volatile( "pxor %%xmm0, %%xmm0\n\t" - "pxor %%xmm1, %%xmm1\n\t" - "pxor %%xmm2, %%xmm2\n\t" - "pxor %%xmm3, %%xmm3\n\t" - "pxor %%xmm4, %%xmm4\n\t" - "pxor %%xmm5, %%xmm5\n\t" - "movdqu 0*16(%0), %%xmm6\n\t" - "movdqu 1*16(%0), %%xmm7\n\t" - "movdqu 2*16(%0), %%xmm8\n\t" - "movdqu 3*16(%0), %%xmm9\n\t" - "movdqu 4*16(%0), %%xmm10\n\t" - "movdqu 5*16(%0), %%xmm11\n\t" - "movdqu 6*16(%0), %%xmm12\n\t" - "movdqu 7*16(%0), %%xmm13\n\t" - "movdqu 8*16(%0), %%xmm14\n\t" - "movdqu 9*16(%0), %%xmm15\n\t" - : - : "r" (win64tmp) - : "memory" ); + asm volatile ("pxor %%xmm0, %%xmm0\n\t" + "pxor %%xmm1, %%xmm1\n\t" + "pxor %%xmm2, %%xmm2\n\t" + "pxor %%xmm3, %%xmm3\n\t" + "pxor %%xmm4, %%xmm4\n\t" + "pxor %%xmm5, %%xmm5\n\t" + "pxor %%xmm6, %%xmm6\n\t" + "pxor %%xmm7, %%xmm7\n\t" + : + : + : "memory" ); +#ifdef __x86_64__ +#ifdef __WIN64__ + asm volatile ("movdqu 0*16(%0), %%xmm6\n\t" + "movdqu 1*16(%0), %%xmm7\n\t" + "movdqu 2*16(%0), %%xmm8\n\t" + "movdqu 3*16(%0), %%xmm9\n\t" + "movdqu 4*16(%0), %%xmm10\n\t" + "movdqu 5*16(%0), %%xmm11\n\t" + "movdqu 6*16(%0), %%xmm12\n\t" + "movdqu 7*16(%0), %%xmm13\n\t" + "movdqu 8*16(%0), %%xmm14\n\t" + "movdqu 9*16(%0), %%xmm15\n\t" + : + : "r" (win64tmp) + : "memory" ); #else /* Clear used registers. */ - asm volatile( "pxor %%xmm0, %%xmm0\n\t" - "pxor %%xmm1, %%xmm1\n\t" - "pxor %%xmm2, %%xmm2\n\t" - "pxor %%xmm3, %%xmm3\n\t" - "pxor %%xmm4, %%xmm4\n\t" - "pxor %%xmm5, %%xmm5\n\t" - "pxor %%xmm6, %%xmm6\n\t" - "pxor %%xmm7, %%xmm7\n\t" - ::: "memory" ); -#endif + asm volatile ( + "pxor %%xmm8, %%xmm8\n\t" + "pxor %%xmm9, %%xmm9\n\t" + "pxor %%xmm10, %%xmm10\n\t" + "pxor %%xmm11, %%xmm11\n\t" + "pxor %%xmm12, %%xmm12\n\t" + "pxor %%xmm13, %%xmm13\n\t" + "pxor %%xmm14, %%xmm14\n\t" + "pxor %%xmm15, %%xmm15\n\t" + : + : + : "memory" ); +#endif /* __WIN64__ */ +#endif /* __x86_64__ */ return 0; } @@ -1335,90 +1785,142 @@ _gcry_polyval_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf, #if defined(GCM_USE_INTEL_VPCLMUL_AVX2) if (nblocks >= 16 - && (c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX2)) + && ((c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX2) + || (c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX512))) { - u64 h1_h2_h15_h16[4*2]; - - asm volatile ("vmovdqa %%xmm1, %%xmm8\n\t" - ::: "memory" ); - - if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED)) +#if defined(GCM_USE_INTEL_VPCLMUL_AVX512) + if (nblocks >= 32 + && (c->u_mode.gcm.hw_impl_flags & GCM_INTEL_USE_VPCLMUL_AVX512)) { - ghash_setup_aggr8_avx2 (c); + asm volatile ("vpopcntb %%zmm7, %%zmm15\n\t" /* spec stop for old AVX512 CPUs */ + "vmovdqa %%xmm1, %%xmm8\n\t" + "vmovdqu64 %[swapperm], %%zmm14\n\t" + : + : [swapperm] "m" (swap128b_perm), + [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory" ); + + if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR32_TABLE_INITIALIZED)) + { + if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR16_TABLE_INITIALIZED)) + { + if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED)) + ghash_setup_aggr8_avx512 (c); /* Clobbers registers XMM0-XMM7. */ + + ghash_setup_aggr16_avx512 (c); /* Clobbers registers XMM0-XMM7. */ + } + + ghash_setup_aggr32_avx512 (c); /* Clobbers registers XMM0-XMM7. */ + } + + /* Preload H1-H32. */ + load_h1h4_to_zmm1 (c); + asm volatile ("vpermq %%zmm1, %%zmm14, %%zmm16\n\t" /* H1|H2|H3|H4 */ + "vmovdqa %%xmm8, %%xmm1\n\t" + "vpermq 27*16(%[h_table]), %%zmm14, %%zmm0\n\t" /* H28|H29|H31|H32 */ + "vpermq 23*16(%[h_table]), %%zmm14, %%zmm13\n\t" /* H25|H26|H27|H28 */ + "vpermq 19*16(%[h_table]), %%zmm14, %%zmm12\n\t" /* H21|H22|H23|H24 */ + "vpermq 15*16(%[h_table]), %%zmm14, %%zmm11\n\t" /* H17|H18|H19|H20 */ + "vpermq 11*16(%[h_table]), %%zmm14, %%zmm10\n\t" /* H13|H14|H15|H16 */ + "vpermq 7*16(%[h_table]), %%zmm14, %%zmm9\n\t" /* H9|H10|H11|H12 */ + "vpermq 3*16(%[h_table]), %%zmm14, %%zmm8\n\t" /* H4|H6|H7|H8 */ + : + : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key), + [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory" ); + + while (nblocks >= 32) + { + gfmul_vpclmul_avx512_aggr32_le (buf, c->u_mode.gcm.gcm_table); + + buf += 32 * blocksize; + nblocks -= 32; + } + + asm volatile ("vpxor %%xmm7, %%xmm7, %%xmm7\n\t" + "vpxorq %%zmm16, %%zmm16, %%zmm16\n\t" + "vpxorq %%zmm17, %%zmm17, %%zmm17\n\t" + "vpxorq %%zmm18, %%zmm18, %%zmm18\n\t" + "vpxorq %%zmm19, %%zmm19, %%zmm19\n\t" + : + : + : "memory" ); } - if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR16_TABLE_INITIALIZED)) - { - ghash_setup_aggr16_avx2 (c); - } - - /* Preload H1, H2, H3, H4, H5, H6, H7, H8, H9, H10, H11, H12. */ - asm volatile ("vmovdqa %%xmm8, %%xmm1\n\t" - "vpxor %%xmm8, %%xmm8, %%xmm8\n\t" - "vmovdqu 0*16(%[h_table]), %%xmm7\n\t" - "vperm2i128 $0x23, 13*16(%[h_table]), %%ymm8, %%ymm0\n\t" /* H15|H16 */ - "vperm2i128 $0x23, 11*16(%[h_table]), %%ymm8, %%ymm13\n\t" /* H13|H14 */ - "vperm2i128 $0x23, 9*16(%[h_table]), %%ymm8, %%ymm12\n\t" /* H11|H12 */ - "vperm2i128 $0x23, 7*16(%[h_table]), %%ymm8, %%ymm11\n\t" /* H9|H10 */ - "vperm2i128 $0x23, 5*16(%[h_table]), %%ymm8, %%ymm10\n\t" /* H7|H8 */ - "vperm2i128 $0x23, 3*16(%[h_table]), %%ymm8, %%ymm9\n\t" /* H5|H6 */ - "vperm2i128 $0x23, 1*16(%[h_table]), %%ymm8, %%ymm8\n\t" /* H3|H4 */ - "vinserti128 $1, %[h_1], %%ymm7, %%ymm7\n\t" /* H1|H2 */ - "vmovdqu %%ymm0, %[h15_h16]\n\t" - "vmovdqu %%ymm7, %[h1_h2]\n\t" - : [h1_h2] "=m" (h1_h2_h15_h16[0]), - [h15_h16] "=m" (h1_h2_h15_h16[4]) - : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key), - [h_table] "r" (c->u_mode.gcm.gcm_table) - : "memory" ); +#endif - while (nblocks >= 16) + if (nblocks >= 16) { - gfmul_vpclmul_avx2_aggr16_le (buf, c->u_mode.gcm.gcm_table, - h1_h2_h15_h16); + u64 h1_h2_h15_h16[4*2]; + + asm volatile ("vmovdqa %%xmm1, %%xmm8\n\t" + ::: "memory" ); - buf += 16 * blocksize; - nblocks -= 16; + if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR16_TABLE_INITIALIZED)) + { + if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED)) + ghash_setup_aggr8_avx2 (c); /* Clobbers registers XMM0-XMM7. */ + + ghash_setup_aggr16_avx2 (c); /* Clobbers registers XMM0-XMM7. */ + } + + /* Preload H1-H16. */ + load_h1h2_to_ymm1 (c); + asm volatile ("vperm2i128 $0x23, %%ymm1, %%ymm1, %%ymm7\n\t" /* H1|H2 */ + "vmovdqa %%xmm8, %%xmm1\n\t" + "vpxor %%xmm8, %%xmm8, %%xmm8\n\t" + "vperm2i128 $0x23, 13*16(%[h_table]), %%ymm8, %%ymm0\n\t" /* H15|H16 */ + "vperm2i128 $0x23, 11*16(%[h_table]), %%ymm8, %%ymm13\n\t" /* H13|H14 */ + "vperm2i128 $0x23, 9*16(%[h_table]), %%ymm8, %%ymm12\n\t" /* H11|H12 */ + "vperm2i128 $0x23, 7*16(%[h_table]), %%ymm8, %%ymm11\n\t" /* H9|H10 */ + "vperm2i128 $0x23, 5*16(%[h_table]), %%ymm8, %%ymm10\n\t" /* H7|H8 */ + "vperm2i128 $0x23, 3*16(%[h_table]), %%ymm8, %%ymm9\n\t" /* H5|H6 */ + "vperm2i128 $0x23, 1*16(%[h_table]), %%ymm8, %%ymm8\n\t" /* H3|H4 */ + "vmovdqu %%ymm0, %[h15_h16]\n\t" + "vmovdqu %%ymm7, %[h1_h2]\n\t" + : [h1_h2] "=m" (h1_h2_h15_h16[0]), + [h15_h16] "=m" (h1_h2_h15_h16[4]) + : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key), + [h_table] "r" (c->u_mode.gcm.gcm_table) + : "memory" ); + + while (nblocks >= 16) + { + gfmul_vpclmul_avx2_aggr16_le (buf, c->u_mode.gcm.gcm_table, + h1_h2_h15_h16); + + buf += 16 * blocksize; + nblocks -= 16; + } + + asm volatile ("vpxor %%xmm7, %%xmm7, %%xmm7\n\t" + "vmovdqu %%ymm7, %[h15_h16]\n\t" + "vmovdqu %%ymm7, %[h1_h2]\n\t" + : [h1_h2] "=m" (h1_h2_h15_h16[0]), + [h15_h16] "=m" (h1_h2_h15_h16[4]) + : + : "memory" ); } - /* Clear used x86-64/XMM registers. */ - asm volatile("vpxor %%xmm7, %%xmm7, %%xmm7\n\t" - "vmovdqu %%ymm7, %[h15_h16]\n\t" - "vmovdqu %%ymm7, %[h1_h2]\n\t" - "vzeroupper\n\t" -#ifndef __WIN64__ - "pxor %%xmm8, %%xmm8\n\t" - "pxor %%xmm9, %%xmm9\n\t" - "pxor %%xmm10, %%xmm10\n\t" - "pxor %%xmm11, %%xmm11\n\t" - "pxor %%xmm12, %%xmm12\n\t" - "pxor %%xmm13, %%xmm13\n\t" - "pxor %%xmm14, %%xmm14\n\t" -#endif - : [h1_h2] "=m" (h1_h2_h15_h16[0]), - [h15_h16] "=m" (h1_h2_h15_h16[4]) - : - : "memory" ); + asm volatile ("vzeroupper\n\t" ::: "memory" ); } #endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */ #ifdef __x86_64__ if (nblocks >= 8) { + asm volatile ("movdqa %%xmm1, %%xmm8\n\t" + ::: "memory" ); + if (!(c->u_mode.gcm.hw_impl_flags & GCM_INTEL_AGGR8_TABLE_INITIALIZED)) - { - asm volatile ("movdqa %%xmm1, %%xmm8\n\t" - ::: "memory" ); - ghash_setup_aggr8 (c); - asm volatile ("movdqa %%xmm8, %%xmm1\n\t" - ::: "memory" ); - } + ghash_setup_aggr8 (c); /* Clobbers registers XMM0-XMM7. */ /* Preload H1. */ - asm volatile ("pxor %%xmm15, %%xmm15\n\t" - "movdqa %[h_1], %%xmm0\n\t" - : - : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key) - : "memory" ); + asm volatile ("movdqa %%xmm8, %%xmm1\n\t" + "pxor %%xmm15, %%xmm15\n\t" + "movdqa %[h_1], %%xmm0\n\t" + : + : [h_1] "m" (*c->u_mode.gcm.u_ghash_key.key) + : "memory" ); while (nblocks >= 8) { @@ -1427,18 +1929,6 @@ _gcry_polyval_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf, buf += 8 * blocksize; nblocks -= 8; } -#ifndef __WIN64__ - /* Clear used x86-64/XMM registers. */ - asm volatile( "pxor %%xmm8, %%xmm8\n\t" - "pxor %%xmm9, %%xmm9\n\t" - "pxor %%xmm10, %%xmm10\n\t" - "pxor %%xmm11, %%xmm11\n\t" - "pxor %%xmm12, %%xmm12\n\t" - "pxor %%xmm13, %%xmm13\n\t" - "pxor %%xmm14, %%xmm14\n\t" - "pxor %%xmm15, %%xmm15\n\t" - ::: "memory" ); -#endif } #endif @@ -1481,39 +1971,49 @@ _gcry_polyval_intel_pclmul (gcry_cipher_hd_t c, byte *result, const byte *buf, : [be_mask] "m" (*be_mask) : "memory" ); -#if defined(__x86_64__) && defined(__WIN64__) /* Clear/restore used registers. */ - asm volatile( "pxor %%xmm0, %%xmm0\n\t" - "pxor %%xmm1, %%xmm1\n\t" - "pxor %%xmm2, %%xmm2\n\t" - "pxor %%xmm3, %%xmm3\n\t" - "pxor %%xmm4, %%xmm4\n\t" - "pxor %%xmm5, %%xmm5\n\t" - "movdqu 0*16(%0), %%xmm6\n\t" - "movdqu 1*16(%0), %%xmm7\n\t" - "movdqu 2*16(%0), %%xmm8\n\t" - "movdqu 3*16(%0), %%xmm9\n\t" - "movdqu 4*16(%0), %%xmm10\n\t" - "movdqu 5*16(%0), %%xmm11\n\t" - "movdqu 6*16(%0), %%xmm12\n\t" - "movdqu 7*16(%0), %%xmm13\n\t" - "movdqu 8*16(%0), %%xmm14\n\t" - "movdqu 9*16(%0), %%xmm15\n\t" - : - : "r" (win64tmp) - : "memory" ); + asm volatile ("pxor %%xmm0, %%xmm0\n\t" + "pxor %%xmm1, %%xmm1\n\t" + "pxor %%xmm2, %%xmm2\n\t" + "pxor %%xmm3, %%xmm3\n\t" + "pxor %%xmm4, %%xmm4\n\t" + "pxor %%xmm5, %%xmm5\n\t" + "pxor %%xmm6, %%xmm6\n\t" + "pxor %%xmm7, %%xmm7\n\t" + : + : + : "memory" ); +#ifdef __x86_64__ +#ifdef __WIN64__ + asm volatile ("movdqu 0*16(%0), %%xmm6\n\t" + "movdqu 1*16(%0), %%xmm7\n\t" + "movdqu 2*16(%0), %%xmm8\n\t" + "movdqu 3*16(%0), %%xmm9\n\t" + "movdqu 4*16(%0), %%xmm10\n\t" + "movdqu 5*16(%0), %%xmm11\n\t" + "movdqu 6*16(%0), %%xmm12\n\t" + "movdqu 7*16(%0), %%xmm13\n\t" + "movdqu 8*16(%0), %%xmm14\n\t" + "movdqu 9*16(%0), %%xmm15\n\t" + : + : "r" (win64tmp) + : "memory" ); #else /* Clear used registers. */ - asm volatile( "pxor %%xmm0, %%xmm0\n\t" - "pxor %%xmm1, %%xmm1\n\t" - "pxor %%xmm2, %%xmm2\n\t" - "pxor %%xmm3, %%xmm3\n\t" - "pxor %%xmm4, %%xmm4\n\t" - "pxor %%xmm5, %%xmm5\n\t" - "pxor %%xmm6, %%xmm6\n\t" - "pxor %%xmm7, %%xmm7\n\t" - ::: "memory" ); -#endif + asm volatile ( + "pxor %%xmm8, %%xmm8\n\t" + "pxor %%xmm9, %%xmm9\n\t" + "pxor %%xmm10, %%xmm10\n\t" + "pxor %%xmm11, %%xmm11\n\t" + "pxor %%xmm12, %%xmm12\n\t" + "pxor %%xmm13, %%xmm13\n\t" + "pxor %%xmm14, %%xmm14\n\t" + "pxor %%xmm15, %%xmm15\n\t" + : + : + : "memory" ); +#endif /* __WIN64__ */ +#endif /* __x86_64__ */ return 0; } diff --git a/cipher/cipher-internal.h b/cipher/cipher-internal.h index e31ac860..e1ff0437 100644 --- a/cipher/cipher-internal.h +++ b/cipher/cipher-internal.h @@ -80,6 +80,14 @@ # define GCM_USE_INTEL_VPCLMUL_AVX2 1 #endif /* GCM_USE_INTEL_VPCLMUL_AVX2 */ +/* GCM_USE_INTEL_VPCLMUL_AVX512 indicates whether to compile GCM with Intel + VPCLMUL/AVX512 code. */ +#undef GCM_USE_INTEL_VPCLMUL_AVX512 +#if defined(__x86_64__) && defined(GCM_USE_INTEL_VPCLMUL_AVX2) && \ + defined(ENABLE_AVX512_SUPPORT) && defined(HAVE_GCC_INLINE_ASM_AVX512) +# define GCM_USE_INTEL_VPCLMUL_AVX512 1 +#endif /* GCM_USE_INTEL_VPCLMUL_AVX512 */ + /* GCM_USE_ARM_PMULL indicates whether to compile GCM with ARMv8 PMULL code. */ #undef GCM_USE_ARM_PMULL #if defined(ENABLE_ARM_CRYPTO_SUPPORT) && defined(GCM_USE_TABLES) -- 2.32.0 From jjelen at redhat.com Mon Mar 7 11:38:27 2022 From: jjelen at redhat.com (Jakub Jelen) Date: Mon, 7 Mar 2022 11:38:27 +0100 Subject: [PATCH 2/3] Add detection for HW feature "intel-avx512" In-Reply-To: <62982660-0afb-b9ba-e14b-4536352694a3@redhat.com> References: <20220306171910.1011180-1-jussi.kivilinna@iki.fi> <20220306171910.1011180-2-jussi.kivilinna@iki.fi> <62982660-0afb-b9ba-e14b-4536352694a3@redhat.com> Message-ID: <35dcc3b3-9dd6-2d55-866d-322688c9545f@redhat.com> On 3/7/22 11:04, Jakub Jelen wrote: > On 3/6/22 18:19, Jussi Kivilinna wrote: >> [...] >> diff --git a/src/hwfeatures.c b/src/hwfeatures.c >> index 7060d995..8e92cbdd 100644 >> --- a/src/hwfeatures.c >> +++ b/src/hwfeatures.c >> @@ -62,6 +62,7 @@ static struct >> ????? { HWF_INTEL_RDTSC,???????? "intel-rdtsc" }, >> ????? { HWF_INTEL_SHAEXT,??????? "intel-shaext" }, >> ????? { HWF_INTEL_VAES_VPCLMUL,? "intel-vaes-vpclmul" }, >> +??? { HWF_INTEL_AVX512,??????? "intel-avx512" }, >> ? #elif defined(HAVE_CPU_ARCH_ARM) >> ????? { HWF_ARM_NEON,??????????? "arm-neon" }, >> ????? { HWF_ARM_AES,???????????? "arm-aes" }, > > Hi, > can you make sure to update also the doc/gcrypt.texi with the new > hwfeatures value so we do not have to add it retrospectively? > > https://gitlab.com/redhat-crypto/libgcrypt/libgcrypt-mirror/-/blob/master/doc/gcrypt.texi#L574 > > > Thanks, Trying again with the gcrypt-devel(at)gnupg.org (without the lists. which still shows as a reply address when I use Thunderbird (probably because of the List-Post header still pointing to this email). Regards, -- Jakub Jelen Crypto Team, Security Engineering Red Hat, Inc. From jussi.kivilinna at iki.fi Mon Mar 7 17:59:39 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Mon, 7 Mar 2022 18:59:39 +0200 Subject: [PATCH 2/3] Add detection for HW feature "intel-avx512" In-Reply-To: <35dcc3b3-9dd6-2d55-866d-322688c9545f@redhat.com> References: <20220306171910.1011180-1-jussi.kivilinna@iki.fi> <20220306171910.1011180-2-jussi.kivilinna@iki.fi> <62982660-0afb-b9ba-e14b-4536352694a3@redhat.com> <35dcc3b3-9dd6-2d55-866d-322688c9545f@redhat.com> Message-ID: Hello, On 7.3.2022 12.38, Jakub Jelen via Gcrypt-devel wrote: > On 3/7/22 11:04, Jakub Jelen wrote: >> On 3/6/22 18:19, Jussi Kivilinna wrote: >>> [...] >>> diff --git a/src/hwfeatures.c b/src/hwfeatures.c >>> index 7060d995..8e92cbdd 100644 >>> --- a/src/hwfeatures.c >>> +++ b/src/hwfeatures.c >>> @@ -62,6 +62,7 @@ static struct >>> ????? { HWF_INTEL_RDTSC,???????? "intel-rdtsc" }, >>> ????? { HWF_INTEL_SHAEXT,??????? "intel-shaext" }, >>> ????? { HWF_INTEL_VAES_VPCLMUL,? "intel-vaes-vpclmul" }, >>> +??? { HWF_INTEL_AVX512,??????? "intel-avx512" }, >>> ? #elif defined(HAVE_CPU_ARCH_ARM) >>> ????? { HWF_ARM_NEON,??????????? "arm-neon" }, >>> ????? { HWF_ARM_AES,???????????? "arm-aes" }, >> >> Hi, >> can you make sure to update also the doc/gcrypt.texi with the new hwfeatures value so we do not have to add it retrospectively? >> >> https://gitlab.com/redhat-crypto/libgcrypt/libgcrypt-mirror/-/blob/master/doc/gcrypt.texi#L574 >> Thanks for reminder. I'll add intel-avx512 to the HW features list. -Jussi From jussi.kivilinna at iki.fi Thu Mar 10 21:26:49 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Thu, 10 Mar 2022 22:26:49 +0200 Subject: [PATCH] SHA512: Add AVX512 implementation Message-ID: <20220310202649.1673477-1-jussi.kivilinna@iki.fi> * LICENSES: Add 'cipher/sha512-avx512-amd64.S'. * cipher/Makefile.am: Add 'sha512-avx512-amd64.S'. * cipher/sha512-avx512-amd64.S: New. * cipher/sha512.c (USE_AVX512): New. (do_sha512_transform_amd64_ssse3, do_sha512_transform_amd64_avx) (do_sha512_transform_amd64_avx2): Add ASM_EXTRA_STACK to return value only if assembly routine returned non-zero value. [USE_AVX512] (_gcry_sha512_transform_amd64_avx512) (do_sha512_transform_amd64_avx512): New. (sha512_init_common) [USE_AVX512]: Use AVX512 implementation if HW feature supported. --- Benchmark on Intel Core i3-1115G4 (tigerlake): Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SHA512 | 1.51 ns/B 631.6 MiB/s 6.17 c/B 4089 After (~29% faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SHA512 | 1.16 ns/B 819.0 MiB/s 4.76 c/B 4090 GnuPG-bug-id: T4460 Signed-off-by: Jussi Kivilinna --- LICENSES | 1 + cipher/Makefile.am | 2 +- cipher/sha512-avx512-amd64.S | 461 +++++++++++++++++++++++++++++++++++ cipher/sha512.c | 52 +++- configure.ac | 1 + 5 files changed, 509 insertions(+), 8 deletions(-) create mode 100644 cipher/sha512-avx512-amd64.S diff --git a/LICENSES b/LICENSES index 8be7fb24..94499501 100644 --- a/LICENSES +++ b/LICENSES @@ -19,6 +19,7 @@ with any binary distributions derived from the GNU C Library. - cipher/sha512-avx2-bmi2-amd64.S - cipher/sha512-ssse3-amd64.S - cipher/sha512-ssse3-i386.c + - cipher/sha512-avx512-amd64.S #+begin_quote Copyright (c) 2012, Intel Corporation diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 3339c463..6eec9bd8 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -127,7 +127,7 @@ EXTRA_libcipher_la_SOURCES = \ sha256-armv8-aarch32-ce.S sha256-armv8-aarch64-ce.S \ sha256-intel-shaext.c sha256-ppc.c \ sha512.c sha512-ssse3-amd64.S sha512-avx-amd64.S \ - sha512-avx2-bmi2-amd64.S \ + sha512-avx2-bmi2-amd64.S sha512-avx512-amd64.S \ sha512-armv7-neon.S sha512-arm.S \ sha512-ppc.c sha512-ssse3-i386.c \ sm3.c sm3-avx-bmi2-amd64.S sm3-aarch64.S \ diff --git a/cipher/sha512-avx512-amd64.S b/cipher/sha512-avx512-amd64.S new file mode 100644 index 00000000..317f3e5c --- /dev/null +++ b/cipher/sha512-avx512-amd64.S @@ -0,0 +1,461 @@ +/* sha512-avx512-amd64.c - amd64/AVX512 implementation of SHA-512 transform + * Copyright (C) 2022 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ +/* + * Based on implementation from file "sha512-avx2-bmi2-amd64.S": +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +; Copyright (c) 2012, Intel Corporation +; +; All rights reserved. +; +; Redistribution and use in source and binary forms, with or without +; modification, are permitted provided that the following conditions are +; met: +; +; * Redistributions of source code must retain the above copyright +; notice, this list of conditions and the following disclaimer. +; +; * Redistributions in binary form must reproduce the above copyright +; notice, this list of conditions and the following disclaimer in the +; documentation and/or other materials provided with the +; distribution. +; +; * Neither the name of the Intel Corporation nor the names of its +; contributors may be used to endorse or promote products derived from +; this software without specific prior written permission. +; +; +; THIS SOFTWARE IS PROVIDED BY INTEL CORPORATION "AS IS" AND ANY +; EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +; IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +; PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL CORPORATION OR +; CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +; EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +; PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +; PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF +; LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING +; NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS +; SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +; This code schedules 1 blocks at a time, with 4 lanes per block +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +*/ + +#ifdef __x86_64 +#include +#if (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ + defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) && \ + defined(HAVE_INTEL_SYNTAX_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AVX512) && \ + defined(USE_SHA512) + +#include "asm-common-amd64.h" + +.intel_syntax noprefix + +.text + +/* Virtual Registers */ +#define Y_0 ymm0 +#define Y_1 ymm1 +#define Y_2 ymm2 +#define Y_3 ymm3 + +#define YTMP0 ymm4 +#define YTMP1 ymm5 +#define YTMP2 ymm6 +#define YTMP3 ymm7 +#define YTMP4 ymm8 +#define XFER YTMP0 + +#define BYTE_FLIP_MASK ymm9 +#define PERM_VPALIGNR_8 ymm10 + +#define MASK_DC_00 k1 + +#define INP rdi /* 1st arg */ +#define CTX rsi /* 2nd arg */ +#define NUM_BLKS rdx /* 3rd arg */ +#define SRND r8d +#define RSP_SAVE r9 + +#define TBL rcx + +#define a xmm11 +#define b xmm12 +#define c xmm13 +#define d xmm14 +#define e xmm15 +#define f xmm16 +#define g xmm17 +#define h xmm18 + +#define y0 xmm19 +#define y1 xmm20 +#define y2 xmm21 +#define y3 xmm22 + +/* Local variables (stack frame) */ +#define frame_XFER 0 +#define frame_XFER_size (4*4*8) +#define frame_size (frame_XFER + frame_XFER_size) + +#define clear_reg(x) vpxorq x,x,x + +/* addm [mem], reg */ +/* Add reg to mem using reg-mem add and store */ +#define addm(p1, p2) \ + vmovq y0, p1; \ + vpaddq p2, p2, y0; \ + vmovq p1, p2; + +/* COPY_YMM_AND_BSWAP ymm, [mem], byte_flip_mask */ +/* Load ymm with mem and byte swap each dword */ +#define COPY_YMM_AND_BSWAP(p1, p2, p3) \ + vmovdqu p1, p2; \ + vpshufb p1, p1, p3 + +/* %macro MY_VPALIGNR YDST, YSRC1, YSRC2, RVAL */ +/* YDST = {YSRC1, YSRC2} >> RVAL*8 */ +#define MY_VPALIGNR(YDST_SRC1, YSRC2, RVAL) \ + vpermt2q YDST_SRC1, PERM_VPALIGNR_##RVAL, YSRC2; + +#define ONE_ROUND_PART1(XFERIN, a, b, c, d, e, f, g, h) \ + /* h += Sum1 (e) + Ch (e, f, g) + (k[t] + w[0]); \ + * d += h; \ + * h += Sum0 (a) + Maj (a, b, c); \ + * \ + * Ch(x, y, z) => ((x & y) + (~x & z)) \ + * Maj(x, y, z) => ((x & y) + (z & (x ^ y))) \ + */ \ + \ + vmovq y3, [XFERIN]; \ + vmovdqa64 y2, e; \ + vpaddq h, h, y3; \ + vprorq y0, e, 41; \ + vpternlogq y2, f, g, 0xca; /* Ch (e, f, g) */ \ + vprorq y1, e, 18; \ + vprorq y3, e, 14; \ + vpaddq h, h, y2; \ + vpternlogq y0, y1, y3, 0x96; /* Sum1 (e) */ \ + vpaddq h, h, y0; /* h += Sum1 (e) + Ch (e, f, g) + (k[t] + w[0]) */ \ + vpaddq d, d, h; /* d += h */ + +#define ONE_ROUND_PART2(a, b, c, d, e, f, g, h) \ + vmovdqa64 y1, a; \ + vprorq y0, a, 39; \ + vpternlogq y1, b, c, 0xe8; /* Maj (a, b, c) */ \ + vprorq y2, a, 34; \ + vprorq y3, a, 28; \ + vpternlogq y0, y2, y3, 0x96; /* Sum0 (a) */ \ + vpaddq h, h, y1; \ + vpaddq h, h, y0; /* h += Sum0 (a) + Maj (a, b, c) */ + +#define FOUR_ROUNDS_AND_SCHED(X, Y_0, Y_1, Y_2, Y_3, a, b, c, d, e, f, g, h) \ + /*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; RND N + 0 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; */; \ + vmovdqa YTMP0, Y_3; \ + vmovdqa YTMP1, Y_1; \ + /* Extract w[t-7] */; \ + vpermt2q YTMP0, PERM_VPALIGNR_8, Y_2 /* YTMP0 = W[-7] */; \ + /* Calculate w[t-16] + w[t-7] */; \ + vpaddq YTMP0, YTMP0, Y_0 /* YTMP0 = W[-7] + W[-16] */; \ + /* Extract w[t-15] */; \ + vpermt2q YTMP1, PERM_VPALIGNR_8, Y_0 /* YTMP1 = W[-15] */; \ + ONE_ROUND_PART1(rsp+frame_XFER+0*8+X*32, a, b, c, d, e, f, g, h); \ + \ + /* Calculate sigma0 */; \ + \ + /* Calculate w[t-15] ror 1 */; \ + vprorq YTMP3, YTMP1, 1; /* YTMP3 = W[-15] ror 1 */; \ + /* Calculate w[t-15] shr 7 */; \ + vpsrlq YTMP4, YTMP1, 7 /* YTMP4 = W[-15] >> 7 */; \ + \ + ONE_ROUND_PART2(a, b, c, d, e, f, g, h); \ + \ + /*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; RND N + 1 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; */; \ + /* Calculate w[t-15] ror 8 */; \ + vprorq YTMP1, YTMP1, 8 /* YTMP1 = W[-15] ror 8 */; \ + /* XOR the three components */; \ + vpternlogq YTMP1, YTMP3, YTMP4, 0x96 /* YTMP1 = s0 = W[-15] ror 1 ^ W[-15] >> 7 ^ W[-15] ror 8 */; \ + \ + /* Add three components, w[t-16], w[t-7] and sigma0 */; \ + vpaddq YTMP0, YTMP0, YTMP1 /* YTMP0 = W[-16] + W[-7] + s0 */; \ + ONE_ROUND_PART1(rsp+frame_XFER+1*8+X*32, h, a, b, c, d, e, f, g); \ + /* Move to appropriate lanes for calculating w[16] and w[17] */; \ + vshufi64x2 Y_0, YTMP0, YTMP0, 0x0 /* Y_0 = W[-16] + W[-7] + s0 {BABA} */; \ + \ + /* Calculate w[16] and w[17] in both 128 bit lanes */; \ + \ + /* Calculate sigma1 for w[16] and w[17] on both 128 bit lanes */; \ + vshufi64x2 YTMP2, Y_3, Y_3, 0b11 /* YTMP2 = W[-2] {BABA} */; \ + vpsrlq YTMP4, YTMP2, 6 /* YTMP4 = W[-2] >> 6 {BABA} */; \ + \ + ONE_ROUND_PART2(h, a, b, c, d, e, f, g); \ + \ + /*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; RND N + 2 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; */; \ + vprorq YTMP3, YTMP2, 19 /* YTMP3 = W[-2] ror 19 {BABA} */; \ + vprorq YTMP1, YTMP2, 61 /* YTMP3 = W[-2] ror 61 {BABA} */; \ + vpternlogq YTMP4, YTMP3, YTMP1, 0x96 /* YTMP4 = s1 = (W[-2] ror 19) ^ (W[-2] ror 61) ^ (W[-2] >> 6) {BABA} */; \ + \ + ONE_ROUND_PART1(rsp+frame_XFER+2*8+X*32, g, h, a, b, c, d, e, f); \ + /* Add sigma1 to the other compunents to get w[16] and w[17] */; \ + vpaddq Y_0, Y_0, YTMP4 /* Y_0 = {W[1], W[0], W[1], W[0]} */; \ + \ + /* Calculate sigma1 for w[18] and w[19] for upper 128 bit lane */; \ + vpsrlq YTMP4, Y_0, 6 /* YTMP4 = W[-2] >> 6 {DC--} */; \ + \ + ONE_ROUND_PART2(g, h, a, b, c, d, e, f); \ + \ + /*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; RND N + 3 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; */; \ + vprorq YTMP3, Y_0, 19 /* YTMP3 = W[-2] ror 19 {DC--} */; \ + vprorq YTMP1, Y_0, 61 /* YTMP1 = W[-2] ror 61 {DC--} */; \ + vpternlogq YTMP4, YTMP3, YTMP1, 0x96 /* YTMP4 = s1 = (W[-2] ror 19) ^ (W[-2] ror 61) ^ (W[-2] >> 6) {DC--} */; \ + \ + ONE_ROUND_PART1(rsp+frame_XFER+3*8+X*32, f, g, h, a, b, c, d, e); \ + /* Add the sigma0 + w[t-7] + w[t-16] for w[18] and w[19] to newly calculated sigma1 to get w[18] and w[19] */; \ + /* Form w[19, w[18], w17], w[16] */; \ + vpaddq Y_0{MASK_DC_00}, YTMP0, YTMP4 /* YTMP2 = {W[3], W[2], W[1], W[0]} */; \ + \ + vpaddq XFER, Y_0, [TBL + (4+X)*32]; \ + vmovdqa [rsp + frame_XFER + X*32], XFER; \ + ONE_ROUND_PART2(f, g, h, a, b, c, d, e) + +#define ONE_ROUND(XFERIN, a, b, c, d, e, f, g, h) \ + ONE_ROUND_PART1(XFERIN, a, b, c, d, e, f, g, h); \ + ONE_ROUND_PART2(a, b, c, d, e, f, g, h) + +#define DO_4ROUNDS(X, a, b, c, d, e, f, g, h) \ + ONE_ROUND(rsp+frame_XFER+0*8+X*32, a, b, c, d, e, f, g, h); \ + ONE_ROUND(rsp+frame_XFER+1*8+X*32, h, a, b, c, d, e, f, g); \ + ONE_ROUND(rsp+frame_XFER+2*8+X*32, g, h, a, b, c, d, e, f); \ + ONE_ROUND(rsp+frame_XFER+3*8+X*32, f, g, h, a, b, c, d, e) + +/* +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +; void sha512_avx512(const void* M, void* D, uint64_t L); +; Purpose: Updates the SHA512 digest stored at D with the message stored in M. +; The size of the message pointed to by M must be an integer multiple of SHA512 +; message blocks. +; L is the message length in SHA512 blocks +*/ +.globl _gcry_sha512_transform_amd64_avx512 +ELF(.type _gcry_sha512_transform_amd64_avx512, at function;) +.align 16 +_gcry_sha512_transform_amd64_avx512: + CFI_STARTPROC() + xor eax, eax + + cmp rdx, 0 + je .Lnowork + + /* Setup mask register for DC:BA merging. */ + mov eax, 0b1100 + kmovd MASK_DC_00, eax + + /* Allocate Stack Space */ + mov RSP_SAVE, rsp + CFI_DEF_CFA_REGISTER(RSP_SAVE); + sub rsp, frame_size + and rsp, ~(0x40 - 1) + + /*; load initial digest */ + vmovq a,[8*0 + CTX] + vmovq b,[8*1 + CTX] + vmovq c,[8*2 + CTX] + vmovq d,[8*3 + CTX] + vmovq e,[8*4 + CTX] + vmovq f,[8*5 + CTX] + vmovq g,[8*6 + CTX] + vmovq h,[8*7 + CTX] + + vmovdqa BYTE_FLIP_MASK, [.LPSHUFFLE_BYTE_FLIP_MASK ADD_RIP] + vpmovzxbq PERM_VPALIGNR_8, [.LPERM_VPALIGNR_8 ADD_RIP] + + lea TBL,[.LK512 ADD_RIP] + + /*; byte swap first 16 dwords */ + COPY_YMM_AND_BSWAP(Y_0, [INP + 0*32], BYTE_FLIP_MASK) + COPY_YMM_AND_BSWAP(Y_1, [INP + 1*32], BYTE_FLIP_MASK) + COPY_YMM_AND_BSWAP(Y_2, [INP + 2*32], BYTE_FLIP_MASK) + COPY_YMM_AND_BSWAP(Y_3, [INP + 3*32], BYTE_FLIP_MASK) + + lea INP, [INP + 128] + + vpaddq XFER, Y_0, [TBL + 0*32] + vmovdqa [rsp + frame_XFER + 0*32], XFER + vpaddq XFER, Y_1, [TBL + 1*32] + vmovdqa [rsp + frame_XFER + 1*32], XFER + vpaddq XFER, Y_2, [TBL + 2*32] + vmovdqa [rsp + frame_XFER + 2*32], XFER + vpaddq XFER, Y_3, [TBL + 3*32] + vmovdqa [rsp + frame_XFER + 3*32], XFER + + /*; schedule 64 input dwords, by doing 12 rounds of 4 each */ + mov SRND, 4 + +.align 16 +.Loop0: + FOUR_ROUNDS_AND_SCHED(0, Y_0, Y_1, Y_2, Y_3, a, b, c, d, e, f, g, h) + FOUR_ROUNDS_AND_SCHED(1, Y_1, Y_2, Y_3, Y_0, e, f, g, h, a, b, c, d) + FOUR_ROUNDS_AND_SCHED(2, Y_2, Y_3, Y_0, Y_1, a, b, c, d, e, f, g, h) + FOUR_ROUNDS_AND_SCHED(3, Y_3, Y_0, Y_1, Y_2, e, f, g, h, a, b, c, d) + lea TBL, [TBL + 4*32] + + sub SRND, 1 + jne .Loop0 + + sub NUM_BLKS, 1 + je .Ldone_hash + + lea TBL, [.LK512 ADD_RIP] + + /* load next block and byte swap */ + COPY_YMM_AND_BSWAP(Y_0, [INP + 0*32], BYTE_FLIP_MASK) + COPY_YMM_AND_BSWAP(Y_1, [INP + 1*32], BYTE_FLIP_MASK) + COPY_YMM_AND_BSWAP(Y_2, [INP + 2*32], BYTE_FLIP_MASK) + COPY_YMM_AND_BSWAP(Y_3, [INP + 3*32], BYTE_FLIP_MASK) + + lea INP, [INP + 128] + + DO_4ROUNDS(0, a, b, c, d, e, f, g, h) + vpaddq XFER, Y_0, [TBL + 0*32] + vmovdqa [rsp + frame_XFER + 0*32], XFER + DO_4ROUNDS(1, e, f, g, h, a, b, c, d) + vpaddq XFER, Y_1, [TBL + 1*32] + vmovdqa [rsp + frame_XFER + 1*32], XFER + DO_4ROUNDS(2, a, b, c, d, e, f, g, h) + vpaddq XFER, Y_2, [TBL + 2*32] + vmovdqa [rsp + frame_XFER + 2*32], XFER + DO_4ROUNDS(3, e, f, g, h, a, b, c, d) + vpaddq XFER, Y_3, [TBL + 3*32] + vmovdqa [rsp + frame_XFER + 3*32], XFER + + addm([8*0 + CTX],a) + addm([8*1 + CTX],b) + addm([8*2 + CTX],c) + addm([8*3 + CTX],d) + addm([8*4 + CTX],e) + addm([8*5 + CTX],f) + addm([8*6 + CTX],g) + addm([8*7 + CTX],h) + + /*; schedule 64 input dwords, by doing 12 rounds of 4 each */ + mov SRND, 4 + + jmp .Loop0 + +.Ldone_hash: + DO_4ROUNDS(0, a, b, c, d, e, f, g, h) + DO_4ROUNDS(1, e, f, g, h, a, b, c, d) + DO_4ROUNDS(2, a, b, c, d, e, f, g, h) + DO_4ROUNDS(3, e, f, g, h, a, b, c, d) + + addm([8*0 + CTX],a) + xor eax, eax /* burn stack */ + addm([8*1 + CTX],b) + addm([8*2 + CTX],c) + addm([8*3 + CTX],d) + addm([8*4 + CTX],e) + addm([8*5 + CTX],f) + addm([8*6 + CTX],g) + addm([8*7 + CTX],h) + kmovd MASK_DC_00, eax + + vzeroall + vmovdqa [rsp + frame_XFER + 0*32], ymm0 /* burn stack */ + vmovdqa [rsp + frame_XFER + 1*32], ymm0 /* burn stack */ + vmovdqa [rsp + frame_XFER + 2*32], ymm0 /* burn stack */ + vmovdqa [rsp + frame_XFER + 3*32], ymm0 /* burn stack */ + clear_reg(%xmm16); + clear_reg(%xmm17); + clear_reg(%xmm18); + clear_reg(%xmm19); + clear_reg(%xmm20); + clear_reg(%xmm21); + clear_reg(%xmm22); + + /* Restore Stack Pointer */ + mov rsp, RSP_SAVE + CFI_DEF_CFA_REGISTER(rsp) + +.Lnowork: + ret_spec_stop + CFI_ENDPROC() +ELF(.size _gcry_sha512_transform_amd64_avx512,.-_gcry_sha512_transform_amd64_avx512) + +/*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; */ +/*;; Binary Data */ + +ELF(.type _gcry_sha512_avx512_consts, at object) +_gcry_sha512_avx512_consts: +.align 64 +/* K[t] used in SHA512 hashing */ +.LK512: + .quad 0x428a2f98d728ae22,0x7137449123ef65cd + .quad 0xb5c0fbcfec4d3b2f,0xe9b5dba58189dbbc + .quad 0x3956c25bf348b538,0x59f111f1b605d019 + .quad 0x923f82a4af194f9b,0xab1c5ed5da6d8118 + .quad 0xd807aa98a3030242,0x12835b0145706fbe + .quad 0x243185be4ee4b28c,0x550c7dc3d5ffb4e2 + .quad 0x72be5d74f27b896f,0x80deb1fe3b1696b1 + .quad 0x9bdc06a725c71235,0xc19bf174cf692694 + .quad 0xe49b69c19ef14ad2,0xefbe4786384f25e3 + .quad 0x0fc19dc68b8cd5b5,0x240ca1cc77ac9c65 + .quad 0x2de92c6f592b0275,0x4a7484aa6ea6e483 + .quad 0x5cb0a9dcbd41fbd4,0x76f988da831153b5 + .quad 0x983e5152ee66dfab,0xa831c66d2db43210 + .quad 0xb00327c898fb213f,0xbf597fc7beef0ee4 + .quad 0xc6e00bf33da88fc2,0xd5a79147930aa725 + .quad 0x06ca6351e003826f,0x142929670a0e6e70 + .quad 0x27b70a8546d22ffc,0x2e1b21385c26c926 + .quad 0x4d2c6dfc5ac42aed,0x53380d139d95b3df + .quad 0x650a73548baf63de,0x766a0abb3c77b2a8 + .quad 0x81c2c92e47edaee6,0x92722c851482353b + .quad 0xa2bfe8a14cf10364,0xa81a664bbc423001 + .quad 0xc24b8b70d0f89791,0xc76c51a30654be30 + .quad 0xd192e819d6ef5218,0xd69906245565a910 + .quad 0xf40e35855771202a,0x106aa07032bbd1b8 + .quad 0x19a4c116b8d2d0c8,0x1e376c085141ab53 + .quad 0x2748774cdf8eeb99,0x34b0bcb5e19b48a8 + .quad 0x391c0cb3c5c95a63,0x4ed8aa4ae3418acb + .quad 0x5b9cca4f7763e373,0x682e6ff3d6b2b8a3 + .quad 0x748f82ee5defb2fc,0x78a5636f43172f60 + .quad 0x84c87814a1f0ab72,0x8cc702081a6439ec + .quad 0x90befffa23631e28,0xa4506cebde82bde9 + .quad 0xbef9a3f7b2c67915,0xc67178f2e372532b + .quad 0xca273eceea26619c,0xd186b8c721c0c207 + .quad 0xeada7dd6cde0eb1e,0xf57d4f7fee6ed178 + .quad 0x06f067aa72176fba,0x0a637dc5a2c898a6 + .quad 0x113f9804bef90dae,0x1b710b35131c471b + .quad 0x28db77f523047d84,0x32caab7b40c72493 + .quad 0x3c9ebe0a15c9bebc,0x431d67c49c100d4c + .quad 0x4cc5d4becb3e42b6,0x597f299cfc657e2a + .quad 0x5fcb6fab3ad6faec,0x6c44198c4a475817 + +/* Mask for byte-swapping a couple of qwords in an XMM register using (v)pshufb. */ +.align 32 +.LPSHUFFLE_BYTE_FLIP_MASK: .octa 0x08090a0b0c0d0e0f0001020304050607 + .octa 0x18191a1b1c1d1e1f1011121314151617 + +.align 4 +.LPERM_VPALIGNR_8: .byte 5, 6, 7, 0 +ELF(.size _gcry_sha512_avx512_consts,.-_gcry_sha512_avx512_consts) + +#endif +#endif diff --git a/cipher/sha512.c b/cipher/sha512.c index 9cab33d6..05c8943e 100644 --- a/cipher/sha512.c +++ b/cipher/sha512.c @@ -104,6 +104,16 @@ #endif +/* USE_AVX512 indicates whether to compile with Intel AVX512 code. */ +#undef USE_AVX512 +#if defined(__x86_64__) && defined(HAVE_GCC_INLINE_ASM_AVX512) && \ + defined(HAVE_INTEL_SYNTAX_PLATFORM_AS) && \ + (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ + defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) +# define USE_AVX512 1 +#endif + + /* USE_SSSE3_I386 indicates whether to compile with Intel SSSE3/i386 code. */ #undef USE_SSSE3_I386 #if defined(__i386__) && SIZEOF_UNSIGNED_LONG == 4 && __GNUC__ >= 4 && \ @@ -197,7 +207,8 @@ static const u64 k[] = * stack to store XMM6-XMM15 needed on Win64. */ #undef ASM_FUNC_ABI #undef ASM_EXTRA_STACK -#if defined(USE_SSSE3) || defined(USE_AVX) || defined(USE_AVX2) +#if defined(USE_SSSE3) || defined(USE_AVX) || defined(USE_AVX2) \ + || defined(USE_AVX512) # ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS # define ASM_FUNC_ABI __attribute__((sysv_abi)) # define ASM_EXTRA_STACK (10 * 16 + 4 * sizeof(void *)) @@ -232,8 +243,10 @@ do_sha512_transform_amd64_ssse3(void *ctx, const unsigned char *data, size_t nblks) { SHA512_CONTEXT *hd = ctx; - return _gcry_sha512_transform_amd64_ssse3 (data, &hd->state, nblks) - + ASM_EXTRA_STACK; + unsigned int burn; + burn = _gcry_sha512_transform_amd64_ssse3 (data, &hd->state, nblks); + burn += burn > 0 ? ASM_EXTRA_STACK : 0; + return burn; } #endif @@ -247,8 +260,10 @@ do_sha512_transform_amd64_avx(void *ctx, const unsigned char *data, size_t nblks) { SHA512_CONTEXT *hd = ctx; - return _gcry_sha512_transform_amd64_avx (data, &hd->state, nblks) - + ASM_EXTRA_STACK; + unsigned int burn; + burn = _gcry_sha512_transform_amd64_avx (data, &hd->state, nblks); + burn += burn > 0 ? ASM_EXTRA_STACK : 0; + return burn; } #endif @@ -262,8 +277,27 @@ do_sha512_transform_amd64_avx2(void *ctx, const unsigned char *data, size_t nblks) { SHA512_CONTEXT *hd = ctx; - return _gcry_sha512_transform_amd64_avx2 (data, &hd->state, nblks) - + ASM_EXTRA_STACK; + unsigned int burn; + burn = _gcry_sha512_transform_amd64_avx2 (data, &hd->state, nblks); + burn += burn > 0 ? ASM_EXTRA_STACK : 0; + return burn; +} +#endif + +#ifdef USE_AVX512 +unsigned int _gcry_sha512_transform_amd64_avx512(const void *input_data, + void *state, + size_t num_blks) ASM_FUNC_ABI; + +static unsigned int +do_sha512_transform_amd64_avx512(void *ctx, const unsigned char *data, + size_t nblks) +{ + SHA512_CONTEXT *hd = ctx; + unsigned int burn; + burn = _gcry_sha512_transform_amd64_avx512 (data, &hd->state, nblks); + burn += burn > 0 ? ASM_EXTRA_STACK : 0; + return burn; } #endif @@ -393,6 +427,10 @@ sha512_init_common (SHA512_CONTEXT *ctx, unsigned int flags) if ((features & HWF_INTEL_AVX2) && (features & HWF_INTEL_BMI2)) ctx->bctx.bwrite = do_sha512_transform_amd64_avx2; #endif +#ifdef USE_AVX512 + if ((features & HWF_INTEL_AVX512) != 0) + ctx->bctx.bwrite = do_sha512_transform_amd64_avx512; +#endif #ifdef USE_PPC_CRYPTO if ((features & HWF_PPC_VCRYPTO) != 0) ctx->bctx.bwrite = do_sha512_transform_ppc8; diff --git a/configure.ac b/configure.ac index 27d72141..cf255bf3 100644 --- a/configure.ac +++ b/configure.ac @@ -2942,6 +2942,7 @@ if test "$found" = "1" ; then GCRYPT_ASM_DIGESTS="$GCRYPT_ASM_DIGESTS sha512-ssse3-amd64.lo" GCRYPT_ASM_DIGESTS="$GCRYPT_ASM_DIGESTS sha512-avx-amd64.lo" GCRYPT_ASM_DIGESTS="$GCRYPT_ASM_DIGESTS sha512-avx2-bmi2-amd64.lo" + GCRYPT_ASM_DIGESTS="$GCRYPT_ASM_DIGESTS sha512-avx512-amd64.lo" ;; i?86-*-*) # Build with the assembly implementation -- 2.32.0 From ametzler at bebt.de Wed Mar 16 19:04:00 2022 From: ametzler at bebt.de (Andreas Metzler) Date: Wed, 16 Mar 2022 19:04:00 +0100 Subject: libgcrypt 1.10 status Message-ID: Hello, libgcrypt 1.10 was not announced on this list and is not listed on https://gnupg.org/download/index.html#libgcrypt (either as available version nor its end of life). Is this a stable release, is it a LTS version? TIA, cu Andreas -- `What a good friend you are to him, Dr. Maturin. His other friends are so grateful to you.' `I sew his ears on from time to time, sure' From wk at gnupg.org Fri Mar 18 16:08:26 2022 From: wk at gnupg.org (Werner Koch) Date: Fri, 18 Mar 2022 16:08:26 +0100 Subject: libgcrypt 1.10 status In-Reply-To: (Andreas Metzler's message of "Wed, 16 Mar 2022 19:04:00 +0100") References: Message-ID: <87o82347th.fsf@wheatstone.g10code.de> On Wed, 16 Mar 2022 19:04, Andreas Metzler said: > libgcrypt 1.10 was not announced on this list and is not listed on > https://gnupg.org/download/index.html#libgcrypt (either as available > version nor its end of life). Is this a stable release, is it a LTS It was not announced to see whether something broke. It will be the stable versions and there will soon be announcement for 1.10.1. From our versions (source) file: # We will probably wait for the 1.10.1 before we take 1.10 in public use. # xxx +macro: libgcrypt_branch LIBGCRYPT-1.10-BRANCH # xxx +macro: libgcrypt_ver 1.10.0 # xxx +macro: libgcrypt_date 2022-02-01 Shalom-Salam, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 227 bytes Desc: not available URL: From geoff at knauth.org Mon Mar 28 17:33:01 2022 From: geoff at knauth.org (Geoffrey S. Knauth) Date: Mon, 28 Mar 2022 11:33:01 -0400 Subject: Libgcrypt 1.10.1 released In-Reply-To: <87czi6w376.fsf@wheatstone.g10code.de> References: <87czi6w376.fsf@wheatstone.g10code.de> Message-ID: <7dd0b019-da32-4bb7-839c-e7d78a8c7425@www.fastmail.com> Thank you! On Mon, Mar 28, 2022, at 10:40, Werner Koch wrote: > Hello! > > We are pleased to announce the availability of Libgcrypt version 1.10.1. > This release starts a new stable branch of Libgcrypt with full API and > ABI compatibility to the 1.9 series. Over the last year Jussi Kivilinna > put again a lot of work into speeding up the algorithms for the most > commonly used CPUs. See below for a list of improvements and new > features in 1.10. > > Libgcrypt is a general purpose library of cryptographic building blocks. > It is originally based on code used by GnuPG. It does not provide any > implementation of OpenPGP or other protocols. Thorough understanding of > applied cryptography is required to use Libgcrypt. > > > Noteworthy changes in Libgcrypt 1.10.0 and 1.10.1 > ================================================= > > * New and extended interfaces: > > - New control codes to check for FIPS 140-3 approved algorithms. > > - New control code to switch into non-FIPS mode. > > - New cipher modes SIV and GCM-SIV as specified by RFC-5297. > > - Extended cipher mode AESWRAP with padding as specified by > RFC-5649. [T5752] > > - New set of KDF functions. > > - New KDF modes Argon2 and Balloon. > > - New functions for combining hashing and signing/verification. [T4894] > > * Performance: > > - Improved support for PowerPC architectures. > > - Improved ECC performance on zSeries/s390x by using accelerated > scalar multiplication. > > - Many more assembler performance improvements for several > architectures. > > * Bug fixes: > > - Fix Elgamal encryption for other implementations. > [R5328,CVE-2021-40528] > > - Fix alignment problem on macOS. [T5440] > > - Check the input length of the point in ECDH. [T5423] > > - Fix an abort in gcry_pk_get_param for "Curve25519". [T5490] > > - Fix minor memory leaks in FIPS mode. > > - Build fixes for MUSL libc. [rCffaef0be61] > > * Other features: > > - The control code GCRYCTL_SET_ENFORCED_FIPS_FLAG is ignored > because it is useless with the FIPS 140-3 related changes. > > - Update of the jitter entropy RNG code. [T5523] > > - Simplification of the entropy gatherer when using the getentropy > system call. > > - More portable integrity check in FIPS mode. [rC9fa4c8946a,T5835] > > - Add X9.62 OIDs to sha256 and sha512 modules. [rC52fd2305ba] > > Note that 1.10.0 was already released on 2022-02-01 without a public > announcement to allow for some extra test time. > > For a list of links to commits and bug numbers see the release info at > https://dev.gnupg.org/T5691 and https://dev.gnupg.org/T5810 > > > > Download > ======== > > Source code is hosted at the GnuPG FTP server and its mirrors as listed > at https://gnupg.org/download/mirrors.html. On the primary server > the source tarball and its digital signature are: > > https://gnupg.org/ftp/gcrypt/libgcrypt/libgcrypt-1.10.1.tar.bz2 > https://gnupg.org/ftp/gcrypt/libgcrypt/libgcrypt-1.10.1.tar.bz2.sig > > or gzip compressed: > > https://gnupg.org/ftp/gcrypt/libgcrypt/libgcrypt-1.10.1.tar.gz > https://gnupg.org/ftp/gcrypt/libgcrypt/libgcrypt-1.10.1.tar.gz.sig > > In order to check that the version of Libgcrypt you downloaded is an > original and unmodified file please follow the instructions found at > https://gnupg.org/download/integrity_check.html. In short, you may > use one of the following methods: > > - Check the supplied OpenPGP signature. For example to check the > signature of the file libgcrypt-1.10.1.tar.bz2 you would use this > command: > > gpg --verify libgcrypt-1.10.1.tar.bz2.sig libgcrypt-1.10.1.tar.bz2 > > This checks whether the signature file matches the source file. > You should see a message indicating that the signature is good and > made by one or more of the release signing keys. Make sure that > this is a valid key, either by matching the shown fingerprint > against a trustworthy list of valid release signing keys or by > checking that the key has been signed by trustworthy other keys. > See the end of this mail for information on the signing keys. > > - If you are not able to use an existing version of GnuPG, you have > to verify the SHA-1 checksum. On Unix systems the command to do > this is either "sha1sum" or "shasum". Assuming you downloaded the > file libgcrypt-1.10.1.tar.bz2, you run the command like this: > > sha1sum libgcrypt-1.10.1.tar.bz2 > > and check that the output matches the first line from the > this list: > > de2cc32e7538efa376de7bf5d3eafa85626fb95f libgcrypt-1.10.1.tar.bz2 > 9db3ef0ec74bd2915fa7ca6f32ea9ba7e013e1a1 libgcrypt-1.10.1.tar.gz > > You should also verify that the checksums above are authentic by > matching them with copies of this announcement. Those copies can be > found at other mailing lists, web sites, and search engines. > > > Copying > ======= > > Libgcrypt is distributed under the terms of the GNU Lesser General > Public License (LGPLv2.1+). The helper programs as well as the > documentation are distributed under the terms of the GNU General Public > License (GPLv2+). The file LICENSES has notices about contributions > that require that these additional notices are distributed. > > > Support > ======= > > For help on developing with Libgcrypt you should read the included > manual and if needed ask on the gcrypt-devel mailing list. > > In case of problems specific to this release please first check > https://dev.gnupg.org/T5810 for updated information. > > Please also consult the archive of the gcrypt-devel mailing list before > reporting a bug: https://gnupg.org/documentation/mailing-lists.html . > We suggest to send bug reports for a new release to this list in favor > of filing a bug at https://bugs.gnupg.org. If you need commercial > support go to https://gnupg.com or https://gnupg.org/service.html . > > If you are a developer and you need a certain feature for your project, > please do not hesitate to bring it to the gcrypt-devel mailing list for > discussion. > > > > Thanks > ====== > > Since 2001 maintenance and development of GnuPG is done by g10 Code GmbH > and has mostly been financed by donations. Three full-time employed > developers as well as two contractors exclusively work on GnuPG and > closely related software like Libgcrypt, GPGME and Gpg4win. > > Fortunately, and this is still not common with free software, we have > now established a way of financing the development while keeping all our > software free and freely available for everyone. Our model is similar > to the way RedHat manages RHEL and Fedora: Except for the actual binary > of the MSI installer for Windows and client specific configuration > files, all the software is available under the GNU GPL and other Open > Source licenses. Thus customers may even build and distribute their own > version of the software as long as they do not use our trademark > GnuPG VS-Desktop?. > > We like to thank all the nice people who are helping the GnuPG project, > be it testing, coding, translating, suggesting, auditing, administering > the servers, spreading the word, answering questions on the mailing > lists, or helping with donations. > > *Thank you all* > > Your Libgcrypt hackers > > > > p.s. > This is an announcement only mailing list. Please send replies only to > the gnupg-users'at'gnupg.org mailing list. > > List of Release Signing Keys: > To guarantee that a downloaded GnuPG version has not been tampered by > malicious entities we provide signature files for all tarballs and > binary versions. The keys are also signed by the long term keys of > their respective owners. Current releases are signed by one or more > of these keys: > > rsa3072 2017-03-17 [expires: 2027-03-15] > 5B80 C575 4298 F0CB 55D8 ED6A BCEF 7E29 4B09 2E28 > Andre Heinecke (Release Signing Key) > > ed25519 2020-08-24 [expires: 2030-06-30] > 6DAA 6E64 A76D 2840 571B 4902 5288 97B8 2640 3ADA > Werner Koch (dist signing 2020) > > ed25519 2021-05-19 [expires: 2027-04-04] > AC8E 115B F73E 2D8D 47FA 9908 E98E 9B2D 19C6 C8BD > Niibe Yutaka (GnuPG Release Key) > > brainpoolP256r1 2021-10-15 [expires: 2029-12-31] > 02F3 8DFF 731F F97C B039 A1DA 549E 695E 905B A208 > GnuPG.com (Release Signing Key 2021) > > The keys are available at https://gnupg.org/signature_key.html and > in any recently released GnuPG tarball in the file g10/distsigkey.gpg . > Note that this mail has been signed by a different key. > > > -- > The pioneers of a warless world are the youth that > refuse military service. - A. Einstein > > Attachments: > * signature.asc -- Geoffrey S. Knauth | https://knauth.org/gsk From guidovranken at gmail.com Mon Mar 28 22:47:20 2022 From: guidovranken at gmail.com (Guido Vranken) Date: Mon, 28 Mar 2022 22:47:20 +0200 Subject: Argon2 incorrect result and division by zero Message-ID: Fuzzer debug output for the reproducer (included at end of this message): Module libgcrypt result: {0x0f, 0xb5, 0x95, 0x20, 0xf8, 0x1a, 0x3f, 0xec, 0xac, 0xc0, 0xa4, 0x68, 0x78, 0x33, 0xf7, 0xce, 0xb0, 0xbd, 0x42, 0x95, 0xc2, 0x63, 0x45, 0x38, 0xc2, 0x06, 0x6e, 0x8c, 0x39, 0x2a, 0xb4, 0xd5, 0x84, 0x6b, 0x19, 0xf2, 0x5f, 0x00, 0x7b, 0xbf, 0x66, 0xfe, 0xc8, 0xd2, 0xe7, 0x98, 0x2d, 0xa7, 0x00, 0xdb, 0xf9, 0x43, 0x13, 0xd5, 0x5d, 0x19, 0xec, 0x0e, 0x5c, 0x69, 0x06, 0xd2, 0xb6, 0xd7, 0xcf, 0x72, 0xbb, 0x3b, 0xa7, 0x29} (70 bytes) Module Botan result: {0x0f, 0xb5, 0x95, 0x20, 0xf8, 0x1a, 0x3f, 0xec, 0xac, 0xc0, 0xa4, 0x68, 0x78, 0x33, 0xf7, 0xce, 0xb0, 0xbd, 0x42, 0x95, 0xc2, 0x63, 0x45, 0x38, 0xc2, 0x06, 0x6e, 0x8c, 0x39, 0x2a, 0xb4, 0xd5, 0xc1, 0x16, 0x48, 0x32, 0x7c, 0xed, 0xe1, 0x56, 0x90, 0xab, 0x49, 0x32, 0xd0, 0x51, 0x48, 0x55, 0x6d, 0x96, 0xcc, 0xd1, 0x33, 0xe2, 0xb2, 0x2b, 0x88, 0xf8, 0x35, 0x74, 0xf8, 0x90, 0x78, 0x27, 0x45, 0xa4, 0x37, 0x99, 0xc6, 0x86} (70 bytes) It seems that the 64 bytes are always correct but with output sizes larger than that, discrepancies occur. Additionally there is a division by zero on this line in argon2_init() if parallelism is set to 0: segment_length = memory_blocks / (parallelism * 4); it would be better to return an error in this case. Reproducer: #include #define CF_CHECK_EQ(expr, res) if ( (expr) != (res) ) { goto end; } int main(void) { const unsigned char password[32] = { 0xa3, 0x18, 0xc3, 0x65, 0x45, 0xbb, 0x67, 0xb8, 0x26, 0xab, 0x1d, 0x8c, 0xa7, 0x0f, 0xc7, 0x8c, 0x33, 0x0d, 0x4c, 0x57, 0x2b, 0xcf, 0x95, 0x94, 0xfd, 0x75, 0x85, 0xf0, 0x08, 0x2a, 0x04, 0x05}; const unsigned char salt[32] = { 0xb4, 0x0f, 0xf9, 0x84, 0x68, 0x4e, 0x44, 0x0c, 0x86, 0x0b, 0xd1, 0x4b, 0x7c, 0x71, 0x85, 0xdb, 0xa7, 0x9b, 0x47, 0x6e, 0x76, 0xba, 0xf9, 0xa3, 0x47, 0xf0, 0x82, 0x20, 0x84, 0x60, 0xca, 0x8e}; gcry_kdf_hd_t hd; unsigned char out[70]; const unsigned long params[4] = {sizeof(out), 1, 27701, 1}; CF_CHECK_EQ(gcry_kdf_open(&hd, GCRY_KDF_ARGON2, GCRY_KDF_ARGON2ID, params, 4, password, sizeof(password), salt, sizeof(salt), NULL, 0, NULL, 0), GPG_ERR_NO_ERROR); CF_CHECK_EQ(gcry_kdf_compute(hd, NULL), GPG_ERR_NO_ERROR); CF_CHECK_EQ(gcry_kdf_final(hd, sizeof(out), out), GPG_ERR_NO_ERROR); for (size_t i = 0; i < sizeof(out); i++) { if ( !(i % 16) ) printf("\n"); printf("0x%02x, ", out[i]); } printf("\n"); end: gcry_kdf_close(hd); return 0; } -------------- next part -------------- An HTML attachment was scrubbed... URL: From tianjia.zhang at linux.alibaba.com Tue Mar 29 10:26:00 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Tue, 29 Mar 2022 16:26:00 +0800 Subject: [PATCH] Fix configure.ac error of intel-avx512 Message-ID: <20220329082600.42934-1-tianjia.zhang@linux.alibaba.com> * configure.ac: Correctly set value for avx512support. -- Signed-off-by: Tianjia Zhang --- configure.ac | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/configure.ac b/configure.ac index b41322e357ca..467151391a73 100644 --- a/configure.ac +++ b/configure.ac @@ -1303,6 +1303,7 @@ if test "$mpi_cpu_arch" != "x86" ; then sse41support="n/a" avxsupport="n/a" avx2support="n/a" + avx512support="n/a" padlocksupport="n/a" drngsupport="n/a" fi @@ -2404,6 +2405,11 @@ if test x"$avx2support" = xyes ; then avx2support="no (unsupported by compiler)" fi fi +if test x"$avx512support" = xyes ; then + if test "$gcry_cv_gcc_inline_asm_avx512" != "yes" ; then + avx512support="no (unsupported by compiler)" + fi +fi if test x"$neonsupport" = xyes ; then if test "$gcry_cv_gcc_inline_asm_neon" != "yes" ; then if test "$gcry_cv_gcc_inline_asm_aarch64_neon" != "yes" ; then -- 2.24.3 (Apple Git-128) From rms at gnu.org Tue Mar 29 05:30:46 2022 From: rms at gnu.org (Richard Stallman) Date: Mon, 28 Mar 2022 23:30:46 -0400 Subject: Libgcrypt 1.10.1 released In-Reply-To: <87czi6w376.fsf@wheatstone.g10code.de> (message from Werner Koch on Mon, 28 Mar 2022 16:40:13 +0200) References: <87czi6w376.fsf@wheatstone.g10code.de> Message-ID: [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] Congratulations on the new release. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) From jussi.kivilinna at iki.fi Tue Mar 29 18:06:04 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Tue, 29 Mar 2022 19:06:04 +0300 Subject: [PATCH] Fix configure.ac error of intel-avx512 In-Reply-To: <20220329082600.42934-1-tianjia.zhang@linux.alibaba.com> References: <20220329082600.42934-1-tianjia.zhang@linux.alibaba.com> Message-ID: Hello, On 29.3.2022 11.26, Tianjia Zhang via Gcrypt-devel wrote: > * configure.ac: Correctly set value for avx512support. > -- > > Signed-off-by: Tianjia Zhang > --- > configure.ac | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/configure.ac b/configure.ac > index b41322e357ca..467151391a73 100644 > --- a/configure.ac > +++ b/configure.ac > @@ -1303,6 +1303,7 @@ if test "$mpi_cpu_arch" != "x86" ; then > sse41support="n/a" > avxsupport="n/a" > avx2support="n/a" > + avx512support="n/a" > padlocksupport="n/a" > drngsupport="n/a" > fi > @@ -2404,6 +2405,11 @@ if test x"$avx2support" = xyes ; then > avx2support="no (unsupported by compiler)" > fi > fi > +if test x"$avx512support" = xyes ; then > + if test "$gcry_cv_gcc_inline_asm_avx512" != "yes" ; then > + avx512support="no (unsupported by compiler)" > + fi > +fi > if test x"$neonsupport" = xyes ; then > if test "$gcry_cv_gcc_inline_asm_neon" != "yes" ; then > if test "$gcry_cv_gcc_inline_asm_aarch64_neon" != "yes" ; then Applied to master, thanks. -Jussi