From gniibe at fsij.org Fri Apr 1 04:18:32 2022 From: gniibe at fsij.org (NIIBE Yutaka) Date: Fri, 01 Apr 2022 11:18:32 +0900 Subject: Argon2 incorrect result and division by zero In-Reply-To: References: Message-ID: <87a6d5o8av.fsf@akagi.fsij.org> Hello, Thanks a lot for your test case. It's my mistake in the implementation. Guido Vranken wrote: > It seems that the 64 bytes are always correct but with output sizes larger > than that, discrepancies occur. > > Additionally there is a division by zero on this line in argon2_init() if > parallelism is set to 0: > > segment_length = memory_blocks / (parallelism * 4); > > it would be better to return an error in this case. Two problems are fixed in master by the commit: 564739a58426d89db2f0c9334659949e503d2c59 And 1.10 branch by: 13b5454d2620701863f6e89221f5f4c98d2aba8e -- From tianjia.zhang at linux.alibaba.com Fri Apr 1 11:17:36 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Fri, 1 Apr 2022 17:17:36 +0800 Subject: [PATCH] Add SM3 ARMv8/AArch64/CE assembly implementation Message-ID: <20220401091736.30541-1-tianjia.zhang@linux.alibaba.com> * cipher/Makefile.am: Add 'sm3-armv8-aarch64-ce.S'. * cipher/sm3-armv8-aarch64-ce.S: New. * cipher/sm3.c (USE_ARM_CE): New. [USE_ARM_CE] (_gcry_sm3_transform_armv8_ce) (do_sm3_transform_armv8_ce): New. (sm3_init) [USE_ARM_CE]: New. * configure.ac: Add 'sm3-armv8-aarch64-ce.lo'. -- Benchmark on T-Head Yitian-710 2.75 GHz: Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SM3 | 2.84 ns/B 335.3 MiB/s 7.82 c/B 2749 After (~55% faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SM3 | 1.84 ns/B 518.1 MiB/s 5.06 c/B 2749 Signed-off-by: Tianjia Zhang --- cipher/Makefile.am | 2 +- cipher/sm3-armv8-aarch64-ce.S | 218 ++++++++++++++++++++++++++++++++++ cipher/sm3.c | 28 +++++ configure.ac | 1 + 4 files changed, 248 insertions(+), 1 deletion(-) create mode 100644 cipher/sm3-armv8-aarch64-ce.S diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 1ac1923b7ce5..30be9f982883 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -130,7 +130,7 @@ EXTRA_libcipher_la_SOURCES = \ sha512-avx2-bmi2-amd64.S sha512-avx512-amd64.S \ sha512-armv7-neon.S sha512-arm.S \ sha512-ppc.c sha512-ssse3-i386.c \ - sm3.c sm3-avx-bmi2-amd64.S sm3-aarch64.S \ + sm3.c sm3-avx-bmi2-amd64.S sm3-aarch64.S sm3-armv8-aarch64-ce.S \ keccak.c keccak_permute_32.h keccak_permute_64.h keccak-armv7-neon.S \ stribog.c \ tiger.c \ diff --git a/cipher/sm3-armv8-aarch64-ce.S b/cipher/sm3-armv8-aarch64-ce.S new file mode 100644 index 000000000000..0900b84fe2bf --- /dev/null +++ b/cipher/sm3-armv8-aarch64-ce.S @@ -0,0 +1,218 @@ +/* sm3-armv8-aarch64-ce.S - ARMv8/AArch64/CE accelerated SM3 cipher + * + * Copyright (C) 2022 Alibaba Group. + * Copyright (C) 2022 Tianjia Zhang + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include "asm-common-aarch64.h" + +#if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) && \ + defined(USE_SM3) + +.cpu generic+simd+crypto + +/* Must be consistent with register macros */ +#define vecnum_v0 0 +#define vecnum_v1 1 +#define vecnum_v2 2 +#define vecnum_v3 3 +#define vecnum_v4 4 +#define vecnum_CTX1 16 +#define vecnum_CTX2 17 +#define vecnum_SS1 18 +#define vecnum_WT 19 +#define vecnum_K0 20 +#define vecnum_K1 21 +#define vecnum_K2 22 +#define vecnum_K3 23 +#define vecnum_RTMP0 24 +#define vecnum_RTMP1 25 + +#define sm3partw1(vd, vn, vm) \ + .inst (0xce60c000 | (vecnum_##vm << 16) | (vecnum_##vn << 5) | vecnum_##vd) + +#define sm3partw2(vd, vn, vm) \ + .inst (0xce60c400 | (vecnum_##vm << 16) | (vecnum_##vn << 5) | vecnum_##vd) + +#define sm3ss1(vd, vn, vm, va) \ + .inst (0xce400000 | (vecnum_##vm << 16) | (vecnum_##va << 10) \ + | (vecnum_##vn << 5) | vecnum_##vd) + +#define sm3tt1a(vd, vn, vm, imm2) \ + .inst (0xce408000 | (vecnum_##vm << 16) | imm2 << 12 \ + | (vecnum_##vn << 5) | vecnum_##vd) + +#define sm3tt1b(vd, vn, vm, imm2) \ + .inst (0xce408400 | (vecnum_##vm << 16) | imm2 << 12 \ + | (vecnum_##vn << 5) | vecnum_##vd) + +#define sm3tt2a(vd, vn, vm, imm2) \ + .inst (0xce408800 | (vecnum_##vm << 16) | imm2 << 12 \ + | (vecnum_##vn << 5) | vecnum_##vd) + +#define sm3tt2b(vd, vn, vm, imm2) \ + .inst (0xce408c00 | (vecnum_##vm << 16) | imm2 << 12 \ + | (vecnum_##vn << 5) | vecnum_##vd) + +/* Constants */ + +.text +.align 4 +ELF(.type _gcry_sm3_armv8_ce_consts, at object) +_gcry_sm3_armv8_ce_consts: +.Lsm3_Ktable: + .long 0x79cc4519, 0xf3988a32, 0xe7311465, 0xce6228cb + .long 0x9cc45197, 0x3988a32f, 0x7311465e, 0xe6228cbc + .long 0xcc451979, 0x988a32f3, 0x311465e7, 0x6228cbce + .long 0xc451979c, 0x88a32f39, 0x11465e73, 0x228cbce6 + .long 0x9d8a7a87, 0x3b14f50f, 0x7629ea1e, 0xec53d43c + .long 0xd8a7a879, 0xb14f50f3, 0x629ea1e7, 0xc53d43ce + .long 0x8a7a879d, 0x14f50f3b, 0x29ea1e76, 0x53d43cec + .long 0xa7a879d8, 0x4f50f3b1, 0x9ea1e762, 0x3d43cec5 + .long 0x7a879d8a, 0xf50f3b14, 0xea1e7629, 0xd43cec53 + .long 0xa879d8a7, 0x50f3b14f, 0xa1e7629e, 0x43cec53d + .long 0x879d8a7a, 0x0f3b14f5, 0x1e7629ea, 0x3cec53d4 + .long 0x79d8a7a8, 0xf3b14f50, 0xe7629ea1, 0xcec53d43 + .long 0x9d8a7a87, 0x3b14f50f, 0x7629ea1e, 0xec53d43c + .long 0xd8a7a879, 0xb14f50f3, 0x629ea1e7, 0xc53d43ce + .long 0x8a7a879d, 0x14f50f3b, 0x29ea1e76, 0x53d43cec + .long 0xa7a879d8, 0x4f50f3b1, 0x9ea1e762, 0x3d43cec5 +ELF(.size _gcry_sm3_armv8_ce_consts,.-_gcry_sm3_armv8_ce_consts) + +/* Register macros */ + +/* Must be consistent with vecnum_ macros */ +#define CTX1 v16 +#define CTX2 v17 +#define SS1 v18 +#define WT v19 + +#define K0 v20 +#define K1 v21 +#define K2 v22 +#define K3 v23 + +#define RTMP0 v24 +#define RTMP1 v25 + +/* Helper macros. */ + +#define _(...) /*_*/ + +#define SCHED_W_1(s0, s1, s2, s3, s4) ext s4.16b, s1.16b, s2.16b, #12 +#define SCHED_W_2(s0, s1, s2, s3, s4) ext RTMP0.16b, s0.16b, s1.16b, #12 +#define SCHED_W_3(s0, s1, s2, s3, s4) ext RTMP1.16b, s2.16b, s3.16b, #8 +#define SCHED_W_4(s0, s1, s2, s3, s4) sm3partw1(s4, s0, s3) +#define SCHED_W_5(s0, s1, s2, s3, s4) sm3partw2(s4, RTMP1, RTMP0) + +#define SCHED_W(n, s0, s1, s2, s3, s4) SCHED_W_##n(s0, s1, s2, s3, s4) + +#define R(ab, s0, s1, s2, s3, s4, IOP) \ + ld4 {K0.s, K1.s, K2.s, K3.s}[3], [x3], #16; \ + eor WT.16b, s0.16b, s1.16b; \ + \ + sm3ss1(SS1, CTX1, CTX2, K0); \ + IOP(1, s0, s1, s2, s3, s4); \ + sm3tt1##ab(CTX1, SS1, WT, 0); \ + sm3tt2##ab(CTX2, SS1, s0, 0); \ + \ + IOP(2, s0, s1, s2, s3, s4); \ + sm3ss1(SS1, CTX1, CTX2, K1); \ + IOP(3, s0, s1, s2, s3, s4); \ + sm3tt1##ab(CTX1, SS1, WT, 1); \ + sm3tt2##ab(CTX2, SS1, s0, 1); \ + \ + sm3ss1(SS1, CTX1, CTX2, K2); \ + IOP(4, s0, s1, s2, s3, s4); \ + sm3tt1##ab(CTX1, SS1, WT, 2); \ + sm3tt2##ab(CTX2, SS1, s0, 2); \ + \ + sm3ss1(SS1, CTX1, CTX2, K3); \ + IOP(5, s0, s1, s2, s3, s4); \ + sm3tt1##ab(CTX1, SS1, WT, 3); \ + sm3tt2##ab(CTX2, SS1, s0, 3); + +#define R1(s0, s1, s2, s3, s4, IOP) R(a, s0, s1, s2, s3, s4, IOP) +#define R2(s0, s1, s2, s3, s4, IOP) R(b, s0, s1, s2, s3, s4, IOP) + +.align 3 +.global _gcry_sm3_transform_armv8_ce +ELF(.type _gcry_sm3_transform_armv8_ce,%function;) +_gcry_sm3_transform_armv8_ce: + /* input: + * x0: CTX + * x1: data + * x2: nblocks + */ + CFI_STARTPROC(); + + ld1 {CTX1.4s, CTX2.4s}, [x0]; + rev64 CTX1.4s, CTX1.4s; + rev64 CTX2.4s, CTX2.4s; + ext CTX1.16b, CTX1.16b, CTX1.16b, #8; + ext CTX2.16b, CTX2.16b, CTX2.16b, #8; + +.Lloop: + GET_DATA_POINTER(x3, .Lsm3_Ktable); + ld1 {v0.16b-v3.16b}, [x1], #64; + sub x2, x2, #1; + + mov v6.16b, CTX1.16b; + mov v7.16b, CTX2.16b; + + rev32 v0.16b, v0.16b; + rev32 v1.16b, v1.16b; + rev32 v2.16b, v2.16b; + rev32 v3.16b, v3.16b; + + R1(v0, v1, v2, v3, v4, SCHED_W); + R1(v1, v2, v3, v4, v0, SCHED_W); + R1(v2, v3, v4, v0, v1, SCHED_W); + R1(v3, v4, v0, v1, v2, SCHED_W); + R2(v4, v0, v1, v2, v3, SCHED_W); + R2(v0, v1, v2, v3, v4, SCHED_W); + R2(v1, v2, v3, v4, v0, SCHED_W); + R2(v2, v3, v4, v0, v1, SCHED_W); + R2(v3, v4, v0, v1, v2, SCHED_W); + R2(v4, v0, v1, v2, v3, SCHED_W); + R2(v0, v1, v2, v3, v4, SCHED_W); + R2(v1, v2, v3, v4, v0, SCHED_W); + R2(v2, v3, v4, v0, v1, SCHED_W); + R2(v3, v4, v0, v1, v2, _); + R2(v4, v0, v1, v2, v3, _); + R2(v0, v1, v2, v3, v4, _); + + eor CTX1.16b, CTX1.16b, v6.16b; + eor CTX2.16b, CTX2.16b, v7.16b; + + cbnz x2, .Lloop; + + /* save state */ + rev64 CTX1.4s, CTX1.4s; + rev64 CTX2.4s, CTX2.4s; + ext CTX1.16b, CTX1.16b, CTX1.16b, #8; + ext CTX2.16b, CTX2.16b, CTX2.16b, #8; + st1 {CTX1.4s, CTX2.4s}, [x0]; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm3_transform_armv8_ce, .-_gcry_sm3_transform_armv8_ce;) + +#endif diff --git a/cipher/sm3.c b/cipher/sm3.c index 0ab5f5067edb..bfe9f4c25225 100644 --- a/cipher/sm3.c +++ b/cipher/sm3.c @@ -67,6 +67,16 @@ # endif #endif +/* USE_ARM_CE indicates whether to enable ARMv8 Crypto Extension code. */ +#undef USE_ARM_CE +#ifdef ENABLE_ARM_CRYPTO_SUPPORT +# if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) +# define USE_ARM_CE 1 +# endif +#endif + typedef struct { gcry_md_block_ctx_t bctx; @@ -117,6 +127,20 @@ do_sm3_transform_aarch64(void *context, const unsigned char *data, size_t nblks) } #endif /* USE_AARCH64_SIMD */ +#ifdef USE_ARM_CE +void _gcry_sm3_transform_armv8_ce(void *state, const void *input_data, + size_t num_blks); + +static unsigned int +do_sm3_transform_armv8_ce(void *context, const unsigned char *data, + size_t nblks) +{ + SM3_CONTEXT *hd = context; + _gcry_sm3_transform_armv8_ce (hd->h, data, nblks); + return 0; +} +#endif /* USE_ARM_CE */ + static unsigned int transform (void *c, const unsigned char *data, size_t nblks); @@ -153,6 +177,10 @@ sm3_init (void *context, unsigned int flags) if (features & HWF_ARM_NEON) hd->bctx.bwrite = do_sm3_transform_aarch64; #endif +#ifdef USE_ARM_CE + if (features & HWF_ARM_SM3) + hd->bctx.bwrite = do_sm3_transform_armv8_ce; +#endif (void)features; } diff --git a/configure.ac b/configure.ac index e214082b2603..fc49bb86fc2b 100644 --- a/configure.ac +++ b/configure.ac @@ -3049,6 +3049,7 @@ if test "$found" = "1" ; then aarch64-*-*) # Build with the assembly implementation GCRYPT_ASM_DIGESTS="$GCRYPT_ASM_DIGESTS sm3-aarch64.lo" + GCRYPT_ASM_DIGESTS="$GCRYPT_ASM_DIGESTS sm3-armv8-aarch64-ce.lo" ;; esac fi -- 2.24.3 (Apple Git-128) From jussi.kivilinna at iki.fi Sun Apr 3 17:10:43 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 3 Apr 2022 18:10:43 +0300 Subject: [PATCH 2/2] chacha20: add AVX512 implementation In-Reply-To: <20220403151043.4096276-1-jussi.kivilinna@iki.fi> References: <20220403151043.4096276-1-jussi.kivilinna@iki.fi> Message-ID: <20220403151043.4096276-2-jussi.kivilinna@iki.fi> * cipher/Makefile.am: Add 'chacha20-amd64-avx512.S'. * cipher/chacha20-amd64-avx512.S: New. * cipher/chacha20.c (USE_AVX512): New. (CHACHA20_context_s): Add 'use_avx512'. [USE_AVX512] (_gcry_chacha20_amd64_avx512_blocks16): New. (chacha20_do_setkey) [USE_AVX512]: Setup 'use_avx512' based on HW features. (do_chacha20_encrypt_stream_tail) [USE_AVX512]: Use AVX512 implementation if supported. (_gcry_chacha20_poly1305_encrypt) [USE_AVX512]: Disable stitched chacha20-poly1305 implementations if AVX512 implementation is used. (_gcry_chacha20_poly1305_decrypt) [USE_AVX512]: Disable stitched chacha20-poly1305 implementations if AVX512 implementation is used. -- Benchmark on Intel Core i3-1115G4 (tigerlake): Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz STREAM enc | 0.276 ns/B 3451 MiB/s 1.13 c/B 4090 STREAM dec | 0.284 ns/B 3359 MiB/s 1.16 c/B 4090 POLY1305 enc | 0.411 ns/B 2320 MiB/s 1.68 c/B 4098?3 POLY1305 dec | 0.408 ns/B 2338 MiB/s 1.67 c/B 4091?1 POLY1305 auth | 0.060 ns/B 15785 MiB/s 0.247 c/B 4090?1 After (stream 1.7x faster, poly1305-aead 1.8x faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz STREAM enc | 0.162 ns/B 5869 MiB/s 0.665 c/B 4092?1 STREAM dec | 0.162 ns/B 5884 MiB/s 0.664 c/B 4096?3 POLY1305 enc | 0.221 ns/B 4306 MiB/s 0.907 c/B 4097?3 POLY1305 dec | 0.220 ns/B 4342 MiB/s 0.900 c/B 4096?3 POLY1305 auth | 0.060 ns/B 15797 MiB/s 0.247 c/B 4085?2 Signed-off-by: Jussi Kivilinna --- cipher/Makefile.am | 2 +- cipher/chacha20-amd64-avx512.S | 300 +++++++++++++++++++++++++++++++++ cipher/chacha20.c | 60 ++++++- configure.ac | 1 + 4 files changed, 357 insertions(+), 6 deletions(-) create mode 100644 cipher/chacha20-amd64-avx512.S diff --git a/cipher/Makefile.am b/cipher/Makefile.am index b6319d35..ed6d7c35 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -81,7 +81,7 @@ EXTRA_libcipher_la_SOURCES = \ blowfish.c blowfish-amd64.S blowfish-arm.S \ cast5.c cast5-amd64.S cast5-arm.S \ chacha20.c chacha20-amd64-ssse3.S chacha20-amd64-avx2.S \ - chacha20-armv7-neon.S chacha20-aarch64.S \ + chacha20-amd64-avx512.S chacha20-armv7-neon.S chacha20-aarch64.S \ chacha20-ppc.c chacha20-s390x.S \ cipher-gcm-ppc.c cipher-gcm-intel-pclmul.c cipher-gcm-armv7-neon.S \ cipher-gcm-armv8-aarch32-ce.S cipher-gcm-armv8-aarch64-ce.S \ diff --git a/cipher/chacha20-amd64-avx512.S b/cipher/chacha20-amd64-avx512.S new file mode 100644 index 00000000..da24286e --- /dev/null +++ b/cipher/chacha20-amd64-avx512.S @@ -0,0 +1,300 @@ +/* chacha20-amd64-avx512.S - AVX512 implementation of ChaCha20 cipher + * + * Copyright (C) 2022 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +/* + * Based on D. J. Bernstein reference implementation at + * http://cr.yp.to/chacha.html: + * + * chacha-regs.c version 20080118 + * D. J. Bernstein + * Public domain. + */ + +#ifdef __x86_64 +#include +#if defined(HAVE_GCC_INLINE_ASM_AVX512) && \ + (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ + defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) + +.text + +#include "asm-common-amd64.h" + +/* register macros */ +#define INPUT %rdi +#define DST %rsi +#define SRC %rdx +#define NBLKS %rcx +#define ROUND %eax + +/* vector registers */ +#define X0 %zmm0 +#define X1 %zmm1 +#define X2 %zmm2 +#define X3 %zmm3 +#define X4 %zmm4 +#define X5 %zmm5 +#define X6 %zmm6 +#define X7 %zmm7 +#define X8 %zmm8 +#define X9 %zmm9 +#define X10 %zmm10 +#define X11 %zmm11 +#define X12 %zmm12 +#define X13 %zmm13 +#define X14 %zmm14 +#define X15 %zmm15 + +#define TMP0 %zmm16 +#define TMP1 %zmm17 + +#define COUNTER_ADD %zmm18 + +#define X12_SAVE %zmm19 +#define X13_SAVE %zmm20 + +#define S0 %zmm21 +#define S1 %zmm22 +#define S2 %zmm23 +#define S3 %zmm24 +#define S4 %zmm25 +#define S5 %zmm26 +#define S6 %zmm27 +#define S7 %zmm28 +#define S8 %zmm29 +#define S14 %zmm30 +#define S15 %zmm31 + +/********************************************************************** + helper macros + **********************************************************************/ + +/* 4x4 32-bit integer matrix transpose */ +#define transpose_4x4(x0,x1,x2,x3,t1,t2) \ + vpunpckhdq x1, x0, t2; \ + vpunpckldq x1, x0, x0; \ + \ + vpunpckldq x3, x2, t1; \ + vpunpckhdq x3, x2, x2; \ + \ + vpunpckhqdq t1, x0, x1; \ + vpunpcklqdq t1, x0, x0; \ + \ + vpunpckhqdq x2, t2, x3; \ + vpunpcklqdq x2, t2, x2; + +/* 4x4 128-bit matrix transpose */ +#define transpose_16byte_4x4(x0,x1,x2,x3,t1,t2) \ + vshufi32x4 $0xee, x1, x0, t2; \ + vshufi32x4 $0x44, x1, x0, x0; \ + \ + vshufi32x4 $0x44, x3, x2, t1; \ + vshufi32x4 $0xee, x3, x2, x2; \ + \ + vshufi32x4 $0xdd, t1, x0, x1; \ + vshufi32x4 $0x88, t1, x0, x0; \ + \ + vshufi32x4 $0xdd, x2, t2, x3; \ + vshufi32x4 $0x88, x2, t2, x2; + +#define xor_src_dst_4x4(dst, src, offset, add, x0, x4, x8, x12) \ + vpxord (offset + 0 * (add))(src), x0, x0; \ + vpxord (offset + 1 * (add))(src), x4, x4; \ + vpxord (offset + 2 * (add))(src), x8, x8; \ + vpxord (offset + 3 * (add))(src), x12, x12; \ + vmovdqu32 x0, (offset + 0 * (add))(dst); \ + vmovdqu32 x4, (offset + 1 * (add))(dst); \ + vmovdqu32 x8, (offset + 2 * (add))(dst); \ + vmovdqu32 x12, (offset + 3 * (add))(dst); + +#define xor_src_dst(dst, src, offset, xreg) \ + vpxord offset(src), xreg, xreg; \ + vmovdqu32 xreg, offset(dst); + +#define clear_vec4(v0,v1,v2,v3) \ + vpxord v0, v0, v0; \ + vpxord v1, v1, v1; \ + vpxord v2, v2, v2; \ + vpxord v3, v3, v3; + +#define clear_zmm16_zmm31() \ + clear_vec4(%xmm16, %xmm20, %xmm24, %xmm28); \ + clear_vec4(%xmm17, %xmm21, %xmm25, %xmm29); \ + clear_vec4(%xmm18, %xmm22, %xmm26, %xmm30); \ + clear_vec4(%xmm19, %xmm23, %xmm27, %xmm31); + +/********************************************************************** + 16-way chacha20 + **********************************************************************/ + +#define ROTATE2(v1,v2,c) \ + vprold $(c), v1, v1; \ + vprold $(c), v2, v2; + +#define XOR(ds,s) \ + vpxord s, ds, ds; + +#define PLUS(ds,s) \ + vpaddd s, ds, ds; + +#define QUARTERROUND2(a1,b1,c1,d1,a2,b2,c2,d2) \ + PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2); \ + ROTATE2(d1, d2, 16); \ + PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2); \ + ROTATE2(b1, b2, 12); \ + PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2); \ + ROTATE2(d1, d2, 8); \ + PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2); \ + ROTATE2(b1, b2, 7); + +.align 64 +ELF(.type _gcry_chacha20_amd64_avx512_data, at object;) +_gcry_chacha20_amd64_avx512_data: +.Linc_counter: + .byte 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 +.Lone: + .long 1,0,0,0 +ELF(.size _gcry_chacha20_amd64_avx512_data,.-_gcry_chacha20_amd64_avx512_data) + +.align 16 +.globl _gcry_chacha20_amd64_avx512_blocks16 +ELF(.type _gcry_chacha20_amd64_avx512_blocks16, at function;) +_gcry_chacha20_amd64_avx512_blocks16: + /* input: + * %rdi: input + * %rsi: dst + * %rdx: src + * %rcx: nblks (multiple of 16) + */ + CFI_STARTPROC(); + + vpxord %xmm16, %xmm16, %xmm16; + vpopcntb %zmm16, %zmm16; /* spec stop for old AVX512 CPUs */ + + vpmovzxbd .Linc_counter rRIP, COUNTER_ADD; + + /* Preload state */ + vpbroadcastd (0 * 4)(INPUT), S0; + vpbroadcastd (1 * 4)(INPUT), S1; + vpbroadcastd (2 * 4)(INPUT), S2; + vpbroadcastd (3 * 4)(INPUT), S3; + vpbroadcastd (4 * 4)(INPUT), S4; + vpbroadcastd (5 * 4)(INPUT), S5; + vpbroadcastd (6 * 4)(INPUT), S6; + vpbroadcastd (7 * 4)(INPUT), S7; + vpbroadcastd (8 * 4)(INPUT), S8; + vpbroadcastd (14 * 4)(INPUT), S14; + vpbroadcastd (15 * 4)(INPUT), S15; + +.align 16 +.Loop16: + movl $20, ROUND; + + /* Construct counter vectors X12 and X13 */ + vpbroadcastd (12 * 4)(INPUT), X12; + vpbroadcastd (13 * 4)(INPUT), X13; + vpaddd COUNTER_ADD, X12, X12; + vpcmpud $6, X12, COUNTER_ADD, %k2; + vpaddd .Lone rRIP {1to16}, X13, X13{%k2}; + vmovdqa32 X12, X12_SAVE; + vmovdqa32 X13, X13_SAVE; + + /* Load vectors */ + vmovdqa32 S0, X0; + vmovdqa32 S4, X4; + vmovdqa32 S8, X8; + vmovdqa32 S1, X1; + vmovdqa32 S5, X5; + vpbroadcastd (9 * 4)(INPUT), X9; + QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13) + vmovdqa32 S2, X2; + vmovdqa32 S6, X6; + vpbroadcastd (10 * 4)(INPUT), X10; + vmovdqa32 S14, X14; + vmovdqa32 S3, X3; + vmovdqa32 S7, X7; + vpbroadcastd (11 * 4)(INPUT), X11; + vmovdqa32 S15, X15; + + /* Update counter */ + addq $16, (12 * 4)(INPUT); + jmp .Lround2_entry; + +.align 16 +.Lround2: + QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14) + QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13) +.Lround2_entry: + subl $2, ROUND; + QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15) + QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12) + jnz .Lround2; + +.Lround2_end: + PLUS(X0, S0); + PLUS(X1, S1); + PLUS(X5, S5); + PLUS(X6, S6); + PLUS(X10, (10 * 4)(INPUT){1to16}); + PLUS(X11, (11 * 4)(INPUT){1to16}); + PLUS(X15, S15); + PLUS(X12, X12_SAVE); + QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14) + + PLUS(X2, S2); + PLUS(X3, S3); + PLUS(X4, S4); + PLUS(X7, S7); + transpose_4x4(X0, X1, X2, X3, TMP0, TMP1); + transpose_4x4(X4, X5, X6, X7, TMP0, TMP1); + PLUS(X8, S8); + PLUS(X9, (9 * 4)(INPUT){1to16}); + PLUS(X13, X13_SAVE); + PLUS(X14, S14); + transpose_4x4(X8, X9, X10, X11, TMP0, TMP1); + transpose_4x4(X12, X13, X14, X15, TMP0, TMP1); + + transpose_16byte_4x4(X0, X4, X8, X12, TMP0, TMP1); + xor_src_dst_4x4(DST, SRC, (64 * 0), (64 * 4), X0, X4, X8, X12); + transpose_16byte_4x4(X1, X5, X9, X13, TMP0, TMP1); + xor_src_dst_4x4(DST, SRC, (64 * 1), (64 * 4), X1, X5, X9, X13); + transpose_16byte_4x4(X2, X6, X10, X14, TMP0, TMP1); + xor_src_dst_4x4(DST, SRC, (64 * 2), (64 * 4), X2, X6, X10, X14); + transpose_16byte_4x4(X3, X7, X11, X15, TMP0, TMP1); + xor_src_dst_4x4(DST, SRC, (64 * 3), (64 * 4), X3, X7, X11, X15); + + subq $16, NBLKS; + leaq (16 * 64)(SRC), SRC; + leaq (16 * 64)(DST), DST; + jnz .Loop16; + + /* clear the used vector registers */ + clear_zmm16_zmm31(); + kmovd %eax, %k2; + vzeroall; /* clears ZMM0-ZMM15 */ + + /* eax zeroed by round loop. */ + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_chacha20_amd64_avx512_blocks16, + .-_gcry_chacha20_amd64_avx512_blocks16;) + +#endif /*defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS)*/ +#endif /*__x86_64*/ diff --git a/cipher/chacha20.c b/cipher/chacha20.c index 870cfa18..8dec4317 100644 --- a/cipher/chacha20.c +++ b/cipher/chacha20.c @@ -64,6 +64,14 @@ # define USE_AVX2 1 #endif +/* USE_AVX512 indicates whether to compile with Intel AVX512 code. */ +#undef USE_AVX512 +#if defined(__x86_64__) && defined(HAVE_GCC_INLINE_ASM_AVX512) && \ + (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ + defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) +# define USE_AVX512 1 +#endif + /* USE_ARMV7_NEON indicates whether to enable ARMv7 NEON assembly code. */ #undef USE_ARMV7_NEON #ifdef ENABLE_NEON_SUPPORT @@ -123,6 +131,7 @@ typedef struct CHACHA20_context_s unsigned int unused; /* bytes in the pad. */ unsigned int use_ssse3:1; unsigned int use_avx2:1; + unsigned int use_avx512:1; unsigned int use_neon:1; unsigned int use_ppc:1; unsigned int use_s390x:1; @@ -161,6 +170,14 @@ unsigned int _gcry_chacha20_poly1305_amd64_avx2_blocks8( #endif /* USE_AVX2 */ +#ifdef USE_AVX512 + +unsigned int _gcry_chacha20_amd64_avx512_blocks16(u32 *state, byte *dst, + const byte *src, + size_t nblks) ASM_FUNC_ABI; + +#endif /* USE_AVX2 */ + #ifdef USE_PPC_VEC unsigned int _gcry_chacha20_ppc8_blocks4(u32 *state, byte *dst, @@ -464,6 +481,9 @@ chacha20_do_setkey (CHACHA20_context_t *ctx, #ifdef USE_SSSE3 ctx->use_ssse3 = (features & HWF_INTEL_SSSE3) != 0; #endif +#ifdef USE_AVX512 + ctx->use_avx512 = (features & HWF_INTEL_AVX512) != 0; +#endif #ifdef USE_AVX2 ctx->use_avx2 = (features & HWF_INTEL_AVX2) != 0; #endif @@ -510,6 +530,20 @@ do_chacha20_encrypt_stream_tail (CHACHA20_context_t *ctx, byte *outbuf, static const unsigned char zero_pad[CHACHA20_BLOCK_SIZE] = { 0, }; unsigned int nburn, burn = 0; +#ifdef USE_AVX512 + if (ctx->use_avx512 && length >= CHACHA20_BLOCK_SIZE * 16) + { + size_t nblocks = length / CHACHA20_BLOCK_SIZE; + nblocks -= nblocks % 16; + nburn = _gcry_chacha20_amd64_avx512_blocks16(ctx->input, outbuf, inbuf, + nblocks); + burn = nburn > burn ? nburn : burn; + length -= nblocks * CHACHA20_BLOCK_SIZE; + outbuf += nblocks * CHACHA20_BLOCK_SIZE; + inbuf += nblocks * CHACHA20_BLOCK_SIZE; + } +#endif + #ifdef USE_AVX2 if (ctx->use_avx2 && length >= CHACHA20_BLOCK_SIZE * 8) { @@ -703,6 +737,13 @@ _gcry_chacha20_poly1305_encrypt(gcry_cipher_hd_t c, byte *outbuf, if (0) { } +#ifdef USE_AVX512 + else if (ctx->use_avx512) + { + /* Skip stitched chacha20-poly1305 for AVX512. */ + authptr = NULL; + } +#endif #ifdef USE_AVX2 else if (ctx->use_avx2 && length >= CHACHA20_BLOCK_SIZE * 8) { @@ -1000,6 +1041,7 @@ _gcry_chacha20_poly1305_decrypt(gcry_cipher_hd_t c, byte *outbuf, { CHACHA20_context_t *ctx = (void *) &c->context.c; unsigned int nburn, burn = 0; + int skip_stitched = 0; if (!length) return 0; @@ -1035,8 +1077,16 @@ _gcry_chacha20_poly1305_decrypt(gcry_cipher_hd_t c, byte *outbuf, gcry_assert (c->u_mode.poly1305.ctx.leftover == 0); +#ifdef USE_AVX512 + if (ctx->use_avx512) + { + /* Skip stitched chacha20-poly1305 for AVX512. */ + skip_stitched = 1; + } +#endif + #ifdef USE_AVX2 - if (ctx->use_avx2 && length >= 8 * CHACHA20_BLOCK_SIZE) + if (!skip_stitched && ctx->use_avx2 && length >= 8 * CHACHA20_BLOCK_SIZE) { size_t nblocks = length / CHACHA20_BLOCK_SIZE; nblocks -= nblocks % 8; @@ -1053,7 +1103,7 @@ _gcry_chacha20_poly1305_decrypt(gcry_cipher_hd_t c, byte *outbuf, #endif #ifdef USE_SSSE3 - if (ctx->use_ssse3) + if (!skip_stitched && ctx->use_ssse3) { if (length >= 4 * CHACHA20_BLOCK_SIZE) { @@ -1087,7 +1137,7 @@ _gcry_chacha20_poly1305_decrypt(gcry_cipher_hd_t c, byte *outbuf, #endif #ifdef USE_AARCH64_SIMD - if (ctx->use_neon && length >= 4 * CHACHA20_BLOCK_SIZE) + if (!skip_stitched && ctx->use_neon && length >= 4 * CHACHA20_BLOCK_SIZE) { size_t nblocks = length / CHACHA20_BLOCK_SIZE; nblocks -= nblocks % 4; @@ -1104,7 +1154,7 @@ _gcry_chacha20_poly1305_decrypt(gcry_cipher_hd_t c, byte *outbuf, #endif #ifdef USE_PPC_VEC_POLY1305 - if (ctx->use_ppc && length >= 4 * CHACHA20_BLOCK_SIZE) + if (!skip_stitched && ctx->use_ppc && length >= 4 * CHACHA20_BLOCK_SIZE) { size_t nblocks = length / CHACHA20_BLOCK_SIZE; nblocks -= nblocks % 4; @@ -1121,7 +1171,7 @@ _gcry_chacha20_poly1305_decrypt(gcry_cipher_hd_t c, byte *outbuf, #endif #ifdef USE_S390X_VX_POLY1305 - if (ctx->use_s390x) + if (!skip_stitched && ctx->use_s390x) { if (length >= 8 * CHACHA20_BLOCK_SIZE) { diff --git a/configure.ac b/configure.ac index 778dc633..582678e6 100644 --- a/configure.ac +++ b/configure.ac @@ -2759,6 +2759,7 @@ if test "$found" = "1" ; then # Build with the assembly implementation GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS chacha20-amd64-ssse3.lo" GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS chacha20-amd64-avx2.lo" + GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS chacha20-amd64-avx512.lo" ;; aarch64-*-*) # Build with the assembly implementation -- 2.32.0 From jussi.kivilinna at iki.fi Sun Apr 3 17:10:42 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 3 Apr 2022 18:10:42 +0300 Subject: [PATCH 1/2] poly1305: add AVX512 implementation Message-ID: <20220403151043.4096276-1-jussi.kivilinna@iki.fi> * LICENSES: Add 3-clause BSD license for poly1305-amd64-avx512.S. * cipher/Makefile.am: Add 'poly1305-amd64-avx512.S'. * cipher/poly1305-amd64-avx512.S: New. * cipher/poly1305-internal.h (POLY1305_USE_AVX512): New. (poly1305_context_s): Add 'use_avx512'. * cipher/poly1305.c (ASM_FUNC_ABI, ASM_FUNC_WRAPPER_ATTR): New. [POLY1305_USE_AVX512] (_gcry_poly1305_amd64_avx512_blocks) (poly1305_amd64_avx512_blocks): New. (poly1305_init): Use AVX512 is HW feature available (set use_avx512). [USE_MPI_64BIT] (poly1305_blocks): Rename to ... [USE_MPI_64BIT] (poly1305_blocks_generic): ... this. [USE_MPI_64BIT] (poly1305_blocks): New. -- Patch adds AMD64 AVX512-FMA52 implementation for Poly1305. Benchmark on Intel Core i3-1115G4 (tigerlake): Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz POLY1305 | 0.306 ns/B 3117 MiB/s 1.25 c/B 4090 After (5.0x faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz POLY1305 | 0.061 ns/B 15699 MiB/s 0.249 c/B 4095?3 Signed-off-by: Jussi Kivilinna --- LICENSES | 30 + cipher/Makefile.am | 2 +- cipher/poly1305-amd64-avx512.S | 1625 ++++++++++++++++++++++++++++++++ cipher/poly1305-internal.h | 13 + cipher/poly1305.c | 50 +- configure.ac | 3 + 6 files changed, 1720 insertions(+), 3 deletions(-) create mode 100644 cipher/poly1305-amd64-avx512.S diff --git a/LICENSES b/LICENSES index 94499501..67b80e64 100644 --- a/LICENSES +++ b/LICENSES @@ -56,6 +56,36 @@ with any binary distributions derived from the GNU C Library. SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. #+end_quote + For files: + - cipher/poly1305-amd64-avx512.S + +#+begin_quote + Copyright (c) 2021-2022, Intel Corporation + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are met: + + * Redistributions of source code must retain the above copyright notice, + this list of conditions and the following disclaimer. + * Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + * Neither the name of Intel Corporation nor the names of its contributors + may be used to endorse or promote products derived from this software + without specific prior written permission. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE + FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR + SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER + CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, + OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +#+end_quote + For files: - random/jitterentropy-base.c - random/jitterentropy-gcd.c diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 1ac1923b..b6319d35 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -98,7 +98,7 @@ EXTRA_libcipher_la_SOURCES = \ gostr3411-94.c \ md4.c \ md5.c \ - poly1305-s390x.S \ + poly1305-s390x.S poly1305-amd64-avx512.S \ rijndael.c rijndael-internal.h rijndael-tables.h \ rijndael-aesni.c rijndael-padlock.c \ rijndael-amd64.S rijndael-arm.S \ diff --git a/cipher/poly1305-amd64-avx512.S b/cipher/poly1305-amd64-avx512.S new file mode 100644 index 00000000..48892777 --- /dev/null +++ b/cipher/poly1305-amd64-avx512.S @@ -0,0 +1,1625 @@ +/* +;; +;; Copyright (c) 2021-2022, Intel Corporation +;; +;; Redistribution and use in source and binary forms, with or without +;; modification, are permitted provided that the following conditions are met: +;; +;; * Redistributions of source code must retain the above copyright notice, +;; this list of conditions and the following disclaimer. +;; * Redistributions in binary form must reproduce the above copyright +;; notice, this list of conditions and the following disclaimer in the +;; documentation and/or other materials provided with the distribution. +;; * Neither the name of Intel Corporation nor the names of its contributors +;; may be used to endorse or promote products derived from this software +;; without specific prior written permission. +;; +;; THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +;; AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +;; IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +;; DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE +;; FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +;; DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +;; SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +;; CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +;; OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +;; OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +;; +*/ +/* + * From: + * https://github.com/intel/intel-ipsec-mb/blob/f0cad21a644231c0f5d4af51f56061a5796343fb/lib/avx512/poly_fma_avx512.asm + * + * Conversion to GAS assembly and integration to libgcrypt + * by Jussi Kivilinna + */ + +#ifdef __x86_64 +#include +#if (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ + defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) && \ + defined(HAVE_INTEL_SYNTAX_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AVX512) +#include "asm-common-amd64.h" + +.intel_syntax noprefix + +.text + +ELF(.type _gcry_poly1305_avx512_consts, at object) +_gcry_poly1305_avx512_consts: + +.align 64 +.Lmask_44: + .quad 0xfffffffffff, 0xfffffffffff, 0xfffffffffff, 0xfffffffffff + .quad 0xfffffffffff, 0xfffffffffff, 0xfffffffffff, 0xfffffffffff + +.align 64 +.Lmask_42: + .quad 0x3ffffffffff, 0x3ffffffffff, 0x3ffffffffff, 0x3ffffffffff + .quad 0x3ffffffffff, 0x3ffffffffff, 0x3ffffffffff, 0x3ffffffffff + +.align 64 +.Lhigh_bit: + .quad 0x10000000000, 0x10000000000, 0x10000000000, 0x10000000000 + .quad 0x10000000000, 0x10000000000, 0x10000000000, 0x10000000000 + +.Lbyte_len_to_mask_table: + .short 0x0000, 0x0001, 0x0003, 0x0007 + .short 0x000f, 0x001f, 0x003f, 0x007f + .short 0x00ff, 0x01ff, 0x03ff, 0x07ff + .short 0x0fff, 0x1fff, 0x3fff, 0x7fff + .short 0xffff + +.align 64 +.Lbyte64_len_to_mask_table: + .quad 0x0000000000000000, 0x0000000000000001 + .quad 0x0000000000000003, 0x0000000000000007 + .quad 0x000000000000000f, 0x000000000000001f + .quad 0x000000000000003f, 0x000000000000007f + .quad 0x00000000000000ff, 0x00000000000001ff + .quad 0x00000000000003ff, 0x00000000000007ff + .quad 0x0000000000000fff, 0x0000000000001fff + .quad 0x0000000000003fff, 0x0000000000007fff + .quad 0x000000000000ffff, 0x000000000001ffff + .quad 0x000000000003ffff, 0x000000000007ffff + .quad 0x00000000000fffff, 0x00000000001fffff + .quad 0x00000000003fffff, 0x00000000007fffff + .quad 0x0000000000ffffff, 0x0000000001ffffff + .quad 0x0000000003ffffff, 0x0000000007ffffff + .quad 0x000000000fffffff, 0x000000001fffffff + .quad 0x000000003fffffff, 0x000000007fffffff + .quad 0x00000000ffffffff, 0x00000001ffffffff + .quad 0x00000003ffffffff, 0x00000007ffffffff + .quad 0x0000000fffffffff, 0x0000001fffffffff + .quad 0x0000003fffffffff, 0x0000007fffffffff + .quad 0x000000ffffffffff, 0x000001ffffffffff + .quad 0x000003ffffffffff, 0x000007ffffffffff + .quad 0x00000fffffffffff, 0x00001fffffffffff + .quad 0x00003fffffffffff, 0x00007fffffffffff + .quad 0x0000ffffffffffff, 0x0001ffffffffffff + .quad 0x0003ffffffffffff, 0x0007ffffffffffff + .quad 0x000fffffffffffff, 0x001fffffffffffff + .quad 0x003fffffffffffff, 0x007fffffffffffff + .quad 0x00ffffffffffffff, 0x01ffffffffffffff + .quad 0x03ffffffffffffff, 0x07ffffffffffffff + .quad 0x0fffffffffffffff, 0x1fffffffffffffff + .quad 0x3fffffffffffffff, 0x7fffffffffffffff + .quad 0xffffffffffffffff + +.Lqword_high_bit_mask: + .short 0, 0x1, 0x5, 0x15, 0x55, 0x57, 0x5f, 0x7f, 0xff + +ELF(.size _gcry_poly1305_avx512_consts,.-_gcry_poly1305_avx512_consts) + +#define raxd eax +#define rbxd ebx +#define rcxd ecx +#define rdxd edx +#define rsid esi +#define rdid edi +#define rbpd ebp +#define rspd esp +#define __DWORD(X) X##d +#define DWORD(R) __DWORD(R) + +#define arg1 rdi +#define arg2 rsi +#define arg3 rdx +#define arg4 rcx + +#define job arg1 +#define gp1 rsi +#define gp2 rcx + +/* ;; don't use rdx and rax - they are needed for multiply operation */ +#define gp3 rbp +#define gp4 r8 +#define gp5 r9 +#define gp6 r10 +#define gp7 r11 +#define gp8 r12 +#define gp9 r13 +#define gp10 r14 +#define gp11 r15 + +#define len gp11 +#define msg gp10 + +#define POLY1305_BLOCK_SIZE 16 + +#define STACK_r_save 0 +#define STACK_r_save_size (6 * 64) +#define STACK_gpr_save (STACK_r_save + STACK_r_save_size) +#define STACK_gpr_save_size (8 * 8) +#define STACK_rsp_save (STACK_gpr_save + STACK_gpr_save_size) +#define STACK_rsp_save_size (1 * 8) +#define STACK_SIZE (STACK_rsp_save + STACK_rsp_save_size) + +#define A2_ZERO(...) /**/ +#define A2_ZERO_INVERT(...) __VA_ARGS__ +#define A2_NOT_ZERO(...) __VA_ARGS__ +#define A2_NOT_ZERO_INVERT(...) /**/ + +#define clear_zmm(vec) vpxord vec, vec, vec + +/* +;; ============================================================================= +;; ============================================================================= +;; Computes hash for message length being multiple of block size +;; ============================================================================= +;; Combining 64-bit x 64-bit multiplication with reduction steps +;; +;; NOTES: +;; 1) A2 here is only two bits so anything above is subject of reduction. +;; Constant C1 = R1 + (R1 >> 2) simplifies multiply with less operations +;; 2) Magic 5x comes from mod 2^130-5 property and incorporating +;; reduction into multiply phase. +;; See "Cheating at modular arithmetic" and "Poly1305's prime: 2^130 - 5" +;; paragraphs at https://loup-vaillant.fr/tutorials/poly1305-design for more details. +;; +;; Flow of the code below is as follows: +;; +;; A2 A1 A0 +;; x R1 R0 +;; ----------------------------- +;; A2?R0 A1?R0 A0?R0 +;; + A0?R1 +;; + 5xA2xR1 5xA1xR1 +;; ----------------------------- +;; [0|L2L] [L1H|L1L] [L0H|L0L] +;; +;; Registers: T3:T2 T1:A0 +;; +;; Completing the multiply and adding (with carry) 3x128-bit limbs into +;; 192-bits again (3x64-bits): +;; A0 = L0L +;; A1 = L0H + L1L +;; T3 = L1H + L2L +; A0 [in/out] GPR with accumulator bits 63:0 +; A1 [in/out] GPR with accumulator bits 127:64 +; A2 [in/out] GPR with accumulator bits 195:128 +; R0 [in] GPR with R constant bits 63:0 +; R1 [in] GPR with R constant bits 127:64 +; C1 [in] C1 = R1 + (R1 >> 2) +; T1 [clobbered] GPR register +; T2 [clobbered] GPR register +; T3 [clobbered] GPR register +; GP_RAX [clobbered] RAX register +; GP_RDX [clobbered] RDX register +; IF_A2 [in] Used if input A2 is not 0 +*/ +#define POLY1305_MUL_REDUCE(A0, A1, A2, R0, R1, C1, T1, T2, T3, GP_RAX, GP_RDX, IF_A2) \ + /* T3:T2 = (A0 * R1) */ \ + mov GP_RAX, R1; \ + mul A0; \ + mov T2, GP_RAX; \ + mov GP_RAX, R0; \ + mov T3, GP_RDX; \ + \ + /* T1:A0 = (A0 * R0) */ \ + mul A0; \ + mov A0, GP_RAX; /* A0 not used in other operations */ \ + mov GP_RAX, R0; \ + mov T1, GP_RDX; \ + \ + /* T3:T2 += (A1 * R0) */ \ + mul A1; \ + add T2, GP_RAX; \ + mov GP_RAX, C1; \ + adc T3, GP_RDX; \ + \ + /* T1:A0 += (A1 * R1x5) */ \ + mul A1; \ + IF_A2(mov A1, A2); /* use A1 for A2 */ \ + add A0, GP_RAX; \ + adc T1, GP_RDX; \ + \ + /* NOTE: A2 is clamped to 2-bits, */ \ + /* R1/R0 is clamped to 60-bits, */ \ + /* their product is less than 2^64. */ \ + \ + IF_A2(/* T3:T2 += (A2 * R1x5) */); \ + IF_A2(imul A1, C1); \ + IF_A2(add T2, A1); \ + IF_A2(mov A1, T1); /* T1:A0 => A1:A0 */ \ + IF_A2(adc T3, 0); \ + \ + IF_A2(/* T3:A1 += (A2 * R0) */); \ + IF_A2(imul A2, R0); \ + IF_A2(add A1, T2); \ + IF_A2(adc T3, A2); \ + \ + IF_A2##_INVERT(/* If A2 == 0, just move and add T1-T2 to A1 */); \ + IF_A2##_INVERT(mov A1, T1); \ + IF_A2##_INVERT(add A1, T2); \ + IF_A2##_INVERT(adc T3, 0); \ + \ + /* At this point, 3 64-bit limbs are in T3:A1:A0 */ \ + /* T3 can span over more than 2 bits so final partial reduction step is needed. */ \ + \ + /* Partial reduction (just to fit into 130 bits) */ \ + /* A2 = T3 & 3 */ \ + /* k = (T3 & ~3) + (T3 >> 2) */ \ + /* Y x4 + Y x1 */ \ + /* A2:A1:A0 += k */ \ + \ + /* Result will be in A2:A1:A0 */ \ + mov T1, T3; \ + mov DWORD(A2), DWORD(T3); \ + and T1, ~3; \ + shr T3, 2; \ + and DWORD(A2), 3; \ + add T1, T3; \ + \ + /* A2:A1:A0 += k (kept in T1) */ \ + add A0, T1; \ + adc A1, 0; \ + adc DWORD(A2), 0 + +/* +;; ============================================================================= +;; ============================================================================= +;; Computes hash for 8 16-byte message blocks, +;; and adds new message blocks to accumulator. +;; +;; It first multiplies all 8 blocks with powers of R: +;; +;; a2 a1 a0 +;; ? b2 b1 b0 +;; --------------------------------------- +;; a2?b0 a1?b0 a0?b0 +;; + a1?b1 a0?b1 5?a2?b1 +;; + a0?b2 5?a2?b2 5?a1?b2 +;; --------------------------------------- +;; p2 p1 p0 +;; +;; Then, it propagates the carry (higher bits after bit 43) from lower limbs into higher limbs, +;; multiplying by 5 in case of the carry of p2. +;; +;A0 [in/out] ZMM register containing 1st 44-bit limb of the 8 blocks +;A1 [in/out] ZMM register containing 2nd 44-bit limb of the 8 blocks +;A2 [in/out] ZMM register containing 3rd 44-bit limb of the 8 blocks +;R0 [in] ZMM register (R0) to include the 1st limb of R +;R1 [in] ZMM register (R1) to include the 2nd limb of R +;R2 [in] ZMM register (R2) to include the 3rd limb of R +;R1P [in] ZMM register (R1') to include the 2nd limb of R (multiplied by 5) +;R2P [in] ZMM register (R2') to include the 3rd limb of R (multiplied by 5) +;P0_L [clobbered] ZMM register to contain p[0] of the 8 blocks +;P0_H [clobbered] ZMM register to contain p[0] of the 8 blocks +;P1_L [clobbered] ZMM register to contain p[1] of the 8 blocks +;P1_H [clobbered] ZMM register to contain p[1] of the 8 blocks +;P2_L [clobbered] ZMM register to contain p[2] of the 8 blocks +;P2_H [clobbered] ZMM register to contain p[2] of the 8 blocks +;ZTMP1 [clobbered] Temporary ZMM register +*/ +#define POLY1305_MUL_REDUCE_VEC(A0, A1, A2, R0, R1, R2, R1P, R2P, P0_L, P0_H, \ + P1_L, P1_H, P2_L, P2_H, ZTMP1) \ + /* ;; Reset accumulator */ \ + vpxorq P0_L, P0_L, P0_L; \ + vpxorq P0_H, P0_H, P0_H; \ + vpxorq P1_L, P1_L, P1_L; \ + vpxorq P1_H, P1_H, P1_H; \ + vpxorq P2_L, P2_L, P2_L; \ + vpxorq P2_H, P2_H, P2_H; \ + \ + /* ; Reset accumulator and calculate products */ \ + vpmadd52luq P0_L, A2, R1P; \ + vpmadd52huq P0_H, A2, R1P; \ + vpmadd52luq P1_L, A2, R2P; \ + vpmadd52huq P1_H, A2, R2P; \ + vpmadd52luq P2_L, A2, R0; \ + vpmadd52huq P2_H, A2, R0; \ + \ + vpmadd52luq P1_L, A0, R1; \ + vpmadd52huq P1_H, A0, R1; \ + vpmadd52luq P2_L, A0, R2; \ + vpmadd52huq P2_H, A0, R2; \ + vpmadd52luq P0_L, A0, R0; \ + vpmadd52huq P0_H, A0, R0; \ + \ + vpmadd52luq P0_L, A1, R2P; \ + vpmadd52huq P0_H, A1, R2P; \ + vpmadd52luq P1_L, A1, R0; \ + vpmadd52huq P1_H, A1, R0; \ + vpmadd52luq P2_L, A1, R1; \ + vpmadd52huq P2_H, A1, R1; \ + \ + /* ; Carry propagation (first pass) */ \ + vpsrlq ZTMP1, P0_L, 44; \ + vpandq A0, P0_L, [.Lmask_44 ADD_RIP]; /* ; Clear top 20 bits */ \ + vpsllq P0_H, P0_H, 8; \ + vpaddq P0_H, P0_H, ZTMP1; \ + vpaddq P1_L, P1_L, P0_H; \ + vpandq A1, P1_L, [.Lmask_44 ADD_RIP]; /* ; Clear top 20 bits */ \ + vpsrlq ZTMP1, P1_L, 44; \ + vpsllq P1_H, P1_H, 8; \ + vpaddq P1_H, P1_H, ZTMP1; \ + vpaddq P2_L, P2_L, P1_H; \ + vpandq A2, P2_L, [.Lmask_42 ADD_RIP]; /* ; Clear top 22 bits */ \ + vpsrlq ZTMP1, P2_L, 42; \ + vpsllq P2_H, P2_H, 10; \ + vpaddq P2_H, P2_H, ZTMP1; \ + \ + /* ; Carry propagation (second pass) */ \ + \ + /* ; Multiply by 5 the highest bits (above 130 bits) */ \ + vpaddq A0, A0, P2_H; \ + vpsllq P2_H, P2_H, 2; \ + vpaddq A0, A0, P2_H; \ + vpsrlq ZTMP1, A0, 44; \ + vpandq A0, A0, [.Lmask_44 ADD_RIP]; \ + vpaddq A1, A1, ZTMP1; + +/* +;; ============================================================================= +;; ============================================================================= +;; Computes hash for 16 16-byte message blocks, +;; and adds new message blocks to accumulator, +;; interleaving this computation with the loading and splatting +;; of new data. +;; +;; It first multiplies all 16 blocks with powers of R (8 blocks from A0-A2 +;; and 8 blocks from B0-B2, multiplied by R0-R2) +;; +;; a2 a1 a0 +;; ? b2 b1 b0 +;; --------------------------------------- +;; a2?b0 a1?b0 a0?b0 +;; + a1?b1 a0?b1 5?a2?b1 +;; + a0?b2 5?a2?b2 5?a1?b2 +;; --------------------------------------- +;; p2 p1 p0 +;; +;; Then, it propagates the carry (higher bits after bit 43) +;; from lower limbs into higher limbs, +;; multiplying by 5 in case of the carry of p2, and adds +;; the results to A0-A2 and B0-B2. +;; +;; ============================================================================= +;A0 [in/out] ZMM register containing 1st 44-bit limb of blocks 1-8 +;A1 [in/out] ZMM register containing 2nd 44-bit limb of blocks 1-8 +;A2 [in/out] ZMM register containing 3rd 44-bit limb of blocks 1-8 +;B0 [in/out] ZMM register containing 1st 44-bit limb of blocks 9-16 +;B1 [in/out] ZMM register containing 2nd 44-bit limb of blocks 9-16 +;B2 [in/out] ZMM register containing 3rd 44-bit limb of blocks 9-16 +;R0 [in] ZMM register (R0) to include the 1st limb of R +;R1 [in] ZMM register (R1) to include the 2nd limb of R +;R2 [in] ZMM register (R2) to include the 3rd limb of R +;R1P [in] ZMM register (R1') to include the 2nd limb of R (multiplied by 5) +;R2P [in] ZMM register (R2') to include the 3rd limb of R (multiplied by 5) +;P0_L [clobbered] ZMM register to contain p[0] of the 8 blocks 1-8 +;P0_H [clobbered] ZMM register to contain p[0] of the 8 blocks 1-8 +;P1_L [clobbered] ZMM register to contain p[1] of the 8 blocks 1-8 +;P1_H [clobbered] ZMM register to contain p[1] of the 8 blocks 1-8 +;P2_L [clobbered] ZMM register to contain p[2] of the 8 blocks 1-8 +;P2_H [clobbered] ZMM register to contain p[2] of the 8 blocks 1-8 +;Q0_L [clobbered] ZMM register to contain p[0] of the 8 blocks 9-16 +;Q0_H [clobbered] ZMM register to contain p[0] of the 8 blocks 9-16 +;Q1_L [clobbered] ZMM register to contain p[1] of the 8 blocks 9-16 +;Q1_H [clobbered] ZMM register to contain p[1] of the 8 blocks 9-16 +;Q2_L [clobbered] ZMM register to contain p[2] of the 8 blocks 9-16 +;Q2_H [clobbered] ZMM register to contain p[2] of the 8 blocks 9-16 +;ZTMP1 [clobbered] Temporary ZMM register +;ZTMP2 [clobbered] Temporary ZMM register +;ZTMP3 [clobbered] Temporary ZMM register +;ZTMP4 [clobbered] Temporary ZMM register +;ZTMP5 [clobbered] Temporary ZMM register +;ZTMP6 [clobbered] Temporary ZMM register +;ZTMP7 [clobbered] Temporary ZMM register +;ZTMP8 [clobbered] Temporary ZMM register +;ZTMP9 [clobbered] Temporary ZMM register +;MSG [in/out] Pointer to message +;LEN [in/out] Length left of message +*/ +#define POLY1305_MSG_MUL_REDUCE_VEC16(A0, A1, A2, B0, B1, B2, R0, R1, R2, R1P, \ + R2P, P0_L, P0_H, P1_L, P1_H, P2_L, P2_H, \ + Q0_L, Q0_H, Q1_L, Q1_H, Q2_L, Q2_H, \ + ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, \ + ZTMP6, ZTMP7, ZTMP8, ZTMP9, MSG, LEN) \ + /* ;; Reset accumulator */ \ + vpxorq P0_L, P0_L, P0_L; \ + vpxorq P0_H, P0_H, P0_H; \ + vpxorq P1_L, P1_L, P1_L; \ + vpxorq P1_H, P1_H, P1_H; \ + vpxorq P2_L, P2_L, P2_L; \ + vpxorq P2_H, P2_H, P2_H; \ + vpxorq Q0_L, Q0_L, Q0_L; \ + vpxorq Q0_H, Q0_H, Q0_H; \ + vpxorq Q1_L, Q1_L, Q1_L; \ + vpxorq Q1_H, Q1_H, Q1_H; \ + vpxorq Q2_L, Q2_L, Q2_L; \ + vpxorq Q2_H, Q2_H, Q2_H; \ + \ + /* ;; This code interleaves hash computation with input loading/splatting */ \ + \ + /* ; Calculate products */ \ + vpmadd52luq P0_L, A2, R1P; \ + vpmadd52huq P0_H, A2, R1P; \ + /* ;; input loading of new blocks */ \ + add MSG, POLY1305_BLOCK_SIZE*16; \ + sub LEN, POLY1305_BLOCK_SIZE*16; \ + \ + vpmadd52luq Q0_L, B2, R1P; \ + vpmadd52huq Q0_H, B2, R1P; \ + \ + vpmadd52luq P1_L, A2, R2P; \ + vpmadd52huq P1_H, A2, R2P; \ + /* ; Load next block of data (128 bytes) */ \ + vmovdqu64 ZTMP5, [MSG]; \ + vmovdqu64 ZTMP2, [MSG + 64]; \ + \ + vpmadd52luq Q1_L, B2, R2P; \ + vpmadd52huq Q1_H, B2, R2P; \ + \ + /* ; Interleave new blocks of data */ \ + vpunpckhqdq ZTMP3, ZTMP5, ZTMP2; \ + vpunpcklqdq ZTMP5, ZTMP5, ZTMP2; \ + \ + vpmadd52luq P0_L, A0, R0; \ + vpmadd52huq P0_H, A0, R0; \ + /* ; Highest 42-bit limbs of new blocks */ \ + vpsrlq ZTMP6, ZTMP3, 24; \ + vporq ZTMP6, ZTMP6, [.Lhigh_bit ADD_RIP]; /* ; Add 2^128 to all 8 final qwords of the message */ \ + \ + vpmadd52luq Q0_L, B0, R0; \ + vpmadd52huq Q0_H, B0, R0; \ + \ + /* ; Middle 44-bit limbs of new blocks */ \ + vpsrlq ZTMP2, ZTMP5, 44; \ + vpsllq ZTMP4, ZTMP3, 20; \ + \ + vpmadd52luq P2_L, A2, R0; \ + vpmadd52huq P2_H, A2, R0; \ + vpternlogq ZTMP2, ZTMP4, [.Lmask_44 ADD_RIP], 0xA8; /* ; (A OR B AND C) */ \ + \ + /* ; Lowest 44-bit limbs of new blocks */ \ + vpandq ZTMP5, ZTMP5, [.Lmask_44 ADD_RIP]; \ + \ + vpmadd52luq Q2_L, B2, R0; \ + vpmadd52huq Q2_H, B2, R0; \ + \ + /* ; Load next block of data (128 bytes) */ \ + vmovdqu64 ZTMP8, [MSG + 64*2]; \ + vmovdqu64 ZTMP9, [MSG + 64*3]; \ + \ + vpmadd52luq P1_L, A0, R1; \ + vpmadd52huq P1_H, A0, R1; \ + /* ; Interleave new blocks of data */ \ + vpunpckhqdq ZTMP3, ZTMP8, ZTMP9; \ + vpunpcklqdq ZTMP8, ZTMP8, ZTMP9; \ + \ + vpmadd52luq Q1_L, B0, R1; \ + vpmadd52huq Q1_H, B0, R1; \ + \ + /* ; Highest 42-bit limbs of new blocks */ \ + vpsrlq ZTMP7, ZTMP3, 24; \ + vporq ZTMP7, ZTMP7, [.Lhigh_bit ADD_RIP]; /* ; Add 2^128 to all 8 final qwords of the message */ \ + \ + vpmadd52luq P0_L, A1, R2P; \ + vpmadd52huq P0_H, A1, R2P; \ + \ + /* ; Middle 44-bit limbs of new blocks */ \ + vpsrlq ZTMP9, ZTMP8, 44; \ + vpsllq ZTMP4, ZTMP3, 20; \ + \ + vpmadd52luq Q0_L, B1, R2P; \ + vpmadd52huq Q0_H, B1, R2P; \ + \ + vpternlogq ZTMP9, ZTMP4, [.Lmask_44 ADD_RIP], 0xA8; /* ; (A OR B AND C) */ \ + \ + /* ; Lowest 44-bit limbs of new blocks */ \ + vpandq ZTMP8, ZTMP8, [.Lmask_44 ADD_RIP]; \ + \ + vpmadd52luq P2_L, A0, R2; \ + vpmadd52huq P2_H, A0, R2; \ + /* ; Carry propagation (first pass) */ \ + vpsrlq ZTMP1, P0_L, 44; \ + vpsllq P0_H, P0_H, 8; \ + vpmadd52luq Q2_L, B0, R2; \ + vpmadd52huq Q2_H, B0, R2; \ + \ + vpsrlq ZTMP3, Q0_L, 44; \ + vpsllq Q0_H, Q0_H, 8; \ + \ + vpmadd52luq P1_L, A1, R0; \ + vpmadd52huq P1_H, A1, R0; \ + /* ; Carry propagation (first pass) - continue */ \ + vpandq A0, P0_L, [.Lmask_44 ADD_RIP]; /* ; Clear top 20 bits */ \ + vpaddq P0_H, P0_H, ZTMP1; \ + vpmadd52luq Q1_L, B1, R0; \ + vpmadd52huq Q1_H, B1, R0; \ + \ + vpandq B0, Q0_L, [.Lmask_44 ADD_RIP]; /* ; Clear top 20 bits */ \ + vpaddq Q0_H, Q0_H, ZTMP3; \ + \ + vpmadd52luq P2_L, A1, R1; \ + vpmadd52huq P2_H, A1, R1; \ + /* ; Carry propagation (first pass) - continue */ \ + vpaddq P1_L, P1_L, P0_H; \ + vpsllq P1_H, P1_H, 8; \ + vpsrlq ZTMP1, P1_L, 44; \ + vpmadd52luq Q2_L, B1, R1; \ + vpmadd52huq Q2_H, B1, R1; \ + \ + vpandq A1, P1_L, [.Lmask_44 ADD_RIP]; /* ; Clear top 20 bits */ \ + vpaddq Q1_L, Q1_L, Q0_H; \ + vpsllq Q1_H, Q1_H, 8; \ + vpsrlq ZTMP3, Q1_L, 44; \ + vpandq B1, Q1_L, [.Lmask_44 ADD_RIP]; /* ; Clear top 20 bits */ \ + \ + vpaddq P2_L, P2_L, P1_H; /* ; P2_L += P1_H + P1_L[63:44] */ \ + vpaddq P2_L, P2_L, ZTMP1; \ + vpandq A2, P2_L, [.Lmask_42 ADD_RIP]; /* ; Clear top 22 bits */ \ + vpaddq A2, A2, ZTMP6; /* ; Add highest bits from new blocks to accumulator */ \ + vpsrlq ZTMP1, P2_L, 42; \ + vpsllq P2_H, P2_H, 10; \ + vpaddq P2_H, P2_H, ZTMP1; \ + \ + vpaddq Q2_L, Q2_L, Q1_H; /* ; Q2_L += P1_H + P1_L[63:44] */ \ + vpaddq Q2_L, Q2_L, ZTMP3; \ + vpandq B2, Q2_L, [.Lmask_42 ADD_RIP]; /* ; Clear top 22 bits */ \ + vpaddq B2, B2, ZTMP7; /* ; Add highest bits from new blocks to accumulator */ \ + vpsrlq ZTMP3, Q2_L, 42; \ + vpsllq Q2_H, Q2_H, 10; \ + vpaddq Q2_H, Q2_H, ZTMP3; \ + \ + /* ; Carry propagation (second pass) */ \ + /* ; Multiply by 5 the highest bits (above 130 bits) */ \ + vpaddq A0, A0, P2_H; \ + vpsllq P2_H, P2_H, 2; \ + vpaddq A0, A0, P2_H; \ + vpaddq B0, B0, Q2_H; \ + vpsllq Q2_H, Q2_H, 2; \ + vpaddq B0, B0, Q2_H; \ + \ + vpsrlq ZTMP1, A0, 44; \ + vpandq A0, A0, [.Lmask_44 ADD_RIP]; \ + vpaddq A0, A0, ZTMP5; /* ; Add low 42-bit bits from new blocks to accumulator */ \ + vpaddq A1, A1, ZTMP2; /* ; Add medium 42-bit bits from new blocks to accumulator */ \ + vpaddq A1, A1, ZTMP1; \ + vpsrlq ZTMP3, B0, 44; \ + vpandq B0, B0, [.Lmask_44 ADD_RIP]; \ + vpaddq B0, B0, ZTMP8; /* ; Add low 42-bit bits from new blocks to accumulator */ \ + vpaddq B1, B1, ZTMP9; /* ; Add medium 42-bit bits from new blocks to accumulator */ \ + vpaddq B1, B1, ZTMP3 + +/* +;; ============================================================================= +;; ============================================================================= +;; Computes hash for 16 16-byte message blocks. +;; +;; It first multiplies all 16 blocks with powers of R (8 blocks from A0-A2 +;; and 8 blocks from B0-B2, multiplied by R0-R2 and S0-S2) +;; +;; +;; a2 a1 a0 +;; ? b2 b1 b0 +;; --------------------------------------- +;; a2?b0 a1?b0 a0?b0 +;; + a1?b1 a0?b1 5?a2?b1 +;; + a0?b2 5?a2?b2 5?a1?b2 +;; --------------------------------------- +;; p2 p1 p0 +;; +;; Then, it propagates the carry (higher bits after bit 43) from lower limbs into higher limbs, +;; multiplying by 5 in case of the carry of p2. +;; +;; ============================================================================= +;A0 [in/out] ZMM register containing 1st 44-bit limb of the 8 blocks +;A1 [in/out] ZMM register containing 2nd 44-bit limb of the 8 blocks +;A2 [in/out] ZMM register containing 3rd 44-bit limb of the 8 blocks +;B0 [in/out] ZMM register containing 1st 44-bit limb of the 8 blocks +;B1 [in/out] ZMM register containing 2nd 44-bit limb of the 8 blocks +;B2 [in/out] ZMM register containing 3rd 44-bit limb of the 8 blocks +;R0 [in] ZMM register (R0) to include the 1st limb in IDX +;R1 [in] ZMM register (R1) to include the 2nd limb in IDX +;R2 [in] ZMM register (R2) to include the 3rd limb in IDX +;R1P [in] ZMM register (R1') to include the 2nd limb (multiplied by 5) in IDX +;R2P [in] ZMM register (R2') to include the 3rd limb (multiplied by 5) in IDX +;S0 [in] ZMM register (R0) to include the 1st limb in IDX +;S1 [in] ZMM register (R1) to include the 2nd limb in IDX +;S2 [in] ZMM register (R2) to include the 3rd limb in IDX +;S1P [in] ZMM register (R1') to include the 2nd limb (multiplied by 5) in IDX +;S2P [in] ZMM register (R2') to include the 3rd limb (multiplied by 5) in IDX +;P0_L [clobbered] ZMM register to contain p[0] of the 8 blocks +;P0_H [clobbered] ZMM register to contain p[0] of the 8 blocks +;P1_L [clobbered] ZMM register to contain p[1] of the 8 blocks +;P1_H [clobbered] ZMM register to contain p[1] of the 8 blocks +;P2_L [clobbered] ZMM register to contain p[2] of the 8 blocks +;P2_H [clobbered] ZMM register to contain p[2] of the 8 blocks +;Q0_L [clobbered] ZMM register to contain p[0] of the 8 blocks +;Q0_H [clobbered] ZMM register to contain p[0] of the 8 blocks +;Q1_L [clobbered] ZMM register to contain p[1] of the 8 blocks +;Q1_H [clobbered] ZMM register to contain p[1] of the 8 blocks +;Q2_L [clobbered] ZMM register to contain p[2] of the 8 blocks +;Q2_H [clobbered] ZMM register to contain p[2] of the 8 blocks +;ZTMP1 [clobbered] Temporary ZMM register +;ZTMP2 [clobbered] Temporary ZMM register +*/ +#define POLY1305_MUL_REDUCE_VEC16(A0, A1, A2, B0, B1, B2, R0, R1, R2, R1P, R2P,\ + S0, S1, S2, S1P, S2P, P0_L, P0_H, P1_L, P1_H,\ + P2_L, P2_H, Q0_L, Q0_H, Q1_L, Q1_H, Q2_L,\ + Q2_H, ZTMP1, ZTMP2) \ + /* ;; Reset accumulator */ \ + vpxorq P0_L, P0_L, P0_L; \ + vpxorq P0_H, P0_H, P0_H; \ + vpxorq P1_L, P1_L, P1_L; \ + vpxorq P1_H, P1_H, P1_H; \ + vpxorq P2_L, P2_L, P2_L; \ + vpxorq P2_H, P2_H, P2_H; \ + vpxorq Q0_L, Q0_L, Q0_L; \ + vpxorq Q0_H, Q0_H, Q0_H; \ + vpxorq Q1_L, Q1_L, Q1_L; \ + vpxorq Q1_H, Q1_H, Q1_H; \ + vpxorq Q2_L, Q2_L, Q2_L; \ + vpxorq Q2_H, Q2_H, Q2_H; \ + \ + /* ;; This code interleaves hash computation with input loading/splatting */ \ + \ + /* ; Calculate products */ \ + vpmadd52luq P0_L, A2, R1P; \ + vpmadd52huq P0_H, A2, R1P; \ + \ + vpmadd52luq Q0_L, B2, S1P; \ + vpmadd52huq Q0_H, B2, S1P; \ + \ + vpmadd52luq P1_L, A2, R2P; \ + vpmadd52huq P1_H, A2, R2P; \ + \ + vpmadd52luq Q1_L, B2, S2P; \ + vpmadd52huq Q1_H, B2, S2P; \ + \ + vpmadd52luq P0_L, A0, R0; \ + vpmadd52huq P0_H, A0, R0; \ + \ + vpmadd52luq Q0_L, B0, S0; \ + vpmadd52huq Q0_H, B0, S0; \ + \ + vpmadd52luq P2_L, A2, R0; \ + vpmadd52huq P2_H, A2, R0; \ + vpmadd52luq Q2_L, B2, S0; \ + vpmadd52huq Q2_H, B2, S0; \ + \ + vpmadd52luq P1_L, A0, R1; \ + vpmadd52huq P1_H, A0, R1; \ + vpmadd52luq Q1_L, B0, S1; \ + vpmadd52huq Q1_H, B0, S1; \ + \ + vpmadd52luq P0_L, A1, R2P; \ + vpmadd52huq P0_H, A1, R2P; \ + \ + vpmadd52luq Q0_L, B1, S2P; \ + vpmadd52huq Q0_H, B1, S2P; \ + \ + vpmadd52luq P2_L, A0, R2; \ + vpmadd52huq P2_H, A0, R2; \ + \ + vpmadd52luq Q2_L, B0, S2; \ + vpmadd52huq Q2_H, B0, S2; \ + \ + /* ; Carry propagation (first pass) */ \ + vpsrlq ZTMP1, P0_L, 44; \ + vpsllq P0_H, P0_H, 8; \ + vpsrlq ZTMP2, Q0_L, 44; \ + vpsllq Q0_H, Q0_H, 8; \ + \ + vpmadd52luq P1_L, A1, R0; \ + vpmadd52huq P1_H, A1, R0; \ + vpmadd52luq Q1_L, B1, S0; \ + vpmadd52huq Q1_H, B1, S0; \ + \ + /* ; Carry propagation (first pass) - continue */ \ + vpandq A0, P0_L, [.Lmask_44 ADD_RIP]; /* ; Clear top 20 bits */ \ + vpaddq P0_H, P0_H, ZTMP1; \ + vpandq B0, Q0_L, [.Lmask_44 ADD_RIP]; /* ; Clear top 20 bits */ \ + vpaddq Q0_H, Q0_H, ZTMP2; \ + \ + vpmadd52luq P2_L, A1, R1; \ + vpmadd52huq P2_H, A1, R1; \ + vpmadd52luq Q2_L, B1, S1; \ + vpmadd52huq Q2_H, B1, S1; \ + \ + /* ; Carry propagation (first pass) - continue */ \ + vpaddq P1_L, P1_L, P0_H; \ + vpsllq P1_H, P1_H, 8; \ + vpsrlq ZTMP1, P1_L, 44; \ + vpandq A1, P1_L, [.Lmask_44 ADD_RIP]; /* ; Clear top 20 bits */ \ + vpaddq Q1_L, Q1_L, Q0_H; \ + vpsllq Q1_H, Q1_H, 8; \ + vpsrlq ZTMP2, Q1_L, 44; \ + vpandq B1, Q1_L, [.Lmask_44 ADD_RIP]; /* ; Clear top 20 bits */ \ + \ + vpaddq P2_L, P2_L, P1_H; /* ; P2_L += P1_H + P1_L[63:44] */ \ + vpaddq P2_L, P2_L, ZTMP1; \ + vpandq A2, P2_L, [.Lmask_42 ADD_RIP]; /* ; Clear top 22 bits */ \ + vpsrlq ZTMP1, P2_L, 42; \ + vpsllq P2_H, P2_H, 10; \ + vpaddq P2_H, P2_H, ZTMP1; \ + \ + vpaddq Q2_L, Q2_L, Q1_H; /* ; Q2_L += P1_H + P1_L[63:44] */ \ + vpaddq Q2_L, Q2_L, ZTMP2; \ + vpandq B2, Q2_L, [.Lmask_42 ADD_RIP]; /* ; Clear top 22 bits */ \ + vpsrlq ZTMP2, Q2_L, 42; \ + vpsllq Q2_H, Q2_H, 10; \ + vpaddq Q2_H, Q2_H, ZTMP2; \ + \ + /* ; Carry propagation (second pass) */ \ + /* ; Multiply by 5 the highest bits (above 130 bits) */ \ + vpaddq A0, A0, P2_H; \ + vpsllq P2_H, P2_H, 2; \ + vpaddq A0, A0, P2_H; \ + vpaddq B0, B0, Q2_H; \ + vpsllq Q2_H, Q2_H, 2; \ + vpaddq B0, B0, Q2_H; \ + \ + vpsrlq ZTMP1, A0, 44; \ + vpandq A0, A0, [.Lmask_44 ADD_RIP]; \ + vpaddq A1, A1, ZTMP1; \ + vpsrlq ZTMP2, B0, 44; \ + vpandq B0, B0, [.Lmask_44 ADD_RIP]; \ + vpaddq B1, B1, ZTMP2; + +/* +;; ============================================================================= +;; ============================================================================= +;; Shuffle data blocks, so they match the right power of R. +;; Powers of R are in this order: R^8 R^4 R^7 R^3 R^6 R^2 R^5 R +;; Data blocks are coming in this order: A0 A4 A1 A5 A2 A6 A3 A7 +;; Generally the computation is: A0*R^8 + A1*R^7 + A2*R^6 + A3*R^5 + +;; A4*R^4 + A5*R^3 + A6*R^2 + A7*R +;; When there are less data blocks, less powers of R are used, so data needs to +;; be shuffled. Example: if 4 blocks are left, only A0-A3 are available and only +;; R-R^4 are used (A0*R^4 + A1*R^3 + A2*R^2 + A3*R), so A0-A3 need to be shifted +;; ============================================================================= +;A_L [in/out] 0-43 bits of input data +;A_M [in/out] 44-87 bits of input data +;A_H [in/out] 88-129 bits of input data +;TMP [clobbered] Temporary GP register +;N_BLOCKS [in] Number of remaining input blocks +*/ +#define SHUFFLE_DATA_SMASK_1 0x39 +#define SHUFFLE_DATA_KMASK_1 0xffff +#define SHUFFLE_DATA_SMASK_2 0x4E +#define SHUFFLE_DATA_KMASK_2 0xffff +#define SHUFFLE_DATA_SMASK_3 0x93 +#define SHUFFLE_DATA_KMASK_3 0xffff +#define SHUFFLE_DATA_KMASK_4 0xffff +#define SHUFFLE_DATA_SMASK_5 0x39 +#define SHUFFLE_DATA_KMASK_5 0xfff0 +#define SHUFFLE_DATA_SMASK_6 0x4E +#define SHUFFLE_DATA_KMASK_6 0xff00 +#define SHUFFLE_DATA_SMASK_7 0x93 +#define SHUFFLE_DATA_KMASK_7 0xf000 + +#define SHUFFLE_DATA_BLOCKS_GENERIC(A_L, A_M, A_H, TMP, N_BLOCKS) \ + mov TMP, SHUFFLE_DATA_KMASK_##N_BLOCKS; \ + kmovq k1, TMP; \ + vpshufd A_L{k1}, A_L, 0x4E; \ + vpshufd A_M{k1}, A_M, 0x4E; \ + vpshufd A_H{k1}, A_H, 0x4E; \ + vshufi64x2 A_L, A_L, A_L, SHUFFLE_DATA_SMASK_##N_BLOCKS; \ + vshufi64x2 A_M, A_M, A_M, SHUFFLE_DATA_SMASK_##N_BLOCKS; \ + vshufi64x2 A_H, A_H, A_H, SHUFFLE_DATA_SMASK_##N_BLOCKS + +#define SHUFFLE_DATA_BLOCKS_1(A_L, A_M, A_H, TMP) \ + SHUFFLE_DATA_BLOCKS_GENERIC(A_L, A_M, A_H, TMP, 1) + +#define SHUFFLE_DATA_BLOCKS_2(A_L, A_M, A_H, TMP) \ + SHUFFLE_DATA_BLOCKS_GENERIC(A_L, A_M, A_H, TMP, 2) + +#define SHUFFLE_DATA_BLOCKS_3(A_L, A_M, A_H, TMP) \ + SHUFFLE_DATA_BLOCKS_GENERIC(A_L, A_M, A_H, TMP, 3) + +#define SHUFFLE_DATA_BLOCKS_4(A_L, A_M, A_H, TMP) \ + mov TMP, SHUFFLE_DATA_KMASK_4; \ + kmovq k1, TMP; \ + vpshufd A_L{k1}, A_L, 0x4E; \ + vpshufd A_M{k1}, A_M, 0x4E; \ + vpshufd A_H{k1}, A_H, 0x4E; + +#define SHUFFLE_DATA_BLOCKS_5(A_L, A_M, A_H, TMP) \ + SHUFFLE_DATA_BLOCKS_GENERIC(A_L, A_M, A_H, TMP, 5) + +#define SHUFFLE_DATA_BLOCKS_6(A_L, A_M, A_H, TMP) \ + SHUFFLE_DATA_BLOCKS_GENERIC(A_L, A_M, A_H, TMP, 6) + +#define SHUFFLE_DATA_BLOCKS_7(A_L, A_M, A_H, TMP) \ + SHUFFLE_DATA_BLOCKS_GENERIC(A_L, A_M, A_H, TMP, 7) + +/* +;; ============================================================================= +;; ============================================================================= +;; Computes hash for message length being multiple of block size +;; ============================================================================= +;MSG [in/out] GPR pointer to input message (updated) +;LEN [in/out] GPR in: length in bytes / out: length mod 16 +;A0 [in/out] accumulator bits 63..0 +;A1 [in/out] accumulator bits 127..64 +;A2 [in/out] accumulator bits 195..128 +;R0 [in] R constant bits 63..0 +;R1 [in] R constant bits 127..64 +;T0 [clobbered] GPR register +;T1 [clobbered] GPR register +;T2 [clobbered] GPR register +;T3 [clobbered] GPR register +;GP_RAX [clobbered] RAX register +;GP_RDX [clobbered] RDX register +*/ +#define POLY1305_BLOCKS(MSG, LEN, A0, A1, A2, R0, R1, T0, T1, T2, T3, \ + GP_RAX, GP_RDX) \ + /* ; Minimum of 256 bytes to run vectorized code */ \ + cmp LEN, POLY1305_BLOCK_SIZE*16; \ + jb .L_final_loop; \ + \ + /* ; Spread accumulator into 44-bit limbs in quadwords */ \ + mov T0, A0; \ + and T0, [.Lmask_44 ADD_RIP]; /* ;; First limb (A[43:0]) */ \ + vmovq xmm5, T0; \ + \ + mov T0, A1; \ + shrd A0, T0, 44; \ + and A0, [.Lmask_44 ADD_RIP]; /* ;; Second limb (A[77:52]) */ \ + vmovq xmm6, A0; \ + \ + shrd A1, A2, 24; \ + and A1, [.Lmask_42 ADD_RIP]; /* ;; Third limb (A[129:88]) */ \ + vmovq xmm7, A1; \ + \ + /* ; Load first block of data (128 bytes) */ \ + vmovdqu64 zmm0, [MSG]; \ + vmovdqu64 zmm1, [MSG + 64]; \ + \ + /* ; Interleave the data to form 44-bit limbs */ \ + /* ; */ \ + /* ; zmm13 to have bits 0-43 of all 8 blocks in 8 qwords */ \ + /* ; zmm14 to have bits 87-44 of all 8 blocks in 8 qwords */ \ + /* ; zmm15 to have bits 127-88 of all 8 blocks in 8 qwords */ \ + vpunpckhqdq zmm15, zmm0, zmm1; \ + vpunpcklqdq zmm13, zmm0, zmm1; \ + \ + vpsrlq zmm14, zmm13, 44; \ + vpsllq zmm18, zmm15, 20; \ + vpternlogq zmm14, zmm18, [.Lmask_44 ADD_RIP], 0xA8; /* ; (A OR B AND C) */ \ + \ + vpandq zmm13, zmm13, [.Lmask_44 ADD_RIP]; \ + vpsrlq zmm15, zmm15, 24; \ + \ + /* ; Add 2^128 to all 8 final qwords of the message */ \ + vporq zmm15, zmm15, [.Lhigh_bit ADD_RIP]; \ + \ + vpaddq zmm13, zmm13, zmm5; \ + vpaddq zmm14, zmm14, zmm6; \ + vpaddq zmm15, zmm15, zmm7; \ + \ + /* ; Load next blocks of data (128 bytes) */ \ + vmovdqu64 zmm0, [MSG + 64*2]; \ + vmovdqu64 zmm1, [MSG + 64*3]; \ + \ + /* ; Interleave the data to form 44-bit limbs */ \ + /* ; */ \ + /* ; zmm13 to have bits 0-43 of all 8 blocks in 8 qwords */ \ + /* ; zmm14 to have bits 87-44 of all 8 blocks in 8 qwords */ \ + /* ; zmm15 to have bits 127-88 of all 8 blocks in 8 qwords */ \ + vpunpckhqdq zmm18, zmm0, zmm1; \ + vpunpcklqdq zmm16, zmm0, zmm1; \ + \ + vpsrlq zmm17, zmm16, 44; \ + vpsllq zmm19, zmm18, 20; \ + vpternlogq zmm17, zmm19, [.Lmask_44 ADD_RIP], 0xA8; /* ; (A OR B AND C) */ \ + \ + vpandq zmm16, zmm16, [.Lmask_44 ADD_RIP]; \ + vpsrlq zmm18, zmm18, 24; \ + \ + /* ; Add 2^128 to all 8 final qwords of the message */ \ + vporq zmm18, zmm18, [.Lhigh_bit ADD_RIP]; \ + \ + /* ; Use memory in stack to save powers of R, before loading them into ZMM registers */ \ + /* ; The first 16*8 bytes will contain the 16 bytes of the 8 powers of R */ \ + /* ; The last 64 bytes will contain the last 2 bits of powers of R, spread in 8 qwords, */ \ + /* ; to be OR'd with the highest qwords (in zmm26) */ \ + vmovq xmm3, R0; \ + vpinsrq xmm3, xmm3, R1, 1; \ + vinserti32x4 zmm1, zmm1, xmm3, 3; \ + \ + vpxorq zmm0, zmm0, zmm0; \ + vpxorq zmm2, zmm2, zmm2; \ + \ + /* ; Calculate R^2 */ \ + mov T0, R1; \ + shr T0, 2; \ + add T0, R1; /* ;; T0 = R1 + (R1 >> 2) */ \ + \ + mov A0, R0; \ + mov A1, R1; \ + \ + POLY1305_MUL_REDUCE(A0, A1, A2, R0, R1, T0, T1, T2, T3, GP_RAX, GP_RDX, A2_ZERO); \ + \ + vmovq xmm3, A0; \ + vpinsrq xmm3, xmm3, A1, 1; \ + vinserti32x4 zmm1, zmm1, xmm3, 2; \ + \ + vmovq xmm4, A2; \ + vinserti32x4 zmm2, zmm2, xmm4, 2; \ + \ + /* ; Calculate R^3 */ \ + POLY1305_MUL_REDUCE(A0, A1, A2, R0, R1, T0, T1, T2, T3, GP_RAX, GP_RDX, A2_NOT_ZERO); \ + \ + vmovq xmm3, A0; \ + vpinsrq xmm3, xmm3, A1, 1; \ + vinserti32x4 zmm1, zmm1, xmm3, 1; \ + \ + vmovq xmm4, A2; \ + vinserti32x4 zmm2, zmm2, xmm4, 1; \ + \ + /* ; Calculate R^4 */ \ + POLY1305_MUL_REDUCE(A0, A1, A2, R0, R1, T0, T1, T2, T3, GP_RAX, GP_RDX, A2_NOT_ZERO); \ + \ + vmovq xmm3, A0; \ + vpinsrq xmm3, xmm3, A1, 1; \ + vinserti32x4 zmm1, zmm1, xmm3, 0; \ + \ + vmovq xmm4, A2; \ + vinserti32x4 zmm2, zmm2, xmm4, 0; \ + \ + /* ; Move 2 MSbits to top 24 bits, to be OR'ed later */ \ + vpsllq zmm2, zmm2, 40; \ + \ + vpunpckhqdq zmm21, zmm1, zmm0; \ + vpunpcklqdq zmm19, zmm1, zmm0; \ + \ + vpsrlq zmm20, zmm19, 44; \ + vpsllq zmm4, zmm21, 20; \ + vpternlogq zmm20, zmm4, [.Lmask_44 ADD_RIP], 0xA8; /* ; (A OR B AND C) */ \ + \ + vpandq zmm19, zmm19, [.Lmask_44 ADD_RIP]; \ + vpsrlq zmm21, zmm21, 24; \ + \ + /* ; zmm2 contains the 2 highest bits of the powers of R */ \ + vporq zmm21, zmm21, zmm2; \ + \ + /* ; Broadcast 44-bit limbs of R^4 */ \ + mov T0, A0; \ + and T0, [.Lmask_44 ADD_RIP]; /* ;; First limb (R^4[43:0]) */ \ + vpbroadcastq zmm22, T0; \ + \ + mov T0, A1; \ + shrd A0, T0, 44; \ + and A0, [.Lmask_44 ADD_RIP]; /* ;; Second limb (R^4[87:44]) */ \ + vpbroadcastq zmm23, A0; \ + \ + shrd A1, A2, 24; \ + and A1, [.Lmask_42 ADD_RIP]; /* ;; Third limb (R^4[129:88]) */ \ + vpbroadcastq zmm24, A1; \ + \ + /* ; Generate 4*5*R^4 */ \ + vpsllq zmm25, zmm23, 2; \ + vpsllq zmm26, zmm24, 2; \ + \ + /* ; 5*R^4 */ \ + vpaddq zmm25, zmm25, zmm23; \ + vpaddq zmm26, zmm26, zmm24; \ + \ + /* ; 4*5*R^4 */ \ + vpsllq zmm25, zmm25, 2; \ + vpsllq zmm26, zmm26, 2; \ + \ + vpslldq zmm29, zmm19, 8; \ + vpslldq zmm30, zmm20, 8; \ + vpslldq zmm31, zmm21, 8; \ + \ + /* ; Calculate R^8-R^5 */ \ + POLY1305_MUL_REDUCE_VEC(zmm19, zmm20, zmm21, \ + zmm22, zmm23, zmm24, \ + zmm25, zmm26, \ + zmm5, zmm6, zmm7, zmm8, zmm9, zmm10, \ + zmm11); \ + \ + /* ; Interleave powers of R: R^8 R^4 R^7 R^3 R^6 R^2 R^5 R */ \ + vporq zmm19, zmm19, zmm29; \ + vporq zmm20, zmm20, zmm30; \ + vporq zmm21, zmm21, zmm31; \ + \ + /* ; Broadcast R^8 */ \ + vpbroadcastq zmm22, xmm19; \ + vpbroadcastq zmm23, xmm20; \ + vpbroadcastq zmm24, xmm21; \ + \ + /* ; Generate 4*5*R^8 */ \ + vpsllq zmm25, zmm23, 2; \ + vpsllq zmm26, zmm24, 2; \ + \ + /* ; 5*R^8 */ \ + vpaddq zmm25, zmm25, zmm23; \ + vpaddq zmm26, zmm26, zmm24; \ + \ + /* ; 4*5*R^8 */ \ + vpsllq zmm25, zmm25, 2; \ + vpsllq zmm26, zmm26, 2; \ + \ + cmp LEN, POLY1305_BLOCK_SIZE*32; \ + jb .L_len_256_511; \ + \ + /* ; Store R^8-R for later use */ \ + vmovdqa64 [rsp + STACK_r_save], zmm19; \ + vmovdqa64 [rsp + STACK_r_save + 64], zmm20; \ + vmovdqa64 [rsp + STACK_r_save + 64*2], zmm21; \ + \ + /* ; Calculate R^16-R^9 */ \ + POLY1305_MUL_REDUCE_VEC(zmm19, zmm20, zmm21, \ + zmm22, zmm23, zmm24, \ + zmm25, zmm26, \ + zmm5, zmm6, zmm7, zmm8, zmm9, zmm10, \ + zmm11); \ + \ + /* ; Store R^16-R^9 for later use */ \ + vmovdqa64 [rsp + STACK_r_save + 64*3], zmm19; \ + vmovdqa64 [rsp + STACK_r_save + 64*4], zmm20; \ + vmovdqa64 [rsp + STACK_r_save + 64*5], zmm21; \ + \ + /* ; Broadcast R^16 */ \ + vpbroadcastq zmm22, xmm19; \ + vpbroadcastq zmm23, xmm20; \ + vpbroadcastq zmm24, xmm21; \ + \ + /* ; Generate 4*5*R^16 */ \ + vpsllq zmm25, zmm23, 2; \ + vpsllq zmm26, zmm24, 2; \ + \ + /* ; 5*R^16 */ \ + vpaddq zmm25, zmm25, zmm23; \ + vpaddq zmm26, zmm26, zmm24; \ + \ + /* ; 4*5*R^16 */ \ + vpsllq zmm25, zmm25, 2; \ + vpsllq zmm26, zmm26, 2; \ + \ + mov T0, LEN; \ + and T0, 0xffffffffffffff00; /* ; multiple of 256 bytes */ \ + \ +.L_poly1305_blocks_loop: \ + cmp T0, POLY1305_BLOCK_SIZE*16; \ + jbe .L_poly1305_blocks_loop_end; \ + \ + /* ; zmm13-zmm18 contain the 16 blocks of message plus the previous accumulator */ \ + /* ; zmm22-24 contain the 5x44-bit limbs of the powers of R */ \ + /* ; zmm25-26 contain the 5x44-bit limbs of the powers of R' (5*4*R) */ \ + POLY1305_MSG_MUL_REDUCE_VEC16(zmm13, zmm14, zmm15, zmm16, zmm17, zmm18, \ + zmm22, zmm23, zmm24, zmm25, zmm26, \ + zmm5, zmm6, zmm7, zmm8, zmm9, zmm10, \ + zmm19, zmm20, zmm21, zmm27, zmm28, zmm29, \ + zmm30, zmm31, zmm11, zmm0, zmm1, \ + zmm2, zmm3, zmm4, zmm12, MSG, T0); \ + \ + jmp .L_poly1305_blocks_loop; \ + \ +.L_poly1305_blocks_loop_end: \ + \ + /* ;; Need to multiply by r^16, r^15, r^14... r */ \ + \ + /* ; First multiply by r^16-r^9 */ \ + \ + /* ; Read R^16-R^9 */ \ + vmovdqa64 zmm19, [rsp + STACK_r_save + 64*3]; \ + vmovdqa64 zmm20, [rsp + STACK_r_save + 64*4]; \ + vmovdqa64 zmm21, [rsp + STACK_r_save + 64*5]; \ + /* ; Read R^8-R */ \ + vmovdqa64 zmm22, [rsp + STACK_r_save]; \ + vmovdqa64 zmm23, [rsp + STACK_r_save + 64]; \ + vmovdqa64 zmm24, [rsp + STACK_r_save + 64*2]; \ + \ + /* ; zmm27 to have bits 87-44 of all 9-16th powers of R' in 8 qwords */ \ + /* ; zmm28 to have bits 129-88 of all 9-16th powers of R' in 8 qwords */ \ + vpsllq zmm0, zmm20, 2; \ + vpaddq zmm27, zmm20, zmm0; /* ; R1' (R1*5) */ \ + vpsllq zmm1, zmm21, 2; \ + vpaddq zmm28, zmm21, zmm1; /* ; R2' (R2*5) */ \ + \ + /* ; 4*5*R */ \ + vpsllq zmm27, zmm27, 2; \ + vpsllq zmm28, zmm28, 2; \ + \ + /* ; Then multiply by r^8-r */ \ + \ + /* ; zmm25 to have bits 87-44 of all 1-8th powers of R' in 8 qwords */ \ + /* ; zmm26 to have bits 129-88 of all 1-8th powers of R' in 8 qwords */ \ + vpsllq zmm2, zmm23, 2; \ + vpaddq zmm25, zmm23, zmm2; /* ; R1' (R1*5) */ \ + vpsllq zmm3, zmm24, 2; \ + vpaddq zmm26, zmm24, zmm3; /* ; R2' (R2*5) */ \ + \ + /* ; 4*5*R */ \ + vpsllq zmm25, zmm25, 2; \ + vpsllq zmm26, zmm26, 2; \ + \ + POLY1305_MUL_REDUCE_VEC16(zmm13, zmm14, zmm15, zmm16, zmm17, zmm18, \ + zmm19, zmm20, zmm21, zmm27, zmm28, \ + zmm22, zmm23, zmm24, zmm25, zmm26, \ + zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, \ + zmm7, zmm8, zmm9, zmm10, zmm11, zmm12, zmm29); \ + \ + /* ;; Add all blocks (horizontally) */ \ + vpaddq zmm13, zmm13, zmm16; \ + vpaddq zmm14, zmm14, zmm17; \ + vpaddq zmm15, zmm15, zmm18; \ + \ + vextracti64x4 ymm0, zmm13, 1; \ + vextracti64x4 ymm1, zmm14, 1; \ + vextracti64x4 ymm2, zmm15, 1; \ + \ + vpaddq ymm13, ymm13, ymm0; \ + vpaddq ymm14, ymm14, ymm1; \ + vpaddq ymm15, ymm15, ymm2; \ + \ + vextracti32x4 xmm10, ymm13, 1; \ + vextracti32x4 xmm11, ymm14, 1; \ + vextracti32x4 xmm12, ymm15, 1; \ + \ + vpaddq xmm13, xmm13, xmm10; \ + vpaddq xmm14, xmm14, xmm11; \ + vpaddq xmm15, xmm15, xmm12; \ + \ + vpsrldq xmm10, xmm13, 8; \ + vpsrldq xmm11, xmm14, 8; \ + vpsrldq xmm12, xmm15, 8; \ + \ + /* ; Finish folding and clear second qword */ \ + mov T0, 0xfd; \ + kmovq k1, T0; \ + vpaddq xmm13{k1}{z}, xmm13, xmm10; \ + vpaddq xmm14{k1}{z}, xmm14, xmm11; \ + vpaddq xmm15{k1}{z}, xmm15, xmm12; \ + \ + add MSG, POLY1305_BLOCK_SIZE*16; \ + \ + and LEN, (POLY1305_BLOCK_SIZE*16 - 1); /* ; Get remaining lengths (LEN < 256 bytes) */ \ + \ +.L_less_than_256: \ + \ + cmp LEN, POLY1305_BLOCK_SIZE*8; \ + jb .L_less_than_128; \ + \ + /* ; Read next 128 bytes */ \ + /* ; Load first block of data (128 bytes) */ \ + vmovdqu64 zmm0, [MSG]; \ + vmovdqu64 zmm1, [MSG + 64]; \ + \ + /* ; Interleave the data to form 44-bit limbs */ \ + /* ; */ \ + /* ; zmm13 to have bits 0-43 of all 8 blocks in 8 qwords */ \ + /* ; zmm14 to have bits 87-44 of all 8 blocks in 8 qwords */ \ + /* ; zmm15 to have bits 127-88 of all 8 blocks in 8 qwords */ \ + vpunpckhqdq zmm5, zmm0, zmm1; \ + vpunpcklqdq zmm3, zmm0, zmm1; \ + \ + vpsrlq zmm4, zmm3, 44; \ + vpsllq zmm8, zmm5, 20; \ + vpternlogq zmm4, zmm8, [.Lmask_44 ADD_RIP], 0xA8; /* ; (A OR B AND C) */ \ + \ + vpandq zmm3, zmm3, [.Lmask_44 ADD_RIP]; \ + vpsrlq zmm5, zmm5, 24; \ + \ + /* ; Add 2^128 to all 8 final qwords of the message */ \ + vporq zmm5, zmm5, [.Lhigh_bit ADD_RIP]; \ + \ + vpaddq zmm13, zmm13, zmm3; \ + vpaddq zmm14, zmm14, zmm4; \ + vpaddq zmm15, zmm15, zmm5; \ + \ + add MSG, POLY1305_BLOCK_SIZE*8; \ + sub LEN, POLY1305_BLOCK_SIZE*8; \ + \ + POLY1305_MUL_REDUCE_VEC(zmm13, zmm14, zmm15, \ + zmm22, zmm23, zmm24, \ + zmm25, zmm26, \ + zmm5, zmm6, zmm7, zmm8, zmm9, zmm10, \ + zmm11); \ + \ + /* ;; Add all blocks (horizontally) */ \ + vextracti64x4 ymm0, zmm13, 1; \ + vextracti64x4 ymm1, zmm14, 1; \ + vextracti64x4 ymm2, zmm15, 1; \ + \ + vpaddq ymm13, ymm13, ymm0; \ + vpaddq ymm14, ymm14, ymm1; \ + vpaddq ymm15, ymm15, ymm2; \ + \ + vextracti32x4 xmm10, ymm13, 1; \ + vextracti32x4 xmm11, ymm14, 1; \ + vextracti32x4 xmm12, ymm15, 1; \ + \ + vpaddq xmm13, xmm13, xmm10; \ + vpaddq xmm14, xmm14, xmm11; \ + vpaddq xmm15, xmm15, xmm12; \ + \ + vpsrldq xmm10, xmm13, 8; \ + vpsrldq xmm11, xmm14, 8; \ + vpsrldq xmm12, xmm15, 8; \ + \ + /* ; Finish folding and clear second qword */ \ + mov T0, 0xfd; \ + kmovq k1, T0; \ + vpaddq xmm13{k1}{z}, xmm13, xmm10; \ + vpaddq xmm14{k1}{z}, xmm14, xmm11; \ + vpaddq xmm15{k1}{z}, xmm15, xmm12; \ + \ +.L_less_than_128: \ + cmp LEN, 32; /* ; If remaining bytes is <= 32, perform last blocks in scalar */ \ + jbe .L_simd_to_gp; \ + \ + mov T0, LEN; \ + and T0, 0x3f; \ + lea T1, [.Lbyte64_len_to_mask_table ADD_RIP]; \ + mov T1, [T1 + 8*T0]; \ + \ + /* ; Load default byte masks */ \ + mov T2, 0xffffffffffffffff; \ + xor T3, T3; \ + \ + cmp LEN, 64; \ + cmovb T2, T1; /* ; Load mask for first 64 bytes */ \ + cmovg T3, T1; /* ; Load mask for second 64 bytes */ \ + \ + kmovq k1, T2; \ + kmovq k2, T3; \ + vmovdqu8 zmm0{k1}{z}, [MSG]; \ + vmovdqu8 zmm1{k2}{z}, [MSG + 64]; \ + \ + /* ; Pad last block message, if partial */ \ + mov T0, LEN; \ + and T0, 0x70; /* ; Multiple of 16 bytes */ \ + /* ; Load last block of data (up to 112 bytes) */ \ + shr T0, 3; /* ; Get number of full qwords */ \ + \ + /* ; Interleave the data to form 44-bit limbs */ \ + /* ; */ \ + /* ; zmm13 to have bits 0-43 of all 8 blocks in 8 qwords */ \ + /* ; zmm14 to have bits 87-44 of all 8 blocks in 8 qwords */ \ + /* ; zmm15 to have bits 127-88 of all 8 blocks in 8 qwords */ \ + vpunpckhqdq zmm4, zmm0, zmm1; \ + vpunpcklqdq zmm2, zmm0, zmm1; \ + \ + vpsrlq zmm3, zmm2, 44; \ + vpsllq zmm28, zmm4, 20; \ + vpternlogq zmm3, zmm28, [.Lmask_44 ADD_RIP], 0xA8; /* ; (A OR B AND C) */ \ + \ + vpandq zmm2, zmm2, [.Lmask_44 ADD_RIP]; \ + vpsrlq zmm4, zmm4, 24; \ + \ + lea T1, [.Lqword_high_bit_mask ADD_RIP]; \ + kmovb k1, [T1 + T0]; \ + /* ; Add 2^128 to final qwords of the message (all full blocks and partial block, */ \ + /* ; if "pad_to_16" is selected) */ \ + vporq zmm4{k1}, zmm4, [.Lhigh_bit ADD_RIP]; \ + \ + vpaddq zmm13, zmm13, zmm2; \ + vpaddq zmm14, zmm14, zmm3; \ + vpaddq zmm15, zmm15, zmm4; \ + \ + mov T0, LEN; \ + add T0, 15; \ + shr T0, 4; /* ; Get number of 16-byte blocks (including partial blocks) */ \ + xor LEN, LEN; /* ; All length will be consumed */ \ + \ + /* ; No need to shuffle data blocks (data is in the right order) */ \ + cmp T0, 8; \ + je .L_end_shuffle; \ + \ + cmp T0, 4; \ + je .L_shuffle_blocks_4; \ + jb .L_shuffle_blocks_3; \ + \ + /* ; Number of 16-byte blocks > 4 */ \ + cmp T0, 6; \ + je .L_shuffle_blocks_6; \ + ja .L_shuffle_blocks_7; \ + jmp .L_shuffle_blocks_5; \ + \ +.L_shuffle_blocks_3: \ + SHUFFLE_DATA_BLOCKS_3(zmm13, zmm14, zmm15, T1); \ + jmp .L_end_shuffle; \ +.L_shuffle_blocks_4: \ + SHUFFLE_DATA_BLOCKS_4(zmm13, zmm14, zmm15, T1); \ + jmp .L_end_shuffle; \ +.L_shuffle_blocks_5: \ + SHUFFLE_DATA_BLOCKS_5(zmm13, zmm14, zmm15, T1); \ + jmp .L_end_shuffle; \ +.L_shuffle_blocks_6: \ + SHUFFLE_DATA_BLOCKS_6(zmm13, zmm14, zmm15, T1); \ + jmp .L_end_shuffle; \ +.L_shuffle_blocks_7: \ + SHUFFLE_DATA_BLOCKS_7(zmm13, zmm14, zmm15, T1); \ + \ +.L_end_shuffle: \ + \ + /* ; zmm13-zmm15 contain the 8 blocks of message plus the previous accumulator */ \ + /* ; zmm22-24 contain the 3x44-bit limbs of the powers of R */ \ + /* ; zmm25-26 contain the 3x44-bit limbs of the powers of R' (5*4*R) */ \ + POLY1305_MUL_REDUCE_VEC(zmm13, zmm14, zmm15, \ + zmm22, zmm23, zmm24, \ + zmm25, zmm26, \ + zmm5, zmm6, zmm7, zmm8, zmm9, zmm10, \ + zmm11); \ + \ + /* ;; Add all blocks (horizontally) */ \ + vextracti64x4 ymm0, zmm13, 1; \ + vextracti64x4 ymm1, zmm14, 1; \ + vextracti64x4 ymm2, zmm15, 1; \ + \ + vpaddq ymm13, ymm13, ymm0; \ + vpaddq ymm14, ymm14, ymm1; \ + vpaddq ymm15, ymm15, ymm2; \ + \ + vextracti32x4 xmm10, ymm13, 1; \ + vextracti32x4 xmm11, ymm14, 1; \ + vextracti32x4 xmm12, ymm15, 1; \ + \ + vpaddq xmm13, xmm13, xmm10; \ + vpaddq xmm14, xmm14, xmm11; \ + vpaddq xmm15, xmm15, xmm12; \ + \ + vpsrldq xmm10, xmm13, 8; \ + vpsrldq xmm11, xmm14, 8; \ + vpsrldq xmm12, xmm15, 8; \ + \ + vpaddq xmm13, xmm13, xmm10; \ + vpaddq xmm14, xmm14, xmm11; \ + vpaddq xmm15, xmm15, xmm12; \ + \ +.L_simd_to_gp: \ + /* ; Carry propagation */ \ + vpsrlq xmm0, xmm13, 44; \ + vpandq xmm13, xmm13, [.Lmask_44 ADD_RIP]; /* ; Clear top 20 bits */ \ + vpaddq xmm14, xmm14, xmm0; \ + vpsrlq xmm0, xmm14, 44; \ + vpandq xmm14, xmm14, [.Lmask_44 ADD_RIP]; /* ; Clear top 20 bits */ \ + vpaddq xmm15, xmm15, xmm0; \ + vpsrlq xmm0, xmm15, 42; \ + vpandq xmm15, xmm15, [.Lmask_42 ADD_RIP]; /* ; Clear top 22 bits */ \ + vpsllq xmm1, xmm0, 2; \ + vpaddq xmm0, xmm0, xmm1; \ + vpaddq xmm13, xmm13, xmm0; \ + \ + /* ; Put together A */ \ + vmovq A0, xmm13; \ + \ + vmovq T0, xmm14; \ + mov T1, T0; \ + shl T1, 44; \ + or A0, T1; \ + \ + shr T0, 20; \ + vmovq A2, xmm15; \ + mov A1, A2; \ + shl A1, 24; \ + or A1, T0; \ + shr A2, 40; \ + \ + /* ; Clear powers of R */ \ + vpxorq zmm0, zmm0, zmm0; \ + vmovdqa64 [rsp + STACK_r_save], zmm0; \ + vmovdqa64 [rsp + STACK_r_save + 64], zmm0; \ + vmovdqa64 [rsp + STACK_r_save + 64*2], zmm0; \ + vmovdqa64 [rsp + STACK_r_save + 64*3], zmm0; \ + vmovdqa64 [rsp + STACK_r_save + 64*4], zmm0; \ + vmovdqa64 [rsp + STACK_r_save + 64*5], zmm0; \ + \ + vzeroall; \ + clear_zmm(xmm16); clear_zmm(xmm20); clear_zmm(xmm24); clear_zmm(xmm28); \ + clear_zmm(xmm17); clear_zmm(xmm21); clear_zmm(xmm25); clear_zmm(xmm29); \ + clear_zmm(xmm18); clear_zmm(xmm22); clear_zmm(xmm26); clear_zmm(xmm30); \ + clear_zmm(xmm19); clear_zmm(xmm23); clear_zmm(xmm27); clear_zmm(xmm31); \ + \ +.L_final_loop: \ + cmp LEN, POLY1305_BLOCK_SIZE; \ + jb .L_poly1305_blocks_exit; \ + \ + /* ;; A += MSG[i] */ \ + add A0, [MSG + 0]; \ + adc A1, [MSG + 8]; \ + adc A2, 1; /* ;; no padding bit */ \ + \ + mov T0, R1; \ + shr T0, 2; \ + add T0, R1; /* ;; T0 = R1 + (R1 >> 2) */ \ + \ + POLY1305_MUL_REDUCE(A0, A1, A2, R0, R1, \ + T0, T1, T2, T3, GP_RAX, GP_RDX, A2_NOT_ZERO); \ + \ + add MSG, POLY1305_BLOCK_SIZE; \ + sub LEN, POLY1305_BLOCK_SIZE; \ + \ + jmp .L_final_loop; \ + \ +.L_len_256_511: \ + \ + /* ; zmm13-zmm15 contain the 8 blocks of message plus the previous accumulator */ \ + /* ; zmm22-24 contain the 3x44-bit limbs of the powers of R */ \ + /* ; zmm25-26 contain the 3x44-bit limbs of the powers of R' (5*4*R) */ \ + POLY1305_MUL_REDUCE_VEC(zmm13, zmm14, zmm15, \ + zmm22, zmm23, zmm24, \ + zmm25, zmm26, \ + zmm5, zmm6, zmm7, zmm8, zmm9, zmm10, \ + zmm11); \ + \ + /* ; Then multiply by r^8-r */ \ + \ + /* ; zmm19-zmm21 contains R^8-R, need to move it to zmm22-24, */ \ + /* ; as it might be used in other part of the code */ \ + vmovdqa64 zmm22, zmm19; \ + vmovdqa64 zmm23, zmm20; \ + vmovdqa64 zmm24, zmm21; \ + \ + /* ; zmm25 to have bits 87-44 of all 8 powers of R' in 8 qwords */ \ + /* ; zmm26 to have bits 129-88 of all 8 powers of R' in 8 qwords */ \ + vpsllq zmm0, zmm23, 2; \ + vpaddq zmm25, zmm23, zmm0; /* ; R1' (R1*5) */ \ + vpsllq zmm1, zmm24, 2; \ + vpaddq zmm26, zmm24, zmm1; /* ; R2' (R2*5) */ \ + \ + /* ; 4*5*R^8 */ \ + vpsllq zmm25, zmm25, 2; \ + vpsllq zmm26, zmm26, 2; \ + \ + vpaddq zmm13, zmm13, zmm16; \ + vpaddq zmm14, zmm14, zmm17; \ + vpaddq zmm15, zmm15, zmm18; \ + \ + /* ; zmm13-zmm15 contain the 8 blocks of message plus the previous accumulator */ \ + /* ; zmm22-24 contain the 3x44-bit limbs of the powers of R */ \ + /* ; zmm25-26 contain the 3x44-bit limbs of the powers of R' (5*4*R) */ \ + POLY1305_MUL_REDUCE_VEC(zmm13, zmm14, zmm15, \ + zmm22, zmm23, zmm24, \ + zmm25, zmm26, \ + zmm5, zmm6, zmm7, zmm8, zmm9, zmm10, \ + zmm11); \ + \ + /* ;; Add all blocks (horizontally) */ \ + vextracti64x4 ymm0, zmm13, 1; \ + vextracti64x4 ymm1, zmm14, 1; \ + vextracti64x4 ymm2, zmm15, 1; \ + \ + vpaddq ymm13, ymm13, ymm0; \ + vpaddq ymm14, ymm14, ymm1; \ + vpaddq ymm15, ymm15, ymm2; \ + \ + vextracti32x4 xmm10, ymm13, 1; \ + vextracti32x4 xmm11, ymm14, 1; \ + vextracti32x4 xmm12, ymm15, 1; \ + \ + vpaddq xmm13, xmm13, xmm10; \ + vpaddq xmm14, xmm14, xmm11; \ + vpaddq xmm15, xmm15, xmm12; \ + \ + vpsrldq xmm10, xmm13, 8; \ + vpsrldq xmm11, xmm14, 8; \ + vpsrldq xmm12, xmm15, 8; \ + \ + /* ; Finish folding and clear second qword */ \ + mov T0, 0xfd; \ + kmovq k1, T0; \ + vpaddq xmm13{k1}{z}, xmm13, xmm10; \ + vpaddq xmm14{k1}{z}, xmm14, xmm11; \ + vpaddq xmm15{k1}{z}, xmm15, xmm12; \ + \ + add MSG, POLY1305_BLOCK_SIZE*16; \ + sub LEN, POLY1305_BLOCK_SIZE*16; \ + \ + jmp .L_less_than_256; \ +.L_poly1305_blocks_exit: \ + +/* +;; ============================================================================= +;; ============================================================================= +;; Creates stack frame and saves registers +;; ============================================================================= +*/ +#define FUNC_ENTRY() \ + mov rax, rsp; \ + CFI_DEF_CFA_REGISTER(rax); \ + sub rsp, STACK_SIZE; \ + and rsp, -64; \ + \ + mov [rsp + STACK_gpr_save + 8*0], rbx; \ + mov [rsp + STACK_gpr_save + 8*1], rbp; \ + mov [rsp + STACK_gpr_save + 8*2], r12; \ + mov [rsp + STACK_gpr_save + 8*3], r13; \ + mov [rsp + STACK_gpr_save + 8*4], r14; \ + mov [rsp + STACK_gpr_save + 8*5], r15; \ + mov [rsp + STACK_rsp_save], rax; \ + CFI_CFA_ON_STACK(STACK_rsp_save, 0) + +/* +;; ============================================================================= +;; ============================================================================= +;; Restores registers and removes the stack frame +;; ============================================================================= +*/ +#define FUNC_EXIT() \ + mov rbx, [rsp + STACK_gpr_save + 8*0]; \ + mov rbp, [rsp + STACK_gpr_save + 8*1]; \ + mov r12, [rsp + STACK_gpr_save + 8*2]; \ + mov r13, [rsp + STACK_gpr_save + 8*3]; \ + mov r14, [rsp + STACK_gpr_save + 8*4]; \ + mov r15, [rsp + STACK_gpr_save + 8*5]; \ + mov rsp, [rsp + STACK_rsp_save]; \ + CFI_DEF_CFA_REGISTER(rsp) + +/* +;; ============================================================================= +;; ============================================================================= +;; void poly1305_aead_update_fma_avx512(const void *msg, const uint64_t msg_len, +;; void *hash, const void *key) +;; arg1 - Input message +;; arg2 - Message length +;; arg3 - Input/output hash +;; arg4 - Poly1305 key +*/ +.align 32 +.globl _gcry_poly1305_amd64_avx512_blocks +ELF(.type _gcry_poly1305_amd64_avx512_blocks, at function;) +_gcry_poly1305_amd64_avx512_blocks: + CFI_STARTPROC() + vpxord xmm16, xmm16, xmm16; + vpopcntb zmm16, zmm16; /* spec stop for old AVX512 CPUs */ + FUNC_ENTRY() + +#define _a0 gp3 +#define _a0 gp3 +#define _a1 gp4 +#define _a2 gp5 +#define _r0 gp6 +#define _r1 gp7 +#define _len arg2 +#define _arg3 arg4 /* ; use rcx, arg3 = rdx */ + + /* ;; load R */ + mov _r0, [arg4 + 0 * 8] + mov _r1, [arg4 + 1 * 8] + + /* ;; load accumulator / current hash value */ + /* ;; note: arg4 can't be used beyond this point */ + mov _arg3, arg3 /* ; note: _arg3 = arg4 (linux) */ + mov _a0, [_arg3 + 0 * 8] + mov _a1, [_arg3 + 1 * 8] + mov DWORD(_a2), [_arg3 + 2 * 8] /* ; note: _a2 = arg4 (win) */ + + POLY1305_BLOCKS(arg1, _len, _a0, _a1, _a2, _r0, _r1, + gp10, gp11, gp8, gp9, rax, rdx) + + /* ;; save accumulator back */ + mov [_arg3 + 0 * 8], _a0 + mov [_arg3 + 1 * 8], _a1 + mov [_arg3 + 2 * 8], DWORD(_a2) + + FUNC_EXIT() + xor eax, eax + kmovw k1, eax + kmovw k2, eax + ret_spec_stop + CFI_ENDPROC() +ELF(.size _gcry_poly1305_amd64_avx512_blocks, + .-_gcry_poly1305_amd64_avx512_blocks;) + +#endif +#endif diff --git a/cipher/poly1305-internal.h b/cipher/poly1305-internal.h index 19cee5f6..9e01df46 100644 --- a/cipher/poly1305-internal.h +++ b/cipher/poly1305-internal.h @@ -34,6 +34,16 @@ #define POLY1305_BLOCKSIZE 16 +/* POLY1305_USE_AVX512 indicates whether to compile with Intel AVX512 code. */ +#undef POLY1305_USE_AVX512 +#if defined(__x86_64__) && defined(HAVE_GCC_INLINE_ASM_AVX512) && \ + defined(HAVE_INTEL_SYNTAX_PLATFORM_AS) && \ + (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ + defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) +# define POLY1305_USE_AVX512 1 +#endif + + typedef struct { u32 k[4]; @@ -46,6 +56,9 @@ typedef struct poly1305_context_s POLY1305_STATE state; byte buffer[POLY1305_BLOCKSIZE]; unsigned int leftover; +#ifdef POLY1305_USE_AVX512 + unsigned int use_avx512:1; +#endif } poly1305_context_t; diff --git a/cipher/poly1305.c b/cipher/poly1305.c index e57e64f3..5482fc6a 100644 --- a/cipher/poly1305.c +++ b/cipher/poly1305.c @@ -60,6 +60,19 @@ static const char *selftest (void); #endif +/* AMD64 Assembly implementations use SystemV ABI, ABI conversion and + * additional stack to store XMM6-XMM15 needed on Win64. */ +#undef ASM_FUNC_ABI +#undef ASM_FUNC_WRAPPER_ATTR +#if defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS) +# define ASM_FUNC_ABI __attribute__((sysv_abi)) +# define ASM_FUNC_WRAPPER_ATTR __attribute__((noinline)) +#else +# define ASM_FUNC_ABI +# define ASM_FUNC_WRAPPER_ATTR +#endif + + #ifdef USE_S390X_ASM #define HAVE_ASM_POLY1305_BLOCKS 1 @@ -78,11 +91,32 @@ poly1305_blocks (poly1305_context_t *ctx, const byte *buf, size_t len, #endif /* USE_S390X_ASM */ +#ifdef POLY1305_USE_AVX512 + +extern unsigned int +_gcry_poly1305_amd64_avx512_blocks(const void *msg, const u64 msg_len, + void *hash, const void *key) ASM_FUNC_ABI; + +ASM_FUNC_WRAPPER_ATTR static unsigned int +poly1305_amd64_avx512_blocks(poly1305_context_t *ctx, const byte *buf, + size_t len) +{ + POLY1305_STATE *st = &ctx->state; + return _gcry_poly1305_amd64_avx512_blocks(buf, len, st->h, st->r); +} + +#endif /* POLY1305_USE_AVX512 */ + + static void poly1305_init (poly1305_context_t *ctx, const byte key[POLY1305_KEYLEN]) { POLY1305_STATE *st = &ctx->state; +#ifdef POLY1305_USE_AVX512 + ctx->use_avx512 = (_gcry_get_hw_features () & HWF_INTEL_AVX512) != 0; +#endif + ctx->leftover = 0; st->h[0] = 0; @@ -181,8 +215,8 @@ static void poly1305_init (poly1305_context_t *ctx, #ifndef HAVE_ASM_POLY1305_BLOCKS static unsigned int -poly1305_blocks (poly1305_context_t *ctx, const byte *buf, size_t len, - byte high_pad) +poly1305_blocks_generic (poly1305_context_t *ctx, const byte *buf, size_t len, + byte high_pad) { POLY1305_STATE *st = &ctx->state; u64 r0, r1, r1_mult5; @@ -235,6 +269,18 @@ poly1305_blocks (poly1305_context_t *ctx, const byte *buf, size_t len, return 6 * sizeof (void *) + 18 * sizeof (u64); } +static unsigned int +poly1305_blocks (poly1305_context_t *ctx, const byte *buf, size_t len, + byte high_pad) +{ +#ifdef POLY1305_USE_AVX512 + if ((high_pad & ctx->use_avx512) != 0) + return poly1305_amd64_avx512_blocks(ctx, buf, len); +#endif + + return poly1305_blocks_generic(ctx, buf, len, high_pad); +} + #endif /* !HAVE_ASM_POLY1305_BLOCKS */ static unsigned int poly1305_final (poly1305_context_t *ctx, diff --git a/configure.ac b/configure.ac index e214082b..778dc633 100644 --- a/configure.ac +++ b/configure.ac @@ -3106,6 +3106,9 @@ case "${host}" in s390x-*-*) GCRYPT_ASM_DIGESTS="$GCRYPT_ASM_DIGESTS poly1305-s390x.lo" ;; + x86_64-*-*) + GCRYPT_ASM_DIGESTS="$GCRYPT_ASM_DIGESTS poly1305-amd64-avx512.lo" + ;; esac LIST_MEMBER(scrypt, $enabled_kdfs) -- 2.32.0 From jussi.kivilinna at iki.fi Mon Apr 4 21:56:40 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Mon, 4 Apr 2022 22:56:40 +0300 Subject: [PATCH] Add SM3 ARMv8/AArch64/CE assembly implementation In-Reply-To: <20220401091736.30541-1-tianjia.zhang@linux.alibaba.com> References: <20220401091736.30541-1-tianjia.zhang@linux.alibaba.com> Message-ID: <531fccec-4d22-36c1-5c40-4da92af8fd3b@iki.fi> Hello, Applied to master. Thanks. -Jussi On 1.4.2022 12.17, Tianjia Zhang via Gcrypt-devel wrote: > * cipher/Makefile.am: Add 'sm3-armv8-aarch64-ce.S'. > * cipher/sm3-armv8-aarch64-ce.S: New. > * cipher/sm3.c (USE_ARM_CE): New. > [USE_ARM_CE] (_gcry_sm3_transform_armv8_ce) > (do_sm3_transform_armv8_ce): New. > (sm3_init) [USE_ARM_CE]: New. > * configure.ac: Add 'sm3-armv8-aarch64-ce.lo'. > -- > > Benchmark on T-Head Yitian-710 2.75 GHz: > > Before: > | nanosecs/byte mebibytes/sec cycles/byte auto Mhz > SM3 | 2.84 ns/B 335.3 MiB/s 7.82 c/B 2749 > > After (~55% faster): > | nanosecs/byte mebibytes/sec cycles/byte auto Mhz > SM3 | 1.84 ns/B 518.1 MiB/s 5.06 c/B 2749 > > Signed-off-by: Tianjia Zhang > --- > cipher/Makefile.am | 2 +- > cipher/sm3-armv8-aarch64-ce.S | 218 ++++++++++++++++++++++++++++++++++ > cipher/sm3.c | 28 +++++ > configure.ac | 1 + > 4 files changed, 248 insertions(+), 1 deletion(-) > create mode 100644 cipher/sm3-armv8-aarch64-ce.S > > diff --git a/cipher/Makefile.am b/cipher/Makefile.am > index 1ac1923b7ce5..30be9f982883 100644 > --- a/cipher/Makefile.am > +++ b/cipher/Makefile.am > @@ -130,7 +130,7 @@ EXTRA_libcipher_la_SOURCES = \ > sha512-avx2-bmi2-amd64.S sha512-avx512-amd64.S \ > sha512-armv7-neon.S sha512-arm.S \ > sha512-ppc.c sha512-ssse3-i386.c \ > - sm3.c sm3-avx-bmi2-amd64.S sm3-aarch64.S \ > + sm3.c sm3-avx-bmi2-amd64.S sm3-aarch64.S sm3-armv8-aarch64-ce.S \ > keccak.c keccak_permute_32.h keccak_permute_64.h keccak-armv7-neon.S \ > stribog.c \ > tiger.c \ > diff --git a/cipher/sm3-armv8-aarch64-ce.S b/cipher/sm3-armv8-aarch64-ce.S > new file mode 100644 > index 000000000000..0900b84fe2bf > --- /dev/null > +++ b/cipher/sm3-armv8-aarch64-ce.S > @@ -0,0 +1,218 @@ > +/* sm3-armv8-aarch64-ce.S - ARMv8/AArch64/CE accelerated SM3 cipher > + * > + * Copyright (C) 2022 Alibaba Group. > + * Copyright (C) 2022 Tianjia Zhang > + * > + * This file is part of Libgcrypt. > + * > + * Libgcrypt is free software; you can redistribute it and/or modify > + * it under the terms of the GNU Lesser General Public License as > + * published by the Free Software Foundation; either version 2.1 of > + * the License, or (at your option) any later version. > + * > + * Libgcrypt is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with this program; if not, see . > + */ > + > +#include "asm-common-aarch64.h" > + > +#if defined(__AARCH64EL__) && \ > + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ > + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) && \ > + defined(USE_SM3) > + > +.cpu generic+simd+crypto > + > +/* Must be consistent with register macros */ > +#define vecnum_v0 0 > +#define vecnum_v1 1 > +#define vecnum_v2 2 > +#define vecnum_v3 3 > +#define vecnum_v4 4 > +#define vecnum_CTX1 16 > +#define vecnum_CTX2 17 > +#define vecnum_SS1 18 > +#define vecnum_WT 19 > +#define vecnum_K0 20 > +#define vecnum_K1 21 > +#define vecnum_K2 22 > +#define vecnum_K3 23 > +#define vecnum_RTMP0 24 > +#define vecnum_RTMP1 25 > + > +#define sm3partw1(vd, vn, vm) \ > + .inst (0xce60c000 | (vecnum_##vm << 16) | (vecnum_##vn << 5) | vecnum_##vd) > + > +#define sm3partw2(vd, vn, vm) \ > + .inst (0xce60c400 | (vecnum_##vm << 16) | (vecnum_##vn << 5) | vecnum_##vd) > + > +#define sm3ss1(vd, vn, vm, va) \ > + .inst (0xce400000 | (vecnum_##vm << 16) | (vecnum_##va << 10) \ > + | (vecnum_##vn << 5) | vecnum_##vd) > + > +#define sm3tt1a(vd, vn, vm, imm2) \ > + .inst (0xce408000 | (vecnum_##vm << 16) | imm2 << 12 \ > + | (vecnum_##vn << 5) | vecnum_##vd) > + > +#define sm3tt1b(vd, vn, vm, imm2) \ > + .inst (0xce408400 | (vecnum_##vm << 16) | imm2 << 12 \ > + | (vecnum_##vn << 5) | vecnum_##vd) > + > +#define sm3tt2a(vd, vn, vm, imm2) \ > + .inst (0xce408800 | (vecnum_##vm << 16) | imm2 << 12 \ > + | (vecnum_##vn << 5) | vecnum_##vd) > + > +#define sm3tt2b(vd, vn, vm, imm2) \ > + .inst (0xce408c00 | (vecnum_##vm << 16) | imm2 << 12 \ > + | (vecnum_##vn << 5) | vecnum_##vd) > + > +/* Constants */ > + > +.text > +.align 4 > +ELF(.type _gcry_sm3_armv8_ce_consts, at object) > +_gcry_sm3_armv8_ce_consts: > +.Lsm3_Ktable: > + .long 0x79cc4519, 0xf3988a32, 0xe7311465, 0xce6228cb > + .long 0x9cc45197, 0x3988a32f, 0x7311465e, 0xe6228cbc > + .long 0xcc451979, 0x988a32f3, 0x311465e7, 0x6228cbce > + .long 0xc451979c, 0x88a32f39, 0x11465e73, 0x228cbce6 > + .long 0x9d8a7a87, 0x3b14f50f, 0x7629ea1e, 0xec53d43c > + .long 0xd8a7a879, 0xb14f50f3, 0x629ea1e7, 0xc53d43ce > + .long 0x8a7a879d, 0x14f50f3b, 0x29ea1e76, 0x53d43cec > + .long 0xa7a879d8, 0x4f50f3b1, 0x9ea1e762, 0x3d43cec5 > + .long 0x7a879d8a, 0xf50f3b14, 0xea1e7629, 0xd43cec53 > + .long 0xa879d8a7, 0x50f3b14f, 0xa1e7629e, 0x43cec53d > + .long 0x879d8a7a, 0x0f3b14f5, 0x1e7629ea, 0x3cec53d4 > + .long 0x79d8a7a8, 0xf3b14f50, 0xe7629ea1, 0xcec53d43 > + .long 0x9d8a7a87, 0x3b14f50f, 0x7629ea1e, 0xec53d43c > + .long 0xd8a7a879, 0xb14f50f3, 0x629ea1e7, 0xc53d43ce > + .long 0x8a7a879d, 0x14f50f3b, 0x29ea1e76, 0x53d43cec > + .long 0xa7a879d8, 0x4f50f3b1, 0x9ea1e762, 0x3d43cec5 > +ELF(.size _gcry_sm3_armv8_ce_consts,.-_gcry_sm3_armv8_ce_consts) > + > +/* Register macros */ > + > +/* Must be consistent with vecnum_ macros */ > +#define CTX1 v16 > +#define CTX2 v17 > +#define SS1 v18 > +#define WT v19 > + > +#define K0 v20 > +#define K1 v21 > +#define K2 v22 > +#define K3 v23 > + > +#define RTMP0 v24 > +#define RTMP1 v25 > + > +/* Helper macros. */ > + > +#define _(...) /*_*/ > + > +#define SCHED_W_1(s0, s1, s2, s3, s4) ext s4.16b, s1.16b, s2.16b, #12 > +#define SCHED_W_2(s0, s1, s2, s3, s4) ext RTMP0.16b, s0.16b, s1.16b, #12 > +#define SCHED_W_3(s0, s1, s2, s3, s4) ext RTMP1.16b, s2.16b, s3.16b, #8 > +#define SCHED_W_4(s0, s1, s2, s3, s4) sm3partw1(s4, s0, s3) > +#define SCHED_W_5(s0, s1, s2, s3, s4) sm3partw2(s4, RTMP1, RTMP0) > + > +#define SCHED_W(n, s0, s1, s2, s3, s4) SCHED_W_##n(s0, s1, s2, s3, s4) > + > +#define R(ab, s0, s1, s2, s3, s4, IOP) \ > + ld4 {K0.s, K1.s, K2.s, K3.s}[3], [x3], #16; \ > + eor WT.16b, s0.16b, s1.16b; \ > + \ > + sm3ss1(SS1, CTX1, CTX2, K0); \ > + IOP(1, s0, s1, s2, s3, s4); \ > + sm3tt1##ab(CTX1, SS1, WT, 0); \ > + sm3tt2##ab(CTX2, SS1, s0, 0); \ > + \ > + IOP(2, s0, s1, s2, s3, s4); \ > + sm3ss1(SS1, CTX1, CTX2, K1); \ > + IOP(3, s0, s1, s2, s3, s4); \ > + sm3tt1##ab(CTX1, SS1, WT, 1); \ > + sm3tt2##ab(CTX2, SS1, s0, 1); \ > + \ > + sm3ss1(SS1, CTX1, CTX2, K2); \ > + IOP(4, s0, s1, s2, s3, s4); \ > + sm3tt1##ab(CTX1, SS1, WT, 2); \ > + sm3tt2##ab(CTX2, SS1, s0, 2); \ > + \ > + sm3ss1(SS1, CTX1, CTX2, K3); \ > + IOP(5, s0, s1, s2, s3, s4); \ > + sm3tt1##ab(CTX1, SS1, WT, 3); \ > + sm3tt2##ab(CTX2, SS1, s0, 3); > + > +#define R1(s0, s1, s2, s3, s4, IOP) R(a, s0, s1, s2, s3, s4, IOP) > +#define R2(s0, s1, s2, s3, s4, IOP) R(b, s0, s1, s2, s3, s4, IOP) > + > +.align 3 > +.global _gcry_sm3_transform_armv8_ce > +ELF(.type _gcry_sm3_transform_armv8_ce,%function;) > +_gcry_sm3_transform_armv8_ce: > + /* input: > + * x0: CTX > + * x1: data > + * x2: nblocks > + */ > + CFI_STARTPROC(); > + > + ld1 {CTX1.4s, CTX2.4s}, [x0]; > + rev64 CTX1.4s, CTX1.4s; > + rev64 CTX2.4s, CTX2.4s; > + ext CTX1.16b, CTX1.16b, CTX1.16b, #8; > + ext CTX2.16b, CTX2.16b, CTX2.16b, #8; > + > +.Lloop: > + GET_DATA_POINTER(x3, .Lsm3_Ktable); > + ld1 {v0.16b-v3.16b}, [x1], #64; > + sub x2, x2, #1; > + > + mov v6.16b, CTX1.16b; > + mov v7.16b, CTX2.16b; > + > + rev32 v0.16b, v0.16b; > + rev32 v1.16b, v1.16b; > + rev32 v2.16b, v2.16b; > + rev32 v3.16b, v3.16b; > + > + R1(v0, v1, v2, v3, v4, SCHED_W); > + R1(v1, v2, v3, v4, v0, SCHED_W); > + R1(v2, v3, v4, v0, v1, SCHED_W); > + R1(v3, v4, v0, v1, v2, SCHED_W); > + R2(v4, v0, v1, v2, v3, SCHED_W); > + R2(v0, v1, v2, v3, v4, SCHED_W); > + R2(v1, v2, v3, v4, v0, SCHED_W); > + R2(v2, v3, v4, v0, v1, SCHED_W); > + R2(v3, v4, v0, v1, v2, SCHED_W); > + R2(v4, v0, v1, v2, v3, SCHED_W); > + R2(v0, v1, v2, v3, v4, SCHED_W); > + R2(v1, v2, v3, v4, v0, SCHED_W); > + R2(v2, v3, v4, v0, v1, SCHED_W); > + R2(v3, v4, v0, v1, v2, _); > + R2(v4, v0, v1, v2, v3, _); > + R2(v0, v1, v2, v3, v4, _); > + > + eor CTX1.16b, CTX1.16b, v6.16b; > + eor CTX2.16b, CTX2.16b, v7.16b; > + > + cbnz x2, .Lloop; > + > + /* save state */ > + rev64 CTX1.4s, CTX1.4s; > + rev64 CTX2.4s, CTX2.4s; > + ext CTX1.16b, CTX1.16b, CTX1.16b, #8; > + ext CTX2.16b, CTX2.16b, CTX2.16b, #8; > + st1 {CTX1.4s, CTX2.4s}, [x0]; > + > + ret_spec_stop; > + CFI_ENDPROC(); > +ELF(.size _gcry_sm3_transform_armv8_ce, .-_gcry_sm3_transform_armv8_ce;) > + > +#endif > diff --git a/cipher/sm3.c b/cipher/sm3.c > index 0ab5f5067edb..bfe9f4c25225 100644 > --- a/cipher/sm3.c > +++ b/cipher/sm3.c > @@ -67,6 +67,16 @@ > # endif > #endif > > +/* USE_ARM_CE indicates whether to enable ARMv8 Crypto Extension code. */ > +#undef USE_ARM_CE > +#ifdef ENABLE_ARM_CRYPTO_SUPPORT > +# if defined(__AARCH64EL__) && \ > + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ > + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) > +# define USE_ARM_CE 1 > +# endif > +#endif > + > > typedef struct { > gcry_md_block_ctx_t bctx; > @@ -117,6 +127,20 @@ do_sm3_transform_aarch64(void *context, const unsigned char *data, size_t nblks) > } > #endif /* USE_AARCH64_SIMD */ > > +#ifdef USE_ARM_CE > +void _gcry_sm3_transform_armv8_ce(void *state, const void *input_data, > + size_t num_blks); > + > +static unsigned int > +do_sm3_transform_armv8_ce(void *context, const unsigned char *data, > + size_t nblks) > +{ > + SM3_CONTEXT *hd = context; > + _gcry_sm3_transform_armv8_ce (hd->h, data, nblks); > + return 0; > +} > +#endif /* USE_ARM_CE */ > + > > static unsigned int > transform (void *c, const unsigned char *data, size_t nblks); > @@ -153,6 +177,10 @@ sm3_init (void *context, unsigned int flags) > if (features & HWF_ARM_NEON) > hd->bctx.bwrite = do_sm3_transform_aarch64; > #endif > +#ifdef USE_ARM_CE > + if (features & HWF_ARM_SM3) > + hd->bctx.bwrite = do_sm3_transform_armv8_ce; > +#endif > > (void)features; > } > diff --git a/configure.ac b/configure.ac > index e214082b2603..fc49bb86fc2b 100644 > --- a/configure.ac > +++ b/configure.ac > @@ -3049,6 +3049,7 @@ if test "$found" = "1" ; then > aarch64-*-*) > # Build with the assembly implementation > GCRYPT_ASM_DIGESTS="$GCRYPT_ASM_DIGESTS sm3-aarch64.lo" > + GCRYPT_ASM_DIGESTS="$GCRYPT_ASM_DIGESTS sm3-armv8-aarch64-ce.lo" > ;; > esac > fi From tianjia.zhang at linux.alibaba.com Wed Apr 6 14:22:46 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Wed, 6 Apr 2022 20:22:46 +0800 Subject: [PATCH] build: Fix for arm crypto support Message-ID: <20220406122246.74680-1-tianjia.zhang@linux.alibaba.com> * configure.ac: Correct wrong variable names. Signed-off-by: Tianjia Zhang --- configure.ac | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/configure.ac b/configure.ac index fc49bb86fc2b..74f112a81e65 100644 --- a/configure.ac +++ b/configure.ac @@ -2423,7 +2423,7 @@ fi if test x"$armcryptosupport" = xyes ; then if test "$gcry_cv_gcc_inline_asm_aarch32_crypto" != "yes" ; then if test "$gcry_cv_gcc_inline_asm_aarch64_crypto" != "yes" ; then - neonsupport="no (unsupported by compiler)" + armcryptosupport="no (unsupported by compiler)" fi fi fi -- 2.24.3 (Apple Git-128) From tianjia.zhang at linux.alibaba.com Wed Apr 6 14:23:22 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Wed, 6 Apr 2022 20:23:22 +0800 Subject: [PATCH] doc: Fix missing ARM hardware features Message-ID: <20220406122322.74792-1-tianjia.zhang@linux.alibaba.com> * doc/gcrypt.texi: Add sha3/sm3/sm4/sha512 to ARM hardware features. Signed-off-by: Tianjia Zhang --- doc/gcrypt.texi | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/doc/gcrypt.texi b/doc/gcrypt.texi index b02ba442d33c..555150117f09 100644 --- a/doc/gcrypt.texi +++ b/doc/gcrypt.texi @@ -596,6 +596,10 @@ are @item arm-sha1 @item arm-sha2 @item arm-pmull + at item arm-sha3 + at item arm-sm3 + at item arm-sm4 + at item arm-sha512 @item ppc-vcrypto @item ppc-arch_3_00 @item ppc-arch_2_07 -- 2.24.3 (Apple Git-128) From cllang at redhat.com Fri Apr 8 11:45:12 2022 From: cllang at redhat.com (Clemens Lang) Date: Fri, 8 Apr 2022 11:45:12 +0200 Subject: [PATCH] build: Fix make dist after socklen.m4 removal Message-ID: <20220408094512.4385-1-cllang@redhat.com> * m4/Makefile.am: Remove socklen.m4 from EXTRA_DIST -- Signed-off-by: Clemens Lang --- m4/Makefile.am | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/m4/Makefile.am b/m4/Makefile.am index c33f1009..53800d39 100644 --- a/m4/Makefile.am +++ b/m4/Makefile.am @@ -1,2 +1,2 @@ -EXTRA_DIST = libtool.m4 socklen.m4 noexecstack.m4 +EXTRA_DIST = libtool.m4 noexecstack.m4 EXTRA_DIST += gpg-error.m4 -- 2.35.1 From gniibe at fsij.org Tue Apr 12 02:40:00 2022 From: gniibe at fsij.org (NIIBE Yutaka) Date: Tue, 12 Apr 2022 09:40:00 +0900 Subject: [PATCH] build: Fix make dist after socklen.m4 removal In-Reply-To: <20220408094512.4385-1-cllang@redhat.com> References: <20220408094512.4385-1-cllang@redhat.com> Message-ID: <87pmln6snz.fsf@jumper.gniibe.org> Clemens Lang wrote: > * m4/Makefile.am: Remove socklen.m4 from EXTRA_DIST Thank you. Applied and pushed. -- From guidovranken at gmail.com Sat Apr 23 00:00:47 2022 From: guidovranken at gmail.com (Guido Vranken) Date: Sat, 23 Apr 2022 00:00:47 +0200 Subject: Old bug in gcry_mpi_invm producing wrong result Message-ID: It says that InvMod(18446744073709551615, 340282366762482138434845932244680310781) is 170141183381241069226646338154899963903 but that's not true, because 170141183381241069226646338154899963903 * 18446744073709551615 % 340282366762482138434845932244680310781 is 4294967297, not 1. It looks like this bug has been present at least since libgcrypt-1.2.0 from 2004. #include #define CF_CHECK_EQ(expr, res) if ( (expr) != (res) ) { goto end; } int main(void) { gcry_mpi_t A; gcry_mpi_t B; gcry_mpi_t C; gcry_error_t err; CF_CHECK_EQ(err = gcry_mpi_scan(&A, GCRYMPI_FMT_HEX, "ffffffffffffffff", 0, NULL), 0); CF_CHECK_EQ(err = gcry_mpi_scan(&B, GCRYMPI_FMT_HEX, "fffffffdfffffffffffffffffffffffd", 0, NULL), 0); CF_CHECK_EQ(err = gcry_mpi_scan(&C, GCRYMPI_FMT_HEX, "1", 0, NULL), 0); CF_CHECK_EQ(gcry_mpi_invm(C, A, B), 1); printf("Inverse exists\n"); end: return 0; } -------------- next part -------------- An HTML attachment was scrubbed... URL: From jussi.kivilinna at iki.fi Sun Apr 24 20:40:19 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 24 Apr 2022 21:40:19 +0300 Subject: [PATCH 1/7] Add detection for HW feature "intel-gfni" Message-ID: <20220424184025.2202396-1-jussi.kivilinna@iki.fi> * configure.ac (gfnisupport, gcry_cv_gcc_inline_asm_gfni) (ENABLE_GFNI_SUPPORT): New. * src/g10lib.h (HWF_INTEL_GFNI): New. * src/hwf-x86.c (detect_x86_gnuc): Add GFNI detection. * src/hwfeatures.c (hwflist): Add "intel-gfni". * doc/gcrypt.texi: Add "intel-gfni" to HW features list. -- Signed-off-by: Jussi Kivilinna --- configure.ac | 43 +++++++++++++++++++++++++++++++++++++++++++ doc/gcrypt.texi | 1 + src/g10lib.h | 1 + src/hwf-x86.c | 7 ++++++- src/hwfeatures.c | 1 + 5 files changed, 52 insertions(+), 1 deletion(-) diff --git a/configure.ac b/configure.ac index 3e415cea..15c92018 100644 --- a/configure.ac +++ b/configure.ac @@ -675,6 +675,14 @@ AC_ARG_ENABLE(avx512-support, avx512support=$enableval,avx512support=yes) AC_MSG_RESULT($avx512support) +# Implementation of the --disable-gfni-support switch. +AC_MSG_CHECKING([whether GFNI support is requested]) +AC_ARG_ENABLE(gfni-support, + AS_HELP_STRING([--disable-gfni-support], + [Disable support for the Intel GFNI instructions]), + gfnisupport=$enableval,gfnisupport=yes) +AC_MSG_RESULT($gfnisupport) + # Implementation of the --disable-neon-support switch. AC_MSG_CHECKING([whether NEON support is requested]) AC_ARG_ENABLE(neon-support, @@ -1305,6 +1313,7 @@ if test "$mpi_cpu_arch" != "x86" ; then avxsupport="n/a" avx2support="n/a" avx512support="n/a" + gfnisupport="n/a" padlocksupport="n/a" drngsupport="n/a" fi @@ -1606,6 +1615,30 @@ if test "$gcry_cv_gcc_inline_asm_vaes_vpclmul" = "yes" ; then fi +# +# Check whether GCC inline assembler supports GFNI instructions +# +AC_CACHE_CHECK([whether GCC inline assembler supports GFNI instructions], + [gcry_cv_gcc_inline_asm_gfni], + [if test "$mpi_cpu_arch" != "x86" || + test "$try_asm_modules" != "yes" ; then + gcry_cv_gcc_inline_asm_gfni="n/a" + else + gcry_cv_gcc_inline_asm_gfni=no + AC_LINK_IFELSE([AC_LANG_PROGRAM( + [[void a(void) { + __asm__("gf2p8affineqb \$123, %%xmm0, %%xmm0;\n\t":::"cc"); /* SSE */ + __asm__("vgf2p8affineinvqb \$234, %%ymm1, %%ymm1, %%ymm1;\n\t":::"cc"); /* AVX */ + __asm__("vgf2p8mulb (%%eax), %%zmm2, %%zmm2;\n\t":::"cc"); /* AVX512 */ + }]], [ a(); ] )], + [gcry_cv_gcc_inline_asm_gfni=yes]) + fi]) +if test "$gcry_cv_gcc_inline_asm_gfni" = "yes" ; then + AC_DEFINE(HAVE_GCC_INLINE_ASM_GFNI,1, + [Defined if inline assembler supports GFNI instructions]) +fi + + # # Check whether GCC inline assembler supports BMI2 instructions # @@ -2411,6 +2444,11 @@ if test x"$avx512support" = xyes ; then avx512support="no (unsupported by compiler)" fi fi +if test x"$gfnisupport" = xyes ; then + if test "$gcry_cv_gcc_inline_asm_gfni" != "yes" ; then + gfnisupport="no (unsupported by compiler)" + fi +fi if test x"$neonsupport" = xyes ; then if test "$gcry_cv_gcc_inline_asm_neon" != "yes" ; then if test "$gcry_cv_gcc_inline_asm_aarch64_neon" != "yes" ; then @@ -2454,6 +2492,10 @@ if test x"$avx512support" = xyes ; then AC_DEFINE(ENABLE_AVX512_SUPPORT,1, [Enable support for Intel AVX512 instructions.]) fi +if test x"$gfnisupport" = xyes ; then + AC_DEFINE(ENABLE_GFNI_SUPPORT,1, + [Enable support for Intel GFNI instructions.]) +fi if test x"$neonsupport" = xyes ; then AC_DEFINE(ENABLE_NEON_SUPPORT,1, [Enable support for ARM NEON instructions.]) @@ -3318,6 +3360,7 @@ GCRY_MSG_SHOW([Try using DRNG (RDRAND): ],[$drngsupport]) GCRY_MSG_SHOW([Try using Intel AVX: ],[$avxsupport]) GCRY_MSG_SHOW([Try using Intel AVX2: ],[$avx2support]) GCRY_MSG_SHOW([Try using Intel AVX512: ],[$avx512support]) +GCRY_MSG_SHOW([Try using Intel GFNI: ],[$gfnisupport]) GCRY_MSG_SHOW([Try using ARM NEON: ],[$neonsupport]) GCRY_MSG_SHOW([Try using ARMv8 crypto: ],[$armcryptosupport]) GCRY_MSG_SHOW([Try using PPC crypto: ],[$ppccryptosupport]) diff --git a/doc/gcrypt.texi b/doc/gcrypt.texi index 55515011..b82535e2 100644 --- a/doc/gcrypt.texi +++ b/doc/gcrypt.texi @@ -591,6 +591,7 @@ are @item intel-shaext @item intel-vaes-vpclmul @item intel-avx512 + at item intel-gfni @item arm-neon @item arm-aes @item arm-sha1 diff --git a/src/g10lib.h b/src/g10lib.h index c07ed788..a5bed002 100644 --- a/src/g10lib.h +++ b/src/g10lib.h @@ -238,6 +238,7 @@ char **_gcry_strtokenize (const char *string, const char *delim); #define HWF_INTEL_SHAEXT (1 << 16) #define HWF_INTEL_VAES_VPCLMUL (1 << 17) #define HWF_INTEL_AVX512 (1 << 18) +#define HWF_INTEL_GFNI (1 << 19) #elif defined(HAVE_CPU_ARCH_ARM) diff --git a/src/hwf-x86.c b/src/hwf-x86.c index 33386070..20420798 100644 --- a/src/hwf-x86.c +++ b/src/hwf-x86.c @@ -403,7 +403,7 @@ detect_x86_gnuc (void) #if defined(ENABLE_AVX2_SUPPORT) && defined(ENABLE_AESNI_SUPPORT) && \ defined(ENABLE_PCLMUL_SUPPORT) - /* Test bit 9 for VAES and bit 10 for VPCLMULDQD */ + /* Test features2 bit 9 for VAES and features2 bit 10 for VPCLMULDQD */ if ((features2 & 0x00000200) && (features2 & 0x00000400)) result |= HWF_INTEL_VAES_VPCLMUL; #endif @@ -439,6 +439,11 @@ detect_x86_gnuc (void) && (features2 & (1 << 14))) result |= HWF_INTEL_AVX512; #endif + + /* Test features2 bit 6 for GFNI (Galois field new instructions). + * These instructions are available for SSE/AVX/AVX2/AVX512. */ + if (features2 & (1 << 6)) + result |= HWF_INTEL_GFNI; } return result; diff --git a/src/hwfeatures.c b/src/hwfeatures.c index 8e92cbdd..af5daf62 100644 --- a/src/hwfeatures.c +++ b/src/hwfeatures.c @@ -63,6 +63,7 @@ static struct { HWF_INTEL_SHAEXT, "intel-shaext" }, { HWF_INTEL_VAES_VPCLMUL, "intel-vaes-vpclmul" }, { HWF_INTEL_AVX512, "intel-avx512" }, + { HWF_INTEL_GFNI, "intel-gfni" }, #elif defined(HAVE_CPU_ARCH_ARM) { HWF_ARM_NEON, "arm-neon" }, { HWF_ARM_AES, "arm-aes" }, -- 2.34.1 From jussi.kivilinna at iki.fi Sun Apr 24 20:40:24 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 24 Apr 2022 21:40:24 +0300 Subject: [PATCH 6/7] camellia-avx2: add partial parallel block processing In-Reply-To: <20220424184025.2202396-1-jussi.kivilinna@iki.fi> References: <20220424184025.2202396-1-jussi.kivilinna@iki.fi> Message-ID: <20220424184025.2202396-6-jussi.kivilinna@iki.fi> * cipher/camellia-aesni-avx2-amd64.h: Remove unnecessary vzeroupper from function entry. (enc_blk1_32, dec_blk1_32): New. * cipher/camellia-glue.c (avx_burn_stack_depth) (avx2_burn_stack_depth): Move outside of bulk functions to deduplicate. (_gcry_camellia_aesni_avx2_enc_blk1_32) (_gcry_camellia_aesni_avx2_dec_blk1_32) (_gcry_camellia_vaes_avx2_enc_blk1_32) (_gcry_camellia_vaes_avx2_dec_blk1_32) (_gcry_camellia_gfni_avx2_enc_blk1_32) (_gcry_camellia_gfni_avx2_dec_blk1_32, camellia_encrypt_blk1_32) (camellia_decrypt_blk1_32): New. (_gcry_camellia_ctr_enc, _gcry_camellia_cbc_dec, _gcry_camellia_cfb_dec) (_gcry_camellia_ocb_crypt, _gcry_camellia_ocb_auth): Use new bulk processing helpers from 'bulkhelp.h' and 'camellia_encrypt_blk1_32' and 'camellia_decrypt_blk1_32' for partial parallel processing. -- Signed-off-by: Jussi Kivilinna --- cipher/camellia-aesni-avx2-amd64.h | 209 +++++++++++++++++++-- cipher/camellia-glue.c | 292 ++++++++++++++++++++++------- 2 files changed, 421 insertions(+), 80 deletions(-) diff --git a/cipher/camellia-aesni-avx2-amd64.h b/cipher/camellia-aesni-avx2-amd64.h index 8cd4b1cd..9cc5621e 100644 --- a/cipher/camellia-aesni-avx2-amd64.h +++ b/cipher/camellia-aesni-avx2-amd64.h @@ -1152,8 +1152,6 @@ FUNC_NAME(ctr_enc): movq 8(%rcx), %r11; bswapq %r11; - vzeroupper; - cmpl $128, key_bitlength(CTX); movl $32, %r8d; movl $24, %eax; @@ -1347,8 +1345,6 @@ FUNC_NAME(cbc_dec): movq %rsp, %rbp; CFI_DEF_CFA_REGISTER(%rbp); - vzeroupper; - movq %rcx, %r9; cmpl $128, key_bitlength(CTX); @@ -1424,8 +1420,6 @@ FUNC_NAME(cfb_dec): movq %rsp, %rbp; CFI_DEF_CFA_REGISTER(%rbp); - vzeroupper; - cmpl $128, key_bitlength(CTX); movl $32, %r8d; movl $24, %eax; @@ -1510,8 +1504,6 @@ FUNC_NAME(ocb_enc): movq %rsp, %rbp; CFI_DEF_CFA_REGISTER(%rbp); - vzeroupper; - subq $(16 * 32 + 4 * 8), %rsp; andq $~63, %rsp; movq %rsp, %rax; @@ -1684,8 +1676,6 @@ FUNC_NAME(ocb_dec): movq %rsp, %rbp; CFI_DEF_CFA_REGISTER(%rbp); - vzeroupper; - subq $(16 * 32 + 4 * 8), %rsp; andq $~63, %rsp; movq %rsp, %rax; @@ -1880,8 +1870,6 @@ FUNC_NAME(ocb_auth): movq %rsp, %rbp; CFI_DEF_CFA_REGISTER(%rbp); - vzeroupper; - subq $(16 * 32 + 4 * 8), %rsp; andq $~63, %rsp; movq %rsp, %rax; @@ -2032,4 +2020,201 @@ FUNC_NAME(ocb_auth): CFI_ENDPROC(); ELF(.size FUNC_NAME(ocb_auth),.-FUNC_NAME(ocb_auth);) +.align 8 +.globl FUNC_NAME(enc_blk1_32) +ELF(.type FUNC_NAME(enc_blk1_32), at function;) + +FUNC_NAME(enc_blk1_32): + /* input: + * %rdi: ctx, CTX + * %rsi: dst (32 blocks) + * %rdx: src (32 blocks) + * %ecx: nblocks (1 to 32) + */ + CFI_STARTPROC(); + + pushq %rbp; + CFI_PUSH(%rbp); + movq %rsp, %rbp; + CFI_DEF_CFA_REGISTER(%rbp); + + movl %ecx, %r9d; + + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %eax; + cmovel %eax, %r8d; /* max */ + + subq $(16 * 32), %rsp; + andq $~63, %rsp; + movq %rsp, %rax; + + cmpl $31, %ecx; + vpxor %xmm0, %xmm0, %xmm0; + ja 1f; + jb 2f; + vmovdqu 15 * 32(%rdx), %xmm0; + jmp 2f; + 1: + vmovdqu 15 * 32(%rdx), %ymm0; + 2: + vmovdqu %ymm0, (%rax); + + vpbroadcastq (key_table)(CTX), %ymm0; + vpshufb .Lpack_bswap rRIP, %ymm0, %ymm0; + +#define LOAD_INPUT(offset, ymm) \ + cmpl $(1 + 2 * (offset)), %ecx; \ + jb 2f; \ + ja 1f; \ + vmovdqu (offset) * 32(%rdx), %ymm##_x; \ + vpxor %ymm0, %ymm, %ymm; \ + jmp 2f; \ + 1: \ + vpxor (offset) * 32(%rdx), %ymm0, %ymm; + + LOAD_INPUT(0, ymm15); + LOAD_INPUT(1, ymm14); + LOAD_INPUT(2, ymm13); + LOAD_INPUT(3, ymm12); + LOAD_INPUT(4, ymm11); + LOAD_INPUT(5, ymm10); + LOAD_INPUT(6, ymm9); + LOAD_INPUT(7, ymm8); + LOAD_INPUT(8, ymm7); + LOAD_INPUT(9, ymm6); + LOAD_INPUT(10, ymm5); + LOAD_INPUT(11, ymm4); + LOAD_INPUT(12, ymm3); + LOAD_INPUT(13, ymm2); + LOAD_INPUT(14, ymm1); + vpxor (%rax), %ymm0, %ymm0; + +2: + call __camellia_enc_blk32; + +#define STORE_OUTPUT(ymm, offset) \ + cmpl $(1 + 2 * (offset)), %r9d; \ + jb 2f; \ + ja 1f; \ + vmovdqu %ymm##_x, (offset) * 32(%rsi); \ + jmp 2f; \ + 1: \ + vmovdqu %ymm, (offset) * 32(%rsi); + + STORE_OUTPUT(ymm7, 0); + STORE_OUTPUT(ymm6, 1); + STORE_OUTPUT(ymm5, 2); + STORE_OUTPUT(ymm4, 3); + STORE_OUTPUT(ymm3, 4); + STORE_OUTPUT(ymm2, 5); + STORE_OUTPUT(ymm1, 6); + STORE_OUTPUT(ymm0, 7); + STORE_OUTPUT(ymm15, 8); + STORE_OUTPUT(ymm14, 9); + STORE_OUTPUT(ymm13, 10); + STORE_OUTPUT(ymm12, 11); + STORE_OUTPUT(ymm11, 12); + STORE_OUTPUT(ymm10, 13); + STORE_OUTPUT(ymm9, 14); + STORE_OUTPUT(ymm8, 15); + +2: + vzeroall; + + leave; + CFI_LEAVE(); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size FUNC_NAME(enc_blk1_32),.-FUNC_NAME(enc_blk1_32);) + +.align 8 +.globl FUNC_NAME(dec_blk1_32) +ELF(.type FUNC_NAME(dec_blk1_32), at function;) + +FUNC_NAME(dec_blk1_32): + /* input: + * %rdi: ctx, CTX + * %rsi: dst (32 blocks) + * %rdx: src (32 blocks) + * %ecx: nblocks (1 to 32) + */ + CFI_STARTPROC(); + + pushq %rbp; + CFI_PUSH(%rbp); + movq %rsp, %rbp; + CFI_DEF_CFA_REGISTER(%rbp); + + movl %ecx, %r9d; + + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %eax; + cmovel %eax, %r8d; /* max */ + + subq $(16 * 32), %rsp; + andq $~63, %rsp; + movq %rsp, %rax; + + cmpl $31, %ecx; + vpxor %xmm0, %xmm0, %xmm0; + ja 1f; + jb 2f; + vmovdqu 15 * 32(%rdx), %xmm0; + jmp 2f; + 1: + vmovdqu 15 * 32(%rdx), %ymm0; + 2: + vmovdqu %ymm0, (%rax); + + vpbroadcastq (key_table)(CTX, %r8, 8), %ymm0; + vpshufb .Lpack_bswap rRIP, %ymm0, %ymm0; + + LOAD_INPUT(0, ymm15); + LOAD_INPUT(1, ymm14); + LOAD_INPUT(2, ymm13); + LOAD_INPUT(3, ymm12); + LOAD_INPUT(4, ymm11); + LOAD_INPUT(5, ymm10); + LOAD_INPUT(6, ymm9); + LOAD_INPUT(7, ymm8); + LOAD_INPUT(8, ymm7); + LOAD_INPUT(9, ymm6); + LOAD_INPUT(10, ymm5); + LOAD_INPUT(11, ymm4); + LOAD_INPUT(12, ymm3); + LOAD_INPUT(13, ymm2); + LOAD_INPUT(14, ymm1); + vpxor (%rax), %ymm0, %ymm0; + +2: + call __camellia_dec_blk32; + + STORE_OUTPUT(ymm7, 0); + STORE_OUTPUT(ymm6, 1); + STORE_OUTPUT(ymm5, 2); + STORE_OUTPUT(ymm4, 3); + STORE_OUTPUT(ymm3, 4); + STORE_OUTPUT(ymm2, 5); + STORE_OUTPUT(ymm1, 6); + STORE_OUTPUT(ymm0, 7); + STORE_OUTPUT(ymm15, 8); + STORE_OUTPUT(ymm14, 9); + STORE_OUTPUT(ymm13, 10); + STORE_OUTPUT(ymm12, 11); + STORE_OUTPUT(ymm11, 12); + STORE_OUTPUT(ymm10, 13); + STORE_OUTPUT(ymm9, 14); + STORE_OUTPUT(ymm8, 15); + +2: + vzeroall; + + leave; + CFI_LEAVE(); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size FUNC_NAME(dec_blk1_32),.-FUNC_NAME(dec_blk1_32);) + #endif /* GCRY_CAMELLIA_AESNI_AVX2_AMD64_H */ diff --git a/cipher/camellia-glue.c b/cipher/camellia-glue.c index 7f6e92d2..20ab7f7d 100644 --- a/cipher/camellia-glue.c +++ b/cipher/camellia-glue.c @@ -174,6 +174,10 @@ extern void _gcry_camellia_aesni_avx_ocb_auth(CAMELLIA_context *ctx, extern void _gcry_camellia_aesni_avx_keygen(CAMELLIA_context *ctx, const unsigned char *key, unsigned int keylen) ASM_FUNC_ABI; + +static const int avx_burn_stack_depth = 16 * CAMELLIA_BLOCK_SIZE + 16 + + 2 * sizeof(void *) + ASM_EXTRA_STACK; + #endif #ifdef USE_AESNI_AVX2 @@ -214,6 +218,22 @@ extern void _gcry_camellia_aesni_avx2_ocb_auth(CAMELLIA_context *ctx, unsigned char *offset, unsigned char *checksum, const u64 Ls[32]) ASM_FUNC_ABI; + +extern void _gcry_camellia_aesni_avx2_enc_blk1_32(const CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in, + unsigned int nblocks) + ASM_FUNC_ABI; + +extern void _gcry_camellia_aesni_avx2_dec_blk1_32(const CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in, + unsigned int nblocks) + ASM_FUNC_ABI; + +static const int avx2_burn_stack_depth = 32 * CAMELLIA_BLOCK_SIZE + 16 + + 2 * sizeof(void *) + ASM_EXTRA_STACK; + #endif #ifdef USE_VAES_AVX2 @@ -254,6 +274,18 @@ extern void _gcry_camellia_vaes_avx2_ocb_auth(CAMELLIA_context *ctx, unsigned char *offset, unsigned char *checksum, const u64 Ls[32]) ASM_FUNC_ABI; + +extern void _gcry_camellia_vaes_avx2_enc_blk1_32(const CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in, + unsigned int nblocks) + ASM_FUNC_ABI; + +extern void _gcry_camellia_vaes_avx2_dec_blk1_32(const CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in, + unsigned int nblocks) + ASM_FUNC_ABI; #endif #ifdef USE_GFNI_AVX2 @@ -294,6 +326,18 @@ extern void _gcry_camellia_gfni_avx2_ocb_auth(CAMELLIA_context *ctx, unsigned char *offset, unsigned char *checksum, const u64 Ls[32]) ASM_FUNC_ABI; + +extern void _gcry_camellia_gfni_avx2_enc_blk1_32(const CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in, + unsigned int nblocks) + ASM_FUNC_ABI; + +extern void _gcry_camellia_gfni_avx2_dec_blk1_32(const CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in, + unsigned int nblocks) + ASM_FUNC_ABI; #endif static const char *selftest(void); @@ -475,6 +519,105 @@ camellia_decrypt(void *c, byte *outbuf, const byte *inbuf) #endif /*!USE_ARM_ASM*/ + +static unsigned int +camellia_encrypt_blk1_32 (const void *priv, byte *outbuf, const byte *inbuf, + unsigned int num_blks) +{ + const CAMELLIA_context *ctx = priv; + unsigned int stack_burn_size = 0; + + gcry_assert (num_blks <= 32); + +#ifdef USE_GFNI_AVX2 + if (ctx->use_gfni_avx2 && num_blks >= 3) + { + /* 3 or more parallel block GFNI processing is faster than + * generic C implementation. */ + _gcry_camellia_gfni_avx2_enc_blk1_32 (ctx, outbuf, inbuf, num_blks); + return avx2_burn_stack_depth; + } +#endif +#ifdef USE_VAES_AVX2 + if (ctx->use_vaes_avx2 && num_blks >= 6) + { + /* 6 or more parallel block VAES processing is faster than + * generic C implementation. */ + _gcry_camellia_vaes_avx2_enc_blk1_32 (ctx, outbuf, inbuf, num_blks); + return avx2_burn_stack_depth; + } +#endif +#ifdef USE_AESNI_AVX2 + if (ctx->use_aesni_avx2 && num_blks >= 6) + { + /* 6 or more parallel block AESNI processing is faster than + * generic C implementation. */ + _gcry_camellia_aesni_avx2_enc_blk1_32 (ctx, outbuf, inbuf, num_blks); + return avx2_burn_stack_depth; + } +#endif + + while (num_blks) + { + stack_burn_size = camellia_encrypt((void *)ctx, outbuf, inbuf); + outbuf += CAMELLIA_BLOCK_SIZE; + inbuf += CAMELLIA_BLOCK_SIZE; + num_blks--; + } + + return stack_burn_size; +} + + +static unsigned int +camellia_decrypt_blk1_32 (const void *priv, byte *outbuf, const byte *inbuf, + unsigned int num_blks) +{ + const CAMELLIA_context *ctx = priv; + unsigned int stack_burn_size = 0; + + gcry_assert (num_blks <= 32); + +#ifdef USE_GFNI_AVX2 + if (ctx->use_gfni_avx2 && num_blks >= 3) + { + /* 3 or more parallel block GFNI processing is faster than + * generic C implementation. */ + _gcry_camellia_gfni_avx2_dec_blk1_32 (ctx, outbuf, inbuf, num_blks); + return avx2_burn_stack_depth; + } +#endif +#ifdef USE_VAES_AVX2 + if (ctx->use_vaes_avx2 && num_blks >= 6) + { + /* 6 or more parallel block VAES processing is faster than + * generic C implementation. */ + _gcry_camellia_vaes_avx2_dec_blk1_32 (ctx, outbuf, inbuf, num_blks); + return avx2_burn_stack_depth; + } +#endif +#ifdef USE_AESNI_AVX2 + if (ctx->use_aesni_avx2 && num_blks >= 6) + { + /* 6 or more parallel block AESNI processing is faster than + * generic C implementation. */ + _gcry_camellia_aesni_avx2_dec_blk1_32 (ctx, outbuf, inbuf, num_blks); + return avx2_burn_stack_depth; + } +#endif + + while (num_blks) + { + stack_burn_size = camellia_decrypt((void *)ctx, outbuf, inbuf); + outbuf += CAMELLIA_BLOCK_SIZE; + inbuf += CAMELLIA_BLOCK_SIZE; + num_blks--; + } + + return stack_burn_size; +} + + /* Bulk encryption of complete blocks in CTR mode. This function is only intended for the bulk encryption feature of cipher.c. CTR is expected to be of size CAMELLIA_BLOCK_SIZE. */ @@ -486,8 +629,7 @@ _gcry_camellia_ctr_enc(void *context, unsigned char *ctr, CAMELLIA_context *ctx = context; unsigned char *outbuf = outbuf_arg; const unsigned char *inbuf = inbuf_arg; - unsigned char tmpbuf[CAMELLIA_BLOCK_SIZE]; - int burn_stack_depth = CAMELLIA_encrypt_stack_burn_size; + int burn_stack_depth = 0; #ifdef USE_AESNI_AVX2 if (ctx->use_aesni_avx2) @@ -517,9 +659,6 @@ _gcry_camellia_ctr_enc(void *context, unsigned char *ctr, if (did_use_aesni_avx2) { - int avx2_burn_stack_depth = 32 * CAMELLIA_BLOCK_SIZE + 16 + - 2 * sizeof(void *) + ASM_EXTRA_STACK; - if (burn_stack_depth < avx2_burn_stack_depth) burn_stack_depth = avx2_burn_stack_depth; } @@ -547,9 +686,6 @@ _gcry_camellia_ctr_enc(void *context, unsigned char *ctr, if (did_use_aesni_avx) { - int avx_burn_stack_depth = 16 * CAMELLIA_BLOCK_SIZE + - 2 * sizeof(void *) + ASM_EXTRA_STACK; - if (burn_stack_depth < avx_burn_stack_depth) burn_stack_depth = avx_burn_stack_depth; } @@ -559,20 +695,23 @@ _gcry_camellia_ctr_enc(void *context, unsigned char *ctr, } #endif - for ( ;nblocks; nblocks-- ) + /* Process remaining blocks. */ + if (nblocks) { - /* Encrypt the counter. */ - Camellia_EncryptBlock(ctx->keybitlength, ctr, ctx->keytable, tmpbuf); - /* XOR the input with the encrypted counter and store in output. */ - cipher_block_xor(outbuf, tmpbuf, inbuf, CAMELLIA_BLOCK_SIZE); - outbuf += CAMELLIA_BLOCK_SIZE; - inbuf += CAMELLIA_BLOCK_SIZE; - /* Increment the counter. */ - cipher_block_add(ctr, 1, CAMELLIA_BLOCK_SIZE); + byte tmpbuf[CAMELLIA_BLOCK_SIZE * 32]; + unsigned int tmp_used = CAMELLIA_BLOCK_SIZE; + size_t nburn; + + nburn = bulk_ctr_enc_128(ctx, camellia_encrypt_blk1_32, outbuf, inbuf, + nblocks, ctr, tmpbuf, + sizeof(tmpbuf) / CAMELLIA_BLOCK_SIZE, &tmp_used); + burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; + + wipememory(tmpbuf, tmp_used); } - wipememory(tmpbuf, sizeof(tmpbuf)); - _gcry_burn_stack(burn_stack_depth); + if (burn_stack_depth) + _gcry_burn_stack(burn_stack_depth); } /* Bulk decryption of complete blocks in CBC mode. This function is only @@ -585,8 +724,7 @@ _gcry_camellia_cbc_dec(void *context, unsigned char *iv, CAMELLIA_context *ctx = context; unsigned char *outbuf = outbuf_arg; const unsigned char *inbuf = inbuf_arg; - unsigned char savebuf[CAMELLIA_BLOCK_SIZE]; - int burn_stack_depth = CAMELLIA_decrypt_stack_burn_size; + int burn_stack_depth = 0; #ifdef USE_AESNI_AVX2 if (ctx->use_aesni_avx2) @@ -616,9 +754,6 @@ _gcry_camellia_cbc_dec(void *context, unsigned char *iv, if (did_use_aesni_avx2) { - int avx2_burn_stack_depth = 32 * CAMELLIA_BLOCK_SIZE + 16 + - 2 * sizeof(void *) + ASM_EXTRA_STACK;; - if (burn_stack_depth < avx2_burn_stack_depth) burn_stack_depth = avx2_burn_stack_depth; } @@ -645,9 +780,6 @@ _gcry_camellia_cbc_dec(void *context, unsigned char *iv, if (did_use_aesni_avx) { - int avx_burn_stack_depth = 16 * CAMELLIA_BLOCK_SIZE + - 2 * sizeof(void *) + ASM_EXTRA_STACK; - if (burn_stack_depth < avx_burn_stack_depth) burn_stack_depth = avx_burn_stack_depth; } @@ -656,20 +788,23 @@ _gcry_camellia_cbc_dec(void *context, unsigned char *iv, } #endif - for ( ;nblocks; nblocks-- ) + /* Process remaining blocks. */ + if (nblocks) { - /* INBUF is needed later and it may be identical to OUTBUF, so store - the intermediate result to SAVEBUF. */ - Camellia_DecryptBlock(ctx->keybitlength, inbuf, ctx->keytable, savebuf); + byte tmpbuf[CAMELLIA_BLOCK_SIZE * 32]; + unsigned int tmp_used = CAMELLIA_BLOCK_SIZE; + size_t nburn; - cipher_block_xor_n_copy_2(outbuf, savebuf, iv, inbuf, - CAMELLIA_BLOCK_SIZE); - inbuf += CAMELLIA_BLOCK_SIZE; - outbuf += CAMELLIA_BLOCK_SIZE; + nburn = bulk_cbc_dec_128(ctx, camellia_decrypt_blk1_32, outbuf, inbuf, + nblocks, iv, tmpbuf, + sizeof(tmpbuf) / CAMELLIA_BLOCK_SIZE, &tmp_used); + burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; + + wipememory(tmpbuf, tmp_used); } - wipememory(savebuf, sizeof(savebuf)); - _gcry_burn_stack(burn_stack_depth); + if (burn_stack_depth) + _gcry_burn_stack(burn_stack_depth); } /* Bulk decryption of complete blocks in CFB mode. This function is only @@ -682,7 +817,7 @@ _gcry_camellia_cfb_dec(void *context, unsigned char *iv, CAMELLIA_context *ctx = context; unsigned char *outbuf = outbuf_arg; const unsigned char *inbuf = inbuf_arg; - int burn_stack_depth = CAMELLIA_decrypt_stack_burn_size; + int burn_stack_depth = 0; #ifdef USE_AESNI_AVX2 if (ctx->use_aesni_avx2) @@ -712,9 +847,6 @@ _gcry_camellia_cfb_dec(void *context, unsigned char *iv, if (did_use_aesni_avx2) { - int avx2_burn_stack_depth = 32 * CAMELLIA_BLOCK_SIZE + 16 + - 2 * sizeof(void *) + ASM_EXTRA_STACK; - if (burn_stack_depth < avx2_burn_stack_depth) burn_stack_depth = avx2_burn_stack_depth; } @@ -741,9 +873,6 @@ _gcry_camellia_cfb_dec(void *context, unsigned char *iv, if (did_use_aesni_avx) { - int avx_burn_stack_depth = 16 * CAMELLIA_BLOCK_SIZE + - 2 * sizeof(void *) + ASM_EXTRA_STACK; - if (burn_stack_depth < avx_burn_stack_depth) burn_stack_depth = avx_burn_stack_depth; } @@ -752,15 +881,23 @@ _gcry_camellia_cfb_dec(void *context, unsigned char *iv, } #endif - for ( ;nblocks; nblocks-- ) + /* Process remaining blocks. */ + if (nblocks) { - Camellia_EncryptBlock(ctx->keybitlength, iv, ctx->keytable, iv); - cipher_block_xor_n_copy(outbuf, iv, inbuf, CAMELLIA_BLOCK_SIZE); - outbuf += CAMELLIA_BLOCK_SIZE; - inbuf += CAMELLIA_BLOCK_SIZE; + byte tmpbuf[CAMELLIA_BLOCK_SIZE * 32]; + unsigned int tmp_used = CAMELLIA_BLOCK_SIZE; + size_t nburn; + + nburn = bulk_cfb_dec_128(ctx, camellia_encrypt_blk1_32, outbuf, inbuf, + nblocks, iv, tmpbuf, + sizeof(tmpbuf) / CAMELLIA_BLOCK_SIZE, &tmp_used); + burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; + + wipememory(tmpbuf, tmp_used); } - _gcry_burn_stack(burn_stack_depth); + if (burn_stack_depth) + _gcry_burn_stack(burn_stack_depth); } /* Bulk encryption/decryption of complete blocks in OCB mode. */ @@ -772,11 +909,9 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, CAMELLIA_context *ctx = (void *)&c->context.c; unsigned char *outbuf = outbuf_arg; const unsigned char *inbuf = inbuf_arg; - int burn_stack_depth; + int burn_stack_depth = 0; u64 blkn = c->u_mode.ocb.data_nblocks; - burn_stack_depth = encrypt ? CAMELLIA_encrypt_stack_burn_size : - CAMELLIA_decrypt_stack_burn_size; #else (void)c; (void)outbuf_arg; @@ -826,9 +961,6 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, if (did_use_aesni_avx2) { - int avx2_burn_stack_depth = 32 * CAMELLIA_BLOCK_SIZE + - 2 * sizeof(void *) + ASM_EXTRA_STACK; - if (burn_stack_depth < avx2_burn_stack_depth) burn_stack_depth = avx2_burn_stack_depth; } @@ -870,9 +1002,6 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, if (did_use_aesni_avx) { - int avx_burn_stack_depth = 16 * CAMELLIA_BLOCK_SIZE + - 2 * sizeof(void *) + ASM_EXTRA_STACK; - if (burn_stack_depth < avx_burn_stack_depth) burn_stack_depth = avx_burn_stack_depth; } @@ -882,6 +1011,24 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, #endif #if defined(USE_AESNI_AVX) || defined(USE_AESNI_AVX2) + /* Process remaining blocks. */ + if (nblocks) + { + byte tmpbuf[CAMELLIA_BLOCK_SIZE * 32]; + unsigned int tmp_used = CAMELLIA_BLOCK_SIZE; + size_t nburn; + + nburn = bulk_ocb_crypt_128 (c, ctx, encrypt ? camellia_encrypt_blk1_32 + : camellia_decrypt_blk1_32, + outbuf, inbuf, nblocks, &blkn, encrypt, + tmpbuf, sizeof(tmpbuf) / CAMELLIA_BLOCK_SIZE, + &tmp_used); + burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; + + wipememory(tmpbuf, tmp_used); + nblocks = 0; + } + c->u_mode.ocb.data_nblocks = blkn; if (burn_stack_depth) @@ -899,10 +1046,8 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, #if defined(USE_AESNI_AVX) || defined(USE_AESNI_AVX2) CAMELLIA_context *ctx = (void *)&c->context.c; const unsigned char *abuf = abuf_arg; - int burn_stack_depth; + int burn_stack_depth = 0; u64 blkn = c->u_mode.ocb.aad_nblocks; - - burn_stack_depth = CAMELLIA_encrypt_stack_burn_size; #else (void)c; (void)abuf_arg; @@ -948,9 +1093,6 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, if (did_use_aesni_avx2) { - int avx2_burn_stack_depth = 32 * CAMELLIA_BLOCK_SIZE + - 2 * sizeof(void *) + ASM_EXTRA_STACK; - if (burn_stack_depth < avx2_burn_stack_depth) burn_stack_depth = avx2_burn_stack_depth; } @@ -988,9 +1130,6 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, if (did_use_aesni_avx) { - int avx_burn_stack_depth = 16 * CAMELLIA_BLOCK_SIZE + - 2 * sizeof(void *) + ASM_EXTRA_STACK; - if (burn_stack_depth < avx_burn_stack_depth) burn_stack_depth = avx_burn_stack_depth; } @@ -1000,6 +1139,23 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, #endif #if defined(USE_AESNI_AVX) || defined(USE_AESNI_AVX2) + /* Process remaining blocks. */ + if (nblocks) + { + byte tmpbuf[CAMELLIA_BLOCK_SIZE * 32]; + unsigned int tmp_used = CAMELLIA_BLOCK_SIZE; + size_t nburn; + + nburn = bulk_ocb_auth_128 (c, ctx, camellia_encrypt_blk1_32, + abuf, nblocks, &blkn, tmpbuf, + sizeof(tmpbuf) / CAMELLIA_BLOCK_SIZE, + &tmp_used); + burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; + + wipememory(tmpbuf, tmp_used); + nblocks = 0; + } + c->u_mode.ocb.aad_nblocks = blkn; if (burn_stack_depth) -- 2.34.1 From jussi.kivilinna at iki.fi Sun Apr 24 20:40:21 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 24 Apr 2022 21:40:21 +0300 Subject: [PATCH 3/7] sm4: deduplicate bulk processing function selection In-Reply-To: <20220424184025.2202396-1-jussi.kivilinna@iki.fi> References: <20220424184025.2202396-1-jussi.kivilinna@iki.fi> Message-ID: <20220424184025.2202396-3-jussi.kivilinna@iki.fi> * cipher/sm4.c (crypt_blk1_8_fn_t): New. (sm4_aesni_avx_crypt_blk1_8, sm4_aarch64_crypt_blk1_8) (sm4_armv8_ce_crypt_blk1_8, sm4_crypt_blocks): Change first parameter to void pointer type. (sm4_get_crypt_blk1_8_fn): New. (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth): Use sm4_get_crypt_blk1_8_fn for selecting crypt_blk1_8. -- Signed-off-by: Jussi Kivilinna --- cipher/sm4.c | 190 ++++++++++++--------------------------------------- 1 file changed, 45 insertions(+), 145 deletions(-) diff --git a/cipher/sm4.c b/cipher/sm4.c index 79e6dbf1..d36d9ceb 100644 --- a/cipher/sm4.c +++ b/cipher/sm4.c @@ -120,6 +120,10 @@ typedef struct #endif } SM4_context; +typedef unsigned int (*crypt_blk1_8_fn_t) (const void *ctx, byte *out, + const byte *in, + unsigned int num_blks); + static const u32 fk[4] = { 0xa3b1bac6, 0x56aa3350, 0x677d9197, 0xb27022dc @@ -223,7 +227,7 @@ _gcry_sm4_aesni_avx_crypt_blk1_8(const u32 *rk, byte *out, const byte *in, unsigned int num_blks) ASM_FUNC_ABI; static inline unsigned int -sm4_aesni_avx_crypt_blk1_8(const u32 *rk, byte *out, const byte *in, +sm4_aesni_avx_crypt_blk1_8(const void *rk, byte *out, const byte *in, unsigned int num_blks) { return _gcry_sm4_aesni_avx_crypt_blk1_8(rk, out, in, num_blks); @@ -290,7 +294,7 @@ extern void _gcry_sm4_aarch64_crypt_blk1_8(const u32 *rk, byte *out, size_t num_blocks); static inline unsigned int -sm4_aarch64_crypt_blk1_8(const u32 *rk, byte *out, const byte *in, +sm4_aarch64_crypt_blk1_8(const void *rk, byte *out, const byte *in, unsigned int num_blks) { _gcry_sm4_aarch64_crypt_blk1_8(rk, out, in, (size_t)num_blks); @@ -327,8 +331,8 @@ extern void _gcry_sm4_armv8_ce_crypt_blk1_8(const u32 *rk, byte *out, size_t num_blocks); static inline unsigned int -sm4_armv8_ce_crypt_blk1_8(const u32 *rk, byte *out, const byte *in, - unsigned int num_blks) +sm4_armv8_ce_crypt_blk1_8(const void *rk, byte *out, const byte *in, + unsigned int num_blks) { _gcry_sm4_armv8_ce_crypt_blk1_8(rk, out, in, (size_t)num_blks); return 0; @@ -600,9 +604,10 @@ sm4_do_crypt_blks2 (const u32 *rk, byte *out, const byte *in) } static unsigned int -sm4_crypt_blocks (const u32 *rk, byte *out, const byte *in, +sm4_crypt_blocks (const void *ctx, byte *out, const byte *in, unsigned int num_blks) { + const u32 *rk = ctx; unsigned int burn_depth = 0; unsigned int nburn; @@ -629,6 +634,36 @@ sm4_crypt_blocks (const u32 *rk, byte *out, const byte *in, return burn_depth; } +static inline crypt_blk1_8_fn_t +sm4_get_crypt_blk1_8_fn(SM4_context *ctx) +{ + if (0) + ; +#ifdef USE_AESNI_AVX + else if (ctx->use_aesni_avx) + { + return &sm4_aesni_avx_crypt_blk1_8; + } +#endif +#ifdef USE_ARM_CE + else if (ctx->use_arm_ce) + { + return &sm4_armv8_ce_crypt_blk1_8; + } +#endif +#ifdef USE_AARCH64_SIMD + else if (ctx->use_aarch64_simd) + { + return &sm4_aarch64_crypt_blk1_8; + } +#endif + else + { + prefetch_sbox_table (); + return &sm4_crypt_blocks; + } +} + /* Bulk encryption of complete blocks in CTR mode. This function is only intended for the bulk encryption feature of cipher.c. CTR is expected to be of size 16. */ @@ -709,37 +744,10 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, /* Process remaining blocks. */ if (nblocks) { - unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in, - unsigned int num_blks); + crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); byte tmpbuf[16 * 8]; unsigned int tmp_used = 16; - if (0) - ; -#ifdef USE_AESNI_AVX - else if (ctx->use_aesni_avx) - { - crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; - } -#endif -#ifdef USE_ARM_CE - else if (ctx->use_arm_ce) - { - crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; - } -#endif -#ifdef USE_AARCH64_SIMD - else if (ctx->use_aarch64_simd) - { - crypt_blk1_8 = sm4_aarch64_crypt_blk1_8; - } -#endif - else - { - prefetch_sbox_table (); - crypt_blk1_8 = sm4_crypt_blocks; - } - /* Process remaining blocks. */ while (nblocks) { @@ -856,37 +864,10 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, /* Process remaining blocks. */ if (nblocks) { - unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in, - unsigned int num_blks); + crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); unsigned char savebuf[16 * 8]; unsigned int tmp_used = 16; - if (0) - ; -#ifdef USE_AESNI_AVX - else if (ctx->use_aesni_avx) - { - crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; - } -#endif -#ifdef USE_ARM_CE - else if (ctx->use_arm_ce) - { - crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; - } -#endif -#ifdef USE_AARCH64_SIMD - else if (ctx->use_aarch64_simd) - { - crypt_blk1_8 = sm4_aarch64_crypt_blk1_8; - } -#endif - else - { - prefetch_sbox_table (); - crypt_blk1_8 = sm4_crypt_blocks; - } - /* Process remaining blocks. */ while (nblocks) { @@ -996,37 +977,10 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, /* Process remaining blocks. */ if (nblocks) { - unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in, - unsigned int num_blks); + crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); unsigned char ivbuf[16 * 8]; unsigned int tmp_used = 16; - if (0) - ; -#ifdef USE_AESNI_AVX - else if (ctx->use_aesni_avx) - { - crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; - } -#endif -#ifdef USE_ARM_CE - else if (ctx->use_arm_ce) - { - crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; - } -#endif -#ifdef USE_AARCH64_SIMD - else if (ctx->use_aarch64_simd) - { - crypt_blk1_8 = sm4_aarch64_crypt_blk1_8; - } -#endif - else - { - prefetch_sbox_table (); - crypt_blk1_8 = sm4_crypt_blocks; - } - /* Process remaining blocks. */ while (nblocks) { @@ -1163,38 +1117,11 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, if (nblocks) { - unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in, - unsigned int num_blks); + crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); const u32 *rk = encrypt ? ctx->rkey_enc : ctx->rkey_dec; unsigned char tmpbuf[16 * 8]; unsigned int tmp_used = 16; - if (0) - ; -#ifdef USE_AESNI_AVX - else if (ctx->use_aesni_avx) - { - crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; - } -#endif -#ifdef USE_ARM_CE - else if (ctx->use_arm_ce) - { - crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; - } -#endif -#ifdef USE_AARCH64_SIMD - else if (ctx->use_aarch64_simd) - { - crypt_blk1_8 = sm4_aarch64_crypt_blk1_8; - } -#endif - else - { - prefetch_sbox_table (); - crypt_blk1_8 = sm4_crypt_blocks; - } - while (nblocks) { size_t curr_blks = nblocks > 8 ? 8 : nblocks; @@ -1336,37 +1263,10 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) if (nblocks) { - unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in, - unsigned int num_blks); + crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); unsigned char tmpbuf[16 * 8]; unsigned int tmp_used = 16; - if (0) - ; -#ifdef USE_AESNI_AVX - else if (ctx->use_aesni_avx) - { - crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; - } -#endif -#ifdef USE_ARM_CE - else if (ctx->use_arm_ce) - { - crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; - } -#endif -#ifdef USE_AARCH64_SIMD - else if (ctx->use_aarch64_simd) - { - crypt_blk1_8 = sm4_aarch64_crypt_blk1_8; - } -#endif - else - { - prefetch_sbox_table (); - crypt_blk1_8 = sm4_crypt_blocks; - } - while (nblocks) { size_t curr_blks = nblocks > 8 ? 8 : nblocks; -- 2.34.1 From jussi.kivilinna at iki.fi Sun Apr 24 20:40:22 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 24 Apr 2022 21:40:22 +0300 Subject: [PATCH 4/7] Move bulk OCB L pointer array setup code to common header In-Reply-To: <20220424184025.2202396-1-jussi.kivilinna@iki.fi> References: <20220424184025.2202396-1-jussi.kivilinna@iki.fi> Message-ID: <20220424184025.2202396-4-jussi.kivilinna@iki.fi> * cipher/bulkhelp.h: New. * cipher/camellia-glue.c (_gcry_camellia_ocb_crypt) (_gcry_camellia_ocb_crypt): Use new `bulk_ocb_prepare_L_pointers_array_blkXX` function for OCB L pointer array setup. * cipher/serpent.c (_gcry_serpent_ocb_crypt) (_gcry_serpent_ocb_auth): Likewise. * cipher/sm4.c (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth): Likewise. * cipher/twofish.c (_gcry_twofish_ocb_crypt) (_gcry_twofish_ocb_auth): Likewise. -- Signed-off-by: Jussi Kivilinna --- cipher/bulkhelp.h | 103 +++++++++++++++++++++++++++++++++++++++++ cipher/camellia-glue.c | 78 ++----------------------------- cipher/serpent.c | 99 +++++++-------------------------------- cipher/sm4.c | 63 ++----------------------- cipher/twofish.c | 37 ++------------- 5 files changed, 132 insertions(+), 248 deletions(-) create mode 100644 cipher/bulkhelp.h diff --git a/cipher/bulkhelp.h b/cipher/bulkhelp.h new file mode 100644 index 00000000..72668d42 --- /dev/null +++ b/cipher/bulkhelp.h @@ -0,0 +1,103 @@ +/* bulkhelp.h - Some bulk processing helpers + * Copyright (C) 2022 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ +#ifndef GCRYPT_BULKHELP_H +#define GCRYPT_BULKHELP_H + + +#include "g10lib.h" +#include "cipher-internal.h" + + +#ifdef __x86_64__ +/* Use u64 to store pointers for x32 support (assembly function assumes + * 64-bit pointers). */ +typedef u64 ocb_L_uintptr_t; +#else +typedef uintptr_t ocb_L_uintptr_t; +#endif + + +static inline ocb_L_uintptr_t * +bulk_ocb_prepare_L_pointers_array_blk32 (gcry_cipher_hd_t c, + ocb_L_uintptr_t Ls[32], u64 blkn) +{ + unsigned int n = 32 - (blkn % 32); + unsigned int i; + + for (i = 0; i < 32; i += 8) + { + Ls[(i + 0 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(i + 1 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; + Ls[(i + 2 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(i + 3 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; + Ls[(i + 4 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(i + 5 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; + Ls[(i + 6 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + } + + Ls[(7 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; + Ls[(15 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[4]; + Ls[(23 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; + return &Ls[(31 + n) % 32]; +} + + +static inline ocb_L_uintptr_t * +bulk_ocb_prepare_L_pointers_array_blk16 (gcry_cipher_hd_t c, + ocb_L_uintptr_t Ls[16], u64 blkn) +{ + unsigned int n = 16 - (blkn % 16); + unsigned int i; + + for (i = 0; i < 16; i += 8) + { + Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; + Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; + Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; + Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + } + + Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; + return &Ls[(15 + n) % 16]; +} + + +static inline ocb_L_uintptr_t * +bulk_ocb_prepare_L_pointers_array_blk8 (gcry_cipher_hd_t c, + ocb_L_uintptr_t Ls[8], u64 blkn) +{ + unsigned int n = 8 - (blkn % 8); + + Ls[(0 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(1 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; + Ls[(2 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(3 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; + Ls[(4 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(5 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; + Ls[(6 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(7 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; + + return &Ls[(7 + n) % 8]; +} + + +#endif /*GCRYPT_BULKHELP_H*/ diff --git a/cipher/camellia-glue.c b/cipher/camellia-glue.c index 7f009db4..7f6e92d2 100644 --- a/cipher/camellia-glue.c +++ b/cipher/camellia-glue.c @@ -65,6 +65,7 @@ #include "bufhelp.h" #include "cipher-internal.h" #include "cipher-selftest.h" +#include "bulkhelp.h" /* Helper macro to force alignment to 16 bytes. */ #ifdef HAVE_GCC_ATTRIBUTE_ALIGNED @@ -788,9 +789,7 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, { int did_use_aesni_avx2 = 0; u64 Ls[32]; - unsigned int n = 32 - (blkn % 32); u64 *l; - int i; if (nblocks >= 32) { @@ -808,24 +807,7 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, bulk_ocb_fn = encrypt ? _gcry_camellia_gfni_avx2_ocb_enc : _gcry_camellia_gfni_avx2_ocb_dec; #endif - - for (i = 0; i < 32; i += 8) - { - /* Use u64 to store pointers for x32 support (assembly function - * assumes 64-bit pointers). */ - Ls[(i + 0 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 1 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 2 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 3 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; - Ls[(i + 4 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 5 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 6 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - } - - Ls[(7 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; - Ls[(15 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[4]; - Ls[(23 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; - l = &Ls[(31 + n) % 32]; + l = bulk_ocb_prepare_L_pointers_array_blk32 (c, Ls, blkn); /* Process data in 32 block chunks. */ while (nblocks >= 32) @@ -860,27 +842,11 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, { int did_use_aesni_avx = 0; u64 Ls[16]; - unsigned int n = 16 - (blkn % 16); u64 *l; - int i; if (nblocks >= 16) { - for (i = 0; i < 16; i += 8) - { - /* Use u64 to store pointers for x32 support (assembly function - * assumes 64-bit pointers). */ - Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; - Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - } - - Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; - l = &Ls[(15 + n) % 16]; + l = bulk_ocb_prepare_L_pointers_array_blk16 (c, Ls, blkn); /* Process data in 16 block chunks. */ while (nblocks >= 16) @@ -947,9 +913,7 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, { int did_use_aesni_avx2 = 0; u64 Ls[32]; - unsigned int n = 32 - (blkn % 32); u64 *l; - int i; if (nblocks >= 32) { @@ -965,23 +929,7 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, bulk_auth_fn = _gcry_camellia_gfni_avx2_ocb_auth; #endif - for (i = 0; i < 32; i += 8) - { - /* Use u64 to store pointers for x32 support (assembly function - * assumes 64-bit pointers). */ - Ls[(i + 0 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 1 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 2 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 3 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; - Ls[(i + 4 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 5 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 6 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - } - - Ls[(7 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; - Ls[(15 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[4]; - Ls[(23 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; - l = &Ls[(31 + n) % 32]; + l = bulk_ocb_prepare_L_pointers_array_blk32 (c, Ls, blkn); /* Process data in 32 block chunks. */ while (nblocks >= 32) @@ -1016,27 +964,11 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, { int did_use_aesni_avx = 0; u64 Ls[16]; - unsigned int n = 16 - (blkn % 16); u64 *l; - int i; if (nblocks >= 16) { - for (i = 0; i < 16; i += 8) - { - /* Use u64 to store pointers for x32 support (assembly function - * assumes 64-bit pointers). */ - Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; - Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - } - - Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; - l = &Ls[(15 + n) % 16]; + l = bulk_ocb_prepare_L_pointers_array_blk16 (c, Ls, blkn); /* Process data in 16 block chunks. */ while (nblocks >= 16) diff --git a/cipher/serpent.c b/cipher/serpent.c index 159d889f..dfe5cc28 100644 --- a/cipher/serpent.c +++ b/cipher/serpent.c @@ -31,6 +31,7 @@ #include "bufhelp.h" #include "cipher-internal.h" #include "cipher-selftest.h" +#include "bulkhelp.h" /* USE_SSE2 indicates whether to compile with AMD64 SSE2 code. */ @@ -1272,27 +1273,11 @@ _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, { int did_use_avx2 = 0; u64 Ls[16]; - unsigned int n = 16 - (blkn % 16); u64 *l; - int i; if (nblocks >= 16) { - for (i = 0; i < 16; i += 8) - { - /* Use u64 to store pointers for x32 support (assembly function - * assumes 64-bit pointers). */ - Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; - Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - } - - Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; - l = &Ls[(15 + n) % 16]; + l = bulk_ocb_prepare_L_pointers_array_blk16 (c, Ls, blkn); /* Process data in 16 block chunks. */ while (nblocks >= 16) @@ -1329,21 +1314,11 @@ _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, { int did_use_sse2 = 0; u64 Ls[8]; - unsigned int n = 8 - (blkn % 8); u64 *l; if (nblocks >= 8) { - /* Use u64 to store pointers for x32 support (assembly function - * assumes 64-bit pointers). */ - Ls[(0 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(1 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(2 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(3 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; - Ls[(4 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(5 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(6 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - l = &Ls[(7 + n) % 8]; + l = bulk_ocb_prepare_L_pointers_array_blk8 (c, Ls, blkn); /* Process data in 8 block chunks. */ while (nblocks >= 8) @@ -1380,33 +1355,25 @@ _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, if (ctx->use_neon) { int did_use_neon = 0; - const void *Ls[8]; - unsigned int n = 8 - (blkn % 8); - const void **l; + uintptr_t Ls[8]; + uintptr_t *l; if (nblocks >= 8) { - Ls[(0 + n) % 8] = c->u_mode.ocb.L[0]; - Ls[(1 + n) % 8] = c->u_mode.ocb.L[1]; - Ls[(2 + n) % 8] = c->u_mode.ocb.L[0]; - Ls[(3 + n) % 8] = c->u_mode.ocb.L[2]; - Ls[(4 + n) % 8] = c->u_mode.ocb.L[0]; - Ls[(5 + n) % 8] = c->u_mode.ocb.L[1]; - Ls[(6 + n) % 8] = c->u_mode.ocb.L[0]; - l = &Ls[(7 + n) % 8]; + l = bulk_ocb_prepare_L_pointers_array_blk8 (c, Ls, blkn); /* Process data in 8 block chunks. */ while (nblocks >= 8) { blkn += 8; - *l = ocb_get_l(c, blkn - blkn % 8); + *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 8); if (encrypt) _gcry_serpent_neon_ocb_enc(ctx, outbuf, inbuf, c->u_iv.iv, - c->u_ctr.ctr, Ls); + c->u_ctr.ctr, (void **)Ls); else _gcry_serpent_neon_ocb_dec(ctx, outbuf, inbuf, c->u_iv.iv, - c->u_ctr.ctr, Ls); + c->u_ctr.ctr, (void **)Ls); nblocks -= 8; outbuf += 8 * sizeof(serpent_block_t); @@ -1456,27 +1423,11 @@ _gcry_serpent_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, { int did_use_avx2 = 0; u64 Ls[16]; - unsigned int n = 16 - (blkn % 16); u64 *l; - int i; if (nblocks >= 16) { - for (i = 0; i < 16; i += 8) - { - /* Use u64 to store pointers for x32 support (assembly function - * assumes 64-bit pointers). */ - Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; - Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - } - - Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; - l = &Ls[(15 + n) % 16]; + l = bulk_ocb_prepare_L_pointers_array_blk16 (c, Ls, blkn); /* Process data in 16 block chunks. */ while (nblocks >= 16) @@ -1508,21 +1459,11 @@ _gcry_serpent_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, { int did_use_sse2 = 0; u64 Ls[8]; - unsigned int n = 8 - (blkn % 8); u64 *l; if (nblocks >= 8) { - /* Use u64 to store pointers for x32 support (assembly function - * assumes 64-bit pointers). */ - Ls[(0 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(1 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(2 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(3 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; - Ls[(4 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(5 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(6 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - l = &Ls[(7 + n) % 8]; + l = bulk_ocb_prepare_L_pointers_array_blk8 (c, Ls, blkn); /* Process data in 8 block chunks. */ while (nblocks >= 8) @@ -1554,29 +1495,21 @@ _gcry_serpent_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, if (ctx->use_neon) { int did_use_neon = 0; - const void *Ls[8]; - unsigned int n = 8 - (blkn % 8); - const void **l; + uintptr_t Ls[8]; + uintptr_t *l; if (nblocks >= 8) { - Ls[(0 + n) % 8] = c->u_mode.ocb.L[0]; - Ls[(1 + n) % 8] = c->u_mode.ocb.L[1]; - Ls[(2 + n) % 8] = c->u_mode.ocb.L[0]; - Ls[(3 + n) % 8] = c->u_mode.ocb.L[2]; - Ls[(4 + n) % 8] = c->u_mode.ocb.L[0]; - Ls[(5 + n) % 8] = c->u_mode.ocb.L[1]; - Ls[(6 + n) % 8] = c->u_mode.ocb.L[0]; - l = &Ls[(7 + n) % 8]; + l = bulk_ocb_prepare_L_pointers_array_blk8 (c, Ls, blkn); /* Process data in 8 block chunks. */ while (nblocks >= 8) { blkn += 8; - *l = ocb_get_l(c, blkn - blkn % 8); + *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 8); _gcry_serpent_neon_ocb_auth(ctx, abuf, c->u_mode.ocb.aad_offset, - c->u_mode.ocb.aad_sum, Ls); + c->u_mode.ocb.aad_sum, (void **)Ls); nblocks -= 8; abuf += 8 * sizeof(serpent_block_t); diff --git a/cipher/sm4.c b/cipher/sm4.c index d36d9ceb..0148365c 100644 --- a/cipher/sm4.c +++ b/cipher/sm4.c @@ -30,6 +30,7 @@ #include "bufhelp.h" #include "cipher-internal.h" #include "cipher-selftest.h" +#include "bulkhelp.h" /* Helper macro to force alignment to 64 bytes. */ #ifdef HAVE_GCC_ATTRIBUTE_ALIGNED @@ -1030,27 +1031,11 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, if (ctx->use_aesni_avx2) { u64 Ls[16]; - unsigned int n = 16 - (blkn % 16); u64 *l; - int i; if (nblocks >= 16) { - for (i = 0; i < 16; i += 8) - { - /* Use u64 to store pointers for x32 support (assembly function - * assumes 64-bit pointers). */ - Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; - Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - } - - Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; - l = &Ls[(15 + n) % 16]; + l = bulk_ocb_prepare_L_pointers_array_blk16 (c, Ls, blkn); /* Process data in 16 block chunks. */ while (nblocks >= 16) @@ -1077,22 +1062,11 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, if (ctx->use_aesni_avx) { u64 Ls[8]; - unsigned int n = 8 - (blkn % 8); u64 *l; if (nblocks >= 8) { - /* Use u64 to store pointers for x32 support (assembly function - * assumes 64-bit pointers). */ - Ls[(0 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(1 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(2 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(3 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; - Ls[(4 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(5 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(6 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(7 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; - l = &Ls[(7 + n) % 8]; + l = bulk_ocb_prepare_L_pointers_array_blk8 (c, Ls, blkn); /* Process data in 8 block chunks. */ while (nblocks >= 8) @@ -1184,27 +1158,11 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) if (ctx->use_aesni_avx2) { u64 Ls[16]; - unsigned int n = 16 - (blkn % 16); u64 *l; - int i; if (nblocks >= 16) { - for (i = 0; i < 16; i += 8) - { - /* Use u64 to store pointers for x32 support (assembly function - * assumes 64-bit pointers). */ - Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; - Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - } - - Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; - l = &Ls[(15 + n) % 16]; + l = bulk_ocb_prepare_L_pointers_array_blk16 (c, Ls, blkn); /* Process data in 16 block chunks. */ while (nblocks >= 16) @@ -1227,22 +1185,11 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) if (ctx->use_aesni_avx) { u64 Ls[8]; - unsigned int n = 8 - (blkn % 8); u64 *l; if (nblocks >= 8) { - /* Use u64 to store pointers for x32 support (assembly function - * assumes 64-bit pointers). */ - Ls[(0 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(1 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(2 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(3 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; - Ls[(4 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(5 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(6 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(7 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; - l = &Ls[(7 + n) % 8]; + l = bulk_ocb_prepare_L_pointers_array_blk8 (c, Ls, blkn); /* Process data in 8 block chunks. */ while (nblocks >= 8) diff --git a/cipher/twofish.c b/cipher/twofish.c index d19e0790..4ae5d5a6 100644 --- a/cipher/twofish.c +++ b/cipher/twofish.c @@ -47,6 +47,7 @@ #include "bufhelp.h" #include "cipher-internal.h" #include "cipher-selftest.h" +#include "bulkhelp.h" #define TWOFISH_BLOCKSIZE 16 @@ -1358,27 +1359,11 @@ _gcry_twofish_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, { int did_use_avx2 = 0; u64 Ls[16]; - unsigned int n = 16 - (blkn % 16); u64 *l; - int i; if (nblocks >= 16) { - for (i = 0; i < 16; i += 8) - { - /* Use u64 to store pointers for x32 support (assembly function - * assumes 64-bit pointers). */ - Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; - Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - } - - Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; - l = &Ls[(15 + n) % 16]; + l = bulk_ocb_prepare_L_pointers_array_blk16 (c, Ls, blkn); /* Process data in 16 block chunks. */ while (nblocks >= 16) @@ -1471,27 +1456,11 @@ _gcry_twofish_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, { int did_use_avx2 = 0; u64 Ls[16]; - unsigned int n = 16 - (blkn % 16); u64 *l; - int i; if (nblocks >= 16) { - for (i = 0; i < 16; i += 8) - { - /* Use u64 to store pointers for x32 support (assembly function - * assumes 64-bit pointers). */ - Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; - Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; - Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; - } - - Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; - l = &Ls[(15 + n) % 16]; + l = bulk_ocb_prepare_L_pointers_array_blk16 (c, Ls, blkn); /* Process data in 16 block chunks. */ while (nblocks >= 16) -- 2.34.1 From jussi.kivilinna at iki.fi Sun Apr 24 20:40:25 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 24 Apr 2022 21:40:25 +0300 Subject: [PATCH 7/7] camellia-avx2: add bulk processing for XTS mode In-Reply-To: <20220424184025.2202396-1-jussi.kivilinna@iki.fi> References: <20220424184025.2202396-1-jussi.kivilinna@iki.fi> Message-ID: <20220424184025.2202396-7-jussi.kivilinna@iki.fi> * cipher/bulkhelp.h (bulk_xts_crypt_128): New. * cipher/camellia-glue.c (_gcry_camellia_xts_crypt): New. (camellia_set_key) [USE_AESNI_AVX2]: Set XTS bulk function if AVX2 implementation is available. -- Benchmark on AMD Ryzen 5800X: Before: CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz XTS enc | 3.79 ns/B 251.8 MiB/s 18.37 c/B 4850 XTS dec | 3.77 ns/B 253.2 MiB/s 18.27 c/B 4850 After (6.8x faster): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz XTS enc | 0.554 ns/B 1720 MiB/s 2.69 c/B 4850 XTS dec | 0.541 ns/B 1762 MiB/s 2.63 c/B 4850 Signed-off-by: Jussi Kivilinna --- cipher/bulkhelp.h | 68 ++++++++++++++++++++++++++++++++++++++++++ cipher/camellia-glue.c | 39 ++++++++++++++++++++++++ 2 files changed, 107 insertions(+) diff --git a/cipher/bulkhelp.h b/cipher/bulkhelp.h index c9ecaba6..b1b4b2e1 100644 --- a/cipher/bulkhelp.h +++ b/cipher/bulkhelp.h @@ -325,4 +325,72 @@ bulk_ocb_auth_128 (gcry_cipher_hd_t c, void *priv, bulk_crypt_fn_t crypt_fn, } +static inline unsigned int +bulk_xts_crypt_128 (void *priv, bulk_crypt_fn_t crypt_fn, byte *outbuf, + const byte *inbuf, size_t nblocks, byte *tweak, + byte *tmpbuf, size_t tmpbuf_nblocks, + unsigned int *num_used_tmpblocks) +{ + u64 tweak_lo, tweak_hi, tweak_next_lo, tweak_next_hi, tmp_lo, tmp_hi, carry; + unsigned int tmp_used = 16; + unsigned int burn_depth = 0; + unsigned int nburn; + + tweak_next_lo = buf_get_le64 (tweak + 0); + tweak_next_hi = buf_get_le64 (tweak + 8); + + while (nblocks >= 1) + { + size_t curr_blks = nblocks > tmpbuf_nblocks ? tmpbuf_nblocks : nblocks; + size_t i; + + if (curr_blks * 16 > tmp_used) + tmp_used = curr_blks * 16; + + for (i = 0; i < curr_blks; i++) + { + tweak_lo = tweak_next_lo; + tweak_hi = tweak_next_hi; + + /* Generate next tweak. */ + carry = -(tweak_next_hi >> 63) & 0x87; + tweak_next_hi = (tweak_next_hi << 1) + (tweak_next_lo >> 63); + tweak_next_lo = (tweak_next_lo << 1) ^ carry; + + /* Xor-Encrypt/Decrypt-Xor block. */ + tmp_lo = buf_get_le64 (inbuf + i * 16 + 0) ^ tweak_lo; + tmp_hi = buf_get_le64 (inbuf + i * 16 + 8) ^ tweak_hi; + buf_put_he64 (&tmpbuf[i * 16 + 0], tweak_lo); + buf_put_he64 (&tmpbuf[i * 16 + 8], tweak_hi); + buf_put_le64 (outbuf + i * 16 + 0, tmp_lo); + buf_put_le64 (outbuf + i * 16 + 8, tmp_hi); + } + + nburn = crypt_fn (priv, outbuf, outbuf, curr_blks); + burn_depth = nburn > burn_depth ? nburn : burn_depth; + + for (i = 0; i < curr_blks; i++) + { + /* Xor-Encrypt/Decrypt-Xor block. */ + tweak_lo = buf_get_he64 (&tmpbuf[i * 16 + 0]); + tweak_hi = buf_get_he64 (&tmpbuf[i * 16 + 8]); + tmp_lo = buf_get_le64 (outbuf + i * 16 + 0) ^ tweak_lo; + tmp_hi = buf_get_le64 (outbuf + i * 16 + 8) ^ tweak_hi; + buf_put_le64 (outbuf + i * 16 + 0, tmp_lo); + buf_put_le64 (outbuf + i * 16 + 8, tmp_hi); + } + + inbuf += curr_blks * 16; + outbuf += curr_blks * 16; + nblocks -= curr_blks; + } + + buf_put_le64 (tweak + 0, tweak_next_lo); + buf_put_le64 (tweak + 8, tweak_next_hi); + + *num_used_tmpblocks = tmp_used; + return burn_depth; +} + + #endif /*GCRYPT_BULKHELP_H*/ diff --git a/cipher/camellia-glue.c b/cipher/camellia-glue.c index 20ab7f7d..eae1d9ff 100644 --- a/cipher/camellia-glue.c +++ b/cipher/camellia-glue.c @@ -351,6 +351,9 @@ static void _gcry_camellia_cbc_dec (void *context, unsigned char *iv, static void _gcry_camellia_cfb_dec (void *context, unsigned char *iv, void *outbuf_arg, const void *inbuf_arg, size_t nblocks); +static void _gcry_camellia_xts_crypt (void *context, unsigned char *tweak, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks, int encrypt); static size_t _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, const void *inbuf_arg, size_t nblocks, int encrypt); @@ -407,6 +410,10 @@ camellia_setkey(void *c, const byte *key, unsigned keylen, bulk_ops->ctr_enc = _gcry_camellia_ctr_enc; bulk_ops->ocb_crypt = _gcry_camellia_ocb_crypt; bulk_ops->ocb_auth = _gcry_camellia_ocb_auth; +#ifdef USE_AESNI_AVX2 + if (ctx->use_aesni_avx2 || ctx->use_vaes_avx2 || ctx->use_gfni_avx2) + bulk_ops->xts_crypt = _gcry_camellia_xts_crypt; +#endif if (0) { } @@ -900,6 +907,38 @@ _gcry_camellia_cfb_dec(void *context, unsigned char *iv, _gcry_burn_stack(burn_stack_depth); } +/* Bulk encryption/decryption of complete blocks in XTS mode. */ +static void +_gcry_camellia_xts_crypt (void *context, unsigned char *tweak, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks, int encrypt) +{ + CAMELLIA_context *ctx = context; + unsigned char *outbuf = outbuf_arg; + const unsigned char *inbuf = inbuf_arg; + int burn_stack_depth = 0; + + /* Process remaining blocks. */ + if (nblocks) + { + byte tmpbuf[CAMELLIA_BLOCK_SIZE * 32]; + unsigned int tmp_used = CAMELLIA_BLOCK_SIZE; + size_t nburn; + + nburn = bulk_xts_crypt_128(ctx, encrypt ? camellia_encrypt_blk1_32 + : camellia_decrypt_blk1_32, + outbuf, inbuf, nblocks, tweak, tmpbuf, + sizeof(tmpbuf) / CAMELLIA_BLOCK_SIZE, + &tmp_used); + burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; + + wipememory(tmpbuf, tmp_used); + } + + if (burn_stack_depth) + _gcry_burn_stack(burn_stack_depth); +} + /* Bulk encryption/decryption of complete blocks in OCB mode. */ static size_t _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, -- 2.34.1 From jussi.kivilinna at iki.fi Sun Apr 24 20:40:23 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 24 Apr 2022 21:40:23 +0300 Subject: [PATCH 5/7] cipher/bulkhelp: add functions for CTR/CBC/CFB/OCB bulk processing In-Reply-To: <20220424184025.2202396-1-jussi.kivilinna@iki.fi> References: <20220424184025.2202396-1-jussi.kivilinna@iki.fi> Message-ID: <20220424184025.2202396-5-jussi.kivilinna@iki.fi> * cipher/bulkhelp.h (bulk_crypt_fn_t, bulk_ctr_enc_128) (bulk_cbc_dec_128, bulk_cfb_dec_128, bulk_ocb_crypt_128) (bulk_ocb_auth_128): New. * cipher/sm4.c (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec) (_gcry_sm4_cfb_dec, _gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth): Switch to use helper functions from 'bulkhelp.h'. -- Signed-off-by: Jussi Kivilinna --- cipher/bulkhelp.h | 225 ++++++++++++++++++++++++++++++++++++++++++++++ cipher/sm4.c | 184 ++++++++----------------------------- 2 files changed, 260 insertions(+), 149 deletions(-) diff --git a/cipher/bulkhelp.h b/cipher/bulkhelp.h index 72668d42..c9ecaba6 100644 --- a/cipher/bulkhelp.h +++ b/cipher/bulkhelp.h @@ -32,6 +32,10 @@ typedef u64 ocb_L_uintptr_t; typedef uintptr_t ocb_L_uintptr_t; #endif +typedef unsigned int (*bulk_crypt_fn_t) (const void *ctx, byte *out, + const byte *in, + unsigned int num_blks); + static inline ocb_L_uintptr_t * bulk_ocb_prepare_L_pointers_array_blk32 (gcry_cipher_hd_t c, @@ -100,4 +104,225 @@ bulk_ocb_prepare_L_pointers_array_blk8 (gcry_cipher_hd_t c, } +static inline unsigned int +bulk_ctr_enc_128 (void *priv, bulk_crypt_fn_t crypt_fn, byte *outbuf, + const byte *inbuf, size_t nblocks, byte *ctr, + byte *tmpbuf, size_t tmpbuf_nblocks, + unsigned int *num_used_tmpblocks) +{ + unsigned int tmp_used = 16; + unsigned int burn_depth = 0; + unsigned int nburn; + + while (nblocks >= 1) + { + size_t curr_blks = nblocks > tmpbuf_nblocks ? tmpbuf_nblocks : nblocks; + size_t i; + + if (curr_blks * 16 > tmp_used) + tmp_used = curr_blks * 16; + + cipher_block_cpy (tmpbuf + 0 * 16, ctr, 16); + for (i = 1; i < curr_blks; i++) + { + cipher_block_cpy (&tmpbuf[i * 16], ctr, 16); + cipher_block_add (&tmpbuf[i * 16], i, 16); + } + cipher_block_add (ctr, curr_blks, 16); + + nburn = crypt_fn (priv, tmpbuf, tmpbuf, curr_blks); + burn_depth = nburn > burn_depth ? nburn : burn_depth; + + for (i = 0; i < curr_blks; i++) + { + cipher_block_xor (outbuf, &tmpbuf[i * 16], inbuf, 16); + outbuf += 16; + inbuf += 16; + } + + nblocks -= curr_blks; + } + + *num_used_tmpblocks = tmp_used; + return burn_depth; +} + + +static inline unsigned int +bulk_cbc_dec_128 (void *priv, bulk_crypt_fn_t crypt_fn, byte *outbuf, + const byte *inbuf, size_t nblocks, byte *iv, + byte *tmpbuf, size_t tmpbuf_nblocks, + unsigned int *num_used_tmpblocks) +{ + unsigned int tmp_used = 16; + unsigned int burn_depth = 0; + unsigned int nburn; + + while (nblocks >= 1) + { + size_t curr_blks = nblocks > tmpbuf_nblocks ? tmpbuf_nblocks : nblocks; + size_t i; + + if (curr_blks * 16 > tmp_used) + tmp_used = curr_blks * 16; + + nburn = crypt_fn (priv, tmpbuf, inbuf, curr_blks); + burn_depth = nburn > burn_depth ? nburn : burn_depth; + + for (i = 0; i < curr_blks; i++) + { + cipher_block_xor_n_copy_2(outbuf, &tmpbuf[i * 16], iv, inbuf, 16); + outbuf += 16; + inbuf += 16; + } + + nblocks -= curr_blks; + } + + *num_used_tmpblocks = tmp_used; + return burn_depth; +} + + +static inline unsigned int +bulk_cfb_dec_128 (void *priv, bulk_crypt_fn_t crypt_fn, byte *outbuf, + const byte *inbuf, size_t nblocks, byte *iv, + byte *tmpbuf, size_t tmpbuf_nblocks, + unsigned int *num_used_tmpblocks) +{ + unsigned int tmp_used = 16; + unsigned int burn_depth = 0; + unsigned int nburn; + + while (nblocks >= 1) + { + size_t curr_blks = nblocks > tmpbuf_nblocks ? tmpbuf_nblocks : nblocks; + size_t i; + + if (curr_blks * 16 > tmp_used) + tmp_used = curr_blks * 16; + + cipher_block_cpy (&tmpbuf[0 * 16], iv, 16); + if (curr_blks > 1) + memcpy (&tmpbuf[1 * 16], &inbuf[(1 - 1) * 16], 16 * curr_blks - 16); + cipher_block_cpy (iv, &inbuf[(curr_blks - 1) * 16], 16); + + nburn = crypt_fn (priv, tmpbuf, tmpbuf, curr_blks); + burn_depth = nburn > burn_depth ? nburn : burn_depth; + + for (i = 0; i < curr_blks; i++) + { + cipher_block_xor (outbuf, inbuf, &tmpbuf[i * 16], 16); + outbuf += 16; + inbuf += 16; + } + + nblocks -= curr_blks; + } + + *num_used_tmpblocks = tmp_used; + return burn_depth; +} + + +static inline unsigned int +bulk_ocb_crypt_128 (gcry_cipher_hd_t c, void *priv, bulk_crypt_fn_t crypt_fn, + byte *outbuf, const byte *inbuf, size_t nblocks, u64 *blkn, + int encrypt, byte *tmpbuf, size_t tmpbuf_nblocks, + unsigned int *num_used_tmpblocks) +{ + unsigned int tmp_used = 16; + unsigned int burn_depth = 0; + unsigned int nburn; + + while (nblocks >= 1) + { + size_t curr_blks = nblocks > tmpbuf_nblocks ? tmpbuf_nblocks : nblocks; + size_t i; + + if (curr_blks * 16 > tmp_used) + tmp_used = curr_blks * 16; + + for (i = 0; i < curr_blks; i++) + { + const unsigned char *l = ocb_get_l(c, ++*blkn); + + /* Checksum_i = Checksum_{i-1} xor P_i */ + if (encrypt) + cipher_block_xor_1(c->u_ctr.ctr, &inbuf[i * 16], 16); + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + cipher_block_xor_2dst (&tmpbuf[i * 16], c->u_iv.iv, l, 16); + cipher_block_xor (&outbuf[i * 16], &inbuf[i * 16], + c->u_iv.iv, 16); + } + + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + nburn = crypt_fn (priv, outbuf, outbuf, curr_blks); + burn_depth = nburn > burn_depth ? nburn : burn_depth; + + for (i = 0; i < curr_blks; i++) + { + cipher_block_xor_1 (&outbuf[i * 16], &tmpbuf[i * 16], 16); + + /* Checksum_i = Checksum_{i-1} xor P_i */ + if (!encrypt) + cipher_block_xor_1(c->u_ctr.ctr, &outbuf[i * 16], 16); + } + + outbuf += curr_blks * 16; + inbuf += curr_blks * 16; + nblocks -= curr_blks; + } + + *num_used_tmpblocks = tmp_used; + return burn_depth; +} + + +static inline unsigned int +bulk_ocb_auth_128 (gcry_cipher_hd_t c, void *priv, bulk_crypt_fn_t crypt_fn, + const byte *abuf, size_t nblocks, u64 *blkn, byte *tmpbuf, + size_t tmpbuf_nblocks, unsigned int *num_used_tmpblocks) +{ + unsigned int tmp_used = 16; + unsigned int burn_depth = 0; + unsigned int nburn; + + while (nblocks >= 1) + { + size_t curr_blks = nblocks > tmpbuf_nblocks ? tmpbuf_nblocks : nblocks; + size_t i; + + if (curr_blks * 16 > tmp_used) + tmp_used = curr_blks * 16; + + for (i = 0; i < curr_blks; i++) + { + const unsigned char *l = ocb_get_l(c, ++*blkn); + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + cipher_block_xor_2dst (&tmpbuf[i * 16], + c->u_mode.ocb.aad_offset, l, 16); + cipher_block_xor_1 (&tmpbuf[i * 16], &abuf[i * 16], 16); + } + + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + nburn = crypt_fn (priv, tmpbuf, tmpbuf, curr_blks); + burn_depth = nburn > burn_depth ? nburn : burn_depth; + + for (i = 0; i < curr_blks; i++) + { + cipher_block_xor_1 (c->u_mode.ocb.aad_sum, &tmpbuf[i * 16], 16); + } + + abuf += curr_blks * 16; + nblocks -= curr_blks; + } + + *num_used_tmpblocks = tmp_used; + return burn_depth; +} + + #endif /*GCRYPT_BULKHELP_H*/ diff --git a/cipher/sm4.c b/cipher/sm4.c index 0148365c..4815b184 100644 --- a/cipher/sm4.c +++ b/cipher/sm4.c @@ -748,36 +748,12 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); byte tmpbuf[16 * 8]; unsigned int tmp_used = 16; + size_t nburn; - /* Process remaining blocks. */ - while (nblocks) - { - size_t curr_blks = nblocks > 8 ? 8 : nblocks; - size_t i; - - if (curr_blks * 16 > tmp_used) - tmp_used = curr_blks * 16; - - cipher_block_cpy (tmpbuf + 0 * 16, ctr, 16); - for (i = 1; i < curr_blks; i++) - { - cipher_block_cpy (&tmpbuf[i * 16], ctr, 16); - cipher_block_add (&tmpbuf[i * 16], i, 16); - } - cipher_block_add (ctr, curr_blks, 16); - - burn_stack_depth = crypt_blk1_8 (ctx->rkey_enc, tmpbuf, tmpbuf, - curr_blks); - - for (i = 0; i < curr_blks; i++) - { - cipher_block_xor (outbuf, &tmpbuf[i * 16], inbuf, 16); - outbuf += 16; - inbuf += 16; - } - - nblocks -= curr_blks; - } + nburn = bulk_ctr_enc_128(ctx->rkey_enc, crypt_blk1_8, outbuf, inbuf, + nblocks, ctr, tmpbuf, sizeof(tmpbuf) / 16, + &tmp_used); + burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; wipememory(tmpbuf, tmp_used); } @@ -866,33 +842,16 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, if (nblocks) { crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); - unsigned char savebuf[16 * 8]; + unsigned char tmpbuf[16 * 8]; unsigned int tmp_used = 16; + size_t nburn; - /* Process remaining blocks. */ - while (nblocks) - { - size_t curr_blks = nblocks > 8 ? 8 : nblocks; - size_t i; - - if (curr_blks * 16 > tmp_used) - tmp_used = curr_blks * 16; - - burn_stack_depth = crypt_blk1_8 (ctx->rkey_dec, savebuf, inbuf, - curr_blks); + nburn = bulk_cbc_dec_128(ctx->rkey_dec, crypt_blk1_8, outbuf, inbuf, + nblocks, iv, tmpbuf, sizeof(tmpbuf) / 16, + &tmp_used); + burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; - for (i = 0; i < curr_blks; i++) - { - cipher_block_xor_n_copy_2(outbuf, &savebuf[i * 16], iv, inbuf, - 16); - outbuf += 16; - inbuf += 16; - } - - nblocks -= curr_blks; - } - - wipememory(savebuf, tmp_used); + wipememory(tmpbuf, tmp_used); } if (burn_stack_depth) @@ -979,37 +938,16 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, if (nblocks) { crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); - unsigned char ivbuf[16 * 8]; + unsigned char tmpbuf[16 * 8]; unsigned int tmp_used = 16; + size_t nburn; - /* Process remaining blocks. */ - while (nblocks) - { - size_t curr_blks = nblocks > 8 ? 8 : nblocks; - size_t i; - - if (curr_blks * 16 > tmp_used) - tmp_used = curr_blks * 16; - - cipher_block_cpy (&ivbuf[0 * 16], iv, 16); - for (i = 1; i < curr_blks; i++) - cipher_block_cpy (&ivbuf[i * 16], &inbuf[(i - 1) * 16], 16); - cipher_block_cpy (iv, &inbuf[(i - 1) * 16], 16); + nburn = bulk_cfb_dec_128(ctx->rkey_enc, crypt_blk1_8, outbuf, inbuf, + nblocks, iv, tmpbuf, sizeof(tmpbuf) / 16, + &tmp_used); + burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; - burn_stack_depth = crypt_blk1_8 (ctx->rkey_enc, ivbuf, ivbuf, - curr_blks); - - for (i = 0; i < curr_blks; i++) - { - cipher_block_xor (outbuf, inbuf, &ivbuf[i * 16], 16); - outbuf += 16; - inbuf += 16; - } - - nblocks -= curr_blks; - } - - wipememory(ivbuf, tmp_used); + wipememory(tmpbuf, tmp_used); } if (burn_stack_depth) @@ -1089,51 +1027,19 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, } #endif + /* Process remaining blocks. */ if (nblocks) { crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); - const u32 *rk = encrypt ? ctx->rkey_enc : ctx->rkey_dec; + u32 *rk = encrypt ? ctx->rkey_enc : ctx->rkey_dec; unsigned char tmpbuf[16 * 8]; unsigned int tmp_used = 16; + size_t nburn; - while (nblocks) - { - size_t curr_blks = nblocks > 8 ? 8 : nblocks; - size_t i; - - if (curr_blks * 16 > tmp_used) - tmp_used = curr_blks * 16; - - for (i = 0; i < curr_blks; i++) - { - const unsigned char *l = ocb_get_l(c, ++blkn); - - /* Checksum_i = Checksum_{i-1} xor P_i */ - if (encrypt) - cipher_block_xor_1(c->u_ctr.ctr, &inbuf[i * 16], 16); - - /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ - cipher_block_xor_2dst (&tmpbuf[i * 16], c->u_iv.iv, l, 16); - cipher_block_xor (&outbuf[i * 16], &inbuf[i * 16], - c->u_iv.iv, 16); - } - - /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ - crypt_blk1_8 (rk, outbuf, outbuf, curr_blks); - - for (i = 0; i < curr_blks; i++) - { - cipher_block_xor_1 (&outbuf[i * 16], &tmpbuf[i * 16], 16); - - /* Checksum_i = Checksum_{i-1} xor P_i */ - if (!encrypt) - cipher_block_xor_1(c->u_ctr.ctr, &outbuf[i * 16], 16); - } - - outbuf += curr_blks * 16; - inbuf += curr_blks * 16; - nblocks -= curr_blks; - } + nburn = bulk_ocb_crypt_128 (c, rk, crypt_blk1_8, outbuf, inbuf, nblocks, + &blkn, encrypt, tmpbuf, sizeof(tmpbuf) / 16, + &tmp_used); + burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; wipememory(tmpbuf, tmp_used); } @@ -1153,6 +1059,7 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) SM4_context *ctx = (void *)&c->context.c; const unsigned char *abuf = abuf_arg; u64 blkn = c->u_mode.ocb.aad_nblocks; + int burn_stack_depth = 0; #ifdef USE_AESNI_AVX2 if (ctx->use_aesni_avx2) @@ -1208,47 +1115,26 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) } #endif + /* Process remaining blocks. */ if (nblocks) { crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); unsigned char tmpbuf[16 * 8]; unsigned int tmp_used = 16; + size_t nburn; - while (nblocks) - { - size_t curr_blks = nblocks > 8 ? 8 : nblocks; - size_t i; - - if (curr_blks * 16 > tmp_used) - tmp_used = curr_blks * 16; - - for (i = 0; i < curr_blks; i++) - { - const unsigned char *l = ocb_get_l(c, ++blkn); - - /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ - cipher_block_xor_2dst (&tmpbuf[i * 16], - c->u_mode.ocb.aad_offset, l, 16); - cipher_block_xor_1 (&tmpbuf[i * 16], &abuf[i * 16], 16); - } - - /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ - crypt_blk1_8 (ctx->rkey_enc, tmpbuf, tmpbuf, curr_blks); - - for (i = 0; i < curr_blks; i++) - { - cipher_block_xor_1 (c->u_mode.ocb.aad_sum, &tmpbuf[i * 16], 16); - } - - abuf += curr_blks * 16; - nblocks -= curr_blks; - } + nburn = bulk_ocb_auth_128 (c, ctx->rkey_enc, crypt_blk1_8, abuf, nblocks, + &blkn, tmpbuf, sizeof(tmpbuf) / 16, &tmp_used); + burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; wipememory(tmpbuf, tmp_used); } c->u_mode.ocb.aad_nblocks = blkn; + if (burn_stack_depth) + _gcry_burn_stack(burn_stack_depth); + return 0; } -- 2.34.1 From jussi.kivilinna at iki.fi Sun Apr 24 20:40:20 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 24 Apr 2022 21:40:20 +0300 Subject: [PATCH 2/7] Add GFNI/AVX2 implementation of Camellia In-Reply-To: <20220424184025.2202396-1-jussi.kivilinna@iki.fi> References: <20220424184025.2202396-1-jussi.kivilinna@iki.fi> Message-ID: <20220424184025.2202396-2-jussi.kivilinna@iki.fi> * cipher/Makefile.am: Add "camellia-gfni-avx2-amd64.S". * cipher/camellia-aesni-avx2-amd64.h [CAMELLIA_GFNI_BUILD]: Add GFNI support. * cipher/camellia-gfni-avx2-amd64.S: New. * cipher/camellia-glue.c (USE_GFNI_AVX2): New. (CAMELLIA_context) [USE_AESNI_AVX2]: New member "use_gfni_avx2". [USE_GFNI_AVX2] (_gcry_camellia_gfni_avx2_ctr_enc) (_gcry_camellia_gfni_avx2_cbc_dec, _gcry_camellia_gfni_avx2_cfb_dec) (_gcry_camellia_gfni_avx2_ocb_enc, _gcry_camellia_gfni_avx2_ocb_dec) (_gcry_camellia_gfni_avx2_ocb_auth): New. (camellia_setkey) [USE_GFNI_AVX2]: Enable GFNI if supported by HW. (_gcry_camellia_ctr_enc) [USE_GFNI_AVX2]: Add GFNI support. (_gcry_camellia_cbc_dec) [USE_GFNI_AVX2]: Add GFNI support. (_gcry_camellia_cfb_dec) [USE_GFNI_AVX2]: Add GFNI support. (_gcry_camellia_ocb_crypt) [USE_GFNI_AVX2]: Add GFNI support. (_gcry_camellia_ocb_auth) [USE_GFNI_AVX2]: Add GFNI support. * configure.ac: Add "camellia-gfni-avx2-amd64.lo". -- Benchmark on Intel Core i3-1115G4 (tigerlake): Before (VAES/AVX2 implementation): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC dec | 0.579 ns/B 1646 MiB/s 2.37 c/B 4090 CFB dec | 0.579 ns/B 1648 MiB/s 2.37 c/B 4089 CTR enc | 0.586 ns/B 1628 MiB/s 2.40 c/B 4090 CTR dec | 0.587 ns/B 1626 MiB/s 2.40 c/B 4090 OCB enc | 0.607 ns/B 1570 MiB/s 2.48 c/B 4089 OCB dec | 0.611 ns/B 1561 MiB/s 2.50 c/B 4089 OCB auth | 0.602 ns/B 1585 MiB/s 2.46 c/B 4089 After (~80% faster): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC dec | 0.299 ns/B 3186 MiB/s 1.22 c/B 4090 CFB dec | 0.314 ns/B 3039 MiB/s 1.28 c/B 4089 CTR enc | 0.322 ns/B 2962 MiB/s 1.32 c/B 4090 CTR dec | 0.321 ns/B 2970 MiB/s 1.31 c/B 4090 OCB enc | 0.339 ns/B 2817 MiB/s 1.38 c/B 4089 OCB dec | 0.346 ns/B 2756 MiB/s 1.41 c/B 4089 OCB auth | 0.337 ns/B 2831 MiB/s 1.38 c/B 4089 Signed-off-by: Jussi Kivilinna --- cipher/Makefile.am | 5 +- cipher/camellia-aesni-avx2-amd64.h | 249 ++++++++++++++++++++++++++++- cipher/camellia-gfni-avx2-amd64.S | 34 ++++ cipher/camellia-glue.c | 170 +++++++++++++------- configure.ac | 3 + 5 files changed, 398 insertions(+), 63 deletions(-) create mode 100644 cipher/camellia-gfni-avx2-amd64.S diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 07e5ba26..7a429e8b 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -139,8 +139,9 @@ EXTRA_libcipher_la_SOURCES = \ twofish-avx2-amd64.S \ rfc2268.c \ camellia.c camellia.h camellia-glue.c camellia-aesni-avx-amd64.S \ - camellia-aesni-avx2-amd64.h camellia-vaes-avx2-amd64.S \ - camellia-aesni-avx2-amd64.S camellia-arm.S camellia-aarch64.S \ + camellia-aesni-avx2-amd64.h camellia-gfni-avx2-amd64.S \ + camellia-vaes-avx2-amd64.S camellia-aesni-avx2-amd64.S \ + camellia-arm.S camellia-aarch64.S \ blake2.c \ blake2b-amd64-avx2.S blake2s-amd64-avx.S diff --git a/cipher/camellia-aesni-avx2-amd64.h b/cipher/camellia-aesni-avx2-amd64.h index e93c40b8..8cd4b1cd 100644 --- a/cipher/camellia-aesni-avx2-amd64.h +++ b/cipher/camellia-aesni-avx2-amd64.h @@ -1,6 +1,6 @@ -/* camellia-aesni-avx2-amd64.h - AES-NI/VAES/AVX2 implementation of Camellia +/* camellia-aesni-avx2-amd64.h - AES-NI/VAES/GFNI/AVX2 implementation of Camellia * - * Copyright (C) 2013-2015,2020-2021 Jussi Kivilinna + * Copyright (C) 2013-2015,2020-2022 Jussi Kivilinna * * This file is part of Libgcrypt. * @@ -36,6 +36,8 @@ /********************************************************************** helper macros **********************************************************************/ + +#ifndef CAMELLIA_GFNI_BUILD #define filter_8bit(x, lo_t, hi_t, mask4bit, tmp0) \ vpand x, mask4bit, tmp0; \ vpandn x, mask4bit, x; \ @@ -44,6 +46,7 @@ vpshufb tmp0, lo_t, tmp0; \ vpshufb x, hi_t, x; \ vpxor tmp0, x, x; +#endif #define ymm0_x xmm0 #define ymm1_x xmm1 @@ -70,11 +73,61 @@ # define IF_VAES(...) #endif +/********************************************************************** + GFNI helper macros and constants + **********************************************************************/ + +#ifdef CAMELLIA_GFNI_BUILD + +#define BV8(a0,a1,a2,a3,a4,a5,a6,a7) \ + ( (((a0) & 1) << 0) | \ + (((a1) & 1) << 1) | \ + (((a2) & 1) << 2) | \ + (((a3) & 1) << 3) | \ + (((a4) & 1) << 4) | \ + (((a5) & 1) << 5) | \ + (((a6) & 1) << 6) | \ + (((a7) & 1) << 7) ) + +#define BM8X8(l0,l1,l2,l3,l4,l5,l6,l7) \ + ( ((l7) << (0 * 8)) | \ + ((l6) << (1 * 8)) | \ + ((l5) << (2 * 8)) | \ + ((l4) << (3 * 8)) | \ + ((l3) << (4 * 8)) | \ + ((l2) << (5 * 8)) | \ + ((l1) << (6 * 8)) | \ + ((l0) << (7 * 8)) ) + +/* Pre-filters and post-filters constants for Camellia sboxes s1, s2, s3 and s4. + * See http://urn.fi/URN:NBN:fi:oulu-201305311409, pages 43-48. + * + * Pre-filters are directly from above source, "??"/"??". Post-filters are + * combination of function "A" (AES SubBytes affine transformation) and + * "??"/"??"/"??". + */ + +/* Constant from "??(x)" and "??(x)" functions. */ +#define pre_filter_constant_s1234 BV8(1, 0, 1, 0, 0, 0, 1, 0) + +/* Constant from "??(A(x))" function: */ +#define post_filter_constant_s14 BV8(0, 1, 1, 1, 0, 1, 1, 0) + +/* Constant from "??(A(x))" function: */ +#define post_filter_constant_s2 BV8(0, 0, 1, 1, 1, 0, 1, 1) + +/* Constant from "??(A(x))" function: */ +#define post_filter_constant_s3 BV8(1, 1, 1, 0, 1, 1, 0, 0) + +#endif /* CAMELLIA_GFNI_BUILD */ + /********************************************************************** 32-way camellia **********************************************************************/ -/* +#ifdef CAMELLIA_GFNI_BUILD + +/* roundsm32 (GFNI version) * IN: * x0..x7: byte-sliced AB state * mem_cd: register pointer storing CD state @@ -82,7 +135,119 @@ * OUT: * x0..x7: new byte-sliced CD state */ +#define roundsm32(x0, x1, x2, x3, x4, x5, x6, x7, t0, t1, t2, t3, t4, t5, \ + t6, t7, mem_cd, key) \ + /* \ + * S-function with AES subbytes \ + */ \ + vpbroadcastq .Lpre_filter_bitmatrix_s123 rRIP, t5; \ + vpbroadcastq .Lpre_filter_bitmatrix_s4 rRIP, t2; \ + vpbroadcastq .Lpost_filter_bitmatrix_s14 rRIP, t4; \ + vpbroadcastq .Lpost_filter_bitmatrix_s2 rRIP, t3; \ + vpbroadcastq .Lpost_filter_bitmatrix_s3 rRIP, t6; \ + vpxor t7##_x, t7##_x, t7##_x; \ + vpbroadcastq key, t0; /* higher 64-bit duplicate ignored */ \ + \ + /* prefilter sboxes */ \ + vgf2p8affineqb $(pre_filter_constant_s1234), t5, x0, x0; \ + vgf2p8affineqb $(pre_filter_constant_s1234), t5, x7, x7; \ + vgf2p8affineqb $(pre_filter_constant_s1234), t2, x3, x3; \ + vgf2p8affineqb $(pre_filter_constant_s1234), t2, x6, x6; \ + vgf2p8affineqb $(pre_filter_constant_s1234), t5, x2, x2; \ + vgf2p8affineqb $(pre_filter_constant_s1234), t5, x5, x5; \ + vgf2p8affineqb $(pre_filter_constant_s1234), t5, x1, x1; \ + vgf2p8affineqb $(pre_filter_constant_s1234), t5, x4, x4; \ + \ + /* sbox GF8 inverse + postfilter sboxes 1 and 4 */ \ + vgf2p8affineinvqb $(post_filter_constant_s14), t4, x0, x0; \ + vgf2p8affineinvqb $(post_filter_constant_s14), t4, x7, x7; \ + vgf2p8affineinvqb $(post_filter_constant_s14), t4, x3, x3; \ + vgf2p8affineinvqb $(post_filter_constant_s14), t4, x6, x6; \ + \ + /* sbox GF8 inverse + postfilter sbox 3 */ \ + vgf2p8affineinvqb $(post_filter_constant_s3), t6, x2, x2; \ + vgf2p8affineinvqb $(post_filter_constant_s3), t6, x5, x5; \ + \ + /* sbox GF8 inverse + postfilter sbox 2 */ \ + vgf2p8affineinvqb $(post_filter_constant_s2), t3, x1, x1; \ + vgf2p8affineinvqb $(post_filter_constant_s2), t3, x4, x4; \ + \ + vpsrldq $1, t0, t1; \ + vpsrldq $2, t0, t2; \ + vpshufb t7, t1, t1; \ + vpsrldq $3, t0, t3; \ + \ + /* P-function */ \ + vpxor x5, x0, x0; \ + vpxor x6, x1, x1; \ + vpxor x7, x2, x2; \ + vpxor x4, x3, x3; \ + \ + vpshufb t7, t2, t2; \ + vpsrldq $4, t0, t4; \ + vpshufb t7, t3, t3; \ + vpsrldq $5, t0, t5; \ + vpshufb t7, t4, t4; \ + \ + vpxor x2, x4, x4; \ + vpxor x3, x5, x5; \ + vpxor x0, x6, x6; \ + vpxor x1, x7, x7; \ + \ + vpsrldq $6, t0, t6; \ + vpshufb t7, t5, t5; \ + vpshufb t7, t6, t6; \ + \ + vpxor x7, x0, x0; \ + vpxor x4, x1, x1; \ + vpxor x5, x2, x2; \ + vpxor x6, x3, x3; \ + \ + vpxor x3, x4, x4; \ + vpxor x0, x5, x5; \ + vpxor x1, x6, x6; \ + vpxor x2, x7, x7; /* note: high and low parts swapped */ \ + \ + /* Add key material and result to CD (x becomes new CD) */ \ + \ + vpxor t6, x1, x1; \ + vpxor 5 * 32(mem_cd), x1, x1; \ + \ + vpsrldq $7, t0, t6; \ + vpshufb t7, t0, t0; \ + vpshufb t7, t6, t7; \ + \ + vpxor t7, x0, x0; \ + vpxor 4 * 32(mem_cd), x0, x0; \ + \ + vpxor t5, x2, x2; \ + vpxor 6 * 32(mem_cd), x2, x2; \ + \ + vpxor t4, x3, x3; \ + vpxor 7 * 32(mem_cd), x3, x3; \ + \ + vpxor t3, x4, x4; \ + vpxor 0 * 32(mem_cd), x4, x4; \ + \ + vpxor t2, x5, x5; \ + vpxor 1 * 32(mem_cd), x5, x5; \ + \ + vpxor t1, x6, x6; \ + vpxor 2 * 32(mem_cd), x6, x6; \ + \ + vpxor t0, x7, x7; \ + vpxor 3 * 32(mem_cd), x7, x7; +#else /* CAMELLIA_GFNI_BUILD */ + +/* roundsm32 (AES-NI / VAES version) + * IN: + * x0..x7: byte-sliced AB state + * mem_cd: register pointer storing CD state + * key: index for key material + * OUT: + * x0..x7: new byte-sliced CD state + */ #define roundsm32(x0, x1, x2, x3, x4, x5, x6, x7, t0, t1, t2, t3, t4, t5, \ t6, t7, mem_cd, key) \ /* \ @@ -181,7 +346,7 @@ /* postfilter sbox 2 */ \ filter_8bit(x1, t4, t5, t7, t2); \ filter_8bit(x4, t4, t5, t7, t2); \ - vpxor t7, t7, t7; \ + vpxor t7##_x, t7##_x, t7##_x; \ \ vpsrldq $1, t0, t1; \ vpsrldq $2, t0, t2; \ @@ -249,6 +414,8 @@ vpxor t0, x7, x7; \ vpxor 3 * 32(mem_cd), x7, x7; +#endif /* CAMELLIA_GFNI_BUILD */ + /* * IN/OUT: * x0..x7: byte-sliced AB state preloaded @@ -623,6 +790,9 @@ #define SHUFB_BYTES(idx) \ 0 + (idx), 4 + (idx), 8 + (idx), 12 + (idx) +FUNC_NAME(_constants): +ELF(.type FUNC_NAME(_constants), at object;) + .Lshufb_16x16b: .byte SHUFB_BYTES(0), SHUFB_BYTES(1), SHUFB_BYTES(2), SHUFB_BYTES(3) .byte SHUFB_BYTES(0), SHUFB_BYTES(1), SHUFB_BYTES(2), SHUFB_BYTES(3) @@ -635,6 +805,74 @@ .Lbswap128_mask: .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 +#ifdef CAMELLIA_GFNI_BUILD + +/* Pre-filters and post-filters bit-matrixes for Camellia sboxes s1, s2, s3 + * and s4. + * See http://urn.fi/URN:NBN:fi:oulu-201305311409, pages 43-48. + * + * Pre-filters are directly from above source, "??"/"??". Post-filters are + * combination of function "A" (AES SubBytes affine transformation) and + * "??"/"??"/"??". + */ + +/* Bit-matrix from "??(x)" function: */ +.Lpre_filter_bitmatrix_s123: + .quad BM8X8(BV8(1, 1, 1, 0, 1, 1, 0, 1), + BV8(0, 0, 1, 1, 0, 0, 1, 0), + BV8(1, 1, 0, 1, 0, 0, 0, 0), + BV8(1, 0, 1, 1, 0, 0, 1, 1), + BV8(0, 0, 0, 0, 1, 1, 0, 0), + BV8(1, 0, 1, 0, 0, 1, 0, 0), + BV8(0, 0, 1, 0, 1, 1, 0, 0), + BV8(1, 0, 0, 0, 0, 1, 1, 0)) + +/* Bit-matrix from "??(x)" function: */ +.Lpre_filter_bitmatrix_s4: + .quad BM8X8(BV8(1, 1, 0, 1, 1, 0, 1, 1), + BV8(0, 1, 1, 0, 0, 1, 0, 0), + BV8(1, 0, 1, 0, 0, 0, 0, 1), + BV8(0, 1, 1, 0, 0, 1, 1, 1), + BV8(0, 0, 0, 1, 1, 0, 0, 0), + BV8(0, 1, 0, 0, 1, 0, 0, 1), + BV8(0, 1, 0, 1, 1, 0, 0, 0), + BV8(0, 0, 0, 0, 1, 1, 0, 1)) + +/* Bit-matrix from "??(A(x))" function: */ +.Lpost_filter_bitmatrix_s14: + .quad BM8X8(BV8(0, 0, 0, 0, 0, 0, 0, 1), + BV8(0, 1, 1, 0, 0, 1, 1, 0), + BV8(1, 0, 1, 1, 1, 1, 1, 0), + BV8(0, 0, 0, 1, 1, 0, 1, 1), + BV8(1, 0, 0, 0, 1, 1, 1, 0), + BV8(0, 1, 0, 1, 1, 1, 1, 0), + BV8(0, 1, 1, 1, 1, 1, 1, 1), + BV8(0, 0, 0, 1, 1, 1, 0, 0)) + +/* Bit-matrix from "??(A(x))" function: */ +.Lpost_filter_bitmatrix_s2: + .quad BM8X8(BV8(0, 0, 0, 1, 1, 1, 0, 0), + BV8(0, 0, 0, 0, 0, 0, 0, 1), + BV8(0, 1, 1, 0, 0, 1, 1, 0), + BV8(1, 0, 1, 1, 1, 1, 1, 0), + BV8(0, 0, 0, 1, 1, 0, 1, 1), + BV8(1, 0, 0, 0, 1, 1, 1, 0), + BV8(0, 1, 0, 1, 1, 1, 1, 0), + BV8(0, 1, 1, 1, 1, 1, 1, 1)) + +/* Bit-matrix from "??(A(x))" function: */ +.Lpost_filter_bitmatrix_s3: + .quad BM8X8(BV8(0, 1, 1, 0, 0, 1, 1, 0), + BV8(1, 0, 1, 1, 1, 1, 1, 0), + BV8(0, 0, 0, 1, 1, 0, 1, 1), + BV8(1, 0, 0, 0, 1, 1, 1, 0), + BV8(0, 1, 0, 1, 1, 1, 1, 0), + BV8(0, 1, 1, 1, 1, 1, 1, 1), + BV8(0, 0, 0, 1, 1, 1, 0, 0), + BV8(0, 0, 0, 0, 0, 0, 0, 1)) + +#else /* CAMELLIA_GFNI_BUILD */ + /* * pre-SubByte transform * @@ -756,6 +994,9 @@ .L0f0f0f0f: .long 0x0f0f0f0f +#endif /* CAMELLIA_GFNI_BUILD */ + +ELF(.size FUNC_NAME(_constants),.-FUNC_NAME(_constants);) .align 8 ELF(.type __camellia_enc_blk32, at function;) diff --git a/cipher/camellia-gfni-avx2-amd64.S b/cipher/camellia-gfni-avx2-amd64.S new file mode 100644 index 00000000..20c9a432 --- /dev/null +++ b/cipher/camellia-gfni-avx2-amd64.S @@ -0,0 +1,34 @@ +/* camellia-vaes-avx2-amd64.S - GFNI/AVX2 implementation of Camellia cipher + * + * Copyright (C) 2022 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include + +#ifdef __x86_64 +#if (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ + defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) && \ + defined(ENABLE_GFNI_SUPPORT) && defined(ENABLE_AVX2_SUPPORT) + +#define CAMELLIA_GFNI_BUILD 1 +#define FUNC_NAME(func) _gcry_camellia_gfni_avx2_ ## func + +#include "camellia-aesni-avx2-amd64.h" + +#endif /* defined(ENABLE_GFNI_SUPPORT) && defined(ENABLE_AVX2_SUPPORT) */ +#endif /* __x86_64 */ diff --git a/cipher/camellia-glue.c b/cipher/camellia-glue.c index 72c02d77..7f009db4 100644 --- a/cipher/camellia-glue.c +++ b/cipher/camellia-glue.c @@ -97,6 +97,12 @@ # define USE_VAES_AVX2 1 #endif +/* USE_GFNI_AVX2 inidicates whether to compile with Intel GFNI/AVX2 code. */ +#undef USE_GFNI_AVX2 +#if defined(USE_AESNI_AVX2) && defined(ENABLE_GFNI_SUPPORT) +# define USE_GFNI_AVX2 1 +#endif + typedef struct { KEY_TABLE_TYPE keytable; @@ -107,6 +113,7 @@ typedef struct #ifdef USE_AESNI_AVX2 unsigned int use_aesni_avx2:1;/* AES-NI/AVX2 implementation shall be used. */ unsigned int use_vaes_avx2:1; /* VAES/AVX2 implementation shall be used. */ + unsigned int use_gfni_avx2:1; /* GFNI/AVX2 implementation shall be used. */ #endif /*USE_AESNI_AVX2*/ } CAMELLIA_context; @@ -248,6 +255,46 @@ extern void _gcry_camellia_vaes_avx2_ocb_auth(CAMELLIA_context *ctx, const u64 Ls[32]) ASM_FUNC_ABI; #endif +#ifdef USE_GFNI_AVX2 +/* Assembler implementations of Camellia using GFNI and AVX2. Process data + in 32 block same time. + */ +extern void _gcry_camellia_gfni_avx2_ctr_enc(CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in, + unsigned char *ctr) ASM_FUNC_ABI; + +extern void _gcry_camellia_gfni_avx2_cbc_dec(CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in, + unsigned char *iv) ASM_FUNC_ABI; + +extern void _gcry_camellia_gfni_avx2_cfb_dec(CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in, + unsigned char *iv) ASM_FUNC_ABI; + +extern void _gcry_camellia_gfni_avx2_ocb_enc(CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in, + unsigned char *offset, + unsigned char *checksum, + const u64 Ls[32]) ASM_FUNC_ABI; + +extern void _gcry_camellia_gfni_avx2_ocb_dec(CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in, + unsigned char *offset, + unsigned char *checksum, + const u64 Ls[32]) ASM_FUNC_ABI; + +extern void _gcry_camellia_gfni_avx2_ocb_auth(CAMELLIA_context *ctx, + const unsigned char *abuf, + unsigned char *offset, + unsigned char *checksum, + const u64 Ls[32]) ASM_FUNC_ABI; +#endif + static const char *selftest(void); static void _gcry_camellia_ctr_enc (void *context, unsigned char *ctr, @@ -272,7 +319,8 @@ camellia_setkey(void *c, const byte *key, unsigned keylen, CAMELLIA_context *ctx=c; static int initialized=0; static const char *selftest_failed=NULL; -#if defined(USE_AESNI_AVX) || defined(USE_AESNI_AVX2) || defined(USE_VAES_AVX2) +#if defined(USE_AESNI_AVX) || defined(USE_AESNI_AVX2) \ + || defined(USE_VAES_AVX2) || defined(USE_GFNI_AVX2) unsigned int hwf = _gcry_get_hw_features (); #endif @@ -296,10 +344,14 @@ camellia_setkey(void *c, const byte *key, unsigned keylen, #ifdef USE_AESNI_AVX2 ctx->use_aesni_avx2 = (hwf & HWF_INTEL_AESNI) && (hwf & HWF_INTEL_AVX2); ctx->use_vaes_avx2 = 0; + ctx->use_gfni_avx2 = 0; #endif #ifdef USE_VAES_AVX2 ctx->use_vaes_avx2 = (hwf & HWF_INTEL_VAES_VPCLMUL) && (hwf & HWF_INTEL_AVX2); #endif +#ifdef USE_GFNI_AVX2 + ctx->use_gfni_avx2 = (hwf & HWF_INTEL_GFNI) && (hwf & HWF_INTEL_AVX2); +#endif ctx->keybitlength=keylen*8; @@ -440,20 +492,22 @@ _gcry_camellia_ctr_enc(void *context, unsigned char *ctr, if (ctx->use_aesni_avx2) { int did_use_aesni_avx2 = 0; + typeof (&_gcry_camellia_aesni_avx2_ctr_enc) bulk_ctr_fn = + _gcry_camellia_aesni_avx2_ctr_enc; + #ifdef USE_VAES_AVX2 - int use_vaes = ctx->use_vaes_avx2; + if (ctx->use_vaes_avx2) + bulk_ctr_fn =_gcry_camellia_vaes_avx2_ctr_enc; +#endif +#ifdef USE_GFNI_AVX2 + if (ctx->use_gfni_avx2) + bulk_ctr_fn =_gcry_camellia_gfni_avx2_ctr_enc; #endif /* Process data in 32 block chunks. */ while (nblocks >= 32) { -#ifdef USE_VAES_AVX2 - if (use_vaes) - _gcry_camellia_vaes_avx2_ctr_enc(ctx, outbuf, inbuf, ctr); - else -#endif - _gcry_camellia_aesni_avx2_ctr_enc(ctx, outbuf, inbuf, ctr); - + bulk_ctr_fn (ctx, outbuf, inbuf, ctr); nblocks -= 32; outbuf += 32 * CAMELLIA_BLOCK_SIZE; inbuf += 32 * CAMELLIA_BLOCK_SIZE; @@ -537,20 +591,22 @@ _gcry_camellia_cbc_dec(void *context, unsigned char *iv, if (ctx->use_aesni_avx2) { int did_use_aesni_avx2 = 0; + typeof (&_gcry_camellia_aesni_avx2_cbc_dec) bulk_cbc_fn = + _gcry_camellia_aesni_avx2_cbc_dec; + #ifdef USE_VAES_AVX2 - int use_vaes = ctx->use_vaes_avx2; + if (ctx->use_vaes_avx2) + bulk_cbc_fn =_gcry_camellia_vaes_avx2_cbc_dec; +#endif +#ifdef USE_GFNI_AVX2 + if (ctx->use_gfni_avx2) + bulk_cbc_fn =_gcry_camellia_gfni_avx2_cbc_dec; #endif /* Process data in 32 block chunks. */ while (nblocks >= 32) { -#ifdef USE_VAES_AVX2 - if (use_vaes) - _gcry_camellia_vaes_avx2_cbc_dec(ctx, outbuf, inbuf, iv); - else -#endif - _gcry_camellia_aesni_avx2_cbc_dec(ctx, outbuf, inbuf, iv); - + bulk_cbc_fn (ctx, outbuf, inbuf, iv); nblocks -= 32; outbuf += 32 * CAMELLIA_BLOCK_SIZE; inbuf += 32 * CAMELLIA_BLOCK_SIZE; @@ -631,20 +687,22 @@ _gcry_camellia_cfb_dec(void *context, unsigned char *iv, if (ctx->use_aesni_avx2) { int did_use_aesni_avx2 = 0; + typeof (&_gcry_camellia_aesni_avx2_cfb_dec) bulk_cfb_fn = + _gcry_camellia_aesni_avx2_cfb_dec; + #ifdef USE_VAES_AVX2 - int use_vaes = ctx->use_vaes_avx2; + if (ctx->use_vaes_avx2) + bulk_cfb_fn =_gcry_camellia_vaes_avx2_cfb_dec; +#endif +#ifdef USE_GFNI_AVX2 + if (ctx->use_gfni_avx2) + bulk_cfb_fn =_gcry_camellia_gfni_avx2_cfb_dec; #endif /* Process data in 32 block chunks. */ while (nblocks >= 32) { -#ifdef USE_VAES_AVX2 - if (use_vaes) - _gcry_camellia_vaes_avx2_cfb_dec(ctx, outbuf, inbuf, iv); - else -#endif - _gcry_camellia_aesni_avx2_cfb_dec(ctx, outbuf, inbuf, iv); - + bulk_cfb_fn (ctx, outbuf, inbuf, iv); nblocks -= 32; outbuf += 32 * CAMELLIA_BLOCK_SIZE; inbuf += 32 * CAMELLIA_BLOCK_SIZE; @@ -729,10 +787,6 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, if (ctx->use_aesni_avx2) { int did_use_aesni_avx2 = 0; -#ifdef USE_VAES_AVX2 - int encrypt_use_vaes = encrypt && ctx->use_vaes_avx2; - int decrypt_use_vaes = !encrypt && ctx->use_vaes_avx2; -#endif u64 Ls[32]; unsigned int n = 32 - (blkn % 32); u64 *l; @@ -740,6 +794,21 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, if (nblocks >= 32) { + typeof (&_gcry_camellia_aesni_avx2_ocb_dec) bulk_ocb_fn = + encrypt ? _gcry_camellia_aesni_avx2_ocb_enc + : _gcry_camellia_aesni_avx2_ocb_dec; + +#ifdef USE_VAES_AVX2 + if (ctx->use_vaes_avx2) + bulk_ocb_fn = encrypt ? _gcry_camellia_vaes_avx2_ocb_enc + : _gcry_camellia_vaes_avx2_ocb_dec; +#endif +#ifdef USE_GFNI_AVX2 + if (ctx->use_gfni_avx2) + bulk_ocb_fn = encrypt ? _gcry_camellia_gfni_avx2_ocb_enc + : _gcry_camellia_gfni_avx2_ocb_dec; +#endif + for (i = 0; i < 32; i += 8) { /* Use u64 to store pointers for x32 support (assembly function @@ -764,21 +833,7 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, blkn += 32; *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 32); - if (0) {} -#ifdef USE_VAES_AVX2 - else if (encrypt_use_vaes) - _gcry_camellia_vaes_avx2_ocb_enc(ctx, outbuf, inbuf, c->u_iv.iv, - c->u_ctr.ctr, Ls); - else if (decrypt_use_vaes) - _gcry_camellia_vaes_avx2_ocb_dec(ctx, outbuf, inbuf, c->u_iv.iv, - c->u_ctr.ctr, Ls); -#endif - else if (encrypt) - _gcry_camellia_aesni_avx2_ocb_enc(ctx, outbuf, inbuf, c->u_iv.iv, - c->u_ctr.ctr, Ls); - else - _gcry_camellia_aesni_avx2_ocb_dec(ctx, outbuf, inbuf, c->u_iv.iv, - c->u_ctr.ctr, Ls); + bulk_ocb_fn (ctx, outbuf, inbuf, c->u_iv.iv, c->u_ctr.ctr, Ls); nblocks -= 32; outbuf += 32 * CAMELLIA_BLOCK_SIZE; @@ -891,9 +946,6 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, if (ctx->use_aesni_avx2) { int did_use_aesni_avx2 = 0; -#ifdef USE_VAES_AVX2 - int use_vaes = ctx->use_vaes_avx2; -#endif u64 Ls[32]; unsigned int n = 32 - (blkn % 32); u64 *l; @@ -901,6 +953,18 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, if (nblocks >= 32) { + typeof (&_gcry_camellia_aesni_avx2_ocb_auth) bulk_auth_fn = + _gcry_camellia_aesni_avx2_ocb_auth; + +#ifdef USE_VAES_AVX2 + if (ctx->use_vaes_avx2) + bulk_auth_fn = _gcry_camellia_vaes_avx2_ocb_auth; +#endif +#ifdef USE_GFNI_AVX2 + if (ctx->use_gfni_avx2) + bulk_auth_fn = _gcry_camellia_gfni_avx2_ocb_auth; +#endif + for (i = 0; i < 32; i += 8) { /* Use u64 to store pointers for x32 support (assembly function @@ -925,16 +989,8 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, blkn += 32; *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 32); -#ifdef USE_VAES_AVX2 - if (use_vaes) - _gcry_camellia_vaes_avx2_ocb_auth(ctx, abuf, - c->u_mode.ocb.aad_offset, - c->u_mode.ocb.aad_sum, Ls); - else -#endif - _gcry_camellia_aesni_avx2_ocb_auth(ctx, abuf, - c->u_mode.ocb.aad_offset, - c->u_mode.ocb.aad_sum, Ls); + bulk_auth_fn (ctx, abuf, c->u_mode.ocb.aad_offset, + c->u_mode.ocb.aad_sum, Ls); nblocks -= 32; abuf += 32 * CAMELLIA_BLOCK_SIZE; diff --git a/configure.ac b/configure.ac index 15c92018..c5d61657 100644 --- a/configure.ac +++ b/configure.ac @@ -2755,6 +2755,9 @@ if test "$found" = "1" ; then # Build with the VAES/AVX2 implementation GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS camellia-vaes-avx2-amd64.lo" + + # Build with the GFNI/AVX2 implementation + GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS camellia-gfni-avx2-amd64.lo" fi fi fi -- 2.34.1 From jussi.kivilinna at iki.fi Sun Apr 24 20:47:01 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 24 Apr 2022 21:47:01 +0300 Subject: [PATCH 1/3] sm4: add XTS bulk processing Message-ID: <20220424184703.2215215-1-jussi.kivilinna@iki.fi> * cipher/sm4.c (_gcry_sm4_xts_crypt): New. (sm4_setkey): Set XTS bulk function. -- Benchmark on Ryzen 5800X: Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz XTS enc | 7.28 ns/B 131.0 MiB/s 35.31 c/B 4850 XTS dec | 7.29 ns/B 130.9 MiB/s 35.34 c/B 4850 After (4.8x faster): SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz XTS enc | 1.49 ns/B 638.6 MiB/s 7.24 c/B 4850 XTS dec | 1.49 ns/B 639.3 MiB/s 7.24 c/B 4850 Signed-off-by: Jussi Kivilinna --- cipher/sm4.c | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/cipher/sm4.c b/cipher/sm4.c index 4815b184..600850e2 100644 --- a/cipher/sm4.c +++ b/cipher/sm4.c @@ -97,6 +97,9 @@ static void _gcry_sm4_cbc_dec (void *context, unsigned char *iv, static void _gcry_sm4_cfb_dec (void *context, unsigned char *iv, void *outbuf_arg, const void *inbuf_arg, size_t nblocks); +static void _gcry_sm4_xts_crypt (void *context, unsigned char *tweak, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks, int encrypt); static size_t _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, const void *inbuf_arg, size_t nblocks, int encrypt); @@ -492,6 +495,7 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen, bulk_ops->cbc_dec = _gcry_sm4_cbc_dec; bulk_ops->cfb_dec = _gcry_sm4_cfb_dec; bulk_ops->ctr_enc = _gcry_sm4_ctr_enc; + bulk_ops->xts_crypt = _gcry_sm4_xts_crypt; bulk_ops->ocb_crypt = _gcry_sm4_ocb_crypt; bulk_ops->ocb_auth = _gcry_sm4_ocb_auth; @@ -954,6 +958,37 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, _gcry_burn_stack(burn_stack_depth); } +/* Bulk encryption/decryption of complete blocks in XTS mode. */ +static void +_gcry_sm4_xts_crypt (void *context, unsigned char *tweak, void *outbuf_arg, + const void *inbuf_arg, size_t nblocks, int encrypt) +{ + SM4_context *ctx = context; + unsigned char *outbuf = outbuf_arg; + const unsigned char *inbuf = inbuf_arg; + int burn_stack_depth = 0; + + /* Process remaining blocks. */ + if (nblocks) + { + crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); + u32 *rk = encrypt ? ctx->rkey_enc : ctx->rkey_dec; + unsigned char tmpbuf[16 * 8]; + unsigned int tmp_used = 16; + size_t nburn; + + nburn = bulk_xts_crypt_128(rk, crypt_blk1_8, outbuf, inbuf, nblocks, + tweak, tmpbuf, sizeof(tmpbuf) / 16, + &tmp_used); + burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; + + wipememory(tmpbuf, tmp_used); + } + + if (burn_stack_depth) + _gcry_burn_stack(burn_stack_depth); +} + /* Bulk encryption/decryption of complete blocks in OCB mode. */ static size_t _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, -- 2.34.1 From jussi.kivilinna at iki.fi Sun Apr 24 20:47:03 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 24 Apr 2022 21:47:03 +0300 Subject: [PATCH 3/3] sm4-aesni-avx2: add generic 1 to 16 block bulk processing function In-Reply-To: <20220424184703.2215215-1-jussi.kivilinna@iki.fi> References: <20220424184703.2215215-1-jussi.kivilinna@iki.fi> Message-ID: <20220424184703.2215215-3-jussi.kivilinna@iki.fi> * cipher/sm4-aesni-avx2-amd64.S: Remove unnecessary vzeroupper at function entries. (_gcry_sm4_aesni_avx2_crypt_blk1_16): New. * cipher/sm4.c (_gcry_sm4_aesni_avx2_crypt_blk1_16) (sm4_aesni_avx2_crypt_blk1_16): New. (sm4_get_crypt_blk1_16_fn) [USE_AESNI_AVX2]: Add 'sm4_aesni_avx2_crypt_blk1_16'. -- Benchmark AMD Ryzen 5800X: Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz XTS enc | 1.48 ns/B 643.2 MiB/s 7.19 c/B 4850 XTS dec | 1.48 ns/B 644.3 MiB/s 7.18 c/B 4850 After (1.37x faster): SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz XTS enc | 1.07 ns/B 888.7 MiB/s 5.21 c/B 4850 XTS dec | 1.07 ns/B 889.4 MiB/s 5.20 c/B 4850 Signed-off-by: Jussi Kivilinna --- cipher/sm4-aesni-avx2-amd64.S | 82 +++++++++++++++++++++++++++++------ cipher/sm4.c | 26 +++++++++++ 2 files changed, 95 insertions(+), 13 deletions(-) diff --git a/cipher/sm4-aesni-avx2-amd64.S b/cipher/sm4-aesni-avx2-amd64.S index effe590b..e09fed8f 100644 --- a/cipher/sm4-aesni-avx2-amd64.S +++ b/cipher/sm4-aesni-avx2-amd64.S @@ -1,6 +1,6 @@ /* sm4-avx2-amd64.S - AVX2 implementation of SM4 cipher * - * Copyright (C) 2020 Jussi Kivilinna + * Copyright (C) 2020, 2022 Jussi Kivilinna * * This file is part of Libgcrypt. * @@ -45,11 +45,19 @@ #define RA1 %ymm9 #define RA2 %ymm10 #define RA3 %ymm11 +#define RA0x %xmm8 +#define RA1x %xmm9 +#define RA2x %xmm10 +#define RA3x %xmm11 #define RB0 %ymm12 #define RB1 %ymm13 #define RB2 %ymm14 #define RB3 %ymm15 +#define RB0x %xmm12 +#define RB1x %xmm13 +#define RB2x %xmm14 +#define RB3x %xmm15 #define RNOT %ymm0 #define RBSWAP %ymm1 @@ -280,6 +288,66 @@ __sm4_crypt_blk16: CFI_ENDPROC(); ELF(.size __sm4_crypt_blk16,.-__sm4_crypt_blk16;) +.align 8 +.globl _gcry_sm4_aesni_avx2_crypt_blk1_16 +ELF(.type _gcry_sm4_aesni_avx2_crypt_blk1_16, at function;) +_gcry_sm4_aesni_avx2_crypt_blk1_16: + /* input: + * %rdi: round key array, CTX + * %rsi: dst (1..16 blocks) + * %rdx: src (1..16 blocks) + * %rcx: num blocks (1..16) + */ + CFI_STARTPROC(); + +#define LOAD_INPUT(offset, yreg) \ + cmpq $(1 + 2 * (offset)), %rcx; \ + jb .Lblk16_load_input_done; \ + ja 1f; \ + vmovdqu (offset) * 32(%rdx), yreg##x; \ + jmp .Lblk16_load_input_done; \ + 1: \ + vmovdqu (offset) * 32(%rdx), yreg; + + LOAD_INPUT(0, RA0); + LOAD_INPUT(1, RA1); + LOAD_INPUT(2, RA2); + LOAD_INPUT(3, RA3); + LOAD_INPUT(4, RB0); + LOAD_INPUT(5, RB1); + LOAD_INPUT(6, RB2); + LOAD_INPUT(7, RB3); +#undef LOAD_INPUT + +.Lblk16_load_input_done: + call __sm4_crypt_blk16; + +#define STORE_OUTPUT(yreg, offset) \ + cmpq $(1 + 2 * (offset)), %rcx; \ + jb .Lblk16_store_output_done; \ + ja 1f; \ + vmovdqu yreg##x, (offset) * 32(%rsi); \ + jmp .Lblk16_store_output_done; \ + 1: \ + vmovdqu yreg, (offset) * 32(%rsi); + + STORE_OUTPUT(RA0, 0); + STORE_OUTPUT(RA1, 1); + STORE_OUTPUT(RA2, 2); + STORE_OUTPUT(RA3, 3); + STORE_OUTPUT(RB0, 4); + STORE_OUTPUT(RB1, 5); + STORE_OUTPUT(RB2, 6); + STORE_OUTPUT(RB3, 7); +#undef STORE_OUTPUT + +.Lblk16_store_output_done: + vzeroall; + xorl %eax, %eax; + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aesni_avx2_crypt_blk1_16,.-_gcry_sm4_aesni_avx2_crypt_blk1_16;) + #define inc_le128(x, minus_one, tmp) \ vpcmpeqq minus_one, x, tmp; \ vpsubq minus_one, x, x; \ @@ -301,8 +369,6 @@ _gcry_sm4_aesni_avx2_ctr_enc: movq 8(%rcx), %rax; bswapq %rax; - vzeroupper; - vbroadcasti128 .Lbswap128_mask rRIP, RTMP3; vpcmpeqd RNOT, RNOT, RNOT; vpsrldq $8, RNOT, RNOT; /* ab: -1:0 ; cd: -1:0 */ @@ -410,8 +476,6 @@ _gcry_sm4_aesni_avx2_cbc_dec: */ CFI_STARTPROC(); - vzeroupper; - vmovdqu (0 * 32)(%rdx), RA0; vmovdqu (1 * 32)(%rdx), RA1; vmovdqu (2 * 32)(%rdx), RA2; @@ -463,8 +527,6 @@ _gcry_sm4_aesni_avx2_cfb_dec: */ CFI_STARTPROC(); - vzeroupper; - /* Load input */ vmovdqu (%rcx), RNOTx; vinserti128 $1, (%rdx), RNOT, RA0; @@ -521,8 +583,6 @@ _gcry_sm4_aesni_avx2_ocb_enc: */ CFI_STARTPROC(); - vzeroupper; - subq $(4 * 8), %rsp; CFI_ADJUST_CFA_OFFSET(4 * 8); @@ -635,8 +695,6 @@ _gcry_sm4_aesni_avx2_ocb_dec: */ CFI_STARTPROC(); - vzeroupper; - subq $(4 * 8), %rsp; CFI_ADJUST_CFA_OFFSET(4 * 8); @@ -758,8 +816,6 @@ _gcry_sm4_aesni_avx2_ocb_auth: */ CFI_STARTPROC(); - vzeroupper; - subq $(4 * 8), %rsp; CFI_ADJUST_CFA_OFFSET(4 * 8); diff --git a/cipher/sm4.c b/cipher/sm4.c index 9d00ee05..1f27f508 100644 --- a/cipher/sm4.c +++ b/cipher/sm4.c @@ -291,6 +291,24 @@ extern void _gcry_sm4_aesni_avx2_ocb_auth(const u32 *rk_enc, unsigned char *offset, unsigned char *checksum, const u64 Ls[16]) ASM_FUNC_ABI; + +extern unsigned int +_gcry_sm4_aesni_avx2_crypt_blk1_16(const u32 *rk, byte *out, const byte *in, + unsigned int num_blks) ASM_FUNC_ABI; + +static inline unsigned int +sm4_aesni_avx2_crypt_blk1_16(const void *rk, byte *out, const byte *in, + unsigned int num_blks) +{ +#ifdef USE_AESNI_AVX + /* Use 128-bit register implementation for short input. */ + if (num_blks <= 8) + return _gcry_sm4_aesni_avx_crypt_blk1_8(rk, out, in, num_blks); +#endif + + return _gcry_sm4_aesni_avx2_crypt_blk1_16(rk, out, in, num_blks); +} + #endif /* USE_AESNI_AVX2 */ #ifdef USE_GFNI_AVX2 @@ -382,6 +400,7 @@ sm4_aarch64_crypt_blk1_16(const void *rk, byte *out, const byte *in, _gcry_sm4_aarch64_crypt_blk1_8(rk, out, in, num_blks); return 0; } + #endif /* USE_AARCH64_SIMD */ #ifdef USE_ARM_CE @@ -427,6 +446,7 @@ sm4_armv8_ce_crypt_blk1_16(const void *rk, byte *out, const byte *in, _gcry_sm4_armv8_ce_crypt_blk1_8(rk, out, in, num_blks); return 0; } + #endif /* USE_ARM_CE */ static inline void prefetch_sbox_table(void) @@ -758,6 +778,12 @@ sm4_get_crypt_blk1_16_fn(SM4_context *ctx) return &sm4_gfni_avx2_crypt_blk1_16; } #endif +#ifdef USE_AESNI_AVX2 + else if (ctx->use_aesni_avx2) + { + return &sm4_aesni_avx2_crypt_blk1_16; + } +#endif #ifdef USE_AESNI_AVX else if (ctx->use_aesni_avx) { -- 2.34.1 From jussi.kivilinna at iki.fi Sun Apr 24 20:47:02 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 24 Apr 2022 21:47:02 +0300 Subject: [PATCH 2/3] Add SM4 x86-64/GFNI/AVX2 implementation In-Reply-To: <20220424184703.2215215-1-jussi.kivilinna@iki.fi> References: <20220424184703.2215215-1-jussi.kivilinna@iki.fi> Message-ID: <20220424184703.2215215-2-jussi.kivilinna@iki.fi> * cipher/Makefile.am: Add 'sm4-gfni-avx2-amd64.S'. * cipher/sm4-aesni-avx2-amd64.S: New. * cipher/sm4.c (USE_GFNI_AVX2): New. (SM4_context): Add 'use_gfni_avx2'. (crypt_blk1_8_fn_t): Rename to... (crypt_blk1_16_fn_t): ...this. (sm4_aesni_avx_crypt_blk1_8): Rename to... (sm4_aesni_avx_crypt_blk1_16): ...this and add handling for 9 to 16 input blocks. (_gcry_sm4_gfni_avx_expand_key, _gcry_sm4_gfni_avx2_ctr_enc) (_gcry_sm4_gfni_avx2_cbc_dec, _gcry_sm4_gfni_avx2_cfb_dec) (_gcry_sm4_gfni_avx2_ocb_enc, _gcry_sm4_gfni_avx2_ocb_dec) (_gcry_sm4_gfni_avx2_ocb_auth, _gcry_sm4_gfni_avx2_crypt_blk1_16) (sm4_gfni_avx2_crypt_blk1_16): New. (sm4_aarch64_crypt_blk1_8): Rename to... (sm4_aarch64_crypt_blk1_16): ...this and add handling for 9 to 16 input blocks. (sm4_armv8_ce_crypt_blk1_8): Rename to... (sm4_armv8_ce_crypt_blk1_16): ...this and add handling for 9 to 16 input blocks. (sm4_expand_key): Add GFNI/AVX2 path. (sm4_setkey): Enable GFNI/AVX2 implementation if HW features available. (sm4_encrypt) [USE_GFNI_AVX2]: New. (sm4_decrypt) [USE_GFNI_AVX2]: New. (sm4_get_crypt_blk1_8_fn): Rename to... (sm4_get_crypt_blk1_16_fn): ...this; Update to use *_blk1_16 functions; Add GFNI/AVX2 selection. (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth): Add GFNI/AVX2 path; Widen generic bulk processing from 8 blocks to 16 blocks. (_gcry_sm4_xts_crypt): Widen generic bulk processing from 8 blocks to 16 blocks. -- Benchmark on Intel i3-1115G4 (tigerlake): Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 10.34 ns/B 92.21 MiB/s 42.29 c/B 4089 ECB dec | 10.34 ns/B 92.24 MiB/s 42.29 c/B 4090 CBC enc | 11.06 ns/B 86.26 MiB/s 45.21 c/B 4090 CBC dec | 1.13 ns/B 844.8 MiB/s 4.62 c/B 4090 CFB enc | 11.06 ns/B 86.27 MiB/s 45.22 c/B 4090 CFB dec | 1.13 ns/B 846.0 MiB/s 4.61 c/B 4090 CTR enc | 1.14 ns/B 834.3 MiB/s 4.67 c/B 4089 CTR dec | 1.14 ns/B 834.5 MiB/s 4.67 c/B 4089 XTS enc | 1.93 ns/B 494.1 MiB/s 7.89 c/B 4090 XTS dec | 1.94 ns/B 492.5 MiB/s 7.92 c/B 4090 OCB enc | 1.16 ns/B 823.3 MiB/s 4.74 c/B 4090 OCB dec | 1.16 ns/B 818.8 MiB/s 4.76 c/B 4089 OCB auth | 1.15 ns/B 831.0 MiB/s 4.69 c/B 4089 After: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 8.39 ns/B 113.6 MiB/s 34.33 c/B 4090 ECB dec | 8.40 ns/B 113.5 MiB/s 34.35 c/B 4090 CBC enc | 9.45 ns/B 101.0 MiB/s 38.63 c/B 4089 CBC dec | 0.650 ns/B 1468 MiB/s 2.66 c/B 4090 CFB enc | 9.44 ns/B 101.1 MiB/s 38.59 c/B 4090 CFB dec | 0.660 ns/B 1444 MiB/s 2.70 c/B 4090 CTR enc | 0.664 ns/B 1437 MiB/s 2.71 c/B 4090 CTR dec | 0.664 ns/B 1437 MiB/s 2.71 c/B 4090 XTS enc | 0.756 ns/B 1262 MiB/s 3.09 c/B 4090 XTS dec | 0.757 ns/B 1260 MiB/s 3.10 c/B 4090 OCB enc | 0.673 ns/B 1417 MiB/s 2.75 c/B 4090 OCB dec | 0.675 ns/B 1413 MiB/s 2.76 c/B 4090 OCB auth | 0.672 ns/B 1418 MiB/s 2.75 c/B 4090 ECB: 1.2x faster CBC-enc / CFB-enc: 1.17x faster CBC-dec / CFB-dec / CTR / OCB: 1.7x faster XTS: 2.5x faster Signed-off-by: Jussi Kivilinna --- cipher/Makefile.am | 2 +- cipher/sm4-aesni-avx2-amd64.S | 4 +- cipher/sm4-gfni-avx2-amd64.S | 1194 +++++++++++++++++++++++++++++++++ cipher/sm4.c | 295 ++++++-- configure.ac | 1 + 5 files changed, 1454 insertions(+), 42 deletions(-) create mode 100644 cipher/sm4-gfni-avx2-amd64.S diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 7a429e8b..55f96014 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -117,7 +117,7 @@ EXTRA_libcipher_la_SOURCES = \ seed.c \ serpent.c serpent-sse2-amd64.S \ sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S sm4-aarch64.S \ - sm4-armv8-aarch64-ce.S \ + sm4-armv8-aarch64-ce.S sm4-gfni-avx2-amd64.S \ serpent-avx2-amd64.S serpent-armv7-neon.S \ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ diff --git a/cipher/sm4-aesni-avx2-amd64.S b/cipher/sm4-aesni-avx2-amd64.S index 7a8b9558..effe590b 100644 --- a/cipher/sm4-aesni-avx2-amd64.S +++ b/cipher/sm4-aesni-avx2-amd64.S @@ -252,14 +252,14 @@ __sm4_crypt_blk16: leaq (32*4)(%rdi), %rax; .align 16 -.Lroundloop_blk8: +.Lroundloop_blk16: ROUND(0, RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3); ROUND(1, RA1, RA2, RA3, RA0, RB1, RB2, RB3, RB0); ROUND(2, RA2, RA3, RA0, RA1, RB2, RB3, RB0, RB1); ROUND(3, RA3, RA0, RA1, RA2, RB3, RB0, RB1, RB2); leaq (4*4)(%rdi), %rdi; cmpq %rax, %rdi; - jne .Lroundloop_blk8; + jne .Lroundloop_blk16; #undef ROUND diff --git a/cipher/sm4-gfni-avx2-amd64.S b/cipher/sm4-gfni-avx2-amd64.S new file mode 100644 index 00000000..4ec0ea39 --- /dev/null +++ b/cipher/sm4-gfni-avx2-amd64.S @@ -0,0 +1,1194 @@ +/* sm4-gfni-avx2-amd64.S - GFNI/AVX2 implementation of SM4 cipher + * + * Copyright (C) 2022 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include + +#ifdef __x86_64 +#if (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ + defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) && \ + defined(ENABLE_GFNI_SUPPORT) && defined(ENABLE_AVX2_SUPPORT) + +#include "asm-common-amd64.h" + +/********************************************************************** + helper macros + **********************************************************************/ + +/* Transpose four 32-bit words between 128-bit vectors. */ +#define transpose_4x4(x0, x1, x2, x3, t1, t2) \ + vpunpckhdq x1, x0, t2; \ + vpunpckldq x1, x0, x0; \ + \ + vpunpckldq x3, x2, t1; \ + vpunpckhdq x3, x2, x2; \ + \ + vpunpckhqdq t1, x0, x1; \ + vpunpcklqdq t1, x0, x0; \ + \ + vpunpckhqdq x2, t2, x3; \ + vpunpcklqdq x2, t2, x2; + +/********************************************************************** + 4-way && 8-way SM4 with GFNI and AVX2 + **********************************************************************/ + +/* vector registers */ +#define RX0 %ymm0 +#define RX1 %ymm1 +#define RX0x %xmm0 +#define RX1x %xmm1 + +#define RTMP0 %ymm2 +#define RTMP1 %ymm3 +#define RTMP2 %ymm4 +#define RTMP3 %ymm5 +#define RTMP4 %ymm6 +#define RTMP0x %xmm2 +#define RTMP1x %xmm3 +#define RTMP2x %xmm4 +#define RTMP3x %xmm5 +#define RTMP4x %xmm6 + +#define RNOT %ymm7 +#define RNOTx %xmm7 + +#define RA0 %ymm8 +#define RA1 %ymm9 +#define RA2 %ymm10 +#define RA3 %ymm11 +#define RA0x %xmm8 +#define RA1x %xmm9 +#define RA2x %xmm10 +#define RA3x %xmm11 + +#define RB0 %ymm12 +#define RB1 %ymm13 +#define RB2 %ymm14 +#define RB3 %ymm15 +#define RB0x %xmm12 +#define RB1x %xmm13 +#define RB2x %xmm14 +#define RB3x %xmm15 + +.text +.align 32 + +/* Affine transform, SM4 field to AES field */ +.Lpre_affine_s: + .byte 0x52, 0xbc, 0x2d, 0x02, 0x9e, 0x25, 0xac, 0x34 + .byte 0x52, 0xbc, 0x2d, 0x02, 0x9e, 0x25, 0xac, 0x34 + .byte 0x52, 0xbc, 0x2d, 0x02, 0x9e, 0x25, 0xac, 0x34 + .byte 0x52, 0xbc, 0x2d, 0x02, 0x9e, 0x25, 0xac, 0x34 + +/* Affine transform, AES field to SM4 field */ +.Lpost_affine_s: + .byte 0x19, 0x8b, 0x6c, 0x1e, 0x51, 0x8e, 0x2d, 0xd7 + .byte 0x19, 0x8b, 0x6c, 0x1e, 0x51, 0x8e, 0x2d, 0xd7 + .byte 0x19, 0x8b, 0x6c, 0x1e, 0x51, 0x8e, 0x2d, 0xd7 + .byte 0x19, 0x8b, 0x6c, 0x1e, 0x51, 0x8e, 0x2d, 0xd7 + +/* Rotate left by 8 bits on 32-bit words with vpshufb */ +.Lrol_8: + .byte 0x03, 0x00, 0x01, 0x02, 0x07, 0x04, 0x05, 0x06 + .byte 0x0b, 0x08, 0x09, 0x0a, 0x0f, 0x0c, 0x0d, 0x0e + .byte 0x03, 0x00, 0x01, 0x02, 0x07, 0x04, 0x05, 0x06 + .byte 0x0b, 0x08, 0x09, 0x0a, 0x0f, 0x0c, 0x0d, 0x0e + +/* Rotate left by 16 bits on 32-bit words with vpshufb */ +.Lrol_16: + .byte 0x02, 0x03, 0x00, 0x01, 0x06, 0x07, 0x04, 0x05 + .byte 0x0a, 0x0b, 0x08, 0x09, 0x0e, 0x0f, 0x0c, 0x0d + .byte 0x02, 0x03, 0x00, 0x01, 0x06, 0x07, 0x04, 0x05 + .byte 0x0a, 0x0b, 0x08, 0x09, 0x0e, 0x0f, 0x0c, 0x0d + +/* Rotate left by 24 bits on 32-bit words with vpshufb */ +.Lrol_24: + .byte 0x01, 0x02, 0x03, 0x00, 0x05, 0x06, 0x07, 0x04 + .byte 0x09, 0x0a, 0x0b, 0x08, 0x0d, 0x0e, 0x0f, 0x0c + .byte 0x01, 0x02, 0x03, 0x00, 0x05, 0x06, 0x07, 0x04 + .byte 0x09, 0x0a, 0x0b, 0x08, 0x0d, 0x0e, 0x0f, 0x0c + +/* For CTR-mode IV byteswap */ +.Lbswap128_mask: + .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 + +/* For input word byte-swap */ +.Lbswap32_mask: + .byte 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 + +.align 8 +.globl _gcry_sm4_gfni_avx2_expand_key +ELF(.type _gcry_sm4_gfni_avx2_expand_key, at function;) +_gcry_sm4_gfni_avx2_expand_key: + /* input: + * %rdi: 128-bit key + * %rsi: rkey_enc + * %rdx: rkey_dec + * %rcx: fk array + * %r8: ck array + */ + CFI_STARTPROC(); + + vmovd 0*4(%rdi), RA0x; + vmovd 1*4(%rdi), RA1x; + vmovd 2*4(%rdi), RA2x; + vmovd 3*4(%rdi), RA3x; + + vmovdqa .Lbswap32_mask rRIP, RTMP2x; + vpshufb RTMP2x, RA0x, RA0x; + vpshufb RTMP2x, RA1x, RA1x; + vpshufb RTMP2x, RA2x, RA2x; + vpshufb RTMP2x, RA3x, RA3x; + + vmovd 0*4(%rcx), RB0x; + vmovd 1*4(%rcx), RB1x; + vmovd 2*4(%rcx), RB2x; + vmovd 3*4(%rcx), RB3x; + vpxor RB0x, RA0x, RA0x; + vpxor RB1x, RA1x, RA1x; + vpxor RB2x, RA2x, RA2x; + vpxor RB3x, RA3x, RA3x; + +#define ROUND(round, s0, s1, s2, s3) \ + vpbroadcastd (4*(round))(%r8), RX0x; \ + vpxor s1, RX0x, RX0x; \ + vpxor s2, RX0x, RX0x; \ + vpxor s3, RX0x, RX0x; /* s1 ^ s2 ^ s3 ^ rk */ \ + \ + /* sbox, non-linear part */ \ + vgf2p8affineqb $0x65, .Lpre_affine_s rRIP, RX0x, RX0x; \ + vgf2p8affineinvqb $0xd3, .Lpost_affine_s rRIP, RX0x, RX0x; \ + \ + /* linear part */ \ + vpxor RX0x, s0, s0; /* s0 ^ x */ \ + vpslld $13, RX0x, RTMP0x; \ + vpsrld $19, RX0x, RTMP1x; \ + vpslld $23, RX0x, RTMP2x; \ + vpsrld $9, RX0x, RTMP3x; \ + vpxor RTMP0x, RTMP1x, RTMP1x; \ + vpxor RTMP2x, RTMP3x, RTMP3x; \ + vpxor RTMP1x, s0, s0; /* s0 ^ x ^ rol(x,13) */ \ + vpxor RTMP3x, s0, s0; /* s0 ^ x ^ rol(x,13) ^ rol(x,23) */ + + leaq (32*4)(%r8), %rax; + leaq (32*4)(%rdx), %rdx; +.align 16 +.Lroundloop_expand_key: + leaq (-4*4)(%rdx), %rdx; + ROUND(0, RA0x, RA1x, RA2x, RA3x); + ROUND(1, RA1x, RA2x, RA3x, RA0x); + ROUND(2, RA2x, RA3x, RA0x, RA1x); + ROUND(3, RA3x, RA0x, RA1x, RA2x); + leaq (4*4)(%r8), %r8; + vmovd RA0x, (0*4)(%rsi); + vmovd RA1x, (1*4)(%rsi); + vmovd RA2x, (2*4)(%rsi); + vmovd RA3x, (3*4)(%rsi); + vmovd RA0x, (3*4)(%rdx); + vmovd RA1x, (2*4)(%rdx); + vmovd RA2x, (1*4)(%rdx); + vmovd RA3x, (0*4)(%rdx); + leaq (4*4)(%rsi), %rsi; + cmpq %rax, %r8; + jne .Lroundloop_expand_key; + +#undef ROUND + + vzeroall; + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_gfni_avx2_expand_key,.-_gcry_sm4_gfni_avx2_expand_key;) + +.align 8 +ELF(.type sm4_gfni_avx2_crypt_blk1_4, at function;) +sm4_gfni_avx2_crypt_blk1_4: + /* input: + * %rdi: round key array, CTX + * %rsi: dst (1..4 blocks) + * %rdx: src (1..4 blocks) + * %rcx: num blocks (1..4) + */ + CFI_STARTPROC(); + + vmovdqu 0*16(%rdx), RA0x; + vmovdqa RA0x, RA1x; + vmovdqa RA0x, RA2x; + vmovdqa RA0x, RA3x; + cmpq $2, %rcx; + jb .Lblk4_load_input_done; + vmovdqu 1*16(%rdx), RA1x; + je .Lblk4_load_input_done; + vmovdqu 2*16(%rdx), RA2x; + cmpq $3, %rcx; + je .Lblk4_load_input_done; + vmovdqu 3*16(%rdx), RA3x; + +.Lblk4_load_input_done: + + vmovdqa .Lbswap32_mask rRIP, RTMP2x; + vpshufb RTMP2x, RA0x, RA0x; + vpshufb RTMP2x, RA1x, RA1x; + vpshufb RTMP2x, RA2x, RA2x; + vpshufb RTMP2x, RA3x, RA3x; + + vmovdqa .Lrol_8 rRIP, RTMP2x; + vmovdqa .Lrol_16 rRIP, RTMP3x; + vmovdqa .Lrol_24 rRIP, RB3x; + transpose_4x4(RA0x, RA1x, RA2x, RA3x, RTMP0x, RTMP1x); + +#define ROUND(round, s0, s1, s2, s3) \ + vpbroadcastd (4*(round))(%rdi), RX0x; \ + vpxor s1, RX0x, RX0x; \ + vpxor s2, RX0x, RX0x; \ + vpxor s3, RX0x, RX0x; /* s1 ^ s2 ^ s3 ^ rk */ \ + \ + /* sbox, non-linear part */ \ + vgf2p8affineqb $0x65, .Lpre_affine_s rRIP, RX0x, RX0x; \ + vgf2p8affineinvqb $0xd3, .Lpost_affine_s rRIP, RX0x, RX0x; \ + \ + /* linear part */ \ + vpxor RX0x, s0, s0; /* s0 ^ x */ \ + vpshufb RTMP2x, RX0x, RTMP1x; \ + vpxor RTMP1x, RX0x, RTMP0x; /* x ^ rol(x,8) */ \ + vpshufb RTMP3x, RX0x, RTMP1x; \ + vpxor RTMP1x, RTMP0x, RTMP0x; /* x ^ rol(x,8) ^ rol(x,16) */ \ + vpshufb RB3x, RX0x, RTMP1x; \ + vpxor RTMP1x, s0, s0; /* s0 ^ x ^ rol(x,24) */ \ + vpslld $2, RTMP0x, RTMP1x; \ + vpsrld $30, RTMP0x, RTMP0x; \ + vpxor RTMP0x, s0, s0; \ + vpxor RTMP1x, s0, s0; /* s0 ^ x ^ rol(x,2) ^ rol(x,10) ^ rol(x,18) ^ rol(x,24) */ + + leaq (32*4)(%rdi), %rax; +.align 16 +.Lroundloop_blk4: + ROUND(0, RA0x, RA1x, RA2x, RA3x); + ROUND(1, RA1x, RA2x, RA3x, RA0x); + ROUND(2, RA2x, RA3x, RA0x, RA1x); + ROUND(3, RA3x, RA0x, RA1x, RA2x); + leaq (4*4)(%rdi), %rdi; + cmpq %rax, %rdi; + jne .Lroundloop_blk4; + +#undef ROUND + + vmovdqa .Lbswap128_mask rRIP, RTMP2x; + + transpose_4x4(RA0x, RA1x, RA2x, RA3x, RTMP0x, RTMP1x); + vpshufb RTMP2x, RA0x, RA0x; + vpshufb RTMP2x, RA1x, RA1x; + vpshufb RTMP2x, RA2x, RA2x; + vpshufb RTMP2x, RA3x, RA3x; + + vmovdqu RA0x, 0*16(%rsi); + cmpq $2, %rcx; + jb .Lblk4_store_output_done; + vmovdqu RA1x, 1*16(%rsi); + je .Lblk4_store_output_done; + vmovdqu RA2x, 2*16(%rsi); + cmpq $3, %rcx; + je .Lblk4_store_output_done; + vmovdqu RA3x, 3*16(%rsi); + +.Lblk4_store_output_done: + vzeroall; + xorl %eax, %eax; + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size sm4_gfni_avx2_crypt_blk1_4,.-sm4_gfni_avx2_crypt_blk1_4;) + +.align 8 +ELF(.type __sm4_gfni_crypt_blk8, at function;) +__sm4_gfni_crypt_blk8: + /* input: + * %rdi: round key array, CTX + * RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: eight parallel + * ciphertext blocks + * output: + * RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: eight parallel plaintext + * blocks + */ + CFI_STARTPROC(); + + vmovdqa .Lbswap32_mask rRIP, RTMP2x; + vpshufb RTMP2x, RA0x, RA0x; + vpshufb RTMP2x, RA1x, RA1x; + vpshufb RTMP2x, RA2x, RA2x; + vpshufb RTMP2x, RA3x, RA3x; + vpshufb RTMP2x, RB0x, RB0x; + vpshufb RTMP2x, RB1x, RB1x; + vpshufb RTMP2x, RB2x, RB2x; + vpshufb RTMP2x, RB3x, RB3x; + + transpose_4x4(RA0x, RA1x, RA2x, RA3x, RTMP0x, RTMP1x); + transpose_4x4(RB0x, RB1x, RB2x, RB3x, RTMP0x, RTMP1x); + +#define ROUND(round, s0, s1, s2, s3, r0, r1, r2, r3) \ + vpbroadcastd (4*(round))(%rdi), RX0x; \ + vmovdqa .Lpre_affine_s rRIP, RTMP2x; \ + vmovdqa .Lpost_affine_s rRIP, RTMP3x; \ + vmovdqa RX0x, RX1x; \ + vpxor s1, RX0x, RX0x; \ + vpxor s2, RX0x, RX0x; \ + vpxor s3, RX0x, RX0x; /* s1 ^ s2 ^ s3 ^ rk */ \ + vpxor r1, RX1x, RX1x; \ + vpxor r2, RX1x, RX1x; \ + vpxor r3, RX1x, RX1x; /* r1 ^ r2 ^ r3 ^ rk */ \ + \ + /* sbox, non-linear part */ \ + vmovdqa .Lrol_8 rRIP, RTMP4x; \ + vgf2p8affineqb $0x65, RTMP2x, RX0x, RX0x; \ + vgf2p8affineinvqb $0xd3, RTMP3x, RX0x, RX0x; \ + vgf2p8affineqb $0x65, RTMP2x, RX1x, RX1x; \ + vgf2p8affineinvqb $0xd3, RTMP3x, RX1x, RX1x; \ + \ + /* linear part */ \ + vpxor RX0x, s0, s0; /* s0 ^ x */ \ + vpshufb RTMP4x, RX0x, RTMP1x; \ + vpxor RTMP1x, RX0x, RTMP0x; /* x ^ rol(x,8) */ \ + vpxor RX1x, r0, r0; /* r0 ^ x */ \ + vpshufb RTMP4x, RX1x, RTMP3x; \ + vmovdqa .Lrol_16 rRIP, RTMP4x; \ + vpxor RTMP3x, RX1x, RTMP2x; /* x ^ rol(x,8) */ \ + vpshufb RTMP4x, RX0x, RTMP1x; \ + vpxor RTMP1x, RTMP0x, RTMP0x; /* x ^ rol(x,8) ^ rol(x,16) */ \ + vpshufb RTMP4x, RX1x, RTMP3x; \ + vmovdqa .Lrol_24 rRIP, RTMP4x; \ + vpxor RTMP3x, RTMP2x, RTMP2x; /* x ^ rol(x,8) ^ rol(x,16) */ \ + vpshufb RTMP4x, RX0x, RTMP1x; \ + vpxor RTMP1x, s0, s0; /* s0 ^ x ^ rol(x,24) */ \ + vpslld $2, RTMP0x, RTMP1x; \ + vpsrld $30, RTMP0x, RTMP0x; \ + vpxor RTMP0x, s0, s0; \ + vpxor RTMP1x, s0, s0; /* s0 ^ x ^ rol(x,2) ^ rol(x,10) ^ rol(x,18) ^ rol(x,24) */ \ + vpshufb RTMP4x, RX1x, RTMP3x; \ + vpxor RTMP3x, r0, r0; /* r0 ^ x ^ rol(x,24) */ \ + vpslld $2, RTMP2x, RTMP3x; \ + vpsrld $30, RTMP2x, RTMP2x; \ + vpxor RTMP2x, r0, r0; \ + vpxor RTMP3x, r0, r0; /* r0 ^ x ^ rol(x,2) ^ rol(x,10) ^ rol(x,18) ^ rol(x,24) */ + + leaq (32*4)(%rdi), %rax; +.align 16 +.Lroundloop_blk8: + ROUND(0, RA0x, RA1x, RA2x, RA3x, RB0x, RB1x, RB2x, RB3x); + ROUND(1, RA1x, RA2x, RA3x, RA0x, RB1x, RB2x, RB3x, RB0x); + ROUND(2, RA2x, RA3x, RA0x, RA1x, RB2x, RB3x, RB0x, RB1x); + ROUND(3, RA3x, RA0x, RA1x, RA2x, RB3x, RB0x, RB1x, RB2x); + leaq (4*4)(%rdi), %rdi; + cmpq %rax, %rdi; + jne .Lroundloop_blk8; + +#undef ROUND + + vmovdqa .Lbswap128_mask rRIP, RTMP2x; + + transpose_4x4(RA0x, RA1x, RA2x, RA3x, RTMP0x, RTMP1x); + transpose_4x4(RB0x, RB1x, RB2x, RB3x, RTMP0x, RTMP1x); + vpshufb RTMP2x, RA0x, RA0x; + vpshufb RTMP2x, RA1x, RA1x; + vpshufb RTMP2x, RA2x, RA2x; + vpshufb RTMP2x, RA3x, RA3x; + vpshufb RTMP2x, RB0x, RB0x; + vpshufb RTMP2x, RB1x, RB1x; + vpshufb RTMP2x, RB2x, RB2x; + vpshufb RTMP2x, RB3x, RB3x; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size __sm4_gfni_crypt_blk8,.-__sm4_gfni_crypt_blk8;) + +.align 8 +ELF(.type _gcry_sm4_gfni_avx2_crypt_blk1_8, at function;) +_gcry_sm4_gfni_avx2_crypt_blk1_8: + /* input: + * %rdi: round key array, CTX + * %rsi: dst (1..8 blocks) + * %rdx: src (1..8 blocks) + * %rcx: num blocks (1..8) + */ + CFI_STARTPROC(); + + cmpq $5, %rcx; + jb sm4_gfni_avx2_crypt_blk1_4; + vmovdqu (0 * 16)(%rdx), RA0x; + vmovdqu (1 * 16)(%rdx), RA1x; + vmovdqu (2 * 16)(%rdx), RA2x; + vmovdqu (3 * 16)(%rdx), RA3x; + vmovdqu (4 * 16)(%rdx), RB0x; + vmovdqa RB0x, RB1x; + vmovdqa RB0x, RB2x; + vmovdqa RB0x, RB3x; + je .Lblk8_load_input_done; + vmovdqu (5 * 16)(%rdx), RB1x; + cmpq $7, %rcx; + jb .Lblk8_load_input_done; + vmovdqu (6 * 16)(%rdx), RB2x; + je .Lblk8_load_input_done; + vmovdqu (7 * 16)(%rdx), RB3x; + +.Lblk8_load_input_done: + call __sm4_gfni_crypt_blk8; + + cmpq $6, %rcx; + vmovdqu RA0x, (0 * 16)(%rsi); + vmovdqu RA1x, (1 * 16)(%rsi); + vmovdqu RA2x, (2 * 16)(%rsi); + vmovdqu RA3x, (3 * 16)(%rsi); + vmovdqu RB0x, (4 * 16)(%rsi); + jb .Lblk8_store_output_done; + vmovdqu RB1x, (5 * 16)(%rsi); + je .Lblk8_store_output_done; + vmovdqu RB2x, (6 * 16)(%rsi); + cmpq $7, %rcx; + je .Lblk8_store_output_done; + vmovdqu RB3x, (7 * 16)(%rsi); + +.Lblk8_store_output_done: + vzeroall; + xorl %eax, %eax; + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_gfni_avx2_crypt_blk1_8,.-_gcry_sm4_gfni_avx2_crypt_blk1_8;) + +/********************************************************************** + 16-way SM4 with GFNI and AVX2 + **********************************************************************/ + +.align 8 +ELF(.type __sm4_gfni_crypt_blk16, at function;) +__sm4_gfni_crypt_blk16: + /* input: + * %rdi: ctx, CTX + * RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: sixteen parallel + * plaintext blocks + * output: + * RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: sixteen parallel + * ciphertext blocks + */ + CFI_STARTPROC(); + + vbroadcasti128 .Lbswap32_mask rRIP, RTMP2; + vpshufb RTMP2, RA0, RA0; + vpshufb RTMP2, RA1, RA1; + vpshufb RTMP2, RA2, RA2; + vpshufb RTMP2, RA3, RA3; + vpshufb RTMP2, RB0, RB0; + vpshufb RTMP2, RB1, RB1; + vpshufb RTMP2, RB2, RB2; + vpshufb RTMP2, RB3, RB3; + + transpose_4x4(RA0, RA1, RA2, RA3, RTMP0, RTMP1); + transpose_4x4(RB0, RB1, RB2, RB3, RTMP0, RTMP1); + +#define ROUND(round, s0, s1, s2, s3, r0, r1, r2, r3) \ + vpbroadcastd (4*(round))(%rdi), RX0; \ + vbroadcasti128 .Lpre_affine_s rRIP, RTMP2; \ + vbroadcasti128 .Lpost_affine_s rRIP, RTMP3; \ + vmovdqa RX0, RX1; \ + vpxor s1, RX0, RX0; \ + vpxor s2, RX0, RX0; \ + vpxor s3, RX0, RX0; /* s1 ^ s2 ^ s3 ^ rk */ \ + vpxor r1, RX1, RX1; \ + vpxor r2, RX1, RX1; \ + vpxor r3, RX1, RX1; /* r1 ^ r2 ^ r3 ^ rk */ \ + \ + /* sbox, non-linear part */ \ + vbroadcasti128 .Lrol_8 rRIP, RTMP4; \ + vgf2p8affineqb $0x65, RTMP2, RX0, RX0; \ + vgf2p8affineinvqb $0xd3, RTMP3, RX0, RX0; \ + vgf2p8affineqb $0x65, RTMP2, RX1, RX1; \ + vgf2p8affineinvqb $0xd3, RTMP3, RX1, RX1; \ + \ + /* linear part */ \ + vpxor RX0, s0, s0; /* s0 ^ x */ \ + vpshufb RTMP4, RX0, RTMP1; \ + vpxor RTMP1, RX0, RTMP0; /* x ^ rol(x,8) */ \ + vpxor RX1, r0, r0; /* r0 ^ x */ \ + vpshufb RTMP4, RX1, RTMP3; \ + vbroadcasti128 .Lrol_16 rRIP, RTMP4; \ + vpxor RTMP3, RX1, RTMP2; /* x ^ rol(x,8) */ \ + vpshufb RTMP4, RX0, RTMP1; \ + vpxor RTMP1, RTMP0, RTMP0; /* x ^ rol(x,8) ^ rol(x,16) */ \ + vpshufb RTMP4, RX1, RTMP3; \ + vbroadcasti128 .Lrol_24 rRIP, RTMP4; \ + vpxor RTMP3, RTMP2, RTMP2; /* x ^ rol(x,8) ^ rol(x,16) */ \ + vpshufb RTMP4, RX0, RTMP1; \ + vpxor RTMP1, s0, s0; /* s0 ^ x ^ rol(x,24) */ \ + vpslld $2, RTMP0, RTMP1; \ + vpsrld $30, RTMP0, RTMP0; \ + vpxor RTMP0, s0, s0; \ + vpxor RTMP1, s0, s0; /* s0 ^ x ^ rol(x,2) ^ rol(x,10) ^ rol(x,18) ^ rol(x,24) */ \ + vpshufb RTMP4, RX1, RTMP3; \ + vpxor RTMP3, r0, r0; /* r0 ^ x ^ rol(x,24) */ \ + vpslld $2, RTMP2, RTMP3; \ + vpsrld $30, RTMP2, RTMP2; \ + vpxor RTMP2, r0, r0; \ + vpxor RTMP3, r0, r0; /* r0 ^ x ^ rol(x,2) ^ rol(x,10) ^ rol(x,18) ^ rol(x,24) */ + + leaq (32*4)(%rdi), %rax; +.align 16 +.Lroundloop_blk16: + ROUND(0, RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3); + ROUND(1, RA1, RA2, RA3, RA0, RB1, RB2, RB3, RB0); + ROUND(2, RA2, RA3, RA0, RA1, RB2, RB3, RB0, RB1); + ROUND(3, RA3, RA0, RA1, RA2, RB3, RB0, RB1, RB2); + leaq (4*4)(%rdi), %rdi; + cmpq %rax, %rdi; + jne .Lroundloop_blk16; + +#undef ROUND + + vbroadcasti128 .Lbswap128_mask rRIP, RTMP2; + + transpose_4x4(RA0, RA1, RA2, RA3, RTMP0, RTMP1); + transpose_4x4(RB0, RB1, RB2, RB3, RTMP0, RTMP1); + vpshufb RTMP2, RA0, RA0; + vpshufb RTMP2, RA1, RA1; + vpshufb RTMP2, RA2, RA2; + vpshufb RTMP2, RA3, RA3; + vpshufb RTMP2, RB0, RB0; + vpshufb RTMP2, RB1, RB1; + vpshufb RTMP2, RB2, RB2; + vpshufb RTMP2, RB3, RB3; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size __sm4_gfni_crypt_blk16,.-__sm4_gfni_crypt_blk16;) + +.align 8 +.globl _gcry_sm4_gfni_avx2_crypt_blk1_16 +ELF(.type _gcry_sm4_gfni_avx2_crypt_blk1_16, at function;) +_gcry_sm4_gfni_avx2_crypt_blk1_16: + /* input: + * %rdi: round key array, CTX + * %rsi: dst (1..16 blocks) + * %rdx: src (1..16 blocks) + * %rcx: num blocks (1..16) + */ + CFI_STARTPROC(); + +#define LOAD_INPUT(offset, yreg) \ + cmpq $(1 + 2 * (offset)), %rcx; \ + jb .Lblk16_load_input_done; \ + ja 1f; \ + vmovdqu (offset) * 32(%rdx), yreg##x; \ + jmp .Lblk16_load_input_done; \ + 1: \ + vmovdqu (offset) * 32(%rdx), yreg; + + cmpq $8, %rcx; + jbe _gcry_sm4_gfni_avx2_crypt_blk1_8; + vmovdqu (0 * 32)(%rdx), RA0; + vmovdqu (1 * 32)(%rdx), RA1; + vmovdqu (2 * 32)(%rdx), RA2; + vmovdqu (3 * 32)(%rdx), RA3; + LOAD_INPUT(4, RB0); + LOAD_INPUT(5, RB1); + LOAD_INPUT(6, RB2); + LOAD_INPUT(7, RB3); +#undef LOAD_INPUT + +.Lblk16_load_input_done: + call __sm4_gfni_crypt_blk16; + +#define STORE_OUTPUT(yreg, offset) \ + cmpq $(1 + 2 * (offset)), %rcx; \ + jb .Lblk16_store_output_done; \ + ja 1f; \ + vmovdqu yreg##x, (offset) * 32(%rsi); \ + jmp .Lblk16_store_output_done; \ + 1: \ + vmovdqu yreg, (offset) * 32(%rsi); + + vmovdqu RA0, (0 * 32)(%rsi); + vmovdqu RA1, (1 * 32)(%rsi); + vmovdqu RA2, (2 * 32)(%rsi); + vmovdqu RA3, (3 * 32)(%rsi); + STORE_OUTPUT(RB0, 4); + STORE_OUTPUT(RB1, 5); + STORE_OUTPUT(RB2, 6); + STORE_OUTPUT(RB3, 7); +#undef STORE_OUTPUT + +.Lblk16_store_output_done: + vzeroall; + xorl %eax, %eax; + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_gfni_avx2_crypt_blk1_16,.-_gcry_sm4_gfni_avx2_crypt_blk1_16;) + +#define inc_le128(x, minus_one, tmp) \ + vpcmpeqq minus_one, x, tmp; \ + vpsubq minus_one, x, x; \ + vpslldq $8, tmp, tmp; \ + vpsubq tmp, x, x; + +.align 8 +.globl _gcry_sm4_gfni_avx2_ctr_enc +ELF(.type _gcry_sm4_gfni_avx2_ctr_enc, at function;) +_gcry_sm4_gfni_avx2_ctr_enc: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (16 blocks) + * %rdx: src (16 blocks) + * %rcx: iv (big endian, 128bit) + */ + CFI_STARTPROC(); + + movq 8(%rcx), %rax; + bswapq %rax; + + vbroadcasti128 .Lbswap128_mask rRIP, RTMP3; + vpcmpeqd RNOT, RNOT, RNOT; + vpsrldq $8, RNOT, RNOT; /* ab: -1:0 ; cd: -1:0 */ + vpaddq RNOT, RNOT, RTMP2; /* ab: -2:0 ; cd: -2:0 */ + + /* load IV and byteswap */ + vmovdqu (%rcx), RTMP4x; + vpshufb RTMP3x, RTMP4x, RTMP4x; + vmovdqa RTMP4x, RTMP0x; + inc_le128(RTMP4x, RNOTx, RTMP1x); + vinserti128 $1, RTMP4x, RTMP0, RTMP0; + vpshufb RTMP3, RTMP0, RA0; /* +1 ; +0 */ + + /* check need for handling 64-bit overflow and carry */ + cmpq $(0xffffffffffffffff - 16), %rax; + ja .Lhandle_ctr_carry; + + /* construct IVs */ + vpsubq RTMP2, RTMP0, RTMP0; /* +3 ; +2 */ + vpshufb RTMP3, RTMP0, RA1; + vpsubq RTMP2, RTMP0, RTMP0; /* +5 ; +4 */ + vpshufb RTMP3, RTMP0, RA2; + vpsubq RTMP2, RTMP0, RTMP0; /* +7 ; +6 */ + vpshufb RTMP3, RTMP0, RA3; + vpsubq RTMP2, RTMP0, RTMP0; /* +9 ; +8 */ + vpshufb RTMP3, RTMP0, RB0; + vpsubq RTMP2, RTMP0, RTMP0; /* +11 ; +10 */ + vpshufb RTMP3, RTMP0, RB1; + vpsubq RTMP2, RTMP0, RTMP0; /* +13 ; +12 */ + vpshufb RTMP3, RTMP0, RB2; + vpsubq RTMP2, RTMP0, RTMP0; /* +15 ; +14 */ + vpshufb RTMP3, RTMP0, RB3; + vpsubq RTMP2, RTMP0, RTMP0; /* +16 */ + vpshufb RTMP3x, RTMP0x, RTMP0x; + + jmp .Lctr_carry_done; + +.Lhandle_ctr_carry: + /* construct IVs */ + inc_le128(RTMP0, RNOT, RTMP1); + inc_le128(RTMP0, RNOT, RTMP1); + vpshufb RTMP3, RTMP0, RA1; /* +3 ; +2 */ + inc_le128(RTMP0, RNOT, RTMP1); + inc_le128(RTMP0, RNOT, RTMP1); + vpshufb RTMP3, RTMP0, RA2; /* +5 ; +4 */ + inc_le128(RTMP0, RNOT, RTMP1); + inc_le128(RTMP0, RNOT, RTMP1); + vpshufb RTMP3, RTMP0, RA3; /* +7 ; +6 */ + inc_le128(RTMP0, RNOT, RTMP1); + inc_le128(RTMP0, RNOT, RTMP1); + vpshufb RTMP3, RTMP0, RB0; /* +9 ; +8 */ + inc_le128(RTMP0, RNOT, RTMP1); + inc_le128(RTMP0, RNOT, RTMP1); + vpshufb RTMP3, RTMP0, RB1; /* +11 ; +10 */ + inc_le128(RTMP0, RNOT, RTMP1); + inc_le128(RTMP0, RNOT, RTMP1); + vpshufb RTMP3, RTMP0, RB2; /* +13 ; +12 */ + inc_le128(RTMP0, RNOT, RTMP1); + inc_le128(RTMP0, RNOT, RTMP1); + vpshufb RTMP3, RTMP0, RB3; /* +15 ; +14 */ + inc_le128(RTMP0, RNOT, RTMP1); + vextracti128 $1, RTMP0, RTMP0x; + vpshufb RTMP3x, RTMP0x, RTMP0x; /* +16 */ + +.align 4 +.Lctr_carry_done: + /* store new IV */ + vmovdqu RTMP0x, (%rcx); + + call __sm4_gfni_crypt_blk16; + + vpxor (0 * 32)(%rdx), RA0, RA0; + vpxor (1 * 32)(%rdx), RA1, RA1; + vpxor (2 * 32)(%rdx), RA2, RA2; + vpxor (3 * 32)(%rdx), RA3, RA3; + vpxor (4 * 32)(%rdx), RB0, RB0; + vpxor (5 * 32)(%rdx), RB1, RB1; + vpxor (6 * 32)(%rdx), RB2, RB2; + vpxor (7 * 32)(%rdx), RB3, RB3; + + vmovdqu RA0, (0 * 32)(%rsi); + vmovdqu RA1, (1 * 32)(%rsi); + vmovdqu RA2, (2 * 32)(%rsi); + vmovdqu RA3, (3 * 32)(%rsi); + vmovdqu RB0, (4 * 32)(%rsi); + vmovdqu RB1, (5 * 32)(%rsi); + vmovdqu RB2, (6 * 32)(%rsi); + vmovdqu RB3, (7 * 32)(%rsi); + + vzeroall; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_gfni_avx2_ctr_enc,.-_gcry_sm4_gfni_avx2_ctr_enc;) + +.align 8 +.globl _gcry_sm4_gfni_avx2_cbc_dec +ELF(.type _gcry_sm4_gfni_avx2_cbc_dec, at function;) +_gcry_sm4_gfni_avx2_cbc_dec: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (16 blocks) + * %rdx: src (16 blocks) + * %rcx: iv + */ + CFI_STARTPROC(); + + vmovdqu (0 * 32)(%rdx), RA0; + vmovdqu (1 * 32)(%rdx), RA1; + vmovdqu (2 * 32)(%rdx), RA2; + vmovdqu (3 * 32)(%rdx), RA3; + vmovdqu (4 * 32)(%rdx), RB0; + vmovdqu (5 * 32)(%rdx), RB1; + vmovdqu (6 * 32)(%rdx), RB2; + vmovdqu (7 * 32)(%rdx), RB3; + + call __sm4_gfni_crypt_blk16; + + vmovdqu (%rcx), RNOTx; + vinserti128 $1, (%rdx), RNOT, RNOT; + vpxor RNOT, RA0, RA0; + vpxor (0 * 32 + 16)(%rdx), RA1, RA1; + vpxor (1 * 32 + 16)(%rdx), RA2, RA2; + vpxor (2 * 32 + 16)(%rdx), RA3, RA3; + vpxor (3 * 32 + 16)(%rdx), RB0, RB0; + vpxor (4 * 32 + 16)(%rdx), RB1, RB1; + vpxor (5 * 32 + 16)(%rdx), RB2, RB2; + vpxor (6 * 32 + 16)(%rdx), RB3, RB3; + vmovdqu (7 * 32 + 16)(%rdx), RNOTx; + vmovdqu RNOTx, (%rcx); /* store new IV */ + + vmovdqu RA0, (0 * 32)(%rsi); + vmovdqu RA1, (1 * 32)(%rsi); + vmovdqu RA2, (2 * 32)(%rsi); + vmovdqu RA3, (3 * 32)(%rsi); + vmovdqu RB0, (4 * 32)(%rsi); + vmovdqu RB1, (5 * 32)(%rsi); + vmovdqu RB2, (6 * 32)(%rsi); + vmovdqu RB3, (7 * 32)(%rsi); + + vzeroall; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_gfni_avx2_cbc_dec,.-_gcry_sm4_gfni_avx2_cbc_dec;) + +.align 8 +.globl _gcry_sm4_gfni_avx2_cfb_dec +ELF(.type _gcry_sm4_gfni_avx2_cfb_dec, at function;) +_gcry_sm4_gfni_avx2_cfb_dec: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (16 blocks) + * %rdx: src (16 blocks) + * %rcx: iv + */ + CFI_STARTPROC(); + + /* Load input */ + vmovdqu (%rcx), RNOTx; + vinserti128 $1, (%rdx), RNOT, RA0; + vmovdqu (0 * 32 + 16)(%rdx), RA1; + vmovdqu (1 * 32 + 16)(%rdx), RA2; + vmovdqu (2 * 32 + 16)(%rdx), RA3; + vmovdqu (3 * 32 + 16)(%rdx), RB0; + vmovdqu (4 * 32 + 16)(%rdx), RB1; + vmovdqu (5 * 32 + 16)(%rdx), RB2; + vmovdqu (6 * 32 + 16)(%rdx), RB3; + + /* Update IV */ + vmovdqu (7 * 32 + 16)(%rdx), RNOTx; + vmovdqu RNOTx, (%rcx); + + call __sm4_gfni_crypt_blk16; + + vpxor (0 * 32)(%rdx), RA0, RA0; + vpxor (1 * 32)(%rdx), RA1, RA1; + vpxor (2 * 32)(%rdx), RA2, RA2; + vpxor (3 * 32)(%rdx), RA3, RA3; + vpxor (4 * 32)(%rdx), RB0, RB0; + vpxor (5 * 32)(%rdx), RB1, RB1; + vpxor (6 * 32)(%rdx), RB2, RB2; + vpxor (7 * 32)(%rdx), RB3, RB3; + + vmovdqu RA0, (0 * 32)(%rsi); + vmovdqu RA1, (1 * 32)(%rsi); + vmovdqu RA2, (2 * 32)(%rsi); + vmovdqu RA3, (3 * 32)(%rsi); + vmovdqu RB0, (4 * 32)(%rsi); + vmovdqu RB1, (5 * 32)(%rsi); + vmovdqu RB2, (6 * 32)(%rsi); + vmovdqu RB3, (7 * 32)(%rsi); + + vzeroall; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_gfni_avx2_cfb_dec,.-_gcry_sm4_gfni_avx2_cfb_dec;) + +.align 8 +.globl _gcry_sm4_gfni_avx2_ocb_enc +ELF(.type _gcry_sm4_gfni_avx2_ocb_enc, at function;) + +_gcry_sm4_gfni_avx2_ocb_enc: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (16 blocks) + * %rdx: src (16 blocks) + * %rcx: offset + * %r8 : checksum + * %r9 : L pointers (void *L[16]) + */ + CFI_STARTPROC(); + + subq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(4 * 8); + + movq %r10, (0 * 8)(%rsp); + movq %r11, (1 * 8)(%rsp); + movq %r12, (2 * 8)(%rsp); + movq %r13, (3 * 8)(%rsp); + CFI_REL_OFFSET(%r10, 0 * 8); + CFI_REL_OFFSET(%r11, 1 * 8); + CFI_REL_OFFSET(%r12, 2 * 8); + CFI_REL_OFFSET(%r13, 3 * 8); + + vmovdqu (%rcx), RTMP0x; + vmovdqu (%r8), RTMP1x; + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* Checksum_i = Checksum_{i-1} xor P_i */ + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + +#define OCB_INPUT(n, l0reg, l1reg, yreg) \ + vmovdqu (n * 32)(%rdx), yreg; \ + vpxor (l0reg), RTMP0x, RNOTx; \ + vpxor (l1reg), RNOTx, RTMP0x; \ + vinserti128 $1, RTMP0x, RNOT, RNOT; \ + vpxor yreg, RTMP1, RTMP1; \ + vpxor yreg, RNOT, yreg; \ + vmovdqu RNOT, (n * 32)(%rsi); + + movq (0 * 8)(%r9), %r10; + movq (1 * 8)(%r9), %r11; + movq (2 * 8)(%r9), %r12; + movq (3 * 8)(%r9), %r13; + OCB_INPUT(0, %r10, %r11, RA0); + OCB_INPUT(1, %r12, %r13, RA1); + movq (4 * 8)(%r9), %r10; + movq (5 * 8)(%r9), %r11; + movq (6 * 8)(%r9), %r12; + movq (7 * 8)(%r9), %r13; + OCB_INPUT(2, %r10, %r11, RA2); + OCB_INPUT(3, %r12, %r13, RA3); + movq (8 * 8)(%r9), %r10; + movq (9 * 8)(%r9), %r11; + movq (10 * 8)(%r9), %r12; + movq (11 * 8)(%r9), %r13; + OCB_INPUT(4, %r10, %r11, RB0); + OCB_INPUT(5, %r12, %r13, RB1); + movq (12 * 8)(%r9), %r10; + movq (13 * 8)(%r9), %r11; + movq (14 * 8)(%r9), %r12; + movq (15 * 8)(%r9), %r13; + OCB_INPUT(6, %r10, %r11, RB2); + OCB_INPUT(7, %r12, %r13, RB3); +#undef OCB_INPUT + + vextracti128 $1, RTMP1, RNOTx; + vmovdqu RTMP0x, (%rcx); + vpxor RNOTx, RTMP1x, RTMP1x; + vmovdqu RTMP1x, (%r8); + + movq (0 * 8)(%rsp), %r10; + movq (1 * 8)(%rsp), %r11; + movq (2 * 8)(%rsp), %r12; + movq (3 * 8)(%rsp), %r13; + CFI_RESTORE(%r10); + CFI_RESTORE(%r11); + CFI_RESTORE(%r12); + CFI_RESTORE(%r13); + + call __sm4_gfni_crypt_blk16; + + addq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(-4 * 8); + + vpxor (0 * 32)(%rsi), RA0, RA0; + vpxor (1 * 32)(%rsi), RA1, RA1; + vpxor (2 * 32)(%rsi), RA2, RA2; + vpxor (3 * 32)(%rsi), RA3, RA3; + vpxor (4 * 32)(%rsi), RB0, RB0; + vpxor (5 * 32)(%rsi), RB1, RB1; + vpxor (6 * 32)(%rsi), RB2, RB2; + vpxor (7 * 32)(%rsi), RB3, RB3; + + vmovdqu RA0, (0 * 32)(%rsi); + vmovdqu RA1, (1 * 32)(%rsi); + vmovdqu RA2, (2 * 32)(%rsi); + vmovdqu RA3, (3 * 32)(%rsi); + vmovdqu RB0, (4 * 32)(%rsi); + vmovdqu RB1, (5 * 32)(%rsi); + vmovdqu RB2, (6 * 32)(%rsi); + vmovdqu RB3, (7 * 32)(%rsi); + + vzeroall; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_gfni_avx2_ocb_enc,.-_gcry_sm4_gfni_avx2_ocb_enc;) + +.align 8 +.globl _gcry_sm4_gfni_avx2_ocb_dec +ELF(.type _gcry_sm4_gfni_avx2_ocb_dec, at function;) + +_gcry_sm4_gfni_avx2_ocb_dec: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (16 blocks) + * %rdx: src (16 blocks) + * %rcx: offset + * %r8 : checksum + * %r9 : L pointers (void *L[16]) + */ + CFI_STARTPROC(); + + subq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(4 * 8); + + movq %r10, (0 * 8)(%rsp); + movq %r11, (1 * 8)(%rsp); + movq %r12, (2 * 8)(%rsp); + movq %r13, (3 * 8)(%rsp); + CFI_REL_OFFSET(%r10, 0 * 8); + CFI_REL_OFFSET(%r11, 1 * 8); + CFI_REL_OFFSET(%r12, 2 * 8); + CFI_REL_OFFSET(%r13, 3 * 8); + + vmovdqu (%rcx), RTMP0x; + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + +#define OCB_INPUT(n, l0reg, l1reg, yreg) \ + vmovdqu (n * 32)(%rdx), yreg; \ + vpxor (l0reg), RTMP0x, RNOTx; \ + vpxor (l1reg), RNOTx, RTMP0x; \ + vinserti128 $1, RTMP0x, RNOT, RNOT; \ + vpxor yreg, RNOT, yreg; \ + vmovdqu RNOT, (n * 32)(%rsi); + + movq (0 * 8)(%r9), %r10; + movq (1 * 8)(%r9), %r11; + movq (2 * 8)(%r9), %r12; + movq (3 * 8)(%r9), %r13; + OCB_INPUT(0, %r10, %r11, RA0); + OCB_INPUT(1, %r12, %r13, RA1); + movq (4 * 8)(%r9), %r10; + movq (5 * 8)(%r9), %r11; + movq (6 * 8)(%r9), %r12; + movq (7 * 8)(%r9), %r13; + OCB_INPUT(2, %r10, %r11, RA2); + OCB_INPUT(3, %r12, %r13, RA3); + movq (8 * 8)(%r9), %r10; + movq (9 * 8)(%r9), %r11; + movq (10 * 8)(%r9), %r12; + movq (11 * 8)(%r9), %r13; + OCB_INPUT(4, %r10, %r11, RB0); + OCB_INPUT(5, %r12, %r13, RB1); + movq (12 * 8)(%r9), %r10; + movq (13 * 8)(%r9), %r11; + movq (14 * 8)(%r9), %r12; + movq (15 * 8)(%r9), %r13; + OCB_INPUT(6, %r10, %r11, RB2); + OCB_INPUT(7, %r12, %r13, RB3); +#undef OCB_INPUT + + vmovdqu RTMP0x, (%rcx); + + movq (0 * 8)(%rsp), %r10; + movq (1 * 8)(%rsp), %r11; + movq (2 * 8)(%rsp), %r12; + movq (3 * 8)(%rsp), %r13; + CFI_RESTORE(%r10); + CFI_RESTORE(%r11); + CFI_RESTORE(%r12); + CFI_RESTORE(%r13); + + call __sm4_gfni_crypt_blk16; + + addq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(-4 * 8); + + vmovdqu (%r8), RTMP1x; + + vpxor (0 * 32)(%rsi), RA0, RA0; + vpxor (1 * 32)(%rsi), RA1, RA1; + vpxor (2 * 32)(%rsi), RA2, RA2; + vpxor (3 * 32)(%rsi), RA3, RA3; + vpxor (4 * 32)(%rsi), RB0, RB0; + vpxor (5 * 32)(%rsi), RB1, RB1; + vpxor (6 * 32)(%rsi), RB2, RB2; + vpxor (7 * 32)(%rsi), RB3, RB3; + + /* Checksum_i = Checksum_{i-1} xor P_i */ + + vmovdqu RA0, (0 * 32)(%rsi); + vpxor RA0, RTMP1, RTMP1; + vmovdqu RA1, (1 * 32)(%rsi); + vpxor RA1, RTMP1, RTMP1; + vmovdqu RA2, (2 * 32)(%rsi); + vpxor RA2, RTMP1, RTMP1; + vmovdqu RA3, (3 * 32)(%rsi); + vpxor RA3, RTMP1, RTMP1; + vmovdqu RB0, (4 * 32)(%rsi); + vpxor RB0, RTMP1, RTMP1; + vmovdqu RB1, (5 * 32)(%rsi); + vpxor RB1, RTMP1, RTMP1; + vmovdqu RB2, (6 * 32)(%rsi); + vpxor RB2, RTMP1, RTMP1; + vmovdqu RB3, (7 * 32)(%rsi); + vpxor RB3, RTMP1, RTMP1; + + vextracti128 $1, RTMP1, RNOTx; + vpxor RNOTx, RTMP1x, RTMP1x; + vmovdqu RTMP1x, (%r8); + + vzeroall; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_gfni_avx2_ocb_dec,.-_gcry_sm4_gfni_avx2_ocb_dec;) + +.align 8 +.globl _gcry_sm4_gfni_avx2_ocb_auth +ELF(.type _gcry_sm4_gfni_avx2_ocb_auth, at function;) + +_gcry_sm4_gfni_avx2_ocb_auth: + /* input: + * %rdi: ctx, CTX + * %rsi: abuf (16 blocks) + * %rdx: offset + * %rcx: checksum + * %r8 : L pointers (void *L[16]) + */ + CFI_STARTPROC(); + + subq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(4 * 8); + + movq %r10, (0 * 8)(%rsp); + movq %r11, (1 * 8)(%rsp); + movq %r12, (2 * 8)(%rsp); + movq %r13, (3 * 8)(%rsp); + CFI_REL_OFFSET(%r10, 0 * 8); + CFI_REL_OFFSET(%r11, 1 * 8); + CFI_REL_OFFSET(%r12, 2 * 8); + CFI_REL_OFFSET(%r13, 3 * 8); + + vmovdqu (%rdx), RTMP0x; + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* Sum_i = Sum_{i-1} xor ENCIPHER(K, A_i xor Offset_i) */ + +#define OCB_INPUT(n, l0reg, l1reg, yreg) \ + vmovdqu (n * 32)(%rsi), yreg; \ + vpxor (l0reg), RTMP0x, RNOTx; \ + vpxor (l1reg), RNOTx, RTMP0x; \ + vinserti128 $1, RTMP0x, RNOT, RNOT; \ + vpxor yreg, RNOT, yreg; + + movq (0 * 8)(%r8), %r10; + movq (1 * 8)(%r8), %r11; + movq (2 * 8)(%r8), %r12; + movq (3 * 8)(%r8), %r13; + OCB_INPUT(0, %r10, %r11, RA0); + OCB_INPUT(1, %r12, %r13, RA1); + movq (4 * 8)(%r8), %r10; + movq (5 * 8)(%r8), %r11; + movq (6 * 8)(%r8), %r12; + movq (7 * 8)(%r8), %r13; + OCB_INPUT(2, %r10, %r11, RA2); + OCB_INPUT(3, %r12, %r13, RA3); + movq (8 * 8)(%r8), %r10; + movq (9 * 8)(%r8), %r11; + movq (10 * 8)(%r8), %r12; + movq (11 * 8)(%r8), %r13; + OCB_INPUT(4, %r10, %r11, RB0); + OCB_INPUT(5, %r12, %r13, RB1); + movq (12 * 8)(%r8), %r10; + movq (13 * 8)(%r8), %r11; + movq (14 * 8)(%r8), %r12; + movq (15 * 8)(%r8), %r13; + OCB_INPUT(6, %r10, %r11, RB2); + OCB_INPUT(7, %r12, %r13, RB3); +#undef OCB_INPUT + + vmovdqu RTMP0x, (%rdx); + + movq (0 * 8)(%rsp), %r10; + movq (1 * 8)(%rsp), %r11; + movq (2 * 8)(%rsp), %r12; + movq (3 * 8)(%rsp), %r13; + CFI_RESTORE(%r10); + CFI_RESTORE(%r11); + CFI_RESTORE(%r12); + CFI_RESTORE(%r13); + + call __sm4_gfni_crypt_blk16; + + addq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(-4 * 8); + + vpxor RA0, RB0, RA0; + vpxor RA1, RB1, RA1; + vpxor RA2, RB2, RA2; + vpxor RA3, RB3, RA3; + + vpxor RA1, RA0, RA0; + vpxor RA3, RA2, RA2; + + vpxor RA2, RA0, RTMP1; + + vextracti128 $1, RTMP1, RNOTx; + vpxor (%rcx), RTMP1x, RTMP1x; + vpxor RNOTx, RTMP1x, RTMP1x; + vmovdqu RTMP1x, (%rcx); + + vzeroall; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_gfni_avx2_ocb_auth,.-_gcry_sm4_gfni_avx2_ocb_auth;) + +#endif /*defined(ENABLE_GFNI_SUPPORT) && defined(ENABLE_AVX2_SUPPORT)*/ +#endif /*__x86_64*/ diff --git a/cipher/sm4.c b/cipher/sm4.c index 600850e2..9d00ee05 100644 --- a/cipher/sm4.c +++ b/cipher/sm4.c @@ -1,7 +1,7 @@ /* sm4.c - SM4 Cipher Algorithm * Copyright (C) 2020 Alibaba Group. * Copyright (C) 2020 Tianjia Zhang - * Copyright (C) 2020 Jussi Kivilinna + * Copyright (C) 2020-2022 Jussi Kivilinna * * This file is part of Libgcrypt. * @@ -48,7 +48,7 @@ # endif #endif -/* USE_AESNI_AVX inidicates whether to compile with Intel AES-NI/AVX2 code. */ +/* USE_AESNI_AVX2 inidicates whether to compile with Intel AES-NI/AVX2 code. */ #undef USE_AESNI_AVX2 #if defined(ENABLE_AESNI_SUPPORT) && defined(ENABLE_AVX2_SUPPORT) # if defined(__x86_64__) && (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ @@ -57,10 +57,19 @@ # endif #endif +/* USE_GFNI_AVX2 inidicates whether to compile with Intel GFNI/AVX2 code. */ +#undef USE_GFNI_AVX2 +#if defined(ENABLE_GFNI_SUPPORT) && defined(ENABLE_AVX2_SUPPORT) +# if defined(__x86_64__) && (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ + defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) +# define USE_GFNI_AVX2 1 +# endif +#endif + /* Assembly implementations use SystemV ABI, ABI conversion and additional * stack to store XMM6-XMM15 needed on Win64. */ #undef ASM_FUNC_ABI -#if defined(USE_AESNI_AVX) || defined(USE_AESNI_AVX2) +#if defined(USE_AESNI_AVX) || defined(USE_AESNI_AVX2) || defined(USE_GFNI_AVX2) # ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS # define ASM_FUNC_ABI __attribute__((sysv_abi)) # else @@ -116,6 +125,9 @@ typedef struct #ifdef USE_AESNI_AVX2 unsigned int use_aesni_avx2:1; #endif +#ifdef USE_GFNI_AVX2 + unsigned int use_gfni_avx2:1; +#endif #ifdef USE_AARCH64_SIMD unsigned int use_aarch64_simd:1; #endif @@ -124,9 +136,9 @@ typedef struct #endif } SM4_context; -typedef unsigned int (*crypt_blk1_8_fn_t) (const void *ctx, byte *out, - const byte *in, - unsigned int num_blks); +typedef unsigned int (*crypt_blk1_16_fn_t) (const void *ctx, byte *out, + const byte *in, + unsigned int num_blks); static const u32 fk[4] = { @@ -231,9 +243,17 @@ _gcry_sm4_aesni_avx_crypt_blk1_8(const u32 *rk, byte *out, const byte *in, unsigned int num_blks) ASM_FUNC_ABI; static inline unsigned int -sm4_aesni_avx_crypt_blk1_8(const void *rk, byte *out, const byte *in, - unsigned int num_blks) +sm4_aesni_avx_crypt_blk1_16(const void *rk, byte *out, const byte *in, + unsigned int num_blks) { + if (num_blks > 8) + { + _gcry_sm4_aesni_avx_crypt_blk1_8(rk, out, in, 8); + in += 8 * 16; + out += 8 * 16; + num_blks -= 8; + } + return _gcry_sm4_aesni_avx_crypt_blk1_8(rk, out, in, num_blks); } @@ -273,6 +293,56 @@ extern void _gcry_sm4_aesni_avx2_ocb_auth(const u32 *rk_enc, const u64 Ls[16]) ASM_FUNC_ABI; #endif /* USE_AESNI_AVX2 */ +#ifdef USE_GFNI_AVX2 +extern void _gcry_sm4_gfni_avx_expand_key(const byte *key, u32 *rk_enc, + u32 *rk_dec, const u32 *fk, + const u32 *ck) ASM_FUNC_ABI; + +extern void _gcry_sm4_gfni_avx2_ctr_enc(const u32 *rk_enc, byte *out, + const byte *in, + byte *ctr) ASM_FUNC_ABI; + +extern void _gcry_sm4_gfni_avx2_cbc_dec(const u32 *rk_dec, byte *out, + const byte *in, + byte *iv) ASM_FUNC_ABI; + +extern void _gcry_sm4_gfni_avx2_cfb_dec(const u32 *rk_enc, byte *out, + const byte *in, + byte *iv) ASM_FUNC_ABI; + +extern void _gcry_sm4_gfni_avx2_ocb_enc(const u32 *rk_enc, + unsigned char *out, + const unsigned char *in, + unsigned char *offset, + unsigned char *checksum, + const u64 Ls[16]) ASM_FUNC_ABI; + +extern void _gcry_sm4_gfni_avx2_ocb_dec(const u32 *rk_dec, + unsigned char *out, + const unsigned char *in, + unsigned char *offset, + unsigned char *checksum, + const u64 Ls[16]) ASM_FUNC_ABI; + +extern void _gcry_sm4_gfni_avx2_ocb_auth(const u32 *rk_enc, + const unsigned char *abuf, + unsigned char *offset, + unsigned char *checksum, + const u64 Ls[16]) ASM_FUNC_ABI; + +extern unsigned int +_gcry_sm4_gfni_avx2_crypt_blk1_16(const u32 *rk, byte *out, const byte *in, + unsigned int num_blks) ASM_FUNC_ABI; + +static inline unsigned int +sm4_gfni_avx2_crypt_blk1_16(const void *rk, byte *out, const byte *in, + unsigned int num_blks) +{ + return _gcry_sm4_gfni_avx2_crypt_blk1_16(rk, out, in, num_blks); +} + +#endif /* USE_GFNI_AVX2 */ + #ifdef USE_AARCH64_SIMD extern void _gcry_sm4_aarch64_crypt(const u32 *rk, byte *out, const byte *in, @@ -298,10 +368,18 @@ extern void _gcry_sm4_aarch64_crypt_blk1_8(const u32 *rk, byte *out, size_t num_blocks); static inline unsigned int -sm4_aarch64_crypt_blk1_8(const void *rk, byte *out, const byte *in, - unsigned int num_blks) +sm4_aarch64_crypt_blk1_16(const void *rk, byte *out, const byte *in, + unsigned int num_blks) { - _gcry_sm4_aarch64_crypt_blk1_8(rk, out, in, (size_t)num_blks); + if (num_blks > 8) + { + _gcry_sm4_aarch64_crypt_blk1_8(rk, out, in, 8); + in += 8 * 16; + out += 8 * 16; + num_blks -= 8; + } + + _gcry_sm4_aarch64_crypt_blk1_8(rk, out, in, num_blks); return 0; } #endif /* USE_AARCH64_SIMD */ @@ -335,10 +413,18 @@ extern void _gcry_sm4_armv8_ce_crypt_blk1_8(const u32 *rk, byte *out, size_t num_blocks); static inline unsigned int -sm4_armv8_ce_crypt_blk1_8(const void *rk, byte *out, const byte *in, - unsigned int num_blks) +sm4_armv8_ce_crypt_blk1_16(const void *rk, byte *out, const byte *in, + unsigned int num_blks) { - _gcry_sm4_armv8_ce_crypt_blk1_8(rk, out, in, (size_t)num_blks); + if (num_blks > 8) + { + _gcry_sm4_armv8_ce_crypt_blk1_8(rk, out, in, 8); + in += 8 * 16; + out += 8 * 16; + num_blks -= 8; + } + + _gcry_sm4_armv8_ce_crypt_blk1_8(rk, out, in, num_blks); return 0; } #endif /* USE_ARM_CE */ @@ -411,6 +497,15 @@ sm4_expand_key (SM4_context *ctx, const byte *key) u32 rk[4]; int i; +#ifdef USE_GFNI_AVX + if (ctx->use_gfni_avx) + { + _gcry_sm4_gfni_avx_expand_key (key, ctx->rkey_enc, ctx->rkey_dec, + fk, ck); + return; + } +#endif + #ifdef USE_AESNI_AVX if (ctx->use_aesni_avx) { @@ -483,6 +578,9 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen, #ifdef USE_AESNI_AVX2 ctx->use_aesni_avx2 = (hwf & HWF_INTEL_AESNI) && (hwf & HWF_INTEL_AVX2); #endif +#ifdef USE_GFNI_AVX2 + ctx->use_gfni_avx2 = (hwf & HWF_INTEL_GFNI) && (hwf & HWF_INTEL_AVX2); +#endif #ifdef USE_AARCH64_SIMD ctx->use_aarch64_simd = !!(hwf & HWF_ARM_NEON); #endif @@ -535,9 +633,14 @@ sm4_encrypt (void *context, byte *outbuf, const byte *inbuf) { SM4_context *ctx = context; +#ifdef USE_GFNI_AVX2 + if (ctx->use_gfni_avx2) + return sm4_gfni_avx2_crypt_blk1_16(ctx->rkey_enc, outbuf, inbuf, 1); +#endif + #ifdef USE_ARM_CE if (ctx->use_arm_ce) - return sm4_armv8_ce_crypt_blk1_8(ctx->rkey_enc, outbuf, inbuf, 1); + return sm4_armv8_ce_crypt_blk1_16(ctx->rkey_enc, outbuf, inbuf, 1); #endif prefetch_sbox_table (); @@ -550,9 +653,14 @@ sm4_decrypt (void *context, byte *outbuf, const byte *inbuf) { SM4_context *ctx = context; +#ifdef USE_GFNI_AVX2 + if (ctx->use_gfni_avx2) + return sm4_gfni_avx2_crypt_blk1_16(ctx->rkey_dec, outbuf, inbuf, 1); +#endif + #ifdef USE_ARM_CE if (ctx->use_arm_ce) - return sm4_armv8_ce_crypt_blk1_8(ctx->rkey_dec, outbuf, inbuf, 1); + return sm4_armv8_ce_crypt_blk1_16(ctx->rkey_dec, outbuf, inbuf, 1); #endif prefetch_sbox_table (); @@ -639,27 +747,33 @@ sm4_crypt_blocks (const void *ctx, byte *out, const byte *in, return burn_depth; } -static inline crypt_blk1_8_fn_t -sm4_get_crypt_blk1_8_fn(SM4_context *ctx) +static inline crypt_blk1_16_fn_t +sm4_get_crypt_blk1_16_fn(SM4_context *ctx) { if (0) ; +#ifdef USE_AESNI_AVX + else if (ctx->use_gfni_avx2) + { + return &sm4_gfni_avx2_crypt_blk1_16; + } +#endif #ifdef USE_AESNI_AVX else if (ctx->use_aesni_avx) { - return &sm4_aesni_avx_crypt_blk1_8; + return &sm4_aesni_avx_crypt_blk1_16; } #endif #ifdef USE_ARM_CE else if (ctx->use_arm_ce) { - return &sm4_armv8_ce_crypt_blk1_8; + return &sm4_armv8_ce_crypt_blk1_16; } #endif #ifdef USE_AARCH64_SIMD else if (ctx->use_aarch64_simd) { - return &sm4_aarch64_crypt_blk1_8; + return &sm4_aarch64_crypt_blk1_16; } #endif else @@ -682,6 +796,21 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, const byte *inbuf = inbuf_arg; int burn_stack_depth = 0; +#ifdef USE_GFNI_AVX2 + if (ctx->use_gfni_avx2) + { + /* Process data in 16 block chunks. */ + while (nblocks >= 16) + { + _gcry_sm4_gfni_avx2_ctr_enc(ctx->rkey_enc, outbuf, inbuf, ctr); + + nblocks -= 16; + outbuf += 16 * 16; + inbuf += 16 * 16; + } + } +#endif + #ifdef USE_AESNI_AVX2 if (ctx->use_aesni_avx2) { @@ -749,12 +878,12 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, /* Process remaining blocks. */ if (nblocks) { - crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); - byte tmpbuf[16 * 8]; + crypt_blk1_16_fn_t crypt_blk1_16 = sm4_get_crypt_blk1_16_fn(ctx); + byte tmpbuf[16 * 16]; unsigned int tmp_used = 16; size_t nburn; - nburn = bulk_ctr_enc_128(ctx->rkey_enc, crypt_blk1_8, outbuf, inbuf, + nburn = bulk_ctr_enc_128(ctx->rkey_enc, crypt_blk1_16, outbuf, inbuf, nblocks, ctr, tmpbuf, sizeof(tmpbuf) / 16, &tmp_used); burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; @@ -778,6 +907,21 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, const unsigned char *inbuf = inbuf_arg; int burn_stack_depth = 0; +#ifdef USE_GFNI_AVX2 + if (ctx->use_gfni_avx2) + { + /* Process data in 16 block chunks. */ + while (nblocks >= 16) + { + _gcry_sm4_gfni_avx2_cbc_dec(ctx->rkey_dec, outbuf, inbuf, iv); + + nblocks -= 16; + outbuf += 16 * 16; + inbuf += 16 * 16; + } + } +#endif + #ifdef USE_AESNI_AVX2 if (ctx->use_aesni_avx2) { @@ -845,12 +989,12 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, /* Process remaining blocks. */ if (nblocks) { - crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); - unsigned char tmpbuf[16 * 8]; + crypt_blk1_16_fn_t crypt_blk1_16 = sm4_get_crypt_blk1_16_fn(ctx); + unsigned char tmpbuf[16 * 16]; unsigned int tmp_used = 16; size_t nburn; - nburn = bulk_cbc_dec_128(ctx->rkey_dec, crypt_blk1_8, outbuf, inbuf, + nburn = bulk_cbc_dec_128(ctx->rkey_dec, crypt_blk1_16, outbuf, inbuf, nblocks, iv, tmpbuf, sizeof(tmpbuf) / 16, &tmp_used); burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; @@ -874,6 +1018,21 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, const unsigned char *inbuf = inbuf_arg; int burn_stack_depth = 0; +#ifdef USE_GFNI_AVX2 + if (ctx->use_gfni_avx2) + { + /* Process data in 16 block chunks. */ + while (nblocks >= 16) + { + _gcry_sm4_gfni_avx2_cfb_dec(ctx->rkey_enc, outbuf, inbuf, iv); + + nblocks -= 16; + outbuf += 16 * 16; + inbuf += 16 * 16; + } + } +#endif + #ifdef USE_AESNI_AVX2 if (ctx->use_aesni_avx2) { @@ -941,12 +1100,12 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, /* Process remaining blocks. */ if (nblocks) { - crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); - unsigned char tmpbuf[16 * 8]; + crypt_blk1_16_fn_t crypt_blk1_16 = sm4_get_crypt_blk1_16_fn(ctx); + unsigned char tmpbuf[16 * 16]; unsigned int tmp_used = 16; size_t nburn; - nburn = bulk_cfb_dec_128(ctx->rkey_enc, crypt_blk1_8, outbuf, inbuf, + nburn = bulk_cfb_dec_128(ctx->rkey_enc, crypt_blk1_16, outbuf, inbuf, nblocks, iv, tmpbuf, sizeof(tmpbuf) / 16, &tmp_used); burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; @@ -971,13 +1130,13 @@ _gcry_sm4_xts_crypt (void *context, unsigned char *tweak, void *outbuf_arg, /* Process remaining blocks. */ if (nblocks) { - crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); + crypt_blk1_16_fn_t crypt_blk1_16 = sm4_get_crypt_blk1_16_fn(ctx); u32 *rk = encrypt ? ctx->rkey_enc : ctx->rkey_dec; - unsigned char tmpbuf[16 * 8]; + unsigned char tmpbuf[16 * 16]; unsigned int tmp_used = 16; size_t nburn; - nburn = bulk_xts_crypt_128(rk, crypt_blk1_8, outbuf, inbuf, nblocks, + nburn = bulk_xts_crypt_128(rk, crypt_blk1_16, outbuf, inbuf, nblocks, tweak, tmpbuf, sizeof(tmpbuf) / 16, &tmp_used); burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; @@ -1000,6 +1159,37 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, u64 blkn = c->u_mode.ocb.data_nblocks; int burn_stack_depth = 0; +#ifdef USE_GFNI_AVX2 + if (ctx->use_gfni_avx2) + { + u64 Ls[16]; + u64 *l; + + if (nblocks >= 16) + { + l = bulk_ocb_prepare_L_pointers_array_blk16 (c, Ls, blkn); + + /* Process data in 16 block chunks. */ + while (nblocks >= 16) + { + blkn += 16; + *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 16); + + if (encrypt) + _gcry_sm4_gfni_avx2_ocb_enc(ctx->rkey_enc, outbuf, inbuf, + c->u_iv.iv, c->u_ctr.ctr, Ls); + else + _gcry_sm4_gfni_avx2_ocb_dec(ctx->rkey_dec, outbuf, inbuf, + c->u_iv.iv, c->u_ctr.ctr, Ls); + + nblocks -= 16; + outbuf += 16 * 16; + inbuf += 16 * 16; + } + } + } +#endif + #ifdef USE_AESNI_AVX2 if (ctx->use_aesni_avx2) { @@ -1065,13 +1255,13 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, /* Process remaining blocks. */ if (nblocks) { - crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); + crypt_blk1_16_fn_t crypt_blk1_16 = sm4_get_crypt_blk1_16_fn(ctx); u32 *rk = encrypt ? ctx->rkey_enc : ctx->rkey_dec; - unsigned char tmpbuf[16 * 8]; + unsigned char tmpbuf[16 * 16]; unsigned int tmp_used = 16; size_t nburn; - nburn = bulk_ocb_crypt_128 (c, rk, crypt_blk1_8, outbuf, inbuf, nblocks, + nburn = bulk_ocb_crypt_128 (c, rk, crypt_blk1_16, outbuf, inbuf, nblocks, &blkn, encrypt, tmpbuf, sizeof(tmpbuf) / 16, &tmp_used); burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; @@ -1096,6 +1286,33 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) u64 blkn = c->u_mode.ocb.aad_nblocks; int burn_stack_depth = 0; +#ifdef USE_GFNI_AVX2 + if (ctx->use_gfni_avx2) + { + u64 Ls[16]; + u64 *l; + + if (nblocks >= 16) + { + l = bulk_ocb_prepare_L_pointers_array_blk16 (c, Ls, blkn); + + /* Process data in 16 block chunks. */ + while (nblocks >= 16) + { + blkn += 16; + *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 16); + + _gcry_sm4_gfni_avx2_ocb_auth(ctx->rkey_enc, abuf, + c->u_mode.ocb.aad_offset, + c->u_mode.ocb.aad_sum, Ls); + + nblocks -= 16; + abuf += 16 * 16; + } + } + } +#endif + #ifdef USE_AESNI_AVX2 if (ctx->use_aesni_avx2) { @@ -1153,12 +1370,12 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) /* Process remaining blocks. */ if (nblocks) { - crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); - unsigned char tmpbuf[16 * 8]; + crypt_blk1_16_fn_t crypt_blk1_16 = sm4_get_crypt_blk1_16_fn(ctx); + unsigned char tmpbuf[16 * 16]; unsigned int tmp_used = 16; size_t nburn; - nburn = bulk_ocb_auth_128 (c, ctx->rkey_enc, crypt_blk1_8, abuf, nblocks, + nburn = bulk_ocb_auth_128 (c, ctx->rkey_enc, crypt_blk1_16, abuf, nblocks, &blkn, tmpbuf, sizeof(tmpbuf) / 16, &tmp_used); burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; diff --git a/configure.ac b/configure.ac index c5d61657..e63a7d6d 100644 --- a/configure.ac +++ b/configure.ac @@ -2842,6 +2842,7 @@ if test "$found" = "1" ; then # Build with the assembly implementation GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-aesni-avx-amd64.lo" GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-aesni-avx2-amd64.lo" + GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-gfni-avx2-amd64.lo" ;; aarch64-*-*) # Build with the assembly implementation -- 2.34.1 From tianjia.zhang at linux.alibaba.com Mon Apr 25 10:15:17 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Mon, 25 Apr 2022 16:15:17 +0800 Subject: [PATCH 1/3] sm4: add XTS bulk processing In-Reply-To: <20220424184703.2215215-1-jussi.kivilinna@iki.fi> References: <20220424184703.2215215-1-jussi.kivilinna@iki.fi> Message-ID: <5bd16395-95d5-3d87-99e9-683e38863a60@linux.alibaba.com> Hi Jussi, On 4/25/22 2:47 AM, Jussi Kivilinna wrote: > * cipher/sm4.c (_gcry_sm4_xts_crypt): New. > (sm4_setkey): Set XTS bulk function. > -- > > Benchmark on Ryzen 5800X: > > Before: > SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz > XTS enc | 7.28 ns/B 131.0 MiB/s 35.31 c/B 4850 > XTS dec | 7.29 ns/B 130.9 MiB/s 35.34 c/B 4850 > > After (4.8x faster): > SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz > XTS enc | 1.49 ns/B 638.6 MiB/s 7.24 c/B 4850 > XTS dec | 1.49 ns/B 639.3 MiB/s 7.24 c/B 4850 > > Signed-off-by: Jussi Kivilinna > --- > cipher/sm4.c | 35 +++++++++++++++++++++++++++++++++++ > 1 file changed, 35 insertions(+) > > diff --git a/cipher/sm4.c b/cipher/sm4.c > index 4815b184..600850e2 100644 > --- a/cipher/sm4.c > +++ b/cipher/sm4.c > @@ -97,6 +97,9 @@ static void _gcry_sm4_cbc_dec (void *context, unsigned char *iv, > static void _gcry_sm4_cfb_dec (void *context, unsigned char *iv, > void *outbuf_arg, const void *inbuf_arg, > size_t nblocks); > +static void _gcry_sm4_xts_crypt (void *context, unsigned char *tweak, > + void *outbuf_arg, const void *inbuf_arg, > + size_t nblocks, int encrypt); > static size_t _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, > const void *inbuf_arg, size_t nblocks, > int encrypt); > @@ -492,6 +495,7 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen, > bulk_ops->cbc_dec = _gcry_sm4_cbc_dec; > bulk_ops->cfb_dec = _gcry_sm4_cfb_dec; > bulk_ops->ctr_enc = _gcry_sm4_ctr_enc; > + bulk_ops->xts_crypt = _gcry_sm4_xts_crypt; > bulk_ops->ocb_crypt = _gcry_sm4_ocb_crypt; > bulk_ops->ocb_auth = _gcry_sm4_ocb_auth; > > @@ -954,6 +958,37 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, > _gcry_burn_stack(burn_stack_depth); > } > > +/* Bulk encryption/decryption of complete blocks in XTS mode. */ > +static void > +_gcry_sm4_xts_crypt (void *context, unsigned char *tweak, void *outbuf_arg, > + const void *inbuf_arg, size_t nblocks, int encrypt) > +{ > + SM4_context *ctx = context; > + unsigned char *outbuf = outbuf_arg; > + const unsigned char *inbuf = inbuf_arg; > + int burn_stack_depth = 0; > + > + /* Process remaining blocks. */ > + if (nblocks) > + { > + crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); > + u32 *rk = encrypt ? ctx->rkey_enc : ctx->rkey_dec; > + unsigned char tmpbuf[16 * 8]; > + unsigned int tmp_used = 16; > + size_t nburn; > + > + nburn = bulk_xts_crypt_128(rk, crypt_blk1_8, outbuf, inbuf, nblocks, > + tweak, tmpbuf, sizeof(tmpbuf) / 16, > + &tmp_used); > + burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; > + > + wipememory(tmpbuf, tmp_used); > + } > + > + if (burn_stack_depth) > + _gcry_burn_stack(burn_stack_depth); > +} > + > /* Bulk encryption/decryption of complete blocks in OCB mode. */ > static size_t > _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, I can't successfully apply this series to master branch, patch 1/3 is successfully applied, patch 2/3 can't be successfully applied, it seems that some code modifications are missing, and patch 1/3 compiles with errors: sm4.c: In function '_gcry_sm4_xts_crypt': sm4.c:1081:7: error: unknown type name 'crypt_blk1_8_fn_t' 1081 | crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); | ^~~~~~~~~~~~~~~~~ sm4.c:1081:40: warning: implicit declaration of function 'sm4_get_crypt_blk1_8_fn' [-Wimplicit-function-declaration] 1081 | crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); | ^~~~~~~~~~~~~~~~~~~~~~~ sm4.c:1087:15: warning: implicit declaration of function 'bulk_xts_crypt_128' [-Wimplicit-function-declaration] 1087 | nburn = bulk_xts_crypt_128(rk, crypt_blk1_8, outbuf, inbuf, nblocks, | ^~~~~~~~~~~~~~~~~~ Best regards, Tianjia From tianjia.zhang at linux.alibaba.com Tue Apr 26 10:33:56 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Tue, 26 Apr 2022 16:33:56 +0800 Subject: [PATCH 1/3] sm4: add XTS bulk processing In-Reply-To: <20220424184703.2215215-1-jussi.kivilinna@iki.fi> References: <20220424184703.2215215-1-jussi.kivilinna@iki.fi> Message-ID: <94eb39fd-8f7b-9f74-80b3-6e239768880e@linux.alibaba.com> Hi Jussi, On 4/25/22 2:47 AM, Jussi Kivilinna wrote: > * cipher/sm4.c (_gcry_sm4_xts_crypt): New. > (sm4_setkey): Set XTS bulk function. > -- > > Benchmark on Ryzen 5800X: > > Before: > SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz > XTS enc | 7.28 ns/B 131.0 MiB/s 35.31 c/B 4850 > XTS dec | 7.29 ns/B 130.9 MiB/s 35.34 c/B 4850 > > After (4.8x faster): > SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz > XTS enc | 1.49 ns/B 638.6 MiB/s 7.24 c/B 4850 > XTS dec | 1.49 ns/B 639.3 MiB/s 7.24 c/B 4850 > > Signed-off-by: Jussi Kivilinna > --- > cipher/sm4.c | 35 +++++++++++++++++++++++++++++++++++ > 1 file changed, 35 insertions(+) > > diff --git a/cipher/sm4.c b/cipher/sm4.c > index 4815b184..600850e2 100644 > --- a/cipher/sm4.c > +++ b/cipher/sm4.c > @@ -97,6 +97,9 @@ static void _gcry_sm4_cbc_dec (void *context, unsigned char *iv, > static void _gcry_sm4_cfb_dec (void *context, unsigned char *iv, > void *outbuf_arg, const void *inbuf_arg, > size_t nblocks); > +static void _gcry_sm4_xts_crypt (void *context, unsigned char *tweak, > + void *outbuf_arg, const void *inbuf_arg, > + size_t nblocks, int encrypt); > static size_t _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, > const void *inbuf_arg, size_t nblocks, > int encrypt); > @@ -492,6 +495,7 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen, > bulk_ops->cbc_dec = _gcry_sm4_cbc_dec; > bulk_ops->cfb_dec = _gcry_sm4_cfb_dec; > bulk_ops->ctr_enc = _gcry_sm4_ctr_enc; > + bulk_ops->xts_crypt = _gcry_sm4_xts_crypt; > bulk_ops->ocb_crypt = _gcry_sm4_ocb_crypt; > bulk_ops->ocb_auth = _gcry_sm4_ocb_auth; > > @@ -954,6 +958,37 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, > _gcry_burn_stack(burn_stack_depth); > } > > +/* Bulk encryption/decryption of complete blocks in XTS mode. */ > +static void > +_gcry_sm4_xts_crypt (void *context, unsigned char *tweak, void *outbuf_arg, > + const void *inbuf_arg, size_t nblocks, int encrypt) > +{ > + SM4_context *ctx = context; > + unsigned char *outbuf = outbuf_arg; > + const unsigned char *inbuf = inbuf_arg; > + int burn_stack_depth = 0; > + > + /* Process remaining blocks. */ > + if (nblocks) > + { > + crypt_blk1_8_fn_t crypt_blk1_8 = sm4_get_crypt_blk1_8_fn(ctx); > + u32 *rk = encrypt ? ctx->rkey_enc : ctx->rkey_dec; > + unsigned char tmpbuf[16 * 8]; > + unsigned int tmp_used = 16; > + size_t nburn; > + > + nburn = bulk_xts_crypt_128(rk, crypt_blk1_8, outbuf, inbuf, nblocks, > + tweak, tmpbuf, sizeof(tmpbuf) / 16, > + &tmp_used); > + burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; > + > + wipememory(tmpbuf, tmp_used); > + } > + > + if (burn_stack_depth) > + _gcry_burn_stack(burn_stack_depth); > +} > + > /* Bulk encryption/decryption of complete blocks in OCB mode. */ > static size_t > _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, Thanks for the reply, this is a great job, I did some performance tests and reviews, but unfortunately I haven't found a machine that supports GFNI features at the moment, so for patch 1/3: Benchmark on Intel i5-6200U 2.30GHz: Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz XTS enc | 13.41 ns/B 71.10 MiB/s 37.45 c/B 2792 XTS dec | 13.43 ns/B 71.03 MiB/s 37.49 c/B 2792 After (4.54x faster): SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz XTS enc | 2.96 ns/B 322.7 MiB/s 8.25 c/B 2792 XTS dec | 2.96 ns/B 322.5 MiB/s 8.26 c/B 2792 Reviewed-and-tested-by: Tianjia Zhang Best regards, Tianjia From tianjia.zhang at linux.alibaba.com Tue Apr 26 10:35:15 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Tue, 26 Apr 2022 16:35:15 +0800 Subject: [PATCH 3/3] sm4-aesni-avx2: add generic 1 to 16 block bulk processing function In-Reply-To: <20220424184703.2215215-3-jussi.kivilinna@iki.fi> References: <20220424184703.2215215-1-jussi.kivilinna@iki.fi> <20220424184703.2215215-3-jussi.kivilinna@iki.fi> Message-ID: <09e04d63-9f8b-1f70-7b84-680b8555e3d3@linux.alibaba.com> Hi Jussi, On 4/25/22 2:47 AM, Jussi Kivilinna wrote: > * cipher/sm4-aesni-avx2-amd64.S: Remove unnecessary vzeroupper at > function entries. > (_gcry_sm4_aesni_avx2_crypt_blk1_16): New. > * cipher/sm4.c (_gcry_sm4_aesni_avx2_crypt_blk1_16) > (sm4_aesni_avx2_crypt_blk1_16): New. > (sm4_get_crypt_blk1_16_fn) [USE_AESNI_AVX2]: Add > 'sm4_aesni_avx2_crypt_blk1_16'. > -- > > Benchmark AMD Ryzen 5800X: > > Before: > SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz > XTS enc | 1.48 ns/B 643.2 MiB/s 7.19 c/B 4850 > XTS dec | 1.48 ns/B 644.3 MiB/s 7.18 c/B 4850 > > After (1.37x faster): > SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz > XTS enc | 1.07 ns/B 888.7 MiB/s 5.21 c/B 4850 > XTS dec | 1.07 ns/B 889.4 MiB/s 5.20 c/B 4850 > > Signed-off-by: Jussi Kivilinna > --- Benchmark on Intel i5-6200U 2.30GHz: Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz XTS enc | 2.95 ns/B 323.0 MiB/s 8.25 c/B 2792 XTS dec | 2.95 ns/B 323.0 MiB/s 8.24 c/B 2792 After (1.64x faster): SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz XTS enc | 1.79 ns/B 531.4 MiB/s 5.01 c/B 2791 XTS dec | 1.79 ns/B 531.6 MiB/s 5.01 c/B 2791 Reviewed-and-tested-by: Tianjia Zhang Best regards, Tianjia