From falko.strenzke at mtg.de Thu Aug 3 13:49:08 2023 From: falko.strenzke at mtg.de (Falko Strenzke) Date: Thu, 3 Aug 2023 13:49:08 +0200 Subject: secmem limits for PQC schemes Message-ID: We are currently working on the implementation of the "CRYSTALS" schemes Kyber and Dilithium (SPHINCS? to follow soon) in Libgcrypt. In this course we came across a problem with the secure memory [^1] management in Libgcrypt. Namely, the current hard limit for secure memory is 32kB. That seems to be a reasonable default value as there are apparently indeed OSs which have limits for locked in this domain.? However, the heap memory requirements for the largest parameter sets of the CRYSTALS schemes are - Kyber 33,376 bytes (key generation) - Dilithium 135,968 bytes (probably also key generation, but not determined yet) For Kyber we could possibly increase the default pool size to still reasonable 64kB. But in the case of multiple threads using Kyber operations this will still not suffice. This raises the question of how to deal with this limitation. When the secure memory pool set up with the default size is exhausted, even in non-FIPS mode, further requests for secure memory fail. This is not ideal, since many modern systems will provide much higher margins for lockable memory. So one possibility I see is to - implement an allocation function for the CRYSTALS schemes that first tries to ? allocate secure memory and if that fails, and FIPS mode is not activated, then ? simply allocates non-secure memory - possibly rework the secure memory management so that it tries to lock further ? memory blocks when secure memory is requested after the initially set up pool ? is exhausted. For instance on my Debian 11 x86 for instance I have limit of 4 ? MB for locked memory, thus allowing to exceed the rather pessimistic default ? value by orders of magnitude. The Libgcrypt core developers please let us know their thoughts regarding these issues. [^1]: i.e. heap memory that is protected from being swapped (locked memory) to disk and overwritten when freed -- *MTG AG* Dr. Falko Strenzke Executive System Architect Phone: +49 6151 8000 24 E-Mail: falko.strenzke at mtg.de Web: mtg.de *MTG Exhibitions ? See you in 2023* ------------------------------------------------------------------------ MTG AG - Dolivostr. 11 - 64293 Darmstadt, Germany Commercial register: HRB 8901 Register Court: Amtsgericht Darmstadt Management Board: J?rgen Ruf (CEO), Tamer Kemer?z Chairman of the Supervisory Board: Dr. Thomas Milde This email may contain confidential and/or privileged information. If you are not the correct recipient or have received this email in error, please inform the sender immediately and delete this email. Unauthorised copying or distribution of this email is not permitted. Data protection information: Privacy policy -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: yWPupc80CWA8GrDP.png Type: image/png Size: 5256 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: wjsdBueg0xHvmEfQ.png Type: image/png Size: 4906 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4764 bytes Desc: Kryptografische S/MIME-Signatur URL: From jussi.kivilinna at iki.fi Wed Aug 9 19:56:42 2023 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Wed, 9 Aug 2023 20:56:42 +0300 Subject: [PATCH] Avoid VPGATHER usage for most of Intel CPUs Message-ID: <20230809175642.26581-1-jussi.kivilinna@iki.fi> * cipher/blake2.c (blake2b_init_ctx): Check for fast VPGATHER for AVX512 implementation. * src/hwf-x86.c (detect_x86_gnuc): Do not enable HWF_INTEL_FAST_VPGATHER for Intel CPUs suffering from "Downfall" vulnerability. -- VPGATHER used to be fast on Intel CPU from Skylake to Tiger Lake, but instruction is now very slow on these CPUs (slower than on Haswell) because of mitigation introduced in new microcode version for "Downfall" speculative execution vulnerability. Signed-off-by: Jussi Kivilinna --- cipher/blake2.c | 3 ++- src/hwf-x86.c | 30 ++++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/cipher/blake2.c b/cipher/blake2.c index 45f74a56..637eebbd 100644 --- a/cipher/blake2.c +++ b/cipher/blake2.c @@ -494,7 +494,8 @@ static gcry_err_code_t blake2b_init_ctx(void *ctx, unsigned int flags, c->use_avx2 = !!(features & HWF_INTEL_AVX2); #endif #ifdef USE_AVX512 - c->use_avx512 = !!(features & HWF_INTEL_AVX512); + c->use_avx512 = (features & HWF_INTEL_AVX512) + && (features & HWF_INTEL_FAST_VPGATHER); #endif c->outlen = dbits / 8; diff --git a/src/hwf-x86.c b/src/hwf-x86.c index 5240a460..bda14d9d 100644 --- a/src/hwf-x86.c +++ b/src/hwf-x86.c @@ -424,6 +424,36 @@ detect_x86_gnuc (void) avoid_vpgather |= 1; break; } + + /* These Intel Core processors (skylake to tigerlake) have slow VPGATHER + * because of mitigation introduced by new microcode (2023-08-08) for + * "Downfall" speculative execution vulnerability. */ + switch (model) + { + /* Skylake, Cascade Lake, Cooper Lake */ + case 0x4E: + case 0x5E: + case 0x55: + /* Kaby Lake, Coffee Lake, Whiskey Lake, Amber Lake */ + case 0x8E: + case 0x9E: + /* Cannon Lake */ + case 0x66: + /* Comet Lake */ + case 0xA5: + case 0xA6: + /* Ice Lake */ + case 0x7E: + case 0x6A: + case 0x6C: + /* Tiger Lake */ + case 0x8C: + case 0x8D: + /* Rocket Lake */ + case 0xA7: + avoid_vpgather |= 1; + break; + } } else if (is_amd_cpu) { -- 2.39.2 From jussi.kivilinna at iki.fi Sun Aug 13 14:40:25 2023 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 13 Aug 2023 15:40:25 +0300 Subject: [PATCH] twofish-avx2-amd64: replace VPGATHER with manual gather Message-ID: <20230813124025.789901-1-jussi.kivilinna@iki.fi> * cipher/twofish-avx2-amd64.S (do_gather): New. (g16): Switch to use 'do_gather' instead of VPGATHER instruction. (__twofish_enc_blk16, __twofish_dec_blk16): Prepare stack for 'do_gather'. -- As VPGATHER is now slow on majority of CPUs (because of "Downfall"), switch twofish-avx2 implementation to use manual memory gathering instead. Benchmark on Intel Core i3-1115G4 (tigerlake, with "Downfall" mitigated microcode): Before: TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 7.00 ns/B 136.3 MiB/s 28.62 c/B 4089 ECB dec | 7.00 ns/B 136.2 MiB/s 28.64 c/B 4090 After (~3.1x faster): TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 2.20 ns/B 433.7 MiB/s 8.99 c/B 4090 ECB dec | 2.20 ns/B 433.7 MiB/s 8.99 c/B 4089 Benchmark on AMD Ryzen 9 7900X (zen4, did not suffer from "Downfall"): Before: TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 1.91 ns/B 499.0 MiB/s 8.98 c/B 4700 ECB dec | 1.90 ns/B 500.7 MiB/s 8.95 c/B 4700 After (~6% faster): TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 1.78 ns/B 534.7 MiB/s 8.38 c/B 4700 ECB dec | 1.79 ns/B 533.7 MiB/s 8.40 c/B 4700 Signed-off-by: Jussi Kivilinna --- cipher/twofish-avx2-amd64.S | 168 ++++++++++++++++++++++++------------ cipher/twofish.c | 6 +- 2 files changed, 113 insertions(+), 61 deletions(-) diff --git a/cipher/twofish-avx2-amd64.S b/cipher/twofish-avx2-amd64.S index d05ec1f9..2207ac57 100644 --- a/cipher/twofish-avx2-amd64.S +++ b/cipher/twofish-avx2-amd64.S @@ -39,14 +39,20 @@ /* register macros */ #define CTX %rdi -#define RROUND %r12 -#define RROUNDd %r12d +#define RROUND %r13 +#define RROUNDd %r13d #define RS0 CTX #define RS1 %r8 #define RS2 %r9 #define RS3 %r10 #define RK %r11 -#define RW %rax +#define RW %r12 +#define RIDX0 %rax +#define RIDX0d %eax +#define RIDX1 %rbx +#define RIDX1d %ebx +#define RIDX2 %r14 +#define RIDX3 %r15 #define RA0 %ymm8 #define RB0 %ymm9 @@ -63,14 +69,14 @@ #define RX1 %ymm2 #define RY1 %ymm3 #define RT0 %ymm4 -#define RIDX %ymm5 +#define RT1 %ymm5 #define RX0x %xmm0 #define RY0x %xmm1 #define RX1x %xmm2 #define RY1x %xmm3 #define RT0x %xmm4 -#define RIDXx %xmm5 +#define RT1x %xmm5 #define RTMP0 RX0 #define RTMP0x RX0x @@ -80,8 +86,8 @@ #define RTMP2x RY0x #define RTMP3 RY1 #define RTMP3x RY1x -#define RTMP4 RIDX -#define RTMP4x RIDXx +#define RTMP4 RT1 +#define RTMP4x RT1x /* vpgatherdd mask and '-1' */ #define RNOT %ymm6 @@ -102,48 +108,42 @@ leaq s2(CTX), RS2; \ leaq s3(CTX), RS3; \ +#define do_gather(stoffs, byteoffs, rs, out) \ + movzbl (stoffs + 0*4 + byteoffs)(%rsp), RIDX0d; \ + movzbl (stoffs + 1*4 + byteoffs)(%rsp), RIDX1d; \ + movzbq (stoffs + 2*4 + byteoffs)(%rsp), RIDX2; \ + movzbq (stoffs + 3*4 + byteoffs)(%rsp), RIDX3; \ + vmovd (rs, RIDX0, 4), RT1x; \ + vpinsrd $1, (rs, RIDX1, 4), RT1x, RT1x; \ + vpinsrd $2, (rs, RIDX2, 4), RT1x, RT1x; \ + vpinsrd $3, (rs, RIDX3, 4), RT1x, RT1x; \ + movzbl (stoffs + 4*4 + byteoffs)(%rsp), RIDX0d; \ + movzbl (stoffs + 5*4 + byteoffs)(%rsp), RIDX1d; \ + movzbq (stoffs + 6*4 + byteoffs)(%rsp), RIDX2; \ + movzbq (stoffs + 7*4 + byteoffs)(%rsp), RIDX3; \ + vmovd (rs, RIDX0, 4), RT0x; \ + vpinsrd $1, (rs, RIDX1, 4), RT0x, RT0x; \ + vpinsrd $2, (rs, RIDX2, 4), RT0x, RT0x; \ + vpinsrd $3, (rs, RIDX3, 4), RT0x, RT0x; \ + vinserti128 $1, RT0x, RT1, out; + #define g16(ab, rs0, rs1, rs2, rs3, xy) \ - vpand RBYTE, ab ## 0, RIDX; \ - vpgatherdd RNOT, (rs0, RIDX, 4), xy ## 0; \ - vpcmpeqd RNOT, RNOT, RNOT; \ - \ - vpand RBYTE, ab ## 1, RIDX; \ - vpgatherdd RNOT, (rs0, RIDX, 4), xy ## 1; \ - vpcmpeqd RNOT, RNOT, RNOT; \ - \ - vpsrld $8, ab ## 0, RIDX; \ - vpand RBYTE, RIDX, RIDX; \ - vpgatherdd RNOT, (rs1, RIDX, 4), RT0; \ - vpcmpeqd RNOT, RNOT, RNOT; \ - vpxor RT0, xy ## 0, xy ## 0; \ - \ - vpsrld $8, ab ## 1, RIDX; \ - vpand RBYTE, RIDX, RIDX; \ - vpgatherdd RNOT, (rs1, RIDX, 4), RT0; \ - vpcmpeqd RNOT, RNOT, RNOT; \ - vpxor RT0, xy ## 1, xy ## 1; \ - \ - vpsrld $16, ab ## 0, RIDX; \ - vpand RBYTE, RIDX, RIDX; \ - vpgatherdd RNOT, (rs2, RIDX, 4), RT0; \ - vpcmpeqd RNOT, RNOT, RNOT; \ - vpxor RT0, xy ## 0, xy ## 0; \ - \ - vpsrld $16, ab ## 1, RIDX; \ - vpand RBYTE, RIDX, RIDX; \ - vpgatherdd RNOT, (rs2, RIDX, 4), RT0; \ - vpcmpeqd RNOT, RNOT, RNOT; \ - vpxor RT0, xy ## 1, xy ## 1; \ - \ - vpsrld $24, ab ## 0, RIDX; \ - vpgatherdd RNOT, (rs3, RIDX, 4), RT0; \ - vpcmpeqd RNOT, RNOT, RNOT; \ - vpxor RT0, xy ## 0, xy ## 0; \ - \ - vpsrld $24, ab ## 1, RIDX; \ - vpgatherdd RNOT, (rs3, RIDX, 4), RT0; \ - vpcmpeqd RNOT, RNOT, RNOT; \ - vpxor RT0, xy ## 1, xy ## 1; + vmovdqa ab ## 0, 0(%rsp); \ + vmovdqa ab ## 1, 32(%rsp); \ + do_gather(0*32, 0, rs0, xy ## 0); \ + do_gather(1*32, 0, rs0, xy ## 1); \ + do_gather(0*32, 1, rs1, RT1); \ + vpxor RT1, xy ## 0, xy ## 0; \ + do_gather(1*32, 1, rs1, RT1); \ + vpxor RT1, xy ## 1, xy ## 1; \ + do_gather(0*32, 2, rs2, RT1); \ + vpxor RT1, xy ## 0, xy ## 0; \ + do_gather(1*32, 2, rs2, RT1); \ + vpxor RT1, xy ## 1, xy ## 1; \ + do_gather(0*32, 3, rs3, RT1); \ + vpxor RT1, xy ## 0, xy ## 0; \ + do_gather(1*32, 3, rs3, RT1); \ + vpxor RT1, xy ## 1, xy ## 1; #define g1_16(a, x) \ g16(a, RS0, RS1, RS2, RS3, x); @@ -375,8 +375,23 @@ __twofish_enc_blk16: */ CFI_STARTPROC(); - pushq RROUND; - CFI_PUSH(RROUND); + pushq %rbp; + CFI_PUSH(%rbp); + movq %rsp, %rbp; + CFI_DEF_CFA_REGISTER(%rbp); + subq $(64 + 5 * 8), %rsp; + andq $-64, %rsp; + + movq %rbx, (64 + 0 * 8)(%rsp); + movq %r12, (64 + 1 * 8)(%rsp); + movq %r13, (64 + 2 * 8)(%rsp); + movq %r14, (64 + 3 * 8)(%rsp); + movq %r15, (64 + 4 * 8)(%rsp); + CFI_REG_ON_STACK(rbx, 64 + 0 * 8); + CFI_REG_ON_STACK(r12, 64 + 1 * 8); + CFI_REG_ON_STACK(r13, 64 + 2 * 8); + CFI_REG_ON_STACK(r14, 64 + 3 * 8); + CFI_REG_ON_STACK(r15, 64 + 4 * 8); init_round_constants(); @@ -400,8 +415,21 @@ __twofish_enc_blk16: outunpack_enc16(RA, RB, RC, RD); transpose4x4_16(RA, RB, RC, RD); - popq RROUND; - CFI_POP(RROUND); + movq (64 + 0 * 8)(%rsp), %rbx; + movq (64 + 1 * 8)(%rsp), %r12; + movq (64 + 2 * 8)(%rsp), %r13; + movq (64 + 3 * 8)(%rsp), %r14; + movq (64 + 4 * 8)(%rsp), %r15; + CFI_RESTORE(%rbx); + CFI_RESTORE(%r12); + CFI_RESTORE(%r13); + CFI_RESTORE(%r14); + CFI_RESTORE(%r15); + vpxor RT0, RT0, RT0; + vmovdqa RT0, 0(%rsp); + vmovdqa RT0, 32(%rsp); + leave; + CFI_LEAVE(); ret_spec_stop; CFI_ENDPROC(); @@ -420,8 +448,23 @@ __twofish_dec_blk16: */ CFI_STARTPROC(); - pushq RROUND; - CFI_PUSH(RROUND); + pushq %rbp; + CFI_PUSH(%rbp); + movq %rsp, %rbp; + CFI_DEF_CFA_REGISTER(%rbp); + subq $(64 + 5 * 8), %rsp; + andq $-64, %rsp; + + movq %rbx, (64 + 0 * 8)(%rsp); + movq %r12, (64 + 1 * 8)(%rsp); + movq %r13, (64 + 2 * 8)(%rsp); + movq %r14, (64 + 3 * 8)(%rsp); + movq %r15, (64 + 4 * 8)(%rsp); + CFI_REG_ON_STACK(rbx, 64 + 0 * 8); + CFI_REG_ON_STACK(r12, 64 + 1 * 8); + CFI_REG_ON_STACK(r13, 64 + 2 * 8); + CFI_REG_ON_STACK(r14, 64 + 3 * 8); + CFI_REG_ON_STACK(r15, 64 + 4 * 8); init_round_constants(); @@ -444,8 +487,21 @@ __twofish_dec_blk16: outunpack_dec16(RA, RB, RC, RD); transpose4x4_16(RA, RB, RC, RD); - popq RROUND; - CFI_POP(RROUND); + movq (64 + 0 * 8)(%rsp), %rbx; + movq (64 + 1 * 8)(%rsp), %r12; + movq (64 + 2 * 8)(%rsp), %r13; + movq (64 + 3 * 8)(%rsp), %r14; + movq (64 + 4 * 8)(%rsp), %r15; + CFI_RESTORE(%rbx); + CFI_RESTORE(%r12); + CFI_RESTORE(%r13); + CFI_RESTORE(%r14); + CFI_RESTORE(%r15); + vpxor RT0, RT0, RT0; + vmovdqa RT0, 0(%rsp); + vmovdqa RT0, 32(%rsp); + leave; + CFI_LEAVE(); ret_spec_stop; CFI_ENDPROC(); diff --git a/cipher/twofish.c b/cipher/twofish.c index 74061913..11a6e251 100644 --- a/cipher/twofish.c +++ b/cipher/twofish.c @@ -767,11 +767,7 @@ twofish_setkey (void *context, const byte *key, unsigned int keylen, rc = do_twofish_setkey (ctx, key, keylen); #ifdef USE_AVX2 - ctx->use_avx2 = 0; - if ((hwfeatures & HWF_INTEL_AVX2) && (hwfeatures & HWF_INTEL_FAST_VPGATHER)) - { - ctx->use_avx2 = 1; - } + ctx->use_avx2 = (hwfeatures & HWF_INTEL_AVX2) != 0; #endif /* Setup bulk encryption routines. */ -- 2.39.2 From jcb62281 at gmail.com Mon Aug 14 04:47:44 2023 From: jcb62281 at gmail.com (Jacob Bachmeyer) Date: Sun, 13 Aug 2023 21:47:44 -0500 Subject: [PATCH] twofish-avx2-amd64: replace VPGATHER with manual gather In-Reply-To: <20230813124025.789901-1-jussi.kivilinna@iki.fi> References: <20230813124025.789901-1-jussi.kivilinna@iki.fi> Message-ID: <64D995D0.2090305@gmail.com> Jussi Kivilinna wrote: > * cipher/twofish-avx2-amd64.S (do_gather): New. > (g16): Switch to use 'do_gather' instead of VPGATHER instruction. > (__twofish_enc_blk16, __twofish_dec_blk16): Prepare stack > for 'do_gather'. > -- > > As VPGATHER is now slow on majority of CPUs (because of "Downfall"), > switch twofish-avx2 implementation to use manual memory gathering > instead. > > Benchmark on Intel Core i3-1115G4 (tigerlake, with "Downfall" mitigated > microcode): > > Before: > TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz > ECB enc | 7.00 ns/B 136.3 MiB/s 28.62 c/B 4089 > ECB dec | 7.00 ns/B 136.2 MiB/s 28.64 c/B 4090 > > After (~3.1x faster): > TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz > ECB enc | 2.20 ns/B 433.7 MiB/s 8.99 c/B 4090 > ECB dec | 2.20 ns/B 433.7 MiB/s 8.99 c/B 4089 > > Benchmark on AMD Ryzen 9 7900X (zen4, did not suffer from "Downfall"): > > Before: > TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz > ECB enc | 1.91 ns/B 499.0 MiB/s 8.98 c/B 4700 > ECB dec | 1.90 ns/B 500.7 MiB/s 8.95 c/B 4700 > > After (~6% faster): > TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz > ECB enc | 1.78 ns/B 534.7 MiB/s 8.38 c/B 4700 > ECB dec | 1.79 ns/B 533.7 MiB/s 8.40 c/B 4700 > Obviously, do_gather is bouncing the data around in the cache, but the fact that this change is a performance improvement on a processor not affected by "Downfall" strongly suggests that using VPGATHER may have been suboptimal from the start. Can you do a third test on the i3-1115G4 with older microcode? Would this patch have actually improved performance in all cases? Was using VPGATHER a waste of time the whole time? Do we need to be more skeptical about new SSE/AVX/etc. opcodes in the future? -- Jacob From jussi.kivilinna at iki.fi Mon Aug 14 18:24:05 2023 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Mon, 14 Aug 2023 19:24:05 +0300 Subject: [PATCH] twofish-avx2-amd64: replace VPGATHER with manual gather In-Reply-To: <64D995D0.2090305@gmail.com> References: <20230813124025.789901-1-jussi.kivilinna@iki.fi> <64D995D0.2090305@gmail.com> Message-ID: <5935b345-2e40-073e-77c3-f92e353279af@iki.fi> On 14.8.2023 5.47, Jacob Bachmeyer wrote: > Jussi Kivilinna wrote: >> * cipher/twofish-avx2-amd64.S (do_gather): New. >> (g16): Switch to use 'do_gather' instead of VPGATHER instruction. >> (__twofish_enc_blk16, __twofish_dec_blk16): Prepare stack >> for 'do_gather'. >> -- >> >> As VPGATHER is now slow on majority of CPUs (because of "Downfall"), >> switch twofish-avx2 implementation to use manual memory gathering >> instead. >> >> Benchmark on Intel Core i3-1115G4 (tigerlake, with "Downfall" mitigated >> microcode): >> >> Before: >> ?TWOFISH??????? |? nanosecs/byte?? mebibytes/sec?? cycles/byte? auto Mhz >> ??????? ECB enc |????? 7.00 ns/B???? 136.3 MiB/s???? 28.62 c/B????? 4089 >> ??????? ECB dec |????? 7.00 ns/B???? 136.2 MiB/s???? 28.64 c/B????? 4090 >> >> After (~3.1x faster): >> ?TWOFISH??????? |? nanosecs/byte?? mebibytes/sec?? cycles/byte? auto Mhz >> ??????? ECB enc |????? 2.20 ns/B???? 433.7 MiB/s????? 8.99 c/B????? 4090 >> ??????? ECB dec |????? 2.20 ns/B???? 433.7 MiB/s????? 8.99 c/B????? 4089 >> >> Benchmark on AMD Ryzen 9 7900X (zen4, did not suffer from "Downfall"): >> >> Before: >> ?TWOFISH??????? |? nanosecs/byte?? mebibytes/sec?? cycles/byte? auto Mhz >> ??????? ECB enc |????? 1.91 ns/B???? 499.0 MiB/s????? 8.98 c/B????? 4700 >> ??????? ECB dec |????? 1.90 ns/B???? 500.7 MiB/s????? 8.95 c/B????? 4700 >> >> After (~6% faster): >> ?TWOFISH??????? |? nanosecs/byte?? mebibytes/sec?? cycles/byte? auto Mhz >> ??????? ECB enc |????? 1.78 ns/B???? 534.7 MiB/s????? 8.38 c/B????? 4700 >> ??????? ECB dec |????? 1.79 ns/B???? 533.7 MiB/s????? 8.40 c/B????? 4700 > > Obviously, do_gather is bouncing the data around in the cache, but the fact that this change is a performance improvement on a processor not affected by "Downfall" strongly suggests that using VPGATHER may have been suboptimal from the start.? Can you do a third test on the i3-1115G4 with older microcode?? Would this patch have actually improved performance in all cases? > VPGATHER used to be faster than manual gather starting with Intel Skylake. Old results on this i3-1115G4 show ~6.5 c/B for Twofish-CTR. Interesting thing is that older Intel CPUs with AVX2 had slower VPGATHER implementation and those are not affected by "Downfall". For AMD CPUs, VPGATHER has been slower and getting faster tiny bit generation to generation. With Zen4, gather performance was finally good enough that twofish-avx2 implementation beat the twofish-3way-asm implementation so I enabled HWF_INTEL_FAST_VPGATHER HW-feature for AMD Zen4+ CPUs. > Was using VPGATHER a waste of time the whole time?? Do we need to be more skeptical about new SSE/AVX/etc. opcodes in the future? > I don't think so, VPGATHER really was quite a bit faster on Intel Skylake+ CPUs. About being skeptical, I think problem is not so much with specific opcodes but optimizations that have been or get baked into microarchitectures. -Jussi > > -- Jacob > From jussi.kivilinna at iki.fi Sun Aug 20 17:31:54 2023 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 20 Aug 2023 18:31:54 +0300 Subject: [PATCH v2] twofish-avx2-amd64: replace VPGATHER with manual gather Message-ID: <20230820153155.382969-1-jussi.kivilinna@iki.fi> * cipher/twofish-avx2-amd64.S (do_gather): New. (g16): Switch to use 'do_gather' instead of VPGATHER instruction. (__twofish_enc_blk16, __twofish_dec_blk16): Prepare stack for 'do_gather'. * cipher/twofish.c (twofish) [USE_AVX2]: Remove now unneeded HWF_INTEL_FAST_VPGATHER check. -- As VPGATHER is now slow on majority of CPUs (because of "Downfall"), switch twofish-avx2 implementation to use manual memory gathering instead. Benchmark on Intel Core i3-1115G4 (tigerlake, with "Downfall" mitigated microcode): Before: TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 7.00 ns/B 136.3 MiB/s 28.62 c/B 4089 ECB dec | 7.00 ns/B 136.2 MiB/s 28.64 c/B 4090 After (~3.2x faster): TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 2.19 ns/B 435.5 MiB/s 8.95 c/B 4089 ECB dec | 2.19 ns/B 436.2 MiB/s 8.94 c/B 4089 Benchmark on AMD Ryzen 9 7900X (zen4, did not suffer from "Downfall"): Before: TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 1.91 ns/B 499.0 MiB/s 8.98 c/B 4700 ECB dec | 1.90 ns/B 500.7 MiB/s 8.95 c/B 4700 After (~9% faster): TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 1.74 ns/B 547.9 MiB/s 8.18 c/B 4700 ECB dec | 1.74 ns/B 547.8 MiB/s 8.18 c/B 4700 [v2]: - reorder memory operations in do_gather for small performance increase. Signed-off-by: Jussi Kivilinna --- cipher/twofish-avx2-amd64.S | 168 ++++++++++++++++++++++++------------ cipher/twofish.c | 6 +- 2 files changed, 113 insertions(+), 61 deletions(-) diff --git a/cipher/twofish-avx2-amd64.S b/cipher/twofish-avx2-amd64.S index d05ec1f9..3f61f87b 100644 --- a/cipher/twofish-avx2-amd64.S +++ b/cipher/twofish-avx2-amd64.S @@ -39,14 +39,20 @@ /* register macros */ #define CTX %rdi -#define RROUND %r12 -#define RROUNDd %r12d +#define RROUND %r13 +#define RROUNDd %r13d #define RS0 CTX #define RS1 %r8 #define RS2 %r9 #define RS3 %r10 #define RK %r11 -#define RW %rax +#define RW %r12 +#define RIDX0 %rax +#define RIDX0d %eax +#define RIDX1 %rbx +#define RIDX1d %ebx +#define RIDX2 %r14 +#define RIDX3 %r15 #define RA0 %ymm8 #define RB0 %ymm9 @@ -63,14 +69,14 @@ #define RX1 %ymm2 #define RY1 %ymm3 #define RT0 %ymm4 -#define RIDX %ymm5 +#define RT1 %ymm5 #define RX0x %xmm0 #define RY0x %xmm1 #define RX1x %xmm2 #define RY1x %xmm3 #define RT0x %xmm4 -#define RIDXx %xmm5 +#define RT1x %xmm5 #define RTMP0 RX0 #define RTMP0x RX0x @@ -80,8 +86,8 @@ #define RTMP2x RY0x #define RTMP3 RY1 #define RTMP3x RY1x -#define RTMP4 RIDX -#define RTMP4x RIDXx +#define RTMP4 RT1 +#define RTMP4x RT1x /* vpgatherdd mask and '-1' */ #define RNOT %ymm6 @@ -102,48 +108,42 @@ leaq s2(CTX), RS2; \ leaq s3(CTX), RS3; \ +#define do_gather(stoffs, byteoffs, rs, out) \ + movzbl (stoffs + 0*4 + byteoffs)(%rsp), RIDX0d; \ + movzbl (stoffs + 1*4 + byteoffs)(%rsp), RIDX1d; \ + movzbq (stoffs + 2*4 + byteoffs)(%rsp), RIDX2; \ + movzbq (stoffs + 3*4 + byteoffs)(%rsp), RIDX3; \ + vmovd (rs, RIDX0, 4), RT1x; \ + movzbl (stoffs + 4*4 + byteoffs)(%rsp), RIDX0d; \ + vmovd (rs, RIDX0, 4), RT0x; \ + vpinsrd $1, (rs, RIDX1, 4), RT1x, RT1x; \ + movzbl (stoffs + 5*4 + byteoffs)(%rsp), RIDX1d; \ + vpinsrd $1, (rs, RIDX1, 4), RT0x, RT0x; \ + vpinsrd $2, (rs, RIDX2, 4), RT1x, RT1x; \ + movzbq (stoffs + 6*4 + byteoffs)(%rsp), RIDX2; \ + vpinsrd $2, (rs, RIDX2, 4), RT0x, RT0x; \ + vpinsrd $3, (rs, RIDX3, 4), RT1x, RT1x; \ + movzbq (stoffs + 7*4 + byteoffs)(%rsp), RIDX3; \ + vpinsrd $3, (rs, RIDX3, 4), RT0x, RT0x; \ + vinserti128 $1, RT0x, RT1, out; + #define g16(ab, rs0, rs1, rs2, rs3, xy) \ - vpand RBYTE, ab ## 0, RIDX; \ - vpgatherdd RNOT, (rs0, RIDX, 4), xy ## 0; \ - vpcmpeqd RNOT, RNOT, RNOT; \ - \ - vpand RBYTE, ab ## 1, RIDX; \ - vpgatherdd RNOT, (rs0, RIDX, 4), xy ## 1; \ - vpcmpeqd RNOT, RNOT, RNOT; \ - \ - vpsrld $8, ab ## 0, RIDX; \ - vpand RBYTE, RIDX, RIDX; \ - vpgatherdd RNOT, (rs1, RIDX, 4), RT0; \ - vpcmpeqd RNOT, RNOT, RNOT; \ - vpxor RT0, xy ## 0, xy ## 0; \ - \ - vpsrld $8, ab ## 1, RIDX; \ - vpand RBYTE, RIDX, RIDX; \ - vpgatherdd RNOT, (rs1, RIDX, 4), RT0; \ - vpcmpeqd RNOT, RNOT, RNOT; \ - vpxor RT0, xy ## 1, xy ## 1; \ - \ - vpsrld $16, ab ## 0, RIDX; \ - vpand RBYTE, RIDX, RIDX; \ - vpgatherdd RNOT, (rs2, RIDX, 4), RT0; \ - vpcmpeqd RNOT, RNOT, RNOT; \ - vpxor RT0, xy ## 0, xy ## 0; \ - \ - vpsrld $16, ab ## 1, RIDX; \ - vpand RBYTE, RIDX, RIDX; \ - vpgatherdd RNOT, (rs2, RIDX, 4), RT0; \ - vpcmpeqd RNOT, RNOT, RNOT; \ - vpxor RT0, xy ## 1, xy ## 1; \ - \ - vpsrld $24, ab ## 0, RIDX; \ - vpgatherdd RNOT, (rs3, RIDX, 4), RT0; \ - vpcmpeqd RNOT, RNOT, RNOT; \ - vpxor RT0, xy ## 0, xy ## 0; \ - \ - vpsrld $24, ab ## 1, RIDX; \ - vpgatherdd RNOT, (rs3, RIDX, 4), RT0; \ - vpcmpeqd RNOT, RNOT, RNOT; \ - vpxor RT0, xy ## 1, xy ## 1; + vmovdqa ab ## 0, 0(%rsp); \ + vmovdqa ab ## 1, 32(%rsp); \ + do_gather(0*32, 0, rs0, xy ## 0); \ + do_gather(1*32, 0, rs0, xy ## 1); \ + do_gather(0*32, 1, rs1, RT1); \ + vpxor RT1, xy ## 0, xy ## 0; \ + do_gather(1*32, 1, rs1, RT1); \ + vpxor RT1, xy ## 1, xy ## 1; \ + do_gather(0*32, 2, rs2, RT1); \ + vpxor RT1, xy ## 0, xy ## 0; \ + do_gather(1*32, 2, rs2, RT1); \ + vpxor RT1, xy ## 1, xy ## 1; \ + do_gather(0*32, 3, rs3, RT1); \ + vpxor RT1, xy ## 0, xy ## 0; \ + do_gather(1*32, 3, rs3, RT1); \ + vpxor RT1, xy ## 1, xy ## 1; #define g1_16(a, x) \ g16(a, RS0, RS1, RS2, RS3, x); @@ -375,8 +375,23 @@ __twofish_enc_blk16: */ CFI_STARTPROC(); - pushq RROUND; - CFI_PUSH(RROUND); + pushq %rbp; + CFI_PUSH(%rbp); + movq %rsp, %rbp; + CFI_DEF_CFA_REGISTER(%rbp); + subq $(64 + 5 * 8), %rsp; + andq $-64, %rsp; + + movq %rbx, (64 + 0 * 8)(%rsp); + movq %r12, (64 + 1 * 8)(%rsp); + movq %r13, (64 + 2 * 8)(%rsp); + movq %r14, (64 + 3 * 8)(%rsp); + movq %r15, (64 + 4 * 8)(%rsp); + CFI_REG_ON_STACK(rbx, 64 + 0 * 8); + CFI_REG_ON_STACK(r12, 64 + 1 * 8); + CFI_REG_ON_STACK(r13, 64 + 2 * 8); + CFI_REG_ON_STACK(r14, 64 + 3 * 8); + CFI_REG_ON_STACK(r15, 64 + 4 * 8); init_round_constants(); @@ -400,8 +415,21 @@ __twofish_enc_blk16: outunpack_enc16(RA, RB, RC, RD); transpose4x4_16(RA, RB, RC, RD); - popq RROUND; - CFI_POP(RROUND); + movq (64 + 0 * 8)(%rsp), %rbx; + movq (64 + 1 * 8)(%rsp), %r12; + movq (64 + 2 * 8)(%rsp), %r13; + movq (64 + 3 * 8)(%rsp), %r14; + movq (64 + 4 * 8)(%rsp), %r15; + CFI_RESTORE(%rbx); + CFI_RESTORE(%r12); + CFI_RESTORE(%r13); + CFI_RESTORE(%r14); + CFI_RESTORE(%r15); + vpxor RT0, RT0, RT0; + vmovdqa RT0, 0(%rsp); + vmovdqa RT0, 32(%rsp); + leave; + CFI_LEAVE(); ret_spec_stop; CFI_ENDPROC(); @@ -420,8 +448,23 @@ __twofish_dec_blk16: */ CFI_STARTPROC(); - pushq RROUND; - CFI_PUSH(RROUND); + pushq %rbp; + CFI_PUSH(%rbp); + movq %rsp, %rbp; + CFI_DEF_CFA_REGISTER(%rbp); + subq $(64 + 5 * 8), %rsp; + andq $-64, %rsp; + + movq %rbx, (64 + 0 * 8)(%rsp); + movq %r12, (64 + 1 * 8)(%rsp); + movq %r13, (64 + 2 * 8)(%rsp); + movq %r14, (64 + 3 * 8)(%rsp); + movq %r15, (64 + 4 * 8)(%rsp); + CFI_REG_ON_STACK(rbx, 64 + 0 * 8); + CFI_REG_ON_STACK(r12, 64 + 1 * 8); + CFI_REG_ON_STACK(r13, 64 + 2 * 8); + CFI_REG_ON_STACK(r14, 64 + 3 * 8); + CFI_REG_ON_STACK(r15, 64 + 4 * 8); init_round_constants(); @@ -444,8 +487,21 @@ __twofish_dec_blk16: outunpack_dec16(RA, RB, RC, RD); transpose4x4_16(RA, RB, RC, RD); - popq RROUND; - CFI_POP(RROUND); + movq (64 + 0 * 8)(%rsp), %rbx; + movq (64 + 1 * 8)(%rsp), %r12; + movq (64 + 2 * 8)(%rsp), %r13; + movq (64 + 3 * 8)(%rsp), %r14; + movq (64 + 4 * 8)(%rsp), %r15; + CFI_RESTORE(%rbx); + CFI_RESTORE(%r12); + CFI_RESTORE(%r13); + CFI_RESTORE(%r14); + CFI_RESTORE(%r15); + vpxor RT0, RT0, RT0; + vmovdqa RT0, 0(%rsp); + vmovdqa RT0, 32(%rsp); + leave; + CFI_LEAVE(); ret_spec_stop; CFI_ENDPROC(); diff --git a/cipher/twofish.c b/cipher/twofish.c index 74061913..11a6e251 100644 --- a/cipher/twofish.c +++ b/cipher/twofish.c @@ -767,11 +767,7 @@ twofish_setkey (void *context, const byte *key, unsigned int keylen, rc = do_twofish_setkey (ctx, key, keylen); #ifdef USE_AVX2 - ctx->use_avx2 = 0; - if ((hwfeatures & HWF_INTEL_AVX2) && (hwfeatures & HWF_INTEL_FAST_VPGATHER)) - { - ctx->use_avx2 = 1; - } + ctx->use_avx2 = (hwfeatures & HWF_INTEL_AVX2) != 0; #endif /* Setup bulk encryption routines. */ -- 2.39.2 From jussi.kivilinna at iki.fi Sun Aug 20 17:31:55 2023 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 20 Aug 2023 18:31:55 +0300 Subject: [PATCH] blake2b-avx512: replace VPGATHER with manual gather In-Reply-To: <20230820153155.382969-1-jussi.kivilinna@iki.fi> References: <20230820153155.382969-1-jussi.kivilinna@iki.fi> Message-ID: <20230820153155.382969-2-jussi.kivilinna@iki.fi> * cipher/blake2.c (blake2b_init_ctx): Remove HWF_INTEL_FAST_VPGATHER check for AVX512 implementation. * cipher/blake2b-amd64-avx512.S (R16, VPINSRQ_KMASK, .Lshuf_ror16) (.Lk1_mask): New. (GEN_GMASK, RESET_KMASKS, .Lgmask*): Remove. (GATHER_MSG): Use manual gather instead of VPGATHER. (ROR_16): Use vpshufb for small speed improvement on tigerlake. (_gcry_blake2b_transform_amd64_avx512): New setup & clean-up for kmask registers; Reduce excess loop aligned from 64B to 16B. -- As VPGATHER is now slow on majority of CPUs (because of "Downfall"), switch blake2b-avx512 implementation to use manual memory gathering instead. Benchmark on Intel Core i3-1115G4 (tigerlake, with "Downfall" mitigated microcode): Old before "Downfall" (commit 909daa700e4b45d75469df298ee564b8fc2f4b72): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz BLAKE2B_512 | 0.705 ns/B 1353 MiB/s 2.88 c/B 4088 Old after "Downfall" (~3.0x slower): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz BLAKE2B_512 | 2.11 ns/B 451.3 MiB/s 8.64 c/B 4089 New (same as before "Downfall"): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz BLAKE2B_512 | 0.705 ns/B 1353 MiB/s 2.88 c/B 4090 Benchmark on AMD Ryzen 9 7900X (zen4, did not suffer from "Downfall"): Old: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz BLAKE2B_512 | 0.793 ns/B 1203 MiB/s 3.73 c/B 4700 New (~3% faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz BLAKE2B_512 | 0.771 ns/B 1237 MiB/s 3.62 c/B 4700 Signed-off-by: Jussi Kivilinna --- cipher/blake2.c | 3 +- cipher/blake2b-amd64-avx512.S | 140 ++++++++++++++++------------------ 2 files changed, 65 insertions(+), 78 deletions(-) diff --git a/cipher/blake2.c b/cipher/blake2.c index 637eebbd..45f74a56 100644 --- a/cipher/blake2.c +++ b/cipher/blake2.c @@ -494,8 +494,7 @@ static gcry_err_code_t blake2b_init_ctx(void *ctx, unsigned int flags, c->use_avx2 = !!(features & HWF_INTEL_AVX2); #endif #ifdef USE_AVX512 - c->use_avx512 = (features & HWF_INTEL_AVX512) - && (features & HWF_INTEL_FAST_VPGATHER); + c->use_avx512 = !!(features & HWF_INTEL_AVX512); #endif c->outlen = dbits / 8; diff --git a/cipher/blake2b-amd64-avx512.S b/cipher/blake2b-amd64-avx512.S index fe938730..3a04818c 100644 --- a/cipher/blake2b-amd64-avx512.S +++ b/cipher/blake2b-amd64-avx512.S @@ -49,6 +49,7 @@ #define ROW4 %ymm3 #define TMP1 %ymm4 #define TMP1x %xmm4 +#define R16 %ymm13 #define MA1 %ymm5 #define MA2 %ymm6 @@ -72,64 +73,65 @@ blake2b/AVX2 **********************************************************************/ -#define GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, gather_masks) \ - vmovdqa gather_masks + (4*4) * 0 rRIP, m2x; \ - vmovdqa gather_masks + (4*4) * 1 rRIP, m3x; \ - vmovdqa gather_masks + (4*4) * 2 rRIP, m4x; \ - vmovdqa gather_masks + (4*4) * 3 rRIP, TMP1x; \ - vpgatherdq (RINBLKS, m2x), m1 {%k1}; \ - vpgatherdq (RINBLKS, m3x), m2 {%k2}; \ - vpgatherdq (RINBLKS, m4x), m3 {%k3}; \ - vpgatherdq (RINBLKS, TMP1x), m4 {%k4} - -#define GEN_GMASK(s0, s1, s2, s3, s4, s5, s6, s7, \ - s8, s9, s10, s11, s12, s13, s14, s15) \ - .long (s0)*8, (s2)*8, (s4)*8, (s6)*8, \ - (s1)*8, (s3)*8, (s5)*8, (s7)*8, \ - (s8)*8, (s10)*8, (s12)*8, (s14)*8, \ - (s9)*8, (s11)*8, (s13)*8, (s15)*8 - -#define RESET_KMASKS() \ - kmovw %k0, %k1; \ - kmovw %k0, %k2; \ - kmovw %k0, %k3; \ - kmovw %k0, %k4 +/* Load one qword value at memory location MEM to specific element in + * target register VREG. Note, KPOS needs to contain value "(1 << QPOS)". */ +#define VPINSRQ_KMASK(kpos, qpos, mem, vreg) \ + vmovdqu64 -((qpos) * 8) + mem, vreg {kpos} + +#define GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \ + s0, s1, s2, s3, s4, s5, s6, s7, s8, \ + s9, s10, s11, s12, s13, s14, s15) \ + vmovq (s0)*8(RINBLKS), m1x; \ + vmovq (s1)*8(RINBLKS), m2x; \ + vmovq (s8)*8(RINBLKS), m3x; \ + vmovq (s9)*8(RINBLKS), m4x; \ + VPINSRQ_KMASK(%k1, 1, (s2)*8(RINBLKS), m1); \ + VPINSRQ_KMASK(%k1, 1, (s3)*8(RINBLKS), m2); \ + VPINSRQ_KMASK(%k1, 1, (s10)*8(RINBLKS), m3); \ + VPINSRQ_KMASK(%k1, 1, (s11)*8(RINBLKS), m4); \ + VPINSRQ_KMASK(%k2, 2, (s4)*8(RINBLKS), m1); \ + VPINSRQ_KMASK(%k2, 2, (s5)*8(RINBLKS), m2); \ + VPINSRQ_KMASK(%k2, 2, (s12)*8(RINBLKS), m3); \ + VPINSRQ_KMASK(%k2, 2, (s13)*8(RINBLKS), m4); \ + VPINSRQ_KMASK(%k3, 3, (s6)*8(RINBLKS), m1); \ + VPINSRQ_KMASK(%k3, 3, (s7)*8(RINBLKS), m2); \ + VPINSRQ_KMASK(%k3, 3, (s14)*8(RINBLKS), m3); \ + VPINSRQ_KMASK(%k3, 3, (s15)*8(RINBLKS), m4); #define LOAD_MSG_0(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \ - GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask0); \ - RESET_KMASKS() + GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \ + 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) #define LOAD_MSG_1(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \ - GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask1); \ - RESET_KMASKS() + GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \ + 14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3) #define LOAD_MSG_2(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \ - GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask2); \ - RESET_KMASKS() + GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \ + 11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4) #define LOAD_MSG_3(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \ - GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask3); \ - RESET_KMASKS() + GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \ + 7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8) #define LOAD_MSG_4(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \ - GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask4); \ - RESET_KMASKS() + GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \ + 9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13) #define LOAD_MSG_5(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \ - GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask5); \ - RESET_KMASKS() + GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \ + 2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9) #define LOAD_MSG_6(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \ - GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask6); \ - RESET_KMASKS() + GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \ + 12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11) #define LOAD_MSG_7(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \ - GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask7); \ - RESET_KMASKS() + GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \ + 13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10) #define LOAD_MSG_8(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \ - GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask8); \ - RESET_KMASKS() + GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \ + 6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5) #define LOAD_MSG_9(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \ - GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask9); \ - RESET_KMASKS() + GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \ + 10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13 , 0) #define LOAD_MSG_10(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \ - GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask0); \ - RESET_KMASKS() + LOAD_MSG_0(m1, m2, m3, m4, m1x, m2x, m3x, m4x) #define LOAD_MSG_11(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \ - GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask1); + LOAD_MSG_1(m1, m2, m3, m4, m1x, m2x, m3x, m4x) #define LOAD_MSG(r, m1, m2, m3, m4) \ LOAD_MSG_##r(m1, m2, m3, m4, m1##x, m2##x, m3##x, m4##x) @@ -138,7 +140,7 @@ #define ROR_24(in, out) vprorq $24, in, out -#define ROR_16(in, out) vprorq $16, in, out +#define ROR_16(in, out) vpshufb R16, in, out #define ROR_63(in, out) vprorq $63, in, out @@ -188,26 +190,10 @@ _blake2b_avx512_data: .quad 0x3c6ef372fe94f82b, 0xa54ff53a5f1d36f1 .quad 0x510e527fade682d1, 0x9b05688c2b3e6c1f .quad 0x1f83d9abfb41bd6b, 0x5be0cd19137e2179 -.Lgmask0: - GEN_GMASK(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) -.Lgmask1: - GEN_GMASK(14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3) -.Lgmask2: - GEN_GMASK(11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4) -.Lgmask3: - GEN_GMASK(7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8) -.Lgmask4: - GEN_GMASK(9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13) -.Lgmask5: - GEN_GMASK(2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9) -.Lgmask6: - GEN_GMASK(12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11) -.Lgmask7: - GEN_GMASK(13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10) -.Lgmask8: - GEN_GMASK(6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5) -.Lgmask9: - GEN_GMASK(10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13 , 0) +.Lshuf_ror16: + .byte 2, 3, 4, 5, 6, 7, 0, 1, 10, 11, 12, 13, 14, 15, 8, 9 +.Lk1_mask: + .byte (1 << 1) .text @@ -225,14 +211,15 @@ _gcry_blake2b_transform_amd64_avx512: spec_stop_avx512; - movl $0xf, %eax; - kmovw %eax, %k0; - xorl %eax, %eax; - RESET_KMASKS(); + kmovb .Lk1_mask rRIP, %k1; + kshiftlb $1, %k1, %k2; + kshiftlb $2, %k1, %k3; addq $128, (STATE_T + 0)(RSTATE); adcq $0, (STATE_T + 8)(RSTATE); + vbroadcasti128 .Lshuf_ror16 rRIP, R16; + vmovdqa .Liv+(0 * 8) rRIP, ROW3; vmovdqa .Liv+(4 * 8) rRIP, ROW4; @@ -243,9 +230,8 @@ _gcry_blake2b_transform_amd64_avx512: LOAD_MSG(0, MA1, MA2, MA3, MA4); LOAD_MSG(1, MB1, MB2, MB3, MB4); - jmp .Loop; -.align 64, 0xcc +.align 16 .Loop: ROUND(0, MA1, MA2, MA3, MA4); LOAD_MSG(2, MA1, MA2, MA3, MA4); @@ -269,7 +255,6 @@ _gcry_blake2b_transform_amd64_avx512: LOAD_MSG(11, MB1, MB2, MB3, MB4); sub $1, RNBLKS; jz .Loop_end; - RESET_KMASKS(); lea 128(RINBLKS), RINBLKS; addq $128, (STATE_T + 0)(RSTATE); @@ -293,7 +278,7 @@ _gcry_blake2b_transform_amd64_avx512: jmp .Loop; -.align 64, 0xcc +.align 16 .Loop_end: ROUND(10, MA1, MA2, MA3, MA4); ROUND(11, MB1, MB2, MB3, MB4); @@ -304,9 +289,12 @@ _gcry_blake2b_transform_amd64_avx512: vmovdqu ROW1, (STATE_H + 0 * 8)(RSTATE); vmovdqu ROW2, (STATE_H + 4 * 8)(RSTATE); - kxorw %k0, %k0, %k0; + xorl %eax, %eax; + kxord %k1, %k1, %k1; + kxord %k2, %k2, %k2; + kxord %k3, %k3, %k3; + vzeroall; - RESET_KMASKS(); ret_spec_stop; CFI_ENDPROC(); ELF(.size _gcry_blake2b_transform_amd64_avx512, -- 2.39.2 From falko.strenzke at mtg.de Tue Aug 22 13:49:04 2023 From: falko.strenzke at mtg.de (Falko Strenzke) Date: Tue, 22 Aug 2023 13:49:04 +0200 Subject: KMAC / cSHAKE in Libgcrypt Message-ID: <16a6eb37-94b0-456c-b3fd-93bc09573b3e@mtg.de> We are currently working on the integration of PQC algorithms in Libgcrypt based on draft-wussler-openpgp-pqc and will also add KMAC to Libgcrypt since this algorithm is used for the key derivation inside the key combiner. KMAC is based on cSHAKE , which is variant of SHAKE that requires a different final bit padding than SHAKE and is currently not implemented in Libgcrypt. cSHAKE is defined as |cSHAKE(X, L, N, S): 1. If N = "" and S = "": return SHAKE256(X, L); 2. Else: return KECCAK[256](bytepad(encode_string(N) || encode_string(S), 168) || X || 00, L) | In order to support the additional arguments N and S, I propose the following approach: * cSHAKE is added as an XOF message digest like SHAKE * For the purpose of providing the additional arguments N and S we add |typedef enum { GCRY_MD_ADDIN_CSHAKE_N = 1, GCRY_MD_ADDIN_CSHAKE_S = 2 } gcry_md_add_input_t; gcry_error_t gcry_md_set_add_input (gcry_md_hd_t *h, gcry_md_add_input_t addin_type, const void* v, size_t v_len) | In order to invoke cSHAKE with non-empty N and S parameters, after the call to |_gcry_md_open()|, two calls to |gcry_md_set_add_input()| have to be made to set N and S in that order. If data is added without having made these calls, then it will behave as normal SHAKE as required by the specification. Does anyone have any thoughts on this? - Falko -- *MTG AG* Dr. Falko Strenzke Executive System Architect Phone: +49 6151 8000 24 E-Mail: falko.strenzke at mtg.de Web: mtg.de *MTG Exhibitions ? See you in 2023* ------------------------------------------------------------------------ MTG AG - Dolivostr. 11 - 64293 Darmstadt, Germany Commercial register: HRB 8901 Register Court: Amtsgericht Darmstadt Management Board: J?rgen Ruf (CEO), Tamer Kemer?z Chairman of the Supervisory Board: Dr. Thomas Milde This email may contain confidential and/or privileged information. If you are not the correct recipient or have received this email in error, please inform the sender immediately and delete this email. Unauthorised copying or distribution of this email is not permitted. Data protection information: Privacy policy -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 5L3tHhBh4FWRSS1p.png Type: image/png Size: 5256 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: f8zrjtyyCy1NImgS.png Type: image/png Size: 4906 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4764 bytes Desc: Kryptografische S/MIME-Signatur URL: From jussi.kivilinna at iki.fi Thu Aug 31 19:12:16 2023 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Thu, 31 Aug 2023 20:12:16 +0300 Subject: KMAC / cSHAKE in Libgcrypt In-Reply-To: <16a6eb37-94b0-456c-b3fd-93bc09573b3e@mtg.de> References: <16a6eb37-94b0-456c-b3fd-93bc09573b3e@mtg.de> Message-ID: Hello, On 22.8.2023 14.49, Falko Strenzke wrote: > We are currently working on the integration of PQC algorithms in Libgcrypt based on draft-wussler-openpgp-pqc and will also add KMAC to Libgcrypt since this algorithm is used for the key derivation inside the key combiner. > > KMAC is based on cSHAKE , which is variant of SHAKE that requires a different final bit padding than SHAKE and is currently not implemented in Libgcrypt. cSHAKE is defined as > > |cSHAKE(X, L, N, S): 1. If N = "" and S = "": return SHAKE256(X, L); 2. Else: return KECCAK[256](bytepad(encode_string(N) || encode_string(S), 168) || X || 00, L) | > > In order to support the additional arguments N and S, I propose the following approach: > > * > > cSHAKE is added as an XOF message digest like SHAKE > > * > > For the purpose of providing the additional arguments N and S we add > > |typedef enum { GCRY_MD_ADDIN_CSHAKE_N = 1, GCRY_MD_ADDIN_CSHAKE_S = 2 } gcry_md_add_input_t; gcry_error_t gcry_md_set_add_input (gcry_md_hd_t *h, gcry_md_add_input_t addin_type, const void* v, size_t v_len) | > > In order to invoke cSHAKE with non-empty N and S parameters, after the call to |_gcry_md_open()|, two calls to |gcry_md_set_add_input()| have to be made to set N and S in that order. If data is added without having made these calls, then it will behave as normal SHAKE as required by the specification. > > Does anyone have any thoughts on this? I checked cSHAKE spec and think that interface is good way for passing these parameters. I first thought about having user to pass encoded N and S to gcry_md_write but that would mean that user needs to implement encode_string function from cSHAKE spec which would not work. One additional thing to consider is the _gcry_md_hash_buffers_extract internal interface. There cSHAKE could take first two IO buffers as N and S strings and following buffers as actual data. This would be similar to how HMAC work through this interface, where first IO buffer is used as HMAC key and remaining buffers as data. -Jussi > > - Falko > > -- > > *MTG AG* > Dr. Falko Strenzke > Executive System Architect > > Phone: +49 6151 8000 24 > E-Mail: falko.strenzke at mtg.de > Web: mtg.de > > > *MTG Exhibitions ? See you in 2023* > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > > MTG AG - Dolivostr. 11 - 64293 Darmstadt, Germany > Commercial register: HRB 8901 > Register Court: Amtsgericht Darmstadt > Management Board: J?rgen Ruf (CEO), Tamer Kemer?z > Chairman of the Supervisory Board: Dr. Thomas Milde > > This email may contain confidential and/or privileged information. If you are not the correct recipient or have received this email in error, > please inform the sender immediately and delete this email. Unauthorised copying or distribution of this email is not permitted. > > Data protection information: Privacy policy > > > _______________________________________________ > Gcrypt-devel mailing list > Gcrypt-devel at gnupg.org > https://lists.gnupg.org/mailman/listinfo/gcrypt-devel