From falko.strenzke at mtg.de  Thu Aug  3 13:49:08 2023
From: falko.strenzke at mtg.de (Falko Strenzke)
Date: Thu, 3 Aug 2023 13:49:08 +0200
Subject: secmem limits for PQC schemes
Message-ID: <df895871-57d3-422f-96cf-411cf5e446dd@mtg.de>

We are currently working on the implementation of the "CRYSTALS" schemes 
Kyber
and Dilithium (SPHINCS? to follow soon) in Libgcrypt. In this course we came
across a problem with the secure memory [^1] management in Libgcrypt. 
Namely,
the current hard limit for secure memory is 32kB. That seems to be a 
reasonable
default value as there are apparently indeed OSs which have limits for 
locked in
this domain.? However, the heap memory requirements for the largest 
parameter
sets of the CRYSTALS schemes are
- Kyber 33,376 bytes (key generation)
- Dilithium 135,968 bytes (probably also key generation, but not 
determined yet)

For Kyber we could possibly increase the default pool size to still 
reasonable
64kB. But in the case of multiple threads using Kyber operations this 
will still
not suffice.

This raises the question of how to deal with this limitation. When the 
secure
memory pool set up with the default size is exhausted, even in non-FIPS 
mode,
further requests for secure memory fail. This is not ideal, since many 
modern
systems will provide much higher margins for lockable memory.

So one possibility I see is to
- implement an allocation function for the CRYSTALS schemes that first 
tries to
 ? allocate secure memory and if that fails, and FIPS mode is not 
activated, then
 ? simply allocates non-secure memory
- possibly rework the secure memory management so that it tries to lock 
further
 ? memory blocks when secure memory is requested after the initially set 
up pool
 ? is exhausted. For instance on my Debian 11 x86 for instance I have 
limit of 4
 ? MB for locked memory, thus allowing to exceed the rather pessimistic 
default
 ? value by orders of magnitude.

The Libgcrypt core developers please let us know their thoughts 
regarding these
issues.


[^1]: i.e. heap memory that is protected from being swapped (locked 
memory) to
disk and overwritten when freed

-- 

*MTG AG*
Dr. Falko Strenzke
Executive System Architect

Phone: +49 6151 8000 24
E-Mail: falko.strenzke at mtg.de
Web: mtg.de <https://www.mtg.de>


*MTG Exhibitions ? See you in 2023*

------------------------------------------------------------------------
<https://community.e-world-essen.com/institutions/allExhibitors?query=true&keywords=mtg> 
<https://www.itsa365.de/de-de/companies/m/mtg-ag>

MTG AG - Dolivostr. 11 - 64293 Darmstadt, Germany
Commercial register: HRB 8901
Register Court: Amtsgericht Darmstadt
Management Board: J?rgen Ruf (CEO), Tamer Kemer?z
Chairman of the Supervisory Board: Dr. Thomas Milde

This email may contain confidential and/or privileged information. If 
you are not the correct recipient or have received this email in error,
please inform the sender immediately and delete this email. Unauthorised 
copying or distribution of this email is not permitted.

Data protection information: Privacy policy 
<https://www.mtg.de/en/privacy-policy>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230803/119017fe/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: yWPupc80CWA8GrDP.png
Type: image/png
Size: 5256 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230803/119017fe/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wjsdBueg0xHvmEfQ.png
Type: image/png
Size: 4906 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230803/119017fe/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4764 bytes
Desc: Kryptografische S/MIME-Signatur
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230803/119017fe/attachment-0001.bin>

From jussi.kivilinna at iki.fi  Wed Aug  9 19:56:42 2023
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Wed,  9 Aug 2023 20:56:42 +0300
Subject: [PATCH] Avoid VPGATHER usage for most of Intel CPUs
Message-ID: <20230809175642.26581-1-jussi.kivilinna@iki.fi>

* cipher/blake2.c (blake2b_init_ctx): Check for fast VPGATHER
for AVX512 implementation.
* src/hwf-x86.c (detect_x86_gnuc): Do not enable
HWF_INTEL_FAST_VPGATHER for Intel CPUs suffering from
"Downfall" vulnerability.
--

VPGATHER used to be fast on Intel CPU from Skylake to Tiger Lake,
but instruction is now very slow on these CPUs (slower than on
Haswell) because of mitigation introduced in new microcode version
for "Downfall" speculative execution vulnerability.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/blake2.c |  3 ++-
 src/hwf-x86.c   | 30 ++++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/cipher/blake2.c b/cipher/blake2.c
index 45f74a56..637eebbd 100644
--- a/cipher/blake2.c
+++ b/cipher/blake2.c
@@ -494,7 +494,8 @@ static gcry_err_code_t blake2b_init_ctx(void *ctx, unsigned int flags,
   c->use_avx2 = !!(features & HWF_INTEL_AVX2);
 #endif
 #ifdef USE_AVX512
-  c->use_avx512 = !!(features & HWF_INTEL_AVX512);
+  c->use_avx512 = (features & HWF_INTEL_AVX512)
+		  && (features & HWF_INTEL_FAST_VPGATHER);
 #endif
 
   c->outlen = dbits / 8;
diff --git a/src/hwf-x86.c b/src/hwf-x86.c
index 5240a460..bda14d9d 100644
--- a/src/hwf-x86.c
+++ b/src/hwf-x86.c
@@ -424,6 +424,36 @@ detect_x86_gnuc (void)
 	  avoid_vpgather |= 1;
 	  break;
 	}
+
+      /* These Intel Core processors (skylake to tigerlake) have slow VPGATHER
+       * because of mitigation introduced by new microcode (2023-08-08) for
+       * "Downfall" speculative execution vulnerability. */
+      switch (model)
+	{
+	/* Skylake, Cascade Lake, Cooper Lake */
+	case 0x4E:
+	case 0x5E:
+	case 0x55:
+	/* Kaby Lake, Coffee Lake, Whiskey Lake, Amber Lake */
+	case 0x8E:
+	case 0x9E:
+	/* Cannon Lake */
+	case 0x66:
+	/* Comet Lake */
+	case 0xA5:
+	case 0xA6:
+	/* Ice Lake */
+	case 0x7E:
+	case 0x6A:
+	case 0x6C:
+	/* Tiger Lake */
+	case 0x8C:
+	case 0x8D:
+	/* Rocket Lake */
+	case 0xA7:
+	  avoid_vpgather |= 1;
+	  break;
+	}
     }
   else if (is_amd_cpu)
     {
-- 
2.39.2


From jussi.kivilinna at iki.fi  Sun Aug 13 14:40:25 2023
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sun, 13 Aug 2023 15:40:25 +0300
Subject: [PATCH] twofish-avx2-amd64: replace VPGATHER with manual gather
Message-ID: <20230813124025.789901-1-jussi.kivilinna@iki.fi>

* cipher/twofish-avx2-amd64.S (do_gather): New.
(g16): Switch to use 'do_gather' instead of VPGATHER instruction.
(__twofish_enc_blk16, __twofish_dec_blk16): Prepare stack
for 'do_gather'.
--

As VPGATHER is now slow on majority of CPUs (because of "Downfall"),
switch twofish-avx2 implementation to use manual memory gathering
instead.

Benchmark on Intel Core i3-1115G4 (tigerlake, with "Downfall" mitigated
microcode):

Before:
 TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |      7.00 ns/B     136.3 MiB/s     28.62 c/B      4089
        ECB dec |      7.00 ns/B     136.2 MiB/s     28.64 c/B      4090

After (~3.1x faster):
 TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |      2.20 ns/B     433.7 MiB/s      8.99 c/B      4090
        ECB dec |      2.20 ns/B     433.7 MiB/s      8.99 c/B      4089

Benchmark on AMD Ryzen 9 7900X (zen4, did not suffer from "Downfall"):

Before:
 TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |      1.91 ns/B     499.0 MiB/s      8.98 c/B      4700
        ECB dec |      1.90 ns/B     500.7 MiB/s      8.95 c/B      4700

After (~6% faster):
 TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |      1.78 ns/B     534.7 MiB/s      8.38 c/B      4700
        ECB dec |      1.79 ns/B     533.7 MiB/s      8.40 c/B      4700

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/twofish-avx2-amd64.S | 168 ++++++++++++++++++++++++------------
 cipher/twofish.c            |   6 +-
 2 files changed, 113 insertions(+), 61 deletions(-)

diff --git a/cipher/twofish-avx2-amd64.S b/cipher/twofish-avx2-amd64.S
index d05ec1f9..2207ac57 100644
--- a/cipher/twofish-avx2-amd64.S
+++ b/cipher/twofish-avx2-amd64.S
@@ -39,14 +39,20 @@
 /* register macros */
 #define CTX	%rdi
 
-#define RROUND  %r12
-#define RROUNDd %r12d
+#define RROUND  %r13
+#define RROUNDd %r13d
 #define RS0	CTX
 #define RS1	%r8
 #define RS2	%r9
 #define RS3	%r10
 #define RK	%r11
-#define RW	%rax
+#define RW	%r12
+#define RIDX0	%rax
+#define RIDX0d	%eax
+#define RIDX1	%rbx
+#define RIDX1d	%ebx
+#define RIDX2	%r14
+#define RIDX3	%r15
 
 #define RA0	%ymm8
 #define RB0	%ymm9
@@ -63,14 +69,14 @@
 #define RX1	%ymm2
 #define RY1	%ymm3
 #define RT0	%ymm4
-#define RIDX	%ymm5
+#define RT1	%ymm5
 
 #define RX0x	%xmm0
 #define RY0x	%xmm1
 #define RX1x	%xmm2
 #define RY1x	%xmm3
 #define RT0x	%xmm4
-#define RIDXx	%xmm5
+#define RT1x	%xmm5
 
 #define RTMP0   RX0
 #define RTMP0x  RX0x
@@ -80,8 +86,8 @@
 #define RTMP2x  RY0x
 #define RTMP3   RY1
 #define RTMP3x  RY1x
-#define RTMP4   RIDX
-#define RTMP4x  RIDXx
+#define RTMP4   RT1
+#define RTMP4x  RT1x
 
 /* vpgatherdd mask and '-1' */
 #define RNOT	%ymm6
@@ -102,48 +108,42 @@
 	leaq s2(CTX), RS2; \
 	leaq s3(CTX), RS3; \
 
+#define do_gather(stoffs, byteoffs, rs, out) \
+	movzbl (stoffs + 0*4 + byteoffs)(%rsp), RIDX0d; \
+	movzbl (stoffs + 1*4 + byteoffs)(%rsp), RIDX1d; \
+	movzbq (stoffs + 2*4 + byteoffs)(%rsp), RIDX2; \
+	movzbq (stoffs + 3*4 + byteoffs)(%rsp), RIDX3; \
+	vmovd (rs, RIDX0, 4), RT1x; \
+	vpinsrd $1, (rs, RIDX1, 4), RT1x, RT1x; \
+	vpinsrd $2, (rs, RIDX2, 4), RT1x, RT1x; \
+	vpinsrd $3, (rs, RIDX3, 4), RT1x, RT1x; \
+	movzbl (stoffs + 4*4 + byteoffs)(%rsp), RIDX0d; \
+	movzbl (stoffs + 5*4 + byteoffs)(%rsp), RIDX1d; \
+	movzbq (stoffs + 6*4 + byteoffs)(%rsp), RIDX2; \
+	movzbq (stoffs + 7*4 + byteoffs)(%rsp), RIDX3; \
+	vmovd (rs, RIDX0, 4), RT0x; \
+	vpinsrd $1, (rs, RIDX1, 4), RT0x, RT0x; \
+	vpinsrd $2, (rs, RIDX2, 4), RT0x, RT0x; \
+	vpinsrd $3, (rs, RIDX3, 4), RT0x, RT0x; \
+	vinserti128 $1, RT0x, RT1, out;
+
 #define g16(ab, rs0, rs1, rs2, rs3, xy) \
-	vpand RBYTE, ab ## 0, RIDX; \
-	vpgatherdd RNOT, (rs0, RIDX, 4), xy ## 0; \
-	vpcmpeqd RNOT, RNOT, RNOT; \
-		\
-		vpand RBYTE, ab ## 1, RIDX; \
-		vpgatherdd RNOT, (rs0, RIDX, 4), xy ## 1; \
-		vpcmpeqd RNOT, RNOT, RNOT; \
-	\
-	vpsrld $8, ab ## 0, RIDX; \
-	vpand RBYTE, RIDX, RIDX; \
-	vpgatherdd RNOT, (rs1, RIDX, 4), RT0; \
-	vpcmpeqd RNOT, RNOT, RNOT; \
-	vpxor RT0, xy ## 0, xy ## 0; \
-		\
-		vpsrld $8, ab ## 1, RIDX; \
-		vpand RBYTE, RIDX, RIDX; \
-		vpgatherdd RNOT, (rs1, RIDX, 4), RT0; \
-		vpcmpeqd RNOT, RNOT, RNOT; \
-		vpxor RT0, xy ## 1, xy ## 1; \
-	\
-	vpsrld $16, ab ## 0, RIDX; \
-	vpand RBYTE, RIDX, RIDX; \
-	vpgatherdd RNOT, (rs2, RIDX, 4), RT0; \
-	vpcmpeqd RNOT, RNOT, RNOT; \
-	vpxor RT0, xy ## 0, xy ## 0; \
-		\
-		vpsrld $16, ab ## 1, RIDX; \
-		vpand RBYTE, RIDX, RIDX; \
-		vpgatherdd RNOT, (rs2, RIDX, 4), RT0; \
-		vpcmpeqd RNOT, RNOT, RNOT; \
-		vpxor RT0, xy ## 1, xy ## 1; \
-	\
-	vpsrld $24, ab ## 0, RIDX; \
-	vpgatherdd RNOT, (rs3, RIDX, 4), RT0; \
-	vpcmpeqd RNOT, RNOT, RNOT; \
-	vpxor RT0, xy ## 0, xy ## 0; \
-		\
-		vpsrld $24, ab ## 1, RIDX; \
-		vpgatherdd RNOT, (rs3, RIDX, 4), RT0; \
-		vpcmpeqd RNOT, RNOT, RNOT; \
-		vpxor RT0, xy ## 1, xy ## 1;
+	vmovdqa ab ## 0, 0(%rsp); \
+	vmovdqa ab ## 1, 32(%rsp); \
+	do_gather(0*32, 0, rs0, xy ## 0); \
+		do_gather(1*32, 0, rs0, xy ## 1); \
+	do_gather(0*32, 1, rs1, RT1); \
+	vpxor RT1, xy ## 0, xy ## 0; \
+		do_gather(1*32, 1, rs1, RT1); \
+		vpxor RT1, xy ## 1, xy ## 1; \
+	do_gather(0*32, 2, rs2, RT1); \
+	vpxor RT1, xy ## 0, xy ## 0; \
+		do_gather(1*32, 2, rs2, RT1); \
+		vpxor RT1, xy ## 1, xy ## 1; \
+	do_gather(0*32, 3, rs3, RT1); \
+	vpxor RT1, xy ## 0, xy ## 0; \
+		do_gather(1*32, 3, rs3, RT1); \
+		vpxor RT1, xy ## 1, xy ## 1;
 
 #define g1_16(a, x) \
 	g16(a, RS0, RS1, RS2, RS3, x);
@@ -375,8 +375,23 @@ __twofish_enc_blk16:
 	 */
 	CFI_STARTPROC();
 
-	pushq RROUND;
-	CFI_PUSH(RROUND);
+	pushq %rbp;
+	CFI_PUSH(%rbp);
+	movq %rsp, %rbp;
+	CFI_DEF_CFA_REGISTER(%rbp);
+	subq $(64 + 5 * 8), %rsp;
+	andq $-64, %rsp;
+
+	movq %rbx, (64 + 0 * 8)(%rsp);
+	movq %r12, (64 + 1 * 8)(%rsp);
+	movq %r13, (64 + 2 * 8)(%rsp);
+	movq %r14, (64 + 3 * 8)(%rsp);
+	movq %r15, (64 + 4 * 8)(%rsp);
+	CFI_REG_ON_STACK(rbx, 64 + 0 * 8);
+	CFI_REG_ON_STACK(r12, 64 + 1 * 8);
+	CFI_REG_ON_STACK(r13, 64 + 2 * 8);
+	CFI_REG_ON_STACK(r14, 64 + 3 * 8);
+	CFI_REG_ON_STACK(r15, 64 + 4 * 8);
 
 	init_round_constants();
 
@@ -400,8 +415,21 @@ __twofish_enc_blk16:
 	outunpack_enc16(RA, RB, RC, RD);
 	transpose4x4_16(RA, RB, RC, RD);
 
-	popq RROUND;
-	CFI_POP(RROUND);
+	movq (64 + 0 * 8)(%rsp), %rbx;
+	movq (64 + 1 * 8)(%rsp), %r12;
+	movq (64 + 2 * 8)(%rsp), %r13;
+	movq (64 + 3 * 8)(%rsp), %r14;
+	movq (64 + 4 * 8)(%rsp), %r15;
+	CFI_RESTORE(%rbx);
+	CFI_RESTORE(%r12);
+	CFI_RESTORE(%r13);
+	CFI_RESTORE(%r14);
+	CFI_RESTORE(%r15);
+	vpxor RT0, RT0, RT0;
+	vmovdqa RT0, 0(%rsp);
+	vmovdqa RT0, 32(%rsp);
+	leave;
+	CFI_LEAVE();
 
 	ret_spec_stop;
 	CFI_ENDPROC();
@@ -420,8 +448,23 @@ __twofish_dec_blk16:
 	 */
 	CFI_STARTPROC();
 
-	pushq RROUND;
-	CFI_PUSH(RROUND);
+	pushq %rbp;
+	CFI_PUSH(%rbp);
+	movq %rsp, %rbp;
+	CFI_DEF_CFA_REGISTER(%rbp);
+	subq $(64 + 5 * 8), %rsp;
+	andq $-64, %rsp;
+
+	movq %rbx, (64 + 0 * 8)(%rsp);
+	movq %r12, (64 + 1 * 8)(%rsp);
+	movq %r13, (64 + 2 * 8)(%rsp);
+	movq %r14, (64 + 3 * 8)(%rsp);
+	movq %r15, (64 + 4 * 8)(%rsp);
+	CFI_REG_ON_STACK(rbx, 64 + 0 * 8);
+	CFI_REG_ON_STACK(r12, 64 + 1 * 8);
+	CFI_REG_ON_STACK(r13, 64 + 2 * 8);
+	CFI_REG_ON_STACK(r14, 64 + 3 * 8);
+	CFI_REG_ON_STACK(r15, 64 + 4 * 8);
 
 	init_round_constants();
 
@@ -444,8 +487,21 @@ __twofish_dec_blk16:
 	outunpack_dec16(RA, RB, RC, RD);
 	transpose4x4_16(RA, RB, RC, RD);
 
-	popq RROUND;
-	CFI_POP(RROUND);
+	movq (64 + 0 * 8)(%rsp), %rbx;
+	movq (64 + 1 * 8)(%rsp), %r12;
+	movq (64 + 2 * 8)(%rsp), %r13;
+	movq (64 + 3 * 8)(%rsp), %r14;
+	movq (64 + 4 * 8)(%rsp), %r15;
+	CFI_RESTORE(%rbx);
+	CFI_RESTORE(%r12);
+	CFI_RESTORE(%r13);
+	CFI_RESTORE(%r14);
+	CFI_RESTORE(%r15);
+	vpxor RT0, RT0, RT0;
+	vmovdqa RT0, 0(%rsp);
+	vmovdqa RT0, 32(%rsp);
+	leave;
+	CFI_LEAVE();
 
 	ret_spec_stop;
 	CFI_ENDPROC();
diff --git a/cipher/twofish.c b/cipher/twofish.c
index 74061913..11a6e251 100644
--- a/cipher/twofish.c
+++ b/cipher/twofish.c
@@ -767,11 +767,7 @@ twofish_setkey (void *context, const byte *key, unsigned int keylen,
   rc = do_twofish_setkey (ctx, key, keylen);
 
 #ifdef USE_AVX2
-  ctx->use_avx2 = 0;
-  if ((hwfeatures & HWF_INTEL_AVX2) && (hwfeatures & HWF_INTEL_FAST_VPGATHER))
-    {
-      ctx->use_avx2 = 1;
-    }
+  ctx->use_avx2 = (hwfeatures & HWF_INTEL_AVX2) != 0;
 #endif
 
   /* Setup bulk encryption routines.  */
-- 
2.39.2


From jcb62281 at gmail.com  Mon Aug 14 04:47:44 2023
From: jcb62281 at gmail.com (Jacob Bachmeyer)
Date: Sun, 13 Aug 2023 21:47:44 -0500
Subject: [PATCH] twofish-avx2-amd64: replace VPGATHER with manual gather
In-Reply-To: <20230813124025.789901-1-jussi.kivilinna@iki.fi>
References: <20230813124025.789901-1-jussi.kivilinna@iki.fi>
Message-ID: <64D995D0.2090305@gmail.com>

Jussi Kivilinna wrote:
> * cipher/twofish-avx2-amd64.S (do_gather): New.
> (g16): Switch to use 'do_gather' instead of VPGATHER instruction.
> (__twofish_enc_blk16, __twofish_dec_blk16): Prepare stack
> for 'do_gather'.
> --
>
> As VPGATHER is now slow on majority of CPUs (because of "Downfall"),
> switch twofish-avx2 implementation to use manual memory gathering
> instead.
>
> Benchmark on Intel Core i3-1115G4 (tigerlake, with "Downfall" mitigated
> microcode):
>
> Before:
>  TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
>         ECB enc |      7.00 ns/B     136.3 MiB/s     28.62 c/B      4089
>         ECB dec |      7.00 ns/B     136.2 MiB/s     28.64 c/B      4090
>
> After (~3.1x faster):
>  TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
>         ECB enc |      2.20 ns/B     433.7 MiB/s      8.99 c/B      4090
>         ECB dec |      2.20 ns/B     433.7 MiB/s      8.99 c/B      4089
>
> Benchmark on AMD Ryzen 9 7900X (zen4, did not suffer from "Downfall"):
>
> Before:
>  TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
>         ECB enc |      1.91 ns/B     499.0 MiB/s      8.98 c/B      4700
>         ECB dec |      1.90 ns/B     500.7 MiB/s      8.95 c/B      4700
>
> After (~6% faster):
>  TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
>         ECB enc |      1.78 ns/B     534.7 MiB/s      8.38 c/B      4700
>         ECB dec |      1.79 ns/B     533.7 MiB/s      8.40 c/B      4700
>   

Obviously, do_gather is bouncing the data around in the cache, but the 
fact that this change is a performance improvement on a processor not 
affected by "Downfall" strongly suggests that using VPGATHER may have 
been suboptimal from the start.  Can you do a third test on the 
i3-1115G4 with older microcode?  Would this patch have actually improved 
performance in all cases?

Was using VPGATHER a waste of time the whole time?  Do we need to be 
more skeptical about new SSE/AVX/etc. opcodes in the future?


-- Jacob


From jussi.kivilinna at iki.fi  Mon Aug 14 18:24:05 2023
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Mon, 14 Aug 2023 19:24:05 +0300
Subject: [PATCH] twofish-avx2-amd64: replace VPGATHER with manual gather
In-Reply-To: <64D995D0.2090305@gmail.com>
References: <20230813124025.789901-1-jussi.kivilinna@iki.fi>
 <64D995D0.2090305@gmail.com>
Message-ID: <5935b345-2e40-073e-77c3-f92e353279af@iki.fi>

On 14.8.2023 5.47, Jacob Bachmeyer wrote:
> Jussi Kivilinna wrote:
>> * cipher/twofish-avx2-amd64.S (do_gather): New.
>> (g16): Switch to use 'do_gather' instead of VPGATHER instruction.
>> (__twofish_enc_blk16, __twofish_dec_blk16): Prepare stack
>> for 'do_gather'.
>> -- 
>>
>> As VPGATHER is now slow on majority of CPUs (because of "Downfall"),
>> switch twofish-avx2 implementation to use manual memory gathering
>> instead.
>>
>> Benchmark on Intel Core i3-1115G4 (tigerlake, with "Downfall" mitigated
>> microcode):
>>
>> Before:
>> ?TWOFISH??????? |? nanosecs/byte?? mebibytes/sec?? cycles/byte? auto Mhz
>> ??????? ECB enc |????? 7.00 ns/B???? 136.3 MiB/s???? 28.62 c/B????? 4089
>> ??????? ECB dec |????? 7.00 ns/B???? 136.2 MiB/s???? 28.64 c/B????? 4090
>>
>> After (~3.1x faster):
>> ?TWOFISH??????? |? nanosecs/byte?? mebibytes/sec?? cycles/byte? auto Mhz
>> ??????? ECB enc |????? 2.20 ns/B???? 433.7 MiB/s????? 8.99 c/B????? 4090
>> ??????? ECB dec |????? 2.20 ns/B???? 433.7 MiB/s????? 8.99 c/B????? 4089
>>
>> Benchmark on AMD Ryzen 9 7900X (zen4, did not suffer from "Downfall"):
>>
>> Before:
>> ?TWOFISH??????? |? nanosecs/byte?? mebibytes/sec?? cycles/byte? auto Mhz
>> ??????? ECB enc |????? 1.91 ns/B???? 499.0 MiB/s????? 8.98 c/B????? 4700
>> ??????? ECB dec |????? 1.90 ns/B???? 500.7 MiB/s????? 8.95 c/B????? 4700
>>
>> After (~6% faster):
>> ?TWOFISH??????? |? nanosecs/byte?? mebibytes/sec?? cycles/byte? auto Mhz
>> ??????? ECB enc |????? 1.78 ns/B???? 534.7 MiB/s????? 8.38 c/B????? 4700
>> ??????? ECB dec |????? 1.79 ns/B???? 533.7 MiB/s????? 8.40 c/B????? 4700
> 
> Obviously, do_gather is bouncing the data around in the cache, but the fact that this change is a performance improvement on a processor not affected by "Downfall" strongly suggests that using VPGATHER may have been suboptimal from the start.? Can you do a third test on the i3-1115G4 with older microcode?? Would this patch have actually improved performance in all cases?
> 

VPGATHER used to be faster than manual gather starting with Intel Skylake. Old results on this i3-1115G4 show ~6.5 c/B for Twofish-CTR. Interesting thing is that older Intel CPUs with AVX2 had slower VPGATHER implementation and those are not affected by "Downfall". For AMD CPUs, VPGATHER has been slower and getting faster tiny bit generation to generation. With Zen4, gather performance was finally good enough that twofish-avx2 implementation beat the twofish-3way-asm implementation so I enabled HWF_INTEL_FAST_VPGATHER HW-feature for AMD Zen4+ CPUs.

> Was using VPGATHER a waste of time the whole time?? Do we need to be more skeptical about new SSE/AVX/etc. opcodes in the future?
> 

I don't think so, VPGATHER really was quite a bit faster on Intel Skylake+ CPUs. About being skeptical, I think problem is not so much with specific opcodes but optimizations that have been or get baked into microarchitectures.

-Jussi

> 
> -- Jacob
> 


From jussi.kivilinna at iki.fi  Sun Aug 20 17:31:54 2023
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sun, 20 Aug 2023 18:31:54 +0300
Subject: [PATCH v2] twofish-avx2-amd64: replace VPGATHER with manual gather
Message-ID: <20230820153155.382969-1-jussi.kivilinna@iki.fi>

* cipher/twofish-avx2-amd64.S (do_gather): New.
(g16): Switch to use 'do_gather' instead of VPGATHER instruction.
(__twofish_enc_blk16, __twofish_dec_blk16): Prepare stack
for 'do_gather'.
* cipher/twofish.c (twofish) [USE_AVX2]: Remove now unneeded
HWF_INTEL_FAST_VPGATHER check.
--

As VPGATHER is now slow on majority of CPUs (because of "Downfall"),
switch twofish-avx2 implementation to use manual memory gathering
instead.

Benchmark on Intel Core i3-1115G4 (tigerlake, with "Downfall" mitigated
microcode):

Before:
 TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |      7.00 ns/B     136.3 MiB/s     28.62 c/B      4089
        ECB dec |      7.00 ns/B     136.2 MiB/s     28.64 c/B      4090

After (~3.2x faster):
 TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |      2.19 ns/B     435.5 MiB/s      8.95 c/B      4089
        ECB dec |      2.19 ns/B     436.2 MiB/s      8.94 c/B      4089

Benchmark on AMD Ryzen 9 7900X (zen4, did not suffer from "Downfall"):

Before:
 TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |      1.91 ns/B     499.0 MiB/s      8.98 c/B      4700
        ECB dec |      1.90 ns/B     500.7 MiB/s      8.95 c/B      4700

After (~9% faster):
 TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |      1.74 ns/B     547.9 MiB/s      8.18 c/B      4700
        ECB dec |      1.74 ns/B     547.8 MiB/s      8.18 c/B      4700

[v2]:
 - reorder memory operations in do_gather for small performance
   increase.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/twofish-avx2-amd64.S | 168 ++++++++++++++++++++++++------------
 cipher/twofish.c            |   6 +-
 2 files changed, 113 insertions(+), 61 deletions(-)

diff --git a/cipher/twofish-avx2-amd64.S b/cipher/twofish-avx2-amd64.S
index d05ec1f9..3f61f87b 100644
--- a/cipher/twofish-avx2-amd64.S
+++ b/cipher/twofish-avx2-amd64.S
@@ -39,14 +39,20 @@
 /* register macros */
 #define CTX	%rdi
 
-#define RROUND  %r12
-#define RROUNDd %r12d
+#define RROUND  %r13
+#define RROUNDd %r13d
 #define RS0	CTX
 #define RS1	%r8
 #define RS2	%r9
 #define RS3	%r10
 #define RK	%r11
-#define RW	%rax
+#define RW	%r12
+#define RIDX0	%rax
+#define RIDX0d	%eax
+#define RIDX1	%rbx
+#define RIDX1d	%ebx
+#define RIDX2	%r14
+#define RIDX3	%r15
 
 #define RA0	%ymm8
 #define RB0	%ymm9
@@ -63,14 +69,14 @@
 #define RX1	%ymm2
 #define RY1	%ymm3
 #define RT0	%ymm4
-#define RIDX	%ymm5
+#define RT1	%ymm5
 
 #define RX0x	%xmm0
 #define RY0x	%xmm1
 #define RX1x	%xmm2
 #define RY1x	%xmm3
 #define RT0x	%xmm4
-#define RIDXx	%xmm5
+#define RT1x	%xmm5
 
 #define RTMP0   RX0
 #define RTMP0x  RX0x
@@ -80,8 +86,8 @@
 #define RTMP2x  RY0x
 #define RTMP3   RY1
 #define RTMP3x  RY1x
-#define RTMP4   RIDX
-#define RTMP4x  RIDXx
+#define RTMP4   RT1
+#define RTMP4x  RT1x
 
 /* vpgatherdd mask and '-1' */
 #define RNOT	%ymm6
@@ -102,48 +108,42 @@
 	leaq s2(CTX), RS2; \
 	leaq s3(CTX), RS3; \
 
+#define do_gather(stoffs, byteoffs, rs, out) \
+	movzbl (stoffs + 0*4 + byteoffs)(%rsp), RIDX0d; \
+	movzbl (stoffs + 1*4 + byteoffs)(%rsp), RIDX1d; \
+	movzbq (stoffs + 2*4 + byteoffs)(%rsp), RIDX2; \
+	movzbq (stoffs + 3*4 + byteoffs)(%rsp), RIDX3; \
+	vmovd (rs, RIDX0, 4), RT1x; \
+	movzbl (stoffs + 4*4 + byteoffs)(%rsp), RIDX0d; \
+	vmovd (rs, RIDX0, 4), RT0x; \
+	vpinsrd $1, (rs, RIDX1, 4), RT1x, RT1x; \
+	movzbl (stoffs + 5*4 + byteoffs)(%rsp), RIDX1d; \
+	vpinsrd $1, (rs, RIDX1, 4), RT0x, RT0x; \
+	vpinsrd $2, (rs, RIDX2, 4), RT1x, RT1x; \
+	movzbq (stoffs + 6*4 + byteoffs)(%rsp), RIDX2; \
+	vpinsrd $2, (rs, RIDX2, 4), RT0x, RT0x; \
+	vpinsrd $3, (rs, RIDX3, 4), RT1x, RT1x; \
+	movzbq (stoffs + 7*4 + byteoffs)(%rsp), RIDX3; \
+	vpinsrd $3, (rs, RIDX3, 4), RT0x, RT0x; \
+	vinserti128 $1, RT0x, RT1, out;
+
 #define g16(ab, rs0, rs1, rs2, rs3, xy) \
-	vpand RBYTE, ab ## 0, RIDX; \
-	vpgatherdd RNOT, (rs0, RIDX, 4), xy ## 0; \
-	vpcmpeqd RNOT, RNOT, RNOT; \
-		\
-		vpand RBYTE, ab ## 1, RIDX; \
-		vpgatherdd RNOT, (rs0, RIDX, 4), xy ## 1; \
-		vpcmpeqd RNOT, RNOT, RNOT; \
-	\
-	vpsrld $8, ab ## 0, RIDX; \
-	vpand RBYTE, RIDX, RIDX; \
-	vpgatherdd RNOT, (rs1, RIDX, 4), RT0; \
-	vpcmpeqd RNOT, RNOT, RNOT; \
-	vpxor RT0, xy ## 0, xy ## 0; \
-		\
-		vpsrld $8, ab ## 1, RIDX; \
-		vpand RBYTE, RIDX, RIDX; \
-		vpgatherdd RNOT, (rs1, RIDX, 4), RT0; \
-		vpcmpeqd RNOT, RNOT, RNOT; \
-		vpxor RT0, xy ## 1, xy ## 1; \
-	\
-	vpsrld $16, ab ## 0, RIDX; \
-	vpand RBYTE, RIDX, RIDX; \
-	vpgatherdd RNOT, (rs2, RIDX, 4), RT0; \
-	vpcmpeqd RNOT, RNOT, RNOT; \
-	vpxor RT0, xy ## 0, xy ## 0; \
-		\
-		vpsrld $16, ab ## 1, RIDX; \
-		vpand RBYTE, RIDX, RIDX; \
-		vpgatherdd RNOT, (rs2, RIDX, 4), RT0; \
-		vpcmpeqd RNOT, RNOT, RNOT; \
-		vpxor RT0, xy ## 1, xy ## 1; \
-	\
-	vpsrld $24, ab ## 0, RIDX; \
-	vpgatherdd RNOT, (rs3, RIDX, 4), RT0; \
-	vpcmpeqd RNOT, RNOT, RNOT; \
-	vpxor RT0, xy ## 0, xy ## 0; \
-		\
-		vpsrld $24, ab ## 1, RIDX; \
-		vpgatherdd RNOT, (rs3, RIDX, 4), RT0; \
-		vpcmpeqd RNOT, RNOT, RNOT; \
-		vpxor RT0, xy ## 1, xy ## 1;
+	vmovdqa ab ## 0, 0(%rsp); \
+	vmovdqa ab ## 1, 32(%rsp); \
+	do_gather(0*32, 0, rs0, xy ## 0); \
+		do_gather(1*32, 0, rs0, xy ## 1); \
+	do_gather(0*32, 1, rs1, RT1); \
+	vpxor RT1, xy ## 0, xy ## 0; \
+		do_gather(1*32, 1, rs1, RT1); \
+		vpxor RT1, xy ## 1, xy ## 1; \
+	do_gather(0*32, 2, rs2, RT1); \
+	vpxor RT1, xy ## 0, xy ## 0; \
+		do_gather(1*32, 2, rs2, RT1); \
+		vpxor RT1, xy ## 1, xy ## 1; \
+	do_gather(0*32, 3, rs3, RT1); \
+	vpxor RT1, xy ## 0, xy ## 0; \
+		do_gather(1*32, 3, rs3, RT1); \
+		vpxor RT1, xy ## 1, xy ## 1;
 
 #define g1_16(a, x) \
 	g16(a, RS0, RS1, RS2, RS3, x);
@@ -375,8 +375,23 @@ __twofish_enc_blk16:
 	 */
 	CFI_STARTPROC();
 
-	pushq RROUND;
-	CFI_PUSH(RROUND);
+	pushq %rbp;
+	CFI_PUSH(%rbp);
+	movq %rsp, %rbp;
+	CFI_DEF_CFA_REGISTER(%rbp);
+	subq $(64 + 5 * 8), %rsp;
+	andq $-64, %rsp;
+
+	movq %rbx, (64 + 0 * 8)(%rsp);
+	movq %r12, (64 + 1 * 8)(%rsp);
+	movq %r13, (64 + 2 * 8)(%rsp);
+	movq %r14, (64 + 3 * 8)(%rsp);
+	movq %r15, (64 + 4 * 8)(%rsp);
+	CFI_REG_ON_STACK(rbx, 64 + 0 * 8);
+	CFI_REG_ON_STACK(r12, 64 + 1 * 8);
+	CFI_REG_ON_STACK(r13, 64 + 2 * 8);
+	CFI_REG_ON_STACK(r14, 64 + 3 * 8);
+	CFI_REG_ON_STACK(r15, 64 + 4 * 8);
 
 	init_round_constants();
 
@@ -400,8 +415,21 @@ __twofish_enc_blk16:
 	outunpack_enc16(RA, RB, RC, RD);
 	transpose4x4_16(RA, RB, RC, RD);
 
-	popq RROUND;
-	CFI_POP(RROUND);
+	movq (64 + 0 * 8)(%rsp), %rbx;
+	movq (64 + 1 * 8)(%rsp), %r12;
+	movq (64 + 2 * 8)(%rsp), %r13;
+	movq (64 + 3 * 8)(%rsp), %r14;
+	movq (64 + 4 * 8)(%rsp), %r15;
+	CFI_RESTORE(%rbx);
+	CFI_RESTORE(%r12);
+	CFI_RESTORE(%r13);
+	CFI_RESTORE(%r14);
+	CFI_RESTORE(%r15);
+	vpxor RT0, RT0, RT0;
+	vmovdqa RT0, 0(%rsp);
+	vmovdqa RT0, 32(%rsp);
+	leave;
+	CFI_LEAVE();
 
 	ret_spec_stop;
 	CFI_ENDPROC();
@@ -420,8 +448,23 @@ __twofish_dec_blk16:
 	 */
 	CFI_STARTPROC();
 
-	pushq RROUND;
-	CFI_PUSH(RROUND);
+	pushq %rbp;
+	CFI_PUSH(%rbp);
+	movq %rsp, %rbp;
+	CFI_DEF_CFA_REGISTER(%rbp);
+	subq $(64 + 5 * 8), %rsp;
+	andq $-64, %rsp;
+
+	movq %rbx, (64 + 0 * 8)(%rsp);
+	movq %r12, (64 + 1 * 8)(%rsp);
+	movq %r13, (64 + 2 * 8)(%rsp);
+	movq %r14, (64 + 3 * 8)(%rsp);
+	movq %r15, (64 + 4 * 8)(%rsp);
+	CFI_REG_ON_STACK(rbx, 64 + 0 * 8);
+	CFI_REG_ON_STACK(r12, 64 + 1 * 8);
+	CFI_REG_ON_STACK(r13, 64 + 2 * 8);
+	CFI_REG_ON_STACK(r14, 64 + 3 * 8);
+	CFI_REG_ON_STACK(r15, 64 + 4 * 8);
 
 	init_round_constants();
 
@@ -444,8 +487,21 @@ __twofish_dec_blk16:
 	outunpack_dec16(RA, RB, RC, RD);
 	transpose4x4_16(RA, RB, RC, RD);
 
-	popq RROUND;
-	CFI_POP(RROUND);
+	movq (64 + 0 * 8)(%rsp), %rbx;
+	movq (64 + 1 * 8)(%rsp), %r12;
+	movq (64 + 2 * 8)(%rsp), %r13;
+	movq (64 + 3 * 8)(%rsp), %r14;
+	movq (64 + 4 * 8)(%rsp), %r15;
+	CFI_RESTORE(%rbx);
+	CFI_RESTORE(%r12);
+	CFI_RESTORE(%r13);
+	CFI_RESTORE(%r14);
+	CFI_RESTORE(%r15);
+	vpxor RT0, RT0, RT0;
+	vmovdqa RT0, 0(%rsp);
+	vmovdqa RT0, 32(%rsp);
+	leave;
+	CFI_LEAVE();
 
 	ret_spec_stop;
 	CFI_ENDPROC();
diff --git a/cipher/twofish.c b/cipher/twofish.c
index 74061913..11a6e251 100644
--- a/cipher/twofish.c
+++ b/cipher/twofish.c
@@ -767,11 +767,7 @@ twofish_setkey (void *context, const byte *key, unsigned int keylen,
   rc = do_twofish_setkey (ctx, key, keylen);
 
 #ifdef USE_AVX2
-  ctx->use_avx2 = 0;
-  if ((hwfeatures & HWF_INTEL_AVX2) && (hwfeatures & HWF_INTEL_FAST_VPGATHER))
-    {
-      ctx->use_avx2 = 1;
-    }
+  ctx->use_avx2 = (hwfeatures & HWF_INTEL_AVX2) != 0;
 #endif
 
   /* Setup bulk encryption routines.  */
-- 
2.39.2


From jussi.kivilinna at iki.fi  Sun Aug 20 17:31:55 2023
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sun, 20 Aug 2023 18:31:55 +0300
Subject: [PATCH] blake2b-avx512: replace VPGATHER with manual gather
In-Reply-To: <20230820153155.382969-1-jussi.kivilinna@iki.fi>
References: <20230820153155.382969-1-jussi.kivilinna@iki.fi>
Message-ID: <20230820153155.382969-2-jussi.kivilinna@iki.fi>

* cipher/blake2.c (blake2b_init_ctx): Remove HWF_INTEL_FAST_VPGATHER
check for AVX512 implementation.
* cipher/blake2b-amd64-avx512.S (R16, VPINSRQ_KMASK, .Lshuf_ror16)
(.Lk1_mask): New.
(GEN_GMASK, RESET_KMASKS, .Lgmask*): Remove.
(GATHER_MSG): Use manual gather instead of VPGATHER.
(ROR_16): Use vpshufb for small speed improvement on tigerlake.
(_gcry_blake2b_transform_amd64_avx512): New setup & clean-up for
kmask registers; Reduce excess loop aligned from 64B to 16B.
--

As VPGATHER is now slow on majority of CPUs (because of "Downfall"),
switch blake2b-avx512 implementation to use manual memory gathering
instead.

Benchmark on Intel Core i3-1115G4 (tigerlake, with "Downfall" mitigated
microcode):

Old before "Downfall" (commit 909daa700e4b45d75469df298ee564b8fc2f4b72):
                |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
 BLAKE2B_512    |     0.705 ns/B      1353 MiB/s      2.88 c/B      4088

Old after "Downfall" (~3.0x slower):
                |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
 BLAKE2B_512    |      2.11 ns/B     451.3 MiB/s      8.64 c/B      4089

New (same as before "Downfall"):
                |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
 BLAKE2B_512    |     0.705 ns/B      1353 MiB/s      2.88 c/B      4090

Benchmark on AMD Ryzen 9 7900X (zen4, did not suffer from "Downfall"):

Old:
                |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
 BLAKE2B_512    |     0.793 ns/B      1203 MiB/s      3.73 c/B      4700

New (~3% faster):
                |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
 BLAKE2B_512    |     0.771 ns/B      1237 MiB/s      3.62 c/B      4700

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/blake2.c               |   3 +-
 cipher/blake2b-amd64-avx512.S | 140 ++++++++++++++++------------------
 2 files changed, 65 insertions(+), 78 deletions(-)

diff --git a/cipher/blake2.c b/cipher/blake2.c
index 637eebbd..45f74a56 100644
--- a/cipher/blake2.c
+++ b/cipher/blake2.c
@@ -494,8 +494,7 @@ static gcry_err_code_t blake2b_init_ctx(void *ctx, unsigned int flags,
   c->use_avx2 = !!(features & HWF_INTEL_AVX2);
 #endif
 #ifdef USE_AVX512
-  c->use_avx512 = (features & HWF_INTEL_AVX512)
-		  && (features & HWF_INTEL_FAST_VPGATHER);
+  c->use_avx512 = !!(features & HWF_INTEL_AVX512);
 #endif
 
   c->outlen = dbits / 8;
diff --git a/cipher/blake2b-amd64-avx512.S b/cipher/blake2b-amd64-avx512.S
index fe938730..3a04818c 100644
--- a/cipher/blake2b-amd64-avx512.S
+++ b/cipher/blake2b-amd64-avx512.S
@@ -49,6 +49,7 @@
 #define ROW4  %ymm3
 #define TMP1  %ymm4
 #define TMP1x %xmm4
+#define R16   %ymm13
 
 #define MA1   %ymm5
 #define MA2   %ymm6
@@ -72,64 +73,65 @@
   blake2b/AVX2
  **********************************************************************/
 
-#define GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, gather_masks) \
-        vmovdqa gather_masks + (4*4) * 0 rRIP, m2x; \
-          vmovdqa gather_masks + (4*4) * 1 rRIP, m3x; \
-            vmovdqa gather_masks + (4*4) * 2 rRIP, m4x; \
-              vmovdqa gather_masks + (4*4) * 3 rRIP, TMP1x; \
-        vpgatherdq (RINBLKS, m2x), m1 {%k1}; \
-          vpgatherdq (RINBLKS, m3x), m2 {%k2}; \
-            vpgatherdq (RINBLKS, m4x), m3 {%k3}; \
-              vpgatherdq (RINBLKS, TMP1x), m4 {%k4}
-
-#define GEN_GMASK(s0, s1, s2, s3, s4, s5, s6, s7, \
-                  s8, s9, s10, s11, s12, s13, s14, s15) \
-        .long (s0)*8, (s2)*8, (s4)*8, (s6)*8, \
-              (s1)*8, (s3)*8, (s5)*8, (s7)*8, \
-              (s8)*8, (s10)*8, (s12)*8, (s14)*8, \
-              (s9)*8, (s11)*8, (s13)*8, (s15)*8
-
-#define RESET_KMASKS() \
-        kmovw %k0, %k1; \
-        kmovw %k0, %k2; \
-        kmovw %k0, %k3; \
-        kmovw %k0, %k4
+/* Load one qword value at memory location MEM to specific element in
+ * target register VREG. Note, KPOS needs to contain value "(1 << QPOS)". */
+#define VPINSRQ_KMASK(kpos, qpos, mem, vreg) \
+        vmovdqu64 -((qpos) * 8) + mem, vreg {kpos}
+
+#define GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   s0, s1, s2, s3, s4, s5, s6, s7, s8, \
+                   s9, s10, s11, s12, s13, s14, s15) \
+        vmovq (s0)*8(RINBLKS), m1x; \
+          vmovq (s1)*8(RINBLKS), m2x; \
+            vmovq (s8)*8(RINBLKS), m3x; \
+              vmovq (s9)*8(RINBLKS), m4x; \
+        VPINSRQ_KMASK(%k1, 1, (s2)*8(RINBLKS), m1); \
+          VPINSRQ_KMASK(%k1, 1, (s3)*8(RINBLKS), m2); \
+            VPINSRQ_KMASK(%k1, 1, (s10)*8(RINBLKS), m3); \
+              VPINSRQ_KMASK(%k1, 1, (s11)*8(RINBLKS), m4); \
+        VPINSRQ_KMASK(%k2, 2, (s4)*8(RINBLKS), m1); \
+          VPINSRQ_KMASK(%k2, 2, (s5)*8(RINBLKS), m2); \
+            VPINSRQ_KMASK(%k2, 2, (s12)*8(RINBLKS), m3); \
+              VPINSRQ_KMASK(%k2, 2, (s13)*8(RINBLKS), m4); \
+        VPINSRQ_KMASK(%k3, 3, (s6)*8(RINBLKS), m1); \
+          VPINSRQ_KMASK(%k3, 3, (s7)*8(RINBLKS), m2); \
+            VPINSRQ_KMASK(%k3, 3, (s14)*8(RINBLKS), m3); \
+              VPINSRQ_KMASK(%k3, 3, (s15)*8(RINBLKS), m4);
 
 #define LOAD_MSG_0(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
-        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask0); \
-        RESET_KMASKS()
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                    0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15)
 #define LOAD_MSG_1(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
-        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask1); \
-        RESET_KMASKS()
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3)
 #define LOAD_MSG_2(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
-        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask2); \
-        RESET_KMASKS()
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   11,  8, 12,  0,  5,  2, 15, 13, 10, 14,  3,  6,  7,  1,  9,  4)
 #define LOAD_MSG_3(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
-        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask3); \
-        RESET_KMASKS()
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                    7,  9,  3,  1, 13, 12, 11, 14,  2,  6,  5, 10,  4,  0, 15,  8)
 #define LOAD_MSG_4(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
-        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask4); \
-        RESET_KMASKS()
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                    9,  0,  5,  7,  2,  4, 10, 15, 14,  1, 11, 12,  6,  8,  3, 13)
 #define LOAD_MSG_5(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
-        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask5); \
-        RESET_KMASKS()
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                    2, 12,  6, 10,  0, 11,  8,  3,  4, 13,  7,  5, 15, 14,  1,  9)
 #define LOAD_MSG_6(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
-        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask6); \
-        RESET_KMASKS()
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   12,  5,  1, 15, 14, 13,  4, 10,  0,  7,  6,  3,  9,  2,  8, 11)
 #define LOAD_MSG_7(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
-        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask7); \
-        RESET_KMASKS()
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   13, 11,  7, 14, 12,  1,  3,  9,  5,  0, 15,  4,  8,  6,  2, 10)
 #define LOAD_MSG_8(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
-        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask8); \
-        RESET_KMASKS()
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                    6, 15, 14,  9, 11,  3,  0,  8, 12,  2, 13,  7,  1,  4, 10,  5)
 #define LOAD_MSG_9(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
-        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask9); \
-        RESET_KMASKS()
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   10,  2,  8,  4,  7,  6,  1,  5, 15, 11,  9, 14,  3, 12, 13 , 0)
 #define LOAD_MSG_10(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
-        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask0); \
-        RESET_KMASKS()
+        LOAD_MSG_0(m1, m2, m3, m4, m1x, m2x, m3x, m4x)
 #define LOAD_MSG_11(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
-        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, .Lgmask1);
+        LOAD_MSG_1(m1, m2, m3, m4, m1x, m2x, m3x, m4x)
 
 #define LOAD_MSG(r, m1, m2, m3, m4) \
         LOAD_MSG_##r(m1, m2, m3, m4, m1##x, m2##x, m3##x, m4##x)
@@ -138,7 +140,7 @@
 
 #define ROR_24(in, out) vprorq $24, in, out
 
-#define ROR_16(in, out) vprorq $16, in, out
+#define ROR_16(in, out) vpshufb R16, in, out
 
 #define ROR_63(in, out) vprorq $63, in, out
 
@@ -188,26 +190,10 @@ _blake2b_avx512_data:
         .quad 0x3c6ef372fe94f82b, 0xa54ff53a5f1d36f1
         .quad 0x510e527fade682d1, 0x9b05688c2b3e6c1f
         .quad 0x1f83d9abfb41bd6b, 0x5be0cd19137e2179
-.Lgmask0:
-        GEN_GMASK(0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15)
-.Lgmask1:
-        GEN_GMASK(14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3)
-.Lgmask2:
-        GEN_GMASK(11,  8, 12,  0,  5,  2, 15, 13, 10, 14,  3,  6,  7,  1,  9,  4)
-.Lgmask3:
-        GEN_GMASK(7,  9,  3,  1, 13, 12, 11, 14,  2,  6,  5, 10,  4,  0, 15,  8)
-.Lgmask4:
-        GEN_GMASK(9,  0,  5,  7,  2,  4, 10, 15, 14,  1, 11, 12,  6,  8,  3, 13)
-.Lgmask5:
-        GEN_GMASK(2, 12,  6, 10,  0, 11,  8,  3,  4, 13,  7,  5, 15, 14,  1,  9)
-.Lgmask6:
-        GEN_GMASK(12,  5,  1, 15, 14, 13,  4, 10,  0,  7,  6,  3,  9,  2,  8, 11)
-.Lgmask7:
-        GEN_GMASK(13, 11,  7, 14, 12,  1,  3,  9,  5,  0, 15,  4,  8,  6,  2, 10)
-.Lgmask8:
-        GEN_GMASK(6, 15, 14,  9, 11,  3,  0,  8, 12,  2, 13,  7,  1,  4, 10,  5)
-.Lgmask9:
-        GEN_GMASK(10,  2,  8,  4,  7,  6,  1,  5, 15, 11,  9, 14,  3, 12, 13 , 0)
+.Lshuf_ror16:
+        .byte 2, 3, 4, 5, 6, 7, 0, 1, 10, 11, 12, 13, 14, 15, 8, 9
+.Lk1_mask:
+	.byte (1 << 1)
 
 .text
 
@@ -225,14 +211,15 @@ _gcry_blake2b_transform_amd64_avx512:
 
         spec_stop_avx512;
 
-        movl $0xf, %eax;
-        kmovw %eax, %k0;
-        xorl %eax, %eax;
-        RESET_KMASKS();
+        kmovb .Lk1_mask rRIP, %k1;
+        kshiftlb $1, %k1, %k2;
+        kshiftlb $2, %k1, %k3;
 
         addq $128, (STATE_T + 0)(RSTATE);
         adcq $0, (STATE_T + 8)(RSTATE);
 
+        vbroadcasti128 .Lshuf_ror16 rRIP, R16;
+
         vmovdqa .Liv+(0 * 8) rRIP, ROW3;
         vmovdqa .Liv+(4 * 8) rRIP, ROW4;
 
@@ -243,9 +230,8 @@ _gcry_blake2b_transform_amd64_avx512:
 
         LOAD_MSG(0, MA1, MA2, MA3, MA4);
         LOAD_MSG(1, MB1, MB2, MB3, MB4);
-        jmp .Loop;
 
-.align 64, 0xcc
+.align 16
 .Loop:
         ROUND(0, MA1, MA2, MA3, MA4);
                                       LOAD_MSG(2, MA1, MA2, MA3, MA4);
@@ -269,7 +255,6 @@ _gcry_blake2b_transform_amd64_avx512:
                                       LOAD_MSG(11, MB1, MB2, MB3, MB4);
         sub $1, RNBLKS;
         jz .Loop_end;
-                                      RESET_KMASKS();
 
         lea 128(RINBLKS), RINBLKS;
         addq $128, (STATE_T + 0)(RSTATE);
@@ -293,7 +278,7 @@ _gcry_blake2b_transform_amd64_avx512:
 
         jmp .Loop;
 
-.align 64, 0xcc
+.align 16
 .Loop_end:
         ROUND(10, MA1, MA2, MA3, MA4);
         ROUND(11, MB1, MB2, MB3, MB4);
@@ -304,9 +289,12 @@ _gcry_blake2b_transform_amd64_avx512:
         vmovdqu ROW1, (STATE_H + 0 * 8)(RSTATE);
         vmovdqu ROW2, (STATE_H + 4 * 8)(RSTATE);
 
-        kxorw %k0, %k0, %k0;
+        xorl %eax, %eax;
+        kxord %k1, %k1, %k1;
+        kxord %k2, %k2, %k2;
+        kxord %k3, %k3, %k3;
+
         vzeroall;
-        RESET_KMASKS();
         ret_spec_stop;
         CFI_ENDPROC();
 ELF(.size _gcry_blake2b_transform_amd64_avx512,
-- 
2.39.2


From falko.strenzke at mtg.de  Tue Aug 22 13:49:04 2023
From: falko.strenzke at mtg.de (Falko Strenzke)
Date: Tue, 22 Aug 2023 13:49:04 +0200
Subject: KMAC / cSHAKE in Libgcrypt
Message-ID: <16a6eb37-94b0-456c-b3fd-93bc09573b3e@mtg.de>

We are currently working on the integration of PQC algorithms in 
Libgcrypt based on draft-wussler-openpgp-pqc 
<https://datatracker.ietf.org/doc/draft-wussler-openpgp-pqc/> and will 
also add KMAC to Libgcrypt since this algorithm is used for the key 
derivation inside the key combiner.

KMAC 
<https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-185.pdf#page=16> 
is based on cSHAKE 
<https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-185.pdf#page=13>, 
which is variant of SHAKE that requires a different final bit padding 
than SHAKE and is currently not implemented in Libgcrypt. cSHAKE is 
defined as

|cSHAKE(X, L, N, S): 1. If N = "" and S = "": return SHAKE256(X, L); 2. 
Else: return KECCAK[256](bytepad(encode_string(N) || encode_string(S), 
168) || X || 00, L) |

In order to support the additional arguments N and S, I propose the 
following approach:

  *

    cSHAKE is added as an XOF message digest like SHAKE

  *

    For the purpose of providing the additional arguments N and S we add

    |typedef enum { GCRY_MD_ADDIN_CSHAKE_N = 1, GCRY_MD_ADDIN_CSHAKE_S =
    2 } gcry_md_add_input_t; gcry_error_t gcry_md_set_add_input
    (gcry_md_hd_t *h, gcry_md_add_input_t addin_type, const void* v,
    size_t v_len) |

    In order to invoke cSHAKE with non-empty N and S parameters, after
    the call to |_gcry_md_open()|, two calls to
    |gcry_md_set_add_input()| have to be made to set N and S in that
    order. If data is added without having made these calls, then it
    will behave as normal SHAKE as required by the specification.

Does anyone have any thoughts on this?

- Falko

-- 

*MTG AG*
Dr. Falko Strenzke
Executive System Architect

Phone: +49 6151 8000 24
E-Mail: falko.strenzke at mtg.de
Web: mtg.de <https://www.mtg.de>


*MTG Exhibitions ? See you in 2023*

------------------------------------------------------------------------
<https://community.e-world-essen.com/institutions/allExhibitors?query=true&keywords=mtg> 
<https://www.itsa365.de/de-de/companies/m/mtg-ag>

MTG AG - Dolivostr. 11 - 64293 Darmstadt, Germany
Commercial register: HRB 8901
Register Court: Amtsgericht Darmstadt
Management Board: J?rgen Ruf (CEO), Tamer Kemer?z
Chairman of the Supervisory Board: Dr. Thomas Milde

This email may contain confidential and/or privileged information. If 
you are not the correct recipient or have received this email in error,
please inform the sender immediately and delete this email. Unauthorised 
copying or distribution of this email is not permitted.

Data protection information: Privacy policy 
<https://www.mtg.de/en/privacy-policy>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230822/47568635/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 5L3tHhBh4FWRSS1p.png
Type: image/png
Size: 5256 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230822/47568635/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: f8zrjtyyCy1NImgS.png
Type: image/png
Size: 4906 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230822/47568635/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4764 bytes
Desc: Kryptografische S/MIME-Signatur
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230822/47568635/attachment-0001.bin>

From jussi.kivilinna at iki.fi  Thu Aug 31 19:12:16 2023
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Thu, 31 Aug 2023 20:12:16 +0300
Subject: KMAC / cSHAKE in Libgcrypt
In-Reply-To: <16a6eb37-94b0-456c-b3fd-93bc09573b3e@mtg.de>
References: <16a6eb37-94b0-456c-b3fd-93bc09573b3e@mtg.de>
Message-ID: <f846fd15-b507-9340-f166-75a51ec54997@iki.fi>

Hello,

On 22.8.2023 14.49, Falko Strenzke wrote:
> We are currently working on the integration of PQC algorithms in Libgcrypt based on draft-wussler-openpgp-pqc <https://datatracker.ietf.org/doc/draft-wussler-openpgp-pqc/> and will also add KMAC to Libgcrypt since this algorithm is used for the key derivation inside the key combiner.
> 
> KMAC <https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-185.pdf#page=16> is based on cSHAKE <https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-185.pdf#page=13>, which is variant of SHAKE that requires a different final bit padding than SHAKE and is currently not implemented in Libgcrypt. cSHAKE is defined as
> 
> |cSHAKE(X, L, N, S): 1. If N = "" and S = "": return SHAKE256(X, L); 2. Else: return KECCAK[256](bytepad(encode_string(N) || encode_string(S), 168) || X || 00, L) |
> 
> In order to support the additional arguments N and S, I propose the following approach:
> 
>   *
> 
>     cSHAKE is added as an XOF message digest like SHAKE
> 
>   *
> 
>     For the purpose of providing the additional arguments N and S we add
> 
>     |typedef enum { GCRY_MD_ADDIN_CSHAKE_N = 1, GCRY_MD_ADDIN_CSHAKE_S = 2 } gcry_md_add_input_t; gcry_error_t gcry_md_set_add_input (gcry_md_hd_t *h, gcry_md_add_input_t addin_type, const void* v, size_t v_len) |
> 
>     In order to invoke cSHAKE with non-empty N and S parameters, after the call to |_gcry_md_open()|, two calls to |gcry_md_set_add_input()| have to be made to set N and S in that order. If data is added without having made these calls, then it will behave as normal SHAKE as required by the specification.
> 
> Does anyone have any thoughts on this?

I checked cSHAKE spec and think that interface is good way for passing these parameters. I first thought about having user to pass encoded N and S to gcry_md_write but that would mean that user needs to implement encode_string function from cSHAKE spec which would not work.

One additional thing to consider is the _gcry_md_hash_buffers_extract internal interface. There cSHAKE could take first two IO buffers as N and S strings and following buffers as actual data. This would be similar to how HMAC work through this interface, where first IO buffer is used as HMAC key and remaining buffers as data.

-Jussi

> 
> - Falko
> 
> -- 
> 
> *MTG AG*
> Dr. Falko Strenzke
> Executive System Architect
> 
> Phone: +49 6151 8000 24
> E-Mail: falko.strenzke at mtg.de
> Web: mtg.de <https://www.mtg.de>
> 
> 
> *MTG Exhibitions ? See you in 2023*
> 
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> <https://community.e-world-essen.com/institutions/allExhibitors?query=true&keywords=mtg> <https://www.itsa365.de/de-de/companies/m/mtg-ag>
> 
> MTG AG - Dolivostr. 11 - 64293 Darmstadt, Germany
> Commercial register: HRB 8901
> Register Court: Amtsgericht Darmstadt
> Management Board: J?rgen Ruf (CEO), Tamer Kemer?z
> Chairman of the Supervisory Board: Dr. Thomas Milde
> 
> This email may contain confidential and/or privileged information. If you are not the correct recipient or have received this email in error,
> please inform the sender immediately and delete this email. Unauthorised copying or distribution of this email is not permitted.
> 
> Data protection information: Privacy policy <https://www.mtg.de/en/privacy-policy>
> 
> 
> _______________________________________________
> Gcrypt-devel mailing list
> Gcrypt-devel at gnupg.org
> https://lists.gnupg.org/mailman/listinfo/gcrypt-devel