From jussi.kivilinna at iki.fi  Sun Apr  2 19:22:48 2023
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sun,  2 Apr 2023 20:22:48 +0300
Subject: [PATCH] cipher-gcm-ppc: tweak loop structure a bit
Message-ID: <20230402172248.1085628-1-jussi.kivilinna@iki.fi>

* cipher/cipher-gcm-ppc.c (_gcry_ghash_ppc_vpmsum): Increament
'buf' pointer right after use; Use 'for' loop for inner 4-blocks
loop to allow compiler to better optimize loop.
--

Benchmark on POWER9:

Before:
                    |  nanosecs/byte   mebibytes/sec   cycles/byte
 GMAC_AES           |     0.226 ns/B      4211 MiB/s     0.521 c/B

After:
                    |  nanosecs/byte   mebibytes/sec   cycles/byte
 GMAC_AES           |     0.224 ns/B      4248 MiB/s     0.516 c/B

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/cipher-gcm-ppc.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/cipher/cipher-gcm-ppc.c b/cipher/cipher-gcm-ppc.c
index 4f75e95c..06bf5eb1 100644
--- a/cipher/cipher-gcm-ppc.c
+++ b/cipher/cipher-gcm-ppc.c
@@ -437,6 +437,7 @@ _gcry_ghash_ppc_vpmsum (byte *result, void *gcm_table,
       in1 = vec_load_he (16, buf);
       in2 = vec_load_he (32, buf);
       in3 = vec_load_he (48, buf);
+      buf += 64;
       in0 = vec_be_swap(in0, bswap_const);
       in1 = vec_be_swap(in1, bswap_const);
       in2 = vec_be_swap(in2, bswap_const);
@@ -464,17 +465,13 @@ _gcry_ghash_ppc_vpmsum (byte *result, void *gcm_table,
       Xh3 = asm_xor (Xh3, Xh1);
 
       /* Gerald Estrin's scheme for parallel multiplication of polynomials */
-      while (1)
+      for (; blocks_remaining >= 4; blocks_remaining -= 4)
         {
-	  buf += 64;
-	  blocks_remaining -= 4;
-	  if (!blocks_remaining)
-	    break;
-
 	  in0 = vec_load_he (0, buf);
 	  in1 = vec_load_he (16, buf);
 	  in2 = vec_load_he (32, buf);
 	  in3 = vec_load_he (48, buf);
+	  buf += 64;
 	  in1 = vec_be_swap(in1, bswap_const);
 	  in2 = vec_be_swap(in2, bswap_const);
 	  in3 = vec_be_swap(in3, bswap_const);
-- 
2.37.2


From falko.strenzke at mtg.de  Mon Apr  3 05:59:10 2023
From: falko.strenzke at mtg.de (Falko Strenzke)
Date: Mon, 3 Apr 2023 05:59:10 +0200
Subject: Implementation of PQC Algorithms in libgcrypt
In-Reply-To: <87tty2cq2q.fsf@wheatstone.g10code.de>
References: <958d689a-5f76-5fbe-f3ef-140bc1b2d132@mtg.de>
 <87tty2cq2q.fsf@wheatstone.g10code.de>
Message-ID: <85f0fb8c-1587-262c-187c-a5a6bc590145@mtg.de>

Hi Werner,

the only API change is the addition of the following interface function:

gcry_err_code_t
_gcry_pk_encap(gcry_sexp_t *r_ciph, gcry_sexp_t* r_shared_key, 
gcry_sexp_t s_pkey)

This also means that the public key spec needs to contain this 
additional function. For Kyber our public key spec currently looks as 
follows:

gcry_pk_spec_t _gcry_pubkey_spec_kyber = {
 ? GCRY_PK_KYBER, {0, 1},
 ? (GCRY_PK_USAGE_ENCAP),??????? // TODOMTG: can the key usage 
"encryption" remain or do we need new KU "encap"?
 ? "Kyber", kyber_names,
 ? "p", "s", "a", "", "p",?????? // elements of pub-key, sec-key, 
ciphertext, signature, key-grip
 ? kyber_generate,
 ? kyber_check_secret_key,
 ? NULL,???????????????????????? // encrypt
 ? kyber_encap,
 ? kyber_decrypt,
 ? NULL,???????????????????????? // sign,
 ? NULL,???????????????????????? // verify,
 ? kyber_get_nbits,
 ? run_selftests,
 ? compute_keygrip
};

For the PKEs the encapsulation function would of course be NULL. 
Regarding the TODO on the key usage marked in the code above, this so 
far doesn't seem to have any implications for us so the decision isn't 
urgent from my point of view.

- Falko

Am 30.03.23 um 15:43 schrieb Werner Koch:
> On Wed, 29 Mar 2023 10:09, Falko Strenzke said:
>
>> While the integration of the signature algorithms is straightforward, the KEM
>> requires a new interface function, as the KEM encapsulation cannot be modelled
>> by a public-key encryption.
> It would be good if we can discuss a proposed API early enough, so that
> we can see how it fits into the design of Libgcrypt.  Can you already
> roughly describes the needs?
>
>
> Salam-Shalom,
>
>     Werner
>
-- 

*MTG AG*
Dr. Falko Strenzke
Executive System Architect

Phone: +49 6151 8000 24
E-Mail: falko.strenzke at mtg.de
Web: mtg.de <https://www.mtg.de>


*MTG Exhibitions ? See you in 2023*

------------------------------------------------------------------------
<https://community.e-world-essen.com/institutions/allExhibitors?query=true&keywords=mtg> 
<https://www.itsa365.de/de-de/companies/m/mtg-ag>

MTG AG - Dolivostr. 11 - 64293 Darmstadt, Germany
Commercial register: HRB 8901
Register Court: Amtsgericht Darmstadt
Management Board: J?rgen Ruf (CEO), Tamer Kemer?z
Chairman of the Supervisory Board: Dr. Thomas Milde

This email may contain confidential and/or privileged information. If 
you are not the correct recipient or have received this email in error,
please inform the sender immediately and delete this email. Unauthorised 
copying or distribution of this email is not permitted.

Data protection information: Privacy policy 
<https://www.mtg.de/en/privacy-policy>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230403/cd5859b4/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0aW07JZqU4dNOv5z.png
Type: image/png
Size: 5256 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230403/cd5859b4/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: XhbY9a0ASjTIxu1R.png
Type: image/png
Size: 4906 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230403/cd5859b4/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4764 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230403/cd5859b4/attachment-0001.bin>

From jussi.kivilinna at iki.fi  Sat Apr 22 09:35:35 2023
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sat, 22 Apr 2023 10:35:35 +0300
Subject: [PATCH 1/5] bench-slope: add MPI benchmarking
Message-ID: <20230422073539.894723-1-jussi.kivilinna@iki.fi>

* tests/bench-slope.c (MPI_START_SIZE, MPI_END_SIZE, MPI_STEP_SIZE)
(MPI_NUM_STEPS, bench_mpi_test, mpi_test_names, bench_mpi_mode)
(bench_mpi_hd, bench_mpi_init, bench_mpi_fre, bench_mpi_do_bench)
(mpi_ops, mpi_modes, mpi_bench_one, _mpi_bench, mpi_match_test)
(mpi_bench): New.
(print_help): Add mention of 'mpi'.
(main): Add "mpi" tests.
--

Patch adds MPI operation benchmarking for bench-slope:

$ tests/bench-slope --cpu-mhz auto mpi
MPI:
                    |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
 add                |     0.054 ns/B     17580 MiB/s     0.298 c/B      5500
 sub                |     0.083 ns/B     11432 MiB/s     0.459 c/B      5500
 rshift3            |     0.033 ns/B     28862 MiB/s     0.182 c/B      5499
 lshift3            |     0.093 ns/B     10256 MiB/s     0.511 c/B      5500
 rshift65           |     0.096 ns/B      9888 MiB/s     0.530 c/B      5500
 lshift65           |     0.093 ns/B     10228 MiB/s     0.513 c/B      5500
 mul4               |     0.074 ns/B     12825 MiB/s     0.409 c/B      5500
 mul8               |     0.072 ns/B     13313 MiB/s     0.394 c/B      5500
 mul16              |     0.148 ns/B      6450 MiB/s     0.813 c/B      5500
 mul32              |     0.299 ns/B      3191 MiB/s      1.64 c/B      5500
 div4               |     0.458 ns/B      2080 MiB/s      2.52 c/B      5500
 div8               |     0.458 ns/B      2084 MiB/s      2.52 c/B      5500
 div16              |     0.602 ns/B      1584 MiB/s      3.31 c/B      5500
 div32              |     0.926 ns/B      1030 MiB/s      5.09 c/B      5500
 mod4               |     0.443 ns/B      2151 MiB/s      2.44 c/B      5500
 mod8               |     0.443 ns/B      2152 MiB/s      2.44 c/B      5500
 mod16              |     0.600 ns/B      1590 MiB/s      3.30 c/B      5500
 mod32              |     0.924 ns/B      1032 MiB/s      5.08 c/B      5500

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 tests/bench-slope.c | 308 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 307 insertions(+), 1 deletion(-)

diff --git a/tests/bench-slope.c b/tests/bench-slope.c
index f8031e5e..2a203a07 100644
--- a/tests/bench-slope.c
+++ b/tests/bench-slope.c
@@ -2933,13 +2933,310 @@ ecc_bench (char **argv, int argc)
 #endif
 }
 
+/************************************************************ MPI benchmarks. */
+
+#define MPI_START_SIZE 64
+#define MPI_END_SIZE 1024
+#define MPI_STEP_SIZE 8
+#define MPI_NUM_STEPS (((MPI_END_SIZE - MPI_START_SIZE) / MPI_STEP_SIZE) + 1)
+
+enum bench_mpi_test
+{
+  MPI_TEST_ADD = 0,
+  MPI_TEST_SUB,
+  MPI_TEST_RSHIFT3,
+  MPI_TEST_LSHIFT3,
+  MPI_TEST_RSHIFT65,
+  MPI_TEST_LSHIFT65,
+  MPI_TEST_MUL4,
+  MPI_TEST_MUL8,
+  MPI_TEST_MUL16,
+  MPI_TEST_MUL32,
+  MPI_TEST_DIV4,
+  MPI_TEST_DIV8,
+  MPI_TEST_DIV16,
+  MPI_TEST_DIV32,
+  MPI_TEST_MOD4,
+  MPI_TEST_MOD8,
+  MPI_TEST_MOD16,
+  MPI_TEST_MOD32,
+  __MAX_MPI_TEST
+};
+
+static const char * const mpi_test_names[] =
+{
+  "add",
+  "sub",
+  "rshift3",
+  "lshift3",
+  "rshift65",
+  "lshift65",
+  "mul4",
+  "mul8",
+  "mul16",
+  "mul32",
+  "div4",
+  "div8",
+  "div16",
+  "div32",
+  "mod4",
+  "mod8",
+  "mod16",
+  "mod32",
+  NULL,
+};
+
+struct bench_mpi_mode
+{
+  const char *name;
+  struct bench_ops *ops;
+
+  enum bench_mpi_test test_id;
+};
+
+struct bench_mpi_hd
+{
+  gcry_mpi_t bytes[MPI_NUM_STEPS + 1];
+  gcry_mpi_t y;
+};
+
+static int
+bench_mpi_init (struct bench_obj *obj)
+{
+  struct bench_mpi_mode *mode = obj->priv;
+  struct bench_mpi_hd *hd;
+  int y_bytes;
+  int i, j;
+
+  (void)mode;
+
+  obj->min_bufsize = MPI_START_SIZE;
+  obj->max_bufsize = MPI_END_SIZE;
+  obj->step_size = MPI_STEP_SIZE;
+  obj->num_measure_repetitions = num_measurement_repetitions;
+
+  hd = calloc (1, sizeof(*hd));
+  if (!hd)
+    return -1;
+
+  /* Generate input MPIs for benchmark. */
+  for (i = MPI_START_SIZE, j = 0; j < DIM(hd->bytes); i += MPI_STEP_SIZE, j++)
+    {
+      hd->bytes[j] = gcry_mpi_new (i * 8);
+      gcry_mpi_randomize (hd->bytes[j], i * 8, GCRY_WEAK_RANDOM);
+      gcry_mpi_set_bit (hd->bytes[j], i * 8 - 1);
+    }
+
+  switch (mode->test_id)
+    {
+      case MPI_TEST_MUL4:
+      case MPI_TEST_DIV4:
+      case MPI_TEST_MOD4:
+	y_bytes = 4;
+	break;
+
+      case MPI_TEST_MUL8:
+      case MPI_TEST_DIV8:
+      case MPI_TEST_MOD8:
+	y_bytes = 8;
+	break;
+
+      case MPI_TEST_MUL16:
+      case MPI_TEST_DIV16:
+      case MPI_TEST_MOD16:
+	y_bytes = 16;
+	break;
+
+      case MPI_TEST_MUL32:
+      case MPI_TEST_DIV32:
+      case MPI_TEST_MOD32:
+	y_bytes = 32;
+	break;
+
+      default:
+	y_bytes = 0;
+	break;
+    }
+
+  hd->y = gcry_mpi_new (y_bytes * 8);
+  if (y_bytes)
+    {
+      gcry_mpi_randomize (hd->y, y_bytes * 8, GCRY_WEAK_RANDOM);
+      gcry_mpi_set_bit (hd->y, y_bytes * 8 - 1);
+    }
+
+  obj->hd = hd;
+  return 0;
+}
+
+static void
+bench_mpi_free (struct bench_obj *obj)
+{
+  struct bench_mpi_hd *hd = obj->hd;
+  int i;
+
+  gcry_mpi_release (hd->y);
+  for (i = DIM(hd->bytes) - 1; i >= 0; i--)
+    gcry_mpi_release (hd->bytes[i]);
+
+  free(hd);
+}
+
+static void
+bench_mpi_do_bench (struct bench_obj *obj, void *buf, size_t buflen)
+{
+  struct bench_mpi_hd *hd = obj->hd;
+  struct bench_mpi_mode *mode = obj->priv;
+  int bytes_idx = (buflen - MPI_START_SIZE) / MPI_STEP_SIZE;
+  gcry_mpi_t x;
+
+  (void)buf;
+
+  x = gcry_mpi_new (2 * (MPI_END_SIZE + 1) * 8);
+
+  switch (mode->test_id)
+    {
+      case MPI_TEST_ADD:
+	gcry_mpi_add (x, hd->bytes[bytes_idx], hd->bytes[bytes_idx]);
+	break;
+
+      case MPI_TEST_SUB:
+	gcry_mpi_sub (x, hd->bytes[bytes_idx + 1], hd->bytes[bytes_idx]);
+	break;
+
+      case MPI_TEST_RSHIFT3:
+	gcry_mpi_rshift (x, hd->bytes[bytes_idx], 3);
+	break;
+
+      case MPI_TEST_LSHIFT3:
+	gcry_mpi_lshift (x, hd->bytes[bytes_idx], 3);
+	break;
+
+      case MPI_TEST_RSHIFT65:
+	gcry_mpi_rshift (x, hd->bytes[bytes_idx], 65);
+	break;
+
+      case MPI_TEST_LSHIFT65:
+	gcry_mpi_lshift (x, hd->bytes[bytes_idx], 65);
+	break;
+
+      case MPI_TEST_MUL4:
+      case MPI_TEST_MUL8:
+      case MPI_TEST_MUL16:
+      case MPI_TEST_MUL32:
+	gcry_mpi_mul (x, hd->bytes[bytes_idx], hd->y);
+	break;
+
+      case MPI_TEST_DIV4:
+      case MPI_TEST_DIV8:
+      case MPI_TEST_DIV16:
+      case MPI_TEST_DIV32:
+	gcry_mpi_div (x, NULL, hd->bytes[bytes_idx], hd->y, 0);
+	break;
+
+      case MPI_TEST_MOD4:
+      case MPI_TEST_MOD8:
+      case MPI_TEST_MOD16:
+      case MPI_TEST_MOD32:
+	gcry_mpi_mod (x, hd->bytes[bytes_idx], hd->y);
+	break;
+
+      default:
+	break;
+    }
+
+  gcry_mpi_release (x);
+}
+
+static struct bench_ops mpi_ops = {
+  &bench_mpi_init,
+  &bench_mpi_free,
+  &bench_mpi_do_bench
+};
+
+
+static struct bench_mpi_mode mpi_modes[] = {
+  {"", &mpi_ops},
+  {0},
+};
+
+
+static void
+mpi_bench_one (int test_id, struct bench_mpi_mode *pmode)
+{
+  struct bench_mpi_mode mode = *pmode;
+  struct bench_obj obj = { 0 };
+  double result;
+
+  mode.test_id = test_id;
+
+  if (mode.name[0] == '\0')
+    bench_print_algo (-18, mpi_test_names[test_id]);
+  else
+    bench_print_algo (18, mode.name);
+
+  obj.ops = mode.ops;
+  obj.priv = &mode;
+
+  result = do_slope_benchmark (&obj);
+
+  bench_print_result (result);
+}
+
+static void
+_mpi_bench (int test_id)
+{
+  int i;
+
+  for (i = 0; mpi_modes[i].name; i++)
+    mpi_bench_one (test_id, &mpi_modes[i]);
+}
+
+static int
+mpi_match_test(const char *name)
+{
+  int i;
+
+  for (i = 0; i < __MAX_MPI_TEST; i++)
+    if (strcmp(name, mpi_test_names[i]) == 0)
+      return i;
+
+  return -1;
+}
+
+void
+mpi_bench (char **argv, int argc)
+{
+  int i, test_id;
+
+  bench_print_section ("mpi", "MPI");
+  bench_print_header (18, "");
+
+  if (argv && argc)
+    {
+      for (i = 0; i < argc; i++)
+	{
+	  test_id = mpi_match_test (argv[i]);
+	  if (test_id >= 0)
+	    _mpi_bench (test_id);
+	}
+    }
+  else
+    {
+      for (i = 0; i < __MAX_MPI_TEST; i++)
+	_mpi_bench (i);
+    }
+
+  bench_print_footer (18);
+}
+
 /************************************************************** Main program. */
 
 void
 print_help (void)
 {
   static const char *help_lines[] = {
-    "usage: bench-slope [options] [hash|mac|cipher|kdf|ecc [algonames]]",
+    "usage: bench-slope [options] [hash|mac|cipher|kdf|ecc|mpi [algonames]]",
     "",
     " options:",
     "   --cpu-mhz <mhz>           Set CPU speed for calculating cycles",
@@ -3128,6 +3425,7 @@ main (int argc, char **argv)
       cipher_bench (NULL, 0);
       kdf_bench (NULL, 0);
       ecc_bench (NULL, 0);
+      mpi_bench (NULL, 0);
     }
   else if (!strcmp (*argv, "hash"))
     {
@@ -3169,6 +3467,14 @@ main (int argc, char **argv)
       warm_up_cpu ();
       ecc_bench ((argc == 0) ? NULL : argv, argc);
     }
+  else if (!strcmp (*argv, "mpi"))
+    {
+      argc--;
+      argv++;
+
+      warm_up_cpu ();
+      mpi_bench ((argc == 0) ? NULL : argv, argc);
+    }
   else
     {
       fprintf (stderr, PGM ": unknown argument: %s\n", *argv);
-- 
2.39.2


From jussi.kivilinna at iki.fi  Sat Apr 22 09:35:37 2023
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sat, 22 Apr 2023 10:35:37 +0300
Subject: [PATCH 3/5] mpi/amd64: fix use of 'movd' for 64-bit register move in
 lshift&rshift
In-Reply-To: <20230422073539.894723-1-jussi.kivilinna@iki.fi>
References: <20230422073539.894723-1-jussi.kivilinna@iki.fi>
Message-ID: <20230422073539.894723-3-jussi.kivilinna@iki.fi>

* mpi/amd64/mpih-lshift.S: Use 'movq' instead of 'movd' for moving
value to %rax.
* mpi/amd64/mpih-rshift.S: Likewise.
--

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 mpi/amd64/mpih-lshift.S | 2 +-
 mpi/amd64/mpih-rshift.S | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mpi/amd64/mpih-lshift.S b/mpi/amd64/mpih-lshift.S
index c11e808c..3fa6e4fd 100644
--- a/mpi/amd64/mpih-lshift.S
+++ b/mpi/amd64/mpih-lshift.S
@@ -52,7 +52,7 @@ C_SYMBOL_NAME(_gcry_mpih_lshift:)
 	movd	%eax, %xmm0
 	movdqa	%xmm4, %xmm3
 	psrlq	%xmm0, %xmm4
-	movd	%xmm4, %rax
+	movq	%xmm4, %rax
 	subq	$2, %rdx
 	jl	.Lendo
 
diff --git a/mpi/amd64/mpih-rshift.S b/mpi/amd64/mpih-rshift.S
index 430ba4b0..4bc5db22 100644
--- a/mpi/amd64/mpih-rshift.S
+++ b/mpi/amd64/mpih-rshift.S
@@ -52,7 +52,7 @@ C_SYMBOL_NAME(_gcry_mpih_rshift:)
 	movd	%eax, %xmm0
 	movdqa	%xmm4, %xmm3
 	psllq	%xmm0, %xmm4
-	movd	%xmm4, %rax
+	movq	%xmm4, %rax
 	leaq	(%rsi,%rdx,8), %rsi
 	leaq	(%rdi,%rdx,8), %rdi
 	negq	%rdx
-- 
2.39.2


From jussi.kivilinna at iki.fi  Sat Apr 22 09:35:36 2023
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sat, 22 Apr 2023 10:35:36 +0300
Subject: [PATCH 2/5] mpi: avoid MPI copy at gcry_mpi_sub
In-Reply-To: <20230422073539.894723-1-jussi.kivilinna@iki.fi>
References: <20230422073539.894723-1-jussi.kivilinna@iki.fi>
Message-ID: <20230422073539.894723-2-jussi.kivilinna@iki.fi>

* mpi/mpi-add.c (_gcry_mpi_add): Rename function...
(_gcry_mpi_add_inv_sign): ... to this and add parameter for inverting
sign of second operand.
(_gcry_mpi_add): New.
(_gcry_mpi_sub): Remove mpi_copy and instead use new
'_gcry_mpi_add_inv_sign' function with inverted sign for second
operand.
--

Benchmark on AMD Ryzen 9 7900X:

 Before:
                    |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
 add                |     0.052 ns/B     18301 MiB/s     0.287 c/B      5500
 sub                |     0.098 ns/B      9768 MiB/s     0.537 c/B      5500

 After:
                    |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
 add                |     0.030 ns/B     31771 MiB/s     0.165 c/B      5500
 sub                |     0.031 ns/B     31187 MiB/s     0.168 c/B      5500

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 mpi/mpi-add.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mpi/mpi-add.c b/mpi/mpi-add.c
index 38dd352f..2fd19e55 100644
--- a/mpi/mpi-add.c
+++ b/mpi/mpi-add.c
@@ -84,8 +84,8 @@ _gcry_mpi_add_ui (gcry_mpi_t w, gcry_mpi_t u, unsigned long v )
 }
 
 
-void
-_gcry_mpi_add(gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v)
+static void
+_gcry_mpi_add_inv_sign(gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v, int inv_v_sign)
 {
     mpi_ptr_t wp, up, vp;
     mpi_size_t usize, vsize, wsize;
@@ -93,7 +93,7 @@ _gcry_mpi_add(gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v)
 
     if( u->nlimbs < v->nlimbs ) { /* Swap U and V. */
 	usize = v->nlimbs;
-	usign = v->sign;
+	usign = v->sign ^ inv_v_sign;
 	vsize = u->nlimbs;
 	vsign = u->sign;
 	wsize = usize + 1;
@@ -106,7 +106,7 @@ _gcry_mpi_add(gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v)
 	usize = u->nlimbs;
 	usign = u->sign;
 	vsize = v->nlimbs;
-	vsign = v->sign;
+	vsign = v->sign ^ inv_v_sign;
 	wsize = usize + 1;
 	RESIZE_IF_NEEDED(w, wsize);
 	/* These must be after realloc (u or v may be the same as w).  */
@@ -211,13 +211,16 @@ _gcry_mpi_sub_ui(gcry_mpi_t w, gcry_mpi_t u, unsigned long v )
     w->sign   = wsign;
 }
 
+void
+_gcry_mpi_add(gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v)
+{
+  _gcry_mpi_add_inv_sign (w, u, v, 0);
+}
+
 void
 _gcry_mpi_sub(gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v)
 {
-  gcry_mpi_t vv = mpi_copy (v);
-  vv->sign = ! vv->sign;
-  mpi_add (w, u, vv);
-  mpi_free (vv);
+  _gcry_mpi_add_inv_sign (w, u, v, 1);
 }
 
 
-- 
2.39.2


From jussi.kivilinna at iki.fi  Sat Apr 22 09:35:38 2023
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sat, 22 Apr 2023 10:35:38 +0300
Subject: [PATCH 4/5] mpi/amd64: optimize add_n and sub_n
In-Reply-To: <20230422073539.894723-1-jussi.kivilinna@iki.fi>
References: <20230422073539.894723-1-jussi.kivilinna@iki.fi>
Message-ID: <20230422073539.894723-4-jussi.kivilinna@iki.fi>

* mpi/amd64/mpih-add1.S (_gcry_mpih_add_n): New implementation
with 4x unrolled fast-path loop.
* mpi/amd64/mpih-sub1.S (_gcry_mpih_sub_n): Likewise.
--

Benchmark on AMD Ryzen 9 7900X:

 Before:
                    |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
 add                |     0.035 ns/B     27559 MiB/s     0.163 c/B      4700
 sub                |     0.034 ns/B     28332 MiB/s     0.158 c/B      4700

 After (~26% faster):
                    |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
 add                |     0.027 ns/B     35271 MiB/s     0.127 c/B      4700
 sub                |     0.027 ns/B     35206 MiB/s     0.127 c/B      4700

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 mpi/amd64/mpih-add1.S | 81 ++++++++++++++++++++++++++++++++++++-------
 mpi/amd64/mpih-sub1.S | 80 +++++++++++++++++++++++++++++++++++-------
 2 files changed, 136 insertions(+), 25 deletions(-)

diff --git a/mpi/amd64/mpih-add1.S b/mpi/amd64/mpih-add1.S
index 833a43cb..f2e86237 100644
--- a/mpi/amd64/mpih-add1.S
+++ b/mpi/amd64/mpih-add1.S
@@ -3,6 +3,7 @@
  *
  *      Copyright (C) 1992, 1994, 1995, 1998, 
  *                    2001, 2002, 2006 Free Software Foundation, Inc.
+ *      Copyright (C) 2023 Jussi Kivilinna <jussi.kivilinna at iki.fi>
  *
  * This file is part of Libgcrypt.
  *
@@ -39,26 +40,80 @@
  *		   mpi_ptr_t s2_ptr,		rdx
  *		   mpi_size_t size)		rcx
  */
-
 	TEXT
 	ALIGN(4)
 	.globl C_SYMBOL_NAME(_gcry_mpih_add_n)
 C_SYMBOL_NAME(_gcry_mpih_add_n:)
 	FUNC_ENTRY()
-	leaq	(%rsi,%rcx,8), %rsi
-	leaq	(%rdi,%rcx,8), %rdi
-	leaq	(%rdx,%rcx,8), %rdx
-	negq	%rcx
-	xorl	%eax, %eax		/* clear cy */
+	movl	%ecx, %r9d
+	andl	$3, %r9d
+	je	.Lprehandle0
+	cmpl	$2, %r9d
+	jb	.Lprehandle1
+	je	.Lprehandle2
+
+#define FIRST_ADD() \
+	movq	(%rsi), %rax; \
+	addq	(%rdx), %rax; \
+	movq	%rax, (%rdi)
+
+#define NEXT_ADD(offset) \
+	movq	offset(%rsi), %rax; \
+	adcq	offset(%rdx), %rax; \
+	movq	%rax, offset(%rdi)
+
+.Lprehandle3:
+	leaq	-2(%rcx), %rcx
+	FIRST_ADD();
+	NEXT_ADD(8);
+	NEXT_ADD(16);
+	decq	%rcx
+	je	.Lend
+	leaq	24(%rsi), %rsi
+	leaq	24(%rdx), %rdx
+	leaq	24(%rdi), %rdi
+	jmp	.Loop
+
+	ALIGN(3)
+.Lprehandle2:
+	leaq	-1(%rcx), %rcx
+	FIRST_ADD();
+	NEXT_ADD(8);
+	decq	%rcx
+	je	.Lend
+	leaq	16(%rsi), %rsi
+	leaq	16(%rdx), %rdx
+	leaq	16(%rdi), %rdi
+	jmp	.Loop
+
+	ALIGN(3)
+.Lprehandle1:
+	FIRST_ADD();
+	decq	%rcx
+	je	.Lend
+	leaq	8(%rsi), %rsi
+	leaq	8(%rdx), %rdx
+	leaq	8(%rdi), %rdi
+	jmp	.Loop
+
+	ALIGN(3)
+.Lprehandle0:
+	clc				/* clear cy */
 
 	ALIGN(4)			/* minimal alignment for claimed speed */
-.Loop:	movq	(%rsi,%rcx,8), %rax
-	movq	(%rdx,%rcx,8), %r10
-	adcq	%r10, %rax
-	movq	%rax, (%rdi,%rcx,8)
-	incq	%rcx
+.Loop:	leaq	-3(%rcx), %rcx
+	NEXT_ADD(0);
+	NEXT_ADD(8);
+	NEXT_ADD(16);
+	NEXT_ADD(24);
+	leaq	32(%rsi), %rsi
+	leaq	32(%rdx), %rdx
+	leaq	32(%rdi), %rdi
+	decq	%rcx
 	jne	.Loop
 
-	movq	%rcx, %rax		/* zero %rax */
-	adcq	%rax, %rax
+	ALIGN(2)
+.Lend:
+	movl	$0, %eax		/* zero %rax */
+	adcl	%eax, %eax
 	FUNC_EXIT()
diff --git a/mpi/amd64/mpih-sub1.S b/mpi/amd64/mpih-sub1.S
index 8c61cb20..32799c86 100644
--- a/mpi/amd64/mpih-sub1.S
+++ b/mpi/amd64/mpih-sub1.S
@@ -3,6 +3,7 @@
  *
  *      Copyright (C) 1992, 1994, 1995, 1998, 
  *                    2001, 2002, 2006 Free Software Foundation, Inc.
+ *      Copyright (C) 2023 Jussi Kivilinna <jussi.kivilinna at iki.fi>
  *
  * This file is part of Libgcrypt.
  *
@@ -44,20 +45,75 @@
 	.globl C_SYMBOL_NAME(_gcry_mpih_sub_n)
 C_SYMBOL_NAME(_gcry_mpih_sub_n:)
 	FUNC_ENTRY()
-	leaq	(%rsi,%rcx,8), %rsi
-	leaq	(%rdi,%rcx,8), %rdi
-	leaq	(%rdx,%rcx,8), %rdx
-	negq	%rcx
-	xorl	%eax, %eax		/* clear cy */
+	movl	%ecx, %r9d
+	andl	$3, %r9d
+	je	.Lprehandle0
+	cmpl	$2, %r9d
+	jb	.Lprehandle1
+	je	.Lprehandle2
+
+#define FIRST_SUB() \
+	movq	(%rsi), %rax; \
+	subq	(%rdx), %rax; \
+	movq	%rax, (%rdi)
+
+#define NEXT_SUB(offset) \
+	movq	offset(%rsi), %rax; \
+	sbbq	offset(%rdx), %rax; \
+	movq	%rax, offset(%rdi)
+
+.Lprehandle3:
+	leaq	-2(%rcx), %rcx
+	FIRST_SUB();
+	NEXT_SUB(8);
+	NEXT_SUB(16);
+	decq	%rcx
+	je	.Lend
+	leaq	24(%rsi), %rsi
+	leaq	24(%rdx), %rdx
+	leaq	24(%rdi), %rdi
+	jmp	.Loop
+
+	ALIGN(3)
+.Lprehandle2:
+	leaq	-1(%rcx), %rcx
+	FIRST_SUB();
+	NEXT_SUB(8);
+	decq	%rcx
+	je	.Lend
+	leaq	16(%rsi), %rsi
+	leaq	16(%rdx), %rdx
+	leaq	16(%rdi), %rdi
+	jmp	.Loop
+
+	ALIGN(3)
+.Lprehandle1:
+	FIRST_SUB();
+	decq	%rcx
+	je	.Lend
+	leaq	8(%rsi), %rsi
+	leaq	8(%rdx), %rdx
+	leaq	8(%rdi), %rdi
+	jmp	.Loop
+
+	ALIGN(3)
+.Lprehandle0:
+	clc				/* clear cy */
 
 	ALIGN(4)			/* minimal alignment for claimed speed */
-.Loop:	movq	(%rsi,%rcx,8), %rax
-	movq	(%rdx,%rcx,8), %r10
-	sbbq	%r10, %rax
-	movq	%rax, (%rdi,%rcx,8)
-	incq	%rcx
+.Loop:	leaq	-3(%rcx), %rcx
+	NEXT_SUB(0);
+	NEXT_SUB(8);
+	NEXT_SUB(16);
+	NEXT_SUB(24);
+	leaq	32(%rsi), %rsi
+	leaq	32(%rdx), %rdx
+	leaq	32(%rdi), %rdi
+	decq	%rcx
 	jne	.Loop
 
-	movq	%rcx, %rax		/* zero %rax */
-	adcq	%rax, %rax
+	ALIGN(2)
+.Lend:
+	movl	$0, %eax		/* zero %rax */
+	adcl	%eax, %eax
 	FUNC_EXIT()
-- 
2.39.2


From jussi.kivilinna at iki.fi  Sat Apr 22 09:35:39 2023
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sat, 22 Apr 2023 10:35:39 +0300
Subject: [PATCH 5/5] mpi: optimize mpi_rshift and mpi_lshift to avoid extra
 MPI copying
In-Reply-To: <20230422073539.894723-1-jussi.kivilinna@iki.fi>
References: <20230422073539.894723-1-jussi.kivilinna@iki.fi>
Message-ID: <20230422073539.894723-5-jussi.kivilinna@iki.fi>

* mpi/mpi-bit.c (_gcry_mpi_rshift): Refactor so that _gcry_mpih_rshift
is used to do the copying along with shifting when copying is needed
and refactor so that same code-path is used for both in-place and
copying operation.
(_gcry_mpi_lshift): Refactor so that _gcry_mpih_lshift is used to do
the copying along with shifting when copying is needed and refactor
so that same code-path is used for both in-place and copying operation.
--

Benchmark on AMD Ryzen 9 7900X:

 Before:
                    |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
 rshift3            |     0.039 ns/B     24662 MiB/s     0.182 c/B      4700
 lshift3            |     0.108 ns/B      8832 MiB/s     0.508 c/B      4700
 rshift65           |     0.137 ns/B      6968 MiB/s     0.643 c/B      4700
 lshift65           |     0.109 ns/B      8776 MiB/s     0.511 c/B      4700

 After:
                    |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
 rshift3            |     0.038 ns/B     25049 MiB/s     0.179 c/B      4700
 lshift3            |     0.039 ns/B     24709 MiB/s     0.181 c/B      4700
 rshift65           |     0.038 ns/B     24942 MiB/s     0.180 c/B      4700
 lshift65           |     0.040 ns/B     23671 MiB/s     0.189 c/B      4700

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 mpi/mpi-bit.c | 138 +++++++++++++++++++-------------------------------
 1 file changed, 51 insertions(+), 87 deletions(-)

diff --git a/mpi/mpi-bit.c b/mpi/mpi-bit.c
index e2170401..7313a9d4 100644
--- a/mpi/mpi-bit.c
+++ b/mpi/mpi-bit.c
@@ -251,10 +251,11 @@ _gcry_mpi_rshift_limbs( gcry_mpi_t a, unsigned int count )
 void
 _gcry_mpi_rshift ( gcry_mpi_t x, gcry_mpi_t a, unsigned int n )
 {
-  mpi_size_t xsize;
-  unsigned int i;
   unsigned int nlimbs = (n/BITS_PER_MPI_LIMB);
   unsigned int nbits = (n%BITS_PER_MPI_LIMB);
+  unsigned int i;
+  mpi_size_t alimbs;
+  mpi_ptr_t xp, ap;
 
   if (mpi_is_immutable (x))
     {
@@ -262,75 +263,42 @@ _gcry_mpi_rshift ( gcry_mpi_t x, gcry_mpi_t a, unsigned int n )
       return;
     }
 
-  if ( x == a )
-    {
-      /* In-place operation.  */
-      if ( nlimbs >= x->nlimbs )
-        {
-          x->nlimbs = 0;
-          return;
-        }
+  alimbs = a->nlimbs;
 
-      if (nlimbs)
-        {
-          for (i=0; i < x->nlimbs - nlimbs; i++ )
-            x->d[i] = x->d[i+nlimbs];
-          x->d[i] = 0;
-          x->nlimbs -= nlimbs;
-
-        }
-      if ( x->nlimbs && nbits )
-        _gcry_mpih_rshift ( x->d, x->d, x->nlimbs, nbits );
-    }
-  else if ( nlimbs )
+  if (x != a)
     {
-      /* Copy and shift by more or equal bits than in a limb. */
-      xsize = a->nlimbs;
+      RESIZE_IF_NEEDED (x, alimbs);
+      x->nlimbs = alimbs;
+      x->flags = a->flags;
       x->sign = a->sign;
-      RESIZE_IF_NEEDED (x, xsize);
-      x->nlimbs = xsize;
-      for (i=0; i < a->nlimbs; i++ )
-        x->d[i] = a->d[i];
-      x->nlimbs = i;
-
-      if ( nlimbs >= x->nlimbs )
-        {
-          x->nlimbs = 0;
-          return;
-        }
+    }
+
+  /* In-place operation.  */
+  if (nlimbs >= alimbs)
+    {
+      x->nlimbs = 0;
+      return;
+    }
+
+  xp = x->d;
+  ap = a->d;
 
+  if (alimbs && nbits)
+    {
+      _gcry_mpih_rshift (xp, ap + nlimbs, alimbs - nlimbs, nbits);
       if (nlimbs)
-        {
-          for (i=0; i < x->nlimbs - nlimbs; i++ )
-            x->d[i] = x->d[i+nlimbs];
-          x->d[i] = 0;
-          x->nlimbs -= nlimbs;
-        }
-
-      if ( x->nlimbs && nbits )
-        _gcry_mpih_rshift ( x->d, x->d, x->nlimbs, nbits );
+	xp[alimbs - nlimbs] = 0;
+      x->nlimbs -= nlimbs;
     }
-  else
+  else if (nlimbs || (x != a))
     {
-      /* Copy and shift by less than bits in a limb.  */
-      xsize = a->nlimbs;
-      x->sign = a->sign;
-      RESIZE_IF_NEEDED (x, xsize);
-      x->nlimbs = xsize;
-
-      if ( xsize )
-        {
-          if (nbits )
-            _gcry_mpih_rshift (x->d, a->d, x->nlimbs, nbits );
-          else
-            {
-              /* The rshift helper function is not specified for
-                 NBITS==0, thus we do a plain copy here. */
-              for (i=0; i < x->nlimbs; i++ )
-                x->d[i] = a->d[i];
-            }
-        }
+      for (i = 0; i < alimbs - nlimbs; i++ )
+	xp[i] = ap[i + nlimbs];
+      if (nlimbs)
+	xp[i] = 0;
+      x->nlimbs -= nlimbs;
     }
+
   MPN_NORMALIZE (x->d, x->nlimbs);
 }
 
@@ -368,6 +336,9 @@ _gcry_mpi_lshift ( gcry_mpi_t x, gcry_mpi_t a, unsigned int n )
 {
   unsigned int nlimbs = (n/BITS_PER_MPI_LIMB);
   unsigned int nbits = (n%BITS_PER_MPI_LIMB);
+  mpi_size_t alimbs;
+  mpi_ptr_t xp, ap;
+  int i;
 
   if (mpi_is_immutable (x))
     {
@@ -378,34 +349,27 @@ _gcry_mpi_lshift ( gcry_mpi_t x, gcry_mpi_t a, unsigned int n )
   if (x == a && !n)
     return;  /* In-place shift with an amount of zero.  */
 
-  if ( x != a )
-    {
-      /* Copy A to X.  */
-      unsigned int alimbs = a->nlimbs;
-      int asign  = a->sign;
-      mpi_ptr_t xp, ap;
-
-      RESIZE_IF_NEEDED (x, alimbs+nlimbs+1);
-      xp = x->d;
-      ap = a->d;
-      MPN_COPY (xp, ap, alimbs);
-      x->nlimbs = alimbs;
-      x->flags = a->flags;
-      x->sign = asign;
-    }
+  /* Note: might be in-place operation, so a==x or a!=x. */
+
+  alimbs = a->nlimbs;
 
-  if (nlimbs && !nbits)
+  RESIZE_IF_NEEDED (x, alimbs + nlimbs + 1);
+  xp = x->d;
+  ap = a->d;
+  if (nbits && alimbs)
     {
-      /* Shift a full number of limbs.  */
-      _gcry_mpi_lshift_limbs (x, nlimbs);
+      x->nlimbs = alimbs + nlimbs + 1;
+      xp[alimbs + nlimbs] = _gcry_mpih_lshift (xp + nlimbs, ap, alimbs, nbits);
     }
-  else if (n)
+  else
     {
-      /* We use a very dump approach: Shift left by the number of
-         limbs plus one and than fix it up by an rshift.  */
-      _gcry_mpi_lshift_limbs (x, nlimbs+1);
-      mpi_rshift (x, x, BITS_PER_MPI_LIMB - nbits);
+      x->nlimbs = alimbs + nlimbs;
+      for (i = alimbs - 1; i >= 0; i--)
+	xp[i + nlimbs] = ap[i];
     }
-
+  for (i = 0; i < nlimbs; i++)
+    xp[i] = 0;
+  x->flags = a->flags;
+  x->sign = a->sign;
   MPN_NORMALIZE (x->d, x->nlimbs);
 }
-- 
2.39.2