From Johannes.Schindelin at gmx.de  Wed Jun 16 10:07:11 2021
From: Johannes.Schindelin at gmx.de (Johannes Schindelin)
Date: Wed, 16 Jun 2021 10:07:11 +0200 (CEST)
Subject: [PATCH] Fix broken mlock detection
Message-ID: <nycvar.QRO.7.76.6.2106161006320.57@tvgsbejvaqbjf.bet>


We need to be careful when casting a pointer to a `long int`: the
highest bit might be set, in which case the result is a negative number.

In this instance, it is fatal: we now take the modulus of that negative
number with regards to the page size, and subtract it from the page
size. So what should be a number that is smaller than the page size is
now larger than the page size.

As a consequence, we do not try to lock a 4096-byte block that is at the
page size boundary inside a `malloc()`ed block, but we try to do that
_outside_ the block.

Which means that we are not at all detecting whether `mlock()` is
broken.

This actually happened here, in the i686 MSYS2 build of libgcrypt.

Let's be very careful to case the pointer to an _unsigned_ value
instead.

Note: technically, we should cast the pointer to a `size_t`. But since
we only need the remainder modulo the page size (which is a power of
two) anyway, it does not matter whether we clip, say, a 64-bit `size_t`
to a 32-bit `unsigned long`. It does matter, though, whether we
mistakenly turn the remainder into a negative one.

Signed-off-by: Johannes Schindelin <johannes.schindelin at gmx.de>
---
 acinclude.m4 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/acinclude.m4 b/acinclude.m4
index 3c8dfba7..4a2a83c0 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -236,7 +236,7 @@ int main()
     pool = malloc( 4096 + pgsize );
     if( !pool )
         return 2;
-    pool += (pgsize - ((long int)pool % pgsize));
+    pool += (pgsize - ((unsigned long int)pool % pgsize));

     err = mlock( pool, 4096 );
     if( !err || errno == EPERM || errno == EAGAIN)
--
2.31.1


From wk at gnupg.org  Wed Jun 16 15:58:37 2021
From: wk at gnupg.org (Werner Koch)
Date: Wed, 16 Jun 2021 15:58:37 +0200
Subject: [PATCH] Fix broken mlock detection
In-Reply-To: <nycvar.QRO.7.76.6.2106161006320.57@tvgsbejvaqbjf.bet> (Johannes
 Schindelin via Gcrypt-devel's message of "Wed,
 16 Jun 2021 10:07:11 +0200 (CEST)")
References: <nycvar.QRO.7.76.6.2106161006320.57@tvgsbejvaqbjf.bet>
Message-ID: <87wnqtsygi.fsf@wheatstone.g10code.de>

Hi!

On Wed, 16 Jun 2021 10:07, Johannes Schindelin said:

> Which means that we are not at all detecting whether `mlock()` is
> broken.

Thanks for your correct analysis.  I wrote this test code 23 years ago
and this is the first report.  Seems that until now this has never been
tried on systems which allocate memory above 2 GiB.

> This actually happened here, in the i686 MSYS2 build of libgcrypt.

Please take care: I would not suggest to build the Windows version with
MSYS - the only supported toolchain for Windows is gcc.

> we only need the remainder modulo the page size (which is a power of
> two) anyway, it does not matter whether we clip, say, a 64-bit `size_t`
> to a 32-bit `unsigned long`. It does matter, though, whether we

Yep.  I would anyway use size_t here to avoid questions about the
reasoning.  In fact secmem.c uses uintptr_t but that is a bit too
complicated to use in this configure test.

Thanks,

  Werner

-- 
Die Gedanken sind frei.  Ausnahmen regelt ein Bundesgesetz.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20210616/dac2533b/attachment.sig>

From Johannes.Schindelin at gmx.de  Thu Jun 17 14:13:25 2021
From: Johannes.Schindelin at gmx.de (Johannes Schindelin)
Date: Thu, 17 Jun 2021 14:13:25 +0200 (CEST)
Subject: [PATCH] Fix broken mlock detection
In-Reply-To: <87wnqtsygi.fsf@wheatstone.g10code.de>
References: <nycvar.QRO.7.76.6.2106161006320.57@tvgsbejvaqbjf.bet>
 <87wnqtsygi.fsf@wheatstone.g10code.de>
Message-ID: <nycvar.QRO.7.76.6.2106171407380.57@tvgsbejvaqbjf.bet>

Hi Werner,

On Wed, 16 Jun 2021, Werner Koch wrote:

> On Wed, 16 Jun 2021 10:07, Johannes Schindelin said:
>
> > Which means that we are not at all detecting whether `mlock()` is
> > broken.
>
> Thanks for your correct analysis.  I wrote this test code 23 years ago
> and this is the first report.  Seems that until now this has never been
> tried on systems which allocate memory above 2 GiB.

Right. 23 years ago, some people still believed that nobody will ever need
more than 640 kilobyte of RAM ;-)

BTW I _suspect_ that the reason I ran into this is a recent change in
Cygwin and/or Windows 10. I cannot recall seeing `malloc()`ed blocks above
0x80000000.

> > This actually happened here, in the i686 MSYS2 build of libgcrypt.
>
> Please take care: I would not suggest to build the Windows version with
> MSYS - the only supported toolchain for Windows is gcc.

Oh, I should have been clearer: I _am_ building using GCC.

MSYS2 is based partially on Cygwin (the MSYS2 runtime is a close fork of
the Cygwin runtime, to provide a POSIX emulation layer), and partially on
ArchLinux (from where it inherits its package management system, pacman).

It is a well-tested system, and it is used as an important block of Git
for Windows: millions of users already rely on it for over five years.

In other words: I am not worried about building and using GNU Privacy
Guard and libgcrypt in MSYS2's context ;-)

> > we only need the remainder modulo the page size (which is a power of
> > two) anyway, it does not matter whether we clip, say, a 64-bit `size_t`
> > to a 32-bit `unsigned long`. It does matter, though, whether we
>
> Yep.  I would anyway use size_t here to avoid questions about the
> reasoning.  In fact secmem.c uses uintptr_t but that is a bit too
> complicated to use in this configure test.

As long as you can be sure that the type that is used is actually defined,
I do not care which one you use. I used `unsigned long` because there is
no question that it is defined here, whereas I have run into compile
problems in the past on systems where `size_t` was not defined.

Ciao,
Dscho


From jussi.kivilinna at iki.fi  Sat Jun 19 15:36:02 2021
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sat, 19 Jun 2021 16:36:02 +0300
Subject: [PATCH v2] mpi/ec: add fast reduction functions for NIST curves
Message-ID: <20210619133602.665603-1-jussi.kivilinna@iki.fi>

* configure.ac (ASM_DISABLED): New.
* mpi/Makefile.am: Add 'ec-nist.c' and 'ec-inline.h'.
* mpi/ec-nist.c: New.
* mpi/ec-inline.h: New.
* mpi/ec-internal.h (_gcry_mpi_ec_nist192_mod)
(_gcry_mpi_ec_nist224_mod, _gcry_mpi_ec_nist256_mod)
(_gcry_mpi_ec_nist384_mod, _gcry_mpi_ec_nist521_mod): New.
* mpi/ec.c (ec_addm, ec_subm, ec_mulm, ec_mul2): Use
'ctx->mod'.
(field_table): Add 'mod' function; Add NIST reduction
functions.
(ec_p_init): Setup ctx->mod; Setup function pointers
from field_table only if pointer is not NULL; Resize
ctx->a and ctx->b only if set.
* mpi/mpi-internal.h (RESIZE_AND_CLEAR_IF_NEEDED): New.
* mpi/mpiutil.c (_gcry_mpi_resize): Clear all unused
limbs also in realloc case.
* src/ec-context.h (mpi_ec_ctx_s): Add 'mod' function.
--

Benchmark on AMD Ryzen 7 5800X (x86_64):

Before:
 NIST-P192      |  nanosecs/iter   cycles/iter  auto Mhz
           mult |         283346       1369473      4833
         keygen |        1688442       8185744      4848
           sign |         549683       2662984      4845
         verify |         615284       2984325      4850
                =
 NIST-P224      |  nanosecs/iter   cycles/iter  auto Mhz
           mult |         516443       2501173      4843
         keygen |        2859746      13866802      4849
           sign |         918472       4455043      4850
         verify |        1057940       5131372      4850
                =
 NIST-P256      |  nanosecs/iter   cycles/iter  auto Mhz
           mult |         423536       2054040      4850
         keygen |        2383097      11557572      4850
           sign |         774346       3754243      4848
         verify |         864934       4196315      4852
                =
 NIST-P384      |  nanosecs/iter   cycles/iter  auto Mhz
           mult |         929985       4511881      4852
         keygen |        5230788      25367299      4850
           sign |        1671432       8109726      4852
         verify |        1902729       9228568      4850
                =
 NIST-P521      |  nanosecs/iter   cycles/iter  auto Mhz
           mult |        2123546      10300952      4851
         keygen |       12019340      58297774      4850
           sign |        3886988      18853054      4850
         verify |        4507885      21864015      4850

After:
 NIST-P192      |  nanosecs/iter   cycles/iter  auto Mhz speed-up
           mult |         186679        905603      4851     +51%
         keygen |        1161423       5623822      4842     +46%
           sign |         389531       1887557      4846     +41%
         verify |         412936       2000461      4844     +49%
                =
 NIST-P224      |  nanosecs/iter   cycles/iter  auto Mhz speed-up
           mult |         260621       1256327      4821     +99%
         keygen |        1557845       7531677      4835     +84%
           sign |         521678       2527083      4844     +76%
         verify |         554084       2677949      4833     +92%
                =
 NIST-P256      |  nanosecs/iter   cycles/iter  auto Mhz speed-up
           mult |         319045       1542061      4833     +33%
         keygen |        1834822       8898950      4850     +30%
           sign |         612866       2972630      4850     +26%
         verify |         664821       3222597      4847     +30%
                =
 NIST-P384      |  nanosecs/iter   cycles/iter  auto Mhz speed-up
           mult |         593894       2875260      4841     +57%
         keygen |        3526600      17089717      4846     +48%
           sign |        1178098       5710151      4847     +42%
         verify |        1260185       6107449      4846     +51%
                =
 NIST-P521      |  nanosecs/iter   cycles/iter  auto Mhz speed-up
           mult |        1160220       5621946      4846     +83%
         keygen |        6862975      33247351      4844     +75%?
           sign |        2287366      11096711      4851     +70%
         verify |        2455858      11888045      4841     +84%

Benchmark on AMD Ryzen 7 5800X (i386):

Before:
 NIST-P192      |  nanosecs/iter   cycles/iter  auto Mhz
           mult |         648039       3143236      4850
         keygen |        3554452      17244822      4852
           sign |        1163173       5641932      4850
         verify |        1300076       6305673      4850
                =
 NIST-P224      |  nanosecs/iter   cycles/iter  auto Mhz
           mult |         798607       3874405      4851
         keygen |        4657604      22589864      4850
           sign |        1515803       7352049      4850
         verify |        1635470       7935373      4852
                =
 NIST-P256      |  nanosecs/iter   cycles/iter  auto Mhz
           mult |         927033       4496283      4850
         keygen |        5313601      25771983      4850
           sign |        1735795       8418514      4850
         verify |        1945804       9438212      4851
                =
 NIST-P384      |  nanosecs/iter   cycles/iter  auto Mhz
           mult |        2301781      11164473      4850
         keygen |       12856001      62353242      4850
           sign |        4161041      20180651      4850
         verify |        4705961      22827478      4851
                =
 NIST-P521      |  nanosecs/iter   cycles/iter  auto Mhz
           mult |        6066635      29422721      4850
         keygen |       32995868     160046407      4850
           sign |       10503306      50945387      4850
         verify |       12225252      59294323      4850

After:
 NIST-P192      |  nanosecs/iter   cycles/iter  auto Mhz speed-up
           mult |         413605       2007498      4854     +57%
         keygen |        2479429      12010926      4844     +44%
           sign |         825111       3997147      4844     +41%
         verify |         890206       4318723      4851     +46%
                =
 NIST-P224      |  nanosecs/iter   cycles/iter  auto Mhz speed-up
           mult |         551703       2676454      4851     +45%
         keygen |        3257022      15781844      4845     +43%
           sign |        1085678       5258894      4844     +40%
         verify |        1172195       5678499      4844     +40%
                =
 NIST-P256      |  nanosecs/iter   cycles/iter  auto Mhz speed-up
           mult |         720395       3497486      4855     +29%
         keygen |        4217758      20461257      4851     +26%
           sign |        1404350       6814131      4852     +24%
         verify |        1515136       7353955      4854     +28%
                =
 NIST-P384      |  nanosecs/iter   cycles/iter  auto Mhz speed-up
           mult |        1525742       7400771      4851     +51%
         keygen |        9046660      43877889      4850     +42%
           sign |        2974641      14408703      4844     +40%
         verify |        3265285      15834951      4849     +44%
                =
 NIST-P521      |  nanosecs/iter   cycles/iter  auto Mhz speed-up
           mult |        3289348      15968678      4855     +84%
         keygen |       19354174      93873531      4850     +70%
           sign |        6351493      30830140      4854     +65%
         verify |        6979292      33854215      4851     +75%

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 configure.ac       |    3 +
 mpi/Makefile.am    |    2 +-
 mpi/ec-inline.h    | 1047 ++++++++++++++++++++++++++++++++++++++++++++
 mpi/ec-internal.h  |   16 +
 mpi/ec-nist.c      |  795 +++++++++++++++++++++++++++++++++
 mpi/ec.c           |   90 +++-
 mpi/mpi-internal.h |    5 +
 mpi/mpiutil.c      |    2 +-
 src/ec-context.h   |    1 +
 9 files changed, 1943 insertions(+), 18 deletions(-)
 create mode 100644 mpi/ec-inline.h
 create mode 100644 mpi/ec-nist.c

diff --git a/configure.ac b/configure.ac
index 37947ecb..6fdca24a 100644
--- a/configure.ac
+++ b/configure.ac
@@ -546,6 +546,9 @@ AC_ARG_ENABLE([asm],
               [try_asm_modules=$enableval],
               [try_asm_modules=yes])
 AC_MSG_RESULT($try_asm_modules)
+if test "$try_asm_modules" != yes ; then
+    AC_DEFINE(ASM_DISABLED,1,[Defined if --disable-asm was used to configure])
+fi
 
 # Implementation of the --enable-m-guard switch.
 AC_MSG_CHECKING([whether memory guard is requested])
diff --git a/mpi/Makefile.am b/mpi/Makefile.am
index d06594e1..adb8e6f5 100644
--- a/mpi/Makefile.am
+++ b/mpi/Makefile.am
@@ -175,5 +175,5 @@ libmpi_la_SOURCES = longlong.h	   \
 	      mpih-mul.c     \
 	      mpih-const-time.c \
 	      mpiutil.c         \
-              ec.c ec-internal.h ec-ed25519.c
+              ec.c ec-internal.h ec-ed25519.c ec-nist.c ec-inline.h
 EXTRA_libmpi_la_SOURCES = asm-common-aarch64.h
diff --git a/mpi/ec-inline.h b/mpi/ec-inline.h
new file mode 100644
index 00000000..25c3b40d
--- /dev/null
+++ b/mpi/ec-inline.h
@@ -0,0 +1,1047 @@
+/* ec-inline.h - EC inline addition/substraction helpers
+ * Copyright (C) 2021 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef GCRY_EC_INLINE_H
+#define GCRY_EC_INLINE_H
+
+#include "mpi-internal.h"
+#include "longlong.h"
+#include "ec-context.h"
+#include "../cipher/bithelp.h"
+#include "../cipher/bufhelp.h"
+
+
+#if BYTES_PER_MPI_LIMB == 8
+
+/* 64-bit limb definitions for 64-bit architectures.  */
+
+#define LIMBS_PER_LIMB64 1
+#define LOAD64(x, pos) ((x)[pos])
+#define STORE64(x, pos, v) ((x)[pos] = (mpi_limb_t)(v))
+#define LIMB_TO64(v) ((mpi_limb_t)(v))
+#define LIMB_FROM64(v) ((mpi_limb_t)(v))
+#define HIBIT_LIMB64(v) ((mpi_limb_t)(v) >> (BITS_PER_MPI_LIMB - 1))
+#define HI32_LIMB64(v) (u32)((mpi_limb_t)(v) >> (BITS_PER_MPI_LIMB - 32))
+#define LO32_LIMB64(v) ((u32)(v))
+#define LIMB64_C(hi, lo) (((mpi_limb_t)(u32)(hi) << 32) | (u32)(lo))
+#define STORE64_COND(x, pos, mask1, val1, mask2, val2) \
+    ((x)[(pos)] = ((mask1) & (val1)) | ((mask2) & (val2)))
+
+typedef mpi_limb_t mpi_limb64_t;
+
+static inline u32
+LOAD32(mpi_ptr_t x, unsigned int pos)
+{
+  unsigned int shr = (pos % 2) * 32;
+  return (x[pos / 2] >> shr);
+}
+
+static inline mpi_limb64_t
+LIMB64_HILO(u32 hi, u32 lo)
+{
+  mpi_limb64_t v = hi;
+  return (v << 32) | lo;
+}
+
+
+/* x86-64 addition/subtraction helpers.  */
+#if defined (__x86_64__) && defined(HAVE_CPU_ARCH_X86) && __GNUC__ >= 4
+
+#define ADD3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) \
+  __asm__ ("addq %8, %2\n" \
+	   "adcq %7, %1\n" \
+	   "adcq %6, %0\n" \
+	   : "=r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "0" ((mpi_limb_t)(B2)), \
+	     "1" ((mpi_limb_t)(B1)), \
+	     "2" ((mpi_limb_t)(B0)), \
+	     "g" ((mpi_limb_t)(C2)), \
+	     "g" ((mpi_limb_t)(C1)), \
+	     "g" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define SUB3_LIMB64(A3, A2, A1, A0, B2, B1, B0, C2, C1, C0) \
+  __asm__ ("subq %8, %2\n" \
+	   "sbbq %7, %1\n" \
+	   "sbbq %6, %0\n" \
+	   : "=r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "0" ((mpi_limb_t)(B2)), \
+	     "1" ((mpi_limb_t)(B1)), \
+	     "2" ((mpi_limb_t)(B0)), \
+	     "g" ((mpi_limb_t)(C2)), \
+	     "g" ((mpi_limb_t)(C1)), \
+	     "g" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define ADD4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \
+  __asm__ ("addq %11, %3\n" \
+	   "adcq %10, %2\n" \
+	   "adcq %9, %1\n" \
+	   "adcq %8, %0\n" \
+	   : "=r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "0" ((mpi_limb_t)(B3)), \
+	     "1" ((mpi_limb_t)(B2)), \
+	     "2" ((mpi_limb_t)(B1)), \
+	     "3" ((mpi_limb_t)(B0)), \
+	     "g" ((mpi_limb_t)(C3)), \
+	     "g" ((mpi_limb_t)(C2)), \
+	     "g" ((mpi_limb_t)(C1)), \
+	     "g" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define SUB4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \
+  __asm__ ("subq %11, %3\n" \
+	   "sbbq %10, %2\n" \
+	   "sbbq %9, %1\n" \
+	   "sbbq %8, %0\n" \
+	   : "=r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "0" ((mpi_limb_t)(B3)), \
+	     "1" ((mpi_limb_t)(B2)), \
+	     "2" ((mpi_limb_t)(B1)), \
+	     "3" ((mpi_limb_t)(B0)), \
+	     "g" ((mpi_limb_t)(C3)), \
+	     "g" ((mpi_limb_t)(C2)), \
+	     "g" ((mpi_limb_t)(C1)), \
+	     "g" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define ADD5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \
+                    C4, C3, C2, C1, C0) \
+  __asm__ ("addq %14, %4\n" \
+	   "adcq %13, %3\n" \
+	   "adcq %12, %2\n" \
+	   "adcq %11, %1\n" \
+	   "adcq %10, %0\n" \
+	   : "=r" (A4), \
+	     "=&r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "0" ((mpi_limb_t)(B4)), \
+	     "1" ((mpi_limb_t)(B3)), \
+	     "2" ((mpi_limb_t)(B2)), \
+	     "3" ((mpi_limb_t)(B1)), \
+	     "4" ((mpi_limb_t)(B0)), \
+	     "g" ((mpi_limb_t)(C4)), \
+	     "g" ((mpi_limb_t)(C3)), \
+	     "g" ((mpi_limb_t)(C2)), \
+	     "g" ((mpi_limb_t)(C1)), \
+	     "g" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define SUB5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \
+                    C4, C3, C2, C1, C0) \
+  __asm__ ("subq %14, %4\n" \
+	   "sbbq %13, %3\n" \
+	   "sbbq %12, %2\n" \
+	   "sbbq %11, %1\n" \
+	   "sbbq %10, %0\n" \
+	   : "=r" (A4), \
+	     "=&r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "0" ((mpi_limb_t)(B4)), \
+	     "1" ((mpi_limb_t)(B3)), \
+	     "2" ((mpi_limb_t)(B2)), \
+	     "3" ((mpi_limb_t)(B1)), \
+	     "4" ((mpi_limb_t)(B0)), \
+	     "g" ((mpi_limb_t)(C4)), \
+	     "g" ((mpi_limb_t)(C3)), \
+	     "g" ((mpi_limb_t)(C2)), \
+	     "g" ((mpi_limb_t)(C1)), \
+	     "g" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#endif /* __x86_64__ */
+
+
+/* ARM AArch64 addition/subtraction helpers.  */
+#if defined (__aarch64__) && defined(HAVE_CPU_ARCH_ARM) && __GNUC__ >= 4
+
+#define ADD3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) \
+  __asm__ ("adds %2, %5, %8\n" \
+	   "adcs %1, %4, %7\n" \
+	   "adc  %0, %3, %6\n" \
+	   : "=r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "r" ((mpi_limb_t)(B2)), \
+	     "r" ((mpi_limb_t)(B1)), \
+	     "r" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define SUB3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) \
+  __asm__ ("subs %2, %5, %8\n" \
+	   "sbcs %1, %4, %7\n" \
+	   "sbc  %0, %3, %6\n" \
+	   : "=r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "r" ((mpi_limb_t)(B2)), \
+	     "r" ((mpi_limb_t)(B1)), \
+	     "r" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define ADD4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \
+  __asm__ ("adds %3, %7, %11\n" \
+	   "adcs %2, %6, %10\n" \
+	   "adcs %1, %5, %9\n" \
+	   "adc  %0, %4, %8\n" \
+	   : "=r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "r" ((mpi_limb_t)(B3)), \
+	     "r" ((mpi_limb_t)(B2)), \
+	     "r" ((mpi_limb_t)(B1)), \
+	     "r" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C3)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define SUB4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \
+  __asm__ ("subs %3, %7, %11\n" \
+	   "sbcs %2, %6, %10\n" \
+	   "sbcs %1, %5, %9\n" \
+	   "sbc  %0, %4, %8\n" \
+	   : "=r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "r" ((mpi_limb_t)(B3)), \
+	     "r" ((mpi_limb_t)(B2)), \
+	     "r" ((mpi_limb_t)(B1)), \
+	     "r" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C3)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define ADD5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \
+                    C4, C3, C2, C1, C0) \
+  __asm__ ("adds %4, %9, %14\n" \
+	   "adcs %3, %8, %13\n" \
+	   "adcs %2, %7, %12\n" \
+	   "adcs %1, %6, %11\n" \
+	   "adc  %0, %5, %10\n" \
+	   : "=r" (A4), \
+	     "=&r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "r" ((mpi_limb_t)(B4)), \
+	     "r" ((mpi_limb_t)(B3)), \
+	     "r" ((mpi_limb_t)(B2)), \
+	     "r" ((mpi_limb_t)(B1)), \
+	     "r" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C4)), \
+	     "r" ((mpi_limb_t)(C3)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define SUB5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \
+                    C4, C3, C2, C1, C0) \
+  __asm__ ("subs %4, %9, %14\n" \
+	   "sbcs %3, %8, %13\n" \
+	   "sbcs %2, %7, %12\n" \
+	   "sbcs %1, %6, %11\n" \
+	   "sbc  %0, %5, %10\n" \
+	   : "=r" (A4), \
+	     "=&r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "r" ((mpi_limb_t)(B4)), \
+	     "r" ((mpi_limb_t)(B3)), \
+	     "r" ((mpi_limb_t)(B2)), \
+	     "r" ((mpi_limb_t)(B1)), \
+	     "r" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C4)), \
+	     "r" ((mpi_limb_t)(C3)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#endif /* __aarch64__ */
+
+
+/* PowerPC64 addition/subtraction helpers.  */
+#if defined (__powerpc__) && defined(HAVE_CPU_ARCH_PPC) && __GNUC__ >= 4
+
+#define ADD3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) \
+  __asm__ ("addc %2, %8, %5\n" \
+	   "adde %1, %7, %4\n" \
+	   "adde %0, %6, %3\n" \
+	   : "=r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "r" ((mpi_limb_t)(B2)), \
+	     "r" ((mpi_limb_t)(B1)), \
+	     "r" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc", "r0")
+
+#define SUB3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) \
+  __asm__ ("subfc %2, %8, %5\n" \
+	   "subfe %1, %7, %4\n" \
+	   "subfe %0, %6, %3\n" \
+	   : "=r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "r" ((mpi_limb_t)(B2)), \
+	     "r" ((mpi_limb_t)(B1)), \
+	     "r" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc", "r0")
+
+#define ADD4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \
+  __asm__ ("addc %3, %11, %7\n" \
+	   "adde %2, %10, %6\n" \
+	   "adde %1, %9, %5\n" \
+	   "adde %0, %8, %4\n" \
+	   : "=r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "r" ((mpi_limb_t)(B3)), \
+	     "r" ((mpi_limb_t)(B2)), \
+	     "r" ((mpi_limb_t)(B1)), \
+	     "r" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C3)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define SUB4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \
+  __asm__ ("subfc %3, %11, %7\n" \
+	   "subfe %2, %10, %6\n" \
+	   "subfe %1, %9, %5\n" \
+	   "subfe %0, %8, %4\n" \
+	   : "=r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "r" ((mpi_limb_t)(B3)), \
+	     "r" ((mpi_limb_t)(B2)), \
+	     "r" ((mpi_limb_t)(B1)), \
+	     "r" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C3)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define ADD5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \
+	                    C4, C3, C2, C1, C0) \
+  __asm__ ("addc %4, %14, %9\n" \
+	   "adde %3, %13, %8\n" \
+	   "adde %2, %12, %7\n" \
+	   "adde %1, %11, %6\n" \
+	   "adde %0, %10, %5\n" \
+	   : "=r" (A4), \
+	     "=&r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "r" ((mpi_limb_t)(B4)), \
+	     "r" ((mpi_limb_t)(B3)), \
+	     "r" ((mpi_limb_t)(B2)), \
+	     "r" ((mpi_limb_t)(B1)), \
+	     "r" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C4)), \
+	     "r" ((mpi_limb_t)(C3)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define SUB5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \
+	                    C4, C3, C2, C1, C0) \
+  __asm__ ("subfc %4, %14, %9\n" \
+	   "subfe %3, %13, %8\n" \
+	   "subfe %2, %12, %7\n" \
+	   "subfe %1, %11, %6\n" \
+	   "subfe %0, %10, %5\n" \
+	   : "=r" (A4), \
+	     "=&r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "r" ((mpi_limb_t)(B4)), \
+	     "r" ((mpi_limb_t)(B3)), \
+	     "r" ((mpi_limb_t)(B2)), \
+	     "r" ((mpi_limb_t)(B1)), \
+	     "r" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C4)), \
+	     "r" ((mpi_limb_t)(C3)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#endif /* __powerpc__ */
+
+
+/* s390x/zSeries addition/subtraction helpers.  */
+#if defined (__s390x__) && defined(HAVE_CPU_ARCH_S390X) && __GNUC__ >= 4
+
+#define ADD3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) \
+  __asm__ ("algr %2, %8\n" \
+	   "alcgr %1, %7\n" \
+	   "alcgr %0, %6\n" \
+	   : "=r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "0" ((mpi_limb_t)(B2)), \
+	     "1" ((mpi_limb_t)(B1)), \
+	     "2" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define SUB3_LIMB64(A3, A2, A1, A0, B2, B1, B0, C2, C1, C0) \
+  __asm__ ("slgr %2, %8\n" \
+	   "slbgr %1, %7\n" \
+	   "slbgr %0, %6\n" \
+	   : "=r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "0" ((mpi_limb_t)(B2)), \
+	     "1" ((mpi_limb_t)(B1)), \
+	     "2" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define ADD4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \
+  __asm__ ("algr %3, %11\n" \
+	   "alcgr %2, %10\n" \
+	   "alcgr %1, %9\n" \
+	   "alcgr %0, %8\n" \
+	   : "=r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "0" ((mpi_limb_t)(B3)), \
+	     "1" ((mpi_limb_t)(B2)), \
+	     "2" ((mpi_limb_t)(B1)), \
+	     "3" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C3)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define SUB4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \
+  __asm__ ("slgr %3, %11\n" \
+	   "slbgr %2, %10\n" \
+	   "slbgr %1, %9\n" \
+	   "slbgr %0, %8\n" \
+	   : "=r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "0" ((mpi_limb_t)(B3)), \
+	     "1" ((mpi_limb_t)(B2)), \
+	     "2" ((mpi_limb_t)(B1)), \
+	     "3" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C3)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define ADD5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \
+                    C4, C3, C2, C1, C0) \
+  __asm__ ("algr %4, %14\n" \
+	   "alcgr %3, %13\n" \
+	   "alcgr %2, %12\n" \
+	   "alcgr %1, %11\n" \
+	   "alcgr %0, %10\n" \
+	   : "=r" (A4), \
+	     "=&r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "0" ((mpi_limb_t)(B4)), \
+	     "1" ((mpi_limb_t)(B3)), \
+	     "2" ((mpi_limb_t)(B2)), \
+	     "3" ((mpi_limb_t)(B1)), \
+	     "4" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C4)), \
+	     "r" ((mpi_limb_t)(C3)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define SUB5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \
+                    C4, C3, C2, C1, C0) \
+  __asm__ ("slgr %4, %14\n" \
+	   "slbgr %3, %13\n" \
+	   "slbgr %2, %12\n" \
+	   "slbgr %1, %11\n" \
+	   "slbgr %0, %10\n" \
+	   : "=r" (A4), \
+	     "=&r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "0" ((mpi_limb_t)(B4)), \
+	     "1" ((mpi_limb_t)(B3)), \
+	     "2" ((mpi_limb_t)(B2)), \
+	     "3" ((mpi_limb_t)(B1)), \
+	     "4" ((mpi_limb_t)(B0)), \
+	     "r" ((mpi_limb_t)(C4)), \
+	     "r" ((mpi_limb_t)(C3)), \
+	     "r" ((mpi_limb_t)(C2)), \
+	     "r" ((mpi_limb_t)(C1)), \
+	     "r" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#endif /* __s390x__ */
+
+
+/* Common 64-bit arch addition/subtraction macros.  */
+
+#define ADD2_LIMB64(A1, A0, B1, B0, C1, C0) \
+  add_ssaaaa(A1, A0, B1, B0, C1, C0)
+
+#define SUB2_LIMB64(A1, A0, B1, B0, C1, C0) \
+  sub_ddmmss(A1, A0, B1, B0, C1, C0)
+
+#endif /* BYTES_PER_MPI_LIMB == 8 */
+
+
+#if BYTES_PER_MPI_LIMB == 4
+
+/* 64-bit limb definitions for 32-bit architectures.  */
+
+#define LIMBS_PER_LIMB64 2
+#define LIMB_FROM64(v) ((v).lo)
+#define HIBIT_LIMB64(v) ((v).hi >> (BITS_PER_MPI_LIMB - 1))
+#define HI32_LIMB64(v) ((v).hi)
+#define LO32_LIMB64(v) ((v).lo)
+#define LOAD32(x, pos) ((x)[pos])
+#define LIMB64_C(hi, lo) { (lo), (hi) }
+
+typedef struct
+{
+  mpi_limb_t lo;
+  mpi_limb_t hi;
+} mpi_limb64_t;
+
+static inline mpi_limb64_t
+LOAD64(const mpi_ptr_t x, unsigned int pos)
+{
+  mpi_limb64_t v;
+  v.lo = x[pos * 2 + 0];
+  v.hi = x[pos * 2 + 1];
+  return v;
+}
+
+static inline void
+STORE64(mpi_ptr_t x, unsigned int pos, mpi_limb64_t v)
+{
+  x[pos * 2 + 0] = v.lo;
+  x[pos * 2 + 1] = v.hi;
+}
+
+static inline void
+STORE64_COND(mpi_ptr_t x, unsigned int pos, mpi_limb_t mask1,
+	     mpi_limb64_t val1, mpi_limb_t mask2, mpi_limb64_t val2)
+{
+  x[pos * 2 + 0] = (mask1 & val1.lo) | (mask2 & val2.lo);
+  x[pos * 2 + 1] = (mask1 & val1.hi) | (mask2 & val2.hi);
+}
+
+static inline mpi_limb64_t
+LIMB_TO64(mpi_limb_t x)
+{
+  mpi_limb64_t v;
+  v.lo = x;
+  v.hi = 0;
+  return v;
+}
+
+static inline mpi_limb64_t
+LIMB64_HILO(mpi_limb_t hi, mpi_limb_t lo)
+{
+  mpi_limb64_t v;
+  v.lo = lo;
+  v.hi = hi;
+  return v;
+}
+
+
+/* i386 addition/subtraction helpers.  */
+#if defined (__i386__) && defined(HAVE_CPU_ARCH_X86) && __GNUC__ >= 4
+
+#define ADD4_LIMB32(a3, a2, a1, a0, b3, b2, b1, b0, c3, c2, c1, c0) \
+  __asm__ ("addl %11, %3\n" \
+	   "adcl %10, %2\n" \
+	   "adcl %9, %1\n" \
+	   "adcl %8, %0\n" \
+	   : "=r" (a3), \
+	     "=&r" (a2), \
+	     "=&r" (a1), \
+	     "=&r" (a0) \
+	   : "0" ((mpi_limb_t)(b3)), \
+	     "1" ((mpi_limb_t)(b2)), \
+	     "2" ((mpi_limb_t)(b1)), \
+	     "3" ((mpi_limb_t)(b0)), \
+	     "g" ((mpi_limb_t)(c3)), \
+	     "g" ((mpi_limb_t)(c2)), \
+	     "g" ((mpi_limb_t)(c1)), \
+	     "g" ((mpi_limb_t)(c0)) \
+	   : "cc")
+
+#define ADD6_LIMB32(a5, a4, a3, a2, a1, a0, b5, b4, b3, b2, b1, b0, \
+		    c5, c4, c3, c2, c1, c0) do { \
+    mpi_limb_t __carry6_32; \
+    __asm__ ("addl %10, %3\n" \
+	     "adcl %9, %2\n" \
+	     "adcl %8, %1\n" \
+	     "sbbl %0, %0\n" \
+	     : "=r" (__carry6_32), \
+	       "=&r" (a2), \
+	       "=&r" (a1), \
+	       "=&r" (a0) \
+	     : "0" ((mpi_limb_t)(0)), \
+	       "1" ((mpi_limb_t)(b2)), \
+	       "2" ((mpi_limb_t)(b1)), \
+	       "3" ((mpi_limb_t)(b0)), \
+	       "g" ((mpi_limb_t)(c2)), \
+	       "g" ((mpi_limb_t)(c1)), \
+	       "g" ((mpi_limb_t)(c0)) \
+	     : "cc"); \
+    __asm__ ("addl $1, %3\n" \
+	     "adcl %10, %2\n" \
+	     "adcl %9, %1\n" \
+	     "adcl %8, %0\n" \
+	     : "=r" (a5), \
+	       "=&r" (a4), \
+	       "=&r" (a3), \
+	       "=&r" (__carry6_32) \
+	     : "0" ((mpi_limb_t)(b5)), \
+	       "1" ((mpi_limb_t)(b4)), \
+	       "2" ((mpi_limb_t)(b3)), \
+	       "3" ((mpi_limb_t)(__carry6_32)), \
+	       "g" ((mpi_limb_t)(c5)), \
+	       "g" ((mpi_limb_t)(c4)), \
+	       "g" ((mpi_limb_t)(c3)) \
+	   : "cc"); \
+  } while (0)
+
+#define SUB4_LIMB32(a3, a2, a1, a0, b3, b2, b1, b0, c3, c2, c1, c0) \
+  __asm__ ("subl %11, %3\n" \
+	   "sbbl %10, %2\n" \
+	   "sbbl %9, %1\n" \
+	   "sbbl %8, %0\n" \
+	   : "=r" (a3), \
+	     "=&r" (a2), \
+	     "=&r" (a1), \
+	     "=&r" (a0) \
+	   : "0" ((mpi_limb_t)(b3)), \
+	     "1" ((mpi_limb_t)(b2)), \
+	     "2" ((mpi_limb_t)(b1)), \
+	     "3" ((mpi_limb_t)(b0)), \
+	     "g" ((mpi_limb_t)(c3)), \
+	     "g" ((mpi_limb_t)(c2)), \
+	     "g" ((mpi_limb_t)(c1)), \
+	     "g" ((mpi_limb_t)(c0)) \
+	   : "cc")
+
+#define SUB6_LIMB32(a5, a4, a3, a2, a1, a0, b5, b4, b3, b2, b1, b0, \
+		    c5, c4, c3, c2, c1, c0) do { \
+    mpi_limb_t __borrow6_32; \
+    __asm__ ("subl %10, %3\n" \
+	     "sbbl %9, %2\n" \
+	     "sbbl %8, %1\n" \
+	     "sbbl %0, %0\n" \
+	     : "=r" (__borrow6_32), \
+	       "=&r" (a2), \
+	       "=&r" (a1), \
+	       "=&r" (a0) \
+	     : "0" ((mpi_limb_t)(0)), \
+	       "1" ((mpi_limb_t)(b2)), \
+	       "2" ((mpi_limb_t)(b1)), \
+	       "3" ((mpi_limb_t)(b0)), \
+	       "g" ((mpi_limb_t)(c2)), \
+	       "g" ((mpi_limb_t)(c1)), \
+	       "g" ((mpi_limb_t)(c0)) \
+	     : "cc"); \
+    __asm__ ("addl $1, %3\n" \
+	     "sbbl %10, %2\n" \
+	     "sbbl %9, %1\n" \
+	     "sbbl %8, %0\n" \
+	     : "=r" (a5), \
+	       "=&r" (a4), \
+	       "=&r" (a3), \
+	       "=&r" (__borrow6_32) \
+	     : "0" ((mpi_limb_t)(b5)), \
+	       "1" ((mpi_limb_t)(b4)), \
+	       "2" ((mpi_limb_t)(b3)), \
+	       "3" ((mpi_limb_t)(__borrow6_32)), \
+	       "g" ((mpi_limb_t)(c5)), \
+	       "g" ((mpi_limb_t)(c4)), \
+	       "g" ((mpi_limb_t)(c3)) \
+	   : "cc"); \
+  } while (0)
+
+#endif /* __i386__ */
+
+
+/* ARM addition/subtraction helpers.  */
+#ifdef HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS
+
+#define ADD4_LIMB32(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \
+  __asm__ ("adds %3, %7, %11\n" \
+	   "adcs %2, %6, %10\n" \
+	   "adcs %1, %5, %9\n" \
+	   "adc  %0, %4, %8\n" \
+	   : "=r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "r" ((mpi_limb_t)(B3)), \
+	     "r" ((mpi_limb_t)(B2)), \
+	     "r" ((mpi_limb_t)(B1)), \
+	     "r" ((mpi_limb_t)(B0)), \
+	     "Ir" ((mpi_limb_t)(C3)), \
+	     "Ir" ((mpi_limb_t)(C2)), \
+	     "Ir" ((mpi_limb_t)(C1)), \
+	     "Ir" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define ADD6_LIMB32(A5, A4, A3, A2, A1, A0, B5, B4, B3, B2, B1, B0, \
+		    C5, C4, C3, C2, C1, C0) do { \
+    mpi_limb_t __carry6_32; \
+    __asm__ ("adds %3, %7, %10\n" \
+	     "adcs %2, %6, %9\n" \
+	     "adcs %1, %5, %8\n" \
+	     "adc  %0, %4, %4\n" \
+	     : "=r" (__carry6_32), \
+	       "=&r" (A2), \
+	       "=&r" (A1), \
+	       "=&r" (A0) \
+	     : "r" ((mpi_limb_t)(0)), \
+	       "r" ((mpi_limb_t)(B2)), \
+	       "r" ((mpi_limb_t)(B1)), \
+	       "r" ((mpi_limb_t)(B0)), \
+	       "Ir" ((mpi_limb_t)(C2)), \
+	       "Ir" ((mpi_limb_t)(C1)), \
+	       "Ir" ((mpi_limb_t)(C0)) \
+	     : "cc"); \
+    ADD4_LIMB32(A5, A4, A3, __carry6_32, B5, B4, B3, __carry6_32, \
+		C5, C4, C3, 0xffffffffU); \
+  } while (0)
+
+#define SUB4_LIMB32(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \
+  __asm__ ("subs %3, %7, %11\n" \
+	   "sbcs %2, %6, %10\n" \
+	   "sbcs %1, %5, %9\n" \
+	   "sbc  %0, %4, %8\n" \
+	   : "=r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "r" ((mpi_limb_t)(B3)), \
+	     "r" ((mpi_limb_t)(B2)), \
+	     "r" ((mpi_limb_t)(B1)), \
+	     "r" ((mpi_limb_t)(B0)), \
+	     "Ir" ((mpi_limb_t)(C3)), \
+	     "Ir" ((mpi_limb_t)(C2)), \
+	     "Ir" ((mpi_limb_t)(C1)), \
+	     "Ir" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+
+#define SUB6_LIMB32(A5, A4, A3, A2, A1, A0, B5, B4, B3, B2, B1, B0, \
+		    C5, C4, C3, C2, C1, C0) do { \
+    mpi_limb_t __borrow6_32; \
+    __asm__ ("subs %3, %7, %10\n" \
+	     "sbcs %2, %6, %9\n" \
+	     "sbcs %1, %5, %8\n" \
+	     "sbc  %0, %4, %4\n" \
+	     : "=r" (__borrow6_32), \
+	       "=&r" (A2), \
+	       "=&r" (A1), \
+	       "=&r" (A0) \
+	     : "r" ((mpi_limb_t)(0)), \
+	       "r" ((mpi_limb_t)(B2)), \
+	       "r" ((mpi_limb_t)(B1)), \
+	       "r" ((mpi_limb_t)(B0)), \
+	       "Ir" ((mpi_limb_t)(C2)), \
+	       "Ir" ((mpi_limb_t)(C1)), \
+	       "Ir" ((mpi_limb_t)(C0)) \
+	     : "cc"); \
+    SUB4_LIMB32(A5, A4, A3, __borrow6_32, B5, B4, B3, 0, \
+		C5, C4, C3, -__borrow6_32); \
+  } while (0)
+
+#endif /* HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS */
+
+
+/* Common 32-bit arch addition/subtraction macros.  */
+
+#if defined(ADD4_LIMB32)
+/* A[0..1] = B[0..1] + C[0..1] */
+#define ADD2_LIMB64(A1, A0, B1, B0, C1, C0) \
+	ADD4_LIMB32(A1.hi, A1.lo, A0.hi, A0.lo, \
+		    B1.hi, B1.lo, B0.hi, B0.lo, \
+		    C1.hi, C1.lo, C0.hi, C0.lo)
+#else
+/* A[0..1] = B[0..1] + C[0..1] */
+#define ADD2_LIMB64(A1, A0, B1, B0, C1, C0) do { \
+    mpi_limb_t __carry2_0, __carry2_1; \
+    add_ssaaaa(__carry2_0, A0.lo, 0, B0.lo, 0, C0.lo); \
+    add_ssaaaa(__carry2_1, A0.hi, 0, B0.hi, 0, C0.hi); \
+    add_ssaaaa(__carry2_1, A0.hi, __carry2_1, A0.hi, 0, __carry2_0); \
+    add_ssaaaa(A1.hi, A1.lo, B1.hi, B1.lo, C1.hi, C1.lo); \
+    add_ssaaaa(A1.hi, A1.lo, A1.hi, A1.lo, 0, __carry2_1); \
+  } while (0)
+#endif
+
+#if defined(ADD6_LIMB32)
+/* A[0..2] = B[0..2] + C[0..2] */
+#define ADD3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) \
+	ADD6_LIMB32(A2.hi, A2.lo, A1.hi, A1.lo, A0.hi, A0.lo, \
+		    B2.hi, B2.lo, B1.hi, B1.lo, B0.hi, B0.lo, \
+		    C2.hi, C2.lo, C1.hi, C1.lo, C0.hi, C0.lo)
+#endif
+
+#if defined(ADD6_LIMB32)
+/* A[0..3] = B[0..3] + C[0..3] */
+#define ADD4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) do { \
+    mpi_limb_t __carry4; \
+    ADD6_LIMB32(__carry4, A2.lo, A1.hi, A1.lo, A0.hi, A0.lo, \
+		0, B2.lo, B1.hi, B1.lo, B0.hi, B0.lo, \
+		0, C2.lo, C1.hi, C1.lo, C0.hi, C0.lo); \
+    ADD4_LIMB32(A3.hi, A3.lo, A2.hi, __carry4, \
+		B3.hi, B3.lo, B2.hi, __carry4, \
+		C3.hi, C3.lo, C2.hi, 0xffffffffU); \
+  } while (0)
+#endif
+
+#if defined(SUB4_LIMB32)
+/* A[0..1] = B[0..1] - C[0..1] */
+#define SUB2_LIMB64(A1, A0, B1, B0, C1, C0) \
+	SUB4_LIMB32(A1.hi, A1.lo, A0.hi, A0.lo, \
+		    B1.hi, B1.lo, B0.hi, B0.lo, \
+		    C1.hi, C1.lo, C0.hi, C0.lo)
+#else
+/* A[0..1] = B[0..1] - C[0..1] */
+#define SUB2_LIMB64(A1, A0, B1, B0, C1, C0) do { \
+    mpi_limb_t __borrow2_0, __borrow2_1; \
+    sub_ddmmss(__borrow2_0, A0.lo, 0, B0.lo, 0, C0.lo); \
+    sub_ddmmss(__borrow2_1, A0.hi, 0, B0.hi, 0, C0.hi); \
+    sub_ddmmss(__borrow2_1, A0.hi, __borrow2_1, A0.hi, 0, -__borrow2_0); \
+    sub_ddmmss(A1.hi, A1.lo, B1.hi, B1.lo, C1.hi, C1.lo); \
+    sub_ddmmss(A1.hi, A1.lo, A1.hi, A1.lo, 0, -__borrow2_1); \
+  } while (0)
+#endif
+
+#if defined(SUB6_LIMB32)
+/* A[0..2] = B[0..2] - C[0..2] */
+#define SUB3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) \
+	SUB6_LIMB32(A2.hi, A2.lo, A1.hi, A1.lo, A0.hi, A0.lo, \
+		    B2.hi, B2.lo, B1.hi, B1.lo, B0.hi, B0.lo, \
+		    C2.hi, C2.lo, C1.hi, C1.lo, C0.hi, C0.lo)
+#endif
+
+#if defined(SUB6_LIMB32)
+/* A[0..3] = B[0..3] - C[0..3] */
+#define SUB4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) do { \
+    mpi_limb_t __borrow4; \
+    SUB6_LIMB32(__borrow4, A2.lo, A1.hi, A1.lo, A0.hi, A0.lo, \
+		0, B2.lo, B1.hi, B1.lo, B0.hi, B0.lo, \
+		0, C2.lo, C1.hi, C1.lo, C0.hi, C0.lo); \
+    SUB4_LIMB32(A3.hi, A3.lo, A2.hi, __borrow4, \
+		B3.hi, B3.lo, B2.hi, 0, \
+		C3.hi, C3.lo, C2.hi, -__borrow4); \
+  } while (0)
+#endif
+
+#endif /* BYTES_PER_MPI_LIMB == 4 */
+
+
+/* Common definitions.  */
+#define BITS_PER_MPI_LIMB64 (BITS_PER_MPI_LIMB * LIMBS_PER_LIMB64)
+#define BYTES_PER_MPI_LIMB64 (BYTES_PER_MPI_LIMB * LIMBS_PER_LIMB64)
+
+
+/* Common addition/subtraction macros.  */
+
+#ifndef ADD3_LIMB64
+/* A[0..2] = B[0..2] + C[0..2] */
+#define ADD3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) do { \
+    mpi_limb64_t __carry3; \
+    ADD2_LIMB64(__carry3, A0, zero, B0, zero, C0); \
+    ADD2_LIMB64(A2, A1, B2, B1, C2, C1); \
+    ADD2_LIMB64(A2, A1, A2, A1, zero, __carry3); \
+  } while (0)
+#endif
+
+#ifndef ADD4_LIMB64
+/* A[0..3] = B[0..3] + C[0..3] */
+#define ADD4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) do { \
+    mpi_limb64_t __carry4; \
+    ADD3_LIMB64(__carry4, A1, A0, zero, B1, B0, zero, C1, C0); \
+    ADD2_LIMB64(A3, A2, B3, B2, C3, C2); \
+    ADD2_LIMB64(A3, A2, A3, A2, zero, __carry4); \
+  } while (0)
+#endif
+
+#ifndef ADD5_LIMB64
+/* A[0..4] = B[0..4] + C[0..4] */
+#define ADD5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \
+                    C4, C3, C2, C1, C0) do { \
+    mpi_limb64_t __carry5; \
+    ADD4_LIMB64(__carry5, A2, A1, A0, zero, B2, B1, B0, zero, C2, C1, C0); \
+    ADD2_LIMB64(A4, A3, B4, B3, C4, C3); \
+    ADD2_LIMB64(A4, A3, A4, A3, zero, __carry5); \
+  } while (0)
+#endif
+
+#ifndef ADD7_LIMB64
+/* A[0..6] = B[0..6] + C[0..6] */
+#define ADD7_LIMB64(A6, A5, A4, A3, A2, A1, A0, B6, B5, B4, B3, B2, B1, B0, \
+                    C6, C5, C4, C3, C2, C1, C0) do { \
+    mpi_limb64_t __carry7; \
+    ADD4_LIMB64(__carry7, A2, A1, A0, zero, B2, B1, B0, \
+		zero, C2, C1, C0); \
+    ADD5_LIMB64(A6, A5, A4, A3, __carry7, B6, B5, B4, B3, \
+		__carry7, C6, C5, C4, C3, LIMB64_HILO(-1, -1)); \
+  } while (0)
+#endif
+
+#ifndef SUB3_LIMB64
+/* A[0..2] = B[0..2] - C[0..2] */
+#define SUB3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) do { \
+    mpi_limb64_t __borrow3; \
+    SUB2_LIMB64(__borrow3, A0, zero, B0, zero, C0); \
+    SUB2_LIMB64(A2, A1, B2, B1, C2, C1); \
+    SUB2_LIMB64(A2, A1, A2, A1, zero, LIMB_TO64(-LIMB_FROM64(__borrow3))); \
+  } while (0)
+#endif
+
+#ifndef SUB4_LIMB64
+/* A[0..3] = B[0..3] - C[0..3] */
+#define SUB4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) do { \
+    mpi_limb64_t __borrow4; \
+    SUB3_LIMB64(__borrow4, A1, A0, zero, B1, B0, zero, C1, C0); \
+    SUB2_LIMB64(A3, A2, B3, B2, C3, C2); \
+    SUB2_LIMB64(A3, A2, A3, A2, zero, LIMB_TO64(-LIMB_FROM64(__borrow4))); \
+  } while (0)
+#endif
+
+#ifndef SUB5_LIMB64
+/* A[0..4] = B[0..4] - C[0..4] */
+#define SUB5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \
+                    C4, C3, C2, C1, C0) do { \
+    mpi_limb64_t __borrow5; \
+    SUB4_LIMB64(__borrow5, A2, A1, A0, zero, B2, B1, B0, zero, C2, C1, C0); \
+    SUB2_LIMB64(A4, A3, B4, B3, C4, C3); \
+    SUB2_LIMB64(A4, A3, A4, A3, zero, LIMB_TO64(-LIMB_FROM64(__borrow5))); \
+  } while (0)
+#endif
+
+#ifndef SUB7_LIMB64
+/* A[0..6] = B[0..6] - C[0..6] */
+#define SUB7_LIMB64(A6, A5, A4, A3, A2, A1, A0, B6, B5, B4, B3, B2, B1, B0, \
+                    C6, C5, C4, C3, C2, C1, C0) do { \
+    mpi_limb64_t __borrow7; \
+    SUB4_LIMB64(__borrow7, A2, A1, A0, zero, B2, B1, B0, \
+		zero, C2, C1, C0); \
+    SUB5_LIMB64(A6, A5, A4, A3, __borrow7, B6, B5, B4, B3, zero, \
+		C6, C5, C4, C3, LIMB_TO64(-LIMB_FROM64(__borrow7))); \
+  } while (0)
+#endif
+
+
+#if defined(WORDS_BIGENDIAN) || (BITS_PER_MPI_LIMB64 != BITS_PER_MPI_LIMB)
+#define LOAD64_UNALIGNED(x, pos) \
+  LIMB64_HILO(LOAD32(x, 2 * (pos) + 2), LOAD32(x, 2 * (pos) + 1))
+#else
+#define LOAD64_UNALIGNED(x, pos) \
+  buf_get_le64((const byte *)(&(x)[pos]) + 4)
+#endif
+
+
+/* Helper functions.  */
+
+static inline int
+mpi_nbits_more_than (gcry_mpi_t w, unsigned int nbits)
+{
+  unsigned int nbits_nlimbs;
+  mpi_limb_t wlimb;
+  unsigned int n;
+
+  nbits_nlimbs = (nbits + BITS_PER_MPI_LIMB - 1) / BITS_PER_MPI_LIMB;
+
+  /* Note: Assumes that 'w' is normalized. */
+
+  if (w->nlimbs > nbits_nlimbs)
+    return 1;
+  if (w->nlimbs < nbits_nlimbs)
+    return 0;
+  if ((nbits % BITS_PER_MPI_LIMB) == 0)
+    return 0;
+
+  wlimb = w->d[nbits_nlimbs - 1];
+  if (wlimb == 0)
+    log_bug ("mpi_nbits_more_than: input mpi not normalized\n");
+
+  count_leading_zeros (n, wlimb);
+
+  return (BITS_PER_MPI_LIMB - n) > (nbits % BITS_PER_MPI_LIMB);
+}
+
+#endif /* GCRY_EC_INLINE_H */
diff --git a/mpi/ec-internal.h b/mpi/ec-internal.h
index 759335aa..2296d55d 100644
--- a/mpi/ec-internal.h
+++ b/mpi/ec-internal.h
@@ -20,6 +20,22 @@
 #ifndef GCRY_EC_INTERNAL_H
 #define GCRY_EC_INTERNAL_H
 
+#include <config.h>
+
 void _gcry_mpi_ec_ed25519_mod (gcry_mpi_t a);
 
+#ifndef ASM_DISABLED
+void _gcry_mpi_ec_nist192_mod (gcry_mpi_t w, mpi_ec_t ctx);
+void _gcry_mpi_ec_nist224_mod (gcry_mpi_t w, mpi_ec_t ctx);
+void _gcry_mpi_ec_nist256_mod (gcry_mpi_t w, mpi_ec_t ctx);
+void _gcry_mpi_ec_nist384_mod (gcry_mpi_t w, mpi_ec_t ctx);
+void _gcry_mpi_ec_nist521_mod (gcry_mpi_t w, mpi_ec_t ctx);
+#else
+# define _gcry_mpi_ec_nist192_mod NULL
+# define _gcry_mpi_ec_nist224_mod NULL
+# define _gcry_mpi_ec_nist256_mod NULL
+# define _gcry_mpi_ec_nist384_mod NULL
+# define _gcry_mpi_ec_nist521_mod NULL
+#endif
+
 #endif /*GCRY_EC_INTERNAL_H*/
diff --git a/mpi/ec-nist.c b/mpi/ec-nist.c
new file mode 100644
index 00000000..955d2b7c
--- /dev/null
+++ b/mpi/ec-nist.c
@@ -0,0 +1,795 @@
+/* ec-nist.c -  NIST optimized elliptic curve functions
+ * Copyright (C) 2021 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <config.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+
+
+#ifndef ASM_DISABLED
+
+
+#include "mpi-internal.h"
+#include "longlong.h"
+#include "g10lib.h"
+#include "context.h"
+#include "ec-context.h"
+#include "ec-inline.h"
+
+
+/* These variables are used to generate masks from conditional operation
+ * flag parameters.  Use of volatile prevents compiler optimizations from
+ * converting AND-masking to conditional branches.  */
+static volatile mpi_limb_t vzero = 0;
+static volatile mpi_limb_t vone = 1;
+
+
+static inline
+void prefetch(const void *tab, size_t len)
+{
+  const volatile byte *vtab = tab;
+
+  if (len > 0 * 64)
+    (void)vtab[0 * 64];
+  if (len > 1 * 64)
+    (void)vtab[1 * 64];
+  if (len > 2 * 64)
+    (void)vtab[2 * 64];
+  if (len > 3 * 64)
+    (void)vtab[3 * 64];
+  if (len > 4 * 64)
+    (void)vtab[4 * 64];
+  if (len > 5 * 64)
+    (void)vtab[5 * 64];
+  if (len > 6 * 64)
+    (void)vtab[6 * 64];
+  if (len > 7 * 64)
+    (void)vtab[7 * 64];
+  if (len > 8 * 64)
+    (void)vtab[8 * 64];
+  if (len > 9 * 64)
+    (void)vtab[9 * 64];
+  if (len > 10 * 64)
+    (void)vtab[10 * 64];
+  (void)vtab[len - 1];
+}
+
+
+/* Fast reduction routines for NIST curves.  */
+
+void
+_gcry_mpi_ec_nist192_mod (gcry_mpi_t w, mpi_ec_t ctx)
+{
+  static const mpi_limb64_t p_mult[3][4] =
+  {
+    { /* P * 1 */
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xfffffffeU),
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000000U, 0x00000000U)
+    },
+    { /* P * 2 */
+      LIMB64_C(0xffffffffU, 0xfffffffeU), LIMB64_C(0xffffffffU, 0xfffffffdU),
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000000U, 0x00000001U)
+    },
+    { /* P * 3 */
+      LIMB64_C(0xffffffffU, 0xfffffffdU), LIMB64_C(0xffffffffU, 0xfffffffcU),
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000000U, 0x00000002U)
+    }
+  };
+  const mpi_limb64_t zero = LIMB_TO64(0);
+  mpi_ptr_t wp;
+  mpi_ptr_t pp;
+  mpi_size_t wsize = 192 / BITS_PER_MPI_LIMB64;
+  mpi_limb64_t s[wsize + 1];
+  mpi_limb64_t o[wsize + 1];
+  mpi_limb_t mask1;
+  mpi_limb_t mask2;
+  int carry;
+
+  MPN_NORMALIZE (w->d, w->nlimbs);
+  if (mpi_nbits_more_than (w, 2 * 192))
+    log_bug ("W must be less than m^2\n");
+
+  RESIZE_AND_CLEAR_IF_NEEDED (w, wsize * 2 * LIMBS_PER_LIMB64);
+  RESIZE_AND_CLEAR_IF_NEEDED (ctx->p, wsize * LIMBS_PER_LIMB64);
+
+  pp = ctx->p->d;
+  wp = w->d;
+
+  prefetch (p_mult, sizeof(p_mult));
+
+  /* See "FIPS 186-4, D.2.1 Curve P-192". */
+
+  s[0] = LOAD64(wp, 3);
+  ADD3_LIMB64 (s[3],  s[2],          s[1],
+	       zero,  zero,          LOAD64(wp, 3),
+	       zero,  LOAD64(wp, 4), LOAD64(wp, 4));
+
+  ADD4_LIMB64 (s[3],  s[2],          s[1],          s[0],
+	       s[3],  s[2],          s[1],          s[0],
+	       zero,  LOAD64(wp, 5), LOAD64(wp, 5), LOAD64(wp, 5));
+
+  ADD4_LIMB64 (s[3],  s[2],          s[1],          s[0],
+	       s[3],  s[2],          s[1],          s[0],
+	       zero,  LOAD64(wp, 2), LOAD64(wp, 1), LOAD64(wp, 0));
+
+  /* mod p:
+   *  's[3]' holds carry value (0..2). Subtract (carry + 1) * p. Result will be
+   *  with in range -p...p. Handle result being negative with addition and
+   *  conditional store. */
+
+  carry = LO32_LIMB64(s[3]);
+
+  SUB4_LIMB64 (s[3], s[2], s[1], s[0],
+	       s[3], s[2], s[1], s[0],
+	       p_mult[carry][3], p_mult[carry][2],
+	       p_mult[carry][1], p_mult[carry][0]);
+
+  ADD4_LIMB64 (o[3], o[2], o[1], o[0],
+	       s[3], s[2], s[1], s[0],
+	       zero, LOAD64(pp, 2), LOAD64(pp, 1), LOAD64(pp, 0));
+  mask1 = vzero - (LO32_LIMB64(o[3]) >> 31);
+  mask2 = (LO32_LIMB64(o[3]) >> 31) - vone;
+
+  STORE64_COND(wp, 0, mask2, o[0], mask1, s[0]);
+  STORE64_COND(wp, 1, mask2, o[1], mask1, s[1]);
+  STORE64_COND(wp, 2, mask2, o[2], mask1, s[2]);
+
+  w->nlimbs = 192 / BITS_PER_MPI_LIMB;
+  MPN_NORMALIZE (wp, w->nlimbs);
+}
+
+void
+_gcry_mpi_ec_nist224_mod (gcry_mpi_t w, mpi_ec_t ctx)
+{
+  static const mpi_limb64_t p_mult[5][4] =
+  {
+    { /* P * -1 */
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000000U, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0xffffffffU, 0x00000000U)
+    },
+    { /* P * 0 */
+      LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U),
+      LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U)
+    },
+    { /* P * 1 */
+      LIMB64_C(0x00000000U, 0x00000001U), LIMB64_C(0xffffffffU, 0x00000000U),
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000000U, 0xffffffffU)
+    },
+    { /* P * 2 */
+      LIMB64_C(0x00000000U, 0x00000002U), LIMB64_C(0xfffffffeU, 0x00000000U),
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000001U, 0xffffffffU)
+    },
+    { /* P * 3 */
+      LIMB64_C(0x00000000U, 0x00000003U), LIMB64_C(0xfffffffdU, 0x00000000U),
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000002U, 0xffffffffU)
+    }
+  };
+  const mpi_limb64_t zero = LIMB_TO64(0);
+  mpi_ptr_t wp;
+  mpi_ptr_t pp;
+  mpi_size_t wsize = (224 + BITS_PER_MPI_LIMB64 - 1) / BITS_PER_MPI_LIMB64;
+  mpi_size_t psize = ctx->p->nlimbs;
+  mpi_limb64_t s[wsize];
+  mpi_limb64_t d[wsize];
+  mpi_limb_t mask1;
+  mpi_limb_t mask2;
+  int carry;
+
+  MPN_NORMALIZE (w->d, w->nlimbs);
+  if (mpi_nbits_more_than (w, 2 * 224))
+    log_bug ("W must be less than m^2\n");
+
+  RESIZE_AND_CLEAR_IF_NEEDED (w, wsize * 2 * LIMBS_PER_LIMB64);
+  RESIZE_AND_CLEAR_IF_NEEDED (ctx->p, wsize * LIMBS_PER_LIMB64);
+  ctx->p->nlimbs = psize;
+
+  pp = ctx->p->d;
+  wp = w->d;
+
+  prefetch (p_mult, sizeof(p_mult));
+
+  /* See "FIPS 186-4, D.2.2 Curve P-224". */
+
+  /* "S1 + S2" with 64-bit limbs:
+   *     [0:A10]:[ A9: A8]:[ A7:0]:[0:0]
+   *  +    [0:0]:[A13:A12]:[A11:0]:[0:0]
+   *  => s[3]:s[2]:s[1]:s[0]
+   */
+  s[0] = zero;
+  ADD3_LIMB64 (s[3], s[2], s[1],
+	       LIMB64_HILO(0, LOAD32(wp, 10)),
+	       LOAD64(wp, 8 / 2),
+	       LIMB64_HILO(LOAD32(wp, 7), 0),
+	       zero,
+	       LOAD64(wp, 12 / 2),
+	       LIMB64_HILO(LOAD32(wp, 11), 0));
+
+  /* "T + S1 + S2" */
+  ADD4_LIMB64 (s[3], s[2], s[1], s[0],
+	       s[3], s[2], s[1], s[0],
+	       LIMB64_HILO(0, LOAD32(wp, 6)),
+	       LOAD64(wp, 4 / 2),
+	       LOAD64(wp, 2 / 2),
+	       LOAD64(wp, 0 / 2));
+
+  /* "D1 + D2" with 64-bit limbs:
+   *     [0:A13]:[A12:A11]:[A10: A9]:[ A8: A7]
+   *  +    [0:0]:[  0:  0]:[  0:A13]:[A12:A11]
+   *  => d[3]:d[2]:d[1]:d[0]
+   */
+  ADD4_LIMB64 (d[3], d[2], d[1], d[0],
+	       LIMB64_HILO(0, LOAD32(wp, 13)),
+	       LOAD64_UNALIGNED(wp, 11 / 2),
+	       LOAD64_UNALIGNED(wp, 9 / 2),
+	       LOAD64_UNALIGNED(wp, 7 / 2),
+	       zero,
+	       zero,
+	       LIMB64_HILO(0, LOAD32(wp, 13)),
+	       LOAD64_UNALIGNED(wp, 11 / 2));
+
+  /* "T + S1 + S2 - D1 - D2" */
+  SUB4_LIMB64 (s[3], s[2], s[1], s[0],
+	       s[3], s[2], s[1], s[0],
+	       d[3], d[2], d[1], d[0]);
+
+  /* mod p:
+   *  Upper 32-bits of 's[3]' holds carry value (-2..2).
+   *  Subtract (carry + 1) * p. Result will be with in range -p...p.
+   *  Handle result being negative with addition and conditional store. */
+
+  carry = HI32_LIMB64(s[3]);
+
+  SUB4_LIMB64 (s[3], s[2], s[1], s[0],
+	       s[3], s[2], s[1], s[0],
+	       p_mult[carry + 2][3], p_mult[carry + 2][2],
+	       p_mult[carry + 2][1], p_mult[carry + 2][0]);
+
+  ADD4_LIMB64 (d[3], d[2], d[1], d[0],
+	       s[3], s[2], s[1], s[0],
+	       LOAD64(pp, 3), LOAD64(pp, 2), LOAD64(pp, 1), LOAD64(pp, 0));
+
+  mask1 = vzero - (HI32_LIMB64(d[3]) >> 31);
+  mask2 = (HI32_LIMB64(d[3]) >> 31) - vone;
+
+  STORE64_COND(wp, 0, mask2, d[0], mask1, s[0]);
+  STORE64_COND(wp, 1, mask2, d[1], mask1, s[1]);
+  STORE64_COND(wp, 2, mask2, d[2], mask1, s[2]);
+  STORE64_COND(wp, 3, mask2, d[3], mask1, s[3]);
+
+  w->nlimbs = wsize * LIMBS_PER_LIMB64;
+  MPN_NORMALIZE (wp, w->nlimbs);
+}
+
+void
+_gcry_mpi_ec_nist256_mod (gcry_mpi_t w, mpi_ec_t ctx)
+{
+  static const mpi_limb64_t p_mult[11][5] =
+  {
+    { /* P * -3 */
+      LIMB64_C(0x00000000U, 0x00000003U), LIMB64_C(0xfffffffdU, 0x00000000U),
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000002U, 0xfffffffcU),
+      LIMB64_C(0xffffffffU, 0xfffffffdU)
+    },
+    { /* P * -2 */
+      LIMB64_C(0x00000000U, 0x00000002U), LIMB64_C(0xfffffffeU, 0x00000000U),
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000001U, 0xfffffffdU),
+      LIMB64_C(0xffffffffU, 0xfffffffeU)
+    },
+    { /* P * -1 */
+      LIMB64_C(0x00000000U, 0x00000001U), LIMB64_C(0xffffffffU, 0x00000000U),
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000000U, 0xfffffffeU),
+      LIMB64_C(0xffffffffU, 0xffffffffU)
+    },
+    { /* P * 0 */
+      LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U),
+      LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U),
+      LIMB64_C(0x00000000U, 0x00000000U)
+    },
+    { /* P * 1 */
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000000U, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0xffffffffU, 0x00000001U),
+      LIMB64_C(0x00000000U, 0x00000000U)
+    },
+    { /* P * 2 */
+      LIMB64_C(0xffffffffU, 0xfffffffeU), LIMB64_C(0x00000001U, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0xfffffffeU, 0x00000002U),
+      LIMB64_C(0x00000000U, 0x00000001U)
+    },
+    { /* P * 3 */
+      LIMB64_C(0xffffffffU, 0xfffffffdU), LIMB64_C(0x00000002U, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0xfffffffdU, 0x00000003U),
+      LIMB64_C(0x00000000U, 0x00000002U)
+    },
+    { /* P * 4 */
+      LIMB64_C(0xffffffffU, 0xfffffffcU), LIMB64_C(0x00000003U, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0xfffffffcU, 0x00000004U),
+      LIMB64_C(0x00000000U, 0x00000003U)
+    },
+    { /* P * 5 */
+      LIMB64_C(0xffffffffU, 0xfffffffbU), LIMB64_C(0x00000004U, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0xfffffffbU, 0x00000005U),
+      LIMB64_C(0x00000000U, 0x00000004U)
+    },
+    { /* P * 6 */
+      LIMB64_C(0xffffffffU, 0xfffffffaU), LIMB64_C(0x00000005U, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0xfffffffaU, 0x00000006U),
+      LIMB64_C(0x00000000U, 0x00000005U)
+    },
+    { /* P * 7 */
+      LIMB64_C(0xffffffffU, 0xfffffff9U), LIMB64_C(0x00000006U, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0xfffffff9U, 0x00000007U),
+      LIMB64_C(0x00000000U, 0x00000006U)
+    }
+  };
+  const mpi_limb64_t zero = LIMB_TO64(0);
+  mpi_ptr_t wp;
+  mpi_ptr_t pp;
+  mpi_size_t wsize = (256 + BITS_PER_MPI_LIMB64 - 1) / BITS_PER_MPI_LIMB64;
+  mpi_size_t psize = ctx->p->nlimbs;
+  mpi_limb64_t s[wsize + 1];
+  mpi_limb64_t t[wsize + 1];
+  mpi_limb64_t d[wsize + 1];
+  mpi_limb_t mask1;
+  mpi_limb_t mask2;
+  int carry;
+
+  MPN_NORMALIZE (w->d, w->nlimbs);
+  if (mpi_nbits_more_than (w, 2 * 256))
+    log_bug ("W must be less than m^2\n");
+
+  RESIZE_AND_CLEAR_IF_NEEDED (w, wsize * 2 * LIMBS_PER_LIMB64);
+  RESIZE_AND_CLEAR_IF_NEEDED (ctx->p, wsize * LIMBS_PER_LIMB64);
+  ctx->p->nlimbs = psize;
+
+  pp = ctx->p->d;
+  wp = w->d;
+
+  prefetch (p_mult, sizeof(p_mult));
+
+  /* See "FIPS 186-4, D.2.3 Curve P-256". */
+
+  /* "S1 + S2" with 64-bit limbs:
+   *     [A15:A14]:[A13:A12]:[A11:0]:[0:0]
+   *  +    [0:A15]:[A14:A13]:[A12:0]:[0:0]
+   *  => s[4]:s[3]:s[2]:s[1]:s[0]
+   */
+  s[0] = zero;
+  ADD4_LIMB64 (s[4], s[3], s[2], s[1],
+	       zero,
+	       LOAD64(wp, 14 / 2),
+	       LOAD64(wp, 12 / 2),
+	       LIMB64_HILO(LOAD32(wp, 11), 0),
+	       zero,
+	       LIMB64_HILO(0, LOAD32(wp, 15)),
+	       LOAD64_UNALIGNED(wp, 13 / 2),
+	       LIMB64_HILO(LOAD32(wp, 12), 0));
+
+  /* "S3 + S4" with 64-bit limbs:
+   *     [A15:A14]:[  0:  0]:[  0:A10]:[ A9:A8]
+   *  +   [A8:A13]:[A15:A14]:[A13:A11]:[A10:A9]
+   *  => t[4]:t[3]:t[2]:t[1]:t[0]
+   */
+  ADD5_LIMB64 (t[4], t[3], t[2], t[1], t[0],
+	       zero,
+	       LOAD64(wp, 14 / 2),
+	       zero,
+	       LIMB64_HILO(0, LOAD32(wp, 10)),
+	       LOAD64(wp, 8 / 2),
+	       zero,
+	       LIMB64_HILO(LOAD32(wp, 8), LOAD32(wp, 13)),
+	       LOAD64(wp, 14 / 2),
+	       LIMB64_HILO(LOAD32(wp, 13), LOAD32(wp, 11)),
+	       LOAD64_UNALIGNED(wp, 9 / 2));
+
+  /* "2*S1 + 2*S2" */
+  ADD5_LIMB64 (s[4], s[3], s[2], s[1], s[0],
+               s[4], s[3], s[2], s[1], s[0],
+               s[4], s[3], s[2], s[1], s[0]);
+
+  /* "T + S3 + S4" */
+  ADD5_LIMB64 (t[4], t[3], t[2], t[1], t[0],
+	       t[4], t[3], t[2], t[1], t[0],
+	       zero,
+	       LOAD64(wp, 6 / 2),
+	       LOAD64(wp, 4 / 2),
+	       LOAD64(wp, 2 / 2),
+	       LOAD64(wp, 0 / 2));
+
+  /* "2*S1 + 2*S2 - D3" with 64-bit limbs:
+   *    s[4]:    s[3]:    s[2]:    s[1]:     s[0]
+   *  -       [A12:0]:[A10:A9]:[A8:A15]:[A14:A13]
+   *  => s[4]:s[3]:s[2]:s[1]:s[0]
+   */
+  SUB5_LIMB64 (s[4], s[3], s[2], s[1], s[0],
+               s[4], s[3], s[2], s[1], s[0],
+	       zero,
+	       LIMB64_HILO(LOAD32(wp, 12), 0),
+	       LOAD64_UNALIGNED(wp, 9 / 2),
+	       LIMB64_HILO(LOAD32(wp, 8), LOAD32(wp, 15)),
+	       LOAD64_UNALIGNED(wp, 13 / 2));
+
+  /* "T + 2*S1 + 2*S2 + S3 + S4 - D3" */
+  ADD5_LIMB64 (s[4], s[3], s[2], s[1], s[0],
+               s[4], s[3], s[2], s[1], s[0],
+               t[4], t[3], t[2], t[1], t[0]);
+
+  /* "D1 + D2" with 64-bit limbs:
+   *     [0:A13]:[A12:A11] + [A15:A14]:[A13:A12] => d[2]:d[1]:d[0]
+   *     [A10:A8] + [A11:A9] => d[4]:d[3]
+   */
+  ADD3_LIMB64 (d[2], d[1], d[0],
+	       zero,
+	       LIMB64_HILO(0, LOAD32(wp, 13)),
+	       LOAD64_UNALIGNED(wp, 11 / 2),
+	       zero,
+	       LOAD64(wp, 14 / 2),
+	       LOAD64(wp, 12 / 2));
+  ADD2_LIMB64 (d[4], d[3],
+	       zero, LIMB64_HILO(LOAD32(wp, 10), LOAD32(wp, 8)),
+	       zero, LIMB64_HILO(LOAD32(wp, 11), LOAD32(wp, 9)));
+
+  /* "D1 + D2 + D4" with 64-bit limbs:
+   *    d[4]:    d[3]:     d[2]:  d[1]:     d[0]
+   *  -       [A13:0]:[A11:A10]:[A9:0]:[A15:A14]
+   *  => d[4]:d[3]:d[2]:d[1]:d[0]
+   */
+  ADD5_LIMB64 (d[4], d[3], d[2], d[1], d[0],
+               d[4], d[3], d[2], d[1], d[0],
+	       zero,
+	       LIMB64_HILO(LOAD32(wp, 13), 0),
+	       LOAD64(wp, 10 / 2),
+	       LIMB64_HILO(LOAD32(wp, 9), 0),
+	       LOAD64(wp, 14 / 2));
+
+  /* "T + 2*S1 + 2*S2 + S3 + S4 - D1 - D2 - D3 - D4" */
+  SUB5_LIMB64 (s[4], s[3], s[2], s[1], s[0],
+               s[4], s[3], s[2], s[1], s[0],
+               d[4], d[3], d[2], d[1], d[0]);
+
+  /* mod p:
+   *  's[4]' holds carry value (-4..6). Subtract (carry + 1) * p. Result
+   *  will be with in range -p...p. Handle result being negative with
+   *  addition and conditional store. */
+
+  carry = LO32_LIMB64(s[4]);
+
+  SUB5_LIMB64 (s[4], s[3], s[2], s[1], s[0],
+	       s[4], s[3], s[2], s[1], s[0],
+	       p_mult[carry + 4][4], p_mult[carry + 4][3],
+	       p_mult[carry + 4][2], p_mult[carry + 4][1],
+	       p_mult[carry + 4][0]);
+
+  ADD5_LIMB64 (d[4], d[3], d[2], d[1], d[0],
+	       s[4], s[3], s[2], s[1], s[0],
+	       zero,
+	       LOAD64(pp, 3), LOAD64(pp, 2), LOAD64(pp, 1), LOAD64(pp, 0));
+
+  mask1 = vzero - (LO32_LIMB64(d[4]) >> 31);
+  mask2 = (LO32_LIMB64(d[4]) >> 31) - vone;
+
+  STORE64_COND(wp, 0, mask2, d[0], mask1, s[0]);
+  STORE64_COND(wp, 1, mask2, d[1], mask1, s[1]);
+  STORE64_COND(wp, 2, mask2, d[2], mask1, s[2]);
+  STORE64_COND(wp, 3, mask2, d[3], mask1, s[3]);
+
+  w->nlimbs = wsize * LIMBS_PER_LIMB64;
+  MPN_NORMALIZE (wp, w->nlimbs);
+}
+
+void
+_gcry_mpi_ec_nist384_mod (gcry_mpi_t w, mpi_ec_t ctx)
+{
+  static const mpi_limb64_t p_mult[11][7] =
+  {
+    { /* P * -2 */
+      LIMB64_C(0xfffffffeU, 0x00000002U), LIMB64_C(0x00000001U, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000002U), LIMB64_C(0x00000000U, 0x00000000U),
+      LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U),
+      LIMB64_C(0xffffffffU, 0xfffffffeU)
+    },
+    { /* P * -1 */
+      LIMB64_C(0xffffffffU, 0x00000001U), LIMB64_C(0x00000000U, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000001U), LIMB64_C(0x00000000U, 0x00000000U),
+      LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U),
+      LIMB64_C(0xffffffffU, 0xffffffffU)
+    },
+    { /* P * 0 */
+      LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U),
+      LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U),
+      LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U),
+      LIMB64_C(0x00000000U, 0x00000000U)
+    },
+    { /* P * 1 */
+      LIMB64_C(0x00000000U, 0xffffffffU), LIMB64_C(0xffffffffU, 0x00000000U),
+      LIMB64_C(0xffffffffU, 0xfffffffeU), LIMB64_C(0xffffffffU, 0xffffffffU),
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000000U)
+    },
+    { /* P * 2 */
+      LIMB64_C(0x00000001U, 0xfffffffeU), LIMB64_C(0xfffffffeU, 0x00000000U),
+      LIMB64_C(0xffffffffU, 0xfffffffdU), LIMB64_C(0xffffffffU, 0xffffffffU),
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000001U)
+    },
+    { /* P * 3 */
+      LIMB64_C(0x00000002U, 0xfffffffdU), LIMB64_C(0xfffffffdU, 0x00000000U),
+      LIMB64_C(0xffffffffU, 0xfffffffcU), LIMB64_C(0xffffffffU, 0xffffffffU),
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000002U)
+    },
+    { /* P * 4 */
+      LIMB64_C(0x00000003U, 0xfffffffcU), LIMB64_C(0xfffffffcU, 0x00000000U),
+      LIMB64_C(0xffffffffU, 0xfffffffbU), LIMB64_C(0xffffffffU, 0xffffffffU),
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000003U)
+    },
+    { /* P * 5 */
+      LIMB64_C(0x00000004U, 0xfffffffbU), LIMB64_C(0xfffffffbU, 0x00000000U),
+      LIMB64_C(0xffffffffU, 0xfffffffaU), LIMB64_C(0xffffffffU, 0xffffffffU),
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000004U)
+    },
+    { /* P * 6 */
+      LIMB64_C(0x00000005U, 0xfffffffaU), LIMB64_C(0xfffffffaU, 0x00000000U),
+      LIMB64_C(0xffffffffU, 0xfffffff9U), LIMB64_C(0xffffffffU, 0xffffffffU),
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000005U)
+    },
+    { /* P * 7 */
+      LIMB64_C(0x00000006U, 0xfffffff9U), LIMB64_C(0xfffffff9U, 0x00000000U),
+      LIMB64_C(0xffffffffU, 0xfffffff8U), LIMB64_C(0xffffffffU, 0xffffffffU),
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000006U)
+    },
+    { /* P * 8 */
+      LIMB64_C(0x00000007U, 0xfffffff8U), LIMB64_C(0xfffffff8U, 0x00000000U),
+      LIMB64_C(0xffffffffU, 0xfffffff7U), LIMB64_C(0xffffffffU, 0xffffffffU),
+      LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xffffffffU),
+      LIMB64_C(0x00000000U, 0x00000007U)
+    },
+  };
+  const mpi_limb64_t zero = LIMB_TO64(0);
+  mpi_ptr_t wp;
+  mpi_ptr_t pp;
+  mpi_size_t wsize = (384 + BITS_PER_MPI_LIMB64 - 1) / BITS_PER_MPI_LIMB64;
+  mpi_size_t psize = ctx->p->nlimbs;
+#if (BITS_PER_MPI_LIMB64 == BITS_PER_MPI_LIMB) && defined(WORDS_BIGENDIAN)
+  mpi_limb_t wp_shr32[wsize * LIMBS_PER_LIMB64];
+#endif
+  mpi_limb64_t s[wsize + 1];
+  mpi_limb64_t t[wsize + 1];
+  mpi_limb64_t d[wsize + 1];
+  mpi_limb64_t x[wsize + 1];
+  mpi_limb_t mask1;
+  mpi_limb_t mask2;
+  int carry;
+
+  MPN_NORMALIZE (w->d, w->nlimbs);
+  if (mpi_nbits_more_than (w, 2 * 384))
+    log_bug ("W must be less than m^2\n");
+
+  RESIZE_AND_CLEAR_IF_NEEDED (w, wsize * 2 * LIMBS_PER_LIMB64);
+  RESIZE_AND_CLEAR_IF_NEEDED (ctx->p, wsize * LIMBS_PER_LIMB64);
+  ctx->p->nlimbs = psize;
+
+  pp = ctx->p->d;
+  wp = w->d;
+
+  prefetch (p_mult, sizeof(p_mult));
+
+  /* See "FIPS 186-4, D.2.4 Curve P-384". */
+
+#if BITS_PER_MPI_LIMB64 == BITS_PER_MPI_LIMB
+# ifdef WORDS_BIGENDIAN
+#  define LOAD64_SHR32(idx) LOAD64(wp_shr32, ((idx) / 2 - wsize))
+  _gcry_mpih_rshift (wp_shr32, wp + 384 / BITS_PER_MPI_LIMB,
+		     wsize * LIMBS_PER_LIMB64, 32);
+# else
+# define LOAD64_SHR32(idx) LOAD64_UNALIGNED(wp, idx / 2)
+#endif
+#else
+# define LOAD64_SHR32(idx) LIMB64_HILO(LOAD32(wp, (idx) + 1), LOAD32(wp, idx))
+#endif
+
+  /* "S1 + S1" with 64-bit limbs:
+   *     [0:A23]:[A22:A21]
+   *  +  [0:A23]:[A22:A21]
+   *  => s[3]:s[2]
+   */
+  ADD2_LIMB64 (s[3], s[2],
+	       LIMB64_HILO(0, LOAD32(wp, 23)),
+	       LOAD64_SHR32(21),
+	       LIMB64_HILO(0, LOAD32(wp, 23)),
+	       LOAD64_SHR32(21));
+
+  /* "S5 + S6" with 64-bit limbs:
+   *     [A23:A22]:[A21:A20]:[  0:0]:[0:  0]
+   *  +  [  0:  0]:[A23:A22]:[A21:0]:[0:A20]
+   *  => x[4]:x[3]:x[2]:x[1]:x[0]
+   */
+  x[0] = LIMB64_HILO(0, LOAD32(wp, 20));
+  x[1] = LIMB64_HILO(LOAD32(wp, 21), 0);
+  ADD3_LIMB64 (x[4], x[3], x[2],
+	       zero, LOAD64(wp, 22 / 2), LOAD64(wp, 20 / 2),
+	       zero, zero, LOAD64(wp, 22 / 2));
+
+  /* "D2 + D3" with 64-bit limbs:
+   *     [0:A23]:[A22:A21]:[A20:0]
+   *  +  [0:A23]:[A23:0]:[0:0]
+   *  => d[2]:d[1]:d[0]
+   */
+  d[0] = LIMB64_HILO(LOAD32(wp, 20), 0);
+  ADD2_LIMB64 (d[2], d[1],
+	       LIMB64_HILO(0, LOAD32(wp, 23)),
+	       LOAD64_SHR32(21),
+	       LIMB64_HILO(0, LOAD32(wp, 23)),
+	       LIMB64_HILO(LOAD32(wp, 23), 0));
+
+  /* "2*S1 + S5 + S6" with 64-bit limbs:
+   *     s[4]:s[3]:s[2]:s[1]:s[0]
+   *  +  x[4]:x[3]:x[2]:x[1]:x[0]
+   *  => s[4]:s[3]:s[2]:s[1]:s[0]
+   */
+  s[0] = x[0];
+  s[1] = x[1];
+  ADD3_LIMB64(s[4], s[3], s[2],
+	      zero, s[3], s[2],
+	      x[4], x[3], x[2]);
+
+  /* "T + S2" with 64-bit limbs:
+   *     [A11:A10]:[ A9: A8]:[ A7: A6]:[ A5: A4]:[ A3: A2]:[ A1: A0]
+   *  +  [A23:A22]:[A21:A20]:[A19:A18]:[A17:A16]:[A15:A14]:[A13:A12]
+   *  => t[6]:t[5]:t[4]:t[3]:t[2]:t[1]:t[0]
+   */
+  ADD7_LIMB64 (t[6], t[5], t[4], t[3], t[2], t[1], t[0],
+	       zero,
+	       LOAD64(wp, 10 / 2), LOAD64(wp, 8 / 2), LOAD64(wp, 6 / 2),
+	       LOAD64(wp, 4 / 2), LOAD64(wp, 2 / 2), LOAD64(wp, 0 / 2),
+	       zero,
+	       LOAD64(wp, 22 / 2), LOAD64(wp, 20 / 2), LOAD64(wp, 18 / 2),
+	       LOAD64(wp, 16 / 2), LOAD64(wp, 14 / 2), LOAD64(wp, 12 / 2));
+
+  /* "2*S1 + S4 + S5 + S6" with 64-bit limbs:
+   *     s[6]:     s[5]:     s[4]:     s[3]:     s[2]:   s[1]:   s[0]
+   *  +       [A19:A18]:[A17:A16]:[A15:A14]:[A13:A12]:[A20:0]:[A23:0]
+   *  => s[6]:s[5]:s[4]:s[3]:s[2]:s[1]:s[0]
+   */
+  ADD7_LIMB64 (s[6], s[5], s[4], s[3], s[2], s[1], s[0],
+	       zero, zero, s[4], s[3], s[2], s[1], s[0],
+	       zero,
+	       LOAD64(wp, 18 / 2), LOAD64(wp, 16 / 2),
+	       LOAD64(wp, 14 / 2), LOAD64(wp, 12 / 2),
+	       LIMB64_HILO(LOAD32(wp, 20), 0),
+	       LIMB64_HILO(LOAD32(wp, 23), 0));
+
+  /* "D1 + D2 + D3" with 64-bit limbs:
+   *     d[6]:     d[5]:     d[4]:     d[3]:     d[2]:     d[1]:     d[0]
+   *  +       [A22:A21]:[A20:A19]:[A18:A17]:[A16:A15]:[A14:A13]:[A12:A23]
+   *  => d[6]:d[5]:d[4]:d[3]:d[2]:d[1]:d[0]
+   */
+  ADD7_LIMB64 (d[6], d[5], d[4], d[3], d[2], d[1], d[0],
+	       zero, zero, zero, zero, d[2], d[1], d[0],
+	       zero,
+	       LOAD64_SHR32(21),
+	       LOAD64_SHR32(19),
+	       LOAD64_SHR32(17),
+	       LOAD64_SHR32(15),
+	       LOAD64_SHR32(13),
+	       LIMB64_HILO(LOAD32(wp, 12), LOAD32(wp, 23)));
+
+  /* "2*S1 + S3 + S4 + S5 + S6" with 64-bit limbs:
+   *     s[6]:     s[5]:     s[4]:     s[3]:     s[2]:     s[1]:     s[0]
+   *  +       [A20:A19]:[A18:A17]:[A16:A15]:[A14:A13]:[A12:A23]:[A22:A21]
+   *  => s[6]:s[5]:s[4]:s[3]:s[2]:s[1]:s[0]
+   */
+  ADD7_LIMB64 (s[6], s[5], s[4], s[3], s[2], s[1], s[0],
+	       s[6], s[5], s[4], s[3], s[2], s[1], s[0],
+	       zero,
+	       LOAD64_SHR32(19),
+	       LOAD64_SHR32(17),
+	       LOAD64_SHR32(15),
+	       LOAD64_SHR32(13),
+	       LIMB64_HILO(LOAD32(wp, 12), LOAD32(wp, 23)),
+	       LOAD64_SHR32(21));
+
+  /* "T + 2*S1 + S2 + S3 + S4 + S5 + S6" */
+  ADD7_LIMB64 (s[6], s[5], s[4], s[3], s[2], s[1], s[0],
+               s[6], s[5], s[4], s[3], s[2], s[1], s[0],
+               t[6], t[5], t[4], t[3], t[2], t[1], t[0]);
+
+  /* "T + 2*S1 + S2 + S3 + S4 + S5 + S6 - D1 - D2 - D3" */
+  SUB7_LIMB64 (s[6], s[5], s[4], s[3], s[2], s[1], s[0],
+               s[6], s[5], s[4], s[3], s[2], s[1], s[0],
+               d[6], d[5], d[4], d[3], d[2], d[1], d[0]);
+
+#undef LOAD64_SHR32
+
+  /* mod p:
+   *  's[6]' holds carry value (-3..7). Subtract (carry + 1) * p. Result
+   *  will be with in range -p...p. Handle result being negative with
+   *  addition and conditional store. */
+
+  carry = LO32_LIMB64(s[6]);
+
+  SUB7_LIMB64 (s[6], s[5], s[4], s[3], s[2], s[1], s[0],
+	       s[6], s[5], s[4], s[3], s[2], s[1], s[0],
+	       p_mult[carry + 3][6], p_mult[carry + 3][5],
+	       p_mult[carry + 3][4], p_mult[carry + 3][3],
+	       p_mult[carry + 3][2], p_mult[carry + 3][1],
+	       p_mult[carry + 3][0]);
+
+  ADD7_LIMB64 (d[6], d[5], d[4], d[3], d[2], d[1], d[0],
+	       s[6], s[5], s[4], s[3], s[2], s[1], s[0],
+	       zero,
+	       LOAD64(pp, 5), LOAD64(pp, 4),
+	       LOAD64(pp, 3), LOAD64(pp, 2),
+	       LOAD64(pp, 1), LOAD64(pp, 0));
+
+  mask1 = vzero - (LO32_LIMB64(d[6]) >> 31);
+  mask2 = (LO32_LIMB64(d[6]) >> 31) - vone;
+
+  STORE64_COND(wp, 0, mask2, d[0], mask1, s[0]);
+  STORE64_COND(wp, 1, mask2, d[1], mask1, s[1]);
+  STORE64_COND(wp, 2, mask2, d[2], mask1, s[2]);
+  STORE64_COND(wp, 3, mask2, d[3], mask1, s[3]);
+  STORE64_COND(wp, 4, mask2, d[4], mask1, s[4]);
+  STORE64_COND(wp, 5, mask2, d[5], mask1, s[5]);
+
+  w->nlimbs = wsize * LIMBS_PER_LIMB64;
+  MPN_NORMALIZE (wp, w->nlimbs);
+
+#if (BITS_PER_MPI_LIMB64 == BITS_PER_MPI_LIMB) && defined(WORDS_BIGENDIAN)
+  wipememory(wp_shr32, sizeof(wp_shr32));
+#endif
+}
+
+void
+_gcry_mpi_ec_nist521_mod (gcry_mpi_t w, mpi_ec_t ctx)
+{
+  mpi_size_t wsize = (521 + BITS_PER_MPI_LIMB - 1) / BITS_PER_MPI_LIMB;
+  mpi_limb_t s[wsize];
+  mpi_limb_t cy;
+  mpi_ptr_t wp;
+
+  MPN_NORMALIZE (w->d, w->nlimbs);
+  if (mpi_nbits_more_than (w, 2 * 521))
+    log_bug ("W must be less than m^2\n");
+
+  RESIZE_AND_CLEAR_IF_NEEDED (w, wsize * 2);
+
+  wp = w->d;
+
+  /* See "FIPS 186-4, D.2.5 Curve P-521". */
+
+  _gcry_mpih_rshift (s, wp + wsize - 1, wsize, 521 % BITS_PER_MPI_LIMB);
+  s[wsize - 1] &= (1 << (521 % BITS_PER_MPI_LIMB)) - 1;
+  wp[wsize - 1] &= (1 << (521 % BITS_PER_MPI_LIMB)) - 1;
+  _gcry_mpih_add_n (wp, wp, s, wsize);
+
+  /* "mod p" */
+  cy = _gcry_mpih_sub_n (wp, wp, ctx->p->d, wsize);
+  _gcry_mpih_add_n (s, wp, ctx->p->d, wsize);
+  mpih_set_cond (wp, s, wsize, (cy != 0UL));
+
+  w->nlimbs = wsize;
+  MPN_NORMALIZE (wp, w->nlimbs);
+}
+
+#endif /* !ASM_DISABLED */
diff --git a/mpi/ec.c b/mpi/ec.c
index 4fabf9b4..ae1d036a 100644
--- a/mpi/ec.c
+++ b/mpi/ec.c
@@ -285,7 +285,7 @@ static void
 ec_addm (gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v, mpi_ec_t ctx)
 {
   mpi_add (w, u, v);
-  ec_mod (w, ctx);
+  ctx->mod (w, ctx);
 }
 
 static void
@@ -294,14 +294,14 @@ ec_subm (gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v, mpi_ec_t ec)
   mpi_sub (w, u, v);
   while (w->sign)
     mpi_add (w, w, ec->p);
-  /*ec_mod (w, ec);*/
+  /*ctx->mod (w, ec);*/
 }
 
 static void
 ec_mulm (gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v, mpi_ec_t ctx)
 {
   mpi_mul (w, u, v);
-  ec_mod (w, ctx);
+  ctx->mod (w, ctx);
 }
 
 /* W = 2 * U mod P.  */
@@ -309,7 +309,7 @@ static void
 ec_mul2 (gcry_mpi_t w, gcry_mpi_t u, mpi_ec_t ctx)
 {
   mpi_lshift (w, u, 1);
-  ec_mod (w, ctx);
+  ctx->mod (w, ctx);
 }
 
 static void
@@ -585,6 +585,7 @@ struct field_table {
   void (* mulm) (gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v, mpi_ec_t ctx);
   void (* mul2) (gcry_mpi_t w, gcry_mpi_t u, mpi_ec_t ctx);
   void (* pow2) (gcry_mpi_t w, const gcry_mpi_t b, mpi_ec_t ctx);
+  void (* mod) (gcry_mpi_t w, mpi_ec_t ctx);
 };
 
 static const struct field_table field_table[] = {
@@ -594,7 +595,8 @@ static const struct field_table field_table[] = {
     ec_subm_25519,
     ec_mulm_25519,
     ec_mul2_25519,
-    ec_pow2_25519
+    ec_pow2_25519,
+    NULL
   },
   {
    "0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFE"
@@ -603,7 +605,55 @@ static const struct field_table field_table[] = {
     ec_subm_448,
     ec_mulm_448,
     ec_mul2_448,
-    ec_pow2_448
+    ec_pow2_448,
+    NULL
+  },
+  {
+    "0xfffffffffffffffffffffffffffffffeffffffffffffffff",
+    NULL,
+    NULL,
+    NULL,
+    NULL,
+    NULL,
+    _gcry_mpi_ec_nist192_mod
+  },
+  {
+    "0xffffffffffffffffffffffffffffffff000000000000000000000001",
+    NULL,
+    NULL,
+    NULL,
+    NULL,
+    NULL,
+    _gcry_mpi_ec_nist224_mod
+  },
+  {
+    "0xffffffff00000001000000000000000000000000ffffffffffffffffffffffff",
+    NULL,
+    NULL,
+    NULL,
+    NULL,
+    NULL,
+    _gcry_mpi_ec_nist256_mod
+  },
+  {
+    "0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffe"
+    "ffffffff0000000000000000ffffffff",
+    NULL,
+    NULL,
+    NULL,
+    NULL,
+    NULL,
+    _gcry_mpi_ec_nist384_mod
+  },
+  {
+    "0x01ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff"
+    "ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff",
+    NULL,
+    NULL,
+    NULL,
+    NULL,
+    NULL,
+    _gcry_mpi_ec_nist521_mod
   },
   { NULL, NULL, NULL, NULL, NULL, NULL },
 };
@@ -757,6 +807,7 @@ ec_p_init (mpi_ec_t ctx, enum gcry_mpi_ec_models model,
   ctx->mulm = ec_mulm;
   ctx->mul2 = ec_mul2;
   ctx->pow2 = ec_pow2;
+  ctx->mod = ec_mod;
 
   for (i=0; field_table[i].p; i++)
     {
@@ -769,18 +820,25 @@ ec_p_init (mpi_ec_t ctx, enum gcry_mpi_ec_models model,
 
       if (!mpi_cmp (p, f_p))
         {
-          ctx->addm = field_table[i].addm;
-          ctx->subm = field_table[i].subm;
-          ctx->mulm = field_table[i].mulm;
-          ctx->mul2 = field_table[i].mul2;
-          ctx->pow2 = field_table[i].pow2;
+          ctx->addm = field_table[i].addm ? field_table[i].addm : ctx->addm;
+          ctx->subm = field_table[i].subm ? field_table[i].subm : ctx->subm;
+          ctx->mulm = field_table[i].mulm ? field_table[i].mulm : ctx->mulm;
+          ctx->mul2 = field_table[i].mul2 ? field_table[i].mul2 : ctx->mul2;
+          ctx->pow2 = field_table[i].pow2 ? field_table[i].pow2 : ctx->pow2;
+          ctx->mod = field_table[i].mod ? field_table[i].mod : ctx->mod;
           _gcry_mpi_release (f_p);
 
-          mpi_resize (ctx->a, ctx->p->nlimbs);
-          ctx->a->nlimbs = ctx->p->nlimbs;
-
-          mpi_resize (ctx->b, ctx->p->nlimbs);
-          ctx->b->nlimbs = ctx->p->nlimbs;
+	  if (ctx->a)
+	    {
+	      mpi_resize (ctx->a, ctx->p->nlimbs);
+	      ctx->a->nlimbs = ctx->p->nlimbs;
+	    }
+
+	  if (ctx->b)
+	    {
+	      mpi_resize (ctx->b, ctx->p->nlimbs);
+	      ctx->b->nlimbs = ctx->p->nlimbs;
+	    }
 
           for (i=0; i< DIM(ctx->t.scratch) && ctx->t.scratch[i]; i++)
             ctx->t.scratch[i]->nlimbs = ctx->p->nlimbs;
diff --git a/mpi/mpi-internal.h b/mpi/mpi-internal.h
index 8ccdeada..11fcbde4 100644
--- a/mpi/mpi-internal.h
+++ b/mpi/mpi-internal.h
@@ -79,6 +79,11 @@ typedef int mpi_size_t;        /* (must be a signed type) */
 	if( (a)->alloced < (b) )   \
 	    mpi_resize((a), (b));  \
     } while(0)
+#define RESIZE_AND_CLEAR_IF_NEEDED(a,b) \
+    do {			   \
+	if( (a)->nlimbs < (b) )   \
+	    mpi_resize((a), (b));  \
+    } while(0)
 
 /* Copy N limbs from S to D.  */
 #define MPN_COPY( d, s, n) \
diff --git a/mpi/mpiutil.c b/mpi/mpiutil.c
index 5320f4d8..a5583c79 100644
--- a/mpi/mpiutil.c
+++ b/mpi/mpiutil.c
@@ -197,7 +197,7 @@ _gcry_mpi_resize (gcry_mpi_t a, unsigned nlimbs)
   if (a->d)
     {
       a->d = xrealloc (a->d, nlimbs * sizeof (mpi_limb_t));
-      for (i=a->alloced; i < nlimbs; i++)
+      for (i=a->nlimbs; i < nlimbs; i++)
         a->d[i] = 0;
     }
   else
diff --git a/src/ec-context.h b/src/ec-context.h
index d1c64804..479862f6 100644
--- a/src/ec-context.h
+++ b/src/ec-context.h
@@ -74,6 +74,7 @@ struct mpi_ec_ctx_s
   void (* mulm) (gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v, mpi_ec_t ctx);
   void (* pow2) (gcry_mpi_t w, const gcry_mpi_t b, mpi_ec_t ctx);
   void (* mul2) (gcry_mpi_t w, gcry_mpi_t u, mpi_ec_t ctx);
+  void (* mod) (gcry_mpi_t w, mpi_ec_t ctx);
 };
 
 
-- 
2.30.2


From jussi.kivilinna at iki.fi  Sun Jun 20 11:52:11 2021
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sun, 20 Jun 2021 12:52:11 +0300
Subject: [PATCH 1/4] mpi_ec_get_affine: fast path for Z==1 case
Message-ID: <20210620095214.2352572-1-jussi.kivilinna@iki.fi>

* mpi/ec.c (_gcry_mpi_ec_get_affine): Return X and Y as is
if Z is 1 (for Weierstrass and Edwards curves).
--

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 mpi/ec.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/mpi/ec.c b/mpi/ec.c
index 29b2ce30..e25d9d8a 100644
--- a/mpi/ec.c
+++ b/mpi/ec.c
@@ -1117,6 +1117,15 @@ _gcry_mpi_ec_get_affine (gcry_mpi_t x, gcry_mpi_t y, mpi_point_t point,
       {
         gcry_mpi_t z1, z2, z3;
 
+	if (!mpi_cmp_ui (point->z, 1))
+	  {
+	    if (x)
+	      mpi_set (x, point->x);
+	    if (y)
+	      mpi_set (y, point->y);
+	    return 0;
+	  }
+
         z1 = mpi_new (0);
         z2 = mpi_new (0);
         ec_invm (z1, point->z, ctx);  /* z1 = z^(-1) mod p  */
@@ -1156,6 +1165,15 @@ _gcry_mpi_ec_get_affine (gcry_mpi_t x, gcry_mpi_t y, mpi_point_t point,
       {
         gcry_mpi_t z;
 
+	if (!mpi_cmp_ui (point->z, 1))
+	  {
+	    if (x)
+	      mpi_set (x, point->x);
+	    if (y)
+	      mpi_set (y, point->y);
+	    return 0;
+	  }
+
         z = mpi_new (0);
         ec_invm (z, point->z, ctx);
 
-- 
2.30.2


From jussi.kivilinna at iki.fi  Sun Jun 20 11:52:14 2021
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sun, 20 Jun 2021 12:52:14 +0300
Subject: [PATCH 4/4] bench-slope: add X25519 and X448 scalar multiplication
In-Reply-To: <20210620095214.2352572-1-jussi.kivilinna@iki.fi>
References: <20210620095214.2352572-1-jussi.kivilinna@iki.fi>
Message-ID: <20210620095214.2352572-4-jussi.kivilinna@iki.fi>

* tests/bench-slope.c (ECC_ALGO_X25519, ECC_ALGO_X448): New.
(ecc_algo_name, ecc_algo_curve, ecc_nbits): Add X25519 and X448.
(bench_ecc_mult_do_bench): Pass Y as NULL to ec_get_affine with
X25519 and X448.
(cipher_ecc_one): Run only multiplication bench for X25519 and X448.
--

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 tests/bench-slope.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/tests/bench-slope.c b/tests/bench-slope.c
index 9b4a139a..35272094 100644
--- a/tests/bench-slope.c
+++ b/tests/bench-slope.c
@@ -2144,6 +2144,8 @@ enum bench_ecc_algo
 {
   ECC_ALGO_ED25519 = 0,
   ECC_ALGO_ED448,
+  ECC_ALGO_X25519,
+  ECC_ALGO_X448,
   ECC_ALGO_NIST_P192,
   ECC_ALGO_NIST_P224,
   ECC_ALGO_NIST_P256,
@@ -2197,6 +2199,10 @@ ecc_algo_name (int algo)
 	return "Ed25519";
       case ECC_ALGO_ED448:
 	return "Ed448";
+      case ECC_ALGO_X25519:
+	return "X25519";
+      case ECC_ALGO_X448:
+	return "X448";
       case ECC_ALGO_NIST_P192:
 	return "NIST-P192";
       case ECC_ALGO_NIST_P224:
@@ -2223,6 +2229,10 @@ ecc_algo_curve (int algo)
 	return "Ed25519";
       case ECC_ALGO_ED448:
 	return "Ed448";
+      case ECC_ALGO_X25519:
+	return "Curve25519";
+      case ECC_ALGO_X448:
+	return "X448";
       case ECC_ALGO_NIST_P192:
 	return "NIST P-192";
       case ECC_ALGO_NIST_P224:
@@ -2249,6 +2259,10 @@ ecc_nbits (int algo)
 	return 255;
       case ECC_ALGO_ED448:
 	return 448;
+      case ECC_ALGO_X25519:
+	return 255;
+      case ECC_ALGO_X448:
+	return 448;
       case ECC_ALGO_NIST_P192:
 	return 192;
       case ECC_ALGO_NIST_P224:
@@ -2355,15 +2369,26 @@ bench_ecc_mult_free (struct bench_obj *obj)
 static void
 bench_ecc_mult_do_bench (struct bench_obj *obj, void *buf, size_t num_iter)
 {
+  struct bench_ecc_oper *oper = obj->priv;
   struct bench_ecc_mult_hd *hd = obj->hd;
+  gcry_mpi_t y;
   size_t i;
 
   (void)buf;
 
+  if (oper->algo == ECC_ALGO_X25519 || oper->algo == ECC_ALGO_X448)
+    {
+      y = NULL;
+    }
+  else
+    {
+      y = hd->y;
+    }
+
   for (i = 0; i < num_iter; i++)
     {
       gcry_mpi_ec_mul (hd->Q, hd->k, hd->G, hd->ec);
-      if (gcry_mpi_ec_get_affine (hd->x, hd->y, hd->Q, hd->ec))
+      if (gcry_mpi_ec_get_affine (hd->x, y, hd->Q, hd->ec))
 	{
 	  fprintf (stderr, PGM ": gcry_mpi_ec_get_affine failed\n");
 	  exit (1);
@@ -2634,7 +2659,8 @@ cipher_ecc_one (enum bench_ecc_algo algo, struct bench_ecc_oper *poper)
   struct bench_obj obj = { 0 };
   double result;
 
-  if (algo == ECC_ALGO_SECP256K1 && oper.oper != ECC_OPER_MULT)
+  if ((algo == ECC_ALGO_X25519 || algo == ECC_ALGO_X448 ||
+       algo == ECC_ALGO_SECP256K1) && oper.oper != ECC_OPER_MULT)
     return;
 
   oper.algo = algo;
-- 
2.30.2


From jussi.kivilinna at iki.fi  Sun Jun 20 11:52:12 2021
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sun, 20 Jun 2021 12:52:12 +0300
Subject: [PATCH 2/4] mpi/ec: cache converted field_table MPIs
In-Reply-To: <20210620095214.2352572-1-jussi.kivilinna@iki.fi>
References: <20210620095214.2352572-1-jussi.kivilinna@iki.fi>
Message-ID: <20210620095214.2352572-2-jussi.kivilinna@iki.fi>

* mpi/ec.c (field_table_mpis): New.
(ec_p_init): Cache converted field table MPIs.
--

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 mpi/ec.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/mpi/ec.c b/mpi/ec.c
index e25d9d8a..029099b4 100644
--- a/mpi/ec.c
+++ b/mpi/ec.c
@@ -719,7 +719,10 @@ static const struct field_table field_table[] = {
   },
   { NULL, NULL, NULL, NULL, NULL, NULL },
 };
+
+static gcry_mpi_t field_table_mpis[DIM(field_table)];
 
+
 /* Force recomputation of all helper variables.  */
 void
 _gcry_mpi_ec_get_reset (mpi_ec_t ec)
@@ -876,9 +879,19 @@ ec_p_init (mpi_ec_t ctx, enum gcry_mpi_ec_models model,
       gcry_mpi_t f_p;
       gpg_err_code_t rc;
 
-      rc = _gcry_mpi_scan (&f_p, GCRYMPI_FMT_HEX, field_table[i].p, 0, NULL);
-      if (rc)
-        log_fatal ("scanning ECC parameter failed: %s\n", gpg_strerror (rc));
+      if (field_table_mpis[i] == NULL)
+	{
+	  rc = _gcry_mpi_scan (&f_p, GCRYMPI_FMT_HEX, field_table[i].p, 0,
+			       NULL);
+	  if (rc)
+	    log_fatal ("scanning ECC parameter failed: %s\n",
+		       gpg_strerror (rc));
+	  field_table_mpis[i] = f_p; /* cache */
+	}
+      else
+	{
+	  f_p = field_table_mpis[i];
+	}
 
       if (!mpi_cmp (p, f_p))
         {
@@ -888,7 +901,6 @@ ec_p_init (mpi_ec_t ctx, enum gcry_mpi_ec_models model,
           ctx->mul2 = field_table[i].mul2 ? field_table[i].mul2 : ctx->mul2;
           ctx->pow2 = field_table[i].pow2 ? field_table[i].pow2 : ctx->pow2;
           ctx->mod = field_table[i].mod ? field_table[i].mod : ctx->mod;
-          _gcry_mpi_release (f_p);
 
 	  if (ctx->a)
 	    {
@@ -907,8 +919,6 @@ ec_p_init (mpi_ec_t ctx, enum gcry_mpi_ec_models model,
 
           break;
         }
-
-      _gcry_mpi_release (f_p);
     }
 
   /* Prepare for fast reduction.  */
-- 
2.30.2


From jussi.kivilinna at iki.fi  Sun Jun 20 11:52:13 2021
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sun, 20 Jun 2021 12:52:13 +0300
Subject: [PATCH 3/4] mpi: optimizations for MPI scanning and printing
In-Reply-To: <20210620095214.2352572-1-jussi.kivilinna@iki.fi>
References: <20210620095214.2352572-1-jussi.kivilinna@iki.fi>
Message-ID: <20210620095214.2352572-3-jussi.kivilinna@iki.fi>

* mpi/mpicoder.c (mpi_read_from_buffer): Add word-size buffer
reading loop using 'buf_get_be(32|64)'.
(mpi_fromstr): Use look-up tables for HEX conversion; Add fast-path
loop for converting 8 hex-characters at once; Add string length
parameter.
(do_get_buffer): Use 'buf_put_be(32|64)' instead of byte writes; Add
fast-path for reversing buffer with 'buf_get_(be64|be32|le64|le32)'.
(_gcry_mpi_set_buffer): Use 'buf_get_be(32|64)' instead of byte reads.
(twocompl): Use _gcry_ctz instead of open-coded if-clauses to get
first bit set; Add fast-path for inverting buffer with
'buf_get_(he64|he32)'.
(_gcry_mpi_scan): Use 'buf_get_be32' where possible; Provide string
length to 'mpi_fromstr'.
(_gcry_mpi_print): Use 'buf_put_be32' where possible; Use look-up
table for HEX conversion; Add fast-path loop for converting to
8 hex-characters at once.
* tests/t-convert.c (check_formats): Add new tests for larger values.
--

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 mpi/mpicoder.c    | 310 +++++++++++++++++---------
 tests/t-convert.c | 538 ++++++++++++++++++++++++++++++----------------
 2 files changed, 561 insertions(+), 287 deletions(-)

diff --git a/mpi/mpicoder.c b/mpi/mpicoder.c
index f61f777f..830ee4e2 100644
--- a/mpi/mpicoder.c
+++ b/mpi/mpicoder.c
@@ -26,6 +26,7 @@
 
 #include "mpi-internal.h"
 #include "g10lib.h"
+#include "../cipher/bufhelp.h"
 
 /* The maximum length we support in the functions converting an
  * external representation to an MPI.  This limit is used to catch
@@ -51,8 +52,9 @@ mpi_read_from_buffer (const unsigned char *buffer, unsigned *ret_nread,
   unsigned int nbits, nbytes, nlimbs, nread=0;
   mpi_limb_t a;
   gcry_mpi_t val = MPI_NULL;
+  unsigned int max_nread = *ret_nread;
 
-  if ( *ret_nread < 2 )
+  if ( max_nread < 2 )
     goto leave;
   nbits = buffer[0] << 8 | buffer[1];
   if ( nbits > MAX_EXTERN_MPI_BITS )
@@ -73,9 +75,22 @@ mpi_read_from_buffer (const unsigned char *buffer, unsigned *ret_nread,
   for ( ; j > 0; j-- )
     {
       a = 0;
+      if (i == 0 && nread + BYTES_PER_MPI_LIMB <= max_nread)
+	{
+#if BYTES_PER_MPI_LIMB == 4
+	  a = buf_get_be32 (buffer);
+#elif BYTES_PER_MPI_LIMB == 8
+	  a = buf_get_be64 (buffer);
+#else
+#     error please implement for this limb size.
+#endif
+	  buffer += BYTES_PER_MPI_LIMB;
+	  nread += BYTES_PER_MPI_LIMB;
+	  i += BYTES_PER_MPI_LIMB;
+	}
       for (; i < BYTES_PER_MPI_LIMB; i++ )
         {
-          if ( ++nread > *ret_nread )
+          if ( ++nread > max_nread )
             {
 /*               log_debug ("mpi larger than buffer"); */
               mpi_free (val);
@@ -99,8 +114,45 @@ mpi_read_from_buffer (const unsigned char *buffer, unsigned *ret_nread,
  * Fill the mpi VAL from the hex string in STR.
  */
 static int
-mpi_fromstr (gcry_mpi_t val, const char *str)
+mpi_fromstr (gcry_mpi_t val, const char *str, size_t slen)
 {
+  static const int hex2int[2][256] =
+  {
+    {
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0x00, 0x10, 0x20, 0x30,
+      0x40, 0x50, 0x60, 0x70, 0x80, 0x90, -1, -1, -1, -1, -1, -1, -1, 0xa0,
+      0xb0, 0xc0, 0xd0, 0xe0, 0xf0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0xa0,
+      0xb0, 0xc0, 0xd0, 0xe0, 0xf0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
+    },
+    {
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0x00, 0x01, 0x02, 0x03,
+      0x04, 0x05, 0x06, 0x07, 0x08, 0x09, -1, -1, -1, -1, -1, -1, -1, 0x0a,
+      0x0b, 0x0c, 0x0d, 0x0e, 0x0f, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0x0a,
+      0x0b, 0x0c, 0x0d, 0x0e, 0x0f, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
+      -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
+    }
+  };
   int sign = 0;
   int prepend_zero = 0;
   int i, j, c, c1, c2;
@@ -111,19 +163,17 @@ mpi_fromstr (gcry_mpi_t val, const char *str)
     {
       sign = 1;
       str++;
+      slen--;
     }
 
   /* Skip optional hex prefix.  */
   if ( *str == '0' && str[1] == 'x' )
-    str += 2;
-
-  nbits = strlen (str);
-  if (nbits > MAX_EXTERN_SCAN_BYTES)
     {
-      mpi_clear (val);
-      return 1;  /* Error.  */
+      str += 2;
+      slen -= 2;
     }
-  nbits *= 4;
+
+  nbits = slen * 4;
   if ((nbits % 8))
     prepend_zero = 1;
 
@@ -140,6 +190,44 @@ mpi_fromstr (gcry_mpi_t val, const char *str)
   for (; j > 0; j--)
     {
       a = 0;
+
+      if (prepend_zero == 0 && (i & 31) == 0)
+	{
+	  while (slen >= sizeof(u32) * 2)
+	    {
+	      u32 n, m;
+	      u32 x, y;
+
+	      x = buf_get_le32(str);
+	      y = buf_get_le32(str + 4);
+	      str += 8;
+	      slen -= 8;
+
+	      a <<= 31; /* Two step to avoid compiler warning on 32-bit. */
+	      a <<= 1;
+
+	      n = (hex2int[0][(x >> 0) & 0xff]
+		   | hex2int[1][(x >> 8) & 0xff]) << 8;
+	      m = (hex2int[0][(y >> 0) & 0xff]
+		   | hex2int[1][(y >> 8) & 0xff]) << 8;
+	      n |= hex2int[0][(x >> 16) & 0xff];
+	      n |= hex2int[1][(x >> 24) & 0xff];
+	      m |= hex2int[0][(y >> 16) & 0xff];
+	      m |= hex2int[1][(y >> 24) & 0xff];
+
+	      a |= (n << 16) | m;
+	      i += 32;
+	      if ((int)(n | m) < 0)
+		{
+		  /* Invalid character. */
+		  mpi_clear (val);
+		  return 1;  /* Error.  */
+		}
+	      if (i == BITS_PER_MPI_LIMB)
+		break;
+	    }
+	}
+
       for (; i < BYTES_PER_MPI_LIMB; i++)
         {
           if (prepend_zero)
@@ -148,7 +236,10 @@ mpi_fromstr (gcry_mpi_t val, const char *str)
               prepend_zero = 0;
 	    }
           else
-            c1 = *str++;
+	    {
+	      c1 = *str++;
+	      slen--;
+	    }
 
           if (!c1)
             {
@@ -156,30 +247,15 @@ mpi_fromstr (gcry_mpi_t val, const char *str)
               return 1;  /* Error.  */
 	    }
           c2 = *str++;
+	  slen--;
           if (!c2)
             {
               mpi_clear (val);
               return 1;  /* Error.  */
 	    }
-          if ( c1 >= '0' && c1 <= '9' )
-            c = c1 - '0';
-          else if ( c1 >= 'a' && c1 <= 'f' )
-            c = c1 - 'a' + 10;
-          else if ( c1 >= 'A' && c1 <= 'F' )
-            c = c1 - 'A' + 10;
-          else
-            {
-              mpi_clear (val);
-              return 1;  /* Error.  */
-	    }
-          c <<= 4;
-          if ( c2 >= '0' && c2 <= '9' )
-            c |= c2 - '0';
-          else if( c2 >= 'a' && c2 <= 'f' )
-            c |= c2 - 'a' + 10;
-          else if( c2 >= 'A' && c2 <= 'F' )
-            c |= c2 - 'A' + 10;
-          else
+	  c = hex2int[0][c1 & 0xff];
+	  c |= hex2int[1][c2 & 0xff];
+          if (c < 0)
             {
               mpi_clear(val);
               return 1;  /* Error. */
@@ -248,19 +324,11 @@ do_get_buffer (gcry_mpi_t a, unsigned int fill_le, int extraalloc,
     {
       alimb = a->d[i];
 #if BYTES_PER_MPI_LIMB == 4
-      *p++ = alimb >> 24;
-      *p++ = alimb >> 16;
-      *p++ = alimb >>  8;
-      *p++ = alimb	  ;
+      buf_put_be32 (p, alimb);
+      p += 4;
 #elif BYTES_PER_MPI_LIMB == 8
-      *p++ = alimb >> 56;
-      *p++ = alimb >> 48;
-      *p++ = alimb >> 40;
-      *p++ = alimb >> 32;
-      *p++ = alimb >> 24;
-      *p++ = alimb >> 16;
-      *p++ = alimb >>  8;
-      *p++ = alimb	  ;
+      buf_put_be64 (p, alimb);
+      p += 8;
 #else
 #     error please implement for this limb size.
 #endif
@@ -270,7 +338,22 @@ do_get_buffer (gcry_mpi_t a, unsigned int fill_le, int extraalloc,
     {
       length = *nbytes;
       /* Reverse buffer and pad with zeroes.  */
-      for (i=0; i < length/2; i++)
+      for (i = 0; i + 8 < length / 2; i += 8)
+	{
+	  u64 head = buf_get_be64 (buffer + i);
+	  u64 tail = buf_get_be64 (buffer + length - 8 - i);
+	  buf_put_le64 (buffer + length - 8 - i, head);
+	  buf_put_le64 (buffer + i, tail);
+	}
+      if (i + 4 < length / 2)
+	{
+	  u32 head = buf_get_be32 (buffer + i);
+	  u32 tail = buf_get_be32 (buffer + length - 4 - i);
+	  buf_put_le32 (buffer + length - 4 - i, head);
+	  buf_put_le32 (buffer + i, tail);
+	  i += 4;
+	}
+      for (; i < length/2; i++)
         {
           tmp = buffer[i];
           buffer[i] = buffer[length-1-i];
@@ -354,53 +437,33 @@ _gcry_mpi_set_buffer (gcry_mpi_t a, const void *buffer_arg,
   for (i=0, p = buffer+nbytes-1; p >= buffer+BYTES_PER_MPI_LIMB; )
     {
 #if BYTES_PER_MPI_LIMB == 4
-      alimb  = (mpi_limb_t)*p--	    ;
-      alimb |= (mpi_limb_t)*p-- <<  8 ;
-      alimb |= (mpi_limb_t)*p-- << 16 ;
-      alimb |= (mpi_limb_t)*p-- << 24 ;
+      alimb = buf_get_be32(p - 4 + 1);
+      p -= 4;
 #elif BYTES_PER_MPI_LIMB == 8
-      alimb  = (mpi_limb_t)*p--	;
-      alimb |= (mpi_limb_t)*p-- <<  8 ;
-      alimb |= (mpi_limb_t)*p-- << 16 ;
-      alimb |= (mpi_limb_t)*p-- << 24 ;
-      alimb |= (mpi_limb_t)*p-- << 32 ;
-      alimb |= (mpi_limb_t)*p-- << 40 ;
-      alimb |= (mpi_limb_t)*p-- << 48 ;
-      alimb |= (mpi_limb_t)*p-- << 56 ;
+      alimb = buf_get_be64(p - 8 + 1);
+      p -= 8;
 #else
-#       error please implement for this limb size.
+#     error please implement for this limb size.
 #endif
       a->d[i++] = alimb;
     }
   if ( p >= buffer )
     {
+      byte last[BYTES_PER_MPI_LIMB] = { 0 };
+      unsigned int n = (p - buffer) + 1;
+
+      n = n > BYTES_PER_MPI_LIMB ? BYTES_PER_MPI_LIMB : n;
+      memcpy (last + BYTES_PER_MPI_LIMB - n, p - n + 1, n);
+      p -= n;
+
 #if BYTES_PER_MPI_LIMB == 4
-      alimb  = (mpi_limb_t)*p--;
-      if (p >= buffer)
-        alimb |= (mpi_limb_t)*p-- <<  8;
-      if (p >= buffer)
-        alimb |= (mpi_limb_t)*p-- << 16;
-      if (p >= buffer)
-        alimb |= (mpi_limb_t)*p-- << 24;
+      alimb = buf_get_be32(last);
 #elif BYTES_PER_MPI_LIMB == 8
-      alimb  = (mpi_limb_t)*p--;
-      if (p >= buffer)
-        alimb |= (mpi_limb_t)*p-- << 8;
-      if (p >= buffer)
-        alimb |= (mpi_limb_t)*p-- << 16;
-      if (p >= buffer)
-        alimb |= (mpi_limb_t)*p-- << 24;
-      if (p >= buffer)
-        alimb |= (mpi_limb_t)*p-- << 32;
-      if (p >= buffer)
-        alimb |= (mpi_limb_t)*p-- << 40;
-      if (p >= buffer)
-        alimb |= (mpi_limb_t)*p-- << 48;
-      if (p >= buffer)
-        alimb |= (mpi_limb_t)*p-- << 56;
+      alimb = buf_get_be64(last);
 #else
 #     error please implement for this limb size.
 #endif
+
       a->d[i++] = alimb;
     }
   a->nlimbs = i;
@@ -446,25 +509,24 @@ twocompl (unsigned char *p, unsigned int n)
     ;
   if (i >= 0)
     {
-      if ((p[i] & 0x01))
-        p[i] = (((p[i] ^ 0xfe) | 0x01) & 0xff);
-      else if ((p[i] & 0x02))
-        p[i] = (((p[i] ^ 0xfc) | 0x02) & 0xfe);
-      else if ((p[i] & 0x04))
-        p[i] = (((p[i] ^ 0xf8) | 0x04) & 0xfc);
-      else if ((p[i] & 0x08))
-        p[i] = (((p[i] ^ 0xf0) | 0x08) & 0xf8);
-      else if ((p[i] & 0x10))
-        p[i] = (((p[i] ^ 0xe0) | 0x10) & 0xf0);
-      else if ((p[i] & 0x20))
-        p[i] = (((p[i] ^ 0xc0) | 0x20) & 0xe0);
-      else if ((p[i] & 0x40))
-        p[i] = (((p[i] ^ 0x80) | 0x40) & 0xc0);
-      else
-        p[i] = 0x80;
+      unsigned char pi = p[i];
+      unsigned int ntz = _gcry_ctz (pi);
+
+      p[i] = ((p[i] ^ (0xfe << ntz)) | (0x01 << ntz)) & (0xff << ntz);
 
-      for (i--; i >= 0; i--)
-        p[i] ^= 0xff;
+      for (i--; i >= 7; i -= 8)
+	{
+	  buf_put_he64(&p[i-7], ~buf_get_he64(&p[i-7]));
+	}
+      if (i >= 3)
+	{
+	  buf_put_he32(&p[i-3], ~buf_get_he32(&p[i-3]));
+	  i -= 4;
+	}
+      for (; i >= 0; i--)
+	{
+	  p[i] ^= 0xff;
+	}
     }
 }
 
@@ -571,7 +633,7 @@ _gcry_mpi_scan (struct gcry_mpi **ret_mpi, enum gcry_mpi_format format,
       if (len && len < 4)
         return GPG_ERR_TOO_SHORT;
 
-      n = (s[0] << 24 | s[1] << 16 | s[2] << 8 | s[3]);
+      n = buf_get_be32 (s);
       s += 4;
       if (len)
         len -= 4;
@@ -605,12 +667,19 @@ _gcry_mpi_scan (struct gcry_mpi **ret_mpi, enum gcry_mpi_format format,
     }
   else if (format == GCRYMPI_FMT_HEX)
     {
+      size_t slen;
       /* We can only handle C strings for now.  */
       if (buflen)
         return GPG_ERR_INV_ARG;
 
-      a = secure? mpi_alloc_secure (0) : mpi_alloc(0);
-      if (mpi_fromstr (a, (const char *)buffer))
+      slen = strlen ((const char *)buffer);
+      if (slen > MAX_EXTERN_SCAN_BYTES)
+	return GPG_ERR_INV_OBJ;
+      a = secure? mpi_alloc_secure ((((slen+1)/2)+BYTES_PER_MPI_LIMB-1)
+				    /BYTES_PER_MPI_LIMB)
+		: mpi_alloc((((slen+1)/2)+BYTES_PER_MPI_LIMB-1)
+			    /BYTES_PER_MPI_LIMB);
+      if (mpi_fromstr (a, (const char *)buffer, slen))
         {
           mpi_free (a);
           return GPG_ERR_INV_OBJ;
@@ -798,10 +867,8 @@ _gcry_mpi_print (enum gcry_mpi_format format,
         {
           unsigned char *s = buffer;
 
-          *s++ = n >> 24;
-          *s++ = n >> 16;
-          *s++ = n >> 8;
-          *s++ = n;
+	  buf_put_be32 (s, n);
+	  s += 4;
           if (extra == 1)
             *s++ = 0;
           else if (extra)
@@ -832,6 +899,11 @@ _gcry_mpi_print (enum gcry_mpi_format format,
 	}
       if (buffer)
         {
+	  static const u32 nibble2hex[] =
+	  {
+	    '0', '1', '2', '3', '4', '5', '6', '7',
+	    '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'
+	  };
           unsigned char *s = buffer;
 
           if (negative)
@@ -842,13 +914,37 @@ _gcry_mpi_print (enum gcry_mpi_format format,
               *s++ = '0';
 	    }
 
-          for (i=0; i < n; i++)
+	  for (i = 0; i + 4 < n; i += 4)
+	    {
+	      u32 c = buf_get_be32(tmp + i);
+	      u32 o1, o2;
+
+	      o1 = nibble2hex[(c >> 28) & 0xF];
+	      o1 <<= 8;
+	      o1 |= nibble2hex[(c >> 24) & 0xF];
+	      o1 <<= 8;
+	      o1 |= nibble2hex[(c >> 20) & 0xF];
+	      o1 <<= 8;
+	      o1 |= nibble2hex[(c >> 16) & 0xF];
+
+	      o2 = nibble2hex[(c >> 12) & 0xF];
+	      o2 <<= 8;
+	      o2 |= (u64)nibble2hex[(c >> 8) & 0xF];
+	      o2 <<= 8;
+	      o2 |= (u64)nibble2hex[(c >> 4) & 0xF];
+	      o2 <<= 8;
+	      o2 |= (u64)nibble2hex[(c >> 0) & 0xF];
+
+	      buf_put_be32 (s + 0, o1);
+	      buf_put_be32 (s + 4, o2);
+	      s += 8;
+	    }
+          for (; i < n; i++)
             {
               unsigned int c = tmp[i];
 
-              *s++ = (c >> 4) < 10? '0'+(c>>4) : 'A'+(c>>4)-10 ;
-              c &= 15;
-              *s++ = c < 10? '0'+c : 'A'+c-10 ;
+              *s++ = nibble2hex[c >> 4];
+              *s++ = nibble2hex[c & 0xF];
 	    }
           *s++ = 0;
           *nwritten = s - buffer;
diff --git a/tests/t-convert.c b/tests/t-convert.c
index 4450a9e3..d5d162b9 100644
--- a/tests/t-convert.c
+++ b/tests/t-convert.c
@@ -141,6 +141,7 @@ static void
 check_formats (void)
 {
   static struct {
+    int have_value;
     int value;
     struct {
       const char *hex;
@@ -154,136 +155,283 @@ check_formats (void)
       const char *pgp;
     } a;
   } data[] = {
-    {    0, { "00",
-              0, "",
-              4, "\x00\x00\x00\x00",
-              0, "",
-              2, "\x00\x00"}
-    },
-    {    1, { "01",
-              1, "\x01",
-              5, "\x00\x00\x00\x01\x01",
-              1, "\x01",
-              3, "\x00\x01\x01" }
-    },
-    {    2, { "02",
-              1, "\x02",
-              5, "\x00\x00\x00\x01\x02",
-              1, "\x02",
-              3, "\x00\x02\x02" }
-    },
-    {  127, { "7F",
-              1, "\x7f",
-              5, "\x00\x00\x00\x01\x7f",
-              1, "\x7f",
-              3, "\x00\x07\x7f" }
-    },
-    {  128, { "0080",
-              2, "\x00\x80",
-              6, "\x00\x00\x00\x02\x00\x80",
-              1, "\x80",
-              3, "\x00\x08\x80" }
-    },
-    {  129, { "0081",
-              2, "\x00\x81",
-              6, "\x00\x00\x00\x02\x00\x81",
-              1, "\x81",
-              3, "\x00\x08\x81" }
-    },
-    {  255, { "00FF",
-              2, "\x00\xff",
-              6, "\x00\x00\x00\x02\x00\xff",
-              1, "\xff",
-              3, "\x00\x08\xff" }
-    },
-    {  256, { "0100",
-              2, "\x01\x00",
-              6, "\x00\x00\x00\x02\x01\x00",
-              2, "\x01\x00",
-              4, "\x00\x09\x01\x00" }
-    },
-    {  257, { "0101",
-              2, "\x01\x01",
-              6, "\x00\x00\x00\x02\x01\x01",
-              2, "\x01\x01",
-              4, "\x00\x09\x01\x01" }
-    },
-    {   -1, { "-01",
-              1, "\xff",
-              5, "\x00\x00\x00\x01\xff",
-              1,"\x01" }
-    },
-    {   -2, { "-02",
-              1, "\xfe",
-              5, "\x00\x00\x00\x01\xfe",
-              1, "\x02" }
-    },
-    { -127, { "-7F",
-              1, "\x81",
-              5, "\x00\x00\x00\x01\x81",
-              1, "\x7f" }
-    },
-    { -128, { "-0080",
-              1, "\x80",
-              5, "\x00\x00\x00\x01\x80",
-              1, "\x80" }
-    },
-    { -129, { "-0081",
-              2, "\xff\x7f",
-              6, "\x00\x00\x00\x02\xff\x7f",
-              1, "\x81" }
-    },
-    { -255, { "-00FF",
-              2, "\xff\x01",
-              6, "\x00\x00\x00\x02\xff\x01",
-              1, "\xff" }
-    },
-    { -256, { "-0100",
-              2, "\xff\x00",
-              6, "\x00\x00\x00\x02\xff\x00",
-              2, "\x01\x00" }
-    },
-    { -257, { "-0101",
-              2, "\xfe\xff",
-              6, "\x00\x00\x00\x02\xfe\xff",
-              2, "\x01\x01" }
-    },
-    {  65535, { "00FFFF",
-                3, "\x00\xff\xff",
-                7, "\x00\x00\x00\x03\x00\xff\xff",
-                2, "\xff\xff",
-                4, "\x00\x10\xff\xff" }
-    },
-    {  65536, { "010000",
-                3, "\x01\00\x00",
-                7, "\x00\x00\x00\x03\x01\x00\x00",
-                3, "\x01\x00\x00",
-                5, "\x00\x11\x01\x00\x00 "}
-    },
-    {  65537, { "010001",
-                3, "\x01\00\x01",
-                7, "\x00\x00\x00\x03\x01\x00\x01",
-                3, "\x01\x00\x01",
-                5, "\x00\x11\x01\x00\x01" }
-    },
-    { -65537, { "-010001",
-                3, "\xfe\xff\xff",
-                7, "\x00\x00\x00\x03\xfe\xff\xff",
-                3, "\x01\x00\x01" }
-    },
-    { -65536, { "-010000",
-                3, "\xff\x00\x00",
-                7, "\x00\x00\x00\x03\xff\x00\x00",
-                3, "\x01\x00\x00" }
-    },
-    { -65535, { "-00FFFF",
-                3, "\xff\x00\x01",
-                7, "\x00\x00\x00\x03\xff\x00\x01",
-                2, "\xff\xff" }
+    {
+      1, 0,
+      { "00",
+	0, "",
+	4, "\x00\x00\x00\x00",
+	0, "",
+	2, "\x00\x00" }
+    },
+    {
+      1, 1,
+      { "01",
+	1, "\x01",
+	5, "\x00\x00\x00\x01\x01",
+	1, "\x01",
+	3, "\x00\x01\x01" }
+    },
+    {
+      1, 2,
+      { "02",
+	1, "\x02",
+	5, "\x00\x00\x00\x01\x02",
+	1, "\x02",
+	3, "\x00\x02\x02" }
+    },
+    {
+      1, 127,
+      { "7F",
+	1, "\x7f",
+	5, "\x00\x00\x00\x01\x7f",
+	1, "\x7f",
+	3, "\x00\x07\x7f" }
+    },
+    {
+      1, 128,
+      { "0080",
+	2, "\x00\x80",
+	6, "\x00\x00\x00\x02\x00\x80",
+	1, "\x80",
+	3, "\x00\x08\x80" }
+    },
+    {
+      1, 129,
+      { "0081",
+	2, "\x00\x81",
+	6, "\x00\x00\x00\x02\x00\x81",
+	1, "\x81",
+	3, "\x00\x08\x81" }
+    },
+    {
+      1, 255,
+      { "00FF",
+	2, "\x00\xff",
+	6, "\x00\x00\x00\x02\x00\xff",
+	1, "\xff",
+	3, "\x00\x08\xff" }
+    },
+    {
+      1, 256,
+      { "0100",
+	2, "\x01\x00",
+	6, "\x00\x00\x00\x02\x01\x00",
+	2, "\x01\x00",
+	4, "\x00\x09\x01\x00" }
+    },
+    {
+      1, 257,
+      { "0101",
+	2, "\x01\x01",
+	6, "\x00\x00\x00\x02\x01\x01",
+	2, "\x01\x01",
+	4, "\x00\x09\x01\x01" }
+    },
+    {
+      1, -1,
+      { "-01",
+	1, "\xff",
+	5, "\x00\x00\x00\x01\xff",
+	1,"\x01" }
+    },
+    {
+      1, -2,
+      { "-02",
+	1, "\xfe",
+	5, "\x00\x00\x00\x01\xfe",
+	1, "\x02" }
+    },
+    {
+      1, -127,
+      { "-7F",
+	1, "\x81",
+	5, "\x00\x00\x00\x01\x81",
+	1, "\x7f" }
+    },
+    {
+      1, -128,
+      { "-0080",
+	1, "\x80",
+	5, "\x00\x00\x00\x01\x80",
+	1, "\x80" }
+    },
+    {
+      1, -129,
+      { "-0081",
+	2, "\xff\x7f",
+	6, "\x00\x00\x00\x02\xff\x7f",
+	1, "\x81" }
+    },
+    {
+      1, -255,
+      { "-00FF",
+	2, "\xff\x01",
+	6, "\x00\x00\x00\x02\xff\x01",
+	1, "\xff" }
+    },
+    {
+      1, -256,
+      { "-0100",
+	2, "\xff\x00",
+	6, "\x00\x00\x00\x02\xff\x00",
+	2, "\x01\x00" }
+    },
+    {
+      1, -257,
+      { "-0101",
+	2, "\xfe\xff",
+	6, "\x00\x00\x00\x02\xfe\xff",
+	2, "\x01\x01" }
+    },
+    {
+      1, 65535,
+      { "00FFFF",
+	3, "\x00\xff\xff",
+	7, "\x00\x00\x00\x03\x00\xff\xff",
+	2, "\xff\xff",
+	4, "\x00\x10\xff\xff" }
+    },
+    {
+      1, 65536,
+      { "010000",
+	3, "\x01\00\x00",
+	7, "\x00\x00\x00\x03\x01\x00\x00",
+	3, "\x01\x00\x00",
+	5, "\x00\x11\x01\x00\x00 "}
+    },
+    {
+      1, 65537,
+      { "010001",
+	3, "\x01\00\x01",
+	7, "\x00\x00\x00\x03\x01\x00\x01",
+	3, "\x01\x00\x01",
+	5, "\x00\x11\x01\x00\x01" }
+    },
+    {
+      1, -65537,
+      { "-010001",
+	3, "\xfe\xff\xff",
+	7, "\x00\x00\x00\x03\xfe\xff\xff",
+	3, "\x01\x00\x01" }
+    },
+    {
+      1, -65536,
+      { "-010000",
+	3, "\xff\x00\x00",
+	7, "\x00\x00\x00\x03\xff\x00\x00",
+	3, "\x01\x00\x00" }
+    },
+    {
+      1, -65535,
+      { "-00FFFF",
+	3, "\xff\x00\x01",
+	7, "\x00\x00\x00\x03\xff\x00\x01",
+	2, "\xff\xff" }
+    },
+    {
+      1, 0x7fffffff,
+      { "7FFFFFFF",
+	4, "\x7f\xff\xff\xff",
+	8, "\x00\x00\x00\x04\x7f\xff\xff\xff",
+	4, "\x7f\xff\xff\xff",
+	6, "\x00\x1f\x7f\xff\xff\xff" }
+    },
+    { 1, -0x7fffffff,
+      { "-7FFFFFFF",
+	4, "\x80\x00\x00\x01",
+	8, "\x00\x00\x00\x04\x80\x00\x00\x01",
+	4, "\x7f\xff\xff\xff" }
+    },
+    {
+      1, (int)0x800000ffU,
+      { "-7FFFFF01",
+	4, "\x80\x00\x00\xff",
+	8, "\x00\x00\x00\x04\x80\x00\x00\xff",
+	4, "\x7f\xff\xff\x01" }
+    },
+    {
+      1, (int)0x800000feU,
+      { "-7FFFFF02",
+	4, "\x80\x00\x00\xfe",
+	8, "\x00\x00\x00\x04\x80\x00\x00\xfe",
+	4, "\x7f\xff\xff\x02" }
+    },
+    {
+      1, (int)0x800000fcU,
+      { "-7FFFFF04",
+	4, "\x80\x00\x00\xfc",
+	8, "\x00\x00\x00\x04\x80\x00\x00\xfc",
+	4, "\x7f\xff\xff\x04" }
+    },
+    {
+      1, (int)0x800000f8U,
+      { "-7FFFFF08",
+	4, "\x80\x00\x00\xf8",
+	8, "\x00\x00\x00\x04\x80\x00\x00\xf8",
+	4, "\x7f\xff\xff\x08" }
+    },
+    {
+      1, (int)0x800000f0U,
+      { "-7FFFFF10",
+	4, "\x80\x00\x00\xf0",
+	8, "\x00\x00\x00\x04\x80\x00\x00\xf0",
+	4, "\x7f\xff\xff\x10" }
+    },
+    {
+      1, (int)0x800000e0U,
+      { "-7FFFFF20",
+	4, "\x80\x00\x00\xe0",
+	8, "\x00\x00\x00\x04\x80\x00\x00\xe0",
+	4, "\x7f\xff\xff\x20" }
+    },
+    {
+      1, (int)0x800000c0U,
+      { "-7FFFFF40",
+	4, "\x80\x00\x00\xc0",
+	8, "\x00\x00\x00\x04\x80\x00\x00\xc0",
+	4, "\x7f\xff\xff\x40" }
+    },
+    {
+      1, (int)0x80000080U,
+      { "-7FFFFF80",
+	4, "\x80\x00\x00\x80",
+	8, "\x00\x00\x00\x04\x80\x00\x00\x80",
+	4, "\x7f\xff\xff\x80" }
+    },
+    {
+      1, (int)0x80000100U,
+      { "-7FFFFF00",
+	4, "\x80\x00\x01\x00",
+	8, "\x00\x00\x00\x04\x80\x00\x01\x00",
+	4, "\x7f\xff\xff\x00" }
+    },
+    {
+      0, 0,
+      { "076543210FEDCBA9876543210123456789ABCDEF00112233",
+	24, "\x07\x65\x43\x21\x0f\xed\xcb\xa9\x87\x65\x43\x21\x01\x23"
+	    "\x45\x67\x89\xab\xcd\xef\x00\x11\x22\x33",
+	28, "\x00\x00\x00\x18\x07\x65\x43\x21\x0f\xed\xcb\xa9\x87\x65"
+	    "\x43\x21\x01\x23\x45\x67\x89\xab\xcd\xef\x00\x11\x22\x33"
+	    "\x44",
+	24, "\x07\x65\x43\x21\x0f\xed\xcb\xa9\x87\x65\x43\x21\x01\x23"
+	    "\x45\x67\x89\xab\xcd\xef\x00\x11\x22\x33",
+	26, "\x00\xbb\x07\x65\x43\x21\x0f\xed\xcb\xa9\x87\x65\x43\x21"
+	    "\x01\x23\x45\x67\x89\xab\xcd\xef\x00\x11\x22\x33" }
+    },
+    {
+      0, 0,
+      { "-07FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF01",
+	24, "\xf8\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+	    "\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff",
+	28, "\x00\x00\x00\x18\xf8\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+	    "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff",
+	24, "\x07\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff"
+	    "\xff\xff\xff\xff\xff\xff\xff\xff\xff\x01" }
     }
   };
   gpg_error_t err;
   gcry_mpi_t a, b;
+  char valuestr[128];
   char *buf;
   void *bufaddr = &buf;
   int idx;
@@ -295,24 +443,39 @@ check_formats (void)
       if (debug)
         info ("print test %d\n", data[idx].value);
 
-      if (data[idx].value < 0)
-        {
-          gcry_mpi_set_ui (a, -data[idx].value);
-          gcry_mpi_neg (a, a);
-        }
+      if (data[idx].have_value)
+	{
+	  snprintf(valuestr, sizeof(valuestr), "%d", data[idx].value);
+	  if (data[idx].value < 0)
+	    {
+	      gcry_mpi_set_ui (a, -data[idx].value);
+	      gcry_mpi_neg (a, a);
+	    }
+	  else
+	    gcry_mpi_set_ui (a, data[idx].value);
+	}
       else
-        gcry_mpi_set_ui (a, data[idx].value);
+	{
+	  /* Use hex-format as source test vector. */
+	  snprintf(valuestr, sizeof(valuestr), "%s", data[idx].a.hex);
+          gcry_mpi_release (a);
+	  err = gcry_mpi_scan (&a, GCRYMPI_FMT_HEX, data[idx].a.hex, 0,
+			       &buflen);
+	  if (err)
+	    fail ("error scanning value %s from %s: %s\n",
+		  valuestr, "HEX", gpg_strerror (err));
+	}
 
       err = gcry_mpi_aprint (GCRYMPI_FMT_HEX, bufaddr, NULL, a);
       if (err)
-        fail ("error printing value %d as %s: %s\n",
-              data[idx].value, "HEX", gpg_strerror (err));
+        fail ("error printing value %s as %s: %s\n",
+              valuestr, "HEX", gpg_strerror (err));
       else
         {
           if (strcmp (buf, data[idx].a.hex))
             {
-              fail ("error printing value %d as %s: %s\n",
-                    data[idx].value, "HEX", "wrong result");
+              fail ("error printing value %s as %s: %s\n",
+                    valuestr, "HEX", "wrong result");
               info ("expected: '%s'\n", data[idx].a.hex);
               info ("     got: '%s'\n", buf);
             }
@@ -321,15 +484,15 @@ check_formats (void)
 
       err = gcry_mpi_aprint (GCRYMPI_FMT_STD, bufaddr, &buflen, a);
       if (err)
-        fail ("error printing value %d as %s: %s\n",
-              data[idx].value, "STD", gpg_strerror (err));
+        fail ("error printing value %s as %s: %s\n",
+              valuestr, "STD", gpg_strerror (err));
       else
         {
           if (buflen != data[idx].a.stdlen
               || memcmp (buf, data[idx].a.std, data[idx].a.stdlen))
             {
-              fail ("error printing value %d as %s: %s\n",
-                    data[idx].value, "STD", "wrong result");
+              fail ("error printing value %s as %s: %s\n",
+                    valuestr, "STD", "wrong result");
               showhex ("expected:", data[idx].a.std, data[idx].a.stdlen);
               showhex ("     got:", buf, buflen);
             }
@@ -338,15 +501,15 @@ check_formats (void)
 
       err = gcry_mpi_aprint (GCRYMPI_FMT_SSH, bufaddr, &buflen, a);
       if (err)
-        fail ("error printing value %d as %s: %s\n",
-              data[idx].value, "SSH", gpg_strerror (err));
+        fail ("error printing value %s as %s: %s\n",
+              valuestr, "SSH", gpg_strerror (err));
       else
         {
           if (buflen != data[idx].a.sshlen
               || memcmp (buf, data[idx].a.ssh, data[idx].a.sshlen))
             {
-              fail ("error printing value %d as %s: %s\n",
-                    data[idx].value, "SSH", "wrong result");
+              fail ("error printing value %s as %s: %s\n",
+                    valuestr, "SSH", "wrong result");
               showhex ("expected:", data[idx].a.ssh, data[idx].a.sshlen);
               showhex ("     got:", buf, buflen);
             }
@@ -355,15 +518,15 @@ check_formats (void)
 
       err = gcry_mpi_aprint (GCRYMPI_FMT_USG, bufaddr, &buflen, a);
       if (err)
-        fail ("error printing value %d as %s: %s\n",
-              data[idx].value, "USG", gpg_strerror (err));
+        fail ("error printing value %s as %s: %s\n",
+              valuestr, "USG", gpg_strerror (err));
       else
         {
           if (buflen != data[idx].a.usglen
               || memcmp (buf, data[idx].a.usg, data[idx].a.usglen))
             {
-              fail ("error printing value %d as %s: %s\n",
-                    data[idx].value, "USG", "wrong result");
+              fail ("error printing value %s as %s: %s\n",
+                    valuestr, "USG", "wrong result");
               showhex ("expected:", data[idx].a.usg, data[idx].a.usglen);
               showhex ("     got:", buf, buflen);
             }
@@ -374,19 +537,19 @@ check_formats (void)
       if (gcry_mpi_is_neg (a))
         {
           if (gpg_err_code (err) != GPG_ERR_INV_ARG)
-            fail ("error printing value %d as %s: %s\n",
-                  data[idx].value, "PGP", "Expected error not returned");
+            fail ("error printing value %s as %s: %s\n",
+                  valuestr, "PGP", "Expected error not returned");
         }
       else if (err)
-        fail ("error printing value %d as %s: %s\n",
-              data[idx].value, "PGP", gpg_strerror (err));
+        fail ("error printing value %s as %s: %s\n",
+              valuestr, "PGP", gpg_strerror (err));
       else
         {
           if (buflen != data[idx].a.pgplen
               || memcmp (buf, data[idx].a.pgp, data[idx].a.pgplen))
             {
-              fail ("error printing value %d as %s: %s\n",
-                    data[idx].value, "PGP", "wrong result");
+              fail ("error printing value %s as %s: %s\n",
+                    valuestr, "PGP", "wrong result");
               showhex ("expected:", data[idx].a.pgp, data[idx].a.pgplen);
               showhex ("     got:", buf, buflen);
             }
@@ -401,24 +564,39 @@ check_formats (void)
       if (debug)
         info ("scan test %d\n", data[idx].value);
 
-      if (data[idx].value < 0)
-        {
-          gcry_mpi_set_ui (a, -data[idx].value);
-          gcry_mpi_neg (a, a);
-        }
+      if (data[idx].have_value)
+	{
+	  snprintf(valuestr, sizeof(valuestr), "%d", data[idx].value);
+	  if (data[idx].value < 0)
+	    {
+	      gcry_mpi_set_ui (a, -data[idx].value);
+	      gcry_mpi_neg (a, a);
+	    }
+	  else
+	    gcry_mpi_set_ui (a, data[idx].value);
+	}
       else
-        gcry_mpi_set_ui (a, data[idx].value);
+	{
+	  /* Use hex-format as source test vector. */
+	  snprintf(valuestr, sizeof(valuestr), "%s", data[idx].a.hex);
+          gcry_mpi_release (a);
+	  err = gcry_mpi_scan (&a, GCRYMPI_FMT_HEX, data[idx].a.hex, 0,
+			       &buflen);
+	  if (err)
+	    fail ("error scanning value %s from %s: %s\n",
+		  valuestr, "HEX", gpg_strerror (err));
+	}
 
       err = gcry_mpi_scan (&b, GCRYMPI_FMT_HEX, data[idx].a.hex, 0, &buflen);
       if (err)
-        fail ("error scanning value %d from %s: %s\n",
-              data[idx].value, "HEX", gpg_strerror (err));
+        fail ("error scanning value %s from %s: %s\n",
+              valuestr, "HEX", gpg_strerror (err));
       else
         {
           if (gcry_mpi_cmp (a, b))
             {
-              fail ("error scanning value %d from %s: %s\n",
-                    data[idx].value, "HEX", "wrong result");
+              fail ("error scanning value %s from %s: %s\n",
+                    valuestr, "HEX", "wrong result");
               showmpi ("expected:", a);
               showmpi ("     got:", b);
             }
@@ -428,14 +606,14 @@ check_formats (void)
       err = gcry_mpi_scan (&b, GCRYMPI_FMT_STD,
                            data[idx].a.std, data[idx].a.stdlen, &buflen);
       if (err)
-        fail ("error scanning value %d as %s: %s\n",
-              data[idx].value, "STD", gpg_strerror (err));
+        fail ("error scanning value %s as %s: %s\n",
+              valuestr, "STD", gpg_strerror (err));
       else
         {
           if (gcry_mpi_cmp (a, b) || data[idx].a.stdlen != buflen)
             {
-              fail ("error scanning value %d from %s: %s (%lu)\n",
-                    data[idx].value, "STD", "wrong result",
+              fail ("error scanning value %s from %s: %s (%lu)\n",
+                    valuestr, "STD", "wrong result",
                     (long unsigned int)buflen);
               showmpi ("expected:", a);
               showmpi ("     got:", b);
@@ -446,14 +624,14 @@ check_formats (void)
       err = gcry_mpi_scan (&b, GCRYMPI_FMT_SSH,
                            data[idx].a.ssh, data[idx].a.sshlen, &buflen);
       if (err)
-        fail ("error scanning value %d as %s: %s\n",
-              data[idx].value, "SSH", gpg_strerror (err));
+        fail ("error scanning value %s as %s: %s\n",
+              valuestr, "SSH", gpg_strerror (err));
       else
         {
           if (gcry_mpi_cmp (a, b) || data[idx].a.sshlen != buflen)
             {
-              fail ("error scanning value %d from %s: %s (%lu)\n",
-                    data[idx].value, "SSH", "wrong result",
+              fail ("error scanning value %s from %s: %s (%lu)\n",
+                    valuestr, "SSH", "wrong result",
                     (long unsigned int)buflen);
               showmpi ("expected:", a);
               showmpi ("     got:", b);
@@ -464,16 +642,16 @@ check_formats (void)
       err = gcry_mpi_scan (&b, GCRYMPI_FMT_USG,
                            data[idx].a.usg, data[idx].a.usglen, &buflen);
       if (err)
-        fail ("error scanning value %d as %s: %s\n",
-              data[idx].value, "USG", gpg_strerror (err));
+        fail ("error scanning value %s as %s: %s\n",
+              valuestr, "USG", gpg_strerror (err));
       else
         {
           if (gcry_mpi_is_neg (a))
             gcry_mpi_neg (b, b);
           if (gcry_mpi_cmp (a, b) || data[idx].a.usglen != buflen)
             {
-              fail ("error scanning value %d from %s: %s (%lu)\n",
-                    data[idx].value, "USG", "wrong result",
+              fail ("error scanning value %s from %s: %s (%lu)\n",
+                    valuestr, "USG", "wrong result",
                     (long unsigned int)buflen);
               showmpi ("expected:", a);
               showmpi ("     got:", b);
@@ -488,14 +666,14 @@ check_formats (void)
           err = gcry_mpi_scan (&b, GCRYMPI_FMT_PGP,
                                data[idx].a.pgp, data[idx].a.pgplen, &buflen);
           if (err)
-            fail ("error scanning value %d as %s: %s\n",
-                  data[idx].value, "PGP", gpg_strerror (err));
+            fail ("error scanning value %s as %s: %s\n",
+                  valuestr, "PGP", gpg_strerror (err));
           else
             {
               if (gcry_mpi_cmp (a, b) || data[idx].a.pgplen != buflen)
                 {
-                  fail ("error scanning value %d from %s: %s (%lu)\n",
-                        data[idx].value, "PGP", "wrong result",
+                  fail ("error scanning value %s from %s: %s (%lu)\n",
+                        valuestr, "PGP", "wrong result",
                         (long unsigned int)buflen);
                   showmpi ("expected:", a);
                   showmpi ("     got:", b);
-- 
2.30.2


From jussi.kivilinna at iki.fi  Fri Jun 25 13:02:28 2021
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Fri, 25 Jun 2021 14:02:28 +0300
Subject: [PATCH] ec: add zSeries/s390x accelerated scalar multiplication
Message-ID: <20210625110228.1175550-1-jussi.kivilinna@iki.fi>

* cipher/asm-inline-s390x.h (PCC_FUNCTION_*): New.
(pcc_query, pcc_scalar_multiply): New.
* mpi/Makefile.am: Add 'ec-hw-s390x.c'.
* mpi/ec-hw-s390x.c: New.
* mpi/ec-internal.h (_gcry_s390x_ec_hw_mul_point)
(mpi_ec_hw_mul_point): New.
* mpi/ec.c (_gcry_mpi_ec_mul_point): Call 'mpi_ec_hw_mul_point'.
* src/g10lib.h (HWF_S390X_MSA_9): New.
* src/hwf-s390x.c (s390x_features): Add MSA9.
* src/hwfeatures.c (hwflist): Add 's390x-msa-9'.
--

Patch adds ECC scalar multiplication acceleration using
s390x's PCC instruction. Following curves are supported:
 - Ed25519
 - Ed448
 - X25519
 - X448
 - NIST curves P-256, P-384 and P-521

Benchmark on z15 (5.2Ghz):

Before:
 Ed25519        |  nanosecs/iter   cycles/iter
           mult |         389791       2026916
         keygen |         572017       2974487
           sign |         636603       3310336
         verify |        1189097       6183305
                =
 X25519         |  nanosecs/iter   cycles/iter
           mult |         296805       1543385
                =
 Ed448          |  nanosecs/iter   cycles/iter
           mult |        1693373       8805541
         keygen |        2382473      12388858
           sign |        2609562      13569725
         verify |        5177606      26923552
                =
 X448           |  nanosecs/iter   cycles/iter
           mult |        1136178       5908127
                =
 NIST-P256      |  nanosecs/iter   cycles/iter
           mult |         792620       4121625
         keygen |        4627835      24064740
           sign |        1528268       7946991
         verify |        1678205       8726664
                =
 NIST-P384      |  nanosecs/iter   cycles/iter
           mult |        1766418       9185373
         keygen |       10158485      52824123
           sign |        3341172      17374095
         verify |        3694750      19212700
                =
 NIST-P521      |  nanosecs/iter   cycles/iter
           mult |        3172566      16497346
         keygen |       18184747      94560683
           sign |        6039956      31407771
         verify |        6480882      33700588

After:
 Ed25519        |  nanosecs/iter   cycles/iter   speed-up
           mult |          25913        134746        15x
         keygen |          44447        231124        12x
           sign |         106928        556028         6x
         verify |         164681        856341         7x
                =
 X25519         |  nanosecs/iter   cycles/iter   speed-up
           mult |          17761         92358        16x
                =
 Ed448          |  nanosecs/iter   cycles/iter   speed-up
           mult |          50808        264199        33x
         keygen |          68644        356951        34x
           sign |         317446       1650720         8x
         verify |         457115       2376997        11x
                =
 X448           |  nanosecs/iter   cycles/iter   speed-up
           mult |          35637        185313        31x
                =
 NIST-P256      |  nanosecs/iter   cycles/iter   speed-up
           mult |          30678        159528        25x
         keygen |         323722       1683356        14x
           sign |         114176        593713        13x
         verify |         169901        883487         9x
                =
 NIST-P384      |  nanosecs/iter   cycles/iter   speed-up
           mult |          59966        311822        29x
         keygen |         607778       3160445        16x
           sign |         209832       1091128        16x
         verify |         329506       1713431        11x
                =
 NIST-P521      |  nanosecs/iter   cycles/iter   speed-up
           mult |          98230        510797        32x
         keygen |        1131686       5884765        16x
           sign |         397777       2068442        15x
         verify |         623076       3239998        10x

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/asm-inline-s390x.h |  48 +++++
 mpi/Makefile.am           |   3 +-
 mpi/ec-hw-s390x.c         | 412 ++++++++++++++++++++++++++++++++++++++
 mpi/ec-internal.h         |   8 +
 mpi/ec.c                  |  10 +-
 src/g10lib.h              |   3 +-
 src/hwf-s390x.c           |   1 +
 src/hwfeatures.c          |   1 +
 8 files changed, 483 insertions(+), 3 deletions(-)
 create mode 100644 mpi/ec-hw-s390x.c

diff --git a/cipher/asm-inline-s390x.h b/cipher/asm-inline-s390x.h
index bacb45fe..001cb965 100644
--- a/cipher/asm-inline-s390x.h
+++ b/cipher/asm-inline-s390x.h
@@ -45,6 +45,14 @@ enum kmxx_functions_e
   KMID_FUNCTION_SHAKE128 = 36,
   KMID_FUNCTION_SHAKE256 = 37,
   KMID_FUNCTION_GHASH = 65,
+
+  PCC_FUNCTION_NIST_P256 = 64,
+  PCC_FUNCTION_NIST_P384 = 65,
+  PCC_FUNCTION_NIST_P521 = 66,
+  PCC_FUNCTION_ED25519 = 72,
+  PCC_FUNCTION_ED448 = 73,
+  PCC_FUNCTION_X25519 = 80,
+  PCC_FUNCTION_X448 = 81
 };
 
 enum kmxx_function_flags_e
@@ -108,6 +116,26 @@ static inline u128_t klmd_query(void)
   return function_codes;
 }
 
+static inline u128_t pcc_query(void)
+{
+  static u128_t function_codes = 0;
+  static int initialized = 0;
+  register unsigned long reg0 asm("0") = 0;
+  register void *reg1 asm("1") = &function_codes;
+
+  if (initialized)
+    return function_codes;
+
+  asm volatile ("0: .insn rre,0xb92c << 16, 0, 0\n\t"
+		"   brc 1,0b\n\t"
+		:
+		: [reg0] "r" (reg0), [reg1] "r" (reg1)
+		: "cc", "memory");
+
+  initialized = 1;
+  return function_codes;
+}
+
 static ALWAYS_INLINE void
 kimd_execute(unsigned int func, void *param_block, const void *src,
 	     size_t src_len)
@@ -154,4 +182,24 @@ klmd_shake_execute(unsigned int func, void *param_block, void *dst,
 		: "cc", "memory");
 }
 
+static ALWAYS_INLINE unsigned int
+pcc_scalar_multiply(unsigned int func, void *param_block)
+{
+  register unsigned long reg0 asm("0") = func;
+  register byte *reg1 asm("1") = param_block;
+  register unsigned long error = 0;
+
+  asm volatile ("0: .insn rre,0xb92c << 16, 0, 0\n\t"
+		"   brc 1,0b\n\t"
+		"   brc 7,1f\n\t"
+		"   j 2f\n\t"
+		"1: lhi %[error], 1\n\t"
+		"2:\n\t"
+		: [func] "+r" (reg0), [error] "+r" (error)
+		: [param_ptr] "r" (reg1)
+		: "cc", "memory");
+
+  return error;
+}
+
 #endif /* GCRY_ASM_INLINE_S390X_H */
diff --git a/mpi/Makefile.am b/mpi/Makefile.am
index adb8e6f5..3604f840 100644
--- a/mpi/Makefile.am
+++ b/mpi/Makefile.am
@@ -175,5 +175,6 @@ libmpi_la_SOURCES = longlong.h	   \
 	      mpih-mul.c     \
 	      mpih-const-time.c \
 	      mpiutil.c         \
-              ec.c ec-internal.h ec-ed25519.c ec-nist.c ec-inline.h
+              ec.c ec-internal.h ec-ed25519.c ec-nist.c ec-inline.h \
+              ec-hw-s390x.c
 EXTRA_libmpi_la_SOURCES = asm-common-aarch64.h
diff --git a/mpi/ec-hw-s390x.c b/mpi/ec-hw-s390x.c
new file mode 100644
index 00000000..149a061d
--- /dev/null
+++ b/mpi/ec-hw-s390x.c
@@ -0,0 +1,412 @@
+/* ec-hw-s390x.c -  zSeries ECC acceleration
+ * Copyright (C) 2021 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <config.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+
+#ifdef HAVE_GCC_INLINE_ASM_S390X
+
+#include "mpi-internal.h"
+#include "g10lib.h"
+#include "context.h"
+#include "ec-context.h"
+#include "ec-internal.h"
+
+#include "../cipher/bufhelp.h"
+#include "../cipher/asm-inline-s390x.h"
+
+
+#define S390X_PCC_PARAM_BLOCK_SIZE 4096
+
+
+extern void reverse_buffer (unsigned char *buffer, unsigned int length);
+
+static int s390_mul_point_montgomery (mpi_point_t result, gcry_mpi_t scalar,
+				      mpi_point_t point, mpi_ec_t ctx,
+				      byte *param_block_buf);
+
+
+static int
+mpi_copy_to_raw(byte *raw, unsigned int raw_nbytes, gcry_mpi_t a)
+{
+  unsigned int num_to_zero;
+  unsigned int nbytes;
+  int i, j;
+
+  if (mpi_has_sign (a))
+    return -1;
+
+  if (mpi_get_flag (a, GCRYMPI_FLAG_OPAQUE))
+    {
+      unsigned int nbits;
+      byte *buf;
+
+      buf = mpi_get_opaque (a, &nbits);
+      nbytes = (nbits + 7) / 8;
+
+      if (raw_nbytes < nbytes)
+	return -1;
+
+      num_to_zero = raw_nbytes - nbytes;
+      if (num_to_zero > 0)
+        memset (raw, 0, num_to_zero);
+      if (nbytes > 0)
+	memcpy (raw + num_to_zero, buf, nbytes);
+
+      return 0;
+    }
+
+  nbytes = a->nlimbs * BYTES_PER_MPI_LIMB;
+  if (raw_nbytes < nbytes)
+    return -1;
+
+  num_to_zero = raw_nbytes - nbytes;
+  if (num_to_zero > 0)
+    memset (raw, 0, num_to_zero);
+
+  for (j = a->nlimbs - 1, i = 0; i < a->nlimbs; i++, j--)
+    {
+      buf_put_be64(raw + num_to_zero + i * BYTES_PER_MPI_LIMB, a->d[j]);
+    }
+
+  return 0;
+}
+
+int
+_gcry_s390x_ec_hw_mul_point (mpi_point_t result, gcry_mpi_t scalar,
+			     mpi_point_t point, mpi_ec_t ctx)
+{
+  byte param_block_buf[S390X_PCC_PARAM_BLOCK_SIZE];
+  byte *param_out_x = NULL;
+  byte *param_out_y = NULL;
+  byte *param_in_x = NULL;
+  byte *param_in_y = NULL;
+  byte *param_scalar = NULL;
+  unsigned int field_nbits;
+  unsigned int pcc_func;
+  gcry_mpi_t x, y;
+  gcry_mpi_t d = NULL;
+  int rc = -1;
+
+  if (ctx->name == NULL)
+    return -1;
+
+  if (!(_gcry_get_hw_features () & HWF_S390X_MSA_9))
+    return -1; /* ECC acceleration not supported by HW. */
+
+  if (ctx->model == MPI_EC_MONTGOMERY)
+    return s390_mul_point_montgomery (result, scalar, point, ctx,
+				      param_block_buf);
+
+  if (ctx->model == MPI_EC_WEIERSTRASS && ctx->nbits == 256 &&
+      strcmp (ctx->name, "NIST P-256") == 0)
+    {
+      struct pcc_param_block_nistp256_s
+      {
+	byte out_x[256 / 8];
+	byte out_y[256 / 8];
+	byte in_x[256 / 8];
+	byte in_y[256 / 8];
+	byte scalar[256 / 8];
+	byte c_and_ribm[64];
+      } *params = (void *)param_block_buf;
+
+      memset (params->c_and_ribm, 0, sizeof(params->c_and_ribm));
+
+      pcc_func = PCC_FUNCTION_NIST_P256;
+      field_nbits = 256;
+      param_out_x = params->out_x;
+      param_out_y = params->out_y;
+      param_in_x = params->in_x;
+      param_in_y = params->in_y;
+      param_scalar = params->scalar;
+    }
+  else if (ctx->model == MPI_EC_WEIERSTRASS && ctx->nbits == 384 &&
+           strcmp (ctx->name, "NIST P-384") == 0)
+    {
+      struct pcc_param_block_nistp384_s
+      {
+	byte out_x[384 / 8];
+	byte out_y[384 / 8];
+	byte in_x[384 / 8];
+	byte in_y[384 / 8];
+	byte scalar[384 / 8];
+	byte c_and_ribm[64];
+      } *params = (void *)param_block_buf;
+
+      memset (params->c_and_ribm, 0, sizeof(params->c_and_ribm));
+
+      pcc_func = PCC_FUNCTION_NIST_P384;
+      field_nbits = 384;
+      param_out_x = params->out_x;
+      param_out_y = params->out_y;
+      param_in_x = params->in_x;
+      param_in_y = params->in_y;
+      param_scalar = params->scalar;
+    }
+  else if (ctx->model == MPI_EC_WEIERSTRASS && ctx->nbits == 521 &&
+           strcmp (ctx->name, "NIST P-521") == 0)
+    {
+      struct pcc_param_block_nistp521_s
+      {
+	byte out_x[640 / 8]; /* note: first 14 bytes not modified by pcc */
+	byte out_y[640 / 8]; /* note: first 14 bytes not modified by pcc */
+	byte in_x[640 / 8];
+	byte in_y[640 / 8];
+	byte scalar[640 / 8];
+	byte c_and_ribm[64];
+      } *params = (void *)param_block_buf;
+
+      memset (params->out_x, 0, 14);
+      memset (params->out_y, 0, 14);
+      memset (params->c_and_ribm, 0, sizeof(params->c_and_ribm));
+
+      pcc_func = PCC_FUNCTION_NIST_P521;
+      field_nbits = 640;
+      param_out_x = params->out_x;
+      param_out_y = params->out_y;
+      param_in_x = params->in_x;
+      param_in_y = params->in_y;
+      param_scalar = params->scalar;
+    }
+  else if (ctx->model == MPI_EC_EDWARDS && ctx->nbits == 255 &&
+           strcmp (ctx->name, "Ed25519") == 0)
+    {
+      struct pcc_param_block_ed25519_s
+      {
+	byte out_x[256 / 8];
+	byte out_y[256 / 8];
+	byte in_x[256 / 8];
+	byte in_y[256 / 8];
+	byte scalar[256 / 8];
+	byte c_and_ribm[64];
+      } *params = (void *)param_block_buf;
+
+      memset (params->c_and_ribm, 0, sizeof(params->c_and_ribm));
+
+      pcc_func = PCC_FUNCTION_ED25519;
+      field_nbits = 256;
+      param_out_x = params->out_x;
+      param_out_y = params->out_y;
+      param_in_x = params->in_x;
+      param_in_y = params->in_y;
+      param_scalar = params->scalar;
+    }
+  else if (ctx->model == MPI_EC_EDWARDS && ctx->nbits == 448 &&
+           strcmp (ctx->name, "Ed448") == 0)
+    {
+      struct pcc_param_block_ed448_s
+      {
+	byte out_x[512 / 8]; /* note: first 8 bytes not modified by pcc */
+	byte out_y[512 / 8]; /* note: first 8 bytes not modified by pcc */
+	byte in_x[512 / 8];
+	byte in_y[512 / 8];
+	byte scalar[512 / 8];
+	byte c_and_ribm[64];
+      } *params = (void *)param_block_buf;
+
+      memset (params->out_x, 0, 8);
+      memset (params->out_y, 0, 8);
+      memset (params->c_and_ribm, 0, sizeof(params->c_and_ribm));
+
+      pcc_func = PCC_FUNCTION_ED448;
+      field_nbits = 512;
+      param_out_x = params->out_x;
+      param_out_y = params->out_y;
+      param_in_x = params->in_x;
+      param_in_y = params->in_y;
+      param_scalar = params->scalar;
+    }
+
+  if (param_scalar == NULL)
+    return -1; /* No curve match. */
+
+  if (!(pcc_query () & km_function_to_mask (pcc_func)))
+    return -1; /* HW does not support acceleration for this curve. */
+
+  x = mpi_new (0);
+  y = mpi_new (0);
+
+  if (_gcry_mpi_ec_get_affine (x, y, point, ctx) < 0)
+    {
+      /* Point at infinity. */
+      goto out;
+    }
+
+  if (mpi_has_sign (scalar) || mpi_cmp (scalar, ctx->n) >= 0)
+    {
+      d = mpi_is_secure (scalar) ? mpi_snew (ctx->nbits) : mpi_new (ctx->nbits);
+      _gcry_mpi_mod (d, scalar, ctx->n);
+    }
+  else
+    {
+      d = scalar;
+    }
+
+  if (mpi_copy_to_raw (param_in_x, field_nbits / 8, x) < 0)
+    goto out;
+
+  if (mpi_copy_to_raw (param_in_y, field_nbits / 8, y) < 0)
+    goto out;
+
+  if (mpi_copy_to_raw (param_scalar, field_nbits / 8, d) < 0)
+    goto out;
+
+  if (pcc_scalar_multiply (pcc_func, param_block_buf) != 0)
+    goto out;
+
+  _gcry_mpi_set_buffer (result->x, param_out_x, field_nbits / 8, 0);
+  _gcry_mpi_set_buffer (result->y, param_out_y, field_nbits / 8, 0);
+  mpi_set_ui (result->z, 1);
+  mpi_normalize (result->x);
+  mpi_normalize (result->y);
+  if (ctx->model == MPI_EC_EDWARDS)
+    mpi_point_resize (result, ctx);
+
+  rc = 0;
+
+out:
+  if (d != scalar)
+    mpi_release (d);
+  mpi_release (y);
+  mpi_release (x);
+  wipememory (param_block_buf, S390X_PCC_PARAM_BLOCK_SIZE);
+
+  return rc;
+}
+
+
+static int
+s390_mul_point_montgomery (mpi_point_t result, gcry_mpi_t scalar,
+			   mpi_point_t point, mpi_ec_t ctx,
+			   byte *param_block_buf)
+{
+  byte *param_out_x = NULL;
+  byte *param_in_x = NULL;
+  byte *param_scalar = NULL;
+  unsigned int field_nbits;
+  unsigned int pcc_func;
+  gcry_mpi_t x;
+  gcry_mpi_t d = NULL;
+  int rc = -1;
+
+  if (ctx->nbits == 255 && strcmp (ctx->name, "Curve25519") == 0)
+    {
+      struct pcc_param_block_x25519_s
+      {
+	byte out_x[256 / 8];
+	byte in_x[256 / 8];
+	byte scalar[256 / 8];
+	byte c_and_ribm[64];
+      } *params = (void *)param_block_buf;
+
+      memset (params->c_and_ribm, 0, sizeof(params->c_and_ribm));
+
+      pcc_func = PCC_FUNCTION_X25519;
+      field_nbits = 256;
+      param_out_x = params->out_x;
+      param_in_x = params->in_x;
+      param_scalar = params->scalar;
+    }
+  else if (ctx->nbits == 448 && strcmp (ctx->name, "X448") == 0)
+    {
+      struct pcc_param_block_x448_s
+      {
+	byte out_x[512 / 8]; /* note: first 8 bytes not modified by pcc */
+	byte in_x[512 / 8];
+	byte scalar[512 / 8];
+	byte c_and_ribm[64];
+      } *params = (void *)param_block_buf;
+
+      memset (params->out_x, 0, 8);
+      memset (params->c_and_ribm, 0, sizeof(params->c_and_ribm));
+
+      pcc_func = PCC_FUNCTION_X448;
+      field_nbits = 512;
+      param_out_x = params->out_x;
+      param_in_x = params->in_x;
+      param_scalar = params->scalar;
+    }
+
+  if (param_scalar == NULL)
+    return -1; /* No curve match. */
+
+  if (!(pcc_query () & km_function_to_mask (pcc_func)))
+    return -1; /* HW does not support acceleration for this curve. */
+
+  x = mpi_new (0);
+
+  if (mpi_is_opaque (scalar))
+    {
+      const unsigned int pbits = ctx->nbits;
+      unsigned int n;
+      unsigned char *raw;
+
+      raw = _gcry_mpi_get_opaque_copy (scalar, &n);
+      if ((n + 7) / 8 != (pbits + 7) / 8)
+        log_fatal ("scalar size (%d) != prime size (%d)\n",
+                   (n + 7) / 8, (pbits + 7) / 8);
+
+      reverse_buffer (raw, (n + 7 ) / 8);
+      if ((pbits % 8))
+        raw[0] &= (1 << (pbits % 8)) - 1;
+      raw[0] |= (1 << ((pbits + 7) % 8));
+      raw[(pbits + 7) / 8 - 1] &= (256 - ctx->h);
+      d = mpi_is_secure (scalar) ? mpi_snew (pbits) : mpi_new (pbits);
+      _gcry_mpi_set_buffer (d, raw, (n + 7) / 8, 0);
+      xfree (raw);
+    }
+  else
+    {
+      d = scalar;
+    }
+
+  if (_gcry_mpi_ec_get_affine (x, NULL, point, ctx) < 0)
+    {
+      /* Point at infinity. */
+      goto out;
+    }
+
+  if (mpi_copy_to_raw (param_in_x, field_nbits / 8, x) < 0)
+    goto out;
+
+  if (mpi_copy_to_raw (param_scalar, field_nbits / 8, d) < 0)
+    goto out;
+
+  if (pcc_scalar_multiply (pcc_func, param_block_buf) != 0)
+    goto out;
+
+  _gcry_mpi_set_buffer (result->x, param_out_x, field_nbits / 8, 0);
+  mpi_set_ui (result->z, 1);
+  mpi_point_resize (result, ctx);
+
+  rc = 0;
+
+out:
+  if (d != scalar)
+    mpi_release (d);
+  mpi_release (x);
+  wipememory (param_block_buf, S390X_PCC_PARAM_BLOCK_SIZE);
+
+  return rc;
+}
+
+#endif /* HAVE_GCC_INLINE_ASM_S390X */
diff --git a/mpi/ec-internal.h b/mpi/ec-internal.h
index 2296d55d..3f948aa0 100644
--- a/mpi/ec-internal.h
+++ b/mpi/ec-internal.h
@@ -38,4 +38,12 @@ void _gcry_mpi_ec_nist521_mod (gcry_mpi_t w, mpi_ec_t ctx);
 # define _gcry_mpi_ec_nist521_mod NULL
 #endif
 
+#ifdef HAVE_GCC_INLINE_ASM_S390X
+int _gcry_s390x_ec_hw_mul_point (mpi_point_t result, gcry_mpi_t scalar,
+				 mpi_point_t point, mpi_ec_t ctx);
+# define mpi_ec_hw_mul_point _gcry_s390x_ec_hw_mul_point
+#else
+# define mpi_ec_hw_mul_point(r,s,p,c) (-1)
+#endif
+
 #endif /*GCRY_EC_INTERNAL_H*/
diff --git a/mpi/ec.c b/mpi/ec.c
index 029099b4..c24921ee 100644
--- a/mpi/ec.c
+++ b/mpi/ec.c
@@ -1769,7 +1769,7 @@ _gcry_mpi_ec_sub_points (mpi_point_t result,
 }
 
 
-/* Scalar point multiplication - the main function for ECC.  If takes
+/* Scalar point multiplication - the main function for ECC.  It takes
    an integer SCALAR and a POINT as well as the usual context CTX.
    RESULT will be set to the resulting point. */
 void
@@ -1781,6 +1781,14 @@ _gcry_mpi_ec_mul_point (mpi_point_t result,
   unsigned int i, loops;
   mpi_point_struct p1, p2, p1inv;
 
+  /* First try HW accelerated scalar multiplications.  Error
+     is returned if acceleration is not supported or if HW
+     does not support acceleration of given input.  */
+  if (mpi_ec_hw_mul_point (result, scalar, point, ctx) >= 0)
+    {
+      return;
+    }
+
   if (ctx->model == MPI_EC_EDWARDS
       || (ctx->model == MPI_EC_WEIERSTRASS
           && mpi_is_secure (scalar)))
diff --git a/src/g10lib.h b/src/g10lib.h
index fb288a30..ed908742 100644
--- a/src/g10lib.h
+++ b/src/g10lib.h
@@ -258,7 +258,8 @@ char **_gcry_strtokenize (const char *string, const char *delim);
 #define HWF_S390X_MSA           (1 << 0)
 #define HWF_S390X_MSA_4         (1 << 1)
 #define HWF_S390X_MSA_8         (1 << 2)
-#define HWF_S390X_VX            (1 << 3)
+#define HWF_S390X_MSA_9         (1 << 3)
+#define HWF_S390X_VX            (1 << 4)
 
 #endif
 
diff --git a/src/hwf-s390x.c b/src/hwf-s390x.c
index 25121b91..74590fc3 100644
--- a/src/hwf-s390x.c
+++ b/src/hwf-s390x.c
@@ -63,6 +63,7 @@ static const struct feature_map_s s390x_features[] =
     { 17,  0, HWF_S390X_MSA },
     { 77,  0, HWF_S390X_MSA_4 },
     { 146, 0, HWF_S390X_MSA_8 },
+    { 155, 0, HWF_S390X_MSA_9 },
 #ifdef HAVE_GCC_INLINE_ASM_S390X_VX
     { 129, HWCAP_S390_VXRS, HWF_S390X_VX },
 #endif
diff --git a/src/hwfeatures.c b/src/hwfeatures.c
index 534c271d..89f5943d 100644
--- a/src/hwfeatures.c
+++ b/src/hwfeatures.c
@@ -76,6 +76,7 @@ static struct
     { HWF_S390X_MSA,           "s390x-msa" },
     { HWF_S390X_MSA_4,         "s390x-msa-4" },
     { HWF_S390X_MSA_8,         "s390x-msa-8" },
+    { HWF_S390X_MSA_9,         "s390x-msa-9" },
     { HWF_S390X_VX,            "s390x-vx" },
 #endif
   };
-- 
2.30.2