From Johannes.Schindelin at gmx.de Wed Jun 16 10:07:11 2021 From: Johannes.Schindelin at gmx.de (Johannes Schindelin) Date: Wed, 16 Jun 2021 10:07:11 +0200 (CEST) Subject: [PATCH] Fix broken mlock detection Message-ID: We need to be careful when casting a pointer to a `long int`: the highest bit might be set, in which case the result is a negative number. In this instance, it is fatal: we now take the modulus of that negative number with regards to the page size, and subtract it from the page size. So what should be a number that is smaller than the page size is now larger than the page size. As a consequence, we do not try to lock a 4096-byte block that is at the page size boundary inside a `malloc()`ed block, but we try to do that _outside_ the block. Which means that we are not at all detecting whether `mlock()` is broken. This actually happened here, in the i686 MSYS2 build of libgcrypt. Let's be very careful to case the pointer to an _unsigned_ value instead. Note: technically, we should cast the pointer to a `size_t`. But since we only need the remainder modulo the page size (which is a power of two) anyway, it does not matter whether we clip, say, a 64-bit `size_t` to a 32-bit `unsigned long`. It does matter, though, whether we mistakenly turn the remainder into a negative one. Signed-off-by: Johannes Schindelin --- acinclude.m4 | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/acinclude.m4 b/acinclude.m4 index 3c8dfba7..4a2a83c0 100644 --- a/acinclude.m4 +++ b/acinclude.m4 @@ -236,7 +236,7 @@ int main() pool = malloc( 4096 + pgsize ); if( !pool ) return 2; - pool += (pgsize - ((long int)pool % pgsize)); + pool += (pgsize - ((unsigned long int)pool % pgsize)); err = mlock( pool, 4096 ); if( !err || errno == EPERM || errno == EAGAIN) -- 2.31.1 From wk at gnupg.org Wed Jun 16 15:58:37 2021 From: wk at gnupg.org (Werner Koch) Date: Wed, 16 Jun 2021 15:58:37 +0200 Subject: [PATCH] Fix broken mlock detection In-Reply-To: (Johannes Schindelin via Gcrypt-devel's message of "Wed, 16 Jun 2021 10:07:11 +0200 (CEST)") References: Message-ID: <87wnqtsygi.fsf@wheatstone.g10code.de> Hi! On Wed, 16 Jun 2021 10:07, Johannes Schindelin said: > Which means that we are not at all detecting whether `mlock()` is > broken. Thanks for your correct analysis. I wrote this test code 23 years ago and this is the first report. Seems that until now this has never been tried on systems which allocate memory above 2 GiB. > This actually happened here, in the i686 MSYS2 build of libgcrypt. Please take care: I would not suggest to build the Windows version with MSYS - the only supported toolchain for Windows is gcc. > we only need the remainder modulo the page size (which is a power of > two) anyway, it does not matter whether we clip, say, a 64-bit `size_t` > to a 32-bit `unsigned long`. It does matter, though, whether we Yep. I would anyway use size_t here to avoid questions about the reasoning. In fact secmem.c uses uintptr_t but that is a bit too complicated to use in this configure test. Thanks, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 227 bytes Desc: not available URL: From Johannes.Schindelin at gmx.de Thu Jun 17 14:13:25 2021 From: Johannes.Schindelin at gmx.de (Johannes Schindelin) Date: Thu, 17 Jun 2021 14:13:25 +0200 (CEST) Subject: [PATCH] Fix broken mlock detection In-Reply-To: <87wnqtsygi.fsf@wheatstone.g10code.de> References: <87wnqtsygi.fsf@wheatstone.g10code.de> Message-ID: Hi Werner, On Wed, 16 Jun 2021, Werner Koch wrote: > On Wed, 16 Jun 2021 10:07, Johannes Schindelin said: > > > Which means that we are not at all detecting whether `mlock()` is > > broken. > > Thanks for your correct analysis. I wrote this test code 23 years ago > and this is the first report. Seems that until now this has never been > tried on systems which allocate memory above 2 GiB. Right. 23 years ago, some people still believed that nobody will ever need more than 640 kilobyte of RAM ;-) BTW I _suspect_ that the reason I ran into this is a recent change in Cygwin and/or Windows 10. I cannot recall seeing `malloc()`ed blocks above 0x80000000. > > This actually happened here, in the i686 MSYS2 build of libgcrypt. > > Please take care: I would not suggest to build the Windows version with > MSYS - the only supported toolchain for Windows is gcc. Oh, I should have been clearer: I _am_ building using GCC. MSYS2 is based partially on Cygwin (the MSYS2 runtime is a close fork of the Cygwin runtime, to provide a POSIX emulation layer), and partially on ArchLinux (from where it inherits its package management system, pacman). It is a well-tested system, and it is used as an important block of Git for Windows: millions of users already rely on it for over five years. In other words: I am not worried about building and using GNU Privacy Guard and libgcrypt in MSYS2's context ;-) > > we only need the remainder modulo the page size (which is a power of > > two) anyway, it does not matter whether we clip, say, a 64-bit `size_t` > > to a 32-bit `unsigned long`. It does matter, though, whether we > > Yep. I would anyway use size_t here to avoid questions about the > reasoning. In fact secmem.c uses uintptr_t but that is a bit too > complicated to use in this configure test. As long as you can be sure that the type that is used is actually defined, I do not care which one you use. I used `unsigned long` because there is no question that it is defined here, whereas I have run into compile problems in the past on systems where `size_t` was not defined. Ciao, Dscho From jussi.kivilinna at iki.fi Sat Jun 19 15:36:02 2021 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sat, 19 Jun 2021 16:36:02 +0300 Subject: [PATCH v2] mpi/ec: add fast reduction functions for NIST curves Message-ID: <20210619133602.665603-1-jussi.kivilinna@iki.fi> * configure.ac (ASM_DISABLED): New. * mpi/Makefile.am: Add 'ec-nist.c' and 'ec-inline.h'. * mpi/ec-nist.c: New. * mpi/ec-inline.h: New. * mpi/ec-internal.h (_gcry_mpi_ec_nist192_mod) (_gcry_mpi_ec_nist224_mod, _gcry_mpi_ec_nist256_mod) (_gcry_mpi_ec_nist384_mod, _gcry_mpi_ec_nist521_mod): New. * mpi/ec.c (ec_addm, ec_subm, ec_mulm, ec_mul2): Use 'ctx->mod'. (field_table): Add 'mod' function; Add NIST reduction functions. (ec_p_init): Setup ctx->mod; Setup function pointers from field_table only if pointer is not NULL; Resize ctx->a and ctx->b only if set. * mpi/mpi-internal.h (RESIZE_AND_CLEAR_IF_NEEDED): New. * mpi/mpiutil.c (_gcry_mpi_resize): Clear all unused limbs also in realloc case. * src/ec-context.h (mpi_ec_ctx_s): Add 'mod' function. -- Benchmark on AMD Ryzen 7 5800X (x86_64): Before: NIST-P192 | nanosecs/iter cycles/iter auto Mhz mult | 283346 1369473 4833 keygen | 1688442 8185744 4848 sign | 549683 2662984 4845 verify | 615284 2984325 4850 = NIST-P224 | nanosecs/iter cycles/iter auto Mhz mult | 516443 2501173 4843 keygen | 2859746 13866802 4849 sign | 918472 4455043 4850 verify | 1057940 5131372 4850 = NIST-P256 | nanosecs/iter cycles/iter auto Mhz mult | 423536 2054040 4850 keygen | 2383097 11557572 4850 sign | 774346 3754243 4848 verify | 864934 4196315 4852 = NIST-P384 | nanosecs/iter cycles/iter auto Mhz mult | 929985 4511881 4852 keygen | 5230788 25367299 4850 sign | 1671432 8109726 4852 verify | 1902729 9228568 4850 = NIST-P521 | nanosecs/iter cycles/iter auto Mhz mult | 2123546 10300952 4851 keygen | 12019340 58297774 4850 sign | 3886988 18853054 4850 verify | 4507885 21864015 4850 After: NIST-P192 | nanosecs/iter cycles/iter auto Mhz speed-up mult | 186679 905603 4851 +51% keygen | 1161423 5623822 4842 +46% sign | 389531 1887557 4846 +41% verify | 412936 2000461 4844 +49% = NIST-P224 | nanosecs/iter cycles/iter auto Mhz speed-up mult | 260621 1256327 4821 +99% keygen | 1557845 7531677 4835 +84% sign | 521678 2527083 4844 +76% verify | 554084 2677949 4833 +92% = NIST-P256 | nanosecs/iter cycles/iter auto Mhz speed-up mult | 319045 1542061 4833 +33% keygen | 1834822 8898950 4850 +30% sign | 612866 2972630 4850 +26% verify | 664821 3222597 4847 +30% = NIST-P384 | nanosecs/iter cycles/iter auto Mhz speed-up mult | 593894 2875260 4841 +57% keygen | 3526600 17089717 4846 +48% sign | 1178098 5710151 4847 +42% verify | 1260185 6107449 4846 +51% = NIST-P521 | nanosecs/iter cycles/iter auto Mhz speed-up mult | 1160220 5621946 4846 +83% keygen | 6862975 33247351 4844 +75%? sign | 2287366 11096711 4851 +70% verify | 2455858 11888045 4841 +84% Benchmark on AMD Ryzen 7 5800X (i386): Before: NIST-P192 | nanosecs/iter cycles/iter auto Mhz mult | 648039 3143236 4850 keygen | 3554452 17244822 4852 sign | 1163173 5641932 4850 verify | 1300076 6305673 4850 = NIST-P224 | nanosecs/iter cycles/iter auto Mhz mult | 798607 3874405 4851 keygen | 4657604 22589864 4850 sign | 1515803 7352049 4850 verify | 1635470 7935373 4852 = NIST-P256 | nanosecs/iter cycles/iter auto Mhz mult | 927033 4496283 4850 keygen | 5313601 25771983 4850 sign | 1735795 8418514 4850 verify | 1945804 9438212 4851 = NIST-P384 | nanosecs/iter cycles/iter auto Mhz mult | 2301781 11164473 4850 keygen | 12856001 62353242 4850 sign | 4161041 20180651 4850 verify | 4705961 22827478 4851 = NIST-P521 | nanosecs/iter cycles/iter auto Mhz mult | 6066635 29422721 4850 keygen | 32995868 160046407 4850 sign | 10503306 50945387 4850 verify | 12225252 59294323 4850 After: NIST-P192 | nanosecs/iter cycles/iter auto Mhz speed-up mult | 413605 2007498 4854 +57% keygen | 2479429 12010926 4844 +44% sign | 825111 3997147 4844 +41% verify | 890206 4318723 4851 +46% = NIST-P224 | nanosecs/iter cycles/iter auto Mhz speed-up mult | 551703 2676454 4851 +45% keygen | 3257022 15781844 4845 +43% sign | 1085678 5258894 4844 +40% verify | 1172195 5678499 4844 +40% = NIST-P256 | nanosecs/iter cycles/iter auto Mhz speed-up mult | 720395 3497486 4855 +29% keygen | 4217758 20461257 4851 +26% sign | 1404350 6814131 4852 +24% verify | 1515136 7353955 4854 +28% = NIST-P384 | nanosecs/iter cycles/iter auto Mhz speed-up mult | 1525742 7400771 4851 +51% keygen | 9046660 43877889 4850 +42% sign | 2974641 14408703 4844 +40% verify | 3265285 15834951 4849 +44% = NIST-P521 | nanosecs/iter cycles/iter auto Mhz speed-up mult | 3289348 15968678 4855 +84% keygen | 19354174 93873531 4850 +70% sign | 6351493 30830140 4854 +65% verify | 6979292 33854215 4851 +75% Signed-off-by: Jussi Kivilinna --- configure.ac | 3 + mpi/Makefile.am | 2 +- mpi/ec-inline.h | 1047 ++++++++++++++++++++++++++++++++++++++++++++ mpi/ec-internal.h | 16 + mpi/ec-nist.c | 795 +++++++++++++++++++++++++++++++++ mpi/ec.c | 90 +++- mpi/mpi-internal.h | 5 + mpi/mpiutil.c | 2 +- src/ec-context.h | 1 + 9 files changed, 1943 insertions(+), 18 deletions(-) create mode 100644 mpi/ec-inline.h create mode 100644 mpi/ec-nist.c diff --git a/configure.ac b/configure.ac index 37947ecb..6fdca24a 100644 --- a/configure.ac +++ b/configure.ac @@ -546,6 +546,9 @@ AC_ARG_ENABLE([asm], [try_asm_modules=$enableval], [try_asm_modules=yes]) AC_MSG_RESULT($try_asm_modules) +if test "$try_asm_modules" != yes ; then + AC_DEFINE(ASM_DISABLED,1,[Defined if --disable-asm was used to configure]) +fi # Implementation of the --enable-m-guard switch. AC_MSG_CHECKING([whether memory guard is requested]) diff --git a/mpi/Makefile.am b/mpi/Makefile.am index d06594e1..adb8e6f5 100644 --- a/mpi/Makefile.am +++ b/mpi/Makefile.am @@ -175,5 +175,5 @@ libmpi_la_SOURCES = longlong.h \ mpih-mul.c \ mpih-const-time.c \ mpiutil.c \ - ec.c ec-internal.h ec-ed25519.c + ec.c ec-internal.h ec-ed25519.c ec-nist.c ec-inline.h EXTRA_libmpi_la_SOURCES = asm-common-aarch64.h diff --git a/mpi/ec-inline.h b/mpi/ec-inline.h new file mode 100644 index 00000000..25c3b40d --- /dev/null +++ b/mpi/ec-inline.h @@ -0,0 +1,1047 @@ +/* ec-inline.h - EC inline addition/substraction helpers + * Copyright (C) 2021 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#ifndef GCRY_EC_INLINE_H +#define GCRY_EC_INLINE_H + +#include "mpi-internal.h" +#include "longlong.h" +#include "ec-context.h" +#include "../cipher/bithelp.h" +#include "../cipher/bufhelp.h" + + +#if BYTES_PER_MPI_LIMB == 8 + +/* 64-bit limb definitions for 64-bit architectures. */ + +#define LIMBS_PER_LIMB64 1 +#define LOAD64(x, pos) ((x)[pos]) +#define STORE64(x, pos, v) ((x)[pos] = (mpi_limb_t)(v)) +#define LIMB_TO64(v) ((mpi_limb_t)(v)) +#define LIMB_FROM64(v) ((mpi_limb_t)(v)) +#define HIBIT_LIMB64(v) ((mpi_limb_t)(v) >> (BITS_PER_MPI_LIMB - 1)) +#define HI32_LIMB64(v) (u32)((mpi_limb_t)(v) >> (BITS_PER_MPI_LIMB - 32)) +#define LO32_LIMB64(v) ((u32)(v)) +#define LIMB64_C(hi, lo) (((mpi_limb_t)(u32)(hi) << 32) | (u32)(lo)) +#define STORE64_COND(x, pos, mask1, val1, mask2, val2) \ + ((x)[(pos)] = ((mask1) & (val1)) | ((mask2) & (val2))) + +typedef mpi_limb_t mpi_limb64_t; + +static inline u32 +LOAD32(mpi_ptr_t x, unsigned int pos) +{ + unsigned int shr = (pos % 2) * 32; + return (x[pos / 2] >> shr); +} + +static inline mpi_limb64_t +LIMB64_HILO(u32 hi, u32 lo) +{ + mpi_limb64_t v = hi; + return (v << 32) | lo; +} + + +/* x86-64 addition/subtraction helpers. */ +#if defined (__x86_64__) && defined(HAVE_CPU_ARCH_X86) && __GNUC__ >= 4 + +#define ADD3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) \ + __asm__ ("addq %8, %2\n" \ + "adcq %7, %1\n" \ + "adcq %6, %0\n" \ + : "=r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "0" ((mpi_limb_t)(B2)), \ + "1" ((mpi_limb_t)(B1)), \ + "2" ((mpi_limb_t)(B0)), \ + "g" ((mpi_limb_t)(C2)), \ + "g" ((mpi_limb_t)(C1)), \ + "g" ((mpi_limb_t)(C0)) \ + : "cc") + +#define SUB3_LIMB64(A3, A2, A1, A0, B2, B1, B0, C2, C1, C0) \ + __asm__ ("subq %8, %2\n" \ + "sbbq %7, %1\n" \ + "sbbq %6, %0\n" \ + : "=r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "0" ((mpi_limb_t)(B2)), \ + "1" ((mpi_limb_t)(B1)), \ + "2" ((mpi_limb_t)(B0)), \ + "g" ((mpi_limb_t)(C2)), \ + "g" ((mpi_limb_t)(C1)), \ + "g" ((mpi_limb_t)(C0)) \ + : "cc") + +#define ADD4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \ + __asm__ ("addq %11, %3\n" \ + "adcq %10, %2\n" \ + "adcq %9, %1\n" \ + "adcq %8, %0\n" \ + : "=r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "0" ((mpi_limb_t)(B3)), \ + "1" ((mpi_limb_t)(B2)), \ + "2" ((mpi_limb_t)(B1)), \ + "3" ((mpi_limb_t)(B0)), \ + "g" ((mpi_limb_t)(C3)), \ + "g" ((mpi_limb_t)(C2)), \ + "g" ((mpi_limb_t)(C1)), \ + "g" ((mpi_limb_t)(C0)) \ + : "cc") + +#define SUB4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \ + __asm__ ("subq %11, %3\n" \ + "sbbq %10, %2\n" \ + "sbbq %9, %1\n" \ + "sbbq %8, %0\n" \ + : "=r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "0" ((mpi_limb_t)(B3)), \ + "1" ((mpi_limb_t)(B2)), \ + "2" ((mpi_limb_t)(B1)), \ + "3" ((mpi_limb_t)(B0)), \ + "g" ((mpi_limb_t)(C3)), \ + "g" ((mpi_limb_t)(C2)), \ + "g" ((mpi_limb_t)(C1)), \ + "g" ((mpi_limb_t)(C0)) \ + : "cc") + +#define ADD5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \ + C4, C3, C2, C1, C0) \ + __asm__ ("addq %14, %4\n" \ + "adcq %13, %3\n" \ + "adcq %12, %2\n" \ + "adcq %11, %1\n" \ + "adcq %10, %0\n" \ + : "=r" (A4), \ + "=&r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "0" ((mpi_limb_t)(B4)), \ + "1" ((mpi_limb_t)(B3)), \ + "2" ((mpi_limb_t)(B2)), \ + "3" ((mpi_limb_t)(B1)), \ + "4" ((mpi_limb_t)(B0)), \ + "g" ((mpi_limb_t)(C4)), \ + "g" ((mpi_limb_t)(C3)), \ + "g" ((mpi_limb_t)(C2)), \ + "g" ((mpi_limb_t)(C1)), \ + "g" ((mpi_limb_t)(C0)) \ + : "cc") + +#define SUB5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \ + C4, C3, C2, C1, C0) \ + __asm__ ("subq %14, %4\n" \ + "sbbq %13, %3\n" \ + "sbbq %12, %2\n" \ + "sbbq %11, %1\n" \ + "sbbq %10, %0\n" \ + : "=r" (A4), \ + "=&r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "0" ((mpi_limb_t)(B4)), \ + "1" ((mpi_limb_t)(B3)), \ + "2" ((mpi_limb_t)(B2)), \ + "3" ((mpi_limb_t)(B1)), \ + "4" ((mpi_limb_t)(B0)), \ + "g" ((mpi_limb_t)(C4)), \ + "g" ((mpi_limb_t)(C3)), \ + "g" ((mpi_limb_t)(C2)), \ + "g" ((mpi_limb_t)(C1)), \ + "g" ((mpi_limb_t)(C0)) \ + : "cc") + +#endif /* __x86_64__ */ + + +/* ARM AArch64 addition/subtraction helpers. */ +#if defined (__aarch64__) && defined(HAVE_CPU_ARCH_ARM) && __GNUC__ >= 4 + +#define ADD3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) \ + __asm__ ("adds %2, %5, %8\n" \ + "adcs %1, %4, %7\n" \ + "adc %0, %3, %6\n" \ + : "=r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "r" ((mpi_limb_t)(B2)), \ + "r" ((mpi_limb_t)(B1)), \ + "r" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc") + +#define SUB3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) \ + __asm__ ("subs %2, %5, %8\n" \ + "sbcs %1, %4, %7\n" \ + "sbc %0, %3, %6\n" \ + : "=r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "r" ((mpi_limb_t)(B2)), \ + "r" ((mpi_limb_t)(B1)), \ + "r" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc") + +#define ADD4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \ + __asm__ ("adds %3, %7, %11\n" \ + "adcs %2, %6, %10\n" \ + "adcs %1, %5, %9\n" \ + "adc %0, %4, %8\n" \ + : "=r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "r" ((mpi_limb_t)(B3)), \ + "r" ((mpi_limb_t)(B2)), \ + "r" ((mpi_limb_t)(B1)), \ + "r" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C3)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc") + +#define SUB4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \ + __asm__ ("subs %3, %7, %11\n" \ + "sbcs %2, %6, %10\n" \ + "sbcs %1, %5, %9\n" \ + "sbc %0, %4, %8\n" \ + : "=r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "r" ((mpi_limb_t)(B3)), \ + "r" ((mpi_limb_t)(B2)), \ + "r" ((mpi_limb_t)(B1)), \ + "r" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C3)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc") + +#define ADD5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \ + C4, C3, C2, C1, C0) \ + __asm__ ("adds %4, %9, %14\n" \ + "adcs %3, %8, %13\n" \ + "adcs %2, %7, %12\n" \ + "adcs %1, %6, %11\n" \ + "adc %0, %5, %10\n" \ + : "=r" (A4), \ + "=&r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "r" ((mpi_limb_t)(B4)), \ + "r" ((mpi_limb_t)(B3)), \ + "r" ((mpi_limb_t)(B2)), \ + "r" ((mpi_limb_t)(B1)), \ + "r" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C4)), \ + "r" ((mpi_limb_t)(C3)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc") + +#define SUB5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \ + C4, C3, C2, C1, C0) \ + __asm__ ("subs %4, %9, %14\n" \ + "sbcs %3, %8, %13\n" \ + "sbcs %2, %7, %12\n" \ + "sbcs %1, %6, %11\n" \ + "sbc %0, %5, %10\n" \ + : "=r" (A4), \ + "=&r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "r" ((mpi_limb_t)(B4)), \ + "r" ((mpi_limb_t)(B3)), \ + "r" ((mpi_limb_t)(B2)), \ + "r" ((mpi_limb_t)(B1)), \ + "r" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C4)), \ + "r" ((mpi_limb_t)(C3)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc") + +#endif /* __aarch64__ */ + + +/* PowerPC64 addition/subtraction helpers. */ +#if defined (__powerpc__) && defined(HAVE_CPU_ARCH_PPC) && __GNUC__ >= 4 + +#define ADD3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) \ + __asm__ ("addc %2, %8, %5\n" \ + "adde %1, %7, %4\n" \ + "adde %0, %6, %3\n" \ + : "=r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "r" ((mpi_limb_t)(B2)), \ + "r" ((mpi_limb_t)(B1)), \ + "r" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc", "r0") + +#define SUB3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) \ + __asm__ ("subfc %2, %8, %5\n" \ + "subfe %1, %7, %4\n" \ + "subfe %0, %6, %3\n" \ + : "=r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "r" ((mpi_limb_t)(B2)), \ + "r" ((mpi_limb_t)(B1)), \ + "r" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc", "r0") + +#define ADD4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \ + __asm__ ("addc %3, %11, %7\n" \ + "adde %2, %10, %6\n" \ + "adde %1, %9, %5\n" \ + "adde %0, %8, %4\n" \ + : "=r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "r" ((mpi_limb_t)(B3)), \ + "r" ((mpi_limb_t)(B2)), \ + "r" ((mpi_limb_t)(B1)), \ + "r" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C3)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc") + +#define SUB4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \ + __asm__ ("subfc %3, %11, %7\n" \ + "subfe %2, %10, %6\n" \ + "subfe %1, %9, %5\n" \ + "subfe %0, %8, %4\n" \ + : "=r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "r" ((mpi_limb_t)(B3)), \ + "r" ((mpi_limb_t)(B2)), \ + "r" ((mpi_limb_t)(B1)), \ + "r" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C3)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc") + +#define ADD5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \ + C4, C3, C2, C1, C0) \ + __asm__ ("addc %4, %14, %9\n" \ + "adde %3, %13, %8\n" \ + "adde %2, %12, %7\n" \ + "adde %1, %11, %6\n" \ + "adde %0, %10, %5\n" \ + : "=r" (A4), \ + "=&r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "r" ((mpi_limb_t)(B4)), \ + "r" ((mpi_limb_t)(B3)), \ + "r" ((mpi_limb_t)(B2)), \ + "r" ((mpi_limb_t)(B1)), \ + "r" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C4)), \ + "r" ((mpi_limb_t)(C3)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc") + +#define SUB5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \ + C4, C3, C2, C1, C0) \ + __asm__ ("subfc %4, %14, %9\n" \ + "subfe %3, %13, %8\n" \ + "subfe %2, %12, %7\n" \ + "subfe %1, %11, %6\n" \ + "subfe %0, %10, %5\n" \ + : "=r" (A4), \ + "=&r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "r" ((mpi_limb_t)(B4)), \ + "r" ((mpi_limb_t)(B3)), \ + "r" ((mpi_limb_t)(B2)), \ + "r" ((mpi_limb_t)(B1)), \ + "r" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C4)), \ + "r" ((mpi_limb_t)(C3)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc") + +#endif /* __powerpc__ */ + + +/* s390x/zSeries addition/subtraction helpers. */ +#if defined (__s390x__) && defined(HAVE_CPU_ARCH_S390X) && __GNUC__ >= 4 + +#define ADD3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) \ + __asm__ ("algr %2, %8\n" \ + "alcgr %1, %7\n" \ + "alcgr %0, %6\n" \ + : "=r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "0" ((mpi_limb_t)(B2)), \ + "1" ((mpi_limb_t)(B1)), \ + "2" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc") + +#define SUB3_LIMB64(A3, A2, A1, A0, B2, B1, B0, C2, C1, C0) \ + __asm__ ("slgr %2, %8\n" \ + "slbgr %1, %7\n" \ + "slbgr %0, %6\n" \ + : "=r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "0" ((mpi_limb_t)(B2)), \ + "1" ((mpi_limb_t)(B1)), \ + "2" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc") + +#define ADD4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \ + __asm__ ("algr %3, %11\n" \ + "alcgr %2, %10\n" \ + "alcgr %1, %9\n" \ + "alcgr %0, %8\n" \ + : "=r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "0" ((mpi_limb_t)(B3)), \ + "1" ((mpi_limb_t)(B2)), \ + "2" ((mpi_limb_t)(B1)), \ + "3" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C3)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc") + +#define SUB4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \ + __asm__ ("slgr %3, %11\n" \ + "slbgr %2, %10\n" \ + "slbgr %1, %9\n" \ + "slbgr %0, %8\n" \ + : "=r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "0" ((mpi_limb_t)(B3)), \ + "1" ((mpi_limb_t)(B2)), \ + "2" ((mpi_limb_t)(B1)), \ + "3" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C3)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc") + +#define ADD5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \ + C4, C3, C2, C1, C0) \ + __asm__ ("algr %4, %14\n" \ + "alcgr %3, %13\n" \ + "alcgr %2, %12\n" \ + "alcgr %1, %11\n" \ + "alcgr %0, %10\n" \ + : "=r" (A4), \ + "=&r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "0" ((mpi_limb_t)(B4)), \ + "1" ((mpi_limb_t)(B3)), \ + "2" ((mpi_limb_t)(B2)), \ + "3" ((mpi_limb_t)(B1)), \ + "4" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C4)), \ + "r" ((mpi_limb_t)(C3)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc") + +#define SUB5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \ + C4, C3, C2, C1, C0) \ + __asm__ ("slgr %4, %14\n" \ + "slbgr %3, %13\n" \ + "slbgr %2, %12\n" \ + "slbgr %1, %11\n" \ + "slbgr %0, %10\n" \ + : "=r" (A4), \ + "=&r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "0" ((mpi_limb_t)(B4)), \ + "1" ((mpi_limb_t)(B3)), \ + "2" ((mpi_limb_t)(B2)), \ + "3" ((mpi_limb_t)(B1)), \ + "4" ((mpi_limb_t)(B0)), \ + "r" ((mpi_limb_t)(C4)), \ + "r" ((mpi_limb_t)(C3)), \ + "r" ((mpi_limb_t)(C2)), \ + "r" ((mpi_limb_t)(C1)), \ + "r" ((mpi_limb_t)(C0)) \ + : "cc") + +#endif /* __s390x__ */ + + +/* Common 64-bit arch addition/subtraction macros. */ + +#define ADD2_LIMB64(A1, A0, B1, B0, C1, C0) \ + add_ssaaaa(A1, A0, B1, B0, C1, C0) + +#define SUB2_LIMB64(A1, A0, B1, B0, C1, C0) \ + sub_ddmmss(A1, A0, B1, B0, C1, C0) + +#endif /* BYTES_PER_MPI_LIMB == 8 */ + + +#if BYTES_PER_MPI_LIMB == 4 + +/* 64-bit limb definitions for 32-bit architectures. */ + +#define LIMBS_PER_LIMB64 2 +#define LIMB_FROM64(v) ((v).lo) +#define HIBIT_LIMB64(v) ((v).hi >> (BITS_PER_MPI_LIMB - 1)) +#define HI32_LIMB64(v) ((v).hi) +#define LO32_LIMB64(v) ((v).lo) +#define LOAD32(x, pos) ((x)[pos]) +#define LIMB64_C(hi, lo) { (lo), (hi) } + +typedef struct +{ + mpi_limb_t lo; + mpi_limb_t hi; +} mpi_limb64_t; + +static inline mpi_limb64_t +LOAD64(const mpi_ptr_t x, unsigned int pos) +{ + mpi_limb64_t v; + v.lo = x[pos * 2 + 0]; + v.hi = x[pos * 2 + 1]; + return v; +} + +static inline void +STORE64(mpi_ptr_t x, unsigned int pos, mpi_limb64_t v) +{ + x[pos * 2 + 0] = v.lo; + x[pos * 2 + 1] = v.hi; +} + +static inline void +STORE64_COND(mpi_ptr_t x, unsigned int pos, mpi_limb_t mask1, + mpi_limb64_t val1, mpi_limb_t mask2, mpi_limb64_t val2) +{ + x[pos * 2 + 0] = (mask1 & val1.lo) | (mask2 & val2.lo); + x[pos * 2 + 1] = (mask1 & val1.hi) | (mask2 & val2.hi); +} + +static inline mpi_limb64_t +LIMB_TO64(mpi_limb_t x) +{ + mpi_limb64_t v; + v.lo = x; + v.hi = 0; + return v; +} + +static inline mpi_limb64_t +LIMB64_HILO(mpi_limb_t hi, mpi_limb_t lo) +{ + mpi_limb64_t v; + v.lo = lo; + v.hi = hi; + return v; +} + + +/* i386 addition/subtraction helpers. */ +#if defined (__i386__) && defined(HAVE_CPU_ARCH_X86) && __GNUC__ >= 4 + +#define ADD4_LIMB32(a3, a2, a1, a0, b3, b2, b1, b0, c3, c2, c1, c0) \ + __asm__ ("addl %11, %3\n" \ + "adcl %10, %2\n" \ + "adcl %9, %1\n" \ + "adcl %8, %0\n" \ + : "=r" (a3), \ + "=&r" (a2), \ + "=&r" (a1), \ + "=&r" (a0) \ + : "0" ((mpi_limb_t)(b3)), \ + "1" ((mpi_limb_t)(b2)), \ + "2" ((mpi_limb_t)(b1)), \ + "3" ((mpi_limb_t)(b0)), \ + "g" ((mpi_limb_t)(c3)), \ + "g" ((mpi_limb_t)(c2)), \ + "g" ((mpi_limb_t)(c1)), \ + "g" ((mpi_limb_t)(c0)) \ + : "cc") + +#define ADD6_LIMB32(a5, a4, a3, a2, a1, a0, b5, b4, b3, b2, b1, b0, \ + c5, c4, c3, c2, c1, c0) do { \ + mpi_limb_t __carry6_32; \ + __asm__ ("addl %10, %3\n" \ + "adcl %9, %2\n" \ + "adcl %8, %1\n" \ + "sbbl %0, %0\n" \ + : "=r" (__carry6_32), \ + "=&r" (a2), \ + "=&r" (a1), \ + "=&r" (a0) \ + : "0" ((mpi_limb_t)(0)), \ + "1" ((mpi_limb_t)(b2)), \ + "2" ((mpi_limb_t)(b1)), \ + "3" ((mpi_limb_t)(b0)), \ + "g" ((mpi_limb_t)(c2)), \ + "g" ((mpi_limb_t)(c1)), \ + "g" ((mpi_limb_t)(c0)) \ + : "cc"); \ + __asm__ ("addl $1, %3\n" \ + "adcl %10, %2\n" \ + "adcl %9, %1\n" \ + "adcl %8, %0\n" \ + : "=r" (a5), \ + "=&r" (a4), \ + "=&r" (a3), \ + "=&r" (__carry6_32) \ + : "0" ((mpi_limb_t)(b5)), \ + "1" ((mpi_limb_t)(b4)), \ + "2" ((mpi_limb_t)(b3)), \ + "3" ((mpi_limb_t)(__carry6_32)), \ + "g" ((mpi_limb_t)(c5)), \ + "g" ((mpi_limb_t)(c4)), \ + "g" ((mpi_limb_t)(c3)) \ + : "cc"); \ + } while (0) + +#define SUB4_LIMB32(a3, a2, a1, a0, b3, b2, b1, b0, c3, c2, c1, c0) \ + __asm__ ("subl %11, %3\n" \ + "sbbl %10, %2\n" \ + "sbbl %9, %1\n" \ + "sbbl %8, %0\n" \ + : "=r" (a3), \ + "=&r" (a2), \ + "=&r" (a1), \ + "=&r" (a0) \ + : "0" ((mpi_limb_t)(b3)), \ + "1" ((mpi_limb_t)(b2)), \ + "2" ((mpi_limb_t)(b1)), \ + "3" ((mpi_limb_t)(b0)), \ + "g" ((mpi_limb_t)(c3)), \ + "g" ((mpi_limb_t)(c2)), \ + "g" ((mpi_limb_t)(c1)), \ + "g" ((mpi_limb_t)(c0)) \ + : "cc") + +#define SUB6_LIMB32(a5, a4, a3, a2, a1, a0, b5, b4, b3, b2, b1, b0, \ + c5, c4, c3, c2, c1, c0) do { \ + mpi_limb_t __borrow6_32; \ + __asm__ ("subl %10, %3\n" \ + "sbbl %9, %2\n" \ + "sbbl %8, %1\n" \ + "sbbl %0, %0\n" \ + : "=r" (__borrow6_32), \ + "=&r" (a2), \ + "=&r" (a1), \ + "=&r" (a0) \ + : "0" ((mpi_limb_t)(0)), \ + "1" ((mpi_limb_t)(b2)), \ + "2" ((mpi_limb_t)(b1)), \ + "3" ((mpi_limb_t)(b0)), \ + "g" ((mpi_limb_t)(c2)), \ + "g" ((mpi_limb_t)(c1)), \ + "g" ((mpi_limb_t)(c0)) \ + : "cc"); \ + __asm__ ("addl $1, %3\n" \ + "sbbl %10, %2\n" \ + "sbbl %9, %1\n" \ + "sbbl %8, %0\n" \ + : "=r" (a5), \ + "=&r" (a4), \ + "=&r" (a3), \ + "=&r" (__borrow6_32) \ + : "0" ((mpi_limb_t)(b5)), \ + "1" ((mpi_limb_t)(b4)), \ + "2" ((mpi_limb_t)(b3)), \ + "3" ((mpi_limb_t)(__borrow6_32)), \ + "g" ((mpi_limb_t)(c5)), \ + "g" ((mpi_limb_t)(c4)), \ + "g" ((mpi_limb_t)(c3)) \ + : "cc"); \ + } while (0) + +#endif /* __i386__ */ + + +/* ARM addition/subtraction helpers. */ +#ifdef HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS + +#define ADD4_LIMB32(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \ + __asm__ ("adds %3, %7, %11\n" \ + "adcs %2, %6, %10\n" \ + "adcs %1, %5, %9\n" \ + "adc %0, %4, %8\n" \ + : "=r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "r" ((mpi_limb_t)(B3)), \ + "r" ((mpi_limb_t)(B2)), \ + "r" ((mpi_limb_t)(B1)), \ + "r" ((mpi_limb_t)(B0)), \ + "Ir" ((mpi_limb_t)(C3)), \ + "Ir" ((mpi_limb_t)(C2)), \ + "Ir" ((mpi_limb_t)(C1)), \ + "Ir" ((mpi_limb_t)(C0)) \ + : "cc") + +#define ADD6_LIMB32(A5, A4, A3, A2, A1, A0, B5, B4, B3, B2, B1, B0, \ + C5, C4, C3, C2, C1, C0) do { \ + mpi_limb_t __carry6_32; \ + __asm__ ("adds %3, %7, %10\n" \ + "adcs %2, %6, %9\n" \ + "adcs %1, %5, %8\n" \ + "adc %0, %4, %4\n" \ + : "=r" (__carry6_32), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "r" ((mpi_limb_t)(0)), \ + "r" ((mpi_limb_t)(B2)), \ + "r" ((mpi_limb_t)(B1)), \ + "r" ((mpi_limb_t)(B0)), \ + "Ir" ((mpi_limb_t)(C2)), \ + "Ir" ((mpi_limb_t)(C1)), \ + "Ir" ((mpi_limb_t)(C0)) \ + : "cc"); \ + ADD4_LIMB32(A5, A4, A3, __carry6_32, B5, B4, B3, __carry6_32, \ + C5, C4, C3, 0xffffffffU); \ + } while (0) + +#define SUB4_LIMB32(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \ + __asm__ ("subs %3, %7, %11\n" \ + "sbcs %2, %6, %10\n" \ + "sbcs %1, %5, %9\n" \ + "sbc %0, %4, %8\n" \ + : "=r" (A3), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "r" ((mpi_limb_t)(B3)), \ + "r" ((mpi_limb_t)(B2)), \ + "r" ((mpi_limb_t)(B1)), \ + "r" ((mpi_limb_t)(B0)), \ + "Ir" ((mpi_limb_t)(C3)), \ + "Ir" ((mpi_limb_t)(C2)), \ + "Ir" ((mpi_limb_t)(C1)), \ + "Ir" ((mpi_limb_t)(C0)) \ + : "cc") + + +#define SUB6_LIMB32(A5, A4, A3, A2, A1, A0, B5, B4, B3, B2, B1, B0, \ + C5, C4, C3, C2, C1, C0) do { \ + mpi_limb_t __borrow6_32; \ + __asm__ ("subs %3, %7, %10\n" \ + "sbcs %2, %6, %9\n" \ + "sbcs %1, %5, %8\n" \ + "sbc %0, %4, %4\n" \ + : "=r" (__borrow6_32), \ + "=&r" (A2), \ + "=&r" (A1), \ + "=&r" (A0) \ + : "r" ((mpi_limb_t)(0)), \ + "r" ((mpi_limb_t)(B2)), \ + "r" ((mpi_limb_t)(B1)), \ + "r" ((mpi_limb_t)(B0)), \ + "Ir" ((mpi_limb_t)(C2)), \ + "Ir" ((mpi_limb_t)(C1)), \ + "Ir" ((mpi_limb_t)(C0)) \ + : "cc"); \ + SUB4_LIMB32(A5, A4, A3, __borrow6_32, B5, B4, B3, 0, \ + C5, C4, C3, -__borrow6_32); \ + } while (0) + +#endif /* HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS */ + + +/* Common 32-bit arch addition/subtraction macros. */ + +#if defined(ADD4_LIMB32) +/* A[0..1] = B[0..1] + C[0..1] */ +#define ADD2_LIMB64(A1, A0, B1, B0, C1, C0) \ + ADD4_LIMB32(A1.hi, A1.lo, A0.hi, A0.lo, \ + B1.hi, B1.lo, B0.hi, B0.lo, \ + C1.hi, C1.lo, C0.hi, C0.lo) +#else +/* A[0..1] = B[0..1] + C[0..1] */ +#define ADD2_LIMB64(A1, A0, B1, B0, C1, C0) do { \ + mpi_limb_t __carry2_0, __carry2_1; \ + add_ssaaaa(__carry2_0, A0.lo, 0, B0.lo, 0, C0.lo); \ + add_ssaaaa(__carry2_1, A0.hi, 0, B0.hi, 0, C0.hi); \ + add_ssaaaa(__carry2_1, A0.hi, __carry2_1, A0.hi, 0, __carry2_0); \ + add_ssaaaa(A1.hi, A1.lo, B1.hi, B1.lo, C1.hi, C1.lo); \ + add_ssaaaa(A1.hi, A1.lo, A1.hi, A1.lo, 0, __carry2_1); \ + } while (0) +#endif + +#if defined(ADD6_LIMB32) +/* A[0..2] = B[0..2] + C[0..2] */ +#define ADD3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) \ + ADD6_LIMB32(A2.hi, A2.lo, A1.hi, A1.lo, A0.hi, A0.lo, \ + B2.hi, B2.lo, B1.hi, B1.lo, B0.hi, B0.lo, \ + C2.hi, C2.lo, C1.hi, C1.lo, C0.hi, C0.lo) +#endif + +#if defined(ADD6_LIMB32) +/* A[0..3] = B[0..3] + C[0..3] */ +#define ADD4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) do { \ + mpi_limb_t __carry4; \ + ADD6_LIMB32(__carry4, A2.lo, A1.hi, A1.lo, A0.hi, A0.lo, \ + 0, B2.lo, B1.hi, B1.lo, B0.hi, B0.lo, \ + 0, C2.lo, C1.hi, C1.lo, C0.hi, C0.lo); \ + ADD4_LIMB32(A3.hi, A3.lo, A2.hi, __carry4, \ + B3.hi, B3.lo, B2.hi, __carry4, \ + C3.hi, C3.lo, C2.hi, 0xffffffffU); \ + } while (0) +#endif + +#if defined(SUB4_LIMB32) +/* A[0..1] = B[0..1] - C[0..1] */ +#define SUB2_LIMB64(A1, A0, B1, B0, C1, C0) \ + SUB4_LIMB32(A1.hi, A1.lo, A0.hi, A0.lo, \ + B1.hi, B1.lo, B0.hi, B0.lo, \ + C1.hi, C1.lo, C0.hi, C0.lo) +#else +/* A[0..1] = B[0..1] - C[0..1] */ +#define SUB2_LIMB64(A1, A0, B1, B0, C1, C0) do { \ + mpi_limb_t __borrow2_0, __borrow2_1; \ + sub_ddmmss(__borrow2_0, A0.lo, 0, B0.lo, 0, C0.lo); \ + sub_ddmmss(__borrow2_1, A0.hi, 0, B0.hi, 0, C0.hi); \ + sub_ddmmss(__borrow2_1, A0.hi, __borrow2_1, A0.hi, 0, -__borrow2_0); \ + sub_ddmmss(A1.hi, A1.lo, B1.hi, B1.lo, C1.hi, C1.lo); \ + sub_ddmmss(A1.hi, A1.lo, A1.hi, A1.lo, 0, -__borrow2_1); \ + } while (0) +#endif + +#if defined(SUB6_LIMB32) +/* A[0..2] = B[0..2] - C[0..2] */ +#define SUB3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) \ + SUB6_LIMB32(A2.hi, A2.lo, A1.hi, A1.lo, A0.hi, A0.lo, \ + B2.hi, B2.lo, B1.hi, B1.lo, B0.hi, B0.lo, \ + C2.hi, C2.lo, C1.hi, C1.lo, C0.hi, C0.lo) +#endif + +#if defined(SUB6_LIMB32) +/* A[0..3] = B[0..3] - C[0..3] */ +#define SUB4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) do { \ + mpi_limb_t __borrow4; \ + SUB6_LIMB32(__borrow4, A2.lo, A1.hi, A1.lo, A0.hi, A0.lo, \ + 0, B2.lo, B1.hi, B1.lo, B0.hi, B0.lo, \ + 0, C2.lo, C1.hi, C1.lo, C0.hi, C0.lo); \ + SUB4_LIMB32(A3.hi, A3.lo, A2.hi, __borrow4, \ + B3.hi, B3.lo, B2.hi, 0, \ + C3.hi, C3.lo, C2.hi, -__borrow4); \ + } while (0) +#endif + +#endif /* BYTES_PER_MPI_LIMB == 4 */ + + +/* Common definitions. */ +#define BITS_PER_MPI_LIMB64 (BITS_PER_MPI_LIMB * LIMBS_PER_LIMB64) +#define BYTES_PER_MPI_LIMB64 (BYTES_PER_MPI_LIMB * LIMBS_PER_LIMB64) + + +/* Common addition/subtraction macros. */ + +#ifndef ADD3_LIMB64 +/* A[0..2] = B[0..2] + C[0..2] */ +#define ADD3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) do { \ + mpi_limb64_t __carry3; \ + ADD2_LIMB64(__carry3, A0, zero, B0, zero, C0); \ + ADD2_LIMB64(A2, A1, B2, B1, C2, C1); \ + ADD2_LIMB64(A2, A1, A2, A1, zero, __carry3); \ + } while (0) +#endif + +#ifndef ADD4_LIMB64 +/* A[0..3] = B[0..3] + C[0..3] */ +#define ADD4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) do { \ + mpi_limb64_t __carry4; \ + ADD3_LIMB64(__carry4, A1, A0, zero, B1, B0, zero, C1, C0); \ + ADD2_LIMB64(A3, A2, B3, B2, C3, C2); \ + ADD2_LIMB64(A3, A2, A3, A2, zero, __carry4); \ + } while (0) +#endif + +#ifndef ADD5_LIMB64 +/* A[0..4] = B[0..4] + C[0..4] */ +#define ADD5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \ + C4, C3, C2, C1, C0) do { \ + mpi_limb64_t __carry5; \ + ADD4_LIMB64(__carry5, A2, A1, A0, zero, B2, B1, B0, zero, C2, C1, C0); \ + ADD2_LIMB64(A4, A3, B4, B3, C4, C3); \ + ADD2_LIMB64(A4, A3, A4, A3, zero, __carry5); \ + } while (0) +#endif + +#ifndef ADD7_LIMB64 +/* A[0..6] = B[0..6] + C[0..6] */ +#define ADD7_LIMB64(A6, A5, A4, A3, A2, A1, A0, B6, B5, B4, B3, B2, B1, B0, \ + C6, C5, C4, C3, C2, C1, C0) do { \ + mpi_limb64_t __carry7; \ + ADD4_LIMB64(__carry7, A2, A1, A0, zero, B2, B1, B0, \ + zero, C2, C1, C0); \ + ADD5_LIMB64(A6, A5, A4, A3, __carry7, B6, B5, B4, B3, \ + __carry7, C6, C5, C4, C3, LIMB64_HILO(-1, -1)); \ + } while (0) +#endif + +#ifndef SUB3_LIMB64 +/* A[0..2] = B[0..2] - C[0..2] */ +#define SUB3_LIMB64(A2, A1, A0, B2, B1, B0, C2, C1, C0) do { \ + mpi_limb64_t __borrow3; \ + SUB2_LIMB64(__borrow3, A0, zero, B0, zero, C0); \ + SUB2_LIMB64(A2, A1, B2, B1, C2, C1); \ + SUB2_LIMB64(A2, A1, A2, A1, zero, LIMB_TO64(-LIMB_FROM64(__borrow3))); \ + } while (0) +#endif + +#ifndef SUB4_LIMB64 +/* A[0..3] = B[0..3] - C[0..3] */ +#define SUB4_LIMB64(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) do { \ + mpi_limb64_t __borrow4; \ + SUB3_LIMB64(__borrow4, A1, A0, zero, B1, B0, zero, C1, C0); \ + SUB2_LIMB64(A3, A2, B3, B2, C3, C2); \ + SUB2_LIMB64(A3, A2, A3, A2, zero, LIMB_TO64(-LIMB_FROM64(__borrow4))); \ + } while (0) +#endif + +#ifndef SUB5_LIMB64 +/* A[0..4] = B[0..4] - C[0..4] */ +#define SUB5_LIMB64(A4, A3, A2, A1, A0, B4, B3, B2, B1, B0, \ + C4, C3, C2, C1, C0) do { \ + mpi_limb64_t __borrow5; \ + SUB4_LIMB64(__borrow5, A2, A1, A0, zero, B2, B1, B0, zero, C2, C1, C0); \ + SUB2_LIMB64(A4, A3, B4, B3, C4, C3); \ + SUB2_LIMB64(A4, A3, A4, A3, zero, LIMB_TO64(-LIMB_FROM64(__borrow5))); \ + } while (0) +#endif + +#ifndef SUB7_LIMB64 +/* A[0..6] = B[0..6] - C[0..6] */ +#define SUB7_LIMB64(A6, A5, A4, A3, A2, A1, A0, B6, B5, B4, B3, B2, B1, B0, \ + C6, C5, C4, C3, C2, C1, C0) do { \ + mpi_limb64_t __borrow7; \ + SUB4_LIMB64(__borrow7, A2, A1, A0, zero, B2, B1, B0, \ + zero, C2, C1, C0); \ + SUB5_LIMB64(A6, A5, A4, A3, __borrow7, B6, B5, B4, B3, zero, \ + C6, C5, C4, C3, LIMB_TO64(-LIMB_FROM64(__borrow7))); \ + } while (0) +#endif + + +#if defined(WORDS_BIGENDIAN) || (BITS_PER_MPI_LIMB64 != BITS_PER_MPI_LIMB) +#define LOAD64_UNALIGNED(x, pos) \ + LIMB64_HILO(LOAD32(x, 2 * (pos) + 2), LOAD32(x, 2 * (pos) + 1)) +#else +#define LOAD64_UNALIGNED(x, pos) \ + buf_get_le64((const byte *)(&(x)[pos]) + 4) +#endif + + +/* Helper functions. */ + +static inline int +mpi_nbits_more_than (gcry_mpi_t w, unsigned int nbits) +{ + unsigned int nbits_nlimbs; + mpi_limb_t wlimb; + unsigned int n; + + nbits_nlimbs = (nbits + BITS_PER_MPI_LIMB - 1) / BITS_PER_MPI_LIMB; + + /* Note: Assumes that 'w' is normalized. */ + + if (w->nlimbs > nbits_nlimbs) + return 1; + if (w->nlimbs < nbits_nlimbs) + return 0; + if ((nbits % BITS_PER_MPI_LIMB) == 0) + return 0; + + wlimb = w->d[nbits_nlimbs - 1]; + if (wlimb == 0) + log_bug ("mpi_nbits_more_than: input mpi not normalized\n"); + + count_leading_zeros (n, wlimb); + + return (BITS_PER_MPI_LIMB - n) > (nbits % BITS_PER_MPI_LIMB); +} + +#endif /* GCRY_EC_INLINE_H */ diff --git a/mpi/ec-internal.h b/mpi/ec-internal.h index 759335aa..2296d55d 100644 --- a/mpi/ec-internal.h +++ b/mpi/ec-internal.h @@ -20,6 +20,22 @@ #ifndef GCRY_EC_INTERNAL_H #define GCRY_EC_INTERNAL_H +#include + void _gcry_mpi_ec_ed25519_mod (gcry_mpi_t a); +#ifndef ASM_DISABLED +void _gcry_mpi_ec_nist192_mod (gcry_mpi_t w, mpi_ec_t ctx); +void _gcry_mpi_ec_nist224_mod (gcry_mpi_t w, mpi_ec_t ctx); +void _gcry_mpi_ec_nist256_mod (gcry_mpi_t w, mpi_ec_t ctx); +void _gcry_mpi_ec_nist384_mod (gcry_mpi_t w, mpi_ec_t ctx); +void _gcry_mpi_ec_nist521_mod (gcry_mpi_t w, mpi_ec_t ctx); +#else +# define _gcry_mpi_ec_nist192_mod NULL +# define _gcry_mpi_ec_nist224_mod NULL +# define _gcry_mpi_ec_nist256_mod NULL +# define _gcry_mpi_ec_nist384_mod NULL +# define _gcry_mpi_ec_nist521_mod NULL +#endif + #endif /*GCRY_EC_INTERNAL_H*/ diff --git a/mpi/ec-nist.c b/mpi/ec-nist.c new file mode 100644 index 00000000..955d2b7c --- /dev/null +++ b/mpi/ec-nist.c @@ -0,0 +1,795 @@ +/* ec-nist.c - NIST optimized elliptic curve functions + * Copyright (C) 2021 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include +#include +#include +#include + + +#ifndef ASM_DISABLED + + +#include "mpi-internal.h" +#include "longlong.h" +#include "g10lib.h" +#include "context.h" +#include "ec-context.h" +#include "ec-inline.h" + + +/* These variables are used to generate masks from conditional operation + * flag parameters. Use of volatile prevents compiler optimizations from + * converting AND-masking to conditional branches. */ +static volatile mpi_limb_t vzero = 0; +static volatile mpi_limb_t vone = 1; + + +static inline +void prefetch(const void *tab, size_t len) +{ + const volatile byte *vtab = tab; + + if (len > 0 * 64) + (void)vtab[0 * 64]; + if (len > 1 * 64) + (void)vtab[1 * 64]; + if (len > 2 * 64) + (void)vtab[2 * 64]; + if (len > 3 * 64) + (void)vtab[3 * 64]; + if (len > 4 * 64) + (void)vtab[4 * 64]; + if (len > 5 * 64) + (void)vtab[5 * 64]; + if (len > 6 * 64) + (void)vtab[6 * 64]; + if (len > 7 * 64) + (void)vtab[7 * 64]; + if (len > 8 * 64) + (void)vtab[8 * 64]; + if (len > 9 * 64) + (void)vtab[9 * 64]; + if (len > 10 * 64) + (void)vtab[10 * 64]; + (void)vtab[len - 1]; +} + + +/* Fast reduction routines for NIST curves. */ + +void +_gcry_mpi_ec_nist192_mod (gcry_mpi_t w, mpi_ec_t ctx) +{ + static const mpi_limb64_t p_mult[3][4] = + { + { /* P * 1 */ + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xfffffffeU), + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000000U, 0x00000000U) + }, + { /* P * 2 */ + LIMB64_C(0xffffffffU, 0xfffffffeU), LIMB64_C(0xffffffffU, 0xfffffffdU), + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000000U, 0x00000001U) + }, + { /* P * 3 */ + LIMB64_C(0xffffffffU, 0xfffffffdU), LIMB64_C(0xffffffffU, 0xfffffffcU), + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000000U, 0x00000002U) + } + }; + const mpi_limb64_t zero = LIMB_TO64(0); + mpi_ptr_t wp; + mpi_ptr_t pp; + mpi_size_t wsize = 192 / BITS_PER_MPI_LIMB64; + mpi_limb64_t s[wsize + 1]; + mpi_limb64_t o[wsize + 1]; + mpi_limb_t mask1; + mpi_limb_t mask2; + int carry; + + MPN_NORMALIZE (w->d, w->nlimbs); + if (mpi_nbits_more_than (w, 2 * 192)) + log_bug ("W must be less than m^2\n"); + + RESIZE_AND_CLEAR_IF_NEEDED (w, wsize * 2 * LIMBS_PER_LIMB64); + RESIZE_AND_CLEAR_IF_NEEDED (ctx->p, wsize * LIMBS_PER_LIMB64); + + pp = ctx->p->d; + wp = w->d; + + prefetch (p_mult, sizeof(p_mult)); + + /* See "FIPS 186-4, D.2.1 Curve P-192". */ + + s[0] = LOAD64(wp, 3); + ADD3_LIMB64 (s[3], s[2], s[1], + zero, zero, LOAD64(wp, 3), + zero, LOAD64(wp, 4), LOAD64(wp, 4)); + + ADD4_LIMB64 (s[3], s[2], s[1], s[0], + s[3], s[2], s[1], s[0], + zero, LOAD64(wp, 5), LOAD64(wp, 5), LOAD64(wp, 5)); + + ADD4_LIMB64 (s[3], s[2], s[1], s[0], + s[3], s[2], s[1], s[0], + zero, LOAD64(wp, 2), LOAD64(wp, 1), LOAD64(wp, 0)); + + /* mod p: + * 's[3]' holds carry value (0..2). Subtract (carry + 1) * p. Result will be + * with in range -p...p. Handle result being negative with addition and + * conditional store. */ + + carry = LO32_LIMB64(s[3]); + + SUB4_LIMB64 (s[3], s[2], s[1], s[0], + s[3], s[2], s[1], s[0], + p_mult[carry][3], p_mult[carry][2], + p_mult[carry][1], p_mult[carry][0]); + + ADD4_LIMB64 (o[3], o[2], o[1], o[0], + s[3], s[2], s[1], s[0], + zero, LOAD64(pp, 2), LOAD64(pp, 1), LOAD64(pp, 0)); + mask1 = vzero - (LO32_LIMB64(o[3]) >> 31); + mask2 = (LO32_LIMB64(o[3]) >> 31) - vone; + + STORE64_COND(wp, 0, mask2, o[0], mask1, s[0]); + STORE64_COND(wp, 1, mask2, o[1], mask1, s[1]); + STORE64_COND(wp, 2, mask2, o[2], mask1, s[2]); + + w->nlimbs = 192 / BITS_PER_MPI_LIMB; + MPN_NORMALIZE (wp, w->nlimbs); +} + +void +_gcry_mpi_ec_nist224_mod (gcry_mpi_t w, mpi_ec_t ctx) +{ + static const mpi_limb64_t p_mult[5][4] = + { + { /* P * -1 */ + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000000U, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0xffffffffU, 0x00000000U) + }, + { /* P * 0 */ + LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U), + LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U) + }, + { /* P * 1 */ + LIMB64_C(0x00000000U, 0x00000001U), LIMB64_C(0xffffffffU, 0x00000000U), + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000000U, 0xffffffffU) + }, + { /* P * 2 */ + LIMB64_C(0x00000000U, 0x00000002U), LIMB64_C(0xfffffffeU, 0x00000000U), + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000001U, 0xffffffffU) + }, + { /* P * 3 */ + LIMB64_C(0x00000000U, 0x00000003U), LIMB64_C(0xfffffffdU, 0x00000000U), + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000002U, 0xffffffffU) + } + }; + const mpi_limb64_t zero = LIMB_TO64(0); + mpi_ptr_t wp; + mpi_ptr_t pp; + mpi_size_t wsize = (224 + BITS_PER_MPI_LIMB64 - 1) / BITS_PER_MPI_LIMB64; + mpi_size_t psize = ctx->p->nlimbs; + mpi_limb64_t s[wsize]; + mpi_limb64_t d[wsize]; + mpi_limb_t mask1; + mpi_limb_t mask2; + int carry; + + MPN_NORMALIZE (w->d, w->nlimbs); + if (mpi_nbits_more_than (w, 2 * 224)) + log_bug ("W must be less than m^2\n"); + + RESIZE_AND_CLEAR_IF_NEEDED (w, wsize * 2 * LIMBS_PER_LIMB64); + RESIZE_AND_CLEAR_IF_NEEDED (ctx->p, wsize * LIMBS_PER_LIMB64); + ctx->p->nlimbs = psize; + + pp = ctx->p->d; + wp = w->d; + + prefetch (p_mult, sizeof(p_mult)); + + /* See "FIPS 186-4, D.2.2 Curve P-224". */ + + /* "S1 + S2" with 64-bit limbs: + * [0:A10]:[ A9: A8]:[ A7:0]:[0:0] + * + [0:0]:[A13:A12]:[A11:0]:[0:0] + * => s[3]:s[2]:s[1]:s[0] + */ + s[0] = zero; + ADD3_LIMB64 (s[3], s[2], s[1], + LIMB64_HILO(0, LOAD32(wp, 10)), + LOAD64(wp, 8 / 2), + LIMB64_HILO(LOAD32(wp, 7), 0), + zero, + LOAD64(wp, 12 / 2), + LIMB64_HILO(LOAD32(wp, 11), 0)); + + /* "T + S1 + S2" */ + ADD4_LIMB64 (s[3], s[2], s[1], s[0], + s[3], s[2], s[1], s[0], + LIMB64_HILO(0, LOAD32(wp, 6)), + LOAD64(wp, 4 / 2), + LOAD64(wp, 2 / 2), + LOAD64(wp, 0 / 2)); + + /* "D1 + D2" with 64-bit limbs: + * [0:A13]:[A12:A11]:[A10: A9]:[ A8: A7] + * + [0:0]:[ 0: 0]:[ 0:A13]:[A12:A11] + * => d[3]:d[2]:d[1]:d[0] + */ + ADD4_LIMB64 (d[3], d[2], d[1], d[0], + LIMB64_HILO(0, LOAD32(wp, 13)), + LOAD64_UNALIGNED(wp, 11 / 2), + LOAD64_UNALIGNED(wp, 9 / 2), + LOAD64_UNALIGNED(wp, 7 / 2), + zero, + zero, + LIMB64_HILO(0, LOAD32(wp, 13)), + LOAD64_UNALIGNED(wp, 11 / 2)); + + /* "T + S1 + S2 - D1 - D2" */ + SUB4_LIMB64 (s[3], s[2], s[1], s[0], + s[3], s[2], s[1], s[0], + d[3], d[2], d[1], d[0]); + + /* mod p: + * Upper 32-bits of 's[3]' holds carry value (-2..2). + * Subtract (carry + 1) * p. Result will be with in range -p...p. + * Handle result being negative with addition and conditional store. */ + + carry = HI32_LIMB64(s[3]); + + SUB4_LIMB64 (s[3], s[2], s[1], s[0], + s[3], s[2], s[1], s[0], + p_mult[carry + 2][3], p_mult[carry + 2][2], + p_mult[carry + 2][1], p_mult[carry + 2][0]); + + ADD4_LIMB64 (d[3], d[2], d[1], d[0], + s[3], s[2], s[1], s[0], + LOAD64(pp, 3), LOAD64(pp, 2), LOAD64(pp, 1), LOAD64(pp, 0)); + + mask1 = vzero - (HI32_LIMB64(d[3]) >> 31); + mask2 = (HI32_LIMB64(d[3]) >> 31) - vone; + + STORE64_COND(wp, 0, mask2, d[0], mask1, s[0]); + STORE64_COND(wp, 1, mask2, d[1], mask1, s[1]); + STORE64_COND(wp, 2, mask2, d[2], mask1, s[2]); + STORE64_COND(wp, 3, mask2, d[3], mask1, s[3]); + + w->nlimbs = wsize * LIMBS_PER_LIMB64; + MPN_NORMALIZE (wp, w->nlimbs); +} + +void +_gcry_mpi_ec_nist256_mod (gcry_mpi_t w, mpi_ec_t ctx) +{ + static const mpi_limb64_t p_mult[11][5] = + { + { /* P * -3 */ + LIMB64_C(0x00000000U, 0x00000003U), LIMB64_C(0xfffffffdU, 0x00000000U), + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000002U, 0xfffffffcU), + LIMB64_C(0xffffffffU, 0xfffffffdU) + }, + { /* P * -2 */ + LIMB64_C(0x00000000U, 0x00000002U), LIMB64_C(0xfffffffeU, 0x00000000U), + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000001U, 0xfffffffdU), + LIMB64_C(0xffffffffU, 0xfffffffeU) + }, + { /* P * -1 */ + LIMB64_C(0x00000000U, 0x00000001U), LIMB64_C(0xffffffffU, 0x00000000U), + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000000U, 0xfffffffeU), + LIMB64_C(0xffffffffU, 0xffffffffU) + }, + { /* P * 0 */ + LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U), + LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U), + LIMB64_C(0x00000000U, 0x00000000U) + }, + { /* P * 1 */ + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0x00000000U, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0xffffffffU, 0x00000001U), + LIMB64_C(0x00000000U, 0x00000000U) + }, + { /* P * 2 */ + LIMB64_C(0xffffffffU, 0xfffffffeU), LIMB64_C(0x00000001U, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0xfffffffeU, 0x00000002U), + LIMB64_C(0x00000000U, 0x00000001U) + }, + { /* P * 3 */ + LIMB64_C(0xffffffffU, 0xfffffffdU), LIMB64_C(0x00000002U, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0xfffffffdU, 0x00000003U), + LIMB64_C(0x00000000U, 0x00000002U) + }, + { /* P * 4 */ + LIMB64_C(0xffffffffU, 0xfffffffcU), LIMB64_C(0x00000003U, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0xfffffffcU, 0x00000004U), + LIMB64_C(0x00000000U, 0x00000003U) + }, + { /* P * 5 */ + LIMB64_C(0xffffffffU, 0xfffffffbU), LIMB64_C(0x00000004U, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0xfffffffbU, 0x00000005U), + LIMB64_C(0x00000000U, 0x00000004U) + }, + { /* P * 6 */ + LIMB64_C(0xffffffffU, 0xfffffffaU), LIMB64_C(0x00000005U, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0xfffffffaU, 0x00000006U), + LIMB64_C(0x00000000U, 0x00000005U) + }, + { /* P * 7 */ + LIMB64_C(0xffffffffU, 0xfffffff9U), LIMB64_C(0x00000006U, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0xfffffff9U, 0x00000007U), + LIMB64_C(0x00000000U, 0x00000006U) + } + }; + const mpi_limb64_t zero = LIMB_TO64(0); + mpi_ptr_t wp; + mpi_ptr_t pp; + mpi_size_t wsize = (256 + BITS_PER_MPI_LIMB64 - 1) / BITS_PER_MPI_LIMB64; + mpi_size_t psize = ctx->p->nlimbs; + mpi_limb64_t s[wsize + 1]; + mpi_limb64_t t[wsize + 1]; + mpi_limb64_t d[wsize + 1]; + mpi_limb_t mask1; + mpi_limb_t mask2; + int carry; + + MPN_NORMALIZE (w->d, w->nlimbs); + if (mpi_nbits_more_than (w, 2 * 256)) + log_bug ("W must be less than m^2\n"); + + RESIZE_AND_CLEAR_IF_NEEDED (w, wsize * 2 * LIMBS_PER_LIMB64); + RESIZE_AND_CLEAR_IF_NEEDED (ctx->p, wsize * LIMBS_PER_LIMB64); + ctx->p->nlimbs = psize; + + pp = ctx->p->d; + wp = w->d; + + prefetch (p_mult, sizeof(p_mult)); + + /* See "FIPS 186-4, D.2.3 Curve P-256". */ + + /* "S1 + S2" with 64-bit limbs: + * [A15:A14]:[A13:A12]:[A11:0]:[0:0] + * + [0:A15]:[A14:A13]:[A12:0]:[0:0] + * => s[4]:s[3]:s[2]:s[1]:s[0] + */ + s[0] = zero; + ADD4_LIMB64 (s[4], s[3], s[2], s[1], + zero, + LOAD64(wp, 14 / 2), + LOAD64(wp, 12 / 2), + LIMB64_HILO(LOAD32(wp, 11), 0), + zero, + LIMB64_HILO(0, LOAD32(wp, 15)), + LOAD64_UNALIGNED(wp, 13 / 2), + LIMB64_HILO(LOAD32(wp, 12), 0)); + + /* "S3 + S4" with 64-bit limbs: + * [A15:A14]:[ 0: 0]:[ 0:A10]:[ A9:A8] + * + [A8:A13]:[A15:A14]:[A13:A11]:[A10:A9] + * => t[4]:t[3]:t[2]:t[1]:t[0] + */ + ADD5_LIMB64 (t[4], t[3], t[2], t[1], t[0], + zero, + LOAD64(wp, 14 / 2), + zero, + LIMB64_HILO(0, LOAD32(wp, 10)), + LOAD64(wp, 8 / 2), + zero, + LIMB64_HILO(LOAD32(wp, 8), LOAD32(wp, 13)), + LOAD64(wp, 14 / 2), + LIMB64_HILO(LOAD32(wp, 13), LOAD32(wp, 11)), + LOAD64_UNALIGNED(wp, 9 / 2)); + + /* "2*S1 + 2*S2" */ + ADD5_LIMB64 (s[4], s[3], s[2], s[1], s[0], + s[4], s[3], s[2], s[1], s[0], + s[4], s[3], s[2], s[1], s[0]); + + /* "T + S3 + S4" */ + ADD5_LIMB64 (t[4], t[3], t[2], t[1], t[0], + t[4], t[3], t[2], t[1], t[0], + zero, + LOAD64(wp, 6 / 2), + LOAD64(wp, 4 / 2), + LOAD64(wp, 2 / 2), + LOAD64(wp, 0 / 2)); + + /* "2*S1 + 2*S2 - D3" with 64-bit limbs: + * s[4]: s[3]: s[2]: s[1]: s[0] + * - [A12:0]:[A10:A9]:[A8:A15]:[A14:A13] + * => s[4]:s[3]:s[2]:s[1]:s[0] + */ + SUB5_LIMB64 (s[4], s[3], s[2], s[1], s[0], + s[4], s[3], s[2], s[1], s[0], + zero, + LIMB64_HILO(LOAD32(wp, 12), 0), + LOAD64_UNALIGNED(wp, 9 / 2), + LIMB64_HILO(LOAD32(wp, 8), LOAD32(wp, 15)), + LOAD64_UNALIGNED(wp, 13 / 2)); + + /* "T + 2*S1 + 2*S2 + S3 + S4 - D3" */ + ADD5_LIMB64 (s[4], s[3], s[2], s[1], s[0], + s[4], s[3], s[2], s[1], s[0], + t[4], t[3], t[2], t[1], t[0]); + + /* "D1 + D2" with 64-bit limbs: + * [0:A13]:[A12:A11] + [A15:A14]:[A13:A12] => d[2]:d[1]:d[0] + * [A10:A8] + [A11:A9] => d[4]:d[3] + */ + ADD3_LIMB64 (d[2], d[1], d[0], + zero, + LIMB64_HILO(0, LOAD32(wp, 13)), + LOAD64_UNALIGNED(wp, 11 / 2), + zero, + LOAD64(wp, 14 / 2), + LOAD64(wp, 12 / 2)); + ADD2_LIMB64 (d[4], d[3], + zero, LIMB64_HILO(LOAD32(wp, 10), LOAD32(wp, 8)), + zero, LIMB64_HILO(LOAD32(wp, 11), LOAD32(wp, 9))); + + /* "D1 + D2 + D4" with 64-bit limbs: + * d[4]: d[3]: d[2]: d[1]: d[0] + * - [A13:0]:[A11:A10]:[A9:0]:[A15:A14] + * => d[4]:d[3]:d[2]:d[1]:d[0] + */ + ADD5_LIMB64 (d[4], d[3], d[2], d[1], d[0], + d[4], d[3], d[2], d[1], d[0], + zero, + LIMB64_HILO(LOAD32(wp, 13), 0), + LOAD64(wp, 10 / 2), + LIMB64_HILO(LOAD32(wp, 9), 0), + LOAD64(wp, 14 / 2)); + + /* "T + 2*S1 + 2*S2 + S3 + S4 - D1 - D2 - D3 - D4" */ + SUB5_LIMB64 (s[4], s[3], s[2], s[1], s[0], + s[4], s[3], s[2], s[1], s[0], + d[4], d[3], d[2], d[1], d[0]); + + /* mod p: + * 's[4]' holds carry value (-4..6). Subtract (carry + 1) * p. Result + * will be with in range -p...p. Handle result being negative with + * addition and conditional store. */ + + carry = LO32_LIMB64(s[4]); + + SUB5_LIMB64 (s[4], s[3], s[2], s[1], s[0], + s[4], s[3], s[2], s[1], s[0], + p_mult[carry + 4][4], p_mult[carry + 4][3], + p_mult[carry + 4][2], p_mult[carry + 4][1], + p_mult[carry + 4][0]); + + ADD5_LIMB64 (d[4], d[3], d[2], d[1], d[0], + s[4], s[3], s[2], s[1], s[0], + zero, + LOAD64(pp, 3), LOAD64(pp, 2), LOAD64(pp, 1), LOAD64(pp, 0)); + + mask1 = vzero - (LO32_LIMB64(d[4]) >> 31); + mask2 = (LO32_LIMB64(d[4]) >> 31) - vone; + + STORE64_COND(wp, 0, mask2, d[0], mask1, s[0]); + STORE64_COND(wp, 1, mask2, d[1], mask1, s[1]); + STORE64_COND(wp, 2, mask2, d[2], mask1, s[2]); + STORE64_COND(wp, 3, mask2, d[3], mask1, s[3]); + + w->nlimbs = wsize * LIMBS_PER_LIMB64; + MPN_NORMALIZE (wp, w->nlimbs); +} + +void +_gcry_mpi_ec_nist384_mod (gcry_mpi_t w, mpi_ec_t ctx) +{ + static const mpi_limb64_t p_mult[11][7] = + { + { /* P * -2 */ + LIMB64_C(0xfffffffeU, 0x00000002U), LIMB64_C(0x00000001U, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000002U), LIMB64_C(0x00000000U, 0x00000000U), + LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U), + LIMB64_C(0xffffffffU, 0xfffffffeU) + }, + { /* P * -1 */ + LIMB64_C(0xffffffffU, 0x00000001U), LIMB64_C(0x00000000U, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000001U), LIMB64_C(0x00000000U, 0x00000000U), + LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U), + LIMB64_C(0xffffffffU, 0xffffffffU) + }, + { /* P * 0 */ + LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U), + LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U), + LIMB64_C(0x00000000U, 0x00000000U), LIMB64_C(0x00000000U, 0x00000000U), + LIMB64_C(0x00000000U, 0x00000000U) + }, + { /* P * 1 */ + LIMB64_C(0x00000000U, 0xffffffffU), LIMB64_C(0xffffffffU, 0x00000000U), + LIMB64_C(0xffffffffU, 0xfffffffeU), LIMB64_C(0xffffffffU, 0xffffffffU), + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000000U) + }, + { /* P * 2 */ + LIMB64_C(0x00000001U, 0xfffffffeU), LIMB64_C(0xfffffffeU, 0x00000000U), + LIMB64_C(0xffffffffU, 0xfffffffdU), LIMB64_C(0xffffffffU, 0xffffffffU), + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000001U) + }, + { /* P * 3 */ + LIMB64_C(0x00000002U, 0xfffffffdU), LIMB64_C(0xfffffffdU, 0x00000000U), + LIMB64_C(0xffffffffU, 0xfffffffcU), LIMB64_C(0xffffffffU, 0xffffffffU), + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000002U) + }, + { /* P * 4 */ + LIMB64_C(0x00000003U, 0xfffffffcU), LIMB64_C(0xfffffffcU, 0x00000000U), + LIMB64_C(0xffffffffU, 0xfffffffbU), LIMB64_C(0xffffffffU, 0xffffffffU), + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000003U) + }, + { /* P * 5 */ + LIMB64_C(0x00000004U, 0xfffffffbU), LIMB64_C(0xfffffffbU, 0x00000000U), + LIMB64_C(0xffffffffU, 0xfffffffaU), LIMB64_C(0xffffffffU, 0xffffffffU), + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000004U) + }, + { /* P * 6 */ + LIMB64_C(0x00000005U, 0xfffffffaU), LIMB64_C(0xfffffffaU, 0x00000000U), + LIMB64_C(0xffffffffU, 0xfffffff9U), LIMB64_C(0xffffffffU, 0xffffffffU), + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000005U) + }, + { /* P * 7 */ + LIMB64_C(0x00000006U, 0xfffffff9U), LIMB64_C(0xfffffff9U, 0x00000000U), + LIMB64_C(0xffffffffU, 0xfffffff8U), LIMB64_C(0xffffffffU, 0xffffffffU), + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000006U) + }, + { /* P * 8 */ + LIMB64_C(0x00000007U, 0xfffffff8U), LIMB64_C(0xfffffff8U, 0x00000000U), + LIMB64_C(0xffffffffU, 0xfffffff7U), LIMB64_C(0xffffffffU, 0xffffffffU), + LIMB64_C(0xffffffffU, 0xffffffffU), LIMB64_C(0xffffffffU, 0xffffffffU), + LIMB64_C(0x00000000U, 0x00000007U) + }, + }; + const mpi_limb64_t zero = LIMB_TO64(0); + mpi_ptr_t wp; + mpi_ptr_t pp; + mpi_size_t wsize = (384 + BITS_PER_MPI_LIMB64 - 1) / BITS_PER_MPI_LIMB64; + mpi_size_t psize = ctx->p->nlimbs; +#if (BITS_PER_MPI_LIMB64 == BITS_PER_MPI_LIMB) && defined(WORDS_BIGENDIAN) + mpi_limb_t wp_shr32[wsize * LIMBS_PER_LIMB64]; +#endif + mpi_limb64_t s[wsize + 1]; + mpi_limb64_t t[wsize + 1]; + mpi_limb64_t d[wsize + 1]; + mpi_limb64_t x[wsize + 1]; + mpi_limb_t mask1; + mpi_limb_t mask2; + int carry; + + MPN_NORMALIZE (w->d, w->nlimbs); + if (mpi_nbits_more_than (w, 2 * 384)) + log_bug ("W must be less than m^2\n"); + + RESIZE_AND_CLEAR_IF_NEEDED (w, wsize * 2 * LIMBS_PER_LIMB64); + RESIZE_AND_CLEAR_IF_NEEDED (ctx->p, wsize * LIMBS_PER_LIMB64); + ctx->p->nlimbs = psize; + + pp = ctx->p->d; + wp = w->d; + + prefetch (p_mult, sizeof(p_mult)); + + /* See "FIPS 186-4, D.2.4 Curve P-384". */ + +#if BITS_PER_MPI_LIMB64 == BITS_PER_MPI_LIMB +# ifdef WORDS_BIGENDIAN +# define LOAD64_SHR32(idx) LOAD64(wp_shr32, ((idx) / 2 - wsize)) + _gcry_mpih_rshift (wp_shr32, wp + 384 / BITS_PER_MPI_LIMB, + wsize * LIMBS_PER_LIMB64, 32); +# else +# define LOAD64_SHR32(idx) LOAD64_UNALIGNED(wp, idx / 2) +#endif +#else +# define LOAD64_SHR32(idx) LIMB64_HILO(LOAD32(wp, (idx) + 1), LOAD32(wp, idx)) +#endif + + /* "S1 + S1" with 64-bit limbs: + * [0:A23]:[A22:A21] + * + [0:A23]:[A22:A21] + * => s[3]:s[2] + */ + ADD2_LIMB64 (s[3], s[2], + LIMB64_HILO(0, LOAD32(wp, 23)), + LOAD64_SHR32(21), + LIMB64_HILO(0, LOAD32(wp, 23)), + LOAD64_SHR32(21)); + + /* "S5 + S6" with 64-bit limbs: + * [A23:A22]:[A21:A20]:[ 0:0]:[0: 0] + * + [ 0: 0]:[A23:A22]:[A21:0]:[0:A20] + * => x[4]:x[3]:x[2]:x[1]:x[0] + */ + x[0] = LIMB64_HILO(0, LOAD32(wp, 20)); + x[1] = LIMB64_HILO(LOAD32(wp, 21), 0); + ADD3_LIMB64 (x[4], x[3], x[2], + zero, LOAD64(wp, 22 / 2), LOAD64(wp, 20 / 2), + zero, zero, LOAD64(wp, 22 / 2)); + + /* "D2 + D3" with 64-bit limbs: + * [0:A23]:[A22:A21]:[A20:0] + * + [0:A23]:[A23:0]:[0:0] + * => d[2]:d[1]:d[0] + */ + d[0] = LIMB64_HILO(LOAD32(wp, 20), 0); + ADD2_LIMB64 (d[2], d[1], + LIMB64_HILO(0, LOAD32(wp, 23)), + LOAD64_SHR32(21), + LIMB64_HILO(0, LOAD32(wp, 23)), + LIMB64_HILO(LOAD32(wp, 23), 0)); + + /* "2*S1 + S5 + S6" with 64-bit limbs: + * s[4]:s[3]:s[2]:s[1]:s[0] + * + x[4]:x[3]:x[2]:x[1]:x[0] + * => s[4]:s[3]:s[2]:s[1]:s[0] + */ + s[0] = x[0]; + s[1] = x[1]; + ADD3_LIMB64(s[4], s[3], s[2], + zero, s[3], s[2], + x[4], x[3], x[2]); + + /* "T + S2" with 64-bit limbs: + * [A11:A10]:[ A9: A8]:[ A7: A6]:[ A5: A4]:[ A3: A2]:[ A1: A0] + * + [A23:A22]:[A21:A20]:[A19:A18]:[A17:A16]:[A15:A14]:[A13:A12] + * => t[6]:t[5]:t[4]:t[3]:t[2]:t[1]:t[0] + */ + ADD7_LIMB64 (t[6], t[5], t[4], t[3], t[2], t[1], t[0], + zero, + LOAD64(wp, 10 / 2), LOAD64(wp, 8 / 2), LOAD64(wp, 6 / 2), + LOAD64(wp, 4 / 2), LOAD64(wp, 2 / 2), LOAD64(wp, 0 / 2), + zero, + LOAD64(wp, 22 / 2), LOAD64(wp, 20 / 2), LOAD64(wp, 18 / 2), + LOAD64(wp, 16 / 2), LOAD64(wp, 14 / 2), LOAD64(wp, 12 / 2)); + + /* "2*S1 + S4 + S5 + S6" with 64-bit limbs: + * s[6]: s[5]: s[4]: s[3]: s[2]: s[1]: s[0] + * + [A19:A18]:[A17:A16]:[A15:A14]:[A13:A12]:[A20:0]:[A23:0] + * => s[6]:s[5]:s[4]:s[3]:s[2]:s[1]:s[0] + */ + ADD7_LIMB64 (s[6], s[5], s[4], s[3], s[2], s[1], s[0], + zero, zero, s[4], s[3], s[2], s[1], s[0], + zero, + LOAD64(wp, 18 / 2), LOAD64(wp, 16 / 2), + LOAD64(wp, 14 / 2), LOAD64(wp, 12 / 2), + LIMB64_HILO(LOAD32(wp, 20), 0), + LIMB64_HILO(LOAD32(wp, 23), 0)); + + /* "D1 + D2 + D3" with 64-bit limbs: + * d[6]: d[5]: d[4]: d[3]: d[2]: d[1]: d[0] + * + [A22:A21]:[A20:A19]:[A18:A17]:[A16:A15]:[A14:A13]:[A12:A23] + * => d[6]:d[5]:d[4]:d[3]:d[2]:d[1]:d[0] + */ + ADD7_LIMB64 (d[6], d[5], d[4], d[3], d[2], d[1], d[0], + zero, zero, zero, zero, d[2], d[1], d[0], + zero, + LOAD64_SHR32(21), + LOAD64_SHR32(19), + LOAD64_SHR32(17), + LOAD64_SHR32(15), + LOAD64_SHR32(13), + LIMB64_HILO(LOAD32(wp, 12), LOAD32(wp, 23))); + + /* "2*S1 + S3 + S4 + S5 + S6" with 64-bit limbs: + * s[6]: s[5]: s[4]: s[3]: s[2]: s[1]: s[0] + * + [A20:A19]:[A18:A17]:[A16:A15]:[A14:A13]:[A12:A23]:[A22:A21] + * => s[6]:s[5]:s[4]:s[3]:s[2]:s[1]:s[0] + */ + ADD7_LIMB64 (s[6], s[5], s[4], s[3], s[2], s[1], s[0], + s[6], s[5], s[4], s[3], s[2], s[1], s[0], + zero, + LOAD64_SHR32(19), + LOAD64_SHR32(17), + LOAD64_SHR32(15), + LOAD64_SHR32(13), + LIMB64_HILO(LOAD32(wp, 12), LOAD32(wp, 23)), + LOAD64_SHR32(21)); + + /* "T + 2*S1 + S2 + S3 + S4 + S5 + S6" */ + ADD7_LIMB64 (s[6], s[5], s[4], s[3], s[2], s[1], s[0], + s[6], s[5], s[4], s[3], s[2], s[1], s[0], + t[6], t[5], t[4], t[3], t[2], t[1], t[0]); + + /* "T + 2*S1 + S2 + S3 + S4 + S5 + S6 - D1 - D2 - D3" */ + SUB7_LIMB64 (s[6], s[5], s[4], s[3], s[2], s[1], s[0], + s[6], s[5], s[4], s[3], s[2], s[1], s[0], + d[6], d[5], d[4], d[3], d[2], d[1], d[0]); + +#undef LOAD64_SHR32 + + /* mod p: + * 's[6]' holds carry value (-3..7). Subtract (carry + 1) * p. Result + * will be with in range -p...p. Handle result being negative with + * addition and conditional store. */ + + carry = LO32_LIMB64(s[6]); + + SUB7_LIMB64 (s[6], s[5], s[4], s[3], s[2], s[1], s[0], + s[6], s[5], s[4], s[3], s[2], s[1], s[0], + p_mult[carry + 3][6], p_mult[carry + 3][5], + p_mult[carry + 3][4], p_mult[carry + 3][3], + p_mult[carry + 3][2], p_mult[carry + 3][1], + p_mult[carry + 3][0]); + + ADD7_LIMB64 (d[6], d[5], d[4], d[3], d[2], d[1], d[0], + s[6], s[5], s[4], s[3], s[2], s[1], s[0], + zero, + LOAD64(pp, 5), LOAD64(pp, 4), + LOAD64(pp, 3), LOAD64(pp, 2), + LOAD64(pp, 1), LOAD64(pp, 0)); + + mask1 = vzero - (LO32_LIMB64(d[6]) >> 31); + mask2 = (LO32_LIMB64(d[6]) >> 31) - vone; + + STORE64_COND(wp, 0, mask2, d[0], mask1, s[0]); + STORE64_COND(wp, 1, mask2, d[1], mask1, s[1]); + STORE64_COND(wp, 2, mask2, d[2], mask1, s[2]); + STORE64_COND(wp, 3, mask2, d[3], mask1, s[3]); + STORE64_COND(wp, 4, mask2, d[4], mask1, s[4]); + STORE64_COND(wp, 5, mask2, d[5], mask1, s[5]); + + w->nlimbs = wsize * LIMBS_PER_LIMB64; + MPN_NORMALIZE (wp, w->nlimbs); + +#if (BITS_PER_MPI_LIMB64 == BITS_PER_MPI_LIMB) && defined(WORDS_BIGENDIAN) + wipememory(wp_shr32, sizeof(wp_shr32)); +#endif +} + +void +_gcry_mpi_ec_nist521_mod (gcry_mpi_t w, mpi_ec_t ctx) +{ + mpi_size_t wsize = (521 + BITS_PER_MPI_LIMB - 1) / BITS_PER_MPI_LIMB; + mpi_limb_t s[wsize]; + mpi_limb_t cy; + mpi_ptr_t wp; + + MPN_NORMALIZE (w->d, w->nlimbs); + if (mpi_nbits_more_than (w, 2 * 521)) + log_bug ("W must be less than m^2\n"); + + RESIZE_AND_CLEAR_IF_NEEDED (w, wsize * 2); + + wp = w->d; + + /* See "FIPS 186-4, D.2.5 Curve P-521". */ + + _gcry_mpih_rshift (s, wp + wsize - 1, wsize, 521 % BITS_PER_MPI_LIMB); + s[wsize - 1] &= (1 << (521 % BITS_PER_MPI_LIMB)) - 1; + wp[wsize - 1] &= (1 << (521 % BITS_PER_MPI_LIMB)) - 1; + _gcry_mpih_add_n (wp, wp, s, wsize); + + /* "mod p" */ + cy = _gcry_mpih_sub_n (wp, wp, ctx->p->d, wsize); + _gcry_mpih_add_n (s, wp, ctx->p->d, wsize); + mpih_set_cond (wp, s, wsize, (cy != 0UL)); + + w->nlimbs = wsize; + MPN_NORMALIZE (wp, w->nlimbs); +} + +#endif /* !ASM_DISABLED */ diff --git a/mpi/ec.c b/mpi/ec.c index 4fabf9b4..ae1d036a 100644 --- a/mpi/ec.c +++ b/mpi/ec.c @@ -285,7 +285,7 @@ static void ec_addm (gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v, mpi_ec_t ctx) { mpi_add (w, u, v); - ec_mod (w, ctx); + ctx->mod (w, ctx); } static void @@ -294,14 +294,14 @@ ec_subm (gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v, mpi_ec_t ec) mpi_sub (w, u, v); while (w->sign) mpi_add (w, w, ec->p); - /*ec_mod (w, ec);*/ + /*ctx->mod (w, ec);*/ } static void ec_mulm (gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v, mpi_ec_t ctx) { mpi_mul (w, u, v); - ec_mod (w, ctx); + ctx->mod (w, ctx); } /* W = 2 * U mod P. */ @@ -309,7 +309,7 @@ static void ec_mul2 (gcry_mpi_t w, gcry_mpi_t u, mpi_ec_t ctx) { mpi_lshift (w, u, 1); - ec_mod (w, ctx); + ctx->mod (w, ctx); } static void @@ -585,6 +585,7 @@ struct field_table { void (* mulm) (gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v, mpi_ec_t ctx); void (* mul2) (gcry_mpi_t w, gcry_mpi_t u, mpi_ec_t ctx); void (* pow2) (gcry_mpi_t w, const gcry_mpi_t b, mpi_ec_t ctx); + void (* mod) (gcry_mpi_t w, mpi_ec_t ctx); }; static const struct field_table field_table[] = { @@ -594,7 +595,8 @@ static const struct field_table field_table[] = { ec_subm_25519, ec_mulm_25519, ec_mul2_25519, - ec_pow2_25519 + ec_pow2_25519, + NULL }, { "0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFE" @@ -603,7 +605,55 @@ static const struct field_table field_table[] = { ec_subm_448, ec_mulm_448, ec_mul2_448, - ec_pow2_448 + ec_pow2_448, + NULL + }, + { + "0xfffffffffffffffffffffffffffffffeffffffffffffffff", + NULL, + NULL, + NULL, + NULL, + NULL, + _gcry_mpi_ec_nist192_mod + }, + { + "0xffffffffffffffffffffffffffffffff000000000000000000000001", + NULL, + NULL, + NULL, + NULL, + NULL, + _gcry_mpi_ec_nist224_mod + }, + { + "0xffffffff00000001000000000000000000000000ffffffffffffffffffffffff", + NULL, + NULL, + NULL, + NULL, + NULL, + _gcry_mpi_ec_nist256_mod + }, + { + "0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffe" + "ffffffff0000000000000000ffffffff", + NULL, + NULL, + NULL, + NULL, + NULL, + _gcry_mpi_ec_nist384_mod + }, + { + "0x01ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff" + "ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff", + NULL, + NULL, + NULL, + NULL, + NULL, + _gcry_mpi_ec_nist521_mod }, { NULL, NULL, NULL, NULL, NULL, NULL }, }; @@ -757,6 +807,7 @@ ec_p_init (mpi_ec_t ctx, enum gcry_mpi_ec_models model, ctx->mulm = ec_mulm; ctx->mul2 = ec_mul2; ctx->pow2 = ec_pow2; + ctx->mod = ec_mod; for (i=0; field_table[i].p; i++) { @@ -769,18 +820,25 @@ ec_p_init (mpi_ec_t ctx, enum gcry_mpi_ec_models model, if (!mpi_cmp (p, f_p)) { - ctx->addm = field_table[i].addm; - ctx->subm = field_table[i].subm; - ctx->mulm = field_table[i].mulm; - ctx->mul2 = field_table[i].mul2; - ctx->pow2 = field_table[i].pow2; + ctx->addm = field_table[i].addm ? field_table[i].addm : ctx->addm; + ctx->subm = field_table[i].subm ? field_table[i].subm : ctx->subm; + ctx->mulm = field_table[i].mulm ? field_table[i].mulm : ctx->mulm; + ctx->mul2 = field_table[i].mul2 ? field_table[i].mul2 : ctx->mul2; + ctx->pow2 = field_table[i].pow2 ? field_table[i].pow2 : ctx->pow2; + ctx->mod = field_table[i].mod ? field_table[i].mod : ctx->mod; _gcry_mpi_release (f_p); - mpi_resize (ctx->a, ctx->p->nlimbs); - ctx->a->nlimbs = ctx->p->nlimbs; - - mpi_resize (ctx->b, ctx->p->nlimbs); - ctx->b->nlimbs = ctx->p->nlimbs; + if (ctx->a) + { + mpi_resize (ctx->a, ctx->p->nlimbs); + ctx->a->nlimbs = ctx->p->nlimbs; + } + + if (ctx->b) + { + mpi_resize (ctx->b, ctx->p->nlimbs); + ctx->b->nlimbs = ctx->p->nlimbs; + } for (i=0; i< DIM(ctx->t.scratch) && ctx->t.scratch[i]; i++) ctx->t.scratch[i]->nlimbs = ctx->p->nlimbs; diff --git a/mpi/mpi-internal.h b/mpi/mpi-internal.h index 8ccdeada..11fcbde4 100644 --- a/mpi/mpi-internal.h +++ b/mpi/mpi-internal.h @@ -79,6 +79,11 @@ typedef int mpi_size_t; /* (must be a signed type) */ if( (a)->alloced < (b) ) \ mpi_resize((a), (b)); \ } while(0) +#define RESIZE_AND_CLEAR_IF_NEEDED(a,b) \ + do { \ + if( (a)->nlimbs < (b) ) \ + mpi_resize((a), (b)); \ + } while(0) /* Copy N limbs from S to D. */ #define MPN_COPY( d, s, n) \ diff --git a/mpi/mpiutil.c b/mpi/mpiutil.c index 5320f4d8..a5583c79 100644 --- a/mpi/mpiutil.c +++ b/mpi/mpiutil.c @@ -197,7 +197,7 @@ _gcry_mpi_resize (gcry_mpi_t a, unsigned nlimbs) if (a->d) { a->d = xrealloc (a->d, nlimbs * sizeof (mpi_limb_t)); - for (i=a->alloced; i < nlimbs; i++) + for (i=a->nlimbs; i < nlimbs; i++) a->d[i] = 0; } else diff --git a/src/ec-context.h b/src/ec-context.h index d1c64804..479862f6 100644 --- a/src/ec-context.h +++ b/src/ec-context.h @@ -74,6 +74,7 @@ struct mpi_ec_ctx_s void (* mulm) (gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v, mpi_ec_t ctx); void (* pow2) (gcry_mpi_t w, const gcry_mpi_t b, mpi_ec_t ctx); void (* mul2) (gcry_mpi_t w, gcry_mpi_t u, mpi_ec_t ctx); + void (* mod) (gcry_mpi_t w, mpi_ec_t ctx); }; -- 2.30.2 From jussi.kivilinna at iki.fi Sun Jun 20 11:52:11 2021 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 20 Jun 2021 12:52:11 +0300 Subject: [PATCH 1/4] mpi_ec_get_affine: fast path for Z==1 case Message-ID: <20210620095214.2352572-1-jussi.kivilinna@iki.fi> * mpi/ec.c (_gcry_mpi_ec_get_affine): Return X and Y as is if Z is 1 (for Weierstrass and Edwards curves). -- Signed-off-by: Jussi Kivilinna --- mpi/ec.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/mpi/ec.c b/mpi/ec.c index 29b2ce30..e25d9d8a 100644 --- a/mpi/ec.c +++ b/mpi/ec.c @@ -1117,6 +1117,15 @@ _gcry_mpi_ec_get_affine (gcry_mpi_t x, gcry_mpi_t y, mpi_point_t point, { gcry_mpi_t z1, z2, z3; + if (!mpi_cmp_ui (point->z, 1)) + { + if (x) + mpi_set (x, point->x); + if (y) + mpi_set (y, point->y); + return 0; + } + z1 = mpi_new (0); z2 = mpi_new (0); ec_invm (z1, point->z, ctx); /* z1 = z^(-1) mod p */ @@ -1156,6 +1165,15 @@ _gcry_mpi_ec_get_affine (gcry_mpi_t x, gcry_mpi_t y, mpi_point_t point, { gcry_mpi_t z; + if (!mpi_cmp_ui (point->z, 1)) + { + if (x) + mpi_set (x, point->x); + if (y) + mpi_set (y, point->y); + return 0; + } + z = mpi_new (0); ec_invm (z, point->z, ctx); -- 2.30.2 From jussi.kivilinna at iki.fi Sun Jun 20 11:52:14 2021 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 20 Jun 2021 12:52:14 +0300 Subject: [PATCH 4/4] bench-slope: add X25519 and X448 scalar multiplication In-Reply-To: <20210620095214.2352572-1-jussi.kivilinna@iki.fi> References: <20210620095214.2352572-1-jussi.kivilinna@iki.fi> Message-ID: <20210620095214.2352572-4-jussi.kivilinna@iki.fi> * tests/bench-slope.c (ECC_ALGO_X25519, ECC_ALGO_X448): New. (ecc_algo_name, ecc_algo_curve, ecc_nbits): Add X25519 and X448. (bench_ecc_mult_do_bench): Pass Y as NULL to ec_get_affine with X25519 and X448. (cipher_ecc_one): Run only multiplication bench for X25519 and X448. -- Signed-off-by: Jussi Kivilinna --- tests/bench-slope.c | 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/tests/bench-slope.c b/tests/bench-slope.c index 9b4a139a..35272094 100644 --- a/tests/bench-slope.c +++ b/tests/bench-slope.c @@ -2144,6 +2144,8 @@ enum bench_ecc_algo { ECC_ALGO_ED25519 = 0, ECC_ALGO_ED448, + ECC_ALGO_X25519, + ECC_ALGO_X448, ECC_ALGO_NIST_P192, ECC_ALGO_NIST_P224, ECC_ALGO_NIST_P256, @@ -2197,6 +2199,10 @@ ecc_algo_name (int algo) return "Ed25519"; case ECC_ALGO_ED448: return "Ed448"; + case ECC_ALGO_X25519: + return "X25519"; + case ECC_ALGO_X448: + return "X448"; case ECC_ALGO_NIST_P192: return "NIST-P192"; case ECC_ALGO_NIST_P224: @@ -2223,6 +2229,10 @@ ecc_algo_curve (int algo) return "Ed25519"; case ECC_ALGO_ED448: return "Ed448"; + case ECC_ALGO_X25519: + return "Curve25519"; + case ECC_ALGO_X448: + return "X448"; case ECC_ALGO_NIST_P192: return "NIST P-192"; case ECC_ALGO_NIST_P224: @@ -2249,6 +2259,10 @@ ecc_nbits (int algo) return 255; case ECC_ALGO_ED448: return 448; + case ECC_ALGO_X25519: + return 255; + case ECC_ALGO_X448: + return 448; case ECC_ALGO_NIST_P192: return 192; case ECC_ALGO_NIST_P224: @@ -2355,15 +2369,26 @@ bench_ecc_mult_free (struct bench_obj *obj) static void bench_ecc_mult_do_bench (struct bench_obj *obj, void *buf, size_t num_iter) { + struct bench_ecc_oper *oper = obj->priv; struct bench_ecc_mult_hd *hd = obj->hd; + gcry_mpi_t y; size_t i; (void)buf; + if (oper->algo == ECC_ALGO_X25519 || oper->algo == ECC_ALGO_X448) + { + y = NULL; + } + else + { + y = hd->y; + } + for (i = 0; i < num_iter; i++) { gcry_mpi_ec_mul (hd->Q, hd->k, hd->G, hd->ec); - if (gcry_mpi_ec_get_affine (hd->x, hd->y, hd->Q, hd->ec)) + if (gcry_mpi_ec_get_affine (hd->x, y, hd->Q, hd->ec)) { fprintf (stderr, PGM ": gcry_mpi_ec_get_affine failed\n"); exit (1); @@ -2634,7 +2659,8 @@ cipher_ecc_one (enum bench_ecc_algo algo, struct bench_ecc_oper *poper) struct bench_obj obj = { 0 }; double result; - if (algo == ECC_ALGO_SECP256K1 && oper.oper != ECC_OPER_MULT) + if ((algo == ECC_ALGO_X25519 || algo == ECC_ALGO_X448 || + algo == ECC_ALGO_SECP256K1) && oper.oper != ECC_OPER_MULT) return; oper.algo = algo; -- 2.30.2 From jussi.kivilinna at iki.fi Sun Jun 20 11:52:12 2021 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 20 Jun 2021 12:52:12 +0300 Subject: [PATCH 2/4] mpi/ec: cache converted field_table MPIs In-Reply-To: <20210620095214.2352572-1-jussi.kivilinna@iki.fi> References: <20210620095214.2352572-1-jussi.kivilinna@iki.fi> Message-ID: <20210620095214.2352572-2-jussi.kivilinna@iki.fi> * mpi/ec.c (field_table_mpis): New. (ec_p_init): Cache converted field table MPIs. -- Signed-off-by: Jussi Kivilinna --- mpi/ec.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/mpi/ec.c b/mpi/ec.c index e25d9d8a..029099b4 100644 --- a/mpi/ec.c +++ b/mpi/ec.c @@ -719,7 +719,10 @@ static const struct field_table field_table[] = { }, { NULL, NULL, NULL, NULL, NULL, NULL }, }; + +static gcry_mpi_t field_table_mpis[DIM(field_table)]; + /* Force recomputation of all helper variables. */ void _gcry_mpi_ec_get_reset (mpi_ec_t ec) @@ -876,9 +879,19 @@ ec_p_init (mpi_ec_t ctx, enum gcry_mpi_ec_models model, gcry_mpi_t f_p; gpg_err_code_t rc; - rc = _gcry_mpi_scan (&f_p, GCRYMPI_FMT_HEX, field_table[i].p, 0, NULL); - if (rc) - log_fatal ("scanning ECC parameter failed: %s\n", gpg_strerror (rc)); + if (field_table_mpis[i] == NULL) + { + rc = _gcry_mpi_scan (&f_p, GCRYMPI_FMT_HEX, field_table[i].p, 0, + NULL); + if (rc) + log_fatal ("scanning ECC parameter failed: %s\n", + gpg_strerror (rc)); + field_table_mpis[i] = f_p; /* cache */ + } + else + { + f_p = field_table_mpis[i]; + } if (!mpi_cmp (p, f_p)) { @@ -888,7 +901,6 @@ ec_p_init (mpi_ec_t ctx, enum gcry_mpi_ec_models model, ctx->mul2 = field_table[i].mul2 ? field_table[i].mul2 : ctx->mul2; ctx->pow2 = field_table[i].pow2 ? field_table[i].pow2 : ctx->pow2; ctx->mod = field_table[i].mod ? field_table[i].mod : ctx->mod; - _gcry_mpi_release (f_p); if (ctx->a) { @@ -907,8 +919,6 @@ ec_p_init (mpi_ec_t ctx, enum gcry_mpi_ec_models model, break; } - - _gcry_mpi_release (f_p); } /* Prepare for fast reduction. */ -- 2.30.2 From jussi.kivilinna at iki.fi Sun Jun 20 11:52:13 2021 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 20 Jun 2021 12:52:13 +0300 Subject: [PATCH 3/4] mpi: optimizations for MPI scanning and printing In-Reply-To: <20210620095214.2352572-1-jussi.kivilinna@iki.fi> References: <20210620095214.2352572-1-jussi.kivilinna@iki.fi> Message-ID: <20210620095214.2352572-3-jussi.kivilinna@iki.fi> * mpi/mpicoder.c (mpi_read_from_buffer): Add word-size buffer reading loop using 'buf_get_be(32|64)'. (mpi_fromstr): Use look-up tables for HEX conversion; Add fast-path loop for converting 8 hex-characters at once; Add string length parameter. (do_get_buffer): Use 'buf_put_be(32|64)' instead of byte writes; Add fast-path for reversing buffer with 'buf_get_(be64|be32|le64|le32)'. (_gcry_mpi_set_buffer): Use 'buf_get_be(32|64)' instead of byte reads. (twocompl): Use _gcry_ctz instead of open-coded if-clauses to get first bit set; Add fast-path for inverting buffer with 'buf_get_(he64|he32)'. (_gcry_mpi_scan): Use 'buf_get_be32' where possible; Provide string length to 'mpi_fromstr'. (_gcry_mpi_print): Use 'buf_put_be32' where possible; Use look-up table for HEX conversion; Add fast-path loop for converting to 8 hex-characters at once. * tests/t-convert.c (check_formats): Add new tests for larger values. -- Signed-off-by: Jussi Kivilinna --- mpi/mpicoder.c | 310 +++++++++++++++++--------- tests/t-convert.c | 538 ++++++++++++++++++++++++++++++---------------- 2 files changed, 561 insertions(+), 287 deletions(-) diff --git a/mpi/mpicoder.c b/mpi/mpicoder.c index f61f777f..830ee4e2 100644 --- a/mpi/mpicoder.c +++ b/mpi/mpicoder.c @@ -26,6 +26,7 @@ #include "mpi-internal.h" #include "g10lib.h" +#include "../cipher/bufhelp.h" /* The maximum length we support in the functions converting an * external representation to an MPI. This limit is used to catch @@ -51,8 +52,9 @@ mpi_read_from_buffer (const unsigned char *buffer, unsigned *ret_nread, unsigned int nbits, nbytes, nlimbs, nread=0; mpi_limb_t a; gcry_mpi_t val = MPI_NULL; + unsigned int max_nread = *ret_nread; - if ( *ret_nread < 2 ) + if ( max_nread < 2 ) goto leave; nbits = buffer[0] << 8 | buffer[1]; if ( nbits > MAX_EXTERN_MPI_BITS ) @@ -73,9 +75,22 @@ mpi_read_from_buffer (const unsigned char *buffer, unsigned *ret_nread, for ( ; j > 0; j-- ) { a = 0; + if (i == 0 && nread + BYTES_PER_MPI_LIMB <= max_nread) + { +#if BYTES_PER_MPI_LIMB == 4 + a = buf_get_be32 (buffer); +#elif BYTES_PER_MPI_LIMB == 8 + a = buf_get_be64 (buffer); +#else +# error please implement for this limb size. +#endif + buffer += BYTES_PER_MPI_LIMB; + nread += BYTES_PER_MPI_LIMB; + i += BYTES_PER_MPI_LIMB; + } for (; i < BYTES_PER_MPI_LIMB; i++ ) { - if ( ++nread > *ret_nread ) + if ( ++nread > max_nread ) { /* log_debug ("mpi larger than buffer"); */ mpi_free (val); @@ -99,8 +114,45 @@ mpi_read_from_buffer (const unsigned char *buffer, unsigned *ret_nread, * Fill the mpi VAL from the hex string in STR. */ static int -mpi_fromstr (gcry_mpi_t val, const char *str) +mpi_fromstr (gcry_mpi_t val, const char *str, size_t slen) { + static const int hex2int[2][256] = + { + { + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0x00, 0x10, 0x20, 0x30, + 0x40, 0x50, 0x60, 0x70, 0x80, 0x90, -1, -1, -1, -1, -1, -1, -1, 0xa0, + 0xb0, 0xc0, 0xd0, 0xe0, 0xf0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0xa0, + 0xb0, 0xc0, 0xd0, 0xe0, 0xf0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 + }, + { + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0x00, 0x01, 0x02, 0x03, + 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, -1, -1, -1, -1, -1, -1, -1, 0x0a, + 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0x0a, + 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 + } + }; int sign = 0; int prepend_zero = 0; int i, j, c, c1, c2; @@ -111,19 +163,17 @@ mpi_fromstr (gcry_mpi_t val, const char *str) { sign = 1; str++; + slen--; } /* Skip optional hex prefix. */ if ( *str == '0' && str[1] == 'x' ) - str += 2; - - nbits = strlen (str); - if (nbits > MAX_EXTERN_SCAN_BYTES) { - mpi_clear (val); - return 1; /* Error. */ + str += 2; + slen -= 2; } - nbits *= 4; + + nbits = slen * 4; if ((nbits % 8)) prepend_zero = 1; @@ -140,6 +190,44 @@ mpi_fromstr (gcry_mpi_t val, const char *str) for (; j > 0; j--) { a = 0; + + if (prepend_zero == 0 && (i & 31) == 0) + { + while (slen >= sizeof(u32) * 2) + { + u32 n, m; + u32 x, y; + + x = buf_get_le32(str); + y = buf_get_le32(str + 4); + str += 8; + slen -= 8; + + a <<= 31; /* Two step to avoid compiler warning on 32-bit. */ + a <<= 1; + + n = (hex2int[0][(x >> 0) & 0xff] + | hex2int[1][(x >> 8) & 0xff]) << 8; + m = (hex2int[0][(y >> 0) & 0xff] + | hex2int[1][(y >> 8) & 0xff]) << 8; + n |= hex2int[0][(x >> 16) & 0xff]; + n |= hex2int[1][(x >> 24) & 0xff]; + m |= hex2int[0][(y >> 16) & 0xff]; + m |= hex2int[1][(y >> 24) & 0xff]; + + a |= (n << 16) | m; + i += 32; + if ((int)(n | m) < 0) + { + /* Invalid character. */ + mpi_clear (val); + return 1; /* Error. */ + } + if (i == BITS_PER_MPI_LIMB) + break; + } + } + for (; i < BYTES_PER_MPI_LIMB; i++) { if (prepend_zero) @@ -148,7 +236,10 @@ mpi_fromstr (gcry_mpi_t val, const char *str) prepend_zero = 0; } else - c1 = *str++; + { + c1 = *str++; + slen--; + } if (!c1) { @@ -156,30 +247,15 @@ mpi_fromstr (gcry_mpi_t val, const char *str) return 1; /* Error. */ } c2 = *str++; + slen--; if (!c2) { mpi_clear (val); return 1; /* Error. */ } - if ( c1 >= '0' && c1 <= '9' ) - c = c1 - '0'; - else if ( c1 >= 'a' && c1 <= 'f' ) - c = c1 - 'a' + 10; - else if ( c1 >= 'A' && c1 <= 'F' ) - c = c1 - 'A' + 10; - else - { - mpi_clear (val); - return 1; /* Error. */ - } - c <<= 4; - if ( c2 >= '0' && c2 <= '9' ) - c |= c2 - '0'; - else if( c2 >= 'a' && c2 <= 'f' ) - c |= c2 - 'a' + 10; - else if( c2 >= 'A' && c2 <= 'F' ) - c |= c2 - 'A' + 10; - else + c = hex2int[0][c1 & 0xff]; + c |= hex2int[1][c2 & 0xff]; + if (c < 0) { mpi_clear(val); return 1; /* Error. */ @@ -248,19 +324,11 @@ do_get_buffer (gcry_mpi_t a, unsigned int fill_le, int extraalloc, { alimb = a->d[i]; #if BYTES_PER_MPI_LIMB == 4 - *p++ = alimb >> 24; - *p++ = alimb >> 16; - *p++ = alimb >> 8; - *p++ = alimb ; + buf_put_be32 (p, alimb); + p += 4; #elif BYTES_PER_MPI_LIMB == 8 - *p++ = alimb >> 56; - *p++ = alimb >> 48; - *p++ = alimb >> 40; - *p++ = alimb >> 32; - *p++ = alimb >> 24; - *p++ = alimb >> 16; - *p++ = alimb >> 8; - *p++ = alimb ; + buf_put_be64 (p, alimb); + p += 8; #else # error please implement for this limb size. #endif @@ -270,7 +338,22 @@ do_get_buffer (gcry_mpi_t a, unsigned int fill_le, int extraalloc, { length = *nbytes; /* Reverse buffer and pad with zeroes. */ - for (i=0; i < length/2; i++) + for (i = 0; i + 8 < length / 2; i += 8) + { + u64 head = buf_get_be64 (buffer + i); + u64 tail = buf_get_be64 (buffer + length - 8 - i); + buf_put_le64 (buffer + length - 8 - i, head); + buf_put_le64 (buffer + i, tail); + } + if (i + 4 < length / 2) + { + u32 head = buf_get_be32 (buffer + i); + u32 tail = buf_get_be32 (buffer + length - 4 - i); + buf_put_le32 (buffer + length - 4 - i, head); + buf_put_le32 (buffer + i, tail); + i += 4; + } + for (; i < length/2; i++) { tmp = buffer[i]; buffer[i] = buffer[length-1-i]; @@ -354,53 +437,33 @@ _gcry_mpi_set_buffer (gcry_mpi_t a, const void *buffer_arg, for (i=0, p = buffer+nbytes-1; p >= buffer+BYTES_PER_MPI_LIMB; ) { #if BYTES_PER_MPI_LIMB == 4 - alimb = (mpi_limb_t)*p-- ; - alimb |= (mpi_limb_t)*p-- << 8 ; - alimb |= (mpi_limb_t)*p-- << 16 ; - alimb |= (mpi_limb_t)*p-- << 24 ; + alimb = buf_get_be32(p - 4 + 1); + p -= 4; #elif BYTES_PER_MPI_LIMB == 8 - alimb = (mpi_limb_t)*p-- ; - alimb |= (mpi_limb_t)*p-- << 8 ; - alimb |= (mpi_limb_t)*p-- << 16 ; - alimb |= (mpi_limb_t)*p-- << 24 ; - alimb |= (mpi_limb_t)*p-- << 32 ; - alimb |= (mpi_limb_t)*p-- << 40 ; - alimb |= (mpi_limb_t)*p-- << 48 ; - alimb |= (mpi_limb_t)*p-- << 56 ; + alimb = buf_get_be64(p - 8 + 1); + p -= 8; #else -# error please implement for this limb size. +# error please implement for this limb size. #endif a->d[i++] = alimb; } if ( p >= buffer ) { + byte last[BYTES_PER_MPI_LIMB] = { 0 }; + unsigned int n = (p - buffer) + 1; + + n = n > BYTES_PER_MPI_LIMB ? BYTES_PER_MPI_LIMB : n; + memcpy (last + BYTES_PER_MPI_LIMB - n, p - n + 1, n); + p -= n; + #if BYTES_PER_MPI_LIMB == 4 - alimb = (mpi_limb_t)*p--; - if (p >= buffer) - alimb |= (mpi_limb_t)*p-- << 8; - if (p >= buffer) - alimb |= (mpi_limb_t)*p-- << 16; - if (p >= buffer) - alimb |= (mpi_limb_t)*p-- << 24; + alimb = buf_get_be32(last); #elif BYTES_PER_MPI_LIMB == 8 - alimb = (mpi_limb_t)*p--; - if (p >= buffer) - alimb |= (mpi_limb_t)*p-- << 8; - if (p >= buffer) - alimb |= (mpi_limb_t)*p-- << 16; - if (p >= buffer) - alimb |= (mpi_limb_t)*p-- << 24; - if (p >= buffer) - alimb |= (mpi_limb_t)*p-- << 32; - if (p >= buffer) - alimb |= (mpi_limb_t)*p-- << 40; - if (p >= buffer) - alimb |= (mpi_limb_t)*p-- << 48; - if (p >= buffer) - alimb |= (mpi_limb_t)*p-- << 56; + alimb = buf_get_be64(last); #else # error please implement for this limb size. #endif + a->d[i++] = alimb; } a->nlimbs = i; @@ -446,25 +509,24 @@ twocompl (unsigned char *p, unsigned int n) ; if (i >= 0) { - if ((p[i] & 0x01)) - p[i] = (((p[i] ^ 0xfe) | 0x01) & 0xff); - else if ((p[i] & 0x02)) - p[i] = (((p[i] ^ 0xfc) | 0x02) & 0xfe); - else if ((p[i] & 0x04)) - p[i] = (((p[i] ^ 0xf8) | 0x04) & 0xfc); - else if ((p[i] & 0x08)) - p[i] = (((p[i] ^ 0xf0) | 0x08) & 0xf8); - else if ((p[i] & 0x10)) - p[i] = (((p[i] ^ 0xe0) | 0x10) & 0xf0); - else if ((p[i] & 0x20)) - p[i] = (((p[i] ^ 0xc0) | 0x20) & 0xe0); - else if ((p[i] & 0x40)) - p[i] = (((p[i] ^ 0x80) | 0x40) & 0xc0); - else - p[i] = 0x80; + unsigned char pi = p[i]; + unsigned int ntz = _gcry_ctz (pi); + + p[i] = ((p[i] ^ (0xfe << ntz)) | (0x01 << ntz)) & (0xff << ntz); - for (i--; i >= 0; i--) - p[i] ^= 0xff; + for (i--; i >= 7; i -= 8) + { + buf_put_he64(&p[i-7], ~buf_get_he64(&p[i-7])); + } + if (i >= 3) + { + buf_put_he32(&p[i-3], ~buf_get_he32(&p[i-3])); + i -= 4; + } + for (; i >= 0; i--) + { + p[i] ^= 0xff; + } } } @@ -571,7 +633,7 @@ _gcry_mpi_scan (struct gcry_mpi **ret_mpi, enum gcry_mpi_format format, if (len && len < 4) return GPG_ERR_TOO_SHORT; - n = (s[0] << 24 | s[1] << 16 | s[2] << 8 | s[3]); + n = buf_get_be32 (s); s += 4; if (len) len -= 4; @@ -605,12 +667,19 @@ _gcry_mpi_scan (struct gcry_mpi **ret_mpi, enum gcry_mpi_format format, } else if (format == GCRYMPI_FMT_HEX) { + size_t slen; /* We can only handle C strings for now. */ if (buflen) return GPG_ERR_INV_ARG; - a = secure? mpi_alloc_secure (0) : mpi_alloc(0); - if (mpi_fromstr (a, (const char *)buffer)) + slen = strlen ((const char *)buffer); + if (slen > MAX_EXTERN_SCAN_BYTES) + return GPG_ERR_INV_OBJ; + a = secure? mpi_alloc_secure ((((slen+1)/2)+BYTES_PER_MPI_LIMB-1) + /BYTES_PER_MPI_LIMB) + : mpi_alloc((((slen+1)/2)+BYTES_PER_MPI_LIMB-1) + /BYTES_PER_MPI_LIMB); + if (mpi_fromstr (a, (const char *)buffer, slen)) { mpi_free (a); return GPG_ERR_INV_OBJ; @@ -798,10 +867,8 @@ _gcry_mpi_print (enum gcry_mpi_format format, { unsigned char *s = buffer; - *s++ = n >> 24; - *s++ = n >> 16; - *s++ = n >> 8; - *s++ = n; + buf_put_be32 (s, n); + s += 4; if (extra == 1) *s++ = 0; else if (extra) @@ -832,6 +899,11 @@ _gcry_mpi_print (enum gcry_mpi_format format, } if (buffer) { + static const u32 nibble2hex[] = + { + '0', '1', '2', '3', '4', '5', '6', '7', + '8', '9', 'A', 'B', 'C', 'D', 'E', 'F' + }; unsigned char *s = buffer; if (negative) @@ -842,13 +914,37 @@ _gcry_mpi_print (enum gcry_mpi_format format, *s++ = '0'; } - for (i=0; i < n; i++) + for (i = 0; i + 4 < n; i += 4) + { + u32 c = buf_get_be32(tmp + i); + u32 o1, o2; + + o1 = nibble2hex[(c >> 28) & 0xF]; + o1 <<= 8; + o1 |= nibble2hex[(c >> 24) & 0xF]; + o1 <<= 8; + o1 |= nibble2hex[(c >> 20) & 0xF]; + o1 <<= 8; + o1 |= nibble2hex[(c >> 16) & 0xF]; + + o2 = nibble2hex[(c >> 12) & 0xF]; + o2 <<= 8; + o2 |= (u64)nibble2hex[(c >> 8) & 0xF]; + o2 <<= 8; + o2 |= (u64)nibble2hex[(c >> 4) & 0xF]; + o2 <<= 8; + o2 |= (u64)nibble2hex[(c >> 0) & 0xF]; + + buf_put_be32 (s + 0, o1); + buf_put_be32 (s + 4, o2); + s += 8; + } + for (; i < n; i++) { unsigned int c = tmp[i]; - *s++ = (c >> 4) < 10? '0'+(c>>4) : 'A'+(c>>4)-10 ; - c &= 15; - *s++ = c < 10? '0'+c : 'A'+c-10 ; + *s++ = nibble2hex[c >> 4]; + *s++ = nibble2hex[c & 0xF]; } *s++ = 0; *nwritten = s - buffer; diff --git a/tests/t-convert.c b/tests/t-convert.c index 4450a9e3..d5d162b9 100644 --- a/tests/t-convert.c +++ b/tests/t-convert.c @@ -141,6 +141,7 @@ static void check_formats (void) { static struct { + int have_value; int value; struct { const char *hex; @@ -154,136 +155,283 @@ check_formats (void) const char *pgp; } a; } data[] = { - { 0, { "00", - 0, "", - 4, "\x00\x00\x00\x00", - 0, "", - 2, "\x00\x00"} - }, - { 1, { "01", - 1, "\x01", - 5, "\x00\x00\x00\x01\x01", - 1, "\x01", - 3, "\x00\x01\x01" } - }, - { 2, { "02", - 1, "\x02", - 5, "\x00\x00\x00\x01\x02", - 1, "\x02", - 3, "\x00\x02\x02" } - }, - { 127, { "7F", - 1, "\x7f", - 5, "\x00\x00\x00\x01\x7f", - 1, "\x7f", - 3, "\x00\x07\x7f" } - }, - { 128, { "0080", - 2, "\x00\x80", - 6, "\x00\x00\x00\x02\x00\x80", - 1, "\x80", - 3, "\x00\x08\x80" } - }, - { 129, { "0081", - 2, "\x00\x81", - 6, "\x00\x00\x00\x02\x00\x81", - 1, "\x81", - 3, "\x00\x08\x81" } - }, - { 255, { "00FF", - 2, "\x00\xff", - 6, "\x00\x00\x00\x02\x00\xff", - 1, "\xff", - 3, "\x00\x08\xff" } - }, - { 256, { "0100", - 2, "\x01\x00", - 6, "\x00\x00\x00\x02\x01\x00", - 2, "\x01\x00", - 4, "\x00\x09\x01\x00" } - }, - { 257, { "0101", - 2, "\x01\x01", - 6, "\x00\x00\x00\x02\x01\x01", - 2, "\x01\x01", - 4, "\x00\x09\x01\x01" } - }, - { -1, { "-01", - 1, "\xff", - 5, "\x00\x00\x00\x01\xff", - 1,"\x01" } - }, - { -2, { "-02", - 1, "\xfe", - 5, "\x00\x00\x00\x01\xfe", - 1, "\x02" } - }, - { -127, { "-7F", - 1, "\x81", - 5, "\x00\x00\x00\x01\x81", - 1, "\x7f" } - }, - { -128, { "-0080", - 1, "\x80", - 5, "\x00\x00\x00\x01\x80", - 1, "\x80" } - }, - { -129, { "-0081", - 2, "\xff\x7f", - 6, "\x00\x00\x00\x02\xff\x7f", - 1, "\x81" } - }, - { -255, { "-00FF", - 2, "\xff\x01", - 6, "\x00\x00\x00\x02\xff\x01", - 1, "\xff" } - }, - { -256, { "-0100", - 2, "\xff\x00", - 6, "\x00\x00\x00\x02\xff\x00", - 2, "\x01\x00" } - }, - { -257, { "-0101", - 2, "\xfe\xff", - 6, "\x00\x00\x00\x02\xfe\xff", - 2, "\x01\x01" } - }, - { 65535, { "00FFFF", - 3, "\x00\xff\xff", - 7, "\x00\x00\x00\x03\x00\xff\xff", - 2, "\xff\xff", - 4, "\x00\x10\xff\xff" } - }, - { 65536, { "010000", - 3, "\x01\00\x00", - 7, "\x00\x00\x00\x03\x01\x00\x00", - 3, "\x01\x00\x00", - 5, "\x00\x11\x01\x00\x00 "} - }, - { 65537, { "010001", - 3, "\x01\00\x01", - 7, "\x00\x00\x00\x03\x01\x00\x01", - 3, "\x01\x00\x01", - 5, "\x00\x11\x01\x00\x01" } - }, - { -65537, { "-010001", - 3, "\xfe\xff\xff", - 7, "\x00\x00\x00\x03\xfe\xff\xff", - 3, "\x01\x00\x01" } - }, - { -65536, { "-010000", - 3, "\xff\x00\x00", - 7, "\x00\x00\x00\x03\xff\x00\x00", - 3, "\x01\x00\x00" } - }, - { -65535, { "-00FFFF", - 3, "\xff\x00\x01", - 7, "\x00\x00\x00\x03\xff\x00\x01", - 2, "\xff\xff" } + { + 1, 0, + { "00", + 0, "", + 4, "\x00\x00\x00\x00", + 0, "", + 2, "\x00\x00" } + }, + { + 1, 1, + { "01", + 1, "\x01", + 5, "\x00\x00\x00\x01\x01", + 1, "\x01", + 3, "\x00\x01\x01" } + }, + { + 1, 2, + { "02", + 1, "\x02", + 5, "\x00\x00\x00\x01\x02", + 1, "\x02", + 3, "\x00\x02\x02" } + }, + { + 1, 127, + { "7F", + 1, "\x7f", + 5, "\x00\x00\x00\x01\x7f", + 1, "\x7f", + 3, "\x00\x07\x7f" } + }, + { + 1, 128, + { "0080", + 2, "\x00\x80", + 6, "\x00\x00\x00\x02\x00\x80", + 1, "\x80", + 3, "\x00\x08\x80" } + }, + { + 1, 129, + { "0081", + 2, "\x00\x81", + 6, "\x00\x00\x00\x02\x00\x81", + 1, "\x81", + 3, "\x00\x08\x81" } + }, + { + 1, 255, + { "00FF", + 2, "\x00\xff", + 6, "\x00\x00\x00\x02\x00\xff", + 1, "\xff", + 3, "\x00\x08\xff" } + }, + { + 1, 256, + { "0100", + 2, "\x01\x00", + 6, "\x00\x00\x00\x02\x01\x00", + 2, "\x01\x00", + 4, "\x00\x09\x01\x00" } + }, + { + 1, 257, + { "0101", + 2, "\x01\x01", + 6, "\x00\x00\x00\x02\x01\x01", + 2, "\x01\x01", + 4, "\x00\x09\x01\x01" } + }, + { + 1, -1, + { "-01", + 1, "\xff", + 5, "\x00\x00\x00\x01\xff", + 1,"\x01" } + }, + { + 1, -2, + { "-02", + 1, "\xfe", + 5, "\x00\x00\x00\x01\xfe", + 1, "\x02" } + }, + { + 1, -127, + { "-7F", + 1, "\x81", + 5, "\x00\x00\x00\x01\x81", + 1, "\x7f" } + }, + { + 1, -128, + { "-0080", + 1, "\x80", + 5, "\x00\x00\x00\x01\x80", + 1, "\x80" } + }, + { + 1, -129, + { "-0081", + 2, "\xff\x7f", + 6, "\x00\x00\x00\x02\xff\x7f", + 1, "\x81" } + }, + { + 1, -255, + { "-00FF", + 2, "\xff\x01", + 6, "\x00\x00\x00\x02\xff\x01", + 1, "\xff" } + }, + { + 1, -256, + { "-0100", + 2, "\xff\x00", + 6, "\x00\x00\x00\x02\xff\x00", + 2, "\x01\x00" } + }, + { + 1, -257, + { "-0101", + 2, "\xfe\xff", + 6, "\x00\x00\x00\x02\xfe\xff", + 2, "\x01\x01" } + }, + { + 1, 65535, + { "00FFFF", + 3, "\x00\xff\xff", + 7, "\x00\x00\x00\x03\x00\xff\xff", + 2, "\xff\xff", + 4, "\x00\x10\xff\xff" } + }, + { + 1, 65536, + { "010000", + 3, "\x01\00\x00", + 7, "\x00\x00\x00\x03\x01\x00\x00", + 3, "\x01\x00\x00", + 5, "\x00\x11\x01\x00\x00 "} + }, + { + 1, 65537, + { "010001", + 3, "\x01\00\x01", + 7, "\x00\x00\x00\x03\x01\x00\x01", + 3, "\x01\x00\x01", + 5, "\x00\x11\x01\x00\x01" } + }, + { + 1, -65537, + { "-010001", + 3, "\xfe\xff\xff", + 7, "\x00\x00\x00\x03\xfe\xff\xff", + 3, "\x01\x00\x01" } + }, + { + 1, -65536, + { "-010000", + 3, "\xff\x00\x00", + 7, "\x00\x00\x00\x03\xff\x00\x00", + 3, "\x01\x00\x00" } + }, + { + 1, -65535, + { "-00FFFF", + 3, "\xff\x00\x01", + 7, "\x00\x00\x00\x03\xff\x00\x01", + 2, "\xff\xff" } + }, + { + 1, 0x7fffffff, + { "7FFFFFFF", + 4, "\x7f\xff\xff\xff", + 8, "\x00\x00\x00\x04\x7f\xff\xff\xff", + 4, "\x7f\xff\xff\xff", + 6, "\x00\x1f\x7f\xff\xff\xff" } + }, + { 1, -0x7fffffff, + { "-7FFFFFFF", + 4, "\x80\x00\x00\x01", + 8, "\x00\x00\x00\x04\x80\x00\x00\x01", + 4, "\x7f\xff\xff\xff" } + }, + { + 1, (int)0x800000ffU, + { "-7FFFFF01", + 4, "\x80\x00\x00\xff", + 8, "\x00\x00\x00\x04\x80\x00\x00\xff", + 4, "\x7f\xff\xff\x01" } + }, + { + 1, (int)0x800000feU, + { "-7FFFFF02", + 4, "\x80\x00\x00\xfe", + 8, "\x00\x00\x00\x04\x80\x00\x00\xfe", + 4, "\x7f\xff\xff\x02" } + }, + { + 1, (int)0x800000fcU, + { "-7FFFFF04", + 4, "\x80\x00\x00\xfc", + 8, "\x00\x00\x00\x04\x80\x00\x00\xfc", + 4, "\x7f\xff\xff\x04" } + }, + { + 1, (int)0x800000f8U, + { "-7FFFFF08", + 4, "\x80\x00\x00\xf8", + 8, "\x00\x00\x00\x04\x80\x00\x00\xf8", + 4, "\x7f\xff\xff\x08" } + }, + { + 1, (int)0x800000f0U, + { "-7FFFFF10", + 4, "\x80\x00\x00\xf0", + 8, "\x00\x00\x00\x04\x80\x00\x00\xf0", + 4, "\x7f\xff\xff\x10" } + }, + { + 1, (int)0x800000e0U, + { "-7FFFFF20", + 4, "\x80\x00\x00\xe0", + 8, "\x00\x00\x00\x04\x80\x00\x00\xe0", + 4, "\x7f\xff\xff\x20" } + }, + { + 1, (int)0x800000c0U, + { "-7FFFFF40", + 4, "\x80\x00\x00\xc0", + 8, "\x00\x00\x00\x04\x80\x00\x00\xc0", + 4, "\x7f\xff\xff\x40" } + }, + { + 1, (int)0x80000080U, + { "-7FFFFF80", + 4, "\x80\x00\x00\x80", + 8, "\x00\x00\x00\x04\x80\x00\x00\x80", + 4, "\x7f\xff\xff\x80" } + }, + { + 1, (int)0x80000100U, + { "-7FFFFF00", + 4, "\x80\x00\x01\x00", + 8, "\x00\x00\x00\x04\x80\x00\x01\x00", + 4, "\x7f\xff\xff\x00" } + }, + { + 0, 0, + { "076543210FEDCBA9876543210123456789ABCDEF00112233", + 24, "\x07\x65\x43\x21\x0f\xed\xcb\xa9\x87\x65\x43\x21\x01\x23" + "\x45\x67\x89\xab\xcd\xef\x00\x11\x22\x33", + 28, "\x00\x00\x00\x18\x07\x65\x43\x21\x0f\xed\xcb\xa9\x87\x65" + "\x43\x21\x01\x23\x45\x67\x89\xab\xcd\xef\x00\x11\x22\x33" + "\x44", + 24, "\x07\x65\x43\x21\x0f\xed\xcb\xa9\x87\x65\x43\x21\x01\x23" + "\x45\x67\x89\xab\xcd\xef\x00\x11\x22\x33", + 26, "\x00\xbb\x07\x65\x43\x21\x0f\xed\xcb\xa9\x87\x65\x43\x21" + "\x01\x23\x45\x67\x89\xab\xcd\xef\x00\x11\x22\x33" } + }, + { + 0, 0, + { "-07FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF01", + 24, "\xf8\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" + "\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff", + 28, "\x00\x00\x00\x18\xf8\x00\x00\x00\x00\x00\x00\x00\x00\x00" + "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff", + 24, "\x07\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff" + "\xff\xff\xff\xff\xff\xff\xff\xff\xff\x01" } } }; gpg_error_t err; gcry_mpi_t a, b; + char valuestr[128]; char *buf; void *bufaddr = &buf; int idx; @@ -295,24 +443,39 @@ check_formats (void) if (debug) info ("print test %d\n", data[idx].value); - if (data[idx].value < 0) - { - gcry_mpi_set_ui (a, -data[idx].value); - gcry_mpi_neg (a, a); - } + if (data[idx].have_value) + { + snprintf(valuestr, sizeof(valuestr), "%d", data[idx].value); + if (data[idx].value < 0) + { + gcry_mpi_set_ui (a, -data[idx].value); + gcry_mpi_neg (a, a); + } + else + gcry_mpi_set_ui (a, data[idx].value); + } else - gcry_mpi_set_ui (a, data[idx].value); + { + /* Use hex-format as source test vector. */ + snprintf(valuestr, sizeof(valuestr), "%s", data[idx].a.hex); + gcry_mpi_release (a); + err = gcry_mpi_scan (&a, GCRYMPI_FMT_HEX, data[idx].a.hex, 0, + &buflen); + if (err) + fail ("error scanning value %s from %s: %s\n", + valuestr, "HEX", gpg_strerror (err)); + } err = gcry_mpi_aprint (GCRYMPI_FMT_HEX, bufaddr, NULL, a); if (err) - fail ("error printing value %d as %s: %s\n", - data[idx].value, "HEX", gpg_strerror (err)); + fail ("error printing value %s as %s: %s\n", + valuestr, "HEX", gpg_strerror (err)); else { if (strcmp (buf, data[idx].a.hex)) { - fail ("error printing value %d as %s: %s\n", - data[idx].value, "HEX", "wrong result"); + fail ("error printing value %s as %s: %s\n", + valuestr, "HEX", "wrong result"); info ("expected: '%s'\n", data[idx].a.hex); info (" got: '%s'\n", buf); } @@ -321,15 +484,15 @@ check_formats (void) err = gcry_mpi_aprint (GCRYMPI_FMT_STD, bufaddr, &buflen, a); if (err) - fail ("error printing value %d as %s: %s\n", - data[idx].value, "STD", gpg_strerror (err)); + fail ("error printing value %s as %s: %s\n", + valuestr, "STD", gpg_strerror (err)); else { if (buflen != data[idx].a.stdlen || memcmp (buf, data[idx].a.std, data[idx].a.stdlen)) { - fail ("error printing value %d as %s: %s\n", - data[idx].value, "STD", "wrong result"); + fail ("error printing value %s as %s: %s\n", + valuestr, "STD", "wrong result"); showhex ("expected:", data[idx].a.std, data[idx].a.stdlen); showhex (" got:", buf, buflen); } @@ -338,15 +501,15 @@ check_formats (void) err = gcry_mpi_aprint (GCRYMPI_FMT_SSH, bufaddr, &buflen, a); if (err) - fail ("error printing value %d as %s: %s\n", - data[idx].value, "SSH", gpg_strerror (err)); + fail ("error printing value %s as %s: %s\n", + valuestr, "SSH", gpg_strerror (err)); else { if (buflen != data[idx].a.sshlen || memcmp (buf, data[idx].a.ssh, data[idx].a.sshlen)) { - fail ("error printing value %d as %s: %s\n", - data[idx].value, "SSH", "wrong result"); + fail ("error printing value %s as %s: %s\n", + valuestr, "SSH", "wrong result"); showhex ("expected:", data[idx].a.ssh, data[idx].a.sshlen); showhex (" got:", buf, buflen); } @@ -355,15 +518,15 @@ check_formats (void) err = gcry_mpi_aprint (GCRYMPI_FMT_USG, bufaddr, &buflen, a); if (err) - fail ("error printing value %d as %s: %s\n", - data[idx].value, "USG", gpg_strerror (err)); + fail ("error printing value %s as %s: %s\n", + valuestr, "USG", gpg_strerror (err)); else { if (buflen != data[idx].a.usglen || memcmp (buf, data[idx].a.usg, data[idx].a.usglen)) { - fail ("error printing value %d as %s: %s\n", - data[idx].value, "USG", "wrong result"); + fail ("error printing value %s as %s: %s\n", + valuestr, "USG", "wrong result"); showhex ("expected:", data[idx].a.usg, data[idx].a.usglen); showhex (" got:", buf, buflen); } @@ -374,19 +537,19 @@ check_formats (void) if (gcry_mpi_is_neg (a)) { if (gpg_err_code (err) != GPG_ERR_INV_ARG) - fail ("error printing value %d as %s: %s\n", - data[idx].value, "PGP", "Expected error not returned"); + fail ("error printing value %s as %s: %s\n", + valuestr, "PGP", "Expected error not returned"); } else if (err) - fail ("error printing value %d as %s: %s\n", - data[idx].value, "PGP", gpg_strerror (err)); + fail ("error printing value %s as %s: %s\n", + valuestr, "PGP", gpg_strerror (err)); else { if (buflen != data[idx].a.pgplen || memcmp (buf, data[idx].a.pgp, data[idx].a.pgplen)) { - fail ("error printing value %d as %s: %s\n", - data[idx].value, "PGP", "wrong result"); + fail ("error printing value %s as %s: %s\n", + valuestr, "PGP", "wrong result"); showhex ("expected:", data[idx].a.pgp, data[idx].a.pgplen); showhex (" got:", buf, buflen); } @@ -401,24 +564,39 @@ check_formats (void) if (debug) info ("scan test %d\n", data[idx].value); - if (data[idx].value < 0) - { - gcry_mpi_set_ui (a, -data[idx].value); - gcry_mpi_neg (a, a); - } + if (data[idx].have_value) + { + snprintf(valuestr, sizeof(valuestr), "%d", data[idx].value); + if (data[idx].value < 0) + { + gcry_mpi_set_ui (a, -data[idx].value); + gcry_mpi_neg (a, a); + } + else + gcry_mpi_set_ui (a, data[idx].value); + } else - gcry_mpi_set_ui (a, data[idx].value); + { + /* Use hex-format as source test vector. */ + snprintf(valuestr, sizeof(valuestr), "%s", data[idx].a.hex); + gcry_mpi_release (a); + err = gcry_mpi_scan (&a, GCRYMPI_FMT_HEX, data[idx].a.hex, 0, + &buflen); + if (err) + fail ("error scanning value %s from %s: %s\n", + valuestr, "HEX", gpg_strerror (err)); + } err = gcry_mpi_scan (&b, GCRYMPI_FMT_HEX, data[idx].a.hex, 0, &buflen); if (err) - fail ("error scanning value %d from %s: %s\n", - data[idx].value, "HEX", gpg_strerror (err)); + fail ("error scanning value %s from %s: %s\n", + valuestr, "HEX", gpg_strerror (err)); else { if (gcry_mpi_cmp (a, b)) { - fail ("error scanning value %d from %s: %s\n", - data[idx].value, "HEX", "wrong result"); + fail ("error scanning value %s from %s: %s\n", + valuestr, "HEX", "wrong result"); showmpi ("expected:", a); showmpi (" got:", b); } @@ -428,14 +606,14 @@ check_formats (void) err = gcry_mpi_scan (&b, GCRYMPI_FMT_STD, data[idx].a.std, data[idx].a.stdlen, &buflen); if (err) - fail ("error scanning value %d as %s: %s\n", - data[idx].value, "STD", gpg_strerror (err)); + fail ("error scanning value %s as %s: %s\n", + valuestr, "STD", gpg_strerror (err)); else { if (gcry_mpi_cmp (a, b) || data[idx].a.stdlen != buflen) { - fail ("error scanning value %d from %s: %s (%lu)\n", - data[idx].value, "STD", "wrong result", + fail ("error scanning value %s from %s: %s (%lu)\n", + valuestr, "STD", "wrong result", (long unsigned int)buflen); showmpi ("expected:", a); showmpi (" got:", b); @@ -446,14 +624,14 @@ check_formats (void) err = gcry_mpi_scan (&b, GCRYMPI_FMT_SSH, data[idx].a.ssh, data[idx].a.sshlen, &buflen); if (err) - fail ("error scanning value %d as %s: %s\n", - data[idx].value, "SSH", gpg_strerror (err)); + fail ("error scanning value %s as %s: %s\n", + valuestr, "SSH", gpg_strerror (err)); else { if (gcry_mpi_cmp (a, b) || data[idx].a.sshlen != buflen) { - fail ("error scanning value %d from %s: %s (%lu)\n", - data[idx].value, "SSH", "wrong result", + fail ("error scanning value %s from %s: %s (%lu)\n", + valuestr, "SSH", "wrong result", (long unsigned int)buflen); showmpi ("expected:", a); showmpi (" got:", b); @@ -464,16 +642,16 @@ check_formats (void) err = gcry_mpi_scan (&b, GCRYMPI_FMT_USG, data[idx].a.usg, data[idx].a.usglen, &buflen); if (err) - fail ("error scanning value %d as %s: %s\n", - data[idx].value, "USG", gpg_strerror (err)); + fail ("error scanning value %s as %s: %s\n", + valuestr, "USG", gpg_strerror (err)); else { if (gcry_mpi_is_neg (a)) gcry_mpi_neg (b, b); if (gcry_mpi_cmp (a, b) || data[idx].a.usglen != buflen) { - fail ("error scanning value %d from %s: %s (%lu)\n", - data[idx].value, "USG", "wrong result", + fail ("error scanning value %s from %s: %s (%lu)\n", + valuestr, "USG", "wrong result", (long unsigned int)buflen); showmpi ("expected:", a); showmpi (" got:", b); @@ -488,14 +666,14 @@ check_formats (void) err = gcry_mpi_scan (&b, GCRYMPI_FMT_PGP, data[idx].a.pgp, data[idx].a.pgplen, &buflen); if (err) - fail ("error scanning value %d as %s: %s\n", - data[idx].value, "PGP", gpg_strerror (err)); + fail ("error scanning value %s as %s: %s\n", + valuestr, "PGP", gpg_strerror (err)); else { if (gcry_mpi_cmp (a, b) || data[idx].a.pgplen != buflen) { - fail ("error scanning value %d from %s: %s (%lu)\n", - data[idx].value, "PGP", "wrong result", + fail ("error scanning value %s from %s: %s (%lu)\n", + valuestr, "PGP", "wrong result", (long unsigned int)buflen); showmpi ("expected:", a); showmpi (" got:", b); -- 2.30.2 From jussi.kivilinna at iki.fi Fri Jun 25 13:02:28 2021 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Fri, 25 Jun 2021 14:02:28 +0300 Subject: [PATCH] ec: add zSeries/s390x accelerated scalar multiplication Message-ID: <20210625110228.1175550-1-jussi.kivilinna@iki.fi> * cipher/asm-inline-s390x.h (PCC_FUNCTION_*): New. (pcc_query, pcc_scalar_multiply): New. * mpi/Makefile.am: Add 'ec-hw-s390x.c'. * mpi/ec-hw-s390x.c: New. * mpi/ec-internal.h (_gcry_s390x_ec_hw_mul_point) (mpi_ec_hw_mul_point): New. * mpi/ec.c (_gcry_mpi_ec_mul_point): Call 'mpi_ec_hw_mul_point'. * src/g10lib.h (HWF_S390X_MSA_9): New. * src/hwf-s390x.c (s390x_features): Add MSA9. * src/hwfeatures.c (hwflist): Add 's390x-msa-9'. -- Patch adds ECC scalar multiplication acceleration using s390x's PCC instruction. Following curves are supported: - Ed25519 - Ed448 - X25519 - X448 - NIST curves P-256, P-384 and P-521 Benchmark on z15 (5.2Ghz): Before: Ed25519 | nanosecs/iter cycles/iter mult | 389791 2026916 keygen | 572017 2974487 sign | 636603 3310336 verify | 1189097 6183305 = X25519 | nanosecs/iter cycles/iter mult | 296805 1543385 = Ed448 | nanosecs/iter cycles/iter mult | 1693373 8805541 keygen | 2382473 12388858 sign | 2609562 13569725 verify | 5177606 26923552 = X448 | nanosecs/iter cycles/iter mult | 1136178 5908127 = NIST-P256 | nanosecs/iter cycles/iter mult | 792620 4121625 keygen | 4627835 24064740 sign | 1528268 7946991 verify | 1678205 8726664 = NIST-P384 | nanosecs/iter cycles/iter mult | 1766418 9185373 keygen | 10158485 52824123 sign | 3341172 17374095 verify | 3694750 19212700 = NIST-P521 | nanosecs/iter cycles/iter mult | 3172566 16497346 keygen | 18184747 94560683 sign | 6039956 31407771 verify | 6480882 33700588 After: Ed25519 | nanosecs/iter cycles/iter speed-up mult | 25913 134746 15x keygen | 44447 231124 12x sign | 106928 556028 6x verify | 164681 856341 7x = X25519 | nanosecs/iter cycles/iter speed-up mult | 17761 92358 16x = Ed448 | nanosecs/iter cycles/iter speed-up mult | 50808 264199 33x keygen | 68644 356951 34x sign | 317446 1650720 8x verify | 457115 2376997 11x = X448 | nanosecs/iter cycles/iter speed-up mult | 35637 185313 31x = NIST-P256 | nanosecs/iter cycles/iter speed-up mult | 30678 159528 25x keygen | 323722 1683356 14x sign | 114176 593713 13x verify | 169901 883487 9x = NIST-P384 | nanosecs/iter cycles/iter speed-up mult | 59966 311822 29x keygen | 607778 3160445 16x sign | 209832 1091128 16x verify | 329506 1713431 11x = NIST-P521 | nanosecs/iter cycles/iter speed-up mult | 98230 510797 32x keygen | 1131686 5884765 16x sign | 397777 2068442 15x verify | 623076 3239998 10x Signed-off-by: Jussi Kivilinna --- cipher/asm-inline-s390x.h | 48 +++++ mpi/Makefile.am | 3 +- mpi/ec-hw-s390x.c | 412 ++++++++++++++++++++++++++++++++++++++ mpi/ec-internal.h | 8 + mpi/ec.c | 10 +- src/g10lib.h | 3 +- src/hwf-s390x.c | 1 + src/hwfeatures.c | 1 + 8 files changed, 483 insertions(+), 3 deletions(-) create mode 100644 mpi/ec-hw-s390x.c diff --git a/cipher/asm-inline-s390x.h b/cipher/asm-inline-s390x.h index bacb45fe..001cb965 100644 --- a/cipher/asm-inline-s390x.h +++ b/cipher/asm-inline-s390x.h @@ -45,6 +45,14 @@ enum kmxx_functions_e KMID_FUNCTION_SHAKE128 = 36, KMID_FUNCTION_SHAKE256 = 37, KMID_FUNCTION_GHASH = 65, + + PCC_FUNCTION_NIST_P256 = 64, + PCC_FUNCTION_NIST_P384 = 65, + PCC_FUNCTION_NIST_P521 = 66, + PCC_FUNCTION_ED25519 = 72, + PCC_FUNCTION_ED448 = 73, + PCC_FUNCTION_X25519 = 80, + PCC_FUNCTION_X448 = 81 }; enum kmxx_function_flags_e @@ -108,6 +116,26 @@ static inline u128_t klmd_query(void) return function_codes; } +static inline u128_t pcc_query(void) +{ + static u128_t function_codes = 0; + static int initialized = 0; + register unsigned long reg0 asm("0") = 0; + register void *reg1 asm("1") = &function_codes; + + if (initialized) + return function_codes; + + asm volatile ("0: .insn rre,0xb92c << 16, 0, 0\n\t" + " brc 1,0b\n\t" + : + : [reg0] "r" (reg0), [reg1] "r" (reg1) + : "cc", "memory"); + + initialized = 1; + return function_codes; +} + static ALWAYS_INLINE void kimd_execute(unsigned int func, void *param_block, const void *src, size_t src_len) @@ -154,4 +182,24 @@ klmd_shake_execute(unsigned int func, void *param_block, void *dst, : "cc", "memory"); } +static ALWAYS_INLINE unsigned int +pcc_scalar_multiply(unsigned int func, void *param_block) +{ + register unsigned long reg0 asm("0") = func; + register byte *reg1 asm("1") = param_block; + register unsigned long error = 0; + + asm volatile ("0: .insn rre,0xb92c << 16, 0, 0\n\t" + " brc 1,0b\n\t" + " brc 7,1f\n\t" + " j 2f\n\t" + "1: lhi %[error], 1\n\t" + "2:\n\t" + : [func] "+r" (reg0), [error] "+r" (error) + : [param_ptr] "r" (reg1) + : "cc", "memory"); + + return error; +} + #endif /* GCRY_ASM_INLINE_S390X_H */ diff --git a/mpi/Makefile.am b/mpi/Makefile.am index adb8e6f5..3604f840 100644 --- a/mpi/Makefile.am +++ b/mpi/Makefile.am @@ -175,5 +175,6 @@ libmpi_la_SOURCES = longlong.h \ mpih-mul.c \ mpih-const-time.c \ mpiutil.c \ - ec.c ec-internal.h ec-ed25519.c ec-nist.c ec-inline.h + ec.c ec-internal.h ec-ed25519.c ec-nist.c ec-inline.h \ + ec-hw-s390x.c EXTRA_libmpi_la_SOURCES = asm-common-aarch64.h diff --git a/mpi/ec-hw-s390x.c b/mpi/ec-hw-s390x.c new file mode 100644 index 00000000..149a061d --- /dev/null +++ b/mpi/ec-hw-s390x.c @@ -0,0 +1,412 @@ +/* ec-hw-s390x.c - zSeries ECC acceleration + * Copyright (C) 2021 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include +#include +#include +#include + +#ifdef HAVE_GCC_INLINE_ASM_S390X + +#include "mpi-internal.h" +#include "g10lib.h" +#include "context.h" +#include "ec-context.h" +#include "ec-internal.h" + +#include "../cipher/bufhelp.h" +#include "../cipher/asm-inline-s390x.h" + + +#define S390X_PCC_PARAM_BLOCK_SIZE 4096 + + +extern void reverse_buffer (unsigned char *buffer, unsigned int length); + +static int s390_mul_point_montgomery (mpi_point_t result, gcry_mpi_t scalar, + mpi_point_t point, mpi_ec_t ctx, + byte *param_block_buf); + + +static int +mpi_copy_to_raw(byte *raw, unsigned int raw_nbytes, gcry_mpi_t a) +{ + unsigned int num_to_zero; + unsigned int nbytes; + int i, j; + + if (mpi_has_sign (a)) + return -1; + + if (mpi_get_flag (a, GCRYMPI_FLAG_OPAQUE)) + { + unsigned int nbits; + byte *buf; + + buf = mpi_get_opaque (a, &nbits); + nbytes = (nbits + 7) / 8; + + if (raw_nbytes < nbytes) + return -1; + + num_to_zero = raw_nbytes - nbytes; + if (num_to_zero > 0) + memset (raw, 0, num_to_zero); + if (nbytes > 0) + memcpy (raw + num_to_zero, buf, nbytes); + + return 0; + } + + nbytes = a->nlimbs * BYTES_PER_MPI_LIMB; + if (raw_nbytes < nbytes) + return -1; + + num_to_zero = raw_nbytes - nbytes; + if (num_to_zero > 0) + memset (raw, 0, num_to_zero); + + for (j = a->nlimbs - 1, i = 0; i < a->nlimbs; i++, j--) + { + buf_put_be64(raw + num_to_zero + i * BYTES_PER_MPI_LIMB, a->d[j]); + } + + return 0; +} + +int +_gcry_s390x_ec_hw_mul_point (mpi_point_t result, gcry_mpi_t scalar, + mpi_point_t point, mpi_ec_t ctx) +{ + byte param_block_buf[S390X_PCC_PARAM_BLOCK_SIZE]; + byte *param_out_x = NULL; + byte *param_out_y = NULL; + byte *param_in_x = NULL; + byte *param_in_y = NULL; + byte *param_scalar = NULL; + unsigned int field_nbits; + unsigned int pcc_func; + gcry_mpi_t x, y; + gcry_mpi_t d = NULL; + int rc = -1; + + if (ctx->name == NULL) + return -1; + + if (!(_gcry_get_hw_features () & HWF_S390X_MSA_9)) + return -1; /* ECC acceleration not supported by HW. */ + + if (ctx->model == MPI_EC_MONTGOMERY) + return s390_mul_point_montgomery (result, scalar, point, ctx, + param_block_buf); + + if (ctx->model == MPI_EC_WEIERSTRASS && ctx->nbits == 256 && + strcmp (ctx->name, "NIST P-256") == 0) + { + struct pcc_param_block_nistp256_s + { + byte out_x[256 / 8]; + byte out_y[256 / 8]; + byte in_x[256 / 8]; + byte in_y[256 / 8]; + byte scalar[256 / 8]; + byte c_and_ribm[64]; + } *params = (void *)param_block_buf; + + memset (params->c_and_ribm, 0, sizeof(params->c_and_ribm)); + + pcc_func = PCC_FUNCTION_NIST_P256; + field_nbits = 256; + param_out_x = params->out_x; + param_out_y = params->out_y; + param_in_x = params->in_x; + param_in_y = params->in_y; + param_scalar = params->scalar; + } + else if (ctx->model == MPI_EC_WEIERSTRASS && ctx->nbits == 384 && + strcmp (ctx->name, "NIST P-384") == 0) + { + struct pcc_param_block_nistp384_s + { + byte out_x[384 / 8]; + byte out_y[384 / 8]; + byte in_x[384 / 8]; + byte in_y[384 / 8]; + byte scalar[384 / 8]; + byte c_and_ribm[64]; + } *params = (void *)param_block_buf; + + memset (params->c_and_ribm, 0, sizeof(params->c_and_ribm)); + + pcc_func = PCC_FUNCTION_NIST_P384; + field_nbits = 384; + param_out_x = params->out_x; + param_out_y = params->out_y; + param_in_x = params->in_x; + param_in_y = params->in_y; + param_scalar = params->scalar; + } + else if (ctx->model == MPI_EC_WEIERSTRASS && ctx->nbits == 521 && + strcmp (ctx->name, "NIST P-521") == 0) + { + struct pcc_param_block_nistp521_s + { + byte out_x[640 / 8]; /* note: first 14 bytes not modified by pcc */ + byte out_y[640 / 8]; /* note: first 14 bytes not modified by pcc */ + byte in_x[640 / 8]; + byte in_y[640 / 8]; + byte scalar[640 / 8]; + byte c_and_ribm[64]; + } *params = (void *)param_block_buf; + + memset (params->out_x, 0, 14); + memset (params->out_y, 0, 14); + memset (params->c_and_ribm, 0, sizeof(params->c_and_ribm)); + + pcc_func = PCC_FUNCTION_NIST_P521; + field_nbits = 640; + param_out_x = params->out_x; + param_out_y = params->out_y; + param_in_x = params->in_x; + param_in_y = params->in_y; + param_scalar = params->scalar; + } + else if (ctx->model == MPI_EC_EDWARDS && ctx->nbits == 255 && + strcmp (ctx->name, "Ed25519") == 0) + { + struct pcc_param_block_ed25519_s + { + byte out_x[256 / 8]; + byte out_y[256 / 8]; + byte in_x[256 / 8]; + byte in_y[256 / 8]; + byte scalar[256 / 8]; + byte c_and_ribm[64]; + } *params = (void *)param_block_buf; + + memset (params->c_and_ribm, 0, sizeof(params->c_and_ribm)); + + pcc_func = PCC_FUNCTION_ED25519; + field_nbits = 256; + param_out_x = params->out_x; + param_out_y = params->out_y; + param_in_x = params->in_x; + param_in_y = params->in_y; + param_scalar = params->scalar; + } + else if (ctx->model == MPI_EC_EDWARDS && ctx->nbits == 448 && + strcmp (ctx->name, "Ed448") == 0) + { + struct pcc_param_block_ed448_s + { + byte out_x[512 / 8]; /* note: first 8 bytes not modified by pcc */ + byte out_y[512 / 8]; /* note: first 8 bytes not modified by pcc */ + byte in_x[512 / 8]; + byte in_y[512 / 8]; + byte scalar[512 / 8]; + byte c_and_ribm[64]; + } *params = (void *)param_block_buf; + + memset (params->out_x, 0, 8); + memset (params->out_y, 0, 8); + memset (params->c_and_ribm, 0, sizeof(params->c_and_ribm)); + + pcc_func = PCC_FUNCTION_ED448; + field_nbits = 512; + param_out_x = params->out_x; + param_out_y = params->out_y; + param_in_x = params->in_x; + param_in_y = params->in_y; + param_scalar = params->scalar; + } + + if (param_scalar == NULL) + return -1; /* No curve match. */ + + if (!(pcc_query () & km_function_to_mask (pcc_func))) + return -1; /* HW does not support acceleration for this curve. */ + + x = mpi_new (0); + y = mpi_new (0); + + if (_gcry_mpi_ec_get_affine (x, y, point, ctx) < 0) + { + /* Point at infinity. */ + goto out; + } + + if (mpi_has_sign (scalar) || mpi_cmp (scalar, ctx->n) >= 0) + { + d = mpi_is_secure (scalar) ? mpi_snew (ctx->nbits) : mpi_new (ctx->nbits); + _gcry_mpi_mod (d, scalar, ctx->n); + } + else + { + d = scalar; + } + + if (mpi_copy_to_raw (param_in_x, field_nbits / 8, x) < 0) + goto out; + + if (mpi_copy_to_raw (param_in_y, field_nbits / 8, y) < 0) + goto out; + + if (mpi_copy_to_raw (param_scalar, field_nbits / 8, d) < 0) + goto out; + + if (pcc_scalar_multiply (pcc_func, param_block_buf) != 0) + goto out; + + _gcry_mpi_set_buffer (result->x, param_out_x, field_nbits / 8, 0); + _gcry_mpi_set_buffer (result->y, param_out_y, field_nbits / 8, 0); + mpi_set_ui (result->z, 1); + mpi_normalize (result->x); + mpi_normalize (result->y); + if (ctx->model == MPI_EC_EDWARDS) + mpi_point_resize (result, ctx); + + rc = 0; + +out: + if (d != scalar) + mpi_release (d); + mpi_release (y); + mpi_release (x); + wipememory (param_block_buf, S390X_PCC_PARAM_BLOCK_SIZE); + + return rc; +} + + +static int +s390_mul_point_montgomery (mpi_point_t result, gcry_mpi_t scalar, + mpi_point_t point, mpi_ec_t ctx, + byte *param_block_buf) +{ + byte *param_out_x = NULL; + byte *param_in_x = NULL; + byte *param_scalar = NULL; + unsigned int field_nbits; + unsigned int pcc_func; + gcry_mpi_t x; + gcry_mpi_t d = NULL; + int rc = -1; + + if (ctx->nbits == 255 && strcmp (ctx->name, "Curve25519") == 0) + { + struct pcc_param_block_x25519_s + { + byte out_x[256 / 8]; + byte in_x[256 / 8]; + byte scalar[256 / 8]; + byte c_and_ribm[64]; + } *params = (void *)param_block_buf; + + memset (params->c_and_ribm, 0, sizeof(params->c_and_ribm)); + + pcc_func = PCC_FUNCTION_X25519; + field_nbits = 256; + param_out_x = params->out_x; + param_in_x = params->in_x; + param_scalar = params->scalar; + } + else if (ctx->nbits == 448 && strcmp (ctx->name, "X448") == 0) + { + struct pcc_param_block_x448_s + { + byte out_x[512 / 8]; /* note: first 8 bytes not modified by pcc */ + byte in_x[512 / 8]; + byte scalar[512 / 8]; + byte c_and_ribm[64]; + } *params = (void *)param_block_buf; + + memset (params->out_x, 0, 8); + memset (params->c_and_ribm, 0, sizeof(params->c_and_ribm)); + + pcc_func = PCC_FUNCTION_X448; + field_nbits = 512; + param_out_x = params->out_x; + param_in_x = params->in_x; + param_scalar = params->scalar; + } + + if (param_scalar == NULL) + return -1; /* No curve match. */ + + if (!(pcc_query () & km_function_to_mask (pcc_func))) + return -1; /* HW does not support acceleration for this curve. */ + + x = mpi_new (0); + + if (mpi_is_opaque (scalar)) + { + const unsigned int pbits = ctx->nbits; + unsigned int n; + unsigned char *raw; + + raw = _gcry_mpi_get_opaque_copy (scalar, &n); + if ((n + 7) / 8 != (pbits + 7) / 8) + log_fatal ("scalar size (%d) != prime size (%d)\n", + (n + 7) / 8, (pbits + 7) / 8); + + reverse_buffer (raw, (n + 7 ) / 8); + if ((pbits % 8)) + raw[0] &= (1 << (pbits % 8)) - 1; + raw[0] |= (1 << ((pbits + 7) % 8)); + raw[(pbits + 7) / 8 - 1] &= (256 - ctx->h); + d = mpi_is_secure (scalar) ? mpi_snew (pbits) : mpi_new (pbits); + _gcry_mpi_set_buffer (d, raw, (n + 7) / 8, 0); + xfree (raw); + } + else + { + d = scalar; + } + + if (_gcry_mpi_ec_get_affine (x, NULL, point, ctx) < 0) + { + /* Point at infinity. */ + goto out; + } + + if (mpi_copy_to_raw (param_in_x, field_nbits / 8, x) < 0) + goto out; + + if (mpi_copy_to_raw (param_scalar, field_nbits / 8, d) < 0) + goto out; + + if (pcc_scalar_multiply (pcc_func, param_block_buf) != 0) + goto out; + + _gcry_mpi_set_buffer (result->x, param_out_x, field_nbits / 8, 0); + mpi_set_ui (result->z, 1); + mpi_point_resize (result, ctx); + + rc = 0; + +out: + if (d != scalar) + mpi_release (d); + mpi_release (x); + wipememory (param_block_buf, S390X_PCC_PARAM_BLOCK_SIZE); + + return rc; +} + +#endif /* HAVE_GCC_INLINE_ASM_S390X */ diff --git a/mpi/ec-internal.h b/mpi/ec-internal.h index 2296d55d..3f948aa0 100644 --- a/mpi/ec-internal.h +++ b/mpi/ec-internal.h @@ -38,4 +38,12 @@ void _gcry_mpi_ec_nist521_mod (gcry_mpi_t w, mpi_ec_t ctx); # define _gcry_mpi_ec_nist521_mod NULL #endif +#ifdef HAVE_GCC_INLINE_ASM_S390X +int _gcry_s390x_ec_hw_mul_point (mpi_point_t result, gcry_mpi_t scalar, + mpi_point_t point, mpi_ec_t ctx); +# define mpi_ec_hw_mul_point _gcry_s390x_ec_hw_mul_point +#else +# define mpi_ec_hw_mul_point(r,s,p,c) (-1) +#endif + #endif /*GCRY_EC_INTERNAL_H*/ diff --git a/mpi/ec.c b/mpi/ec.c index 029099b4..c24921ee 100644 --- a/mpi/ec.c +++ b/mpi/ec.c @@ -1769,7 +1769,7 @@ _gcry_mpi_ec_sub_points (mpi_point_t result, } -/* Scalar point multiplication - the main function for ECC. If takes +/* Scalar point multiplication - the main function for ECC. It takes an integer SCALAR and a POINT as well as the usual context CTX. RESULT will be set to the resulting point. */ void @@ -1781,6 +1781,14 @@ _gcry_mpi_ec_mul_point (mpi_point_t result, unsigned int i, loops; mpi_point_struct p1, p2, p1inv; + /* First try HW accelerated scalar multiplications. Error + is returned if acceleration is not supported or if HW + does not support acceleration of given input. */ + if (mpi_ec_hw_mul_point (result, scalar, point, ctx) >= 0) + { + return; + } + if (ctx->model == MPI_EC_EDWARDS || (ctx->model == MPI_EC_WEIERSTRASS && mpi_is_secure (scalar))) diff --git a/src/g10lib.h b/src/g10lib.h index fb288a30..ed908742 100644 --- a/src/g10lib.h +++ b/src/g10lib.h @@ -258,7 +258,8 @@ char **_gcry_strtokenize (const char *string, const char *delim); #define HWF_S390X_MSA (1 << 0) #define HWF_S390X_MSA_4 (1 << 1) #define HWF_S390X_MSA_8 (1 << 2) -#define HWF_S390X_VX (1 << 3) +#define HWF_S390X_MSA_9 (1 << 3) +#define HWF_S390X_VX (1 << 4) #endif diff --git a/src/hwf-s390x.c b/src/hwf-s390x.c index 25121b91..74590fc3 100644 --- a/src/hwf-s390x.c +++ b/src/hwf-s390x.c @@ -63,6 +63,7 @@ static const struct feature_map_s s390x_features[] = { 17, 0, HWF_S390X_MSA }, { 77, 0, HWF_S390X_MSA_4 }, { 146, 0, HWF_S390X_MSA_8 }, + { 155, 0, HWF_S390X_MSA_9 }, #ifdef HAVE_GCC_INLINE_ASM_S390X_VX { 129, HWCAP_S390_VXRS, HWF_S390X_VX }, #endif diff --git a/src/hwfeatures.c b/src/hwfeatures.c index 534c271d..89f5943d 100644 --- a/src/hwfeatures.c +++ b/src/hwfeatures.c @@ -76,6 +76,7 @@ static struct { HWF_S390X_MSA, "s390x-msa" }, { HWF_S390X_MSA_4, "s390x-msa-4" }, { HWF_S390X_MSA_8, "s390x-msa-8" }, + { HWF_S390X_MSA_9, "s390x-msa-9" }, { HWF_S390X_VX, "s390x-vx" }, #endif }; -- 2.30.2