From jussi.kivilinna at iki.fi Mon Feb 2 19:21:56 2026 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Mon, 2 Feb 2026 20:21:56 +0200 Subject: [PATCH] configure.ac: fix HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS on x32 targets Message-ID: <20260202182156.2115138-1-jussi.kivilinna@iki.fi> * configure.ac (gcry_cv_compiler_defines__x86_64__): New. (HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS): Enable if __x86_64__ macro is defined by compiler and size of long is 4 (x32) or 8 (amd64). -- Signed-off-by: Jussi Kivilinna --- configure.ac | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/configure.ac b/configure.ac index be6b29b4..d94319cc 100644 --- a/configure.ac +++ b/configure.ac @@ -1829,6 +1829,25 @@ if test $amd64_as_feature_detection = yes; then fi +# +# Check whether compiler defines __x86_64__ macro (amd64 or x32) +# +AC_CACHE_CHECK([whether compiler defines __x86_64__ macro], + [gcry_cv_compiler_defines__x86_64__], + [if test "$mpi_cpu_arch" != "x86" || + test "$try_asm_modules" != "yes" ; then + gcry_cv_compiler_defines__x86_64__="n/a" + else + AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [[ + #ifndef __x86_64__ + # error "Architecture is not x86_64" + #endif + ]])], + [gcry_cv_compiler_defines__x86_64__=yes], + [gcry_cv_compiler_defines__x86_64__=no]) + fi]) + + # # Check whether GCC assembler supports features needed for our i386/amd64 # implementations @@ -1859,7 +1878,9 @@ if test $amd64_as_feature_detection = yes; then [gcry_cv_gcc_x86_platform_as_ok=yes]) fi]) if test "$gcry_cv_gcc_x86_platform_as_ok" = "yes" && - test "$ac_cv_sizeof_unsigned_long" = "8"; then + test "$gcry_cv_compiler_defines__x86_64__" = "yes" && + (test "$ac_cv_sizeof_unsigned_long" = "4" || + test "$ac_cv_sizeof_unsigned_long" = "8"); then AC_DEFINE(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS,1, [Defined if underlying assembler is compatible with amd64 assembly implementations]) fi -- 2.51.0 From zach.fogg at gmail.com Thu Feb 5 01:23:01 2026 From: zach.fogg at gmail.com (Zachary Fogg) Date: Wed, 4 Feb 2026 19:23:01 -0500 Subject: EdDSA Verification Bug - Clarification on Format 2 Verification Failure In-Reply-To: <87pl7b7zxh.fsf@jacob.g10code.de> References: <87pl7b7zxh.fsf@jacob.g10code.de> Message-ID: Hey Werner, I appreciate the feedback, but I have to respectfully push back here. I don't think this is just an API inconsistency issue. I set up a minimal test case to verify what's happening, and it's pretty clear cut: // Format 2: (data (flags eddsa) (hash-algo sha512) (value %b)) err = gcry_sexp_build(&s_data, NULL, "(data (flags eddsa) (hash-algo sha512) (value %b))", msg_len, msg); err = gcry_pk_sign(&s_sig, s_data, privkey); // OK - signing works err = gcry_pk_verify(s_sig, s_data, pubkey); // FAILED - verification fails The thing is, I'm using the exact same S-expression for both signing and verification. If the format is valid enough for signing to succeed, it should be valid for verification. That's the inconsistency. You can't have signing work and verification fail with identical inputs. I get that Ed25519 has been working well since 2014 - the simple format (data (value %b)) works fine for me too. But when you add the EdDSA flags and hash-algo spec (which is what GPG uses), the verification breaks. That's a real problem. This isn't some edge case either. I'm trying to verify signatures and it fails every time because libgcrypt can sign but can't verify using the same format. For my use case, that's a showstopper for using the lib directly... I resorted to having the gpg binary do it in its own process which is not ideal code at all. I'm not trying to be difficult here - I just want this fixed or at least documented as a known limitation, and to have tests updated because in my research this bug exists because there are no tests for specifically this case. If EdDSA with hash-algo flags isn't supposed to be supported, the library should reject it at sign time, not silently accept it and then fail verification. What do you think? Did you run my example code at least and SEE the failures in your own terminal? It's reproducible, and I could show you if you were sitting in front of me. -Zachary On Thu, Jan 15, 2026 at 8:41?AM Werner Koch wrote: > Hi! > > Just a short note on your bug report. You gave a lot of examples and a > nicely formated report at https://github.com/zfogg/ascii-chat/issues/92 > but I can't read everything of it. > > On Tue, 30 Dec 2025 02:30, Zachary Fogg said: > > **In-Reply-To:** > > > Your response mentioned using `(flags eddsa)` during key generation, > which > > is good practice. However, I want to clarify that **my bug report > concerns > > signature verification, not key generation**. > > If you look at the way GnuPG uses Libgcrypt will find in > gnupg/g10/pkglue.c:pk_verify this: > > if (openpgp_oid_is_ed25519 (pkey[0])) > fmt = "(public-key(ecc(curve %s)(flags eddsa)(q%m)))"; > else > fmt = "(public-key(ecc(curve %s)(q%m)))"; > > and this for the data: > > if (openpgp_oid_is_ed25519 (pkey[0])) > fmt = "(data(flags eddsa)(hash-algo sha512)(value %m))"; > else > fmt = "(data(value %m))"; > > and more complicated stuff for re-formatting the signature data. It is > a bit unfortunate that we need to have these special cases but that's > the drawback of a having a stable API and protocol. > > > 1. Can you confirm this is a genuine bug in libgcrypt's verification > logic? > > No, at least not as I understand it. ed25519 signatures are working > well and are in active use since GnuPG 2.1 from 2014. > > > 2. Should I open a formal bug in the dev.gnupg.org tracker? > > I don't see a bug ;-) > > > 3. Would a patch fixing the PUBKEY_FLAG_PREHASH handling be acceptable? > > I do not understand exactly what you propose. A more concise > description would be helpful. But note that API stability is a primary > goal. > > BTW on your website your wrote: > > I've created a working exploit that demonstrates the severity of this > bug. The exploit proves that GPG agent creates EdDSA signatures that > cannot be verified by standard libgcrypt verification code, even with > the correct keys. > > The term "exploit" is used to describe an attack method which undermines > the security of a system. What you describe is a claimed inconsistent > API. That may or may not be the case; I don't see a security bug here, > though. > > > > Salam-Shalom, > > Werner > > > p.s. > I had a brief look at your project: In src/main.c I notice > > // Set global FPS from command-line option if provided > extern int g_max_fps; > > The declaration of an external variable inside a function is a not a > good coding style. Put this at the top of the file or into a header. > A few lines above: > > #ifndef NDEBUG > // Initialize lock debugging system after logging is fully set up > log_debug("Initializing lock debug system..."); > > Never ever use NDEBUG. This is an idea of the 70ies. This also > disables the assert(3) functionality and if you do this you won't get an > assertion failure at all in your production code - either you know the > code is correct or you are not sure. Never remove an assert from > production code. > > I have noticed a lot of documentation inside the code - that's good. > > -- > The pioneers of a warless world are the youth that > refuse military service. - A. Einstein > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gniibe at fsij.org Thu Feb 5 02:58:25 2026 From: gniibe at fsij.org (NIIBE Yutaka) Date: Thu, 05 Feb 2026 10:58:25 +0900 Subject: libgcrypt 1.8.12: STRIBOG carry overflow bug In-Reply-To: References: Message-ID: <87jywsgddq.fsf@haruna.fsij.org> Guido Vranken wrote: > Fix is in da6cd4f but was not backported to 1.8. 1.8 is EOL but has > "Extended Long Term Support contract available". Thank you. I cherry picked the commit to 1.8 branch. I found that building 1.8 is getting difficult with newer toolchain and newer libgpg-error. I push a minimal change of such a forward port, too. -- From stu at spacehopper.org Fri Feb 6 15:41:30 2026 From: stu at spacehopper.org (Stuart Henderson) Date: Fri, 6 Feb 2026 14:41:30 +0000 Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on i386 Message-ID: When building the "notmuch" email indexer, the configure script tests that gmime can extract a session key, which it does using gcrypt. Since 1.12.0 this frequently, though not always, fails on i386 (32-bit). This is not changed by applying the patch https://lists.gnupg.org/pipermail/gcrypt-devel/2026-January/006025.html The problem is no longer seen after neutering part of https://git.gnupg.org/cgi-bin/gitweb.cgi?p=libgcrypt.git;a=commit;h=4f56fd8c5e03f389a9f27a5e9206b9dfb49c92e3 Index: mpi/ec.c --- mpi/ec.c.orig +++ mpi/ec.c @@ -305,7 +305,7 @@ ec_mod (gcry_mpi_t w, mpi_ec_t ec) else _gcry_mpi_mod (w, w, ec->p); - if ((ec->flags & GCRYECC_FLAG_LEAST_LEAK)) + if (0 && (ec->flags & GCRYECC_FLAG_LEAST_LEAK)) w->nlimbs = ec->p->nlimbs; } The script below replicates the test setup used by notmuch (requires gmime and gnupg to be installed). #!/bin/sh set -e tmp=$(mktemp -d /tmp/notmuchtest.XXXXXXXXX) cd $tmp cat << EOF > _check_session_keys.c #include #include int main () { GError *error = NULL; GMimeParser *parser = NULL; GMimeMultipartEncrypted *body = NULL; GMimeDecryptResult *decrypt_result = NULL; GMimeObject *output = NULL; g_mime_init (); parser = g_mime_parser_new (); g_mime_parser_init_with_stream (parser, g_mime_stream_file_open("basic-encrypted.eml", "r", &error)); if (error) return !! fprintf (stderr, "failed to instantiate parser with basic-encrypted.eml\n"); body = GMIME_MULTIPART_ENCRYPTED(g_mime_message_get_mime_part (g_mime_parser_construct_message (parser, NULL))); if (body == NULL) return !! fprintf (stderr, "did not find a multipart encrypted message\n"); output = g_mime_multipart_encrypted_decrypt (body, GMIME_DECRYPT_EXPORT_SESSION_KEY, NULL, &decrypt_result, &error); if (error || output == NULL) return !! fprintf (stderr, "decryption failed\n"); if (decrypt_result == NULL) return !! fprintf (stderr, "no GMimeDecryptResult found\n"); if (decrypt_result->session_key == NULL) return !! fprintf (stderr, "GMimeDecryptResult has no session key\n"); printf ("%s\n", decrypt_result->session_key); return 0; } EOF cat << EOF > openpgp4-secret-key.asc -----BEGIN PGP PRIVATE KEY BLOCK----- lFgEYxhtlxYJKwYBBAHaRw8BAQdA0PoNKr90DaQV1dIK77wbWm4RT+JQzqBkwIjA HQM9RHYAAQDQ5wSfkOGXvKYroALWgibztISzXS5b8boGXykcHERo6w/ctDtOb3Rt dWNoIFRlc3QgU3VpdGUgKElOU0VDVVJFISkgPHRlc3Rfc3VpdGVAbm90bXVjaG1h aWwub3JnPoiQBBMWCAA4AhsDBQsJCAcCBhUKCQgLAgQWAgMBAh4BAheAFiEEmjr+ bGAGWhSP1LWKfmq+kkZFzGAFAmMYbZwACgkQfmq+kkZFzGDtrwEAjQRn3xhEomah wICjQjfi4RKNbvnRViZgosijDBANUAgA/28GrK1tPnQsXWqmuZxQ1Cd5ry4NAnj/ 4jsxD3cTbnEHnF0EYxhtlxIKKwYBBAGXVQEFAQEHQEOd3EyCD5qo4+QuHz0lruCG VM6n6RI4dtAh3cX9uHwiAwEIBwAA/1oe+p5jNjNE5lEj4yTpYjCxCeC98MolbiAy 0yY7526wECqIeAQYFggAIBYhBJo6/mxgBloUj9S1in5qvpJGRcxgBQJjGG2XAhsM AAoJEH5qvpJGRcxgBdsA/R9ZECoxai5QhOitDIAUZVCRr59Pm1VMPiJOOIla2N1p AQCNESwJ9IJOdO/06q+bR2GG4WyEkB4VoVBiA3hFx/zZAA== =uGTo -----END PGP PRIVATE KEY BLOCK----- EOF cat << EOF > basic-encrypted.eml From: test_suite at notmuchmail.org To: test_suite at notmuchmail.org Subject: Here is the password Date: Sat, 01 Jan 2000 12:00:00 +0000 Message-ID: MIME-Version: 1.0 Content-Type: multipart/encrypted; boundary="=-=-="; protocol="application/pgp-encrypted" --=-=-= Content-Type: application/pgp-encrypted Version: 1 --=-=-= Content-Type: application/octet-stream -----BEGIN PGP MESSAGE----- hF4DHXHP849rSK8SAQdAYbv9NFaU2Fbd6JbfE87h/yZNyWLJYZ2EseU0WyOz7Agw /+KTbbIqRcEYhnpQhQXBQ2wqIN5gmdRhaqrj5q0VLV2BOKNJKqXGs/W4DghXwfAu 0oMBqjTd/mMbF0nJLw3bPX+LW47RHQdZ8vUVPlPr0ALg8kqgcfy95Qqy5h796Uyq xs+I/UUOt7fzTDAw0B4qkRbdSangwYy80N4X43KrAfKSstBH3/7O4285XZr86YhF rEtsBuwhoXI+DaG3uYZBBMTkzfButmBKHwB2CmWutmVpQL087A== =lhSz -----END PGP MESSAGE----- --=-=-=-- EOF cc $(pkg-config --cflags gmime-3.0) _check_session_keys.c \ $(pkg-config --libs gmime-3.0) -o _check_session_keys export GNUPGHOME=$tmp gpg --batch --quiet --import < openpgp4-secret-key.asc echo "cd $tmp; GNUPGHOME=$tmp ./_check_session_keys" ./_check_session_keys From wk at gnupg.org Mon Feb 9 11:32:44 2026 From: wk at gnupg.org (Werner Koch) Date: Mon, 09 Feb 2026 11:32:44 +0100 Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on i386 In-Reply-To: (Stuart Henderson via Gcrypt-devel's message of "Fri, 6 Feb 2026 14:41:30 +0000") References: Message-ID: <87qzqub41f.fsf@jacob.g10code.de> On Fri, 6 Feb 2026 14:41, Stuart Henderson said: > When building the "notmuch" email indexer, the configure script tests > that gmime can extract a session key, which it does using gcrypt. > Since 1.12.0 this frequently, though not always, fails on i386 (32-bit). You mean Notmuch uses its own *PGP parser to "extract" the session key? I recall that there was a discussion on speeding up things by storing decrypted session in a database and then use gpg --override-sssion-key. But I can't remember the details. Salam-Shalom, Werner -- The pioneers of a warless world are the youth that refuse military service. - A. Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: openpgp-digital-signature.asc Type: application/pgp-signature Size: 284 bytes Desc: not available URL: From stu at spacehopper.org Mon Feb 9 12:14:52 2026 From: stu at spacehopper.org (Stuart Henderson) Date: Mon, 9 Feb 2026 11:14:52 +0000 Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on i386 In-Reply-To: <87qzqub41f.fsf@jacob.g10code.de> References: <87qzqub41f.fsf@jacob.g10code.de> Message-ID: On 2026/02/09 11:32, Werner Koch wrote: > On Fri, 6 Feb 2026 14:41, Stuart Henderson said: > > When building the "notmuch" email indexer, the configure script tests > > that gmime can extract a session key, which it does using gcrypt. > > Since 1.12.0 this frequently, though not always, fails on i386 (32-bit). > > You mean Notmuch uses its own *PGP parser to "extract" the session key? > > I recall that there was a discussion on speeding up things by storing > decrypted session in a database and then use gpg --override-sssion-key. > But I can't remember the details. I didn't look into how it's used in the main program code, was just hitting the failure where it's used in autoconf when building (I'm an os package builder). It does look like it's doing something along those lines in the main code though: https://git.notmuchmail.org/git?p=notmuch;a=blob;f=util/crypto.c;h=156a6550c20afef00a6bb5eaab94e8ba435cbfbd;hb=HEAD From sam at gentoo.org Mon Feb 9 17:43:31 2026 From: sam at gentoo.org (Sam James) Date: Mon, 09 Feb 2026 16:43:31 +0000 Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on i386 In-Reply-To: References: Message-ID: <87a4xhkguk.fsf@gentoo.org> Stuart Henderson via Gcrypt-devel writes: > When building the "notmuch" email indexer, the configure script tests > that gmime can extract a session key, which it does using gcrypt. > Since 1.12.0 this frequently, though not always, fails on i386 (32-bit). Do you see anything useful if you run the reproducer under Valgrind? > [...] sam -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 418 bytes Desc: not available URL: From stu at spacehopper.org Mon Feb 9 17:55:23 2026 From: stu at spacehopper.org (Stuart Henderson) Date: Mon, 9 Feb 2026 16:55:23 +0000 Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on i386 In-Reply-To: <87a4xhkguk.fsf@gentoo.org> References: <87a4xhkguk.fsf@gentoo.org> Message-ID: On 2026/02/09 16:43, Sam James wrote: > Stuart Henderson via Gcrypt-devel writes: > > > When building the "notmuch" email indexer, the configure script tests > > that gmime can extract a session key, which it does using gcrypt. > > Since 1.12.0 this frequently, though not always, fails on i386 (32-bit). > > Do you see anything useful if you run the reproducer under Valgrind? I'm seeing this on OpenBSD, Valgrind doesn't currently work there. From sam at gentoo.org Mon Feb 9 18:14:12 2026 From: sam at gentoo.org (Sam James) Date: Mon, 09 Feb 2026 17:14:12 +0000 Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on i386 In-Reply-To: References: <87a4xhkguk.fsf@gentoo.org> Message-ID: <871pitkfff.fsf@gentoo.org> Stuart Henderson writes: > On 2026/02/09 16:43, Sam James wrote: >> Stuart Henderson via Gcrypt-devel writes: >> >> > When building the "notmuch" email indexer, the configure script tests >> > that gmime can extract a session key, which it does using gcrypt. >> > Since 1.12.0 this frequently, though not always, fails on i386 (32-bit). >> >> Do you see anything useful if you run the reproducer under Valgrind? > > I'm seeing this on OpenBSD, Valgrind doesn't currently work there. Ah, sorry for the silly question then. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 418 bytes Desc: not available URL: From gniibe at fsij.org Tue Feb 10 07:50:57 2026 From: gniibe at fsij.org (NIIBE Yutaka) Date: Tue, 10 Feb 2026 15:50:57 +0900 Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on i386 In-Reply-To: References: Message-ID: <87wm0lytv2.fsf@haruna.fsij.org> Hello, Thank you for your report. Stuart Henderson wrote: > The script below replicates the test setup used by notmuch (requires > gmime and gnupg to be installed). I build gnupg from master for i386 on a Debian machine (also build npth, libgpg-error, libgcrypt, libassuan, libksba, and ntbtls). With this newly built gpg, I can't replicate the failure by running the _check_session_keys program many times. (I make sure using the newly built gnupg i386 version with libgcrypt 1.12.) Manually decrypting the message (basic-encrypted.eml), by the newly built gnupg, also gets success. I can't replicate any failure. Could you please narrow down the failure? IIUC, when the _check_session_keys program fails, invoking gpg must fail too. If so, debug output of gpg (and/or gpg-agent) helps. -- From gniibe at fsij.org Wed Feb 11 03:07:27 2026 From: gniibe at fsij.org (NIIBE Yutaka) Date: Wed, 11 Feb 2026 11:07:27 +0900 Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on i386 In-Reply-To: <87wm0lytv2.fsf@haruna.fsij.org> References: <87wm0lytv2.fsf@haruna.fsij.org> Message-ID: <87wm0k3ue8.fsf@haruna.fsij.org> NIIBE Yutaka wrote: > Could you please narrow down the failure? IIUC, when the > _check_session_keys program fails, invoking gpg must fail too. > If so, debug output of gpg (and/or gpg-agent) helps. I mean, I'd like to locate the bug. There are two things in the notmuch/configure script. (1) importing the secret key (2) running the _check_session_keys program Does the failure occur in (1) or (2)? And what version of gpg and gpg-agent you are running? Those information helps us. I created the ticket: https://dev.gnupg.org/T8094 I put the tag of "libgcrypt" for this, but it may be other parts of program (gnupg, libgpg-error, etc.) which cause the issue. -- From gniibe at fsij.org Sat Feb 14 04:42:54 2026 From: gniibe at fsij.org (NIIBE Yutaka) Date: Sat, 14 Feb 2026 12:42:54 +0900 Subject: [PATCH 2/2] mpi:ec: Use mpi_new with NBITS, instead of mpi_alloc. In-Reply-To: <43b648f0465fb449471944a84bb40f45996f6de3.1771039985.git.gniibe@fsij.org> References: <87wm0lytv2.fsf@haruna.fsij.org> <87wm0k3ue8.fsf@haruna.fsij.org> <43b648f0465fb449471944a84bb40f45996f6de3.1771039985.git.gniibe@fsij.org> Message-ID: <44bfa11c45cc7b4903467d822a93db7bbc18283d.1771039985.git.gniibe@fsij.org> * mpi/ec.c (ec_get_two_inv_p): Use mpi_new with NBITS. * cipher/ecc-ecdsa.c (_gcry_ecc_ecdsa_sign): Likewise. (_gcry_ecc_ecdsa_verify): Likewise. * cipher/ecc-gost.c (_gcry_ecc_gost_sign): Likewise. (_gcry_ecc_gost_verify): Likewise. -- GnuPG-bug-id: 8094 Signed-off-by: NIIBE Yutaka --- cipher/ecc-ecdsa.c | 16 ++++++++-------- cipher/ecc-gost.c | 24 ++++++++++++------------ mpi/ec.c | 2 +- 3 files changed, 21 insertions(+), 21 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-mpi-ec-Use-mpi_new-with-NBITS-instead-of-mpi_alloc.patch Type: text/x-patch Size: 2616 bytes Desc: not available URL: From gniibe at fsij.org Sat Feb 14 04:42:53 2026 From: gniibe at fsij.org (NIIBE Yutaka) Date: Sat, 14 Feb 2026 12:42:53 +0900 Subject: [PATCH 1/2] mpi:ec: Make sure to have MPI limbs in ECC. In-Reply-To: <87wm0k3ue8.fsf@haruna.fsij.org> References: <87wm0lytv2.fsf@haruna.fsij.org> <87wm0k3ue8.fsf@haruna.fsij.org> Message-ID: <43b648f0465fb449471944a84bb40f45996f6de3.1771039985.git.gniibe@fsij.org> * src/mpi.h (_gcry_mpi_point_init): Add NBITS argument. * mpi/ec.c (point_init): Follow the change. (_gcry_mpi_point_log): Fix mpi_new with NBITS. (_gcry_mpi_point_new): Fix _gcry_mpi_point_init with NBITS. (_gcry_mpi_point_init): Initialize with mpi_new with NBITS. (_gcry_mpi_ec_get_affine): Fix mpi_new with NBITS. (montgomery_mul_point): Fix point_init with NBITS. (mpi_ec_mul_point_lli): Fix point_init and mpi_new with NBITS. (_gcry_mpi_ec_mul_point): Fix point_init with NBITS. (_gcry_mpi_ec_curve_point): Fix mpi_new with NBITS. * mpi/ec-hw-s390x.c (_gcry_s390x_ec_hw_mul_point): Likewise. (s390_mul_point_montgomery): Likewise. * cipher/ecc-common.h (point_init): Follow the change of _gcry_mpi_point_init. * cipher/ecc-curves.c (_gcry_ecc_get_curve): Likewise. (point_from_keyparam): Fix mpi_point_new with NBITS. (mpi_ec_get_elliptic_curve): Follow the change of _gcry_mpi_point_init. (_gcry_ecc_set_mpi): Fix mpi_point_new with NBITS. * cipher/ecc-ecdh.c (_gcry_ecc_curve_keypair) (_gcry_ecc_curve_mul_point): Fix point_init with NBITS. * cipher/ecc-ecdsa.c (_gcry_ecc_ecdsa_sign): Likewise. (_gcry_ecc_ecdsa_verify): Likewise. * cipher/ecc-eddsa.c (_gcry_ecc_eddsa_encodepoint, ecc_ed448_recover_x) (_gcry_ecc_eddsa_recover_x): Fix mpi_new with NBITS. (_gcry_ecc_eddsa_genkey): Remove unused X and Y. Fix point_init with NBITS. (_gcry_ecc_eddsa_sign): Fix mpi_new with NBITS. Fix point_init with NBITS. (_gcry_ecc_eddsa_verify): Fix point_init with NBITS. * cipher/ecc-gost.c (_gcry_ecc_gost_sign, _gcry_ecc_gost_verify): Likewise. * cipher/ecc-misc.c (_gcry_ecc_curve_copy): Follow the change of _gcry_mpi_point_init. (_gcry_mpi_ec_ec2os, _gcry_ecc_sec_decodepoint): Fix mpi_new with NBITS. (_gcry_ecc_compute_public): Fix mpi_point_new with NBITS. * cipher/ecc-sm2.c (_gcry_ecc_sm2_encrypt): Fix point_init with NBITS. Fix mpi_new with NBITS. (_gcry_ecc_sm2_decrypt, _gcry_ecc_sm2_sign, _gcry_ecc_sm2_verify): Likewise. * cipher/ecc.c (nist_generate_key): Fix point_init with NBITS. (test_keys): Likewise. (test_ecdh_only_keys): Fix point_init and mpi_new with NBITS. (check_secret_key): Likewise. (ecc_generate): Fix mpi_new with NBITS. (ecc_encrypt_raw): Fix mpi_new and point_init with NBITS. (ecc_decrypt_raw): Fix point_init and mpi_new with NBITS. (compute_keygrip): Fix mpi_new with NBITS. -- The changes for ECC least leak assume that the limbs for MPI are allocated and enough. In the past, we had a practice to use "mpi_new (0)" to initialize an MPI, which only allocates the placeholder of MPI and not the limbs. This is the fix of those places in ECC. GnuPG-bug-id: 8094 Signed-off-by: NIIBE Yutaka --- cipher/ecc-common.h | 2 +- cipher/ecc-curves.c | 8 +++---- cipher/ecc-ecdh.c | 6 ++--- cipher/ecc-ecdsa.c | 8 +++---- cipher/ecc-eddsa.c | 44 ++++++++++++++++------------------- cipher/ecc-gost.c | 8 +++---- cipher/ecc-misc.c | 18 +++++++-------- cipher/ecc-sm2.c | 38 +++++++++++++++--------------- cipher/ecc.c | 55 ++++++++++++++++++++++---------------------- mpi/ec-hw-s390x.c | 6 ++--- mpi/ec.c | 56 ++++++++++++++++++++++----------------------- src/mpi.h | 2 +- 12 files changed, 123 insertions(+), 128 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-mpi-ec-Make-sure-to-have-MPI-limbs-in-ECC.patch Type: text/x-patch Size: 22200 bytes Desc: not available URL: From gniibe at fsij.org Sat Feb 14 08:55:38 2026 From: gniibe at fsij.org (NIIBE Yutaka) Date: Sat, 14 Feb 2026 16:55:38 +0900 Subject: [PATCH 2/2] mpi:ec: Use mpi_new with NBITS, instead of mpi_alloc. In-Reply-To: <44bfa11c45cc7b4903467d822a93db7bbc18283d.1771039985.git.gniibe@fsij.org> References: <87wm0lytv2.fsf@haruna.fsij.org> <87wm0k3ue8.fsf@haruna.fsij.org> <43b648f0465fb449471944a84bb40f45996f6de3.1771039985.git.gniibe@fsij.org> <44bfa11c45cc7b4903467d822a93db7bbc18283d.1771039985.git.gniibe@fsij.org> Message-ID: <87ikbz92th.fsf@haruna.fsij.org> Hello, One more patch for this series. -- -------------- next part -------------- A non-text attachment was scrubbed... Name: 0003-cipher-ecc-Fix-Weierstrass-curve-with-PUBKEY_FLAG_PA.patch Type: text/x-diff Size: 2908 bytes Desc: not available URL: From wk at gnupg.org Tue Feb 17 08:32:54 2026 From: wk at gnupg.org (Werner Koch) Date: Tue, 17 Feb 2026 08:32:54 +0100 Subject: [PATCH 1/2] mpi:ec: Make sure to have MPI limbs in ECC. In-Reply-To: <43b648f0465fb449471944a84bb40f45996f6de3.1771039985.git.gniibe@fsij.org> (NIIBE Yutaka via Gcrypt-devel's message of "Sat, 14 Feb 2026 12:42:53 +0900") References: <87wm0lytv2.fsf@haruna.fsij.org> <87wm0k3ue8.fsf@haruna.fsij.org> <43b648f0465fb449471944a84bb40f45996f6de3.1771039985.git.gniibe@fsij.org> Message-ID: <871pij9655.fsf@jacob.g10code.de> Hi! That is a pretty large chnage but it seems to be okay. Shalom-Salam, Werner -- The pioneers of a warless world are the youth that refuse military service. - A. Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: openpgp-digital-signature.asc Type: application/pgp-signature Size: 284 bytes Desc: not available URL: From stu at spacehopper.org Tue Feb 17 16:20:08 2026 From: stu at spacehopper.org (Stuart Henderson) Date: Tue, 17 Feb 2026 15:20:08 +0000 Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on i386 In-Reply-To: <87wm0k3ue8.fsf@haruna.fsij.org> References: <87wm0lytv2.fsf@haruna.fsij.org> <87wm0k3ue8.fsf@haruna.fsij.org> Message-ID: On 2026/02/11 11:07, NIIBE Yutaka wrote: > NIIBE Yutaka wrote: > > Could you please narrow down the failure? IIUC, when the > > _check_session_keys program fails, invoking gpg must fail too. > > If so, debug output of gpg (and/or gpg-agent) helps. > > I mean, I'd like to locate the bug. > > There are two things in the notmuch/configure script. > > (1) importing the secret key > (2) running the _check_session_keys program > > Does the failure occur in (1) or (2)? And what version of gpg and > gpg-agent you are running? Those information helps us. It occurs in (2) for me, never seen it in (1). gpg, gpg-agent: 2.5.16 libgpg-error: 1.58 libassuan: 3.0.2 libksba: 1.6.7 libnpth: 1.8 Same happens if I import the secret key into gpg and then try to use gpg --decrypt on the encrypted message from the .eml file. > I created the ticket: > > https://dev.gnupg.org/T8094 > > I put the tag of "libgcrypt" for this, but it may be other > parts of program (gnupg, libgpg-error, etc.) which cause > the issue. > -- From dtsen at us.ibm.com Tue Feb 24 01:27:49 2026 From: dtsen at us.ibm.com (Danny Tsen) Date: Mon, 23 Feb 2026 18:27:49 -0600 Subject: [PATCH 1/5] dilithium: Added optimized dilithium NTT support for ppc64le. In-Reply-To: <20260224002753.151873-1-dtsen@us.ibm.com> References: <20260224002753.151873-1-dtsen@us.ibm.com> Message-ID: <20260224002753.151873-2-dtsen@us.ibm.com> Optimized dilithium (ML-DSA) NTT algorithm for ppc64le (Power 8 and above). Signed-off-by: Danny Tsen --- cipher/dilithium_ntt_p8le.S | 859 ++++++++++++++++++++++++++++++++++++ 1 file changed, 859 insertions(+) create mode 100644 cipher/dilithium_ntt_p8le.S diff --git a/cipher/dilithium_ntt_p8le.S b/cipher/dilithium_ntt_p8le.S new file mode 100644 index 00000000..8932d8e8 --- /dev/null +++ b/cipher/dilithium_ntt_p8le.S @@ -0,0 +1,859 @@ +/* + * This file was modified for use by Libgcrypt. + * + * This file is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * This file is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + * SPDX-License-Identifier: LGPL-2.1-or-later + * + * You can also use this file under the same licence of original code. + * SPDX-License-Identifier: CC0 OR Apache-2.0 + * + */ +/* + * + * Copyright IBM Corp. 2025, 2026 + * + * =================================================================================== + * Written by Danny Tsen + */ + +#define QINV_OFFSET 16 +#define ZETA_NTT_OFFSET 32 + +#define MLDSA_Q 8380417 +#define MLDSA_QINV 58728449 + +#define QINV 0 +#define V_Q 1 +#define V_ZETA 2 +#define V_Z0 2 +#define V_Z1 3 +#define V_Z2 4 +#define V_Z3 5 + +.machine "any" +.text + +.macro SAVE_REGS + stdu 1, -352(1) + mflr 0 + std 14, 56(1) + std 15, 64(1) + std 16, 72(1) + std 17, 80(1) + std 18, 88(1) + std 19, 96(1) + std 20, 104(1) + std 21, 112(1) + li 10, 128 + li 11, 144 + li 12, 160 + li 14, 176 + li 15, 192 + li 16, 208 + stxvx 32+20, 10, 1 + stxvx 32+21, 11, 1 + stxvx 32+22, 12, 1 + stxvx 32+23, 14, 1 + stxvx 32+24, 15, 1 + stxvx 32+25, 16, 1 + li 10, 224 + li 11, 240 + li 12, 256 + li 14, 272 + stxvx 32+26, 10, 1 + stxvx 32+27, 11, 1 + stxvx 32+28, 12, 1 + stxvx 32+29, 14, 1 +.endm + +.macro RESTORE_REGS + li 10, 128 + li 11, 144 + li 12, 160 + li 14, 176 + li 15, 192 + li 16, 208 + lxvx 32+20, 10, 1 + lxvx 32+21, 11, 1 + lxvx 32+22, 12, 1 + lxvx 32+23, 14, 1 + lxvx 32+24, 15, 1 + lxvx 32+25, 16, 1 + li 10, 224 + li 11, 240 + li 12, 256 + li 14, 272 + lxvx 32+26, 10, 1 + lxvx 32+27, 11, 1 + lxvx 32+28, 12, 1 + lxvx 32+29, 14, 1 + ld 14, 56(1) + ld 15, 64(1) + ld 16, 72(1) + ld 17, 80(1) + ld 18, 88(1) + ld 19, 96(1) + ld 20, 104(1) + ld 21, 112(1) + + mtlr 0 + addi 1, 1, 352 +.endm + +/* + * Init_Coeffs_offset: initial offset setup for the coefficient array. + * + * start: beginning of the offset to the coefficient array. + * next: Next offset. + * len: Index difference between coefficients. + * + * r7: len * 2, each coefficient component is 32 bits. + * + * registers used for offset to coefficients, r[j] and r[j+len] + * R9: offset to r0 = j + * R16: offset to r1 = r0 + next + * R18: offset to r2 = r1 + next + * R20: offset to r3 = r2 + next + * + * R10: offset to r'0 = r0 + len*2 + * R17: offset to r'1 = r'0 + next + * R19: offset to r'2 = r'1 + next + * R21: offset to r'3 = r'2 + next + * + */ +.macro Init_Coeffs_offset start next + li 9, \start /* first offset to j */ + add 10, 7, 9 /* J + len*2 */ + addi 16, 9, \next + addi 17, 10, \next + addi 18, 16, \next + addi 19, 17, \next + addi 20, 18, \next + addi 21, 19, \next +.endm + +/* + * For Len=1, load 1-1-1-1 layout + * + * Load Coefficients and setup vectors + * rj0, rjlen1, rj2, rjlen3 + * rj4, rjlen5, rj6, rjlen7 + * + * Each vmrgew and vmrgow will transpose vectors as, + * + * rj vector = (rj0, rj4, rj2, rj6) + * rjlen vector = (rjlen1, rjlen5, rjlen3, rjlen7) + * + * r' =r[j+len]: V6, V7, V8, V9 + * r = r[j]: V26, V27, V28, V29 + * + * In order to do the coefficients computation, zeta vector will arrange + * in the proper order to match the multiplication. + */ +.macro Load_41Coeffs + lxvd2x 32+10, 0, 5 + lxvd2x 32+11, 10, 5 + vmrgew 6, 10, 11 + vmrgow 26, 10, 11 + lxvd2x 32+12, 11, 5 + lxvd2x 32+13, 12, 5 + vmrgew 7, 12, 13 + vmrgow 27, 12, 13 + lxvd2x 32+10, 15, 5 + lxvd2x 32+11, 16, 5 + vmrgew 8, 10, 11 + vmrgow 28, 10, 11 + lxvd2x 32+12, 17, 5 + lxvd2x 32+13, 18, 5 + vmrgew 9, 12, 13 + vmrgow 29, 12, 13 +.endm + +/* + * For Len=2, Load 2 - 2 - 2 - 2 layout + * + * Load Coefficients and setup vectors for 8 coefficients in the + * following order, + * rj0, rj1, rjlen2, rjlen3, + * rj4, rj5, rjlen6, arlen7 + * Each xxpermdi will transpose vectors as, + * r[j]= rj0, rj1, rj4, rj5 + * r[j+len]= rjlen2, rjlen3, rjlen6, arlen7 + * + * r' = r[j+len]: V6, V7, V8, V9 + * r = r[j]: V26, V27, V28, V29 + * + * In order to do the coefficients computation, zeta vector will arrange + * in the proper order to match the multiplication. + */ +.macro Load_42Coeffs + lxvd2x 1, 0, 5 + lxvd2x 2, 10, 5 + xxpermdi 32+6, 1, 2, 3 + xxpermdi 32+26, 1, 2, 0 + lxvd2x 3, 11, 5 + lxvd2x 4, 12, 5 + xxpermdi 32+7, 3, 4, 3 + xxpermdi 32+27, 3, 4, 0 + lxvd2x 1, 15, 5 + lxvd2x 2, 16, 5 + xxpermdi 32+8, 1, 2, 3 + xxpermdi 32+28, 1, 2, 0 + lxvd2x 3, 17, 5 + lxvd2x 4, 18, 5 + xxpermdi 32+9, 3, 4, 3 + xxpermdi 32+29, 3, 4, 0 +.endm + +/* + * For Len=8, + * Load coefficient with 2 legs with 64 bytes apart in + * r[j+len] (r') vectors from offset, R10, R17, R19 and R21 + * r[j+len]: V6, V7, V8, V9 + */ +.macro Load_22Coeffs start next + li 9, \start + add 10, 7, 9 + addi 16, 9, \next + addi 17, 10, \next + li 18, \start+64 + add 19, 7, 18 + addi 20, 18, \next + addi 21, 19, \next + lxvd2x 32+6, 3, 10 + lxvd2x 32+7, 3, 17 + lxvd2x 32+8, 3, 19 + lxvd2x 32+9, 3, 21 +.endm + +/* + * Load coefficient with 2 legs with len*2 bytes apart in + * r[j+len] (r') vectors from offset, R10, R17, R19 and R21 + * r[j+len]: V6, V7, V8, V9 + */ +.macro Load_4Coeffs start next + Init_Coeffs_offset \start, \next + + lxvd2x 32+6, 3, 10 + lxvd2x 32+7, 3, 17 + lxvd2x 32+8, 3, 19 + lxvd2x 32+9, 3, 21 +.endm + +/* + * Load 4 r[j] (r) coefficient vectors: + * Load coefficient in vectors from offset, R9, R16, R18 and R20 + * r[j]: V26, V27, V28, V29 + */ +.macro Load_4Rj + lxvd2x 32+26, 3, 9 + lxvd2x 32+27, 3, 16 + lxvd2x 32+28, 3, 18 + lxvd2x 32+29, 3, 20 +.endm + +/* + * Compute final final r[j] and r[j+len] + * final r[j+len]: V18, V19, V20, V21 + * final r[j]: V22, V23, V24, V25 + */ +.macro Compute_4Coeff + vsubuwm 18, 26, 10 + vadduwm 22, 26, 10 + + vsubuwm 19, 27, 11 + vadduwm 23, 27, 11 + + vsubuwm 20, 28, 12 + vadduwm 24, 28, 12 + + vsubuwm 21, 29, 13 + vadduwm 25, 29, 13 +.endm + +.macro Write_One + stxvd2x 32+22, 3, 9 + stxvd2x 32+18, 3, 10 + stxvd2x 32+23, 3, 16 + stxvd2x 32+19, 3, 17 + stxvd2x 32+24, 3, 18 + stxvd2x 32+20, 3, 19 + stxvd2x 32+25, 3, 20 + stxvd2x 32+21, 3, 21 +.endm + +/* + * Transpose the final coefficients of 2-2-2-2 layout to the original + * coefficient array order. + */ +.macro PermWrite42 + xxpermdi 32+10, 32+22, 32+18, 0 + xxpermdi 32+14, 32+22, 32+18, 3 + xxpermdi 32+11, 32+23, 32+19, 0 + xxpermdi 32+15, 32+23, 32+19, 3 + xxpermdi 32+12, 32+24, 32+20, 0 + xxpermdi 32+16, 32+24, 32+20, 3 + xxpermdi 32+13, 32+25, 32+21, 0 + xxpermdi 32+17, 32+25, 32+21, 3 + stxvd2x 32+10, 0, 5 + stxvd2x 32+14, 10, 5 + stxvd2x 32+11, 11, 5 + stxvd2x 32+15, 12, 5 + stxvd2x 32+12, 15, 5 + stxvd2x 32+16, 16, 5 + stxvd2x 32+13, 17, 5 + stxvd2x 32+17, 18, 5 +.endm + +/* + * Transpose the final coefficients of 1-1-1-1 layout to the original + * coefficient array order. + */ +.macro PermWrite41 + vmrgew 10, 18, 22 + vmrgow 11, 18, 22 + vmrgew 12, 19, 23 + vmrgow 13, 19, 23 + vmrgew 14, 20, 24 + vmrgow 15, 20, 24 + vmrgew 16, 21, 25 + vmrgow 17, 21, 25 + stxvd2x 32+10, 0, 5 + stxvd2x 32+11, 10, 5 + stxvd2x 32+12, 11, 5 + stxvd2x 32+13, 12, 5 + stxvd2x 32+14, 15, 5 + stxvd2x 32+15, 16, 5 + stxvd2x 32+16, 17, 5 + stxvd2x 32+17, 18, 5 +.endm + +.macro Load_next_4zetas + li 10, 16 + li 11, 32 + li 12, 48 + lxvd2x 32+V_Z0, 0, 14 + lxvd2x 32+V_Z1, 10, 14 + lxvd2x 32+V_Z2, 11, 14 + lxvd2x 32+V_Z3, 12, 14 + addi 14, 14, 64 +.endm + +/* + * montgomery_reduce + * a = zeta * a[j+len] + * t = (int64_t)(int32_t)a*QINV; + * t = (a - (int64_t)t*Q) >> 32; + * + * ----------------------------------- + * MREDUCE_4X(_vz0, _vz1, _vz2, _vz3) + */ +.macro MREDUCE_4x _vz0 _vz1 _vz2 _vz3 + /* Coefficients computation results in abosulte value of 2^64 in + even and odd pairs */ + vmulesw 10, 6, \_vz0 + vmulosw 11, 6, \_vz0 + vmulesw 12, 7, \_vz1 + vmulosw 13, 7, \_vz1 + vmulesw 14, 8, \_vz2 + vmulosw 15, 8, \_vz2 + vmulesw 16, 9, \_vz3 + vmulosw 17, 9, \_vz3 + + /* Compute a*q^(-1) mod 2^32 and results in the upper 32 bits of + even pair */ + vmulosw 18, 10, QINV + vmulosw 19, 11, QINV + vmulosw 20, 12, QINV + vmulosw 21, 13, QINV + vmulosw 22, 14, QINV + vmulosw 23, 15, QINV + vmulosw 24, 16, QINV + vmulosw 25, 17, QINV + + vmulosw 18, 18, V_Q + vmulosw 19, 19, V_Q + vmulosw 20, 20, V_Q + vmulosw 21, 21, V_Q + vmulosw 22, 22, V_Q + vmulosw 23, 23, V_Q + vmulosw 24, 24, V_Q + vmulosw 25, 25, V_Q + + vsubudm 18, 10, 18 + vsubudm 19, 11, 19 + vsubudm 20, 12, 20 + vsubudm 21, 13, 21 + vsubudm 22, 14, 22 + vsubudm 23, 15, 23 + vsubudm 24, 16, 24 + vsubudm 25, 17, 25 + + vmrgew 10, 18, 19 + vmrgew 11, 20, 21 + vmrgew 12, 22, 23 + vmrgew 13, 24, 25 +.endm + +/* + * For Len=1, layer with 1-1-1-1 layout. + */ +.macro NTT_MREDUCE_41x + Load_next_4zetas + Load_41Coeffs + MREDUCE_4x V_Z0, V_Z1, V_Z2, V_Z3 + Compute_4Coeff + PermWrite41 + addi 5, 5, 128 +.endm + +/* + * For Len=2, layer with 2-2-2-2 layout. + */ +.macro NTT_MREDUCE_42x + Load_next_4zetas + Load_42Coeffs + MREDUCE_4x V_Z0, V_Z1, V_Z2, V_Z3 + Compute_4Coeff + PermWrite42 + addi 5, 5, 128 +.endm + +/* + * For Len=8 + */ +.macro NTT_MREDUCE_22x start next _vz0 _vz1 _vz2 _vz3 + Load_22Coeffs \start, \next + MREDUCE_4x \_vz0, \_vz1, \_vz2, \_vz3 + Load_4Rj + Compute_4Coeff + Write_One +.endm + +/* + * For Len=128, 64, 32, 16 and 4. + */ +.macro NTT_MREDUCE_4x start next _vz0 _vz1 _vz2 _vz3 + Load_4Coeffs \start, \next + MREDUCE_4x \_vz0, \_vz1, \_vz2, \_vz3 + Load_4Rj + Compute_4Coeff + Write_One +.endm + +/* + * mldsa_ntt_ppc(int32_t *r) + * Compute forward NTT based on the following 8 layers - + * len = 128, 64, 32, 16, 8, 4, 2, 1. + * + * Each layer compute the coefficients on 2 legs, start and start + len*2 offsets. + * + * leg 1 leg 2 + * ----- ----- + * start start+len*2 + * start+next start+len*2+next + * start+next+next start+len*2+next+next + * start+next+next+next start+len*2+next+next+next + * + * Each computation loads 8 vectors, 4 for each leg. + * The final coefficient (t) from each vector of leg1 and leg2 then do the + * add/sub operations to obtain the final results. + * + * -> leg1 = leg1 + t, leg2 = leg1 - t + * + * The resulting coefficients then store back to each leg's offset. + * + * Each vector has the same corresponding zeta except len=2. + * + * len=2 has 2-2-2-2 layout which means every 2 32-bit coefficients has the same zeta. + * e.g. + * coeff vector a1 a2 a3 a4 a5 a6 a7 a8 + * zeta vector z1 z1 z2 z2 z3 z3 z4 z4 + * + * For len=2, each vector will get permuted to leg1 and leg2. Zeta is + * pre-arranged for the leg1 and leg2. After the computation, each vector needs + * to transpose back to its original 2-2-2-2 layout. + * + */ +.global mldsa_ntt_ppc +.align 4 +mldsa_ntt_ppc: + + SAVE_REGS + + /* load Q and Q_NEG_INV */ + addis 8,2,mldsa_consts at toc@ha + addi 8,8,mldsa_consts at toc@l + lvx V_Q, 0, 8 + li 10, QINV_OFFSET + lvx QINV, 10, 8 + + /* set zetas array */ + addi 14, 8, ZETA_NTT_OFFSET + + /* + * 1. len = 128, start = 0 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 128 + * 16 - 144 + * 32 - 160 + * ... + * 112 - 240 + * These are indexes to the 32 bits array + * + * r7 is len * 4 + */ + li 7, 512 + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + + NTT_MREDUCE_4x 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4x 64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4x 128, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4x 192, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4x 256, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4x 320, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4x 384, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4x 448, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + +.align 4 + /* + * 2. len = 64, start = 0, 128 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 64 + * 16 - 80 + * 32 - 96 + * ... + * 128 - 192 + * 144 - 208 + * 160 - 224 + * 176 - 240 + * These are indexes to the 32 bits array + */ + li 7, 256 + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4x 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4x 64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4x 128, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4x 192, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4x 512, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4x 576, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4x 640, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4x 704, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + +.align 4 + /* + * 3. len = 32, start = 0, 64, 128, 192 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 32 + * ... + * 64 - 96 + * ... + * 128 - 160 + * ... + * 192 - 224 + * ... + * + * These are indexes to the 32 bits array + */ + li 7, 128 + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4x 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4x 64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4x 256, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4x 320, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4x 512, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4x 576, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4x 768, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4x 832, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + +.align 4 + /* + * 4. len = 16, start = 0, 32, 64, 96, 128, 160, 192, 224 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 16 + * 32 - 48 + * 64 - 80 + * ... + * 192 - 208 + * 224 - 240 + * + * These are indexes to the 32 bits array + */ + li 7, 64 + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4x 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4x 128, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4x 256, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4x 384, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4x 512, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4x 640, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4x 768, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4x 896, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + +.align 4 + /* + * 5. len = 8, start = 0, 32, 64, 96, 128, 160, 192, 224 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 8 + * 32 - 40 + * 64 - 72 + * ... + * 192 - 200 + * 224 - 232 + * + * These are indexes to the 32 bits array + */ + + li 7, 32 + Load_next_4zetas + NTT_MREDUCE_22x 0, 16, V_Z0, V_Z0, V_Z1, V_Z1 + NTT_MREDUCE_22x 128, 16, V_Z2, V_Z2, V_Z3, V_Z3 + + Load_next_4zetas + NTT_MREDUCE_22x 256, 16, V_Z0, V_Z0, V_Z1, V_Z1 + NTT_MREDUCE_22x 384, 16, V_Z2, V_Z2, V_Z3, V_Z3 + + Load_next_4zetas + NTT_MREDUCE_22x 512, 16, V_Z0, V_Z0, V_Z1, V_Z1 + NTT_MREDUCE_22x 640, 16, V_Z2, V_Z2, V_Z3, V_Z3 + + Load_next_4zetas + NTT_MREDUCE_22x 768, 16, V_Z0, V_Z0, V_Z1, V_Z1 + NTT_MREDUCE_22x 896, 16, V_Z2, V_Z2, V_Z3, V_Z3 + +.align 4 + /* + * 6. len = 4, start = 0, 32, 64, 96, 128, 160, 192, 224 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 4 + * 32 - 36 + * 64 - 68 + * ... + * 192 - 196 + * 224 - 228 + * + * These are indexes to the 32 bits array + */ + + li 7, 16 + + Load_next_4zetas + NTT_MREDUCE_4x 0, 32, V_Z0, V_Z1, V_Z2, V_Z3 + + Load_next_4zetas + NTT_MREDUCE_4x 128, 32, V_Z0, V_Z1, V_Z2, V_Z3 + + Load_next_4zetas + NTT_MREDUCE_4x 256, 32, V_Z0, V_Z1, V_Z2, V_Z3 + + Load_next_4zetas + NTT_MREDUCE_4x 384, 32, V_Z0, V_Z1, V_Z2, V_Z3 + + Load_next_4zetas + NTT_MREDUCE_4x 512, 32, V_Z0, V_Z1, V_Z2, V_Z3 + + Load_next_4zetas + NTT_MREDUCE_4x 640, 32, V_Z0, V_Z1, V_Z2, V_Z3 + + Load_next_4zetas + NTT_MREDUCE_4x 768, 32, V_Z0, V_Z1, V_Z2, V_Z3 + + Load_next_4zetas + NTT_MREDUCE_4x 896, 32, V_Z0, V_Z1, V_Z2, V_Z3 + +.align 4 + /* + * 7. len = 2, start = 0, 4, 8, 12,...244, 248, 252 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 4 + * 8 - 12 + * 16 - 20 + * ... + * 240 - 244 + * 248 - 252 + * + * These are indexes to the 32 bits array + */ + mr 5, 3 + li 7, 8 + + li 10, 16 + li 11, 32 + li 12, 48 + li 15, 64 + li 16, 80 + li 17, 96 + li 18, 112 + + NTT_MREDUCE_42x + NTT_MREDUCE_42x + NTT_MREDUCE_42x + NTT_MREDUCE_42x + NTT_MREDUCE_42x + NTT_MREDUCE_42x + NTT_MREDUCE_42x + NTT_MREDUCE_42x + +.align 4 + /* + * 8. len = 1, start = 0, 2, 4, 6, 8, 10, 12,...254 + * + * Compute coefficients of the NTT based on the following sequences, + * 0, 1, 2, 3 + * 4, 5, 6, 7 + * 8, 9, 10, 11 + * 12, 13, 14, 15 + * ... + * 240, 241, 242, 243 + * 244, 245, 246, 247 + * 248, 249, 250, 251 + * 252, 253, 254, 255 + * + * These are indexes to the 32 bits array. Each loads 4 vectors. + */ + mr 5, 3 + li 7, 4 + + NTT_MREDUCE_41x + NTT_MREDUCE_41x + NTT_MREDUCE_41x + NTT_MREDUCE_41x + NTT_MREDUCE_41x + NTT_MREDUCE_41x + NTT_MREDUCE_41x + NTT_MREDUCE_41x + + RESTORE_REGS + blr +.size mldsa_ntt_ppc,.-mldsa_ntt_ppc + +.rodata +.align 4 +mldsa_consts: +.long MLDSA_Q, MLDSA_Q, MLDSA_Q, MLDSA_Q +.long MLDSA_QINV, MLDSA_QINV, MLDSA_QINV, MLDSA_QINV + +/* zetas */ +mldsa_zetas: +.long 25847, 25847, 25847, 25847, -2608894, -2608894, -2608894, -2608894 +.long -518909, -518909, -518909, -518909, 237124, 237124, 237124, 237124 +.long -777960, -777960, -777960, -777960, -876248, -876248, -876248, -876248 +.long 466468, 466468, 466468, 466468, 1826347, 1826347, 1826347, 1826347 +.long 2353451, 2353451, 2353451, 2353451, -359251, -359251, -359251, -359251 +.long -2091905, -2091905, -2091905, -2091905, 3119733, 3119733, 3119733, 3119733 +.long -2884855, -2884855, -2884855, -2884855, 3111497, 3111497, 3111497, 3111497 +.long 2680103, 2680103, 2680103, 2680103, 2725464, 2725464, 2725464, 2725464 +.long 1024112, 1024112, 1024112, 1024112, -1079900, -1079900, -1079900, -1079900 +.long 3585928, 3585928, 3585928, 3585928, -549488, -549488, -549488, -549488 +.long -1119584, -1119584, -1119584, -1119584, 2619752, 2619752, 2619752, 2619752 +.long -2108549, -2108549, -2108549, -2108549, -2118186, -2118186, -2118186, -2118186 +.long -3859737, -3859737, -3859737, -3859737, -1399561, -1399561, -1399561, -1399561 +.long -3277672, -3277672, -3277672, -3277672, 1757237, 1757237, 1757237, 1757237 +.long -19422, -19422, -19422, -19422, 4010497, 4010497, 4010497, 4010497 +.long 280005, 280005, 280005, 280005 +/*For Len=4 */ +.long 2706023, 2706023, 2706023, 2706023, 95776, 95776, 95776, 95776 +.long 3077325, 3077325, 3077325, 3077325, 3530437, 3530437, 3530437, 3530437 +.long -1661693, -1661693, -1661693, -1661693, -3592148, -3592148, -3592148, -3592148 +.long -2537516, -2537516, -2537516, -2537516, 3915439, 3915439, 3915439, 3915439 +.long -3861115, -3861115, -3861115, -3861115, -3043716, -3043716, -3043716, -3043716 +.long 3574422, 3574422, 3574422, 3574422, -2867647, -2867647, -2867647, -2867647 +.long 3539968, 3539968, 3539968, 3539968, -300467, -300467, -300467, -300467 +.long 2348700, 2348700, 2348700, 2348700, -539299, -539299, -539299, -539299 +.long -1699267, -1699267, -1699267, -1699267, -1643818, -1643818, -1643818, -1643818 +.long 3505694, 3505694, 3505694, 3505694, -3821735, -3821735, -3821735, -3821735 +.long 3507263, 3507263, 3507263, 3507263, -2140649, -2140649, -2140649, -2140649 +.long -1600420, -1600420, -1600420, -1600420, 3699596, 3699596, 3699596, 3699596 +.long 811944, 811944, 811944, 811944, 531354, 531354, 531354, 531354 +.long 954230, 954230, 954230, 954230, 3881043, 3881043, 3881043, 3881043 +.long 3900724, 3900724, 3900724, 3900724, -2556880, -2556880, -2556880, -2556880 +.long 2071892, 2071892, 2071892, 2071892, -2797779, -2797779, -2797779, -2797779 +/* For Len=2 */ +.long -3930395, -3930395, -1528703, -1528703, -3677745, -3677745, -3041255, -3041255 +.long -1452451, -1452451, 3475950, 3475950, 2176455, 2176455, -1585221, -1585221 +.long -1257611, -1257611, 1939314, 1939314, -4083598, -4083598, -1000202, -1000202 +.long -3190144, -3190144, -3157330, -3157330, -3632928, -3632928, 126922, 126922 +.long 3412210, 3412210, -983419, -983419, 2147896, 2147896, 2715295, 2715295 +.long -2967645, -2967645, -3693493, -3693493, -411027, -411027, -2477047, -2477047 +.long -671102, -671102, -1228525, -1228525, -22981, -22981, -1308169, -1308169 +.long -381987, -381987, 1349076, 1349076, 1852771, 1852771, -1430430, -1430430 +.long -3343383, -3343383, 264944, 264944, 508951, 508951, 3097992, 3097992 +.long 44288, 44288, -1100098, -1100098, 904516, 904516, 3958618, 3958618 +.long -3724342, -3724342, -8578, -8578, 1653064, 1653064, -3249728, -3249728 +.long 2389356, 2389356, -210977, -210977, 759969, 759969, -1316856, -1316856 +.long 189548, 189548, -3553272, -3553272, 3159746, 3159746, -1851402, -1851402 +.long -2409325, -2409325, -177440, -177440, 1315589, 1315589, 1341330, 1341330 +.long 1285669, 1285669, -1584928, -1584928, -812732, -812732, -1439742, -1439742 +.long -3019102, -3019102, -3881060, -3881060, -3628969, -3628969, 3839961, 3839961 +/* Setup zetas for Len=1 as (3, 2, 1, 4) order */ +.long 2316500, 2091667, 3817976, 3407706, -2446433, -3342478, -3562462, 2244091 +.long -1235728, 266997, 3513181, 2434439, -1197226, -3520352, -3193378, -3759364 +.long 909542, 900702, 819034, 1859098, -43260, 495491, -522500, -1613174 +.long 2031748, -655327, 3207046, -3122442, -768622, -3556995, -3595838, -525098 +.long -2437823, 342297, 4108315, 286988, 1735879, 3437287, 203044, -3342277 +.long -2590150, 2842341, 1265009, 2691481, 2486353, 4055324, 1595974, 1247620 +.long 2635921, -3767016, -3548272, 1250494, 1903435, -2994039, -1050970, 1869119 +.long -3318210, -1333058, -1430225, 1237275, 3306115, -451100, -1962642, 1312455 +.long -2546312, -1279661, -1374803, 1917081, 2235880, 1500165, 3406031, 777191 +.long -1671176, -542412, -1846953, -2831860, 594136, -2584293, -3776993, -3724270 +.long 2454455, -2013608, -164721, 2432395, 185531, 1957272, -1207385, 3369112 +.long 1616392, -3183426, 3014001, 162844, -3694233, 810149, -1799107, 1652634 +.long 3866901, -3038916, 269760, 3523897, 1717735, 2213111, 472078, -975884 +.long -1803090, -426683, 1910376, 1723600, -260646, -1667432, -3833893, -1104333 +.long -420899, -2939036, -2286327, -2235985, 1612842, 183443, -3545687, -976891 +.long -48306, -554416, -1362209, 3919660, -846154, 3937738, 1976782, 1400424 -- 2.47.3 From dtsen at us.ibm.com Tue Feb 24 01:27:50 2026 From: dtsen at us.ibm.com (Danny Tsen) Date: Mon, 23 Feb 2026 18:27:50 -0600 Subject: [PATCH 2/5] dilithium: Added optimized dilithium inverse NTT support for ppc64le. In-Reply-To: <20260224002753.151873-1-dtsen@us.ibm.com> References: <20260224002753.151873-1-dtsen@us.ibm.com> Message-ID: <20260224002753.151873-3-dtsen@us.ibm.com> Optimized dilithium (ML-DSA) inverse NTT algorithm for ppc64le (Power 8 and above). Signed-off-by: Danny Tsen --- cipher/dilithium_intt_p8le.S | 915 +++++++++++++++++++++++++++++++++++ 1 file changed, 915 insertions(+) create mode 100644 cipher/dilithium_intt_p8le.S diff --git a/cipher/dilithium_intt_p8le.S b/cipher/dilithium_intt_p8le.S new file mode 100644 index 00000000..b0f67979 --- /dev/null +++ b/cipher/dilithium_intt_p8le.S @@ -0,0 +1,915 @@ +/* + * This file was modified for use by Libgcrypt. + * + * This file is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * This file is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + * SPDX-License-Identifier: LGPL-2.1-or-later + * + * You can also use this file under the same licence of original code. + * SPDX-License-Identifier: CC0 OR Apache-2.0 + * + */ +/* + * + * Copyright IBM Corp. 2025, 2026 + * + * =================================================================================== + * Written by Danny Tsen + */ + +#define QINV_OFFSET 16 +#define FCONST_OFFSET 32 +#define ZETA_INTT_OFFSET 48 + +#define MLDSA_Q 8380417 +#define MLDSA_QINV 58728449 +#define FCONST 41978 + +#define QINV 0 +#define V_Q 1 +#define V_F 2 +#define V_ZETA 2 +#define V_Z0 2 +#define V_Z1 3 +#define V_Z2 4 +#define V_Z3 5 + +.machine "any" +.text + +.macro SAVE_REGS + stdu 1, -352(1) + mflr 0 + std 14, 56(1) + std 15, 64(1) + std 16, 72(1) + std 17, 80(1) + std 18, 88(1) + std 19, 96(1) + std 20, 104(1) + std 21, 112(1) + li 10, 128 + li 11, 144 + li 12, 160 + li 14, 176 + li 15, 192 + li 16, 208 + stxvx 32+20, 10, 1 + stxvx 32+21, 11, 1 + stxvx 32+22, 12, 1 + stxvx 32+23, 14, 1 + stxvx 32+24, 15, 1 + stxvx 32+25, 16, 1 + li 10, 224 + li 11, 240 + li 12, 256 + li 14, 272 + stxvx 32+26, 10, 1 + stxvx 32+27, 11, 1 + stxvx 32+28, 12, 1 + stxvx 32+29, 14, 1 +.endm + +.macro RESTORE_REGS + li 10, 128 + li 11, 144 + li 12, 160 + li 14, 176 + li 15, 192 + li 16, 208 + lxvx 32+20, 10, 1 + lxvx 32+21, 11, 1 + lxvx 32+22, 12, 1 + lxvx 32+23, 14, 1 + lxvx 32+24, 15, 1 + lxvx 32+25, 16, 1 + li 10, 224 + li 11, 240 + li 12, 256 + li 14, 272 + lxvx 32+26, 10, 1 + lxvx 32+27, 11, 1 + lxvx 32+28, 12, 1 + lxvx 32+29, 14, 1 + ld 14, 56(1) + ld 15, 64(1) + ld 16, 72(1) + ld 17, 80(1) + ld 18, 88(1) + ld 19, 96(1) + ld 20, 104(1) + ld 21, 112(1) + + mtlr 0 + addi 1, 1, 352 +.endm + +/* + * Init_Coeffs_offset: initial offset setup for the coefficient array. + * + * start: beginning of the offset to the coefficient array. + * next: Next offset. + * len: Index difference between coefficients. + * + * r7: len * 2, each coefficient component is 32 bits. + * + * registers used for offset to coefficients, r[j] and r[j+len] + * R9: offset to r0 = j + * R16: offset to r1 = r0 + next + * R18: offset to r2 = r1 + next + * R20: offset to r3 = r2 + next + * + * R10: offset to r'0 = r0 + len*2 + * R17: offset to r'1 = r'0 + next + * R19: offset to r'2 = r'1 + next + * R21: offset to r'3 = r'2 + next + * + */ +.macro Init_Coeffs_offset start next + li 9, \start /* first offset to j */ + add 10, 7, 9 /* J + len*2 */ + addi 16, 9, \next + addi 17, 10, \next + addi 18, 16, \next + addi 19, 17, \next + addi 20, 18, \next + addi 21, 19, \next +.endm + +/* + * For Len=1, load 1-1-1-1 layout + * + * Load Coefficients and setup vectors + * rj0, rjlen1, rj2, rjlen3 + * rj4, rjlen5, rj6, rjlen7 + * + * Each vmrgew and vmrgow will transpose vectors as, + * + * rj vector = (rj0, rj4, rj2, rj6) + * rjlen vector = (rjlen1, rjlen5, rjlen3, rjlen7) + * + * r' =r[j+len]: V18, V19, V20, V21 + * r = r[j]: V14, V15, V16, V17 + * + * In order to do the coefficients computation, zeta vector will arrange + * in the proper order to match the multiplication. + */ +.macro Load_41Coeffs + lxvd2x 32+10, 0, 5 + lxvd2x 32+11, 10, 5 + vmrgew 18, 10, 11 + vmrgow 14, 10, 11 + lxvd2x 32+12, 11, 5 + lxvd2x 32+13, 12, 5 + vmrgew 19, 12, 13 + vmrgow 15, 12, 13 + lxvd2x 32+10, 15, 5 + lxvd2x 32+11, 16, 5 + vmrgew 20, 10, 11 + vmrgow 16, 10, 11 + lxvd2x 32+12, 17, 5 + lxvd2x 32+13, 18, 5 + vmrgew 21, 12, 13 + vmrgow 17, 12, 13 +.endm + +/* + * For Len=2, Load 2 - 2 - 2 - 2 layout + * + * Load Coefficients and setup vectors for 8 coefficients in the + * following order, + * rj0, rj1, rjlen2, rjlen3, + * rj4, rj5, rjlen6, arlen7 + * Each xxpermdi will transpose vectors as, + * r[j]= rj0, rj1, rj4, rj5 + * r[j+len]= rjlen2, rjlen3, rjlen6, arlen7 + * + * r' =r[j+len]: V18, V19, V20, V21 + * r = r[j]: V14, V15, V16, V17 + * + * In order to do the coefficients computation, zeta vector will arrange + * in the proper order to match the multiplication. + */ +.macro Load_42Coeffs + lxvd2x 1, 0, 5 + lxvd2x 2, 10, 5 + xxpermdi 32+18, 1, 2, 3 + xxpermdi 32+14, 1, 2, 0 + lxvd2x 3, 11, 5 + lxvd2x 4, 12, 5 + xxpermdi 32+19, 3, 4, 3 + xxpermdi 32+15, 3, 4, 0 + lxvd2x 1, 15, 5 + lxvd2x 2, 16, 5 + xxpermdi 32+20, 1, 2, 3 + xxpermdi 32+16, 1, 2, 0 + lxvd2x 3, 17, 5 + lxvd2x 4, 18, 5 + xxpermdi 32+21, 3, 4, 3 + xxpermdi 32+17, 3, 4, 0 +.endm + +/* + * For Len=8, + * Load coefficient with 2 legs with 64 bytes apart in + * r[j+len] (r') vectors from offset, R10, R17, R19 and R21 + * r[j] (r) vectors from offset, R9, R16, R18 and R20 + * r[j+len]: V18, V19, V20, V21 + * r = r[j]: V14, V15, V16, V17 + */ +.macro Load_22Coeffs start next + li 9, \start + add 10, 7, 9 + addi 16, 9, \next + addi 17, 10, \next + li 18, \start+64 + add 19, 7, 18 + addi 20, 18, \next + addi 21, 19, \next + lxvd2x 32+18, 3, 10 + lxvd2x 32+19, 3, 17 + lxvd2x 32+20, 3, 19 + lxvd2x 32+21, 3, 21 + + lxvd2x 32+14, 3, 9 + lxvd2x 32+15, 3, 16 + lxvd2x 32+16, 3, 18 + lxvd2x 32+17, 3, 20 +.endm + +/* + * Load coefficient with 2 legs with len*2 bytes apart in + * r[j+len] (r') vectors from offset, R10, R17, R19 and R21 + * r[j] (r) vectors from offset, R9, R16, R18 and R20 + * r[j+len]: V18, V19, V20, V21 + * r = r[j]: V14, V15, V16, V17 + */ +.macro Load_4Coeffs start next + Init_Coeffs_offset \start, \next + + lxvd2x 32+18, 3, 10 + lxvd2x 32+19, 3, 17 + lxvd2x 32+20, 3, 19 + lxvd2x 32+21, 3, 21 + + lxvd2x 32+14, 3, 9 + lxvd2x 32+15, 3, 16 + lxvd2x 32+16, 3, 18 + lxvd2x 32+17, 3, 20 +.endm + +/* + * Compute final final r[j] and r[j+len] + * final r[j]: V26, V27, V28, V29 + * final r[j+len]: V6, V7, V8, V9 + */ +.macro Compute_4Coeff + vadduwm 26, 14, 18 + vsubuwm 6, 14, 18 + + vadduwm 27, 15, 19 + vsubuwm 7, 15, 19 + + vadduwm 28, 16, 20 + vsubuwm 8, 16, 20 + + vadduwm 29, 17, 21 + vsubuwm 9, 17, 21 +.endm + +.macro Write_One + stxvd2x 32+26, 3, 9 + stxvd2x 32+10, 3, 10 + stxvd2x 32+27, 3, 16 + stxvd2x 32+11, 3, 17 + stxvd2x 32+28, 3, 18 + stxvd2x 32+12, 3, 19 + stxvd2x 32+29, 3, 20 + stxvd2x 32+13, 3, 21 +.endm + +/* + * For Len=2 + * Transpose the final coefficients of 2-2-2-2 layout to the original + * coefficient array order. + */ +.macro PermWrite42 + xxpermdi 32+14, 32+26, 32+10, 0 + xxpermdi 32+15, 32+26, 32+10, 3 + xxpermdi 32+16, 32+27, 32+11, 0 + xxpermdi 32+17, 32+27, 32+11, 3 + xxpermdi 32+18, 32+28, 32+12, 0 + xxpermdi 32+19, 32+28, 32+12, 3 + xxpermdi 32+20, 32+29, 32+13, 0 + xxpermdi 32+21, 32+29, 32+13, 3 + stxvd2x 32+14, 0, 5 + stxvd2x 32+15, 10, 5 + stxvd2x 32+16, 11, 5 + stxvd2x 32+17, 12, 5 + stxvd2x 32+18, 15, 5 + stxvd2x 32+19, 16, 5 + stxvd2x 32+20, 17, 5 + stxvd2x 32+21, 18, 5 +.endm + +/* + * For Len=1 + * Transpose the final coefficients of 1-1-1-1 layout to the original + * coefficient array order. + */ +.macro PermWrite41 + vmrgew 14, 10, 26 + vmrgow 15, 10, 26 + vmrgew 16, 11, 27 + vmrgow 17, 11, 27 + vmrgew 18, 12, 28 + vmrgow 19, 12, 28 + vmrgew 20, 13, 29 + vmrgow 21, 13, 29 + stxvd2x 32+14, 0, 5 + stxvd2x 32+15, 10, 5 + stxvd2x 32+16, 11, 5 + stxvd2x 32+17, 12, 5 + stxvd2x 32+18, 15, 5 + stxvd2x 32+19, 16, 5 + stxvd2x 32+20, 17, 5 + stxvd2x 32+21, 18, 5 +.endm + +.macro Load_next_4zetas + li 10, 16 + li 11, 32 + li 12, 48 + lxvd2x 32+V_Z0, 0, 14 + lxvd2x 32+V_Z1, 10, 14 + lxvd2x 32+V_Z2, 11, 14 + lxvd2x 32+V_Z3, 12, 14 + addi 14, 14, 64 +.endm + +/* + * montgomery_reduce + * montgomery_reduce((int64_t)zeta * a[j + len]) + * a = zeta * a[j+len] + * t = (int64_t)(int32_t)a*QINV; + * t = (a - (int64_t)t*Q) >> 32; + * + * Or + * montgomery_reduce((int64_t)f * a[j]) + * + * ----------------------------------- + * MREDUCE_4X(_vz0, _vz1, _vz2, _vz3) + */ +.macro MREDUCE_4x _vz0 _vz1 _vz2 _vz3 + /* Coefficients computation results in abosulte value of 2^64 in + even and odd pairs */ + vmulesw 10, 6, \_vz0 + vmulosw 11, 6, \_vz0 + vmulesw 12, 7, \_vz1 + vmulosw 13, 7, \_vz1 + vmulesw 14, 8, \_vz2 + vmulosw 15, 8, \_vz2 + vmulesw 16, 9, \_vz3 + vmulosw 17, 9, \_vz3 + + /* Compute a*q^(-1) mod 2^32 and results in the upper 32 bits of + even pair */ + vmulosw 18, 10, QINV + vmulosw 19, 11, QINV + vmulosw 20, 12, QINV + vmulosw 21, 13, QINV + vmulosw 22, 14, QINV + vmulosw 23, 15, QINV + vmulosw 24, 16, QINV + vmulosw 25, 17, QINV + + vmulosw 18, 18, V_Q + vmulosw 19, 19, V_Q + vmulosw 20, 20, V_Q + vmulosw 21, 21, V_Q + vmulosw 22, 22, V_Q + vmulosw 23, 23, V_Q + vmulosw 24, 24, V_Q + vmulosw 25, 25, V_Q + + vsubudm 18, 10, 18 + vsubudm 19, 11, 19 + vsubudm 20, 12, 20 + vsubudm 21, 13, 21 + vsubudm 22, 14, 22 + vsubudm 23, 15, 23 + vsubudm 24, 16, 24 + vsubudm 25, 17, 25 + + vmrgew 10, 18, 19 + vmrgew 11, 20, 21 + vmrgew 12, 22, 23 + vmrgew 13, 24, 25 +.endm + +/* + * For Len=1, layer with 1-1-1-1 layout. + */ +.macro iNTT_MREDUCE_41x + Load_next_4zetas + Load_41Coeffs + Compute_4Coeff + MREDUCE_4x V_Z0, V_Z1, V_Z2, V_Z3 + PermWrite41 + addi 5, 5, 128 +.endm + +/* + * For Len=2, layer with 2-2-2-2 layout. + */ +.macro iNTT_MREDUCE_42x + Load_next_4zetas + Load_42Coeffs + Compute_4Coeff + MREDUCE_4x V_Z0, V_Z1, V_Z2, V_Z3 + PermWrite42 + addi 5, 5, 128 +.endm + +/* + * For Len=8 + */ +.macro iNTT_MREDUCE_22x start next _vz0 _vz1 _vz2 _vz3 + Load_22Coeffs \start, \next + Compute_4Coeff + MREDUCE_4x \_vz0, \_vz1, \_vz2, \_vz3 + Write_One +.endm + +/* + * For Len=128, 64, 32, 16 and 4. + */ +.macro iNTT_MREDUCE_4x start next _vz0 _vz1 _vz2 _vz3 + Load_4Coeffs \start, \next + Compute_4Coeff + MREDUCE_4x \_vz0, \_vz1, \_vz2, \_vz3 + Write_One +.endm + +.macro Reload_4coeffs + lxvd2x 32+6, 0, 6 + lxvd2x 32+7, 10, 6 + lxvd2x 32+8, 11, 6 + lxvd2x 32+9, 12, 6 +.endm + +.macro Write_F + stxvd2x 32+10, 0, 6 + stxvd2x 32+11, 10, 6 + stxvd2x 32+12, 11, 6 + stxvd2x 32+13, 12, 6 + addi 6, 6, 64 +.endm + +.macro POLY_Mont_Reduce_4x + Reload_4coeffs + MREDUCE_4x V_F, V_F, V_F, V_F + Write_F + + Reload_4coeffs + MREDUCE_4x V_F, V_F, V_F, V_F + Write_F + + Reload_4coeffs + MREDUCE_4x V_F, V_F, V_F, V_F + Write_F + + Reload_4coeffs + MREDUCE_4x V_F, V_F, V_F, V_F + Write_F +.endm + +/* + * mldsa_intt_ppc(int32_t *r) + * + * Compute Inverse NTT based on the following 8 layers - + * len = 1, 2, 4, 8, 16, 32, 64, 128. + * + * Each layer compute the coefficients on 2 legs, start and start + len*2 offsets. + * + * leg 1 leg 2 + * ----- ----- + * start start+len*2 + * start+next start+len*2+next + * start+next+next start+len*2+next+next + * start+next+next+next start+len*2+next+next+next + * + * Each computation loads 8 vectors, 4 for each leg. + * The final coefficient (t) from each vector of leg1 and leg2 then do the + * add/sub operations to obtain the final results. + * + * -> leg1 = leg1 + t, leg2 = leg1 - t + * + * The resulting coefficients then store back to each leg's offset. + * + * Each vector has the same corresponding zeta except len=2. + * + * len=2 has 2-2-2-2 layout which means every 2 32-bit coefficients has the same zeta. + * e.g. + * coeff vector a1 a2 a3 a4 a5 a6 a7 a8 + * zeta vector z1 z1 z2 z2 z3 z3 z4 z4 + * + * For len=2, each vector will get permuted to leg1 and leg2. Zeta is + * pre-arranged for the leg1 and leg2. After the computation, each vector needs + * to transpose back to its original 2-2-2-2 layout. + * + */ +.global mldsa_intt_ppc +.align 4 +mldsa_intt_ppc: + + SAVE_REGS + + /* load Q and Q_NEG_INV */ + addis 8,2,mldsa_consts at toc@ha + addi 8,8,mldsa_consts at toc@l + lvx V_Q, 0, 8 + li 10, QINV_OFFSET + lvx QINV, 10, 8 + + /* set zetas array */ + addi 14, 8, ZETA_INTT_OFFSET + +.align 4 + /* + * 1. len = 1, start = 0, 2, 4, 6, 8, 10, 12,...254 + * + * Compute coefficients of the inverse NTT based on the following sequences, + * 0, 1, 2, 3 + * 4, 5, 6, 7 + * 8, 9, 10, 11 + * 12, 13, 14, 15 + * ... + * 240, 241, 242, 243 + * 244, 245, 246, 247 + * 248, 249, 250, 251 + * 252, 253, 254, 255 + * + * These are indexes to the 32 bits array. Each loads 4 vectors. + */ + mr 5, 3 + li 7, 4 + + li 10, 16 + li 11, 32 + li 12, 48 + li 15, 64 + li 16, 80 + li 17, 96 + li 18, 112 + + iNTT_MREDUCE_41x + iNTT_MREDUCE_41x + iNTT_MREDUCE_41x + iNTT_MREDUCE_41x + iNTT_MREDUCE_41x + iNTT_MREDUCE_41x + iNTT_MREDUCE_41x + iNTT_MREDUCE_41x + +.align 4 + /* + * 2. len = 2, start = 0, 4, 8, 12,...244, 248, 252 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 4 + * 8 - 12 + * 16 - 20 + * ... + * 240 - 244 + * 248 - 252 + * + * These are indexes to the 32 bits array + */ + mr 5, 3 + li 7, 8 + + iNTT_MREDUCE_42x + iNTT_MREDUCE_42x + iNTT_MREDUCE_42x + iNTT_MREDUCE_42x + iNTT_MREDUCE_42x + iNTT_MREDUCE_42x + iNTT_MREDUCE_42x + iNTT_MREDUCE_42x + +.align 4 + /* + * 3. len = 4, start = 0, 32, 64, 96, 128, 160, 192, 224 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 4 + * 32 - 36 + * 64 - 68 + * ... + * 192 - 196 + * 224 - 228 + * + * These are indexes to the 32 bits array + */ + + li 7, 16 + + Load_next_4zetas + iNTT_MREDUCE_4x 0, 32, V_Z0, V_Z1, V_Z2, V_Z3 + Load_next_4zetas + iNTT_MREDUCE_4x 128, 32, V_Z0, V_Z1, V_Z2, V_Z3 + Load_next_4zetas + iNTT_MREDUCE_4x 128*2, 32, V_Z0, V_Z1, V_Z2, V_Z3 + Load_next_4zetas + iNTT_MREDUCE_4x 128*3, 32, V_Z0, V_Z1, V_Z2, V_Z3 + Load_next_4zetas + iNTT_MREDUCE_4x 128*4, 32, V_Z0, V_Z1, V_Z2, V_Z3 + Load_next_4zetas + iNTT_MREDUCE_4x 128*5, 32, V_Z0, V_Z1, V_Z2, V_Z3 + Load_next_4zetas + iNTT_MREDUCE_4x 128*6, 32, V_Z0, V_Z1, V_Z2, V_Z3 + Load_next_4zetas + iNTT_MREDUCE_4x 128*7, 32, V_Z0, V_Z1, V_Z2, V_Z3 + +.align 4 + /* + * 4. len = 8, start = 0, 32, 64, 96, 128, 160, 192, 224 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 8 + * 32 - 40 + * 64 - 72 + * ... + * 192 - 200 + * 224 - 232 + * + * These are indexes to the 32 bits array + */ + + li 7, 32 + Load_next_4zetas + iNTT_MREDUCE_22x 0, 16, V_Z0, V_Z0, V_Z1, V_Z1 + iNTT_MREDUCE_22x 128, 16, V_Z2, V_Z2, V_Z3, V_Z3 + + Load_next_4zetas + iNTT_MREDUCE_22x 128*2, 16, V_Z0, V_Z0, V_Z1, V_Z1 + iNTT_MREDUCE_22x 128*3, 16, V_Z2, V_Z2, V_Z3, V_Z3 + + Load_next_4zetas + iNTT_MREDUCE_22x 128*4, 16, V_Z0, V_Z0, V_Z1, V_Z1 + iNTT_MREDUCE_22x 128*5, 16, V_Z2, V_Z2, V_Z3, V_Z3 + + Load_next_4zetas + iNTT_MREDUCE_22x 128*6, 16, V_Z0, V_Z0, V_Z1, V_Z1 + iNTT_MREDUCE_22x 128*7, 16, V_Z2, V_Z2, V_Z3, V_Z3 + +.align 4 + /* + * 5. len = 16, start = 0, 32, 64, 96, 128, 160, 192, 224 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 16 + * 32 - 48 + * 64 - 80 + * ... + * 192 - 208 + * 224 - 240 + * + * These are indexes to the 32 bits array + */ + li 7, 64 + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + iNTT_MREDUCE_4x 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + iNTT_MREDUCE_4x 128, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + iNTT_MREDUCE_4x 128*2, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + iNTT_MREDUCE_4x 128*3, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + iNTT_MREDUCE_4x 128*4, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + iNTT_MREDUCE_4x 128*5, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + iNTT_MREDUCE_4x 128*6, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + iNTT_MREDUCE_4x 128*7, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + +.align 4 + /* + * 6. len = 32, start = 0, 64, 128, 192 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 32 + * ... + * 64 - 96 + * ... + * 128 - 160 + * ... + * 192 - 224 + * ... + * + * These are indexes to the 32 bits array + */ + li 7, 128 + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + iNTT_MREDUCE_4x 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + iNTT_MREDUCE_4x 64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + iNTT_MREDUCE_4x 256, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + iNTT_MREDUCE_4x 256+64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + iNTT_MREDUCE_4x 512, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + iNTT_MREDUCE_4x 512+64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + iNTT_MREDUCE_4x 768, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + iNTT_MREDUCE_4x 768+64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + +.align 4 + /* + * 7. len = 64, start = 0, 128 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 64 + * 16 - 80 + * 32 - 96 + * ... + * 128 - 192 + * 144 - 208 + * 160 - 224 + * 176 - 240 + * These are indexes to the 32 bits array + */ + li 7, 256 + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + iNTT_MREDUCE_4x 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + iNTT_MREDUCE_4x 64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + iNTT_MREDUCE_4x 128, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + iNTT_MREDUCE_4x 192, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + iNTT_MREDUCE_4x 512, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + iNTT_MREDUCE_4x 512+64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + iNTT_MREDUCE_4x 512+128, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + iNTT_MREDUCE_4x 512+192, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + /* + * 8. len = 128, start = 0 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 128 + * 16 - 144 + * 32 - 160 + * ... + * 112 - 240 + * These are indexes to the 32 bits array + */ + li 7, 512 + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + + iNTT_MREDUCE_4x 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + iNTT_MREDUCE_4x 64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + iNTT_MREDUCE_4x 64*2, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + iNTT_MREDUCE_4x 64*3, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + iNTT_MREDUCE_4x 64*4, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + iNTT_MREDUCE_4x 64*5, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + iNTT_MREDUCE_4x 64*6, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + iNTT_MREDUCE_4x 64*7, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + /* + * Montgomery reduce loops with constant f=41978 (mont^2/256) + * + * a[j] = montgomery_reduce((int64_t)f * a[j]) + */ + addi 10, 8, FCONST_OFFSET + lvx V_F, 0, 10 + + li 10, 16 + li 11, 32 + li 12, 48 + + mr 6, 3 + + POLY_Mont_Reduce_4x + POLY_Mont_Reduce_4x + POLY_Mont_Reduce_4x + POLY_Mont_Reduce_4x + + RESTORE_REGS + blr +.size mldsa_intt_ppc,.-mldsa_intt_ppc + +.rodata +.align 4 +mldsa_consts: +.long MLDSA_Q, MLDSA_Q, MLDSA_Q, MLDSA_Q +.long MLDSA_QINV, MLDSA_QINV, MLDSA_QINV, MLDSA_QINV +/* Constant for INTT, f=mont^2/256 */ +.long FCONST, FCONST, FCONST, FCONST + +/* zetas for qinv */ +mldsa_zetas: +/* Zetas for Lane=1: setup as (3, 2, 1, 4) order */ +.long -1400424, -1976782, -3937738, 846154, -3919660, 1362209, 554416, 48306 +.long 976891, 3545687, -183443, -1612842, 2235985, 2286327, 2939036, 420899 +.long 1104333, 3833893, 1667432, 260646, -1723600, -1910376, 426683, 1803090 +.long 975884, -472078, -2213111, -1717735, -3523897, -269760, 3038916, -3866901 +.long -1652634, 1799107, -810149, 3694233, -162844, -3014001, 3183426, -1616392 +.long -3369112, 1207385, -1957272, -185531, -2432395, 164721, 2013608, -2454455 +.long 3724270, 3776993, 2584293, -594136, 2831860, 1846953, 542412, 1671176 +.long -777191, -3406031, -1500165, -2235880, -1917081, 1374803, 1279661, 2546312 +.long -1312455, 1962642, 451100, -3306115, -1237275, 1430225, 1333058, 3318210 +.long -1869119, 1050970, 2994039, -1903435, -1250494, 3548272, 3767016, -2635921 +.long -1247620, -1595974, -4055324, -2486353, -2691481, -1265009, -2842341, 2590150 +.long 3342277, -203044, -3437287, -1735879, -286988, -4108315, -342297, 2437823 +.long 525098, 3595838, 3556995, 768622, 3122442, -3207046, 655327, -2031748 +.long 1613174, 522500, -495491, 43260, -1859098, -819034, -900702, -909542 +.long 3759364, 3193378, 3520352, 1197226, -2434439, -3513181, -266997, 1235728 +.long -2244091, 3562462, 3342478, 2446433, -3407706, -3817976, -2091667, -2316500 +/* For Len=2 */ +.long -3839961, -3839961, 3628969, 3628969, 3881060, 3881060, 3019102, 3019102 +.long 1439742, 1439742, 812732, 812732, 1584928, 1584928, -1285669, -1285669 +.long -1341330, -1341330, -1315589, -1315589, 177440, 177440, 2409325, 2409325 +.long 1851402, 1851402, -3159746, -3159746, 3553272, 3553272, -189548, -189548 +.long 1316856, 1316856, -759969, -759969, 210977, 210977, -2389356, -2389356 +.long 3249728, 3249728, -1653064, -1653064, 8578, 8578, 3724342, 3724342 +.long -3958618, -3958618, -904516, -904516, 1100098, 1100098, -44288, -44288 +.long -3097992, -3097992, -508951, -508951, -264944, -264944, 3343383, 3343383 +.long 1430430, 1430430, -1852771, -1852771, -1349076, -1349076, 381987, 381987 +.long 1308169, 1308169, 22981, 22981, 1228525, 1228525, 671102, 671102 +.long 2477047, 2477047, 411027, 411027, 3693493, 3693493, 2967645, 2967645 +.long -2715295, -2715295, -2147896, -2147896, 983419, 983419, -3412210, -3412210 +.long -126922, -126922, 3632928, 3632928, 3157330, 3157330, 3190144, 3190144 +.long 1000202, 1000202, 4083598, 4083598, -1939314, -1939314, 1257611, 1257611 +.long 1585221, 1585221, -2176455, -2176455, -3475950, -3475950, 1452451, 1452451 +.long 3041255, 3041255, 3677745, 3677745, 1528703, 1528703, 3930395, 3930395 +/* For Lane=4 */ +.long 2797779, 2797779, 2797779, 2797779, -2071892, -2071892, -2071892, -2071892 +.long 2556880, 2556880, 2556880, 2556880, -3900724, -3900724, -3900724, -3900724 +.long -3881043, -3881043, -3881043, -3881043, -954230, -954230, -954230, -954230 +.long -531354, -531354, -531354, -531354, -811944, -811944, -811944, -811944 +.long -3699596, -3699596, -3699596, -3699596, 1600420, 1600420, 1600420, 1600420 +.long 2140649, 2140649, 2140649, 2140649, -3507263, -3507263, -3507263, -3507263 +.long 3821735, 3821735, 3821735, 3821735, -3505694, -3505694, -3505694, -3505694 +.long 1643818, 1643818, 1643818, 1643818, 1699267, 1699267, 1699267, 1699267 +.long 539299, 539299, 539299, 539299, -2348700, -2348700, -2348700, -2348700 +.long 300467, 300467, 300467, 300467, -3539968, -3539968, -3539968, -3539968 +.long 2867647, 2867647, 2867647, 2867647, -3574422, -3574422, -3574422, -3574422 +.long 3043716, 3043716, 3043716, 3043716, 3861115, 3861115, 3861115, 3861115 +.long -3915439, -3915439, -3915439, -3915439, 2537516, 2537516, 2537516, 2537516 +.long 3592148, 3592148, 3592148, 3592148, 1661693, 1661693, 1661693, 1661693 +.long -3530437, -3530437, -3530437, -3530437, -3077325, -3077325, -3077325, -3077325 +.long -95776, -95776, -95776, -95776, -2706023, -2706023, -2706023, -2706023 +/* zetas for other len */ +.long -280005, -280005, -280005, -280005, -4010497, -4010497, -4010497, -4010497 +.long 19422, 19422, 19422, 19422, -1757237, -1757237, -1757237, -1757237 +.long 3277672, 3277672, 3277672, 3277672, 1399561, 1399561, 1399561, 1399561 +.long 3859737, 3859737, 3859737, 3859737, 2118186, 2118186, 2118186, 2118186 +.long 2108549, 2108549, 2108549, 2108549, -2619752, -2619752, -2619752, -2619752 +.long 1119584, 1119584, 1119584, 1119584, 549488, 549488, 549488, 549488 +.long -3585928, -3585928, -3585928, -3585928, 1079900, 1079900, 1079900, 1079900 +.long -1024112, -1024112, -1024112, -1024112, -2725464, -2725464, -2725464, -2725464 +.long -2680103, -2680103, -2680103, -2680103, -3111497, -3111497, -3111497, -3111497 +.long 2884855, 2884855, 2884855, 2884855, -3119733, -3119733, -3119733, -3119733 +.long 2091905, 2091905, 2091905, 2091905, 359251, 359251, 359251, 359251 +.long -2353451, -2353451, -2353451, -2353451, -1826347, -1826347, -1826347, -1826347 +.long -466468, -466468, -466468, -466468, 876248, 876248, 876248, 876248 +.long 777960, 777960, 777960, 777960, -237124, -237124, -237124, -237124 +.long 518909, 518909, 518909, 518909, 2608894, 2608894, 2608894, 2608894 +.long -25847, -25847, -25847, -25847 -- 2.47.3 From dtsen at us.ibm.com Tue Feb 24 01:27:48 2026 From: dtsen at us.ibm.com (Danny Tsen) Date: Mon, 23 Feb 2026 18:27:48 -0600 Subject: [PATCH 0/5] dilithium-kyber: Optimized (i)NTT support for Message-ID: <20260224002753.151873-1-dtsen@us.ibm.com> Added optimized (i)NTT algorithm support for ppc64le (Power 8 and above). Defined ENABLE_PPC_DILITHIUM and ENABLE_PPC_KYBER for dilithium (ML-DSA) and kyber (ML-KEM) NTT and inverse NTT. Danny Tsen (5): dilithium: Added optimized dilithium NTT support for ppc64le. dilithium: Added optimized dilithium inverse NTT support for ppc64le. kyber: Added optimized kyber NTT support for ppc64le. kyber: Added optimized kyber inverse NTT support for ppc64le. dilithium-kyber: Added ppc64le dilithium and kyber (i)NTT support. cipher/dilithium-common.c | 13 + cipher/dilithium_intt_p8le.S | 915 +++++++++++++++++++++++++++++++++++ cipher/dilithium_ntt_p8le.S | 859 ++++++++++++++++++++++++++++++++ cipher/kyber-common.c | 13 + cipher/kyber_intt_p8le.S | 878 +++++++++++++++++++++++++++++++++ cipher/kyber_ntt_p8le.S | 716 +++++++++++++++++++++++++++ configure.ac | 20 + 7 files changed, 3414 insertions(+) create mode 100644 cipher/dilithium_intt_p8le.S create mode 100644 cipher/dilithium_ntt_p8le.S create mode 100644 cipher/kyber_intt_p8le.S create mode 100644 cipher/kyber_ntt_p8le.S -- 2.47.3 From dtsen at us.ibm.com Tue Feb 24 01:27:51 2026 From: dtsen at us.ibm.com (Danny Tsen) Date: Mon, 23 Feb 2026 18:27:51 -0600 Subject: [PATCH 3/5] kyber: Added optimized kyber NTT support for ppc64le. In-Reply-To: <20260224002753.151873-1-dtsen@us.ibm.com> References: <20260224002753.151873-1-dtsen@us.ibm.com> Message-ID: <20260224002753.151873-4-dtsen@us.ibm.com> Optimized kyber (ML-KEM) NTT algorithm for ppc64le (Power 8 and above). Signed-off-by: Danny Tsen --- cipher/kyber_ntt_p8le.S | 716 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 716 insertions(+) create mode 100644 cipher/kyber_ntt_p8le.S diff --git a/cipher/kyber_ntt_p8le.S b/cipher/kyber_ntt_p8le.S new file mode 100644 index 00000000..401598f2 --- /dev/null +++ b/cipher/kyber_ntt_p8le.S @@ -0,0 +1,716 @@ +/* + * This file was modified for use by Libgcrypt. + * + * This file is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * This file is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + * SPDX-License-Identifier: LGPL-2.1-or-later + * + * You can also use this file under the same licence of original code. + * SPDX-License-Identifier: CC0 OR Apache-2.0 + * + */ +/* + * Copyright IBM Corp. 2025, 2026 + * + * =================================================================================== + * Written by Danny Tsen + */ + +#define QINV_OFFSET 16 +#define ZETA_NTT_OFFSET 32 + +#define V_QINV 2 +#define V_NMKQ 5 +#define V_Z0 7 +#define V_Z1 8 +#define V_Z2 9 +#define V_Z3 10 +#define V_ZETA 10 + +.machine "any" +.text + +.macro SAVE_REGS + stdu 1, -352(1) + mflr 0 + std 14, 56(1) + std 15, 64(1) + std 16, 72(1) + std 17, 80(1) + std 18, 88(1) + std 19, 96(1) + std 20, 104(1) + std 21, 112(1) + li 10, 128 + li 11, 144 + li 12, 160 + li 14, 176 + li 15, 192 + li 16, 208 + stxvx 32+20, 10, 1 + stxvx 32+21, 11, 1 + stxvx 32+22, 12, 1 + stxvx 32+23, 14, 1 + stxvx 32+24, 15, 1 + stxvx 32+25, 16, 1 + li 10, 224 + li 11, 240 + li 12, 256 + li 14, 272 + li 15, 288 + li 16, 304 + stxvx 32+26, 10, 1 + stxvx 32+27, 11, 1 + stxvx 32+28, 12, 1 + stxvx 32+29, 14, 1 + stxvx 32+30, 15, 1 + stxvx 32+31, 16, 1 +.endm + +.macro RESTORE_REGS + li 10, 128 + li 11, 144 + li 12, 160 + li 14, 176 + li 15, 192 + li 16, 208 + lxvx 32+20, 10, 1 + lxvx 32+21, 11, 1 + lxvx 32+22, 12, 1 + lxvx 32+23, 14, 1 + lxvx 32+24, 15, 1 + lxvx 32+25, 16, 1 + li 10, 224 + li 11, 240 + li 12, 256 + li 14, 272 + li 15, 288 + li 16, 304 + lxvx 32+26, 10, 1 + lxvx 32+27, 11, 1 + lxvx 32+28, 12, 1 + lxvx 32+29, 14, 1 + lxvx 32+30, 15, 1 + lxvx 32+31, 16, 1 + ld 14, 56(1) + ld 15, 64(1) + ld 16, 72(1) + ld 17, 80(1) + ld 18, 88(1) + ld 19, 96(1) + ld 20, 104(1) + ld 21, 112(1) + + mtlr 0 + addi 1, 1, 352 +.endm + +/* + * Init_Coeffs_offset: initial offset setup for the coefficient array. + * + * start: beginning of the offset to the coefficient array. + * next: Next offset. + * len: Index difference between coefficients. + * + * r7: len * 2, each coefficient component is 2 bytes. + * + * registers used for offset to coefficients, r[j] and r[j+len] + * R9: offset to r0 = j + * R16: offset to r1 = r0 + next + * R18: offset to r2 = r1 + next + * R20: offset to r3 = r2 + next + * + * R10: offset to r'0 = r0 + len*2 + * R17: offset to r'1 = r'0 + step + * R19: offset to r'2 = r'1 + step + * R21: offset to r'3 = r'2 + step + * + */ +.macro Init_Coeffs_offset start next + li 9, \start /* first offset to j */ + add 10, 7, 9 /* J + len*2 */ + addi 16, 9, \next + addi 17, 10, \next + addi 18, 16, \next + addi 19, 17, \next + addi 20, 18, \next + addi 21, 19, \next +.endm + +/* + * Load coefficient in r[j+len] (r') vectors from offset, R10, R17, R19 and R21 + * r[j+len]: V13, V18, V23, V28 + */ +.macro Load_4Rjp + lxvd2x 32+13, 3, 10 /* V13: vector r'0 */ + lxvd2x 32+18, 3, 17 /* V18: vector for r'1 */ + lxvd2x 32+23, 3, 19 /* V23: vector for r'2 */ + lxvd2x 32+28, 3, 21 /* V28: vector for r'3 */ +.endm + +/* + * Load Coefficients and setup vectors for 8 coefficients in the + * following order, + * rjlen0, rjlen1, rjlen2, rjlen3, rjlen4, rjlen5, rjlen6, rjlen7 + */ +.macro Load_4Coeffs start next + Init_Coeffs_offset \start \next + Load_4Rjp +.endm + +/* + * Load 2 - 2 - 2 - 2 layout + * + * Load Coefficients and setup vectors for 8 coefficients in the + * following order, + * rj0, rj1, rjlen2, rjlen3, rj4, rj5, rjlen6, arlen7 + * rj8, rj9, rjlen10, rjlen11, rj12, rj13, rjlen14, rjlen15 + * Each vmrgew and vmrgow will transpose vectors as, + * r[j]= rj0, rj1, rj8, rj9, rj4, rj5, rj12, rj13 + * r[j+len]= rjlen2, rjlen3, rjlen10, rjlen11, rjlen6, arlen7, rjlen14, rjlen15 + * + * r[j+len]: V13, V18, V23, V28 + * r[j]: V12, V17, V22, V27 + * + * In order to do the coefficients computation, zeta vector will arrange + * in the proper order to match the multiplication. + */ +.macro Load_L24Coeffs + lxvd2x 32+25, 0, 5 + lxvd2x 32+26, 10, 5 + vmrgew 13, 25, 26 + vmrgow 12, 25, 26 + lxvd2x 32+25, 11, 5 + lxvd2x 32+26, 12, 5 + vmrgew 18, 25, 26 + vmrgow 17, 25, 26 + lxvd2x 32+25, 15, 5 + lxvd2x 32+26, 16, 5 + vmrgew 23, 25, 26 + vmrgow 22, 25, 26 + lxvd2x 32+25, 17, 5 + lxvd2x 32+26, 18, 5 + vmrgew 28, 25, 26 + vmrgow 27, 25, 26 +.endm + +/* + * Load 4 - 4 layout + * + * Load Coefficients and setup vectors for 8 coefficients in the + * following order, + * rj0, rj1, rj2, rj3, rjlen4, rjlen5, rjlen6, rjlen7 + * rj8, rj9, rj10, rj11, rjlen12, rjlen13, rjlen14, rjlen15 + * + * Each xxpermdi will transpose vectors as, + * rjlen4, rjlen5, rjlen6, rjlen7, rjlen12, rjlen13, rjlen14, rjlen15 + * rj0, rj1, rj2, rj3, rj8, rj9, rj10, rj11 + * + * In order to do the coefficients computation, zeta vector will arrange + * in the proper order to match the multiplication. + */ +.macro Load_L44Coeffs + lxvd2x 1, 0, 5 + lxvd2x 2, 10, 5 + xxpermdi 32+13, 2, 1, 3 + xxpermdi 32+12, 2, 1, 0 + lxvd2x 3, 11, 5 + lxvd2x 4, 12, 5 + xxpermdi 32+18, 4, 3, 3 + xxpermdi 32+17, 4, 3, 0 + lxvd2x 1, 15, 5 + lxvd2x 2, 16, 5 + xxpermdi 32+23, 2, 1, 3 + xxpermdi 32+22, 2, 1, 0 + lxvd2x 3, 17, 5 + lxvd2x 4, 18, 5 + xxpermdi 32+28, 4, 3, 3 + xxpermdi 32+27, 4, 3, 0 +.endm + +/* + * montgomery_reduce + * t = a * QINV + * t = (a - (int32_t)t*_MLKEM_Q) >> 16 + * + * ----------------------------------- + * MREDUCE_4X(_vz0, _vz1, _vz2, _vz3) + */ +.macro MREDUCE_4X _vz0 _vz1 _vz2 _vz3 + /* fqmul = zeta * coefficient + Modular multification bond by 2^16 * q in abs value */ + vmladduhm 15, 13, \_vz0, 3 + vmladduhm 20, 18, \_vz1, 3 + vmladduhm 25, 23, \_vz2, 3 + vmladduhm 30, 28, \_vz3, 3 + + /* Signed multiply-high-round; outputs are bound by 2^15 * q in abs value */ + vmhraddshs 14, 13, \_vz0, 3 + vmhraddshs 19, 18, \_vz1, 3 + vmhraddshs 24, 23, \_vz2, 3 + vmhraddshs 29, 28, \_vz3, 3 + + vmladduhm 15, 15, V_QINV, 3 + vmladduhm 20, 20, V_QINV, 3 + vmladduhm 25, 25, V_QINV, 3 + vmladduhm 30, 30, V_QINV, 3 + + vmhraddshs 15, 15, V_NMKQ, 14 + vmhraddshs 20, 20, V_NMKQ, 19 + vmhraddshs 25, 25, V_NMKQ, 24 + vmhraddshs 30, 30, V_NMKQ, 29 + + /* Shift right 1 bit */ + vsrah 13, 15, 4 + vsrah 18, 20, 4 + vsrah 23, 25, 4 + vsrah 28, 30, 4 +.endm + +/* + * Load 4 r[j] (r) coefficient vectors: + * Load coefficient in vectors from offset, R9, R16, R18 and R20 + * r[j]: V12, V17, V22, V27 + */ +.macro Load_4Rj + lxvd2x 32+12, 3, 9 /* V12: vector r0 */ + lxvd2x 32+17, 3, 16 /* V17: vector r1 */ + lxvd2x 32+22, 3, 18 /* V22: vector r2 */ + lxvd2x 32+27, 3, 20 /* V27: vector r3 */ +.endm + +/* + * Compute final final r[j] and r[j+len] + * final r[j+len]: V16, V21, V26, V31 + * final r[j]: V15, V20, V25, V30 + */ +.macro Compute_4Coeffs + /* Since the result of the Montgomery multiplication is bounded + by q in absolute value. + Finally to complete the final update of the results with add/sub + r[j] = r[j] + t. + r[j+len] = r[j] - t + */ + vsubuhm 16, 12, 13 + vadduhm 15, 13, 12 + vsubuhm 21, 17, 18 + vadduhm 20, 18, 17 + vsubuhm 26, 22, 23 + vadduhm 25, 23, 22 + vsubuhm 31, 27, 28 + vadduhm 30, 28, 27 +.endm + +.macro Write_One + stxvd2x 32+15, 3, 9 + stxvd2x 32+16, 3, 10 + stxvd2x 32+20, 3, 16 + stxvd2x 32+21, 3, 17 + stxvd2x 32+25, 3, 18 + stxvd2x 32+26, 3, 19 + stxvd2x 32+30, 3, 20 + stxvd2x 32+31, 3, 21 +.endm + +/* + * Transpose the final coefficients of 4-4 layout to the orginal + * coefficient array order. + */ +.macro PermWriteL44 + Compute_4Coeffs + xxpermdi 0, 32+15, 32+16, 3 + xxpermdi 1, 32+15, 32+16, 0 + xxpermdi 2, 32+20, 32+21, 3 + xxpermdi 3, 32+20, 32+21, 0 + xxpermdi 4, 32+25, 32+26, 3 + xxpermdi 5, 32+25, 32+26, 0 + xxpermdi 6, 32+30, 32+31, 3 + xxpermdi 7, 32+30, 32+31, 0 + stxvd2x 0, 0, 5 + stxvd2x 1, 10, 5 + stxvd2x 2, 11, 5 + stxvd2x 3, 12, 5 + stxvd2x 4, 15, 5 + stxvd2x 5, 16, 5 + stxvd2x 6, 17, 5 + stxvd2x 7, 18, 5 +.endm + +/* + * Transpose the final coefficients of 2-2-2-2 layout to the orginal + * coefficient array order. + */ +.macro PermWriteL24 + Compute_4Coeffs + vmrgew 10, 16, 15 + vmrgow 11, 16, 15 + vmrgew 12, 21, 20 + vmrgow 13, 21, 20 + vmrgew 14, 26, 25 + vmrgow 15, 26, 25 + vmrgew 16, 31, 30 + vmrgow 17, 31, 30 + stxvd2x 32+10, 0, 5 + stxvd2x 32+11, 10, 5 + stxvd2x 32+12, 11, 5 + stxvd2x 32+13, 12, 5 + stxvd2x 32+14, 15, 5 + stxvd2x 32+15, 16, 5 + stxvd2x 32+16, 17, 5 + stxvd2x 32+17, 18, 5 +.endm + +.macro Load_next_4zetas + li 10, 16 + li 11, 32 + li 12, 48 + lxvd2x 32+V_Z0, 0, 14 + lxvd2x 32+V_Z1, 10, 14 + lxvd2x 32+V_Z2, 11, 14 + lxvd2x 32+V_Z3, 12, 14 + addi 14, 14, 64 +.endm + +/* + * Re-ordering of the 4-4 layout zetas. + * Swap double-words. + */ +.macro Perm_4zetas + xxpermdi 32+V_Z0, 32+V_Z0, 32+V_Z0, 2 + xxpermdi 32+V_Z1, 32+V_Z1, 32+V_Z1, 2 + xxpermdi 32+V_Z2, 32+V_Z2, 32+V_Z2, 2 + xxpermdi 32+V_Z3, 32+V_Z3, 32+V_Z3, 2 +.endm + +/* + * NTT layer Len=2. + */ +.macro NTT_REDUCE_L24 + Load_next_4zetas + Load_L24Coeffs + MREDUCE_4X V_Z0, V_Z1, V_Z2, V_Z3 + PermWriteL24 + addi 5, 5, 128 +.endm + +/* + * NTT layer Len=4. + */ +.macro NTT_REDUCE_L44 + Load_next_4zetas + Perm_4zetas + Load_L44Coeffs + MREDUCE_4X V_Z0, V_Z1, V_Z2, V_Z3 + PermWriteL44 + addi 5, 5, 128 +.endm + +/* + * NTT other layers. + */ +.macro NTT_MREDUCE_4X start next _vz0 _vz1 _vz2 _vz3 + Load_4Coeffs \start, \next + MREDUCE_4x \_vz0, \_vz1, \_vz2, \_vz3 + Load_4Rj + Compute_4Coeffs + Write_One +.endm + +/* + * ntt_ppc(int16_t *r) + * Compute forward NTT based on the following 7 layers - + * len = 128, 64, 32, 16, 8, 4, 2. + * + * Each layer compute the coeffients on 2 legs, start and start + len*2 offsets. + * + * leg 1 leg 2 + * ----- ----- + * start start+len*2 + * start+next start+len*2+next + * start+next+next start+len*2+next+next + * start+next+next+next start+len*2+next+next+next + * + * Each computation loads 8 vectors, 4 for each leg. + * The final coefficient (t) from each vector of leg1 and leg2 then do the + * add/sub operations to obtain the final results. + * + * -> leg1 = leg1 + t, leg2 = leg1 - t + * + * The resulting coeffients then store back to each leg's offset. + * + * Each vector has the same corresponding zeta except len=4 and len=2. + * + * len=4 has 4-4 layout which means every 4 16-bit coeffients has the same zeta. + * and len=2 has 2-2-2-2 layout which means every 2 16-bit coeffients has the same zeta. + * e.g. + * coeff vector a1 a2 a3 a4 a5 a6 a7 a8 + * zeta vector z1 z1 z2 z2 z3 z3 z4 z4 + * + * For len=4 and len=2, each vector will get permuted to leg1 and leg2. Zeta is + * pre-arranged for the leg1 and leg2. After the computation, each vector needs + * to transpose back to its original 4-4 or 2-2-2-2 layout. + * + */ +.global ntt_ppc +.align 4 +ntt_ppc: +.localentry ntt_ppc,.-ntt_ppc + + SAVE_REGS + + addis 8,2,mlkem_consts at toc@ha + addi 8,8,mlkem_consts at toc@l + lvx V_NMKQ,0,8 + + addi 14, 8, ZETA_NTT_OFFSET + + vxor 3, 3, 3 + vspltish 4, 1 + + li 10, QINV_OFFSET + lvx V_QINV, 10, 8 + +.align 4 + /* + * 1. len = 128, start = 0 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 128 + * 32 - 160 + * 64 - 192 + * 96 - 224 + * + * These are indexes to the 16 bits array + */ + li 7, 256 /* len * 2 */ + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + + NTT_MREDUCE_4X 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4X 64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4X 128, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4X 192, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + +.align 4 + /* + * 2. len = 64, start = 0, 128 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 64 + * 32 - 96 + * 128 - 192 + * 160 - 224 + * + * These are indexes to the 16 bits array + */ + li 7, 128 + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4X 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4X 64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4X 256, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + NTT_MREDUCE_4X 320, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + +.align 4 + /* + * 3. len = 32, start = 0, 64, 128, 192 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 32 + * 64 - 96 + * 128 - 160 + * 192 - 224 + * + * These are indexes to the 16 bits array + */ + li 7, 64 + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4X 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4X 128, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4X 256, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + + lvx V_ZETA, 0, 14 + addi 14, 14, 16 + NTT_MREDUCE_4X 384, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA + +.align 4 + /* + * 4. len = 16, start = 0, 8, 128, 136 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 16 + * 8 - 24 + * 128 - 144 + * 136 - 152 + * + * These are indexes to the 16 bits array + */ + li 7, 32 + Load_next_4zetas + NTT_MREDUCE_4X 0, 64, V_Z0, V_Z1, V_Z2, V_Z3 + NTT_MREDUCE_4X 16, 64, V_Z0, V_Z1, V_Z2, V_Z3 + + Load_next_4zetas + NTT_MREDUCE_4X 256, 64, V_Z0, V_Z1, V_Z2, V_Z3 + NTT_MREDUCE_4X 272, 64, V_Z0, V_Z1, V_Z2, V_Z3 + +.align 4 + /* + * 5. len = 8, start = 0, 64, 128, 192 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 8 + * 64 - 72 + * 128 - 136 + * 192 - 200 + * + * These are indexes to the 16 bits array + */ + li 7, 16 + Load_next_4zetas + NTT_MREDUCE_4X 0, 32, V_Z0, V_Z1, V_Z2, V_Z3 + + Load_next_4zetas + NTT_MREDUCE_4X 128, 32, V_Z0, V_Z1, V_Z2, V_Z3 + + Load_next_4zetas + NTT_MREDUCE_4X 256, 32, V_Z0, V_Z1, V_Z2, V_Z3 + + Load_next_4zetas + NTT_MREDUCE_4X 384, 32, V_Z0, V_Z1, V_Z2, V_Z3 + + /* + * 6. len = 4, start = 0, 8, 16, 24,...232, 240, 248 + * Load zeta vectors in 4-4 layout + * + * Compute coefficients of the NTT based on the following sequences, + * 0, 1, 2, 3, 4, 5, 6, 7 + * 8, 9, 10, 11, 12, 13, 14, 15 + * ... + * 240, 241, 242, 243, 244, 245, 246, 247 + * 248, 249, 250, 251, 252, 253, 254, 255 + * + * These are indexes to the 16 bits array. Each loads 4 vectors. + */ + mr 5, 3 /* Let r5 points to coefficient array */ + li 7, 8 + + li 10, 16 + li 11, 32 + li 12, 48 + li 15, 64 + li 16, 80 + li 17, 96 + li 18, 112 + +.align 4 + NTT_REDUCE_L44 + NTT_REDUCE_L44 + NTT_REDUCE_L44 + NTT_REDUCE_L44 + + /* + * 7. len = 2, start = 0, 4, 8, 12,...244, 248, 252 + * Load zeta vectors in 2-2-2-2 layout + * + * Compute coefficients of the NTT based on the following sequences, + * 0, 1, 2, 3, 4, 5, 6, 7 + * 8, 9, 10, 11, 12, 13, 14, 15 + * ... + * 240, 241, 242, 243, 244, 245, 246, 247 + * 248, 249, 250, 251, 252, 253, 254, 255 + * + * These are indexes to the 16 bits array. Each loads 4 vectors. + */ + mr 5, 3 /* Let r5 points to coefficient array */ + li 7, 4 + +.align 4 + NTT_REDUCE_L24 + NTT_REDUCE_L24 + NTT_REDUCE_L24 + NTT_REDUCE_L24 + + RESTORE_REGS + blr +.size ntt_ppc,.-ntt_ppc + +.rodata +.align 4 +mlkem_consts: +/* -Q */ +.short -3329, -3329, -3329, -3329, -3329, -3329, -3329, -3329 +/* QINV */ +.short -3327, -3327, -3327, -3327, -3327, -3327, -3327, -3327 + +/* zetas */ +mlkem_zetas: +/* For ntt Len=128, offset 96 */ +.short -758, -758, -758, -758, -758, -758, -758, -758, -359, -359, -359, -359 +.short -359, -359, -359, -359, -1517, -1517, -1517, -1517, -1517, -1517, -1517 +.short -1517, 1493, 1493, 1493, 1493, 1493, 1493, 1493, 1493, 1422, 1422, 1422 +.short 1422, 1422, 1422, 1422, 1422, 287, 287, 287, 287, 287, 287, 287, 287, 202 +.short 202, 202, 202, 202, 202, 202, 202, -171, -171, -171, -171, -171, -171, -171 +.short -171, 622, 622, 622, 622, 622, 622, 622, 622, 1577, 1577, 1577, 1577, 1577 +.short 1577, 1577, 1577, 182, 182, 182, 182, 182, 182, 182, 182, 962, 962, 962 +.short 962, 962, 962, 962, 962, -1202, -1202, -1202, -1202, -1202, -1202, -1202 +.short -1202, -1474, -1474, -1474, -1474, -1474, -1474, -1474, -1474, 1468, 1468 +.short 1468, 1468, 1468, 1468, 1468, 1468, 573, 573, 573, 573, 573, 573, 573, 573 +.short -1325, -1325, -1325, -1325, -1325, -1325, -1325, -1325, 264, 264, 264, 264 +.short 264, 264, 264, 264, 383, 383, 383, 383, 383, 383, 383, 383, -829, -829 +.short -829, -829, -829, -829, -829, -829, 1458, 1458, 1458, 1458, 1458, 1458 +.short 1458, 1458, -1602, -1602, -1602, -1602, -1602, -1602, -1602, -1602, -130 +.short -130, -130, -130, -130, -130, -130, -130, -681, -681, -681, -681, -681 +.short -681, -681, -681, 1017, 1017, 1017, 1017, 1017, 1017, 1017, 1017, 732, 732 +.short 732, 732, 732, 732, 732, 732, 608, 608, 608, 608, 608, 608, 608, 608, -1542 +.short -1542, -1542, -1542, -1542, -1542, -1542, -1542, 411, 411, 411, 411, 411 +.short 411, 411, 411, -205, -205, -205, -205, -205, -205, -205, -205, -1571, -1571 +.short -1571, -1571, -1571, -1571, -1571, -1571 +/* For Len=4 */ +.short 1223, 1223, 1223, 1223, 652, 652, 652, 652, -552, -552, -552, -552, 1015 +.short 1015, 1015, 1015, -1293, -1293, -1293, -1293, 1491, 1491, 1491, 1491, -282 +.short -282, -282, -282, -1544, -1544, -1544, -1544, 516, 516, 516, 516, -8, -8 +.short -8, -8, -320, -320, -320, -320, -666, -666, -666, -666, -1618, -1618, -1618 +.short -1618, -1162, -1162, -1162, -1162, 126, 126, 126, 126, 1469, 1469, 1469 +.short 1469, -853, -853, -853, -853, -90, -90, -90, -90, -271, -271, -271, -271 +.short 830, 830, 830, 830, 107, 107, 107, 107, -1421, -1421, -1421, -1421, -247 +.short -247, -247, -247, -951, -951, -951, -951, -398, -398, -398, -398, 961, 961 +.short 961, 961, -1508, -1508, -1508, -1508, -725, -725, -725, -725, 448, 448, 448 +.short 448, -1065, -1065, -1065, -1065, 677, 677, 677, 677, -1275, -1275, -1275 +.short -1275 +/* + * For ntt Len=2 + * reorder zeta array, (1, 2, 3, 4) -> (3, 1, 4, 2) + * Transpose z[0], z[1], z[2], z[3] + * -> z[3], z[3], z[1], z[1], z[4], z[4], z[2], z[2] + */ +.short 555, 555, -1103, -1103, 843, 843, 430, 430, 1550, 1550, -1251, -1251, 105 +.short 105, 871, 871, 177, 177, 422, 422, -235, -235, 587, 587, 1574, 1574, -291 +.short -291, 1653, 1653, -460, -460, 1159, 1159, -246, -246, -147, -147, 778, 778 +.short -602, -602, -777, -777, 1119, 1119, 1483, 1483, -872, -872, -1590, -1590 +.short 349, 349, 644, 644, -156, -156, 418, 418, -75, -75, 329, 329, 603, 603, 817 +.short 817, 610, 610, 1097, 1097, -1465, -1465, 1322, 1322, 384, 384, -1285, -1285 +.short 1218, 1218, -1215, -1215, -1335, -1335, -136, -136, -1187, -1187, -874 +.short -874, -1659, -1659, 220, 220, -1278, -1278, -1185, -1185, 794, 794, -1530 +.short -1530, -870, -870, -1510, -1510, 478, 478, -854, -854, 996, 996, -108, -108 +.short 991, 991, -308, -308, 1522, 1522, 958, 958, 1628, 1628, -1460, -1460 -- 2.47.3 From dtsen at us.ibm.com Tue Feb 24 01:27:52 2026 From: dtsen at us.ibm.com (Danny Tsen) Date: Mon, 23 Feb 2026 18:27:52 -0600 Subject: [PATCH 4/5] kyber: Added optimized kyber inverse NTT support for ppc64le. In-Reply-To: <20260224002753.151873-1-dtsen@us.ibm.com> References: <20260224002753.151873-1-dtsen@us.ibm.com> Message-ID: <20260224002753.151873-5-dtsen@us.ibm.com> Optimized kyber (ML-KEM) inverse NTT algorithm for ppc64le (Power 8 and above). Signed-off-by: Danny Tsen --- cipher/kyber_intt_p8le.S | 878 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 878 insertions(+) create mode 100644 cipher/kyber_intt_p8le.S diff --git a/cipher/kyber_intt_p8le.S b/cipher/kyber_intt_p8le.S new file mode 100644 index 00000000..c46412aa --- /dev/null +++ b/cipher/kyber_intt_p8le.S @@ -0,0 +1,878 @@ +/* + * This file was modified for use by Libgcrypt. + * + * This file is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * This file is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + * SPDX-License-Identifier: LGPL-2.1-or-later + * + * You can also use this file under the same licence of original code. + * SPDX-License-Identifier: CC0 OR Apache-2.0 + * + */ +/* + * Copyright IBM Corp. 2025, 2026 + * + * =================================================================================== + * Written by Danny Tsen + */ + +.machine "any" +.text + +#define QINV_OFFSET 16 +#define Q_OFFSET 32 +#define C20159_OFFSET 48 +#define C1441_OFFSET 64 +#define ZETA_INTT_OFFSET 80 + +/* Barrett reduce constatnts */ +#define V20159 0 +#define V_25 1 +#define V_26 2 +#define V_MKQ 3 + +/* Montgomery reduce constatnts */ +#define V_QINV 2 +#define V_NMKQ 5 +#define V_Z0 7 +#define V_Z1 8 +#define V_Z2 9 +#define V_Z3 10 +#define V_ZETA 10 +#define V1441 10 + +.macro SAVE_REGS + stdu 1, -352(1) + mflr 0 + std 14, 56(1) + std 15, 64(1) + std 16, 72(1) + std 17, 80(1) + std 18, 88(1) + std 19, 96(1) + std 20, 104(1) + std 21, 112(1) + li 10, 128 + li 11, 144 + li 12, 160 + li 14, 176 + li 15, 192 + li 16, 208 + stxvx 32+20, 10, 1 + stxvx 32+21, 11, 1 + stxvx 32+22, 12, 1 + stxvx 32+23, 14, 1 + stxvx 32+24, 15, 1 + stxvx 32+25, 16, 1 + li 10, 224 + li 11, 240 + li 12, 256 + li 14, 272 + li 15, 288 + li 16, 304 + stxvx 32+26, 10, 1 + stxvx 32+27, 11, 1 + stxvx 32+28, 12, 1 + stxvx 32+29, 14, 1 + stxvx 32+30, 15, 1 + stxvx 32+31, 16, 1 +.endm + +.macro RESTORE_REGS + li 10, 128 + li 11, 144 + li 12, 160 + li 14, 176 + li 15, 192 + li 16, 208 + lxvx 32+20, 10, 1 + lxvx 32+21, 11, 1 + lxvx 32+22, 12, 1 + lxvx 32+23, 14, 1 + lxvx 32+24, 15, 1 + lxvx 32+25, 16, 1 + li 10, 224 + li 11, 240 + li 12, 256 + li 14, 272 + li 15, 288 + li 16, 304 + lxvx 32+26, 10, 1 + lxvx 32+27, 11, 1 + lxvx 32+28, 12, 1 + lxvx 32+29, 14, 1 + lxvx 32+30, 15, 1 + lxvx 32+31, 16, 1 + ld 14, 56(1) + ld 15, 64(1) + ld 16, 72(1) + ld 17, 80(1) + ld 18, 88(1) + ld 19, 96(1) + ld 20, 104(1) + ld 21, 112(1) + + mtlr 0 + addi 1, 1, 352 +.endm + +/* + * Compute r[j] and r[j+len] from computed coefficients + * r[j] + r[j+len] : V8, V12, V16, V20 (data for Barett reduce) + * r[j+len] - r[j]: V25, V26, V30, V31 (data for Montgomery reduce) + */ +.macro Compute_4Coeffs + vsubuhm 25, 8, 21 + vsubuhm 26, 12, 22 + vsubuhm 30, 16, 23 + vsubuhm 31, 20, 24 + vadduhm 8, 8, 21 + vadduhm 12, 12, 22 + vadduhm 16, 16, 23 + vadduhm 20, 20, 24 +.endm + +/* + * Init_Coeffs_offset: initial offset setup for the coefficient array. + * + * start: beginning of the offset to the coefficient array. + * next: Next offset. + * len: Index difference between coefficients. + * + * r7: len * 2, each coefficient component is 2 bytes. + * + * register used for offset to coefficients, r[j] and r[j+len] + * R9: offset to r0 = j + * R16: offset to r1 = r0 + next + * R18: offset to r2 = r1 + next + * R20: offset to r3 = r2 + next + * + * R10: offset to r'0 = r0 + len*2 + * R17: offset to r'1 = r'0 + step + * R19: offset to r'2 = r'1 + step + * R21: offset to r'3 = r'2 + step + * + */ +.macro Init_Coeffs_offset start next + li 9, \start /* first offset to j */ + add 10, 7, 9 /* J + len*2 */ + addi 16, 9, \next + addi 17, 10, \next + addi 18, 16, \next + addi 19, 17, \next + addi 20, 18, \next + addi 21, 19, \next +.endm + +/* + * Load coefficient vectors for r[j] (r) and r[j+len] (r'): + * Load coefficient in r' vectors from offset, R10, R17, R19 and R21 + * Load coefficient in r vectors from offset, R9, R16, R18 and R20 + * + * r[j+len]: V8, V12, V16, V20 + * r[j]: V21, V22, V23, V24 + */ +.macro Load_4Rjp + lxvd2x 32+8, 3, 10 /* V8: vector r'0 */ + lxvd2x 32+12, 3, 17 /* V12: vector for r'1 */ + lxvd2x 32+16, 3, 19 /* V16: vector for r'2 */ + lxvd2x 32+20, 3, 21 /* V20: vector for r'3 */ + + lxvd2x 32+21, 3, 9 /* V21: vector r0 */ + lxvd2x 32+22, 3, 16 /* V22: vector r1 */ + lxvd2x 32+23, 3, 18 /* V23: vector r2 */ + lxvd2x 32+24, 3, 20 /* V24: vector r3 */ +.endm + +/* + * Load Coefficients and setup vectors for 8 coefficients in the + * following order, + * rjlen0, rjlen1, rjlen2, rjlen3, rjlen4, rjlen5, rjlen6, rjlen7 + */ +.macro Load_4Coeffs start next + Init_Coeffs_offset \start \next + Load_4Rjp + Compute_4Coeffs +.endm + +/* + * Load 2 - 2 - 2 - 2 layout + * + * Load Coefficients and setup vectors for 8 coefficients in the + * following order, + * rj0, rj1, rjlen2, rjlen3, rj4, rj5, rjlen6, arlen7 + * rj8, rj9, rjlen10, rjlen11, rj12, rj13, rjlen14, rjlen15 + * Each vmrgew and vmrgow will transpose vectors as, + * r[j]= rj0, rj1, rj8, rj9, rj4, rj5, rj12, rj13 + * r[j+len]= rjlen2, rjlen3, rjlen10, rjlen11, rjlen6, arlen7, rjlen14, rjlen15 + * + * r[j+len]: V8, V12, V16, V20 + * r[j]: V21, V22, V23, V24 + * + * In order to do the coefficient computation, zeta vector will arrange + * in the proper order to match the multiplication. + */ +.macro Load_L24Coeffs + lxvd2x 32+25, 0, 5 + lxvd2x 32+26, 10, 5 + vmrgew 8, 25, 26 + vmrgow 21, 25, 26 + lxvd2x 32+25, 11, 5 + lxvd2x 32+26, 12, 5 + vmrgew 12, 25, 26 + vmrgow 22, 25, 26 + lxvd2x 32+25, 15, 5 + lxvd2x 32+26, 16, 5 + vmrgew 16, 25, 26 + vmrgow 23, 25, 26 + lxvd2x 32+25, 17, 5 + lxvd2x 32+26, 18, 5 + vmrgew 20, 25, 26 + vmrgow 24, 25, 26 +.endm + +/* + * Load 4 - 4 layout + * + * Load Coefficients and setup vectors for 8 coefficients in the + * following order, + * rj0, rj1, rj2, rj3, rjlen4, rjlen5, rjlen6, rjlen7 + * rj8, rj9, rj10, rj11, rjlen12, rjlen13, rjlen14, rjlen15 + * + * Each xxpermdi will transpose vectors as, + * rjlen4, rjlen5, rjlen6, rjlen7, rjlen12, rjlen13, rjlen14, rjlen15 + * rj0, rj1, rj2, rj3, rj8, rj9, rj10, rj11 + * + * In order to do the coefficients computation, zeta vector will arrange + * in the proper order to match the multiplication. + */ +.macro Load_L44Coeffs + lxvd2x 10, 0, 5 + lxvd2x 11, 10, 5 + xxpermdi 32+8, 11, 10, 3 + xxpermdi 32+21, 11, 10, 0 + lxvd2x 10, 11, 5 + lxvd2x 11, 12, 5 + xxpermdi 32+12, 11, 10, 3 + xxpermdi 32+22, 11, 10, 0 + lxvd2x 10, 15, 5 + lxvd2x 11, 16, 5 + xxpermdi 32+16, 11, 10, 3 + xxpermdi 32+23, 11, 10, 0 + lxvd2x 10, 17, 5 + lxvd2x 11, 18, 5 + xxpermdi 32+20, 11, 10, 3 + xxpermdi 32+24, 11, 10, 0 +.endm + +.macro BREDUCE_4X _v0 _v1 _v2 _v3 + /* Restore constant vectors + V_MKQ, V_25 and V_26 */ + vxor 7, 7, 7 + xxlor 32+3, 6, 6 + xxlor 32+1, 7, 7 + xxlor 32+2, 8, 8 + /* Multify Odd/Even signed halfword; + Results word bound by 2^32 in abs value. */ + vmulosh 6, 8, V20159 + vmulesh 5, 8, V20159 + vmulosh 11, 12, V20159 + vmulesh 10, 12, V20159 + vmulosh 15, 16, V20159 + vmulesh 14, 16, V20159 + vmulosh 19, 20, V20159 + vmulesh 18, 20, V20159 + xxmrglw 32+4, 32+5, 32+6 + xxmrghw 32+5, 32+5, 32+6 + xxmrglw 32+9, 32+10, 32+11 + xxmrghw 32+10, 32+10, 32+11 + xxmrglw 32+13, 32+14, 32+15 + xxmrghw 32+14, 32+14, 32+15 + xxmrglw 32+17, 32+18, 32+19 + xxmrghw 32+18, 32+18, 32+19 + vadduwm 4, 4, V_25 + vadduwm 5, 5, V_25 + vadduwm 9, 9, V_25 + vadduwm 10, 10, V_25 + vadduwm 13, 13, V_25 + vadduwm 14, 14, V_25 + vadduwm 17, 17, V_25 + vadduwm 18, 18, V_25 + /* Right shift and pack lower halfword, + results bond to 2^16 in abs value */ + vsraw 4, 4, V_26 + vsraw 5, 5, V_26 + vsraw 9, 9, V_26 + vsraw 10, 10, V_26 + vsraw 13, 13, V_26 + vsraw 14, 14, V_26 + vsraw 17, 17, V_26 + vsraw 18, 18, V_26 + vpkuwum 4, 5, 4 + vsubuhm 4, 7, 4 + vpkuwum 9, 10, 9 + vsubuhm 9, 7, 9 + vpkuwum 13, 14, 13 + vsubuhm 13, 7, 13 + vpkuwum 17, 18, 17 + vsubuhm 17, 7, 17 + /* Modulo multify-Low unsigned halfword; + results bond to 2^16 * q in abs value. */ + vmladduhm \_v0, 4, V_MKQ, 8 + vmladduhm \_v1, 9, V_MKQ, 12 + vmladduhm \_v2, 13, V_MKQ, 16 + vmladduhm \_v3, 17, V_MKQ, 20 +.endm + +/* + * ----------------------------------- + * MREDUCE_4X(_vz0, _vz1, _vz2, _vz3, _vo0, _vo1, _vo2, _vo3) + */ +.macro MREDUCE_4X _vz0 _vz1 _vz2 _vz3 _vo0 _vo1 _vo2 _vo3 + /* Modular multification bond by 2^16 * q in abs value */ + vmladduhm 15, 25, \_vz0, 3 + vmladduhm 20, 26, \_vz1, 3 + vmladduhm 27, 30, \_vz2, 3 + vmladduhm 28, 31, \_vz3, 3 + + /* Signed multiply-high-round; outputs are bound by 2^15 * q in abs value */ + vmhraddshs 14, 25, \_vz0, 3 + vmhraddshs 19, 26, \_vz1, 3 + vmhraddshs 24, 30, \_vz2, 3 + vmhraddshs 29, 31, \_vz3, 3 + + vmladduhm 15, 15, V_QINV, 3 + vmladduhm 20, 20, V_QINV, 3 + vmladduhm 25, 27, V_QINV, 3 + vmladduhm 30, 28, V_QINV, 3 + + vmhraddshs 15, 15, V_NMKQ, 14 + vmhraddshs 20, 20, V_NMKQ, 19 + vmhraddshs 25, 25, V_NMKQ, 24 + vmhraddshs 30, 30, V_NMKQ, 29 + + /* Shift right 1 bit */ + vsrah \_vo0, 15, 4 + vsrah \_vo1, 20, 4 + vsrah \_vo2, 25, 4 + vsrah \_vo3, 30, 4 +.endm + +/* + * setup constant vectors for Montgmery multiplication + * V_NMKQ, V_QINV, Zero vector, One vector + */ +.macro Set_mont_consts + xxlor 32+5, 0, 0 /* V_NMKQ */ + xxlor 32+2, 2, 2 /* V_QINV */ + xxlor 32+3, 3, 3 /* all 0 */ + xxlor 32+4, 4, 4 /* all 1 */ +.endm + +.macro Load_next_4zetas + li 8, 16 + li 11, 32 + li 12, 48 + lxvd2x 32+V_Z0, 0, 14 + lxvd2x 32+V_Z1, 8, 14 + lxvd2x 32+V_Z2, 11, 14 + lxvd2x 32+V_Z3, 12, 14 + addi 14, 14, 64 +.endm + +/* + * Re-ordering of the 4-4 layout zetas. + * Swap double-words. + */ +.macro Perm_4zetas + xxpermdi 32+V_Z0, 32+V_Z0, 32+V_Z0, 2 + xxpermdi 32+V_Z1, 32+V_Z1, 32+V_Z1, 2 + xxpermdi 32+V_Z2, 32+V_Z2, 32+V_Z2, 2 + xxpermdi 32+V_Z3, 32+V_Z3, 32+V_Z3, 2 +.endm + +.macro Write_B4C _vs0 _vs1 _vs2 _vs3 + stxvd2x \_vs0, 3, 9 + stxvd2x \_vs1, 3, 16 + stxvd2x \_vs2, 3, 18 + stxvd2x \_vs3, 3, 20 +.endm + +.macro Write_M4C _vs0 _vs1 _vs2 _vs3 + stxvd2x \_vs0, 3, 10 + stxvd2x \_vs1, 3, 17 + stxvd2x \_vs2, 3, 19 + stxvd2x \_vs3, 3, 21 +.endm + +.macro Reload_4coeffs + lxvd2x 32+25, 0, 3 + lxvd2x 32+26, 10, 3 + lxvd2x 32+30, 11, 3 + lxvd2x 32+31, 12, 3 + addi 3, 3, 64 +.endm + +.macro MWrite_8X _vs0 _vs1 _vs2 _vs3 _vs4 _vs5 _vs6 _vs7 + addi 3, 3, -128 + stxvd2x \_vs0, 0, 3 + stxvd2x \_vs1, 10, 3 + stxvd2x \_vs2, 11, 3 + stxvd2x \_vs3, 12, 3 + stxvd2x \_vs4, 15, 3 + stxvd2x \_vs5, 16, 3 + stxvd2x \_vs6, 17, 3 + stxvd2x \_vs7, 18, 3 + addi 3, 3, 128 +.endm + +/* + * Transpose the final coefficients of 4-4 layout to the orginal + * coefficient array order. + */ +.macro PermWriteL44 + xxlor 32+14, 10, 10 + xxlor 32+19, 11, 11 + xxlor 32+24, 12, 12 + xxlor 32+29, 13, 13 + xxpermdi 32+10, 32+14, 32+13, 3 + xxpermdi 32+11, 32+14, 32+13, 0 + xxpermdi 32+12, 32+19, 32+18, 3 + xxpermdi 32+13, 32+19, 32+18, 0 + xxpermdi 32+14, 32+24, 32+23, 3 + xxpermdi 32+15, 32+24, 32+23, 0 + xxpermdi 32+16, 32+29, 32+28, 3 + xxpermdi 32+17, 32+29, 32+28, 0 + stxvd2x 32+10, 0, 5 + stxvd2x 32+11, 10, 5 + stxvd2x 32+12, 11, 5 + stxvd2x 32+13, 12, 5 + stxvd2x 32+14, 15, 5 + stxvd2x 32+15, 16, 5 + stxvd2x 32+16, 17, 5 + stxvd2x 32+17, 18, 5 +.endm + +/* + * Transpose the final coefficients of 2-2-2-2 layout to the orginal + * coefficient array order. + */ +.macro PermWriteL24 + xxlor 32+14, 10, 10 + xxlor 32+19, 11, 11 + xxlor 32+24, 12, 12 + xxlor 32+29, 13, 13 + vmrgew 10, 13, 14 + vmrgow 11, 13, 14 + vmrgew 12, 18, 19 + vmrgow 13, 18, 19 + vmrgew 14, 23, 24 + vmrgow 15, 23, 24 + vmrgew 16, 28, 29 + vmrgow 17, 28, 29 + stxvd2x 32+10, 0, 5 + stxvd2x 32+11, 10, 5 + stxvd2x 32+12, 11, 5 + stxvd2x 32+13, 12, 5 + stxvd2x 32+14, 15, 5 + stxvd2x 32+15, 16, 5 + stxvd2x 32+16, 17, 5 + stxvd2x 32+17, 18, 5 +.endm + +/* + * INTT layer Len=2. + */ +.macro INTT_REDUCE_L24 + Load_L24Coeffs + Compute_4Coeffs + BREDUCE_4X 4, 9, 13, 17 + xxlor 10, 32+4, 32+4 + xxlor 11, 32+9, 32+9 + xxlor 12, 32+13, 32+13 + xxlor 13, 32+17, 32+17 + Set_mont_consts + Load_next_4zetas + MREDUCE_4X V_Z0, V_Z1, V_Z2, V_Z3, 13, 18, 23, 28 + PermWriteL24 +.endm + +/* + * INTT layer Len=4. + */ +.macro INTT_REDUCE_L44 + Load_L44Coeffs + Compute_4Coeffs + BREDUCE_4X 4, 9, 13, 17 + xxlor 10, 32+4, 32+4 + xxlor 11, 32+9, 32+9 + xxlor 12, 32+13, 32+13 + xxlor 13, 32+17, 32+17 + Set_mont_consts + Load_next_4zetas + Perm_4zetas + MREDUCE_4X V_Z0, V_Z1, V_Z2, V_Z3, 13, 18, 23, 28 + PermWriteL44 +.endm + +/* + * INTT layer Len=8 and 16. + */ +.macro INTT_REDUCE_4X start next + Load_4Coeffs \start, \next + BREDUCE_4X 4, 9, 13, 17 + Write_B4C 32+4, 32+9, 32+13, 32+17 + Set_mont_consts + Load_next_4zetas + MREDUCE_4X V_Z0, V_Z1, V_Z2, V_Z3, 13, 18, 23, 28 + Write_M4C 32+13, 32+18, 32+23, 32+28 +.endm + +/* + * INTT layer Len=32, 64 and 128. + */ +.macro INTT_REDUCE_L567 start next + Load_4Coeffs \start, \next + BREDUCE_4X 4, 9, 13, 17 + Write_B4C 32+4, 32+9, 32+13, 32+17 + Set_mont_consts + lvx V_ZETA, 0, 14 + MREDUCE_4X V_ZETA, V_ZETA, V_ZETA, V_ZETA, 13, 18, 23, 28 + Write_M4C 32+13, 32+18, 32+23, 32+28 +.endm + +/* + * intt_ppc(int16_t *r) + * Compute inverse NTT based on the following 7 layers - + * len = 2, 4, 8, 16, 32, 64, 128 + * + * Each layer compute the coeffients on 2 legs, start and start + len*2 offsets. + * + * leg 1 leg 2 + * ----- ----- + * start start+len*2 + * start+next start+len*2+next + * start+next+next start+len*2+next+next + * start+next+next+next start+len*2+next+next+next + * + * Each computation loads 8 vectors, 4 for each leg. + * The final coefficient (t) from each vector of leg1 and leg2 then do the + * add/sub operations to obtain the final results. + * + * -> leg1 = leg1 + t, leg2 = leg1 - t + * + * The resulting coeffients then store back to each leg's offset. + * + * Each vector has the same corresponding zeta except len=4 and len=2. + * + * len=4 has 4-4 layout which means every 4 16-bit coeffients has the same zeta. + * and len=2 has 2-2-2-2 layout which means every 2 16-bit coeffients has the same zeta. + * e.g. + * coeff vector a1 a2 a3 a4 a5 a6 a7 a8 + * zeta vector z1 z1 z2 z2 z3 z3 z4 z4 + * + * For len=4 and len=2, each vector will get permuted to leg1 and leg2. Zeta is + * pre-arranged for the leg1 and leg2. After the computation, each vector needs + * to transpose back to its original 4-4 or 2-2-2-2 layout. + */ +.global intt_ppc +.align 4 +intt_ppc: +.localentry intt_ppc,.-intt_ppc + + SAVE_REGS + + /* init vectors and constants + Setup for Montgomery reduce */ + addis 8,2,mlkem_consts at toc@ha + addi 8,8,mlkem_consts at toc@l + lxvx 0, 0, 8 /* V_NMKQ */ + + li 10, QINV_OFFSET + lxvx 32+V_QINV, 10, 8 + xxlxor 32+3, 32+3, 32+3 + vspltish 4, 1 + xxlor 2, 32+2, 32+2 /* QINV */ + xxlor 3, 32+3, 32+3 /* 0 vector */ + xxlor 4, 32+4, 32+4 /* 1 vector */ + + /* Setup for Barrett reduce */ + li 10, Q_OFFSET + li 11, C20159_OFFSET + lxvx 6, 10, 8 /* V_MKQ */ + lxvx 32+V20159, 11, 8 /* V20159 */ + + vspltisw 8, 13 + vadduwm 8, 8, 8 + xxlor 8, 32+8, 32+8 /* V_26 store at vs8 */ + + vspltisw 9, 1 + vsubuwm 10, 8, 9 /* value 25 */ + vslw 9, 9, 10 + xxlor 7, 32+9, 32+9 /* V_25 syore at vs7 */ + + li 10, 16 + li 11, 32 + li 12, 48 + li 15, 64 + li 16, 80 + li 17, 96 + li 18, 112 + + /* + * Montgomery reduce loops with constant 1441 + */ + addi 14, 8, C1441_OFFSET + lvx V1441, 0, 14 + li 7, 4 + mtctr 7 + + Set_mont_consts +intt_ppc__Loopf: + Reload_4coeffs + MREDUCE_4X V1441, V1441, V1441, V1441, 6, 7, 8, 9 + Reload_4coeffs + MREDUCE_4X V1441, V1441, V1441, V1441, 13, 18, 23, 28 + MWrite_8X 32+6, 32+7, 32+8, 32+9, 32+13, 32+18, 32+23, 32+28 + bdnz intt_ppc__Loopf + + addi 3, 3, -512 + +.align 4 + /* + * 1. len = 2, start = 0, 4, 8, 12,...244, 248, 252 + * Update zetas vectors, each vector has 2 zetas + * Load zeta vectors in 2-2-2-2 layout + * + * Compute coefficients of the NTT based on the following sequences, + * 0, 1, 2, 3, 4, 5, 6, 7 + * 8, 9, 10, 11, 12, 13, 14, 15 + * ... + * 240, 241, 242, 243, 244, 245, 246, 247 + * 248, 249, 250, 251, 252, 253, 254, 255 + * + * These are indexes to the 16 bits array. Each loads 4 vectors. + */ + addi 14, 8, ZETA_INTT_OFFSET + li 7, 4 /* len * 2 */ + mr 5, 3 + + INTT_REDUCE_L24 + addi 5, 5, 128 + INTT_REDUCE_L24 + addi 5, 5, 128 + INTT_REDUCE_L24 + addi 5, 5, 128 + INTT_REDUCE_L24 + addi 5, 5, 128 + +.align 4 + /* + * 2. len = 4, start = 0, 8, 16, 24,...232, 240, 248 + * Load zeta vectors in 4-4 layout + * + * Compute coefficients of the NTT based on the following sequences, + * 0, 1, 2, 3, 4, 5, 6, 7 + * 8, 9, 10, 11, 12, 13, 14, 15 + * ... + * 240, 241, 242, 243, 244, 245, 246, 247 + * 248, 249, 250, 251, 252, 253, 254, 255 + * + * These are indexes to the 16 bits array. Each loads 4 vectors. + */ + mr 5, 3 + li 7, 8 + + INTT_REDUCE_L44 + addi 5, 5, 128 + INTT_REDUCE_L44 + addi 5, 5, 128 + INTT_REDUCE_L44 + addi 5, 5, 128 + INTT_REDUCE_L44 + addi 5, 5, 128 + +.align 4 + /* + * 3. len = 8, start = 0, 16, 32, 48,...208, 224, 240 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 8 + * 64 - 72 + * 128 - 136 + * 192 - 200 + * + * These are indexes to the 16 bits array + */ + li 7, 16 + + INTT_REDUCE_4X 0, 32 + INTT_REDUCE_4X 128, 32 + INTT_REDUCE_4X 256, 32 + INTT_REDUCE_4X 384, 32 + +.align 4 + /* + * 4. len = 16, start = 0, 32, 64,,...160, 192, 224 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 16 + * 8 - 24 + * 128 - 144 + * 136 - 152 + * + * These are indexes to the 16 bits array + */ + li 7, 32 + + INTT_REDUCE_4X 0, 64 + + addi 14, 14, -64 + INTT_REDUCE_4X 16, 64 + + INTT_REDUCE_4X 256, 64 + + addi 14, 14, -64 + INTT_REDUCE_4X 272, 64 + +.align 4 + /* + * 5. len = 32, start = 0, 64, 128, 192 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 32 + * 64 - 96 + * 128 - 160 + * 192 - 224 + * + * These are indexes to the 16 bits array + */ + li 7, 64 + + INTT_REDUCE_L567 0, 16 + addi 14, 14, 16 + INTT_REDUCE_L567 128, 16 + addi 14, 14, 16 + INTT_REDUCE_L567 256, 16 + addi 14, 14, 16 + INTT_REDUCE_L567 384, 16 + addi 14, 14, 16 + +.align 4 + /* + * 6. len = 64, start = 0, 128 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 64 + * 32 - 96 + * 128 - 192 + * 160 - 224 + * + * These are indexes to the 16 bits array + */ + li 7, 128 + + INTT_REDUCE_L567 0, 16 + INTT_REDUCE_L567 64, 16 + addi 14, 14, 16 + INTT_REDUCE_L567 256, 16 + INTT_REDUCE_L567 320, 16 + addi 14, 14, 16 + +.align 4 + /* + * 7. len = 128, start = 0 + * + * Compute coefficients of the NTT based on 2 legs, + * 0 - 128 + * 32 - 160 + * 64 - 192 + * 96 - 224 + * + * These are indexes to the 16 bits array + */ + li 7, 256 /* len*2 */ + + INTT_REDUCE_L567 0, 16 + INTT_REDUCE_L567 64, 16 + INTT_REDUCE_L567 128, 16 + INTT_REDUCE_L567 192, 16 + + RESTORE_REGS + blr +.size intt_ppc,.-intt_ppc + +.rodata +.align 4 +mlkem_consts: +/* -Q */ +.short -3329, -3329, -3329, -3329, -3329, -3329, -3329, -3329 +/* QINV */ +.short -3327, -3327, -3327, -3327, -3327, -3327, -3327, -3327 +/* Q */ +.short 3329, 3329, 3329, 3329, 3329, 3329, 3329, 3329 +/* const 20159 for reduce.S and intt */ +.short 20159, 20159, 20159, 20159, 20159, 20159, 20159, 20159 +/* const 1441 for intt */ +.short 1441, 1441, 1441, 1441, 1441, 1441, 1441, 1441 + +mlkem_zetas: +/* + * For intt Len=2, offset IZETA_NTT_OFFSET127 + * reorder zeta array, (1, 2, 3, 4) -> (3, 1, 4, 2) + * Transpose z[0], z[1], z[2], z[3] + * -> z[3], z[3], z[1], z[1], z[4], z[4], z[2], z[2] + */ +.short -1460, -1460, 1628, 1628, 958, 958, 1522, 1522, -308, -308, 991, 991, -108 +.short -108, 996, 996, -854, -854, 478, 478, -1510, -1510, -870, -870, -1530 +.short -1530, 794, 794, -1185, -1185, -1278, -1278, 220, 220, -1659, -1659, -874 +.short -874, -1187, -1187, -136, -136, -1335, -1335, -1215, -1215, 1218, 1218 +.short -1285, -1285, 384, 384, 1322, 1322, -1465, -1465, 1097, 1097, 610, 610, 817 +.short 817, 603, 603, 329, 329, -75, -75, 418, 418, -156, -156, 644, 644, 349, 349 +.short -1590, -1590, -872, -872, 1483, 1483, 1119, 1119, -777, -777, -602, -602 +.short 778, 778, -147, -147, -246, -246, 1159, 1159, -460, -460, 1653, 1653, -291 +.short -291, 1574, 1574, 587, 587, -235, -235, 422, 422, 177, 177, 871, 871, 105 +.short 105, -1251, -1251, 1550, 1550, 430, 430, 843, 843, -1103, -1103, 555, 555 +/* For intt Len=4 */ +.short -1275, -1275, -1275, -1275, 677, 677, 677, 677, -1065, -1065, -1065, -1065 +.short 448, 448, 448, 448, -725, -725, -725, -725, -1508, -1508, -1508, -1508, 961 +.short 961, 961, 961, -398, -398, -398, -398, -951, -951, -951, -951, -247, -247 +.short -247, -247, -1421, -1421, -1421, -1421, 107, 107, 107, 107, 830, 830, 830 +.short 830, -271, -271, -271, -271, -90, -90, -90, -90, -853, -853, -853, -853 +.short 1469, 1469, 1469, 1469, 126, 126, 126, 126, -1162, -1162, -1162, -1162 +.short -1618, -1618, -1618, -1618, -666, -666, -666, -666, -320, -320, -320, -320 +.short -8, -8, -8, -8, 516, 516, 516, 516, -1544, -1544, -1544, -1544, -282, -282 +.short -282, -282, 1491, 1491, 1491, 1491, -1293, -1293, -1293, -1293, 1015, 1015 +.short 1015, 1015, -552, -552, -552, -552, 652, 652, 652, 652, 1223, 1223, 1223 +.short 1223 +/* For intt Len=8 and others */ +.short -1571, -1571, -1571, -1571, -1571, -1571, -1571, -1571, -205, -205, -205 +.short -205, -205, -205, -205, -205, 411, 411, 411, 411, 411, 411, 411, 411, -1542 +.short -1542, -1542, -1542, -1542, -1542, -1542, -1542, 608, 608, 608, 608, 608 +.short 608, 608, 608, 732, 732, 732, 732, 732, 732, 732, 732, 1017, 1017, 1017 +.short 1017, 1017, 1017, 1017, 1017, -681, -681, -681, -681, -681, -681, -681 +.short -681, -130, -130, -130, -130, -130, -130, -130, -130, -1602, -1602, -1602 +.short -1602, -1602, -1602, -1602, -1602, 1458, 1458, 1458, 1458, 1458, 1458, 1458 +.short 1458, -829, -829, -829, -829, -829, -829, -829, -829, 383, 383, 383, 383 +.short 383, 383, 383, 383, 264, 264, 264, 264, 264, 264, 264, 264, -1325, -1325 +.short -1325, -1325, -1325, -1325, -1325, -1325, 573, 573, 573, 573, 573, 573, 573 +.short 573, 1468, 1468, 1468, 1468, 1468, 1468, 1468, 1468, -1474, -1474, -1474 +.short -1474, -1474, -1474, -1474, -1474, -1202, -1202, -1202, -1202, -1202, -1202 +.short -1202, -1202, 962, 962, 962, 962, 962, 962, 962, 962, 182, 182, 182, 182 +.short 182, 182, 182, 182, 1577, 1577, 1577, 1577, 1577, 1577, 1577, 1577, 622 +.short 622, 622, 622, 622, 622, 622, 622, -171, -171, -171, -171, -171, -171, -171 +.short -171, 202, 202, 202, 202, 202, 202, 202, 202, 287, 287, 287, 287, 287, 287 +.short 287, 287, 1422, 1422, 1422, 1422, 1422, 1422, 1422, 1422, 1493, 1493, 1493 +.short 1493, 1493, 1493, 1493, 1493, -1517, -1517, -1517, -1517, -1517, -1517 +.short -1517, -1517, -359, -359, -359, -359, -359, -359, -359, -359, -758, -758 +.short -758, -758, -758, -758, -758, -758 -- 2.47.3 From dtsen at us.ibm.com Tue Feb 24 01:27:53 2026 From: dtsen at us.ibm.com (Danny Tsen) Date: Mon, 23 Feb 2026 18:27:53 -0600 Subject: [PATCH 5/5] dilithium-kyber: Added ppc64le dilithium and kyber (i)NTT support. In-Reply-To: <20260224002753.151873-1-dtsen@us.ibm.com> References: <20260224002753.151873-1-dtsen@us.ibm.com> Message-ID: <20260224002753.151873-6-dtsen@us.ibm.com> Updated the following files to ENABLE_PPC_DILITHIUM and ENABLE_PPC_KYBER, dilithium-common.c, kyber-common.c and configure.ac Signed-off-by: Danny Tsen --- cipher/dilithium-common.c | 13 +++++++++++++ cipher/kyber-common.c | 13 +++++++++++++ configure.ac | 20 ++++++++++++++++++++ 3 files changed, 46 insertions(+) diff --git a/cipher/dilithium-common.c b/cipher/dilithium-common.c index d16f22f7..0f3d2d96 100644 --- a/cipher/dilithium-common.c +++ b/cipher/dilithium-common.c @@ -50,6 +50,18 @@ static void invntt_tomont(int32_t a[N]); /*************** dilithium/ref/ntt.c */ +#ifdef ENABLE_PPC_DILITHIUM +extern void mldsa_ntt_ppc(int32_t a[N]); +extern void mldsa_intt_ppc(int32_t a[N]); + +void ntt(int32_t a[N]) { + mldsa_ntt_ppc(a); +} + +void invntt_tomont(int32_t a[N]) { + mldsa_intt_ppc(a); +} +#else static const int32_t zetas[N] = { 0, 25847, -2608894, -518909, 237124, -777960, -876248, 466468, 1826347, 2353451, -359251, -2091905, 3119733, -2884855, 3111497, 2680103, @@ -143,6 +155,7 @@ void invntt_tomont(int32_t a[N]) { a[j] = montgomery_reduce((int64_t)f * a[j]); } } +#endif /*************** dilithium/ref/rounding.h */ #if !defined(DILITHIUM_MODE) || DILITHIUM_MODE == 2 static int32_t decompose_88(int32_t *a0, int32_t a); diff --git a/cipher/kyber-common.c b/cipher/kyber-common.c index 54377788..278d0b0b 100644 --- a/cipher/kyber-common.c +++ b/cipher/kyber-common.c @@ -273,6 +273,18 @@ static int16_t fqmul(int16_t a, int16_t b) { return montgomery_reduce((int32_t)a*b); } +#ifdef ENABLE_PPC_KYBER +extern void ntt_ppc(int16_t r[256]); +extern void intt_ppc(int16_t r[256]); + +void ntt(int16_t r[256]) { + ntt_ppc(r); +} + +void invntt(int16_t r[256]) { + intt_ppc(r); +} +#else /************************************************* * Name: ntt * @@ -328,6 +340,7 @@ void invntt(int16_t r[256]) { for(j = 0; j < 256; j++) r[j] = fqmul(r[j], f); } +#endif /************************************************* * Name: basemul diff --git a/configure.ac b/configure.ac index 00572b45..49a094fe 100644 --- a/configure.ac +++ b/configure.ac @@ -3828,6 +3828,16 @@ if test "$found" = "1" ; then GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS \ kyber.lo" AC_DEFINE(USE_KYBER, 1, [Defined if this module should be included]) + + case "${host}" in + powerpc64le-*-*) + if test "$gcry_cv_gcc_inline_asm_ppc_altivec" = "yes" ; then + AC_DEFINE(ENABLE_PPC_KYBER, 1, [Enable support for PPC optimized kyber.]) + GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS kyber_ntt_p8le.lo" + GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS kyber_intt_p8le.lo" + fi + ;; + esac fi LIST_MEMBER(dilithium, $enabled_pubkey_ciphers) @@ -3836,6 +3846,16 @@ if test "$found" = "1" ; then GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS \ dilithium.lo pubkey-dilithium.lo" AC_DEFINE(USE_DILITHIUM, 1, [Defined if this module should be included]) + + case "${host}" in + powerpc64le-*-*) + if test "$gcry_cv_gcc_inline_asm_ppc_altivec" = "yes" ; then + AC_DEFINE(ENABLE_PPC_DILITHIUM, 1, [Enable support for PPC optimized dilithium.]) + GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS dilithium_ntt_p8le.lo" + GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS dilithium_intt_p8le.lo" + fi + ;; + esac fi LIST_MEMBER(crc, $enabled_digests) -- 2.47.3 From wk at gnupg.org Tue Feb 24 10:30:38 2026 From: wk at gnupg.org (Werner Koch) Date: Tue, 24 Feb 2026 10:30:38 +0100 Subject: [PATCH 0/5] dilithium-kyber: Optimized (i)NTT support for In-Reply-To: <20260224002753.151873-1-dtsen@us.ibm.com> (Danny Tsen via Gcrypt-devel's message of "Mon, 23 Feb 2026 18:27:48 -0600") References: <20260224002753.151873-1-dtsen@us.ibm.com> Message-ID: <87bjhetrnl.fsf@jacob.g10code.de> Hi! Thanks for working on this. Do you have a benchmark? Shalom-Salam, Werner -- The pioneers of a warless world are the youth that refuse military service. - A. Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: openpgp-digital-signature.asc Type: application/pgp-signature Size: 284 bytes Desc: not available URL: From dtsen at us.ibm.com Thu Feb 26 11:23:17 2026 From: dtsen at us.ibm.com (Danny Tsen) Date: Thu, 26 Feb 2026 10:23:17 +0000 Subject: [PATCH 0/5] dilithium-kyber: Optimized (i)NTT support for In-Reply-To: <87bjhetrnl.fsf@jacob.g10code.de> References: <20260224002753.151873-1-dtsen@us.ibm.com> <87bjhetrnl.fsf@jacob.g10code.de> Message-ID: Hi Werner, I don't have benchmark for libgcrypt. I do have my own testing performance number on NTT operation. That probably not what you are looking for. Thanks. -Danny ________________________________ From: Werner Koch Sent: Tuesday, February 24, 2026 5:30 PM To: Danny Tsen via Gcrypt-devel Cc: Danny Tsen Subject: [EXTERNAL] Re: [PATCH 0/5] dilithium-kyber: Optimized (i)NTT support for Hi! Thanks for working on this. Do you have a benchmark? Shalom-Salam, Werner -- The pioneers of a warless world are the youth that refuse military service. - A. Einstein -------------- next part -------------- An HTML attachment was scrubbed... URL: From wk at gnupg.org Thu Feb 26 14:47:06 2026 From: wk at gnupg.org (Werner Koch) Date: Thu, 26 Feb 2026 14:47:06 +0100 Subject: [PATCH 0/5] dilithium-kyber: Optimized (i)NTT support for In-Reply-To: (Danny Tsen via Gcrypt-devel's message of "Thu, 26 Feb 2026 10:23:17 +0000") References: <20260224002753.151873-1-dtsen@us.ibm.com> <87bjhetrnl.fsf@jacob.g10code.de> Message-ID: <87h5r3r50l.fsf@jacob.g10code.de> On Thu, 26 Feb 2026 10:23, Danny Tsen said: > I don't have benchmark for libgcrypt. I do have my own testing > performance number on NTT operation. That probably not what you are I just noticed that we do have support for MLKEM and MLDSA in our ./bench-slope . We should change that to make it easier torun benchmarks. I was actually looking only for a rough figure on how much performance you gain with your patches. Salam-Shalom, Werner -- The pioneers of a warless world are the youth that refuse military service. - A. Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: openpgp-digital-signature.asc Type: application/pgp-signature Size: 284 bytes Desc: not available URL: