From jussi.kivilinna at iki.fi  Mon Feb  2 19:21:56 2026
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Mon,  2 Feb 2026 20:21:56 +0200
Subject: [PATCH] configure.ac: fix HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS on
 x32 targets
Message-ID: <20260202182156.2115138-1-jussi.kivilinna@iki.fi>

* configure.ac (gcry_cv_compiler_defines__x86_64__): New.
(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS): Enable if __x86_64__ macro is
defined by compiler and size of long is 4 (x32) or 8 (amd64).
--

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 configure.ac | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/configure.ac b/configure.ac
index be6b29b4..d94319cc 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1829,6 +1829,25 @@ if test $amd64_as_feature_detection = yes; then
 fi
 
 
+#
+# Check whether compiler defines __x86_64__ macro (amd64 or x32)
+#
+AC_CACHE_CHECK([whether compiler defines __x86_64__ macro],
+  [gcry_cv_compiler_defines__x86_64__],
+  [if test "$mpi_cpu_arch" != "x86" ||
+      test "$try_asm_modules" != "yes" ; then
+    gcry_cv_compiler_defines__x86_64__="n/a"
+  else
+    AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [[
+      #ifndef __x86_64__
+      # error "Architecture is not x86_64"
+      #endif
+    ]])],
+    [gcry_cv_compiler_defines__x86_64__=yes],
+    [gcry_cv_compiler_defines__x86_64__=no])
+  fi])
+
+
 #
 # Check whether GCC assembler supports features needed for our i386/amd64
 # implementations
@@ -1859,7 +1878,9 @@ if test $amd64_as_feature_detection = yes; then
           [gcry_cv_gcc_x86_platform_as_ok=yes])
         fi])
   if test "$gcry_cv_gcc_x86_platform_as_ok" = "yes" &&
-     test "$ac_cv_sizeof_unsigned_long" = "8"; then
+     test "$gcry_cv_compiler_defines__x86_64__" = "yes" &&
+     (test "$ac_cv_sizeof_unsigned_long" = "4" ||
+      test "$ac_cv_sizeof_unsigned_long" = "8"); then
     AC_DEFINE(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS,1,
               [Defined if underlying assembler is compatible with amd64 assembly implementations])
   fi
-- 
2.51.0


From zach.fogg at gmail.com  Thu Feb  5 01:23:01 2026
From: zach.fogg at gmail.com (Zachary Fogg)
Date: Wed, 4 Feb 2026 19:23:01 -0500
Subject: EdDSA Verification Bug - Clarification on Format 2 Verification
 Failure
In-Reply-To: <87pl7b7zxh.fsf@jacob.g10code.de>
References: <CAC7bEvkeKKTmHaQNQM-BVs3cORWbmML6R8DZGcR0ETpQWBTTPg@mail.gmail.com>
 <87pl7b7zxh.fsf@jacob.g10code.de>
Message-ID: <CAC7bEvmD9K+JOFEA_eptYN4gwxULZFyYkFdZnAcymz++b-WHwA@mail.gmail.com>

Hey Werner,

I appreciate the feedback, but I have to respectfully push back here. I
don't think this is just an API inconsistency issue.

I set up a minimal test case to verify what's happening, and it's pretty
clear cut:

// Format 2: (data (flags eddsa) (hash-algo sha512) (value %b))
err = gcry_sexp_build(&s_data, NULL,
"(data (flags eddsa) (hash-algo sha512) (value %b))",
msg_len, msg);

err = gcry_pk_sign(&s_sig, s_data, privkey);  // OK - signing works
err = gcry_pk_verify(s_sig, s_data, pubkey);  // FAILED - verification fails

The thing is, I'm using the exact same S-expression for both signing and
verification. If the format is valid enough for signing to succeed, it
should be valid for verification. That's the inconsistency. You can't have
signing work and verification fail with identical inputs.

I get that Ed25519 has been working well since 2014 - the simple format
(data (value %b)) works fine for me too. But when you add the EdDSA flags
and hash-algo spec (which is what GPG uses), the verification breaks.
That's a real problem.

This isn't some edge case either. I'm trying to verify signatures and it
fails every time because libgcrypt can sign but can't verify using the same
format. For my use case, that's a showstopper for using the lib directly...
I resorted to having the gpg binary do it in its own process which is not
ideal code at all.

I'm not trying to be difficult here - I just want this fixed or at least
documented as a known limitation, and to have tests updated because in my
research this bug exists because there are no tests for specifically this
case. If EdDSA with hash-algo flags isn't supposed to be supported, the
library should reject it at sign time, not silently accept it and then fail
verification.

What do you think? Did you run my example code at least and SEE the
failures in your own terminal? It's reproducible, and I could show you if
you were sitting in front of me.

-Zachary

On Thu, Jan 15, 2026 at 8:41?AM Werner Koch <wk at gnupg.org> wrote:

> Hi!
>
> Just a short note on your bug report.  You gave a lot of examples and a
> nicely formated report at https://github.com/zfogg/ascii-chat/issues/92
> but I can't read everything of it.
>
> On Tue, 30 Dec 2025 02:30, Zachary Fogg said:
> > **In-Reply-To:** <response from NIIBE Yutaka on Oct 22, 2025>
>
> > Your response mentioned using `(flags eddsa)` during key generation,
> which
> > is good practice. However, I want to clarify that **my bug report
> concerns
> > signature verification, not key generation**.
>
> If you look at the way GnuPG uses Libgcrypt will find in
> gnupg/g10/pkglue.c:pk_verify this:
>
>           if (openpgp_oid_is_ed25519 (pkey[0]))
>             fmt = "(public-key(ecc(curve %s)(flags eddsa)(q%m)))";
>           else
>             fmt = "(public-key(ecc(curve %s)(q%m)))";
>
> and this for the data:
>
>       if (openpgp_oid_is_ed25519 (pkey[0]))
>         fmt = "(data(flags eddsa)(hash-algo sha512)(value %m))";
>       else
>         fmt = "(data(value %m))";
>
> and more complicated stuff for re-formatting the signature data.  It is
> a bit unfortunate that we need to have these special cases but that's
> the drawback of a having a stable API and protocol.
>
> > 1. Can you confirm this is a genuine bug in libgcrypt's verification
> logic?
>
> No, at least not as I understand it.  ed25519 signatures are working
> well and are in active use since GnuPG 2.1 from 2014.
>
> > 2. Should I open a formal bug in the dev.gnupg.org tracker?
>
> I don't see a bug ;-)
>
> > 3. Would a patch fixing the PUBKEY_FLAG_PREHASH handling be acceptable?
>
> I do not understand exactly what you propose.  A more concise
> description would be helpful.  But note that API stability is a primary
> goal.
>
> BTW on your website your wrote:
>
>   I've created a working exploit that demonstrates the severity of this
>   bug. The exploit proves that GPG agent creates EdDSA signatures that
>   cannot be verified by standard libgcrypt verification code, even with
>   the correct keys.
>
> The term "exploit" is used to describe an attack method which undermines
> the security of a system.  What you describe is a claimed inconsistent
> API.  That may or may not be the case; I don't see a security bug here,
> though.
>
>
>
> Salam-Shalom,
>
>    Werner
>
>
> p.s.
> I had a brief look at your project:  In src/main.c I notice
>
>   // Set global FPS from command-line option if provided
>   extern int g_max_fps;
>
> The declaration of an external variable inside a function is a not a
> good coding style.  Put this at the top of the file or into a header.
> A few lines above:
>
>   #ifndef NDEBUG
>     // Initialize lock debugging system after logging is fully set up
>     log_debug("Initializing lock debug system...");
>
> Never ever use NDEBUG.  This is an idea of the 70ies.  This also
> disables the assert(3) functionality and if you do this you won't get an
> assertion failure at all in your production code - either you know the
> code is correct or you are not sure.  Never remove an assert from
> production code.
>
> I have noticed a lot of documentation inside the code - that's good.
>
> --
> The pioneers of a warless world are the youth that
> refuse military service.             - A. Einstein
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20260204/187c0a3f/attachment.html>

From gniibe at fsij.org  Thu Feb  5 02:58:25 2026
From: gniibe at fsij.org (NIIBE Yutaka)
Date: Thu, 05 Feb 2026 10:58:25 +0900
Subject: libgcrypt 1.8.12: STRIBOG carry overflow bug
In-Reply-To: <CAO5O-ELhyUsE7VvqYO2Fov5mR+0Fr8ZSd4-RtfTj4MyuaFhfFA@mail.gmail.com>
References: <CAO5O-ELhyUsE7VvqYO2Fov5mR+0Fr8ZSd4-RtfTj4MyuaFhfFA@mail.gmail.com>
Message-ID: <87jywsgddq.fsf@haruna.fsij.org>

Guido Vranken wrote:
> Fix is in da6cd4f but was not backported to 1.8. 1.8 is EOL but has
> "Extended Long Term Support contract available".

Thank you.

I cherry picked the commit to 1.8 branch.

I found that building 1.8 is getting difficult with newer toolchain and
newer libgpg-error.  I push a minimal change of such a forward port,
too.
-- 


From stu at spacehopper.org  Fri Feb  6 15:41:30 2026
From: stu at spacehopper.org (Stuart Henderson)
Date: Fri, 6 Feb 2026 14:41:30 +0000
Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on i386
Message-ID: <aYX9mimjccCPyGMI@symphytum.spacehopper.org>

When building the "notmuch" email indexer, the configure script tests
that gmime can extract a session key, which it does using gcrypt.
Since 1.12.0 this frequently, though not always, fails on i386 (32-bit).

This is not changed by applying the patch
https://lists.gnupg.org/pipermail/gcrypt-devel/2026-January/006025.html

The problem is no longer seen after neutering part of
https://git.gnupg.org/cgi-bin/gitweb.cgi?p=libgcrypt.git;a=commit;h=4f56fd8c5e03f389a9f27a5e9206b9dfb49c92e3

Index: mpi/ec.c
--- mpi/ec.c.orig
+++ mpi/ec.c
@@ -305,7 +305,7 @@ ec_mod (gcry_mpi_t w, mpi_ec_t ec)
   else
     _gcry_mpi_mod (w, w, ec->p);
 
-  if ((ec->flags & GCRYECC_FLAG_LEAST_LEAK))
+  if (0 && (ec->flags & GCRYECC_FLAG_LEAST_LEAK))
     w->nlimbs = ec->p->nlimbs;
 }
 

The script below replicates the test setup used by notmuch (requires
gmime and gnupg to be installed).

#!/bin/sh
set -e

tmp=$(mktemp -d /tmp/notmuchtest.XXXXXXXXX)
cd $tmp

cat << EOF > _check_session_keys.c
#include <gmime/gmime.h>
#include <stdio.h>

int main () {
    GError *error = NULL;
    GMimeParser *parser = NULL;
    GMimeMultipartEncrypted *body = NULL;
    GMimeDecryptResult *decrypt_result = NULL;
    GMimeObject *output = NULL;

    g_mime_init ();
    parser = g_mime_parser_new ();
    g_mime_parser_init_with_stream (parser, g_mime_stream_file_open("basic-encrypted.eml", "r", &error));
    if (error) return !! fprintf (stderr, "failed to instantiate parser with basic-encrypted.eml\n");

    body = GMIME_MULTIPART_ENCRYPTED(g_mime_message_get_mime_part (g_mime_parser_construct_message (parser, NULL)));
    if (body == NULL) return !! fprintf (stderr, "did not find a multipart encrypted message\n");

    output = g_mime_multipart_encrypted_decrypt (body, GMIME_DECRYPT_EXPORT_SESSION_KEY, NULL, &decrypt_result, &error);
    if (error || output == NULL) return !! fprintf (stderr, "decryption failed\n");

    if (decrypt_result == NULL) return !! fprintf (stderr, "no GMimeDecryptResult found\n");
    if (decrypt_result->session_key == NULL) return !! fprintf (stderr, "GMimeDecryptResult has no session key\n");

    printf ("%s\n", decrypt_result->session_key);
    return 0;
}
EOF

cat << EOF > openpgp4-secret-key.asc
-----BEGIN PGP PRIVATE KEY BLOCK-----

lFgEYxhtlxYJKwYBBAHaRw8BAQdA0PoNKr90DaQV1dIK77wbWm4RT+JQzqBkwIjA
HQM9RHYAAQDQ5wSfkOGXvKYroALWgibztISzXS5b8boGXykcHERo6w/ctDtOb3Rt
dWNoIFRlc3QgU3VpdGUgKElOU0VDVVJFISkgPHRlc3Rfc3VpdGVAbm90bXVjaG1h
aWwub3JnPoiQBBMWCAA4AhsDBQsJCAcCBhUKCQgLAgQWAgMBAh4BAheAFiEEmjr+
bGAGWhSP1LWKfmq+kkZFzGAFAmMYbZwACgkQfmq+kkZFzGDtrwEAjQRn3xhEomah
wICjQjfi4RKNbvnRViZgosijDBANUAgA/28GrK1tPnQsXWqmuZxQ1Cd5ry4NAnj/
4jsxD3cTbnEHnF0EYxhtlxIKKwYBBAGXVQEFAQEHQEOd3EyCD5qo4+QuHz0lruCG
VM6n6RI4dtAh3cX9uHwiAwEIBwAA/1oe+p5jNjNE5lEj4yTpYjCxCeC98MolbiAy
0yY7526wECqIeAQYFggAIBYhBJo6/mxgBloUj9S1in5qvpJGRcxgBQJjGG2XAhsM
AAoJEH5qvpJGRcxgBdsA/R9ZECoxai5QhOitDIAUZVCRr59Pm1VMPiJOOIla2N1p
AQCNESwJ9IJOdO/06q+bR2GG4WyEkB4VoVBiA3hFx/zZAA==
=uGTo
-----END PGP PRIVATE KEY BLOCK-----
EOF

cat << EOF > basic-encrypted.eml
From: test_suite at notmuchmail.org
To: test_suite at notmuchmail.org
Subject: Here is the password
Date: Sat, 01 Jan 2000 12:00:00 +0000
Message-ID: <basic-encrypted at crypto.notmuchmail.org>
MIME-Version: 1.0
Content-Type: multipart/encrypted; boundary="=-=-=";
        protocol="application/pgp-encrypted"

--=-=-=
Content-Type: application/pgp-encrypted

Version: 1

--=-=-=
Content-Type: application/octet-stream

-----BEGIN PGP MESSAGE-----

hF4DHXHP849rSK8SAQdAYbv9NFaU2Fbd6JbfE87h/yZNyWLJYZ2EseU0WyOz7Agw
/+KTbbIqRcEYhnpQhQXBQ2wqIN5gmdRhaqrj5q0VLV2BOKNJKqXGs/W4DghXwfAu
0oMBqjTd/mMbF0nJLw3bPX+LW47RHQdZ8vUVPlPr0ALg8kqgcfy95Qqy5h796Uyq
xs+I/UUOt7fzTDAw0B4qkRbdSangwYy80N4X43KrAfKSstBH3/7O4285XZr86YhF
rEtsBuwhoXI+DaG3uYZBBMTkzfButmBKHwB2CmWutmVpQL087A==
=lhSz
-----END PGP MESSAGE-----
--=-=-=--
EOF

cc $(pkg-config --cflags gmime-3.0) _check_session_keys.c \
   $(pkg-config --libs gmime-3.0) -o _check_session_keys

export GNUPGHOME=$tmp
gpg --batch --quiet --import < openpgp4-secret-key.asc
echo "cd $tmp; GNUPGHOME=$tmp ./_check_session_keys"
./_check_session_keys


From wk at gnupg.org  Mon Feb  9 11:32:44 2026
From: wk at gnupg.org (Werner Koch)
Date: Mon, 09 Feb 2026 11:32:44 +0100
Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on
 i386
In-Reply-To: <aYX9mimjccCPyGMI@symphytum.spacehopper.org> (Stuart Henderson
 via Gcrypt-devel's message of "Fri, 6 Feb 2026 14:41:30 +0000")
References: <aYX9mimjccCPyGMI@symphytum.spacehopper.org>
Message-ID: <87qzqub41f.fsf@jacob.g10code.de>

On Fri,  6 Feb 2026 14:41, Stuart Henderson said:
> When building the "notmuch" email indexer, the configure script tests
> that gmime can extract a session key, which it does using gcrypt.
> Since 1.12.0 this frequently, though not always, fails on i386 (32-bit).

You mean Notmuch uses its own *PGP parser to "extract" the session key?

I recall that there was a discussion on speeding up things by storing
decrypted session in a database and then use gpg --override-sssion-key.
But I can't remember the details.


Salam-Shalom,

   Werner

-- 
The pioneers of a warless world are the youth that
refuse military service.             - A. Einstein
-------------- next part --------------
A non-text attachment was scrubbed...
Name: openpgp-digital-signature.asc
Type: application/pgp-signature
Size: 284 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20260209/b021a514/attachment.sig>

From stu at spacehopper.org  Mon Feb  9 12:14:52 2026
From: stu at spacehopper.org (Stuart Henderson)
Date: Mon, 9 Feb 2026 11:14:52 +0000
Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on
 i386
In-Reply-To: <87qzqub41f.fsf@jacob.g10code.de>
References: <aYX9mimjccCPyGMI@symphytum.spacehopper.org>
 <87qzqub41f.fsf@jacob.g10code.de>
Message-ID: <aYnBrN3CKPxO_jGY@symphytum.spacehopper.org>

On 2026/02/09 11:32, Werner Koch wrote:
> On Fri,  6 Feb 2026 14:41, Stuart Henderson said:
> > When building the "notmuch" email indexer, the configure script tests
> > that gmime can extract a session key, which it does using gcrypt.
> > Since 1.12.0 this frequently, though not always, fails on i386 (32-bit).
> 
> You mean Notmuch uses its own *PGP parser to "extract" the session key?
> 
> I recall that there was a discussion on speeding up things by storing
> decrypted session in a database and then use gpg --override-sssion-key.
> But I can't remember the details.

I didn't look into how it's used in the main program code, was just
hitting the failure where it's used in autoconf when building (I'm an os
package builder).

It does look like it's doing something along those lines in the main
code though:
https://git.notmuchmail.org/git?p=notmuch;a=blob;f=util/crypto.c;h=156a6550c20afef00a6bb5eaab94e8ba435cbfbd;hb=HEAD


From sam at gentoo.org  Mon Feb  9 17:43:31 2026
From: sam at gentoo.org (Sam James)
Date: Mon, 09 Feb 2026 16:43:31 +0000
Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on
 i386
In-Reply-To: <aYX9mimjccCPyGMI@symphytum.spacehopper.org>
References: <aYX9mimjccCPyGMI@symphytum.spacehopper.org>
Message-ID: <87a4xhkguk.fsf@gentoo.org>

Stuart Henderson via Gcrypt-devel <gcrypt-devel at gnupg.org> writes:

> When building the "notmuch" email indexer, the configure script tests
> that gmime can extract a session key, which it does using gcrypt.
> Since 1.12.0 this frequently, though not always, fails on i386 (32-bit).

Do you see anything useful if you run the reproducer under Valgrind?

> [...]

sam
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 418 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20260209/982b1976/attachment.sig>

From stu at spacehopper.org  Mon Feb  9 17:55:23 2026
From: stu at spacehopper.org (Stuart Henderson)
Date: Mon, 9 Feb 2026 16:55:23 +0000
Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on
 i386
In-Reply-To: <87a4xhkguk.fsf@gentoo.org>
References: <aYX9mimjccCPyGMI@symphytum.spacehopper.org>
 <87a4xhkguk.fsf@gentoo.org>
Message-ID: <aYoRe3NpUrln1g_c@symphytum.spacehopper.org>

On 2026/02/09 16:43, Sam James wrote:
> Stuart Henderson via Gcrypt-devel <gcrypt-devel at gnupg.org> writes:
> 
> > When building the "notmuch" email indexer, the configure script tests
> > that gmime can extract a session key, which it does using gcrypt.
> > Since 1.12.0 this frequently, though not always, fails on i386 (32-bit).
> 
> Do you see anything useful if you run the reproducer under Valgrind?

I'm seeing this on OpenBSD, Valgrind doesn't currently work there.


From sam at gentoo.org  Mon Feb  9 18:14:12 2026
From: sam at gentoo.org (Sam James)
Date: Mon, 09 Feb 2026 17:14:12 +0000
Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on
 i386
In-Reply-To: <aYoRe3NpUrln1g_c@symphytum.spacehopper.org>
References: <aYX9mimjccCPyGMI@symphytum.spacehopper.org>
 <87a4xhkguk.fsf@gentoo.org>
 <aYoRe3NpUrln1g_c@symphytum.spacehopper.org>
Message-ID: <871pitkfff.fsf@gentoo.org>

Stuart Henderson <stu at spacehopper.org> writes:

> On 2026/02/09 16:43, Sam James wrote:
>> Stuart Henderson via Gcrypt-devel <gcrypt-devel at gnupg.org> writes:
>> 
>> > When building the "notmuch" email indexer, the configure script tests
>> > that gmime can extract a session key, which it does using gcrypt.
>> > Since 1.12.0 this frequently, though not always, fails on i386 (32-bit).
>> 
>> Do you see anything useful if you run the reproducer under Valgrind?
>
> I'm seeing this on OpenBSD, Valgrind doesn't currently work there.

Ah, sorry for the silly question then.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 418 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20260209/2fbd90a5/attachment.sig>

From gniibe at fsij.org  Tue Feb 10 07:50:57 2026
From: gniibe at fsij.org (NIIBE Yutaka)
Date: Tue, 10 Feb 2026 15:50:57 +0900
Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on
 i386
In-Reply-To: <aYX9mimjccCPyGMI@symphytum.spacehopper.org>
References: <aYX9mimjccCPyGMI@symphytum.spacehopper.org>
Message-ID: <87wm0lytv2.fsf@haruna.fsij.org>

Hello,

Thank you for your report.

Stuart Henderson wrote:
> The script below replicates the test setup used by notmuch (requires
> gmime and gnupg to be installed).

I build gnupg from master for i386 on a Debian machine (also build npth,
libgpg-error, libgcrypt, libassuan, libksba, and ntbtls).

With this newly built gpg, I can't replicate the failure by running the
_check_session_keys program many times.  (I make sure using the newly
built gnupg i386 version with libgcrypt 1.12.)

Manually decrypting the message (basic-encrypted.eml), by the newly
built gnupg, also gets success.  I can't replicate any failure.

Could you please narrow down the failure?  IIUC, when the
_check_session_keys program fails, invoking gpg must fail too.
If so, debug output of gpg (and/or gpg-agent) helps.
-- 


From gniibe at fsij.org  Wed Feb 11 03:07:27 2026
From: gniibe at fsij.org (NIIBE Yutaka)
Date: Wed, 11 Feb 2026 11:07:27 +0900
Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on
 i386
In-Reply-To: <87wm0lytv2.fsf@haruna.fsij.org>
References: <aYX9mimjccCPyGMI@symphytum.spacehopper.org>
 <87wm0lytv2.fsf@haruna.fsij.org>
Message-ID: <87wm0k3ue8.fsf@haruna.fsij.org>

NIIBE Yutaka <gniibe at fsij.org> wrote:
> Could you please narrow down the failure?  IIUC, when the
> _check_session_keys program fails, invoking gpg must fail too.
> If so, debug output of gpg (and/or gpg-agent) helps.

I mean, I'd like to locate the bug.

There are two things in the notmuch/configure script.

(1) importing the secret key
(2) running the _check_session_keys program

Does the failure occur in (1) or (2)?  And what version of gpg and
gpg-agent you are running? Those information helps us.

I created the ticket:

    https://dev.gnupg.org/T8094

I put the tag of "libgcrypt" for this, but it may be other
parts of program (gnupg, libgpg-error, etc.) which cause
the issue.
-- 


From gniibe at fsij.org  Sat Feb 14 04:42:54 2026
From: gniibe at fsij.org (NIIBE Yutaka)
Date: Sat, 14 Feb 2026 12:42:54 +0900
Subject: [PATCH 2/2] mpi:ec: Use mpi_new with NBITS, instead of mpi_alloc.
In-Reply-To: <43b648f0465fb449471944a84bb40f45996f6de3.1771039985.git.gniibe@fsij.org>
References: <aYX9mimjccCPyGMI@symphytum.spacehopper.org>
 <87wm0lytv2.fsf@haruna.fsij.org> <87wm0k3ue8.fsf@haruna.fsij.org>
 <43b648f0465fb449471944a84bb40f45996f6de3.1771039985.git.gniibe@fsij.org>
Message-ID: <44bfa11c45cc7b4903467d822a93db7bbc18283d.1771039985.git.gniibe@fsij.org>


* mpi/ec.c (ec_get_two_inv_p): Use mpi_new with NBITS.
* cipher/ecc-ecdsa.c (_gcry_ecc_ecdsa_sign): Likewise.
(_gcry_ecc_ecdsa_verify): Likewise.
* cipher/ecc-gost.c (_gcry_ecc_gost_sign): Likewise.
(_gcry_ecc_gost_verify): Likewise.

--

GnuPG-bug-id: 8094
Signed-off-by: NIIBE Yutaka <gniibe at fsij.org>
---
 cipher/ecc-ecdsa.c | 16 ++++++++--------
 cipher/ecc-gost.c  | 24 ++++++++++++------------
 mpi/ec.c           |  2 +-
 3 files changed, 21 insertions(+), 21 deletions(-)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-mpi-ec-Use-mpi_new-with-NBITS-instead-of-mpi_alloc.patch
Type: text/x-patch
Size: 2616 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20260214/20792bfc/attachment.bin>

From gniibe at fsij.org  Sat Feb 14 04:42:53 2026
From: gniibe at fsij.org (NIIBE Yutaka)
Date: Sat, 14 Feb 2026 12:42:53 +0900
Subject: [PATCH 1/2] mpi:ec: Make sure to have MPI limbs in ECC.
In-Reply-To: <87wm0k3ue8.fsf@haruna.fsij.org>
References: <aYX9mimjccCPyGMI@symphytum.spacehopper.org>
 <87wm0lytv2.fsf@haruna.fsij.org> <87wm0k3ue8.fsf@haruna.fsij.org>
Message-ID: <43b648f0465fb449471944a84bb40f45996f6de3.1771039985.git.gniibe@fsij.org>


* src/mpi.h (_gcry_mpi_point_init): Add NBITS argument.
* mpi/ec.c (point_init): Follow the change.
(_gcry_mpi_point_log): Fix mpi_new with NBITS.
(_gcry_mpi_point_new): Fix _gcry_mpi_point_init with NBITS.
(_gcry_mpi_point_init): Initialize with mpi_new with NBITS.
(_gcry_mpi_ec_get_affine): Fix mpi_new with NBITS.
(montgomery_mul_point): Fix point_init with NBITS.
(mpi_ec_mul_point_lli): Fix point_init and mpi_new with NBITS.
(_gcry_mpi_ec_mul_point): Fix point_init with NBITS.
(_gcry_mpi_ec_curve_point): Fix mpi_new with NBITS.
* mpi/ec-hw-s390x.c (_gcry_s390x_ec_hw_mul_point): Likewise.
(s390_mul_point_montgomery): Likewise.
* cipher/ecc-common.h (point_init): Follow the change of
_gcry_mpi_point_init.
* cipher/ecc-curves.c (_gcry_ecc_get_curve): Likewise.
(point_from_keyparam): Fix mpi_point_new with NBITS.
(mpi_ec_get_elliptic_curve): Follow the change of
_gcry_mpi_point_init.
(_gcry_ecc_set_mpi): Fix mpi_point_new with NBITS.
* cipher/ecc-ecdh.c (_gcry_ecc_curve_keypair)
(_gcry_ecc_curve_mul_point): Fix point_init with NBITS.
* cipher/ecc-ecdsa.c (_gcry_ecc_ecdsa_sign): Likewise.
(_gcry_ecc_ecdsa_verify): Likewise.
* cipher/ecc-eddsa.c (_gcry_ecc_eddsa_encodepoint, ecc_ed448_recover_x)
(_gcry_ecc_eddsa_recover_x): Fix mpi_new with NBITS.
(_gcry_ecc_eddsa_genkey): Remove unused X and Y.  Fix point_init with
NBITS.
(_gcry_ecc_eddsa_sign): Fix mpi_new with NBITS.  Fix point_init with
NBITS.
(_gcry_ecc_eddsa_verify): Fix point_init with NBITS.
* cipher/ecc-gost.c (_gcry_ecc_gost_sign, _gcry_ecc_gost_verify):
Likewise.
* cipher/ecc-misc.c (_gcry_ecc_curve_copy): Follow the change of
_gcry_mpi_point_init.
(_gcry_mpi_ec_ec2os, _gcry_ecc_sec_decodepoint): Fix mpi_new with
NBITS.
(_gcry_ecc_compute_public): Fix mpi_point_new with NBITS.
* cipher/ecc-sm2.c (_gcry_ecc_sm2_encrypt): Fix point_init with NBITS.
Fix mpi_new with NBITS.
(_gcry_ecc_sm2_decrypt, _gcry_ecc_sm2_sign, _gcry_ecc_sm2_verify):
Likewise.
* cipher/ecc.c (nist_generate_key): Fix point_init with NBITS.
(test_keys): Likewise.
(test_ecdh_only_keys): Fix point_init and mpi_new with NBITS.
(check_secret_key): Likewise.
(ecc_generate): Fix mpi_new with NBITS.
(ecc_encrypt_raw): Fix mpi_new and point_init with NBITS.
(ecc_decrypt_raw): Fix point_init and mpi_new with NBITS.
(compute_keygrip): Fix mpi_new with NBITS.

--

The changes for ECC least leak assume that the limbs for MPI are
allocated and enough.  In the past, we had a practice to use
"mpi_new (0)" to initialize an MPI, which only allocates the
placeholder of MPI and not the limbs.  This is the fix of those places
in ECC.

GnuPG-bug-id: 8094
Signed-off-by: NIIBE Yutaka <gniibe at fsij.org>
---
 cipher/ecc-common.h |  2 +-
 cipher/ecc-curves.c |  8 +++----
 cipher/ecc-ecdh.c   |  6 ++---
 cipher/ecc-ecdsa.c  |  8 +++----
 cipher/ecc-eddsa.c  | 44 ++++++++++++++++-------------------
 cipher/ecc-gost.c   |  8 +++----
 cipher/ecc-misc.c   | 18 +++++++--------
 cipher/ecc-sm2.c    | 38 +++++++++++++++---------------
 cipher/ecc.c        | 55 ++++++++++++++++++++++----------------------
 mpi/ec-hw-s390x.c   |  6 ++---
 mpi/ec.c            | 56 ++++++++++++++++++++++-----------------------
 src/mpi.h           |  2 +-
 12 files changed, 123 insertions(+), 128 deletions(-)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-mpi-ec-Make-sure-to-have-MPI-limbs-in-ECC.patch
Type: text/x-patch
Size: 22200 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20260214/6e6d7ccd/attachment-0001.bin>

From gniibe at fsij.org  Sat Feb 14 08:55:38 2026
From: gniibe at fsij.org (NIIBE Yutaka)
Date: Sat, 14 Feb 2026 16:55:38 +0900
Subject: [PATCH 2/2] mpi:ec: Use mpi_new with NBITS, instead of mpi_alloc.
In-Reply-To: <44bfa11c45cc7b4903467d822a93db7bbc18283d.1771039985.git.gniibe@fsij.org>
References: <aYX9mimjccCPyGMI@symphytum.spacehopper.org>
 <87wm0lytv2.fsf@haruna.fsij.org> <87wm0k3ue8.fsf@haruna.fsij.org>
 <43b648f0465fb449471944a84bb40f45996f6de3.1771039985.git.gniibe@fsij.org>
 <44bfa11c45cc7b4903467d822a93db7bbc18283d.1771039985.git.gniibe@fsij.org>
Message-ID: <87ikbz92th.fsf@haruna.fsij.org>

Hello,

One more patch for this series.
-- 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-cipher-ecc-Fix-Weierstrass-curve-with-PUBKEY_FLAG_PA.patch
Type: text/x-diff
Size: 2908 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20260214/2e5de241/attachment.patch>

From wk at gnupg.org  Tue Feb 17 08:32:54 2026
From: wk at gnupg.org (Werner Koch)
Date: Tue, 17 Feb 2026 08:32:54 +0100
Subject: [PATCH 1/2] mpi:ec: Make sure to have MPI limbs in ECC.
In-Reply-To: <43b648f0465fb449471944a84bb40f45996f6de3.1771039985.git.gniibe@fsij.org>
 (NIIBE Yutaka via Gcrypt-devel's message of "Sat, 14 Feb 2026 12:42:53
 +0900")
References: <aYX9mimjccCPyGMI@symphytum.spacehopper.org>
 <87wm0lytv2.fsf@haruna.fsij.org> <87wm0k3ue8.fsf@haruna.fsij.org>
 <43b648f0465fb449471944a84bb40f45996f6de3.1771039985.git.gniibe@fsij.org>
Message-ID: <871pij9655.fsf@jacob.g10code.de>

Hi!

That is a pretty large chnage but it seems to be okay.


Shalom-Salam,

   Werner

-- 
The pioneers of a warless world are the youth that
refuse military service.             - A. Einstein
-------------- next part --------------
A non-text attachment was scrubbed...
Name: openpgp-digital-signature.asc
Type: application/pgp-signature
Size: 284 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20260217/782784d8/attachment.sig>

From stu at spacehopper.org  Tue Feb 17 16:20:08 2026
From: stu at spacehopper.org (Stuart Henderson)
Date: Tue, 17 Feb 2026 15:20:08 +0000
Subject: libgcrypt 1.12.0: g_mime_multipart_encrypted_decrypt failing on
 i386
In-Reply-To: <87wm0k3ue8.fsf@haruna.fsij.org>
References: <aYX9mimjccCPyGMI@symphytum.spacehopper.org>
 <87wm0lytv2.fsf@haruna.fsij.org> <87wm0k3ue8.fsf@haruna.fsij.org>
Message-ID: <aZSHKFXDYbLtKTdE@symphytum.spacehopper.org>

On 2026/02/11 11:07, NIIBE Yutaka wrote:
> NIIBE Yutaka <gniibe at fsij.org> wrote:
> > Could you please narrow down the failure?  IIUC, when the
> > _check_session_keys program fails, invoking gpg must fail too.
> > If so, debug output of gpg (and/or gpg-agent) helps.
> 
> I mean, I'd like to locate the bug.
> 
> There are two things in the notmuch/configure script.
> 
> (1) importing the secret key
> (2) running the _check_session_keys program
> 
> Does the failure occur in (1) or (2)?  And what version of gpg and
> gpg-agent you are running? Those information helps us.

It occurs in (2) for me, never seen it in (1).

gpg, gpg-agent: 2.5.16
libgpg-error: 1.58
libassuan: 3.0.2
libksba: 1.6.7
libnpth: 1.8

Same happens if I import the secret key into gpg and then try to use
gpg --decrypt on the encrypted message from the .eml file.

> I created the ticket:
> 
>     https://dev.gnupg.org/T8094
> 
> I put the tag of "libgcrypt" for this, but it may be other
> parts of program (gnupg, libgpg-error, etc.) which cause
> the issue.
> -- 


From dtsen at us.ibm.com  Tue Feb 24 01:27:49 2026
From: dtsen at us.ibm.com (Danny Tsen)
Date: Mon, 23 Feb 2026 18:27:49 -0600
Subject: [PATCH 1/5] dilithium: Added optimized dilithium NTT support for
 ppc64le.
In-Reply-To: <20260224002753.151873-1-dtsen@us.ibm.com>
References: <20260224002753.151873-1-dtsen@us.ibm.com>
Message-ID: <20260224002753.151873-2-dtsen@us.ibm.com>

Optimized dilithium (ML-DSA) NTT algorithm for ppc64le (Power 8 and
above).

Signed-off-by: Danny Tsen <dtsen at us.ibm.com>
---
 cipher/dilithium_ntt_p8le.S | 859 ++++++++++++++++++++++++++++++++++++
 1 file changed, 859 insertions(+)
 create mode 100644 cipher/dilithium_ntt_p8le.S

diff --git a/cipher/dilithium_ntt_p8le.S b/cipher/dilithium_ntt_p8le.S
new file mode 100644
index 00000000..8932d8e8
--- /dev/null
+++ b/cipher/dilithium_ntt_p8le.S
@@ -0,0 +1,859 @@
+/*
+ * This file was modified for use by Libgcrypt.
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * This file is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <https://www.gnu.org/licenses/>.
+ * SPDX-License-Identifier: LGPL-2.1-or-later
+ *
+ * You can also use this file under the same licence of original code.
+ * SPDX-License-Identifier: CC0 OR Apache-2.0
+ *
+ */
+/*
+ *
+ * Copyright IBM Corp. 2025, 2026
+ *
+ * ===================================================================================
+ * Written by Danny Tsen <dtsen at us.ibm.com>
+ */
+
+#define QINV_OFFSET 16
+#define ZETA_NTT_OFFSET 32
+
+#define MLDSA_Q    8380417
+#define MLDSA_QINV 58728449
+
+#define QINV    0
+#define V_Q     1
+#define V_ZETA  2
+#define V_Z0    2
+#define V_Z1    3
+#define V_Z2    4
+#define V_Z3    5
+
+.machine "any"
+.text
+
+.macro SAVE_REGS
+        stdu    1, -352(1)
+        mflr    0
+        std     14, 56(1)
+        std     15, 64(1)
+        std     16, 72(1)
+        std     17, 80(1)
+        std     18, 88(1)
+        std     19, 96(1)
+        std     20, 104(1)
+        std     21, 112(1)
+        li      10, 128
+        li      11, 144
+        li      12, 160
+        li      14, 176
+        li      15, 192
+        li      16, 208
+        stxvx   32+20, 10, 1
+        stxvx   32+21, 11, 1
+        stxvx   32+22, 12, 1
+        stxvx   32+23, 14, 1
+        stxvx   32+24, 15, 1
+        stxvx   32+25, 16, 1
+        li      10, 224
+        li      11, 240
+        li      12, 256
+        li      14, 272
+        stxvx   32+26, 10, 1
+        stxvx   32+27, 11, 1
+        stxvx   32+28, 12, 1
+        stxvx   32+29, 14, 1
+.endm
+
+.macro RESTORE_REGS
+        li      10, 128
+        li      11, 144
+        li      12, 160
+        li      14, 176
+        li      15, 192
+        li      16, 208
+        lxvx    32+20, 10, 1
+        lxvx    32+21, 11, 1
+        lxvx    32+22, 12, 1
+        lxvx    32+23, 14, 1
+        lxvx    32+24, 15, 1
+        lxvx    32+25, 16, 1
+        li      10, 224
+        li      11, 240
+        li      12, 256
+        li      14, 272
+        lxvx    32+26, 10, 1
+        lxvx    32+27, 11, 1
+        lxvx    32+28, 12, 1
+        lxvx    32+29, 14, 1
+        ld      14, 56(1)
+        ld      15, 64(1)
+        ld      16, 72(1)
+        ld      17, 80(1)
+        ld      18, 88(1)
+        ld      19, 96(1)
+        ld      20, 104(1)
+        ld      21, 112(1)
+
+        mtlr    0
+        addi    1, 1, 352
+.endm
+
+/*
+ * Init_Coeffs_offset: initial offset setup for the coefficient array.
+ *
+ * start: beginning of the offset to the coefficient array.
+ * next: Next offset.
+ * len: Index difference between coefficients.
+ *
+ * r7: len * 2, each coefficient component is 32 bits.
+ *
+ * registers used for offset to coefficients, r[j] and r[j+len]
+ * R9: offset to r0 = j
+ * R16: offset to r1 = r0 + next
+ * R18: offset to r2 = r1 + next
+ * R20: offset to r3 = r2 + next
+ *
+ * R10: offset to r'0 = r0 + len*2
+ * R17: offset to r'1 = r'0 + next
+ * R19: offset to r'2 = r'1 + next
+ * R21: offset to r'3 = r'2 + next
+ *
+ */
+.macro Init_Coeffs_offset start next
+        li      9, \start       /* first offset to j */
+        add     10, 7, 9        /* J + len*2 */
+        addi    16, 9, \next
+        addi    17, 10, \next
+        addi    18, 16, \next
+        addi    19, 17, \next
+        addi    20, 18, \next
+        addi    21, 19, \next
+.endm
+
+/*
+ * For Len=1, load 1-1-1-1 layout
+ *
+ * Load Coefficients and setup vectors
+ *    rj0, rjlen1, rj2, rjlen3
+ *    rj4, rjlen5, rj6, rjlen7
+ *
+ *  Each vmrgew and vmrgow will transpose vectors as,
+ *
+ *   rj vector = (rj0, rj4, rj2, rj6)
+ *   rjlen vector = (rjlen1, rjlen5, rjlen3, rjlen7)
+ *
+ *  r' =r[j+len]: V6, V7, V8, V9
+ *  r = r[j]: V26, V27, V28, V29
+ *
+ * In order to do the coefficients computation, zeta vector will arrange
+ * in the proper order to match the multiplication.
+ */
+.macro Load_41Coeffs
+        lxvd2x     32+10, 0, 5
+        lxvd2x     32+11, 10, 5
+        vmrgew 6, 10, 11
+        vmrgow 26, 10, 11
+        lxvd2x     32+12, 11, 5
+        lxvd2x     32+13, 12, 5
+        vmrgew 7, 12, 13
+        vmrgow 27, 12, 13
+        lxvd2x     32+10, 15, 5
+        lxvd2x     32+11, 16, 5
+        vmrgew 8, 10, 11
+        vmrgow 28, 10, 11
+        lxvd2x     32+12, 17, 5
+        lxvd2x     32+13, 18, 5
+        vmrgew 9, 12, 13
+        vmrgow 29, 12, 13
+.endm
+
+/*
+ * For Len=2, Load 2 - 2 - 2 - 2 layout
+ *
+ * Load Coefficients and setup vectors for 8 coefficients in the
+ * following order,
+ *    rj0, rj1, rjlen2, rjlen3,
+ *    rj4, rj5, rjlen6, arlen7
+ *  Each xxpermdi will transpose vectors as,
+ *  r[j]=      rj0, rj1, rj4, rj5
+ *  r[j+len]=  rjlen2, rjlen3, rjlen6, arlen7
+ *
+ *  r' = r[j+len]: V6, V7, V8, V9
+ *  r = r[j]: V26, V27, V28, V29
+ *
+ * In order to do the coefficients computation, zeta vector will arrange
+ * in the proper order to match the multiplication.
+ */
+.macro Load_42Coeffs
+        lxvd2x     1, 0, 5
+        lxvd2x     2, 10, 5
+        xxpermdi 32+6, 1, 2, 3
+        xxpermdi 32+26, 1, 2, 0
+        lxvd2x     3, 11, 5
+        lxvd2x     4, 12, 5
+        xxpermdi 32+7, 3, 4, 3
+        xxpermdi 32+27, 3, 4, 0
+        lxvd2x     1, 15, 5
+        lxvd2x     2, 16, 5
+        xxpermdi 32+8, 1, 2, 3
+        xxpermdi 32+28, 1, 2, 0
+        lxvd2x     3, 17, 5
+        lxvd2x     4, 18, 5
+        xxpermdi 32+9, 3, 4, 3
+        xxpermdi 32+29, 3, 4, 0
+.endm
+
+/*
+ * For Len=8,
+ * Load coefficient with 2 legs with 64  bytes apart in
+ *  r[j+len] (r') vectors from offset, R10, R17, R19 and R21
+ *  r[j+len]: V6, V7, V8, V9
+ */
+.macro Load_22Coeffs start next
+        li      9, \start
+        add     10, 7, 9
+        addi    16, 9, \next
+        addi    17, 10, \next
+        li      18, \start+64
+        add     19, 7, 18
+        addi    20, 18, \next
+        addi    21, 19, \next
+        lxvd2x  32+6, 3, 10
+        lxvd2x  32+7, 3, 17
+        lxvd2x  32+8, 3, 19
+        lxvd2x  32+9, 3, 21
+.endm
+
+/*
+ * Load coefficient with 2 legs with len*2 bytes apart in
+ *  r[j+len] (r') vectors from offset, R10, R17, R19 and R21
+ *  r[j+len]: V6, V7, V8, V9
+ */
+.macro Load_4Coeffs start next
+        Init_Coeffs_offset \start, \next
+
+        lxvd2x  32+6, 3, 10
+        lxvd2x  32+7, 3, 17
+        lxvd2x  32+8, 3, 19
+        lxvd2x  32+9, 3, 21
+.endm
+
+/*
+ * Load 4 r[j] (r) coefficient vectors:
+ *   Load coefficient in vectors from offset, R9, R16, R18 and R20
+ *  r[j]: V26, V27, V28, V29
+ */
+.macro Load_4Rj
+        lxvd2x  32+26, 3, 9
+        lxvd2x  32+27, 3, 16
+        lxvd2x  32+28, 3, 18
+        lxvd2x  32+29, 3, 20
+.endm
+
+/*
+ * Compute final final r[j] and r[j+len]
+ *  final r[j+len]: V18, V19, V20, V21
+ *  final r[j]: V22, V23, V24, V25
+ */
+.macro Compute_4Coeff
+        vsubuwm 18, 26, 10
+        vadduwm 22, 26, 10
+
+        vsubuwm 19, 27, 11
+        vadduwm 23, 27, 11
+
+        vsubuwm 20, 28, 12
+        vadduwm 24, 28, 12
+
+        vsubuwm 21, 29, 13
+        vadduwm 25, 29, 13
+.endm
+
+.macro Write_One
+        stxvd2x 32+22, 3, 9
+        stxvd2x 32+18, 3, 10
+        stxvd2x 32+23, 3, 16
+        stxvd2x 32+19, 3, 17
+        stxvd2x 32+24, 3, 18
+        stxvd2x 32+20, 3, 19
+        stxvd2x 32+25, 3, 20
+        stxvd2x 32+21, 3, 21
+.endm
+
+/*
+ * Transpose the final coefficients of 2-2-2-2 layout to the original
+ * coefficient array order.
+ */
+.macro PermWrite42
+        xxpermdi 32+10, 32+22, 32+18, 0
+        xxpermdi 32+14, 32+22, 32+18, 3
+        xxpermdi 32+11, 32+23, 32+19, 0
+        xxpermdi 32+15, 32+23, 32+19, 3
+        xxpermdi 32+12, 32+24, 32+20, 0
+        xxpermdi 32+16, 32+24, 32+20, 3
+        xxpermdi 32+13, 32+25, 32+21, 0
+        xxpermdi 32+17, 32+25, 32+21, 3
+        stxvd2x    32+10, 0, 5
+        stxvd2x    32+14, 10, 5
+        stxvd2x    32+11, 11, 5
+        stxvd2x    32+15, 12, 5
+        stxvd2x    32+12, 15, 5
+        stxvd2x    32+16, 16, 5
+        stxvd2x    32+13, 17, 5
+        stxvd2x    32+17, 18, 5
+.endm
+
+/*
+ * Transpose the final coefficients of 1-1-1-1 layout to the original
+ * coefficient array order.
+ */
+.macro PermWrite41
+        vmrgew 10, 18, 22
+        vmrgow 11, 18, 22
+        vmrgew 12, 19, 23
+        vmrgow 13, 19, 23
+        vmrgew 14, 20, 24
+        vmrgow 15, 20, 24
+        vmrgew 16, 21, 25
+        vmrgow 17, 21, 25
+        stxvd2x 32+10, 0, 5
+        stxvd2x 32+11, 10, 5
+        stxvd2x 32+12, 11, 5
+        stxvd2x 32+13, 12, 5
+        stxvd2x 32+14, 15, 5
+        stxvd2x 32+15, 16, 5
+        stxvd2x 32+16, 17, 5
+        stxvd2x 32+17, 18, 5
+.endm
+
+.macro Load_next_4zetas
+        li      10, 16
+        li      11, 32
+        li      12, 48
+        lxvd2x  32+V_Z0, 0, 14
+        lxvd2x  32+V_Z1, 10, 14
+        lxvd2x  32+V_Z2, 11, 14
+        lxvd2x  32+V_Z3, 12, 14
+        addi    14, 14, 64
+.endm
+
+/*
+ * montgomery_reduce
+ *  a = zeta * a[j+len]
+ *  t = (int64_t)(int32_t)a*QINV;
+ *  t = (a - (int64_t)t*Q) >> 32;
+ *
+ * -----------------------------------
+ * MREDUCE_4X(_vz0, _vz1, _vz2, _vz3)
+ */
+.macro MREDUCE_4x  _vz0 _vz1 _vz2 _vz3
+        /* Coefficients computation results in abosulte value of 2^64 in
+           even and odd pairs */
+        vmulesw 10, 6, \_vz0
+        vmulosw 11, 6, \_vz0
+        vmulesw 12, 7, \_vz1
+        vmulosw 13, 7, \_vz1
+        vmulesw 14, 8, \_vz2
+        vmulosw 15, 8, \_vz2
+        vmulesw 16, 9, \_vz3
+        vmulosw 17, 9, \_vz3
+
+        /* Compute a*q^(-1) mod 2^32 and results in the upper 32 bits of
+           even pair */
+        vmulosw 18, 10, QINV
+        vmulosw 19, 11, QINV
+        vmulosw 20, 12, QINV
+        vmulosw 21, 13, QINV
+        vmulosw 22, 14, QINV
+        vmulosw 23, 15, QINV
+        vmulosw 24, 16, QINV
+        vmulosw 25, 17, QINV
+
+        vmulosw 18, 18, V_Q
+        vmulosw 19, 19, V_Q
+        vmulosw 20, 20, V_Q
+        vmulosw 21, 21, V_Q
+        vmulosw 22, 22, V_Q
+        vmulosw 23, 23, V_Q
+        vmulosw 24, 24, V_Q
+        vmulosw 25, 25, V_Q
+
+        vsubudm 18, 10, 18
+        vsubudm 19, 11, 19
+        vsubudm 20, 12, 20
+        vsubudm 21, 13, 21
+        vsubudm 22, 14, 22
+        vsubudm 23, 15, 23
+        vsubudm 24, 16, 24
+        vsubudm 25, 17, 25
+
+        vmrgew  10, 18, 19
+        vmrgew  11, 20, 21
+        vmrgew  12, 22, 23
+        vmrgew  13, 24, 25
+.endm
+
+/*
+ * For Len=1, layer with 1-1-1-1 layout.
+ */
+.macro NTT_MREDUCE_41x
+        Load_next_4zetas
+        Load_41Coeffs
+        MREDUCE_4x V_Z0, V_Z1, V_Z2, V_Z3
+        Compute_4Coeff
+        PermWrite41
+        addi    5, 5, 128
+.endm
+
+/*
+ * For Len=2, layer with 2-2-2-2 layout.
+ */
+.macro NTT_MREDUCE_42x
+        Load_next_4zetas
+        Load_42Coeffs
+        MREDUCE_4x V_Z0, V_Z1, V_Z2, V_Z3
+        Compute_4Coeff
+        PermWrite42
+        addi    5, 5, 128
+.endm
+
+/*
+ * For Len=8
+ */
+.macro NTT_MREDUCE_22x  start next _vz0 _vz1 _vz2 _vz3
+        Load_22Coeffs \start, \next
+        MREDUCE_4x \_vz0, \_vz1, \_vz2, \_vz3
+        Load_4Rj
+        Compute_4Coeff
+        Write_One
+.endm
+
+/*
+ * For Len=128, 64, 32, 16 and 4.
+ */
+.macro NTT_MREDUCE_4x  start next _vz0 _vz1 _vz2 _vz3
+        Load_4Coeffs \start, \next
+        MREDUCE_4x \_vz0, \_vz1, \_vz2, \_vz3
+        Load_4Rj
+        Compute_4Coeff
+        Write_One
+.endm
+
+/*
+ * mldsa_ntt_ppc(int32_t *r)
+ *   Compute forward NTT based on the following 8 layers -
+ *     len = 128, 64, 32, 16, 8, 4, 2, 1.
+ *
+ *   Each layer compute the coefficients on 2 legs, start and start + len*2 offsets.
+ *
+ *   leg 1                        leg 2
+ *   -----                        -----
+ *   start                        start+len*2
+ *   start+next                   start+len*2+next
+ *   start+next+next              start+len*2+next+next
+ *   start+next+next+next         start+len*2+next+next+next
+ *
+ *   Each computation loads 8 vectors, 4 for each leg.
+ *   The final coefficient (t) from each vector of leg1 and leg2 then do the
+ *   add/sub operations to obtain the final results.
+ *
+ *   -> leg1 = leg1 + t, leg2 = leg1 - t
+ *
+ *   The resulting coefficients then store back to each leg's offset.
+ *
+ *   Each vector has the same corresponding zeta except len=2.
+ *
+ *   len=2 has 2-2-2-2 layout which means every 2 32-bit coefficients has the same zeta.
+ *   e.g.
+ *         coeff vector    a1   a2   a3  a4  a5  a6  a7  a8
+ *         zeta  vector    z1   z1   z2  z2  z3  z3  z4  z4
+ *
+ *   For len=2, each vector will get permuted to leg1 and leg2. Zeta is
+ *   pre-arranged for the leg1 and leg2.  After the computation, each vector needs
+ *   to transpose back to its original 2-2-2-2 layout.
+ *
+ */
+.global mldsa_ntt_ppc
+.align 4
+mldsa_ntt_ppc:
+
+        SAVE_REGS
+
+        /* load Q and Q_NEG_INV */
+        addis   8,2,mldsa_consts at toc@ha
+        addi    8,8,mldsa_consts at toc@l
+        lvx     V_Q, 0, 8
+        li      10, QINV_OFFSET
+        lvx     QINV, 10, 8
+
+        /* set zetas array */
+        addi      14, 8, ZETA_NTT_OFFSET
+
+        /*
+         * 1. len = 128, start = 0
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        128
+         *        16        -        144
+         *          32        -        160
+         *                    ...
+         *            112        -        240
+         *     These are indexes to the 32 bits array
+         *
+         * r7 is len * 4
+         */
+        li      7, 512
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+
+        NTT_MREDUCE_4x 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4x 64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4x 128, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4x 192, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4x 256, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4x 320, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4x 384, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4x 448, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+.align 4
+        /*
+         * 2. len = 64, start = 0, 128
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        64
+         *        16        -        80
+         *          32        -        96
+         *                    ...
+         *      128        -        192
+         *        144        -        208
+         *          160        -        224
+         *            176        -        240
+         *     These are indexes to the 32 bits array
+         */
+        li      7, 256
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4x 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4x 64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4x 128, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4x 192, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4x 512, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4x 576, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4x 640, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4x 704, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+.align 4
+        /*
+         * 3. len = 32, start = 0, 64, 128, 192
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        32
+         *               ...
+         *      64        -        96
+         *               ...
+         *      128        -        160
+         *                ...
+         *      192        -        224
+         *                ...
+         *
+         *     These are indexes to the 32 bits array
+         */
+        li      7, 128
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4x 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4x 64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4x 256, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4x 320, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4x 512, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4x 576, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4x 768, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4x 832, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+.align 4
+        /*
+         * 4. len = 16, start = 0, 32, 64, 96, 128, 160, 192, 224
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        16
+         *        32        -        48
+         *          64        -        80
+         *                    ...
+         *            192        -        208
+         *              224        -        240
+         *
+         *     These are indexes to the 32 bits array
+         */
+        li      7, 64
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4x 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4x 128, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4x 256, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4x 384, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4x 512, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4x 640, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4x 768, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4x 896, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+.align 4
+        /*
+         * 5. len = 8, start = 0, 32, 64, 96, 128, 160, 192, 224
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        8
+         *        32        -        40
+         *          64        -        72
+         *                    ...
+         *            192        -        200
+         *              224        -        232
+         *
+         *     These are indexes to the 32 bits array
+         */
+
+        li      7, 32
+        Load_next_4zetas
+        NTT_MREDUCE_22x 0, 16, V_Z0, V_Z0, V_Z1, V_Z1
+        NTT_MREDUCE_22x 128, 16, V_Z2, V_Z2, V_Z3, V_Z3
+
+        Load_next_4zetas
+        NTT_MREDUCE_22x 256, 16, V_Z0, V_Z0, V_Z1, V_Z1
+        NTT_MREDUCE_22x 384, 16, V_Z2, V_Z2, V_Z3, V_Z3
+
+        Load_next_4zetas
+        NTT_MREDUCE_22x 512, 16, V_Z0, V_Z0, V_Z1, V_Z1
+        NTT_MREDUCE_22x 640, 16, V_Z2, V_Z2, V_Z3, V_Z3
+
+        Load_next_4zetas
+        NTT_MREDUCE_22x 768, 16, V_Z0, V_Z0, V_Z1, V_Z1
+        NTT_MREDUCE_22x 896, 16, V_Z2, V_Z2, V_Z3, V_Z3
+
+.align 4
+        /*
+         * 6. len = 4, start = 0, 32, 64, 96, 128, 160, 192, 224
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        4
+         *        32        -        36
+         *          64        -        68
+         *                    ...
+         *            192        -        196
+         *              224        -        228
+         *
+         *     These are indexes to the 32 bits array
+         */
+
+        li      7, 16
+
+        Load_next_4zetas
+        NTT_MREDUCE_4x 0, 32, V_Z0, V_Z1, V_Z2, V_Z3
+
+        Load_next_4zetas
+        NTT_MREDUCE_4x 128, 32, V_Z0, V_Z1, V_Z2, V_Z3
+
+        Load_next_4zetas
+        NTT_MREDUCE_4x 256, 32, V_Z0, V_Z1, V_Z2, V_Z3
+
+        Load_next_4zetas
+        NTT_MREDUCE_4x 384, 32, V_Z0, V_Z1, V_Z2, V_Z3
+
+        Load_next_4zetas
+        NTT_MREDUCE_4x 512, 32, V_Z0, V_Z1, V_Z2, V_Z3
+
+        Load_next_4zetas
+        NTT_MREDUCE_4x 640, 32, V_Z0, V_Z1, V_Z2, V_Z3
+
+        Load_next_4zetas
+        NTT_MREDUCE_4x 768, 32, V_Z0, V_Z1, V_Z2, V_Z3
+
+        Load_next_4zetas
+        NTT_MREDUCE_4x 896, 32, V_Z0, V_Z1, V_Z2, V_Z3
+
+.align 4
+        /*
+         * 7. len = 2, start = 0, 4, 8, 12,...244, 248, 252
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        4
+         *        8        -        12
+         *          16        -        20
+         *                    ...
+         *            240        -        244
+         *              248        -        252
+         *
+         *     These are indexes to the 32 bits array
+         */
+        mr      5, 3
+        li      7, 8
+
+        li      10, 16
+        li      11, 32
+        li      12, 48
+        li      15, 64
+        li      16, 80
+        li      17, 96
+        li      18, 112
+
+        NTT_MREDUCE_42x
+        NTT_MREDUCE_42x
+        NTT_MREDUCE_42x
+        NTT_MREDUCE_42x
+        NTT_MREDUCE_42x
+        NTT_MREDUCE_42x
+        NTT_MREDUCE_42x
+        NTT_MREDUCE_42x
+
+.align 4
+        /*
+         * 8. len = 1, start = 0, 2, 4, 6, 8, 10, 12,...254
+         *
+         *    Compute coefficients of the NTT based on the following sequences,
+         *      0, 1, 2, 3
+         *      4, 5, 6, 7
+         *      8, 9, 10, 11
+         *      12, 13, 14, 15
+         *            ...
+         *      240, 241, 242, 243
+         *      244, 245, 246, 247
+         *      248, 249, 250, 251
+         *      252, 253, 254, 255
+         *
+         *     These are indexes to the 32 bits array.  Each loads 4 vectors.
+         */
+        mr      5, 3
+        li      7, 4
+
+        NTT_MREDUCE_41x
+        NTT_MREDUCE_41x
+        NTT_MREDUCE_41x
+        NTT_MREDUCE_41x
+        NTT_MREDUCE_41x
+        NTT_MREDUCE_41x
+        NTT_MREDUCE_41x
+        NTT_MREDUCE_41x
+
+        RESTORE_REGS
+        blr
+.size     mldsa_ntt_ppc,.-mldsa_ntt_ppc
+
+.rodata
+.align 4
+mldsa_consts:
+.long  MLDSA_Q, MLDSA_Q, MLDSA_Q, MLDSA_Q
+.long  MLDSA_QINV, MLDSA_QINV, MLDSA_QINV, MLDSA_QINV
+
+/* zetas */
+mldsa_zetas:
+.long   25847, 25847, 25847, 25847, -2608894, -2608894, -2608894, -2608894
+.long   -518909, -518909, -518909, -518909, 237124, 237124, 237124, 237124
+.long   -777960, -777960, -777960, -777960, -876248, -876248, -876248, -876248
+.long   466468, 466468, 466468, 466468, 1826347, 1826347, 1826347, 1826347
+.long   2353451, 2353451, 2353451, 2353451, -359251, -359251, -359251, -359251
+.long   -2091905, -2091905, -2091905, -2091905, 3119733, 3119733, 3119733, 3119733
+.long   -2884855, -2884855, -2884855, -2884855, 3111497, 3111497, 3111497, 3111497
+.long   2680103, 2680103, 2680103, 2680103, 2725464, 2725464, 2725464, 2725464
+.long   1024112, 1024112, 1024112, 1024112, -1079900, -1079900, -1079900, -1079900
+.long   3585928, 3585928, 3585928, 3585928, -549488, -549488, -549488, -549488
+.long   -1119584, -1119584, -1119584, -1119584, 2619752, 2619752, 2619752, 2619752
+.long   -2108549, -2108549, -2108549, -2108549, -2118186, -2118186, -2118186, -2118186
+.long   -3859737, -3859737, -3859737, -3859737, -1399561, -1399561, -1399561, -1399561
+.long   -3277672, -3277672, -3277672, -3277672, 1757237, 1757237, 1757237, 1757237
+.long   -19422, -19422, -19422, -19422, 4010497, 4010497, 4010497, 4010497
+.long   280005, 280005, 280005, 280005
+/*For Len=4 */
+.long   2706023, 2706023, 2706023, 2706023, 95776, 95776, 95776, 95776
+.long   3077325, 3077325, 3077325, 3077325, 3530437, 3530437, 3530437, 3530437
+.long   -1661693, -1661693, -1661693, -1661693, -3592148, -3592148, -3592148, -3592148
+.long   -2537516, -2537516, -2537516, -2537516, 3915439, 3915439, 3915439, 3915439
+.long   -3861115, -3861115, -3861115, -3861115, -3043716, -3043716, -3043716, -3043716
+.long   3574422, 3574422, 3574422, 3574422, -2867647, -2867647, -2867647, -2867647
+.long   3539968, 3539968, 3539968, 3539968, -300467, -300467, -300467, -300467
+.long   2348700, 2348700, 2348700, 2348700, -539299, -539299, -539299, -539299
+.long   -1699267, -1699267, -1699267, -1699267, -1643818, -1643818, -1643818, -1643818
+.long   3505694, 3505694, 3505694, 3505694, -3821735, -3821735, -3821735, -3821735
+.long   3507263, 3507263, 3507263, 3507263, -2140649, -2140649, -2140649, -2140649
+.long   -1600420, -1600420, -1600420, -1600420, 3699596, 3699596, 3699596, 3699596
+.long   811944, 811944, 811944, 811944, 531354, 531354, 531354, 531354
+.long   954230, 954230, 954230, 954230, 3881043, 3881043, 3881043, 3881043
+.long   3900724, 3900724, 3900724, 3900724, -2556880, -2556880, -2556880, -2556880
+.long   2071892, 2071892, 2071892, 2071892, -2797779, -2797779, -2797779, -2797779
+/* For Len=2 */
+.long   -3930395, -3930395, -1528703, -1528703, -3677745, -3677745, -3041255, -3041255
+.long   -1452451, -1452451, 3475950, 3475950, 2176455, 2176455, -1585221, -1585221
+.long   -1257611, -1257611, 1939314, 1939314, -4083598, -4083598, -1000202, -1000202
+.long   -3190144, -3190144, -3157330, -3157330, -3632928, -3632928, 126922, 126922
+.long   3412210, 3412210, -983419, -983419, 2147896, 2147896, 2715295, 2715295
+.long   -2967645, -2967645, -3693493, -3693493, -411027, -411027, -2477047, -2477047
+.long   -671102, -671102, -1228525, -1228525, -22981, -22981, -1308169, -1308169
+.long   -381987, -381987, 1349076, 1349076, 1852771, 1852771, -1430430, -1430430
+.long   -3343383, -3343383, 264944, 264944, 508951, 508951, 3097992, 3097992
+.long   44288, 44288, -1100098, -1100098, 904516, 904516, 3958618, 3958618
+.long   -3724342, -3724342, -8578, -8578, 1653064, 1653064, -3249728, -3249728
+.long   2389356, 2389356, -210977, -210977, 759969, 759969, -1316856, -1316856
+.long   189548, 189548, -3553272, -3553272, 3159746, 3159746, -1851402, -1851402
+.long   -2409325, -2409325, -177440, -177440, 1315589, 1315589, 1341330, 1341330
+.long   1285669, 1285669, -1584928, -1584928, -812732, -812732, -1439742, -1439742
+.long   -3019102, -3019102, -3881060, -3881060, -3628969, -3628969, 3839961, 3839961
+/* Setup zetas for Len=1 as (3, 2, 1, 4) order */
+.long   2316500, 2091667, 3817976, 3407706, -2446433, -3342478, -3562462, 2244091
+.long   -1235728, 266997, 3513181, 2434439, -1197226, -3520352, -3193378, -3759364
+.long   909542, 900702, 819034, 1859098, -43260, 495491, -522500, -1613174
+.long   2031748, -655327, 3207046, -3122442, -768622, -3556995, -3595838, -525098
+.long   -2437823, 342297, 4108315, 286988, 1735879, 3437287, 203044, -3342277
+.long   -2590150, 2842341, 1265009, 2691481, 2486353, 4055324, 1595974, 1247620
+.long   2635921, -3767016, -3548272, 1250494, 1903435, -2994039, -1050970, 1869119
+.long   -3318210, -1333058, -1430225, 1237275, 3306115, -451100, -1962642, 1312455
+.long   -2546312, -1279661, -1374803, 1917081, 2235880, 1500165, 3406031, 777191
+.long   -1671176, -542412, -1846953, -2831860, 594136, -2584293, -3776993, -3724270
+.long   2454455, -2013608, -164721, 2432395, 185531, 1957272, -1207385, 3369112
+.long   1616392, -3183426, 3014001, 162844, -3694233, 810149, -1799107, 1652634
+.long   3866901, -3038916, 269760, 3523897, 1717735, 2213111, 472078, -975884
+.long   -1803090, -426683, 1910376, 1723600, -260646, -1667432, -3833893, -1104333
+.long   -420899, -2939036, -2286327, -2235985, 1612842, 183443, -3545687, -976891
+.long   -48306, -554416, -1362209, 3919660, -846154, 3937738, 1976782, 1400424
-- 
2.47.3


From dtsen at us.ibm.com  Tue Feb 24 01:27:50 2026
From: dtsen at us.ibm.com (Danny Tsen)
Date: Mon, 23 Feb 2026 18:27:50 -0600
Subject: [PATCH 2/5] dilithium: Added optimized dilithium inverse NTT support
 for ppc64le.
In-Reply-To: <20260224002753.151873-1-dtsen@us.ibm.com>
References: <20260224002753.151873-1-dtsen@us.ibm.com>
Message-ID: <20260224002753.151873-3-dtsen@us.ibm.com>

Optimized dilithium (ML-DSA) inverse NTT algorithm for ppc64le
(Power 8 and above).

Signed-off-by: Danny Tsen <dtsen at us.ibm.com>
---
 cipher/dilithium_intt_p8le.S | 915 +++++++++++++++++++++++++++++++++++
 1 file changed, 915 insertions(+)
 create mode 100644 cipher/dilithium_intt_p8le.S

diff --git a/cipher/dilithium_intt_p8le.S b/cipher/dilithium_intt_p8le.S
new file mode 100644
index 00000000..b0f67979
--- /dev/null
+++ b/cipher/dilithium_intt_p8le.S
@@ -0,0 +1,915 @@
+/*
+ * This file was modified for use by Libgcrypt.
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * This file is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <https://www.gnu.org/licenses/>.
+ * SPDX-License-Identifier: LGPL-2.1-or-later
+ *
+ * You can also use this file under the same licence of original code.
+ * SPDX-License-Identifier: CC0 OR Apache-2.0
+ *
+ */
+/*
+ *
+ * Copyright IBM Corp. 2025, 2026
+ *
+ * ===================================================================================
+ * Written by Danny Tsen <dtsen at us.ibm.com>
+ */
+
+#define QINV_OFFSET 16
+#define FCONST_OFFSET 32
+#define ZETA_INTT_OFFSET 48
+
+#define MLDSA_Q    8380417
+#define MLDSA_QINV 58728449
+#define FCONST     41978
+
+#define QINV    0
+#define V_Q     1
+#define V_F     2
+#define V_ZETA  2
+#define V_Z0    2
+#define V_Z1    3
+#define V_Z2    4
+#define V_Z3    5
+
+.machine "any"
+.text
+
+.macro SAVE_REGS
+        stdu    1, -352(1)
+        mflr    0
+        std     14, 56(1)
+        std     15, 64(1)
+        std     16, 72(1)
+        std     17, 80(1)
+        std     18, 88(1)
+        std     19, 96(1)
+        std     20, 104(1)
+        std     21, 112(1)
+        li      10, 128
+        li      11, 144
+        li      12, 160
+        li      14, 176
+        li      15, 192
+        li      16, 208
+        stxvx   32+20, 10, 1
+        stxvx   32+21, 11, 1
+        stxvx   32+22, 12, 1
+        stxvx   32+23, 14, 1
+        stxvx   32+24, 15, 1
+        stxvx   32+25, 16, 1
+        li      10, 224
+        li      11, 240
+        li      12, 256
+        li      14, 272
+        stxvx   32+26, 10, 1
+        stxvx   32+27, 11, 1
+        stxvx   32+28, 12, 1
+        stxvx   32+29, 14, 1
+.endm
+
+.macro RESTORE_REGS
+        li      10, 128
+        li      11, 144
+        li      12, 160
+        li      14, 176
+        li      15, 192
+        li      16, 208
+        lxvx    32+20, 10, 1
+        lxvx    32+21, 11, 1
+        lxvx    32+22, 12, 1
+        lxvx    32+23, 14, 1
+        lxvx    32+24, 15, 1
+        lxvx    32+25, 16, 1
+        li      10, 224
+        li      11, 240
+        li      12, 256
+        li      14, 272
+        lxvx    32+26, 10, 1
+        lxvx    32+27, 11, 1
+        lxvx    32+28, 12, 1
+        lxvx    32+29, 14, 1
+        ld      14, 56(1)
+        ld      15, 64(1)
+        ld      16, 72(1)
+        ld      17, 80(1)
+        ld      18, 88(1)
+        ld      19, 96(1)
+        ld      20, 104(1)
+        ld      21, 112(1)
+
+        mtlr    0
+        addi    1, 1, 352
+.endm
+
+/*
+ * Init_Coeffs_offset: initial offset setup for the coefficient array.
+ *
+ * start: beginning of the offset to the coefficient array.
+ * next: Next offset.
+ * len: Index difference between coefficients.
+ *
+ * r7: len * 2, each coefficient component is 32 bits.
+ *
+ * registers used for offset to coefficients, r[j] and r[j+len]
+ * R9: offset to r0 = j
+ * R16: offset to r1 = r0 + next
+ * R18: offset to r2 = r1 + next
+ * R20: offset to r3 = r2 + next
+ *
+ * R10: offset to r'0 = r0 + len*2
+ * R17: offset to r'1 = r'0 + next
+ * R19: offset to r'2 = r'1 + next
+ * R21: offset to r'3 = r'2 + next
+ *
+ */
+.macro Init_Coeffs_offset start next
+        li      9, \start       /* first offset to j */
+        add     10, 7, 9        /* J + len*2 */
+        addi    16, 9, \next
+        addi    17, 10, \next
+        addi    18, 16, \next
+        addi    19, 17, \next
+        addi    20, 18, \next
+        addi    21, 19, \next
+.endm
+
+/*
+ * For Len=1, load 1-1-1-1 layout
+ *
+ * Load Coefficients and setup vectors
+ *    rj0, rjlen1, rj2, rjlen3
+ *    rj4, rjlen5, rj6, rjlen7
+ *
+ *  Each vmrgew and vmrgow will transpose vectors as,
+ *
+ *   rj vector = (rj0, rj4, rj2, rj6)
+ *   rjlen vector = (rjlen1, rjlen5, rjlen3, rjlen7)
+ *
+ *  r' =r[j+len]: V18, V19, V20, V21
+ *  r = r[j]: V14, V15, V16, V17
+ *
+ * In order to do the coefficients computation, zeta vector will arrange
+ * in the proper order to match the multiplication.
+ */
+.macro Load_41Coeffs
+        lxvd2x     32+10, 0, 5
+        lxvd2x     32+11, 10, 5
+        vmrgew 18, 10, 11
+        vmrgow 14, 10, 11
+        lxvd2x     32+12, 11, 5
+        lxvd2x     32+13, 12, 5
+        vmrgew 19, 12, 13
+        vmrgow 15, 12, 13
+        lxvd2x     32+10, 15, 5
+        lxvd2x     32+11, 16, 5
+        vmrgew 20, 10, 11
+        vmrgow 16, 10, 11
+        lxvd2x     32+12, 17, 5
+        lxvd2x     32+13, 18, 5
+        vmrgew 21, 12, 13
+        vmrgow 17, 12, 13
+.endm
+
+/*
+ * For Len=2, Load 2 - 2 - 2 - 2 layout
+ *
+ * Load Coefficients and setup vectors for 8 coefficients in the
+ * following order,
+ *    rj0, rj1, rjlen2, rjlen3,
+ *    rj4, rj5, rjlen6, arlen7
+ *  Each xxpermdi will transpose vectors as,
+ *  r[j]=      rj0, rj1, rj4, rj5
+ *  r[j+len]=  rjlen2, rjlen3, rjlen6, arlen7
+ *
+ *  r' =r[j+len]: V18, V19, V20, V21
+ *  r = r[j]: V14, V15, V16, V17
+ *
+ * In order to do the coefficients computation, zeta vector will arrange
+ * in the proper order to match the multiplication.
+ */
+.macro Load_42Coeffs
+        lxvd2x     1, 0, 5
+        lxvd2x     2, 10, 5
+        xxpermdi 32+18, 1, 2, 3
+        xxpermdi 32+14, 1, 2, 0
+        lxvd2x     3, 11, 5
+        lxvd2x     4, 12, 5
+        xxpermdi 32+19, 3, 4, 3
+        xxpermdi 32+15, 3, 4, 0
+        lxvd2x     1, 15, 5
+        lxvd2x     2, 16, 5
+        xxpermdi 32+20, 1, 2, 3
+        xxpermdi 32+16, 1, 2, 0
+        lxvd2x     3, 17, 5
+        lxvd2x     4, 18, 5
+        xxpermdi 32+21, 3, 4, 3
+        xxpermdi 32+17, 3, 4, 0
+.endm
+
+/*
+ * For Len=8,
+ * Load coefficient with 2 legs with 64  bytes apart in
+ *  r[j+len] (r') vectors from offset, R10, R17, R19 and R21
+ *  r[j] (r) vectors from offset, R9, R16, R18 and R20
+ *  r[j+len]: V18, V19, V20, V21
+ *  r = r[j]: V14, V15, V16, V17
+ */
+.macro Load_22Coeffs start next
+        li      9, \start
+        add     10, 7, 9
+        addi    16, 9, \next
+        addi    17, 10, \next
+        li      18, \start+64
+        add     19, 7, 18
+        addi    20, 18, \next
+        addi    21, 19, \next
+        lxvd2x  32+18, 3, 10
+        lxvd2x  32+19, 3, 17
+        lxvd2x  32+20, 3, 19
+        lxvd2x  32+21, 3, 21
+
+        lxvd2x  32+14, 3, 9
+        lxvd2x  32+15, 3, 16
+        lxvd2x  32+16, 3, 18
+        lxvd2x  32+17, 3, 20
+.endm
+
+/*
+ * Load coefficient with 2 legs with len*2 bytes apart in
+ *  r[j+len] (r') vectors from offset, R10, R17, R19 and R21
+ *  r[j] (r) vectors from offset, R9, R16, R18 and R20
+ *  r[j+len]: V18, V19, V20, V21
+ *  r = r[j]: V14, V15, V16, V17
+ */
+.macro Load_4Coeffs start next
+        Init_Coeffs_offset \start, \next
+
+        lxvd2x  32+18, 3, 10
+        lxvd2x  32+19, 3, 17
+        lxvd2x  32+20, 3, 19
+        lxvd2x  32+21, 3, 21
+
+        lxvd2x  32+14, 3, 9
+        lxvd2x  32+15, 3, 16
+        lxvd2x  32+16, 3, 18
+        lxvd2x  32+17, 3, 20
+.endm
+
+/*
+ * Compute final final r[j] and r[j+len]
+ *  final r[j]: V26, V27, V28, V29
+ *  final r[j+len]: V6, V7, V8, V9
+ */
+.macro Compute_4Coeff
+        vadduwm 26, 14, 18
+        vsubuwm 6, 14, 18
+
+        vadduwm 27, 15, 19
+        vsubuwm 7, 15, 19
+
+        vadduwm 28, 16, 20
+        vsubuwm 8, 16, 20
+
+        vadduwm 29, 17, 21
+        vsubuwm 9, 17, 21
+.endm
+
+.macro Write_One
+        stxvd2x 32+26, 3, 9
+        stxvd2x 32+10, 3, 10
+        stxvd2x 32+27, 3, 16
+        stxvd2x 32+11, 3, 17
+        stxvd2x 32+28, 3, 18
+        stxvd2x 32+12, 3, 19
+        stxvd2x 32+29, 3, 20
+        stxvd2x 32+13, 3, 21
+.endm
+
+/*
+ * For Len=2
+ * Transpose the final coefficients of 2-2-2-2 layout to the original
+ * coefficient array order.
+ */
+.macro PermWrite42
+        xxpermdi 32+14, 32+26, 32+10, 0
+        xxpermdi 32+15, 32+26, 32+10, 3
+        xxpermdi 32+16, 32+27, 32+11, 0
+        xxpermdi 32+17, 32+27, 32+11, 3
+        xxpermdi 32+18, 32+28, 32+12, 0
+        xxpermdi 32+19, 32+28, 32+12, 3
+        xxpermdi 32+20, 32+29, 32+13, 0
+        xxpermdi 32+21, 32+29, 32+13, 3
+        stxvd2x    32+14, 0, 5
+        stxvd2x    32+15, 10, 5
+        stxvd2x    32+16, 11, 5
+        stxvd2x    32+17, 12, 5
+        stxvd2x    32+18, 15, 5
+        stxvd2x    32+19, 16, 5
+        stxvd2x    32+20, 17, 5
+        stxvd2x    32+21, 18, 5
+.endm
+
+/*
+ * For Len=1
+ * Transpose the final coefficients of 1-1-1-1 layout to the original
+ * coefficient array order.
+ */
+.macro PermWrite41
+        vmrgew 14, 10, 26
+        vmrgow 15, 10, 26
+        vmrgew 16, 11, 27
+        vmrgow 17, 11, 27
+        vmrgew 18, 12, 28
+        vmrgow 19, 12, 28
+        vmrgew 20, 13, 29
+        vmrgow 21, 13, 29
+        stxvd2x    32+14, 0, 5
+        stxvd2x    32+15, 10, 5
+        stxvd2x    32+16, 11, 5
+        stxvd2x    32+17, 12, 5
+        stxvd2x    32+18, 15, 5
+        stxvd2x    32+19, 16, 5
+        stxvd2x    32+20, 17, 5
+        stxvd2x    32+21, 18, 5
+.endm
+
+.macro Load_next_4zetas
+        li      10, 16
+        li      11, 32
+        li      12, 48
+        lxvd2x  32+V_Z0, 0, 14
+        lxvd2x  32+V_Z1, 10, 14
+        lxvd2x  32+V_Z2, 11, 14
+        lxvd2x  32+V_Z3, 12, 14
+        addi    14, 14, 64
+.endm
+
+/*
+ * montgomery_reduce
+ *  montgomery_reduce((int64_t)zeta * a[j + len])
+ *    a = zeta * a[j+len]
+ *    t = (int64_t)(int32_t)a*QINV;
+ *    t = (a - (int64_t)t*Q) >> 32;
+ *
+ * Or
+ *  montgomery_reduce((int64_t)f * a[j])
+ *
+ * -----------------------------------
+ * MREDUCE_4X(_vz0, _vz1, _vz2, _vz3)
+ */
+.macro MREDUCE_4x  _vz0 _vz1 _vz2 _vz3
+        /* Coefficients computation results in abosulte value of 2^64 in
+           even and odd pairs */
+        vmulesw 10, 6, \_vz0
+        vmulosw 11, 6, \_vz0
+        vmulesw 12, 7, \_vz1
+        vmulosw 13, 7, \_vz1
+        vmulesw 14, 8, \_vz2
+        vmulosw 15, 8, \_vz2
+        vmulesw 16, 9, \_vz3
+        vmulosw 17, 9, \_vz3
+
+        /* Compute a*q^(-1) mod 2^32 and results in the upper 32 bits of
+           even pair */
+        vmulosw 18, 10, QINV
+        vmulosw 19, 11, QINV
+        vmulosw 20, 12, QINV
+        vmulosw 21, 13, QINV
+        vmulosw 22, 14, QINV
+        vmulosw 23, 15, QINV
+        vmulosw 24, 16, QINV
+        vmulosw 25, 17, QINV
+
+        vmulosw 18, 18, V_Q
+        vmulosw 19, 19, V_Q
+        vmulosw 20, 20, V_Q
+        vmulosw 21, 21, V_Q
+        vmulosw 22, 22, V_Q
+        vmulosw 23, 23, V_Q
+        vmulosw 24, 24, V_Q
+        vmulosw 25, 25, V_Q
+
+        vsubudm 18, 10, 18
+        vsubudm 19, 11, 19
+        vsubudm 20, 12, 20
+        vsubudm 21, 13, 21
+        vsubudm 22, 14, 22
+        vsubudm 23, 15, 23
+        vsubudm 24, 16, 24
+        vsubudm 25, 17, 25
+
+        vmrgew  10, 18, 19
+        vmrgew  11, 20, 21
+        vmrgew  12, 22, 23
+        vmrgew  13, 24, 25
+.endm
+
+/*
+ * For Len=1, layer with 1-1-1-1 layout.
+ */
+.macro iNTT_MREDUCE_41x
+        Load_next_4zetas
+        Load_41Coeffs
+        Compute_4Coeff
+        MREDUCE_4x V_Z0, V_Z1, V_Z2, V_Z3
+        PermWrite41
+        addi    5, 5, 128
+.endm
+
+/*
+ * For Len=2, layer with 2-2-2-2 layout.
+ */
+.macro iNTT_MREDUCE_42x
+        Load_next_4zetas
+        Load_42Coeffs
+        Compute_4Coeff
+        MREDUCE_4x V_Z0, V_Z1, V_Z2, V_Z3
+        PermWrite42
+        addi    5, 5, 128
+.endm
+
+/*
+ * For Len=8
+ */
+.macro iNTT_MREDUCE_22x  start next _vz0 _vz1 _vz2 _vz3
+        Load_22Coeffs \start, \next
+        Compute_4Coeff
+        MREDUCE_4x \_vz0, \_vz1, \_vz2, \_vz3
+        Write_One
+.endm
+
+/*
+ * For Len=128, 64, 32, 16 and 4.
+ */
+.macro iNTT_MREDUCE_4x  start next _vz0 _vz1 _vz2 _vz3
+        Load_4Coeffs \start, \next
+        Compute_4Coeff
+        MREDUCE_4x \_vz0, \_vz1, \_vz2, \_vz3
+        Write_One
+.endm
+
+.macro Reload_4coeffs
+        lxvd2x  32+6, 0, 6
+        lxvd2x  32+7, 10, 6
+        lxvd2x  32+8, 11, 6
+        lxvd2x  32+9, 12, 6
+.endm
+
+.macro Write_F
+        stxvd2x 32+10, 0, 6
+        stxvd2x 32+11, 10, 6
+        stxvd2x 32+12, 11, 6
+        stxvd2x 32+13, 12, 6
+        addi    6, 6, 64
+.endm
+
+.macro POLY_Mont_Reduce_4x
+        Reload_4coeffs
+        MREDUCE_4x V_F, V_F, V_F, V_F
+        Write_F
+
+        Reload_4coeffs
+        MREDUCE_4x V_F, V_F, V_F, V_F
+        Write_F
+
+        Reload_4coeffs
+        MREDUCE_4x V_F, V_F, V_F, V_F
+        Write_F
+
+        Reload_4coeffs
+        MREDUCE_4x V_F, V_F, V_F, V_F
+        Write_F
+.endm
+
+/*
+ * mldsa_intt_ppc(int32_t *r)
+ *
+ *   Compute Inverse NTT based on the following 8 layers -
+ *     len = 1, 2, 4, 8, 16, 32, 64, 128.
+ *
+ *   Each layer compute the coefficients on 2 legs, start and start + len*2 offsets.
+ *
+ *   leg 1                        leg 2
+ *   -----                        -----
+ *   start                        start+len*2
+ *   start+next                   start+len*2+next
+ *   start+next+next              start+len*2+next+next
+ *   start+next+next+next         start+len*2+next+next+next
+ *
+ *   Each computation loads 8 vectors, 4 for each leg.
+ *   The final coefficient (t) from each vector of leg1 and leg2 then do the
+ *   add/sub operations to obtain the final results.
+ *
+ *   -> leg1 = leg1 + t, leg2 = leg1 - t
+ *
+ *   The resulting coefficients then store back to each leg's offset.
+ *
+ *   Each vector has the same corresponding zeta except len=2.
+ *
+ *   len=2 has 2-2-2-2 layout which means every 2 32-bit coefficients has the same zeta.
+ *   e.g.
+ *         coeff vector    a1   a2   a3  a4  a5  a6  a7  a8
+ *         zeta  vector    z1   z1   z2  z2  z3  z3  z4  z4
+ *
+ *   For len=2, each vector will get permuted to leg1 and leg2. Zeta is
+ *   pre-arranged for the leg1 and leg2.  After the computation, each vector needs
+ *   to transpose back to its original 2-2-2-2 layout.
+ *
+ */
+.global mldsa_intt_ppc
+.align 4
+mldsa_intt_ppc:
+
+        SAVE_REGS
+
+        /* load Q and Q_NEG_INV */
+        addis   8,2,mldsa_consts at toc@ha
+        addi    8,8,mldsa_consts at toc@l
+        lvx     V_Q, 0, 8
+        li      10, QINV_OFFSET
+        lvx     QINV, 10, 8
+
+        /* set zetas array */
+        addi      14, 8, ZETA_INTT_OFFSET
+
+.align 4
+        /*
+         * 1. len = 1, start = 0, 2, 4, 6, 8, 10, 12,...254
+         *
+         *    Compute coefficients of the inverse NTT based on the following sequences,
+         *      0, 1, 2, 3
+         *      4, 5, 6, 7
+         *      8, 9, 10, 11
+         *      12, 13, 14, 15
+         *            ...
+         *      240, 241, 242, 243
+         *      244, 245, 246, 247
+         *      248, 249, 250, 251
+         *      252, 253, 254, 255
+         *
+         *     These are indexes to the 32 bits array.  Each loads 4 vectors.
+         */
+        mr      5, 3
+        li      7, 4
+
+        li      10, 16
+        li      11, 32
+        li      12, 48
+        li      15, 64
+        li      16, 80
+        li      17, 96
+        li      18, 112
+
+        iNTT_MREDUCE_41x
+        iNTT_MREDUCE_41x
+        iNTT_MREDUCE_41x
+        iNTT_MREDUCE_41x
+        iNTT_MREDUCE_41x
+        iNTT_MREDUCE_41x
+        iNTT_MREDUCE_41x
+        iNTT_MREDUCE_41x
+
+.align 4
+        /*
+         * 2. len = 2, start = 0, 4, 8, 12,...244, 248, 252
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        4
+         *        8        -        12
+         *          16        -        20
+         *                    ...
+         *            240        -        244
+         *              248        -        252
+         *
+         *     These are indexes to the 32 bits array
+         */
+        mr      5, 3
+        li      7, 8
+
+        iNTT_MREDUCE_42x
+        iNTT_MREDUCE_42x
+        iNTT_MREDUCE_42x
+        iNTT_MREDUCE_42x
+        iNTT_MREDUCE_42x
+        iNTT_MREDUCE_42x
+        iNTT_MREDUCE_42x
+        iNTT_MREDUCE_42x
+
+.align 4
+        /*
+         * 3. len = 4, start = 0, 32, 64, 96, 128, 160, 192, 224
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        4
+         *        32        -        36
+         *          64        -        68
+         *                    ...
+         *            192        -        196
+         *              224        -        228
+         *
+         *     These are indexes to the 32 bits array
+         */
+
+        li      7, 16
+
+        Load_next_4zetas
+        iNTT_MREDUCE_4x 0, 32, V_Z0, V_Z1, V_Z2, V_Z3
+        Load_next_4zetas
+        iNTT_MREDUCE_4x 128, 32, V_Z0, V_Z1, V_Z2, V_Z3
+        Load_next_4zetas
+        iNTT_MREDUCE_4x 128*2, 32, V_Z0, V_Z1, V_Z2, V_Z3
+        Load_next_4zetas
+        iNTT_MREDUCE_4x 128*3, 32, V_Z0, V_Z1, V_Z2, V_Z3
+        Load_next_4zetas
+        iNTT_MREDUCE_4x 128*4, 32, V_Z0, V_Z1, V_Z2, V_Z3
+        Load_next_4zetas
+        iNTT_MREDUCE_4x 128*5, 32, V_Z0, V_Z1, V_Z2, V_Z3
+        Load_next_4zetas
+        iNTT_MREDUCE_4x 128*6, 32, V_Z0, V_Z1, V_Z2, V_Z3
+        Load_next_4zetas
+        iNTT_MREDUCE_4x 128*7, 32, V_Z0, V_Z1, V_Z2, V_Z3
+
+.align 4
+        /*
+         * 4. len = 8, start = 0, 32, 64, 96, 128, 160, 192, 224
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        8
+         *        32        -        40
+         *          64        -        72
+         *                    ...
+         *            192        -        200
+         *              224        -        232
+         *
+         *     These are indexes to the 32 bits array
+         */
+
+        li      7, 32
+        Load_next_4zetas
+        iNTT_MREDUCE_22x 0, 16, V_Z0, V_Z0, V_Z1, V_Z1
+        iNTT_MREDUCE_22x 128, 16, V_Z2, V_Z2, V_Z3, V_Z3
+
+        Load_next_4zetas
+        iNTT_MREDUCE_22x 128*2, 16, V_Z0, V_Z0, V_Z1, V_Z1
+        iNTT_MREDUCE_22x 128*3, 16, V_Z2, V_Z2, V_Z3, V_Z3
+
+        Load_next_4zetas
+        iNTT_MREDUCE_22x 128*4, 16, V_Z0, V_Z0, V_Z1, V_Z1
+        iNTT_MREDUCE_22x 128*5, 16, V_Z2, V_Z2, V_Z3, V_Z3
+
+        Load_next_4zetas
+        iNTT_MREDUCE_22x 128*6, 16, V_Z0, V_Z0, V_Z1, V_Z1
+        iNTT_MREDUCE_22x 128*7, 16, V_Z2, V_Z2, V_Z3, V_Z3
+
+.align 4
+        /*
+         * 5. len = 16, start = 0, 32, 64, 96, 128, 160, 192, 224
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        16
+         *        32        -        48
+         *          64        -        80
+         *                    ...
+         *            192        -        208
+         *              224        -        240
+         *
+         *     These are indexes to the 32 bits array
+         */
+        li      7, 64
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        iNTT_MREDUCE_4x 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        iNTT_MREDUCE_4x 128, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        iNTT_MREDUCE_4x 128*2, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        iNTT_MREDUCE_4x 128*3, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        iNTT_MREDUCE_4x 128*4, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        iNTT_MREDUCE_4x 128*5, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        iNTT_MREDUCE_4x 128*6, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        iNTT_MREDUCE_4x 128*7, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+.align 4
+        /*
+         * 6. len = 32, start = 0, 64, 128, 192
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        32
+         *               ...
+         *      64        -        96
+         *               ...
+         *      128        -        160
+         *                ...
+         *      192        -        224
+         *                ...
+         *
+         *     These are indexes to the 32 bits array
+         */
+        li      7, 128
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        iNTT_MREDUCE_4x 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        iNTT_MREDUCE_4x 64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        iNTT_MREDUCE_4x 256, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        iNTT_MREDUCE_4x 256+64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        iNTT_MREDUCE_4x 512, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        iNTT_MREDUCE_4x 512+64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        iNTT_MREDUCE_4x 768, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        iNTT_MREDUCE_4x 768+64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+.align 4
+        /*
+         * 7. len = 64, start = 0, 128
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        64
+         *        16        -        80
+         *          32        -        96
+         *                    ...
+         *      128        -        192
+         *        144        -        208
+         *          160        -        224
+         *            176        -        240
+         *     These are indexes to the 32 bits array
+         */
+        li      7, 256
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        iNTT_MREDUCE_4x 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        iNTT_MREDUCE_4x 64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        iNTT_MREDUCE_4x 128, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        iNTT_MREDUCE_4x 192, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        iNTT_MREDUCE_4x 512, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        iNTT_MREDUCE_4x 512+64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        iNTT_MREDUCE_4x 512+128, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        iNTT_MREDUCE_4x 512+192, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        /*
+         * 8. len = 128, start = 0
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        128
+         *        16        -        144
+         *          32        -        160
+         *                    ...
+         *            112        -        240
+         *     These are indexes to the 32 bits array
+         */
+        li      7, 512
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+
+        iNTT_MREDUCE_4x 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        iNTT_MREDUCE_4x 64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        iNTT_MREDUCE_4x 64*2, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        iNTT_MREDUCE_4x 64*3, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        iNTT_MREDUCE_4x 64*4, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        iNTT_MREDUCE_4x 64*5, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        iNTT_MREDUCE_4x 64*6, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        iNTT_MREDUCE_4x 64*7, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        /*
+         * Montgomery reduce loops with constant f=41978 (mont^2/256)
+         *
+         *  a[j] = montgomery_reduce((int64_t)f * a[j])
+         */
+	addi    10, 8, FCONST_OFFSET
+        lvx     V_F, 0, 10
+
+        li      10, 16
+        li      11, 32
+        li      12, 48
+
+	mr      6, 3
+
+        POLY_Mont_Reduce_4x
+        POLY_Mont_Reduce_4x
+        POLY_Mont_Reduce_4x
+        POLY_Mont_Reduce_4x
+
+        RESTORE_REGS
+        blr
+.size     mldsa_intt_ppc,.-mldsa_intt_ppc
+
+.rodata
+.align 4
+mldsa_consts:
+.long  MLDSA_Q, MLDSA_Q, MLDSA_Q, MLDSA_Q
+.long  MLDSA_QINV, MLDSA_QINV, MLDSA_QINV, MLDSA_QINV
+/* Constant for INTT, f=mont^2/256 */
+.long  FCONST, FCONST, FCONST, FCONST
+
+/* zetas  for qinv */
+mldsa_zetas:
+/* Zetas for Lane=1: setup as (3, 2, 1, 4) order */
+.long  -1400424, -1976782, -3937738, 846154, -3919660, 1362209, 554416, 48306
+.long  976891, 3545687, -183443, -1612842, 2235985, 2286327, 2939036, 420899
+.long  1104333, 3833893, 1667432, 260646, -1723600, -1910376, 426683, 1803090
+.long  975884, -472078, -2213111, -1717735, -3523897, -269760, 3038916, -3866901
+.long  -1652634, 1799107, -810149, 3694233, -162844, -3014001, 3183426, -1616392
+.long  -3369112, 1207385, -1957272, -185531, -2432395, 164721, 2013608, -2454455
+.long  3724270, 3776993, 2584293, -594136, 2831860, 1846953, 542412, 1671176
+.long  -777191, -3406031, -1500165, -2235880, -1917081, 1374803, 1279661, 2546312
+.long  -1312455, 1962642, 451100, -3306115, -1237275, 1430225, 1333058, 3318210
+.long  -1869119, 1050970, 2994039, -1903435, -1250494, 3548272, 3767016, -2635921
+.long  -1247620, -1595974, -4055324, -2486353, -2691481, -1265009, -2842341, 2590150
+.long  3342277, -203044, -3437287, -1735879, -286988, -4108315, -342297, 2437823
+.long  525098, 3595838, 3556995, 768622, 3122442, -3207046, 655327, -2031748
+.long  1613174, 522500, -495491, 43260, -1859098, -819034, -900702, -909542
+.long  3759364, 3193378, 3520352, 1197226, -2434439, -3513181, -266997, 1235728
+.long  -2244091, 3562462, 3342478, 2446433, -3407706, -3817976, -2091667, -2316500
+/* For Len=2 */
+.long  -3839961, -3839961, 3628969, 3628969, 3881060, 3881060, 3019102, 3019102
+.long  1439742, 1439742, 812732, 812732, 1584928, 1584928, -1285669, -1285669
+.long  -1341330, -1341330, -1315589, -1315589, 177440, 177440, 2409325, 2409325
+.long  1851402, 1851402, -3159746, -3159746, 3553272, 3553272, -189548, -189548
+.long  1316856, 1316856, -759969, -759969, 210977, 210977, -2389356, -2389356
+.long  3249728, 3249728, -1653064, -1653064, 8578, 8578, 3724342, 3724342
+.long  -3958618, -3958618, -904516, -904516, 1100098, 1100098, -44288, -44288
+.long  -3097992, -3097992, -508951, -508951, -264944, -264944, 3343383, 3343383
+.long  1430430, 1430430, -1852771, -1852771, -1349076, -1349076, 381987, 381987
+.long  1308169, 1308169, 22981, 22981, 1228525, 1228525, 671102, 671102
+.long  2477047, 2477047, 411027, 411027, 3693493, 3693493, 2967645, 2967645
+.long  -2715295, -2715295, -2147896, -2147896, 983419, 983419, -3412210, -3412210
+.long  -126922, -126922, 3632928, 3632928, 3157330, 3157330, 3190144, 3190144
+.long  1000202, 1000202, 4083598, 4083598, -1939314, -1939314, 1257611, 1257611
+.long  1585221, 1585221, -2176455, -2176455, -3475950, -3475950, 1452451, 1452451
+.long  3041255, 3041255, 3677745, 3677745, 1528703, 1528703, 3930395, 3930395
+/*  For Lane=4 */
+.long  2797779, 2797779, 2797779, 2797779, -2071892, -2071892, -2071892, -2071892
+.long  2556880, 2556880, 2556880, 2556880, -3900724, -3900724, -3900724, -3900724
+.long  -3881043, -3881043, -3881043, -3881043, -954230, -954230, -954230, -954230
+.long  -531354, -531354, -531354, -531354, -811944, -811944, -811944, -811944
+.long  -3699596, -3699596, -3699596, -3699596, 1600420, 1600420, 1600420, 1600420
+.long  2140649, 2140649, 2140649, 2140649, -3507263, -3507263, -3507263, -3507263
+.long  3821735, 3821735, 3821735, 3821735, -3505694, -3505694, -3505694, -3505694
+.long  1643818, 1643818, 1643818, 1643818, 1699267, 1699267, 1699267, 1699267
+.long  539299, 539299, 539299, 539299, -2348700, -2348700, -2348700, -2348700
+.long  300467, 300467, 300467, 300467, -3539968, -3539968, -3539968, -3539968
+.long  2867647, 2867647, 2867647, 2867647, -3574422, -3574422, -3574422, -3574422
+.long  3043716, 3043716, 3043716, 3043716, 3861115, 3861115, 3861115, 3861115
+.long  -3915439, -3915439, -3915439, -3915439, 2537516, 2537516, 2537516, 2537516
+.long  3592148, 3592148, 3592148, 3592148, 1661693, 1661693, 1661693, 1661693
+.long  -3530437, -3530437, -3530437, -3530437, -3077325, -3077325, -3077325, -3077325
+.long  -95776, -95776, -95776, -95776, -2706023, -2706023, -2706023, -2706023
+/* zetas for other len */
+.long  -280005, -280005, -280005, -280005, -4010497, -4010497, -4010497, -4010497
+.long  19422, 19422, 19422, 19422, -1757237, -1757237, -1757237, -1757237
+.long  3277672, 3277672, 3277672, 3277672, 1399561, 1399561, 1399561, 1399561
+.long  3859737, 3859737, 3859737, 3859737, 2118186, 2118186, 2118186, 2118186
+.long  2108549, 2108549, 2108549, 2108549, -2619752, -2619752, -2619752, -2619752
+.long  1119584, 1119584, 1119584, 1119584, 549488, 549488, 549488, 549488
+.long  -3585928, -3585928, -3585928, -3585928, 1079900, 1079900, 1079900, 1079900
+.long  -1024112, -1024112, -1024112, -1024112, -2725464, -2725464, -2725464, -2725464
+.long  -2680103, -2680103, -2680103, -2680103, -3111497, -3111497, -3111497, -3111497
+.long  2884855, 2884855, 2884855, 2884855, -3119733, -3119733, -3119733, -3119733
+.long  2091905, 2091905, 2091905, 2091905, 359251, 359251, 359251, 359251
+.long  -2353451, -2353451, -2353451, -2353451, -1826347, -1826347, -1826347, -1826347
+.long  -466468, -466468, -466468, -466468, 876248, 876248, 876248, 876248
+.long  777960, 777960, 777960, 777960, -237124, -237124, -237124, -237124
+.long  518909, 518909, 518909, 518909, 2608894, 2608894, 2608894, 2608894
+.long  -25847, -25847, -25847, -25847
-- 
2.47.3


From dtsen at us.ibm.com  Tue Feb 24 01:27:48 2026
From: dtsen at us.ibm.com (Danny Tsen)
Date: Mon, 23 Feb 2026 18:27:48 -0600
Subject: [PATCH 0/5] dilithium-kyber: Optimized (i)NTT support for
Message-ID: <20260224002753.151873-1-dtsen@us.ibm.com>

Added optimized (i)NTT algorithm support for ppc64le (Power 8 and
above).  Defined ENABLE_PPC_DILITHIUM and ENABLE_PPC_KYBER for
dilithium (ML-DSA) and kyber (ML-KEM) NTT and inverse NTT.

Danny Tsen (5):
  dilithium: Added optimized dilithium NTT support for ppc64le.
  dilithium: Added optimized dilithium inverse NTT support for ppc64le.
  kyber: Added optimized kyber NTT support for ppc64le.
  kyber: Added optimized kyber inverse NTT support for ppc64le.
  dilithium-kyber: Added ppc64le dilithium and kyber (i)NTT support.

 cipher/dilithium-common.c    |  13 +
 cipher/dilithium_intt_p8le.S | 915 +++++++++++++++++++++++++++++++++++
 cipher/dilithium_ntt_p8le.S  | 859 ++++++++++++++++++++++++++++++++
 cipher/kyber-common.c        |  13 +
 cipher/kyber_intt_p8le.S     | 878 +++++++++++++++++++++++++++++++++
 cipher/kyber_ntt_p8le.S      | 716 +++++++++++++++++++++++++++
 configure.ac                 |  20 +
 7 files changed, 3414 insertions(+)
 create mode 100644 cipher/dilithium_intt_p8le.S
 create mode 100644 cipher/dilithium_ntt_p8le.S
 create mode 100644 cipher/kyber_intt_p8le.S
 create mode 100644 cipher/kyber_ntt_p8le.S

-- 
2.47.3


From dtsen at us.ibm.com  Tue Feb 24 01:27:51 2026
From: dtsen at us.ibm.com (Danny Tsen)
Date: Mon, 23 Feb 2026 18:27:51 -0600
Subject: [PATCH 3/5] kyber: Added optimized kyber NTT support for ppc64le.
In-Reply-To: <20260224002753.151873-1-dtsen@us.ibm.com>
References: <20260224002753.151873-1-dtsen@us.ibm.com>
Message-ID: <20260224002753.151873-4-dtsen@us.ibm.com>

Optimized kyber (ML-KEM) NTT algorithm for ppc64le (Power 8 and above).

Signed-off-by: Danny Tsen <dtsen at us.ibm.com>
---
 cipher/kyber_ntt_p8le.S | 716 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 716 insertions(+)
 create mode 100644 cipher/kyber_ntt_p8le.S

diff --git a/cipher/kyber_ntt_p8le.S b/cipher/kyber_ntt_p8le.S
new file mode 100644
index 00000000..401598f2
--- /dev/null
+++ b/cipher/kyber_ntt_p8le.S
@@ -0,0 +1,716 @@
+/*
+ * This file was modified for use by Libgcrypt.
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * This file is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <https://www.gnu.org/licenses/>.
+ * SPDX-License-Identifier: LGPL-2.1-or-later
+ *
+ * You can also use this file under the same licence of original code.
+ * SPDX-License-Identifier: CC0 OR Apache-2.0
+ *
+ */
+/*
+ * Copyright IBM Corp. 2025, 2026
+ *
+ * ===================================================================================
+ * Written by Danny Tsen <dtsen at us.ibm.com>
+ */
+
+#define QINV_OFFSET 16
+#define ZETA_NTT_OFFSET 32
+
+#define V_QINV  2
+#define V_NMKQ  5
+#define V_Z0    7
+#define V_Z1    8
+#define V_Z2    9
+#define V_Z3    10
+#define V_ZETA  10
+
+.machine "any"
+.text
+
+.macro SAVE_REGS
+        stdu    1, -352(1)
+        mflr    0
+        std     14, 56(1)
+        std     15, 64(1)
+        std     16, 72(1)
+        std     17, 80(1)
+        std     18, 88(1)
+        std     19, 96(1)
+        std     20, 104(1)
+        std     21, 112(1)
+        li      10, 128
+        li      11, 144
+        li      12, 160
+        li      14, 176
+        li      15, 192
+        li      16, 208
+        stxvx   32+20, 10, 1
+        stxvx   32+21, 11, 1
+        stxvx   32+22, 12, 1
+        stxvx   32+23, 14, 1
+        stxvx   32+24, 15, 1
+        stxvx   32+25, 16, 1
+        li      10, 224
+        li      11, 240
+        li      12, 256
+        li      14, 272
+        li      15, 288
+        li      16, 304
+        stxvx   32+26, 10, 1
+        stxvx   32+27, 11, 1
+        stxvx   32+28, 12, 1
+        stxvx   32+29, 14, 1
+        stxvx   32+30, 15, 1
+        stxvx   32+31, 16, 1
+.endm
+
+.macro RESTORE_REGS
+        li      10, 128
+        li      11, 144
+        li      12, 160
+        li      14, 176
+        li      15, 192
+        li      16, 208
+        lxvx    32+20, 10, 1
+        lxvx    32+21, 11, 1
+        lxvx    32+22, 12, 1
+        lxvx    32+23, 14, 1
+        lxvx    32+24, 15, 1
+        lxvx    32+25, 16, 1
+        li      10, 224
+        li      11, 240
+        li      12, 256
+        li      14, 272
+        li      15, 288
+        li      16, 304
+        lxvx    32+26, 10, 1
+        lxvx    32+27, 11, 1
+        lxvx    32+28, 12, 1
+        lxvx    32+29, 14, 1
+        lxvx    32+30, 15, 1
+        lxvx    32+31, 16, 1
+        ld      14, 56(1)
+        ld      15, 64(1)
+        ld      16, 72(1)
+        ld      17, 80(1)
+        ld      18, 88(1)
+        ld      19, 96(1)
+        ld      20, 104(1)
+        ld      21, 112(1)
+
+        mtlr    0
+        addi    1, 1, 352
+.endm
+
+/*
+ * Init_Coeffs_offset: initial offset setup for the coefficient array.
+ *
+ * start: beginning of the offset to the coefficient array.
+ * next: Next offset.
+ * len: Index difference between coefficients.
+ *
+ * r7: len * 2, each coefficient component is 2 bytes.
+ *
+ * registers used for offset to coefficients, r[j] and r[j+len]
+ * R9: offset to r0 = j
+ * R16: offset to r1 = r0 + next
+ * R18: offset to r2 = r1 + next
+ * R20: offset to r3 = r2 + next
+ *
+ * R10: offset to r'0 = r0 + len*2
+ * R17: offset to r'1 = r'0 + step
+ * R19: offset to r'2 = r'1 + step
+ * R21: offset to r'3 = r'2 + step
+ *
+ */
+.macro Init_Coeffs_offset start next
+        li      9, \start       /* first offset to j */
+        add     10, 7, 9        /* J + len*2 */
+        addi    16, 9, \next
+        addi    17, 10, \next
+        addi    18, 16, \next
+        addi    19, 17, \next
+        addi    20, 18, \next
+        addi    21, 19, \next
+.endm
+
+/*
+ * Load coefficient in r[j+len] (r') vectors from offset, R10, R17, R19 and R21
+ *  r[j+len]: V13, V18, V23, V28
+ */
+.macro Load_4Rjp
+        lxvd2x  32+13, 3, 10    /* V13: vector r'0 */
+        lxvd2x  32+18, 3, 17    /* V18: vector for r'1 */
+        lxvd2x  32+23, 3, 19    /* V23: vector for r'2 */
+        lxvd2x  32+28, 3, 21    /* V28: vector for r'3 */
+.endm
+
+/*
+ * Load Coefficients and setup vectors for 8 coefficients in the
+ * following order,
+ *  rjlen0, rjlen1, rjlen2, rjlen3, rjlen4, rjlen5, rjlen6, rjlen7
+ */
+.macro Load_4Coeffs start next
+        Init_Coeffs_offset \start \next
+        Load_4Rjp
+.endm
+
+/*
+ * Load 2 - 2 - 2 - 2 layout
+ *
+ * Load Coefficients and setup vectors for 8 coefficients in the
+ * following order,
+ *    rj0, rj1, rjlen2, rjlen3, rj4, rj5, rjlen6, arlen7
+ *    rj8, rj9, rjlen10, rjlen11, rj12, rj13, rjlen14, rjlen15
+ *  Each vmrgew and vmrgow will transpose vectors as,
+ *  r[j]=      rj0, rj1, rj8, rj9, rj4, rj5, rj12, rj13
+ *  r[j+len]=  rjlen2, rjlen3, rjlen10, rjlen11, rjlen6, arlen7, rjlen14, rjlen15
+ *
+ *  r[j+len]: V13, V18, V23, V28
+ *  r[j]: V12, V17, V22, V27
+ *
+ * In order to do the coefficients computation, zeta vector will arrange
+ * in the proper order to match the multiplication.
+ */
+.macro Load_L24Coeffs
+        lxvd2x     32+25, 0, 5
+        lxvd2x     32+26, 10, 5
+        vmrgew 13, 25, 26
+        vmrgow 12, 25, 26
+        lxvd2x     32+25, 11, 5
+        lxvd2x     32+26, 12, 5
+        vmrgew 18, 25, 26
+        vmrgow 17, 25, 26
+        lxvd2x     32+25, 15, 5
+        lxvd2x     32+26, 16, 5
+        vmrgew 23, 25, 26
+        vmrgow 22, 25, 26
+        lxvd2x     32+25, 17, 5
+        lxvd2x     32+26, 18, 5
+        vmrgew 28, 25, 26
+        vmrgow 27, 25, 26
+.endm
+
+/*
+ * Load 4 - 4 layout
+ *
+ * Load Coefficients and setup vectors for 8 coefficients in the
+ * following order,
+ *  rj0, rj1, rj2, rj3, rjlen4, rjlen5, rjlen6, rjlen7
+ *  rj8, rj9, rj10, rj11, rjlen12, rjlen13, rjlen14, rjlen15
+ *
+ *  Each xxpermdi will transpose vectors as,
+ *  rjlen4, rjlen5, rjlen6, rjlen7, rjlen12, rjlen13, rjlen14, rjlen15
+ *  rj0, rj1, rj2, rj3, rj8, rj9, rj10, rj11
+ *
+ * In order to do the coefficients computation, zeta vector will arrange
+ * in the proper order to match the multiplication.
+ */
+.macro Load_L44Coeffs
+        lxvd2x     1, 0, 5
+        lxvd2x     2, 10, 5
+        xxpermdi 32+13, 2, 1, 3
+        xxpermdi 32+12, 2, 1, 0
+        lxvd2x     3, 11, 5
+        lxvd2x     4, 12, 5
+        xxpermdi 32+18, 4, 3, 3
+        xxpermdi 32+17, 4, 3, 0
+        lxvd2x     1, 15, 5
+        lxvd2x     2, 16, 5
+        xxpermdi 32+23, 2, 1, 3
+        xxpermdi 32+22, 2, 1, 0
+        lxvd2x     3, 17, 5
+        lxvd2x     4, 18, 5
+        xxpermdi 32+28, 4, 3, 3
+        xxpermdi 32+27, 4, 3, 0
+.endm
+
+/*
+ * montgomery_reduce
+ * t = a * QINV
+ * t = (a - (int32_t)t*_MLKEM_Q) >> 16
+ *
+ * -----------------------------------
+ * MREDUCE_4X(_vz0, _vz1, _vz2, _vz3)
+ */
+.macro MREDUCE_4X _vz0 _vz1 _vz2 _vz3
+        /* fqmul = zeta * coefficient
+           Modular multification bond by 2^16 * q in abs value */
+        vmladduhm 15, 13, \_vz0, 3
+        vmladduhm 20, 18, \_vz1, 3
+        vmladduhm 25, 23, \_vz2, 3
+        vmladduhm 30, 28, \_vz3, 3
+
+        /* Signed multiply-high-round; outputs are bound by 2^15 * q in abs value */
+        vmhraddshs 14, 13, \_vz0, 3
+        vmhraddshs 19, 18, \_vz1, 3
+        vmhraddshs 24, 23, \_vz2, 3
+        vmhraddshs 29, 28, \_vz3, 3
+
+        vmladduhm 15, 15, V_QINV, 3
+        vmladduhm 20, 20, V_QINV, 3
+        vmladduhm 25, 25, V_QINV, 3
+        vmladduhm 30, 30, V_QINV, 3
+
+        vmhraddshs 15, 15, V_NMKQ, 14
+        vmhraddshs 20, 20, V_NMKQ, 19
+        vmhraddshs 25, 25, V_NMKQ, 24
+        vmhraddshs 30, 30, V_NMKQ, 29
+
+        /* Shift right 1 bit */
+        vsrah 13, 15, 4
+        vsrah 18, 20, 4
+        vsrah 23, 25, 4
+        vsrah 28, 30, 4
+.endm
+
+/*
+ * Load 4 r[j] (r) coefficient vectors:
+ *   Load coefficient in vectors from offset, R9, R16, R18 and R20
+ *  r[j]: V12, V17, V22, V27
+ */
+.macro Load_4Rj
+        lxvd2x  32+12, 3, 9     /* V12: vector r0 */
+        lxvd2x  32+17, 3, 16    /* V17: vector r1 */
+        lxvd2x  32+22, 3, 18    /* V22: vector r2 */
+        lxvd2x  32+27, 3, 20    /* V27: vector r3 */
+.endm
+
+/*
+ * Compute final final r[j] and r[j+len]
+ *  final r[j+len]: V16, V21, V26, V31
+ *  final r[j]: V15, V20, V25, V30
+ */
+.macro Compute_4Coeffs
+        /* Since the result of the Montgomery multiplication is bounded
+           by q in absolute value.
+           Finally to complete the final update of the results with add/sub
+           r[j] = r[j] + t.
+           r[j+len] = r[j] - t
+         */
+        vsubuhm 16, 12, 13
+        vadduhm 15, 13, 12
+        vsubuhm 21, 17, 18
+        vadduhm 20, 18, 17
+        vsubuhm 26, 22, 23
+        vadduhm 25, 23, 22
+        vsubuhm 31, 27, 28
+        vadduhm 30, 28, 27
+.endm
+
+.macro Write_One
+        stxvd2x 32+15, 3, 9
+        stxvd2x 32+16, 3, 10
+        stxvd2x 32+20, 3, 16
+        stxvd2x 32+21, 3, 17
+        stxvd2x 32+25, 3, 18
+        stxvd2x 32+26, 3, 19
+        stxvd2x 32+30, 3, 20
+        stxvd2x 32+31, 3, 21
+.endm
+
+/*
+ * Transpose the final coefficients of 4-4 layout to the orginal
+ * coefficient array order.
+ */
+.macro PermWriteL44
+        Compute_4Coeffs
+        xxpermdi 0, 32+15, 32+16, 3
+        xxpermdi 1, 32+15, 32+16, 0
+        xxpermdi 2, 32+20, 32+21, 3
+        xxpermdi 3, 32+20, 32+21, 0
+        xxpermdi 4, 32+25, 32+26, 3
+        xxpermdi 5, 32+25, 32+26, 0
+        xxpermdi 6, 32+30, 32+31, 3
+        xxpermdi 7, 32+30, 32+31, 0
+        stxvd2x 0, 0, 5
+        stxvd2x 1, 10, 5
+        stxvd2x 2, 11, 5
+        stxvd2x 3, 12, 5
+        stxvd2x 4, 15, 5
+        stxvd2x 5, 16, 5
+        stxvd2x 6, 17, 5
+        stxvd2x 7, 18, 5
+.endm
+
+/*
+ * Transpose the final coefficients of 2-2-2-2 layout to the orginal
+ * coefficient array order.
+ */
+.macro PermWriteL24
+        Compute_4Coeffs
+        vmrgew 10, 16, 15
+        vmrgow 11, 16, 15
+        vmrgew 12, 21, 20
+        vmrgow 13, 21, 20
+        vmrgew 14, 26, 25
+        vmrgow 15, 26, 25
+        vmrgew 16, 31, 30
+        vmrgow 17, 31, 30
+        stxvd2x 32+10, 0, 5
+        stxvd2x 32+11, 10, 5
+        stxvd2x 32+12, 11, 5
+        stxvd2x 32+13, 12, 5
+        stxvd2x 32+14, 15, 5
+        stxvd2x 32+15, 16, 5
+        stxvd2x 32+16, 17, 5
+        stxvd2x 32+17, 18, 5
+.endm
+
+.macro Load_next_4zetas
+        li      10, 16
+        li      11, 32
+        li      12, 48
+        lxvd2x  32+V_Z0, 0, 14
+        lxvd2x  32+V_Z1, 10, 14
+        lxvd2x  32+V_Z2, 11, 14
+        lxvd2x  32+V_Z3, 12, 14
+        addi    14, 14, 64
+.endm
+
+/*
+ * Re-ordering of the 4-4 layout zetas.
+ * Swap double-words.
+ */
+.macro Perm_4zetas
+        xxpermdi 32+V_Z0, 32+V_Z0, 32+V_Z0, 2
+        xxpermdi 32+V_Z1, 32+V_Z1, 32+V_Z1, 2
+        xxpermdi 32+V_Z2, 32+V_Z2, 32+V_Z2, 2
+        xxpermdi 32+V_Z3, 32+V_Z3, 32+V_Z3, 2
+.endm
+
+/*
+ * NTT layer Len=2.
+ */
+.macro NTT_REDUCE_L24
+        Load_next_4zetas
+        Load_L24Coeffs
+        MREDUCE_4X V_Z0, V_Z1, V_Z2, V_Z3
+        PermWriteL24
+        addi    5, 5, 128
+.endm
+
+/*
+ * NTT layer Len=4.
+ */
+.macro NTT_REDUCE_L44
+        Load_next_4zetas
+        Perm_4zetas
+        Load_L44Coeffs
+        MREDUCE_4X V_Z0, V_Z1, V_Z2, V_Z3
+        PermWriteL44
+        addi    5, 5, 128
+.endm
+
+/*
+ * NTT other layers.
+ */
+.macro NTT_MREDUCE_4X start next _vz0 _vz1 _vz2 _vz3
+        Load_4Coeffs \start, \next
+        MREDUCE_4x \_vz0, \_vz1, \_vz2, \_vz3
+        Load_4Rj
+        Compute_4Coeffs
+        Write_One
+.endm
+
+/*
+ * ntt_ppc(int16_t *r)
+ *   Compute forward NTT based on the following 7 layers -
+ *     len = 128, 64, 32, 16, 8, 4, 2.
+ *
+ *   Each layer compute the coeffients on 2 legs, start and start + len*2 offsets.
+ *
+ *   leg 1                        leg 2
+ *   -----                        -----
+ *   start                        start+len*2
+ *   start+next                   start+len*2+next
+ *   start+next+next              start+len*2+next+next
+ *   start+next+next+next         start+len*2+next+next+next
+ *
+ *   Each computation loads 8 vectors, 4 for each leg.
+ *   The final coefficient (t) from each vector of leg1 and leg2 then do the
+ *   add/sub operations to obtain the final results.
+ *
+ *   -> leg1 = leg1 + t, leg2 = leg1 - t
+ *
+ *   The resulting coeffients then store back to each leg's offset.
+ *
+ *   Each vector has the same corresponding zeta except len=4 and len=2.
+ *
+ *   len=4 has 4-4 layout which means every 4 16-bit coeffients has the same zeta.
+ *   and len=2 has 2-2-2-2 layout which means every 2 16-bit coeffients has the same zeta.
+ *   e.g.
+ *         coeff vector    a1   a2   a3  a4  a5  a6  a7  a8
+ *         zeta  vector    z1   z1   z2  z2  z3  z3  z4  z4
+ *
+ *   For len=4 and len=2, each vector will get permuted to leg1 and leg2. Zeta is
+ *   pre-arranged for the leg1 and leg2.  After the computation, each vector needs
+ *   to transpose back to its original 4-4 or 2-2-2-2 layout.
+ *
+ */
+.global ntt_ppc
+.align 4
+ntt_ppc:
+.localentry     ntt_ppc,.-ntt_ppc
+
+        SAVE_REGS
+
+        addis   8,2,mlkem_consts at toc@ha
+        addi    8,8,mlkem_consts at toc@l
+        lvx     V_NMKQ,0,8
+
+        addi    14, 8, ZETA_NTT_OFFSET
+
+        vxor    3, 3, 3
+        vspltish 4, 1
+
+        li      10, QINV_OFFSET
+        lvx     V_QINV, 10, 8
+
+.align 4
+        /*
+         * 1. len = 128, start = 0
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        128
+         *        32        -        160
+         *          64        -        192
+         *            96        -        224
+         *
+         *     These are indexes to the 16 bits array
+         */
+        li      7, 256          /* len * 2 */
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+
+        NTT_MREDUCE_4X 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4X 64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4X 128, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4X 192, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+.align 4
+        /*
+         * 2. len = 64, start = 0, 128
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        64
+         *        32        -       96
+         *          128        -      192
+         *            160        -      224
+         *
+         *     These are indexes to the 16 bits array
+         */
+        li      7, 128
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4X 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4X 64, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4X 256, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+        NTT_MREDUCE_4X 320, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+.align 4
+        /*
+         * 3. len = 32, start = 0, 64, 128, 192
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        32
+         *        64        -       96
+         *          128        -      160
+         *            192        -      224
+         *
+         *     These are indexes to the 16 bits array
+         */
+        li      7, 64
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4X 0, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4X 128, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4X 256, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+        lvx     V_ZETA, 0, 14
+        addi    14, 14, 16
+        NTT_MREDUCE_4X 384, 16, V_ZETA, V_ZETA, V_ZETA, V_ZETA
+
+.align 4
+        /*
+         * 4. len = 16, start = 0, 8, 128, 136
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        16
+         *        8        -       24
+         *          128        -     144
+         *            136        -    152
+         *
+         *     These are indexes to the 16 bits array
+         */
+        li      7, 32
+        Load_next_4zetas
+        NTT_MREDUCE_4X 0, 64, V_Z0, V_Z1, V_Z2, V_Z3
+        NTT_MREDUCE_4X 16, 64, V_Z0, V_Z1, V_Z2, V_Z3
+
+        Load_next_4zetas
+        NTT_MREDUCE_4X 256, 64, V_Z0, V_Z1, V_Z2, V_Z3
+        NTT_MREDUCE_4X  272, 64, V_Z0, V_Z1, V_Z2, V_Z3
+
+.align 4
+        /*
+         * 5. len = 8, start = 0, 64, 128, 192
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        8
+         *       64        -       72
+         *         128        -      136
+         *            192        -     200
+         *
+         *     These are indexes to the 16 bits array
+         */
+        li      7, 16
+        Load_next_4zetas
+        NTT_MREDUCE_4X 0, 32, V_Z0, V_Z1, V_Z2, V_Z3
+
+        Load_next_4zetas
+        NTT_MREDUCE_4X 128, 32, V_Z0, V_Z1, V_Z2, V_Z3
+
+        Load_next_4zetas
+        NTT_MREDUCE_4X 256, 32, V_Z0, V_Z1, V_Z2, V_Z3
+
+        Load_next_4zetas
+        NTT_MREDUCE_4X 384, 32, V_Z0, V_Z1, V_Z2, V_Z3
+
+        /*
+         * 6. len = 4, start = 0, 8, 16, 24,...232, 240, 248
+         *    Load zeta vectors in 4-4 layout
+         *
+         *    Compute coefficients of the NTT based on the following sequences,
+         *      0, 1, 2, 3, 4, 5, 6, 7
+         *      8, 9, 10, 11, 12, 13, 14, 15
+         *            ...
+         *      240, 241, 242, 243, 244, 245, 246, 247
+         *      248, 249, 250, 251, 252, 253, 254, 255
+         *
+         *     These are indexes to the 16 bits array.  Each loads 4 vectors.
+         */
+        mr      5, 3                 /* Let r5 points to coefficient array */
+        li      7, 8
+
+        li      10, 16
+        li      11, 32
+        li      12, 48
+        li      15, 64
+        li      16, 80
+        li      17, 96
+        li      18, 112
+
+.align 4
+        NTT_REDUCE_L44
+        NTT_REDUCE_L44
+        NTT_REDUCE_L44
+        NTT_REDUCE_L44
+
+        /*
+         * 7. len = 2, start = 0, 4, 8, 12,...244, 248, 252
+         *    Load zeta vectors in 2-2-2-2 layout
+         *
+         *    Compute coefficients of the NTT based on the following sequences,
+         *      0, 1, 2, 3, 4, 5, 6, 7
+         *      8, 9, 10, 11, 12, 13, 14, 15
+         *            ...
+         *      240, 241, 242, 243, 244, 245, 246, 247
+         *      248, 249, 250, 251, 252, 253, 254, 255
+         *
+         *     These are indexes to the 16 bits array.  Each loads 4 vectors.
+         */
+        mr      5, 3                  /* Let r5 points to coefficient array */
+        li      7, 4
+
+.align 4
+        NTT_REDUCE_L24
+        NTT_REDUCE_L24
+        NTT_REDUCE_L24
+        NTT_REDUCE_L24
+
+        RESTORE_REGS
+        blr
+.size     ntt_ppc,.-ntt_ppc
+
+.rodata
+.align 4
+mlkem_consts:
+/* -Q */
+.short  -3329, -3329, -3329, -3329, -3329, -3329, -3329, -3329
+/* QINV */
+.short  -3327, -3327, -3327, -3327, -3327, -3327, -3327, -3327
+
+/* zetas */
+mlkem_zetas:
+/* For ntt Len=128, offset 96 */
+.short  -758, -758, -758, -758, -758, -758, -758, -758, -359, -359, -359, -359
+.short  -359, -359, -359, -359, -1517, -1517, -1517, -1517, -1517, -1517, -1517
+.short  -1517, 1493, 1493, 1493, 1493, 1493, 1493, 1493, 1493, 1422, 1422, 1422
+.short  1422, 1422, 1422, 1422, 1422, 287, 287, 287, 287, 287, 287, 287, 287, 202
+.short  202, 202, 202, 202, 202, 202, 202, -171, -171, -171, -171, -171, -171, -171
+.short  -171, 622, 622, 622, 622, 622, 622, 622, 622, 1577, 1577, 1577, 1577, 1577
+.short  1577, 1577, 1577, 182, 182, 182, 182, 182, 182, 182, 182, 962, 962, 962
+.short  962, 962, 962, 962, 962, -1202, -1202, -1202, -1202, -1202, -1202, -1202
+.short  -1202, -1474, -1474, -1474, -1474, -1474, -1474, -1474, -1474, 1468, 1468
+.short  1468, 1468, 1468, 1468, 1468, 1468, 573, 573, 573, 573, 573, 573, 573, 573
+.short  -1325, -1325, -1325, -1325, -1325, -1325, -1325, -1325, 264, 264, 264, 264
+.short  264, 264, 264, 264, 383, 383, 383, 383, 383, 383, 383, 383, -829, -829
+.short  -829, -829, -829, -829, -829, -829, 1458, 1458, 1458, 1458, 1458, 1458
+.short  1458, 1458, -1602, -1602, -1602, -1602, -1602, -1602, -1602, -1602, -130
+.short  -130, -130, -130, -130, -130, -130, -130, -681, -681, -681, -681, -681
+.short  -681, -681, -681, 1017, 1017, 1017, 1017, 1017, 1017, 1017, 1017, 732, 732
+.short  732, 732, 732, 732, 732, 732, 608, 608, 608, 608, 608, 608, 608, 608, -1542
+.short  -1542, -1542, -1542, -1542, -1542, -1542, -1542, 411, 411, 411, 411, 411
+.short  411, 411, 411, -205, -205, -205, -205, -205, -205, -205, -205, -1571, -1571
+.short  -1571, -1571, -1571, -1571, -1571, -1571
+/* For Len=4 */
+.short  1223, 1223, 1223, 1223, 652, 652, 652, 652, -552, -552, -552, -552, 1015
+.short  1015, 1015, 1015, -1293, -1293, -1293, -1293, 1491, 1491, 1491, 1491, -282
+.short  -282, -282, -282, -1544, -1544, -1544, -1544, 516, 516, 516, 516, -8, -8
+.short  -8, -8, -320, -320, -320, -320, -666, -666, -666, -666, -1618, -1618, -1618
+.short  -1618, -1162, -1162, -1162, -1162, 126, 126, 126, 126, 1469, 1469, 1469
+.short  1469, -853, -853, -853, -853, -90, -90, -90, -90, -271, -271, -271, -271
+.short  830, 830, 830, 830, 107, 107, 107, 107, -1421, -1421, -1421, -1421, -247
+.short  -247, -247, -247, -951, -951, -951, -951, -398, -398, -398, -398, 961, 961
+.short  961, 961, -1508, -1508, -1508, -1508, -725, -725, -725, -725, 448, 448, 448
+.short  448, -1065, -1065, -1065, -1065, 677, 677, 677, 677, -1275, -1275, -1275
+.short  -1275
+/*
+ * For ntt Len=2
+ * reorder zeta array, (1, 2, 3, 4) -> (3, 1, 4, 2)
+ * Transpose z[0], z[1], z[2], z[3]
+ *    -> z[3], z[3], z[1], z[1], z[4], z[4], z[2], z[2]
+ */
+.short  555, 555, -1103, -1103, 843, 843, 430, 430, 1550, 1550, -1251, -1251, 105
+.short  105, 871, 871, 177, 177, 422, 422, -235, -235, 587, 587, 1574, 1574, -291
+.short  -291, 1653, 1653, -460, -460, 1159, 1159, -246, -246, -147, -147, 778, 778
+.short  -602, -602, -777, -777, 1119, 1119, 1483, 1483, -872, -872, -1590, -1590
+.short  349, 349, 644, 644, -156, -156, 418, 418, -75, -75, 329, 329, 603, 603, 817
+.short  817, 610, 610, 1097, 1097, -1465, -1465, 1322, 1322, 384, 384, -1285, -1285
+.short  1218, 1218, -1215, -1215, -1335, -1335, -136, -136, -1187, -1187, -874
+.short  -874, -1659, -1659, 220, 220, -1278, -1278, -1185, -1185, 794, 794, -1530
+.short  -1530, -870, -870, -1510, -1510, 478, 478, -854, -854, 996, 996, -108, -108
+.short  991, 991, -308, -308, 1522, 1522, 958, 958, 1628, 1628, -1460, -1460
-- 
2.47.3


From dtsen at us.ibm.com  Tue Feb 24 01:27:52 2026
From: dtsen at us.ibm.com (Danny Tsen)
Date: Mon, 23 Feb 2026 18:27:52 -0600
Subject: [PATCH 4/5] kyber: Added optimized kyber inverse NTT support for
 ppc64le.
In-Reply-To: <20260224002753.151873-1-dtsen@us.ibm.com>
References: <20260224002753.151873-1-dtsen@us.ibm.com>
Message-ID: <20260224002753.151873-5-dtsen@us.ibm.com>

Optimized kyber (ML-KEM) inverse NTT algorithm for ppc64le (Power 8
and above).

Signed-off-by: Danny Tsen <dtsen at us.ibm.com>
---
 cipher/kyber_intt_p8le.S | 878 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 878 insertions(+)
 create mode 100644 cipher/kyber_intt_p8le.S

diff --git a/cipher/kyber_intt_p8le.S b/cipher/kyber_intt_p8le.S
new file mode 100644
index 00000000..c46412aa
--- /dev/null
+++ b/cipher/kyber_intt_p8le.S
@@ -0,0 +1,878 @@
+/*
+ * This file was modified for use by Libgcrypt.
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * This file is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <https://www.gnu.org/licenses/>.
+ * SPDX-License-Identifier: LGPL-2.1-or-later
+ *
+ * You can also use this file under the same licence of original code.
+ * SPDX-License-Identifier: CC0 OR Apache-2.0
+ *
+ */
+/*
+ * Copyright IBM Corp. 2025, 2026
+ *
+ * ===================================================================================
+ * Written by Danny Tsen <dtsen at us.ibm.com>
+ */
+
+.machine "any"
+.text
+
+#define QINV_OFFSET 16
+#define Q_OFFSET 32
+#define C20159_OFFSET 48
+#define C1441_OFFSET 64
+#define ZETA_INTT_OFFSET 80
+
+/* Barrett reduce constatnts */
+#define V20159  0
+#define V_25    1
+#define V_26    2
+#define V_MKQ   3
+
+/* Montgomery reduce constatnts */
+#define V_QINV  2
+#define V_NMKQ  5
+#define V_Z0    7
+#define V_Z1    8
+#define V_Z2    9
+#define V_Z3    10
+#define V_ZETA  10
+#define V1441   10
+
+.macro SAVE_REGS
+        stdu    1, -352(1)
+        mflr    0
+        std     14, 56(1)
+        std     15, 64(1)
+        std     16, 72(1)
+        std     17, 80(1)
+        std     18, 88(1)
+        std     19, 96(1)
+        std     20, 104(1)
+        std     21, 112(1)
+        li      10, 128
+        li      11, 144
+        li      12, 160
+        li      14, 176
+        li      15, 192
+        li      16, 208
+        stxvx   32+20, 10, 1
+        stxvx   32+21, 11, 1
+        stxvx   32+22, 12, 1
+        stxvx   32+23, 14, 1
+        stxvx   32+24, 15, 1
+        stxvx   32+25, 16, 1
+        li      10, 224
+        li      11, 240
+        li      12, 256
+        li      14, 272
+        li      15, 288
+        li      16, 304
+        stxvx   32+26, 10, 1
+        stxvx   32+27, 11, 1
+        stxvx   32+28, 12, 1
+        stxvx   32+29, 14, 1
+        stxvx   32+30, 15, 1
+        stxvx   32+31, 16, 1
+.endm
+
+.macro RESTORE_REGS
+        li      10, 128
+        li      11, 144
+        li      12, 160
+        li      14, 176
+        li      15, 192
+        li      16, 208
+        lxvx    32+20, 10, 1
+        lxvx    32+21, 11, 1
+        lxvx    32+22, 12, 1
+        lxvx    32+23, 14, 1
+        lxvx    32+24, 15, 1
+        lxvx    32+25, 16, 1
+        li      10, 224
+        li      11, 240
+        li      12, 256
+        li      14, 272
+        li      15, 288
+        li      16, 304
+        lxvx    32+26, 10, 1
+        lxvx    32+27, 11, 1
+        lxvx    32+28, 12, 1
+        lxvx    32+29, 14, 1
+        lxvx    32+30, 15, 1
+        lxvx    32+31, 16, 1
+        ld      14, 56(1)
+        ld      15, 64(1)
+        ld      16, 72(1)
+        ld      17, 80(1)
+        ld      18, 88(1)
+        ld      19, 96(1)
+        ld      20, 104(1)
+        ld      21, 112(1)
+
+        mtlr    0
+        addi    1, 1, 352
+.endm
+
+/*
+ * Compute r[j] and r[j+len] from computed coefficients
+ *  r[j] + r[j+len] : V8, V12, V16, V20 (data for Barett reduce)
+ *  r[j+len] - r[j]: V25, V26, V30, V31 (data for Montgomery reduce)
+ */
+.macro Compute_4Coeffs
+        vsubuhm 25, 8, 21
+        vsubuhm 26, 12, 22
+        vsubuhm 30, 16, 23
+        vsubuhm 31, 20, 24
+        vadduhm 8, 8, 21
+        vadduhm 12, 12, 22
+        vadduhm 16, 16, 23
+        vadduhm 20, 20, 24
+.endm
+
+/*
+ * Init_Coeffs_offset: initial offset setup for the coefficient array.
+ *
+ * start: beginning of the offset to the coefficient array.
+ * next: Next offset.
+ * len: Index difference between coefficients.
+ *
+ * r7: len * 2, each coefficient component is 2 bytes.
+ *
+ * register used for offset to coefficients, r[j] and r[j+len]
+ * R9: offset to r0 = j
+ * R16: offset to r1 = r0 + next
+ * R18: offset to r2 = r1 + next
+ * R20: offset to r3 = r2 + next
+ *
+ * R10: offset to r'0 = r0 + len*2
+ * R17: offset to r'1 = r'0 + step
+ * R19: offset to r'2 = r'1 + step
+ * R21: offset to r'3 = r'2 + step
+ *
+ */
+.macro Init_Coeffs_offset start next
+        li      9, \start       /* first offset to j */
+        add     10, 7, 9        /* J + len*2 */
+        addi    16, 9, \next
+        addi    17, 10, \next
+        addi    18, 16, \next
+        addi    19, 17, \next
+        addi    20, 18, \next
+        addi    21, 19, \next
+.endm
+
+/*
+ * Load coefficient vectors for r[j] (r) and r[j+len] (r'):
+ *   Load coefficient in r' vectors from offset, R10, R17, R19 and R21
+ *   Load coefficient in r vectors from offset, R9, R16, R18 and R20
+ *
+ *  r[j+len]: V8, V12, V16, V20
+ *  r[j]: V21, V22, V23, V24
+ */
+.macro Load_4Rjp
+        lxvd2x  32+8, 3, 10     /* V8: vector r'0 */
+        lxvd2x  32+12, 3, 17    /* V12: vector for r'1 */
+        lxvd2x  32+16, 3, 19    /* V16: vector for r'2 */
+        lxvd2x  32+20, 3, 21    /* V20: vector for r'3 */
+
+        lxvd2x  32+21, 3, 9     /* V21: vector r0 */
+        lxvd2x  32+22, 3, 16    /* V22: vector r1 */
+        lxvd2x  32+23, 3, 18    /* V23: vector r2 */
+        lxvd2x  32+24, 3, 20    /* V24: vector r3 */
+.endm
+
+/*
+ * Load Coefficients and setup vectors for 8 coefficients in the
+ * following order,
+ *  rjlen0, rjlen1, rjlen2, rjlen3, rjlen4, rjlen5, rjlen6, rjlen7
+ */
+.macro Load_4Coeffs start next
+        Init_Coeffs_offset \start \next
+        Load_4Rjp
+        Compute_4Coeffs
+.endm
+
+/*
+ * Load 2 - 2 - 2 - 2 layout
+ *
+ * Load Coefficients and setup vectors for 8 coefficients in the
+ * following order,
+ *    rj0, rj1, rjlen2, rjlen3, rj4, rj5, rjlen6, arlen7
+ *    rj8, rj9, rjlen10, rjlen11, rj12, rj13, rjlen14, rjlen15
+ *  Each vmrgew and vmrgow will transpose vectors as,
+ *  r[j]=      rj0, rj1, rj8, rj9, rj4, rj5, rj12, rj13
+ *  r[j+len]=  rjlen2, rjlen3, rjlen10, rjlen11, rjlen6, arlen7, rjlen14, rjlen15
+ *
+ *  r[j+len]: V8, V12, V16, V20
+ *  r[j]: V21, V22, V23, V24
+ *
+ * In order to do the coefficient computation, zeta vector will arrange
+ * in the proper order to match the multiplication.
+ */
+.macro Load_L24Coeffs
+        lxvd2x     32+25, 0, 5
+        lxvd2x     32+26, 10, 5
+        vmrgew 8, 25, 26
+        vmrgow 21, 25, 26
+        lxvd2x     32+25, 11, 5
+        lxvd2x     32+26, 12, 5
+        vmrgew 12, 25, 26
+        vmrgow 22, 25, 26
+        lxvd2x     32+25, 15, 5
+        lxvd2x     32+26, 16, 5
+        vmrgew 16, 25, 26
+        vmrgow 23, 25, 26
+        lxvd2x     32+25, 17, 5
+        lxvd2x     32+26, 18, 5
+        vmrgew 20, 25, 26
+        vmrgow 24, 25, 26
+.endm
+
+/*
+ * Load 4 - 4 layout
+ *
+ * Load Coefficients and setup vectors for 8 coefficients in the
+ * following order,
+ *  rj0, rj1, rj2, rj3, rjlen4, rjlen5, rjlen6, rjlen7
+ *  rj8, rj9, rj10, rj11, rjlen12, rjlen13, rjlen14, rjlen15
+ *
+ *  Each xxpermdi will transpose vectors as,
+ *  rjlen4, rjlen5, rjlen6, rjlen7, rjlen12, rjlen13, rjlen14, rjlen15
+ *  rj0, rj1, rj2, rj3, rj8, rj9, rj10, rj11
+ *
+ * In order to do the coefficients computation, zeta vector will arrange
+ * in the proper order to match the multiplication.
+ */
+.macro Load_L44Coeffs
+        lxvd2x     10, 0, 5
+        lxvd2x     11, 10, 5
+        xxpermdi 32+8, 11, 10, 3
+        xxpermdi 32+21, 11, 10, 0
+        lxvd2x     10, 11, 5
+        lxvd2x     11, 12, 5
+        xxpermdi 32+12, 11, 10, 3
+        xxpermdi 32+22, 11, 10, 0
+        lxvd2x     10, 15, 5
+        lxvd2x     11, 16, 5
+        xxpermdi 32+16, 11, 10, 3
+        xxpermdi 32+23, 11, 10, 0
+        lxvd2x     10, 17, 5
+        lxvd2x     11, 18, 5
+        xxpermdi 32+20, 11, 10, 3
+        xxpermdi 32+24, 11, 10, 0
+.endm
+
+.macro BREDUCE_4X _v0 _v1 _v2 _v3
+        /* Restore constant vectors
+           V_MKQ, V_25 and V_26 */
+        vxor    7, 7, 7
+        xxlor   32+3, 6, 6
+        xxlor   32+1, 7, 7
+        xxlor   32+2, 8, 8
+        /* Multify Odd/Even signed halfword;
+           Results word bound by 2^32 in abs value. */
+        vmulosh 6, 8, V20159
+        vmulesh 5, 8, V20159
+        vmulosh 11, 12, V20159
+        vmulesh 10, 12, V20159
+        vmulosh 15, 16, V20159
+        vmulesh 14, 16, V20159
+        vmulosh 19, 20, V20159
+        vmulesh 18, 20, V20159
+        xxmrglw 32+4, 32+5, 32+6
+        xxmrghw 32+5, 32+5, 32+6
+        xxmrglw 32+9, 32+10, 32+11
+        xxmrghw 32+10, 32+10, 32+11
+        xxmrglw 32+13, 32+14, 32+15
+        xxmrghw 32+14, 32+14, 32+15
+        xxmrglw 32+17, 32+18, 32+19
+        xxmrghw 32+18, 32+18, 32+19
+        vadduwm 4, 4, V_25
+        vadduwm 5, 5, V_25
+        vadduwm 9, 9, V_25
+        vadduwm 10, 10, V_25
+        vadduwm 13, 13, V_25
+        vadduwm 14, 14, V_25
+        vadduwm 17, 17, V_25
+        vadduwm 18, 18, V_25
+        /* Right shift and pack lower halfword,
+           results bond to 2^16 in abs value */
+        vsraw   4, 4, V_26
+        vsraw   5, 5, V_26
+        vsraw   9, 9, V_26
+        vsraw   10, 10, V_26
+        vsraw   13, 13, V_26
+        vsraw   14, 14, V_26
+        vsraw   17, 17, V_26
+        vsraw   18, 18, V_26
+        vpkuwum 4, 5, 4
+        vsubuhm 4, 7, 4
+        vpkuwum 9, 10, 9
+        vsubuhm 9, 7, 9
+        vpkuwum 13, 14, 13
+        vsubuhm 13, 7, 13
+        vpkuwum 17, 18, 17
+        vsubuhm 17, 7, 17
+        /* Modulo multify-Low unsigned halfword;
+           results bond to 2^16 * q in abs value. */
+        vmladduhm \_v0, 4, V_MKQ, 8
+        vmladduhm \_v1, 9, V_MKQ, 12
+        vmladduhm \_v2, 13, V_MKQ, 16
+        vmladduhm \_v3, 17, V_MKQ, 20
+.endm
+
+/*
+ * -----------------------------------
+ * MREDUCE_4X(_vz0, _vz1, _vz2, _vz3, _vo0, _vo1, _vo2, _vo3)
+ */
+.macro MREDUCE_4X _vz0 _vz1 _vz2 _vz3 _vo0 _vo1 _vo2 _vo3
+        /* Modular multification bond by 2^16 * q in abs value */
+        vmladduhm 15, 25, \_vz0, 3
+        vmladduhm 20, 26, \_vz1, 3
+        vmladduhm 27, 30, \_vz2, 3
+        vmladduhm 28, 31, \_vz3, 3
+
+        /* Signed multiply-high-round; outputs are bound by 2^15 * q in abs value */
+        vmhraddshs 14, 25, \_vz0, 3
+        vmhraddshs 19, 26, \_vz1, 3
+        vmhraddshs 24, 30, \_vz2, 3
+        vmhraddshs 29, 31, \_vz3, 3
+
+        vmladduhm 15, 15, V_QINV, 3
+        vmladduhm 20, 20, V_QINV, 3
+        vmladduhm 25, 27, V_QINV, 3
+        vmladduhm 30, 28, V_QINV, 3
+
+        vmhraddshs 15, 15, V_NMKQ, 14
+        vmhraddshs 20, 20, V_NMKQ, 19
+        vmhraddshs 25, 25, V_NMKQ, 24
+        vmhraddshs 30, 30, V_NMKQ, 29
+
+        /* Shift right 1 bit */
+        vsrah \_vo0, 15, 4
+        vsrah \_vo1, 20, 4
+        vsrah \_vo2, 25, 4
+        vsrah \_vo3, 30, 4
+.endm
+
+/*
+ * setup constant vectors for Montgmery multiplication
+ * V_NMKQ, V_QINV, Zero vector, One vector
+ */
+.macro Set_mont_consts
+        xxlor   32+5, 0, 0    /* V_NMKQ */
+        xxlor   32+2, 2, 2    /* V_QINV */
+        xxlor   32+3, 3, 3    /* all 0 */
+        xxlor   32+4, 4, 4    /* all 1 */
+.endm
+
+.macro Load_next_4zetas
+        li      8, 16
+        li      11, 32
+        li      12, 48
+        lxvd2x    32+V_Z0, 0, 14
+        lxvd2x    32+V_Z1, 8, 14
+        lxvd2x    32+V_Z2, 11, 14
+        lxvd2x    32+V_Z3, 12, 14
+        addi    14, 14, 64
+.endm
+
+/*
+ * Re-ordering of the 4-4 layout zetas.
+ * Swap double-words.
+ */
+.macro Perm_4zetas
+        xxpermdi 32+V_Z0, 32+V_Z0, 32+V_Z0, 2
+        xxpermdi 32+V_Z1, 32+V_Z1, 32+V_Z1, 2
+        xxpermdi 32+V_Z2, 32+V_Z2, 32+V_Z2, 2
+        xxpermdi 32+V_Z3, 32+V_Z3, 32+V_Z3, 2
+.endm
+
+.macro Write_B4C _vs0 _vs1 _vs2 _vs3
+        stxvd2x \_vs0, 3, 9
+        stxvd2x \_vs1, 3, 16
+        stxvd2x \_vs2, 3, 18
+        stxvd2x \_vs3, 3, 20
+.endm
+
+.macro Write_M4C _vs0 _vs1 _vs2 _vs3
+        stxvd2x \_vs0, 3, 10
+        stxvd2x \_vs1, 3, 17
+        stxvd2x \_vs2, 3, 19
+        stxvd2x \_vs3, 3, 21
+.endm
+
+.macro Reload_4coeffs
+        lxvd2x  32+25, 0, 3
+        lxvd2x  32+26, 10, 3
+        lxvd2x  32+30, 11, 3
+        lxvd2x  32+31, 12, 3
+        addi    3, 3, 64
+.endm
+
+.macro MWrite_8X _vs0 _vs1 _vs2 _vs3 _vs4 _vs5 _vs6 _vs7
+        addi    3, 3, -128
+        stxvd2x \_vs0, 0, 3
+        stxvd2x \_vs1, 10, 3
+        stxvd2x \_vs2, 11, 3
+        stxvd2x \_vs3, 12, 3
+        stxvd2x \_vs4, 15, 3
+        stxvd2x \_vs5, 16, 3
+        stxvd2x \_vs6, 17, 3
+        stxvd2x \_vs7, 18, 3
+        addi    3, 3, 128
+.endm
+
+/*
+ * Transpose the final coefficients of 4-4 layout to the orginal
+ * coefficient array order.
+ */
+.macro PermWriteL44
+        xxlor   32+14, 10, 10
+        xxlor   32+19, 11, 11
+        xxlor   32+24, 12, 12
+        xxlor   32+29, 13, 13
+        xxpermdi 32+10, 32+14, 32+13, 3
+        xxpermdi 32+11, 32+14, 32+13, 0
+        xxpermdi 32+12, 32+19, 32+18, 3
+        xxpermdi 32+13, 32+19, 32+18, 0
+        xxpermdi 32+14, 32+24, 32+23, 3
+        xxpermdi 32+15, 32+24, 32+23, 0
+        xxpermdi 32+16, 32+29, 32+28, 3
+        xxpermdi 32+17, 32+29, 32+28, 0
+        stxvd2x    32+10, 0, 5
+        stxvd2x    32+11, 10, 5
+        stxvd2x    32+12, 11, 5
+        stxvd2x    32+13, 12, 5
+        stxvd2x    32+14, 15, 5
+        stxvd2x    32+15, 16, 5
+        stxvd2x    32+16, 17, 5
+        stxvd2x    32+17, 18, 5
+.endm
+
+/*
+ * Transpose the final coefficients of 2-2-2-2 layout to the orginal
+ * coefficient array order.
+ */
+.macro PermWriteL24
+        xxlor   32+14, 10, 10
+        xxlor   32+19, 11, 11
+        xxlor   32+24, 12, 12
+        xxlor   32+29, 13, 13
+        vmrgew 10, 13, 14
+        vmrgow 11, 13, 14
+        vmrgew 12, 18, 19
+        vmrgow 13, 18, 19
+        vmrgew 14, 23, 24
+        vmrgow 15, 23, 24
+        vmrgew 16, 28, 29
+        vmrgow 17, 28, 29
+        stxvd2x    32+10, 0, 5
+        stxvd2x    32+11, 10, 5
+        stxvd2x    32+12, 11, 5
+        stxvd2x    32+13, 12, 5
+        stxvd2x    32+14, 15, 5
+        stxvd2x    32+15, 16, 5
+        stxvd2x    32+16, 17, 5
+        stxvd2x    32+17, 18, 5
+.endm
+
+/*
+ * INTT layer Len=2.
+ */
+.macro INTT_REDUCE_L24
+        Load_L24Coeffs
+        Compute_4Coeffs
+        BREDUCE_4X 4, 9, 13, 17
+        xxlor   10, 32+4, 32+4
+        xxlor   11, 32+9, 32+9
+        xxlor   12, 32+13, 32+13
+        xxlor   13, 32+17, 32+17
+        Set_mont_consts
+        Load_next_4zetas
+        MREDUCE_4X V_Z0, V_Z1, V_Z2, V_Z3, 13, 18, 23, 28
+        PermWriteL24
+.endm
+
+/*
+ * INTT layer Len=4.
+ */
+.macro INTT_REDUCE_L44
+        Load_L44Coeffs
+        Compute_4Coeffs
+        BREDUCE_4X 4, 9, 13, 17
+        xxlor   10, 32+4, 32+4
+        xxlor   11, 32+9, 32+9
+        xxlor   12, 32+13, 32+13
+        xxlor   13, 32+17, 32+17
+        Set_mont_consts
+        Load_next_4zetas
+        Perm_4zetas
+        MREDUCE_4X V_Z0, V_Z1, V_Z2, V_Z3, 13, 18, 23, 28
+        PermWriteL44
+.endm
+
+/*
+ * INTT layer Len=8 and 16.
+ */
+.macro INTT_REDUCE_4X start next
+        Load_4Coeffs \start, \next
+        BREDUCE_4X 4, 9, 13, 17
+        Write_B4C 32+4, 32+9, 32+13, 32+17
+        Set_mont_consts
+        Load_next_4zetas
+        MREDUCE_4X V_Z0, V_Z1, V_Z2, V_Z3, 13, 18, 23, 28
+        Write_M4C 32+13, 32+18, 32+23, 32+28
+.endm
+
+/*
+ * INTT layer Len=32, 64 and 128.
+ */
+.macro INTT_REDUCE_L567 start next
+        Load_4Coeffs \start, \next
+        BREDUCE_4X 4, 9, 13, 17
+        Write_B4C 32+4, 32+9, 32+13, 32+17
+        Set_mont_consts
+        lvx     V_ZETA, 0, 14
+        MREDUCE_4X V_ZETA, V_ZETA, V_ZETA, V_ZETA, 13, 18, 23, 28
+        Write_M4C 32+13, 32+18, 32+23, 32+28
+.endm
+
+/*
+ * intt_ppc(int16_t *r)
+ *   Compute inverse NTT based on the following 7 layers -
+ *     len = 2, 4, 8, 16, 32, 64, 128
+ *
+ *   Each layer compute the coeffients on 2 legs, start and start + len*2 offsets.
+ *
+ *   leg 1                        leg 2
+ *   -----                        -----
+ *   start                        start+len*2
+ *   start+next                   start+len*2+next
+ *   start+next+next              start+len*2+next+next
+ *   start+next+next+next         start+len*2+next+next+next
+ *
+ *   Each computation loads 8 vectors, 4 for each leg.
+ *   The final coefficient (t) from each vector of leg1 and leg2 then do the
+ *   add/sub operations to obtain the final results.
+ *
+ *   -> leg1 = leg1 + t, leg2 = leg1 - t
+ *
+ *   The resulting coeffients then store back to each leg's offset.
+ *
+ *   Each vector has the same corresponding zeta except len=4 and len=2.
+ *
+ *   len=4 has 4-4 layout which means every 4 16-bit coeffients has the same zeta.
+ *   and len=2 has 2-2-2-2 layout which means every 2 16-bit coeffients has the same zeta.
+ *   e.g.
+ *         coeff vector    a1   a2   a3  a4  a5  a6  a7  a8
+ *         zeta  vector    z1   z1   z2  z2  z3  z3  z4  z4
+ *
+ *   For len=4 and len=2, each vector will get permuted to leg1 and leg2. Zeta is
+ *   pre-arranged for the leg1 and leg2.  After the computation, each vector needs
+ *   to transpose back to its original 4-4 or 2-2-2-2 layout.
+ */
+.global intt_ppc
+.align 4
+intt_ppc:
+.localentry     intt_ppc,.-intt_ppc
+
+        SAVE_REGS
+
+        /* init vectors and constants
+           Setup for Montgomery reduce */
+        addis   8,2,mlkem_consts at toc@ha
+        addi    8,8,mlkem_consts at toc@l
+        lxvx    0, 0, 8         /* V_NMKQ */
+
+        li      10, QINV_OFFSET
+        lxvx    32+V_QINV, 10, 8
+        xxlxor  32+3, 32+3, 32+3
+        vspltish 4, 1
+        xxlor   2, 32+2, 32+2        /* QINV */
+        xxlor   3, 32+3, 32+3        /* 0 vector */
+        xxlor   4, 32+4, 32+4        /* 1 vector */
+
+        /*  Setup for Barrett reduce */
+        li      10, Q_OFFSET
+        li      11, C20159_OFFSET
+        lxvx    6, 10, 8             /* V_MKQ */
+        lxvx    32+V20159, 11, 8     /* V20159 */
+
+        vspltisw 8, 13
+        vadduwm  8, 8, 8
+        xxlor   8, 32+8, 32+8   /* V_26 store at vs8 */
+
+        vspltisw 9, 1
+        vsubuwm 10, 8, 9        /* value 25 */
+        vslw    9, 9, 10
+        xxlor   7, 32+9, 32+9   /* V_25 syore at vs7 */
+
+        li      10, 16
+        li      11, 32
+        li      12, 48
+        li      15, 64
+        li      16, 80
+        li      17, 96
+        li      18, 112
+
+        /*
+         * Montgomery reduce loops with constant 1441
+         */
+        addi    14, 8, C1441_OFFSET
+        lvx     V1441, 0, 14
+        li      7, 4
+        mtctr   7
+
+        Set_mont_consts
+intt_ppc__Loopf:
+        Reload_4coeffs
+        MREDUCE_4X V1441, V1441, V1441, V1441, 6, 7, 8, 9
+        Reload_4coeffs
+        MREDUCE_4X V1441, V1441, V1441, V1441, 13, 18, 23, 28
+        MWrite_8X 32+6, 32+7, 32+8, 32+9, 32+13, 32+18, 32+23, 32+28
+        bdnz    intt_ppc__Loopf
+
+        addi    3, 3, -512
+
+.align 4
+        /*
+         * 1. len = 2, start = 0, 4, 8, 12,...244, 248, 252
+         *    Update zetas vectors, each vector has 2 zetas
+         *    Load zeta vectors in 2-2-2-2 layout
+         *
+         *    Compute coefficients of the NTT based on the following sequences,
+         *      0, 1, 2, 3, 4, 5, 6, 7
+         *      8, 9, 10, 11, 12, 13, 14, 15
+         *            ...
+         *      240, 241, 242, 243, 244, 245, 246, 247
+         *      248, 249, 250, 251, 252, 253, 254, 255
+         *
+         *     These are indexes to the 16 bits array.  Each loads 4 vectors.
+         */
+        addi    14, 8, ZETA_INTT_OFFSET
+        li      7, 4        /* len * 2 */
+        mr      5, 3
+
+        INTT_REDUCE_L24
+        addi    5, 5, 128
+        INTT_REDUCE_L24
+        addi    5, 5, 128
+        INTT_REDUCE_L24
+        addi    5, 5, 128
+        INTT_REDUCE_L24
+        addi    5, 5, 128
+
+.align 4
+        /*
+         * 2. len = 4, start = 0, 8, 16, 24,...232, 240, 248
+         *    Load zeta vectors in 4-4 layout
+         *
+         *    Compute coefficients of the NTT based on the following sequences,
+         *      0, 1, 2, 3, 4, 5, 6, 7
+         *      8, 9, 10, 11, 12, 13, 14, 15
+         *            ...
+         *      240, 241, 242, 243, 244, 245, 246, 247
+         *      248, 249, 250, 251, 252, 253, 254, 255
+         *
+         *     These are indexes to the 16 bits array.  Each loads 4 vectors.
+         */
+        mr      5, 3
+        li      7, 8
+
+        INTT_REDUCE_L44
+        addi    5, 5, 128
+        INTT_REDUCE_L44
+        addi    5, 5, 128
+        INTT_REDUCE_L44
+        addi    5, 5, 128
+        INTT_REDUCE_L44
+        addi    5, 5, 128
+
+.align 4
+        /*
+         * 3. len = 8, start = 0, 16, 32, 48,...208, 224, 240
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        8
+         *       64        -       72
+         *         128        -      136
+         *            192        -     200
+         *
+         *     These are indexes to the 16 bits array
+         */
+        li      7, 16
+
+        INTT_REDUCE_4X 0, 32
+        INTT_REDUCE_4X 128, 32
+        INTT_REDUCE_4X 256, 32
+        INTT_REDUCE_4X 384, 32
+
+.align 4
+        /*
+         * 4. len = 16, start = 0, 32, 64,,...160, 192, 224
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        16
+         *        8        -       24
+         *          128        -     144
+         *            136        -    152
+         *
+         *     These are indexes to the 16 bits array
+         */
+        li      7, 32
+
+        INTT_REDUCE_4X 0, 64
+
+        addi    14, 14, -64
+        INTT_REDUCE_4X 16, 64
+
+        INTT_REDUCE_4X 256, 64
+
+        addi    14, 14, -64
+        INTT_REDUCE_4X 272, 64
+
+.align 4
+        /*
+         * 5. len = 32, start = 0, 64, 128, 192
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        32
+         *        64        -       96
+         *          128        -      160
+         *            192        -      224
+         *
+         *     These are indexes to the 16 bits array
+         */
+        li      7, 64
+
+        INTT_REDUCE_L567 0, 16
+        addi    14, 14, 16
+        INTT_REDUCE_L567 128, 16
+        addi    14, 14, 16
+        INTT_REDUCE_L567 256, 16
+        addi    14, 14, 16
+        INTT_REDUCE_L567 384, 16
+        addi    14, 14, 16
+
+.align 4
+        /*
+         * 6. len = 64, start = 0, 128
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        64
+         *        32        -       96
+         *          128        -      192
+         *            160        -      224
+         *
+         *     These are indexes to the 16 bits array
+         */
+        li      7, 128
+
+        INTT_REDUCE_L567 0, 16
+        INTT_REDUCE_L567 64, 16
+        addi    14, 14, 16
+        INTT_REDUCE_L567 256, 16
+        INTT_REDUCE_L567 320, 16
+        addi    14, 14, 16
+
+.align 4
+        /*
+         * 7. len = 128, start = 0
+         *
+         *    Compute coefficients of the NTT based on 2 legs,
+         *      0        -        128
+         *        32        -        160
+         *          64        -        192
+         *            96        -        224
+         *
+         *     These are indexes to the 16 bits array
+         */
+        li      7, 256          /* len*2 */
+
+        INTT_REDUCE_L567 0, 16
+        INTT_REDUCE_L567 64, 16
+        INTT_REDUCE_L567 128, 16
+        INTT_REDUCE_L567 192, 16
+
+        RESTORE_REGS
+        blr
+.size     intt_ppc,.-intt_ppc
+
+.rodata
+.align 4
+mlkem_consts:
+/* -Q */
+.short  -3329, -3329, -3329, -3329, -3329, -3329, -3329, -3329
+/* QINV */
+.short  -3327, -3327, -3327, -3327, -3327, -3327, -3327, -3327
+/* Q */
+.short  3329, 3329, 3329, 3329, 3329, 3329, 3329, 3329
+/* const 20159 for reduce.S and intt */
+.short  20159, 20159, 20159, 20159, 20159, 20159, 20159, 20159
+/* const 1441 for intt */
+.short  1441, 1441, 1441, 1441, 1441, 1441, 1441, 1441
+
+mlkem_zetas:
+/*
+ * For intt Len=2, offset IZETA_NTT_OFFSET127
+ * reorder zeta array, (1, 2, 3, 4) -> (3, 1, 4, 2)
+ * Transpose z[0], z[1], z[2], z[3]
+ *    -> z[3], z[3], z[1], z[1], z[4], z[4], z[2], z[2]
+ */
+.short  -1460, -1460, 1628, 1628, 958, 958, 1522, 1522, -308, -308, 991, 991, -108
+.short  -108, 996, 996, -854, -854, 478, 478, -1510, -1510, -870, -870, -1530
+.short  -1530, 794, 794, -1185, -1185, -1278, -1278, 220, 220, -1659, -1659, -874
+.short  -874, -1187, -1187, -136, -136, -1335, -1335, -1215, -1215, 1218, 1218
+.short  -1285, -1285, 384, 384, 1322, 1322, -1465, -1465, 1097, 1097, 610, 610, 817
+.short  817, 603, 603, 329, 329, -75, -75, 418, 418, -156, -156, 644, 644, 349, 349
+.short  -1590, -1590, -872, -872, 1483, 1483, 1119, 1119, -777, -777, -602, -602
+.short  778, 778, -147, -147, -246, -246, 1159, 1159, -460, -460, 1653, 1653, -291
+.short  -291, 1574, 1574, 587, 587, -235, -235, 422, 422, 177, 177, 871, 871, 105
+.short  105, -1251, -1251, 1550, 1550, 430, 430, 843, 843, -1103, -1103, 555, 555
+/* For intt Len=4 */
+.short  -1275, -1275, -1275, -1275, 677, 677, 677, 677, -1065, -1065, -1065, -1065
+.short  448, 448, 448, 448, -725, -725, -725, -725, -1508, -1508, -1508, -1508, 961
+.short  961, 961, 961, -398, -398, -398, -398, -951, -951, -951, -951, -247, -247
+.short  -247, -247, -1421, -1421, -1421, -1421, 107, 107, 107, 107, 830, 830, 830
+.short  830, -271, -271, -271, -271, -90, -90, -90, -90, -853, -853, -853, -853
+.short  1469, 1469, 1469, 1469, 126, 126, 126, 126, -1162, -1162, -1162, -1162
+.short  -1618, -1618, -1618, -1618, -666, -666, -666, -666, -320, -320, -320, -320
+.short  -8, -8, -8, -8, 516, 516, 516, 516, -1544, -1544, -1544, -1544, -282, -282
+.short  -282, -282, 1491, 1491, 1491, 1491, -1293, -1293, -1293, -1293, 1015, 1015
+.short  1015, 1015, -552, -552, -552, -552, 652, 652, 652, 652, 1223, 1223, 1223
+.short  1223
+/* For intt Len=8 and others */
+.short  -1571, -1571, -1571, -1571, -1571, -1571, -1571, -1571, -205, -205, -205
+.short  -205, -205, -205, -205, -205, 411, 411, 411, 411, 411, 411, 411, 411, -1542
+.short  -1542, -1542, -1542, -1542, -1542, -1542, -1542, 608, 608, 608, 608, 608
+.short  608, 608, 608, 732, 732, 732, 732, 732, 732, 732, 732, 1017, 1017, 1017
+.short  1017, 1017, 1017, 1017, 1017, -681, -681, -681, -681, -681, -681, -681
+.short  -681, -130, -130, -130, -130, -130, -130, -130, -130, -1602, -1602, -1602
+.short  -1602, -1602, -1602, -1602, -1602, 1458, 1458, 1458, 1458, 1458, 1458, 1458
+.short  1458, -829, -829, -829, -829, -829, -829, -829, -829, 383, 383, 383, 383
+.short  383, 383, 383, 383, 264, 264, 264, 264, 264, 264, 264, 264, -1325, -1325
+.short  -1325, -1325, -1325, -1325, -1325, -1325, 573, 573, 573, 573, 573, 573, 573
+.short  573, 1468, 1468, 1468, 1468, 1468, 1468, 1468, 1468, -1474, -1474, -1474
+.short  -1474, -1474, -1474, -1474, -1474, -1202, -1202, -1202, -1202, -1202, -1202
+.short  -1202, -1202, 962, 962, 962, 962, 962, 962, 962, 962, 182, 182, 182, 182
+.short  182, 182, 182, 182, 1577, 1577, 1577, 1577, 1577, 1577, 1577, 1577, 622
+.short  622, 622, 622, 622, 622, 622, 622, -171, -171, -171, -171, -171, -171, -171
+.short  -171, 202, 202, 202, 202, 202, 202, 202, 202, 287, 287, 287, 287, 287, 287
+.short  287, 287, 1422, 1422, 1422, 1422, 1422, 1422, 1422, 1422, 1493, 1493, 1493
+.short  1493, 1493, 1493, 1493, 1493, -1517, -1517, -1517, -1517, -1517, -1517
+.short  -1517, -1517, -359, -359, -359, -359, -359, -359, -359, -359, -758, -758
+.short  -758, -758, -758, -758, -758, -758
-- 
2.47.3


From dtsen at us.ibm.com  Tue Feb 24 01:27:53 2026
From: dtsen at us.ibm.com (Danny Tsen)
Date: Mon, 23 Feb 2026 18:27:53 -0600
Subject: [PATCH 5/5] dilithium-kyber: Added ppc64le dilithium and kyber (i)NTT
 support.
In-Reply-To: <20260224002753.151873-1-dtsen@us.ibm.com>
References: <20260224002753.151873-1-dtsen@us.ibm.com>
Message-ID: <20260224002753.151873-6-dtsen@us.ibm.com>

Updated the following files to ENABLE_PPC_DILITHIUM and
ENABLE_PPC_KYBER, dilithium-common.c, kyber-common.c and
configure.ac

Signed-off-by: Danny Tsen <dtsen at us.ibm.com>
---
 cipher/dilithium-common.c | 13 +++++++++++++
 cipher/kyber-common.c     | 13 +++++++++++++
 configure.ac              | 20 ++++++++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/cipher/dilithium-common.c b/cipher/dilithium-common.c
index d16f22f7..0f3d2d96 100644
--- a/cipher/dilithium-common.c
+++ b/cipher/dilithium-common.c
@@ -50,6 +50,18 @@ static void invntt_tomont(int32_t a[N]);
 
 /*************** dilithium/ref/ntt.c */
 
+#ifdef ENABLE_PPC_DILITHIUM
+extern void mldsa_ntt_ppc(int32_t a[N]);
+extern void mldsa_intt_ppc(int32_t a[N]);
+
+void ntt(int32_t a[N]) {
+  mldsa_ntt_ppc(a);
+}
+
+void invntt_tomont(int32_t a[N]) {
+  mldsa_intt_ppc(a);
+}
+#else
 static const int32_t zetas[N] = {
          0,    25847, -2608894,  -518909,   237124,  -777960,  -876248,   466468,
    1826347,  2353451,  -359251, -2091905,  3119733, -2884855,  3111497,  2680103,
@@ -143,6 +155,7 @@ void invntt_tomont(int32_t a[N]) {
     a[j] = montgomery_reduce((int64_t)f * a[j]);
   }
 }
+#endif
 /*************** dilithium/ref/rounding.h */
 #if !defined(DILITHIUM_MODE) || DILITHIUM_MODE == 2
 static int32_t decompose_88(int32_t *a0, int32_t a);
diff --git a/cipher/kyber-common.c b/cipher/kyber-common.c
index 54377788..278d0b0b 100644
--- a/cipher/kyber-common.c
+++ b/cipher/kyber-common.c
@@ -273,6 +273,18 @@ static int16_t fqmul(int16_t a, int16_t b) {
   return montgomery_reduce((int32_t)a*b);
 }
 
+#ifdef ENABLE_PPC_KYBER
+extern void ntt_ppc(int16_t r[256]);
+extern void intt_ppc(int16_t r[256]);
+
+void ntt(int16_t r[256]) {
+    ntt_ppc(r);
+}
+
+void invntt(int16_t r[256]) {
+    intt_ppc(r);
+}
+#else
 /*************************************************
 * Name:        ntt
 *
@@ -328,6 +340,7 @@ void invntt(int16_t r[256]) {
   for(j = 0; j < 256; j++)
     r[j] = fqmul(r[j], f);
 }
+#endif
 
 /*************************************************
 * Name:        basemul
diff --git a/configure.ac b/configure.ac
index 00572b45..49a094fe 100644
--- a/configure.ac
+++ b/configure.ac
@@ -3828,6 +3828,16 @@ if test "$found" = "1" ; then
    GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS \
                           kyber.lo"
    AC_DEFINE(USE_KYBER, 1, [Defined if this module should be included])
+
+   case "${host}" in
+      powerpc64le-*-*)
+         if test "$gcry_cv_gcc_inline_asm_ppc_altivec" = "yes" ; then
+             AC_DEFINE(ENABLE_PPC_KYBER, 1, [Enable support for PPC optimized kyber.])
+             GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS kyber_ntt_p8le.lo"
+             GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS kyber_intt_p8le.lo"
+         fi
+      ;;
+   esac
 fi
 
 LIST_MEMBER(dilithium, $enabled_pubkey_ciphers)
@@ -3836,6 +3846,16 @@ if test "$found" = "1" ; then
    GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS \
                           dilithium.lo pubkey-dilithium.lo"
    AC_DEFINE(USE_DILITHIUM, 1, [Defined if this module should be included])
+
+   case "${host}" in
+      powerpc64le-*-*)
+         if test "$gcry_cv_gcc_inline_asm_ppc_altivec" = "yes" ; then
+             AC_DEFINE(ENABLE_PPC_DILITHIUM, 1, [Enable support for PPC optimized dilithium.])
+             GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS dilithium_ntt_p8le.lo"
+             GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS dilithium_intt_p8le.lo"
+         fi
+      ;;
+   esac
 fi
 
 LIST_MEMBER(crc, $enabled_digests)
-- 
2.47.3


From wk at gnupg.org  Tue Feb 24 10:30:38 2026
From: wk at gnupg.org (Werner Koch)
Date: Tue, 24 Feb 2026 10:30:38 +0100
Subject: [PATCH 0/5] dilithium-kyber: Optimized (i)NTT support for
In-Reply-To: <20260224002753.151873-1-dtsen@us.ibm.com> (Danny Tsen via
 Gcrypt-devel's message of "Mon, 23 Feb 2026 18:27:48 -0600")
References: <20260224002753.151873-1-dtsen@us.ibm.com>
Message-ID: <87bjhetrnl.fsf@jacob.g10code.de>

Hi!

Thanks for working on this.  Do you have a benchmark?


Shalom-Salam,

   Werner

-- 
The pioneers of a warless world are the youth that
refuse military service.             - A. Einstein
-------------- next part --------------
A non-text attachment was scrubbed...
Name: openpgp-digital-signature.asc
Type: application/pgp-signature
Size: 284 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20260224/b6915bb6/attachment.sig>

From dtsen at us.ibm.com  Thu Feb 26 11:23:17 2026
From: dtsen at us.ibm.com (Danny Tsen)
Date: Thu, 26 Feb 2026 10:23:17 +0000
Subject: [PATCH 0/5] dilithium-kyber: Optimized (i)NTT support for
In-Reply-To: <87bjhetrnl.fsf@jacob.g10code.de>
References: <20260224002753.151873-1-dtsen@us.ibm.com>
 <87bjhetrnl.fsf@jacob.g10code.de>
Message-ID: <PH7PR15MB542613EC90F48C3EEE036ED99372A@PH7PR15MB5426.namprd15.prod.outlook.com>

Hi Werner,

I don't have benchmark for libgcrypt.  I do have my own testing performance number on NTT operation. That probably not what you are looking for.

Thanks.
-Danny


________________________________
From: Werner Koch
Sent: Tuesday, February 24, 2026 5:30 PM
To: Danny Tsen via Gcrypt-devel
Cc: Danny Tsen
Subject: [EXTERNAL] Re: [PATCH 0/5] dilithium-kyber: Optimized (i)NTT support for

Hi!

Thanks for working on this.  Do you have a benchmark?


Shalom-Salam,

   Werner

--
The pioneers of a warless world are the youth that
refuse military service.             - A. Einstein
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20260226/d0436431/attachment.html>

From wk at gnupg.org  Thu Feb 26 14:47:06 2026
From: wk at gnupg.org (Werner Koch)
Date: Thu, 26 Feb 2026 14:47:06 +0100
Subject: [PATCH 0/5] dilithium-kyber: Optimized (i)NTT support for
In-Reply-To: <PH7PR15MB542613EC90F48C3EEE036ED99372A@PH7PR15MB5426.namprd15.prod.outlook.com>
 (Danny Tsen via Gcrypt-devel's message of "Thu, 26 Feb 2026 10:23:17
 +0000")
References: <20260224002753.151873-1-dtsen@us.ibm.com>
 <87bjhetrnl.fsf@jacob.g10code.de>
 <PH7PR15MB542613EC90F48C3EEE036ED99372A@PH7PR15MB5426.namprd15.prod.outlook.com>
Message-ID: <87h5r3r50l.fsf@jacob.g10code.de>

On Thu, 26 Feb 2026 10:23, Danny Tsen said:

> I don't have benchmark for libgcrypt.  I do have my own testing
> performance number on NTT operation. That probably not what you are

I just noticed that we do have support for MLKEM and MLDSA in our
./bench-slope .  We should change that to make it easier torun
benchmarks.

I was actually looking only for a rough figure on how much performance
you gain with your patches.


Salam-Shalom,

   Werner

-- 
The pioneers of a warless world are the youth that
refuse military service.             - A. Einstein
-------------- next part --------------
A non-text attachment was scrubbed...
Name: openpgp-digital-signature.asc
Type: application/pgp-signature
Size: 284 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20260226/bf8773ab/attachment.sig>