From simon at josefsson.org  Mon May 15 16:20:17 2023
From: simon at josefsson.org (Simon Josefsson)
Date: Mon, 15 May 2023 16:20:17 +0200
Subject: Implementation of PQC Algorithms in libgcrypt
In-Reply-To: <85f0fb8c-1587-262c-187c-a5a6bc590145@mtg.de> (Falko Strenzke's
 message of "Mon, 3 Apr 2023 05:59:10 +0200")
References: <958d689a-5f76-5fbe-f3ef-140bc1b2d132@mtg.de>
 <87tty2cq2q.fsf@wheatstone.g10code.de>
 <85f0fb8c-1587-262c-187c-a5a6bc590145@mtg.de>
Message-ID: <87lehp1xsu.fsf@kaka.sjd.se>

Hi

I noticed this thread just after submitting sntrup761 [1] patches.

My opinion is that libgcrypt's public-key API is a bad fit for KEM's: it
uses S-expressions and MPI data types.  I believe the crypto world
rightly has moved away from those abstraction, towards byte-oriented
designs instead, for simplicity and safety.  Compare gcry_ecc_mul_point
for X25519 and gcry_kdf_derive for KDF's.  Could you implement Kyber as
a KEM using the API that I suggested?  I think that would be
significantly simpler, and would help validate the KEM API as supporting
more than one KEM.  I would strongly support having a KEM API that is
not using sexp/mpi, but I wouldn't object to a sexp/mpi API in addition
to it, for different use-cases.

/Simon

[1] https://gitlab.com/jas/libgcrypt/-/commits/jas/sntrup761

Falko Strenzke <falko.strenzke at mtg.de> writes:

> Hi Werner,
>
> the only API change is the addition of the following interface function:
>
> gcry_err_code_t
> _gcry_pk_encap(gcry_sexp_t *r_ciph, gcry_sexp_t* r_shared_key, gcry_sexp_t s_pkey)
>
> This also means that the public key spec needs to contain this additional function. For Kyber our public key spec currently looks as follows:
>
> gcry_pk_spec_t _gcry_pubkey_spec_kyber = {
>   GCRY_PK_KYBER, {0, 1},
>   (GCRY_PK_USAGE_ENCAP),        // TODOMTG: can the key usage "encryption" remain or do we need new KU "encap"?
>   "Kyber", kyber_names,
>   "p", "s", "a", "", "p",       // elements of pub-key, sec-key, ciphertext, signature, key-grip
>   kyber_generate,
>   kyber_check_secret_key,
>   NULL,                         // encrypt
>   kyber_encap,
>   kyber_decrypt,
>   NULL,                         // sign,
>   NULL,                         // verify,
>   kyber_get_nbits,
>   run_selftests,
>   compute_keygrip
> };
>
> For the PKEs the encapsulation function would of course be NULL. Regarding the TODO on the key usage marked in the code above, this so far
> doesn't seem to have any implications for us so the decision isn't urgent from my point of view.
>
> - Falko    
>
> Am 30.03.23 um 15:43 schrieb Werner Koch:
>
>  On Wed, 29 Mar 2023 10:09, Falko Strenzke said:
>
>  While the integration of the signature algorithms is straightforward, the KEM
> requires a new interface function, as the KEM encapsulation cannot be modelled
> by a public-key encryption.
>
>
> It would be good if we can discuss a proposed API early enough, so that
> we can see how it fits into the design of Libgcrypt.  Can you already
> roughly describes the needs?
>
>
> Salam-Shalom,
>
>    Werner
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 255 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230515/b0a33af9/attachment.sig>

From falko.strenzke at mtg.de  Mon May 15 17:07:15 2023
From: falko.strenzke at mtg.de (Falko Strenzke)
Date: Mon, 15 May 2023 17:07:15 +0200
Subject: Implementation of PQC Algorithms in libgcrypt
In-Reply-To: <87lehp1xsu.fsf@kaka.sjd.se>
References: <958d689a-5f76-5fbe-f3ef-140bc1b2d132@mtg.de>
 <87tty2cq2q.fsf@wheatstone.g10code.de>
 <85f0fb8c-1587-262c-187c-a5a6bc590145@mtg.de> <87lehp1xsu.fsf@kaka.sjd.se>
Message-ID: <a7b42e90-4af4-2b10-2d0a-ffe08316c2a3@mtg.de>

Hi Simon,

indeed, there is considerable overhead in our implementation of the 
S-Expressions interface for the extraction of values and MPI <-> byte 
array conversions even though each Kyber "token" is merely an opaque 
byte array. However, we don't consider it our call to divert from the 
existing API as we can't gauge the implication of that for the client 
code, e.g. GnuPG. So we basically consider this the maintainer's decision.

I looked through your API. It is indeed much simpler. I have the 
following points, however:

1. I don't fully understand the design logic regarding the 
gcry_kem_hd_t. I understand that it makes sense to use it for the 
encryption and decryption to instantiate a particular key.? But for the 
key generation I don't per se see why it needs a handle. Is it required 
for precomputations in the case of NTRU Prime? (or anticipated that this 
is the case for other KEMs?)

2. "open" / "close" are in my opinion not the best names for the 
function to create and destroy such a handle. These terms rather suggest 
the handling of a file or a pipe. I know these terms are also used in 
the hash API, but I think more appropriate names would "create" / 
"destroy" or something similar. Maybe it makes sense to make the move to 
a new terminology here.

3. While the previous two points are rather minor or even cosmetic, this 
one is really important in my opinion: we need an API that allows for 
derandomized key generation and encapsulation to support KAT tests for 
all operations. The Kyber reference implementation already supports such 
KAT tests. I would anyway have raised the question here how to realize 
that. Signature functions in libgcrypt already support a 
"random-override" parameter, but so far I don't really understand how it 
works and whether it would be suitable to use it for the KEM API as 
well. Ideally, I think, the new API would allow to provide an RNG object 
and to set it to a specific seed before any operation (possibly via the 
KEM handle). However, it would probably be better if this functionality 
is only supported by an internal test-API and not available to normal 
clients. But I am not sure how to realize that in the current design of 
libgcrypt.

- Falko

Am 15.05.23 um 16:20 schrieb Simon Josefsson:
> Hi
>
> I noticed this thread just after submitting sntrup761 [1] patches.
>
> My opinion is that libgcrypt's public-key API is a bad fit for KEM's: it
> uses S-expressions and MPI data types.  I believe the crypto world
> rightly has moved away from those abstraction, towards byte-oriented
> designs instead, for simplicity and safety.  Compare gcry_ecc_mul_point
> for X25519 and gcry_kdf_derive for KDF's.  Could you implement Kyber as
> a KEM using the API that I suggested?  I think that would be
> significantly simpler, and would help validate the KEM API as supporting
> more than one KEM.  I would strongly support having a KEM API that is
> not using sexp/mpi, but I wouldn't object to a sexp/mpi API in addition
> to it, for different use-cases.
>
> /Simon
>
> [1]https://gitlab.com/jas/libgcrypt/-/commits/jas/sntrup761
>
> Falko Strenzke<falko.strenzke at mtg.de>  writes:
>
>> Hi Werner,
>>
>> the only API change is the addition of the following interface function:
>>
>> gcry_err_code_t
>> _gcry_pk_encap(gcry_sexp_t *r_ciph, gcry_sexp_t* r_shared_key, gcry_sexp_t s_pkey)
>>
>> This also means that the public key spec needs to contain this additional function. For Kyber our public key spec currently looks as follows:
>>
>> gcry_pk_spec_t _gcry_pubkey_spec_kyber = {
>>    GCRY_PK_KYBER, {0, 1},
>>    (GCRY_PK_USAGE_ENCAP),        // TODOMTG: can the key usage "encryption" remain or do we need new KU "encap"?
>>    "Kyber", kyber_names,
>>    "p", "s", "a", "", "p",       // elements of pub-key, sec-key, ciphertext, signature, key-grip
>>    kyber_generate,
>>    kyber_check_secret_key,
>>    NULL,                         // encrypt
>>    kyber_encap,
>>    kyber_decrypt,
>>    NULL,                         // sign,
>>    NULL,                         // verify,
>>    kyber_get_nbits,
>>    run_selftests,
>>    compute_keygrip
>> };
>>
>> For the PKEs the encapsulation function would of course be NULL. Regarding the TODO on the key usage marked in the code above, this so far
>> doesn't seem to have any implications for us so the decision isn't urgent from my point of view.
>>
>> - Falko
>>
>> Am 30.03.23 um 15:43 schrieb Werner Koch:
>>
>>   On Wed, 29 Mar 2023 10:09, Falko Strenzke said:
>>
>>   While the integration of the signature algorithms is straightforward, the KEM
>> requires a new interface function, as the KEM encapsulation cannot be modelled
>> by a public-key encryption.
>>
>>
>> It would be good if we can discuss a proposed API early enough, so that
>> we can see how it fits into the design of Libgcrypt.  Can you already
>> roughly describes the needs?
>>
>>
>> Salam-Shalom,
>>
>>     Werner
-- 

*MTG AG*
Dr. Falko Strenzke
Executive System Architect

Phone: +49 6151 8000 24
E-Mail: falko.strenzke at mtg.de
Web: mtg.de <https://www.mtg.de>


*MTG Exhibitions ? See you in 2023*

------------------------------------------------------------------------
<https://community.e-world-essen.com/institutions/allExhibitors?query=true&keywords=mtg> 
<https://www.itsa365.de/de-de/companies/m/mtg-ag>

MTG AG - Dolivostr. 11 - 64293 Darmstadt, Germany
Commercial register: HRB 8901
Register Court: Amtsgericht Darmstadt
Management Board: J?rgen Ruf (CEO), Tamer Kemer?z
Chairman of the Supervisory Board: Dr. Thomas Milde

This email may contain confidential and/or privileged information. If 
you are not the correct recipient or have received this email in error,
please inform the sender immediately and delete this email. Unauthorised 
copying or distribution of this email is not permitted.

Data protection information: Privacy policy 
<https://www.mtg.de/en/privacy-policy>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230515/55a247dd/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Y0kNcAEqvMyKELQS.png
Type: image/png
Size: 5256 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230515/55a247dd/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eTvco00Nz4KhPB6y.png
Type: image/png
Size: 4906 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230515/55a247dd/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4764 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230515/55a247dd/attachment-0001.bin>

From simon at josefsson.org  Mon May 15 17:39:23 2023
From: simon at josefsson.org (Simon Josefsson)
Date: Mon, 15 May 2023 17:39:23 +0200
Subject: Implementation of PQC Algorithms in libgcrypt
In-Reply-To: <a7b42e90-4af4-2b10-2d0a-ffe08316c2a3@mtg.de> (Falko Strenzke's
 message of "Mon, 15 May 2023 17:07:15 +0200")
References: <958d689a-5f76-5fbe-f3ef-140bc1b2d132@mtg.de>
 <87tty2cq2q.fsf@wheatstone.g10code.de>
 <85f0fb8c-1587-262c-187c-a5a6bc590145@mtg.de>
 <87lehp1xsu.fsf@kaka.sjd.se>
 <a7b42e90-4af4-2b10-2d0a-ffe08316c2a3@mtg.de>
Message-ID: <87h6sd1u50.fsf@kaka.sjd.se>

Hi

Thanks for feedback.  Generally I'm not sure we should consider KEM's a
subset of public-key encryption/decryption from an API point of view.
Compare KDF's relationship to MAC/hashes.  Sometimes separate APIs make
sense.

Re your point about test vectors, I agree.  Libgcrypt supports some
"selftest" code (used for FIPS mode) that seems relevant here, maybe it
is sufficient to add selftest code that hard-code the RNG stuff and
compare test vectors?  Compare how kdf.c looks.  Werner, any thoughts on
this approach?  It seems simple, and avoids exposing potentially
insecure APIs to users.  I was considering this, but didn't want to
modify any FIPS-related code.

Re API, I think *_open is fairly established in libgcrypt, so it is good
idiom to re-use.  Let's have some alternatives, right now I proposed
this:

enum gcry_kem_algos
  {
    GCRY_KEM_SNTRUP761 = 761,
  };
#define GCRY_KEM_SNTRUP761_SECRETKEY_SIZE 1763
#define GCRY_KEM_SNTRUP761_PUBLICKEY_SIZE 1158
#define GCRY_KEM_SNTRUP761_CIPHERTEXT_SIZE 1039
#define GCRY_KEM_SNTRUP761_SIZE 32
typedef struct gcry_kem_handle *gcry_kem_hd_t;
gcry_error_t gcry_kem_open (gcry_kem_hd_t *hd, int algo);
void gcry_kem_close (gcry_kem_hd_t h);
gcry_error_t gcry_kem_keypair (gcry_kem_hd_t hd,
			       size_t pklen, void *pubkey,
			       size_t sklen, void *seckey);
gcry_error_t gcry_kem_enc (gcry_kem_hd_t hd,
			   size_t pklen, const void *pubkey,
			   size_t ctlen, void *ciphertext,
			   size_t keylen, void *key);
gcry_error_t gcry_kem_dec (gcry_kem_hd_t hd,
			   size_t ctlen, const void *ciphertext,
			   size_t sklen, const void *seckey,
			   size_t keylen, void *key);

Here is minimal approach similar to KDF interface:

enum gcry_kem_algos
  {
    GCRY_KEM_SNTRUP761 = 761,
  };
#define GCRY_KEM_SNTRUP761_SECRETKEY_SIZE 1763
#define GCRY_KEM_SNTRUP761_PUBLICKEY_SIZE 1158
#define GCRY_KEM_SNTRUP761_CIPHERTEXT_SIZE 1039
#define GCRY_KEM_SNTRUP761_SIZE 32
gcry_error_t gcry_kem_keypair (int algo,
			       size_t pklen, void *pubkey,
			       size_t sklen, void *seckey);
gcry_error_t gcry_kem_enc (int algo,
			   size_t pklen, const void *pubkey,
			   size_t ctlen, void *ciphertext,
			   size_t keylen, void *key);
gcry_error_t gcry_kem_dec (int algo,
			   size_t ctlen, const void *ciphertext,
			   size_t sklen, const void *seckey,
			   size_t keylen, void *key);

Here is a more complex variant that may be more consistent with existing
APIs but has some disadvantages (more APIs are harder to analyze, makes
static allocation much harder if not impossible):

enum gcry_kem_algos
  {
    GCRY_KEM_SNTRUP761 = 761,
  };
typedef struct gcry_kem_handle *gcry_kem_hd_t;
gcry_error_t gcry_kem_open (gcry_kem_hd_t *hd, int algo);
void gcry_kem_close (gcry_kem_hd_t h);
size_t gcry_kem_pubkey_size (gcry_kem_hd_t hd);
size_t gcry_kem_seckey_size (gcry_kem_hd_t hd);
size_t gcry_kem_ciphertext_size (gcry_kem_hd_t hd);
size_t gcry_kem_output_size (gcry_kem_hd_t hd);
gcry_error_t gcry_kem_keypair (gcry_kem_hd_t hd);
void *gcry_kem_get_seckey (gcry_kem_hd_t hd);
void *gcry_kem_get_pubkey (gcry_kem_hd_t hd);
gcry_error_t gcry_kem_enc (gcry_kem_hd_t hd,
			   size_t ctlen, void *ciphertext,
			   size_t keylen, void *key);
gcry_error_t gcry_kem_dec (gcry_kem_hd_t hd,
			   size_t ctlen, const void *ciphertext,
			   size_t keylen, void *key);

Other ideas?

Does kyber have any requirements on the API that wouldn't work well with
any of these?

/Simon

Falko Strenzke <falko.strenzke at mtg.de> writes:

> Hi Simon,
>
> indeed, there is considerable overhead in our implementation of the
> S-Expressions interface for the extraction of values and MPI <-> byte
> array conversions even though each Kyber "token" is merely an opaque
> byte array. However, we don't consider it our call to divert from the
> existing API as we can't gauge the implication of that for the client
> code, e.g. GnuPG. So we basically consider this the maintainer's
> decision.
>
> I looked through your API. It is indeed much simpler. I have the
> following points, however:
>
> 1. I don't fully understand the design logic regarding the
> gcry_kem_hd_t. I understand that it makes sense to use it for the
> encryption and decryption to instantiate a particular key.  But for
> the key generation I don't per se see why it needs a handle. Is it
> required for precomputations in the case of NTRU Prime? (or
> anticipated that this is the case for other KEMs?)
>
> 2. "open" / "close" are in my opinion not the best names for the
> function to create and destroy such a handle. These terms rather
> suggest the handling of a file or a pipe. I know these terms are also
> used in the hash API, but I think more appropriate names would
> "create" / "destroy" or something similar. Maybe it makes sense to
> make the move to a new terminology here.
>
> 3. While the previous two points are rather minor or even cosmetic,
> this one is really important in my opinion: we need an API that allows
> for derandomized key generation and encapsulation to support KAT tests
> for all operations. The Kyber reference implementation already
> supports such KAT tests. I would anyway have raised the question here
> how to realize that. Signature functions in libgcrypt already support
> a "random-override" parameter, but so far I don't really understand
> how it works and whether it would be suitable to use it for the KEM
> API as well.  Ideally, I think, the new API would allow to provide an
> RNG object and to set it to a specific seed before any operation
> (possibly via the KEM handle). However, it would probably be better if
> this functionality is only supported by an internal test-API and not
> available to normal clients. But I am not sure how to realize that in
> the current design of libgcrypt.
>
> - Falko
>
> Am 15.05.23 um 16:20 schrieb Simon Josefsson:
>
>  Hi
>
> I noticed this thread just after submitting sntrup761 [1] patches.
>
> My opinion is that libgcrypt's public-key API is a bad fit for KEM's: it
> uses S-expressions and MPI data types.  I believe the crypto world
> rightly has moved away from those abstraction, towards byte-oriented
> designs instead, for simplicity and safety.  Compare gcry_ecc_mul_point
> for X25519 and gcry_kdf_derive for KDF's.  Could you implement Kyber as
> a KEM using the API that I suggested?  I think that would be
> significantly simpler, and would help validate the KEM API as supporting
> more than one KEM.  I would strongly support having a KEM API that is
> not using sexp/mpi, but I wouldn't object to a sexp/mpi API in addition
> to it, for different use-cases.
>
> /Simon
>
> [1] https://gitlab.com/jas/libgcrypt/-/commits/jas/sntrup761
>
> Falko Strenzke <falko.strenzke at mtg.de> writes:
>
>  Hi Werner,
>
> the only API change is the addition of the following interface function:
>
> gcry_err_code_t
> _gcry_pk_encap(gcry_sexp_t *r_ciph, gcry_sexp_t* r_shared_key, gcry_sexp_t s_pkey)
>
> This also means that the public key spec needs to contain this additional function. For Kyber our public key spec currently looks as follows:
>
> gcry_pk_spec_t _gcry_pubkey_spec_kyber = {
>   GCRY_PK_KYBER, {0, 1},
>   (GCRY_PK_USAGE_ENCAP),        // TODOMTG: can the key usage "encryption" remain or do we need new KU "encap"?
>   "Kyber", kyber_names,
>   "p", "s", "a", "", "p",       // elements of pub-key, sec-key, ciphertext, signature, key-grip
>   kyber_generate,
>   kyber_check_secret_key,
>   NULL,                         // encrypt
>   kyber_encap,
>   kyber_decrypt,
>   NULL,                         // sign,
>   NULL,                         // verify,
>   kyber_get_nbits,
>   run_selftests,
>   compute_keygrip
> };
>
> For the PKEs the encapsulation function would of course be NULL. Regarding the TODO on the key usage marked in the code above, this so far
> doesn't seem to have any implications for us so the decision isn't urgent from my point of view.
>
> - Falko    
>
> Am 30.03.23 um 15:43 schrieb Werner Koch:
>
>  On Wed, 29 Mar 2023 10:09, Falko Strenzke said:
>
>  While the integration of the signature algorithms is straightforward, the KEM
> requires a new interface function, as the KEM encapsulation cannot be modelled
> by a public-key encryption.
>
>
> It would be good if we can discuss a proposed API early enough, so that
> we can see how it fits into the design of Libgcrypt.  Can you already
> roughly describes the needs?
>
>
> Salam-Shalom,
>
>    Werner
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 255 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230515/6b5d816b/attachment.sig>

From smueller at chronox.de  Mon May 15 17:51:12 2023
From: smueller at chronox.de (Stephan Mueller)
Date: Mon, 15 May 2023 17:51:12 +0200
Subject: Implementation of PQC Algorithms in libgcrypt
In-Reply-To: <87h6sd1u50.fsf@kaka.sjd.se>
References: <958d689a-5f76-5fbe-f3ef-140bc1b2d132@mtg.de>
 <a7b42e90-4af4-2b10-2d0a-ffe08316c2a3@mtg.de> <87h6sd1u50.fsf@kaka.sjd.se>
Message-ID: <3414868.jSmoDxJbhA@tauon.chronox.de>

Am Montag, 15. Mai 2023, 17:39:23 CEST schrieb Simon Josefsson via Gcrypt-
devel:

Hi Simon,

> Does kyber have any requirements on the API that wouldn't work well with
> any of these?

I am experimenting with Kyber in [1]. For KEM, your API would work.

There you see that I use an additional parameter, an RNG context. This allows 
me to also derive Kyber keys straight from a KDF (which is accessed like an 
RNG context). But that is not really needed.

However, how do you propose to handle the KEX scenario? See [2] for the full 
Kyber KEX exchange and the API. I think the KEX is much more important than 
the KEM, as the KEX is conceptually what is DH today. Kyber KEM can be used in 
an integrated encryption schema as suggested in [3].

Unfortunately, the Kyber KEX cannot be acting as a direct replacement for DH. 
Due to its 7 total steps. However, it is possible to coalescing all of them 
into 2 handshake network exchanges and one final data blob that is sent along 
with the already encrypted first payload.

[1] https://github.com/smuellerDD/leancrypto/blob/master/kem/api/
lc_kyber.h#L121

[2] https://github.com/smuellerDD/leancrypto/blob/master/kem/api/
lc_kyber.h#L294

[3] https://github.com/smuellerDD/leancrypto/blob/master/kem/api/
lc_kyber.h#L425

Ciao
Stephan


From simon at josefsson.org  Mon May 15 16:02:39 2023
From: simon at josefsson.org (Simon Josefsson)
Date: Mon, 15 May 2023 16:02:39 +0200
Subject: [PATCH] Add Streamlined NTRU Prime sntrup761.
Message-ID: <87wn191ym8.fsf@kaka.sjd.se>

Hi

See attached patch that adds sntrup761.  What do you think?

My use case is to enable implementation of OpenSSH's
sntrup761x25519-sha512 in libssh/libssh2.

Specific open issues:

- Documentation

- Benchmarking self-test

- Self-tests that validate test vectors

   Not trivial because the algorithm is randomized, so we would have to
   use deterministic randomness somehow -- and to use an deterministic
   algorithm for which there exists sntrup761 test vectors (DRBG-CTR is
   the only one I am aware of, but far from ideal).

- API design

   - Are gcry_kem_open/gcry_kem_close useful?  They complicate
     implementation for no gain for sntrup761, but could be useful for
     other KEM's, OTOH they may just complicate it for all KEM's since I
     believe the KEM APIs are fairly established these days.

   - The pubkey parameter for KEM-Enc could be stored in the handle, as
     could the seckey parameter for KEM-Dec.  This would make the
     gcry_kem_open/gcry_kem_close more useful, however it would mean
     more memory zeroization issues.

   - The #define's for output lengths could be functions, similar to
     other libgcrypt APIs.  This makes it harder to use statically
     allocated buffers, so I think the current #define's are useful.

/Simon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Add-Streamlined-NTRU-Prime-sntrup761.patch
Type: text/x-diff
Size: 41181 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230515/5e4b75fb/attachment-0001.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 255 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230515/5e4b75fb/attachment-0001.sig>

From simon at josefsson.org  Tue May 16 08:09:23 2023
From: simon at josefsson.org (Simon Josefsson)
Date: Tue, 16 May 2023 08:09:23 +0200
Subject: Implementation of PQC Algorithms in libgcrypt
In-Reply-To: <3414868.jSmoDxJbhA@tauon.chronox.de> (Stephan Mueller's message
 of "Mon, 15 May 2023 17:51:12 +0200")
References: <958d689a-5f76-5fbe-f3ef-140bc1b2d132@mtg.de>
 <a7b42e90-4af4-2b10-2d0a-ffe08316c2a3@mtg.de>
 <87h6sd1u50.fsf@kaka.sjd.se> <3414868.jSmoDxJbhA@tauon.chronox.de>
Message-ID: <87353w24fg.fsf@kaka.sjd.se>

Stephan Mueller <smueller at chronox.de> writes:

> Am Montag, 15. Mai 2023, 17:39:23 CEST schrieb Simon Josefsson via Gcrypt-
> devel:
>
> Hi Simon,
>
>> Does kyber have any requirements on the API that wouldn't work well with
>> any of these?
>
> I am experimenting with Kyber in [1]. For KEM, your API would work.

Thanks for confirming this!  Looking at the code, it seems Kyber KEM has
exactly the same API as sntrup761, which probably was a NIST PQCS
requirement, and we should expect that other KEM's follow a similar
approach.

I think that sntrup761 can be added to libgcrypt now since it has been
stable since 2017, but I'm less sure about Kyber since it is stuck in
the NIST process -- aren't there some risk that NIST will modify the
parameters again?

> There you see that I use an additional parameter, an RNG context. This allows 
> me to also derive Kyber keys straight from a KDF (which is accessed like an 
> RNG context). But that is not really needed.

Right, I use the RNG context internally in sntrup761.c as well, but I
don't think it should be exposed to libgcrypt callers.  The internal RNG
context will be useful for self-testing.  This is especially true since
I think test vectors for KEM's are implementation-specific: if you
optimize the implementation to re-order RNG calls, the test vectors will
no longer work.  Thus, you can't really do black-box testing with KEM
KATs.  The libgcrypt selftest() approach is perfectly suited for doing a
whitebox test internally though.

> However, how do you propose to handle the KEX scenario? See [2] for the full 
> Kyber KEX exchange and the API. I think the KEX is much more important than 
> the KEM, as the KEX is conceptually what is DH today. Kyber KEM can be used in 
> an integrated encryption schema as suggested in [3].
>
> Unfortunately, the Kyber KEX cannot be acting as a direct replacement for DH. 
> Due to its 7 total steps. However, it is possible to coalescing all of them 
> into 2 handshake network exchanges and one final data blob that is sent along 
> with the already encrypted first payload.

I think this should be through a completely different API than for KEM
or public-key encrypt/decrypt, and an API that is customized for the KEX
functionality.  The properties are different from existing APIs, similar
to how AEAD ciphers differs from ECB ciphers, and how KDF differs from
MAC/hashes.  Also compare how libgcrypt contains an API for X25519/X448
curve operations.

/Simon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 255 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230516/e02a3217/attachment.sig>

From simon at josefsson.org  Tue May 16 08:56:08 2023
From: simon at josefsson.org (Simon Josefsson)
Date: Tue, 16 May 2023 08:56:08 +0200
Subject: [PATCH] Add Streamlined NTRU Prime sntrup761.
In-Reply-To: <87wn191ym8.fsf@kaka.sjd.se> (Simon Josefsson via Gcrypt-devel's
 message of "Mon, 15 May 2023 16:02:39 +0200")
References: <87wn191ym8.fsf@kaka.sjd.se>
Message-ID: <87y1lozrw7.fsf@kaka.sjd.se>

Hi

Attached is a second version of the sntrup761 patch, this time using a
minimal API that would work for Kyber too (please confirm).  Unless we
know complexity is required, I prefer to keep things minimal.

I've pushed it to:
https://gitlab.com/jas/libgcrypt/-/commits/jas/sntrup761v2

Below is the added API.  Thoughts?

enum gcry_kem_algos
  {
    GCRY_KEM_SNTRUP761 = 761,
  };

#define GCRY_KEM_SNTRUP761_SECRETKEY_SIZE 1763
#define GCRY_KEM_SNTRUP761_PUBLICKEY_SIZE 1158
#define GCRY_KEM_SNTRUP761_CIPHERTEXT_SIZE 1039
#define GCRY_KEM_SNTRUP761_SHAREDSECRET_SIZE 32

gcry_error_t gcry_kem_keypair (int algo,
			       void *pubkey,
			       void *seckey);

gcry_error_t gcry_kem_enc (int algo,
			   const void *pubkey,
			   void *ciphertext,
			   void *ss);

gcry_error_t gcry_kem_dec (int algo,
			   const void *ciphertext,
			   const void *seckey,
			   void *ss);

/Simon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Add-Streamlined-NTRU-Prime-sntrup761.patch
Type: text/x-diff
Size: 38454 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230516/f4610818/attachment-0001.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 255 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230516/f4610818/attachment-0001.sig>

From falko.strenzke at mtg.de  Tue May 16 09:16:13 2023
From: falko.strenzke at mtg.de (Falko Strenzke)
Date: Tue, 16 May 2023 09:16:13 +0200
Subject: [PATCH] Add Streamlined NTRU Prime sntrup761.
In-Reply-To: <87y1lozrw7.fsf@kaka.sjd.se>
References: <87wn191ym8.fsf@kaka.sjd.se> <87y1lozrw7.fsf@kaka.sjd.se>
Message-ID: <2b3979dc-9a89-4755-9e4c-727263b27bb5@mtg.de>

Hi Simon,

Am 16.05.23 um 08:56 schrieb Simon Josefsson via Gcrypt-devel:
> Hi
>
> Attached is a second version of the sntrup761 patch, this time using a
> minimal API that would work for Kyber too (please confirm).  Unless we
> know complexity is required, I prefer to keep things minimal.
>
> I've pushed it to:
> https://gitlab.com/jas/libgcrypt/-/commits/jas/sntrup761v2
>
> Below is the added API.  Thoughts?
>
> enum gcry_kem_algos
>    {
>      GCRY_KEM_SNTRUP761 = 761,
>    };
>
> #define GCRY_KEM_SNTRUP761_SECRETKEY_SIZE 1763
> #define GCRY_KEM_SNTRUP761_PUBLICKEY_SIZE 1158
> #define GCRY_KEM_SNTRUP761_CIPHERTEXT_SIZE 1039
> #define GCRY_KEM_SNTRUP761_SHAREDSECRET_SIZE 32
>
> gcry_error_t gcry_kem_keypair (int algo,
> 			       void *pubkey,
> 			       void *seckey);
>
> gcry_error_t gcry_kem_enc (int algo,
> 			   const void *pubkey,
> 			   void *ciphertext,
> 			   void *ss);
>
> gcry_error_t gcry_kem_dec (int algo,
> 			   const void *ciphertext,
> 			   const void *seckey,
> 			   void *ss);

I think this is already going into the right direction. However, I have 
some proposals:

1. I would prefer a more type safe API: distinct public and private key 
objects instead of void pointers, i.e gcry_kem_public_key_t and 
gcry_kem_private_key_t. From your proposed API it does not become clear 
if pubkey and seckey are objects or just byte arrays. Since 
instantiating a key from a byte array may involve some precomputations 
(imagine for instance instantiating a private key from a PRNG seed), for 
efficiency reasons it is in my view necessary to have public and private 
key objects.

2. Also the enum should by typedef'd and used with its type in the 
function signature.

3. There is no need to provide algo again in the enc/dec functions. A 
key object will know it's algorithm. (Probably this is due to key void 
pointers meant as byte arrays)

4. All input/output byte arrays should be typed as uint8_t* and be 
passed in with their lengths. If without lengths, client code will be 
prone to memory access errors.

5. Then we will also need extra functions for serialization and 
deserialization of keys.

- Falko

>
> /Simon
>
> _______________________________________________
> Gcrypt-devel mailing list
> Gcrypt-devel at gnupg.org
> https://lists.gnupg.org/mailman/listinfo/gcrypt-devel
-- 

*MTG AG*
Dr. Falko Strenzke
Executive System Architect

Phone: +49 6151 8000 24
E-Mail: falko.strenzke at mtg.de
Web: mtg.de <https://www.mtg.de>


*MTG Exhibitions ? See you in 2023*

------------------------------------------------------------------------
<https://community.e-world-essen.com/institutions/allExhibitors?query=true&keywords=mtg> 
<https://www.itsa365.de/de-de/companies/m/mtg-ag>

MTG AG - Dolivostr. 11 - 64293 Darmstadt, Germany
Commercial register: HRB 8901
Register Court: Amtsgericht Darmstadt
Management Board: J?rgen Ruf (CEO), Tamer Kemer?z
Chairman of the Supervisory Board: Dr. Thomas Milde

This email may contain confidential and/or privileged information. If 
you are not the correct recipient or have received this email in error,
please inform the sender immediately and delete this email. Unauthorised 
copying or distribution of this email is not permitted.

Data protection information: Privacy policy 
<https://www.mtg.de/en/privacy-policy>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230516/cc4ff2fa/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bl8h4Ks4fuYl1XEo.png
Type: image/png
Size: 5256 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230516/cc4ff2fa/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: xsNU0Yk5v78SRIzb.png
Type: image/png
Size: 4906 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230516/cc4ff2fa/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4764 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230516/cc4ff2fa/attachment-0001.bin>

From smueller at chronox.de  Tue May 16 09:58:40 2023
From: smueller at chronox.de (Stephan Mueller)
Date: Tue, 16 May 2023 09:58:40 +0200
Subject: [PATCH] Add Streamlined NTRU Prime sntrup761.
In-Reply-To: <87y1lozrw7.fsf@kaka.sjd.se>
References: <87wn191ym8.fsf@kaka.sjd.se> <87y1lozrw7.fsf@kaka.sjd.se>
Message-ID: <4957427.z1Mmn1aDQA@tauon.chronox.de>

Am Dienstag, 16. Mai 2023, 08:56:08 CEST schrieb Simon Josefsson via Gcrypt-
devel:

Hi Simon,

> Hi
> 
> Attached is a second version of the sntrup761 patch, this time using a
> minimal API that would work for Kyber too (please confirm).  Unless we
> know complexity is required, I prefer to keep things minimal.
> 
> I've pushed it to:
> https://gitlab.com/jas/libgcrypt/-/commits/jas/sntrup761v2
> 
> Below is the added API.  Thoughts?
> 
> enum gcry_kem_algos
>   {
>     GCRY_KEM_SNTRUP761 = 761,
>   };
> 
> #define GCRY_KEM_SNTRUP761_SECRETKEY_SIZE 1763
> #define GCRY_KEM_SNTRUP761_PUBLICKEY_SIZE 1158
> #define GCRY_KEM_SNTRUP761_CIPHERTEXT_SIZE 1039
> #define GCRY_KEM_SNTRUP761_SHAREDSECRET_SIZE 32
> 
> gcry_error_t gcry_kem_keypair (int algo,
> 			       void *pubkey,
> 			       void *seckey);
> 
> gcry_error_t gcry_kem_enc (int algo,
> 			   const void *pubkey,
> 			   void *ciphertext,
> 			   void *ss);

May I suggest to add another parameter: size_t ss_len which shall specify the 
caller-requested size of ss?
> 
> gcry_error_t gcry_kem_dec (int algo,
> 			   const void *ciphertext,
> 			   const void *seckey,
> 			   void *ss);

Same here.

Kyber uses a KDF as the last step. I am aware of the fact that the Kyber 
reference implementation returns 32 bytes statically. However, considering the 
use of a true KDF which has the property of a pseudorandom behavior (either 
SHAKE256 or AES-CTR is used), the KDF can produce arbitrary amounts of data. 
By specifying an ss_len parameter, the caller can directly request the data 
that may be needed as a key/IV/mac Key or similar for subsequent cipher 
operations.

In [1] and [2], I use such an ss_len parameter which in turn serves me well 
for production use cases.

[1] https://github.com/smuellerDD/leancrypto/blob/master/kem/api/
lc_kyber.h#L149

[2] https://github.com/smuellerDD/leancrypto/blob/master/kem/api/
lc_kyber.h#L167

Thanks a lot
Stephan


From smueller at chronox.de  Tue May 16 10:07:30 2023
From: smueller at chronox.de (Stephan Mueller)
Date: Tue, 16 May 2023 10:07:30 +0200
Subject: Implementation of PQC Algorithms in libgcrypt
In-Reply-To: <87353w24fg.fsf@kaka.sjd.se>
References: <958d689a-5f76-5fbe-f3ef-140bc1b2d132@mtg.de>
 <3414868.jSmoDxJbhA@tauon.chronox.de> <87353w24fg.fsf@kaka.sjd.se>
Message-ID: <3134498.TlHfbrAK3V@tauon.chronox.de>

Am Dienstag, 16. Mai 2023, 08:09:23 CEST schrieb Simon Josefsson:

Hi Simon,

> Stephan Mueller <smueller at chronox.de> writes:
> > Am Montag, 15. Mai 2023, 17:39:23 CEST schrieb Simon Josefsson via Gcrypt-
> > devel:
> > 
> > Hi Simon,
> > 
> >> Does kyber have any requirements on the API that wouldn't work well with
> >> any of these?
> > 
> > I am experimenting with Kyber in [1]. For KEM, your API would work.
> 
> Thanks for confirming this!  Looking at the code, it seems Kyber KEM has
> exactly the same API as sntrup761, which probably was a NIST PQCS
> requirement, and we should expect that other KEM's follow a similar
> approach.
> 
> I think that sntrup761 can be added to libgcrypt now since it has been
> stable since 2017, but I'm less sure about Kyber since it is stuck in
> the NIST process -- aren't there some risk that NIST will modify the
> parameters again?

I have no insight into the process. I expect, though, that only Kyber/
Dilithium with security strength of 256 bits will be allowed. I would not 
expect that internal parameters would change, though.

However, NIAP / NSA now starts mandating Kyber / Dilithium with 256 bits 
strength as a replacement for *all* general-purpose asymmetric algorithms by 
2035. There are no options for other algorithms! This is now spawning 
discussions especially around the network protocols. Especially Kyber KEX is 
no direct-fit replacement for DH which implies that all network protocols must 
change.

I.e. RSA, (EC)DSA, (EC)DH shall be completely replaced by Kyber and Dilithium.

This is also an interesting catch-22 for NIST's competition. NIST did not 
decide yet, but it may be hard for them to ignore the new ruling my NSA.

> 
> > There you see that I use an additional parameter, an RNG context. This
> > allows me to also derive Kyber keys straight from a KDF (which is
> > accessed like an RNG context). But that is not really needed.
> 
> Right, I use the RNG context internally in sntrup761.c as well, but I
> don't think it should be exposed to libgcrypt callers. 

I can live with that, no doubts. But it makes life (at least for keygen) 
significantly easier :-)

Anyhow, considering that libgcrypt also wants to comply with FIPS rules, it is 
not permissible to allow the user to specify the rng context. So, your 
approach fits even the FIPS considerations (whereas mine does not).

> The internal RNG
> context will be useful for self-testing.  This is especially true since
> I think test vectors for KEM's are implementation-specific: if you
> optimize the implementation to re-order RNG calls, the test vectors will
> no longer work.  Thus, you can't really do black-box testing with KEM
> KATs.  The libgcrypt selftest() approach is perfectly suited for doing a
> whitebox test internally though.

Agreed.
> 
> > However, how do you propose to handle the KEX scenario? See [2] for the
> > full Kyber KEX exchange and the API. I think the KEX is much more
> > important than the KEM, as the KEX is conceptually what is DH today.
> > Kyber KEM can be used in an integrated encryption schema as suggested in
> > [3].
> > 
> > Unfortunately, the Kyber KEX cannot be acting as a direct replacement for
> > DH. Due to its 7 total steps. However, it is possible to coalescing all
> > of them into 2 handshake network exchanges and one final data blob that
> > is sent along with the already encrypted first payload.
> 
> I think this should be through a completely different API than for KEM
> or public-key encrypt/decrypt, and an API that is customized for the KEX
> functionality.  The properties are different from existing APIs, similar
> to how AEAD ciphers differs from ECB ciphers, and how KDF differs from
> MAC/hashes.  Also compare how libgcrypt contains an API for X25519/X448
> curve operations.

Sounds good from my side.

Ciao
Stephan


From falko.strenzke at mtg.de  Tue May 16 15:00:23 2023
From: falko.strenzke at mtg.de (Falko Strenzke)
Date: Tue, 16 May 2023 15:00:23 +0200
Subject: Code formatting for libgcrypt
Message-ID: <a61df418-8a8a-5718-cf25-e2fdff4e2dfb@mtg.de>

Is there any official code formatting style for any formatting tool 
available for libgcrypt? If no official one, an unofficial one that 
matches the existing code quite well?

- Falko

-- 

*MTG AG*
Dr. Falko Strenzke
Executive System Architect

Phone: +49 6151 8000 24
E-Mail: falko.strenzke at mtg.de
Web: mtg.de <https://www.mtg.de>


*MTG Exhibitions ? See you in 2023*

------------------------------------------------------------------------
<https://community.e-world-essen.com/institutions/allExhibitors?query=true&keywords=mtg> 
<https://www.itsa365.de/de-de/companies/m/mtg-ag>

MTG AG - Dolivostr. 11 - 64293 Darmstadt, Germany
Commercial register: HRB 8901
Register Court: Amtsgericht Darmstadt
Management Board: J?rgen Ruf (CEO), Tamer Kemer?z
Chairman of the Supervisory Board: Dr. Thomas Milde

This email may contain confidential and/or privileged information. If 
you are not the correct recipient or have received this email in error,
please inform the sender immediately and delete this email. Unauthorised 
copying or distribution of this email is not permitted.

Data protection information: Privacy policy 
<https://www.mtg.de/en/privacy-policy>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230516/01d04514/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eYWhdZgr2HrMZnnH.png
Type: image/png
Size: 5256 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230516/01d04514/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BkASwHg58wey7eBI.png
Type: image/png
Size: 4906 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230516/01d04514/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4764 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230516/01d04514/attachment-0001.bin>

From wk at gnupg.org  Tue May 16 16:38:42 2023
From: wk at gnupg.org (Werner Koch)
Date: Tue, 16 May 2023 16:38:42 +0200
Subject: Code formatting for libgcrypt
In-Reply-To: <a61df418-8a8a-5718-cf25-e2fdff4e2dfb@mtg.de> (Falko Strenzke's
 message of "Tue, 16 May 2023 15:00:23 +0200")
References: <a61df418-8a8a-5718-cf25-e2fdff4e2dfb@mtg.de>
Message-ID: <87h6scqr2l.fsf@wheatstone.g10code.de>

On Tue, 16 May 2023 15:00, Falko Strenzke said:
> Is there any official code formatting style for any formatting tool available
> for libgcrypt? If no official one, an unofficial one that matches the existing
> code quite well?

Please see: gnupg/doc/HACKING


Shalom-Salam,

   Werner

-- 
The pioneers of a warless world are the youth that
refuse military service.             - A. Einstein
-------------- next part --------------
A non-text attachment was scrubbed...
Name: openpgp-digital-signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230516/2f9f649a/attachment.sig>

From wk at gnupg.org  Tue May 16 17:52:53 2023
From: wk at gnupg.org (Werner Koch)
Date: Tue, 16 May 2023 17:52:53 +0200
Subject: [PATCH] Add Streamlined NTRU Prime sntrup761.
In-Reply-To: <87wn191ym8.fsf@kaka.sjd.se> (Simon Josefsson via Gcrypt-devel's
 message of "Mon, 15 May 2023 16:02:39 +0200")
References: <87wn191ym8.fsf@kaka.sjd.se>
Message-ID: <871qjgqnmy.fsf@wheatstone.g10code.de>

Hi!

> My use case is to enable implementation of OpenSSH's
> sntrup761x25519-sha512 in libssh/libssh2.

Given that OpenSSH starts to move into that direction, it is a good idea
to add support to Libgcrypt.  After all we want that gpg-agent can also
work with that algorithms.

>    - Are gcry_kem_open/gcry_kem_close useful?  They complicate
>      implementation for no gain for sntrup761, but could be useful for
>      other KEM's, OTOH they may just complicate it for all KEM's since I
>      believe the KEM APIs are fairly established these days.

I have not yet anaylyzed your needs but I think that this new API is not
needed because we have KEM functions already implemented in the pubkey
API.

Instead of a new separate API it should be sufficient to make use of the
general idea of gcry_ctx_t.  Right now we use such a context only for
KATs and to implement custom EC functions.

The context object was actually implemented to add state to the public
key functions and to allow the provisioning of larger parameters by
associating them with an s-expression.  A context is also a way to
implement n-way processing within Libgcrypt.


Salam-Shalom,

   Werner

-- 
The pioneers of a warless world are the youth that
refuse military service.             - A. Einstein
-------------- next part --------------
A non-text attachment was scrubbed...
Name: openpgp-digital-signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230516/e15b3d58/attachment.sig>

From simon at josefsson.org  Fri May 19 23:37:31 2023
From: simon at josefsson.org (Simon Josefsson)
Date: Fri, 19 May 2023 23:37:31 +0200
Subject: [PATCH] Add Streamlined NTRU Prime sntrup761.
In-Reply-To: <871qjgqnmy.fsf@wheatstone.g10code.de> (Werner Koch via
 Gcrypt-devel's message of "Tue, 16 May 2023 17:52:53 +0200")
References: <87wn191ym8.fsf@kaka.sjd.se> <871qjgqnmy.fsf@wheatstone.g10code.de>
Message-ID: <87v8goxask.fsf@kaka.sjd.se>

Werner Koch via Gcrypt-devel <gcrypt-devel at gnupg.org> writes:

> I have not yet anaylyzed your needs but I think that this new API is not
> needed because we have KEM functions already implemented in the pubkey
> API.

Do you mean these?

 -- Function: gcry_error_t gcry_pk_genkey (gcry_sexp_t *R_KEY,
          gcry_sexp_t PARMS)
 -- Function: gcry_error_t gcry_pk_encrypt (gcry_sexp_t *R_CIPH,
          gcry_sexp_t DATA, gcry_sexp_t PKEY)
 -- Function: gcry_error_t gcry_pk_decrypt (gcry_sexp_t *R_PLAIN,
          gcry_sexp_t DATA, gcry_sexp_t SKEY)

I think these are poorly suited for modern KEM's like sntrup761.  They
are all now byte-oriented, not MPI/sexp.  KEM's use of public/private
keys are ephemeral, like diffie-hellman, so they are different than
long-term keys.  I think this is comparable to the separate APIs
introduced for X25519:

 -- Function: gpg_error_t gcry_ecc_mul_point (int CURVEID,
          unsigned char *RESULT, const unsigned char *SCALAR,
          const unsigned char *POINT)

Using MPI's to store byte-values lead to a security concern in RFC 8731,
since MPI's encode different byte-values in different length depending
on the content.  I haven't checked if libgcrypt would be vulnerable to
the same problem, but type-overloading is not safe.

Maybe you could take a second look on the API I proposed below?  It
matches the API that several modern KEM's uses.  Yes this would make
KEM's a special animal that is not compatible with other
public/private-key stuff in libgcrypt, but I think that is actually a
good thing.

enum gcry_kem_algos
  {
    GCRY_KEM_SNTRUP761 = 761,
  };

#define GCRY_KEM_SNTRUP761_SECRETKEY_SIZE 1763
#define GCRY_KEM_SNTRUP761_PUBLICKEY_SIZE 1158
#define GCRY_KEM_SNTRUP761_CIPHERTEXT_SIZE 1039
#define GCRY_KEM_SNTRUP761_SHAREDSECRET_SIZE 32

gcry_error_t gcry_kem_keypair (int algo,
			       void *pubkey,
			       void *seckey);
gcry_error_t gcry_kem_enc (int algo,
			   const void *pubkey,
			   void *ciphertext,
			   void *ss);
gcry_error_t gcry_kem_dec (int algo,
			   const void *ciphertext,
			   const void *seckey,
			   void *ss);

/Simon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 255 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230519/563d8ecd/attachment.sig>

From simon at josefsson.org  Fri May 19 23:47:39 2023
From: simon at josefsson.org (Simon Josefsson)
Date: Fri, 19 May 2023 23:47:39 +0200
Subject: [PATCH] Add Streamlined NTRU Prime sntrup761.
In-Reply-To: <2b3979dc-9a89-4755-9e4c-727263b27bb5@mtg.de> (Falko Strenzke's
 message of "Tue, 16 May 2023 09:16:13 +0200")
References: <87wn191ym8.fsf@kaka.sjd.se> <87y1lozrw7.fsf@kaka.sjd.se>
 <2b3979dc-9a89-4755-9e4c-727263b27bb5@mtg.de>
Message-ID: <87o7mgxabo.fsf@kaka.sjd.se>

Falko Strenzke <falko.strenzke at mtg.de> writes:

> I think this is already going into the right direction. However, I
> have some proposals:

Thank you for feedback!

> 1. I would prefer a more type safe API: distinct public and private
> key objects instead of void pointers, i.e gcry_kem_public_key_t and
> gcry_kem_private_key_t. From your proposed API it does not become
> clear if pubkey and seckey are objects or just byte arrays. Since
> instantiating a key from a byte array may involve some precomputations
> (imagine for instance instantiating a private key from a PRNG seed),
> for efficiency reasons it is in my view necessary to have public and
> private key objects.

This is a trade-off, and my rationale was that I prefer doing
byte-oriented APIs since that seems to what all modern KEM's are using
(including Kyber?).  And for some reason byte-strings are passed as
'void*' in libgcrypt, so I followed that style.  There should be
documentation explaining this.

I think the core decision should be to use 1) byte-oriented API or 2)
some higher-level representation like MPI/sexp.  The API types follow
from that decision.  I agree with you that 'void*' is not nice, but it
seems like the libgcrypt idiom.

However you make me believe we could use uint8_t here?  My KEM API is
not similar to other parts of libgcrypt anyway, so we don't have to
repeat using 'void*' for data.

> 2. Also the enum should by typedef'd and used with its type in the
> function signature.

My use was modeled in existing uses like 'enum gcry_cipher_algos', 'enum
gcry_pk_algos' etc.  I do agree with you, but I think consistency is
also important.

> 3. There is no need to provide algo again in the enc/dec functions. A
> key object will know it's algorithm. (Probably this is due to key void
> pointers meant as byte arrays)

I think either a context handle or algorithm identifier is needed.  The
key parameter is just a opaque byte array, it doesn't know its
algorithm.

> 4. All input/output byte arrays should be typed as uint8_t* and be
> passed in with their lengths. If without lengths, client code will be
> prone to memory access errors.

That was my first version of the API.  It felt useless to have all these
size_t lengths and checks for them since they were all fixed strings
anyway.  Let's see where all other issues end up, if this is still
relevant.

> 5. Then we will also need extra functions for serialization and
> deserialization of keys.

My approach uses raw keys directly, so this is included.

/Simon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 255 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230519/0550ba14/attachment.sig>

From simon at josefsson.org  Fri May 19 23:52:00 2023
From: simon at josefsson.org (Simon Josefsson)
Date: Fri, 19 May 2023 23:52:00 +0200
Subject: [PATCH] Add Streamlined NTRU Prime sntrup761.
In-Reply-To: <4957427.z1Mmn1aDQA@tauon.chronox.de> (Stephan Mueller's message
 of "Tue, 16 May 2023 09:58:40 +0200")
References: <87wn191ym8.fsf@kaka.sjd.se> <87y1lozrw7.fsf@kaka.sjd.se>
 <4957427.z1Mmn1aDQA@tauon.chronox.de>
Message-ID: <87jzx4xa4f.fsf@kaka.sjd.se>

Stephan Mueller <smueller at chronox.de> writes:

>> gcry_error_t gcry_kem_enc (int algo,
>> 			   const void *pubkey,
>> 			   void *ciphertext,
>> 			   void *ss);
>
> May I suggest to add another parameter: size_t ss_len which shall specify the 
> caller-requested size of ss?

Is that to support variable-length outputs?  Or just to indicate the
buffer size?  Does kyber or some other popular KEM supports
variable-length outputs?

>> gcry_error_t gcry_kem_dec (int algo,
>> 			   const void *ciphertext,
>> 			   const void *seckey,
>> 			   void *ss);
>
> Same here.
>
> Kyber uses a KDF as the last step. I am aware of the fact that the Kyber 
> reference implementation returns 32 bytes statically. However, considering the 
> use of a true KDF which has the property of a pseudorandom behavior (either 
> SHAKE256 or AES-CTR is used), the KDF can produce arbitrary amounts of data. 
> By specifying an ss_len parameter, the caller can directly request the data 
> that may be needed as a key/IV/mac Key or similar for subsequent cipher 
> operations.

What does the specification says?  Is kyber specified as a
variable-length output, or output of 32 bytes?

One approach is to have another API for that use-case:

gcry_error_t gcry_kem_enc_kdf (int algo,
			      const void *pubkey,
			      void *ciphertext,
			      size_t sslen, void *ss);
gcry_error_t gcry_kem_dec_kdf (int algo,
			       const void *ciphertext,
			       const void *seckey,
			       size_t sslen, void *ss);

/Simon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 255 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230519/98a7be83/attachment.sig>

From smueller at chronox.de  Sun May 21 16:30:24 2023
From: smueller at chronox.de (Stephan =?ISO-8859-1?Q?M=FCller?=)
Date: Sun, 21 May 2023 16:30:24 +0200
Subject: [PATCH] Add Streamlined NTRU Prime sntrup761.
In-Reply-To: <87jzx4xa4f.fsf@kaka.sjd.se>
References: <87wn191ym8.fsf@kaka.sjd.se> <4957427.z1Mmn1aDQA@tauon.chronox.de>
 <87jzx4xa4f.fsf@kaka.sjd.se>
Message-ID: <4836093.31r3eYUQgx@positron.chronox.de>

Am Freitag, 19. Mai 2023, 23:52:00 CEST schrieb Simon Josefsson:

Hi Simon,

> Stephan Mueller <smueller at chronox.de> writes:
> >> gcry_error_t gcry_kem_enc (int algo,
> >> 
> >> 			   const void *pubkey,
> >> 			   void *ciphertext,
> >> 			   void *ss);
> > 
> > May I suggest to add another parameter: size_t ss_len which shall specify
> > the caller-requested size of ss?
> 
> Is that to support variable-length outputs?  Or just to indicate the
> buffer size?  Does kyber or some other popular KEM supports
> variable-length outputs?

Short answer: this is to indicate variable length outputs.

Long answer: Kyber KEM defines as its last step a KDF using either SHAKE256. 
Both KDFs allow the caller to request an arbitrary output size. Yet, the 
sample source code generates hard-coded 32 bytes.

To avoid waste of CPU cycles and considering that both KDF operations are 
defined as pseudorandom operations (see SP800-185 for SHAKE), I personally 
think that this KDF should be asked to generate exakt those number of bytes 
that are needed. This implies that I added the ss_len parameter to my API set 
to advertise that this helps preventing the waste of precious CPU cycles.
> 
> >> gcry_error_t gcry_kem_dec (int algo,
> >> 
> >> 			   const void *ciphertext,
> >> 			   const void *seckey,
> >> 			   void *ss);
> > 
> > Same here.
> > 
> > Kyber uses a KDF as the last step. I am aware of the fact that the Kyber
> > reference implementation returns 32 bytes statically. However, considering
> > the use of a true KDF which has the property of a pseudorandom behavior
> > (either SHAKE256 or AES-CTR is used), the KDF can produce arbitrary
> > amounts of data. By specifying an ss_len parameter, the caller can
> > directly request the data that may be needed as a key/IV/mac Key or
> > similar for subsequent cipher operations.
> 
> What does the specification says?  Is kyber specified as a
> variable-length output, or output of 32 bytes?

The specification contains the following words regarding the KDF:

"""
As a modification in round-2, we decided to derive the final key using 
SHAKE-256 instead of SHA3-256.
This is an advantage for protocols that need keys of more than 256 bits. 
Instead of first requesting a 256-bit
key from Kyber and then expanding it, they can pass an additional key-length 
parameter to Kyber and
obtain a key of the desired length. This feature is not supported by the NIST 
API, so in our implementations
we set the keylength to a fixed length of 32 bytes in api.h.
"""

So, the authors deem it as acceptable to specify an ss_len.
> 
> One approach is to have another API for that use-case:
> 
> gcry_error_t gcry_kem_enc_kdf (int algo,
> 			      const void *pubkey,
> 			      void *ciphertext,
> 			      size_t sslen, void *ss);
> gcry_error_t gcry_kem_dec_kdf (int algo,
> 			       const void *ciphertext,
> 			       const void *seckey,
> 			       size_t sslen, void *ss);

Fine with me, too
> 
> /Simon


Ciao
Stephan


From canadamax at proton.me  Fri May 26 20:30:51 2023
From: canadamax at proton.me (Max Blanco)
Date: Fri, 26 May 2023 18:30:51 +0000
Subject: bug report: trouble compiling libgcrypt with libgpg-error-1.47
Message-ID: <gXmH6SZ6ljy9QY0eTXs-AoYU38xfhBjxy21VsymxWGkvFquM1403zSzReYLi6FJEUSnc2K-0v5nNq8hVDI7kMG5qeZkaHwJFE1ms1U7fXBw=@proton.me>

Hello,

I have trouble compiling libgcrypt on top of libgpg-error-1.47.


If I try?libgcrypt-1.8.10 the configure script tells me I need libgpg-error > 1.25 (!!).


If I try?libgcrypt-1.10.1 the computer throws an error in the test regime at ec-nist.c in function '_gcry_mpi_ec_nist192_mod'.


The system is a legacy Intel Core Duo running Mac OS X 10.6.8.

This behaviour is puzzling. Any suggestions are welcome.


Sent with Proton Mail secure email.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230526/94d6d6fb/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: publickey - canadamax at proton.me - 0xD17D3B5F.asc
Type: application/pgp-keys
Size: 653 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230526/94d6d6fb/attachment.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230526/94d6d6fb/attachment.sig>

From jussi.kivilinna at iki.fi  Sun May 28 16:53:55 2023
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sun, 28 May 2023 17:53:55 +0300
Subject: [PATCH] rijndael-aesni: use inline checksumming for OCB decryption
Message-ID: <20230528145355.532424-1-jussi.kivilinna@iki.fi>

* cipher/rijndael-aesni.c (aesni_ocb_checksum): Remove.
(aesni_ocb_dec): Add inline checksumming.
--

Inline checksumming is far faster on Ryzen processors on i386
builds than two-pass checksumming.

Benchmark on AMD Ryzen 9 7900X (i386):

Before:
 AES            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        OCB dec |     0.180 ns/B      5292 MiB/s     0.847 c/B      4700

After (~2x faster):
 AES            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        OCB dec |     0.091 ns/B     10491 MiB/s     0.427 c/B      4700

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/rijndael-aesni.c | 220 ++++++++--------------------------------
 1 file changed, 43 insertions(+), 177 deletions(-)

diff --git a/cipher/rijndael-aesni.c b/cipher/rijndael-aesni.c
index 906737a6..b33ef7ed 100644
--- a/cipher/rijndael-aesni.c
+++ b/cipher/rijndael-aesni.c
@@ -2710,174 +2710,6 @@ _gcry_aes_aesni_cbc_dec (RIJNDAEL_context *ctx, unsigned char *iv,
 }
 
 
-static ASM_FUNC_ATTR_INLINE void
-aesni_ocb_checksum (gcry_cipher_hd_t c, const unsigned char *plaintext,
-		    size_t nblocks)
-{
-  RIJNDAEL_context *ctx = (void *)&c->context.c;
-
-  /* Calculate checksum */
-  asm volatile ("movdqu %[checksum], %%xmm6\n\t"
-                "pxor %%xmm1, %%xmm1\n\t"
-                "pxor %%xmm2, %%xmm2\n\t"
-                "pxor %%xmm3, %%xmm3\n\t"
-                :
-                :[checksum] "m" (*c->u_ctr.ctr)
-                : "memory" );
-
-  if (0) {}
-#if defined(HAVE_GCC_INLINE_ASM_AVX2)
-  else if (nblocks >= 16 && ctx->use_avx2)
-    {
-      /* Use wider 256-bit registers for fast xoring of plaintext. */
-      asm volatile ("vzeroupper\n\t"
-		    "vpxor %%xmm0, %%xmm0, %%xmm0\n\t"
-		    "vpxor %%xmm4, %%xmm4, %%xmm4\n\t"
-		    "vpxor %%xmm5, %%xmm5, %%xmm5\n\t"
-		    "vpxor %%xmm7, %%xmm7, %%xmm7\n\t"
-                    :
-                    :
-                    : "memory");
-
-      for (;nblocks >= 16; nblocks -= 16)
-	{
-	  asm volatile ("vpxor %[ptr0], %%ymm6, %%ymm6\n\t"
-			"vpxor %[ptr1], %%ymm1, %%ymm1\n\t"
-			"vpxor %[ptr2], %%ymm2, %%ymm2\n\t"
-			"vpxor %[ptr3], %%ymm3, %%ymm3\n\t"
-			:
-			: [ptr0] "m" (*(plaintext + 0 * BLOCKSIZE * 2)),
-			  [ptr1] "m" (*(plaintext + 1 * BLOCKSIZE * 2)),
-			  [ptr2] "m" (*(plaintext + 2 * BLOCKSIZE * 2)),
-			  [ptr3] "m" (*(plaintext + 3 * BLOCKSIZE * 2))
-			: "memory" );
-	  asm volatile ("vpxor %[ptr4], %%ymm0, %%ymm0\n\t"
-			"vpxor %[ptr5], %%ymm4, %%ymm4\n\t"
-			"vpxor %[ptr6], %%ymm5, %%ymm5\n\t"
-			"vpxor %[ptr7], %%ymm7, %%ymm7\n\t"
-			:
-			: [ptr4] "m" (*(plaintext + 4 * BLOCKSIZE * 2)),
-			  [ptr5] "m" (*(plaintext + 5 * BLOCKSIZE * 2)),
-			  [ptr6] "m" (*(plaintext + 6 * BLOCKSIZE * 2)),
-			  [ptr7] "m" (*(plaintext + 7 * BLOCKSIZE * 2))
-			: "memory" );
-	  plaintext += BLOCKSIZE * 16;
-	}
-
-      asm volatile ("vpxor %%ymm0, %%ymm6, %%ymm6\n\t"
-		    "vpxor %%ymm4, %%ymm1, %%ymm1\n\t"
-		    "vpxor %%ymm5, %%ymm2, %%ymm2\n\t"
-		    "vpxor %%ymm7, %%ymm3, %%ymm3\n\t"
-		    "vextracti128 $1, %%ymm6, %%xmm0\n\t"
-		    "vextracti128 $1, %%ymm1, %%xmm4\n\t"
-		    "vextracti128 $1, %%ymm2, %%xmm5\n\t"
-		    "vextracti128 $1, %%ymm3, %%xmm7\n\t"
-		    "vpxor %%xmm0, %%xmm6, %%xmm6\n\t"
-		    "vpxor %%xmm4, %%xmm1, %%xmm1\n\t"
-		    "vpxor %%xmm5, %%xmm2, %%xmm2\n\t"
-		    "vpxor %%xmm7, %%xmm3, %%xmm3\n\t"
-		    "vzeroupper\n\t"
-		    :
-		    :
-		    : "memory" );
-    }
-#endif
-#if defined(HAVE_GCC_INLINE_ASM_AVX)
-  else if (nblocks >= 16 && ctx->use_avx)
-    {
-      /* Same as AVX2, except using 256-bit floating point instructions. */
-      asm volatile ("vzeroupper\n\t"
-		    "vxorpd %%xmm0, %%xmm0, %%xmm0\n\t"
-		    "vxorpd %%xmm4, %%xmm4, %%xmm4\n\t"
-		    "vxorpd %%xmm5, %%xmm5, %%xmm5\n\t"
-		    "vxorpd %%xmm7, %%xmm7, %%xmm7\n\t"
-                    :
-                    :
-                    : "memory");
-
-      for (;nblocks >= 16; nblocks -= 16)
-	{
-	  asm volatile ("vxorpd %[ptr0], %%ymm6, %%ymm6\n\t"
-			"vxorpd %[ptr1], %%ymm1, %%ymm1\n\t"
-			"vxorpd %[ptr2], %%ymm2, %%ymm2\n\t"
-			"vxorpd %[ptr3], %%ymm3, %%ymm3\n\t"
-			:
-			: [ptr0] "m" (*(plaintext + 0 * BLOCKSIZE * 2)),
-			  [ptr1] "m" (*(plaintext + 1 * BLOCKSIZE * 2)),
-			  [ptr2] "m" (*(plaintext + 2 * BLOCKSIZE * 2)),
-			  [ptr3] "m" (*(plaintext + 3 * BLOCKSIZE * 2))
-			: "memory" );
-	  asm volatile ("vxorpd %[ptr4], %%ymm0, %%ymm0\n\t"
-			"vxorpd %[ptr5], %%ymm4, %%ymm4\n\t"
-			"vxorpd %[ptr6], %%ymm5, %%ymm5\n\t"
-			"vxorpd %[ptr7], %%ymm7, %%ymm7\n\t"
-			:
-			: [ptr4] "m" (*(plaintext + 4 * BLOCKSIZE * 2)),
-			  [ptr5] "m" (*(plaintext + 5 * BLOCKSIZE * 2)),
-			  [ptr6] "m" (*(plaintext + 6 * BLOCKSIZE * 2)),
-			  [ptr7] "m" (*(plaintext + 7 * BLOCKSIZE * 2))
-			: "memory" );
-	  plaintext += BLOCKSIZE * 16;
-	}
-
-      asm volatile ("vxorpd %%ymm0, %%ymm6, %%ymm6\n\t"
-		    "vxorpd %%ymm4, %%ymm1, %%ymm1\n\t"
-		    "vxorpd %%ymm5, %%ymm2, %%ymm2\n\t"
-		    "vxorpd %%ymm7, %%ymm3, %%ymm3\n\t"
-		    "vextractf128 $1, %%ymm6, %%xmm0\n\t"
-		    "vextractf128 $1, %%ymm1, %%xmm4\n\t"
-		    "vextractf128 $1, %%ymm2, %%xmm5\n\t"
-		    "vextractf128 $1, %%ymm3, %%xmm7\n\t"
-		    "vxorpd %%xmm0, %%xmm6, %%xmm6\n\t"
-		    "vxorpd %%xmm4, %%xmm1, %%xmm1\n\t"
-		    "vxorpd %%xmm5, %%xmm2, %%xmm2\n\t"
-		    "vxorpd %%xmm7, %%xmm3, %%xmm3\n\t"
-		    "vzeroupper\n\t"
-		    :
-		    :
-		    : "memory" );
-    }
-#endif
-
-  for (;nblocks >= 4; nblocks -= 4)
-    {
-      asm volatile ("movdqu %[ptr0], %%xmm0\n\t"
-		    "movdqu %[ptr1], %%xmm4\n\t"
-		    "movdqu %[ptr2], %%xmm5\n\t"
-		    "movdqu %[ptr3], %%xmm7\n\t"
-		    "pxor %%xmm0, %%xmm6\n\t"
-		    "pxor %%xmm4, %%xmm1\n\t"
-		    "pxor %%xmm5, %%xmm2\n\t"
-		    "pxor %%xmm7, %%xmm3\n\t"
-		    :
-		    : [ptr0] "m" (*(plaintext + 0 * BLOCKSIZE)),
-		      [ptr1] "m" (*(plaintext + 1 * BLOCKSIZE)),
-		      [ptr2] "m" (*(plaintext + 2 * BLOCKSIZE)),
-		      [ptr3] "m" (*(plaintext + 3 * BLOCKSIZE))
-		    : "memory" );
-      plaintext += BLOCKSIZE * 4;
-    }
-
-  for (;nblocks >= 1; nblocks -= 1)
-    {
-      asm volatile ("movdqu %[ptr0], %%xmm0\n\t"
-		    "pxor %%xmm0, %%xmm6\n\t"
-		    :
-		    : [ptr0] "m" (*(plaintext + 0 * BLOCKSIZE))
-		    : "memory" );
-      plaintext += BLOCKSIZE;
-    }
-
-  asm volatile ("pxor %%xmm1, %%xmm6\n\t"
-		"pxor %%xmm2, %%xmm6\n\t"
-		"pxor %%xmm3, %%xmm6\n\t"
-		"movdqu %%xmm6, %[checksum]\n\t"
-		: [checksum] "=m" (*c->u_ctr.ctr)
-		:
-		: "memory" );
-}
-
-
 static unsigned int ASM_FUNC_ATTR_NOINLINE
 aesni_ocb_enc (gcry_cipher_hd_t c, void *outbuf_arg,
                const void *inbuf_arg, size_t nblocks)
@@ -3401,9 +3233,11 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg,
 
   /* Preload Offset */
   asm volatile ("movdqu %[iv], %%xmm5\n\t"
-                : /* No output */
-                : [iv] "m" (*c->u_iv.iv)
-                : "memory" );
+		"movdqu %[ctr], %%xmm7\n\t"
+		: /* No output */
+		: [iv] "m" (*c->u_iv.iv),
+		  [ctr] "m" (*c->u_ctr.ctr)
+		: "memory" );
 
   for ( ;nblocks && n % 4; nblocks-- )
     {
@@ -3424,6 +3258,7 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg,
 
       asm volatile ("pxor   %%xmm5, %%xmm0\n\t"
                     "movdqu %%xmm0, %[outbuf]\n\t"
+		    "pxor   %%xmm0, %%xmm7\n\t"
                     : [outbuf] "=m" (*outbuf)
                     :
                     : "memory" );
@@ -3452,6 +3287,15 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg,
 		    "pxor %[first_key], %%xmm5\n\t"
 		    "pxor %[first_key], %%xmm0\n\t"
 		    "movdqa %%xmm0, %[lxfkey]\n\t"
+		    /* Clear plaintext blocks */
+		    "pxor   %%xmm1,    %%xmm1\n\t"
+		    "pxor   %%xmm2,    %%xmm2\n\t"
+		    "pxor   %%xmm3,    %%xmm3\n\t"
+		    "pxor   %%xmm4,    %%xmm4\n\t"
+		    "pxor   %%xmm8,    %%xmm8\n\t"
+		    "pxor   %%xmm9,    %%xmm9\n\t"
+		    "pxor   %%xmm10,   %%xmm10\n\t"
+		    "pxor   %%xmm11,   %%xmm11\n\t"
 		    : [lxfkey] "=m" (*lxf_key)
 		    : [l0] "m" (*c->u_mode.ocb.L[0]),
 		      [last_key] "m" (ctx->keyschdec[ctx->rounds][0][0]),
@@ -3463,7 +3307,9 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg,
 	  n += 4;
 	  l = aes_ocb_get_l(c, n);
 
-	  asm volatile ("movdqu %[l0l1],   %%xmm10\n\t"
+	  asm volatile ("pxor   %%xmm10,   %%xmm1\n\t"
+			"pxor   %%xmm11,   %%xmm2\n\t"
+			"movdqu %[l0l1],   %%xmm10\n\t"
 			"movdqu %[l1],     %%xmm11\n\t"
 			"movdqu %[l3],     %%xmm15\n\t"
 			:
@@ -3477,7 +3323,10 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg,
 
 	  /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */
 	  /* P_i = Offset_i xor ENCIPHER(K, C_i xor Offset_i)  */
-	  asm volatile ("movdqu %[inbuf0], %%xmm1\n\t"
+	  asm volatile ("pxor   %%xmm1,    %%xmm4\n\t"
+			"pxor   %%xmm2,    %%xmm8\n\t"
+			"pxor   %%xmm3,    %%xmm9\n\t"
+			"movdqu %[inbuf0], %%xmm1\n\t"
 			"movdqu %[inbuf1], %%xmm2\n\t"
 			"movdqu %[inbuf2], %%xmm3\n\t"
 			:
@@ -3485,8 +3334,11 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg,
 			  [inbuf1] "m" (*(inbuf + 1 * BLOCKSIZE)),
 			  [inbuf2] "m" (*(inbuf + 2 * BLOCKSIZE))
 			: "memory" );
-	  asm volatile ("movdqu %[inbuf3], %%xmm4\n\t"
+	  asm volatile ("pxor   %%xmm4,    %%xmm7\n\t"
+			"movdqu %[inbuf3], %%xmm4\n\t"
+			"pxor   %%xmm8,    %%xmm7\n\t"
 			"movdqu %[inbuf4], %%xmm8\n\t"
+			"pxor   %%xmm9,    %%xmm7\n\t"
 			"movdqu %[inbuf5], %%xmm9\n\t"
 			:
 			: [inbuf3] "m" (*(inbuf + 3 * BLOCKSIZE)),
@@ -3722,6 +3574,15 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg,
       asm volatile ("pxor %[first_key], %%xmm5\n\t"
 		    "pxor %%xmm0, %%xmm0\n\t"
 		    "movdqu %%xmm0, %[lxfkey]\n\t"
+		    /* Add plaintext blocks to checksum */
+		    "pxor   %%xmm1,    %%xmm2\n\t"
+		    "pxor   %%xmm3,    %%xmm4\n\t"
+		    "pxor   %%xmm9,    %%xmm8\n\t"
+		    "pxor   %%xmm11,   %%xmm10\n\t"
+		    "pxor   %%xmm2,    %%xmm4\n\t"
+		    "pxor   %%xmm8,    %%xmm10\n\t"
+		    "pxor   %%xmm4,    %%xmm7\n\t"
+		    "pxor   %%xmm10,   %%xmm7\n\t"
 		    : [lxfkey] "=m" (*lxf_key)
 		    : [first_key] "m" (ctx->keyschdec[0][0][0])
 		    : "memory" );
@@ -3782,8 +3643,10 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg,
 
       asm volatile ("pxor   %[tmpbuf0],%%xmm1\n\t"
 		    "movdqu %%xmm1,    %[outbuf0]\n\t"
+		    "pxor   %%xmm1,    %%xmm7\n\t"
 		    "pxor   %[tmpbuf1],%%xmm2\n\t"
 		    "movdqu %%xmm2,    %[outbuf1]\n\t"
+		    "pxor   %%xmm2,    %%xmm7\n\t"
 		    : [outbuf0] "=m" (*(outbuf + 0 * BLOCKSIZE)),
 		      [outbuf1] "=m" (*(outbuf + 1 * BLOCKSIZE))
 		    : [tmpbuf0] "m" (*(tmpbuf + 0 * BLOCKSIZE)),
@@ -3791,8 +3654,10 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg,
 		    : "memory" );
       asm volatile ("pxor   %[tmpbuf2],%%xmm3\n\t"
 		    "movdqu %%xmm3,    %[outbuf2]\n\t"
+		    "pxor   %%xmm3,    %%xmm7\n\t"
 		    "pxor   %%xmm5,    %%xmm4\n\t"
 		    "movdqu %%xmm4,    %[outbuf3]\n\t"
+		    "pxor   %%xmm4,    %%xmm7\n\t"
 		    : [outbuf2] "=m" (*(outbuf + 2 * BLOCKSIZE)),
 		      [outbuf3] "=m" (*(outbuf + 3 * BLOCKSIZE))
 		    : [tmpbuf2] "m" (*(tmpbuf + 2 * BLOCKSIZE))
@@ -3822,6 +3687,7 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg,
 
       asm volatile ("pxor   %%xmm5, %%xmm0\n\t"
                     "movdqu %%xmm0, %[outbuf]\n\t"
+		    "pxor   %%xmm0, %%xmm7\n\t"
                     : [outbuf] "=m" (*outbuf)
                     :
                     : "memory" );
@@ -3832,7 +3698,9 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg,
 
   c->u_mode.ocb.data_nblocks = n;
   asm volatile ("movdqu %%xmm5, %[iv]\n\t"
-                : [iv] "=m" (*c->u_iv.iv)
+                "movdqu %%xmm7, %[ctr]\n\t"
+		: [iv] "=m" (*c->u_iv.iv),
+		  [ctr] "=m" (*c->u_ctr.ctr)
                 :
                 : "memory" );
 
@@ -3846,8 +3714,6 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg,
                 :
                 : "memory" );
 
-  aesni_ocb_checksum (c, outbuf_arg, nblocks_arg);
-
   aesni_cleanup ();
   aesni_cleanup_2_7 ();
 
-- 
2.39.2


From jussi.kivilinna at iki.fi  Sun May 28 16:54:04 2023
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sun, 28 May 2023 17:54:04 +0300
Subject: [PATCH] serpent: add x86/AVX512 implementation
Message-ID: <20230528145404.532462-1-jussi.kivilinna@iki.fi>

* cipher/Makefile.am: Add `serpent-avx512-x86.c`; Add extra CFLAG
handling for `serpent-avx512-x86.o` and `serpent-avx512-x86.lo`.
* cipher/serpent-avx512-x86.c: New.
* cipher/serpent.c (USE_AVX512): New.
(serpent_context_t): Add `use_avx512`.
[USE_AVX512] (_gcry_serpent_avx512_cbc_dec)
(_gcry_serpent_avx512_cfb_dec, _gcry_serpent_avx512_ctr_enc)
(_gcry_serpent_avx512_ocb_crypt, _gcry_serpent_avx512_blk32): New.
(serpent_setkey_internal) [USE_AVX512]: Set `use_avx512` is
AVX512 HW available.
(_gcry_serpent_ctr_enc) [USE_AVX512]: New.
(_gcry_serpent_cbc_dec) [USE_AVX512]: New.
(_gcry_serpent_cfb_dec) [USE_AVX512]: New.
(_gcry_serpent_ocb_crypt) [USE_AVX512]: New.
(serpent_crypt_blk1_16): Rename to...
(serpent_crypt_blk1_32): ... this; Add AVX512 code-path; Adjust for
increase from max 16 blocks to max 32 blocks.
(serpent_encrypt_blk1_16): Rename to ...
(serpent_encrypt_blk1_32): ... this.
(serpent_decrypt_blk1_16): Rename to ...
(serpent_decrypt_blk1_32): ... this.
(_gcry_serpent_xts_crypt, _gcry_serpent_ecb_crypt): Increase bulk
block count from 16 to 32.
* configure.ac (gcry_cv_cc_x86_avx512_intrinsics)
(ENABLE_X86_AVX512_INTRINSICS_EXTRA_CFLAGS): New.
(GCRYPT_ASM_CIPHERS): Add `serpent-avx512-x86.lo`.
--

Benchmark on AMD Ryzen 9 7900X:

Before:
Cipher:
 SERPENT128     |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |      1.52 ns/B     626.2 MiB/s      8.26 c/B      5425
        ECB dec |      1.48 ns/B     645.5 MiB/s      8.01 c/B      5425
        CBC enc |      5.81 ns/B     164.2 MiB/s     31.94 c/B      5500
        CBC dec |     0.722 ns/B      1322 MiB/s      3.91 c/B      5425
        CFB enc |      5.88 ns/B     162.3 MiB/s     32.31 c/B      5500
        CFB dec |     0.735 ns/B      1297 MiB/s      3.99 c/B      5424
        OFB enc |      5.77 ns/B     165.3 MiB/s     31.72 c/B      5500
        OFB dec |      5.77 ns/B     165.4 MiB/s     31.72 c/B      5500
        CTR enc |     0.756 ns/B      1262 MiB/s      4.10 c/B      5425
        CTR dec |     0.776 ns/B      1228 MiB/s      4.21 c/B      5424
        XTS enc |      1.68 ns/B     568.3 MiB/s      9.10 c/B      5424
        XTS dec |      1.58 ns/B     604.2 MiB/s      8.56 c/B      5425
        CCM enc |      6.60 ns/B     144.5 MiB/s     36.30 c/B      5500
        CCM dec |      6.60 ns/B     144.5 MiB/s     36.30 c/B      5500
       CCM auth |      5.86 ns/B     162.6 MiB/s     32.25 c/B      5500
        EAX enc |      6.54 ns/B     145.8 MiB/s     35.98 c/B      5500
        EAX dec |      6.54 ns/B     145.8 MiB/s     35.98 c/B      5500
       EAX auth |      5.81 ns/B     164.2 MiB/s     31.94 c/B      5500
        GCM enc |     0.787 ns/B      1212 MiB/s      4.27 c/B      5425
        GCM dec |     0.788 ns/B      1211 MiB/s      4.27 c/B      5425
       GCM auth |     0.038 ns/B     24932 MiB/s     0.210 c/B      5500
        OCB enc |     0.750 ns/B      1272 MiB/s      4.07 c/B      5424
        OCB dec |     0.743 ns/B      1284 MiB/s      4.03 c/B      5425
       OCB auth |     0.749 ns/B      1274 MiB/s      4.06 c/B      5425
        SIV enc |      6.54 ns/B     145.8 MiB/s     35.99 c/B      5500
        SIV dec |      6.55 ns/B     145.7 MiB/s     36.01 c/B      5500
       SIV auth |      5.81 ns/B     164.2 MiB/s     31.94 c/B      5500
    GCM-SIV enc |      5.63 ns/B     169.4 MiB/s     30.97 c/B      5500
    GCM-SIV dec |      5.64 ns/B     169.2 MiB/s     31.00 c/B      5500
   GCM-SIV auth |     0.038 ns/B     25201 MiB/s     0.208 c/B      5500

After:
 SERPENT128     |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |     0.578 ns/B      1649 MiB/s      3.14 c/B      5425
        ECB dec |     0.505 ns/B      1889 MiB/s      2.74 c/B      5424
        CBC enc |      5.81 ns/B     164.1 MiB/s     31.96 c/B      5500
        CBC dec |     0.527 ns/B      1810 MiB/s      2.86 c/B      5424
        CFB enc |      5.88 ns/B     162.3 MiB/s     32.31 c/B      5500
        CFB dec |     0.471 ns/B      2026 MiB/s      2.55 c/B      5425
        OFB enc |      5.77 ns/B     165.3 MiB/s     31.72 c/B      5500
        OFB dec |      5.77 ns/B     165.3 MiB/s     31.73 c/B      5501
        CTR enc |     0.464 ns/B      2053 MiB/s      2.52 c/B      5425
        CTR dec |     0.464 ns/B      2057 MiB/s      2.51 c/B      5425
        XTS enc |     0.551 ns/B      1732 MiB/s      2.99 c/B      5424
        XTS dec |     0.527 ns/B      1809 MiB/s      2.86 c/B      5424
        CCM enc |      6.32 ns/B     150.8 MiB/s     34.78 c/B      5501
        CCM dec |      6.32 ns/B     150.9 MiB/s     34.77 c/B      5500
       CCM auth |      5.86 ns/B     162.6 MiB/s     32.25 c/B      5500
        EAX enc |      6.26 ns/B     152.2 MiB/s     34.46 c/B      5500
        EAX dec |      6.27 ns/B     152.2 MiB/s     34.46 c/B      5500
       EAX auth |      5.81 ns/B     164.2 MiB/s     31.94 c/B      5500
        GCM enc |     0.497 ns/B      1917 MiB/s      2.70 c/B      5425
        GCM dec |     0.499 ns/B      1913 MiB/s      2.70 c/B      5425
       GCM auth |     0.031 ns/B     30709 MiB/s     0.171 c/B      5500
        OCB enc |     0.482 ns/B      1979 MiB/s      2.61 c/B      5424
        OCB dec |     0.475 ns/B      2007 MiB/s      2.58 c/B      5424
       OCB auth |     0.748 ns/B      1274 MiB/s      4.06 c/B      5424
        SIV enc |      6.27 ns/B     152.0 MiB/s     34.50 c/B      5500
        SIV dec |      6.27 ns/B     152.1 MiB/s     34.48 c/B      5500
       SIV auth |      5.81 ns/B     164.2 MiB/s     31.94 c/B      5500
    GCM-SIV enc |      5.63 ns/B     169.5 MiB/s     30.95 c/B      5500
    GCM-SIV dec |      5.63 ns/B     169.3 MiB/s     30.98 c/B      5500
   GCM-SIV auth |     0.034 ns/B     28060 MiB/s     0.187 c/B      5500

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/Makefile.am          |  17 +-
 cipher/serpent-avx2-amd64.S |   4 +-
 cipher/serpent-avx512-x86.c | 994 ++++++++++++++++++++++++++++++++++++
 cipher/serpent.c            | 218 +++++++-
 configure.ac                |  45 ++
 5 files changed, 1257 insertions(+), 21 deletions(-)
 create mode 100644 cipher/serpent-avx512-x86.c

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index e67b1ee2..8c7ec095 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -119,12 +119,12 @@ EXTRA_libcipher_la_SOURCES = \
 	salsa20.c salsa20-amd64.S salsa20-armv7-neon.S \
 	scrypt.c \
 	seed.c \
-	serpent.c serpent-sse2-amd64.S \
+	serpent.c serpent-sse2-amd64.S serpent-avx2-amd64.S \
+	serpent-avx512-x86.c serpent-armv7-neon.S \
 	sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S \
 	sm4-gfni-avx2-amd64.S sm4-gfni-avx512-amd64.S \
 	sm4-aarch64.S sm4-armv8-aarch64-ce.S sm4-armv9-aarch64-sve-ce.S \
 	sm4-ppc.c \
-	serpent-avx2-amd64.S serpent-armv7-neon.S \
 	sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \
 	sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \
 	sha1-armv8-aarch64-ce.S sha1-intel-shaext.c \
@@ -316,3 +316,16 @@ sm4-ppc.o: $(srcdir)/sm4-ppc.c Makefile
 
 sm4-ppc.lo: $(srcdir)/sm4-ppc.c Makefile
 	`echo $(LTCOMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) `
+
+
+if ENABLE_X86_AVX512_INTRINSICS_EXTRA_CFLAGS
+avx512f_cflags = -mavx512f
+else
+avx512f_cflags =
+endif
+
+serpent-avx512-x86.o: $(srcdir)/serpent-avx512-x86.c Makefile
+	`echo $(COMPILE) $(avx512f_cflags) -c $< | $(instrumentation_munging) `
+
+serpent-avx512-x86.lo: $(srcdir)/serpent-avx512-x86.c Makefile
+	`echo $(LTCOMPILE) $(avx512f_cflags) -c $< | $(instrumentation_munging) `
diff --git a/cipher/serpent-avx2-amd64.S b/cipher/serpent-avx2-amd64.S
index e25e7d3b..7aba235f 100644
--- a/cipher/serpent-avx2-amd64.S
+++ b/cipher/serpent-avx2-amd64.S
@@ -589,8 +589,8 @@ ELF(.type   _gcry_serpent_avx2_blk16, at function;)
 _gcry_serpent_avx2_blk16:
 	/* input:
 	 *	%rdi: ctx, CTX
-	 *	%rsi: dst (8 blocks)
-	 *	%rdx: src (8 blocks)
+	 *	%rsi: dst (16 blocks)
+	 *	%rdx: src (16 blocks)
 	 *	%ecx: encrypt
 	 */
 	CFI_STARTPROC();
diff --git a/cipher/serpent-avx512-x86.c b/cipher/serpent-avx512-x86.c
new file mode 100644
index 00000000..762c09e1
--- /dev/null
+++ b/cipher/serpent-avx512-x86.c
@@ -0,0 +1,994 @@
+/* serpent-avx512-x86.c  -  AVX512 implementation of Serpent cipher
+ *
+ * Copyright (C) 2023 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <config.h>
+
+#if defined(__x86_64) || defined(__i386)
+#if defined(HAVE_COMPATIBLE_CC_X86_AVX512_INTRINSICS) && \
+    defined(USE_SERPENT) && defined(ENABLE_AVX512_SUPPORT)
+
+#include <immintrin.h>
+#include <string.h>
+#include <stdio.h>
+
+#include "g10lib.h"
+#include "types.h"
+#include "cipher.h"
+#include "bithelp.h"
+#include "bufhelp.h"
+#include "cipher-internal.h"
+#include "bulkhelp.h"
+
+#define ALWAYS_INLINE inline __attribute__((always_inline))
+#define NO_INLINE __attribute__((noinline))
+
+/* Number of rounds per Serpent encrypt/decrypt operation.  */
+#define ROUNDS 32
+
+/* Serpent works on 128 bit blocks.  */
+typedef unsigned int serpent_block_t[4];
+
+/* The key schedule consists of 33 128 bit subkeys.  */
+typedef unsigned int serpent_subkeys_t[ROUNDS + 1][4];
+
+#define vpunpckhdq(a, b, o)  ((o) = _mm512_unpackhi_epi32((b), (a)))
+#define vpunpckldq(a, b, o)  ((o) = _mm512_unpacklo_epi32((b), (a)))
+#define vpunpckhqdq(a, b, o) ((o) = _mm512_unpackhi_epi64((b), (a)))
+#define vpunpcklqdq(a, b, o) ((o) = _mm512_unpacklo_epi64((b), (a)))
+
+#define vpbroadcastd(v) _mm512_set1_epi32(v)
+
+#define vrol(x, s) _mm512_rol_epi32((x), (s))
+#define vror(x, s) _mm512_ror_epi32((x), (s))
+#define vshl(x, s) _mm512_slli_epi32((x), (s))
+
+/* 4x4 32-bit integer matrix transpose */
+#define transpose_4x4(x0, x1, x2, x3, t1, t2, t3) \
+	vpunpckhdq(x1, x0, t2); \
+	vpunpckldq(x1, x0, x0); \
+	\
+	vpunpckldq(x3, x2, t1); \
+	vpunpckhdq(x3, x2, x2); \
+	\
+	vpunpckhqdq(t1, x0, x1); \
+	vpunpcklqdq(t1, x0, x0); \
+	\
+	vpunpckhqdq(x2, t2, x3); \
+	vpunpcklqdq(x2, t2, x2);
+
+/*
+ * These are the S-Boxes of Serpent from following research paper.
+ *
+ *  D. A. Osvik, ?Speeding up Serpent,? in Third AES Candidate Conference,
+ *   (New York, New York, USA), p. 317?329, National Institute of Standards and
+ *   Technology, 2000.
+ *
+ * Paper is also available at: http://www.ii.uib.no/~osvik/pub/aes3.pdf
+ *
+ * --
+ *
+ * Following logic gets heavily optimized by compiler to use AVX512F
+ * 'vpternlogq' instruction. This gives higher performance increase than
+ * would be expected from simple wideing of vectors from AVX2/256bit to
+ * AVX512/512bit.
+ *
+ */
+
+#define SBOX0(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r3 ^= r0; r4 =  r1; \
+    r1 &= r3; r4 ^= r2; \
+    r1 ^= r0; r0 |= r3; \
+    r0 ^= r4; r4 ^= r3; \
+    r3 ^= r2; r2 |= r1; \
+    r2 ^= r4; r4 = ~r4; \
+    r4 |= r1; r1 ^= r3; \
+    r1 ^= r4; r3 |= r0; \
+    r1 ^= r3; r4 ^= r3; \
+    \
+    w = r1; x = r4; y = r2; z = r0; \
+  }
+
+#define SBOX0_INVERSE(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r2 = ~r2; r4 =  r1; \
+    r1 |= r0; r4 = ~r4; \
+    r1 ^= r2; r2 |= r4; \
+    r1 ^= r3; r0 ^= r4; \
+    r2 ^= r0; r0 &= r3; \
+    r4 ^= r0; r0 |= r1; \
+    r0 ^= r2; r3 ^= r4; \
+    r2 ^= r1; r3 ^= r0; \
+    r3 ^= r1; \
+    r2 &= r3; \
+    r4 ^= r2; \
+    \
+    w = r0; x = r4; y = r1; z = r3; \
+  }
+
+#define SBOX1(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r0 = ~r0; r2 = ~r2; \
+    r4 =  r0; r0 &= r1; \
+    r2 ^= r0; r0 |= r3; \
+    r3 ^= r2; r1 ^= r0; \
+    r0 ^= r4; r4 |= r1; \
+    r1 ^= r3; r2 |= r0; \
+    r2 &= r4; r0 ^= r1; \
+    r1 &= r2; \
+    r1 ^= r0; r0 &= r2; \
+    r0 ^= r4; \
+    \
+    w = r2; x = r0; y = r3; z = r1; \
+  }
+
+#define SBOX1_INVERSE(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r4 =  r1; r1 ^= r3; \
+    r3 &= r1; r4 ^= r2; \
+    r3 ^= r0; r0 |= r1; \
+    r2 ^= r3; r0 ^= r4; \
+    r0 |= r2; r1 ^= r3; \
+    r0 ^= r1; r1 |= r3; \
+    r1 ^= r0; r4 = ~r4; \
+    r4 ^= r1; r1 |= r0; \
+    r1 ^= r0; \
+    r1 |= r4; \
+    r3 ^= r1; \
+    \
+    w = r4; x = r0; y = r3; z = r2; \
+  }
+
+#define SBOX2(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r4 =  r0; r0 &= r2; \
+    r0 ^= r3; r2 ^= r1; \
+    r2 ^= r0; r3 |= r4; \
+    r3 ^= r1; r4 ^= r2; \
+    r1 =  r3; r3 |= r4; \
+    r3 ^= r0; r0 &= r1; \
+    r4 ^= r0; r1 ^= r3; \
+    r1 ^= r4; r4 = ~r4; \
+    \
+    w = r2; x = r3; y = r1; z = r4; \
+  }
+
+#define SBOX2_INVERSE(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r2 ^= r3; r3 ^= r0; \
+    r4 =  r3; r3 &= r2; \
+    r3 ^= r1; r1 |= r2; \
+    r1 ^= r4; r4 &= r3; \
+    r2 ^= r3; r4 &= r0; \
+    r4 ^= r2; r2 &= r1; \
+    r2 |= r0; r3 = ~r3; \
+    r2 ^= r3; r0 ^= r3; \
+    r0 &= r1; r3 ^= r4; \
+    r3 ^= r0; \
+    \
+    w = r1; x = r4; y = r2; z = r3; \
+  }
+
+#define SBOX3(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r4 =  r0; r0 |= r3; \
+    r3 ^= r1; r1 &= r4; \
+    r4 ^= r2; r2 ^= r3; \
+    r3 &= r0; r4 |= r1; \
+    r3 ^= r4; r0 ^= r1; \
+    r4 &= r0; r1 ^= r3; \
+    r4 ^= r2; r1 |= r0; \
+    r1 ^= r2; r0 ^= r3; \
+    r2 =  r1; r1 |= r3; \
+    r1 ^= r0; \
+    \
+    w = r1; x = r2; y = r3; z = r4; \
+  }
+
+#define SBOX3_INVERSE(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r4 =  r2; r2 ^= r1; \
+    r0 ^= r2; r4 &= r2; \
+    r4 ^= r0; r0 &= r1; \
+    r1 ^= r3; r3 |= r4; \
+    r2 ^= r3; r0 ^= r3; \
+    r1 ^= r4; r3 &= r2; \
+    r3 ^= r1; r1 ^= r0; \
+    r1 |= r2; r0 ^= r3; \
+    r1 ^= r4; \
+    r0 ^= r1; \
+    \
+    w = r2; x = r1; y = r3; z = r0; \
+  }
+
+#define SBOX4(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r1 ^= r3; r3 = ~r3; \
+    r2 ^= r3; r3 ^= r0; \
+    r4 =  r1; r1 &= r3; \
+    r1 ^= r2; r4 ^= r3; \
+    r0 ^= r4; r2 &= r4; \
+    r2 ^= r0; r0 &= r1; \
+    r3 ^= r0; r4 |= r1; \
+    r4 ^= r0; r0 |= r3; \
+    r0 ^= r2; r2 &= r3; \
+    r0 = ~r0; r4 ^= r2; \
+    \
+    w = r1; x = r4; y = r0; z = r3; \
+  }
+
+#define SBOX4_INVERSE(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r4 =  r2; r2 &= r3; \
+    r2 ^= r1; r1 |= r3; \
+    r1 &= r0; r4 ^= r2; \
+    r4 ^= r1; r1 &= r2; \
+    r0 = ~r0; r3 ^= r4; \
+    r1 ^= r3; r3 &= r0; \
+    r3 ^= r2; r0 ^= r1; \
+    r2 &= r0; r3 ^= r0; \
+    r2 ^= r4; \
+    r2 |= r3; r3 ^= r0; \
+    r2 ^= r1; \
+    \
+    w = r0; x = r3; y = r2; z = r4; \
+  }
+
+#define SBOX5(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r0 ^= r1; r1 ^= r3; \
+    r3 = ~r3; r4 =  r1; \
+    r1 &= r0; r2 ^= r3; \
+    r1 ^= r2; r2 |= r4; \
+    r4 ^= r3; r3 &= r1; \
+    r3 ^= r0; r4 ^= r1; \
+    r4 ^= r2; r2 ^= r0; \
+    r0 &= r3; r2 = ~r2; \
+    r0 ^= r4; r4 |= r3; \
+    r2 ^= r4; \
+    \
+    w = r1; x = r3; y = r0; z = r2; \
+  }
+
+#define SBOX5_INVERSE(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r1 = ~r1; r4 =  r3; \
+    r2 ^= r1; r3 |= r0; \
+    r3 ^= r2; r2 |= r1; \
+    r2 &= r0; r4 ^= r3; \
+    r2 ^= r4; r4 |= r0; \
+    r4 ^= r1; r1 &= r2; \
+    r1 ^= r3; r4 ^= r2; \
+    r3 &= r4; r4 ^= r1; \
+    r3 ^= r4; r4 = ~r4; \
+    r3 ^= r0; \
+    \
+    w = r1; x = r4; y = r3; z = r2; \
+  }
+
+#define SBOX6(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r2 = ~r2; r4 =  r3; \
+    r3 &= r0; r0 ^= r4; \
+    r3 ^= r2; r2 |= r4; \
+    r1 ^= r3; r2 ^= r0; \
+    r0 |= r1; r2 ^= r1; \
+    r4 ^= r0; r0 |= r3; \
+    r0 ^= r2; r4 ^= r3; \
+    r4 ^= r0; r3 = ~r3; \
+    r2 &= r4; \
+    r2 ^= r3; \
+    \
+    w = r0; x = r1; y = r4; z = r2; \
+  }
+
+#define SBOX6_INVERSE(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r0 ^= r2; r4 =  r2; \
+    r2 &= r0; r4 ^= r3; \
+    r2 = ~r2; r3 ^= r1; \
+    r2 ^= r3; r4 |= r0; \
+    r0 ^= r2; r3 ^= r4; \
+    r4 ^= r1; r1 &= r3; \
+    r1 ^= r0; r0 ^= r3; \
+    r0 |= r2; r3 ^= r1; \
+    r4 ^= r0; \
+    \
+    w = r1; x = r2; y = r4; z = r3; \
+  }
+
+#define SBOX7(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r4 =  r1; r1 |= r2; \
+    r1 ^= r3; r4 ^= r2; \
+    r2 ^= r1; r3 |= r4; \
+    r3 &= r0; r4 ^= r2; \
+    r3 ^= r1; r1 |= r4; \
+    r1 ^= r0; r0 |= r4; \
+    r0 ^= r2; r1 ^= r4; \
+    r2 ^= r1; r1 &= r0; \
+    r1 ^= r4; r2 = ~r2; \
+    r2 |= r0; \
+    r4 ^= r2; \
+    \
+    w = r4; x = r3; y = r1; z = r0; \
+  }
+
+#define SBOX7_INVERSE(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r4 =  r2; r2 ^= r0; \
+    r0 &= r3; r4 |= r3; \
+    r2 = ~r2; r3 ^= r1; \
+    r1 |= r0; r0 ^= r2; \
+    r2 &= r4; r3 &= r4; \
+    r1 ^= r2; r2 ^= r0; \
+    r0 |= r2; r4 ^= r1; \
+    r0 ^= r3; r3 ^= r4; \
+    r4 |= r0; r3 ^= r2; \
+    r4 ^= r2; \
+    \
+    w = r3; x = r0; y = r1; z = r4; \
+  }
+
+/* XOR BLOCK1 into BLOCK0.  */
+#define BLOCK_XOR_KEY(block0, rkey)     \
+  {                                     \
+    block0[0] ^= vpbroadcastd(rkey[0]); \
+    block0[1] ^= vpbroadcastd(rkey[1]); \
+    block0[2] ^= vpbroadcastd(rkey[2]); \
+    block0[3] ^= vpbroadcastd(rkey[3]); \
+  }
+
+/* Copy BLOCK_SRC to BLOCK_DST.  */
+#define BLOCK_COPY(block_dst, block_src) \
+  {                                      \
+    block_dst[0] = block_src[0];         \
+    block_dst[1] = block_src[1];         \
+    block_dst[2] = block_src[2];         \
+    block_dst[3] = block_src[3];         \
+  }
+
+/* Apply SBOX number WHICH to to the block found in ARRAY0, writing
+   the output to the block found in ARRAY1.  */
+#define SBOX(which, array0, array1)                         \
+  SBOX##which (array0[0], array0[1], array0[2], array0[3],  \
+               array1[0], array1[1], array1[2], array1[3]);
+
+/* Apply inverse SBOX number WHICH to to the block found in ARRAY0, writing
+   the output to the block found in ARRAY1.  */
+#define SBOX_INVERSE(which, array0, array1)                           \
+  SBOX##which##_INVERSE (array0[0], array0[1], array0[2], array0[3],  \
+                         array1[0], array1[1], array1[2], array1[3]);
+
+/* Apply the linear transformation to BLOCK.  */
+#define LINEAR_TRANSFORMATION(block)                    \
+  {                                                     \
+    block[0] = vrol (block[0], 13);                     \
+    block[2] = vrol (block[2], 3);                      \
+    block[1] = block[1] ^ block[0] ^ block[2];          \
+    block[3] = block[3] ^ block[2] ^ vshl(block[0], 3); \
+    block[1] = vrol (block[1], 1);                      \
+    block[3] = vrol (block[3], 7);                      \
+    block[0] = block[0] ^ block[1] ^ block[3];          \
+    block[2] = block[2] ^ block[3] ^ vshl(block[1], 7); \
+    block[0] = vrol (block[0], 5);                      \
+    block[2] = vrol (block[2], 22);                     \
+  }
+
+/* Apply the inverse linear transformation to BLOCK.  */
+#define LINEAR_TRANSFORMATION_INVERSE(block)            \
+  {                                                     \
+    block[2] = vror (block[2], 22);                     \
+    block[0] = vror (block[0] , 5);                     \
+    block[2] = block[2] ^ block[3] ^ vshl(block[1], 7); \
+    block[0] = block[0] ^ block[1] ^ block[3];          \
+    block[3] = vror (block[3], 7);                      \
+    block[1] = vror (block[1], 1);                      \
+    block[3] = block[3] ^ block[2] ^ vshl(block[0], 3); \
+    block[1] = block[1] ^ block[0] ^ block[2];          \
+    block[2] = vror (block[2], 3);                      \
+    block[0] = vror (block[0], 13);                     \
+  }
+
+/* Apply a Serpent round to BLOCK, using the SBOX number WHICH and the
+   subkeys contained in SUBKEYS.  Use BLOCK_TMP as temporary storage.
+   This macro increments `round'.  */
+#define ROUND(which, subkeys, block, block_tmp) \
+  {                                             \
+    BLOCK_XOR_KEY (block, subkeys[round]);      \
+    SBOX (which, block, block_tmp);             \
+    LINEAR_TRANSFORMATION (block_tmp);          \
+    BLOCK_COPY (block, block_tmp);              \
+  }
+
+/* Apply the last Serpent round to BLOCK, using the SBOX number WHICH
+   and the subkeys contained in SUBKEYS.  Use BLOCK_TMP as temporary
+   storage.  The result will be stored in BLOCK_TMP.  This macro
+   increments `round'.  */
+#define ROUND_LAST(which, subkeys, block, block_tmp) \
+  {                                                  \
+    BLOCK_XOR_KEY (block, subkeys[round]);           \
+    SBOX (which, block, block_tmp);                  \
+    BLOCK_XOR_KEY (block_tmp, subkeys[round+1]);     \
+  }
+
+/* Apply an inverse Serpent round to BLOCK, using the SBOX number
+   WHICH and the subkeys contained in SUBKEYS.  Use BLOCK_TMP as
+   temporary storage.  This macro increments `round'.  */
+#define ROUND_INVERSE(which, subkey, block, block_tmp) \
+  {                                                    \
+    LINEAR_TRANSFORMATION_INVERSE (block);             \
+    SBOX_INVERSE (which, block, block_tmp);            \
+    BLOCK_XOR_KEY (block_tmp, subkey[round]);          \
+    BLOCK_COPY (block, block_tmp);                     \
+  }
+
+/* Apply the first Serpent round to BLOCK, using the SBOX number WHICH
+   and the subkeys contained in SUBKEYS.  Use BLOCK_TMP as temporary
+   storage.  The result will be stored in BLOCK_TMP.  This macro
+   increments `round'.  */
+#define ROUND_FIRST_INVERSE(which, subkeys, block, block_tmp) \
+  {                                                           \
+    BLOCK_XOR_KEY (block, subkeys[round]);                    \
+    SBOX_INVERSE (which, block, block_tmp);                   \
+    BLOCK_XOR_KEY (block_tmp, subkeys[round-1]);              \
+  }
+
+static ALWAYS_INLINE void
+serpent_encrypt_internal_avx512 (const serpent_subkeys_t keys,
+				 const __m512i vin[8], __m512i vout[8])
+{
+  __m512i b[4];
+  __m512i c[4];
+  __m512i b_next[4];
+  __m512i c_next[4];
+  int round = 0;
+
+  b_next[0] = vin[0];
+  b_next[1] = vin[1];
+  b_next[2] = vin[2];
+  b_next[3] = vin[3];
+  c_next[0] = vin[4];
+  c_next[1] = vin[5];
+  c_next[2] = vin[6];
+  c_next[3] = vin[7];
+  transpose_4x4 (b_next[0], b_next[1], b_next[2], b_next[3], b[0], b[1], b[2]);
+  transpose_4x4 (c_next[0], c_next[1], c_next[2], c_next[3], c[0], c[1], c[2]);
+
+  b[0] = b_next[0];
+  b[1] = b_next[1];
+  b[2] = b_next[2];
+  b[3] = b_next[3];
+  c[0] = c_next[0];
+  c[1] = c_next[1];
+  c[2] = c_next[2];
+  c[3] = c_next[3];
+
+  while (1)
+    {
+      ROUND (0, keys, b, b_next); ROUND (0, keys, c, c_next); round++;
+      ROUND (1, keys, b, b_next); ROUND (1, keys, c, c_next); round++;
+      ROUND (2, keys, b, b_next); ROUND (2, keys, c, c_next); round++;
+      ROUND (3, keys, b, b_next); ROUND (3, keys, c, c_next); round++;
+      ROUND (4, keys, b, b_next); ROUND (4, keys, c, c_next); round++;
+      ROUND (5, keys, b, b_next); ROUND (5, keys, c, c_next); round++;
+      ROUND (6, keys, b, b_next); ROUND (6, keys, c, c_next); round++;
+      if (round >= ROUNDS - 1)
+	break;
+      ROUND (7, keys, b, b_next); ROUND (7, keys, c, c_next); round++;
+    }
+
+  ROUND_LAST (7, keys, b, b_next); ROUND_LAST (7, keys, c, c_next);
+
+  transpose_4x4 (b_next[0], b_next[1], b_next[2], b_next[3], b[0], b[1], b[2]);
+  transpose_4x4 (c_next[0], c_next[1], c_next[2], c_next[3], c[0], c[1], c[2]);
+  vout[0] = b_next[0];
+  vout[1] = b_next[1];
+  vout[2] = b_next[2];
+  vout[3] = b_next[3];
+  vout[4] = c_next[0];
+  vout[5] = c_next[1];
+  vout[6] = c_next[2];
+  vout[7] = c_next[3];
+}
+
+static ALWAYS_INLINE void
+serpent_decrypt_internal_avx512 (const serpent_subkeys_t keys,
+				 const __m512i vin[8], __m512i vout[8])
+{
+  __m512i b[4];
+  __m512i c[4];
+  __m512i b_next[4];
+  __m512i c_next[4];
+  int round = ROUNDS;
+
+  b_next[0] = vin[0];
+  b_next[1] = vin[1];
+  b_next[2] = vin[2];
+  b_next[3] = vin[3];
+  c_next[0] = vin[4];
+  c_next[1] = vin[5];
+  c_next[2] = vin[6];
+  c_next[3] = vin[7];
+  transpose_4x4 (b_next[0], b_next[1], b_next[2], b_next[3], b[0], b[1], b[2]);
+  transpose_4x4 (c_next[0], c_next[1], c_next[2], c_next[3], c[0], c[1], c[2]);
+
+  ROUND_FIRST_INVERSE (7, keys, b_next, b); ROUND_FIRST_INVERSE (7, keys, c_next, c);
+  round -= 2;
+
+  while (1)
+    {
+      ROUND_INVERSE (6, keys, b, b_next); ROUND_INVERSE (6, keys, c, c_next); round--;
+      ROUND_INVERSE (5, keys, b, b_next); ROUND_INVERSE (5, keys, c, c_next); round--;
+      ROUND_INVERSE (4, keys, b, b_next); ROUND_INVERSE (4, keys, c, c_next); round--;
+      ROUND_INVERSE (3, keys, b, b_next); ROUND_INVERSE (3, keys, c, c_next); round--;
+      ROUND_INVERSE (2, keys, b, b_next); ROUND_INVERSE (2, keys, c, c_next); round--;
+      ROUND_INVERSE (1, keys, b, b_next); ROUND_INVERSE (1, keys, c, c_next); round--;
+      ROUND_INVERSE (0, keys, b, b_next); ROUND_INVERSE (0, keys, c, c_next); round--;
+      if (round <= 0)
+	break;
+      ROUND_INVERSE (7, keys, b, b_next); ROUND_INVERSE (7, keys, c, c_next); round--;
+    }
+
+  transpose_4x4 (b_next[0], b_next[1], b_next[2], b_next[3], b[0], b[1], b[2]);
+  transpose_4x4 (c_next[0], c_next[1], c_next[2], c_next[3], c[0], c[1], c[2]);
+  vout[0] = b_next[0];
+  vout[1] = b_next[1];
+  vout[2] = b_next[2];
+  vout[3] = b_next[3];
+  vout[4] = c_next[0];
+  vout[5] = c_next[1];
+  vout[6] = c_next[2];
+  vout[7] = c_next[3];
+}
+
+enum crypt_mode_e
+{
+  ECB_ENC = 0,
+  ECB_DEC,
+  CBC_DEC,
+  CFB_DEC,
+  CTR_ENC,
+  OCB_ENC,
+  OCB_DEC
+};
+
+static ALWAYS_INLINE void
+ctr_generate(unsigned char *ctr, __m512i vin[8])
+{
+  const unsigned int blocksize = 16;
+  unsigned char ctr_low = ctr[15];
+
+  if (ctr_low + 32 <= 256)
+    {
+      const __m512i add0123 = _mm512_set_epi64(3LL << 56, 0,
+					       2LL << 56, 0,
+					       1LL << 56, 0,
+					       0LL << 56, 0);
+      const __m512i add4444 = _mm512_set_epi64(4LL << 56, 0,
+					       4LL << 56, 0,
+					       4LL << 56, 0,
+					       4LL << 56, 0);
+      const __m512i add4567 = _mm512_add_epi32(add0123, add4444);
+      const __m512i add8888 = _mm512_add_epi32(add4444, add4444);
+
+      // Fast path without carry handling.
+      __m512i vctr =
+	_mm512_broadcast_i32x4(_mm_loadu_si128((const void *)ctr));
+
+      cipher_block_add(ctr, 32, blocksize);
+      vin[0] = _mm512_add_epi32(vctr, add0123);
+      vin[1] = _mm512_add_epi32(vctr, add4567);
+      vin[2] = _mm512_add_epi32(vin[0], add8888);
+      vin[3] = _mm512_add_epi32(vin[1], add8888);
+      vin[4] = _mm512_add_epi32(vin[2], add8888);
+      vin[5] = _mm512_add_epi32(vin[3], add8888);
+      vin[6] = _mm512_add_epi32(vin[4], add8888);
+      vin[7] = _mm512_add_epi32(vin[5], add8888);
+    }
+  else
+    {
+      // Slow path.
+      u32 blocks[4][blocksize / sizeof(u32)];
+
+      cipher_block_cpy(blocks[0], ctr, blocksize);
+      cipher_block_cpy(blocks[1], ctr, blocksize);
+      cipher_block_cpy(blocks[2], ctr, blocksize);
+      cipher_block_cpy(blocks[3], ctr, blocksize);
+      cipher_block_add(ctr, 32, blocksize);
+      cipher_block_add(blocks[1], 1, blocksize);
+      cipher_block_add(blocks[2], 2, blocksize);
+      cipher_block_add(blocks[3], 3, blocksize);
+      vin[0] = _mm512_loadu_epi32 (blocks);
+      cipher_block_add(blocks[0], 4, blocksize);
+      cipher_block_add(blocks[1], 4, blocksize);
+      cipher_block_add(blocks[2], 4, blocksize);
+      cipher_block_add(blocks[3], 4, blocksize);
+      vin[1] = _mm512_loadu_epi32 (blocks);
+      cipher_block_add(blocks[0], 4, blocksize);
+      cipher_block_add(blocks[1], 4, blocksize);
+      cipher_block_add(blocks[2], 4, blocksize);
+      cipher_block_add(blocks[3], 4, blocksize);
+      vin[2] = _mm512_loadu_epi32 (blocks);
+      cipher_block_add(blocks[0], 4, blocksize);
+      cipher_block_add(blocks[1], 4, blocksize);
+      cipher_block_add(blocks[2], 4, blocksize);
+      cipher_block_add(blocks[3], 4, blocksize);
+      vin[3] = _mm512_loadu_epi32 (blocks);
+      cipher_block_add(blocks[0], 4, blocksize);
+      cipher_block_add(blocks[1], 4, blocksize);
+      cipher_block_add(blocks[2], 4, blocksize);
+      cipher_block_add(blocks[3], 4, blocksize);
+      vin[4] = _mm512_loadu_epi32 (blocks);
+      cipher_block_add(blocks[0], 4, blocksize);
+      cipher_block_add(blocks[1], 4, blocksize);
+      cipher_block_add(blocks[2], 4, blocksize);
+      cipher_block_add(blocks[3], 4, blocksize);
+      vin[5] = _mm512_loadu_epi32 (blocks);
+      cipher_block_add(blocks[0], 4, blocksize);
+      cipher_block_add(blocks[1], 4, blocksize);
+      cipher_block_add(blocks[2], 4, blocksize);
+      cipher_block_add(blocks[3], 4, blocksize);
+      vin[6] = _mm512_loadu_epi32 (blocks);
+      cipher_block_add(blocks[0], 4, blocksize);
+      cipher_block_add(blocks[1], 4, blocksize);
+      cipher_block_add(blocks[2], 4, blocksize);
+      cipher_block_add(blocks[3], 4, blocksize);
+      vin[7] = _mm512_loadu_epi32 (blocks);
+
+      wipememory(blocks, sizeof(blocks));
+    }
+}
+
+static ALWAYS_INLINE __m512i
+ocb_input(__m512i *vchecksum, __m128i *voffset, const unsigned char *input,
+	  unsigned char *output, const ocb_L_uintptr_t L[4])
+{
+  __m128i L0 = _mm_loadu_si128((const void *)(uintptr_t)L[0]);
+  __m128i L1 = _mm_loadu_si128((const void *)(uintptr_t)L[1]);
+  __m128i L2 = _mm_loadu_si128((const void *)(uintptr_t)L[2]);
+  __m128i L3 = _mm_loadu_si128((const void *)(uintptr_t)L[3]);
+  __m512i vin = _mm512_loadu_epi32 (input);
+  __m512i voffsets;
+
+  /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */
+  /* Checksum_i = Checksum_{i-1} xor P_i  */
+  /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i)  */
+
+  if (vchecksum)
+    *vchecksum ^= _mm512_loadu_epi32 (input);
+
+  *voffset ^= L0;
+  voffsets = _mm512_castsi128_si512(*voffset);
+  *voffset ^= L1;
+  voffsets = _mm512_inserti32x4(voffsets, *voffset, 1);
+  *voffset ^= L2;
+  voffsets = _mm512_inserti32x4(voffsets, *voffset, 2);
+  *voffset ^= L3;
+  voffsets = _mm512_inserti32x4(voffsets, *voffset, 3);
+  _mm512_storeu_epi32 (output, voffsets);
+
+  return vin ^ voffsets;
+}
+
+static NO_INLINE void
+serpent_avx512_blk32(const void *c, unsigned char *output,
+		     const unsigned char *input, int mode,
+		     unsigned char *iv, unsigned char *checksum,
+		     const ocb_L_uintptr_t Ls[32])
+{
+  __m512i vin[8];
+  __m512i vout[8];
+  int encrypt = 1;
+
+  asm volatile ("vpxor %%ymm0, %%ymm0, %%ymm0;\n\t"
+		"vpopcntb %%zmm0, %%zmm6;\n\t" /* spec stop for old AVX512 CPUs */
+		"vpxor %%ymm6, %%ymm6, %%ymm6;\n\t"
+		:
+		: "m"(*input), "m"(*output)
+		: "xmm6", "xmm0", "memory", "cc");
+
+  // Input handling
+  switch (mode)
+    {
+      default:
+      case CBC_DEC:
+      case ECB_DEC:
+	encrypt = 0;
+	/* fall through */
+      case ECB_ENC:
+	vin[0] = _mm512_loadu_epi32 (input + 0 * 64);
+	vin[1] = _mm512_loadu_epi32 (input + 1 * 64);
+	vin[2] = _mm512_loadu_epi32 (input + 2 * 64);
+	vin[3] = _mm512_loadu_epi32 (input + 3 * 64);
+	vin[4] = _mm512_loadu_epi32 (input + 4 * 64);
+	vin[5] = _mm512_loadu_epi32 (input + 5 * 64);
+	vin[6] = _mm512_loadu_epi32 (input + 6 * 64);
+	vin[7] = _mm512_loadu_epi32 (input + 7 * 64);
+	break;
+
+      case CFB_DEC:
+      {
+	__m128i viv = _mm_loadu_si128((const void *)iv);
+	vin[0] = _mm512_maskz_loadu_epi32(_cvtu32_mask16(0xfff0),
+					  input - 1 * 64 + 48)
+		  ^ _mm512_castsi128_si512(viv);
+	vin[1] = _mm512_loadu_epi32(input + 0 * 64 + 48);
+	vin[2] = _mm512_loadu_epi32(input + 1 * 64 + 48);
+	vin[3] = _mm512_loadu_epi32(input + 2 * 64 + 48);
+	vin[4] = _mm512_loadu_epi32(input + 3 * 64 + 48);
+	vin[5] = _mm512_loadu_epi32(input + 4 * 64 + 48);
+	vin[6] = _mm512_loadu_epi32(input + 5 * 64 + 48);
+	vin[7] = _mm512_loadu_epi32(input + 6 * 64 + 48);
+	viv = _mm_loadu_si128((const void *)(input + 7 * 64 + 48));
+	_mm_storeu_si128((void *)iv, viv);
+	break;
+      }
+
+      case CTR_ENC:
+	ctr_generate(iv, vin);
+	break;
+
+      case OCB_ENC:
+      {
+	const ocb_L_uintptr_t *L = Ls;
+	__m512i vchecksum = _mm512_setzero_epi32();
+	__m128i vchecksum128 = _mm_loadu_si128((const void *)checksum);
+	__m128i voffset = _mm_loadu_si128((const void *)iv);
+	vin[0] = ocb_input(&vchecksum, &voffset, input + 0 * 64, output + 0 * 64, L); L += 4;
+	vin[1] = ocb_input(&vchecksum, &voffset, input + 1 * 64, output + 1 * 64, L); L += 4;
+	vin[2] = ocb_input(&vchecksum, &voffset, input + 2 * 64, output + 2 * 64, L); L += 4;
+	vin[3] = ocb_input(&vchecksum, &voffset, input + 3 * 64, output + 3 * 64, L); L += 4;
+	vin[4] = ocb_input(&vchecksum, &voffset, input + 4 * 64, output + 4 * 64, L); L += 4;
+	vin[5] = ocb_input(&vchecksum, &voffset, input + 5 * 64, output + 5 * 64, L); L += 4;
+	vin[6] = ocb_input(&vchecksum, &voffset, input + 6 * 64, output + 6 * 64, L); L += 4;
+	vin[7] = ocb_input(&vchecksum, &voffset, input + 7 * 64, output + 7 * 64, L);
+	vchecksum128 ^= _mm512_extracti32x4_epi32(vchecksum, 0)
+			^ _mm512_extracti32x4_epi32(vchecksum, 1)
+			^ _mm512_extracti32x4_epi32(vchecksum, 2)
+			^ _mm512_extracti32x4_epi32(vchecksum, 3);
+	_mm_storeu_si128((void *)checksum, vchecksum128);
+	_mm_storeu_si128((void *)iv, voffset);
+	break;
+      }
+
+      case OCB_DEC:
+      {
+	const ocb_L_uintptr_t *L = Ls;
+	__m128i voffset = _mm_loadu_si128((const void *)iv);
+	encrypt = 0;
+	vin[0] = ocb_input(NULL, &voffset, input + 0 * 64, output + 0 * 64, L); L += 4;
+	vin[1] = ocb_input(NULL, &voffset, input + 1 * 64, output + 1 * 64, L); L += 4;
+	vin[2] = ocb_input(NULL, &voffset, input + 2 * 64, output + 2 * 64, L); L += 4;
+	vin[3] = ocb_input(NULL, &voffset, input + 3 * 64, output + 3 * 64, L); L += 4;
+	vin[4] = ocb_input(NULL, &voffset, input + 4 * 64, output + 4 * 64, L); L += 4;
+	vin[5] = ocb_input(NULL, &voffset, input + 5 * 64, output + 5 * 64, L); L += 4;
+	vin[6] = ocb_input(NULL, &voffset, input + 6 * 64, output + 6 * 64, L); L += 4;
+	vin[7] = ocb_input(NULL, &voffset, input + 7 * 64, output + 7 * 64, L);
+	_mm_storeu_si128((void *)iv, voffset);
+	break;
+      }
+    }
+
+  if (encrypt)
+    serpent_encrypt_internal_avx512(c, vin, vout);
+  else
+    serpent_decrypt_internal_avx512(c, vin, vout);
+
+  switch (mode)
+    {
+      case CTR_ENC:
+      case CFB_DEC:
+	vout[0] ^= _mm512_loadu_epi32 (input + 0 * 64);
+	vout[1] ^= _mm512_loadu_epi32 (input + 1 * 64);
+	vout[2] ^= _mm512_loadu_epi32 (input + 2 * 64);
+	vout[3] ^= _mm512_loadu_epi32 (input + 3 * 64);
+	vout[4] ^= _mm512_loadu_epi32 (input + 4 * 64);
+	vout[5] ^= _mm512_loadu_epi32 (input + 5 * 64);
+	vout[6] ^= _mm512_loadu_epi32 (input + 6 * 64);
+	vout[7] ^= _mm512_loadu_epi32 (input + 7 * 64);
+	/* fall through */
+      default:
+      case ECB_DEC:
+      case ECB_ENC:
+	_mm512_storeu_epi32 (output + 0 * 64, vout[0]);
+	_mm512_storeu_epi32 (output + 1 * 64, vout[1]);
+	_mm512_storeu_epi32 (output + 2 * 64, vout[2]);
+	_mm512_storeu_epi32 (output + 3 * 64, vout[3]);
+	_mm512_storeu_epi32 (output + 4 * 64, vout[4]);
+	_mm512_storeu_epi32 (output + 5 * 64, vout[5]);
+	_mm512_storeu_epi32 (output + 6 * 64, vout[6]);
+	_mm512_storeu_epi32 (output + 7 * 64, vout[7]);
+	break;
+
+      case CBC_DEC:
+      {
+	__m128i viv = _mm_loadu_si128((const void *)iv);
+	vout[0] ^= _mm512_maskz_loadu_epi32(_cvtu32_mask16(0xfff0),
+					    input - 1 * 64 + 48)
+		    ^ _mm512_castsi128_si512(viv);
+	vout[1] ^= _mm512_loadu_epi32(input + 0 * 64 + 48);
+	vout[2] ^= _mm512_loadu_epi32(input + 1 * 64 + 48);
+	vout[3] ^= _mm512_loadu_epi32(input + 2 * 64 + 48);
+	vout[4] ^= _mm512_loadu_epi32(input + 3 * 64 + 48);
+	vout[5] ^= _mm512_loadu_epi32(input + 4 * 64 + 48);
+	vout[6] ^= _mm512_loadu_epi32(input + 5 * 64 + 48);
+	vout[7] ^= _mm512_loadu_epi32(input + 6 * 64 + 48);
+	viv = _mm_loadu_si128((const void *)(input + 7 * 64 + 48));
+	_mm_storeu_si128((void *)iv, viv);
+	_mm512_storeu_epi32 (output + 0 * 64, vout[0]);
+	_mm512_storeu_epi32 (output + 1 * 64, vout[1]);
+	_mm512_storeu_epi32 (output + 2 * 64, vout[2]);
+	_mm512_storeu_epi32 (output + 3 * 64, vout[3]);
+	_mm512_storeu_epi32 (output + 4 * 64, vout[4]);
+	_mm512_storeu_epi32 (output + 5 * 64, vout[5]);
+	_mm512_storeu_epi32 (output + 6 * 64, vout[6]);
+	_mm512_storeu_epi32 (output + 7 * 64, vout[7]);
+	break;
+      }
+
+      case OCB_ENC:
+	vout[0] ^= _mm512_loadu_epi32 (output + 0 * 64);
+	vout[1] ^= _mm512_loadu_epi32 (output + 1 * 64);
+	vout[2] ^= _mm512_loadu_epi32 (output + 2 * 64);
+	vout[3] ^= _mm512_loadu_epi32 (output + 3 * 64);
+	vout[4] ^= _mm512_loadu_epi32 (output + 4 * 64);
+	vout[5] ^= _mm512_loadu_epi32 (output + 5 * 64);
+	vout[6] ^= _mm512_loadu_epi32 (output + 6 * 64);
+	vout[7] ^= _mm512_loadu_epi32 (output + 7 * 64);
+	_mm512_storeu_epi32 (output + 0 * 64, vout[0]);
+	_mm512_storeu_epi32 (output + 1 * 64, vout[1]);
+	_mm512_storeu_epi32 (output + 2 * 64, vout[2]);
+	_mm512_storeu_epi32 (output + 3 * 64, vout[3]);
+	_mm512_storeu_epi32 (output + 4 * 64, vout[4]);
+	_mm512_storeu_epi32 (output + 5 * 64, vout[5]);
+	_mm512_storeu_epi32 (output + 6 * 64, vout[6]);
+	_mm512_storeu_epi32 (output + 7 * 64, vout[7]);
+	break;
+
+      case OCB_DEC:
+      {
+	__m512i vchecksum = _mm512_setzero_epi32();
+	__m128i vchecksum128 = _mm_loadu_si128((const void *)checksum);
+	vout[0] ^= _mm512_loadu_epi32 (output + 0 * 64);
+	vout[1] ^= _mm512_loadu_epi32 (output + 1 * 64);
+	vout[2] ^= _mm512_loadu_epi32 (output + 2 * 64);
+	vout[3] ^= _mm512_loadu_epi32 (output + 3 * 64);
+	vout[4] ^= _mm512_loadu_epi32 (output + 4 * 64);
+	vout[5] ^= _mm512_loadu_epi32 (output + 5 * 64);
+	vout[6] ^= _mm512_loadu_epi32 (output + 6 * 64);
+	vout[7] ^= _mm512_loadu_epi32 (output + 7 * 64);
+	vchecksum ^= vout[0];
+	vchecksum ^= vout[1];
+	vchecksum ^= vout[2];
+	vchecksum ^= vout[3];
+	vchecksum ^= vout[4];
+	vchecksum ^= vout[5];
+	vchecksum ^= vout[6];
+	vchecksum ^= vout[7];
+	_mm512_storeu_epi32 (output + 0 * 64, vout[0]);
+	_mm512_storeu_epi32 (output + 1 * 64, vout[1]);
+	_mm512_storeu_epi32 (output + 2 * 64, vout[2]);
+	_mm512_storeu_epi32 (output + 3 * 64, vout[3]);
+	_mm512_storeu_epi32 (output + 4 * 64, vout[4]);
+	_mm512_storeu_epi32 (output + 5 * 64, vout[5]);
+	_mm512_storeu_epi32 (output + 6 * 64, vout[6]);
+	_mm512_storeu_epi32 (output + 7 * 64, vout[7]);
+	vchecksum128 ^= _mm512_extracti32x4_epi32(vchecksum, 0)
+			^ _mm512_extracti32x4_epi32(vchecksum, 1)
+			^ _mm512_extracti32x4_epi32(vchecksum, 2)
+			^ _mm512_extracti32x4_epi32(vchecksum, 3);
+	_mm_storeu_si128((void *)checksum, vchecksum128);
+	break;
+      }
+    }
+
+  _mm256_zeroall();
+#ifdef __x86_64__
+  asm volatile (
+#define CLEAR(mm) "vpxord %%" #mm ", %%" #mm ", %%" #mm ";\n\t"
+		CLEAR(ymm16) CLEAR(ymm17) CLEAR(ymm18) CLEAR(ymm19)
+		CLEAR(ymm20) CLEAR(ymm21) CLEAR(ymm22) CLEAR(ymm23)
+		CLEAR(ymm24) CLEAR(ymm25) CLEAR(ymm26) CLEAR(ymm27)
+		CLEAR(ymm28) CLEAR(ymm29) CLEAR(ymm30) CLEAR(ymm31)
+#undef CLEAR
+		:
+		: "m"(*input), "m"(*output)
+		: "xmm16", "xmm17", "xmm18", "xmm19",
+		  "xmm20", "xmm21", "xmm22", "xmm23",
+		  "xmm24", "xmm25", "xmm26", "xmm27",
+		  "xmm28", "xmm29", "xmm30", "xmm31",
+		  "memory", "cc");
+#endif
+}
+
+void
+_gcry_serpent_avx512_blk32(const void *ctx, unsigned char *out,
+			   const unsigned char *in, int encrypt)
+{
+  serpent_avx512_blk32 (ctx, out, in, encrypt ? ECB_ENC : ECB_DEC,
+			NULL, NULL, NULL);
+}
+
+void
+_gcry_serpent_avx512_cbc_dec(const void *ctx, unsigned char *out,
+			     const unsigned char *in, unsigned char *iv)
+{
+  serpent_avx512_blk32 (ctx, out, in, CBC_DEC, iv, NULL, NULL);
+}
+
+void
+_gcry_serpent_avx512_cfb_dec(const void *ctx, unsigned char *out,
+			     const unsigned char *in, unsigned char *iv)
+{
+  serpent_avx512_blk32 (ctx, out, in, CFB_DEC, iv, NULL, NULL);
+}
+
+void
+_gcry_serpent_avx512_ctr_enc(const void *ctx, unsigned char *out,
+			     const unsigned char *in, unsigned char *iv)
+{
+  serpent_avx512_blk32 (ctx, out, in, CTR_ENC, iv, NULL, NULL);
+}
+
+void
+_gcry_serpent_avx512_ocb_crypt(const void *ctx, unsigned char *out,
+			       const unsigned char *in, unsigned char *offset,
+			       unsigned char *checksum,
+			       const ocb_L_uintptr_t Ls[32], int encrypt)
+{
+  serpent_avx512_blk32 (ctx, out, in, encrypt ? OCB_ENC : OCB_DEC, offset,
+			checksum, Ls);
+}
+
+#endif /*defined(USE_SERPENT) && defined(ENABLE_AVX512_SUPPORT)*/
+#endif /*__x86_64 || __i386*/
diff --git a/cipher/serpent.c b/cipher/serpent.c
index 908523c2..2b951aba 100644
--- a/cipher/serpent.c
+++ b/cipher/serpent.c
@@ -32,14 +32,14 @@
 #include "bulkhelp.h"
 
 
-/* USE_SSE2 indicates whether to compile with AMD64 SSE2 code. */
+/* USE_SSE2 indicates whether to compile with x86-64 SSE2 code. */
 #undef USE_SSE2
 #if defined(__x86_64__) && (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
     defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS))
 # define USE_SSE2 1
 #endif
 
-/* USE_AVX2 indicates whether to compile with AMD64 AVX2 code. */
+/* USE_AVX2 indicates whether to compile with x86-64 AVX2 code. */
 #undef USE_AVX2
 #if defined(__x86_64__) && (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
     defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS))
@@ -48,6 +48,15 @@
 # endif
 #endif
 
+/* USE_AVX512 indicates whether to compile with x86 AVX512 code. */
+#undef USE_AVX512
+#if (defined(__x86_64) || defined(__i386)) && \
+    defined(HAVE_COMPATIBLE_CC_X86_AVX512_INTRINSICS)
+# if defined(ENABLE_AVX512_SUPPORT)
+#  define USE_AVX512 1
+# endif
+#endif
+
 /* USE_NEON indicates whether to enable ARM NEON assembly code. */
 #undef USE_NEON
 #ifdef ENABLE_NEON_SUPPORT
@@ -82,6 +91,9 @@ typedef struct serpent_context
 #ifdef USE_AVX2
   int use_avx2;
 #endif
+#ifdef USE_AVX512
+  int use_avx512;
+#endif
 #ifdef USE_NEON
   int use_neon;
 #endif
@@ -186,6 +198,38 @@ extern void _gcry_serpent_avx2_blk16(const serpent_context_t *c, byte *out,
 				     const byte *in, int encrypt) ASM_FUNC_ABI;
 #endif
 
+#ifdef USE_AVX512
+/* Assembler implementations of Serpent using AVX512.  Processing 32 blocks in
+   parallel.
+ */
+extern void _gcry_serpent_avx512_cbc_dec(const void *ctx,
+					 unsigned char *out,
+					 const unsigned char *in,
+					 unsigned char *iv);
+
+extern void _gcry_serpent_avx512_cfb_dec(const void *ctx,
+					 unsigned char *out,
+					 const unsigned char *in,
+					 unsigned char *iv);
+
+extern void _gcry_serpent_avx512_ctr_enc(const void *ctx,
+					 unsigned char *out,
+					 const unsigned char *in,
+					 unsigned char *ctr);
+
+extern void _gcry_serpent_avx512_ocb_crypt(const void *ctx,
+					   unsigned char *out,
+					   const unsigned char *in,
+					   unsigned char *offset,
+					   unsigned char *checksum,
+					   const ocb_L_uintptr_t Ls[32],
+					   int encrypt);
+
+extern void _gcry_serpent_avx512_blk32(const void *c, byte *out,
+				       const byte *in,
+				       int encrypt);
+#endif
+
 #ifdef USE_NEON
 /* Assembler implementations of Serpent using ARM NEON.  Process 8 block in
    parallel.
@@ -758,6 +802,14 @@ serpent_setkey_internal (serpent_context_t *context,
   serpent_key_prepare (key, key_length, key_prepared);
   serpent_subkeys_generate (key_prepared, context->keys);
 
+#ifdef USE_AVX512
+  context->use_avx512 = 0;
+  if ((_gcry_get_hw_features () & HWF_INTEL_AVX512))
+    {
+      context->use_avx512 = 1;
+    }
+#endif
+
 #ifdef USE_AVX2
   context->use_avx2 = 0;
   if ((_gcry_get_hw_features () & HWF_INTEL_AVX2))
@@ -954,6 +1006,34 @@ _gcry_serpent_ctr_enc(void *context, unsigned char *ctr,
   unsigned char tmpbuf[sizeof(serpent_block_t)];
   int burn_stack_depth = 2 * sizeof (serpent_block_t);
 
+#ifdef USE_AVX512
+  if (ctx->use_avx512)
+    {
+      int did_use_avx512 = 0;
+
+      /* Process data in 32 block chunks. */
+      while (nblocks >= 32)
+        {
+          _gcry_serpent_avx512_ctr_enc(ctx, outbuf, inbuf, ctr);
+
+          nblocks -= 32;
+          outbuf += 32 * sizeof(serpent_block_t);
+          inbuf  += 32 * sizeof(serpent_block_t);
+          did_use_avx512 = 1;
+        }
+
+      if (did_use_avx512)
+        {
+          /* serpent-avx512 code does not use stack */
+          if (nblocks == 0)
+            burn_stack_depth = 0;
+        }
+
+      /* Use generic/avx2/sse2 code to handle smaller chunks... */
+      /* TODO: use caching instead? */
+    }
+#endif
+
 #ifdef USE_AVX2
   if (ctx->use_avx2)
     {
@@ -1066,6 +1146,33 @@ _gcry_serpent_cbc_dec(void *context, unsigned char *iv,
   unsigned char savebuf[sizeof(serpent_block_t)];
   int burn_stack_depth = 2 * sizeof (serpent_block_t);
 
+#ifdef USE_AVX512
+  if (ctx->use_avx512)
+    {
+      int did_use_avx512 = 0;
+
+      /* Process data in 32 block chunks. */
+      while (nblocks >= 32)
+        {
+          _gcry_serpent_avx512_cbc_dec(ctx, outbuf, inbuf, iv);
+
+          nblocks -= 32;
+          outbuf += 32 * sizeof(serpent_block_t);
+          inbuf  += 32 * sizeof(serpent_block_t);
+          did_use_avx512 = 1;
+        }
+
+      if (did_use_avx512)
+        {
+          /* serpent-avx512 code does not use stack */
+          if (nblocks == 0)
+            burn_stack_depth = 0;
+        }
+
+      /* Use generic/avx2/sse2 code to handle smaller chunks... */
+    }
+#endif
+
 #ifdef USE_AVX2
   if (ctx->use_avx2)
     {
@@ -1174,6 +1281,33 @@ _gcry_serpent_cfb_dec(void *context, unsigned char *iv,
   const unsigned char *inbuf = inbuf_arg;
   int burn_stack_depth = 2 * sizeof (serpent_block_t);
 
+#ifdef USE_AVX512
+  if (ctx->use_avx512)
+    {
+      int did_use_avx512 = 0;
+
+      /* Process data in 32 block chunks. */
+      while (nblocks >= 32)
+        {
+          _gcry_serpent_avx512_cfb_dec(ctx, outbuf, inbuf, iv);
+
+          nblocks -= 32;
+          outbuf += 32 * sizeof(serpent_block_t);
+          inbuf  += 32 * sizeof(serpent_block_t);
+          did_use_avx512 = 1;
+        }
+
+      if (did_use_avx512)
+        {
+          /* serpent-avx512 code does not use stack */
+          if (nblocks == 0)
+            burn_stack_depth = 0;
+        }
+
+      /* Use generic/avx2/sse2 code to handle smaller chunks... */
+    }
+#endif
+
 #ifdef USE_AVX2
   if (ctx->use_avx2)
     {
@@ -1270,7 +1404,8 @@ static size_t
 _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
 			const void *inbuf_arg, size_t nblocks, int encrypt)
 {
-#if defined(USE_AVX2) || defined(USE_SSE2) || defined(USE_NEON)
+#if defined(USE_AVX512) || defined(USE_AVX2) || defined(USE_SSE2) \
+    || defined(USE_NEON)
   serpent_context_t *ctx = (void *)&c->context.c;
   unsigned char *outbuf = outbuf_arg;
   const unsigned char *inbuf = inbuf_arg;
@@ -1283,6 +1418,44 @@ _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
   (void)encrypt;
 #endif
 
+#ifdef USE_AVX512
+  if (ctx->use_avx512)
+    {
+      int did_use_avx512 = 0;
+      ocb_L_uintptr_t Ls[32];
+      ocb_L_uintptr_t *l;
+
+      if (nblocks >= 32)
+	{
+          l = bulk_ocb_prepare_L_pointers_array_blk32 (c, Ls, blkn);
+
+	  /* Process data in 32 block chunks. */
+	  while (nblocks >= 32)
+	    {
+	      blkn += 32;
+	      *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 32);
+
+	      _gcry_serpent_avx512_ocb_crypt(ctx, outbuf, inbuf, c->u_iv.iv,
+					     c->u_ctr.ctr, Ls, encrypt);
+
+	      nblocks -= 32;
+	      outbuf += 32 * sizeof(serpent_block_t);
+	      inbuf  += 32 * sizeof(serpent_block_t);
+	      did_use_avx512 = 1;
+	    }
+	}
+
+      if (did_use_avx512)
+	{
+	  /* serpent-avx512 code does not use stack */
+	  if (nblocks == 0)
+	    burn_stack_depth = 0;
+	}
+
+      /* Use generic code to handle smaller chunks... */
+    }
+#endif
+
 #ifdef USE_AVX2
   if (ctx->use_avx2)
     {
@@ -1408,7 +1581,8 @@ _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
     }
 #endif
 
-#if defined(USE_AVX2) || defined(USE_SSE2) || defined(USE_NEON)
+#if defined(USE_AVX512) || defined(USE_AVX2) || defined(USE_SSE2) \
+    || defined(USE_NEON)
   c->u_mode.ocb.data_nblocks = blkn;
 
   if (burn_stack_depth)
@@ -1556,17 +1730,27 @@ _gcry_serpent_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
 
 
 static unsigned int
-serpent_crypt_blk1_16(void *context, byte *out, const byte *in,
+serpent_crypt_blk1_32(void *context, byte *out, const byte *in,
 		      size_t num_blks, int encrypt)
 {
   serpent_context_t *ctx = context;
   unsigned int burn, burn_stack_depth = 0;
 
+#ifdef USE_AVX512
+  if (num_blks == 32 && ctx->use_avx512)
+    {
+      _gcry_serpent_avx512_blk32 (ctx, out, in, encrypt);
+      return 0;
+    }
+#endif
+
 #ifdef USE_AVX2
-  if (num_blks == 16 && ctx->use_avx2)
+  while (num_blks == 16 && ctx->use_avx2)
     {
       _gcry_serpent_avx2_blk16 (ctx, out, in, encrypt);
-      return 0;
+      out += 16 * sizeof(serpent_block_t);
+      in += 16 * sizeof(serpent_block_t);
+      num_blks -= 16;
     }
 #endif
 
@@ -1611,17 +1795,17 @@ serpent_crypt_blk1_16(void *context, byte *out, const byte *in,
 }
 
 static unsigned int
-serpent_encrypt_blk1_16(void *ctx, byte *out, const byte *in,
+serpent_encrypt_blk1_32(void *ctx, byte *out, const byte *in,
 			size_t num_blks)
 {
-  return serpent_crypt_blk1_16 (ctx, out, in, num_blks, 1);
+  return serpent_crypt_blk1_32 (ctx, out, in, num_blks, 1);
 }
 
 static unsigned int
-serpent_decrypt_blk1_16(void *ctx, byte *out, const byte *in,
+serpent_decrypt_blk1_32(void *ctx, byte *out, const byte *in,
 			size_t num_blks)
 {
-  return serpent_crypt_blk1_16 (ctx, out, in, num_blks, 0);
+  return serpent_crypt_blk1_32 (ctx, out, in, num_blks, 0);
 }
 
 
@@ -1638,12 +1822,12 @@ _gcry_serpent_xts_crypt (void *context, unsigned char *tweak, void *outbuf_arg,
   /* Process remaining blocks. */
   if (nblocks)
     {
-      unsigned char tmpbuf[16 * 16];
+      unsigned char tmpbuf[32 * 16];
       unsigned int tmp_used = 16;
       size_t nburn;
 
-      nburn = bulk_xts_crypt_128(ctx, encrypt ? serpent_encrypt_blk1_16
-                                              : serpent_decrypt_blk1_16,
+      nburn = bulk_xts_crypt_128(ctx, encrypt ? serpent_encrypt_blk1_32
+                                              : serpent_decrypt_blk1_32,
                                  outbuf, inbuf, nblocks,
                                  tweak, tmpbuf, sizeof(tmpbuf) / 16,
                                  &tmp_used);
@@ -1672,9 +1856,9 @@ _gcry_serpent_ecb_crypt (void *context, void *outbuf_arg, const void *inbuf_arg,
     {
       size_t nburn;
 
-      nburn = bulk_ecb_crypt_128(ctx, encrypt ? serpent_encrypt_blk1_16
-                                              : serpent_decrypt_blk1_16,
-                                 outbuf, inbuf, nblocks, 16);
+      nburn = bulk_ecb_crypt_128(ctx, encrypt ? serpent_encrypt_blk1_32
+                                              : serpent_decrypt_blk1_32,
+                                 outbuf, inbuf, nblocks, 32);
       burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth;
     }
 
diff --git a/configure.ac b/configure.ac
index 60fb1f75..572fe279 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1704,6 +1704,46 @@ if test "$gcry_cv_gcc_inline_asm_bmi2" = "yes" ; then
 fi
 
 
+#
+# Check whether compiler supports x86/AVX512 intrinsics
+#
+_gcc_cflags_save=$CFLAGS
+CFLAGS="$CFLAGS -mavx512f"
+
+AC_CACHE_CHECK([whether compiler supports x86/AVX512 intrinsics],
+      [gcry_cv_cc_x86_avx512_intrinsics],
+      [if test "$mpi_cpu_arch" != "x86" ||
+	  test "$try_asm_modules" != "yes" ; then
+	gcry_cv_cc_x86_avx512_intrinsics="n/a"
+      else
+	gcry_cv_cc_x86_avx512_intrinsics=no
+	AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+	[[#include <immintrin.h>
+	  __m512i fn(void *in, __m128i y)
+	  {
+	    __m512i x;
+	    x = _mm512_maskz_loadu_epi32(_cvtu32_mask16(0xfff0), in)
+		  ^ _mm512_castsi128_si512(y);
+	    asm volatile ("vinserti32x4 \$3, %0, %%zmm6, %%zmm6;\n\t"
+			  "vpxord %%zmm6, %%zmm6, %%zmm6"
+			  ::"x"(y),"r"(in):"memory","xmm6");
+	    return x;
+	  }
+	  ]])],
+	[gcry_cv_cc_x86_avx512_intrinsics=yes])
+      fi])
+if test "$gcry_cv_cc_x86_avx512_intrinsics" = "yes" ; then
+    AC_DEFINE(HAVE_COMPATIBLE_CC_X86_AVX512_INTRINSICS,1,
+	    [Defined if underlying compiler supports x86/AVX512 intrinsics])
+fi
+
+AM_CONDITIONAL(ENABLE_X86_AVX512_INTRINSICS_EXTRA_CFLAGS,
+	       test "$gcry_cv_cc_x86_avx512_intrinsics" = "yes")
+
+# Restore flags.
+CFLAGS=$_gcc_cflags_save;
+
+
 #
 # Check whether GCC assembler needs "-Wa,--divide" to correctly handle
 # constant division
@@ -3034,6 +3074,11 @@ if test "$found" = "1" ; then
       GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS serpent-avx2-amd64.lo"
    fi
 
+   if test x"$avx512support" = xyes ; then
+      # Build with the AVX512 implementation
+      GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS serpent-avx512-x86.lo"
+   fi
+
    if test x"$neonsupport" = xyes ; then
       # Build with the NEON implementation
       GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS serpent-armv7-neon.lo"
-- 
2.39.2


From jussi.kivilinna at iki.fi  Mon May 29 20:52:57 2023
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Mon, 29 May 2023 21:52:57 +0300
Subject: [PATCH] twofish-avx2: de-unroll round function
Message-ID: <20230529185257.789223-1-jussi.kivilinna@iki.fi>

* cipher/twofish-avx2-amd64.S (__twofish_enc_blk16)
(__twofish_dec_blk16): Use loop structure instead of unrolling.
--

De-unrolling reduces code-size significantly and gives
small (<1%) increase in speed (tested on zen4, tiger-lake).

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/twofish-avx2-amd64.S | 115 +++++++++++++++---------------------
 1 file changed, 49 insertions(+), 66 deletions(-)

diff --git a/cipher/twofish-avx2-amd64.S b/cipher/twofish-avx2-amd64.S
index 8a6aae19..d05ec1f9 100644
--- a/cipher/twofish-avx2-amd64.S
+++ b/cipher/twofish-avx2-amd64.S
@@ -39,8 +39,8 @@
 /* register macros */
 #define CTX	%rdi
 
-#define RROUND  %rbp
-#define RROUNDd %ebp
+#define RROUND  %r12
+#define RROUNDd %r12d
 #define RS0	CTX
 #define RS1	%r8
 #define RS2	%r9
@@ -154,9 +154,9 @@
 #define encrypt_round_end16(a, b, c, d, nk, r) \
 	vpaddd RY0, RX0, RX0; \
 	vpaddd RX0, RY0, RY0; \
-	vpbroadcastd ((nk)+((r)*8))(RK), RT0; \
+	vpbroadcastd ((nk))(RK,r), RT0; \
 	vpaddd RT0, RX0, RX0; \
-	vpbroadcastd 4+((nk)+((r)*8))(RK), RT0; \
+	vpbroadcastd 4+((nk))(RK,r), RT0; \
 	vpaddd RT0, RY0, RY0; \
 	\
 	vpxor RY0, d ## 0, d ## 0; \
@@ -168,9 +168,9 @@
 	\
 		vpaddd RY1, RX1, RX1; \
 		vpaddd RX1, RY1, RY1; \
-		vpbroadcastd ((nk)+((r)*8))(RK), RT0; \
+		vpbroadcastd ((nk))(RK,r), RT0; \
 		vpaddd RT0, RX1, RX1; \
-		vpbroadcastd 4+((nk)+((r)*8))(RK), RT0; \
+		vpbroadcastd 4+((nk))(RK,r), RT0; \
 		vpaddd RT0, RY1, RY1; \
 		\
 		vpxor RY1, d ## 1, d ## 1; \
@@ -216,9 +216,9 @@
 #define decrypt_round_end16(a, b, c, d, nk, r) \
 	vpaddd RY0, RX0, RX0; \
 	vpaddd RX0, RY0, RY0; \
-	vpbroadcastd ((nk)+((r)*8))(RK), RT0; \
+	vpbroadcastd ((nk))(RK,r), RT0; \
 	vpaddd RT0, RX0, RX0; \
-	vpbroadcastd 4+((nk)+((r)*8))(RK), RT0; \
+	vpbroadcastd 4+((nk))(RK,r), RT0; \
 	vpaddd RT0, RY0, RY0; \
 	\
 	vpxor RX0, c ## 0, c ## 0; \
@@ -230,9 +230,9 @@
 	\
 		vpaddd RY1, RX1, RX1; \
 		vpaddd RX1, RY1, RY1; \
-		vpbroadcastd ((nk)+((r)*8))(RK), RT0; \
+		vpbroadcastd ((nk))(RK,r), RT0; \
 		vpaddd RT0, RX1, RX1; \
-		vpbroadcastd 4+((nk)+((r)*8))(RK), RT0; \
+		vpbroadcastd 4+((nk))(RK,r), RT0; \
 		vpaddd RT0, RY1, RY1; \
 		\
 		vpxor RX1, c ## 1, c ## 1; \
@@ -275,30 +275,6 @@
 	\
 	decrypt_round_end16(a, b, c, d, nk, r);
 
-#define encrypt_cycle16(r) \
-	encrypt_round16(RA, RB, RC, RD, 0, r); \
-	encrypt_round16(RC, RD, RA, RB, 8, r);
-
-#define encrypt_cycle_first16(r) \
-	encrypt_round_first16(RA, RB, RC, RD, 0, r); \
-	encrypt_round16(RC, RD, RA, RB, 8, r);
-
-#define encrypt_cycle_last16(r) \
-	encrypt_round16(RA, RB, RC, RD, 0, r); \
-	encrypt_round_last16(RC, RD, RA, RB, 8, r);
-
-#define decrypt_cycle16(r) \
-	decrypt_round16(RC, RD, RA, RB, 8, r); \
-	decrypt_round16(RA, RB, RC, RD, 0, r);
-
-#define decrypt_cycle_first16(r) \
-	decrypt_round_first16(RC, RD, RA, RB, 8, r); \
-	decrypt_round16(RA, RB, RC, RD, 0, r);
-
-#define decrypt_cycle_last16(r) \
-	decrypt_round16(RC, RD, RA, RB, 8, r); \
-	decrypt_round_last16(RA, RB, RC, RD, 0, r);
-
 #define transpose_4x4(x0,x1,x2,x3,t1,t2) \
 	vpunpckhdq x1, x0, t2; \
 	vpunpckldq x1, x0, x0; \
@@ -312,22 +288,6 @@
 	vpunpckhqdq x2, t2, x3; \
 	vpunpcklqdq x2,	t2, x2;
 
-#define read_blocks8(offs,a,b,c,d) \
-	vmovdqu 16*offs(RIO), a; \
-	vmovdqu 16*offs+32(RIO), b; \
-	vmovdqu 16*offs+64(RIO), c; \
-	vmovdqu 16*offs+96(RIO), d; \
-	\
-	transpose_4x4(a, b, c, d, RX0, RY0);
-
-#define write_blocks8(offs,a,b,c,d) \
-	transpose_4x4(a, b, c, d, RX0, RY0); \
-	\
-	vmovdqu a, 16*offs(RIO); \
-	vmovdqu b, 16*offs+32(RIO); \
-	vmovdqu c, 16*offs+64(RIO); \
-	vmovdqu d, 16*offs+96(RIO);
-
 #define inpack_enc8(a,b,c,d) \
 	vpbroadcastd 4*0(RW), RT0; \
 	vpxor RT0, a, a; \
@@ -414,23 +374,35 @@ __twofish_enc_blk16:
 	 *						ciphertext blocks
 	 */
 	CFI_STARTPROC();
+
+	pushq RROUND;
+	CFI_PUSH(RROUND);
+
 	init_round_constants();
 
 	transpose4x4_16(RA, RB, RC, RD);
 	inpack_enc16(RA, RB, RC, RD);
 
-	encrypt_cycle_first16(0);
-	encrypt_cycle16(2);
-	encrypt_cycle16(4);
-	encrypt_cycle16(6);
-	encrypt_cycle16(8);
-	encrypt_cycle16(10);
-	encrypt_cycle16(12);
-	encrypt_cycle_last16(14);
+	xorl RROUNDd, RROUNDd;
+
+	encrypt_round_first16(RA, RB, RC, RD, 0, RROUND);
+
+.align 16
+.Loop_enc16:
+	encrypt_round16(RC, RD, RA, RB, 8, RROUND);
+	encrypt_round16(RA, RB, RC, RD, 16, RROUND);
+	leal 16(RROUNDd), RROUNDd;
+	cmpl $8*14, RROUNDd;
+	jb .Loop_enc16;
+
+	encrypt_round_last16(RC, RD, RA, RB, 8, RROUND);
 
 	outunpack_enc16(RA, RB, RC, RD);
 	transpose4x4_16(RA, RB, RC, RD);
 
+	popq RROUND;
+	CFI_POP(RROUND);
+
 	ret_spec_stop;
 	CFI_ENDPROC();
 ELF(.size __twofish_enc_blk16,.-__twofish_enc_blk16;)
@@ -447,23 +419,34 @@ __twofish_dec_blk16:
 	 *						ciphertext blocks
 	 */
 	CFI_STARTPROC();
+
+	pushq RROUND;
+	CFI_PUSH(RROUND);
+
 	init_round_constants();
 
 	transpose4x4_16(RA, RB, RC, RD);
 	inpack_dec16(RA, RB, RC, RD);
 
-	decrypt_cycle_first16(14);
-	decrypt_cycle16(12);
-	decrypt_cycle16(10);
-	decrypt_cycle16(8);
-	decrypt_cycle16(6);
-	decrypt_cycle16(4);
-	decrypt_cycle16(2);
-	decrypt_cycle_last16(0);
+	movl $14*8, RROUNDd;
+
+	decrypt_round_first16(RC, RD, RA, RB, 8, RROUND);
+
+.align 16
+.Loop_dec16:
+	decrypt_round16(RA, RB, RC, RD, 0, RROUND);
+	decrypt_round16(RC, RD, RA, RB, -8, RROUND);
+	subl $16, RROUNDd;
+	jnz .Loop_dec16;
+
+	decrypt_round_last16(RA, RB, RC, RD, 0, RROUND);
 
 	outunpack_dec16(RA, RB, RC, RD);
 	transpose4x4_16(RA, RB, RC, RD);
 
+	popq RROUND;
+	CFI_POP(RROUND);
+
 	ret_spec_stop;
 	CFI_ENDPROC();
 ELF(.size __twofish_dec_blk16,.-__twofish_dec_blk16;)
-- 
2.39.2


From wk at gnupg.org  Tue May 30 12:32:46 2023
From: wk at gnupg.org (Werner Koch)
Date: Tue, 30 May 2023 12:32:46 +0200
Subject: [PATCH] rijndael-aesni: use inline checksumming for OCB decryption
In-Reply-To: <20230528145355.532424-1-jussi.kivilinna@iki.fi> (Jussi
 Kivilinna's message of "Sun, 28 May 2023 17:53:55 +0300")
References: <20230528145355.532424-1-jussi.kivilinna@iki.fi>
Message-ID: <87ilcajei9.fsf@wheatstone.g10code.de>

On Sun, 28 May 2023 17:53, Jussi Kivilinna said:

> Inline checksumming is far faster on Ryzen processors on i386
> builds than two-pass checksumming.

That is indeed a large performance boost.  Did you had a chance to
benchmark it on some common Intel CPU?


Shalom-Salam,

   Werner

-- 
The pioneers of a warless world are the youth that
refuse military service.             - A. Einstein
-------------- next part --------------
A non-text attachment was scrubbed...
Name: openpgp-digital-signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20230530/f06702ab/attachment.sig>

From jussi.kivilinna at iki.fi  Wed May 31 06:48:39 2023
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Wed, 31 May 2023 07:48:39 +0300
Subject: [PATCH] rijndael-aesni: use inline checksumming for OCB decryption
In-Reply-To: <87ilcajei9.fsf@wheatstone.g10code.de>
References: <20230528145355.532424-1-jussi.kivilinna@iki.fi>
 <87ilcajei9.fsf@wheatstone.g10code.de>
Message-ID: <918b7158-7385-c845-c102-ffca049d8f41@iki.fi>

On 30.5.2023 13.32, Werner Koch via Gcrypt-devel wrote:
> On Sun, 28 May 2023 17:53, Jussi Kivilinna said:
> 
>> Inline checksumming is far faster on Ryzen processors on i386
>> builds than two-pass checksumming.
> 
> That is indeed a large performance boost.  Did you had a chance to
> benchmark it on some common Intel CPU?
> 

I tested now with Intel tigerlake, performance dropped by 9% which is
unexpectedly large change. I'll try few different things to see if
I can avoid such drop.

-Jussi

> 
> Shalom-Salam,
> 
>     Werner
> 
> 
> _______________________________________________
> Gcrypt-devel mailing list
> Gcrypt-devel at gnupg.org
> https://lists.gnupg.org/mailman/listinfo/gcrypt-devel