From simon at josefsson.org Mon May 15 16:20:17 2023 From: simon at josefsson.org (Simon Josefsson) Date: Mon, 15 May 2023 16:20:17 +0200 Subject: Implementation of PQC Algorithms in libgcrypt In-Reply-To: <85f0fb8c-1587-262c-187c-a5a6bc590145@mtg.de> (Falko Strenzke's message of "Mon, 3 Apr 2023 05:59:10 +0200") References: <958d689a-5f76-5fbe-f3ef-140bc1b2d132@mtg.de> <87tty2cq2q.fsf@wheatstone.g10code.de> <85f0fb8c-1587-262c-187c-a5a6bc590145@mtg.de> Message-ID: <87lehp1xsu.fsf@kaka.sjd.se> Hi I noticed this thread just after submitting sntrup761 [1] patches. My opinion is that libgcrypt's public-key API is a bad fit for KEM's: it uses S-expressions and MPI data types. I believe the crypto world rightly has moved away from those abstraction, towards byte-oriented designs instead, for simplicity and safety. Compare gcry_ecc_mul_point for X25519 and gcry_kdf_derive for KDF's. Could you implement Kyber as a KEM using the API that I suggested? I think that would be significantly simpler, and would help validate the KEM API as supporting more than one KEM. I would strongly support having a KEM API that is not using sexp/mpi, but I wouldn't object to a sexp/mpi API in addition to it, for different use-cases. /Simon [1] https://gitlab.com/jas/libgcrypt/-/commits/jas/sntrup761 Falko Strenzke writes: > Hi Werner, > > the only API change is the addition of the following interface function: > > gcry_err_code_t > _gcry_pk_encap(gcry_sexp_t *r_ciph, gcry_sexp_t* r_shared_key, gcry_sexp_t s_pkey) > > This also means that the public key spec needs to contain this additional function. For Kyber our public key spec currently looks as follows: > > gcry_pk_spec_t _gcry_pubkey_spec_kyber = { > GCRY_PK_KYBER, {0, 1}, > (GCRY_PK_USAGE_ENCAP), // TODOMTG: can the key usage "encryption" remain or do we need new KU "encap"? > "Kyber", kyber_names, > "p", "s", "a", "", "p", // elements of pub-key, sec-key, ciphertext, signature, key-grip > kyber_generate, > kyber_check_secret_key, > NULL, // encrypt > kyber_encap, > kyber_decrypt, > NULL, // sign, > NULL, // verify, > kyber_get_nbits, > run_selftests, > compute_keygrip > }; > > For the PKEs the encapsulation function would of course be NULL. Regarding the TODO on the key usage marked in the code above, this so far > doesn't seem to have any implications for us so the decision isn't urgent from my point of view. > > - Falko > > Am 30.03.23 um 15:43 schrieb Werner Koch: > > On Wed, 29 Mar 2023 10:09, Falko Strenzke said: > > While the integration of the signature algorithms is straightforward, the KEM > requires a new interface function, as the KEM encapsulation cannot be modelled > by a public-key encryption. > > > It would be good if we can discuss a proposed API early enough, so that > we can see how it fits into the design of Libgcrypt. Can you already > roughly describes the needs? > > > Salam-Shalom, > > Werner -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 255 bytes Desc: not available URL: From falko.strenzke at mtg.de Mon May 15 17:07:15 2023 From: falko.strenzke at mtg.de (Falko Strenzke) Date: Mon, 15 May 2023 17:07:15 +0200 Subject: Implementation of PQC Algorithms in libgcrypt In-Reply-To: <87lehp1xsu.fsf@kaka.sjd.se> References: <958d689a-5f76-5fbe-f3ef-140bc1b2d132@mtg.de> <87tty2cq2q.fsf@wheatstone.g10code.de> <85f0fb8c-1587-262c-187c-a5a6bc590145@mtg.de> <87lehp1xsu.fsf@kaka.sjd.se> Message-ID: Hi Simon, indeed, there is considerable overhead in our implementation of the S-Expressions interface for the extraction of values and MPI <-> byte array conversions even though each Kyber "token" is merely an opaque byte array. However, we don't consider it our call to divert from the existing API as we can't gauge the implication of that for the client code, e.g. GnuPG. So we basically consider this the maintainer's decision. I looked through your API. It is indeed much simpler. I have the following points, however: 1. I don't fully understand the design logic regarding the gcry_kem_hd_t. I understand that it makes sense to use it for the encryption and decryption to instantiate a particular key.? But for the key generation I don't per se see why it needs a handle. Is it required for precomputations in the case of NTRU Prime? (or anticipated that this is the case for other KEMs?) 2. "open" / "close" are in my opinion not the best names for the function to create and destroy such a handle. These terms rather suggest the handling of a file or a pipe. I know these terms are also used in the hash API, but I think more appropriate names would "create" / "destroy" or something similar. Maybe it makes sense to make the move to a new terminology here. 3. While the previous two points are rather minor or even cosmetic, this one is really important in my opinion: we need an API that allows for derandomized key generation and encapsulation to support KAT tests for all operations. The Kyber reference implementation already supports such KAT tests. I would anyway have raised the question here how to realize that. Signature functions in libgcrypt already support a "random-override" parameter, but so far I don't really understand how it works and whether it would be suitable to use it for the KEM API as well. Ideally, I think, the new API would allow to provide an RNG object and to set it to a specific seed before any operation (possibly via the KEM handle). However, it would probably be better if this functionality is only supported by an internal test-API and not available to normal clients. But I am not sure how to realize that in the current design of libgcrypt. - Falko Am 15.05.23 um 16:20 schrieb Simon Josefsson: > Hi > > I noticed this thread just after submitting sntrup761 [1] patches. > > My opinion is that libgcrypt's public-key API is a bad fit for KEM's: it > uses S-expressions and MPI data types. I believe the crypto world > rightly has moved away from those abstraction, towards byte-oriented > designs instead, for simplicity and safety. Compare gcry_ecc_mul_point > for X25519 and gcry_kdf_derive for KDF's. Could you implement Kyber as > a KEM using the API that I suggested? I think that would be > significantly simpler, and would help validate the KEM API as supporting > more than one KEM. I would strongly support having a KEM API that is > not using sexp/mpi, but I wouldn't object to a sexp/mpi API in addition > to it, for different use-cases. > > /Simon > > [1]https://gitlab.com/jas/libgcrypt/-/commits/jas/sntrup761 > > Falko Strenzke writes: > >> Hi Werner, >> >> the only API change is the addition of the following interface function: >> >> gcry_err_code_t >> _gcry_pk_encap(gcry_sexp_t *r_ciph, gcry_sexp_t* r_shared_key, gcry_sexp_t s_pkey) >> >> This also means that the public key spec needs to contain this additional function. For Kyber our public key spec currently looks as follows: >> >> gcry_pk_spec_t _gcry_pubkey_spec_kyber = { >> GCRY_PK_KYBER, {0, 1}, >> (GCRY_PK_USAGE_ENCAP), // TODOMTG: can the key usage "encryption" remain or do we need new KU "encap"? >> "Kyber", kyber_names, >> "p", "s", "a", "", "p", // elements of pub-key, sec-key, ciphertext, signature, key-grip >> kyber_generate, >> kyber_check_secret_key, >> NULL, // encrypt >> kyber_encap, >> kyber_decrypt, >> NULL, // sign, >> NULL, // verify, >> kyber_get_nbits, >> run_selftests, >> compute_keygrip >> }; >> >> For the PKEs the encapsulation function would of course be NULL. Regarding the TODO on the key usage marked in the code above, this so far >> doesn't seem to have any implications for us so the decision isn't urgent from my point of view. >> >> - Falko >> >> Am 30.03.23 um 15:43 schrieb Werner Koch: >> >> On Wed, 29 Mar 2023 10:09, Falko Strenzke said: >> >> While the integration of the signature algorithms is straightforward, the KEM >> requires a new interface function, as the KEM encapsulation cannot be modelled >> by a public-key encryption. >> >> >> It would be good if we can discuss a proposed API early enough, so that >> we can see how it fits into the design of Libgcrypt. Can you already >> roughly describes the needs? >> >> >> Salam-Shalom, >> >> Werner -- *MTG AG* Dr. Falko Strenzke Executive System Architect Phone: +49 6151 8000 24 E-Mail: falko.strenzke at mtg.de Web: mtg.de *MTG Exhibitions ? See you in 2023* ------------------------------------------------------------------------ MTG AG - Dolivostr. 11 - 64293 Darmstadt, Germany Commercial register: HRB 8901 Register Court: Amtsgericht Darmstadt Management Board: J?rgen Ruf (CEO), Tamer Kemer?z Chairman of the Supervisory Board: Dr. Thomas Milde This email may contain confidential and/or privileged information. If you are not the correct recipient or have received this email in error, please inform the sender immediately and delete this email. Unauthorised copying or distribution of this email is not permitted. Data protection information: Privacy policy -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Y0kNcAEqvMyKELQS.png Type: image/png Size: 5256 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: eTvco00Nz4KhPB6y.png Type: image/png Size: 4906 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4764 bytes Desc: S/MIME Cryptographic Signature URL: From simon at josefsson.org Mon May 15 17:39:23 2023 From: simon at josefsson.org (Simon Josefsson) Date: Mon, 15 May 2023 17:39:23 +0200 Subject: Implementation of PQC Algorithms in libgcrypt In-Reply-To: (Falko Strenzke's message of "Mon, 15 May 2023 17:07:15 +0200") References: <958d689a-5f76-5fbe-f3ef-140bc1b2d132@mtg.de> <87tty2cq2q.fsf@wheatstone.g10code.de> <85f0fb8c-1587-262c-187c-a5a6bc590145@mtg.de> <87lehp1xsu.fsf@kaka.sjd.se> Message-ID: <87h6sd1u50.fsf@kaka.sjd.se> Hi Thanks for feedback. Generally I'm not sure we should consider KEM's a subset of public-key encryption/decryption from an API point of view. Compare KDF's relationship to MAC/hashes. Sometimes separate APIs make sense. Re your point about test vectors, I agree. Libgcrypt supports some "selftest" code (used for FIPS mode) that seems relevant here, maybe it is sufficient to add selftest code that hard-code the RNG stuff and compare test vectors? Compare how kdf.c looks. Werner, any thoughts on this approach? It seems simple, and avoids exposing potentially insecure APIs to users. I was considering this, but didn't want to modify any FIPS-related code. Re API, I think *_open is fairly established in libgcrypt, so it is good idiom to re-use. Let's have some alternatives, right now I proposed this: enum gcry_kem_algos { GCRY_KEM_SNTRUP761 = 761, }; #define GCRY_KEM_SNTRUP761_SECRETKEY_SIZE 1763 #define GCRY_KEM_SNTRUP761_PUBLICKEY_SIZE 1158 #define GCRY_KEM_SNTRUP761_CIPHERTEXT_SIZE 1039 #define GCRY_KEM_SNTRUP761_SIZE 32 typedef struct gcry_kem_handle *gcry_kem_hd_t; gcry_error_t gcry_kem_open (gcry_kem_hd_t *hd, int algo); void gcry_kem_close (gcry_kem_hd_t h); gcry_error_t gcry_kem_keypair (gcry_kem_hd_t hd, size_t pklen, void *pubkey, size_t sklen, void *seckey); gcry_error_t gcry_kem_enc (gcry_kem_hd_t hd, size_t pklen, const void *pubkey, size_t ctlen, void *ciphertext, size_t keylen, void *key); gcry_error_t gcry_kem_dec (gcry_kem_hd_t hd, size_t ctlen, const void *ciphertext, size_t sklen, const void *seckey, size_t keylen, void *key); Here is minimal approach similar to KDF interface: enum gcry_kem_algos { GCRY_KEM_SNTRUP761 = 761, }; #define GCRY_KEM_SNTRUP761_SECRETKEY_SIZE 1763 #define GCRY_KEM_SNTRUP761_PUBLICKEY_SIZE 1158 #define GCRY_KEM_SNTRUP761_CIPHERTEXT_SIZE 1039 #define GCRY_KEM_SNTRUP761_SIZE 32 gcry_error_t gcry_kem_keypair (int algo, size_t pklen, void *pubkey, size_t sklen, void *seckey); gcry_error_t gcry_kem_enc (int algo, size_t pklen, const void *pubkey, size_t ctlen, void *ciphertext, size_t keylen, void *key); gcry_error_t gcry_kem_dec (int algo, size_t ctlen, const void *ciphertext, size_t sklen, const void *seckey, size_t keylen, void *key); Here is a more complex variant that may be more consistent with existing APIs but has some disadvantages (more APIs are harder to analyze, makes static allocation much harder if not impossible): enum gcry_kem_algos { GCRY_KEM_SNTRUP761 = 761, }; typedef struct gcry_kem_handle *gcry_kem_hd_t; gcry_error_t gcry_kem_open (gcry_kem_hd_t *hd, int algo); void gcry_kem_close (gcry_kem_hd_t h); size_t gcry_kem_pubkey_size (gcry_kem_hd_t hd); size_t gcry_kem_seckey_size (gcry_kem_hd_t hd); size_t gcry_kem_ciphertext_size (gcry_kem_hd_t hd); size_t gcry_kem_output_size (gcry_kem_hd_t hd); gcry_error_t gcry_kem_keypair (gcry_kem_hd_t hd); void *gcry_kem_get_seckey (gcry_kem_hd_t hd); void *gcry_kem_get_pubkey (gcry_kem_hd_t hd); gcry_error_t gcry_kem_enc (gcry_kem_hd_t hd, size_t ctlen, void *ciphertext, size_t keylen, void *key); gcry_error_t gcry_kem_dec (gcry_kem_hd_t hd, size_t ctlen, const void *ciphertext, size_t keylen, void *key); Other ideas? Does kyber have any requirements on the API that wouldn't work well with any of these? /Simon Falko Strenzke writes: > Hi Simon, > > indeed, there is considerable overhead in our implementation of the > S-Expressions interface for the extraction of values and MPI <-> byte > array conversions even though each Kyber "token" is merely an opaque > byte array. However, we don't consider it our call to divert from the > existing API as we can't gauge the implication of that for the client > code, e.g. GnuPG. So we basically consider this the maintainer's > decision. > > I looked through your API. It is indeed much simpler. I have the > following points, however: > > 1. I don't fully understand the design logic regarding the > gcry_kem_hd_t. I understand that it makes sense to use it for the > encryption and decryption to instantiate a particular key. But for > the key generation I don't per se see why it needs a handle. Is it > required for precomputations in the case of NTRU Prime? (or > anticipated that this is the case for other KEMs?) > > 2. "open" / "close" are in my opinion not the best names for the > function to create and destroy such a handle. These terms rather > suggest the handling of a file or a pipe. I know these terms are also > used in the hash API, but I think more appropriate names would > "create" / "destroy" or something similar. Maybe it makes sense to > make the move to a new terminology here. > > 3. While the previous two points are rather minor or even cosmetic, > this one is really important in my opinion: we need an API that allows > for derandomized key generation and encapsulation to support KAT tests > for all operations. The Kyber reference implementation already > supports such KAT tests. I would anyway have raised the question here > how to realize that. Signature functions in libgcrypt already support > a "random-override" parameter, but so far I don't really understand > how it works and whether it would be suitable to use it for the KEM > API as well. Ideally, I think, the new API would allow to provide an > RNG object and to set it to a specific seed before any operation > (possibly via the KEM handle). However, it would probably be better if > this functionality is only supported by an internal test-API and not > available to normal clients. But I am not sure how to realize that in > the current design of libgcrypt. > > - Falko > > Am 15.05.23 um 16:20 schrieb Simon Josefsson: > > Hi > > I noticed this thread just after submitting sntrup761 [1] patches. > > My opinion is that libgcrypt's public-key API is a bad fit for KEM's: it > uses S-expressions and MPI data types. I believe the crypto world > rightly has moved away from those abstraction, towards byte-oriented > designs instead, for simplicity and safety. Compare gcry_ecc_mul_point > for X25519 and gcry_kdf_derive for KDF's. Could you implement Kyber as > a KEM using the API that I suggested? I think that would be > significantly simpler, and would help validate the KEM API as supporting > more than one KEM. I would strongly support having a KEM API that is > not using sexp/mpi, but I wouldn't object to a sexp/mpi API in addition > to it, for different use-cases. > > /Simon > > [1] https://gitlab.com/jas/libgcrypt/-/commits/jas/sntrup761 > > Falko Strenzke writes: > > Hi Werner, > > the only API change is the addition of the following interface function: > > gcry_err_code_t > _gcry_pk_encap(gcry_sexp_t *r_ciph, gcry_sexp_t* r_shared_key, gcry_sexp_t s_pkey) > > This also means that the public key spec needs to contain this additional function. For Kyber our public key spec currently looks as follows: > > gcry_pk_spec_t _gcry_pubkey_spec_kyber = { > GCRY_PK_KYBER, {0, 1}, > (GCRY_PK_USAGE_ENCAP), // TODOMTG: can the key usage "encryption" remain or do we need new KU "encap"? > "Kyber", kyber_names, > "p", "s", "a", "", "p", // elements of pub-key, sec-key, ciphertext, signature, key-grip > kyber_generate, > kyber_check_secret_key, > NULL, // encrypt > kyber_encap, > kyber_decrypt, > NULL, // sign, > NULL, // verify, > kyber_get_nbits, > run_selftests, > compute_keygrip > }; > > For the PKEs the encapsulation function would of course be NULL. Regarding the TODO on the key usage marked in the code above, this so far > doesn't seem to have any implications for us so the decision isn't urgent from my point of view. > > - Falko > > Am 30.03.23 um 15:43 schrieb Werner Koch: > > On Wed, 29 Mar 2023 10:09, Falko Strenzke said: > > While the integration of the signature algorithms is straightforward, the KEM > requires a new interface function, as the KEM encapsulation cannot be modelled > by a public-key encryption. > > > It would be good if we can discuss a proposed API early enough, so that > we can see how it fits into the design of Libgcrypt. Can you already > roughly describes the needs? > > > Salam-Shalom, > > Werner -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 255 bytes Desc: not available URL: From smueller at chronox.de Mon May 15 17:51:12 2023 From: smueller at chronox.de (Stephan Mueller) Date: Mon, 15 May 2023 17:51:12 +0200 Subject: Implementation of PQC Algorithms in libgcrypt In-Reply-To: <87h6sd1u50.fsf@kaka.sjd.se> References: <958d689a-5f76-5fbe-f3ef-140bc1b2d132@mtg.de> <87h6sd1u50.fsf@kaka.sjd.se> Message-ID: <3414868.jSmoDxJbhA@tauon.chronox.de> Am Montag, 15. Mai 2023, 17:39:23 CEST schrieb Simon Josefsson via Gcrypt- devel: Hi Simon, > Does kyber have any requirements on the API that wouldn't work well with > any of these? I am experimenting with Kyber in [1]. For KEM, your API would work. There you see that I use an additional parameter, an RNG context. This allows me to also derive Kyber keys straight from a KDF (which is accessed like an RNG context). But that is not really needed. However, how do you propose to handle the KEX scenario? See [2] for the full Kyber KEX exchange and the API. I think the KEX is much more important than the KEM, as the KEX is conceptually what is DH today. Kyber KEM can be used in an integrated encryption schema as suggested in [3]. Unfortunately, the Kyber KEX cannot be acting as a direct replacement for DH. Due to its 7 total steps. However, it is possible to coalescing all of them into 2 handshake network exchanges and one final data blob that is sent along with the already encrypted first payload. [1] https://github.com/smuellerDD/leancrypto/blob/master/kem/api/ lc_kyber.h#L121 [2] https://github.com/smuellerDD/leancrypto/blob/master/kem/api/ lc_kyber.h#L294 [3] https://github.com/smuellerDD/leancrypto/blob/master/kem/api/ lc_kyber.h#L425 Ciao Stephan From simon at josefsson.org Mon May 15 16:02:39 2023 From: simon at josefsson.org (Simon Josefsson) Date: Mon, 15 May 2023 16:02:39 +0200 Subject: [PATCH] Add Streamlined NTRU Prime sntrup761. Message-ID: <87wn191ym8.fsf@kaka.sjd.se> Hi See attached patch that adds sntrup761. What do you think? My use case is to enable implementation of OpenSSH's sntrup761x25519-sha512 in libssh/libssh2. Specific open issues: - Documentation - Benchmarking self-test - Self-tests that validate test vectors Not trivial because the algorithm is randomized, so we would have to use deterministic randomness somehow -- and to use an deterministic algorithm for which there exists sntrup761 test vectors (DRBG-CTR is the only one I am aware of, but far from ideal). - API design - Are gcry_kem_open/gcry_kem_close useful? They complicate implementation for no gain for sntrup761, but could be useful for other KEM's, OTOH they may just complicate it for all KEM's since I believe the KEM APIs are fairly established these days. - The pubkey parameter for KEM-Enc could be stored in the handle, as could the seckey parameter for KEM-Dec. This would make the gcry_kem_open/gcry_kem_close more useful, however it would mean more memory zeroization issues. - The #define's for output lengths could be functions, similar to other libgcrypt APIs. This makes it harder to use statically allocated buffers, so I think the current #define's are useful. /Simon -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-Add-Streamlined-NTRU-Prime-sntrup761.patch Type: text/x-diff Size: 41181 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 255 bytes Desc: not available URL: From simon at josefsson.org Tue May 16 08:09:23 2023 From: simon at josefsson.org (Simon Josefsson) Date: Tue, 16 May 2023 08:09:23 +0200 Subject: Implementation of PQC Algorithms in libgcrypt In-Reply-To: <3414868.jSmoDxJbhA@tauon.chronox.de> (Stephan Mueller's message of "Mon, 15 May 2023 17:51:12 +0200") References: <958d689a-5f76-5fbe-f3ef-140bc1b2d132@mtg.de> <87h6sd1u50.fsf@kaka.sjd.se> <3414868.jSmoDxJbhA@tauon.chronox.de> Message-ID: <87353w24fg.fsf@kaka.sjd.se> Stephan Mueller writes: > Am Montag, 15. Mai 2023, 17:39:23 CEST schrieb Simon Josefsson via Gcrypt- > devel: > > Hi Simon, > >> Does kyber have any requirements on the API that wouldn't work well with >> any of these? > > I am experimenting with Kyber in [1]. For KEM, your API would work. Thanks for confirming this! Looking at the code, it seems Kyber KEM has exactly the same API as sntrup761, which probably was a NIST PQCS requirement, and we should expect that other KEM's follow a similar approach. I think that sntrup761 can be added to libgcrypt now since it has been stable since 2017, but I'm less sure about Kyber since it is stuck in the NIST process -- aren't there some risk that NIST will modify the parameters again? > There you see that I use an additional parameter, an RNG context. This allows > me to also derive Kyber keys straight from a KDF (which is accessed like an > RNG context). But that is not really needed. Right, I use the RNG context internally in sntrup761.c as well, but I don't think it should be exposed to libgcrypt callers. The internal RNG context will be useful for self-testing. This is especially true since I think test vectors for KEM's are implementation-specific: if you optimize the implementation to re-order RNG calls, the test vectors will no longer work. Thus, you can't really do black-box testing with KEM KATs. The libgcrypt selftest() approach is perfectly suited for doing a whitebox test internally though. > However, how do you propose to handle the KEX scenario? See [2] for the full > Kyber KEX exchange and the API. I think the KEX is much more important than > the KEM, as the KEX is conceptually what is DH today. Kyber KEM can be used in > an integrated encryption schema as suggested in [3]. > > Unfortunately, the Kyber KEX cannot be acting as a direct replacement for DH. > Due to its 7 total steps. However, it is possible to coalescing all of them > into 2 handshake network exchanges and one final data blob that is sent along > with the already encrypted first payload. I think this should be through a completely different API than for KEM or public-key encrypt/decrypt, and an API that is customized for the KEX functionality. The properties are different from existing APIs, similar to how AEAD ciphers differs from ECB ciphers, and how KDF differs from MAC/hashes. Also compare how libgcrypt contains an API for X25519/X448 curve operations. /Simon -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 255 bytes Desc: not available URL: From simon at josefsson.org Tue May 16 08:56:08 2023 From: simon at josefsson.org (Simon Josefsson) Date: Tue, 16 May 2023 08:56:08 +0200 Subject: [PATCH] Add Streamlined NTRU Prime sntrup761. In-Reply-To: <87wn191ym8.fsf@kaka.sjd.se> (Simon Josefsson via Gcrypt-devel's message of "Mon, 15 May 2023 16:02:39 +0200") References: <87wn191ym8.fsf@kaka.sjd.se> Message-ID: <87y1lozrw7.fsf@kaka.sjd.se> Hi Attached is a second version of the sntrup761 patch, this time using a minimal API that would work for Kyber too (please confirm). Unless we know complexity is required, I prefer to keep things minimal. I've pushed it to: https://gitlab.com/jas/libgcrypt/-/commits/jas/sntrup761v2 Below is the added API. Thoughts? enum gcry_kem_algos { GCRY_KEM_SNTRUP761 = 761, }; #define GCRY_KEM_SNTRUP761_SECRETKEY_SIZE 1763 #define GCRY_KEM_SNTRUP761_PUBLICKEY_SIZE 1158 #define GCRY_KEM_SNTRUP761_CIPHERTEXT_SIZE 1039 #define GCRY_KEM_SNTRUP761_SHAREDSECRET_SIZE 32 gcry_error_t gcry_kem_keypair (int algo, void *pubkey, void *seckey); gcry_error_t gcry_kem_enc (int algo, const void *pubkey, void *ciphertext, void *ss); gcry_error_t gcry_kem_dec (int algo, const void *ciphertext, const void *seckey, void *ss); /Simon -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-Add-Streamlined-NTRU-Prime-sntrup761.patch Type: text/x-diff Size: 38454 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 255 bytes Desc: not available URL: From falko.strenzke at mtg.de Tue May 16 09:16:13 2023 From: falko.strenzke at mtg.de (Falko Strenzke) Date: Tue, 16 May 2023 09:16:13 +0200 Subject: [PATCH] Add Streamlined NTRU Prime sntrup761. In-Reply-To: <87y1lozrw7.fsf@kaka.sjd.se> References: <87wn191ym8.fsf@kaka.sjd.se> <87y1lozrw7.fsf@kaka.sjd.se> Message-ID: <2b3979dc-9a89-4755-9e4c-727263b27bb5@mtg.de> Hi Simon, Am 16.05.23 um 08:56 schrieb Simon Josefsson via Gcrypt-devel: > Hi > > Attached is a second version of the sntrup761 patch, this time using a > minimal API that would work for Kyber too (please confirm). Unless we > know complexity is required, I prefer to keep things minimal. > > I've pushed it to: > https://gitlab.com/jas/libgcrypt/-/commits/jas/sntrup761v2 > > Below is the added API. Thoughts? > > enum gcry_kem_algos > { > GCRY_KEM_SNTRUP761 = 761, > }; > > #define GCRY_KEM_SNTRUP761_SECRETKEY_SIZE 1763 > #define GCRY_KEM_SNTRUP761_PUBLICKEY_SIZE 1158 > #define GCRY_KEM_SNTRUP761_CIPHERTEXT_SIZE 1039 > #define GCRY_KEM_SNTRUP761_SHAREDSECRET_SIZE 32 > > gcry_error_t gcry_kem_keypair (int algo, > void *pubkey, > void *seckey); > > gcry_error_t gcry_kem_enc (int algo, > const void *pubkey, > void *ciphertext, > void *ss); > > gcry_error_t gcry_kem_dec (int algo, > const void *ciphertext, > const void *seckey, > void *ss); I think this is already going into the right direction. However, I have some proposals: 1. I would prefer a more type safe API: distinct public and private key objects instead of void pointers, i.e gcry_kem_public_key_t and gcry_kem_private_key_t. From your proposed API it does not become clear if pubkey and seckey are objects or just byte arrays. Since instantiating a key from a byte array may involve some precomputations (imagine for instance instantiating a private key from a PRNG seed), for efficiency reasons it is in my view necessary to have public and private key objects. 2. Also the enum should by typedef'd and used with its type in the function signature. 3. There is no need to provide algo again in the enc/dec functions. A key object will know it's algorithm. (Probably this is due to key void pointers meant as byte arrays) 4. All input/output byte arrays should be typed as uint8_t* and be passed in with their lengths. If without lengths, client code will be prone to memory access errors. 5. Then we will also need extra functions for serialization and deserialization of keys. - Falko > > /Simon > > _______________________________________________ > Gcrypt-devel mailing list > Gcrypt-devel at gnupg.org > https://lists.gnupg.org/mailman/listinfo/gcrypt-devel -- *MTG AG* Dr. Falko Strenzke Executive System Architect Phone: +49 6151 8000 24 E-Mail: falko.strenzke at mtg.de Web: mtg.de *MTG Exhibitions ? See you in 2023* ------------------------------------------------------------------------ MTG AG - Dolivostr. 11 - 64293 Darmstadt, Germany Commercial register: HRB 8901 Register Court: Amtsgericht Darmstadt Management Board: J?rgen Ruf (CEO), Tamer Kemer?z Chairman of the Supervisory Board: Dr. Thomas Milde This email may contain confidential and/or privileged information. If you are not the correct recipient or have received this email in error, please inform the sender immediately and delete this email. Unauthorised copying or distribution of this email is not permitted. Data protection information: Privacy policy -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bl8h4Ks4fuYl1XEo.png Type: image/png Size: 5256 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: xsNU0Yk5v78SRIzb.png Type: image/png Size: 4906 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4764 bytes Desc: S/MIME Cryptographic Signature URL: From smueller at chronox.de Tue May 16 09:58:40 2023 From: smueller at chronox.de (Stephan Mueller) Date: Tue, 16 May 2023 09:58:40 +0200 Subject: [PATCH] Add Streamlined NTRU Prime sntrup761. In-Reply-To: <87y1lozrw7.fsf@kaka.sjd.se> References: <87wn191ym8.fsf@kaka.sjd.se> <87y1lozrw7.fsf@kaka.sjd.se> Message-ID: <4957427.z1Mmn1aDQA@tauon.chronox.de> Am Dienstag, 16. Mai 2023, 08:56:08 CEST schrieb Simon Josefsson via Gcrypt- devel: Hi Simon, > Hi > > Attached is a second version of the sntrup761 patch, this time using a > minimal API that would work for Kyber too (please confirm). Unless we > know complexity is required, I prefer to keep things minimal. > > I've pushed it to: > https://gitlab.com/jas/libgcrypt/-/commits/jas/sntrup761v2 > > Below is the added API. Thoughts? > > enum gcry_kem_algos > { > GCRY_KEM_SNTRUP761 = 761, > }; > > #define GCRY_KEM_SNTRUP761_SECRETKEY_SIZE 1763 > #define GCRY_KEM_SNTRUP761_PUBLICKEY_SIZE 1158 > #define GCRY_KEM_SNTRUP761_CIPHERTEXT_SIZE 1039 > #define GCRY_KEM_SNTRUP761_SHAREDSECRET_SIZE 32 > > gcry_error_t gcry_kem_keypair (int algo, > void *pubkey, > void *seckey); > > gcry_error_t gcry_kem_enc (int algo, > const void *pubkey, > void *ciphertext, > void *ss); May I suggest to add another parameter: size_t ss_len which shall specify the caller-requested size of ss? > > gcry_error_t gcry_kem_dec (int algo, > const void *ciphertext, > const void *seckey, > void *ss); Same here. Kyber uses a KDF as the last step. I am aware of the fact that the Kyber reference implementation returns 32 bytes statically. However, considering the use of a true KDF which has the property of a pseudorandom behavior (either SHAKE256 or AES-CTR is used), the KDF can produce arbitrary amounts of data. By specifying an ss_len parameter, the caller can directly request the data that may be needed as a key/IV/mac Key or similar for subsequent cipher operations. In [1] and [2], I use such an ss_len parameter which in turn serves me well for production use cases. [1] https://github.com/smuellerDD/leancrypto/blob/master/kem/api/ lc_kyber.h#L149 [2] https://github.com/smuellerDD/leancrypto/blob/master/kem/api/ lc_kyber.h#L167 Thanks a lot Stephan From smueller at chronox.de Tue May 16 10:07:30 2023 From: smueller at chronox.de (Stephan Mueller) Date: Tue, 16 May 2023 10:07:30 +0200 Subject: Implementation of PQC Algorithms in libgcrypt In-Reply-To: <87353w24fg.fsf@kaka.sjd.se> References: <958d689a-5f76-5fbe-f3ef-140bc1b2d132@mtg.de> <3414868.jSmoDxJbhA@tauon.chronox.de> <87353w24fg.fsf@kaka.sjd.se> Message-ID: <3134498.TlHfbrAK3V@tauon.chronox.de> Am Dienstag, 16. Mai 2023, 08:09:23 CEST schrieb Simon Josefsson: Hi Simon, > Stephan Mueller writes: > > Am Montag, 15. Mai 2023, 17:39:23 CEST schrieb Simon Josefsson via Gcrypt- > > devel: > > > > Hi Simon, > > > >> Does kyber have any requirements on the API that wouldn't work well with > >> any of these? > > > > I am experimenting with Kyber in [1]. For KEM, your API would work. > > Thanks for confirming this! Looking at the code, it seems Kyber KEM has > exactly the same API as sntrup761, which probably was a NIST PQCS > requirement, and we should expect that other KEM's follow a similar > approach. > > I think that sntrup761 can be added to libgcrypt now since it has been > stable since 2017, but I'm less sure about Kyber since it is stuck in > the NIST process -- aren't there some risk that NIST will modify the > parameters again? I have no insight into the process. I expect, though, that only Kyber/ Dilithium with security strength of 256 bits will be allowed. I would not expect that internal parameters would change, though. However, NIAP / NSA now starts mandating Kyber / Dilithium with 256 bits strength as a replacement for *all* general-purpose asymmetric algorithms by 2035. There are no options for other algorithms! This is now spawning discussions especially around the network protocols. Especially Kyber KEX is no direct-fit replacement for DH which implies that all network protocols must change. I.e. RSA, (EC)DSA, (EC)DH shall be completely replaced by Kyber and Dilithium. This is also an interesting catch-22 for NIST's competition. NIST did not decide yet, but it may be hard for them to ignore the new ruling my NSA. > > > There you see that I use an additional parameter, an RNG context. This > > allows me to also derive Kyber keys straight from a KDF (which is > > accessed like an RNG context). But that is not really needed. > > Right, I use the RNG context internally in sntrup761.c as well, but I > don't think it should be exposed to libgcrypt callers. I can live with that, no doubts. But it makes life (at least for keygen) significantly easier :-) Anyhow, considering that libgcrypt also wants to comply with FIPS rules, it is not permissible to allow the user to specify the rng context. So, your approach fits even the FIPS considerations (whereas mine does not). > The internal RNG > context will be useful for self-testing. This is especially true since > I think test vectors for KEM's are implementation-specific: if you > optimize the implementation to re-order RNG calls, the test vectors will > no longer work. Thus, you can't really do black-box testing with KEM > KATs. The libgcrypt selftest() approach is perfectly suited for doing a > whitebox test internally though. Agreed. > > > However, how do you propose to handle the KEX scenario? See [2] for the > > full Kyber KEX exchange and the API. I think the KEX is much more > > important than the KEM, as the KEX is conceptually what is DH today. > > Kyber KEM can be used in an integrated encryption schema as suggested in > > [3]. > > > > Unfortunately, the Kyber KEX cannot be acting as a direct replacement for > > DH. Due to its 7 total steps. However, it is possible to coalescing all > > of them into 2 handshake network exchanges and one final data blob that > > is sent along with the already encrypted first payload. > > I think this should be through a completely different API than for KEM > or public-key encrypt/decrypt, and an API that is customized for the KEX > functionality. The properties are different from existing APIs, similar > to how AEAD ciphers differs from ECB ciphers, and how KDF differs from > MAC/hashes. Also compare how libgcrypt contains an API for X25519/X448 > curve operations. Sounds good from my side. Ciao Stephan From falko.strenzke at mtg.de Tue May 16 15:00:23 2023 From: falko.strenzke at mtg.de (Falko Strenzke) Date: Tue, 16 May 2023 15:00:23 +0200 Subject: Code formatting for libgcrypt Message-ID: Is there any official code formatting style for any formatting tool available for libgcrypt? If no official one, an unofficial one that matches the existing code quite well? - Falko -- *MTG AG* Dr. Falko Strenzke Executive System Architect Phone: +49 6151 8000 24 E-Mail: falko.strenzke at mtg.de Web: mtg.de *MTG Exhibitions ? See you in 2023* ------------------------------------------------------------------------ MTG AG - Dolivostr. 11 - 64293 Darmstadt, Germany Commercial register: HRB 8901 Register Court: Amtsgericht Darmstadt Management Board: J?rgen Ruf (CEO), Tamer Kemer?z Chairman of the Supervisory Board: Dr. Thomas Milde This email may contain confidential and/or privileged information. If you are not the correct recipient or have received this email in error, please inform the sender immediately and delete this email. Unauthorised copying or distribution of this email is not permitted. Data protection information: Privacy policy -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: eYWhdZgr2HrMZnnH.png Type: image/png Size: 5256 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: BkASwHg58wey7eBI.png Type: image/png Size: 4906 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4764 bytes Desc: S/MIME Cryptographic Signature URL: From wk at gnupg.org Tue May 16 16:38:42 2023 From: wk at gnupg.org (Werner Koch) Date: Tue, 16 May 2023 16:38:42 +0200 Subject: Code formatting for libgcrypt In-Reply-To: (Falko Strenzke's message of "Tue, 16 May 2023 15:00:23 +0200") References: Message-ID: <87h6scqr2l.fsf@wheatstone.g10code.de> On Tue, 16 May 2023 15:00, Falko Strenzke said: > Is there any official code formatting style for any formatting tool available > for libgcrypt? If no official one, an unofficial one that matches the existing > code quite well? Please see: gnupg/doc/HACKING Shalom-Salam, Werner -- The pioneers of a warless world are the youth that refuse military service. - A. Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: openpgp-digital-signature.asc Type: application/pgp-signature Size: 227 bytes Desc: not available URL: From wk at gnupg.org Tue May 16 17:52:53 2023 From: wk at gnupg.org (Werner Koch) Date: Tue, 16 May 2023 17:52:53 +0200 Subject: [PATCH] Add Streamlined NTRU Prime sntrup761. In-Reply-To: <87wn191ym8.fsf@kaka.sjd.se> (Simon Josefsson via Gcrypt-devel's message of "Mon, 15 May 2023 16:02:39 +0200") References: <87wn191ym8.fsf@kaka.sjd.se> Message-ID: <871qjgqnmy.fsf@wheatstone.g10code.de> Hi! > My use case is to enable implementation of OpenSSH's > sntrup761x25519-sha512 in libssh/libssh2. Given that OpenSSH starts to move into that direction, it is a good idea to add support to Libgcrypt. After all we want that gpg-agent can also work with that algorithms. > - Are gcry_kem_open/gcry_kem_close useful? They complicate > implementation for no gain for sntrup761, but could be useful for > other KEM's, OTOH they may just complicate it for all KEM's since I > believe the KEM APIs are fairly established these days. I have not yet anaylyzed your needs but I think that this new API is not needed because we have KEM functions already implemented in the pubkey API. Instead of a new separate API it should be sufficient to make use of the general idea of gcry_ctx_t. Right now we use such a context only for KATs and to implement custom EC functions. The context object was actually implemented to add state to the public key functions and to allow the provisioning of larger parameters by associating them with an s-expression. A context is also a way to implement n-way processing within Libgcrypt. Salam-Shalom, Werner -- The pioneers of a warless world are the youth that refuse military service. - A. Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: openpgp-digital-signature.asc Type: application/pgp-signature Size: 227 bytes Desc: not available URL: From simon at josefsson.org Fri May 19 23:37:31 2023 From: simon at josefsson.org (Simon Josefsson) Date: Fri, 19 May 2023 23:37:31 +0200 Subject: [PATCH] Add Streamlined NTRU Prime sntrup761. In-Reply-To: <871qjgqnmy.fsf@wheatstone.g10code.de> (Werner Koch via Gcrypt-devel's message of "Tue, 16 May 2023 17:52:53 +0200") References: <87wn191ym8.fsf@kaka.sjd.se> <871qjgqnmy.fsf@wheatstone.g10code.de> Message-ID: <87v8goxask.fsf@kaka.sjd.se> Werner Koch via Gcrypt-devel writes: > I have not yet anaylyzed your needs but I think that this new API is not > needed because we have KEM functions already implemented in the pubkey > API. Do you mean these? -- Function: gcry_error_t gcry_pk_genkey (gcry_sexp_t *R_KEY, gcry_sexp_t PARMS) -- Function: gcry_error_t gcry_pk_encrypt (gcry_sexp_t *R_CIPH, gcry_sexp_t DATA, gcry_sexp_t PKEY) -- Function: gcry_error_t gcry_pk_decrypt (gcry_sexp_t *R_PLAIN, gcry_sexp_t DATA, gcry_sexp_t SKEY) I think these are poorly suited for modern KEM's like sntrup761. They are all now byte-oriented, not MPI/sexp. KEM's use of public/private keys are ephemeral, like diffie-hellman, so they are different than long-term keys. I think this is comparable to the separate APIs introduced for X25519: -- Function: gpg_error_t gcry_ecc_mul_point (int CURVEID, unsigned char *RESULT, const unsigned char *SCALAR, const unsigned char *POINT) Using MPI's to store byte-values lead to a security concern in RFC 8731, since MPI's encode different byte-values in different length depending on the content. I haven't checked if libgcrypt would be vulnerable to the same problem, but type-overloading is not safe. Maybe you could take a second look on the API I proposed below? It matches the API that several modern KEM's uses. Yes this would make KEM's a special animal that is not compatible with other public/private-key stuff in libgcrypt, but I think that is actually a good thing. enum gcry_kem_algos { GCRY_KEM_SNTRUP761 = 761, }; #define GCRY_KEM_SNTRUP761_SECRETKEY_SIZE 1763 #define GCRY_KEM_SNTRUP761_PUBLICKEY_SIZE 1158 #define GCRY_KEM_SNTRUP761_CIPHERTEXT_SIZE 1039 #define GCRY_KEM_SNTRUP761_SHAREDSECRET_SIZE 32 gcry_error_t gcry_kem_keypair (int algo, void *pubkey, void *seckey); gcry_error_t gcry_kem_enc (int algo, const void *pubkey, void *ciphertext, void *ss); gcry_error_t gcry_kem_dec (int algo, const void *ciphertext, const void *seckey, void *ss); /Simon -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 255 bytes Desc: not available URL: From simon at josefsson.org Fri May 19 23:47:39 2023 From: simon at josefsson.org (Simon Josefsson) Date: Fri, 19 May 2023 23:47:39 +0200 Subject: [PATCH] Add Streamlined NTRU Prime sntrup761. In-Reply-To: <2b3979dc-9a89-4755-9e4c-727263b27bb5@mtg.de> (Falko Strenzke's message of "Tue, 16 May 2023 09:16:13 +0200") References: <87wn191ym8.fsf@kaka.sjd.se> <87y1lozrw7.fsf@kaka.sjd.se> <2b3979dc-9a89-4755-9e4c-727263b27bb5@mtg.de> Message-ID: <87o7mgxabo.fsf@kaka.sjd.se> Falko Strenzke writes: > I think this is already going into the right direction. However, I > have some proposals: Thank you for feedback! > 1. I would prefer a more type safe API: distinct public and private > key objects instead of void pointers, i.e gcry_kem_public_key_t and > gcry_kem_private_key_t. From your proposed API it does not become > clear if pubkey and seckey are objects or just byte arrays. Since > instantiating a key from a byte array may involve some precomputations > (imagine for instance instantiating a private key from a PRNG seed), > for efficiency reasons it is in my view necessary to have public and > private key objects. This is a trade-off, and my rationale was that I prefer doing byte-oriented APIs since that seems to what all modern KEM's are using (including Kyber?). And for some reason byte-strings are passed as 'void*' in libgcrypt, so I followed that style. There should be documentation explaining this. I think the core decision should be to use 1) byte-oriented API or 2) some higher-level representation like MPI/sexp. The API types follow from that decision. I agree with you that 'void*' is not nice, but it seems like the libgcrypt idiom. However you make me believe we could use uint8_t here? My KEM API is not similar to other parts of libgcrypt anyway, so we don't have to repeat using 'void*' for data. > 2. Also the enum should by typedef'd and used with its type in the > function signature. My use was modeled in existing uses like 'enum gcry_cipher_algos', 'enum gcry_pk_algos' etc. I do agree with you, but I think consistency is also important. > 3. There is no need to provide algo again in the enc/dec functions. A > key object will know it's algorithm. (Probably this is due to key void > pointers meant as byte arrays) I think either a context handle or algorithm identifier is needed. The key parameter is just a opaque byte array, it doesn't know its algorithm. > 4. All input/output byte arrays should be typed as uint8_t* and be > passed in with their lengths. If without lengths, client code will be > prone to memory access errors. That was my first version of the API. It felt useless to have all these size_t lengths and checks for them since they were all fixed strings anyway. Let's see where all other issues end up, if this is still relevant. > 5. Then we will also need extra functions for serialization and > deserialization of keys. My approach uses raw keys directly, so this is included. /Simon -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 255 bytes Desc: not available URL: From simon at josefsson.org Fri May 19 23:52:00 2023 From: simon at josefsson.org (Simon Josefsson) Date: Fri, 19 May 2023 23:52:00 +0200 Subject: [PATCH] Add Streamlined NTRU Prime sntrup761. In-Reply-To: <4957427.z1Mmn1aDQA@tauon.chronox.de> (Stephan Mueller's message of "Tue, 16 May 2023 09:58:40 +0200") References: <87wn191ym8.fsf@kaka.sjd.se> <87y1lozrw7.fsf@kaka.sjd.se> <4957427.z1Mmn1aDQA@tauon.chronox.de> Message-ID: <87jzx4xa4f.fsf@kaka.sjd.se> Stephan Mueller writes: >> gcry_error_t gcry_kem_enc (int algo, >> const void *pubkey, >> void *ciphertext, >> void *ss); > > May I suggest to add another parameter: size_t ss_len which shall specify the > caller-requested size of ss? Is that to support variable-length outputs? Or just to indicate the buffer size? Does kyber or some other popular KEM supports variable-length outputs? >> gcry_error_t gcry_kem_dec (int algo, >> const void *ciphertext, >> const void *seckey, >> void *ss); > > Same here. > > Kyber uses a KDF as the last step. I am aware of the fact that the Kyber > reference implementation returns 32 bytes statically. However, considering the > use of a true KDF which has the property of a pseudorandom behavior (either > SHAKE256 or AES-CTR is used), the KDF can produce arbitrary amounts of data. > By specifying an ss_len parameter, the caller can directly request the data > that may be needed as a key/IV/mac Key or similar for subsequent cipher > operations. What does the specification says? Is kyber specified as a variable-length output, or output of 32 bytes? One approach is to have another API for that use-case: gcry_error_t gcry_kem_enc_kdf (int algo, const void *pubkey, void *ciphertext, size_t sslen, void *ss); gcry_error_t gcry_kem_dec_kdf (int algo, const void *ciphertext, const void *seckey, size_t sslen, void *ss); /Simon -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 255 bytes Desc: not available URL: From smueller at chronox.de Sun May 21 16:30:24 2023 From: smueller at chronox.de (Stephan =?ISO-8859-1?Q?M=FCller?=) Date: Sun, 21 May 2023 16:30:24 +0200 Subject: [PATCH] Add Streamlined NTRU Prime sntrup761. In-Reply-To: <87jzx4xa4f.fsf@kaka.sjd.se> References: <87wn191ym8.fsf@kaka.sjd.se> <4957427.z1Mmn1aDQA@tauon.chronox.de> <87jzx4xa4f.fsf@kaka.sjd.se> Message-ID: <4836093.31r3eYUQgx@positron.chronox.de> Am Freitag, 19. Mai 2023, 23:52:00 CEST schrieb Simon Josefsson: Hi Simon, > Stephan Mueller writes: > >> gcry_error_t gcry_kem_enc (int algo, > >> > >> const void *pubkey, > >> void *ciphertext, > >> void *ss); > > > > May I suggest to add another parameter: size_t ss_len which shall specify > > the caller-requested size of ss? > > Is that to support variable-length outputs? Or just to indicate the > buffer size? Does kyber or some other popular KEM supports > variable-length outputs? Short answer: this is to indicate variable length outputs. Long answer: Kyber KEM defines as its last step a KDF using either SHAKE256. Both KDFs allow the caller to request an arbitrary output size. Yet, the sample source code generates hard-coded 32 bytes. To avoid waste of CPU cycles and considering that both KDF operations are defined as pseudorandom operations (see SP800-185 for SHAKE), I personally think that this KDF should be asked to generate exakt those number of bytes that are needed. This implies that I added the ss_len parameter to my API set to advertise that this helps preventing the waste of precious CPU cycles. > > >> gcry_error_t gcry_kem_dec (int algo, > >> > >> const void *ciphertext, > >> const void *seckey, > >> void *ss); > > > > Same here. > > > > Kyber uses a KDF as the last step. I am aware of the fact that the Kyber > > reference implementation returns 32 bytes statically. However, considering > > the use of a true KDF which has the property of a pseudorandom behavior > > (either SHAKE256 or AES-CTR is used), the KDF can produce arbitrary > > amounts of data. By specifying an ss_len parameter, the caller can > > directly request the data that may be needed as a key/IV/mac Key or > > similar for subsequent cipher operations. > > What does the specification says? Is kyber specified as a > variable-length output, or output of 32 bytes? The specification contains the following words regarding the KDF: """ As a modification in round-2, we decided to derive the final key using SHAKE-256 instead of SHA3-256. This is an advantage for protocols that need keys of more than 256 bits. Instead of first requesting a 256-bit key from Kyber and then expanding it, they can pass an additional key-length parameter to Kyber and obtain a key of the desired length. This feature is not supported by the NIST API, so in our implementations we set the keylength to a fixed length of 32 bytes in api.h. """ So, the authors deem it as acceptable to specify an ss_len. > > One approach is to have another API for that use-case: > > gcry_error_t gcry_kem_enc_kdf (int algo, > const void *pubkey, > void *ciphertext, > size_t sslen, void *ss); > gcry_error_t gcry_kem_dec_kdf (int algo, > const void *ciphertext, > const void *seckey, > size_t sslen, void *ss); Fine with me, too > > /Simon Ciao Stephan From canadamax at proton.me Fri May 26 20:30:51 2023 From: canadamax at proton.me (Max Blanco) Date: Fri, 26 May 2023 18:30:51 +0000 Subject: bug report: trouble compiling libgcrypt with libgpg-error-1.47 Message-ID: Hello, I have trouble compiling libgcrypt on top of libgpg-error-1.47. If I try?libgcrypt-1.8.10 the configure script tells me I need libgpg-error > 1.25 (!!). If I try?libgcrypt-1.10.1 the computer throws an error in the test regime at ec-nist.c in function '_gcry_mpi_ec_nist192_mod'. The system is a legacy Intel Core Duo running Mac OS X 10.6.8. This behaviour is puzzling. Any suggestions are welcome. Sent with Proton Mail secure email. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: publickey - canadamax at proton.me - 0xD17D3B5F.asc Type: application/pgp-keys Size: 653 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 249 bytes Desc: OpenPGP digital signature URL: From jussi.kivilinna at iki.fi Sun May 28 16:53:55 2023 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 28 May 2023 17:53:55 +0300 Subject: [PATCH] rijndael-aesni: use inline checksumming for OCB decryption Message-ID: <20230528145355.532424-1-jussi.kivilinna@iki.fi> * cipher/rijndael-aesni.c (aesni_ocb_checksum): Remove. (aesni_ocb_dec): Add inline checksumming. -- Inline checksumming is far faster on Ryzen processors on i386 builds than two-pass checksumming. Benchmark on AMD Ryzen 9 7900X (i386): Before: AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz OCB dec | 0.180 ns/B 5292 MiB/s 0.847 c/B 4700 After (~2x faster): AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz OCB dec | 0.091 ns/B 10491 MiB/s 0.427 c/B 4700 Signed-off-by: Jussi Kivilinna --- cipher/rijndael-aesni.c | 220 ++++++++-------------------------------- 1 file changed, 43 insertions(+), 177 deletions(-) diff --git a/cipher/rijndael-aesni.c b/cipher/rijndael-aesni.c index 906737a6..b33ef7ed 100644 --- a/cipher/rijndael-aesni.c +++ b/cipher/rijndael-aesni.c @@ -2710,174 +2710,6 @@ _gcry_aes_aesni_cbc_dec (RIJNDAEL_context *ctx, unsigned char *iv, } -static ASM_FUNC_ATTR_INLINE void -aesni_ocb_checksum (gcry_cipher_hd_t c, const unsigned char *plaintext, - size_t nblocks) -{ - RIJNDAEL_context *ctx = (void *)&c->context.c; - - /* Calculate checksum */ - asm volatile ("movdqu %[checksum], %%xmm6\n\t" - "pxor %%xmm1, %%xmm1\n\t" - "pxor %%xmm2, %%xmm2\n\t" - "pxor %%xmm3, %%xmm3\n\t" - : - :[checksum] "m" (*c->u_ctr.ctr) - : "memory" ); - - if (0) {} -#if defined(HAVE_GCC_INLINE_ASM_AVX2) - else if (nblocks >= 16 && ctx->use_avx2) - { - /* Use wider 256-bit registers for fast xoring of plaintext. */ - asm volatile ("vzeroupper\n\t" - "vpxor %%xmm0, %%xmm0, %%xmm0\n\t" - "vpxor %%xmm4, %%xmm4, %%xmm4\n\t" - "vpxor %%xmm5, %%xmm5, %%xmm5\n\t" - "vpxor %%xmm7, %%xmm7, %%xmm7\n\t" - : - : - : "memory"); - - for (;nblocks >= 16; nblocks -= 16) - { - asm volatile ("vpxor %[ptr0], %%ymm6, %%ymm6\n\t" - "vpxor %[ptr1], %%ymm1, %%ymm1\n\t" - "vpxor %[ptr2], %%ymm2, %%ymm2\n\t" - "vpxor %[ptr3], %%ymm3, %%ymm3\n\t" - : - : [ptr0] "m" (*(plaintext + 0 * BLOCKSIZE * 2)), - [ptr1] "m" (*(plaintext + 1 * BLOCKSIZE * 2)), - [ptr2] "m" (*(plaintext + 2 * BLOCKSIZE * 2)), - [ptr3] "m" (*(plaintext + 3 * BLOCKSIZE * 2)) - : "memory" ); - asm volatile ("vpxor %[ptr4], %%ymm0, %%ymm0\n\t" - "vpxor %[ptr5], %%ymm4, %%ymm4\n\t" - "vpxor %[ptr6], %%ymm5, %%ymm5\n\t" - "vpxor %[ptr7], %%ymm7, %%ymm7\n\t" - : - : [ptr4] "m" (*(plaintext + 4 * BLOCKSIZE * 2)), - [ptr5] "m" (*(plaintext + 5 * BLOCKSIZE * 2)), - [ptr6] "m" (*(plaintext + 6 * BLOCKSIZE * 2)), - [ptr7] "m" (*(plaintext + 7 * BLOCKSIZE * 2)) - : "memory" ); - plaintext += BLOCKSIZE * 16; - } - - asm volatile ("vpxor %%ymm0, %%ymm6, %%ymm6\n\t" - "vpxor %%ymm4, %%ymm1, %%ymm1\n\t" - "vpxor %%ymm5, %%ymm2, %%ymm2\n\t" - "vpxor %%ymm7, %%ymm3, %%ymm3\n\t" - "vextracti128 $1, %%ymm6, %%xmm0\n\t" - "vextracti128 $1, %%ymm1, %%xmm4\n\t" - "vextracti128 $1, %%ymm2, %%xmm5\n\t" - "vextracti128 $1, %%ymm3, %%xmm7\n\t" - "vpxor %%xmm0, %%xmm6, %%xmm6\n\t" - "vpxor %%xmm4, %%xmm1, %%xmm1\n\t" - "vpxor %%xmm5, %%xmm2, %%xmm2\n\t" - "vpxor %%xmm7, %%xmm3, %%xmm3\n\t" - "vzeroupper\n\t" - : - : - : "memory" ); - } -#endif -#if defined(HAVE_GCC_INLINE_ASM_AVX) - else if (nblocks >= 16 && ctx->use_avx) - { - /* Same as AVX2, except using 256-bit floating point instructions. */ - asm volatile ("vzeroupper\n\t" - "vxorpd %%xmm0, %%xmm0, %%xmm0\n\t" - "vxorpd %%xmm4, %%xmm4, %%xmm4\n\t" - "vxorpd %%xmm5, %%xmm5, %%xmm5\n\t" - "vxorpd %%xmm7, %%xmm7, %%xmm7\n\t" - : - : - : "memory"); - - for (;nblocks >= 16; nblocks -= 16) - { - asm volatile ("vxorpd %[ptr0], %%ymm6, %%ymm6\n\t" - "vxorpd %[ptr1], %%ymm1, %%ymm1\n\t" - "vxorpd %[ptr2], %%ymm2, %%ymm2\n\t" - "vxorpd %[ptr3], %%ymm3, %%ymm3\n\t" - : - : [ptr0] "m" (*(plaintext + 0 * BLOCKSIZE * 2)), - [ptr1] "m" (*(plaintext + 1 * BLOCKSIZE * 2)), - [ptr2] "m" (*(plaintext + 2 * BLOCKSIZE * 2)), - [ptr3] "m" (*(plaintext + 3 * BLOCKSIZE * 2)) - : "memory" ); - asm volatile ("vxorpd %[ptr4], %%ymm0, %%ymm0\n\t" - "vxorpd %[ptr5], %%ymm4, %%ymm4\n\t" - "vxorpd %[ptr6], %%ymm5, %%ymm5\n\t" - "vxorpd %[ptr7], %%ymm7, %%ymm7\n\t" - : - : [ptr4] "m" (*(plaintext + 4 * BLOCKSIZE * 2)), - [ptr5] "m" (*(plaintext + 5 * BLOCKSIZE * 2)), - [ptr6] "m" (*(plaintext + 6 * BLOCKSIZE * 2)), - [ptr7] "m" (*(plaintext + 7 * BLOCKSIZE * 2)) - : "memory" ); - plaintext += BLOCKSIZE * 16; - } - - asm volatile ("vxorpd %%ymm0, %%ymm6, %%ymm6\n\t" - "vxorpd %%ymm4, %%ymm1, %%ymm1\n\t" - "vxorpd %%ymm5, %%ymm2, %%ymm2\n\t" - "vxorpd %%ymm7, %%ymm3, %%ymm3\n\t" - "vextractf128 $1, %%ymm6, %%xmm0\n\t" - "vextractf128 $1, %%ymm1, %%xmm4\n\t" - "vextractf128 $1, %%ymm2, %%xmm5\n\t" - "vextractf128 $1, %%ymm3, %%xmm7\n\t" - "vxorpd %%xmm0, %%xmm6, %%xmm6\n\t" - "vxorpd %%xmm4, %%xmm1, %%xmm1\n\t" - "vxorpd %%xmm5, %%xmm2, %%xmm2\n\t" - "vxorpd %%xmm7, %%xmm3, %%xmm3\n\t" - "vzeroupper\n\t" - : - : - : "memory" ); - } -#endif - - for (;nblocks >= 4; nblocks -= 4) - { - asm volatile ("movdqu %[ptr0], %%xmm0\n\t" - "movdqu %[ptr1], %%xmm4\n\t" - "movdqu %[ptr2], %%xmm5\n\t" - "movdqu %[ptr3], %%xmm7\n\t" - "pxor %%xmm0, %%xmm6\n\t" - "pxor %%xmm4, %%xmm1\n\t" - "pxor %%xmm5, %%xmm2\n\t" - "pxor %%xmm7, %%xmm3\n\t" - : - : [ptr0] "m" (*(plaintext + 0 * BLOCKSIZE)), - [ptr1] "m" (*(plaintext + 1 * BLOCKSIZE)), - [ptr2] "m" (*(plaintext + 2 * BLOCKSIZE)), - [ptr3] "m" (*(plaintext + 3 * BLOCKSIZE)) - : "memory" ); - plaintext += BLOCKSIZE * 4; - } - - for (;nblocks >= 1; nblocks -= 1) - { - asm volatile ("movdqu %[ptr0], %%xmm0\n\t" - "pxor %%xmm0, %%xmm6\n\t" - : - : [ptr0] "m" (*(plaintext + 0 * BLOCKSIZE)) - : "memory" ); - plaintext += BLOCKSIZE; - } - - asm volatile ("pxor %%xmm1, %%xmm6\n\t" - "pxor %%xmm2, %%xmm6\n\t" - "pxor %%xmm3, %%xmm6\n\t" - "movdqu %%xmm6, %[checksum]\n\t" - : [checksum] "=m" (*c->u_ctr.ctr) - : - : "memory" ); -} - - static unsigned int ASM_FUNC_ATTR_NOINLINE aesni_ocb_enc (gcry_cipher_hd_t c, void *outbuf_arg, const void *inbuf_arg, size_t nblocks) @@ -3401,9 +3233,11 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg, /* Preload Offset */ asm volatile ("movdqu %[iv], %%xmm5\n\t" - : /* No output */ - : [iv] "m" (*c->u_iv.iv) - : "memory" ); + "movdqu %[ctr], %%xmm7\n\t" + : /* No output */ + : [iv] "m" (*c->u_iv.iv), + [ctr] "m" (*c->u_ctr.ctr) + : "memory" ); for ( ;nblocks && n % 4; nblocks-- ) { @@ -3424,6 +3258,7 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg, asm volatile ("pxor %%xmm5, %%xmm0\n\t" "movdqu %%xmm0, %[outbuf]\n\t" + "pxor %%xmm0, %%xmm7\n\t" : [outbuf] "=m" (*outbuf) : : "memory" ); @@ -3452,6 +3287,15 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg, "pxor %[first_key], %%xmm5\n\t" "pxor %[first_key], %%xmm0\n\t" "movdqa %%xmm0, %[lxfkey]\n\t" + /* Clear plaintext blocks */ + "pxor %%xmm1, %%xmm1\n\t" + "pxor %%xmm2, %%xmm2\n\t" + "pxor %%xmm3, %%xmm3\n\t" + "pxor %%xmm4, %%xmm4\n\t" + "pxor %%xmm8, %%xmm8\n\t" + "pxor %%xmm9, %%xmm9\n\t" + "pxor %%xmm10, %%xmm10\n\t" + "pxor %%xmm11, %%xmm11\n\t" : [lxfkey] "=m" (*lxf_key) : [l0] "m" (*c->u_mode.ocb.L[0]), [last_key] "m" (ctx->keyschdec[ctx->rounds][0][0]), @@ -3463,7 +3307,9 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg, n += 4; l = aes_ocb_get_l(c, n); - asm volatile ("movdqu %[l0l1], %%xmm10\n\t" + asm volatile ("pxor %%xmm10, %%xmm1\n\t" + "pxor %%xmm11, %%xmm2\n\t" + "movdqu %[l0l1], %%xmm10\n\t" "movdqu %[l1], %%xmm11\n\t" "movdqu %[l3], %%xmm15\n\t" : @@ -3477,7 +3323,10 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg, /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ /* P_i = Offset_i xor ENCIPHER(K, C_i xor Offset_i) */ - asm volatile ("movdqu %[inbuf0], %%xmm1\n\t" + asm volatile ("pxor %%xmm1, %%xmm4\n\t" + "pxor %%xmm2, %%xmm8\n\t" + "pxor %%xmm3, %%xmm9\n\t" + "movdqu %[inbuf0], %%xmm1\n\t" "movdqu %[inbuf1], %%xmm2\n\t" "movdqu %[inbuf2], %%xmm3\n\t" : @@ -3485,8 +3334,11 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg, [inbuf1] "m" (*(inbuf + 1 * BLOCKSIZE)), [inbuf2] "m" (*(inbuf + 2 * BLOCKSIZE)) : "memory" ); - asm volatile ("movdqu %[inbuf3], %%xmm4\n\t" + asm volatile ("pxor %%xmm4, %%xmm7\n\t" + "movdqu %[inbuf3], %%xmm4\n\t" + "pxor %%xmm8, %%xmm7\n\t" "movdqu %[inbuf4], %%xmm8\n\t" + "pxor %%xmm9, %%xmm7\n\t" "movdqu %[inbuf5], %%xmm9\n\t" : : [inbuf3] "m" (*(inbuf + 3 * BLOCKSIZE)), @@ -3722,6 +3574,15 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg, asm volatile ("pxor %[first_key], %%xmm5\n\t" "pxor %%xmm0, %%xmm0\n\t" "movdqu %%xmm0, %[lxfkey]\n\t" + /* Add plaintext blocks to checksum */ + "pxor %%xmm1, %%xmm2\n\t" + "pxor %%xmm3, %%xmm4\n\t" + "pxor %%xmm9, %%xmm8\n\t" + "pxor %%xmm11, %%xmm10\n\t" + "pxor %%xmm2, %%xmm4\n\t" + "pxor %%xmm8, %%xmm10\n\t" + "pxor %%xmm4, %%xmm7\n\t" + "pxor %%xmm10, %%xmm7\n\t" : [lxfkey] "=m" (*lxf_key) : [first_key] "m" (ctx->keyschdec[0][0][0]) : "memory" ); @@ -3782,8 +3643,10 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg, asm volatile ("pxor %[tmpbuf0],%%xmm1\n\t" "movdqu %%xmm1, %[outbuf0]\n\t" + "pxor %%xmm1, %%xmm7\n\t" "pxor %[tmpbuf1],%%xmm2\n\t" "movdqu %%xmm2, %[outbuf1]\n\t" + "pxor %%xmm2, %%xmm7\n\t" : [outbuf0] "=m" (*(outbuf + 0 * BLOCKSIZE)), [outbuf1] "=m" (*(outbuf + 1 * BLOCKSIZE)) : [tmpbuf0] "m" (*(tmpbuf + 0 * BLOCKSIZE)), @@ -3791,8 +3654,10 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg, : "memory" ); asm volatile ("pxor %[tmpbuf2],%%xmm3\n\t" "movdqu %%xmm3, %[outbuf2]\n\t" + "pxor %%xmm3, %%xmm7\n\t" "pxor %%xmm5, %%xmm4\n\t" "movdqu %%xmm4, %[outbuf3]\n\t" + "pxor %%xmm4, %%xmm7\n\t" : [outbuf2] "=m" (*(outbuf + 2 * BLOCKSIZE)), [outbuf3] "=m" (*(outbuf + 3 * BLOCKSIZE)) : [tmpbuf2] "m" (*(tmpbuf + 2 * BLOCKSIZE)) @@ -3822,6 +3687,7 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg, asm volatile ("pxor %%xmm5, %%xmm0\n\t" "movdqu %%xmm0, %[outbuf]\n\t" + "pxor %%xmm0, %%xmm7\n\t" : [outbuf] "=m" (*outbuf) : : "memory" ); @@ -3832,7 +3698,9 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg, c->u_mode.ocb.data_nblocks = n; asm volatile ("movdqu %%xmm5, %[iv]\n\t" - : [iv] "=m" (*c->u_iv.iv) + "movdqu %%xmm7, %[ctr]\n\t" + : [iv] "=m" (*c->u_iv.iv), + [ctr] "=m" (*c->u_ctr.ctr) : : "memory" ); @@ -3846,8 +3714,6 @@ aesni_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg, : : "memory" ); - aesni_ocb_checksum (c, outbuf_arg, nblocks_arg); - aesni_cleanup (); aesni_cleanup_2_7 (); -- 2.39.2 From jussi.kivilinna at iki.fi Sun May 28 16:54:04 2023 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 28 May 2023 17:54:04 +0300 Subject: [PATCH] serpent: add x86/AVX512 implementation Message-ID: <20230528145404.532462-1-jussi.kivilinna@iki.fi> * cipher/Makefile.am: Add `serpent-avx512-x86.c`; Add extra CFLAG handling for `serpent-avx512-x86.o` and `serpent-avx512-x86.lo`. * cipher/serpent-avx512-x86.c: New. * cipher/serpent.c (USE_AVX512): New. (serpent_context_t): Add `use_avx512`. [USE_AVX512] (_gcry_serpent_avx512_cbc_dec) (_gcry_serpent_avx512_cfb_dec, _gcry_serpent_avx512_ctr_enc) (_gcry_serpent_avx512_ocb_crypt, _gcry_serpent_avx512_blk32): New. (serpent_setkey_internal) [USE_AVX512]: Set `use_avx512` is AVX512 HW available. (_gcry_serpent_ctr_enc) [USE_AVX512]: New. (_gcry_serpent_cbc_dec) [USE_AVX512]: New. (_gcry_serpent_cfb_dec) [USE_AVX512]: New. (_gcry_serpent_ocb_crypt) [USE_AVX512]: New. (serpent_crypt_blk1_16): Rename to... (serpent_crypt_blk1_32): ... this; Add AVX512 code-path; Adjust for increase from max 16 blocks to max 32 blocks. (serpent_encrypt_blk1_16): Rename to ... (serpent_encrypt_blk1_32): ... this. (serpent_decrypt_blk1_16): Rename to ... (serpent_decrypt_blk1_32): ... this. (_gcry_serpent_xts_crypt, _gcry_serpent_ecb_crypt): Increase bulk block count from 16 to 32. * configure.ac (gcry_cv_cc_x86_avx512_intrinsics) (ENABLE_X86_AVX512_INTRINSICS_EXTRA_CFLAGS): New. (GCRYPT_ASM_CIPHERS): Add `serpent-avx512-x86.lo`. -- Benchmark on AMD Ryzen 9 7900X: Before: Cipher: SERPENT128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 1.52 ns/B 626.2 MiB/s 8.26 c/B 5425 ECB dec | 1.48 ns/B 645.5 MiB/s 8.01 c/B 5425 CBC enc | 5.81 ns/B 164.2 MiB/s 31.94 c/B 5500 CBC dec | 0.722 ns/B 1322 MiB/s 3.91 c/B 5425 CFB enc | 5.88 ns/B 162.3 MiB/s 32.31 c/B 5500 CFB dec | 0.735 ns/B 1297 MiB/s 3.99 c/B 5424 OFB enc | 5.77 ns/B 165.3 MiB/s 31.72 c/B 5500 OFB dec | 5.77 ns/B 165.4 MiB/s 31.72 c/B 5500 CTR enc | 0.756 ns/B 1262 MiB/s 4.10 c/B 5425 CTR dec | 0.776 ns/B 1228 MiB/s 4.21 c/B 5424 XTS enc | 1.68 ns/B 568.3 MiB/s 9.10 c/B 5424 XTS dec | 1.58 ns/B 604.2 MiB/s 8.56 c/B 5425 CCM enc | 6.60 ns/B 144.5 MiB/s 36.30 c/B 5500 CCM dec | 6.60 ns/B 144.5 MiB/s 36.30 c/B 5500 CCM auth | 5.86 ns/B 162.6 MiB/s 32.25 c/B 5500 EAX enc | 6.54 ns/B 145.8 MiB/s 35.98 c/B 5500 EAX dec | 6.54 ns/B 145.8 MiB/s 35.98 c/B 5500 EAX auth | 5.81 ns/B 164.2 MiB/s 31.94 c/B 5500 GCM enc | 0.787 ns/B 1212 MiB/s 4.27 c/B 5425 GCM dec | 0.788 ns/B 1211 MiB/s 4.27 c/B 5425 GCM auth | 0.038 ns/B 24932 MiB/s 0.210 c/B 5500 OCB enc | 0.750 ns/B 1272 MiB/s 4.07 c/B 5424 OCB dec | 0.743 ns/B 1284 MiB/s 4.03 c/B 5425 OCB auth | 0.749 ns/B 1274 MiB/s 4.06 c/B 5425 SIV enc | 6.54 ns/B 145.8 MiB/s 35.99 c/B 5500 SIV dec | 6.55 ns/B 145.7 MiB/s 36.01 c/B 5500 SIV auth | 5.81 ns/B 164.2 MiB/s 31.94 c/B 5500 GCM-SIV enc | 5.63 ns/B 169.4 MiB/s 30.97 c/B 5500 GCM-SIV dec | 5.64 ns/B 169.2 MiB/s 31.00 c/B 5500 GCM-SIV auth | 0.038 ns/B 25201 MiB/s 0.208 c/B 5500 After: SERPENT128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.578 ns/B 1649 MiB/s 3.14 c/B 5425 ECB dec | 0.505 ns/B 1889 MiB/s 2.74 c/B 5424 CBC enc | 5.81 ns/B 164.1 MiB/s 31.96 c/B 5500 CBC dec | 0.527 ns/B 1810 MiB/s 2.86 c/B 5424 CFB enc | 5.88 ns/B 162.3 MiB/s 32.31 c/B 5500 CFB dec | 0.471 ns/B 2026 MiB/s 2.55 c/B 5425 OFB enc | 5.77 ns/B 165.3 MiB/s 31.72 c/B 5500 OFB dec | 5.77 ns/B 165.3 MiB/s 31.73 c/B 5501 CTR enc | 0.464 ns/B 2053 MiB/s 2.52 c/B 5425 CTR dec | 0.464 ns/B 2057 MiB/s 2.51 c/B 5425 XTS enc | 0.551 ns/B 1732 MiB/s 2.99 c/B 5424 XTS dec | 0.527 ns/B 1809 MiB/s 2.86 c/B 5424 CCM enc | 6.32 ns/B 150.8 MiB/s 34.78 c/B 5501 CCM dec | 6.32 ns/B 150.9 MiB/s 34.77 c/B 5500 CCM auth | 5.86 ns/B 162.6 MiB/s 32.25 c/B 5500 EAX enc | 6.26 ns/B 152.2 MiB/s 34.46 c/B 5500 EAX dec | 6.27 ns/B 152.2 MiB/s 34.46 c/B 5500 EAX auth | 5.81 ns/B 164.2 MiB/s 31.94 c/B 5500 GCM enc | 0.497 ns/B 1917 MiB/s 2.70 c/B 5425 GCM dec | 0.499 ns/B 1913 MiB/s 2.70 c/B 5425 GCM auth | 0.031 ns/B 30709 MiB/s 0.171 c/B 5500 OCB enc | 0.482 ns/B 1979 MiB/s 2.61 c/B 5424 OCB dec | 0.475 ns/B 2007 MiB/s 2.58 c/B 5424 OCB auth | 0.748 ns/B 1274 MiB/s 4.06 c/B 5424 SIV enc | 6.27 ns/B 152.0 MiB/s 34.50 c/B 5500 SIV dec | 6.27 ns/B 152.1 MiB/s 34.48 c/B 5500 SIV auth | 5.81 ns/B 164.2 MiB/s 31.94 c/B 5500 GCM-SIV enc | 5.63 ns/B 169.5 MiB/s 30.95 c/B 5500 GCM-SIV dec | 5.63 ns/B 169.3 MiB/s 30.98 c/B 5500 GCM-SIV auth | 0.034 ns/B 28060 MiB/s 0.187 c/B 5500 Signed-off-by: Jussi Kivilinna --- cipher/Makefile.am | 17 +- cipher/serpent-avx2-amd64.S | 4 +- cipher/serpent-avx512-x86.c | 994 ++++++++++++++++++++++++++++++++++++ cipher/serpent.c | 218 +++++++- configure.ac | 45 ++ 5 files changed, 1257 insertions(+), 21 deletions(-) create mode 100644 cipher/serpent-avx512-x86.c diff --git a/cipher/Makefile.am b/cipher/Makefile.am index e67b1ee2..8c7ec095 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -119,12 +119,12 @@ EXTRA_libcipher_la_SOURCES = \ salsa20.c salsa20-amd64.S salsa20-armv7-neon.S \ scrypt.c \ seed.c \ - serpent.c serpent-sse2-amd64.S \ + serpent.c serpent-sse2-amd64.S serpent-avx2-amd64.S \ + serpent-avx512-x86.c serpent-armv7-neon.S \ sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S \ sm4-gfni-avx2-amd64.S sm4-gfni-avx512-amd64.S \ sm4-aarch64.S sm4-armv8-aarch64-ce.S sm4-armv9-aarch64-sve-ce.S \ sm4-ppc.c \ - serpent-avx2-amd64.S serpent-armv7-neon.S \ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ sha1-armv8-aarch64-ce.S sha1-intel-shaext.c \ @@ -316,3 +316,16 @@ sm4-ppc.o: $(srcdir)/sm4-ppc.c Makefile sm4-ppc.lo: $(srcdir)/sm4-ppc.c Makefile `echo $(LTCOMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` + + +if ENABLE_X86_AVX512_INTRINSICS_EXTRA_CFLAGS +avx512f_cflags = -mavx512f +else +avx512f_cflags = +endif + +serpent-avx512-x86.o: $(srcdir)/serpent-avx512-x86.c Makefile + `echo $(COMPILE) $(avx512f_cflags) -c $< | $(instrumentation_munging) ` + +serpent-avx512-x86.lo: $(srcdir)/serpent-avx512-x86.c Makefile + `echo $(LTCOMPILE) $(avx512f_cflags) -c $< | $(instrumentation_munging) ` diff --git a/cipher/serpent-avx2-amd64.S b/cipher/serpent-avx2-amd64.S index e25e7d3b..7aba235f 100644 --- a/cipher/serpent-avx2-amd64.S +++ b/cipher/serpent-avx2-amd64.S @@ -589,8 +589,8 @@ ELF(.type _gcry_serpent_avx2_blk16, at function;) _gcry_serpent_avx2_blk16: /* input: * %rdi: ctx, CTX - * %rsi: dst (8 blocks) - * %rdx: src (8 blocks) + * %rsi: dst (16 blocks) + * %rdx: src (16 blocks) * %ecx: encrypt */ CFI_STARTPROC(); diff --git a/cipher/serpent-avx512-x86.c b/cipher/serpent-avx512-x86.c new file mode 100644 index 00000000..762c09e1 --- /dev/null +++ b/cipher/serpent-avx512-x86.c @@ -0,0 +1,994 @@ +/* serpent-avx512-x86.c - AVX512 implementation of Serpent cipher + * + * Copyright (C) 2023 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include + +#if defined(__x86_64) || defined(__i386) +#if defined(HAVE_COMPATIBLE_CC_X86_AVX512_INTRINSICS) && \ + defined(USE_SERPENT) && defined(ENABLE_AVX512_SUPPORT) + +#include +#include +#include + +#include "g10lib.h" +#include "types.h" +#include "cipher.h" +#include "bithelp.h" +#include "bufhelp.h" +#include "cipher-internal.h" +#include "bulkhelp.h" + +#define ALWAYS_INLINE inline __attribute__((always_inline)) +#define NO_INLINE __attribute__((noinline)) + +/* Number of rounds per Serpent encrypt/decrypt operation. */ +#define ROUNDS 32 + +/* Serpent works on 128 bit blocks. */ +typedef unsigned int serpent_block_t[4]; + +/* The key schedule consists of 33 128 bit subkeys. */ +typedef unsigned int serpent_subkeys_t[ROUNDS + 1][4]; + +#define vpunpckhdq(a, b, o) ((o) = _mm512_unpackhi_epi32((b), (a))) +#define vpunpckldq(a, b, o) ((o) = _mm512_unpacklo_epi32((b), (a))) +#define vpunpckhqdq(a, b, o) ((o) = _mm512_unpackhi_epi64((b), (a))) +#define vpunpcklqdq(a, b, o) ((o) = _mm512_unpacklo_epi64((b), (a))) + +#define vpbroadcastd(v) _mm512_set1_epi32(v) + +#define vrol(x, s) _mm512_rol_epi32((x), (s)) +#define vror(x, s) _mm512_ror_epi32((x), (s)) +#define vshl(x, s) _mm512_slli_epi32((x), (s)) + +/* 4x4 32-bit integer matrix transpose */ +#define transpose_4x4(x0, x1, x2, x3, t1, t2, t3) \ + vpunpckhdq(x1, x0, t2); \ + vpunpckldq(x1, x0, x0); \ + \ + vpunpckldq(x3, x2, t1); \ + vpunpckhdq(x3, x2, x2); \ + \ + vpunpckhqdq(t1, x0, x1); \ + vpunpcklqdq(t1, x0, x0); \ + \ + vpunpckhqdq(x2, t2, x3); \ + vpunpcklqdq(x2, t2, x2); + +/* + * These are the S-Boxes of Serpent from following research paper. + * + * D. A. Osvik, ?Speeding up Serpent,? in Third AES Candidate Conference, + * (New York, New York, USA), p. 317?329, National Institute of Standards and + * Technology, 2000. + * + * Paper is also available at: http://www.ii.uib.no/~osvik/pub/aes3.pdf + * + * -- + * + * Following logic gets heavily optimized by compiler to use AVX512F + * 'vpternlogq' instruction. This gives higher performance increase than + * would be expected from simple wideing of vectors from AVX2/256bit to + * AVX512/512bit. + * + */ + +#define SBOX0(r0, r1, r2, r3, w, x, y, z) \ + { \ + __m512i r4; \ + \ + r3 ^= r0; r4 = r1; \ + r1 &= r3; r4 ^= r2; \ + r1 ^= r0; r0 |= r3; \ + r0 ^= r4; r4 ^= r3; \ + r3 ^= r2; r2 |= r1; \ + r2 ^= r4; r4 = ~r4; \ + r4 |= r1; r1 ^= r3; \ + r1 ^= r4; r3 |= r0; \ + r1 ^= r3; r4 ^= r3; \ + \ + w = r1; x = r4; y = r2; z = r0; \ + } + +#define SBOX0_INVERSE(r0, r1, r2, r3, w, x, y, z) \ + { \ + __m512i r4; \ + \ + r2 = ~r2; r4 = r1; \ + r1 |= r0; r4 = ~r4; \ + r1 ^= r2; r2 |= r4; \ + r1 ^= r3; r0 ^= r4; \ + r2 ^= r0; r0 &= r3; \ + r4 ^= r0; r0 |= r1; \ + r0 ^= r2; r3 ^= r4; \ + r2 ^= r1; r3 ^= r0; \ + r3 ^= r1; \ + r2 &= r3; \ + r4 ^= r2; \ + \ + w = r0; x = r4; y = r1; z = r3; \ + } + +#define SBOX1(r0, r1, r2, r3, w, x, y, z) \ + { \ + __m512i r4; \ + \ + r0 = ~r0; r2 = ~r2; \ + r4 = r0; r0 &= r1; \ + r2 ^= r0; r0 |= r3; \ + r3 ^= r2; r1 ^= r0; \ + r0 ^= r4; r4 |= r1; \ + r1 ^= r3; r2 |= r0; \ + r2 &= r4; r0 ^= r1; \ + r1 &= r2; \ + r1 ^= r0; r0 &= r2; \ + r0 ^= r4; \ + \ + w = r2; x = r0; y = r3; z = r1; \ + } + +#define SBOX1_INVERSE(r0, r1, r2, r3, w, x, y, z) \ + { \ + __m512i r4; \ + \ + r4 = r1; r1 ^= r3; \ + r3 &= r1; r4 ^= r2; \ + r3 ^= r0; r0 |= r1; \ + r2 ^= r3; r0 ^= r4; \ + r0 |= r2; r1 ^= r3; \ + r0 ^= r1; r1 |= r3; \ + r1 ^= r0; r4 = ~r4; \ + r4 ^= r1; r1 |= r0; \ + r1 ^= r0; \ + r1 |= r4; \ + r3 ^= r1; \ + \ + w = r4; x = r0; y = r3; z = r2; \ + } + +#define SBOX2(r0, r1, r2, r3, w, x, y, z) \ + { \ + __m512i r4; \ + \ + r4 = r0; r0 &= r2; \ + r0 ^= r3; r2 ^= r1; \ + r2 ^= r0; r3 |= r4; \ + r3 ^= r1; r4 ^= r2; \ + r1 = r3; r3 |= r4; \ + r3 ^= r0; r0 &= r1; \ + r4 ^= r0; r1 ^= r3; \ + r1 ^= r4; r4 = ~r4; \ + \ + w = r2; x = r3; y = r1; z = r4; \ + } + +#define SBOX2_INVERSE(r0, r1, r2, r3, w, x, y, z) \ + { \ + __m512i r4; \ + \ + r2 ^= r3; r3 ^= r0; \ + r4 = r3; r3 &= r2; \ + r3 ^= r1; r1 |= r2; \ + r1 ^= r4; r4 &= r3; \ + r2 ^= r3; r4 &= r0; \ + r4 ^= r2; r2 &= r1; \ + r2 |= r0; r3 = ~r3; \ + r2 ^= r3; r0 ^= r3; \ + r0 &= r1; r3 ^= r4; \ + r3 ^= r0; \ + \ + w = r1; x = r4; y = r2; z = r3; \ + } + +#define SBOX3(r0, r1, r2, r3, w, x, y, z) \ + { \ + __m512i r4; \ + \ + r4 = r0; r0 |= r3; \ + r3 ^= r1; r1 &= r4; \ + r4 ^= r2; r2 ^= r3; \ + r3 &= r0; r4 |= r1; \ + r3 ^= r4; r0 ^= r1; \ + r4 &= r0; r1 ^= r3; \ + r4 ^= r2; r1 |= r0; \ + r1 ^= r2; r0 ^= r3; \ + r2 = r1; r1 |= r3; \ + r1 ^= r0; \ + \ + w = r1; x = r2; y = r3; z = r4; \ + } + +#define SBOX3_INVERSE(r0, r1, r2, r3, w, x, y, z) \ + { \ + __m512i r4; \ + \ + r4 = r2; r2 ^= r1; \ + r0 ^= r2; r4 &= r2; \ + r4 ^= r0; r0 &= r1; \ + r1 ^= r3; r3 |= r4; \ + r2 ^= r3; r0 ^= r3; \ + r1 ^= r4; r3 &= r2; \ + r3 ^= r1; r1 ^= r0; \ + r1 |= r2; r0 ^= r3; \ + r1 ^= r4; \ + r0 ^= r1; \ + \ + w = r2; x = r1; y = r3; z = r0; \ + } + +#define SBOX4(r0, r1, r2, r3, w, x, y, z) \ + { \ + __m512i r4; \ + \ + r1 ^= r3; r3 = ~r3; \ + r2 ^= r3; r3 ^= r0; \ + r4 = r1; r1 &= r3; \ + r1 ^= r2; r4 ^= r3; \ + r0 ^= r4; r2 &= r4; \ + r2 ^= r0; r0 &= r1; \ + r3 ^= r0; r4 |= r1; \ + r4 ^= r0; r0 |= r3; \ + r0 ^= r2; r2 &= r3; \ + r0 = ~r0; r4 ^= r2; \ + \ + w = r1; x = r4; y = r0; z = r3; \ + } + +#define SBOX4_INVERSE(r0, r1, r2, r3, w, x, y, z) \ + { \ + __m512i r4; \ + \ + r4 = r2; r2 &= r3; \ + r2 ^= r1; r1 |= r3; \ + r1 &= r0; r4 ^= r2; \ + r4 ^= r1; r1 &= r2; \ + r0 = ~r0; r3 ^= r4; \ + r1 ^= r3; r3 &= r0; \ + r3 ^= r2; r0 ^= r1; \ + r2 &= r0; r3 ^= r0; \ + r2 ^= r4; \ + r2 |= r3; r3 ^= r0; \ + r2 ^= r1; \ + \ + w = r0; x = r3; y = r2; z = r4; \ + } + +#define SBOX5(r0, r1, r2, r3, w, x, y, z) \ + { \ + __m512i r4; \ + \ + r0 ^= r1; r1 ^= r3; \ + r3 = ~r3; r4 = r1; \ + r1 &= r0; r2 ^= r3; \ + r1 ^= r2; r2 |= r4; \ + r4 ^= r3; r3 &= r1; \ + r3 ^= r0; r4 ^= r1; \ + r4 ^= r2; r2 ^= r0; \ + r0 &= r3; r2 = ~r2; \ + r0 ^= r4; r4 |= r3; \ + r2 ^= r4; \ + \ + w = r1; x = r3; y = r0; z = r2; \ + } + +#define SBOX5_INVERSE(r0, r1, r2, r3, w, x, y, z) \ + { \ + __m512i r4; \ + \ + r1 = ~r1; r4 = r3; \ + r2 ^= r1; r3 |= r0; \ + r3 ^= r2; r2 |= r1; \ + r2 &= r0; r4 ^= r3; \ + r2 ^= r4; r4 |= r0; \ + r4 ^= r1; r1 &= r2; \ + r1 ^= r3; r4 ^= r2; \ + r3 &= r4; r4 ^= r1; \ + r3 ^= r4; r4 = ~r4; \ + r3 ^= r0; \ + \ + w = r1; x = r4; y = r3; z = r2; \ + } + +#define SBOX6(r0, r1, r2, r3, w, x, y, z) \ + { \ + __m512i r4; \ + \ + r2 = ~r2; r4 = r3; \ + r3 &= r0; r0 ^= r4; \ + r3 ^= r2; r2 |= r4; \ + r1 ^= r3; r2 ^= r0; \ + r0 |= r1; r2 ^= r1; \ + r4 ^= r0; r0 |= r3; \ + r0 ^= r2; r4 ^= r3; \ + r4 ^= r0; r3 = ~r3; \ + r2 &= r4; \ + r2 ^= r3; \ + \ + w = r0; x = r1; y = r4; z = r2; \ + } + +#define SBOX6_INVERSE(r0, r1, r2, r3, w, x, y, z) \ + { \ + __m512i r4; \ + \ + r0 ^= r2; r4 = r2; \ + r2 &= r0; r4 ^= r3; \ + r2 = ~r2; r3 ^= r1; \ + r2 ^= r3; r4 |= r0; \ + r0 ^= r2; r3 ^= r4; \ + r4 ^= r1; r1 &= r3; \ + r1 ^= r0; r0 ^= r3; \ + r0 |= r2; r3 ^= r1; \ + r4 ^= r0; \ + \ + w = r1; x = r2; y = r4; z = r3; \ + } + +#define SBOX7(r0, r1, r2, r3, w, x, y, z) \ + { \ + __m512i r4; \ + \ + r4 = r1; r1 |= r2; \ + r1 ^= r3; r4 ^= r2; \ + r2 ^= r1; r3 |= r4; \ + r3 &= r0; r4 ^= r2; \ + r3 ^= r1; r1 |= r4; \ + r1 ^= r0; r0 |= r4; \ + r0 ^= r2; r1 ^= r4; \ + r2 ^= r1; r1 &= r0; \ + r1 ^= r4; r2 = ~r2; \ + r2 |= r0; \ + r4 ^= r2; \ + \ + w = r4; x = r3; y = r1; z = r0; \ + } + +#define SBOX7_INVERSE(r0, r1, r2, r3, w, x, y, z) \ + { \ + __m512i r4; \ + \ + r4 = r2; r2 ^= r0; \ + r0 &= r3; r4 |= r3; \ + r2 = ~r2; r3 ^= r1; \ + r1 |= r0; r0 ^= r2; \ + r2 &= r4; r3 &= r4; \ + r1 ^= r2; r2 ^= r0; \ + r0 |= r2; r4 ^= r1; \ + r0 ^= r3; r3 ^= r4; \ + r4 |= r0; r3 ^= r2; \ + r4 ^= r2; \ + \ + w = r3; x = r0; y = r1; z = r4; \ + } + +/* XOR BLOCK1 into BLOCK0. */ +#define BLOCK_XOR_KEY(block0, rkey) \ + { \ + block0[0] ^= vpbroadcastd(rkey[0]); \ + block0[1] ^= vpbroadcastd(rkey[1]); \ + block0[2] ^= vpbroadcastd(rkey[2]); \ + block0[3] ^= vpbroadcastd(rkey[3]); \ + } + +/* Copy BLOCK_SRC to BLOCK_DST. */ +#define BLOCK_COPY(block_dst, block_src) \ + { \ + block_dst[0] = block_src[0]; \ + block_dst[1] = block_src[1]; \ + block_dst[2] = block_src[2]; \ + block_dst[3] = block_src[3]; \ + } + +/* Apply SBOX number WHICH to to the block found in ARRAY0, writing + the output to the block found in ARRAY1. */ +#define SBOX(which, array0, array1) \ + SBOX##which (array0[0], array0[1], array0[2], array0[3], \ + array1[0], array1[1], array1[2], array1[3]); + +/* Apply inverse SBOX number WHICH to to the block found in ARRAY0, writing + the output to the block found in ARRAY1. */ +#define SBOX_INVERSE(which, array0, array1) \ + SBOX##which##_INVERSE (array0[0], array0[1], array0[2], array0[3], \ + array1[0], array1[1], array1[2], array1[3]); + +/* Apply the linear transformation to BLOCK. */ +#define LINEAR_TRANSFORMATION(block) \ + { \ + block[0] = vrol (block[0], 13); \ + block[2] = vrol (block[2], 3); \ + block[1] = block[1] ^ block[0] ^ block[2]; \ + block[3] = block[3] ^ block[2] ^ vshl(block[0], 3); \ + block[1] = vrol (block[1], 1); \ + block[3] = vrol (block[3], 7); \ + block[0] = block[0] ^ block[1] ^ block[3]; \ + block[2] = block[2] ^ block[3] ^ vshl(block[1], 7); \ + block[0] = vrol (block[0], 5); \ + block[2] = vrol (block[2], 22); \ + } + +/* Apply the inverse linear transformation to BLOCK. */ +#define LINEAR_TRANSFORMATION_INVERSE(block) \ + { \ + block[2] = vror (block[2], 22); \ + block[0] = vror (block[0] , 5); \ + block[2] = block[2] ^ block[3] ^ vshl(block[1], 7); \ + block[0] = block[0] ^ block[1] ^ block[3]; \ + block[3] = vror (block[3], 7); \ + block[1] = vror (block[1], 1); \ + block[3] = block[3] ^ block[2] ^ vshl(block[0], 3); \ + block[1] = block[1] ^ block[0] ^ block[2]; \ + block[2] = vror (block[2], 3); \ + block[0] = vror (block[0], 13); \ + } + +/* Apply a Serpent round to BLOCK, using the SBOX number WHICH and the + subkeys contained in SUBKEYS. Use BLOCK_TMP as temporary storage. + This macro increments `round'. */ +#define ROUND(which, subkeys, block, block_tmp) \ + { \ + BLOCK_XOR_KEY (block, subkeys[round]); \ + SBOX (which, block, block_tmp); \ + LINEAR_TRANSFORMATION (block_tmp); \ + BLOCK_COPY (block, block_tmp); \ + } + +/* Apply the last Serpent round to BLOCK, using the SBOX number WHICH + and the subkeys contained in SUBKEYS. Use BLOCK_TMP as temporary + storage. The result will be stored in BLOCK_TMP. This macro + increments `round'. */ +#define ROUND_LAST(which, subkeys, block, block_tmp) \ + { \ + BLOCK_XOR_KEY (block, subkeys[round]); \ + SBOX (which, block, block_tmp); \ + BLOCK_XOR_KEY (block_tmp, subkeys[round+1]); \ + } + +/* Apply an inverse Serpent round to BLOCK, using the SBOX number + WHICH and the subkeys contained in SUBKEYS. Use BLOCK_TMP as + temporary storage. This macro increments `round'. */ +#define ROUND_INVERSE(which, subkey, block, block_tmp) \ + { \ + LINEAR_TRANSFORMATION_INVERSE (block); \ + SBOX_INVERSE (which, block, block_tmp); \ + BLOCK_XOR_KEY (block_tmp, subkey[round]); \ + BLOCK_COPY (block, block_tmp); \ + } + +/* Apply the first Serpent round to BLOCK, using the SBOX number WHICH + and the subkeys contained in SUBKEYS. Use BLOCK_TMP as temporary + storage. The result will be stored in BLOCK_TMP. This macro + increments `round'. */ +#define ROUND_FIRST_INVERSE(which, subkeys, block, block_tmp) \ + { \ + BLOCK_XOR_KEY (block, subkeys[round]); \ + SBOX_INVERSE (which, block, block_tmp); \ + BLOCK_XOR_KEY (block_tmp, subkeys[round-1]); \ + } + +static ALWAYS_INLINE void +serpent_encrypt_internal_avx512 (const serpent_subkeys_t keys, + const __m512i vin[8], __m512i vout[8]) +{ + __m512i b[4]; + __m512i c[4]; + __m512i b_next[4]; + __m512i c_next[4]; + int round = 0; + + b_next[0] = vin[0]; + b_next[1] = vin[1]; + b_next[2] = vin[2]; + b_next[3] = vin[3]; + c_next[0] = vin[4]; + c_next[1] = vin[5]; + c_next[2] = vin[6]; + c_next[3] = vin[7]; + transpose_4x4 (b_next[0], b_next[1], b_next[2], b_next[3], b[0], b[1], b[2]); + transpose_4x4 (c_next[0], c_next[1], c_next[2], c_next[3], c[0], c[1], c[2]); + + b[0] = b_next[0]; + b[1] = b_next[1]; + b[2] = b_next[2]; + b[3] = b_next[3]; + c[0] = c_next[0]; + c[1] = c_next[1]; + c[2] = c_next[2]; + c[3] = c_next[3]; + + while (1) + { + ROUND (0, keys, b, b_next); ROUND (0, keys, c, c_next); round++; + ROUND (1, keys, b, b_next); ROUND (1, keys, c, c_next); round++; + ROUND (2, keys, b, b_next); ROUND (2, keys, c, c_next); round++; + ROUND (3, keys, b, b_next); ROUND (3, keys, c, c_next); round++; + ROUND (4, keys, b, b_next); ROUND (4, keys, c, c_next); round++; + ROUND (5, keys, b, b_next); ROUND (5, keys, c, c_next); round++; + ROUND (6, keys, b, b_next); ROUND (6, keys, c, c_next); round++; + if (round >= ROUNDS - 1) + break; + ROUND (7, keys, b, b_next); ROUND (7, keys, c, c_next); round++; + } + + ROUND_LAST (7, keys, b, b_next); ROUND_LAST (7, keys, c, c_next); + + transpose_4x4 (b_next[0], b_next[1], b_next[2], b_next[3], b[0], b[1], b[2]); + transpose_4x4 (c_next[0], c_next[1], c_next[2], c_next[3], c[0], c[1], c[2]); + vout[0] = b_next[0]; + vout[1] = b_next[1]; + vout[2] = b_next[2]; + vout[3] = b_next[3]; + vout[4] = c_next[0]; + vout[5] = c_next[1]; + vout[6] = c_next[2]; + vout[7] = c_next[3]; +} + +static ALWAYS_INLINE void +serpent_decrypt_internal_avx512 (const serpent_subkeys_t keys, + const __m512i vin[8], __m512i vout[8]) +{ + __m512i b[4]; + __m512i c[4]; + __m512i b_next[4]; + __m512i c_next[4]; + int round = ROUNDS; + + b_next[0] = vin[0]; + b_next[1] = vin[1]; + b_next[2] = vin[2]; + b_next[3] = vin[3]; + c_next[0] = vin[4]; + c_next[1] = vin[5]; + c_next[2] = vin[6]; + c_next[3] = vin[7]; + transpose_4x4 (b_next[0], b_next[1], b_next[2], b_next[3], b[0], b[1], b[2]); + transpose_4x4 (c_next[0], c_next[1], c_next[2], c_next[3], c[0], c[1], c[2]); + + ROUND_FIRST_INVERSE (7, keys, b_next, b); ROUND_FIRST_INVERSE (7, keys, c_next, c); + round -= 2; + + while (1) + { + ROUND_INVERSE (6, keys, b, b_next); ROUND_INVERSE (6, keys, c, c_next); round--; + ROUND_INVERSE (5, keys, b, b_next); ROUND_INVERSE (5, keys, c, c_next); round--; + ROUND_INVERSE (4, keys, b, b_next); ROUND_INVERSE (4, keys, c, c_next); round--; + ROUND_INVERSE (3, keys, b, b_next); ROUND_INVERSE (3, keys, c, c_next); round--; + ROUND_INVERSE (2, keys, b, b_next); ROUND_INVERSE (2, keys, c, c_next); round--; + ROUND_INVERSE (1, keys, b, b_next); ROUND_INVERSE (1, keys, c, c_next); round--; + ROUND_INVERSE (0, keys, b, b_next); ROUND_INVERSE (0, keys, c, c_next); round--; + if (round <= 0) + break; + ROUND_INVERSE (7, keys, b, b_next); ROUND_INVERSE (7, keys, c, c_next); round--; + } + + transpose_4x4 (b_next[0], b_next[1], b_next[2], b_next[3], b[0], b[1], b[2]); + transpose_4x4 (c_next[0], c_next[1], c_next[2], c_next[3], c[0], c[1], c[2]); + vout[0] = b_next[0]; + vout[1] = b_next[1]; + vout[2] = b_next[2]; + vout[3] = b_next[3]; + vout[4] = c_next[0]; + vout[5] = c_next[1]; + vout[6] = c_next[2]; + vout[7] = c_next[3]; +} + +enum crypt_mode_e +{ + ECB_ENC = 0, + ECB_DEC, + CBC_DEC, + CFB_DEC, + CTR_ENC, + OCB_ENC, + OCB_DEC +}; + +static ALWAYS_INLINE void +ctr_generate(unsigned char *ctr, __m512i vin[8]) +{ + const unsigned int blocksize = 16; + unsigned char ctr_low = ctr[15]; + + if (ctr_low + 32 <= 256) + { + const __m512i add0123 = _mm512_set_epi64(3LL << 56, 0, + 2LL << 56, 0, + 1LL << 56, 0, + 0LL << 56, 0); + const __m512i add4444 = _mm512_set_epi64(4LL << 56, 0, + 4LL << 56, 0, + 4LL << 56, 0, + 4LL << 56, 0); + const __m512i add4567 = _mm512_add_epi32(add0123, add4444); + const __m512i add8888 = _mm512_add_epi32(add4444, add4444); + + // Fast path without carry handling. + __m512i vctr = + _mm512_broadcast_i32x4(_mm_loadu_si128((const void *)ctr)); + + cipher_block_add(ctr, 32, blocksize); + vin[0] = _mm512_add_epi32(vctr, add0123); + vin[1] = _mm512_add_epi32(vctr, add4567); + vin[2] = _mm512_add_epi32(vin[0], add8888); + vin[3] = _mm512_add_epi32(vin[1], add8888); + vin[4] = _mm512_add_epi32(vin[2], add8888); + vin[5] = _mm512_add_epi32(vin[3], add8888); + vin[6] = _mm512_add_epi32(vin[4], add8888); + vin[7] = _mm512_add_epi32(vin[5], add8888); + } + else + { + // Slow path. + u32 blocks[4][blocksize / sizeof(u32)]; + + cipher_block_cpy(blocks[0], ctr, blocksize); + cipher_block_cpy(blocks[1], ctr, blocksize); + cipher_block_cpy(blocks[2], ctr, blocksize); + cipher_block_cpy(blocks[3], ctr, blocksize); + cipher_block_add(ctr, 32, blocksize); + cipher_block_add(blocks[1], 1, blocksize); + cipher_block_add(blocks[2], 2, blocksize); + cipher_block_add(blocks[3], 3, blocksize); + vin[0] = _mm512_loadu_epi32 (blocks); + cipher_block_add(blocks[0], 4, blocksize); + cipher_block_add(blocks[1], 4, blocksize); + cipher_block_add(blocks[2], 4, blocksize); + cipher_block_add(blocks[3], 4, blocksize); + vin[1] = _mm512_loadu_epi32 (blocks); + cipher_block_add(blocks[0], 4, blocksize); + cipher_block_add(blocks[1], 4, blocksize); + cipher_block_add(blocks[2], 4, blocksize); + cipher_block_add(blocks[3], 4, blocksize); + vin[2] = _mm512_loadu_epi32 (blocks); + cipher_block_add(blocks[0], 4, blocksize); + cipher_block_add(blocks[1], 4, blocksize); + cipher_block_add(blocks[2], 4, blocksize); + cipher_block_add(blocks[3], 4, blocksize); + vin[3] = _mm512_loadu_epi32 (blocks); + cipher_block_add(blocks[0], 4, blocksize); + cipher_block_add(blocks[1], 4, blocksize); + cipher_block_add(blocks[2], 4, blocksize); + cipher_block_add(blocks[3], 4, blocksize); + vin[4] = _mm512_loadu_epi32 (blocks); + cipher_block_add(blocks[0], 4, blocksize); + cipher_block_add(blocks[1], 4, blocksize); + cipher_block_add(blocks[2], 4, blocksize); + cipher_block_add(blocks[3], 4, blocksize); + vin[5] = _mm512_loadu_epi32 (blocks); + cipher_block_add(blocks[0], 4, blocksize); + cipher_block_add(blocks[1], 4, blocksize); + cipher_block_add(blocks[2], 4, blocksize); + cipher_block_add(blocks[3], 4, blocksize); + vin[6] = _mm512_loadu_epi32 (blocks); + cipher_block_add(blocks[0], 4, blocksize); + cipher_block_add(blocks[1], 4, blocksize); + cipher_block_add(blocks[2], 4, blocksize); + cipher_block_add(blocks[3], 4, blocksize); + vin[7] = _mm512_loadu_epi32 (blocks); + + wipememory(blocks, sizeof(blocks)); + } +} + +static ALWAYS_INLINE __m512i +ocb_input(__m512i *vchecksum, __m128i *voffset, const unsigned char *input, + unsigned char *output, const ocb_L_uintptr_t L[4]) +{ + __m128i L0 = _mm_loadu_si128((const void *)(uintptr_t)L[0]); + __m128i L1 = _mm_loadu_si128((const void *)(uintptr_t)L[1]); + __m128i L2 = _mm_loadu_si128((const void *)(uintptr_t)L[2]); + __m128i L3 = _mm_loadu_si128((const void *)(uintptr_t)L[3]); + __m512i vin = _mm512_loadu_epi32 (input); + __m512i voffsets; + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* Checksum_i = Checksum_{i-1} xor P_i */ + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + + if (vchecksum) + *vchecksum ^= _mm512_loadu_epi32 (input); + + *voffset ^= L0; + voffsets = _mm512_castsi128_si512(*voffset); + *voffset ^= L1; + voffsets = _mm512_inserti32x4(voffsets, *voffset, 1); + *voffset ^= L2; + voffsets = _mm512_inserti32x4(voffsets, *voffset, 2); + *voffset ^= L3; + voffsets = _mm512_inserti32x4(voffsets, *voffset, 3); + _mm512_storeu_epi32 (output, voffsets); + + return vin ^ voffsets; +} + +static NO_INLINE void +serpent_avx512_blk32(const void *c, unsigned char *output, + const unsigned char *input, int mode, + unsigned char *iv, unsigned char *checksum, + const ocb_L_uintptr_t Ls[32]) +{ + __m512i vin[8]; + __m512i vout[8]; + int encrypt = 1; + + asm volatile ("vpxor %%ymm0, %%ymm0, %%ymm0;\n\t" + "vpopcntb %%zmm0, %%zmm6;\n\t" /* spec stop for old AVX512 CPUs */ + "vpxor %%ymm6, %%ymm6, %%ymm6;\n\t" + : + : "m"(*input), "m"(*output) + : "xmm6", "xmm0", "memory", "cc"); + + // Input handling + switch (mode) + { + default: + case CBC_DEC: + case ECB_DEC: + encrypt = 0; + /* fall through */ + case ECB_ENC: + vin[0] = _mm512_loadu_epi32 (input + 0 * 64); + vin[1] = _mm512_loadu_epi32 (input + 1 * 64); + vin[2] = _mm512_loadu_epi32 (input + 2 * 64); + vin[3] = _mm512_loadu_epi32 (input + 3 * 64); + vin[4] = _mm512_loadu_epi32 (input + 4 * 64); + vin[5] = _mm512_loadu_epi32 (input + 5 * 64); + vin[6] = _mm512_loadu_epi32 (input + 6 * 64); + vin[7] = _mm512_loadu_epi32 (input + 7 * 64); + break; + + case CFB_DEC: + { + __m128i viv = _mm_loadu_si128((const void *)iv); + vin[0] = _mm512_maskz_loadu_epi32(_cvtu32_mask16(0xfff0), + input - 1 * 64 + 48) + ^ _mm512_castsi128_si512(viv); + vin[1] = _mm512_loadu_epi32(input + 0 * 64 + 48); + vin[2] = _mm512_loadu_epi32(input + 1 * 64 + 48); + vin[3] = _mm512_loadu_epi32(input + 2 * 64 + 48); + vin[4] = _mm512_loadu_epi32(input + 3 * 64 + 48); + vin[5] = _mm512_loadu_epi32(input + 4 * 64 + 48); + vin[6] = _mm512_loadu_epi32(input + 5 * 64 + 48); + vin[7] = _mm512_loadu_epi32(input + 6 * 64 + 48); + viv = _mm_loadu_si128((const void *)(input + 7 * 64 + 48)); + _mm_storeu_si128((void *)iv, viv); + break; + } + + case CTR_ENC: + ctr_generate(iv, vin); + break; + + case OCB_ENC: + { + const ocb_L_uintptr_t *L = Ls; + __m512i vchecksum = _mm512_setzero_epi32(); + __m128i vchecksum128 = _mm_loadu_si128((const void *)checksum); + __m128i voffset = _mm_loadu_si128((const void *)iv); + vin[0] = ocb_input(&vchecksum, &voffset, input + 0 * 64, output + 0 * 64, L); L += 4; + vin[1] = ocb_input(&vchecksum, &voffset, input + 1 * 64, output + 1 * 64, L); L += 4; + vin[2] = ocb_input(&vchecksum, &voffset, input + 2 * 64, output + 2 * 64, L); L += 4; + vin[3] = ocb_input(&vchecksum, &voffset, input + 3 * 64, output + 3 * 64, L); L += 4; + vin[4] = ocb_input(&vchecksum, &voffset, input + 4 * 64, output + 4 * 64, L); L += 4; + vin[5] = ocb_input(&vchecksum, &voffset, input + 5 * 64, output + 5 * 64, L); L += 4; + vin[6] = ocb_input(&vchecksum, &voffset, input + 6 * 64, output + 6 * 64, L); L += 4; + vin[7] = ocb_input(&vchecksum, &voffset, input + 7 * 64, output + 7 * 64, L); + vchecksum128 ^= _mm512_extracti32x4_epi32(vchecksum, 0) + ^ _mm512_extracti32x4_epi32(vchecksum, 1) + ^ _mm512_extracti32x4_epi32(vchecksum, 2) + ^ _mm512_extracti32x4_epi32(vchecksum, 3); + _mm_storeu_si128((void *)checksum, vchecksum128); + _mm_storeu_si128((void *)iv, voffset); + break; + } + + case OCB_DEC: + { + const ocb_L_uintptr_t *L = Ls; + __m128i voffset = _mm_loadu_si128((const void *)iv); + encrypt = 0; + vin[0] = ocb_input(NULL, &voffset, input + 0 * 64, output + 0 * 64, L); L += 4; + vin[1] = ocb_input(NULL, &voffset, input + 1 * 64, output + 1 * 64, L); L += 4; + vin[2] = ocb_input(NULL, &voffset, input + 2 * 64, output + 2 * 64, L); L += 4; + vin[3] = ocb_input(NULL, &voffset, input + 3 * 64, output + 3 * 64, L); L += 4; + vin[4] = ocb_input(NULL, &voffset, input + 4 * 64, output + 4 * 64, L); L += 4; + vin[5] = ocb_input(NULL, &voffset, input + 5 * 64, output + 5 * 64, L); L += 4; + vin[6] = ocb_input(NULL, &voffset, input + 6 * 64, output + 6 * 64, L); L += 4; + vin[7] = ocb_input(NULL, &voffset, input + 7 * 64, output + 7 * 64, L); + _mm_storeu_si128((void *)iv, voffset); + break; + } + } + + if (encrypt) + serpent_encrypt_internal_avx512(c, vin, vout); + else + serpent_decrypt_internal_avx512(c, vin, vout); + + switch (mode) + { + case CTR_ENC: + case CFB_DEC: + vout[0] ^= _mm512_loadu_epi32 (input + 0 * 64); + vout[1] ^= _mm512_loadu_epi32 (input + 1 * 64); + vout[2] ^= _mm512_loadu_epi32 (input + 2 * 64); + vout[3] ^= _mm512_loadu_epi32 (input + 3 * 64); + vout[4] ^= _mm512_loadu_epi32 (input + 4 * 64); + vout[5] ^= _mm512_loadu_epi32 (input + 5 * 64); + vout[6] ^= _mm512_loadu_epi32 (input + 6 * 64); + vout[7] ^= _mm512_loadu_epi32 (input + 7 * 64); + /* fall through */ + default: + case ECB_DEC: + case ECB_ENC: + _mm512_storeu_epi32 (output + 0 * 64, vout[0]); + _mm512_storeu_epi32 (output + 1 * 64, vout[1]); + _mm512_storeu_epi32 (output + 2 * 64, vout[2]); + _mm512_storeu_epi32 (output + 3 * 64, vout[3]); + _mm512_storeu_epi32 (output + 4 * 64, vout[4]); + _mm512_storeu_epi32 (output + 5 * 64, vout[5]); + _mm512_storeu_epi32 (output + 6 * 64, vout[6]); + _mm512_storeu_epi32 (output + 7 * 64, vout[7]); + break; + + case CBC_DEC: + { + __m128i viv = _mm_loadu_si128((const void *)iv); + vout[0] ^= _mm512_maskz_loadu_epi32(_cvtu32_mask16(0xfff0), + input - 1 * 64 + 48) + ^ _mm512_castsi128_si512(viv); + vout[1] ^= _mm512_loadu_epi32(input + 0 * 64 + 48); + vout[2] ^= _mm512_loadu_epi32(input + 1 * 64 + 48); + vout[3] ^= _mm512_loadu_epi32(input + 2 * 64 + 48); + vout[4] ^= _mm512_loadu_epi32(input + 3 * 64 + 48); + vout[5] ^= _mm512_loadu_epi32(input + 4 * 64 + 48); + vout[6] ^= _mm512_loadu_epi32(input + 5 * 64 + 48); + vout[7] ^= _mm512_loadu_epi32(input + 6 * 64 + 48); + viv = _mm_loadu_si128((const void *)(input + 7 * 64 + 48)); + _mm_storeu_si128((void *)iv, viv); + _mm512_storeu_epi32 (output + 0 * 64, vout[0]); + _mm512_storeu_epi32 (output + 1 * 64, vout[1]); + _mm512_storeu_epi32 (output + 2 * 64, vout[2]); + _mm512_storeu_epi32 (output + 3 * 64, vout[3]); + _mm512_storeu_epi32 (output + 4 * 64, vout[4]); + _mm512_storeu_epi32 (output + 5 * 64, vout[5]); + _mm512_storeu_epi32 (output + 6 * 64, vout[6]); + _mm512_storeu_epi32 (output + 7 * 64, vout[7]); + break; + } + + case OCB_ENC: + vout[0] ^= _mm512_loadu_epi32 (output + 0 * 64); + vout[1] ^= _mm512_loadu_epi32 (output + 1 * 64); + vout[2] ^= _mm512_loadu_epi32 (output + 2 * 64); + vout[3] ^= _mm512_loadu_epi32 (output + 3 * 64); + vout[4] ^= _mm512_loadu_epi32 (output + 4 * 64); + vout[5] ^= _mm512_loadu_epi32 (output + 5 * 64); + vout[6] ^= _mm512_loadu_epi32 (output + 6 * 64); + vout[7] ^= _mm512_loadu_epi32 (output + 7 * 64); + _mm512_storeu_epi32 (output + 0 * 64, vout[0]); + _mm512_storeu_epi32 (output + 1 * 64, vout[1]); + _mm512_storeu_epi32 (output + 2 * 64, vout[2]); + _mm512_storeu_epi32 (output + 3 * 64, vout[3]); + _mm512_storeu_epi32 (output + 4 * 64, vout[4]); + _mm512_storeu_epi32 (output + 5 * 64, vout[5]); + _mm512_storeu_epi32 (output + 6 * 64, vout[6]); + _mm512_storeu_epi32 (output + 7 * 64, vout[7]); + break; + + case OCB_DEC: + { + __m512i vchecksum = _mm512_setzero_epi32(); + __m128i vchecksum128 = _mm_loadu_si128((const void *)checksum); + vout[0] ^= _mm512_loadu_epi32 (output + 0 * 64); + vout[1] ^= _mm512_loadu_epi32 (output + 1 * 64); + vout[2] ^= _mm512_loadu_epi32 (output + 2 * 64); + vout[3] ^= _mm512_loadu_epi32 (output + 3 * 64); + vout[4] ^= _mm512_loadu_epi32 (output + 4 * 64); + vout[5] ^= _mm512_loadu_epi32 (output + 5 * 64); + vout[6] ^= _mm512_loadu_epi32 (output + 6 * 64); + vout[7] ^= _mm512_loadu_epi32 (output + 7 * 64); + vchecksum ^= vout[0]; + vchecksum ^= vout[1]; + vchecksum ^= vout[2]; + vchecksum ^= vout[3]; + vchecksum ^= vout[4]; + vchecksum ^= vout[5]; + vchecksum ^= vout[6]; + vchecksum ^= vout[7]; + _mm512_storeu_epi32 (output + 0 * 64, vout[0]); + _mm512_storeu_epi32 (output + 1 * 64, vout[1]); + _mm512_storeu_epi32 (output + 2 * 64, vout[2]); + _mm512_storeu_epi32 (output + 3 * 64, vout[3]); + _mm512_storeu_epi32 (output + 4 * 64, vout[4]); + _mm512_storeu_epi32 (output + 5 * 64, vout[5]); + _mm512_storeu_epi32 (output + 6 * 64, vout[6]); + _mm512_storeu_epi32 (output + 7 * 64, vout[7]); + vchecksum128 ^= _mm512_extracti32x4_epi32(vchecksum, 0) + ^ _mm512_extracti32x4_epi32(vchecksum, 1) + ^ _mm512_extracti32x4_epi32(vchecksum, 2) + ^ _mm512_extracti32x4_epi32(vchecksum, 3); + _mm_storeu_si128((void *)checksum, vchecksum128); + break; + } + } + + _mm256_zeroall(); +#ifdef __x86_64__ + asm volatile ( +#define CLEAR(mm) "vpxord %%" #mm ", %%" #mm ", %%" #mm ";\n\t" + CLEAR(ymm16) CLEAR(ymm17) CLEAR(ymm18) CLEAR(ymm19) + CLEAR(ymm20) CLEAR(ymm21) CLEAR(ymm22) CLEAR(ymm23) + CLEAR(ymm24) CLEAR(ymm25) CLEAR(ymm26) CLEAR(ymm27) + CLEAR(ymm28) CLEAR(ymm29) CLEAR(ymm30) CLEAR(ymm31) +#undef CLEAR + : + : "m"(*input), "m"(*output) + : "xmm16", "xmm17", "xmm18", "xmm19", + "xmm20", "xmm21", "xmm22", "xmm23", + "xmm24", "xmm25", "xmm26", "xmm27", + "xmm28", "xmm29", "xmm30", "xmm31", + "memory", "cc"); +#endif +} + +void +_gcry_serpent_avx512_blk32(const void *ctx, unsigned char *out, + const unsigned char *in, int encrypt) +{ + serpent_avx512_blk32 (ctx, out, in, encrypt ? ECB_ENC : ECB_DEC, + NULL, NULL, NULL); +} + +void +_gcry_serpent_avx512_cbc_dec(const void *ctx, unsigned char *out, + const unsigned char *in, unsigned char *iv) +{ + serpent_avx512_blk32 (ctx, out, in, CBC_DEC, iv, NULL, NULL); +} + +void +_gcry_serpent_avx512_cfb_dec(const void *ctx, unsigned char *out, + const unsigned char *in, unsigned char *iv) +{ + serpent_avx512_blk32 (ctx, out, in, CFB_DEC, iv, NULL, NULL); +} + +void +_gcry_serpent_avx512_ctr_enc(const void *ctx, unsigned char *out, + const unsigned char *in, unsigned char *iv) +{ + serpent_avx512_blk32 (ctx, out, in, CTR_ENC, iv, NULL, NULL); +} + +void +_gcry_serpent_avx512_ocb_crypt(const void *ctx, unsigned char *out, + const unsigned char *in, unsigned char *offset, + unsigned char *checksum, + const ocb_L_uintptr_t Ls[32], int encrypt) +{ + serpent_avx512_blk32 (ctx, out, in, encrypt ? OCB_ENC : OCB_DEC, offset, + checksum, Ls); +} + +#endif /*defined(USE_SERPENT) && defined(ENABLE_AVX512_SUPPORT)*/ +#endif /*__x86_64 || __i386*/ diff --git a/cipher/serpent.c b/cipher/serpent.c index 908523c2..2b951aba 100644 --- a/cipher/serpent.c +++ b/cipher/serpent.c @@ -32,14 +32,14 @@ #include "bulkhelp.h" -/* USE_SSE2 indicates whether to compile with AMD64 SSE2 code. */ +/* USE_SSE2 indicates whether to compile with x86-64 SSE2 code. */ #undef USE_SSE2 #if defined(__x86_64__) && (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) # define USE_SSE2 1 #endif -/* USE_AVX2 indicates whether to compile with AMD64 AVX2 code. */ +/* USE_AVX2 indicates whether to compile with x86-64 AVX2 code. */ #undef USE_AVX2 #if defined(__x86_64__) && (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) @@ -48,6 +48,15 @@ # endif #endif +/* USE_AVX512 indicates whether to compile with x86 AVX512 code. */ +#undef USE_AVX512 +#if (defined(__x86_64) || defined(__i386)) && \ + defined(HAVE_COMPATIBLE_CC_X86_AVX512_INTRINSICS) +# if defined(ENABLE_AVX512_SUPPORT) +# define USE_AVX512 1 +# endif +#endif + /* USE_NEON indicates whether to enable ARM NEON assembly code. */ #undef USE_NEON #ifdef ENABLE_NEON_SUPPORT @@ -82,6 +91,9 @@ typedef struct serpent_context #ifdef USE_AVX2 int use_avx2; #endif +#ifdef USE_AVX512 + int use_avx512; +#endif #ifdef USE_NEON int use_neon; #endif @@ -186,6 +198,38 @@ extern void _gcry_serpent_avx2_blk16(const serpent_context_t *c, byte *out, const byte *in, int encrypt) ASM_FUNC_ABI; #endif +#ifdef USE_AVX512 +/* Assembler implementations of Serpent using AVX512. Processing 32 blocks in + parallel. + */ +extern void _gcry_serpent_avx512_cbc_dec(const void *ctx, + unsigned char *out, + const unsigned char *in, + unsigned char *iv); + +extern void _gcry_serpent_avx512_cfb_dec(const void *ctx, + unsigned char *out, + const unsigned char *in, + unsigned char *iv); + +extern void _gcry_serpent_avx512_ctr_enc(const void *ctx, + unsigned char *out, + const unsigned char *in, + unsigned char *ctr); + +extern void _gcry_serpent_avx512_ocb_crypt(const void *ctx, + unsigned char *out, + const unsigned char *in, + unsigned char *offset, + unsigned char *checksum, + const ocb_L_uintptr_t Ls[32], + int encrypt); + +extern void _gcry_serpent_avx512_blk32(const void *c, byte *out, + const byte *in, + int encrypt); +#endif + #ifdef USE_NEON /* Assembler implementations of Serpent using ARM NEON. Process 8 block in parallel. @@ -758,6 +802,14 @@ serpent_setkey_internal (serpent_context_t *context, serpent_key_prepare (key, key_length, key_prepared); serpent_subkeys_generate (key_prepared, context->keys); +#ifdef USE_AVX512 + context->use_avx512 = 0; + if ((_gcry_get_hw_features () & HWF_INTEL_AVX512)) + { + context->use_avx512 = 1; + } +#endif + #ifdef USE_AVX2 context->use_avx2 = 0; if ((_gcry_get_hw_features () & HWF_INTEL_AVX2)) @@ -954,6 +1006,34 @@ _gcry_serpent_ctr_enc(void *context, unsigned char *ctr, unsigned char tmpbuf[sizeof(serpent_block_t)]; int burn_stack_depth = 2 * sizeof (serpent_block_t); +#ifdef USE_AVX512 + if (ctx->use_avx512) + { + int did_use_avx512 = 0; + + /* Process data in 32 block chunks. */ + while (nblocks >= 32) + { + _gcry_serpent_avx512_ctr_enc(ctx, outbuf, inbuf, ctr); + + nblocks -= 32; + outbuf += 32 * sizeof(serpent_block_t); + inbuf += 32 * sizeof(serpent_block_t); + did_use_avx512 = 1; + } + + if (did_use_avx512) + { + /* serpent-avx512 code does not use stack */ + if (nblocks == 0) + burn_stack_depth = 0; + } + + /* Use generic/avx2/sse2 code to handle smaller chunks... */ + /* TODO: use caching instead? */ + } +#endif + #ifdef USE_AVX2 if (ctx->use_avx2) { @@ -1066,6 +1146,33 @@ _gcry_serpent_cbc_dec(void *context, unsigned char *iv, unsigned char savebuf[sizeof(serpent_block_t)]; int burn_stack_depth = 2 * sizeof (serpent_block_t); +#ifdef USE_AVX512 + if (ctx->use_avx512) + { + int did_use_avx512 = 0; + + /* Process data in 32 block chunks. */ + while (nblocks >= 32) + { + _gcry_serpent_avx512_cbc_dec(ctx, outbuf, inbuf, iv); + + nblocks -= 32; + outbuf += 32 * sizeof(serpent_block_t); + inbuf += 32 * sizeof(serpent_block_t); + did_use_avx512 = 1; + } + + if (did_use_avx512) + { + /* serpent-avx512 code does not use stack */ + if (nblocks == 0) + burn_stack_depth = 0; + } + + /* Use generic/avx2/sse2 code to handle smaller chunks... */ + } +#endif + #ifdef USE_AVX2 if (ctx->use_avx2) { @@ -1174,6 +1281,33 @@ _gcry_serpent_cfb_dec(void *context, unsigned char *iv, const unsigned char *inbuf = inbuf_arg; int burn_stack_depth = 2 * sizeof (serpent_block_t); +#ifdef USE_AVX512 + if (ctx->use_avx512) + { + int did_use_avx512 = 0; + + /* Process data in 32 block chunks. */ + while (nblocks >= 32) + { + _gcry_serpent_avx512_cfb_dec(ctx, outbuf, inbuf, iv); + + nblocks -= 32; + outbuf += 32 * sizeof(serpent_block_t); + inbuf += 32 * sizeof(serpent_block_t); + did_use_avx512 = 1; + } + + if (did_use_avx512) + { + /* serpent-avx512 code does not use stack */ + if (nblocks == 0) + burn_stack_depth = 0; + } + + /* Use generic/avx2/sse2 code to handle smaller chunks... */ + } +#endif + #ifdef USE_AVX2 if (ctx->use_avx2) { @@ -1270,7 +1404,8 @@ static size_t _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, const void *inbuf_arg, size_t nblocks, int encrypt) { -#if defined(USE_AVX2) || defined(USE_SSE2) || defined(USE_NEON) +#if defined(USE_AVX512) || defined(USE_AVX2) || defined(USE_SSE2) \ + || defined(USE_NEON) serpent_context_t *ctx = (void *)&c->context.c; unsigned char *outbuf = outbuf_arg; const unsigned char *inbuf = inbuf_arg; @@ -1283,6 +1418,44 @@ _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, (void)encrypt; #endif +#ifdef USE_AVX512 + if (ctx->use_avx512) + { + int did_use_avx512 = 0; + ocb_L_uintptr_t Ls[32]; + ocb_L_uintptr_t *l; + + if (nblocks >= 32) + { + l = bulk_ocb_prepare_L_pointers_array_blk32 (c, Ls, blkn); + + /* Process data in 32 block chunks. */ + while (nblocks >= 32) + { + blkn += 32; + *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 32); + + _gcry_serpent_avx512_ocb_crypt(ctx, outbuf, inbuf, c->u_iv.iv, + c->u_ctr.ctr, Ls, encrypt); + + nblocks -= 32; + outbuf += 32 * sizeof(serpent_block_t); + inbuf += 32 * sizeof(serpent_block_t); + did_use_avx512 = 1; + } + } + + if (did_use_avx512) + { + /* serpent-avx512 code does not use stack */ + if (nblocks == 0) + burn_stack_depth = 0; + } + + /* Use generic code to handle smaller chunks... */ + } +#endif + #ifdef USE_AVX2 if (ctx->use_avx2) { @@ -1408,7 +1581,8 @@ _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, } #endif -#if defined(USE_AVX2) || defined(USE_SSE2) || defined(USE_NEON) +#if defined(USE_AVX512) || defined(USE_AVX2) || defined(USE_SSE2) \ + || defined(USE_NEON) c->u_mode.ocb.data_nblocks = blkn; if (burn_stack_depth) @@ -1556,17 +1730,27 @@ _gcry_serpent_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, static unsigned int -serpent_crypt_blk1_16(void *context, byte *out, const byte *in, +serpent_crypt_blk1_32(void *context, byte *out, const byte *in, size_t num_blks, int encrypt) { serpent_context_t *ctx = context; unsigned int burn, burn_stack_depth = 0; +#ifdef USE_AVX512 + if (num_blks == 32 && ctx->use_avx512) + { + _gcry_serpent_avx512_blk32 (ctx, out, in, encrypt); + return 0; + } +#endif + #ifdef USE_AVX2 - if (num_blks == 16 && ctx->use_avx2) + while (num_blks == 16 && ctx->use_avx2) { _gcry_serpent_avx2_blk16 (ctx, out, in, encrypt); - return 0; + out += 16 * sizeof(serpent_block_t); + in += 16 * sizeof(serpent_block_t); + num_blks -= 16; } #endif @@ -1611,17 +1795,17 @@ serpent_crypt_blk1_16(void *context, byte *out, const byte *in, } static unsigned int -serpent_encrypt_blk1_16(void *ctx, byte *out, const byte *in, +serpent_encrypt_blk1_32(void *ctx, byte *out, const byte *in, size_t num_blks) { - return serpent_crypt_blk1_16 (ctx, out, in, num_blks, 1); + return serpent_crypt_blk1_32 (ctx, out, in, num_blks, 1); } static unsigned int -serpent_decrypt_blk1_16(void *ctx, byte *out, const byte *in, +serpent_decrypt_blk1_32(void *ctx, byte *out, const byte *in, size_t num_blks) { - return serpent_crypt_blk1_16 (ctx, out, in, num_blks, 0); + return serpent_crypt_blk1_32 (ctx, out, in, num_blks, 0); } @@ -1638,12 +1822,12 @@ _gcry_serpent_xts_crypt (void *context, unsigned char *tweak, void *outbuf_arg, /* Process remaining blocks. */ if (nblocks) { - unsigned char tmpbuf[16 * 16]; + unsigned char tmpbuf[32 * 16]; unsigned int tmp_used = 16; size_t nburn; - nburn = bulk_xts_crypt_128(ctx, encrypt ? serpent_encrypt_blk1_16 - : serpent_decrypt_blk1_16, + nburn = bulk_xts_crypt_128(ctx, encrypt ? serpent_encrypt_blk1_32 + : serpent_decrypt_blk1_32, outbuf, inbuf, nblocks, tweak, tmpbuf, sizeof(tmpbuf) / 16, &tmp_used); @@ -1672,9 +1856,9 @@ _gcry_serpent_ecb_crypt (void *context, void *outbuf_arg, const void *inbuf_arg, { size_t nburn; - nburn = bulk_ecb_crypt_128(ctx, encrypt ? serpent_encrypt_blk1_16 - : serpent_decrypt_blk1_16, - outbuf, inbuf, nblocks, 16); + nburn = bulk_ecb_crypt_128(ctx, encrypt ? serpent_encrypt_blk1_32 + : serpent_decrypt_blk1_32, + outbuf, inbuf, nblocks, 32); burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth; } diff --git a/configure.ac b/configure.ac index 60fb1f75..572fe279 100644 --- a/configure.ac +++ b/configure.ac @@ -1704,6 +1704,46 @@ if test "$gcry_cv_gcc_inline_asm_bmi2" = "yes" ; then fi +# +# Check whether compiler supports x86/AVX512 intrinsics +# +_gcc_cflags_save=$CFLAGS +CFLAGS="$CFLAGS -mavx512f" + +AC_CACHE_CHECK([whether compiler supports x86/AVX512 intrinsics], + [gcry_cv_cc_x86_avx512_intrinsics], + [if test "$mpi_cpu_arch" != "x86" || + test "$try_asm_modules" != "yes" ; then + gcry_cv_cc_x86_avx512_intrinsics="n/a" + else + gcry_cv_cc_x86_avx512_intrinsics=no + AC_COMPILE_IFELSE([AC_LANG_SOURCE( + [[#include + __m512i fn(void *in, __m128i y) + { + __m512i x; + x = _mm512_maskz_loadu_epi32(_cvtu32_mask16(0xfff0), in) + ^ _mm512_castsi128_si512(y); + asm volatile ("vinserti32x4 \$3, %0, %%zmm6, %%zmm6;\n\t" + "vpxord %%zmm6, %%zmm6, %%zmm6" + ::"x"(y),"r"(in):"memory","xmm6"); + return x; + } + ]])], + [gcry_cv_cc_x86_avx512_intrinsics=yes]) + fi]) +if test "$gcry_cv_cc_x86_avx512_intrinsics" = "yes" ; then + AC_DEFINE(HAVE_COMPATIBLE_CC_X86_AVX512_INTRINSICS,1, + [Defined if underlying compiler supports x86/AVX512 intrinsics]) +fi + +AM_CONDITIONAL(ENABLE_X86_AVX512_INTRINSICS_EXTRA_CFLAGS, + test "$gcry_cv_cc_x86_avx512_intrinsics" = "yes") + +# Restore flags. +CFLAGS=$_gcc_cflags_save; + + # # Check whether GCC assembler needs "-Wa,--divide" to correctly handle # constant division @@ -3034,6 +3074,11 @@ if test "$found" = "1" ; then GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS serpent-avx2-amd64.lo" fi + if test x"$avx512support" = xyes ; then + # Build with the AVX512 implementation + GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS serpent-avx512-x86.lo" + fi + if test x"$neonsupport" = xyes ; then # Build with the NEON implementation GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS serpent-armv7-neon.lo" -- 2.39.2 From jussi.kivilinna at iki.fi Mon May 29 20:52:57 2023 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Mon, 29 May 2023 21:52:57 +0300 Subject: [PATCH] twofish-avx2: de-unroll round function Message-ID: <20230529185257.789223-1-jussi.kivilinna@iki.fi> * cipher/twofish-avx2-amd64.S (__twofish_enc_blk16) (__twofish_dec_blk16): Use loop structure instead of unrolling. -- De-unrolling reduces code-size significantly and gives small (<1%) increase in speed (tested on zen4, tiger-lake). Signed-off-by: Jussi Kivilinna --- cipher/twofish-avx2-amd64.S | 115 +++++++++++++++--------------------- 1 file changed, 49 insertions(+), 66 deletions(-) diff --git a/cipher/twofish-avx2-amd64.S b/cipher/twofish-avx2-amd64.S index 8a6aae19..d05ec1f9 100644 --- a/cipher/twofish-avx2-amd64.S +++ b/cipher/twofish-avx2-amd64.S @@ -39,8 +39,8 @@ /* register macros */ #define CTX %rdi -#define RROUND %rbp -#define RROUNDd %ebp +#define RROUND %r12 +#define RROUNDd %r12d #define RS0 CTX #define RS1 %r8 #define RS2 %r9 @@ -154,9 +154,9 @@ #define encrypt_round_end16(a, b, c, d, nk, r) \ vpaddd RY0, RX0, RX0; \ vpaddd RX0, RY0, RY0; \ - vpbroadcastd ((nk)+((r)*8))(RK), RT0; \ + vpbroadcastd ((nk))(RK,r), RT0; \ vpaddd RT0, RX0, RX0; \ - vpbroadcastd 4+((nk)+((r)*8))(RK), RT0; \ + vpbroadcastd 4+((nk))(RK,r), RT0; \ vpaddd RT0, RY0, RY0; \ \ vpxor RY0, d ## 0, d ## 0; \ @@ -168,9 +168,9 @@ \ vpaddd RY1, RX1, RX1; \ vpaddd RX1, RY1, RY1; \ - vpbroadcastd ((nk)+((r)*8))(RK), RT0; \ + vpbroadcastd ((nk))(RK,r), RT0; \ vpaddd RT0, RX1, RX1; \ - vpbroadcastd 4+((nk)+((r)*8))(RK), RT0; \ + vpbroadcastd 4+((nk))(RK,r), RT0; \ vpaddd RT0, RY1, RY1; \ \ vpxor RY1, d ## 1, d ## 1; \ @@ -216,9 +216,9 @@ #define decrypt_round_end16(a, b, c, d, nk, r) \ vpaddd RY0, RX0, RX0; \ vpaddd RX0, RY0, RY0; \ - vpbroadcastd ((nk)+((r)*8))(RK), RT0; \ + vpbroadcastd ((nk))(RK,r), RT0; \ vpaddd RT0, RX0, RX0; \ - vpbroadcastd 4+((nk)+((r)*8))(RK), RT0; \ + vpbroadcastd 4+((nk))(RK,r), RT0; \ vpaddd RT0, RY0, RY0; \ \ vpxor RX0, c ## 0, c ## 0; \ @@ -230,9 +230,9 @@ \ vpaddd RY1, RX1, RX1; \ vpaddd RX1, RY1, RY1; \ - vpbroadcastd ((nk)+((r)*8))(RK), RT0; \ + vpbroadcastd ((nk))(RK,r), RT0; \ vpaddd RT0, RX1, RX1; \ - vpbroadcastd 4+((nk)+((r)*8))(RK), RT0; \ + vpbroadcastd 4+((nk))(RK,r), RT0; \ vpaddd RT0, RY1, RY1; \ \ vpxor RX1, c ## 1, c ## 1; \ @@ -275,30 +275,6 @@ \ decrypt_round_end16(a, b, c, d, nk, r); -#define encrypt_cycle16(r) \ - encrypt_round16(RA, RB, RC, RD, 0, r); \ - encrypt_round16(RC, RD, RA, RB, 8, r); - -#define encrypt_cycle_first16(r) \ - encrypt_round_first16(RA, RB, RC, RD, 0, r); \ - encrypt_round16(RC, RD, RA, RB, 8, r); - -#define encrypt_cycle_last16(r) \ - encrypt_round16(RA, RB, RC, RD, 0, r); \ - encrypt_round_last16(RC, RD, RA, RB, 8, r); - -#define decrypt_cycle16(r) \ - decrypt_round16(RC, RD, RA, RB, 8, r); \ - decrypt_round16(RA, RB, RC, RD, 0, r); - -#define decrypt_cycle_first16(r) \ - decrypt_round_first16(RC, RD, RA, RB, 8, r); \ - decrypt_round16(RA, RB, RC, RD, 0, r); - -#define decrypt_cycle_last16(r) \ - decrypt_round16(RC, RD, RA, RB, 8, r); \ - decrypt_round_last16(RA, RB, RC, RD, 0, r); - #define transpose_4x4(x0,x1,x2,x3,t1,t2) \ vpunpckhdq x1, x0, t2; \ vpunpckldq x1, x0, x0; \ @@ -312,22 +288,6 @@ vpunpckhqdq x2, t2, x3; \ vpunpcklqdq x2, t2, x2; -#define read_blocks8(offs,a,b,c,d) \ - vmovdqu 16*offs(RIO), a; \ - vmovdqu 16*offs+32(RIO), b; \ - vmovdqu 16*offs+64(RIO), c; \ - vmovdqu 16*offs+96(RIO), d; \ - \ - transpose_4x4(a, b, c, d, RX0, RY0); - -#define write_blocks8(offs,a,b,c,d) \ - transpose_4x4(a, b, c, d, RX0, RY0); \ - \ - vmovdqu a, 16*offs(RIO); \ - vmovdqu b, 16*offs+32(RIO); \ - vmovdqu c, 16*offs+64(RIO); \ - vmovdqu d, 16*offs+96(RIO); - #define inpack_enc8(a,b,c,d) \ vpbroadcastd 4*0(RW), RT0; \ vpxor RT0, a, a; \ @@ -414,23 +374,35 @@ __twofish_enc_blk16: * ciphertext blocks */ CFI_STARTPROC(); + + pushq RROUND; + CFI_PUSH(RROUND); + init_round_constants(); transpose4x4_16(RA, RB, RC, RD); inpack_enc16(RA, RB, RC, RD); - encrypt_cycle_first16(0); - encrypt_cycle16(2); - encrypt_cycle16(4); - encrypt_cycle16(6); - encrypt_cycle16(8); - encrypt_cycle16(10); - encrypt_cycle16(12); - encrypt_cycle_last16(14); + xorl RROUNDd, RROUNDd; + + encrypt_round_first16(RA, RB, RC, RD, 0, RROUND); + +.align 16 +.Loop_enc16: + encrypt_round16(RC, RD, RA, RB, 8, RROUND); + encrypt_round16(RA, RB, RC, RD, 16, RROUND); + leal 16(RROUNDd), RROUNDd; + cmpl $8*14, RROUNDd; + jb .Loop_enc16; + + encrypt_round_last16(RC, RD, RA, RB, 8, RROUND); outunpack_enc16(RA, RB, RC, RD); transpose4x4_16(RA, RB, RC, RD); + popq RROUND; + CFI_POP(RROUND); + ret_spec_stop; CFI_ENDPROC(); ELF(.size __twofish_enc_blk16,.-__twofish_enc_blk16;) @@ -447,23 +419,34 @@ __twofish_dec_blk16: * ciphertext blocks */ CFI_STARTPROC(); + + pushq RROUND; + CFI_PUSH(RROUND); + init_round_constants(); transpose4x4_16(RA, RB, RC, RD); inpack_dec16(RA, RB, RC, RD); - decrypt_cycle_first16(14); - decrypt_cycle16(12); - decrypt_cycle16(10); - decrypt_cycle16(8); - decrypt_cycle16(6); - decrypt_cycle16(4); - decrypt_cycle16(2); - decrypt_cycle_last16(0); + movl $14*8, RROUNDd; + + decrypt_round_first16(RC, RD, RA, RB, 8, RROUND); + +.align 16 +.Loop_dec16: + decrypt_round16(RA, RB, RC, RD, 0, RROUND); + decrypt_round16(RC, RD, RA, RB, -8, RROUND); + subl $16, RROUNDd; + jnz .Loop_dec16; + + decrypt_round_last16(RA, RB, RC, RD, 0, RROUND); outunpack_dec16(RA, RB, RC, RD); transpose4x4_16(RA, RB, RC, RD); + popq RROUND; + CFI_POP(RROUND); + ret_spec_stop; CFI_ENDPROC(); ELF(.size __twofish_dec_blk16,.-__twofish_dec_blk16;) -- 2.39.2 From wk at gnupg.org Tue May 30 12:32:46 2023 From: wk at gnupg.org (Werner Koch) Date: Tue, 30 May 2023 12:32:46 +0200 Subject: [PATCH] rijndael-aesni: use inline checksumming for OCB decryption In-Reply-To: <20230528145355.532424-1-jussi.kivilinna@iki.fi> (Jussi Kivilinna's message of "Sun, 28 May 2023 17:53:55 +0300") References: <20230528145355.532424-1-jussi.kivilinna@iki.fi> Message-ID: <87ilcajei9.fsf@wheatstone.g10code.de> On Sun, 28 May 2023 17:53, Jussi Kivilinna said: > Inline checksumming is far faster on Ryzen processors on i386 > builds than two-pass checksumming. That is indeed a large performance boost. Did you had a chance to benchmark it on some common Intel CPU? Shalom-Salam, Werner -- The pioneers of a warless world are the youth that refuse military service. - A. Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: openpgp-digital-signature.asc Type: application/pgp-signature Size: 227 bytes Desc: not available URL: From jussi.kivilinna at iki.fi Wed May 31 06:48:39 2023 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Wed, 31 May 2023 07:48:39 +0300 Subject: [PATCH] rijndael-aesni: use inline checksumming for OCB decryption In-Reply-To: <87ilcajei9.fsf@wheatstone.g10code.de> References: <20230528145355.532424-1-jussi.kivilinna@iki.fi> <87ilcajei9.fsf@wheatstone.g10code.de> Message-ID: <918b7158-7385-c845-c102-ffca049d8f41@iki.fi> On 30.5.2023 13.32, Werner Koch via Gcrypt-devel wrote: > On Sun, 28 May 2023 17:53, Jussi Kivilinna said: > >> Inline checksumming is far faster on Ryzen processors on i386 >> builds than two-pass checksumming. > > That is indeed a large performance boost. Did you had a chance to > benchmark it on some common Intel CPU? > I tested now with Intel tigerlake, performance dropped by 9% which is unexpectedly large change. I'll try few different things to see if I can avoid such drop. -Jussi > > Shalom-Salam, > > Werner > > > _______________________________________________ > Gcrypt-devel mailing list > Gcrypt-devel at gnupg.org > https://lists.gnupg.org/mailman/listinfo/gcrypt-devel