From jussi.kivilinna at iki.fi Sun Apr 2 19:22:48 2023 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 2 Apr 2023 20:22:48 +0300 Subject: [PATCH] cipher-gcm-ppc: tweak loop structure a bit Message-ID: <20230402172248.1085628-1-jussi.kivilinna@iki.fi> * cipher/cipher-gcm-ppc.c (_gcry_ghash_ppc_vpmsum): Increament 'buf' pointer right after use; Use 'for' loop for inner 4-blocks loop to allow compiler to better optimize loop. -- Benchmark on POWER9: Before: | nanosecs/byte mebibytes/sec cycles/byte GMAC_AES | 0.226 ns/B 4211 MiB/s 0.521 c/B After: | nanosecs/byte mebibytes/sec cycles/byte GMAC_AES | 0.224 ns/B 4248 MiB/s 0.516 c/B Signed-off-by: Jussi Kivilinna --- cipher/cipher-gcm-ppc.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/cipher/cipher-gcm-ppc.c b/cipher/cipher-gcm-ppc.c index 4f75e95c..06bf5eb1 100644 --- a/cipher/cipher-gcm-ppc.c +++ b/cipher/cipher-gcm-ppc.c @@ -437,6 +437,7 @@ _gcry_ghash_ppc_vpmsum (byte *result, void *gcm_table, in1 = vec_load_he (16, buf); in2 = vec_load_he (32, buf); in3 = vec_load_he (48, buf); + buf += 64; in0 = vec_be_swap(in0, bswap_const); in1 = vec_be_swap(in1, bswap_const); in2 = vec_be_swap(in2, bswap_const); @@ -464,17 +465,13 @@ _gcry_ghash_ppc_vpmsum (byte *result, void *gcm_table, Xh3 = asm_xor (Xh3, Xh1); /* Gerald Estrin's scheme for parallel multiplication of polynomials */ - while (1) + for (; blocks_remaining >= 4; blocks_remaining -= 4) { - buf += 64; - blocks_remaining -= 4; - if (!blocks_remaining) - break; - in0 = vec_load_he (0, buf); in1 = vec_load_he (16, buf); in2 = vec_load_he (32, buf); in3 = vec_load_he (48, buf); + buf += 64; in1 = vec_be_swap(in1, bswap_const); in2 = vec_be_swap(in2, bswap_const); in3 = vec_be_swap(in3, bswap_const); -- 2.37.2 From falko.strenzke at mtg.de Mon Apr 3 05:59:10 2023 From: falko.strenzke at mtg.de (Falko Strenzke) Date: Mon, 3 Apr 2023 05:59:10 +0200 Subject: Implementation of PQC Algorithms in libgcrypt In-Reply-To: <87tty2cq2q.fsf@wheatstone.g10code.de> References: <958d689a-5f76-5fbe-f3ef-140bc1b2d132@mtg.de> <87tty2cq2q.fsf@wheatstone.g10code.de> Message-ID: <85f0fb8c-1587-262c-187c-a5a6bc590145@mtg.de> Hi Werner, the only API change is the addition of the following interface function: gcry_err_code_t _gcry_pk_encap(gcry_sexp_t *r_ciph, gcry_sexp_t* r_shared_key, gcry_sexp_t s_pkey) This also means that the public key spec needs to contain this additional function. For Kyber our public key spec currently looks as follows: gcry_pk_spec_t _gcry_pubkey_spec_kyber = { ? GCRY_PK_KYBER, {0, 1}, ? (GCRY_PK_USAGE_ENCAP),??????? // TODOMTG: can the key usage "encryption" remain or do we need new KU "encap"? ? "Kyber", kyber_names, ? "p", "s", "a", "", "p",?????? // elements of pub-key, sec-key, ciphertext, signature, key-grip ? kyber_generate, ? kyber_check_secret_key, ? NULL,???????????????????????? // encrypt ? kyber_encap, ? kyber_decrypt, ? NULL,???????????????????????? // sign, ? NULL,???????????????????????? // verify, ? kyber_get_nbits, ? run_selftests, ? compute_keygrip }; For the PKEs the encapsulation function would of course be NULL. Regarding the TODO on the key usage marked in the code above, this so far doesn't seem to have any implications for us so the decision isn't urgent from my point of view. - Falko Am 30.03.23 um 15:43 schrieb Werner Koch: > On Wed, 29 Mar 2023 10:09, Falko Strenzke said: > >> While the integration of the signature algorithms is straightforward, the KEM >> requires a new interface function, as the KEM encapsulation cannot be modelled >> by a public-key encryption. > It would be good if we can discuss a proposed API early enough, so that > we can see how it fits into the design of Libgcrypt. Can you already > roughly describes the needs? > > > Salam-Shalom, > > Werner > -- *MTG AG* Dr. Falko Strenzke Executive System Architect Phone: +49 6151 8000 24 E-Mail: falko.strenzke at mtg.de Web: mtg.de *MTG Exhibitions ? See you in 2023* ------------------------------------------------------------------------ MTG AG - Dolivostr. 11 - 64293 Darmstadt, Germany Commercial register: HRB 8901 Register Court: Amtsgericht Darmstadt Management Board: J?rgen Ruf (CEO), Tamer Kemer?z Chairman of the Supervisory Board: Dr. Thomas Milde This email may contain confidential and/or privileged information. If you are not the correct recipient or have received this email in error, please inform the sender immediately and delete this email. Unauthorised copying or distribution of this email is not permitted. Data protection information: Privacy policy -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 0aW07JZqU4dNOv5z.png Type: image/png Size: 5256 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: XhbY9a0ASjTIxu1R.png Type: image/png Size: 4906 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4764 bytes Desc: S/MIME Cryptographic Signature URL: From jussi.kivilinna at iki.fi Sat Apr 22 09:35:35 2023 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sat, 22 Apr 2023 10:35:35 +0300 Subject: [PATCH 1/5] bench-slope: add MPI benchmarking Message-ID: <20230422073539.894723-1-jussi.kivilinna@iki.fi> * tests/bench-slope.c (MPI_START_SIZE, MPI_END_SIZE, MPI_STEP_SIZE) (MPI_NUM_STEPS, bench_mpi_test, mpi_test_names, bench_mpi_mode) (bench_mpi_hd, bench_mpi_init, bench_mpi_fre, bench_mpi_do_bench) (mpi_ops, mpi_modes, mpi_bench_one, _mpi_bench, mpi_match_test) (mpi_bench): New. (print_help): Add mention of 'mpi'. (main): Add "mpi" tests. -- Patch adds MPI operation benchmarking for bench-slope: $ tests/bench-slope --cpu-mhz auto mpi MPI: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz add | 0.054 ns/B 17580 MiB/s 0.298 c/B 5500 sub | 0.083 ns/B 11432 MiB/s 0.459 c/B 5500 rshift3 | 0.033 ns/B 28862 MiB/s 0.182 c/B 5499 lshift3 | 0.093 ns/B 10256 MiB/s 0.511 c/B 5500 rshift65 | 0.096 ns/B 9888 MiB/s 0.530 c/B 5500 lshift65 | 0.093 ns/B 10228 MiB/s 0.513 c/B 5500 mul4 | 0.074 ns/B 12825 MiB/s 0.409 c/B 5500 mul8 | 0.072 ns/B 13313 MiB/s 0.394 c/B 5500 mul16 | 0.148 ns/B 6450 MiB/s 0.813 c/B 5500 mul32 | 0.299 ns/B 3191 MiB/s 1.64 c/B 5500 div4 | 0.458 ns/B 2080 MiB/s 2.52 c/B 5500 div8 | 0.458 ns/B 2084 MiB/s 2.52 c/B 5500 div16 | 0.602 ns/B 1584 MiB/s 3.31 c/B 5500 div32 | 0.926 ns/B 1030 MiB/s 5.09 c/B 5500 mod4 | 0.443 ns/B 2151 MiB/s 2.44 c/B 5500 mod8 | 0.443 ns/B 2152 MiB/s 2.44 c/B 5500 mod16 | 0.600 ns/B 1590 MiB/s 3.30 c/B 5500 mod32 | 0.924 ns/B 1032 MiB/s 5.08 c/B 5500 Signed-off-by: Jussi Kivilinna --- tests/bench-slope.c | 308 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 307 insertions(+), 1 deletion(-) diff --git a/tests/bench-slope.c b/tests/bench-slope.c index f8031e5e..2a203a07 100644 --- a/tests/bench-slope.c +++ b/tests/bench-slope.c @@ -2933,13 +2933,310 @@ ecc_bench (char **argv, int argc) #endif } +/************************************************************ MPI benchmarks. */ + +#define MPI_START_SIZE 64 +#define MPI_END_SIZE 1024 +#define MPI_STEP_SIZE 8 +#define MPI_NUM_STEPS (((MPI_END_SIZE - MPI_START_SIZE) / MPI_STEP_SIZE) + 1) + +enum bench_mpi_test +{ + MPI_TEST_ADD = 0, + MPI_TEST_SUB, + MPI_TEST_RSHIFT3, + MPI_TEST_LSHIFT3, + MPI_TEST_RSHIFT65, + MPI_TEST_LSHIFT65, + MPI_TEST_MUL4, + MPI_TEST_MUL8, + MPI_TEST_MUL16, + MPI_TEST_MUL32, + MPI_TEST_DIV4, + MPI_TEST_DIV8, + MPI_TEST_DIV16, + MPI_TEST_DIV32, + MPI_TEST_MOD4, + MPI_TEST_MOD8, + MPI_TEST_MOD16, + MPI_TEST_MOD32, + __MAX_MPI_TEST +}; + +static const char * const mpi_test_names[] = +{ + "add", + "sub", + "rshift3", + "lshift3", + "rshift65", + "lshift65", + "mul4", + "mul8", + "mul16", + "mul32", + "div4", + "div8", + "div16", + "div32", + "mod4", + "mod8", + "mod16", + "mod32", + NULL, +}; + +struct bench_mpi_mode +{ + const char *name; + struct bench_ops *ops; + + enum bench_mpi_test test_id; +}; + +struct bench_mpi_hd +{ + gcry_mpi_t bytes[MPI_NUM_STEPS + 1]; + gcry_mpi_t y; +}; + +static int +bench_mpi_init (struct bench_obj *obj) +{ + struct bench_mpi_mode *mode = obj->priv; + struct bench_mpi_hd *hd; + int y_bytes; + int i, j; + + (void)mode; + + obj->min_bufsize = MPI_START_SIZE; + obj->max_bufsize = MPI_END_SIZE; + obj->step_size = MPI_STEP_SIZE; + obj->num_measure_repetitions = num_measurement_repetitions; + + hd = calloc (1, sizeof(*hd)); + if (!hd) + return -1; + + /* Generate input MPIs for benchmark. */ + for (i = MPI_START_SIZE, j = 0; j < DIM(hd->bytes); i += MPI_STEP_SIZE, j++) + { + hd->bytes[j] = gcry_mpi_new (i * 8); + gcry_mpi_randomize (hd->bytes[j], i * 8, GCRY_WEAK_RANDOM); + gcry_mpi_set_bit (hd->bytes[j], i * 8 - 1); + } + + switch (mode->test_id) + { + case MPI_TEST_MUL4: + case MPI_TEST_DIV4: + case MPI_TEST_MOD4: + y_bytes = 4; + break; + + case MPI_TEST_MUL8: + case MPI_TEST_DIV8: + case MPI_TEST_MOD8: + y_bytes = 8; + break; + + case MPI_TEST_MUL16: + case MPI_TEST_DIV16: + case MPI_TEST_MOD16: + y_bytes = 16; + break; + + case MPI_TEST_MUL32: + case MPI_TEST_DIV32: + case MPI_TEST_MOD32: + y_bytes = 32; + break; + + default: + y_bytes = 0; + break; + } + + hd->y = gcry_mpi_new (y_bytes * 8); + if (y_bytes) + { + gcry_mpi_randomize (hd->y, y_bytes * 8, GCRY_WEAK_RANDOM); + gcry_mpi_set_bit (hd->y, y_bytes * 8 - 1); + } + + obj->hd = hd; + return 0; +} + +static void +bench_mpi_free (struct bench_obj *obj) +{ + struct bench_mpi_hd *hd = obj->hd; + int i; + + gcry_mpi_release (hd->y); + for (i = DIM(hd->bytes) - 1; i >= 0; i--) + gcry_mpi_release (hd->bytes[i]); + + free(hd); +} + +static void +bench_mpi_do_bench (struct bench_obj *obj, void *buf, size_t buflen) +{ + struct bench_mpi_hd *hd = obj->hd; + struct bench_mpi_mode *mode = obj->priv; + int bytes_idx = (buflen - MPI_START_SIZE) / MPI_STEP_SIZE; + gcry_mpi_t x; + + (void)buf; + + x = gcry_mpi_new (2 * (MPI_END_SIZE + 1) * 8); + + switch (mode->test_id) + { + case MPI_TEST_ADD: + gcry_mpi_add (x, hd->bytes[bytes_idx], hd->bytes[bytes_idx]); + break; + + case MPI_TEST_SUB: + gcry_mpi_sub (x, hd->bytes[bytes_idx + 1], hd->bytes[bytes_idx]); + break; + + case MPI_TEST_RSHIFT3: + gcry_mpi_rshift (x, hd->bytes[bytes_idx], 3); + break; + + case MPI_TEST_LSHIFT3: + gcry_mpi_lshift (x, hd->bytes[bytes_idx], 3); + break; + + case MPI_TEST_RSHIFT65: + gcry_mpi_rshift (x, hd->bytes[bytes_idx], 65); + break; + + case MPI_TEST_LSHIFT65: + gcry_mpi_lshift (x, hd->bytes[bytes_idx], 65); + break; + + case MPI_TEST_MUL4: + case MPI_TEST_MUL8: + case MPI_TEST_MUL16: + case MPI_TEST_MUL32: + gcry_mpi_mul (x, hd->bytes[bytes_idx], hd->y); + break; + + case MPI_TEST_DIV4: + case MPI_TEST_DIV8: + case MPI_TEST_DIV16: + case MPI_TEST_DIV32: + gcry_mpi_div (x, NULL, hd->bytes[bytes_idx], hd->y, 0); + break; + + case MPI_TEST_MOD4: + case MPI_TEST_MOD8: + case MPI_TEST_MOD16: + case MPI_TEST_MOD32: + gcry_mpi_mod (x, hd->bytes[bytes_idx], hd->y); + break; + + default: + break; + } + + gcry_mpi_release (x); +} + +static struct bench_ops mpi_ops = { + &bench_mpi_init, + &bench_mpi_free, + &bench_mpi_do_bench +}; + + +static struct bench_mpi_mode mpi_modes[] = { + {"", &mpi_ops}, + {0}, +}; + + +static void +mpi_bench_one (int test_id, struct bench_mpi_mode *pmode) +{ + struct bench_mpi_mode mode = *pmode; + struct bench_obj obj = { 0 }; + double result; + + mode.test_id = test_id; + + if (mode.name[0] == '\0') + bench_print_algo (-18, mpi_test_names[test_id]); + else + bench_print_algo (18, mode.name); + + obj.ops = mode.ops; + obj.priv = &mode; + + result = do_slope_benchmark (&obj); + + bench_print_result (result); +} + +static void +_mpi_bench (int test_id) +{ + int i; + + for (i = 0; mpi_modes[i].name; i++) + mpi_bench_one (test_id, &mpi_modes[i]); +} + +static int +mpi_match_test(const char *name) +{ + int i; + + for (i = 0; i < __MAX_MPI_TEST; i++) + if (strcmp(name, mpi_test_names[i]) == 0) + return i; + + return -1; +} + +void +mpi_bench (char **argv, int argc) +{ + int i, test_id; + + bench_print_section ("mpi", "MPI"); + bench_print_header (18, ""); + + if (argv && argc) + { + for (i = 0; i < argc; i++) + { + test_id = mpi_match_test (argv[i]); + if (test_id >= 0) + _mpi_bench (test_id); + } + } + else + { + for (i = 0; i < __MAX_MPI_TEST; i++) + _mpi_bench (i); + } + + bench_print_footer (18); +} + /************************************************************** Main program. */ void print_help (void) { static const char *help_lines[] = { - "usage: bench-slope [options] [hash|mac|cipher|kdf|ecc [algonames]]", + "usage: bench-slope [options] [hash|mac|cipher|kdf|ecc|mpi [algonames]]", "", " options:", " --cpu-mhz Set CPU speed for calculating cycles", @@ -3128,6 +3425,7 @@ main (int argc, char **argv) cipher_bench (NULL, 0); kdf_bench (NULL, 0); ecc_bench (NULL, 0); + mpi_bench (NULL, 0); } else if (!strcmp (*argv, "hash")) { @@ -3169,6 +3467,14 @@ main (int argc, char **argv) warm_up_cpu (); ecc_bench ((argc == 0) ? NULL : argv, argc); } + else if (!strcmp (*argv, "mpi")) + { + argc--; + argv++; + + warm_up_cpu (); + mpi_bench ((argc == 0) ? NULL : argv, argc); + } else { fprintf (stderr, PGM ": unknown argument: %s\n", *argv); -- 2.39.2 From jussi.kivilinna at iki.fi Sat Apr 22 09:35:37 2023 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sat, 22 Apr 2023 10:35:37 +0300 Subject: [PATCH 3/5] mpi/amd64: fix use of 'movd' for 64-bit register move in lshift&rshift In-Reply-To: <20230422073539.894723-1-jussi.kivilinna@iki.fi> References: <20230422073539.894723-1-jussi.kivilinna@iki.fi> Message-ID: <20230422073539.894723-3-jussi.kivilinna@iki.fi> * mpi/amd64/mpih-lshift.S: Use 'movq' instead of 'movd' for moving value to %rax. * mpi/amd64/mpih-rshift.S: Likewise. -- Signed-off-by: Jussi Kivilinna --- mpi/amd64/mpih-lshift.S | 2 +- mpi/amd64/mpih-rshift.S | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/mpi/amd64/mpih-lshift.S b/mpi/amd64/mpih-lshift.S index c11e808c..3fa6e4fd 100644 --- a/mpi/amd64/mpih-lshift.S +++ b/mpi/amd64/mpih-lshift.S @@ -52,7 +52,7 @@ C_SYMBOL_NAME(_gcry_mpih_lshift:) movd %eax, %xmm0 movdqa %xmm4, %xmm3 psrlq %xmm0, %xmm4 - movd %xmm4, %rax + movq %xmm4, %rax subq $2, %rdx jl .Lendo diff --git a/mpi/amd64/mpih-rshift.S b/mpi/amd64/mpih-rshift.S index 430ba4b0..4bc5db22 100644 --- a/mpi/amd64/mpih-rshift.S +++ b/mpi/amd64/mpih-rshift.S @@ -52,7 +52,7 @@ C_SYMBOL_NAME(_gcry_mpih_rshift:) movd %eax, %xmm0 movdqa %xmm4, %xmm3 psllq %xmm0, %xmm4 - movd %xmm4, %rax + movq %xmm4, %rax leaq (%rsi,%rdx,8), %rsi leaq (%rdi,%rdx,8), %rdi negq %rdx -- 2.39.2 From jussi.kivilinna at iki.fi Sat Apr 22 09:35:36 2023 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sat, 22 Apr 2023 10:35:36 +0300 Subject: [PATCH 2/5] mpi: avoid MPI copy at gcry_mpi_sub In-Reply-To: <20230422073539.894723-1-jussi.kivilinna@iki.fi> References: <20230422073539.894723-1-jussi.kivilinna@iki.fi> Message-ID: <20230422073539.894723-2-jussi.kivilinna@iki.fi> * mpi/mpi-add.c (_gcry_mpi_add): Rename function... (_gcry_mpi_add_inv_sign): ... to this and add parameter for inverting sign of second operand. (_gcry_mpi_add): New. (_gcry_mpi_sub): Remove mpi_copy and instead use new '_gcry_mpi_add_inv_sign' function with inverted sign for second operand. -- Benchmark on AMD Ryzen 9 7900X: Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz add | 0.052 ns/B 18301 MiB/s 0.287 c/B 5500 sub | 0.098 ns/B 9768 MiB/s 0.537 c/B 5500 After: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz add | 0.030 ns/B 31771 MiB/s 0.165 c/B 5500 sub | 0.031 ns/B 31187 MiB/s 0.168 c/B 5500 Signed-off-by: Jussi Kivilinna --- mpi/mpi-add.c | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/mpi/mpi-add.c b/mpi/mpi-add.c index 38dd352f..2fd19e55 100644 --- a/mpi/mpi-add.c +++ b/mpi/mpi-add.c @@ -84,8 +84,8 @@ _gcry_mpi_add_ui (gcry_mpi_t w, gcry_mpi_t u, unsigned long v ) } -void -_gcry_mpi_add(gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v) +static void +_gcry_mpi_add_inv_sign(gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v, int inv_v_sign) { mpi_ptr_t wp, up, vp; mpi_size_t usize, vsize, wsize; @@ -93,7 +93,7 @@ _gcry_mpi_add(gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v) if( u->nlimbs < v->nlimbs ) { /* Swap U and V. */ usize = v->nlimbs; - usign = v->sign; + usign = v->sign ^ inv_v_sign; vsize = u->nlimbs; vsign = u->sign; wsize = usize + 1; @@ -106,7 +106,7 @@ _gcry_mpi_add(gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v) usize = u->nlimbs; usign = u->sign; vsize = v->nlimbs; - vsign = v->sign; + vsign = v->sign ^ inv_v_sign; wsize = usize + 1; RESIZE_IF_NEEDED(w, wsize); /* These must be after realloc (u or v may be the same as w). */ @@ -211,13 +211,16 @@ _gcry_mpi_sub_ui(gcry_mpi_t w, gcry_mpi_t u, unsigned long v ) w->sign = wsign; } +void +_gcry_mpi_add(gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v) +{ + _gcry_mpi_add_inv_sign (w, u, v, 0); +} + void _gcry_mpi_sub(gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v) { - gcry_mpi_t vv = mpi_copy (v); - vv->sign = ! vv->sign; - mpi_add (w, u, vv); - mpi_free (vv); + _gcry_mpi_add_inv_sign (w, u, v, 1); } -- 2.39.2 From jussi.kivilinna at iki.fi Sat Apr 22 09:35:38 2023 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sat, 22 Apr 2023 10:35:38 +0300 Subject: [PATCH 4/5] mpi/amd64: optimize add_n and sub_n In-Reply-To: <20230422073539.894723-1-jussi.kivilinna@iki.fi> References: <20230422073539.894723-1-jussi.kivilinna@iki.fi> Message-ID: <20230422073539.894723-4-jussi.kivilinna@iki.fi> * mpi/amd64/mpih-add1.S (_gcry_mpih_add_n): New implementation with 4x unrolled fast-path loop. * mpi/amd64/mpih-sub1.S (_gcry_mpih_sub_n): Likewise. -- Benchmark on AMD Ryzen 9 7900X: Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz add | 0.035 ns/B 27559 MiB/s 0.163 c/B 4700 sub | 0.034 ns/B 28332 MiB/s 0.158 c/B 4700 After (~26% faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz add | 0.027 ns/B 35271 MiB/s 0.127 c/B 4700 sub | 0.027 ns/B 35206 MiB/s 0.127 c/B 4700 Signed-off-by: Jussi Kivilinna --- mpi/amd64/mpih-add1.S | 81 ++++++++++++++++++++++++++++++++++++------- mpi/amd64/mpih-sub1.S | 80 +++++++++++++++++++++++++++++++++++------- 2 files changed, 136 insertions(+), 25 deletions(-) diff --git a/mpi/amd64/mpih-add1.S b/mpi/amd64/mpih-add1.S index 833a43cb..f2e86237 100644 --- a/mpi/amd64/mpih-add1.S +++ b/mpi/amd64/mpih-add1.S @@ -3,6 +3,7 @@ * * Copyright (C) 1992, 1994, 1995, 1998, * 2001, 2002, 2006 Free Software Foundation, Inc. + * Copyright (C) 2023 Jussi Kivilinna * * This file is part of Libgcrypt. * @@ -39,26 +40,80 @@ * mpi_ptr_t s2_ptr, rdx * mpi_size_t size) rcx */ - TEXT ALIGN(4) .globl C_SYMBOL_NAME(_gcry_mpih_add_n) C_SYMBOL_NAME(_gcry_mpih_add_n:) FUNC_ENTRY() - leaq (%rsi,%rcx,8), %rsi - leaq (%rdi,%rcx,8), %rdi - leaq (%rdx,%rcx,8), %rdx - negq %rcx - xorl %eax, %eax /* clear cy */ + movl %ecx, %r9d + andl $3, %r9d + je .Lprehandle0 + cmpl $2, %r9d + jb .Lprehandle1 + je .Lprehandle2 + +#define FIRST_ADD() \ + movq (%rsi), %rax; \ + addq (%rdx), %rax; \ + movq %rax, (%rdi) + +#define NEXT_ADD(offset) \ + movq offset(%rsi), %rax; \ + adcq offset(%rdx), %rax; \ + movq %rax, offset(%rdi) + +.Lprehandle3: + leaq -2(%rcx), %rcx + FIRST_ADD(); + NEXT_ADD(8); + NEXT_ADD(16); + decq %rcx + je .Lend + leaq 24(%rsi), %rsi + leaq 24(%rdx), %rdx + leaq 24(%rdi), %rdi + jmp .Loop + + ALIGN(3) +.Lprehandle2: + leaq -1(%rcx), %rcx + FIRST_ADD(); + NEXT_ADD(8); + decq %rcx + je .Lend + leaq 16(%rsi), %rsi + leaq 16(%rdx), %rdx + leaq 16(%rdi), %rdi + jmp .Loop + + ALIGN(3) +.Lprehandle1: + FIRST_ADD(); + decq %rcx + je .Lend + leaq 8(%rsi), %rsi + leaq 8(%rdx), %rdx + leaq 8(%rdi), %rdi + jmp .Loop + + ALIGN(3) +.Lprehandle0: + clc /* clear cy */ ALIGN(4) /* minimal alignment for claimed speed */ -.Loop: movq (%rsi,%rcx,8), %rax - movq (%rdx,%rcx,8), %r10 - adcq %r10, %rax - movq %rax, (%rdi,%rcx,8) - incq %rcx +.Loop: leaq -3(%rcx), %rcx + NEXT_ADD(0); + NEXT_ADD(8); + NEXT_ADD(16); + NEXT_ADD(24); + leaq 32(%rsi), %rsi + leaq 32(%rdx), %rdx + leaq 32(%rdi), %rdi + decq %rcx jne .Loop - movq %rcx, %rax /* zero %rax */ - adcq %rax, %rax + ALIGN(2) +.Lend: + movl $0, %eax /* zero %rax */ + adcl %eax, %eax FUNC_EXIT() diff --git a/mpi/amd64/mpih-sub1.S b/mpi/amd64/mpih-sub1.S index 8c61cb20..32799c86 100644 --- a/mpi/amd64/mpih-sub1.S +++ b/mpi/amd64/mpih-sub1.S @@ -3,6 +3,7 @@ * * Copyright (C) 1992, 1994, 1995, 1998, * 2001, 2002, 2006 Free Software Foundation, Inc. + * Copyright (C) 2023 Jussi Kivilinna * * This file is part of Libgcrypt. * @@ -44,20 +45,75 @@ .globl C_SYMBOL_NAME(_gcry_mpih_sub_n) C_SYMBOL_NAME(_gcry_mpih_sub_n:) FUNC_ENTRY() - leaq (%rsi,%rcx,8), %rsi - leaq (%rdi,%rcx,8), %rdi - leaq (%rdx,%rcx,8), %rdx - negq %rcx - xorl %eax, %eax /* clear cy */ + movl %ecx, %r9d + andl $3, %r9d + je .Lprehandle0 + cmpl $2, %r9d + jb .Lprehandle1 + je .Lprehandle2 + +#define FIRST_SUB() \ + movq (%rsi), %rax; \ + subq (%rdx), %rax; \ + movq %rax, (%rdi) + +#define NEXT_SUB(offset) \ + movq offset(%rsi), %rax; \ + sbbq offset(%rdx), %rax; \ + movq %rax, offset(%rdi) + +.Lprehandle3: + leaq -2(%rcx), %rcx + FIRST_SUB(); + NEXT_SUB(8); + NEXT_SUB(16); + decq %rcx + je .Lend + leaq 24(%rsi), %rsi + leaq 24(%rdx), %rdx + leaq 24(%rdi), %rdi + jmp .Loop + + ALIGN(3) +.Lprehandle2: + leaq -1(%rcx), %rcx + FIRST_SUB(); + NEXT_SUB(8); + decq %rcx + je .Lend + leaq 16(%rsi), %rsi + leaq 16(%rdx), %rdx + leaq 16(%rdi), %rdi + jmp .Loop + + ALIGN(3) +.Lprehandle1: + FIRST_SUB(); + decq %rcx + je .Lend + leaq 8(%rsi), %rsi + leaq 8(%rdx), %rdx + leaq 8(%rdi), %rdi + jmp .Loop + + ALIGN(3) +.Lprehandle0: + clc /* clear cy */ ALIGN(4) /* minimal alignment for claimed speed */ -.Loop: movq (%rsi,%rcx,8), %rax - movq (%rdx,%rcx,8), %r10 - sbbq %r10, %rax - movq %rax, (%rdi,%rcx,8) - incq %rcx +.Loop: leaq -3(%rcx), %rcx + NEXT_SUB(0); + NEXT_SUB(8); + NEXT_SUB(16); + NEXT_SUB(24); + leaq 32(%rsi), %rsi + leaq 32(%rdx), %rdx + leaq 32(%rdi), %rdi + decq %rcx jne .Loop - movq %rcx, %rax /* zero %rax */ - adcq %rax, %rax + ALIGN(2) +.Lend: + movl $0, %eax /* zero %rax */ + adcl %eax, %eax FUNC_EXIT() -- 2.39.2 From jussi.kivilinna at iki.fi Sat Apr 22 09:35:39 2023 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sat, 22 Apr 2023 10:35:39 +0300 Subject: [PATCH 5/5] mpi: optimize mpi_rshift and mpi_lshift to avoid extra MPI copying In-Reply-To: <20230422073539.894723-1-jussi.kivilinna@iki.fi> References: <20230422073539.894723-1-jussi.kivilinna@iki.fi> Message-ID: <20230422073539.894723-5-jussi.kivilinna@iki.fi> * mpi/mpi-bit.c (_gcry_mpi_rshift): Refactor so that _gcry_mpih_rshift is used to do the copying along with shifting when copying is needed and refactor so that same code-path is used for both in-place and copying operation. (_gcry_mpi_lshift): Refactor so that _gcry_mpih_lshift is used to do the copying along with shifting when copying is needed and refactor so that same code-path is used for both in-place and copying operation. -- Benchmark on AMD Ryzen 9 7900X: Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz rshift3 | 0.039 ns/B 24662 MiB/s 0.182 c/B 4700 lshift3 | 0.108 ns/B 8832 MiB/s 0.508 c/B 4700 rshift65 | 0.137 ns/B 6968 MiB/s 0.643 c/B 4700 lshift65 | 0.109 ns/B 8776 MiB/s 0.511 c/B 4700 After: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz rshift3 | 0.038 ns/B 25049 MiB/s 0.179 c/B 4700 lshift3 | 0.039 ns/B 24709 MiB/s 0.181 c/B 4700 rshift65 | 0.038 ns/B 24942 MiB/s 0.180 c/B 4700 lshift65 | 0.040 ns/B 23671 MiB/s 0.189 c/B 4700 Signed-off-by: Jussi Kivilinna --- mpi/mpi-bit.c | 138 +++++++++++++++++++------------------------------- 1 file changed, 51 insertions(+), 87 deletions(-) diff --git a/mpi/mpi-bit.c b/mpi/mpi-bit.c index e2170401..7313a9d4 100644 --- a/mpi/mpi-bit.c +++ b/mpi/mpi-bit.c @@ -251,10 +251,11 @@ _gcry_mpi_rshift_limbs( gcry_mpi_t a, unsigned int count ) void _gcry_mpi_rshift ( gcry_mpi_t x, gcry_mpi_t a, unsigned int n ) { - mpi_size_t xsize; - unsigned int i; unsigned int nlimbs = (n/BITS_PER_MPI_LIMB); unsigned int nbits = (n%BITS_PER_MPI_LIMB); + unsigned int i; + mpi_size_t alimbs; + mpi_ptr_t xp, ap; if (mpi_is_immutable (x)) { @@ -262,75 +263,42 @@ _gcry_mpi_rshift ( gcry_mpi_t x, gcry_mpi_t a, unsigned int n ) return; } - if ( x == a ) - { - /* In-place operation. */ - if ( nlimbs >= x->nlimbs ) - { - x->nlimbs = 0; - return; - } + alimbs = a->nlimbs; - if (nlimbs) - { - for (i=0; i < x->nlimbs - nlimbs; i++ ) - x->d[i] = x->d[i+nlimbs]; - x->d[i] = 0; - x->nlimbs -= nlimbs; - - } - if ( x->nlimbs && nbits ) - _gcry_mpih_rshift ( x->d, x->d, x->nlimbs, nbits ); - } - else if ( nlimbs ) + if (x != a) { - /* Copy and shift by more or equal bits than in a limb. */ - xsize = a->nlimbs; + RESIZE_IF_NEEDED (x, alimbs); + x->nlimbs = alimbs; + x->flags = a->flags; x->sign = a->sign; - RESIZE_IF_NEEDED (x, xsize); - x->nlimbs = xsize; - for (i=0; i < a->nlimbs; i++ ) - x->d[i] = a->d[i]; - x->nlimbs = i; - - if ( nlimbs >= x->nlimbs ) - { - x->nlimbs = 0; - return; - } + } + + /* In-place operation. */ + if (nlimbs >= alimbs) + { + x->nlimbs = 0; + return; + } + + xp = x->d; + ap = a->d; + if (alimbs && nbits) + { + _gcry_mpih_rshift (xp, ap + nlimbs, alimbs - nlimbs, nbits); if (nlimbs) - { - for (i=0; i < x->nlimbs - nlimbs; i++ ) - x->d[i] = x->d[i+nlimbs]; - x->d[i] = 0; - x->nlimbs -= nlimbs; - } - - if ( x->nlimbs && nbits ) - _gcry_mpih_rshift ( x->d, x->d, x->nlimbs, nbits ); + xp[alimbs - nlimbs] = 0; + x->nlimbs -= nlimbs; } - else + else if (nlimbs || (x != a)) { - /* Copy and shift by less than bits in a limb. */ - xsize = a->nlimbs; - x->sign = a->sign; - RESIZE_IF_NEEDED (x, xsize); - x->nlimbs = xsize; - - if ( xsize ) - { - if (nbits ) - _gcry_mpih_rshift (x->d, a->d, x->nlimbs, nbits ); - else - { - /* The rshift helper function is not specified for - NBITS==0, thus we do a plain copy here. */ - for (i=0; i < x->nlimbs; i++ ) - x->d[i] = a->d[i]; - } - } + for (i = 0; i < alimbs - nlimbs; i++ ) + xp[i] = ap[i + nlimbs]; + if (nlimbs) + xp[i] = 0; + x->nlimbs -= nlimbs; } + MPN_NORMALIZE (x->d, x->nlimbs); } @@ -368,6 +336,9 @@ _gcry_mpi_lshift ( gcry_mpi_t x, gcry_mpi_t a, unsigned int n ) { unsigned int nlimbs = (n/BITS_PER_MPI_LIMB); unsigned int nbits = (n%BITS_PER_MPI_LIMB); + mpi_size_t alimbs; + mpi_ptr_t xp, ap; + int i; if (mpi_is_immutable (x)) { @@ -378,34 +349,27 @@ _gcry_mpi_lshift ( gcry_mpi_t x, gcry_mpi_t a, unsigned int n ) if (x == a && !n) return; /* In-place shift with an amount of zero. */ - if ( x != a ) - { - /* Copy A to X. */ - unsigned int alimbs = a->nlimbs; - int asign = a->sign; - mpi_ptr_t xp, ap; - - RESIZE_IF_NEEDED (x, alimbs+nlimbs+1); - xp = x->d; - ap = a->d; - MPN_COPY (xp, ap, alimbs); - x->nlimbs = alimbs; - x->flags = a->flags; - x->sign = asign; - } + /* Note: might be in-place operation, so a==x or a!=x. */ + + alimbs = a->nlimbs; - if (nlimbs && !nbits) + RESIZE_IF_NEEDED (x, alimbs + nlimbs + 1); + xp = x->d; + ap = a->d; + if (nbits && alimbs) { - /* Shift a full number of limbs. */ - _gcry_mpi_lshift_limbs (x, nlimbs); + x->nlimbs = alimbs + nlimbs + 1; + xp[alimbs + nlimbs] = _gcry_mpih_lshift (xp + nlimbs, ap, alimbs, nbits); } - else if (n) + else { - /* We use a very dump approach: Shift left by the number of - limbs plus one and than fix it up by an rshift. */ - _gcry_mpi_lshift_limbs (x, nlimbs+1); - mpi_rshift (x, x, BITS_PER_MPI_LIMB - nbits); + x->nlimbs = alimbs + nlimbs; + for (i = alimbs - 1; i >= 0; i--) + xp[i + nlimbs] = ap[i]; } - + for (i = 0; i < nlimbs; i++) + xp[i] = 0; + x->flags = a->flags; + x->sign = a->sign; MPN_NORMALIZE (x->d, x->nlimbs); } -- 2.39.2