From gniibe at fsij.org Tue Feb 1 03:52:40 2022 From: gniibe at fsij.org (NIIBE Yutaka) Date: Tue, 01 Feb 2022 11:52:40 +0900 Subject: PATCH random/rndgetentropy.c: fix build failure on macOS In-Reply-To: <20220124173855.GD23126@irregular-apocalypse.k.bsd.de> References: <20220124173855.GD23126@irregular-apocalypse.k.bsd.de> Message-ID: <8735l3s3gn.fsf@akagi.fsij.org> Hello, Christoph Badura wrote: > libgcrypt fails in rndgetentropy.c because the prototype for getentropy() > is missing. The prototype is provided by sys/random.h per the man pages. Thank you for your report. Using HAVE_SYS_RANDOM_H, fixed in the commit of: 3d353782d84b9720262d7b05adfae3aef7ff843b -- From jussi.kivilinna at iki.fi Tue Feb 1 19:07:32 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Tue, 1 Feb 2022 20:07:32 +0200 Subject: [PATCH] hwf-arm: add detection of ARMv8 crypto extension by toolchain config Message-ID: <20220201180732.4157374-1-jussi.kivilinna@iki.fi> * src/hwf-arm.c (detect_arm_hwf_by_toolchain): New. (_gcry_hwf_detect_arm): Move __ARM_NEON check to 'detect_arm_hwf_by_toolchain' and add call to the new function. -- This allows use of HW accelerated AES/SHA1/SHA256/GCM implementations on macOS/aarch64 target. Signed-off-by: Jussi Kivilinna --- src/hwf-arm.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 66 insertions(+), 3 deletions(-) diff --git a/src/hwf-arm.c b/src/hwf-arm.c index 41188583..60107f36 100644 --- a/src/hwf-arm.c +++ b/src/hwf-arm.c @@ -369,6 +369,71 @@ detect_arm_proc_cpuinfo(unsigned int *broken_hwfs) #endif /* __linux__ */ +static unsigned int +detect_arm_hwf_by_toolchain (void) +{ + unsigned int ret = 0; + + /* Detect CPU features required by toolchain. + * This allows detection of ARMv8 crypto extension support, + * for example, on macOS/aarch64. + */ + +#if __GNUC__ >= 4 + +#if defined(__ARM_NEON) && defined(ENABLE_NEON_SUPPORT) + ret |= HWF_ARM_NEON; + +#ifdef HAVE_GCC_INLINE_ASM_NEON + /* Early test for NEON instruction to detect faulty toolchain + * configuration. */ + asm volatile ("veor q15, q15, q15":::"q15"); +#endif + +#ifdef HAVE_GCC_INLINE_ASM_AARCH64_NEON + /* Early test for NEON instruction to detect faulty toolchain + * configuration. */ + asm volatile ("eor v31.16b, v31.16b, v31.16b":::"v31"); +#endif + +#endif /* __ARM_NEON */ + +#if defined(__ARM_FEATURE_CRYPTO) + /* ARMv8 crypto extensions include support for PMULL, AES, SHA1 and SHA2 + * instructions. */ + ret |= HWF_ARM_PMULL; + ret |= HWF_ARM_AES; + ret |= HWF_ARM_SHA1; + ret |= HWF_ARM_SHA2; + +#ifdef HAVE_GCC_INLINE_ASM_AARCH32_CRYPTO + /* Early test for CE instructions to detect faulty toolchain + * configuration. */ + asm volatile ("vmull.p64 q0, d0, d0;\n\t" + "aesimc.8 q7, q0;\n\t" + "sha1su1.32 q0, q0;\n\t" + "sha256su1.32 q0, q7, q15;\n\t" + ::: + "q0", "q7", "q15"); +#endif + +#ifdef HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO + /* Early test for CE instructions to detect faulty toolchain + * configuration. */ + asm volatile ("pmull2 v0.1q, v0.2d, v31.2d;\n\t" + "aesimc v15.16b, v0.16b;\n\t" + "sha1su1 v0.4s, v0.4s;\n\t" + "sha256su1 v0.4s, v15.4s, v31.4s;\n\t" + ::: + "v0", "v15", "v31"); +#endif +#endif + +#endif + + return ret; +} + unsigned int _gcry_hwf_detect_arm (void) { @@ -383,9 +448,7 @@ _gcry_hwf_detect_arm (void) ret |= detect_arm_proc_cpuinfo (&broken_hwfs); #endif -#if defined(__ARM_NEON) && defined(ENABLE_NEON_SUPPORT) - ret |= HWF_ARM_NEON; -#endif + ret |= detect_arm_hwf_by_toolchain (); ret &= ~broken_hwfs; -- 2.32.0 From heirecka at exherbo.org Thu Feb 3 23:46:41 2022 From: heirecka at exherbo.org (heirecka at exherbo.org) Date: Thu, 3 Feb 2022 22:46:41 +0000 Subject: [PATCH] jitterentropy: Include and Message-ID: <20220203224641.45468-1-heirecka@exherbo.org> From: Heiko Becker * random/jitterentropy-base-user.h: Include for O_RDONLY * random/jitterentropy-base-user.h: Include for LONG_MAX -- Fixes the build with musl libc. Signed-off-by: Heiko Becker --- random/jitterentropy-base-user.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/random/jitterentropy-base-user.h b/random/jitterentropy-base-user.h index 326dfbed..389106ff 100644 --- a/random/jitterentropy-base-user.h +++ b/random/jitterentropy-base-user.h @@ -39,6 +39,9 @@ * DAMAGE. */ +#include +#include + #ifndef GCRYPT_JITTERENTROPY_BASE_USER_H #define GCRYPT_JITTERENTROPY_BASE_USER_H -- 2.35.1 From cllang at redhat.com Thu Feb 10 14:38:43 2022 From: cllang at redhat.com (Clemens Lang) Date: Thu, 10 Feb 2022 14:38:43 +0100 Subject: DCO Message-ID: <20220210133844.46581-1-cllang@redhat.com> Hi, since the following patch is my first contribution, please find my signed DCO below. The corresponding key is available from keys.openpgp.org [1]. -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Libgcrypt Developer's Certificate of Origin. Version 1.0 ========================================================= By making a contribution to the Libgcrypt project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the free software license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate free software license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same free software license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the free software license(s) involved. Signed-off-by: Clemens Lang -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEE62B89mhY4EukQLrg4mytd0xva5kFAmIFAswACgkQ4mytd0xv a5lH6g/+OCip+4Y+cx5jBXPX2su3jICsIqB5njpBeckQbAZzU96mCuqbRur2uO+l Hx9j9e5RdGy6hPMSGHQ0guYHZgAbl1Ktr3Dg8XjAQiNjQSAQFpQdUPOIi7+G69xh +Yg6vPB1pZAi20kBIOspMwtax8w5Jm4QHYP8CtufIkkEIDTCy+AklpPcLZzTkzRs tBkUNLfaUwpxL5lVPLof0HyWonqEzXkF03ofiMcBFyDwx6OY4afG7Ch2Cqiwua4R YdJULf7Ap8B/cvO8JD56sJHrPzVtFhW2jUPgmbDdyCPh+A5XKKHBvn1N9+P8+jcQ Af6xEVWTRwIDQQAiOrYDjvKy1aHVA/gE/S8tvdHV6Irn2ADOE+MLw+Tznp++jjfo mAnXu2W2DTf+aX6urzvIjmeLTST8OBfwaHmSAMC5BZcpr1kykyraNGupFo956Foc 6NMPfj92H3Km4QDQPlEug9OjH85FmtqrOPu1m3fds5pEgJFw7DBDzoq8aMGPtNfb wuQRPKKzmB4eEa9Pq52bwRVqtwor6Q+poOPRdE4LdbmsbOFjuprk6oxh8kIYRnOB a94NAxjyCrRQOpj+/aQtFR3A/wgB807Rw/OuQPl5uBWX7muiGWzhvUSmnKR6fX1g +EXrCHV/Fdb3t20/wj3Ic90juSXoSZAbZsCtsOAFPgYLJrYUTU4= =j/+h -----END PGP SIGNATURE----- [1]: https://keys.openpgp.org/search?q=cllang%40redhat.com From cllang at redhat.com Thu Feb 10 14:38:44 2022 From: cllang at redhat.com (Clemens Lang) Date: Thu, 10 Feb 2022 14:38:44 +0100 Subject: [PATCH] tests: Fix undefined reference to 'pthread_create' In-Reply-To: <20220210133844.46581-1-cllang@redhat.com> References: <20220210133844.46581-1-cllang@redhat.com> Message-ID: <20220210133844.46581-2-cllang@redhat.com> * configure.ac (HAVE_PTHREAD): Expose as AM_CONDITIONAL for use in Makefile.am * tests/Makefile.am: Link against pthread for tests that use it -- Compilation on CentOS 8 Stream failed without this. Signed-off-by: Clemens Lang --- configure.ac | 1 + tests/Makefile.am | 5 +++++ 2 files changed, 6 insertions(+) diff --git a/configure.ac b/configure.ac index a9350c9c..9034194a 100644 --- a/configure.ac +++ b/configure.ac @@ -812,6 +812,7 @@ if test "$have_w32_system" != yes; then AC_DEFINE(HAVE_PTHREAD, 1 ,[Define if we have pthread.]) fi fi +AM_CONDITIONAL(HAVE_PTHREAD, test "$have_w32_system" != yes && test "$have_pthread" = yes) # Solaris needs -lsocket and -lnsl. Unisys system includes diff --git a/tests/Makefile.am b/tests/Makefile.am index e6953fd3..a4ad267f 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -98,6 +98,11 @@ t_kdf_LDADD = $(standard_ldadd) $(GPG_ERROR_MT_LIBS) @LDADD_FOR_TESTS_KLUDGE@ t_kdf_CFLAGS = $(GPG_ERROR_MT_CFLAGS) endif +if HAVE_PTHREAD +t_lock_LDADD += -lpthread +t_kdf_LDADD += -lpthread +endif + # xcheck uses our new testdrv instead of the automake test runner. .PHONY: xcheck xtestsuite xcheck: testdrv$(EXEEXT) -- 2.34.1 From cllang at redhat.com Thu Feb 10 17:46:56 2022 From: cllang at redhat.com (Clemens Lang) Date: Thu, 10 Feb 2022 17:46:56 +0100 Subject: 'make check' with --enable-hmac-binary-check and GNU gold Message-ID: <20220210164712.126672-1-cllang@redhat.com> Hi, in my testing, I've tried to make ./configure --enable-hmac-binary-check make LIBGCRYPT_FORCE_FIPS_MODE=1 make check work out of the box, but for some reason, this failed with an HMAC mismatch. I've taken inspiration from what Jakub did downstream [1] to compute the HMAC from a copy of the library with the .rodata1 section set to 32 0-bytes and ported that into src/Makefile.am, which fixed the test execution. Further investigation shows that any use of objcopy on the libgcrypt library before HMAC calculation fixes the mismatch, even invocations that are not supposed to modify the binary, such as objcopy --dump-section .rodata1=/dev/stdout src/.libs/libgcrypt.so This happens because objcopy drops entries from the .strtab section of the binary, presumably because it removes unused values. Since this modifies the binary, the computed checksum is no longer valid. Adding the additional objcopy invocation ensures that this cleanup has happened before the checksum is computed, such that the only modification to the library afterwards is in the .rodata1 section, which we expect and handle during the selftest. This only happens when GNU gold is used as linker, specifically GNU gold (version 2.37-10.fc35) 1.16. GNU ld does not show the same problem. [1]: https://gitlab.com/redhat/centos-stream/rpms/libgcrypt/-/blob/c9s/libgcrypt.spec#L92-95 From cllang at redhat.com Thu Feb 10 17:46:57 2022 From: cllang at redhat.com (Clemens Lang) Date: Thu, 10 Feb 2022 17:46:57 +0100 Subject: [PATCH] build: Fix 'make check' with HMAC check In-Reply-To: <20220210164712.126672-1-cllang@redhat.com> References: <20220210164712.126672-1-cllang@redhat.com> Message-ID: <20220210164712.126672-2-cllang@redhat.com> * src/Makefile.am: Generate HMAC from a copy of libgcrypt.so with rodata1 set to zero. This fixes test execution after configuring with --enable-hmac-binary-check with LIBGCRYPT_FORCE_FIPS_MODE=1 in the environment. -- Signed-off-by: Clemens Lang --- src/Makefile.am | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/src/Makefile.am b/src/Makefile.am index 018d5761..b0a196a3 100644 --- a/src/Makefile.am +++ b/src/Makefile.am @@ -149,7 +149,11 @@ libgcrypt.la.done: libgcrypt.so.hmac @touch libgcrypt.la.done libgcrypt.so.hmac: hmac256 libgcrypt.la - ./hmac256 --stdkey --binary < .libs/libgcrypt.so > $@ + dd if=/dev/zero of=libgcrypt.so.hmac.empty bs=32 count=1 + $(OBJCOPY) --update-section .rodata1=libgcrypt.so.hmac.empty \ + .libs/libgcrypt.so .libs/libgcrypt.so.empty-hmac + ./hmac256 --stdkey --binary .libs/libgcrypt.so.empty-hmac > $@ + $(RM) libgcrypt.so.hmac.empty .libs/libgcrypt.so.empty-hmac else !USE_HMAC_BINARY_CHECK libgcrypt.la.done: libgcrypt.la @touch libgcrypt.la.done -- 2.34.1 From cllang at redhat.com Fri Feb 11 16:55:02 2022 From: cllang at redhat.com (Clemens Lang) Date: Fri, 11 Feb 2022 16:55:02 +0100 Subject: [PATCH] hmac: Fix memory leak Message-ID: <20220211155502.86403-1-cllang@redhat.com> * src/hmac.c: Release HMAC256 context -- LeakSanitizer marks the allocation of this context as leaked. Since the hmac binary is used during the build with --enable-hmac-binary-check, this fails the build with AddressSanitizer/LeakSanitizer. Signed-off-by: Clemens Lang --- src/hmac256.c | 1 + 1 file changed, 1 insertion(+) diff --git a/src/hmac256.c b/src/hmac256.c index bd089b79..899e6d15 100644 --- a/src/hmac256.c +++ b/src/hmac256.c @@ -780,6 +780,7 @@ main (int argc, char **argv) pgm, strerror (errno)); exit (1); } + _gcry_hmac256_release (hd); if (use_stdin) break; } -- 2.34.1 From cllang at redhat.com Fri Feb 11 16:55:24 2022 From: cllang at redhat.com (Clemens Lang) Date: Fri, 11 Feb 2022 16:55:24 +0100 Subject: [PATCH] fips: Fix memory leaks in FIPS mode Message-ID: <20220211155524.86434-1-cllang@redhat.com> * cipher/pubkey.c (_gcry_pk_sign_md): Fix memory leak in FIPS mode when used with SHA1 * tests/basic.c (check_one_cipher_core): Add missing free in error code triggered in FIPS mode * tests/dsa-rfc6979.c (check_dsa_rfc6979): Likewise * tests/pubkey.c (check_x931_derived_key): Likewise -- Signed-off-by: Clemens Lang --- cipher/pubkey.c | 5 ++++- tests/basic.c | 1 + tests/dsa-rfc6979.c | 2 ++ tests/pubkey.c | 1 + 4 files changed, 8 insertions(+), 1 deletion(-) diff --git a/cipher/pubkey.c b/cipher/pubkey.c index 7fdb7771..8deeced6 100644 --- a/cipher/pubkey.c +++ b/cipher/pubkey.c @@ -516,7 +516,10 @@ _gcry_pk_sign_md (gcry_sexp_t *r_sig, const char *tmpl, gcry_md_hd_t hd_orig, algo = _gcry_md_get_algo (hd); if (fips_mode () && algo == GCRY_MD_SHA1) - return GPG_ERR_DIGEST_ALGO; + { + _gcry_md_close (hd); + return GPG_ERR_DIGEST_ALGO; + } digest = _gcry_md_read (hd, 0); } diff --git a/tests/basic.c b/tests/basic.c index 32be7c2f..a0ad33eb 100644 --- a/tests/basic.c +++ b/tests/basic.c @@ -11047,6 +11047,7 @@ check_one_cipher_core (int algo, int mode, int flags, if (!err) fail ("pass %d, algo %d, mode %d, gcry_cipher_encrypt is expected to " "fail in FIPS mode: %s\n", pass, algo, mode, gpg_strerror (err)); + gcry_cipher_close (hd); goto err_out_free; } if (err) diff --git a/tests/dsa-rfc6979.c b/tests/dsa-rfc6979.c index cd68cd25..79b25c3d 100644 --- a/tests/dsa-rfc6979.c +++ b/tests/dsa-rfc6979.c @@ -943,6 +943,8 @@ check_dsa_rfc6979 (void) { if (!err) fail ("signing should not work in FIPS mode: %s\n", gpg_strerror (err)); + gcry_sexp_release (data); + gcry_sexp_release (seckey); continue; } if (err) diff --git a/tests/pubkey.c b/tests/pubkey.c index c5510d05..b352490b 100644 --- a/tests/pubkey.c +++ b/tests/pubkey.c @@ -1035,6 +1035,7 @@ check_x931_derived_key (int what) if (in_fips_mode && nbits < 2048) { info("RSA key test with %d bits skipped in fips mode\n", nbits); + gcry_sexp_release (key_spec); goto leave; } } -- 2.34.1 From cllang at redhat.com Fri Feb 11 16:57:22 2022 From: cllang at redhat.com (Clemens Lang) Date: Fri, 11 Feb 2022 16:57:22 +0100 Subject: [PATCH 1/2] fips: Use ELF header to find .rodata1 section Message-ID: <20220211155723.86516-1-cllang@redhat.com> * src/fips.c [ENABLE_HMAC_BINARY_CHECK] (hmac256_check): Use ELF headers to locate the binary section with the HMAC rather than information from the loader -- The previous method of locating the offset of the .rodata1 section in the ELF file on disk used information obtained from the loader. This computed the address of the section at runtime (sh_addr in the ElfN_Shdr struct), but the offset in the file can be different (sh_offset from the ElfN_Shdr struct). So far, GCC did include .rodata1 close enough to the beginning of the section table so that sh_addr and sh_offset were the same. If there were short sections followed by sections with larger alignment requirements in the binary, this would have caused sh_addr and sh_offset to differ, and thus the selfcheck to fail. To allow clang to pass the FIPS selftest, a follow-up commit will mark the HMAC as volatile. This causes GCC to move the section closer to the end of the section table, where sh_addr and sh_offset differ. Switch to determining the location of the .rodata1 section from the ELF headers in the file on disk to be robust against future section re-ordering changes in compilers. Signed-off-by: Clemens Lang --- README | 3 +- src/fips.c | 153 +++++++++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 146 insertions(+), 10 deletions(-) diff --git a/README b/README index 3b465c1b..4d7697dd 100644 --- a/README +++ b/README @@ -157,7 +157,8 @@ --enable-hmac-binary-check Include support to check the binary at runtime against a HMAC checksum. This works only in FIPS - mode and on systems providing the dladdr function. + mode on systems providing the dladdr function and using + the ELF binary format. --with-fips-module-version=version Specify a string used as a module version for FIPS diff --git a/src/fips.c b/src/fips.c index 391b94f1..193af36b 100644 --- a/src/fips.c +++ b/src/fips.c @@ -24,8 +24,9 @@ #include #include #ifdef ENABLE_HMAC_BINARY_CHECK +# include # include -# include +# include #endif #ifdef HAVE_SYSLOG # include @@ -594,23 +595,159 @@ run_random_selftests (void) static const unsigned char __attribute__ ((section (".rodata1"))) hmac_for_the_implementation[HMAC_LEN]; +/** + * Obtain the ElfN_Shdr.sh_offset value for the section with the given name in + * the ELF file opened as fp and return it in offset. Rewinds fp to the + * beginning on success. + */ static gpg_error_t -hmac256_check (const char *filename, const char *key, struct link_map *lm) +get_section_offset (FILE *fp, const char *section, unsigned long *offset) +{ + unsigned char e_ident[EI_NIDENT]; +#if __WORDSIZE == 64 + Elf64_Ehdr ehdr; + Elf64_Shdr shdr; +#define ELFCLASS_NATIVE ELFCLASS64 +#else + Elf32_Ehdr ehdr; + Elf32_Shdr shdr; +#define ELFCLASS_NATIVE ELFCLASS32 +#endif + char *shstrtab; + uint16_t e_shnum; + uint16_t e_shstrndx; + uint16_t e_shidx; + + // verify binary word size + if (1 != fread (e_ident, EI_NIDENT, 1, fp)) + return gpg_error_from_syserror (); + + if (ELFCLASS_NATIVE != e_ident[EI_CLASS]) + return gpg_error (GPG_ERR_INV_OBJ); + + // read the ELF header + if (0 != fseek (fp, 0, SEEK_SET)) + return gpg_error_from_syserror (); + if (1 != fread (&ehdr, sizeof (ehdr), 1, fp)) + return gpg_error_from_syserror (); + + // the section header entry size should match the size of the shdr struct + if (ehdr.e_shentsize != sizeof (shdr)) + return gpg_error (GPG_ERR_INV_OBJ); + + /* elf(5): "If the file has no section name string table, this member holds + * the value SHN_UNDEF." Without a section name string table, we can not + * locate a named section, so error out. */ + if (ehdr.e_shstrndx == SHN_UNDEF) + return gpg_error (GPG_ERR_INV_OBJ); + + // jump to and read the first section header + if (0 != fseek (fp, ehdr.e_shoff, SEEK_SET)) + return gpg_error_from_syserror (); + if (1 != fread (&shdr, sizeof (shdr), 1, fp)) + return gpg_error_from_syserror (); + + /* Number of entries in the section header table + * + * If the number of entries in the section header table is larger than or + * equal to SHN_LORESERVE, e_shnum holds the value zero and the real number + * of entries in the section header table is held in the sh_size member of + * the initial entry in section header table. + */ + e_shnum = ehdr.e_shnum == 0 ? shdr.sh_size : ehdr.e_shnum; + /* Section header table index of the section name string table + * + * If the index of section name string table section is larger than or equal + * to SHN_LORESERVE, this member holds SHN_XINDEX and the real index of the + * section name string table section is held in the sh_link member of the + * initial entry in section header table. + */ + e_shstrndx = ehdr.e_shstrndx == SHN_XINDEX ? shdr.sh_link : ehdr.e_shstrndx; + + // seek to the section header for the section name string table and read it + if (0 != fseek (fp, ehdr.e_shoff + e_shstrndx * sizeof (shdr), SEEK_SET)) + return gpg_error_from_syserror (); + if (1 != fread (&shdr, sizeof (shdr), 1, fp)) + return gpg_error_from_syserror (); + + // this is the section name string table and should have a type of SHT_STRTAB + if (shdr.sh_type != SHT_STRTAB) + return gpg_error (GPG_ERR_INV_OBJ); + + // read the string table + shstrtab = xtrymalloc (shdr.sh_size); + if (!shstrtab) + return gpg_error_from_syserror (); + + if (0 != fseek (fp, shdr.sh_offset, SEEK_SET)) + { + gpg_error_t err = gpg_error_from_syserror (); + xfree (shstrtab); + return err; + } + if (1 != fread (shstrtab, shdr.sh_size, 1, fp)) + { + gpg_error_t err = gpg_error_from_syserror (); + xfree (shstrtab); + return err; + } + + // iterate over the sections, compare their names, and if the section name + // matches the expected name, return the offset. We already read section 0, + // which is always an empty section. + if (0 != fseek (fp, ehdr.e_shoff + 1 * sizeof (shdr), SEEK_SET)) + { + gpg_error_t err = gpg_error_from_syserror (); + xfree (shstrtab); + return err; + } + for (e_shidx = 1; e_shidx < e_shnum; e_shidx++) + { + if (1 != fread (&shdr, sizeof (shdr), 1, fp)) + { + gpg_error_t err = gpg_error_from_syserror (); + xfree (shstrtab); + return err; + } + if (0 == strcmp (shstrtab + shdr.sh_name, section)) + { + // found section, return the offset + *offset = shdr.sh_offset; + xfree (shstrtab); + + if (0 != fseek (fp, 0, SEEK_SET)) + return gpg_error_from_syserror (); + return 0; + } + } + + // section not found in the file + xfree (shstrtab); + return gpg_error (GPG_ERR_INV_OBJ); +} + +static gpg_error_t +hmac256_check (const char *filename, const char *key) { gpg_error_t err; FILE *fp; gcry_md_hd_t hd; size_t buffer_size, nread; char *buffer; - unsigned long paddr; + unsigned long paddr = 0; unsigned long off = 0; - paddr = (unsigned long)hmac_for_the_implementation - lm->l_addr; - fp = fopen (filename, "rb"); if (!fp) return gpg_error (GPG_ERR_INV_OBJ); + err = get_section_offset (fp, ".rodata1", &paddr); + if (err) + { + fclose (fp); + return err; + } + err = _gcry_md_open (&hd, GCRY_MD_SHA256, GCRY_MD_FLAG_HMAC); if (err) { @@ -692,13 +829,11 @@ check_binary_integrity (void) gpg_error_t err; Dl_info info; const char *key = KEY_FOR_BINARY_CHECK; - void *extra_info; - if (!dladdr1 (hmac_for_the_implementation, - &info, &extra_info, RTLD_DL_LINKMAP)) + if (!dladdr (hmac_for_the_implementation, &info)) err = gpg_error_from_syserror (); else - err = hmac256_check (info.dli_fname, key, extra_info); + err = hmac256_check (info.dli_fname, key); reporter ("binary", 0, NULL, err? gpg_strerror (err):NULL); #ifdef HAVE_SYSLOG -- 2.34.1 From cllang at redhat.com Fri Feb 11 16:57:23 2022 From: cllang at redhat.com (Clemens Lang) Date: Fri, 11 Feb 2022 16:57:23 +0100 Subject: [PATCH 2/2] fips: Fix self-check compiled with clang In-Reply-To: <20220211155723.86516-1-cllang@redhat.com> References: <20220211155723.86516-1-cllang@redhat.com> Message-ID: <20220211155723.86516-2-cllang@redhat.com> * src/fips.c [ENABLE_HMAC_BINARY_CHECK] (hmac256_check): Prevent constant propagation of a 0 HMAC value. -- clang 13 assumes that the static const unsigned char[32] hmac_for_the_implementation is zero, propagates this constant into its use in hmac256_check and replaces with invocation of memcmp(3) with assembly instructions that compare the computed digest with 0. clang is able to make this assumption, because a 0-initialized static const variable should never change its value, but this assumption is invalid as soon as objcpy(1) changes the HMAC in the binary. Add a volatile specifier to prevent this optimization. Note that this requires casting away the volatile keyword in some invocations where function signature does not support it. Signed-off-by: Clemens Lang --- src/fips.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/src/fips.c b/src/fips.c index 193af36b..fabc5158 100644 --- a/src/fips.c +++ b/src/fips.c @@ -592,7 +592,10 @@ run_random_selftests (void) # endif #define HMAC_LEN 32 -static const unsigned char __attribute__ ((section (".rodata1"))) +/* Compilers can and will constant-propagate this as 0 when reading if it is + * not declared volatile. Since this value will be changed using objcopy(1) + * after compilation, this can cause the HMAC verification to fail. */ +static const volatile unsigned char __attribute__ ((section (".rodata1"))) hmac_for_the_implementation[HMAC_LEN]; /** @@ -805,10 +808,16 @@ hmac256_check (const char *filename, const char *key) err = gpg_error (GPG_ERR_INV_HANDLE); else { + unsigned char hmac[HMAC_LEN]; unsigned char *digest; + size_t idx; + + // memcpy(3) does not accept volatile pointers + for (idx = 0; idx < HMAC_LEN; idx++) + hmac[idx] = hmac_for_the_implementation[idx]; digest = _gcry_md_read (hd, 0); - if (!memcmp (digest, hmac_for_the_implementation, HMAC_LEN)) + if (!memcmp (digest, hmac, HMAC_LEN)) /* Success. */ err = 0; else @@ -830,7 +839,7 @@ check_binary_integrity (void) Dl_info info; const char *key = KEY_FOR_BINARY_CHECK; - if (!dladdr (hmac_for_the_implementation, &info)) + if (!dladdr ((const char *)hmac_for_the_implementation, &info)) err = gpg_error_from_syserror (); else err = hmac256_check (info.dli_fname, key); -- 2.34.1 From fweimer at redhat.com Fri Feb 11 17:09:50 2022 From: fweimer at redhat.com (Florian Weimer) Date: Fri, 11 Feb 2022 17:09:50 +0100 Subject: [PATCH 1/2] fips: Use ELF header to find .rodata1 section In-Reply-To: <20220211155723.86516-1-cllang@redhat.com> (Clemens Lang via Gcrypt-devel's message of "Fri, 11 Feb 2022 16:57:22 +0100") References: <20220211155723.86516-1-cllang@redhat.com> Message-ID: <87pmnttmep.fsf@oldenburg.str.redhat.com> * Clemens Lang via Gcrypt-devel: > diff --git a/src/fips.c b/src/fips.c > index 193af36b..fabc5158 100644 > --- a/src/fips.c > +++ b/src/fips.c > @@ -592,7 +592,10 @@ run_random_selftests (void) > # endif > #define HMAC_LEN 32 > > -static const unsigned char __attribute__ ((section (".rodata1"))) > +/* Compilers can and will constant-propagate this as 0 when reading if it is > + * not declared volatile. Since this value will be changed using objcopy(1) > + * after compilation, this can cause the HMAC verification to fail. */ > +static const volatile unsigned char __attribute__ ((section (".rodata1"))) > hmac_for_the_implementation[HMAC_LEN]; volatile causes GCC to emit a writable section, and the link editor will make .rodata1 (and typically .text) writable as a result. This is a fairly significant loss of security hardening. This bug is relevant here: various services trigger { execmem } denials in FIPS mode > +/** > + * Obtain the ElfN_Shdr.sh_offset value for the section with the given name in > + * the ELF file opened as fp and return it in offset. Rewinds fp to the > + * beginning on success. > + */ > static gpg_error_t > -hmac256_check (const char *filename, const char *key, struct link_map *lm) > +get_section_offset (FILE *fp, const char *section, unsigned long *offset) > +{ > + unsigned char e_ident[EI_NIDENT]; > +#if __WORDSIZE == 64 > + Elf64_Ehdr ehdr; > + Elf64_Shdr shdr; > +#define ELFCLASS_NATIVE ELFCLASS64 __WORDSIZE is an internal glibc macro, not to be used outside of glibc. glibc's defines ElfW as an official macro, and you could use ElfW(Ehdr) and ElfW(Shdr) here. The code looks at section headers. These can be stripped. Furthermore, the .rodata1 section is not really reserved for application use. I haven't reviewed Dmitry's OpenSSL changes (which I probably should do), but I'd suggest to use the same approach. 8-) Thanks, Florian From dbelyavs at redhat.com Fri Feb 11 17:15:04 2022 From: dbelyavs at redhat.com (Dmitry Belyavskiy) Date: Fri, 11 Feb 2022 17:15:04 +0100 Subject: [PATCH 1/2] fips: Use ELF header to find .rodata1 section In-Reply-To: <87pmnttmep.fsf@oldenburg.str.redhat.com> References: <20220211155723.86516-1-cllang@redhat.com> <87pmnttmep.fsf@oldenburg.str.redhat.com> Message-ID: Dear Florian, On Fri, Feb 11, 2022 at 5:09 PM Florian Weimer wrote: > * Clemens Lang via Gcrypt-devel: > > > diff --git a/src/fips.c b/src/fips.c > > index 193af36b..fabc5158 100644 > > --- a/src/fips.c > > +++ b/src/fips.c > > @@ -592,7 +592,10 @@ run_random_selftests (void) > > # endif > > #define HMAC_LEN 32 > > > > -static const unsigned char __attribute__ ((section (".rodata1"))) > > +/* Compilers can and will constant-propagate this as 0 when reading if > it is > > + * not declared volatile. Since this value will be changed using > objcopy(1) > > + * after compilation, this can cause the HMAC verification to fail. */ > > +static const volatile unsigned char __attribute__ ((section > (".rodata1"))) > > hmac_for_the_implementation[HMAC_LEN]; > > volatile causes GCC to emit a writable section, and the link editor will > make .rodata1 (and typically .text) writable as a result. This is a > fairly significant loss of security hardening. > > This bug is relevant here: > > various services trigger { execmem } denials in FIPS mode > > > > +/** > > + * Obtain the ElfN_Shdr.sh_offset value for the section with the given > name in > > + * the ELF file opened as fp and return it in offset. Rewinds fp to the > > + * beginning on success. > > + */ > > static gpg_error_t > > -hmac256_check (const char *filename, const char *key, struct link_map > *lm) > > +get_section_offset (FILE *fp, const char *section, unsigned long > *offset) > > +{ > > + unsigned char e_ident[EI_NIDENT]; > > +#if __WORDSIZE == 64 > > + Elf64_Ehdr ehdr; > > + Elf64_Shdr shdr; > > +#define ELFCLASS_NATIVE ELFCLASS64 > > __WORDSIZE is an internal glibc macro, not to be used outside of glibc. > glibc's defines ElfW as an official macro, and you could use > ElfW(Ehdr) and ElfW(Shdr) here. > > The code looks at section headers. These can be stripped. Furthermore, > the .rodata1 section is not really reserved for application use. > > I haven't reviewed Dmitry's OpenSSL changes (which I probably should > do), but I'd suggest to use the same approach. 8-) > Yes, I used the same approach. But the situation is a bit more strange. I had to add a `volatile` modifier to the HMAC variable because the .section attribute was ignored otherwise. After the issue you refer to was raised, I removed this modifier - and the section was preserved. -- Dmitry Belyavskiy -------------- next part -------------- An HTML attachment was scrubbed... URL: From gniibe at fsij.org Mon Feb 14 06:57:52 2022 From: gniibe at fsij.org (NIIBE Yutaka) Date: Mon, 14 Feb 2022 14:57:52 +0900 Subject: [PATCH] tests: Fix undefined reference to 'pthread_create' In-Reply-To: <20220210133844.46581-2-cllang@redhat.com> References: <20220210133844.46581-1-cllang@redhat.com> <20220210133844.46581-2-cllang@redhat.com> Message-ID: <87h792m1lr.fsf@jumper.gniibe.org> Hello, Thank you for your report. Clemens Lang wrote: > * configure.ac (HAVE_PTHREAD): Expose as AM_CONDITIONAL for use in > Makefile.am > * tests/Makefile.am: Link against pthread for tests that use it > > -- > > Compilation on CentOS 8 Stream failed without this. Could you please show us the build log (on that OS) for libgpg-error? When -lpthread is needed, GPG_ERROR_MT_LIBS should have that. (Or else, it may also fail in other places like compiling GnuPG.) For the build, libgcrypt depends on gpg-error.pc of libgpg-error, and uses GPG_ERROR_MT_CFLAGS and GPG_ERROR_MT_LIBS. It is generated by the build of libgpg-error using Gnulib's M4 script m4/threadlib.m4. If something goes wrong, we need to fix libgpg-error. -- From gniibe at fsij.org Mon Feb 14 07:21:26 2022 From: gniibe at fsij.org (NIIBE Yutaka) Date: Mon, 14 Feb 2022 15:21:26 +0900 Subject: [PATCH] fips: Fix memory leaks in FIPS mode In-Reply-To: <20220211155524.86434-1-cllang@redhat.com> References: <20220211155524.86434-1-cllang@redhat.com> Message-ID: <87czjqm0ih.fsf@jumper.gniibe.org> Clemens Lang wrote: > * cipher/pubkey.c (_gcry_pk_sign_md): Fix memory leak in FIPS mode when > used with SHA1 > * tests/basic.c (check_one_cipher_core): Add missing free in error code > triggered in FIPS mode > * tests/dsa-rfc6979.c (check_dsa_rfc6979): Likewise > * tests/pubkey.c (check_x931_derived_key): Likewise Thank you. Applied to master and 1.10 branch. -- From gniibe at fsij.org Mon Feb 14 07:22:16 2022 From: gniibe at fsij.org (NIIBE Yutaka) Date: Mon, 14 Feb 2022 15:22:16 +0900 Subject: [PATCH] hmac: Fix memory leak In-Reply-To: <20220211155502.86403-1-cllang@redhat.com> References: <20220211155502.86403-1-cllang@redhat.com> Message-ID: <87a6eum0h3.fsf@jumper.gniibe.org> Clemens Lang wrote: > * src/hmac.c: Release HMAC256 context Thank you. Applied master and 1.10 branch. -- From cllang at redhat.com Mon Feb 14 13:46:19 2022 From: cllang at redhat.com (Clemens Lang) Date: Mon, 14 Feb 2022 13:46:19 +0100 Subject: [PATCH 1/2] fips: Use ELF header to find .rodata1 section In-Reply-To: <87pmnttmep.fsf@oldenburg.str.redhat.com> References: <20220211155723.86516-1-cllang@redhat.com> <87pmnttmep.fsf@oldenburg.str.redhat.com> Message-ID: <8AEF6972-91BF-4B9A-B335-1EA482BF0DA5@redhat.com> Hi Florian, > On 11. Feb 2022, at 17:09, Florian Weimer wrote: > > __WORDSIZE is an internal glibc macro, not to be used outside of glibc. > glibc's defines ElfW as an official macro, and you could use > ElfW(Ehdr) and ElfW(Shdr) here. Thanks, I?ll fix that. > The code looks at section headers. These can be stripped. Furthermore, > the .rodata1 section is not really reserved for application use. > > I haven't reviewed Dmitry's OpenSSL changes (which I probably should > do), but I'd suggest to use the same approach. 8-) >From what I can see, it currently uses the same approach, and probably has the same issue where the compiler could assume that the HMAC is 0 and constant-propagate that. Again, this currently works just fine with GCC, but I don?t think it?s a good idea to rely on GCC?s unwillingness to replace a memcmp(3) with a few assembly instructions. Adding volatile was the simplest method I could think of to prevent that, but you make good points that it might not be the best approach here. I?ll try your suggestion from [1], which I guess would work because the variable isn?t const. The currently merged state assumes the offset in the file matches the address at runtime. This is probably not a good assumption to make. How would you determine the offset of a symbol in a file given its runtime address? Find the matching program header entry that must have loaded it and subtracting the difference between p_vaddr and p_offset? As for stripping the section headers, GNU strip 2.37 does not seem to do that in the default configuration, so we could just expect users that want the FIPS selftest to not manually strip them. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2034320 Thanks, Clemens -- Clemens Lang RHEL Crypto Team Red Hat From cllang at redhat.com Mon Feb 14 14:00:21 2022 From: cllang at redhat.com (Clemens Lang) Date: Mon, 14 Feb 2022 14:00:21 +0100 Subject: [PATCH] tests: Fix undefined reference to 'pthread_create' In-Reply-To: <87h792m1lr.fsf@jumper.gniibe.org> References: <20220210133844.46581-1-cllang@redhat.com> <20220210133844.46581-2-cllang@redhat.com> <87h792m1lr.fsf@jumper.gniibe.org> Message-ID: Hello, > On 14. Feb 2022, at 06:57, NIIBE Yutaka wrote: > > Could you please show us the build log (on that OS) for libgpg-error? Please see https://gitlab.com/redhat-crypto/libgcrypt/libgcrypt-mirror/-/jobs/2067433059. > When -lpthread is needed, GPG_ERROR_MT_LIBS should have that. > (Or else, it may also fail in other places like compiling GnuPG.) > > For the build, libgcrypt depends on gpg-error.pc of libgpg-error, and > uses GPG_ERROR_MT_CFLAGS and GPG_ERROR_MT_LIBS. It is generated by the > build of libgpg-error using Gnulib's M4 script m4/threadlib.m4. > > If something goes wrong, we need to fix libgpg-error. I think the problem here is that this used libgpg-error 1.31, which does not ship a pkgconfig file. What is the minimum version of libgpg-error required by libgcrypt? -- Clemens Lang RHEL Crypto Team Red Hat From cllang at redhat.com Mon Feb 14 18:49:59 2022 From: cllang at redhat.com (Clemens Lang) Date: Mon, 14 Feb 2022 18:49:59 +0100 Subject: [PATCH v2 1/2] fips: Use ELF header to find hmac file offset In-Reply-To: <8AEF6972-91BF-4B9A-B335-1EA482BF0DA5@redhat.com> References: <8AEF6972-91BF-4B9A-B335-1EA482BF0DA5@redhat.com> Message-ID: <20220214175000.221372-1-cllang@redhat.com> * src/fips.c [ENABLE_HMAC_BINARY_CHECK] (hmac256_check): Use ELF headers to locate the file offset for the HMAC in addition to information from the loader -- The previous method of locating the offset of the .rodata1 section in the ELF file on disk used information obtained from the loader. This computed the address of the value in memory at runtime, but the offset in the file can be different. Specifically, the old code computed a value relative to ElfW(Phdr).p_vaddr, but the offset in the file is relative to ElfW(Phdr).p_offset. These values can differ, so the computed address at runtime must be translated into a file offset relative to p_offset. This is largely cosmetic, since the text section that should contain the HMAC usually has both p_vaddr and p_offset set to 0. Signed-off-by: Clemens Lang --- README | 3 ++- src/fips.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 69 insertions(+), 7 deletions(-) diff --git a/README b/README index 3b465c1b..4d7697dd 100644 --- a/README +++ b/README @@ -157,7 +157,8 @@ --enable-hmac-binary-check Include support to check the binary at runtime against a HMAC checksum. This works only in FIPS - mode and on systems providing the dladdr function. + mode on systems providing the dladdr function and using + the ELF binary format. --with-fips-module-version=version Specify a string used as a module version for FIPS diff --git a/src/fips.c b/src/fips.c index 391b94f1..c40274d9 100644 --- a/src/fips.c +++ b/src/fips.c @@ -25,6 +25,8 @@ #include #ifdef ENABLE_HMAC_BINARY_CHECK # include +# include +# include # include #endif #ifdef HAVE_SYSLOG @@ -594,6 +596,57 @@ run_random_selftests (void) static const unsigned char __attribute__ ((section (".rodata1"))) hmac_for_the_implementation[HMAC_LEN]; +/** + * Determine the offset of the given virtual address in the ELF file opened as + * fp and return it in offset. Rewinds fp to the beginning on success. + */ +static gpg_error_t +get_file_offset (FILE *fp, unsigned long paddr, unsigned long *offset) +{ + ElfW (Ehdr) ehdr; + ElfW (Phdr) phdr; + uint16_t e_phidx; + + // read the ELF header + if (0 != fseek (fp, 0, SEEK_SET)) + return gpg_error_from_syserror (); + if (1 != fread (&ehdr, sizeof (ehdr), 1, fp)) + return gpg_error_from_syserror (); + + // the section header entry size should match the size of the shdr struct + if (ehdr.e_phentsize != sizeof (phdr)) + return gpg_error (GPG_ERR_INV_OBJ); + if (ehdr.e_phoff == 0) + return gpg_error (GPG_ERR_INV_OBJ); + + // jump to the first program header + if (0 != fseek (fp, ehdr.e_phoff, SEEK_SET)) + return gpg_error_from_syserror (); + + // iterate over the program headers, compare their virtual addresses with the + // address we are looking for, and if the program header matches, calculate + // the offset of the given paddr in the file using the program header's + // p_offset field. + for (e_phidx = 0; e_phidx < ehdr.e_phnum; e_phidx++) + { + if (1 != fread (&phdr, sizeof (phdr), 1, fp)) + return gpg_error_from_syserror (); + if (phdr.p_type == PT_LOAD && phdr.p_vaddr <= paddr + && phdr.p_vaddr + phdr.p_memsz > paddr) + { + // found section, compute the offset of paddr in the file + *offset = phdr.p_offset + (paddr - phdr.p_vaddr); + + if (0 != fseek (fp, 0, SEEK_SET)) + return gpg_error_from_syserror (); + return 0; + } + } + + // section not found in the file + return gpg_error (GPG_ERR_INV_OBJ); +} + static gpg_error_t hmac256_check (const char *filename, const char *key, struct link_map *lm) { @@ -603,6 +656,7 @@ hmac256_check (const char *filename, const char *key, struct link_map *lm) size_t buffer_size, nread; char *buffer; unsigned long paddr; + unsigned long offset = 0; unsigned long off = 0; paddr = (unsigned long)hmac_for_the_implementation - lm->l_addr; @@ -611,6 +665,13 @@ hmac256_check (const char *filename, const char *key, struct link_map *lm) if (!fp) return gpg_error (GPG_ERR_INV_OBJ); + err = get_file_offset (fp, paddr, &offset); + if (err) + { + fclose (fp); + return err; + } + err = _gcry_md_open (&hd, GCRY_MD_SHA256, GCRY_MD_FLAG_HMAC); if (err) { @@ -651,14 +712,14 @@ hmac256_check (const char *filename, const char *key, struct link_map *lm) nread = fread (buffer+HMAC_LEN, 1, buffer_size, fp); if (nread < buffer_size) { - if (off - HMAC_LEN <= paddr && paddr <= off + nread) - memset (buffer + HMAC_LEN + paddr - off, 0, HMAC_LEN); + if (off - HMAC_LEN <= offset && offset <= off + nread) + memset (buffer + HMAC_LEN + offset - off, 0, HMAC_LEN); _gcry_md_write (hd, buffer, nread+HMAC_LEN); break; } - if (off - HMAC_LEN <= paddr && paddr <= off + nread) - memset (buffer + HMAC_LEN + paddr - off, 0, HMAC_LEN); + if (off - HMAC_LEN <= offset && offset <= off + nread) + memset (buffer + HMAC_LEN + offset - off, 0, HMAC_LEN); _gcry_md_write (hd, buffer, nread); memcpy (buffer, buffer+buffer_size, HMAC_LEN); off += nread; @@ -694,8 +755,8 @@ check_binary_integrity (void) const char *key = KEY_FOR_BINARY_CHECK; void *extra_info; - if (!dladdr1 (hmac_for_the_implementation, - &info, &extra_info, RTLD_DL_LINKMAP)) + if (!dladdr1 (hmac_for_the_implementation, &info, &extra_info, + RTLD_DL_LINKMAP)) err = gpg_error_from_syserror (); else err = hmac256_check (info.dli_fname, key, extra_info); -- 2.35.1 From cllang at redhat.com Mon Feb 14 18:50:00 2022 From: cllang at redhat.com (Clemens Lang) Date: Mon, 14 Feb 2022 18:50:00 +0100 Subject: [PATCH v2 2/2] fips: Fix self-check compiled with clang In-Reply-To: <20220214175000.221372-1-cllang@redhat.com> References: <8AEF6972-91BF-4B9A-B335-1EA482BF0DA5@redhat.com> <20220214175000.221372-1-cllang@redhat.com> Message-ID: <20220214175000.221372-2-cllang@redhat.com> * src/fips.c [ENABLE_HMAC_BINARY_CHECK] (hmac256_check): Prevent constant propagation of a 0 HMAC value. -- clang 13 assumes that the static const unsigned char[32] hmac_for_the_implementation is zero, propagates this constant into its use in hmac256_check and replaces with invocation of memcmp(3) with assembly instructions that compare the computed digest with 0. clang is able to make this assumption, because a 0-initialized static const variable should never change its value, but this assumption is invalid as soon as objcpy(1) changes the HMAC in the binary. Apply a suggestion from [1] to prevent this optimization. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2034320#c22 Signed-off-by: Clemens Lang --- src/Makefile.am | 4 ++-- src/fips.c | 11 ++++++++++- 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/src/Makefile.am b/src/Makefile.am index b0a196a3..b799e52d 100644 --- a/src/Makefile.am +++ b/src/Makefile.am @@ -143,14 +143,14 @@ if USE_HMAC_BINARY_CHECK CLEANFILES += libgcrypt.so.hmac libgcrypt.la.done: libgcrypt.so.hmac - $(OBJCOPY) --update-section .rodata1=libgcrypt.so.hmac \ + $(OBJCOPY) --update-section .rodata_fips_hmac=libgcrypt.so.hmac \ .libs/libgcrypt.so .libs/libgcrypt.so.new mv -f .libs/libgcrypt.so.new .libs/libgcrypt.so.*.* @touch libgcrypt.la.done libgcrypt.so.hmac: hmac256 libgcrypt.la dd if=/dev/zero of=libgcrypt.so.hmac.empty bs=32 count=1 - $(OBJCOPY) --update-section .rodata1=libgcrypt.so.hmac.empty \ + $(OBJCOPY) --update-section .rodata_fips_hmac=libgcrypt.so.hmac.empty \ .libs/libgcrypt.so .libs/libgcrypt.so.empty-hmac ./hmac256 --stdkey --binary .libs/libgcrypt.so.empty-hmac > $@ $(RM) libgcrypt.so.hmac.empty .libs/libgcrypt.so.empty-hmac diff --git a/src/fips.c b/src/fips.c index c40274d9..35546b8c 100644 --- a/src/fips.c +++ b/src/fips.c @@ -593,9 +593,18 @@ run_random_selftests (void) # endif #define HMAC_LEN 32 -static const unsigned char __attribute__ ((section (".rodata1"))) +extern const unsigned char __attribute__ ((visibility ("hidden"))) hmac_for_the_implementation[HMAC_LEN]; +__asm (".hidden hmac_for_the_implementation\n\t" + ".type hmac_for_the_implementation, %object\n\t" + ".size hmac_for_the_implementation, 32\n\t" + ".section .rodata_fips_hmac,\"a\"\n\t" + ".balign 16\n" + "hmac_for_the_implementation:\n\t" + ".zero 32\n\t" + ".previous"); + /** * Determine the offset of the given virtual address in the ELF file opened as * fp and return it in offset. Rewinds fp to the beginning on success. -- 2.35.1 From wk at gnupg.org Tue Feb 15 08:52:23 2022 From: wk at gnupg.org (Werner Koch) Date: Tue, 15 Feb 2022 08:52:23 +0100 Subject: [PATCH v2 1/2] fips: Use ELF header to find hmac file offset In-Reply-To: <20220214175000.221372-1-cllang@redhat.com> (Clemens Lang via Gcrypt-devel's message of "Mon, 14 Feb 2022 18:49:59 +0100") References: <8AEF6972-91BF-4B9A-B335-1EA482BF0DA5@redhat.com> <20220214175000.221372-1-cllang@redhat.com> Message-ID: <874k50bm88.fsf@wheatstone.g10code.de> On Mon, 14 Feb 2022 18:49, Clemens Lang said: > + // iterate over the program headers, compare their virtual addresses with the > + // address we are looking for, and if the program header matches, calculate > + // the offset of the given paddr in the file using the program header's Please don't use C++ comments - we stick to standard C90 comments. > + if (1 != fread (&phdr, sizeof (phdr), 1, fp)) That is hard to read. Given the state of current compilers this old trick is not needed. We are reading code left to righta nd partially reversing the direction is an uncommon pattern. So please don't do. > + unsigned long offset = 0; > unsigned long off = 0; The names are too similar - better use off2 or tmpoff instead of offset. Sorry for nitpicky - thanks for your work. Salam-Shalom, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 227 bytes Desc: not available URL: From gniibe at fsij.org Tue Feb 15 09:20:24 2022 From: gniibe at fsij.org (NIIBE Yutaka) Date: Tue, 15 Feb 2022 17:20:24 +0900 Subject: [PATCH] tests: Fix undefined reference to 'pthread_create' In-Reply-To: References: <20220210133844.46581-1-cllang@redhat.com> <20220210133844.46581-2-cllang@redhat.com> <87h792m1lr.fsf@jumper.gniibe.org> Message-ID: <87mtis4k3b.fsf@jumper.gniibe.org> Hello, I think I located the cause. It is a bug in m4/gpg-error.m4 (from libgpg-error). I'm going to push a fix to master (of libgcrypt). Then, I'll fix libgpg-error. Clemens Lang wrote: > I think the problem here is that this used libgpg-error 1.31, which does not ship a pkgconfig file. > > What is the minimum version of libgpg-error required by libgcrypt? It is 1.27. It doesn't provide gpg-error.pc, but it supports gpg-error-config script. (Unified config script gpgrt-config and gpg-error.pc is since version 1.33.) libgpg-error 1.27 has old version of gpgrt-config, which is not unified version. This older gpgrt-config confuses the m4 script of m4/gpg-error.m4. -- From gniibe at fsij.org Tue Feb 15 09:35:50 2022 From: gniibe at fsij.org (NIIBE Yutaka) Date: Tue, 15 Feb 2022 17:35:50 +0900 Subject: [PATCH v2 1/2] fips: Use ELF header to find hmac file offset In-Reply-To: <874k50bm88.fsf@wheatstone.g10code.de> References: <8AEF6972-91BF-4B9A-B335-1EA482BF0DA5@redhat.com> <20220214175000.221372-1-cllang@redhat.com> <874k50bm88.fsf@wheatstone.g10code.de> Message-ID: <87k0dw4jdl.fsf@jumper.gniibe.org> Hello, I'm going to apply this change of Clemens' (v2 1/2), with modification of coding style. On top of the change, I'll apply change to the hashing content. It is better to hash from the start to the last loadable segment, so that we can allow change of the binary by strip. It is tracked by: https://dev.gnupg.org/T5835 -- From tianjia.zhang at linux.alibaba.com Wed Feb 16 14:12:23 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Wed, 16 Feb 2022 21:12:23 +0800 Subject: [PATCH] Add SM4 ARMv8/AArch64 assembly implementation Message-ID: <20220216131223.10166-1-tianjia.zhang@linux.alibaba.com> * cipher/Makefile.am: Add 'sm4-aarch64.S'. * cipher/sm4-aarch64.S: New. * cipher/sm4.c (USE_AARCH64_SIMD): New. (SM4_context) [USE_AARCH64_SIMD]: Add 'use_aarch64_simd'. [USE_AARCH64_SIMD] (_gcry_sm4_aarch64_crypt) (_gcry_sm4_aarch64_cbc_dec, _gcry_sm4_aarch64_cfb_dec) (_gcry_sm4_aarch64_ctr_dec): New. (sm4_setkey): Enable ARMv8/AArch64 if supported by HW. (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec) (_gcry_sm4_cfb_dec) [USE_AESNI_AVX2]: Add ARMv8/AArch64 bulk functions. * configure.ac: Add ''sm4-aarch64.lo'. -- Signed-off-by: Tianjia Zhang --- cipher/Makefile.am | 2 +- cipher/sm4-aarch64.S | 390 +++++++++++++++++++++++++++++++++++++++++++ cipher/sm4.c | 87 ++++++++++ configure.ac | 3 + 4 files changed, 481 insertions(+), 1 deletion(-) create mode 100644 cipher/sm4-aarch64.S diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 264b3d30..6c1c7693 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -116,7 +116,7 @@ EXTRA_libcipher_la_SOURCES = \ scrypt.c \ seed.c \ serpent.c serpent-sse2-amd64.S \ - sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S \ + sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S sm4-aarch64.S \ serpent-avx2-amd64.S serpent-armv7-neon.S \ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ diff --git a/cipher/sm4-aarch64.S b/cipher/sm4-aarch64.S new file mode 100644 index 00000000..f9c828be --- /dev/null +++ b/cipher/sm4-aarch64.S @@ -0,0 +1,390 @@ +/* sm4-aarch64.S - ARMv8/AArch64 accelerated SM4 cipher + * + * Copyright (C) 2021 Alibaba Group. + * Copyright (C) 2021 Tianjia Zhang + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include "asm-common-aarch64.h" + +#if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_NEON) && \ + defined(USE_SM4) + +.cpu generic+simd + +/* Constants */ + +.text +.align 4 +ELF(.type _gcry_sm4_aarch64_consts, at object) +_gcry_sm4_aarch64_consts: +.Lsm4_sbox: + .byte 0xd6, 0x90, 0xe9, 0xfe, 0xcc, 0xe1, 0x3d, 0xb7 + .byte 0x16, 0xb6, 0x14, 0xc2, 0x28, 0xfb, 0x2c, 0x05 + .byte 0x2b, 0x67, 0x9a, 0x76, 0x2a, 0xbe, 0x04, 0xc3 + .byte 0xaa, 0x44, 0x13, 0x26, 0x49, 0x86, 0x06, 0x99 + .byte 0x9c, 0x42, 0x50, 0xf4, 0x91, 0xef, 0x98, 0x7a + .byte 0x33, 0x54, 0x0b, 0x43, 0xed, 0xcf, 0xac, 0x62 + .byte 0xe4, 0xb3, 0x1c, 0xa9, 0xc9, 0x08, 0xe8, 0x95 + .byte 0x80, 0xdf, 0x94, 0xfa, 0x75, 0x8f, 0x3f, 0xa6 + .byte 0x47, 0x07, 0xa7, 0xfc, 0xf3, 0x73, 0x17, 0xba + .byte 0x83, 0x59, 0x3c, 0x19, 0xe6, 0x85, 0x4f, 0xa8 + .byte 0x68, 0x6b, 0x81, 0xb2, 0x71, 0x64, 0xda, 0x8b + .byte 0xf8, 0xeb, 0x0f, 0x4b, 0x70, 0x56, 0x9d, 0x35 + .byte 0x1e, 0x24, 0x0e, 0x5e, 0x63, 0x58, 0xd1, 0xa2 + .byte 0x25, 0x22, 0x7c, 0x3b, 0x01, 0x21, 0x78, 0x87 + .byte 0xd4, 0x00, 0x46, 0x57, 0x9f, 0xd3, 0x27, 0x52 + .byte 0x4c, 0x36, 0x02, 0xe7, 0xa0, 0xc4, 0xc8, 0x9e + .byte 0xea, 0xbf, 0x8a, 0xd2, 0x40, 0xc7, 0x38, 0xb5 + .byte 0xa3, 0xf7, 0xf2, 0xce, 0xf9, 0x61, 0x15, 0xa1 + .byte 0xe0, 0xae, 0x5d, 0xa4, 0x9b, 0x34, 0x1a, 0x55 + .byte 0xad, 0x93, 0x32, 0x30, 0xf5, 0x8c, 0xb1, 0xe3 + .byte 0x1d, 0xf6, 0xe2, 0x2e, 0x82, 0x66, 0xca, 0x60 + .byte 0xc0, 0x29, 0x23, 0xab, 0x0d, 0x53, 0x4e, 0x6f + .byte 0xd5, 0xdb, 0x37, 0x45, 0xde, 0xfd, 0x8e, 0x2f + .byte 0x03, 0xff, 0x6a, 0x72, 0x6d, 0x6c, 0x5b, 0x51 + .byte 0x8d, 0x1b, 0xaf, 0x92, 0xbb, 0xdd, 0xbc, 0x7f + .byte 0x11, 0xd9, 0x5c, 0x41, 0x1f, 0x10, 0x5a, 0xd8 + .byte 0x0a, 0xc1, 0x31, 0x88, 0xa5, 0xcd, 0x7b, 0xbd + .byte 0x2d, 0x74, 0xd0, 0x12, 0xb8, 0xe5, 0xb4, 0xb0 + .byte 0x89, 0x69, 0x97, 0x4a, 0x0c, 0x96, 0x77, 0x7e + .byte 0x65, 0xb9, 0xf1, 0x09, 0xc5, 0x6e, 0xc6, 0x84 + .byte 0x18, 0xf0, 0x7d, 0xec, 0x3a, 0xdc, 0x4d, 0x20 + .byte 0x79, 0xee, 0x5f, 0x3e, 0xd7, 0xcb, 0x39, 0x48 +ELF(.size _gcry_sm4_aarch64_consts,.-_gcry_sm4_aarch64_consts) + +/* Register macros */ + +#define RTMP0 v8 +#define RTMP1 v9 +#define RTMP2 v10 +#define RTMP3 v11 +#define RTMP4 v12 + +#define RX0 v13 +#define RKEY v14 +#define RIDX v15 + +/* Helper macros. */ + +#define preload_sbox(ptr) \ + GET_DATA_POINTER(ptr, .Lsm4_sbox); \ + ld1 {v16.16b-v19.16b}, [ptr], #64; \ + ld1 {v20.16b-v23.16b}, [ptr], #64; \ + ld1 {v24.16b-v27.16b}, [ptr], #64; \ + ld1 {v28.16b-v31.16b}, [ptr]; \ + movi RIDX.16b, #64; /* sizeof(sbox) / 4 */ + +#define transpose_4x4(s0, s1, s2, s3) \ + zip1 RTMP0.4s, s0.4s, s1.4s; \ + zip1 RTMP1.4s, s2.4s, s3.4s; \ + zip2 RTMP2.4s, s0.4s, s1.4s; \ + zip2 RTMP3.4s, s2.4s, s3.4s; \ + zip1 s0.2d, RTMP0.2d, RTMP1.2d; \ + zip2 s1.2d, RTMP0.2d, RTMP1.2d; \ + zip1 s2.2d, RTMP2.2d, RTMP3.2d; \ + zip2 s3.2d, RTMP2.2d, RTMP3.2d; + +#define rotate_clockwise_90(s0, s1, s2, s3) \ + zip1 RTMP0.4s, s1.4s, s0.4s; \ + zip2 RTMP1.4s, s1.4s, s0.4s; \ + zip1 RTMP2.4s, s3.4s, s2.4s; \ + zip2 RTMP3.4s, s3.4s, s2.4s; \ + zip1 s0.2d, RTMP2.2d, RTMP0.2d; \ + zip2 s1.2d, RTMP2.2d, RTMP0.2d; \ + zip1 s2.2d, RTMP3.2d, RTMP1.2d; \ + zip2 s3.2d, RTMP3.2d, RTMP1.2d; + +#define ROUND(round, s0, s1, s2, s3) \ + dup RX0.4s, RKEY.s[round]; \ + /* rk ^ s1 ^ s2 ^ s3 */ \ + eor RTMP1.16b, s2.16b, s3.16b; \ + eor RX0.16b, RX0.16b, s1.16b; \ + eor RX0.16b, RX0.16b, RTMP1.16b; \ + \ + /* sbox, non-linear part */ \ + tbl RTMP0.16b, {v16.16b-v19.16b}, RX0.16b; \ + sub RX0.16b, RX0.16b, RIDX.16b; \ + tbx RTMP0.16b, {v20.16b-v23.16b}, RX0.16b; \ + sub RX0.16b, RX0.16b, RIDX.16b; \ + tbx RTMP0.16b, {v24.16b-v27.16b}, RX0.16b; \ + sub RX0.16b, RX0.16b, RIDX.16b; \ + tbx RTMP0.16b, {v28.16b-v31.16b}, RX0.16b; \ + \ + /* linear part */ \ + shl RTMP1.4s, RTMP0.4s, #8; \ + shl RTMP2.4s, RTMP0.4s, #16; \ + shl RTMP3.4s, RTMP0.4s, #24; \ + sri RTMP1.4s, RTMP0.4s, #(32-8); \ + sri RTMP2.4s, RTMP0.4s, #(32-16); \ + sri RTMP3.4s, RTMP0.4s, #(32-24); \ + /* RTMP1 = x ^ rol32(x, 8) ^ rol32(x, 16) */ \ + eor RTMP1.16b, RTMP1.16b, RTMP0.16b; \ + eor RTMP1.16b, RTMP1.16b, RTMP2.16b; \ + /* RTMP3 = x ^ rol32(x, 24) ^ rol32(RTMP1, 2) */ \ + eor RTMP3.16b, RTMP3.16b, RTMP0.16b; \ + shl RTMP2.4s, RTMP1.4s, 2; \ + sri RTMP2.4s, RTMP1.4s, #(32-2); \ + eor RTMP3.16b, RTMP3.16b, RTMP2.16b; \ + /* s0 ^= RTMP3 */ \ + eor s0.16b, s0.16b, RTMP3.16b; + + +ELF(.type __sm4_crypt_blk4,%function;) +__sm4_crypt_blk4: + /* input: + * x0: round key array, CTX + * v0 v1 v2 v3: four parallel plaintext blocks + * output: + * v0 v1 v2 v3: four parallel ciphertext blocks + */ + CFI_STARTPROC(); + + rev32 v0.16b, v0.16b; + rev32 v1.16b, v1.16b; + rev32 v2.16b, v2.16b; + rev32 v3.16b, v3.16b; + + transpose_4x4(v0, v1, v2, v3); + + mov x6, 8; +.Lroundloop: + ld1 {RKEY.4s}, [x0], #16; + ROUND(0, v0, v1, v2, v3); + ROUND(1, v1, v2, v3, v0); + ROUND(2, v2, v3, v0, v1); + ROUND(3, v3, v0, v1, v2); + + subs x6, x6, #1; + bne .Lroundloop; + + rotate_clockwise_90(v0, v1, v2, v3); + rev32 v0.16b, v0.16b; + rev32 v1.16b, v1.16b; + rev32 v2.16b, v2.16b; + rev32 v3.16b, v3.16b; + + sub x0, x0, #128; /* repoint to rkey */ + ret; + CFI_ENDPROC(); +ELF(.size __sm4_crypt_blk4,.-__sm4_crypt_blk4;) + +.global _gcry_sm4_aarch64_crypt +ELF(.type _gcry_sm4_aarch64_crypt,%function;) +_gcry_sm4_aarch64_crypt: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: nblocks (multiples of 4) + */ + CFI_STARTPROC(); + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + + preload_sbox(x5); + +.Lcrypt_loop_blk4: + subs x3, x3, #4; + bmi .Lcrypt_end; + + ld1 {v0.16b-v3.16b}, [x2], #64; + bl __sm4_crypt_blk4; + st1 {v0.16b-v3.16b}, [x1], #64; + b .Lcrypt_loop_blk4; + +.Lcrypt_end: + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aarch64_crypt,.-_gcry_sm4_aarch64_crypt;) + +.global _gcry_sm4_aarch64_cbc_dec +ELF(.type _gcry_sm4_aarch64_cbc_dec,%function;) +_gcry_sm4_aarch64_cbc_dec: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: iv (big endian, 128 bit) + * x4: nblocks (multiples of 4) + */ + CFI_STARTPROC(); + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + + preload_sbox(x5); + ld1 {RTMP4.16b}, [x3]; + +.Lcbc_loop_blk4: + subs x4, x4, #4; + bmi .Lcbc_end; + + ld1 {v0.16b-v3.16b}, [x2]; + + bl __sm4_crypt_blk4; + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v0.16b, v0.16b, RTMP4.16b; + eor v1.16b, v1.16b, RTMP0.16b; + eor v2.16b, v2.16b, RTMP1.16b; + eor v3.16b, v3.16b, RTMP2.16b; + + st1 {v0.16b-v3.16b}, [x1], #64; + mov RTMP4.16b, RTMP3.16b; + + b .Lcbc_loop_blk4; + +.Lcbc_end: + /* store new IV */ + st1 {RTMP4.16b}, [x3]; + + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aarch64_cbc_dec,.-_gcry_sm4_aarch64_cbc_dec;) + +.global _gcry_sm4_aarch64_cfb_dec +ELF(.type _gcry_sm4_aarch64_cfb_dec,%function;) +_gcry_sm4_aarch64_cfb_dec: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: iv (big endian, 128 bit) + * x4: nblocks (multiples of 4) + */ + CFI_STARTPROC(); + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + + preload_sbox(x5); + ld1 {v0.16b}, [x3]; + +.Lcfb_loop_blk4: + subs x4, x4, #4; + bmi .Lcfb_end; + + ld1 {v1.16b, v2.16b, v3.16b}, [x2]; + + bl __sm4_crypt_blk4; + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v0.16b, v0.16b, RTMP0.16b; + eor v1.16b, v1.16b, RTMP1.16b; + eor v2.16b, v2.16b, RTMP2.16b; + eor v3.16b, v3.16b, RTMP3.16b; + + st1 {v0.16b-v3.16b}, [x1], #64; + mov v0.16b, RTMP3.16b; + + b .Lcfb_loop_blk4; + +.Lcfb_end: + /* store new IV */ + st1 {v0.16b}, [x3]; + + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aarch64_cfb_dec,.-_gcry_sm4_aarch64_cfb_dec;) + +.global _gcry_sm4_aarch64_ctr_enc +ELF(.type _gcry_sm4_aarch64_ctr_enc,%function;) +_gcry_sm4_aarch64_ctr_enc: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: ctr (big endian, 128 bit) + * x4: nblocks (multiples of 4) + */ + CFI_STARTPROC(); + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + + preload_sbox(x5); + + ldp x7, x8, [x3]; + rev x7, x7; + rev x8, x8; + +.Lctr_loop_blk4: + subs x4, x4, #4; + bmi .Lctr_end; + +#define inc_le128(vctr) \ + mov vctr.d[1], x8; \ + mov vctr.d[0], x7; \ + adds x8, x8, #1; \ + adc x7, x7, xzr; \ + rev64 vctr.16b, vctr.16b; + + /* construct CTRs */ + inc_le128(v0); /* +0 */ + inc_le128(v1); /* +1 */ + inc_le128(v2); /* +2 */ + inc_le128(v3); /* +3 */ + + bl __sm4_crypt_blk4; + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v0.16b, v0.16b, RTMP0.16b; + eor v1.16b, v1.16b, RTMP1.16b; + eor v2.16b, v2.16b, RTMP2.16b; + eor v3.16b, v3.16b, RTMP3.16b; + st1 {v0.16b-v3.16b}, [x1], #64; + b .Lctr_loop_blk4; + +.Lctr_end: + /* store new CTR */ + rev x7, x7; + rev x8, x8; + stp x7, x8, [x3]; + + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aarch64_ctr_enc,.-_gcry_sm4_aarch64_ctr_enc;) + +#endif diff --git a/cipher/sm4.c b/cipher/sm4.c index 81662988..afcfd61b 100644 --- a/cipher/sm4.c +++ b/cipher/sm4.c @@ -67,6 +67,15 @@ # endif #endif +#undef USE_AARCH64_SIMD +#ifdef ENABLE_NEON_SUPPORT +# if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_NEON) +# define USE_AARCH64_SIMD 1 +# endif +#endif + static const char *sm4_selftest (void); static void _gcry_sm4_ctr_enc (void *context, unsigned char *ctr, @@ -94,6 +103,9 @@ typedef struct #ifdef USE_AESNI_AVX2 unsigned int use_aesni_avx2:1; #endif +#ifdef USE_AARCH64_SIMD + unsigned int use_aarch64_simd:1; +#endif } SM4_context; static const u32 fk[4] = @@ -241,6 +253,27 @@ extern void _gcry_sm4_aesni_avx2_ocb_auth(const u32 *rk_enc, const u64 Ls[16]) ASM_FUNC_ABI; #endif /* USE_AESNI_AVX2 */ +#ifdef USE_AARCH64_SIMD +extern void _gcry_sm4_aarch64_crypt(const u32 *rk, byte *out, + const byte *in, + int nblocks); + +extern void _gcry_sm4_aarch64_cbc_dec(const u32 *rk_dec, byte *out, + const byte *in, + byte *iv, + int nblocks); + +extern void _gcry_sm4_aarch64_cfb_dec(const u32 *rk_enc, byte *out, + const byte *in, + byte *iv, + int nblocks); + +extern void _gcry_sm4_aarch64_ctr_enc(const u32 *rk_enc, byte *out, + const byte *in, + byte *ctr, + int nblocks); +#endif /* USE_AARCH64_SIMD */ + static inline void prefetch_sbox_table(void) { const volatile byte *vtab = (void *)&sbox_table; @@ -372,6 +405,9 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen, #ifdef USE_AESNI_AVX2 ctx->use_aesni_avx2 = (hwf & HWF_INTEL_AESNI) && (hwf & HWF_INTEL_AVX2); #endif +#ifdef USE_AARCH64_SIMD + ctx->use_aarch64_simd = !!(hwf & HWF_ARM_NEON); +#endif /* Setup bulk encryption routines. */ memset (bulk_ops, 0, sizeof(*bulk_ops)); @@ -553,6 +589,23 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, } #endif +#ifdef USE_AARCH64_SIMD + if (ctx->use_aarch64_simd) + { + /* Process multiples of 4 blocks at a time. */ + if (nblocks >= 4) + { + size_t nblks = nblocks & ~(4 - 1); + + _gcry_sm4_aarch64_ctr_enc(ctx->rkey_enc, outbuf, inbuf, ctr, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + /* Process remaining blocks. */ if (nblocks) { @@ -654,6 +707,23 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, } #endif +#ifdef USE_AARCH64_SIMD + if (ctx->use_aarch64_simd) + { + /* Process multiples of 4 blocks at a time. */ + if (nblocks >= 4) + { + size_t nblks = nblocks & ~(4 - 1); + + _gcry_sm4_aarch64_cbc_dec(ctx->rkey_dec, outbuf, inbuf, iv, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + /* Process remaining blocks. */ if (nblocks) { @@ -748,6 +818,23 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, } #endif +#ifdef USE_AARCH64_SIMD + if (ctx->use_aarch64_simd) + { + /* Process multiples of 4 blocks at a time. */ + if (nblocks >= 4) + { + size_t nblks = nblocks & ~(4 - 1); + + _gcry_sm4_aarch64_cfb_dec(ctx->rkey_enc, outbuf, inbuf, iv, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + /* Process remaining blocks. */ if (nblocks) { diff --git a/configure.ac b/configure.ac index ea01f5a6..89df9434 100644 --- a/configure.ac +++ b/configure.ac @@ -2740,6 +2740,9 @@ if test "$found" = "1" ; then GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-aesni-avx-amd64.lo" GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-aesni-avx2-amd64.lo" ;; + aarch64-*-*) + # Build with the assembly implementation + GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-aarch64.lo" esac fi -- 2.34.1 From jussi.kivilinna at iki.fi Thu Feb 17 19:12:22 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Thu, 17 Feb 2022 20:12:22 +0200 Subject: [PATCH] Add SM4 ARMv8/AArch64 assembly implementation In-Reply-To: <20220216131223.10166-1-tianjia.zhang@linux.alibaba.com> References: <20220216131223.10166-1-tianjia.zhang@linux.alibaba.com> Message-ID: Hello, Looks good, just few comments below... On 16.2.2022 15.12, Tianjia Zhang wrote: > * cipher/Makefile.am: Add 'sm4-aarch64.S'. > * cipher/sm4-aarch64.S: New. > * cipher/sm4.c (USE_AARCH64_SIMD): New. > (SM4_context) [USE_AARCH64_SIMD]: Add 'use_aarch64_simd'. > [USE_AARCH64_SIMD] (_gcry_sm4_aarch64_crypt) > (_gcry_sm4_aarch64_cbc_dec, _gcry_sm4_aarch64_cfb_dec) > (_gcry_sm4_aarch64_ctr_dec): New. > (sm4_setkey): Enable ARMv8/AArch64 if supported by HW. > (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec) > (_gcry_sm4_cfb_dec) [USE_AESNI_AVX2]: Add ARMv8/AArch64 bulk functions. USE_AARCH64_SIMD here. > * configure.ac: Add ''sm4-aarch64.lo'. > -- > > Signed-off-by: Tianjia Zhang > --- > cipher/Makefile.am | 2 +- > cipher/sm4-aarch64.S | 390 +++++++++++++++++++++++++++++++++++++++++++ > cipher/sm4.c | 87 ++++++++++ > configure.ac | 3 + > 4 files changed, 481 insertions(+), 1 deletion(-) > create mode 100644 cipher/sm4-aarch64.S > > diff --git a/cipher/Makefile.am b/cipher/Makefile.am > index 264b3d30..6c1c7693 100644 > --- a/cipher/Makefile.am > +++ b/cipher/Makefile.am > @@ -116,7 +116,7 @@ EXTRA_libcipher_la_SOURCES = \ > scrypt.c \ > seed.c \ > serpent.c serpent-sse2-amd64.S \ > - sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S \ > + sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S sm4-aarch64.S \ > serpent-avx2-amd64.S serpent-armv7-neon.S \ > sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ > sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ > diff --git a/cipher/sm4-aarch64.S b/cipher/sm4-aarch64.S > new file mode 100644 > index 00000000..f9c828be > --- /dev/null > +++ b/cipher/sm4-aarch64.S > @@ -0,0 +1,390 @@ > +/* sm4-aarch64.S - ARMv8/AArch64 accelerated SM4 cipher > + * > + * Copyright (C) 2021 Alibaba Group. > + * Copyright (C) 2021 Tianjia Zhang > + * > + * This file is part of Libgcrypt. > + * > + * Libgcrypt is free software; you can redistribute it and/or modify > + * it under the terms of the GNU Lesser General Public License as > + * published by the Free Software Foundation; either version 2.1 of > + * the License, or (at your option) any later version. > + * > + * Libgcrypt is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with this program; if not, see . > + */ > + > +#include "asm-common-aarch64.h" > + > +#if defined(__AARCH64EL__) && \ > + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ > + defined(HAVE_GCC_INLINE_ASM_AARCH64_NEON) && \ > + defined(USE_SM4) > + > +.cpu generic+simd > + > +/* Constants */ > + > +.text > +.align 4 > +ELF(.type _gcry_sm4_aarch64_consts, at object) > +_gcry_sm4_aarch64_consts: > +.Lsm4_sbox: > + .byte 0xd6, 0x90, 0xe9, 0xfe, 0xcc, 0xe1, 0x3d, 0xb7 > + .byte 0x16, 0xb6, 0x14, 0xc2, 0x28, 0xfb, 0x2c, 0x05 > + .byte 0x2b, 0x67, 0x9a, 0x76, 0x2a, 0xbe, 0x04, 0xc3 > + .byte 0xaa, 0x44, 0x13, 0x26, 0x49, 0x86, 0x06, 0x99 > + .byte 0x9c, 0x42, 0x50, 0xf4, 0x91, 0xef, 0x98, 0x7a > + .byte 0x33, 0x54, 0x0b, 0x43, 0xed, 0xcf, 0xac, 0x62 > + .byte 0xe4, 0xb3, 0x1c, 0xa9, 0xc9, 0x08, 0xe8, 0x95 > + .byte 0x80, 0xdf, 0x94, 0xfa, 0x75, 0x8f, 0x3f, 0xa6 > + .byte 0x47, 0x07, 0xa7, 0xfc, 0xf3, 0x73, 0x17, 0xba > + .byte 0x83, 0x59, 0x3c, 0x19, 0xe6, 0x85, 0x4f, 0xa8 > + .byte 0x68, 0x6b, 0x81, 0xb2, 0x71, 0x64, 0xda, 0x8b > + .byte 0xf8, 0xeb, 0x0f, 0x4b, 0x70, 0x56, 0x9d, 0x35 > + .byte 0x1e, 0x24, 0x0e, 0x5e, 0x63, 0x58, 0xd1, 0xa2 > + .byte 0x25, 0x22, 0x7c, 0x3b, 0x01, 0x21, 0x78, 0x87 > + .byte 0xd4, 0x00, 0x46, 0x57, 0x9f, 0xd3, 0x27, 0x52 > + .byte 0x4c, 0x36, 0x02, 0xe7, 0xa0, 0xc4, 0xc8, 0x9e > + .byte 0xea, 0xbf, 0x8a, 0xd2, 0x40, 0xc7, 0x38, 0xb5 > + .byte 0xa3, 0xf7, 0xf2, 0xce, 0xf9, 0x61, 0x15, 0xa1 > + .byte 0xe0, 0xae, 0x5d, 0xa4, 0x9b, 0x34, 0x1a, 0x55 > + .byte 0xad, 0x93, 0x32, 0x30, 0xf5, 0x8c, 0xb1, 0xe3 > + .byte 0x1d, 0xf6, 0xe2, 0x2e, 0x82, 0x66, 0xca, 0x60 > + .byte 0xc0, 0x29, 0x23, 0xab, 0x0d, 0x53, 0x4e, 0x6f > + .byte 0xd5, 0xdb, 0x37, 0x45, 0xde, 0xfd, 0x8e, 0x2f > + .byte 0x03, 0xff, 0x6a, 0x72, 0x6d, 0x6c, 0x5b, 0x51 > + .byte 0x8d, 0x1b, 0xaf, 0x92, 0xbb, 0xdd, 0xbc, 0x7f > + .byte 0x11, 0xd9, 0x5c, 0x41, 0x1f, 0x10, 0x5a, 0xd8 > + .byte 0x0a, 0xc1, 0x31, 0x88, 0xa5, 0xcd, 0x7b, 0xbd > + .byte 0x2d, 0x74, 0xd0, 0x12, 0xb8, 0xe5, 0xb4, 0xb0 > + .byte 0x89, 0x69, 0x97, 0x4a, 0x0c, 0x96, 0x77, 0x7e > + .byte 0x65, 0xb9, 0xf1, 0x09, 0xc5, 0x6e, 0xc6, 0x84 > + .byte 0x18, 0xf0, 0x7d, 0xec, 0x3a, 0xdc, 0x4d, 0x20 > + .byte 0x79, 0xee, 0x5f, 0x3e, 0xd7, 0xcb, 0x39, 0x48 > +ELF(.size _gcry_sm4_aarch64_consts,.-_gcry_sm4_aarch64_consts) > + > +/* Register macros */ > + > +#define RTMP0 v8 > +#define RTMP1 v9 > +#define RTMP2 v10 > +#define RTMP3 v11 > +#define RTMP4 v12 > + > +#define RX0 v13 > +#define RKEY v14 > +#define RIDX v15 Vectors registers v8 to v15 are being used, so functions need to store and restore d8-d15 registers as they are ABI callee saved. Check "VPUSH_ABI" and "VPOP_API" macros in "cipher-gcm-armv8-aarch64-ce.S". Those could be moved to "asm-common-aarch64.h" so that macros can be shared between different files. > + > +/* Helper macros. */ > + > +#define preload_sbox(ptr) \ > + GET_DATA_POINTER(ptr, .Lsm4_sbox); \ > + ld1 {v16.16b-v19.16b}, [ptr], #64; \ > + ld1 {v20.16b-v23.16b}, [ptr], #64; \ > + ld1 {v24.16b-v27.16b}, [ptr], #64; \ > + ld1 {v28.16b-v31.16b}, [ptr]; \ > + movi RIDX.16b, #64; /* sizeof(sbox) / 4 */ > + > +#define transpose_4x4(s0, s1, s2, s3) \ > + zip1 RTMP0.4s, s0.4s, s1.4s; \ > + zip1 RTMP1.4s, s2.4s, s3.4s; \ > + zip2 RTMP2.4s, s0.4s, s1.4s; \ > + zip2 RTMP3.4s, s2.4s, s3.4s; \ > + zip1 s0.2d, RTMP0.2d, RTMP1.2d; \ > + zip2 s1.2d, RTMP0.2d, RTMP1.2d; \ > + zip1 s2.2d, RTMP2.2d, RTMP3.2d; \ > + zip2 s3.2d, RTMP2.2d, RTMP3.2d; > + > +#define rotate_clockwise_90(s0, s1, s2, s3) \ > + zip1 RTMP0.4s, s1.4s, s0.4s; \ > + zip2 RTMP1.4s, s1.4s, s0.4s; \ > + zip1 RTMP2.4s, s3.4s, s2.4s; \ > + zip2 RTMP3.4s, s3.4s, s2.4s; \ > + zip1 s0.2d, RTMP2.2d, RTMP0.2d; \ > + zip2 s1.2d, RTMP2.2d, RTMP0.2d; \ > + zip1 s2.2d, RTMP3.2d, RTMP1.2d; \ > + zip2 s3.2d, RTMP3.2d, RTMP1.2d; > + > +#define ROUND(round, s0, s1, s2, s3) \ > + dup RX0.4s, RKEY.s[round]; \ > + /* rk ^ s1 ^ s2 ^ s3 */ \ > + eor RTMP1.16b, s2.16b, s3.16b; \ > + eor RX0.16b, RX0.16b, s1.16b; \ > + eor RX0.16b, RX0.16b, RTMP1.16b; \ > + \ > + /* sbox, non-linear part */ \ > + tbl RTMP0.16b, {v16.16b-v19.16b}, RX0.16b; \ > + sub RX0.16b, RX0.16b, RIDX.16b; \ > + tbx RTMP0.16b, {v20.16b-v23.16b}, RX0.16b; \ > + sub RX0.16b, RX0.16b, RIDX.16b; \ > + tbx RTMP0.16b, {v24.16b-v27.16b}, RX0.16b; \ > + sub RX0.16b, RX0.16b, RIDX.16b; \ > + tbx RTMP0.16b, {v28.16b-v31.16b}, RX0.16b; \ > + \ > + /* linear part */ \ > + shl RTMP1.4s, RTMP0.4s, #8; \ > + shl RTMP2.4s, RTMP0.4s, #16; \ > + shl RTMP3.4s, RTMP0.4s, #24; \ > + sri RTMP1.4s, RTMP0.4s, #(32-8); \ > + sri RTMP2.4s, RTMP0.4s, #(32-16); \ > + sri RTMP3.4s, RTMP0.4s, #(32-24); \ > + /* RTMP1 = x ^ rol32(x, 8) ^ rol32(x, 16) */ \ > + eor RTMP1.16b, RTMP1.16b, RTMP0.16b; \ > + eor RTMP1.16b, RTMP1.16b, RTMP2.16b; \ > + /* RTMP3 = x ^ rol32(x, 24) ^ rol32(RTMP1, 2) */ \ > + eor RTMP3.16b, RTMP3.16b, RTMP0.16b; \ > + shl RTMP2.4s, RTMP1.4s, 2; \ > + sri RTMP2.4s, RTMP1.4s, #(32-2); \ > + eor RTMP3.16b, RTMP3.16b, RTMP2.16b; \ > + /* s0 ^= RTMP3 */ \ > + eor s0.16b, s0.16b, RTMP3.16b; > + > + > +ELF(.type __sm4_crypt_blk4,%function;) > +__sm4_crypt_blk4: > + /* input: > + * x0: round key array, CTX > + * v0 v1 v2 v3: four parallel plaintext blocks > + * output: > + * v0 v1 v2 v3: four parallel ciphertext blocks > + */ > + CFI_STARTPROC(); > + > + rev32 v0.16b, v0.16b; > + rev32 v1.16b, v1.16b; > + rev32 v2.16b, v2.16b; > + rev32 v3.16b, v3.16b; > + > + transpose_4x4(v0, v1, v2, v3); > + > + mov x6, 8; > +.Lroundloop: > + ld1 {RKEY.4s}, [x0], #16; > + ROUND(0, v0, v1, v2, v3); > + ROUND(1, v1, v2, v3, v0); > + ROUND(2, v2, v3, v0, v1); > + ROUND(3, v3, v0, v1, v2); > + > + subs x6, x6, #1; Bit of micro-optimization, but this could be moved after "ld1 {RKEY.4s}" above. > + bne .Lroundloop; > + > + rotate_clockwise_90(v0, v1, v2, v3); > + rev32 v0.16b, v0.16b; > + rev32 v1.16b, v1.16b; > + rev32 v2.16b, v2.16b; > + rev32 v3.16b, v3.16b; > + > + sub x0, x0, #128; /* repoint to rkey */ > + ret; > + CFI_ENDPROC(); > +ELF(.size __sm4_crypt_blk4,.-__sm4_crypt_blk4;) > + > +.global _gcry_sm4_aarch64_crypt > +ELF(.type _gcry_sm4_aarch64_crypt,%function;) > +_gcry_sm4_aarch64_crypt: > + /* input: > + * x0: round key array, CTX > + * x1: dst > + * x2: src > + * x3: nblocks (multiples of 4) > + */ > + CFI_STARTPROC(); > + > + stp x29, x30, [sp, #-16]!; > + CFI_ADJUST_CFA_OFFSET(16); > + CFI_REG_ON_STACK(29, 0); > + CFI_REG_ON_STACK(30, 8); > + > + preload_sbox(x5); > + > +.Lcrypt_loop_blk4: > + subs x3, x3, #4; > + bmi .Lcrypt_end; > + > + ld1 {v0.16b-v3.16b}, [x2], #64; > + bl __sm4_crypt_blk4; > + st1 {v0.16b-v3.16b}, [x1], #64; > + b .Lcrypt_loop_blk4; > + > +.Lcrypt_end: > + ldp x29, x30, [sp], #16; > + CFI_ADJUST_CFA_OFFSET(-16); > + CFI_RESTORE(x29); > + CFI_RESTORE(x30); > + ret_spec_stop; > + CFI_ENDPROC(); > +ELF(.size _gcry_sm4_aarch64_crypt,.-_gcry_sm4_aarch64_crypt;) > + > +.global _gcry_sm4_aarch64_cbc_dec > +ELF(.type _gcry_sm4_aarch64_cbc_dec,%function;) > +_gcry_sm4_aarch64_cbc_dec: > + /* input: > + * x0: round key array, CTX > + * x1: dst > + * x2: src > + * x3: iv (big endian, 128 bit) > + * x4: nblocks (multiples of 4) > + */ > + CFI_STARTPROC(); > + > + stp x29, x30, [sp, #-16]!; > + CFI_ADJUST_CFA_OFFSET(16); > + CFI_REG_ON_STACK(29, 0); > + CFI_REG_ON_STACK(30, 8); > + > + preload_sbox(x5); > + ld1 {RTMP4.16b}, [x3]; > + > +.Lcbc_loop_blk4: > + subs x4, x4, #4; > + bmi .Lcbc_end; > + > + ld1 {v0.16b-v3.16b}, [x2]; > + > + bl __sm4_crypt_blk4; > + > + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; > + eor v0.16b, v0.16b, RTMP4.16b; > + eor v1.16b, v1.16b, RTMP0.16b; > + eor v2.16b, v2.16b, RTMP1.16b; > + eor v3.16b, v3.16b, RTMP2.16b; > + > + st1 {v0.16b-v3.16b}, [x1], #64; > + mov RTMP4.16b, RTMP3.16b; > + > + b .Lcbc_loop_blk4; > + > +.Lcbc_end: > + /* store new IV */ > + st1 {RTMP4.16b}, [x3]; > + > + ldp x29, x30, [sp], #16; > + CFI_ADJUST_CFA_OFFSET(-16); > + CFI_RESTORE(x29); > + CFI_RESTORE(x30); > + ret_spec_stop; > + CFI_ENDPROC(); > +ELF(.size _gcry_sm4_aarch64_cbc_dec,.-_gcry_sm4_aarch64_cbc_dec;) > + > +.global _gcry_sm4_aarch64_cfb_dec > +ELF(.type _gcry_sm4_aarch64_cfb_dec,%function;) > +_gcry_sm4_aarch64_cfb_dec: > + /* input: > + * x0: round key array, CTX > + * x1: dst > + * x2: src > + * x3: iv (big endian, 128 bit) > + * x4: nblocks (multiples of 4) > + */ > + CFI_STARTPROC(); > + > + stp x29, x30, [sp, #-16]!; > + CFI_ADJUST_CFA_OFFSET(16); > + CFI_REG_ON_STACK(29, 0); > + CFI_REG_ON_STACK(30, 8); > + > + preload_sbox(x5); > + ld1 {v0.16b}, [x3]; > + > +.Lcfb_loop_blk4: > + subs x4, x4, #4; > + bmi .Lcfb_end; > + > + ld1 {v1.16b, v2.16b, v3.16b}, [x2]; > + > + bl __sm4_crypt_blk4; > + > + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; > + eor v0.16b, v0.16b, RTMP0.16b; > + eor v1.16b, v1.16b, RTMP1.16b; > + eor v2.16b, v2.16b, RTMP2.16b; > + eor v3.16b, v3.16b, RTMP3.16b; > + > + st1 {v0.16b-v3.16b}, [x1], #64; > + mov v0.16b, RTMP3.16b; > + > + b .Lcfb_loop_blk4; > + > +.Lcfb_end: > + /* store new IV */ > + st1 {v0.16b}, [x3]; > + > + ldp x29, x30, [sp], #16; > + CFI_ADJUST_CFA_OFFSET(-16); > + CFI_RESTORE(x29); > + CFI_RESTORE(x30); > + ret_spec_stop; > + CFI_ENDPROC(); > +ELF(.size _gcry_sm4_aarch64_cfb_dec,.-_gcry_sm4_aarch64_cfb_dec;) > + > +.global _gcry_sm4_aarch64_ctr_enc > +ELF(.type _gcry_sm4_aarch64_ctr_enc,%function;) > +_gcry_sm4_aarch64_ctr_enc: > + /* input: > + * x0: round key array, CTX > + * x1: dst > + * x2: src > + * x3: ctr (big endian, 128 bit) > + * x4: nblocks (multiples of 4) > + */ > + CFI_STARTPROC(); > + > + stp x29, x30, [sp, #-16]!; > + CFI_ADJUST_CFA_OFFSET(16); > + CFI_REG_ON_STACK(29, 0); > + CFI_REG_ON_STACK(30, 8); > + > + preload_sbox(x5); > + > + ldp x7, x8, [x3]; > + rev x7, x7; > + rev x8, x8; > + > +.Lctr_loop_blk4: > + subs x4, x4, #4; > + bmi .Lctr_end; > + > +#define inc_le128(vctr) \ > + mov vctr.d[1], x8; \ > + mov vctr.d[0], x7; \ > + adds x8, x8, #1; \ > + adc x7, x7, xzr; \ > + rev64 vctr.16b, vctr.16b; > + > + /* construct CTRs */ > + inc_le128(v0); /* +0 */ > + inc_le128(v1); /* +1 */ > + inc_le128(v2); /* +2 */ > + inc_le128(v3); /* +3 */ > + > + bl __sm4_crypt_blk4; > + > + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; > + eor v0.16b, v0.16b, RTMP0.16b; > + eor v1.16b, v1.16b, RTMP1.16b; > + eor v2.16b, v2.16b, RTMP2.16b; > + eor v3.16b, v3.16b, RTMP3.16b; > + st1 {v0.16b-v3.16b}, [x1], #64; > + b .Lctr_loop_blk4; > + > +.Lctr_end: > + /* store new CTR */ > + rev x7, x7; > + rev x8, x8; > + stp x7, x8, [x3]; > + > + ldp x29, x30, [sp], #16; > + CFI_ADJUST_CFA_OFFSET(-16); > + CFI_RESTORE(x29); > + CFI_RESTORE(x30); > + ret_spec_stop; > + CFI_ENDPROC(); > +ELF(.size _gcry_sm4_aarch64_ctr_enc,.-_gcry_sm4_aarch64_ctr_enc;) > + > +#endif > diff --git a/cipher/sm4.c b/cipher/sm4.c > index 81662988..afcfd61b 100644 > --- a/cipher/sm4.c > +++ b/cipher/sm4.c > @@ -67,6 +67,15 @@ > # endif > #endif > > +#undef USE_AARCH64_SIMD > +#ifdef ENABLE_NEON_SUPPORT > +# if defined(__AARCH64EL__) && \ > + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ > + defined(HAVE_GCC_INLINE_ASM_AARCH64_NEON) > +# define USE_AARCH64_SIMD 1 > +# endif > +#endif > + > static const char *sm4_selftest (void); > > static void _gcry_sm4_ctr_enc (void *context, unsigned char *ctr, > @@ -94,6 +103,9 @@ typedef struct > #ifdef USE_AESNI_AVX2 > unsigned int use_aesni_avx2:1; > #endif > +#ifdef USE_AARCH64_SIMD > + unsigned int use_aarch64_simd:1; > +#endif > } SM4_context; > > static const u32 fk[4] = > @@ -241,6 +253,27 @@ extern void _gcry_sm4_aesni_avx2_ocb_auth(const u32 *rk_enc, > const u64 Ls[16]) ASM_FUNC_ABI; > #endif /* USE_AESNI_AVX2 */ > > +#ifdef USE_AARCH64_SIMD > +extern void _gcry_sm4_aarch64_crypt(const u32 *rk, byte *out, > + const byte *in, > + int nblocks);> + > +extern void _gcry_sm4_aarch64_cbc_dec(const u32 *rk_dec, byte *out, > + const byte *in, > + byte *iv, > + int nblocks); > + > +extern void _gcry_sm4_aarch64_cfb_dec(const u32 *rk_enc, byte *out, > + const byte *in, > + byte *iv, > + int nblocks); > + > +extern void _gcry_sm4_aarch64_ctr_enc(const u32 *rk_enc, byte *out, > + const byte *in, > + byte *ctr, > + int nblocks); Use 'size_t' for nblocks. Clang can make assumption that 'int' means that target function uses only low 32-bit of 'nblocks' (assumes that target function accesses only through W3 register) and leave garbage values in upper 32-bit of X3 register here. -Jussi From tianjia.zhang at linux.alibaba.com Fri Feb 18 13:00:48 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Fri, 18 Feb 2022 20:00:48 +0800 Subject: [PATCH] Add SM4 ARMv8/AArch64 assembly implementation In-Reply-To: References: <20220216131223.10166-1-tianjia.zhang@linux.alibaba.com> Message-ID: Hi Jussi, On 2/18/22 2:12 AM, Jussi Kivilinna wrote: > Hello, > > Looks good, just few comments below... > > On 16.2.2022 15.12, Tianjia Zhang wrote: >> * cipher/Makefile.am: Add 'sm4-aarch64.S'. >> * cipher/sm4-aarch64.S: New. >> * cipher/sm4.c (USE_AARCH64_SIMD): New. >> (SM4_context) [USE_AARCH64_SIMD]: Add 'use_aarch64_simd'. >> [USE_AARCH64_SIMD] (_gcry_sm4_aarch64_crypt) >> (_gcry_sm4_aarch64_cbc_dec, _gcry_sm4_aarch64_cfb_dec) >> (_gcry_sm4_aarch64_ctr_dec): New. >> (sm4_setkey): Enable ARMv8/AArch64 if supported by HW. >> (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec) >> (_gcry_sm4_cfb_dec) [USE_AESNI_AVX2]: Add ARMv8/AArch64 bulk functions. > > USE_AARCH64_SIMD here. > >> * configure.ac: Add ''sm4-aarch64.lo'. >> +/* Register macros */ >> + >> +#define RTMP0 v8 >> +#define RTMP1 v9 >> +#define RTMP2 v10 >> +#define RTMP3 v11 >> +#define RTMP4 v12 >> + >> +#define RX0?? v13 >> +#define RKEY? v14 >> +#define RIDX? v15 > > Vectors registers v8 to v15 are being used, so functions need > to store and restore d8-d15 registers as they are ABI callee > saved. Check "VPUSH_ABI" and "VPOP_API" macros in > "cipher-gcm-armv8-aarch64-ce.S". Those could be moved to > "asm-common-aarch64.h" so that macros can be shared between > different files. > >> + >> +??? mov x6, 8; >> +.Lroundloop: >> +??? ld1 {RKEY.4s}, [x0], #16; >> +??? ROUND(0, v0, v1, v2, v3); >> +??? ROUND(1, v1, v2, v3, v0); >> +??? ROUND(2, v2, v3, v0, v1); >> +??? ROUND(3, v3, v0, v1, v2); >> + >> +??? subs x6, x6, #1; > > Bit of micro-optimization, but this could be moved after > "ld1 {RKEY.4s}" above. > >> +??? bne .Lroundloop; >> + >> +??? rotate_clockwise_90(v0, v1, v2, v3); >> +??? rev32 v0.16b, v0.16b; >> +??? rev32 v1.16b, v1.16b; >> +??? rev32 v2.16b, v2.16b; >> +??? rev32 v3.16b, v3.16b; >> ? #endif /* USE_AESNI_AVX2 */ >> +#ifdef USE_AARCH64_SIMD >> +extern void _gcry_sm4_aarch64_crypt(const u32 *rk, byte *out, >> +??????????????????? const byte *in, >> +??????????????????? int nblocks);> + >> +extern void _gcry_sm4_aarch64_cbc_dec(const u32 *rk_dec, byte *out, >> +????????????????????? const byte *in, >> +????????????????????? byte *iv, >> +????????????????????? int nblocks); >> + >> +extern void _gcry_sm4_aarch64_cfb_dec(const u32 *rk_enc, byte *out, >> +????????????????????? const byte *in, >> +????????????????????? byte *iv, >> +????????????????????? int nblocks); >> + >> +extern void _gcry_sm4_aarch64_ctr_enc(const u32 *rk_enc, byte *out, >> +????????????????????? const byte *in, >> +????????????????????? byte *ctr, >> +????????????????????? int nblocks); > > Use 'size_t' for nblocks. Clang can make assumption that 'int' > means that target function uses only low 32-bit of 'nblocks' > (assumes that target function accesses only through W3 > register) and leave garbage values in upper 32-bit of X3 register > here. > > -Jussi Thanks for your suggestion, I will fix the bugs you mentioned, and introduce 8 way acceleration support in the v2 patch, and introduce a new patch to move VPUSH_ABI/VPOP_ABI into asm-common-aarch64.h header file. Best regards, Tianjia From jussi.kivilinna at iki.fi Sat Feb 19 12:52:09 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sat, 19 Feb 2022 13:52:09 +0200 Subject: [PATCH] Perform AEAD input 24KiB splitting only when input larger than 32KiB Message-ID: <20220219115209.1019382-1-jussi.kivilinna@iki.fi> * cipher/chacha20.c (_gcry_chacha20_poly1305_encrypt) (_gcry_chacha20_poly1305_decrypt): Process in 24KiB chunks if input larger than 32KiB. * cipher/cipher-ccm.c (_gcry_cipher_ccm_encrypt) (_gcry_cipher_ccm_decrypt): Likewise. * cipher/cipher-eax.c (_gcry_cipher_eax_encrypt) (_gcry_cipher_eax_decrypt): Likewise. * cipher/cipher-gcm.c (gcm_cipher_inner): Likewise. * cipher/cipher-ocb.c (ocb_crypt): Likewise. * cipher/cipher-poly2305.c (_gcry_cipher_poly1305_encrypt) (_gcry_cipher_poly1305_decrypt): Likewise. -- Splitting input which length is just above 24KiB is not benefical. Instead perform splitting if input is longer than 32KiB to ensure that last chunk is also a large buffer. Signed-off-by: Jussi Kivilinna --- cipher/chacha20.c | 12 ++++++++---- cipher/cipher-ccm.c | 12 ++++++++---- cipher/cipher-eax.c | 12 ++++++++---- cipher/cipher-gcm.c | 5 +++-- cipher/cipher-ocb.c | 7 ++++--- cipher/cipher-poly1305.c | 12 ++++++++---- 6 files changed, 39 insertions(+), 21 deletions(-) diff --git a/cipher/chacha20.c b/cipher/chacha20.c index 497594a0..870cfa18 100644 --- a/cipher/chacha20.c +++ b/cipher/chacha20.c @@ -969,8 +969,10 @@ _gcry_chacha20_poly1305_encrypt(gcry_cipher_hd_t c, byte *outbuf, size_t currlen = length; /* Since checksumming is done after encryption, process input in 24KiB - * chunks to keep data loaded in L1 cache for checksumming. */ - if (currlen > 24 * 1024) + * chunks to keep data loaded in L1 cache for checksumming. However + * only do splitting if input is large enough so that last chunks does + * not end up being short. */ + if (currlen > 32 * 1024) currlen = 24 * 1024; nburn = do_chacha20_encrypt_stream_tail (ctx, outbuf, inbuf, currlen); @@ -1157,8 +1159,10 @@ _gcry_chacha20_poly1305_decrypt(gcry_cipher_hd_t c, byte *outbuf, size_t currlen = length; /* Since checksumming is done before decryption, process input in 24KiB - * chunks to keep data loaded in L1 cache for decryption. */ - if (currlen > 24 * 1024) + * chunks to keep data loaded in L1 cache for decryption. However only + * do splitting if input is large enough so that last chunks does not + * end up being short. */ + if (currlen > 32 * 1024) currlen = 24 * 1024; nburn = _gcry_poly1305_update_burn (&c->u_mode.poly1305.ctx, inbuf, diff --git a/cipher/cipher-ccm.c b/cipher/cipher-ccm.c index dcb268d0..3e2a767a 100644 --- a/cipher/cipher-ccm.c +++ b/cipher/cipher-ccm.c @@ -345,8 +345,10 @@ _gcry_cipher_ccm_encrypt (gcry_cipher_hd_t c, unsigned char *outbuf, size_t currlen = inbuflen; /* Since checksumming is done before encryption, process input in 24KiB - * chunks to keep data loaded in L1 cache for encryption. */ - if (currlen > 24 * 1024) + * chunks to keep data loaded in L1 cache for encryption. However only + * do splitting if input is large enough so that last chunks does not + * end up being short. */ + if (currlen > 32 * 1024) currlen = 24 * 1024; c->u_mode.ccm.encryptlen -= currlen; @@ -391,8 +393,10 @@ _gcry_cipher_ccm_decrypt (gcry_cipher_hd_t c, unsigned char *outbuf, size_t currlen = inbuflen; /* Since checksumming is done after decryption, process input in 24KiB - * chunks to keep data loaded in L1 cache for checksumming. */ - if (currlen > 24 * 1024) + * chunks to keep data loaded in L1 cache for checksumming. However + * only do splitting if input is large enough so that last chunks + * does not end up being short. */ + if (currlen > 32 * 1024) currlen = 24 * 1024; err = _gcry_cipher_ctr_encrypt (c, outbuf, outbuflen, inbuf, currlen); diff --git a/cipher/cipher-eax.c b/cipher/cipher-eax.c index 08f815a9..0c5cf84e 100644 --- a/cipher/cipher-eax.c +++ b/cipher/cipher-eax.c @@ -53,8 +53,10 @@ _gcry_cipher_eax_encrypt (gcry_cipher_hd_t c, size_t currlen = inbuflen; /* Since checksumming is done after encryption, process input in 24KiB - * chunks to keep data loaded in L1 cache for checksumming. */ - if (currlen > 24 * 1024) + * chunks to keep data loaded in L1 cache for checksumming. However + * only do splitting if input is large enough so that last chunks does + * not end up being short.*/ + if (currlen > 32 * 1024) currlen = 24 * 1024; err = _gcry_cipher_ctr_encrypt (c, outbuf, outbuflen, inbuf, currlen); @@ -100,8 +102,10 @@ _gcry_cipher_eax_decrypt (gcry_cipher_hd_t c, size_t currlen = inbuflen; /* Since checksumming is done before decryption, process input in 24KiB - * chunks to keep data loaded in L1 cache for decryption. */ - if (currlen > 24 * 1024) + * chunks to keep data loaded in L1 cache for decryption. However only + * do splitting if input is large enough so that last chunks does not + * end up being short. */ + if (currlen > 32 * 1024) currlen = 24 * 1024; err = _gcry_cmac_write (c, &c->u_mode.eax.cmac_ciphertext, inbuf, diff --git a/cipher/cipher-gcm.c b/cipher/cipher-gcm.c index fc79986e..69ff0de6 100644 --- a/cipher/cipher-gcm.c +++ b/cipher/cipher-gcm.c @@ -888,8 +888,9 @@ gcm_crypt_inner (gcry_cipher_hd_t c, byte *outbuf, size_t outbuflen, /* Since checksumming is done after/before encryption/decryption, * process input in 24KiB chunks to keep data loaded in L1 cache for - * checksumming/decryption. */ - if (currlen > 24 * 1024) + * checksumming/decryption. However only do splitting if input is + * large enough so that last chunks does not end up being short. */ + if (currlen > 32 * 1024) currlen = 24 * 1024; if (!encrypt) diff --git a/cipher/cipher-ocb.c b/cipher/cipher-ocb.c index bfafa4c8..7a4cfbe1 100644 --- a/cipher/cipher-ocb.c +++ b/cipher/cipher-ocb.c @@ -548,9 +548,10 @@ ocb_crypt (gcry_cipher_hd_t c, int encrypt, nblks = nblks < nmaxblks ? nblks : nmaxblks; /* Since checksum xoring is done before/after encryption/decryption, - process input in 24KiB chunks to keep data loaded in L1 cache for - checksumming. */ - if (nblks > 24 * 1024 / OCB_BLOCK_LEN) + process input in 24KiB chunks to keep data loaded in L1 cache for + checksumming. However only do splitting if input is large enough + so that last chunks does not end up being short. */ + if (nblks > 32 * 1024 / OCB_BLOCK_LEN) nblks = 24 * 1024 / OCB_BLOCK_LEN; /* Use a bulk method if available. */ diff --git a/cipher/cipher-poly1305.c b/cipher/cipher-poly1305.c index bb475236..5cd3561b 100644 --- a/cipher/cipher-poly1305.c +++ b/cipher/cipher-poly1305.c @@ -174,8 +174,10 @@ _gcry_cipher_poly1305_encrypt (gcry_cipher_hd_t c, size_t currlen = inbuflen; /* Since checksumming is done after encryption, process input in 24KiB - * chunks to keep data loaded in L1 cache for checksumming. */ - if (currlen > 24 * 1024) + * chunks to keep data loaded in L1 cache for checksumming. However + * only do splitting if input is large enough so that last chunks does + * not end up being short. */ + if (currlen > 32 * 1024) currlen = 24 * 1024; c->spec->stencrypt(&c->context.c, outbuf, (byte*)inbuf, currlen); @@ -232,8 +234,10 @@ _gcry_cipher_poly1305_decrypt (gcry_cipher_hd_t c, size_t currlen = inbuflen; /* Since checksumming is done before decryption, process input in 24KiB - * chunks to keep data loaded in L1 cache for decryption. */ - if (currlen > 24 * 1024) + * chunks to keep data loaded in L1 cache for decryption. However only + * do splitting if input is large enough so that last chunks does not + * end up being short. */ + if (currlen > 32 * 1024) currlen = 24 * 1024; _gcry_poly1305_update (&c->u_mode.poly1305.ctx, inbuf, currlen); -- 2.32.0 From tianjia.zhang at linux.alibaba.com Tue Feb 22 13:18:28 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Tue, 22 Feb 2022 20:18:28 +0800 Subject: [PATCH v2 2/2] Add SM4 ARMv8/AArch64 assembly implementation In-Reply-To: <20220222121828.4752-1-tianjia.zhang@linux.alibaba.com> References: <20220222121828.4752-1-tianjia.zhang@linux.alibaba.com> Message-ID: <20220222121828.4752-2-tianjia.zhang@linux.alibaba.com> * cipher/Makefile.am: Add 'sm4-aarch64.S'. * cipher/sm4-aarch64.S: New. * cipher/sm4.c (USE_AARCH64_SIMD): New. (SM4_context) [USE_AARCH64_SIMD]: Add 'use_aarch64_simd'. [USE_AARCH64_SIMD] (_gcry_sm4_aarch64_crypt) (_gcry_sm4_aarch64_ctr_enc, _gcry_sm4_aarch64_cbc_dec) (_gcry_sm4_aarch64_cfb_dec, _gcry_sm4_aarch64_crypt_blk1_8) (sm4_aarch64_crypt_blk1_8): New. (sm4_setkey): Enable ARMv8/AArch64 if supported by HW. (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_AARCH64_SIMD]: Add ARMv8/AArch64 bulk functions. * configure.ac: Add 'sm4-aarch64.lo'. -- This patch adds ARMv8/AArch64 bulk encryption/decryption. Bulk functions process eight blocks in parallel. Benchmark on T-Head Yitian-710 2.75 GHz: Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 12.10 ns/B 78.81 MiB/s 33.28 c/B 2750 CBC dec | 7.19 ns/B 132.6 MiB/s 19.77 c/B 2750 CFB enc | 12.14 ns/B 78.58 MiB/s 33.37 c/B 2750 CFB dec | 7.24 ns/B 131.8 MiB/s 19.90 c/B 2750 CTR enc | 7.24 ns/B 131.7 MiB/s 19.90 c/B 2750 CTR dec | 7.24 ns/B 131.7 MiB/s 19.91 c/B 2750 GCM enc | 9.49 ns/B 100.4 MiB/s 26.11 c/B 2750 GCM dec | 9.49 ns/B 100.5 MiB/s 26.10 c/B 2750 GCM auth | 2.25 ns/B 423.1 MiB/s 6.20 c/B 2750 OCB enc | 7.35 ns/B 129.8 MiB/s 20.20 c/B 2750 OCB dec | 7.36 ns/B 129.6 MiB/s 20.23 c/B 2750 OCB auth | 7.29 ns/B 130.8 MiB/s 20.04 c/B 2749 After (~55% faster): SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 12.10 ns/B 78.79 MiB/s 33.28 c/B 2750 CBC dec | 4.63 ns/B 205.9 MiB/s 12.74 c/B 2749 CFB enc | 12.14 ns/B 78.58 MiB/s 33.37 c/B 2750 CFB dec | 4.64 ns/B 205.5 MiB/s 12.76 c/B 2750 CTR enc | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 CTR dec | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 GCM enc | 4.88 ns/B 195.4 MiB/s 13.42 c/B 2750 GCM dec | 4.88 ns/B 195.5 MiB/s 13.42 c/B 2750 GCM auth | 0.189 ns/B 5048 MiB/s 0.520 c/B 2750 OCB enc | 4.86 ns/B 196.0 MiB/s 13.38 c/B 2750 OCB dec | 4.90 ns/B 194.7 MiB/s 13.47 c/B 2750 OCB auth | 4.79 ns/B 199.0 MiB/s 13.18 c/B 2750 Signed-off-by: Tianjia Zhang --- cipher/Makefile.am | 2 +- cipher/sm4-aarch64.S | 642 +++++++++++++++++++++++++++++++++++++++++++ cipher/sm4.c | 129 +++++++++ configure.ac | 3 + 4 files changed, 775 insertions(+), 1 deletion(-) create mode 100644 cipher/sm4-aarch64.S diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 264b3d30..6c1c7693 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -116,7 +116,7 @@ EXTRA_libcipher_la_SOURCES = \ scrypt.c \ seed.c \ serpent.c serpent-sse2-amd64.S \ - sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S \ + sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S sm4-aarch64.S \ serpent-avx2-amd64.S serpent-armv7-neon.S \ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ diff --git a/cipher/sm4-aarch64.S b/cipher/sm4-aarch64.S new file mode 100644 index 00000000..8d29be37 --- /dev/null +++ b/cipher/sm4-aarch64.S @@ -0,0 +1,642 @@ +/* sm4-aarch64.S - ARMv8/AArch64 accelerated SM4 cipher + * + * Copyright (C) 2022 Alibaba Group. + * Copyright (C) 2022 Tianjia Zhang + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include "asm-common-aarch64.h" + +#if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_NEON) && \ + defined(USE_SM4) + +.cpu generic+simd + +/* Constants */ + +.text +.align 16 +ELF(.type _gcry_sm4_aarch64_consts, at object) +_gcry_sm4_aarch64_consts: +.Lsm4_sbox: + .byte 0xd6, 0x90, 0xe9, 0xfe, 0xcc, 0xe1, 0x3d, 0xb7 + .byte 0x16, 0xb6, 0x14, 0xc2, 0x28, 0xfb, 0x2c, 0x05 + .byte 0x2b, 0x67, 0x9a, 0x76, 0x2a, 0xbe, 0x04, 0xc3 + .byte 0xaa, 0x44, 0x13, 0x26, 0x49, 0x86, 0x06, 0x99 + .byte 0x9c, 0x42, 0x50, 0xf4, 0x91, 0xef, 0x98, 0x7a + .byte 0x33, 0x54, 0x0b, 0x43, 0xed, 0xcf, 0xac, 0x62 + .byte 0xe4, 0xb3, 0x1c, 0xa9, 0xc9, 0x08, 0xe8, 0x95 + .byte 0x80, 0xdf, 0x94, 0xfa, 0x75, 0x8f, 0x3f, 0xa6 + .byte 0x47, 0x07, 0xa7, 0xfc, 0xf3, 0x73, 0x17, 0xba + .byte 0x83, 0x59, 0x3c, 0x19, 0xe6, 0x85, 0x4f, 0xa8 + .byte 0x68, 0x6b, 0x81, 0xb2, 0x71, 0x64, 0xda, 0x8b + .byte 0xf8, 0xeb, 0x0f, 0x4b, 0x70, 0x56, 0x9d, 0x35 + .byte 0x1e, 0x24, 0x0e, 0x5e, 0x63, 0x58, 0xd1, 0xa2 + .byte 0x25, 0x22, 0x7c, 0x3b, 0x01, 0x21, 0x78, 0x87 + .byte 0xd4, 0x00, 0x46, 0x57, 0x9f, 0xd3, 0x27, 0x52 + .byte 0x4c, 0x36, 0x02, 0xe7, 0xa0, 0xc4, 0xc8, 0x9e + .byte 0xea, 0xbf, 0x8a, 0xd2, 0x40, 0xc7, 0x38, 0xb5 + .byte 0xa3, 0xf7, 0xf2, 0xce, 0xf9, 0x61, 0x15, 0xa1 + .byte 0xe0, 0xae, 0x5d, 0xa4, 0x9b, 0x34, 0x1a, 0x55 + .byte 0xad, 0x93, 0x32, 0x30, 0xf5, 0x8c, 0xb1, 0xe3 + .byte 0x1d, 0xf6, 0xe2, 0x2e, 0x82, 0x66, 0xca, 0x60 + .byte 0xc0, 0x29, 0x23, 0xab, 0x0d, 0x53, 0x4e, 0x6f + .byte 0xd5, 0xdb, 0x37, 0x45, 0xde, 0xfd, 0x8e, 0x2f + .byte 0x03, 0xff, 0x6a, 0x72, 0x6d, 0x6c, 0x5b, 0x51 + .byte 0x8d, 0x1b, 0xaf, 0x92, 0xbb, 0xdd, 0xbc, 0x7f + .byte 0x11, 0xd9, 0x5c, 0x41, 0x1f, 0x10, 0x5a, 0xd8 + .byte 0x0a, 0xc1, 0x31, 0x88, 0xa5, 0xcd, 0x7b, 0xbd + .byte 0x2d, 0x74, 0xd0, 0x12, 0xb8, 0xe5, 0xb4, 0xb0 + .byte 0x89, 0x69, 0x97, 0x4a, 0x0c, 0x96, 0x77, 0x7e + .byte 0x65, 0xb9, 0xf1, 0x09, 0xc5, 0x6e, 0xc6, 0x84 + .byte 0x18, 0xf0, 0x7d, 0xec, 0x3a, 0xdc, 0x4d, 0x20 + .byte 0x79, 0xee, 0x5f, 0x3e, 0xd7, 0xcb, 0x39, 0x48 +ELF(.size _gcry_sm4_aarch64_consts,.-_gcry_sm4_aarch64_consts) + +/* Register macros */ + +#define RTMP0 v8 +#define RTMP1 v9 +#define RTMP2 v10 +#define RTMP3 v11 + +#define RX0 v12 +#define RX1 v13 +#define RKEY v14 +#define RIV v15 + +/* Helper macros. */ + +#define preload_sbox(ptr) \ + GET_DATA_POINTER(ptr, .Lsm4_sbox); \ + ld1 {v16.16b-v19.16b}, [ptr], #64; \ + ld1 {v20.16b-v23.16b}, [ptr], #64; \ + ld1 {v24.16b-v27.16b}, [ptr], #64; \ + ld1 {v28.16b-v31.16b}, [ptr]; + +#define transpose_4x4(s0, s1, s2, s3) \ + zip1 RTMP0.4s, s0.4s, s1.4s; \ + zip1 RTMP1.4s, s2.4s, s3.4s; \ + zip2 RTMP2.4s, s0.4s, s1.4s; \ + zip2 RTMP3.4s, s2.4s, s3.4s; \ + zip1 s0.2d, RTMP0.2d, RTMP1.2d; \ + zip2 s1.2d, RTMP0.2d, RTMP1.2d; \ + zip1 s2.2d, RTMP2.2d, RTMP3.2d; \ + zip2 s3.2d, RTMP2.2d, RTMP3.2d; + +#define rotate_clockwise_90(s0, s1, s2, s3) \ + zip1 RTMP0.4s, s1.4s, s0.4s; \ + zip2 RTMP1.4s, s1.4s, s0.4s; \ + zip1 RTMP2.4s, s3.4s, s2.4s; \ + zip2 RTMP3.4s, s3.4s, s2.4s; \ + zip1 s0.2d, RTMP2.2d, RTMP0.2d; \ + zip2 s1.2d, RTMP2.2d, RTMP0.2d; \ + zip1 s2.2d, RTMP3.2d, RTMP1.2d; \ + zip2 s3.2d, RTMP3.2d, RTMP1.2d; + + +.align 3 +ELF(.type sm4_aarch64_crypt_blk1_4,%function;) +sm4_aarch64_crypt_blk1_4: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: num blocks (1..4) + */ + CFI_STARTPROC(); + VPUSH_ABI; + + preload_sbox(x5); + + ld1 {v0.16b}, [x2], #16; + mov v1.16b, v0.16b; + mov v2.16b, v0.16b; + mov v3.16b, v0.16b; + cmp x3, #2; + blt .Lblk4_load_input_done; + ld1 {v1.16b}, [x2], #16; + beq .Lblk4_load_input_done; + ld1 {v2.16b}, [x2], #16; + cmp x3, #3; + beq .Lblk4_load_input_done; + ld1 {v3.16b}, [x2]; + +.Lblk4_load_input_done: + + rev32 v0.16b, v0.16b; + rev32 v1.16b, v1.16b; + rev32 v2.16b, v2.16b; + rev32 v3.16b, v3.16b; + + transpose_4x4(v0, v1, v2, v3); + +#define ROUND(round, s0, s1, s2, s3) \ + dup RX0.4s, RKEY.s[round]; \ + /* rk ^ s1 ^ s2 ^ s3 */ \ + eor RTMP1.16b, s2.16b, s3.16b; \ + eor RX0.16b, RX0.16b, s1.16b; \ + eor RX0.16b, RX0.16b, RTMP1.16b; \ + \ + /* sbox, non-linear part */ \ + movi RTMP3.16b, #64; /* sizeof(sbox) / 4 */ \ + tbl RTMP0.16b, {v16.16b-v19.16b}, RX0.16b; \ + sub RX0.16b, RX0.16b, RTMP3.16b; \ + tbx RTMP0.16b, {v20.16b-v23.16b}, RX0.16b; \ + sub RX0.16b, RX0.16b, RTMP3.16b; \ + tbx RTMP0.16b, {v24.16b-v27.16b}, RX0.16b; \ + sub RX0.16b, RX0.16b, RTMP3.16b; \ + tbx RTMP0.16b, {v28.16b-v31.16b}, RX0.16b; \ + \ + /* linear part */ \ + shl RTMP1.4s, RTMP0.4s, #8; \ + shl RTMP2.4s, RTMP0.4s, #16; \ + shl RTMP3.4s, RTMP0.4s, #24; \ + sri RTMP1.4s, RTMP0.4s, #(32-8); \ + sri RTMP2.4s, RTMP0.4s, #(32-16); \ + sri RTMP3.4s, RTMP0.4s, #(32-24); \ + /* RTMP1 = x ^ rol32(x, 8) ^ rol32(x, 16) */ \ + eor RTMP1.16b, RTMP1.16b, RTMP0.16b; \ + eor RTMP1.16b, RTMP1.16b, RTMP2.16b; \ + /* RTMP3 = x ^ rol32(x, 24) ^ rol32(RTMP1, 2) */ \ + eor RTMP3.16b, RTMP3.16b, RTMP0.16b; \ + shl RTMP2.4s, RTMP1.4s, 2; \ + sri RTMP2.4s, RTMP1.4s, #(32-2); \ + eor RTMP3.16b, RTMP3.16b, RTMP2.16b; \ + /* s0 ^= RTMP3 */ \ + eor s0.16b, s0.16b, RTMP3.16b; + + mov x6, 8; +.Lroundloop4: + ld1 {RKEY.4s}, [x0], #16; + subs x6, x6, #1; + + ROUND(0, v0, v1, v2, v3); + ROUND(1, v1, v2, v3, v0); + ROUND(2, v2, v3, v0, v1); + ROUND(3, v3, v0, v1, v2); + + bne .Lroundloop4; + +#undef ROUND + + rotate_clockwise_90(v0, v1, v2, v3); + rev32 v0.16b, v0.16b; + rev32 v1.16b, v1.16b; + rev32 v2.16b, v2.16b; + rev32 v3.16b, v3.16b; + + st1 {v0.16b}, [x1], #16; + cmp x3, #2; + blt .Lblk4_store_output_done; + st1 {v1.16b}, [x1], #16; + beq .Lblk4_store_output_done; + st1 {v2.16b}, [x1], #16; + cmp x3, #3; + beq .Lblk4_store_output_done; + st1 {v3.16b}, [x1]; + +.Lblk4_store_output_done: + VPOP_ABI; + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size sm4_aarch64_crypt_blk1_4,.-sm4_aarch64_crypt_blk1_4;) + +.align 3 +ELF(.type __sm4_crypt_blk8,%function;) +__sm4_crypt_blk8: + /* input: + * x0: round key array, CTX + * v16-v31: fill with sbox + * v0, v1, v2, v3, v4, v5, v6, v7: eight parallel plaintext blocks + * output: + * v0, v1, v2, v3, v4, v5, v6, v7: eight parallel ciphertext blocks + */ + CFI_STARTPROC(); + + rev32 v0.16b, v0.16b; + rev32 v1.16b, v1.16b; + rev32 v2.16b, v2.16b; + rev32 v3.16b, v3.16b; + rev32 v4.16b, v4.16b; + rev32 v5.16b, v5.16b; + rev32 v6.16b, v6.16b; + rev32 v7.16b, v7.16b; + + transpose_4x4(v0, v1, v2, v3); + transpose_4x4(v4, v5, v6, v7); + +#define ROUND(round, s0, s1, s2, s3, t0, t1, t2, t3) \ + /* rk ^ s1 ^ s2 ^ s3 */ \ + dup RX0.4s, RKEY.s[round]; \ + eor RTMP0.16b, s2.16b, s3.16b; \ + mov RX1.16b, RX0.16b; \ + eor RTMP1.16b, t2.16b, t3.16b; \ + eor RX0.16b, RX0.16b, s1.16b; \ + eor RX1.16b, RX1.16b, t1.16b; \ + eor RX0.16b, RX0.16b, RTMP0.16b; \ + eor RX1.16b, RX1.16b, RTMP1.16b; \ + \ + /* sbox, non-linear part */ \ + movi RTMP3.16b, #64; /* sizeof(sbox) / 4 */ \ + tbl RTMP0.16b, {v16.16b-v19.16b}, RX0.16b; \ + tbl RTMP1.16b, {v16.16b-v19.16b}, RX1.16b; \ + sub RX0.16b, RX0.16b, RTMP3.16b; \ + sub RX1.16b, RX1.16b, RTMP3.16b; \ + tbx RTMP0.16b, {v20.16b-v23.16b}, RX0.16b; \ + tbx RTMP1.16b, {v20.16b-v23.16b}, RX1.16b; \ + sub RX0.16b, RX0.16b, RTMP3.16b; \ + sub RX1.16b, RX1.16b, RTMP3.16b; \ + tbx RTMP0.16b, {v24.16b-v27.16b}, RX0.16b; \ + tbx RTMP1.16b, {v24.16b-v27.16b}, RX1.16b; \ + sub RX0.16b, RX0.16b, RTMP3.16b; \ + sub RX1.16b, RX1.16b, RTMP3.16b; \ + tbx RTMP0.16b, {v28.16b-v31.16b}, RX0.16b; \ + tbx RTMP1.16b, {v28.16b-v31.16b}, RX1.16b; \ + \ + /* linear part */ \ + shl RX0.4s, RTMP0.4s, #8; \ + shl RX1.4s, RTMP1.4s, #8; \ + shl RTMP2.4s, RTMP0.4s, #16; \ + shl RTMP3.4s, RTMP1.4s, #16; \ + sri RX0.4s, RTMP0.4s, #(32 - 8); \ + sri RX1.4s, RTMP1.4s, #(32 - 8); \ + sri RTMP2.4s, RTMP0.4s, #(32 - 16); \ + sri RTMP3.4s, RTMP1.4s, #(32 - 16); \ + /* RX = x ^ rol32(x, 8) ^ rol32(x, 16) */ \ + eor RX0.16b, RX0.16b, RTMP0.16b; \ + eor RX1.16b, RX1.16b, RTMP1.16b; \ + eor RX0.16b, RX0.16b, RTMP2.16b; \ + eor RX1.16b, RX1.16b, RTMP3.16b; \ + /* RTMP0/1 ^= x ^ rol32(x, 24) ^ rol32(RX, 2) */ \ + shl RTMP2.4s, RTMP0.4s, #24; \ + shl RTMP3.4s, RTMP1.4s, #24; \ + sri RTMP2.4s, RTMP0.4s, #(32 - 24); \ + sri RTMP3.4s, RTMP1.4s, #(32 - 24); \ + eor RTMP0.16b, RTMP0.16b, RTMP2.16b; \ + eor RTMP1.16b, RTMP1.16b, RTMP3.16b; \ + shl RTMP2.4s, RX0.4s, #2; \ + shl RTMP3.4s, RX1.4s, #2; \ + sri RTMP2.4s, RX0.4s, #(32 - 2); \ + sri RTMP3.4s, RX1.4s, #(32 - 2); \ + eor RTMP0.16b, RTMP0.16b, RTMP2.16b; \ + eor RTMP1.16b, RTMP1.16b, RTMP3.16b; \ + /* s0/t0 ^= RTMP0/1 */ \ + eor s0.16b, s0.16b, RTMP0.16b; \ + eor t0.16b, t0.16b, RTMP1.16b; + + mov x6, 8; +.Lroundloop8: + ld1 {RKEY.4s}, [x0], #16; + subs x6, x6, #1; + + ROUND(0, v0, v1, v2, v3, v4, v5, v6, v7); + ROUND(1, v1, v2, v3, v0, v5, v6, v7, v4); + ROUND(2, v2, v3, v0, v1, v6, v7, v4, v5); + ROUND(3, v3, v0, v1, v2, v7, v4, v5, v6); + + bne .Lroundloop8; + +#undef ROUND + + rotate_clockwise_90(v0, v1, v2, v3); + rotate_clockwise_90(v4, v5, v6, v7); + rev32 v0.16b, v0.16b; + rev32 v1.16b, v1.16b; + rev32 v2.16b, v2.16b; + rev32 v3.16b, v3.16b; + rev32 v4.16b, v4.16b; + rev32 v5.16b, v5.16b; + rev32 v6.16b, v6.16b; + rev32 v7.16b, v7.16b; + + sub x0, x0, #128; /* repoint to rkey */ + ret; + CFI_ENDPROC(); +ELF(.size __sm4_crypt_blk8,.-__sm4_crypt_blk8;) + +.align 3 +.global _gcry_sm4_aarch64_crypt_blk1_8 +ELF(.type _gcry_sm4_aarch64_crypt_blk1_8,%function;) +_gcry_sm4_aarch64_crypt_blk1_8: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: num blocks (1..8) + */ + CFI_STARTPROC(); + + cmp x3, #5; + blt sm4_aarch64_crypt_blk1_4; + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + VPUSH_ABI; + + preload_sbox(x5); + + ld1 {v0.16b-v3.16b}, [x2], #64; + ld1 {v4.16b}, [x2], #16; + mov v5.16b, v4.16b; + mov v6.16b, v4.16b; + mov v7.16b, v4.16b; + beq .Lblk8_load_input_done; + ld1 {v5.16b}, [x2], #16; + cmp x3, #7; + blt .Lblk8_load_input_done; + ld1 {v6.16b}, [x2], #16; + beq .Lblk8_load_input_done; + ld1 {v7.16b}, [x2]; + +.Lblk8_load_input_done: + bl __sm4_crypt_blk8; + + cmp x3, #6; + st1 {v0.16b-v3.16b}, [x1], #64; + st1 {v4.16b}, [x1], #16; + blt .Lblk8_store_output_done; + st1 {v5.16b}, [x1], #16; + beq .Lblk8_store_output_done; + st1 {v6.16b}, [x1], #16; + cmp x3, #7; + beq .Lblk8_store_output_done; + st1 {v7.16b}, [x1]; + +.Lblk8_store_output_done: + VPOP_ABI; + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aarch64_crypt_blk1_8,.-_gcry_sm4_aarch64_crypt_blk1_8;) + + +.align 3 +.global _gcry_sm4_aarch64_crypt +ELF(.type _gcry_sm4_aarch64_crypt,%function;) +_gcry_sm4_aarch64_crypt: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + VPUSH_ABI; + + preload_sbox(x5); + +.Lcrypt_loop_blk: + subs x3, x3, #8; + bmi .Lcrypt_end; + + ld1 {v0.16b-v3.16b}, [x2], #64; + ld1 {v4.16b-v7.16b}, [x2], #64; + bl __sm4_crypt_blk8; + st1 {v0.16b-v3.16b}, [x1], #64; + st1 {v4.16b-v7.16b}, [x1], #64; + b .Lcrypt_loop_blk; + +.Lcrypt_end: + VPOP_ABI; + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aarch64_crypt,.-_gcry_sm4_aarch64_crypt;) + + +.align 3 +.global _gcry_sm4_aarch64_cbc_dec +ELF(.type _gcry_sm4_aarch64_cbc_dec,%function;) +_gcry_sm4_aarch64_cbc_dec: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: iv (big endian, 128 bit) + * x4: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + VPUSH_ABI; + + preload_sbox(x5); + ld1 {RIV.16b}, [x3]; + +.Lcbc_loop_blk: + subs x4, x4, #8; + bmi .Lcbc_end; + + ld1 {v0.16b-v3.16b}, [x2], #64; + ld1 {v4.16b-v7.16b}, [x2]; + + bl __sm4_crypt_blk8; + + sub x2, x2, #64; + eor v0.16b, v0.16b, RIV.16b; + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v1.16b, v1.16b, RTMP0.16b; + eor v2.16b, v2.16b, RTMP1.16b; + eor v3.16b, v3.16b, RTMP2.16b; + st1 {v0.16b-v3.16b}, [x1], #64; + + eor v4.16b, v4.16b, RTMP3.16b; + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v5.16b, v5.16b, RTMP0.16b; + eor v6.16b, v6.16b, RTMP1.16b; + eor v7.16b, v7.16b, RTMP2.16b; + + mov RIV.16b, RTMP3.16b; + st1 {v4.16b-v7.16b}, [x1], #64; + + b .Lcbc_loop_blk; + +.Lcbc_end: + /* store new IV */ + st1 {RIV.16b}, [x3]; + + VPOP_ABI; + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aarch64_cbc_dec,.-_gcry_sm4_aarch64_cbc_dec;) + +.align 3 +.global _gcry_sm4_aarch64_cfb_dec +ELF(.type _gcry_sm4_aarch64_cfb_dec,%function;) +_gcry_sm4_aarch64_cfb_dec: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: iv (big endian, 128 bit) + * x4: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + VPUSH_ABI; + + preload_sbox(x5); + ld1 {v0.16b}, [x3]; + +.Lcfb_loop_blk: + subs x4, x4, #8; + bmi .Lcfb_end; + + ld1 {v1.16b, v2.16b, v3.16b}, [x2], #48; + ld1 {v4.16b-v7.16b}, [x2]; + + bl __sm4_crypt_blk8; + + sub x2, x2, #48; + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v0.16b, v0.16b, RTMP0.16b; + eor v1.16b, v1.16b, RTMP1.16b; + eor v2.16b, v2.16b, RTMP2.16b; + eor v3.16b, v3.16b, RTMP3.16b; + st1 {v0.16b-v3.16b}, [x1], #64; + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v4.16b, v4.16b, RTMP0.16b; + eor v5.16b, v5.16b, RTMP1.16b; + eor v6.16b, v6.16b, RTMP2.16b; + eor v7.16b, v7.16b, RTMP3.16b; + st1 {v4.16b-v7.16b}, [x1], #64; + + mov v0.16b, RTMP3.16b; + + b .Lcfb_loop_blk; + +.Lcfb_end: + /* store new IV */ + st1 {v0.16b}, [x3]; + + VPOP_ABI; + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aarch64_cfb_dec,.-_gcry_sm4_aarch64_cfb_dec;) + +.align 3 +.global _gcry_sm4_aarch64_ctr_enc +ELF(.type _gcry_sm4_aarch64_ctr_enc,%function;) +_gcry_sm4_aarch64_ctr_enc: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: ctr (big endian, 128 bit) + * x4: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + VPUSH_ABI; + + preload_sbox(x5); + + ldp x7, x8, [x3]; + rev x7, x7; + rev x8, x8; + +.Lctr_loop_blk: + subs x4, x4, #8; + bmi .Lctr_end; + +#define inc_le128(vctr) \ + mov vctr.d[1], x8; \ + mov vctr.d[0], x7; \ + adds x8, x8, #1; \ + adc x7, x7, xzr; \ + rev64 vctr.16b, vctr.16b; + + /* construct CTRs */ + inc_le128(v0); /* +0 */ + inc_le128(v1); /* +1 */ + inc_le128(v2); /* +2 */ + inc_le128(v3); /* +3 */ + inc_le128(v4); /* +4 */ + inc_le128(v5); /* +5 */ + inc_le128(v6); /* +6 */ + inc_le128(v7); /* +7 */ + + bl __sm4_crypt_blk8; + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v0.16b, v0.16b, RTMP0.16b; + eor v1.16b, v1.16b, RTMP1.16b; + eor v2.16b, v2.16b, RTMP2.16b; + eor v3.16b, v3.16b, RTMP3.16b; + st1 {v0.16b-v3.16b}, [x1], #64; + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v4.16b, v4.16b, RTMP0.16b; + eor v5.16b, v5.16b, RTMP1.16b; + eor v6.16b, v6.16b, RTMP2.16b; + eor v7.16b, v7.16b, RTMP3.16b; + st1 {v4.16b-v7.16b}, [x1], #64; + + b .Lctr_loop_blk; + +.Lctr_end: + /* store new CTR */ + rev x7, x7; + rev x8, x8; + stp x7, x8, [x3]; + + VPOP_ABI; + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aarch64_ctr_enc,.-_gcry_sm4_aarch64_ctr_enc;) + +#endif diff --git a/cipher/sm4.c b/cipher/sm4.c index 81662988..ec2281b6 100644 --- a/cipher/sm4.c +++ b/cipher/sm4.c @@ -67,6 +67,15 @@ # endif #endif +#undef USE_AARCH64_SIMD +#ifdef ENABLE_NEON_SUPPORT +# if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_NEON) +# define USE_AARCH64_SIMD 1 +# endif +#endif + static const char *sm4_selftest (void); static void _gcry_sm4_ctr_enc (void *context, unsigned char *ctr, @@ -94,6 +103,9 @@ typedef struct #ifdef USE_AESNI_AVX2 unsigned int use_aesni_avx2:1; #endif +#ifdef USE_AARCH64_SIMD + unsigned int use_aarch64_simd:1; +#endif } SM4_context; static const u32 fk[4] = @@ -241,6 +253,39 @@ extern void _gcry_sm4_aesni_avx2_ocb_auth(const u32 *rk_enc, const u64 Ls[16]) ASM_FUNC_ABI; #endif /* USE_AESNI_AVX2 */ +#ifdef USE_AARCH64_SIMD +extern void _gcry_sm4_aarch64_crypt(const u32 *rk, byte *out, + const byte *in, + size_t num_blocks); + +extern void _gcry_sm4_aarch64_ctr_enc(const u32 *rk_enc, byte *out, + const byte *in, + byte *ctr, + size_t nblocks); + +extern void _gcry_sm4_aarch64_cbc_dec(const u32 *rk_dec, byte *out, + const byte *in, + byte *iv, + size_t nblocks); + +extern void _gcry_sm4_aarch64_cfb_dec(const u32 *rk_enc, byte *out, + const byte *in, + byte *iv, + size_t nblocks); + +extern void _gcry_sm4_aarch64_crypt_blk1_8(const u32 *rk, byte *out, + const byte *in, + size_t num_blocks); + +static inline unsigned int +sm4_aarch64_crypt_blk1_8(const u32 *rk, byte *out, const byte *in, + unsigned int num_blks) +{ + _gcry_sm4_aarch64_crypt_blk1_8(rk, out, in, (size_t)num_blks); + return 0; +} +#endif /* USE_AARCH64_SIMD */ + static inline void prefetch_sbox_table(void) { const volatile byte *vtab = (void *)&sbox_table; @@ -372,6 +417,9 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen, #ifdef USE_AESNI_AVX2 ctx->use_aesni_avx2 = (hwf & HWF_INTEL_AESNI) && (hwf & HWF_INTEL_AVX2); #endif +#ifdef USE_AARCH64_SIMD + ctx->use_aarch64_simd = !!(hwf & HWF_ARM_NEON); +#endif /* Setup bulk encryption routines. */ memset (bulk_ops, 0, sizeof(*bulk_ops)); @@ -553,6 +601,23 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, } #endif +#ifdef USE_AARCH64_SIMD + if (ctx->use_aarch64_simd) + { + /* Process multiples of 8 blocks at a time. */ + if (nblocks >= 8) + { + size_t nblks = nblocks & ~(8 - 1); + + _gcry_sm4_aarch64_ctr_enc(ctx->rkey_enc, outbuf, inbuf, ctr, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + /* Process remaining blocks. */ if (nblocks) { @@ -568,6 +633,12 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, { crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } +#endif +#ifdef USE_AARCH64_SIMD + else if (ctx->use_aarch64_simd) + { + crypt_blk1_8 = sm4_aarch64_crypt_blk1_8; + } #endif else { @@ -654,6 +725,23 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, } #endif +#ifdef USE_AARCH64_SIMD + if (ctx->use_aarch64_simd) + { + /* Process multiples of 8 blocks at a time. */ + if (nblocks >= 8) + { + size_t nblks = nblocks & ~(8 - 1); + + _gcry_sm4_aarch64_cbc_dec(ctx->rkey_dec, outbuf, inbuf, iv, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + /* Process remaining blocks. */ if (nblocks) { @@ -669,6 +757,12 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, { crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } +#endif +#ifdef USE_AARCH64_SIMD + else if (ctx->use_aarch64_simd) + { + crypt_blk1_8 = sm4_aarch64_crypt_blk1_8; + } #endif else { @@ -748,6 +842,23 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, } #endif +#ifdef USE_AARCH64_SIMD + if (ctx->use_aarch64_simd) + { + /* Process multiples of 8 blocks at a time. */ + if (nblocks >= 8) + { + size_t nblks = nblocks & ~(8 - 1); + + _gcry_sm4_aarch64_cfb_dec(ctx->rkey_enc, outbuf, inbuf, iv, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + /* Process remaining blocks. */ if (nblocks) { @@ -763,6 +874,12 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, { crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } +#endif +#ifdef USE_AARCH64_SIMD + else if (ctx->use_aarch64_simd) + { + crypt_blk1_8 = sm4_aarch64_crypt_blk1_8; + } #endif else { @@ -919,6 +1036,12 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, { crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } +#endif +#ifdef USE_AARCH64_SIMD + else if (ctx->use_aarch64_simd) + { + crypt_blk1_8 = sm4_aarch64_crypt_blk1_8; + } #endif else { @@ -1079,6 +1202,12 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) { crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } +#endif +#ifdef USE_AARCH64_SIMD + else if (ctx->use_aarch64_simd) + { + crypt_blk1_8 = sm4_aarch64_crypt_blk1_8; + } #endif else { diff --git a/configure.ac b/configure.ac index ea01f5a6..89df9434 100644 --- a/configure.ac +++ b/configure.ac @@ -2740,6 +2740,9 @@ if test "$found" = "1" ; then GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-aesni-avx-amd64.lo" GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-aesni-avx2-amd64.lo" ;; + aarch64-*-*) + # Build with the assembly implementation + GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-aarch64.lo" esac fi -- 2.34.1 From tianjia.zhang at linux.alibaba.com Tue Feb 22 13:18:27 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Tue, 22 Feb 2022 20:18:27 +0800 Subject: [PATCH v2 1/2] Move VPUSH_API/VPOP_API macros to common header Message-ID: <20220222121828.4752-1-tianjia.zhang@linux.alibaba.com> * cipher/asm-common-aarch64.h: Add VPUSH_API/VPOP_API/CLEAR_REG macros. * cipher/cipher-gcm-armv8-aarch64-ce.S: Remove common macros. -- Signed-off-by: Tianjia Zhang --- cipher/asm-common-aarch64.h | 22 ++++++++++++++++++++++ cipher/cipher-gcm-armv8-aarch64-ce.S | 22 ---------------------- 2 files changed, 22 insertions(+), 22 deletions(-) diff --git a/cipher/asm-common-aarch64.h b/cipher/asm-common-aarch64.h index 451539e8..d3f7801c 100644 --- a/cipher/asm-common-aarch64.h +++ b/cipher/asm-common-aarch64.h @@ -105,4 +105,26 @@ #define ret_spec_stop \ ret; dsb sy; isb; +#define CLEAR_REG(reg) movi reg.16b, #0; + +#define VPUSH_ABI \ + stp d8, d9, [sp, #-16]!; \ + CFI_ADJUST_CFA_OFFSET(16); \ + stp d10, d11, [sp, #-16]!; \ + CFI_ADJUST_CFA_OFFSET(16); \ + stp d12, d13, [sp, #-16]!; \ + CFI_ADJUST_CFA_OFFSET(16); \ + stp d14, d15, [sp, #-16]!; \ + CFI_ADJUST_CFA_OFFSET(16); + +#define VPOP_ABI \ + ldp d14, d15, [sp], #16; \ + CFI_ADJUST_CFA_OFFSET(-16); \ + ldp d12, d13, [sp], #16; \ + CFI_ADJUST_CFA_OFFSET(-16); \ + ldp d10, d11, [sp], #16; \ + CFI_ADJUST_CFA_OFFSET(-16); \ + ldp d8, d9, [sp], #16; \ + CFI_ADJUST_CFA_OFFSET(-16); + #endif /* GCRY_ASM_COMMON_AARCH64_H */ diff --git a/cipher/cipher-gcm-armv8-aarch64-ce.S b/cipher/cipher-gcm-armv8-aarch64-ce.S index e6714249..687fabe3 100644 --- a/cipher/cipher-gcm-armv8-aarch64-ce.S +++ b/cipher/cipher-gcm-armv8-aarch64-ce.S @@ -149,28 +149,6 @@ gcry_gcm_reduction_constant: #define _(...) __VA_ARGS__ #define __ _() -#define CLEAR_REG(reg) movi reg.16b, #0; - -#define VPUSH_ABI \ - stp d8, d9, [sp, #-16]!; \ - CFI_ADJUST_CFA_OFFSET(16); \ - stp d10, d11, [sp, #-16]!; \ - CFI_ADJUST_CFA_OFFSET(16); \ - stp d12, d13, [sp, #-16]!; \ - CFI_ADJUST_CFA_OFFSET(16); \ - stp d14, d15, [sp, #-16]!; \ - CFI_ADJUST_CFA_OFFSET(16); - -#define VPOP_ABI \ - ldp d14, d15, [sp], #16; \ - CFI_ADJUST_CFA_OFFSET(-16); \ - ldp d12, d13, [sp], #16; \ - CFI_ADJUST_CFA_OFFSET(-16); \ - ldp d10, d11, [sp], #16; \ - CFI_ADJUST_CFA_OFFSET(-16); \ - ldp d8, d9, [sp], #16; \ - CFI_ADJUST_CFA_OFFSET(-16); - /* * unsigned int _gcry_ghash_armv8_ce_pmull (void *gcm_key, byte *result, * const byte *buf, size_t nblocks, -- 2.34.1 From cllang at redhat.com Tue Feb 22 14:30:19 2022 From: cllang at redhat.com (Clemens Lang) Date: Tue, 22 Feb 2022 14:30:19 +0100 Subject: libgcrypt GitLab CI mirror Message-ID: Hi everyone, Over the last weeks, I built a CI system for libgcrypt on a mirror of the libgcrypt repository at https://gitlab.com/redhat-crypto/libgcrypt/libgcrypt-mirror/ I hope that this can help to get the master branch tested in different environments and with different compilers before distributions do it and thus shorten the feedback cycle between upstream and downstream. The mirror updates every day at 4 AM UTC, with builds run on CentOS Stream 8 and 9, Fedora, and Ubuntu, with both GCC and clang. Some builds use AddressSanitizer, LeakSanitizer, and UndefinedBehaviorSanitizer. The latest coverage results based on x86_64 get pushed to https://redhat-crypto.gitlab.io/libgcrypt/libgcrypt-mirror/ For more information and links to the builds results, see the README.md in the repository: https://gitlab.com/redhat-crypto/libgcrypt/libgcrypt-mirror/-/blob/master/README.md HTH, Clemens -- Clemens Lang RHEL Crypto Team Red Hat From jussi.kivilinna at iki.fi Tue Feb 22 19:22:45 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Tue, 22 Feb 2022 20:22:45 +0200 Subject: [PATCH v2 2/2] Add SM4 ARMv8/AArch64 assembly implementation In-Reply-To: <20220222121828.4752-2-tianjia.zhang@linux.alibaba.com> References: <20220222121828.4752-1-tianjia.zhang@linux.alibaba.com> <20220222121828.4752-2-tianjia.zhang@linux.alibaba.com> Message-ID: <4ebbfaf2-c375-55c5-2766-c740378c6837@iki.fi> On 22.2.2022 14.18, Tianjia Zhang wrote: > * cipher/Makefile.am: Add 'sm4-aarch64.S'. > * cipher/sm4-aarch64.S: New. > * cipher/sm4.c (USE_AARCH64_SIMD): New. > (SM4_context) [USE_AARCH64_SIMD]: Add 'use_aarch64_simd'. > [USE_AARCH64_SIMD] (_gcry_sm4_aarch64_crypt) > (_gcry_sm4_aarch64_ctr_enc, _gcry_sm4_aarch64_cbc_dec) > (_gcry_sm4_aarch64_cfb_dec, _gcry_sm4_aarch64_crypt_blk1_8) > (sm4_aarch64_crypt_blk1_8): New. > (sm4_setkey): Enable ARMv8/AArch64 if supported by HW. > (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) > (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_AARCH64_SIMD]: > Add ARMv8/AArch64 bulk functions. > * configure.ac: Add 'sm4-aarch64.lo'. > -- > > This patch adds ARMv8/AArch64 bulk encryption/decryption. Bulk > functions process eight blocks in parallel. > > Benchmark on T-Head Yitian-710 2.75 GHz: > > Before: > SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz > CBC enc | 12.10 ns/B 78.81 MiB/s 33.28 c/B 2750 > CBC dec | 7.19 ns/B 132.6 MiB/s 19.77 c/B 2750 > CFB enc | 12.14 ns/B 78.58 MiB/s 33.37 c/B 2750 > CFB dec | 7.24 ns/B 131.8 MiB/s 19.90 c/B 2750 > CTR enc | 7.24 ns/B 131.7 MiB/s 19.90 c/B 2750 > CTR dec | 7.24 ns/B 131.7 MiB/s 19.91 c/B 2750 > GCM enc | 9.49 ns/B 100.4 MiB/s 26.11 c/B 2750 > GCM dec | 9.49 ns/B 100.5 MiB/s 26.10 c/B 2750 > GCM auth | 2.25 ns/B 423.1 MiB/s 6.20 c/B 2750 > OCB enc | 7.35 ns/B 129.8 MiB/s 20.20 c/B 2750 > OCB dec | 7.36 ns/B 129.6 MiB/s 20.23 c/B 2750 > OCB auth | 7.29 ns/B 130.8 MiB/s 20.04 c/B 2749 > > After (~55% faster): > SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz > CBC enc | 12.10 ns/B 78.79 MiB/s 33.28 c/B 2750 > CBC dec | 4.63 ns/B 205.9 MiB/s 12.74 c/B 2749 > CFB enc | 12.14 ns/B 78.58 MiB/s 33.37 c/B 2750 > CFB dec | 4.64 ns/B 205.5 MiB/s 12.76 c/B 2750 > CTR enc | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 > CTR dec | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 > GCM enc | 4.88 ns/B 195.4 MiB/s 13.42 c/B 2750 > GCM dec | 4.88 ns/B 195.5 MiB/s 13.42 c/B 2750 > GCM auth | 0.189 ns/B 5048 MiB/s 0.520 c/B 2750 > OCB enc | 4.86 ns/B 196.0 MiB/s 13.38 c/B 2750 > OCB dec | 4.90 ns/B 194.7 MiB/s 13.47 c/B 2750 > OCB auth | 4.79 ns/B 199.0 MiB/s 13.18 c/B 2750 > > Signed-off-by: Tianjia Zhang > --- > cipher/Makefile.am | 2 +- > cipher/sm4-aarch64.S | 642 +++++++++++++++++++++++++++++++++++++++++++ > cipher/sm4.c | 129 +++++++++ > configure.ac | 3 + > 4 files changed, 775 insertions(+), 1 deletion(-) > create mode 100644 cipher/sm4-aarch64.S > > diff --git a/cipher/Makefile.am b/cipher/Makefile.am > index 264b3d30..6c1c7693 100644 > --- a/cipher/Makefile.am > +++ b/cipher/Makefile.am > @@ -116,7 +116,7 @@ EXTRA_libcipher_la_SOURCES = \ > scrypt.c \ > seed.c \ > serpent.c serpent-sse2-amd64.S \ > - sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S \ > + sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S sm4-aarch64.S \ > serpent-avx2-amd64.S serpent-armv7-neon.S \ > sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ > sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ > diff --git a/cipher/sm4-aarch64.S b/cipher/sm4-aarch64.S > new file mode 100644 > index 00000000..8d29be37 > --- /dev/null > +++ b/cipher/sm4-aarch64.S > @@ -0,0 +1,642 @@ > +/* sm4-aarch64.S - ARMv8/AArch64 accelerated SM4 cipher > + * > + * Copyright (C) 2022 Alibaba Group. > + * Copyright (C) 2022 Tianjia Zhang > + * > + * This file is part of Libgcrypt. > + * > + * Libgcrypt is free software; you can redistribute it and/or modify > + * it under the terms of the GNU Lesser General Public License as > + * published by the Free Software Foundation; either version 2.1 of > + * the License, or (at your option) any later version. > + * > + * Libgcrypt is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with this program; if not, see . > + */ > + > +#include "asm-common-aarch64.h" > + > +#if defined(__AARCH64EL__) && \ > + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ > + defined(HAVE_GCC_INLINE_ASM_AARCH64_NEON) && \ > + defined(USE_SM4) > + > +.cpu generic+simd > + > +/* Constants */ > + > +.text > +.align 16 Alignment to 65536 bytes seems excessive. Did you mean to use 16-byte aligment here (".align 4" or ".balign 16")? Otherwise patches look good. On Cortex-A53, ~36% performance improvement seen for CFB/CBC/CTR/OCB. -Jussi From tianjia.zhang at linux.alibaba.com Wed Feb 23 04:33:34 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Wed, 23 Feb 2022 11:33:34 +0800 Subject: [PATCH v2 2/2] Add SM4 ARMv8/AArch64 assembly implementation In-Reply-To: <4ebbfaf2-c375-55c5-2766-c740378c6837@iki.fi> References: <20220222121828.4752-1-tianjia.zhang@linux.alibaba.com> <20220222121828.4752-2-tianjia.zhang@linux.alibaba.com> <4ebbfaf2-c375-55c5-2766-c740378c6837@iki.fi> Message-ID: Hi Jussi, On 2/23/22 2:22 AM, Jussi Kivilinna wrote: > On 22.2.2022 14.18, Tianjia Zhang wrote: >> * cipher/Makefile.am: Add 'sm4-aarch64.S'. >> + >> +/* Constants */ >> + >> +.text >> +.align 16 > > Alignment to 65536 bytes seems excessive. Did you mean to use > 16-byte aligment here (".align 4" or ".balign 16")? Thanks for pointing it out, you're right, I meant for 16 byte alignment, I'll fix it in v3 patch. > > Otherwise patches look good. On Cortex-A53, ~36% performance > improvement seen for CFB/CBC/CTR/OCB. Great job, thanks for the test data. Kind regards, Tianjia From tianjia.zhang at linux.alibaba.com Wed Feb 23 05:23:58 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Wed, 23 Feb 2022 12:23:58 +0800 Subject: [PATCH v3 1/2] Move VPUSH_API/VPOP_API macros to common header Message-ID: <20220223042359.7765-1-tianjia.zhang@linux.alibaba.com> * cipher/asm-common-aarch64.h: Add VPUSH_API/VPOP_API/CLEAR_REG macros. * cipher/cipher-gcm-armv8-aarch64-ce.S: Remove common macros. -- Signed-off-by: Tianjia Zhang --- cipher/asm-common-aarch64.h | 22 ++++++++++++++++++++++ cipher/cipher-gcm-armv8-aarch64-ce.S | 22 ---------------------- 2 files changed, 22 insertions(+), 22 deletions(-) diff --git a/cipher/asm-common-aarch64.h b/cipher/asm-common-aarch64.h index 451539e8..d3f7801c 100644 --- a/cipher/asm-common-aarch64.h +++ b/cipher/asm-common-aarch64.h @@ -105,4 +105,26 @@ #define ret_spec_stop \ ret; dsb sy; isb; +#define CLEAR_REG(reg) movi reg.16b, #0; + +#define VPUSH_ABI \ + stp d8, d9, [sp, #-16]!; \ + CFI_ADJUST_CFA_OFFSET(16); \ + stp d10, d11, [sp, #-16]!; \ + CFI_ADJUST_CFA_OFFSET(16); \ + stp d12, d13, [sp, #-16]!; \ + CFI_ADJUST_CFA_OFFSET(16); \ + stp d14, d15, [sp, #-16]!; \ + CFI_ADJUST_CFA_OFFSET(16); + +#define VPOP_ABI \ + ldp d14, d15, [sp], #16; \ + CFI_ADJUST_CFA_OFFSET(-16); \ + ldp d12, d13, [sp], #16; \ + CFI_ADJUST_CFA_OFFSET(-16); \ + ldp d10, d11, [sp], #16; \ + CFI_ADJUST_CFA_OFFSET(-16); \ + ldp d8, d9, [sp], #16; \ + CFI_ADJUST_CFA_OFFSET(-16); + #endif /* GCRY_ASM_COMMON_AARCH64_H */ diff --git a/cipher/cipher-gcm-armv8-aarch64-ce.S b/cipher/cipher-gcm-armv8-aarch64-ce.S index e6714249..687fabe3 100644 --- a/cipher/cipher-gcm-armv8-aarch64-ce.S +++ b/cipher/cipher-gcm-armv8-aarch64-ce.S @@ -149,28 +149,6 @@ gcry_gcm_reduction_constant: #define _(...) __VA_ARGS__ #define __ _() -#define CLEAR_REG(reg) movi reg.16b, #0; - -#define VPUSH_ABI \ - stp d8, d9, [sp, #-16]!; \ - CFI_ADJUST_CFA_OFFSET(16); \ - stp d10, d11, [sp, #-16]!; \ - CFI_ADJUST_CFA_OFFSET(16); \ - stp d12, d13, [sp, #-16]!; \ - CFI_ADJUST_CFA_OFFSET(16); \ - stp d14, d15, [sp, #-16]!; \ - CFI_ADJUST_CFA_OFFSET(16); - -#define VPOP_ABI \ - ldp d14, d15, [sp], #16; \ - CFI_ADJUST_CFA_OFFSET(-16); \ - ldp d12, d13, [sp], #16; \ - CFI_ADJUST_CFA_OFFSET(-16); \ - ldp d10, d11, [sp], #16; \ - CFI_ADJUST_CFA_OFFSET(-16); \ - ldp d8, d9, [sp], #16; \ - CFI_ADJUST_CFA_OFFSET(-16); - /* * unsigned int _gcry_ghash_armv8_ce_pmull (void *gcm_key, byte *result, * const byte *buf, size_t nblocks, -- 2.34.1 From tianjia.zhang at linux.alibaba.com Wed Feb 23 05:23:59 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Wed, 23 Feb 2022 12:23:59 +0800 Subject: [PATCH v3 2/2] Add SM4 ARMv8/AArch64 assembly implementation In-Reply-To: <20220223042359.7765-1-tianjia.zhang@linux.alibaba.com> References: <20220223042359.7765-1-tianjia.zhang@linux.alibaba.com> Message-ID: <20220223042359.7765-2-tianjia.zhang@linux.alibaba.com> * cipher/Makefile.am: Add 'sm4-aarch64.S'. * cipher/sm4-aarch64.S: New. * cipher/sm4.c (USE_AARCH64_SIMD): New. (SM4_context) [USE_AARCH64_SIMD]: Add 'use_aarch64_simd'. [USE_AARCH64_SIMD] (_gcry_sm4_aarch64_crypt) (_gcry_sm4_aarch64_ctr_enc, _gcry_sm4_aarch64_cbc_dec) (_gcry_sm4_aarch64_cfb_dec, _gcry_sm4_aarch64_crypt_blk1_8) (sm4_aarch64_crypt_blk1_8): New. (sm4_setkey): Enable ARMv8/AArch64 if supported by HW. (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_AARCH64_SIMD]: Add ARMv8/AArch64 bulk functions. * configure.ac: Add 'sm4-aarch64.lo'. -- This patch adds ARMv8/AArch64 bulk encryption/decryption. Bulk functions process eight blocks in parallel. Benchmark on T-Head Yitian-710 2.75 GHz: Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 12.10 ns/B 78.81 MiB/s 33.28 c/B 2750 CBC dec | 7.19 ns/B 132.6 MiB/s 19.77 c/B 2750 CFB enc | 12.14 ns/B 78.58 MiB/s 33.37 c/B 2750 CFB dec | 7.24 ns/B 131.8 MiB/s 19.90 c/B 2750 CTR enc | 7.24 ns/B 131.7 MiB/s 19.90 c/B 2750 CTR dec | 7.24 ns/B 131.7 MiB/s 19.91 c/B 2750 GCM enc | 9.49 ns/B 100.4 MiB/s 26.11 c/B 2750 GCM dec | 9.49 ns/B 100.5 MiB/s 26.10 c/B 2750 GCM auth | 2.25 ns/B 423.1 MiB/s 6.20 c/B 2750 OCB enc | 7.35 ns/B 129.8 MiB/s 20.20 c/B 2750 OCB dec | 7.36 ns/B 129.6 MiB/s 20.23 c/B 2750 OCB auth | 7.29 ns/B 130.8 MiB/s 20.04 c/B 2749 After (~55% faster): SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 12.10 ns/B 78.79 MiB/s 33.28 c/B 2750 CBC dec | 4.63 ns/B 205.9 MiB/s 12.74 c/B 2749 CFB enc | 12.14 ns/B 78.58 MiB/s 33.37 c/B 2750 CFB dec | 4.64 ns/B 205.5 MiB/s 12.76 c/B 2750 CTR enc | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 CTR dec | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 GCM enc | 4.88 ns/B 195.4 MiB/s 13.42 c/B 2750 GCM dec | 4.88 ns/B 195.5 MiB/s 13.42 c/B 2750 GCM auth | 0.189 ns/B 5048 MiB/s 0.520 c/B 2750 OCB enc | 4.86 ns/B 196.0 MiB/s 13.38 c/B 2750 OCB dec | 4.90 ns/B 194.7 MiB/s 13.47 c/B 2750 OCB auth | 4.79 ns/B 199.0 MiB/s 13.18 c/B 2750 Signed-off-by: Tianjia Zhang --- cipher/Makefile.am | 2 +- cipher/sm4-aarch64.S | 642 +++++++++++++++++++++++++++++++++++++++++++ cipher/sm4.c | 129 +++++++++ configure.ac | 3 + 4 files changed, 775 insertions(+), 1 deletion(-) create mode 100644 cipher/sm4-aarch64.S diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 264b3d30..6c1c7693 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -116,7 +116,7 @@ EXTRA_libcipher_la_SOURCES = \ scrypt.c \ seed.c \ serpent.c serpent-sse2-amd64.S \ - sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S \ + sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S sm4-aarch64.S \ serpent-avx2-amd64.S serpent-armv7-neon.S \ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ diff --git a/cipher/sm4-aarch64.S b/cipher/sm4-aarch64.S new file mode 100644 index 00000000..306b425e --- /dev/null +++ b/cipher/sm4-aarch64.S @@ -0,0 +1,642 @@ +/* sm4-aarch64.S - ARMv8/AArch64 accelerated SM4 cipher + * + * Copyright (C) 2022 Alibaba Group. + * Copyright (C) 2022 Tianjia Zhang + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include "asm-common-aarch64.h" + +#if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_NEON) && \ + defined(USE_SM4) + +.cpu generic+simd + +/* Constants */ + +.text +.align 4 +ELF(.type _gcry_sm4_aarch64_consts, at object) +_gcry_sm4_aarch64_consts: +.Lsm4_sbox: + .byte 0xd6, 0x90, 0xe9, 0xfe, 0xcc, 0xe1, 0x3d, 0xb7 + .byte 0x16, 0xb6, 0x14, 0xc2, 0x28, 0xfb, 0x2c, 0x05 + .byte 0x2b, 0x67, 0x9a, 0x76, 0x2a, 0xbe, 0x04, 0xc3 + .byte 0xaa, 0x44, 0x13, 0x26, 0x49, 0x86, 0x06, 0x99 + .byte 0x9c, 0x42, 0x50, 0xf4, 0x91, 0xef, 0x98, 0x7a + .byte 0x33, 0x54, 0x0b, 0x43, 0xed, 0xcf, 0xac, 0x62 + .byte 0xe4, 0xb3, 0x1c, 0xa9, 0xc9, 0x08, 0xe8, 0x95 + .byte 0x80, 0xdf, 0x94, 0xfa, 0x75, 0x8f, 0x3f, 0xa6 + .byte 0x47, 0x07, 0xa7, 0xfc, 0xf3, 0x73, 0x17, 0xba + .byte 0x83, 0x59, 0x3c, 0x19, 0xe6, 0x85, 0x4f, 0xa8 + .byte 0x68, 0x6b, 0x81, 0xb2, 0x71, 0x64, 0xda, 0x8b + .byte 0xf8, 0xeb, 0x0f, 0x4b, 0x70, 0x56, 0x9d, 0x35 + .byte 0x1e, 0x24, 0x0e, 0x5e, 0x63, 0x58, 0xd1, 0xa2 + .byte 0x25, 0x22, 0x7c, 0x3b, 0x01, 0x21, 0x78, 0x87 + .byte 0xd4, 0x00, 0x46, 0x57, 0x9f, 0xd3, 0x27, 0x52 + .byte 0x4c, 0x36, 0x02, 0xe7, 0xa0, 0xc4, 0xc8, 0x9e + .byte 0xea, 0xbf, 0x8a, 0xd2, 0x40, 0xc7, 0x38, 0xb5 + .byte 0xa3, 0xf7, 0xf2, 0xce, 0xf9, 0x61, 0x15, 0xa1 + .byte 0xe0, 0xae, 0x5d, 0xa4, 0x9b, 0x34, 0x1a, 0x55 + .byte 0xad, 0x93, 0x32, 0x30, 0xf5, 0x8c, 0xb1, 0xe3 + .byte 0x1d, 0xf6, 0xe2, 0x2e, 0x82, 0x66, 0xca, 0x60 + .byte 0xc0, 0x29, 0x23, 0xab, 0x0d, 0x53, 0x4e, 0x6f + .byte 0xd5, 0xdb, 0x37, 0x45, 0xde, 0xfd, 0x8e, 0x2f + .byte 0x03, 0xff, 0x6a, 0x72, 0x6d, 0x6c, 0x5b, 0x51 + .byte 0x8d, 0x1b, 0xaf, 0x92, 0xbb, 0xdd, 0xbc, 0x7f + .byte 0x11, 0xd9, 0x5c, 0x41, 0x1f, 0x10, 0x5a, 0xd8 + .byte 0x0a, 0xc1, 0x31, 0x88, 0xa5, 0xcd, 0x7b, 0xbd + .byte 0x2d, 0x74, 0xd0, 0x12, 0xb8, 0xe5, 0xb4, 0xb0 + .byte 0x89, 0x69, 0x97, 0x4a, 0x0c, 0x96, 0x77, 0x7e + .byte 0x65, 0xb9, 0xf1, 0x09, 0xc5, 0x6e, 0xc6, 0x84 + .byte 0x18, 0xf0, 0x7d, 0xec, 0x3a, 0xdc, 0x4d, 0x20 + .byte 0x79, 0xee, 0x5f, 0x3e, 0xd7, 0xcb, 0x39, 0x48 +ELF(.size _gcry_sm4_aarch64_consts,.-_gcry_sm4_aarch64_consts) + +/* Register macros */ + +#define RTMP0 v8 +#define RTMP1 v9 +#define RTMP2 v10 +#define RTMP3 v11 + +#define RX0 v12 +#define RX1 v13 +#define RKEY v14 +#define RIV v15 + +/* Helper macros. */ + +#define preload_sbox(ptr) \ + GET_DATA_POINTER(ptr, .Lsm4_sbox); \ + ld1 {v16.16b-v19.16b}, [ptr], #64; \ + ld1 {v20.16b-v23.16b}, [ptr], #64; \ + ld1 {v24.16b-v27.16b}, [ptr], #64; \ + ld1 {v28.16b-v31.16b}, [ptr]; + +#define transpose_4x4(s0, s1, s2, s3) \ + zip1 RTMP0.4s, s0.4s, s1.4s; \ + zip1 RTMP1.4s, s2.4s, s3.4s; \ + zip2 RTMP2.4s, s0.4s, s1.4s; \ + zip2 RTMP3.4s, s2.4s, s3.4s; \ + zip1 s0.2d, RTMP0.2d, RTMP1.2d; \ + zip2 s1.2d, RTMP0.2d, RTMP1.2d; \ + zip1 s2.2d, RTMP2.2d, RTMP3.2d; \ + zip2 s3.2d, RTMP2.2d, RTMP3.2d; + +#define rotate_clockwise_90(s0, s1, s2, s3) \ + zip1 RTMP0.4s, s1.4s, s0.4s; \ + zip2 RTMP1.4s, s1.4s, s0.4s; \ + zip1 RTMP2.4s, s3.4s, s2.4s; \ + zip2 RTMP3.4s, s3.4s, s2.4s; \ + zip1 s0.2d, RTMP2.2d, RTMP0.2d; \ + zip2 s1.2d, RTMP2.2d, RTMP0.2d; \ + zip1 s2.2d, RTMP3.2d, RTMP1.2d; \ + zip2 s3.2d, RTMP3.2d, RTMP1.2d; + + +.align 3 +ELF(.type sm4_aarch64_crypt_blk1_4,%function;) +sm4_aarch64_crypt_blk1_4: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: num blocks (1..4) + */ + CFI_STARTPROC(); + VPUSH_ABI; + + preload_sbox(x5); + + ld1 {v0.16b}, [x2], #16; + mov v1.16b, v0.16b; + mov v2.16b, v0.16b; + mov v3.16b, v0.16b; + cmp x3, #2; + blt .Lblk4_load_input_done; + ld1 {v1.16b}, [x2], #16; + beq .Lblk4_load_input_done; + ld1 {v2.16b}, [x2], #16; + cmp x3, #3; + beq .Lblk4_load_input_done; + ld1 {v3.16b}, [x2]; + +.Lblk4_load_input_done: + + rev32 v0.16b, v0.16b; + rev32 v1.16b, v1.16b; + rev32 v2.16b, v2.16b; + rev32 v3.16b, v3.16b; + + transpose_4x4(v0, v1, v2, v3); + +#define ROUND(round, s0, s1, s2, s3) \ + dup RX0.4s, RKEY.s[round]; \ + /* rk ^ s1 ^ s2 ^ s3 */ \ + eor RTMP1.16b, s2.16b, s3.16b; \ + eor RX0.16b, RX0.16b, s1.16b; \ + eor RX0.16b, RX0.16b, RTMP1.16b; \ + \ + /* sbox, non-linear part */ \ + movi RTMP3.16b, #64; /* sizeof(sbox) / 4 */ \ + tbl RTMP0.16b, {v16.16b-v19.16b}, RX0.16b; \ + sub RX0.16b, RX0.16b, RTMP3.16b; \ + tbx RTMP0.16b, {v20.16b-v23.16b}, RX0.16b; \ + sub RX0.16b, RX0.16b, RTMP3.16b; \ + tbx RTMP0.16b, {v24.16b-v27.16b}, RX0.16b; \ + sub RX0.16b, RX0.16b, RTMP3.16b; \ + tbx RTMP0.16b, {v28.16b-v31.16b}, RX0.16b; \ + \ + /* linear part */ \ + shl RTMP1.4s, RTMP0.4s, #8; \ + shl RTMP2.4s, RTMP0.4s, #16; \ + shl RTMP3.4s, RTMP0.4s, #24; \ + sri RTMP1.4s, RTMP0.4s, #(32-8); \ + sri RTMP2.4s, RTMP0.4s, #(32-16); \ + sri RTMP3.4s, RTMP0.4s, #(32-24); \ + /* RTMP1 = x ^ rol32(x, 8) ^ rol32(x, 16) */ \ + eor RTMP1.16b, RTMP1.16b, RTMP0.16b; \ + eor RTMP1.16b, RTMP1.16b, RTMP2.16b; \ + /* RTMP3 = x ^ rol32(x, 24) ^ rol32(RTMP1, 2) */ \ + eor RTMP3.16b, RTMP3.16b, RTMP0.16b; \ + shl RTMP2.4s, RTMP1.4s, 2; \ + sri RTMP2.4s, RTMP1.4s, #(32-2); \ + eor RTMP3.16b, RTMP3.16b, RTMP2.16b; \ + /* s0 ^= RTMP3 */ \ + eor s0.16b, s0.16b, RTMP3.16b; + + mov x6, 8; +.Lroundloop4: + ld1 {RKEY.4s}, [x0], #16; + subs x6, x6, #1; + + ROUND(0, v0, v1, v2, v3); + ROUND(1, v1, v2, v3, v0); + ROUND(2, v2, v3, v0, v1); + ROUND(3, v3, v0, v1, v2); + + bne .Lroundloop4; + +#undef ROUND + + rotate_clockwise_90(v0, v1, v2, v3); + rev32 v0.16b, v0.16b; + rev32 v1.16b, v1.16b; + rev32 v2.16b, v2.16b; + rev32 v3.16b, v3.16b; + + st1 {v0.16b}, [x1], #16; + cmp x3, #2; + blt .Lblk4_store_output_done; + st1 {v1.16b}, [x1], #16; + beq .Lblk4_store_output_done; + st1 {v2.16b}, [x1], #16; + cmp x3, #3; + beq .Lblk4_store_output_done; + st1 {v3.16b}, [x1]; + +.Lblk4_store_output_done: + VPOP_ABI; + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size sm4_aarch64_crypt_blk1_4,.-sm4_aarch64_crypt_blk1_4;) + +.align 3 +ELF(.type __sm4_crypt_blk8,%function;) +__sm4_crypt_blk8: + /* input: + * x0: round key array, CTX + * v16-v31: fill with sbox + * v0, v1, v2, v3, v4, v5, v6, v7: eight parallel plaintext blocks + * output: + * v0, v1, v2, v3, v4, v5, v6, v7: eight parallel ciphertext blocks + */ + CFI_STARTPROC(); + + rev32 v0.16b, v0.16b; + rev32 v1.16b, v1.16b; + rev32 v2.16b, v2.16b; + rev32 v3.16b, v3.16b; + rev32 v4.16b, v4.16b; + rev32 v5.16b, v5.16b; + rev32 v6.16b, v6.16b; + rev32 v7.16b, v7.16b; + + transpose_4x4(v0, v1, v2, v3); + transpose_4x4(v4, v5, v6, v7); + +#define ROUND(round, s0, s1, s2, s3, t0, t1, t2, t3) \ + /* rk ^ s1 ^ s2 ^ s3 */ \ + dup RX0.4s, RKEY.s[round]; \ + eor RTMP0.16b, s2.16b, s3.16b; \ + mov RX1.16b, RX0.16b; \ + eor RTMP1.16b, t2.16b, t3.16b; \ + eor RX0.16b, RX0.16b, s1.16b; \ + eor RX1.16b, RX1.16b, t1.16b; \ + eor RX0.16b, RX0.16b, RTMP0.16b; \ + eor RX1.16b, RX1.16b, RTMP1.16b; \ + \ + /* sbox, non-linear part */ \ + movi RTMP3.16b, #64; /* sizeof(sbox) / 4 */ \ + tbl RTMP0.16b, {v16.16b-v19.16b}, RX0.16b; \ + tbl RTMP1.16b, {v16.16b-v19.16b}, RX1.16b; \ + sub RX0.16b, RX0.16b, RTMP3.16b; \ + sub RX1.16b, RX1.16b, RTMP3.16b; \ + tbx RTMP0.16b, {v20.16b-v23.16b}, RX0.16b; \ + tbx RTMP1.16b, {v20.16b-v23.16b}, RX1.16b; \ + sub RX0.16b, RX0.16b, RTMP3.16b; \ + sub RX1.16b, RX1.16b, RTMP3.16b; \ + tbx RTMP0.16b, {v24.16b-v27.16b}, RX0.16b; \ + tbx RTMP1.16b, {v24.16b-v27.16b}, RX1.16b; \ + sub RX0.16b, RX0.16b, RTMP3.16b; \ + sub RX1.16b, RX1.16b, RTMP3.16b; \ + tbx RTMP0.16b, {v28.16b-v31.16b}, RX0.16b; \ + tbx RTMP1.16b, {v28.16b-v31.16b}, RX1.16b; \ + \ + /* linear part */ \ + shl RX0.4s, RTMP0.4s, #8; \ + shl RX1.4s, RTMP1.4s, #8; \ + shl RTMP2.4s, RTMP0.4s, #16; \ + shl RTMP3.4s, RTMP1.4s, #16; \ + sri RX0.4s, RTMP0.4s, #(32 - 8); \ + sri RX1.4s, RTMP1.4s, #(32 - 8); \ + sri RTMP2.4s, RTMP0.4s, #(32 - 16); \ + sri RTMP3.4s, RTMP1.4s, #(32 - 16); \ + /* RX = x ^ rol32(x, 8) ^ rol32(x, 16) */ \ + eor RX0.16b, RX0.16b, RTMP0.16b; \ + eor RX1.16b, RX1.16b, RTMP1.16b; \ + eor RX0.16b, RX0.16b, RTMP2.16b; \ + eor RX1.16b, RX1.16b, RTMP3.16b; \ + /* RTMP0/1 ^= x ^ rol32(x, 24) ^ rol32(RX, 2) */ \ + shl RTMP2.4s, RTMP0.4s, #24; \ + shl RTMP3.4s, RTMP1.4s, #24; \ + sri RTMP2.4s, RTMP0.4s, #(32 - 24); \ + sri RTMP3.4s, RTMP1.4s, #(32 - 24); \ + eor RTMP0.16b, RTMP0.16b, RTMP2.16b; \ + eor RTMP1.16b, RTMP1.16b, RTMP3.16b; \ + shl RTMP2.4s, RX0.4s, #2; \ + shl RTMP3.4s, RX1.4s, #2; \ + sri RTMP2.4s, RX0.4s, #(32 - 2); \ + sri RTMP3.4s, RX1.4s, #(32 - 2); \ + eor RTMP0.16b, RTMP0.16b, RTMP2.16b; \ + eor RTMP1.16b, RTMP1.16b, RTMP3.16b; \ + /* s0/t0 ^= RTMP0/1 */ \ + eor s0.16b, s0.16b, RTMP0.16b; \ + eor t0.16b, t0.16b, RTMP1.16b; + + mov x6, 8; +.Lroundloop8: + ld1 {RKEY.4s}, [x0], #16; + subs x6, x6, #1; + + ROUND(0, v0, v1, v2, v3, v4, v5, v6, v7); + ROUND(1, v1, v2, v3, v0, v5, v6, v7, v4); + ROUND(2, v2, v3, v0, v1, v6, v7, v4, v5); + ROUND(3, v3, v0, v1, v2, v7, v4, v5, v6); + + bne .Lroundloop8; + +#undef ROUND + + rotate_clockwise_90(v0, v1, v2, v3); + rotate_clockwise_90(v4, v5, v6, v7); + rev32 v0.16b, v0.16b; + rev32 v1.16b, v1.16b; + rev32 v2.16b, v2.16b; + rev32 v3.16b, v3.16b; + rev32 v4.16b, v4.16b; + rev32 v5.16b, v5.16b; + rev32 v6.16b, v6.16b; + rev32 v7.16b, v7.16b; + + sub x0, x0, #128; /* repoint to rkey */ + ret; + CFI_ENDPROC(); +ELF(.size __sm4_crypt_blk8,.-__sm4_crypt_blk8;) + +.align 3 +.global _gcry_sm4_aarch64_crypt_blk1_8 +ELF(.type _gcry_sm4_aarch64_crypt_blk1_8,%function;) +_gcry_sm4_aarch64_crypt_blk1_8: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: num blocks (1..8) + */ + CFI_STARTPROC(); + + cmp x3, #5; + blt sm4_aarch64_crypt_blk1_4; + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + VPUSH_ABI; + + preload_sbox(x5); + + ld1 {v0.16b-v3.16b}, [x2], #64; + ld1 {v4.16b}, [x2], #16; + mov v5.16b, v4.16b; + mov v6.16b, v4.16b; + mov v7.16b, v4.16b; + beq .Lblk8_load_input_done; + ld1 {v5.16b}, [x2], #16; + cmp x3, #7; + blt .Lblk8_load_input_done; + ld1 {v6.16b}, [x2], #16; + beq .Lblk8_load_input_done; + ld1 {v7.16b}, [x2]; + +.Lblk8_load_input_done: + bl __sm4_crypt_blk8; + + cmp x3, #6; + st1 {v0.16b-v3.16b}, [x1], #64; + st1 {v4.16b}, [x1], #16; + blt .Lblk8_store_output_done; + st1 {v5.16b}, [x1], #16; + beq .Lblk8_store_output_done; + st1 {v6.16b}, [x1], #16; + cmp x3, #7; + beq .Lblk8_store_output_done; + st1 {v7.16b}, [x1]; + +.Lblk8_store_output_done: + VPOP_ABI; + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aarch64_crypt_blk1_8,.-_gcry_sm4_aarch64_crypt_blk1_8;) + + +.align 3 +.global _gcry_sm4_aarch64_crypt +ELF(.type _gcry_sm4_aarch64_crypt,%function;) +_gcry_sm4_aarch64_crypt: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + VPUSH_ABI; + + preload_sbox(x5); + +.Lcrypt_loop_blk: + subs x3, x3, #8; + bmi .Lcrypt_end; + + ld1 {v0.16b-v3.16b}, [x2], #64; + ld1 {v4.16b-v7.16b}, [x2], #64; + bl __sm4_crypt_blk8; + st1 {v0.16b-v3.16b}, [x1], #64; + st1 {v4.16b-v7.16b}, [x1], #64; + b .Lcrypt_loop_blk; + +.Lcrypt_end: + VPOP_ABI; + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aarch64_crypt,.-_gcry_sm4_aarch64_crypt;) + + +.align 3 +.global _gcry_sm4_aarch64_cbc_dec +ELF(.type _gcry_sm4_aarch64_cbc_dec,%function;) +_gcry_sm4_aarch64_cbc_dec: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: iv (big endian, 128 bit) + * x4: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + VPUSH_ABI; + + preload_sbox(x5); + ld1 {RIV.16b}, [x3]; + +.Lcbc_loop_blk: + subs x4, x4, #8; + bmi .Lcbc_end; + + ld1 {v0.16b-v3.16b}, [x2], #64; + ld1 {v4.16b-v7.16b}, [x2]; + + bl __sm4_crypt_blk8; + + sub x2, x2, #64; + eor v0.16b, v0.16b, RIV.16b; + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v1.16b, v1.16b, RTMP0.16b; + eor v2.16b, v2.16b, RTMP1.16b; + eor v3.16b, v3.16b, RTMP2.16b; + st1 {v0.16b-v3.16b}, [x1], #64; + + eor v4.16b, v4.16b, RTMP3.16b; + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v5.16b, v5.16b, RTMP0.16b; + eor v6.16b, v6.16b, RTMP1.16b; + eor v7.16b, v7.16b, RTMP2.16b; + + mov RIV.16b, RTMP3.16b; + st1 {v4.16b-v7.16b}, [x1], #64; + + b .Lcbc_loop_blk; + +.Lcbc_end: + /* store new IV */ + st1 {RIV.16b}, [x3]; + + VPOP_ABI; + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aarch64_cbc_dec,.-_gcry_sm4_aarch64_cbc_dec;) + +.align 3 +.global _gcry_sm4_aarch64_cfb_dec +ELF(.type _gcry_sm4_aarch64_cfb_dec,%function;) +_gcry_sm4_aarch64_cfb_dec: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: iv (big endian, 128 bit) + * x4: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + VPUSH_ABI; + + preload_sbox(x5); + ld1 {v0.16b}, [x3]; + +.Lcfb_loop_blk: + subs x4, x4, #8; + bmi .Lcfb_end; + + ld1 {v1.16b, v2.16b, v3.16b}, [x2], #48; + ld1 {v4.16b-v7.16b}, [x2]; + + bl __sm4_crypt_blk8; + + sub x2, x2, #48; + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v0.16b, v0.16b, RTMP0.16b; + eor v1.16b, v1.16b, RTMP1.16b; + eor v2.16b, v2.16b, RTMP2.16b; + eor v3.16b, v3.16b, RTMP3.16b; + st1 {v0.16b-v3.16b}, [x1], #64; + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v4.16b, v4.16b, RTMP0.16b; + eor v5.16b, v5.16b, RTMP1.16b; + eor v6.16b, v6.16b, RTMP2.16b; + eor v7.16b, v7.16b, RTMP3.16b; + st1 {v4.16b-v7.16b}, [x1], #64; + + mov v0.16b, RTMP3.16b; + + b .Lcfb_loop_blk; + +.Lcfb_end: + /* store new IV */ + st1 {v0.16b}, [x3]; + + VPOP_ABI; + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aarch64_cfb_dec,.-_gcry_sm4_aarch64_cfb_dec;) + +.align 3 +.global _gcry_sm4_aarch64_ctr_enc +ELF(.type _gcry_sm4_aarch64_ctr_enc,%function;) +_gcry_sm4_aarch64_ctr_enc: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: ctr (big endian, 128 bit) + * x4: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + VPUSH_ABI; + + preload_sbox(x5); + + ldp x7, x8, [x3]; + rev x7, x7; + rev x8, x8; + +.Lctr_loop_blk: + subs x4, x4, #8; + bmi .Lctr_end; + +#define inc_le128(vctr) \ + mov vctr.d[1], x8; \ + mov vctr.d[0], x7; \ + adds x8, x8, #1; \ + adc x7, x7, xzr; \ + rev64 vctr.16b, vctr.16b; + + /* construct CTRs */ + inc_le128(v0); /* +0 */ + inc_le128(v1); /* +1 */ + inc_le128(v2); /* +2 */ + inc_le128(v3); /* +3 */ + inc_le128(v4); /* +4 */ + inc_le128(v5); /* +5 */ + inc_le128(v6); /* +6 */ + inc_le128(v7); /* +7 */ + + bl __sm4_crypt_blk8; + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v0.16b, v0.16b, RTMP0.16b; + eor v1.16b, v1.16b, RTMP1.16b; + eor v2.16b, v2.16b, RTMP2.16b; + eor v3.16b, v3.16b, RTMP3.16b; + st1 {v0.16b-v3.16b}, [x1], #64; + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v4.16b, v4.16b, RTMP0.16b; + eor v5.16b, v5.16b, RTMP1.16b; + eor v6.16b, v6.16b, RTMP2.16b; + eor v7.16b, v7.16b, RTMP3.16b; + st1 {v4.16b-v7.16b}, [x1], #64; + + b .Lctr_loop_blk; + +.Lctr_end: + /* store new CTR */ + rev x7, x7; + rev x8, x8; + stp x7, x8, [x3]; + + VPOP_ABI; + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aarch64_ctr_enc,.-_gcry_sm4_aarch64_ctr_enc;) + +#endif diff --git a/cipher/sm4.c b/cipher/sm4.c index 81662988..ec2281b6 100644 --- a/cipher/sm4.c +++ b/cipher/sm4.c @@ -67,6 +67,15 @@ # endif #endif +#undef USE_AARCH64_SIMD +#ifdef ENABLE_NEON_SUPPORT +# if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_NEON) +# define USE_AARCH64_SIMD 1 +# endif +#endif + static const char *sm4_selftest (void); static void _gcry_sm4_ctr_enc (void *context, unsigned char *ctr, @@ -94,6 +103,9 @@ typedef struct #ifdef USE_AESNI_AVX2 unsigned int use_aesni_avx2:1; #endif +#ifdef USE_AARCH64_SIMD + unsigned int use_aarch64_simd:1; +#endif } SM4_context; static const u32 fk[4] = @@ -241,6 +253,39 @@ extern void _gcry_sm4_aesni_avx2_ocb_auth(const u32 *rk_enc, const u64 Ls[16]) ASM_FUNC_ABI; #endif /* USE_AESNI_AVX2 */ +#ifdef USE_AARCH64_SIMD +extern void _gcry_sm4_aarch64_crypt(const u32 *rk, byte *out, + const byte *in, + size_t num_blocks); + +extern void _gcry_sm4_aarch64_ctr_enc(const u32 *rk_enc, byte *out, + const byte *in, + byte *ctr, + size_t nblocks); + +extern void _gcry_sm4_aarch64_cbc_dec(const u32 *rk_dec, byte *out, + const byte *in, + byte *iv, + size_t nblocks); + +extern void _gcry_sm4_aarch64_cfb_dec(const u32 *rk_enc, byte *out, + const byte *in, + byte *iv, + size_t nblocks); + +extern void _gcry_sm4_aarch64_crypt_blk1_8(const u32 *rk, byte *out, + const byte *in, + size_t num_blocks); + +static inline unsigned int +sm4_aarch64_crypt_blk1_8(const u32 *rk, byte *out, const byte *in, + unsigned int num_blks) +{ + _gcry_sm4_aarch64_crypt_blk1_8(rk, out, in, (size_t)num_blks); + return 0; +} +#endif /* USE_AARCH64_SIMD */ + static inline void prefetch_sbox_table(void) { const volatile byte *vtab = (void *)&sbox_table; @@ -372,6 +417,9 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen, #ifdef USE_AESNI_AVX2 ctx->use_aesni_avx2 = (hwf & HWF_INTEL_AESNI) && (hwf & HWF_INTEL_AVX2); #endif +#ifdef USE_AARCH64_SIMD + ctx->use_aarch64_simd = !!(hwf & HWF_ARM_NEON); +#endif /* Setup bulk encryption routines. */ memset (bulk_ops, 0, sizeof(*bulk_ops)); @@ -553,6 +601,23 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, } #endif +#ifdef USE_AARCH64_SIMD + if (ctx->use_aarch64_simd) + { + /* Process multiples of 8 blocks at a time. */ + if (nblocks >= 8) + { + size_t nblks = nblocks & ~(8 - 1); + + _gcry_sm4_aarch64_ctr_enc(ctx->rkey_enc, outbuf, inbuf, ctr, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + /* Process remaining blocks. */ if (nblocks) { @@ -568,6 +633,12 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, { crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } +#endif +#ifdef USE_AARCH64_SIMD + else if (ctx->use_aarch64_simd) + { + crypt_blk1_8 = sm4_aarch64_crypt_blk1_8; + } #endif else { @@ -654,6 +725,23 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, } #endif +#ifdef USE_AARCH64_SIMD + if (ctx->use_aarch64_simd) + { + /* Process multiples of 8 blocks at a time. */ + if (nblocks >= 8) + { + size_t nblks = nblocks & ~(8 - 1); + + _gcry_sm4_aarch64_cbc_dec(ctx->rkey_dec, outbuf, inbuf, iv, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + /* Process remaining blocks. */ if (nblocks) { @@ -669,6 +757,12 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, { crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } +#endif +#ifdef USE_AARCH64_SIMD + else if (ctx->use_aarch64_simd) + { + crypt_blk1_8 = sm4_aarch64_crypt_blk1_8; + } #endif else { @@ -748,6 +842,23 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, } #endif +#ifdef USE_AARCH64_SIMD + if (ctx->use_aarch64_simd) + { + /* Process multiples of 8 blocks at a time. */ + if (nblocks >= 8) + { + size_t nblks = nblocks & ~(8 - 1); + + _gcry_sm4_aarch64_cfb_dec(ctx->rkey_enc, outbuf, inbuf, iv, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + /* Process remaining blocks. */ if (nblocks) { @@ -763,6 +874,12 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, { crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } +#endif +#ifdef USE_AARCH64_SIMD + else if (ctx->use_aarch64_simd) + { + crypt_blk1_8 = sm4_aarch64_crypt_blk1_8; + } #endif else { @@ -919,6 +1036,12 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, { crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } +#endif +#ifdef USE_AARCH64_SIMD + else if (ctx->use_aarch64_simd) + { + crypt_blk1_8 = sm4_aarch64_crypt_blk1_8; + } #endif else { @@ -1079,6 +1202,12 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) { crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } +#endif +#ifdef USE_AARCH64_SIMD + else if (ctx->use_aarch64_simd) + { + crypt_blk1_8 = sm4_aarch64_crypt_blk1_8; + } #endif else { diff --git a/configure.ac b/configure.ac index ea01f5a6..89df9434 100644 --- a/configure.ac +++ b/configure.ac @@ -2740,6 +2740,9 @@ if test "$found" = "1" ; then GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-aesni-avx-amd64.lo" GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-aesni-avx2-amd64.lo" ;; + aarch64-*-*) + # Build with the assembly implementation + GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-aarch64.lo" esac fi -- 2.34.1 From jussi.kivilinna at iki.fi Wed Feb 23 19:13:06 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Wed, 23 Feb 2022 20:13:06 +0200 Subject: [PATCH v3 2/2] Add SM4 ARMv8/AArch64 assembly implementation In-Reply-To: <20220223042359.7765-2-tianjia.zhang@linux.alibaba.com> References: <20220223042359.7765-1-tianjia.zhang@linux.alibaba.com> <20220223042359.7765-2-tianjia.zhang@linux.alibaba.com> Message-ID: <80c38e17-34da-6d0a-325b-43a287bd1fe1@iki.fi> Hello, Applied to master. Thanks. -Jussi On 23.2.2022 6.23, Tianjia Zhang wrote: > * cipher/Makefile.am: Add 'sm4-aarch64.S'. > * cipher/sm4-aarch64.S: New. > * cipher/sm4.c (USE_AARCH64_SIMD): New. > (SM4_context) [USE_AARCH64_SIMD]: Add 'use_aarch64_simd'. > [USE_AARCH64_SIMD] (_gcry_sm4_aarch64_crypt) > (_gcry_sm4_aarch64_ctr_enc, _gcry_sm4_aarch64_cbc_dec) > (_gcry_sm4_aarch64_cfb_dec, _gcry_sm4_aarch64_crypt_blk1_8) > (sm4_aarch64_crypt_blk1_8): New. > (sm4_setkey): Enable ARMv8/AArch64 if supported by HW. > (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) > (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_AARCH64_SIMD]: > Add ARMv8/AArch64 bulk functions. > * configure.ac: Add 'sm4-aarch64.lo'. > -- > > This patch adds ARMv8/AArch64 bulk encryption/decryption. Bulk > functions process eight blocks in parallel. > From tianjia.zhang at linux.alibaba.com Fri Feb 25 08:41:19 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Fri, 25 Feb 2022 15:41:19 +0800 Subject: [PATCH 2/2] Add SM4 ARMv8/AArch64/CE assembly implementation In-Reply-To: <20220225074119.31182-1-tianjia.zhang@linux.alibaba.com> References: <20220225074119.31182-1-tianjia.zhang@linux.alibaba.com> Message-ID: <20220225074119.31182-2-tianjia.zhang@linux.alibaba.com> * cipher/Makefile.am: Add 'sm4-armv8-aarch64-ce.S'. * cipher/sm4-armv8-aarch64-ce.S: New. * cipher/sm4.c (USE_ARM_CE): New. (SM4_context) [USE_ARM_CE]: Add 'use_arm_ce'. [USE_ARM_CE] (_gcry_sm4_armv8_ce_expand_key) (_gcry_sm4_armv8_ce_crypt, _gcry_sm4_armv8_ce_ctr_enc) (_gcry_sm4_armv8_ce_cbc_dec, _gcry_sm4_armv8_ce_cfb_dec) (_gcry_sm4_armv8_ce_crypt_blk1_8, sm4_armv8_ce_crypt_blk1_8): New. (sm4_expand_key) [USE_ARM_CE]: Use ARMv8/AArch64/CE key setup. (sm4_setkey): Enable ARMv8/AArch64/CE if supported by HW. (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_ARM_CE]: Add ARMv8/AArch64/CE bulk functions. * configure.ac: Add 'sm4-armv8-aarch64-ce.lo'. -- This patch adds ARMv8/AArch64/CE bulk encryption/decryption. Bulk functions process eight blocks in parallel. Benchmark on T-Head Yitian-710 2.75 GHz: Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 12.10 ns/B 78.79 MiB/s 33.28 c/B 2750 CBC dec | 4.63 ns/B 205.9 MiB/s 12.74 c/B 2749 CFB enc | 12.14 ns/B 78.58 MiB/s 33.37 c/B 2750 CFB dec | 4.64 ns/B 205.5 MiB/s 12.76 c/B 2750 CTR enc | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 CTR dec | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 GCM enc | 4.88 ns/B 195.4 MiB/s 13.42 c/B 2750 GCM dec | 4.88 ns/B 195.5 MiB/s 13.42 c/B 2750 GCM auth | 0.189 ns/B 5048 MiB/s 0.520 c/B 2750 OCB enc | 4.86 ns/B 196.0 MiB/s 13.38 c/B 2750 OCB dec | 4.90 ns/B 194.7 MiB/s 13.47 c/B 2750 OCB auth | 4.79 ns/B 199.0 MiB/s 13.18 c/B 2750 After (16x - 19x faster than ARMv8/AArch64 impl): SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 12.10 ns/B 78.81 MiB/s 33.27 c/B 2750 CBC dec | 0.243 ns/B 3921 MiB/s 0.669 c/B 2750 CFB enc | 12.14 ns/B 78.52 MiB/s 33.39 c/B 2750 CFB dec | 0.241 ns/B 3963 MiB/s 0.662 c/B 2750 CTR enc | 0.298 ns/B 3201 MiB/s 0.819 c/B 2749 CTR dec | 0.298 ns/B 3197 MiB/s 0.820 c/B 2750 GCM enc | 0.488 ns/B 1956 MiB/s 1.34 c/B 2749 GCM dec | 0.487 ns/B 1959 MiB/s 1.34 c/B 2750 GCM auth | 0.189 ns/B 5049 MiB/s 0.519 c/B 2749 OCB enc | 0.461 ns/B 2069 MiB/s 1.27 c/B 2750 OCB dec | 0.495 ns/B 1928 MiB/s 1.36 c/B 2750 OCB auth | 0.385 ns/B 2479 MiB/s 1.06 c/B 2750 Signed-off-by: Tianjia Zhang --- cipher/Makefile.am | 1 + cipher/sm4-armv8-aarch64-ce.S | 614 ++++++++++++++++++++++++++++++++++ cipher/sm4.c | 142 ++++++++ configure.ac | 1 + 4 files changed, 758 insertions(+) create mode 100644 cipher/sm4-armv8-aarch64-ce.S diff --git a/cipher/Makefile.am b/cipher/Makefile.am index a7cbf3fc..3339c463 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -117,6 +117,7 @@ EXTRA_libcipher_la_SOURCES = \ seed.c \ serpent.c serpent-sse2-amd64.S \ sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S sm4-aarch64.S \ + sm4-armv8-aarch64-ce.S \ serpent-avx2-amd64.S serpent-armv7-neon.S \ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ diff --git a/cipher/sm4-armv8-aarch64-ce.S b/cipher/sm4-armv8-aarch64-ce.S new file mode 100644 index 00000000..943f0143 --- /dev/null +++ b/cipher/sm4-armv8-aarch64-ce.S @@ -0,0 +1,614 @@ +/* sm4-armv8-aarch64-ce.S - ARMv8/AArch64/CE accelerated SM4 cipher + * + * Copyright (C) 2022 Alibaba Group. + * Copyright (C) 2022 Tianjia Zhang + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include "asm-common-aarch64.h" + +#if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) && \ + defined(USE_SM4) + +.cpu generic+simd+crypto + +.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 16, 24, 25, 26, 27, 28, 29, 30, 31 + .set .Lv\b\().4s, \b +.endr + +.macro sm4e, vd, vn + .inst 0xcec08400 | (.L\vn << 5) | .L\vd +.endm + +.macro sm4ekey, vd, vn, vm + .inst 0xce60c800 | (.L\vm << 16) | (.L\vn << 5) | .L\vd +.endm + +.text + +/* Register macros */ + +#define RTMP0 v16 +#define RTMP1 v17 +#define RTMP2 v18 +#define RTMP3 v19 + +#define RIV v20 + +/* Helper macros. */ + +#define load_rkey(ptr) \ + ld1 {v24.16b-v27.16b}, [ptr], #64; \ + ld1 {v28.16b-v31.16b}, [ptr]; + +#define crypt_blk4(b0, b1, b2, b3) \ + rev32 b0.16b, b0.16b; \ + rev32 b1.16b, b1.16b; \ + rev32 b2.16b, b2.16b; \ + rev32 b3.16b, b3.16b; \ + sm4e b0.4s, v24.4s; \ + sm4e b1.4s, v24.4s; \ + sm4e b2.4s, v24.4s; \ + sm4e b3.4s, v24.4s; \ + sm4e b0.4s, v25.4s; \ + sm4e b1.4s, v25.4s; \ + sm4e b2.4s, v25.4s; \ + sm4e b3.4s, v25.4s; \ + sm4e b0.4s, v26.4s; \ + sm4e b1.4s, v26.4s; \ + sm4e b2.4s, v26.4s; \ + sm4e b3.4s, v26.4s; \ + sm4e b0.4s, v27.4s; \ + sm4e b1.4s, v27.4s; \ + sm4e b2.4s, v27.4s; \ + sm4e b3.4s, v27.4s; \ + sm4e b0.4s, v28.4s; \ + sm4e b1.4s, v28.4s; \ + sm4e b2.4s, v28.4s; \ + sm4e b3.4s, v28.4s; \ + sm4e b0.4s, v29.4s; \ + sm4e b1.4s, v29.4s; \ + sm4e b2.4s, v29.4s; \ + sm4e b3.4s, v29.4s; \ + sm4e b0.4s, v30.4s; \ + sm4e b1.4s, v30.4s; \ + sm4e b2.4s, v30.4s; \ + sm4e b3.4s, v30.4s; \ + sm4e b0.4s, v31.4s; \ + sm4e b1.4s, v31.4s; \ + sm4e b2.4s, v31.4s; \ + sm4e b3.4s, v31.4s; \ + rev64 b0.4s, b0.4s; \ + rev64 b1.4s, b1.4s; \ + rev64 b2.4s, b2.4s; \ + rev64 b3.4s, b3.4s; \ + ext b0.16b, b0.16b, b0.16b, #8; \ + ext b1.16b, b1.16b, b1.16b, #8; \ + ext b2.16b, b2.16b, b2.16b, #8; \ + ext b3.16b, b3.16b, b3.16b, #8; \ + rev32 b0.16b, b0.16b; \ + rev32 b1.16b, b1.16b; \ + rev32 b2.16b, b2.16b; \ + rev32 b3.16b, b3.16b; + +#define crypt_blk8(b0, b1, b2, b3, b4, b5, b6, b7) \ + rev32 b0.16b, b0.16b; \ + rev32 b1.16b, b1.16b; \ + rev32 b2.16b, b2.16b; \ + rev32 b3.16b, b3.16b; \ + rev32 b4.16b, b4.16b; \ + rev32 b5.16b, b5.16b; \ + rev32 b6.16b, b6.16b; \ + rev32 b7.16b, b7.16b; \ + sm4e b0.4s, v24.4s; \ + sm4e b1.4s, v24.4s; \ + sm4e b2.4s, v24.4s; \ + sm4e b3.4s, v24.4s; \ + sm4e b4.4s, v24.4s; \ + sm4e b5.4s, v24.4s; \ + sm4e b6.4s, v24.4s; \ + sm4e b7.4s, v24.4s; \ + sm4e b0.4s, v25.4s; \ + sm4e b1.4s, v25.4s; \ + sm4e b2.4s, v25.4s; \ + sm4e b3.4s, v25.4s; \ + sm4e b4.4s, v25.4s; \ + sm4e b5.4s, v25.4s; \ + sm4e b6.4s, v25.4s; \ + sm4e b7.4s, v25.4s; \ + sm4e b0.4s, v26.4s; \ + sm4e b1.4s, v26.4s; \ + sm4e b2.4s, v26.4s; \ + sm4e b3.4s, v26.4s; \ + sm4e b4.4s, v26.4s; \ + sm4e b5.4s, v26.4s; \ + sm4e b6.4s, v26.4s; \ + sm4e b7.4s, v26.4s; \ + sm4e b0.4s, v27.4s; \ + sm4e b1.4s, v27.4s; \ + sm4e b2.4s, v27.4s; \ + sm4e b3.4s, v27.4s; \ + sm4e b4.4s, v27.4s; \ + sm4e b5.4s, v27.4s; \ + sm4e b6.4s, v27.4s; \ + sm4e b7.4s, v27.4s; \ + sm4e b0.4s, v28.4s; \ + sm4e b1.4s, v28.4s; \ + sm4e b2.4s, v28.4s; \ + sm4e b3.4s, v28.4s; \ + sm4e b4.4s, v28.4s; \ + sm4e b5.4s, v28.4s; \ + sm4e b6.4s, v28.4s; \ + sm4e b7.4s, v28.4s; \ + sm4e b0.4s, v29.4s; \ + sm4e b1.4s, v29.4s; \ + sm4e b2.4s, v29.4s; \ + sm4e b3.4s, v29.4s; \ + sm4e b4.4s, v29.4s; \ + sm4e b5.4s, v29.4s; \ + sm4e b6.4s, v29.4s; \ + sm4e b7.4s, v29.4s; \ + sm4e b0.4s, v30.4s; \ + sm4e b1.4s, v30.4s; \ + sm4e b2.4s, v30.4s; \ + sm4e b3.4s, v30.4s; \ + sm4e b4.4s, v30.4s; \ + sm4e b5.4s, v30.4s; \ + sm4e b6.4s, v30.4s; \ + sm4e b7.4s, v30.4s; \ + sm4e b0.4s, v31.4s; \ + sm4e b1.4s, v31.4s; \ + sm4e b2.4s, v31.4s; \ + sm4e b3.4s, v31.4s; \ + sm4e b4.4s, v31.4s; \ + sm4e b5.4s, v31.4s; \ + sm4e b6.4s, v31.4s; \ + sm4e b7.4s, v31.4s; \ + rev64 b0.4s, b0.4s; \ + rev64 b1.4s, b1.4s; \ + rev64 b2.4s, b2.4s; \ + rev64 b3.4s, b3.4s; \ + rev64 b4.4s, b4.4s; \ + rev64 b5.4s, b5.4s; \ + rev64 b6.4s, b6.4s; \ + rev64 b7.4s, b7.4s; \ + ext b0.16b, b0.16b, b0.16b, #8; \ + ext b1.16b, b1.16b, b1.16b, #8; \ + ext b2.16b, b2.16b, b2.16b, #8; \ + ext b3.16b, b3.16b, b3.16b, #8; \ + ext b4.16b, b4.16b, b4.16b, #8; \ + ext b5.16b, b5.16b, b5.16b, #8; \ + ext b6.16b, b6.16b, b6.16b, #8; \ + ext b7.16b, b7.16b, b7.16b, #8; \ + rev32 b0.16b, b0.16b; \ + rev32 b1.16b, b1.16b; \ + rev32 b2.16b, b2.16b; \ + rev32 b3.16b, b3.16b; \ + rev32 b4.16b, b4.16b; \ + rev32 b5.16b, b5.16b; \ + rev32 b6.16b, b6.16b; \ + rev32 b7.16b, b7.16b; + + +.align 3 +.global _gcry_sm4_armv8_ce_expand_key +ELF(.type _gcry_sm4_armv8_ce_expand_key,%function;) +_gcry_sm4_armv8_ce_expand_key: + /* input: + * x0: 128-bit key + * x1: rkey_enc + * x2: rkey_dec + * x3: fk array + * x4: ck array + */ + CFI_STARTPROC(); + + ld1 {v0.16b}, [x0]; + rev32 v0.16b, v0.16b; + ld1 {v1.16b}, [x3]; + load_rkey(x4); + + /* input ^ fk */ + eor v0.16b, v0.16b, v1.16b; + + sm4ekey v0.4s, v0.4s, v24.4s; + sm4ekey v1.4s, v0.4s, v25.4s; + sm4ekey v2.4s, v1.4s, v26.4s; + sm4ekey v3.4s, v2.4s, v27.4s; + sm4ekey v4.4s, v3.4s, v28.4s; + sm4ekey v5.4s, v4.4s, v29.4s; + sm4ekey v6.4s, v5.4s, v30.4s; + sm4ekey v7.4s, v6.4s, v31.4s; + + st1 {v0.16b-v3.16b}, [x1], #64; + st1 {v4.16b-v7.16b}, [x1]; + rev64 v7.4s, v7.4s; + rev64 v6.4s, v6.4s; + rev64 v5.4s, v5.4s; + rev64 v4.4s, v4.4s; + rev64 v3.4s, v3.4s; + rev64 v2.4s, v2.4s; + rev64 v1.4s, v1.4s; + rev64 v0.4s, v0.4s; + ext v7.16b, v7.16b, v7.16b, #8; + ext v6.16b, v6.16b, v6.16b, #8; + ext v5.16b, v5.16b, v5.16b, #8; + ext v4.16b, v4.16b, v4.16b, #8; + ext v3.16b, v3.16b, v3.16b, #8; + ext v2.16b, v2.16b, v2.16b, #8; + ext v1.16b, v1.16b, v1.16b, #8; + ext v0.16b, v0.16b, v0.16b, #8; + st1 {v7.16b}, [x2], #16; + st1 {v6.16b}, [x2], #16; + st1 {v5.16b}, [x2], #16; + st1 {v4.16b}, [x2], #16; + st1 {v3.16b}, [x2], #16; + st1 {v2.16b}, [x2], #16; + st1 {v1.16b}, [x2], #16; + st1 {v0.16b}, [x2]; + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_expand_key,.-_gcry_sm4_armv8_ce_expand_key;) + +.align 3 +ELF(.type sm4_armv8_ce_crypt_blk1_4,%function;) +sm4_armv8_ce_crypt_blk1_4: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: num blocks (1..4) + */ + CFI_STARTPROC(); + VPUSH_ABI; + + load_rkey(x0); + + ld1 {v0.16b}, [x2], #16; + mov v1.16b, v0.16b; + mov v2.16b, v0.16b; + mov v3.16b, v0.16b; + cmp x3, #2; + blt .Lblk4_load_input_done; + ld1 {v1.16b}, [x2], #16; + beq .Lblk4_load_input_done; + ld1 {v2.16b}, [x2], #16; + cmp x3, #3; + beq .Lblk4_load_input_done; + ld1 {v3.16b}, [x2]; + +.Lblk4_load_input_done: + crypt_blk4(v0, v1, v2, v3); + + st1 {v0.16b}, [x1], #16; + cmp x3, #2; + blt .Lblk4_store_output_done; + st1 {v1.16b}, [x1], #16; + beq .Lblk4_store_output_done; + st1 {v2.16b}, [x1], #16; + cmp x3, #3; + beq .Lblk4_store_output_done; + st1 {v3.16b}, [x1]; + +.Lblk4_store_output_done: + VPOP_ABI; + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size sm4_armv8_ce_crypt_blk1_4,.-sm4_armv8_ce_crypt_blk1_4;) + +.align 3 +.global _gcry_sm4_armv8_ce_crypt_blk1_8 +ELF(.type _gcry_sm4_armv8_ce_crypt_blk1_8,%function;) +_gcry_sm4_armv8_ce_crypt_blk1_8: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: num blocks (1..8) + */ + CFI_STARTPROC(); + + cmp x3, #5; + blt sm4_armv8_ce_crypt_blk1_4; + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + VPUSH_ABI; + + load_rkey(x0); + + ld1 {v0.16b-v3.16b}, [x2], #64; + ld1 {v4.16b}, [x2], #16; + mov v5.16b, v4.16b; + mov v6.16b, v4.16b; + mov v7.16b, v4.16b; + beq .Lblk8_load_input_done; + ld1 {v5.16b}, [x2], #16; + cmp x3, #7; + blt .Lblk8_load_input_done; + ld1 {v6.16b}, [x2], #16; + beq .Lblk8_load_input_done; + ld1 {v7.16b}, [x2]; + +.Lblk8_load_input_done: + crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7); + + cmp x3, #6; + st1 {v0.16b-v3.16b}, [x1], #64; + st1 {v4.16b}, [x1], #16; + blt .Lblk8_store_output_done; + st1 {v5.16b}, [x1], #16; + beq .Lblk8_store_output_done; + st1 {v6.16b}, [x1], #16; + cmp x3, #7; + beq .Lblk8_store_output_done; + st1 {v7.16b}, [x1]; + +.Lblk8_store_output_done: + VPOP_ABI; + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_crypt_blk1_8,.-_gcry_sm4_armv8_ce_crypt_blk1_8;) + +.align 3 +.global _gcry_sm4_armv8_ce_crypt +ELF(.type _gcry_sm4_armv8_ce_crypt,%function;) +_gcry_sm4_armv8_ce_crypt: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + load_rkey(x0); + +.Lcrypt_loop_blk: + subs x3, x3, #8; + bmi .Lcrypt_end; + + ld1 {v0.16b-v3.16b}, [x2], #64; + ld1 {v4.16b-v7.16b}, [x2], #64; + + crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7); + + st1 {v0.16b-v3.16b}, [x1], #64; + st1 {v4.16b-v7.16b}, [x1], #64; + + b .Lcrypt_loop_blk; + +.Lcrypt_end: + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_crypt,.-_gcry_sm4_armv8_ce_crypt;) + +.align 3 +.global _gcry_sm4_armv8_ce_cbc_dec +ELF(.type _gcry_sm4_armv8_ce_cbc_dec,%function;) +_gcry_sm4_armv8_ce_cbc_dec: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: iv (big endian, 128 bit) + * x4: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + VPUSH_ABI; + + load_rkey(x0); + ld1 {RIV.16b}, [x3]; + +.Lcbc_loop_blk: + subs x4, x4, #8; + bmi .Lcbc_end; + + ld1 {v0.16b-v3.16b}, [x2], #64; + ld1 {v4.16b-v7.16b}, [x2]; + + crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7); + + sub x2, x2, #64; + eor v0.16b, v0.16b, RIV.16b; + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v1.16b, v1.16b, RTMP0.16b; + eor v2.16b, v2.16b, RTMP1.16b; + eor v3.16b, v3.16b, RTMP2.16b; + st1 {v0.16b-v3.16b}, [x1], #64; + + eor v4.16b, v4.16b, RTMP3.16b; + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v5.16b, v5.16b, RTMP0.16b; + eor v6.16b, v6.16b, RTMP1.16b; + eor v7.16b, v7.16b, RTMP2.16b; + + mov RIV.16b, RTMP3.16b; + st1 {v4.16b-v7.16b}, [x1], #64; + + b .Lcbc_loop_blk; + +.Lcbc_end: + /* store new IV */ + st1 {RIV.16b}, [x3]; + + VPOP_ABI; + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_cbc_dec,.-_gcry_sm4_armv8_ce_cbc_dec;) + +.align 3 +.global _gcry_sm4_armv8_ce_cfb_dec +ELF(.type _gcry_sm4_armv8_ce_cfb_dec,%function;) +_gcry_sm4_armv8_ce_cfb_dec: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: iv (big endian, 128 bit) + * x4: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + VPUSH_ABI; + + load_rkey(x0); + ld1 {v0.16b}, [x3]; + +.Lcfb_loop_blk: + subs x4, x4, #8; + bmi .Lcfb_end; + + ld1 {v1.16b, v2.16b, v3.16b}, [x2], #48; + ld1 {v4.16b-v7.16b}, [x2]; + + crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7); + + sub x2, x2, #48; + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v0.16b, v0.16b, RTMP0.16b; + eor v1.16b, v1.16b, RTMP1.16b; + eor v2.16b, v2.16b, RTMP2.16b; + eor v3.16b, v3.16b, RTMP3.16b; + st1 {v0.16b-v3.16b}, [x1], #64; + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v4.16b, v4.16b, RTMP0.16b; + eor v5.16b, v5.16b, RTMP1.16b; + eor v6.16b, v6.16b, RTMP2.16b; + eor v7.16b, v7.16b, RTMP3.16b; + st1 {v4.16b-v7.16b}, [x1], #64; + + mov v0.16b, RTMP3.16b; + + b .Lcfb_loop_blk; + +.Lcfb_end: + /* store new IV */ + st1 {v0.16b}, [x3]; + + VPOP_ABI; + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_cfb_dec,.-_gcry_sm4_armv8_ce_cfb_dec;) + +.align 3 +.global _gcry_sm4_armv8_ce_ctr_enc +ELF(.type _gcry_sm4_armv8_ce_ctr_enc,%function;) +_gcry_sm4_armv8_ce_ctr_enc: + /* input: + * x0: round key array, CTX + * x1: dst + * x2: src + * x3: ctr (big endian, 128 bit) + * x4: nblocks (multiples of 8) + */ + CFI_STARTPROC(); + + stp x29, x30, [sp, #-16]!; + CFI_ADJUST_CFA_OFFSET(16); + CFI_REG_ON_STACK(29, 0); + CFI_REG_ON_STACK(30, 8); + VPUSH_ABI; + + load_rkey(x0); + + ldp x7, x8, [x3]; + rev x7, x7; + rev x8, x8; + +.Lctr_loop_blk: + subs x4, x4, #8; + bmi .Lctr_end; + +#define inc_le128(vctr) \ + mov vctr.d[1], x8; \ + mov vctr.d[0], x7; \ + adds x8, x8, #1; \ + adc x7, x7, xzr; \ + rev64 vctr.16b, vctr.16b; + + /* construct CTRs */ + inc_le128(v0); /* +0 */ + inc_le128(v1); /* +1 */ + inc_le128(v2); /* +2 */ + inc_le128(v3); /* +3 */ + inc_le128(v4); /* +4 */ + inc_le128(v5); /* +5 */ + inc_le128(v6); /* +6 */ + inc_le128(v7); /* +7 */ + + crypt_blk8(v0, v1, v2, v3, v4, v5, v6, v7); + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v0.16b, v0.16b, RTMP0.16b; + eor v1.16b, v1.16b, RTMP1.16b; + eor v2.16b, v2.16b, RTMP2.16b; + eor v3.16b, v3.16b, RTMP3.16b; + st1 {v0.16b-v3.16b}, [x1], #64; + + ld1 {RTMP0.16b-RTMP3.16b}, [x2], #64; + eor v4.16b, v4.16b, RTMP0.16b; + eor v5.16b, v5.16b, RTMP1.16b; + eor v6.16b, v6.16b, RTMP2.16b; + eor v7.16b, v7.16b, RTMP3.16b; + st1 {v4.16b-v7.16b}, [x1], #64; + + b .Lctr_loop_blk; + +.Lctr_end: + /* store new CTR */ + rev x7, x7; + rev x8, x8; + stp x7, x8, [x3]; + + VPOP_ABI; + ldp x29, x30, [sp], #16; + CFI_ADJUST_CFA_OFFSET(-16); + CFI_RESTORE(x29); + CFI_RESTORE(x30); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_armv8_ce_ctr_enc,.-_gcry_sm4_armv8_ce_ctr_enc;) + +#endif diff --git a/cipher/sm4.c b/cipher/sm4.c index ec2281b6..37b9e210 100644 --- a/cipher/sm4.c +++ b/cipher/sm4.c @@ -76,6 +76,15 @@ # endif #endif +#undef USE_ARM_CE +#ifdef ENABLE_ARM_CRYPTO_SUPPORT +# if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) +# define USE_ARM_CE 1 +# endif +#endif + static const char *sm4_selftest (void); static void _gcry_sm4_ctr_enc (void *context, unsigned char *ctr, @@ -106,6 +115,9 @@ typedef struct #ifdef USE_AARCH64_SIMD unsigned int use_aarch64_simd:1; #endif +#ifdef USE_ARM_CE + unsigned int use_arm_ce:1; +#endif } SM4_context; static const u32 fk[4] = @@ -286,6 +298,43 @@ sm4_aarch64_crypt_blk1_8(const u32 *rk, byte *out, const byte *in, } #endif /* USE_AARCH64_SIMD */ +#ifdef USE_ARM_CE +extern void _gcry_sm4_armv8_ce_expand_key(const byte *key, + u32 *rkey_enc, u32 *rkey_dec, + const u32 *fk, const u32 *ck); + +extern void _gcry_sm4_armv8_ce_crypt(const u32 *rk, byte *out, + const byte *in, + size_t num_blocks); + +extern void _gcry_sm4_armv8_ce_ctr_enc(const u32 *rk_enc, byte *out, + const byte *in, + byte *ctr, + size_t nblocks); + +extern void _gcry_sm4_armv8_ce_cbc_dec(const u32 *rk_dec, byte *out, + const byte *in, + byte *iv, + size_t nblocks); + +extern void _gcry_sm4_armv8_ce_cfb_dec(const u32 *rk_enc, byte *out, + const byte *in, + byte *iv, + size_t nblocks); + +extern void _gcry_sm4_armv8_ce_crypt_blk1_8(const u32 *rk, byte *out, + const byte *in, + size_t num_blocks); + +static inline unsigned int +sm4_armv8_ce_crypt_blk1_8(const u32 *rk, byte *out, const byte *in, + unsigned int num_blks) +{ + _gcry_sm4_armv8_ce_crypt_blk1_8(rk, out, in, (size_t)num_blks); + return 0; +} +#endif /* USE_ARM_CE */ + static inline void prefetch_sbox_table(void) { const volatile byte *vtab = (void *)&sbox_table; @@ -363,6 +412,15 @@ sm4_expand_key (SM4_context *ctx, const byte *key) } #endif +#ifdef USE_ARM_CE + if (ctx->use_arm_ce) + { + _gcry_sm4_armv8_ce_expand_key (key, ctx->rkey_enc, ctx->rkey_dec, + fk, ck); + return; + } +#endif + rk[0] = buf_get_be32(key + 4 * 0) ^ fk[0]; rk[1] = buf_get_be32(key + 4 * 1) ^ fk[1]; rk[2] = buf_get_be32(key + 4 * 2) ^ fk[2]; @@ -420,6 +478,9 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen, #ifdef USE_AARCH64_SIMD ctx->use_aarch64_simd = !!(hwf & HWF_ARM_NEON); #endif +#ifdef USE_ARM_CE + ctx->use_arm_ce = !!(hwf & HWF_ARM_SM4); +#endif /* Setup bulk encryption routines. */ memset (bulk_ops, 0, sizeof(*bulk_ops)); @@ -601,6 +662,23 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, } #endif +#ifdef USE_ARM_CE + if (ctx->use_arm_ce) + { + /* Process multiples of 8 blocks at a time. */ + if (nblocks >= 8) + { + size_t nblks = nblocks & ~(8 - 1); + + _gcry_sm4_armv8_ce_ctr_enc(ctx->rkey_enc, outbuf, inbuf, ctr, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + #ifdef USE_AARCH64_SIMD if (ctx->use_aarch64_simd) { @@ -634,6 +712,12 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } #endif +#ifdef USE_ARM_CE + else if (ctx->use_arm_ce) + { + crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; + } +#endif #ifdef USE_AARCH64_SIMD else if (ctx->use_aarch64_simd) { @@ -725,6 +809,23 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, } #endif +#ifdef USE_ARM_CE + if (ctx->use_arm_ce) + { + /* Process multiples of 8 blocks at a time. */ + if (nblocks >= 8) + { + size_t nblks = nblocks & ~(8 - 1); + + _gcry_sm4_armv8_ce_cbc_dec(ctx->rkey_dec, outbuf, inbuf, iv, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + #ifdef USE_AARCH64_SIMD if (ctx->use_aarch64_simd) { @@ -758,6 +859,12 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } #endif +#ifdef USE_ARM_CE + else if (ctx->use_arm_ce) + { + crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; + } +#endif #ifdef USE_AARCH64_SIMD else if (ctx->use_aarch64_simd) { @@ -842,6 +949,23 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, } #endif +#ifdef USE_ARM_CE + if (ctx->use_arm_ce) + { + /* Process multiples of 8 blocks at a time. */ + if (nblocks >= 8) + { + size_t nblks = nblocks & ~(8 - 1); + + _gcry_sm4_armv8_ce_cfb_dec(ctx->rkey_enc, outbuf, inbuf, iv, nblks); + + nblocks -= nblks; + outbuf += nblks * 16; + inbuf += nblks * 16; + } + } +#endif + #ifdef USE_AARCH64_SIMD if (ctx->use_aarch64_simd) { @@ -875,6 +999,12 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } #endif +#ifdef USE_ARM_CE + else if (ctx->use_arm_ce) + { + crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; + } +#endif #ifdef USE_AARCH64_SIMD else if (ctx->use_aarch64_simd) { @@ -1037,6 +1167,12 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } #endif +#ifdef USE_ARM_CE + else if (ctx->use_arm_ce) + { + crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; + } +#endif #ifdef USE_AARCH64_SIMD else if (ctx->use_aarch64_simd) { @@ -1203,6 +1339,12 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) crypt_blk1_8 = sm4_aesni_avx_crypt_blk1_8; } #endif +#ifdef USE_ARM_CE + else if (ctx->use_arm_ce) + { + crypt_blk1_8 = sm4_armv8_ce_crypt_blk1_8; + } +#endif #ifdef USE_AARCH64_SIMD else if (ctx->use_aarch64_simd) { diff --git a/configure.ac b/configure.ac index f5363f22..e20f9d13 100644 --- a/configure.ac +++ b/configure.ac @@ -2755,6 +2755,7 @@ if test "$found" = "1" ; then aarch64-*-*) # Build with the assembly implementation GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-aarch64.lo" + GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS sm4-armv8-aarch64-ce.lo" esac fi -- 2.34.1 From tianjia.zhang at linux.alibaba.com Fri Feb 25 08:41:18 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Fri, 25 Feb 2022 15:41:18 +0800 Subject: [PATCH 1/2] hwf-arm: add ARMv8.2 optional crypto extension HW features Message-ID: <20220225074119.31182-1-tianjia.zhang@linux.alibaba.com> * src/g10lib.h (HWF_ARM_SHA3, HWF_ARM_SM3, HWF_ARM_SM4) (HWF_ARM_SHA512): New. * src/hwf-arm.c (arm_features): Add sha3, sm3, sm4, sha512 HW features. * src/hwfeatures.c (hwflist): Add sha3, sm3, sm4, sha512 HW features. -- Signed-off-by: Tianjia Zhang --- src/g10lib.h | 4 ++++ src/hwf-arm.c | 16 ++++++++++++++++ src/hwfeatures.c | 4 ++++ 3 files changed, 24 insertions(+) diff --git a/src/g10lib.h b/src/g10lib.h index 22c0f0c2..985e75c6 100644 --- a/src/g10lib.h +++ b/src/g10lib.h @@ -245,6 +245,10 @@ char **_gcry_strtokenize (const char *string, const char *delim); #define HWF_ARM_SHA1 (1 << 2) #define HWF_ARM_SHA2 (1 << 3) #define HWF_ARM_PMULL (1 << 4) +#define HWF_ARM_SHA3 (1 << 5) +#define HWF_ARM_SM3 (1 << 6) +#define HWF_ARM_SM4 (1 << 7) +#define HWF_ARM_SHA512 (1 << 8) #elif defined(HAVE_CPU_ARCH_PPC) diff --git a/src/hwf-arm.c b/src/hwf-arm.c index 60107f36..70d375b2 100644 --- a/src/hwf-arm.c +++ b/src/hwf-arm.c @@ -137,6 +137,18 @@ static const struct feature_map_s arm_features[] = #ifndef HWCAP_SHA2 # define HWCAP_SHA2 64 #endif +#ifndef HWCAP_SHA3 +# define HWCAP_SHA3 (1 << 17) +#endif +#ifndef HWCAP_SM3 +# define HWCAP_SM3 (1 << 18) +#endif +#ifndef HWCAP_SM4 +# define HWCAP_SM4 (1 << 19) +#endif +#ifndef HWCAP_SHA512 +# define HWCAP_SHA512 (1 << 21) +#endif static const struct feature_map_s arm_features[] = { @@ -148,6 +160,10 @@ static const struct feature_map_s arm_features[] = { HWCAP_SHA1, 0, " sha1", HWF_ARM_SHA1 }, { HWCAP_SHA2, 0, " sha2", HWF_ARM_SHA2 }, { HWCAP_PMULL, 0, " pmull", HWF_ARM_PMULL }, + { HWCAP_SHA3, 0, " sha3", HWF_ARM_SHA3 }, + { HWCAP_SM3, 0, " sm3", HWF_ARM_SM3 }, + { HWCAP_SM4, 0, " sm4", HWF_ARM_SM4 }, + { HWCAP_SHA512, 0, " sha512", HWF_ARM_SHA512 }, #endif }; diff --git a/src/hwfeatures.c b/src/hwfeatures.c index 97e67b3c..7060d995 100644 --- a/src/hwfeatures.c +++ b/src/hwfeatures.c @@ -68,6 +68,10 @@ static struct { HWF_ARM_SHA1, "arm-sha1" }, { HWF_ARM_SHA2, "arm-sha2" }, { HWF_ARM_PMULL, "arm-pmull" }, + { HWF_ARM_SHA3, "arm-sha3" }, + { HWF_ARM_SM3, "arm-sm3" }, + { HWF_ARM_SM4, "arm-sm4" }, + { HWF_ARM_SHA512, "arm-sha512" }, #elif defined(HAVE_CPU_ARCH_PPC) { HWF_PPC_VCRYPTO, "ppc-vcrypto" }, { HWF_PPC_ARCH_3_00, "ppc-arch_3_00" }, -- 2.34.1 From jussi.kivilinna at iki.fi Sat Feb 26 08:10:55 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sat, 26 Feb 2022 09:10:55 +0200 Subject: [PATCH 2/2] Add SM4 ARMv8/AArch64/CE assembly implementation In-Reply-To: <20220225074119.31182-2-tianjia.zhang@linux.alibaba.com> References: <20220225074119.31182-1-tianjia.zhang@linux.alibaba.com> <20220225074119.31182-2-tianjia.zhang@linux.alibaba.com> Message-ID: On 25.2.2022 9.41, Tianjia Zhang wrote: > * cipher/Makefile.am: Add 'sm4-armv8-aarch64-ce.S'. > * cipher/sm4-armv8-aarch64-ce.S: New. > * cipher/sm4.c (USE_ARM_CE): New. > (SM4_context) [USE_ARM_CE]: Add 'use_arm_ce'. > [USE_ARM_CE] (_gcry_sm4_armv8_ce_expand_key) > (_gcry_sm4_armv8_ce_crypt, _gcry_sm4_armv8_ce_ctr_enc) > (_gcry_sm4_armv8_ce_cbc_dec, _gcry_sm4_armv8_ce_cfb_dec) > (_gcry_sm4_armv8_ce_crypt_blk1_8, sm4_armv8_ce_crypt_blk1_8): New. > (sm4_expand_key) [USE_ARM_CE]: Use ARMv8/AArch64/CE key setup. > (sm4_setkey): Enable ARMv8/AArch64/CE if supported by HW. > (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) > (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_ARM_CE]: > Add ARMv8/AArch64/CE bulk functions. > * configure.ac: Add 'sm4-armv8-aarch64-ce.lo'. > -- > > This patch adds ARMv8/AArch64/CE bulk encryption/decryption. Bulk > functions process eight blocks in parallel. > > Benchmark on T-Head Yitian-710 2.75 GHz: > > Before: > SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz > CBC enc | 12.10 ns/B 78.79 MiB/s 33.28 c/B 2750 > CBC dec | 4.63 ns/B 205.9 MiB/s 12.74 c/B 2749 > CFB enc | 12.14 ns/B 78.58 MiB/s 33.37 c/B 2750 > CFB dec | 4.64 ns/B 205.5 MiB/s 12.76 c/B 2750 > CTR enc | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 > CTR dec | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 > GCM enc | 4.88 ns/B 195.4 MiB/s 13.42 c/B 2750 > GCM dec | 4.88 ns/B 195.5 MiB/s 13.42 c/B 2750 > GCM auth | 0.189 ns/B 5048 MiB/s 0.520 c/B 2750 > OCB enc | 4.86 ns/B 196.0 MiB/s 13.38 c/B 2750 > OCB dec | 4.90 ns/B 194.7 MiB/s 13.47 c/B 2750 > OCB auth | 4.79 ns/B 199.0 MiB/s 13.18 c/B 2750 > > After (16x - 19x faster than ARMv8/AArch64 impl): > SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz > CBC enc | 12.10 ns/B 78.81 MiB/s 33.27 c/B 2750 > CBC dec | 0.243 ns/B 3921 MiB/s 0.669 c/B 2750 This implementation is actually so much faster than generic C, that `_gcry_sm4_armv8_ce_crypt_blk1_8` could be used in `sm4_encrypt` and `sm4_decrypt` to speed up single block operations (CBC encryption, etc) ... static unsigned int sm4_encrypt (void *context, byte *outbuf, const byte *inbuf) { SM4_context *ctx = context; #ifdef USE_ARM_CE if (ctx->use_arm_ce) return sm4_armv8_ce_crypt_blk1_8 (ctx->rkey_enc, outbuf, inbuf, 1); #endif ... > CFB enc | 12.14 ns/B 78.52 MiB/s 33.39 c/B 2750 > CFB dec | 0.241 ns/B 3963 MiB/s 0.662 c/B 2750 > CTR enc | 0.298 ns/B 3201 MiB/s 0.819 c/B 2749 > CTR dec | 0.298 ns/B 3197 MiB/s 0.820 c/B 2750 > GCM enc | 0.488 ns/B 1956 MiB/s 1.34 c/B 2749 > GCM dec | 0.487 ns/B 1959 MiB/s 1.34 c/B 2750 > GCM auth | 0.189 ns/B 5049 MiB/s 0.519 c/B 2749 > OCB enc | 0.461 ns/B 2069 MiB/s 1.27 c/B 2750 > OCB dec | 0.495 ns/B 1928 MiB/s 1.36 c/B 2750 > OCB auth | 0.385 ns/B 2479 MiB/s 1.06 c/B 2750 > > Signed-off-by: Tianjia Zhang > --- > cipher/Makefile.am | 1 + > cipher/sm4-armv8-aarch64-ce.S | 614 ++++++++++++++++++++++++++++++++++ > cipher/sm4.c | 142 ++++++++ > configure.ac | 1 + > 4 files changed, 758 insertions(+) > create mode 100644 cipher/sm4-armv8-aarch64-ce.S > > diff --git a/cipher/Makefile.am b/cipher/Makefile.am > index a7cbf3fc..3339c463 100644 > --- a/cipher/Makefile.am > +++ b/cipher/Makefile.am > @@ -117,6 +117,7 @@ EXTRA_libcipher_la_SOURCES = \ > seed.c \ > serpent.c serpent-sse2-amd64.S \ > sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S sm4-aarch64.S \ > + sm4-armv8-aarch64-ce.S \ > serpent-avx2-amd64.S serpent-armv7-neon.S \ > sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ > sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ > diff --git a/cipher/sm4-armv8-aarch64-ce.S b/cipher/sm4-armv8-aarch64-ce.S > new file mode 100644 > index 00000000..943f0143 > --- /dev/null > +++ b/cipher/sm4-armv8-aarch64-ce.S > @@ -0,0 +1,614 @@ > +/* sm4-armv8-aarch64-ce.S - ARMv8/AArch64/CE accelerated SM4 cipher > + * > + * Copyright (C) 2022 Alibaba Group. > + * Copyright (C) 2022 Tianjia Zhang > + * > + * This file is part of Libgcrypt. > + * > + * Libgcrypt is free software; you can redistribute it and/or modify > + * it under the terms of the GNU Lesser General Public License as > + * published by the Free Software Foundation; either version 2.1 of > + * the License, or (at your option) any later version. > + * > + * Libgcrypt is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with this program; if not, see . > + */ > + > +#include "asm-common-aarch64.h" > + > +#if defined(__AARCH64EL__) && \ > + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ > + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) && \ > + defined(USE_SM4) > + > +.cpu generic+simd+crypto > + > +.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 16, 24, 25, 26, 27, 28, 29, 30, 31 > + .set .Lv\b\().4s, \b > +.endr > + > +.macro sm4e, vd, vn > + .inst 0xcec08400 | (.L\vn << 5) | .L\vd > +.endm > + > +.macro sm4ekey, vd, vn, vm > + .inst 0xce60c800 | (.L\vm << 16) | (.L\vn << 5) | .L\vd > +.endm We have target architectures where assembler does not support these macros (MacOSX for example). It's better to detect if these instructions are supported with new check in `configure.ac`. For example, see how this is done for `HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO`. -Jussi From tianjia.zhang at linux.alibaba.com Mon Feb 28 13:25:36 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Mon, 28 Feb 2022 20:25:36 +0800 Subject: [PATCH 2/2] Add SM4 ARMv8/AArch64/CE assembly implementation In-Reply-To: References: <20220225074119.31182-1-tianjia.zhang@linux.alibaba.com> <20220225074119.31182-2-tianjia.zhang@linux.alibaba.com> Message-ID: <33c45d66-f80d-5555-2d67-ae105fb2b533@linux.alibaba.com> Hi Jussi, On 2/26/22 3:10 PM, Jussi Kivilinna wrote: > On 25.2.2022 9.41, Tianjia Zhang wrote: >> * cipher/Makefile.am: Add 'sm4-armv8-aarch64-ce.S'. >> * cipher/sm4-armv8-aarch64-ce.S: New. >> * cipher/sm4.c (USE_ARM_CE): New. >> (SM4_context) [USE_ARM_CE]: Add 'use_arm_ce'. >> [USE_ARM_CE] (_gcry_sm4_armv8_ce_expand_key) >> (_gcry_sm4_armv8_ce_crypt, _gcry_sm4_armv8_ce_ctr_enc) >> (_gcry_sm4_armv8_ce_cbc_dec, _gcry_sm4_armv8_ce_cfb_dec) >> (_gcry_sm4_armv8_ce_crypt_blk1_8, sm4_armv8_ce_crypt_blk1_8): New. >> (sm4_expand_key) [USE_ARM_CE]: Use ARMv8/AArch64/CE key setup. >> (sm4_setkey): Enable ARMv8/AArch64/CE if supported by HW. >> (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) >> (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_ARM_CE]: >> Add ARMv8/AArch64/CE bulk functions. >> * configure.ac: Add 'sm4-armv8-aarch64-ce.lo'. >> -- >> >> This patch adds ARMv8/AArch64/CE bulk encryption/decryption. Bulk >> functions process eight blocks in parallel. >> >> Benchmark on T-Head Yitian-710 2.75 GHz: >> >> Before: >> ? SM4??????????? |? nanosecs/byte?? mebibytes/sec?? cycles/byte? auto Mhz >> ???????? CBC enc |???? 12.10 ns/B???? 78.79 MiB/s???? 33.28 c/B????? 2750 >> ???????? CBC dec |????? 4.63 ns/B???? 205.9 MiB/s???? 12.74 c/B????? 2749 >> ???????? CFB enc |???? 12.14 ns/B???? 78.58 MiB/s???? 33.37 c/B????? 2750 >> ???????? CFB dec |????? 4.64 ns/B???? 205.5 MiB/s???? 12.76 c/B????? 2750 >> ???????? CTR enc |????? 4.69 ns/B???? 203.3 MiB/s???? 12.90 c/B????? 2750 >> ???????? CTR dec |????? 4.69 ns/B???? 203.3 MiB/s???? 12.90 c/B????? 2750 >> ???????? GCM enc |????? 4.88 ns/B???? 195.4 MiB/s???? 13.42 c/B????? 2750 >> ???????? GCM dec |????? 4.88 ns/B???? 195.5 MiB/s???? 13.42 c/B????? 2750 >> ??????? GCM auth |???? 0.189 ns/B????? 5048 MiB/s???? 0.520 c/B????? 2750 >> ???????? OCB enc |????? 4.86 ns/B???? 196.0 MiB/s???? 13.38 c/B????? 2750 >> ???????? OCB dec |????? 4.90 ns/B???? 194.7 MiB/s???? 13.47 c/B????? 2750 >> ??????? OCB auth |????? 4.79 ns/B???? 199.0 MiB/s???? 13.18 c/B????? 2750 >> >> After (16x - 19x faster than ARMv8/AArch64 impl): >> ? SM4??????????? |? nanosecs/byte?? mebibytes/sec?? cycles/byte? auto Mhz >> ???????? CBC enc |???? 12.10 ns/B???? 78.81 MiB/s???? 33.27 c/B????? 2750 >> ???????? CBC dec |???? 0.243 ns/B????? 3921 MiB/s???? 0.669 c/B????? 2750 > > This implementation is actually so much faster than generic C, that > `_gcry_sm4_armv8_ce_crypt_blk1_8` could be used in `sm4_encrypt` and > `sm4_decrypt` to speed up single block operations (CBC encryption, etc) ... > > ? static unsigned int > ? sm4_encrypt (void *context, byte *outbuf, const byte *inbuf) > ? { > ??? SM4_context *ctx = context; > > ? #ifdef USE_ARM_CE > ??? if (ctx->use_arm_ce) > ????? return sm4_armv8_ce_crypt_blk1_8 (ctx->rkey_enc, outbuf, inbuf, 1); > ? #endif > ? ... > Great suggestion, I will do. >> ???????? CFB enc |???? 12.14 ns/B???? 78.52 MiB/s???? 33.39 c/B????? 2750 >> ???????? CFB dec |???? 0.241 ns/B????? 3963 MiB/s???? 0.662 c/B????? 2750 >> ???????? CTR enc |???? 0.298 ns/B????? 3201 MiB/s???? 0.819 c/B????? 2749 >> ???????? CTR dec |???? 0.298 ns/B????? 3197 MiB/s???? 0.820 c/B????? 2750 >> ???????? GCM enc |???? 0.488 ns/B????? 1956 MiB/s????? 1.34 c/B????? 2749 >> ???????? GCM dec |???? 0.487 ns/B????? 1959 MiB/s????? 1.34 c/B????? 2750 >> ??????? GCM auth |???? 0.189 ns/B????? 5049 MiB/s???? 0.519 c/B????? 2749 >> ???????? OCB enc |???? 0.461 ns/B????? 2069 MiB/s????? 1.27 c/B????? 2750 >> ???????? OCB dec |???? 0.495 ns/B????? 1928 MiB/s????? 1.36 c/B????? 2750 >> ??????? OCB auth |???? 0.385 ns/B????? 2479 MiB/s????? 1.06 c/B????? 2750 >> >> Signed-off-by: Tianjia Zhang >> --- >> ? cipher/Makefile.am??????????? |?? 1 + >> ? cipher/sm4-armv8-aarch64-ce.S | 614 ++++++++++++++++++++++++++++++++++ >> ? cipher/sm4.c????????????????? | 142 ++++++++ >> ? configure.ac????????????????? |?? 1 + >> ? 4 files changed, 758 insertions(+) >> ? create mode 100644 cipher/sm4-armv8-aarch64-ce.S >> >> diff --git a/cipher/Makefile.am b/cipher/Makefile.am >> index a7cbf3fc..3339c463 100644 >> --- a/cipher/Makefile.am >> +++ b/cipher/Makefile.am >> @@ -117,6 +117,7 @@ EXTRA_libcipher_la_SOURCES = \ >> ????? seed.c \ >> ????? serpent.c serpent-sse2-amd64.S \ >> ????? sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S sm4-aarch64.S \ >> +??? sm4-armv8-aarch64-ce.S \ >> ????? serpent-avx2-amd64.S serpent-armv7-neon.S \ >> ????? sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ >> ????? sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ >> diff --git a/cipher/sm4-armv8-aarch64-ce.S >> b/cipher/sm4-armv8-aarch64-ce.S >> new file mode 100644 >> index 00000000..943f0143 >> --- /dev/null >> +++ b/cipher/sm4-armv8-aarch64-ce.S >> @@ -0,0 +1,614 @@ >> +/* sm4-armv8-aarch64-ce.S? -? ARMv8/AArch64/CE accelerated SM4 cipher >> + * >> + * Copyright (C) 2022 Alibaba Group. >> + * Copyright (C) 2022 Tianjia Zhang >> + * >> + * This file is part of Libgcrypt. >> + * >> + * Libgcrypt is free software; you can redistribute it and/or modify >> + * it under the terms of the GNU Lesser General Public License as >> + * published by the Free Software Foundation; either version 2.1 of >> + * the License, or (at your option) any later version. >> + * >> + * Libgcrypt is distributed in the hope that it will be useful, >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.? See the >> + * GNU Lesser General Public License for more details. >> + * >> + * You should have received a copy of the GNU Lesser General Public >> + * License along with this program; if not, see >> . >> + */ >> + >> +#include "asm-common-aarch64.h" >> + >> +#if defined(__AARCH64EL__) && \ >> +??? defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ >> +??? defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) && \ >> +??? defined(USE_SM4) >> + >> +.cpu generic+simd+crypto >> + >> +.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 16, 24, 25, 26, 27, 28, 29, 30, 31 >> +??? .set .Lv\b\().4s, \b >> +.endr >> + >> +.macro sm4e, vd, vn >> +??? .inst 0xcec08400 | (.L\vn << 5) | .L\vd >> +.endm >> + >> +.macro sm4ekey, vd, vn, vm >> +??? .inst 0xce60c800 | (.L\vm << 16) | (.L\vn << 5) | .L\vd >> +.endm > > We have target architectures where assembler does not support these > macros (MacOSX for example). It's better to detect if these instructions > are supported with new check in `configure.ac`. For example, see how > this is done for `HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO`. > > -Jussi SM Crypto Extensions is an optional ARMv8 extension, so the current mainstream ARM architecture CPUs do not support this extension due to various reasons. I will add in the next patch to detect whether the extension of the SM3/4 instructions is supported. Best regards, Tianjia