[PATCH 2/2] Add SM4 ARMv8/AArch64/CE assembly implementation
Jussi Kivilinna
jussi.kivilinna at iki.fi
Sat Feb 26 08:10:55 CET 2022
On 25.2.2022 9.41, Tianjia Zhang wrote:
> * cipher/Makefile.am: Add 'sm4-armv8-aarch64-ce.S'.
> * cipher/sm4-armv8-aarch64-ce.S: New.
> * cipher/sm4.c (USE_ARM_CE): New.
> (SM4_context) [USE_ARM_CE]: Add 'use_arm_ce'.
> [USE_ARM_CE] (_gcry_sm4_armv8_ce_expand_key)
> (_gcry_sm4_armv8_ce_crypt, _gcry_sm4_armv8_ce_ctr_enc)
> (_gcry_sm4_armv8_ce_cbc_dec, _gcry_sm4_armv8_ce_cfb_dec)
> (_gcry_sm4_armv8_ce_crypt_blk1_8, sm4_armv8_ce_crypt_blk1_8): New.
> (sm4_expand_key) [USE_ARM_CE]: Use ARMv8/AArch64/CE key setup.
> (sm4_setkey): Enable ARMv8/AArch64/CE if supported by HW.
> (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec)
> (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_ARM_CE]:
> Add ARMv8/AArch64/CE bulk functions.
> * configure.ac: Add 'sm4-armv8-aarch64-ce.lo'.
> --
>
> This patch adds ARMv8/AArch64/CE bulk encryption/decryption. Bulk
> functions process eight blocks in parallel.
>
> Benchmark on T-Head Yitian-710 2.75 GHz:
>
> Before:
> SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
> CBC enc | 12.10 ns/B 78.79 MiB/s 33.28 c/B 2750
> CBC dec | 4.63 ns/B 205.9 MiB/s 12.74 c/B 2749
> CFB enc | 12.14 ns/B 78.58 MiB/s 33.37 c/B 2750
> CFB dec | 4.64 ns/B 205.5 MiB/s 12.76 c/B 2750
> CTR enc | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750
> CTR dec | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750
> GCM enc | 4.88 ns/B 195.4 MiB/s 13.42 c/B 2750
> GCM dec | 4.88 ns/B 195.5 MiB/s 13.42 c/B 2750
> GCM auth | 0.189 ns/B 5048 MiB/s 0.520 c/B 2750
> OCB enc | 4.86 ns/B 196.0 MiB/s 13.38 c/B 2750
> OCB dec | 4.90 ns/B 194.7 MiB/s 13.47 c/B 2750
> OCB auth | 4.79 ns/B 199.0 MiB/s 13.18 c/B 2750
>
> After (16x - 19x faster than ARMv8/AArch64 impl):
> SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
> CBC enc | 12.10 ns/B 78.81 MiB/s 33.27 c/B 2750
> CBC dec | 0.243 ns/B 3921 MiB/s 0.669 c/B 2750
This implementation is actually so much faster than generic C, that `_gcry_sm4_armv8_ce_crypt_blk1_8` could be used in `sm4_encrypt` and `sm4_decrypt` to speed up single block operations (CBC encryption, etc) ...
static unsigned int
sm4_encrypt (void *context, byte *outbuf, const byte *inbuf)
{
SM4_context *ctx = context;
#ifdef USE_ARM_CE
if (ctx->use_arm_ce)
return sm4_armv8_ce_crypt_blk1_8 (ctx->rkey_enc, outbuf, inbuf, 1);
#endif
...
> CFB enc | 12.14 ns/B 78.52 MiB/s 33.39 c/B 2750
> CFB dec | 0.241 ns/B 3963 MiB/s 0.662 c/B 2750
> CTR enc | 0.298 ns/B 3201 MiB/s 0.819 c/B 2749
> CTR dec | 0.298 ns/B 3197 MiB/s 0.820 c/B 2750
> GCM enc | 0.488 ns/B 1956 MiB/s 1.34 c/B 2749
> GCM dec | 0.487 ns/B 1959 MiB/s 1.34 c/B 2750
> GCM auth | 0.189 ns/B 5049 MiB/s 0.519 c/B 2749
> OCB enc | 0.461 ns/B 2069 MiB/s 1.27 c/B 2750
> OCB dec | 0.495 ns/B 1928 MiB/s 1.36 c/B 2750
> OCB auth | 0.385 ns/B 2479 MiB/s 1.06 c/B 2750
>
> Signed-off-by: Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
> ---
> cipher/Makefile.am | 1 +
> cipher/sm4-armv8-aarch64-ce.S | 614 ++++++++++++++++++++++++++++++++++
> cipher/sm4.c | 142 ++++++++
> configure.ac | 1 +
> 4 files changed, 758 insertions(+)
> create mode 100644 cipher/sm4-armv8-aarch64-ce.S
>
> diff --git a/cipher/Makefile.am b/cipher/Makefile.am
> index a7cbf3fc..3339c463 100644
> --- a/cipher/Makefile.am
> +++ b/cipher/Makefile.am
> @@ -117,6 +117,7 @@ EXTRA_libcipher_la_SOURCES = \
> seed.c \
> serpent.c serpent-sse2-amd64.S \
> sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S sm4-aarch64.S \
> + sm4-armv8-aarch64-ce.S \
> serpent-avx2-amd64.S serpent-armv7-neon.S \
> sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \
> sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \
> diff --git a/cipher/sm4-armv8-aarch64-ce.S b/cipher/sm4-armv8-aarch64-ce.S
> new file mode 100644
> index 00000000..943f0143
> --- /dev/null
> +++ b/cipher/sm4-armv8-aarch64-ce.S
> @@ -0,0 +1,614 @@
> +/* sm4-armv8-aarch64-ce.S - ARMv8/AArch64/CE accelerated SM4 cipher
> + *
> + * Copyright (C) 2022 Alibaba Group.
> + * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
> + *
> + * This file is part of Libgcrypt.
> + *
> + * Libgcrypt is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU Lesser General Public License as
> + * published by the Free Software Foundation; either version 2.1 of
> + * the License, or (at your option) any later version.
> + *
> + * Libgcrypt is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "asm-common-aarch64.h"
> +
> +#if defined(__AARCH64EL__) && \
> + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \
> + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) && \
> + defined(USE_SM4)
> +
> +.cpu generic+simd+crypto
> +
> +.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 16, 24, 25, 26, 27, 28, 29, 30, 31
> + .set .Lv\b\().4s, \b
> +.endr
> +
> +.macro sm4e, vd, vn
> + .inst 0xcec08400 | (.L\vn << 5) | .L\vd
> +.endm
> +
> +.macro sm4ekey, vd, vn, vm
> + .inst 0xce60c800 | (.L\vm << 16) | (.L\vn << 5) | .L\vd
> +.endm
We have target architectures where assembler does not support these macros (MacOSX for example). It's better to detect if these instructions are supported with new check in `configure.ac`. For example, see how this is done for `HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO`.
-Jussi
More information about the Gcrypt-devel
mailing list