[PATCH 1/5] Add Aarch64 assembly implementation of AES
Jussi Kivilinna
jussi.kivilinna at iki.fi
Sun Sep 4 12:43:56 CEST 2016
* cipher/Makefile.am: Add 'rijndael-aarch64.S'.
* cipher/rijndael-aarch64.S: New.
* cipher/rijndael-internal.h: Enable USE_ARM_ASM if __AARCH64EL__ and
HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS defined.
* configure.ac (gcry_cv_gcc_aarch64_platform_as_ok): New check.
[host=aarch64]: Add 'rijndael-aarch64.lo'.
--
Patch adds ARMv8/Aarch64 implementation of AES.
Benchmark on Cortex-A53 (1536 Mhz):
Before:
AES | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 19.37 ns/B 49.22 MiB/s 29.76 c/B
ECB dec | 19.85 ns/B 48.03 MiB/s 30.50 c/B
CBC enc | 16.84 ns/B 56.62 MiB/s 25.87 c/B
CBC dec | 16.81 ns/B 56.74 MiB/s 25.82 c/B
CFB enc | 16.80 ns/B 56.75 MiB/s 25.81 c/B
CFB dec | 16.81 ns/B 56.75 MiB/s 25.81 c/B
OFB enc | 20.02 ns/B 47.64 MiB/s 30.75 c/B
OFB dec | 20.02 ns/B 47.64 MiB/s 30.75 c/B
CTR enc | 17.06 ns/B 55.91 MiB/s 26.20 c/B
CTR dec | 17.06 ns/B 55.92 MiB/s 26.20 c/B
CCM enc | 33.94 ns/B 28.10 MiB/s 52.13 c/B
CCM dec | 33.94 ns/B 28.10 MiB/s 52.14 c/B
CCM auth | 16.97 ns/B 56.18 MiB/s 26.07 c/B
GCM enc | 28.70 ns/B 33.23 MiB/s 44.09 c/B
GCM dec | 28.70 ns/B 33.23 MiB/s 44.09 c/B
GCM auth | 11.66 ns/B 81.81 MiB/s 17.90 c/B
OCB enc | 17.66 ns/B 53.99 MiB/s 27.13 c/B
OCB dec | 17.61 ns/B 54.16 MiB/s 27.05 c/B
OCB auth | 17.44 ns/B 54.69 MiB/s 26.78 c/B
=
AES192 | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 21.82 ns/B 43.71 MiB/s 33.51 c/B
ECB dec | 22.55 ns/B 42.30 MiB/s 34.63 c/B
CBC enc | 19.33 ns/B 49.33 MiB/s 29.70 c/B
CBC dec | 19.50 ns/B 48.91 MiB/s 29.95 c/B
CFB enc | 19.29 ns/B 49.44 MiB/s 29.63 c/B
CFB dec | 19.28 ns/B 49.46 MiB/s 29.61 c/B
OFB enc | 22.49 ns/B 42.40 MiB/s 34.55 c/B
OFB dec | 22.50 ns/B 42.38 MiB/s 34.56 c/B
CTR enc | 19.53 ns/B 48.83 MiB/s 30.00 c/B
CTR dec | 19.54 ns/B 48.80 MiB/s 30.02 c/B
CCM enc | 38.91 ns/B 24.51 MiB/s 59.77 c/B
CCM dec | 38.90 ns/B 24.51 MiB/s 59.76 c/B
CCM auth | 19.45 ns/B 49.02 MiB/s 29.88 c/B
GCM enc | 31.13 ns/B 30.63 MiB/s 47.82 c/B
GCM dec | 31.14 ns/B 30.63 MiB/s 47.82 c/B
GCM auth | 11.66 ns/B 81.80 MiB/s 17.91 c/B
OCB enc | 20.15 ns/B 47.33 MiB/s 30.95 c/B
OCB dec | 20.30 ns/B 46.98 MiB/s 31.18 c/B
OCB auth | 19.92 ns/B 47.88 MiB/s 30.59 c/B
=
AES256 | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 24.33 ns/B 39.19 MiB/s 37.38 c/B
ECB dec | 25.23 ns/B 37.80 MiB/s 38.76 c/B
CBC enc | 21.82 ns/B 43.71 MiB/s 33.51 c/B
CBC dec | 22.18 ns/B 42.99 MiB/s 34.07 c/B
CFB enc | 21.77 ns/B 43.80 MiB/s 33.44 c/B
CFB dec | 21.77 ns/B 43.81 MiB/s 33.44 c/B
OFB enc | 24.99 ns/B 38.16 MiB/s 38.39 c/B
OFB dec | 24.99 ns/B 38.17 MiB/s 38.38 c/B
CTR enc | 22.02 ns/B 43.32 MiB/s 33.82 c/B
CTR dec | 22.02 ns/B 43.31 MiB/s 33.82 c/B
CCM enc | 43.86 ns/B 21.74 MiB/s 67.38 c/B
CCM dec | 43.87 ns/B 21.74 MiB/s 67.39 c/B
CCM auth | 21.94 ns/B 43.48 MiB/s 33.69 c/B
GCM enc | 33.66 ns/B 28.33 MiB/s 51.71 c/B
GCM dec | 33.66 ns/B 28.33 MiB/s 51.70 c/B
GCM auth | 11.69 ns/B 81.59 MiB/s 17.95 c/B
OCB enc | 22.90 ns/B 41.65 MiB/s 35.17 c/B
OCB dec | 23.25 ns/B 41.02 MiB/s 35.71 c/B
OCB auth | 22.69 ns/B 42.03 MiB/s 34.85 c/B
=
After (~1.2x faster):
AES | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 16.40 ns/B 58.16 MiB/s 25.19 c/B
ECB dec | 17.01 ns/B 56.07 MiB/s 26.13 c/B
CBC enc | 13.99 ns/B 68.15 MiB/s 21.49 c/B
CBC dec | 14.04 ns/B 67.94 MiB/s 21.56 c/B
CFB enc | 13.96 ns/B 68.32 MiB/s 21.44 c/B
CFB dec | 13.95 ns/B 68.34 MiB/s 21.43 c/B
OFB enc | 17.14 ns/B 55.65 MiB/s 26.32 c/B
OFB dec | 17.13 ns/B 55.67 MiB/s 26.31 c/B
CTR enc | 14.17 ns/B 67.31 MiB/s 21.76 c/B
CTR dec | 14.17 ns/B 67.29 MiB/s 21.77 c/B
CCM enc | 28.16 ns/B 33.86 MiB/s 43.26 c/B
CCM dec | 28.16 ns/B 33.87 MiB/s 43.26 c/B
CCM auth | 14.08 ns/B 67.71 MiB/s 21.63 c/B
GCM enc | 25.82 ns/B 36.94 MiB/s 39.66 c/B
GCM dec | 25.82 ns/B 36.94 MiB/s 39.65 c/B
GCM auth | 11.67 ns/B 81.74 MiB/s 17.92 c/B
OCB enc | 14.78 ns/B 64.55 MiB/s 22.69 c/B
OCB dec | 14.80 ns/B 64.43 MiB/s 22.74 c/B
OCB auth | 14.59 ns/B 65.36 MiB/s 22.41 c/B
=
AES192 | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 19.05 ns/B 50.07 MiB/s 29.25 c/B
ECB dec | 19.62 ns/B 48.62 MiB/s 30.13 c/B
CBC enc | 16.56 ns/B 57.59 MiB/s 25.44 c/B
CBC dec | 16.69 ns/B 57.14 MiB/s 25.64 c/B
CFB enc | 16.52 ns/B 57.71 MiB/s 25.38 c/B
CFB dec | 16.52 ns/B 57.73 MiB/s 25.37 c/B
OFB enc | 19.70 ns/B 48.41 MiB/s 30.26 c/B
OFB dec | 19.69 ns/B 48.43 MiB/s 30.24 c/B
CTR enc | 16.73 ns/B 57.00 MiB/s 25.70 c/B
CTR dec | 16.73 ns/B 57.01 MiB/s 25.70 c/B
CCM enc | 33.29 ns/B 28.65 MiB/s 51.13 c/B
CCM dec | 33.29 ns/B 28.65 MiB/s 51.13 c/B
CCM auth | 16.65 ns/B 57.29 MiB/s 25.57 c/B
GCM enc | 28.39 ns/B 33.60 MiB/s 43.60 c/B
GCM dec | 28.39 ns/B 33.59 MiB/s 43.60 c/B
GCM auth | 11.64 ns/B 81.92 MiB/s 17.88 c/B
OCB enc | 17.33 ns/B 55.03 MiB/s 26.62 c/B
OCB dec | 17.40 ns/B 54.82 MiB/s 26.72 c/B
OCB auth | 17.16 ns/B 55.59 MiB/s 26.35 c/B
=
AES256 | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 21.56 ns/B 44.23 MiB/s 33.12 c/B
ECB dec | 22.09 ns/B 43.17 MiB/s 33.93 c/B
CBC enc | 19.09 ns/B 49.97 MiB/s 29.31 c/B
CBC dec | 19.13 ns/B 49.86 MiB/s 29.38 c/B
CFB enc | 19.04 ns/B 50.09 MiB/s 29.24 c/B
CFB dec | 19.04 ns/B 50.08 MiB/s 29.25 c/B
OFB enc | 22.22 ns/B 42.93 MiB/s 34.13 c/B
OFB dec | 22.22 ns/B 42.92 MiB/s 34.13 c/B
CTR enc | 19.25 ns/B 49.53 MiB/s 29.57 c/B
CTR dec | 19.25 ns/B 49.55 MiB/s 29.57 c/B
CCM enc | 38.33 ns/B 24.88 MiB/s 58.88 c/B
CCM dec | 38.34 ns/B 24.88 MiB/s 58.88 c/B
CCM auth | 19.17 ns/B 49.76 MiB/s 29.44 c/B
GCM enc | 30.91 ns/B 30.86 MiB/s 47.47 c/B
GCM dec | 30.91 ns/B 30.85 MiB/s 47.48 c/B
GCM auth | 11.71 ns/B 81.47 MiB/s 17.98 c/B
OCB enc | 19.85 ns/B 48.04 MiB/s 30.49 c/B
OCB dec | 19.89 ns/B 47.95 MiB/s 30.55 c/B
OCB auth | 19.67 ns/B 48.48 MiB/s 30.22 c/B
=
Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
0 files changed
diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index de619fe..c555f81 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -82,6 +82,7 @@ poly1305-sse2-amd64.S poly1305-avx2-amd64.S poly1305-armv7-neon.S \
rijndael.c rijndael-internal.h rijndael-tables.h rijndael-aesni.c \
rijndael-padlock.c rijndael-amd64.S rijndael-arm.S rijndael-ssse3-amd64.c \
rijndael-armv8-ce.c rijndael-armv8-aarch32-ce.S \
+ rijndael-aarch64.S \
rmd160.c \
rsa.c \
salsa20.c salsa20-amd64.S salsa20-armv7-neon.S \
diff --git a/cipher/rijndael-aarch64.S b/cipher/rijndael-aarch64.S
new file mode 100644
index 0000000..2f91a1d
--- /dev/null
+++ b/cipher/rijndael-aarch64.S
@@ -0,0 +1,510 @@
+/* rijndael-aarch64.S - ARMv8/Aarch64 assembly implementation of AES cipher
+ *
+ * Copyright (C) 2016 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <config.h>
+
+#if defined(__AARCH64EL__)
+#ifdef HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS
+
+.text
+
+/* register macros */
+#define CTX x0
+#define RDST x1
+#define RSRC x2
+#define NROUNDS w3
+#define RTAB x4
+#define RMASK w5
+
+#define RA w8
+#define RB w9
+#define RC w10
+#define RD w11
+
+#define RNA w12
+#define RNB w13
+#define RNC w14
+#define RND w15
+
+#define RT0 w6
+#define RT1 w7
+#define RT2 w16
+#define xRT0 x6
+#define xRT1 x7
+#define xRT2 x16
+
+#define xw8 x8
+#define xw9 x9
+#define xw10 x10
+#define xw11 x11
+
+#define xw12 x12
+#define xw13 x13
+#define xw14 x14
+#define xw15 x15
+
+/***********************************************************************
+ * ARMv8/Aarch64 assembly implementation of the AES cipher
+ ***********************************************************************/
+#define preload_first_key(round, ra) \
+ ldr ra, [CTX, #(((round) * 16) + 0 * 4)];
+
+#define dummy(round, ra) /* nothing */
+
+#define addroundkey(ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_key) \
+ ldp rna, rnb, [CTX]; \
+ ldp rnc, rnd, [CTX, #8]; \
+ eor ra, ra, rna; \
+ eor rb, rb, rnb; \
+ eor rc, rc, rnc; \
+ preload_key(1, rna); \
+ eor rd, rd, rnd;
+
+#define do_encround(next_r, ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_key) \
+ ldr rnb, [CTX, #(((next_r) * 16) + 1 * 4)]; \
+ \
+ and RT0, RMASK, ra, lsl#2; \
+ ldr rnc, [CTX, #(((next_r) * 16) + 2 * 4)]; \
+ and RT1, RMASK, ra, lsr#(8 - 2); \
+ ldr rnd, [CTX, #(((next_r) * 16) + 3 * 4)]; \
+ and RT2, RMASK, ra, lsr#(16 - 2); \
+ ldr RT0, [RTAB, xRT0]; \
+ and ra, RMASK, ra, lsr#(24 - 2); \
+ \
+ ldr RT1, [RTAB, xRT1]; \
+ eor rna, rna, RT0; \
+ ldr RT2, [RTAB, xRT2]; \
+ and RT0, RMASK, rd, lsl#2; \
+ ldr ra, [RTAB, x##ra]; \
+ \
+ eor rnd, rnd, RT1, ror #24; \
+ and RT1, RMASK, rd, lsr#(8 - 2); \
+ eor rnc, rnc, RT2, ror #16; \
+ and RT2, RMASK, rd, lsr#(16 - 2); \
+ eor rnb, rnb, ra, ror #8; \
+ ldr RT0, [RTAB, xRT0]; \
+ and rd, RMASK, rd, lsr#(24 - 2); \
+ \
+ ldr RT1, [RTAB, xRT1]; \
+ eor rnd, rnd, RT0; \
+ ldr RT2, [RTAB, xRT2]; \
+ and RT0, RMASK, rc, lsl#2; \
+ ldr rd, [RTAB, x##rd]; \
+ \
+ eor rnc, rnc, RT1, ror #24; \
+ and RT1, RMASK, rc, lsr#(8 - 2); \
+ eor rnb, rnb, RT2, ror #16; \
+ and RT2, RMASK, rc, lsr#(16 - 2); \
+ eor rna, rna, rd, ror #8; \
+ ldr RT0, [RTAB, xRT0]; \
+ and rc, RMASK, rc, lsr#(24 - 2); \
+ \
+ ldr RT1, [RTAB, xRT1]; \
+ eor rnc, rnc, RT0; \
+ ldr RT2, [RTAB, xRT2]; \
+ and RT0, RMASK, rb, lsl#2; \
+ ldr rc, [RTAB, x##rc]; \
+ \
+ eor rnb, rnb, RT1, ror #24; \
+ and RT1, RMASK, rb, lsr#(8 - 2); \
+ eor rna, rna, RT2, ror #16; \
+ and RT2, RMASK, rb, lsr#(16 - 2); \
+ eor rnd, rnd, rc, ror #8; \
+ ldr RT0, [RTAB, xRT0]; \
+ and rb, RMASK, rb, lsr#(24 - 2); \
+ \
+ ldr RT1, [RTAB, xRT1]; \
+ eor rnb, rnb, RT0; \
+ ldr RT2, [RTAB, xRT2]; \
+ eor rna, rna, RT1, ror #24; \
+ ldr rb, [RTAB, x##rb]; \
+ \
+ eor rnd, rnd, RT2, ror #16; \
+ preload_key((next_r) + 1, ra); \
+ eor rnc, rnc, rb, ror #8;
+
+#define do_lastencround(ra, rb, rc, rd, rna, rnb, rnc, rnd) \
+ and RT0, RMASK, ra, lsl#2; \
+ and RT1, RMASK, ra, lsr#(8 - 2); \
+ and RT2, RMASK, ra, lsr#(16 - 2); \
+ ldrb rna, [RTAB, xRT0]; \
+ and ra, RMASK, ra, lsr#(24 - 2); \
+ ldrb rnd, [RTAB, xRT1]; \
+ and RT0, RMASK, rd, lsl#2; \
+ ldrb rnc, [RTAB, xRT2]; \
+ ror rnd, rnd, #24; \
+ ldrb rnb, [RTAB, x##ra]; \
+ and RT1, RMASK, rd, lsr#(8 - 2); \
+ ror rnc, rnc, #16; \
+ and RT2, RMASK, rd, lsr#(16 - 2); \
+ ror rnb, rnb, #8; \
+ ldrb RT0, [RTAB, xRT0]; \
+ and rd, RMASK, rd, lsr#(24 - 2); \
+ ldrb RT1, [RTAB, xRT1]; \
+ \
+ orr rnd, rnd, RT0; \
+ ldrb RT2, [RTAB, xRT2]; \
+ and RT0, RMASK, rc, lsl#2; \
+ ldrb rd, [RTAB, x##rd]; \
+ orr rnc, rnc, RT1, ror #24; \
+ and RT1, RMASK, rc, lsr#(8 - 2); \
+ orr rnb, rnb, RT2, ror #16; \
+ and RT2, RMASK, rc, lsr#(16 - 2); \
+ orr rna, rna, rd, ror #8; \
+ ldrb RT0, [RTAB, xRT0]; \
+ and rc, RMASK, rc, lsr#(24 - 2); \
+ ldrb RT1, [RTAB, xRT1]; \
+ \
+ orr rnc, rnc, RT0; \
+ ldrb RT2, [RTAB, xRT2]; \
+ and RT0, RMASK, rb, lsl#2; \
+ ldrb rc, [RTAB, x##rc]; \
+ orr rnb, rnb, RT1, ror #24; \
+ and RT1, RMASK, rb, lsr#(8 - 2); \
+ orr rna, rna, RT2, ror #16; \
+ ldrb RT0, [RTAB, xRT0]; \
+ and RT2, RMASK, rb, lsr#(16 - 2); \
+ ldrb RT1, [RTAB, xRT1]; \
+ orr rnd, rnd, rc, ror #8; \
+ ldrb RT2, [RTAB, xRT2]; \
+ and rb, RMASK, rb, lsr#(24 - 2); \
+ ldrb rb, [RTAB, x##rb]; \
+ \
+ orr rnb, rnb, RT0; \
+ orr rna, rna, RT1, ror #24; \
+ orr rnd, rnd, RT2, ror #16; \
+ orr rnc, rnc, rb, ror #8;
+
+#define firstencround(round, ra, rb, rc, rd, rna, rnb, rnc, rnd) \
+ addroundkey(ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_first_key); \
+ do_encround((round) + 1, ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_first_key);
+
+#define encround(round, ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_key) \
+ do_encround((round) + 1, ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_key);
+
+#define lastencround(round, ra, rb, rc, rd, rna, rnb, rnc, rnd) \
+ add CTX, CTX, #(((round) + 1) * 16); \
+ add RTAB, RTAB, #1; \
+ do_lastencround(ra, rb, rc, rd, rna, rnb, rnc, rnd); \
+ addroundkey(rna, rnb, rnc, rnd, ra, rb, rc, rd, dummy);
+
+.globl _gcry_aes_arm_encrypt_block
+.type _gcry_aes_arm_encrypt_block,%function;
+
+_gcry_aes_arm_encrypt_block:
+ /* input:
+ * %x0: keysched, CTX
+ * %x1: dst
+ * %x2: src
+ * %w3: number of rounds.. 10, 12 or 14
+ * %x4: encryption table
+ */
+
+ /* read input block */
+
+ /* aligned load */
+ ldp RA, RB, [RSRC];
+ ldp RC, RD, [RSRC, #8];
+#ifndef __AARCH64EL__
+ rev RA, RA;
+ rev RB, RB;
+ rev RC, RC;
+ rev RD, RD;
+#endif
+
+ mov RMASK, #(0xff<<2);
+
+ firstencround(0, RA, RB, RC, RD, RNA, RNB, RNC, RND);
+ encround(1, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key);
+ encround(2, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key);
+ encround(3, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key);
+ encround(4, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key);
+ encround(5, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key);
+ encround(6, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key);
+ encround(7, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key);
+
+ cmp NROUNDS, #12;
+ bge .Lenc_not_128;
+
+ encround(8, RA, RB, RC, RD, RNA, RNB, RNC, RND, dummy);
+ lastencround(9, RNA, RNB, RNC, RND, RA, RB, RC, RD);
+
+.Lenc_done:
+
+ /* store output block */
+
+ /* aligned store */
+#ifndef __AARCH64EL__
+ rev RA, RA;
+ rev RB, RB;
+ rev RC, RC;
+ rev RD, RD;
+#endif
+ /* write output block */
+ stp RA, RB, [RDST];
+ stp RC, RD, [RDST, #8];
+
+ mov x0, #(0);
+ ret;
+
+.ltorg
+.Lenc_not_128:
+ beq .Lenc_192
+
+ encround(8, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key);
+ encround(9, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key);
+ encround(10, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key);
+ encround(11, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key);
+ encround(12, RA, RB, RC, RD, RNA, RNB, RNC, RND, dummy);
+ lastencround(13, RNA, RNB, RNC, RND, RA, RB, RC, RD);
+
+ b .Lenc_done;
+
+.ltorg
+.Lenc_192:
+ encround(8, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key);
+ encround(9, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key);
+ encround(10, RA, RB, RC, RD, RNA, RNB, RNC, RND, dummy);
+ lastencround(11, RNA, RNB, RNC, RND, RA, RB, RC, RD);
+
+ b .Lenc_done;
+.size _gcry_aes_arm_encrypt_block,.-_gcry_aes_arm_encrypt_block;
+
+#define addroundkey_dec(round, ra, rb, rc, rd, rna, rnb, rnc, rnd) \
+ ldr rna, [CTX, #(((round) * 16) + 0 * 4)]; \
+ ldr rnb, [CTX, #(((round) * 16) + 1 * 4)]; \
+ eor ra, ra, rna; \
+ ldr rnc, [CTX, #(((round) * 16) + 2 * 4)]; \
+ eor rb, rb, rnb; \
+ ldr rnd, [CTX, #(((round) * 16) + 3 * 4)]; \
+ eor rc, rc, rnc; \
+ preload_first_key((round) - 1, rna); \
+ eor rd, rd, rnd;
+
+#define do_decround(next_r, ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_key) \
+ ldr rnb, [CTX, #(((next_r) * 16) + 1 * 4)]; \
+ \
+ and RT0, RMASK, ra, lsl#2; \
+ ldr rnc, [CTX, #(((next_r) * 16) + 2 * 4)]; \
+ and RT1, RMASK, ra, lsr#(8 - 2); \
+ ldr rnd, [CTX, #(((next_r) * 16) + 3 * 4)]; \
+ and RT2, RMASK, ra, lsr#(16 - 2); \
+ ldr RT0, [RTAB, xRT0]; \
+ and ra, RMASK, ra, lsr#(24 - 2); \
+ \
+ ldr RT1, [RTAB, xRT1]; \
+ eor rna, rna, RT0; \
+ ldr RT2, [RTAB, xRT2]; \
+ and RT0, RMASK, rb, lsl#2; \
+ ldr ra, [RTAB, x##ra]; \
+ \
+ eor rnb, rnb, RT1, ror #24; \
+ and RT1, RMASK, rb, lsr#(8 - 2); \
+ eor rnc, rnc, RT2, ror #16; \
+ and RT2, RMASK, rb, lsr#(16 - 2); \
+ eor rnd, rnd, ra, ror #8; \
+ ldr RT0, [RTAB, xRT0]; \
+ and rb, RMASK, rb, lsr#(24 - 2); \
+ \
+ ldr RT1, [RTAB, xRT1]; \
+ eor rnb, rnb, RT0; \
+ ldr RT2, [RTAB, xRT2]; \
+ and RT0, RMASK, rc, lsl#2; \
+ ldr rb, [RTAB, x##rb]; \
+ \
+ eor rnc, rnc, RT1, ror #24; \
+ and RT1, RMASK, rc, lsr#(8 - 2); \
+ eor rnd, rnd, RT2, ror #16; \
+ and RT2, RMASK, rc, lsr#(16 - 2); \
+ eor rna, rna, rb, ror #8; \
+ ldr RT0, [RTAB, xRT0]; \
+ and rc, RMASK, rc, lsr#(24 - 2); \
+ \
+ ldr RT1, [RTAB, xRT1]; \
+ eor rnc, rnc, RT0; \
+ ldr RT2, [RTAB, xRT2]; \
+ and RT0, RMASK, rd, lsl#2; \
+ ldr rc, [RTAB, x##rc]; \
+ \
+ eor rnd, rnd, RT1, ror #24; \
+ and RT1, RMASK, rd, lsr#(8 - 2); \
+ eor rna, rna, RT2, ror #16; \
+ and RT2, RMASK, rd, lsr#(16 - 2); \
+ eor rnb, rnb, rc, ror #8; \
+ ldr RT0, [RTAB, xRT0]; \
+ and rd, RMASK, rd, lsr#(24 - 2); \
+ \
+ ldr RT1, [RTAB, xRT1]; \
+ eor rnd, rnd, RT0; \
+ ldr RT2, [RTAB, xRT2]; \
+ eor rna, rna, RT1, ror #24; \
+ ldr rd, [RTAB, x##rd]; \
+ \
+ eor rnb, rnb, RT2, ror #16; \
+ preload_key((next_r) - 1, ra); \
+ eor rnc, rnc, rd, ror #8;
+
+#define do_lastdecround(ra, rb, rc, rd, rna, rnb, rnc, rnd) \
+ and RT0, RMASK, ra; \
+ and RT1, RMASK, ra, lsr#8; \
+ and RT2, RMASK, ra, lsr#16; \
+ ldrb rna, [RTAB, xRT0]; \
+ lsr ra, ra, #24; \
+ ldrb rnb, [RTAB, xRT1]; \
+ and RT0, RMASK, rb; \
+ ldrb rnc, [RTAB, xRT2]; \
+ ror rnb, rnb, #24; \
+ ldrb rnd, [RTAB, x##ra]; \
+ and RT1, RMASK, rb, lsr#8; \
+ ror rnc, rnc, #16; \
+ and RT2, RMASK, rb, lsr#16; \
+ ror rnd, rnd, #8; \
+ ldrb RT0, [RTAB, xRT0]; \
+ lsr rb, rb, #24; \
+ ldrb RT1, [RTAB, xRT1]; \
+ \
+ orr rnb, rnb, RT0; \
+ ldrb RT2, [RTAB, xRT2]; \
+ and RT0, RMASK, rc; \
+ ldrb rb, [RTAB, x##rb]; \
+ orr rnc, rnc, RT1, ror #24; \
+ and RT1, RMASK, rc, lsr#8; \
+ orr rnd, rnd, RT2, ror #16; \
+ and RT2, RMASK, rc, lsr#16; \
+ orr rna, rna, rb, ror #8; \
+ ldrb RT0, [RTAB, xRT0]; \
+ lsr rc, rc, #24; \
+ ldrb RT1, [RTAB, xRT1]; \
+ \
+ orr rnc, rnc, RT0; \
+ ldrb RT2, [RTAB, xRT2]; \
+ and RT0, RMASK, rd; \
+ ldrb rc, [RTAB, x##rc]; \
+ orr rnd, rnd, RT1, ror #24; \
+ and RT1, RMASK, rd, lsr#8; \
+ orr rna, rna, RT2, ror #16; \
+ ldrb RT0, [RTAB, xRT0]; \
+ and RT2, RMASK, rd, lsr#16; \
+ ldrb RT1, [RTAB, xRT1]; \
+ orr rnb, rnb, rc, ror #8; \
+ ldrb RT2, [RTAB, xRT2]; \
+ lsr rd, rd, #24; \
+ ldrb rd, [RTAB, x##rd]; \
+ \
+ orr rnd, rnd, RT0; \
+ orr rna, rna, RT1, ror #24; \
+ orr rnb, rnb, RT2, ror #16; \
+ orr rnc, rnc, rd, ror #8;
+
+#define firstdecround(round, ra, rb, rc, rd, rna, rnb, rnc, rnd) \
+ addroundkey_dec(((round) + 1), ra, rb, rc, rd, rna, rnb, rnc, rnd); \
+ do_decround(round, ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_first_key);
+
+#define decround(round, ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_key) \
+ do_decround(round, ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_key);
+
+#define set_last_round_rmask(_, __) \
+ mov RMASK, #0xff;
+
+#define lastdecround(round, ra, rb, rc, rd, rna, rnb, rnc, rnd) \
+ add RTAB, RTAB, #(4 * 256); \
+ do_lastdecround(ra, rb, rc, rd, rna, rnb, rnc, rnd); \
+ addroundkey(rna, rnb, rnc, rnd, ra, rb, rc, rd, dummy);
+
+.globl _gcry_aes_arm_decrypt_block
+.type _gcry_aes_arm_decrypt_block,%function;
+
+_gcry_aes_arm_decrypt_block:
+ /* input:
+ * %x0: keysched, CTX
+ * %x1: dst
+ * %x2: src
+ * %w3: number of rounds.. 10, 12 or 14
+ * %x4: decryption table
+ */
+
+ /* read input block */
+
+ /* aligned load */
+ ldp RA, RB, [RSRC];
+ ldp RC, RD, [RSRC, #8];
+#ifndef __AARCH64EL__
+ rev RA, RA;
+ rev RB, RB;
+ rev RC, RC;
+ rev RD, RD;
+#endif
+
+ mov RMASK, #(0xff << 2);
+
+ cmp NROUNDS, #12;
+ bge .Ldec_256;
+
+ firstdecround(9, RA, RB, RC, RD, RNA, RNB, RNC, RND);
+.Ldec_tail:
+ decround(8, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key);
+ decround(7, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key);
+ decround(6, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key);
+ decround(5, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key);
+ decround(4, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key);
+ decround(3, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key);
+ decround(2, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key);
+ decround(1, RA, RB, RC, RD, RNA, RNB, RNC, RND, set_last_round_rmask);
+ lastdecround(0, RNA, RNB, RNC, RND, RA, RB, RC, RD);
+
+ /* store output block */
+
+ /* aligned store */
+#ifndef __AARCH64EL__
+ rev RA, RA;
+ rev RB, RB;
+ rev RC, RC;
+ rev RD, RD;
+#endif
+ /* write output block */
+ stp RA, RB, [RDST];
+ stp RC, RD, [RDST, #8];
+
+ mov x0, #(0);
+ ret;
+
+.ltorg
+.Ldec_256:
+ beq .Ldec_192;
+
+ firstdecround(13, RA, RB, RC, RD, RNA, RNB, RNC, RND);
+ decround(12, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key);
+ decround(11, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key);
+ decround(10, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key);
+ decround(9, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key);
+
+ b .Ldec_tail;
+
+.ltorg
+.Ldec_192:
+ firstdecround(11, RA, RB, RC, RD, RNA, RNB, RNC, RND);
+ decround(10, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key);
+ decround(9, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key);
+
+ b .Ldec_tail;
+.size _gcry_aes_arm_decrypt_block,.-_gcry_aes_arm_decrypt_block;
+
+#endif /*HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS*/
+#endif /*__AARCH64EL__ */
diff --git a/cipher/rijndael-arm.S b/cipher/rijndael-arm.S
index e3a91c2..e680c81 100644
--- a/cipher/rijndael-arm.S
+++ b/cipher/rijndael-arm.S
@@ -577,5 +577,5 @@ _gcry_aes_arm_decrypt_block:
b .Ldec_tail;
.size _gcry_aes_arm_encrypt_block,.-_gcry_aes_arm_encrypt_block;
-#endif /*HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS*/
+#endif /*HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS*/
#endif /*__ARMEL__ */
diff --git a/cipher/rijndael-internal.h b/cipher/rijndael-internal.h
index 7544fa0..340dbc0 100644
--- a/cipher/rijndael-internal.h
+++ b/cipher/rijndael-internal.h
@@ -58,6 +58,11 @@
# define USE_ARM_ASM 1
# endif
#endif
+#if defined(__AARCH64EL__)
+# ifdef HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS
+# define USE_ARM_ASM 1
+# endif
+#endif
/* USE_PADLOCK indicates whether to compile the padlock specific
code. */
diff --git a/configure.ac b/configure.ac
index 7f415bf..a530f77 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1073,6 +1073,32 @@ fi
#
+# Check whether GCC assembler supports features needed for our ARMv8/Aarch64
+# implementations. This needs to be done before setting up the
+# assembler stuff.
+#
+AC_CACHE_CHECK([whether GCC assembler is compatible for ARMv8/Aarch64 assembly implementations],
+ [gcry_cv_gcc_aarch64_platform_as_ok],
+ [gcry_cv_gcc_aarch64_platform_as_ok=no
+ AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+ [[__asm__(
+ "asmfunc:\n\t"
+ "eor x0, x0, x30, ror #12;\n\t"
+ "add x0, x0, x30, asr #12;\n\t"
+ "eor v0.16b, v0.16b, v31.16b;\n\t"
+
+ /* Test if '.type' and '.size' are supported. */
+ ".size asmfunc,.-asmfunc;\n\t"
+ ".type asmfunc, at function;\n\t"
+ );]])],
+ [gcry_cv_gcc_aarch64_platform_as_ok=yes])])
+if test "$gcry_cv_gcc_aarch64_platform_as_ok" = "yes" ; then
+ AC_DEFINE(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS,1,
+ [Defined if underlying assembler is compatible with ARMv8/Aarch64 assembly implementations])
+fi
+
+
+#
# Check whether underscores in symbols are required. This needs to be
# done before setting up the assembler stuff.
#
@@ -2014,6 +2040,10 @@ if test "$found" = "1" ; then
GCRYPT_CIPHERS="$GCRYPT_CIPHERS rijndael-armv8-ce.lo"
GCRYPT_CIPHERS="$GCRYPT_CIPHERS rijndael-armv8-aarch32-ce.lo"
;;
+ aarch64-*-*)
+ # Build with the assembly implementation
+ GCRYPT_CIPHERS="$GCRYPT_CIPHERS rijndael-aarch64.lo"
+ ;;
esac
case "$mpi_cpu_arch" in
More information about the Gcrypt-devel
mailing list