From burdges at gnunet.org Thu Sep 1 23:19:42 2016 From: burdges at gnunet.org (Jeff Burdges) Date: Thu, 01 Sep 2016 23:19:42 +0200 Subject: Fault attacks on RSA in libgcrypt In-Reply-To: <878tvmmh7z.fsf@wheatstone.g10code.de> References: <1471887762.11550.159.camel@gnunet.org> <878tvmmh7z.fsf@wheatstone.g10code.de> Message-ID: <1472764782.16025.3.camel@gnunet.org> Appears someone just improved Rowhammer : http://arstechnica.com/security/2016/08/new-attack-steals-private-crypto-keys-by-corrupting-data-in-computer-memory/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: This is a digitally signed message part URL: From gniibe at fsij.org Fri Sep 2 02:34:21 2016 From: gniibe at fsij.org (NIIBE Yutaka) Date: Fri, 2 Sep 2016 09:34:21 +0900 Subject: Fault attacks on RSA in libgcrypt In-Reply-To: <1472764782.16025.3.camel@gnunet.org> References: <1471887762.11550.159.camel@gnunet.org> <878tvmmh7z.fsf@wheatstone.g10code.de> <1472764782.16025.3.camel@gnunet.org> Message-ID: <78f2af22-d8a3-d8ab-1865-97805da6ece9@fsij.org> Hello, On 09/02/2016 06:19 AM, Jeff Burdges wrote: > Appears someone just improved Rowhammer : > http://arstechnica.com/security/2016/08/new-attack-steals-private-crypto-keys-by-corrupting-data-in-computer-memory/ This is a bit different. The attack doesn't get the private key of RSA. The attack changes a bit of public key of RSA and cheats the verification process. Newer gpgv of GnuPG has a tweak and the particular attack scenario is not valid, now. But, in a hardware condition we can flip a bit (rather arbitrary), it would be possible to achieve some privilege escalation to get more control of a system. So, I think that the idea of this attack itself is valid and we have no way to solve it by software, in general (while we could find a way to mitigate somehow for a given scenario). For the original discussion: > "Making RSA-PSS Provably Secure Against Non-Random Faults" by Gilles > Barthe, Fran?ois Dupressoir, Pierre-Alain Fouque, Benjamin Gr?goire, > Mehdi Tibouchi and Jean-Christophe Zapalowicz. > https://eprint.iacr.org/2014/252 I read it briefly. IIUC, this is more related to smartcard and "secure chip". For general purpose computer, if such multi-factor fault attacks can be applied (by rowhammer, or by laser, electric power), it would be more easier for an attacker to achieve another privilege escalation to get more control of a system (to get the private key easily). That's my current opinion. -- -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 455 bytes Desc: OpenPGP digital signature URL: From burdges at gnunet.org Fri Sep 2 05:27:59 2016 From: burdges at gnunet.org (Jeff Burdges) Date: Fri, 02 Sep 2016 05:27:59 +0200 Subject: Fault attacks on RSA in libgcrypt In-Reply-To: <78f2af22-d8a3-d8ab-1865-97805da6ece9@fsij.org> References: <1471887762.11550.159.camel@gnunet.org> <878tvmmh7z.fsf@wheatstone.g10code.de> <1472764782.16025.3.camel@gnunet.org> <78f2af22-d8a3-d8ab-1865-97805da6ece9@fsij.org> Message-ID: <1472786879.16025.22.camel@gnunet.org> On Fri, 2016-09-02 at 09:34 +0900, NIIBE Yutaka wrote: > So, I think that the idea of this attack itself is valid and we have > no way to solve it by software, in general (while we could find a way > to mitigate somehow for a given scenario). As I said before, I now think the patch I submitted up thread is useless. And we should instead look towards approaches resembling : http://dl.acm.org/citation.cfm?doid=1873548.1873556 In this new article, there is considerably more randomization throughout the signing algorithm. Indeed, one could imagine extending it to two layers of randomization, so that the actual key only exists briefly when loaded from disk before being randomized for the session, and each decryption operation gets its own randomization as well. There are good odds that a more throughly randomized approach like this can be justified purely for added protection against timing attacks, while my now retracted patch is obviously useless for that. The paper does not make such a case though. Anyone here who understands the existing protections against timing attacks want to glance over this new article? Jeff -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: This is a digitally signed message part URL: From jussi.kivilinna at iki.fi Sun Sep 4 12:43:56 2016 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 04 Sep 2016 13:43:56 +0300 Subject: [PATCH 1/5] Add Aarch64 assembly implementation of AES Message-ID: <147298583621.4156.8421131540663910552.stgit@localhost6.localdomain6> * cipher/Makefile.am: Add 'rijndael-aarch64.S'. * cipher/rijndael-aarch64.S: New. * cipher/rijndael-internal.h: Enable USE_ARM_ASM if __AARCH64EL__ and HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS defined. * configure.ac (gcry_cv_gcc_aarch64_platform_as_ok): New check. [host=aarch64]: Add 'rijndael-aarch64.lo'. -- Patch adds ARMv8/Aarch64 implementation of AES. Benchmark on Cortex-A53 (1536 Mhz): Before: AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 19.37 ns/B 49.22 MiB/s 29.76 c/B ECB dec | 19.85 ns/B 48.03 MiB/s 30.50 c/B CBC enc | 16.84 ns/B 56.62 MiB/s 25.87 c/B CBC dec | 16.81 ns/B 56.74 MiB/s 25.82 c/B CFB enc | 16.80 ns/B 56.75 MiB/s 25.81 c/B CFB dec | 16.81 ns/B 56.75 MiB/s 25.81 c/B OFB enc | 20.02 ns/B 47.64 MiB/s 30.75 c/B OFB dec | 20.02 ns/B 47.64 MiB/s 30.75 c/B CTR enc | 17.06 ns/B 55.91 MiB/s 26.20 c/B CTR dec | 17.06 ns/B 55.92 MiB/s 26.20 c/B CCM enc | 33.94 ns/B 28.10 MiB/s 52.13 c/B CCM dec | 33.94 ns/B 28.10 MiB/s 52.14 c/B CCM auth | 16.97 ns/B 56.18 MiB/s 26.07 c/B GCM enc | 28.70 ns/B 33.23 MiB/s 44.09 c/B GCM dec | 28.70 ns/B 33.23 MiB/s 44.09 c/B GCM auth | 11.66 ns/B 81.81 MiB/s 17.90 c/B OCB enc | 17.66 ns/B 53.99 MiB/s 27.13 c/B OCB dec | 17.61 ns/B 54.16 MiB/s 27.05 c/B OCB auth | 17.44 ns/B 54.69 MiB/s 26.78 c/B = AES192 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 21.82 ns/B 43.71 MiB/s 33.51 c/B ECB dec | 22.55 ns/B 42.30 MiB/s 34.63 c/B CBC enc | 19.33 ns/B 49.33 MiB/s 29.70 c/B CBC dec | 19.50 ns/B 48.91 MiB/s 29.95 c/B CFB enc | 19.29 ns/B 49.44 MiB/s 29.63 c/B CFB dec | 19.28 ns/B 49.46 MiB/s 29.61 c/B OFB enc | 22.49 ns/B 42.40 MiB/s 34.55 c/B OFB dec | 22.50 ns/B 42.38 MiB/s 34.56 c/B CTR enc | 19.53 ns/B 48.83 MiB/s 30.00 c/B CTR dec | 19.54 ns/B 48.80 MiB/s 30.02 c/B CCM enc | 38.91 ns/B 24.51 MiB/s 59.77 c/B CCM dec | 38.90 ns/B 24.51 MiB/s 59.76 c/B CCM auth | 19.45 ns/B 49.02 MiB/s 29.88 c/B GCM enc | 31.13 ns/B 30.63 MiB/s 47.82 c/B GCM dec | 31.14 ns/B 30.63 MiB/s 47.82 c/B GCM auth | 11.66 ns/B 81.80 MiB/s 17.91 c/B OCB enc | 20.15 ns/B 47.33 MiB/s 30.95 c/B OCB dec | 20.30 ns/B 46.98 MiB/s 31.18 c/B OCB auth | 19.92 ns/B 47.88 MiB/s 30.59 c/B = AES256 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 24.33 ns/B 39.19 MiB/s 37.38 c/B ECB dec | 25.23 ns/B 37.80 MiB/s 38.76 c/B CBC enc | 21.82 ns/B 43.71 MiB/s 33.51 c/B CBC dec | 22.18 ns/B 42.99 MiB/s 34.07 c/B CFB enc | 21.77 ns/B 43.80 MiB/s 33.44 c/B CFB dec | 21.77 ns/B 43.81 MiB/s 33.44 c/B OFB enc | 24.99 ns/B 38.16 MiB/s 38.39 c/B OFB dec | 24.99 ns/B 38.17 MiB/s 38.38 c/B CTR enc | 22.02 ns/B 43.32 MiB/s 33.82 c/B CTR dec | 22.02 ns/B 43.31 MiB/s 33.82 c/B CCM enc | 43.86 ns/B 21.74 MiB/s 67.38 c/B CCM dec | 43.87 ns/B 21.74 MiB/s 67.39 c/B CCM auth | 21.94 ns/B 43.48 MiB/s 33.69 c/B GCM enc | 33.66 ns/B 28.33 MiB/s 51.71 c/B GCM dec | 33.66 ns/B 28.33 MiB/s 51.70 c/B GCM auth | 11.69 ns/B 81.59 MiB/s 17.95 c/B OCB enc | 22.90 ns/B 41.65 MiB/s 35.17 c/B OCB dec | 23.25 ns/B 41.02 MiB/s 35.71 c/B OCB auth | 22.69 ns/B 42.03 MiB/s 34.85 c/B = After (~1.2x faster): AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 16.40 ns/B 58.16 MiB/s 25.19 c/B ECB dec | 17.01 ns/B 56.07 MiB/s 26.13 c/B CBC enc | 13.99 ns/B 68.15 MiB/s 21.49 c/B CBC dec | 14.04 ns/B 67.94 MiB/s 21.56 c/B CFB enc | 13.96 ns/B 68.32 MiB/s 21.44 c/B CFB dec | 13.95 ns/B 68.34 MiB/s 21.43 c/B OFB enc | 17.14 ns/B 55.65 MiB/s 26.32 c/B OFB dec | 17.13 ns/B 55.67 MiB/s 26.31 c/B CTR enc | 14.17 ns/B 67.31 MiB/s 21.76 c/B CTR dec | 14.17 ns/B 67.29 MiB/s 21.77 c/B CCM enc | 28.16 ns/B 33.86 MiB/s 43.26 c/B CCM dec | 28.16 ns/B 33.87 MiB/s 43.26 c/B CCM auth | 14.08 ns/B 67.71 MiB/s 21.63 c/B GCM enc | 25.82 ns/B 36.94 MiB/s 39.66 c/B GCM dec | 25.82 ns/B 36.94 MiB/s 39.65 c/B GCM auth | 11.67 ns/B 81.74 MiB/s 17.92 c/B OCB enc | 14.78 ns/B 64.55 MiB/s 22.69 c/B OCB dec | 14.80 ns/B 64.43 MiB/s 22.74 c/B OCB auth | 14.59 ns/B 65.36 MiB/s 22.41 c/B = AES192 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 19.05 ns/B 50.07 MiB/s 29.25 c/B ECB dec | 19.62 ns/B 48.62 MiB/s 30.13 c/B CBC enc | 16.56 ns/B 57.59 MiB/s 25.44 c/B CBC dec | 16.69 ns/B 57.14 MiB/s 25.64 c/B CFB enc | 16.52 ns/B 57.71 MiB/s 25.38 c/B CFB dec | 16.52 ns/B 57.73 MiB/s 25.37 c/B OFB enc | 19.70 ns/B 48.41 MiB/s 30.26 c/B OFB dec | 19.69 ns/B 48.43 MiB/s 30.24 c/B CTR enc | 16.73 ns/B 57.00 MiB/s 25.70 c/B CTR dec | 16.73 ns/B 57.01 MiB/s 25.70 c/B CCM enc | 33.29 ns/B 28.65 MiB/s 51.13 c/B CCM dec | 33.29 ns/B 28.65 MiB/s 51.13 c/B CCM auth | 16.65 ns/B 57.29 MiB/s 25.57 c/B GCM enc | 28.39 ns/B 33.60 MiB/s 43.60 c/B GCM dec | 28.39 ns/B 33.59 MiB/s 43.60 c/B GCM auth | 11.64 ns/B 81.92 MiB/s 17.88 c/B OCB enc | 17.33 ns/B 55.03 MiB/s 26.62 c/B OCB dec | 17.40 ns/B 54.82 MiB/s 26.72 c/B OCB auth | 17.16 ns/B 55.59 MiB/s 26.35 c/B = AES256 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 21.56 ns/B 44.23 MiB/s 33.12 c/B ECB dec | 22.09 ns/B 43.17 MiB/s 33.93 c/B CBC enc | 19.09 ns/B 49.97 MiB/s 29.31 c/B CBC dec | 19.13 ns/B 49.86 MiB/s 29.38 c/B CFB enc | 19.04 ns/B 50.09 MiB/s 29.24 c/B CFB dec | 19.04 ns/B 50.08 MiB/s 29.25 c/B OFB enc | 22.22 ns/B 42.93 MiB/s 34.13 c/B OFB dec | 22.22 ns/B 42.92 MiB/s 34.13 c/B CTR enc | 19.25 ns/B 49.53 MiB/s 29.57 c/B CTR dec | 19.25 ns/B 49.55 MiB/s 29.57 c/B CCM enc | 38.33 ns/B 24.88 MiB/s 58.88 c/B CCM dec | 38.34 ns/B 24.88 MiB/s 58.88 c/B CCM auth | 19.17 ns/B 49.76 MiB/s 29.44 c/B GCM enc | 30.91 ns/B 30.86 MiB/s 47.47 c/B GCM dec | 30.91 ns/B 30.85 MiB/s 47.48 c/B GCM auth | 11.71 ns/B 81.47 MiB/s 17.98 c/B OCB enc | 19.85 ns/B 48.04 MiB/s 30.49 c/B OCB dec | 19.89 ns/B 47.95 MiB/s 30.55 c/B OCB auth | 19.67 ns/B 48.48 MiB/s 30.22 c/B = Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/Makefile.am b/cipher/Makefile.am index de619fe..c555f81 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -82,6 +82,7 @@ poly1305-sse2-amd64.S poly1305-avx2-amd64.S poly1305-armv7-neon.S \ rijndael.c rijndael-internal.h rijndael-tables.h rijndael-aesni.c \ rijndael-padlock.c rijndael-amd64.S rijndael-arm.S rijndael-ssse3-amd64.c \ rijndael-armv8-ce.c rijndael-armv8-aarch32-ce.S \ + rijndael-aarch64.S \ rmd160.c \ rsa.c \ salsa20.c salsa20-amd64.S salsa20-armv7-neon.S \ diff --git a/cipher/rijndael-aarch64.S b/cipher/rijndael-aarch64.S new file mode 100644 index 0000000..2f91a1d --- /dev/null +++ b/cipher/rijndael-aarch64.S @@ -0,0 +1,510 @@ +/* rijndael-aarch64.S - ARMv8/Aarch64 assembly implementation of AES cipher + * + * Copyright (C) 2016 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include + +#if defined(__AARCH64EL__) +#ifdef HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS + +.text + +/* register macros */ +#define CTX x0 +#define RDST x1 +#define RSRC x2 +#define NROUNDS w3 +#define RTAB x4 +#define RMASK w5 + +#define RA w8 +#define RB w9 +#define RC w10 +#define RD w11 + +#define RNA w12 +#define RNB w13 +#define RNC w14 +#define RND w15 + +#define RT0 w6 +#define RT1 w7 +#define RT2 w16 +#define xRT0 x6 +#define xRT1 x7 +#define xRT2 x16 + +#define xw8 x8 +#define xw9 x9 +#define xw10 x10 +#define xw11 x11 + +#define xw12 x12 +#define xw13 x13 +#define xw14 x14 +#define xw15 x15 + +/*********************************************************************** + * ARMv8/Aarch64 assembly implementation of the AES cipher + ***********************************************************************/ +#define preload_first_key(round, ra) \ + ldr ra, [CTX, #(((round) * 16) + 0 * 4)]; + +#define dummy(round, ra) /* nothing */ + +#define addroundkey(ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_key) \ + ldp rna, rnb, [CTX]; \ + ldp rnc, rnd, [CTX, #8]; \ + eor ra, ra, rna; \ + eor rb, rb, rnb; \ + eor rc, rc, rnc; \ + preload_key(1, rna); \ + eor rd, rd, rnd; + +#define do_encround(next_r, ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_key) \ + ldr rnb, [CTX, #(((next_r) * 16) + 1 * 4)]; \ + \ + and RT0, RMASK, ra, lsl#2; \ + ldr rnc, [CTX, #(((next_r) * 16) + 2 * 4)]; \ + and RT1, RMASK, ra, lsr#(8 - 2); \ + ldr rnd, [CTX, #(((next_r) * 16) + 3 * 4)]; \ + and RT2, RMASK, ra, lsr#(16 - 2); \ + ldr RT0, [RTAB, xRT0]; \ + and ra, RMASK, ra, lsr#(24 - 2); \ + \ + ldr RT1, [RTAB, xRT1]; \ + eor rna, rna, RT0; \ + ldr RT2, [RTAB, xRT2]; \ + and RT0, RMASK, rd, lsl#2; \ + ldr ra, [RTAB, x##ra]; \ + \ + eor rnd, rnd, RT1, ror #24; \ + and RT1, RMASK, rd, lsr#(8 - 2); \ + eor rnc, rnc, RT2, ror #16; \ + and RT2, RMASK, rd, lsr#(16 - 2); \ + eor rnb, rnb, ra, ror #8; \ + ldr RT0, [RTAB, xRT0]; \ + and rd, RMASK, rd, lsr#(24 - 2); \ + \ + ldr RT1, [RTAB, xRT1]; \ + eor rnd, rnd, RT0; \ + ldr RT2, [RTAB, xRT2]; \ + and RT0, RMASK, rc, lsl#2; \ + ldr rd, [RTAB, x##rd]; \ + \ + eor rnc, rnc, RT1, ror #24; \ + and RT1, RMASK, rc, lsr#(8 - 2); \ + eor rnb, rnb, RT2, ror #16; \ + and RT2, RMASK, rc, lsr#(16 - 2); \ + eor rna, rna, rd, ror #8; \ + ldr RT0, [RTAB, xRT0]; \ + and rc, RMASK, rc, lsr#(24 - 2); \ + \ + ldr RT1, [RTAB, xRT1]; \ + eor rnc, rnc, RT0; \ + ldr RT2, [RTAB, xRT2]; \ + and RT0, RMASK, rb, lsl#2; \ + ldr rc, [RTAB, x##rc]; \ + \ + eor rnb, rnb, RT1, ror #24; \ + and RT1, RMASK, rb, lsr#(8 - 2); \ + eor rna, rna, RT2, ror #16; \ + and RT2, RMASK, rb, lsr#(16 - 2); \ + eor rnd, rnd, rc, ror #8; \ + ldr RT0, [RTAB, xRT0]; \ + and rb, RMASK, rb, lsr#(24 - 2); \ + \ + ldr RT1, [RTAB, xRT1]; \ + eor rnb, rnb, RT0; \ + ldr RT2, [RTAB, xRT2]; \ + eor rna, rna, RT1, ror #24; \ + ldr rb, [RTAB, x##rb]; \ + \ + eor rnd, rnd, RT2, ror #16; \ + preload_key((next_r) + 1, ra); \ + eor rnc, rnc, rb, ror #8; + +#define do_lastencround(ra, rb, rc, rd, rna, rnb, rnc, rnd) \ + and RT0, RMASK, ra, lsl#2; \ + and RT1, RMASK, ra, lsr#(8 - 2); \ + and RT2, RMASK, ra, lsr#(16 - 2); \ + ldrb rna, [RTAB, xRT0]; \ + and ra, RMASK, ra, lsr#(24 - 2); \ + ldrb rnd, [RTAB, xRT1]; \ + and RT0, RMASK, rd, lsl#2; \ + ldrb rnc, [RTAB, xRT2]; \ + ror rnd, rnd, #24; \ + ldrb rnb, [RTAB, x##ra]; \ + and RT1, RMASK, rd, lsr#(8 - 2); \ + ror rnc, rnc, #16; \ + and RT2, RMASK, rd, lsr#(16 - 2); \ + ror rnb, rnb, #8; \ + ldrb RT0, [RTAB, xRT0]; \ + and rd, RMASK, rd, lsr#(24 - 2); \ + ldrb RT1, [RTAB, xRT1]; \ + \ + orr rnd, rnd, RT0; \ + ldrb RT2, [RTAB, xRT2]; \ + and RT0, RMASK, rc, lsl#2; \ + ldrb rd, [RTAB, x##rd]; \ + orr rnc, rnc, RT1, ror #24; \ + and RT1, RMASK, rc, lsr#(8 - 2); \ + orr rnb, rnb, RT2, ror #16; \ + and RT2, RMASK, rc, lsr#(16 - 2); \ + orr rna, rna, rd, ror #8; \ + ldrb RT0, [RTAB, xRT0]; \ + and rc, RMASK, rc, lsr#(24 - 2); \ + ldrb RT1, [RTAB, xRT1]; \ + \ + orr rnc, rnc, RT0; \ + ldrb RT2, [RTAB, xRT2]; \ + and RT0, RMASK, rb, lsl#2; \ + ldrb rc, [RTAB, x##rc]; \ + orr rnb, rnb, RT1, ror #24; \ + and RT1, RMASK, rb, lsr#(8 - 2); \ + orr rna, rna, RT2, ror #16; \ + ldrb RT0, [RTAB, xRT0]; \ + and RT2, RMASK, rb, lsr#(16 - 2); \ + ldrb RT1, [RTAB, xRT1]; \ + orr rnd, rnd, rc, ror #8; \ + ldrb RT2, [RTAB, xRT2]; \ + and rb, RMASK, rb, lsr#(24 - 2); \ + ldrb rb, [RTAB, x##rb]; \ + \ + orr rnb, rnb, RT0; \ + orr rna, rna, RT1, ror #24; \ + orr rnd, rnd, RT2, ror #16; \ + orr rnc, rnc, rb, ror #8; + +#define firstencround(round, ra, rb, rc, rd, rna, rnb, rnc, rnd) \ + addroundkey(ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_first_key); \ + do_encround((round) + 1, ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_first_key); + +#define encround(round, ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_key) \ + do_encround((round) + 1, ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_key); + +#define lastencround(round, ra, rb, rc, rd, rna, rnb, rnc, rnd) \ + add CTX, CTX, #(((round) + 1) * 16); \ + add RTAB, RTAB, #1; \ + do_lastencround(ra, rb, rc, rd, rna, rnb, rnc, rnd); \ + addroundkey(rna, rnb, rnc, rnd, ra, rb, rc, rd, dummy); + +.globl _gcry_aes_arm_encrypt_block +.type _gcry_aes_arm_encrypt_block,%function; + +_gcry_aes_arm_encrypt_block: + /* input: + * %x0: keysched, CTX + * %x1: dst + * %x2: src + * %w3: number of rounds.. 10, 12 or 14 + * %x4: encryption table + */ + + /* read input block */ + + /* aligned load */ + ldp RA, RB, [RSRC]; + ldp RC, RD, [RSRC, #8]; +#ifndef __AARCH64EL__ + rev RA, RA; + rev RB, RB; + rev RC, RC; + rev RD, RD; +#endif + + mov RMASK, #(0xff<<2); + + firstencround(0, RA, RB, RC, RD, RNA, RNB, RNC, RND); + encround(1, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key); + encround(2, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key); + encround(3, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key); + encround(4, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key); + encround(5, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key); + encround(6, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key); + encround(7, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key); + + cmp NROUNDS, #12; + bge .Lenc_not_128; + + encround(8, RA, RB, RC, RD, RNA, RNB, RNC, RND, dummy); + lastencround(9, RNA, RNB, RNC, RND, RA, RB, RC, RD); + +.Lenc_done: + + /* store output block */ + + /* aligned store */ +#ifndef __AARCH64EL__ + rev RA, RA; + rev RB, RB; + rev RC, RC; + rev RD, RD; +#endif + /* write output block */ + stp RA, RB, [RDST]; + stp RC, RD, [RDST, #8]; + + mov x0, #(0); + ret; + +.ltorg +.Lenc_not_128: + beq .Lenc_192 + + encround(8, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key); + encround(9, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key); + encround(10, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key); + encround(11, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key); + encround(12, RA, RB, RC, RD, RNA, RNB, RNC, RND, dummy); + lastencround(13, RNA, RNB, RNC, RND, RA, RB, RC, RD); + + b .Lenc_done; + +.ltorg +.Lenc_192: + encround(8, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key); + encround(9, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key); + encround(10, RA, RB, RC, RD, RNA, RNB, RNC, RND, dummy); + lastencround(11, RNA, RNB, RNC, RND, RA, RB, RC, RD); + + b .Lenc_done; +.size _gcry_aes_arm_encrypt_block,.-_gcry_aes_arm_encrypt_block; + +#define addroundkey_dec(round, ra, rb, rc, rd, rna, rnb, rnc, rnd) \ + ldr rna, [CTX, #(((round) * 16) + 0 * 4)]; \ + ldr rnb, [CTX, #(((round) * 16) + 1 * 4)]; \ + eor ra, ra, rna; \ + ldr rnc, [CTX, #(((round) * 16) + 2 * 4)]; \ + eor rb, rb, rnb; \ + ldr rnd, [CTX, #(((round) * 16) + 3 * 4)]; \ + eor rc, rc, rnc; \ + preload_first_key((round) - 1, rna); \ + eor rd, rd, rnd; + +#define do_decround(next_r, ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_key) \ + ldr rnb, [CTX, #(((next_r) * 16) + 1 * 4)]; \ + \ + and RT0, RMASK, ra, lsl#2; \ + ldr rnc, [CTX, #(((next_r) * 16) + 2 * 4)]; \ + and RT1, RMASK, ra, lsr#(8 - 2); \ + ldr rnd, [CTX, #(((next_r) * 16) + 3 * 4)]; \ + and RT2, RMASK, ra, lsr#(16 - 2); \ + ldr RT0, [RTAB, xRT0]; \ + and ra, RMASK, ra, lsr#(24 - 2); \ + \ + ldr RT1, [RTAB, xRT1]; \ + eor rna, rna, RT0; \ + ldr RT2, [RTAB, xRT2]; \ + and RT0, RMASK, rb, lsl#2; \ + ldr ra, [RTAB, x##ra]; \ + \ + eor rnb, rnb, RT1, ror #24; \ + and RT1, RMASK, rb, lsr#(8 - 2); \ + eor rnc, rnc, RT2, ror #16; \ + and RT2, RMASK, rb, lsr#(16 - 2); \ + eor rnd, rnd, ra, ror #8; \ + ldr RT0, [RTAB, xRT0]; \ + and rb, RMASK, rb, lsr#(24 - 2); \ + \ + ldr RT1, [RTAB, xRT1]; \ + eor rnb, rnb, RT0; \ + ldr RT2, [RTAB, xRT2]; \ + and RT0, RMASK, rc, lsl#2; \ + ldr rb, [RTAB, x##rb]; \ + \ + eor rnc, rnc, RT1, ror #24; \ + and RT1, RMASK, rc, lsr#(8 - 2); \ + eor rnd, rnd, RT2, ror #16; \ + and RT2, RMASK, rc, lsr#(16 - 2); \ + eor rna, rna, rb, ror #8; \ + ldr RT0, [RTAB, xRT0]; \ + and rc, RMASK, rc, lsr#(24 - 2); \ + \ + ldr RT1, [RTAB, xRT1]; \ + eor rnc, rnc, RT0; \ + ldr RT2, [RTAB, xRT2]; \ + and RT0, RMASK, rd, lsl#2; \ + ldr rc, [RTAB, x##rc]; \ + \ + eor rnd, rnd, RT1, ror #24; \ + and RT1, RMASK, rd, lsr#(8 - 2); \ + eor rna, rna, RT2, ror #16; \ + and RT2, RMASK, rd, lsr#(16 - 2); \ + eor rnb, rnb, rc, ror #8; \ + ldr RT0, [RTAB, xRT0]; \ + and rd, RMASK, rd, lsr#(24 - 2); \ + \ + ldr RT1, [RTAB, xRT1]; \ + eor rnd, rnd, RT0; \ + ldr RT2, [RTAB, xRT2]; \ + eor rna, rna, RT1, ror #24; \ + ldr rd, [RTAB, x##rd]; \ + \ + eor rnb, rnb, RT2, ror #16; \ + preload_key((next_r) - 1, ra); \ + eor rnc, rnc, rd, ror #8; + +#define do_lastdecround(ra, rb, rc, rd, rna, rnb, rnc, rnd) \ + and RT0, RMASK, ra; \ + and RT1, RMASK, ra, lsr#8; \ + and RT2, RMASK, ra, lsr#16; \ + ldrb rna, [RTAB, xRT0]; \ + lsr ra, ra, #24; \ + ldrb rnb, [RTAB, xRT1]; \ + and RT0, RMASK, rb; \ + ldrb rnc, [RTAB, xRT2]; \ + ror rnb, rnb, #24; \ + ldrb rnd, [RTAB, x##ra]; \ + and RT1, RMASK, rb, lsr#8; \ + ror rnc, rnc, #16; \ + and RT2, RMASK, rb, lsr#16; \ + ror rnd, rnd, #8; \ + ldrb RT0, [RTAB, xRT0]; \ + lsr rb, rb, #24; \ + ldrb RT1, [RTAB, xRT1]; \ + \ + orr rnb, rnb, RT0; \ + ldrb RT2, [RTAB, xRT2]; \ + and RT0, RMASK, rc; \ + ldrb rb, [RTAB, x##rb]; \ + orr rnc, rnc, RT1, ror #24; \ + and RT1, RMASK, rc, lsr#8; \ + orr rnd, rnd, RT2, ror #16; \ + and RT2, RMASK, rc, lsr#16; \ + orr rna, rna, rb, ror #8; \ + ldrb RT0, [RTAB, xRT0]; \ + lsr rc, rc, #24; \ + ldrb RT1, [RTAB, xRT1]; \ + \ + orr rnc, rnc, RT0; \ + ldrb RT2, [RTAB, xRT2]; \ + and RT0, RMASK, rd; \ + ldrb rc, [RTAB, x##rc]; \ + orr rnd, rnd, RT1, ror #24; \ + and RT1, RMASK, rd, lsr#8; \ + orr rna, rna, RT2, ror #16; \ + ldrb RT0, [RTAB, xRT0]; \ + and RT2, RMASK, rd, lsr#16; \ + ldrb RT1, [RTAB, xRT1]; \ + orr rnb, rnb, rc, ror #8; \ + ldrb RT2, [RTAB, xRT2]; \ + lsr rd, rd, #24; \ + ldrb rd, [RTAB, x##rd]; \ + \ + orr rnd, rnd, RT0; \ + orr rna, rna, RT1, ror #24; \ + orr rnb, rnb, RT2, ror #16; \ + orr rnc, rnc, rd, ror #8; + +#define firstdecround(round, ra, rb, rc, rd, rna, rnb, rnc, rnd) \ + addroundkey_dec(((round) + 1), ra, rb, rc, rd, rna, rnb, rnc, rnd); \ + do_decround(round, ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_first_key); + +#define decround(round, ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_key) \ + do_decround(round, ra, rb, rc, rd, rna, rnb, rnc, rnd, preload_key); + +#define set_last_round_rmask(_, __) \ + mov RMASK, #0xff; + +#define lastdecround(round, ra, rb, rc, rd, rna, rnb, rnc, rnd) \ + add RTAB, RTAB, #(4 * 256); \ + do_lastdecround(ra, rb, rc, rd, rna, rnb, rnc, rnd); \ + addroundkey(rna, rnb, rnc, rnd, ra, rb, rc, rd, dummy); + +.globl _gcry_aes_arm_decrypt_block +.type _gcry_aes_arm_decrypt_block,%function; + +_gcry_aes_arm_decrypt_block: + /* input: + * %x0: keysched, CTX + * %x1: dst + * %x2: src + * %w3: number of rounds.. 10, 12 or 14 + * %x4: decryption table + */ + + /* read input block */ + + /* aligned load */ + ldp RA, RB, [RSRC]; + ldp RC, RD, [RSRC, #8]; +#ifndef __AARCH64EL__ + rev RA, RA; + rev RB, RB; + rev RC, RC; + rev RD, RD; +#endif + + mov RMASK, #(0xff << 2); + + cmp NROUNDS, #12; + bge .Ldec_256; + + firstdecround(9, RA, RB, RC, RD, RNA, RNB, RNC, RND); +.Ldec_tail: + decround(8, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key); + decround(7, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key); + decround(6, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key); + decround(5, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key); + decround(4, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key); + decround(3, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key); + decround(2, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key); + decround(1, RA, RB, RC, RD, RNA, RNB, RNC, RND, set_last_round_rmask); + lastdecround(0, RNA, RNB, RNC, RND, RA, RB, RC, RD); + + /* store output block */ + + /* aligned store */ +#ifndef __AARCH64EL__ + rev RA, RA; + rev RB, RB; + rev RC, RC; + rev RD, RD; +#endif + /* write output block */ + stp RA, RB, [RDST]; + stp RC, RD, [RDST, #8]; + + mov x0, #(0); + ret; + +.ltorg +.Ldec_256: + beq .Ldec_192; + + firstdecround(13, RA, RB, RC, RD, RNA, RNB, RNC, RND); + decround(12, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key); + decround(11, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key); + decround(10, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key); + decround(9, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key); + + b .Ldec_tail; + +.ltorg +.Ldec_192: + firstdecround(11, RA, RB, RC, RD, RNA, RNB, RNC, RND); + decround(10, RNA, RNB, RNC, RND, RA, RB, RC, RD, preload_first_key); + decround(9, RA, RB, RC, RD, RNA, RNB, RNC, RND, preload_first_key); + + b .Ldec_tail; +.size _gcry_aes_arm_decrypt_block,.-_gcry_aes_arm_decrypt_block; + +#endif /*HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS*/ +#endif /*__AARCH64EL__ */ diff --git a/cipher/rijndael-arm.S b/cipher/rijndael-arm.S index e3a91c2..e680c81 100644 --- a/cipher/rijndael-arm.S +++ b/cipher/rijndael-arm.S @@ -577,5 +577,5 @@ _gcry_aes_arm_decrypt_block: b .Ldec_tail; .size _gcry_aes_arm_encrypt_block,.-_gcry_aes_arm_encrypt_block; -#endif /*HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS*/ +#endif /*HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS*/ #endif /*__ARMEL__ */ diff --git a/cipher/rijndael-internal.h b/cipher/rijndael-internal.h index 7544fa0..340dbc0 100644 --- a/cipher/rijndael-internal.h +++ b/cipher/rijndael-internal.h @@ -58,6 +58,11 @@ # define USE_ARM_ASM 1 # endif #endif +#if defined(__AARCH64EL__) +# ifdef HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS +# define USE_ARM_ASM 1 +# endif +#endif /* USE_PADLOCK indicates whether to compile the padlock specific code. */ diff --git a/configure.ac b/configure.ac index 7f415bf..a530f77 100644 --- a/configure.ac +++ b/configure.ac @@ -1073,6 +1073,32 @@ fi # +# Check whether GCC assembler supports features needed for our ARMv8/Aarch64 +# implementations. This needs to be done before setting up the +# assembler stuff. +# +AC_CACHE_CHECK([whether GCC assembler is compatible for ARMv8/Aarch64 assembly implementations], + [gcry_cv_gcc_aarch64_platform_as_ok], + [gcry_cv_gcc_aarch64_platform_as_ok=no + AC_COMPILE_IFELSE([AC_LANG_SOURCE( + [[__asm__( + "asmfunc:\n\t" + "eor x0, x0, x30, ror #12;\n\t" + "add x0, x0, x30, asr #12;\n\t" + "eor v0.16b, v0.16b, v31.16b;\n\t" + + /* Test if '.type' and '.size' are supported. */ + ".size asmfunc,.-asmfunc;\n\t" + ".type asmfunc, at function;\n\t" + );]])], + [gcry_cv_gcc_aarch64_platform_as_ok=yes])]) +if test "$gcry_cv_gcc_aarch64_platform_as_ok" = "yes" ; then + AC_DEFINE(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS,1, + [Defined if underlying assembler is compatible with ARMv8/Aarch64 assembly implementations]) +fi + + +# # Check whether underscores in symbols are required. This needs to be # done before setting up the assembler stuff. # @@ -2014,6 +2040,10 @@ if test "$found" = "1" ; then GCRYPT_CIPHERS="$GCRYPT_CIPHERS rijndael-armv8-ce.lo" GCRYPT_CIPHERS="$GCRYPT_CIPHERS rijndael-armv8-aarch32-ce.lo" ;; + aarch64-*-*) + # Build with the assembly implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS rijndael-aarch64.lo" + ;; esac case "$mpi_cpu_arch" in From jussi.kivilinna at iki.fi Sun Sep 4 12:44:11 2016 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 04 Sep 2016 13:44:11 +0300 Subject: [PATCH 4/5] Add ARMv8/AArch64 Crypto Extension implementation of GCM In-Reply-To: <147298583621.4156.8421131540663910552.stgit@localhost6.localdomain6> References: <147298583621.4156.8421131540663910552.stgit@localhost6.localdomain6> Message-ID: <147298585133.4156.14562931309655682394.stgit@localhost6.localdomain6> * cipher/Makefile.am: Add 'cipher-gcm-armv8-aarch64-ce.S'. * cipher/cipher-gcm-armv8-aarch64-ce.S: New. * cipher/cipher-internal.h (GCM_USE_ARM_PMULL): Enable on ARMv8/AArch64. -- Benchmark on Cortex-A53 (1152 Mhz): Before: | nanosecs/byte mebibytes/sec cycles/byte GMAC_AES | 15.54 ns/B 61.36 MiB/s 17.91 c/B After (11.9x faster): | nanosecs/byte mebibytes/sec cycles/byte GMAC_AES | 1.30 ns/B 731.5 MiB/s 1.50 c/B Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/Makefile.am b/cipher/Makefile.am index ae9fbca..c31b233 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -43,7 +43,7 @@ libcipher_la_SOURCES = \ cipher.c cipher-internal.h \ cipher-cbc.c cipher-cfb.c cipher-ofb.c cipher-ctr.c cipher-aeswrap.c \ cipher-ccm.c cipher-cmac.c cipher-gcm.c cipher-gcm-intel-pclmul.c \ - cipher-gcm-armv8-aarch32-ce.S \ + cipher-gcm-armv8-aarch32-ce.S cipher-gcm-armv8-aarch64-ce.S \ cipher-poly1305.c cipher-ocb.c \ cipher-selftest.c cipher-selftest.h \ pubkey.c pubkey-internal.h pubkey-util.c \ diff --git a/cipher/cipher-gcm-armv8-aarch64-ce.S b/cipher/cipher-gcm-armv8-aarch64-ce.S new file mode 100644 index 0000000..51d67b7 --- /dev/null +++ b/cipher/cipher-gcm-armv8-aarch64-ce.S @@ -0,0 +1,180 @@ +/* cipher-gcm-armv8-aarch64-ce.S - ARM/CE accelerated GHASH + * Copyright (C) 2016 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include + +#if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) + +.arch armv8-a+crypto + +.text + +#define GET_DATA_POINTER(reg, name) \ + adrp reg, :got:name ; \ + ldr reg, [reg, #:got_lo12:name] ; + + +/* Constants */ + +.align 4 +gcry_gcm_reduction_constant: +.Lrconst: + .quad 0x87 + + +/* Register macros */ + +#define rhash v0 +#define rbuf v1 +#define rh0 v2 +#define rr0 v3 +#define rr1 v4 +#define rrconst v5 +#define vT0 v16 +#define vT1 v17 +#define vZZ v18 + +/* GHASH macros */ + +/* See "Gouv?a, C. P. L. & L?pez, J. Implementing GCM on ARMv8. Topics in + * Cryptology ? CT-RSA 2015" for details. + */ + +/* Input: 'a' and 'b', Output: 'r0:r1' (low 128-bits in r0, high in r1) */ +#define PMUL_128x128(r0, r1, a, b, interleave_op) \ + ext vT0.16b, b.16b, b.16b, #8; \ + pmull r0.1q, a.1d, b.1d; \ + pmull2 r1.1q, a.2d, b.2d; \ + pmull vT1.1q, a.1d, vT0.1d; \ + pmull2 vT0.1q, a.2d, vT0.2d; \ + interleave_op(); \ + eor vT0.16b, vT0.16b, vT1.16b; \ + ext vT1.16b, vZZ.16b, vT0.16b, #8; \ + ext vT0.16b, vT0.16b, vZZ.16b, #8; \ + eor r0.16b, r0.16b, vT1.16b; \ + eor r1.16b, r1.16b, vT0.16b; + +/* Input: 'r0:r1', Output: 'a' */ +#define REDUCTION(a, r0, r1, rconst, interleave_op) \ + pmull2 vT0.1q, r1.2d, rconst.2d; \ + interleave_op(); \ + ext vT1.16b, vT0.16b, vZZ.16b, #8; \ + ext vT0.16b, vZZ.16b, vT0.16b, #8; \ + eor r1.16b, r1.16b, vT1.16b; \ + eor r0.16b, r0.16b, vT0.16b; \ + pmull vT0.1q, r1.1d, rconst.1d; \ + eor a.16b, r0.16b, vT0.16b; + +#define _(...) /*_*/ +#define ld1_rbuf() ld1 {rbuf.16b}, [x2], #16; +#define rbit_rbuf() rbit rbuf.16b, rbuf.16b; + +/* Other functional macros */ + +#define CLEAR_REG(reg) eor reg.16b, reg.16b, reg.16b; + + +/* + * unsigned int _gcry_ghash_armv8_ce_pmull (void *gcm_key, byte *result, + * const byte *buf, size_t nblocks, + * void *gcm_table); + */ +.align 3 +.globl _gcry_ghash_armv8_ce_pmull +.type _gcry_ghash_armv8_ce_pmull,%function; +_gcry_ghash_armv8_ce_pmull: + /* input: + * x0: gcm_key + * x1: result/hash + * x2: buf + * x3: nblocks + * x4: gcm_table + */ + cbz x3, .Ldo_nothing; + + GET_DATA_POINTER(x5, .Lrconst) + + sub x3, x3, #1 + + eor vZZ.16b, vZZ.16b, vZZ.16b + ld1 {rhash.16b}, [x1] + ld1 {rh0.16b}, [x0] + + rbit rhash.16b, rhash.16b /* bit-swap */ + ld1r {rrconst.2d}, [x5] + + ld1 {rbuf.16b}, [x2], #16 + + rbit rbuf.16b, rbuf.16b /* bit-swap */ + + eor rhash.16b, rhash.16b, rbuf.16b + + cbz x3, .Lend + +.Loop: + PMUL_128x128(rr0, rr1, rh0, rhash, ld1_rbuf) + sub x3, x3, #1 + REDUCTION(rhash, rr0, rr1, rrconst, rbit_rbuf) + eor rhash.16b, rhash.16b, rbuf.16b + + cbnz x3, .Loop + +.Lend: + PMUL_128x128(rr0, rr1, rh0, rhash, _) + REDUCTION(rhash, rr0, rr1, rrconst, _) + + CLEAR_REG(rr1) + CLEAR_REG(rr0) + rbit rhash.16b, rhash.16b /* bit-swap */ + CLEAR_REG(rbuf) + CLEAR_REG(vT0) + CLEAR_REG(vT1) + CLEAR_REG(rh0) + + st1 {rhash.2d}, [x1] + CLEAR_REG(rhash) + +.Ldo_nothing: + mov x0, #0 + ret +.size _gcry_ghash_armv8_ce_pmull,.-_gcry_ghash_armv8_ce_pmull; + + +/* + * void _gcry_ghash_setup_armv8_ce_pmull (void *gcm_key, void *gcm_table); + */ +.align 3 +.globl _gcry_ghash_setup_armv8_ce_pmull +.type _gcry_ghash_setup_armv8_ce_pmull,%function; +_gcry_ghash_setup_armv8_ce_pmull: + /* input: + * x0: gcm_key + * x1: gcm_table + */ + + ld1 {vT0.16b}, [x0] + rbit vT0.16b, vT0.16b + st1 {vT0.16b}, [x0] + + ret +.size _gcry_ghash_setup_armv8_ce_pmull,.-_gcry_ghash_setup_armv8_ce_pmull; + +#endif diff --git a/cipher/cipher-internal.h b/cipher/cipher-internal.h index 52504f6..01352f3 100644 --- a/cipher/cipher-internal.h +++ b/cipher/cipher-internal.h @@ -79,6 +79,10 @@ && defined(HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS) \ && defined(HAVE_GCC_INLINE_ASM_AARCH32_CRYPTO) # define GCM_USE_ARM_PMULL 1 +# elif defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) +# define GCM_USE_ARM_PMULL 1 # endif #endif /* GCM_USE_ARM_PMULL */ From jussi.kivilinna at iki.fi Sun Sep 4 12:44:06 2016 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 04 Sep 2016 13:44:06 +0300 Subject: [PATCH 3/5] Add ARMv8/AArch64 Crypto Extension implementation of SHA-256 In-Reply-To: <147298583621.4156.8421131540663910552.stgit@localhost6.localdomain6> References: <147298583621.4156.8421131540663910552.stgit@localhost6.localdomain6> Message-ID: <147298584631.4156.9621810164456541675.stgit@localhost6.localdomain6> * cipher/Makefile.am: Add 'sha256-armv8-aarch64-ce.S'. * cipher/sha256-armv8-aarch64-ce.S: New. * cipher/sha256-armv8-aarch32-ce.S: Move round macros to correct section. * cipher/sha256.c (USE_ARM_CE): Enable on ARMv8/AArch64. * configure.ac: Add 'sha256-armv8-aarch64-ce.lo'; Swap places for 'sha512-arm.lo' and 'sha256-armv8-aarch32-ce.lo'. -- Benchmark on Cortex-A53 (1152 Mhz): Before: | nanosecs/byte mebibytes/sec cycles/byte SHA256 | 13.34 ns/B 71.51 MiB/s 15.36 c/B After (7.2x faster): | nanosecs/byte mebibytes/sec cycles/byte SHA256 | 1.85 ns/B 516.3 MiB/s 2.13 c/B Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/Makefile.am b/cipher/Makefile.am index d49c01f..ae9fbca 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -92,7 +92,7 @@ serpent.c serpent-sse2-amd64.S serpent-avx2-amd64.S serpent-armv7-neon.S \ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ sha1-armv7-neon.S sha1-armv8-aarch32-ce.S sha1-armv8-aarch64-ce.S \ sha256.c sha256-ssse3-amd64.S sha256-avx-amd64.S sha256-avx2-bmi2-amd64.S \ - sha256-armv8-aarch32-ce.S \ + sha256-armv8-aarch32-ce.S sha256-armv8-aarch64-ce.S \ sha512.c sha512-ssse3-amd64.S sha512-avx-amd64.S sha512-avx2-bmi2-amd64.S \ sha512-armv7-neon.S sha512-arm.S \ keccak.c keccak_permute_32.h keccak_permute_64.h keccak-armv7-neon.S \ diff --git a/cipher/sha256-armv8-aarch32-ce.S b/cipher/sha256-armv8-aarch32-ce.S index a0dbcea..2041a23 100644 --- a/cipher/sha256-armv8-aarch32-ce.S +++ b/cipher/sha256-armv8-aarch32-ce.S @@ -93,6 +93,19 @@ gcry_sha256_aarch32_ce_K: #define _(...) /*_*/ +#define do_loadk(nk0, nk1) vld1.32 {nk0-nk1},[lr]!; +#define do_add(a, b) vadd.u32 a, a, b; +#define do_sha256su0(w0, w1) sha256su0.32 w0, w1; +#define do_sha256su1(w0, w2, w3) sha256su1.32 w0, w2, w3; + +#define do_rounds(k, nk0, nk1, w0, w1, w2, w3, loadk_fn, add_fn, su0_fn, su1_fn) \ + loadk_fn( nk0, nk1 ); \ + su0_fn( w0, w1 ); \ + vmov qABCD1, qABCD0; \ + sha256h.32 qABCD0, qEFGH, k; \ + sha256h2.32 qEFGH, qABCD1, k; \ + add_fn( nk0, w2 ); \ + su1_fn( w0, w2, w3 ); /* Other functional macros */ @@ -126,20 +139,6 @@ _gcry_sha256_transform_armv8_ce: vld1.32 {qH0123-qH4567}, [r0] /* load state */ -#define do_loadk(nk0, nk1) vld1.32 {nk0-nk1},[lr]!; -#define do_add(a, b) vadd.u32 a, a, b; -#define do_sha256su0(w0, w1) sha256su0.32 w0, w1; -#define do_sha256su1(w0, w2, w3) sha256su1.32 w0, w2, w3; - -#define do_rounds(k, nk0, nk1, w0, w1, w2, w3, loadk_fn, add_fn, su0_fn, su1_fn) \ - loadk_fn( nk0, nk1 ); \ - su0_fn( w0, w1 ); \ - vmov qABCD1, qABCD0; \ - sha256h.32 qABCD0, qEFGH, k; \ - sha256h2.32 qEFGH, qABCD1, k; \ - add_fn( nk0, w2 ); \ - su1_fn( w0, w2, w3 ); \ - vld1.8 {qW0-qW1}, [r1]! do_loadk(qK0, qK1) vld1.8 {qW2-qW3}, [r1]! diff --git a/cipher/sha256-armv8-aarch64-ce.S b/cipher/sha256-armv8-aarch64-ce.S new file mode 100644 index 0000000..257204f --- /dev/null +++ b/cipher/sha256-armv8-aarch64-ce.S @@ -0,0 +1,218 @@ +/* sha256-armv8-aarch64-ce.S - ARM/CE accelerated SHA-256 transform function + * Copyright (C) 2016 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include + +#if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) && defined(USE_SHA256) + +.arch armv8-a+crypto + +.text + + +#define GET_DATA_POINTER(reg, name) \ + adrp reg, :got:name ; \ + ldr reg, [reg, #:got_lo12:name] ; + + +/* Constants */ + +.align 4 +gcry_sha256_aarch64_ce_K: +.LK: + .long 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5 + .long 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5 + .long 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3 + .long 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174 + .long 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc + .long 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da + .long 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7 + .long 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967 + .long 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13 + .long 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85 + .long 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3 + .long 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070 + .long 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5 + .long 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3 + .long 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208 + .long 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2 + + +/* Register macros */ + +#define vH0123 v0 +#define vH4567 v1 + +#define vABCD0 v2 +#define qABCD0 q2 +#define vABCD1 v3 +#define qABCD1 q3 +#define vEFGH v4 +#define qEFGH q4 + +#define vT0 v5 +#define vT1 v6 + +#define vW0 v16 +#define vW1 v17 +#define vW2 v18 +#define vW3 v19 + +#define vK0 v20 +#define vK1 v21 +#define vK2 v22 +#define vK3 v23 + + +/* Round macros */ + +#define _(...) /*_*/ + +#define do_loadk(nk0, nk1) ld1 {nk0.16b-nk1.16b},[x3],#32; +#define do_add(a, b) add a.4s, a.4s, b.4s; +#define do_sha256su0(w0, w1) sha256su0 w0.4s, w1.4s; +#define do_sha256su1(w0, w2, w3) sha256su1 w0.4s, w2.4s, w3.4s; + +#define do_rounds(k, nk0, nk1, w0, w1, w2, w3, loadk_fn, add_fn, su0_fn, su1_fn) \ + loadk_fn( v##nk0, v##nk1 ); \ + su0_fn( v##w0, v##w1 ); \ + mov vABCD1.16b, vABCD0.16b; \ + sha256h qABCD0, qEFGH, v##k.4s; \ + sha256h2 qEFGH, qABCD1, v##k.4s; \ + add_fn( v##nk0, v##w2 ); \ + su1_fn( v##w0, v##w2, v##w3 ); + + +/* Other functional macros */ + +#define CLEAR_REG(reg) eor reg.16b, reg.16b, reg.16b; + + +/* + * unsigned int + * _gcry_sha256_transform_armv8_ce (u32 state[8], const void *input_data, + * size_t num_blks) + */ +.align 3 +.globl _gcry_sha256_transform_armv8_ce +.type _gcry_sha256_transform_armv8_ce,%function; +_gcry_sha256_transform_armv8_ce: + /* input: + * r0: ctx, CTX + * r1: data (64*nblks bytes) + * r2: nblks + */ + + cbz x2, .Ldo_nothing; + + GET_DATA_POINTER(x3, .LK); + mov x4, x3 + + ld1 {vH0123.4s-vH4567.4s}, [x0] /* load state */ + + ld1 {vW0.16b-vW1.16b}, [x1], #32 + do_loadk(vK0, vK1) + ld1 {vW2.16b-vW3.16b}, [x1], #32 + mov vABCD0.16b, vH0123.16b + mov vEFGH.16b, vH4567.16b + + rev32 vW0.16b, vW0.16b + rev32 vW1.16b, vW1.16b + rev32 vW2.16b, vW2.16b + do_add(vK0, vW0) + rev32 vW3.16b, vW3.16b + do_add(vK1, vW1) + +.Loop: + do_rounds(K0, K2, K3, W0, W1, W2, W3, do_loadk, do_add, do_sha256su0, do_sha256su1) + sub x2,x2,#1 + do_rounds(K1, K3, _ , W1, W2, W3, W0, _ , do_add, do_sha256su0, do_sha256su1) + do_rounds(K2, K0, K1, W2, W3, W0, W1, do_loadk, do_add, do_sha256su0, do_sha256su1) + do_rounds(K3, K1, _ , W3, W0, W1, W2, _ , do_add, do_sha256su0, do_sha256su1) + + do_rounds(K0, K2, K3, W0, W1, W2, W3, do_loadk, do_add, do_sha256su0, do_sha256su1) + do_rounds(K1, K3, _ , W1, W2, W3, W0, _ , do_add, do_sha256su0, do_sha256su1) + do_rounds(K2, K0, K1, W2, W3, W0, W1, do_loadk, do_add, do_sha256su0, do_sha256su1) + do_rounds(K3, K1, _ , W3, W0, W1, W2, _ , do_add, do_sha256su0, do_sha256su1) + + do_rounds(K0, K2, K3, W0, W1, W2, W3, do_loadk, do_add, do_sha256su0, do_sha256su1) + do_rounds(K1, K3, _ , W1, W2, W3, W0, _ , do_add, do_sha256su0, do_sha256su1) + do_rounds(K2, K0, K1, W2, W3, W0, W1, do_loadk, do_add, do_sha256su0, do_sha256su1) + do_rounds(K3, K1, _ , W3, W0, W1, W2, _ , do_add, do_sha256su0, do_sha256su1) + + cbz x2, .Lend + + do_rounds(K0, K2, K3, W0, _ , W2, W3, do_loadk, do_add, _, _) + ld1 {vW0.16b}, [x1], #16 + mov x3, x4 + do_rounds(K1, K3, _ , W1, _ , W3, _ , _ , do_add, _, _) + ld1 {vW1.16b}, [x1], #16 + rev32 vW0.16b, vW0.16b + do_rounds(K2, K0, K1, W2, _ , W0, _ , do_loadk, do_add, _, _) + rev32 vW1.16b, vW1.16b + ld1 {vW2.16b}, [x1], #16 + do_rounds(K3, K1, _ , W3, _ , W1, _ , _ , do_add, _, _) + ld1 {vW3.16b}, [x1], #16 + + do_add(vH0123, vABCD0) + do_add(vH4567, vEFGH) + + rev32 vW2.16b, vW2.16b + mov vABCD0.16b, vH0123.16b + rev32 vW3.16b, vW3.16b + mov vEFGH.16b, vH4567.16b + + b .Loop + +.Lend: + + do_rounds(K0, K2, K3, W0, _ , W2, W3, do_loadk, do_add, _, _) + do_rounds(K1, K3, _ , W1, _ , W3, _ , _ , do_add, _, _) + do_rounds(K2, _ , _ , W2, _ , _ , _ , _ , _, _, _) + do_rounds(K3, _ , _ , W3, _ , _ , _ , _ , _, _, _) + + CLEAR_REG(vW0) + CLEAR_REG(vW1) + CLEAR_REG(vW2) + CLEAR_REG(vW3) + CLEAR_REG(vK0) + CLEAR_REG(vK1) + CLEAR_REG(vK2) + CLEAR_REG(vK3) + + do_add(vH0123, vABCD0) + do_add(vH4567, vEFGH) + + CLEAR_REG(vABCD0) + CLEAR_REG(vABCD1) + CLEAR_REG(vEFGH) + + st1 {vH0123.4s-vH4567.4s}, [x0] /* store state */ + + CLEAR_REG(vH0123) + CLEAR_REG(vH4567) + +.Ldo_nothing: + mov x0, #0 + ret +.size _gcry_sha256_transform_armv8_ce,.-_gcry_sha256_transform_armv8_ce; + +#endif diff --git a/cipher/sha256.c b/cipher/sha256.c index 72818ce..b450a12 100644 --- a/cipher/sha256.c +++ b/cipher/sha256.c @@ -83,6 +83,10 @@ && defined(HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS) \ && defined(HAVE_GCC_INLINE_ASM_AARCH32_CRYPTO) # define USE_ARM_CE 1 +# elif defined(__AARCH64EL__) \ + && defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) \ + && defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) +# define USE_ARM_CE 1 # endif #endif diff --git a/configure.ac b/configure.ac index a617992..9049db7 100644 --- a/configure.ac +++ b/configure.ac @@ -2292,7 +2292,11 @@ if test "$found" = "1" ; then ;; arm*-*-*) # Build with the assembly implementation - GCRYPT_DIGESTS="$GCRYPT_DIGESTS sha512-arm.lo" + GCRYPT_DIGESTS="$GCRYPT_DIGESTS sha256-armv8-aarch32-ce.lo" + ;; + aarch64-*-*) + # Build with the assembly implementation + GCRYPT_DIGESTS="$GCRYPT_DIGESTS sha256-armv8-aarch64-ce.lo" ;; esac fi @@ -2311,7 +2315,7 @@ if test "$found" = "1" ; then ;; arm*-*-*) # Build with the assembly implementation - GCRYPT_DIGESTS="$GCRYPT_DIGESTS sha256-armv8-aarch32-ce.lo" + GCRYPT_DIGESTS="$GCRYPT_DIGESTS sha512-arm.lo" ;; esac From jussi.kivilinna at iki.fi Sun Sep 4 12:44:01 2016 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 04 Sep 2016 13:44:01 +0300 Subject: [PATCH 2/5] Add ARMv8/AArch64 Crypto Extension implementation of SHA-1 In-Reply-To: <147298583621.4156.8421131540663910552.stgit@localhost6.localdomain6> References: <147298583621.4156.8421131540663910552.stgit@localhost6.localdomain6> Message-ID: <147298584128.4156.5269432424425610696.stgit@localhost6.localdomain6> * cipher/Makefile.am: Add 'sha1-armv8-aarch64-ce.S'. * cipher/sha1-armv8-aarch64-ce.S: New. * cipher/sha1.c (USE_ARM_CE): Enable on ARMv8/AArch64. * configure.ac: Add 'sha1-armv8-aarch64-ce.lo'. -- Benchmark on Cortex-A53 (1152 Mhz): Before: | nanosecs/byte mebibytes/sec cycles/byte SHA1 | 7.54 ns/B 126.4 MiB/s 8.69 c/B After (4.3x faster): | nanosecs/byte mebibytes/sec cycles/byte SHA1 | 1.72 ns/B 553.0 MiB/s 1.99 c/B Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/Makefile.am b/cipher/Makefile.am index c555f81..d49c01f 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -90,7 +90,7 @@ scrypt.c \ seed.c \ serpent.c serpent-sse2-amd64.S serpent-avx2-amd64.S serpent-armv7-neon.S \ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ - sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ + sha1-armv7-neon.S sha1-armv8-aarch32-ce.S sha1-armv8-aarch64-ce.S \ sha256.c sha256-ssse3-amd64.S sha256-avx-amd64.S sha256-avx2-bmi2-amd64.S \ sha256-armv8-aarch32-ce.S \ sha512.c sha512-ssse3-amd64.S sha512-avx-amd64.S sha512-avx2-bmi2-amd64.S \ diff --git a/cipher/sha1-armv8-aarch64-ce.S b/cipher/sha1-armv8-aarch64-ce.S new file mode 100644 index 0000000..dcc33a3 --- /dev/null +++ b/cipher/sha1-armv8-aarch64-ce.S @@ -0,0 +1,204 @@ +/* sha1-armv8-aarch64-ce.S - ARM/CE accelerated SHA-1 transform function + * Copyright (C) 2016 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include + +#if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) && defined(USE_SHA1) + +.arch armv8-a+crypto + +.text + + +#define GET_DATA_POINTER(reg, name) \ + adrp reg, :got:name ; \ + ldr reg, [reg, #:got_lo12:name] ; + + +/* Constants */ + +#define K1 0x5A827999 +#define K2 0x6ED9EBA1 +#define K3 0x8F1BBCDC +#define K4 0xCA62C1D6 +.align 4 +gcry_sha1_aarch64_ce_K_VEC: +.LK_VEC: +.LK1: .long K1, K1, K1, K1 +.LK2: .long K2, K2, K2, K2 +.LK3: .long K3, K3, K3, K3 +.LK4: .long K4, K4, K4, K4 + + +/* Register macros */ + +#define sH4 s0 +#define vH4 v0 +#define vH0123 v1 + +#define qABCD q2 +#define sABCD s2 +#define vABCD v2 +#define sE0 s3 +#define vE0 v3 +#define sE1 s4 +#define vE1 v4 + +#define vT0 v5 +#define vT1 v6 + +#define vW0 v16 +#define vW1 v17 +#define vW2 v18 +#define vW3 v19 + +#define vK1 v20 +#define vK2 v21 +#define vK3 v22 +#define vK4 v23 + + +/* Round macros */ + +#define _(...) /*_*/ +#define do_add(dst, src0, src1) add dst.4s, src0.4s, src1.4s; +#define do_sha1su0(w0,w1,w2) sha1su0 w0.4s,w1.4s,w2.4s; +#define do_sha1su1(w0,w3) sha1su1 w0.4s,w3.4s; + +#define do_rounds(f, e0, e1, t, k, w0, w1, w2, w3, add_fn, sha1su0_fn, sha1su1_fn) \ + sha1su1_fn( v##w3, v##w2 ); \ + sha1h e0, sABCD; \ + sha1##f qABCD, e1, v##t.4s; \ + add_fn( v##t, v##w2, v##k ); \ + sha1su0_fn( v##w0, v##w1, v##w2 ); + + +/* Other functional macros */ + +#define CLEAR_REG(reg) eor reg.16b, reg.16b, reg.16b; + + +/* + * unsigned int + * _gcry_sha1_transform_armv8_ce (void *ctx, const unsigned char *data, + * size_t nblks) + */ +.align 3 +.globl _gcry_sha1_transform_armv8_ce +.type _gcry_sha1_transform_armv8_ce,%function; +_gcry_sha1_transform_armv8_ce: + /* input: + * x0: ctx, CTX + * x1: data (64*nblks bytes) + * x2: nblks + */ + + cbz x2, .Ldo_nothing; + + GET_DATA_POINTER(x4, .LK_VEC); + + ld1 {vH0123.4s}, [x0] /* load h0,h1,h2,h3 */ + ld1 {vK1.4s-vK4.4s}, [x4] /* load K1,K2,K3,K4 */ + ldr sH4, [x0, #16] /* load h4 */ + + ld1 {vW0.16b-vW3.16b}, [x1], #64 + mov vABCD.16b, vH0123.16b + + rev32 vW0.16b, vW0.16b + rev32 vW1.16b, vW1.16b + rev32 vW2.16b, vW2.16b + do_add(vT0, vW0, vK1) + rev32 vW3.16b, vW3.16b + do_add(vT1, vW1, vK1) + +.Loop: + do_rounds(c, sE1, sH4, T0, K1, W0, W1, W2, W3, do_add, do_sha1su0, _) + sub x2, x2, #1 + do_rounds(c, sE0, sE1, T1, K1, W1, W2, W3, W0, do_add, do_sha1su0, do_sha1su1) + do_rounds(c, sE1, sE0, T0, K1, W2, W3, W0, W1, do_add, do_sha1su0, do_sha1su1) + do_rounds(c, sE0, sE1, T1, K2, W3, W0, W1, W2, do_add, do_sha1su0, do_sha1su1) + do_rounds(c, sE1, sE0, T0, K2, W0, W1, W2, W3, do_add, do_sha1su0, do_sha1su1) + + do_rounds(p, sE0, sE1, T1, K2, W1, W2, W3, W0, do_add, do_sha1su0, do_sha1su1) + do_rounds(p, sE1, sE0, T0, K2, W2, W3, W0, W1, do_add, do_sha1su0, do_sha1su1) + do_rounds(p, sE0, sE1, T1, K2, W3, W0, W1, W2, do_add, do_sha1su0, do_sha1su1) + do_rounds(p, sE1, sE0, T0, K3, W0, W1, W2, W3, do_add, do_sha1su0, do_sha1su1) + do_rounds(p, sE0, sE1, T1, K3, W1, W2, W3, W0, do_add, do_sha1su0, do_sha1su1) + + do_rounds(m, sE1, sE0, T0, K3, W2, W3, W0, W1, do_add, do_sha1su0, do_sha1su1) + do_rounds(m, sE0, sE1, T1, K3, W3, W0, W1, W2, do_add, do_sha1su0, do_sha1su1) + do_rounds(m, sE1, sE0, T0, K3, W0, W1, W2, W3, do_add, do_sha1su0, do_sha1su1) + do_rounds(m, sE0, sE1, T1, K4, W1, W2, W3, W0, do_add, do_sha1su0, do_sha1su1) + do_rounds(m, sE1, sE0, T0, K4, W2, W3, W0, W1, do_add, do_sha1su0, do_sha1su1) + + do_rounds(p, sE0, sE1, T1, K4, W3, W0, W1, W2, do_add, do_sha1su0, do_sha1su1) + cbz x2, .Lend + + ld1 {vW0.16b-vW1.16b}, [x1], #32 /* preload */ + do_rounds(p, sE1, sE0, T0, K4, _ , _ , W2, W3, do_add, _, do_sha1su1) + rev32 vW0.16b, vW0.16b + ld1 {vW2.16b}, [x1], #16 + rev32 vW1.16b, vW1.16b + do_rounds(p, sE0, sE1, T1, K4, _ , _ , W3, _ , do_add, _, _) + ld1 {vW3.16b}, [x1], #16 + rev32 vW2.16b, vW2.16b + do_rounds(p, sE1, sE0, T0, _, _, _, _, _, _, _, _) + rev32 vW3.16b, vW3.16b + do_rounds(p, sE0, sE1, T1, _, _, _, _, _, _, _, _) + + do_add(vT0, vW0, vK1) + add vH4.2s, vH4.2s, vE0.2s + add vABCD.4s, vABCD.4s, vH0123.4s + do_add(vT1, vW1, vK1) + + mov vH0123.16b, vABCD.16b + + b .Loop + +.Lend: + do_rounds(p, sE1, sE0, T0, K4, _ , _ , W2, W3, do_add, _, do_sha1su1) + do_rounds(p, sE0, sE1, T1, K4, _ , _ , W3, _ , do_add, _, _) + do_rounds(p, sE1, sE0, T0, _, _, _, _, _, _, _, _) + do_rounds(p, sE0, sE1, T1, _, _, _, _, _, _, _, _) + + add vH4.2s, vH4.2s, vE0.2s + add vH0123.4s, vH0123.4s, vABCD.4s + + CLEAR_REG(vW0) + CLEAR_REG(vW1) + CLEAR_REG(vW2) + CLEAR_REG(vW3) + CLEAR_REG(vABCD) + CLEAR_REG(vE1) + CLEAR_REG(vE0) + + str sH4, [x0, #16] /* store h4 */ + st1 {vH0123.4s}, [x0] /* store h0,h1,h2,h3 */ + + CLEAR_REG(vH0123) + CLEAR_REG(vH4) + +.Ldo_nothing: + mov x0, #0 + ret +.size _gcry_sha1_transform_armv8_ce,.-_gcry_sha1_transform_armv8_ce; + +#endif diff --git a/cipher/sha1.c b/cipher/sha1.c index e0b68b2..78b172f 100644 --- a/cipher/sha1.c +++ b/cipher/sha1.c @@ -86,6 +86,10 @@ && defined(HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS) \ && defined(HAVE_GCC_INLINE_ASM_AARCH32_CRYPTO) # define USE_ARM_CE 1 +# elif defined(__AARCH64EL__) \ + && defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) \ + && defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) +# define USE_ARM_CE 1 # endif #endif diff --git a/configure.ac b/configure.ac index a530f77..a617992 100644 --- a/configure.ac +++ b/configure.ac @@ -2375,6 +2375,10 @@ case "${host}" in GCRYPT_DIGESTS="$GCRYPT_DIGESTS sha1-armv7-neon.lo" GCRYPT_DIGESTS="$GCRYPT_DIGESTS sha1-armv8-aarch32-ce.lo" ;; + aarch64-*-*) + # Build with the assembly implementation + GCRYPT_DIGESTS="$GCRYPT_DIGESTS sha1-armv8-aarch64-ce.lo" + ;; esac LIST_MEMBER(scrypt, $enabled_kdfs) From jussi.kivilinna at iki.fi Sun Sep 4 12:44:16 2016 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 04 Sep 2016 13:44:16 +0300 Subject: [PATCH 5/5] Add ARMv8/AArch64 Crypto Extension implementation of AES In-Reply-To: <147298583621.4156.8421131540663910552.stgit@localhost6.localdomain6> References: <147298583621.4156.8421131540663910552.stgit@localhost6.localdomain6> Message-ID: <147298585636.4156.9979305355179327914.stgit@localhost6.localdomain6> * cipher/Makefile.am: Add 'rijndael-armv-aarch64-ce.S'. * cipher/rijndael-armv8-aarch64-ce.S: New. * cipher/rijndael-internal.h (USE_ARM_CE): Enable for ARMv8/AArch64. * configure.ac: Add 'rijndael-armv-aarch64-ce.lo' and 'rijndael-armv8-ce.lo' for ARMv8/AArch64. -- Improvement vs AArch64 assembly on Cortex-A53: AES-128 AES-192 AES-256 CBC enc: 13.19x 13.53x 13.76x CBC dec: 20.53x 21.91x 22.60x CFB enc: 14.29x 14.50x 14.63x CFB dec: 20.42x 21.69x 22.50x CTR: 18.29x 19.61x 20.53x OCB enc: 15.21x 16.32x 17.12x OCB dec: 14.95x 16.11x 16.88x OCB auth: 16.73x 17.93x 18.66x Benchmark on Cortex-A53 (1152 Mhz): Before: AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 21.86 ns/B 43.62 MiB/s 25.19 c/B ECB dec | 22.68 ns/B 42.05 MiB/s 26.13 c/B CBC enc | 18.66 ns/B 51.10 MiB/s 21.50 c/B CBC dec | 18.72 ns/B 50.95 MiB/s 21.56 c/B CFB enc | 18.61 ns/B 51.25 MiB/s 21.44 c/B CFB dec | 18.61 ns/B 51.25 MiB/s 21.44 c/B OFB enc | 22.84 ns/B 41.75 MiB/s 26.31 c/B OFB dec | 22.84 ns/B 41.75 MiB/s 26.31 c/B CTR enc | 18.89 ns/B 50.50 MiB/s 21.76 c/B CTR dec | 18.89 ns/B 50.50 MiB/s 21.76 c/B CCM enc | 37.55 ns/B 25.40 MiB/s 43.25 c/B CCM dec | 37.55 ns/B 25.40 MiB/s 43.25 c/B CCM auth | 18.77 ns/B 50.80 MiB/s 21.63 c/B GCM enc | 20.18 ns/B 47.25 MiB/s 23.25 c/B GCM dec | 20.18 ns/B 47.25 MiB/s 23.25 c/B GCM auth | 1.30 ns/B 732.5 MiB/s 1.50 c/B OCB enc | 19.67 ns/B 48.48 MiB/s 22.66 c/B OCB dec | 19.73 ns/B 48.34 MiB/s 22.72 c/B OCB auth | 19.46 ns/B 49.00 MiB/s 22.42 c/B = AES192 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 25.39 ns/B 37.56 MiB/s 29.25 c/B ECB dec | 26.15 ns/B 36.47 MiB/s 30.13 c/B CBC enc | 22.08 ns/B 43.19 MiB/s 25.44 c/B CBC dec | 22.25 ns/B 42.87 MiB/s 25.63 c/B CFB enc | 22.03 ns/B 43.30 MiB/s 25.38 c/B CFB dec | 22.03 ns/B 43.29 MiB/s 25.38 c/B OFB enc | 26.26 ns/B 36.32 MiB/s 30.25 c/B OFB dec | 26.26 ns/B 36.32 MiB/s 30.25 c/B CTR enc | 22.30 ns/B 42.76 MiB/s 25.69 c/B CTR dec | 22.30 ns/B 42.76 MiB/s 25.69 c/B CCM enc | 44.38 ns/B 21.49 MiB/s 51.13 c/B CCM dec | 44.38 ns/B 21.49 MiB/s 51.13 c/B CCM auth | 22.20 ns/B 42.97 MiB/s 25.57 c/B GCM enc | 23.60 ns/B 40.41 MiB/s 27.19 c/B GCM dec | 23.60 ns/B 40.41 MiB/s 27.19 c/B GCM auth | 1.30 ns/B 732.4 MiB/s 1.50 c/B OCB enc | 23.09 ns/B 41.31 MiB/s 26.60 c/B OCB dec | 23.21 ns/B 41.09 MiB/s 26.74 c/B OCB auth | 22.88 ns/B 41.68 MiB/s 26.36 c/B = AES256 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 28.76 ns/B 33.17 MiB/s 33.13 c/B ECB dec | 29.46 ns/B 32.37 MiB/s 33.94 c/B CBC enc | 25.45 ns/B 37.48 MiB/s 29.31 c/B CBC dec | 25.50 ns/B 37.40 MiB/s 29.38 c/B CFB enc | 25.39 ns/B 37.56 MiB/s 29.25 c/B CFB dec | 25.39 ns/B 37.56 MiB/s 29.25 c/B OFB enc | 29.62 ns/B 32.19 MiB/s 34.13 c/B OFB dec | 29.62 ns/B 32.19 MiB/s 34.13 c/B CTR enc | 25.67 ns/B 37.15 MiB/s 29.57 c/B CTR dec | 25.67 ns/B 37.15 MiB/s 29.57 c/B CCM enc | 51.11 ns/B 18.66 MiB/s 58.88 c/B CCM dec | 51.11 ns/B 18.66 MiB/s 58.88 c/B CCM auth | 25.56 ns/B 37.32 MiB/s 29.44 c/B GCM enc | 26.96 ns/B 35.37 MiB/s 31.06 c/B GCM dec | 26.98 ns/B 35.35 MiB/s 31.08 c/B GCM auth | 1.30 ns/B 733.4 MiB/s 1.50 c/B OCB enc | 26.45 ns/B 36.05 MiB/s 30.47 c/B OCB dec | 26.53 ns/B 35.95 MiB/s 30.56 c/B OCB auth | 26.24 ns/B 36.34 MiB/s 30.23 c/B = After: Cipher: AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 4.83 ns/B 197.5 MiB/s 5.56 c/B ECB dec | 4.99 ns/B 191.1 MiB/s 5.75 c/B CBC enc | 1.41 ns/B 675.5 MiB/s 1.63 c/B CBC dec | 0.911 ns/B 1046.9 MiB/s 1.05 c/B CFB enc | 1.30 ns/B 732.2 MiB/s 1.50 c/B CFB dec | 0.911 ns/B 1046.7 MiB/s 1.05 c/B OFB enc | 5.81 ns/B 164.3 MiB/s 6.69 c/B OFB dec | 5.81 ns/B 164.3 MiB/s 6.69 c/B CTR enc | 1.03 ns/B 924.0 MiB/s 1.19 c/B CTR dec | 1.03 ns/B 924.1 MiB/s 1.19 c/B CCM enc | 2.50 ns/B 381.8 MiB/s 2.88 c/B CCM dec | 2.50 ns/B 381.7 MiB/s 2.88 c/B CCM auth | 1.57 ns/B 606.1 MiB/s 1.81 c/B GCM enc | 2.33 ns/B 408.5 MiB/s 2.69 c/B GCM dec | 2.34 ns/B 408.4 MiB/s 2.69 c/B GCM auth | 1.30 ns/B 732.1 MiB/s 1.50 c/B OCB enc | 1.29 ns/B 736.6 MiB/s 1.49 c/B OCB dec | 1.32 ns/B 724.4 MiB/s 1.52 c/B OCB auth | 1.16 ns/B 819.6 MiB/s 1.34 c/B = AES192 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 5.48 ns/B 174.0 MiB/s 6.31 c/B ECB dec | 5.64 ns/B 169.0 MiB/s 6.50 c/B CBC enc | 1.63 ns/B 585.8 MiB/s 1.88 c/B CBC dec | 1.02 ns/B 935.8 MiB/s 1.17 c/B CFB enc | 1.52 ns/B 627.7 MiB/s 1.75 c/B CFB dec | 1.02 ns/B 935.9 MiB/s 1.17 c/B OFB enc | 6.46 ns/B 147.7 MiB/s 7.44 c/B OFB dec | 6.46 ns/B 147.7 MiB/s 7.44 c/B CTR enc | 1.14 ns/B 836.1 MiB/s 1.31 c/B CTR dec | 1.14 ns/B 835.9 MiB/s 1.31 c/B CCM enc | 2.83 ns/B 337.6 MiB/s 3.25 c/B CCM dec | 2.82 ns/B 338.0 MiB/s 3.25 c/B CCM auth | 1.79 ns/B 532.7 MiB/s 2.06 c/B GCM enc | 2.44 ns/B 390.3 MiB/s 2.82 c/B GCM dec | 2.44 ns/B 390.2 MiB/s 2.82 c/B GCM auth | 1.30 ns/B 731.9 MiB/s 1.50 c/B OCB enc | 1.41 ns/B 674.7 MiB/s 1.63 c/B OCB dec | 1.44 ns/B 662.0 MiB/s 1.66 c/B OCB auth | 1.28 ns/B 746.1 MiB/s 1.47 c/B = AES256 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 6.13 ns/B 155.5 MiB/s 7.06 c/B ECB dec | 6.29 ns/B 151.5 MiB/s 7.25 c/B CBC enc | 1.85 ns/B 516.8 MiB/s 2.13 c/B CBC dec | 1.13 ns/B 845.6 MiB/s 1.30 c/B CFB enc | 1.74 ns/B 549.5 MiB/s 2.00 c/B CFB dec | 1.13 ns/B 846.1 MiB/s 1.30 c/B OFB enc | 7.11 ns/B 134.2 MiB/s 8.19 c/B OFB dec | 7.11 ns/B 134.2 MiB/s 8.19 c/B CTR enc | 1.25 ns/B 763.5 MiB/s 1.44 c/B CTR dec | 1.25 ns/B 763.4 MiB/s 1.44 c/B CCM enc | 3.15 ns/B 302.9 MiB/s 3.63 c/B CCM dec | 3.15 ns/B 302.9 MiB/s 3.63 c/B CCM auth | 2.01 ns/B 474.2 MiB/s 2.32 c/B GCM enc | 2.55 ns/B 374.2 MiB/s 2.94 c/B GCM dec | 2.55 ns/B 373.7 MiB/s 2.94 c/B GCM auth | 1.30 ns/B 732.2 MiB/s 1.50 c/B OCB enc | 1.54 ns/B 617.6 MiB/s 1.78 c/B OCB dec | 1.57 ns/B 606.8 MiB/s 1.81 c/B OCB auth | 1.40 ns/B 679.8 MiB/s 1.62 c/B = Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/Makefile.am b/cipher/Makefile.am index c31b233..db606ca 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -81,7 +81,7 @@ md5.c \ poly1305-sse2-amd64.S poly1305-avx2-amd64.S poly1305-armv7-neon.S \ rijndael.c rijndael-internal.h rijndael-tables.h rijndael-aesni.c \ rijndael-padlock.c rijndael-amd64.S rijndael-arm.S rijndael-ssse3-amd64.c \ - rijndael-armv8-ce.c rijndael-armv8-aarch32-ce.S \ + rijndael-armv8-ce.c rijndael-armv8-aarch32-ce.S rijndael-armv8-aarch64-ce.S \ rijndael-aarch64.S \ rmd160.c \ rsa.c \ diff --git a/cipher/rijndael-armv8-aarch32-ce.S b/cipher/rijndael-armv8-aarch32-ce.S index f3b5400..bf68f20 100644 --- a/cipher/rijndael-armv8-aarch32-ce.S +++ b/cipher/rijndael-armv8-aarch32-ce.S @@ -1,4 +1,4 @@ -/* ARMv8 CE accelerated AES +/* rijndael-armv8-aarch32-ce.S - ARMv8/CE accelerated AES * Copyright (C) 2016 Jussi Kivilinna * * This file is part of Libgcrypt. diff --git a/cipher/rijndael-armv8-aarch64-ce.S b/cipher/rijndael-armv8-aarch64-ce.S new file mode 100644 index 0000000..21d0aec --- /dev/null +++ b/cipher/rijndael-armv8-aarch64-ce.S @@ -0,0 +1,1265 @@ +/* rijndael-armv8-aarch64-ce.S - ARMv8/CE accelerated AES + * Copyright (C) 2016 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include + +#if defined(__AARCH64EL__) && \ + defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) && \ + defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) + +.arch armv8-a+crypto + +.text + + +#if (SIZEOF_VOID_P == 4) + #define ptr8 w8 + #define ptr9 w9 + #define ptr10 w10 + #define ptr11 w11 + #define ptr_sz 4 +#elif (SIZEOF_VOID_P == 8) + #define ptr8 x8 + #define ptr9 x9 + #define ptr10 x10 + #define ptr11 x11 + #define ptr_sz 8 +#else + #error "missing SIZEOF_VOID_P" +#endif + + +#define GET_DATA_POINTER(reg, name) \ + adrp reg, :got:name ; \ + ldr reg, [reg, #:got_lo12:name] ; + + +/* Register macros */ + +#define vk0 v17 +#define vk1 v18 +#define vk2 v19 +#define vk3 v20 +#define vk4 v21 +#define vk5 v22 +#define vk6 v23 +#define vk7 v24 +#define vk8 v25 +#define vk9 v26 +#define vk10 v27 +#define vk11 v28 +#define vk12 v29 +#define vk13 v30 +#define vk14 v31 + + +/* AES macros */ + +#define aes_preload_keys(keysched, nrounds) \ + cmp nrounds, #12; \ + ld1 {vk0.16b-vk3.16b}, [keysched], #64; \ + ld1 {vk4.16b-vk7.16b}, [keysched], #64; \ + ld1 {vk8.16b-vk10.16b}, [keysched], #48; \ + b.lo 1f; \ + ld1 {vk11.16b-vk12.16b}, [keysched], #32; \ + b.eq 1f; \ + ld1 {vk13.16b-vk14.16b}, [keysched]; \ +1: ; + +#define do_aes_one128(ed, mcimc, vo, vb) \ + aes##ed vb.16b, vk0.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk1.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk2.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk3.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk4.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk5.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk6.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk7.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk8.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk9.16b; \ + eor vo.16b, vb.16b, vk10.16b; + +#define do_aes_one192(ed, mcimc, vo, vb) \ + aes##ed vb.16b, vk0.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk1.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk2.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk3.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk4.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk5.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk6.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk7.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk8.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk9.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk10.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk11.16b; \ + eor vo.16b, vb.16b, vk12.16b; + +#define do_aes_one256(ed, mcimc, vo, vb) \ + aes##ed vb.16b, vk0.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk1.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk2.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk3.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk4.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk5.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk6.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk7.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk8.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk9.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk10.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk11.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk12.16b; \ + aes##mcimc vb.16b, vb.16b; \ + aes##ed vb.16b, vk13.16b; \ + eor vo.16b, vb.16b, vk14.16b; + +#define aes_round_4(ed, mcimc, b0, b1, b2, b3, key) \ + aes##ed b0.16b, key.16b; \ + aes##mcimc b0.16b, b0.16b; \ + aes##ed b1.16b, key.16b; \ + aes##mcimc b1.16b, b1.16b; \ + aes##ed b2.16b, key.16b; \ + aes##mcimc b2.16b, b2.16b; \ + aes##ed b3.16b, key.16b; \ + aes##mcimc b3.16b, b3.16b; + +#define aes_lastround_4(ed, b0, b1, b2, b3, key1, key2) \ + aes##ed b0.16b, key1.16b; \ + eor b0.16b, b0.16b, key2.16b; \ + aes##ed b1.16b, key1.16b; \ + eor b1.16b, b1.16b, key2.16b; \ + aes##ed b2.16b, key1.16b; \ + eor b2.16b, b2.16b, key2.16b; \ + aes##ed b3.16b, key1.16b; \ + eor b3.16b, b3.16b, key2.16b; + +#define do_aes_4_128(ed, mcimc, b0, b1, b2, b3) \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk0); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk1); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk2); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk3); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk4); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk5); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk6); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk7); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk8); \ + aes_lastround_4(ed, b0, b1, b2, b3, vk9, vk10); + +#define do_aes_4_192(ed, mcimc, b0, b1, b2, b3) \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk0); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk1); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk2); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk3); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk4); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk5); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk6); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk7); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk8); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk9); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk10); \ + aes_lastround_4(ed, b0, b1, b2, b3, vk11, vk12); + +#define do_aes_4_256(ed, mcimc, b0, b1, b2, b3) \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk0); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk1); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk2); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk3); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk4); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk5); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk6); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk7); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk8); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk9); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk10); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk11); \ + aes_round_4(ed, mcimc, b0, b1, b2, b3, vk12); \ + aes_lastround_4(ed, b0, b1, b2, b3, vk13, vk14); + + +/* Other functional macros */ + +#define CLEAR_REG(reg) eor reg.16b, reg.16b, reg.16b; + +#define aes_clear_keys(nrounds) \ + cmp nrounds, #12; \ + CLEAR_REG(vk0); \ + CLEAR_REG(vk1); \ + CLEAR_REG(vk2); \ + CLEAR_REG(vk3); \ + CLEAR_REG(vk4); \ + CLEAR_REG(vk5); \ + CLEAR_REG(vk6); \ + CLEAR_REG(vk7); \ + CLEAR_REG(vk9); \ + CLEAR_REG(vk8); \ + CLEAR_REG(vk10); \ + b.lo 1f; \ + CLEAR_REG(vk11); \ + CLEAR_REG(vk12); \ + b.eq 1f; \ + CLEAR_REG(vk13); \ + CLEAR_REG(vk14); \ +1: ; + + +/* + * unsigned int _gcry_aes_enc_armv8_ce(void *keysched, byte *dst, + * const byte *src, + * unsigned int nrounds); + */ +.align 3 +.globl _gcry_aes_enc_armv8_ce +.type _gcry_aes_enc_armv8_ce,%function; +_gcry_aes_enc_armv8_ce: + /* input: + * x0: keysched + * x1: dst + * x2: src + * w3: nrounds + */ + + aes_preload_keys(x0, w3); + + ld1 {v0.16b}, [x2] + + b.hi .Lenc1_256 + b.eq .Lenc1_192 + +.Lenc1_128: + do_aes_one128(e, mc, v0, v0); + +.Lenc1_tail: + CLEAR_REG(vk0) + CLEAR_REG(vk1) + CLEAR_REG(vk2) + CLEAR_REG(vk3) + CLEAR_REG(vk4) + CLEAR_REG(vk5) + CLEAR_REG(vk6) + CLEAR_REG(vk7) + CLEAR_REG(vk8) + CLEAR_REG(vk9) + CLEAR_REG(vk10) + st1 {v0.16b}, [x1] + CLEAR_REG(v0) + + mov x0, #0 + ret + +.Lenc1_192: + do_aes_one192(e, mc, v0, v0); + + CLEAR_REG(vk11) + CLEAR_REG(vk12) + b .Lenc1_tail + +.Lenc1_256: + do_aes_one256(e, mc, v0, v0); + + CLEAR_REG(vk11) + CLEAR_REG(vk12) + CLEAR_REG(vk13) + CLEAR_REG(vk14) + b .Lenc1_tail +.size _gcry_aes_enc_armv8_ce,.-_gcry_aes_enc_armv8_ce; + + +/* + * unsigned int _gcry_aes_dec_armv8_ce(void *keysched, byte *dst, + * const byte *src, + * unsigned int nrounds); + */ +.align 3 +.globl _gcry_aes_dec_armv8_ce +.type _gcry_aes_dec_armv8_ce,%function; +_gcry_aes_dec_armv8_ce: + /* input: + * x0: keysched + * x1: dst + * x2: src + * w3: nrounds + */ + + aes_preload_keys(x0, w3); + + ld1 {v0.16b}, [x2] + + b.hi .Ldec1_256 + b.eq .Ldec1_192 + +.Ldec1_128: + do_aes_one128(d, imc, v0, v0); + +.Ldec1_tail: + CLEAR_REG(vk0) + CLEAR_REG(vk1) + CLEAR_REG(vk2) + CLEAR_REG(vk3) + CLEAR_REG(vk4) + CLEAR_REG(vk5) + CLEAR_REG(vk6) + CLEAR_REG(vk7) + CLEAR_REG(vk8) + CLEAR_REG(vk9) + CLEAR_REG(vk10) + st1 {v0.16b}, [x1] + CLEAR_REG(v0) + + mov x0, #0 + ret + +.Ldec1_192: + do_aes_one192(d, imc, v0, v0); + + CLEAR_REG(vk11) + CLEAR_REG(vk12) + b .Ldec1_tail + +.Ldec1_256: + do_aes_one256(d, imc, v0, v0); + + CLEAR_REG(vk11) + CLEAR_REG(vk12) + CLEAR_REG(vk13) + CLEAR_REG(vk14) + b .Ldec1_tail +.size _gcry_aes_dec_armv8_ce,.-_gcry_aes_dec_armv8_ce; + + +/* + * void _gcry_aes_cbc_enc_armv8_ce (const void *keysched, + * unsigned char *outbuf, + * const unsigned char *inbuf, + * unsigned char *iv, size_t nblocks, + * int cbc_mac, unsigned int nrounds); + */ + +.align 3 +.globl _gcry_aes_cbc_enc_armv8_ce +.type _gcry_aes_cbc_enc_armv8_ce,%function; +_gcry_aes_cbc_enc_armv8_ce: + /* input: + * x0: keysched + * x1: outbuf + * x2: inbuf + * x3: iv + * x4: nblocks + * w5: cbc_mac + * w6: nrounds + */ + + cbz x4, .Lcbc_enc_skip + + cmp w5, #0 + ld1 {v1.16b}, [x3] /* load IV */ + cset x5, eq + + aes_preload_keys(x0, w6); + lsl x5, x5, #4 + + b.eq .Lcbc_enc_loop192 + b.hi .Lcbc_enc_loop256 + +#define CBC_ENC(bits) \ + .Lcbc_enc_loop##bits: \ + ld1 {v0.16b}, [x2], #16; /* load plaintext */ \ + eor v1.16b, v0.16b, v1.16b; \ + sub x4, x4, #1; \ + \ + do_aes_one##bits(e, mc, v1, v1); \ + \ + st1 {v1.16b}, [x1], x5; /* store ciphertext */ \ + \ + cbnz x4, .Lcbc_enc_loop##bits; \ + b .Lcbc_enc_done; + + CBC_ENC(128) + CBC_ENC(192) + CBC_ENC(256) + +#undef CBC_ENC + +.Lcbc_enc_done: + aes_clear_keys(w6) + + st1 {v1.16b}, [x3] /* store IV */ + + CLEAR_REG(v1) + CLEAR_REG(v0) + +.Lcbc_enc_skip: + ret +.size _gcry_aes_cbc_enc_armv8_ce,.-_gcry_aes_cbc_enc_armv8_ce; + +/* + * void _gcry_aes_cbc_dec_armv8_ce (const void *keysched, + * unsigned char *outbuf, + * const unsigned char *inbuf, + * unsigned char *iv, unsigned int nrounds); + */ + +.align 3 +.globl _gcry_aes_cbc_dec_armv8_ce +.type _gcry_aes_cbc_dec_armv8_ce,%function; +_gcry_aes_cbc_dec_armv8_ce: + /* input: + * x0: keysched + * x1: outbuf + * x2: inbuf + * x3: iv + * x4: nblocks + * w5: nrounds + */ + + cbz x4, .Lcbc_dec_skip + + ld1 {v0.16b}, [x3] /* load IV */ + + aes_preload_keys(x0, w5); + + b.eq .Lcbc_dec_entry_192 + b.hi .Lcbc_dec_entry_256 + +#define CBC_DEC(bits) \ + .Lcbc_dec_entry_##bits: \ + cmp x4, #4; \ + b.lo .Lcbc_dec_loop_##bits; \ + \ + .Lcbc_dec_loop4_##bits: \ + \ + ld1 {v1.16b-v4.16b}, [x2], #64; /* load ciphertext */ \ + sub x4, x4, #4; \ + mov v5.16b, v1.16b; \ + mov v6.16b, v2.16b; \ + mov v7.16b, v3.16b; \ + mov v16.16b, v4.16b; \ + cmp x4, #4; \ + \ + do_aes_4_##bits(d, imc, v1, v2, v3, v4); \ + \ + eor v1.16b, v1.16b, v0.16b; \ + eor v2.16b, v2.16b, v5.16b; \ + st1 {v1.16b-v2.16b}, [x1], #32; /* store plaintext */ \ + eor v3.16b, v3.16b, v6.16b; \ + eor v4.16b, v4.16b, v7.16b; \ + mov v0.16b, v16.16b; /* next IV */ \ + st1 {v3.16b-v4.16b}, [x1], #32; /* store plaintext */ \ + \ + b.hs .Lcbc_dec_loop4_##bits; \ + CLEAR_REG(v3); \ + CLEAR_REG(v4); \ + CLEAR_REG(v5); \ + CLEAR_REG(v6); \ + CLEAR_REG(v7); \ + CLEAR_REG(v16); \ + cbz x4, .Lcbc_dec_done; \ + \ + .Lcbc_dec_loop_##bits: \ + ld1 {v1.16b}, [x2], #16; /* load ciphertext */ \ + sub x4, x4, #1; \ + mov v2.16b, v1.16b; \ + \ + do_aes_one##bits(d, imc, v1, v1); \ + \ + eor v1.16b, v1.16b, v0.16b; \ + mov v0.16b, v2.16b; \ + st1 {v1.16b}, [x1], #16; /* store plaintext */ \ + \ + cbnz x4, .Lcbc_dec_loop_##bits; \ + b .Lcbc_dec_done; + + CBC_DEC(128) + CBC_DEC(192) + CBC_DEC(256) + +#undef CBC_DEC + +.Lcbc_dec_done: + aes_clear_keys(w5) + + st1 {v0.16b}, [x3] /* store IV */ + + CLEAR_REG(v0) + CLEAR_REG(v1) + CLEAR_REG(v2) + +.Lcbc_dec_skip: + ret +.size _gcry_aes_cbc_dec_armv8_ce,.-_gcry_aes_cbc_dec_armv8_ce; + + +/* + * void _gcry_aes_ctr_enc_armv8_ce (const void *keysched, + * unsigned char *outbuf, + * const unsigned char *inbuf, + * unsigned char *iv, unsigned int nrounds); + */ + +.align 3 +.globl _gcry_aes_ctr_enc_armv8_ce +.type _gcry_aes_ctr_enc_armv8_ce,%function; +_gcry_aes_ctr_enc_armv8_ce: + /* input: + * r0: keysched + * r1: outbuf + * r2: inbuf + * r3: iv + * x4: nblocks + * w5: nrounds + */ + + cbz x4, .Lctr_enc_skip + + mov x6, #1 + movi v16.16b, #0 + mov v16.D[1], x6 + + /* load IV */ + ldp x9, x10, [x3] + ld1 {v0.16b}, [x3] + rev x9, x9 + rev x10, x10 + + aes_preload_keys(x0, w5); + + b.eq .Lctr_enc_entry_192 + b.hi .Lctr_enc_entry_256 + +#define CTR_ENC(bits) \ + .Lctr_enc_entry_##bits: \ + cmp x4, #4; \ + b.lo .Lctr_enc_loop_##bits; \ + \ + .Lctr_enc_loop4_##bits: \ + cmp x10, #0xfffffffffffffffc; \ + sub x4, x4, #4; \ + b.lo .Lctr_enc_loop4_##bits##_nocarry; \ + \ + adds x10, x10, #1; \ + mov v1.16b, v0.16b; \ + adc x9, x9, xzr; \ + mov v2.D[1], x10; \ + mov v2.D[0], x9; \ + \ + adds x10, x10, #1; \ + rev64 v2.16b, v2.16b; \ + adc x9, x9, xzr; \ + mov v3.D[1], x10; \ + mov v3.D[0], x9; \ + \ + adds x10, x10, #1; \ + rev64 v3.16b, v3.16b; \ + adc x9, x9, xzr; \ + mov v4.D[1], x10; \ + mov v4.D[0], x9; \ + \ + adds x10, x10, #1; \ + rev64 v4.16b, v4.16b; \ + adc x9, x9, xzr; \ + mov v0.D[1], x10; \ + mov v0.D[0], x9; \ + rev64 v0.16b, v0.16b; \ + \ + b .Lctr_enc_loop4_##bits##_store_ctr; \ + \ + .Lctr_enc_loop4_##bits##_nocarry: \ + \ + add v3.2d, v16.2d, v16.2d; /* 2 */ \ + rev64 v6.16b, v0.16b; \ + add x10, x10, #4; \ + add v4.2d, v3.2d, v16.2d; /* 3 */ \ + add v0.2d, v3.2d, v3.2d; /* 4 */ \ + rev64 v1.16b, v6.16b; \ + add v2.2d, v6.2d, v16.2d; \ + add v3.2d, v6.2d, v3.2d; \ + add v4.2d, v6.2d, v4.2d; \ + add v0.2d, v6.2d, v0.2d; \ + rev64 v2.16b, v2.16b; \ + rev64 v3.16b, v3.16b; \ + rev64 v0.16b, v0.16b; \ + rev64 v4.16b, v4.16b; \ + \ + .Lctr_enc_loop4_##bits##_store_ctr: \ + \ + st1 {v0.16b}, [x3]; \ + cmp x4, #4; \ + ld1 {v5.16b-v7.16b}, [x2], #48; /* preload ciphertext */ \ + \ + do_aes_4_##bits(e, mc, v1, v2, v3, v4); \ + \ + eor v1.16b, v1.16b, v5.16b; \ + ld1 {v5.16b}, [x2], #16; /* load ciphertext */ \ + eor v2.16b, v2.16b, v6.16b; \ + eor v3.16b, v3.16b, v7.16b; \ + eor v4.16b, v4.16b, v5.16b; \ + st1 {v1.16b-v4.16b}, [x1], #64; /* store plaintext */ \ + \ + b.hs .Lctr_enc_loop4_##bits; \ + CLEAR_REG(v3); \ + CLEAR_REG(v4); \ + CLEAR_REG(v5); \ + CLEAR_REG(v6); \ + CLEAR_REG(v7); \ + cbz x4, .Lctr_enc_done; \ + \ + .Lctr_enc_loop_##bits: \ + \ + adds x10, x10, #1; \ + mov v1.16b, v0.16b; \ + adc x9, x9, xzr; \ + mov v0.D[1], x10; \ + mov v0.D[0], x9; \ + sub x4, x4, #1; \ + ld1 {v2.16b}, [x2], #16; /* load ciphertext */ \ + rev64 v0.16b, v0.16b; \ + \ + do_aes_one##bits(e, mc, v1, v1); \ + \ + eor v1.16b, v2.16b, v1.16b; \ + st1 {v1.16b}, [x1], #16; /* store plaintext */ \ + \ + cbnz x4, .Lctr_enc_loop_##bits; \ + b .Lctr_enc_done; + + CTR_ENC(128) + CTR_ENC(192) + CTR_ENC(256) + +#undef CTR_ENC + +.Lctr_enc_done: + aes_clear_keys(w5) + + st1 {v0.16b}, [x3] /* store IV */ + + CLEAR_REG(v0) + CLEAR_REG(v1) + CLEAR_REG(v2) + +.Lctr_enc_skip: + ret + +.size _gcry_aes_ctr_enc_armv8_ce,.-_gcry_aes_ctr_enc_armv8_ce; + + +/* + * void _gcry_aes_cfb_enc_armv8_ce (const void *keysched, + * unsigned char *outbuf, + * const unsigned char *inbuf, + * unsigned char *iv, unsigned int nrounds); + */ + +.align 3 +.globl _gcry_aes_cfb_enc_armv8_ce +.type _gcry_aes_cfb_enc_armv8_ce,%function; +_gcry_aes_cfb_enc_armv8_ce: + /* input: + * r0: keysched + * r1: outbuf + * r2: inbuf + * r3: iv + * x4: nblocks + * w5: nrounds + */ + + cbz x4, .Lcfb_enc_skip + + /* load IV */ + ld1 {v0.16b}, [x3] + + aes_preload_keys(x0, w5); + + b.eq .Lcfb_enc_entry_192 + b.hi .Lcfb_enc_entry_256 + +#define CFB_ENC(bits) \ + .Lcfb_enc_entry_##bits: \ + .Lcfb_enc_loop_##bits: \ + ld1 {v1.16b}, [x2], #16; /* load plaintext */ \ + sub x4, x4, #1; \ + \ + do_aes_one##bits(e, mc, v0, v0); \ + \ + eor v0.16b, v1.16b, v0.16b; \ + st1 {v0.16b}, [x1], #16; /* store ciphertext */ \ + \ + cbnz x4, .Lcfb_enc_loop_##bits; \ + b .Lcfb_enc_done; + + CFB_ENC(128) + CFB_ENC(192) + CFB_ENC(256) + +#undef CFB_ENC + +.Lcfb_enc_done: + aes_clear_keys(w5) + + st1 {v0.16b}, [x3] /* store IV */ + + CLEAR_REG(v0) + CLEAR_REG(v1) + +.Lcfb_enc_skip: + ret +.size _gcry_aes_cfb_enc_armv8_ce,.-_gcry_aes_cfb_enc_armv8_ce; + + +/* + * void _gcry_aes_cfb_dec_armv8_ce (const void *keysched, + * unsigned char *outbuf, + * const unsigned char *inbuf, + * unsigned char *iv, unsigned int nrounds); + */ + +.align 3 +.globl _gcry_aes_cfb_dec_armv8_ce +.type _gcry_aes_cfb_dec_armv8_ce,%function; +_gcry_aes_cfb_dec_armv8_ce: + /* input: + * r0: keysched + * r1: outbuf + * r2: inbuf + * r3: iv + * x4: nblocks + * w5: nrounds + */ + + cbz x4, .Lcfb_dec_skip + + /* load IV */ + ld1 {v0.16b}, [x3] + + aes_preload_keys(x0, w5); + + b.eq .Lcfb_dec_entry_192 + b.hi .Lcfb_dec_entry_256 + +#define CFB_DEC(bits) \ + .Lcfb_dec_entry_##bits: \ + cmp x4, #4; \ + b.lo .Lcfb_dec_loop_##bits; \ + \ + .Lcfb_dec_loop4_##bits: \ + \ + ld1 {v2.16b-v4.16b}, [x2], #48; /* load ciphertext */ \ + mov v1.16b, v0.16b; \ + sub x4, x4, #4; \ + cmp x4, #4; \ + mov v5.16b, v2.16b; \ + mov v6.16b, v3.16b; \ + mov v7.16b, v4.16b; \ + ld1 {v0.16b}, [x2], #16; /* load next IV / ciphertext */ \ + \ + do_aes_4_##bits(e, mc, v1, v2, v3, v4); \ + \ + eor v1.16b, v1.16b, v5.16b; \ + eor v2.16b, v2.16b, v6.16b; \ + eor v3.16b, v3.16b, v7.16b; \ + eor v4.16b, v4.16b, v0.16b; \ + st1 {v1.16b-v4.16b}, [x1], #64; /* store plaintext */ \ + \ + b.hs .Lcfb_dec_loop4_##bits; \ + CLEAR_REG(v3); \ + CLEAR_REG(v4); \ + CLEAR_REG(v5); \ + CLEAR_REG(v6); \ + CLEAR_REG(v7); \ + cbz x4, .Lcfb_dec_done; \ + \ + .Lcfb_dec_loop_##bits: \ + \ + ld1 {v1.16b}, [x2], #16; /* load ciphertext */ \ + \ + sub x4, x4, #1; \ + \ + do_aes_one##bits(e, mc, v0, v0); \ + \ + eor v2.16b, v1.16b, v0.16b; \ + mov v0.16b, v1.16b; \ + st1 {v2.16b}, [x1], #16; /* store plaintext */ \ + \ + cbnz x4, .Lcfb_dec_loop_##bits; \ + b .Lcfb_dec_done; + + CFB_DEC(128) + CFB_DEC(192) + CFB_DEC(256) + +#undef CFB_DEC + +.Lcfb_dec_done: + aes_clear_keys(w5) + + st1 {v0.16b}, [x3] /* store IV */ + + CLEAR_REG(v0) + CLEAR_REG(v1) + CLEAR_REG(v2) + +.Lcfb_dec_skip: + ret +.size _gcry_aes_cfb_dec_armv8_ce,.-_gcry_aes_cfb_dec_armv8_ce; + + +/* + * void _gcry_aes_ocb_enc_armv8_ce (const void *keysched, + * unsigned char *outbuf, + * const unsigned char *inbuf, + * unsigned char *offset, + * unsigned char *checksum, + * void **Ls, + * size_t nblocks, + * unsigned int nrounds); + */ + +.align 3 +.globl _gcry_aes_ocb_enc_armv8_ce +.type _gcry_aes_ocb_enc_armv8_ce,%function; +_gcry_aes_ocb_enc_armv8_ce: + /* input: + * x0: keysched + * x1: outbuf + * x2: inbuf + * x3: offset + * x4: checksum + * x5: Ls + * x6: nblocks (0 < nblocks <= 32) + * w7: nrounds + */ + + ld1 {v0.16b}, [x3] /* load offset */ + ld1 {v16.16b}, [x4] /* load checksum */ + + aes_preload_keys(x0, w7); + + b.eq .Locb_enc_entry_192 + b.hi .Locb_enc_entry_256 + +#define OCB_ENC(bits, ...) \ + .Locb_enc_entry_##bits: \ + cmp x6, #4; \ + b.lo .Locb_enc_loop_##bits; \ + \ + .Locb_enc_loop4_##bits: \ + \ + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ \ + /* Checksum_i = Checksum_{i-1} xor P_i */ \ + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ \ + \ + ldp ptr8, ptr9, [x5], #(ptr_sz*2); \ + \ + ld1 {v1.16b-v4.16b}, [x2], #64; /* load P_i+<0-3> */ \ + ldp ptr10, ptr11, [x5], #(ptr_sz*2); \ + sub x6, x6, #4; \ + \ + ld1 {v5.16b}, [x8]; /* load L_{ntz(i+0)} */ \ + eor v16.16b, v16.16b, v1.16b; /* Checksum_i+0 */ \ + ld1 {v6.16b}, [x9]; /* load L_{ntz(i+1)} */ \ + eor v16.16b, v16.16b, v2.16b; /* Checksum_i+1 */ \ + ld1 {v7.16b}, [x10]; /* load L_{ntz(i+2)} */ \ + eor v16.16b, v16.16b, v3.16b; /* Checksum_i+2 */ \ + eor v5.16b, v5.16b, v0.16b; /* Offset_i+0 */ \ + ld1 {v0.16b}, [x11]; /* load L_{ntz(i+3)} */ \ + eor v16.16b, v16.16b, v4.16b; /* Checksum_i+3 */ \ + eor v6.16b, v6.16b, v5.16b; /* Offset_i+1 */ \ + eor v1.16b, v1.16b, v5.16b; /* P_i+0 xor Offset_i+0 */ \ + eor v7.16b, v7.16b, v6.16b; /* Offset_i+2 */ \ + eor v2.16b, v2.16b, v6.16b; /* P_i+1 xor Offset_i+1 */ \ + eor v0.16b, v0.16b, v7.16b; /* Offset_i+3 */ \ + cmp x6, #4; \ + eor v3.16b, v3.16b, v7.16b; /* P_i+2 xor Offset_i+2 */ \ + eor v4.16b, v4.16b, v0.16b; /* P_i+3 xor Offset_i+3 */ \ + \ + do_aes_4_##bits(e, mc, v1, v2, v3, v4); \ + \ + eor v1.16b, v1.16b, v5.16b; /* xor Offset_i+0 */ \ + eor v2.16b, v2.16b, v6.16b; /* xor Offset_i+1 */ \ + eor v3.16b, v3.16b, v7.16b; /* xor Offset_i+2 */ \ + eor v4.16b, v4.16b, v0.16b; /* xor Offset_i+3 */ \ + st1 {v1.16b-v4.16b}, [x1], #64; \ + \ + b.hs .Locb_enc_loop4_##bits; \ + CLEAR_REG(v3); \ + CLEAR_REG(v4); \ + CLEAR_REG(v5); \ + CLEAR_REG(v6); \ + CLEAR_REG(v7); \ + cbz x6, .Locb_enc_done; \ + \ + .Locb_enc_loop_##bits: \ + \ + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ \ + /* Checksum_i = Checksum_{i-1} xor P_i */ \ + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ \ + \ + ldr ptr8, [x5], #(ptr_sz); \ + ld1 {v1.16b}, [x2], #16; /* load plaintext */ \ + ld1 {v2.16b}, [x8]; /* load L_{ntz(i)} */ \ + sub x6, x6, #1; \ + eor v0.16b, v0.16b, v2.16b; \ + eor v16.16b, v16.16b, v1.16b; \ + eor v1.16b, v1.16b, v0.16b; \ + \ + do_aes_one##bits(e, mc, v1, v1); \ + \ + eor v1.16b, v1.16b, v0.16b; \ + st1 {v1.16b}, [x1], #16; /* store ciphertext */ \ + \ + cbnz x6, .Locb_enc_loop_##bits; \ + b .Locb_enc_done; + + OCB_ENC(128) + OCB_ENC(192) + OCB_ENC(256) + +#undef OCB_ENC + +.Locb_enc_done: + aes_clear_keys(w7) + + st1 {v16.16b}, [x4] /* store checksum */ + st1 {v0.16b}, [x3] /* store offset */ + + CLEAR_REG(v0) + CLEAR_REG(v1) + CLEAR_REG(v2) + CLEAR_REG(v16) + + ret +.size _gcry_aes_ocb_enc_armv8_ce,.-_gcry_aes_ocb_enc_armv8_ce; + + +/* + * void _gcry_aes_ocb_dec_armv8_ce (const void *keysched, + * unsigned char *outbuf, + * const unsigned char *inbuf, + * unsigned char *offset, + * unsigned char *checksum, + * void **Ls, + * size_t nblocks, + * unsigned int nrounds); + */ + +.align 3 +.globl _gcry_aes_ocb_dec_armv8_ce +.type _gcry_aes_ocb_dec_armv8_ce,%function; +_gcry_aes_ocb_dec_armv8_ce: + /* input: + * x0: keysched + * x1: outbuf + * x2: inbuf + * x3: offset + * x4: checksum + * x5: Ls + * x6: nblocks (0 < nblocks <= 32) + * w7: nrounds + */ + + ld1 {v0.16b}, [x3] /* load offset */ + ld1 {v16.16b}, [x4] /* load checksum */ + + aes_preload_keys(x0, w7); + + b.eq .Locb_dec_entry_192 + b.hi .Locb_dec_entry_256 + +#define OCB_DEC(bits) \ + .Locb_dec_entry_##bits: \ + cmp x6, #4; \ + b.lo .Locb_dec_loop_##bits; \ + \ + .Locb_dec_loop4_##bits: \ + \ + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ \ + /* P_i = Offset_i xor DECIPHER(K, C_i xor Offset_i) */ \ + /* Checksum_i = Checksum_{i-1} xor P_i */ \ + \ + ldp ptr8, ptr9, [x5], #(ptr_sz*2); \ + \ + ld1 {v1.16b-v4.16b}, [x2], #64; /* load C_i+<0-3> */ \ + ldp ptr10, ptr11, [x5], #(ptr_sz*2); \ + sub x6, x6, #4; \ + \ + ld1 {v5.16b}, [x8]; /* load L_{ntz(i+0)} */ \ + ld1 {v6.16b}, [x9]; /* load L_{ntz(i+1)} */ \ + ld1 {v7.16b}, [x10]; /* load L_{ntz(i+2)} */ \ + eor v5.16b, v5.16b, v0.16b; /* Offset_i+0 */ \ + ld1 {v0.16b}, [x11]; /* load L_{ntz(i+3)} */ \ + eor v6.16b, v6.16b, v5.16b; /* Offset_i+1 */ \ + eor v1.16b, v1.16b, v5.16b; /* C_i+0 xor Offset_i+0 */ \ + eor v7.16b, v7.16b, v6.16b; /* Offset_i+2 */ \ + eor v2.16b, v2.16b, v6.16b; /* C_i+1 xor Offset_i+1 */ \ + eor v0.16b, v0.16b, v7.16b; /* Offset_i+3 */ \ + cmp x6, #4; \ + eor v3.16b, v3.16b, v7.16b; /* C_i+2 xor Offset_i+2 */ \ + eor v4.16b, v4.16b, v0.16b; /* C_i+3 xor Offset_i+3 */ \ + \ + do_aes_4_##bits(d, imc, v1, v2, v3, v4); \ + \ + eor v1.16b, v1.16b, v5.16b; /* xor Offset_i+0 */ \ + eor v2.16b, v2.16b, v6.16b; /* xor Offset_i+1 */ \ + eor v16.16b, v16.16b, v1.16b; /* Checksum_i+0 */ \ + eor v3.16b, v3.16b, v7.16b; /* xor Offset_i+2 */ \ + eor v16.16b, v16.16b, v2.16b; /* Checksum_i+1 */ \ + eor v4.16b, v4.16b, v0.16b; /* xor Offset_i+3 */ \ + eor v16.16b, v16.16b, v3.16b; /* Checksum_i+2 */ \ + eor v16.16b, v16.16b, v4.16b; /* Checksum_i+3 */ \ + st1 {v1.16b-v4.16b}, [x1], #64; \ + \ + b.hs .Locb_dec_loop4_##bits; \ + CLEAR_REG(v3); \ + CLEAR_REG(v4); \ + CLEAR_REG(v5); \ + CLEAR_REG(v6); \ + CLEAR_REG(v7); \ + cbz x6, .Locb_dec_done; \ + \ + .Locb_dec_loop_##bits: \ + \ + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ \ + /* P_i = Offset_i xor DECIPHER(K, C_i xor Offset_i) */ \ + /* Checksum_i = Checksum_{i-1} xor P_i */ \ + \ + ldr ptr8, [x5], #(ptr_sz); \ + ld1 {v1.16b}, [x2], #16; /* load ciphertext */ \ + ld1 {v2.16b}, [x8]; /* load L_{ntz(i)} */ \ + sub x6, x6, #1; \ + eor v0.16b, v0.16b, v2.16b; \ + eor v1.16b, v1.16b, v0.16b; \ + \ + do_aes_one##bits(d, imc, v1, v1) \ + \ + eor v1.16b, v1.16b, v0.16b; \ + st1 {v1.16b}, [x1], #16; /* store plaintext */ \ + eor v16.16b, v16.16b, v1.16b; \ + \ + cbnz x6, .Locb_dec_loop_##bits; \ + b .Locb_dec_done; + + OCB_DEC(128) + OCB_DEC(192) + OCB_DEC(256) + +#undef OCB_DEC + +.Locb_dec_done: + aes_clear_keys(w7) + + st1 {v16.16b}, [x4] /* store checksum */ + st1 {v0.16b}, [x3] /* store offset */ + + CLEAR_REG(v0) + CLEAR_REG(v1) + CLEAR_REG(v2) + CLEAR_REG(v16) + + ret +.size _gcry_aes_ocb_dec_armv8_ce,.-_gcry_aes_ocb_dec_armv8_ce; + + +/* + * void _gcry_aes_ocb_auth_armv8_ce (const void *keysched, + * const unsigned char *abuf, + * unsigned char *offset, + * unsigned char *checksum, + * void **Ls, + * size_t nblocks, + * unsigned int nrounds); + */ + +.align 3 +.globl _gcry_aes_ocb_auth_armv8_ce +.type _gcry_aes_ocb_auth_armv8_ce,%function; +_gcry_aes_ocb_auth_armv8_ce: + /* input: + * x0: keysched + * x1: abuf + * x2: offset => x3 + * x3: checksum => x4 + * x4: Ls => x5 + * x5: nblocks => x6 (0 < nblocks <= 32) + * w6: nrounds => w7 + */ + mov x7, x6 + mov x6, x5 + mov x5, x4 + mov x4, x3 + mov x3, x2 + + aes_preload_keys(x0, w7); + + ld1 {v0.16b}, [x3] /* load offset */ + ld1 {v16.16b}, [x4] /* load checksum */ + + beq .Locb_auth_entry_192 + bhi .Locb_auth_entry_256 + +#define OCB_AUTH(bits) \ + .Locb_auth_entry_##bits: \ + cmp x6, #4; \ + b.lo .Locb_auth_loop_##bits; \ + \ + .Locb_auth_loop4_##bits: \ + \ + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ \ + /* Sum_i = Sum_{i-1} xor ENCIPHER(K, A_i xor Offset_i) */ \ + \ + ldp ptr8, ptr9, [x5], #(ptr_sz*2); \ + \ + ld1 {v1.16b-v4.16b}, [x1], #64; /* load A_i+<0-3> */ \ + ldp ptr10, ptr11, [x5], #(ptr_sz*2); \ + sub x6, x6, #4; \ + \ + ld1 {v5.16b}, [x8]; /* load L_{ntz(i+0)} */ \ + ld1 {v6.16b}, [x9]; /* load L_{ntz(i+1)} */ \ + ld1 {v7.16b}, [x10]; /* load L_{ntz(i+2)} */ \ + eor v5.16b, v5.16b, v0.16b; /* Offset_i+0 */ \ + ld1 {v0.16b}, [x11]; /* load L_{ntz(i+3)} */ \ + eor v6.16b, v6.16b, v5.16b; /* Offset_i+1 */ \ + eor v1.16b, v1.16b, v5.16b; /* A_i+0 xor Offset_i+0 */ \ + eor v7.16b, v7.16b, v6.16b; /* Offset_i+2 */ \ + eor v2.16b, v2.16b, v6.16b; /* A_i+1 xor Offset_i+1 */ \ + eor v0.16b, v0.16b, v7.16b; /* Offset_i+3 */ \ + cmp x6, #4; \ + eor v3.16b, v3.16b, v7.16b; /* A_i+2 xor Offset_i+2 */ \ + eor v4.16b, v4.16b, v0.16b; /* A_i+3 xor Offset_i+3 */ \ + \ + do_aes_4_##bits(e, mc, v1, v2, v3, v4); \ + \ + eor v1.16b, v1.16b, v2.16b; \ + eor v16.16b, v16.16b, v3.16b; \ + eor v1.16b, v1.16b, v4.16b; \ + eor v16.16b, v16.16b, v1.16b; \ + \ + b.hs .Locb_auth_loop4_##bits; \ + CLEAR_REG(v3); \ + CLEAR_REG(v4); \ + CLEAR_REG(v5); \ + CLEAR_REG(v6); \ + CLEAR_REG(v7); \ + cbz x6, .Locb_auth_done; \ + \ + .Locb_auth_loop_##bits: \ + \ + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ \ + /* Sum_i = Sum_{i-1} xor ENCIPHER(K, A_i xor Offset_i) */ \ + \ + ldr ptr8, [x5], #(ptr_sz); \ + ld1 {v1.16b}, [x1], #16; /* load aadtext */ \ + ld1 {v2.16b}, [x8]; /* load L_{ntz(i)} */ \ + sub x6, x6, #1; \ + eor v0.16b, v0.16b, v2.16b; \ + eor v1.16b, v1.16b, v0.16b; \ + \ + do_aes_one##bits(e, mc, v1, v1) \ + \ + eor v16.16b, v16.16b, v1.16b; \ + \ + cbnz x6, .Locb_auth_loop_##bits; \ + b .Locb_auth_done; + + OCB_AUTH(128) + OCB_AUTH(192) + OCB_AUTH(256) + +#undef OCB_AUTH + +.Locb_auth_done: + aes_clear_keys(w7) + + st1 {v16.16b}, [x4] /* store checksum */ + st1 {v0.16b}, [x3] /* store offset */ + + CLEAR_REG(v0) + CLEAR_REG(v1) + CLEAR_REG(v2) + CLEAR_REG(v16) + + ret +.size _gcry_aes_ocb_auth_armv8_ce,.-_gcry_aes_ocb_auth_armv8_ce; + + +/* + * u32 _gcry_aes_sbox4_armv8_ce(u32 in4b); + */ +.align 3 +.globl _gcry_aes_sbox4_armv8_ce +.type _gcry_aes_sbox4_armv8_ce,%function; +_gcry_aes_sbox4_armv8_ce: + /* See "Gouv?a, C. P. L. & L?pez, J. Implementing GCM on ARMv8. Topics in + * Cryptology ? CT-RSA 2015" for details. + */ + movi v0.16b, #0x52 + movi v1.16b, #0 + mov v0.S[0], w0 + aese v0.16b, v1.16b + addv s0, v0.4s + mov w0, v0.S[0] + CLEAR_REG(v0) + ret +.size _gcry_aes_sbox4_armv8_ce,.-_gcry_aes_sbox4_armv8_ce; + + +/* + * void _gcry_aes_invmixcol_armv8_ce(void *dst, const void *src); + */ +.align 3 +.globl _gcry_aes_invmixcol_armv8_ce +.type _gcry_aes_invmixcol_armv8_ce,%function; +_gcry_aes_invmixcol_armv8_ce: + ld1 {v0.16b}, [x1] + aesimc v0.16b, v0.16b + st1 {v0.16b}, [x0] + CLEAR_REG(v0) + ret +.size _gcry_aes_invmixcol_armv8_ce,.-_gcry_aes_invmixcol_armv8_ce; + +#endif diff --git a/cipher/rijndael-internal.h b/cipher/rijndael-internal.h index 340dbc0..160fb8c 100644 --- a/cipher/rijndael-internal.h +++ b/cipher/rijndael-internal.h @@ -95,6 +95,10 @@ && defined(HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS) \ && defined(HAVE_GCC_INLINE_ASM_AARCH32_CRYPTO) # define USE_ARM_CE 1 +# elif defined(__AARCH64EL__) \ + && defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) \ + && defined(HAVE_GCC_INLINE_ASM_AARCH64_CRYPTO) +# define USE_ARM_CE 1 # endif #endif /* ENABLE_ARM_CRYPTO_SUPPORT */ diff --git a/configure.ac b/configure.ac index 9049db7..ca82af9 100644 --- a/configure.ac +++ b/configure.ac @@ -2043,6 +2043,10 @@ if test "$found" = "1" ; then aarch64-*-*) # Build with the assembly implementation GCRYPT_CIPHERS="$GCRYPT_CIPHERS rijndael-aarch64.lo" + + # Build with the ARMv8/AArch64 CE implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS rijndael-armv8-ce.lo" + GCRYPT_CIPHERS="$GCRYPT_CIPHERS rijndael-armv8-aarch64-ce.lo" ;; esac From wangshuai901 at gmail.com Mon Sep 5 19:32:10 2016 From: wangshuai901 at gmail.com (Shuai Wang) Date: Mon, 5 Sep 2016 13:32:10 -0400 Subject: [gcrypt-devel] gcry_pk_genkey function is extremely slow in libgcrypt Message-ID: I am a newbie to libgcrypt(version 1.6.1), and right now I am trying to produce a public/private key pair for rsa algorithm. I list the code I am using below. What I am trapped into is the gcry_pk_genkey function, in which it can take over 1.5 hours but never return. int main(int argc, char** argv) { if (argc != 2) { fprintf(stderr, "Usage: %s \n", argv[0]); xerr1("Invalid arguments."); } gcrypt_init(); gcry_error_t err = 0; gcry_sexp_t rsa_parms; gcry_sexp_t rsa_keypair; err &= gcry_sexp_build(&rsa_parms, NULL, "(genkey (rsa (nbits 4:2048)))"); if (err) { xerr1("gcrypt: failed to create rsa params"); } err &= gcry_pk_genkey(&rsa_keypair, rsa_parms); <------- This function call if (err) { xerr1("gcrypt: failed to create rsa key pair"); } char* fname = argv[1]; err = gcrypt_sexp_to_file(fname, rsa_keypair, 1 << 16); printf("i am here3\n"); gcry_sexp_release(rsa_keypair); gcry_sexp_release(rsa_parms); return err; } I am aware that this function can take a few minutes. Your computer needs to gather random entropy.. However, I can hardly believe it could take almost 2 hours without return/throw exception... I am using a 32-bit Ubuntu 14.04, inside a virtualbox VM instance. Am I doing anything wrong here? -------------- next part -------------- An HTML attachment was scrubbed... URL: From wangshuai901 at gmail.com Tue Sep 6 03:23:09 2016 From: wangshuai901 at gmail.com (Shuai Wang) Date: Mon, 5 Sep 2016 21:23:09 -0400 Subject: [gcrypt-devel] gcry_pk_genkey function is extremely slow in libgcrypt In-Reply-To: References: Message-ID: Hello Karl, Thank you for your reply. Yes, I have double-checked the */dev/random* and it is extremely slow. So currently I produce the key pair in my host machine (OS X) and then switch to the VM for some tests. It works! Sincerely, Shuai On Mon, Sep 5, 2016 at 9:20 PM, Karl Magdsick wrote: > Virtual machines may gather entropy incredibly slowly. While you're > stuck, in another terminal try > > prompt> time dd bs=128 count=1 if=/dev/random/ | uuencode - > > My guess is this alone will take several minutes for your vm. > > Cheers, > -Karl > > On Sep 6, 2016 2:51 AM, "Shuai Wang" wrote: > > I am a newbie to libgcrypt(version 1.6.1), and right now I am trying to > produce a public/private key pair for rsa algorithm. > > I list the code I am using below. What I am trapped into is the > gcry_pk_genkey function, in which it can take over 1.5 hours but never > return. > > int main(int argc, char** argv) > { > if (argc != 2) { > fprintf(stderr, "Usage: %s \n", argv[0]); > xerr1("Invalid arguments."); > } > > gcrypt_init(); > > gcry_error_t err = 0; > gcry_sexp_t rsa_parms; > gcry_sexp_t rsa_keypair; > > err &= gcry_sexp_build(&rsa_parms, NULL, "(genkey (rsa (nbits 4:2048)))"); > if (err) { > xerr1("gcrypt: failed to create rsa params"); > } > > err &= gcry_pk_genkey(&rsa_keypair, rsa_parms); <------- This function call > if (err) { > xerr1("gcrypt: failed to create rsa key pair"); > } > > char* fname = argv[1]; > err = gcrypt_sexp_to_file(fname, rsa_keypair, 1 << 16); > > printf("i am here3\n"); > gcry_sexp_release(rsa_keypair); > gcry_sexp_release(rsa_parms); > > return err; > } > > I am aware that this function can take a few minutes. Your computer needs > to gather random entropy.. However, I can hardly believe it could take > almost 2 hours without return/throw exception... > > I am using a 32-bit Ubuntu 14.04, inside a virtualbox VM instance. Am I > doing anything wrong here? > > _______________________________________________ > Gcrypt-devel mailing list > Gcrypt-devel at gnupg.org > http://lists.gnupg.org/mailman/listinfo/gcrypt-devel > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From smueller at chronox.de Tue Sep 6 03:57:07 2016 From: smueller at chronox.de (Stephan Mueller) Date: Tue, 06 Sep 2016 03:57:07 +0200 Subject: [gcrypt-devel] gcry_pk_genkey function is extremely slow in libgcrypt In-Reply-To: References: Message-ID: <1981903.HmEVDCvZXS@positron.chronox.de> Am Montag, 5. September 2016, 21:23:09 CEST schrieb Shuai Wang: Hi Shuai, > Hello Karl, > > Thank you for your reply. Yes, I have double-checked the */dev/random* and > it is extremely slow. > > So currently I produce the key pair in my host machine (OS X) and then > switch to the VM for some tests. It works! Maybe you want to consider an entropy harvesting daemon like maxwell, the Jitter RNG or the haveged. Ciao Stephan From kmagnum at gmail.com Tue Sep 6 03:20:47 2016 From: kmagnum at gmail.com (Karl Magdsick) Date: Tue, 6 Sep 2016 01:20:47 +0000 Subject: [gcrypt-devel] gcry_pk_genkey function is extremely slow in libgcrypt In-Reply-To: References: Message-ID: Virtual machines may gather entropy incredibly slowly. While you're stuck, in another terminal try prompt> time dd bs=128 count=1 if=/dev/random/ | uuencode - My guess is this alone will take several minutes for your vm. Cheers, -Karl On Sep 6, 2016 2:51 AM, "Shuai Wang" wrote: I am a newbie to libgcrypt(version 1.6.1), and right now I am trying to produce a public/private key pair for rsa algorithm. I list the code I am using below. What I am trapped into is the gcry_pk_genkey function, in which it can take over 1.5 hours but never return. int main(int argc, char** argv) { if (argc != 2) { fprintf(stderr, "Usage: %s \n", argv[0]); xerr1("Invalid arguments."); } gcrypt_init(); gcry_error_t err = 0; gcry_sexp_t rsa_parms; gcry_sexp_t rsa_keypair; err &= gcry_sexp_build(&rsa_parms, NULL, "(genkey (rsa (nbits 4:2048)))"); if (err) { xerr1("gcrypt: failed to create rsa params"); } err &= gcry_pk_genkey(&rsa_keypair, rsa_parms); <------- This function call if (err) { xerr1("gcrypt: failed to create rsa key pair"); } char* fname = argv[1]; err = gcrypt_sexp_to_file(fname, rsa_keypair, 1 << 16); printf("i am here3\n"); gcry_sexp_release(rsa_keypair); gcry_sexp_release(rsa_parms); return err; } I am aware that this function can take a few minutes. Your computer needs to gather random entropy.. However, I can hardly believe it could take almost 2 hours without return/throw exception... I am using a 32-bit Ubuntu 14.04, inside a virtualbox VM instance. Am I doing anything wrong here? _______________________________________________ Gcrypt-devel mailing list Gcrypt-devel at gnupg.org http://lists.gnupg.org/mailman/listinfo/gcrypt-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From kmagnum at gmail.com Tue Sep 6 09:23:02 2016 From: kmagnum at gmail.com (Karl Magdsick) Date: Tue, 6 Sep 2016 07:23:02 +0000 Subject: [gcrypt-devel] gcry_pk_genkey function is extremely slow in libgcrypt In-Reply-To: <1981903.HmEVDCvZXS@positron.chronox.de> References: <1981903.HmEVDCvZXS@positron.chronox.de> Message-ID: Note that FreeBSD and OS X both have /dev/random implementations that stop blocking once the system hits a high water mark for entropy. If FreeBSD is an option for you, it may behave better in a vm. Do any of you know offhand the latest Linux kernel's /dev/random behavior when RDRAND/RDSEED is providing the vast majority of the entropy? I'm aware that hardware entropy instructions (where available) are treated (wisely) as one of many sources, but I'm not if there's any mechanism capping the estimated entropy contributions of any single source. (Of course, as the Fortuna designers point out, putting much faith in entropy estimates is misguided.) On Sep 6, 2016 9:57 AM, "Stephan Mueller" wrote: > Am Montag, 5. September 2016, 21:23:09 CEST schrieb Shuai Wang: > > Hi Shuai, > > > Hello Karl, > > > > Thank you for your reply. Yes, I have double-checked the */dev/random* > and > > it is extremely slow. > > > > So currently I produce the key pair in my host machine (OS X) and then > > switch to the VM for some tests. It works! > > Maybe you want to consider an entropy harvesting daemon like maxwell, the > Jitter RNG or the haveged. > > Ciao > Stephan > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jussi.kivilinna at iki.fi Sun Sep 11 16:39:19 2016 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 11 Sep 2016 17:39:19 +0300 Subject: [PATCH 1/2] Add Aarch64 assembly implementation of Camellia Message-ID: <147360475926.22786.5098848227904913348.stgit@localhost6.localdomain6> * cipher/Makefile.am: Add 'camellia-aarch64.S'. * cipher/camellia-aarch64.S: New. * cipher/camellia-glue.c [USE_ARM_ASM][__aarch64__]: Set stack burn size to zero. * cipher/camellia.h: Enable USE_ARM_ASM if __AARCH64EL__ and HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS defined. * configure.ac [host=aarch64]: Add 'rijndael-aarch64.lo'. -- Patch adds ARMv8/Aarch64 implementation of Camellia. Benchmark on Cortex-A53 (1152 Mhz): Before: CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 39.71 ns/B 24.01 MiB/s 45.75 c/B ECB dec | 39.72 ns/B 24.01 MiB/s 45.75 c/B CBC enc | 40.80 ns/B 23.38 MiB/s 47.00 c/B CBC dec | 39.66 ns/B 24.05 MiB/s 45.69 c/B CFB enc | 40.69 ns/B 23.44 MiB/s 46.88 c/B CFB dec | 39.66 ns/B 24.05 MiB/s 45.69 c/B OFB enc | 40.69 ns/B 23.44 MiB/s 46.88 c/B OFB dec | 40.69 ns/B 23.44 MiB/s 46.88 c/B CTR enc | 39.88 ns/B 23.91 MiB/s 45.94 c/B CTR dec | 39.88 ns/B 23.91 MiB/s 45.94 c/B CCM enc | 79.97 ns/B 11.92 MiB/s 92.13 c/B CCM dec | 79.97 ns/B 11.93 MiB/s 92.13 c/B CCM auth | 40.20 ns/B 23.72 MiB/s 46.31 c/B GCM enc | 41.18 ns/B 23.16 MiB/s 47.44 c/B GCM dec | 41.18 ns/B 23.16 MiB/s 47.44 c/B GCM auth | 1.30 ns/B 732.7 MiB/s 1.50 c/B OCB enc | 42.04 ns/B 22.69 MiB/s 48.43 c/B OCB dec | 42.03 ns/B 22.69 MiB/s 48.42 c/B OCB auth | 41.38 ns/B 23.05 MiB/s 47.67 c/B = CAMELLIA256 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 52.36 ns/B 18.22 MiB/s 60.31 c/B ECB dec | 52.36 ns/B 18.22 MiB/s 60.31 c/B CBC enc | 53.39 ns/B 17.86 MiB/s 61.50 c/B CBC dec | 52.14 ns/B 18.29 MiB/s 60.06 c/B CFB enc | 53.28 ns/B 17.90 MiB/s 61.38 c/B CFB dec | 52.14 ns/B 18.29 MiB/s 60.06 c/B OFB enc | 53.17 ns/B 17.94 MiB/s 61.25 c/B OFB dec | 53.17 ns/B 17.94 MiB/s 61.25 c/B CTR enc | 52.36 ns/B 18.21 MiB/s 60.32 c/B CTR dec | 52.36 ns/B 18.21 MiB/s 60.32 c/B CCM enc | 105.0 ns/B 9.08 MiB/s 120.9 c/B CCM dec | 105.0 ns/B 9.08 MiB/s 120.9 c/B CCM auth | 52.74 ns/B 18.08 MiB/s 60.75 c/B GCM enc | 53.66 ns/B 17.77 MiB/s 61.81 c/B GCM dec | 53.66 ns/B 17.77 MiB/s 61.82 c/B GCM auth | 1.30 ns/B 732.3 MiB/s 1.50 c/B OCB enc | 54.54 ns/B 17.49 MiB/s 62.83 c/B OCB dec | 54.48 ns/B 17.50 MiB/s 62.77 c/B OCB auth | 53.89 ns/B 17.70 MiB/s 62.09 c/B = After (~1.7x faster): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 22.25 ns/B 42.87 MiB/s 25.63 c/B ECB dec | 22.25 ns/B 42.87 MiB/s 25.63 c/B CBC enc | 23.27 ns/B 40.97 MiB/s 26.81 c/B CBC dec | 22.14 ns/B 43.08 MiB/s 25.50 c/B CFB enc | 23.17 ns/B 41.17 MiB/s 26.69 c/B CFB dec | 22.14 ns/B 43.08 MiB/s 25.50 c/B OFB enc | 23.11 ns/B 41.26 MiB/s 26.63 c/B OFB dec | 23.11 ns/B 41.26 MiB/s 26.63 c/B CTR enc | 22.36 ns/B 42.65 MiB/s 25.76 c/B CTR dec | 22.36 ns/B 42.65 MiB/s 25.76 c/B CCM enc | 44.87 ns/B 21.26 MiB/s 51.69 c/B CCM dec | 44.87 ns/B 21.25 MiB/s 51.69 c/B CCM auth | 22.62 ns/B 42.15 MiB/s 26.06 c/B GCM enc | 23.66 ns/B 40.31 MiB/s 27.25 c/B GCM dec | 23.66 ns/B 40.31 MiB/s 27.25 c/B GCM auth | 1.30 ns/B 732.0 MiB/s 1.50 c/B OCB enc | 24.32 ns/B 39.21 MiB/s 28.02 c/B OCB dec | 24.32 ns/B 39.21 MiB/s 28.02 c/B OCB auth | 23.75 ns/B 40.15 MiB/s 27.36 c/B = CAMELLIA256 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 29.08 ns/B 32.79 MiB/s 33.50 c/B ECB dec | 29.19 ns/B 32.67 MiB/s 33.63 c/B CBC enc | 30.11 ns/B 31.67 MiB/s 34.69 c/B CBC dec | 29.05 ns/B 32.83 MiB/s 33.47 c/B CFB enc | 30.00 ns/B 31.79 MiB/s 34.56 c/B CFB dec | 28.97 ns/B 32.91 MiB/s 33.38 c/B OFB enc | 29.95 ns/B 31.84 MiB/s 34.50 c/B OFB dec | 29.95 ns/B 31.84 MiB/s 34.50 c/B CTR enc | 29.19 ns/B 32.67 MiB/s 33.63 c/B CTR dec | 29.19 ns/B 32.67 MiB/s 33.63 c/B CCM enc | 58.54 ns/B 16.29 MiB/s 67.43 c/B CCM dec | 58.54 ns/B 16.29 MiB/s 67.44 c/B CCM auth | 29.46 ns/B 32.37 MiB/s 33.94 c/B GCM enc | 30.49 ns/B 31.28 MiB/s 35.12 c/B GCM dec | 30.49 ns/B 31.27 MiB/s 35.13 c/B GCM auth | 1.30 ns/B 731.6 MiB/s 1.50 c/B OCB enc | 31.16 ns/B 30.61 MiB/s 35.90 c/B OCB dec | 31.22 ns/B 30.55 MiB/s 35.96 c/B OCB auth | 30.59 ns/B 31.18 MiB/s 35.24 c/B = Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/Makefile.am b/cipher/Makefile.am index db606ca..305a3b9 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -102,7 +102,7 @@ whirlpool.c whirlpool-sse2-amd64.S \ twofish.c twofish-amd64.S twofish-arm.S \ rfc2268.c \ camellia.c camellia.h camellia-glue.c camellia-aesni-avx-amd64.S \ - camellia-aesni-avx2-amd64.S camellia-arm.S + camellia-aesni-avx2-amd64.S camellia-arm.S camellia-aarch64.S gost28147.lo: gost-sb.h gost-sb.h: gost-s-box diff --git a/cipher/camellia-aarch64.S b/cipher/camellia-aarch64.S new file mode 100644 index 0000000..440f69f --- /dev/null +++ b/cipher/camellia-aarch64.S @@ -0,0 +1,557 @@ +/* camellia-aarch64.S - ARMv8/AArch64 assembly implementation of Camellia + * cipher + * + * Copyright (C) 2016 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include + +#if defined(__AARCH64EL__) +#ifdef HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS + +.text + +/* struct camellia_ctx: */ +#define key_table 0 + +/* register macros */ +#define CTX x0 +#define RDST x1 +#define RSRC x2 +#define RKEYBITS x3 + +#define RTAB1 x4 +#define RTAB2 x5 +#define RTAB3 x6 +#define RTAB4 x7 +#define RMASK w8 + +#define IL w9 +#define IR w10 + +#define xIL x9 +#define xIR x10 + +#define XL w11 +#define XR w12 +#define YL w13 +#define YR w14 + +#define RT0 w15 +#define RT1 w16 +#define RT2 w17 +#define RT3 w18 + +#define xRT0 x15 +#define xRT1 x16 +#define xRT2 x17 +#define xRT3 x18 + +#ifdef __AARCH64EL__ + #define host_to_be(reg, rtmp) \ + rev reg, reg; + #define be_to_host(reg, rtmp) \ + rev reg, reg; +#else + /* nop on big-endian */ + #define host_to_be(reg, rtmp) /*_*/ + #define be_to_host(reg, rtmp) /*_*/ +#endif + +#define ldr_input_aligned_be(rin, a, b, c, d, rtmp) \ + ldr a, [rin, #0]; \ + ldr b, [rin, #4]; \ + be_to_host(a, rtmp); \ + ldr c, [rin, #8]; \ + be_to_host(b, rtmp); \ + ldr d, [rin, #12]; \ + be_to_host(c, rtmp); \ + be_to_host(d, rtmp); + +#define str_output_aligned_be(rout, a, b, c, d, rtmp) \ + be_to_host(a, rtmp); \ + be_to_host(b, rtmp); \ + str a, [rout, #0]; \ + be_to_host(c, rtmp); \ + str b, [rout, #4]; \ + be_to_host(d, rtmp); \ + str c, [rout, #8]; \ + str d, [rout, #12]; + +/* unaligned word reads/writes allowed */ +#define ldr_input_be(rin, ra, rb, rc, rd, rtmp) \ + ldr_input_aligned_be(rin, ra, rb, rc, rd, rtmp) + +#define str_output_be(rout, ra, rb, rc, rd, rtmp0, rtmp1) \ + str_output_aligned_be(rout, ra, rb, rc, rd, rtmp0) + +/********************************************************************** + 1-way camellia + **********************************************************************/ +#define roundsm(xl, xr, kl, kr, yl, yr) \ + ldr RT2, [CTX, #(key_table + ((kl) * 4))]; \ + and IR, RMASK, xr, lsl#(4); /*sp1110*/ \ + ldr RT3, [CTX, #(key_table + ((kr) * 4))]; \ + and IL, RMASK, xl, lsr#(24 - 4); /*sp1110*/ \ + and RT0, RMASK, xr, lsr#(16 - 4); /*sp3033*/ \ + ldr IR, [RTAB1, xIR]; \ + and RT1, RMASK, xl, lsr#(8 - 4); /*sp3033*/ \ + eor yl, yl, RT2; \ + ldr IL, [RTAB1, xIL]; \ + eor yr, yr, RT3; \ + \ + ldr RT0, [RTAB3, xRT0]; \ + ldr RT1, [RTAB3, xRT1]; \ + \ + and RT2, RMASK, xr, lsr#(24 - 4); /*sp0222*/ \ + and RT3, RMASK, xl, lsr#(16 - 4); /*sp0222*/ \ + \ + eor IR, IR, RT0; \ + eor IL, IL, RT1; \ + \ + ldr RT2, [RTAB2, xRT2]; \ + and RT0, RMASK, xr, lsr#(8 - 4); /*sp4404*/ \ + ldr RT3, [RTAB2, xRT3]; \ + and RT1, RMASK, xl, lsl#(4); /*sp4404*/ \ + \ + ldr RT0, [RTAB4, xRT0]; \ + ldr RT1, [RTAB4, xRT1]; \ + \ + eor IR, IR, RT2; \ + eor IL, IL, RT3; \ + eor IR, IR, RT0; \ + eor IL, IL, RT1; \ + \ + eor IR, IR, IL; \ + eor yr, yr, IL, ror#8; \ + eor yl, yl, IR; \ + eor yr, yr, IR; + +#define enc_rounds(n) \ + roundsm(XL, XR, ((n) + 2) * 2 + 0, ((n) + 2) * 2 + 1, YL, YR); \ + roundsm(YL, YR, ((n) + 3) * 2 + 0, ((n) + 3) * 2 + 1, XL, XR); \ + roundsm(XL, XR, ((n) + 4) * 2 + 0, ((n) + 4) * 2 + 1, YL, YR); \ + roundsm(YL, YR, ((n) + 5) * 2 + 0, ((n) + 5) * 2 + 1, XL, XR); \ + roundsm(XL, XR, ((n) + 6) * 2 + 0, ((n) + 6) * 2 + 1, YL, YR); \ + roundsm(YL, YR, ((n) + 7) * 2 + 0, ((n) + 7) * 2 + 1, XL, XR); + +#define dec_rounds(n) \ + roundsm(XL, XR, ((n) + 7) * 2 + 0, ((n) + 7) * 2 + 1, YL, YR); \ + roundsm(YL, YR, ((n) + 6) * 2 + 0, ((n) + 6) * 2 + 1, XL, XR); \ + roundsm(XL, XR, ((n) + 5) * 2 + 0, ((n) + 5) * 2 + 1, YL, YR); \ + roundsm(YL, YR, ((n) + 4) * 2 + 0, ((n) + 4) * 2 + 1, XL, XR); \ + roundsm(XL, XR, ((n) + 3) * 2 + 0, ((n) + 3) * 2 + 1, YL, YR); \ + roundsm(YL, YR, ((n) + 2) * 2 + 0, ((n) + 2) * 2 + 1, XL, XR); + +/* perform FL and FL?? */ +#define fls(ll, lr, rl, rr, kll, klr, krl, krr) \ + ldr RT0, [CTX, #(key_table + ((kll) * 4))]; \ + ldr RT2, [CTX, #(key_table + ((krr) * 4))]; \ + and RT0, RT0, ll; \ + ldr RT3, [CTX, #(key_table + ((krl) * 4))]; \ + orr RT2, RT2, rr; \ + ldr RT1, [CTX, #(key_table + ((klr) * 4))]; \ + eor rl, rl, RT2; \ + eor lr, lr, RT0, ror#31; \ + and RT3, RT3, rl; \ + orr RT1, RT1, lr; \ + eor ll, ll, RT1; \ + eor rr, rr, RT3, ror#31; + +#define enc_fls(n) \ + fls(XL, XR, YL, YR, \ + (n) * 2 + 0, (n) * 2 + 1, \ + (n) * 2 + 2, (n) * 2 + 3); + +#define dec_fls(n) \ + fls(XL, XR, YL, YR, \ + (n) * 2 + 2, (n) * 2 + 3, \ + (n) * 2 + 0, (n) * 2 + 1); + +#define inpack(n) \ + ldr_input_be(RSRC, XL, XR, YL, YR, RT0); \ + ldr RT0, [CTX, #(key_table + ((n) * 8) + 0)]; \ + ldr RT1, [CTX, #(key_table + ((n) * 8) + 4)]; \ + eor XL, XL, RT0; \ + eor XR, XR, RT1; + +#define outunpack(n) \ + ldr RT0, [CTX, #(key_table + ((n) * 8) + 0)]; \ + ldr RT1, [CTX, #(key_table + ((n) * 8) + 4)]; \ + eor YL, YL, RT0; \ + eor YR, YR, RT1; \ + str_output_be(RDST, YL, YR, XL, XR, RT0, RT1); + +.globl _gcry_camellia_arm_encrypt_block +.type _gcry_camellia_arm_encrypt_block, at function; + +_gcry_camellia_arm_encrypt_block: + /* input: + * x0: keytable + * x1: dst + * x2: src + * x3: keybitlen + */ + + adr RTAB1, _gcry_camellia_arm_tables; + mov RMASK, #(0xff<<4); /* byte mask */ + add RTAB2, RTAB1, #(1 * 4); + add RTAB3, RTAB1, #(2 * 4); + add RTAB4, RTAB1, #(3 * 4); + + inpack(0); + + enc_rounds(0); + enc_fls(8); + enc_rounds(8); + enc_fls(16); + enc_rounds(16); + + cmp RKEYBITS, #(16 * 8); + bne .Lenc_256; + + outunpack(24); + + ret; +.ltorg + +.Lenc_256: + enc_fls(24); + enc_rounds(24); + + outunpack(32); + + ret; +.ltorg +.size _gcry_camellia_arm_encrypt_block,.-_gcry_camellia_arm_encrypt_block; + +.globl _gcry_camellia_arm_decrypt_block +.type _gcry_camellia_arm_decrypt_block, at function; + +_gcry_camellia_arm_decrypt_block: + /* input: + * x0: keytable + * x1: dst + * x2: src + * x3: keybitlen + */ + + adr RTAB1, _gcry_camellia_arm_tables; + mov RMASK, #(0xff<<4); /* byte mask */ + add RTAB2, RTAB1, #(1 * 4); + add RTAB3, RTAB1, #(2 * 4); + add RTAB4, RTAB1, #(3 * 4); + + cmp RKEYBITS, #(16 * 8); + bne .Ldec_256; + + inpack(24); + +.Ldec_128: + dec_rounds(16); + dec_fls(16); + dec_rounds(8); + dec_fls(8); + dec_rounds(0); + + outunpack(0); + + ret; +.ltorg + +.Ldec_256: + inpack(32); + dec_rounds(24); + dec_fls(24); + + b .Ldec_128; +.ltorg +.size _gcry_camellia_arm_decrypt_block,.-_gcry_camellia_arm_decrypt_block; + +/* Encryption/Decryption tables */ +.globl _gcry_camellia_arm_tables +.type _gcry_camellia_arm_tables, at object; +.balign 32 +_gcry_camellia_arm_tables: +.Lcamellia_sp1110: +.long 0x70707000 +.Lcamellia_sp0222: + .long 0x00e0e0e0 +.Lcamellia_sp3033: + .long 0x38003838 +.Lcamellia_sp4404: + .long 0x70700070 +.long 0x82828200, 0x00050505, 0x41004141, 0x2c2c002c +.long 0x2c2c2c00, 0x00585858, 0x16001616, 0xb3b300b3 +.long 0xececec00, 0x00d9d9d9, 0x76007676, 0xc0c000c0 +.long 0xb3b3b300, 0x00676767, 0xd900d9d9, 0xe4e400e4 +.long 0x27272700, 0x004e4e4e, 0x93009393, 0x57570057 +.long 0xc0c0c000, 0x00818181, 0x60006060, 0xeaea00ea +.long 0xe5e5e500, 0x00cbcbcb, 0xf200f2f2, 0xaeae00ae +.long 0xe4e4e400, 0x00c9c9c9, 0x72007272, 0x23230023 +.long 0x85858500, 0x000b0b0b, 0xc200c2c2, 0x6b6b006b +.long 0x57575700, 0x00aeaeae, 0xab00abab, 0x45450045 +.long 0x35353500, 0x006a6a6a, 0x9a009a9a, 0xa5a500a5 +.long 0xeaeaea00, 0x00d5d5d5, 0x75007575, 0xeded00ed +.long 0x0c0c0c00, 0x00181818, 0x06000606, 0x4f4f004f +.long 0xaeaeae00, 0x005d5d5d, 0x57005757, 0x1d1d001d +.long 0x41414100, 0x00828282, 0xa000a0a0, 0x92920092 +.long 0x23232300, 0x00464646, 0x91009191, 0x86860086 +.long 0xefefef00, 0x00dfdfdf, 0xf700f7f7, 0xafaf00af +.long 0x6b6b6b00, 0x00d6d6d6, 0xb500b5b5, 0x7c7c007c +.long 0x93939300, 0x00272727, 0xc900c9c9, 0x1f1f001f +.long 0x45454500, 0x008a8a8a, 0xa200a2a2, 0x3e3e003e +.long 0x19191900, 0x00323232, 0x8c008c8c, 0xdcdc00dc +.long 0xa5a5a500, 0x004b4b4b, 0xd200d2d2, 0x5e5e005e +.long 0x21212100, 0x00424242, 0x90009090, 0x0b0b000b +.long 0xededed00, 0x00dbdbdb, 0xf600f6f6, 0xa6a600a6 +.long 0x0e0e0e00, 0x001c1c1c, 0x07000707, 0x39390039 +.long 0x4f4f4f00, 0x009e9e9e, 0xa700a7a7, 0xd5d500d5 +.long 0x4e4e4e00, 0x009c9c9c, 0x27002727, 0x5d5d005d +.long 0x1d1d1d00, 0x003a3a3a, 0x8e008e8e, 0xd9d900d9 +.long 0x65656500, 0x00cacaca, 0xb200b2b2, 0x5a5a005a +.long 0x92929200, 0x00252525, 0x49004949, 0x51510051 +.long 0xbdbdbd00, 0x007b7b7b, 0xde00dede, 0x6c6c006c +.long 0x86868600, 0x000d0d0d, 0x43004343, 0x8b8b008b +.long 0xb8b8b800, 0x00717171, 0x5c005c5c, 0x9a9a009a +.long 0xafafaf00, 0x005f5f5f, 0xd700d7d7, 0xfbfb00fb +.long 0x8f8f8f00, 0x001f1f1f, 0xc700c7c7, 0xb0b000b0 +.long 0x7c7c7c00, 0x00f8f8f8, 0x3e003e3e, 0x74740074 +.long 0xebebeb00, 0x00d7d7d7, 0xf500f5f5, 0x2b2b002b +.long 0x1f1f1f00, 0x003e3e3e, 0x8f008f8f, 0xf0f000f0 +.long 0xcecece00, 0x009d9d9d, 0x67006767, 0x84840084 +.long 0x3e3e3e00, 0x007c7c7c, 0x1f001f1f, 0xdfdf00df +.long 0x30303000, 0x00606060, 0x18001818, 0xcbcb00cb +.long 0xdcdcdc00, 0x00b9b9b9, 0x6e006e6e, 0x34340034 +.long 0x5f5f5f00, 0x00bebebe, 0xaf00afaf, 0x76760076 +.long 0x5e5e5e00, 0x00bcbcbc, 0x2f002f2f, 0x6d6d006d +.long 0xc5c5c500, 0x008b8b8b, 0xe200e2e2, 0xa9a900a9 +.long 0x0b0b0b00, 0x00161616, 0x85008585, 0xd1d100d1 +.long 0x1a1a1a00, 0x00343434, 0x0d000d0d, 0x04040004 +.long 0xa6a6a600, 0x004d4d4d, 0x53005353, 0x14140014 +.long 0xe1e1e100, 0x00c3c3c3, 0xf000f0f0, 0x3a3a003a +.long 0x39393900, 0x00727272, 0x9c009c9c, 0xdede00de +.long 0xcacaca00, 0x00959595, 0x65006565, 0x11110011 +.long 0xd5d5d500, 0x00ababab, 0xea00eaea, 0x32320032 +.long 0x47474700, 0x008e8e8e, 0xa300a3a3, 0x9c9c009c +.long 0x5d5d5d00, 0x00bababa, 0xae00aeae, 0x53530053 +.long 0x3d3d3d00, 0x007a7a7a, 0x9e009e9e, 0xf2f200f2 +.long 0xd9d9d900, 0x00b3b3b3, 0xec00ecec, 0xfefe00fe +.long 0x01010100, 0x00020202, 0x80008080, 0xcfcf00cf +.long 0x5a5a5a00, 0x00b4b4b4, 0x2d002d2d, 0xc3c300c3 +.long 0xd6d6d600, 0x00adadad, 0x6b006b6b, 0x7a7a007a +.long 0x51515100, 0x00a2a2a2, 0xa800a8a8, 0x24240024 +.long 0x56565600, 0x00acacac, 0x2b002b2b, 0xe8e800e8 +.long 0x6c6c6c00, 0x00d8d8d8, 0x36003636, 0x60600060 +.long 0x4d4d4d00, 0x009a9a9a, 0xa600a6a6, 0x69690069 +.long 0x8b8b8b00, 0x00171717, 0xc500c5c5, 0xaaaa00aa +.long 0x0d0d0d00, 0x001a1a1a, 0x86008686, 0xa0a000a0 +.long 0x9a9a9a00, 0x00353535, 0x4d004d4d, 0xa1a100a1 +.long 0x66666600, 0x00cccccc, 0x33003333, 0x62620062 +.long 0xfbfbfb00, 0x00f7f7f7, 0xfd00fdfd, 0x54540054 +.long 0xcccccc00, 0x00999999, 0x66006666, 0x1e1e001e +.long 0xb0b0b000, 0x00616161, 0x58005858, 0xe0e000e0 +.long 0x2d2d2d00, 0x005a5a5a, 0x96009696, 0x64640064 +.long 0x74747400, 0x00e8e8e8, 0x3a003a3a, 0x10100010 +.long 0x12121200, 0x00242424, 0x09000909, 0x00000000 +.long 0x2b2b2b00, 0x00565656, 0x95009595, 0xa3a300a3 +.long 0x20202000, 0x00404040, 0x10001010, 0x75750075 +.long 0xf0f0f000, 0x00e1e1e1, 0x78007878, 0x8a8a008a +.long 0xb1b1b100, 0x00636363, 0xd800d8d8, 0xe6e600e6 +.long 0x84848400, 0x00090909, 0x42004242, 0x09090009 +.long 0x99999900, 0x00333333, 0xcc00cccc, 0xdddd00dd +.long 0xdfdfdf00, 0x00bfbfbf, 0xef00efef, 0x87870087 +.long 0x4c4c4c00, 0x00989898, 0x26002626, 0x83830083 +.long 0xcbcbcb00, 0x00979797, 0xe500e5e5, 0xcdcd00cd +.long 0xc2c2c200, 0x00858585, 0x61006161, 0x90900090 +.long 0x34343400, 0x00686868, 0x1a001a1a, 0x73730073 +.long 0x7e7e7e00, 0x00fcfcfc, 0x3f003f3f, 0xf6f600f6 +.long 0x76767600, 0x00ececec, 0x3b003b3b, 0x9d9d009d +.long 0x05050500, 0x000a0a0a, 0x82008282, 0xbfbf00bf +.long 0x6d6d6d00, 0x00dadada, 0xb600b6b6, 0x52520052 +.long 0xb7b7b700, 0x006f6f6f, 0xdb00dbdb, 0xd8d800d8 +.long 0xa9a9a900, 0x00535353, 0xd400d4d4, 0xc8c800c8 +.long 0x31313100, 0x00626262, 0x98009898, 0xc6c600c6 +.long 0xd1d1d100, 0x00a3a3a3, 0xe800e8e8, 0x81810081 +.long 0x17171700, 0x002e2e2e, 0x8b008b8b, 0x6f6f006f +.long 0x04040400, 0x00080808, 0x02000202, 0x13130013 +.long 0xd7d7d700, 0x00afafaf, 0xeb00ebeb, 0x63630063 +.long 0x14141400, 0x00282828, 0x0a000a0a, 0xe9e900e9 +.long 0x58585800, 0x00b0b0b0, 0x2c002c2c, 0xa7a700a7 +.long 0x3a3a3a00, 0x00747474, 0x1d001d1d, 0x9f9f009f +.long 0x61616100, 0x00c2c2c2, 0xb000b0b0, 0xbcbc00bc +.long 0xdedede00, 0x00bdbdbd, 0x6f006f6f, 0x29290029 +.long 0x1b1b1b00, 0x00363636, 0x8d008d8d, 0xf9f900f9 +.long 0x11111100, 0x00222222, 0x88008888, 0x2f2f002f +.long 0x1c1c1c00, 0x00383838, 0x0e000e0e, 0xb4b400b4 +.long 0x32323200, 0x00646464, 0x19001919, 0x78780078 +.long 0x0f0f0f00, 0x001e1e1e, 0x87008787, 0x06060006 +.long 0x9c9c9c00, 0x00393939, 0x4e004e4e, 0xe7e700e7 +.long 0x16161600, 0x002c2c2c, 0x0b000b0b, 0x71710071 +.long 0x53535300, 0x00a6a6a6, 0xa900a9a9, 0xd4d400d4 +.long 0x18181800, 0x00303030, 0x0c000c0c, 0xabab00ab +.long 0xf2f2f200, 0x00e5e5e5, 0x79007979, 0x88880088 +.long 0x22222200, 0x00444444, 0x11001111, 0x8d8d008d +.long 0xfefefe00, 0x00fdfdfd, 0x7f007f7f, 0x72720072 +.long 0x44444400, 0x00888888, 0x22002222, 0xb9b900b9 +.long 0xcfcfcf00, 0x009f9f9f, 0xe700e7e7, 0xf8f800f8 +.long 0xb2b2b200, 0x00656565, 0x59005959, 0xacac00ac +.long 0xc3c3c300, 0x00878787, 0xe100e1e1, 0x36360036 +.long 0xb5b5b500, 0x006b6b6b, 0xda00dada, 0x2a2a002a +.long 0x7a7a7a00, 0x00f4f4f4, 0x3d003d3d, 0x3c3c003c +.long 0x91919100, 0x00232323, 0xc800c8c8, 0xf1f100f1 +.long 0x24242400, 0x00484848, 0x12001212, 0x40400040 +.long 0x08080800, 0x00101010, 0x04000404, 0xd3d300d3 +.long 0xe8e8e800, 0x00d1d1d1, 0x74007474, 0xbbbb00bb +.long 0xa8a8a800, 0x00515151, 0x54005454, 0x43430043 +.long 0x60606000, 0x00c0c0c0, 0x30003030, 0x15150015 +.long 0xfcfcfc00, 0x00f9f9f9, 0x7e007e7e, 0xadad00ad +.long 0x69696900, 0x00d2d2d2, 0xb400b4b4, 0x77770077 +.long 0x50505000, 0x00a0a0a0, 0x28002828, 0x80800080 +.long 0xaaaaaa00, 0x00555555, 0x55005555, 0x82820082 +.long 0xd0d0d000, 0x00a1a1a1, 0x68006868, 0xecec00ec +.long 0xa0a0a000, 0x00414141, 0x50005050, 0x27270027 +.long 0x7d7d7d00, 0x00fafafa, 0xbe00bebe, 0xe5e500e5 +.long 0xa1a1a100, 0x00434343, 0xd000d0d0, 0x85850085 +.long 0x89898900, 0x00131313, 0xc400c4c4, 0x35350035 +.long 0x62626200, 0x00c4c4c4, 0x31003131, 0x0c0c000c +.long 0x97979700, 0x002f2f2f, 0xcb00cbcb, 0x41410041 +.long 0x54545400, 0x00a8a8a8, 0x2a002a2a, 0xefef00ef +.long 0x5b5b5b00, 0x00b6b6b6, 0xad00adad, 0x93930093 +.long 0x1e1e1e00, 0x003c3c3c, 0x0f000f0f, 0x19190019 +.long 0x95959500, 0x002b2b2b, 0xca00caca, 0x21210021 +.long 0xe0e0e000, 0x00c1c1c1, 0x70007070, 0x0e0e000e +.long 0xffffff00, 0x00ffffff, 0xff00ffff, 0x4e4e004e +.long 0x64646400, 0x00c8c8c8, 0x32003232, 0x65650065 +.long 0xd2d2d200, 0x00a5a5a5, 0x69006969, 0xbdbd00bd +.long 0x10101000, 0x00202020, 0x08000808, 0xb8b800b8 +.long 0xc4c4c400, 0x00898989, 0x62006262, 0x8f8f008f +.long 0x00000000, 0x00000000, 0x00000000, 0xebeb00eb +.long 0x48484800, 0x00909090, 0x24002424, 0xcece00ce +.long 0xa3a3a300, 0x00474747, 0xd100d1d1, 0x30300030 +.long 0xf7f7f700, 0x00efefef, 0xfb00fbfb, 0x5f5f005f +.long 0x75757500, 0x00eaeaea, 0xba00baba, 0xc5c500c5 +.long 0xdbdbdb00, 0x00b7b7b7, 0xed00eded, 0x1a1a001a +.long 0x8a8a8a00, 0x00151515, 0x45004545, 0xe1e100e1 +.long 0x03030300, 0x00060606, 0x81008181, 0xcaca00ca +.long 0xe6e6e600, 0x00cdcdcd, 0x73007373, 0x47470047 +.long 0xdadada00, 0x00b5b5b5, 0x6d006d6d, 0x3d3d003d +.long 0x09090900, 0x00121212, 0x84008484, 0x01010001 +.long 0x3f3f3f00, 0x007e7e7e, 0x9f009f9f, 0xd6d600d6 +.long 0xdddddd00, 0x00bbbbbb, 0xee00eeee, 0x56560056 +.long 0x94949400, 0x00292929, 0x4a004a4a, 0x4d4d004d +.long 0x87878700, 0x000f0f0f, 0xc300c3c3, 0x0d0d000d +.long 0x5c5c5c00, 0x00b8b8b8, 0x2e002e2e, 0x66660066 +.long 0x83838300, 0x00070707, 0xc100c1c1, 0xcccc00cc +.long 0x02020200, 0x00040404, 0x01000101, 0x2d2d002d +.long 0xcdcdcd00, 0x009b9b9b, 0xe600e6e6, 0x12120012 +.long 0x4a4a4a00, 0x00949494, 0x25002525, 0x20200020 +.long 0x90909000, 0x00212121, 0x48004848, 0xb1b100b1 +.long 0x33333300, 0x00666666, 0x99009999, 0x99990099 +.long 0x73737300, 0x00e6e6e6, 0xb900b9b9, 0x4c4c004c +.long 0x67676700, 0x00cecece, 0xb300b3b3, 0xc2c200c2 +.long 0xf6f6f600, 0x00ededed, 0x7b007b7b, 0x7e7e007e +.long 0xf3f3f300, 0x00e7e7e7, 0xf900f9f9, 0x05050005 +.long 0x9d9d9d00, 0x003b3b3b, 0xce00cece, 0xb7b700b7 +.long 0x7f7f7f00, 0x00fefefe, 0xbf00bfbf, 0x31310031 +.long 0xbfbfbf00, 0x007f7f7f, 0xdf00dfdf, 0x17170017 +.long 0xe2e2e200, 0x00c5c5c5, 0x71007171, 0xd7d700d7 +.long 0x52525200, 0x00a4a4a4, 0x29002929, 0x58580058 +.long 0x9b9b9b00, 0x00373737, 0xcd00cdcd, 0x61610061 +.long 0xd8d8d800, 0x00b1b1b1, 0x6c006c6c, 0x1b1b001b +.long 0x26262600, 0x004c4c4c, 0x13001313, 0x1c1c001c +.long 0xc8c8c800, 0x00919191, 0x64006464, 0x0f0f000f +.long 0x37373700, 0x006e6e6e, 0x9b009b9b, 0x16160016 +.long 0xc6c6c600, 0x008d8d8d, 0x63006363, 0x18180018 +.long 0x3b3b3b00, 0x00767676, 0x9d009d9d, 0x22220022 +.long 0x81818100, 0x00030303, 0xc000c0c0, 0x44440044 +.long 0x96969600, 0x002d2d2d, 0x4b004b4b, 0xb2b200b2 +.long 0x6f6f6f00, 0x00dedede, 0xb700b7b7, 0xb5b500b5 +.long 0x4b4b4b00, 0x00969696, 0xa500a5a5, 0x91910091 +.long 0x13131300, 0x00262626, 0x89008989, 0x08080008 +.long 0xbebebe00, 0x007d7d7d, 0x5f005f5f, 0xa8a800a8 +.long 0x63636300, 0x00c6c6c6, 0xb100b1b1, 0xfcfc00fc +.long 0x2e2e2e00, 0x005c5c5c, 0x17001717, 0x50500050 +.long 0xe9e9e900, 0x00d3d3d3, 0xf400f4f4, 0xd0d000d0 +.long 0x79797900, 0x00f2f2f2, 0xbc00bcbc, 0x7d7d007d +.long 0xa7a7a700, 0x004f4f4f, 0xd300d3d3, 0x89890089 +.long 0x8c8c8c00, 0x00191919, 0x46004646, 0x97970097 +.long 0x9f9f9f00, 0x003f3f3f, 0xcf00cfcf, 0x5b5b005b +.long 0x6e6e6e00, 0x00dcdcdc, 0x37003737, 0x95950095 +.long 0xbcbcbc00, 0x00797979, 0x5e005e5e, 0xffff00ff +.long 0x8e8e8e00, 0x001d1d1d, 0x47004747, 0xd2d200d2 +.long 0x29292900, 0x00525252, 0x94009494, 0xc4c400c4 +.long 0xf5f5f500, 0x00ebebeb, 0xfa00fafa, 0x48480048 +.long 0xf9f9f900, 0x00f3f3f3, 0xfc00fcfc, 0xf7f700f7 +.long 0xb6b6b600, 0x006d6d6d, 0x5b005b5b, 0xdbdb00db +.long 0x2f2f2f00, 0x005e5e5e, 0x97009797, 0x03030003 +.long 0xfdfdfd00, 0x00fbfbfb, 0xfe00fefe, 0xdada00da +.long 0xb4b4b400, 0x00696969, 0x5a005a5a, 0x3f3f003f +.long 0x59595900, 0x00b2b2b2, 0xac00acac, 0x94940094 +.long 0x78787800, 0x00f0f0f0, 0x3c003c3c, 0x5c5c005c +.long 0x98989800, 0x00313131, 0x4c004c4c, 0x02020002 +.long 0x06060600, 0x000c0c0c, 0x03000303, 0x4a4a004a +.long 0x6a6a6a00, 0x00d4d4d4, 0x35003535, 0x33330033 +.long 0xe7e7e700, 0x00cfcfcf, 0xf300f3f3, 0x67670067 +.long 0x46464600, 0x008c8c8c, 0x23002323, 0xf3f300f3 +.long 0x71717100, 0x00e2e2e2, 0xb800b8b8, 0x7f7f007f +.long 0xbababa00, 0x00757575, 0x5d005d5d, 0xe2e200e2 +.long 0xd4d4d400, 0x00a9a9a9, 0x6a006a6a, 0x9b9b009b +.long 0x25252500, 0x004a4a4a, 0x92009292, 0x26260026 +.long 0xababab00, 0x00575757, 0xd500d5d5, 0x37370037 +.long 0x42424200, 0x00848484, 0x21002121, 0x3b3b003b +.long 0x88888800, 0x00111111, 0x44004444, 0x96960096 +.long 0xa2a2a200, 0x00454545, 0x51005151, 0x4b4b004b +.long 0x8d8d8d00, 0x001b1b1b, 0xc600c6c6, 0xbebe00be +.long 0xfafafa00, 0x00f5f5f5, 0x7d007d7d, 0x2e2e002e +.long 0x72727200, 0x00e4e4e4, 0x39003939, 0x79790079 +.long 0x07070700, 0x000e0e0e, 0x83008383, 0x8c8c008c +.long 0xb9b9b900, 0x00737373, 0xdc00dcdc, 0x6e6e006e +.long 0x55555500, 0x00aaaaaa, 0xaa00aaaa, 0x8e8e008e +.long 0xf8f8f800, 0x00f1f1f1, 0x7c007c7c, 0xf5f500f5 +.long 0xeeeeee00, 0x00dddddd, 0x77007777, 0xb6b600b6 +.long 0xacacac00, 0x00595959, 0x56005656, 0xfdfd00fd +.long 0x0a0a0a00, 0x00141414, 0x05000505, 0x59590059 +.long 0x36363600, 0x006c6c6c, 0x1b001b1b, 0x98980098 +.long 0x49494900, 0x00929292, 0xa400a4a4, 0x6a6a006a +.long 0x2a2a2a00, 0x00545454, 0x15001515, 0x46460046 +.long 0x68686800, 0x00d0d0d0, 0x34003434, 0xbaba00ba +.long 0x3c3c3c00, 0x00787878, 0x1e001e1e, 0x25250025 +.long 0x38383800, 0x00707070, 0x1c001c1c, 0x42420042 +.long 0xf1f1f100, 0x00e3e3e3, 0xf800f8f8, 0xa2a200a2 +.long 0xa4a4a400, 0x00494949, 0x52005252, 0xfafa00fa +.long 0x40404000, 0x00808080, 0x20002020, 0x07070007 +.long 0x28282800, 0x00505050, 0x14001414, 0x55550055 +.long 0xd3d3d300, 0x00a7a7a7, 0xe900e9e9, 0xeeee00ee +.long 0x7b7b7b00, 0x00f6f6f6, 0xbd00bdbd, 0x0a0a000a +.long 0xbbbbbb00, 0x00777777, 0xdd00dddd, 0x49490049 +.long 0xc9c9c900, 0x00939393, 0xe400e4e4, 0x68680068 +.long 0x43434300, 0x00868686, 0xa100a1a1, 0x38380038 +.long 0xc1c1c100, 0x00838383, 0xe000e0e0, 0xa4a400a4 +.long 0x15151500, 0x002a2a2a, 0x8a008a8a, 0x28280028 +.long 0xe3e3e300, 0x00c7c7c7, 0xf100f1f1, 0x7b7b007b +.long 0xadadad00, 0x005b5b5b, 0xd600d6d6, 0xc9c900c9 +.long 0xf4f4f400, 0x00e9e9e9, 0x7a007a7a, 0xc1c100c1 +.long 0x77777700, 0x00eeeeee, 0xbb00bbbb, 0xe3e300e3 +.long 0xc7c7c700, 0x008f8f8f, 0xe300e3e3, 0xf4f400f4 +.long 0x80808000, 0x00010101, 0x40004040, 0xc7c700c7 +.long 0x9e9e9e00, 0x003d3d3d, 0x4f004f4f, 0x9e9e009e +.size _gcry_camellia_arm_tables,.-_gcry_camellia_arm_tables; + +#endif /*HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS*/ +#endif /*__AARCH64EL__*/ diff --git a/cipher/camellia-glue.c b/cipher/camellia-glue.c index dfddb4a..1be35c9 100644 --- a/cipher/camellia-glue.c +++ b/cipher/camellia-glue.c @@ -285,12 +285,19 @@ static void Camellia_DecryptBlock(const int keyBitLength, keyBitLength); } +#ifdef __aarch64__ +# define CAMELLIA_encrypt_stack_burn_size (0) +# define CAMELLIA_decrypt_stack_burn_size (0) +#else +# define CAMELLIA_encrypt_stack_burn_size (15*4) +# define CAMELLIA_decrypt_stack_burn_size (15*4) +#endif + static unsigned int camellia_encrypt(void *c, byte *outbuf, const byte *inbuf) { CAMELLIA_context *ctx = c; Camellia_EncryptBlock(ctx->keybitlength,inbuf,ctx->keytable,outbuf); -#define CAMELLIA_encrypt_stack_burn_size (15*4) return /*burn_stack*/ (CAMELLIA_encrypt_stack_burn_size); } @@ -299,7 +306,6 @@ camellia_decrypt(void *c, byte *outbuf, const byte *inbuf) { CAMELLIA_context *ctx=c; Camellia_DecryptBlock(ctx->keybitlength,inbuf,ctx->keytable,outbuf); -#define CAMELLIA_decrypt_stack_burn_size (15*4) return /*burn_stack*/ (CAMELLIA_decrypt_stack_burn_size); } diff --git a/cipher/camellia.h b/cipher/camellia.h index d0e3c18..d7a1e6f 100644 --- a/cipher/camellia.h +++ b/cipher/camellia.h @@ -37,6 +37,11 @@ # define USE_ARM_ASM 1 # endif # endif +# if defined(__AARCH64EL__) +# ifdef HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS +# define USE_ARM_ASM 1 +# endif +# endif #endif #ifdef CAMELLIA_EXT_SYM_PREFIX #define CAMELLIA_PREFIX1(x,y) x ## y @@ -80,7 +85,7 @@ void Camellia_DecryptBlock(const int keyBitLength, const unsigned char *cipherText, const KEY_TABLE_TYPE keyTable, unsigned char *plaintext); -#endif /*!USE_ARMV6_ASM*/ +#endif /*!USE_ARM_ASM*/ #ifdef __cplusplus diff --git a/configure.ac b/configure.ac index ca82af9..3e926a5 100644 --- a/configure.ac +++ b/configure.ac @@ -2123,6 +2123,10 @@ if test "$found" = "1" ; then # Build with the assembly implementation GCRYPT_CIPHERS="$GCRYPT_CIPHERS camellia-arm.lo" ;; + aarch64-*-*) + # Build with the assembly implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS camellia-aarch64.lo" + ;; esac if test x"$avxsupport" = xyes ; then From jussi.kivilinna at iki.fi Sun Sep 11 16:39:24 2016 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 11 Sep 2016 17:39:24 +0300 Subject: [PATCH 2/2] Add Aarch64 assembly implementation of Twofish In-Reply-To: <147360475926.22786.5098848227904913348.stgit@localhost6.localdomain6> References: <147360475926.22786.5098848227904913348.stgit@localhost6.localdomain6> Message-ID: <147360476431.22786.3983620669858245715.stgit@localhost6.localdomain6> * cipher/Makefile.am: Add 'twofish-aarch64.S'. * cipher/twofish-aarch64.S: New. * cipher/twofish.c: Enable USE_ARM_ASM if __AARCH64EL__ and HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS defined. * configure.ac [host=aarch64]: Add 'twofish-aarch64.lo'. -- Patch adds ARMv8/Aarch64 implementation of Twofish. Benchmark on Cortex-A53 (1152 Mhz): Before: TWOFISH | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 27.51 ns/B 34.67 MiB/s 31.69 c/B ECB dec | 26.37 ns/B 36.17 MiB/s 30.38 c/B CBC enc | 28.64 ns/B 33.29 MiB/s 33.00 c/B CBC dec | 26.21 ns/B 36.39 MiB/s 30.19 c/B CFB enc | 28.54 ns/B 33.42 MiB/s 32.88 c/B CFB dec | 27.40 ns/B 34.81 MiB/s 31.56 c/B OFB enc | 28.38 ns/B 33.61 MiB/s 32.69 c/B OFB dec | 28.37 ns/B 33.61 MiB/s 32.69 c/B CTR enc | 27.57 ns/B 34.60 MiB/s 31.76 c/B CTR dec | 27.57 ns/B 34.60 MiB/s 31.76 c/B CCM enc | 55.28 ns/B 17.25 MiB/s 63.69 c/B CCM dec | 55.29 ns/B 17.25 MiB/s 63.70 c/B CCM auth | 27.83 ns/B 34.27 MiB/s 32.06 c/B GCM enc | 28.86 ns/B 33.04 MiB/s 33.25 c/B GCM dec | 28.87 ns/B 33.04 MiB/s 33.25 c/B GCM auth | 1.30 ns/B 731.9 MiB/s 1.50 c/B OCB enc | 29.69 ns/B 32.12 MiB/s 34.20 c/B OCB dec | 28.50 ns/B 33.47 MiB/s 32.83 c/B OCB auth | 29.04 ns/B 32.84 MiB/s 33.45 c/B = After (~1.3x faster): TWOFISH | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 19.97 ns/B 47.77 MiB/s 23.00 c/B ECB dec | 18.29 ns/B 52.16 MiB/s 21.06 c/B CBC enc | 20.94 ns/B 45.54 MiB/s 24.13 c/B CBC dec | 18.34 ns/B 52.00 MiB/s 21.13 c/B CFB enc | 20.83 ns/B 45.77 MiB/s 24.00 c/B CFB dec | 19.97 ns/B 47.76 MiB/s 23.00 c/B OFB enc | 20.94 ns/B 45.54 MiB/s 24.13 c/B OFB dec | 20.94 ns/B 45.54 MiB/s 24.13 c/B CTR enc | 20.19 ns/B 47.24 MiB/s 23.26 c/B CTR dec | 20.19 ns/B 47.24 MiB/s 23.26 c/B CCM enc | 40.53 ns/B 23.53 MiB/s 46.69 c/B CCM dec | 40.53 ns/B 23.53 MiB/s 46.69 c/B CCM auth | 20.40 ns/B 46.74 MiB/s 23.50 c/B GCM enc | 21.49 ns/B 44.39 MiB/s 24.75 c/B GCM dec | 21.48 ns/B 44.39 MiB/s 24.75 c/B GCM auth | 1.30 ns/B 731.8 MiB/s 1.50 c/B OCB enc | 22.15 ns/B 43.05 MiB/s 25.52 c/B OCB dec | 20.47 ns/B 46.58 MiB/s 23.59 c/B OCB auth | 21.64 ns/B 44.07 MiB/s 24.93 c/B = Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 305a3b9..ac0ec58 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -99,7 +99,7 @@ keccak.c keccak_permute_32.h keccak_permute_64.h keccak-armv7-neon.S \ stribog.c \ tiger.c \ whirlpool.c whirlpool-sse2-amd64.S \ -twofish.c twofish-amd64.S twofish-arm.S \ +twofish.c twofish-amd64.S twofish-arm.S twofish-aarch64.S \ rfc2268.c \ camellia.c camellia.h camellia-glue.c camellia-aesni-avx-amd64.S \ camellia-aesni-avx2-amd64.S camellia-arm.S camellia-aarch64.S diff --git a/cipher/twofish-aarch64.S b/cipher/twofish-aarch64.S new file mode 100644 index 0000000..99c4675 --- /dev/null +++ b/cipher/twofish-aarch64.S @@ -0,0 +1,317 @@ +/* twofish-aarch64.S - ARMv8/AArch64 assembly implementation of Twofish cipher + * + * Copyright (C) 2016 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include + +#if defined(__AARCH64EL__) +#ifdef HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS + +.text + +/* structure of TWOFISH_context: */ +#define s0 0 +#define s1 ((s0) + 4 * 256) +#define s2 ((s1) + 4 * 256) +#define s3 ((s2) + 4 * 256) +#define w ((s3) + 4 * 256) +#define k ((w) + 4 * 8) + +/* register macros */ +#define CTX x0 +#define RDST x1 +#define RSRC x2 +#define CTXs0 CTX +#define CTXs1 x3 +#define CTXs2 x4 +#define CTXs3 x5 +#define CTXw x17 + +#define RA w6 +#define RB w7 +#define RC w8 +#define RD w9 + +#define RX w10 +#define RY w11 + +#define xRX x10 +#define xRY x11 + +#define RMASK w12 + +#define RT0 w13 +#define RT1 w14 +#define RT2 w15 +#define RT3 w16 + +#define xRT0 x13 +#define xRT1 x14 +#define xRT2 x15 +#define xRT3 x16 + +/* helper macros */ +#ifndef __AARCH64EL__ + /* bswap on big-endian */ + #define host_to_le(reg) \ + rev reg, reg; + #define le_to_host(reg) \ + rev reg, reg; +#else + /* nop on little-endian */ + #define host_to_le(reg) /*_*/ + #define le_to_host(reg) /*_*/ +#endif + +#define ldr_input_aligned_le(rin, a, b, c, d) \ + ldr a, [rin, #0]; \ + ldr b, [rin, #4]; \ + le_to_host(a); \ + ldr c, [rin, #8]; \ + le_to_host(b); \ + ldr d, [rin, #12]; \ + le_to_host(c); \ + le_to_host(d); + +#define str_output_aligned_le(rout, a, b, c, d) \ + le_to_host(a); \ + le_to_host(b); \ + str a, [rout, #0]; \ + le_to_host(c); \ + str b, [rout, #4]; \ + le_to_host(d); \ + str c, [rout, #8]; \ + str d, [rout, #12]; + +/* unaligned word reads/writes allowed */ +#define ldr_input_le(rin, ra, rb, rc, rd, rtmp) \ + ldr_input_aligned_le(rin, ra, rb, rc, rd) + +#define str_output_le(rout, ra, rb, rc, rd, rtmp0, rtmp1) \ + str_output_aligned_le(rout, ra, rb, rc, rd) + +/********************************************************************** + 1-way twofish + **********************************************************************/ +#define encrypt_round(a, b, rc, rd, n, ror_a, adj_a) \ + and RT0, RMASK, b, lsr#(8 - 2); \ + and RY, RMASK, b, lsr#(16 - 2); \ + and RT1, RMASK, b, lsr#(24 - 2); \ + ldr RY, [CTXs3, xRY]; \ + and RT2, RMASK, b, lsl#(2); \ + ldr RT0, [CTXs2, xRT0]; \ + and RT3, RMASK, a, lsr#(16 - 2 + (adj_a)); \ + ldr RT1, [CTXs0, xRT1]; \ + and RX, RMASK, a, lsr#(8 - 2 + (adj_a)); \ + ldr RT2, [CTXs1, xRT2]; \ + ldr RX, [CTXs1, xRX]; \ + ror_a(a); \ + \ + eor RY, RY, RT0; \ + ldr RT3, [CTXs2, xRT3]; \ + and RT0, RMASK, a, lsl#(2); \ + eor RY, RY, RT1; \ + and RT1, RMASK, a, lsr#(24 - 2); \ + eor RY, RY, RT2; \ + ldr RT0, [CTXs0, xRT0]; \ + eor RX, RX, RT3; \ + ldr RT1, [CTXs3, xRT1]; \ + eor RX, RX, RT0; \ + \ + ldr RT3, [CTXs3, #(k - s3 + 8 * (n) + 4)]; \ + eor RX, RX, RT1; \ + ldr RT2, [CTXs3, #(k - s3 + 8 * (n))]; \ + \ + add RT0, RX, RY, lsl #1; \ + add RX, RX, RY; \ + add RT0, RT0, RT3; \ + add RX, RX, RT2; \ + eor rd, RT0, rd, ror #31; \ + eor rc, rc, RX; + +#define dummy(x) /*_*/ + +#define ror1(r) \ + ror r, r, #1; + +#define decrypt_round(a, b, rc, rd, n, ror_b, adj_b) \ + and RT3, RMASK, b, lsl#(2 - (adj_b)); \ + and RT1, RMASK, b, lsr#(8 - 2 + (adj_b)); \ + ror_b(b); \ + and RT2, RMASK, a, lsl#(2); \ + and RT0, RMASK, a, lsr#(8 - 2); \ + \ + ldr RY, [CTXs1, xRT3]; \ + ldr RX, [CTXs0, xRT2]; \ + and RT3, RMASK, b, lsr#(16 - 2); \ + ldr RT1, [CTXs2, xRT1]; \ + and RT2, RMASK, a, lsr#(16 - 2); \ + ldr RT0, [CTXs1, xRT0]; \ + \ + ldr RT3, [CTXs3, xRT3]; \ + eor RY, RY, RT1; \ + \ + and RT1, RMASK, b, lsr#(24 - 2); \ + eor RX, RX, RT0; \ + ldr RT2, [CTXs2, xRT2]; \ + and RT0, RMASK, a, lsr#(24 - 2); \ + \ + ldr RT1, [CTXs0, xRT1]; \ + \ + eor RY, RY, RT3; \ + ldr RT0, [CTXs3, xRT0]; \ + eor RX, RX, RT2; \ + eor RY, RY, RT1; \ + \ + ldr RT1, [CTXs3, #(k - s3 + 8 * (n) + 4)]; \ + eor RX, RX, RT0; \ + ldr RT2, [CTXs3, #(k - s3 + 8 * (n))]; \ + \ + add RT0, RX, RY, lsl #1; \ + add RX, RX, RY; \ + add RT0, RT0, RT1; \ + add RX, RX, RT2; \ + eor rd, rd, RT0; \ + eor rc, RX, rc, ror #31; + +#define first_encrypt_cycle(nc) \ + encrypt_round(RA, RB, RC, RD, (nc) * 2, dummy, 0); \ + encrypt_round(RC, RD, RA, RB, (nc) * 2 + 1, ror1, 1); + +#define encrypt_cycle(nc) \ + encrypt_round(RA, RB, RC, RD, (nc) * 2, ror1, 1); \ + encrypt_round(RC, RD, RA, RB, (nc) * 2 + 1, ror1, 1); + +#define last_encrypt_cycle(nc) \ + encrypt_round(RA, RB, RC, RD, (nc) * 2, ror1, 1); \ + encrypt_round(RC, RD, RA, RB, (nc) * 2 + 1, ror1, 1); \ + ror1(RA); + +#define first_decrypt_cycle(nc) \ + decrypt_round(RC, RD, RA, RB, (nc) * 2 + 1, dummy, 0); \ + decrypt_round(RA, RB, RC, RD, (nc) * 2, ror1, 1); + +#define decrypt_cycle(nc) \ + decrypt_round(RC, RD, RA, RB, (nc) * 2 + 1, ror1, 1); \ + decrypt_round(RA, RB, RC, RD, (nc) * 2, ror1, 1); + +#define last_decrypt_cycle(nc) \ + decrypt_round(RC, RD, RA, RB, (nc) * 2 + 1, ror1, 1); \ + decrypt_round(RA, RB, RC, RD, (nc) * 2, ror1, 1); \ + ror1(RD); + +.globl _gcry_twofish_arm_encrypt_block +.type _gcry_twofish_arm_encrypt_block,%function; + +_gcry_twofish_arm_encrypt_block: + /* input: + * x0: ctx + * x1: dst + * x2: src + */ + + add CTXw, CTX, #(w); + + ldr_input_le(RSRC, RA, RB, RC, RD, RT0); + + /* Input whitening */ + ldp RT0, RT1, [CTXw, #(0*8)]; + ldp RT2, RT3, [CTXw, #(1*8)]; + add CTXs3, CTX, #(s3); + add CTXs2, CTX, #(s2); + add CTXs1, CTX, #(s1); + mov RMASK, #(0xff << 2); + eor RA, RA, RT0; + eor RB, RB, RT1; + eor RC, RC, RT2; + eor RD, RD, RT3; + + first_encrypt_cycle(0); + encrypt_cycle(1); + encrypt_cycle(2); + encrypt_cycle(3); + encrypt_cycle(4); + encrypt_cycle(5); + encrypt_cycle(6); + last_encrypt_cycle(7); + + /* Output whitening */ + ldp RT0, RT1, [CTXw, #(2*8)]; + ldp RT2, RT3, [CTXw, #(3*8)]; + eor RC, RC, RT0; + eor RD, RD, RT1; + eor RA, RA, RT2; + eor RB, RB, RT3; + + str_output_le(RDST, RC, RD, RA, RB, RT0, RT1); + + ret; +.ltorg +.size _gcry_twofish_arm_encrypt_block,.-_gcry_twofish_arm_encrypt_block; + +.globl _gcry_twofish_arm_decrypt_block +.type _gcry_twofish_arm_decrypt_block,%function; + +_gcry_twofish_arm_decrypt_block: + /* input: + * %r0: ctx + * %r1: dst + * %r2: src + */ + + add CTXw, CTX, #(w); + + ldr_input_le(RSRC, RC, RD, RA, RB, RT0); + + /* Input whitening */ + ldp RT0, RT1, [CTXw, #(2*8)]; + ldp RT2, RT3, [CTXw, #(3*8)]; + add CTXs3, CTX, #(s3); + add CTXs2, CTX, #(s2); + add CTXs1, CTX, #(s1); + mov RMASK, #(0xff << 2); + eor RC, RC, RT0; + eor RD, RD, RT1; + eor RA, RA, RT2; + eor RB, RB, RT3; + + first_decrypt_cycle(7); + decrypt_cycle(6); + decrypt_cycle(5); + decrypt_cycle(4); + decrypt_cycle(3); + decrypt_cycle(2); + decrypt_cycle(1); + last_decrypt_cycle(0); + + /* Output whitening */ + ldp RT0, RT1, [CTXw, #(0*8)]; + ldp RT2, RT3, [CTXw, #(1*8)]; + eor RA, RA, RT0; + eor RB, RB, RT1; + eor RC, RC, RT2; + eor RD, RD, RT3; + + str_output_le(RDST, RA, RB, RC, RD, RT0, RT1); + + ret; +.size _gcry_twofish_arm_decrypt_block,.-_gcry_twofish_arm_decrypt_block; + +#endif /*HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS*/ +#endif /*__AARCH64EL__*/ diff --git a/cipher/twofish.c b/cipher/twofish.c index f6ecd67..7a4d26a 100644 --- a/cipher/twofish.c +++ b/cipher/twofish.c @@ -66,6 +66,11 @@ # define USE_ARM_ASM 1 # endif #endif +# if defined(__AARCH64EL__) +# ifdef HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS +# define USE_ARM_ASM 1 +# endif +# endif /* Prototype for the self-test function. */ diff --git a/configure.ac b/configure.ac index 3e926a5..7bbf4bd 100644 --- a/configure.ac +++ b/configure.ac @@ -2075,6 +2075,10 @@ if test "$found" = "1" ; then # Build with the assembly implementation GCRYPT_CIPHERS="$GCRYPT_CIPHERS twofish-arm.lo" ;; + aarch64-*-*) + # Build with the assembly implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS twofish-aarch64.lo" + ;; esac fi From cvs at cvs.gnupg.org Sun Sep 11 17:31:25 2016 From: cvs at cvs.gnupg.org (by Jussi Kivilinna) Date: Sun, 11 Sep 2016 17:31:25 +0200 Subject: [git] GCRYPT - branch, master, updated. libgcrypt-1.7.3-8-g5418d9c Message-ID: This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "The GNU crypto library". The branch, master has been updated via 5418d9ca4c0e087fd6872ad350a996fe74880d86 (commit) via de73a2e7237ba7c34ce48bb5fb671aa3993de832 (commit) from 4cd8d40d698564d24ece2af24546e34c58bf2961 (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- commit 5418d9ca4c0e087fd6872ad350a996fe74880d86 Author: Jussi Kivilinna Date: Wed Apr 27 18:18:54 2016 +0300 Add Aarch64 assembly implementation of Twofish * cipher/Makefile.am: Add 'twofish-aarch64.S'. * cipher/twofish-aarch64.S: New. * cipher/twofish.c: Enable USE_ARM_ASM if __AARCH64EL__ and HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS defined. * configure.ac [host=aarch64]: Add 'twofish-aarch64.lo'. -- Patch adds ARMv8/Aarch64 implementation of Twofish. Benchmark on Cortex-A53 (1152 Mhz): Before: TWOFISH | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 27.51 ns/B 34.67 MiB/s 31.69 c/B ECB dec | 26.37 ns/B 36.17 MiB/s 30.38 c/B CBC enc | 28.64 ns/B 33.29 MiB/s 33.00 c/B CBC dec | 26.21 ns/B 36.39 MiB/s 30.19 c/B CFB enc | 28.54 ns/B 33.42 MiB/s 32.88 c/B CFB dec | 27.40 ns/B 34.81 MiB/s 31.56 c/B OFB enc | 28.38 ns/B 33.61 MiB/s 32.69 c/B OFB dec | 28.37 ns/B 33.61 MiB/s 32.69 c/B CTR enc | 27.57 ns/B 34.60 MiB/s 31.76 c/B CTR dec | 27.57 ns/B 34.60 MiB/s 31.76 c/B CCM enc | 55.28 ns/B 17.25 MiB/s 63.69 c/B CCM dec | 55.29 ns/B 17.25 MiB/s 63.70 c/B CCM auth | 27.83 ns/B 34.27 MiB/s 32.06 c/B GCM enc | 28.86 ns/B 33.04 MiB/s 33.25 c/B GCM dec | 28.87 ns/B 33.04 MiB/s 33.25 c/B GCM auth | 1.30 ns/B 731.9 MiB/s 1.50 c/B OCB enc | 29.69 ns/B 32.12 MiB/s 34.20 c/B OCB dec | 28.50 ns/B 33.47 MiB/s 32.83 c/B OCB auth | 29.04 ns/B 32.84 MiB/s 33.45 c/B = After (~1.3x faster): TWOFISH | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 19.97 ns/B 47.77 MiB/s 23.00 c/B ECB dec | 18.29 ns/B 52.16 MiB/s 21.06 c/B CBC enc | 20.94 ns/B 45.54 MiB/s 24.13 c/B CBC dec | 18.34 ns/B 52.00 MiB/s 21.13 c/B CFB enc | 20.83 ns/B 45.77 MiB/s 24.00 c/B CFB dec | 19.97 ns/B 47.76 MiB/s 23.00 c/B OFB enc | 20.94 ns/B 45.54 MiB/s 24.13 c/B OFB dec | 20.94 ns/B 45.54 MiB/s 24.13 c/B CTR enc | 20.19 ns/B 47.24 MiB/s 23.26 c/B CTR dec | 20.19 ns/B 47.24 MiB/s 23.26 c/B CCM enc | 40.53 ns/B 23.53 MiB/s 46.69 c/B CCM dec | 40.53 ns/B 23.53 MiB/s 46.69 c/B CCM auth | 20.40 ns/B 46.74 MiB/s 23.50 c/B GCM enc | 21.49 ns/B 44.39 MiB/s 24.75 c/B GCM dec | 21.48 ns/B 44.39 MiB/s 24.75 c/B GCM auth | 1.30 ns/B 731.8 MiB/s 1.50 c/B OCB enc | 22.15 ns/B 43.05 MiB/s 25.52 c/B OCB dec | 20.47 ns/B 46.58 MiB/s 23.59 c/B OCB auth | 21.64 ns/B 44.07 MiB/s 24.93 c/B = Signed-off-by: Jussi Kivilinna diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 305a3b9..ac0ec58 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -99,7 +99,7 @@ keccak.c keccak_permute_32.h keccak_permute_64.h keccak-armv7-neon.S \ stribog.c \ tiger.c \ whirlpool.c whirlpool-sse2-amd64.S \ -twofish.c twofish-amd64.S twofish-arm.S \ +twofish.c twofish-amd64.S twofish-arm.S twofish-aarch64.S \ rfc2268.c \ camellia.c camellia.h camellia-glue.c camellia-aesni-avx-amd64.S \ camellia-aesni-avx2-amd64.S camellia-arm.S camellia-aarch64.S diff --git a/cipher/twofish-aarch64.S b/cipher/twofish-aarch64.S new file mode 100644 index 0000000..99c4675 --- /dev/null +++ b/cipher/twofish-aarch64.S @@ -0,0 +1,317 @@ +/* twofish-aarch64.S - ARMv8/AArch64 assembly implementation of Twofish cipher + * + * Copyright (C) 2016 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include + +#if defined(__AARCH64EL__) +#ifdef HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS + +.text + +/* structure of TWOFISH_context: */ +#define s0 0 +#define s1 ((s0) + 4 * 256) +#define s2 ((s1) + 4 * 256) +#define s3 ((s2) + 4 * 256) +#define w ((s3) + 4 * 256) +#define k ((w) + 4 * 8) + +/* register macros */ +#define CTX x0 +#define RDST x1 +#define RSRC x2 +#define CTXs0 CTX +#define CTXs1 x3 +#define CTXs2 x4 +#define CTXs3 x5 +#define CTXw x17 + +#define RA w6 +#define RB w7 +#define RC w8 +#define RD w9 + +#define RX w10 +#define RY w11 + +#define xRX x10 +#define xRY x11 + +#define RMASK w12 + +#define RT0 w13 +#define RT1 w14 +#define RT2 w15 +#define RT3 w16 + +#define xRT0 x13 +#define xRT1 x14 +#define xRT2 x15 +#define xRT3 x16 + +/* helper macros */ +#ifndef __AARCH64EL__ + /* bswap on big-endian */ + #define host_to_le(reg) \ + rev reg, reg; + #define le_to_host(reg) \ + rev reg, reg; +#else + /* nop on little-endian */ + #define host_to_le(reg) /*_*/ + #define le_to_host(reg) /*_*/ +#endif + +#define ldr_input_aligned_le(rin, a, b, c, d) \ + ldr a, [rin, #0]; \ + ldr b, [rin, #4]; \ + le_to_host(a); \ + ldr c, [rin, #8]; \ + le_to_host(b); \ + ldr d, [rin, #12]; \ + le_to_host(c); \ + le_to_host(d); + +#define str_output_aligned_le(rout, a, b, c, d) \ + le_to_host(a); \ + le_to_host(b); \ + str a, [rout, #0]; \ + le_to_host(c); \ + str b, [rout, #4]; \ + le_to_host(d); \ + str c, [rout, #8]; \ + str d, [rout, #12]; + +/* unaligned word reads/writes allowed */ +#define ldr_input_le(rin, ra, rb, rc, rd, rtmp) \ + ldr_input_aligned_le(rin, ra, rb, rc, rd) + +#define str_output_le(rout, ra, rb, rc, rd, rtmp0, rtmp1) \ + str_output_aligned_le(rout, ra, rb, rc, rd) + +/********************************************************************** + 1-way twofish + **********************************************************************/ +#define encrypt_round(a, b, rc, rd, n, ror_a, adj_a) \ + and RT0, RMASK, b, lsr#(8 - 2); \ + and RY, RMASK, b, lsr#(16 - 2); \ + and RT1, RMASK, b, lsr#(24 - 2); \ + ldr RY, [CTXs3, xRY]; \ + and RT2, RMASK, b, lsl#(2); \ + ldr RT0, [CTXs2, xRT0]; \ + and RT3, RMASK, a, lsr#(16 - 2 + (adj_a)); \ + ldr RT1, [CTXs0, xRT1]; \ + and RX, RMASK, a, lsr#(8 - 2 + (adj_a)); \ + ldr RT2, [CTXs1, xRT2]; \ + ldr RX, [CTXs1, xRX]; \ + ror_a(a); \ + \ + eor RY, RY, RT0; \ + ldr RT3, [CTXs2, xRT3]; \ + and RT0, RMASK, a, lsl#(2); \ + eor RY, RY, RT1; \ + and RT1, RMASK, a, lsr#(24 - 2); \ + eor RY, RY, RT2; \ + ldr RT0, [CTXs0, xRT0]; \ + eor RX, RX, RT3; \ + ldr RT1, [CTXs3, xRT1]; \ + eor RX, RX, RT0; \ + \ + ldr RT3, [CTXs3, #(k - s3 + 8 * (n) + 4)]; \ + eor RX, RX, RT1; \ + ldr RT2, [CTXs3, #(k - s3 + 8 * (n))]; \ + \ + add RT0, RX, RY, lsl #1; \ + add RX, RX, RY; \ + add RT0, RT0, RT3; \ + add RX, RX, RT2; \ + eor rd, RT0, rd, ror #31; \ + eor rc, rc, RX; + +#define dummy(x) /*_*/ + +#define ror1(r) \ + ror r, r, #1; + +#define decrypt_round(a, b, rc, rd, n, ror_b, adj_b) \ + and RT3, RMASK, b, lsl#(2 - (adj_b)); \ + and RT1, RMASK, b, lsr#(8 - 2 + (adj_b)); \ + ror_b(b); \ + and RT2, RMASK, a, lsl#(2); \ + and RT0, RMASK, a, lsr#(8 - 2); \ + \ + ldr RY, [CTXs1, xRT3]; \ + ldr RX, [CTXs0, xRT2]; \ + and RT3, RMASK, b, lsr#(16 - 2); \ + ldr RT1, [CTXs2, xRT1]; \ + and RT2, RMASK, a, lsr#(16 - 2); \ + ldr RT0, [CTXs1, xRT0]; \ + \ + ldr RT3, [CTXs3, xRT3]; \ + eor RY, RY, RT1; \ + \ + and RT1, RMASK, b, lsr#(24 - 2); \ + eor RX, RX, RT0; \ + ldr RT2, [CTXs2, xRT2]; \ + and RT0, RMASK, a, lsr#(24 - 2); \ + \ + ldr RT1, [CTXs0, xRT1]; \ + \ + eor RY, RY, RT3; \ + ldr RT0, [CTXs3, xRT0]; \ + eor RX, RX, RT2; \ + eor RY, RY, RT1; \ + \ + ldr RT1, [CTXs3, #(k - s3 + 8 * (n) + 4)]; \ + eor RX, RX, RT0; \ + ldr RT2, [CTXs3, #(k - s3 + 8 * (n))]; \ + \ + add RT0, RX, RY, lsl #1; \ + add RX, RX, RY; \ + add RT0, RT0, RT1; \ + add RX, RX, RT2; \ + eor rd, rd, RT0; \ + eor rc, RX, rc, ror #31; + +#define first_encrypt_cycle(nc) \ + encrypt_round(RA, RB, RC, RD, (nc) * 2, dummy, 0); \ + encrypt_round(RC, RD, RA, RB, (nc) * 2 + 1, ror1, 1); + +#define encrypt_cycle(nc) \ + encrypt_round(RA, RB, RC, RD, (nc) * 2, ror1, 1); \ + encrypt_round(RC, RD, RA, RB, (nc) * 2 + 1, ror1, 1); + +#define last_encrypt_cycle(nc) \ + encrypt_round(RA, RB, RC, RD, (nc) * 2, ror1, 1); \ + encrypt_round(RC, RD, RA, RB, (nc) * 2 + 1, ror1, 1); \ + ror1(RA); + +#define first_decrypt_cycle(nc) \ + decrypt_round(RC, RD, RA, RB, (nc) * 2 + 1, dummy, 0); \ + decrypt_round(RA, RB, RC, RD, (nc) * 2, ror1, 1); + +#define decrypt_cycle(nc) \ + decrypt_round(RC, RD, RA, RB, (nc) * 2 + 1, ror1, 1); \ + decrypt_round(RA, RB, RC, RD, (nc) * 2, ror1, 1); + +#define last_decrypt_cycle(nc) \ + decrypt_round(RC, RD, RA, RB, (nc) * 2 + 1, ror1, 1); \ + decrypt_round(RA, RB, RC, RD, (nc) * 2, ror1, 1); \ + ror1(RD); + +.globl _gcry_twofish_arm_encrypt_block +.type _gcry_twofish_arm_encrypt_block,%function; + +_gcry_twofish_arm_encrypt_block: + /* input: + * x0: ctx + * x1: dst + * x2: src + */ + + add CTXw, CTX, #(w); + + ldr_input_le(RSRC, RA, RB, RC, RD, RT0); + + /* Input whitening */ + ldp RT0, RT1, [CTXw, #(0*8)]; + ldp RT2, RT3, [CTXw, #(1*8)]; + add CTXs3, CTX, #(s3); + add CTXs2, CTX, #(s2); + add CTXs1, CTX, #(s1); + mov RMASK, #(0xff << 2); + eor RA, RA, RT0; + eor RB, RB, RT1; + eor RC, RC, RT2; + eor RD, RD, RT3; + + first_encrypt_cycle(0); + encrypt_cycle(1); + encrypt_cycle(2); + encrypt_cycle(3); + encrypt_cycle(4); + encrypt_cycle(5); + encrypt_cycle(6); + last_encrypt_cycle(7); + + /* Output whitening */ + ldp RT0, RT1, [CTXw, #(2*8)]; + ldp RT2, RT3, [CTXw, #(3*8)]; + eor RC, RC, RT0; + eor RD, RD, RT1; + eor RA, RA, RT2; + eor RB, RB, RT3; + + str_output_le(RDST, RC, RD, RA, RB, RT0, RT1); + + ret; +.ltorg +.size _gcry_twofish_arm_encrypt_block,.-_gcry_twofish_arm_encrypt_block; + +.globl _gcry_twofish_arm_decrypt_block +.type _gcry_twofish_arm_decrypt_block,%function; + +_gcry_twofish_arm_decrypt_block: + /* input: + * %r0: ctx + * %r1: dst + * %r2: src + */ + + add CTXw, CTX, #(w); + + ldr_input_le(RSRC, RC, RD, RA, RB, RT0); + + /* Input whitening */ + ldp RT0, RT1, [CTXw, #(2*8)]; + ldp RT2, RT3, [CTXw, #(3*8)]; + add CTXs3, CTX, #(s3); + add CTXs2, CTX, #(s2); + add CTXs1, CTX, #(s1); + mov RMASK, #(0xff << 2); + eor RC, RC, RT0; + eor RD, RD, RT1; + eor RA, RA, RT2; + eor RB, RB, RT3; + + first_decrypt_cycle(7); + decrypt_cycle(6); + decrypt_cycle(5); + decrypt_cycle(4); + decrypt_cycle(3); + decrypt_cycle(2); + decrypt_cycle(1); + last_decrypt_cycle(0); + + /* Output whitening */ + ldp RT0, RT1, [CTXw, #(0*8)]; + ldp RT2, RT3, [CTXw, #(1*8)]; + eor RA, RA, RT0; + eor RB, RB, RT1; + eor RC, RC, RT2; + eor RD, RD, RT3; + + str_output_le(RDST, RA, RB, RC, RD, RT0, RT1); + + ret; +.size _gcry_twofish_arm_decrypt_block,.-_gcry_twofish_arm_decrypt_block; + +#endif /*HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS*/ +#endif /*__AARCH64EL__*/ diff --git a/cipher/twofish.c b/cipher/twofish.c index f6ecd67..7a4d26a 100644 --- a/cipher/twofish.c +++ b/cipher/twofish.c @@ -66,6 +66,11 @@ # define USE_ARM_ASM 1 # endif #endif +# if defined(__AARCH64EL__) +# ifdef HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS +# define USE_ARM_ASM 1 +# endif +# endif /* Prototype for the self-test function. */ diff --git a/configure.ac b/configure.ac index 3e926a5..7bbf4bd 100644 --- a/configure.ac +++ b/configure.ac @@ -2075,6 +2075,10 @@ if test "$found" = "1" ; then # Build with the assembly implementation GCRYPT_CIPHERS="$GCRYPT_CIPHERS twofish-arm.lo" ;; + aarch64-*-*) + # Build with the assembly implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS twofish-aarch64.lo" + ;; esac fi commit de73a2e7237ba7c34ce48bb5fb671aa3993de832 Author: Jussi Kivilinna Date: Wed Apr 27 18:18:54 2016 +0300 Add Aarch64 assembly implementation of Camellia * cipher/Makefile.am: Add 'camellia-aarch64.S'. * cipher/camellia-aarch64.S: New. * cipher/camellia-glue.c [USE_ARM_ASM][__aarch64__]: Set stack burn size to zero. * cipher/camellia.h: Enable USE_ARM_ASM if __AARCH64EL__ and HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS defined. * configure.ac [host=aarch64]: Add 'rijndael-aarch64.lo'. -- Patch adds ARMv8/Aarch64 implementation of Camellia. Benchmark on Cortex-A53 (1152 Mhz): Before: CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 39.71 ns/B 24.01 MiB/s 45.75 c/B ECB dec | 39.72 ns/B 24.01 MiB/s 45.75 c/B CBC enc | 40.80 ns/B 23.38 MiB/s 47.00 c/B CBC dec | 39.66 ns/B 24.05 MiB/s 45.69 c/B CFB enc | 40.69 ns/B 23.44 MiB/s 46.88 c/B CFB dec | 39.66 ns/B 24.05 MiB/s 45.69 c/B OFB enc | 40.69 ns/B 23.44 MiB/s 46.88 c/B OFB dec | 40.69 ns/B 23.44 MiB/s 46.88 c/B CTR enc | 39.88 ns/B 23.91 MiB/s 45.94 c/B CTR dec | 39.88 ns/B 23.91 MiB/s 45.94 c/B CCM enc | 79.97 ns/B 11.92 MiB/s 92.13 c/B CCM dec | 79.97 ns/B 11.93 MiB/s 92.13 c/B CCM auth | 40.20 ns/B 23.72 MiB/s 46.31 c/B GCM enc | 41.18 ns/B 23.16 MiB/s 47.44 c/B GCM dec | 41.18 ns/B 23.16 MiB/s 47.44 c/B GCM auth | 1.30 ns/B 732.7 MiB/s 1.50 c/B OCB enc | 42.04 ns/B 22.69 MiB/s 48.43 c/B OCB dec | 42.03 ns/B 22.69 MiB/s 48.42 c/B OCB auth | 41.38 ns/B 23.05 MiB/s 47.67 c/B = CAMELLIA256 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 52.36 ns/B 18.22 MiB/s 60.31 c/B ECB dec | 52.36 ns/B 18.22 MiB/s 60.31 c/B CBC enc | 53.39 ns/B 17.86 MiB/s 61.50 c/B CBC dec | 52.14 ns/B 18.29 MiB/s 60.06 c/B CFB enc | 53.28 ns/B 17.90 MiB/s 61.38 c/B CFB dec | 52.14 ns/B 18.29 MiB/s 60.06 c/B OFB enc | 53.17 ns/B 17.94 MiB/s 61.25 c/B OFB dec | 53.17 ns/B 17.94 MiB/s 61.25 c/B CTR enc | 52.36 ns/B 18.21 MiB/s 60.32 c/B CTR dec | 52.36 ns/B 18.21 MiB/s 60.32 c/B CCM enc | 105.0 ns/B 9.08 MiB/s 120.9 c/B CCM dec | 105.0 ns/B 9.08 MiB/s 120.9 c/B CCM auth | 52.74 ns/B 18.08 MiB/s 60.75 c/B GCM enc | 53.66 ns/B 17.77 MiB/s 61.81 c/B GCM dec | 53.66 ns/B 17.77 MiB/s 61.82 c/B GCM auth | 1.30 ns/B 732.3 MiB/s 1.50 c/B OCB enc | 54.54 ns/B 17.49 MiB/s 62.83 c/B OCB dec | 54.48 ns/B 17.50 MiB/s 62.77 c/B OCB auth | 53.89 ns/B 17.70 MiB/s 62.09 c/B = After (~1.7x faster): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 22.25 ns/B 42.87 MiB/s 25.63 c/B ECB dec | 22.25 ns/B 42.87 MiB/s 25.63 c/B CBC enc | 23.27 ns/B 40.97 MiB/s 26.81 c/B CBC dec | 22.14 ns/B 43.08 MiB/s 25.50 c/B CFB enc | 23.17 ns/B 41.17 MiB/s 26.69 c/B CFB dec | 22.14 ns/B 43.08 MiB/s 25.50 c/B OFB enc | 23.11 ns/B 41.26 MiB/s 26.63 c/B OFB dec | 23.11 ns/B 41.26 MiB/s 26.63 c/B CTR enc | 22.36 ns/B 42.65 MiB/s 25.76 c/B CTR dec | 22.36 ns/B 42.65 MiB/s 25.76 c/B CCM enc | 44.87 ns/B 21.26 MiB/s 51.69 c/B CCM dec | 44.87 ns/B 21.25 MiB/s 51.69 c/B CCM auth | 22.62 ns/B 42.15 MiB/s 26.06 c/B GCM enc | 23.66 ns/B 40.31 MiB/s 27.25 c/B GCM dec | 23.66 ns/B 40.31 MiB/s 27.25 c/B GCM auth | 1.30 ns/B 732.0 MiB/s 1.50 c/B OCB enc | 24.32 ns/B 39.21 MiB/s 28.02 c/B OCB dec | 24.32 ns/B 39.21 MiB/s 28.02 c/B OCB auth | 23.75 ns/B 40.15 MiB/s 27.36 c/B = CAMELLIA256 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 29.08 ns/B 32.79 MiB/s 33.50 c/B ECB dec | 29.19 ns/B 32.67 MiB/s 33.63 c/B CBC enc | 30.11 ns/B 31.67 MiB/s 34.69 c/B CBC dec | 29.05 ns/B 32.83 MiB/s 33.47 c/B CFB enc | 30.00 ns/B 31.79 MiB/s 34.56 c/B CFB dec | 28.97 ns/B 32.91 MiB/s 33.38 c/B OFB enc | 29.95 ns/B 31.84 MiB/s 34.50 c/B OFB dec | 29.95 ns/B 31.84 MiB/s 34.50 c/B CTR enc | 29.19 ns/B 32.67 MiB/s 33.63 c/B CTR dec | 29.19 ns/B 32.67 MiB/s 33.63 c/B CCM enc | 58.54 ns/B 16.29 MiB/s 67.43 c/B CCM dec | 58.54 ns/B 16.29 MiB/s 67.44 c/B CCM auth | 29.46 ns/B 32.37 MiB/s 33.94 c/B GCM enc | 30.49 ns/B 31.28 MiB/s 35.12 c/B GCM dec | 30.49 ns/B 31.27 MiB/s 35.13 c/B GCM auth | 1.30 ns/B 731.6 MiB/s 1.50 c/B OCB enc | 31.16 ns/B 30.61 MiB/s 35.90 c/B OCB dec | 31.22 ns/B 30.55 MiB/s 35.96 c/B OCB auth | 30.59 ns/B 31.18 MiB/s 35.24 c/B = Signed-off-by: Jussi Kivilinna diff --git a/cipher/Makefile.am b/cipher/Makefile.am index db606ca..305a3b9 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -102,7 +102,7 @@ whirlpool.c whirlpool-sse2-amd64.S \ twofish.c twofish-amd64.S twofish-arm.S \ rfc2268.c \ camellia.c camellia.h camellia-glue.c camellia-aesni-avx-amd64.S \ - camellia-aesni-avx2-amd64.S camellia-arm.S + camellia-aesni-avx2-amd64.S camellia-arm.S camellia-aarch64.S gost28147.lo: gost-sb.h gost-sb.h: gost-s-box diff --git a/cipher/camellia-aarch64.S b/cipher/camellia-aarch64.S new file mode 100644 index 0000000..440f69f --- /dev/null +++ b/cipher/camellia-aarch64.S @@ -0,0 +1,557 @@ +/* camellia-aarch64.S - ARMv8/AArch64 assembly implementation of Camellia + * cipher + * + * Copyright (C) 2016 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include + +#if defined(__AARCH64EL__) +#ifdef HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS + +.text + +/* struct camellia_ctx: */ +#define key_table 0 + +/* register macros */ +#define CTX x0 +#define RDST x1 +#define RSRC x2 +#define RKEYBITS x3 + +#define RTAB1 x4 +#define RTAB2 x5 +#define RTAB3 x6 +#define RTAB4 x7 +#define RMASK w8 + +#define IL w9 +#define IR w10 + +#define xIL x9 +#define xIR x10 + +#define XL w11 +#define XR w12 +#define YL w13 +#define YR w14 + +#define RT0 w15 +#define RT1 w16 +#define RT2 w17 +#define RT3 w18 + +#define xRT0 x15 +#define xRT1 x16 +#define xRT2 x17 +#define xRT3 x18 + +#ifdef __AARCH64EL__ + #define host_to_be(reg, rtmp) \ + rev reg, reg; + #define be_to_host(reg, rtmp) \ + rev reg, reg; +#else + /* nop on big-endian */ + #define host_to_be(reg, rtmp) /*_*/ + #define be_to_host(reg, rtmp) /*_*/ +#endif + +#define ldr_input_aligned_be(rin, a, b, c, d, rtmp) \ + ldr a, [rin, #0]; \ + ldr b, [rin, #4]; \ + be_to_host(a, rtmp); \ + ldr c, [rin, #8]; \ + be_to_host(b, rtmp); \ + ldr d, [rin, #12]; \ + be_to_host(c, rtmp); \ + be_to_host(d, rtmp); + +#define str_output_aligned_be(rout, a, b, c, d, rtmp) \ + be_to_host(a, rtmp); \ + be_to_host(b, rtmp); \ + str a, [rout, #0]; \ + be_to_host(c, rtmp); \ + str b, [rout, #4]; \ + be_to_host(d, rtmp); \ + str c, [rout, #8]; \ + str d, [rout, #12]; + +/* unaligned word reads/writes allowed */ +#define ldr_input_be(rin, ra, rb, rc, rd, rtmp) \ + ldr_input_aligned_be(rin, ra, rb, rc, rd, rtmp) + +#define str_output_be(rout, ra, rb, rc, rd, rtmp0, rtmp1) \ + str_output_aligned_be(rout, ra, rb, rc, rd, rtmp0) + +/********************************************************************** + 1-way camellia + **********************************************************************/ +#define roundsm(xl, xr, kl, kr, yl, yr) \ + ldr RT2, [CTX, #(key_table + ((kl) * 4))]; \ + and IR, RMASK, xr, lsl#(4); /*sp1110*/ \ + ldr RT3, [CTX, #(key_table + ((kr) * 4))]; \ + and IL, RMASK, xl, lsr#(24 - 4); /*sp1110*/ \ + and RT0, RMASK, xr, lsr#(16 - 4); /*sp3033*/ \ + ldr IR, [RTAB1, xIR]; \ + and RT1, RMASK, xl, lsr#(8 - 4); /*sp3033*/ \ + eor yl, yl, RT2; \ + ldr IL, [RTAB1, xIL]; \ + eor yr, yr, RT3; \ + \ + ldr RT0, [RTAB3, xRT0]; \ + ldr RT1, [RTAB3, xRT1]; \ + \ + and RT2, RMASK, xr, lsr#(24 - 4); /*sp0222*/ \ + and RT3, RMASK, xl, lsr#(16 - 4); /*sp0222*/ \ + \ + eor IR, IR, RT0; \ + eor IL, IL, RT1; \ + \ + ldr RT2, [RTAB2, xRT2]; \ + and RT0, RMASK, xr, lsr#(8 - 4); /*sp4404*/ \ + ldr RT3, [RTAB2, xRT3]; \ + and RT1, RMASK, xl, lsl#(4); /*sp4404*/ \ + \ + ldr RT0, [RTAB4, xRT0]; \ + ldr RT1, [RTAB4, xRT1]; \ + \ + eor IR, IR, RT2; \ + eor IL, IL, RT3; \ + eor IR, IR, RT0; \ + eor IL, IL, RT1; \ + \ + eor IR, IR, IL; \ + eor yr, yr, IL, ror#8; \ + eor yl, yl, IR; \ + eor yr, yr, IR; + +#define enc_rounds(n) \ + roundsm(XL, XR, ((n) + 2) * 2 + 0, ((n) + 2) * 2 + 1, YL, YR); \ + roundsm(YL, YR, ((n) + 3) * 2 + 0, ((n) + 3) * 2 + 1, XL, XR); \ + roundsm(XL, XR, ((n) + 4) * 2 + 0, ((n) + 4) * 2 + 1, YL, YR); \ + roundsm(YL, YR, ((n) + 5) * 2 + 0, ((n) + 5) * 2 + 1, XL, XR); \ + roundsm(XL, XR, ((n) + 6) * 2 + 0, ((n) + 6) * 2 + 1, YL, YR); \ + roundsm(YL, YR, ((n) + 7) * 2 + 0, ((n) + 7) * 2 + 1, XL, XR); + +#define dec_rounds(n) \ + roundsm(XL, XR, ((n) + 7) * 2 + 0, ((n) + 7) * 2 + 1, YL, YR); \ + roundsm(YL, YR, ((n) + 6) * 2 + 0, ((n) + 6) * 2 + 1, XL, XR); \ + roundsm(XL, XR, ((n) + 5) * 2 + 0, ((n) + 5) * 2 + 1, YL, YR); \ + roundsm(YL, YR, ((n) + 4) * 2 + 0, ((n) + 4) * 2 + 1, XL, XR); \ + roundsm(XL, XR, ((n) + 3) * 2 + 0, ((n) + 3) * 2 + 1, YL, YR); \ + roundsm(YL, YR, ((n) + 2) * 2 + 0, ((n) + 2) * 2 + 1, XL, XR); + +/* perform FL and FL?? */ +#define fls(ll, lr, rl, rr, kll, klr, krl, krr) \ + ldr RT0, [CTX, #(key_table + ((kll) * 4))]; \ + ldr RT2, [CTX, #(key_table + ((krr) * 4))]; \ + and RT0, RT0, ll; \ + ldr RT3, [CTX, #(key_table + ((krl) * 4))]; \ + orr RT2, RT2, rr; \ + ldr RT1, [CTX, #(key_table + ((klr) * 4))]; \ + eor rl, rl, RT2; \ + eor lr, lr, RT0, ror#31; \ + and RT3, RT3, rl; \ + orr RT1, RT1, lr; \ + eor ll, ll, RT1; \ + eor rr, rr, RT3, ror#31; + +#define enc_fls(n) \ + fls(XL, XR, YL, YR, \ + (n) * 2 + 0, (n) * 2 + 1, \ + (n) * 2 + 2, (n) * 2 + 3); + +#define dec_fls(n) \ + fls(XL, XR, YL, YR, \ + (n) * 2 + 2, (n) * 2 + 3, \ + (n) * 2 + 0, (n) * 2 + 1); + +#define inpack(n) \ + ldr_input_be(RSRC, XL, XR, YL, YR, RT0); \ + ldr RT0, [CTX, #(key_table + ((n) * 8) + 0)]; \ + ldr RT1, [CTX, #(key_table + ((n) * 8) + 4)]; \ + eor XL, XL, RT0; \ + eor XR, XR, RT1; + +#define outunpack(n) \ + ldr RT0, [CTX, #(key_table + ((n) * 8) + 0)]; \ + ldr RT1, [CTX, #(key_table + ((n) * 8) + 4)]; \ + eor YL, YL, RT0; \ + eor YR, YR, RT1; \ + str_output_be(RDST, YL, YR, XL, XR, RT0, RT1); + +.globl _gcry_camellia_arm_encrypt_block +.type _gcry_camellia_arm_encrypt_block, at function; + +_gcry_camellia_arm_encrypt_block: + /* input: + * x0: keytable + * x1: dst + * x2: src + * x3: keybitlen + */ + + adr RTAB1, _gcry_camellia_arm_tables; + mov RMASK, #(0xff<<4); /* byte mask */ + add RTAB2, RTAB1, #(1 * 4); + add RTAB3, RTAB1, #(2 * 4); + add RTAB4, RTAB1, #(3 * 4); + + inpack(0); + + enc_rounds(0); + enc_fls(8); + enc_rounds(8); + enc_fls(16); + enc_rounds(16); + + cmp RKEYBITS, #(16 * 8); + bne .Lenc_256; + + outunpack(24); + + ret; +.ltorg + +.Lenc_256: + enc_fls(24); + enc_rounds(24); + + outunpack(32); + + ret; +.ltorg +.size _gcry_camellia_arm_encrypt_block,.-_gcry_camellia_arm_encrypt_block; + +.globl _gcry_camellia_arm_decrypt_block +.type _gcry_camellia_arm_decrypt_block, at function; + +_gcry_camellia_arm_decrypt_block: + /* input: + * x0: keytable + * x1: dst + * x2: src + * x3: keybitlen + */ + + adr RTAB1, _gcry_camellia_arm_tables; + mov RMASK, #(0xff<<4); /* byte mask */ + add RTAB2, RTAB1, #(1 * 4); + add RTAB3, RTAB1, #(2 * 4); + add RTAB4, RTAB1, #(3 * 4); + + cmp RKEYBITS, #(16 * 8); + bne .Ldec_256; + + inpack(24); + +.Ldec_128: + dec_rounds(16); + dec_fls(16); + dec_rounds(8); + dec_fls(8); + dec_rounds(0); + + outunpack(0); + + ret; +.ltorg + +.Ldec_256: + inpack(32); + dec_rounds(24); + dec_fls(24); + + b .Ldec_128; +.ltorg +.size _gcry_camellia_arm_decrypt_block,.-_gcry_camellia_arm_decrypt_block; + +/* Encryption/Decryption tables */ +.globl _gcry_camellia_arm_tables +.type _gcry_camellia_arm_tables, at object; +.balign 32 +_gcry_camellia_arm_tables: +.Lcamellia_sp1110: +.long 0x70707000 +.Lcamellia_sp0222: + .long 0x00e0e0e0 +.Lcamellia_sp3033: + .long 0x38003838 +.Lcamellia_sp4404: + .long 0x70700070 +.long 0x82828200, 0x00050505, 0x41004141, 0x2c2c002c +.long 0x2c2c2c00, 0x00585858, 0x16001616, 0xb3b300b3 +.long 0xececec00, 0x00d9d9d9, 0x76007676, 0xc0c000c0 +.long 0xb3b3b300, 0x00676767, 0xd900d9d9, 0xe4e400e4 +.long 0x27272700, 0x004e4e4e, 0x93009393, 0x57570057 +.long 0xc0c0c000, 0x00818181, 0x60006060, 0xeaea00ea +.long 0xe5e5e500, 0x00cbcbcb, 0xf200f2f2, 0xaeae00ae +.long 0xe4e4e400, 0x00c9c9c9, 0x72007272, 0x23230023 +.long 0x85858500, 0x000b0b0b, 0xc200c2c2, 0x6b6b006b +.long 0x57575700, 0x00aeaeae, 0xab00abab, 0x45450045 +.long 0x35353500, 0x006a6a6a, 0x9a009a9a, 0xa5a500a5 +.long 0xeaeaea00, 0x00d5d5d5, 0x75007575, 0xeded00ed +.long 0x0c0c0c00, 0x00181818, 0x06000606, 0x4f4f004f +.long 0xaeaeae00, 0x005d5d5d, 0x57005757, 0x1d1d001d +.long 0x41414100, 0x00828282, 0xa000a0a0, 0x92920092 +.long 0x23232300, 0x00464646, 0x91009191, 0x86860086 +.long 0xefefef00, 0x00dfdfdf, 0xf700f7f7, 0xafaf00af +.long 0x6b6b6b00, 0x00d6d6d6, 0xb500b5b5, 0x7c7c007c +.long 0x93939300, 0x00272727, 0xc900c9c9, 0x1f1f001f +.long 0x45454500, 0x008a8a8a, 0xa200a2a2, 0x3e3e003e +.long 0x19191900, 0x00323232, 0x8c008c8c, 0xdcdc00dc +.long 0xa5a5a500, 0x004b4b4b, 0xd200d2d2, 0x5e5e005e +.long 0x21212100, 0x00424242, 0x90009090, 0x0b0b000b +.long 0xededed00, 0x00dbdbdb, 0xf600f6f6, 0xa6a600a6 +.long 0x0e0e0e00, 0x001c1c1c, 0x07000707, 0x39390039 +.long 0x4f4f4f00, 0x009e9e9e, 0xa700a7a7, 0xd5d500d5 +.long 0x4e4e4e00, 0x009c9c9c, 0x27002727, 0x5d5d005d +.long 0x1d1d1d00, 0x003a3a3a, 0x8e008e8e, 0xd9d900d9 +.long 0x65656500, 0x00cacaca, 0xb200b2b2, 0x5a5a005a +.long 0x92929200, 0x00252525, 0x49004949, 0x51510051 +.long 0xbdbdbd00, 0x007b7b7b, 0xde00dede, 0x6c6c006c +.long 0x86868600, 0x000d0d0d, 0x43004343, 0x8b8b008b +.long 0xb8b8b800, 0x00717171, 0x5c005c5c, 0x9a9a009a +.long 0xafafaf00, 0x005f5f5f, 0xd700d7d7, 0xfbfb00fb +.long 0x8f8f8f00, 0x001f1f1f, 0xc700c7c7, 0xb0b000b0 +.long 0x7c7c7c00, 0x00f8f8f8, 0x3e003e3e, 0x74740074 +.long 0xebebeb00, 0x00d7d7d7, 0xf500f5f5, 0x2b2b002b +.long 0x1f1f1f00, 0x003e3e3e, 0x8f008f8f, 0xf0f000f0 +.long 0xcecece00, 0x009d9d9d, 0x67006767, 0x84840084 +.long 0x3e3e3e00, 0x007c7c7c, 0x1f001f1f, 0xdfdf00df +.long 0x30303000, 0x00606060, 0x18001818, 0xcbcb00cb +.long 0xdcdcdc00, 0x00b9b9b9, 0x6e006e6e, 0x34340034 +.long 0x5f5f5f00, 0x00bebebe, 0xaf00afaf, 0x76760076 +.long 0x5e5e5e00, 0x00bcbcbc, 0x2f002f2f, 0x6d6d006d +.long 0xc5c5c500, 0x008b8b8b, 0xe200e2e2, 0xa9a900a9 +.long 0x0b0b0b00, 0x00161616, 0x85008585, 0xd1d100d1 +.long 0x1a1a1a00, 0x00343434, 0x0d000d0d, 0x04040004 +.long 0xa6a6a600, 0x004d4d4d, 0x53005353, 0x14140014 +.long 0xe1e1e100, 0x00c3c3c3, 0xf000f0f0, 0x3a3a003a +.long 0x39393900, 0x00727272, 0x9c009c9c, 0xdede00de +.long 0xcacaca00, 0x00959595, 0x65006565, 0x11110011 +.long 0xd5d5d500, 0x00ababab, 0xea00eaea, 0x32320032 +.long 0x47474700, 0x008e8e8e, 0xa300a3a3, 0x9c9c009c +.long 0x5d5d5d00, 0x00bababa, 0xae00aeae, 0x53530053 +.long 0x3d3d3d00, 0x007a7a7a, 0x9e009e9e, 0xf2f200f2 +.long 0xd9d9d900, 0x00b3b3b3, 0xec00ecec, 0xfefe00fe +.long 0x01010100, 0x00020202, 0x80008080, 0xcfcf00cf +.long 0x5a5a5a00, 0x00b4b4b4, 0x2d002d2d, 0xc3c300c3 +.long 0xd6d6d600, 0x00adadad, 0x6b006b6b, 0x7a7a007a +.long 0x51515100, 0x00a2a2a2, 0xa800a8a8, 0x24240024 +.long 0x56565600, 0x00acacac, 0x2b002b2b, 0xe8e800e8 +.long 0x6c6c6c00, 0x00d8d8d8, 0x36003636, 0x60600060 +.long 0x4d4d4d00, 0x009a9a9a, 0xa600a6a6, 0x69690069 +.long 0x8b8b8b00, 0x00171717, 0xc500c5c5, 0xaaaa00aa +.long 0x0d0d0d00, 0x001a1a1a, 0x86008686, 0xa0a000a0 +.long 0x9a9a9a00, 0x00353535, 0x4d004d4d, 0xa1a100a1 +.long 0x66666600, 0x00cccccc, 0x33003333, 0x62620062 +.long 0xfbfbfb00, 0x00f7f7f7, 0xfd00fdfd, 0x54540054 +.long 0xcccccc00, 0x00999999, 0x66006666, 0x1e1e001e +.long 0xb0b0b000, 0x00616161, 0x58005858, 0xe0e000e0 +.long 0x2d2d2d00, 0x005a5a5a, 0x96009696, 0x64640064 +.long 0x74747400, 0x00e8e8e8, 0x3a003a3a, 0x10100010 +.long 0x12121200, 0x00242424, 0x09000909, 0x00000000 +.long 0x2b2b2b00, 0x00565656, 0x95009595, 0xa3a300a3 +.long 0x20202000, 0x00404040, 0x10001010, 0x75750075 +.long 0xf0f0f000, 0x00e1e1e1, 0x78007878, 0x8a8a008a +.long 0xb1b1b100, 0x00636363, 0xd800d8d8, 0xe6e600e6 +.long 0x84848400, 0x00090909, 0x42004242, 0x09090009 +.long 0x99999900, 0x00333333, 0xcc00cccc, 0xdddd00dd +.long 0xdfdfdf00, 0x00bfbfbf, 0xef00efef, 0x87870087 +.long 0x4c4c4c00, 0x00989898, 0x26002626, 0x83830083 +.long 0xcbcbcb00, 0x00979797, 0xe500e5e5, 0xcdcd00cd +.long 0xc2c2c200, 0x00858585, 0x61006161, 0x90900090 +.long 0x34343400, 0x00686868, 0x1a001a1a, 0x73730073 +.long 0x7e7e7e00, 0x00fcfcfc, 0x3f003f3f, 0xf6f600f6 +.long 0x76767600, 0x00ececec, 0x3b003b3b, 0x9d9d009d +.long 0x05050500, 0x000a0a0a, 0x82008282, 0xbfbf00bf +.long 0x6d6d6d00, 0x00dadada, 0xb600b6b6, 0x52520052 +.long 0xb7b7b700, 0x006f6f6f, 0xdb00dbdb, 0xd8d800d8 +.long 0xa9a9a900, 0x00535353, 0xd400d4d4, 0xc8c800c8 +.long 0x31313100, 0x00626262, 0x98009898, 0xc6c600c6 +.long 0xd1d1d100, 0x00a3a3a3, 0xe800e8e8, 0x81810081 +.long 0x17171700, 0x002e2e2e, 0x8b008b8b, 0x6f6f006f +.long 0x04040400, 0x00080808, 0x02000202, 0x13130013 +.long 0xd7d7d700, 0x00afafaf, 0xeb00ebeb, 0x63630063 +.long 0x14141400, 0x00282828, 0x0a000a0a, 0xe9e900e9 +.long 0x58585800, 0x00b0b0b0, 0x2c002c2c, 0xa7a700a7 +.long 0x3a3a3a00, 0x00747474, 0x1d001d1d, 0x9f9f009f +.long 0x61616100, 0x00c2c2c2, 0xb000b0b0, 0xbcbc00bc +.long 0xdedede00, 0x00bdbdbd, 0x6f006f6f, 0x29290029 +.long 0x1b1b1b00, 0x00363636, 0x8d008d8d, 0xf9f900f9 +.long 0x11111100, 0x00222222, 0x88008888, 0x2f2f002f +.long 0x1c1c1c00, 0x00383838, 0x0e000e0e, 0xb4b400b4 +.long 0x32323200, 0x00646464, 0x19001919, 0x78780078 +.long 0x0f0f0f00, 0x001e1e1e, 0x87008787, 0x06060006 +.long 0x9c9c9c00, 0x00393939, 0x4e004e4e, 0xe7e700e7 +.long 0x16161600, 0x002c2c2c, 0x0b000b0b, 0x71710071 +.long 0x53535300, 0x00a6a6a6, 0xa900a9a9, 0xd4d400d4 +.long 0x18181800, 0x00303030, 0x0c000c0c, 0xabab00ab +.long 0xf2f2f200, 0x00e5e5e5, 0x79007979, 0x88880088 +.long 0x22222200, 0x00444444, 0x11001111, 0x8d8d008d +.long 0xfefefe00, 0x00fdfdfd, 0x7f007f7f, 0x72720072 +.long 0x44444400, 0x00888888, 0x22002222, 0xb9b900b9 +.long 0xcfcfcf00, 0x009f9f9f, 0xe700e7e7, 0xf8f800f8 +.long 0xb2b2b200, 0x00656565, 0x59005959, 0xacac00ac +.long 0xc3c3c300, 0x00878787, 0xe100e1e1, 0x36360036 +.long 0xb5b5b500, 0x006b6b6b, 0xda00dada, 0x2a2a002a +.long 0x7a7a7a00, 0x00f4f4f4, 0x3d003d3d, 0x3c3c003c +.long 0x91919100, 0x00232323, 0xc800c8c8, 0xf1f100f1 +.long 0x24242400, 0x00484848, 0x12001212, 0x40400040 +.long 0x08080800, 0x00101010, 0x04000404, 0xd3d300d3 +.long 0xe8e8e800, 0x00d1d1d1, 0x74007474, 0xbbbb00bb +.long 0xa8a8a800, 0x00515151, 0x54005454, 0x43430043 +.long 0x60606000, 0x00c0c0c0, 0x30003030, 0x15150015 +.long 0xfcfcfc00, 0x00f9f9f9, 0x7e007e7e, 0xadad00ad +.long 0x69696900, 0x00d2d2d2, 0xb400b4b4, 0x77770077 +.long 0x50505000, 0x00a0a0a0, 0x28002828, 0x80800080 +.long 0xaaaaaa00, 0x00555555, 0x55005555, 0x82820082 +.long 0xd0d0d000, 0x00a1a1a1, 0x68006868, 0xecec00ec +.long 0xa0a0a000, 0x00414141, 0x50005050, 0x27270027 +.long 0x7d7d7d00, 0x00fafafa, 0xbe00bebe, 0xe5e500e5 +.long 0xa1a1a100, 0x00434343, 0xd000d0d0, 0x85850085 +.long 0x89898900, 0x00131313, 0xc400c4c4, 0x35350035 +.long 0x62626200, 0x00c4c4c4, 0x31003131, 0x0c0c000c +.long 0x97979700, 0x002f2f2f, 0xcb00cbcb, 0x41410041 +.long 0x54545400, 0x00a8a8a8, 0x2a002a2a, 0xefef00ef +.long 0x5b5b5b00, 0x00b6b6b6, 0xad00adad, 0x93930093 +.long 0x1e1e1e00, 0x003c3c3c, 0x0f000f0f, 0x19190019 +.long 0x95959500, 0x002b2b2b, 0xca00caca, 0x21210021 +.long 0xe0e0e000, 0x00c1c1c1, 0x70007070, 0x0e0e000e +.long 0xffffff00, 0x00ffffff, 0xff00ffff, 0x4e4e004e +.long 0x64646400, 0x00c8c8c8, 0x32003232, 0x65650065 +.long 0xd2d2d200, 0x00a5a5a5, 0x69006969, 0xbdbd00bd +.long 0x10101000, 0x00202020, 0x08000808, 0xb8b800b8 +.long 0xc4c4c400, 0x00898989, 0x62006262, 0x8f8f008f +.long 0x00000000, 0x00000000, 0x00000000, 0xebeb00eb +.long 0x48484800, 0x00909090, 0x24002424, 0xcece00ce +.long 0xa3a3a300, 0x00474747, 0xd100d1d1, 0x30300030 +.long 0xf7f7f700, 0x00efefef, 0xfb00fbfb, 0x5f5f005f +.long 0x75757500, 0x00eaeaea, 0xba00baba, 0xc5c500c5 +.long 0xdbdbdb00, 0x00b7b7b7, 0xed00eded, 0x1a1a001a +.long 0x8a8a8a00, 0x00151515, 0x45004545, 0xe1e100e1 +.long 0x03030300, 0x00060606, 0x81008181, 0xcaca00ca +.long 0xe6e6e600, 0x00cdcdcd, 0x73007373, 0x47470047 +.long 0xdadada00, 0x00b5b5b5, 0x6d006d6d, 0x3d3d003d +.long 0x09090900, 0x00121212, 0x84008484, 0x01010001 +.long 0x3f3f3f00, 0x007e7e7e, 0x9f009f9f, 0xd6d600d6 +.long 0xdddddd00, 0x00bbbbbb, 0xee00eeee, 0x56560056 +.long 0x94949400, 0x00292929, 0x4a004a4a, 0x4d4d004d +.long 0x87878700, 0x000f0f0f, 0xc300c3c3, 0x0d0d000d +.long 0x5c5c5c00, 0x00b8b8b8, 0x2e002e2e, 0x66660066 +.long 0x83838300, 0x00070707, 0xc100c1c1, 0xcccc00cc +.long 0x02020200, 0x00040404, 0x01000101, 0x2d2d002d +.long 0xcdcdcd00, 0x009b9b9b, 0xe600e6e6, 0x12120012 +.long 0x4a4a4a00, 0x00949494, 0x25002525, 0x20200020 +.long 0x90909000, 0x00212121, 0x48004848, 0xb1b100b1 +.long 0x33333300, 0x00666666, 0x99009999, 0x99990099 +.long 0x73737300, 0x00e6e6e6, 0xb900b9b9, 0x4c4c004c +.long 0x67676700, 0x00cecece, 0xb300b3b3, 0xc2c200c2 +.long 0xf6f6f600, 0x00ededed, 0x7b007b7b, 0x7e7e007e +.long 0xf3f3f300, 0x00e7e7e7, 0xf900f9f9, 0x05050005 +.long 0x9d9d9d00, 0x003b3b3b, 0xce00cece, 0xb7b700b7 +.long 0x7f7f7f00, 0x00fefefe, 0xbf00bfbf, 0x31310031 +.long 0xbfbfbf00, 0x007f7f7f, 0xdf00dfdf, 0x17170017 +.long 0xe2e2e200, 0x00c5c5c5, 0x71007171, 0xd7d700d7 +.long 0x52525200, 0x00a4a4a4, 0x29002929, 0x58580058 +.long 0x9b9b9b00, 0x00373737, 0xcd00cdcd, 0x61610061 +.long 0xd8d8d800, 0x00b1b1b1, 0x6c006c6c, 0x1b1b001b +.long 0x26262600, 0x004c4c4c, 0x13001313, 0x1c1c001c +.long 0xc8c8c800, 0x00919191, 0x64006464, 0x0f0f000f +.long 0x37373700, 0x006e6e6e, 0x9b009b9b, 0x16160016 +.long 0xc6c6c600, 0x008d8d8d, 0x63006363, 0x18180018 +.long 0x3b3b3b00, 0x00767676, 0x9d009d9d, 0x22220022 +.long 0x81818100, 0x00030303, 0xc000c0c0, 0x44440044 +.long 0x96969600, 0x002d2d2d, 0x4b004b4b, 0xb2b200b2 +.long 0x6f6f6f00, 0x00dedede, 0xb700b7b7, 0xb5b500b5 +.long 0x4b4b4b00, 0x00969696, 0xa500a5a5, 0x91910091 +.long 0x13131300, 0x00262626, 0x89008989, 0x08080008 +.long 0xbebebe00, 0x007d7d7d, 0x5f005f5f, 0xa8a800a8 +.long 0x63636300, 0x00c6c6c6, 0xb100b1b1, 0xfcfc00fc +.long 0x2e2e2e00, 0x005c5c5c, 0x17001717, 0x50500050 +.long 0xe9e9e900, 0x00d3d3d3, 0xf400f4f4, 0xd0d000d0 +.long 0x79797900, 0x00f2f2f2, 0xbc00bcbc, 0x7d7d007d +.long 0xa7a7a700, 0x004f4f4f, 0xd300d3d3, 0x89890089 +.long 0x8c8c8c00, 0x00191919, 0x46004646, 0x97970097 +.long 0x9f9f9f00, 0x003f3f3f, 0xcf00cfcf, 0x5b5b005b +.long 0x6e6e6e00, 0x00dcdcdc, 0x37003737, 0x95950095 +.long 0xbcbcbc00, 0x00797979, 0x5e005e5e, 0xffff00ff +.long 0x8e8e8e00, 0x001d1d1d, 0x47004747, 0xd2d200d2 +.long 0x29292900, 0x00525252, 0x94009494, 0xc4c400c4 +.long 0xf5f5f500, 0x00ebebeb, 0xfa00fafa, 0x48480048 +.long 0xf9f9f900, 0x00f3f3f3, 0xfc00fcfc, 0xf7f700f7 +.long 0xb6b6b600, 0x006d6d6d, 0x5b005b5b, 0xdbdb00db +.long 0x2f2f2f00, 0x005e5e5e, 0x97009797, 0x03030003 +.long 0xfdfdfd00, 0x00fbfbfb, 0xfe00fefe, 0xdada00da +.long 0xb4b4b400, 0x00696969, 0x5a005a5a, 0x3f3f003f +.long 0x59595900, 0x00b2b2b2, 0xac00acac, 0x94940094 +.long 0x78787800, 0x00f0f0f0, 0x3c003c3c, 0x5c5c005c +.long 0x98989800, 0x00313131, 0x4c004c4c, 0x02020002 +.long 0x06060600, 0x000c0c0c, 0x03000303, 0x4a4a004a +.long 0x6a6a6a00, 0x00d4d4d4, 0x35003535, 0x33330033 +.long 0xe7e7e700, 0x00cfcfcf, 0xf300f3f3, 0x67670067 +.long 0x46464600, 0x008c8c8c, 0x23002323, 0xf3f300f3 +.long 0x71717100, 0x00e2e2e2, 0xb800b8b8, 0x7f7f007f +.long 0xbababa00, 0x00757575, 0x5d005d5d, 0xe2e200e2 +.long 0xd4d4d400, 0x00a9a9a9, 0x6a006a6a, 0x9b9b009b +.long 0x25252500, 0x004a4a4a, 0x92009292, 0x26260026 +.long 0xababab00, 0x00575757, 0xd500d5d5, 0x37370037 +.long 0x42424200, 0x00848484, 0x21002121, 0x3b3b003b +.long 0x88888800, 0x00111111, 0x44004444, 0x96960096 +.long 0xa2a2a200, 0x00454545, 0x51005151, 0x4b4b004b +.long 0x8d8d8d00, 0x001b1b1b, 0xc600c6c6, 0xbebe00be +.long 0xfafafa00, 0x00f5f5f5, 0x7d007d7d, 0x2e2e002e +.long 0x72727200, 0x00e4e4e4, 0x39003939, 0x79790079 +.long 0x07070700, 0x000e0e0e, 0x83008383, 0x8c8c008c +.long 0xb9b9b900, 0x00737373, 0xdc00dcdc, 0x6e6e006e +.long 0x55555500, 0x00aaaaaa, 0xaa00aaaa, 0x8e8e008e +.long 0xf8f8f800, 0x00f1f1f1, 0x7c007c7c, 0xf5f500f5 +.long 0xeeeeee00, 0x00dddddd, 0x77007777, 0xb6b600b6 +.long 0xacacac00, 0x00595959, 0x56005656, 0xfdfd00fd +.long 0x0a0a0a00, 0x00141414, 0x05000505, 0x59590059 +.long 0x36363600, 0x006c6c6c, 0x1b001b1b, 0x98980098 +.long 0x49494900, 0x00929292, 0xa400a4a4, 0x6a6a006a +.long 0x2a2a2a00, 0x00545454, 0x15001515, 0x46460046 +.long 0x68686800, 0x00d0d0d0, 0x34003434, 0xbaba00ba +.long 0x3c3c3c00, 0x00787878, 0x1e001e1e, 0x25250025 +.long 0x38383800, 0x00707070, 0x1c001c1c, 0x42420042 +.long 0xf1f1f100, 0x00e3e3e3, 0xf800f8f8, 0xa2a200a2 +.long 0xa4a4a400, 0x00494949, 0x52005252, 0xfafa00fa +.long 0x40404000, 0x00808080, 0x20002020, 0x07070007 +.long 0x28282800, 0x00505050, 0x14001414, 0x55550055 +.long 0xd3d3d300, 0x00a7a7a7, 0xe900e9e9, 0xeeee00ee +.long 0x7b7b7b00, 0x00f6f6f6, 0xbd00bdbd, 0x0a0a000a +.long 0xbbbbbb00, 0x00777777, 0xdd00dddd, 0x49490049 +.long 0xc9c9c900, 0x00939393, 0xe400e4e4, 0x68680068 +.long 0x43434300, 0x00868686, 0xa100a1a1, 0x38380038 +.long 0xc1c1c100, 0x00838383, 0xe000e0e0, 0xa4a400a4 +.long 0x15151500, 0x002a2a2a, 0x8a008a8a, 0x28280028 +.long 0xe3e3e300, 0x00c7c7c7, 0xf100f1f1, 0x7b7b007b +.long 0xadadad00, 0x005b5b5b, 0xd600d6d6, 0xc9c900c9 +.long 0xf4f4f400, 0x00e9e9e9, 0x7a007a7a, 0xc1c100c1 +.long 0x77777700, 0x00eeeeee, 0xbb00bbbb, 0xe3e300e3 +.long 0xc7c7c700, 0x008f8f8f, 0xe300e3e3, 0xf4f400f4 +.long 0x80808000, 0x00010101, 0x40004040, 0xc7c700c7 +.long 0x9e9e9e00, 0x003d3d3d, 0x4f004f4f, 0x9e9e009e +.size _gcry_camellia_arm_tables,.-_gcry_camellia_arm_tables; + +#endif /*HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS*/ +#endif /*__AARCH64EL__*/ diff --git a/cipher/camellia-glue.c b/cipher/camellia-glue.c index dfddb4a..1be35c9 100644 --- a/cipher/camellia-glue.c +++ b/cipher/camellia-glue.c @@ -285,12 +285,19 @@ static void Camellia_DecryptBlock(const int keyBitLength, keyBitLength); } +#ifdef __aarch64__ +# define CAMELLIA_encrypt_stack_burn_size (0) +# define CAMELLIA_decrypt_stack_burn_size (0) +#else +# define CAMELLIA_encrypt_stack_burn_size (15*4) +# define CAMELLIA_decrypt_stack_burn_size (15*4) +#endif + static unsigned int camellia_encrypt(void *c, byte *outbuf, const byte *inbuf) { CAMELLIA_context *ctx = c; Camellia_EncryptBlock(ctx->keybitlength,inbuf,ctx->keytable,outbuf); -#define CAMELLIA_encrypt_stack_burn_size (15*4) return /*burn_stack*/ (CAMELLIA_encrypt_stack_burn_size); } @@ -299,7 +306,6 @@ camellia_decrypt(void *c, byte *outbuf, const byte *inbuf) { CAMELLIA_context *ctx=c; Camellia_DecryptBlock(ctx->keybitlength,inbuf,ctx->keytable,outbuf); -#define CAMELLIA_decrypt_stack_burn_size (15*4) return /*burn_stack*/ (CAMELLIA_decrypt_stack_burn_size); } diff --git a/cipher/camellia.h b/cipher/camellia.h index d0e3c18..d7a1e6f 100644 --- a/cipher/camellia.h +++ b/cipher/camellia.h @@ -37,6 +37,11 @@ # define USE_ARM_ASM 1 # endif # endif +# if defined(__AARCH64EL__) +# ifdef HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS +# define USE_ARM_ASM 1 +# endif +# endif #endif #ifdef CAMELLIA_EXT_SYM_PREFIX #define CAMELLIA_PREFIX1(x,y) x ## y @@ -80,7 +85,7 @@ void Camellia_DecryptBlock(const int keyBitLength, const unsigned char *cipherText, const KEY_TABLE_TYPE keyTable, unsigned char *plaintext); -#endif /*!USE_ARMV6_ASM*/ +#endif /*!USE_ARM_ASM*/ #ifdef __cplusplus diff --git a/configure.ac b/configure.ac index ca82af9..3e926a5 100644 --- a/configure.ac +++ b/configure.ac @@ -2123,6 +2123,10 @@ if test "$found" = "1" ; then # Build with the assembly implementation GCRYPT_CIPHERS="$GCRYPT_CIPHERS camellia-arm.lo" ;; + aarch64-*-*) + # Build with the assembly implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS camellia-aarch64.lo" + ;; esac if test x"$avxsupport" = xyes ; then ----------------------------------------------------------------------- Summary of changes: cipher/Makefile.am | 4 +- cipher/{camellia-arm.S => camellia-aarch64.S} | 291 ++++++++++---------------- cipher/camellia-glue.c | 10 +- cipher/camellia.h | 7 +- cipher/{twofish-arm.S => twofish-aarch64.S} | 214 ++++++++----------- cipher/twofish.c | 5 + configure.ac | 8 + 7 files changed, 224 insertions(+), 315 deletions(-) copy cipher/{camellia-arm.S => camellia-aarch64.S} (80%) copy cipher/{twofish-arm.S => twofish-aarch64.S} (60%) hooks/post-receive -- The GNU crypto library http://git.gnupg.org _______________________________________________ Gnupg-commits mailing list Gnupg-commits at gnupg.org http://lists.gnupg.org/mailman/listinfo/gnupg-commits