[git] GCRYPT - branch, master, updated. libgcrypt-1.5.0-344-g1faa618
by Jussi Kivilinna
cvs at cvs.gnupg.org
Mon Oct 28 16:22:06 CET 2013
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "The GNU crypto library".
The branch, master has been updated
via 1faa61845f180bd47e037e400dde2d864ee83c89 (commit)
via 2cb6e1f323d24359b1c5b113be5c2f79a2a4cded (commit)
from 3ff9d2571c18cd7a34359f9c60a10d3b0f932b23 (commit)
Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.
- Log -----------------------------------------------------------------
commit 1faa61845f180bd47e037e400dde2d864ee83c89
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date: Mon Oct 28 17:11:21 2013 +0200
Fix typos in documentation
* doc/gcrypt.texi: Fix some typos.
--
Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
diff --git a/doc/gcrypt.texi b/doc/gcrypt.texi
index 91fe399..6dcb4b1 100644
--- a/doc/gcrypt.texi
+++ b/doc/gcrypt.texi
@@ -390,7 +390,7 @@ and freed memory, you need to initialize Libgcrypt this way:
@example
/* Version check should be the very first call because it
- makes sure that important subsystems are intialized. */
+ makes sure that important subsystems are initialized. */
if (!gcry_check_version (GCRYPT_VERSION))
@{
fputs ("libgcrypt version mismatch\n", stderr);
@@ -405,7 +405,7 @@ and freed memory, you need to initialize Libgcrypt this way:
/* ... If required, other initialization goes here. Note that the
process might still be running with increased privileges and that
- the secure memory has not been intialized. */
+ the secure memory has not been initialized. */
/* Allocate a pool of 16k secure memory. This make the secure memory
available and also drops privileges where needed. */
@@ -642,9 +642,9 @@ callbacks.
@item GCRYCTL_ENABLE_QUICK_RANDOM; Arguments: none
This command inhibits the use the very secure random quality level
(@code{GCRY_VERY_STRONG_RANDOM}) and degrades all request down to
- at code{GCRY_STRONG_RANDOM}. In general this is not recommened. However,
+ at code{GCRY_STRONG_RANDOM}. In general this is not recommended. However,
for some applications the extra quality random Libgcrypt tries to create
-is not justified and this option may help to get better performace.
+is not justified and this option may help to get better performance.
Please check with a crypto expert whether this option can be used for
your application.
@@ -652,19 +652,19 @@ This option can only be used at initialization time.
@item GCRYCTL_DUMP_RANDOM_STATS; Arguments: none
-This command dumps randum number generator related statistics to the
+This command dumps random number generator related statistics to the
library's logging stream.
@item GCRYCTL_DUMP_MEMORY_STATS; Arguments: none
-This command dumps memory managment related statistics to the library's
+This command dumps memory management related statistics to the library's
logging stream.
@item GCRYCTL_DUMP_SECMEM_STATS; Arguments: none
-This command dumps secure memory manamgent related statistics to the
+This command dumps secure memory management related statistics to the
library's logging stream.
@item GCRYCTL_DROP_PRIVS; Arguments: none
-This command disables the use of secure memory and drops the priviliges
+This command disables the use of secure memory and drops the privileges
of the current process. This command has not much use; the suggested way
to disable secure memory is to use @code{GCRYCTL_DISABLE_SECMEM} right
after initialization.
@@ -758,7 +758,7 @@ these different instances is correlated to some extent. In a perfect
attack scenario, the attacker can control (or at least guess) the PID
and clock of the application, and drain the system's entropy pool to
reduce the "up to 16 bytes" above to 0. Then the dependencies of the
-inital states of the pools are completely known. Note that this is not
+initial states of the pools are completely known. Note that this is not
an issue if random of @code{GCRY_VERY_STRONG_RANDOM} quality is
requested as in this case enough extra entropy gets mixed. It is also
not an issue when using Linux (rndlinux driver), because this one
@@ -795,11 +795,11 @@ This command does nothing. It exists only for backward compatibility.
This command returns true if the library has been basically initialized.
Such a basic initialization happens implicitly with many commands to get
certain internal subsystems running. The common and suggested way to
-do this basic intialization is by calling gcry_check_version.
+do this basic initialization is by calling gcry_check_version.
@item GCRYCTL_INITIALIZATION_FINISHED; Arguments: none
This command tells the library that the application has finished the
-intialization.
+initialization.
@item GCRYCTL_INITIALIZATION_FINISHED_P; Arguments: none
This command returns true if the command@*
@@ -825,7 +825,7 @@ proper random device.
@item GCRYCTL_PRINT_CONFIG; Arguments: FILE *stream
This command dumps information pertaining to the configuration of the
library to the given stream. If NULL is given for @var{stream}, the log
-system is used. This command may be used before the intialization has
+system is used. This command may be used before the initialization has
been finished but not before a @code{gcry_check_version}.
@item GCRYCTL_OPERATIONAL_P; Arguments: none
@@ -833,12 +833,12 @@ This command returns true if the library is in an operational state.
This information makes only sense in FIPS mode. In contrast to other
functions, this is a pure test function and won't put the library into
FIPS mode or change the internal state. This command may be used before
-the intialization has been finished but not before a @code{gcry_check_version}.
+the initialization has been finished but not before a @code{gcry_check_version}.
@item GCRYCTL_FIPS_MODE_P; Arguments: none
This command returns true if the library is in FIPS mode. Note, that
this is no indication about the current state of the library. This
-command may be used before the intialization has been finished but not
+command may be used before the initialization has been finished but not
before a @code{gcry_check_version}. An application may use this command or
the convenience macro below to check whether FIPS mode is actually
active.
@@ -857,7 +857,7 @@ already in FIPS mode, a self-test is triggered and thus the library will
be put into operational state. This command may be used before a call
to @code{gcry_check_version} and that is actually the recommended way to let an
application switch the library into FIPS mode. Note that Libgcrypt will
-reject an attempt to switch to fips mode during or after the intialization.
+reject an attempt to switch to fips mode during or after the initialization.
@item GCRYCTL_SET_ENFORCED_FIPS_FLAG; Arguments: none
Running this command sets the internal flag that puts the library into
@@ -866,7 +866,7 @@ does not affect the library if the library is not put into the FIPS mode and
it must be used before any other libgcrypt library calls that initialize
the library such as @code{gcry_check_version}. Note that Libgcrypt will
reject an attempt to switch to the enforced fips mode during or after
-the intialization.
+the initialization.
@item GCRYCTL_SET_PREFERRED_RNG_TYPE; Arguments: int
These are advisory commands to select a certain random number
@@ -875,7 +875,7 @@ an application actually wants or vice versa. Thus Libgcrypt employs a
priority check to select the actually used RNG. If an applications
selects a lower priority RNG but a library requests a higher priority
RNG Libgcrypt will switch to the higher priority RNG. Applications
-and libaries should use these control codes before
+and libraries should use these control codes before
@code{gcry_check_version}. The available generators are:
@table @code
@item GCRY_RNG_TYPE_STANDARD
@@ -907,8 +907,8 @@ success or an error code on failure.
@item GCRYCTL_DISABLE_HWF; Arguments: const char *name
Libgcrypt detects certain features of the CPU at startup time. For
-performace tests it is sometimes required not to use such a feature.
-This option may be used to disabale a certain feature; i.e. Libgcrypt
+performance tests it is sometimes required not to use such a feature.
+This option may be used to disable a certain feature; i.e. Libgcrypt
behaves as if this feature has not been detected. Note that the
detection code might be run if the feature has been disabled. This
command must be used at initialization time; i.e. before calling
@@ -1929,7 +1929,7 @@ checking.
@deftypefun size_t gcry_cipher_get_algo_blklen (int @var{algo})
-This functions returns the blocklength of the algorithm @var{algo}
+This functions returns the block-length of the algorithm @var{algo}
counted in octets. On error @code{0} is returned.
This is a convenience functions which should be preferred over
@@ -2292,7 +2292,7 @@ will be changed to implement 186-3.
@item use-fips186-2
@cindex FIPS 186-2
Force the use of the FIPS 186-2 key generation algorithm instead of
-the default algorithm. This algorithm is slighlty different from
+the default algorithm. This algorithm is slightly different from
FIPS 186-3 and allows only 1024 bit keys. This flag is only meaningful
for DSA and only required for FIPS testing backward compatibility.
@@ -4547,7 +4547,7 @@ Convenience function to release the @var{factors} array.
@deftypefun gcry_error_t gcry_prime_check (gcry_mpi_t @var{p}, unsigned int @var{flags})
-Check wether the number @var{p} is prime. Returns zero in case @var{p}
+Check whether the number @var{p} is prime. Returns zero in case @var{p}
is indeed a prime, returns @code{GPG_ERR_NO_PRIME} in case @var{p} is
not a prime and a different error code in case something went horribly
wrong.
@@ -4988,7 +4988,7 @@ checking function is exported as well.
The generation of random prime numbers is based on the Lim and Lee
algorithm to create practically save primes. at footnote{Chae Hoon Lim
-and Pil Joong Lee. A key recovery attack on discrete log-based shemes
+and Pil Joong Lee. A key recovery attack on discrete log-based schemes
using a prime order subgroup. In Burton S. Kaliski Jr., editor,
Advances in Cryptology: Crypto '97, pages 249-263, Berlin /
Heidelberg / New York, 1997. Springer-Verlag. Described on page 260.}
@@ -5147,7 +5147,7 @@ output blocks.
On Unix like systems the @code{GCRY_VERY_STRONG_RANDOM} and
@code{GCRY_STRONG_RANDOM} generators are keyed and seeded using the
-rndlinux module with the @file{/dev/radnom} device. Thus these
+rndlinux module with the @file{/dev/random} device. Thus these
generators may block until the OS kernel has collected enough entropy.
When used with Microsoft Windows the rndw32 module is used instead.
@@ -5162,7 +5162,7 @@ entropy for use by the ``real'' random generators.
A self-test facility uses a separate context to check the
functionality of the core X9.31 functions using a known answers test.
During runtime each output block is compared to the previous one to
-detect a stucked generator.
+detect a stuck generator.
The DT value for the generator is made up of the current time down to
microseconds (if available) and a free running 64 bit counter. When
@@ -5188,7 +5188,7 @@ incremented on each use.
@c them. To use an S-expression with Libgcrypt it needs first be
@c converted into the internal representation used by Libgcrypt (the type
@c @code{gcry_sexp_t}). The conversion functions support a large subset
- at c of the S-expression specification and further fature a printf like
+ at c of the S-expression specification and further feature a printf like
@c function to convert a list of big integers or other binary data into
@c an S-expression.
@c
@@ -5357,8 +5357,8 @@ The result is verified using the public key against the original data
and against modified data. (@code{cipher/@/rsa.c:@/selftest_sign_1024})
@item
A 1000 bit random value is encrypted and checked that it does not
-match the orginal random value. The encrtypted result is then
-decrypted and checked that it macthes the original random value.
+match the original random value. The encrypted result is then
+decrypted and checked that it matches the original random value.
(@code{cipher/@/rsa.c:@/selftest_encr_1024})
@end enumerate
@@ -5401,7 +5401,7 @@ keys. The table itself is protected using a SHA-1 hash.
@c --------------------------------
@section Conditional Tests
-The conditional tests are performed if a certain contidion is met.
+The conditional tests are performed if a certain condition is met.
This may occur at any time; the library does not necessary enter the
``Self-Test'' state to run these tests but will transit to the
``Error'' state if a test failed.
@@ -5696,7 +5696,7 @@ documentation only.
@item Power-On
Libgcrypt is loaded into memory and API calls may be made. Compiler
-introducted constructor functions may be run. Note that Libgcrypt does
+introduced constructor functions may be run. Note that Libgcrypt does
not implement any arbitrary constructor functions to be called by the
operating system
@@ -5721,7 +5721,7 @@ will automatically transit into the Shutdown state.
@item Shutdown
Libgcrypt is about to be terminated and removed from the memory. The
-application may at this point still runing cleanup handlers.
+application may at this point still running cleanup handlers.
@end table
@end float
@@ -5738,18 +5738,18 @@ a shared library and having it linked to an application.
@item 2
Power-On to Init is triggered by the application calling the
-Libgcrypt intialization function @code{gcry_check_version}.
+Libgcrypt initialization function @code{gcry_check_version}.
@item 3
-Init to Self-Test is either triggred by a dedicated API call or implicit
-by invoking a libgrypt service conrolled by the FSM.
+Init to Self-Test is either triggered by a dedicated API call or implicit
+by invoking a libgrypt service controlled by the FSM.
@item 4
Self-Test to Operational is triggered after all self-tests passed
successfully.
@item 5
-Operational to Shutdown is an artifical state without any direct action
+Operational to Shutdown is an artificial state without any direct action
in Libgcrypt. When reaching the Shutdown state the library is
deinitialized and can't return to any other state again.
@@ -5770,7 +5770,7 @@ Error to Shutdown is similar to the Operational to Shutdown transition
(5).
@item 9
-Error to Fatal-Error is triggred if Libgrypt detects an fatal error
+Error to Fatal-Error is triggered if Libgrypt detects an fatal error
while already being in Error state.
@item 10
@@ -5778,26 +5778,26 @@ Fatal-Error to Shutdown is automatically entered by Libgcrypt
after having reported the error.
@item 11
-Power-On to Shutdown is an artifical state to document that Libgcrypt
-has not ye been initializaed but the process is about to terminate.
+Power-On to Shutdown is an artificial state to document that Libgcrypt
+has not ye been initialized but the process is about to terminate.
@item 12
-Power-On to Fatal-Error will be triggerd if certain Libgcrypt functions
+Power-On to Fatal-Error will be triggered if certain Libgcrypt functions
are used without having reached the Init state.
@item 13
-Self-Test to Fatal-Error is triggred by severe errors in Libgcrypt while
+Self-Test to Fatal-Error is triggered by severe errors in Libgcrypt while
running self-tests.
@item 14
-Self-Test to Error is triggred by a failed self-test.
+Self-Test to Error is triggered by a failed self-test.
@item 15
Operational to Fatal-Error is triggered if Libcrypt encountered a
non-recoverable error.
@item 16
-Operational to Self-Test is triggred if the application requested to run
+Operational to Self-Test is triggered if the application requested to run
the self-tests again.
@item 17
@@ -5868,7 +5868,7 @@ memory and thus also the encryption contexts with these keys.
GCRYCTL_SET_RANDOM_DAEMON_SOCKET
GCRYCTL_USE_RANDOM_DAEMON
-The random damon is still a bit experimental, thus we do not document
+The random daemon is still a bit experimental, thus we do not document
them. Note that they should be used during initialization and that
these functions are not really thread safe.
commit 2cb6e1f323d24359b1c5b113be5c2f79a2a4cded
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date: Sun Oct 27 14:07:59 2013 +0200
Add ARM NEON assembly implementation of Serpent
* cipher/Makefile.am: Add 'serpent-armv7-neon.S'.
* cipher/serpent-armv7-neon.S: New.
* cipher/serpent.c (USE_NEON): New macro.
(serpent_context_t) [USE_NEON]: Add 'use_neon'.
[USE_NEON] (_gcry_serpent_neon_ctr_enc, _gcry_serpent_neon_cfb_dec)
(_gcry_serpent_neon_cbc_dec): New prototypes.
(serpent_setkey_internal) [USE_NEON]: Detect NEON support.
(_gcry_serpent_neon_ctr_enc, _gcry_serpent_neon_cfb_dec)
(_gcry_serpent_neon_cbc_dec) [USE_NEON]: Use NEON implementations
to process eight blocks in parallel.
* configure.ac [neonsupport]: Add 'serpent-armv7-neon.lo'.
--
Patch adds ARM NEON optimized implementation of Serpent cipher
to speed up parallelizable bulk operations.
Benchmarks on ARM Cortex-A8 (armhf, 1008 Mhz):
Old:
SERPENT128 | nanosecs/byte mebibytes/sec cycles/byte
CBC dec | 43.53 ns/B 21.91 MiB/s 43.88 c/B
CFB dec | 44.77 ns/B 21.30 MiB/s 45.13 c/B
CTR enc | 45.21 ns/B 21.10 MiB/s 45.57 c/B
CTR dec | 45.21 ns/B 21.09 MiB/s 45.57 c/B
New:
SERPENT128 | nanosecs/byte mebibytes/sec cycles/byte
CBC dec | 26.26 ns/B 36.32 MiB/s 26.47 c/B
CFB dec | 26.21 ns/B 36.38 MiB/s 26.42 c/B
CTR enc | 26.20 ns/B 36.40 MiB/s 26.41 c/B
CTR dec | 26.20 ns/B 36.40 MiB/s 26.41 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
diff --git a/cipher/salsa20-armv7-neon.S b/cipher/salsa20-armv7-neon.S
index 5b51301..7d31e9f 100644
--- a/cipher/salsa20-armv7-neon.S
+++ b/cipher/salsa20-armv7-neon.S
@@ -36,7 +36,7 @@
.text
.align 2
-.global _gcry_arm_neon_salsa20_encrypt
+.globl _gcry_arm_neon_salsa20_encrypt
.type _gcry_arm_neon_salsa20_encrypt,%function;
_gcry_arm_neon_salsa20_encrypt:
/* Modifications:
diff --git a/cipher/serpent-armv7-neon.S b/cipher/serpent-armv7-neon.S
new file mode 100644
index 0000000..92e95a0
--- /dev/null
+++ b/cipher/serpent-armv7-neon.S
@@ -0,0 +1,869 @@
+/* serpent-armv7-neon.S - ARM/NEON assembly implementation of Serpent cipher
+ *
+ * Copyright © 2013 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <config.h>
+
+#if defined(HAVE_ARM_ARCH_V6) && defined(__ARMEL__) && \
+ defined(HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS) && \
+ defined(HAVE_GCC_INLINE_ASM_NEON)
+
+.text
+
+.syntax unified
+.fpu neon
+.arm
+
+/* ARM registers */
+#define RROUND r0
+
+/* NEON vector registers */
+#define RA0 q0
+#define RA1 q1
+#define RA2 q2
+#define RA3 q3
+#define RA4 q4
+#define RB0 q5
+#define RB1 q6
+#define RB2 q7
+#define RB3 q8
+#define RB4 q9
+
+#define RT0 q10
+#define RT1 q11
+#define RT2 q12
+#define RT3 q13
+
+#define RA0d0 d0
+#define RA0d1 d1
+#define RA1d0 d2
+#define RA1d1 d3
+#define RA2d0 d4
+#define RA2d1 d5
+#define RA3d0 d6
+#define RA3d1 d7
+#define RA4d0 d8
+#define RA4d1 d9
+#define RB0d0 d10
+#define RB0d1 d11
+#define RB1d0 d12
+#define RB1d1 d13
+#define RB2d0 d14
+#define RB2d1 d15
+#define RB3d0 d16
+#define RB3d1 d17
+#define RB4d0 d18
+#define RB4d1 d19
+#define RT0d0 d20
+#define RT0d1 d21
+#define RT1d0 d22
+#define RT1d1 d23
+#define RT2d0 d24
+#define RT2d1 d25
+
+/**********************************************************************
+ helper macros
+ **********************************************************************/
+
+#define transpose_4x4(_q0, _q1, _q2, _q3) \
+ vtrn.32 _q0, _q1; \
+ vtrn.32 _q2, _q3; \
+ vswp _q0##d1, _q2##d0; \
+ vswp _q1##d1, _q3##d0;
+
+/**********************************************************************
+ 8-way serpent
+ **********************************************************************/
+
+/*
+ * These are the S-Boxes of Serpent from following research paper.
+ *
+ * D. A. Osvik, “Speeding up Serpent,” in Third AES Candidate Conference,
+ * (New York, New York, USA), p. 317–329, National Institute of Standards and
+ * Technology, 2000.
+ *
+ * Paper is also available at: http://www.ii.uib.no/~osvik/pub/aes3.pdf
+ *
+ */
+#define SBOX0(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ veor a3, a3, a0; veor b3, b3, b0; vmov a4, a1; vmov b4, b1; \
+ vand a1, a1, a3; vand b1, b1, b3; veor a4, a4, a2; veor b4, b4, b2; \
+ veor a1, a1, a0; veor b1, b1, b0; vorr a0, a0, a3; vorr b0, b0, b3; \
+ veor a0, a0, a4; veor b0, b0, b4; veor a4, a4, a3; veor b4, b4, b3; \
+ veor a3, a3, a2; veor b3, b3, b2; vorr a2, a2, a1; vorr b2, b2, b1; \
+ veor a2, a2, a4; veor b2, b2, b4; vmvn a4, a4; vmvn b4, b4; \
+ vorr a4, a4, a1; vorr b4, b4, b1; veor a1, a1, a3; veor b1, b1, b3; \
+ veor a1, a1, a4; veor b1, b1, b4; vorr a3, a3, a0; vorr b3, b3, b0; \
+ veor a1, a1, a3; veor b1, b1, b3; veor a4, a3; veor b4, b3;
+
+#define SBOX0_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ vmvn a2, a2; vmvn b2, b2; vmov a4, a1; vmov b4, b1; \
+ vorr a1, a1, a0; vorr b1, b1, b0; vmvn a4, a4; vmvn b4, b4; \
+ veor a1, a1, a2; veor b1, b1, b2; vorr a2, a2, a4; vorr b2, b2, b4; \
+ veor a1, a1, a3; veor b1, b1, b3; veor a0, a0, a4; veor b0, b0, b4; \
+ veor a2, a2, a0; veor b2, b2, b0; vand a0, a0, a3; vand b0, b0, b3; \
+ veor a4, a4, a0; veor b4, b4, b0; vorr a0, a0, a1; vorr b0, b0, b1; \
+ veor a0, a0, a2; veor b0, b0, b2; veor a3, a3, a4; veor b3, b3, b4; \
+ veor a2, a2, a1; veor b2, b2, b1; veor a3, a3, a0; veor b3, b3, b0; \
+ veor a3, a3, a1; veor b3, b3, b1;\
+ vand a2, a2, a3; vand b2, b2, b3;\
+ veor a4, a2; veor b4, b2;
+
+#define SBOX1(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ vmvn a0, a0; vmvn b0, b0; vmvn a2, a2; vmvn b2, b2; \
+ vmov a4, a0; vmov b4, b0; vand a0, a0, a1; vand b0, b0, b1; \
+ veor a2, a2, a0; veor b2, b2, b0; vorr a0, a0, a3; vorr b0, b0, b3; \
+ veor a3, a3, a2; veor b3, b3, b2; veor a1, a1, a0; veor b1, b1, b0; \
+ veor a0, a0, a4; veor b0, b0, b4; vorr a4, a4, a1; vorr b4, b4, b1; \
+ veor a1, a1, a3; veor b1, b1, b3; vorr a2, a2, a0; vorr b2, b2, b0; \
+ vand a2, a2, a4; vand b2, b2, b4; veor a0, a0, a1; veor b0, b0, b1; \
+ vand a1, a1, a2; vand b1, b1, b2;\
+ veor a1, a1, a0; veor b1, b1, b0; vand a0, a0, a2; vand b0, b0, b2; \
+ veor a0, a4; veor b0, b4;
+
+#define SBOX1_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ vmov a4, a1; vmov b4, b1; veor a1, a1, a3; veor b1, b1, b3; \
+ vand a3, a3, a1; vand b3, b3, b1; veor a4, a4, a2; veor b4, b4, b2; \
+ veor a3, a3, a0; veor b3, b3, b0; vorr a0, a0, a1; vorr b0, b0, b1; \
+ veor a2, a2, a3; veor b2, b2, b3; veor a0, a0, a4; veor b0, b0, b4; \
+ vorr a0, a0, a2; vorr b0, b0, b2; veor a1, a1, a3; veor b1, b1, b3; \
+ veor a0, a0, a1; veor b0, b0, b1; vorr a1, a1, a3; vorr b1, b1, b3; \
+ veor a1, a1, a0; veor b1, b1, b0; vmvn a4, a4; vmvn b4, b4; \
+ veor a4, a4, a1; veor b4, b4, b1; vorr a1, a1, a0; vorr b1, b1, b0; \
+ veor a1, a1, a0; veor b1, b1, b0;\
+ vorr a1, a1, a4; vorr b1, b1, b4;\
+ veor a3, a1; veor b3, b1;
+
+#define SBOX2(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ vmov a4, a0; vmov b4, b0; vand a0, a0, a2; vand b0, b0, b2; \
+ veor a0, a0, a3; veor b0, b0, b3; veor a2, a2, a1; veor b2, b2, b1; \
+ veor a2, a2, a0; veor b2, b2, b0; vorr a3, a3, a4; vorr b3, b3, b4; \
+ veor a3, a3, a1; veor b3, b3, b1; veor a4, a4, a2; veor b4, b4, b2; \
+ vmov a1, a3; vmov b1, b3; vorr a3, a3, a4; vorr b3, b3, b4; \
+ veor a3, a3, a0; veor b3, b3, b0; vand a0, a0, a1; vand b0, b0, b1; \
+ veor a4, a4, a0; veor b4, b4, b0; veor a1, a1, a3; veor b1, b1, b3; \
+ veor a1, a1, a4; veor b1, b1, b4; vmvn a4, a4; vmvn b4, b4;
+
+#define SBOX2_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ veor a2, a2, a3; veor b2, b2, b3; veor a3, a3, a0; veor b3, b3, b0; \
+ vmov a4, a3; vmov b4, b3; vand a3, a3, a2; vand b3, b3, b2; \
+ veor a3, a3, a1; veor b3, b3, b1; vorr a1, a1, a2; vorr b1, b1, b2; \
+ veor a1, a1, a4; veor b1, b1, b4; vand a4, a4, a3; vand b4, b4, b3; \
+ veor a2, a2, a3; veor b2, b2, b3; vand a4, a4, a0; vand b4, b4, b0; \
+ veor a4, a4, a2; veor b4, b4, b2; vand a2, a2, a1; vand b2, b2, b1; \
+ vorr a2, a2, a0; vorr b2, b2, b0; vmvn a3, a3; vmvn b3, b3; \
+ veor a2, a2, a3; veor b2, b2, b3; veor a0, a0, a3; veor b0, b0, b3; \
+ vand a0, a0, a1; vand b0, b0, b1; veor a3, a3, a4; veor b3, b3, b4; \
+ veor a3, a0; veor b3, b0;
+
+#define SBOX3(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ vmov a4, a0; vmov b4, b0; vorr a0, a0, a3; vorr b0, b0, b3; \
+ veor a3, a3, a1; veor b3, b3, b1; vand a1, a1, a4; vand b1, b1, b4; \
+ veor a4, a4, a2; veor b4, b4, b2; veor a2, a2, a3; veor b2, b2, b3; \
+ vand a3, a3, a0; vand b3, b3, b0; vorr a4, a4, a1; vorr b4, b4, b1; \
+ veor a3, a3, a4; veor b3, b3, b4; veor a0, a0, a1; veor b0, b0, b1; \
+ vand a4, a4, a0; vand b4, b4, b0; veor a1, a1, a3; veor b1, b1, b3; \
+ veor a4, a4, a2; veor b4, b4, b2; vorr a1, a1, a0; vorr b1, b1, b0; \
+ veor a1, a1, a2; veor b1, b1, b2; veor a0, a0, a3; veor b0, b0, b3; \
+ vmov a2, a1; vmov b2, b1; vorr a1, a1, a3; vorr b1, b1, b3; \
+ veor a1, a0; veor b1, b0;
+
+#define SBOX3_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ vmov a4, a2; vmov b4, b2; veor a2, a2, a1; veor b2, b2, b1; \
+ veor a0, a0, a2; veor b0, b0, b2; vand a4, a4, a2; vand b4, b4, b2; \
+ veor a4, a4, a0; veor b4, b4, b0; vand a0, a0, a1; vand b0, b0, b1; \
+ veor a1, a1, a3; veor b1, b1, b3; vorr a3, a3, a4; vorr b3, b3, b4; \
+ veor a2, a2, a3; veor b2, b2, b3; veor a0, a0, a3; veor b0, b0, b3; \
+ veor a1, a1, a4; veor b1, b1, b4; vand a3, a3, a2; vand b3, b3, b2; \
+ veor a3, a3, a1; veor b3, b3, b1; veor a1, a1, a0; veor b1, b1, b0; \
+ vorr a1, a1, a2; vorr b1, b1, b2; veor a0, a0, a3; veor b0, b0, b3; \
+ veor a1, a1, a4; veor b1, b1, b4;\
+ veor a0, a1; veor b0, b1;
+
+#define SBOX4(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ veor a1, a1, a3; veor b1, b1, b3; vmvn a3, a3; vmvn b3, b3; \
+ veor a2, a2, a3; veor b2, b2, b3; veor a3, a3, a0; veor b3, b3, b0; \
+ vmov a4, a1; vmov b4, b1; vand a1, a1, a3; vand b1, b1, b3; \
+ veor a1, a1, a2; veor b1, b1, b2; veor a4, a4, a3; veor b4, b4, b3; \
+ veor a0, a0, a4; veor b0, b0, b4; vand a2, a2, a4; vand b2, b2, b4; \
+ veor a2, a2, a0; veor b2, b2, b0; vand a0, a0, a1; vand b0, b0, b1; \
+ veor a3, a3, a0; veor b3, b3, b0; vorr a4, a4, a1; vorr b4, b4, b1; \
+ veor a4, a4, a0; veor b4, b4, b0; vorr a0, a0, a3; vorr b0, b0, b3; \
+ veor a0, a0, a2; veor b0, b0, b2; vand a2, a2, a3; vand b2, b2, b3; \
+ vmvn a0, a0; vmvn b0, b0; veor a4, a2; veor b4, b2;
+
+#define SBOX4_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ vmov a4, a2; vmov b4, b2; vand a2, a2, a3; vand b2, b2, b3; \
+ veor a2, a2, a1; veor b2, b2, b1; vorr a1, a1, a3; vorr b1, b1, b3; \
+ vand a1, a1, a0; vand b1, b1, b0; veor a4, a4, a2; veor b4, b4, b2; \
+ veor a4, a4, a1; veor b4, b4, b1; vand a1, a1, a2; vand b1, b1, b2; \
+ vmvn a0, a0; vmvn b0, b0; veor a3, a3, a4; veor b3, b3, b4; \
+ veor a1, a1, a3; veor b1, b1, b3; vand a3, a3, a0; vand b3, b3, b0; \
+ veor a3, a3, a2; veor b3, b3, b2; veor a0, a0, a1; veor b0, b0, b1; \
+ vand a2, a2, a0; vand b2, b2, b0; veor a3, a3, a0; veor b3, b3, b0; \
+ veor a2, a2, a4; veor b2, b2, b4;\
+ vorr a2, a2, a3; vorr b2, b2, b3; veor a3, a3, a0; veor b3, b3, b0; \
+ veor a2, a1; veor b2, b1;
+
+#define SBOX5(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ veor a0, a0, a1; veor b0, b0, b1; veor a1, a1, a3; veor b1, b1, b3; \
+ vmvn a3, a3; vmvn b3, b3; vmov a4, a1; vmov b4, b1; \
+ vand a1, a1, a0; vand b1, b1, b0; veor a2, a2, a3; veor b2, b2, b3; \
+ veor a1, a1, a2; veor b1, b1, b2; vorr a2, a2, a4; vorr b2, b2, b4; \
+ veor a4, a4, a3; veor b4, b4, b3; vand a3, a3, a1; vand b3, b3, b1; \
+ veor a3, a3, a0; veor b3, b3, b0; veor a4, a4, a1; veor b4, b4, b1; \
+ veor a4, a4, a2; veor b4, b4, b2; veor a2, a2, a0; veor b2, b2, b0; \
+ vand a0, a0, a3; vand b0, b0, b3; vmvn a2, a2; vmvn b2, b2; \
+ veor a0, a0, a4; veor b0, b0, b4; vorr a4, a4, a3; vorr b4, b4, b3; \
+ veor a2, a4; veor b2, b4;
+
+#define SBOX5_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ vmvn a1, a1; vmvn b1, b1; vmov a4, a3; vmov b4, b3; \
+ veor a2, a2, a1; veor b2, b2, b1; vorr a3, a3, a0; vorr b3, b3, b0; \
+ veor a3, a3, a2; veor b3, b3, b2; vorr a2, a2, a1; vorr b2, b2, b1; \
+ vand a2, a2, a0; vand b2, b2, b0; veor a4, a4, a3; veor b4, b4, b3; \
+ veor a2, a2, a4; veor b2, b2, b4; vorr a4, a4, a0; vorr b4, b4, b0; \
+ veor a4, a4, a1; veor b4, b4, b1; vand a1, a1, a2; vand b1, b1, b2; \
+ veor a1, a1, a3; veor b1, b1, b3; veor a4, a4, a2; veor b4, b4, b2; \
+ vand a3, a3, a4; vand b3, b3, b4; veor a4, a4, a1; veor b4, b4, b1; \
+ veor a3, a3, a4; veor b3, b3, b4; vmvn a4, a4; vmvn b4, b4; \
+ veor a3, a0; veor b3, b0;
+
+#define SBOX6(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ vmvn a2, a2; vmvn b2, b2; vmov a4, a3; vmov b4, b3; \
+ vand a3, a3, a0; vand b3, b3, b0; veor a0, a0, a4; veor b0, b0, b4; \
+ veor a3, a3, a2; veor b3, b3, b2; vorr a2, a2, a4; vorr b2, b2, b4; \
+ veor a1, a1, a3; veor b1, b1, b3; veor a2, a2, a0; veor b2, b2, b0; \
+ vorr a0, a0, a1; vorr b0, b0, b1; veor a2, a2, a1; veor b2, b2, b1; \
+ veor a4, a4, a0; veor b4, b4, b0; vorr a0, a0, a3; vorr b0, b0, b3; \
+ veor a0, a0, a2; veor b0, b0, b2; veor a4, a4, a3; veor b4, b4, b3; \
+ veor a4, a4, a0; veor b4, b4, b0; vmvn a3, a3; vmvn b3, b3; \
+ vand a2, a2, a4; vand b2, b2, b4;\
+ veor a2, a3; veor b2, b3;
+
+#define SBOX6_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ veor a0, a0, a2; veor b0, b0, b2; vmov a4, a2; vmov b4, b2; \
+ vand a2, a2, a0; vand b2, b2, b0; veor a4, a4, a3; veor b4, b4, b3; \
+ vmvn a2, a2; vmvn b2, b2; veor a3, a3, a1; veor b3, b3, b1; \
+ veor a2, a2, a3; veor b2, b2, b3; vorr a4, a4, a0; vorr b4, b4, b0; \
+ veor a0, a0, a2; veor b0, b0, b2; veor a3, a3, a4; veor b3, b3, b4; \
+ veor a4, a4, a1; veor b4, b4, b1; vand a1, a1, a3; vand b1, b1, b3; \
+ veor a1, a1, a0; veor b1, b1, b0; veor a0, a0, a3; veor b0, b0, b3; \
+ vorr a0, a0, a2; vorr b0, b0, b2; veor a3, a3, a1; veor b3, b3, b1; \
+ veor a4, a0; veor b4, b0;
+
+#define SBOX7(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ vmov a4, a1; vmov b4, b1; vorr a1, a1, a2; vorr b1, b1, b2; \
+ veor a1, a1, a3; veor b1, b1, b3; veor a4, a4, a2; veor b4, b4, b2; \
+ veor a2, a2, a1; veor b2, b2, b1; vorr a3, a3, a4; vorr b3, b3, b4; \
+ vand a3, a3, a0; vand b3, b3, b0; veor a4, a4, a2; veor b4, b4, b2; \
+ veor a3, a3, a1; veor b3, b3, b1; vorr a1, a1, a4; vorr b1, b1, b4; \
+ veor a1, a1, a0; veor b1, b1, b0; vorr a0, a0, a4; vorr b0, b0, b4; \
+ veor a0, a0, a2; veor b0, b0, b2; veor a1, a1, a4; veor b1, b1, b4; \
+ veor a2, a2, a1; veor b2, b2, b1; vand a1, a1, a0; vand b1, b1, b0; \
+ veor a1, a1, a4; veor b1, b1, b4; vmvn a2, a2; vmvn b2, b2; \
+ vorr a2, a2, a0; vorr b2, b2, b0;\
+ veor a4, a2; veor b4, b2;
+
+#define SBOX7_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ vmov a4, a2; vmov b4, b2; veor a2, a2, a0; veor b2, b2, b0; \
+ vand a0, a0, a3; vand b0, b0, b3; vorr a4, a4, a3; vorr b4, b4, b3; \
+ vmvn a2, a2; vmvn b2, b2; veor a3, a3, a1; veor b3, b3, b1; \
+ vorr a1, a1, a0; vorr b1, b1, b0; veor a0, a0, a2; veor b0, b0, b2; \
+ vand a2, a2, a4; vand b2, b2, b4; vand a3, a3, a4; vand b3, b3, b4; \
+ veor a1, a1, a2; veor b1, b1, b2; veor a2, a2, a0; veor b2, b2, b0; \
+ vorr a0, a0, a2; vorr b0, b0, b2; veor a4, a4, a1; veor b4, b4, b1; \
+ veor a0, a0, a3; veor b0, b0, b3; veor a3, a3, a4; veor b3, b3, b4; \
+ vorr a4, a4, a0; vorr b4, b4, b0; veor a3, a3, a2; veor b3, b3, b2; \
+ veor a4, a2; veor b4, b2;
+
+/* Apply SBOX number WHICH to to the block. */
+#define SBOX(which, a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ SBOX##which (a0, a1, a2, a3, a4, b0, b1, b2, b3, b4)
+
+/* Apply inverse SBOX number WHICH to to the block. */
+#define SBOX_INVERSE(which, a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ SBOX##which##_INVERSE (a0, a1, a2, a3, a4, b0, b1, b2, b3, b4)
+
+/* XOR round key into block state in a0,a1,a2,a3. a4 used as temporary. */
+#define BLOCK_XOR_KEY(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ vdup.32 RT3, RT0d0[0]; \
+ vdup.32 RT1, RT0d0[1]; \
+ vdup.32 RT2, RT0d1[0]; \
+ vdup.32 RT0, RT0d1[1]; \
+ veor a0, a0, RT3; veor b0, b0, RT3; \
+ veor a1, a1, RT1; veor b1, b1, RT1; \
+ veor a2, a2, RT2; veor b2, b2, RT2; \
+ veor a3, a3, RT0; veor b3, b3, RT0;
+
+#define BLOCK_LOAD_KEY_ENC() \
+ vld1.8 {RT0d0, RT0d1}, [RROUND]!;
+
+#define BLOCK_LOAD_KEY_DEC() \
+ vld1.8 {RT0d0, RT0d1}, [RROUND]; \
+ sub RROUND, RROUND, #16
+
+/* Apply the linear transformation to BLOCK. */
+#define LINEAR_TRANSFORMATION(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ vshl.u32 a4, a0, #13; vshl.u32 b4, b0, #13; \
+ vshr.u32 a0, a0, #(32-13); vshr.u32 b0, b0, #(32-13); \
+ veor a0, a0, a4; veor b0, b0, b4; \
+ vshl.u32 a4, a2, #3; vshl.u32 b4, b2, #3; \
+ vshr.u32 a2, a2, #(32-3); vshr.u32 b2, b2, #(32-3); \
+ veor a2, a2, a4; veor b2, b2, b4; \
+ veor a1, a0, a1; veor b1, b0, b1; \
+ veor a1, a2, a1; veor b1, b2, b1; \
+ vshl.u32 a4, a0, #3; vshl.u32 b4, b0, #3; \
+ veor a3, a2, a3; veor b3, b2, b3; \
+ veor a3, a4, a3; veor b3, b4, b3; \
+ vshl.u32 a4, a1, #1; vshl.u32 b4, b1, #1; \
+ vshr.u32 a1, a1, #(32-1); vshr.u32 b1, b1, #(32-1); \
+ veor a1, a1, a4; veor b1, b1, b4; \
+ vshl.u32 a4, a3, #7; vshl.u32 b4, b3, #7; \
+ vshr.u32 a3, a3, #(32-7); vshr.u32 b3, b3, #(32-7); \
+ veor a3, a3, a4; veor b3, b3, b4; \
+ veor a0, a1, a0; veor b0, b1, b0; \
+ veor a0, a3, a0; veor b0, b3, b0; \
+ vshl.u32 a4, a1, #7; vshl.u32 b4, b1, #7; \
+ veor a2, a3, a2; veor b2, b3, b2; \
+ veor a2, a4, a2; veor b2, b4, b2; \
+ vshl.u32 a4, a0, #5; vshl.u32 b4, b0, #5; \
+ vshr.u32 a0, a0, #(32-5); vshr.u32 b0, b0, #(32-5); \
+ veor a0, a0, a4; veor b0, b0, b4; \
+ vshl.u32 a4, a2, #22; vshl.u32 b4, b2, #22; \
+ vshr.u32 a2, a2, #(32-22); vshr.u32 b2, b2, #(32-22); \
+ veor a2, a2, a4; veor b2, b2, b4;
+
+/* Apply the inverse linear transformation to BLOCK. */
+#define LINEAR_TRANSFORMATION_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+ vshr.u32 a4, a2, #22; vshr.u32 b4, b2, #22; \
+ vshl.u32 a2, a2, #(32-22); vshl.u32 b2, b2, #(32-22); \
+ veor a2, a2, a4; veor b2, b2, b4; \
+ vshr.u32 a4, a0, #5; vshr.u32 b4, b0, #5; \
+ vshl.u32 a0, a0, #(32-5); vshl.u32 b0, b0, #(32-5); \
+ veor a0, a0, a4; veor b0, b0, b4; \
+ vshl.u32 a4, a1, #7; vshl.u32 b4, b1, #7; \
+ veor a2, a3, a2; veor b2, b3, b2; \
+ veor a2, a4, a2; veor b2, b4, b2; \
+ veor a0, a1, a0; veor b0, b1, b0; \
+ veor a0, a3, a0; veor b0, b3, b0; \
+ vshr.u32 a4, a3, #7; vshr.u32 b4, b3, #7; \
+ vshl.u32 a3, a3, #(32-7); vshl.u32 b3, b3, #(32-7); \
+ veor a3, a3, a4; veor b3, b3, b4; \
+ vshr.u32 a4, a1, #1; vshr.u32 b4, b1, #1; \
+ vshl.u32 a1, a1, #(32-1); vshl.u32 b1, b1, #(32-1); \
+ veor a1, a1, a4; veor b1, b1, b4; \
+ vshl.u32 a4, a0, #3; vshl.u32 b4, b0, #3; \
+ veor a3, a2, a3; veor b3, b2, b3; \
+ veor a3, a4, a3; veor b3, b4, b3; \
+ veor a1, a0, a1; veor b1, b0, b1; \
+ veor a1, a2, a1; veor b1, b2, b1; \
+ vshr.u32 a4, a2, #3; vshr.u32 b4, b2, #3; \
+ vshl.u32 a2, a2, #(32-3); vshl.u32 b2, b2, #(32-3); \
+ veor a2, a2, a4; veor b2, b2, b4; \
+ vshr.u32 a4, a0, #13; vshr.u32 b4, b0, #13; \
+ vshl.u32 a0, a0, #(32-13); vshl.u32 b0, b0, #(32-13); \
+ veor a0, a0, a4; veor b0, b0, b4;
+
+/* Apply a Serpent round to eight parallel blocks. This macro increments
+ `round'. */
+#define ROUND(round, which, a0, a1, a2, a3, a4, na0, na1, na2, na3, na4, \
+ b0, b1, b2, b3, b4, nb0, nb1, nb2, nb3, nb4) \
+ BLOCK_XOR_KEY (a0, a1, a2, a3, a4, b0, b1, b2, b3, b4); \
+ BLOCK_LOAD_KEY_ENC (); \
+ SBOX (which, a0, a1, a2, a3, a4, b0, b1, b2, b3, b4); \
+ LINEAR_TRANSFORMATION (na0, na1, na2, na3, na4, nb0, nb1, nb2, nb3, nb4);
+
+/* Apply the last Serpent round to eight parallel blocks. This macro increments
+ `round'. */
+#define ROUND_LAST(round, which, a0, a1, a2, a3, a4, na0, na1, na2, na3, na4, \
+ b0, b1, b2, b3, b4, nb0, nb1, nb2, nb3, nb4) \
+ BLOCK_XOR_KEY (a0, a1, a2, a3, a4, b0, b1, b2, b3, b4); \
+ BLOCK_LOAD_KEY_ENC (); \
+ SBOX (which, a0, a1, a2, a3, a4, b0, b1, b2, b3, b4); \
+ BLOCK_XOR_KEY (na0, na1, na2, na3, na4, nb0, nb1, nb2, nb3, nb4);
+
+/* Apply an inverse Serpent round to eight parallel blocks. This macro
+ increments `round'. */
+#define ROUND_INVERSE(round, which, a0, a1, a2, a3, a4, \
+ na0, na1, na2, na3, na4, \
+ b0, b1, b2, b3, b4, \
+ nb0, nb1, nb2, nb3, nb4) \
+ LINEAR_TRANSFORMATION_INVERSE (a0, a1, a2, a3, a4, b0, b1, b2, b3, b4); \
+ SBOX_INVERSE (which, a0, a1, a2, a3, a4, b0, b1, b2, b3, b4); \
+ BLOCK_XOR_KEY (na0, na1, na2, na3, na4, nb0, nb1, nb2, nb3, nb4); \
+ BLOCK_LOAD_KEY_DEC ();
+
+/* Apply the first inverse Serpent round to eight parallel blocks. This macro
+ increments `round'. */
+#define ROUND_FIRST_INVERSE(round, which, a0, a1, a2, a3, a4, \
+ na0, na1, na2, na3, na4, \
+ b0, b1, b2, b3, b4, \
+ nb0, nb1, nb2, nb3, nb4) \
+ BLOCK_XOR_KEY (a0, a1, a2, a3, a4, b0, b1, b2, b3, b4); \
+ BLOCK_LOAD_KEY_DEC (); \
+ SBOX_INVERSE (which, a0, a1, a2, a3, a4, b0, b1, b2, b3, b4); \
+ BLOCK_XOR_KEY (na0, na1, na2, na3, na4, nb0, nb1, nb2, nb3, nb4); \
+ BLOCK_LOAD_KEY_DEC ();
+
+.align 3
+.type __serpent_enc_blk8,%function;
+__serpent_enc_blk8:
+ /* input:
+ * r0: round key pointer
+ * RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: eight parallel plaintext
+ * blocks
+ * output:
+ * RA4, RA1, RA2, RA0, RB4, RB1, RB2, RB0: eight parallel
+ * ciphertext blocks
+ */
+
+ transpose_4x4(RA0, RA1, RA2, RA3);
+ BLOCK_LOAD_KEY_ENC ();
+ transpose_4x4(RB0, RB1, RB2, RB3);
+
+ ROUND (0, 0, RA0, RA1, RA2, RA3, RA4, RA1, RA4, RA2, RA0, RA3,
+ RB0, RB1, RB2, RB3, RB4, RB1, RB4, RB2, RB0, RB3);
+ ROUND (1, 1, RA1, RA4, RA2, RA0, RA3, RA2, RA1, RA0, RA4, RA3,
+ RB1, RB4, RB2, RB0, RB3, RB2, RB1, RB0, RB4, RB3);
+ ROUND (2, 2, RA2, RA1, RA0, RA4, RA3, RA0, RA4, RA1, RA3, RA2,
+ RB2, RB1, RB0, RB4, RB3, RB0, RB4, RB1, RB3, RB2);
+ ROUND (3, 3, RA0, RA4, RA1, RA3, RA2, RA4, RA1, RA3, RA2, RA0,
+ RB0, RB4, RB1, RB3, RB2, RB4, RB1, RB3, RB2, RB0);
+ ROUND (4, 4, RA4, RA1, RA3, RA2, RA0, RA1, RA0, RA4, RA2, RA3,
+ RB4, RB1, RB3, RB2, RB0, RB1, RB0, RB4, RB2, RB3);
+ ROUND (5, 5, RA1, RA0, RA4, RA2, RA3, RA0, RA2, RA1, RA4, RA3,
+ RB1, RB0, RB4, RB2, RB3, RB0, RB2, RB1, RB4, RB3);
+ ROUND (6, 6, RA0, RA2, RA1, RA4, RA3, RA0, RA2, RA3, RA1, RA4,
+ RB0, RB2, RB1, RB4, RB3, RB0, RB2, RB3, RB1, RB4);
+ ROUND (7, 7, RA0, RA2, RA3, RA1, RA4, RA4, RA1, RA2, RA0, RA3,
+ RB0, RB2, RB3, RB1, RB4, RB4, RB1, RB2, RB0, RB3);
+ ROUND (8, 0, RA4, RA1, RA2, RA0, RA3, RA1, RA3, RA2, RA4, RA0,
+ RB4, RB1, RB2, RB0, RB3, RB1, RB3, RB2, RB4, RB0);
+ ROUND (9, 1, RA1, RA3, RA2, RA4, RA0, RA2, RA1, RA4, RA3, RA0,
+ RB1, RB3, RB2, RB4, RB0, RB2, RB1, RB4, RB3, RB0);
+ ROUND (10, 2, RA2, RA1, RA4, RA3, RA0, RA4, RA3, RA1, RA0, RA2,
+ RB2, RB1, RB4, RB3, RB0, RB4, RB3, RB1, RB0, RB2);
+ ROUND (11, 3, RA4, RA3, RA1, RA0, RA2, RA3, RA1, RA0, RA2, RA4,
+ RB4, RB3, RB1, RB0, RB2, RB3, RB1, RB0, RB2, RB4);
+ ROUND (12, 4, RA3, RA1, RA0, RA2, RA4, RA1, RA4, RA3, RA2, RA0,
+ RB3, RB1, RB0, RB2, RB4, RB1, RB4, RB3, RB2, RB0);
+ ROUND (13, 5, RA1, RA4, RA3, RA2, RA0, RA4, RA2, RA1, RA3, RA0,
+ RB1, RB4, RB3, RB2, RB0, RB4, RB2, RB1, RB3, RB0);
+ ROUND (14, 6, RA4, RA2, RA1, RA3, RA0, RA4, RA2, RA0, RA1, RA3,
+ RB4, RB2, RB1, RB3, RB0, RB4, RB2, RB0, RB1, RB3);
+ ROUND (15, 7, RA4, RA2, RA0, RA1, RA3, RA3, RA1, RA2, RA4, RA0,
+ RB4, RB2, RB0, RB1, RB3, RB3, RB1, RB2, RB4, RB0);
+ ROUND (16, 0, RA3, RA1, RA2, RA4, RA0, RA1, RA0, RA2, RA3, RA4,
+ RB3, RB1, RB2, RB4, RB0, RB1, RB0, RB2, RB3, RB4);
+ ROUND (17, 1, RA1, RA0, RA2, RA3, RA4, RA2, RA1, RA3, RA0, RA4,
+ RB1, RB0, RB2, RB3, RB4, RB2, RB1, RB3, RB0, RB4);
+ ROUND (18, 2, RA2, RA1, RA3, RA0, RA4, RA3, RA0, RA1, RA4, RA2,
+ RB2, RB1, RB3, RB0, RB4, RB3, RB0, RB1, RB4, RB2);
+ ROUND (19, 3, RA3, RA0, RA1, RA4, RA2, RA0, RA1, RA4, RA2, RA3,
+ RB3, RB0, RB1, RB4, RB2, RB0, RB1, RB4, RB2, RB3);
+ ROUND (20, 4, RA0, RA1, RA4, RA2, RA3, RA1, RA3, RA0, RA2, RA4,
+ RB0, RB1, RB4, RB2, RB3, RB1, RB3, RB0, RB2, RB4);
+ ROUND (21, 5, RA1, RA3, RA0, RA2, RA4, RA3, RA2, RA1, RA0, RA4,
+ RB1, RB3, RB0, RB2, RB4, RB3, RB2, RB1, RB0, RB4);
+ ROUND (22, 6, RA3, RA2, RA1, RA0, RA4, RA3, RA2, RA4, RA1, RA0,
+ RB3, RB2, RB1, RB0, RB4, RB3, RB2, RB4, RB1, RB0);
+ ROUND (23, 7, RA3, RA2, RA4, RA1, RA0, RA0, RA1, RA2, RA3, RA4,
+ RB3, RB2, RB4, RB1, RB0, RB0, RB1, RB2, RB3, RB4);
+ ROUND (24, 0, RA0, RA1, RA2, RA3, RA4, RA1, RA4, RA2, RA0, RA3,
+ RB0, RB1, RB2, RB3, RB4, RB1, RB4, RB2, RB0, RB3);
+ ROUND (25, 1, RA1, RA4, RA2, RA0, RA3, RA2, RA1, RA0, RA4, RA3,
+ RB1, RB4, RB2, RB0, RB3, RB2, RB1, RB0, RB4, RB3);
+ ROUND (26, 2, RA2, RA1, RA0, RA4, RA3, RA0, RA4, RA1, RA3, RA2,
+ RB2, RB1, RB0, RB4, RB3, RB0, RB4, RB1, RB3, RB2);
+ ROUND (27, 3, RA0, RA4, RA1, RA3, RA2, RA4, RA1, RA3, RA2, RA0,
+ RB0, RB4, RB1, RB3, RB2, RB4, RB1, RB3, RB2, RB0);
+ ROUND (28, 4, RA4, RA1, RA3, RA2, RA0, RA1, RA0, RA4, RA2, RA3,
+ RB4, RB1, RB3, RB2, RB0, RB1, RB0, RB4, RB2, RB3);
+ ROUND (29, 5, RA1, RA0, RA4, RA2, RA3, RA0, RA2, RA1, RA4, RA3,
+ RB1, RB0, RB4, RB2, RB3, RB0, RB2, RB1, RB4, RB3);
+ ROUND (30, 6, RA0, RA2, RA1, RA4, RA3, RA0, RA2, RA3, RA1, RA4,
+ RB0, RB2, RB1, RB4, RB3, RB0, RB2, RB3, RB1, RB4);
+ ROUND_LAST (31, 7, RA0, RA2, RA3, RA1, RA4, RA4, RA1, RA2, RA0, RA3,
+ RB0, RB2, RB3, RB1, RB4, RB4, RB1, RB2, RB0, RB3);
+
+ transpose_4x4(RA4, RA1, RA2, RA0);
+ transpose_4x4(RB4, RB1, RB2, RB0);
+
+ bx lr;
+.size __serpent_enc_blk8,.-__serpent_enc_blk8;
+
+.align 3
+.type __serpent_dec_blk8,%function;
+__serpent_dec_blk8:
+ /* input:
+ * r0: round key pointer
+ * RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: eight parallel
+ * ciphertext blocks
+ * output:
+ * RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: eight parallel plaintext
+ * blocks
+ */
+
+ add RROUND, RROUND, #(32*16);
+
+ transpose_4x4(RA0, RA1, RA2, RA3);
+ BLOCK_LOAD_KEY_DEC ();
+ transpose_4x4(RB0, RB1, RB2, RB3);
+
+ ROUND_FIRST_INVERSE (31, 7, RA0, RA1, RA2, RA3, RA4,
+ RA3, RA0, RA1, RA4, RA2,
+ RB0, RB1, RB2, RB3, RB4,
+ RB3, RB0, RB1, RB4, RB2);
+ ROUND_INVERSE (30, 6, RA3, RA0, RA1, RA4, RA2, RA0, RA1, RA2, RA4, RA3,
+ RB3, RB0, RB1, RB4, RB2, RB0, RB1, RB2, RB4, RB3);
+ ROUND_INVERSE (29, 5, RA0, RA1, RA2, RA4, RA3, RA1, RA3, RA4, RA2, RA0,
+ RB0, RB1, RB2, RB4, RB3, RB1, RB3, RB4, RB2, RB0);
+ ROUND_INVERSE (28, 4, RA1, RA3, RA4, RA2, RA0, RA1, RA2, RA4, RA0, RA3,
+ RB1, RB3, RB4, RB2, RB0, RB1, RB2, RB4, RB0, RB3);
+ ROUND_INVERSE (27, 3, RA1, RA2, RA4, RA0, RA3, RA4, RA2, RA0, RA1, RA3,
+ RB1, RB2, RB4, RB0, RB3, RB4, RB2, RB0, RB1, RB3);
+ ROUND_INVERSE (26, 2, RA4, RA2, RA0, RA1, RA3, RA2, RA3, RA0, RA1, RA4,
+ RB4, RB2, RB0, RB1, RB3, RB2, RB3, RB0, RB1, RB4);
+ ROUND_INVERSE (25, 1, RA2, RA3, RA0, RA1, RA4, RA4, RA2, RA1, RA0, RA3,
+ RB2, RB3, RB0, RB1, RB4, RB4, RB2, RB1, RB0, RB3);
+ ROUND_INVERSE (24, 0, RA4, RA2, RA1, RA0, RA3, RA4, RA3, RA2, RA0, RA1,
+ RB4, RB2, RB1, RB0, RB3, RB4, RB3, RB2, RB0, RB1);
+ ROUND_INVERSE (23, 7, RA4, RA3, RA2, RA0, RA1, RA0, RA4, RA3, RA1, RA2,
+ RB4, RB3, RB2, RB0, RB1, RB0, RB4, RB3, RB1, RB2);
+ ROUND_INVERSE (22, 6, RA0, RA4, RA3, RA1, RA2, RA4, RA3, RA2, RA1, RA0,
+ RB0, RB4, RB3, RB1, RB2, RB4, RB3, RB2, RB1, RB0);
+ ROUND_INVERSE (21, 5, RA4, RA3, RA2, RA1, RA0, RA3, RA0, RA1, RA2, RA4,
+ RB4, RB3, RB2, RB1, RB0, RB3, RB0, RB1, RB2, RB4);
+ ROUND_INVERSE (20, 4, RA3, RA0, RA1, RA2, RA4, RA3, RA2, RA1, RA4, RA0,
+ RB3, RB0, RB1, RB2, RB4, RB3, RB2, RB1, RB4, RB0);
+ ROUND_INVERSE (19, 3, RA3, RA2, RA1, RA4, RA0, RA1, RA2, RA4, RA3, RA0,
+ RB3, RB2, RB1, RB4, RB0, RB1, RB2, RB4, RB3, RB0);
+ ROUND_INVERSE (18, 2, RA1, RA2, RA4, RA3, RA0, RA2, RA0, RA4, RA3, RA1,
+ RB1, RB2, RB4, RB3, RB0, RB2, RB0, RB4, RB3, RB1);
+ ROUND_INVERSE (17, 1, RA2, RA0, RA4, RA3, RA1, RA1, RA2, RA3, RA4, RA0,
+ RB2, RB0, RB4, RB3, RB1, RB1, RB2, RB3, RB4, RB0);
+ ROUND_INVERSE (16, 0, RA1, RA2, RA3, RA4, RA0, RA1, RA0, RA2, RA4, RA3,
+ RB1, RB2, RB3, RB4, RB0, RB1, RB0, RB2, RB4, RB3);
+ ROUND_INVERSE (15, 7, RA1, RA0, RA2, RA4, RA3, RA4, RA1, RA0, RA3, RA2,
+ RB1, RB0, RB2, RB4, RB3, RB4, RB1, RB0, RB3, RB2);
+ ROUND_INVERSE (14, 6, RA4, RA1, RA0, RA3, RA2, RA1, RA0, RA2, RA3, RA4,
+ RB4, RB1, RB0, RB3, RB2, RB1, RB0, RB2, RB3, RB4);
+ ROUND_INVERSE (13, 5, RA1, RA0, RA2, RA3, RA4, RA0, RA4, RA3, RA2, RA1,
+ RB1, RB0, RB2, RB3, RB4, RB0, RB4, RB3, RB2, RB1);
+ ROUND_INVERSE (12, 4, RA0, RA4, RA3, RA2, RA1, RA0, RA2, RA3, RA1, RA4,
+ RB0, RB4, RB3, RB2, RB1, RB0, RB2, RB3, RB1, RB4);
+ ROUND_INVERSE (11, 3, RA0, RA2, RA3, RA1, RA4, RA3, RA2, RA1, RA0, RA4,
+ RB0, RB2, RB3, RB1, RB4, RB3, RB2, RB1, RB0, RB4);
+ ROUND_INVERSE (10, 2, RA3, RA2, RA1, RA0, RA4, RA2, RA4, RA1, RA0, RA3,
+ RB3, RB2, RB1, RB0, RB4, RB2, RB4, RB1, RB0, RB3);
+ ROUND_INVERSE (9, 1, RA2, RA4, RA1, RA0, RA3, RA3, RA2, RA0, RA1, RA4,
+ RB2, RB4, RB1, RB0, RB3, RB3, RB2, RB0, RB1, RB4);
+ ROUND_INVERSE (8, 0, RA3, RA2, RA0, RA1, RA4, RA3, RA4, RA2, RA1, RA0,
+ RB3, RB2, RB0, RB1, RB4, RB3, RB4, RB2, RB1, RB0);
+ ROUND_INVERSE (7, 7, RA3, RA4, RA2, RA1, RA0, RA1, RA3, RA4, RA0, RA2,
+ RB3, RB4, RB2, RB1, RB0, RB1, RB3, RB4, RB0, RB2);
+ ROUND_INVERSE (6, 6, RA1, RA3, RA4, RA0, RA2, RA3, RA4, RA2, RA0, RA1,
+ RB1, RB3, RB4, RB0, RB2, RB3, RB4, RB2, RB0, RB1);
+ ROUND_INVERSE (5, 5, RA3, RA4, RA2, RA0, RA1, RA4, RA1, RA0, RA2, RA3,
+ RB3, RB4, RB2, RB0, RB1, RB4, RB1, RB0, RB2, RB3);
+ ROUND_INVERSE (4, 4, RA4, RA1, RA0, RA2, RA3, RA4, RA2, RA0, RA3, RA1,
+ RB4, RB1, RB0, RB2, RB3, RB4, RB2, RB0, RB3, RB1);
+ ROUND_INVERSE (3, 3, RA4, RA2, RA0, RA3, RA1, RA0, RA2, RA3, RA4, RA1,
+ RB4, RB2, RB0, RB3, RB1, RB0, RB2, RB3, RB4, RB1);
+ ROUND_INVERSE (2, 2, RA0, RA2, RA3, RA4, RA1, RA2, RA1, RA3, RA4, RA0,
+ RB0, RB2, RB3, RB4, RB1, RB2, RB1, RB3, RB4, RB0);
+ ROUND_INVERSE (1, 1, RA2, RA1, RA3, RA4, RA0, RA0, RA2, RA4, RA3, RA1,
+ RB2, RB1, RB3, RB4, RB0, RB0, RB2, RB4, RB3, RB1);
+ ROUND_INVERSE (0, 0, RA0, RA2, RA4, RA3, RA1, RA0, RA1, RA2, RA3, RA4,
+ RB0, RB2, RB4, RB3, RB1, RB0, RB1, RB2, RB3, RB4);
+
+ transpose_4x4(RA0, RA1, RA2, RA3);
+ transpose_4x4(RB0, RB1, RB2, RB3);
+
+ bx lr;
+.size __serpent_dec_blk8,.-__serpent_dec_blk8;
+
+.align 3
+.globl _gcry_serpent_neon_ctr_enc
+.type _gcry_serpent_neon_ctr_enc,%function;
+_gcry_serpent_neon_ctr_enc:
+ /* input:
+ * r0: ctx, CTX
+ * r1: dst (8 blocks)
+ * r2: src (8 blocks)
+ * r3: iv
+ */
+
+ vmov.u8 RT1d0, #0xff; /* u64: -1 */
+ push {r4,lr};
+ vadd.u64 RT2d0, RT1d0, RT1d0; /* u64: -2 */
+ vpush {RA4-RB2};
+
+ /* load IV and byteswap */
+ vld1.8 {RA0}, [r3];
+ vrev64.u8 RT0, RA0; /* be => le */
+ ldr r4, [r3, #8];
+
+ /* construct IVs */
+ vsub.u64 RA2d1, RT0d1, RT2d0; /* +2 */
+ vsub.u64 RA1d1, RT0d1, RT1d0; /* +1 */
+ cmp r4, #-1;
+
+ vsub.u64 RB0d1, RA2d1, RT2d0; /* +4 */
+ vsub.u64 RA3d1, RA2d1, RT1d0; /* +3 */
+ ldr r4, [r3, #12];
+
+ vsub.u64 RB2d1, RB0d1, RT2d0; /* +6 */
+ vsub.u64 RB1d1, RB0d1, RT1d0; /* +5 */
+
+ vsub.u64 RT2d1, RB2d1, RT2d0; /* +8 */
+ vsub.u64 RB3d1, RB2d1, RT1d0; /* +7 */
+
+ vmov RA1d0, RT0d0;
+ vmov RA2d0, RT0d0;
+ vmov RA3d0, RT0d0;
+ vmov RB0d0, RT0d0;
+ rev r4, r4;
+ vmov RB1d0, RT0d0;
+ vmov RB2d0, RT0d0;
+ vmov RB3d0, RT0d0;
+ vmov RT2d0, RT0d0;
+
+ /* check need for handling 64-bit overflow and carry */
+ beq .Ldo_ctr_carry;
+
+.Lctr_carry_done:
+ /* le => be */
+ vrev64.u8 RA1, RA1;
+ vrev64.u8 RA2, RA2;
+ vrev64.u8 RA3, RA3;
+ vrev64.u8 RB0, RB0;
+ vrev64.u8 RT2, RT2;
+ vrev64.u8 RB1, RB1;
+ vrev64.u8 RB2, RB2;
+ vrev64.u8 RB3, RB3;
+ /* store new IV */
+ vst1.8 {RT2}, [r3];
+
+ bl __serpent_enc_blk8;
+
+ vld1.8 {RT0, RT1}, [r2]!;
+ vld1.8 {RT2, RT3}, [r2]!;
+ veor RA4, RA4, RT0;
+ veor RA1, RA1, RT1;
+ vld1.8 {RT0, RT1}, [r2]!;
+ veor RA2, RA2, RT2;
+ veor RA0, RA0, RT3;
+ vld1.8 {RT2, RT3}, [r2]!;
+ veor RB4, RB4, RT0;
+ veor RT0, RT0;
+ veor RB1, RB1, RT1;
+ veor RT1, RT1;
+ veor RB2, RB2, RT2;
+ veor RT2, RT2;
+ veor RB0, RB0, RT3;
+ veor RT3, RT3;
+
+ vst1.8 {RA4}, [r1]!;
+ vst1.8 {RA1}, [r1]!;
+ veor RA1, RA1;
+ vst1.8 {RA2}, [r1]!;
+ veor RA2, RA2;
+ vst1.8 {RA0}, [r1]!;
+ veor RA0, RA0;
+ vst1.8 {RB4}, [r1]!;
+ veor RB4, RB4;
+ vst1.8 {RB1}, [r1]!;
+ vst1.8 {RB2}, [r1]!;
+ vst1.8 {RB0}, [r1]!;
+
+ vpop {RA4-RB2};
+
+ /* clear the used registers */
+ veor RA3, RA3;
+ veor RB3, RB3;
+
+ pop {r4,pc};
+
+.Ldo_ctr_carry:
+ cmp r4, #-8;
+ blo .Lctr_carry_done;
+ beq .Lcarry_RT2;
+
+ cmp r4, #-6;
+ blo .Lcarry_RB3;
+ beq .Lcarry_RB2;
+
+ cmp r4, #-4;
+ blo .Lcarry_RB1;
+ beq .Lcarry_RB0;
+
+ cmp r4, #-2;
+ blo .Lcarry_RA3;
+ beq .Lcarry_RA2;
+
+ vsub.u64 RA1d0, RT1d0;
+.Lcarry_RA2:
+ vsub.u64 RA2d0, RT1d0;
+.Lcarry_RA3:
+ vsub.u64 RA3d0, RT1d0;
+.Lcarry_RB0:
+ vsub.u64 RB0d0, RT1d0;
+.Lcarry_RB1:
+ vsub.u64 RB1d0, RT1d0;
+.Lcarry_RB2:
+ vsub.u64 RB2d0, RT1d0;
+.Lcarry_RB3:
+ vsub.u64 RB3d0, RT1d0;
+.Lcarry_RT2:
+ vsub.u64 RT2d0, RT1d0;
+
+ b .Lctr_carry_done;
+.size _gcry_serpent_neon_ctr_enc,.-_gcry_serpent_neon_ctr_enc;
+
+.align 3
+.globl _gcry_serpent_neon_cfb_dec
+.type _gcry_serpent_neon_cfb_dec,%function;
+_gcry_serpent_neon_cfb_dec:
+ /* input:
+ * r0: ctx, CTX
+ * r1: dst (8 blocks)
+ * r2: src (8 blocks)
+ * r3: iv
+ */
+
+ push {lr};
+ vpush {RA4-RB2};
+
+ /* Load input */
+ vld1.8 {RA0}, [r3];
+ vld1.8 {RA1, RA2}, [r2]!;
+ vld1.8 {RA3}, [r2]!;
+ vld1.8 {RB0}, [r2]!;
+ vld1.8 {RB1, RB2}, [r2]!;
+ vld1.8 {RB3}, [r2]!;
+
+ /* Update IV */
+ vld1.8 {RT0}, [r2]!;
+ vst1.8 {RT0}, [r3];
+ mov r3, lr;
+ sub r2, r2, #(8*16);
+
+ bl __serpent_enc_blk8;
+
+ vld1.8 {RT0, RT1}, [r2]!;
+ vld1.8 {RT2, RT3}, [r2]!;
+ veor RA4, RA4, RT0;
+ veor RA1, RA1, RT1;
+ vld1.8 {RT0, RT1}, [r2]!;
+ veor RA2, RA2, RT2;
+ veor RA0, RA0, RT3;
+ vld1.8 {RT2, RT3}, [r2]!;
+ veor RB4, RB4, RT0;
+ veor RT0, RT0;
+ veor RB1, RB1, RT1;
+ veor RT1, RT1;
+ veor RB2, RB2, RT2;
+ veor RT2, RT2;
+ veor RB0, RB0, RT3;
+ veor RT3, RT3;
+
+ vst1.8 {RA4}, [r1]!;
+ vst1.8 {RA1}, [r1]!;
+ veor RA1, RA1;
+ vst1.8 {RA2}, [r1]!;
+ veor RA2, RA2;
+ vst1.8 {RA0}, [r1]!;
+ veor RA0, RA0;
+ vst1.8 {RB4}, [r1]!;
+ veor RB4, RB4;
+ vst1.8 {RB1}, [r1]!;
+ vst1.8 {RB2}, [r1]!;
+ vst1.8 {RB0}, [r1]!;
+
+ vpop {RA4-RB2};
+
+ /* clear the used registers */
+ veor RA3, RA3;
+ veor RB3, RB3;
+
+ pop {pc};
+.size _gcry_serpent_neon_cfb_dec,.-_gcry_serpent_neon_cfb_dec;
+
+.align 3
+.globl _gcry_serpent_neon_cbc_dec
+.type _gcry_serpent_neon_cbc_dec,%function;
+_gcry_serpent_neon_cbc_dec:
+ /* input:
+ * r0: ctx, CTX
+ * r1: dst (8 blocks)
+ * r2: src (8 blocks)
+ * r3: iv
+ */
+
+ push {lr};
+ vpush {RA4-RB2};
+
+ vld1.8 {RA0, RA1}, [r2]!;
+ vld1.8 {RA2, RA3}, [r2]!;
+ vld1.8 {RB0, RB1}, [r2]!;
+ vld1.8 {RB2, RB3}, [r2]!;
+ sub r2, r2, #(8*16);
+
+ bl __serpent_dec_blk8;
+
+ vld1.8 {RB4}, [r3];
+ vld1.8 {RT0, RT1}, [r2]!;
+ vld1.8 {RT2, RT3}, [r2]!;
+ veor RA0, RA0, RB4;
+ veor RA1, RA1, RT0;
+ veor RA2, RA2, RT1;
+ vld1.8 {RT0, RT1}, [r2]!;
+ veor RA3, RA3, RT2;
+ veor RB0, RB0, RT3;
+ vld1.8 {RT2, RT3}, [r2]!;
+ veor RB1, RB1, RT0;
+ veor RT0, RT0;
+ veor RB2, RB2, RT1;
+ veor RT1, RT1;
+ veor RB3, RB3, RT2;
+ veor RT2, RT2;
+ vst1.8 {RT3}, [r3]; /* store new IV */
+ veor RT3, RT3;
+
+ vst1.8 {RA0, RA1}, [r1]!;
+ veor RA0, RA0;
+ veor RA1, RA1;
+ vst1.8 {RA2, RA3}, [r1]!;
+ veor RA2, RA2;
+ vst1.8 {RB0, RB1}, [r1]!;
+ veor RA3, RA3;
+ vst1.8 {RB2, RB3}, [r1]!;
+ veor RB3, RB3;
+
+ vpop {RA4-RB2};
+
+ /* clear the used registers */
+ veor RB4, RB4;
+
+ pop {pc};
+.size _gcry_serpent_neon_cbc_dec,.-_gcry_serpent_neon_cbc_dec;
+
+#endif
diff --git a/cipher/serpent.c b/cipher/serpent.c
index a8ee15f..cfda742 100644
--- a/cipher/serpent.c
+++ b/cipher/serpent.c
@@ -46,6 +46,15 @@
# endif
#endif
+/* USE_NEON indicates whether to enable ARM NEON assembly code. */
+#undef USE_NEON
+#if defined(HAVE_ARM_ARCH_V6) && defined(__ARMEL__)
+# if defined(HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS) && \
+ defined(HAVE_GCC_INLINE_ASM_NEON)
+# define USE_NEON 1
+# endif
+#endif
+
/* Number of rounds per Serpent encrypt/decrypt operation. */
#define ROUNDS 32
@@ -71,6 +80,9 @@ typedef struct serpent_context
#ifdef USE_AVX2
int use_avx2;
#endif
+#ifdef USE_NEON
+ int use_neon;
+#endif
} serpent_context_t;
@@ -114,6 +126,26 @@ extern void _gcry_serpent_avx2_cfb_dec(serpent_context_t *ctx,
unsigned char *iv);
#endif
+#ifdef USE_NEON
+/* Assembler implementations of Serpent using ARM NEON. Process 8 block in
+ parallel.
+ */
+extern void _gcry_serpent_neon_ctr_enc(serpent_context_t *ctx,
+ unsigned char *out,
+ const unsigned char *in,
+ unsigned char *ctr);
+
+extern void _gcry_serpent_neon_cbc_dec(serpent_context_t *ctx,
+ unsigned char *out,
+ const unsigned char *in,
+ unsigned char *iv);
+
+extern void _gcry_serpent_neon_cfb_dec(serpent_context_t *ctx,
+ unsigned char *out,
+ const unsigned char *in,
+ unsigned char *iv);
+#endif
+
/* A prototype. */
static const char *serpent_test (void);
@@ -634,6 +666,14 @@ serpent_setkey_internal (serpent_context_t *context,
}
#endif
+#ifdef USE_NEON
+ context->use_neon = 0;
+ if ((_gcry_get_hw_features () & HWF_ARM_NEON))
+ {
+ context->use_neon = 1;
+ }
+#endif
+
_gcry_burn_stack (272 * sizeof (u32));
}
@@ -861,6 +901,34 @@ _gcry_serpent_ctr_enc(void *context, unsigned char *ctr,
}
#endif
+#ifdef USE_NEON
+ if (ctx->use_neon)
+ {
+ int did_use_neon = 0;
+
+ /* Process data in 8 block chunks. */
+ while (nblocks >= 8)
+ {
+ _gcry_serpent_neon_ctr_enc(ctx, outbuf, inbuf, ctr);
+
+ nblocks -= 8;
+ outbuf += 8 * sizeof(serpent_block_t);
+ inbuf += 8 * sizeof(serpent_block_t);
+ did_use_neon = 1;
+ }
+
+ if (did_use_neon)
+ {
+ /* serpent-neon assembly code does not use stack */
+ if (nblocks == 0)
+ burn_stack_depth = 0;
+ }
+
+ /* Use generic code to handle smaller chunks... */
+ /* TODO: use caching instead? */
+ }
+#endif
+
for ( ;nblocks; nblocks-- )
{
/* Encrypt the counter. */
@@ -948,6 +1016,33 @@ _gcry_serpent_cbc_dec(void *context, unsigned char *iv,
}
#endif
+#ifdef USE_NEON
+ if (ctx->use_neon)
+ {
+ int did_use_neon = 0;
+
+ /* Process data in 8 block chunks. */
+ while (nblocks >= 8)
+ {
+ _gcry_serpent_neon_cbc_dec(ctx, outbuf, inbuf, iv);
+
+ nblocks -= 8;
+ outbuf += 8 * sizeof(serpent_block_t);
+ inbuf += 8 * sizeof(serpent_block_t);
+ did_use_neon = 1;
+ }
+
+ if (did_use_neon)
+ {
+ /* serpent-neon assembly code does not use stack */
+ if (nblocks == 0)
+ burn_stack_depth = 0;
+ }
+
+ /* Use generic code to handle smaller chunks... */
+ }
+#endif
+
for ( ;nblocks; nblocks-- )
{
/* INBUF is needed later and it may be identical to OUTBUF, so store
@@ -1028,6 +1123,33 @@ _gcry_serpent_cfb_dec(void *context, unsigned char *iv,
}
#endif
+#ifdef USE_NEON
+ if (ctx->use_neon)
+ {
+ int did_use_neon = 0;
+
+ /* Process data in 8 block chunks. */
+ while (nblocks >= 8)
+ {
+ _gcry_serpent_neon_cfb_dec(ctx, outbuf, inbuf, iv);
+
+ nblocks -= 8;
+ outbuf += 8 * sizeof(serpent_block_t);
+ inbuf += 8 * sizeof(serpent_block_t);
+ did_use_neon = 1;
+ }
+
+ if (did_use_neon)
+ {
+ /* serpent-neon assembly code does not use stack */
+ if (nblocks == 0)
+ burn_stack_depth = 0;
+ }
+
+ /* Use generic code to handle smaller chunks... */
+ }
+#endif
+
for ( ;nblocks; nblocks-- )
{
serpent_encrypt_internal(ctx, iv, iv);
diff --git a/configure.ac b/configure.ac
index 19c97bd..e3471d0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1502,6 +1502,11 @@ if test "$found" = "1" ; then
# Build with the AVX2 implementation
GCRYPT_CIPHERS="$GCRYPT_CIPHERS serpent-avx2-amd64.lo"
fi
+
+ if test x"$neonsupport" = xyes ; then
+ # Build with the NEON implementation
+ GCRYPT_CIPHERS="$GCRYPT_CIPHERS serpent-armv7-neon.lo"
+ fi
fi
LIST_MEMBER(rfc2268, $enabled_ciphers)
-----------------------------------------------------------------------
Summary of changes:
cipher/salsa20-armv7-neon.S | 2 +-
cipher/serpent-armv7-neon.S | 869 +++++++++++++++++++++++++++++++++++++++++++
cipher/serpent.c | 122 ++++++
configure.ac | 5 +
doc/gcrypt.texi | 86 ++---
5 files changed, 1040 insertions(+), 44 deletions(-)
create mode 100644 cipher/serpent-armv7-neon.S
hooks/post-receive
--
The GNU crypto library
http://git.gnupg.org
More information about the Gnupg-commits
mailing list