[git] GCRYPT - branch, master, updated. libgcrypt-1.5.0-344-g1faa618

by Jussi Kivilinna cvs at cvs.gnupg.org
Mon Oct 28 16:22:06 CET 2013


This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "The GNU crypto library".

The branch, master has been updated
       via  1faa61845f180bd47e037e400dde2d864ee83c89 (commit)
       via  2cb6e1f323d24359b1c5b113be5c2f79a2a4cded (commit)
      from  3ff9d2571c18cd7a34359f9c60a10d3b0f932b23 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
commit 1faa61845f180bd47e037e400dde2d864ee83c89
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date:   Mon Oct 28 17:11:21 2013 +0200

    Fix typos in documentation
    
    * doc/gcrypt.texi: Fix some typos.
    --
    
    Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>

diff --git a/doc/gcrypt.texi b/doc/gcrypt.texi
index 91fe399..6dcb4b1 100644
--- a/doc/gcrypt.texi
+++ b/doc/gcrypt.texi
@@ -390,7 +390,7 @@ and freed memory, you need to initialize Libgcrypt this way:
 
 @example
   /* Version check should be the very first call because it
-     makes sure that important subsystems are intialized. */
+     makes sure that important subsystems are initialized. */
   if (!gcry_check_version (GCRYPT_VERSION))
     @{
       fputs ("libgcrypt version mismatch\n", stderr);
@@ -405,7 +405,7 @@ and freed memory, you need to initialize Libgcrypt this way:
 
   /* ... If required, other initialization goes here.  Note that the
      process might still be running with increased privileges and that
-     the secure memory has not been intialized.  */
+     the secure memory has not been initialized.  */
 
   /* Allocate a pool of 16k secure memory.  This make the secure memory
      available and also drops privileges where needed.  */
@@ -642,9 +642,9 @@ callbacks.
 @item GCRYCTL_ENABLE_QUICK_RANDOM; Arguments: none
 This command inhibits the use the very secure random quality level
 (@code{GCRY_VERY_STRONG_RANDOM}) and degrades all request down to
- at code{GCRY_STRONG_RANDOM}.  In general this is not recommened.  However,
+ at code{GCRY_STRONG_RANDOM}.  In general this is not recommended.  However,
 for some applications the extra quality random Libgcrypt tries to create
-is not justified and this option may help to get better performace.
+is not justified and this option may help to get better performance.
 Please check with a crypto expert whether this option can be used for
 your application.
 
@@ -652,19 +652,19 @@ This option can only be used at initialization time.
 
 
 @item GCRYCTL_DUMP_RANDOM_STATS; Arguments: none
-This command dumps randum number generator related statistics to the
+This command dumps random number generator related statistics to the
 library's logging stream.
 
 @item GCRYCTL_DUMP_MEMORY_STATS; Arguments: none
-This command dumps memory managment related statistics to the library's
+This command dumps memory management related statistics to the library's
 logging stream.
 
 @item GCRYCTL_DUMP_SECMEM_STATS; Arguments: none
-This command dumps secure memory manamgent related statistics to the
+This command dumps secure memory management related statistics to the
 library's logging stream.
 
 @item GCRYCTL_DROP_PRIVS; Arguments: none
-This command disables the use of secure memory and drops the priviliges
+This command disables the use of secure memory and drops the privileges
 of the current process.  This command has not much use; the suggested way
 to disable secure memory is to use @code{GCRYCTL_DISABLE_SECMEM} right
 after initialization.
@@ -758,7 +758,7 @@ these different instances is correlated to some extent.  In a perfect
 attack scenario, the attacker can control (or at least guess) the PID
 and clock of the application, and drain the system's entropy pool to
 reduce the "up to 16 bytes" above to 0.  Then the dependencies of the
-inital states of the pools are completely known.  Note that this is not
+initial states of the pools are completely known.  Note that this is not
 an issue if random of @code{GCRY_VERY_STRONG_RANDOM} quality is
 requested as in this case enough extra entropy gets mixed.  It is also
 not an issue when using Linux (rndlinux driver), because this one
@@ -795,11 +795,11 @@ This command does nothing.  It exists only for backward compatibility.
 This command returns true if the library has been basically initialized.
 Such a basic initialization happens implicitly with many commands to get
 certain internal subsystems running.  The common and suggested way to
-do this basic intialization is by calling gcry_check_version.
+do this basic initialization is by calling gcry_check_version.
 
 @item GCRYCTL_INITIALIZATION_FINISHED; Arguments: none
 This command tells the library that the application has finished the
-intialization.
+initialization.
 
 @item GCRYCTL_INITIALIZATION_FINISHED_P; Arguments: none
 This command returns true if the command@*
@@ -825,7 +825,7 @@ proper random device.
 @item GCRYCTL_PRINT_CONFIG; Arguments: FILE *stream
 This command dumps information pertaining to the configuration of the
 library to the given stream.  If NULL is given for @var{stream}, the log
-system is used.  This command may be used before the intialization has
+system is used.  This command may be used before the initialization has
 been finished but not before a @code{gcry_check_version}.
 
 @item GCRYCTL_OPERATIONAL_P; Arguments: none
@@ -833,12 +833,12 @@ This command returns true if the library is in an operational state.
 This information makes only sense in FIPS mode.  In contrast to other
 functions, this is a pure test function and won't put the library into
 FIPS mode or change the internal state.  This command may be used before
-the intialization has been finished but not before a @code{gcry_check_version}.
+the initialization has been finished but not before a @code{gcry_check_version}.
 
 @item GCRYCTL_FIPS_MODE_P; Arguments: none
 This command returns true if the library is in FIPS mode.  Note, that
 this is no indication about the current state of the library.  This
-command may be used before the intialization has been finished but not
+command may be used before the initialization has been finished but not
 before a @code{gcry_check_version}.  An application may use this command or
 the convenience macro below to check whether FIPS mode is actually
 active.
@@ -857,7 +857,7 @@ already in FIPS mode, a self-test is triggered and thus the library will
 be put into operational state.  This command may be used before a call
 to @code{gcry_check_version} and that is actually the recommended way to let an
 application switch the library into FIPS mode.  Note that Libgcrypt will
-reject an attempt to switch to fips mode during or after the intialization.
+reject an attempt to switch to fips mode during or after the initialization.
 
 @item GCRYCTL_SET_ENFORCED_FIPS_FLAG; Arguments: none
 Running this command sets the internal flag that puts the library into
@@ -866,7 +866,7 @@ does not affect the library if the library is not put into the FIPS mode and
 it must be used before any other libgcrypt library calls that initialize
 the library such as @code{gcry_check_version}. Note that Libgcrypt will
 reject an attempt to switch to the enforced fips mode during or after
-the intialization.
+the initialization.
 
 @item GCRYCTL_SET_PREFERRED_RNG_TYPE; Arguments: int
 These are advisory commands to select a certain random number
@@ -875,7 +875,7 @@ an application actually wants or vice versa.  Thus Libgcrypt employs a
 priority check to select the actually used RNG.  If an applications
 selects a lower priority RNG but a library requests a higher priority
 RNG Libgcrypt will switch to the higher priority RNG.  Applications
-and libaries should use these control codes before
+and libraries should use these control codes before
 @code{gcry_check_version}.  The available generators are:
 @table @code
 @item GCRY_RNG_TYPE_STANDARD
@@ -907,8 +907,8 @@ success or an error code on failure.
 @item GCRYCTL_DISABLE_HWF; Arguments: const char *name
 
 Libgcrypt detects certain features of the CPU at startup time.  For
-performace tests it is sometimes required not to use such a feature.
-This option may be used to disabale a certain feature; i.e. Libgcrypt
+performance tests it is sometimes required not to use such a feature.
+This option may be used to disable a certain feature; i.e. Libgcrypt
 behaves as if this feature has not been detected.  Note that the
 detection code might be run if the feature has been disabled.  This
 command must be used at initialization time; i.e. before calling
@@ -1929,7 +1929,7 @@ checking.
 
 @deftypefun size_t gcry_cipher_get_algo_blklen (int @var{algo})
 
-This functions returns the blocklength of the algorithm @var{algo}
+This functions returns the block-length of the algorithm @var{algo}
 counted in octets.  On error @code{0} is returned.
 
 This is a convenience functions which should be preferred over
@@ -2292,7 +2292,7 @@ will be changed to implement 186-3.
 @item use-fips186-2
 @cindex FIPS 186-2
 Force the use of the FIPS 186-2 key generation algorithm instead of
-the default algorithm.  This algorithm is slighlty different from
+the default algorithm.  This algorithm is slightly different from
 FIPS 186-3 and allows only 1024 bit keys.  This flag is only meaningful
 for DSA and only required for FIPS testing backward compatibility.
 
@@ -4547,7 +4547,7 @@ Convenience function to release the @var{factors} array.
 
 @deftypefun gcry_error_t gcry_prime_check (gcry_mpi_t @var{p}, unsigned int @var{flags})
 
-Check wether the number @var{p} is prime.  Returns zero in case @var{p}
+Check whether the number @var{p} is prime.  Returns zero in case @var{p}
 is indeed a prime, returns @code{GPG_ERR_NO_PRIME} in case @var{p} is
 not a prime and a different error code in case something went horribly
 wrong.
@@ -4988,7 +4988,7 @@ checking function is exported as well.
 
 The generation of random prime numbers is based on the Lim and Lee
 algorithm to create practically save primes. at footnote{Chae Hoon Lim
-and Pil Joong Lee. A key recovery attack on discrete log-based shemes
+and Pil Joong Lee. A key recovery attack on discrete log-based schemes
 using a prime order subgroup. In Burton S. Kaliski Jr., editor,
 Advances in Cryptology: Crypto '97, pages 249­-263, Berlin /
 Heidelberg / New York, 1997. Springer-Verlag.  Described on page 260.}
@@ -5147,7 +5147,7 @@ output blocks.
 
 On Unix like systems the @code{GCRY_VERY_STRONG_RANDOM} and
 @code{GCRY_STRONG_RANDOM} generators are keyed and seeded using the
-rndlinux module with the @file{/dev/radnom} device. Thus these
+rndlinux module with the @file{/dev/random} device. Thus these
 generators may block until the OS kernel has collected enough entropy.
 When used with Microsoft Windows the rndw32 module is used instead.
 
@@ -5162,7 +5162,7 @@ entropy for use by the ``real'' random generators.
 A self-test facility uses a separate context to check the
 functionality of the core X9.31 functions using a known answers test.
 During runtime each output block is compared to the previous one to
-detect a stucked generator.
+detect a stuck generator.
 
 The DT value for the generator is made up of the current time down to
 microseconds (if available) and a free running 64 bit counter.  When
@@ -5188,7 +5188,7 @@ incremented on each use.
 @c them.  To use an S-expression with Libgcrypt it needs first be
 @c converted into the internal representation used by Libgcrypt (the type
 @c @code{gcry_sexp_t}).  The conversion functions support a large subset
- at c of the S-expression specification and further fature a printf like
+ at c of the S-expression specification and further feature a printf like
 @c function to convert a list of big integers or other binary data into
 @c an S-expression.
 @c
@@ -5357,8 +5357,8 @@ The result is verified using the public key against the original data
 and against modified data.  (@code{cipher/@/rsa.c:@/selftest_sign_1024})
 @item
 A 1000 bit random value is encrypted and checked that it does not
-match the orginal random value.  The encrtypted result is then
-decrypted and checked that it macthes the original random value.
+match the original random value.  The encrypted result is then
+decrypted and checked that it matches the original random value.
 (@code{cipher/@/rsa.c:@/selftest_encr_1024})
 @end enumerate
 
@@ -5401,7 +5401,7 @@ keys.  The table itself is protected using a SHA-1 hash.
 @c --------------------------------
 @section Conditional Tests
 
-The conditional tests are performed if a certain contidion is met.
+The conditional tests are performed if a certain condition is met.
 This may occur at any time; the library does not necessary enter the
 ``Self-Test'' state to run these tests but will transit to the
 ``Error'' state if a test failed.
@@ -5696,7 +5696,7 @@ documentation only.
 
 @item Power-On
 Libgcrypt is loaded into memory and API calls may be made.  Compiler
-introducted constructor functions may be run.  Note that Libgcrypt does
+introduced constructor functions may be run.  Note that Libgcrypt does
 not implement any arbitrary constructor functions to be called by the
 operating system
 
@@ -5721,7 +5721,7 @@ will automatically transit into the  Shutdown state.
 
 @item Shutdown
 Libgcrypt is about to be terminated and removed from the memory. The
-application may at this point still runing cleanup handlers.
+application may at this point still running cleanup handlers.
 
 @end table
 @end float
@@ -5738,18 +5738,18 @@ a shared library and having it linked to an application.
 
 @item 2
 Power-On to Init is triggered by the application calling the
-Libgcrypt intialization function @code{gcry_check_version}.
+Libgcrypt initialization function @code{gcry_check_version}.
 
 @item 3
-Init to Self-Test is either triggred by a dedicated API call or implicit
-by invoking a libgrypt service conrolled by the FSM.
+Init to Self-Test is either triggered by a dedicated API call or implicit
+by invoking a libgrypt service controlled by the FSM.
 
 @item 4
 Self-Test to Operational is triggered after all self-tests passed
 successfully.
 
 @item 5
-Operational to Shutdown is an artifical state without any direct action
+Operational to Shutdown is an artificial state without any direct action
 in Libgcrypt.  When reaching the Shutdown state the library is
 deinitialized and can't return to any other state again.
 
@@ -5770,7 +5770,7 @@ Error to Shutdown is similar to the Operational to Shutdown transition
 (5).
 
 @item 9
-Error to Fatal-Error is triggred if Libgrypt detects an fatal error
+Error to Fatal-Error is triggered if Libgrypt detects an fatal error
 while already being in Error state.
 
 @item 10
@@ -5778,26 +5778,26 @@ Fatal-Error to Shutdown is automatically entered by Libgcrypt
 after having reported the error.
 
 @item 11
-Power-On to Shutdown is an artifical state to document that Libgcrypt
-has not ye been initializaed but the process is about to terminate.
+Power-On to Shutdown is an artificial state to document that Libgcrypt
+has not ye been initialized but the process is about to terminate.
 
 @item 12
-Power-On to Fatal-Error will be triggerd if certain Libgcrypt functions
+Power-On to Fatal-Error will be triggered if certain Libgcrypt functions
 are used without having reached the Init state.
 
 @item 13
-Self-Test to Fatal-Error is triggred by severe errors in Libgcrypt while
+Self-Test to Fatal-Error is triggered by severe errors in Libgcrypt while
 running self-tests.
 
 @item 14
-Self-Test to Error is triggred by a failed self-test.
+Self-Test to Error is triggered by a failed self-test.
 
 @item 15
 Operational to Fatal-Error is triggered if Libcrypt encountered a
 non-recoverable error.
 
 @item 16
-Operational to Self-Test is triggred if the application requested to run
+Operational to Self-Test is triggered if the application requested to run
 the self-tests again.
 
 @item 17
@@ -5868,7 +5868,7 @@ memory and thus also the encryption contexts with these keys.
 
 GCRYCTL_SET_RANDOM_DAEMON_SOCKET
 GCRYCTL_USE_RANDOM_DAEMON
-The random damon is still a bit experimental, thus we do not document
+The random daemon is still a bit experimental, thus we do not document
 them.  Note that they should be used during initialization and that
 these functions are not really thread safe.
 

commit 2cb6e1f323d24359b1c5b113be5c2f79a2a4cded
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date:   Sun Oct 27 14:07:59 2013 +0200

    Add ARM NEON assembly implementation of Serpent
    
    * cipher/Makefile.am: Add 'serpent-armv7-neon.S'.
    * cipher/serpent-armv7-neon.S: New.
    * cipher/serpent.c (USE_NEON): New macro.
    (serpent_context_t) [USE_NEON]: Add 'use_neon'.
    [USE_NEON] (_gcry_serpent_neon_ctr_enc, _gcry_serpent_neon_cfb_dec)
    (_gcry_serpent_neon_cbc_dec): New prototypes.
    (serpent_setkey_internal) [USE_NEON]: Detect NEON support.
    (_gcry_serpent_neon_ctr_enc, _gcry_serpent_neon_cfb_dec)
    (_gcry_serpent_neon_cbc_dec) [USE_NEON]: Use NEON implementations
    to process eight blocks in parallel.
    * configure.ac [neonsupport]: Add 'serpent-armv7-neon.lo'.
    --
    
    Patch adds ARM NEON optimized implementation of Serpent cipher
    to speed up parallelizable bulk operations.
    
    Benchmarks on ARM Cortex-A8 (armhf, 1008 Mhz):
    
    Old:
     SERPENT128     |  nanosecs/byte   mebibytes/sec   cycles/byte
            CBC dec |     43.53 ns/B     21.91 MiB/s     43.88 c/B
            CFB dec |     44.77 ns/B     21.30 MiB/s     45.13 c/B
            CTR enc |     45.21 ns/B     21.10 MiB/s     45.57 c/B
            CTR dec |     45.21 ns/B     21.09 MiB/s     45.57 c/B
    New:
     SERPENT128     |  nanosecs/byte   mebibytes/sec   cycles/byte
            CBC dec |     26.26 ns/B     36.32 MiB/s     26.47 c/B
            CFB dec |     26.21 ns/B     36.38 MiB/s     26.42 c/B
            CTR enc |     26.20 ns/B     36.40 MiB/s     26.41 c/B
            CTR dec |     26.20 ns/B     36.40 MiB/s     26.41 c/B
    
    Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>

diff --git a/cipher/salsa20-armv7-neon.S b/cipher/salsa20-armv7-neon.S
index 5b51301..7d31e9f 100644
--- a/cipher/salsa20-armv7-neon.S
+++ b/cipher/salsa20-armv7-neon.S
@@ -36,7 +36,7 @@
 .text
 
 .align 2
-.global _gcry_arm_neon_salsa20_encrypt
+.globl _gcry_arm_neon_salsa20_encrypt
 .type  _gcry_arm_neon_salsa20_encrypt,%function;
 _gcry_arm_neon_salsa20_encrypt:
 	/* Modifications:
diff --git a/cipher/serpent-armv7-neon.S b/cipher/serpent-armv7-neon.S
new file mode 100644
index 0000000..92e95a0
--- /dev/null
+++ b/cipher/serpent-armv7-neon.S
@@ -0,0 +1,869 @@
+/* serpent-armv7-neon.S  -  ARM/NEON assembly implementation of Serpent cipher
+ *
+ * Copyright © 2013 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <config.h>
+
+#if defined(HAVE_ARM_ARCH_V6) && defined(__ARMEL__) && \
+    defined(HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS) && \
+    defined(HAVE_GCC_INLINE_ASM_NEON)
+
+.text
+
+.syntax unified
+.fpu neon
+.arm
+
+/* ARM registers */
+#define RROUND r0
+
+/* NEON vector registers */
+#define RA0 q0
+#define RA1 q1
+#define RA2 q2
+#define RA3 q3
+#define RA4 q4
+#define RB0 q5
+#define RB1 q6
+#define RB2 q7
+#define RB3 q8
+#define RB4 q9
+
+#define RT0 q10
+#define RT1 q11
+#define RT2 q12
+#define RT3 q13
+
+#define RA0d0 d0
+#define RA0d1 d1
+#define RA1d0 d2
+#define RA1d1 d3
+#define RA2d0 d4
+#define RA2d1 d5
+#define RA3d0 d6
+#define RA3d1 d7
+#define RA4d0 d8
+#define RA4d1 d9
+#define RB0d0 d10
+#define RB0d1 d11
+#define RB1d0 d12
+#define RB1d1 d13
+#define RB2d0 d14
+#define RB2d1 d15
+#define RB3d0 d16
+#define RB3d1 d17
+#define RB4d0 d18
+#define RB4d1 d19
+#define RT0d0 d20
+#define RT0d1 d21
+#define RT1d0 d22
+#define RT1d1 d23
+#define RT2d0 d24
+#define RT2d1 d25
+
+/**********************************************************************
+  helper macros
+ **********************************************************************/
+
+#define transpose_4x4(_q0, _q1, _q2, _q3) \
+	vtrn.32 _q0, _q1;	\
+	vtrn.32 _q2, _q3;	\
+	vswp _q0##d1, _q2##d0;	\
+	vswp _q1##d1, _q3##d0;
+
+/**********************************************************************
+  8-way serpent
+ **********************************************************************/
+
+/*
+ * These are the S-Boxes of Serpent from following research paper.
+ *
+ *  D. A. Osvik, “Speeding up Serpent,” in Third AES Candidate Conference,
+ *   (New York, New York, USA), p. 317–329, National Institute of Standards and
+ *   Technology, 2000.
+ *
+ * Paper is also available at: http://www.ii.uib.no/~osvik/pub/aes3.pdf
+ *
+ */
+#define SBOX0(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	veor	a3, a3, a0;	veor	b3, b3, b0;	vmov	a4, a1;		vmov	b4, b1;		\
+	vand	a1, a1, a3;	vand	b1, b1, b3;	veor	a4, a4, a2;	veor	b4, b4, b2;	\
+	veor	a1, a1, a0;	veor	b1, b1, b0;	vorr	a0, a0, a3;	vorr	b0, b0, b3;	\
+	veor	a0, a0, a4;	veor	b0, b0, b4;	veor	a4, a4, a3;	veor	b4, b4, b3;	\
+	veor	a3, a3, a2;	veor	b3, b3, b2;	vorr	a2, a2, a1;	vorr	b2, b2, b1;	\
+	veor	a2, a2, a4;	veor	b2, b2, b4;	vmvn	a4, a4;		vmvn	b4, b4;		\
+	vorr	a4, a4, a1;	vorr	b4, b4, b1;	veor	a1, a1, a3;	veor	b1, b1, b3;	\
+	veor	a1, a1, a4;	veor	b1, b1, b4;	vorr	a3, a3, a0;	vorr	b3, b3, b0;	\
+	veor	a1, a1, a3;	veor	b1, b1, b3;	veor	a4, a3;		veor	b4, b3;
+
+#define SBOX0_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	vmvn	a2, a2;		vmvn	b2, b2;		vmov	a4, a1;		vmov	b4, b1;		\
+	vorr	a1, a1, a0;	vorr	b1, b1, b0;	vmvn	a4, a4;		vmvn	b4, b4;		\
+	veor	a1, a1, a2;	veor	b1, b1, b2;	vorr	a2, a2, a4;	vorr	b2, b2, b4;	\
+	veor	a1, a1, a3;	veor	b1, b1, b3;	veor	a0, a0, a4;	veor	b0, b0, b4;	\
+	veor	a2, a2, a0;	veor	b2, b2, b0;	vand	a0, a0, a3;	vand	b0, b0, b3;	\
+	veor	a4, a4, a0;	veor	b4, b4, b0;	vorr	a0, a0, a1;	vorr	b0, b0, b1;	\
+	veor	a0, a0, a2;	veor	b0, b0, b2;	veor	a3, a3, a4;	veor	b3, b3, b4;	\
+	veor	a2, a2, a1;	veor	b2, b2, b1;	veor	a3, a3, a0;	veor	b3, b3, b0;	\
+	veor	a3, a3, a1;	veor	b3, b3, b1;\
+	vand	a2, a2, a3;	vand	b2, b2, b3;\
+	veor	a4, a2;	veor	b4, b2;
+
+#define SBOX1(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	vmvn	a0, a0;		vmvn	b0, b0;		vmvn	a2, a2;		vmvn	b2, b2;		\
+	vmov	a4, a0;		vmov	b4, b0;		vand	a0, a0, a1;	vand	b0, b0, b1;	\
+	veor	a2, a2, a0;	veor	b2, b2, b0;	vorr	a0, a0, a3;	vorr	b0, b0, b3;	\
+	veor	a3, a3, a2;	veor	b3, b3, b2;	veor	a1, a1, a0;	veor	b1, b1, b0;	\
+	veor	a0, a0, a4;	veor	b0, b0, b4;	vorr	a4, a4, a1;	vorr	b4, b4, b1;	\
+	veor	a1, a1, a3;	veor	b1, b1, b3;	vorr	a2, a2, a0;	vorr	b2, b2, b0;	\
+	vand	a2, a2, a4;	vand	b2, b2, b4;	veor	a0, a0, a1;	veor	b0, b0, b1;	\
+	vand	a1, a1, a2;	vand	b1, b1, b2;\
+	veor	a1, a1, a0;	veor	b1, b1, b0;	vand	a0, a0, a2;	vand	b0, b0, b2;	\
+	veor	a0, a4;		veor	b0, b4;
+
+#define SBOX1_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	vmov	a4, a1;		vmov	b4, b1;		veor	a1, a1, a3;	veor	b1, b1, b3;	\
+	vand	a3, a3, a1;	vand	b3, b3, b1;	veor	a4, a4, a2;	veor	b4, b4, b2;	\
+	veor	a3, a3, a0;	veor	b3, b3, b0;	vorr	a0, a0, a1;	vorr	b0, b0, b1;	\
+	veor	a2, a2, a3;	veor	b2, b2, b3;	veor	a0, a0, a4;	veor	b0, b0, b4;	\
+	vorr	a0, a0, a2;	vorr	b0, b0, b2;	veor	a1, a1, a3;	veor	b1, b1, b3;	\
+	veor	a0, a0, a1;	veor	b0, b0, b1;	vorr	a1, a1, a3;	vorr	b1, b1, b3;	\
+	veor	a1, a1, a0;	veor	b1, b1, b0;	vmvn	a4, a4;		vmvn	b4, b4;		\
+	veor	a4, a4, a1;	veor	b4, b4, b1;	vorr	a1, a1, a0;	vorr	b1, b1, b0;	\
+	veor	a1, a1, a0;	veor	b1, b1, b0;\
+	vorr	a1, a1, a4;	vorr	b1, b1, b4;\
+	veor	a3, a1;		veor	b3, b1;
+
+#define SBOX2(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	vmov	a4, a0;		vmov	b4, b0;		vand	a0, a0, a2;	vand	b0, b0, b2;	\
+	veor	a0, a0, a3;	veor	b0, b0, b3;	veor	a2, a2, a1;	veor	b2, b2, b1;	\
+	veor	a2, a2, a0;	veor	b2, b2, b0;	vorr	a3, a3, a4;	vorr	b3, b3, b4;	\
+	veor	a3, a3, a1;	veor	b3, b3, b1;	veor	a4, a4, a2;	veor	b4, b4, b2;	\
+	vmov	a1, a3;		vmov	b1, b3;		vorr	a3, a3, a4;	vorr	b3, b3, b4;	\
+	veor	a3, a3, a0;	veor	b3, b3, b0;	vand	a0, a0, a1;	vand	b0, b0, b1;	\
+	veor	a4, a4, a0;	veor	b4, b4, b0;	veor	a1, a1, a3;	veor	b1, b1, b3;	\
+	veor	a1, a1, a4;	veor	b1, b1, b4;	vmvn	a4, a4;		vmvn	b4, b4;
+
+#define SBOX2_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	veor	a2, a2, a3;	veor	b2, b2, b3;	veor	a3, a3, a0;	veor	b3, b3, b0;	\
+	vmov	a4, a3;		vmov	b4, b3;		vand	a3, a3, a2;	vand	b3, b3, b2;	\
+	veor	a3, a3, a1;	veor	b3, b3, b1;	vorr	a1, a1, a2;	vorr	b1, b1, b2;	\
+	veor	a1, a1, a4;	veor	b1, b1, b4;	vand	a4, a4, a3;	vand	b4, b4, b3;	\
+	veor	a2, a2, a3;	veor	b2, b2, b3;	vand	a4, a4, a0;	vand	b4, b4, b0;	\
+	veor	a4, a4, a2;	veor	b4, b4, b2;	vand	a2, a2, a1;	vand	b2, b2, b1;	\
+	vorr	a2, a2, a0;	vorr	b2, b2, b0;	vmvn	a3, a3;		vmvn	b3, b3;		\
+	veor	a2, a2, a3;	veor	b2, b2, b3;	veor	a0, a0, a3;	veor	b0, b0, b3;	\
+	vand	a0, a0, a1;	vand	b0, b0, b1;	veor	a3, a3, a4;	veor	b3, b3, b4;	\
+	veor	a3, a0;		veor	b3, b0;
+
+#define SBOX3(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	vmov	a4, a0;		vmov	b4, b0;		vorr	a0, a0, a3;	vorr	b0, b0, b3;	\
+	veor	a3, a3, a1;	veor	b3, b3, b1;	vand	a1, a1, a4;	vand	b1, b1, b4;	\
+	veor	a4, a4, a2;	veor	b4, b4, b2;	veor	a2, a2, a3;	veor	b2, b2, b3;	\
+	vand	a3, a3, a0;	vand	b3, b3, b0;	vorr	a4, a4, a1;	vorr	b4, b4, b1;	\
+	veor	a3, a3, a4;	veor	b3, b3, b4;	veor	a0, a0, a1;	veor	b0, b0, b1;	\
+	vand	a4, a4, a0;	vand	b4, b4, b0;	veor	a1, a1, a3;	veor	b1, b1, b3;	\
+	veor	a4, a4, a2;	veor	b4, b4, b2;	vorr	a1, a1, a0;	vorr	b1, b1, b0;	\
+	veor	a1, a1, a2;	veor	b1, b1, b2;	veor	a0, a0, a3;	veor	b0, b0, b3;	\
+	vmov	a2, a1;		vmov	b2, b1;		vorr	a1, a1, a3;	vorr	b1, b1, b3;	\
+	veor	a1, a0;		veor	b1, b0;
+
+#define SBOX3_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	vmov	a4, a2;		vmov	b4, b2;		veor	a2, a2, a1;	veor	b2, b2, b1;	\
+	veor	a0, a0, a2;	veor	b0, b0, b2;	vand	a4, a4, a2;	vand	b4, b4, b2;	\
+	veor	a4, a4, a0;	veor	b4, b4, b0;	vand	a0, a0, a1;	vand	b0, b0, b1;	\
+	veor	a1, a1, a3;	veor	b1, b1, b3;	vorr	a3, a3, a4;	vorr	b3, b3, b4;	\
+	veor	a2, a2, a3;	veor	b2, b2, b3;	veor	a0, a0, a3;	veor	b0, b0, b3;	\
+	veor	a1, a1, a4;	veor	b1, b1, b4;	vand	a3, a3, a2;	vand	b3, b3, b2;	\
+	veor	a3, a3, a1;	veor	b3, b3, b1;	veor	a1, a1, a0;	veor	b1, b1, b0;	\
+	vorr	a1, a1, a2;	vorr	b1, b1, b2;	veor	a0, a0, a3;	veor	b0, b0, b3;	\
+	veor	a1, a1, a4;	veor	b1, b1, b4;\
+	veor	a0, a1;		veor	b0, b1;
+
+#define SBOX4(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	veor	a1, a1, a3;	veor	b1, b1, b3;	vmvn	a3, a3;		vmvn	b3, b3;		\
+	veor	a2, a2, a3;	veor	b2, b2, b3;	veor	a3, a3, a0;	veor	b3, b3, b0;	\
+	vmov	a4, a1;		vmov	b4, b1;		vand	a1, a1, a3;	vand	b1, b1, b3;	\
+	veor	a1, a1, a2;	veor	b1, b1, b2;	veor	a4, a4, a3;	veor	b4, b4, b3;	\
+	veor	a0, a0, a4;	veor	b0, b0, b4;	vand	a2, a2, a4;	vand	b2, b2, b4;	\
+	veor	a2, a2, a0;	veor	b2, b2, b0;	vand	a0, a0, a1;	vand	b0, b0, b1;	\
+	veor	a3, a3, a0;	veor	b3, b3, b0;	vorr	a4, a4, a1;	vorr	b4, b4, b1;	\
+	veor	a4, a4, a0;	veor	b4, b4, b0;	vorr	a0, a0, a3;	vorr	b0, b0, b3;	\
+	veor	a0, a0, a2;	veor	b0, b0, b2;	vand	a2, a2, a3;	vand	b2, b2, b3;	\
+	vmvn	a0, a0;		vmvn	b0, b0;		veor	a4, a2;		veor	b4, b2;
+
+#define SBOX4_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	vmov	a4, a2;		vmov	b4, b2;		vand	a2, a2, a3;	vand	b2, b2, b3;	\
+	veor	a2, a2, a1;	veor	b2, b2, b1;	vorr	a1, a1, a3;	vorr	b1, b1, b3;	\
+	vand	a1, a1, a0;	vand	b1, b1, b0;	veor	a4, a4, a2;	veor	b4, b4, b2;	\
+	veor	a4, a4, a1;	veor	b4, b4, b1;	vand	a1, a1, a2;	vand	b1, b1, b2;	\
+	vmvn	a0, a0;		vmvn	b0, b0;		veor	a3, a3, a4;	veor	b3, b3, b4;	\
+	veor	a1, a1, a3;	veor	b1, b1, b3;	vand	a3, a3, a0;	vand	b3, b3, b0;	\
+	veor	a3, a3, a2;	veor	b3, b3, b2;	veor	a0, a0, a1;	veor	b0, b0, b1;	\
+	vand	a2, a2, a0;	vand	b2, b2, b0;	veor	a3, a3, a0;	veor	b3, b3, b0;	\
+	veor	a2, a2, a4;	veor	b2, b2, b4;\
+	vorr	a2, a2, a3;	vorr	b2, b2, b3;	veor	a3, a3, a0;	veor	b3, b3, b0;	\
+	veor	a2, a1;		veor	b2, b1;
+
+#define SBOX5(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	veor	a0, a0, a1;	veor	b0, b0, b1;	veor	a1, a1, a3;	veor	b1, b1, b3;	\
+	vmvn	a3, a3;		vmvn	b3, b3;		vmov	a4, a1;		vmov	b4, b1;		\
+	vand	a1, a1, a0;	vand	b1, b1, b0;	veor	a2, a2, a3;	veor	b2, b2, b3;	\
+	veor	a1, a1, a2;	veor	b1, b1, b2;	vorr	a2, a2, a4;	vorr	b2, b2, b4;	\
+	veor	a4, a4, a3;	veor	b4, b4, b3;	vand	a3, a3, a1;	vand	b3, b3, b1;	\
+	veor	a3, a3, a0;	veor	b3, b3, b0;	veor	a4, a4, a1;	veor	b4, b4, b1;	\
+	veor	a4, a4, a2;	veor	b4, b4, b2;	veor	a2, a2, a0;	veor	b2, b2, b0;	\
+	vand	a0, a0, a3;	vand	b0, b0, b3;	vmvn	a2, a2;		vmvn	b2, b2;		\
+	veor	a0, a0, a4;	veor	b0, b0, b4;	vorr	a4, a4, a3;	vorr	b4, b4, b3;	\
+	veor	a2, a4;		veor	b2, b4;
+
+#define SBOX5_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	vmvn	a1, a1;		vmvn	b1, b1;		vmov	a4, a3;		vmov	b4, b3;		\
+	veor	a2, a2, a1;	veor	b2, b2, b1;	vorr	a3, a3, a0;	vorr	b3, b3, b0;	\
+	veor	a3, a3, a2;	veor	b3, b3, b2;	vorr	a2, a2, a1;	vorr	b2, b2, b1;	\
+	vand	a2, a2, a0;	vand	b2, b2, b0;	veor	a4, a4, a3;	veor	b4, b4, b3;	\
+	veor	a2, a2, a4;	veor	b2, b2, b4;	vorr	a4, a4, a0;	vorr	b4, b4, b0;	\
+	veor	a4, a4, a1;	veor	b4, b4, b1;	vand	a1, a1, a2;	vand	b1, b1, b2;	\
+	veor	a1, a1, a3;	veor	b1, b1, b3;	veor	a4, a4, a2;	veor	b4, b4, b2;	\
+	vand	a3, a3, a4;	vand	b3, b3, b4;	veor	a4, a4, a1;	veor	b4, b4, b1;	\
+	veor	a3, a3, a4;	veor	b3, b3, b4;	vmvn	a4, a4;		vmvn	b4, b4;		\
+	veor	a3, a0;		veor	b3, b0;
+
+#define SBOX6(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	vmvn	a2, a2;		vmvn	b2, b2;		vmov	a4, a3;		vmov	b4, b3;		\
+	vand	a3, a3, a0;	vand	b3, b3, b0;	veor	a0, a0, a4;	veor	b0, b0, b4;	\
+	veor	a3, a3, a2;	veor	b3, b3, b2;	vorr	a2, a2, a4;	vorr	b2, b2, b4;	\
+	veor	a1, a1, a3;	veor	b1, b1, b3;	veor	a2, a2, a0;	veor	b2, b2, b0;	\
+	vorr	a0, a0, a1;	vorr	b0, b0, b1;	veor	a2, a2, a1;	veor	b2, b2, b1;	\
+	veor	a4, a4, a0;	veor	b4, b4, b0;	vorr	a0, a0, a3;	vorr	b0, b0, b3;	\
+	veor	a0, a0, a2;	veor	b0, b0, b2;	veor	a4, a4, a3;	veor	b4, b4, b3;	\
+	veor	a4, a4, a0;	veor	b4, b4, b0;	vmvn	a3, a3;		vmvn	b3, b3;		\
+	vand	a2, a2, a4;	vand	b2, b2, b4;\
+	veor	a2, a3;		veor	b2, b3;
+
+#define SBOX6_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	veor	a0, a0, a2;	veor	b0, b0, b2;	vmov	a4, a2;		vmov	b4, b2;		\
+	vand	a2, a2, a0;	vand	b2, b2, b0;	veor	a4, a4, a3;	veor	b4, b4, b3;	\
+	vmvn	a2, a2;		vmvn	b2, b2;		veor	a3, a3, a1;	veor	b3, b3, b1;	\
+	veor	a2, a2, a3;	veor	b2, b2, b3;	vorr	a4, a4, a0;	vorr	b4, b4, b0;	\
+	veor	a0, a0, a2;	veor	b0, b0, b2;	veor	a3, a3, a4;	veor	b3, b3, b4;	\
+	veor	a4, a4, a1;	veor	b4, b4, b1;	vand	a1, a1, a3;	vand	b1, b1, b3;	\
+	veor	a1, a1, a0;	veor	b1, b1, b0;	veor	a0, a0, a3;	veor	b0, b0, b3;	\
+	vorr	a0, a0, a2;	vorr	b0, b0, b2;	veor	a3, a3, a1;	veor	b3, b3, b1;	\
+	veor	a4, a0;		veor	b4, b0;
+
+#define SBOX7(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	vmov	a4, a1;		vmov	b4, b1;		vorr	a1, a1, a2;	vorr	b1, b1, b2;	\
+	veor	a1, a1, a3;	veor	b1, b1, b3;	veor	a4, a4, a2;	veor	b4, b4, b2;	\
+	veor	a2, a2, a1;	veor	b2, b2, b1;	vorr	a3, a3, a4;	vorr	b3, b3, b4;	\
+	vand	a3, a3, a0;	vand	b3, b3, b0;	veor	a4, a4, a2;	veor	b4, b4, b2;	\
+	veor	a3, a3, a1;	veor	b3, b3, b1;	vorr	a1, a1, a4;	vorr	b1, b1, b4;	\
+	veor	a1, a1, a0;	veor	b1, b1, b0;	vorr	a0, a0, a4;	vorr	b0, b0, b4;	\
+	veor	a0, a0, a2;	veor	b0, b0, b2;	veor	a1, a1, a4;	veor	b1, b1, b4;	\
+	veor	a2, a2, a1;	veor	b2, b2, b1;	vand	a1, a1, a0;	vand	b1, b1, b0;	\
+	veor	a1, a1, a4;	veor	b1, b1, b4;	vmvn	a2, a2;		vmvn	b2, b2;		\
+	vorr	a2, a2, a0;	vorr	b2, b2, b0;\
+	veor	a4, a2;		veor	b4, b2;
+
+#define SBOX7_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	vmov	a4, a2;		vmov	b4, b2;		veor	a2, a2, a0;	veor	b2, b2, b0;	\
+	vand	a0, a0, a3;	vand	b0, b0, b3;	vorr	a4, a4, a3;	vorr	b4, b4, b3;	\
+	vmvn	a2, a2;		vmvn	b2, b2;		veor	a3, a3, a1;	veor	b3, b3, b1;	\
+	vorr	a1, a1, a0;	vorr	b1, b1, b0;	veor	a0, a0, a2;	veor	b0, b0, b2;	\
+	vand	a2, a2, a4;	vand	b2, b2, b4;	vand	a3, a3, a4;	vand	b3, b3, b4;	\
+	veor	a1, a1, a2;	veor	b1, b1, b2;	veor	a2, a2, a0;	veor	b2, b2, b0;	\
+	vorr	a0, a0, a2;	vorr	b0, b0, b2;	veor	a4, a4, a1;	veor	b4, b4, b1;	\
+	veor	a0, a0, a3;	veor	b0, b0, b3;	veor	a3, a3, a4;	veor	b3, b3, b4;	\
+	vorr	a4, a4, a0;	vorr	b4, b4, b0;	veor	a3, a3, a2;	veor	b3, b3, b2;	\
+	veor	a4, a2;		veor	b4, b2;
+
+/* Apply SBOX number WHICH to to the block.  */
+#define SBOX(which, a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	SBOX##which (a0, a1, a2, a3, a4, b0, b1, b2, b3, b4)
+
+/* Apply inverse SBOX number WHICH to to the block.  */
+#define SBOX_INVERSE(which, a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	SBOX##which##_INVERSE (a0, a1, a2, a3, a4, b0, b1, b2, b3, b4)
+
+/* XOR round key into block state in a0,a1,a2,a3. a4 used as temporary.  */
+#define BLOCK_XOR_KEY(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	vdup.32 RT3, RT0d0[0]; \
+	vdup.32 RT1, RT0d0[1]; \
+	vdup.32 RT2, RT0d1[0]; \
+	vdup.32 RT0, RT0d1[1]; \
+	veor a0, a0, RT3;	veor b0, b0, RT3; \
+	veor a1, a1, RT1;	veor b1, b1, RT1; \
+	veor a2, a2, RT2;	veor b2, b2, RT2; \
+	veor a3, a3, RT0;	veor b3, b3, RT0;
+
+#define BLOCK_LOAD_KEY_ENC() \
+	vld1.8 {RT0d0, RT0d1}, [RROUND]!;
+
+#define BLOCK_LOAD_KEY_DEC() \
+	vld1.8 {RT0d0, RT0d1}, [RROUND]; \
+	sub RROUND, RROUND, #16
+
+/* Apply the linear transformation to BLOCK.  */
+#define LINEAR_TRANSFORMATION(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	vshl.u32	a4, a0, #13;		vshl.u32	b4, b0, #13;		\
+	vshr.u32	a0, a0, #(32-13);	vshr.u32	b0, b0, #(32-13);	\
+	veor		a0, a0, a4;		veor		b0, b0, b4;		\
+	vshl.u32	a4, a2, #3;		vshl.u32	b4, b2, #3;		\
+	vshr.u32	a2, a2, #(32-3);	vshr.u32	b2, b2, #(32-3);	\
+	veor		a2, a2, a4;		veor		b2, b2, b4;		\
+	veor		a1, a0, a1;		veor		b1, b0, b1;		\
+	veor		a1, a2, a1;		veor		b1, b2, b1;		\
+	vshl.u32	a4, a0, #3;		vshl.u32	b4, b0, #3;		\
+	veor		a3, a2, a3;		veor		b3, b2, b3;		\
+	veor		a3, a4, a3;		veor		b3, b4, b3;		\
+	vshl.u32	a4, a1, #1;		vshl.u32	b4, b1, #1;		\
+	vshr.u32	a1, a1, #(32-1);	vshr.u32	b1, b1, #(32-1);	\
+	veor		a1, a1, a4;		veor		b1, b1, b4;		\
+	vshl.u32	a4, a3, #7;		vshl.u32	b4, b3, #7;		\
+	vshr.u32	a3, a3, #(32-7);	vshr.u32	b3, b3, #(32-7);	\
+	veor		a3, a3, a4;		veor		b3, b3, b4;		\
+	veor		a0, a1, a0;		veor		b0, b1, b0;		\
+	veor		a0, a3, a0;		veor		b0, b3, b0;		\
+	vshl.u32	a4, a1, #7;		vshl.u32	b4, b1, #7;		\
+	veor		a2, a3, a2;		veor		b2, b3, b2;		\
+	veor		a2, a4, a2;		veor		b2, b4, b2;		\
+	vshl.u32	a4, a0, #5;		vshl.u32	b4, b0, #5;		\
+	vshr.u32	a0, a0, #(32-5);	vshr.u32	b0, b0, #(32-5);	\
+	veor		a0, a0, a4;		veor		b0, b0, b4;		\
+	vshl.u32	a4, a2, #22;		vshl.u32	b4, b2, #22;		\
+	vshr.u32	a2, a2, #(32-22);	vshr.u32	b2, b2, #(32-22);	\
+	veor		a2, a2, a4;		veor		b2, b2, b4;
+
+/* Apply the inverse linear transformation to BLOCK.  */
+#define LINEAR_TRANSFORMATION_INVERSE(a0, a1, a2, a3, a4, b0, b1, b2, b3, b4) \
+	vshr.u32	a4, a2, #22;		vshr.u32	b4, b2, #22;		\
+	vshl.u32	a2, a2, #(32-22);	vshl.u32	b2, b2, #(32-22);	\
+	veor		a2, a2, a4;		veor		b2, b2, b4;		\
+	vshr.u32	a4, a0, #5;		vshr.u32	b4, b0, #5;		\
+	vshl.u32	a0, a0, #(32-5);	vshl.u32	b0, b0, #(32-5);	\
+	veor		a0, a0, a4;		veor		b0, b0, b4;		\
+	vshl.u32	a4, a1, #7;		vshl.u32	b4, b1, #7;		\
+	veor		a2, a3, a2;		veor		b2, b3, b2;		\
+	veor		a2, a4, a2;		veor		b2, b4, b2;		\
+	veor		a0, a1, a0;		veor		b0, b1, b0;		\
+	veor		a0, a3, a0;		veor		b0, b3, b0;		\
+	vshr.u32	a4, a3, #7;		vshr.u32	b4, b3, #7;		\
+	vshl.u32	a3, a3, #(32-7);	vshl.u32	b3, b3, #(32-7);	\
+	veor		a3, a3, a4;		veor		b3, b3, b4;		\
+	vshr.u32	a4, a1, #1;		vshr.u32	b4, b1, #1;		\
+	vshl.u32	a1, a1, #(32-1);	vshl.u32	b1, b1, #(32-1);	\
+	veor		a1, a1, a4;		veor		b1, b1, b4;		\
+	vshl.u32	a4, a0, #3;		vshl.u32	b4, b0, #3;		\
+	veor		a3, a2, a3;		veor		b3, b2, b3;		\
+	veor		a3, a4, a3;		veor		b3, b4, b3;		\
+	veor		a1, a0, a1;		veor		b1, b0, b1;		\
+	veor		a1, a2, a1;		veor		b1, b2, b1;		\
+	vshr.u32	a4, a2, #3;		vshr.u32	b4, b2, #3;		\
+	vshl.u32	a2, a2, #(32-3);	vshl.u32	b2, b2, #(32-3);	\
+	veor		a2, a2, a4;		veor		b2, b2, b4;		\
+	vshr.u32	a4, a0, #13;		vshr.u32	b4, b0, #13;		\
+	vshl.u32	a0, a0, #(32-13);	vshl.u32	b0, b0, #(32-13);	\
+	veor		a0, a0, a4;		veor		b0, b0, b4;
+
+/* Apply a Serpent round to eight parallel blocks.  This macro increments
+   `round'.  */
+#define ROUND(round, which, a0, a1, a2, a3, a4, na0, na1, na2, na3, na4, \
+			    b0, b1, b2, b3, b4, nb0, nb1, nb2, nb3, nb4) \
+	BLOCK_XOR_KEY (a0, a1, a2, a3, a4, b0, b1, b2, b3, b4);		\
+	BLOCK_LOAD_KEY_ENC ();						\
+	SBOX (which, a0, a1, a2, a3, a4, b0, b1, b2, b3, b4);		\
+	LINEAR_TRANSFORMATION (na0, na1, na2, na3, na4, nb0, nb1, nb2, nb3, nb4);
+
+/* Apply the last Serpent round to eight parallel blocks.  This macro increments
+   `round'.  */
+#define ROUND_LAST(round, which, a0, a1, a2, a3, a4, na0, na1, na2, na3, na4, \
+				 b0, b1, b2, b3, b4, nb0, nb1, nb2, nb3, nb4) \
+	BLOCK_XOR_KEY (a0, a1, a2, a3, a4, b0, b1, b2, b3, b4);		\
+	BLOCK_LOAD_KEY_ENC ();						\
+	SBOX (which, a0, a1, a2, a3, a4, b0, b1, b2, b3, b4);		\
+	BLOCK_XOR_KEY (na0, na1, na2, na3, na4, nb0, nb1, nb2, nb3, nb4);
+
+/* Apply an inverse Serpent round to eight parallel blocks.  This macro
+   increments `round'.  */
+#define ROUND_INVERSE(round, which, a0, a1, a2, a3, a4, \
+				    na0, na1, na2, na3, na4, \
+				    b0, b1, b2, b3, b4, \
+				    nb0, nb1, nb2, nb3, nb4) \
+	LINEAR_TRANSFORMATION_INVERSE (a0, a1, a2, a3, a4, b0, b1, b2, b3, b4);	\
+	SBOX_INVERSE (which, a0, a1, a2, a3, a4, b0, b1, b2, b3, b4);		\
+	BLOCK_XOR_KEY (na0, na1, na2, na3, na4, nb0, nb1, nb2, nb3, nb4);	\
+	BLOCK_LOAD_KEY_DEC ();
+
+/* Apply the first inverse Serpent round to eight parallel blocks.  This macro
+   increments `round'.  */
+#define ROUND_FIRST_INVERSE(round, which, a0, a1, a2, a3, a4, \
+					  na0, na1, na2, na3, na4, \
+					  b0, b1, b2, b3, b4, \
+					  nb0, nb1, nb2, nb3, nb4) \
+	BLOCK_XOR_KEY (a0, a1, a2, a3, a4, b0, b1, b2, b3, b4);			\
+	BLOCK_LOAD_KEY_DEC ();							\
+	SBOX_INVERSE (which, a0, a1, a2, a3, a4, b0, b1, b2, b3, b4); 		\
+	BLOCK_XOR_KEY (na0, na1, na2, na3, na4, nb0, nb1, nb2, nb3, nb4);	\
+	BLOCK_LOAD_KEY_DEC ();
+
+.align 3
+.type __serpent_enc_blk8,%function;
+__serpent_enc_blk8:
+	/* input:
+	 *	r0: round key pointer
+	 *	RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: eight parallel plaintext
+	 *						blocks
+	 * output:
+	 *	RA4, RA1, RA2, RA0, RB4, RB1, RB2, RB0: eight parallel
+	 * 						ciphertext blocks
+	 */
+
+	transpose_4x4(RA0, RA1, RA2, RA3);
+	BLOCK_LOAD_KEY_ENC ();
+	transpose_4x4(RB0, RB1, RB2, RB3);
+
+	ROUND (0, 0, RA0, RA1, RA2, RA3, RA4, RA1, RA4, RA2, RA0, RA3,
+		     RB0, RB1, RB2, RB3, RB4, RB1, RB4, RB2, RB0, RB3);
+	ROUND (1, 1, RA1, RA4, RA2, RA0, RA3, RA2, RA1, RA0, RA4, RA3,
+		     RB1, RB4, RB2, RB0, RB3, RB2, RB1, RB0, RB4, RB3);
+	ROUND (2, 2, RA2, RA1, RA0, RA4, RA3, RA0, RA4, RA1, RA3, RA2,
+		     RB2, RB1, RB0, RB4, RB3, RB0, RB4, RB1, RB3, RB2);
+	ROUND (3, 3, RA0, RA4, RA1, RA3, RA2, RA4, RA1, RA3, RA2, RA0,
+		     RB0, RB4, RB1, RB3, RB2, RB4, RB1, RB3, RB2, RB0);
+	ROUND (4, 4, RA4, RA1, RA3, RA2, RA0, RA1, RA0, RA4, RA2, RA3,
+		     RB4, RB1, RB3, RB2, RB0, RB1, RB0, RB4, RB2, RB3);
+	ROUND (5, 5, RA1, RA0, RA4, RA2, RA3, RA0, RA2, RA1, RA4, RA3,
+		     RB1, RB0, RB4, RB2, RB3, RB0, RB2, RB1, RB4, RB3);
+	ROUND (6, 6, RA0, RA2, RA1, RA4, RA3, RA0, RA2, RA3, RA1, RA4,
+		     RB0, RB2, RB1, RB4, RB3, RB0, RB2, RB3, RB1, RB4);
+	ROUND (7, 7, RA0, RA2, RA3, RA1, RA4, RA4, RA1, RA2, RA0, RA3,
+		     RB0, RB2, RB3, RB1, RB4, RB4, RB1, RB2, RB0, RB3);
+	ROUND (8, 0, RA4, RA1, RA2, RA0, RA3, RA1, RA3, RA2, RA4, RA0,
+		     RB4, RB1, RB2, RB0, RB3, RB1, RB3, RB2, RB4, RB0);
+	ROUND (9, 1, RA1, RA3, RA2, RA4, RA0, RA2, RA1, RA4, RA3, RA0,
+		     RB1, RB3, RB2, RB4, RB0, RB2, RB1, RB4, RB3, RB0);
+	ROUND (10, 2, RA2, RA1, RA4, RA3, RA0, RA4, RA3, RA1, RA0, RA2,
+		      RB2, RB1, RB4, RB3, RB0, RB4, RB3, RB1, RB0, RB2);
+	ROUND (11, 3, RA4, RA3, RA1, RA0, RA2, RA3, RA1, RA0, RA2, RA4,
+		      RB4, RB3, RB1, RB0, RB2, RB3, RB1, RB0, RB2, RB4);
+	ROUND (12, 4, RA3, RA1, RA0, RA2, RA4, RA1, RA4, RA3, RA2, RA0,
+		      RB3, RB1, RB0, RB2, RB4, RB1, RB4, RB3, RB2, RB0);
+	ROUND (13, 5, RA1, RA4, RA3, RA2, RA0, RA4, RA2, RA1, RA3, RA0,
+		      RB1, RB4, RB3, RB2, RB0, RB4, RB2, RB1, RB3, RB0);
+	ROUND (14, 6, RA4, RA2, RA1, RA3, RA0, RA4, RA2, RA0, RA1, RA3,
+		      RB4, RB2, RB1, RB3, RB0, RB4, RB2, RB0, RB1, RB3);
+	ROUND (15, 7, RA4, RA2, RA0, RA1, RA3, RA3, RA1, RA2, RA4, RA0,
+		      RB4, RB2, RB0, RB1, RB3, RB3, RB1, RB2, RB4, RB0);
+	ROUND (16, 0, RA3, RA1, RA2, RA4, RA0, RA1, RA0, RA2, RA3, RA4,
+		      RB3, RB1, RB2, RB4, RB0, RB1, RB0, RB2, RB3, RB4);
+	ROUND (17, 1, RA1, RA0, RA2, RA3, RA4, RA2, RA1, RA3, RA0, RA4,
+		      RB1, RB0, RB2, RB3, RB4, RB2, RB1, RB3, RB0, RB4);
+	ROUND (18, 2, RA2, RA1, RA3, RA0, RA4, RA3, RA0, RA1, RA4, RA2,
+		      RB2, RB1, RB3, RB0, RB4, RB3, RB0, RB1, RB4, RB2);
+	ROUND (19, 3, RA3, RA0, RA1, RA4, RA2, RA0, RA1, RA4, RA2, RA3,
+		      RB3, RB0, RB1, RB4, RB2, RB0, RB1, RB4, RB2, RB3);
+	ROUND (20, 4, RA0, RA1, RA4, RA2, RA3, RA1, RA3, RA0, RA2, RA4,
+		      RB0, RB1, RB4, RB2, RB3, RB1, RB3, RB0, RB2, RB4);
+	ROUND (21, 5, RA1, RA3, RA0, RA2, RA4, RA3, RA2, RA1, RA0, RA4,
+		      RB1, RB3, RB0, RB2, RB4, RB3, RB2, RB1, RB0, RB4);
+	ROUND (22, 6, RA3, RA2, RA1, RA0, RA4, RA3, RA2, RA4, RA1, RA0,
+		      RB3, RB2, RB1, RB0, RB4, RB3, RB2, RB4, RB1, RB0);
+	ROUND (23, 7, RA3, RA2, RA4, RA1, RA0, RA0, RA1, RA2, RA3, RA4,
+		      RB3, RB2, RB4, RB1, RB0, RB0, RB1, RB2, RB3, RB4);
+	ROUND (24, 0, RA0, RA1, RA2, RA3, RA4, RA1, RA4, RA2, RA0, RA3,
+		      RB0, RB1, RB2, RB3, RB4, RB1, RB4, RB2, RB0, RB3);
+	ROUND (25, 1, RA1, RA4, RA2, RA0, RA3, RA2, RA1, RA0, RA4, RA3,
+		      RB1, RB4, RB2, RB0, RB3, RB2, RB1, RB0, RB4, RB3);
+	ROUND (26, 2, RA2, RA1, RA0, RA4, RA3, RA0, RA4, RA1, RA3, RA2,
+		      RB2, RB1, RB0, RB4, RB3, RB0, RB4, RB1, RB3, RB2);
+	ROUND (27, 3, RA0, RA4, RA1, RA3, RA2, RA4, RA1, RA3, RA2, RA0,
+		      RB0, RB4, RB1, RB3, RB2, RB4, RB1, RB3, RB2, RB0);
+	ROUND (28, 4, RA4, RA1, RA3, RA2, RA0, RA1, RA0, RA4, RA2, RA3,
+		      RB4, RB1, RB3, RB2, RB0, RB1, RB0, RB4, RB2, RB3);
+	ROUND (29, 5, RA1, RA0, RA4, RA2, RA3, RA0, RA2, RA1, RA4, RA3,
+		      RB1, RB0, RB4, RB2, RB3, RB0, RB2, RB1, RB4, RB3);
+	ROUND (30, 6, RA0, RA2, RA1, RA4, RA3, RA0, RA2, RA3, RA1, RA4,
+		      RB0, RB2, RB1, RB4, RB3, RB0, RB2, RB3, RB1, RB4);
+	ROUND_LAST (31, 7, RA0, RA2, RA3, RA1, RA4, RA4, RA1, RA2, RA0, RA3,
+		           RB0, RB2, RB3, RB1, RB4, RB4, RB1, RB2, RB0, RB3);
+
+	transpose_4x4(RA4, RA1, RA2, RA0);
+	transpose_4x4(RB4, RB1, RB2, RB0);
+
+	bx lr;
+.size __serpent_enc_blk8,.-__serpent_enc_blk8;
+
+.align 3
+.type   __serpent_dec_blk8,%function;
+__serpent_dec_blk8:
+	/* input:
+	 *	r0: round key pointer
+	 *	RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: eight parallel
+	 * 						ciphertext blocks
+	 * output:
+	 *	RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: eight parallel plaintext
+	 *						blocks
+	 */
+
+	add RROUND, RROUND, #(32*16);
+
+	transpose_4x4(RA0, RA1, RA2, RA3);
+	BLOCK_LOAD_KEY_DEC ();
+	transpose_4x4(RB0, RB1, RB2, RB3);
+
+	ROUND_FIRST_INVERSE (31, 7, RA0, RA1, RA2, RA3, RA4,
+				    RA3, RA0, RA1, RA4, RA2,
+				    RB0, RB1, RB2, RB3, RB4,
+				    RB3, RB0, RB1, RB4, RB2);
+	ROUND_INVERSE (30, 6, RA3, RA0, RA1, RA4, RA2, RA0, RA1, RA2, RA4, RA3,
+		              RB3, RB0, RB1, RB4, RB2, RB0, RB1, RB2, RB4, RB3);
+	ROUND_INVERSE (29, 5, RA0, RA1, RA2, RA4, RA3, RA1, RA3, RA4, RA2, RA0,
+		              RB0, RB1, RB2, RB4, RB3, RB1, RB3, RB4, RB2, RB0);
+	ROUND_INVERSE (28, 4, RA1, RA3, RA4, RA2, RA0, RA1, RA2, RA4, RA0, RA3,
+		              RB1, RB3, RB4, RB2, RB0, RB1, RB2, RB4, RB0, RB3);
+	ROUND_INVERSE (27, 3, RA1, RA2, RA4, RA0, RA3, RA4, RA2, RA0, RA1, RA3,
+		              RB1, RB2, RB4, RB0, RB3, RB4, RB2, RB0, RB1, RB3);
+	ROUND_INVERSE (26, 2, RA4, RA2, RA0, RA1, RA3, RA2, RA3, RA0, RA1, RA4,
+		              RB4, RB2, RB0, RB1, RB3, RB2, RB3, RB0, RB1, RB4);
+	ROUND_INVERSE (25, 1, RA2, RA3, RA0, RA1, RA4, RA4, RA2, RA1, RA0, RA3,
+		              RB2, RB3, RB0, RB1, RB4, RB4, RB2, RB1, RB0, RB3);
+	ROUND_INVERSE (24, 0, RA4, RA2, RA1, RA0, RA3, RA4, RA3, RA2, RA0, RA1,
+		              RB4, RB2, RB1, RB0, RB3, RB4, RB3, RB2, RB0, RB1);
+	ROUND_INVERSE (23, 7, RA4, RA3, RA2, RA0, RA1, RA0, RA4, RA3, RA1, RA2,
+		              RB4, RB3, RB2, RB0, RB1, RB0, RB4, RB3, RB1, RB2);
+	ROUND_INVERSE (22, 6, RA0, RA4, RA3, RA1, RA2, RA4, RA3, RA2, RA1, RA0,
+		              RB0, RB4, RB3, RB1, RB2, RB4, RB3, RB2, RB1, RB0);
+	ROUND_INVERSE (21, 5, RA4, RA3, RA2, RA1, RA0, RA3, RA0, RA1, RA2, RA4,
+		              RB4, RB3, RB2, RB1, RB0, RB3, RB0, RB1, RB2, RB4);
+	ROUND_INVERSE (20, 4, RA3, RA0, RA1, RA2, RA4, RA3, RA2, RA1, RA4, RA0,
+		              RB3, RB0, RB1, RB2, RB4, RB3, RB2, RB1, RB4, RB0);
+	ROUND_INVERSE (19, 3, RA3, RA2, RA1, RA4, RA0, RA1, RA2, RA4, RA3, RA0,
+		              RB3, RB2, RB1, RB4, RB0, RB1, RB2, RB4, RB3, RB0);
+	ROUND_INVERSE (18, 2, RA1, RA2, RA4, RA3, RA0, RA2, RA0, RA4, RA3, RA1,
+		              RB1, RB2, RB4, RB3, RB0, RB2, RB0, RB4, RB3, RB1);
+	ROUND_INVERSE (17, 1, RA2, RA0, RA4, RA3, RA1, RA1, RA2, RA3, RA4, RA0,
+		              RB2, RB0, RB4, RB3, RB1, RB1, RB2, RB3, RB4, RB0);
+	ROUND_INVERSE (16, 0, RA1, RA2, RA3, RA4, RA0, RA1, RA0, RA2, RA4, RA3,
+		              RB1, RB2, RB3, RB4, RB0, RB1, RB0, RB2, RB4, RB3);
+	ROUND_INVERSE (15, 7, RA1, RA0, RA2, RA4, RA3, RA4, RA1, RA0, RA3, RA2,
+		              RB1, RB0, RB2, RB4, RB3, RB4, RB1, RB0, RB3, RB2);
+	ROUND_INVERSE (14, 6, RA4, RA1, RA0, RA3, RA2, RA1, RA0, RA2, RA3, RA4,
+		              RB4, RB1, RB0, RB3, RB2, RB1, RB0, RB2, RB3, RB4);
+	ROUND_INVERSE (13, 5, RA1, RA0, RA2, RA3, RA4, RA0, RA4, RA3, RA2, RA1,
+		              RB1, RB0, RB2, RB3, RB4, RB0, RB4, RB3, RB2, RB1);
+	ROUND_INVERSE (12, 4, RA0, RA4, RA3, RA2, RA1, RA0, RA2, RA3, RA1, RA4,
+		              RB0, RB4, RB3, RB2, RB1, RB0, RB2, RB3, RB1, RB4);
+	ROUND_INVERSE (11, 3, RA0, RA2, RA3, RA1, RA4, RA3, RA2, RA1, RA0, RA4,
+		              RB0, RB2, RB3, RB1, RB4, RB3, RB2, RB1, RB0, RB4);
+	ROUND_INVERSE (10, 2, RA3, RA2, RA1, RA0, RA4, RA2, RA4, RA1, RA0, RA3,
+		              RB3, RB2, RB1, RB0, RB4, RB2, RB4, RB1, RB0, RB3);
+	ROUND_INVERSE (9, 1, RA2, RA4, RA1, RA0, RA3, RA3, RA2, RA0, RA1, RA4,
+		             RB2, RB4, RB1, RB0, RB3, RB3, RB2, RB0, RB1, RB4);
+	ROUND_INVERSE (8, 0, RA3, RA2, RA0, RA1, RA4, RA3, RA4, RA2, RA1, RA0,
+		             RB3, RB2, RB0, RB1, RB4, RB3, RB4, RB2, RB1, RB0);
+	ROUND_INVERSE (7, 7, RA3, RA4, RA2, RA1, RA0, RA1, RA3, RA4, RA0, RA2,
+		             RB3, RB4, RB2, RB1, RB0, RB1, RB3, RB4, RB0, RB2);
+	ROUND_INVERSE (6, 6, RA1, RA3, RA4, RA0, RA2, RA3, RA4, RA2, RA0, RA1,
+		             RB1, RB3, RB4, RB0, RB2, RB3, RB4, RB2, RB0, RB1);
+	ROUND_INVERSE (5, 5, RA3, RA4, RA2, RA0, RA1, RA4, RA1, RA0, RA2, RA3,
+		             RB3, RB4, RB2, RB0, RB1, RB4, RB1, RB0, RB2, RB3);
+	ROUND_INVERSE (4, 4, RA4, RA1, RA0, RA2, RA3, RA4, RA2, RA0, RA3, RA1,
+		             RB4, RB1, RB0, RB2, RB3, RB4, RB2, RB0, RB3, RB1);
+	ROUND_INVERSE (3, 3, RA4, RA2, RA0, RA3, RA1, RA0, RA2, RA3, RA4, RA1,
+		             RB4, RB2, RB0, RB3, RB1, RB0, RB2, RB3, RB4, RB1);
+	ROUND_INVERSE (2, 2, RA0, RA2, RA3, RA4, RA1, RA2, RA1, RA3, RA4, RA0,
+		             RB0, RB2, RB3, RB4, RB1, RB2, RB1, RB3, RB4, RB0);
+	ROUND_INVERSE (1, 1, RA2, RA1, RA3, RA4, RA0, RA0, RA2, RA4, RA3, RA1,
+		             RB2, RB1, RB3, RB4, RB0, RB0, RB2, RB4, RB3, RB1);
+	ROUND_INVERSE (0, 0, RA0, RA2, RA4, RA3, RA1, RA0, RA1, RA2, RA3, RA4,
+		             RB0, RB2, RB4, RB3, RB1, RB0, RB1, RB2, RB3, RB4);
+
+	transpose_4x4(RA0, RA1, RA2, RA3);
+	transpose_4x4(RB0, RB1, RB2, RB3);
+
+	bx lr;
+.size __serpent_dec_blk8,.-__serpent_dec_blk8;
+
+.align 3
+.globl _gcry_serpent_neon_ctr_enc
+.type _gcry_serpent_neon_ctr_enc,%function;
+_gcry_serpent_neon_ctr_enc:
+	/* input:
+	 *	r0: ctx, CTX
+	 *	r1: dst (8 blocks)
+	 *	r2: src (8 blocks)
+	 *	r3: iv
+	 */
+
+	vmov.u8 RT1d0, #0xff; /* u64: -1 */
+	push {r4,lr};
+	vadd.u64 RT2d0, RT1d0, RT1d0; /* u64: -2 */
+	vpush {RA4-RB2};
+
+	/* load IV and byteswap */
+	vld1.8 {RA0}, [r3];
+	vrev64.u8 RT0, RA0; /* be => le */
+	ldr r4, [r3, #8];
+
+	/* construct IVs */
+	vsub.u64 RA2d1, RT0d1, RT2d0; /* +2 */
+	vsub.u64 RA1d1, RT0d1, RT1d0; /* +1 */
+	cmp r4, #-1;
+
+	vsub.u64 RB0d1, RA2d1, RT2d0; /* +4 */
+	vsub.u64 RA3d1, RA2d1, RT1d0; /* +3 */
+	ldr r4, [r3, #12];
+
+	vsub.u64 RB2d1, RB0d1, RT2d0; /* +6 */
+	vsub.u64 RB1d1, RB0d1, RT1d0; /* +5 */
+
+	vsub.u64 RT2d1, RB2d1, RT2d0; /* +8 */
+	vsub.u64 RB3d1, RB2d1, RT1d0; /* +7 */
+
+	vmov RA1d0, RT0d0;
+	vmov RA2d0, RT0d0;
+	vmov RA3d0, RT0d0;
+	vmov RB0d0, RT0d0;
+	rev r4, r4;
+	vmov RB1d0, RT0d0;
+	vmov RB2d0, RT0d0;
+	vmov RB3d0, RT0d0;
+	vmov RT2d0, RT0d0;
+
+	/* check need for handling 64-bit overflow and carry */
+	beq .Ldo_ctr_carry;
+
+.Lctr_carry_done:
+	/* le => be */
+	vrev64.u8 RA1, RA1;
+	vrev64.u8 RA2, RA2;
+	vrev64.u8 RA3, RA3;
+	vrev64.u8 RB0, RB0;
+	vrev64.u8 RT2, RT2;
+	vrev64.u8 RB1, RB1;
+	vrev64.u8 RB2, RB2;
+	vrev64.u8 RB3, RB3;
+	/* store new IV */
+	vst1.8 {RT2}, [r3];
+
+	bl __serpent_enc_blk8;
+
+	vld1.8 {RT0, RT1}, [r2]!;
+	vld1.8 {RT2, RT3}, [r2]!;
+	veor RA4, RA4, RT0;
+	veor RA1, RA1, RT1;
+	vld1.8 {RT0, RT1}, [r2]!;
+	veor RA2, RA2, RT2;
+	veor RA0, RA0, RT3;
+	vld1.8 {RT2, RT3}, [r2]!;
+	veor RB4, RB4, RT0;
+	veor RT0, RT0;
+	veor RB1, RB1, RT1;
+	veor RT1, RT1;
+	veor RB2, RB2, RT2;
+	veor RT2, RT2;
+	veor RB0, RB0, RT3;
+	veor RT3, RT3;
+
+	vst1.8 {RA4}, [r1]!;
+	vst1.8 {RA1}, [r1]!;
+	veor RA1, RA1;
+	vst1.8 {RA2}, [r1]!;
+	veor RA2, RA2;
+	vst1.8 {RA0}, [r1]!;
+	veor RA0, RA0;
+	vst1.8 {RB4}, [r1]!;
+	veor RB4, RB4;
+	vst1.8 {RB1}, [r1]!;
+	vst1.8 {RB2}, [r1]!;
+	vst1.8 {RB0}, [r1]!;
+
+	vpop {RA4-RB2};
+
+	/* clear the used registers */
+	veor RA3, RA3;
+	veor RB3, RB3;
+
+	pop {r4,pc};
+
+.Ldo_ctr_carry:
+	cmp r4, #-8;
+	blo .Lctr_carry_done;
+	beq .Lcarry_RT2;
+
+	cmp r4, #-6;
+	blo .Lcarry_RB3;
+	beq .Lcarry_RB2;
+
+	cmp r4, #-4;
+	blo .Lcarry_RB1;
+	beq .Lcarry_RB0;
+
+	cmp r4, #-2;
+	blo .Lcarry_RA3;
+	beq .Lcarry_RA2;
+
+	vsub.u64 RA1d0, RT1d0;
+.Lcarry_RA2:
+	vsub.u64 RA2d0, RT1d0;
+.Lcarry_RA3:
+	vsub.u64 RA3d0, RT1d0;
+.Lcarry_RB0:
+	vsub.u64 RB0d0, RT1d0;
+.Lcarry_RB1:
+	vsub.u64 RB1d0, RT1d0;
+.Lcarry_RB2:
+	vsub.u64 RB2d0, RT1d0;
+.Lcarry_RB3:
+	vsub.u64 RB3d0, RT1d0;
+.Lcarry_RT2:
+	vsub.u64 RT2d0, RT1d0;
+
+	b .Lctr_carry_done;
+.size _gcry_serpent_neon_ctr_enc,.-_gcry_serpent_neon_ctr_enc;
+
+.align 3
+.globl _gcry_serpent_neon_cfb_dec
+.type _gcry_serpent_neon_cfb_dec,%function;
+_gcry_serpent_neon_cfb_dec:
+	/* input:
+	 *	r0: ctx, CTX
+	 *	r1: dst (8 blocks)
+	 *	r2: src (8 blocks)
+	 *	r3: iv
+	 */
+
+	push {lr};
+	vpush {RA4-RB2};
+
+	/* Load input */
+	vld1.8 {RA0}, [r3];
+	vld1.8 {RA1, RA2}, [r2]!;
+	vld1.8 {RA3}, [r2]!;
+	vld1.8 {RB0}, [r2]!;
+	vld1.8 {RB1, RB2}, [r2]!;
+	vld1.8 {RB3}, [r2]!;
+
+	/* Update IV */
+	vld1.8 {RT0}, [r2]!;
+	vst1.8 {RT0}, [r3];
+	mov r3, lr;
+	sub r2, r2, #(8*16);
+
+	bl __serpent_enc_blk8;
+
+	vld1.8 {RT0, RT1}, [r2]!;
+	vld1.8 {RT2, RT3}, [r2]!;
+	veor RA4, RA4, RT0;
+	veor RA1, RA1, RT1;
+	vld1.8 {RT0, RT1}, [r2]!;
+	veor RA2, RA2, RT2;
+	veor RA0, RA0, RT3;
+	vld1.8 {RT2, RT3}, [r2]!;
+	veor RB4, RB4, RT0;
+	veor RT0, RT0;
+	veor RB1, RB1, RT1;
+	veor RT1, RT1;
+	veor RB2, RB2, RT2;
+	veor RT2, RT2;
+	veor RB0, RB0, RT3;
+	veor RT3, RT3;
+
+	vst1.8 {RA4}, [r1]!;
+	vst1.8 {RA1}, [r1]!;
+	veor RA1, RA1;
+	vst1.8 {RA2}, [r1]!;
+	veor RA2, RA2;
+	vst1.8 {RA0}, [r1]!;
+	veor RA0, RA0;
+	vst1.8 {RB4}, [r1]!;
+	veor RB4, RB4;
+	vst1.8 {RB1}, [r1]!;
+	vst1.8 {RB2}, [r1]!;
+	vst1.8 {RB0}, [r1]!;
+
+	vpop {RA4-RB2};
+
+	/* clear the used registers */
+	veor RA3, RA3;
+	veor RB3, RB3;
+
+	pop {pc};
+.size _gcry_serpent_neon_cfb_dec,.-_gcry_serpent_neon_cfb_dec;
+
+.align 3
+.globl _gcry_serpent_neon_cbc_dec
+.type _gcry_serpent_neon_cbc_dec,%function;
+_gcry_serpent_neon_cbc_dec:
+	/* input:
+	 *	r0: ctx, CTX
+	 *	r1: dst (8 blocks)
+	 *	r2: src (8 blocks)
+	 *	r3: iv
+	 */
+
+	push {lr};
+	vpush {RA4-RB2};
+
+	vld1.8 {RA0, RA1}, [r2]!;
+	vld1.8 {RA2, RA3}, [r2]!;
+	vld1.8 {RB0, RB1}, [r2]!;
+	vld1.8 {RB2, RB3}, [r2]!;
+	sub r2, r2, #(8*16);
+
+	bl __serpent_dec_blk8;
+
+	vld1.8 {RB4}, [r3];
+	vld1.8 {RT0, RT1}, [r2]!;
+	vld1.8 {RT2, RT3}, [r2]!;
+	veor RA0, RA0, RB4;
+	veor RA1, RA1, RT0;
+	veor RA2, RA2, RT1;
+	vld1.8 {RT0, RT1}, [r2]!;
+	veor RA3, RA3, RT2;
+	veor RB0, RB0, RT3;
+	vld1.8 {RT2, RT3}, [r2]!;
+	veor RB1, RB1, RT0;
+	veor RT0, RT0;
+	veor RB2, RB2, RT1;
+	veor RT1, RT1;
+	veor RB3, RB3, RT2;
+	veor RT2, RT2;
+	vst1.8 {RT3}, [r3]; /* store new IV */
+	veor RT3, RT3;
+
+	vst1.8 {RA0, RA1}, [r1]!;
+	veor RA0, RA0;
+	veor RA1, RA1;
+	vst1.8 {RA2, RA3}, [r1]!;
+	veor RA2, RA2;
+	vst1.8 {RB0, RB1}, [r1]!;
+	veor RA3, RA3;
+	vst1.8 {RB2, RB3}, [r1]!;
+	veor RB3, RB3;
+
+	vpop {RA4-RB2};
+
+	/* clear the used registers */
+	veor RB4, RB4;
+
+	pop {pc};
+.size _gcry_serpent_neon_cbc_dec,.-_gcry_serpent_neon_cbc_dec;
+
+#endif
diff --git a/cipher/serpent.c b/cipher/serpent.c
index a8ee15f..cfda742 100644
--- a/cipher/serpent.c
+++ b/cipher/serpent.c
@@ -46,6 +46,15 @@
 # endif
 #endif
 
+/* USE_NEON indicates whether to enable ARM NEON assembly code. */
+#undef USE_NEON
+#if defined(HAVE_ARM_ARCH_V6) && defined(__ARMEL__)
+# if defined(HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS) && \
+     defined(HAVE_GCC_INLINE_ASM_NEON)
+#  define USE_NEON 1
+# endif
+#endif
+
 
 /* Number of rounds per Serpent encrypt/decrypt operation.  */
 #define ROUNDS 32
@@ -71,6 +80,9 @@ typedef struct serpent_context
 #ifdef USE_AVX2
   int use_avx2;
 #endif
+#ifdef USE_NEON
+  int use_neon;
+#endif
 } serpent_context_t;
 
 
@@ -114,6 +126,26 @@ extern void _gcry_serpent_avx2_cfb_dec(serpent_context_t *ctx,
 				       unsigned char *iv);
 #endif
 
+#ifdef USE_NEON
+/* Assembler implementations of Serpent using ARM NEON.  Process 8 block in
+   parallel.
+ */
+extern void _gcry_serpent_neon_ctr_enc(serpent_context_t *ctx,
+				       unsigned char *out,
+				       const unsigned char *in,
+				       unsigned char *ctr);
+
+extern void _gcry_serpent_neon_cbc_dec(serpent_context_t *ctx,
+				       unsigned char *out,
+				       const unsigned char *in,
+				       unsigned char *iv);
+
+extern void _gcry_serpent_neon_cfb_dec(serpent_context_t *ctx,
+				       unsigned char *out,
+				       const unsigned char *in,
+				       unsigned char *iv);
+#endif
+
 
 /* A prototype.  */
 static const char *serpent_test (void);
@@ -634,6 +666,14 @@ serpent_setkey_internal (serpent_context_t *context,
     }
 #endif
 
+#ifdef USE_NEON
+  context->use_neon = 0;
+  if ((_gcry_get_hw_features () & HWF_ARM_NEON))
+    {
+      context->use_neon = 1;
+    }
+#endif
+
   _gcry_burn_stack (272 * sizeof (u32));
 }
 
@@ -861,6 +901,34 @@ _gcry_serpent_ctr_enc(void *context, unsigned char *ctr,
   }
 #endif
 
+#ifdef USE_NEON
+  if (ctx->use_neon)
+    {
+      int did_use_neon = 0;
+
+      /* Process data in 8 block chunks. */
+      while (nblocks >= 8)
+        {
+          _gcry_serpent_neon_ctr_enc(ctx, outbuf, inbuf, ctr);
+
+          nblocks -= 8;
+          outbuf += 8 * sizeof(serpent_block_t);
+          inbuf  += 8 * sizeof(serpent_block_t);
+          did_use_neon = 1;
+        }
+
+      if (did_use_neon)
+        {
+          /* serpent-neon assembly code does not use stack */
+          if (nblocks == 0)
+            burn_stack_depth = 0;
+        }
+
+      /* Use generic code to handle smaller chunks... */
+      /* TODO: use caching instead? */
+    }
+#endif
+
   for ( ;nblocks; nblocks-- )
     {
       /* Encrypt the counter. */
@@ -948,6 +1016,33 @@ _gcry_serpent_cbc_dec(void *context, unsigned char *iv,
   }
 #endif
 
+#ifdef USE_NEON
+  if (ctx->use_neon)
+    {
+      int did_use_neon = 0;
+
+      /* Process data in 8 block chunks. */
+      while (nblocks >= 8)
+        {
+          _gcry_serpent_neon_cbc_dec(ctx, outbuf, inbuf, iv);
+
+          nblocks -= 8;
+          outbuf += 8 * sizeof(serpent_block_t);
+          inbuf  += 8 * sizeof(serpent_block_t);
+          did_use_neon = 1;
+        }
+
+      if (did_use_neon)
+        {
+          /* serpent-neon assembly code does not use stack */
+          if (nblocks == 0)
+            burn_stack_depth = 0;
+        }
+
+      /* Use generic code to handle smaller chunks... */
+    }
+#endif
+
   for ( ;nblocks; nblocks-- )
     {
       /* INBUF is needed later and it may be identical to OUTBUF, so store
@@ -1028,6 +1123,33 @@ _gcry_serpent_cfb_dec(void *context, unsigned char *iv,
   }
 #endif
 
+#ifdef USE_NEON
+  if (ctx->use_neon)
+    {
+      int did_use_neon = 0;
+
+      /* Process data in 8 block chunks. */
+      while (nblocks >= 8)
+        {
+          _gcry_serpent_neon_cfb_dec(ctx, outbuf, inbuf, iv);
+
+          nblocks -= 8;
+          outbuf += 8 * sizeof(serpent_block_t);
+          inbuf  += 8 * sizeof(serpent_block_t);
+          did_use_neon = 1;
+        }
+
+      if (did_use_neon)
+        {
+          /* serpent-neon assembly code does not use stack */
+          if (nblocks == 0)
+            burn_stack_depth = 0;
+        }
+
+      /* Use generic code to handle smaller chunks... */
+    }
+#endif
+
   for ( ;nblocks; nblocks-- )
     {
       serpent_encrypt_internal(ctx, iv, iv);
diff --git a/configure.ac b/configure.ac
index 19c97bd..e3471d0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1502,6 +1502,11 @@ if test "$found" = "1" ; then
       # Build with the AVX2 implementation
       GCRYPT_CIPHERS="$GCRYPT_CIPHERS serpent-avx2-amd64.lo"
    fi
+
+   if test x"$neonsupport" = xyes ; then
+      # Build with the NEON implementation
+      GCRYPT_CIPHERS="$GCRYPT_CIPHERS serpent-armv7-neon.lo"
+   fi
 fi
 
 LIST_MEMBER(rfc2268, $enabled_ciphers)

-----------------------------------------------------------------------

Summary of changes:
 cipher/salsa20-armv7-neon.S |    2 +-
 cipher/serpent-armv7-neon.S |  869 +++++++++++++++++++++++++++++++++++++++++++
 cipher/serpent.c            |  122 ++++++
 configure.ac                |    5 +
 doc/gcrypt.texi             |   86 ++---
 5 files changed, 1040 insertions(+), 44 deletions(-)
 create mode 100644 cipher/serpent-armv7-neon.S


hooks/post-receive
-- 
The GNU crypto library
http://git.gnupg.org


_______________________________________________
Gnupg-commits mailing list
Gnupg-commits at gnupg.org
http://lists.gnupg.org/mailman/listinfo/gnupg-commits


More information about the Gcrypt-devel mailing list