[PATCH] keccak: rewrite for improved performance

Sat Oct 24 16:12:00 CEST 2015

* cipher/Makefile.am: Add 'keccak_permute_32.h' and
'keccak_permute_64.h'.
* cipher/hash-common.h [USE_SHA3] (MD_BLOCK_MAX_BLOCKSIZE): Remove.
* cipher/keccak.c (USE_64BIT, USE_32BIT, USE_64BIT_BMI2)
(USE_64BIT_SHLD, USE_32BIT_BMI2, NEED_COMMON64, NEED_COMMON32BI)
(keccak_ops_t): New.
(KECCAK_STATE): Add 'state64' and 'state32bi' members.
(KECCAK_CONTEXT): Remove 'bctx'; add 'blocksize', 'count' and 'ops'.
(rol64, keccak_f1600_state_permute): Remove.
[NEED_COMMON64] (round_consts_64bit, keccak_extract_inplace64): New.
[NEED_COMMON32BI] (round_consts_32bit, keccak_extract_inplace32bi)
(keccak_absorb_lane32bi): New.
[USE_64BIT] (ANDN64, ROL64, keccak_f1600_state_permute64)
(keccak_absorb_lanes64, keccak_generic64_ops): New.
[USE_64BIT_SHLD] (ANDN64, ROL64, keccak_f1600_state_permute64_shld)
(keccak_absorb_lanes64_shld, keccak_shld_64_ops): New.
[USE_64BIT_BMI2] (ANDN64, ROL64, keccak_f1600_state_permute64_bmi2)
(keccak_absorb_lanes64_bmi2, keccak_bmi2_64_ops): New.
[USE_32BIT] (ANDN64, ROL64, keccak_f1600_state_permute32bi)
(keccak_absorb_lanes32bi, keccak_generic32bi_ops): New.
[USE_32BIT_BMI2] (ANDN64, ROL64, keccak_f1600_state_permute32bi_bmi2)
(pext, pdep, keccak_absorb_lane32bi_bmi2, keccak_absorb_lanes32bi_bmi2)
(keccak_extract_inplace32bi_bmi2, keccak_bmi2_32bi_ops): New.
(keccak_write): New.
(keccak_init): Adjust to KECCAK_CONTEXT changes; add implementation
selection based on HWF features.
(keccak_final): Adjust to KECCAK_CONTEXT changes; use selected 'ops'
for state manipulation.
(keccak_read): Adjust to KECCAK_CONTEXT changes.
(_gcry_digest_spec_sha3_224, _gcry_digest_spec_sha3_256)
(_gcry_digest_spec_sha3_348, _gcry_digest_spec_sha3_512): Use
'keccak_write' instead of '_gcry_md_block_write'.
* cipher/keccak_permute_32.h: New.
* cipher/keccak_permute_64.h: New.
--

Patch adds new generic 64-bit and 32-bit implementations and
optimized implementations for SHA3:
 - Generic 64-bit implementation based on 'simple' implementation
   from SUPERCOP package.
 - Generic 32-bit bit-inteleaved implementataion based on
   'simple32bi' implementation from SUPERCOP package.
 - Intel BMI2 optimized variants of 64-bit and 32-bit BI
   implementations.
 - Intel SHLD optimized variant of 64-bit implementation.

Patch also makes proper use of sponge construction to avoid
use of addition input buffer.

Below are bench-slope benchmarks for new 64-bit implementations
made on Intel Core i5-4570 (no turbo, 3.2 Ghz, gcc-4.9.2).

Before (amd64):

 SHA3-224       |      3.92 ns/B     243.2 MiB/s     12.55 c/B
 SHA3-256       |      4.15 ns/B     230.0 MiB/s     13.27 c/B
 SHA3-384       |      5.40 ns/B     176.6 MiB/s     17.29 c/B
 SHA3-512       |      7.77 ns/B     122.7 MiB/s     24.87 c/B

After (generic 64-bit, amd64), 1.10x faster):

 SHA3-224       |      3.57 ns/B     267.4 MiB/s     11.42 c/B
 SHA3-256       |      3.77 ns/B     252.8 MiB/s     12.07 c/B
 SHA3-384       |      4.91 ns/B     194.1 MiB/s     15.72 c/B
 SHA3-512       |      7.06 ns/B     135.0 MiB/s     22.61 c/B

After (Intel SHLD 64-bit, amd64, 1.13x faster):

 SHA3-224       |      3.48 ns/B     273.7 MiB/s     11.15 c/B
 SHA3-256       |      3.68 ns/B     258.9 MiB/s     11.79 c/B
 SHA3-384       |      4.80 ns/B     198.7 MiB/s     15.36 c/B
 SHA3-512       |      6.89 ns/B     138.4 MiB/s     22.05 c/B

After (Intel BMI2 64-bit, amd64, 1.45x faster):

 SHA3-224       |      2.71 ns/B     352.1 MiB/s      8.67 c/B
 SHA3-256       |      2.86 ns/B     333.2 MiB/s      9.16 c/B
 SHA3-384       |      3.72 ns/B     256.2 MiB/s     11.91 c/B
 SHA3-512       |      5.34 ns/B     178.5 MiB/s     17.10 c/B

Benchmarks of new 32-bit implementations on Intel Core i5-4570
(no turbo, 3.2 Ghz, gcc-4.9.2):

Before (win32):

 SHA3-224       |     12.05 ns/B     79.16 MiB/s     38.56 c/B
 SHA3-256       |     12.75 ns/B     74.78 MiB/s     40.82 c/B
 SHA3-384       |     16.63 ns/B     57.36 MiB/s     53.22 c/B
 SHA3-512       |     23.97 ns/B     39.79 MiB/s     76.72 c/B

After (generic 32-bit BI, win32, 1.23x to 1.29x faster):

 SHA3-224       |      9.76 ns/B     97.69 MiB/s     31.25 c/B
 SHA3-256       |     10.27 ns/B     92.82 MiB/s     32.89 c/B
 SHA3-384       |     13.22 ns/B     72.16 MiB/s     42.31 c/B
 SHA3-512       |     18.65 ns/B     51.13 MiB/s     59.70 c/B

After (Intel BMI2 32-bit BI, win32, 1.66x to 1.70x faster):

 SHA3-224       |      7.26 ns/B     131.4 MiB/s     23.23 c/B
 SHA3-256       |      7.65 ns/B     124.7 MiB/s     24.47 c/B
 SHA3-384       |      9.87 ns/B     96.67 MiB/s     31.58 c/B
 SHA3-512       |     14.05 ns/B     67.85 MiB/s     44.99 c/B

Benchmarks of new 32-bit implementation on ARM Cortex-A8
(1008 Mhz, gcc-4.9.1):

Before:

 SHA3-224       |     148.6 ns/B      6.42 MiB/s     149.8 c/B
 SHA3-256       |     157.2 ns/B      6.07 MiB/s     158.4 c/B
 SHA3-384       |     205.3 ns/B      4.65 MiB/s     206.9 c/B
 SHA3-512       |     296.3 ns/B      3.22 MiB/s     298.6 c/B

After (1.56x faster):

 SHA3-224       |     96.12 ns/B      9.92 MiB/s     96.89 c/B
 SHA3-256       |     101.5 ns/B      9.40 MiB/s     102.3 c/B
 SHA3-384       |     131.4 ns/B      7.26 MiB/s     132.5 c/B
 SHA3-512       |     188.2 ns/B      5.07 MiB/s     189.7 c/B

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/Makefile.am         |    2 
 cipher/hash-common.h       |   12 -
 cipher/keccak.c            |  807 +++++++++++++++++++++++++++++++-------------
 cipher/keccak_permute_32.h |  535 +++++++++++++++++++++++++++++
 cipher/keccak_permute_64.h |  290 ++++++++++++++++
 5 files changed, 1403 insertions(+), 243 deletions(-)
 create mode 100644 cipher/keccak_permute_32.h
 create mode 100644 cipher/keccak_permute_64.h

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index b08c9a9..be03d06 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -90,7 +90,7 @@ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \
 sha256.c sha256-ssse3-amd64.S sha256-avx-amd64.S sha256-avx2-bmi2-amd64.S \
 sha512.c sha512-ssse3-amd64.S sha512-avx-amd64.S sha512-avx2-bmi2-amd64.S \
   sha512-armv7-neon.S \
-keccak.c \
+keccak.c keccak_permute_32.h keccak_permute_64.h \
 stribog.c \
 tiger.c \
 whirlpool.c whirlpool-sse2-amd64.S \
diff --git a/cipher/hash-common.h b/cipher/hash-common.h
index e1ae5a2..27d670d 100644
--- a/cipher/hash-common.h
+++ b/cipher/hash-common.h
@@ -33,15 +33,9 @@ typedef unsigned int (*_gcry_md_block_write_t) (void *c,
 						const unsigned char *blks,
 						size_t nblks);
 
-#if defined(HAVE_U64_TYPEDEF) && (defined(USE_SHA512) || defined(USE_SHA3) || \
-				  defined(USE_WHIRLPOOL))
-/* SHA-512, SHA-3 and Whirlpool needs u64. SHA-512 and SHA3 need larger
- * buffer. */
-# ifdef USE_SHA3
-#  define MD_BLOCK_MAX_BLOCKSIZE (1152 / 8)
-# else
-#  define MD_BLOCK_MAX_BLOCKSIZE 128
-# endif
+#if defined(HAVE_U64_TYPEDEF) && (defined(USE_SHA512) || defined(USE_WHIRLPOOL))
+/* SHA-512 and Whirlpool needs u64. SHA-512 needs larger buffer. */
+# define MD_BLOCK_MAX_BLOCKSIZE 128
 # define MD_NBLOCKS_TYPE u64
 #else
 # define MD_BLOCK_MAX_BLOCKSIZE 64
diff --git a/cipher/keccak.c b/cipher/keccak.c
index 4a9c1f2..efcd813 100644
--- a/cipher/keccak.c
+++ b/cipher/keccak.c
@@ -27,11 +27,45 @@
 #include "hash-common.h"
 
 
-/* The code is based on public-domain/CC0 "Keccak-readable-and-compact.c"
- * implementation by the Keccak, Keyak and Ketje Teams, namely, Guido Bertoni,
- * Joan Daemen, Michaël Peeters, Gilles Van Assche and Ronny Van Keer. From:
- *   https://github.com/gvanas/KeccakCodePackage
- */
+
+/* USE_64BIT indicates whether to use 64-bit generic implementation.
+ * USE_32BIT indicates whether to use 32-bit generic implementation. */
+#undef USE_64BIT
+#if defined(__x86_64__) || SIZEOF_UNSIGNED_LONG == 8
+# define USE_64BIT 1
+#else
+# define USE_32BIT 1
+#endif
+
+
+/* USE_64BIT_BMI2 indicates whether to compile with 64-bit Intel BMI2 code. */
+#undef USE_64BIT_BMI2
+#if defined(USE_64BIT) && defined(HAVE_GCC_INLINE_ASM_BMI2)
+# define USE_64BIT_BMI2 1
+#endif
+
+
+/* USE_64BIT_SHLD indicates whether to compile with 64-bit Intel SHLD code. */
+#undef USE_64BIT_SHLD
+#if defined(USE_64BIT) && defined (__GNUC__) && defined(__x86_64__)
+# define USE_64BIT_SHLD 1
+#endif
+
+
+/* USE_32BIT_BMI2 indicates whether to compile with 32-bit Intel BMI2 code. */
+#undef USE_32BIT_BMI2
+#if defined(USE_32BIT) && defined(HAVE_GCC_INLINE_ASM_BMI2)
+# define USE_32BIT_BMI2 1
+#endif
+
+
+#ifdef USE_64BIT
+# define NEED_COMMON64 1
+#endif
+
+#ifdef USE_32BIT
+# define NEED_COMMON32BI 1
+#endif
 
 
 #define SHA3_DELIMITED_SUFFIX 0x06
@@ -40,220 +74,527 @@
 
 typedef struct
 {
-  u64 state[5][5];
+  union {
+#ifdef NEED_COMMON64
+    u64 state64[25];
+#endif
+#ifdef NEED_COMMON32BI
+    u32 state32bi[50];
+#endif
+  } u;
 } KECCAK_STATE;
 
 
 typedef struct
 {
-  gcry_md_block_ctx_t bctx;
+  unsigned int (*permute)(KECCAK_STATE *hd);
+  unsigned int (*absorb)(KECCAK_STATE *hd, int pos, const byte *lanes,
+			 unsigned int nlanes, int blocklanes);
+  unsigned int (*extract_inplace) (KECCAK_STATE *hd, unsigned int outlen);
+} keccak_ops_t;
+
+
+typedef struct KECCAK_CONTEXT_S
+{
   KECCAK_STATE state;
   unsigned int outlen;
+  unsigned int blocksize;
+  unsigned int count;
+  const keccak_ops_t *ops;
 } KECCAK_CONTEXT;
 
 
-static inline u64
-rol64 (u64 x, unsigned int n)
+
+#ifdef NEED_COMMON64
+
+static const u64 round_consts_64bit[24] =
 {
-  return ((x << n) | (x >> (64 - n)));
-}
+  U64_C(0x0000000000000001), U64_C(0x0000000000008082),
+  U64_C(0x800000000000808A), U64_C(0x8000000080008000),
+  U64_C(0x000000000000808B), U64_C(0x0000000080000001),
+  U64_C(0x8000000080008081), U64_C(0x8000000000008009),
+  U64_C(0x000000000000008A), U64_C(0x0000000000000088),
+  U64_C(0x0000000080008009), U64_C(0x000000008000000A),
+  U64_C(0x000000008000808B), U64_C(0x800000000000008B),
+  U64_C(0x8000000000008089), U64_C(0x8000000000008003),
+  U64_C(0x8000000000008002), U64_C(0x8000000000000080),
+  U64_C(0x000000000000800A), U64_C(0x800000008000000A),
+  U64_C(0x8000000080008081), U64_C(0x8000000000008080),
+  U64_C(0x0000000080000001), U64_C(0x8000000080008008)
+};
 
-/* Function that computes the Keccak-f[1600] permutation on the given state. */
-static unsigned int keccak_f1600_state_permute(KECCAK_STATE *hd)
+static unsigned int
+keccak_extract_inplace64(KECCAK_STATE *hd, unsigned int outlen)
 {
-  static const u64 round_consts[24] =
-  {
-    U64_C(0x0000000000000001), U64_C(0x0000000000008082),
-    U64_C(0x800000000000808A), U64_C(0x8000000080008000),
-    U64_C(0x000000000000808B), U64_C(0x0000000080000001),
-    U64_C(0x8000000080008081), U64_C(0x8000000000008009),
-    U64_C(0x000000000000008A), U64_C(0x0000000000000088),
-    U64_C(0x0000000080008009), U64_C(0x000000008000000A),
-    U64_C(0x000000008000808B), U64_C(0x800000000000008B),
-    U64_C(0x8000000000008089), U64_C(0x8000000000008003),
-    U64_C(0x8000000000008002), U64_C(0x8000000000000080),
-    U64_C(0x000000000000800A), U64_C(0x800000008000000A),
-    U64_C(0x8000000080008081), U64_C(0x8000000000008080),
-    U64_C(0x0000000080000001), U64_C(0x8000000080008008)
-  };
-  unsigned int round;
+  unsigned int i;
 
-  for (round = 0; round < 24; round++)
+  for (i = 0; i < outlen / 8 + !!(outlen % 8); i++)
     {
-      {
-	/* θ step (see [Keccak Reference, Section 2.3.2]) === */
-	u64 C[5], D[5];
-
-	/* Compute the parity of the columns */
-	C[0] = hd->state[0][0] ^ hd->state[1][0] ^ hd->state[2][0]
-	      ^ hd->state[3][0] ^ hd->state[4][0];
-	C[1] = hd->state[0][1] ^ hd->state[1][1] ^ hd->state[2][1]
-	      ^ hd->state[3][1] ^ hd->state[4][1];
-	C[2] = hd->state[0][2] ^ hd->state[1][2] ^ hd->state[2][2]
-	      ^ hd->state[3][2] ^ hd->state[4][2];
-	C[3] = hd->state[0][3] ^ hd->state[1][3] ^ hd->state[2][3]
-	      ^ hd->state[3][3] ^ hd->state[4][3];
-	C[4] = hd->state[0][4] ^ hd->state[1][4] ^ hd->state[2][4]
-	      ^ hd->state[3][4] ^ hd->state[4][4];
-
-	/* Compute the θ effect for a given column */
-	D[0] = C[4] ^ rol64(C[1], 1);
-	D[1] = C[0] ^ rol64(C[2], 1);
-	D[2] = C[1] ^ rol64(C[3], 1);
-	D[3] = C[2] ^ rol64(C[4], 1);
-	D[4] = C[3] ^ rol64(C[0], 1);
-
-	/* Add the θ effect to the whole column */
-	hd->state[0][0] ^= D[0];
-	hd->state[1][0] ^= D[0];
-	hd->state[2][0] ^= D[0];
-	hd->state[3][0] ^= D[0];
-	hd->state[4][0] ^= D[0];
-
-	/* Add the θ effect to the whole column */
-	hd->state[0][1] ^= D[1];
-	hd->state[1][1] ^= D[1];
-	hd->state[2][1] ^= D[1];
-	hd->state[3][1] ^= D[1];
-	hd->state[4][1] ^= D[1];
-
-	/* Add the θ effect to the whole column */
-	hd->state[0][2] ^= D[2];
-	hd->state[1][2] ^= D[2];
-	hd->state[2][2] ^= D[2];
-	hd->state[3][2] ^= D[2];
-	hd->state[4][2] ^= D[2];
-
-	/* Add the θ effect to the whole column */
-	hd->state[0][3] ^= D[3];
-	hd->state[1][3] ^= D[3];
-	hd->state[2][3] ^= D[3];
-	hd->state[3][3] ^= D[3];
-	hd->state[4][3] ^= D[3];
-
-	/* Add the θ effect to the whole column */
-	hd->state[0][4] ^= D[4];
-	hd->state[1][4] ^= D[4];
-	hd->state[2][4] ^= D[4];
-	hd->state[3][4] ^= D[4];
-	hd->state[4][4] ^= D[4];
-      }
-
-      {
-	/* ρ and π steps (see [Keccak Reference, Sections 2.3.3 and 2.3.4]) */
-	u64 current, temp;
-
-#define do_swap_n_rol(x, y, r) \
-  temp = hd->state[y][x]; \
-  hd->state[y][x] = rol64(current, r); \
-  current = temp;
-
-	/* Start at coordinates (1 0) */
-	current = hd->state[0][1];
-
-	/* Iterate over ((0 1)(2 3))^t * (1 0) for 0 ≤ t ≤ 23 */
-	do_swap_n_rol(0, 2, 1);
-	do_swap_n_rol(2, 1, 3);
-	do_swap_n_rol(1, 2, 6);
-	do_swap_n_rol(2, 3, 10);
-	do_swap_n_rol(3, 3, 15);
-	do_swap_n_rol(3, 0, 21);
-	do_swap_n_rol(0, 1, 28);
-	do_swap_n_rol(1, 3, 36);
-	do_swap_n_rol(3, 1, 45);
-	do_swap_n_rol(1, 4, 55);
-	do_swap_n_rol(4, 4, 2);
-	do_swap_n_rol(4, 0, 14);
-	do_swap_n_rol(0, 3, 27);
-	do_swap_n_rol(3, 4, 41);
-	do_swap_n_rol(4, 3, 56);
-	do_swap_n_rol(3, 2, 8);
-	do_swap_n_rol(2, 2, 25);
-	do_swap_n_rol(2, 0, 43);
-	do_swap_n_rol(0, 4, 62);
-	do_swap_n_rol(4, 2, 18);
-	do_swap_n_rol(2, 4, 39);
-	do_swap_n_rol(4, 1, 61);
-	do_swap_n_rol(1, 1, 20);
-	do_swap_n_rol(1, 0, 44);
-
-#undef do_swap_n_rol
-      }
-
-      {
-	/* χ step (see [Keccak Reference, Section 2.3.1]) */
-	u64 temp[5];
-
-#define do_x_step_for_plane(y) \
-  /* Take a copy of the plane */ \
-  temp[0] = hd->state[y][0]; \
-  temp[1] = hd->state[y][1]; \
-  temp[2] = hd->state[y][2]; \
-  temp[3] = hd->state[y][3]; \
-  temp[4] = hd->state[y][4]; \
-  \
-  /* Compute χ on the plane */ \
-  hd->state[y][0] = temp[0] ^ ((~temp[1]) & temp[2]); \
-  hd->state[y][1] = temp[1] ^ ((~temp[2]) & temp[3]); \
-  hd->state[y][2] = temp[2] ^ ((~temp[3]) & temp[4]); \
-  hd->state[y][3] = temp[3] ^ ((~temp[4]) & temp[0]); \
-  hd->state[y][4] = temp[4] ^ ((~temp[0]) & temp[1]);
-
-	do_x_step_for_plane(0);
-	do_x_step_for_plane(1);
-	do_x_step_for_plane(2);
-	do_x_step_for_plane(3);
-	do_x_step_for_plane(4);
-
-#undef do_x_step_for_plane
-      }
-
-      {
-	/* ι step (see [Keccak Reference, Section 2.3.5]) */
-
-	hd->state[0][0] ^= round_consts[round];
-      }
+      hd->u.state64[i] = le_bswap64(hd->u.state64[i]);
     }
 
-  return sizeof(void *) * 4 + sizeof(u64) * 10;
+  return 0;
 }
 
+#endif /* NEED_COMMON64 */
+
+
+#ifdef NEED_COMMON32BI
+
+static const u32 round_consts_32bit[2 * 24] =
+{
+  0x00000001UL, 0x00000000UL, 0x00000000UL, 0x00000089UL,
+  0x00000000UL, 0x8000008bUL, 0x00000000UL, 0x80008080UL,
+  0x00000001UL, 0x0000008bUL, 0x00000001UL, 0x00008000UL,
+  0x00000001UL, 0x80008088UL, 0x00000001UL, 0x80000082UL,
+  0x00000000UL, 0x0000000bUL, 0x00000000UL, 0x0000000aUL,
+  0x00000001UL, 0x00008082UL, 0x00000000UL, 0x00008003UL,
+  0x00000001UL, 0x0000808bUL, 0x00000001UL, 0x8000000bUL,
+  0x00000001UL, 0x8000008aUL, 0x00000001UL, 0x80000081UL,
+  0x00000000UL, 0x80000081UL, 0x00000000UL, 0x80000008UL,
+  0x00000000UL, 0x00000083UL, 0x00000000UL, 0x80008003UL,
+  0x00000001UL, 0x80008088UL, 0x00000000UL, 0x80000088UL,
+  0x00000001UL, 0x00008000UL, 0x00000000UL, 0x80008082UL
+};
 
 static unsigned int
-transform_blk (void *context, const unsigned char *data)
+keccak_extract_inplace32bi(KECCAK_STATE *hd, unsigned int outlen)
 {
-  KECCAK_CONTEXT *ctx = context;
-  KECCAK_STATE *hd = &ctx->state;
-  u64 *state = (u64 *)hd->state;
-  const size_t bsize = ctx->bctx.blocksize;
   unsigned int i;
+  u32 x0;
+  u32 x1;
+  u32 t;
+
+  for (i = 0; i < outlen / 8 + !!(outlen % 8); i++)
+    {
+      x0 = hd->u.state32bi[i * 2 + 0];
+      x1 = hd->u.state32bi[i * 2 + 1];
+
+      t = (x0 & 0x0000FFFFUL) + (x1 << 16);
+      x1 = (x0 >> 16) + (x1 & 0xFFFF0000UL);
+      x0 = t;
+      t = (x0 ^ (x0 >> 8)) & 0x0000FF00UL; x0 = x0 ^ t ^ (t << 8);
+      t = (x0 ^ (x0 >> 4)) & 0x00F000F0UL; x0 = x0 ^ t ^ (t << 4);
+      t = (x0 ^ (x0 >> 2)) & 0x0C0C0C0CUL; x0 = x0 ^ t ^ (t << 2);
+      t = (x0 ^ (x0 >> 1)) & 0x22222222UL; x0 = x0 ^ t ^ (t << 1);
+      t = (x1 ^ (x1 >> 8)) & 0x0000FF00UL; x1 = x1 ^ t ^ (t << 8);
+      t = (x1 ^ (x1 >> 4)) & 0x00F000F0UL; x1 = x1 ^ t ^ (t << 4);
+      t = (x1 ^ (x1 >> 2)) & 0x0C0C0C0CUL; x1 = x1 ^ t ^ (t << 2);
+      t = (x1 ^ (x1 >> 1)) & 0x22222222UL; x1 = x1 ^ t ^ (t << 1);
+
+      hd->u.state32bi[i * 2 + 0] = le_bswap32(x0);
+      hd->u.state32bi[i * 2 + 1] = le_bswap32(x1);
+    }
 
-  /* Absorb input block. */
-  for (i = 0; i < bsize / 8; i++)
-    state[i] ^= buf_get_le64(data + i * 8);
+  return 0;
+}
 
-  return keccak_f1600_state_permute(hd) + 4 * sizeof(void *);
+static inline void
+keccak_absorb_lane32bi(u32 *lane, u32 x0, u32 x1)
+{
+  u32 t;
+
+  t = (x0 ^ (x0 >> 1)) & 0x22222222UL; x0 = x0 ^ t ^ (t << 1);
+  t = (x0 ^ (x0 >> 2)) & 0x0C0C0C0CUL; x0 = x0 ^ t ^ (t << 2);
+  t = (x0 ^ (x0 >> 4)) & 0x00F000F0UL; x0 = x0 ^ t ^ (t << 4);
+  t = (x0 ^ (x0 >> 8)) & 0x0000FF00UL; x0 = x0 ^ t ^ (t << 8);
+  t = (x1 ^ (x1 >> 1)) & 0x22222222UL; x1 = x1 ^ t ^ (t << 1);
+  t = (x1 ^ (x1 >> 2)) & 0x0C0C0C0CUL; x1 = x1 ^ t ^ (t << 2);
+  t = (x1 ^ (x1 >> 4)) & 0x00F000F0UL; x1 = x1 ^ t ^ (t << 4);
+  t = (x1 ^ (x1 >> 8)) & 0x0000FF00UL; x1 = x1 ^ t ^ (t << 8);
+  lane[0] ^= (x0 & 0x0000FFFFUL) + (x1 << 16);
+  lane[1] ^= (x0 >> 16) + (x1 & 0xFFFF0000UL);
 }
 
+#endif /* NEED_COMMON32BI */
+
+
+/* Construct generic 64-bit implementation. */
+#ifdef USE_64BIT
+
+# define ANDN64(x, y) (~(x) & (y))
+# define ROL64(x, n) (((x) << ((unsigned int)n & 63)) | \
+		      ((x) >> ((64 - (unsigned int)(n)) & 63)))
+
+# define KECCAK_F1600_PERMUTE_FUNC_NAME keccak_f1600_state_permute64
+# include "keccak_permute_64.h"
+
+# undef ANDN64
+# undef ROL64
+# undef KECCAK_F1600_PERMUTE_FUNC_NAME
 
 static unsigned int
-transform (void *context, const unsigned char *data, size_t nblks)
+keccak_absorb_lanes64(KECCAK_STATE *hd, int pos, const byte *lanes,
+		      unsigned int nlanes, int blocklanes)
 {
-  KECCAK_CONTEXT *ctx = context;
-  const size_t bsize = ctx->bctx.blocksize;
-  unsigned int burn;
+  unsigned int burn = 0;
+
+  while (nlanes)
+    {
+      hd->u.state64[pos] ^= buf_get_le64(lanes);
+      lanes += 8;
+      nlanes--;
+
+      if (++pos == blocklanes)
+	{
+	  burn = keccak_f1600_state_permute64(hd);
+	  pos = 0;
+	}
+    }
+
+  return burn;
+}
+
+static const keccak_ops_t keccak_generic64_ops =
+{
+  .permute = keccak_f1600_state_permute64,
+  .absorb = keccak_absorb_lanes64,
+  .extract_inplace = keccak_extract_inplace64,
+};
+
+#endif /* USE_64BIT */
+
+
+/* Construct 64-bit Intel SHLD implementation. */
+#ifdef USE_64BIT_SHLD
+
+# define ANDN64(x, y) (~(x) & (y))
+# define ROL64(x, n) ({ \
+			u64 tmp = (x); \
+			asm ("shldq %1, %0, %0" \
+			     : "+r" (tmp) \
+			     : "J" ((n) & 63)); \
+			tmp; })
+
+# define KECCAK_F1600_PERMUTE_FUNC_NAME keccak_f1600_state_permute64_shld
+# include "keccak_permute_64.h"
+
+# undef ANDN64
+# undef ROL64
+# undef KECCAK_F1600_PERMUTE_FUNC_NAME
+
+static unsigned int
+keccak_absorb_lanes64_shld(KECCAK_STATE *hd, int pos, const byte *lanes,
+			   unsigned int nlanes, int blocklanes)
+{
+  unsigned int burn = 0;
+
+  while (nlanes)
+    {
+      hd->u.state64[pos] ^= buf_get_le64(lanes);
+      lanes += 8;
+      nlanes--;
+
+      if (++pos == blocklanes)
+	{
+	  burn = keccak_f1600_state_permute64_shld(hd);
+	  pos = 0;
+	}
+    }
+
+  return burn;
+}
+
+static const keccak_ops_t keccak_shld_64_ops =
+{
+  .permute = keccak_f1600_state_permute64_shld,
+  .absorb = keccak_absorb_lanes64_shld,
+  .extract_inplace = keccak_extract_inplace64,
+};
+
+#endif /* USE_64BIT_SHLD */
+
+
+/* Construct 64-bit Intel BMI2 implementation. */
+#ifdef USE_64BIT_BMI2
+
+# define ANDN64(x, y) ({ \
+			u64 tmp; \
+			asm ("andnq %2, %1, %0" \
+			     : "=r" (tmp) \
+			     : "r0" (x), "rm" (y)); \
+			tmp; })
+
+# define ROL64(x, n) ({ \
+			u64 tmp; \
+			asm ("rorxq %2, %1, %0" \
+			     : "=r" (tmp) \
+			     : "rm0" (x), "J" (64 - ((n) & 63))); \
+			tmp; })
+
+# define KECCAK_F1600_PERMUTE_FUNC_NAME keccak_f1600_state_permute64_bmi2
+# include "keccak_permute_64.h"
+
+# undef ANDN64
+# undef ROL64
+# undef KECCAK_F1600_PERMUTE_FUNC_NAME
+
+static unsigned int
+keccak_absorb_lanes64_bmi2(KECCAK_STATE *hd, int pos, const byte *lanes,
+			   unsigned int nlanes, int blocklanes)
+{
+  unsigned int burn = 0;
+
+  while (nlanes)
+    {
+      hd->u.state64[pos] ^= buf_get_le64(lanes);
+      lanes += 8;
+      nlanes--;
+
+      if (++pos == blocklanes)
+	{
+	  burn = keccak_f1600_state_permute64_bmi2(hd);
+	  pos = 0;
+	}
+    }
+
+  return burn;
+}
+
+static const keccak_ops_t keccak_bmi2_64_ops =
+{
+  .permute = keccak_f1600_state_permute64_bmi2,
+  .absorb = keccak_absorb_lanes64_bmi2,
+  .extract_inplace = keccak_extract_inplace64,
+};
+
+#endif /* USE_64BIT_BMI2 */
+
+
+/* Construct generic 32-bit implementation. */
+#ifdef USE_32BIT
+
+# define ANDN32(x, y) (~(x) & (y))
+# define ROL32(x, n) (((x) << ((unsigned int)n & 31)) | \
+		      ((x) >> ((32 - (unsigned int)(n)) & 31)))
+
+# define KECCAK_F1600_PERMUTE_FUNC_NAME keccak_f1600_state_permute32bi
+# include "keccak_permute_32.h"
+
+# undef ANDN32
+# undef ROL32
+# undef KECCAK_F1600_PERMUTE_FUNC_NAME
+
+static unsigned int
+keccak_absorb_lanes32bi(KECCAK_STATE *hd, int pos, const byte *lanes,
+		        unsigned int nlanes, int blocklanes)
+{
+  unsigned int burn = 0;
 
-  /* Absorb full blocks. */
-  do
+  while (nlanes)
     {
-      burn = transform_blk (context, data);
-      data += bsize;
+      keccak_absorb_lane32bi(&hd->u.state32bi[pos * 2],
+			     buf_get_le32(lanes + 0),
+			     buf_get_le32(lanes + 4));
+      lanes += 8;
+      nlanes--;
+
+      if (++pos == blocklanes)
+	{
+	  burn = keccak_f1600_state_permute32bi(hd);
+	  pos = 0;
+	}
     }
-  while (--nblks);
 
   return burn;
 }
 
+static const keccak_ops_t keccak_generic32bi_ops =
+{
+  .permute = keccak_f1600_state_permute32bi,
+  .absorb = keccak_absorb_lanes32bi,
+  .extract_inplace = keccak_extract_inplace32bi,
+};
+
+#endif /* USE_32BIT */
+
+
+/* Construct 32-bit Intel BMI2 implementation. */
+#ifdef USE_32BIT_BMI2
+
+# define ANDN32(x, y) ({ \
+			u32 tmp; \
+			asm ("andnl %2, %1, %0" \
+			     : "=r" (tmp) \
+			     : "r0" (x), "rm" (y)); \
+			tmp; })
+
+# define ROL32(x, n) ({ \
+			u32 tmp; \
+			asm ("rorxl %2, %1, %0" \
+			     : "=r" (tmp) \
+			     : "rm0" (x), "J" (32 - ((n) & 31))); \
+			tmp; })
+
+# define KECCAK_F1600_PERMUTE_FUNC_NAME keccak_f1600_state_permute32bi_bmi2
+# include "keccak_permute_32.h"
+
+# undef ANDN32
+# undef ROL32
+# undef KECCAK_F1600_PERMUTE_FUNC_NAME
+
+static inline u32 pext(u32 x, u32 mask)
+{
+  u32 tmp;
+  asm ("pextl %2, %1, %0" : "=r" (tmp) : "r0" (x), "rm" (mask));
+  return tmp;
+}
+
+static inline u32 pdep(u32 x, u32 mask)
+{
+  u32 tmp;
+  asm ("pdepl %2, %1, %0" : "=r" (tmp) : "r0" (x), "rm" (mask));
+  return tmp;
+}
+
+static inline void
+keccak_absorb_lane32bi_bmi2(u32 *lane, u32 x0, u32 x1)
+{
+  x0 = pdep(pext(x0, 0x55555555), 0x0000ffff) | (pext(x0, 0xaaaaaaaa) << 16);
+  x1 = pdep(pext(x1, 0x55555555), 0x0000ffff) | (pext(x1, 0xaaaaaaaa) << 16);
+
+  lane[0] ^= (x0 & 0x0000FFFFUL) + (x1 << 16);
+  lane[1] ^= (x0 >> 16) + (x1 & 0xFFFF0000UL);
+}
+
+static unsigned int
+keccak_absorb_lanes32bi_bmi2(KECCAK_STATE *hd, int pos, const byte *lanes,
+		             unsigned int nlanes, int blocklanes)
+{
+  unsigned int burn = 0;
+
+  while (nlanes)
+    {
+      keccak_absorb_lane32bi_bmi2(&hd->u.state32bi[pos * 2],
+			          buf_get_le32(lanes + 0),
+			          buf_get_le32(lanes + 4));
+      lanes += 8;
+      nlanes--;
+
+      if (++pos == blocklanes)
+	{
+	  burn = keccak_f1600_state_permute32bi_bmi2(hd);
+	  pos = 0;
+	}
+    }
+
+  return burn;
+}
+
+static unsigned int
+keccak_extract_inplace32bi_bmi2(KECCAK_STATE *hd, unsigned int outlen)
+{
+  unsigned int i;
+  u32 x0;
+  u32 x1;
+  u32 t;
+
+  for (i = 0; i < outlen / 8 + !!(outlen % 8); i++)
+    {
+      x0 = hd->u.state32bi[i * 2 + 0];
+      x1 = hd->u.state32bi[i * 2 + 1];
+
+      t = (x0 & 0x0000FFFFUL) + (x1 << 16);
+      x1 = (x0 >> 16) + (x1 & 0xFFFF0000UL);
+      x0 = t;
+
+      x0 = pdep(pext(x0, 0xffff0001), 0xaaaaaaab) | pdep(x0 >> 1, 0x55555554);
+      x1 = pdep(pext(x1, 0xffff0001), 0xaaaaaaab) | pdep(x1 >> 1, 0x55555554);
+
+      hd->u.state32bi[i * 2 + 0] = le_bswap32(x0);
+      hd->u.state32bi[i * 2 + 1] = le_bswap32(x1);
+    }
+
+  return 0;
+}
+
+static const keccak_ops_t keccak_bmi2_32bi_ops =
+{
+  .permute = keccak_f1600_state_permute32bi_bmi2,
+  .absorb = keccak_absorb_lanes32bi_bmi2,
+  .extract_inplace = keccak_extract_inplace32bi_bmi2,
+};
+
+#endif /* USE_32BIT */
+
+
+static void
+keccak_write (void *context, const void *inbuf_arg, size_t inlen)
+{
+  KECCAK_CONTEXT *ctx = context;
+  const size_t bsize = ctx->blocksize;
+  const size_t blocklanes = bsize / 8;
+  const byte *inbuf = inbuf_arg;
+  unsigned int nburn, burn = 0;
+  unsigned int count, i;
+  unsigned int pos, nlanes;
+
+  count = ctx->count;
+
+  if (inlen && (count % 8))
+    {
+      byte lane[8] = { 0, };
+
+      /* Complete absorbing partial input lane. */
+
+      pos = count / 8;
+
+      for (i = count % 8; inlen && i < 8; i++)
+	{
+	  lane[i] = *inbuf++;
+	  inlen--;
+	  count++;
+	}
+
+      if (count == bsize)
+	count = 0;
+
+      nburn = ctx->ops->absorb(&ctx->state, pos, lane, 1,
+			       (count % 8) ? -1 : blocklanes);
+      burn = nburn > burn ? nburn : burn;
+    }
+
+  /* Absorb full input lanes. */
+
+  pos = count / 8;
+  nlanes = inlen / 8;
+  if (nlanes > 0)
+    {
+      nburn = ctx->ops->absorb(&ctx->state, pos, inbuf, nlanes, blocklanes);
+      burn = nburn > burn ? nburn : burn;
+      inlen -= nlanes * 8;
+      inbuf += nlanes * 8;
+      count += nlanes * 8;
+      count = count % bsize;
+    }
+
+  if (inlen)
+    {
+      byte lane[8] = { 0, };
+
+      /* Absorb remaining partial input lane. */
+
+      pos = count / 8;
+
+      for (i = count % 8; inlen && i < 8; i++)
+	{
+	  lane[i] = *inbuf++;
+	  inlen--;
+	  count++;
+	}
+
+      nburn = ctx->ops->absorb(&ctx->state, pos, lane, 1, -1);
+      burn = nburn > burn ? nburn : burn;
+
+      gcry_assert(count < bsize);
+    }
+
+  ctx->count = count;
+
+  if (burn)
+    _gcry_burn_stack (burn);
+}
+
 
 static void
 keccak_init (int algo, void *context, unsigned int flags)
@@ -267,29 +608,48 @@ keccak_init (int algo, void *context, unsigned int flags)
 
   memset (hd, 0, sizeof *hd);
 
-  ctx->bctx.nblocks = 0;
-  ctx->bctx.nblocks_high = 0;
-  ctx->bctx.count = 0;
-  ctx->bctx.bwrite = transform;
+  ctx->count = 0;
+
+  /* Select generic implementation. */
+#ifdef USE_64BIT
+  ctx->ops = &keccak_generic64_ops;
+#elif defined USE_32BIT
+  ctx->ops = &keccak_generic32bi_ops;
+#endif
+
+  /* Select optimized implementation based in hw features. */
+  if (0) {}
+#ifdef USE_64BIT_BMI2
+  else if (features & HWF_INTEL_BMI2)
+    ctx->ops = &keccak_bmi2_64_ops;
+#endif
+#ifdef USE_32BIT_BMI2
+  else if (features & HWF_INTEL_BMI2)
+    ctx->ops = &keccak_bmi2_32bi_ops;
+#endif
+#ifdef USE_64BIT_SHLD
+  else if (features & HWF_INTEL_FAST_SHLD)
+    ctx->ops = &keccak_shld_64_ops;
+#endif
 
   /* Set input block size, in Keccak terms this is called 'rate'. */
 
   switch (algo)
     {
     case GCRY_MD_SHA3_224:
-      ctx->bctx.blocksize = 1152 / 8;
+      ctx->blocksize = 1152 / 8;
       ctx->outlen = 224 / 8;
       break;
     case GCRY_MD_SHA3_256:
-      ctx->bctx.blocksize = 1088 / 8;
+      ctx->blocksize = 1088 / 8;
       ctx->outlen = 256 / 8;
       break;
     case GCRY_MD_SHA3_384:
-      ctx->bctx.blocksize = 832 / 8;
+      ctx->blocksize = 832 / 8;
       ctx->outlen = 384 / 8;
       break;
     case GCRY_MD_SHA3_512:
-      ctx->bctx.blocksize = 576 / 8;
+      ctx->blocksize = 576 / 8;
       ctx->outlen = 512 / 8;
       break;
     default:
@@ -334,59 +694,37 @@ keccak_final (void *context)
 {
   KECCAK_CONTEXT *ctx = context;
   KECCAK_STATE *hd = &ctx->state;
-  const size_t bsize = ctx->bctx.blocksize;
+  const size_t bsize = ctx->blocksize;
   const byte suffix = SHA3_DELIMITED_SUFFIX;
-  u64 *state = (u64 *)hd->state;
-  unsigned int stack_burn_depth;
+  unsigned int nburn, burn = 0;
   unsigned int lastbytes;
-  unsigned int i;
-  byte *buf;
+  byte lane[8];
 
-  _gcry_md_block_write (context, NULL, 0); /* flush */
-
-  buf = ctx->bctx.buf;
-  lastbytes = ctx->bctx.count;
-
-  /* Absorb remaining bytes. */
-  for (i = 0; i < lastbytes / 8; i++)
-    {
-      state[i] ^= buf_get_le64(buf);
-      buf += 8;
-    }
-
-  for (i = 0; i < lastbytes % 8; i++)
-    {
-      state[lastbytes / 8] ^= (u64)*buf << (i * 8);
-      buf++;
-    }
+  lastbytes = ctx->count;
 
   /* Do the padding and switch to the squeezing phase */
 
   /* Absorb the last few bits and add the first bit of padding (which
      coincides with the delimiter in delimited suffix) */
-  state[lastbytes / 8] ^= (u64)suffix << ((lastbytes % 8) * 8);
+  buf_put_le64(lane, (u64)suffix << ((lastbytes % 8) * 8));
+  nburn = ctx->ops->absorb(&ctx->state, lastbytes / 8, lane, 1, -1);
+  burn = nburn > burn ? nburn : burn;
 
   /* Add the second bit of padding. */
-  state[(bsize - 1) / 8] ^= (u64)0x80 << (((bsize - 1) % 8) * 8);
+  buf_put_le64(lane, (u64)0x80 << (((bsize - 1) % 8) * 8));
+  nburn = ctx->ops->absorb(&ctx->state, (bsize - 1) / 8, lane, 1, -1);
+  burn = nburn > burn ? nburn : burn;
 
   /* Switch to the squeezing phase. */
-  stack_burn_depth = keccak_f1600_state_permute(hd);
+  nburn = ctx->ops->permute(hd);
+  burn = nburn > burn ? nburn : burn;
 
   /* Squeeze out all the output blocks */
   if (ctx->outlen < bsize)
     {
       /* Output SHA3 digest. */
-      buf = ctx->bctx.buf;
-      for (i = 0; i < ctx->outlen / 8; i++)
-	{
-	  buf_put_le64(buf, state[i]);
-	  buf += 8;
-	}
-      for (i = 0; i < ctx->outlen % 8; i++)
-	{
-	  *buf = state[ctx->outlen / 8] >> (i * 8);
-	  buf++;
-	}
+      nburn = ctx->ops->extract_inplace(hd, ctx->outlen);
+      burn = nburn > burn ? nburn : burn;
     }
   else
     {
@@ -394,15 +732,18 @@ keccak_final (void *context)
       BUG();
     }
 
-  _gcry_burn_stack (stack_burn_depth);
+  wipememory(lane, sizeof(lane));
+  if (burn)
+    _gcry_burn_stack (burn);
 }
 
 
 static byte *
 keccak_read (void *context)
 {
-  KECCAK_CONTEXT *hd = (KECCAK_CONTEXT *) context;
-  return hd->bctx.buf;
+  KECCAK_CONTEXT *ctx = (KECCAK_CONTEXT *) context;
+  KECCAK_STATE *hd = &ctx->state;
+  return (byte *)&hd->u;
 }
 
 
@@ -585,7 +926,7 @@ gcry_md_spec_t _gcry_digest_spec_sha3_224 =
   {
     GCRY_MD_SHA3_224, {0, 1},
     "SHA3-224", sha3_224_asn, DIM (sha3_224_asn), oid_spec_sha3_224, 28,
-    sha3_224_init, _gcry_md_block_write, keccak_final, keccak_read,
+    sha3_224_init, keccak_write, keccak_final, keccak_read,
     sizeof (KECCAK_CONTEXT),
     run_selftests
   };
@@ -593,7 +934,7 @@ gcry_md_spec_t _gcry_digest_spec_sha3_256 =
   {
     GCRY_MD_SHA3_256, {0, 1},
     "SHA3-256", sha3_256_asn, DIM (sha3_256_asn), oid_spec_sha3_256, 32,
-    sha3_256_init, _gcry_md_block_write, keccak_final, keccak_read,
+    sha3_256_init, keccak_write, keccak_final, keccak_read,
     sizeof (KECCAK_CONTEXT),
     run_selftests
   };
@@ -601,7 +942,7 @@ gcry_md_spec_t _gcry_digest_spec_sha3_384 =
   {
     GCRY_MD_SHA3_384, {0, 1},
     "SHA3-384", sha3_384_asn, DIM (sha3_384_asn), oid_spec_sha3_384, 48,
-    sha3_384_init, _gcry_md_block_write, keccak_final, keccak_read,
+    sha3_384_init, keccak_write, keccak_final, keccak_read,
     sizeof (KECCAK_CONTEXT),
     run_selftests
   };
@@ -609,7 +950,7 @@ gcry_md_spec_t _gcry_digest_spec_sha3_512 =
   {
     GCRY_MD_SHA3_512, {0, 1},
     "SHA3-512", sha3_512_asn, DIM (sha3_512_asn), oid_spec_sha3_512, 64,
-    sha3_512_init, _gcry_md_block_write, keccak_final, keccak_read,
+    sha3_512_init, keccak_write, keccak_final, keccak_read,
     sizeof (KECCAK_CONTEXT),
     run_selftests
   };
diff --git a/cipher/keccak_permute_32.h b/cipher/keccak_permute_32.h
new file mode 100644
index 0000000..fed9383
--- /dev/null
+++ b/cipher/keccak_permute_32.h
@@ -0,0 +1,535 @@
+/* keccak_permute_32.h - Keccak permute function (simple 32bit bit-interleaved)
+ * Copyright (C) 2015 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser general Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* The code is based on public-domain/CC0 "keccakc1024/simple32bi/
+ * Keccak-simple32BI.c" implementation by Ronny Van Keer from SUPERCOP toolkit
+ * package.
+ */
+
+/* Function that computes the Keccak-f[1600] permutation on the given state. */
+static unsigned int
+KECCAK_F1600_PERMUTE_FUNC_NAME(KECCAK_STATE *hd)
+{
+  const u32 *round_consts = round_consts_32bit;
+  u32 Aba0, Abe0, Abi0, Abo0, Abu0;
+  u32 Aba1, Abe1, Abi1, Abo1, Abu1;
+  u32 Aga0, Age0, Agi0, Ago0, Agu0;
+  u32 Aga1, Age1, Agi1, Ago1, Agu1;
+  u32 Aka0, Ake0, Aki0, Ako0, Aku0;
+  u32 Aka1, Ake1, Aki1, Ako1, Aku1;
+  u32 Ama0, Ame0, Ami0, Amo0, Amu0;
+  u32 Ama1, Ame1, Ami1, Amo1, Amu1;
+  u32 Asa0, Ase0, Asi0, Aso0, Asu0;
+  u32 Asa1, Ase1, Asi1, Aso1, Asu1;
+  u32 BCa0, BCe0, BCi0, BCo0, BCu0;
+  u32 BCa1, BCe1, BCi1, BCo1, BCu1;
+  u32 Da0, De0, Di0, Do0, Du0;
+  u32 Da1, De1, Di1, Do1, Du1;
+  u32 Eba0, Ebe0, Ebi0, Ebo0, Ebu0;
+  u32 Eba1, Ebe1, Ebi1, Ebo1, Ebu1;
+  u32 Ega0, Ege0, Egi0, Ego0, Egu0;
+  u32 Ega1, Ege1, Egi1, Ego1, Egu1;
+  u32 Eka0, Eke0, Eki0, Eko0, Eku0;
+  u32 Eka1, Eke1, Eki1, Eko1, Eku1;
+  u32 Ema0, Eme0, Emi0, Emo0, Emu0;
+  u32 Ema1, Eme1, Emi1, Emo1, Emu1;
+  u32 Esa0, Ese0, Esi0, Eso0, Esu0;
+  u32 Esa1, Ese1, Esi1, Eso1, Esu1;
+  u32 *state = hd->u.state32bi;
+  unsigned int round;
+
+  Aba0 = state[0];
+  Aba1 = state[1];
+  Abe0 = state[2];
+  Abe1 = state[3];
+  Abi0 = state[4];
+  Abi1 = state[5];
+  Abo0 = state[6];
+  Abo1 = state[7];
+  Abu0 = state[8];
+  Abu1 = state[9];
+  Aga0 = state[10];
+  Aga1 = state[11];
+  Age0 = state[12];
+  Age1 = state[13];
+  Agi0 = state[14];
+  Agi1 = state[15];
+  Ago0 = state[16];
+  Ago1 = state[17];
+  Agu0 = state[18];
+  Agu1 = state[19];
+  Aka0 = state[20];
+  Aka1 = state[21];
+  Ake0 = state[22];
+  Ake1 = state[23];
+  Aki0 = state[24];
+  Aki1 = state[25];
+  Ako0 = state[26];
+  Ako1 = state[27];
+  Aku0 = state[28];
+  Aku1 = state[29];
+  Ama0 = state[30];
+  Ama1 = state[31];
+  Ame0 = state[32];
+  Ame1 = state[33];
+  Ami0 = state[34];
+  Ami1 = state[35];
+  Amo0 = state[36];
+  Amo1 = state[37];
+  Amu0 = state[38];
+  Amu1 = state[39];
+  Asa0 = state[40];
+  Asa1 = state[41];
+  Ase0 = state[42];
+  Ase1 = state[43];
+  Asi0 = state[44];
+  Asi1 = state[45];
+  Aso0 = state[46];
+  Aso1 = state[47];
+  Asu0 = state[48];
+  Asu1 = state[49];
+
+  for (round = 0; round < 24; round += 2)
+    {
+      /* prepareTheta */
+      BCa0 = Aba0 ^ Aga0 ^ Aka0 ^ Ama0 ^ Asa0;
+      BCa1 = Aba1 ^ Aga1 ^ Aka1 ^ Ama1 ^ Asa1;
+      BCe0 = Abe0 ^ Age0 ^ Ake0 ^ Ame0 ^ Ase0;
+      BCe1 = Abe1 ^ Age1 ^ Ake1 ^ Ame1 ^ Ase1;
+      BCi0 = Abi0 ^ Agi0 ^ Aki0 ^ Ami0 ^ Asi0;
+      BCi1 = Abi1 ^ Agi1 ^ Aki1 ^ Ami1 ^ Asi1;
+      BCo0 = Abo0 ^ Ago0 ^ Ako0 ^ Amo0 ^ Aso0;
+      BCo1 = Abo1 ^ Ago1 ^ Ako1 ^ Amo1 ^ Aso1;
+      BCu0 = Abu0 ^ Agu0 ^ Aku0 ^ Amu0 ^ Asu0;
+      BCu1 = Abu1 ^ Agu1 ^ Aku1 ^ Amu1 ^ Asu1;
+
+      /* thetaRhoPiChiIota(round  , A, E) */
+      Da0 = BCu0 ^ ROL32(BCe1, 1);
+      Da1 = BCu1 ^ BCe0;
+      De0 = BCa0 ^ ROL32(BCi1, 1);
+      De1 = BCa1 ^ BCi0;
+      Di0 = BCe0 ^ ROL32(BCo1, 1);
+      Di1 = BCe1 ^ BCo0;
+      Do0 = BCi0 ^ ROL32(BCu1, 1);
+      Do1 = BCi1 ^ BCu0;
+      Du0 = BCo0 ^ ROL32(BCa1, 1);
+      Du1 = BCo1 ^ BCa0;
+
+      Aba0 ^= Da0;
+      BCa0 = Aba0;
+      Age0 ^= De0;
+      BCe0 = ROL32(Age0, 22);
+      Aki1 ^= Di1;
+      BCi0 = ROL32(Aki1, 22);
+      Amo1 ^= Do1;
+      BCo0 = ROL32(Amo1, 11);
+      Asu0 ^= Du0;
+      BCu0 = ROL32(Asu0, 7);
+      Eba0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Eba0 ^= round_consts[round * 2 + 0];
+      Ebe0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Ebi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Ebo0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Ebu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Aba1 ^= Da1;
+      BCa1 = Aba1;
+      Age1 ^= De1;
+      BCe1 = ROL32(Age1, 22);
+      Aki0 ^= Di0;
+      BCi1 = ROL32(Aki0, 21);
+      Amo0 ^= Do0;
+      BCo1 = ROL32(Amo0, 10);
+      Asu1 ^= Du1;
+      BCu1 = ROL32(Asu1, 7);
+      Eba1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Eba1 ^= round_consts[round * 2 + 1];
+      Ebe1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Ebi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Ebo1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Ebu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      Abo0 ^= Do0;
+      BCa0 = ROL32(Abo0, 14);
+      Agu0 ^= Du0;
+      BCe0 = ROL32(Agu0, 10);
+      Aka1 ^= Da1;
+      BCi0 = ROL32(Aka1, 2);
+      Ame1 ^= De1;
+      BCo0 = ROL32(Ame1, 23);
+      Asi1 ^= Di1;
+      BCu0 = ROL32(Asi1, 31);
+      Ega0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Ege0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Egi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Ego0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Egu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Abo1 ^= Do1;
+      BCa1 = ROL32(Abo1, 14);
+      Agu1 ^= Du1;
+      BCe1 = ROL32(Agu1, 10);
+      Aka0 ^= Da0;
+      BCi1 = ROL32(Aka0, 1);
+      Ame0 ^= De0;
+      BCo1 = ROL32(Ame0, 22);
+      Asi0 ^= Di0;
+      BCu1 = ROL32(Asi0, 30);
+      Ega1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Ege1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Egi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Ego1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Egu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      Abe1 ^= De1;
+      BCa0 = ROL32(Abe1, 1);
+      Agi0 ^= Di0;
+      BCe0 = ROL32(Agi0, 3);
+      Ako1 ^= Do1;
+      BCi0 = ROL32(Ako1, 13);
+      Amu0 ^= Du0;
+      BCo0 = ROL32(Amu0, 4);
+      Asa0 ^= Da0;
+      BCu0 = ROL32(Asa0, 9);
+      Eka0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Eke0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Eki0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Eko0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Eku0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Abe0 ^= De0;
+      BCa1 = Abe0;
+      Agi1 ^= Di1;
+      BCe1 = ROL32(Agi1, 3);
+      Ako0 ^= Do0;
+      BCi1 = ROL32(Ako0, 12);
+      Amu1 ^= Du1;
+      BCo1 = ROL32(Amu1, 4);
+      Asa1 ^= Da1;
+      BCu1 = ROL32(Asa1, 9);
+      Eka1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Eke1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Eki1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Eko1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Eku1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      Abu1 ^= Du1;
+      BCa0 = ROL32(Abu1, 14);
+      Aga0 ^= Da0;
+      BCe0 = ROL32(Aga0, 18);
+      Ake0 ^= De0;
+      BCi0 = ROL32(Ake0, 5);
+      Ami1 ^= Di1;
+      BCo0 = ROL32(Ami1, 8);
+      Aso0 ^= Do0;
+      BCu0 = ROL32(Aso0, 28);
+      Ema0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Eme0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Emi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Emo0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Emu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Abu0 ^= Du0;
+      BCa1 = ROL32(Abu0, 13);
+      Aga1 ^= Da1;
+      BCe1 = ROL32(Aga1, 18);
+      Ake1 ^= De1;
+      BCi1 = ROL32(Ake1, 5);
+      Ami0 ^= Di0;
+      BCo1 = ROL32(Ami0, 7);
+      Aso1 ^= Do1;
+      BCu1 = ROL32(Aso1, 28);
+      Ema1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Eme1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Emi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Emo1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Emu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      Abi0 ^= Di0;
+      BCa0 = ROL32(Abi0, 31);
+      Ago1 ^= Do1;
+      BCe0 = ROL32(Ago1, 28);
+      Aku1 ^= Du1;
+      BCi0 = ROL32(Aku1, 20);
+      Ama1 ^= Da1;
+      BCo0 = ROL32(Ama1, 21);
+      Ase0 ^= De0;
+      BCu0 = ROL32(Ase0, 1);
+      Esa0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Ese0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Esi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Eso0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Esu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Abi1 ^= Di1;
+      BCa1 = ROL32(Abi1, 31);
+      Ago0 ^= Do0;
+      BCe1 = ROL32(Ago0, 27);
+      Aku0 ^= Du0;
+      BCi1 = ROL32(Aku0, 19);
+      Ama0 ^= Da0;
+      BCo1 = ROL32(Ama0, 20);
+      Ase1 ^= De1;
+      BCu1 = ROL32(Ase1, 1);
+      Esa1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Ese1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Esi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Eso1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Esu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      /* prepareTheta */
+      BCa0 = Eba0 ^ Ega0 ^ Eka0 ^ Ema0 ^ Esa0;
+      BCa1 = Eba1 ^ Ega1 ^ Eka1 ^ Ema1 ^ Esa1;
+      BCe0 = Ebe0 ^ Ege0 ^ Eke0 ^ Eme0 ^ Ese0;
+      BCe1 = Ebe1 ^ Ege1 ^ Eke1 ^ Eme1 ^ Ese1;
+      BCi0 = Ebi0 ^ Egi0 ^ Eki0 ^ Emi0 ^ Esi0;
+      BCi1 = Ebi1 ^ Egi1 ^ Eki1 ^ Emi1 ^ Esi1;
+      BCo0 = Ebo0 ^ Ego0 ^ Eko0 ^ Emo0 ^ Eso0;
+      BCo1 = Ebo1 ^ Ego1 ^ Eko1 ^ Emo1 ^ Eso1;
+      BCu0 = Ebu0 ^ Egu0 ^ Eku0 ^ Emu0 ^ Esu0;
+      BCu1 = Ebu1 ^ Egu1 ^ Eku1 ^ Emu1 ^ Esu1;
+
+      /* thetaRhoPiChiIota(round+1, E, A) */
+      Da0 = BCu0 ^ ROL32(BCe1, 1);
+      Da1 = BCu1 ^ BCe0;
+      De0 = BCa0 ^ ROL32(BCi1, 1);
+      De1 = BCa1 ^ BCi0;
+      Di0 = BCe0 ^ ROL32(BCo1, 1);
+      Di1 = BCe1 ^ BCo0;
+      Do0 = BCi0 ^ ROL32(BCu1, 1);
+      Do1 = BCi1 ^ BCu0;
+      Du0 = BCo0 ^ ROL32(BCa1, 1);
+      Du1 = BCo1 ^ BCa0;
+
+      Eba0 ^= Da0;
+      BCa0 = Eba0;
+      Ege0 ^= De0;
+      BCe0 = ROL32(Ege0, 22);
+      Eki1 ^= Di1;
+      BCi0 = ROL32(Eki1, 22);
+      Emo1 ^= Do1;
+      BCo0 = ROL32(Emo1, 11);
+      Esu0 ^= Du0;
+      BCu0 = ROL32(Esu0, 7);
+      Aba0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Aba0 ^= round_consts[round * 2 + 2];
+      Abe0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Abi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Abo0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Abu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Eba1 ^= Da1;
+      BCa1 = Eba1;
+      Ege1 ^= De1;
+      BCe1 = ROL32(Ege1, 22);
+      Eki0 ^= Di0;
+      BCi1 = ROL32(Eki0, 21);
+      Emo0 ^= Do0;
+      BCo1 = ROL32(Emo0, 10);
+      Esu1 ^= Du1;
+      BCu1 = ROL32(Esu1, 7);
+      Aba1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Aba1 ^= round_consts[round * 2 + 3];
+      Abe1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Abi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Abo1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Abu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      Ebo0 ^= Do0;
+      BCa0 = ROL32(Ebo0, 14);
+      Egu0 ^= Du0;
+      BCe0 = ROL32(Egu0, 10);
+      Eka1 ^= Da1;
+      BCi0 = ROL32(Eka1, 2);
+      Eme1 ^= De1;
+      BCo0 = ROL32(Eme1, 23);
+      Esi1 ^= Di1;
+      BCu0 = ROL32(Esi1, 31);
+      Aga0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Age0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Agi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Ago0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Agu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Ebo1 ^= Do1;
+      BCa1 = ROL32(Ebo1, 14);
+      Egu1 ^= Du1;
+      BCe1 = ROL32(Egu1, 10);
+      Eka0 ^= Da0;
+      BCi1 = ROL32(Eka0, 1);
+      Eme0 ^= De0;
+      BCo1 = ROL32(Eme0, 22);
+      Esi0 ^= Di0;
+      BCu1 = ROL32(Esi0, 30);
+      Aga1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Age1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Agi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Ago1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Agu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      Ebe1 ^= De1;
+      BCa0 = ROL32(Ebe1, 1);
+      Egi0 ^= Di0;
+      BCe0 = ROL32(Egi0, 3);
+      Eko1 ^= Do1;
+      BCi0 = ROL32(Eko1, 13);
+      Emu0 ^= Du0;
+      BCo0 = ROL32(Emu0, 4);
+      Esa0 ^= Da0;
+      BCu0 = ROL32(Esa0, 9);
+      Aka0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Ake0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Aki0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Ako0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Aku0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Ebe0 ^= De0;
+      BCa1 = Ebe0;
+      Egi1 ^= Di1;
+      BCe1 = ROL32(Egi1, 3);
+      Eko0 ^= Do0;
+      BCi1 = ROL32(Eko0, 12);
+      Emu1 ^= Du1;
+      BCo1 = ROL32(Emu1, 4);
+      Esa1 ^= Da1;
+      BCu1 = ROL32(Esa1, 9);
+      Aka1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Ake1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Aki1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Ako1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Aku1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      Ebu1 ^= Du1;
+      BCa0 = ROL32(Ebu1, 14);
+      Ega0 ^= Da0;
+      BCe0 = ROL32(Ega0, 18);
+      Eke0 ^= De0;
+      BCi0 = ROL32(Eke0, 5);
+      Emi1 ^= Di1;
+      BCo0 = ROL32(Emi1, 8);
+      Eso0 ^= Do0;
+      BCu0 = ROL32(Eso0, 28);
+      Ama0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Ame0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Ami0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Amo0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Amu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Ebu0 ^= Du0;
+      BCa1 = ROL32(Ebu0, 13);
+      Ega1 ^= Da1;
+      BCe1 = ROL32(Ega1, 18);
+      Eke1 ^= De1;
+      BCi1 = ROL32(Eke1, 5);
+      Emi0 ^= Di0;
+      BCo1 = ROL32(Emi0, 7);
+      Eso1 ^= Do1;
+      BCu1 = ROL32(Eso1, 28);
+      Ama1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Ame1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Ami1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Amo1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Amu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      Ebi0 ^= Di0;
+      BCa0 = ROL32(Ebi0, 31);
+      Ego1 ^= Do1;
+      BCe0 = ROL32(Ego1, 28);
+      Eku1 ^= Du1;
+      BCi0 = ROL32(Eku1, 20);
+      Ema1 ^= Da1;
+      BCo0 = ROL32(Ema1, 21);
+      Ese0 ^= De0;
+      BCu0 = ROL32(Ese0, 1);
+      Asa0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Ase0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Asi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Aso0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Asu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Ebi1 ^= Di1;
+      BCa1 = ROL32(Ebi1, 31);
+      Ego0 ^= Do0;
+      BCe1 = ROL32(Ego0, 27);
+      Eku0 ^= Du0;
+      BCi1 = ROL32(Eku0, 19);
+      Ema0 ^= Da0;
+      BCo1 = ROL32(Ema0, 20);
+      Ese1 ^= De1;
+      BCu1 = ROL32(Ese1, 1);
+      Asa1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Ase1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Asi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Aso1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Asu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+    }
+
+  state[0] = Aba0;
+  state[1] = Aba1;
+  state[2] = Abe0;
+  state[3] = Abe1;
+  state[4] = Abi0;
+  state[5] = Abi1;
+  state[6] = Abo0;
+  state[7] = Abo1;
+  state[8] = Abu0;
+  state[9] = Abu1;
+  state[10] = Aga0;
+  state[11] = Aga1;
+  state[12] = Age0;
+  state[13] = Age1;
+  state[14] = Agi0;
+  state[15] = Agi1;
+  state[16] = Ago0;
+  state[17] = Ago1;
+  state[18] = Agu0;
+  state[19] = Agu1;
+  state[20] = Aka0;
+  state[21] = Aka1;
+  state[22] = Ake0;
+  state[23] = Ake1;
+  state[24] = Aki0;
+  state[25] = Aki1;
+  state[26] = Ako0;
+  state[27] = Ako1;
+  state[28] = Aku0;
+  state[29] = Aku1;
+  state[30] = Ama0;
+  state[31] = Ama1;
+  state[32] = Ame0;
+  state[33] = Ame1;
+  state[34] = Ami0;
+  state[35] = Ami1;
+  state[36] = Amo0;
+  state[37] = Amo1;
+  state[38] = Amu0;
+  state[39] = Amu1;
+  state[40] = Asa0;
+  state[41] = Asa1;
+  state[42] = Ase0;
+  state[43] = Ase1;
+  state[44] = Asi0;
+  state[45] = Asi1;
+  state[46] = Aso0;
+  state[47] = Aso1;
+  state[48] = Asu0;
+  state[49] = Asu1;
+
+  return sizeof(void *) * 4 + sizeof(u32) * 12 * 5 * 2;
+}
diff --git a/cipher/keccak_permute_64.h b/cipher/keccak_permute_64.h
new file mode 100644
index 0000000..1264f19
--- /dev/null
+++ b/cipher/keccak_permute_64.h
@@ -0,0 +1,290 @@
+/* keccak_permute_64.h - Keccak permute function (simple 64bit)
+ * Copyright (C) 2015 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser general Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* The code is based on public-domain/CC0 "keccakc1024/simple/Keccak-simple.c"
+ * implementation by Ronny Van Keer from SUPERCOP toolkit package.
+ */
+
+/* Function that computes the Keccak-f[1600] permutation on the given state. */
+static unsigned int
+KECCAK_F1600_PERMUTE_FUNC_NAME(KECCAK_STATE *hd)
+{
+  const u64 *round_consts = round_consts_64bit;
+  u64 Aba, Abe, Abi, Abo, Abu;
+  u64 Aga, Age, Agi, Ago, Agu;
+  u64 Aka, Ake, Aki, Ako, Aku;
+  u64 Ama, Ame, Ami, Amo, Amu;
+  u64 Asa, Ase, Asi, Aso, Asu;
+  u64 BCa, BCe, BCi, BCo, BCu;
+  u64 Da, De, Di, Do, Du;
+  u64 Eba, Ebe, Ebi, Ebo, Ebu;
+  u64 Ega, Ege, Egi, Ego, Egu;
+  u64 Eka, Eke, Eki, Eko, Eku;
+  u64 Ema, Eme, Emi, Emo, Emu;
+  u64 Esa, Ese, Esi, Eso, Esu;
+  u64 *state = hd->u.state64;
+  unsigned int round;
+
+  Aba = state[0];
+  Abe = state[1];
+  Abi = state[2];
+  Abo = state[3];
+  Abu = state[4];
+  Aga = state[5];
+  Age = state[6];
+  Agi = state[7];
+  Ago = state[8];
+  Agu = state[9];
+  Aka = state[10];
+  Ake = state[11];
+  Aki = state[12];
+  Ako = state[13];
+  Aku = state[14];
+  Ama = state[15];
+  Ame = state[16];
+  Ami = state[17];
+  Amo = state[18];
+  Amu = state[19];
+  Asa = state[20];
+  Ase = state[21];
+  Asi = state[22];
+  Aso = state[23];
+  Asu = state[24];
+
+  for (round = 0; round < 24; round += 2)
+    {
+      /* prepareTheta */
+      BCa = Aba ^ Aga ^ Aka ^ Ama ^ Asa;
+      BCe = Abe ^ Age ^ Ake ^ Ame ^ Ase;
+      BCi = Abi ^ Agi ^ Aki ^ Ami ^ Asi;
+      BCo = Abo ^ Ago ^ Ako ^ Amo ^ Aso;
+      BCu = Abu ^ Agu ^ Aku ^ Amu ^ Asu;
+
+      /* thetaRhoPiChiIotaPrepareTheta(round  , A, E) */
+      Da = BCu ^ ROL64(BCe, 1);
+      De = BCa ^ ROL64(BCi, 1);
+      Di = BCe ^ ROL64(BCo, 1);
+      Do = BCi ^ ROL64(BCu, 1);
+      Du = BCo ^ ROL64(BCa, 1);
+
+      Aba ^= Da;
+      BCa = Aba;
+      Age ^= De;
+      BCe = ROL64(Age, 44);
+      Aki ^= Di;
+      BCi = ROL64(Aki, 43);
+      Amo ^= Do;
+      BCo = ROL64(Amo, 21);
+      Asu ^= Du;
+      BCu = ROL64(Asu, 14);
+      Eba = BCa ^ ANDN64(BCe, BCi);
+      Eba ^= (u64)round_consts[round];
+      Ebe = BCe ^ ANDN64(BCi, BCo);
+      Ebi = BCi ^ ANDN64(BCo, BCu);
+      Ebo = BCo ^ ANDN64(BCu, BCa);
+      Ebu = BCu ^ ANDN64(BCa, BCe);
+
+      Abo ^= Do;
+      BCa = ROL64(Abo, 28);
+      Agu ^= Du;
+      BCe = ROL64(Agu, 20);
+      Aka ^= Da;
+      BCi = ROL64(Aka, 3);
+      Ame ^= De;
+      BCo = ROL64(Ame, 45);
+      Asi ^= Di;
+      BCu = ROL64(Asi, 61);
+      Ega = BCa ^ ANDN64(BCe, BCi);
+      Ege = BCe ^ ANDN64(BCi, BCo);
+      Egi = BCi ^ ANDN64(BCo, BCu);
+      Ego = BCo ^ ANDN64(BCu, BCa);
+      Egu = BCu ^ ANDN64(BCa, BCe);
+
+      Abe ^= De;
+      BCa = ROL64(Abe, 1);
+      Agi ^= Di;
+      BCe = ROL64(Agi, 6);
+      Ako ^= Do;
+      BCi = ROL64(Ako, 25);
+      Amu ^= Du;
+      BCo = ROL64(Amu, 8);
+      Asa ^= Da;
+      BCu = ROL64(Asa, 18);
+      Eka = BCa ^ ANDN64(BCe, BCi);
+      Eke = BCe ^ ANDN64(BCi, BCo);
+      Eki = BCi ^ ANDN64(BCo, BCu);
+      Eko = BCo ^ ANDN64(BCu, BCa);
+      Eku = BCu ^ ANDN64(BCa, BCe);
+
+      Abu ^= Du;
+      BCa = ROL64(Abu, 27);
+      Aga ^= Da;
+      BCe = ROL64(Aga, 36);
+      Ake ^= De;
+      BCi = ROL64(Ake, 10);
+      Ami ^= Di;
+      BCo = ROL64(Ami, 15);
+      Aso ^= Do;
+      BCu = ROL64(Aso, 56);
+      Ema = BCa ^ ANDN64(BCe, BCi);
+      Eme = BCe ^ ANDN64(BCi, BCo);
+      Emi = BCi ^ ANDN64(BCo, BCu);
+      Emo = BCo ^ ANDN64(BCu, BCa);
+      Emu = BCu ^ ANDN64(BCa, BCe);
+
+      Abi ^= Di;
+      BCa = ROL64(Abi, 62);
+      Ago ^= Do;
+      BCe = ROL64(Ago, 55);
+      Aku ^= Du;
+      BCi = ROL64(Aku, 39);
+      Ama ^= Da;
+      BCo = ROL64(Ama, 41);
+      Ase ^= De;
+      BCu = ROL64(Ase, 2);
+      Esa = BCa ^ ANDN64(BCe, BCi);
+      Ese = BCe ^ ANDN64(BCi, BCo);
+      Esi = BCi ^ ANDN64(BCo, BCu);
+      Eso = BCo ^ ANDN64(BCu, BCa);
+      Esu = BCu ^ ANDN64(BCa, BCe);
+
+      /* prepareTheta */
+      BCa = Eba ^ Ega ^ Eka ^ Ema ^ Esa;
+      BCe = Ebe ^ Ege ^ Eke ^ Eme ^ Ese;
+      BCi = Ebi ^ Egi ^ Eki ^ Emi ^ Esi;
+      BCo = Ebo ^ Ego ^ Eko ^ Emo ^ Eso;
+      BCu = Ebu ^ Egu ^ Eku ^ Emu ^ Esu;
+
+      /* thetaRhoPiChiIotaPrepareTheta(round+1, E, A) */
+      Da = BCu ^ ROL64(BCe, 1);
+      De = BCa ^ ROL64(BCi, 1);
+      Di = BCe ^ ROL64(BCo, 1);
+      Do = BCi ^ ROL64(BCu, 1);
+      Du = BCo ^ ROL64(BCa, 1);
+
+      Eba ^= Da;
+      BCa = Eba;
+      Ege ^= De;
+      BCe = ROL64(Ege, 44);
+      Eki ^= Di;
+      BCi = ROL64(Eki, 43);
+      Emo ^= Do;
+      BCo = ROL64(Emo, 21);
+      Esu ^= Du;
+      BCu = ROL64(Esu, 14);
+      Aba = BCa ^ ANDN64(BCe, BCi);
+      Aba ^= (u64)round_consts[round + 1];
+      Abe = BCe ^ ANDN64(BCi, BCo);
+      Abi = BCi ^ ANDN64(BCo, BCu);
+      Abo = BCo ^ ANDN64(BCu, BCa);
+      Abu = BCu ^ ANDN64(BCa, BCe);
+
+      Ebo ^= Do;
+      BCa = ROL64(Ebo, 28);
+      Egu ^= Du;
+      BCe = ROL64(Egu, 20);
+      Eka ^= Da;
+      BCi = ROL64(Eka, 3);
+      Eme ^= De;
+      BCo = ROL64(Eme, 45);
+      Esi ^= Di;
+      BCu = ROL64(Esi, 61);
+      Aga = BCa ^ ANDN64(BCe, BCi);
+      Age = BCe ^ ANDN64(BCi, BCo);
+      Agi = BCi ^ ANDN64(BCo, BCu);
+      Ago = BCo ^ ANDN64(BCu, BCa);
+      Agu = BCu ^ ANDN64(BCa, BCe);
+
+      Ebe ^= De;
+      BCa = ROL64(Ebe, 1);
+      Egi ^= Di;
+      BCe = ROL64(Egi, 6);
+      Eko ^= Do;
+      BCi = ROL64(Eko, 25);
+      Emu ^= Du;
+      BCo = ROL64(Emu, 8);
+      Esa ^= Da;
+      BCu = ROL64(Esa, 18);
+      Aka = BCa ^ ANDN64(BCe, BCi);
+      Ake = BCe ^ ANDN64(BCi, BCo);
+      Aki = BCi ^ ANDN64(BCo, BCu);
+      Ako = BCo ^ ANDN64(BCu, BCa);
+      Aku = BCu ^ ANDN64(BCa, BCe);
+
+      Ebu ^= Du;
+      BCa = ROL64(Ebu, 27);
+      Ega ^= Da;
+      BCe = ROL64(Ega, 36);
+      Eke ^= De;
+      BCi = ROL64(Eke, 10);
+      Emi ^= Di;
+      BCo = ROL64(Emi, 15);
+      Eso ^= Do;
+      BCu = ROL64(Eso, 56);
+      Ama = BCa ^ ANDN64(BCe, BCi);
+      Ame = BCe ^ ANDN64(BCi, BCo);
+      Ami = BCi ^ ANDN64(BCo, BCu);
+      Amo = BCo ^ ANDN64(BCu, BCa);
+      Amu = BCu ^ ANDN64(BCa, BCe);
+
+      Ebi ^= Di;
+      BCa = ROL64(Ebi, 62);
+      Ego ^= Do;
+      BCe = ROL64(Ego, 55);
+      Eku ^= Du;
+      BCi = ROL64(Eku, 39);
+      Ema ^= Da;
+      BCo = ROL64(Ema, 41);
+      Ese ^= De;
+      BCu = ROL64(Ese, 2);
+      Asa = BCa ^ ANDN64(BCe, BCi);
+      Ase = BCe ^ ANDN64(BCi, BCo);
+      Asi = BCi ^ ANDN64(BCo, BCu);
+      Aso = BCo ^ ANDN64(BCu, BCa);
+      Asu = BCu ^ ANDN64(BCa, BCe);
+    }
+
+  state[0] = Aba;
+  state[1] = Abe;
+  state[2] = Abi;
+  state[3] = Abo;
+  state[4] = Abu;
+  state[5] = Aga;
+  state[6] = Age;
+  state[7] = Agi;
+  state[8] = Ago;
+  state[9] = Agu;
+  state[10] = Aka;
+  state[11] = Ake;
+  state[12] = Aki;
+  state[13] = Ako;
+  state[14] = Aku;
+  state[15] = Ama;
+  state[16] = Ame;
+  state[17] = Ami;
+  state[18] = Amo;
+  state[19] = Amu;
+  state[20] = Asa;
+  state[21] = Ase;
+  state[22] = Asi;
+  state[23] = Aso;
+  state[24] = Asu;
+
+  return sizeof(void *) * 4 + sizeof(u64) * 12 * 5;
+}