[PATCH v2] Add VAES/AVX2 accelerated i386 implementation for AES

Jussi Kivilinna jussi.kivilinna at iki.fi
Sun Jul 16 09:01:18 CEST 2023


* cipher/Makefile.am: Add 'rijndael-vaes-i386.c' and
'rijndael-vaes-avx2-i386.S'.
* cipher/asm-common-i386.h: New.
* cipher/rijndael-internal.h (USE_VAES_I386): New.
* cipher/rijndael-vaes-avx2-i386.S: New.
* cipher/rijndael-vaes-i386.c: New.
* cipher/rijndael-vaes.c: Update header description (add 'AMD64').
* cipher/rijndael.c [USE_VAES]: Add 'USE_VAES_I386' to ifdef around
'_gcry_aes_vaes_*' function prototypes.
(setkey) [USE_VAES_I386]: Add setup of VAES/AVX2/i386 bulk functions.
* configure.ac: Add 'rijndael-vaes-i386.lo' and
'rijndael-vaes-avx2-i386.lo'.
(gcry_cv_gcc_amd64_platform_as_ok): Rename this to ...
(gcry_cv_gcc_x86_platform_as_ok): ... this and change to check for
both AMD64 and i386 assembler compatibility.
(gcry_cv_gcc_win32_platform_as_ok): New.
--

Benchmark on Intel Core(TM) i3-1115G4 (tigerlake, linux-i386):
------------------------------------------

Before:
 AES            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |     0.127 ns/B      7523 MiB/s     0.379 c/B      2992
        ECB dec |     0.127 ns/B      7517 MiB/s     0.380 c/B      2992
        CBC dec |     0.108 ns/B      8855 MiB/s     0.322 c/B      2993
        CFB dec |     0.107 ns/B      8938 MiB/s     0.319 c/B      2993
        CTR enc |     0.111 ns/B      8589 MiB/s     0.332 c/B      2992
        CTR dec |     0.111 ns/B      8593 MiB/s     0.332 c/B      2993
        XTS enc |     0.140 ns/B      6833 MiB/s     0.418 c/B      2993
        XTS dec |     0.139 ns/B      6863 MiB/s     0.416 c/B      2993
        OCB enc |     0.138 ns/B      6907 MiB/s     0.413 c/B      2993
        OCB dec |     0.139 ns/B      6884 MiB/s     0.415 c/B      2993
       OCB auth |     0.124 ns/B      7679 MiB/s     0.372 c/B      2993

After:
 AES            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |     0.053 ns/B     18056 MiB/s     0.158 c/B      2992
        ECB dec |     0.053 ns/B     18009 MiB/s     0.158 c/B      2993
        CBC dec |     0.053 ns/B     17955 MiB/s     0.159 c/B      2993
        CFB dec |     0.054 ns/B     17813 MiB/s     0.160 c/B      2993
        CTR enc |     0.061 ns/B     15633 MiB/s     0.183 c/B      2993
        CTR dec |     0.061 ns/B     15608 MiB/s     0.183 c/B      2993
        XTS enc |     0.082 ns/B     11640 MiB/s     0.245 c/B      2993
        XTS dec |     0.081 ns/B     11717 MiB/s     0.244 c/B      2992
        OCB enc |     0.082 ns/B     11677 MiB/s     0.244 c/B      2993
        OCB dec |     0.089 ns/B     10736 MiB/s     0.266 c/B      2992
       OCB auth |     0.080 ns/B     11883 MiB/s     0.240 c/B      2993
  ECB: ~2.4x faster
  CBC/CFB dec: ~2.0x faster
  CTR: ~1.8x faster
  XTS: ~1.7x faster
  OCB enc: ~1.7x faster
  OCB dec/auth: ~1.5x faster

Benchmark on AMD Ryzen 9 7900X (zen4, linux-i386):
----------------------------------

Before:
 AES            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |     0.070 ns/B     13582 MiB/s     0.330 c/B      4700
        ECB dec |     0.071 ns/B     13525 MiB/s     0.331 c/B      4700
        CBC dec |     0.072 ns/B     13165 MiB/s     0.341 c/B      4701
        CFB dec |     0.072 ns/B     13197 MiB/s     0.340 c/B      4700
        CTR enc |     0.073 ns/B     13140 MiB/s     0.341 c/B      4700
        CTR dec |     0.073 ns/B     13092 MiB/s     0.342 c/B      4700
        XTS enc |     0.093 ns/B     10268 MiB/s     0.437 c/B      4700
        XTS dec |     0.093 ns/B     10204 MiB/s     0.439 c/B      4700
        OCB enc |     0.088 ns/B     10885 MiB/s     0.412 c/B      4700
        OCB dec |     0.180 ns/B      5290 MiB/s     0.847 c/B      4700
       OCB auth |     0.174 ns/B      5466 MiB/s     0.820 c/B      4700

After:
 AES            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |     0.035 ns/B     27469 MiB/s     0.163 c/B      4700
        ECB dec |     0.035 ns/B     27482 MiB/s     0.163 c/B      4700
        CBC dec |     0.036 ns/B     26853 MiB/s     0.167 c/B      4700
        CFB dec |     0.035 ns/B     27452 MiB/s     0.163 c/B      4700
        CTR enc |     0.042 ns/B     22573 MiB/s     0.199 c/B      4700
        CTR dec |     0.042 ns/B     22524 MiB/s     0.199 c/B      4700
        XTS enc |     0.054 ns/B     17731 MiB/s     0.253 c/B      4700
        XTS dec |     0.054 ns/B     17788 MiB/s     0.252 c/B      4700
        OCB enc |     0.043 ns/B     22162 MiB/s     0.202 c/B      4700
        OCB dec |     0.044 ns/B     21918 MiB/s     0.205 c/B      4700
       OCB auth |     0.039 ns/B     24327 MiB/s     0.184 c/B      4700
  ECB: ~2.0x faster
  CBC/CFB dec: ~2.0x faster
  CTR/XTS: ~1.7x faster
  OCB enc: ~2.0x faster
  OCB dec: ~4.1x faster
  OCB auth: ~4.4x faster

Benchmark on AMD Ryzen 7 5800X (zen3, win32):
-----------------------------

Before:
 AES            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |     0.087 ns/B     11002 MiB/s     0.329 c/B      3800
        ECB dec |     0.088 ns/B     10887 MiB/s     0.333 c/B      3801
        CBC dec |     0.097 ns/B      9831 MiB/s     0.369 c/B      3801
        CFB dec |     0.096 ns/B      9897 MiB/s     0.366 c/B      3800
        CTR enc |     0.104 ns/B      9190 MiB/s     0.394 c/B      3801
        CTR dec |     0.105 ns/B      9083 MiB/s     0.399 c/B      3801
        XTS enc |     0.127 ns/B      7538 MiB/s     0.481 c/B      3801
        XTS dec |     0.127 ns/B      7505 MiB/s     0.483 c/B      3801
        OCB enc |     0.117 ns/B      8180 MiB/s     0.443 c/B      3801
        OCB dec |     0.115 ns/B      8296 MiB/s     0.437 c/B      3800
       OCB auth |     0.107 ns/B      8928 MiB/s     0.406 c/B      3801

After:
 AES            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |     0.042 ns/B     22515 MiB/s     0.161 c/B      3801
        ECB dec |     0.043 ns/B     22308 MiB/s     0.163 c/B      3801
        CBC dec |     0.050 ns/B     18910 MiB/s     0.192 c/B      3801
        CFB dec |     0.049 ns/B     19402 MiB/s     0.187 c/B      3801
        CTR enc |     0.053 ns/B     18002 MiB/s     0.201 c/B      3801
        CTR dec |     0.053 ns/B     17944 MiB/s     0.202 c/B      3801
        XTS enc |     0.076 ns/B     12531 MiB/s     0.289 c/B      3801
        XTS dec |     0.077 ns/B     12465 MiB/s     0.291 c/B      3801
        OCB enc |     0.065 ns/B     14719 MiB/s     0.246 c/B      3801
        OCB dec |     0.060 ns/B     15887 MiB/s     0.228 c/B      3801
       OCB auth |     0.054 ns/B     17504 MiB/s     0.207 c/B      3801
  ECB: ~2.0x faster
  CBC/CFB dec: ~1.9x faster
  CTR: ~1.9x faster
  XTS: ~1.6x faster
  OCB enc: ~1.8x faster
  OCB dec/auth: ~1.9x faster

[v2]:
 - Improve CTR performance
 - Improve OCB performance

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/Makefile.am               |    3 +-
 cipher/asm-common-i386.h         |  161 ++
 cipher/rijndael-internal.h       |   12 +-
 cipher/rijndael-vaes-avx2-i386.S | 2804 ++++++++++++++++++++++++++++++
 cipher/rijndael-vaes-i386.c      |  231 +++
 cipher/rijndael-vaes.c           |    4 +-
 cipher/rijndael.c                |   19 +-
 configure.ac                     |   47 +-
 8 files changed, 3266 insertions(+), 15 deletions(-)
 create mode 100644 cipher/asm-common-i386.h
 create mode 100644 cipher/rijndael-vaes-avx2-i386.S
 create mode 100644 cipher/rijndael-vaes-i386.c

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index 8c7ec095..ada89418 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -108,10 +108,11 @@ EXTRA_libcipher_la_SOURCES = \
 	rijndael-amd64.S rijndael-arm.S                    \
 	rijndael-ssse3-amd64.c rijndael-ssse3-amd64-asm.S  \
 	rijndael-vaes.c rijndael-vaes-avx2-amd64.S         \
+	rijndael-vaes-i386.c rijndael-vaes-avx2-i386.S     \
 	rijndael-armv8-ce.c rijndael-armv8-aarch32-ce.S    \
 	rijndael-armv8-aarch64-ce.S rijndael-aarch64.S     \
 	rijndael-ppc.c rijndael-ppc9le.c                   \
-	rijndael-p10le.c rijndael-gcm-p10le.s             \
+	rijndael-p10le.c rijndael-gcm-p10le.s              \
 	rijndael-ppc-common.h rijndael-ppc-functions.h     \
 	rijndael-s390x.c                                   \
 	rmd160.c \
diff --git a/cipher/asm-common-i386.h b/cipher/asm-common-i386.h
new file mode 100644
index 00000000..d746ebc4
--- /dev/null
+++ b/cipher/asm-common-i386.h
@@ -0,0 +1,161 @@
+/* asm-common-i386.h  -  Common macros for i386 assembly
+ *
+ * Copyright (C) 2023 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef GCRY_ASM_COMMON_I386_H
+#define GCRY_ASM_COMMON_I386_H
+
+#include <config.h>
+
+#ifdef HAVE_COMPATIBLE_GCC_I386_PLATFORM_AS
+# define ELF(...) __VA_ARGS__
+#else
+# define ELF(...) /*_*/
+#endif
+
+#ifdef HAVE_COMPATIBLE_GCC_WIN32_PLATFORM_AS
+# define SECTION_RODATA .section .rdata
+#else
+# define SECTION_RODATA .section .rodata
+#endif
+
+#ifdef HAVE_COMPATIBLE_GCC_WIN32_PLATFORM_AS
+# define SYM_NAME(name) _##name
+#else
+# define SYM_NAME(name) name
+#endif
+
+#ifdef HAVE_COMPATIBLE_GCC_WIN32_PLATFORM_AS
+# define DECL_GET_PC_THUNK(reg)
+# define GET_DATA_POINTER(name, reg) leal name, %reg
+#else
+# define DECL_GET_PC_THUNK(reg) \
+      .type __gcry_get_pc_thunk_##reg, @function; \
+      .align 16; \
+      __gcry_get_pc_thunk_##reg:; \
+	CFI_STARTPROC(); \
+	movl (%esp), %reg; \
+	ret_spec_stop; \
+	CFI_ENDPROC()
+# define GET_DATA_POINTER(name, reg) \
+	call __gcry_get_pc_thunk_##reg; \
+	addl $_GLOBAL_OFFSET_TABLE_, %reg; \
+	movl name##@GOT(%reg), %reg;
+#endif
+
+#ifdef HAVE_GCC_ASM_CFI_DIRECTIVES
+/* CFI directives to emit DWARF stack unwinding information. */
+# define CFI_STARTPROC()            .cfi_startproc
+# define CFI_ENDPROC()              .cfi_endproc
+# define CFI_REMEMBER_STATE()       .cfi_remember_state
+# define CFI_RESTORE_STATE()        .cfi_restore_state
+# define CFI_ADJUST_CFA_OFFSET(off) .cfi_adjust_cfa_offset off
+# define CFI_REL_OFFSET(reg,off)    .cfi_rel_offset reg, off
+# define CFI_DEF_CFA_REGISTER(reg)  .cfi_def_cfa_register reg
+# define CFI_REGISTER(ro,rn)        .cfi_register ro, rn
+# define CFI_RESTORE(reg)           .cfi_restore reg
+
+# define CFI_PUSH(reg) \
+	CFI_ADJUST_CFA_OFFSET(4); CFI_REL_OFFSET(reg, 0)
+# define CFI_POP(reg) \
+	CFI_ADJUST_CFA_OFFSET(-4); CFI_RESTORE(reg)
+# define CFI_POP_TMP_REG() \
+	CFI_ADJUST_CFA_OFFSET(-4);
+# define CFI_LEAVE() \
+	CFI_ADJUST_CFA_OFFSET(-4); CFI_DEF_CFA_REGISTER(%esp)
+
+/* CFA expressions are used for pointing CFA and registers to
+ * %rsp relative offsets. */
+# define DW_REGNO_eax 0
+# define DW_REGNO_edx 1
+# define DW_REGNO_ecx 2
+# define DW_REGNO_ebx 3
+# define DW_REGNO_esi 4
+# define DW_REGNO_edi 5
+# define DW_REGNO_ebp 6
+# define DW_REGNO_esp 7
+
+# define DW_REGNO(reg) DW_REGNO_ ## reg
+
+/* Fixed length encoding used for integers for now. */
+# define DW_SLEB128_7BIT(value) \
+	0x00|((value) & 0x7f)
+# define DW_SLEB128_28BIT(value) \
+	0x80|((value)&0x7f), \
+	0x80|(((value)>>7)&0x7f), \
+	0x80|(((value)>>14)&0x7f), \
+	0x00|(((value)>>21)&0x7f)
+
+# define CFI_CFA_ON_STACK(esp_offs,cfa_depth) \
+	.cfi_escape \
+	  0x0f, /* DW_CFA_def_cfa_expression */ \
+	    DW_SLEB128_7BIT(11), /* length */ \
+	  0x77, /* DW_OP_breg7, rsp + constant */ \
+	    DW_SLEB128_28BIT(esp_offs), \
+	  0x06, /* DW_OP_deref */ \
+	  0x23, /* DW_OP_plus_constu */ \
+	    DW_SLEB128_28BIT((cfa_depth)+4)
+
+# define CFI_REG_ON_STACK(reg,esp_offs) \
+	.cfi_escape \
+	  0x10, /* DW_CFA_expression */ \
+	    DW_SLEB128_7BIT(DW_REGNO(reg)), \
+	    DW_SLEB128_7BIT(5), /* length */ \
+	  0x77, /* DW_OP_breg7, rsp + constant */ \
+	    DW_SLEB128_28BIT(esp_offs)
+
+#else
+# define CFI_STARTPROC()
+# define CFI_ENDPROC()
+# define CFI_REMEMBER_STATE()
+# define CFI_RESTORE_STATE()
+# define CFI_ADJUST_CFA_OFFSET(off)
+# define CFI_REL_OFFSET(reg,off)
+# define CFI_DEF_CFA_REGISTER(reg)
+# define CFI_REGISTER(ro,rn)
+# define CFI_RESTORE(reg)
+
+# define CFI_PUSH(reg)
+# define CFI_POP(reg)
+# define CFI_POP_TMP_REG()
+# define CFI_LEAVE()
+
+# define CFI_CFA_ON_STACK(rsp_offs,cfa_depth)
+# define CFI_REG_ON_STACK(reg,rsp_offs)
+#endif
+
+/* 'ret' instruction replacement for straight-line speculation mitigation. */
+#define ret_spec_stop \
+	ret; int3;
+
+/* This prevents speculative execution on old AVX512 CPUs, to prevent
+ * speculative execution to AVX512 code. The vpopcntb instruction is
+ * available on newer CPUs that do not suffer from significant frequency
+ * drop when 512-bit vectors are utilized. */
+#define spec_stop_avx512 \
+	vpxord %ymm7, %ymm7, %ymm7; \
+	vpopcntb %xmm7, %xmm7; /* Supported only by newer AVX512 CPUs. */ \
+	vpxord %ymm7, %ymm7, %ymm7;
+
+#define spec_stop_avx512_intel_syntax \
+	vpxord ymm7, ymm7, ymm7; \
+	vpopcntb xmm7, xmm7; /* Supported only by newer AVX512 CPUs. */ \
+	vpxord ymm7, ymm7, ymm7;
+
+#endif /* GCRY_ASM_COMMON_AMD64_H */
diff --git a/cipher/rijndael-internal.h b/cipher/rijndael-internal.h
index 52c892fd..166f2415 100644
--- a/cipher/rijndael-internal.h
+++ b/cipher/rijndael-internal.h
@@ -89,7 +89,7 @@
 # endif
 #endif /* ENABLE_AESNI_SUPPORT */
 
-/* USE_VAES inidicates whether to compile with Intel VAES code.  */
+/* USE_VAES inidicates whether to compile with AMD64 VAES code.  */
 #undef USE_VAES
 #if (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
      defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) && \
@@ -99,6 +99,16 @@
 # define USE_VAES 1
 #endif
 
+/* USE_VAES_I386 inidicates whether to compile with i386 VAES code.  */
+#undef USE_VAES_I386
+#if (defined(HAVE_COMPATIBLE_GCC_I386_PLATFORM_AS) || \
+     defined(HAVE_COMPATIBLE_GCC_WIN32_PLATFORM_AS)) && \
+     defined(__i386__) && defined(ENABLE_AVX2_SUPPORT) && \
+     defined(HAVE_GCC_INLINE_ASM_VAES_VPCLMUL) && \
+     defined(USE_AESNI)
+# define USE_VAES_I386 1
+#endif
+
 /* USE_ARM_CE indicates whether to enable ARMv8 Crypto Extension assembly
  * code. */
 #undef USE_ARM_CE
diff --git a/cipher/rijndael-vaes-avx2-i386.S b/cipher/rijndael-vaes-avx2-i386.S
new file mode 100644
index 00000000..245e8443
--- /dev/null
+++ b/cipher/rijndael-vaes-avx2-i386.S
@@ -0,0 +1,2804 @@
+/* VAES/AVX2 i386 accelerated AES for Libgcrypt
+ * Copyright (C) 2023 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#if defined(__i386__)
+#include <config.h>
+#if (defined(HAVE_COMPATIBLE_GCC_I386_PLATFORM_AS) || \
+     defined(HAVE_COMPATIBLE_GCC_WIN32_PLATFORM_AS)) && \
+    defined(ENABLE_AESNI_SUPPORT) && defined(ENABLE_AVX2_SUPPORT) && \
+    defined(HAVE_GCC_INLINE_ASM_VAES_VPCLMUL)
+
+#include "asm-common-i386.h"
+
+.text
+
+DECL_GET_PC_THUNK(eax);
+
+/**********************************************************************
+  helper macros
+ **********************************************************************/
+#define AES_OP4(op, key, b0, b1, b2, b3) \
+	op key, b0, b0; \
+	op key, b1, b1; \
+	op key, b2, b2; \
+	op key, b3, b3;
+
+#define VAESENC4(key, b0, b1, b2, b3) \
+	AES_OP4(vaesenc, key, b0, b1, b2, b3)
+
+#define VAESDEC4(key, b0, b1, b2, b3) \
+	AES_OP4(vaesdec, key, b0, b1, b2, b3)
+
+#define XOR4(key, b0, b1, b2, b3) \
+	AES_OP4(vpxor, key, b0, b1, b2, b3)
+
+#define AES_OP2(op, key, b0, b1) \
+	op key, b0, b0; \
+	op key, b1, b1;
+
+#define VAESENC2(key, b0, b1) \
+	AES_OP2(vaesenc, key, b0, b1)
+
+#define VAESDEC2(key, b0, b1) \
+	AES_OP2(vaesdec, key, b0, b1)
+
+#define XOR2(key, b0, b1) \
+	AES_OP2(vpxor, key, b0, b1)
+
+#define VAESENC6(key, b0, b1, b2, b3, b4, b5) \
+	AES_OP4(vaesenc, key, b0, b1, b2, b3); \
+	AES_OP2(vaesenc, key, b4, b5)
+
+#define VAESDEC6(key, b0, b1, b2, b3, b4, b5) \
+	AES_OP4(vaesdec, key, b0, b1, b2, b3); \
+	AES_OP2(vaesdec, key, b4, b5)
+
+#define XOR6(key, b0, b1, b2, b3, b4, b5) \
+	AES_OP4(vpxor, key, b0, b1, b2, b3); \
+	AES_OP2(vpxor, key, b4, b5)
+
+#define CADDR(name, reg) \
+	(name - SYM_NAME(_gcry_vaes_consts))(reg)
+
+/**********************************************************************
+  CBC-mode decryption
+ **********************************************************************/
+ELF(.type SYM_NAME(_gcry_vaes_avx2_cbc_dec_i386), at function)
+.globl SYM_NAME(_gcry_vaes_avx2_cbc_dec_i386)
+.align 16
+SYM_NAME(_gcry_vaes_avx2_cbc_dec_i386):
+	/* input:
+	 *	(esp + 4): round keys
+	 *	(esp + 8): iv
+	 *	(esp + 12): dst
+	 *	(esp + 16): src
+	 *	(esp + 20): nblocks
+	 *	(esp + 24): nrounds
+	 */
+	CFI_STARTPROC();
+	pushl %edi;
+	CFI_PUSH(%edi);
+	pushl %esi;
+	CFI_PUSH(%esi);
+
+	movl 8+4(%esp), %edi;
+	movl 8+8(%esp), %esi;
+	movl 8+12(%esp), %edx;
+	movl 8+16(%esp), %ecx;
+	movl 8+20(%esp), %eax;
+
+	/* Process 8 blocks per loop. */
+.align 8
+.Lcbc_dec_blk8:
+	cmpl $8, %eax;
+	jb .Lcbc_dec_blk4;
+
+	leal -8(%eax), %eax;
+
+	/* Load input and xor first key. Update IV. */
+	vbroadcasti128 (0 * 16)(%edi), %ymm4;
+	vmovdqu (0 * 16)(%ecx), %ymm0;
+	vmovdqu (2 * 16)(%ecx), %ymm1;
+	vmovdqu (4 * 16)(%ecx), %ymm2;
+	vmovdqu (6 * 16)(%ecx), %ymm3;
+	vmovdqu (%esi), %xmm6; /* Load IV. */
+	vinserti128 $1, %xmm0, %ymm6, %ymm5;
+	vextracti128 $1, %ymm3, (%esi); /* Store IV. */
+	vpxor %ymm4, %ymm0, %ymm0;
+	vpxor %ymm4, %ymm1, %ymm1;
+	vpxor %ymm4, %ymm2, %ymm2;
+	vpxor %ymm4, %ymm3, %ymm3;
+	vmovdqu (1 * 16)(%ecx), %ymm6;
+	vmovdqu (3 * 16)(%ecx), %ymm7;
+
+	/* AES rounds */
+	vbroadcasti128 (1 * 16)(%edi), %ymm4;
+	VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (2 * 16)(%edi), %ymm4;
+	VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (3 * 16)(%edi), %ymm4;
+	VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (4 * 16)(%edi), %ymm4;
+	VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (5 * 16)(%edi), %ymm4;
+	VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (6 * 16)(%edi), %ymm4;
+	VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (7 * 16)(%edi), %ymm4;
+	VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (8 * 16)(%edi), %ymm4;
+	VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (9 * 16)(%edi), %ymm4;
+	VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (10 * 16)(%edi), %ymm4;
+	cmpl $12, 8+24(%esp);
+	jb .Lcbc_dec_blk8_last;
+	VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (11 * 16)(%edi), %ymm4;
+	VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (12 * 16)(%edi), %ymm4;
+	jz .Lcbc_dec_blk8_last;
+	VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (13 * 16)(%edi), %ymm4;
+	VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (14 * 16)(%edi), %ymm4;
+
+	/* Last round and output handling. */
+  .Lcbc_dec_blk8_last:
+	vpxor %ymm4, %ymm5, %ymm5;
+	vpxor %ymm4, %ymm6, %ymm6;
+	vpxor %ymm4, %ymm7, %ymm7;
+	vpxor (5 * 16)(%ecx), %ymm4, %ymm4;
+	leal (8 * 16)(%ecx), %ecx;
+	vaesdeclast %ymm5, %ymm0, %ymm0;
+	vaesdeclast %ymm6, %ymm1, %ymm1;
+	vaesdeclast %ymm7, %ymm2, %ymm2;
+	vaesdeclast %ymm4, %ymm3, %ymm3;
+	vmovdqu %ymm0, (0 * 16)(%edx);
+	vmovdqu %ymm1, (2 * 16)(%edx);
+	vmovdqu %ymm2, (4 * 16)(%edx);
+	vmovdqu %ymm3, (6 * 16)(%edx);
+	leal (8 * 16)(%edx), %edx;
+
+	jmp .Lcbc_dec_blk8;
+
+	/* Handle trailing four blocks. */
+.align 8
+.Lcbc_dec_blk4:
+	cmpl $4, %eax;
+	jb .Lcbc_dec_blk1;
+
+	leal -4(%eax), %eax;
+
+	/* Load input and xor first key. Update IV. */
+	vbroadcasti128 (0 * 16)(%edi), %ymm4;
+	vmovdqu (0 * 16)(%ecx), %ymm0;
+	vmovdqu (2 * 16)(%ecx), %ymm1;
+	vmovdqu (%esi), %xmm6; /* Load IV. */
+	vinserti128 $1, %xmm0, %ymm6, %ymm5;
+	vextracti128 $1, %ymm1, (%esi); /* Store IV. */
+	vpxor %ymm4, %ymm0, %ymm0;
+	vpxor %ymm4, %ymm1, %ymm1;
+	vmovdqu (1 * 16)(%ecx), %ymm6;
+	leal (4 * 16)(%ecx), %ecx;
+
+	/* AES rounds */
+	vbroadcasti128 (1 * 16)(%edi), %ymm4;
+	VAESDEC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (2 * 16)(%edi), %ymm4;
+	VAESDEC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (3 * 16)(%edi), %ymm4;
+	VAESDEC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (4 * 16)(%edi), %ymm4;
+	VAESDEC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (5 * 16)(%edi), %ymm4;
+	VAESDEC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (6 * 16)(%edi), %ymm4;
+	VAESDEC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (7 * 16)(%edi), %ymm4;
+	VAESDEC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (8 * 16)(%edi), %ymm4;
+	VAESDEC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (9 * 16)(%edi), %ymm4;
+	VAESDEC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (10 * 16)(%edi), %ymm4;
+	cmpl $12, 8+24(%esp);
+	jb .Lcbc_dec_blk4_last;
+	VAESDEC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (11 * 16)(%edi), %ymm4;
+	VAESDEC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (12 * 16)(%edi), %ymm4;
+	jz .Lcbc_dec_blk4_last;
+	VAESDEC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (13 * 16)(%edi), %ymm4;
+	VAESDEC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (14 * 16)(%edi), %ymm4;
+
+	/* Last round and output handling. */
+  .Lcbc_dec_blk4_last:
+	vpxor %ymm4, %ymm5, %ymm5;
+	vpxor %ymm4, %ymm6, %ymm6;
+	vaesdeclast %ymm5, %ymm0, %ymm0;
+	vaesdeclast %ymm6, %ymm1, %ymm1;
+	vmovdqu %ymm0, (0 * 16)(%edx);
+	vmovdqu %ymm1, (2 * 16)(%edx);
+	leal (4 * 16)(%edx), %edx;
+
+	/* Process trailing one to three blocks, one per loop. */
+.align 8
+.Lcbc_dec_blk1:
+	cmpl $1, %eax;
+	jb .Ldone_cbc_dec;
+
+	leal -1(%eax), %eax;
+
+	/* Load input. */
+	vmovdqu (%ecx), %xmm2;
+	leal 16(%ecx), %ecx;
+
+	/* Xor first key. */
+	vpxor (0 * 16)(%edi), %xmm2, %xmm0;
+
+	/* AES rounds. */
+	vaesdec (1 * 16)(%edi), %xmm0, %xmm0;
+	vaesdec (2 * 16)(%edi), %xmm0, %xmm0;
+	vaesdec (3 * 16)(%edi), %xmm0, %xmm0;
+	vaesdec (4 * 16)(%edi), %xmm0, %xmm0;
+	vaesdec (5 * 16)(%edi), %xmm0, %xmm0;
+	vaesdec (6 * 16)(%edi), %xmm0, %xmm0;
+	vaesdec (7 * 16)(%edi), %xmm0, %xmm0;
+	vaesdec (8 * 16)(%edi), %xmm0, %xmm0;
+	vaesdec (9 * 16)(%edi), %xmm0, %xmm0;
+	vmovdqa (10 * 16)(%edi), %xmm1;
+	cmpl $12, 8+24(%esp);
+	jb .Lcbc_dec_blk1_last;
+	vaesdec %xmm1, %xmm0, %xmm0;
+	vaesdec (11 * 16)(%edi), %xmm0, %xmm0;
+	vmovdqa (12 * 16)(%edi), %xmm1;
+	jz .Lcbc_dec_blk1_last;
+	vaesdec %xmm1, %xmm0, %xmm0;
+	vaesdec (13 * 16)(%edi), %xmm0, %xmm0;
+	vmovdqa (14 * 16)(%edi), %xmm1;
+
+	/* Last round and output handling. */
+  .Lcbc_dec_blk1_last:
+	vpxor (%esi), %xmm1, %xmm1;
+	vaesdeclast %xmm1, %xmm0, %xmm0;
+	vmovdqu %xmm2, (%esi);
+	vmovdqu %xmm0, (%edx);
+	leal 16(%edx), %edx;
+
+	jmp .Lcbc_dec_blk1;
+
+.align 8
+.Ldone_cbc_dec:
+	popl %esi;
+	CFI_POP(%esi);
+	popl %edi;
+	CFI_POP(%edi);
+	vzeroall;
+	ret_spec_stop
+	CFI_ENDPROC();
+ELF(.size SYM_NAME(_gcry_vaes_avx2_cbc_dec_i386),
+	  .-SYM_NAME(_gcry_vaes_avx2_cbc_dec_i386))
+
+/**********************************************************************
+  CFB-mode decryption
+ **********************************************************************/
+ELF(.type SYM_NAME(_gcry_vaes_avx2_cfb_dec_i386), at function)
+.globl SYM_NAME(_gcry_vaes_avx2_cfb_dec_i386)
+.align 16
+SYM_NAME(_gcry_vaes_avx2_cfb_dec_i386):
+	/* input:
+	 *	(esp + 4): round keys
+	 *	(esp + 8): iv
+	 *	(esp + 12): dst
+	 *	(esp + 16): src
+	 *	(esp + 20): nblocks
+	 *	(esp + 24): nrounds
+	 */
+	CFI_STARTPROC();
+	pushl %edi;
+	CFI_PUSH(%edi);
+	pushl %esi;
+	CFI_PUSH(%esi);
+
+	movl 8+4(%esp), %edi;
+	movl 8+8(%esp), %esi;
+	movl 8+12(%esp), %edx;
+	movl 8+16(%esp), %ecx;
+	movl 8+20(%esp), %eax;
+
+	/* Process 8 blocks per loop. */
+.align 8
+.Lcfb_dec_blk8:
+	cmpl $8, %eax;
+	jb .Lcfb_dec_blk4;
+
+	leal -8(%eax), %eax;
+
+	/* Load IV. */
+	vmovdqu (%esi), %xmm0;
+
+	/* Load input and xor first key. Update IV. */
+	vbroadcasti128 (0 * 16)(%edi), %ymm4;
+	vmovdqu (0 * 16)(%ecx), %ymm5;
+	vinserti128 $1, %xmm5, %ymm0, %ymm0;
+	vmovdqu (1 * 16)(%ecx), %ymm1;
+	vmovdqu (3 * 16)(%ecx), %ymm2;
+	vmovdqu (5 * 16)(%ecx), %ymm3;
+	vmovdqu (7 * 16)(%ecx), %xmm6;
+	vpxor %ymm4, %ymm0, %ymm0;
+	vpxor %ymm4, %ymm1, %ymm1;
+	vpxor %ymm4, %ymm2, %ymm2;
+	vpxor %ymm4, %ymm3, %ymm3;
+	vbroadcasti128 (1 * 16)(%edi), %ymm4;
+	vmovdqu %xmm6, (%esi); /* Store IV. */
+	vmovdqu (2 * 16)(%ecx), %ymm6;
+	vmovdqu (4 * 16)(%ecx), %ymm7;
+
+	/* AES rounds */
+	vbroadcasti128 (1 * 16)(%edi), %ymm4;
+	VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (2 * 16)(%edi), %ymm4;
+	VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (3 * 16)(%edi), %ymm4;
+	VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (4 * 16)(%edi), %ymm4;
+	VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (5 * 16)(%edi), %ymm4;
+	VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (6 * 16)(%edi), %ymm4;
+	VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (7 * 16)(%edi), %ymm4;
+	VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (8 * 16)(%edi), %ymm4;
+	VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (9 * 16)(%edi), %ymm4;
+	VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (10 * 16)(%edi), %ymm4;
+	cmpl $12, 8+24(%esp);
+	jb .Lcfb_dec_blk8_last;
+	VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (11 * 16)(%edi), %ymm4;
+	VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (12 * 16)(%edi), %ymm4;
+	jz .Lcfb_dec_blk8_last;
+	VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (13 * 16)(%edi), %ymm4;
+	VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+	vbroadcasti128 (14 * 16)(%edi), %ymm4;
+
+	/* Last round and output handling. */
+  .Lcfb_dec_blk8_last:
+	vpxor %ymm4, %ymm5, %ymm5;
+	vpxor %ymm4, %ymm6, %ymm6;
+	vpxor %ymm4, %ymm7, %ymm7;
+	vpxor (6 * 16)(%ecx), %ymm4, %ymm4;
+	leal (8 * 16)(%ecx), %ecx;
+	vaesenclast %ymm5, %ymm0, %ymm0;
+	vaesenclast %ymm6, %ymm1, %ymm1;
+	vaesenclast %ymm7, %ymm2, %ymm2;
+	vaesenclast %ymm4, %ymm3, %ymm3;
+	vmovdqu %ymm0, (0 * 16)(%edx);
+	vmovdqu %ymm1, (2 * 16)(%edx);
+	vmovdqu %ymm2, (4 * 16)(%edx);
+	vmovdqu %ymm3, (6 * 16)(%edx);
+	leal (8 * 16)(%edx), %edx;
+
+	jmp .Lcfb_dec_blk8;
+
+	/* Handle trailing four blocks. */
+.align 8
+.Lcfb_dec_blk4:
+	cmpl $4, %eax;
+	jb .Lcfb_dec_blk1;
+
+	leal -4(%eax), %eax;
+
+	/* Load IV. */
+	vmovdqu (%esi), %xmm0;
+
+	/* Load input and xor first key. Update IV. */
+	vbroadcasti128 (0 * 16)(%edi), %ymm4;
+	vmovdqu (0 * 16)(%ecx), %ymm5;
+	vinserti128 $1, %xmm5, %ymm0, %ymm0;
+	vmovdqu (1 * 16)(%ecx), %ymm1;
+	vmovdqu (3 * 16)(%ecx), %xmm6;
+	vpxor %ymm4, %ymm0, %ymm0;
+	vpxor %ymm4, %ymm1, %ymm1;
+	vbroadcasti128 (1 * 16)(%edi), %ymm4;
+	vmovdqu %xmm6, (%esi); /* Store IV. */
+	vmovdqu (2 * 16)(%ecx), %ymm6;
+
+	leal (4 * 16)(%ecx), %ecx;
+
+	/* AES rounds */
+	vbroadcasti128 (1 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (2 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (3 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (4 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (5 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (6 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (7 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (8 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (9 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (10 * 16)(%edi), %ymm4;
+	cmpl $12, 8+24(%esp);
+	jb .Lcfb_dec_blk4_last;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (11 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (12 * 16)(%edi), %ymm4;
+	jz .Lcfb_dec_blk4_last;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (13 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (14 * 16)(%edi), %ymm4;
+
+	/* Last round and output handling. */
+  .Lcfb_dec_blk4_last:
+	vpxor %ymm4, %ymm5, %ymm5;
+	vpxor %ymm4, %ymm6, %ymm6;
+	vaesenclast %ymm5, %ymm0, %ymm0;
+	vaesenclast %ymm6, %ymm1, %ymm1;
+	vmovdqu %ymm0, (0 * 16)(%edx);
+	vmovdqu %ymm1, (2 * 16)(%edx);
+	leal (4 * 16)(%edx), %edx;
+
+	/* Process trailing one to three blocks, one per loop. */
+.align 8
+.Lcfb_dec_blk1:
+	cmpl $1, %eax;
+	jb .Ldone_cfb_dec;
+
+	leal -1(%eax), %eax;
+
+	/* Load IV. */
+	vmovdqu (%esi), %xmm0;
+
+	/* Xor first key. */
+	vpxor (0 * 16)(%edi), %xmm0, %xmm0;
+
+	/* Load input as next IV. */
+	vmovdqu (%ecx), %xmm2;
+	leal 16(%ecx), %ecx;
+
+	/* AES rounds. */
+	vaesenc (1 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (2 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (3 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (4 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (5 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (6 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (7 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (8 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (9 * 16)(%edi), %xmm0, %xmm0;
+	vmovdqa (10 * 16)(%edi), %xmm1;
+	vmovdqu %xmm2, (%esi); /* Store IV. */
+	cmpl $12, 8+24(%esp);
+	jb .Lcfb_dec_blk1_last;
+	vaesenc %xmm1, %xmm0, %xmm0;
+	vaesenc (11 * 16)(%edi), %xmm0, %xmm0;
+	vmovdqa (12 * 16)(%edi), %xmm1;
+	jz .Lcfb_dec_blk1_last;
+	vaesenc %xmm1, %xmm0, %xmm0;
+	vaesenc (13 * 16)(%edi), %xmm0, %xmm0;
+	vmovdqa (14 * 16)(%edi), %xmm1;
+
+	/* Last round and output handling. */
+  .Lcfb_dec_blk1_last:
+	vpxor %xmm2, %xmm1, %xmm1;
+	vaesenclast %xmm1, %xmm0, %xmm0;
+	vmovdqu %xmm0, (%edx);
+	leal 16(%edx), %edx;
+
+	jmp .Lcfb_dec_blk1;
+
+.align 8
+.Ldone_cfb_dec:
+	popl %esi;
+	CFI_POP(%esi);
+	popl %edi;
+	CFI_POP(%edi);
+	vzeroall;
+	ret_spec_stop
+	CFI_ENDPROC();
+ELF(.size SYM_NAME(_gcry_vaes_avx2_cfb_dec_i386),
+	  .-SYM_NAME(_gcry_vaes_avx2_cfb_dec_i386))
+
+/**********************************************************************
+  CTR-mode encryption
+ **********************************************************************/
+ELF(.type SYM_NAME(_gcry_vaes_avx2_ctr_enc_i386), at function)
+.globl SYM_NAME(_gcry_vaes_avx2_ctr_enc_i386)
+.align 16
+SYM_NAME(_gcry_vaes_avx2_ctr_enc_i386):
+	/* input:
+	 *	(esp + 4): round keys
+	 *	(esp + 8): iv
+	 *	(esp + 12): dst
+	 *	(esp + 16): src
+	 *	(esp + 20): nblocks
+	 *	(esp + 24): nrounds
+	 */
+	CFI_STARTPROC();
+
+	GET_DATA_POINTER(SYM_NAME(_gcry_vaes_consts), eax);
+
+	pushl %ebp;
+	CFI_PUSH(%ebp);
+	movl %esp, %ebp;
+	CFI_DEF_CFA_REGISTER(%ebp);
+
+	subl $(3 * 32 + 3 * 4), %esp;
+	andl $-32, %esp;
+
+	movl %edi, (3 * 32 + 0 * 4)(%esp);
+	CFI_REG_ON_STACK(edi, 3 * 32 + 0 * 4);
+	movl %esi, (3 * 32 + 1 * 4)(%esp);
+	CFI_REG_ON_STACK(esi, 3 * 32 + 1 * 4);
+	movl %ebx, (3 * 32 + 2 * 4)(%esp);
+	CFI_REG_ON_STACK(ebx, 3 * 32 + 2 * 4);
+
+	movl %eax, %ebx;
+	movl 4+4(%ebp), %edi;
+	movl 4+8(%ebp), %esi;
+	movl 4+12(%ebp), %edx;
+	movl 4+16(%ebp), %ecx;
+
+#define prepare_ctr_const(minus_one, minus_two) \
+	vpcmpeqd minus_one, minus_one, minus_one; \
+	vpsrldq $8, minus_one, minus_one;       /* 0:-1 */ \
+	vpaddq minus_one, minus_one, minus_two; /* 0:-2 */
+
+#define inc_le128(x, minus_one, tmp) \
+	vpcmpeqq minus_one, x, tmp; \
+	vpsubq minus_one, x, x; \
+	vpslldq $8, tmp, tmp; \
+	vpsubq tmp, x, x;
+
+#define add2_le128(x, minus_one, minus_two, tmp1, tmp2) \
+	vpcmpeqq minus_one, x, tmp1; \
+	vpcmpeqq minus_two, x, tmp2; \
+	vpor tmp1, tmp2, tmp2; \
+	vpsubq minus_two, x, x; \
+	vpslldq $8, tmp2, tmp2; \
+	vpsubq tmp2, x, x;
+
+#define handle_ctr_128bit_add(nblks) \
+	movl 12(%esi), %eax; \
+	bswapl %eax; \
+	addl $nblks, %eax; \
+	bswapl %eax; \
+	movl %eax, 12(%esi); \
+	jnc 1f; \
+	\
+	movl 8(%esi), %eax; \
+	bswapl %eax; \
+	adcl $0, %eax; \
+	bswapl %eax; \
+	movl %eax, 8(%esi); \
+	\
+	movl 4(%esi), %eax; \
+	bswapl %eax; \
+	adcl $0, %eax; \
+	bswapl %eax; \
+	movl %eax, 4(%esi); \
+	\
+	movl 0(%esi), %eax; \
+	bswapl %eax; \
+	adcl $0, %eax; \
+	bswapl %eax; \
+	movl %eax, 0(%esi); \
+	.align 8; \
+	1:;
+
+	cmpl $12, 4+20(%ebp);
+	jae .Lctr_enc_blk12_loop;
+	jmp .Lctr_enc_blk4;
+
+	/* Process 12 blocks per loop. */
+.align 16
+.Lctr_enc_blk12_loop:
+	subl $12, 4+20(%ebp);
+
+	vbroadcasti128 (%esi), %ymm6;
+
+	/* detect if carry handling is needed */
+	movl 12(%esi), %eax;
+	addl $(12 << 24), %eax;
+	jc .Lctr_enc_blk12_handle_carry;
+	movl %eax, 12(%esi);
+
+  .Lctr_enc_blk12_byte_bige_add:
+	/* Increment counters. */
+	vpaddb CADDR(.Lbige_addb_0, %ebx), %ymm6, %ymm0;
+	vpaddb CADDR(.Lbige_addb_2, %ebx), %ymm6, %ymm1;
+	vpaddb CADDR(.Lbige_addb_4, %ebx), %ymm6, %ymm2;
+	vpaddb CADDR(.Lbige_addb_6, %ebx), %ymm6, %ymm3;
+	vpaddb CADDR(.Lbige_addb_8, %ebx), %ymm6, %ymm5;
+	vpaddb CADDR(.Lbige_addb_10, %ebx), %ymm6, %ymm6;
+
+  .Lctr_enc_blk12_rounds:
+	/* AES rounds */
+	vbroadcasti128 (0 * 16)(%edi), %ymm4;
+	XOR6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (1 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (2 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (3 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (4 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (5 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (6 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (7 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (8 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (9 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (10 * 16)(%edi), %ymm4;
+	cmpl $12, 4+24(%ebp);
+	jb .Lctr_enc_blk12_last;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (11 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (12 * 16)(%edi), %ymm4;
+	jz .Lctr_enc_blk12_last;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (13 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (14 * 16)(%edi), %ymm4;
+
+	/* Last round and output handling. */
+  .Lctr_enc_blk12_last:
+	vpxor (0 * 16)(%ecx), %ymm4, %ymm7; /* Xor src to last round key. */
+	vaesenclast %ymm7, %ymm0, %ymm0;
+	vmovdqu %ymm0, (0 * 16)(%edx);
+	vpxor (2 * 16)(%ecx), %ymm4, %ymm7;
+	vpxor (4 * 16)(%ecx), %ymm4, %ymm0;
+	vaesenclast %ymm7, %ymm1, %ymm1;
+	vaesenclast %ymm0, %ymm2, %ymm2;
+	vpxor (6 * 16)(%ecx), %ymm4, %ymm7;
+	vpxor (8 * 16)(%ecx), %ymm4, %ymm0;
+	vpxor (10 * 16)(%ecx), %ymm4, %ymm4;
+	leal (12 * 16)(%ecx), %ecx;
+	vaesenclast %ymm7, %ymm3, %ymm3;
+	vaesenclast %ymm0, %ymm5, %ymm5;
+	vaesenclast %ymm4, %ymm6, %ymm6;
+	vmovdqu %ymm1, (2 * 16)(%edx);
+	vmovdqu %ymm2, (4 * 16)(%edx);
+	vmovdqu %ymm3, (6 * 16)(%edx);
+	vmovdqu %ymm5, (8 * 16)(%edx);
+	vmovdqu %ymm6, (10 * 16)(%edx);
+	leal (12 * 16)(%edx), %edx;
+
+	cmpl $12, 4+20(%ebp);
+	jae .Lctr_enc_blk12_loop;
+	jmp .Lctr_enc_blk4;
+
+  .align 8
+  .Lctr_enc_blk12_handle_only_ctr_carry:
+	handle_ctr_128bit_add(12);
+	jmp .Lctr_enc_blk12_byte_bige_add;
+
+  .align 8
+  .Lctr_enc_blk12_handle_carry:
+	jz .Lctr_enc_blk12_handle_only_ctr_carry;
+	/* Increment counters (handle carry). */
+	prepare_ctr_const(%ymm4, %ymm7);
+	vmovdqa CADDR(.Lbswap128_mask, %ebx), %ymm2;
+	vpshufb %xmm2, %xmm6, %xmm1; /* be => le */
+	vmovdqa %xmm1, %xmm0;
+	inc_le128(%xmm1, %xmm4, %xmm5);
+	vinserti128 $1, %xmm1, %ymm0, %ymm6; /* ctr: +1:+0 */
+	handle_ctr_128bit_add(12);
+	vpshufb %ymm2, %ymm6, %ymm0;
+	vmovdqa %ymm0, (0 * 32)(%esp);
+	add2_le128(%ymm6, %ymm4, %ymm7, %ymm5, %ymm1); /* ctr: +3:+2 */
+	vpshufb %ymm2, %ymm6, %ymm0;
+	vmovdqa %ymm0, (1 * 32)(%esp);
+	add2_le128(%ymm6, %ymm4, %ymm7, %ymm5, %ymm1); /* ctr: +5:+4 */
+	vpshufb %ymm2, %ymm6, %ymm0;
+	vmovdqa %ymm0, (2 * 32)(%esp);
+	add2_le128(%ymm6, %ymm4, %ymm7, %ymm5, %ymm1); /* ctr: +7:+6 */
+	vpshufb %ymm2, %ymm6, %ymm3;
+	add2_le128(%ymm6, %ymm4, %ymm7, %ymm5, %ymm1); /* ctr: +9:+8 */
+	vpshufb %ymm2, %ymm6, %ymm5;
+	add2_le128(%ymm6, %ymm4, %ymm7, %ymm2, %ymm1); /* ctr: +11:+10 */
+	vmovdqa (0 * 32)(%esp), %ymm0;
+	vmovdqa (1 * 32)(%esp), %ymm1;
+	vmovdqa (2 * 32)(%esp), %ymm2;
+	vpshufb CADDR(.Lbswap128_mask, %ebx), %ymm6, %ymm6;
+
+	jmp .Lctr_enc_blk12_rounds;
+
+	/* Handle trailing four blocks. */
+.align 8
+.Lctr_enc_blk4:
+	cmpl $4, 4+20(%ebp);
+	jb .Lctr_enc_blk1;
+
+	subl $4, 4+20(%ebp);
+
+	vbroadcasti128 (%esi), %ymm3;
+
+	/* detect if carry handling is needed */
+	movl 12(%esi), %eax;
+	addl $(4 << 24), %eax;
+	jc .Lctr_enc_blk4_handle_carry;
+	movl %eax, 12(%esi);
+
+  .Lctr_enc_blk4_byte_bige_add:
+	/* Increment counters. */
+	vpaddb CADDR(.Lbige_addb_0, %ebx), %ymm3, %ymm0;
+	vpaddb CADDR(.Lbige_addb_2, %ebx), %ymm3, %ymm1;
+
+  .Lctr_enc_blk4_rounds:
+	/* AES rounds */
+	vbroadcasti128 (0 * 16)(%edi), %ymm4;
+	XOR2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (1 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (2 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (3 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (4 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (5 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (6 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (7 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (8 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (9 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (10 * 16)(%edi), %ymm4;
+	cmpl $12, 4+24(%ebp);
+	jb .Lctr_enc_blk4_last;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (11 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (12 * 16)(%edi), %ymm4;
+	jz .Lctr_enc_blk4_last;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (13 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (14 * 16)(%edi), %ymm4;
+
+	/* Last round and output handling. */
+  .Lctr_enc_blk4_last:
+	vpxor (0 * 16)(%ecx), %ymm4, %ymm5; /* Xor src to last round key. */
+	vpxor (2 * 16)(%ecx), %ymm4, %ymm6;
+	leal (4 * 16)(%ecx), %ecx;
+	vaesenclast %ymm5, %ymm0, %ymm0;
+	vaesenclast %ymm6, %ymm1, %ymm1;
+	vmovdqu %ymm0, (0 * 16)(%edx);
+	vmovdqu %ymm1, (2 * 16)(%edx);
+	leal (4 * 16)(%edx), %edx;
+
+	jmp .Lctr_enc_blk1;
+
+  .align 8
+  .Lctr_enc_blk4_handle_only_ctr_carry:
+	handle_ctr_128bit_add(4);
+	jmp .Lctr_enc_blk4_byte_bige_add;
+
+  .align 8
+  .Lctr_enc_blk4_handle_carry:
+	jz .Lctr_enc_blk4_handle_only_ctr_carry;
+	/* Increment counters (handle carry). */
+	prepare_ctr_const(%ymm4, %ymm7);
+	vpshufb CADDR(.Lbswap128_mask, %ebx), %xmm3, %xmm1; /* be => le */
+	vmovdqa %xmm1, %xmm0;
+	inc_le128(%xmm1, %xmm4, %xmm5);
+	vinserti128 $1, %xmm1, %ymm0, %ymm3; /* ctr: +1:+0 */
+	vpshufb CADDR(.Lbswap128_mask, %ebx), %ymm3, %ymm0;
+	handle_ctr_128bit_add(4);
+	add2_le128(%ymm3, %ymm4, %ymm7, %ymm5, %ymm6); /* ctr: +3:+2 */
+	vpshufb CADDR(.Lbswap128_mask, %ebx), %ymm3, %ymm1;
+
+	jmp .Lctr_enc_blk4_rounds;
+
+	/* Process trailing one to three blocks, one per loop. */
+.align 8
+.Lctr_enc_blk1:
+	cmpl $1, 4+20(%ebp);
+	jb .Ldone_ctr_enc;
+
+	subl $1, 4+20(%ebp);
+
+	/* Load and increament counter. */
+	vmovdqu (%esi), %xmm0;
+	handle_ctr_128bit_add(1);
+
+	/* AES rounds. */
+	vpxor (0 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (1 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (2 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (3 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (4 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (5 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (6 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (7 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (8 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (9 * 16)(%edi), %xmm0, %xmm0;
+	vmovdqa (10 * 16)(%edi), %xmm1;
+	cmpl $12, 4+24(%ebp);
+	jb .Lctr_enc_blk1_last;
+	vaesenc %xmm1, %xmm0, %xmm0;
+	vaesenc (11 * 16)(%edi), %xmm0, %xmm0;
+	vmovdqa (12 * 16)(%edi), %xmm1;
+	jz .Lctr_enc_blk1_last;
+	vaesenc %xmm1, %xmm0, %xmm0;
+	vaesenc (13 * 16)(%edi), %xmm0, %xmm0;
+	vmovdqa (14 * 16)(%edi), %xmm1;
+
+	/* Last round and output handling. */
+  .Lctr_enc_blk1_last:
+	vpxor (%ecx), %xmm1, %xmm1; /* Xor src to last round key. */
+	leal 16(%ecx), %ecx;
+	vaesenclast %xmm1, %xmm0, %xmm0; /* Last round and xor with xmm1. */
+	vmovdqu %xmm0, (%edx);
+	leal 16(%edx), %edx;
+
+	jmp .Lctr_enc_blk1;
+
+.align 8
+.Ldone_ctr_enc:
+	vpxor %ymm0, %ymm0, %ymm0;
+	movl (3 * 32 + 0 * 4)(%esp), %edi;
+	CFI_RESTORE(edi);
+	movl (3 * 32 + 1 * 4)(%esp), %esi;
+	CFI_RESTORE(esi);
+	movl (3 * 32 + 2 * 4)(%esp), %ebx;
+	CFI_RESTORE(ebx);
+	vmovdqa %ymm0, (0 * 32)(%esp);
+	vmovdqa %ymm0, (1 * 32)(%esp);
+	vmovdqa %ymm0, (2 * 32)(%esp);
+	leave;
+	CFI_LEAVE();
+	vzeroall;
+	ret_spec_stop
+	CFI_ENDPROC();
+ELF(.size SYM_NAME(_gcry_vaes_avx2_ctr_enc_i386),
+	  .-SYM_NAME(_gcry_vaes_avx2_ctr_enc_i386))
+
+/**********************************************************************
+  Little-endian 32-bit CTR-mode encryption (GCM-SIV)
+ **********************************************************************/
+ELF(.type SYM_NAME(_gcry_vaes_avx2_ctr32le_enc_i386), at function)
+.globl SYM_NAME(_gcry_vaes_avx2_ctr32le_enc_i386)
+.align 16
+SYM_NAME(_gcry_vaes_avx2_ctr32le_enc_i386):
+	/* input:
+	 *	(esp + 4): round keys
+	 *	(esp + 8): counter
+	 *	(esp + 12): dst
+	 *	(esp + 16): src
+	 *	(esp + 20): nblocks
+	 *	(esp + 24): nrounds
+	 */
+	CFI_STARTPROC();
+
+	GET_DATA_POINTER(SYM_NAME(_gcry_vaes_consts), eax);
+
+	pushl %ebp;
+	CFI_PUSH(%ebp);
+	movl %esp, %ebp;
+	CFI_DEF_CFA_REGISTER(%ebp);
+
+	subl $(3 * 4), %esp;
+
+	movl %edi, (0 * 4)(%esp);
+	CFI_REG_ON_STACK(edi, 0 * 4);
+	movl %esi, (1 * 4)(%esp);
+	CFI_REG_ON_STACK(esi, 1 * 4);
+	movl %ebx, (2 * 4)(%esp);
+	CFI_REG_ON_STACK(ebx, 2 * 4);
+
+	movl %eax, %ebx;
+	movl 4+4(%ebp), %edi;
+	movl 4+8(%ebp), %esi;
+	movl 4+12(%ebp), %edx;
+	movl 4+16(%ebp), %ecx;
+	movl 4+20(%ebp), %eax;
+
+	vbroadcasti128 (%esi), %ymm7; /* Load CTR. */
+
+	/* Process 12 blocks per loop. */
+.align 8
+.Lctr32le_enc_blk12:
+	cmpl $12, %eax;
+	jb .Lctr32le_enc_blk4;
+
+	leal -12(%eax), %eax;
+
+	vbroadcasti128 (0 * 16)(%edi), %ymm4;
+
+	/* Increment counters. */
+	vpaddd CADDR(.Lle_addd_0, %ebx), %ymm7, %ymm0;
+	vpaddd CADDR(.Lle_addd_2, %ebx), %ymm7, %ymm1;
+	vpaddd CADDR(.Lle_addd_4, %ebx), %ymm7, %ymm2;
+	vpaddd CADDR(.Lle_addd_6, %ebx), %ymm7, %ymm3;
+	vpaddd CADDR(.Lle_addd_8, %ebx), %ymm7, %ymm5;
+	vpaddd CADDR(.Lle_addd_10, %ebx), %ymm7, %ymm6;
+
+	vpaddd CADDR(.Lle_addd_12_2, %ebx), %ymm7, %ymm7;
+	vmovdqu %xmm7, (%esi); /* Store CTR. */
+
+	/* AES rounds */
+	XOR6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (1 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (2 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (3 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (4 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (5 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (6 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (7 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (8 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (9 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (10 * 16)(%edi), %ymm4;
+	cmpl $12, 4+24(%ebp);
+	jb .Lctr32le_enc_blk8_last;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (11 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (12 * 16)(%edi), %ymm4;
+	jz .Lctr32le_enc_blk8_last;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (13 * 16)(%edi), %ymm4;
+	VAESENC6(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3, %ymm5, %ymm6);
+	vbroadcasti128 (14 * 16)(%edi), %ymm4;
+
+	/* Last round and output handling. */
+  .Lctr32le_enc_blk8_last:
+	vpxor (0 * 16)(%ecx), %ymm4, %ymm7; /* Xor src to last round key. */
+	vaesenclast %ymm7, %ymm0, %ymm0;
+	vpxor (2 * 16)(%ecx), %ymm4, %ymm7;
+	vaesenclast %ymm7, %ymm1, %ymm1;
+	vpxor (4 * 16)(%ecx), %ymm4, %ymm7;
+	vaesenclast %ymm7, %ymm2, %ymm2;
+	vpxor (6 * 16)(%ecx), %ymm4, %ymm7;
+	vaesenclast %ymm7, %ymm3, %ymm3;
+	vpxor (8 * 16)(%ecx), %ymm4, %ymm7;
+	vpxor (10 * 16)(%ecx), %ymm4, %ymm4;
+	vaesenclast %ymm7, %ymm5, %ymm5;
+	vbroadcasti128 (%esi), %ymm7; /* Reload CTR. */
+	vaesenclast %ymm4, %ymm6, %ymm6;
+	leal (12 * 16)(%ecx), %ecx;
+	vmovdqu %ymm0, (0 * 16)(%edx);
+	vmovdqu %ymm1, (2 * 16)(%edx);
+	vmovdqu %ymm2, (4 * 16)(%edx);
+	vmovdqu %ymm3, (6 * 16)(%edx);
+	vmovdqu %ymm5, (8 * 16)(%edx);
+	vmovdqu %ymm6, (10 * 16)(%edx);
+	leal (12 * 16)(%edx), %edx;
+
+	jmp .Lctr32le_enc_blk12;
+
+	/* Handle trailing four blocks. */
+.align 8
+.Lctr32le_enc_blk4:
+	cmpl $4, %eax;
+	jb .Lctr32le_enc_blk1;
+
+	leal -4(%eax), %eax;
+
+	vbroadcasti128 (0 * 16)(%edi), %ymm4;
+
+	/* Increment counters. */
+	vpaddd CADDR(.Lle_addd_0, %ebx), %ymm7, %ymm0;
+	vpaddd CADDR(.Lle_addd_2, %ebx), %ymm7, %ymm1;
+
+	vpaddd CADDR(.Lle_addd_4_2, %ebx), %ymm7, %ymm7;
+
+	/* AES rounds */
+	XOR2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (1 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (2 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (3 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (4 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (5 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (6 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (7 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (8 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (9 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (10 * 16)(%edi), %ymm4;
+	cmpl $12, 4+24(%ebp);
+	jb .Lctr32le_enc_blk4_last;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (11 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (12 * 16)(%edi), %ymm4;
+	jz .Lctr32le_enc_blk4_last;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (13 * 16)(%edi), %ymm4;
+	VAESENC2(%ymm4, %ymm0, %ymm1);
+	vbroadcasti128 (14 * 16)(%edi), %ymm4;
+
+	/* Last round and output handling. */
+  .Lctr32le_enc_blk4_last:
+	vpxor (0 * 16)(%ecx), %ymm4, %ymm5; /* Xor src to last round key. */
+	vpxor (2 * 16)(%ecx), %ymm4, %ymm6;
+	leal (4 * 16)(%ecx), %ecx;
+	vaesenclast %ymm5, %ymm0, %ymm0;
+	vaesenclast %ymm6, %ymm1, %ymm1;
+	vmovdqu %ymm0, (0 * 16)(%edx);
+	vmovdqu %ymm1, (2 * 16)(%edx);
+	leal (4 * 16)(%edx), %edx;
+
+	/* Process trailing one to three blocks, one per loop. */
+.align 8
+.Lctr32le_enc_blk1:
+	cmpl $1, %eax;
+	jb .Ldone_ctr32le_enc;
+
+	leal -1(%eax), %eax;
+
+	/* Load and increament counter. */
+	vmovdqu %xmm7, %xmm0;
+	vpaddd CADDR(.Lle_addd_1, %ebx), %xmm7, %xmm7;
+
+	/* AES rounds. */
+	vpxor (0 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (1 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (2 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (3 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (4 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (5 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (6 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (7 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (8 * 16)(%edi), %xmm0, %xmm0;
+	vaesenc (9 * 16)(%edi), %xmm0, %xmm0;
+	vmovdqa (10 * 16)(%edi), %xmm1;
+	cmpl $12, 4+24(%ebp);
+	jb .Lctr32le_enc_blk1_last;
+	vaesenc %xmm1, %xmm0, %xmm0;
+	vaesenc (11 * 16)(%edi), %xmm0, %xmm0;
+	vmovdqa (12 * 16)(%edi), %xmm1;
+	jz .Lctr32le_enc_blk1_last;
+	vaesenc %xmm1, %xmm0, %xmm0;
+	vaesenc (13 * 16)(%edi), %xmm0, %xmm0;
+	vmovdqa (14 * 16)(%edi), %xmm1;
+
+	/* Last round and output handling. */
+  .Lctr32le_enc_blk1_last:
+	vpxor (%ecx), %xmm1, %xmm1; /* Xor src to last round key. */
+	leal 16(%ecx), %ecx;
+	vaesenclast %xmm1, %xmm0, %xmm0; /* Last round and xor with xmm1. */
+	vmovdqu %xmm0, (%edx);
+	leal 16(%edx), %edx;
+
+	jmp .Lctr32le_enc_blk1;
+
+.align 8
+.Ldone_ctr32le_enc:
+	vmovdqu %xmm7, (%esi); /* Store CTR. */
+	movl (0 * 4)(%esp), %edi;
+	CFI_RESTORE(edi);
+	movl (1 * 4)(%esp), %esi;
+	CFI_RESTORE(esi);
+	movl (2 * 4)(%esp), %ebx;
+	CFI_RESTORE(ebx);
+	leave;
+	CFI_LEAVE();
+	vzeroall;
+	ret_spec_stop
+	CFI_ENDPROC();
+ELF(.size SYM_NAME(_gcry_vaes_avx2_ctr32le_enc_i386),
+	  .-SYM_NAME(_gcry_vaes_avx2_ctr32le_enc_i386))
+
+/**********************************************************************
+  OCB-mode encryption/decryption/authentication
+ **********************************************************************/
+ELF(.type SYM_NAME(_gcry_vaes_avx2_ocb_crypt_i386), at function)
+.globl SYM_NAME(_gcry_vaes_avx2_ocb_crypt_i386)
+.align 16
+SYM_NAME(_gcry_vaes_avx2_ocb_crypt_i386):
+	/* input:
+	 *	(esp + 4): round keys
+	 *	(esp + 8): dst
+	 *	(esp + 12): src
+	 *	(esp + 16): nblocks
+	 *	(esp + 20): nrounds
+	 *	(esp + 24): offset
+	 *	(esp + 28): checksum
+	 *      (esp + 32): blkn
+	 *	(esp + 36): L table
+	 *	(esp + 44): encrypt/decrypt/auth mode
+	 */
+	CFI_STARTPROC();
+
+	pushl %ebp;
+	CFI_PUSH(%ebp);
+	movl %esp, %ebp;
+	CFI_DEF_CFA_REGISTER(%ebp);
+
+#define STACK_VEC_POS           0
+#define STACK_TMP_Y0            (STACK_VEC_POS + 0 * 32)
+#define STACK_TMP_Y1            (STACK_VEC_POS + 1 * 32)
+#define STACK_TMP_Y2            (STACK_VEC_POS + 2 * 32)
+#define STACK_TMP_Y3            (STACK_VEC_POS + 3 * 32)
+#define STACK_TMP_Y4            (STACK_VEC_POS + 4 * 32)
+#define STACK_TMP_Y5            (STACK_VEC_POS + 5 * 32)
+#define STACK_FXL_KEY           (STACK_VEC_POS + 6 * 32)
+#define STACK_OFFSET_AND_F_KEY  (STACK_VEC_POS + 7 * 32)
+#define STACK_CHECKSUM          (STACK_VEC_POS + 8 * 32)
+#define STACK_GPR_POS           (9 * 32)
+#define STACK_END_POS           (STACK_GPR_POS + 3 * 4)
+
+	subl $STACK_END_POS, %esp;
+	andl $-32, %esp;
+
+	movl %edi, (STACK_GPR_POS + 0 * 4)(%esp);
+	CFI_REG_ON_STACK(edi, STACK_GPR_POS + 0 * 4);
+	movl %esi, (STACK_GPR_POS + 1 * 4)(%esp);
+	CFI_REG_ON_STACK(esi, STACK_GPR_POS + 1 * 4);
+	movl %ebx, (STACK_GPR_POS + 2 * 4)(%esp);
+	CFI_REG_ON_STACK(ebx, STACK_GPR_POS + 2 * 4);
+
+	movl 4+4(%ebp), %edi;
+	movl 4+8(%ebp), %esi;
+	movl 4+12(%ebp), %edx;
+	movl 4+32(%ebp), %ebx;
+
+	movl 4+24(%ebp), %eax;
+	movl 4+20(%ebp), %ecx;
+	leal (, %ecx, 4), %ecx;
+	vmovdqu (%eax), %xmm1; /* offset */
+	vmovdqa (%edi), %xmm0; /* first key */
+	vpxor %xmm0, %xmm1, %xmm1; /* offset ^ first key */
+	vpxor (%edi, %ecx, 4), %xmm0, %xmm0; /* first key ^ last key */
+	vinserti128 $1, %xmm0, %ymm0, %ymm0;
+	vpxor %ymm2, %ymm2, %ymm2;
+	vmovdqa %xmm1, (STACK_OFFSET_AND_F_KEY)(%esp);
+	vmovdqa %ymm2, (STACK_CHECKSUM)(%esp);
+	vmovdqa %ymm0, (STACK_FXL_KEY)(%esp);
+
+	cmpl $12, 4+16(%ebp);
+	jae .Locb_crypt_blk12_loop;
+	jmp .Locb_crypt_blk4;
+
+	/* Process 12 blocks per loop. */
+.align 16
+.Locb_crypt_blk12_loop:
+	subl $12, 4+16(%ebp);
+
+	movl 4+36(%ebp), %ecx;
+	vmovdqa (%ecx), %xmm7; /* Preload L[0] */
+
+	testl $1, %ebx;
+	jz .Locb_crypt_blk12_nblk_even;
+		/* Offset_i = Offset_{i-1} xor L_{ntz(i)} */
+		leal 1(%ebx), %eax;
+		tzcntl %eax, %eax; // ntz(blkn+1)
+		shll $4, %eax;
+		vmovdqa (STACK_OFFSET_AND_F_KEY)(%esp), %xmm1;
+		vpxor (%ecx, %eax), %xmm1, %xmm1;
+
+		vpxor %xmm7, %xmm1, %xmm0;
+		vinserti128 $1, %xmm0, %ymm1, %ymm1;
+		vmovdqa %ymm1, (STACK_TMP_Y0)(%esp);
+
+		leal 3(%ebx), %eax;
+		tzcntl %eax, %eax; // ntz(blkn+3)
+		shll $4, %eax;
+		vpxor (%ecx, %eax), %xmm0, %xmm1;
+
+		vpxor %xmm7, %xmm1, %xmm0;
+		vinserti128 $1, %xmm0, %ymm1, %ymm2;
+
+		leal 5(%ebx), %eax;
+		tzcntl %eax, %eax; // ntz(blkn+5)
+		shll $4, %eax;
+		vpxor (%ecx, %eax), %xmm0, %xmm1;
+
+		vpxor %xmm7, %xmm1, %xmm0;
+		vinserti128 $1, %xmm0, %ymm1, %ymm3;
+
+		leal 7(%ebx), %eax;
+		tzcntl %eax, %eax; // ntz(blkn+7)
+		shll $4, %eax;
+		vpxor (%ecx, %eax), %xmm0, %xmm1;
+
+		vpxor %xmm7, %xmm1, %xmm0;
+		vinserti128 $1, %xmm0, %ymm1, %ymm4;
+
+		leal 9(%ebx), %eax;
+		tzcntl %eax, %eax; // ntz(blkn+9)
+		shll $4, %eax;
+		vpxor (%ecx, %eax), %xmm0, %xmm1;
+
+		vpxor %xmm7, %xmm1, %xmm0;
+		vinserti128 $1, %xmm0, %ymm1, %ymm5;
+
+		leal 11(%ebx), %eax;
+		tzcntl %eax, %eax; // ntz(blkn+11)
+		shll $4, %eax;
+		vpxor (%ecx, %eax), %xmm0, %xmm1;
+
+		leal 12(%ebx), %ebx;
+		vpxor %xmm7, %xmm1, %xmm0;
+		vinserti128 $1, %xmm0, %ymm1, %ymm6;
+
+		cmpl $1, 4+40(%ebp);
+		jb .Locb_dec_blk12;
+		ja .Locb_auth_blk12;
+		jmp .Locb_enc_blk12;
+
+	.align 8
+	.Locb_crypt_blk12_nblk_even:
+		/* Offset_i = Offset_{i-1} xor L_{ntz(i)} */
+		vpxor (STACK_OFFSET_AND_F_KEY)(%esp), %xmm7, %xmm1;
+
+		leal 2(%ebx), %eax;
+		tzcntl %eax, %eax; // ntz(blkn+2)
+		shll $4, %eax;
+		vpxor (%ecx, %eax), %xmm1, %xmm0;
+		vinserti128 $1, %xmm0, %ymm1, %ymm1;
+		vmovdqa %ymm1, (STACK_TMP_Y0)(%esp);
+
+		vpxor %xmm7, %xmm0, %xmm1;
+
+		leal 4(%ebx), %eax;
+		tzcntl %eax, %eax; // ntz(blkn+4)
+		shll $4, %eax;
+		vpxor (%ecx, %eax), %xmm1, %xmm0;
+		vinserti128 $1, %xmm0, %ymm1, %ymm2;
+
+		vpxor %xmm7, %xmm0, %xmm1;
+
+		leal 6(%ebx), %eax;
+		tzcntl %eax, %eax; // ntz(blkn+6)
+		shll $4, %eax;
+		vpxor (%ecx, %eax), %xmm1, %xmm0;
+		vinserti128 $1, %xmm0, %ymm1, %ymm3;
+
+		vpxor %xmm7, %xmm0, %xmm1;
+
+		leal 8(%ebx), %eax;
+		tzcntl %eax, %eax; // ntz(blkn+8)
+		shll $4, %eax;
+		vpxor (%ecx, %eax), %xmm1, %xmm0;
+		vinserti128 $1, %xmm0, %ymm1, %ymm4;
+
+		vpxor %xmm7, %xmm0, %xmm1;
+
+		leal 10(%ebx), %eax;
+		tzcntl %eax, %eax; // ntz(blkn+10)
+		shll $4, %eax;
+		vpxor (%ecx, %eax), %xmm1, %xmm0;
+		vinserti128 $1, %xmm0, %ymm1, %ymm5;
+
+		vpxor %xmm7, %xmm0, %xmm1;
+
+		leal 12(%ebx), %ebx;
+		tzcntl %ebx, %eax; // ntz(blkn+12)
+		shll $4, %eax;
+		vpxor (%ecx, %eax), %xmm1, %xmm0;
+		vinserti128 $1, %xmm0, %ymm1, %ymm6;
+
+		cmpl $1, 4+40(%ebp);
+		jb .Locb_dec_blk12;
+		ja .Locb_auth_blk12;
+
+	.align 8
+	.Locb_enc_blk12:
+		vmovdqa %ymm2, (STACK_TMP_Y1)(%esp);
+		vmovdqa %ymm3, (STACK_TMP_Y2)(%esp);
+		vmovdqa %ymm4, (STACK_TMP_Y3)(%esp);
+		vmovdqa %ymm5, (STACK_TMP_Y4)(%esp);
+		vmovdqa %ymm6, (STACK_TMP_Y5)(%esp);
+		vmovdqa %xmm0, (STACK_OFFSET_AND_F_KEY)(%esp);
+
+		vmovdqu 0*16(%edx), %ymm1;
+		vmovdqu 2*16(%edx), %ymm2;
+		vmovdqu 4*16(%edx), %ymm3;
+		vmovdqu 6*16(%edx), %ymm4;
+		vmovdqu 8*16(%edx), %ymm5;
+		vmovdqu 10*16(%edx), %ymm6;
+		leal 12*16(%edx), %edx;
+
+		/* Checksum_i = Checksum_{i-1} xor P_i  */
+		vpxor %ymm1, %ymm2, %ymm0;
+		vpxor %ymm3, %ymm4, %ymm7;
+		vpxor %ymm5, %ymm0, %ymm0;
+		vpxor %ymm6, %ymm7, %ymm7;
+		vpxor %ymm0, %ymm7, %ymm7;
+		vbroadcasti128 (1 * 16)(%edi), %ymm0;
+		vpxor (STACK_CHECKSUM)(%esp), %ymm7, %ymm7;
+
+		/* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i)  */
+		vpxor (STACK_TMP_Y0)(%esp), %ymm1, %ymm1;
+		vpxor (STACK_TMP_Y1)(%esp), %ymm2, %ymm2;
+		vpxor (STACK_TMP_Y2)(%esp), %ymm3, %ymm3;
+		vpxor (STACK_TMP_Y3)(%esp), %ymm4, %ymm4;
+		vpxor (STACK_TMP_Y4)(%esp), %ymm5, %ymm5;
+		vpxor (STACK_TMP_Y5)(%esp), %ymm6, %ymm6;
+
+		vmovdqa %ymm7, (STACK_CHECKSUM)(%esp);
+
+		/* AES rounds */
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (2 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (3 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (4 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (5 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (6 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (7 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (8 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (9 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		cmpl $12, 4+20(%ebp);
+		jb .Locb_enc_blk12_last;
+		vbroadcasti128 (10 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (11 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		jz .Locb_enc_blk12_last;
+		vbroadcasti128 (12 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (13 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+
+		/* Last round and output handling. */
+	  .Locb_enc_blk12_last:
+		vmovdqa (STACK_FXL_KEY)(%esp), %ymm0;
+		vpxor (STACK_TMP_Y0)(%esp), %ymm0, %ymm7;
+		vaesenclast %ymm7, %ymm1, %ymm1;
+		vpxor (STACK_TMP_Y1)(%esp), %ymm0, %ymm7;
+		vmovdqu %ymm1, 0*16(%esi);
+		vpxor (STACK_TMP_Y2)(%esp), %ymm0, %ymm1;
+		vaesenclast %ymm7, %ymm2, %ymm2;
+		vpxor (STACK_TMP_Y3)(%esp), %ymm0, %ymm7;
+		vaesenclast %ymm1, %ymm3, %ymm3;
+		vpxor (STACK_TMP_Y4)(%esp), %ymm0, %ymm1;
+		vaesenclast %ymm7, %ymm4, %ymm4;
+		vpxor (STACK_TMP_Y5)(%esp), %ymm0, %ymm7;
+		vaesenclast %ymm1, %ymm5, %ymm5;
+		vaesenclast %ymm7, %ymm6, %ymm6;
+		vmovdqu %ymm2, 2*16(%esi);
+		vmovdqu %ymm3, 4*16(%esi);
+		vmovdqu %ymm4, 6*16(%esi);
+		vmovdqu %ymm5, 8*16(%esi);
+		vmovdqu %ymm6, 10*16(%esi);
+		leal 12*16(%esi), %esi;
+
+		cmpl $12, 4+16(%ebp);
+		jae .Locb_crypt_blk12_loop;
+		jmp .Locb_crypt_blk12_cleanup;
+
+	.align 8
+	.Locb_auth_blk12:
+		vmovdqa %xmm0, (STACK_OFFSET_AND_F_KEY)(%esp);
+		vbroadcasti128 (1 * 16)(%edi), %ymm0;
+
+		/* Sum_i = Sum_{i-1} xor ENCIPHER(K, A_i xor Offset_i)  */
+		vmovdqa (STACK_TMP_Y0)(%esp), %ymm1;
+		vpxor 0*16(%edx), %ymm1, %ymm1;
+		vpxor 2*16(%edx), %ymm2, %ymm2;
+		vpxor 4*16(%edx), %ymm3, %ymm3;
+		vpxor 6*16(%edx), %ymm4, %ymm4;
+		vpxor 8*16(%edx), %ymm5, %ymm5;
+		vpxor 10*16(%edx), %ymm6, %ymm6;
+		leal 12*16(%edx), %edx;
+
+		/* AES rounds */
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (2 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (3 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (4 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (5 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (6 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (7 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (8 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (9 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (10 * 16)(%edi), %ymm0;
+		cmpl $12, 4+20(%ebp);
+		jb .Locb_auth_blk12_last;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (11 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (12 * 16)(%edi), %ymm0;
+		jz .Locb_auth_blk12_last;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (13 * 16)(%edi), %ymm0;
+		VAESENC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (14 * 16)(%edi), %ymm0;
+
+		/* Last round and output handling. */
+	  .Locb_auth_blk12_last:
+		vaesenclast %ymm0, %ymm1, %ymm1;
+		vaesenclast %ymm0, %ymm2, %ymm2;
+		vaesenclast %ymm0, %ymm3, %ymm3;
+		vaesenclast %ymm0, %ymm4, %ymm4;
+		vaesenclast %ymm0, %ymm5, %ymm5;
+		vaesenclast %ymm0, %ymm6, %ymm6;
+
+		vpxor %ymm1, %ymm2, %ymm0;
+		vpxor %ymm3, %ymm4, %ymm4;
+		vpxor %ymm5, %ymm0, %ymm0;
+		vpxor %ymm6, %ymm4, %ymm4;
+		vpxor %ymm0, %ymm4, %ymm4;
+		vpxor (STACK_CHECKSUM)(%esp), %ymm4, %ymm4;
+		vmovdqa %ymm4, (STACK_CHECKSUM)(%esp);
+
+		cmpl $12, 4+16(%ebp);
+		jae .Locb_crypt_blk12_loop;
+		jmp .Locb_crypt_blk12_cleanup;
+
+	.align 8
+	.Locb_dec_blk12:
+		vmovdqa %xmm0, (STACK_OFFSET_AND_F_KEY)(%esp);
+
+		/* P_i = Offset_i xor DECIPHER(K, C_i xor Offset_i)  */
+		vmovdqa (STACK_TMP_Y0)(%esp), %ymm1;
+		vmovdqu 0*16(%edx), %ymm0;
+		vmovdqu 2*16(%edx), %ymm7;
+		vpxor %ymm0, %ymm1, %ymm1;
+		vmovdqa %ymm2, (STACK_TMP_Y1)(%esp);
+		vpxor %ymm7, %ymm2, %ymm2;
+		vmovdqu 4*16(%edx), %ymm0;
+		vmovdqu 6*16(%edx), %ymm7;
+		vmovdqa %ymm3, (STACK_TMP_Y2)(%esp);
+		vmovdqa %ymm4, (STACK_TMP_Y3)(%esp);
+		vpxor %ymm0, %ymm3, %ymm3;
+		vpxor %ymm7, %ymm4, %ymm4;
+		vmovdqu 8*16(%edx), %ymm0;
+		vmovdqu 10*16(%edx), %ymm7;
+		leal 12*16(%edx), %edx;
+		vmovdqa %ymm5, (STACK_TMP_Y4)(%esp);
+		vmovdqa %ymm6, (STACK_TMP_Y5)(%esp);
+		vpxor %ymm0, %ymm5, %ymm5;
+		vbroadcasti128 (1 * 16)(%edi), %ymm0;
+		vpxor %ymm7, %ymm6, %ymm6;
+
+		/* AES rounds */
+		VAESDEC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (2 * 16)(%edi), %ymm0;
+		VAESDEC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (3 * 16)(%edi), %ymm0;
+		VAESDEC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (4 * 16)(%edi), %ymm0;
+		VAESDEC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (5 * 16)(%edi), %ymm0;
+		VAESDEC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (6 * 16)(%edi), %ymm0;
+		VAESDEC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (7 * 16)(%edi), %ymm0;
+		VAESDEC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (8 * 16)(%edi), %ymm0;
+		VAESDEC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (9 * 16)(%edi), %ymm0;
+		VAESDEC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		cmpl $12, 4+20(%ebp);
+		jb .Locb_dec_blk12_last;
+		vbroadcasti128 (10 * 16)(%edi), %ymm0;
+		VAESDEC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (11 * 16)(%edi), %ymm0;
+		VAESDEC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		jz .Locb_dec_blk12_last;
+		vbroadcasti128 (12 * 16)(%edi), %ymm0;
+		VAESDEC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+		vbroadcasti128 (13 * 16)(%edi), %ymm0;
+		VAESDEC6(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6);
+
+		/* Last round and output handling. */
+	  .Locb_dec_blk12_last:
+		vmovdqa (STACK_FXL_KEY)(%esp), %ymm0;
+		vpxor (STACK_TMP_Y0)(%esp), %ymm0, %ymm7;
+		vaesdeclast %ymm7, %ymm1, %ymm1;
+		vmovdqu %ymm1, 0*16(%esi);
+		vpxor (STACK_TMP_Y1)(%esp), %ymm0, %ymm1;
+		vpxor (STACK_TMP_Y2)(%esp), %ymm0, %ymm7;
+		vaesdeclast %ymm1, %ymm2, %ymm2;
+		vpxor (STACK_TMP_Y3)(%esp), %ymm0, %ymm1;
+		vaesdeclast %ymm7, %ymm3, %ymm3;
+		vpxor (STACK_TMP_Y4)(%esp), %ymm0, %ymm7;
+		vaesdeclast %ymm1, %ymm4, %ymm4;
+		vpxor (STACK_TMP_Y5)(%esp), %ymm0, %ymm0;
+		vaesdeclast %ymm7, %ymm5, %ymm5;
+		vaesdeclast %ymm0, %ymm6, %ymm6;
+
+		/* Checksum_i = Checksum_{i-1} xor P_i  */
+		vpxor %ymm2, %ymm3, %ymm0;
+		vpxor %ymm4, %ymm5, %ymm7;
+		vpxor %ymm6, %ymm0, %ymm0;
+		vpxor 0*16(%esi), %ymm7, %ymm7;
+		vpxor %ymm0, %ymm7, %ymm7;
+		vpxor (STACK_CHECKSUM)(%esp), %ymm7, %ymm7;
+
+		vmovdqu %ymm2, 2*16(%esi);
+		vmovdqu %ymm3, 4*16(%esi);
+		vmovdqu %ymm4, 6*16(%esi);
+		vmovdqu %ymm5, 8*16(%esi);
+		vmovdqu %ymm6, 10*16(%esi);
+		leal 12*16(%esi), %esi;
+
+		vmovdqa %ymm7, (STACK_CHECKSUM)(%esp);
+
+		cmpl $12, 4+16(%ebp);
+		jae .Locb_crypt_blk12_loop;
+
+.align 8
+.Locb_crypt_blk12_cleanup:
+	vpxor %ymm0, %ymm0, %ymm0;
+	vmovdqa %ymm0, (STACK_TMP_Y0)(%esp);
+	vmovdqa %ymm0, (STACK_TMP_Y1)(%esp);
+	vmovdqa %ymm0, (STACK_TMP_Y2)(%esp);
+	vmovdqa %ymm0, (STACK_TMP_Y3)(%esp);
+	vmovdqa %ymm0, (STACK_TMP_Y4)(%esp);
+	vmovdqa %ymm0, (STACK_TMP_Y5)(%esp);
+
+	/* Process trailing four blocks. */
+.align 8
+.Locb_crypt_blk4:
+	cmpl $4, 4+16(%ebp);
+	jb .Locb_crypt_blk1;
+
+	subl $4, 4+16(%ebp);
+
+	movl 4+36(%ebp), %ecx;
+	vmovdqa (%ecx), %xmm7; /* Preload L[0] */
+
+	testl $1, %ebx;
+	jz .Locb_crypt_blk4_nblk_even;
+		/* Offset_i = Offset_{i-1} xor L_{ntz(i)} */
+		leal 1(%ebx), %eax;
+		tzcntl %eax, %eax; // ntz(blkn+1)
+		shll $4, %eax;
+		vmovdqa (STACK_OFFSET_AND_F_KEY)(%esp), %xmm1;
+		vpxor (%ecx, %eax), %xmm1, %xmm1;
+
+		vpxor %xmm7, %xmm1, %xmm2;
+		vinserti128 $1, %xmm2, %ymm1, %ymm6;
+
+		leal 3(%ebx), %eax;
+		tzcntl %eax, %eax; // ntz(blkn+3)
+		shll $4, %eax;
+		vpxor (%ecx, %eax), %xmm2, %xmm3;
+
+		leal 4(%ebx), %ebx;
+		vpxor %xmm7, %xmm3, %xmm4;
+		vinserti128 $1, %xmm4, %ymm3, %ymm7;
+		vmovdqa %xmm4, (STACK_OFFSET_AND_F_KEY)(%esp);
+
+		cmpl $1, 4+40(%ebp);
+		jb .Locb_dec_blk4;
+		ja .Locb_auth_blk4;
+		jmp .Locb_enc_blk4;
+
+	.align 8
+	.Locb_crypt_blk4_nblk_even:
+		/* Offset_i = Offset_{i-1} xor L_{ntz(i)} */
+		vmovdqa (STACK_OFFSET_AND_F_KEY)(%esp), %xmm1;
+		vpxor %xmm7, %xmm1, %xmm1;
+
+		leal 2(%ebx), %eax;
+		tzcntl %eax, %eax; // ntz(blkn+2)
+		shll $4, %eax;
+		vpxor (%ecx, %eax), %xmm1, %xmm2;
+		vinserti128 $1, %xmm2, %ymm1, %ymm6;
+
+		vpxor %xmm7, %xmm2, %xmm3;
+
+		leal 4(%ebx), %ebx;
+		tzcntl %ebx, %eax; // ntz(blkn+4)
+		shll $4, %eax;
+		vpxor (%ecx, %eax), %xmm3, %xmm4;
+		vinserti128 $1, %xmm4, %ymm3, %ymm7;
+		vmovdqa %xmm4, (STACK_OFFSET_AND_F_KEY)(%esp);
+
+		cmpl $1, 4+40(%ebp);
+		jb .Locb_dec_blk4;
+		ja .Locb_auth_blk4;
+
+	.align 8
+	.Locb_enc_blk4:
+		vmovdqu 0*16(%edx), %ymm1;
+		vmovdqu 2*16(%edx), %ymm2;
+		leal 4*16(%edx), %edx;
+
+		/* Checksum_i = Checksum_{i-1} xor P_i  */
+		vpxor %ymm1, %ymm2, %ymm5;
+		vpxor (STACK_CHECKSUM)(%esp), %ymm5, %ymm5;
+		vmovdqa %ymm5, (STACK_CHECKSUM)(%esp);
+
+		/* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i)  */
+		vpxor %ymm6, %ymm1, %ymm1;
+		vpxor %ymm7, %ymm2, %ymm2;
+
+		/* AES rounds */
+		vbroadcasti128 (1 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (2 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (3 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (4 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (5 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (6 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (7 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (8 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (9 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		cmpl $12, 4+20(%ebp);
+		jb .Locb_enc_blk4_last;
+		vbroadcasti128 (10 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (11 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		jz .Locb_enc_blk4_last;
+		vbroadcasti128 (12 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (13 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+
+		/* Last round and output handling. */
+	  .Locb_enc_blk4_last:
+		vmovdqa (STACK_FXL_KEY)(%esp), %ymm0;
+		vpxor %ymm0, %ymm6, %ymm6; /* Xor offset to last round key. */
+		vpxor %ymm0, %ymm7, %ymm7;
+		vaesenclast %ymm6, %ymm1, %ymm1;
+		vaesenclast %ymm7, %ymm2, %ymm2;
+		vmovdqu %ymm1, 0*16(%esi);
+		vmovdqu %ymm2, 2*16(%esi);
+		leal 4*16(%esi), %esi;
+
+		jmp .Locb_crypt_blk1;
+
+	.align 8
+	.Locb_auth_blk4:
+		/* Sum_i = Sum_{i-1} xor ENCIPHER(K, A_i xor Offset_i)  */
+		vpxor 0*16(%edx), %ymm6, %ymm1;
+		vpxor 2*16(%edx), %ymm7, %ymm2;
+		leal 4*16(%edx), %edx;
+
+		/* AES rounds */
+		vbroadcasti128 (1 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (2 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (3 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (4 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (5 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (6 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (7 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (8 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (9 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (10 * 16)(%edi), %ymm0;
+		cmpl $12, 4+20(%ebp);
+		jb .Locb_auth_blk4_last;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (11 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (12 * 16)(%edi), %ymm0;
+		jz .Locb_auth_blk4_last;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (13 * 16)(%edi), %ymm0;
+		VAESENC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (14 * 16)(%edi), %ymm0;
+
+		/* Last round and output handling. */
+	  .Locb_auth_blk4_last:
+		vaesenclast %ymm0, %ymm1, %ymm1;
+		vaesenclast %ymm0, %ymm2, %ymm2;
+
+		/* Checksum_i = Checksum_{i-1} xor P_i  */
+		vpxor %ymm1, %ymm2, %ymm5;
+		vpxor (STACK_CHECKSUM)(%esp), %ymm5, %ymm5;
+		vmovdqa %ymm5, (STACK_CHECKSUM)(%esp);
+
+		jmp .Locb_crypt_blk1;
+
+	.align 8
+	.Locb_dec_blk4:
+		/* P_i = Offset_i xor DECIPHER(K, C_i xor Offset_i)  */
+		vpxor 0*16(%edx), %ymm6, %ymm1;
+		vpxor 2*16(%edx), %ymm7, %ymm2;
+		leal 4*16(%edx), %edx;
+
+		/* AES rounds */
+		vbroadcasti128 (1 * 16)(%edi), %ymm0;
+		VAESDEC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (2 * 16)(%edi), %ymm0;
+		VAESDEC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (3 * 16)(%edi), %ymm0;
+		VAESDEC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (4 * 16)(%edi), %ymm0;
+		VAESDEC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (5 * 16)(%edi), %ymm0;
+		VAESDEC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (6 * 16)(%edi), %ymm0;
+		VAESDEC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (7 * 16)(%edi), %ymm0;
+		VAESDEC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (8 * 16)(%edi), %ymm0;
+		VAESDEC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (9 * 16)(%edi), %ymm0;
+		VAESDEC2(%ymm0, %ymm1, %ymm2);
+		cmpl $12, 4+20(%ebp);
+		jb .Locb_dec_blk4_last;
+		vbroadcasti128 (10 * 16)(%edi), %ymm0;
+		VAESDEC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (11 * 16)(%edi), %ymm0;
+		VAESDEC2(%ymm0, %ymm1, %ymm2);
+		jz .Locb_dec_blk4_last;
+		vbroadcasti128 (12 * 16)(%edi), %ymm0;
+		VAESDEC2(%ymm0, %ymm1, %ymm2);
+		vbroadcasti128 (13 * 16)(%edi), %ymm0;
+		VAESDEC2(%ymm0, %ymm1, %ymm2);
+
+		/* Last round and output handling. */
+	  .Locb_dec_blk4_last:
+		vmovdqa (STACK_FXL_KEY)(%esp), %ymm0;
+		vpxor %ymm0, %ymm6, %ymm6; /* Xor offset to last round key. */
+		vpxor %ymm0, %ymm7, %ymm7;
+		vaesdeclast %ymm6, %ymm1, %ymm1;
+		vaesdeclast %ymm7, %ymm2, %ymm2;
+
+		/* Checksum_i = Checksum_{i-1} xor P_i  */
+		vpxor %ymm1, %ymm2, %ymm5;
+		vpxor (STACK_CHECKSUM)(%esp), %ymm5, %ymm5;
+
+		vmovdqu %ymm1, 0*16(%esi);
+		vmovdqu %ymm2, 2*16(%esi);
+		leal 4*16(%esi), %esi;
+
+		vmovdqa %ymm5, (STACK_CHECKSUM)(%esp);
+
+	/* Process trailing one to three blocks, one per loop. */
+.align 8
+.Locb_crypt_blk1:
+	cmpl $1, 4+16(%ebp);
+	jb .Locb_crypt_done;
+
+	subl $1, 4+16(%ebp);
+
+	movl 4+36(%ebp), %ecx;
+	leal 1(%ebx), %ebx;
+	tzcntl %ebx, %eax; // ntz(blkn+1)
+	shll $4, %eax;
+	vmovdqa (STACK_OFFSET_AND_F_KEY)(%esp), %xmm7;
+	vpxor (%ecx, %eax), %xmm7, %xmm7;
+
+	/* Offset_i = Offset_{i-1} xor L_{ntz(i)} */
+	vmovdqa %xmm7, (STACK_OFFSET_AND_F_KEY)(%esp);
+
+	cmpl $1, 4+40(%ebp);
+	jb .Locb_dec_blk1;
+	ja .Locb_auth_blk1;
+		vmovdqu (%edx), %xmm0;
+		leal 16(%edx), %edx;
+
+		/* Checksum_i = Checksum_{i-1} xor P_i  */
+		vpxor (STACK_CHECKSUM)(%esp), %xmm0, %xmm1;
+		vmovdqa %xmm1, (STACK_CHECKSUM)(%esp);
+
+		/* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i)  */
+		vpxor %xmm7, %xmm0, %xmm0;
+
+		/* AES rounds. */
+		vaesenc (1 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (2 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (3 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (4 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (5 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (6 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (7 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (8 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (9 * 16)(%edi), %xmm0, %xmm0;
+		cmpl $12, 4+20(%ebp);
+		jb .Locb_enc_blk1_last;
+		vaesenc (10 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (11 * 16)(%edi), %xmm0, %xmm0;
+		jz .Locb_enc_blk1_last;
+		vaesenc (12 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (13 * 16)(%edi), %xmm0, %xmm0;
+
+		/* Last round and output handling. */
+	  .Locb_enc_blk1_last:
+		vpxor (STACK_FXL_KEY)(%esp), %xmm7, %xmm1;
+		vaesenclast %xmm1, %xmm0, %xmm0;
+		vmovdqu %xmm0, (%esi);
+		leal 16(%esi), %esi;
+
+		jmp .Locb_crypt_blk1;
+
+	.align 8
+	.Locb_auth_blk1:
+		/* Sum_i = Sum_{i-1} xor ENCIPHER(K, A_i xor Offset_i)  */
+		vpxor (%edx), %xmm7, %xmm0;
+		leal 16(%edx), %edx;
+
+		/* AES rounds. */
+		vaesenc (1 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (2 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (3 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (4 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (5 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (6 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (7 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (8 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (9 * 16)(%edi), %xmm0, %xmm0;
+		vmovdqa (10 * 16)(%edi), %xmm1;
+		cmpl $12, 4+20(%ebp);
+		jb .Locb_auth_blk1_last;
+		vaesenc %xmm1, %xmm0, %xmm0;
+		vaesenc (11 * 16)(%edi), %xmm0, %xmm0;
+		vmovdqa (12 * 16)(%edi), %xmm1;
+		jz .Locb_auth_blk1_last;
+		vaesenc %xmm1, %xmm0, %xmm0;
+		vaesenc (13 * 16)(%edi), %xmm0, %xmm0;
+		vmovdqa (14 * 16)(%edi), %xmm1;
+
+		/* Last round and output handling. */
+	  .Locb_auth_blk1_last:
+		vpxor (STACK_CHECKSUM)(%esp), %xmm1, %xmm1;
+		vaesenclast %xmm1, %xmm0, %xmm0;
+		vmovdqa %xmm0, (STACK_CHECKSUM)(%esp);
+
+		jmp .Locb_crypt_blk1;
+
+	.align 8
+	.Locb_dec_blk1:
+		/* P_i = Offset_i xor DECIPHER(K, C_i xor Offset_i)  */
+		vpxor (%edx), %xmm7, %xmm0;
+		leal 16(%edx), %edx;
+
+		/* AES rounds. */
+		vaesdec (1 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (2 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (3 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (4 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (5 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (6 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (7 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (8 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (9 * 16)(%edi), %xmm0, %xmm0;
+		cmpl $12, 4+20(%ebp);
+		jb .Locb_dec_blk1_last;
+		vaesdec (10 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (11 * 16)(%edi), %xmm0, %xmm0;
+		jz .Locb_dec_blk1_last;
+		vaesdec (12 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (13 * 16)(%edi), %xmm0, %xmm0;
+
+		/* Last round and output handling. */
+	  .Locb_dec_blk1_last:
+		vpxor (STACK_FXL_KEY)(%esp), %xmm7, %xmm1;
+		vaesdeclast %xmm1, %xmm0, %xmm0;
+
+		/* Checksum_i = Checksum_{i-1} xor P_i  */
+		vpxor (STACK_CHECKSUM)(%esp), %xmm0, %xmm1;
+
+		vmovdqu %xmm0, (%esi);
+		leal 16(%esi), %esi;
+
+		vmovdqa %xmm1, (STACK_CHECKSUM)(%esp);
+
+		jmp .Locb_crypt_blk1;
+
+.align 8
+.Locb_crypt_done:
+	movl 4+24(%ebp), %ecx;
+	vmovdqa (STACK_OFFSET_AND_F_KEY)(%esp), %xmm1;
+	vpxor (%edi), %xmm1, %xmm1;
+	vmovdqu %xmm1, (%ecx);
+
+	movl 4+28(%ebp), %eax;
+	vmovdqa (STACK_CHECKSUM)(%esp), %xmm2;
+	vpxor (STACK_CHECKSUM + 16)(%esp), %xmm2, %xmm2;
+	vpxor (%eax), %xmm2, %xmm2;
+	vmovdqu %xmm2, (%eax);
+
+	movl (STACK_GPR_POS + 0 * 4)(%esp), %edi;
+	CFI_RESTORE(edi);
+	movl (STACK_GPR_POS + 1 * 4)(%esp), %esi;
+	CFI_RESTORE(esi);
+	movl (STACK_GPR_POS + 2 * 4)(%esp), %ebx;
+	CFI_RESTORE(ebx);
+
+	vpxor %ymm0, %ymm0, %ymm0;
+	vmovdqa %ymm0, (STACK_OFFSET_AND_F_KEY)(%esp);
+	vmovdqa %ymm0, (STACK_CHECKSUM)(%esp);
+
+	xorl %eax, %eax;
+	leave;
+	CFI_LEAVE();
+	vzeroall;
+	ret_spec_stop
+	CFI_ENDPROC();
+ELF(.size SYM_NAME(_gcry_vaes_avx2_ocb_crypt_i386),
+	  .-SYM_NAME(_gcry_vaes_avx2_ocb_crypt_i386))
+
+/**********************************************************************
+  XTS-mode encryption
+ **********************************************************************/
+ELF(.type SYM_NAME(_gcry_vaes_avx2_xts_crypt_i386), at function)
+.globl SYM_NAME(_gcry_vaes_avx2_xts_crypt_i386)
+.align 16
+SYM_NAME(_gcry_vaes_avx2_xts_crypt_i386):
+	/* input:
+	 *	(esp + 4): round keys
+	 *	(esp + 8): tweak
+	 *	(esp + 12): dst
+	 *	(esp + 16): src
+	 *	(esp + 20): nblocks
+	 *	(esp + 24): nrounds
+	 *	(esp + 28): encrypt
+	 */
+	CFI_STARTPROC();
+
+	GET_DATA_POINTER(SYM_NAME(_gcry_vaes_consts), eax);
+
+	pushl %ebp;
+	CFI_PUSH(%ebp);
+	movl %esp, %ebp;
+	CFI_DEF_CFA_REGISTER(%ebp);
+
+	subl $(4 * 32 + 3 * 4), %esp;
+	andl $-32, %esp;
+
+	movl %edi, (4 * 32 + 0 * 4)(%esp);
+	CFI_REG_ON_STACK(edi, 4 * 32 + 0 * 4);
+	movl %esi, (4 * 32 + 1 * 4)(%esp);
+	CFI_REG_ON_STACK(esi, 4 * 32 + 1 * 4);
+	movl %ebx, (4 * 32 + 2 * 4)(%esp);
+	CFI_REG_ON_STACK(ebx, 4 * 32 + 2 * 4);
+
+	movl %eax, %ebx;
+	movl 4+4(%ebp), %edi;
+	movl 4+8(%ebp), %esi;
+	movl 4+12(%ebp), %edx;
+	movl 4+16(%ebp), %ecx;
+	movl 4+20(%ebp), %eax;
+
+#define tweak_clmul(shift, out, tweak, hi_tweak, tmp1, tmp2) \
+	vpsrld $(32-(shift)), hi_tweak, tmp2; \
+	vpsllq $(shift), tweak, out; \
+	vpclmulqdq $0, CADDR(.Lxts_gfmul_clmul, %ebx), tmp2, tmp1; \
+	vpunpckhqdq tmp2, tmp1, tmp1; \
+	vpxor tmp1, out, out;
+
+	/* Prepare tweak. */
+	vmovdqu (%esi), %xmm7;
+	vpshufb CADDR(.Lxts_high_bit_shuf, %ebx), %xmm7, %xmm6;
+	tweak_clmul(1, %xmm5, %xmm7, %xmm6, %xmm0, %xmm1);
+	vinserti128 $1, %xmm5, %ymm7, %ymm7; /* tweak:tweak1 */
+	vpshufb CADDR(.Lxts_high_bit_shuf, %ebx), %ymm7, %ymm6;
+
+	/* Process eight blocks per loop. */
+.align 8
+.Lxts_crypt_blk8:
+	cmpl $8, %eax;
+	jb .Lxts_crypt_blk4;
+
+	leal -8(%eax), %eax;
+
+	vmovdqa %ymm7, (0 * 32)(%esp);
+	tweak_clmul(2, %ymm2, %ymm7, %ymm6, %ymm0, %ymm1);
+	vmovdqa %ymm2, (1 * 32)(%esp);
+	tweak_clmul(4, %ymm2, %ymm7, %ymm6, %ymm0, %ymm1);
+	vmovdqa %ymm2, (2 * 32)(%esp);
+	tweak_clmul(6, %ymm2, %ymm7, %ymm6, %ymm0, %ymm1);
+	vmovdqa %ymm2, (3 * 32)(%esp);
+	tweak_clmul(8, %ymm7, %ymm7, %ymm6, %ymm0, %ymm1);
+	vpshufb CADDR(.Lxts_high_bit_shuf, %ebx), %ymm7, %ymm6;
+
+	vbroadcasti128 (0 * 16)(%edi), %ymm4;
+	vmovdqa (0 * 32)(%esp), %ymm0;
+	vmovdqa (1 * 32)(%esp), %ymm1;
+	vmovdqa (2 * 32)(%esp), %ymm2;
+	vmovdqa (3 * 32)(%esp), %ymm3;
+	vpxor (0 * 16)(%ecx), %ymm0, %ymm0;
+	vpxor (2 * 16)(%ecx), %ymm1, %ymm1;
+	vpxor (4 * 16)(%ecx), %ymm2, %ymm2;
+	vpxor (6 * 16)(%ecx), %ymm3, %ymm3;
+
+	leal (8 * 16)(%ecx), %ecx;
+
+	cmpl $1, 4+28(%ebp);
+	jne .Lxts_dec_blk8;
+		/* AES rounds */
+		XOR4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (1 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (2 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (3 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (4 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (5 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (6 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (7 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (8 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (9 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (10 * 16)(%edi), %ymm4;
+		cmpl $12, 4+24(%ebp);
+		jb .Lxts_enc_blk8_last;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (11 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (12 * 16)(%edi), %ymm4;
+		jz .Lxts_enc_blk8_last;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (13 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (14 * 16)(%edi), %ymm4;
+
+		/* Last round and output handling. */
+	.Lxts_enc_blk8_last:
+		vpxor (0 * 32)(%esp), %ymm4, %ymm5; /* Xor tweak to last round key. */
+		vaesenclast %ymm5, %ymm0, %ymm0;
+		vpxor (1 * 32)(%esp), %ymm4, %ymm5;
+		vaesenclast %ymm5, %ymm1, %ymm1;
+		vpxor (2 * 32)(%esp), %ymm4, %ymm5;
+		vpxor (3 * 32)(%esp), %ymm4, %ymm4;
+		vaesenclast %ymm5, %ymm2, %ymm2;
+		vaesenclast %ymm4, %ymm3, %ymm3;
+		vmovdqu %ymm0, (0 * 16)(%edx);
+		vmovdqu %ymm1, (2 * 16)(%edx);
+		vmovdqu %ymm2, (4 * 16)(%edx);
+		vmovdqu %ymm3, (6 * 16)(%edx);
+		leal (8 * 16)(%edx), %edx;
+
+		jmp .Lxts_crypt_blk8;
+
+	.align 8
+	.Lxts_dec_blk8:
+		/* AES rounds */
+		XOR4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (1 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (2 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (3 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (4 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (5 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (6 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (7 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (8 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (9 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (10 * 16)(%edi), %ymm4;
+		cmpl $12, 4+24(%ebp);
+		jb .Lxts_dec_blk8_last;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (11 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (12 * 16)(%edi), %ymm4;
+		jz .Lxts_dec_blk8_last;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (13 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (14 * 16)(%edi), %ymm4;
+
+		/* Last round and output handling. */
+	.Lxts_dec_blk8_last:
+		vpxor (0 * 32)(%esp), %ymm4, %ymm5; /* Xor tweak to last round key. */
+		vaesdeclast %ymm5, %ymm0, %ymm0;
+		vpxor (1 * 32)(%esp), %ymm4, %ymm5;
+		vaesdeclast %ymm5, %ymm1, %ymm1;
+		vpxor (2 * 32)(%esp), %ymm4, %ymm5;
+		vpxor (3 * 32)(%esp), %ymm4, %ymm4;
+		vaesdeclast %ymm5, %ymm2, %ymm2;
+		vaesdeclast %ymm4, %ymm3, %ymm3;
+		vmovdqu %ymm0, (0 * 16)(%edx);
+		vmovdqu %ymm1, (2 * 16)(%edx);
+		vmovdqu %ymm2, (4 * 16)(%edx);
+		vmovdqu %ymm3, (6 * 16)(%edx);
+		leal (8 * 16)(%edx), %edx;
+
+		jmp .Lxts_crypt_blk8;
+
+	/* Handle trailing four blocks. */
+.align 8
+.Lxts_crypt_blk4:
+	/* Try exit early as typically input length is large power of 2. */
+	cmpl $1, %eax;
+	jb .Ldone_xts_crypt;
+	cmpl $4, %eax;
+	jb .Lxts_crypt_blk1;
+
+	leal -4(%eax), %eax;
+
+	vmovdqa %ymm7, %ymm2;
+	tweak_clmul(2, %ymm3, %ymm7, %ymm6, %ymm0, %ymm1);
+	tweak_clmul(4, %ymm7, %ymm7, %ymm6, %ymm0, %ymm1);
+	vpshufb CADDR(.Lxts_high_bit_shuf, %ebx), %ymm7, %ymm6;
+
+	vbroadcasti128 (0 * 16)(%edi), %ymm4;
+	vpxor (0 * 16)(%ecx), %ymm2, %ymm0;
+	vpxor (2 * 16)(%ecx), %ymm3, %ymm1;
+
+	leal (4 * 16)(%ecx), %ecx;
+
+	cmpl $1, 4+28(%ebp);
+	jne .Lxts_dec_blk4;
+		/* AES rounds */
+		XOR2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (1 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (2 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (3 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (4 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (5 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (6 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (7 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (8 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (9 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (10 * 16)(%edi), %ymm4;
+		cmpl $12, 4+24(%ebp);
+		jb .Lxts_enc_blk4_last;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (11 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (12 * 16)(%edi), %ymm4;
+		jz .Lxts_enc_blk4_last;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (13 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (14 * 16)(%edi), %ymm4;
+
+		/* Last round and output handling. */
+	.Lxts_enc_blk4_last:
+		vpxor %ymm4, %ymm2, %ymm2; /* Xor tweak to last round key. */
+		vpxor %ymm4, %ymm3, %ymm3;
+		vaesenclast %ymm2, %ymm0, %ymm0;
+		vaesenclast %ymm3, %ymm1, %ymm1;
+		vmovdqu %ymm0, (0 * 16)(%edx);
+		vmovdqu %ymm1, (2 * 16)(%edx);
+		leal (4 * 16)(%edx), %edx;
+
+		jmp .Lxts_crypt_blk1;
+
+	.align 8
+	.Lxts_dec_blk4:
+		/* AES rounds */
+		XOR2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (1 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (2 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (3 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (4 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (5 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (6 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (7 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (8 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (9 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (10 * 16)(%edi), %ymm4;
+		cmpl $12, 4+24(%ebp);
+		jb .Lxts_dec_blk4_last;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (11 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (12 * 16)(%edi), %ymm4;
+		jz .Lxts_dec_blk4_last;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (13 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (14 * 16)(%edi), %ymm4;
+
+		/* Last round and output handling. */
+	.Lxts_dec_blk4_last:
+		vpxor %ymm4, %ymm2, %ymm2; /* Xor tweak to last round key. */
+		vpxor %ymm4, %ymm3, %ymm3;
+		vaesdeclast %ymm2, %ymm0, %ymm0;
+		vaesdeclast %ymm3, %ymm1, %ymm1;
+		vmovdqu %ymm0, (0 * 16)(%edx);
+		vmovdqu %ymm1, (2 * 16)(%edx);
+		leal (4 * 16)(%edx), %edx;
+
+	/* Process trailing one to three blocks, one per loop. */
+.align 8
+.Lxts_crypt_blk1:
+	cmpl $1, %eax;
+	jb .Ldone_xts_crypt;
+
+	leal -1(%eax), %eax;
+
+	vpxor (%ecx), %xmm7, %xmm0;
+	vmovdqa %xmm7, %xmm5;
+	tweak_clmul(1, %xmm7, %xmm7, %xmm6, %xmm2, %xmm3);
+	vpshufb CADDR(.Lxts_high_bit_shuf, %ebx), %xmm7, %xmm6;
+
+	leal 16(%ecx), %ecx;
+
+	cmpl $1, 4+28(%ebp);
+	jne .Lxts_dec_blk1;
+		/* AES rounds. */
+		vpxor (0 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (1 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (2 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (3 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (4 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (5 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (6 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (7 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (8 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (9 * 16)(%edi), %xmm0, %xmm0;
+		vmovdqa (10 * 16)(%edi), %xmm1;
+		cmpl $12, 4+24(%ebp);
+		jb .Lxts_enc_blk1_last;
+		vaesenc %xmm1, %xmm0, %xmm0;
+		vaesenc (11 * 16)(%edi), %xmm0, %xmm0;
+		vmovdqa (12 * 16)(%edi), %xmm1;
+		jz .Lxts_enc_blk1_last;
+		vaesenc %xmm1, %xmm0, %xmm0;
+		vaesenc (13 * 16)(%edi), %xmm0, %xmm0;
+		vmovdqa (14 * 16)(%edi), %xmm1;
+
+		/* Last round and output handling. */
+	.Lxts_enc_blk1_last:
+		vpxor %xmm1, %xmm5, %xmm5; /* Xor tweak to last round key. */
+		vaesenclast %xmm5, %xmm0, %xmm0;
+		vmovdqu %xmm0, (%edx);
+		leal 16(%edx), %edx;
+
+		jmp .Lxts_crypt_blk1;
+
+	.align 8
+	.Lxts_dec_blk1:
+		/* AES rounds. */
+		vpxor (0 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (1 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (2 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (3 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (4 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (5 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (6 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (7 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (8 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (9 * 16)(%edi), %xmm0, %xmm0;
+		vmovdqa (10 * 16)(%edi), %xmm1;
+		cmpl $12, 4+24(%ebp);
+		jb .Lxts_dec_blk1_last;
+		vaesdec %xmm1, %xmm0, %xmm0;
+		vaesdec (11 * 16)(%edi), %xmm0, %xmm0;
+		vmovdqa (12 * 16)(%edi), %xmm1;
+		jz .Lxts_dec_blk1_last;
+		vaesdec %xmm1, %xmm0, %xmm0;
+		vaesdec (13 * 16)(%edi), %xmm0, %xmm0;
+		vmovdqa (14 * 16)(%edi), %xmm1;
+
+		/* Last round and output handling. */
+	.Lxts_dec_blk1_last:
+		vpxor %xmm1, %xmm5, %xmm5; /* Xor tweak to last round key. */
+		vaesdeclast %xmm5, %xmm0, %xmm0;
+		vmovdqu %xmm0, (%edx);
+		leal 16(%edx), %edx;
+
+		jmp .Lxts_crypt_blk1;
+
+.align 8
+.Ldone_xts_crypt:
+	/* Store IV. */
+	vmovdqu %xmm7, (%esi);
+
+	vpxor %ymm0, %ymm0, %ymm0;
+	movl (4 * 32 + 0 * 4)(%esp), %edi;
+	CFI_RESTORE(edi);
+	movl (4 * 32 + 1 * 4)(%esp), %esi;
+	CFI_RESTORE(esi);
+	movl (4 * 32 + 2 * 4)(%esp), %ebx;
+	CFI_RESTORE(ebx);
+	vmovdqa %ymm0, (0 * 32)(%esp);
+	vmovdqa %ymm0, (1 * 32)(%esp);
+	vmovdqa %ymm0, (2 * 32)(%esp);
+	vmovdqa %ymm0, (3 * 32)(%esp);
+	leave;
+	CFI_LEAVE();
+	vzeroall;
+	xorl %eax, %eax;
+	ret_spec_stop
+	CFI_ENDPROC();
+ELF(.size SYM_NAME(_gcry_vaes_avx2_xts_crypt_i386),
+	  .-SYM_NAME(_gcry_vaes_avx2_xts_crypt_i386))
+
+/**********************************************************************
+  ECB-mode encryption
+ **********************************************************************/
+ELF(.type SYM_NAME(_gcry_vaes_avx2_ecb_crypt_i386), at function)
+.globl SYM_NAME(_gcry_vaes_avx2_ecb_crypt_i386)
+.align 16
+SYM_NAME(_gcry_vaes_avx2_ecb_crypt_i386):
+	/* input:
+	 *	(esp + 4): round keys
+	 *	(esp + 8): encrypt
+	 *	(esp + 12): dst
+	 *	(esp + 16): src
+	 *	(esp + 20): nblocks
+	 *	(esp + 24): nrounds
+	 */
+	CFI_STARTPROC();
+	pushl %edi;
+	CFI_PUSH(%edi);
+	pushl %esi;
+	CFI_PUSH(%esi);
+
+	movl 8+4(%esp), %edi;
+	movl 8+8(%esp), %esi;
+	movl 8+12(%esp), %edx;
+	movl 8+16(%esp), %ecx;
+	movl 8+20(%esp), %eax;
+
+	/* Process 8 blocks per loop. */
+.align 8
+.Lecb_blk8:
+	cmpl $8, %eax;
+	jb .Lecb_blk4;
+
+	leal -8(%eax), %eax;
+
+	/* Load input and xor first key. */
+	vbroadcasti128 (0 * 16)(%edi), %ymm4;
+	vmovdqu (0 * 16)(%ecx), %ymm0;
+	vmovdqu (2 * 16)(%ecx), %ymm1;
+	vmovdqu (4 * 16)(%ecx), %ymm2;
+	vmovdqu (6 * 16)(%ecx), %ymm3;
+	vpxor %ymm4, %ymm0, %ymm0;
+	vpxor %ymm4, %ymm1, %ymm1;
+	vpxor %ymm4, %ymm2, %ymm2;
+	vpxor %ymm4, %ymm3, %ymm3;
+	vbroadcasti128 (1 * 16)(%edi), %ymm4;
+	leal (8 * 16)(%ecx), %ecx;
+
+	testl %esi, %esi;
+	jz .Lecb_dec_blk8;
+		/* AES rounds */
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (2 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (3 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (4 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (5 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (6 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (7 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (8 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (9 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (10 * 16)(%edi), %ymm4;
+		cmpl $12, 8+24(%esp);
+		jb .Lecb_enc_blk8_last;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (11 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (12 * 16)(%edi), %ymm4;
+		jz .Lecb_enc_blk8_last;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (13 * 16)(%edi), %ymm4;
+		VAESENC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (14 * 16)(%edi), %ymm4;
+	  .Lecb_enc_blk8_last:
+		vaesenclast %ymm4, %ymm0, %ymm0;
+		vaesenclast %ymm4, %ymm1, %ymm1;
+		vaesenclast %ymm4, %ymm2, %ymm2;
+		vaesenclast %ymm4, %ymm3, %ymm3;
+		vmovdqu %ymm0, (0 * 16)(%edx);
+		vmovdqu %ymm1, (2 * 16)(%edx);
+		vmovdqu %ymm2, (4 * 16)(%edx);
+		vmovdqu %ymm3, (6 * 16)(%edx);
+		leal (8 * 16)(%edx), %edx;
+		jmp .Lecb_blk8;
+
+	  .align 8
+	  .Lecb_dec_blk8:
+		/* AES rounds */
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (2 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (3 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (4 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (5 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (6 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (7 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (8 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (9 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (10 * 16)(%edi), %ymm4;
+		cmpl $12, 8+24(%esp);
+		jb .Lecb_dec_blk8_last;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (11 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (12 * 16)(%edi), %ymm4;
+		jz .Lecb_dec_blk8_last;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (13 * 16)(%edi), %ymm4;
+		VAESDEC4(%ymm4, %ymm0, %ymm1, %ymm2, %ymm3);
+		vbroadcasti128 (14 * 16)(%edi), %ymm4;
+	  .Lecb_dec_blk8_last:
+		vaesdeclast %ymm4, %ymm0, %ymm0;
+		vaesdeclast %ymm4, %ymm1, %ymm1;
+		vaesdeclast %ymm4, %ymm2, %ymm2;
+		vaesdeclast %ymm4, %ymm3, %ymm3;
+		vmovdqu %ymm0, (0 * 16)(%edx);
+		vmovdqu %ymm1, (2 * 16)(%edx);
+		vmovdqu %ymm2, (4 * 16)(%edx);
+		vmovdqu %ymm3, (6 * 16)(%edx);
+		leal (8 * 16)(%edx), %edx;
+		jmp .Lecb_blk8;
+
+	/* Handle trailing four blocks. */
+.align 8
+.Lecb_blk4:
+	cmpl $4, %eax;
+	jb .Lecb_blk1;
+
+	leal -4(%eax), %eax;
+
+	/* Load input and xor first key. */
+	vbroadcasti128 (0 * 16)(%edi), %ymm4;
+	vmovdqu (0 * 16)(%ecx), %ymm0;
+	vmovdqu (2 * 16)(%ecx), %ymm1;
+	vpxor %ymm4, %ymm0, %ymm0;
+	vpxor %ymm4, %ymm1, %ymm1;
+	vbroadcasti128 (1 * 16)(%edi), %ymm4;
+	leal (4 * 16)(%ecx), %ecx;
+
+	testl %esi, %esi;
+	jz .Lecb_dec_blk4;
+		/* AES rounds */
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (2 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (3 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (4 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (5 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (6 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (7 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (8 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (9 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (10 * 16)(%edi), %ymm4;
+		cmpl $12, 8+24(%esp);
+		jb .Lecb_enc_blk4_last;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (11 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (12 * 16)(%edi), %ymm4;
+		jz .Lecb_enc_blk4_last;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (13 * 16)(%edi), %ymm4;
+		VAESENC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (14 * 16)(%edi), %ymm4;
+	  .Lecb_enc_blk4_last:
+		vaesenclast %ymm4, %ymm0, %ymm0;
+		vaesenclast %ymm4, %ymm1, %ymm1;
+		vmovdqu %ymm0, (0 * 16)(%edx);
+		vmovdqu %ymm1, (2 * 16)(%edx);
+		leal (4 * 16)(%edx), %edx;
+		jmp .Lecb_blk1;
+
+	  .align 8
+	  .Lecb_dec_blk4:
+		/* AES rounds */
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (2 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (3 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (4 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (5 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (6 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (7 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (8 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (9 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (10 * 16)(%edi), %ymm4;
+		cmpl $12, 8+24(%esp);
+		jb .Lecb_dec_blk4_last;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (11 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (12 * 16)(%edi), %ymm4;
+		jz .Lecb_dec_blk4_last;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (13 * 16)(%edi), %ymm4;
+		VAESDEC2(%ymm4, %ymm0, %ymm1);
+		vbroadcasti128 (14 * 16)(%edi), %ymm4;
+	  .Lecb_dec_blk4_last:
+		vaesdeclast %ymm4, %ymm0, %ymm0;
+		vaesdeclast %ymm4, %ymm1, %ymm1;
+		vmovdqu %ymm0, (0 * 16)(%edx);
+		vmovdqu %ymm1, (2 * 16)(%edx);
+		leal (4 * 16)(%edx), %edx;
+
+	/* Process trailing one to three blocks, one per loop. */
+.align 8
+.Lecb_blk1:
+	cmpl $1, %eax;
+	jb .Ldone_ecb;
+
+	leal -1(%eax), %eax;
+
+	/* Load input. */
+	vmovdqu (%ecx), %xmm2;
+	leal 16(%ecx), %ecx;
+
+	/* Xor first key. */
+	vpxor (0 * 16)(%edi), %xmm2, %xmm0;
+
+	testl %esi, %esi;
+	jz .Lecb_dec_blk1;
+		/* AES rounds. */
+		vaesenc (1 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (2 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (3 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (4 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (5 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (6 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (7 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (8 * 16)(%edi), %xmm0, %xmm0;
+		vaesenc (9 * 16)(%edi), %xmm0, %xmm0;
+		vmovdqa (10 * 16)(%edi), %xmm1;
+		cmpl $12, 8+24(%esp);
+		jb .Lecb_enc_blk1_last;
+		vaesenc %xmm1, %xmm0, %xmm0;
+		vaesenc (11 * 16)(%edi), %xmm0, %xmm0;
+		vmovdqa (12 * 16)(%edi), %xmm1;
+		jz .Lecb_enc_blk1_last;
+		vaesenc %xmm1, %xmm0, %xmm0;
+		vaesenc (13 * 16)(%edi), %xmm0, %xmm0;
+		vmovdqa (14 * 16)(%edi), %xmm1;
+	  .Lecb_enc_blk1_last:
+		vaesenclast %xmm1, %xmm0, %xmm0;
+		jmp .Lecb_blk1_end;
+
+	  .align 8
+	  .Lecb_dec_blk1:
+		/* AES rounds. */
+		vaesdec (1 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (2 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (3 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (4 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (5 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (6 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (7 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (8 * 16)(%edi), %xmm0, %xmm0;
+		vaesdec (9 * 16)(%edi), %xmm0, %xmm0;
+		vmovdqa (10 * 16)(%edi), %xmm1;
+		cmpl $12, 8+24(%esp);
+		jb .Lecb_dec_blk1_last;
+		vaesdec %xmm1, %xmm0, %xmm0;
+		vaesdec (11 * 16)(%edi), %xmm0, %xmm0;
+		vmovdqa (12 * 16)(%edi), %xmm1;
+		jz .Lecb_dec_blk1_last;
+		vaesdec %xmm1, %xmm0, %xmm0;
+		vaesdec (13 * 16)(%edi), %xmm0, %xmm0;
+		vmovdqa (14 * 16)(%edi), %xmm1;
+	  .Lecb_dec_blk1_last:
+		vaesdeclast %xmm1, %xmm0, %xmm0;
+		jmp .Lecb_blk1_end;
+
+  .align 8
+  .Lecb_blk1_end:
+	vmovdqu %xmm0, (%edx);
+	leal 16(%edx), %edx;
+
+	jmp .Lecb_blk1;
+
+.align 8
+.Ldone_ecb:
+	popl %esi;
+	CFI_POP(%esi);
+	popl %edi;
+	CFI_POP(%edi);
+	vzeroall;
+	ret_spec_stop
+	CFI_ENDPROC();
+ELF(.size SYM_NAME(_gcry_vaes_avx2_ecb_crypt_i386),
+	  .-SYM_NAME(_gcry_vaes_avx2_ecb_crypt_i386))
+
+/**********************************************************************
+  constants
+ **********************************************************************/
+SECTION_RODATA
+
+ELF(.type SYM_NAME(_gcry_vaes_consts), at object)
+.align 32
+SYM_NAME(_gcry_vaes_consts):
+.Lbige_addb_0:
+	.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+.Lbige_addb_1:
+	.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1
+.Lbige_addb_2:
+	.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2
+.Lbige_addb_3:
+	.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3
+.Lbige_addb_4:
+	.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4
+.Lbige_addb_5:
+	.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5
+.Lbige_addb_6:
+	.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6
+.Lbige_addb_7:
+	.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7
+.Lbige_addb_8:
+	.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8
+.Lbige_addb_9:
+	.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9
+.Lbige_addb_10:
+	.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10
+.Lbige_addb_11:
+	.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11
+
+.Lle_addd_0:
+	.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+.Lle_addd_1:
+	.byte 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+.Lle_addd_2:
+	.byte 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+.Lle_addd_3:
+	.byte 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+.Lle_addd_4:
+	.byte 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+.Lle_addd_5:
+	.byte 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+.Lle_addd_6:
+	.byte 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+.Lle_addd_7:
+	.byte 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+.Lle_addd_8:
+	.byte 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+.Lle_addd_9:
+	.byte 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+.Lle_addd_10:
+	.byte 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+.Lle_addd_11:
+	.byte 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+
+.Lle_addd_4_2:
+	.byte 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+	.byte 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+.Lle_addd_12_2:
+	.byte 12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+	.byte 12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+
+.Lxts_gfmul_clmul:
+	.long 0x00, 0x87, 0x00, 0x00
+	.long 0x00, 0x87, 0x00, 0x00
+.Lxts_high_bit_shuf:
+	.byte -1, -1, -1, -1, 12, 13, 14, 15
+	.byte 4, 5, 6, 7, -1, -1, -1, -1
+	.byte -1, -1, -1, -1, 12, 13, 14, 15
+	.byte 4, 5, 6, 7, -1, -1, -1, -1
+.Lbswap128_mask:
+	.byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
+	.byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
+
+ELF(.size SYM_NAME(_gcry_vaes_consts),.-SYM_NAME(_gcry_vaes_consts))
+
+#endif /* HAVE_GCC_INLINE_ASM_VAES */
+#endif /* __i386__ */
diff --git a/cipher/rijndael-vaes-i386.c b/cipher/rijndael-vaes-i386.c
new file mode 100644
index 00000000..e10d3ac7
--- /dev/null
+++ b/cipher/rijndael-vaes-i386.c
@@ -0,0 +1,231 @@
+/* VAES/AVX2 i386 accelerated AES for Libgcrypt
+ * Copyright (C) 2023 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ *
+ */
+
+#include <config.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#include "types.h"  /* for byte and u32 typedefs */
+#include "g10lib.h"
+#include "cipher.h"
+#include "bufhelp.h"
+#include "rijndael-internal.h"
+#include "./cipher-internal.h"
+
+
+#ifdef USE_VAES_I386
+
+
+extern void _gcry_aes_aesni_prepare_decryption(RIJNDAEL_context *ctx);
+
+
+extern void _gcry_vaes_avx2_cbc_dec_i386 (const void *keysched,
+					  unsigned char *iv,
+					  void *outbuf_arg,
+					  const void *inbuf_arg,
+					  size_t nblocks,
+					  unsigned int nrounds);
+
+extern void _gcry_vaes_avx2_cfb_dec_i386 (const void *keysched,
+					  unsigned char *iv,
+					  void *outbuf_arg,
+					  const void *inbuf_arg,
+					  size_t nblocks,
+					  unsigned int nrounds);
+
+extern void _gcry_vaes_avx2_ctr_enc_i386 (const void *keysched,
+					  unsigned char *ctr,
+					  void *outbuf_arg,
+					  const void *inbuf_arg,
+					  size_t nblocks,
+					  unsigned int nrounds);
+
+extern void _gcry_vaes_avx2_ctr32le_enc_i386 (const void *keysched,
+					      unsigned char *ctr,
+					      void *outbuf_arg,
+					      const void *inbuf_arg,
+					      size_t nblocks,
+					      unsigned int nrounds);
+
+extern size_t _gcry_vaes_avx2_ocb_crypt_i386 (const void *keysched,
+					      void *outbuf_arg,
+					      const void *inbuf_arg,
+					      size_t nblocks,
+					      unsigned int nrounds,
+					      unsigned char *offset,
+					      unsigned char *checksum,
+					      unsigned int blkn,
+					      const void *L_table,
+					      int enc_dec_or_auth);
+
+extern void _gcry_vaes_avx2_xts_crypt_i386 (const void *keysched,
+					    unsigned char *tweak,
+					    void *outbuf_arg,
+					    const void *inbuf_arg,
+					    size_t nblocks,
+					    unsigned int nrounds,
+					    int encrypt);
+
+extern void _gcry_vaes_avx2_ecb_crypt_i386 (const void *keysched,
+					    int encrypt,
+					    void *outbuf_arg,
+					    const void *inbuf_arg,
+					    size_t nblocks,
+					    unsigned int nrounds);
+
+
+void
+_gcry_aes_vaes_ecb_crypt (void *context, void *outbuf,
+			  const void *inbuf, size_t nblocks,
+			  int encrypt)
+{
+  RIJNDAEL_context *ctx = context;
+  const void *keysched = encrypt ? ctx->keyschenc32 : ctx->keyschdec32;
+  unsigned int nrounds = ctx->rounds;
+
+  if (!encrypt && !ctx->decryption_prepared)
+    {
+      _gcry_aes_aesni_prepare_decryption (ctx);
+      ctx->decryption_prepared = 1;
+    }
+
+  _gcry_vaes_avx2_ecb_crypt_i386 (keysched, encrypt, outbuf, inbuf,
+				   nblocks, nrounds);
+}
+
+void
+_gcry_aes_vaes_cbc_dec (void *context, unsigned char *iv,
+			void *outbuf, const void *inbuf,
+			size_t nblocks)
+{
+  RIJNDAEL_context *ctx = context;
+  const void *keysched = ctx->keyschdec32;
+  unsigned int nrounds = ctx->rounds;
+
+  if (!ctx->decryption_prepared)
+    {
+      _gcry_aes_aesni_prepare_decryption (ctx);
+      ctx->decryption_prepared = 1;
+    }
+
+  _gcry_vaes_avx2_cbc_dec_i386 (keysched, iv, outbuf, inbuf, nblocks, nrounds);
+}
+
+void
+_gcry_aes_vaes_cfb_dec (void *context, unsigned char *iv,
+			void *outbuf, const void *inbuf,
+			size_t nblocks)
+{
+  RIJNDAEL_context *ctx = context;
+  const void *keysched = ctx->keyschenc32;
+  unsigned int nrounds = ctx->rounds;
+
+  _gcry_vaes_avx2_cfb_dec_i386 (keysched, iv, outbuf, inbuf, nblocks, nrounds);
+}
+
+void
+_gcry_aes_vaes_ctr_enc (void *context, unsigned char *iv,
+			void *outbuf, const void *inbuf,
+			size_t nblocks)
+{
+  RIJNDAEL_context *ctx = context;
+  const void *keysched = ctx->keyschenc32;
+  unsigned int nrounds = ctx->rounds;
+
+  _gcry_vaes_avx2_ctr_enc_i386 (keysched, iv, outbuf, inbuf, nblocks, nrounds);
+}
+
+void
+_gcry_aes_vaes_ctr32le_enc (void *context, unsigned char *iv,
+			    void *outbuf, const void *inbuf,
+			    size_t nblocks)
+{
+  RIJNDAEL_context *ctx = context;
+  const void *keysched = ctx->keyschenc32;
+  unsigned int nrounds = ctx->rounds;
+
+  _gcry_vaes_avx2_ctr32le_enc_i386 (keysched, iv, outbuf, inbuf, nblocks,
+				     nrounds);
+}
+
+size_t
+_gcry_aes_vaes_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
+			  const void *inbuf_arg, size_t nblocks,
+			  int encrypt)
+{
+  RIJNDAEL_context *ctx = (void *)&c->context.c;
+  const void *keysched = encrypt ? ctx->keyschenc32 : ctx->keyschdec32;
+  unsigned char *outbuf = outbuf_arg;
+  const unsigned char *inbuf = inbuf_arg;
+  u64 blkn;
+
+  if (!encrypt && !ctx->decryption_prepared)
+    {
+      _gcry_aes_aesni_prepare_decryption (ctx);
+      ctx->decryption_prepared = 1;
+    }
+
+  blkn = c->u_mode.ocb.data_nblocks;
+  c->u_mode.ocb.data_nblocks = blkn + nblocks;
+
+  return _gcry_vaes_avx2_ocb_crypt_i386 (keysched, outbuf, inbuf, nblocks,
+					 ctx->rounds, c->u_iv.iv, c->u_ctr.ctr,
+					 (unsigned int)blkn,
+					 &c->u_mode.ocb.L[0], encrypt);
+}
+
+size_t
+_gcry_aes_vaes_ocb_auth (gcry_cipher_hd_t c, const void *inbuf_arg,
+			 size_t nblocks)
+{
+  RIJNDAEL_context *ctx = (void *)&c->context.c;
+  const void *keysched = ctx->keyschenc32;
+  const unsigned char *inbuf = inbuf_arg;
+  u64 blkn = c->u_mode.ocb.aad_nblocks;
+
+  c->u_mode.ocb.aad_nblocks = blkn + nblocks;
+
+  return _gcry_vaes_avx2_ocb_crypt_i386 (keysched, NULL, inbuf, nblocks,
+					 ctx->rounds, c->u_mode.ocb.aad_offset,
+					 c->u_mode.ocb.aad_sum,
+					 (unsigned int)blkn,
+					 &c->u_mode.ocb.L[0], 2);
+}
+
+void
+_gcry_aes_vaes_xts_crypt (void *context, unsigned char *tweak,
+			  void *outbuf, const void *inbuf,
+			  size_t nblocks, int encrypt)
+{
+  RIJNDAEL_context *ctx = context;
+  const void *keysched = encrypt ? ctx->keyschenc32 : ctx->keyschdec32;
+  unsigned int nrounds = ctx->rounds;
+
+  if (!encrypt && !ctx->decryption_prepared)
+    {
+      _gcry_aes_aesni_prepare_decryption (ctx);
+      ctx->decryption_prepared = 1;
+    }
+
+  _gcry_vaes_avx2_xts_crypt_i386 (keysched, tweak, outbuf, inbuf, nblocks,
+				   nrounds, encrypt);
+}
+
+#endif /* USE_VAES_I386 */
diff --git a/cipher/rijndael-vaes.c b/cipher/rijndael-vaes.c
index ce9e18e7..478904d0 100644
--- a/cipher/rijndael-vaes.c
+++ b/cipher/rijndael-vaes.c
@@ -1,4 +1,4 @@
-/* VAES/AVX2 accelerated AES for Libgcrypt
+/* VAES/AVX2 AMD64 accelerated AES for Libgcrypt
  * Copyright (C) 2021 Jussi Kivilinna <jussi.kivilinna at iki.fi>
  *
  * This file is part of Libgcrypt.
@@ -81,7 +81,7 @@ extern size_t _gcry_vaes_avx2_ocb_crypt_amd64 (const void *keysched,
 					       unsigned char *offset,
 					       unsigned char *checksum,
 					       unsigned char *L_table,
-					       int encrypt) ASM_FUNC_ABI;
+					       int enc_dec_auth) ASM_FUNC_ABI;
 
 extern void _gcry_vaes_avx2_xts_crypt_amd64 (const void *keysched,
 					     unsigned char *tweak,
diff --git a/cipher/rijndael.c b/cipher/rijndael.c
index 56acb199..f1683007 100644
--- a/cipher/rijndael.c
+++ b/cipher/rijndael.c
@@ -107,8 +107,8 @@ extern void _gcry_aes_aesni_ecb_crypt (void *context, void *outbuf_arg,
 				       int encrypt);
 #endif
 
-#ifdef USE_VAES
-/* VAES (AMD64) accelerated implementation of AES */
+#if defined(USE_VAES_I386) || defined(USE_VAES)
+/* VAES (i386/AMD64) accelerated implementation of AES */
 
 extern void _gcry_aes_vaes_cfb_dec (void *context, unsigned char *iv,
 				    void *outbuf_arg, const void *inbuf_arg,
@@ -569,6 +569,21 @@ do_setkey (RIJNDAEL_context *ctx, const byte *key, const unsigned keylen,
 	  bulk_ops->xts_crypt = _gcry_aes_vaes_xts_crypt;
 	  bulk_ops->ecb_crypt = _gcry_aes_vaes_ecb_crypt;
 	}
+#endif
+#ifdef USE_VAES_I386
+      if ((hwfeatures & HWF_INTEL_VAES_VPCLMUL) &&
+	  (hwfeatures & HWF_INTEL_AVX2))
+	{
+	  /* Setup VAES bulk encryption routines.  */
+	  bulk_ops->cfb_dec = _gcry_aes_vaes_cfb_dec;
+	  bulk_ops->cbc_dec = _gcry_aes_vaes_cbc_dec;
+	  bulk_ops->ctr_enc = _gcry_aes_vaes_ctr_enc;
+	  bulk_ops->ctr32le_enc = _gcry_aes_vaes_ctr32le_enc;
+	  bulk_ops->ocb_crypt = _gcry_aes_vaes_ocb_crypt;
+	  bulk_ops->ocb_auth = _gcry_aes_vaes_ocb_auth;
+	  bulk_ops->xts_crypt = _gcry_aes_vaes_xts_crypt;
+	  bulk_ops->ecb_crypt = _gcry_aes_vaes_ecb_crypt;
+	}
 #endif
     }
 #endif
diff --git a/configure.ac b/configure.ac
index 8ddba0e8..72596f30 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1781,17 +1781,17 @@ fi
 
 
 #
-# Check whether GCC assembler supports features needed for our amd64
+# Check whether GCC assembler supports features needed for our i386/amd64
 # implementations
 #
 if test $amd64_as_feature_detection = yes; then
-  AC_CACHE_CHECK([whether GCC assembler is compatible for amd64 assembly implementations],
-       [gcry_cv_gcc_amd64_platform_as_ok],
+  AC_CACHE_CHECK([whether GCC assembler is compatible for i386/amd64 assembly implementations],
+       [gcry_cv_gcc_x86_platform_as_ok],
        [if test "$mpi_cpu_arch" != "x86" ||
            test "$try_asm_modules" != "yes" ; then
-          gcry_cv_gcc_amd64_platform_as_ok="n/a"
+          gcry_cv_gcc_x86_platform_as_ok="n/a"
         else
-          gcry_cv_gcc_amd64_platform_as_ok=no
+          gcry_cv_gcc_x86_platform_as_ok=no
           AC_LINK_IFELSE([AC_LANG_PROGRAM(
           [[__asm__(
                 /* Test if '.type' and '.size' are supported.  */
@@ -1807,13 +1807,19 @@ if test $amd64_as_feature_detection = yes; then
                  "xorl \$(123456789/12345678), %ebp;\n\t"
             );
             void asmfunc(void);]], [ asmfunc(); ])],
-          [gcry_cv_gcc_amd64_platform_as_ok=yes])
+          [gcry_cv_gcc_x86_platform_as_ok=yes])
         fi])
-  if test "$gcry_cv_gcc_amd64_platform_as_ok" = "yes" ; then
-     AC_DEFINE(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS,1,
+  if test "$gcry_cv_gcc_x86_platform_as_ok" = "yes" &&
+     test "$ac_cv_sizeof_unsigned_long" = "8"; then
+    AC_DEFINE(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS,1,
               [Defined if underlying assembler is compatible with amd64 assembly implementations])
   fi
-  if test "$gcry_cv_gcc_amd64_platform_as_ok" = "no" &&
+  if test "$gcry_cv_gcc_x86_platform_as_ok" = "yes" &&
+     test "$ac_cv_sizeof_unsigned_long" = "4"; then
+    AC_DEFINE(HAVE_COMPATIBLE_GCC_I386_PLATFORM_AS,1,
+              [Defined if underlying assembler is compatible with i386 assembly implementations])
+  fi
+  if test "$gcry_cv_gcc_x86_platform_as_ok" = "no" &&
      test "$gcry_cv_gcc_attribute_sysv_abi" = "yes" &&
      test "$gcry_cv_gcc_default_abi_is_ms_abi" = "yes"; then
     AC_CACHE_CHECK([whether GCC assembler is compatible for WIN64 assembly implementations],
@@ -1833,6 +1839,25 @@ if test $amd64_as_feature_detection = yes; then
                 [Defined if underlying assembler is compatible with WIN64 assembly implementations])
     fi
   fi
+  if test "$gcry_cv_gcc_x86_platform_as_ok" = "no" &&
+     test "$ac_cv_sizeof_unsigned_long" = "4"; then
+    AC_CACHE_CHECK([whether GCC assembler is compatible for WIN32 assembly implementations],
+      [gcry_cv_gcc_win32_platform_as_ok],
+      [gcry_cv_gcc_win32_platform_as_ok=no
+      AC_LINK_IFELSE([AC_LANG_PROGRAM(
+        [[__asm__(
+              ".text\n\t"
+              ".globl _asmfunc\n\t"
+              "_asmfunc:\n\t"
+              "xorl \$(1234), %ebp;\n\t"
+          );
+          void asmfunc(void);]], [ asmfunc(); ])],
+        [gcry_cv_gcc_win32_platform_as_ok=yes])])
+    if test "$gcry_cv_gcc_win32_platform_as_ok" = "yes" ; then
+      AC_DEFINE(HAVE_COMPATIBLE_GCC_WIN32_PLATFORM_AS,1,
+                [Defined if underlying assembler is compatible with WIN32 assembly implementations])
+    fi
+  fi
 fi
 
 
@@ -3028,6 +3053,10 @@ if test "$found" = "1" ; then
 
          # Build with the Padlock implementation
          GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS rijndael-padlock.lo"
+
+         # Build with the VAES/AVX2 implementation
+         GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS rijndael-vaes-i386.lo"
+         GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS rijndael-vaes-avx2-i386.lo"
       ;;
    esac
 fi
-- 
2.39.2




More information about the Gcrypt-devel mailing list