[git] GCRYPT - branch, master, updated. libgcrypt-1.6.0-80-gbf49439

by Jussi Kivilinna cvs at cvs.gnupg.org
Sun May 18 11:57:58 CEST 2014


This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "The GNU crypto library".

The branch, master has been updated
       via  bf4943932dae95a0573b63bf32a9b9acd5a6ddf3 (commit)
       via  323b1eb80ff3396d83fedbe5bba9a4e6c412d192 (commit)
       via  98f021961ee65669037bc8bb552a69fd78f610fc (commit)
       via  297532602ed2d881d8fdc393d1961068a143a891 (commit)
      from  e813958419b0ec4439e6caf07d3b2234cffa2bfa (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
commit bf4943932dae95a0573b63bf32a9b9acd5a6ddf3
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date:   Sat May 17 18:30:39 2014 +0300

    Add Poly1305 to documentation
    
    * doc/gcrypt.texi: Add documentation for Poly1305 MACs and AEAD mode.
    --
    
    Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>

diff --git a/doc/gcrypt.texi b/doc/gcrypt.texi
index d202b8b..d59c095 100644
--- a/doc/gcrypt.texi
+++ b/doc/gcrypt.texi
@@ -1629,6 +1629,11 @@ Galois/Counter Mode (GCM) is an Authenticated Encryption with
 Associated Data (AEAD) block cipher mode, which is specified in
 'NIST Special Publication 800-38D'.
 
+ at item  GCRY_CIPHER_MODE_POLY1305
+ at cindex Poly1305 based AEAD mode
+Poly1305 is an Authenticated Encryption with Associated Data (AEAD)
+mode, which can be used with ChaCha20 and Salsa20 stream ciphers.
+
 @end table
 
 @node Working with cipher handles
@@ -1655,12 +1660,13 @@ The cipher mode to use must be specified via @var{mode}.  See
 @xref{Available cipher modes}, for a list of supported cipher modes
 and the according constants.  Note that some modes are incompatible
 with some algorithms - in particular, stream mode
-(@code{GCRY_CIPHER_MODE_STREAM}) only works with stream ciphers. The
-block cipher modes (@code{GCRY_CIPHER_MODE_ECB},
+(@code{GCRY_CIPHER_MODE_STREAM}) only works with stream ciphers.
+Poly1305 AEAD mode (@code{GCRY_CIPHER_MODE_POLY1305}) only works with
+ChaCha and Salsa stream ciphers. The block cipher modes (@code{GCRY_CIPHER_MODE_ECB},
 @code{GCRY_CIPHER_MODE_CBC}, @code{GCRY_CIPHER_MODE_CFB},
 @code{GCRY_CIPHER_MODE_OFB} and @code{GCRY_CIPHER_MODE_CTR}) will work
-with any block cipher algorithm. @code{GCRY_CIPHER_MODE_CCM} and
- at code{GCRY_CIPHER_MODE_GCM} modes will only work with block cipher algorithms
+with any block cipher algorithm. GCM mode (@code{GCRY_CIPHER_MODE_CCM}) and
+CCM mode (@code{GCRY_CIPHER_MODE_GCM}) will only work with block cipher algorithms
 which have the block size of 16 bytes.
 
 The third argument @var{flags} can either be passed as @code{0} or as
@@ -3548,6 +3554,30 @@ block cipher algorithm.
 This is GMAC message authentication algorithm based on the SEED
 block cipher algorithm.
 
+ at item GCRY_MAC_POLY1305
+This is plain Poly1305 message authentication algorithm, used with
+one-time key.
+
+ at item GCRY_MAC_POLY1305_AES
+This is Poly1305-AES message authentication algorithm, used with
+key and one-time nonce.
+
+ at item GCRY_MAC_POLY1305_CAMELLIA
+This is Poly1305-Camellia message authentication algorithm, used with
+key and one-time nonce.
+
+ at item GCRY_MAC_POLY1305_TWOFISH
+This is Poly1305-Twofish message authentication algorithm, used with
+key and one-time nonce.
+
+ at item GCRY_MAC_POLY1305_SERPENT
+This is Poly1305-Serpent message authentication algorithm, used with
+key and one-time nonce.
+
+ at item GCRY_MAC_POLY1305_SEED
+This is Poly1305-SEED message authentication algorithm, used with
+key and one-time nonce.
+
 @end table
 @c end table of MAC algorithms
 
@@ -3593,8 +3623,8 @@ underlying block cipher.
 @end deftypefun
 
 
-GMAC algorithms need initialization vector to be set, which can be
-performed with function:
+GMAC algorithms and Poly1305-with-cipher algorithms need initialization vector to be set,
+which can be performed with function:
 
 @deftypefun gcry_error_t gcry_mac_setiv (gcry_mac_hd_t @var{h}, const void *@var{iv}, size_t @var{ivlen})
 

commit 323b1eb80ff3396d83fedbe5bba9a4e6c412d192
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date:   Fri May 16 21:28:26 2014 +0300

    chacha20: add SSE2/AMD64 optimized implementation
    
    * cipher/Makefile.am: Add 'chacha20-sse2-amd64.S'.
    * cipher/chacha20-sse2-amd64.S: New.
    * cipher/chacha20.c (USE_SSE2): New.
    [USE_SSE2] (_gcry_chacha20_amd64_sse2_blocks): New.
    (chacha20_do_setkey) [USE_SSE2]: Use SSE2 implementation for blocks
    function.
    * configure.ac [host=x86-64]: Add 'chacha20-sse2-amd64.lo'.
    --
    
    Add Andrew Moon's public domain SSE2 implementation of ChaCha20. Original
    source is available at: https://github.com/floodyberry/chacha-opt
    
    Benchmark on Intel i5-4570 (haswell),
    with "--disable-hwf intel-avx2 --disable-hwf intel-ssse3":
    
    Old:
     CHACHA20       |  nanosecs/byte   mebibytes/sec   cycles/byte
         STREAM enc |      1.97 ns/B     483.8 MiB/s      6.31 c/B
         STREAM dec |      1.97 ns/B     483.6 MiB/s      6.31 c/B
    
    New:
     CHACHA20       |  nanosecs/byte   mebibytes/sec   cycles/byte
         STREAM enc |     0.931 ns/B    1024.7 MiB/s      2.98 c/B
         STREAM dec |     0.930 ns/B    1025.0 MiB/s      2.98 c/B
    
    Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index 19b0097..8a3bd19 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -60,7 +60,7 @@ EXTRA_libcipher_la_SOURCES = \
 arcfour.c arcfour-amd64.S \
 blowfish.c blowfish-amd64.S blowfish-arm.S \
 cast5.c cast5-amd64.S cast5-arm.S \
-chacha20.c chacha20-ssse3-amd64.S chacha20-avx2-amd64.S \
+chacha20.c chacha20-sse2-amd64.S chacha20-ssse3-amd64.S chacha20-avx2-amd64.S \
 crc.c \
 des.c des-amd64.S \
 dsa.c \
diff --git a/cipher/chacha20-sse2-amd64.S b/cipher/chacha20-sse2-amd64.S
new file mode 100644
index 0000000..4811f40
--- /dev/null
+++ b/cipher/chacha20-sse2-amd64.S
@@ -0,0 +1,652 @@
+/* chacha20-sse2-amd64.S  -  AMD64/SSE2 implementation of ChaCha20
+ *
+ * Copyright (C) 2014 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * Based on public domain implementation by Andrew Moon at
+ *  https://github.com/floodyberry/chacha-opt
+ */
+
+#ifdef __x86_64__
+#include <config.h>
+
+#if defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) && USE_CHACHA20
+
+.text
+
+.align 8
+.globl _gcry_chacha20_amd64_sse2_blocks
+.type  _gcry_chacha20_amd64_sse2_blocks, at function;
+_gcry_chacha20_amd64_sse2_blocks:
+.Lchacha_blocks_sse2_local:
+	pushq %rbx
+	pushq %rbp
+	movq %rsp, %rbp
+	andq $~63, %rsp
+	subq $512, %rsp
+	movdqu (%rdi), %xmm8
+	movdqu 16(%rdi), %xmm9
+	movdqu 32(%rdi), %xmm10
+	movdqu 48(%rdi), %xmm11
+	movq $20, %rax
+	movq $1, %r9
+	movdqa %xmm8, 0(%rsp)
+	movdqa %xmm9, 16(%rsp)
+	movdqa %xmm10, 32(%rsp)
+	movdqa %xmm11, 48(%rsp)
+	movq %rax, 64(%rsp)
+	cmpq $256, %rcx
+	jb .Lchacha_blocks_sse2_below256
+	pshufd $0x00, %xmm8, %xmm0
+	pshufd $0x55, %xmm8, %xmm1
+	pshufd $0xaa, %xmm8, %xmm2
+	pshufd $0xff, %xmm8, %xmm3
+	movdqa %xmm0, 128(%rsp)
+	movdqa %xmm1, 144(%rsp)
+	movdqa %xmm2, 160(%rsp)
+	movdqa %xmm3, 176(%rsp)
+	pshufd $0x00, %xmm9, %xmm0
+	pshufd $0x55, %xmm9, %xmm1
+	pshufd $0xaa, %xmm9, %xmm2
+	pshufd $0xff, %xmm9, %xmm3
+	movdqa %xmm0, 192(%rsp)
+	movdqa %xmm1, 208(%rsp)
+	movdqa %xmm2, 224(%rsp)
+	movdqa %xmm3, 240(%rsp)
+	pshufd $0x00, %xmm10, %xmm0
+	pshufd $0x55, %xmm10, %xmm1
+	pshufd $0xaa, %xmm10, %xmm2
+	pshufd $0xff, %xmm10, %xmm3
+	movdqa %xmm0, 256(%rsp)
+	movdqa %xmm1, 272(%rsp)
+	movdqa %xmm2, 288(%rsp)
+	movdqa %xmm3, 304(%rsp)
+	pshufd $0xaa, %xmm11, %xmm0
+	pshufd $0xff, %xmm11, %xmm1
+	movdqa %xmm0, 352(%rsp)
+	movdqa %xmm1, 368(%rsp)
+	jmp .Lchacha_blocks_sse2_atleast256
+.p2align 6,,63
+.Lchacha_blocks_sse2_atleast256:
+	movq 48(%rsp), %rax
+	leaq 1(%rax), %r8
+	leaq 2(%rax), %r9
+	leaq 3(%rax), %r10
+	leaq 4(%rax), %rbx
+	movl %eax, 320(%rsp)
+	movl %r8d, 4+320(%rsp)
+	movl %r9d, 8+320(%rsp)
+	movl %r10d, 12+320(%rsp)
+	shrq $32, %rax
+	shrq $32, %r8
+	shrq $32, %r9
+	shrq $32, %r10
+	movl %eax, 336(%rsp)
+	movl %r8d, 4+336(%rsp)
+	movl %r9d, 8+336(%rsp)
+	movl %r10d, 12+336(%rsp)
+	movq %rbx, 48(%rsp)
+	movq 64(%rsp), %rax
+	movdqa 128(%rsp), %xmm0
+	movdqa 144(%rsp), %xmm1
+	movdqa 160(%rsp), %xmm2
+	movdqa 176(%rsp), %xmm3
+	movdqa 192(%rsp), %xmm4
+	movdqa 208(%rsp), %xmm5
+	movdqa 224(%rsp), %xmm6
+	movdqa 240(%rsp), %xmm7
+	movdqa 256(%rsp), %xmm8
+	movdqa 272(%rsp), %xmm9
+	movdqa 288(%rsp), %xmm10
+	movdqa 304(%rsp), %xmm11
+	movdqa 320(%rsp), %xmm12
+	movdqa 336(%rsp), %xmm13
+	movdqa 352(%rsp), %xmm14
+	movdqa 368(%rsp), %xmm15
+.Lchacha_blocks_sse2_mainloop1:
+	paddd %xmm4, %xmm0
+	paddd %xmm5, %xmm1
+	pxor %xmm0, %xmm12
+	pxor %xmm1, %xmm13
+	paddd %xmm6, %xmm2
+	paddd %xmm7, %xmm3
+	movdqa %xmm6, 96(%rsp)
+	pxor %xmm2, %xmm14
+	pxor %xmm3, %xmm15
+	pshuflw $0xb1,%xmm12,%xmm12
+	pshufhw $0xb1,%xmm12,%xmm12
+	pshuflw $0xb1,%xmm13,%xmm13
+	pshufhw $0xb1,%xmm13,%xmm13
+	pshuflw $0xb1,%xmm14,%xmm14
+	pshufhw $0xb1,%xmm14,%xmm14
+	pshuflw $0xb1,%xmm15,%xmm15
+	pshufhw $0xb1,%xmm15,%xmm15
+	paddd %xmm12, %xmm8
+	paddd %xmm13, %xmm9
+	paddd %xmm14, %xmm10
+	paddd %xmm15, %xmm11
+	movdqa %xmm12, 112(%rsp)
+	pxor %xmm8, %xmm4
+	pxor %xmm9, %xmm5
+	movdqa 96(%rsp), %xmm6
+	movdqa %xmm4, %xmm12
+	pslld $ 12, %xmm4
+	psrld $20, %xmm12
+	pxor %xmm12, %xmm4
+	movdqa %xmm5, %xmm12
+	pslld $ 12, %xmm5
+	psrld $20, %xmm12
+	pxor %xmm12, %xmm5
+	pxor %xmm10, %xmm6
+	pxor %xmm11, %xmm7
+	movdqa %xmm6, %xmm12
+	pslld $ 12, %xmm6
+	psrld $20, %xmm12
+	pxor %xmm12, %xmm6
+	movdqa %xmm7, %xmm12
+	pslld $ 12, %xmm7
+	psrld $20, %xmm12
+	pxor %xmm12, %xmm7
+	movdqa 112(%rsp), %xmm12
+	paddd %xmm4, %xmm0
+	paddd %xmm5, %xmm1
+	pxor %xmm0, %xmm12
+	pxor %xmm1, %xmm13
+	paddd %xmm6, %xmm2
+	paddd %xmm7, %xmm3
+	movdqa %xmm6, 96(%rsp)
+	pxor %xmm2, %xmm14
+	pxor %xmm3, %xmm15
+	movdqa %xmm12, %xmm6
+	pslld $ 8, %xmm12
+	psrld $24, %xmm6
+	pxor %xmm6, %xmm12
+	movdqa %xmm13, %xmm6
+	pslld $ 8, %xmm13
+	psrld $24, %xmm6
+	pxor %xmm6, %xmm13
+	paddd %xmm12, %xmm8
+	paddd %xmm13, %xmm9
+	movdqa %xmm14, %xmm6
+	pslld $ 8, %xmm14
+	psrld $24, %xmm6
+	pxor %xmm6, %xmm14
+	movdqa %xmm15, %xmm6
+	pslld $ 8, %xmm15
+	psrld $24, %xmm6
+	pxor %xmm6, %xmm15
+	paddd %xmm14, %xmm10
+	paddd %xmm15, %xmm11
+	movdqa %xmm12, 112(%rsp)
+	pxor %xmm8, %xmm4
+	pxor %xmm9, %xmm5
+	movdqa 96(%rsp), %xmm6
+	movdqa %xmm4, %xmm12
+	pslld $ 7, %xmm4
+	psrld $25, %xmm12
+	pxor %xmm12, %xmm4
+	movdqa %xmm5, %xmm12
+	pslld $ 7, %xmm5
+	psrld $25, %xmm12
+	pxor %xmm12, %xmm5
+	pxor %xmm10, %xmm6
+	pxor %xmm11, %xmm7
+	movdqa %xmm6, %xmm12
+	pslld $ 7, %xmm6
+	psrld $25, %xmm12
+	pxor %xmm12, %xmm6
+	movdqa %xmm7, %xmm12
+	pslld $ 7, %xmm7
+	psrld $25, %xmm12
+	pxor %xmm12, %xmm7
+	movdqa 112(%rsp), %xmm12
+	paddd %xmm5, %xmm0
+	paddd %xmm6, %xmm1
+	pxor %xmm0, %xmm15
+	pxor %xmm1, %xmm12
+	paddd %xmm7, %xmm2
+	paddd %xmm4, %xmm3
+	movdqa %xmm7, 96(%rsp)
+	pxor %xmm2, %xmm13
+	pxor %xmm3, %xmm14
+	pshuflw $0xb1,%xmm15,%xmm15
+	pshufhw $0xb1,%xmm15,%xmm15
+	pshuflw $0xb1,%xmm12,%xmm12
+	pshufhw $0xb1,%xmm12,%xmm12
+	pshuflw $0xb1,%xmm13,%xmm13
+	pshufhw $0xb1,%xmm13,%xmm13
+	pshuflw $0xb1,%xmm14,%xmm14
+	pshufhw $0xb1,%xmm14,%xmm14
+	paddd %xmm15, %xmm10
+	paddd %xmm12, %xmm11
+	paddd %xmm13, %xmm8
+	paddd %xmm14, %xmm9
+	movdqa %xmm15, 112(%rsp)
+	pxor %xmm10, %xmm5
+	pxor %xmm11, %xmm6
+	movdqa 96(%rsp), %xmm7
+	movdqa %xmm5, %xmm15
+	pslld $ 12, %xmm5
+	psrld $20, %xmm15
+	pxor %xmm15, %xmm5
+	movdqa %xmm6, %xmm15
+	pslld $ 12, %xmm6
+	psrld $20, %xmm15
+	pxor %xmm15, %xmm6
+	pxor %xmm8, %xmm7
+	pxor %xmm9, %xmm4
+	movdqa %xmm7, %xmm15
+	pslld $ 12, %xmm7
+	psrld $20, %xmm15
+	pxor %xmm15, %xmm7
+	movdqa %xmm4, %xmm15
+	pslld $ 12, %xmm4
+	psrld $20, %xmm15
+	pxor %xmm15, %xmm4
+	movdqa 112(%rsp), %xmm15
+	paddd %xmm5, %xmm0
+	paddd %xmm6, %xmm1
+	pxor %xmm0, %xmm15
+	pxor %xmm1, %xmm12
+	paddd %xmm7, %xmm2
+	paddd %xmm4, %xmm3
+	movdqa %xmm7, 96(%rsp)
+	pxor %xmm2, %xmm13
+	pxor %xmm3, %xmm14
+	movdqa %xmm15, %xmm7
+	pslld $ 8, %xmm15
+	psrld $24, %xmm7
+	pxor %xmm7, %xmm15
+	movdqa %xmm12, %xmm7
+	pslld $ 8, %xmm12
+	psrld $24, %xmm7
+	pxor %xmm7, %xmm12
+	paddd %xmm15, %xmm10
+	paddd %xmm12, %xmm11
+	movdqa %xmm13, %xmm7
+	pslld $ 8, %xmm13
+	psrld $24, %xmm7
+	pxor %xmm7, %xmm13
+	movdqa %xmm14, %xmm7
+	pslld $ 8, %xmm14
+	psrld $24, %xmm7
+	pxor %xmm7, %xmm14
+	paddd %xmm13, %xmm8
+	paddd %xmm14, %xmm9
+	movdqa %xmm15, 112(%rsp)
+	pxor %xmm10, %xmm5
+	pxor %xmm11, %xmm6
+	movdqa 96(%rsp), %xmm7
+	movdqa %xmm5, %xmm15
+	pslld $ 7, %xmm5
+	psrld $25, %xmm15
+	pxor %xmm15, %xmm5
+	movdqa %xmm6, %xmm15
+	pslld $ 7, %xmm6
+	psrld $25, %xmm15
+	pxor %xmm15, %xmm6
+	pxor %xmm8, %xmm7
+	pxor %xmm9, %xmm4
+	movdqa %xmm7, %xmm15
+	pslld $ 7, %xmm7
+	psrld $25, %xmm15
+	pxor %xmm15, %xmm7
+	movdqa %xmm4, %xmm15
+	pslld $ 7, %xmm4
+	psrld $25, %xmm15
+	pxor %xmm15, %xmm4
+	movdqa 112(%rsp), %xmm15
+	subq $2, %rax
+	jnz .Lchacha_blocks_sse2_mainloop1
+	paddd 128(%rsp), %xmm0
+	paddd 144(%rsp), %xmm1
+	paddd 160(%rsp), %xmm2
+	paddd 176(%rsp), %xmm3
+	paddd 192(%rsp), %xmm4
+	paddd 208(%rsp), %xmm5
+	paddd 224(%rsp), %xmm6
+	paddd 240(%rsp), %xmm7
+	paddd 256(%rsp), %xmm8
+	paddd 272(%rsp), %xmm9
+	paddd 288(%rsp), %xmm10
+	paddd 304(%rsp), %xmm11
+	paddd 320(%rsp), %xmm12
+	paddd 336(%rsp), %xmm13
+	paddd 352(%rsp), %xmm14
+	paddd 368(%rsp), %xmm15
+	movdqa %xmm8, 384(%rsp)
+	movdqa %xmm9, 400(%rsp)
+	movdqa %xmm10, 416(%rsp)
+	movdqa %xmm11, 432(%rsp)
+	movdqa %xmm12, 448(%rsp)
+	movdqa %xmm13, 464(%rsp)
+	movdqa %xmm14, 480(%rsp)
+	movdqa %xmm15, 496(%rsp)
+	movdqa %xmm0, %xmm8
+	movdqa %xmm2, %xmm9
+	movdqa %xmm4, %xmm10
+	movdqa %xmm6, %xmm11
+	punpckhdq %xmm1, %xmm0
+	punpckhdq %xmm3, %xmm2
+	punpckhdq %xmm5, %xmm4
+	punpckhdq %xmm7, %xmm6
+	punpckldq %xmm1, %xmm8
+	punpckldq %xmm3, %xmm9
+	punpckldq %xmm5, %xmm10
+	punpckldq %xmm7, %xmm11
+	movdqa %xmm0, %xmm1
+	movdqa %xmm4, %xmm3
+	movdqa %xmm8, %xmm5
+	movdqa %xmm10, %xmm7
+	punpckhqdq %xmm2, %xmm0
+	punpckhqdq %xmm6, %xmm4
+	punpckhqdq %xmm9, %xmm8
+	punpckhqdq %xmm11, %xmm10
+	punpcklqdq %xmm2, %xmm1
+	punpcklqdq %xmm6, %xmm3
+	punpcklqdq %xmm9, %xmm5
+	punpcklqdq %xmm11, %xmm7
+	andq %rsi, %rsi
+	jz .Lchacha_blocks_sse2_noinput1
+	movdqu 0(%rsi), %xmm2
+	movdqu 16(%rsi), %xmm6
+	movdqu 64(%rsi), %xmm9
+	movdqu 80(%rsi), %xmm11
+	movdqu 128(%rsi), %xmm12
+	movdqu 144(%rsi), %xmm13
+	movdqu 192(%rsi), %xmm14
+	movdqu 208(%rsi), %xmm15
+	pxor %xmm2, %xmm5
+	pxor %xmm6, %xmm7
+	pxor %xmm9, %xmm8
+	pxor %xmm11, %xmm10
+	pxor %xmm12, %xmm1
+	pxor %xmm13, %xmm3
+	pxor %xmm14, %xmm0
+	pxor %xmm15, %xmm4
+	movdqu %xmm5, 0(%rdx)
+	movdqu %xmm7, 16(%rdx)
+	movdqu %xmm8, 64(%rdx)
+	movdqu %xmm10, 80(%rdx)
+	movdqu %xmm1, 128(%rdx)
+	movdqu %xmm3, 144(%rdx)
+	movdqu %xmm0, 192(%rdx)
+	movdqu %xmm4, 208(%rdx)
+	movdqa 384(%rsp), %xmm0
+	movdqa 400(%rsp), %xmm1
+	movdqa 416(%rsp), %xmm2
+	movdqa 432(%rsp), %xmm3
+	movdqa 448(%rsp), %xmm4
+	movdqa 464(%rsp), %xmm5
+	movdqa 480(%rsp), %xmm6
+	movdqa 496(%rsp), %xmm7
+	movdqa %xmm0, %xmm8
+	movdqa %xmm2, %xmm9
+	movdqa %xmm4, %xmm10
+	movdqa %xmm6, %xmm11
+	punpckldq %xmm1, %xmm8
+	punpckldq %xmm3, %xmm9
+	punpckhdq %xmm1, %xmm0
+	punpckhdq %xmm3, %xmm2
+	punpckldq %xmm5, %xmm10
+	punpckldq %xmm7, %xmm11
+	punpckhdq %xmm5, %xmm4
+	punpckhdq %xmm7, %xmm6
+	movdqa %xmm8, %xmm1
+	movdqa %xmm0, %xmm3
+	movdqa %xmm10, %xmm5
+	movdqa %xmm4, %xmm7
+	punpcklqdq %xmm9, %xmm1
+	punpcklqdq %xmm11, %xmm5
+	punpckhqdq %xmm9, %xmm8
+	punpckhqdq %xmm11, %xmm10
+	punpcklqdq %xmm2, %xmm3
+	punpcklqdq %xmm6, %xmm7
+	punpckhqdq %xmm2, %xmm0
+	punpckhqdq %xmm6, %xmm4
+	movdqu 32(%rsi), %xmm2
+	movdqu 48(%rsi), %xmm6
+	movdqu 96(%rsi), %xmm9
+	movdqu 112(%rsi), %xmm11
+	movdqu 160(%rsi), %xmm12
+	movdqu 176(%rsi), %xmm13
+	movdqu 224(%rsi), %xmm14
+	movdqu 240(%rsi), %xmm15
+	pxor %xmm2, %xmm1
+	pxor %xmm6, %xmm5
+	pxor %xmm9, %xmm8
+	pxor %xmm11, %xmm10
+	pxor %xmm12, %xmm3
+	pxor %xmm13, %xmm7
+	pxor %xmm14, %xmm0
+	pxor %xmm15, %xmm4
+	movdqu %xmm1, 32(%rdx)
+	movdqu %xmm5, 48(%rdx)
+	movdqu %xmm8, 96(%rdx)
+	movdqu %xmm10, 112(%rdx)
+	movdqu %xmm3, 160(%rdx)
+	movdqu %xmm7, 176(%rdx)
+	movdqu %xmm0, 224(%rdx)
+	movdqu %xmm4, 240(%rdx)
+	addq $256, %rsi
+	jmp .Lchacha_blocks_sse2_mainloop_cont
+.Lchacha_blocks_sse2_noinput1:
+	movdqu %xmm5, 0(%rdx)
+	movdqu %xmm7, 16(%rdx)
+	movdqu %xmm8, 64(%rdx)
+	movdqu %xmm10, 80(%rdx)
+	movdqu %xmm1, 128(%rdx)
+	movdqu %xmm3, 144(%rdx)
+	movdqu %xmm0, 192(%rdx)
+	movdqu %xmm4, 208(%rdx)
+	movdqa 384(%rsp), %xmm0
+	movdqa 400(%rsp), %xmm1
+	movdqa 416(%rsp), %xmm2
+	movdqa 432(%rsp), %xmm3
+	movdqa 448(%rsp), %xmm4
+	movdqa 464(%rsp), %xmm5
+	movdqa 480(%rsp), %xmm6
+	movdqa 496(%rsp), %xmm7
+	movdqa %xmm0, %xmm8
+	movdqa %xmm2, %xmm9
+	movdqa %xmm4, %xmm10
+	movdqa %xmm6, %xmm11
+	punpckldq %xmm1, %xmm8
+	punpckldq %xmm3, %xmm9
+	punpckhdq %xmm1, %xmm0
+	punpckhdq %xmm3, %xmm2
+	punpckldq %xmm5, %xmm10
+	punpckldq %xmm7, %xmm11
+	punpckhdq %xmm5, %xmm4
+	punpckhdq %xmm7, %xmm6
+	movdqa %xmm8, %xmm1
+	movdqa %xmm0, %xmm3
+	movdqa %xmm10, %xmm5
+	movdqa %xmm4, %xmm7
+	punpcklqdq %xmm9, %xmm1
+	punpcklqdq %xmm11, %xmm5
+	punpckhqdq %xmm9, %xmm8
+	punpckhqdq %xmm11, %xmm10
+	punpcklqdq %xmm2, %xmm3
+	punpcklqdq %xmm6, %xmm7
+	punpckhqdq %xmm2, %xmm0
+	punpckhqdq %xmm6, %xmm4
+	movdqu %xmm1, 32(%rdx)
+	movdqu %xmm5, 48(%rdx)
+	movdqu %xmm8, 96(%rdx)
+	movdqu %xmm10, 112(%rdx)
+	movdqu %xmm3, 160(%rdx)
+	movdqu %xmm7, 176(%rdx)
+	movdqu %xmm0, 224(%rdx)
+	movdqu %xmm4, 240(%rdx)
+.Lchacha_blocks_sse2_mainloop_cont:
+	addq $256, %rdx
+	subq $256, %rcx
+	cmp $256, %rcx
+	jae .Lchacha_blocks_sse2_atleast256
+	movdqa 0(%rsp), %xmm8
+	movdqa 16(%rsp), %xmm9
+	movdqa 32(%rsp), %xmm10
+	movdqa 48(%rsp), %xmm11
+	movq $1, %r9
+.Lchacha_blocks_sse2_below256:
+	movq %r9, %xmm5
+	andq %rcx, %rcx
+	jz .Lchacha_blocks_sse2_done
+	cmpq $64, %rcx
+	jae .Lchacha_blocks_sse2_above63
+	movq %rdx, %r9
+	andq %rsi, %rsi
+	jz .Lchacha_blocks_sse2_noinput2
+	movq %rcx, %r10
+	movq %rsp, %rdx
+	addq %r10, %rsi
+	addq %r10, %rdx
+	negq %r10
+.Lchacha_blocks_sse2_copyinput:
+	movb (%rsi, %r10), %al
+	movb %al, (%rdx, %r10)
+	incq %r10
+	jnz .Lchacha_blocks_sse2_copyinput
+	movq %rsp, %rsi
+.Lchacha_blocks_sse2_noinput2:
+	movq %rsp, %rdx
+.Lchacha_blocks_sse2_above63:
+	movdqa %xmm8, %xmm0
+	movdqa %xmm9, %xmm1
+	movdqa %xmm10, %xmm2
+	movdqa %xmm11, %xmm3
+	movq 64(%rsp), %rax
+.Lchacha_blocks_sse2_mainloop2:
+	paddd %xmm1, %xmm0
+	pxor %xmm0, %xmm3
+	pshuflw $0xb1,%xmm3,%xmm3
+	pshufhw $0xb1,%xmm3,%xmm3
+	paddd %xmm3, %xmm2
+	pxor %xmm2, %xmm1
+	movdqa %xmm1,%xmm4
+	pslld $12, %xmm1
+	psrld $20, %xmm4
+	pxor %xmm4, %xmm1
+	paddd %xmm1, %xmm0
+	pxor %xmm0, %xmm3
+	movdqa %xmm3,%xmm4
+	pslld $8, %xmm3
+	psrld $24, %xmm4
+	pshufd $0x93,%xmm0,%xmm0
+	pxor %xmm4, %xmm3
+	paddd %xmm3, %xmm2
+	pshufd $0x4e,%xmm3,%xmm3
+	pxor %xmm2, %xmm1
+	pshufd $0x39,%xmm2,%xmm2
+	movdqa %xmm1,%xmm4
+	pslld $7, %xmm1
+	psrld $25, %xmm4
+	pxor %xmm4, %xmm1
+	subq $2, %rax
+	paddd %xmm1, %xmm0
+	pxor %xmm0, %xmm3
+	pshuflw $0xb1,%xmm3,%xmm3
+	pshufhw $0xb1,%xmm3,%xmm3
+	paddd %xmm3, %xmm2
+	pxor %xmm2, %xmm1
+	movdqa %xmm1,%xmm4
+	pslld $12, %xmm1
+	psrld $20, %xmm4
+	pxor %xmm4, %xmm1
+	paddd %xmm1, %xmm0
+	pxor %xmm0, %xmm3
+	movdqa %xmm3,%xmm4
+	pslld $8, %xmm3
+	psrld $24, %xmm4
+	pshufd $0x39,%xmm0,%xmm0
+	pxor %xmm4, %xmm3
+	paddd %xmm3, %xmm2
+	pshufd $0x4e,%xmm3,%xmm3
+	pxor %xmm2, %xmm1
+	pshufd $0x93,%xmm2,%xmm2
+	movdqa %xmm1,%xmm4
+	pslld $7, %xmm1
+	psrld $25, %xmm4
+	pxor %xmm4, %xmm1
+	jnz .Lchacha_blocks_sse2_mainloop2
+	paddd %xmm8, %xmm0
+	paddd %xmm9, %xmm1
+	paddd %xmm10, %xmm2
+	paddd %xmm11, %xmm3
+	andq %rsi, %rsi
+	jz .Lchacha_blocks_sse2_noinput3
+	movdqu 0(%rsi), %xmm12
+	movdqu 16(%rsi), %xmm13
+	movdqu 32(%rsi), %xmm14
+	movdqu 48(%rsi), %xmm15
+	pxor %xmm12, %xmm0
+	pxor %xmm13, %xmm1
+	pxor %xmm14, %xmm2
+	pxor %xmm15, %xmm3
+	addq $64, %rsi
+.Lchacha_blocks_sse2_noinput3:
+	movdqu %xmm0, 0(%rdx)
+	movdqu %xmm1, 16(%rdx)
+	movdqu %xmm2, 32(%rdx)
+	movdqu %xmm3, 48(%rdx)
+	paddq %xmm5, %xmm11
+	cmpq $64, %rcx
+	jbe .Lchacha_blocks_sse2_mainloop2_finishup
+	addq $64, %rdx
+	subq $64, %rcx
+	jmp .Lchacha_blocks_sse2_below256
+.Lchacha_blocks_sse2_mainloop2_finishup:
+	cmpq $64, %rcx
+	je .Lchacha_blocks_sse2_done
+	addq %rcx, %r9
+	addq %rcx, %rdx
+	negq %rcx
+.Lchacha_blocks_sse2_copyoutput:
+	movb (%rdx, %rcx), %al
+	movb %al, (%r9, %rcx)
+	incq %rcx
+	jnz .Lchacha_blocks_sse2_copyoutput
+.Lchacha_blocks_sse2_done:
+	movdqu %xmm11, 48(%rdi)
+	movq %rbp, %rsp
+	pxor %xmm15, %xmm15
+	pxor %xmm7, %xmm7
+	pxor %xmm14, %xmm14
+	pxor %xmm6, %xmm6
+	pxor %xmm13, %xmm13
+	pxor %xmm5, %xmm5
+	pxor %xmm12, %xmm12
+	pxor %xmm4, %xmm4
+	popq %rbp
+	popq %rbx
+	movl $(63 + 512 + 16), %eax
+	pxor %xmm11, %xmm11
+	pxor %xmm3, %xmm3
+	pxor %xmm10, %xmm10
+	pxor %xmm2, %xmm2
+	pxor %xmm9, %xmm9
+	pxor %xmm1, %xmm1
+	pxor %xmm8, %xmm8
+	pxor %xmm0, %xmm0
+	ret
+.size _gcry_chacha20_amd64_sse2_blocks,.-_gcry_chacha20_amd64_sse2_blocks;
+
+#endif /*defined(USE_CHACHA20)*/
+#endif /*__x86_64*/
diff --git a/cipher/chacha20.c b/cipher/chacha20.c
index e2cf442..03416d4 100644
--- a/cipher/chacha20.c
+++ b/cipher/chacha20.c
@@ -47,6 +47,12 @@
 #define CHACHA20_MAX_IV_SIZE  12        /* Bytes.  */
 #define CHACHA20_INPUT_LENGTH (CHACHA20_BLOCK_SIZE / 4)
 
+/* USE_SSE2 indicates whether to compile with Intel SSE2 code. */
+#undef USE_SSE2
+#if defined(__x86_64__) && defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS)
+# define USE_SSE2 1
+#endif
+
 /* USE_SSSE3 indicates whether to compile with Intel SSSE3 code. */
 #undef USE_SSSE3
 #if defined(__x86_64__) && defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) && \
@@ -77,6 +83,13 @@ typedef struct CHACHA20_context_s
 } CHACHA20_context_t;
 
 
+#ifdef USE_SSE2
+
+unsigned int _gcry_chacha20_amd64_sse2_blocks(u32 *state, const byte *in,
+                                              byte *out, size_t bytes);
+
+#endif /* USE_SSE2 */
+
 #ifdef USE_SSSE3
 
 unsigned int _gcry_chacha20_amd64_ssse3_blocks(u32 *state, const byte *in,
@@ -323,7 +336,12 @@ chacha20_do_setkey (CHACHA20_context_t * ctx,
   if (keylen != CHACHA20_MAX_KEY_SIZE && keylen != CHACHA20_MIN_KEY_SIZE)
     return GPG_ERR_INV_KEYLEN;
 
+#ifdef USE_SSE2
+  ctx->blocks = _gcry_chacha20_amd64_sse2_blocks;
+#else
   ctx->blocks = chacha20_blocks;
+#endif
+
 #ifdef USE_SSSE3
   if (features & HWF_INTEL_SSSE3)
     ctx->blocks = _gcry_chacha20_amd64_ssse3_blocks;
diff --git a/configure.ac b/configure.ac
index 47a322b..c5952c7 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1815,6 +1815,7 @@ if test "$found" = "1" ; then
    case "${host}" in
       x86_64-*-*)
          # Build with the assembly implementation
+         GCRYPT_CIPHERS="$GCRYPT_CIPHERS chacha20-sse2-amd64.lo"
          GCRYPT_CIPHERS="$GCRYPT_CIPHERS chacha20-ssse3-amd64.lo"
          GCRYPT_CIPHERS="$GCRYPT_CIPHERS chacha20-avx2-amd64.lo"
       ;;

commit 98f021961ee65669037bc8bb552a69fd78f610fc
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date:   Sun May 11 20:52:27 2014 +0300

    poly1305: add AMD64/AVX2 optimized implementation
    
    * cipher/Makefile.am: Add 'poly1305-avx2-amd64.S'.
    * cipher/poly1305-avx2-amd64.S: New.
    * cipher/poly1305-internal.h (POLY1305_USE_AVX2)
    (POLY1305_AVX2_BLOCKSIZE, POLY1305_AVX2_STATESIZE)
    (POLY1305_AVX2_ALIGNMENT): New.
    (POLY1305_LARGEST_BLOCKSIZE, POLY1305_LARGEST_STATESIZE)
    (POLY1305_STATE_ALIGNMENT): Use AVX2 versions when needed.
    * cipher/poly1305.c [POLY1305_USE_AVX2]
    (_gcry_poly1305_amd64_avx2_init_ext)
    (_gcry_poly1305_amd64_avx2_finish_ext)
    (_gcry_poly1305_amd64_avx2_blocks, poly1305_amd64_avx2_ops): New.
    (_gcry_poly1305_init) [POLY1305_USE_AVX2]: Use AVX2 implementation if
    AVX2 supported by CPU.
    * configure.ac [host=x86_64]: Add 'poly1305-avx2-amd64.lo'.
    --
    
    Add Andrew Moon's public domain AVX2 implementation of Poly1305. Original
    source is available at: https://github.com/floodyberry/poly1305-opt
    
    Benchmarks on Intel i5-4570 (haswell):
    
    Old:
                        |  nanosecs/byte   mebibytes/sec   cycles/byte
     POLY1305           |     0.448 ns/B    2129.5 MiB/s      1.43 c/B
    
    New:
                        |  nanosecs/byte   mebibytes/sec   cycles/byte
     POLY1305           |     0.205 ns/B    4643.5 MiB/s     0.657 c/B
    
    Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index a32ae89..19b0097 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -72,7 +72,7 @@ gost28147.c gost.h \
 gostr3411-94.c \
 md4.c \
 md5.c \
-poly1305-sse2-amd64.S \
+poly1305-sse2-amd64.S poly1305-avx2-amd64.S \
 rijndael.c rijndael-tables.h rijndael-amd64.S rijndael-arm.S \
 rmd160.c \
 rsa.c \
diff --git a/cipher/poly1305-avx2-amd64.S b/cipher/poly1305-avx2-amd64.S
new file mode 100644
index 0000000..0ba7e76
--- /dev/null
+++ b/cipher/poly1305-avx2-amd64.S
@@ -0,0 +1,954 @@
+/* poly1305-avx2-amd64.S  -  AMD64/AVX2 implementation of Poly1305
+ *
+ * Copyright (C) 2014 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * Based on public domain implementation by Andrew Moon at
+ *  https://github.com/floodyberry/poly1305-opt
+ */
+
+#include <config.h>
+
+#if defined(__x86_64__) && defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) && \
+    defined(ENABLE_AVX2_SUPPORT)
+
+.text
+
+
+.align 8
+.globl _gcry_poly1305_amd64_avx2_init_ext
+.type  _gcry_poly1305_amd64_avx2_init_ext, at function;
+_gcry_poly1305_amd64_avx2_init_ext:
+.Lpoly1305_init_ext_avx2_local:
+	xor %edx, %edx
+	vzeroupper
+	pushq %r12
+	pushq %r13
+	pushq %r14
+	pushq %r15
+	pushq %rbx
+	movq %rdx, %rcx
+	vpxor %ymm0, %ymm0, %ymm0
+	movq $-1, %r8
+	testq %rcx, %rcx
+	vmovdqu %ymm0, (%rdi)
+	vmovdqu %ymm0, 32(%rdi)
+	vmovdqu %ymm0, 64(%rdi)
+	vmovdqu %ymm0, 96(%rdi)
+	vmovdqu %ymm0, 128(%rdi)
+	movq 8(%rsi), %r9
+	cmove %r8, %rcx
+	movq $0xffc0fffffff, %r8
+	movq %r9, %r13
+	movq (%rsi), %r10
+	andq %r10, %r8
+	shrq $44, %r10
+	movq %r8, %r14
+	shlq $20, %r13
+	orq %r13, %r10
+	movq $0xfffffc0ffff, %r13
+	shrq $24, %r9
+	andq %r13, %r10
+	movq $0xffffffc0f, %r13
+	andq %r13, %r9
+	movl %r8d, %r13d
+	andl $67108863, %r13d
+	movl %r13d, 164(%rdi)
+	movq %r10, %r13
+	shrq $26, %r14
+	shlq $18, %r13
+	orq %r13, %r14
+	movq %r10, %r13
+	shrq $8, %r13
+	andl $67108863, %r14d
+	andl $67108863, %r13d
+	movl %r14d, 172(%rdi)
+	movq %r10, %r14
+	movl %r13d, 180(%rdi)
+	movq %r9, %r13
+	shrq $34, %r14
+	shlq $10, %r13
+	orq %r13, %r14
+	movq %r9, %r13
+	shrq $16, %r13
+	andl $67108863, %r14d
+	movl %r14d, 188(%rdi)
+	movl %r13d, 196(%rdi)
+	cmpq $16, %rcx
+	jbe .Lpoly1305_init_ext_avx2_continue
+	lea (%r9,%r9,4), %r11
+	shlq $2, %r11
+	lea (%r10,%r10), %rax
+	mulq %r11
+	movq %rax, %r13
+	movq %r8, %rax
+	movq %rdx, %r14
+	mulq %r8
+	addq %rax, %r13
+	lea (%r8,%r8), %rax
+	movq %r13, %r12
+	adcq %rdx, %r14
+	mulq %r10
+	shlq $20, %r14
+	movq %rax, %r15
+	shrq $44, %r12
+	movq %r11, %rax
+	orq %r12, %r14
+	movq %rdx, %r12
+	mulq %r9
+	addq %rax, %r15
+	movq %r8, %rax
+	adcq %rdx, %r12
+	addq %r15, %r14
+	lea (%r9,%r9), %r15
+	movq %r14, %rbx
+	adcq $0, %r12
+	mulq %r15
+	shlq $20, %r12
+	movq %rdx, %r11
+	shrq $44, %rbx
+	orq %rbx, %r12
+	movq %rax, %rbx
+	movq %r10, %rax
+	mulq %r10
+	addq %rax, %rbx
+	adcq %rdx, %r11
+	addq %rbx, %r12
+	movq $0xfffffffffff, %rbx
+	movq %r12, %r15
+	adcq $0, %r11
+	andq %rbx, %r13
+	shlq $22, %r11
+	andq %rbx, %r14
+	shrq $42, %r15
+	orq %r15, %r11
+	lea (%r11,%r11,4), %r11
+	addq %r11, %r13
+	movq %rbx, %r11
+	andq %r13, %r11
+	shrq $44, %r13
+	movq %r11, %r15
+	addq %r13, %r14
+	movq $0x3ffffffffff, %r13
+	andq %r14, %rbx
+	andq %r13, %r12
+	movq %rbx, %r13
+	shrq $26, %r15
+	shlq $18, %r13
+	orq %r13, %r15
+	movq %rbx, %r13
+	shrq $44, %r14
+	shrq $8, %r13
+	addq %r14, %r12
+	movl %r11d, %r14d
+	andl $67108863, %r15d
+	andl $67108863, %r14d
+	andl $67108863, %r13d
+	movl %r14d, 204(%rdi)
+	movq %rbx, %r14
+	movl %r13d, 220(%rdi)
+	movq %r12, %r13
+	shrq $34, %r14
+	shlq $10, %r13
+	orq %r13, %r14
+	movq %r12, %r13
+	shrq $16, %r13
+	andl $67108863, %r14d
+	movl %r15d, 212(%rdi)
+	movl %r14d, 228(%rdi)
+	movl %r13d, 236(%rdi)
+	cmpq $32, %rcx
+	jbe .Lpoly1305_init_ext_avx2_continue
+	movq %r9, %rax
+	lea (%rbx,%rbx,4), %r14
+	shlq $2, %r14
+	mulq %r14
+	movq %rdi, -32(%rsp)
+	lea (%r12,%r12,4), %rdi
+	shlq $2, %rdi
+	movq %rax, %r14
+	movq %r10, %rax
+	movq %rdx, %r15
+	mulq %rdi
+	movq %rax, %r13
+	movq %r11, %rax
+	movq %rcx, -16(%rsp)
+	movq %rdx, %rcx
+	mulq %r8
+	addq %rax, %r13
+	movq %rdi, %rax
+	movq %rsi, -24(%rsp)
+	adcq %rdx, %rcx
+	addq %r13, %r14
+	adcq %rcx, %r15
+	movq %r14, %rcx
+	mulq %r9
+	shlq $20, %r15
+	movq %rax, %r13
+	shrq $44, %rcx
+	movq %r11, %rax
+	orq %rcx, %r15
+	movq %rdx, %rcx
+	mulq %r10
+	movq %rax, %rsi
+	movq %rbx, %rax
+	movq %rdx, %rdi
+	mulq %r8
+	addq %rax, %rsi
+	movq %r11, %rax
+	adcq %rdx, %rdi
+	addq %rsi, %r13
+	adcq %rdi, %rcx
+	addq %r13, %r15
+	movq %r15, %rdi
+	adcq $0, %rcx
+	mulq %r9
+	shlq $20, %rcx
+	movq %rdx, %rsi
+	shrq $44, %rdi
+	orq %rdi, %rcx
+	movq %rax, %rdi
+	movq %rbx, %rax
+	mulq %r10
+	movq %rax, %r9
+	movq %r8, %rax
+	movq %rdx, %r10
+	movq $0xfffffffffff, %r8
+	mulq %r12
+	addq %rax, %r9
+	adcq %rdx, %r10
+	andq %r8, %r14
+	addq %r9, %rdi
+	adcq %r10, %rsi
+	andq %r8, %r15
+	addq %rdi, %rcx
+	movq $0x3ffffffffff, %rdi
+	movq %rcx, %r10
+	adcq $0, %rsi
+	andq %rdi, %rcx
+	shlq $22, %rsi
+	shrq $42, %r10
+	orq %r10, %rsi
+	movq -32(%rsp), %rdi
+	lea (%rsi,%rsi,4), %r9
+	movq %r8, %rsi
+	addq %r9, %r14
+	andq %r14, %rsi
+	shrq $44, %r14
+	addq %r14, %r15
+	andq %r15, %r8
+	shrq $44, %r15
+	movq %r8, %r14
+	addq %r15, %rcx
+	movl %esi, %r15d
+	movq %rcx, %r10
+	movq %r8, %r9
+	shrq $26, %rsi
+	andl $67108863, %r15d
+	shlq $18, %r14
+	shrq $34, %r8
+	orq %r14, %rsi
+	shlq $10, %r10
+	shrq $8, %r9
+	orq %r10, %r8
+	shrq $16, %rcx
+	andl $67108863, %esi
+	movl %esi, 252(%rdi)
+	andl $67108863, %r9d
+	movl %ecx, 276(%rdi)
+	andl $67108863, %r8d
+	movl %r15d, 244(%rdi)
+	movl %r9d, 260(%rdi)
+	movl %r8d, 268(%rdi)
+	movq -16(%rsp), %rcx
+	movq -24(%rsp), %rsi
+.Lpoly1305_init_ext_avx2_continue:
+	movl 16(%rsi), %r8d
+	movl %r8d, 284(%rdi)
+	movl 20(%rsi), %r9d
+	movl %r9d, 292(%rdi)
+	movl 24(%rsi), %r10d
+	movl %r10d, 300(%rdi)
+	movl 28(%rsi), %esi
+	movl %esi, 308(%rdi)
+	cmpq $48, %rcx
+	jbe .Lpoly1305_init_ext_avx2_done
+	lea (%r12,%r12,4), %r9
+	shlq $2, %r9
+	lea (%rbx,%rbx), %rax
+	mulq %r9
+	movq %rax, %rsi
+	movq %r11, %rax
+	movq %rdx, %r8
+	mulq %r11
+	addq %rax, %rsi
+	lea (%r11,%r11), %rax
+	movq %rsi, %r10
+	adcq %rdx, %r8
+	mulq %rbx
+	movq %rax, %r13
+	movq %r12, %rax
+	movq %rdx, %rcx
+	addq %r12, %r12
+	mulq %r9
+	addq %rax, %r13
+	movq %r11, %rax
+	movq $0xfffffffffff, %r9
+	adcq %rdx, %rcx
+	andq %r9, %rsi
+	mulq %r12
+	shlq $20, %r8
+	movq %rax, %r11
+	shrq $44, %r10
+	movq %rbx, %rax
+	orq %r10, %r8
+	movq %rdx, %r12
+	mulq %rbx
+	addq %r13, %r8
+	movq %r8, %r14
+	adcq $0, %rcx
+	andq %r9, %r8
+	addq %rax, %r11
+	adcq %rdx, %r12
+	shlq $20, %rcx
+	shrq $44, %r14
+	orq %r14, %rcx
+	addq %r11, %rcx
+	movq %rcx, %rbx
+	adcq $0, %r12
+	shlq $22, %r12
+	shrq $42, %rbx
+	orq %rbx, %r12
+	movq %r9, %rbx
+	lea (%r12,%r12,4), %r15
+	addq %r15, %rsi
+	andq %rsi, %rbx
+	shrq $44, %rsi
+	movl %ebx, %r11d
+	addq %rsi, %r8
+	movq $0x3ffffffffff, %rsi
+	andq %r8, %r9
+	andq %rsi, %rcx
+	shrq $44, %r8
+	movq %r9, %rax
+	addq %r8, %rcx
+	movq %r9, %r8
+	movq %rcx, %r10
+	andl $67108863, %r11d
+	shrq $26, %rbx
+	shlq $18, %r8
+	shrq $34, %r9
+	orq %r8, %rbx
+	shlq $10, %r10
+	shrq $8, %rax
+	orq %r10, %r9
+	shrq $16, %rcx
+	andl $67108863, %ebx
+	andl $67108863, %eax
+	andl $67108863, %r9d
+	movl %r11d, 184(%rdi)
+	movl %r11d, 176(%rdi)
+	movl %r11d, 168(%rdi)
+	movl %r11d, 160(%rdi)
+	movl %ebx, 216(%rdi)
+	movl %ebx, 208(%rdi)
+	movl %ebx, 200(%rdi)
+	movl %ebx, 192(%rdi)
+	movl %eax, 248(%rdi)
+	movl %eax, 240(%rdi)
+	movl %eax, 232(%rdi)
+	movl %eax, 224(%rdi)
+	movl %r9d, 280(%rdi)
+	movl %r9d, 272(%rdi)
+	movl %r9d, 264(%rdi)
+	movl %r9d, 256(%rdi)
+	movl %ecx, 312(%rdi)
+	movl %ecx, 304(%rdi)
+	movl %ecx, 296(%rdi)
+	movl %ecx, 288(%rdi)
+.Lpoly1305_init_ext_avx2_done:
+	movq $0, 320(%rdi)
+	vzeroall
+	popq %rbx
+	popq %r15
+	popq %r14
+	popq %r13
+	popq %r12
+	ret
+.size _gcry_poly1305_amd64_avx2_init_ext,.-_gcry_poly1305_amd64_avx2_init_ext;
+
+
+.align 8
+.globl _gcry_poly1305_amd64_avx2_blocks
+.type  _gcry_poly1305_amd64_avx2_blocks, at function;
+_gcry_poly1305_amd64_avx2_blocks:
+.Lpoly1305_blocks_avx2_local:
+	vzeroupper
+	pushq %rbp
+	movq %rsp, %rbp
+	pushq %rbx
+	andq $-64, %rsp
+	subq $200, %rsp
+	movl $((1<<26)-1), %r8d
+	movl $(5), %r9d
+	movl $((1<<24)), %r10d
+	vmovd %r8d, %xmm0
+	vmovd %r9d, %xmm8
+	vmovd %r10d, %xmm7
+	vpbroadcastq %xmm0, %ymm0
+	vpbroadcastq %xmm8, %ymm8
+	vpbroadcastq %xmm7, %ymm7
+	vmovdqa %ymm7, 168(%rsp)
+	movq 320(%rdi), %rax
+	testb $60, %al
+	je .Lpoly1305_blocks_avx2_9
+	vmovdqa 168(%rsp), %ymm7
+	vpsrldq $8, %ymm7, %ymm1
+	vmovdqa %ymm1, 168(%rsp)
+	testb $4, %al
+	je .Lpoly1305_blocks_avx2_10
+	vpermq $192, %ymm1, %ymm7
+	vmovdqa %ymm7, 168(%rsp)
+.Lpoly1305_blocks_avx2_10:
+	testb $8, %al
+	je .Lpoly1305_blocks_avx2_11
+	vpermq $240, 168(%rsp), %ymm7
+	vmovdqa %ymm7, 168(%rsp)
+.Lpoly1305_blocks_avx2_11:
+	testb $16, %al
+	je .Lpoly1305_blocks_avx2_12
+	vpermq $252, 168(%rsp), %ymm6
+	vmovdqa %ymm6, 168(%rsp)
+.Lpoly1305_blocks_avx2_12:
+	testb $32, %al
+	je .Lpoly1305_blocks_avx2_9
+	vpxor %xmm6, %xmm6, %xmm6
+	vmovdqa %ymm6, 168(%rsp)
+.Lpoly1305_blocks_avx2_9:
+	testb $1, %al
+	jne .Lpoly1305_blocks_avx2_13
+	vmovdqu (%rsi), %ymm3
+	vmovdqu 32(%rsi), %ymm1
+	vpunpcklqdq %ymm1, %ymm3, %ymm2
+	vpunpckhqdq %ymm1, %ymm3, %ymm1
+	vpermq $216, %ymm2, %ymm2
+	vpermq $216, %ymm1, %ymm1
+	vpand %ymm2, %ymm0, %ymm5
+	vpsrlq $26, %ymm2, %ymm4
+	vpand %ymm4, %ymm0, %ymm4
+	vpsllq $12, %ymm1, %ymm3
+	vpsrlq $52, %ymm2, %ymm2
+	vpor %ymm3, %ymm2, %ymm2
+	vpand %ymm2, %ymm0, %ymm3
+	vpsrlq $26, %ymm2, %ymm2
+	vpand %ymm2, %ymm0, %ymm2
+	vpsrlq $40, %ymm1, %ymm1
+	vpor 168(%rsp), %ymm1, %ymm1
+	addq $64, %rsi
+	subq $64, %rdx
+	orq $1, 320(%rdi)
+	jmp .Lpoly1305_blocks_avx2_14
+.Lpoly1305_blocks_avx2_13:
+	vmovdqa (%rdi), %ymm5
+	vmovdqa 32(%rdi), %ymm4
+	vmovdqa 64(%rdi), %ymm3
+	vmovdqa 96(%rdi), %ymm2
+	vmovdqa 128(%rdi), %ymm1
+.Lpoly1305_blocks_avx2_14:
+	cmpq $63, %rdx
+	jbe .Lpoly1305_blocks_avx2_15
+	vmovdqa 160(%rdi), %ymm6
+	vmovdqa %ymm8, 136(%rsp)
+	vmovdqa 192(%rdi), %ymm7
+	vpmuludq %ymm8, %ymm7, %ymm11
+	vmovdqa %ymm11, 104(%rsp)
+	vmovdqa 224(%rdi), %ymm11
+	vmovdqa %ymm11, 72(%rsp)
+	vpmuludq %ymm11, %ymm8, %ymm11
+	vmovdqa %ymm11, 40(%rsp)
+	vmovdqa 256(%rdi), %ymm11
+	vmovdqa %ymm11, 8(%rsp)
+	vpmuludq %ymm11, %ymm8, %ymm11
+	vmovdqa %ymm11, -24(%rsp)
+	vmovdqa 288(%rdi), %ymm13
+	vmovdqa %ymm13, -56(%rsp)
+	vpmuludq %ymm13, %ymm8, %ymm13
+	vmovdqa %ymm13, -88(%rsp)
+.Lpoly1305_blocks_avx2_16:
+	vpmuludq 104(%rsp), %ymm1, %ymm14
+	vmovdqa 40(%rsp), %ymm13
+	vpmuludq %ymm13, %ymm2, %ymm8
+	vpmuludq %ymm13, %ymm1, %ymm13
+	vmovdqa -24(%rsp), %ymm9
+	vpmuludq %ymm9, %ymm2, %ymm10
+	vpmuludq %ymm9, %ymm1, %ymm11
+	vpaddq %ymm8, %ymm14, %ymm14
+	vpmuludq %ymm9, %ymm3, %ymm8
+	vmovdqa -88(%rsp), %ymm12
+	vpmuludq %ymm12, %ymm1, %ymm9
+	vpaddq %ymm10, %ymm13, %ymm13
+	vpmuludq %ymm12, %ymm4, %ymm15
+	vmovdqa %ymm12, %ymm10
+	vpmuludq %ymm12, %ymm3, %ymm12
+	vpaddq %ymm8, %ymm14, %ymm14
+	vpmuludq %ymm10, %ymm2, %ymm10
+	vpmuludq %ymm6, %ymm2, %ymm8
+	vpaddq %ymm15, %ymm14, %ymm14
+	vpmuludq %ymm6, %ymm1, %ymm1
+	vpaddq %ymm12, %ymm13, %ymm13
+	vpmuludq %ymm6, %ymm5, %ymm15
+	vpaddq %ymm10, %ymm11, %ymm11
+	vpmuludq %ymm6, %ymm4, %ymm12
+	vpaddq %ymm8, %ymm9, %ymm9
+	vpmuludq %ymm6, %ymm3, %ymm10
+	vpmuludq %ymm7, %ymm3, %ymm8
+	vpaddq %ymm15, %ymm14, %ymm14
+	vpmuludq %ymm7, %ymm2, %ymm2
+	vpaddq %ymm12, %ymm13, %ymm12
+	vpmuludq %ymm7, %ymm5, %ymm15
+	vpaddq %ymm10, %ymm11, %ymm10
+	vpmuludq %ymm7, %ymm4, %ymm13
+	vpaddq %ymm8, %ymm9, %ymm8
+	vmovdqa 72(%rsp), %ymm9
+	vpmuludq %ymm9, %ymm4, %ymm11
+	vpaddq %ymm2, %ymm1, %ymm1
+	vpmuludq %ymm9, %ymm3, %ymm3
+	vpaddq %ymm15, %ymm12, %ymm12
+	vpmuludq %ymm9, %ymm5, %ymm15
+	vpaddq %ymm13, %ymm10, %ymm10
+	vmovdqa 8(%rsp), %ymm2
+	vpmuludq %ymm2, %ymm5, %ymm9
+	vpaddq %ymm11, %ymm8, %ymm8
+	vpmuludq %ymm2, %ymm4, %ymm4
+	vpaddq %ymm3, %ymm1, %ymm1
+	vpmuludq -56(%rsp), %ymm5, %ymm5
+	vpaddq %ymm15, %ymm10, %ymm10
+	vpaddq %ymm9, %ymm8, %ymm8
+	vpaddq %ymm4, %ymm1, %ymm1
+	vpaddq %ymm5, %ymm1, %ymm5
+	vmovdqu (%rsi), %ymm3
+	vmovdqu 32(%rsi), %ymm2
+	vperm2i128 $32, %ymm2, %ymm3, %ymm1
+	vperm2i128 $49, %ymm2, %ymm3, %ymm2
+	vpunpckldq %ymm2, %ymm1, %ymm15
+	vpunpckhdq %ymm2, %ymm1, %ymm2
+	vpxor %xmm4, %xmm4, %xmm4
+	vpunpckldq %ymm4, %ymm15, %ymm1
+	vpunpckhdq %ymm4, %ymm15, %ymm15
+	vpunpckldq %ymm4, %ymm2, %ymm3
+	vpunpckhdq %ymm4, %ymm2, %ymm2
+	vpsllq $6, %ymm15, %ymm15
+	vpsllq $12, %ymm3, %ymm3
+	vpsllq $18, %ymm2, %ymm2
+	vpaddq %ymm1, %ymm14, %ymm14
+	vpaddq %ymm15, %ymm12, %ymm12
+	vpaddq %ymm3, %ymm10, %ymm10
+	vpaddq %ymm2, %ymm8, %ymm8
+	vpaddq 168(%rsp), %ymm5, %ymm5
+	addq $64, %rsi
+	vpsrlq $26, %ymm14, %ymm4
+	vpsrlq $26, %ymm8, %ymm2
+	vpand %ymm0, %ymm14, %ymm14
+	vpand %ymm0, %ymm8, %ymm8
+	vpaddq %ymm4, %ymm12, %ymm12
+	vpaddq %ymm2, %ymm5, %ymm5
+	vpsrlq $26, %ymm12, %ymm3
+	vpsrlq $26, %ymm5, %ymm9
+	vpand %ymm0, %ymm12, %ymm12
+	vpand %ymm0, %ymm5, %ymm11
+	vpaddq %ymm3, %ymm10, %ymm3
+	vpmuludq 136(%rsp), %ymm9, %ymm9
+	vpaddq %ymm9, %ymm14, %ymm14
+	vpsrlq $26, %ymm3, %ymm2
+	vpsrlq $26, %ymm14, %ymm4
+	vpand %ymm0, %ymm3, %ymm3
+	vpand %ymm0, %ymm14, %ymm5
+	vpaddq %ymm2, %ymm8, %ymm2
+	vpaddq %ymm4, %ymm12, %ymm4
+	vpsrlq $26, %ymm2, %ymm1
+	vpand %ymm0, %ymm2, %ymm2
+	vpaddq %ymm1, %ymm11, %ymm1
+	subq $64, %rdx
+	cmpq $63, %rdx
+	ja .Lpoly1305_blocks_avx2_16
+.Lpoly1305_blocks_avx2_15:
+	testb $64, 320(%rdi)
+	jne .Lpoly1305_blocks_avx2_17
+	vmovdqa %ymm5, (%rdi)
+	vmovdqa %ymm4, 32(%rdi)
+	vmovdqa %ymm3, 64(%rdi)
+	vmovdqa %ymm2, 96(%rdi)
+	vmovdqa %ymm1, 128(%rdi)
+	jmp .Lpoly1305_blocks_avx2_8
+.Lpoly1305_blocks_avx2_17:
+	vpermq $245, %ymm5, %ymm0
+	vpaddq %ymm0, %ymm5, %ymm5
+	vpermq $245, %ymm4, %ymm0
+	vpaddq %ymm0, %ymm4, %ymm4
+	vpermq $245, %ymm3, %ymm0
+	vpaddq %ymm0, %ymm3, %ymm3
+	vpermq $245, %ymm2, %ymm0
+	vpaddq %ymm0, %ymm2, %ymm2
+	vpermq $245, %ymm1, %ymm0
+	vpaddq %ymm0, %ymm1, %ymm1
+	vpermq $170, %ymm5, %ymm0
+	vpaddq %ymm0, %ymm5, %ymm5
+	vpermq $170, %ymm4, %ymm0
+	vpaddq %ymm0, %ymm4, %ymm4
+	vpermq $170, %ymm3, %ymm0
+	vpaddq %ymm0, %ymm3, %ymm3
+	vpermq $170, %ymm2, %ymm0
+	vpaddq %ymm0, %ymm2, %ymm2
+	vpermq $170, %ymm1, %ymm0
+	vpaddq %ymm0, %ymm1, %ymm1
+	vmovd %xmm5, %eax
+	vmovd %xmm4, %edx
+	movl %eax, %ecx
+	shrl $26, %ecx
+	addl %edx, %ecx
+	movl %ecx, %edx
+	andl $67108863, %edx
+	vmovd %xmm3, %esi
+	shrl $26, %ecx
+	movl %ecx, %r11d
+	addl %esi, %r11d
+	vmovd %xmm2, %ecx
+	movl %r11d, %r10d
+	shrl $26, %r10d
+	addl %ecx, %r10d
+	movl %r10d, %r9d
+	andl $67108863, %r9d
+	vmovd %xmm1, %r8d
+	movl %edx, %esi
+	salq $26, %rsi
+	andl $67108863, %eax
+	orq %rax, %rsi
+	movabsq $17592186044415, %rax
+	andq %rax, %rsi
+	andl $67108863, %r11d
+	salq $8, %r11
+	shrl $18, %edx
+	movl %edx, %edx
+	orq %r11, %rdx
+	movq %r9, %rcx
+	salq $34, %rcx
+	orq %rcx, %rdx
+	andq %rax, %rdx
+	shrl $26, %r10d
+	addl %r10d, %r8d
+	salq $16, %r8
+	shrl $10, %r9d
+	movl %r9d, %r9d
+	orq %r9, %r8
+	movabsq $4398046511103, %r10
+	movq %r8, %r9
+	andq %r10, %r9
+	shrq $42, %r8
+	leaq (%r8,%r8,4), %rcx
+	addq %rcx, %rsi
+	movq %rsi, %r8
+	andq %rax, %r8
+	movq %rsi, %rcx
+	shrq $44, %rcx
+	addq %rdx, %rcx
+	movq %rcx, %rsi
+	andq %rax, %rsi
+	shrq $44, %rcx
+	movq %rcx, %rdx
+	addq %r9, %rdx
+	andq %rdx, %r10
+	shrq $42, %rdx
+	leaq (%r8,%rdx,4), %rcx
+	leaq (%rcx,%rdx), %rdx
+	movq %rdx, %rbx
+	andq %rax, %rbx
+	shrq $44, %rdx
+	movq %rdx, %r11
+	addq %rsi, %r11
+	leaq 5(%rbx), %r9
+	movq %r9, %r8
+	shrq $44, %r8
+	addq %r11, %r8
+	movabsq $-4398046511104, %rsi
+	addq %r10, %rsi
+	movq %r8, %rdx
+	shrq $44, %rdx
+	addq %rdx, %rsi
+	movq %rsi, %rdx
+	shrq $63, %rdx
+	subq $1, %rdx
+	movq %rdx, %rcx
+	notq %rcx
+	andq %rcx, %rbx
+	andq %rcx, %r11
+	andq %r10, %rcx
+	andq %rax, %r9
+	andq %rdx, %r9
+	orq %r9, %rbx
+	movq %rbx, (%rdi)
+	andq %r8, %rax
+	andq %rdx, %rax
+	orq %rax, %r11
+	movq %r11, 8(%rdi)
+	andq %rsi, %rdx
+	orq %rcx, %rdx
+	movq %rdx, 16(%rdi)
+.Lpoly1305_blocks_avx2_8:
+	movq -8(%rbp), %rbx
+	vzeroall
+	movq %rbp, %rax
+	subq %rsp, %rax
+	leave
+	addq $8, %rax
+	ret
+.size _gcry_poly1305_amd64_avx2_blocks,.-_gcry_poly1305_amd64_avx2_blocks;
+
+
+.align 8
+.globl _gcry_poly1305_amd64_avx2_finish_ext
+.type  _gcry_poly1305_amd64_avx2_finish_ext, at function;
+_gcry_poly1305_amd64_avx2_finish_ext:
+.Lpoly1305_finish_ext_avx2_local:
+	vzeroupper
+	pushq %rbp
+	movq %rsp, %rbp
+	pushq %r13
+	pushq %r12
+	pushq %rbx
+	andq $-64, %rsp
+	subq $64, %rsp
+	movq %rdi, %rbx
+	movq %rdx, %r13
+	movq %rcx, %r12
+	testq %rdx, %rdx
+	je .Lpoly1305_finish_ext_avx2_22
+	vpxor %xmm0, %xmm0, %xmm0
+	vmovdqa %ymm0, (%rsp)
+	vmovdqa %ymm0, 32(%rsp)
+	movq %rsp, %rax
+	subq %rsp, %rsi
+	testb $32, %dl
+	je .Lpoly1305_finish_ext_avx2_23
+	vmovdqu (%rsp,%rsi), %ymm0
+	vmovdqa %ymm0, (%rsp)
+	leaq 32(%rsp), %rax
+.Lpoly1305_finish_ext_avx2_23:
+	testb $16, %r13b
+	je .Lpoly1305_finish_ext_avx2_24
+	vmovdqu (%rax,%rsi), %xmm0
+	vmovdqa %xmm0, (%rax)
+	addq $16, %rax
+.Lpoly1305_finish_ext_avx2_24:
+	testb $8, %r13b
+	je .Lpoly1305_finish_ext_avx2_25
+	movq (%rax,%rsi), %rdx
+	movq %rdx, (%rax)
+	addq $8, %rax
+.Lpoly1305_finish_ext_avx2_25:
+	testb $4, %r13b
+	je .Lpoly1305_finish_ext_avx2_26
+	movl (%rax,%rsi), %edx
+	movl %edx, (%rax)
+	addq $4, %rax
+.Lpoly1305_finish_ext_avx2_26:
+	testb $2, %r13b
+	je .Lpoly1305_finish_ext_avx2_27
+	movzwl (%rax,%rsi), %edx
+	movw %dx, (%rax)
+	addq $2, %rax
+.Lpoly1305_finish_ext_avx2_27:
+	testb $1, %r13b
+	je .Lpoly1305_finish_ext_avx2_28
+	movzbl (%rax,%rsi), %edx
+	movb %dl, (%rax)
+.Lpoly1305_finish_ext_avx2_28:
+	testb $15, %r13b
+	je .Lpoly1305_finish_ext_avx2_29
+	movb $1, (%rsp,%r13)
+.Lpoly1305_finish_ext_avx2_29:
+	cmpq $47, %r13
+	jbe .Lpoly1305_finish_ext_avx2_30
+	orq $4, 320(%rbx)
+	jmp .Lpoly1305_finish_ext_avx2_31
+.Lpoly1305_finish_ext_avx2_30:
+	cmpq $31, %r13
+	jbe .Lpoly1305_finish_ext_avx2_32
+	orq $8, 320(%rbx)
+	jmp .Lpoly1305_finish_ext_avx2_31
+.Lpoly1305_finish_ext_avx2_32:
+	cmpq $15, %r13
+	jbe .Lpoly1305_finish_ext_avx2_33
+	orq $16, 320(%rbx)
+	jmp .Lpoly1305_finish_ext_avx2_31
+.Lpoly1305_finish_ext_avx2_33:
+	orq $32, 320(%rbx)
+.Lpoly1305_finish_ext_avx2_31:
+	testb $1, 320(%rbx)
+	je .Lpoly1305_finish_ext_avx2_34
+	cmpq $32, %r13
+	ja .Lpoly1305_finish_ext_avx2_34
+	cmpq $17, %r13
+	sbbq %rsi, %rsi
+	notq %rsi
+	addq $2, %rsi
+	cmpq $17, %r13
+	sbbq %rax, %rax
+	movq %rbx, %rdx
+	addq $23, %rax
+	leaq (%rbx,%rax,8), %rax
+	movl $0, %ecx
+.Lpoly1305_finish_ext_avx2_37:
+	movl 244(%rdx), %edi
+	movl %edi, (%rax)
+	movl 252(%rdx), %edi
+	movl %edi, 32(%rax)
+	movl 260(%rdx), %edi
+	movl %edi, 64(%rax)
+	movl 268(%rdx), %edi
+	movl %edi, 96(%rax)
+	movl 276(%rdx), %edi
+	movl %edi, 128(%rax)
+	addq $1, %rcx
+	subq $40, %rdx
+	addq $8, %rax
+	cmpq %rcx, %rsi
+	ja .Lpoly1305_finish_ext_avx2_37
+.Lpoly1305_finish_ext_avx2_34:
+	movl $64, %edx
+	movq %rsp, %rsi
+	movq %rbx, %rdi
+	call .Lpoly1305_blocks_avx2_local
+.Lpoly1305_finish_ext_avx2_22:
+	movq 320(%rbx), %r8
+	testb $1, %r8b
+	je .Lpoly1305_finish_ext_avx2_38
+	leaq -1(%r13), %rax
+	cmpq $47, %rax
+	ja .Lpoly1305_finish_ext_avx2_46
+	cmpq $32, %r13
+	ja .Lpoly1305_finish_ext_avx2_47
+	cmpq $17, %r13
+	sbbq %r9, %r9
+	addq $2, %r9
+	movl $0, %edi
+	cmpq $17, %r13
+	sbbq %rax, %rax
+	notq %rax
+	andl $5, %eax
+	jmp .Lpoly1305_finish_ext_avx2_39
+.Lpoly1305_finish_ext_avx2_41:
+	movl (%rdx), %esi
+	movl %esi, (%rax)
+	movl 8(%rdx), %esi
+	movl %esi, 32(%rax)
+	movl 16(%rdx), %esi
+	movl %esi, 64(%rax)
+	movl 24(%rdx), %esi
+	movl %esi, 96(%rax)
+	movl 32(%rdx), %esi
+	movl %esi, 128(%rax)
+	addq $1, %rcx
+	subq $40, %rdx
+	addq $8, %rax
+	movq %rcx, %rsi
+	subq %rdi, %rsi
+	cmpq %rsi, %r9
+	ja .Lpoly1305_finish_ext_avx2_41
+	cmpq $3, %rcx
+	ja .Lpoly1305_finish_ext_avx2_42
+	leaq 160(%rbx,%rcx,8), %rax
+.Lpoly1305_finish_ext_avx2_43:
+	movl $1, (%rax)
+	movl $0, 32(%rax)
+	movl $0, 64(%rax)
+	movl $0, 96(%rax)
+	movl $0, 128(%rax)
+	addq $1, %rcx
+	addq $8, %rax
+	cmpq $4, %rcx
+	jne .Lpoly1305_finish_ext_avx2_43
+.Lpoly1305_finish_ext_avx2_42:
+	orq $96, %r8
+	movq %r8, 320(%rbx)
+	vpxor %ymm0, %ymm0, %ymm0
+	vmovdqa %ymm0, (%rsp)
+	vmovdqa %ymm0, 32(%rsp)
+	movl $64, %edx
+	movq %rsp, %rsi
+	movq %rbx, %rdi
+	call .Lpoly1305_blocks_avx2_local
+.Lpoly1305_finish_ext_avx2_38:
+	movq 8(%rbx), %rax
+	movq %rax, %rdx
+	salq $44, %rdx
+	orq (%rbx), %rdx
+	shrq $20, %rax
+	movl $24, %edi
+	shlx %rdi, 16(%rbx), %rcx
+	orq %rcx, %rax
+	movl 292(%rbx), %ecx
+	salq $32, %rcx
+	movl 284(%rbx), %esi
+	orq %rsi, %rcx
+	movl 308(%rbx), %esi
+	salq $32, %rsi
+	movl 300(%rbx), %edi
+	orq %rdi, %rsi
+	addq %rcx, %rdx
+	adcq %rsi, %rax
+	movq %rdx, (%r12)
+	movq %rax, 8(%r12)
+	vpxor %xmm0, %xmm0, %xmm0
+	vmovdqu %ymm0, (%rbx)
+	vmovdqu %ymm0, 32(%rbx)
+	vmovdqu %ymm0, 64(%rbx)
+	vmovdqu %ymm0, 96(%rbx)
+	vmovdqu %ymm0, 128(%rbx)
+	vmovdqu %ymm0, 160(%rbx)
+	vmovdqu %ymm0, 192(%rbx)
+	vmovdqu %ymm0, 224(%rbx)
+	jmp .Lpoly1305_finish_ext_avx2_49
+.Lpoly1305_finish_ext_avx2_46:
+	movl $3, %r9d
+	movl $1, %edi
+	movl $10, %eax
+	jmp .Lpoly1305_finish_ext_avx2_39
+.Lpoly1305_finish_ext_avx2_47:
+	movl $3, %r9d
+	movl $0, %edi
+	movl $10, %eax
+.Lpoly1305_finish_ext_avx2_39:
+	leaq 164(%rbx,%rax,8), %rdx
+	leaq 160(%rbx,%rdi,8), %rax
+	movq %rdi, %rcx
+	jmp .Lpoly1305_finish_ext_avx2_41
+.Lpoly1305_finish_ext_avx2_49:
+	movq %rbp, %rax
+	subq %rsp, %rax
+	leaq -24(%rbp), %rsp
+	vzeroall
+	popq %rbx
+	popq %r12
+	popq %r13
+	popq %rbp
+	addq $(8*5), %rax
+ret
+.size _gcry_poly1305_amd64_avx2_finish_ext,.-_gcry_poly1305_amd64_avx2_finish_ext;
+
+#endif
diff --git a/cipher/poly1305-internal.h b/cipher/poly1305-internal.h
index fa3fe75..0299c43 100644
--- a/cipher/poly1305-internal.h
+++ b/cipher/poly1305-internal.h
@@ -54,23 +54,40 @@
 #endif
 
 
+/* POLY1305_USE_AVX2 indicates whether to compile with AMD64 AVX2 code. */
+#undef POLY1305_USE_AVX2
+#if defined(__x86_64__) && defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) && \
+    defined(ENABLE_AVX2_SUPPORT)
+# define POLY1305_USE_AVX2 1
+# define POLY1305_AVX2_BLOCKSIZE 64
+# define POLY1305_AVX2_STATESIZE 328
+# define POLY1305_AVX2_ALIGNMENT 32
+#endif
+
+
 /* Largest block-size used in any implementation (optimized implementations
  * might use block-size multiple of 16). */
-#ifdef POLY1305_USE_SSE2
+#ifdef POLY1305_USE_AVX2
+# define POLY1305_LARGEST_BLOCKSIZE POLY1305_AVX2_BLOCKSIZE
+#elif defined(POLY1305_USE_SSE2)
 # define POLY1305_LARGEST_BLOCKSIZE POLY1305_SSE2_BLOCKSIZE
 #else
 # define POLY1305_LARGEST_BLOCKSIZE POLY1305_REF_BLOCKSIZE
 #endif
 
 /* Largest state-size used in any implementation. */
-#ifdef POLY1305_USE_SSE2
+#ifdef POLY1305_USE_AVX2
+# define POLY1305_LARGEST_STATESIZE POLY1305_AVX2_STATESIZE
+#elif defined(POLY1305_USE_SSE2)
 # define POLY1305_LARGEST_STATESIZE POLY1305_SSE2_STATESIZE
 #else
 # define POLY1305_LARGEST_STATESIZE POLY1305_REF_STATESIZE
 #endif
 
 /* Minimum alignment for state pointer passed to implementations. */
-#ifdef POLY1305_USE_SSE2
+#ifdef POLY1305_USE_AVX2
+# define POLY1305_STATE_ALIGNMENT POLY1305_AVX2_ALIGNMENT
+#elif defined(POLY1305_USE_SSE2)
 # define POLY1305_STATE_ALIGNMENT POLY1305_SSE2_ALIGNMENT
 #else
 # define POLY1305_STATE_ALIGNMENT POLY1305_REF_ALIGNMENT
diff --git a/cipher/poly1305.c b/cipher/poly1305.c
index cd1902a..fe241c1 100644
--- a/cipher/poly1305.c
+++ b/cipher/poly1305.c
@@ -57,6 +57,25 @@ static const poly1305_ops_t poly1305_amd64_sse2_ops = {
 #endif
 
 
+#ifdef POLY1305_USE_AVX2
+
+void _gcry_poly1305_amd64_avx2_init_ext(void *state, const poly1305_key_t *key);
+unsigned int _gcry_poly1305_amd64_avx2_finish_ext(void *state, const byte *m,
+						  size_t remaining,
+						  byte mac[16]);
+unsigned int _gcry_poly1305_amd64_avx2_blocks(void *ctx, const byte *m,
+					      size_t bytes);
+
+static const poly1305_ops_t poly1305_amd64_avx2_ops = {
+  POLY1305_AVX2_BLOCKSIZE,
+  _gcry_poly1305_amd64_avx2_init_ext,
+  _gcry_poly1305_amd64_avx2_blocks,
+  _gcry_poly1305_amd64_avx2_finish_ext
+};
+
+#endif
+
+
 #ifdef HAVE_U64_TYPEDEF
 
 /* Reference unoptimized poly1305 implementation using 32 bit * 32 bit = 64 bit
@@ -616,6 +635,7 @@ _gcry_poly1305_init (poly1305_context_t * ctx, const byte * key,
   static int initialized;
   static const char *selftest_failed;
   poly1305_key_t keytmp;
+  unsigned int features = _gcry_get_hw_features ();
 
   if (!initialized)
     {
@@ -637,6 +657,12 @@ _gcry_poly1305_init (poly1305_context_t * ctx, const byte * key,
   ctx->ops = &poly1305_default_ops;
 #endif
 
+#ifdef POLY1305_USE_AVX2
+  if (features & HWF_INTEL_AVX2)
+    ctx->ops = &poly1305_amd64_avx2_ops;
+#endif
+  (void)features;
+
   buf_cpy (keytmp.b, key, POLY1305_KEYLEN);
   poly1305_init (ctx, &keytmp);
 
diff --git a/configure.ac b/configure.ac
index 4dc36d5..47a322b 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1825,6 +1825,7 @@ case "${host}" in
    x86_64-*-*)
       # Build with the assembly implementation
       GCRYPT_CIPHERS="$GCRYPT_CIPHERS poly1305-sse2-amd64.lo"
+      GCRYPT_CIPHERS="$GCRYPT_CIPHERS poly1305-avx2-amd64.lo"
    ;;
 esac
 

commit 297532602ed2d881d8fdc393d1961068a143a891
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date:   Sun May 11 20:18:49 2014 +0300

    poly1305: add AMD64/SSE2 optimized implementation
    
    * cipher/Makefile.am: Add 'poly1305-sse2-amd64.S'.
    * cipher/poly1305-internal.h (POLY1305_USE_SSE2)
    (POLY1305_SSE2_BLOCKSIZE, POLY1305_SSE2_STATESIZE)
    (POLY1305_SSE2_ALIGNMENT): New.
    (POLY1305_LARGEST_BLOCKSIZE, POLY1305_LARGEST_STATESIZE)
    (POLY1305_STATE_ALIGNMENT): Use SSE2 versions when needed.
    * cipher/poly1305-sse2-amd64.S: New.
    * cipher/poly1305.c [POLY1305_USE_SSE2]
    (_gcry_poly1305_amd64_sse2_init_ext)
    (_gcry_poly1305_amd64_sse2_finish_ext)
    (_gcry_poly1305_amd64_sse2_blocks, poly1305_amd64_sse2_ops): New.
    (_gcry_polu1305_init) [POLY1305_USE_SSE2]: Use SSE2 version.
    * configure.ac [host=x86_64]: Add 'poly1305-sse2-amd64.lo'.
    --
    
    Add Andrew Moon's public domain SSE2 implementation of Poly1305. Original
    source is available at: https://github.com/floodyberry/poly1305-opt
    
    Benchmarks on Intel i5-4570 (haswell):
    
    Old:
                        |  nanosecs/byte   mebibytes/sec   cycles/byte
     POLY1305           |     0.844 ns/B    1130.2 MiB/s      2.70 c/B
    
    New:
                        |  nanosecs/byte   mebibytes/sec   cycles/byte
     POLY1305           |     0.448 ns/B    2129.5 MiB/s      1.43 c/B
    
    Benchmarks on Intel i5-2450M (sandy-bridge):
    
    Old:
                        |  nanosecs/byte   mebibytes/sec   cycles/byte
     POLY1305           |      1.25 ns/B     763.0 MiB/s      3.12 c/B
    
    New:
                        |  nanosecs/byte   mebibytes/sec   cycles/byte
     POLY1305           |     0.605 ns/B    1575.9 MiB/s      1.51 c/B
    
    Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index 4468647..a32ae89 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -72,6 +72,7 @@ gost28147.c gost.h \
 gostr3411-94.c \
 md4.c \
 md5.c \
+poly1305-sse2-amd64.S \
 rijndael.c rijndael-tables.h rijndael-amd64.S rijndael-arm.S \
 rmd160.c \
 rsa.c \
diff --git a/cipher/poly1305-internal.h b/cipher/poly1305-internal.h
index d2c6b5c..fa3fe75 100644
--- a/cipher/poly1305-internal.h
+++ b/cipher/poly1305-internal.h
@@ -44,15 +44,37 @@
 #define POLY1305_REF_ALIGNMENT sizeof(void *)
 
 
+/* POLY1305_USE_SSE2 indicates whether to compile with AMD64 SSE2 code. */
+#undef POLY1305_USE_SSE2
+#if defined(__x86_64__) && defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS)
+# define POLY1305_USE_SSE2 1
+# define POLY1305_SSE2_BLOCKSIZE 32
+# define POLY1305_SSE2_STATESIZE 248
+# define POLY1305_SSE2_ALIGNMENT 16
+#endif
+
+
 /* Largest block-size used in any implementation (optimized implementations
  * might use block-size multiple of 16). */
-#define POLY1305_LARGEST_BLOCKSIZE POLY1305_REF_BLOCKSIZE
+#ifdef POLY1305_USE_SSE2
+# define POLY1305_LARGEST_BLOCKSIZE POLY1305_SSE2_BLOCKSIZE
+#else
+# define POLY1305_LARGEST_BLOCKSIZE POLY1305_REF_BLOCKSIZE
+#endif
 
 /* Largest state-size used in any implementation. */
-#define POLY1305_LARGEST_STATESIZE POLY1305_REF_STATESIZE
+#ifdef POLY1305_USE_SSE2
+# define POLY1305_LARGEST_STATESIZE POLY1305_SSE2_STATESIZE
+#else
+# define POLY1305_LARGEST_STATESIZE POLY1305_REF_STATESIZE
+#endif
 
 /* Minimum alignment for state pointer passed to implementations. */
-#define POLY1305_STATE_ALIGNMENT POLY1305_REF_ALIGNMENT
+#ifdef POLY1305_USE_SSE2
+# define POLY1305_STATE_ALIGNMENT POLY1305_SSE2_ALIGNMENT
+#else
+# define POLY1305_STATE_ALIGNMENT POLY1305_REF_ALIGNMENT
+#endif
 
 
 typedef struct poly1305_key_s
diff --git a/cipher/poly1305-sse2-amd64.S b/cipher/poly1305-sse2-amd64.S
new file mode 100644
index 0000000..106b119
--- /dev/null
+++ b/cipher/poly1305-sse2-amd64.S
@@ -0,0 +1,1035 @@
+/* poly1305-sse2-amd64.S  -  AMD64/SSE2 implementation of Poly1305
+ *
+ * Copyright (C) 2014 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * Based on public domain implementation by Andrew Moon at
+ *  https://github.com/floodyberry/poly1305-opt
+ */
+
+#include <config.h>
+
+#if defined(__x86_64__) && defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS)
+
+.text
+
+
+.align 8
+.globl _gcry_poly1305_amd64_sse2_init_ext
+.type  _gcry_poly1305_amd64_sse2_init_ext, at function;
+_gcry_poly1305_amd64_sse2_init_ext:
+.Lpoly1305_init_ext_x86_local:
+	xor %edx, %edx
+	pushq %r12
+	pushq %r13
+	pushq %r14
+	movq %rdx, %r10
+	movq $-1, %rcx
+	testq %r10, %r10
+	pxor %xmm0, %xmm0
+	movq $0xfffffc0ffff, %r9
+	movdqa %xmm0, (%rdi)
+	cmove %rcx, %r10
+	movdqa %xmm0, 16(%rdi)
+	movq $0xffc0fffffff, %rcx
+	movdqa %xmm0, 32(%rdi)
+	movdqa %xmm0, 48(%rdi)
+	movdqa %xmm0, 64(%rdi)
+	movq 8(%rsi), %r11
+	movq %r11, %r8
+	movq (%rsi), %r12
+	andq %r12, %rcx
+	shrq $44, %r12
+	shlq $20, %r8
+	shrq $24, %r11
+	orq %r8, %r12
+	movq $0xffffffc0f, %r8
+	andq %r9, %r12
+	andq %r8, %r11
+	movl %ecx, %r8d
+	andl $67108863, %r8d
+	movq %rcx, %r9
+	movl %r8d, 84(%rdi)
+	movq %r12, %r8
+	shrq $26, %r9
+	shlq $18, %r8
+	orq %r8, %r9
+	movq %r12, %r8
+	shrq $8, %r8
+	andl $67108863, %r9d
+	andl $67108863, %r8d
+	movl %r9d, 92(%rdi)
+	movq %r12, %r9
+	movl %r8d, 100(%rdi)
+	movq %r11, %r8
+	shrq $34, %r9
+	shlq $10, %r8
+	orq %r8, %r9
+	movq %r11, %r8
+	shrq $16, %r8
+	andl $67108863, %r9d
+	movl %r9d, 108(%rdi)
+	cmpq $16, %r10
+	movl %r8d, 116(%rdi)
+	movl 16(%rsi), %r8d
+	movl %r8d, 124(%rdi)
+	movl 20(%rsi), %r8d
+	movl %r8d, 132(%rdi)
+	movl 24(%rsi), %r8d
+	movl %r8d, 140(%rdi)
+	movl 28(%rsi), %esi
+	movl %esi, 148(%rdi)
+	jbe .Lpoly1305_init_ext_sse2_done
+	lea (%r11,%r11,4), %r14
+	shlq $2, %r14
+	lea (%r12,%r12), %rax
+	mulq %r14
+	movq %rax, %r13
+	movq %rcx, %rax
+	movq %rdx, %r8
+	mulq %rcx
+	addq %rax, %r13
+	lea (%rcx,%rcx), %rax
+	movq %r13, %r9
+	adcq %rdx, %r8
+	mulq %r12
+	shlq $20, %r8
+	movq %rax, %rsi
+	shrq $44, %r9
+	movq %r11, %rax
+	orq %r9, %r8
+	movq %rdx, %r9
+	mulq %r14
+	addq %rax, %rsi
+	movq %rcx, %rax
+	adcq %rdx, %r9
+	addq %r11, %r11
+	mulq %r11
+	addq %rsi, %r8
+	movq %rax, %r11
+	movq %r12, %rax
+	movq %rdx, %rcx
+	adcq $0, %r9
+	mulq %r12
+	addq %rax, %r11
+	movq %r8, %rsi
+	adcq %rdx, %rcx
+	shlq $20, %r9
+	shrq $44, %rsi
+	orq %rsi, %r9
+	movq $0xfffffffffff, %rsi
+	addq %r11, %r9
+	movq %r9, %r12
+	adcq $0, %rcx
+	andq %rsi, %r13
+	shlq $22, %rcx
+	andq %rsi, %r8
+	shrq $42, %r12
+	orq %r12, %rcx
+	movq %rsi, %r12
+	lea (%rcx,%rcx,4), %rcx
+	addq %rcx, %r13
+	movq %rsi, %rcx
+	andq %r13, %rcx
+	shrq $44, %r13
+	movq %rcx, %r14
+	addq %r13, %r8
+	movq $0x3ffffffffff, %r13
+	andq %r8, %r12
+	andq %r13, %r9
+	shrq $44, %r8
+	movq %r12, %r11
+	addq %r8, %r9
+	movq %r12, %rax
+	movq %r9, %r13
+	movl %ecx, %r8d
+	shrq $26, %r14
+	andl $67108863, %r8d
+	shlq $18, %r11
+	shrq $34, %rax
+	orq %r11, %r14
+	shlq $10, %r13
+	movq %r12, %r11
+	orq %r13, %rax
+	movq %r9, %r13
+	shrq $8, %r11
+	shrq $16, %r13
+	andl $67108863, %r14d
+	andl $67108863, %r11d
+	andl $67108863, %eax
+	movl %r8d, 88(%rdi)
+	cmpq $64, %r10
+	movl %r8d, 80(%rdi)
+	movl %r14d, 104(%rdi)
+	movl %r14d, 96(%rdi)
+	movl %r11d, 120(%rdi)
+	movl %r11d, 112(%rdi)
+	movl %eax, 136(%rdi)
+	movl %eax, 128(%rdi)
+	movl %r13d, 152(%rdi)
+	movl %r13d, 144(%rdi)
+	jbe .Lpoly1305_init_ext_sse2_done
+	lea (%r9,%r9,4), %r14
+	shlq $2, %r14
+	lea (%r12,%r12), %rax
+	mulq %r14
+	movq %rax, %r8
+	movq %rcx, %rax
+	movq %rdx, %r10
+	mulq %rcx
+	addq %rax, %r8
+	lea (%rcx,%rcx), %rax
+	movq %r8, %r11
+	adcq %rdx, %r10
+	andq %rsi, %r8
+	mulq %r12
+	shlq $20, %r10
+	movq %rax, %r13
+	shrq $44, %r11
+	movq %r9, %rax
+	orq %r11, %r10
+	movq %rdx, %r11
+	mulq %r14
+	addq %rax, %r13
+	movq %rcx, %rax
+	adcq %rdx, %r11
+	addq %r9, %r9
+	mulq %r9
+	addq %r13, %r10
+	movq %rax, %r9
+	movq %r12, %rax
+	movq %rdx, %rcx
+	adcq $0, %r11
+	mulq %r12
+	addq %rax, %r9
+	movq %r10, %r13
+	adcq %rdx, %rcx
+	andq %rsi, %r10
+	shlq $20, %r11
+	shrq $44, %r13
+	orq %r13, %r11
+	addq %r9, %r11
+	movq %rsi, %r9
+	movq %r11, %r12
+	adcq $0, %rcx
+	shlq $22, %rcx
+	shrq $42, %r12
+	orq %r12, %rcx
+	lea (%rcx,%rcx,4), %rcx
+	addq %rcx, %r8
+	andq %r8, %r9
+	shrq $44, %r8
+	movl %r9d, %eax
+	addq %r8, %r10
+	movq $0x3ffffffffff, %r8
+	andq %r10, %rsi
+	andq %r8, %r11
+	shrq $44, %r10
+	movq %rsi, %r8
+	addq %r10, %r11
+	andl $67108863, %eax
+	shrq $26, %r9
+	movq %r11, %r10
+	shlq $18, %r8
+	shlq $10, %r10
+	orq %r8, %r9
+	movq %rsi, %r8
+	shrq $34, %rsi
+	andl $67108863, %r9d
+	shrq $8, %r8
+	orq %r10, %rsi
+	shrq $16, %r11
+	andl $67108863, %r8d
+	andl $67108863, %esi
+	movl %eax, 168(%rdi)
+	movl %eax, 160(%rdi)
+	movl %r9d, 184(%rdi)
+	movl %r9d, 176(%rdi)
+	movl %r8d, 200(%rdi)
+	movl %r8d, 192(%rdi)
+	movl %esi, 216(%rdi)
+	movl %esi, 208(%rdi)
+	movl %r11d, 232(%rdi)
+	movl %r11d, 224(%rdi)
+.Lpoly1305_init_ext_sse2_done:
+	movq $0, 240(%rdi)
+	popq %r14
+	popq %r13
+	popq %r12
+	ret
+.size _gcry_poly1305_amd64_sse2_init_ext,.-_gcry_poly1305_amd64_sse2_init_ext;
+
+
+.align 8
+.globl _gcry_poly1305_amd64_sse2_finish_ext
+.type  _gcry_poly1305_amd64_sse2_finish_ext, at function;
+_gcry_poly1305_amd64_sse2_finish_ext:
+.Lpoly1305_finish_ext_x86_local:
+	pushq %rbp
+	movq %rsp, %rbp
+	subq $64, %rsp
+	andq $~63, %rsp
+	movq %rdx, 32(%rsp)
+	movq %rcx, 40(%rsp)
+	andq %rdx, %rdx
+	jz .Lpoly1305_finish_x86_no_leftover
+	pxor %xmm0, %xmm0
+	movdqa %xmm0, 0+0(%rsp)
+	movdqa %xmm0, 16+0(%rsp)
+	leaq 0(%rsp), %r8
+	testq $16, %rdx
+	jz .Lpoly1305_finish_x86_skip16
+	movdqu 0(%rsi), %xmm0
+	movdqa %xmm0, 0(%r8)
+	addq $16, %rsi
+	addq $16, %r8
+.Lpoly1305_finish_x86_skip16:
+	testq $8, %rdx
+	jz .Lpoly1305_finish_x86_skip8
+	movq 0(%rsi), %rax
+	movq %rax, 0(%r8)
+	addq $8, %rsi
+	addq $8, %r8
+.Lpoly1305_finish_x86_skip8:
+	testq $4, %rdx
+	jz .Lpoly1305_finish_x86_skip4
+	movl 0(%rsi), %eax
+	movl %eax, 0(%r8)
+	addq $4, %rsi
+	addq $4, %r8
+.Lpoly1305_finish_x86_skip4:
+	testq $2, %rdx
+	jz .Lpoly1305_finish_x86_skip2
+	movw 0(%rsi), %ax
+	movw %ax, 0(%r8)
+	addq $2, %rsi
+	addq $2, %r8
+.Lpoly1305_finish_x86_skip2:
+	testq $1, %rdx
+	jz .Lpoly1305_finish_x86_skip1
+	movb 0(%rsi), %al
+	movb %al, 0(%r8)
+	addq $1, %r8
+.Lpoly1305_finish_x86_skip1:
+	cmpq $16, %rdx
+	je .Lpoly1305_finish_x86_is16
+	movb $1, 0(%r8)
+.Lpoly1305_finish_x86_is16:
+	movq $4, %rax
+	jae .Lpoly1305_finish_x86_16andover
+	movq $8, %rax
+.Lpoly1305_finish_x86_16andover:
+	orq %rax, 240(%rdi)
+	leaq 0(%rsp), %rsi
+	movq $32, %rdx
+	callq .Lpoly1305_blocks_x86_local
+.Lpoly1305_finish_x86_no_leftover:
+	testq $1, 240(%rdi)
+	jz .Lpoly1305_finish_x86_not_started
+	movq 32(%rsp), %rdx
+	andq %rdx, %rdx
+	jz .Lpoly1305_finish_x86_r2r
+	cmpq $16, %rdx
+	jg .Lpoly1305_finish_x86_r2r
+	xorl %r10d, %r10d
+	movl 84(%rdi), %eax
+	movl 92(%rdi), %ecx
+	movl 100(%rdi), %edx
+	movl 108(%rdi), %r8d
+	movl 116(%rdi), %r9d
+	movl %eax, 80(%rdi)
+	movl $1, 8+80(%rdi)
+	movl %ecx, 96(%rdi)
+	movl %r10d, 8+96(%rdi)
+	movl %edx, 112(%rdi)
+	movl %r10d, 8+112(%rdi)
+	movl %r8d, 128(%rdi)
+	movl %r10d, 8+128(%rdi)
+	movl %r9d, 144(%rdi)
+	movl %r10d, 8+144(%rdi)
+	jmp .Lpoly1305_finish_x86_combine
+.Lpoly1305_finish_x86_r2r:
+	movl 84(%rdi), %eax
+	movl 92(%rdi), %ecx
+	movl 100(%rdi), %edx
+	movl 108(%rdi), %r8d
+	movl 116(%rdi), %r9d
+	movl %eax, 8+80(%rdi)
+	movl %ecx, 8+96(%rdi)
+	movl %edx, 8+112(%rdi)
+	movl %r8d, 8+128(%rdi)
+	movl %r9d, 8+144(%rdi)
+.Lpoly1305_finish_x86_combine:
+	xorq %rsi, %rsi
+	movq $32, %rdx
+	callq .Lpoly1305_blocks_x86_local
+.Lpoly1305_finish_x86_not_started:
+	movq 0(%rdi), %r8
+	movq 8(%rdi), %r9
+	movq %r9, %r10
+	movq 16(%rdi), %r11
+	shlq $44, %r9
+	shrq $20, %r10
+	shlq $24, %r11
+	orq %r9, %r8
+	orq %r11, %r10
+	pxor %xmm0, %xmm0
+	movl 124(%rdi), %eax
+	movl 132(%rdi), %ecx
+	movl 140(%rdi), %edx
+	movl 148(%rdi), %esi
+	movq 40(%rsp), %r11
+	shlq $32, %rcx
+	shlq $32, %rsi
+	orq %rcx, %rax
+	orq %rsi, %rdx
+	addq %r8, %rax
+	adcq %r10, %rdx
+	movq %rax, 0(%r11)
+	movq %rdx, 8(%r11)
+	movq %rbp, %rax
+	subq %rsp, %rax
+	movq %rbp, %rsp
+	movdqa %xmm0, 0(%rdi)
+	movdqa %xmm0, 16(%rdi)
+	movdqa %xmm0, 32(%rdi)
+	movdqa %xmm0, 48(%rdi)
+	movdqa %xmm0, 64(%rdi)
+	movdqa %xmm0, 80(%rdi)
+	movdqa %xmm0, 96(%rdi)
+	movdqa %xmm0, 112(%rdi)
+	movdqa %xmm0, 128(%rdi)
+	movdqa %xmm0, 144(%rdi)
+	movdqa %xmm0, 160(%rdi)
+	movdqa %xmm0, 176(%rdi)
+	movdqa %xmm0, 192(%rdi)
+	movdqa %xmm0, 208(%rdi)
+	movdqa %xmm0, 224(%rdi)
+	popq %rbp
+	addq $8, %rax
+	ret
+.size _gcry_poly1305_amd64_sse2_finish_ext,.-_gcry_poly1305_amd64_sse2_finish_ext;
+
+
+.align 8
+.globl _gcry_poly1305_amd64_sse2_blocks
+.type  _gcry_poly1305_amd64_sse2_blocks, at function;
+_gcry_poly1305_amd64_sse2_blocks:
+.Lpoly1305_blocks_x86_local:
+	pushq %rbp
+	movq %rsp, %rbp
+	pushq %rbx
+	andq $-64, %rsp
+	subq $328, %rsp
+	movq 240(%rdi), %rax
+	movl $(1<<24), %r8d
+	movl $((1<<26)-1), %r9d
+	movd %r8, %xmm0
+	movd %r9, %xmm5
+	pshufd $0x44, %xmm0, %xmm0
+	pshufd $0x44, %xmm5, %xmm5
+	testb $4, %al
+	je .Lpoly1305_blocks_x86_3
+	psrldq $8, %xmm0
+.Lpoly1305_blocks_x86_3:
+	testb $8, %al
+	je .Lpoly1305_blocks_x86_4
+	pxor %xmm0, %xmm0
+.Lpoly1305_blocks_x86_4:
+	movdqa %xmm0, 168(%rsp)
+	testb $1, %al
+	jne .Lpoly1305_blocks_x86_5
+	movq 16(%rsi), %xmm0
+	movdqa %xmm5, %xmm7
+	movdqa %xmm5, %xmm10
+	movq (%rsi), %xmm6
+	orq $1, %rax
+	subq $32, %rdx
+	movq 8(%rsi), %xmm1
+	punpcklqdq %xmm0, %xmm6
+	movq 24(%rsi), %xmm0
+	pand %xmm6, %xmm7
+	movdqa %xmm6, %xmm9
+	psrlq $52, %xmm6
+	addq $32, %rsi
+	punpcklqdq %xmm0, %xmm1
+	movdqa %xmm1, %xmm0
+	psrlq $26, %xmm9
+	psllq $12, %xmm0
+	movq %rax, 240(%rdi)
+	pand %xmm5, %xmm9
+	por %xmm0, %xmm6
+	psrlq $40, %xmm1
+	pand %xmm6, %xmm10
+	por 168(%rsp), %xmm1
+	psrlq $26, %xmm6
+	pand %xmm5, %xmm6
+.Lpoly1305_blocks_x86_6:
+	movdqa 80(%rdi), %xmm13
+	cmpq $63, %rdx
+	movl $(5), %r8d
+	movd %r8, %xmm14
+	pshufd $0x44, %xmm14, %xmm14
+	movdqa 96(%rdi), %xmm15
+	movdqa %xmm13, -8(%rsp)
+	movdqa 112(%rdi), %xmm0
+	movdqa %xmm14, 136(%rsp)
+	movdqa 128(%rdi), %xmm3
+	movdqa %xmm15, 312(%rsp)
+	pmuludq %xmm14, %xmm15
+	movdqa 144(%rdi), %xmm13
+	movdqa %xmm0, 232(%rsp)
+	pmuludq %xmm14, %xmm0
+	movdqa %xmm3, 152(%rsp)
+	pmuludq %xmm14, %xmm3
+	movdqa %xmm13, 56(%rsp)
+	pmuludq %xmm14, %xmm13
+	movdqa %xmm15, 40(%rsp)
+	movdqa %xmm0, -24(%rsp)
+	movdqa %xmm3, -40(%rsp)
+	movdqa %xmm13, -56(%rsp)
+	jbe .Lpoly1305_blocks_x86_7
+	movdqa 192(%rdi), %xmm15
+	leaq 32(%rsi), %rax
+	movq %rdx, %rcx
+	movdqa 176(%rdi), %xmm14
+	movdqa %xmm15, %xmm2
+	movdqa 208(%rdi), %xmm0
+	movdqa %xmm15, 216(%rsp)
+	movdqa %xmm14, 296(%rsp)
+	movdqa 224(%rdi), %xmm3
+	pmuludq 136(%rsp), %xmm14
+	movdqa -24(%rsp), %xmm13
+	movdqa %xmm14, 8(%rsp)
+	pmuludq 136(%rsp), %xmm2
+	movdqa -40(%rsp), %xmm14
+	movdqa %xmm0, 120(%rsp)
+	pmuludq 136(%rsp), %xmm0
+	movdqa %xmm3, 24(%rsp)
+	movdqa 160(%rdi), %xmm12
+	movdqa %xmm0, %xmm8
+	movdqa -56(%rsp), %xmm15
+	movdqa %xmm13, 88(%rsp)
+	pmuludq 136(%rsp), %xmm3
+	movdqa %xmm2, 104(%rsp)
+	movdqa %xmm0, %xmm13
+	movdqa -8(%rsp), %xmm11
+	movdqa %xmm3, 280(%rsp)
+	movdqa %xmm2, %xmm3
+	movdqa %xmm0, 200(%rsp)
+	movdqa %xmm14, 184(%rsp)
+	movdqa %xmm15, 264(%rsp)
+	jmp .Lpoly1305_blocks_x86_8
+.p2align 6,,63
+.Lpoly1305_blocks_x86_13:
+	movdqa 200(%rsp), %xmm13
+	movdqa %xmm3, %xmm6
+	movdqa 200(%rsp), %xmm8
+	movdqa 104(%rsp), %xmm3
+.Lpoly1305_blocks_x86_8:
+	movdqa 8(%rsp), %xmm4
+	pmuludq %xmm6, %xmm3
+	subq $64, %rcx
+	pmuludq %xmm10, %xmm8
+	movdqa 104(%rsp), %xmm2
+	movdqa 200(%rsp), %xmm0
+	pmuludq %xmm1, %xmm4
+	movdqa 280(%rsp), %xmm15
+	pmuludq %xmm6, %xmm13
+	movdqa 280(%rsp), %xmm14
+	pmuludq %xmm1, %xmm0
+	paddq %xmm3, %xmm4
+	pmuludq %xmm1, %xmm2
+	movdqa 280(%rsp), %xmm3
+	paddq %xmm8, %xmm4
+	pmuludq %xmm9, %xmm15
+	movdqa 280(%rsp), %xmm8
+	pmuludq %xmm10, %xmm14
+	pmuludq %xmm6, %xmm8
+	paddq %xmm13, %xmm2
+	movdqa %xmm6, %xmm13
+	pmuludq %xmm1, %xmm3
+	paddq %xmm15, %xmm4
+	movdqa 296(%rsp), %xmm15
+	pmuludq %xmm12, %xmm13
+	paddq %xmm14, %xmm2
+	movdqa %xmm7, %xmm14
+	paddq %xmm8, %xmm0
+	pmuludq %xmm12, %xmm14
+	movdqa %xmm9, %xmm8
+	pmuludq 296(%rsp), %xmm6
+	pmuludq %xmm12, %xmm8
+	movdqa %xmm6, 248(%rsp)
+	pmuludq %xmm10, %xmm15
+	movq -16(%rax), %xmm6
+	paddq %xmm13, %xmm3
+	movdqa %xmm10, %xmm13
+	paddq %xmm14, %xmm4
+	movq -8(%rax), %xmm14
+	paddq %xmm8, %xmm2
+	movq -32(%rax), %xmm8
+	pmuludq %xmm12, %xmm13
+	paddq %xmm15, %xmm3
+	pmuludq %xmm12, %xmm1
+	movdqa 216(%rsp), %xmm15
+	pmuludq 216(%rsp), %xmm10
+	punpcklqdq %xmm6, %xmm8
+	movq -24(%rax), %xmm6
+	pmuludq %xmm9, %xmm15
+	paddq %xmm13, %xmm0
+	movdqa 296(%rsp), %xmm13
+	paddq 248(%rsp), %xmm1
+	punpcklqdq %xmm14, %xmm6
+	movdqa 296(%rsp), %xmm14
+	pmuludq %xmm9, %xmm13
+	pmuludq 120(%rsp), %xmm9
+	movdqa %xmm15, 72(%rsp)
+	paddq %xmm10, %xmm1
+	movdqa 216(%rsp), %xmm15
+	pmuludq %xmm7, %xmm14
+	movdqa %xmm6, %xmm10
+	paddq %xmm9, %xmm1
+	pmuludq %xmm7, %xmm15
+	paddq %xmm13, %xmm0
+	paddq 72(%rsp), %xmm3
+	movdqa 120(%rsp), %xmm13
+	psllq $12, %xmm10
+	paddq %xmm14, %xmm2
+	movdqa %xmm5, %xmm14
+	pand %xmm8, %xmm14
+	pmuludq %xmm7, %xmm13
+	paddq %xmm15, %xmm0
+	movdqa %xmm14, 248(%rsp)
+	movdqa %xmm8, %xmm14
+	psrlq $52, %xmm8
+	movdqu (%rax), %xmm9
+	por %xmm10, %xmm8
+	pmuludq 24(%rsp), %xmm7
+	movdqu 16(%rax), %xmm10
+	paddq %xmm13, %xmm3
+	pxor %xmm13, %xmm13
+	movdqa %xmm9, %xmm15
+	paddq %xmm7, %xmm1
+	movdqa %xmm6, %xmm7
+	movdqa %xmm10, -72(%rsp)
+	punpckldq %xmm10, %xmm15
+	movdqa %xmm15, %xmm10
+	punpckldq %xmm13, %xmm10
+	punpckhdq -72(%rsp), %xmm9
+	psrlq $40, %xmm6
+	movdqa %xmm10, 72(%rsp)
+	movdqa %xmm9, %xmm10
+	punpckhdq %xmm13, %xmm9
+	psllq $18, %xmm9
+	paddq 72(%rsp), %xmm4
+	addq $64, %rax
+	paddq %xmm9, %xmm3
+	movdqa 40(%rsp), %xmm9
+	cmpq $63, %rcx
+	punpckhdq %xmm13, %xmm15
+	psllq $6, %xmm15
+	punpckldq %xmm13, %xmm10
+	paddq %xmm15, %xmm2
+	psllq $12, %xmm10
+	por 168(%rsp), %xmm6
+	pmuludq %xmm6, %xmm9
+	movdqa 88(%rsp), %xmm15
+	paddq %xmm10, %xmm0
+	movdqa 88(%rsp), %xmm13
+	psrlq $14, %xmm7
+	pand %xmm5, %xmm8
+	movdqa 184(%rsp), %xmm10
+	pand %xmm5, %xmm7
+	pmuludq %xmm7, %xmm15
+	paddq %xmm9, %xmm4
+	pmuludq %xmm6, %xmm13
+	movdqa 184(%rsp), %xmm9
+	paddq 168(%rsp), %xmm1
+	pmuludq %xmm7, %xmm10
+	pmuludq %xmm6, %xmm9
+	paddq %xmm15, %xmm4
+	movdqa 184(%rsp), %xmm15
+	paddq %xmm13, %xmm2
+	psrlq $26, %xmm14
+	movdqa 264(%rsp), %xmm13
+	paddq %xmm10, %xmm2
+	pmuludq %xmm8, %xmm15
+	pand %xmm5, %xmm14
+	paddq %xmm9, %xmm0
+	pmuludq %xmm6, %xmm13
+	movdqa 264(%rsp), %xmm9
+	movdqa 264(%rsp), %xmm10
+	pmuludq %xmm11, %xmm6
+	pmuludq %xmm8, %xmm9
+	paddq %xmm15, %xmm4
+	movdqa 264(%rsp), %xmm15
+	pmuludq %xmm14, %xmm10
+	paddq %xmm13, %xmm3
+	movdqa %xmm7, %xmm13
+	pmuludq %xmm7, %xmm15
+	paddq %xmm6, %xmm1
+	movdqa 312(%rsp), %xmm6
+	paddq %xmm9, %xmm2
+	pmuludq %xmm11, %xmm13
+	movdqa 248(%rsp), %xmm9
+	paddq %xmm10, %xmm4
+	pmuludq %xmm8, %xmm6
+	pmuludq 312(%rsp), %xmm7
+	paddq %xmm15, %xmm0
+	movdqa %xmm9, %xmm10
+	movdqa %xmm14, %xmm15
+	pmuludq %xmm11, %xmm10
+	paddq %xmm13, %xmm3
+	movdqa %xmm8, %xmm13
+	pmuludq %xmm11, %xmm13
+	paddq %xmm6, %xmm3
+	paddq %xmm7, %xmm1
+	movdqa 232(%rsp), %xmm6
+	pmuludq %xmm11, %xmm15
+	pmuludq 232(%rsp), %xmm8
+	paddq %xmm10, %xmm4
+	paddq %xmm8, %xmm1
+	movdqa 312(%rsp), %xmm10
+	paddq %xmm13, %xmm0
+	pmuludq %xmm14, %xmm6
+	movdqa 312(%rsp), %xmm13
+	pmuludq %xmm9, %xmm10
+	paddq %xmm15, %xmm2
+	movdqa 232(%rsp), %xmm7
+	pmuludq %xmm14, %xmm13
+	pmuludq 152(%rsp), %xmm14
+	paddq %xmm14, %xmm1
+	pmuludq %xmm9, %xmm7
+	paddq %xmm6, %xmm3
+	paddq %xmm10, %xmm2
+	movdqa 152(%rsp), %xmm10
+	paddq %xmm13, %xmm0
+	pmuludq %xmm9, %xmm10
+	paddq %xmm7, %xmm0
+	movdqa %xmm4, %xmm7
+	psrlq $26, %xmm7
+	pmuludq 56(%rsp), %xmm9
+	pand %xmm5, %xmm4
+	paddq %xmm7, %xmm2
+	paddq %xmm9, %xmm1
+	paddq %xmm10, %xmm3
+	movdqa %xmm2, %xmm7
+	movdqa %xmm2, %xmm9
+	movdqa %xmm3, %xmm6
+	psrlq $26, %xmm7
+	pand %xmm5, %xmm3
+	psrlq $26, %xmm6
+	paddq %xmm7, %xmm0
+	pand %xmm5, %xmm9
+	paddq %xmm6, %xmm1
+	movdqa %xmm0, %xmm10
+	movdqa %xmm1, %xmm6
+	pand %xmm5, %xmm10
+	pand %xmm5, %xmm1
+	psrlq $26, %xmm6
+	pmuludq 136(%rsp), %xmm6
+	paddq %xmm6, %xmm4
+	movdqa %xmm0, %xmm6
+	psrlq $26, %xmm6
+	movdqa %xmm4, %xmm2
+	movdqa %xmm4, %xmm7
+	paddq %xmm6, %xmm3
+	psrlq $26, %xmm2
+	pand %xmm5, %xmm7
+	movdqa %xmm3, %xmm0
+	paddq %xmm2, %xmm9
+	pand %xmm5, %xmm3
+	psrlq $26, %xmm0
+	paddq %xmm0, %xmm1
+	ja .Lpoly1305_blocks_x86_13
+	leaq -64(%rdx), %rax
+	movdqa %xmm3, %xmm6
+	andl $63, %edx
+	andq $-64, %rax
+	leaq 64(%rsi,%rax), %rsi
+.Lpoly1305_blocks_x86_7:
+	cmpq $31, %rdx
+	jbe .Lpoly1305_blocks_x86_9
+	movdqa -24(%rsp), %xmm13
+	movdqa %xmm6, %xmm0
+	movdqa %xmm6, %xmm3
+	movdqa 40(%rsp), %xmm11
+	movdqa %xmm1, %xmm12
+	testq %rsi, %rsi
+	movdqa -40(%rsp), %xmm2
+	pmuludq %xmm13, %xmm0
+	movdqa %xmm1, %xmm8
+	pmuludq %xmm1, %xmm11
+	movdqa %xmm10, %xmm4
+	movdqa %xmm1, %xmm14
+	pmuludq %xmm2, %xmm3
+	movdqa %xmm6, %xmm15
+	pmuludq %xmm1, %xmm13
+	movdqa %xmm7, %xmm1
+	pmuludq %xmm2, %xmm12
+	paddq %xmm0, %xmm11
+	movdqa -56(%rsp), %xmm0
+	pmuludq %xmm10, %xmm2
+	paddq %xmm3, %xmm13
+	pmuludq %xmm0, %xmm4
+	movdqa %xmm9, %xmm3
+	pmuludq %xmm0, %xmm3
+	paddq %xmm2, %xmm11
+	pmuludq %xmm0, %xmm8
+	movdqa %xmm6, %xmm2
+	pmuludq %xmm0, %xmm2
+	movdqa -8(%rsp), %xmm0
+	paddq %xmm4, %xmm13
+	movdqa 312(%rsp), %xmm4
+	paddq %xmm3, %xmm11
+	pmuludq 312(%rsp), %xmm6
+	movdqa 312(%rsp), %xmm3
+	pmuludq %xmm0, %xmm1
+	paddq %xmm2, %xmm12
+	pmuludq %xmm0, %xmm15
+	movdqa %xmm9, %xmm2
+	pmuludq %xmm0, %xmm2
+	pmuludq %xmm7, %xmm3
+	paddq %xmm1, %xmm11
+	movdqa 232(%rsp), %xmm1
+	pmuludq %xmm0, %xmm14
+	paddq %xmm15, %xmm8
+	pmuludq %xmm10, %xmm0
+	paddq %xmm2, %xmm13
+	movdqa 312(%rsp), %xmm2
+	pmuludq %xmm10, %xmm4
+	paddq %xmm3, %xmm13
+	movdqa 152(%rsp), %xmm3
+	pmuludq %xmm9, %xmm2
+	paddq %xmm6, %xmm14
+	pmuludq 232(%rsp), %xmm10
+	paddq %xmm0, %xmm12
+	pmuludq %xmm9, %xmm1
+	paddq %xmm10, %xmm14
+	movdqa 232(%rsp), %xmm0
+	pmuludq %xmm7, %xmm3
+	paddq %xmm4, %xmm8
+	pmuludq 152(%rsp), %xmm9
+	paddq %xmm2, %xmm12
+	paddq %xmm9, %xmm14
+	pmuludq %xmm7, %xmm0
+	paddq %xmm1, %xmm8
+	pmuludq 56(%rsp), %xmm7
+	paddq %xmm3, %xmm8
+	paddq %xmm7, %xmm14
+	paddq %xmm0, %xmm12
+	je .Lpoly1305_blocks_x86_10
+	movdqu (%rsi), %xmm1
+	pxor %xmm0, %xmm0
+	paddq 168(%rsp), %xmm14
+	movdqu 16(%rsi), %xmm2
+	movdqa %xmm1, %xmm3
+	punpckldq %xmm2, %xmm3
+	punpckhdq %xmm2, %xmm1
+	movdqa %xmm3, %xmm4
+	movdqa %xmm1, %xmm2
+	punpckldq %xmm0, %xmm4
+	punpckhdq %xmm0, %xmm3
+	punpckhdq %xmm0, %xmm1
+	punpckldq %xmm0, %xmm2
+	movdqa %xmm2, %xmm0
+	psllq $6, %xmm3
+	paddq %xmm4, %xmm11
+	psllq $12, %xmm0
+	paddq %xmm3, %xmm13
+	psllq $18, %xmm1
+	paddq %xmm0, %xmm12
+	paddq %xmm1, %xmm8
+.Lpoly1305_blocks_x86_10:
+	movdqa %xmm11, %xmm9
+	movdqa %xmm8, %xmm1
+	movdqa %xmm11, %xmm7
+	psrlq $26, %xmm9
+	movdqa %xmm8, %xmm6
+	pand %xmm5, %xmm7
+	paddq %xmm13, %xmm9
+	psrlq $26, %xmm1
+	pand %xmm5, %xmm6
+	movdqa %xmm9, %xmm10
+	paddq %xmm14, %xmm1
+	pand %xmm5, %xmm9
+	psrlq $26, %xmm10
+	movdqa %xmm1, %xmm0
+	pand %xmm5, %xmm1
+	paddq %xmm12, %xmm10
+	psrlq $26, %xmm0
+	pmuludq 136(%rsp), %xmm0
+	movdqa %xmm10, %xmm2
+	paddq %xmm0, %xmm7
+	psrlq $26, %xmm2
+	movdqa %xmm7, %xmm0
+	pand %xmm5, %xmm10
+	paddq %xmm2, %xmm6
+	psrlq $26, %xmm0
+	pand %xmm5, %xmm7
+	movdqa %xmm6, %xmm2
+	paddq %xmm0, %xmm9
+	pand %xmm5, %xmm6
+	psrlq $26, %xmm2
+	paddq %xmm2, %xmm1
+.Lpoly1305_blocks_x86_9:
+	testq %rsi, %rsi
+	je .Lpoly1305_blocks_x86_11
+	movdqa %xmm7, 0(%rdi)
+	movdqa %xmm9, 16(%rdi)
+	movdqa %xmm10, 32(%rdi)
+	movdqa %xmm6, 48(%rdi)
+	movdqa %xmm1, 64(%rdi)
+	movq -8(%rbp), %rbx
+	leave
+	ret
+.Lpoly1305_blocks_x86_5:
+	movdqa 0(%rdi), %xmm7
+	movdqa 16(%rdi), %xmm9
+	movdqa 32(%rdi), %xmm10
+	movdqa 48(%rdi), %xmm6
+	movdqa 64(%rdi), %xmm1
+	jmp .Lpoly1305_blocks_x86_6
+.Lpoly1305_blocks_x86_11:
+	movdqa %xmm7, %xmm0
+	movdqa %xmm9, %xmm2
+	movdqa %xmm6, %xmm3
+	psrldq $8, %xmm0
+	movabsq $4398046511103, %rbx
+	paddq %xmm0, %xmm7
+	psrldq $8, %xmm2
+	movdqa %xmm10, %xmm0
+	movd %xmm7, %edx
+	paddq %xmm2, %xmm9
+	psrldq $8, %xmm0
+	movl %edx, %ecx
+	movd %xmm9, %eax
+	paddq %xmm0, %xmm10
+	shrl $26, %ecx
+	psrldq $8, %xmm3
+	movdqa %xmm1, %xmm0
+	addl %ecx, %eax
+	movd %xmm10, %ecx
+	paddq %xmm3, %xmm6
+	movl %eax, %r9d
+	shrl $26, %eax
+	psrldq $8, %xmm0
+	addl %ecx, %eax
+	movd %xmm6, %ecx
+	paddq %xmm0, %xmm1
+	movl %eax, %esi
+	andl $67108863, %r9d
+	movd %xmm1, %r10d
+	shrl $26, %esi
+	andl $67108863, %eax
+	andl $67108863, %edx
+	addl %ecx, %esi
+	salq $8, %rax
+	movl %r9d, %ecx
+	shrl $18, %r9d
+	movl %esi, %r8d
+	shrl $26, %esi
+	andl $67108863, %r8d
+	addl %r10d, %esi
+	orq %r9, %rax
+	salq $16, %rsi
+	movq %r8, %r9
+	shrl $10, %r8d
+	salq $26, %rcx
+	orq %r8, %rsi
+	salq $34, %r9
+	orq %rdx, %rcx
+	movq %rsi, %r8
+	shrq $42, %rsi
+	movabsq $17592186044415, %rdx
+	orq %r9, %rax
+	andq %rbx, %r8
+	leaq (%rsi,%rsi,4), %rsi
+	andq %rdx, %rcx
+	andq %rdx, %rax
+	movabsq $-4398046511104, %r10
+	addq %rsi, %rcx
+	movq %rcx, %rsi
+	shrq $44, %rcx
+	addq %rcx, %rax
+	andq %rdx, %rsi
+	movq %rax, %rcx
+	shrq $44, %rax
+	addq %r8, %rax
+	andq %rdx, %rcx
+	andq %rax, %rbx
+	shrq $42, %rax
+	leaq (%rsi,%rax,4), %rsi
+	addq %rbx, %r10
+	addq %rax, %rsi
+	movq %rsi, %r8
+	shrq $44, %rsi
+	andq %rdx, %r8
+	addq %rcx, %rsi
+	leaq 5(%r8), %r9
+	movq %r9, %r11
+	andq %rdx, %r9
+	shrq $44, %r11
+	addq %rsi, %r11
+	movq %r11, %rax
+	andq %r11, %rdx
+	shrq $44, %rax
+	addq %rax, %r10
+	movq %r10, %rax
+	shrq $63, %rax
+	subq $1, %rax
+	movq %rax, %rcx
+	andq %rax, %r9
+	andq %rax, %rdx
+	notq %rcx
+	andq %r10, %rax
+	andq %rcx, %r8
+	andq %rcx, %rsi
+	andq %rbx, %rcx
+	orq %r9, %r8
+	orq %rdx, %rsi
+	orq %rax, %rcx
+	movq %r8, 0(%rdi)
+	movq %rsi, 8(%rdi)
+	movq %rcx, 16(%rdi)
+	movq -8(%rbp), %rbx
+	movq %rbp, %rax
+	subq %rsp, %rax
+	pxor %xmm15, %xmm15
+	pxor %xmm7, %xmm7
+	pxor %xmm14, %xmm14
+	pxor %xmm6, %xmm6
+	pxor %xmm13, %xmm13
+	pxor %xmm5, %xmm5
+	pxor %xmm12, %xmm12
+	pxor %xmm4, %xmm4
+	leave
+	addq $8, %rax
+	pxor %xmm11, %xmm11
+	pxor %xmm3, %xmm3
+	pxor %xmm10, %xmm10
+	pxor %xmm2, %xmm2
+	pxor %xmm9, %xmm9
+	pxor %xmm1, %xmm1
+	pxor %xmm8, %xmm8
+	pxor %xmm0, %xmm0
+	ret
+.size _gcry_poly1305_amd64_sse2_blocks,.-_gcry_poly1305_amd64_sse2_blocks;
+
+#endif
diff --git a/cipher/poly1305.c b/cipher/poly1305.c
index 472ae42..cd1902a 100644
--- a/cipher/poly1305.c
+++ b/cipher/poly1305.c
@@ -38,6 +38,25 @@ static const char *selftest (void);
 

 
 
+#ifdef POLY1305_USE_SSE2
+
+void _gcry_poly1305_amd64_sse2_init_ext(void *state, const poly1305_key_t *key);
+unsigned int _gcry_poly1305_amd64_sse2_finish_ext(void *state, const byte *m,
+						  size_t remaining,
+						  byte mac[16]);
+unsigned int _gcry_poly1305_amd64_sse2_blocks(void *ctx, const byte *m,
+					      size_t bytes);
+
+static const poly1305_ops_t poly1305_amd64_sse2_ops = {
+  POLY1305_SSE2_BLOCKSIZE,
+  _gcry_poly1305_amd64_sse2_init_ext,
+  _gcry_poly1305_amd64_sse2_blocks,
+  _gcry_poly1305_amd64_sse2_finish_ext
+};
+
+#endif
+
+
 #ifdef HAVE_U64_TYPEDEF
 
 /* Reference unoptimized poly1305 implementation using 32 bit * 32 bit = 64 bit
@@ -612,7 +631,11 @@ _gcry_poly1305_init (poly1305_context_t * ctx, const byte * key,
   if (selftest_failed)
     return GPG_ERR_SELFTEST_FAILED;
 
+#ifdef POLY1305_USE_SSE2
+  ctx->ops = &poly1305_amd64_sse2_ops;
+#else
   ctx->ops = &poly1305_default_ops;
+#endif
 
   buf_cpy (keytmp.b, key, POLY1305_KEYLEN);
   poly1305_init (ctx, &keytmp);
diff --git a/configure.ac b/configure.ac
index 3a0fd52..4dc36d5 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1821,6 +1821,13 @@ if test "$found" = "1" ; then
    esac
 fi
 
+case "${host}" in
+   x86_64-*-*)
+      # Build with the assembly implementation
+      GCRYPT_CIPHERS="$GCRYPT_CIPHERS poly1305-sse2-amd64.lo"
+   ;;
+esac
+
 LIST_MEMBER(dsa, $enabled_pubkey_ciphers)
 if test "$found" = "1" ; then
    GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS dsa.lo"

-----------------------------------------------------------------------

Summary of changes:
 cipher/Makefile.am                                 |    3 +-
 ...hacha20-ssse3-amd64.S => chacha20-sse2-amd64.S} |  226 +++--
 cipher/chacha20.c                                  |   18 +
 cipher/poly1305-avx2-amd64.S                       |  954 ++++++++++++++++++
 cipher/poly1305-internal.h                         |   45 +-
 cipher/poly1305-sse2-amd64.S                       | 1035 ++++++++++++++++++++
 cipher/poly1305.c                                  |   49 +
 configure.ac                                       |    9 +
 doc/gcrypt.texi                                    |   42 +-
 9 files changed, 2271 insertions(+), 110 deletions(-)
 copy cipher/{chacha20-ssse3-amd64.S => chacha20-sse2-amd64.S} (78%)
 create mode 100644 cipher/poly1305-avx2-amd64.S
 create mode 100644 cipher/poly1305-sse2-amd64.S


hooks/post-receive
-- 
The GNU crypto library
http://git.gnupg.org




More information about the Gnupg-commits mailing list