[git] GCRYPT - branch, master, updated. libgcrypt-1.5.0-342-g3ff9d25

by Jussi Kivilinna cvs at cvs.gnupg.org
Mon Oct 28 15:18:07 CET 2013


This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "The GNU crypto library".

The branch, master has been updated
       via  3ff9d2571c18cd7a34359f9c60a10d3b0f932b23 (commit)
       via  5a3d43485efdc09912be0967ee0a3ce345b3b15a (commit)
      from  e214e8392671dd30e9c33260717b5e756debf3bf (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
commit 3ff9d2571c18cd7a34359f9c60a10d3b0f932b23
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date:   Sat Oct 26 15:00:48 2013 +0300

    Add ARM NEON assembly implementation of Salsa20
    
    * cipher/Makefile.am: Add 'salsa20-armv7-neon.S'.
    * cipher/salsa20-armv7-neon.S: New.
    * cipher/salsa20.c [USE_ARM_NEON_ASM]: New macro.
    (struct SALSA20_context_s, salsa20_core_t, salsa20_keysetup_t)
    (salsa20_ivsetup_t): New.
    (SALSA20_context_t) [USE_ARM_NEON_ASM]: Add 'use_neon'.
    (SALSA20_context_t): Add 'keysetup', 'ivsetup' and 'core'.
    (salsa20_core): Change 'src' argument to 'ctx'.
    [USE_ARM_NEON_ASM] (_gcry_arm_neon_salsa20_encrypt): New prototype.
    [USE_ARM_NEON_ASM] (salsa20_core_neon, salsa20_keysetup_neon)
    (salsa20_ivsetup_neon): New.
    (salsa20_do_setkey): Setup keysetup, ivsetup and core with default
    functions.
    (salsa20_do_setkey) [USE_ARM_NEON_ASM]: When NEON support detect,
    set keysetup, ivsetup and core with ARM NEON functions.
    (salsa20_do_setkey): Call 'ctx->keysetup'.
    (salsa20_setiv): Call 'ctx->ivsetup'.
    (salsa20_do_encrypt_stream) [USE_ARM_NEON_ASM]: Process large buffers
    in ARM NEON implementation.
    (salsa20_do_encrypt_stream): Call 'ctx->core' instead of directly
    calling 'salsa20_core'.
    (selftest): Add test to check large buffer processing and block counter
    updating.
    * configure.ac [neonsupport]: 'Add salsa20-armv7-neon.lo'.
    --
    
    Patch adds fast ARM NEON assembly implementation for Salsa20. Implementation
    gains extra speed by processing three blocks in parallel with help of ARM
    NEON vector processing unit.
    
    This implementation is based on public domain code by Peter Schwabe and D. J.
    Bernstein and it is available in SUPERCOP benchmarking framework. For more
    details on this work, check paper "NEON crypto" by Daniel J. Bernstein and
    Peter Schwabe:
        http://cryptojedi.org/papers/#neoncrypto
    
    Benchmark results on Cortex-A8 (1008 Mhz):
    
    Before:
     SALSA20        |  nanosecs/byte   mebibytes/sec   cycles/byte
         STREAM enc |     18.88 ns/B     50.51 MiB/s     19.03 c/B
         STREAM dec |     18.89 ns/B     50.49 MiB/s     19.04 c/B
                    =
     SALSA20R12     |  nanosecs/byte   mebibytes/sec   cycles/byte
         STREAM enc |     13.60 ns/B     70.14 MiB/s     13.71 c/B
         STREAM dec |     13.60 ns/B     70.13 MiB/s     13.71 c/B
    
    After:
     SALSA20        |  nanosecs/byte   mebibytes/sec   cycles/byte
         STREAM enc |      5.48 ns/B     174.1 MiB/s      5.52 c/B
         STREAM dec |      5.47 ns/B     174.2 MiB/s      5.52 c/B
                    =
     SALSA20R12     |  nanosecs/byte   mebibytes/sec   cycles/byte
         STREAM enc |      3.65 ns/B     260.9 MiB/s      3.68 c/B
         STREAM dec |      3.65 ns/B     261.6 MiB/s      3.67 c/B
    
    Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index e786713..95d484e 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -71,7 +71,7 @@ md5.c \
 rijndael.c rijndael-tables.h rijndael-amd64.S rijndael-arm.S \
 rmd160.c \
 rsa.c \
-salsa20.c salsa20-amd64.S \
+salsa20.c salsa20-amd64.S salsa20-armv7-neon.S \
 scrypt.c \
 seed.c \
 serpent.c serpent-sse2-amd64.S serpent-avx2-amd64.S \
diff --git a/cipher/salsa20-armv7-neon.S b/cipher/salsa20-armv7-neon.S
new file mode 100644
index 0000000..5b51301
--- /dev/null
+++ b/cipher/salsa20-armv7-neon.S
@@ -0,0 +1,899 @@
+/* salsa-armv7-neon.S  -  ARM NEON implementation of Salsa20 cipher
+ *
+ * Copyright © 2013 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <config.h>
+
+#if defined(HAVE_ARM_ARCH_V6) && defined(__ARMEL__) && \
+    defined(HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS) && \
+    defined(HAVE_GCC_INLINE_ASM_NEON) && defined(USE_SALSA20)
+
+/*
+ * Based on public domain implementation from SUPERCOP benchmarking framework
+ * by Peter Schwabe and D. J. Bernstein. Paper about the implementation at:
+ *   http://cryptojedi.org/papers/#neoncrypto
+ */
+
+.syntax unified
+.arm
+.fpu neon
+.text
+
+.align 2
+.global _gcry_arm_neon_salsa20_encrypt
+.type  _gcry_arm_neon_salsa20_encrypt,%function;
+_gcry_arm_neon_salsa20_encrypt:
+	/* Modifications:
+	 *  - arguments changed to (void *c, const void *m, unsigned int nblks,
+         *    void *ctx, unsigned int rounds) from (void *c, const void *m,
+         *    unsigned long long mlen, const void *n, const void *k)
+	 *  - nonce and key read from 'ctx' as well as sigma and counter.
+	 *  - read in counter from 'ctx' at the start.
+         *  - update counter in 'ctx' at the end.
+	 *  - length is input as number of blocks, so don't handle tail bytes
+	 *    (this is done in salsa20.c).
+	 */
+	lsl r2,r2,#6
+	vpush {q4,q5,q6,q7}
+	mov r12,sp
+	sub sp,sp,#352
+	and sp,sp,#0xffffffe0
+	strd r4,[sp,#0]
+	strd r6,[sp,#8]
+	strd r8,[sp,#16]
+	strd r10,[sp,#24]
+	str r14,[sp,#224]
+	str r12,[sp,#228]
+	str r0,[sp,#232]
+	str r1,[sp,#236]
+	str r2,[sp,#240]
+	ldr r4,[r12,#64]
+	str r4,[sp,#244]
+	mov r2,r3
+	add r3,r2,#48
+	vld1.8 {q3},[r2]
+	add r0,r2,#32
+	add r14,r2,#40
+	vmov.i64 q3,#0xff
+	str r14,[sp,#160]
+	ldrd r8,[r2,#4]
+	vld1.8 {d0},[r0]
+	ldrd r4,[r2,#20]
+	vld1.8 {d8-d9},[r2]!
+	ldrd r6,[r0,#0]
+	vmov d4,d9
+	ldr r0,[r14]
+	vrev64.i32 d0,d0
+	ldr r1,[r14,#4]
+	vld1.8 {d10-d11},[r2]
+	strd r6,[sp,#32]
+	sub r2,r2,#16
+	strd r0,[sp,#40]
+	vmov d5,d11
+	strd r8,[sp,#48]
+	vext.32 d1,d0,d10,#1
+	strd r4,[sp,#56]
+	ldr r1,[r2,#0]
+	vshr.u32 q3,q3,#7
+	ldr r4,[r2,#12]
+	vext.32 d3,d11,d9,#1
+	ldr r11,[r2,#16]
+	vext.32 d2,d8,d0,#1
+	ldr r8,[r2,#28]
+	vext.32 d0,d10,d8,#1
+	ldr r0,[r3,#0]
+	add r2,r2,#44
+	vmov q4,q3
+	vld1.8 {d6-d7},[r14]
+	vadd.i64 q3,q3,q4
+	ldr r5,[r3,#4]
+	add r12,sp,#256
+	vst1.8 {d4-d5},[r12,: 128]
+	ldr r10,[r3,#8]
+	add r14,sp,#272
+	vst1.8 {d2-d3},[r14,: 128]
+	ldr r9,[r3,#12]
+	vld1.8 {d2-d3},[r3]
+	strd r0,[sp,#64]
+	ldr r0,[sp,#240]
+	strd r4,[sp,#72]
+	strd r10,[sp,#80]
+	strd r8,[sp,#88]
+	nop
+	cmp r0,#192
+	blo ._mlenlowbelow192
+._mlenatleast192:
+	ldrd r2,[sp,#48]
+	vext.32 d7,d6,d6,#1
+	vmov q8,q1
+	ldrd r6,[sp,#32]
+	vld1.8 {d18-d19},[r12,: 128]
+	vmov q10,q0
+	str r0,[sp,#240]
+	vext.32 d4,d7,d19,#1
+	vmov q11,q8
+	vext.32 d10,d18,d7,#1
+	vadd.i64 q3,q3,q4
+	ldrd r0,[sp,#64]
+	vld1.8 {d24-d25},[r14,: 128]
+	vmov d5,d24
+	add r8,sp,#288
+	ldrd r4,[sp,#72]
+	vmov d11,d25
+	add r9,sp,#304
+	ldrd r10,[sp,#80]
+	vst1.8 {d4-d5},[r8,: 128]
+	strd r2,[sp,#96]
+	vext.32 d7,d6,d6,#1
+	vmov q13,q10
+	strd r6,[sp,#104]
+	vmov d13,d24
+	vst1.8 {d10-d11},[r9,: 128]
+	add r2,sp,#320
+	vext.32 d12,d7,d19,#1
+	vmov d15,d25
+	add r6,sp,#336
+	ldr r12,[sp,#244]
+	vext.32 d14,d18,d7,#1
+	vadd.i64 q3,q3,q4
+	ldrd r8,[sp,#88]
+	vst1.8 {d12-d13},[r2,: 128]
+	ldrd r2,[sp,#56]
+	vst1.8 {d14-d15},[r6,: 128]
+	ldrd r6,[sp,#40]
+._mainloop2:
+	str r12,[sp,#248]
+	vadd.i32 q4,q10,q8
+	vadd.i32 q9,q13,q11
+	add r12,r0,r2
+	add r14,r5,r1
+	vshl.i32 q12,q4,#7
+	vshl.i32 q14,q9,#7
+	vshr.u32 q4,q4,#25
+	vshr.u32 q9,q9,#25
+	eor r4,r4,r12,ROR #25
+	eor r7,r7,r14,ROR #25
+	add r12,r4,r0
+	add r14,r7,r5
+	veor q5,q5,q12
+	veor q7,q7,q14
+	veor q4,q5,q4
+	veor q5,q7,q9
+	eor r6,r6,r12,ROR #23
+	eor r3,r3,r14,ROR #23
+	add r12,r6,r4
+	str r7,[sp,#116]
+	add r7,r3,r7
+	ldr r14,[sp,#108]
+	vadd.i32 q7,q8,q4
+	vadd.i32 q9,q11,q5
+	vshl.i32 q12,q7,#9
+	vshl.i32 q14,q9,#9
+	vshr.u32 q7,q7,#23
+	vshr.u32 q9,q9,#23
+	veor q2,q2,q12
+	veor q6,q6,q14
+	veor q2,q2,q7
+	veor q6,q6,q9
+	eor r2,r2,r12,ROR #19
+	str r2,[sp,#120]
+	eor r1,r1,r7,ROR #19
+	ldr r7,[sp,#96]
+	add r2,r2,r6
+	str r6,[sp,#112]
+	add r6,r1,r3
+	ldr r12,[sp,#104]
+	vadd.i32 q7,q4,q2
+	vext.32 q4,q4,q4,#3
+	vadd.i32 q9,q5,q6
+	vshl.i32 q12,q7,#13
+	vext.32 q5,q5,q5,#3
+	vshl.i32 q14,q9,#13
+	eor r0,r0,r2,ROR #14
+	eor r2,r5,r6,ROR #14
+	str r3,[sp,#124]
+	add r3,r10,r12
+	ldr r5,[sp,#100]
+	add r6,r9,r11
+	vshr.u32 q7,q7,#19
+	vshr.u32 q9,q9,#19
+	veor q10,q10,q12
+	veor q12,q13,q14
+	eor r8,r8,r3,ROR #25
+	eor r3,r5,r6,ROR #25
+	add r5,r8,r10
+	add r6,r3,r9
+	veor q7,q10,q7
+	veor q9,q12,q9
+	eor r5,r7,r5,ROR #23
+	eor r6,r14,r6,ROR #23
+	add r7,r5,r8
+	add r14,r6,r3
+	vadd.i32 q10,q2,q7
+	vswp d4,d5
+	vadd.i32 q12,q6,q9
+	vshl.i32 q13,q10,#18
+	vswp d12,d13
+	vshl.i32 q14,q12,#18
+	eor r7,r12,r7,ROR #19
+	eor r11,r11,r14,ROR #19
+	add r12,r7,r5
+	add r14,r11,r6
+	vshr.u32 q10,q10,#14
+	vext.32 q7,q7,q7,#1
+	vshr.u32 q12,q12,#14
+	veor q8,q8,q13
+	vext.32 q9,q9,q9,#1
+	veor q11,q11,q14
+	eor r10,r10,r12,ROR #14
+	eor r9,r9,r14,ROR #14
+	add r12,r0,r3
+	add r14,r2,r4
+	veor q8,q8,q10
+	veor q10,q11,q12
+	eor r1,r1,r12,ROR #25
+	eor r7,r7,r14,ROR #25
+	add r12,r1,r0
+	add r14,r7,r2
+	vadd.i32 q11,q4,q8
+	vadd.i32 q12,q5,q10
+	vshl.i32 q13,q11,#7
+	vshl.i32 q14,q12,#7
+	eor r5,r5,r12,ROR #23
+	eor r6,r6,r14,ROR #23
+	vshr.u32 q11,q11,#25
+	vshr.u32 q12,q12,#25
+	add r12,r5,r1
+	add r14,r6,r7
+	veor q7,q7,q13
+	veor q9,q9,q14
+	veor q7,q7,q11
+	veor q9,q9,q12
+	vadd.i32 q11,q8,q7
+	vadd.i32 q12,q10,q9
+	vshl.i32 q13,q11,#9
+	vshl.i32 q14,q12,#9
+	eor r3,r3,r12,ROR #19
+	str r7,[sp,#104]
+	eor r4,r4,r14,ROR #19
+	ldr r7,[sp,#112]
+	add r12,r3,r5
+	str r6,[sp,#108]
+	add r6,r4,r6
+	ldr r14,[sp,#116]
+	eor r0,r0,r12,ROR #14
+	str r5,[sp,#96]
+	eor r5,r2,r6,ROR #14
+	ldr r2,[sp,#120]
+	vshr.u32 q11,q11,#23
+	vshr.u32 q12,q12,#23
+	veor q2,q2,q13
+	veor q6,q6,q14
+	veor q2,q2,q11
+	veor q6,q6,q12
+	add r6,r10,r14
+	add r12,r9,r8
+	vadd.i32 q11,q7,q2
+	vext.32 q7,q7,q7,#3
+	vadd.i32 q12,q9,q6
+	vshl.i32 q13,q11,#13
+	vext.32 q9,q9,q9,#3
+	vshl.i32 q14,q12,#13
+	vshr.u32 q11,q11,#19
+	vshr.u32 q12,q12,#19
+	eor r11,r11,r6,ROR #25
+	eor r2,r2,r12,ROR #25
+	add r6,r11,r10
+	str r3,[sp,#100]
+	add r3,r2,r9
+	ldr r12,[sp,#124]
+	veor q4,q4,q13
+	veor q5,q5,q14
+	veor q4,q4,q11
+	veor q5,q5,q12
+	eor r6,r7,r6,ROR #23
+	eor r3,r12,r3,ROR #23
+	add r7,r6,r11
+	add r12,r3,r2
+	vadd.i32 q11,q2,q4
+	vswp d4,d5
+	vadd.i32 q12,q6,q5
+	vshl.i32 q13,q11,#18
+	vswp d12,d13
+	vshl.i32 q14,q12,#18
+	eor r7,r14,r7,ROR #19
+	eor r8,r8,r12,ROR #19
+	add r12,r7,r6
+	add r14,r8,r3
+	vshr.u32 q11,q11,#14
+	vext.32 q4,q4,q4,#1
+	vshr.u32 q12,q12,#14
+	veor q8,q8,q13
+	vext.32 q5,q5,q5,#1
+	veor q10,q10,q14
+	eor r10,r10,r12,ROR #14
+	veor q8,q8,q11
+	eor r9,r9,r14,ROR #14
+	veor q10,q10,q12
+	vadd.i32 q11,q7,q8
+	vadd.i32 q12,q9,q10
+	add r12,r0,r2
+	add r14,r5,r1
+	vshl.i32 q13,q11,#7
+	vshl.i32 q14,q12,#7
+	vshr.u32 q11,q11,#25
+	vshr.u32 q12,q12,#25
+	eor r4,r4,r12,ROR #25
+	eor r7,r7,r14,ROR #25
+	add r12,r4,r0
+	add r14,r7,r5
+	veor q4,q4,q13
+	veor q5,q5,q14
+	veor q4,q4,q11
+	veor q5,q5,q12
+	eor r6,r6,r12,ROR #23
+	eor r3,r3,r14,ROR #23
+	add r12,r6,r4
+	str r7,[sp,#116]
+	add r7,r3,r7
+	ldr r14,[sp,#108]
+	vadd.i32 q11,q8,q4
+	vadd.i32 q12,q10,q5
+	vshl.i32 q13,q11,#9
+	vshl.i32 q14,q12,#9
+	vshr.u32 q11,q11,#23
+	vshr.u32 q12,q12,#23
+	veor q2,q2,q13
+	veor q6,q6,q14
+	veor q2,q2,q11
+	veor q6,q6,q12
+	eor r2,r2,r12,ROR #19
+	str r2,[sp,#120]
+	eor r1,r1,r7,ROR #19
+	ldr r7,[sp,#96]
+	add r2,r2,r6
+	str r6,[sp,#112]
+	add r6,r1,r3
+	ldr r12,[sp,#104]
+	vadd.i32 q11,q4,q2
+	vext.32 q4,q4,q4,#3
+	vadd.i32 q12,q5,q6
+	vshl.i32 q13,q11,#13
+	vext.32 q5,q5,q5,#3
+	vshl.i32 q14,q12,#13
+	eor r0,r0,r2,ROR #14
+	eor r2,r5,r6,ROR #14
+	str r3,[sp,#124]
+	add r3,r10,r12
+	ldr r5,[sp,#100]
+	add r6,r9,r11
+	vshr.u32 q11,q11,#19
+	vshr.u32 q12,q12,#19
+	veor q7,q7,q13
+	veor q9,q9,q14
+	eor r8,r8,r3,ROR #25
+	eor r3,r5,r6,ROR #25
+	add r5,r8,r10
+	add r6,r3,r9
+	veor q7,q7,q11
+	veor q9,q9,q12
+	eor r5,r7,r5,ROR #23
+	eor r6,r14,r6,ROR #23
+	add r7,r5,r8
+	add r14,r6,r3
+	vadd.i32 q11,q2,q7
+	vswp d4,d5
+	vadd.i32 q12,q6,q9
+	vshl.i32 q13,q11,#18
+	vswp d12,d13
+	vshl.i32 q14,q12,#18
+	eor r7,r12,r7,ROR #19
+	eor r11,r11,r14,ROR #19
+	add r12,r7,r5
+	add r14,r11,r6
+	vshr.u32 q11,q11,#14
+	vext.32 q7,q7,q7,#1
+	vshr.u32 q12,q12,#14
+	veor q8,q8,q13
+	vext.32 q9,q9,q9,#1
+	veor q10,q10,q14
+	eor r10,r10,r12,ROR #14
+	eor r9,r9,r14,ROR #14
+	add r12,r0,r3
+	add r14,r2,r4
+	veor q8,q8,q11
+	veor q11,q10,q12
+	eor r1,r1,r12,ROR #25
+	eor r7,r7,r14,ROR #25
+	add r12,r1,r0
+	add r14,r7,r2
+	vadd.i32 q10,q4,q8
+	vadd.i32 q12,q5,q11
+	vshl.i32 q13,q10,#7
+	vshl.i32 q14,q12,#7
+	eor r5,r5,r12,ROR #23
+	eor r6,r6,r14,ROR #23
+	vshr.u32 q10,q10,#25
+	vshr.u32 q12,q12,#25
+	add r12,r5,r1
+	add r14,r6,r7
+	veor q7,q7,q13
+	veor q9,q9,q14
+	veor q7,q7,q10
+	veor q9,q9,q12
+	vadd.i32 q10,q8,q7
+	vadd.i32 q12,q11,q9
+	vshl.i32 q13,q10,#9
+	vshl.i32 q14,q12,#9
+	eor r3,r3,r12,ROR #19
+	str r7,[sp,#104]
+	eor r4,r4,r14,ROR #19
+	ldr r7,[sp,#112]
+	add r12,r3,r5
+	str r6,[sp,#108]
+	add r6,r4,r6
+	ldr r14,[sp,#116]
+	eor r0,r0,r12,ROR #14
+	str r5,[sp,#96]
+	eor r5,r2,r6,ROR #14
+	ldr r2,[sp,#120]
+	vshr.u32 q10,q10,#23
+	vshr.u32 q12,q12,#23
+	veor q2,q2,q13
+	veor q6,q6,q14
+	veor q2,q2,q10
+	veor q6,q6,q12
+	add r6,r10,r14
+	add r12,r9,r8
+	vadd.i32 q12,q7,q2
+	vext.32 q10,q7,q7,#3
+	vadd.i32 q7,q9,q6
+	vshl.i32 q14,q12,#13
+	vext.32 q13,q9,q9,#3
+	vshl.i32 q9,q7,#13
+	vshr.u32 q12,q12,#19
+	vshr.u32 q7,q7,#19
+	eor r11,r11,r6,ROR #25
+	eor r2,r2,r12,ROR #25
+	add r6,r11,r10
+	str r3,[sp,#100]
+	add r3,r2,r9
+	ldr r12,[sp,#124]
+	veor q4,q4,q14
+	veor q5,q5,q9
+	veor q4,q4,q12
+	veor q7,q5,q7
+	eor r6,r7,r6,ROR #23
+	eor r3,r12,r3,ROR #23
+	add r7,r6,r11
+	add r12,r3,r2
+	vadd.i32 q5,q2,q4
+	vswp d4,d5
+	vadd.i32 q9,q6,q7
+	vshl.i32 q12,q5,#18
+	vswp d12,d13
+	vshl.i32 q14,q9,#18
+	eor r7,r14,r7,ROR #19
+	eor r8,r8,r12,ROR #19
+	add r12,r7,r6
+	add r14,r8,r3
+	vshr.u32 q15,q5,#14
+	vext.32 q5,q4,q4,#1
+	vshr.u32 q4,q9,#14
+	veor q8,q8,q12
+	vext.32 q7,q7,q7,#1
+	veor q9,q11,q14
+	eor r10,r10,r12,ROR #14
+	ldr r12,[sp,#248]
+	veor q8,q8,q15
+	eor r9,r9,r14,ROR #14
+	veor q11,q9,q4
+	subs r12,r12,#4
+	bhi ._mainloop2
+	strd r8,[sp,#112]
+	ldrd r8,[sp,#64]
+	strd r2,[sp,#120]
+	ldrd r2,[sp,#96]
+	add r0,r0,r8
+	strd r10,[sp,#96]
+	add r1,r1,r9
+	ldrd r10,[sp,#48]
+	ldrd r8,[sp,#72]
+	add r2,r2,r10
+	strd r6,[sp,#128]
+	add r3,r3,r11
+	ldrd r6,[sp,#104]
+	ldrd r10,[sp,#32]
+	ldr r12,[sp,#236]
+	add r4,r4,r8
+	add r5,r5,r9
+	add r6,r6,r10
+	add r7,r7,r11
+	cmp r12,#0
+	beq ._nomessage1
+	ldr r8,[r12,#0]
+	ldr r9,[r12,#4]
+	ldr r10,[r12,#8]
+	ldr r11,[r12,#12]
+	eor r0,r0,r8
+	ldr r8,[r12,#16]
+	eor r1,r1,r9
+	ldr r9,[r12,#20]
+	eor r2,r2,r10
+	ldr r10,[r12,#24]
+	eor r3,r3,r11
+	ldr r11,[r12,#28]
+	eor r4,r4,r8
+	eor r5,r5,r9
+	eor r6,r6,r10
+	eor r7,r7,r11
+._nomessage1:
+	ldr r14,[sp,#232]
+	vadd.i32 q4,q8,q1
+	str r0,[r14,#0]
+	add r0,sp,#304
+	str r1,[r14,#4]
+	vld1.8 {d16-d17},[r0,: 128]
+	str r2,[r14,#8]
+	vadd.i32 q5,q8,q5
+	str r3,[r14,#12]
+	add r0,sp,#288
+	str r4,[r14,#16]
+	vld1.8 {d16-d17},[r0,: 128]
+	str r5,[r14,#20]
+	vadd.i32 q9,q10,q0
+	str r6,[r14,#24]
+	vadd.i32 q2,q8,q2
+	str r7,[r14,#28]
+	vmov.i64 q8,#0xffffffff
+	ldrd r6,[sp,#128]
+	vext.32 d20,d8,d10,#1
+	ldrd r0,[sp,#40]
+	vext.32 d25,d9,d11,#1
+	ldrd r2,[sp,#120]
+	vbif q4,q9,q8
+	ldrd r4,[sp,#56]
+	vext.32 d21,d5,d19,#1
+	add r6,r6,r0
+	vext.32 d24,d4,d18,#1
+	add r7,r7,r1
+	vbif q2,q5,q8
+	add r2,r2,r4
+	vrev64.i32 q5,q10
+	add r3,r3,r5
+	vrev64.i32 q9,q12
+	adds r0,r0,#3
+	vswp d5,d9
+	adc r1,r1,#0
+	strd r0,[sp,#40]
+	ldrd r8,[sp,#112]
+	ldrd r0,[sp,#88]
+	ldrd r10,[sp,#96]
+	ldrd r4,[sp,#80]
+	add r0,r8,r0
+	add r1,r9,r1
+	add r4,r10,r4
+	add r5,r11,r5
+	add r8,r14,#64
+	cmp r12,#0
+	beq ._nomessage2
+	ldr r9,[r12,#32]
+	ldr r10,[r12,#36]
+	ldr r11,[r12,#40]
+	ldr r14,[r12,#44]
+	eor r6,r6,r9
+	ldr r9,[r12,#48]
+	eor r7,r7,r10
+	ldr r10,[r12,#52]
+	eor r4,r4,r11
+	ldr r11,[r12,#56]
+	eor r5,r5,r14
+	ldr r14,[r12,#60]
+	add r12,r12,#64
+	eor r2,r2,r9
+	vld1.8 {d20-d21},[r12]!
+	veor q4,q4,q10
+	eor r3,r3,r10
+	vld1.8 {d20-d21},[r12]!
+	veor q5,q5,q10
+	eor r0,r0,r11
+	vld1.8 {d20-d21},[r12]!
+	veor q2,q2,q10
+	eor r1,r1,r14
+	vld1.8 {d20-d21},[r12]!
+	veor q9,q9,q10
+._nomessage2:
+	vst1.8 {d8-d9},[r8]!
+	vst1.8 {d10-d11},[r8]!
+	vmov.i64 q4,#0xff
+	vst1.8 {d4-d5},[r8]!
+	vst1.8 {d18-d19},[r8]!
+	str r6,[r8,#-96]
+	add r6,sp,#336
+	str r7,[r8,#-92]
+	add r7,sp,#320
+	str r4,[r8,#-88]
+	vadd.i32 q2,q11,q1
+	vld1.8 {d10-d11},[r6,: 128]
+	vadd.i32 q5,q5,q7
+	vld1.8 {d14-d15},[r7,: 128]
+	vadd.i32 q9,q13,q0
+	vadd.i32 q6,q7,q6
+	str r5,[r8,#-84]
+	vext.32 d14,d4,d10,#1
+	str r2,[r8,#-80]
+	vext.32 d21,d5,d11,#1
+	str r3,[r8,#-76]
+	vbif q2,q9,q8
+	str r0,[r8,#-72]
+	vext.32 d15,d13,d19,#1
+	vshr.u32 q4,q4,#7
+	str r1,[r8,#-68]
+	vext.32 d20,d12,d18,#1
+	vbif q6,q5,q8
+	ldr r0,[sp,#240]
+	vrev64.i32 q5,q7
+	vrev64.i32 q7,q10
+	vswp d13,d5
+	vadd.i64 q3,q3,q4
+	sub r0,r0,#192
+	cmp r12,#0
+	beq ._nomessage21
+	vld1.8 {d16-d17},[r12]!
+	veor q2,q2,q8
+	vld1.8 {d16-d17},[r12]!
+	veor q5,q5,q8
+	vld1.8 {d16-d17},[r12]!
+	veor q6,q6,q8
+	vld1.8 {d16-d17},[r12]!
+	veor q7,q7,q8
+._nomessage21:
+	vst1.8 {d4-d5},[r8]!
+	vst1.8 {d10-d11},[r8]!
+	vst1.8 {d12-d13},[r8]!
+	vst1.8 {d14-d15},[r8]!
+	str r12,[sp,#236]
+	add r14,sp,#272
+	add r12,sp,#256
+	str r8,[sp,#232]
+	cmp r0,#192
+	bhs ._mlenatleast192
+._mlenlowbelow192:
+	cmp r0,#0
+	beq ._done
+	b ._mlenatleast1
+._nextblock:
+	sub r0,r0,#64
+._mlenatleast1:
+._handleblock:
+	str r0,[sp,#248]
+	ldrd r2,[sp,#48]
+	ldrd r6,[sp,#32]
+	ldrd r0,[sp,#64]
+	ldrd r4,[sp,#72]
+	ldrd r10,[sp,#80]
+	ldrd r8,[sp,#88]
+	strd r2,[sp,#96]
+	strd r6,[sp,#104]
+	ldrd r2,[sp,#56]
+	ldrd r6,[sp,#40]
+	ldr r12,[sp,#244]
+._mainloop1:
+	str r12,[sp,#252]
+	add r12,r0,r2
+	add r14,r5,r1
+	eor r4,r4,r12,ROR #25
+	eor r7,r7,r14,ROR #25
+	add r12,r4,r0
+	add r14,r7,r5
+	eor r6,r6,r12,ROR #23
+	eor r3,r3,r14,ROR #23
+	add r12,r6,r4
+	str r7,[sp,#132]
+	add r7,r3,r7
+	ldr r14,[sp,#104]
+	eor r2,r2,r12,ROR #19
+	str r6,[sp,#128]
+	eor r1,r1,r7,ROR #19
+	ldr r7,[sp,#100]
+	add r6,r2,r6
+	str r2,[sp,#120]
+	add r2,r1,r3
+	ldr r12,[sp,#96]
+	eor r0,r0,r6,ROR #14
+	str r3,[sp,#124]
+	eor r2,r5,r2,ROR #14
+	ldr r3,[sp,#108]
+	add r5,r10,r14
+	add r6,r9,r11
+	eor r8,r8,r5,ROR #25
+	eor r5,r7,r6,ROR #25
+	add r6,r8,r10
+	add r7,r5,r9
+	eor r6,r12,r6,ROR #23
+	eor r3,r3,r7,ROR #23
+	add r7,r6,r8
+	add r12,r3,r5
+	eor r7,r14,r7,ROR #19
+	eor r11,r11,r12,ROR #19
+	add r12,r7,r6
+	add r14,r11,r3
+	eor r10,r10,r12,ROR #14
+	eor r9,r9,r14,ROR #14
+	add r12,r0,r5
+	add r14,r2,r4
+	eor r1,r1,r12,ROR #25
+	eor r7,r7,r14,ROR #25
+	add r12,r1,r0
+	add r14,r7,r2
+	eor r6,r6,r12,ROR #23
+	eor r3,r3,r14,ROR #23
+	add r12,r6,r1
+	str r7,[sp,#104]
+	add r7,r3,r7
+	ldr r14,[sp,#128]
+	eor r5,r5,r12,ROR #19
+	str r3,[sp,#108]
+	eor r4,r4,r7,ROR #19
+	ldr r7,[sp,#132]
+	add r12,r5,r6
+	str r6,[sp,#96]
+	add r3,r4,r3
+	ldr r6,[sp,#120]
+	eor r0,r0,r12,ROR #14
+	str r5,[sp,#100]
+	eor r5,r2,r3,ROR #14
+	ldr r3,[sp,#124]
+	add r2,r10,r7
+	add r12,r9,r8
+	eor r11,r11,r2,ROR #25
+	eor r2,r6,r12,ROR #25
+	add r6,r11,r10
+	add r12,r2,r9
+	eor r6,r14,r6,ROR #23
+	eor r3,r3,r12,ROR #23
+	add r12,r6,r11
+	add r14,r3,r2
+	eor r7,r7,r12,ROR #19
+	eor r8,r8,r14,ROR #19
+	add r12,r7,r6
+	add r14,r8,r3
+	eor r10,r10,r12,ROR #14
+	eor r9,r9,r14,ROR #14
+	ldr r12,[sp,#252]
+	subs r12,r12,#2
+	bhi ._mainloop1
+	strd r6,[sp,#128]
+	strd r2,[sp,#120]
+	strd r10,[sp,#112]
+	strd r8,[sp,#136]
+	ldrd r2,[sp,#96]
+	ldrd r6,[sp,#104]
+	ldrd r8,[sp,#64]
+	ldrd r10,[sp,#48]
+	add r0,r0,r8
+	add r1,r1,r9
+	add r2,r2,r10
+	add r3,r3,r11
+	ldrd r8,[sp,#72]
+	ldrd r10,[sp,#32]
+	add r4,r4,r8
+	add r5,r5,r9
+	add r6,r6,r10
+	add r7,r7,r11
+	ldr r12,[sp,#236]
+	cmp r12,#0
+	beq ._nomessage10
+	ldr r8,[r12,#0]
+	ldr r9,[r12,#4]
+	ldr r10,[r12,#8]
+	ldr r11,[r12,#12]
+	eor r0,r0,r8
+	ldr r8,[r12,#16]
+	eor r1,r1,r9
+	ldr r9,[r12,#20]
+	eor r2,r2,r10
+	ldr r10,[r12,#24]
+	eor r3,r3,r11
+	ldr r11,[r12,#28]
+	eor r4,r4,r8
+	eor r5,r5,r9
+	eor r6,r6,r10
+	eor r7,r7,r11
+._nomessage10:
+	ldr r14,[sp,#232]
+	str r0,[r14,#0]
+	str r1,[r14,#4]
+	str r2,[r14,#8]
+	str r3,[r14,#12]
+	str r4,[r14,#16]
+	str r5,[r14,#20]
+	str r6,[r14,#24]
+	str r7,[r14,#28]
+	ldrd r6,[sp,#128]
+	ldrd r10,[sp,#112]
+	ldrd r0,[sp,#40]
+	ldrd r4,[sp,#80]
+	add r6,r6,r0
+	add r7,r7,r1
+	add r10,r10,r4
+	add r11,r11,r5
+	adds r0,r0,#1
+	adc r1,r1,#0
+	strd r0,[sp,#40]
+	ldrd r2,[sp,#120]
+	ldrd r8,[sp,#136]
+	ldrd r4,[sp,#56]
+	ldrd r0,[sp,#88]
+	add r2,r2,r4
+	add r3,r3,r5
+	add r0,r8,r0
+	add r1,r9,r1
+	cmp r12,#0
+	beq ._nomessage11
+	ldr r4,[r12,#32]
+	ldr r5,[r12,#36]
+	ldr r8,[r12,#40]
+	ldr r9,[r12,#44]
+	eor r6,r6,r4
+	ldr r4,[r12,#48]
+	eor r7,r7,r5
+	ldr r5,[r12,#52]
+	eor r10,r10,r8
+	ldr r8,[r12,#56]
+	eor r11,r11,r9
+	ldr r9,[r12,#60]
+	eor r2,r2,r4
+	eor r3,r3,r5
+	eor r0,r0,r8
+	eor r1,r1,r9
+	add r4,r12,#64
+	str r4,[sp,#236]
+._nomessage11:
+	str r6,[r14,#32]
+	str r7,[r14,#36]
+	str r10,[r14,#40]
+	str r11,[r14,#44]
+	str r2,[r14,#48]
+	str r3,[r14,#52]
+	str r0,[r14,#56]
+	str r1,[r14,#60]
+	add r0,r14,#64
+	str r0,[sp,#232]
+	ldr r0,[sp,#248]
+	cmp r0,#64
+	bhi ._nextblock
+._done:
+	ldr r2,[sp,#160]
+	ldrd r4,[sp,#0]
+	ldrd r6,[sp,#8]
+	ldrd r8,[sp,#16]
+	ldrd r10,[sp,#24]
+	ldr r12,[sp,#228]
+	ldr r14,[sp,#224]
+	ldrd r0,[sp,#40]
+	strd r0,[r2]
+	sub r0,r12,sp
+	mov sp,r12
+	vpop {q4,q5,q6,q7}
+	add r0,r0,#64
+	bx lr
+.size _gcry_arm_neon_salsa20_encrypt,.-_gcry_arm_neon_salsa20_encrypt;
+
+#endif
diff --git a/cipher/salsa20.c b/cipher/salsa20.c
index 892b9fc..f708b18 100644
--- a/cipher/salsa20.c
+++ b/cipher/salsa20.c
@@ -47,6 +47,15 @@
 # define USE_AMD64 1
 #endif
 
+/* USE_ARM_NEON_ASM indicates whether to enable ARM NEON assembly code. */
+#undef USE_ARM_NEON_ASM
+#if defined(HAVE_ARM_ARCH_V6) && defined(__ARMEL__)
+# if defined(HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS) && \
+     defined(HAVE_GCC_INLINE_ASM_NEON)
+#  define USE_ARM_NEON_ASM 1
+# endif
+#endif
+
 
 #define SALSA20_MIN_KEY_SIZE 16  /* Bytes.  */
 #define SALSA20_MAX_KEY_SIZE 32  /* Bytes.  */
@@ -60,7 +69,16 @@
 #define SALSA20R12_ROUNDS    12
 
 
-typedef struct
+struct SALSA20_context_s;
+
+typedef unsigned int (*salsa20_core_t) (u32 *dst, struct SALSA20_context_s *ctx,
+                                        unsigned int rounds);
+typedef void (* salsa20_keysetup_t)(struct SALSA20_context_s *ctx,
+                                    const byte *key, int keylen);
+typedef void (* salsa20_ivsetup_t)(struct SALSA20_context_s *ctx,
+                                   const byte *iv);
+
+typedef struct SALSA20_context_s
 {
   /* Indices 1-4 and 11-14 holds the key (two identical copies for the
      shorter key size), indices 0, 5, 10, 15 are constant, indices 6, 7
@@ -74,6 +92,12 @@ typedef struct
   u32 input[SALSA20_INPUT_LENGTH];
   u32 pad[SALSA20_INPUT_LENGTH];
   unsigned int unused; /* bytes in the pad.  */
+#ifdef USE_ARM_NEON_ASM
+  int use_neon;
+#endif
+  salsa20_keysetup_t keysetup;
+  salsa20_ivsetup_t ivsetup;
+  salsa20_core_t core;
 } SALSA20_context_t;
 
 
@@ -113,10 +137,10 @@ salsa20_ivsetup(SALSA20_context_t *ctx, const byte *iv)
 }
 
 static unsigned int
-salsa20_core (u32 *dst, u32 *src, unsigned int rounds)
+salsa20_core (u32 *dst, SALSA20_context_t *ctx, unsigned int rounds)
 {
   memset(dst, 0, SALSA20_BLOCK_SIZE);
-  return _gcry_salsa20_amd64_encrypt_blocks(src, dst, dst, 1, rounds);
+  return _gcry_salsa20_amd64_encrypt_blocks(ctx->input, dst, dst, 1, rounds);
 }
 
 #else /* USE_AMD64 */
@@ -149,9 +173,9 @@ salsa20_core (u32 *dst, u32 *src, unsigned int rounds)
   } while(0)
 
 static unsigned int
-salsa20_core (u32 *dst, u32 *src, unsigned int rounds)
+salsa20_core (u32 *dst, SALSA20_context_t *ctx, unsigned rounds)
 {
-  u32 pad[SALSA20_INPUT_LENGTH];
+  u32 pad[SALSA20_INPUT_LENGTH], *src = ctx->input;
   unsigned int i;
 
   memcpy (pad, src, sizeof(pad));
@@ -236,6 +260,49 @@ static void salsa20_ivsetup(SALSA20_context_t *ctx, const byte *iv)
 
 #endif /*!USE_AMD64*/
 
+#ifdef USE_ARM_NEON_ASM
+
+/* ARM NEON implementation of Salsa20. */
+unsigned int
+_gcry_arm_neon_salsa20_encrypt(void *c, const void *m, unsigned int nblks,
+                               void *k, unsigned int rounds);
+
+static unsigned int
+salsa20_core_neon (u32 *dst, SALSA20_context_t *ctx, unsigned int rounds)
+{
+  return _gcry_arm_neon_salsa20_encrypt(dst, NULL, 1, ctx->input, rounds);
+}
+
+static void salsa20_ivsetup_neon(SALSA20_context_t *ctx, const byte *iv)
+{
+  memcpy(ctx->input + 8, iv, 8);
+  /* Reset the block counter.  */
+  memset(ctx->input + 10, 0, 8);
+}
+
+static void
+salsa20_keysetup_neon(SALSA20_context_t *ctx, const byte *key, int klen)
+{
+  static const unsigned char sigma32[16] = "expand 32-byte k";
+  static const unsigned char sigma16[16] = "expand 16-byte k";
+
+  if (klen == 16)
+    {
+      memcpy (ctx->input, key, 16);
+      memcpy (ctx->input + 4, key, 16); /* Duplicate 128-bit key. */
+      memcpy (ctx->input + 12, sigma16, 16);
+    }
+  else
+    {
+      /* 32-byte key */
+      memcpy (ctx->input, key, 32);
+      memcpy (ctx->input + 12, sigma32, 16);
+    }
+}
+
+#endif /*USE_ARM_NEON_ASM*/
+
+
 static gcry_err_code_t
 salsa20_do_setkey (SALSA20_context_t *ctx,
                    const byte *key, unsigned int keylen)
@@ -257,7 +324,23 @@ salsa20_do_setkey (SALSA20_context_t *ctx,
       && keylen != SALSA20_MAX_KEY_SIZE)
     return GPG_ERR_INV_KEYLEN;
 
-  salsa20_keysetup (ctx, key, keylen);
+  /* Default ops. */
+  ctx->keysetup = salsa20_keysetup;
+  ctx->ivsetup = salsa20_ivsetup;
+  ctx->core = salsa20_core;
+
+#ifdef USE_ARM_NEON_ASM
+  ctx->use_neon = (_gcry_get_hw_features () & HWF_ARM_NEON) != 0;
+  if (ctx->use_neon)
+    {
+      /* Use ARM NEON ops instead. */
+      ctx->keysetup = salsa20_keysetup_neon;
+      ctx->ivsetup = salsa20_ivsetup_neon;
+      ctx->core = salsa20_core_neon;
+    }
+#endif
+
+  ctx->keysetup (ctx, key, keylen);
 
   /* We default to a zero nonce.  */
   salsa20_setiv (ctx, NULL, 0);
@@ -290,7 +373,7 @@ salsa20_setiv (void *context, const byte *iv, unsigned int ivlen)
   else
     memcpy (tmp, iv, SALSA20_IV_SIZE);
 
-  salsa20_ivsetup (ctx, tmp);
+  ctx->ivsetup (ctx, tmp);
 
   /* Reset the unused pad bytes counter.  */
   ctx->unused = 0;
@@ -340,12 +423,24 @@ salsa20_do_encrypt_stream (SALSA20_context_t *ctx,
     }
 #endif
 
+#ifdef USE_ARM_NEON_ASM
+  if (ctx->use_neon && length >= SALSA20_BLOCK_SIZE)
+    {
+      unsigned int nblocks = length / SALSA20_BLOCK_SIZE;
+      _gcry_arm_neon_salsa20_encrypt (outbuf, inbuf, nblocks, ctx->input,
+                                      rounds);
+      length -= SALSA20_BLOCK_SIZE * nblocks;
+      outbuf += SALSA20_BLOCK_SIZE * nblocks;
+      inbuf  += SALSA20_BLOCK_SIZE * nblocks;
+    }
+#endif
+
   while (length > 0)
     {
       /* Create the next pad and bump the block counter.  Note that it
          is the user's duty to change to another nonce not later than
          after 2^70 processed bytes.  */
-      nburn = salsa20_core (ctx->pad, ctx->input, rounds);
+      nburn = ctx->core (ctx->pad, ctx, rounds);
       burn = nburn > burn ? nburn : burn;
 
       if (length <= SALSA20_BLOCK_SIZE)
@@ -386,12 +481,13 @@ salsa20r12_encrypt_stream (void *context,
 }
 
 
-
 static const char*
 selftest (void)
 {
   SALSA20_context_t ctx;
   byte scratch[8+1];
+  byte buf[256+64+4];
+  int i;
 
   static byte key_1[] =
     { 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
@@ -418,6 +514,23 @@ selftest (void)
   salsa20_encrypt_stream (&ctx, scratch, scratch, sizeof plaintext_1);
   if (memcmp (scratch, plaintext_1, sizeof plaintext_1))
     return "Salsa20 decryption test 1 failed.";
+
+  for (i = 0; i < sizeof buf; i++)
+    buf[i] = i;
+  salsa20_setkey (&ctx, key_1, sizeof key_1);
+  salsa20_setiv (&ctx, nonce_1, sizeof nonce_1);
+  /*encrypt*/
+  salsa20_encrypt_stream (&ctx, buf, buf, sizeof buf);
+  /*decrypt*/
+  salsa20_setkey (&ctx, key_1, sizeof key_1);
+  salsa20_setiv (&ctx, nonce_1, sizeof nonce_1);
+  salsa20_encrypt_stream (&ctx, buf, buf, 1);
+  salsa20_encrypt_stream (&ctx, buf+1, buf+1, (sizeof buf)-1-1);
+  salsa20_encrypt_stream (&ctx, buf+(sizeof buf)-1, buf+(sizeof buf)-1, 1);
+  for (i = 0; i < sizeof buf; i++)
+    if (buf[i] != (byte)i)
+      return "Salsa20 encryption test 2 failed.";
+
   return NULL;
 }
 
diff --git a/configure.ac b/configure.ac
index 114460c..19c97bd 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1560,6 +1560,11 @@ if test "$found" = "1" ; then
          GCRYPT_CIPHERS="$GCRYPT_CIPHERS salsa20-amd64.lo"
       ;;
    esac
+
+   if test x"$neonsupport" = xyes ; then
+     # Build with the NEON implementation
+     GCRYPT_CIPHERS="$GCRYPT_CIPHERS salsa20-armv7-neon.lo"
+   fi
 fi
 
 LIST_MEMBER(gost28147, $enabled_ciphers)

commit 5a3d43485efdc09912be0967ee0a3ce345b3b15a
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date:   Sat Oct 26 15:00:48 2013 +0300

    Add AMD64 assembly implementation of Salsa20
    
    * cipher/Makefile.am: Add 'salsa20-amd64.S'.
    * cipher/salsa20-amd64.S: New.
    * cipher/salsa20.c (USE_AMD64): New macro.
    [USE_AMD64] (_gcry_salsa20_amd64_keysetup, _gcry_salsa20_amd64_ivsetup)
    (_gcry_salsa20_amd64_encrypt_blocks): New prototypes.
    [USE_AMD64] (salsa20_keysetup, salsa20_ivsetup, salsa20_core): New.
    [!USE_AMD64] (salsa20_core): Change 'src' to non-constant, update block
    counter in 'salsa20_core' and return burn stack depth.
    [!USE_AMD64] (salsa20_keysetup, salsa20_ivsetup): New.
    (salsa20_do_setkey): Move generic key setup to 'salsa20_keysetup'.
    (salsa20_setkey): Fix burn stack depth.
    (salsa20_setiv): Move generic IV setup to 'salsa20_ivsetup'.
    (salsa20_do_encrypt_stream) [USE_AMD64]: Process large buffers in AMD64
    implementation.
    (salsa20_do_encrypt_stream): Move stack burning to this function...
    (salsa20_encrypt_stream, salsa20r12_encrypt_stream): ...from these
    functions.
    * configure.ac [x86-64]: Add 'salsa20-amd64.lo'.
    --
    
    Patch adds fast AMD64 assembly implementation for Salsa20. This implementation
    is based on public domain code by D. J. Bernstein and it is available at
    http://cr.yp.to/snuffle.html (amd64-xmm6). Implementation gains extra speed
    by processing four blocks in parallel with help SSE2 instructions.
    
    Benchmark results on Intel Core i5-4570 (3.2 Ghz):
    
    Before:
    SALSA20        |  nanosecs/byte   mebibytes/sec   cycles/byte
         STREAM enc |      3.88 ns/B     246.0 MiB/s     12.41 c/B
         STREAM dec |      3.88 ns/B     246.0 MiB/s     12.41 c/B
                    =
     SALSA20R12     |  nanosecs/byte   mebibytes/sec   cycles/byte
         STREAM enc |      2.46 ns/B     387.9 MiB/s      7.87 c/B
         STREAM dec |      2.46 ns/B     387.7 MiB/s      7.87 c/B
    
    After:
     SALSA20        |  nanosecs/byte   mebibytes/sec   cycles/byte
         STREAM enc |     0.985 ns/B     967.8 MiB/s      3.15 c/B
         STREAM dec |     0.987 ns/B     966.5 MiB/s      3.16 c/B
                    =
     SALSA20R12     |  nanosecs/byte   mebibytes/sec   cycles/byte
         STREAM enc |     0.636 ns/B    1500.5 MiB/s      2.03 c/B
         STREAM dec |     0.636 ns/B    1499.2 MiB/s      2.04 c/B
    
    Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index d7db933..e786713 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -71,7 +71,7 @@ md5.c \
 rijndael.c rijndael-tables.h rijndael-amd64.S rijndael-arm.S \
 rmd160.c \
 rsa.c \
-salsa20.c \
+salsa20.c salsa20-amd64.S \
 scrypt.c \
 seed.c \
 serpent.c serpent-sse2-amd64.S serpent-avx2-amd64.S \
diff --git a/cipher/salsa20-amd64.S b/cipher/salsa20-amd64.S
new file mode 100644
index 0000000..691df58
--- /dev/null
+++ b/cipher/salsa20-amd64.S
@@ -0,0 +1,924 @@
+/* salsa20-amd64.S  -  AMD64 implementation of Salsa20
+ *
+ * Copyright © 2013 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * Based on public domain implementation by D. J. Bernstein at
+ *  http://cr.yp.to/snuffle.html
+ */
+
+#ifdef __x86_64
+#include <config.h>
+#if defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) && defined(USE_SALSA20)
+
+.text
+
+.align 8
+.globl _gcry_salsa20_amd64_keysetup
+.type  _gcry_salsa20_amd64_keysetup, at function;
+_gcry_salsa20_amd64_keysetup:
+	movl   0(%rsi),%r8d
+	movl   4(%rsi),%r9d
+	movl   8(%rsi),%eax
+	movl   12(%rsi),%r10d
+	movl   %r8d,20(%rdi)
+	movl   %r9d,40(%rdi)
+	movl   %eax,60(%rdi)
+	movl   %r10d,48(%rdi)
+	cmp  $256,%rdx
+	jb ._kbits128
+._kbits256:
+	movl   16(%rsi),%edx
+	movl   20(%rsi),%ecx
+	movl   24(%rsi),%r8d
+	movl   28(%rsi),%esi
+	movl   %edx,28(%rdi)
+	movl   %ecx,16(%rdi)
+	movl   %r8d,36(%rdi)
+	movl   %esi,56(%rdi)
+	mov  $1634760805,%rsi
+	mov  $857760878,%rdx
+	mov  $2036477234,%rcx
+	mov  $1797285236,%r8
+	movl   %esi,0(%rdi)
+	movl   %edx,4(%rdi)
+	movl   %ecx,8(%rdi)
+	movl   %r8d,12(%rdi)
+	jmp ._keysetupdone
+._kbits128:
+	movl   0(%rsi),%edx
+	movl   4(%rsi),%ecx
+	movl   8(%rsi),%r8d
+	movl   12(%rsi),%esi
+	movl   %edx,28(%rdi)
+	movl   %ecx,16(%rdi)
+	movl   %r8d,36(%rdi)
+	movl   %esi,56(%rdi)
+	mov  $1634760805,%rsi
+	mov  $824206446,%rdx
+	mov  $2036477238,%rcx
+	mov  $1797285236,%r8
+	movl   %esi,0(%rdi)
+	movl   %edx,4(%rdi)
+	movl   %ecx,8(%rdi)
+	movl   %r8d,12(%rdi)
+._keysetupdone:
+	ret
+
+.align 8
+.globl _gcry_salsa20_amd64_ivsetup
+.type  _gcry_salsa20_amd64_ivsetup, at function;
+_gcry_salsa20_amd64_ivsetup:
+	movl   0(%rsi),%r8d
+	movl   4(%rsi),%esi
+	mov  $0,%r9
+	mov  $0,%rax
+	movl   %r8d,24(%rdi)
+	movl   %esi,44(%rdi)
+	movl   %r9d,32(%rdi)
+	movl   %eax,52(%rdi)
+	ret
+
+.align 8
+.globl _gcry_salsa20_amd64_encrypt_blocks
+.type  _gcry_salsa20_amd64_encrypt_blocks, at function;
+_gcry_salsa20_amd64_encrypt_blocks:
+	/*
+	 * Modifications to original implementation:
+	 *  - Number of rounds passing in register %r8 (for Salsa20/12).
+	 *  - Length is input as number of blocks, so don't handle tail bytes
+	 *    (this is done in salsa20.c).
+	 */
+	push %rbx
+	shlq $6, %rcx /* blocks to bytes */
+	mov %r8, %rbx
+	mov %rsp,%r11
+	and $31,%r11
+	add $384,%r11
+	sub %r11,%rsp
+	mov  %rdi,%r8
+	mov  %rsi,%rsi
+	mov  %rdx,%rdi
+	mov  %rcx,%rdx
+	cmp  $0,%rdx
+	jbe ._done
+._start:
+	cmp  $256,%rdx
+	jb ._bytes_are_64_128_or_192
+	movdqa 0(%r8),%xmm0
+	pshufd $0x55,%xmm0,%xmm1
+	pshufd $0xaa,%xmm0,%xmm2
+	pshufd $0xff,%xmm0,%xmm3
+	pshufd $0x00,%xmm0,%xmm0
+	movdqa %xmm1,0(%rsp)
+	movdqa %xmm2,16(%rsp)
+	movdqa %xmm3,32(%rsp)
+	movdqa %xmm0,48(%rsp)
+	movdqa 16(%r8),%xmm0
+	pshufd $0xaa,%xmm0,%xmm1
+	pshufd $0xff,%xmm0,%xmm2
+	pshufd $0x00,%xmm0,%xmm3
+	pshufd $0x55,%xmm0,%xmm0
+	movdqa %xmm1,64(%rsp)
+	movdqa %xmm2,80(%rsp)
+	movdqa %xmm3,96(%rsp)
+	movdqa %xmm0,112(%rsp)
+	movdqa 32(%r8),%xmm0
+	pshufd $0xff,%xmm0,%xmm1
+	pshufd $0x55,%xmm0,%xmm2
+	pshufd $0xaa,%xmm0,%xmm0
+	movdqa %xmm1,128(%rsp)
+	movdqa %xmm2,144(%rsp)
+	movdqa %xmm0,160(%rsp)
+	movdqa 48(%r8),%xmm0
+	pshufd $0x00,%xmm0,%xmm1
+	pshufd $0xaa,%xmm0,%xmm2
+	pshufd $0xff,%xmm0,%xmm0
+	movdqa %xmm1,176(%rsp)
+	movdqa %xmm2,192(%rsp)
+	movdqa %xmm0,208(%rsp)
+._bytesatleast256:
+	movl   32(%r8),%ecx
+	movl   52(%r8),%r9d
+	movl %ecx,224(%rsp)
+	movl %r9d,240(%rsp)
+	add  $1,%ecx
+	adc  $0,%r9d
+	movl %ecx,4+224(%rsp)
+	movl %r9d,4+240(%rsp)
+	add  $1,%ecx
+	adc  $0,%r9d
+	movl %ecx,8+224(%rsp)
+	movl %r9d,8+240(%rsp)
+	add  $1,%ecx
+	adc  $0,%r9d
+	movl %ecx,12+224(%rsp)
+	movl %r9d,12+240(%rsp)
+	add  $1,%ecx
+	adc  $0,%r9d
+	movl   %ecx,32(%r8)
+	movl   %r9d,52(%r8)
+	movq %rdx,288(%rsp)
+	mov  %rbx,%rdx
+	movdqa 0(%rsp),%xmm0
+	movdqa 16(%rsp),%xmm1
+	movdqa 32(%rsp),%xmm2
+	movdqa 192(%rsp),%xmm3
+	movdqa 208(%rsp),%xmm4
+	movdqa 64(%rsp),%xmm5
+	movdqa 80(%rsp),%xmm6
+	movdqa 112(%rsp),%xmm7
+	movdqa 128(%rsp),%xmm8
+	movdqa 144(%rsp),%xmm9
+	movdqa 160(%rsp),%xmm10
+	movdqa 240(%rsp),%xmm11
+	movdqa 48(%rsp),%xmm12
+	movdqa 96(%rsp),%xmm13
+	movdqa 176(%rsp),%xmm14
+	movdqa 224(%rsp),%xmm15
+._mainloop1:
+	movdqa %xmm1,256(%rsp)
+	movdqa %xmm2,272(%rsp)
+	movdqa %xmm13,%xmm1
+	paddd %xmm12,%xmm1
+	movdqa %xmm1,%xmm2
+	pslld $7,%xmm1
+	pxor  %xmm1,%xmm14
+	psrld $25,%xmm2
+	pxor  %xmm2,%xmm14
+	movdqa %xmm7,%xmm1
+	paddd %xmm0,%xmm1
+	movdqa %xmm1,%xmm2
+	pslld $7,%xmm1
+	pxor  %xmm1,%xmm11
+	psrld $25,%xmm2
+	pxor  %xmm2,%xmm11
+	movdqa %xmm12,%xmm1
+	paddd %xmm14,%xmm1
+	movdqa %xmm1,%xmm2
+	pslld $9,%xmm1
+	pxor  %xmm1,%xmm15
+	psrld $23,%xmm2
+	pxor  %xmm2,%xmm15
+	movdqa %xmm0,%xmm1
+	paddd %xmm11,%xmm1
+	movdqa %xmm1,%xmm2
+	pslld $9,%xmm1
+	pxor  %xmm1,%xmm9
+	psrld $23,%xmm2
+	pxor  %xmm2,%xmm9
+	movdqa %xmm14,%xmm1
+	paddd %xmm15,%xmm1
+	movdqa %xmm1,%xmm2
+	pslld $13,%xmm1
+	pxor  %xmm1,%xmm13
+	psrld $19,%xmm2
+	pxor  %xmm2,%xmm13
+	movdqa %xmm11,%xmm1
+	paddd %xmm9,%xmm1
+	movdqa %xmm1,%xmm2
+	pslld $13,%xmm1
+	pxor  %xmm1,%xmm7
+	psrld $19,%xmm2
+	pxor  %xmm2,%xmm7
+	movdqa %xmm15,%xmm1
+	paddd %xmm13,%xmm1
+	movdqa %xmm1,%xmm2
+	pslld $18,%xmm1
+	pxor  %xmm1,%xmm12
+	psrld $14,%xmm2
+	pxor  %xmm2,%xmm12
+	movdqa 256(%rsp),%xmm1
+	movdqa %xmm12,256(%rsp)
+	movdqa %xmm9,%xmm2
+	paddd %xmm7,%xmm2
+	movdqa %xmm2,%xmm12
+	pslld $18,%xmm2
+	pxor  %xmm2,%xmm0
+	psrld $14,%xmm12
+	pxor  %xmm12,%xmm0
+	movdqa %xmm5,%xmm2
+	paddd %xmm1,%xmm2
+	movdqa %xmm2,%xmm12
+	pslld $7,%xmm2
+	pxor  %xmm2,%xmm3
+	psrld $25,%xmm12
+	pxor  %xmm12,%xmm3
+	movdqa 272(%rsp),%xmm2
+	movdqa %xmm0,272(%rsp)
+	movdqa %xmm6,%xmm0
+	paddd %xmm2,%xmm0
+	movdqa %xmm0,%xmm12
+	pslld $7,%xmm0
+	pxor  %xmm0,%xmm4
+	psrld $25,%xmm12
+	pxor  %xmm12,%xmm4
+	movdqa %xmm1,%xmm0
+	paddd %xmm3,%xmm0
+	movdqa %xmm0,%xmm12
+	pslld $9,%xmm0
+	pxor  %xmm0,%xmm10
+	psrld $23,%xmm12
+	pxor  %xmm12,%xmm10
+	movdqa %xmm2,%xmm0
+	paddd %xmm4,%xmm0
+	movdqa %xmm0,%xmm12
+	pslld $9,%xmm0
+	pxor  %xmm0,%xmm8
+	psrld $23,%xmm12
+	pxor  %xmm12,%xmm8
+	movdqa %xmm3,%xmm0
+	paddd %xmm10,%xmm0
+	movdqa %xmm0,%xmm12
+	pslld $13,%xmm0
+	pxor  %xmm0,%xmm5
+	psrld $19,%xmm12
+	pxor  %xmm12,%xmm5
+	movdqa %xmm4,%xmm0
+	paddd %xmm8,%xmm0
+	movdqa %xmm0,%xmm12
+	pslld $13,%xmm0
+	pxor  %xmm0,%xmm6
+	psrld $19,%xmm12
+	pxor  %xmm12,%xmm6
+	movdqa %xmm10,%xmm0
+	paddd %xmm5,%xmm0
+	movdqa %xmm0,%xmm12
+	pslld $18,%xmm0
+	pxor  %xmm0,%xmm1
+	psrld $14,%xmm12
+	pxor  %xmm12,%xmm1
+	movdqa 256(%rsp),%xmm0
+	movdqa %xmm1,256(%rsp)
+	movdqa %xmm4,%xmm1
+	paddd %xmm0,%xmm1
+	movdqa %xmm1,%xmm12
+	pslld $7,%xmm1
+	pxor  %xmm1,%xmm7
+	psrld $25,%xmm12
+	pxor  %xmm12,%xmm7
+	movdqa %xmm8,%xmm1
+	paddd %xmm6,%xmm1
+	movdqa %xmm1,%xmm12
+	pslld $18,%xmm1
+	pxor  %xmm1,%xmm2
+	psrld $14,%xmm12
+	pxor  %xmm12,%xmm2
+	movdqa 272(%rsp),%xmm12
+	movdqa %xmm2,272(%rsp)
+	movdqa %xmm14,%xmm1
+	paddd %xmm12,%xmm1
+	movdqa %xmm1,%xmm2
+	pslld $7,%xmm1
+	pxor  %xmm1,%xmm5
+	psrld $25,%xmm2
+	pxor  %xmm2,%xmm5
+	movdqa %xmm0,%xmm1
+	paddd %xmm7,%xmm1
+	movdqa %xmm1,%xmm2
+	pslld $9,%xmm1
+	pxor  %xmm1,%xmm10
+	psrld $23,%xmm2
+	pxor  %xmm2,%xmm10
+	movdqa %xmm12,%xmm1
+	paddd %xmm5,%xmm1
+	movdqa %xmm1,%xmm2
+	pslld $9,%xmm1
+	pxor  %xmm1,%xmm8
+	psrld $23,%xmm2
+	pxor  %xmm2,%xmm8
+	movdqa %xmm7,%xmm1
+	paddd %xmm10,%xmm1
+	movdqa %xmm1,%xmm2
+	pslld $13,%xmm1
+	pxor  %xmm1,%xmm4
+	psrld $19,%xmm2
+	pxor  %xmm2,%xmm4
+	movdqa %xmm5,%xmm1
+	paddd %xmm8,%xmm1
+	movdqa %xmm1,%xmm2
+	pslld $13,%xmm1
+	pxor  %xmm1,%xmm14
+	psrld $19,%xmm2
+	pxor  %xmm2,%xmm14
+	movdqa %xmm10,%xmm1
+	paddd %xmm4,%xmm1
+	movdqa %xmm1,%xmm2
+	pslld $18,%xmm1
+	pxor  %xmm1,%xmm0
+	psrld $14,%xmm2
+	pxor  %xmm2,%xmm0
+	movdqa 256(%rsp),%xmm1
+	movdqa %xmm0,256(%rsp)
+	movdqa %xmm8,%xmm0
+	paddd %xmm14,%xmm0
+	movdqa %xmm0,%xmm2
+	pslld $18,%xmm0
+	pxor  %xmm0,%xmm12
+	psrld $14,%xmm2
+	pxor  %xmm2,%xmm12
+	movdqa %xmm11,%xmm0
+	paddd %xmm1,%xmm0
+	movdqa %xmm0,%xmm2
+	pslld $7,%xmm0
+	pxor  %xmm0,%xmm6
+	psrld $25,%xmm2
+	pxor  %xmm2,%xmm6
+	movdqa 272(%rsp),%xmm2
+	movdqa %xmm12,272(%rsp)
+	movdqa %xmm3,%xmm0
+	paddd %xmm2,%xmm0
+	movdqa %xmm0,%xmm12
+	pslld $7,%xmm0
+	pxor  %xmm0,%xmm13
+	psrld $25,%xmm12
+	pxor  %xmm12,%xmm13
+	movdqa %xmm1,%xmm0
+	paddd %xmm6,%xmm0
+	movdqa %xmm0,%xmm12
+	pslld $9,%xmm0
+	pxor  %xmm0,%xmm15
+	psrld $23,%xmm12
+	pxor  %xmm12,%xmm15
+	movdqa %xmm2,%xmm0
+	paddd %xmm13,%xmm0
+	movdqa %xmm0,%xmm12
+	pslld $9,%xmm0
+	pxor  %xmm0,%xmm9
+	psrld $23,%xmm12
+	pxor  %xmm12,%xmm9
+	movdqa %xmm6,%xmm0
+	paddd %xmm15,%xmm0
+	movdqa %xmm0,%xmm12
+	pslld $13,%xmm0
+	pxor  %xmm0,%xmm11
+	psrld $19,%xmm12
+	pxor  %xmm12,%xmm11
+	movdqa %xmm13,%xmm0
+	paddd %xmm9,%xmm0
+	movdqa %xmm0,%xmm12
+	pslld $13,%xmm0
+	pxor  %xmm0,%xmm3
+	psrld $19,%xmm12
+	pxor  %xmm12,%xmm3
+	movdqa %xmm15,%xmm0
+	paddd %xmm11,%xmm0
+	movdqa %xmm0,%xmm12
+	pslld $18,%xmm0
+	pxor  %xmm0,%xmm1
+	psrld $14,%xmm12
+	pxor  %xmm12,%xmm1
+	movdqa %xmm9,%xmm0
+	paddd %xmm3,%xmm0
+	movdqa %xmm0,%xmm12
+	pslld $18,%xmm0
+	pxor  %xmm0,%xmm2
+	psrld $14,%xmm12
+	pxor  %xmm12,%xmm2
+	movdqa 256(%rsp),%xmm12
+	movdqa 272(%rsp),%xmm0
+	sub  $2,%rdx
+	ja ._mainloop1
+	paddd 48(%rsp),%xmm12
+	paddd 112(%rsp),%xmm7
+	paddd 160(%rsp),%xmm10
+	paddd 208(%rsp),%xmm4
+	movd   %xmm12,%rdx
+	movd   %xmm7,%rcx
+	movd   %xmm10,%r9
+	movd   %xmm4,%rax
+	pshufd $0x39,%xmm12,%xmm12
+	pshufd $0x39,%xmm7,%xmm7
+	pshufd $0x39,%xmm10,%xmm10
+	pshufd $0x39,%xmm4,%xmm4
+	xorl 0(%rsi),%edx
+	xorl 4(%rsi),%ecx
+	xorl 8(%rsi),%r9d
+	xorl 12(%rsi),%eax
+	movl   %edx,0(%rdi)
+	movl   %ecx,4(%rdi)
+	movl   %r9d,8(%rdi)
+	movl   %eax,12(%rdi)
+	movd   %xmm12,%rdx
+	movd   %xmm7,%rcx
+	movd   %xmm10,%r9
+	movd   %xmm4,%rax
+	pshufd $0x39,%xmm12,%xmm12
+	pshufd $0x39,%xmm7,%xmm7
+	pshufd $0x39,%xmm10,%xmm10
+	pshufd $0x39,%xmm4,%xmm4
+	xorl 64(%rsi),%edx
+	xorl 68(%rsi),%ecx
+	xorl 72(%rsi),%r9d
+	xorl 76(%rsi),%eax
+	movl   %edx,64(%rdi)
+	movl   %ecx,68(%rdi)
+	movl   %r9d,72(%rdi)
+	movl   %eax,76(%rdi)
+	movd   %xmm12,%rdx
+	movd   %xmm7,%rcx
+	movd   %xmm10,%r9
+	movd   %xmm4,%rax
+	pshufd $0x39,%xmm12,%xmm12
+	pshufd $0x39,%xmm7,%xmm7
+	pshufd $0x39,%xmm10,%xmm10
+	pshufd $0x39,%xmm4,%xmm4
+	xorl 128(%rsi),%edx
+	xorl 132(%rsi),%ecx
+	xorl 136(%rsi),%r9d
+	xorl 140(%rsi),%eax
+	movl   %edx,128(%rdi)
+	movl   %ecx,132(%rdi)
+	movl   %r9d,136(%rdi)
+	movl   %eax,140(%rdi)
+	movd   %xmm12,%rdx
+	movd   %xmm7,%rcx
+	movd   %xmm10,%r9
+	movd   %xmm4,%rax
+	xorl 192(%rsi),%edx
+	xorl 196(%rsi),%ecx
+	xorl 200(%rsi),%r9d
+	xorl 204(%rsi),%eax
+	movl   %edx,192(%rdi)
+	movl   %ecx,196(%rdi)
+	movl   %r9d,200(%rdi)
+	movl   %eax,204(%rdi)
+	paddd 176(%rsp),%xmm14
+	paddd 0(%rsp),%xmm0
+	paddd 64(%rsp),%xmm5
+	paddd 128(%rsp),%xmm8
+	movd   %xmm14,%rdx
+	movd   %xmm0,%rcx
+	movd   %xmm5,%r9
+	movd   %xmm8,%rax
+	pshufd $0x39,%xmm14,%xmm14
+	pshufd $0x39,%xmm0,%xmm0
+	pshufd $0x39,%xmm5,%xmm5
+	pshufd $0x39,%xmm8,%xmm8
+	xorl 16(%rsi),%edx
+	xorl 20(%rsi),%ecx
+	xorl 24(%rsi),%r9d
+	xorl 28(%rsi),%eax
+	movl   %edx,16(%rdi)
+	movl   %ecx,20(%rdi)
+	movl   %r9d,24(%rdi)
+	movl   %eax,28(%rdi)
+	movd   %xmm14,%rdx
+	movd   %xmm0,%rcx
+	movd   %xmm5,%r9
+	movd   %xmm8,%rax
+	pshufd $0x39,%xmm14,%xmm14
+	pshufd $0x39,%xmm0,%xmm0
+	pshufd $0x39,%xmm5,%xmm5
+	pshufd $0x39,%xmm8,%xmm8
+	xorl 80(%rsi),%edx
+	xorl 84(%rsi),%ecx
+	xorl 88(%rsi),%r9d
+	xorl 92(%rsi),%eax
+	movl   %edx,80(%rdi)
+	movl   %ecx,84(%rdi)
+	movl   %r9d,88(%rdi)
+	movl   %eax,92(%rdi)
+	movd   %xmm14,%rdx
+	movd   %xmm0,%rcx
+	movd   %xmm5,%r9
+	movd   %xmm8,%rax
+	pshufd $0x39,%xmm14,%xmm14
+	pshufd $0x39,%xmm0,%xmm0
+	pshufd $0x39,%xmm5,%xmm5
+	pshufd $0x39,%xmm8,%xmm8
+	xorl 144(%rsi),%edx
+	xorl 148(%rsi),%ecx
+	xorl 152(%rsi),%r9d
+	xorl 156(%rsi),%eax
+	movl   %edx,144(%rdi)
+	movl   %ecx,148(%rdi)
+	movl   %r9d,152(%rdi)
+	movl   %eax,156(%rdi)
+	movd   %xmm14,%rdx
+	movd   %xmm0,%rcx
+	movd   %xmm5,%r9
+	movd   %xmm8,%rax
+	xorl 208(%rsi),%edx
+	xorl 212(%rsi),%ecx
+	xorl 216(%rsi),%r9d
+	xorl 220(%rsi),%eax
+	movl   %edx,208(%rdi)
+	movl   %ecx,212(%rdi)
+	movl   %r9d,216(%rdi)
+	movl   %eax,220(%rdi)
+	paddd 224(%rsp),%xmm15
+	paddd 240(%rsp),%xmm11
+	paddd 16(%rsp),%xmm1
+	paddd 80(%rsp),%xmm6
+	movd   %xmm15,%rdx
+	movd   %xmm11,%rcx
+	movd   %xmm1,%r9
+	movd   %xmm6,%rax
+	pshufd $0x39,%xmm15,%xmm15
+	pshufd $0x39,%xmm11,%xmm11
+	pshufd $0x39,%xmm1,%xmm1
+	pshufd $0x39,%xmm6,%xmm6
+	xorl 32(%rsi),%edx
+	xorl 36(%rsi),%ecx
+	xorl 40(%rsi),%r9d
+	xorl 44(%rsi),%eax
+	movl   %edx,32(%rdi)
+	movl   %ecx,36(%rdi)
+	movl   %r9d,40(%rdi)
+	movl   %eax,44(%rdi)
+	movd   %xmm15,%rdx
+	movd   %xmm11,%rcx
+	movd   %xmm1,%r9
+	movd   %xmm6,%rax
+	pshufd $0x39,%xmm15,%xmm15
+	pshufd $0x39,%xmm11,%xmm11
+	pshufd $0x39,%xmm1,%xmm1
+	pshufd $0x39,%xmm6,%xmm6
+	xorl 96(%rsi),%edx
+	xorl 100(%rsi),%ecx
+	xorl 104(%rsi),%r9d
+	xorl 108(%rsi),%eax
+	movl   %edx,96(%rdi)
+	movl   %ecx,100(%rdi)
+	movl   %r9d,104(%rdi)
+	movl   %eax,108(%rdi)
+	movd   %xmm15,%rdx
+	movd   %xmm11,%rcx
+	movd   %xmm1,%r9
+	movd   %xmm6,%rax
+	pshufd $0x39,%xmm15,%xmm15
+	pshufd $0x39,%xmm11,%xmm11
+	pshufd $0x39,%xmm1,%xmm1
+	pshufd $0x39,%xmm6,%xmm6
+	xorl 160(%rsi),%edx
+	xorl 164(%rsi),%ecx
+	xorl 168(%rsi),%r9d
+	xorl 172(%rsi),%eax
+	movl   %edx,160(%rdi)
+	movl   %ecx,164(%rdi)
+	movl   %r9d,168(%rdi)
+	movl   %eax,172(%rdi)
+	movd   %xmm15,%rdx
+	movd   %xmm11,%rcx
+	movd   %xmm1,%r9
+	movd   %xmm6,%rax
+	xorl 224(%rsi),%edx
+	xorl 228(%rsi),%ecx
+	xorl 232(%rsi),%r9d
+	xorl 236(%rsi),%eax
+	movl   %edx,224(%rdi)
+	movl   %ecx,228(%rdi)
+	movl   %r9d,232(%rdi)
+	movl   %eax,236(%rdi)
+	paddd 96(%rsp),%xmm13
+	paddd 144(%rsp),%xmm9
+	paddd 192(%rsp),%xmm3
+	paddd 32(%rsp),%xmm2
+	movd   %xmm13,%rdx
+	movd   %xmm9,%rcx
+	movd   %xmm3,%r9
+	movd   %xmm2,%rax
+	pshufd $0x39,%xmm13,%xmm13
+	pshufd $0x39,%xmm9,%xmm9
+	pshufd $0x39,%xmm3,%xmm3
+	pshufd $0x39,%xmm2,%xmm2
+	xorl 48(%rsi),%edx
+	xorl 52(%rsi),%ecx
+	xorl 56(%rsi),%r9d
+	xorl 60(%rsi),%eax
+	movl   %edx,48(%rdi)
+	movl   %ecx,52(%rdi)
+	movl   %r9d,56(%rdi)
+	movl   %eax,60(%rdi)
+	movd   %xmm13,%rdx
+	movd   %xmm9,%rcx
+	movd   %xmm3,%r9
+	movd   %xmm2,%rax
+	pshufd $0x39,%xmm13,%xmm13
+	pshufd $0x39,%xmm9,%xmm9
+	pshufd $0x39,%xmm3,%xmm3
+	pshufd $0x39,%xmm2,%xmm2
+	xorl 112(%rsi),%edx
+	xorl 116(%rsi),%ecx
+	xorl 120(%rsi),%r9d
+	xorl 124(%rsi),%eax
+	movl   %edx,112(%rdi)
+	movl   %ecx,116(%rdi)
+	movl   %r9d,120(%rdi)
+	movl   %eax,124(%rdi)
+	movd   %xmm13,%rdx
+	movd   %xmm9,%rcx
+	movd   %xmm3,%r9
+	movd   %xmm2,%rax
+	pshufd $0x39,%xmm13,%xmm13
+	pshufd $0x39,%xmm9,%xmm9
+	pshufd $0x39,%xmm3,%xmm3
+	pshufd $0x39,%xmm2,%xmm2
+	xorl 176(%rsi),%edx
+	xorl 180(%rsi),%ecx
+	xorl 184(%rsi),%r9d
+	xorl 188(%rsi),%eax
+	movl   %edx,176(%rdi)
+	movl   %ecx,180(%rdi)
+	movl   %r9d,184(%rdi)
+	movl   %eax,188(%rdi)
+	movd   %xmm13,%rdx
+	movd   %xmm9,%rcx
+	movd   %xmm3,%r9
+	movd   %xmm2,%rax
+	xorl 240(%rsi),%edx
+	xorl 244(%rsi),%ecx
+	xorl 248(%rsi),%r9d
+	xorl 252(%rsi),%eax
+	movl   %edx,240(%rdi)
+	movl   %ecx,244(%rdi)
+	movl   %r9d,248(%rdi)
+	movl   %eax,252(%rdi)
+	movq 288(%rsp),%rdx
+	sub  $256,%rdx
+	add  $256,%rsi
+	add  $256,%rdi
+	cmp  $256,%rdx
+	jae ._bytesatleast256
+	cmp  $0,%rdx
+	jbe ._done
+._bytes_are_64_128_or_192:
+	movq %rdx,288(%rsp)
+	movdqa 0(%r8),%xmm0
+	movdqa 16(%r8),%xmm1
+	movdqa 32(%r8),%xmm2
+	movdqa 48(%r8),%xmm3
+	movdqa %xmm1,%xmm4
+	mov  %rbx,%rdx
+._mainloop2:
+	paddd %xmm0,%xmm4
+	movdqa %xmm0,%xmm5
+	movdqa %xmm4,%xmm6
+	pslld $7,%xmm4
+	psrld $25,%xmm6
+	pxor  %xmm4,%xmm3
+	pxor  %xmm6,%xmm3
+	paddd %xmm3,%xmm5
+	movdqa %xmm3,%xmm4
+	movdqa %xmm5,%xmm6
+	pslld $9,%xmm5
+	psrld $23,%xmm6
+	pxor  %xmm5,%xmm2
+	pshufd $0x93,%xmm3,%xmm3
+	pxor  %xmm6,%xmm2
+	paddd %xmm2,%xmm4
+	movdqa %xmm2,%xmm5
+	movdqa %xmm4,%xmm6
+	pslld $13,%xmm4
+	psrld $19,%xmm6
+	pxor  %xmm4,%xmm1
+	pshufd $0x4e,%xmm2,%xmm2
+	pxor  %xmm6,%xmm1
+	paddd %xmm1,%xmm5
+	movdqa %xmm3,%xmm4
+	movdqa %xmm5,%xmm6
+	pslld $18,%xmm5
+	psrld $14,%xmm6
+	pxor  %xmm5,%xmm0
+	pshufd $0x39,%xmm1,%xmm1
+	pxor  %xmm6,%xmm0
+	paddd %xmm0,%xmm4
+	movdqa %xmm0,%xmm5
+	movdqa %xmm4,%xmm6
+	pslld $7,%xmm4
+	psrld $25,%xmm6
+	pxor  %xmm4,%xmm1
+	pxor  %xmm6,%xmm1
+	paddd %xmm1,%xmm5
+	movdqa %xmm1,%xmm4
+	movdqa %xmm5,%xmm6
+	pslld $9,%xmm5
+	psrld $23,%xmm6
+	pxor  %xmm5,%xmm2
+	pshufd $0x93,%xmm1,%xmm1
+	pxor  %xmm6,%xmm2
+	paddd %xmm2,%xmm4
+	movdqa %xmm2,%xmm5
+	movdqa %xmm4,%xmm6
+	pslld $13,%xmm4
+	psrld $19,%xmm6
+	pxor  %xmm4,%xmm3
+	pshufd $0x4e,%xmm2,%xmm2
+	pxor  %xmm6,%xmm3
+	paddd %xmm3,%xmm5
+	movdqa %xmm1,%xmm4
+	movdqa %xmm5,%xmm6
+	pslld $18,%xmm5
+	psrld $14,%xmm6
+	pxor  %xmm5,%xmm0
+	pshufd $0x39,%xmm3,%xmm3
+	pxor  %xmm6,%xmm0
+	paddd %xmm0,%xmm4
+	movdqa %xmm0,%xmm5
+	movdqa %xmm4,%xmm6
+	pslld $7,%xmm4
+	psrld $25,%xmm6
+	pxor  %xmm4,%xmm3
+	pxor  %xmm6,%xmm3
+	paddd %xmm3,%xmm5
+	movdqa %xmm3,%xmm4
+	movdqa %xmm5,%xmm6
+	pslld $9,%xmm5
+	psrld $23,%xmm6
+	pxor  %xmm5,%xmm2
+	pshufd $0x93,%xmm3,%xmm3
+	pxor  %xmm6,%xmm2
+	paddd %xmm2,%xmm4
+	movdqa %xmm2,%xmm5
+	movdqa %xmm4,%xmm6
+	pslld $13,%xmm4
+	psrld $19,%xmm6
+	pxor  %xmm4,%xmm1
+	pshufd $0x4e,%xmm2,%xmm2
+	pxor  %xmm6,%xmm1
+	paddd %xmm1,%xmm5
+	movdqa %xmm3,%xmm4
+	movdqa %xmm5,%xmm6
+	pslld $18,%xmm5
+	psrld $14,%xmm6
+	pxor  %xmm5,%xmm0
+	pshufd $0x39,%xmm1,%xmm1
+	pxor  %xmm6,%xmm0
+	paddd %xmm0,%xmm4
+	movdqa %xmm0,%xmm5
+	movdqa %xmm4,%xmm6
+	pslld $7,%xmm4
+	psrld $25,%xmm6
+	pxor  %xmm4,%xmm1
+	pxor  %xmm6,%xmm1
+	paddd %xmm1,%xmm5
+	movdqa %xmm1,%xmm4
+	movdqa %xmm5,%xmm6
+	pslld $9,%xmm5
+	psrld $23,%xmm6
+	pxor  %xmm5,%xmm2
+	pshufd $0x93,%xmm1,%xmm1
+	pxor  %xmm6,%xmm2
+	paddd %xmm2,%xmm4
+	movdqa %xmm2,%xmm5
+	movdqa %xmm4,%xmm6
+	pslld $13,%xmm4
+	psrld $19,%xmm6
+	pxor  %xmm4,%xmm3
+	pshufd $0x4e,%xmm2,%xmm2
+	pxor  %xmm6,%xmm3
+	sub  $4,%rdx
+	paddd %xmm3,%xmm5
+	movdqa %xmm1,%xmm4
+	movdqa %xmm5,%xmm6
+	pslld $18,%xmm5
+	pxor   %xmm7,%xmm7
+	psrld $14,%xmm6
+	pxor  %xmm5,%xmm0
+	pshufd $0x39,%xmm3,%xmm3
+	pxor  %xmm6,%xmm0
+	ja ._mainloop2
+	paddd 0(%r8),%xmm0
+	paddd 16(%r8),%xmm1
+	paddd 32(%r8),%xmm2
+	paddd 48(%r8),%xmm3
+	movd   %xmm0,%rdx
+	movd   %xmm1,%rcx
+	movd   %xmm2,%rax
+	movd   %xmm3,%r10
+	pshufd $0x39,%xmm0,%xmm0
+	pshufd $0x39,%xmm1,%xmm1
+	pshufd $0x39,%xmm2,%xmm2
+	pshufd $0x39,%xmm3,%xmm3
+	xorl 0(%rsi),%edx
+	xorl 48(%rsi),%ecx
+	xorl 32(%rsi),%eax
+	xorl 16(%rsi),%r10d
+	movl   %edx,0(%rdi)
+	movl   %ecx,48(%rdi)
+	movl   %eax,32(%rdi)
+	movl   %r10d,16(%rdi)
+	movd   %xmm0,%rdx
+	movd   %xmm1,%rcx
+	movd   %xmm2,%rax
+	movd   %xmm3,%r10
+	pshufd $0x39,%xmm0,%xmm0
+	pshufd $0x39,%xmm1,%xmm1
+	pshufd $0x39,%xmm2,%xmm2
+	pshufd $0x39,%xmm3,%xmm3
+	xorl 20(%rsi),%edx
+	xorl 4(%rsi),%ecx
+	xorl 52(%rsi),%eax
+	xorl 36(%rsi),%r10d
+	movl   %edx,20(%rdi)
+	movl   %ecx,4(%rdi)
+	movl   %eax,52(%rdi)
+	movl   %r10d,36(%rdi)
+	movd   %xmm0,%rdx
+	movd   %xmm1,%rcx
+	movd   %xmm2,%rax
+	movd   %xmm3,%r10
+	pshufd $0x39,%xmm0,%xmm0
+	pshufd $0x39,%xmm1,%xmm1
+	pshufd $0x39,%xmm2,%xmm2
+	pshufd $0x39,%xmm3,%xmm3
+	xorl 40(%rsi),%edx
+	xorl 24(%rsi),%ecx
+	xorl 8(%rsi),%eax
+	xorl 56(%rsi),%r10d
+	movl   %edx,40(%rdi)
+	movl   %ecx,24(%rdi)
+	movl   %eax,8(%rdi)
+	movl   %r10d,56(%rdi)
+	movd   %xmm0,%rdx
+	movd   %xmm1,%rcx
+	movd   %xmm2,%rax
+	movd   %xmm3,%r10
+	xorl 60(%rsi),%edx
+	xorl 44(%rsi),%ecx
+	xorl 28(%rsi),%eax
+	xorl 12(%rsi),%r10d
+	movl   %edx,60(%rdi)
+	movl   %ecx,44(%rdi)
+	movl   %eax,28(%rdi)
+	movl   %r10d,12(%rdi)
+	movq 288(%rsp),%rdx
+	movl   32(%r8),%ecx
+	movl   52(%r8),%eax
+	add  $1,%ecx
+	adc  $0,%eax
+	movl   %ecx,32(%r8)
+	movl   %eax,52(%r8)
+	cmp  $64,%rdx
+	ja ._bytes_are_128_or_192
+._done:
+	add %r11,%rsp
+	mov %r11,%rax
+	pop %rbx
+	ret
+._bytes_are_128_or_192:
+	sub  $64,%rdx
+	add  $64,%rdi
+	add  $64,%rsi
+	jmp ._bytes_are_64_128_or_192
+.size _gcry_salsa20_amd64_encrypt_blocks,.-_gcry_salsa20_amd64_encrypt_blocks;
+
+#endif /*defined(USE_SALSA20)*/
+#endif /*__x86_64*/
diff --git a/cipher/salsa20.c b/cipher/salsa20.c
index 6189bca..892b9fc 100644
--- a/cipher/salsa20.c
+++ b/cipher/salsa20.c
@@ -40,6 +40,14 @@
 #include "cipher.h"
 #include "bufhelp.h"
 
+
+/* USE_AMD64 indicates whether to compile with AMD64 code. */
+#undef USE_AMD64
+#if defined(__x86_64__) && defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS)
+# define USE_AMD64 1
+#endif
+
+
 #define SALSA20_MIN_KEY_SIZE 16  /* Bytes.  */
 #define SALSA20_MAX_KEY_SIZE 32  /* Bytes.  */
 #define SALSA20_BLOCK_SIZE   64  /* Bytes.  */
@@ -83,6 +91,36 @@ typedef struct
 static void salsa20_setiv (void *context, const byte *iv, unsigned int ivlen);
 static const char *selftest (void);
 
+
+#ifdef USE_AMD64
+/* AMD64 assembly implementations of Salsa20. */
+void _gcry_salsa20_amd64_keysetup(u32 *ctxinput, const void *key, int keybits);
+void _gcry_salsa20_amd64_ivsetup(u32 *ctxinput, const void *iv);
+unsigned int
+_gcry_salsa20_amd64_encrypt_blocks(u32 *ctxinput, const void *src, void *dst,
+                                   size_t len, int rounds);
+
+static void
+salsa20_keysetup(SALSA20_context_t *ctx, const byte *key, int keylen)
+{
+  _gcry_salsa20_amd64_keysetup(ctx->input, key, keylen * 8);
+}
+
+static void
+salsa20_ivsetup(SALSA20_context_t *ctx, const byte *iv)
+{
+  _gcry_salsa20_amd64_ivsetup(ctx->input, iv);
+}
+
+static unsigned int
+salsa20_core (u32 *dst, u32 *src, unsigned int rounds)
+{
+  memset(dst, 0, SALSA20_BLOCK_SIZE);
+  return _gcry_salsa20_amd64_encrypt_blocks(src, dst, dst, 1, rounds);
+}
+
+#else /* USE_AMD64 */
+
 

 
 #if 0
@@ -110,8 +148,8 @@ static const char *selftest (void);
     x0 ^= ROTL32 (18, x3 + x2);	    \
   } while(0)
 
-static void
-salsa20_core (u32 *dst, const u32 *src, unsigned rounds)
+static unsigned int
+salsa20_core (u32 *dst, u32 *src, unsigned int rounds)
 {
   u32 pad[SALSA20_INPUT_LENGTH];
   unsigned int i;
@@ -138,31 +176,24 @@ salsa20_core (u32 *dst, const u32 *src, unsigned rounds)
       u32 t = pad[i] + src[i];
       dst[i] = LE_SWAP32 (t);
     }
+
+  /* Update counter. */
+  if (!++src[8])
+    src[9]++;
+
+  /* burn_stack */
+  return ( 3*sizeof (void*) \
+         + 2*sizeof (void*) \
+         + 64 \
+         + sizeof (unsigned int) \
+         + sizeof (u32) );
 }
 #undef QROUND
 #undef SALSA20_CORE_DEBUG
 
-static gcry_err_code_t
-salsa20_do_setkey (SALSA20_context_t *ctx,
-                   const byte *key, unsigned int keylen)
+static void
+salsa20_keysetup(SALSA20_context_t *ctx, const byte *key, int keylen)
 {
-  static int initialized;
-  static const char *selftest_failed;
-
-  if (!initialized )
-    {
-      initialized = 1;
-      selftest_failed = selftest ();
-      if (selftest_failed)
-        log_error ("SALSA20 selftest failed (%s)\n", selftest_failed );
-    }
-  if (selftest_failed)
-    return GPG_ERR_SELFTEST_FAILED;
-
-  if (keylen != SALSA20_MIN_KEY_SIZE
-      && keylen != SALSA20_MAX_KEY_SIZE)
-    return GPG_ERR_INV_KEYLEN;
-
   /* These constants are the little endian encoding of the string
      "expand 32-byte k".  For the 128 bit variant, the "32" in that
      string will be fixed up to "16".  */
@@ -192,6 +223,41 @@ salsa20_do_setkey (SALSA20_context_t *ctx,
       ctx->input[5]  -= 0x02000000; /* Change to "1 dn".  */
       ctx->input[10] += 0x00000004; /* Change to "yb-6".  */
     }
+}
+
+static void salsa20_ivsetup(SALSA20_context_t *ctx, const byte *iv)
+{
+  ctx->input[6] = LE_READ_UINT32(iv + 0);
+  ctx->input[7] = LE_READ_UINT32(iv + 4);
+  /* Reset the block counter.  */
+  ctx->input[8] = 0;
+  ctx->input[9] = 0;
+}
+
+#endif /*!USE_AMD64*/
+
+static gcry_err_code_t
+salsa20_do_setkey (SALSA20_context_t *ctx,
+                   const byte *key, unsigned int keylen)
+{
+  static int initialized;
+  static const char *selftest_failed;
+
+  if (!initialized )
+    {
+      initialized = 1;
+      selftest_failed = selftest ();
+      if (selftest_failed)
+        log_error ("SALSA20 selftest failed (%s)\n", selftest_failed );
+    }
+  if (selftest_failed)
+    return GPG_ERR_SELFTEST_FAILED;
+
+  if (keylen != SALSA20_MIN_KEY_SIZE
+      && keylen != SALSA20_MAX_KEY_SIZE)
+    return GPG_ERR_INV_KEYLEN;
+
+  salsa20_keysetup (ctx, key, keylen);
 
   /* We default to a zero nonce.  */
   salsa20_setiv (ctx, NULL, 0);
@@ -205,7 +271,7 @@ salsa20_setkey (void *context, const byte *key, unsigned int keylen)
 {
   SALSA20_context_t *ctx = (SALSA20_context_t *)context;
   gcry_err_code_t rc = salsa20_do_setkey (ctx, key, keylen);
-  _gcry_burn_stack (300/* FIXME*/);
+  _gcry_burn_stack (4 + sizeof (void *) + 4 * sizeof (void *));
   return rc;
 }
 
@@ -214,28 +280,22 @@ static void
 salsa20_setiv (void *context, const byte *iv, unsigned int ivlen)
 {
   SALSA20_context_t *ctx = (SALSA20_context_t *)context;
+  byte tmp[SALSA20_IV_SIZE];
 
-  if (!iv)
-    {
-      ctx->input[6] = 0;
-      ctx->input[7] = 0;
-    }
-  else if (ivlen == SALSA20_IV_SIZE)
-    {
-      ctx->input[6] = LE_READ_UINT32(iv + 0);
-      ctx->input[7] = LE_READ_UINT32(iv + 4);
-    }
+  if (iv && ivlen != SALSA20_IV_SIZE)
+    log_info ("WARNING: salsa20_setiv: bad ivlen=%u\n", ivlen);
+
+  if (!iv || ivlen != SALSA20_IV_SIZE)
+    memset (tmp, 0, sizeof(tmp));
   else
-    {
-      log_info ("WARNING: salsa20_setiv: bad ivlen=%u\n", ivlen);
-      ctx->input[6] = 0;
-      ctx->input[7] = 0;
-    }
-  /* Reset the block counter.  */
-  ctx->input[8] = 0;
-  ctx->input[9] = 0;
+    memcpy (tmp, iv, SALSA20_IV_SIZE);
+
+  salsa20_ivsetup (ctx, tmp);
+
   /* Reset the unused pad bytes counter.  */
   ctx->unused = 0;
+
+  wipememory (tmp, sizeof(tmp));
 }
 
 
@@ -246,6 +306,8 @@ salsa20_do_encrypt_stream (SALSA20_context_t *ctx,
                            byte *outbuf, const byte *inbuf,
                            unsigned int length, unsigned rounds)
 {
+  unsigned int nburn, burn = 0;
+
   if (ctx->unused)
     {
       unsigned char *p = (void*)ctx->pad;
@@ -266,26 +328,39 @@ salsa20_do_encrypt_stream (SALSA20_context_t *ctx,
       gcry_assert (!ctx->unused);
     }
 
-  for (;;)
+#ifdef USE_AMD64
+  if (length >= SALSA20_BLOCK_SIZE)
+    {
+      unsigned int nblocks = length / SALSA20_BLOCK_SIZE;
+      burn = _gcry_salsa20_amd64_encrypt_blocks(ctx->input, inbuf, outbuf,
+                                                nblocks, rounds);
+      length -= SALSA20_BLOCK_SIZE * nblocks;
+      outbuf += SALSA20_BLOCK_SIZE * nblocks;
+      inbuf  += SALSA20_BLOCK_SIZE * nblocks;
+    }
+#endif
+
+  while (length > 0)
     {
       /* Create the next pad and bump the block counter.  Note that it
          is the user's duty to change to another nonce not later than
          after 2^70 processed bytes.  */
-      salsa20_core (ctx->pad, ctx->input, rounds);
-      if (!++ctx->input[8])
-        ctx->input[9]++;
+      nburn = salsa20_core (ctx->pad, ctx->input, rounds);
+      burn = nburn > burn ? nburn : burn;
 
       if (length <= SALSA20_BLOCK_SIZE)
 	{
 	  buf_xor (outbuf, inbuf, ctx->pad, length);
           ctx->unused = SALSA20_BLOCK_SIZE - length;
-	  return;
+	  break;
 	}
       buf_xor (outbuf, inbuf, ctx->pad, SALSA20_BLOCK_SIZE);
       length -= SALSA20_BLOCK_SIZE;
       outbuf += SALSA20_BLOCK_SIZE;
       inbuf  += SALSA20_BLOCK_SIZE;
-  }
+    }
+
+  _gcry_burn_stack (burn);
 }
 
 
@@ -296,19 +371,7 @@ salsa20_encrypt_stream (void *context,
   SALSA20_context_t *ctx = (SALSA20_context_t *)context;
 
   if (length)
-    {
-      salsa20_do_encrypt_stream (ctx, outbuf, inbuf, length, SALSA20_ROUNDS);
-      _gcry_burn_stack (/* salsa20_do_encrypt_stream: */
-                        2*sizeof (void*)
-                        + 3*sizeof (void*) + sizeof (unsigned int)
-                        /* salsa20_core: */
-                        + 2*sizeof (void*)
-                        + 2*sizeof (void*)
-                        + 64
-                        + sizeof (unsigned int)
-                        + sizeof (u32)
-                        );
-    }
+    salsa20_do_encrypt_stream (ctx, outbuf, inbuf, length, SALSA20_ROUNDS);
 }
 
 
@@ -319,19 +382,7 @@ salsa20r12_encrypt_stream (void *context,
   SALSA20_context_t *ctx = (SALSA20_context_t *)context;
 
   if (length)
-    {
-      salsa20_do_encrypt_stream (ctx, outbuf, inbuf, length, SALSA20R12_ROUNDS);
-      _gcry_burn_stack (/* salsa20_do_encrypt_stream: */
-                        2*sizeof (void*)
-                        + 3*sizeof (void*) + sizeof (unsigned int)
-                        /* salsa20_core: */
-                        + 2*sizeof (void*)
-                        + 2*sizeof (void*)
-                        + 64
-                        + sizeof (unsigned int)
-                        + sizeof (u32)
-                        );
-    }
+    salsa20_do_encrypt_stream (ctx, outbuf, inbuf, length, SALSA20R12_ROUNDS);
 }
 
 
diff --git a/configure.ac b/configure.ac
index 5b7ba0d..114460c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1553,6 +1553,13 @@ LIST_MEMBER(salsa20, $enabled_ciphers)
 if test "$found" = "1" ; then
    GCRYPT_CIPHERS="$GCRYPT_CIPHERS salsa20.lo"
    AC_DEFINE(USE_SALSA20, 1, [Defined if this module should be included])
+
+   case "${host}" in
+      x86_64-*-*)
+         # Build with the assembly implementation
+         GCRYPT_CIPHERS="$GCRYPT_CIPHERS salsa20-amd64.lo"
+      ;;
+   esac
 fi
 
 LIST_MEMBER(gost28147, $enabled_ciphers)

-----------------------------------------------------------------------

Summary of changes:
 cipher/Makefile.am          |    2 +-
 cipher/salsa20-amd64.S      |  924 +++++++++++++++++++++++++++++++++++++++++++
 cipher/salsa20-armv7-neon.S |  899 +++++++++++++++++++++++++++++++++++++++++
 cipher/salsa20.c            |  316 +++++++++++----
 configure.ac                |   12 +
 5 files changed, 2076 insertions(+), 77 deletions(-)
 create mode 100644 cipher/salsa20-amd64.S
 create mode 100644 cipher/salsa20-armv7-neon.S


hooks/post-receive
-- 
The GNU crypto library
http://git.gnupg.org


_______________________________________________
Gnupg-commits mailing list
Gnupg-commits at gnupg.org
http://lists.gnupg.org/mailman/listinfo/gnupg-commits


More information about the Gcrypt-devel mailing list