From cvs at cvs.gnupg.org  Sat Feb  3 13:00:29 2018
From: cvs at cvs.gnupg.org (by Jussi Kivilinna)
Date: Sat, 03 Feb 2018 13:00:29 +0100
Subject: [git] GCRYPT - branch, master, updated. libgcrypt-1.8.1-41-gffdc6f3
Message-ID: <E1ehwUh-0005fh-4P@lists.gnupg.org>

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "The GNU crypto library".

The branch, master has been updated
       via  ffdc6f3623a0bcb41324d562340b2cd1c288e387 (commit)
      from  0b55f349a8b8f4b0ac9ed724c2d5b8dcc9f5401c (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
commit ffdc6f3623a0bcb41324d562340b2cd1c288e387
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date:   Wed Jan 31 20:02:48 2018 +0200

    Fix incorrect counter overflow handling for GCM
    
    * cipher/cipher-gcm.c (gcm_ctr_encrypt): New function to handle
    32-bit CTR increment for GCM.
    (_gcry_cipher_gcm_encrypt, _gcry_cipher_gcm_decrypt): Do not use
    generic CTR implementation directly, use gcm_ctr_encrypt instead.
    * tests/basic.c (_check_gcm_cipher): Add test-vectors for 32-bit
    CTR overflow.
    (check_gcm_cipher): Add 'split input to 15 bytes and 17 bytes'
    test-runs.
    --
    
    Reported-by: Clemens Lang <Clemens.Lang at bmw.de>
    
    > I believe we have found what seems to be a bug in counter overflow
    > handling in AES-GCM in libgcrypt's implementation. This leads to
    > incorrect results when using a non-12-byte IV and decrypting payloads
    > encrypted with other AES-GCM implementations, such as OpenSSL.
    >
    > According to the NIST Special Publication 800-38D "Recommendation for
    > Block Cipher Modes of Operation: Galois/Counter Mode (GCM) and GMAC",
    > section 7.1, algorithm 4, step 3 [NIST38D], the counter increment is
    > defined as inc_32. Section 6.2 of the same document defines the
    > incrementing function inc_s for positive integers s as follows:
    >
    > | the function increments the right-most s bits of the string, regarded
    > | as the binary representation of an integer, modulo 2^s; the remaining,
    > | left-most len(X) - s bits remain unchanged
    >
    > (X is the complete counter value in this case)
    >
    > This problem does not occur when using a 12-byte IV, because AES-GCM has
    > a special case for the inital counter value with 12-byte IVs:
    >
    > | If len(IV)=96, then J_0 = IV || 0^31 || 1
    >
    > i.e., one would have to encrypt (UINT_MAX - 1) * blocksize of data to
    > hit an overflow. However, for non-12-byte IVs, the initial counter value
    > is the output of a hash function, which makes hitting an overflow much
    > more likely.
    >
    > In practice, we have found that using
    >
    >  iv = 9e 79 18 8c ff 09 56 1e c9 90 99 cc 6d 5d f6 d3
    >  key = 26 56 e5 73 76 03 c6 95 0d 22 07 31 5d 32 5c 6b a5 54 5f 40 23 98 60 f6 f7 06 6f 7a 4f c2 ca 40
    >
    > will reliably trigger an overflow when encrypting 10 MiB of data. It
    > seems that this is caused by re-using the AES-CTR implementation for
    > incrementing the counter.
    
    Bug was introduced by commit bd4bd23a2511a4bce63c3217cca0d4ecf0c79532
    "GCM: Use counter mode code for speed-up".
    
    GnuPG-bug-id: 3764
    Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>

diff --git a/cipher/cipher-gcm.c b/cipher/cipher-gcm.c
index 2b8b454..6169d14 100644
--- a/cipher/cipher-gcm.c
+++ b/cipher/cipher-gcm.c
@@ -1,6 +1,6 @@
 /* cipher-gcm.c  - Generic Galois Counter Mode implementation
  * Copyright (C) 2013 Dmitry Eremin-Solenikov
- * Copyright (C) 2013 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ * Copyright (C) 2013, 2018 Jussi Kivilinna <jussi.kivilinna at iki.fi>
  *
  * This file is part of Libgcrypt.
  *
@@ -556,6 +556,77 @@ do_ghash_buf(gcry_cipher_hd_t c, byte *hash, const byte *buf,
 }
 
 
+static gcry_err_code_t
+gcm_ctr_encrypt (gcry_cipher_hd_t c, byte *outbuf, size_t outbuflen,
+                 const byte *inbuf, size_t inbuflen)
+{
+  gcry_err_code_t err = 0;
+
+  while (inbuflen)
+    {
+      u32 nblocks_to_overflow;
+      u32 num_ctr_increments;
+      u32 curr_ctr_low;
+      size_t currlen = inbuflen;
+      byte ctr_copy[GCRY_GCM_BLOCK_LEN];
+      int fix_ctr = 0;
+
+      /* GCM CTR increments only least significant 32-bits, without carry
+       * to upper 96-bits of counter.  Using generic CTR implementation
+       * directly would carry 32-bit overflow to upper 96-bit.  Detect
+       * if input length is long enough to cause overflow, and limit
+       * input length so that CTR overflow happen but updated CTR value is
+       * not used to encrypt further input.  After overflow, upper 96 bits
+       * of CTR are restored to cancel out modification done by generic CTR
+       * encryption. */
+
+      if (inbuflen > c->unused)
+        {
+          curr_ctr_low = gcm_add32_be128 (c->u_ctr.ctr, 0);
+
+          /* Number of CTR increments this inbuflen would cause. */
+          num_ctr_increments = (inbuflen - c->unused) / GCRY_GCM_BLOCK_LEN +
+                               !!((inbuflen - c->unused) % GCRY_GCM_BLOCK_LEN);
+
+          if ((u32)(num_ctr_increments + curr_ctr_low) < curr_ctr_low)
+            {
+              nblocks_to_overflow = 0xffffffffU - curr_ctr_low + 1;
+              currlen = nblocks_to_overflow * GCRY_GCM_BLOCK_LEN + c->unused;
+              if (currlen > inbuflen)
+                {
+                  currlen = inbuflen;
+                }
+
+              fix_ctr = 1;
+              buf_cpy(ctr_copy, c->u_ctr.ctr, GCRY_GCM_BLOCK_LEN);
+            }
+        }
+
+      err = _gcry_cipher_ctr_encrypt(c, outbuf, outbuflen, inbuf, currlen);
+      if (err != 0)
+        return err;
+
+      if (fix_ctr)
+        {
+          /* Lower 32-bits of CTR should now be zero. */
+          gcry_assert(gcm_add32_be128 (c->u_ctr.ctr, 0) == 0);
+
+          /* Restore upper part of CTR. */
+          buf_cpy(c->u_ctr.ctr, ctr_copy, GCRY_GCM_BLOCK_LEN - sizeof(u32));
+
+          wipememory(ctr_copy, sizeof(ctr_copy));
+        }
+
+      inbuflen -= currlen;
+      inbuf += currlen;
+      outbuflen -= currlen;
+      outbuf += currlen;
+    }
+
+  return err;
+}
+
+
 gcry_err_code_t
 _gcry_cipher_gcm_encrypt (gcry_cipher_hd_t c,
                           byte *outbuf, size_t outbuflen,
@@ -595,7 +666,7 @@ _gcry_cipher_gcm_encrypt (gcry_cipher_hd_t c,
       return GPG_ERR_INV_LENGTH;
     }
 
-  err = _gcry_cipher_ctr_encrypt(c, outbuf, outbuflen, inbuf, inbuflen);
+  err = gcm_ctr_encrypt(c, outbuf, outbuflen, inbuf, inbuflen);
   if (err != 0)
     return err;
 
@@ -642,7 +713,7 @@ _gcry_cipher_gcm_decrypt (gcry_cipher_hd_t c,
 
   do_ghash_buf(c, c->u_mode.gcm.u_tag.tag, inbuf, inbuflen, 0);
 
-  return _gcry_cipher_ctr_encrypt(c, outbuf, outbuflen, inbuf, inbuflen);
+  return gcm_ctr_encrypt(c, outbuf, outbuflen, inbuf, inbuflen);
 }
 
 
diff --git a/tests/basic.c b/tests/basic.c
index c883eb3..42ee819 100644
--- a/tests/basic.c
+++ b/tests/basic.c
@@ -1347,6 +1347,7 @@ check_ofb_cipher (void)
 static void
 _check_gcm_cipher (unsigned int step)
 {
+#define MAX_GCM_DATA_LEN (256 + 32)
   static const struct tv
   {
     int algo;
@@ -1355,9 +1356,9 @@ _check_gcm_cipher (unsigned int step)
     int ivlen;
     unsigned char aad[MAX_DATA_LEN];
     int aadlen;
-    unsigned char plaintext[MAX_DATA_LEN];
+    unsigned char plaintext[MAX_GCM_DATA_LEN];
     int inlen;
-    char out[MAX_DATA_LEN];
+    char out[MAX_GCM_DATA_LEN];
     char tag[MAX_DATA_LEN];
     int taglen;
     int should_fail;
@@ -1551,11 +1552,687 @@ _check_gcm_cipher (unsigned int step)
         "\xee\xb2\xb2\x2a\xaf\xde\x64\x19\xa0\x58\xab\x4f\x6f\x74\x6b\xf4"
         "\x0f\xc0\xc3\xb7\x80\xf2\x44\x45\x2d\xa3\xeb\xf1\xc5\xd8\x2c\xde"
         "\xa2\x41\x89\x97\x20\x0e\xf8\x2e\x44\xae\x7e\x3f",
-        "\xa4\x4a\x82\x66\xee\x1c\x8e\xb0\xc8\xb5\xd4\xcf\x5a\xe9\xf1\x9a" }
+        "\xa4\x4a\x82\x66\xee\x1c\x8e\xb0\xc8\xb5\xd4\xcf\x5a\xe9\xf1\x9a" },
+      /* Test vectors for overflowing CTR. */
+      /* After setiv, ctr_low: 0xffffffff */
+      { GCRY_CIPHER_AES256,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x86\xdd\x40\xe7",
+        16,
+        "", 0,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        288,
+        "\x7d\x6e\x38\xfd\xd0\x04\x9d\x28\xdf\x4a\x10\x3f\xa3\x9e\xf8\xf8"
+        "\x6c\x2c\x10\xa7\x91\xab\xc0\x86\xd4\x6d\x69\xea\x58\xc4\xf9\xc0"
+        "\xd4\xee\xc2\xb0\x9d\x36\xae\xe7\xc9\xa9\x1f\x71\xa8\xee\xa2\x1d"
+        "\x20\xfd\x79\xc7\xd9\xc4\x90\x51\x38\x97\xb6\x9f\x55\xea\xf3\xf0"
+        "\x78\xb4\xd3\x8c\xa9\x9b\x32\x7d\x19\x36\x96\xbc\x8e\xab\x80\x9f"
+        "\x61\x56\xcc\xbd\x3a\x80\xc6\x69\x37\x0a\x89\x89\x21\x82\xb7\x79"
+        "\x6d\xe9\xb4\x34\xc4\x31\xe0\xbe\x71\xad\xf3\x50\x05\xb2\x61\xab"
+        "\xb3\x1a\x80\x57\xcf\xe1\x11\x26\xcb\xa9\xd1\xf6\x58\x46\xf1\x69"
+        "\xa2\xb8\x42\x3c\xe8\x28\x13\xca\x58\xd9\x28\x99\xf8\xc8\x17\x32"
+        "\x4a\xf9\xb3\x4c\x7a\x47\xad\xe4\x77\x64\xec\x70\xa1\x01\x0b\x88"
+        "\xe7\x30\x0b\xbd\x66\x25\x39\x1e\x51\x67\xee\xec\xdf\xb8\x24\x5d"
+        "\x7f\xcb\xee\x7a\x4e\xa9\x93\xf0\xa1\x84\x7b\xfe\x5a\xe3\x86\xb2"
+        "\xfb\xcd\x39\xe7\x1e\x5e\x48\x65\x4b\x50\x2b\x4a\x99\x46\x3f\x6f"
+        "\xdb\xd9\x97\xdb\xe5\x6d\xa4\xdd\x6c\x18\x64\x5e\xae\x7e\x2c\xd3"
+        "\xb4\xf3\x57\x5c\xb5\xf8\x7f\xe5\x87\xb5\x35\xdb\x80\x38\x6e\x2c"
+        "\x5c\xdd\xeb\x7c\x63\xac\xe4\xb5\x5a\x6a\x40\x6d\x72\x69\x9a\xa9"
+        "\x8f\x5e\x93\x91\x4d\xce\xeb\x87\xf5\x25\xed\x75\x6b\x3b\x1a\xf2"
+        "\x0c\xd2\xa4\x10\x45\xd2\x87\xae\x29\x6d\xeb\xea\x66\x5f\xa0\xc2",
+        "\x8c\x22\xe3\xda\x9d\x94\x8a\xbe\x8a\xbc\x55\x2c\x94\x63\x44\x40" },
+      /* After setiv, ctr_low: 0xfffffffe */
+      { GCRY_CIPHER_AES256,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x8d\xd1\xc1\xdf",
+        16,
+        "", 0,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        288,
+        "\xac\x6a\x10\x3f\xe2\x8d\xed\x27\x55\x14\xca\x1f\x03\x67\x0a\xa8"
+        "\xa1\x07\xbf\x00\x73\x5b\x64\xef\xac\x30\x83\x81\x48\x4c\xaa\xd5"
+        "\xff\xca\xef\x2f\x77\xbe\xfe\x1b\x20\x5c\x86\x19\xc7\xf9\x11\x99"
+        "\x27\xc5\x57\xa7\x0a\xc2\xa8\x05\xd9\x07\x2b\xb9\x38\xa4\xef\x58"
+        "\x92\x74\xcf\x89\xc7\xba\xfc\xb9\x70\xac\x86\xe2\x31\xba\x7c\xf9"
+        "\xc4\xe2\xe0\x4c\x1b\xe4\x3f\x75\x83\x5c\x40\x0e\xa4\x13\x8b\x04"
+        "\x60\x78\x57\x29\xbb\xe6\x61\x93\xe3\x16\xf9\x58\x07\x75\xd0\x96"
+        "\xfb\x8f\x6d\x1e\x49\x0f\xd5\x31\x9e\xee\x31\xe6\x0a\x85\x93\x49"
+        "\x22\xcf\xd6\x1b\x40\x44\x63\x9c\x95\xaf\xf0\x44\x23\x51\x37\x92"
+        "\x0d\xa0\x22\x37\xb9\x6d\x13\xf9\x78\xba\x27\x27\xed\x08\x7e\x35"
+        "\xe4\xe2\x28\xeb\x0e\xbe\x3d\xce\x89\x93\x35\x84\x0f\xa0\xf9\x8d"
+        "\x94\xe9\x5a\xec\xd4\x0d\x1f\x5c\xbe\x6f\x8e\x6a\x4d\x10\x65\xbb"
+        "\xc7\x0b\xa0\xd5\x5c\x20\x80\x0b\x4a\x43\xa6\xe1\xb0\xe0\x56\x6a"
+        "\xde\x90\xe0\x6a\x45\xe7\xc2\xd2\x69\x9b\xc6\x62\x11\xe3\x2b\xa5"
+        "\x45\x98\xb0\x80\xd3\x57\x4d\x1f\x09\x83\x58\xd4\x4d\xa6\xc5\x95"
+        "\x87\x59\xb0\x58\x6c\x81\x49\xc5\x95\x18\x23\x1b\x6f\x10\x86\xa2"
+        "\xd9\x56\x19\x30\xec\xd3\x4a\x4b\xe8\x1c\x11\x37\xfb\x31\x60\x4d"
+        "\x4f\x9b\xc4\x95\xba\xda\x49\x43\x6c\xc7\x3d\x5b\x13\xf9\x91\xf8",
+        "\xcd\x2b\x83\xd5\x5b\x5a\x8e\x0b\x2e\x77\x0d\x97\xbf\xf7\xaa\xab" },
+      /* After setiv, ctr_low: 0xfffffffd */
+      { GCRY_CIPHER_AES256,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x76\x8c\x18\x92",
+        16,
+        "", 0,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        288,
+        "\x3d\x6f\x4e\xf6\xd2\x6f\x4e\xce\xa6\xb4\x4a\x9e\xcb\x57\x13\x90"
+        "\x51\x3b\xf6\xb2\x40\x55\x0c\x2c\xa2\x85\x44\x72\xf2\x90\xaf\x6b"
+        "\x86\x8c\x75\x2a\x9c\xd6\x52\x50\xee\xc6\x5f\x59\xbc\x8d\x18\xd7"
+        "\x87\xa5\x7f\xa0\x13\xd1\x5d\x54\x77\x30\xe2\x5d\x1b\x4f\x87\x9f"
+        "\x3a\x41\xcb\x6a\xdf\x44\x4f\xa2\x1a\xbc\xfb\x4b\x16\x67\xed\x59"
+        "\x65\xf0\x77\x48\xca\xfd\xf0\xb6\x90\x65\xca\x23\x09\xca\x83\x43"
+        "\x8f\xf0\x78\xb4\x5f\x96\x2a\xfd\x29\xae\xda\x62\x85\xc5\x87\x4b"
+        "\x2a\x3f\xba\xbe\x15\x5e\xb0\x4e\x8e\xe7\x66\xae\xb4\x80\x66\x90"
+        "\x10\x9d\x81\xb9\x64\xd3\x36\x00\xb2\x95\xa8\x7d\xaf\x54\xf8\xbd"
+        "\x8f\x7a\xb1\xa1\xde\x09\x0d\x10\xc8\x8e\x1e\x18\x2c\x1e\x73\x71"
+        "\x2f\x1e\xfd\x16\x6e\xbe\xe1\x3e\xe5\xb4\xb5\xbf\x03\x63\xf4\x5a"
+        "\x0d\xeb\xff\xe0\x61\x80\x67\x51\xb4\xa3\x1f\x18\xa5\xa9\xf1\x9a"
+        "\xeb\x2a\x7f\x56\xb6\x01\x88\x82\x78\xdb\xec\xb7\x92\xfd\xef\x56"
+        "\x55\xd3\x72\x35\xcd\xa4\x0d\x19\x6a\xb6\x79\x91\xd5\xcb\x0e\x3b"
+        "\xfb\xea\xa3\x55\x9f\x77\xfb\x75\xc2\x3e\x09\x02\x73\x7a\xff\x0e"
+        "\xa5\xf0\x83\x11\xeb\xe7\xff\x3b\xd0\xfd\x7a\x07\x53\x63\x43\x89"
+        "\xf5\x7b\xc4\x7d\x3b\x2c\x9b\xca\x1c\xf6\xb2\xab\x13\xf5\xc4\x2a"
+        "\xbf\x46\x77\x3b\x09\xdd\xd1\x80\xef\x55\x11\x3e\xd8\xe4\x42\x22",
+        "\xa3\x86\xa1\x5f\xe3\x4f\x3b\xed\x12\x23\xeb\x5c\xb8\x0c\xad\x4a" },
+      /* After setiv, ctr_low: 0xfffffffc */
+      { GCRY_CIPHER_AES256,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x9b\xc8\xc3\xaf",
+        16,
+        "", 0,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        288,
+        "\x33\x5f\xdc\x8d\x5d\x77\x7b\x78\xc1\x5b\x7b\xb3\xd9\x08\x9a\x0c"
+        "\xce\x63\x4e\xef\x19\xf8\x8c\x7a\xcb\x31\x39\x93\x69\x7a\x2c\x97"
+        "\x3a\xb4\x52\x45\x9e\x7b\x78\xbc\xa9\xad\x54\x7f\x88\xa6\xae\xd5"
+        "\xc0\x8b\x7a\xe4\x23\x6b\xb2\x29\x98\xea\x25\x7a\xae\x11\x0c\xc9"
+        "\xf3\x77\xa1\x74\x82\xde\x0c\xec\x68\xce\x94\xfd\xb0\xa0\xc5\x32"
+        "\xd6\xbb\xc3\xe7\xed\x3c\x6f\x0b\x53\x9d\xf3\xc8\xeb\x4e\xee\x99"
+        "\x19\xc7\x16\xd1\xa5\x59\x1d\xa9\xd3\xe6\x43\x52\x74\x61\x28\xe6"
+        "\xac\xd8\x47\x63\xc2\xb7\x53\x39\xc1\x9a\xb0\xa3\xa4\x26\x14\xd0"
+        "\x88\xa9\x8c\xc5\x6d\xe9\x21\x7c\xb9\xa5\xab\x67\xe3\x8d\xe9\x1d"
+        "\xe3\x1c\x7b\xcd\xa4\x12\x0c\xd7\xa6\x5d\x41\xcf\xdd\x3d\xfc\xbc"
+        "\x2a\xbb\xa2\x7a\x9c\x4b\x3a\x42\x6c\x98\x1d\x50\x99\x9c\xfb\xda"
+        "\x21\x09\x2a\x31\xff\x05\xeb\xa5\xf1\xba\x65\x78\xbe\x15\x8e\x84"
+        "\x35\xdd\x45\x29\xcc\xcd\x32\x2d\x27\xe9\xa8\x94\x4b\x16\x16\xcc"
+        "\xab\xf2\xec\xfb\xa0\xb5\x9d\x39\x81\x3e\xec\x5e\x3d\x13\xd1\x83"
+        "\x04\x79\x2d\xbb\x2c\x76\x76\x93\x28\x77\x27\x13\xdd\x1d\x3e\x89"
+        "\x3e\x37\x46\x4c\xb8\x34\xbe\xbf\x9f\x4f\x9f\x37\xff\x0c\xe6\x14"
+        "\x14\x66\x52\x41\x18\xa9\x39\x2b\x0c\xe5\x44\x04\xb0\x93\x06\x64"
+        "\x67\xf7\xa0\x19\xa7\x61\xcf\x03\x7b\xcb\xc8\xb3\x88\x28\xe4\xe7",
+        "\xe6\xe8\x0a\xe3\x72\xfc\xe0\x07\x69\x09\xf2\xeb\xbc\xc8\x6a\xf0" },
+      /* After setiv, ctr_low: 0xfffffffb */
+      { GCRY_CIPHER_AES256,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x60\x95\x1a\xe2",
+        16,
+        "", 0,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        288,
+        "\xd8\x32\x5a\xe3\x55\x8e\xb3\xc2\x51\x84\x2b\x09\x01\x5e\x6c\xfb"
+        "\x4a\xc4\x88\xa0\x33\xe7\x3e\xbf\xe5\x7c\xd2\x00\x4c\x1a\x85\x32"
+        "\x34\xec\x38\x9d\x18\x5f\xf1\x50\x61\x82\xee\xf3\x84\x5a\x84\x4e"
+        "\xeb\x29\x08\x4c\x7b\xb5\x27\xec\x7d\x79\x77\xd7\xa1\x68\x91\x32"
+        "\x2d\xf3\x38\xa9\xd6\x27\x16\xfb\x7d\x8b\x09\x5e\xcf\x1b\x74\x6d"
+        "\xcf\x51\x91\x91\xa1\xe7\x40\x19\x43\x7b\x0d\xa5\xa9\xa5\xf4\x2e"
+        "\x7f\x1c\xc7\xba\xa2\xea\x00\xdd\x24\x01\xa8\x66\x1e\x88\xf1\xf6"
+        "\x0c\x9a\xd6\x2b\xda\x3f\x3e\xb2\x98\xea\x89\xc7\xc6\x63\x27\xb7"
+        "\x6a\x48\x9a\xee\x1e\x70\xa0\xc8\xec\x3d\xc3\x3e\xb5\xf0\xc2\xb1"
+        "\xb9\x71\x1a\x69\x9d\xdd\x72\x1e\xfe\x72\xa0\x21\xb8\x9f\x18\x96"
+        "\x26\xcf\x89\x2e\x92\xf1\x02\x65\xa5\xb4\x2e\xb7\x4e\x12\xbd\xa0"
+        "\x48\xbe\xf6\x5c\xef\x7e\xf3\x0a\xcf\x9d\x1f\x1e\x14\x70\x3e\xa0"
+        "\x01\x0f\x14\xbf\x38\x10\x3a\x3f\x3f\xc2\x76\xe0\xb0\xe0\x7c\xc6"
+        "\x77\x6d\x7f\x69\x8e\xa0\x4b\x00\xc3\x9d\xf9\x0b\x7f\x8a\x8e\xd3"
+        "\x17\x58\x40\xfe\xaf\xf4\x16\x3a\x65\xff\xce\x85\xbb\x80\xfa\xb8"
+        "\x34\xc9\xef\x3a\xdd\x04\x46\xca\x8f\x70\x48\xbc\x1c\x71\x4d\x6a"
+        "\x17\x30\x32\x87\x2e\x2e\x54\x9e\x3f\x15\xed\x17\xd7\xa1\xcf\x6c"
+        "\x5d\x0f\x3c\xee\xf5\x96\xf1\x8f\x68\x1c\xbc\x27\xdc\x10\x3c\x3c",
+        "\x8c\x31\x06\xbb\xf8\x18\x2d\x9d\xd1\x0d\x03\x56\x2b\x28\x25\x9b" },
+      /* After setiv, ctr_low: 0xfffffffa */
+      { GCRY_CIPHER_AES256,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x6b\x99\x9b\xda",
+        16,
+        "", 0,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        288,
+        "\x7a\x74\x57\xe7\xc1\xb8\x7e\xcf\x91\x98\xf4\x1a\xa4\xdb\x4d\x2c"
+        "\x6e\xdc\x05\x0b\xd1\x16\xdf\x25\xa8\x1e\x42\xa6\xf9\x09\x36\xfb"
+        "\x02\x8a\x10\x7d\xa1\x07\x88\x40\xb7\x41\xfd\x64\xf6\xe3\x92\x20"
+        "\xfd\xc9\xde\xbd\x88\x46\xd3\x1f\x20\x14\x73\x86\x09\xb6\x68\x61"
+        "\x64\x90\xda\x24\xa8\x0f\x6a\x10\xc5\x01\xbf\x52\x8a\xee\x23\x44"
+        "\xd5\xb0\xd8\x68\x5e\x77\xc3\x62\xed\xcb\x3c\x1b\x0c\x1f\x13\x92"
+        "\x2c\x74\x6d\xee\x40\x1b\x6b\xfe\xbe\x3c\xb8\x02\xdd\x24\x9d\xd3"
+        "\x3d\x4e\xd3\x9b\x18\xfd\xd6\x8f\x95\xef\xa3\xbf\xa9\x2f\x33\xa8"
+        "\xc2\x37\x69\x58\x92\x42\x3a\x30\x46\x12\x1b\x2c\x04\xf0\xbf\xa9"
+        "\x79\x55\xcd\xac\x45\x36\x79\xc0\xb4\xb2\x5f\x82\x88\x49\xe8\xa3"
+        "\xbf\x33\x41\x7a\xcb\xc4\x11\x0e\xcc\x61\xed\xd1\x6b\x59\x5f\x9d"
+        "\x20\x6f\x85\x01\xd0\x16\x2a\x51\x1b\x79\x35\x42\x5e\x49\xdf\x6f"
+        "\x64\x68\x31\xac\x49\x34\xfb\x2b\xbd\xb1\xd9\x12\x4e\x4b\x16\xc5"
+        "\xa6\xfe\x15\xd3\xaf\xac\x51\x08\x95\x1f\x8c\xd2\x52\x37\x8b\x88"
+        "\xf3\x20\xe2\xf7\x09\x55\x82\x83\x1c\x38\x5f\x17\xfc\x37\x26\x21"
+        "\xb8\xf1\xfe\xa9\xac\x54\x1e\x53\x83\x53\x3f\x43\xe4\x67\x22\xd5"
+        "\x86\xec\xf2\xb6\x4a\x8b\x8a\x66\xea\xe0\x92\x50\x3b\x51\xe4\x00"
+        "\x25\x2a\x7a\x64\x14\xd6\x09\xe1\x6c\x75\x32\x28\x53\x5e\xb3\xab",
+        "\x5d\x4b\xb2\x8f\xfe\xa5\x7f\x01\x6d\x78\x6c\x13\x58\x08\xe4\x94" },
+      /* After setiv, ctr_low: 0xfffffff9 */
+      { GCRY_CIPHER_AES256,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x90\xc4\x42\x97",
+        16,
+        "", 0,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        288,
+        "\xf5\xc1\xed\xb8\x7f\x55\x7b\xb5\x47\xed\xaa\x42\xd2\xda\x33\x41"
+        "\x4a\xe0\x36\x6d\x51\x28\x40\x9c\x35\xfb\x11\x65\x18\x83\x9c\xb5"
+        "\x02\xb2\xa7\xe5\x52\x27\xa4\xe8\x57\x3d\xb3\xf5\xea\xcb\x21\x07"
+        "\x67\xbe\xbe\x0f\xf6\xaa\x32\xa1\x4b\x5e\x79\x4f\x50\x67\xcd\x80"
+        "\xfc\xf1\x65\xf2\x6c\xd0\xdb\x17\xcc\xf9\x52\x93\xfd\x5e\xa6\xb9"
+        "\x5c\x9f\xa8\xc6\x36\xb7\x80\x80\x6a\xea\x62\xdc\x61\x13\x45\xbe"
+        "\xab\x8f\xd8\x99\x17\x51\x9b\x29\x04\x6e\xdb\x3e\x9f\x83\xc6\x35"
+        "\xb3\x90\xce\xcc\x74\xec\xcb\x04\x41\xac\xb1\x92\xde\x20\xb1\x67"
+        "\xb0\x38\x14\xaa\x7d\xee\x3c\xb2\xd3\xbb\x2f\x88\x0b\x73\xcf\x7b"
+        "\x69\xc1\x55\x5b\x2b\xf2\xd4\x38\x2b\x3c\xef\x04\xc9\x14\x7c\x31"
+        "\xd6\x61\x88\xa8\xb3\x8c\x69\xb4\xbc\xaa\x0d\x15\xd2\xd5\x27\x63"
+        "\xc4\xa4\x80\xe9\x2b\xe9\xd2\x34\xc9\x0e\x3f\x7b\xd3\x43\x0d\x47"
+        "\x5d\x37\x8e\x42\xa4\x4e\xef\xcd\xbb\x3a\x5b\xa4\xe1\xb0\x8d\x64"
+        "\xb7\x0b\x58\x52\xec\x55\xd0\xef\x23\xfe\xf2\x8d\xe0\xd1\x6a\x2c"
+        "\xaa\x1c\x03\xc7\x3e\x58\x4c\x61\x72\x07\xc6\xfd\x0e\xbc\xd4\x6b"
+        "\x99\x4f\x91\xda\xff\x6f\xea\x81\x0c\x76\x85\x5d\x0c\x7f\x1c\xb8"
+        "\x84\x8c\x2f\xe1\x36\x3e\x68\xa0\x57\xf5\xdf\x13\x0a\xd6\xe1\xcd"
+        "\xae\x23\x99\x4e\xed\x7a\x72\x1b\x7c\xe5\x65\xd1\xb7\xcf\x2f\x73",
+        "\x1e\x2f\xcf\x3c\x95\x9a\x29\xec\xd3\x37\x90\x8c\x84\x8a\xfb\x95" },
+      /* After setiv, ctr_low: 0xfffffff8 */
+      { GCRY_CIPHER_AES256,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xb7\xfa\xc7\x4f",
+        16,
+        "", 0,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        288,
+        "\x14\x33\xc6\x9d\x04\xd3\x48\x29\x0c\x6a\x24\x27\xdf\x5f\x0a\xd2"
+        "\x71\xd6\xd0\x18\x04\xc0\x9f\x72\x0a\x60\xb7\x10\x52\x56\xf7\xae"
+        "\x64\xb0\x28\xd4\xfd\x25\x93\x8e\x67\x7e\xac\xc2\x93\xc7\x54\x2e"
+        "\x82\x93\x88\x6a\xb9\x8b\x73\xbc\x88\xec\x27\xdd\x4f\x9b\x21\x9e"
+        "\x77\x98\x70\x0b\xf4\xd8\x55\xfe\xf4\xc3\x3a\xcb\xca\x3a\xfb\xd4"
+        "\x52\x72\x2f\xf8\xac\xa9\x6a\xf5\x13\xab\x7a\x2e\x9f\x52\x41\xbd"
+        "\x87\x90\x68\xad\x17\xbd\x5a\xff\xc3\xc6\x10\x4d\xc1\xfe\xfc\x72"
+        "\x21\xb5\x53\x4a\x3f\xe0\x15\x9f\x29\x36\x23\xc0\x9a\x31\xb2\x0f"
+        "\xcd\x2f\xa6\xd0\xfc\xe6\x4d\xed\x68\xb3\x3d\x26\x67\xab\x40\xf0"
+        "\xab\xcf\x72\xc0\x50\xb1\x1e\x86\x38\xe2\xe0\x46\x3a\x2e\x3e\x1d"
+        "\x07\xd6\x9d\xe8\xfc\xa3\xe7\xac\xc9\xa0\xb3\x22\x05\xbc\xbf\xd2"
+        "\x63\x44\x66\xfc\xb4\x7b\xb4\x70\x7e\x96\xa9\x16\x1b\xb2\x7d\x93"
+        "\x44\x92\x5e\xbd\x16\x34\xa7\x11\xd0\xdf\x52\xad\x6f\xbd\x23\x3c"
+        "\x3d\x58\x16\xaf\x99\x8b\xbb\xa0\xdc\x3a\xff\x17\xda\x56\xba\x77"
+        "\xae\xc4\xb1\x51\xe2\x61\x4f\xf0\x66\x1b\x4c\xac\x79\x34\x1c\xfd"
+        "\x6c\x5f\x9a\x2c\x60\xfc\x47\x00\x5f\x2d\x81\xcc\xa9\xdd\x2b\xf4"
+        "\x5b\x53\x44\x61\xd4\x13\x5a\xf3\x93\xf0\xc9\x24\xd4\xe6\x60\x6f"
+        "\x78\x02\x0c\x75\x9d\x0d\x23\x97\x35\xe2\x06\x8a\x49\x5e\xe5\xbe",
+        "\x23\xc0\x4a\x2f\x98\x93\xca\xbd\x2e\x44\xde\x05\xcc\xe7\xf1\xf5" },
+      /* After setiv, ctr_low: 0xfffffff7 */
+      { GCRY_CIPHER_AES256,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x4c\xa7\x1e\x02",
+        16,
+        "", 0,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        288,
+        "\x51\x51\x64\x89\xeb\x9f\xf9\xd6\xb1\xa6\x73\x5f\xf1\x62\xb5\xe4"
+        "\x00\x80\xdb\x4c\x1c\xce\xe5\x00\xeb\xea\x6c\x57\xe4\x27\xfc\x71"
+        "\x08\x8c\xa1\xfc\x59\x1d\x07\x45\x3c\xc9\x4e\x0f\xb6\xea\x96\x90"
+        "\xae\xf7\x81\x1e\x7e\x6c\x5e\x50\xaf\x34\x3e\xa0\x55\x59\x8e\xe7"
+        "\xc1\xba\x48\xfa\x9e\x07\xf6\x6a\x24\x54\x3e\x9b\xa5\xfe\x31\x16"
+        "\x3d\x4d\x9c\xc4\xe1\xec\x26\xa0\x8b\x59\xa6\xf3\x94\xf8\x88\xda"
+        "\x1f\x88\x23\x5f\xfb\xfd\x79\xa2\xd3\x62\x30\x66\x69\xd9\x0d\x05"
+        "\xc0\x75\x4c\xb8\x48\x34\x1d\x97\xcf\x29\x6a\x12\x1c\x26\x54\x1d"
+        "\x80\xa9\x06\x74\x86\xff\xc6\xb4\x72\xee\x34\xe2\x56\x06\x6c\xf5"
+        "\x11\xe7\x26\x71\x47\x6b\x05\xbd\xe4\x0b\x40\x78\x84\x3c\xf9\xf2"
+        "\x78\x34\x2b\x3c\x5f\x0e\x4c\xfb\x17\x39\xdc\x59\x6b\xd1\x56\xac"
+        "\xe4\x1f\xb9\x19\xbc\xec\xb1\xd0\x6d\x47\x3b\x37\x4d\x0d\x6b\x65"
+        "\x7c\x70\xe9\xec\x58\xcc\x09\xd4\xd9\xbf\x9f\xe0\x6c\x7f\x60\x28"
+        "\xd8\xdf\x8e\xd1\x6a\x73\x42\xf3\x50\x01\x79\x68\x41\xc3\xba\x19"
+        "\x1e\x2d\x30\xc2\x81\x2c\x9f\x11\x8b\xd0\xdc\x31\x3b\x01\xfe\x53"
+        "\xa5\x11\x13\x22\x89\x40\xb9\x1b\x12\x89\xef\x9a\xcb\xa8\x03\x4f"
+        "\x54\x1a\x15\x6d\x11\xba\x05\x09\xd3\xdb\xbf\x05\x42\x3a\x5a\x27"
+        "\x3b\x34\x5c\x58\x8a\x5c\xa4\xc2\x28\xdc\xb2\x3a\xe9\x99\x01\xd6",
+        "\x30\xb2\xb5\x11\x8a\x3a\x8d\x70\x67\x71\x14\xde\xed\xa7\x43\xb5" },
+      /* After setiv, ctr_low: 0xfffffff6 */
+      { GCRY_CIPHER_AES256,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x47\xab\x9f\x3a",
+        16,
+        "", 0,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        288,
+        "\x05\x72\x44\xa0\x99\x11\x1d\x2c\x4b\x03\x4f\x20\x92\x88\xbe\x55"
+        "\xee\x31\x2c\xd9\xc0\xc1\x64\x77\x79\xd7\x3e\xfa\x5a\x7d\xf0\x48"
+        "\xf8\xc8\xfe\x81\x8f\x89\x92\xa6\xc2\x07\xdc\x9f\x3f\xb2\xc8\xf2"
+        "\xf3\xe9\xe1\xd3\xed\x55\xb4\xab\xc3\x22\xed\x8f\x00\xde\x32\x95"
+        "\x91\xc0\xc5\xf3\xd3\x93\xf0\xee\x56\x14\x8f\x96\xff\xd0\x6a\xbd"
+        "\xfc\x57\xc2\xc3\x7b\xc1\x1d\x56\x48\x3f\xa6\xc7\x92\x47\xf7\x2f"
+        "\x0b\x85\x1c\xff\x87\x29\xe1\xbb\x9b\x14\x6c\xac\x51\x0a\xc0\x7b"
+        "\x22\x25\xb0\x48\x92\xad\x09\x09\x6e\x39\x8e\x96\x13\x05\x55\x92"
+        "\xbd\xd7\x5d\x95\x35\xdd\x8a\x9d\x05\x59\x60\xae\xbb\xc0\x85\x92"
+        "\x4c\x8b\xa0\x3f\xa2\x4a\xe5\x2e\xde\x85\x1a\x39\x10\x22\x11\x1b"
+        "\xdd\xcc\x96\xf4\x93\x97\xf5\x81\x85\xf3\x33\xda\xa1\x9a\xba\xfd"
+        "\xb8\xaf\x60\x81\x37\xf1\x02\x88\x54\x15\xeb\x21\xd1\x19\x1a\x1f"
+        "\x28\x9f\x02\x27\xca\xce\x97\xda\xdc\xd2\x0f\xc5\x0e\x2e\xdd\x4f"
+        "\x1d\x24\x62\xe4\x6e\x4a\xbe\x96\x95\x38\x0c\xe9\x26\x14\xf3\xf0"
+        "\x92\xbc\x97\xdc\x38\xeb\x64\xc3\x04\xc1\xa2\x6c\xad\xbd\xf8\x03"
+        "\xa0\xa4\x68\xaa\x9d\x1f\x09\xe6\x62\x95\xa2\x1c\x32\xef\x62\x28"
+        "\x7e\x54\x6d\x4b\x6a\xcc\x4a\xd0\x82\x47\x46\x0d\x45\x3c\x36\x03"
+        "\x86\x90\x44\x65\x18\xac\x19\x75\xe6\xba\xb1\x9a\xb4\x5d\x84\x9b",
+        "\x31\x22\x2b\x11\x6e\x2b\x94\x56\x37\x9d\xc3\xa5\xde\xe7\x6e\xc9" },
+      /* After setiv, ctr_low: 0xfffffff5 */
+      { GCRY_CIPHER_AES256,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xbc\xf6\x46\x77",
+        16,
+        "", 0,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        288,
+        "\x6e\x32\xdb\x04\x32\x57\x15\x78\x0e\x4c\x70\x66\x5c\x91\x43\x0c"
+        "\x63\x73\xb8\x86\xad\xb0\xf1\x34\x0f\x0c\x7e\xd3\x4e\xcb\xc9\xea"
+        "\x19\x3c\xb8\x14\xd0\xab\x9e\x9b\x22\xda\x7a\x96\xa7\xf5\xa2\x99"
+        "\x58\xe3\xd6\x72\x0f\xf5\xdf\x88\xd1\x33\xb1\xe5\x03\x72\x62\x1c"
+        "\xa7\xf2\x67\x50\x0e\x70\xc3\x7a\x6c\x4a\x90\xba\x78\x9e\xd2\x0b"
+        "\x29\xd4\xc8\xa7\x57\x06\xf2\xf4\x01\x4b\x30\x53\xea\xf7\xde\xbf"
+        "\x1c\x12\x03\xcf\x9f\xcf\x80\x8b\x77\xfd\x73\x48\x79\x19\xbe\x38"
+        "\x75\x0b\x6d\x78\x7d\x79\x05\x98\x65\x3b\x35\x8f\x68\xff\x30\x7a"
+        "\x6e\xf7\x10\x9e\x11\x25\xc4\x95\x97\x7d\x92\x0f\xbf\x38\x95\xbd"
+        "\x5d\x2a\xf2\x06\x2c\xd9\x5a\x80\x91\x4e\x22\x7d\x5f\x69\x85\x03"
+        "\xa7\x5d\xda\x22\x09\x2b\x8d\x29\x67\x7c\x8c\xf6\xb6\x49\x20\x63"
+        "\xb9\xb6\x4d\xb6\x37\xa3\x7b\x19\xa4\x28\x90\x83\x55\x3d\x4e\x18"
+        "\xc8\x65\xbc\xd1\xe7\xb5\xcf\x65\x28\xea\x19\x11\x5c\xea\x83\x8c"
+        "\x44\x1f\xac\xc5\xf5\x3a\x4b\x1c\x2b\xbf\x76\xd8\x98\xdb\x50\xeb"
+        "\x64\x45\xae\xa5\x39\xb7\xc8\xdf\x5a\x73\x6d\x2d\x0f\x4a\x5a\x17"
+        "\x37\x66\x1c\x3d\x27\xd5\xd6\x7d\xe1\x08\x7f\xba\x4d\x43\xc2\x29"
+        "\xf7\xbe\x83\xec\xd0\x3b\x2e\x19\x9e\xf7\xbf\x1b\x16\x34\xd8\xfa"
+        "\x32\x17\x2a\x90\x55\x93\xd5\x3e\x14\x8d\xd6\xa1\x40\x45\x09\x52",
+        "\x89\xf2\xae\x78\x38\x8e\xf2\xd2\x52\xa8\xba\xb6\xf2\x5d\x7c\xfc" },
+      /* After setiv, ctr_low: 0xfffffff4 */
+      { GCRY_CIPHER_AES256,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x51\xb2\x9d\x4a",
+        16,
+        "", 0,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        288,
+        "\x1d\xb8\x77\xcd\xcd\xfe\xde\x07\x97\xcb\x97\x3a\x4f\xa0\xd0\xe6"
+        "\xcc\xcf\x8b\x71\xd5\x65\x3d\xc4\x17\x52\xe7\x1d\x6a\x68\x4a\x77"
+        "\xca\x04\x4a\xef\x8e\x7e\xce\x79\xa1\x80\x0d\x9e\xd5\xf4\xce\x66"
+        "\x4d\x54\xb1\x09\xd1\xb6\xb0\x43\x28\xe8\x53\xe2\x24\x9c\x76\xc5"
+        "\x4d\x22\xf3\x6e\x13\xf3\xd7\xe0\x85\xb8\x9e\x0b\x17\x22\xc0\x79"
+        "\x2b\x72\x57\xaa\xbd\x43\xc3\xf7\xde\xce\x22\x41\x3c\x7e\x37\x1a"
+        "\x55\x2e\x36\x0e\x7e\xdc\xb3\xde\xd7\x33\x36\xc9\xc8\x56\x93\x51"
+        "\x68\x77\x9a\xb0\x08\x5c\x22\x35\xef\x5c\x9b\xbf\x3e\x20\x8a\x84"
+        "\x3d\xb3\x60\x10\xe1\x97\x30\xd7\xb3\x6f\x40\x5a\x2c\xe0\xe5\x52"
+        "\x19\xb6\x2b\xed\x6e\x8e\x18\xb4\x8d\x78\xbd\xc4\x9f\x4f\xbd\x82"
+        "\x98\xd6\x71\x3d\x71\x5b\x78\x73\xee\x8e\x4b\x37\x88\x9e\x21\xca"
+        "\x00\x6c\xc2\x96\x8d\xf0\xcd\x09\x58\x54\x5a\x58\x59\x8e\x9b\xf8"
+        "\x72\x93\xd7\xa0\xf9\xc4\xdc\x48\x89\xaa\x31\x95\xda\x4e\x2f\x79"
+        "\x1e\x37\x49\x92\x2e\x32\x2e\x76\x54\x2a\x64\xa8\x96\x67\xe9\x75"
+        "\x10\xa6\xeb\xad\xc6\xa8\xec\xb7\x18\x0a\x32\x26\x8d\x6e\x03\x74"
+        "\x0e\x1f\xfc\xde\x76\xff\x6e\x96\x42\x2d\x80\x0a\xc6\x78\x70\xc4"
+        "\xd8\x56\x7b\xa6\x38\x2f\xf6\xc0\x9b\xd7\x21\x6e\x88\x5d\xc8\xe5"
+        "\x02\x6a\x09\x1e\xb3\x46\x44\x80\x82\x5b\xd1\x66\x06\x61\x4f\xb8",
+        "\x16\x0e\x73\xa3\x14\x43\xdb\x15\x9c\xb0\x0d\x30\x6d\x9b\xe1\xb1" },
+      /* After setiv, ctr_low: 0xfffffff3 */
+      { GCRY_CIPHER_AES256,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xaa\xef\x44\x07",
+        16,
+        "", 0,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        288,
+        "\x42\x71\x54\xe2\xdb\x50\x5d\x3c\x10\xbd\xf8\x60\xbd\xdb\x26\x14"
+        "\x7d\x13\x59\x98\x28\xfb\x43\x42\xca\x72\xe6\xd8\x58\x00\xa2\x1b"
+        "\x6a\x61\xb4\x3a\x80\x6b\x9e\x14\xbd\x11\x33\xab\xe9\xb9\x91\x95"
+        "\xd7\x5d\xc3\x98\x1f\x7f\xcb\xa8\xf0\xec\x31\x26\x51\xea\x2e\xdf"
+        "\xd9\xde\x70\xf5\x84\x27\x3a\xac\x22\x05\xb9\xce\x2a\xfb\x2a\x83"
+        "\x1e\xce\x0e\xb2\x31\x35\xc6\xe6\xc0\xd7\xb0\x5f\xf5\xca\xdb\x13"
+        "\xa7\xfe\x4f\x85\xa3\x4f\x94\x5c\xc1\x04\x12\xde\x6f\xa1\xdb\x41"
+        "\x59\x82\x22\x22\x65\x97\x6d\xc8\x67\xab\xf3\x90\xeb\xa4\x00\xb3"
+        "\x7d\x94\x3d\x7b\x2a\xe2\x85\x36\x87\x16\xb8\x19\x92\x02\xe0\x43"
+        "\x42\x85\xa1\xe6\xb8\x11\x30\xcc\x2c\xd8\x63\x09\x0e\x53\x5f\xa3"
+        "\xe0\xd4\xee\x0e\x04\xee\x65\x61\x96\x84\x42\x0c\x68\x8d\xb7\x48"
+        "\xa3\x02\xb4\x82\x69\xf2\x35\xe4\xce\x3b\xe3\x44\xce\xad\x49\x32"
+        "\xab\xda\x04\xea\x06\x60\xa6\x2a\x7d\xee\x0f\xb8\x95\x90\x22\x62"
+        "\x9c\x78\x59\xd3\x7b\x61\x02\x65\x63\x96\x9f\x67\x50\xa0\x61\x43"
+        "\x53\xb2\x3f\x22\xed\x8c\x42\x39\x97\xd9\xbc\x6e\x81\xb9\x21\x97"
+        "\xc6\x5b\x68\xd7\x7f\xd0\xc5\x4a\xfb\x74\xc4\xfd\x9a\x2a\xb8\x9b"
+        "\x48\xe0\x00\xea\x6d\xf5\x30\x26\x61\x8f\xa5\x45\x70\xc9\x3a\xea"
+        "\x6d\x19\x11\x57\x0f\x21\xe6\x0a\x53\x94\xe3\x0c\x99\xb0\x2f\xc5",
+        "\x92\x92\x89\xcd\x4f\x3c\x6d\xbc\xe8\xb3\x70\x14\x5b\x3c\x12\xe4" },
+      /* After setiv, ctr_low: 0xfffffff2 */
+      { GCRY_CIPHER_AES256,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xa1\xe3\xc5\x3f",
+        16,
+        "", 0,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        288,
+        "\x41\xc3\xcb\xd7\x6e\xde\x2a\xc6\x15\x05\xc6\xba\x27\xae\xcd\x37"
+        "\xc0\xe5\xbf\xb9\x5c\xdc\xd6\xad\x1a\xe1\x35\x7c\xc0\x85\x85\x51"
+        "\x8c\x98\x06\xc0\x72\x43\x71\x7a\x2d\x7c\x81\x3c\xe7\xd6\x32\x8e"
+        "\x22\x2b\x46\x95\x6a\xde\x45\x40\x56\xe9\x63\x32\x68\xbf\xb6\x78"
+        "\xb7\x86\x00\x9d\x2c\x9e\xed\x67\xc1\x9b\x09\x9e\xd9\x0a\x56\xcb"
+        "\x57\xc9\x48\x14\x23\x4e\x97\x04\xb5\x85\x25\x1d\xcb\x1a\x79\x9b"
+        "\x54\x06\x95\xad\x16\x81\x84\x3a\x38\xec\x41\x90\x2a\xfa\x50\xe0"
+        "\xb9\x20\xa6\xeb\xfe\x2e\x5c\xa1\xf6\x3c\x69\x4c\xce\xf8\x30\xe0"
+        "\x87\x68\xa2\x3a\x9d\xad\x75\xd4\xa5\x6b\x0a\x90\x65\xa2\x27\x64"
+        "\x9d\xf5\xa0\x6f\xd0\xd3\x62\xa5\x2d\xae\x02\x89\xb4\x1a\xfa\x32"
+        "\x9b\xa0\x44\xdd\x50\xde\xaf\x41\xa9\x89\x1e\xb0\x41\xbc\x9c\x41"
+        "\xb0\x35\x5e\xf1\x9a\xd9\xab\x57\x53\x21\xca\x39\xfc\x8b\xb4\xd4"
+        "\xb2\x19\x8a\xe9\xb2\x24\x1e\xce\x2e\x19\xb0\xd2\x93\x30\xc4\x70"
+        "\xe2\xf8\x6a\x8a\x99\x3b\xed\x71\x7e\x9e\x98\x99\x2a\xc6\xdd\xcf"
+        "\x43\x32\xdb\xfb\x27\x22\x89\xa4\xc5\xe0\xa2\x94\xe9\xcf\x9d\x48"
+        "\xab\x3f\xfa\x4f\x75\x63\x46\xdd\xfe\xfa\xf0\xbf\x6e\xa1\xf9\xca"
+        "\xb1\x77\x79\x35\x6c\x33\xe1\x57\x68\x50\xe9\x78\x4e\xe4\xe2\xf0"
+        "\xcf\xe4\x23\xde\xf4\xa7\x34\xb3\x44\x97\x38\xd2\xbd\x27\x44\x0e",
+        "\x75\x0a\x41\x3b\x87\xe3\xc7\xf6\xd6\xe3\xab\xfa\x4b\xbe\x2e\x56" },
+      /* After setiv, ctr_low: 0xfffffff1 */
+      { GCRY_CIPHER_AES256,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x5a\xbe\x1c\x72",
+        16,
+        "", 0,
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
+        "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
+        288,
+        "\xf1\x3c\x7a\xa4\xa9\xaf\xe7\x49\x19\x7d\xad\x50\xc1\x6a\x84\x87"
+        "\xf5\x69\xe4\xe5\xc2\x0a\x90\x33\xc3\xeb\x76\x63\x5f\x9b\x1d\xf9"
+        "\x53\x4a\x2a\x6d\x6b\x61\xe0\x5d\xed\xcb\x98\x0d\xf2\x57\x33\x12"
+        "\xd1\x44\xaa\x7a\x7e\x4e\x41\x0e\xe6\xa7\x9f\x17\x92\x28\x91\xad"
+        "\xca\xce\xf2\xa8\x73\x4a\xad\x89\x62\x73\x0b\x9a\x68\x91\xa8\x11"
+        "\x44\x01\xfd\x57\xe4\xf8\x84\x55\x2b\x66\xdb\xb9\xd6\xee\x83\xe5"
+        "\x57\xea\x5c\x6a\x23\x87\xdd\x0a\x45\x63\xb4\x0c\x8f\xc5\x9f\x22"
+        "\xf3\x4f\x4e\x6f\x7b\x14\x62\xf7\x80\x59\x4a\xc5\xc8\xae\x8a\x6f"
+        "\x5e\xe3\x1e\xe6\xae\xec\x99\x77\x6b\x88\x14\xe3\x58\x88\x61\x74"
+        "\x38\x91\xa1\x32\xb8\xd2\x39\x6b\xe2\xcb\x8e\x77\xde\x92\x36\x78"
+        "\xad\x50\xcf\x08\xb8\xfa\x29\x59\xb4\x68\x1b\x23\x10\x57\x32\x92"
+        "\xf8\xec\xe1\x97\xdb\x30\x85\x22\xb5\x68\x2f\xf2\x98\xda\x06\xee"
+        "\x65\x02\xe7\xf9\xc8\xc1\xca\x8f\xd3\xed\x4a\x3c\x09\xdd\xde\x64"
+        "\xd9\x85\x17\x2c\x62\x41\x35\x24\xed\x6b\x87\x78\x1e\xb5\x7a\x9b"
+        "\xa3\x90\xa3\x99\xc7\x39\x51\x10\xb7\x6a\x12\x3b\x64\xfe\x32\x3c"
+        "\xb6\x84\x9a\x3f\x95\xd3\xcb\x22\x69\x9c\xf9\xb7\xc2\x8b\xf4\x55"
+        "\x68\x60\x11\x20\xc5\x3e\x0a\xc0\xba\x00\x0e\x88\x96\x66\xfa\xf0"
+        "\x75\xbc\x2b\x9c\xff\xc5\x33\x7b\xaf\xb2\xa6\x34\x78\x44\x9c\xa7",
+        "\x01\x24\x0e\x17\x17\xe5\xfc\x90\x07\xfa\x78\xd5\x5d\x66\xa3\xf5" },
     };
 
   gcry_cipher_hd_t hde, hdd;
-  unsigned char out[MAX_DATA_LEN];
+  unsigned char out[MAX_GCM_DATA_LEN];
   unsigned char tag[GCRY_GCM_BLOCK_LEN];
   int i, keylen;
   gcry_error_t err = 0;
@@ -1885,8 +2562,12 @@ check_gcm_cipher (void)
   _check_gcm_cipher(1);
   /* Split input to 7 byte buffers. */
   _check_gcm_cipher(7);
+  /* Split input to 15 byte buffers. */
+  _check_gcm_cipher(15);
   /* Split input to 16 byte buffers. */
   _check_gcm_cipher(16);
+  /* Split input to 17 byte buffers. */
+  _check_gcm_cipher(17);
 }
 
 
-----------------------------------------------------------------------

Summary of changes:
 cipher/cipher-gcm.c |  77 +++++-
 tests/basic.c       | 689 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 759 insertions(+), 7 deletions(-)


hooks/post-receive
-- 
The GNU crypto library
http://git.gnupg.org


_______________________________________________
Gnupg-commits mailing list
Gnupg-commits at gnupg.org
http://lists.gnupg.org/mailman/listinfo/gnupg-commits


From wk at gnupg.org  Tue Feb  6 13:35:02 2018
From: wk at gnupg.org (Werner Koch)
Date: Tue, 06 Feb 2018 13:35:02 +0100
Subject: [patches] add support for arc4random_buf()
In-Reply-To: <82c9df41-d71b-a601-193f-df0c1d098a15@pettijohn-web.com> (Edgar
 Pettijohn's message of "Mon, 5 Feb 2018 23:25:41 -0600")
References: <82c9df41-d71b-a601-193f-df0c1d098a15@pettijohn-web.com>
Message-ID: <87607axpmx.fsf@wheatstone.g10code.de>

On Tue,  6 Feb 2018 06:25, edgar at pettijohn-web.com said:
> Please see attached patches to add support for arc4random_buf() as an
> alternate to /dev/{u}random. I tried to be as unobtrusive as possible
> and maintain style. It should also allow the user to still define
> RANDOM_CONF_ONLY_URANDOM if they would prefer to use
> /dev/urandom. This will allow gpg to be used on filesystems mounted
> nodev while providing quick, quality randomness.

Please describe what arc4random_buf is and where it is used.

I also redirect this to the libgcrypt mailing list.


Salam-Shalom,

   Werner

-- 
Die Gedanken sind frei.  Ausnahmen regelt ein Bundesgesetz.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20180206/0a7a26c2/attachment.sig>

From wk at gnupg.org  Tue Feb  6 19:34:20 2018
From: wk at gnupg.org (Werner Koch)
Date: Tue, 06 Feb 2018 19:34:20 +0100
Subject: [edgar@pettijohn-web.com] Re: [patches] add support for
 arc4random_buf()
Message-ID: <87eflyvufn.fsf@wheatstone.g10code.de>

An embedded message was scrubbed...
From: edgar at pettijohn-web.com
Subject: Re: [patches] add support for arc4random_buf()
Date: Tue, 06 Feb 2018 09:09:54 -0600
Size: 3960
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20180206/3855d22a/attachment.mht>
-------------- next part --------------


-- 
Die Gedanken sind frei.  Ausnahmen regelt ein Bundesgesetz.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20180206/3855d22a/attachment.sig>

From w.k at berkeley.edu  Wed Feb  7 09:14:46 2018
From: w.k at berkeley.edu (Weikeng Chen)
Date: Wed, 7 Feb 2018 00:14:46 -0800
Subject: Attack on libgcrypt's ElGamal Encryption with Proof of Concept (PoC)
Message-ID: <CAHr+ZK_2+wRMX4ycL1Y1nHCkUggkBjGm6gUVZV8vgrOgQX--1A@mail.gmail.com>

I am Weikeng Chen from UC Berkeley. This post is joint work with
Erik-Oliver Blass from Airbus.

Libgcrypt is one of the very few cryptographic libraries that
implements ElGamal. In the homepage, we see "Libgcrypt is a general
purpose cryptographic library." ElGamal is a cryptographic algorithm
with efficient multiplicative homomorphism, the ability of self
re-randomization, and the support of key splitting. Its applications
include voting, mix-nets, shuffling, etc. I am glad that libgcrypt
implements this algorithm.

However, handling the ElGamal is hard. The ElGamal implementation is
not secure for general purposes. If we use libgcrypt to encrypt
messages directly, rather than using for hybrid encryption, the
security can be broken.


We prepared the attack on the Github:
https://github.com/weikengchen/attack-on-libgcrypt-elgamal

The wiki page for explanation:
https://github.com/weikengchen/attack-on-libgcrypt-elgamal/wiki/Attack-on-libgcrypt's-ElGamal-Encryption-with-Proof-of-Concept-(PoC)

Note: This is a ciphertext-only attack.


Suggestions: I would suggest libgcrypt to consider either one of two
things as follows.

(1) Announce on the manual that ElGamal in libgcrypt should *only be
used for hybrid encryption*:
https://gnupg.org/documentation/manuals/gcrypt/Available-algorithms.html#Available-algorithms

This prevents new libgcrypt's users from possible misuse of the
ElGamal implementation for the purposes not considered by us.

(2) Consider the implementation of ElGamal with semantic security,
which would better be another algorithm in addition to current
ElGamal. One problem is about current ElGamal's prime generation. It
is hard to make the current prime generation, which is fast, secure.
There is a reason to find p=2q+1 where p and q are both primes.


Any further thought?

-- 

Weikeng Chen @ 795 Soda Hall


From michi4 at schoenitzer.de  Thu Feb  8 18:35:24 2018
From: michi4 at schoenitzer.de (Michael F. =?ISO-8859-1?Q?Sch=F6nitzer?=)
Date: Thu, 08 Feb 2018 18:35:24 +0100
Subject: ECDH in gcrypt
Message-ID: <1518111324.2017.25.camel@schoenitzer.de>

Hi everyone,

The infotext of the gcrypt-repo mentions that it supports ECDH and
other DH-variants. But the documentation on the website does not
mention it. Can I doe a key exchange via ECDH with gcrypt? Is there any
documentation or sample code that I could look at?
I would be glad about any help.

Cheers,
 M

-- 
Michael F. Sch?nitzer
Magdalenenstr. 29
80638 M?nchen
Mail: michael at schoenitzer.de
Jabber: schoenitzer at jabber.ccc.de
Tel: 089/152315 - Mobil: 017657895702

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 659 bytes
Desc: This is a digitally signed message part
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20180208/e505725f/attachment.sig>

From stefbon at gmail.com  Fri Feb  9 19:24:28 2018
From: stefbon at gmail.com (Stef Bon)
Date: Fri, 9 Feb 2018 19:24:28 +0100
Subject: ECDH in gcrypt
In-Reply-To: <1518111324.2017.25.camel@schoenitzer.de>
References: <1518111324.2017.25.camel@schoenitzer.de>
Message-ID: <CANXojczd20S3HANCh7Yks6OQYybFCY4s9x39P=Oz=w02bgn9_A@mail.gmail.com>

Hi Michael,

you cannot call a function "do dh" or "do ecdh" with gcrypt.

What you can do with gcrypt is the generation of ephemeral key pair
(client) and write the public key as string to a buffer,
and create the exchange hash, and verify the signature received from the server.

See: https://tools.ietf.org/html/rfc5656#section-4
The creation of the shared secret is specific to the algorithm used.
Sometimes an extra library is required (with
curve25519-sha256 at libssh.org for example).

See: https://git.libssh.org/projects/libssh.git/tree/doc/curve25519-sha256 at libssh.org.txt

Stef


From jussi.kivilinna at iki.fi  Sat Feb 10 00:07:48 2018
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sat, 10 Feb 2018 01:07:48 +0200
Subject: [PATCH 2/2] AVX implementation of BLAKE2s
In-Reply-To: <151821766363.14794.12253707483104078502.stgit@localhost.localdomain>
References: <151821766363.14794.12253707483104078502.stgit@localhost.localdomain>
Message-ID: <151821766868.14794.14801974686961532084.stgit@localhost.localdomain>

* cipher/Makefile.am: Add 'blake2s-amd64-avx.S'.
* cipher/blake2.c (USE_AVX, _gry_blake2s_transform_amd64_avx): New.
(BLAKE2S_CONTEXT) [USE_AVX]: Add 'use_avx'.
(blake2s_transform): Rename to ...
(blake2s_transform_generic): ... this.
(blake2s_transform): New.
(blake2s_final): Pass 'ctx' pointer to transform function instead of
'S'.
(blake2s_init_ctx): Check HW features and enable AVX implementation
if supported.
* cipher/blake2s-amd64-avx.S: New.
* configure.ac: Add 'blake2s-amd64-avx.lo'.
--

Benchmark on Intel Core i7-4790K (4.0 Ghz, no turbo):

Before:
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 BLAKE2S_256    |      1.34 ns/B     711.4 MiB/s      5.36 c/B

After (~1.3x faster):
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 BLAKE2S_256    |      1.77 ns/B     538.2 MiB/s      7.09 c/B

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/Makefile.am         |    2 
 cipher/blake2.c            |   52 ++++++++
 cipher/blake2s-amd64-avx.S |  276 ++++++++++++++++++++++++++++++++++++++++++++
 configure.ac               |    1 
 4 files changed, 326 insertions(+), 5 deletions(-)
 create mode 100644 cipher/blake2s-amd64-avx.S

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index b0ee158cc..625a0ef69 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -107,7 +107,7 @@ rfc2268.c \
 camellia.c camellia.h camellia-glue.c camellia-aesni-avx-amd64.S \
   camellia-aesni-avx2-amd64.S camellia-arm.S camellia-aarch64.S \
 blake2.c \
-  blake2b-amd64-avx2.S
+  blake2b-amd64-avx2.S blake2s-amd64-avx.S
 
 gost28147.lo: gost-sb.h
 gost-sb.h: gost-s-box
diff --git a/cipher/blake2.c b/cipher/blake2.c
index 56f1c7ca8..43ca1bad2 100644
--- a/cipher/blake2.c
+++ b/cipher/blake2.c
@@ -30,6 +30,14 @@
 #include "cipher.h"
 #include "hash-common.h"
 
+/* USE_AVX indicates whether to compile with Intel AVX code. */
+#undef USE_AVX
+#if defined(__x86_64__) && defined(HAVE_GCC_INLINE_ASM_AVX) && \
+    (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
+     defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS))
+# define USE_AVX 1
+#endif
+
 /* USE_AVX2 indicates whether to compile with Intel AVX2 code. */
 #undef USE_AVX2
 #if defined(__x86_64__) && defined(HAVE_GCC_INLINE_ASM_AVX2) && \
@@ -123,6 +131,9 @@ typedef struct BLAKE2S_CONTEXT_S
   byte buf[BLAKE2S_BLOCKBYTES];
   size_t buflen;
   size_t outlen;
+#ifdef USE_AVX
+  unsigned int use_avx:1;
+#endif
 } BLAKE2S_CONTEXT;
 
 typedef unsigned int (*blake2_transform_t)(void *S, const void *inblk,
@@ -481,8 +492,9 @@ static inline void blake2s_increment_counter(BLAKE2S_STATE *S, const int inc)
   S->t[1] += (S->t[0] < (u32)inc) - (inc < 0);
 }
 
-static unsigned int blake2s_transform(void *vS, const void *inblks,
-				      size_t nblks)
+static unsigned int blake2s_transform_generic(BLAKE2S_STATE *S,
+                                              const void *inblks,
+                                              size_t nblks)
 {
   static const byte blake2s_sigma[10][16] =
   {
@@ -497,7 +509,6 @@ static unsigned int blake2s_transform(void *vS, const void *inblks,
     {  6, 15, 14,  9, 11,  3,  0,  8, 12,  2, 13,  7,  1,  4, 10,  5 },
     { 10,  2,  8,  4,  7,  6,  1,  5, 15, 11,  9, 14,  3, 12, 13 , 0 },
   };
-  BLAKE2S_STATE *S = vS;
   unsigned int burn = 0;
   const byte* in = inblks;
   u32 m[16];
@@ -596,6 +607,33 @@ static unsigned int blake2s_transform(void *vS, const void *inblks,
   return burn;
 }
 
+#ifdef USE_AVX
+unsigned int _gcry_blake2s_transform_amd64_avx(BLAKE2S_STATE *S,
+                                               const void *inblks,
+                                               size_t nblks) ASM_FUNC_ABI;
+#endif
+
+static unsigned int blake2s_transform(void *ctx, const void *inblks,
+                                      size_t nblks)
+{
+  BLAKE2S_CONTEXT *c = ctx;
+  unsigned int nburn;
+
+  if (0)
+    {}
+#ifdef USE_AVX
+  if (c->use_avx)
+    nburn = _gcry_blake2s_transform_amd64_avx(&c->state, inblks, nblks);
+#endif
+  else
+    nburn = blake2s_transform_generic(&c->state, inblks, nblks);
+
+  if (nburn)
+    nburn += ASM_EXTRA_STACK;
+
+  return nburn;
+}
+
 static void blake2s_final(void *ctx)
 {
   BLAKE2S_CONTEXT *c = ctx;
@@ -611,7 +649,7 @@ static void blake2s_final(void *ctx)
     memset (c->buf + c->buflen, 0, BLAKE2S_BLOCKBYTES - c->buflen); /* Padding */
   blake2s_set_lastblock (S);
   blake2s_increment_counter (S, (int)c->buflen - BLAKE2S_BLOCKBYTES);
-  burn = blake2s_transform (S, c->buf, 1);
+  burn = blake2s_transform (ctx, c->buf, 1);
 
   /* Output full hash to buffer */
   for (i = 0; i < 8; ++i)
@@ -687,11 +725,17 @@ static gcry_err_code_t blake2s_init_ctx(void *ctx, unsigned int flags,
 					unsigned int dbits)
 {
   BLAKE2S_CONTEXT *c = ctx;
+  unsigned int features = _gcry_get_hw_features ();
 
+  (void)features;
   (void)flags;
 
   memset (c, 0, sizeof (*c));
 
+#ifdef USE_AVX
+  c->use_avx = !!(features & HWF_INTEL_AVX);
+#endif
+
   c->outlen = dbits / 8;
   c->buflen = 0;
   return blake2s_init(c, key, keylen);
diff --git a/cipher/blake2s-amd64-avx.S b/cipher/blake2s-amd64-avx.S
new file mode 100644
index 000000000..f7312dbd0
--- /dev/null
+++ b/cipher/blake2s-amd64-avx.S
@@ -0,0 +1,276 @@
+/* blake2s-amd64-avx.S  -  AVX implementation of BLAKE2s
+ *
+ * Copyright (C) 2018 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* The code is based on public-domain/CC0 BLAKE2 reference implementation
+ * by Samual Neves, at https://github.com/BLAKE2/BLAKE2/tree/master/sse
+ * Copyright 2012, Samuel Neves <sneves at dei.uc.pt>
+ */
+
+#ifdef __x86_64
+#include <config.h>
+#if defined(HAVE_GCC_INLINE_ASM_AVX) && \
+   (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
+    defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS))
+
+#include "asm-common-amd64.h"
+
+.text
+
+/* register macros */
+#define RSTATE  %rdi
+#define RINBLKS %rsi
+#define RNBLKS  %rdx
+#define RIV     %rcx
+
+/* state structure */
+#define STATE_H 0
+#define STATE_T (STATE_H + 8 * 4)
+#define STATE_F (STATE_T + 2 * 4)
+
+/* vector registers */
+#define ROW1  %xmm0
+#define ROW2  %xmm1
+#define ROW3  %xmm2
+#define ROW4  %xmm3
+#define TMP1  %xmm4
+#define TMP1x %xmm4
+#define R16   %xmm5
+#define R8    %xmm6
+
+#define MA1   %xmm8
+#define MA2   %xmm9
+#define MA3   %xmm10
+#define MA4   %xmm11
+
+#define MB1   %xmm12
+#define MB2   %xmm13
+#define MB3   %xmm14
+#define MB4   %xmm15
+
+/**********************************************************************
+  blake2s/AVX
+ **********************************************************************/
+
+#define GATHER_MSG(m1, m2, m3, m4, \
+                   s0, s1, s2, s3, s4, s5, s6, s7, s8, \
+                   s9, s10, s11, s12, s13, s14, s15) \
+        vmovd (s0)*4(RINBLKS), m1; \
+          vmovd (s1)*4(RINBLKS), m2; \
+            vmovd (s8)*4(RINBLKS), m3; \
+              vmovd (s9)*4(RINBLKS), m4; \
+        vpinsrd $1, (s2)*4(RINBLKS), m1, m1; \
+          vpinsrd $1, (s3)*4(RINBLKS), m2, m2; \
+            vpinsrd $1, (s10)*4(RINBLKS), m3, m3; \
+              vpinsrd $1, (s11)*4(RINBLKS), m4, m4; \
+        vpinsrd $2, (s4)*4(RINBLKS), m1, m1; \
+          vpinsrd $2, (s5)*4(RINBLKS), m2, m2; \
+            vpinsrd $2, (s12)*4(RINBLKS), m3, m3; \
+              vpinsrd $2, (s13)*4(RINBLKS), m4, m4; \
+        vpinsrd $3, (s6)*4(RINBLKS), m1, m1; \
+          vpinsrd $3, (s7)*4(RINBLKS), m2, m2; \
+            vpinsrd $3, (s14)*4(RINBLKS), m3, m3; \
+              vpinsrd $3, (s15)*4(RINBLKS), m4, m4;
+
+#define LOAD_MSG_0(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                    0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15)
+#define LOAD_MSG_1(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                   14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3)
+#define LOAD_MSG_2(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                   11,  8, 12,  0,  5,  2, 15, 13, 10, 14,  3,  6,  7,  1,  9,  4)
+#define LOAD_MSG_3(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                    7,  9,  3,  1, 13, 12, 11, 14,  2,  6,  5, 10,  4,  0, 15,  8)
+#define LOAD_MSG_4(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                    9,  0,  5,  7,  2,  4, 10, 15, 14,  1, 11, 12,  6,  8,  3, 13)
+#define LOAD_MSG_5(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                    2, 12,  6, 10,  0, 11,  8,  3,  4, 13,  7,  5, 15, 14,  1,  9)
+#define LOAD_MSG_6(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                   12,  5,  1, 15, 14, 13,  4, 10,  0,  7,  6,  3,  9,  2,  8, 11)
+#define LOAD_MSG_7(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                   13, 11,  7, 14, 12,  1,  3,  9,  5,  0, 15,  4,  8,  6,  2, 10)
+#define LOAD_MSG_8(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                    6, 15, 14,  9, 11,  3,  0,  8, 12,  2, 13,  7,  1,  4, 10,  5)
+#define LOAD_MSG_9(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                   10,  2,  8,  4,  7,  6,  1,  5, 15, 11,  9, 14,  3, 12, 13 , 0)
+
+#define LOAD_MSG(r, m1, m2, m3, m4) LOAD_MSG_##r(m1, m2, m3, m4)
+
+#define ROR_16(in, out) vpshufb R16, in, out;
+
+#define ROR_8(in, out)  vpshufb R8, in, out;
+
+#define ROR_12(in, out) \
+        vpsrld $12, in, TMP1; \
+        vpslld $(32 - 12), in, out; \
+        vpxor TMP1, out, out;
+
+#define ROR_7(in, out) \
+        vpsrld $7, in, TMP1; \
+        vpslld $(32 - 7), in, out; \
+        vpxor TMP1, out, out;
+
+#define G(r1, r2, r3, r4, m, ROR_A, ROR_B) \
+        vpaddd m, r1, r1; \
+        vpaddd r2, r1, r1; \
+        vpxor r1, r4, r4; \
+        ROR_A(r4, r4); \
+        vpaddd r4, r3, r3; \
+        vpxor r3, r2, r2; \
+        ROR_B(r2, r2);
+
+#define G1(r1, r2, r3, r4, m) \
+        G(r1, r2, r3, r4, m, ROR_16, ROR_12);
+
+#define G2(r1, r2, r3, r4, m) \
+        G(r1, r2, r3, r4, m, ROR_8, ROR_7);
+
+#define MM_SHUFFLE(z,y,x,w) \
+        (((z) << 6) | ((y) << 4) | ((x) << 2) | (w))
+
+#define DIAGONALIZE(r1, r2, r3, r4) \
+        vpshufd $MM_SHUFFLE(0,3,2,1), r2, r2; \
+        vpshufd $MM_SHUFFLE(1,0,3,2), r3, r3; \
+        vpshufd $MM_SHUFFLE(2,1,0,3), r4, r4;
+
+#define UNDIAGONALIZE(r1, r2, r3, r4) \
+        vpshufd $MM_SHUFFLE(2,1,0,3), r2, r2; \
+        vpshufd $MM_SHUFFLE(1,0,3,2), r3, r3; \
+        vpshufd $MM_SHUFFLE(0,3,2,1), r4, r4;
+
+#define ROUND(r, m1, m2, m3, m4) \
+        G1(ROW1, ROW2, ROW3, ROW4, m1); \
+        G2(ROW1, ROW2, ROW3, ROW4, m2); \
+        DIAGONALIZE(ROW1, ROW2, ROW3, ROW4); \
+        G1(ROW1, ROW2, ROW3, ROW4, m3); \
+        G2(ROW1, ROW2, ROW3, ROW4, m4); \
+        UNDIAGONALIZE(ROW1, ROW2, ROW3, ROW4);
+
+blake2s_data:
+.align 16
+.Liv:
+        .long 0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A
+        .long 0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
+.Lshuf_ror16:
+        .byte 2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13
+.Lshuf_ror8:
+        .byte 1,2,3,0,5,6,7,4,9,10,11,8,13,14,15,12
+
+.align 64
+.globl _gcry_blake2s_transform_amd64_avx
+ELF(.type _gcry_blake2s_transform_amd64_avx, at function;)
+
+_gcry_blake2s_transform_amd64_avx:
+        /* input:
+         *	%rdi: state
+         *	%rsi: blks
+         *	%rdx: num_blks
+         */
+
+        vzeroupper;
+
+        addq $64, (STATE_T + 0)(RSTATE);
+
+        vmovdqa .Lshuf_ror16 (RIP), R16;
+        vmovdqa .Lshuf_ror8 (RIP), R8;
+
+        vmovdqa .Liv+(0 * 4) (RIP), ROW3;
+        vmovdqa .Liv+(4 * 4) (RIP), ROW4;
+
+        vmovdqu (STATE_H + 0 * 4)(RSTATE), ROW1;
+        vmovdqu (STATE_H + 4 * 4)(RSTATE), ROW2;
+
+        vpxor (STATE_T)(RSTATE), ROW4, ROW4;
+
+        LOAD_MSG(0, MA1, MA2, MA3, MA4);
+        LOAD_MSG(1, MB1, MB2, MB3, MB4);
+
+.Loop:
+        ROUND(0, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(2, MA1, MA2, MA3, MA4);
+        ROUND(1, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(3, MB1, MB2, MB3, MB4);
+        ROUND(2, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(4, MA1, MA2, MA3, MA4);
+        ROUND(3, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(5, MB1, MB2, MB3, MB4);
+        ROUND(4, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(6, MA1, MA2, MA3, MA4);
+        ROUND(5, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(7, MB1, MB2, MB3, MB4);
+        ROUND(6, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(8, MA1, MA2, MA3, MA4);
+        ROUND(7, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(9, MB1, MB2, MB3, MB4);
+        sub $1, RNBLKS;
+        jz .Loop_end;
+
+        lea 64(RINBLKS), RINBLKS;
+        addq $64, (STATE_T + 0)(RSTATE);
+
+        ROUND(8, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(0, MA1, MA2, MA3, MA4);
+        ROUND(9, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(1, MB1, MB2, MB3, MB4);
+
+        vpxor ROW3, ROW1, ROW1;
+        vpxor ROW4, ROW2, ROW2;
+
+        vmovdqa .Liv+(0 * 4) (RIP), ROW3;
+        vmovdqa .Liv+(4 * 4) (RIP), ROW4;
+
+        vpxor (STATE_H + 0 * 4)(RSTATE), ROW1, ROW1;
+        vpxor (STATE_H + 4 * 4)(RSTATE), ROW2, ROW2;
+
+        vmovdqu ROW1, (STATE_H + 0 * 4)(RSTATE);
+        vmovdqu ROW2, (STATE_H + 4 * 4)(RSTATE);
+
+        vpxor (STATE_T)(RSTATE), ROW4, ROW4;
+
+        jmp .Loop;
+
+.Loop_end:
+        ROUND(8, MA1, MA2, MA3, MA4);
+        ROUND(9, MB1, MB2, MB3, MB4);
+
+        vpxor ROW3, ROW1, ROW1;
+        vpxor ROW4, ROW2, ROW2;
+        vpxor (STATE_H + 0 * 4)(RSTATE), ROW1, ROW1;
+        vpxor (STATE_H + 4 * 4)(RSTATE), ROW2, ROW2;
+
+        vmovdqu ROW1, (STATE_H + 0 * 4)(RSTATE);
+        vmovdqu ROW2, (STATE_H + 4 * 4)(RSTATE);
+
+        xor %eax, %eax;
+        vzeroall;
+        ret;
+ELF(.size _gcry_blake2s_transform_amd64_avx,
+    .-_gcry_blake2s_transform_amd64_avx;)
+
+#endif /*defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS)*/
+#endif /*__x86_64*/
diff --git a/configure.ac b/configure.ac
index 300c520a6..305b19f7e 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2421,6 +2421,7 @@ if test "$found" = "1" ; then
       x86_64-*-*)
          # Build with the assembly implementation
          GCRYPT_DIGESTS="$GCRYPT_DIGESTS blake2b-amd64-avx2.lo"
+         GCRYPT_DIGESTS="$GCRYPT_DIGESTS blake2s-amd64-avx.lo"
       ;;
    esac
 fi


From jussi.kivilinna at iki.fi  Sat Feb 10 00:07:43 2018
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sat, 10 Feb 2018 01:07:43 +0200
Subject: [PATCH 1/2] AVX2 implementation of BLAKE2b
Message-ID: <151821766363.14794.12253707483104078502.stgit@localhost.localdomain>

* cipher/Makefile.am: Add 'blake2b-amd64-avx2.S'.
* cipher/blake2.c (USE_AVX2, ASM_FUNC_ABI, ASM_EXTRA_STACK)
(_gry_blake2b_transform_amd64_avx2): New.
(BLAKE2B_CONTEXT) [USE_AVX2]: Add 'use_avx2'.
(blake2b_transform): Rename to ...
(blake2b_transform_generic): ... this.
(blake2b_transform): New.
(blake2b_final): Pass 'ctx' pointer to transform function instead of
'S'.
(blake2b_init_ctx): Check HW features and enable AVX2 implementation
if supported.
* cipher/blake2b-amd64-avx2.S: New.
* configure.ac: Add 'blake2b-amd64-avx2.lo'.
--

Benchmark on Intel Core i7-4790K (4.0 Ghz, no turbo):

Before:
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 BLAKE2B_512    |      1.07 ns/B     887.8 MiB/s      4.30 c/B

After (~1.4x faster):
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 BLAKE2B_512    |     0.771 ns/B    1236.8 MiB/s      3.08 c/B

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/Makefile.am          |    3 
 cipher/blake2.c             |   66 +++++++++-
 cipher/blake2b-amd64-avx2.S |  298 +++++++++++++++++++++++++++++++++++++++++++
 configure.ac                |    7 +
 4 files changed, 369 insertions(+), 5 deletions(-)
 create mode 100644 cipher/blake2b-amd64-avx2.S

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index 6e6c5ac03..b0ee158cc 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -106,7 +106,8 @@ twofish.c twofish-amd64.S twofish-arm.S twofish-aarch64.S \
 rfc2268.c \
 camellia.c camellia.h camellia-glue.c camellia-aesni-avx-amd64.S \
   camellia-aesni-avx2-amd64.S camellia-arm.S camellia-aarch64.S \
-blake2.c
+blake2.c \
+  blake2b-amd64-avx2.S
 
 gost28147.lo: gost-sb.h
 gost-sb.h: gost-s-box
diff --git a/cipher/blake2.c b/cipher/blake2.c
index 0e4cf9bfc..56f1c7ca8 100644
--- a/cipher/blake2.c
+++ b/cipher/blake2.c
@@ -30,6 +30,28 @@
 #include "cipher.h"
 #include "hash-common.h"
 
+/* USE_AVX2 indicates whether to compile with Intel AVX2 code. */
+#undef USE_AVX2
+#if defined(__x86_64__) && defined(HAVE_GCC_INLINE_ASM_AVX2) && \
+    (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
+     defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS))
+# define USE_AVX2 1
+#endif
+
+/* AMD64 assembly implementations use SystemV ABI, ABI conversion and additional
+ * stack to store XMM6-XMM15 needed on Win64. */
+#undef ASM_FUNC_ABI
+#undef ASM_EXTRA_STACK
+#if defined(USE_AVX2)
+# ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS
+#  define ASM_FUNC_ABI __attribute__((sysv_abi))
+#  define ASM_EXTRA_STACK (10 * 16)
+# else
+#  define ASM_FUNC_ABI
+#  define ASM_EXTRA_STACK 0
+# endif
+#endif
+
 #define BLAKE2B_BLOCKBYTES 128
 #define BLAKE2B_OUTBYTES 64
 #define BLAKE2B_KEYBYTES 64
@@ -67,6 +89,9 @@ typedef struct BLAKE2B_CONTEXT_S
   byte buf[BLAKE2B_BLOCKBYTES];
   size_t buflen;
   size_t outlen;
+#ifdef USE_AVX2
+  unsigned int use_avx2:1;
+#endif
 } BLAKE2B_CONTEXT;
 
 typedef struct
@@ -188,8 +213,9 @@ static inline u64 rotr64(u64 x, u64 n)
   return ((x >> (n & 63)) | (x << ((64 - n) & 63)));
 }
 
-static unsigned int blake2b_transform(void *vS, const void *inblks,
-				      size_t nblks)
+static unsigned int blake2b_transform_generic(BLAKE2B_STATE *S,
+                                              const void *inblks,
+                                              size_t nblks)
 {
   static const byte blake2b_sigma[12][16] =
   {
@@ -206,7 +232,6 @@ static unsigned int blake2b_transform(void *vS, const void *inblks,
     {  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 },
     { 14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3 }
   };
-  BLAKE2B_STATE *S = vS;
   const byte* in = inblks;
   u64 m[16];
   u64 v[16];
@@ -306,6 +331,33 @@ static unsigned int blake2b_transform(void *vS, const void *inblks,
   return sizeof(void *) * 4 + sizeof(u64) * 16 * 2;
 }
 
+#ifdef USE_AVX2
+unsigned int _gcry_blake2b_transform_amd64_avx2(BLAKE2B_STATE *S,
+                                                const void *inblks,
+                                                size_t nblks) ASM_FUNC_ABI;
+#endif
+
+static unsigned int blake2b_transform(void *ctx, const void *inblks,
+                                      size_t nblks)
+{
+  BLAKE2B_CONTEXT *c = ctx;
+  unsigned int nburn;
+
+  if (0)
+    {}
+#ifdef USE_AVX2
+  if (c->use_avx2)
+    nburn = _gcry_blake2b_transform_amd64_avx2(&c->state, inblks, nblks);
+#endif
+  else
+    nburn = blake2b_transform_generic(&c->state, inblks, nblks);
+
+  if (nburn)
+    nburn += ASM_EXTRA_STACK;
+
+  return nburn;
+}
+
 static void blake2b_final(void *ctx)
 {
   BLAKE2B_CONTEXT *c = ctx;
@@ -321,7 +373,7 @@ static void blake2b_final(void *ctx)
     memset (c->buf + c->buflen, 0, BLAKE2B_BLOCKBYTES - c->buflen); /* Padding */
   blake2b_set_lastblock (S);
   blake2b_increment_counter (S, (int)c->buflen - BLAKE2B_BLOCKBYTES);
-  burn = blake2b_transform (S, c->buf, 1);
+  burn = blake2b_transform (ctx, c->buf, 1);
 
   /* Output full hash to buffer */
   for (i = 0; i < 8; ++i)
@@ -397,11 +449,17 @@ static gcry_err_code_t blake2b_init_ctx(void *ctx, unsigned int flags,
 					unsigned int dbits)
 {
   BLAKE2B_CONTEXT *c = ctx;
+  unsigned int features = _gcry_get_hw_features ();
 
+  (void)features;
   (void)flags;
 
   memset (c, 0, sizeof (*c));
 
+#ifdef USE_AVX2
+  c->use_avx2 = !!(features & HWF_INTEL_AVX2);
+#endif
+
   c->outlen = dbits / 8;
   c->buflen = 0;
   return blake2b_init(c, key, keylen);
diff --git a/cipher/blake2b-amd64-avx2.S b/cipher/blake2b-amd64-avx2.S
new file mode 100644
index 000000000..6bcc5652d
--- /dev/null
+++ b/cipher/blake2b-amd64-avx2.S
@@ -0,0 +1,298 @@
+/* blake2b-amd64-avx2.S  -  AVX2 implementation of BLAKE2b
+ *
+ * Copyright (C) 2018 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* The code is based on public-domain/CC0 BLAKE2 reference implementation
+ * by Samual Neves, at https://github.com/BLAKE2/BLAKE2/tree/master/sse
+ * Copyright 2012, Samuel Neves <sneves at dei.uc.pt>
+ */
+
+#ifdef __x86_64
+#include <config.h>
+#if defined(HAVE_GCC_INLINE_ASM_AVX2) && \
+   (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
+    defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS))
+
+#include "asm-common-amd64.h"
+
+.text
+
+/* register macros */
+#define RSTATE  %rdi
+#define RINBLKS %rsi
+#define RNBLKS  %rdx
+#define RIV     %rcx
+
+/* state structure */
+#define STATE_H 0
+#define STATE_T (STATE_H + 8 * 8)
+#define STATE_F (STATE_T + 2 * 8)
+
+/* vector registers */
+#define ROW1  %ymm0
+#define ROW2  %ymm1
+#define ROW3  %ymm2
+#define ROW4  %ymm3
+#define TMP1  %ymm4
+#define TMP1x %xmm4
+#define R16   %ymm5
+#define R24   %ymm6
+
+#define MA1   %ymm8
+#define MA2   %ymm9
+#define MA3   %ymm10
+#define MA4   %ymm11
+#define MA1x  %xmm8
+#define MA2x  %xmm9
+#define MA3x  %xmm10
+#define MA4x  %xmm11
+
+#define MB1   %ymm12
+#define MB2   %ymm13
+#define MB3   %ymm14
+#define MB4   %ymm15
+#define MB1x  %xmm12
+#define MB2x  %xmm13
+#define MB3x  %xmm14
+#define MB4x  %xmm15
+
+/**********************************************************************
+  blake2b/AVX2
+ **********************************************************************/
+
+#define GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   s0, s1, s2, s3, s4, s5, s6, s7, s8, \
+                   s9, s10, s11, s12, s13, s14, s15) \
+        vmovq (s0)*8(RINBLKS), m1x; \
+        vmovq (s4)*8(RINBLKS), TMP1x; \
+        vpinsrq $1, (s2)*8(RINBLKS), m1x, m1x; \
+        vpinsrq $1, (s6)*8(RINBLKS), TMP1x, TMP1x; \
+        vinserti128 $1, TMP1x, m1, m1; \
+          vmovq (s1)*8(RINBLKS), m2x; \
+          vmovq (s5)*8(RINBLKS), TMP1x; \
+          vpinsrq $1, (s3)*8(RINBLKS), m2x, m2x; \
+          vpinsrq $1, (s7)*8(RINBLKS), TMP1x, TMP1x; \
+          vinserti128 $1, TMP1x, m2, m2; \
+            vmovq (s8)*8(RINBLKS), m3x; \
+            vmovq (s12)*8(RINBLKS), TMP1x; \
+            vpinsrq $1, (s10)*8(RINBLKS), m3x, m3x; \
+            vpinsrq $1, (s14)*8(RINBLKS), TMP1x, TMP1x; \
+            vinserti128 $1, TMP1x, m3, m3; \
+              vmovq (s9)*8(RINBLKS), m4x; \
+              vmovq (s13)*8(RINBLKS), TMP1x; \
+              vpinsrq $1, (s11)*8(RINBLKS), m4x, m4x; \
+              vpinsrq $1, (s15)*8(RINBLKS), TMP1x, TMP1x; \
+              vinserti128 $1, TMP1x, m4, m4;
+
+#define LOAD_MSG_0(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                    0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15)
+#define LOAD_MSG_1(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3)
+#define LOAD_MSG_2(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   11,  8, 12,  0,  5,  2, 15, 13, 10, 14,  3,  6,  7,  1,  9,  4)
+#define LOAD_MSG_3(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                    7,  9,  3,  1, 13, 12, 11, 14,  2,  6,  5, 10,  4,  0, 15,  8)
+#define LOAD_MSG_4(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                    9,  0,  5,  7,  2,  4, 10, 15, 14,  1, 11, 12,  6,  8,  3, 13)
+#define LOAD_MSG_5(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                    2, 12,  6, 10,  0, 11,  8,  3,  4, 13,  7,  5, 15, 14,  1,  9)
+#define LOAD_MSG_6(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   12,  5,  1, 15, 14, 13,  4, 10,  0,  7,  6,  3,  9,  2,  8, 11)
+#define LOAD_MSG_7(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   13, 11,  7, 14, 12,  1,  3,  9,  5,  0, 15,  4,  8,  6,  2, 10)
+#define LOAD_MSG_8(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                    6, 15, 14,  9, 11,  3,  0,  8, 12,  2, 13,  7,  1,  4, 10,  5)
+#define LOAD_MSG_9(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   10,  2,  8,  4,  7,  6,  1,  5, 15, 11,  9, 14,  3, 12, 13 , 0)
+#define LOAD_MSG_10(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        LOAD_MSG_0(m1, m2, m3, m4, m1x, m2x, m3x, m4x)
+#define LOAD_MSG_11(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        LOAD_MSG_1(m1, m2, m3, m4, m1x, m2x, m3x, m4x)
+
+#define LOAD_MSG(r, m1, m2, m3, m4) \
+        LOAD_MSG_##r(m1, m2, m3, m4, m1##x, m2##x, m3##x, m4##x)
+
+#define ROR_32(in, out) vpshufd $0xb1, in, out;
+
+#define ROR_24(in, out) vpshufb R24, in, out;
+
+#define ROR_16(in, out) vpshufb R16, in, out;
+
+#define ROR_63(in, out) \
+        vpsrlq $63, in, TMP1; \
+        vpaddq in, in, out; \
+        vpxor  TMP1, out, out;
+
+#define G(r1, r2, r3, r4, m, ROR_A, ROR_B) \
+        vpaddq m, r1, r1; \
+        vpaddq r2, r1, r1; \
+        vpxor r1, r4, r4; \
+        ROR_A(r4, r4); \
+        vpaddq r4, r3, r3; \
+        vpxor r3, r2, r2; \
+        ROR_B(r2, r2);
+
+#define G1(r1, r2, r3, r4, m) \
+        G(r1, r2, r3, r4, m, ROR_32, ROR_24);
+
+#define G2(r1, r2, r3, r4, m) \
+        G(r1, r2, r3, r4, m, ROR_16, ROR_63);
+
+#define MM_SHUFFLE(z,y,x,w) \
+        (((z) << 6) | ((y) << 4) | ((x) << 2) | (w))
+
+#define DIAGONALIZE(r1, r2, r3, r4) \
+        vpermq $MM_SHUFFLE(0,3,2,1), r2, r2; \
+        vpermq $MM_SHUFFLE(1,0,3,2), r3, r3; \
+        vpermq $MM_SHUFFLE(2,1,0,3), r4, r4;
+
+#define UNDIAGONALIZE(r1, r2, r3, r4) \
+        vpermq $MM_SHUFFLE(2,1,0,3), r2, r2; \
+        vpermq $MM_SHUFFLE(1,0,3,2), r3, r3; \
+        vpermq $MM_SHUFFLE(0,3,2,1), r4, r4;
+
+#define ROUND(r, m1, m2, m3, m4) \
+        G1(ROW1, ROW2, ROW3, ROW4, m1); \
+        G2(ROW1, ROW2, ROW3, ROW4, m2); \
+        DIAGONALIZE(ROW1, ROW2, ROW3, ROW4); \
+        G1(ROW1, ROW2, ROW3, ROW4, m3); \
+        G2(ROW1, ROW2, ROW3, ROW4, m4); \
+        UNDIAGONALIZE(ROW1, ROW2, ROW3, ROW4);
+
+blake2b_data:
+.align 32
+.Liv:
+        .quad 0x6a09e667f3bcc908, 0xbb67ae8584caa73b
+        .quad 0x3c6ef372fe94f82b, 0xa54ff53a5f1d36f1
+        .quad 0x510e527fade682d1, 0x9b05688c2b3e6c1f
+        .quad 0x1f83d9abfb41bd6b, 0x5be0cd19137e2179
+.Lshuf_ror16:
+        .byte 2, 3, 4, 5, 6, 7, 0, 1, 10, 11, 12, 13, 14, 15, 8, 9
+.Lshuf_ror24:
+        .byte 3, 4, 5, 6, 7, 0, 1, 2, 11, 12, 13, 14, 15, 8, 9, 10
+
+.align 64
+.globl _gcry_blake2b_transform_amd64_avx2
+ELF(.type _gcry_blake2b_transform_amd64_avx2, at function;)
+
+_gcry_blake2b_transform_amd64_avx2:
+        /* input:
+         *	%rdi: state
+         *	%rsi: blks
+         *	%rdx: num_blks
+         */
+
+        vzeroupper;
+
+        addq $128, (STATE_T + 0)(RSTATE);
+        adcq $0, (STATE_T + 8)(RSTATE);
+
+        vbroadcasti128 .Lshuf_ror16 (RIP), R16;
+        vbroadcasti128 .Lshuf_ror24 (RIP), R24;
+
+        vmovdqa .Liv+(0 * 8) (RIP), ROW3;
+        vmovdqa .Liv+(4 * 8) (RIP), ROW4;
+
+        vmovdqu (STATE_H + 0 * 8)(RSTATE), ROW1;
+        vmovdqu (STATE_H + 4 * 8)(RSTATE), ROW2;
+
+        vpxor (STATE_T)(RSTATE), ROW4, ROW4;
+
+        LOAD_MSG(0, MA1, MA2, MA3, MA4);
+        LOAD_MSG(1, MB1, MB2, MB3, MB4);
+
+.Loop:
+        ROUND(0, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(2, MA1, MA2, MA3, MA4);
+        ROUND(1, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(3, MB1, MB2, MB3, MB4);
+        ROUND(2, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(4, MA1, MA2, MA3, MA4);
+        ROUND(3, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(5, MB1, MB2, MB3, MB4);
+        ROUND(4, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(6, MA1, MA2, MA3, MA4);
+        ROUND(5, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(7, MB1, MB2, MB3, MB4);
+        ROUND(6, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(8, MA1, MA2, MA3, MA4);
+        ROUND(7, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(9, MB1, MB2, MB3, MB4);
+        ROUND(8, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(10, MA1, MA2, MA3, MA4);
+        ROUND(9, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(11, MB1, MB2, MB3, MB4);
+        sub $1, RNBLKS;
+        jz .Loop_end;
+
+        lea 128(RINBLKS), RINBLKS;
+        addq $128, (STATE_T + 0)(RSTATE);
+        adcq $0, (STATE_T + 8)(RSTATE);
+
+        ROUND(10, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(0, MA1, MA2, MA3, MA4);
+        ROUND(11, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(1, MB1, MB2, MB3, MB4);
+
+        vpxor ROW3, ROW1, ROW1;
+        vpxor ROW4, ROW2, ROW2;
+
+        vmovdqa .Liv+(0 * 8) (RIP), ROW3;
+        vmovdqa .Liv+(4 * 8) (RIP), ROW4;
+
+        vpxor (STATE_H + 0 * 8)(RSTATE), ROW1, ROW1;
+        vpxor (STATE_H + 4 * 8)(RSTATE), ROW2, ROW2;
+
+        vmovdqu ROW1, (STATE_H + 0 * 8)(RSTATE);
+        vmovdqu ROW2, (STATE_H + 4 * 8)(RSTATE);
+
+        vpxor (STATE_T)(RSTATE), ROW4, ROW4;
+
+        jmp .Loop;
+
+.Loop_end:
+        ROUND(10, MA1, MA2, MA3, MA4);
+        ROUND(11, MB1, MB2, MB3, MB4);
+
+        vpxor ROW3, ROW1, ROW1;
+        vpxor ROW4, ROW2, ROW2;
+        vpxor (STATE_H + 0 * 8)(RSTATE), ROW1, ROW1;
+        vpxor (STATE_H + 4 * 8)(RSTATE), ROW2, ROW2;
+
+        vmovdqu ROW1, (STATE_H + 0 * 8)(RSTATE);
+        vmovdqu ROW2, (STATE_H + 4 * 8)(RSTATE);
+
+        xor %eax, %eax;
+        vzeroall;
+        ret;
+ELF(.size _gcry_blake2b_transform_amd64_avx2,
+    .-_gcry_blake2b_transform_amd64_avx2;)
+
+#endif /*defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS)*/
+#endif /*__x86_64*/
diff --git a/configure.ac b/configure.ac
index aaf3c82a9..300c520a6 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2416,6 +2416,13 @@ LIST_MEMBER(blake2, $enabled_digests)
 if test "$found" = "1" ; then
    GCRYPT_DIGESTS="$GCRYPT_DIGESTS blake2.lo"
    AC_DEFINE(USE_BLAKE2, 1, [Defined if this module should be included])
+
+   case "${host}" in
+      x86_64-*-*)
+         # Build with the assembly implementation
+         GCRYPT_DIGESTS="$GCRYPT_DIGESTS blake2b-amd64-avx2.lo"
+      ;;
+   esac
 fi
 
 # SHA-1 needs to be included always for example because it is used by


From jussi.kivilinna at iki.fi  Mon Feb 12 22:59:52 2018
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Mon, 12 Feb 2018 23:59:52 +0200
Subject: [PATCH] Make BMI2 inline assembly check more robust
In-Reply-To: <cea1d7a9-ab7d-6710-20f9-0227998dfe8f@iki.fi>
References: <151551861550.5642.12750471651801313528.stgit@localhost.localdomain>
 <87h8rsh91o.fsf@wheatstone.g10code.de>
 <cea1d7a9-ab7d-6710-20f9-0227998dfe8f@iki.fi>
Message-ID: <3fbd9adf-36d3-fb45-aaba-0abd5fdbbb90@iki.fi>

On 11.01.2018 23:42, Jussi Kivilinna wrote:
> On 11.01.2018 14:30, Werner Koch wrote:
>>
>> Jussi: Do you have more optimization in mind for 1.9?
>>
> 
> I have AES XTS optimization patch for ARMv8 coming later this week.
> 

Actually, it would be nice to have Intel SHA Extensions accelerated
SHA1/SHA256 implementations in libgcrypt. 

Problem is that I do not currently have access to machine with CPU
that supports these instructions.

-Jussi


From w.k at berkeley.edu  Sat Feb 17 01:53:47 2018
From: w.k at berkeley.edu (Weikeng Chen)
Date: Fri, 16 Feb 2018 16:53:47 -0800
Subject: Attack on libgcrypt's ElGamal Encryption with Proof of Concept
 (PoC)
In-Reply-To: <20180217003612.vclm3hfnhqvnyh7t@adversary.org>
References: <CAHr+ZK_2+wRMX4ycL1Y1nHCkUggkBjGm6gUVZV8vgrOgQX--1A@mail.gmail.com>
 <20180217003612.vclm3hfnhqvnyh7t@adversary.org>
Message-ID: <CAHr+ZK8vgtp-tDejqnpvPcZjQzNAR8aO_syS2C7dB_N2H42j=g@mail.gmail.com>

Hi Ben,

Pardon me for the short reply. Recent I got crazy with some other stuff...

(1) For the PyCrypto, we actually attack PyCryptoDome
(https://www.pycryptodome.org/), which is currently maintained by
Helder Eijs. We extend our results to PyCrypto.

(2) It would be fine for signatures. But the problem is when we use
ElGamal for encryption, the confidentiality has problems! Exactly as
what you point out.


RSA is still not perfect -- it does not support multiplicative
homomorphism when we want security.


On Fri, Feb 16, 2018 at 4:36 PM, Ben McGinnes <ben at adversary.org> wrote:
> On Wed, Feb 07, 2018 at 12:14:46AM -0800, Weikeng Chen wrote:
>> I am Weikeng Chen from UC Berkeley. This post is joint work with
>> Erik-Oliver Blass from Airbus.
>
> Hmmm, let me see if I'm following this correctly ...
>
>> Libgcrypt is one of the very few cryptographic libraries that
>> implements ElGamal. In the homepage, we see "Libgcrypt is a general
>> purpose cryptographic library."
>
> Correct.  Libgcrypt is a general purpose library, but that does not
> mean that every algorithm, cipher and/or hash in it is general purpose
> or ever intended to be implemented that way.  For instance it also
> contains Serpent, a popular and strong cipher used with disk and
> volume encryption, but you won't find it in the OpenPGP implementation
> of GPG.
>
> The correct application of any of the algorithms to be found in
> libgcrypt is still determined by the projects which include libgcrypt
> as a dependeny, not libgcrypt itself.
>
>> ElGamal is a cryptographic algorithm with efficient multiplicative
>> homomorphism, the ability of self re-randomization, and the support
>> of key splitting. Its applications include voting, mix-nets,
>> shuffling, etc. I am glad that libgcrypt implements this algorithm.
>
> You're not the only one glad of it.
>
>> However, handling the ElGamal is hard. The ElGamal implementation is
>> not secure for general purposes. If we use libgcrypt to encrypt
>> messages directly, rather than using for hybrid encryption, the
>> security can be broken.
>
> This is not new, it is well known and is, in fact the reason why
> El-Gamal has only ever been implemented as an assymmetric cipher
> within the OpenPGP specification.  Even when first implemented in
> OpenPGP in the late 1990s it was always paired with DSA, first DSA-1
> and subsequently DSA-2 when vulnerabilities were found in the former.
>
> For instance, you will never see a signing or certification El-Gamal
> key in any of the OpenPGP formats.  You will only ever see it as an
> encryption only subkey; mostly with DSA certification keys, but
> occasionally with RSA certification keys.  Indeed, if you verify the
> OpenPGP signatures on this list and thus import the relevant keys you
> can see one as you read this message.
>
>> We prepared the attack on the Github:
>> https://github.com/weikengchen/attack-on-libgcrypt-elgamal
>
> Yeah, speaking of which, I noticed you also cited a similar attack
> targeting the El-Gamal implementation in the PyCrypto library.
>
> You are aware that PyCrypto is a dead project and no longer
> maintained, right?
>
> In fact it hasn't been maintained for several years.  Most mainstream
> Python development requiring a cryptographic module now uses the
> cryptography.py package instead and the original PyCrypto developer
> has moved on to contributing to that project.  Though some also use
> the NaCl package, which leverages libsodium to implement Daniel
> Bernstein's elliptic curve implementations.
>
>> The wiki page for explanation:
>> https://github.com/weikengchen/attack-on-libgcrypt-elgamal/wiki/Attack-on-libgcrypt's-ElGamal-Encryption-with-Proof-of-Concept-(PoC)
>
> We'll certainly take a long hard look at it, though I think I'll defer
> to Niibe on this one, his maths and cryptographic skills are sharper
> than mine.
>
>> Suggestions: I would suggest libgcrypt to consider either one of two
>> things as follows.
>>
>> (1) Announce on the manual that ElGamal in libgcrypt should *only be
>> used for hybrid encryption*:
>> https://gnupg.org/documentation/manuals/gcrypt/Available-algorithms.html#Available-algorithms
>
> The document you cite is essentially an index of what's in the
> library, it is quite accurate.
>
>> This prevents new libgcrypt's users from possible misuse of the
>> ElGamal implementation for the purposes not considered by us.
>
> This would be better served by a separate guide for developers
> explaining these things as relevant for all the ciphers and hashes in
> the library than singling out any one at the expense of others.  More
> and more accurate documentation is always better than less.
>
>> (2) Consider the implementation of ElGamal with semantic security,
>> which would better be another algorithm in addition to current
>> ElGamal.
>
> Again, this draws back to the use of any given component of libgcrypt
> in other projects, including GnuPG.  In that particular case, of
> course, there is an additional algorithm used at the same level of
> operation: RSA.
>
>> One problem is about current ElGamal's prime generation. It is hard
>> to make the current prime generation, which is fast, secure.  There
>> is a reason to find p=2q+1 where p and q are both primes.
>
> This is definitely where I'd ping Niibe for his opinion.  If and/or
> when he has the time for it.
>
>
> Regards,
> Ben


-- 

Weikeng Chen @ 795 Soda Hall


From ben at adversary.org  Sat Feb 17 01:36:13 2018
From: ben at adversary.org (Ben McGinnes)
Date: Sat, 17 Feb 2018 11:36:13 +1100
Subject: Attack on libgcrypt's ElGamal Encryption with Proof of Concept
 (PoC)
In-Reply-To: <CAHr+ZK_2+wRMX4ycL1Y1nHCkUggkBjGm6gUVZV8vgrOgQX--1A@mail.gmail.com>
References: <CAHr+ZK_2+wRMX4ycL1Y1nHCkUggkBjGm6gUVZV8vgrOgQX--1A@mail.gmail.com>
Message-ID: <20180217003612.vclm3hfnhqvnyh7t@adversary.org>

On Wed, Feb 07, 2018 at 12:14:46AM -0800, Weikeng Chen wrote:
> I am Weikeng Chen from UC Berkeley. This post is joint work with
> Erik-Oliver Blass from Airbus.

Hmmm, let me see if I'm following this correctly ...

> Libgcrypt is one of the very few cryptographic libraries that
> implements ElGamal. In the homepage, we see "Libgcrypt is a general
> purpose cryptographic library."

Correct.  Libgcrypt is a general purpose library, but that does not
mean that every algorithm, cipher and/or hash in it is general purpose
or ever intended to be implemented that way.  For instance it also
contains Serpent, a popular and strong cipher used with disk and
volume encryption, but you won't find it in the OpenPGP implementation
of GPG.

The correct application of any of the algorithms to be found in
libgcrypt is still determined by the projects which include libgcrypt
as a dependeny, not libgcrypt itself.

> ElGamal is a cryptographic algorithm with efficient multiplicative
> homomorphism, the ability of self re-randomization, and the support
> of key splitting. Its applications include voting, mix-nets,
> shuffling, etc. I am glad that libgcrypt implements this algorithm.

You're not the only one glad of it.

> However, handling the ElGamal is hard. The ElGamal implementation is
> not secure for general purposes. If we use libgcrypt to encrypt
> messages directly, rather than using for hybrid encryption, the
> security can be broken.

This is not new, it is well known and is, in fact the reason why
El-Gamal has only ever been implemented as an assymmetric cipher
within the OpenPGP specification.  Even when first implemented in
OpenPGP in the late 1990s it was always paired with DSA, first DSA-1
and subsequently DSA-2 when vulnerabilities were found in the former.

For instance, you will never see a signing or certification El-Gamal
key in any of the OpenPGP formats.  You will only ever see it as an
encryption only subkey; mostly with DSA certification keys, but
occasionally with RSA certification keys.  Indeed, if you verify the
OpenPGP signatures on this list and thus import the relevant keys you
can see one as you read this message.

> We prepared the attack on the Github:
> https://github.com/weikengchen/attack-on-libgcrypt-elgamal

Yeah, speaking of which, I noticed you also cited a similar attack
targeting the El-Gamal implementation in the PyCrypto library.

You are aware that PyCrypto is a dead project and no longer
maintained, right?

In fact it hasn't been maintained for several years.  Most mainstream
Python development requiring a cryptographic module now uses the
cryptography.py package instead and the original PyCrypto developer
has moved on to contributing to that project.  Though some also use
the NaCl package, which leverages libsodium to implement Daniel
Bernstein's elliptic curve implementations.

> The wiki page for explanation:
> https://github.com/weikengchen/attack-on-libgcrypt-elgamal/wiki/Attack-on-libgcrypt's-ElGamal-Encryption-with-Proof-of-Concept-(PoC)

We'll certainly take a long hard look at it, though I think I'll defer
to Niibe on this one, his maths and cryptographic skills are sharper
than mine.

> Suggestions: I would suggest libgcrypt to consider either one of two
> things as follows.
> 
> (1) Announce on the manual that ElGamal in libgcrypt should *only be
> used for hybrid encryption*:
> https://gnupg.org/documentation/manuals/gcrypt/Available-algorithms.html#Available-algorithms

The document you cite is essentially an index of what's in the
library, it is quite accurate.

> This prevents new libgcrypt's users from possible misuse of the
> ElGamal implementation for the purposes not considered by us.

This would be better served by a separate guide for developers
explaining these things as relevant for all the ciphers and hashes in
the library than singling out any one at the expense of others.  More
and more accurate documentation is always better than less.

> (2) Consider the implementation of ElGamal with semantic security,
> which would better be another algorithm in addition to current
> ElGamal.

Again, this draws back to the use of any given component of libgcrypt
in other projects, including GnuPG.  In that particular case, of
course, there is an additional algorithm used at the same level of
operation: RSA.

> One problem is about current ElGamal's prime generation. It is hard
> to make the current prime generation, which is fast, secure.  There
> is a reason to find p=2q+1 where p and q are both primes.

This is definitely where I'd ping Niibe for his opinion.  If and/or
when he has the time for it.


Regards,
Ben
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20180217/6de73bfc/attachment.sig>

From jussi.kivilinna at iki.fi  Sat Feb 17 11:03:55 2018
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sat, 17 Feb 2018 12:03:55 +0200
Subject: [PATCH 1/2] Add Intel SHA Extensions accelerated SHA1 implementation
Message-ID: <151886183542.31284.4052837239153069162.stgit@localhost.localdomain>

* cipher/Makefile.am: Add 'sha1-intel-shaext.c'.
* cipher/sha1-intel-shaext.c: New.
* cipher/sha1.c (USE_SHAEXT, _gcry_sha1_transform_intel_shaext): New.
(sha1_init) [USE_SHAEXT]: Use shaext implementation is supported.
(transform) [USE_SHAEXT]: Use shaext if enabled.
(transform): Only add ASM_EXTRA_STACK if returned burn length is not
zero.
* cipher/sha1.h (SHA1_CONTEXT): Add 'use_shaext'.
* configure.ac: Add 'sha1-intel-shaext.lo'.
(shaextsupport, gcry_cv_gcc_inline_asm_shaext): New.
* src/g10lib.h: Add HWF_INTEL_SHAEXT and reorder HWF flags.
* src/hwf-x86.c (detect_x86_gnuc): Detect SHA Extensions.
* src/hwfeatures.c (hwflist): Add 'intel-shaext'.
--

Benchmark on Intel Celeron J3455 (1500 Mhz, no turbo):

Before:
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA1           |      4.50 ns/B     211.7 MiB/s      6.76 c/B

After (4.0x faster):
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA1           |      1.11 ns/B     858.1 MiB/s      1.67 c/B

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 0 files changed

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index 625a0ef69..110a48b2c 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -92,6 +92,7 @@ seed.c \
 serpent.c serpent-sse2-amd64.S serpent-avx2-amd64.S serpent-armv7-neon.S \
 sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \
   sha1-armv7-neon.S sha1-armv8-aarch32-ce.S sha1-armv8-aarch64-ce.S \
+  sha1-intel-shaext.c \
 sha256.c sha256-ssse3-amd64.S sha256-avx-amd64.S sha256-avx2-bmi2-amd64.S \
   sha256-armv8-aarch32-ce.S sha256-armv8-aarch64-ce.S \
 sha512.c sha512-ssse3-amd64.S sha512-avx-amd64.S sha512-avx2-bmi2-amd64.S \
diff --git a/cipher/sha1-intel-shaext.c b/cipher/sha1-intel-shaext.c
new file mode 100644
index 000000000..5a2349e1e
--- /dev/null
+++ b/cipher/sha1-intel-shaext.c
@@ -0,0 +1,281 @@
+/* sha1-intel-shaext.S - SHAEXT accelerated SHA-1 transform function
+ * Copyright (C) 2018 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <config.h>
+
+#include "types.h"
+
+#if defined(HAVE_GCC_INLINE_ASM_SHAEXT) && \
+    defined(HAVE_GCC_INLINE_ASM_SSE41) && defined(USE_SHA1) && \
+    defined(ENABLE_SHAEXT_SUPPORT)
+
+#if _GCRY_GCC_VERSION >= 40400 /* 4.4 */
+/* Prevent compiler from issuing SSE instructions between asm blocks. */
+#  pragma GCC target("no-sse")
+#endif
+
+/* Two macros to be called prior and after the use of SHA-EXT
+   instructions.  There should be no external function calls between
+   the use of these macros.  There purpose is to make sure that the
+   SSE regsiters are cleared and won't reveal any information about
+   the key or the data.  */
+#ifdef __WIN64__
+/* XMM6-XMM15 are callee-saved registers on WIN64. */
+# define shaext_prepare_variable char win64tmp[2*16]
+# define shaext_prepare_variable_size sizeof(win64tmp)
+# define shaext_prepare()                                               \
+   do { asm volatile ("movdqu %%xmm6, (%0)\n"                           \
+                      "movdqu %%xmm7, (%1)\n"                           \
+                      :                                                 \
+                      : "r" (&win64tmp[0]), "r" (&win64tmp[16])         \
+                      : "memory");                                      \
+   } while (0)
+# define shaext_cleanup(tmp0,tmp1)                                      \
+   do { asm volatile ("movdqu (%0), %%xmm6\n"                           \
+                      "movdqu (%1), %%xmm7\n"                           \
+                      "pxor %%xmm0, %%xmm0\n"                           \
+                      "pxor %%xmm1, %%xmm1\n"                           \
+                      "pxor %%xmm2, %%xmm2\n"                           \
+                      "pxor %%xmm3, %%xmm3\n"                           \
+                      "pxor %%xmm4, %%xmm4\n"                           \
+                      "pxor %%xmm5, %%xmm5\n"                           \
+                      "movdqa %%xmm0, (%2)\n\t"                         \
+                      "movdqa %%xmm0, (%3)\n\t"                         \
+                      :                                                 \
+                      : "r" (&win64tmp[0]), "r" (&win64tmp[16]),        \
+                        "r" (tmp0), "r" (tmp1)                          \
+                      : "memory");                                      \
+   } while (0)
+#else
+# define shaext_prepare_variable
+# define shaext_prepare_variable_size 0
+# define shaext_prepare() do { } while (0)
+# define shaext_cleanup(tmp0,tmp1)                                      \
+   do { asm volatile ("pxor %%xmm0, %%xmm0\n"                           \
+                      "pxor %%xmm1, %%xmm1\n"                           \
+                      "pxor %%xmm2, %%xmm2\n"                           \
+                      "pxor %%xmm3, %%xmm3\n"                           \
+                      "pxor %%xmm4, %%xmm4\n"                           \
+                      "pxor %%xmm5, %%xmm5\n"                           \
+                      "pxor %%xmm6, %%xmm6\n"                           \
+                      "pxor %%xmm7, %%xmm7\n"                           \
+                      "movdqa %%xmm0, (%0)\n\t"                         \
+                      "movdqa %%xmm0, (%1)\n\t"                         \
+                      :                                                 \
+                      : "r" (tmp0), "r" (tmp1)                          \
+                      : "memory");                                      \
+   } while (0)
+#endif
+
+/*
+ * Transform nblks*64 bytes (nblks*16 32-bit words) at DATA.
+ */
+unsigned int
+_gcry_sha1_transform_intel_shaext(void *state, const unsigned char *data,
+                                  size_t nblks)
+{
+  static const unsigned char be_mask[16] __attribute__ ((aligned (16))) =
+    { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 };
+  char save_buf[2 * 16 + 15];
+  char *abcd_save;
+  char *e_save;
+  shaext_prepare_variable;
+
+  if (nblks == 0)
+    return 0;
+
+  shaext_prepare ();
+
+  asm volatile ("" : "=r" (abcd_save) : "0" (save_buf) : "memory");
+  abcd_save = abcd_save + (-(uintptr_t)abcd_save & 15);
+  e_save = abcd_save + 16;
+
+  /* byteswap mask => XMM7 */
+  asm volatile ("movdqa %[mask], %%xmm7\n\t" /* Preload mask */
+                :
+                : [mask] "m" (*be_mask)
+                : "memory");
+
+  /* Load state.. ABCD => XMM4, E => XMM5 */
+  asm volatile ("movd 16(%[state]), %%xmm5\n\t"
+                "movdqu (%[state]), %%xmm4\n\t"
+                "pslldq $12, %%xmm5\n\t"
+                "pshufd $0x1b, %%xmm4, %%xmm4\n\t"
+                "movdqa %%xmm5, (%[e_save])\n\t"
+                "movdqa %%xmm4, (%[abcd_save])\n\t"
+                :
+                : [state] "r" (state), [abcd_save] "r" (abcd_save),
+                  [e_save] "r" (e_save)
+                : "memory" );
+
+  /* DATA => XMM[0..4] */
+  asm volatile ("movdqu 0(%[data]), %%xmm0\n\t"
+                "movdqu 16(%[data]), %%xmm1\n\t"
+                "movdqu 32(%[data]), %%xmm2\n\t"
+                "movdqu 48(%[data]), %%xmm3\n\t"
+                "pshufb %%xmm7, %%xmm0\n\t"
+                "pshufb %%xmm7, %%xmm1\n\t"
+                "pshufb %%xmm7, %%xmm2\n\t"
+                "pshufb %%xmm7, %%xmm3\n\t"
+                :
+                : [data] "r" (data)
+                : "memory" );
+  data += 64;
+
+  while (1)
+    {
+      /* Round 0..3 */
+      asm volatile ("paddd %%xmm0, %%xmm5\n\t"
+                    "movdqa %%xmm4, %%xmm6\n\t" /* ABCD => E1 */
+                    "sha1rnds4 $0, %%xmm5, %%xmm4\n\t"
+                    ::: "memory" );
+
+      /* Round 4..7 */
+      asm volatile ("sha1nexte %%xmm1, %%xmm6\n\t"
+                    "movdqa %%xmm4, %%xmm5\n\t"
+                    "sha1rnds4 $0, %%xmm6, %%xmm4\n\t"
+                    "sha1msg1 %%xmm1, %%xmm0\n\t"
+                    ::: "memory" );
+
+      /* Round 8..11 */
+      asm volatile ("sha1nexte %%xmm2, %%xmm5\n\t"
+                    "movdqa %%xmm4, %%xmm6\n\t"
+                    "sha1rnds4 $0, %%xmm5, %%xmm4\n\t"
+                    "sha1msg1 %%xmm2, %%xmm1\n\t"
+                    "pxor %%xmm2, %%xmm0\n\t"
+                    ::: "memory" );
+
+#define ROUND(imm, E0, E1, MSG0, MSG1, MSG2, MSG3) \
+      asm volatile ("sha1nexte %%"MSG0", %%"E0"\n\t" \
+                    "movdqa %%xmm4, %%"E1"\n\t" \
+                    "sha1msg2 %%"MSG0", %%"MSG1"\n\t" \
+                    "sha1rnds4 $"imm", %%"E0", %%xmm4\n\t" \
+                    "sha1msg1 %%"MSG0", %%"MSG3"\n\t" \
+                    "pxor %%"MSG0", %%"MSG2"\n\t" \
+                    ::: "memory" )
+
+      /* Rounds 12..15 to 64..67 */
+      ROUND("0", "xmm6", "xmm5", "xmm3", "xmm0", "xmm1", "xmm2");
+      ROUND("0", "xmm5", "xmm6", "xmm0", "xmm1", "xmm2", "xmm3");
+      ROUND("1", "xmm6", "xmm5", "xmm1", "xmm2", "xmm3", "xmm0");
+      ROUND("1", "xmm5", "xmm6", "xmm2", "xmm3", "xmm0", "xmm1");
+      ROUND("1", "xmm6", "xmm5", "xmm3", "xmm0", "xmm1", "xmm2");
+      ROUND("1", "xmm5", "xmm6", "xmm0", "xmm1", "xmm2", "xmm3");
+      ROUND("1", "xmm6", "xmm5", "xmm1", "xmm2", "xmm3", "xmm0");
+      ROUND("2", "xmm5", "xmm6", "xmm2", "xmm3", "xmm0", "xmm1");
+      ROUND("2", "xmm6", "xmm5", "xmm3", "xmm0", "xmm1", "xmm2");
+      ROUND("2", "xmm5", "xmm6", "xmm0", "xmm1", "xmm2", "xmm3");
+      ROUND("2", "xmm6", "xmm5", "xmm1", "xmm2", "xmm3", "xmm0");
+      ROUND("2", "xmm5", "xmm6", "xmm2", "xmm3", "xmm0", "xmm1");
+      ROUND("3", "xmm6", "xmm5", "xmm3", "xmm0", "xmm1", "xmm2");
+      ROUND("3", "xmm5", "xmm6", "xmm0", "xmm1", "xmm2", "xmm3");
+
+      if (--nblks == 0)
+        break;
+
+      /* Round 68..71 */
+      asm volatile ("movdqu 0(%[data]), %%xmm0\n\t"
+                    "sha1nexte %%xmm1, %%xmm6\n\t"
+                    "movdqa %%xmm4, %%xmm5\n\t"
+                    "sha1msg2 %%xmm1, %%xmm2\n\t"
+                    "sha1rnds4 $3, %%xmm6, %%xmm4\n\t"
+                    "pxor %%xmm1, %%xmm3\n\t"
+                    "pshufb %%xmm7, %%xmm0\n\t"
+                    :
+                    : [data] "r" (data)
+                    : "memory" );
+
+      /* Round 72..75 */
+      asm volatile ("movdqu 16(%[data]), %%xmm1\n\t"
+                    "sha1nexte %%xmm2, %%xmm5\n\t"
+                    "movdqa %%xmm4, %%xmm6\n\t"
+                    "sha1msg2 %%xmm2, %%xmm3\n\t"
+                    "sha1rnds4 $3, %%xmm5, %%xmm4\n\t"
+                    "pshufb %%xmm7, %%xmm1\n\t"
+                    :
+                    : [data] "r" (data)
+                    : "memory" );
+
+      /* Round 76..79 */
+      asm volatile ("movdqu 32(%[data]), %%xmm2\n\t"
+                    "sha1nexte %%xmm3, %%xmm6\n\t"
+                    "movdqa %%xmm4, %%xmm5\n\t"
+                    "sha1rnds4 $3, %%xmm6, %%xmm4\n\t"
+                    "pshufb %%xmm7, %%xmm2\n\t"
+                    :
+                    : [data] "r" (data)
+                    : "memory" );
+
+      /* Merge states, store current. */
+      asm volatile ("movdqu 48(%[data]), %%xmm3\n\t"
+                    "sha1nexte (%[e_save]), %%xmm5\n\t"
+                    "paddd (%[abcd_save]), %%xmm4\n\t"
+                    "pshufb %%xmm7, %%xmm3\n\t"
+                    "movdqa %%xmm5, (%[e_save])\n\t"
+                    "movdqa %%xmm4, (%[abcd_save])\n\t"
+                    :
+                    : [abcd_save] "r" (abcd_save), [e_save] "r" (e_save),
+                      [data] "r" (data)
+                    : "memory" );
+
+      data += 64;
+    }
+
+  /* Round 68..71 */
+  asm volatile ("sha1nexte %%xmm1, %%xmm6\n\t"
+                "movdqa %%xmm4, %%xmm5\n\t"
+                "sha1msg2 %%xmm1, %%xmm2\n\t"
+                "sha1rnds4 $3, %%xmm6, %%xmm4\n\t"
+                "pxor %%xmm1, %%xmm3\n\t"
+                ::: "memory" );
+
+  /* Round 72..75 */
+  asm volatile ("sha1nexte %%xmm2, %%xmm5\n\t"
+                "movdqa %%xmm4, %%xmm6\n\t"
+                "sha1msg2 %%xmm2, %%xmm3\n\t"
+                "sha1rnds4 $3, %%xmm5, %%xmm4\n\t"
+                ::: "memory" );
+
+  /* Round 76..79 */
+  asm volatile ("sha1nexte %%xmm3, %%xmm6\n\t"
+                "movdqa %%xmm4, %%xmm5\n\t"
+                "sha1rnds4 $3, %%xmm6, %%xmm4\n\t"
+                ::: "memory" );
+
+  /* Merge states. */
+  asm volatile ("sha1nexte (%[e_save]), %%xmm5\n\t"
+                "paddd (%[abcd_save]), %%xmm4\n\t"
+                :
+                : [abcd_save] "r" (abcd_save), [e_save] "r" (e_save)
+                : "memory" );
+
+  /* Save state */
+  asm volatile ("pshufd $0x1b, %%xmm4, %%xmm4\n\t"
+                "psrldq $12, %%xmm5\n\t"
+                "movdqu %%xmm4, (%[state])\n\t"
+                "movd %%xmm5, 16(%[state])\n\t"
+                :
+                : [state] "r" (state)
+                : "memory" );
+
+  shaext_cleanup (abcd_save, e_save);
+  return 0;
+}
+
+#endif /* HAVE_GCC_INLINE_ASM_SHA_EXT */
diff --git a/cipher/sha1.c b/cipher/sha1.c
index 78b172f24..09868aa3f 100644
--- a/cipher/sha1.c
+++ b/cipher/sha1.c
@@ -68,6 +68,14 @@
 # define USE_BMI2 1
 #endif
 
+/* USE_SHAEXT indicates whether to compile with Intel SHA Extension code. */
+#undef USE_SHAEXT
+#if defined(HAVE_GCC_INLINE_ASM_SHAEXT) && \
+    defined(HAVE_GCC_INLINE_ASM_SSE41) && \
+    defined(ENABLE_SHAEXT_SUPPORT)
+# define USE_SHAEXT 1
+#endif
+
 /* USE_NEON indicates whether to enable ARM NEON assembly code. */
 #undef USE_NEON
 #ifdef ENABLE_NEON_SUPPORT
@@ -138,6 +146,10 @@ sha1_init (void *context, unsigned int flags)
 #ifdef USE_BMI2
   hd->use_bmi2 = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_BMI2);
 #endif
+#ifdef USE_SHAEXT
+  hd->use_shaext = (features & HWF_INTEL_SHAEXT)
+                   && (features & HWF_INTEL_SSE4_1);
+#endif
 #ifdef USE_NEON
   hd->use_neon = (features & HWF_ARM_NEON) != 0;
 #endif
@@ -311,7 +323,8 @@ transform_blk (void *ctx, const unsigned char *data)
  * stack to store XMM6-XMM15 needed on Win64. */
 #undef ASM_FUNC_ABI
 #undef ASM_EXTRA_STACK
-#if defined(USE_SSSE3) || defined(USE_AVX) || defined(USE_BMI2)
+#if defined(USE_SSSE3) || defined(USE_AVX) || defined(USE_BMI2) || \
+    defined(USE_SHAEXT)
 # ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS
 #  define ASM_FUNC_ABI __attribute__((sysv_abi))
 #  define ASM_EXTRA_STACK (10 * 16)
@@ -340,6 +353,13 @@ _gcry_sha1_transform_amd64_avx_bmi2 (void *state, const unsigned char *data,
                                      size_t nblks) ASM_FUNC_ABI;
 #endif
 
+#ifdef USE_SHAEXT
+/* Does not need ASM_FUNC_ABI */
+unsigned int
+_gcry_sha1_transform_intel_shaext (void *state, const unsigned char *data,
+                                   size_t nblks);
+#endif
+
 
 static unsigned int
 transform (void *ctx, const unsigned char *data, size_t nblks)
@@ -347,29 +367,53 @@ transform (void *ctx, const unsigned char *data, size_t nblks)
   SHA1_CONTEXT *hd = ctx;
   unsigned int burn;
 
+#ifdef USE_SHAEXT
+  if (hd->use_shaext)
+    {
+      burn = _gcry_sha1_transform_intel_shaext (&hd->h0, data, nblks);
+      burn += burn ? 4 * sizeof(void*) + ASM_EXTRA_STACK : 0;
+      return burn;
+    }
+#endif
 #ifdef USE_BMI2
   if (hd->use_bmi2)
-    return _gcry_sha1_transform_amd64_avx_bmi2 (&hd->h0, data, nblks)
-           + 4 * sizeof(void*) + ASM_EXTRA_STACK;
+    {
+      burn = _gcry_sha1_transform_amd64_avx_bmi2 (&hd->h0, data, nblks);
+      burn += burn ? 4 * sizeof(void*) + ASM_EXTRA_STACK : 0;
+      return burn;
+    }
 #endif
 #ifdef USE_AVX
   if (hd->use_avx)
-    return _gcry_sha1_transform_amd64_avx (&hd->h0, data, nblks)
-           + 4 * sizeof(void*) + ASM_EXTRA_STACK;
+    {
+      burn = _gcry_sha1_transform_amd64_avx (&hd->h0, data, nblks);
+      burn += burn ? 4 * sizeof(void*) + ASM_EXTRA_STACK : 0;
+      return burn;
+    }
 #endif
 #ifdef USE_SSSE3
   if (hd->use_ssse3)
-    return _gcry_sha1_transform_amd64_ssse3 (&hd->h0, data, nblks)
-           + 4 * sizeof(void*) + ASM_EXTRA_STACK;
+    {
+      burn = _gcry_sha1_transform_amd64_ssse3 (&hd->h0, data, nblks);
+      burn += burn ? 4 * sizeof(void*) + ASM_EXTRA_STACK : 0;
+      return burn;
+    }
 #endif
 #ifdef USE_ARM_CE
   if (hd->use_arm_ce)
-    return _gcry_sha1_transform_armv8_ce (&hd->h0, data, nblks);
+    {
+      burn = _gcry_sha1_transform_armv8_ce (&hd->h0, data, nblks);
+      burn += burn ? 4 * sizeof(void*) : 0;
+      return burn;
+    }
 #endif
 #ifdef USE_NEON
   if (hd->use_neon)
-    return _gcry_sha1_transform_armv7_neon (&hd->h0, data, nblks)
-           + 4 * sizeof(void*);
+    {
+      burn = _gcry_sha1_transform_armv7_neon (&hd->h0, data, nblks);
+      burn += burn ? 4 * sizeof(void*) : 0;
+      return burn;
+    }
 #endif
 
   do
diff --git a/cipher/sha1.h b/cipher/sha1.h
index d448fcac8..93ce79b5c 100644
--- a/cipher/sha1.h
+++ b/cipher/sha1.h
@@ -29,6 +29,7 @@ typedef struct
   unsigned int use_ssse3:1;
   unsigned int use_avx:1;
   unsigned int use_bmi2:1;
+  unsigned int use_shaext:1;
   unsigned int use_neon:1;
   unsigned int use_arm_ce:1;
 } SHA1_CONTEXT;
diff --git a/configure.ac b/configure.ac
index 305b19f7e..4ae7667b3 100644
--- a/configure.ac
+++ b/configure.ac
@@ -588,6 +588,14 @@ AC_ARG_ENABLE(aesni-support,
 	      aesnisupport=$enableval,aesnisupport=yes)
 AC_MSG_RESULT($aesnisupport)
 
+# Implementation of the --disable-shaext-support switch.
+AC_MSG_CHECKING([whether SHAEXT support is requested])
+AC_ARG_ENABLE(shaext-support,
+              AC_HELP_STRING([--disable-shaext-support],
+                 [Disable support for the Intel SHAEXT instructions]),
+              shaextsupport=$enableval,shaextsupport=yes)
+AC_MSG_RESULT($shaextsupport)
+
 # Implementation of the --disable-pclmul-support switch.
 AC_MSG_CHECKING([whether PCLMUL support is requested])
 AC_ARG_ENABLE(pclmul-support,
@@ -1175,6 +1183,7 @@ AM_CONDITIONAL(MPI_MOD_C_UDIV_QRNND, test "$mpi_mod_c_udiv_qrnnd" = yes)
 # Reset non applicable feature flags.
 if test "$mpi_cpu_arch" != "x86" ; then
    aesnisupport="n/a"
+   shaextsupport="n/a"
    pclmulsupport="n/a"
    sse41support="n/a"
    avxsupport="n/a"
@@ -1329,6 +1338,34 @@ if test "$gcry_cv_gcc_inline_asm_pclmul" = "yes" ; then
      [Defined if inline assembler supports PCLMUL instructions])
 fi
 
+
+#
+# Check whether GCC inline assembler supports SHA Extensions instructions.
+#
+AC_CACHE_CHECK([whether GCC inline assembler supports SHA Extensions instructions],
+       [gcry_cv_gcc_inline_asm_shaext],
+       [if test "$mpi_cpu_arch" != "x86" ; then
+          gcry_cv_gcc_inline_asm_shaext="n/a"
+        else
+          gcry_cv_gcc_inline_asm_shaext=no
+          AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+          [[void a(void) {
+              __asm__("sha1rnds4 \$0, %%xmm1, %%xmm3\n\t":::"cc");
+              __asm__("sha1nexte %%xmm1, %%xmm3\n\t":::"cc");
+              __asm__("sha1msg1 %%xmm1, %%xmm3\n\t":::"cc");
+              __asm__("sha1msg2 %%xmm1, %%xmm3\n\t":::"cc");
+              __asm__("sha256rnds2 %%xmm0, %%xmm1, %%xmm3\n\t":::"cc");
+              __asm__("sha256msg1 %%xmm1, %%xmm3\n\t":::"cc");
+              __asm__("sha256msg2 %%xmm1, %%xmm3\n\t":::"cc");
+            }]])],
+          [gcry_cv_gcc_inline_asm_shaext=yes])
+        fi])
+if test "$gcry_cv_gcc_inline_asm_shaext" = "yes" ; then
+   AC_DEFINE(HAVE_GCC_INLINE_ASM_SHAEXT,1,
+     [Defined if inline assembler supports SHA Extensions instructions])
+fi
+
+
 #
 # Check whether GCC inline assembler supports SSE4.1 instructions.
 #
@@ -1921,6 +1958,11 @@ if test x"$aesnisupport" = xyes ; then
     aesnisupport="no (unsupported by compiler)"
   fi
 fi
+if test x"$shaextsupport" = xyes ; then
+  if test "$gcry_cv_gcc_inline_asm_shaext" != "yes" ; then
+    shaextsupport="no (unsupported by compiler)"
+  fi
+fi
 if test x"$pclmulsupport" = xyes ; then
   if test "$gcry_cv_gcc_inline_asm_pclmul" != "yes" ; then
     pclmulsupport="no (unsupported by compiler)"
@@ -1960,6 +2002,10 @@ if test x"$aesnisupport" = xyes ; then
   AC_DEFINE(ENABLE_AESNI_SUPPORT, 1,
             [Enable support for Intel AES-NI instructions.])
 fi
+if test x"$shaextsupport" = xyes ; then
+  AC_DEFINE(ENABLE_SHAEXT_SUPPORT, 1,
+            [Enable support for Intel SHAEXT instructions.])
+fi
 if test x"$pclmulsupport" = xyes ; then
   AC_DEFINE(ENABLE_PCLMUL_SUPPORT, 1,
             [Enable support for Intel PCLMUL instructions.])
@@ -2449,6 +2495,13 @@ case "${host}" in
   ;;
 esac
 
+case "$mpi_cpu_arch" in
+  x86)
+    # Build with the SHAEXT implementation
+    GCRYPT_DIGESTS="$GCRYPT_DIGESTS sha1-intel-shaext.lo"
+  ;;
+esac
+
 LIST_MEMBER(sm3, $enabled_digests)
 if test "$found" = "1" ; then
    GCRYPT_DIGESTS="$GCRYPT_DIGESTS sm3.lo"
@@ -2634,6 +2687,7 @@ GCRY_MSG_SHOW([Try using jitter entropy: ],[$jentsupport])
 GCRY_MSG_SHOW([Using linux capabilities: ],[$use_capabilities])
 GCRY_MSG_SHOW([Try using Padlock crypto: ],[$padlocksupport])
 GCRY_MSG_SHOW([Try using AES-NI crypto:  ],[$aesnisupport])
+GCRY_MSG_SHOW([Try using Intel SHAEXT:   ],[$shaextsupport])
 GCRY_MSG_SHOW([Try using Intel PCLMUL:   ],[$pclmulsupport])
 GCRY_MSG_SHOW([Try using Intel SSE4.1:   ],[$sse41support])
 GCRY_MSG_SHOW([Try using DRNG (RDRAND):  ],[$drngsupport])
diff --git a/src/g10lib.h b/src/g10lib.h
index 961b51541..d41fa0cf7 100644
--- a/src/g10lib.h
+++ b/src/g10lib.h
@@ -224,14 +224,14 @@ char **_gcry_strtokenize (const char *string, const char *delim);
 #define HWF_INTEL_AVX           (1 << 12)
 #define HWF_INTEL_AVX2          (1 << 13)
 #define HWF_INTEL_FAST_VPGATHER (1 << 14)
-
-#define HWF_ARM_NEON            (1 << 15)
-#define HWF_ARM_AES             (1 << 16)
-#define HWF_ARM_SHA1            (1 << 17)
-#define HWF_ARM_SHA2            (1 << 18)
-#define HWF_ARM_PMULL           (1 << 19)
-
-#define HWF_INTEL_RDTSC         (1 << 20)
+#define HWF_INTEL_RDTSC         (1 << 15)
+#define HWF_INTEL_SHAEXT        (1 << 16)
+
+#define HWF_ARM_NEON            (1 << 17)
+#define HWF_ARM_AES             (1 << 18)
+#define HWF_ARM_SHA1            (1 << 19)
+#define HWF_ARM_SHA2            (1 << 20)
+#define HWF_ARM_PMULL           (1 << 21)
 
 
diff --git a/src/hwf-x86.c b/src/hwf-x86.c
index 0d3a1f40e..b644eda1f 100644
--- a/src/hwf-x86.c
+++ b/src/hwf-x86.c
@@ -357,6 +357,10 @@ detect_x86_gnuc (void)
       if ((result & HWF_INTEL_AVX2) && !avoid_vpgather)
         result |= HWF_INTEL_FAST_VPGATHER;
 #endif /*ENABLE_AVX_SUPPORT*/
+
+      /* Test bit 29 for SHA Extensions. */
+      if (features & (1 << 29))
+          result |= HWF_INTEL_SHAEXT;
     }
 
   return result;
diff --git a/src/hwfeatures.c b/src/hwfeatures.c
index 1cad546d2..e08166945 100644
--- a/src/hwfeatures.c
+++ b/src/hwfeatures.c
@@ -58,6 +58,7 @@ static struct
     { HWF_INTEL_AVX2,          "intel-avx2" },
     { HWF_INTEL_FAST_VPGATHER, "intel-fast-vpgather" },
     { HWF_INTEL_RDTSC,         "intel-rdtsc" },
+    { HWF_INTEL_SHAEXT,        "intel-shaext" },
     { HWF_ARM_NEON,            "arm-neon" },
     { HWF_ARM_AES,             "arm-aes" },
     { HWF_ARM_SHA1,            "arm-sha1" },


From jussi.kivilinna at iki.fi  Sat Feb 17 11:04:00 2018
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sat, 17 Feb 2018 12:04:00 +0200
Subject: [PATCH 2/2] Add Intel SHA Extensions accelerated SHA256 implementation
In-Reply-To: <151886183542.31284.4052837239153069162.stgit@localhost.localdomain>
References: <151886183542.31284.4052837239153069162.stgit@localhost.localdomain>
Message-ID: <151886184047.31284.8518583557052720359.stgit@localhost.localdomain>

* cipher/Makefile.am: Add 'sha256-intel-shaext.c'.
* cipher/sha256-intel-shaext.c: New.
* cipher/sha256.c (USE_SHAEXT)
(_gcry_sha256_transform_intel_shaext): New.
(SHA256_CONTEXT): Add 'use_shaext'.
(sha256_init, sha224_init) [USE_SHAEXT]: Use shaext if supported.
(transform) [USE_SHAEXT]: Use shaext if enabled.
(transform): Only add ASM_EXTRA_STACK if returned burn length is not
zero.
* configure.ac: Add 'sha256-intel-shaext.lo'.
--

Benchmark on Intel Celeron J3455 (1500 Mhz, no turbo):

Before:
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA256         |     10.07 ns/B     94.72 MiB/s     15.10 c/B

After (3.7x faster):
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA256         |      2.70 ns/B     353.8 MiB/s      4.04 c/B

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 0 files changed

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index 110a48b2c..599e3c103 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -94,7 +94,7 @@ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \
   sha1-armv7-neon.S sha1-armv8-aarch32-ce.S sha1-armv8-aarch64-ce.S \
   sha1-intel-shaext.c \
 sha256.c sha256-ssse3-amd64.S sha256-avx-amd64.S sha256-avx2-bmi2-amd64.S \
-  sha256-armv8-aarch32-ce.S sha256-armv8-aarch64-ce.S \
+  sha256-armv8-aarch32-ce.S sha256-armv8-aarch64-ce.S sha256-intel-shaext.c \
 sha512.c sha512-ssse3-amd64.S sha512-avx-amd64.S sha512-avx2-bmi2-amd64.S \
   sha512-armv7-neon.S sha512-arm.S \
 sm3.c \
diff --git a/cipher/sha256-intel-shaext.c b/cipher/sha256-intel-shaext.c
new file mode 100644
index 000000000..0c107bb4c
--- /dev/null
+++ b/cipher/sha256-intel-shaext.c
@@ -0,0 +1,352 @@
+/* sha256-intel-shaext.S - SHAEXT accelerated SHA-256 transform function
+ * Copyright (C) 2018 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <config.h>
+
+#include "types.h"
+
+#if defined(HAVE_GCC_INLINE_ASM_SHAEXT) && \
+    defined(HAVE_GCC_INLINE_ASM_SSE41) && defined(USE_SHA256) && \
+    defined(ENABLE_SHAEXT_SUPPORT)
+
+#if _GCRY_GCC_VERSION >= 40400 /* 4.4 */
+/* Prevent compiler from issuing SSE instructions between asm blocks. */
+#  pragma GCC target("no-sse")
+#endif
+
+/* Two macros to be called prior and after the use of SHA-EXT
+   instructions.  There should be no external function calls between
+   the use of these macros.  There purpose is to make sure that the
+   SSE regsiters are cleared and won't reveal any information about
+   the key or the data.  */
+#ifdef __WIN64__
+/* XMM6-XMM15 are callee-saved registers on WIN64. */
+# define shaext_prepare_variable char win64tmp[2*16]
+# define shaext_prepare_variable_size sizeof(win64tmp)
+# define shaext_prepare()                                               \
+   do { asm volatile ("movdqu %%xmm6, (%0)\n"                           \
+                      "movdqu %%xmm7, (%1)\n"                           \
+                      :                                                 \
+                      : "r" (&win64tmp[0]), "r" (&win64tmp[16])         \
+                      : "memory");                                      \
+   } while (0)
+# define shaext_cleanup(tmp0,tmp1)                                      \
+   do { asm volatile ("movdqu (%0), %%xmm6\n"                           \
+                      "movdqu (%1), %%xmm7\n"                           \
+                      "pxor %%xmm0, %%xmm0\n"                           \
+                      "pxor %%xmm1, %%xmm1\n"                           \
+                      "pxor %%xmm2, %%xmm2\n"                           \
+                      "pxor %%xmm3, %%xmm3\n"                           \
+                      "pxor %%xmm4, %%xmm4\n"                           \
+                      "pxor %%xmm5, %%xmm5\n"                           \
+                      "movdqa %%xmm0, (%2)\n\t"                         \
+                      "movdqa %%xmm0, (%3)\n\t"                         \
+                      :                                                 \
+                      : "r" (&win64tmp[0]), "r" (&win64tmp[16]),        \
+                        "r" (tmp0), "r" (tmp1)                          \
+                      : "memory");                                      \
+   } while (0)
+#else
+# define shaext_prepare_variable
+# define shaext_prepare_variable_size 0
+# define shaext_prepare() do { } while (0)
+# define shaext_cleanup(tmp0,tmp1)                                      \
+   do { asm volatile ("pxor %%xmm0, %%xmm0\n"                           \
+                      "pxor %%xmm1, %%xmm1\n"                           \
+                      "pxor %%xmm2, %%xmm2\n"                           \
+                      "pxor %%xmm3, %%xmm3\n"                           \
+                      "pxor %%xmm4, %%xmm4\n"                           \
+                      "pxor %%xmm5, %%xmm5\n"                           \
+                      "pxor %%xmm6, %%xmm6\n"                           \
+                      "pxor %%xmm7, %%xmm7\n"                           \
+                      "movdqa %%xmm0, (%0)\n\t"                         \
+                      "movdqa %%xmm0, (%1)\n\t"                         \
+                      :                                                 \
+                      : "r" (tmp0), "r" (tmp1)                          \
+                      : "memory");                                      \
+   } while (0)
+#endif
+
+typedef struct u128_s
+{
+  u32 a, b, c, d;
+} u128_t;
+
+/*
+ * Transform nblks*64 bytes (nblks*16 32-bit words) at DATA.
+ */
+unsigned int
+_gcry_sha256_transform_intel_shaext(u32 state[8], const unsigned char *data,
+                                    size_t nblks)
+{
+  static const unsigned char bshuf_mask[16] __attribute__ ((aligned (16))) =
+    { 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 };
+  static const u128_t K[16] __attribute__ ((aligned (16))) =
+  {
+    { 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5 },
+    { 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5 },
+    { 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3 },
+    { 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174 },
+    { 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc },
+    { 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da },
+    { 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7 },
+    { 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967 },
+    { 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13 },
+    { 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85 },
+    { 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3 },
+    { 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070 },
+    { 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5 },
+    { 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3 },
+    { 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208 },
+    { 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2 }
+  };
+  char save_buf[2 * 16 + 15];
+  char *abef_save;
+  char *cdgh_save;
+  shaext_prepare_variable;
+
+  if (nblks == 0)
+    return 0;
+
+  shaext_prepare ();
+
+  asm volatile ("" : "=r" (abef_save) : "0" (save_buf) : "memory");
+  abef_save = abef_save + (-(uintptr_t)abef_save & 15);
+  cdgh_save = abef_save + 16;
+
+  /* byteswap mask => XMM7 */
+  asm volatile ("movdqa %[mask], %%xmm7\n\t" /* Preload mask */
+                :
+                : [mask] "m" (*bshuf_mask)
+                : "memory");
+
+  /* Load state.. ABEF_SAVE => STATE0 XMM1, CDGH_STATE => STATE1 XMM2 */
+  asm volatile ("movups 16(%[state]), %%xmm1\n\t" /* HGFE (xmm=EFGH) */
+                "movups  0(%[state]), %%xmm0\n\t" /* DCBA (xmm=ABCD) */
+                "movaps %%xmm1, %%xmm2\n\t"
+                "shufps $0x11, %%xmm0, %%xmm1\n\t" /* ABEF (xmm=FEBA) */
+                "shufps $0xbb, %%xmm0, %%xmm2\n\t" /* CDGH (xmm=HGDC) */
+                :
+                : [state] "r" (state)
+                : "memory" );
+
+  /* Load message */
+  asm volatile ("movdqu 0*16(%[data]), %%xmm3\n\t"
+                "movdqu 1*16(%[data]), %%xmm4\n\t"
+                "movdqu 2*16(%[data]), %%xmm5\n\t"
+                "movdqu 3*16(%[data]), %%xmm6\n\t"
+                "pshufb %%xmm7, %%xmm3\n\t"
+                "pshufb %%xmm7, %%xmm4\n\t"
+                "pshufb %%xmm7, %%xmm5\n\t"
+                "pshufb %%xmm7, %%xmm6\n\t"
+                :
+                : [data] "r" (data)
+                : "memory" );
+  data += 64;
+
+  do
+    {
+      /* Save state */
+      asm volatile ("movdqa %%xmm1, (%[abef_save])\n\t"
+                    "movdqa %%xmm2, (%[cdgh_save])\n\t"
+                    :
+                    : [abef_save] "r" (abef_save), [cdgh_save] "r" (cdgh_save)
+                    : "memory" );
+
+      /* Round 0..3 */
+      asm volatile ("movdqa %%xmm3, %%xmm0\n\t"
+                      "paddd %[constants], %%xmm0\n\t"
+                      "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                      "psrldq $8, %%xmm0\n\t"
+                      "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                    :
+                    : [constants] "m" (K[0].a)
+                    : "memory" );
+
+      /* Round 4..7 */
+      asm volatile ("movdqa %%xmm4, %%xmm0\n\t"
+                      "paddd %[constants], %%xmm0\n\t"
+                      "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                      "psrldq $8, %%xmm0\n\t"
+                      "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                    "sha256msg1 %%xmm4, %%xmm3\n\t"
+                    :
+                    : [constants] "m" (K[1].a)
+                    : "memory" );
+
+      /* Round 8..11 */
+      asm volatile ("movdqa %%xmm5, %%xmm0\n\t"
+                      "paddd %[constants], %%xmm0\n\t"
+                      "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                      "psrldq $8, %%xmm0\n\t"
+                      "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                    "sha256msg1 %%xmm5, %%xmm4\n\t"
+                    :
+                    : [constants] "m" (K[2].a)
+                    : "memory" );
+
+#define ROUND(k, MSG0, MSG1, MSG2, MSG3) \
+      asm volatile ("movdqa %%"MSG0", %%xmm0\n\t" \
+                      "paddd %[constants], %%xmm0\n\t" \
+                      "sha256rnds2 %%xmm1, %%xmm2\n\t" \
+                    "movdqa %%"MSG0", %%xmm7\n\t" \
+                    "palignr $4, %%"MSG3", %%xmm7\n\t" \
+                    "paddd %%xmm7, %%"MSG1"\n\t" \
+                    "sha256msg2 %%"MSG0", %%"MSG1"\n\t" \
+                      "psrldq $8, %%xmm0\n\t" \
+                      "sha256rnds2 %%xmm2, %%xmm1\n\t" \
+                    "sha256msg1 %%"MSG0", %%"MSG3"\n\t" \
+                    : \
+                    : [constants] "m" (K[k].a) \
+                    : "memory" )
+
+      /* Rounds 12..15 to 48..51 */
+      ROUND(3, "xmm6", "xmm3", "xmm4", "xmm5");
+      ROUND(4, "xmm3", "xmm4", "xmm5", "xmm6");
+      ROUND(5, "xmm4", "xmm5", "xmm6", "xmm3");
+      ROUND(6, "xmm5", "xmm6", "xmm3", "xmm4");
+      ROUND(7, "xmm6", "xmm3", "xmm4", "xmm5");
+      ROUND(8, "xmm3", "xmm4", "xmm5", "xmm6");
+      ROUND(9, "xmm4", "xmm5", "xmm6", "xmm3");
+      ROUND(10, "xmm5", "xmm6", "xmm3", "xmm4");
+      ROUND(11, "xmm6", "xmm3", "xmm4", "xmm5");
+      ROUND(12, "xmm3", "xmm4", "xmm5", "xmm6");
+
+      if (--nblks == 0)
+        break;
+
+      /* Round 52..55 */
+      asm volatile ("movdqa %%xmm4, %%xmm0\n\t"
+                      "paddd %[constants], %%xmm0\n\t"
+                      "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                    "movdqa %%xmm4, %%xmm7\n\t"
+                    "palignr $4, %%xmm3, %%xmm7\n\t"
+                    "movdqu 0*16(%[data]), %%xmm3\n\t"
+                    "paddd %%xmm7, %%xmm5\n\t"
+                    "sha256msg2 %%xmm4, %%xmm5\n\t"
+                      "psrldq $8, %%xmm0\n\t"
+                      "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                    :
+                    : [constants] "m" (K[13].a), [data] "r" (data)
+                    : "memory" );
+
+      /* Round 56..59 */
+      asm volatile ("movdqa %%xmm5, %%xmm0\n\t"
+                      "paddd %[constants], %%xmm0\n\t"
+                      "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                    "movdqa %%xmm5, %%xmm7\n\t"
+                    "palignr $4, %%xmm4, %%xmm7\n\t"
+                    "movdqu 1*16(%[data]), %%xmm4\n\t"
+                    "paddd %%xmm7, %%xmm6\n\t"
+                    "movdqa %[mask], %%xmm7\n\t" /* Reload mask */
+                    "sha256msg2 %%xmm5, %%xmm6\n\t"
+                    "movdqu 2*16(%[data]), %%xmm5\n\t"
+                      "psrldq $8, %%xmm0\n\t"
+                      "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                    :
+                    : [constants] "m" (K[14].a), [mask] "m" (*bshuf_mask),
+                      [data] "r" (data)
+                    : "memory" );
+
+      /* Round 60..63 */
+      asm volatile ("movdqa %%xmm6, %%xmm0\n\t"
+                    "pshufb %%xmm7, %%xmm3\n\t"
+                    "movdqu 3*16(%[data]), %%xmm6\n\t"
+                      "paddd %[constants], %%xmm0\n\t"
+                    "pshufb %%xmm7, %%xmm4\n\t"
+                      "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                      "psrldq $8, %%xmm0\n\t"
+                    "pshufb %%xmm7, %%xmm5\n\t"
+                      "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                    :
+                    : [constants] "m" (K[15].a), [data] "r" (data)
+                    : "memory" );
+      data += 64;
+
+      /* Merge states */
+      asm volatile ("paddd (%[abef_save]), %%xmm1\n\t"
+                    "paddd (%[cdgh_save]), %%xmm2\n\t"
+                    "pshufb %%xmm7, %%xmm6\n\t"
+                    :
+                    : [abef_save] "r" (abef_save), [cdgh_save] "r" (cdgh_save)
+                    : "memory" );
+    }
+  while (1);
+
+  /* Round 52..55 */
+  asm volatile ("movdqa %%xmm4, %%xmm0\n\t"
+                  "paddd %[constants], %%xmm0\n\t"
+                  "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                "movdqa %%xmm4, %%xmm7\n\t"
+                "palignr $4, %%xmm3, %%xmm7\n\t"
+                "paddd %%xmm7, %%xmm5\n\t"
+                "sha256msg2 %%xmm4, %%xmm5\n\t"
+                  "psrldq $8, %%xmm0\n\t"
+                  "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                :
+                : [constants] "m" (K[13].a)
+                : "memory" );
+
+  /* Round 56..59 */
+  asm volatile ("movdqa %%xmm5, %%xmm0\n\t"
+                  "paddd %[constants], %%xmm0\n\t"
+                  "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                "movdqa %%xmm5, %%xmm7\n\t"
+                "palignr $4, %%xmm4, %%xmm7\n\t"
+                "paddd %%xmm7, %%xmm6\n\t"
+                "movdqa %[mask], %%xmm7\n\t" /* Reload mask */
+                "sha256msg2 %%xmm5, %%xmm6\n\t"
+                  "psrldq $8, %%xmm0\n\t"
+                  "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                :
+                : [constants] "m" (K[14].a), [mask] "m" (*bshuf_mask)
+                : "memory" );
+
+  /* Round 60..63 */
+  asm volatile ("movdqa %%xmm6, %%xmm0\n\t"
+                  "paddd %[constants], %%xmm0\n\t"
+                  "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                  "psrldq $8, %%xmm0\n\t"
+                  "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                :
+                : [constants] "m" (K[15].a)
+                : "memory" );
+
+  /* Merge states */
+  asm volatile ("paddd (%[abef_save]), %%xmm1\n\t"
+                "paddd (%[cdgh_save]), %%xmm2\n\t"
+                :
+                : [abef_save] "r" (abef_save), [cdgh_save] "r" (cdgh_save)
+                : "memory" );
+
+  /* Save state (XMM1=FEBA, XMM2=HGDC) */
+  asm volatile ("movaps %%xmm1, %%xmm0\n\t"
+                "shufps $0x11, %%xmm2, %%xmm1\n\t" /* xmm=ABCD */
+                "shufps $0xbb, %%xmm2, %%xmm0\n\t" /* xmm=EFGH */
+                "movups %%xmm1, 16(%[state])\n\t"
+                "movups %%xmm0,  0(%[state])\n\t"
+                :
+                : [state] "r" (state)
+                : "memory" );
+
+  shaext_cleanup (abef_save, cdgh_save);
+  return 0;
+}
+
+#endif /* HAVE_GCC_INLINE_ASM_SHA_EXT */
diff --git a/cipher/sha256.c b/cipher/sha256.c
index d174321d5..cb6a860ac 100644
--- a/cipher/sha256.c
+++ b/cipher/sha256.c
@@ -75,6 +75,14 @@
 # define USE_AVX2 1
 #endif
 
+/* USE_SHAEXT indicates whether to compile with Intel SHA Extension code. */
+#undef USE_SHAEXT
+#if defined(HAVE_GCC_INLINE_ASM_SHAEXT) && \
+    defined(HAVE_GCC_INLINE_ASM_SSE41) && \
+    defined(ENABLE_SHAEXT_SUPPORT)
+# define USE_SHAEXT 1
+#endif
+
 /* USE_ARM_CE indicates whether to enable ARMv8 Crypto Extension assembly
  * code. */
 #undef USE_ARM_CE
@@ -103,6 +111,9 @@ typedef struct {
 #ifdef USE_AVX2
   unsigned int use_avx2:1;
 #endif
+#ifdef USE_SHAEXT
+  unsigned int use_shaext:1;
+#endif
 #ifdef USE_ARM_CE
   unsigned int use_arm_ce:1;
 #endif
@@ -147,6 +158,10 @@ sha256_init (void *context, unsigned int flags)
 #ifdef USE_AVX2
   hd->use_avx2 = (features & HWF_INTEL_AVX2) && (features & HWF_INTEL_BMI2);
 #endif
+#ifdef USE_SHAEXT
+  hd->use_shaext = (features & HWF_INTEL_SHAEXT)
+                   && (features & HWF_INTEL_SSE4_1);
+#endif
 #ifdef USE_ARM_CE
   hd->use_arm_ce = (features & HWF_ARM_SHA2) != 0;
 #endif
@@ -188,6 +203,10 @@ sha224_init (void *context, unsigned int flags)
 #ifdef USE_AVX2
   hd->use_avx2 = (features & HWF_INTEL_AVX2) && (features & HWF_INTEL_BMI2);
 #endif
+#ifdef USE_SHAEXT
+  hd->use_shaext = (features & HWF_INTEL_SHAEXT)
+                   && (features & HWF_INTEL_SSE4_1);
+#endif
 #ifdef USE_ARM_CE
   hd->use_arm_ce = (features & HWF_ARM_SHA2) != 0;
 #endif
@@ -350,7 +369,8 @@ transform_blk (void *ctx, const unsigned char *data)
  * stack to store XMM6-XMM15 needed on Win64. */
 #undef ASM_FUNC_ABI
 #undef ASM_EXTRA_STACK
-#if defined(USE_SSSE3) || defined(USE_AVX) || defined(USE_AVX2)
+#if defined(USE_SSSE3) || defined(USE_AVX) || defined(USE_AVX2) || \
+    defined(USE_SHAEXT)
 # ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS
 #  define ASM_FUNC_ABI __attribute__((sysv_abi))
 #  define ASM_EXTRA_STACK (10 * 16)
@@ -379,6 +399,14 @@ unsigned int _gcry_sha256_transform_amd64_avx2(const void *input_data,
                                                size_t num_blks) ASM_FUNC_ABI;
 #endif
 
+#ifdef USE_SHAEXT
+/* Does not need ASM_FUNC_ABI */
+unsigned int
+_gcry_sha256_transform_intel_shaext(u32 state[8],
+                                    const unsigned char *input_data,
+                                    size_t num_blks);
+#endif
+
 #ifdef USE_ARM_CE
 unsigned int _gcry_sha256_transform_armv8_ce(u32 state[8],
                                              const void *input_data,
@@ -391,27 +419,49 @@ transform (void *ctx, const unsigned char *data, size_t nblks)
   SHA256_CONTEXT *hd = ctx;
   unsigned int burn;
 
+#ifdef USE_SHAEXT
+  if (hd->use_shaext)
+    {
+      burn = _gcry_sha256_transform_intel_shaext (&hd->h0, data, nblks);
+      burn += burn ? 4 * sizeof(void*) + ASM_EXTRA_STACK : 0;
+      return burn;
+    }
+#endif
+
 #ifdef USE_AVX2
   if (hd->use_avx2)
-    return _gcry_sha256_transform_amd64_avx2 (data, &hd->h0, nblks)
-           + 4 * sizeof(void*) + ASM_EXTRA_STACK;
+    {
+      burn = _gcry_sha256_transform_amd64_avx2 (data, &hd->h0, nblks);
+      burn += burn ? 4 * sizeof(void*) + ASM_EXTRA_STACK : 0;
+      return burn;
+    }
 #endif
 
 #ifdef USE_AVX
   if (hd->use_avx)
-    return _gcry_sha256_transform_amd64_avx (data, &hd->h0, nblks)
-           + 4 * sizeof(void*) + ASM_EXTRA_STACK;
+    {
+      burn = _gcry_sha256_transform_amd64_avx (data, &hd->h0, nblks);
+      burn += burn ? 4 * sizeof(void*) + ASM_EXTRA_STACK : 0;
+      return burn;
+    }
 #endif
 
 #ifdef USE_SSSE3
   if (hd->use_ssse3)
-    return _gcry_sha256_transform_amd64_ssse3 (data, &hd->h0, nblks)
-           + 4 * sizeof(void*) + ASM_EXTRA_STACK;
+    {
+      burn = _gcry_sha256_transform_amd64_ssse3 (data, &hd->h0, nblks);
+      burn += burn ? 4 * sizeof(void*) + ASM_EXTRA_STACK : 0;
+      return burn;
+    }
 #endif
 
 #ifdef USE_ARM_CE
   if (hd->use_arm_ce)
-    return _gcry_sha256_transform_armv8_ce (&hd->h0, data, nblks);
+    {
+      burn = _gcry_sha256_transform_armv8_ce (&hd->h0, data, nblks);
+      burn += burn ? 4 * sizeof(void*) : 0;
+      return burn;
+    }
 #endif
 
   do
diff --git a/configure.ac b/configure.ac
index 4ae7667b3..b5d72111a 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2395,6 +2395,13 @@ if test "$found" = "1" ; then
          GCRYPT_DIGESTS="$GCRYPT_DIGESTS sha256-armv8-aarch64-ce.lo"
       ;;
    esac
+
+   case "$mpi_cpu_arch" in
+     x86)
+       # Build with the SHAEXT implementation
+       GCRYPT_DIGESTS="$GCRYPT_DIGESTS sha256-intel-shaext.lo"
+     ;;
+   esac
 fi
 
 LIST_MEMBER(sha512, $enabled_digests)


From cvs at cvs.gnupg.org  Sun Feb 18 16:12:58 2018
From: cvs at cvs.gnupg.org (by Jussi Kivilinna)
Date: Sun, 18 Feb 2018 16:12:58 +0100
Subject: [git] GCRYPT - branch, master, updated. libgcrypt-1.8.1-45-g0b3ec35
Message-ID: <E1enQeC-0001Tk-31@lists.gnupg.org>

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "The GNU crypto library".

The branch, master has been updated
       via  0b3ec359e2279c3b46b171372b1b7733bba20cd7 (commit)
       via  d02958bd300d2c80bc92b1e072103e95e256b297 (commit)
       via  da58a62ac1b7a8d97b0895dcb41d15af531e45e5 (commit)
       via  af7fc732f9a7af7a70276f1e8364d2132db314f1 (commit)
      from  ffdc6f3623a0bcb41324d562340b2cd1c288e387 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
commit 0b3ec359e2279c3b46b171372b1b7733bba20cd7
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date:   Thu Feb 15 22:13:28 2018 +0200

    Add Intel SHA Extensions accelerated SHA256 implementation
    
    * cipher/Makefile.am: Add 'sha256-intel-shaext.c'.
    * cipher/sha256-intel-shaext.c: New.
    * cipher/sha256.c (USE_SHAEXT)
    (_gcry_sha256_transform_intel_shaext): New.
    (SHA256_CONTEXT): Add 'use_shaext'.
    (sha256_init, sha224_init) [USE_SHAEXT]: Use shaext if supported.
    (transform) [USE_SHAEXT]: Use shaext if enabled.
    (transform): Only add ASM_EXTRA_STACK if returned burn length is not
    zero.
    * configure.ac: Add 'sha256-intel-shaext.lo'.
    --
    
    Benchmark on Intel Celeron J3455 (1500 Mhz, no turbo):
    
    Before:
                    |  nanosecs/byte   mebibytes/sec   cycles/byte
     SHA256         |     10.07 ns/B     94.72 MiB/s     15.10 c/B
    
    After (3.7x faster):
                    |  nanosecs/byte   mebibytes/sec   cycles/byte
     SHA256         |      2.70 ns/B     353.8 MiB/s      4.04 c/B
    
    Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index 110a48b..599e3c1 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -94,7 +94,7 @@ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \
   sha1-armv7-neon.S sha1-armv8-aarch32-ce.S sha1-armv8-aarch64-ce.S \
   sha1-intel-shaext.c \
 sha256.c sha256-ssse3-amd64.S sha256-avx-amd64.S sha256-avx2-bmi2-amd64.S \
-  sha256-armv8-aarch32-ce.S sha256-armv8-aarch64-ce.S \
+  sha256-armv8-aarch32-ce.S sha256-armv8-aarch64-ce.S sha256-intel-shaext.c \
 sha512.c sha512-ssse3-amd64.S sha512-avx-amd64.S sha512-avx2-bmi2-amd64.S \
   sha512-armv7-neon.S sha512-arm.S \
 sm3.c \
diff --git a/cipher/sha256-intel-shaext.c b/cipher/sha256-intel-shaext.c
new file mode 100644
index 0000000..0c107bb
--- /dev/null
+++ b/cipher/sha256-intel-shaext.c
@@ -0,0 +1,352 @@
+/* sha256-intel-shaext.S - SHAEXT accelerated SHA-256 transform function
+ * Copyright (C) 2018 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <config.h>
+
+#include "types.h"
+
+#if defined(HAVE_GCC_INLINE_ASM_SHAEXT) && \
+    defined(HAVE_GCC_INLINE_ASM_SSE41) && defined(USE_SHA256) && \
+    defined(ENABLE_SHAEXT_SUPPORT)
+
+#if _GCRY_GCC_VERSION >= 40400 /* 4.4 */
+/* Prevent compiler from issuing SSE instructions between asm blocks. */
+#  pragma GCC target("no-sse")
+#endif
+
+/* Two macros to be called prior and after the use of SHA-EXT
+   instructions.  There should be no external function calls between
+   the use of these macros.  There purpose is to make sure that the
+   SSE regsiters are cleared and won't reveal any information about
+   the key or the data.  */
+#ifdef __WIN64__
+/* XMM6-XMM15 are callee-saved registers on WIN64. */
+# define shaext_prepare_variable char win64tmp[2*16]
+# define shaext_prepare_variable_size sizeof(win64tmp)
+# define shaext_prepare()                                               \
+   do { asm volatile ("movdqu %%xmm6, (%0)\n"                           \
+                      "movdqu %%xmm7, (%1)\n"                           \
+                      :                                                 \
+                      : "r" (&win64tmp[0]), "r" (&win64tmp[16])         \
+                      : "memory");                                      \
+   } while (0)
+# define shaext_cleanup(tmp0,tmp1)                                      \
+   do { asm volatile ("movdqu (%0), %%xmm6\n"                           \
+                      "movdqu (%1), %%xmm7\n"                           \
+                      "pxor %%xmm0, %%xmm0\n"                           \
+                      "pxor %%xmm1, %%xmm1\n"                           \
+                      "pxor %%xmm2, %%xmm2\n"                           \
+                      "pxor %%xmm3, %%xmm3\n"                           \
+                      "pxor %%xmm4, %%xmm4\n"                           \
+                      "pxor %%xmm5, %%xmm5\n"                           \
+                      "movdqa %%xmm0, (%2)\n\t"                         \
+                      "movdqa %%xmm0, (%3)\n\t"                         \
+                      :                                                 \
+                      : "r" (&win64tmp[0]), "r" (&win64tmp[16]),        \
+                        "r" (tmp0), "r" (tmp1)                          \
+                      : "memory");                                      \
+   } while (0)
+#else
+# define shaext_prepare_variable
+# define shaext_prepare_variable_size 0
+# define shaext_prepare() do { } while (0)
+# define shaext_cleanup(tmp0,tmp1)                                      \
+   do { asm volatile ("pxor %%xmm0, %%xmm0\n"                           \
+                      "pxor %%xmm1, %%xmm1\n"                           \
+                      "pxor %%xmm2, %%xmm2\n"                           \
+                      "pxor %%xmm3, %%xmm3\n"                           \
+                      "pxor %%xmm4, %%xmm4\n"                           \
+                      "pxor %%xmm5, %%xmm5\n"                           \
+                      "pxor %%xmm6, %%xmm6\n"                           \
+                      "pxor %%xmm7, %%xmm7\n"                           \
+                      "movdqa %%xmm0, (%0)\n\t"                         \
+                      "movdqa %%xmm0, (%1)\n\t"                         \
+                      :                                                 \
+                      : "r" (tmp0), "r" (tmp1)                          \
+                      : "memory");                                      \
+   } while (0)
+#endif
+
+typedef struct u128_s
+{
+  u32 a, b, c, d;
+} u128_t;
+
+/*
+ * Transform nblks*64 bytes (nblks*16 32-bit words) at DATA.
+ */
+unsigned int
+_gcry_sha256_transform_intel_shaext(u32 state[8], const unsigned char *data,
+                                    size_t nblks)
+{
+  static const unsigned char bshuf_mask[16] __attribute__ ((aligned (16))) =
+    { 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 };
+  static const u128_t K[16] __attribute__ ((aligned (16))) =
+  {
+    { 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5 },
+    { 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5 },
+    { 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3 },
+    { 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174 },
+    { 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc },
+    { 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da },
+    { 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7 },
+    { 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967 },
+    { 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13 },
+    { 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85 },
+    { 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3 },
+    { 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070 },
+    { 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5 },
+    { 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3 },
+    { 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208 },
+    { 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2 }
+  };
+  char save_buf[2 * 16 + 15];
+  char *abef_save;
+  char *cdgh_save;
+  shaext_prepare_variable;
+
+  if (nblks == 0)
+    return 0;
+
+  shaext_prepare ();
+
+  asm volatile ("" : "=r" (abef_save) : "0" (save_buf) : "memory");
+  abef_save = abef_save + (-(uintptr_t)abef_save & 15);
+  cdgh_save = abef_save + 16;
+
+  /* byteswap mask => XMM7 */
+  asm volatile ("movdqa %[mask], %%xmm7\n\t" /* Preload mask */
+                :
+                : [mask] "m" (*bshuf_mask)
+                : "memory");
+
+  /* Load state.. ABEF_SAVE => STATE0 XMM1, CDGH_STATE => STATE1 XMM2 */
+  asm volatile ("movups 16(%[state]), %%xmm1\n\t" /* HGFE (xmm=EFGH) */
+                "movups  0(%[state]), %%xmm0\n\t" /* DCBA (xmm=ABCD) */
+                "movaps %%xmm1, %%xmm2\n\t"
+                "shufps $0x11, %%xmm0, %%xmm1\n\t" /* ABEF (xmm=FEBA) */
+                "shufps $0xbb, %%xmm0, %%xmm2\n\t" /* CDGH (xmm=HGDC) */
+                :
+                : [state] "r" (state)
+                : "memory" );
+
+  /* Load message */
+  asm volatile ("movdqu 0*16(%[data]), %%xmm3\n\t"
+                "movdqu 1*16(%[data]), %%xmm4\n\t"
+                "movdqu 2*16(%[data]), %%xmm5\n\t"
+                "movdqu 3*16(%[data]), %%xmm6\n\t"
+                "pshufb %%xmm7, %%xmm3\n\t"
+                "pshufb %%xmm7, %%xmm4\n\t"
+                "pshufb %%xmm7, %%xmm5\n\t"
+                "pshufb %%xmm7, %%xmm6\n\t"
+                :
+                : [data] "r" (data)
+                : "memory" );
+  data += 64;
+
+  do
+    {
+      /* Save state */
+      asm volatile ("movdqa %%xmm1, (%[abef_save])\n\t"
+                    "movdqa %%xmm2, (%[cdgh_save])\n\t"
+                    :
+                    : [abef_save] "r" (abef_save), [cdgh_save] "r" (cdgh_save)
+                    : "memory" );
+
+      /* Round 0..3 */
+      asm volatile ("movdqa %%xmm3, %%xmm0\n\t"
+                      "paddd %[constants], %%xmm0\n\t"
+                      "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                      "psrldq $8, %%xmm0\n\t"
+                      "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                    :
+                    : [constants] "m" (K[0].a)
+                    : "memory" );
+
+      /* Round 4..7 */
+      asm volatile ("movdqa %%xmm4, %%xmm0\n\t"
+                      "paddd %[constants], %%xmm0\n\t"
+                      "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                      "psrldq $8, %%xmm0\n\t"
+                      "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                    "sha256msg1 %%xmm4, %%xmm3\n\t"
+                    :
+                    : [constants] "m" (K[1].a)
+                    : "memory" );
+
+      /* Round 8..11 */
+      asm volatile ("movdqa %%xmm5, %%xmm0\n\t"
+                      "paddd %[constants], %%xmm0\n\t"
+                      "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                      "psrldq $8, %%xmm0\n\t"
+                      "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                    "sha256msg1 %%xmm5, %%xmm4\n\t"
+                    :
+                    : [constants] "m" (K[2].a)
+                    : "memory" );
+
+#define ROUND(k, MSG0, MSG1, MSG2, MSG3) \
+      asm volatile ("movdqa %%"MSG0", %%xmm0\n\t" \
+                      "paddd %[constants], %%xmm0\n\t" \
+                      "sha256rnds2 %%xmm1, %%xmm2\n\t" \
+                    "movdqa %%"MSG0", %%xmm7\n\t" \
+                    "palignr $4, %%"MSG3", %%xmm7\n\t" \
+                    "paddd %%xmm7, %%"MSG1"\n\t" \
+                    "sha256msg2 %%"MSG0", %%"MSG1"\n\t" \
+                      "psrldq $8, %%xmm0\n\t" \
+                      "sha256rnds2 %%xmm2, %%xmm1\n\t" \
+                    "sha256msg1 %%"MSG0", %%"MSG3"\n\t" \
+                    : \
+                    : [constants] "m" (K[k].a) \
+                    : "memory" )
+
+      /* Rounds 12..15 to 48..51 */
+      ROUND(3, "xmm6", "xmm3", "xmm4", "xmm5");
+      ROUND(4, "xmm3", "xmm4", "xmm5", "xmm6");
+      ROUND(5, "xmm4", "xmm5", "xmm6", "xmm3");
+      ROUND(6, "xmm5", "xmm6", "xmm3", "xmm4");
+      ROUND(7, "xmm6", "xmm3", "xmm4", "xmm5");
+      ROUND(8, "xmm3", "xmm4", "xmm5", "xmm6");
+      ROUND(9, "xmm4", "xmm5", "xmm6", "xmm3");
+      ROUND(10, "xmm5", "xmm6", "xmm3", "xmm4");
+      ROUND(11, "xmm6", "xmm3", "xmm4", "xmm5");
+      ROUND(12, "xmm3", "xmm4", "xmm5", "xmm6");
+
+      if (--nblks == 0)
+        break;
+
+      /* Round 52..55 */
+      asm volatile ("movdqa %%xmm4, %%xmm0\n\t"
+                      "paddd %[constants], %%xmm0\n\t"
+                      "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                    "movdqa %%xmm4, %%xmm7\n\t"
+                    "palignr $4, %%xmm3, %%xmm7\n\t"
+                    "movdqu 0*16(%[data]), %%xmm3\n\t"
+                    "paddd %%xmm7, %%xmm5\n\t"
+                    "sha256msg2 %%xmm4, %%xmm5\n\t"
+                      "psrldq $8, %%xmm0\n\t"
+                      "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                    :
+                    : [constants] "m" (K[13].a), [data] "r" (data)
+                    : "memory" );
+
+      /* Round 56..59 */
+      asm volatile ("movdqa %%xmm5, %%xmm0\n\t"
+                      "paddd %[constants], %%xmm0\n\t"
+                      "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                    "movdqa %%xmm5, %%xmm7\n\t"
+                    "palignr $4, %%xmm4, %%xmm7\n\t"
+                    "movdqu 1*16(%[data]), %%xmm4\n\t"
+                    "paddd %%xmm7, %%xmm6\n\t"
+                    "movdqa %[mask], %%xmm7\n\t" /* Reload mask */
+                    "sha256msg2 %%xmm5, %%xmm6\n\t"
+                    "movdqu 2*16(%[data]), %%xmm5\n\t"
+                      "psrldq $8, %%xmm0\n\t"
+                      "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                    :
+                    : [constants] "m" (K[14].a), [mask] "m" (*bshuf_mask),
+                      [data] "r" (data)
+                    : "memory" );
+
+      /* Round 60..63 */
+      asm volatile ("movdqa %%xmm6, %%xmm0\n\t"
+                    "pshufb %%xmm7, %%xmm3\n\t"
+                    "movdqu 3*16(%[data]), %%xmm6\n\t"
+                      "paddd %[constants], %%xmm0\n\t"
+                    "pshufb %%xmm7, %%xmm4\n\t"
+                      "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                      "psrldq $8, %%xmm0\n\t"
+                    "pshufb %%xmm7, %%xmm5\n\t"
+                      "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                    :
+                    : [constants] "m" (K[15].a), [data] "r" (data)
+                    : "memory" );
+      data += 64;
+
+      /* Merge states */
+      asm volatile ("paddd (%[abef_save]), %%xmm1\n\t"
+                    "paddd (%[cdgh_save]), %%xmm2\n\t"
+                    "pshufb %%xmm7, %%xmm6\n\t"
+                    :
+                    : [abef_save] "r" (abef_save), [cdgh_save] "r" (cdgh_save)
+                    : "memory" );
+    }
+  while (1);
+
+  /* Round 52..55 */
+  asm volatile ("movdqa %%xmm4, %%xmm0\n\t"
+                  "paddd %[constants], %%xmm0\n\t"
+                  "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                "movdqa %%xmm4, %%xmm7\n\t"
+                "palignr $4, %%xmm3, %%xmm7\n\t"
+                "paddd %%xmm7, %%xmm5\n\t"
+                "sha256msg2 %%xmm4, %%xmm5\n\t"
+                  "psrldq $8, %%xmm0\n\t"
+                  "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                :
+                : [constants] "m" (K[13].a)
+                : "memory" );
+
+  /* Round 56..59 */
+  asm volatile ("movdqa %%xmm5, %%xmm0\n\t"
+                  "paddd %[constants], %%xmm0\n\t"
+                  "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                "movdqa %%xmm5, %%xmm7\n\t"
+                "palignr $4, %%xmm4, %%xmm7\n\t"
+                "paddd %%xmm7, %%xmm6\n\t"
+                "movdqa %[mask], %%xmm7\n\t" /* Reload mask */
+                "sha256msg2 %%xmm5, %%xmm6\n\t"
+                  "psrldq $8, %%xmm0\n\t"
+                  "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                :
+                : [constants] "m" (K[14].a), [mask] "m" (*bshuf_mask)
+                : "memory" );
+
+  /* Round 60..63 */
+  asm volatile ("movdqa %%xmm6, %%xmm0\n\t"
+                  "paddd %[constants], %%xmm0\n\t"
+                  "sha256rnds2 %%xmm1, %%xmm2\n\t"
+                  "psrldq $8, %%xmm0\n\t"
+                  "sha256rnds2 %%xmm2, %%xmm1\n\t"
+                :
+                : [constants] "m" (K[15].a)
+                : "memory" );
+
+  /* Merge states */
+  asm volatile ("paddd (%[abef_save]), %%xmm1\n\t"
+                "paddd (%[cdgh_save]), %%xmm2\n\t"
+                :
+                : [abef_save] "r" (abef_save), [cdgh_save] "r" (cdgh_save)
+                : "memory" );
+
+  /* Save state (XMM1=FEBA, XMM2=HGDC) */
+  asm volatile ("movaps %%xmm1, %%xmm0\n\t"
+                "shufps $0x11, %%xmm2, %%xmm1\n\t" /* xmm=ABCD */
+                "shufps $0xbb, %%xmm2, %%xmm0\n\t" /* xmm=EFGH */
+                "movups %%xmm1, 16(%[state])\n\t"
+                "movups %%xmm0,  0(%[state])\n\t"
+                :
+                : [state] "r" (state)
+                : "memory" );
+
+  shaext_cleanup (abef_save, cdgh_save);
+  return 0;
+}
+
+#endif /* HAVE_GCC_INLINE_ASM_SHA_EXT */
diff --git a/cipher/sha256.c b/cipher/sha256.c
index d174321..cb6a860 100644
--- a/cipher/sha256.c
+++ b/cipher/sha256.c
@@ -75,6 +75,14 @@
 # define USE_AVX2 1
 #endif
 
+/* USE_SHAEXT indicates whether to compile with Intel SHA Extension code. */
+#undef USE_SHAEXT
+#if defined(HAVE_GCC_INLINE_ASM_SHAEXT) && \
+    defined(HAVE_GCC_INLINE_ASM_SSE41) && \
+    defined(ENABLE_SHAEXT_SUPPORT)
+# define USE_SHAEXT 1
+#endif
+
 /* USE_ARM_CE indicates whether to enable ARMv8 Crypto Extension assembly
  * code. */
 #undef USE_ARM_CE
@@ -103,6 +111,9 @@ typedef struct {
 #ifdef USE_AVX2
   unsigned int use_avx2:1;
 #endif
+#ifdef USE_SHAEXT
+  unsigned int use_shaext:1;
+#endif
 #ifdef USE_ARM_CE
   unsigned int use_arm_ce:1;
 #endif
@@ -147,6 +158,10 @@ sha256_init (void *context, unsigned int flags)
 #ifdef USE_AVX2
   hd->use_avx2 = (features & HWF_INTEL_AVX2) && (features & HWF_INTEL_BMI2);
 #endif
+#ifdef USE_SHAEXT
+  hd->use_shaext = (features & HWF_INTEL_SHAEXT)
+                   && (features & HWF_INTEL_SSE4_1);
+#endif
 #ifdef USE_ARM_CE
   hd->use_arm_ce = (features & HWF_ARM_SHA2) != 0;
 #endif
@@ -188,6 +203,10 @@ sha224_init (void *context, unsigned int flags)
 #ifdef USE_AVX2
   hd->use_avx2 = (features & HWF_INTEL_AVX2) && (features & HWF_INTEL_BMI2);
 #endif
+#ifdef USE_SHAEXT
+  hd->use_shaext = (features & HWF_INTEL_SHAEXT)
+                   && (features & HWF_INTEL_SSE4_1);
+#endif
 #ifdef USE_ARM_CE
   hd->use_arm_ce = (features & HWF_ARM_SHA2) != 0;
 #endif
@@ -350,7 +369,8 @@ transform_blk (void *ctx, const unsigned char *data)
  * stack to store XMM6-XMM15 needed on Win64. */
 #undef ASM_FUNC_ABI
 #undef ASM_EXTRA_STACK
-#if defined(USE_SSSE3) || defined(USE_AVX) || defined(USE_AVX2)
+#if defined(USE_SSSE3) || defined(USE_AVX) || defined(USE_AVX2) || \
+    defined(USE_SHAEXT)
 # ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS
 #  define ASM_FUNC_ABI __attribute__((sysv_abi))
 #  define ASM_EXTRA_STACK (10 * 16)
@@ -379,6 +399,14 @@ unsigned int _gcry_sha256_transform_amd64_avx2(const void *input_data,
                                                size_t num_blks) ASM_FUNC_ABI;
 #endif
 
+#ifdef USE_SHAEXT
+/* Does not need ASM_FUNC_ABI */
+unsigned int
+_gcry_sha256_transform_intel_shaext(u32 state[8],
+                                    const unsigned char *input_data,
+                                    size_t num_blks);
+#endif
+
 #ifdef USE_ARM_CE
 unsigned int _gcry_sha256_transform_armv8_ce(u32 state[8],
                                              const void *input_data,
@@ -391,27 +419,49 @@ transform (void *ctx, const unsigned char *data, size_t nblks)
   SHA256_CONTEXT *hd = ctx;
   unsigned int burn;
 
+#ifdef USE_SHAEXT
+  if (hd->use_shaext)
+    {
+      burn = _gcry_sha256_transform_intel_shaext (&hd->h0, data, nblks);
+      burn += burn ? 4 * sizeof(void*) + ASM_EXTRA_STACK : 0;
+      return burn;
+    }
+#endif
+
 #ifdef USE_AVX2
   if (hd->use_avx2)
-    return _gcry_sha256_transform_amd64_avx2 (data, &hd->h0, nblks)
-           + 4 * sizeof(void*) + ASM_EXTRA_STACK;
+    {
+      burn = _gcry_sha256_transform_amd64_avx2 (data, &hd->h0, nblks);
+      burn += burn ? 4 * sizeof(void*) + ASM_EXTRA_STACK : 0;
+      return burn;
+    }
 #endif
 
 #ifdef USE_AVX
   if (hd->use_avx)
-    return _gcry_sha256_transform_amd64_avx (data, &hd->h0, nblks)
-           + 4 * sizeof(void*) + ASM_EXTRA_STACK;
+    {
+      burn = _gcry_sha256_transform_amd64_avx (data, &hd->h0, nblks);
+      burn += burn ? 4 * sizeof(void*) + ASM_EXTRA_STACK : 0;
+      return burn;
+    }
 #endif
 
 #ifdef USE_SSSE3
   if (hd->use_ssse3)
-    return _gcry_sha256_transform_amd64_ssse3 (data, &hd->h0, nblks)
-           + 4 * sizeof(void*) + ASM_EXTRA_STACK;
+    {
+      burn = _gcry_sha256_transform_amd64_ssse3 (data, &hd->h0, nblks);
+      burn += burn ? 4 * sizeof(void*) + ASM_EXTRA_STACK : 0;
+      return burn;
+    }
 #endif
 
 #ifdef USE_ARM_CE
   if (hd->use_arm_ce)
-    return _gcry_sha256_transform_armv8_ce (&hd->h0, data, nblks);
+    {
+      burn = _gcry_sha256_transform_armv8_ce (&hd->h0, data, nblks);
+      burn += burn ? 4 * sizeof(void*) : 0;
+      return burn;
+    }
 #endif
 
   do
diff --git a/configure.ac b/configure.ac
index 4ae7667..b5d7211 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2395,6 +2395,13 @@ if test "$found" = "1" ; then
          GCRYPT_DIGESTS="$GCRYPT_DIGESTS sha256-armv8-aarch64-ce.lo"
       ;;
    esac
+
+   case "$mpi_cpu_arch" in
+     x86)
+       # Build with the SHAEXT implementation
+       GCRYPT_DIGESTS="$GCRYPT_DIGESTS sha256-intel-shaext.lo"
+     ;;
+   esac
 fi
 
 LIST_MEMBER(sha512, $enabled_digests)

commit d02958bd300d2c80bc92b1e072103e95e256b297
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date:   Tue Feb 13 20:22:41 2018 +0200

    Add Intel SHA Extensions accelerated SHA1 implementation
    
    * cipher/Makefile.am: Add 'sha1-intel-shaext.c'.
    * cipher/sha1-intel-shaext.c: New.
    * cipher/sha1.c (USE_SHAEXT, _gcry_sha1_transform_intel_shaext): New.
    (sha1_init) [USE_SHAEXT]: Use shaext implementation is supported.
    (transform) [USE_SHAEXT]: Use shaext if enabled.
    (transform): Only add ASM_EXTRA_STACK if returned burn length is not
    zero.
    * cipher/sha1.h (SHA1_CONTEXT): Add 'use_shaext'.
    * configure.ac: Add 'sha1-intel-shaext.lo'.
    (shaextsupport, gcry_cv_gcc_inline_asm_shaext): New.
    * src/g10lib.h: Add HWF_INTEL_SHAEXT and reorder HWF flags.
    * src/hwf-x86.c (detect_x86_gnuc): Detect SHA Extensions.
    * src/hwfeatures.c (hwflist): Add 'intel-shaext'.
    --
    
    Benchmark on Intel Celeron J3455 (1500 Mhz, no turbo):
    
    Before:
                    |  nanosecs/byte   mebibytes/sec   cycles/byte
     SHA1           |      4.50 ns/B     211.7 MiB/s      6.76 c/B
    
    After (4.0x faster):
                    |  nanosecs/byte   mebibytes/sec   cycles/byte
     SHA1           |      1.11 ns/B     858.1 MiB/s      1.67 c/B
    
    Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index 625a0ef..110a48b 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -92,6 +92,7 @@ seed.c \
 serpent.c serpent-sse2-amd64.S serpent-avx2-amd64.S serpent-armv7-neon.S \
 sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \
   sha1-armv7-neon.S sha1-armv8-aarch32-ce.S sha1-armv8-aarch64-ce.S \
+  sha1-intel-shaext.c \
 sha256.c sha256-ssse3-amd64.S sha256-avx-amd64.S sha256-avx2-bmi2-amd64.S \
   sha256-armv8-aarch32-ce.S sha256-armv8-aarch64-ce.S \
 sha512.c sha512-ssse3-amd64.S sha512-avx-amd64.S sha512-avx2-bmi2-amd64.S \
diff --git a/cipher/sha1-intel-shaext.c b/cipher/sha1-intel-shaext.c
new file mode 100644
index 0000000..5a2349e
--- /dev/null
+++ b/cipher/sha1-intel-shaext.c
@@ -0,0 +1,281 @@
+/* sha1-intel-shaext.S - SHAEXT accelerated SHA-1 transform function
+ * Copyright (C) 2018 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <config.h>
+
+#include "types.h"
+
+#if defined(HAVE_GCC_INLINE_ASM_SHAEXT) && \
+    defined(HAVE_GCC_INLINE_ASM_SSE41) && defined(USE_SHA1) && \
+    defined(ENABLE_SHAEXT_SUPPORT)
+
+#if _GCRY_GCC_VERSION >= 40400 /* 4.4 */
+/* Prevent compiler from issuing SSE instructions between asm blocks. */
+#  pragma GCC target("no-sse")
+#endif
+
+/* Two macros to be called prior and after the use of SHA-EXT
+   instructions.  There should be no external function calls between
+   the use of these macros.  There purpose is to make sure that the
+   SSE regsiters are cleared and won't reveal any information about
+   the key or the data.  */
+#ifdef __WIN64__
+/* XMM6-XMM15 are callee-saved registers on WIN64. */
+# define shaext_prepare_variable char win64tmp[2*16]
+# define shaext_prepare_variable_size sizeof(win64tmp)
+# define shaext_prepare()                                               \
+   do { asm volatile ("movdqu %%xmm6, (%0)\n"                           \
+                      "movdqu %%xmm7, (%1)\n"                           \
+                      :                                                 \
+                      : "r" (&win64tmp[0]), "r" (&win64tmp[16])         \
+                      : "memory");                                      \
+   } while (0)
+# define shaext_cleanup(tmp0,tmp1)                                      \
+   do { asm volatile ("movdqu (%0), %%xmm6\n"                           \
+                      "movdqu (%1), %%xmm7\n"                           \
+                      "pxor %%xmm0, %%xmm0\n"                           \
+                      "pxor %%xmm1, %%xmm1\n"                           \
+                      "pxor %%xmm2, %%xmm2\n"                           \
+                      "pxor %%xmm3, %%xmm3\n"                           \
+                      "pxor %%xmm4, %%xmm4\n"                           \
+                      "pxor %%xmm5, %%xmm5\n"                           \
+                      "movdqa %%xmm0, (%2)\n\t"                         \
+                      "movdqa %%xmm0, (%3)\n\t"                         \
+                      :                                                 \
+                      : "r" (&win64tmp[0]), "r" (&win64tmp[16]),        \
+                        "r" (tmp0), "r" (tmp1)                          \
+                      : "memory");                                      \
+   } while (0)
+#else
+# define shaext_prepare_variable
+# define shaext_prepare_variable_size 0
+# define shaext_prepare() do { } while (0)
+# define shaext_cleanup(tmp0,tmp1)                                      \
+   do { asm volatile ("pxor %%xmm0, %%xmm0\n"                           \
+                      "pxor %%xmm1, %%xmm1\n"                           \
+                      "pxor %%xmm2, %%xmm2\n"                           \
+                      "pxor %%xmm3, %%xmm3\n"                           \
+                      "pxor %%xmm4, %%xmm4\n"                           \
+                      "pxor %%xmm5, %%xmm5\n"                           \
+                      "pxor %%xmm6, %%xmm6\n"                           \
+                      "pxor %%xmm7, %%xmm7\n"                           \
+                      "movdqa %%xmm0, (%0)\n\t"                         \
+                      "movdqa %%xmm0, (%1)\n\t"                         \
+                      :                                                 \
+                      : "r" (tmp0), "r" (tmp1)                          \
+                      : "memory");                                      \
+   } while (0)
+#endif
+
+/*
+ * Transform nblks*64 bytes (nblks*16 32-bit words) at DATA.
+ */
+unsigned int
+_gcry_sha1_transform_intel_shaext(void *state, const unsigned char *data,
+                                  size_t nblks)
+{
+  static const unsigned char be_mask[16] __attribute__ ((aligned (16))) =
+    { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 };
+  char save_buf[2 * 16 + 15];
+  char *abcd_save;
+  char *e_save;
+  shaext_prepare_variable;
+
+  if (nblks == 0)
+    return 0;
+
+  shaext_prepare ();
+
+  asm volatile ("" : "=r" (abcd_save) : "0" (save_buf) : "memory");
+  abcd_save = abcd_save + (-(uintptr_t)abcd_save & 15);
+  e_save = abcd_save + 16;
+
+  /* byteswap mask => XMM7 */
+  asm volatile ("movdqa %[mask], %%xmm7\n\t" /* Preload mask */
+                :
+                : [mask] "m" (*be_mask)
+                : "memory");
+
+  /* Load state.. ABCD => XMM4, E => XMM5 */
+  asm volatile ("movd 16(%[state]), %%xmm5\n\t"
+                "movdqu (%[state]), %%xmm4\n\t"
+                "pslldq $12, %%xmm5\n\t"
+                "pshufd $0x1b, %%xmm4, %%xmm4\n\t"
+                "movdqa %%xmm5, (%[e_save])\n\t"
+                "movdqa %%xmm4, (%[abcd_save])\n\t"
+                :
+                : [state] "r" (state), [abcd_save] "r" (abcd_save),
+                  [e_save] "r" (e_save)
+                : "memory" );
+
+  /* DATA => XMM[0..4] */
+  asm volatile ("movdqu 0(%[data]), %%xmm0\n\t"
+                "movdqu 16(%[data]), %%xmm1\n\t"
+                "movdqu 32(%[data]), %%xmm2\n\t"
+                "movdqu 48(%[data]), %%xmm3\n\t"
+                "pshufb %%xmm7, %%xmm0\n\t"
+                "pshufb %%xmm7, %%xmm1\n\t"
+                "pshufb %%xmm7, %%xmm2\n\t"
+                "pshufb %%xmm7, %%xmm3\n\t"
+                :
+                : [data] "r" (data)
+                : "memory" );
+  data += 64;
+
+  while (1)
+    {
+      /* Round 0..3 */
+      asm volatile ("paddd %%xmm0, %%xmm5\n\t"
+                    "movdqa %%xmm4, %%xmm6\n\t" /* ABCD => E1 */
+                    "sha1rnds4 $0, %%xmm5, %%xmm4\n\t"
+                    ::: "memory" );
+
+      /* Round 4..7 */
+      asm volatile ("sha1nexte %%xmm1, %%xmm6\n\t"
+                    "movdqa %%xmm4, %%xmm5\n\t"
+                    "sha1rnds4 $0, %%xmm6, %%xmm4\n\t"
+                    "sha1msg1 %%xmm1, %%xmm0\n\t"
+                    ::: "memory" );
+
+      /* Round 8..11 */
+      asm volatile ("sha1nexte %%xmm2, %%xmm5\n\t"
+                    "movdqa %%xmm4, %%xmm6\n\t"
+                    "sha1rnds4 $0, %%xmm5, %%xmm4\n\t"
+                    "sha1msg1 %%xmm2, %%xmm1\n\t"
+                    "pxor %%xmm2, %%xmm0\n\t"
+                    ::: "memory" );
+
+#define ROUND(imm, E0, E1, MSG0, MSG1, MSG2, MSG3) \
+      asm volatile ("sha1nexte %%"MSG0", %%"E0"\n\t" \
+                    "movdqa %%xmm4, %%"E1"\n\t" \
+                    "sha1msg2 %%"MSG0", %%"MSG1"\n\t" \
+                    "sha1rnds4 $"imm", %%"E0", %%xmm4\n\t" \
+                    "sha1msg1 %%"MSG0", %%"MSG3"\n\t" \
+                    "pxor %%"MSG0", %%"MSG2"\n\t" \
+                    ::: "memory" )
+
+      /* Rounds 12..15 to 64..67 */
+      ROUND("0", "xmm6", "xmm5", "xmm3", "xmm0", "xmm1", "xmm2");
+      ROUND("0", "xmm5", "xmm6", "xmm0", "xmm1", "xmm2", "xmm3");
+      ROUND("1", "xmm6", "xmm5", "xmm1", "xmm2", "xmm3", "xmm0");
+      ROUND("1", "xmm5", "xmm6", "xmm2", "xmm3", "xmm0", "xmm1");
+      ROUND("1", "xmm6", "xmm5", "xmm3", "xmm0", "xmm1", "xmm2");
+      ROUND("1", "xmm5", "xmm6", "xmm0", "xmm1", "xmm2", "xmm3");
+      ROUND("1", "xmm6", "xmm5", "xmm1", "xmm2", "xmm3", "xmm0");
+      ROUND("2", "xmm5", "xmm6", "xmm2", "xmm3", "xmm0", "xmm1");
+      ROUND("2", "xmm6", "xmm5", "xmm3", "xmm0", "xmm1", "xmm2");
+      ROUND("2", "xmm5", "xmm6", "xmm0", "xmm1", "xmm2", "xmm3");
+      ROUND("2", "xmm6", "xmm5", "xmm1", "xmm2", "xmm3", "xmm0");
+      ROUND("2", "xmm5", "xmm6", "xmm2", "xmm3", "xmm0", "xmm1");
+      ROUND("3", "xmm6", "xmm5", "xmm3", "xmm0", "xmm1", "xmm2");
+      ROUND("3", "xmm5", "xmm6", "xmm0", "xmm1", "xmm2", "xmm3");
+
+      if (--nblks == 0)
+        break;
+
+      /* Round 68..71 */
+      asm volatile ("movdqu 0(%[data]), %%xmm0\n\t"
+                    "sha1nexte %%xmm1, %%xmm6\n\t"
+                    "movdqa %%xmm4, %%xmm5\n\t"
+                    "sha1msg2 %%xmm1, %%xmm2\n\t"
+                    "sha1rnds4 $3, %%xmm6, %%xmm4\n\t"
+                    "pxor %%xmm1, %%xmm3\n\t"
+                    "pshufb %%xmm7, %%xmm0\n\t"
+                    :
+                    : [data] "r" (data)
+                    : "memory" );
+
+      /* Round 72..75 */
+      asm volatile ("movdqu 16(%[data]), %%xmm1\n\t"
+                    "sha1nexte %%xmm2, %%xmm5\n\t"
+                    "movdqa %%xmm4, %%xmm6\n\t"
+                    "sha1msg2 %%xmm2, %%xmm3\n\t"
+                    "sha1rnds4 $3, %%xmm5, %%xmm4\n\t"
+                    "pshufb %%xmm7, %%xmm1\n\t"
+                    :
+                    : [data] "r" (data)
+                    : "memory" );
+
+      /* Round 76..79 */
+      asm volatile ("movdqu 32(%[data]), %%xmm2\n\t"
+                    "sha1nexte %%xmm3, %%xmm6\n\t"
+                    "movdqa %%xmm4, %%xmm5\n\t"
+                    "sha1rnds4 $3, %%xmm6, %%xmm4\n\t"
+                    "pshufb %%xmm7, %%xmm2\n\t"
+                    :
+                    : [data] "r" (data)
+                    : "memory" );
+
+      /* Merge states, store current. */
+      asm volatile ("movdqu 48(%[data]), %%xmm3\n\t"
+                    "sha1nexte (%[e_save]), %%xmm5\n\t"
+                    "paddd (%[abcd_save]), %%xmm4\n\t"
+                    "pshufb %%xmm7, %%xmm3\n\t"
+                    "movdqa %%xmm5, (%[e_save])\n\t"
+                    "movdqa %%xmm4, (%[abcd_save])\n\t"
+                    :
+                    : [abcd_save] "r" (abcd_save), [e_save] "r" (e_save),
+                      [data] "r" (data)
+                    : "memory" );
+
+      data += 64;
+    }
+
+  /* Round 68..71 */
+  asm volatile ("sha1nexte %%xmm1, %%xmm6\n\t"
+                "movdqa %%xmm4, %%xmm5\n\t"
+                "sha1msg2 %%xmm1, %%xmm2\n\t"
+                "sha1rnds4 $3, %%xmm6, %%xmm4\n\t"
+                "pxor %%xmm1, %%xmm3\n\t"
+                ::: "memory" );
+
+  /* Round 72..75 */
+  asm volatile ("sha1nexte %%xmm2, %%xmm5\n\t"
+                "movdqa %%xmm4, %%xmm6\n\t"
+                "sha1msg2 %%xmm2, %%xmm3\n\t"
+                "sha1rnds4 $3, %%xmm5, %%xmm4\n\t"
+                ::: "memory" );
+
+  /* Round 76..79 */
+  asm volatile ("sha1nexte %%xmm3, %%xmm6\n\t"
+                "movdqa %%xmm4, %%xmm5\n\t"
+                "sha1rnds4 $3, %%xmm6, %%xmm4\n\t"
+                ::: "memory" );
+
+  /* Merge states. */
+  asm volatile ("sha1nexte (%[e_save]), %%xmm5\n\t"
+                "paddd (%[abcd_save]), %%xmm4\n\t"
+                :
+                : [abcd_save] "r" (abcd_save), [e_save] "r" (e_save)
+                : "memory" );
+
+  /* Save state */
+  asm volatile ("pshufd $0x1b, %%xmm4, %%xmm4\n\t"
+                "psrldq $12, %%xmm5\n\t"
+                "movdqu %%xmm4, (%[state])\n\t"
+                "movd %%xmm5, 16(%[state])\n\t"
+                :
+                : [state] "r" (state)
+                : "memory" );
+
+  shaext_cleanup (abcd_save, e_save);
+  return 0;
+}
+
+#endif /* HAVE_GCC_INLINE_ASM_SHA_EXT */
diff --git a/cipher/sha1.c b/cipher/sha1.c
index 78b172f..09868aa 100644
--- a/cipher/sha1.c
+++ b/cipher/sha1.c
@@ -68,6 +68,14 @@
 # define USE_BMI2 1
 #endif
 
+/* USE_SHAEXT indicates whether to compile with Intel SHA Extension code. */
+#undef USE_SHAEXT
+#if defined(HAVE_GCC_INLINE_ASM_SHAEXT) && \
+    defined(HAVE_GCC_INLINE_ASM_SSE41) && \
+    defined(ENABLE_SHAEXT_SUPPORT)
+# define USE_SHAEXT 1
+#endif
+
 /* USE_NEON indicates whether to enable ARM NEON assembly code. */
 #undef USE_NEON
 #ifdef ENABLE_NEON_SUPPORT
@@ -138,6 +146,10 @@ sha1_init (void *context, unsigned int flags)
 #ifdef USE_BMI2
   hd->use_bmi2 = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_BMI2);
 #endif
+#ifdef USE_SHAEXT
+  hd->use_shaext = (features & HWF_INTEL_SHAEXT)
+                   && (features & HWF_INTEL_SSE4_1);
+#endif
 #ifdef USE_NEON
   hd->use_neon = (features & HWF_ARM_NEON) != 0;
 #endif
@@ -311,7 +323,8 @@ transform_blk (void *ctx, const unsigned char *data)
  * stack to store XMM6-XMM15 needed on Win64. */
 #undef ASM_FUNC_ABI
 #undef ASM_EXTRA_STACK
-#if defined(USE_SSSE3) || defined(USE_AVX) || defined(USE_BMI2)
+#if defined(USE_SSSE3) || defined(USE_AVX) || defined(USE_BMI2) || \
+    defined(USE_SHAEXT)
 # ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS
 #  define ASM_FUNC_ABI __attribute__((sysv_abi))
 #  define ASM_EXTRA_STACK (10 * 16)
@@ -340,6 +353,13 @@ _gcry_sha1_transform_amd64_avx_bmi2 (void *state, const unsigned char *data,
                                      size_t nblks) ASM_FUNC_ABI;
 #endif
 
+#ifdef USE_SHAEXT
+/* Does not need ASM_FUNC_ABI */
+unsigned int
+_gcry_sha1_transform_intel_shaext (void *state, const unsigned char *data,
+                                   size_t nblks);
+#endif
+
 
 static unsigned int
 transform (void *ctx, const unsigned char *data, size_t nblks)
@@ -347,29 +367,53 @@ transform (void *ctx, const unsigned char *data, size_t nblks)
   SHA1_CONTEXT *hd = ctx;
   unsigned int burn;
 
+#ifdef USE_SHAEXT
+  if (hd->use_shaext)
+    {
+      burn = _gcry_sha1_transform_intel_shaext (&hd->h0, data, nblks);
+      burn += burn ? 4 * sizeof(void*) + ASM_EXTRA_STACK : 0;
+      return burn;
+    }
+#endif
 #ifdef USE_BMI2
   if (hd->use_bmi2)
-    return _gcry_sha1_transform_amd64_avx_bmi2 (&hd->h0, data, nblks)
-           + 4 * sizeof(void*) + ASM_EXTRA_STACK;
+    {
+      burn = _gcry_sha1_transform_amd64_avx_bmi2 (&hd->h0, data, nblks);
+      burn += burn ? 4 * sizeof(void*) + ASM_EXTRA_STACK : 0;
+      return burn;
+    }
 #endif
 #ifdef USE_AVX
   if (hd->use_avx)
-    return _gcry_sha1_transform_amd64_avx (&hd->h0, data, nblks)
-           + 4 * sizeof(void*) + ASM_EXTRA_STACK;
+    {
+      burn = _gcry_sha1_transform_amd64_avx (&hd->h0, data, nblks);
+      burn += burn ? 4 * sizeof(void*) + ASM_EXTRA_STACK : 0;
+      return burn;
+    }
 #endif
 #ifdef USE_SSSE3
   if (hd->use_ssse3)
-    return _gcry_sha1_transform_amd64_ssse3 (&hd->h0, data, nblks)
-           + 4 * sizeof(void*) + ASM_EXTRA_STACK;
+    {
+      burn = _gcry_sha1_transform_amd64_ssse3 (&hd->h0, data, nblks);
+      burn += burn ? 4 * sizeof(void*) + ASM_EXTRA_STACK : 0;
+      return burn;
+    }
 #endif
 #ifdef USE_ARM_CE
   if (hd->use_arm_ce)
-    return _gcry_sha1_transform_armv8_ce (&hd->h0, data, nblks);
+    {
+      burn = _gcry_sha1_transform_armv8_ce (&hd->h0, data, nblks);
+      burn += burn ? 4 * sizeof(void*) : 0;
+      return burn;
+    }
 #endif
 #ifdef USE_NEON
   if (hd->use_neon)
-    return _gcry_sha1_transform_armv7_neon (&hd->h0, data, nblks)
-           + 4 * sizeof(void*);
+    {
+      burn = _gcry_sha1_transform_armv7_neon (&hd->h0, data, nblks);
+      burn += burn ? 4 * sizeof(void*) : 0;
+      return burn;
+    }
 #endif
 
   do
diff --git a/cipher/sha1.h b/cipher/sha1.h
index d448fca..93ce79b 100644
--- a/cipher/sha1.h
+++ b/cipher/sha1.h
@@ -29,6 +29,7 @@ typedef struct
   unsigned int use_ssse3:1;
   unsigned int use_avx:1;
   unsigned int use_bmi2:1;
+  unsigned int use_shaext:1;
   unsigned int use_neon:1;
   unsigned int use_arm_ce:1;
 } SHA1_CONTEXT;
diff --git a/configure.ac b/configure.ac
index 305b19f..4ae7667 100644
--- a/configure.ac
+++ b/configure.ac
@@ -588,6 +588,14 @@ AC_ARG_ENABLE(aesni-support,
 	      aesnisupport=$enableval,aesnisupport=yes)
 AC_MSG_RESULT($aesnisupport)
 
+# Implementation of the --disable-shaext-support switch.
+AC_MSG_CHECKING([whether SHAEXT support is requested])
+AC_ARG_ENABLE(shaext-support,
+              AC_HELP_STRING([--disable-shaext-support],
+                 [Disable support for the Intel SHAEXT instructions]),
+              shaextsupport=$enableval,shaextsupport=yes)
+AC_MSG_RESULT($shaextsupport)
+
 # Implementation of the --disable-pclmul-support switch.
 AC_MSG_CHECKING([whether PCLMUL support is requested])
 AC_ARG_ENABLE(pclmul-support,
@@ -1175,6 +1183,7 @@ AM_CONDITIONAL(MPI_MOD_C_UDIV_QRNND, test "$mpi_mod_c_udiv_qrnnd" = yes)
 # Reset non applicable feature flags.
 if test "$mpi_cpu_arch" != "x86" ; then
    aesnisupport="n/a"
+   shaextsupport="n/a"
    pclmulsupport="n/a"
    sse41support="n/a"
    avxsupport="n/a"
@@ -1329,6 +1338,34 @@ if test "$gcry_cv_gcc_inline_asm_pclmul" = "yes" ; then
      [Defined if inline assembler supports PCLMUL instructions])
 fi
 
+
+#
+# Check whether GCC inline assembler supports SHA Extensions instructions.
+#
+AC_CACHE_CHECK([whether GCC inline assembler supports SHA Extensions instructions],
+       [gcry_cv_gcc_inline_asm_shaext],
+       [if test "$mpi_cpu_arch" != "x86" ; then
+          gcry_cv_gcc_inline_asm_shaext="n/a"
+        else
+          gcry_cv_gcc_inline_asm_shaext=no
+          AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+          [[void a(void) {
+              __asm__("sha1rnds4 \$0, %%xmm1, %%xmm3\n\t":::"cc");
+              __asm__("sha1nexte %%xmm1, %%xmm3\n\t":::"cc");
+              __asm__("sha1msg1 %%xmm1, %%xmm3\n\t":::"cc");
+              __asm__("sha1msg2 %%xmm1, %%xmm3\n\t":::"cc");
+              __asm__("sha256rnds2 %%xmm0, %%xmm1, %%xmm3\n\t":::"cc");
+              __asm__("sha256msg1 %%xmm1, %%xmm3\n\t":::"cc");
+              __asm__("sha256msg2 %%xmm1, %%xmm3\n\t":::"cc");
+            }]])],
+          [gcry_cv_gcc_inline_asm_shaext=yes])
+        fi])
+if test "$gcry_cv_gcc_inline_asm_shaext" = "yes" ; then
+   AC_DEFINE(HAVE_GCC_INLINE_ASM_SHAEXT,1,
+     [Defined if inline assembler supports SHA Extensions instructions])
+fi
+
+
 #
 # Check whether GCC inline assembler supports SSE4.1 instructions.
 #
@@ -1921,6 +1958,11 @@ if test x"$aesnisupport" = xyes ; then
     aesnisupport="no (unsupported by compiler)"
   fi
 fi
+if test x"$shaextsupport" = xyes ; then
+  if test "$gcry_cv_gcc_inline_asm_shaext" != "yes" ; then
+    shaextsupport="no (unsupported by compiler)"
+  fi
+fi
 if test x"$pclmulsupport" = xyes ; then
   if test "$gcry_cv_gcc_inline_asm_pclmul" != "yes" ; then
     pclmulsupport="no (unsupported by compiler)"
@@ -1960,6 +2002,10 @@ if test x"$aesnisupport" = xyes ; then
   AC_DEFINE(ENABLE_AESNI_SUPPORT, 1,
             [Enable support for Intel AES-NI instructions.])
 fi
+if test x"$shaextsupport" = xyes ; then
+  AC_DEFINE(ENABLE_SHAEXT_SUPPORT, 1,
+            [Enable support for Intel SHAEXT instructions.])
+fi
 if test x"$pclmulsupport" = xyes ; then
   AC_DEFINE(ENABLE_PCLMUL_SUPPORT, 1,
             [Enable support for Intel PCLMUL instructions.])
@@ -2449,6 +2495,13 @@ case "${host}" in
   ;;
 esac
 
+case "$mpi_cpu_arch" in
+  x86)
+    # Build with the SHAEXT implementation
+    GCRYPT_DIGESTS="$GCRYPT_DIGESTS sha1-intel-shaext.lo"
+  ;;
+esac
+
 LIST_MEMBER(sm3, $enabled_digests)
 if test "$found" = "1" ; then
    GCRYPT_DIGESTS="$GCRYPT_DIGESTS sm3.lo"
@@ -2634,6 +2687,7 @@ GCRY_MSG_SHOW([Try using jitter entropy: ],[$jentsupport])
 GCRY_MSG_SHOW([Using linux capabilities: ],[$use_capabilities])
 GCRY_MSG_SHOW([Try using Padlock crypto: ],[$padlocksupport])
 GCRY_MSG_SHOW([Try using AES-NI crypto:  ],[$aesnisupport])
+GCRY_MSG_SHOW([Try using Intel SHAEXT:   ],[$shaextsupport])
 GCRY_MSG_SHOW([Try using Intel PCLMUL:   ],[$pclmulsupport])
 GCRY_MSG_SHOW([Try using Intel SSE4.1:   ],[$sse41support])
 GCRY_MSG_SHOW([Try using DRNG (RDRAND):  ],[$drngsupport])
diff --git a/src/g10lib.h b/src/g10lib.h
index 961b515..d41fa0c 100644
--- a/src/g10lib.h
+++ b/src/g10lib.h
@@ -224,14 +224,14 @@ char **_gcry_strtokenize (const char *string, const char *delim);
 #define HWF_INTEL_AVX           (1 << 12)
 #define HWF_INTEL_AVX2          (1 << 13)
 #define HWF_INTEL_FAST_VPGATHER (1 << 14)
-
-#define HWF_ARM_NEON            (1 << 15)
-#define HWF_ARM_AES             (1 << 16)
-#define HWF_ARM_SHA1            (1 << 17)
-#define HWF_ARM_SHA2            (1 << 18)
-#define HWF_ARM_PMULL           (1 << 19)
-
-#define HWF_INTEL_RDTSC         (1 << 20)
+#define HWF_INTEL_RDTSC         (1 << 15)
+#define HWF_INTEL_SHAEXT        (1 << 16)
+
+#define HWF_ARM_NEON            (1 << 17)
+#define HWF_ARM_AES             (1 << 18)
+#define HWF_ARM_SHA1            (1 << 19)
+#define HWF_ARM_SHA2            (1 << 20)
+#define HWF_ARM_PMULL           (1 << 21)
 
 
diff --git a/src/hwf-x86.c b/src/hwf-x86.c
index 0d3a1f4..b644eda 100644
--- a/src/hwf-x86.c
+++ b/src/hwf-x86.c
@@ -357,6 +357,10 @@ detect_x86_gnuc (void)
       if ((result & HWF_INTEL_AVX2) && !avoid_vpgather)
         result |= HWF_INTEL_FAST_VPGATHER;
 #endif /*ENABLE_AVX_SUPPORT*/
+
+      /* Test bit 29 for SHA Extensions. */
+      if (features & (1 << 29))
+          result |= HWF_INTEL_SHAEXT;
     }
 
   return result;
diff --git a/src/hwfeatures.c b/src/hwfeatures.c
index 1cad546..e081669 100644
--- a/src/hwfeatures.c
+++ b/src/hwfeatures.c
@@ -58,6 +58,7 @@ static struct
     { HWF_INTEL_AVX2,          "intel-avx2" },
     { HWF_INTEL_FAST_VPGATHER, "intel-fast-vpgather" },
     { HWF_INTEL_RDTSC,         "intel-rdtsc" },
+    { HWF_INTEL_SHAEXT,        "intel-shaext" },
     { HWF_ARM_NEON,            "arm-neon" },
     { HWF_ARM_AES,             "arm-aes" },
     { HWF_ARM_SHA1,            "arm-sha1" },

commit da58a62ac1b7a8d97b0895dcb41d15af531e45e5
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date:   Thu Feb 8 19:45:10 2018 +0200

    AVX implementation of BLAKE2s
    
    * cipher/Makefile.am: Add 'blake2s-amd64-avx.S'.
    * cipher/blake2.c (USE_AVX, _gry_blake2s_transform_amd64_avx): New.
    (BLAKE2S_CONTEXT) [USE_AVX]: Add 'use_avx'.
    (blake2s_transform): Rename to ...
    (blake2s_transform_generic): ... this.
    (blake2s_transform): New.
    (blake2s_final): Pass 'ctx' pointer to transform function instead of
    'S'.
    (blake2s_init_ctx): Check HW features and enable AVX implementation
    if supported.
    * cipher/blake2s-amd64-avx.S: New.
    * configure.ac: Add 'blake2s-amd64-avx.lo'.
    --
    
    Benchmark on Intel Core i7-4790K (4.0 Ghz, no turbo):
    
    Before:
                    |  nanosecs/byte   mebibytes/sec   cycles/byte
     BLAKE2S_256    |      1.77 ns/B     538.2 MiB/s      7.09 c/B
    
    After (~1.3x faster):
                    |  nanosecs/byte   mebibytes/sec   cycles/byte
     BLAKE2S_256    |      1.34 ns/B     711.4 MiB/s      5.36 c/B
    
    Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index b0ee158..625a0ef 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -107,7 +107,7 @@ rfc2268.c \
 camellia.c camellia.h camellia-glue.c camellia-aesni-avx-amd64.S \
   camellia-aesni-avx2-amd64.S camellia-arm.S camellia-aarch64.S \
 blake2.c \
-  blake2b-amd64-avx2.S
+  blake2b-amd64-avx2.S blake2s-amd64-avx.S
 
 gost28147.lo: gost-sb.h
 gost-sb.h: gost-s-box
diff --git a/cipher/blake2.c b/cipher/blake2.c
index f830c79..0f7494f 100644
--- a/cipher/blake2.c
+++ b/cipher/blake2.c
@@ -30,6 +30,14 @@
 #include "cipher.h"
 #include "hash-common.h"
 
+/* USE_AVX indicates whether to compile with Intel AVX code. */
+#undef USE_AVX
+#if defined(__x86_64__) && defined(HAVE_GCC_INLINE_ASM_AVX) && \
+    (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
+     defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS))
+# define USE_AVX 1
+#endif
+
 /* USE_AVX2 indicates whether to compile with Intel AVX2 code. */
 #undef USE_AVX2
 #if defined(__x86_64__) && defined(HAVE_GCC_INLINE_ASM_AVX2) && \
@@ -121,6 +129,9 @@ typedef struct BLAKE2S_CONTEXT_S
   byte buf[BLAKE2S_BLOCKBYTES];
   size_t buflen;
   size_t outlen;
+#ifdef USE_AVX
+  unsigned int use_avx:1;
+#endif
 } BLAKE2S_CONTEXT;
 
 typedef unsigned int (*blake2_transform_t)(void *S, const void *inblk,
@@ -479,8 +490,9 @@ static inline void blake2s_increment_counter(BLAKE2S_STATE *S, const int inc)
   S->t[1] += (S->t[0] < (u32)inc) - (inc < 0);
 }
 
-static unsigned int blake2s_transform(void *vS, const void *inblks,
-				      size_t nblks)
+static unsigned int blake2s_transform_generic(BLAKE2S_STATE *S,
+                                              const void *inblks,
+                                              size_t nblks)
 {
   static const byte blake2s_sigma[10][16] =
   {
@@ -495,7 +507,6 @@ static unsigned int blake2s_transform(void *vS, const void *inblks,
     {  6, 15, 14,  9, 11,  3,  0,  8, 12,  2, 13,  7,  1,  4, 10,  5 },
     { 10,  2,  8,  4,  7,  6,  1,  5, 15, 11,  9, 14,  3, 12, 13 , 0 },
   };
-  BLAKE2S_STATE *S = vS;
   unsigned int burn = 0;
   const byte* in = inblks;
   u32 m[16];
@@ -594,6 +605,33 @@ static unsigned int blake2s_transform(void *vS, const void *inblks,
   return burn;
 }
 
+#ifdef USE_AVX
+unsigned int _gcry_blake2s_transform_amd64_avx(BLAKE2S_STATE *S,
+                                               const void *inblks,
+                                               size_t nblks) ASM_FUNC_ABI;
+#endif
+
+static unsigned int blake2s_transform(void *ctx, const void *inblks,
+                                      size_t nblks)
+{
+  BLAKE2S_CONTEXT *c = ctx;
+  unsigned int nburn;
+
+  if (0)
+    {}
+#ifdef USE_AVX
+  if (c->use_avx)
+    nburn = _gcry_blake2s_transform_amd64_avx(&c->state, inblks, nblks);
+#endif
+  else
+    nburn = blake2s_transform_generic(&c->state, inblks, nblks);
+
+  if (nburn)
+    nburn += ASM_EXTRA_STACK;
+
+  return nburn;
+}
+
 static void blake2s_final(void *ctx)
 {
   BLAKE2S_CONTEXT *c = ctx;
@@ -609,7 +647,7 @@ static void blake2s_final(void *ctx)
     memset (c->buf + c->buflen, 0, BLAKE2S_BLOCKBYTES - c->buflen); /* Padding */
   blake2s_set_lastblock (S);
   blake2s_increment_counter (S, (int)c->buflen - BLAKE2S_BLOCKBYTES);
-  burn = blake2s_transform (S, c->buf, 1);
+  burn = blake2s_transform (ctx, c->buf, 1);
 
   /* Output full hash to buffer */
   for (i = 0; i < 8; ++i)
@@ -685,11 +723,17 @@ static gcry_err_code_t blake2s_init_ctx(void *ctx, unsigned int flags,
 					unsigned int dbits)
 {
   BLAKE2S_CONTEXT *c = ctx;
+  unsigned int features = _gcry_get_hw_features ();
 
+  (void)features;
   (void)flags;
 
   memset (c, 0, sizeof (*c));
 
+#ifdef USE_AVX
+  c->use_avx = !!(features & HWF_INTEL_AVX);
+#endif
+
   c->outlen = dbits / 8;
   c->buflen = 0;
   return blake2s_init(c, key, keylen);
diff --git a/cipher/blake2s-amd64-avx.S b/cipher/blake2s-amd64-avx.S
new file mode 100644
index 0000000..f7312db
--- /dev/null
+++ b/cipher/blake2s-amd64-avx.S
@@ -0,0 +1,276 @@
+/* blake2s-amd64-avx.S  -  AVX implementation of BLAKE2s
+ *
+ * Copyright (C) 2018 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* The code is based on public-domain/CC0 BLAKE2 reference implementation
+ * by Samual Neves, at https://github.com/BLAKE2/BLAKE2/tree/master/sse
+ * Copyright 2012, Samuel Neves <sneves at dei.uc.pt>
+ */
+
+#ifdef __x86_64
+#include <config.h>
+#if defined(HAVE_GCC_INLINE_ASM_AVX) && \
+   (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
+    defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS))
+
+#include "asm-common-amd64.h"
+
+.text
+
+/* register macros */
+#define RSTATE  %rdi
+#define RINBLKS %rsi
+#define RNBLKS  %rdx
+#define RIV     %rcx
+
+/* state structure */
+#define STATE_H 0
+#define STATE_T (STATE_H + 8 * 4)
+#define STATE_F (STATE_T + 2 * 4)
+
+/* vector registers */
+#define ROW1  %xmm0
+#define ROW2  %xmm1
+#define ROW3  %xmm2
+#define ROW4  %xmm3
+#define TMP1  %xmm4
+#define TMP1x %xmm4
+#define R16   %xmm5
+#define R8    %xmm6
+
+#define MA1   %xmm8
+#define MA2   %xmm9
+#define MA3   %xmm10
+#define MA4   %xmm11
+
+#define MB1   %xmm12
+#define MB2   %xmm13
+#define MB3   %xmm14
+#define MB4   %xmm15
+
+/**********************************************************************
+  blake2s/AVX
+ **********************************************************************/
+
+#define GATHER_MSG(m1, m2, m3, m4, \
+                   s0, s1, s2, s3, s4, s5, s6, s7, s8, \
+                   s9, s10, s11, s12, s13, s14, s15) \
+        vmovd (s0)*4(RINBLKS), m1; \
+          vmovd (s1)*4(RINBLKS), m2; \
+            vmovd (s8)*4(RINBLKS), m3; \
+              vmovd (s9)*4(RINBLKS), m4; \
+        vpinsrd $1, (s2)*4(RINBLKS), m1, m1; \
+          vpinsrd $1, (s3)*4(RINBLKS), m2, m2; \
+            vpinsrd $1, (s10)*4(RINBLKS), m3, m3; \
+              vpinsrd $1, (s11)*4(RINBLKS), m4, m4; \
+        vpinsrd $2, (s4)*4(RINBLKS), m1, m1; \
+          vpinsrd $2, (s5)*4(RINBLKS), m2, m2; \
+            vpinsrd $2, (s12)*4(RINBLKS), m3, m3; \
+              vpinsrd $2, (s13)*4(RINBLKS), m4, m4; \
+        vpinsrd $3, (s6)*4(RINBLKS), m1, m1; \
+          vpinsrd $3, (s7)*4(RINBLKS), m2, m2; \
+            vpinsrd $3, (s14)*4(RINBLKS), m3, m3; \
+              vpinsrd $3, (s15)*4(RINBLKS), m4, m4;
+
+#define LOAD_MSG_0(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                    0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15)
+#define LOAD_MSG_1(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                   14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3)
+#define LOAD_MSG_2(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                   11,  8, 12,  0,  5,  2, 15, 13, 10, 14,  3,  6,  7,  1,  9,  4)
+#define LOAD_MSG_3(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                    7,  9,  3,  1, 13, 12, 11, 14,  2,  6,  5, 10,  4,  0, 15,  8)
+#define LOAD_MSG_4(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                    9,  0,  5,  7,  2,  4, 10, 15, 14,  1, 11, 12,  6,  8,  3, 13)
+#define LOAD_MSG_5(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                    2, 12,  6, 10,  0, 11,  8,  3,  4, 13,  7,  5, 15, 14,  1,  9)
+#define LOAD_MSG_6(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                   12,  5,  1, 15, 14, 13,  4, 10,  0,  7,  6,  3,  9,  2,  8, 11)
+#define LOAD_MSG_7(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                   13, 11,  7, 14, 12,  1,  3,  9,  5,  0, 15,  4,  8,  6,  2, 10)
+#define LOAD_MSG_8(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                    6, 15, 14,  9, 11,  3,  0,  8, 12,  2, 13,  7,  1,  4, 10,  5)
+#define LOAD_MSG_9(m1, m2, m3, m4) \
+        GATHER_MSG(m1, m2, m3, m4, \
+                   10,  2,  8,  4,  7,  6,  1,  5, 15, 11,  9, 14,  3, 12, 13 , 0)
+
+#define LOAD_MSG(r, m1, m2, m3, m4) LOAD_MSG_##r(m1, m2, m3, m4)
+
+#define ROR_16(in, out) vpshufb R16, in, out;
+
+#define ROR_8(in, out)  vpshufb R8, in, out;
+
+#define ROR_12(in, out) \
+        vpsrld $12, in, TMP1; \
+        vpslld $(32 - 12), in, out; \
+        vpxor TMP1, out, out;
+
+#define ROR_7(in, out) \
+        vpsrld $7, in, TMP1; \
+        vpslld $(32 - 7), in, out; \
+        vpxor TMP1, out, out;
+
+#define G(r1, r2, r3, r4, m, ROR_A, ROR_B) \
+        vpaddd m, r1, r1; \
+        vpaddd r2, r1, r1; \
+        vpxor r1, r4, r4; \
+        ROR_A(r4, r4); \
+        vpaddd r4, r3, r3; \
+        vpxor r3, r2, r2; \
+        ROR_B(r2, r2);
+
+#define G1(r1, r2, r3, r4, m) \
+        G(r1, r2, r3, r4, m, ROR_16, ROR_12);
+
+#define G2(r1, r2, r3, r4, m) \
+        G(r1, r2, r3, r4, m, ROR_8, ROR_7);
+
+#define MM_SHUFFLE(z,y,x,w) \
+        (((z) << 6) | ((y) << 4) | ((x) << 2) | (w))
+
+#define DIAGONALIZE(r1, r2, r3, r4) \
+        vpshufd $MM_SHUFFLE(0,3,2,1), r2, r2; \
+        vpshufd $MM_SHUFFLE(1,0,3,2), r3, r3; \
+        vpshufd $MM_SHUFFLE(2,1,0,3), r4, r4;
+
+#define UNDIAGONALIZE(r1, r2, r3, r4) \
+        vpshufd $MM_SHUFFLE(2,1,0,3), r2, r2; \
+        vpshufd $MM_SHUFFLE(1,0,3,2), r3, r3; \
+        vpshufd $MM_SHUFFLE(0,3,2,1), r4, r4;
+
+#define ROUND(r, m1, m2, m3, m4) \
+        G1(ROW1, ROW2, ROW3, ROW4, m1); \
+        G2(ROW1, ROW2, ROW3, ROW4, m2); \
+        DIAGONALIZE(ROW1, ROW2, ROW3, ROW4); \
+        G1(ROW1, ROW2, ROW3, ROW4, m3); \
+        G2(ROW1, ROW2, ROW3, ROW4, m4); \
+        UNDIAGONALIZE(ROW1, ROW2, ROW3, ROW4);
+
+blake2s_data:
+.align 16
+.Liv:
+        .long 0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A
+        .long 0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
+.Lshuf_ror16:
+        .byte 2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13
+.Lshuf_ror8:
+        .byte 1,2,3,0,5,6,7,4,9,10,11,8,13,14,15,12
+
+.align 64
+.globl _gcry_blake2s_transform_amd64_avx
+ELF(.type _gcry_blake2s_transform_amd64_avx, at function;)
+
+_gcry_blake2s_transform_amd64_avx:
+        /* input:
+         *	%rdi: state
+         *	%rsi: blks
+         *	%rdx: num_blks
+         */
+
+        vzeroupper;
+
+        addq $64, (STATE_T + 0)(RSTATE);
+
+        vmovdqa .Lshuf_ror16 (RIP), R16;
+        vmovdqa .Lshuf_ror8 (RIP), R8;
+
+        vmovdqa .Liv+(0 * 4) (RIP), ROW3;
+        vmovdqa .Liv+(4 * 4) (RIP), ROW4;
+
+        vmovdqu (STATE_H + 0 * 4)(RSTATE), ROW1;
+        vmovdqu (STATE_H + 4 * 4)(RSTATE), ROW2;
+
+        vpxor (STATE_T)(RSTATE), ROW4, ROW4;
+
+        LOAD_MSG(0, MA1, MA2, MA3, MA4);
+        LOAD_MSG(1, MB1, MB2, MB3, MB4);
+
+.Loop:
+        ROUND(0, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(2, MA1, MA2, MA3, MA4);
+        ROUND(1, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(3, MB1, MB2, MB3, MB4);
+        ROUND(2, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(4, MA1, MA2, MA3, MA4);
+        ROUND(3, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(5, MB1, MB2, MB3, MB4);
+        ROUND(4, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(6, MA1, MA2, MA3, MA4);
+        ROUND(5, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(7, MB1, MB2, MB3, MB4);
+        ROUND(6, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(8, MA1, MA2, MA3, MA4);
+        ROUND(7, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(9, MB1, MB2, MB3, MB4);
+        sub $1, RNBLKS;
+        jz .Loop_end;
+
+        lea 64(RINBLKS), RINBLKS;
+        addq $64, (STATE_T + 0)(RSTATE);
+
+        ROUND(8, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(0, MA1, MA2, MA3, MA4);
+        ROUND(9, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(1, MB1, MB2, MB3, MB4);
+
+        vpxor ROW3, ROW1, ROW1;
+        vpxor ROW4, ROW2, ROW2;
+
+        vmovdqa .Liv+(0 * 4) (RIP), ROW3;
+        vmovdqa .Liv+(4 * 4) (RIP), ROW4;
+
+        vpxor (STATE_H + 0 * 4)(RSTATE), ROW1, ROW1;
+        vpxor (STATE_H + 4 * 4)(RSTATE), ROW2, ROW2;
+
+        vmovdqu ROW1, (STATE_H + 0 * 4)(RSTATE);
+        vmovdqu ROW2, (STATE_H + 4 * 4)(RSTATE);
+
+        vpxor (STATE_T)(RSTATE), ROW4, ROW4;
+
+        jmp .Loop;
+
+.Loop_end:
+        ROUND(8, MA1, MA2, MA3, MA4);
+        ROUND(9, MB1, MB2, MB3, MB4);
+
+        vpxor ROW3, ROW1, ROW1;
+        vpxor ROW4, ROW2, ROW2;
+        vpxor (STATE_H + 0 * 4)(RSTATE), ROW1, ROW1;
+        vpxor (STATE_H + 4 * 4)(RSTATE), ROW2, ROW2;
+
+        vmovdqu ROW1, (STATE_H + 0 * 4)(RSTATE);
+        vmovdqu ROW2, (STATE_H + 4 * 4)(RSTATE);
+
+        xor %eax, %eax;
+        vzeroall;
+        ret;
+ELF(.size _gcry_blake2s_transform_amd64_avx,
+    .-_gcry_blake2s_transform_amd64_avx;)
+
+#endif /*defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS)*/
+#endif /*__x86_64*/
diff --git a/configure.ac b/configure.ac
index 300c520..305b19f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2421,6 +2421,7 @@ if test "$found" = "1" ; then
       x86_64-*-*)
          # Build with the assembly implementation
          GCRYPT_DIGESTS="$GCRYPT_DIGESTS blake2b-amd64-avx2.lo"
+         GCRYPT_DIGESTS="$GCRYPT_DIGESTS blake2s-amd64-avx.lo"
       ;;
    esac
 fi

commit af7fc732f9a7af7a70276f1e8364d2132db314f1
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date:   Sun Jan 14 16:48:17 2018 +0200

    AVX2 implementation of BLAKE2b
    
    * cipher/Makefile.am: Add 'blake2b-amd64-avx2.S'.
    * cipher/blake2.c (USE_AVX2, ASM_FUNC_ABI, ASM_EXTRA_STACK)
    (_gry_blake2b_transform_amd64_avx2): New.
    (BLAKE2B_CONTEXT) [USE_AVX2]: Add 'use_avx2'.
    (blake2b_transform): Rename to ...
    (blake2b_transform_generic): ... this.
    (blake2b_transform): New.
    (blake2b_final): Pass 'ctx' pointer to transform function instead of
    'S'.
    (blake2b_init_ctx): Check HW features and enable AVX2 implementation
    if supported.
    * cipher/blake2b-amd64-avx2.S: New.
    * configure.ac: Add 'blake2b-amd64-avx2.lo'.
    --
    
    Benchmark on Intel Core i7-4790K (4.0 Ghz, no turbo):
    
    Before:
                    |  nanosecs/byte   mebibytes/sec   cycles/byte
     BLAKE2B_512    |      1.07 ns/B     887.8 MiB/s      4.30 c/B
    
    After (~1.4x faster):
                    |  nanosecs/byte   mebibytes/sec   cycles/byte
     BLAKE2B_512    |     0.771 ns/B    1236.8 MiB/s      3.08 c/B
    
    Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index 6e6c5ac..b0ee158 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -106,7 +106,8 @@ twofish.c twofish-amd64.S twofish-arm.S twofish-aarch64.S \
 rfc2268.c \
 camellia.c camellia.h camellia-glue.c camellia-aesni-avx-amd64.S \
   camellia-aesni-avx2-amd64.S camellia-arm.S camellia-aarch64.S \
-blake2.c
+blake2.c \
+  blake2b-amd64-avx2.S
 
 gost28147.lo: gost-sb.h
 gost-sb.h: gost-s-box
diff --git a/cipher/blake2.c b/cipher/blake2.c
index 0e4cf9b..f830c79 100644
--- a/cipher/blake2.c
+++ b/cipher/blake2.c
@@ -30,6 +30,26 @@
 #include "cipher.h"
 #include "hash-common.h"
 
+/* USE_AVX2 indicates whether to compile with Intel AVX2 code. */
+#undef USE_AVX2
+#if defined(__x86_64__) && defined(HAVE_GCC_INLINE_ASM_AVX2) && \
+    (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
+     defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS))
+# define USE_AVX2 1
+#endif
+
+/* AMD64 assembly implementations use SystemV ABI, ABI conversion and additional
+ * stack to store XMM6-XMM15 needed on Win64. */
+#undef ASM_FUNC_ABI
+#undef ASM_EXTRA_STACK
+#if defined(USE_AVX2) && defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)
+# define ASM_FUNC_ABI __attribute__((sysv_abi))
+# define ASM_EXTRA_STACK (10 * 16)
+#else
+# define ASM_FUNC_ABI
+# define ASM_EXTRA_STACK 0
+#endif
+
 #define BLAKE2B_BLOCKBYTES 128
 #define BLAKE2B_OUTBYTES 64
 #define BLAKE2B_KEYBYTES 64
@@ -67,6 +87,9 @@ typedef struct BLAKE2B_CONTEXT_S
   byte buf[BLAKE2B_BLOCKBYTES];
   size_t buflen;
   size_t outlen;
+#ifdef USE_AVX2
+  unsigned int use_avx2:1;
+#endif
 } BLAKE2B_CONTEXT;
 
 typedef struct
@@ -188,8 +211,9 @@ static inline u64 rotr64(u64 x, u64 n)
   return ((x >> (n & 63)) | (x << ((64 - n) & 63)));
 }
 
-static unsigned int blake2b_transform(void *vS, const void *inblks,
-				      size_t nblks)
+static unsigned int blake2b_transform_generic(BLAKE2B_STATE *S,
+                                              const void *inblks,
+                                              size_t nblks)
 {
   static const byte blake2b_sigma[12][16] =
   {
@@ -206,7 +230,6 @@ static unsigned int blake2b_transform(void *vS, const void *inblks,
     {  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 },
     { 14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3 }
   };
-  BLAKE2B_STATE *S = vS;
   const byte* in = inblks;
   u64 m[16];
   u64 v[16];
@@ -306,6 +329,33 @@ static unsigned int blake2b_transform(void *vS, const void *inblks,
   return sizeof(void *) * 4 + sizeof(u64) * 16 * 2;
 }
 
+#ifdef USE_AVX2
+unsigned int _gcry_blake2b_transform_amd64_avx2(BLAKE2B_STATE *S,
+                                                const void *inblks,
+                                                size_t nblks) ASM_FUNC_ABI;
+#endif
+
+static unsigned int blake2b_transform(void *ctx, const void *inblks,
+                                      size_t nblks)
+{
+  BLAKE2B_CONTEXT *c = ctx;
+  unsigned int nburn;
+
+  if (0)
+    {}
+#ifdef USE_AVX2
+  if (c->use_avx2)
+    nburn = _gcry_blake2b_transform_amd64_avx2(&c->state, inblks, nblks);
+#endif
+  else
+    nburn = blake2b_transform_generic(&c->state, inblks, nblks);
+
+  if (nburn)
+    nburn += ASM_EXTRA_STACK;
+
+  return nburn;
+}
+
 static void blake2b_final(void *ctx)
 {
   BLAKE2B_CONTEXT *c = ctx;
@@ -321,7 +371,7 @@ static void blake2b_final(void *ctx)
     memset (c->buf + c->buflen, 0, BLAKE2B_BLOCKBYTES - c->buflen); /* Padding */
   blake2b_set_lastblock (S);
   blake2b_increment_counter (S, (int)c->buflen - BLAKE2B_BLOCKBYTES);
-  burn = blake2b_transform (S, c->buf, 1);
+  burn = blake2b_transform (ctx, c->buf, 1);
 
   /* Output full hash to buffer */
   for (i = 0; i < 8; ++i)
@@ -397,11 +447,17 @@ static gcry_err_code_t blake2b_init_ctx(void *ctx, unsigned int flags,
 					unsigned int dbits)
 {
   BLAKE2B_CONTEXT *c = ctx;
+  unsigned int features = _gcry_get_hw_features ();
 
+  (void)features;
   (void)flags;
 
   memset (c, 0, sizeof (*c));
 
+#ifdef USE_AVX2
+  c->use_avx2 = !!(features & HWF_INTEL_AVX2);
+#endif
+
   c->outlen = dbits / 8;
   c->buflen = 0;
   return blake2b_init(c, key, keylen);
diff --git a/cipher/blake2b-amd64-avx2.S b/cipher/blake2b-amd64-avx2.S
new file mode 100644
index 0000000..6bcc565
--- /dev/null
+++ b/cipher/blake2b-amd64-avx2.S
@@ -0,0 +1,298 @@
+/* blake2b-amd64-avx2.S  -  AVX2 implementation of BLAKE2b
+ *
+ * Copyright (C) 2018 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* The code is based on public-domain/CC0 BLAKE2 reference implementation
+ * by Samual Neves, at https://github.com/BLAKE2/BLAKE2/tree/master/sse
+ * Copyright 2012, Samuel Neves <sneves at dei.uc.pt>
+ */
+
+#ifdef __x86_64
+#include <config.h>
+#if defined(HAVE_GCC_INLINE_ASM_AVX2) && \
+   (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
+    defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS))
+
+#include "asm-common-amd64.h"
+
+.text
+
+/* register macros */
+#define RSTATE  %rdi
+#define RINBLKS %rsi
+#define RNBLKS  %rdx
+#define RIV     %rcx
+
+/* state structure */
+#define STATE_H 0
+#define STATE_T (STATE_H + 8 * 8)
+#define STATE_F (STATE_T + 2 * 8)
+
+/* vector registers */
+#define ROW1  %ymm0
+#define ROW2  %ymm1
+#define ROW3  %ymm2
+#define ROW4  %ymm3
+#define TMP1  %ymm4
+#define TMP1x %xmm4
+#define R16   %ymm5
+#define R24   %ymm6
+
+#define MA1   %ymm8
+#define MA2   %ymm9
+#define MA3   %ymm10
+#define MA4   %ymm11
+#define MA1x  %xmm8
+#define MA2x  %xmm9
+#define MA3x  %xmm10
+#define MA4x  %xmm11
+
+#define MB1   %ymm12
+#define MB2   %ymm13
+#define MB3   %ymm14
+#define MB4   %ymm15
+#define MB1x  %xmm12
+#define MB2x  %xmm13
+#define MB3x  %xmm14
+#define MB4x  %xmm15
+
+/**********************************************************************
+  blake2b/AVX2
+ **********************************************************************/
+
+#define GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   s0, s1, s2, s3, s4, s5, s6, s7, s8, \
+                   s9, s10, s11, s12, s13, s14, s15) \
+        vmovq (s0)*8(RINBLKS), m1x; \
+        vmovq (s4)*8(RINBLKS), TMP1x; \
+        vpinsrq $1, (s2)*8(RINBLKS), m1x, m1x; \
+        vpinsrq $1, (s6)*8(RINBLKS), TMP1x, TMP1x; \
+        vinserti128 $1, TMP1x, m1, m1; \
+          vmovq (s1)*8(RINBLKS), m2x; \
+          vmovq (s5)*8(RINBLKS), TMP1x; \
+          vpinsrq $1, (s3)*8(RINBLKS), m2x, m2x; \
+          vpinsrq $1, (s7)*8(RINBLKS), TMP1x, TMP1x; \
+          vinserti128 $1, TMP1x, m2, m2; \
+            vmovq (s8)*8(RINBLKS), m3x; \
+            vmovq (s12)*8(RINBLKS), TMP1x; \
+            vpinsrq $1, (s10)*8(RINBLKS), m3x, m3x; \
+            vpinsrq $1, (s14)*8(RINBLKS), TMP1x, TMP1x; \
+            vinserti128 $1, TMP1x, m3, m3; \
+              vmovq (s9)*8(RINBLKS), m4x; \
+              vmovq (s13)*8(RINBLKS), TMP1x; \
+              vpinsrq $1, (s11)*8(RINBLKS), m4x, m4x; \
+              vpinsrq $1, (s15)*8(RINBLKS), TMP1x, TMP1x; \
+              vinserti128 $1, TMP1x, m4, m4;
+
+#define LOAD_MSG_0(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                    0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15)
+#define LOAD_MSG_1(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3)
+#define LOAD_MSG_2(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   11,  8, 12,  0,  5,  2, 15, 13, 10, 14,  3,  6,  7,  1,  9,  4)
+#define LOAD_MSG_3(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                    7,  9,  3,  1, 13, 12, 11, 14,  2,  6,  5, 10,  4,  0, 15,  8)
+#define LOAD_MSG_4(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                    9,  0,  5,  7,  2,  4, 10, 15, 14,  1, 11, 12,  6,  8,  3, 13)
+#define LOAD_MSG_5(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                    2, 12,  6, 10,  0, 11,  8,  3,  4, 13,  7,  5, 15, 14,  1,  9)
+#define LOAD_MSG_6(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   12,  5,  1, 15, 14, 13,  4, 10,  0,  7,  6,  3,  9,  2,  8, 11)
+#define LOAD_MSG_7(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   13, 11,  7, 14, 12,  1,  3,  9,  5,  0, 15,  4,  8,  6,  2, 10)
+#define LOAD_MSG_8(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                    6, 15, 14,  9, 11,  3,  0,  8, 12,  2, 13,  7,  1,  4, 10,  5)
+#define LOAD_MSG_9(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        GATHER_MSG(m1, m2, m3, m4, m1x, m2x, m3x, m4x, \
+                   10,  2,  8,  4,  7,  6,  1,  5, 15, 11,  9, 14,  3, 12, 13 , 0)
+#define LOAD_MSG_10(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        LOAD_MSG_0(m1, m2, m3, m4, m1x, m2x, m3x, m4x)
+#define LOAD_MSG_11(m1, m2, m3, m4, m1x, m2x, m3x, m4x) \
+        LOAD_MSG_1(m1, m2, m3, m4, m1x, m2x, m3x, m4x)
+
+#define LOAD_MSG(r, m1, m2, m3, m4) \
+        LOAD_MSG_##r(m1, m2, m3, m4, m1##x, m2##x, m3##x, m4##x)
+
+#define ROR_32(in, out) vpshufd $0xb1, in, out;
+
+#define ROR_24(in, out) vpshufb R24, in, out;
+
+#define ROR_16(in, out) vpshufb R16, in, out;
+
+#define ROR_63(in, out) \
+        vpsrlq $63, in, TMP1; \
+        vpaddq in, in, out; \
+        vpxor  TMP1, out, out;
+
+#define G(r1, r2, r3, r4, m, ROR_A, ROR_B) \
+        vpaddq m, r1, r1; \
+        vpaddq r2, r1, r1; \
+        vpxor r1, r4, r4; \
+        ROR_A(r4, r4); \
+        vpaddq r4, r3, r3; \
+        vpxor r3, r2, r2; \
+        ROR_B(r2, r2);
+
+#define G1(r1, r2, r3, r4, m) \
+        G(r1, r2, r3, r4, m, ROR_32, ROR_24);
+
+#define G2(r1, r2, r3, r4, m) \
+        G(r1, r2, r3, r4, m, ROR_16, ROR_63);
+
+#define MM_SHUFFLE(z,y,x,w) \
+        (((z) << 6) | ((y) << 4) | ((x) << 2) | (w))
+
+#define DIAGONALIZE(r1, r2, r3, r4) \
+        vpermq $MM_SHUFFLE(0,3,2,1), r2, r2; \
+        vpermq $MM_SHUFFLE(1,0,3,2), r3, r3; \
+        vpermq $MM_SHUFFLE(2,1,0,3), r4, r4;
+
+#define UNDIAGONALIZE(r1, r2, r3, r4) \
+        vpermq $MM_SHUFFLE(2,1,0,3), r2, r2; \
+        vpermq $MM_SHUFFLE(1,0,3,2), r3, r3; \
+        vpermq $MM_SHUFFLE(0,3,2,1), r4, r4;
+
+#define ROUND(r, m1, m2, m3, m4) \
+        G1(ROW1, ROW2, ROW3, ROW4, m1); \
+        G2(ROW1, ROW2, ROW3, ROW4, m2); \
+        DIAGONALIZE(ROW1, ROW2, ROW3, ROW4); \
+        G1(ROW1, ROW2, ROW3, ROW4, m3); \
+        G2(ROW1, ROW2, ROW3, ROW4, m4); \
+        UNDIAGONALIZE(ROW1, ROW2, ROW3, ROW4);
+
+blake2b_data:
+.align 32
+.Liv:
+        .quad 0x6a09e667f3bcc908, 0xbb67ae8584caa73b
+        .quad 0x3c6ef372fe94f82b, 0xa54ff53a5f1d36f1
+        .quad 0x510e527fade682d1, 0x9b05688c2b3e6c1f
+        .quad 0x1f83d9abfb41bd6b, 0x5be0cd19137e2179
+.Lshuf_ror16:
+        .byte 2, 3, 4, 5, 6, 7, 0, 1, 10, 11, 12, 13, 14, 15, 8, 9
+.Lshuf_ror24:
+        .byte 3, 4, 5, 6, 7, 0, 1, 2, 11, 12, 13, 14, 15, 8, 9, 10
+
+.align 64
+.globl _gcry_blake2b_transform_amd64_avx2
+ELF(.type _gcry_blake2b_transform_amd64_avx2, at function;)
+
+_gcry_blake2b_transform_amd64_avx2:
+        /* input:
+         *	%rdi: state
+         *	%rsi: blks
+         *	%rdx: num_blks
+         */
+
+        vzeroupper;
+
+        addq $128, (STATE_T + 0)(RSTATE);
+        adcq $0, (STATE_T + 8)(RSTATE);
+
+        vbroadcasti128 .Lshuf_ror16 (RIP), R16;
+        vbroadcasti128 .Lshuf_ror24 (RIP), R24;
+
+        vmovdqa .Liv+(0 * 8) (RIP), ROW3;
+        vmovdqa .Liv+(4 * 8) (RIP), ROW4;
+
+        vmovdqu (STATE_H + 0 * 8)(RSTATE), ROW1;
+        vmovdqu (STATE_H + 4 * 8)(RSTATE), ROW2;
+
+        vpxor (STATE_T)(RSTATE), ROW4, ROW4;
+
+        LOAD_MSG(0, MA1, MA2, MA3, MA4);
+        LOAD_MSG(1, MB1, MB2, MB3, MB4);
+
+.Loop:
+        ROUND(0, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(2, MA1, MA2, MA3, MA4);
+        ROUND(1, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(3, MB1, MB2, MB3, MB4);
+        ROUND(2, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(4, MA1, MA2, MA3, MA4);
+        ROUND(3, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(5, MB1, MB2, MB3, MB4);
+        ROUND(4, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(6, MA1, MA2, MA3, MA4);
+        ROUND(5, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(7, MB1, MB2, MB3, MB4);
+        ROUND(6, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(8, MA1, MA2, MA3, MA4);
+        ROUND(7, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(9, MB1, MB2, MB3, MB4);
+        ROUND(8, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(10, MA1, MA2, MA3, MA4);
+        ROUND(9, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(11, MB1, MB2, MB3, MB4);
+        sub $1, RNBLKS;
+        jz .Loop_end;
+
+        lea 128(RINBLKS), RINBLKS;
+        addq $128, (STATE_T + 0)(RSTATE);
+        adcq $0, (STATE_T + 8)(RSTATE);
+
+        ROUND(10, MA1, MA2, MA3, MA4);
+                                      LOAD_MSG(0, MA1, MA2, MA3, MA4);
+        ROUND(11, MB1, MB2, MB3, MB4);
+                                      LOAD_MSG(1, MB1, MB2, MB3, MB4);
+
+        vpxor ROW3, ROW1, ROW1;
+        vpxor ROW4, ROW2, ROW2;
+
+        vmovdqa .Liv+(0 * 8) (RIP), ROW3;
+        vmovdqa .Liv+(4 * 8) (RIP), ROW4;
+
+        vpxor (STATE_H + 0 * 8)(RSTATE), ROW1, ROW1;
+        vpxor (STATE_H + 4 * 8)(RSTATE), ROW2, ROW2;
+
+        vmovdqu ROW1, (STATE_H + 0 * 8)(RSTATE);
+        vmovdqu ROW2, (STATE_H + 4 * 8)(RSTATE);
+
+        vpxor (STATE_T)(RSTATE), ROW4, ROW4;
+
+        jmp .Loop;
+
+.Loop_end:
+        ROUND(10, MA1, MA2, MA3, MA4);
+        ROUND(11, MB1, MB2, MB3, MB4);
+
+        vpxor ROW3, ROW1, ROW1;
+        vpxor ROW4, ROW2, ROW2;
+        vpxor (STATE_H + 0 * 8)(RSTATE), ROW1, ROW1;
+        vpxor (STATE_H + 4 * 8)(RSTATE), ROW2, ROW2;
+
+        vmovdqu ROW1, (STATE_H + 0 * 8)(RSTATE);
+        vmovdqu ROW2, (STATE_H + 4 * 8)(RSTATE);
+
+        xor %eax, %eax;
+        vzeroall;
+        ret;
+ELF(.size _gcry_blake2b_transform_amd64_avx2,
+    .-_gcry_blake2b_transform_amd64_avx2;)
+
+#endif /*defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS)*/
+#endif /*__x86_64*/
diff --git a/configure.ac b/configure.ac
index aaf3c82..300c520 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2416,6 +2416,13 @@ LIST_MEMBER(blake2, $enabled_digests)
 if test "$found" = "1" ; then
    GCRYPT_DIGESTS="$GCRYPT_DIGESTS blake2.lo"
    AC_DEFINE(USE_BLAKE2, 1, [Defined if this module should be included])
+
+   case "${host}" in
+      x86_64-*-*)
+         # Build with the assembly implementation
+         GCRYPT_DIGESTS="$GCRYPT_DIGESTS blake2b-amd64-avx2.lo"
+      ;;
+   esac
 fi
 
 # SHA-1 needs to be included always for example because it is used by

-----------------------------------------------------------------------

Summary of changes:
 cipher/Makefile.am           |   6 +-
 cipher/blake2.c              | 116 +++++++++++++-
 cipher/blake2b-amd64-avx2.S  | 298 ++++++++++++++++++++++++++++++++++++
 cipher/blake2s-amd64-avx.S   | 276 +++++++++++++++++++++++++++++++++
 cipher/sha1-intel-shaext.c   | 281 ++++++++++++++++++++++++++++++++++
 cipher/sha1.c                |  64 ++++++--
 cipher/sha1.h                |   1 +
 cipher/sha256-intel-shaext.c | 352 +++++++++++++++++++++++++++++++++++++++++++
 cipher/sha256.c              |  66 +++++++-
 configure.ac                 |  69 +++++++++
 src/g10lib.h                 |  16 +-
 src/hwf-x86.c                |   4 +
 src/hwfeatures.c             |   1 +
 13 files changed, 1514 insertions(+), 36 deletions(-)
 create mode 100644 cipher/blake2b-amd64-avx2.S
 create mode 100644 cipher/blake2s-amd64-avx.S
 create mode 100644 cipher/sha1-intel-shaext.c
 create mode 100644 cipher/sha256-intel-shaext.c


hooks/post-receive
-- 
The GNU crypto library
http://git.gnupg.org


_______________________________________________
Gnupg-commits mailing list
Gnupg-commits at gnupg.org
http://lists.gnupg.org/mailman/listinfo/gnupg-commits


From stefbon at gmail.com  Mon Feb 19 17:21:16 2018
From: stefbon at gmail.com (Stef Bon)
Date: Mon, 19 Feb 2018 17:21:16 +0100
Subject: ECDH in gcrypt
In-Reply-To: <CANXojczd20S3HANCh7Yks6OQYybFCY4s9x39P=Oz=w02bgn9_A@mail.gmail.com>
References: <1518111324.2017.25.camel@schoenitzer.de>
 <CANXojczd20S3HANCh7Yks6OQYybFCY4s9x39P=Oz=w02bgn9_A@mail.gmail.com>
Message-ID: <CANXojcy4QXy39UL_3cS9VY2ps5W=6B+jwHaya+bYHQTMOKy_9Q@mail.gmail.com>

Michael,

was my post any help?

Stef