From jussi.kivilinna at iki.fi Mon May 9 19:33:30 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Mon, 9 May 2022 20:33:30 +0300 Subject: [PATCH 1/2] camellia: add amd64 GFNI/AVX512 implementation Message-ID: <20220509173330.1965253-1-jussi.kivilinna@iki.fi> * cipher/Makefile.am: Add 'camellia-gfni-avx512-amd64.S'. * cipher/bulkhelp.h (bulk_ocb_prepare_L_pointers_array_blk64): New. * cipher/camellia-aesni-avx2-amd64.h: Rename internal functions from "__camellia_???" to "FUNC_NAME(???)"; Minor changes to comments. * cipher/camellia-gfni-avx512-amd64.S: New. * cipher/camellia-gfni.c (USE_GFNI_AVX512): New. (CAMELLIA_context): Add 'use_gfni_avx512'. (_gcry_camellia_gfni_avx512_ctr_enc, _gcry_camellia_gfni_avx512_cbc_dec) (_gcry_camellia_gfni_avx512_cfb_dec, _gcry_camellia_gfni_avx512_ocb_enc) (_gcry_camellia_gfni_avx512_ocb_dec) (_gcry_camellia_gfni_avx512_enc_blk64) (_gcry_camellia_gfni_avx512_dec_blk64, avx512_burn_stack_depth): New. (camellia_setkey): Use GFNI/AVX512 if supported by CPU. (camellia_encrypt_blk1_64, camellia_decrypt_blk1_64): New. (_gcry_camellia_ctr_enc, _gcry_camellia_cbc_dec, _gcry_camellia_cfb_dec) (_gcry_camellia_ocb_crypt) [USE_GFNI_AVX512]: Add GFNI/AVX512 code path. (_gcry_camellia_xts_crypt): Change parallel block size from 32 to 64. (selftest_ctr_128, selftest_cbc_128, selftest_cfb_128): Increase test block size. * cipher/chacha20-amd64-avx512.S: Clear k-mask registers with xor. * cipher/poly1305-amd64-avx512.S: Likewise. * cipher/sha512-avx512-amd64.S: Likewise. --- Benchmark on Intel i3-1115G4 (tigerlake): Before (GFNI/AVX2): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC dec | 0.356 ns/B 2679 MiB/s 1.46 c/B 4089 CFB dec | 0.374 ns/B 2547 MiB/s 1.53 c/B 4089 CTR enc | 0.409 ns/B 2332 MiB/s 1.67 c/B 4089 CTR dec | 0.406 ns/B 2347 MiB/s 1.66 c/B 4089 XTS enc | 0.430 ns/B 2216 MiB/s 1.76 c/B 4090 XTS dec | 0.433 ns/B 2201 MiB/s 1.77 c/B 4090 OCB enc | 0.460 ns/B 2071 MiB/s 1.88 c/B 4089 OCB dec | 0.492 ns/B 1939 MiB/s 2.01 c/B 4089 After (GFNI/AVX512): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC dec | 0.207 ns/B 4600 MiB/s 0.827 c/B 3989 CFB dec | 0.207 ns/B 4610 MiB/s 0.825 c/B 3989 CTR enc | 0.218 ns/B 4382 MiB/s 0.868 c/B 3990 CTR dec | 0.217 ns/B 4389 MiB/s 0.867 c/B 3990 XTS enc | 0.330 ns/B 2886 MiB/s 1.35 c/B 4097?4 XTS dec | 0.328 ns/B 2904 MiB/s 1.35 c/B 4097?3 OCB enc | 0.246 ns/B 3879 MiB/s 0.981 c/B 3990 OCB dec | 0.247 ns/B 3855 MiB/s 0.987 c/B 3990 CBC dec: 70% faster CFB dec: 80% faster CTR: 87% faster XTS: 31% faster OCB: 92% faster Signed-off-by: Jussi Kivilinna --- cipher/Makefile.am | 3 +- cipher/bulkhelp.h | 29 + cipher/camellia-aesni-avx2-amd64.h | 50 +- cipher/camellia-gfni-avx512-amd64.S | 1566 +++++++++++++++++++++++++++ cipher/camellia-glue.c | 257 ++++- cipher/chacha20-amd64-avx512.S | 2 +- cipher/poly1305-amd64-avx512.S | 4 +- cipher/sha512-avx512-amd64.S | 2 +- configure.ac | 3 + 9 files changed, 1873 insertions(+), 43 deletions(-) create mode 100644 cipher/camellia-gfni-avx512-amd64.S diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 55f96014..a6171bf5 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -139,7 +139,8 @@ EXTRA_libcipher_la_SOURCES = \ twofish-avx2-amd64.S \ rfc2268.c \ camellia.c camellia.h camellia-glue.c camellia-aesni-avx-amd64.S \ - camellia-aesni-avx2-amd64.h camellia-gfni-avx2-amd64.S \ + camellia-aesni-avx2-amd64.h \ + camellia-gfni-avx2-amd64.S camellia-gfni-avx512-amd64.S \ camellia-vaes-avx2-amd64.S camellia-aesni-avx2-amd64.S \ camellia-arm.S camellia-aarch64.S \ blake2.c \ diff --git a/cipher/bulkhelp.h b/cipher/bulkhelp.h index b1b4b2e1..8c322ede 100644 --- a/cipher/bulkhelp.h +++ b/cipher/bulkhelp.h @@ -37,6 +37,35 @@ typedef unsigned int (*bulk_crypt_fn_t) (const void *ctx, byte *out, unsigned int num_blks); +static inline ocb_L_uintptr_t * +bulk_ocb_prepare_L_pointers_array_blk64 (gcry_cipher_hd_t c, + ocb_L_uintptr_t Ls[64], u64 blkn) +{ + unsigned int n = 64 - (blkn % 64); + unsigned int i; + + for (i = 0; i < 64; i += 8) + { + Ls[(i + 0 + n) % 64] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(i + 1 + n) % 64] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; + Ls[(i + 2 + n) % 64] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(i + 3 + n) % 64] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; + Ls[(i + 4 + n) % 64] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(i + 5 + n) % 64] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; + Ls[(i + 6 + n) % 64] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + } + + Ls[(7 + n) % 64] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; + Ls[(15 + n) % 64] = (uintptr_t)(void *)c->u_mode.ocb.L[4]; + Ls[(23 + n) % 64] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; + Ls[(31 + n) % 64] = (uintptr_t)(void *)c->u_mode.ocb.L[5]; + Ls[(39 + n) % 64] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; + Ls[(47 + n) % 64] = (uintptr_t)(void *)c->u_mode.ocb.L[4]; + Ls[(55 + n) % 64] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; + return &Ls[(63 + n) % 64]; +} + + static inline ocb_L_uintptr_t * bulk_ocb_prepare_L_pointers_array_blk32 (gcry_cipher_hd_t c, ocb_L_uintptr_t Ls[32], u64 blkn) diff --git a/cipher/camellia-aesni-avx2-amd64.h b/cipher/camellia-aesni-avx2-amd64.h index 9cc5621e..411e790f 100644 --- a/cipher/camellia-aesni-avx2-amd64.h +++ b/cipher/camellia-aesni-avx2-amd64.h @@ -793,14 +793,13 @@ FUNC_NAME(_constants): ELF(.type FUNC_NAME(_constants), at object;) -.Lshufb_16x16b: - .byte SHUFB_BYTES(0), SHUFB_BYTES(1), SHUFB_BYTES(2), SHUFB_BYTES(3) - .byte SHUFB_BYTES(0), SHUFB_BYTES(1), SHUFB_BYTES(2), SHUFB_BYTES(3) - .Lpack_bswap: .long 0x00010203, 0x04050607, 0x80808080, 0x80808080 .long 0x00010203, 0x04050607, 0x80808080, 0x80808080 +.Lshufb_16x16b: + .byte SHUFB_BYTES(0), SHUFB_BYTES(1), SHUFB_BYTES(2), SHUFB_BYTES(3) + /* For CTR-mode IV byteswap */ .Lbswap128_mask: .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 @@ -999,9 +998,9 @@ ELF(.type FUNC_NAME(_constants), at object;) ELF(.size FUNC_NAME(_constants),.-FUNC_NAME(_constants);) .align 8 -ELF(.type __camellia_enc_blk32, at function;) +ELF(.type FUNC_NAME(enc_blk32), at function;) -__camellia_enc_blk32: +FUNC_NAME(enc_blk32): /* input: * %rdi: ctx, CTX * %rax: temporary storage, 512 bytes @@ -1058,19 +1057,19 @@ __camellia_enc_blk32: ret_spec_stop; CFI_ENDPROC(); -ELF(.size __camellia_enc_blk32,.-__camellia_enc_blk32;) +ELF(.size FUNC_NAME(enc_blk32),.-FUNC_NAME(enc_blk32);) .align 8 -ELF(.type __camellia_dec_blk32, at function;) +ELF(.type FUNC_NAME(dec_blk32), at function;) -__camellia_dec_blk32: +FUNC_NAME(dec_blk32): /* input: * %rdi: ctx, CTX * %rax: temporary storage, 512 bytes * %r8d: 24 for 16 byte key, 32 for larger - * %ymm0..%ymm15: 16 encrypted blocks + * %ymm0..%ymm15: 32 encrypted blocks * output: - * %ymm0..%ymm15: 16 plaintext blocks, order swapped: + * %ymm0..%ymm15: 32 plaintext blocks, order swapped: * 7, 8, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8 */ CFI_STARTPROC(); @@ -1123,7 +1122,7 @@ __camellia_dec_blk32: ret_spec_stop; CFI_ENDPROC(); -ELF(.size __camellia_dec_blk32,.-__camellia_dec_blk32;) +ELF(.size FUNC_NAME(dec_blk32),.-FUNC_NAME(dec_blk32);) #define inc_le128(x, minus_one, tmp) \ vpcmpeqq minus_one, x, tmp; \ @@ -1275,7 +1274,7 @@ FUNC_NAME(ctr_enc): .align 4 .Lload_ctr_done: - /* inpack16_pre: */ + /* inpack32_pre: */ vpbroadcastq (key_table)(CTX), %ymm15; vpshufb .Lpack_bswap rRIP, %ymm15, %ymm15; vpxor %ymm0, %ymm15, %ymm0; @@ -1295,7 +1294,7 @@ FUNC_NAME(ctr_enc): vpxor 14 * 32(%rax), %ymm15, %ymm14; vpxor 15 * 32(%rax), %ymm15, %ymm15; - call __camellia_enc_blk32; + call FUNC_NAME(enc_blk32); vpxor 0 * 32(%rdx), %ymm7, %ymm7; vpxor 1 * 32(%rdx), %ymm6, %ymm6; @@ -1313,7 +1312,6 @@ FUNC_NAME(ctr_enc): vpxor 13 * 32(%rdx), %ymm10, %ymm10; vpxor 14 * 32(%rdx), %ymm9, %ymm9; vpxor 15 * 32(%rdx), %ymm8, %ymm8; - leaq 32 * 16(%rdx), %rdx; write_output(%ymm7, %ymm6, %ymm5, %ymm4, %ymm3, %ymm2, %ymm1, %ymm0, %ymm15, %ymm14, %ymm13, %ymm12, %ymm11, %ymm10, %ymm9, @@ -1360,7 +1358,7 @@ FUNC_NAME(cbc_dec): %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, %ymm15, %rdx, (key_table)(CTX, %r8, 8)); - call __camellia_dec_blk32; + call FUNC_NAME(dec_blk32); /* XOR output with IV */ vmovdqu %ymm8, (%rax); @@ -1429,7 +1427,7 @@ FUNC_NAME(cfb_dec): andq $~63, %rsp; movq %rsp, %rax; - /* inpack16_pre: */ + /* inpack32_pre: */ vpbroadcastq (key_table)(CTX), %ymm0; vpshufb .Lpack_bswap rRIP, %ymm0, %ymm0; vmovdqu (%rcx), %xmm15; @@ -1453,7 +1451,7 @@ FUNC_NAME(cfb_dec): vpxor (13 * 32 + 16)(%rdx), %ymm0, %ymm1; vpxor (14 * 32 + 16)(%rdx), %ymm0, %ymm0; - call __camellia_enc_blk32; + call FUNC_NAME(enc_blk32); vpxor 0 * 32(%rdx), %ymm7, %ymm7; vpxor 1 * 32(%rdx), %ymm6, %ymm6; @@ -1596,7 +1594,7 @@ FUNC_NAME(ocb_enc): movl $24, %r10d; cmovel %r10d, %r8d; /* max */ - /* inpack16_pre: */ + /* inpack32_pre: */ vpbroadcastq (key_table)(CTX), %ymm15; vpshufb .Lpack_bswap rRIP, %ymm15, %ymm15; vpxor %ymm0, %ymm15, %ymm0; @@ -1616,7 +1614,7 @@ FUNC_NAME(ocb_enc): vpxor 14 * 32(%rax), %ymm15, %ymm14; vpxor 15 * 32(%rax), %ymm15, %ymm15; - call __camellia_enc_blk32; + call FUNC_NAME(enc_blk32); vpxor 0 * 32(%rsi), %ymm7, %ymm7; vpxor 1 * 32(%rsi), %ymm6, %ymm6; @@ -1763,7 +1761,7 @@ FUNC_NAME(ocb_dec): movl $24, %r9d; cmovel %r9d, %r8d; /* max */ - /* inpack16_pre: */ + /* inpack32_pre: */ vpbroadcastq (key_table)(CTX, %r8, 8), %ymm15; vpshufb .Lpack_bswap rRIP, %ymm15, %ymm15; vpxor %ymm0, %ymm15, %ymm0; @@ -1783,7 +1781,7 @@ FUNC_NAME(ocb_dec): vpxor 14 * 32(%rax), %ymm15, %ymm14; vpxor 15 * 32(%rax), %ymm15, %ymm15; - call __camellia_dec_blk32; + call FUNC_NAME(dec_blk32); vpxor 0 * 32(%rsi), %ymm7, %ymm7; vpxor 1 * 32(%rsi), %ymm6, %ymm6; @@ -1957,7 +1955,7 @@ FUNC_NAME(ocb_auth): movq %rcx, %r10; - /* inpack16_pre: */ + /* inpack32_pre: */ vpbroadcastq (key_table)(CTX), %ymm15; vpshufb .Lpack_bswap rRIP, %ymm15, %ymm15; vpxor %ymm0, %ymm15, %ymm0; @@ -1977,7 +1975,7 @@ FUNC_NAME(ocb_auth): vpxor 14 * 32(%rax), %ymm15, %ymm14; vpxor 15 * 32(%rax), %ymm15, %ymm15; - call __camellia_enc_blk32; + call FUNC_NAME(enc_blk32); vpxor %ymm7, %ymm6, %ymm6; vpxor %ymm5, %ymm4, %ymm4; @@ -2091,7 +2089,7 @@ FUNC_NAME(enc_blk1_32): vpxor (%rax), %ymm0, %ymm0; 2: - call __camellia_enc_blk32; + call FUNC_NAME(enc_blk32); #define STORE_OUTPUT(ymm, offset) \ cmpl $(1 + 2 * (offset)), %r9d; \ @@ -2189,7 +2187,7 @@ FUNC_NAME(dec_blk1_32): vpxor (%rax), %ymm0, %ymm0; 2: - call __camellia_dec_blk32; + call FUNC_NAME(dec_blk32); STORE_OUTPUT(ymm7, 0); STORE_OUTPUT(ymm6, 1); diff --git a/cipher/camellia-gfni-avx512-amd64.S b/cipher/camellia-gfni-avx512-amd64.S new file mode 100644 index 00000000..70e10460 --- /dev/null +++ b/cipher/camellia-gfni-avx512-amd64.S @@ -0,0 +1,1566 @@ +/* camellia-gfni-avx512-amd64.h - GFNI/AVX512 implementation of Camellia + * + * Copyright (C) 2022 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include + +#ifdef __x86_64 +#if (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ + defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) && \ + defined(ENABLE_GFNI_SUPPORT) && defined(ENABLE_AVX512_SUPPORT) + +#include "asm-common-amd64.h" + +#define CAMELLIA_TABLE_BYTE_LEN 272 + +/* struct CAMELLIA_context: */ +#define key_table 0 +#define key_bitlength CAMELLIA_TABLE_BYTE_LEN + +/* register macros */ +#define CTX %rdi +#define RIO %r8 + +/********************************************************************** + helper macros + **********************************************************************/ + +#define zmm0_x xmm0 +#define zmm1_x xmm1 +#define zmm2_x xmm2 +#define zmm3_x xmm3 +#define zmm4_x xmm4 +#define zmm5_x xmm5 +#define zmm6_x xmm6 +#define zmm7_x xmm7 +#define zmm8_x xmm8 +#define zmm9_x xmm9 +#define zmm10_x xmm10 +#define zmm11_x xmm11 +#define zmm12_x xmm12 +#define zmm13_x xmm13 +#define zmm14_x xmm14 +#define zmm15_x xmm15 + +#define zmm0_y ymm0 +#define zmm1_y ymm1 +#define zmm2_y ymm2 +#define zmm3_y ymm3 +#define zmm4_y ymm4 +#define zmm5_y ymm5 +#define zmm6_y ymm6 +#define zmm7_y ymm7 +#define zmm8_y ymm8 +#define zmm9_y ymm9 +#define zmm10_y ymm10 +#define zmm11_y ymm11 +#define zmm12_y ymm12 +#define zmm13_y ymm13 +#define zmm14_y ymm14 +#define zmm15_y ymm15 + +#define mem_ab_0 %zmm16 +#define mem_ab_1 %zmm17 +#define mem_ab_2 %zmm31 +#define mem_ab_3 %zmm18 +#define mem_ab_4 %zmm19 +#define mem_ab_5 %zmm20 +#define mem_ab_6 %zmm21 +#define mem_ab_7 %zmm22 +#define mem_cd_0 %zmm23 +#define mem_cd_1 %zmm24 +#define mem_cd_2 %zmm30 +#define mem_cd_3 %zmm25 +#define mem_cd_4 %zmm26 +#define mem_cd_5 %zmm27 +#define mem_cd_6 %zmm28 +#define mem_cd_7 %zmm29 + +#define clear_vec4(v0,v1,v2,v3) \ + vpxord v0, v0, v0; \ + vpxord v1, v1, v1; \ + vpxord v2, v2, v2; \ + vpxord v3, v3, v3 + +#define clear_zmm16_zmm31() \ + clear_vec4(%xmm16, %xmm20, %xmm24, %xmm28); \ + clear_vec4(%xmm17, %xmm21, %xmm25, %xmm29); \ + clear_vec4(%xmm18, %xmm22, %xmm26, %xmm30); \ + clear_vec4(%xmm19, %xmm23, %xmm27, %xmm31) + +#define clear_regs() \ + kxorq %k1, %k1, %k1; \ + vzeroall; \ + clear_zmm16_zmm31() + +/********************************************************************** + GFNI helper macros and constants + **********************************************************************/ + +#define BV8(a0,a1,a2,a3,a4,a5,a6,a7) \ + ( (((a0) & 1) << 0) | \ + (((a1) & 1) << 1) | \ + (((a2) & 1) << 2) | \ + (((a3) & 1) << 3) | \ + (((a4) & 1) << 4) | \ + (((a5) & 1) << 5) | \ + (((a6) & 1) << 6) | \ + (((a7) & 1) << 7) ) + +#define BM8X8(l0,l1,l2,l3,l4,l5,l6,l7) \ + ( ((l7) << (0 * 8)) | \ + ((l6) << (1 * 8)) | \ + ((l5) << (2 * 8)) | \ + ((l4) << (3 * 8)) | \ + ((l3) << (4 * 8)) | \ + ((l2) << (5 * 8)) | \ + ((l1) << (6 * 8)) | \ + ((l0) << (7 * 8)) ) + +/* Pre-filters and post-filters constants for Camellia sboxes s1, s2, s3 and s4. + * See http://urn.fi/URN:NBN:fi:oulu-201305311409, pages 43-48. + * + * Pre-filters are directly from above source, "??"/"??". Post-filters are + * combination of function "A" (AES SubBytes affine transformation) and + * "??"/"??"/"??". + */ + +/* Constant from "??(x)" and "??(x)" functions. */ +#define pre_filter_constant_s1234 BV8(1, 0, 1, 0, 0, 0, 1, 0) + +/* Constant from "??(A(x))" function: */ +#define post_filter_constant_s14 BV8(0, 1, 1, 1, 0, 1, 1, 0) + +/* Constant from "??(A(x))" function: */ +#define post_filter_constant_s2 BV8(0, 0, 1, 1, 1, 0, 1, 1) + +/* Constant from "??(A(x))" function: */ +#define post_filter_constant_s3 BV8(1, 1, 1, 0, 1, 1, 0, 0) + +/********************************************************************** + 64-way parallel camellia + **********************************************************************/ + +/* roundsm64 (GFNI/AVX512 version) + * IN: + * x0..x7: byte-sliced AB state + * mem_cd: register pointer storing CD state + * key: index for key material + * OUT: + * x0..x7: new byte-sliced CD state + */ +#define roundsm64(x0, x1, x2, x3, x4, x5, x6, x7, t0, t1, t2, t3, t4, t5, \ + t6, t7, mem_cd, key) \ + /* \ + * S-function with AES subbytes \ + */ \ + vpbroadcastq .Lpre_filter_bitmatrix_s123 rRIP, t5; \ + vpbroadcastq .Lpre_filter_bitmatrix_s4 rRIP, t2; \ + vpbroadcastq .Lpost_filter_bitmatrix_s14 rRIP, t4; \ + vpbroadcastq .Lpost_filter_bitmatrix_s2 rRIP, t3; \ + vpbroadcastq .Lpost_filter_bitmatrix_s3 rRIP, t6; \ + vpxor t7##_x, t7##_x, t7##_x; \ + vpbroadcastq key, t0; /* higher 64-bit duplicate ignored */ \ + \ + /* prefilter sboxes */ \ + vgf2p8affineqb $(pre_filter_constant_s1234), t5, x0, x0; \ + vgf2p8affineqb $(pre_filter_constant_s1234), t5, x7, x7; \ + vgf2p8affineqb $(pre_filter_constant_s1234), t2, x3, x3; \ + vgf2p8affineqb $(pre_filter_constant_s1234), t2, x6, x6; \ + vgf2p8affineqb $(pre_filter_constant_s1234), t5, x2, x2; \ + vgf2p8affineqb $(pre_filter_constant_s1234), t5, x5, x5; \ + vgf2p8affineqb $(pre_filter_constant_s1234), t5, x1, x1; \ + vgf2p8affineqb $(pre_filter_constant_s1234), t5, x4, x4; \ + \ + /* sbox GF8 inverse + postfilter sboxes 1 and 4 */ \ + vgf2p8affineinvqb $(post_filter_constant_s14), t4, x0, x0; \ + vgf2p8affineinvqb $(post_filter_constant_s14), t4, x7, x7; \ + vgf2p8affineinvqb $(post_filter_constant_s14), t4, x3, x3; \ + vgf2p8affineinvqb $(post_filter_constant_s14), t4, x6, x6; \ + \ + /* sbox GF8 inverse + postfilter sbox 3 */ \ + vgf2p8affineinvqb $(post_filter_constant_s3), t6, x2, x2; \ + vgf2p8affineinvqb $(post_filter_constant_s3), t6, x5, x5; \ + \ + /* sbox GF8 inverse + postfilter sbox 2 */ \ + vgf2p8affineinvqb $(post_filter_constant_s2), t3, x1, x1; \ + vgf2p8affineinvqb $(post_filter_constant_s2), t3, x4, x4; \ + \ + vpsrldq $1, t0, t1; \ + vpsrldq $2, t0, t2; \ + vpshufb t7, t1, t1; \ + vpsrldq $3, t0, t3; \ + \ + /* P-function */ \ + vpxorq x5, x0, x0; \ + vpxorq x6, x1, x1; \ + vpxorq x7, x2, x2; \ + vpxorq x4, x3, x3; \ + \ + vpshufb t7, t2, t2; \ + vpsrldq $4, t0, t4; \ + vpshufb t7, t3, t3; \ + vpsrldq $5, t0, t5; \ + vpshufb t7, t4, t4; \ + \ + vpxorq x2, x4, x4; \ + vpxorq x3, x5, x5; \ + vpxorq x0, x6, x6; \ + vpxorq x1, x7, x7; \ + \ + vpsrldq $6, t0, t6; \ + vpshufb t7, t5, t5; \ + vpshufb t7, t6, t6; \ + \ + vpxorq x7, x0, x0; \ + vpxorq x4, x1, x1; \ + vpxorq x5, x2, x2; \ + vpxorq x6, x3, x3; \ + \ + vpxorq x3, x4, x4; \ + vpxorq x0, x5, x5; \ + vpxorq x1, x6, x6; \ + vpxorq x2, x7, x7; /* note: high and low parts swapped */ \ + \ + /* Add key material and result to CD (x becomes new CD) */ \ + \ + vpternlogq $0x96, mem_cd##_5, t6, x1; \ + \ + vpsrldq $7, t0, t6; \ + vpshufb t7, t0, t0; \ + vpshufb t7, t6, t7; \ + \ + vpternlogq $0x96, mem_cd##_4, t7, x0; \ + vpternlogq $0x96, mem_cd##_6, t5, x2; \ + vpternlogq $0x96, mem_cd##_7, t4, x3; \ + vpternlogq $0x96, mem_cd##_0, t3, x4; \ + vpternlogq $0x96, mem_cd##_1, t2, x5; \ + vpternlogq $0x96, mem_cd##_2, t1, x6; \ + vpternlogq $0x96, mem_cd##_3, t0, x7; + +/* + * IN/OUT: + * x0..x7: byte-sliced AB state preloaded + * mem_ab: byte-sliced AB state in memory + * mem_cb: byte-sliced CD state in memory + */ +#define two_roundsm64(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \ + y6, y7, mem_ab, mem_cd, i, dir, store_ab) \ + roundsm64(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \ + y6, y7, mem_cd, (key_table + (i) * 8)(CTX)); \ + \ + vmovdqu64 x0, mem_cd##_4; \ + vmovdqu64 x1, mem_cd##_5; \ + vmovdqu64 x2, mem_cd##_6; \ + vmovdqu64 x3, mem_cd##_7; \ + vmovdqu64 x4, mem_cd##_0; \ + vmovdqu64 x5, mem_cd##_1; \ + vmovdqu64 x6, mem_cd##_2; \ + vmovdqu64 x7, mem_cd##_3; \ + \ + roundsm64(x4, x5, x6, x7, x0, x1, x2, x3, y0, y1, y2, y3, y4, y5, \ + y6, y7, mem_ab, (key_table + ((i) + (dir)) * 8)(CTX)); \ + \ + store_ab(x0, x1, x2, x3, x4, x5, x6, x7, mem_ab); + +#define dummy_store(x0, x1, x2, x3, x4, x5, x6, x7, mem_ab) /* do nothing */ + +#define store_ab_state(x0, x1, x2, x3, x4, x5, x6, x7, mem_ab) \ + /* Store new AB state */ \ + vmovdqu64 x4, mem_ab##_4; \ + vmovdqu64 x5, mem_ab##_5; \ + vmovdqu64 x6, mem_ab##_6; \ + vmovdqu64 x7, mem_ab##_7; \ + vmovdqu64 x0, mem_ab##_0; \ + vmovdqu64 x1, mem_ab##_1; \ + vmovdqu64 x2, mem_ab##_2; \ + vmovdqu64 x3, mem_ab##_3; + +#define enc_rounds64(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \ + y6, y7, mem_ab, mem_cd, i) \ + two_roundsm64(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \ + y6, y7, mem_ab, mem_cd, (i) + 2, 1, store_ab_state); \ + two_roundsm64(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \ + y6, y7, mem_ab, mem_cd, (i) + 4, 1, store_ab_state); \ + two_roundsm64(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \ + y6, y7, mem_ab, mem_cd, (i) + 6, 1, dummy_store); + +#define dec_rounds64(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \ + y6, y7, mem_ab, mem_cd, i) \ + two_roundsm64(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \ + y6, y7, mem_ab, mem_cd, (i) + 7, -1, store_ab_state); \ + two_roundsm64(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \ + y6, y7, mem_ab, mem_cd, (i) + 5, -1, store_ab_state); \ + two_roundsm64(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \ + y6, y7, mem_ab, mem_cd, (i) + 3, -1, dummy_store); + +/* + * IN: + * v0..3: byte-sliced 32-bit integers + * OUT: + * v0..3: (IN << 1) + * t0, t1, t2, zero: (IN >> 7) + */ +#define rol32_1_64(v0, v1, v2, v3, t0, t1, t2, zero, one) \ + vpcmpltb zero, v0, %k1; \ + vpaddb v0, v0, v0; \ + vpaddb one, zero, t0{%k1}{z}; \ + \ + vpcmpltb zero, v1, %k1; \ + vpaddb v1, v1, v1; \ + vpaddb one, zero, t1{%k1}{z}; \ + \ + vpcmpltb zero, v2, %k1; \ + vpaddb v2, v2, v2; \ + vpaddb one, zero, t2{%k1}{z}; \ + \ + vpcmpltb zero, v3, %k1; \ + vpaddb v3, v3, v3; \ + vpaddb one, zero, zero{%k1}{z}; + +/* + * IN: + * r: byte-sliced AB state in memory + * l: byte-sliced CD state in memory + * OUT: + * x0..x7: new byte-sliced CD state + */ +#define fls64(l, l0, l1, l2, l3, l4, l5, l6, l7, r, t0, t1, t2, t3, tt0, \ + tt1, tt2, tt3, kll, klr, krl, krr, tmp) \ + /* \ + * t0 = kll; \ + * t0 &= ll; \ + * lr ^= rol32(t0, 1); \ + */ \ + vpbroadcastd kll, t0; /* only lowest 32-bit used */ \ + vpbroadcastq .Lbyte_ones rRIP, tmp; \ + vpxor tt3##_x, tt3##_x, tt3##_x; \ + vpshufb tt3, t0, t3; \ + vpsrldq $1, t0, t0; \ + vpshufb tt3, t0, t2; \ + vpsrldq $1, t0, t0; \ + vpshufb tt3, t0, t1; \ + vpsrldq $1, t0, t0; \ + vpshufb tt3, t0, t0; \ + \ + vpandq l0, t0, t0; \ + vpandq l1, t1, t1; \ + vpandq l2, t2, t2; \ + vpandq l3, t3, t3; \ + \ + rol32_1_64(t3, t2, t1, t0, tt0, tt1, tt2, tt3, tmp); \ + \ + vpternlogq $0x96, tt2, t0, l4; \ + vpbroadcastd krr, t0; /* only lowest 32-bit used */ \ + vmovdqu64 l4, l##_4; \ + vpternlogq $0x96, tt1, t1, l5; \ + vmovdqu64 l5, l##_5; \ + vpternlogq $0x96, tt0, t2, l6; \ + vmovdqu64 l6, l##_6; \ + vpternlogq $0x96, tt3, t3, l7; \ + vmovdqu64 l7, l##_7; \ + vpxor tt3##_x, tt3##_x, tt3##_x; \ + \ + /* \ + * t2 = krr; \ + * t2 |= rr; \ + * rl ^= t2; \ + */ \ + \ + vpshufb tt3, t0, t3; \ + vpsrldq $1, t0, t0; \ + vpshufb tt3, t0, t2; \ + vpsrldq $1, t0, t0; \ + vpshufb tt3, t0, t1; \ + vpsrldq $1, t0, t0; \ + vpshufb tt3, t0, t0; \ + \ + vpternlogq $0x1e, r##_4, t0, r##_0; \ + vpbroadcastd krl, t0; /* only lowest 32-bit used */ \ + vpternlogq $0x1e, r##_5, t1, r##_1; \ + vpternlogq $0x1e, r##_6, t2, r##_2; \ + vpternlogq $0x1e, r##_7, t3, r##_3; \ + \ + /* \ + * t2 = krl; \ + * t2 &= rl; \ + * rr ^= rol32(t2, 1); \ + */ \ + vpshufb tt3, t0, t3; \ + vpsrldq $1, t0, t0; \ + vpshufb tt3, t0, t2; \ + vpsrldq $1, t0, t0; \ + vpshufb tt3, t0, t1; \ + vpsrldq $1, t0, t0; \ + vpshufb tt3, t0, t0; \ + \ + vpandq r##_0, t0, t0; \ + vpandq r##_1, t1, t1; \ + vpandq r##_2, t2, t2; \ + vpandq r##_3, t3, t3; \ + \ + rol32_1_64(t3, t2, t1, t0, tt0, tt1, tt2, tt3, tmp); \ + \ + vpternlogq $0x96, tt2, t0, r##_4; \ + vpbroadcastd klr, t0; /* only lowest 32-bit used */ \ + vpternlogq $0x96, tt1, t1, r##_5; \ + vpternlogq $0x96, tt0, t2, r##_6; \ + vpternlogq $0x96, tt3, t3, r##_7; \ + vpxor tt3##_x, tt3##_x, tt3##_x; \ + \ + /* \ + * t0 = klr; \ + * t0 |= lr; \ + * ll ^= t0; \ + */ \ + \ + vpshufb tt3, t0, t3; \ + vpsrldq $1, t0, t0; \ + vpshufb tt3, t0, t2; \ + vpsrldq $1, t0, t0; \ + vpshufb tt3, t0, t1; \ + vpsrldq $1, t0, t0; \ + vpshufb tt3, t0, t0; \ + \ + vpternlogq $0x1e, l4, t0, l0; \ + vmovdqu64 l0, l##_0; \ + vpternlogq $0x1e, l5, t1, l1; \ + vmovdqu64 l1, l##_1; \ + vpternlogq $0x1e, l6, t2, l2; \ + vmovdqu64 l2, l##_2; \ + vpternlogq $0x1e, l7, t3, l3; \ + vmovdqu64 l3, l##_3; + +#define transpose_4x4(x0, x1, x2, x3, t1, t2) \ + vpunpckhdq x1, x0, t2; \ + vpunpckldq x1, x0, x0; \ + \ + vpunpckldq x3, x2, t1; \ + vpunpckhdq x3, x2, x2; \ + \ + vpunpckhqdq t1, x0, x1; \ + vpunpcklqdq t1, x0, x0; \ + \ + vpunpckhqdq x2, t2, x3; \ + vpunpcklqdq x2, t2, x2; + +#define byteslice_16x16b_fast(a0, b0, c0, d0, a1, b1, c1, d1, a2, b2, c2, d2, \ + a3, b3, c3, d3, st0, st1) \ + transpose_4x4(a0, a1, a2, a3, st0, st1); \ + transpose_4x4(b0, b1, b2, b3, st0, st1); \ + \ + transpose_4x4(c0, c1, c2, c3, st0, st1); \ + transpose_4x4(d0, d1, d2, d3, st0, st1); \ + \ + vbroadcasti64x2 .Lshufb_16x16b rRIP, st0; \ + vpshufb st0, a0, a0; \ + vpshufb st0, a1, a1; \ + vpshufb st0, a2, a2; \ + vpshufb st0, a3, a3; \ + vpshufb st0, b0, b0; \ + vpshufb st0, b1, b1; \ + vpshufb st0, b2, b2; \ + vpshufb st0, b3, b3; \ + vpshufb st0, c0, c0; \ + vpshufb st0, c1, c1; \ + vpshufb st0, c2, c2; \ + vpshufb st0, c3, c3; \ + vpshufb st0, d0, d0; \ + vpshufb st0, d1, d1; \ + vpshufb st0, d2, d2; \ + vpshufb st0, d3, d3; \ + \ + transpose_4x4(a0, b0, c0, d0, st0, st1); \ + transpose_4x4(a1, b1, c1, d1, st0, st1); \ + \ + transpose_4x4(a2, b2, c2, d2, st0, st1); \ + transpose_4x4(a3, b3, c3, d3, st0, st1); \ + /* does not adjust output bytes inside vectors */ + +/* load blocks to registers and apply pre-whitening */ +#define inpack64_pre(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \ + y6, y7, rio, key) \ + vpbroadcastq key, x0; \ + vpshufb .Lpack_bswap rRIP, x0, x0; \ + \ + vpxorq 0 * 64(rio), x0, y7; \ + vpxorq 1 * 64(rio), x0, y6; \ + vpxorq 2 * 64(rio), x0, y5; \ + vpxorq 3 * 64(rio), x0, y4; \ + vpxorq 4 * 64(rio), x0, y3; \ + vpxorq 5 * 64(rio), x0, y2; \ + vpxorq 6 * 64(rio), x0, y1; \ + vpxorq 7 * 64(rio), x0, y0; \ + vpxorq 8 * 64(rio), x0, x7; \ + vpxorq 9 * 64(rio), x0, x6; \ + vpxorq 10 * 64(rio), x0, x5; \ + vpxorq 11 * 64(rio), x0, x4; \ + vpxorq 12 * 64(rio), x0, x3; \ + vpxorq 13 * 64(rio), x0, x2; \ + vpxorq 14 * 64(rio), x0, x1; \ + vpxorq 15 * 64(rio), x0, x0; + +/* byteslice pre-whitened blocks and store to temporary memory */ +#define inpack64_post(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \ + y6, y7, mem_ab, mem_cd, tmp0, tmp1) \ + byteslice_16x16b_fast(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, \ + y4, y5, y6, y7, tmp0, tmp1); \ + \ + vmovdqu64 x0, mem_ab##_0; \ + vmovdqu64 x1, mem_ab##_1; \ + vmovdqu64 x2, mem_ab##_2; \ + vmovdqu64 x3, mem_ab##_3; \ + vmovdqu64 x4, mem_ab##_4; \ + vmovdqu64 x5, mem_ab##_5; \ + vmovdqu64 x6, mem_ab##_6; \ + vmovdqu64 x7, mem_ab##_7; \ + vmovdqu64 y0, mem_cd##_0; \ + vmovdqu64 y1, mem_cd##_1; \ + vmovdqu64 y2, mem_cd##_2; \ + vmovdqu64 y3, mem_cd##_3; \ + vmovdqu64 y4, mem_cd##_4; \ + vmovdqu64 y5, mem_cd##_5; \ + vmovdqu64 y6, mem_cd##_6; \ + vmovdqu64 y7, mem_cd##_7; + +/* de-byteslice, apply post-whitening and store blocks */ +#define outunpack64(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, \ + y5, y6, y7, key, tmp0, tmp1) \ + byteslice_16x16b_fast(y0, y4, x0, x4, y1, y5, x1, x5, y2, y6, x2, x6, \ + y3, y7, x3, x7, tmp0, tmp1); \ + \ + vpbroadcastq key, tmp0; \ + vpshufb .Lpack_bswap rRIP, tmp0, tmp0; \ + \ + vpxorq tmp0, y7, y7; \ + vpxorq tmp0, y6, y6; \ + vpxorq tmp0, y5, y5; \ + vpxorq tmp0, y4, y4; \ + vpxorq tmp0, y3, y3; \ + vpxorq tmp0, y2, y2; \ + vpxorq tmp0, y1, y1; \ + vpxorq tmp0, y0, y0; \ + vpxorq tmp0, x7, x7; \ + vpxorq tmp0, x6, x6; \ + vpxorq tmp0, x5, x5; \ + vpxorq tmp0, x4, x4; \ + vpxorq tmp0, x3, x3; \ + vpxorq tmp0, x2, x2; \ + vpxorq tmp0, x1, x1; \ + vpxorq tmp0, x0, x0; + +#define write_output(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \ + y6, y7, rio) \ + vmovdqu64 x0, 0 * 64(rio); \ + vmovdqu64 x1, 1 * 64(rio); \ + vmovdqu64 x2, 2 * 64(rio); \ + vmovdqu64 x3, 3 * 64(rio); \ + vmovdqu64 x4, 4 * 64(rio); \ + vmovdqu64 x5, 5 * 64(rio); \ + vmovdqu64 x6, 6 * 64(rio); \ + vmovdqu64 x7, 7 * 64(rio); \ + vmovdqu64 y0, 8 * 64(rio); \ + vmovdqu64 y1, 9 * 64(rio); \ + vmovdqu64 y2, 10 * 64(rio); \ + vmovdqu64 y3, 11 * 64(rio); \ + vmovdqu64 y4, 12 * 64(rio); \ + vmovdqu64 y5, 13 * 64(rio); \ + vmovdqu64 y6, 14 * 64(rio); \ + vmovdqu64 y7, 15 * 64(rio); + +.text + +#define SHUFB_BYTES(idx) \ + 0 + (idx), 4 + (idx), 8 + (idx), 12 + (idx) + +_gcry_camellia_gfni_avx512__constants: +ELF(.type _gcry_camellia_gfni_avx512__constants, at object;) + +.align 64 +.Lpack_bswap: + .long 0x00010203, 0x04050607, 0x80808080, 0x80808080 + .long 0x00010203, 0x04050607, 0x80808080, 0x80808080 + .long 0x00010203, 0x04050607, 0x80808080, 0x80808080 + .long 0x00010203, 0x04050607, 0x80808080, 0x80808080 + +.Lcounter0123_lo: + .quad 0, 0 + .quad 1, 0 + .quad 2, 0 + .quad 3, 0 + +.align 16 +.Lcounter4444_lo: + .quad 4, 0 +.Lcounter8888_lo: + .quad 8, 0 +.Lcounter16161616_lo: + .quad 16, 0 +.Lcounter1111_hi: + .quad 0, 1 + +.Lshufb_16x16b: + .byte SHUFB_BYTES(0), SHUFB_BYTES(1), SHUFB_BYTES(2), SHUFB_BYTES(3) + +/* For CTR-mode IV byteswap */ +.Lbswap128_mask: + .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 + + vbroadcasti64x2 .Lcounter4444_lo rRIP, %zmm22; + vbroadcasti64x2 .Lcounter8888_lo rRIP, %zmm23; + vbroadcasti64x2 .Lcounter16161616_lo rRIP, %zmm24; + vbroadcasti64x2 .Lcounter1111_hi rRIP, %zmm25; + +.Lbyte_ones: + .byte 1, 1, 1, 1, 1, 1, 1, 1 + +/* Pre-filters and post-filters bit-matrixes for Camellia sboxes s1, s2, s3 + * and s4. + * See http://urn.fi/URN:NBN:fi:oulu-201305311409, pages 43-48. + * + * Pre-filters are directly from above source, "??"/"??". Post-filters are + * combination of function "A" (AES SubBytes affine transformation) and + * "??"/"??"/"??". + */ + +/* Bit-matrix from "??(x)" function: */ +.Lpre_filter_bitmatrix_s123: + .quad BM8X8(BV8(1, 1, 1, 0, 1, 1, 0, 1), + BV8(0, 0, 1, 1, 0, 0, 1, 0), + BV8(1, 1, 0, 1, 0, 0, 0, 0), + BV8(1, 0, 1, 1, 0, 0, 1, 1), + BV8(0, 0, 0, 0, 1, 1, 0, 0), + BV8(1, 0, 1, 0, 0, 1, 0, 0), + BV8(0, 0, 1, 0, 1, 1, 0, 0), + BV8(1, 0, 0, 0, 0, 1, 1, 0)) + +/* Bit-matrix from "??(x)" function: */ +.Lpre_filter_bitmatrix_s4: + .quad BM8X8(BV8(1, 1, 0, 1, 1, 0, 1, 1), + BV8(0, 1, 1, 0, 0, 1, 0, 0), + BV8(1, 0, 1, 0, 0, 0, 0, 1), + BV8(0, 1, 1, 0, 0, 1, 1, 1), + BV8(0, 0, 0, 1, 1, 0, 0, 0), + BV8(0, 1, 0, 0, 1, 0, 0, 1), + BV8(0, 1, 0, 1, 1, 0, 0, 0), + BV8(0, 0, 0, 0, 1, 1, 0, 1)) + +/* Bit-matrix from "??(A(x))" function: */ +.Lpost_filter_bitmatrix_s14: + .quad BM8X8(BV8(0, 0, 0, 0, 0, 0, 0, 1), + BV8(0, 1, 1, 0, 0, 1, 1, 0), + BV8(1, 0, 1, 1, 1, 1, 1, 0), + BV8(0, 0, 0, 1, 1, 0, 1, 1), + BV8(1, 0, 0, 0, 1, 1, 1, 0), + BV8(0, 1, 0, 1, 1, 1, 1, 0), + BV8(0, 1, 1, 1, 1, 1, 1, 1), + BV8(0, 0, 0, 1, 1, 1, 0, 0)) + +/* Bit-matrix from "??(A(x))" function: */ +.Lpost_filter_bitmatrix_s2: + .quad BM8X8(BV8(0, 0, 0, 1, 1, 1, 0, 0), + BV8(0, 0, 0, 0, 0, 0, 0, 1), + BV8(0, 1, 1, 0, 0, 1, 1, 0), + BV8(1, 0, 1, 1, 1, 1, 1, 0), + BV8(0, 0, 0, 1, 1, 0, 1, 1), + BV8(1, 0, 0, 0, 1, 1, 1, 0), + BV8(0, 1, 0, 1, 1, 1, 1, 0), + BV8(0, 1, 1, 1, 1, 1, 1, 1)) + +/* Bit-matrix from "??(A(x))" function: */ +.Lpost_filter_bitmatrix_s3: + .quad BM8X8(BV8(0, 1, 1, 0, 0, 1, 1, 0), + BV8(1, 0, 1, 1, 1, 1, 1, 0), + BV8(0, 0, 0, 1, 1, 0, 1, 1), + BV8(1, 0, 0, 0, 1, 1, 1, 0), + BV8(0, 1, 0, 1, 1, 1, 1, 0), + BV8(0, 1, 1, 1, 1, 1, 1, 1), + BV8(0, 0, 0, 1, 1, 1, 0, 0), + BV8(0, 0, 0, 0, 0, 0, 0, 1)) + +ELF(.size _gcry_camellia_gfni_avx512__constants,.-_gcry_camellia_gfni_avx512__constants;) + +.align 8 +ELF(.type __camellia_gfni_avx512_enc_blk64, at function;) + +__camellia_gfni_avx512_enc_blk64: + /* input: + * %rdi: ctx, CTX + * %r8d: 24 for 16 byte key, 32 for larger + * %zmm0..%zmm15: 64 plaintext blocks + * output: + * %zmm0..%zmm15: 64 encrypted blocks, order swapped: + * 7, 8, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8 + */ + CFI_STARTPROC(); + + leaq (-8 * 8)(CTX, %r8, 8), %r8; + + inpack64_post(%zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, + %zmm8, %zmm9, %zmm10, %zmm11, %zmm12, %zmm13, %zmm14, + %zmm15, mem_ab, mem_cd, %zmm30, %zmm31); + +.align 8 +.Lenc_loop: + enc_rounds64(%zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, + %zmm8, %zmm9, %zmm10, %zmm11, %zmm12, %zmm13, %zmm14, + %zmm15, mem_ab, mem_cd, 0); + + cmpq %r8, CTX; + je .Lenc_done; + leaq (8 * 8)(CTX), CTX; + + fls64(mem_ab, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, + mem_cd, %zmm8, %zmm9, %zmm10, %zmm11, %zmm12, %zmm13, %zmm14, + %zmm15, + ((key_table) + 0)(CTX), + ((key_table) + 4)(CTX), + ((key_table) + 8)(CTX), + ((key_table) + 12)(CTX), + %zmm31); + jmp .Lenc_loop; + +.align 8 +.Lenc_done: + /* load CD for output */ + vmovdqu64 mem_cd_0, %zmm8; + vmovdqu64 mem_cd_1, %zmm9; + vmovdqu64 mem_cd_2, %zmm10; + vmovdqu64 mem_cd_3, %zmm11; + vmovdqu64 mem_cd_4, %zmm12; + vmovdqu64 mem_cd_5, %zmm13; + vmovdqu64 mem_cd_6, %zmm14; + vmovdqu64 mem_cd_7, %zmm15; + + outunpack64(%zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, + %zmm8, %zmm9, %zmm10, %zmm11, %zmm12, %zmm13, %zmm14, + %zmm15, ((key_table) + 8 * 8)(%r8), %zmm30, %zmm31); + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size __camellia_gfni_avx512_enc_blk64,.-__camellia_gfni_avx512_enc_blk64;) + +.align 8 +ELF(.type __camellia_gfni_avx512_dec_blk64, at function;) + +__camellia_gfni_avx512_dec_blk64: + /* input: + * %rdi: ctx, CTX + * %r8d: 24 for 16 byte key, 32 for larger + * %zmm0..%zmm15: 64 encrypted blocks + * output: + * %zmm0..%zmm15: 64 plaintext blocks, order swapped: + * 7, 8, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8 + */ + CFI_STARTPROC(); + + movq %r8, %rcx; + movq CTX, %r8 + leaq (-8 * 8)(CTX, %rcx, 8), CTX; + + inpack64_post(%zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, + %zmm8, %zmm9, %zmm10, %zmm11, %zmm12, %zmm13, %zmm14, + %zmm15, mem_ab, mem_cd, %zmm30, %zmm31); + +.align 8 +.Ldec_loop: + dec_rounds64(%zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, + %zmm8, %zmm9, %zmm10, %zmm11, %zmm12, %zmm13, %zmm14, + %zmm15, mem_ab, mem_cd, 0); + + cmpq %r8, CTX; + je .Ldec_done; + + fls64(mem_ab, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, + mem_cd, %zmm8, %zmm9, %zmm10, %zmm11, %zmm12, %zmm13, %zmm14, + %zmm15, + ((key_table) + 8)(CTX), + ((key_table) + 12)(CTX), + ((key_table) + 0)(CTX), + ((key_table) + 4)(CTX), + %zmm31); + + leaq (-8 * 8)(CTX), CTX; + jmp .Ldec_loop; + +.align 8 +.Ldec_done: + /* load CD for output */ + vmovdqu64 mem_cd_0, %zmm8; + vmovdqu64 mem_cd_1, %zmm9; + vmovdqu64 mem_cd_2, %zmm10; + vmovdqu64 mem_cd_3, %zmm11; + vmovdqu64 mem_cd_4, %zmm12; + vmovdqu64 mem_cd_5, %zmm13; + vmovdqu64 mem_cd_6, %zmm14; + vmovdqu64 mem_cd_7, %zmm15; + + outunpack64(%zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, + %zmm8, %zmm9, %zmm10, %zmm11, %zmm12, %zmm13, %zmm14, + %zmm15, (key_table)(CTX), %zmm30, %zmm31); + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size __camellia_gfni_avx512_dec_blk64,.-__camellia_gfni_avx512_dec_blk64;) + +#define add_le128(out, in, lo_counter, hi_counter1) \ + vpaddq lo_counter, in, out; \ + vpcmpuq $1, lo_counter, out, %k1; \ + kaddb %k1, %k1, %k1; \ + vpaddq hi_counter1, out, out{%k1}; + +.align 8 +.globl _gcry_camellia_gfni_avx512_ctr_enc +ELF(.type _gcry_camellia_gfni_avx512_ctr_enc, at function;) + +_gcry_camellia_gfni_avx512_ctr_enc: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (64 blocks) + * %rdx: src (64 blocks) + * %rcx: iv (big endian, 128bit) + */ + CFI_STARTPROC(); + vpopcntb %zmm16, %zmm16; /* spec stop for old AVX512 CPUs */ + + vbroadcasti64x2 .Lbswap128_mask rRIP, %zmm19; + vmovdqa64 .Lcounter0123_lo rRIP, %zmm21; + vbroadcasti64x2 .Lcounter4444_lo rRIP, %zmm22; + vbroadcasti64x2 .Lcounter8888_lo rRIP, %zmm23; + vbroadcasti64x2 .Lcounter16161616_lo rRIP, %zmm24; + vbroadcasti64x2 .Lcounter1111_hi rRIP, %zmm25; + + /* load IV and byteswap */ + movq 8(%rcx), %r11; + movq (%rcx), %r10; + bswapq %r11; + bswapq %r10; + vbroadcasti64x2 (%rcx), %zmm0; + vpshufb %zmm19, %zmm0, %zmm0; + + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %eax; + cmovel %eax, %r8d; /* max */ + + /* check need for handling 64-bit overflow and carry */ + cmpq $(0xffffffffffffffff - 64), %r11; + ja .Lload_ctr_carry; + + /* construct IVs */ + vpaddq %zmm21, %zmm0, %zmm15; /* +0:+1:+2:+3 */ + vpaddq %zmm22, %zmm15, %zmm14; /* +4:+5:+6:+7 */ + vpaddq %zmm23, %zmm15, %zmm13; /* +8:+9:+10:+11 */ + vpaddq %zmm23, %zmm14, %zmm12; /* +12:+13:+14:+15 */ + vpaddq %zmm24, %zmm15, %zmm11; /* +16... */ + vpaddq %zmm24, %zmm14, %zmm10; /* +20... */ + vpaddq %zmm24, %zmm13, %zmm9; /* +24... */ + vpaddq %zmm24, %zmm12, %zmm8; /* +28... */ + vpaddq %zmm24, %zmm11, %zmm7; /* +32... */ + vpaddq %zmm24, %zmm10, %zmm6; /* +36... */ + vpaddq %zmm24, %zmm9, %zmm5; /* +40... */ + vpaddq %zmm24, %zmm8, %zmm4; /* +44... */ + vpaddq %zmm24, %zmm7, %zmm3; /* +48... */ + vpaddq %zmm24, %zmm6, %zmm2; /* +52... */ + vpaddq %zmm24, %zmm5, %zmm1; /* +56... */ + vpaddq %zmm24, %zmm4, %zmm0; /* +60... */ + jmp .Lload_ctr_done; + +.align 4 +.Lload_ctr_carry: + /* construct IVs */ + add_le128(%zmm15, %zmm0, %zmm21, %zmm25); /* +0:+1:+2:+3 */ + add_le128(%zmm14, %zmm15, %zmm22, %zmm25); /* +4:+5:+6:+7 */ + add_le128(%zmm13, %zmm15, %zmm23, %zmm25); /* +8:+9:+10:+11 */ + add_le128(%zmm12, %zmm14, %zmm23, %zmm25); /* +12:+13:+14:+15 */ + add_le128(%zmm11, %zmm15, %zmm24, %zmm25); /* +16... */ + add_le128(%zmm10, %zmm14, %zmm24, %zmm25); /* +20... */ + add_le128(%zmm9, %zmm13, %zmm24, %zmm25); /* +24... */ + add_le128(%zmm8, %zmm12, %zmm24, %zmm25); /* +28... */ + add_le128(%zmm7, %zmm11, %zmm24, %zmm25); /* +32... */ + add_le128(%zmm6, %zmm10, %zmm24, %zmm25); /* +36... */ + add_le128(%zmm5, %zmm9, %zmm24, %zmm25); /* +40... */ + add_le128(%zmm4, %zmm8, %zmm24, %zmm25); /* +44... */ + add_le128(%zmm3, %zmm7, %zmm24, %zmm25); /* +48... */ + add_le128(%zmm2, %zmm6, %zmm24, %zmm25); /* +52... */ + add_le128(%zmm1, %zmm5, %zmm24, %zmm25); /* +56... */ + add_le128(%zmm0, %zmm4, %zmm24, %zmm25); /* +60... */ + +.align 4 +.Lload_ctr_done: + vpbroadcastq (key_table)(CTX), %zmm16; + vpshufb .Lpack_bswap rRIP, %zmm16, %zmm16; + + /* Byte-swap IVs and update counter. */ + addq $64, %r11; + adcq $0, %r10; + vpshufb %zmm19, %zmm15, %zmm15; + vpshufb %zmm19, %zmm14, %zmm14; + vpshufb %zmm19, %zmm13, %zmm13; + vpshufb %zmm19, %zmm12, %zmm12; + vpshufb %zmm19, %zmm11, %zmm11; + vpshufb %zmm19, %zmm10, %zmm10; + vpshufb %zmm19, %zmm9, %zmm9; + vpshufb %zmm19, %zmm8, %zmm8; + bswapq %r11; + bswapq %r10; + vpshufb %zmm19, %zmm7, %zmm7; + vpshufb %zmm19, %zmm6, %zmm6; + vpshufb %zmm19, %zmm5, %zmm5; + vpshufb %zmm19, %zmm4, %zmm4; + vpshufb %zmm19, %zmm3, %zmm3; + vpshufb %zmm19, %zmm2, %zmm2; + vpshufb %zmm19, %zmm1, %zmm1; + vpshufb %zmm19, %zmm0, %zmm0; + movq %r11, 8(%rcx); + movq %r10, (%rcx); + + /* inpack64_pre: */ + vpxorq %zmm0, %zmm16, %zmm0; + vpxorq %zmm1, %zmm16, %zmm1; + vpxorq %zmm2, %zmm16, %zmm2; + vpxorq %zmm3, %zmm16, %zmm3; + vpxorq %zmm4, %zmm16, %zmm4; + vpxorq %zmm5, %zmm16, %zmm5; + vpxorq %zmm6, %zmm16, %zmm6; + vpxorq %zmm7, %zmm16, %zmm7; + vpxorq %zmm8, %zmm16, %zmm8; + vpxorq %zmm9, %zmm16, %zmm9; + vpxorq %zmm10, %zmm16, %zmm10; + vpxorq %zmm11, %zmm16, %zmm11; + vpxorq %zmm12, %zmm16, %zmm12; + vpxorq %zmm13, %zmm16, %zmm13; + vpxorq %zmm14, %zmm16, %zmm14; + vpxorq %zmm15, %zmm16, %zmm15; + + call __camellia_gfni_avx512_enc_blk64; + + vpxorq 0 * 64(%rdx), %zmm7, %zmm7; + vpxorq 1 * 64(%rdx), %zmm6, %zmm6; + vpxorq 2 * 64(%rdx), %zmm5, %zmm5; + vpxorq 3 * 64(%rdx), %zmm4, %zmm4; + vpxorq 4 * 64(%rdx), %zmm3, %zmm3; + vpxorq 5 * 64(%rdx), %zmm2, %zmm2; + vpxorq 6 * 64(%rdx), %zmm1, %zmm1; + vpxorq 7 * 64(%rdx), %zmm0, %zmm0; + vpxorq 8 * 64(%rdx), %zmm15, %zmm15; + vpxorq 9 * 64(%rdx), %zmm14, %zmm14; + vpxorq 10 * 64(%rdx), %zmm13, %zmm13; + vpxorq 11 * 64(%rdx), %zmm12, %zmm12; + vpxorq 12 * 64(%rdx), %zmm11, %zmm11; + vpxorq 13 * 64(%rdx), %zmm10, %zmm10; + vpxorq 14 * 64(%rdx), %zmm9, %zmm9; + vpxorq 15 * 64(%rdx), %zmm8, %zmm8; + + write_output(%zmm7, %zmm6, %zmm5, %zmm4, %zmm3, %zmm2, %zmm1, %zmm0, + %zmm15, %zmm14, %zmm13, %zmm12, %zmm11, %zmm10, %zmm9, + %zmm8, %rsi); + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_camellia_gfni_avx512_ctr_enc,.-_gcry_camellia_gfni_avx512_ctr_enc;) + +.align 8 +.globl _gcry_camellia_gfni_avx512_cbc_dec +ELF(.type _gcry_camellia_gfni_avx512_cbc_dec, at function;) + +_gcry_camellia_gfni_avx512_cbc_dec: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (64 blocks) + * %rdx: src (64 blocks) + * %rcx: iv + */ + CFI_STARTPROC(); + vpopcntb %zmm16, %zmm16; /* spec stop for old AVX512 CPUs */ + + movq %rcx, %r9; + + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %eax; + cmovel %eax, %r8d; /* max */ + + inpack64_pre(%zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, + %zmm8, %zmm9, %zmm10, %zmm11, %zmm12, %zmm13, %zmm14, + %zmm15, %rdx, (key_table)(CTX, %r8, 8)); + + call __camellia_gfni_avx512_dec_blk64; + + /* XOR output with IV */ + vmovdqu64 (%r9), %xmm16; + vinserti64x2 $1, (0 * 16)(%rdx), %ymm16, %ymm16; + vinserti64x4 $1, (1 * 16)(%rdx), %zmm16, %zmm16; + vpxorq %zmm16, %zmm7, %zmm7; + vpxorq (0 * 64 + 48)(%rdx), %zmm6, %zmm6; + vpxorq (1 * 64 + 48)(%rdx), %zmm5, %zmm5; + vpxorq (2 * 64 + 48)(%rdx), %zmm4, %zmm4; + vpxorq (3 * 64 + 48)(%rdx), %zmm3, %zmm3; + vpxorq (4 * 64 + 48)(%rdx), %zmm2, %zmm2; + vpxorq (5 * 64 + 48)(%rdx), %zmm1, %zmm1; + vpxorq (6 * 64 + 48)(%rdx), %zmm0, %zmm0; + vpxorq (7 * 64 + 48)(%rdx), %zmm15, %zmm15; + vpxorq (8 * 64 + 48)(%rdx), %zmm14, %zmm14; + vpxorq (9 * 64 + 48)(%rdx), %zmm13, %zmm13; + vpxorq (10 * 64 + 48)(%rdx), %zmm12, %zmm12; + vpxorq (11 * 64 + 48)(%rdx), %zmm11, %zmm11; + vpxorq (12 * 64 + 48)(%rdx), %zmm10, %zmm10; + vpxorq (13 * 64 + 48)(%rdx), %zmm9, %zmm9; + vpxorq (14 * 64 + 48)(%rdx), %zmm8, %zmm8; + vmovdqu64 (15 * 64 + 48)(%rdx), %xmm16; + + write_output(%zmm7, %zmm6, %zmm5, %zmm4, %zmm3, %zmm2, %zmm1, %zmm0, + %zmm15, %zmm14, %zmm13, %zmm12, %zmm11, %zmm10, %zmm9, + %zmm8, %rsi); + + /* store new IV */ + vmovdqu64 %xmm16, (0)(%r9); + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_camellia_gfni_avx512_cbc_dec,.-_gcry_camellia_gfni_avx512_cbc_dec;) + +.align 8 +.globl _gcry_camellia_gfni_avx512_cfb_dec +ELF(.type _gcry_camellia_gfni_avx512_cfb_dec, at function;) + +_gcry_camellia_gfni_avx512_cfb_dec: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (32 blocks) + * %rdx: src (32 blocks) + * %rcx: iv + */ + CFI_STARTPROC(); + vpopcntb %zmm16, %zmm16; /* spec stop for old AVX512 CPUs */ + + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %eax; + cmovel %eax, %r8d; /* max */ + + /* inpack64_pre: */ + vpbroadcastq (key_table)(CTX), %zmm0; + vpshufb .Lpack_bswap rRIP, %zmm0, %zmm0; + vmovdqu64 (%rcx), %xmm15; + vinserti64x2 $1, (%rdx), %ymm15, %ymm15; + vinserti64x4 $1, 16(%rdx), %zmm15, %zmm15; + vpxorq %zmm15, %zmm0, %zmm15; + vpxorq (0 * 64 + 48)(%rdx), %zmm0, %zmm14; + vpxorq (1 * 64 + 48)(%rdx), %zmm0, %zmm13; + vpxorq (2 * 64 + 48)(%rdx), %zmm0, %zmm12; + vpxorq (3 * 64 + 48)(%rdx), %zmm0, %zmm11; + vpxorq (4 * 64 + 48)(%rdx), %zmm0, %zmm10; + vpxorq (5 * 64 + 48)(%rdx), %zmm0, %zmm9; + vpxorq (6 * 64 + 48)(%rdx), %zmm0, %zmm8; + vpxorq (7 * 64 + 48)(%rdx), %zmm0, %zmm7; + vpxorq (8 * 64 + 48)(%rdx), %zmm0, %zmm6; + vpxorq (9 * 64 + 48)(%rdx), %zmm0, %zmm5; + vpxorq (10 * 64 + 48)(%rdx), %zmm0, %zmm4; + vpxorq (11 * 64 + 48)(%rdx), %zmm0, %zmm3; + vpxorq (12 * 64 + 48)(%rdx), %zmm0, %zmm2; + vpxorq (13 * 64 + 48)(%rdx), %zmm0, %zmm1; + vpxorq (14 * 64 + 48)(%rdx), %zmm0, %zmm0; + vmovdqu64 (15 * 64 + 48)(%rdx), %xmm16; + vmovdqu64 %xmm16, (%rcx); /* store new IV */ + + call __camellia_gfni_avx512_enc_blk64; + + vpxorq 0 * 64(%rdx), %zmm7, %zmm7; + vpxorq 1 * 64(%rdx), %zmm6, %zmm6; + vpxorq 2 * 64(%rdx), %zmm5, %zmm5; + vpxorq 3 * 64(%rdx), %zmm4, %zmm4; + vpxorq 4 * 64(%rdx), %zmm3, %zmm3; + vpxorq 5 * 64(%rdx), %zmm2, %zmm2; + vpxorq 6 * 64(%rdx), %zmm1, %zmm1; + vpxorq 7 * 64(%rdx), %zmm0, %zmm0; + vpxorq 8 * 64(%rdx), %zmm15, %zmm15; + vpxorq 9 * 64(%rdx), %zmm14, %zmm14; + vpxorq 10 * 64(%rdx), %zmm13, %zmm13; + vpxorq 11 * 64(%rdx), %zmm12, %zmm12; + vpxorq 12 * 64(%rdx), %zmm11, %zmm11; + vpxorq 13 * 64(%rdx), %zmm10, %zmm10; + vpxorq 14 * 64(%rdx), %zmm9, %zmm9; + vpxorq 15 * 64(%rdx), %zmm8, %zmm8; + + write_output(%zmm7, %zmm6, %zmm5, %zmm4, %zmm3, %zmm2, %zmm1, %zmm0, + %zmm15, %zmm14, %zmm13, %zmm12, %zmm11, %zmm10, %zmm9, + %zmm8, %rsi); + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_camellia_gfni_avx512_cfb_dec,.-_gcry_camellia_gfni_avx512_cfb_dec;) + +.align 8 +.globl _gcry_camellia_gfni_avx512_ocb_enc +ELF(.type _gcry_camellia_gfni_avx512_ocb_enc, at function;) + +_gcry_camellia_gfni_avx512_ocb_enc: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (64 blocks) + * %rdx: src (64 blocks) + * %rcx: offset + * %r8 : checksum + * %r9 : L pointers (void *L[64]) + */ + CFI_STARTPROC(); + vpopcntb %zmm16, %zmm16; /* spec stop for old AVX512 CPUs */ + + pushq %r12; + CFI_PUSH(%r12); + pushq %r13; + CFI_PUSH(%r13); + pushq %r14; + CFI_PUSH(%r14); + pushq %r15; + CFI_PUSH(%r15); + pushq %rbx; + CFI_PUSH(%rbx); + + vmovdqu64 (%rcx), %xmm30; + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* Checksum_i = Checksum_{i-1} xor P_i */ + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + +#define OCB_INPUT(n, l0reg, l1reg, l2reg, l3reg, zreg, zplain) \ + vmovdqu64 (n * 64)(%rdx), zplain; \ + vpxorq (l0reg), %xmm30, %xmm16; \ + vpxorq (l1reg), %xmm16, %xmm30; \ + vinserti64x2 $1, %xmm30, %ymm16, %ymm16; \ + vpxorq (l2reg), %xmm30, %xmm30; \ + vinserti64x2 $2, %xmm30, %zmm16, %zmm16; \ + vpxorq (l3reg), %xmm30, %xmm30; \ + vinserti64x2 $3, %xmm30, %zmm16, %zmm16; \ + vpxorq zplain, %zmm31, %zmm31; \ + vpxorq zplain, %zmm16, zreg; \ + vmovdqu64 %zmm16, (n * 64)(%rsi); + +#define OCB_LOAD_PTRS(n) \ + movq ((n * 4 * 8) + (0 * 8))(%r9), %r10; \ + movq ((n * 4 * 8) + (1 * 8))(%r9), %r11; \ + movq ((n * 4 * 8) + (2 * 8))(%r9), %r12; \ + movq ((n * 4 * 8) + (3 * 8))(%r9), %r13; \ + movq ((n * 4 * 8) + (4 * 8))(%r9), %r14; \ + movq ((n * 4 * 8) + (5 * 8))(%r9), %r15; \ + movq ((n * 4 * 8) + (6 * 8))(%r9), %rax; \ + movq ((n * 4 * 8) + (7 * 8))(%r9), %rbx; + + OCB_LOAD_PTRS(0); + OCB_INPUT(0, %r10, %r11, %r12, %r13, %zmm15, %zmm20); + OCB_INPUT(1, %r14, %r15, %rax, %rbx, %zmm14, %zmm21); + OCB_LOAD_PTRS(2); + OCB_INPUT(2, %r10, %r11, %r12, %r13, %zmm13, %zmm22); + vpternlogq $0x96, %zmm20, %zmm21, %zmm22; + OCB_INPUT(3, %r14, %r15, %rax, %rbx, %zmm12, %zmm23); + OCB_LOAD_PTRS(4); + OCB_INPUT(4, %r10, %r11, %r12, %r13, %zmm11, %zmm24); + OCB_INPUT(5, %r14, %r15, %rax, %rbx, %zmm10, %zmm25); + vpternlogq $0x96, %zmm23, %zmm24, %zmm25; + OCB_LOAD_PTRS(6); + OCB_INPUT(6, %r10, %r11, %r12, %r13, %zmm9, %zmm20); + OCB_INPUT(7, %r14, %r15, %rax, %rbx, %zmm8, %zmm21); + OCB_LOAD_PTRS(8); + OCB_INPUT(8, %r10, %r11, %r12, %r13, %zmm7, %zmm26); + vpternlogq $0x96, %zmm20, %zmm21, %zmm26; + OCB_INPUT(9, %r14, %r15, %rax, %rbx, %zmm6, %zmm23); + OCB_LOAD_PTRS(10); + OCB_INPUT(10, %r10, %r11, %r12, %r13, %zmm5, %zmm24); + OCB_INPUT(11, %r14, %r15, %rax, %rbx, %zmm4, %zmm27); + vpternlogq $0x96, %zmm23, %zmm24, %zmm27; + OCB_LOAD_PTRS(12); + OCB_INPUT(12, %r10, %r11, %r12, %r13, %zmm3, %zmm20); + OCB_INPUT(13, %r14, %r15, %rax, %rbx, %zmm2, %zmm21); + OCB_LOAD_PTRS(14); + OCB_INPUT(14, %r10, %r11, %r12, %r13, %zmm1, %zmm23); + vpternlogq $0x96, %zmm20, %zmm21, %zmm23; + OCB_INPUT(15, %r14, %r15, %rax, %rbx, %zmm0, %zmm24); +#undef OCB_LOAD_PTRS +#undef OCB_INPUT + + vpbroadcastq (key_table)(CTX), %zmm16; + vpshufb .Lpack_bswap rRIP, %zmm16, %zmm16; + + vpternlogq $0x96, %zmm24, %zmm22, %zmm25; + vpternlogq $0x96, %zmm26, %zmm27, %zmm23; + vpxorq %zmm25, %zmm23, %zmm20; + vextracti64x4 $1, %zmm20, %ymm21; + vpxorq %ymm21, %ymm20, %ymm20; + vextracti64x2 $1, %ymm20, %xmm21; + vpternlogq $0x96, (%r8), %xmm21, %xmm20; + vmovdqu64 %xmm30, (%rcx); + vmovdqu64 %xmm20, (%r8); + + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %eax; + cmovel %eax, %r8d; /* max */ + + /* inpack64_pre: */ + vpxorq %zmm0, %zmm16, %zmm0; + vpxorq %zmm1, %zmm16, %zmm1; + vpxorq %zmm2, %zmm16, %zmm2; + vpxorq %zmm3, %zmm16, %zmm3; + vpxorq %zmm4, %zmm16, %zmm4; + vpxorq %zmm5, %zmm16, %zmm5; + vpxorq %zmm6, %zmm16, %zmm6; + vpxorq %zmm7, %zmm16, %zmm7; + vpxorq %zmm8, %zmm16, %zmm8; + vpxorq %zmm9, %zmm16, %zmm9; + vpxorq %zmm10, %zmm16, %zmm10; + vpxorq %zmm11, %zmm16, %zmm11; + vpxorq %zmm12, %zmm16, %zmm12; + vpxorq %zmm13, %zmm16, %zmm13; + vpxorq %zmm14, %zmm16, %zmm14; + vpxorq %zmm15, %zmm16, %zmm15; + + call __camellia_gfni_avx512_enc_blk64; + + vpxorq 0 * 64(%rsi), %zmm7, %zmm7; + vpxorq 1 * 64(%rsi), %zmm6, %zmm6; + vpxorq 2 * 64(%rsi), %zmm5, %zmm5; + vpxorq 3 * 64(%rsi), %zmm4, %zmm4; + vpxorq 4 * 64(%rsi), %zmm3, %zmm3; + vpxorq 5 * 64(%rsi), %zmm2, %zmm2; + vpxorq 6 * 64(%rsi), %zmm1, %zmm1; + vpxorq 7 * 64(%rsi), %zmm0, %zmm0; + vpxorq 8 * 64(%rsi), %zmm15, %zmm15; + vpxorq 9 * 64(%rsi), %zmm14, %zmm14; + vpxorq 10 * 64(%rsi), %zmm13, %zmm13; + vpxorq 11 * 64(%rsi), %zmm12, %zmm12; + vpxorq 12 * 64(%rsi), %zmm11, %zmm11; + vpxorq 13 * 64(%rsi), %zmm10, %zmm10; + vpxorq 14 * 64(%rsi), %zmm9, %zmm9; + vpxorq 15 * 64(%rsi), %zmm8, %zmm8; + + write_output(%zmm7, %zmm6, %zmm5, %zmm4, %zmm3, %zmm2, %zmm1, %zmm0, + %zmm15, %zmm14, %zmm13, %zmm12, %zmm11, %zmm10, %zmm9, + %zmm8, %rsi); + + popq %rbx; + CFI_RESTORE(%rbx); + popq %r15; + CFI_RESTORE(%r15); + popq %r14; + CFI_RESTORE(%r14); + popq %r13; + CFI_RESTORE(%r12); + popq %r12; + CFI_RESTORE(%r13); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_camellia_gfni_avx512_ocb_enc,.-_gcry_camellia_gfni_avx512_ocb_enc;) + +.align 8 +.globl _gcry_camellia_gfni_avx512_ocb_dec +ELF(.type _gcry_camellia_gfni_avx512_ocb_dec, at function;) + +_gcry_camellia_gfni_avx512_ocb_dec: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (64 blocks) + * %rdx: src (64 blocks) + * %rcx: offset + * %r8 : checksum + * %r9 : L pointers (void *L[64]) + */ + CFI_STARTPROC(); + vpopcntb %zmm16, %zmm16; /* spec stop for old AVX512 CPUs */ + + pushq %r12; + CFI_PUSH(%r12); + pushq %r13; + CFI_PUSH(%r13); + pushq %r14; + CFI_PUSH(%r14); + pushq %r15; + CFI_PUSH(%r15); + pushq %rbx; + CFI_PUSH(%rbx); + pushq %r8; + CFI_PUSH(%r8); + + vmovdqu64 (%rcx), %xmm30; + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* C_i = Offset_i xor DECIPHER(K, P_i xor Offset_i) */ + +#define OCB_INPUT(n, l0reg, l1reg, l2reg, l3reg, zreg) \ + vpxorq (l0reg), %xmm30, %xmm16; \ + vpxorq (l1reg), %xmm16, %xmm30; \ + vinserti64x2 $1, %xmm30, %ymm16, %ymm16; \ + vpxorq (l2reg), %xmm30, %xmm30; \ + vinserti64x2 $2, %xmm30, %zmm16, %zmm16; \ + vpxorq (l3reg), %xmm30, %xmm30; \ + vinserti64x2 $3, %xmm30, %zmm16, %zmm16; \ + vpxorq (n * 64)(%rdx), %zmm16, zreg; \ + vmovdqu64 %zmm16, (n * 64)(%rsi); + +#define OCB_LOAD_PTRS(n) \ + movq ((n * 4 * 8) + (0 * 8))(%r9), %r10; \ + movq ((n * 4 * 8) + (1 * 8))(%r9), %r11; \ + movq ((n * 4 * 8) + (2 * 8))(%r9), %r12; \ + movq ((n * 4 * 8) + (3 * 8))(%r9), %r13; \ + movq ((n * 4 * 8) + (4 * 8))(%r9), %r14; \ + movq ((n * 4 * 8) + (5 * 8))(%r9), %r15; \ + movq ((n * 4 * 8) + (6 * 8))(%r9), %rax; \ + movq ((n * 4 * 8) + (7 * 8))(%r9), %rbx; + + OCB_LOAD_PTRS(0); + OCB_INPUT(0, %r10, %r11, %r12, %r13, %zmm15); + OCB_INPUT(1, %r14, %r15, %rax, %rbx, %zmm14); + OCB_LOAD_PTRS(2); + OCB_INPUT(2, %r10, %r11, %r12, %r13, %zmm13); + OCB_INPUT(3, %r14, %r15, %rax, %rbx, %zmm12); + OCB_LOAD_PTRS(4); + OCB_INPUT(4, %r10, %r11, %r12, %r13, %zmm11); + OCB_INPUT(5, %r14, %r15, %rax, %rbx, %zmm10); + OCB_LOAD_PTRS(6); + OCB_INPUT(6, %r10, %r11, %r12, %r13, %zmm9); + OCB_INPUT(7, %r14, %r15, %rax, %rbx, %zmm8); + OCB_LOAD_PTRS(8); + OCB_INPUT(8, %r10, %r11, %r12, %r13, %zmm7); + OCB_INPUT(9, %r14, %r15, %rax, %rbx, %zmm6); + OCB_LOAD_PTRS(10); + OCB_INPUT(10, %r10, %r11, %r12, %r13, %zmm5); + OCB_INPUT(11, %r14, %r15, %rax, %rbx, %zmm4); + OCB_LOAD_PTRS(12); + OCB_INPUT(12, %r10, %r11, %r12, %r13, %zmm3); + OCB_INPUT(13, %r14, %r15, %rax, %rbx, %zmm2); + OCB_LOAD_PTRS(14); + OCB_INPUT(14, %r10, %r11, %r12, %r13, %zmm1); + OCB_INPUT(15, %r14, %r15, %rax, %rbx, %zmm0); +#undef OCB_LOAD_PTRS +#undef OCB_INPUT + + vmovdqu64 %xmm30, (%rcx); + + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %eax; + cmovel %eax, %r8d; /* max */ + + vpbroadcastq (key_table)(CTX, %r8, 8), %zmm16; + vpshufb .Lpack_bswap rRIP, %zmm16, %zmm16; + + /* inpack64_pre: */ + vpxorq %zmm0, %zmm16, %zmm0; + vpxorq %zmm1, %zmm16, %zmm1; + vpxorq %zmm2, %zmm16, %zmm2; + vpxorq %zmm3, %zmm16, %zmm3; + vpxorq %zmm4, %zmm16, %zmm4; + vpxorq %zmm5, %zmm16, %zmm5; + vpxorq %zmm6, %zmm16, %zmm6; + vpxorq %zmm7, %zmm16, %zmm7; + vpxorq %zmm8, %zmm16, %zmm8; + vpxorq %zmm9, %zmm16, %zmm9; + vpxorq %zmm10, %zmm16, %zmm10; + vpxorq %zmm11, %zmm16, %zmm11; + vpxorq %zmm12, %zmm16, %zmm12; + vpxorq %zmm13, %zmm16, %zmm13; + vpxorq %zmm14, %zmm16, %zmm14; + vpxorq %zmm15, %zmm16, %zmm15; + + call __camellia_gfni_avx512_dec_blk64; + + vpxorq 0 * 64(%rsi), %zmm7, %zmm7; + vpxorq 1 * 64(%rsi), %zmm6, %zmm6; + vpxorq 2 * 64(%rsi), %zmm5, %zmm5; + vpxorq 3 * 64(%rsi), %zmm4, %zmm4; + vpxorq 4 * 64(%rsi), %zmm3, %zmm3; + vpxorq 5 * 64(%rsi), %zmm2, %zmm2; + vpxorq 6 * 64(%rsi), %zmm1, %zmm1; + vpxorq 7 * 64(%rsi), %zmm0, %zmm0; + vpxorq 8 * 64(%rsi), %zmm15, %zmm15; + vpxorq 9 * 64(%rsi), %zmm14, %zmm14; + vpxorq 10 * 64(%rsi), %zmm13, %zmm13; + vpxorq 11 * 64(%rsi), %zmm12, %zmm12; + vpxorq 12 * 64(%rsi), %zmm11, %zmm11; + vpxorq 13 * 64(%rsi), %zmm10, %zmm10; + vpxorq 14 * 64(%rsi), %zmm9, %zmm9; + vpxorq 15 * 64(%rsi), %zmm8, %zmm8; + + write_output(%zmm7, %zmm6, %zmm5, %zmm4, %zmm3, %zmm2, %zmm1, %zmm0, + %zmm15, %zmm14, %zmm13, %zmm12, %zmm11, %zmm10, %zmm9, + %zmm8, %rsi); + + popq %r8; + CFI_RESTORE(%r8); + + /* Checksum_i = Checksum_{i-1} xor C_i */ + vpternlogq $0x96, %zmm7, %zmm6, %zmm5; + vpternlogq $0x96, %zmm4, %zmm3, %zmm2; + vpternlogq $0x96, %zmm1, %zmm0, %zmm15; + vpternlogq $0x96, %zmm14, %zmm13, %zmm12; + vpternlogq $0x96, %zmm11, %zmm10, %zmm9; + vpternlogq $0x96, %zmm5, %zmm2, %zmm15; + vpternlogq $0x96, %zmm12, %zmm9, %zmm8; + vpxorq %zmm15, %zmm8, %zmm8; + + vextracti64x4 $1, %zmm8, %ymm0; + vpxor %ymm0, %ymm8, %ymm8; + vextracti128 $1, %ymm8, %xmm0; + vpternlogq $0x96, (%r8), %xmm0, %xmm8; + vmovdqu64 %xmm8, (%r8); + + popq %rbx; + CFI_RESTORE(%rbx); + popq %r15; + CFI_RESTORE(%r15); + popq %r14; + CFI_RESTORE(%r14); + popq %r13; + CFI_RESTORE(%r12); + popq %r12; + CFI_RESTORE(%r13); + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_camellia_gfni_avx512_ocb_dec,.-_gcry_camellia_gfni_avx512_ocb_dec;) + +.align 8 +.globl _gcry_camellia_gfni_avx512_enc_blk64 +ELF(.type _gcry_camellia_gfni_avx512_enc_blk64, at function;) + +_gcry_camellia_gfni_avx512_enc_blk64: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (64 blocks) + * %rdx: src (64 blocks) + */ + CFI_STARTPROC(); + vpopcntb %zmm16, %zmm16; /* spec stop for old AVX512 CPUs */ + + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %eax; + cmovel %eax, %r8d; /* max */ + xorl %eax, %eax; + + vpbroadcastq (key_table)(CTX), %zmm0; + vpshufb .Lpack_bswap rRIP, %zmm0, %zmm0; + + vpxorq (0) * 64(%rdx), %zmm0, %zmm15; + vpxorq (1) * 64(%rdx), %zmm0, %zmm14; + vpxorq (2) * 64(%rdx), %zmm0, %zmm13; + vpxorq (3) * 64(%rdx), %zmm0, %zmm12; + vpxorq (4) * 64(%rdx), %zmm0, %zmm11; + vpxorq (5) * 64(%rdx), %zmm0, %zmm10; + vpxorq (6) * 64(%rdx), %zmm0, %zmm9; + vpxorq (7) * 64(%rdx), %zmm0, %zmm8; + vpxorq (8) * 64(%rdx), %zmm0, %zmm7; + vpxorq (9) * 64(%rdx), %zmm0, %zmm6; + vpxorq (10) * 64(%rdx), %zmm0, %zmm5; + vpxorq (11) * 64(%rdx), %zmm0, %zmm4; + vpxorq (12) * 64(%rdx), %zmm0, %zmm3; + vpxorq (13) * 64(%rdx), %zmm0, %zmm2; + vpxorq (14) * 64(%rdx), %zmm0, %zmm1; + vpxorq (15) * 64(%rdx), %zmm0, %zmm0; + + call __camellia_gfni_avx512_enc_blk64; + + vmovdqu64 %zmm7, (0) * 64(%rsi); + vmovdqu64 %zmm6, (1) * 64(%rsi); + vmovdqu64 %zmm5, (2) * 64(%rsi); + vmovdqu64 %zmm4, (3) * 64(%rsi); + vmovdqu64 %zmm3, (4) * 64(%rsi); + vmovdqu64 %zmm2, (5) * 64(%rsi); + vmovdqu64 %zmm1, (6) * 64(%rsi); + vmovdqu64 %zmm0, (7) * 64(%rsi); + vmovdqu64 %zmm15, (8) * 64(%rsi); + vmovdqu64 %zmm14, (9) * 64(%rsi); + vmovdqu64 %zmm13, (10) * 64(%rsi); + vmovdqu64 %zmm12, (11) * 64(%rsi); + vmovdqu64 %zmm11, (12) * 64(%rsi); + vmovdqu64 %zmm10, (13) * 64(%rsi); + vmovdqu64 %zmm9, (14) * 64(%rsi); + vmovdqu64 %zmm8, (15) * 64(%rsi); + + clear_regs(); + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_camellia_gfni_avx512_enc_blk64,.-_gcry_camellia_gfni_avx512_enc_blk64;) + +.align 8 +.globl _gcry_camellia_gfni_avx512_dec_blk64 +ELF(.type _gcry_camellia_gfni_avx512_dec_blk64, at function;) + +_gcry_camellia_gfni_avx512_dec_blk64: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (64 blocks) + * %rdx: src (64 blocks) + */ + CFI_STARTPROC(); + vpopcntb %zmm16, %zmm16; /* spec stop for old AVX512 CPUs */ + + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %eax; + cmovel %eax, %r8d; /* max */ + xorl %eax, %eax; + + vpbroadcastq (key_table)(CTX, %r8, 8), %zmm0; + vpshufb .Lpack_bswap rRIP, %zmm0, %zmm0; + + vpxorq (0) * 64(%rdx), %zmm0, %zmm15; + vpxorq (1) * 64(%rdx), %zmm0, %zmm14; + vpxorq (2) * 64(%rdx), %zmm0, %zmm13; + vpxorq (3) * 64(%rdx), %zmm0, %zmm12; + vpxorq (4) * 64(%rdx), %zmm0, %zmm11; + vpxorq (5) * 64(%rdx), %zmm0, %zmm10; + vpxorq (6) * 64(%rdx), %zmm0, %zmm9; + vpxorq (7) * 64(%rdx), %zmm0, %zmm8; + vpxorq (8) * 64(%rdx), %zmm0, %zmm7; + vpxorq (9) * 64(%rdx), %zmm0, %zmm6; + vpxorq (10) * 64(%rdx), %zmm0, %zmm5; + vpxorq (11) * 64(%rdx), %zmm0, %zmm4; + vpxorq (12) * 64(%rdx), %zmm0, %zmm3; + vpxorq (13) * 64(%rdx), %zmm0, %zmm2; + vpxorq (14) * 64(%rdx), %zmm0, %zmm1; + vpxorq (15) * 64(%rdx), %zmm0, %zmm0; + + call __camellia_gfni_avx512_dec_blk64; + + vmovdqu64 %zmm7, (0) * 64(%rsi); + vmovdqu64 %zmm6, (1) * 64(%rsi); + vmovdqu64 %zmm5, (2) * 64(%rsi); + vmovdqu64 %zmm4, (3) * 64(%rsi); + vmovdqu64 %zmm3, (4) * 64(%rsi); + vmovdqu64 %zmm2, (5) * 64(%rsi); + vmovdqu64 %zmm1, (6) * 64(%rsi); + vmovdqu64 %zmm0, (7) * 64(%rsi); + vmovdqu64 %zmm15, (8) * 64(%rsi); + vmovdqu64 %zmm14, (9) * 64(%rsi); + vmovdqu64 %zmm13, (10) * 64(%rsi); + vmovdqu64 %zmm12, (11) * 64(%rsi); + vmovdqu64 %zmm11, (12) * 64(%rsi); + vmovdqu64 %zmm10, (13) * 64(%rsi); + vmovdqu64 %zmm9, (14) * 64(%rsi); + vmovdqu64 %zmm8, (15) * 64(%rsi); + + clear_regs(); + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_camellia_gfni_avx512_dec_blk64,.-_gcry_camellia_gfni_avx512_dec_blk64;) + +#endif /* defined(ENABLE_GFNI_SUPPORT) && defined(ENABLE_AVX512_SUPPORT) */ +#endif /* __x86_64 */ diff --git a/cipher/camellia-glue.c b/cipher/camellia-glue.c index 00e23750..a854b82d 100644 --- a/cipher/camellia-glue.c +++ b/cipher/camellia-glue.c @@ -104,6 +104,12 @@ # define USE_GFNI_AVX2 1 #endif +/* USE_GFNI_AVX512 inidicates whether to compile with Intel GFNI/AVX512 code. */ +#undef USE_GFNI_AVX512 +#if defined(USE_GFNI_AVX2) && defined(ENABLE_AVX512_SUPPORT) +# define USE_GFNI_AVX512 1 +#endif + typedef struct { KEY_TABLE_TYPE keytable; @@ -115,6 +121,7 @@ typedef struct unsigned int use_aesni_avx2:1;/* AES-NI/AVX2 implementation shall be used. */ unsigned int use_vaes_avx2:1; /* VAES/AVX2 implementation shall be used. */ unsigned int use_gfni_avx2:1; /* GFNI/AVX2 implementation shall be used. */ + unsigned int use_gfni_avx512:1; /* GFNI/AVX512 implementation shall be used. */ #endif /*USE_AESNI_AVX2*/ } CAMELLIA_context; @@ -134,7 +141,7 @@ typedef struct #ifdef USE_AESNI_AVX /* Assembler implementations of Camellia using AES-NI and AVX. Process data - in 16 block same time. + in 16 blocks same time. */ extern void _gcry_camellia_aesni_avx_ctr_enc(CAMELLIA_context *ctx, unsigned char *out, @@ -182,7 +189,7 @@ static const int avx_burn_stack_depth = 16 * CAMELLIA_BLOCK_SIZE + 16 + #ifdef USE_AESNI_AVX2 /* Assembler implementations of Camellia using AES-NI and AVX2. Process data - in 32 block same time. + in 32 blocks same time. */ extern void _gcry_camellia_aesni_avx2_ctr_enc(CAMELLIA_context *ctx, unsigned char *out, @@ -238,7 +245,7 @@ static const int avx2_burn_stack_depth = 32 * CAMELLIA_BLOCK_SIZE + 16 + #ifdef USE_VAES_AVX2 /* Assembler implementations of Camellia using VAES and AVX2. Process data - in 32 block same time. + in 32 blocks same time. */ extern void _gcry_camellia_vaes_avx2_ctr_enc(CAMELLIA_context *ctx, unsigned char *out, @@ -290,7 +297,7 @@ extern void _gcry_camellia_vaes_avx2_dec_blk1_32(const CAMELLIA_context *ctx, #ifdef USE_GFNI_AVX2 /* Assembler implementations of Camellia using GFNI and AVX2. Process data - in 32 block same time. + in 32 blocks same time. */ extern void _gcry_camellia_gfni_avx2_ctr_enc(CAMELLIA_context *ctx, unsigned char *out, @@ -340,6 +347,53 @@ extern void _gcry_camellia_gfni_avx2_dec_blk1_32(const CAMELLIA_context *ctx, ASM_FUNC_ABI; #endif +#ifdef USE_GFNI_AVX512 +/* Assembler implementations of Camellia using GFNI and AVX512. Process data + in 64 blocks same time. + */ +extern void _gcry_camellia_gfni_avx512_ctr_enc(CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in, + unsigned char *ctr) ASM_FUNC_ABI; + +extern void _gcry_camellia_gfni_avx512_cbc_dec(CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in, + unsigned char *iv) ASM_FUNC_ABI; + +extern void _gcry_camellia_gfni_avx512_cfb_dec(CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in, + unsigned char *iv) ASM_FUNC_ABI; + +extern void _gcry_camellia_gfni_avx512_ocb_enc(CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in, + unsigned char *offset, + unsigned char *checksum, + const u64 Ls[32]) ASM_FUNC_ABI; + +extern void _gcry_camellia_gfni_avx512_ocb_dec(CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in, + unsigned char *offset, + unsigned char *checksum, + const u64 Ls[32]) ASM_FUNC_ABI; + +extern void _gcry_camellia_gfni_avx512_enc_blk64(const CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in) + ASM_FUNC_ABI; + +extern void _gcry_camellia_gfni_avx512_dec_blk64(const CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in) + ASM_FUNC_ABI; + +/* Stack not used by AVX512 implementation. */ +static const int avx512_burn_stack_depth = 0; +#endif + static const char *selftest(void); static void _gcry_camellia_ctr_enc (void *context, unsigned char *ctr, @@ -393,6 +447,7 @@ camellia_setkey(void *c, const byte *key, unsigned keylen, ctx->use_aesni_avx2 = (hwf & HWF_INTEL_AESNI) && (hwf & HWF_INTEL_AVX2); ctx->use_vaes_avx2 = 0; ctx->use_gfni_avx2 = 0; + ctx->use_gfni_avx512 = 0; #endif #ifdef USE_VAES_AVX2 ctx->use_vaes_avx2 = (hwf & HWF_INTEL_VAES_VPCLMUL) && (hwf & HWF_INTEL_AVX2); @@ -400,6 +455,9 @@ camellia_setkey(void *c, const byte *key, unsigned keylen, #ifdef USE_GFNI_AVX2 ctx->use_gfni_avx2 = (hwf & HWF_INTEL_GFNI) && (hwf & HWF_INTEL_AVX2); #endif +#ifdef USE_GFNI_AVX512 + ctx->use_gfni_avx512 = (hwf & HWF_INTEL_GFNI) && (hwf & HWF_INTEL_AVX512); +#endif ctx->keybitlength=keylen*8; @@ -592,6 +650,37 @@ camellia_encrypt_blk1_32 (const void *priv, byte *outbuf, const byte *inbuf, return stack_burn_size; } +static unsigned int +camellia_encrypt_blk1_64 (const void *priv, byte *outbuf, const byte *inbuf, + unsigned int num_blks) +{ + const CAMELLIA_context *ctx = priv; + unsigned int stack_burn_size = 0; + unsigned int nburn; + + gcry_assert (num_blks <= 64); + +#ifdef USE_GFNI_AVX512 + if (num_blks == 64 && ctx->use_gfni_avx512) + { + _gcry_camellia_gfni_avx512_enc_blk64 (ctx, outbuf, inbuf); + return avx512_burn_stack_depth; + } +#endif + + do + { + unsigned int curr_blks = num_blks > 32 ? 32 : num_blks; + nburn = camellia_encrypt_blk1_32 (ctx, outbuf, inbuf, curr_blks); + stack_burn_size = nburn > stack_burn_size ? nburn : stack_burn_size; + outbuf += curr_blks * 16; + inbuf += curr_blks * 16; + num_blks -= curr_blks; + } + while (num_blks > 0); + + return stack_burn_size; +} static unsigned int camellia_decrypt_blk1_32 (const void *priv, byte *outbuf, const byte *inbuf, @@ -641,6 +730,38 @@ camellia_decrypt_blk1_32 (const void *priv, byte *outbuf, const byte *inbuf, return stack_burn_size; } +static unsigned int +camellia_decrypt_blk1_64 (const void *priv, byte *outbuf, const byte *inbuf, + unsigned int num_blks) +{ + const CAMELLIA_context *ctx = priv; + unsigned int stack_burn_size = 0; + unsigned int nburn; + + gcry_assert (num_blks <= 64); + +#ifdef USE_GFNI_AVX512 + if (num_blks == 64 && ctx->use_gfni_avx512) + { + _gcry_camellia_gfni_avx512_dec_blk64 (ctx, outbuf, inbuf); + return avx512_burn_stack_depth; + } +#endif + + do + { + unsigned int curr_blks = num_blks > 32 ? 32 : num_blks; + nburn = camellia_decrypt_blk1_32 (ctx, outbuf, inbuf, curr_blks); + stack_burn_size = nburn > stack_burn_size ? nburn : stack_burn_size; + outbuf += curr_blks * 16; + inbuf += curr_blks * 16; + num_blks -= curr_blks; + } + while (num_blks > 0); + + return stack_burn_size; +} + /* Bulk encryption of complete blocks in CTR mode. This function is only intended for the bulk encryption feature of cipher.c. CTR is expected to be @@ -655,6 +776,31 @@ _gcry_camellia_ctr_enc(void *context, unsigned char *ctr, const unsigned char *inbuf = inbuf_arg; int burn_stack_depth = 0; +#ifdef USE_GFNI_AVX512 + if (ctx->use_gfni_avx512) + { + int did_use_gfni_avx512 = 0; + + /* Process data in 64 block chunks. */ + while (nblocks >= 64) + { + _gcry_camellia_gfni_avx512_ctr_enc (ctx, outbuf, inbuf, ctr); + nblocks -= 64; + outbuf += 64 * CAMELLIA_BLOCK_SIZE; + inbuf += 64 * CAMELLIA_BLOCK_SIZE; + did_use_gfni_avx512 = 1; + } + + if (did_use_gfni_avx512) + { + if (burn_stack_depth < avx512_burn_stack_depth) + burn_stack_depth = avx512_burn_stack_depth; + } + + /* Use generic code to handle smaller chunks... */ + } +#endif + #ifdef USE_AESNI_AVX2 if (ctx->use_aesni_avx2) { @@ -688,7 +834,6 @@ _gcry_camellia_ctr_enc(void *context, unsigned char *ctr, } /* Use generic code to handle smaller chunks... */ - /* TODO: use caching instead? */ } #endif @@ -715,7 +860,6 @@ _gcry_camellia_ctr_enc(void *context, unsigned char *ctr, } /* Use generic code to handle smaller chunks... */ - /* TODO: use caching instead? */ } #endif @@ -750,6 +894,31 @@ _gcry_camellia_cbc_dec(void *context, unsigned char *iv, const unsigned char *inbuf = inbuf_arg; int burn_stack_depth = 0; +#ifdef USE_GFNI_AVX512 + if (ctx->use_gfni_avx512) + { + int did_use_gfni_avx512 = 0; + + /* Process data in 64 block chunks. */ + while (nblocks >= 64) + { + _gcry_camellia_gfni_avx512_cbc_dec (ctx, outbuf, inbuf, iv); + nblocks -= 64; + outbuf += 64 * CAMELLIA_BLOCK_SIZE; + inbuf += 64 * CAMELLIA_BLOCK_SIZE; + did_use_gfni_avx512 = 1; + } + + if (did_use_gfni_avx512) + { + if (burn_stack_depth < avx512_burn_stack_depth) + burn_stack_depth = avx512_burn_stack_depth; + } + + /* Use generic code to handle smaller chunks... */ + } +#endif + #ifdef USE_AESNI_AVX2 if (ctx->use_aesni_avx2) { @@ -843,6 +1012,31 @@ _gcry_camellia_cfb_dec(void *context, unsigned char *iv, const unsigned char *inbuf = inbuf_arg; int burn_stack_depth = 0; +#ifdef USE_GFNI_AVX512 + if (ctx->use_gfni_avx512) + { + int did_use_gfni_avx512 = 0; + + /* Process data in 64 block chunks. */ + while (nblocks >= 64) + { + _gcry_camellia_gfni_avx512_cfb_dec (ctx, outbuf, inbuf, iv); + nblocks -= 64; + outbuf += 64 * CAMELLIA_BLOCK_SIZE; + inbuf += 64 * CAMELLIA_BLOCK_SIZE; + did_use_gfni_avx512 = 1; + } + + if (did_use_gfni_avx512) + { + if (burn_stack_depth < avx512_burn_stack_depth) + burn_stack_depth = avx512_burn_stack_depth; + } + + /* Use generic code to handle smaller chunks... */ + } +#endif + #ifdef USE_AESNI_AVX2 if (ctx->use_aesni_avx2) { @@ -938,12 +1132,12 @@ _gcry_camellia_xts_crypt (void *context, unsigned char *tweak, /* Process remaining blocks. */ if (nblocks) { - byte tmpbuf[CAMELLIA_BLOCK_SIZE * 32]; + byte tmpbuf[CAMELLIA_BLOCK_SIZE * 64]; unsigned int tmp_used = CAMELLIA_BLOCK_SIZE; size_t nburn; - nburn = bulk_xts_crypt_128(ctx, encrypt ? camellia_encrypt_blk1_32 - : camellia_decrypt_blk1_32, + nburn = bulk_xts_crypt_128(ctx, encrypt ? camellia_encrypt_blk1_64 + : camellia_decrypt_blk1_64, outbuf, inbuf, nblocks, tweak, tmpbuf, sizeof(tmpbuf) / CAMELLIA_BLOCK_SIZE, &tmp_used); @@ -975,6 +1169,45 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, (void)encrypt; #endif +#ifdef USE_GFNI_AVX512 + if (ctx->use_gfni_avx512) + { + int did_use_gfni_avx512 = 0; + u64 Ls[64]; + u64 *l; + + if (nblocks >= 64) + { + typeof (&_gcry_camellia_gfni_avx512_ocb_dec) bulk_ocb_fn = + encrypt ? _gcry_camellia_gfni_avx512_ocb_enc + : _gcry_camellia_gfni_avx512_ocb_dec; + l = bulk_ocb_prepare_L_pointers_array_blk64 (c, Ls, blkn); + + /* Process data in 64 block chunks. */ + while (nblocks >= 64) + { + blkn += 64; + *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 64); + + bulk_ocb_fn (ctx, outbuf, inbuf, c->u_iv.iv, c->u_ctr.ctr, Ls); + + nblocks -= 64; + outbuf += 64 * CAMELLIA_BLOCK_SIZE; + inbuf += 64 * CAMELLIA_BLOCK_SIZE; + did_use_gfni_avx512 = 1; + } + } + + if (did_use_gfni_avx512) + { + if (burn_stack_depth < avx2_burn_stack_depth) + burn_stack_depth = avx2_burn_stack_depth; + } + + /* Use generic code to handle smaller chunks... */ + } +#endif + #ifdef USE_AESNI_AVX2 if (ctx->use_aesni_avx2) { @@ -1226,7 +1459,7 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, static const char* selftest_ctr_128 (void) { - const int nblocks = 32+16+1; + const int nblocks = 64+32+16+1; const int blocksize = CAMELLIA_BLOCK_SIZE; const int context_size = sizeof(CAMELLIA_context); @@ -1239,7 +1472,7 @@ selftest_ctr_128 (void) static const char* selftest_cbc_128 (void) { - const int nblocks = 32+16+2; + const int nblocks = 64+32+16+2; const int blocksize = CAMELLIA_BLOCK_SIZE; const int context_size = sizeof(CAMELLIA_context); @@ -1252,7 +1485,7 @@ selftest_cbc_128 (void) static const char* selftest_cfb_128 (void) { - const int nblocks = 32+16+2; + const int nblocks = 64+32+16+2; const int blocksize = CAMELLIA_BLOCK_SIZE; const int context_size = sizeof(CAMELLIA_context); diff --git a/cipher/chacha20-amd64-avx512.S b/cipher/chacha20-amd64-avx512.S index da24286e..8b4d7499 100644 --- a/cipher/chacha20-amd64-avx512.S +++ b/cipher/chacha20-amd64-avx512.S @@ -287,7 +287,7 @@ _gcry_chacha20_amd64_avx512_blocks16: /* clear the used vector registers */ clear_zmm16_zmm31(); - kmovd %eax, %k2; + kxord %k2, %k2, %k2; vzeroall; /* clears ZMM0-ZMM15 */ /* eax zeroed by round loop. */ diff --git a/cipher/poly1305-amd64-avx512.S b/cipher/poly1305-amd64-avx512.S index 48892777..72303e1e 100644 --- a/cipher/poly1305-amd64-avx512.S +++ b/cipher/poly1305-amd64-avx512.S @@ -1614,8 +1614,8 @@ _gcry_poly1305_amd64_avx512_blocks: FUNC_EXIT() xor eax, eax - kmovw k1, eax - kmovw k2, eax + kxorw k1, k1, k1 + kxorw k2, k2, k2 ret_spec_stop CFI_ENDPROC() ELF(.size _gcry_poly1305_amd64_avx512_blocks, diff --git a/cipher/sha512-avx512-amd64.S b/cipher/sha512-avx512-amd64.S index c0fdbc33..0e3f44ab 100644 --- a/cipher/sha512-avx512-amd64.S +++ b/cipher/sha512-avx512-amd64.S @@ -375,7 +375,7 @@ _gcry_sha512_transform_amd64_avx512: addm([8*5 + CTX],f) addm([8*6 + CTX],g) addm([8*7 + CTX],h) - kmovd MASK_DC_00, eax + kxord MASK_DC_00, MASK_DC_00, MASK_DC_00 vzeroall vmovdqa [rsp + frame_XFER + 0*32], ymm0 /* burn stack */ diff --git a/configure.ac b/configure.ac index e63a7d6d..a7482cf3 100644 --- a/configure.ac +++ b/configure.ac @@ -2758,6 +2758,9 @@ if test "$found" = "1" ; then # Build with the GFNI/AVX2 implementation GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS camellia-gfni-avx2-amd64.lo" + + # Build with the GFNI/AVX512 implementation + GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS camellia-gfni-avx512-amd64.lo" fi fi fi -- 2.34.1 From jussi.kivilinna at iki.fi Mon May 9 19:33:36 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Mon, 9 May 2022 20:33:36 +0300 Subject: [PATCH 2/2] cipher: move CBC/CFB/CTR self-tests to tests/basic Message-ID: <20220509173336.1965278-1-jussi.kivilinna@iki.fi> * cipher/Makefile.am: Remove 'cipher-selftest.c' and 'cipher-selftest.h'. * cipher/cipher-selftest.c: Remove (refactor these tests to tests/basic.c). * cipher/cipher-selftest.h: Remove. * cipher/blowfish.c (selftest_ctr, selftest_cbc, selftest_cfb): Remove. (selftest): Remove CTR/CBC/CFB bulk self-tests. * cipher/camellia-glue.c (selftest_ctr_128, selftest_cbc_128) (selftest_cfb_128): Remove. (selftest): Remove CTR/CBC/CFB bulk self-tests. * cipher/cast5.c (selftest_ctr, selftest_cbc, selftest_cfb): Remove. (selftest): Remove CTR/CBC/CFB bulk self-tests. * cipher/des.c (bulk_selftest_setkey, selftest_ctr, selftest_cbc) (selftest_cfb): Remove. (selftest): Remove CTR/CBC/CFB bulk self-tests. * cipher/rijndael.c (selftest_basic_128, selftest_basic_192) (selftest_basic_256): Allocate context from stack instead of heap and handle alignment manually. (selftest_ctr_128, selftest_cbc_128, selftest_cfb_128): Remove. (selftest): Remove CTR/CBC/CFB bulk self-tests. * cipher/serpent.c (selftest_ctr_128, selftest_cbc_128) (selftest_cfb_128): Remove. (selftest): Remove CTR/CBC/CFB bulk self-tests. * cipher/sm4.c (selftest_ctr_128, selftest_cbc_128) (selftest_cfb_128): Remove. (selftest): Remove CTR/CBC/CFB bulk self-tests. * cipher/twofish.c (selftest_ctr, selftest_cbc, selftest_cfb): Remove. (selftest): Remove CTR/CBC/CFB bulk self-tests. * tests/basic.c (buf_xor, cipher_cbc_bulk_test, buf_xor_2dst) (cipher_cfb_bulk_test, cipher_ctr_bulk_test): New. (check_ciphers): Run cipher_cbc_bulk_test(), cipher_cfb_bulk_test() and cipher_ctr_bulk_test() for block ciphers. --- CBC/CFB/CTR bulk self-tests are quite computationally heavy and slow down use cases where application opens cipher context once, does processing and exits. Better place for these tests is in `tests/basic`. Signed-off-by: Jussi Kivilinna --- cipher/Makefile.am | 1 - cipher/blowfish.c | 53 --- cipher/camellia-glue.c | 49 --- cipher/cast5.c | 53 --- cipher/cipher-selftest.c | 512 ---------------------- cipher/cipher-selftest.h | 69 --- cipher/des.c | 72 ---- cipher/rijndael-aesni.c | 1 - cipher/rijndael-armv8-ce.c | 1 - cipher/rijndael-padlock.c | 1 - cipher/rijndael-ssse3-amd64.c | 1 - cipher/rijndael-vaes.c | 1 - cipher/rijndael.c | 92 +--- cipher/serpent.c | 53 --- cipher/sm4.c | 50 --- cipher/twofish.c | 49 --- tests/basic.c | 772 ++++++++++++++++++++++++++++++++++ 17 files changed, 780 insertions(+), 1050 deletions(-) delete mode 100644 cipher/cipher-selftest.c delete mode 100644 cipher/cipher-selftest.h diff --git a/cipher/Makefile.am b/cipher/Makefile.am index a6171bf5..250b229e 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -55,7 +55,6 @@ libcipher_la_SOURCES = \ cipher-eax.c \ cipher-siv.c \ cipher-gcm-siv.c \ - cipher-selftest.c cipher-selftest.h \ pubkey.c pubkey-internal.h pubkey-util.c \ md.c \ mac.c mac-internal.h \ diff --git a/cipher/blowfish.c b/cipher/blowfish.c index 7b001306..1b11d718 100644 --- a/cipher/blowfish.c +++ b/cipher/blowfish.c @@ -38,7 +38,6 @@ #include "cipher.h" #include "bufhelp.h" #include "cipher-internal.h" -#include "cipher-selftest.h" #define BLOWFISH_BLOCKSIZE 8 #define BLOWFISH_KEY_MIN_BITS 8 @@ -856,48 +855,6 @@ _gcry_blowfish_cfb_dec(void *context, unsigned char *iv, void *outbuf_arg, } -/* Run the self-tests for BLOWFISH-CTR, tests IV increment of bulk CTR - encryption. Returns NULL on success. */ -static const char * -selftest_ctr (void) -{ - const int nblocks = 4+1; - const int blocksize = BLOWFISH_BLOCKSIZE; - const int context_size = sizeof(BLOWFISH_context); - - return _gcry_selftest_helper_ctr("BLOWFISH", &bf_setkey, - &encrypt_block, nblocks, blocksize, context_size); -} - - -/* Run the self-tests for BLOWFISH-CBC, tests bulk CBC decryption. - Returns NULL on success. */ -static const char * -selftest_cbc (void) -{ - const int nblocks = 4+2; - const int blocksize = BLOWFISH_BLOCKSIZE; - const int context_size = sizeof(BLOWFISH_context); - - return _gcry_selftest_helper_cbc("BLOWFISH", &bf_setkey, - &encrypt_block, nblocks, blocksize, context_size); -} - - -/* Run the self-tests for BLOWFISH-CFB, tests bulk CBC decryption. - Returns NULL on success. */ -static const char * -selftest_cfb (void) -{ - const int nblocks = 4+2; - const int blocksize = BLOWFISH_BLOCKSIZE; - const int context_size = sizeof(BLOWFISH_context); - - return _gcry_selftest_helper_cfb("BLOWFISH", &bf_setkey, - &encrypt_block, nblocks, blocksize, context_size); -} - - static const char* selftest(void) { @@ -911,7 +868,6 @@ selftest(void) { 0x41, 0x79, 0x6E, 0xA0, 0x52, 0x61, 0x6E, 0xE4 }; static const byte cipher3[] = { 0xE1, 0x13, 0xF4, 0x10, 0x2C, 0xFC, 0xCE, 0x43 }; - const char *r; bf_setkey( (void *) &c, (const unsigned char*)"abcdefghijklmnopqrstuvwxyz", 26, @@ -931,15 +887,6 @@ selftest(void) if( memcmp( buffer, plain3, 8 ) ) return "Blowfish selftest failed (4)."; - if ( (r = selftest_cbc ()) ) - return r; - - if ( (r = selftest_cfb ()) ) - return r; - - if ( (r = selftest_ctr ()) ) - return r; - return NULL; } diff --git a/cipher/camellia-glue.c b/cipher/camellia-glue.c index a854b82d..c938be71 100644 --- a/cipher/camellia-glue.c +++ b/cipher/camellia-glue.c @@ -64,7 +64,6 @@ #include "camellia.h" #include "bufhelp.h" #include "cipher-internal.h" -#include "cipher-selftest.h" #include "bulkhelp.h" /* Helper macro to force alignment to 16 bytes. */ @@ -1454,44 +1453,6 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, return nblocks; } -/* Run the self-tests for CAMELLIA-CTR-128, tests IV increment of bulk CTR - encryption. Returns NULL on success. */ -static const char* -selftest_ctr_128 (void) -{ - const int nblocks = 64+32+16+1; - const int blocksize = CAMELLIA_BLOCK_SIZE; - const int context_size = sizeof(CAMELLIA_context); - - return _gcry_selftest_helper_ctr("CAMELLIA", &camellia_setkey, - &camellia_encrypt, nblocks, blocksize, context_size); -} - -/* Run the self-tests for CAMELLIA-CBC-128, tests bulk CBC decryption. - Returns NULL on success. */ -static const char* -selftest_cbc_128 (void) -{ - const int nblocks = 64+32+16+2; - const int blocksize = CAMELLIA_BLOCK_SIZE; - const int context_size = sizeof(CAMELLIA_context); - - return _gcry_selftest_helper_cbc("CAMELLIA", &camellia_setkey, - &camellia_encrypt, nblocks, blocksize, context_size); -} - -/* Run the self-tests for CAMELLIA-CFB-128, tests bulk CFB decryption. - Returns NULL on success. */ -static const char* -selftest_cfb_128 (void) -{ - const int nblocks = 64+32+16+2; - const int blocksize = CAMELLIA_BLOCK_SIZE; - const int context_size = sizeof(CAMELLIA_context); - - return _gcry_selftest_helper_cfb("CAMELLIA", &camellia_setkey, - &camellia_encrypt, nblocks, blocksize, context_size); -} static const char * selftest(void) @@ -1499,7 +1460,6 @@ selftest(void) CAMELLIA_context ctx; byte scratch[16]; cipher_bulk_ops_t bulk_ops; - const char *r; /* These test vectors are from RFC-3713 */ static const byte plaintext[]= @@ -1563,15 +1523,6 @@ selftest(void) if(memcmp(scratch,plaintext,sizeof(plaintext))!=0) return "CAMELLIA-256 test decryption failed."; - if ( (r = selftest_ctr_128 ()) ) - return r; - - if ( (r = selftest_cbc_128 ()) ) - return r; - - if ( (r = selftest_cfb_128 ()) ) - return r; - return NULL; } diff --git a/cipher/cast5.c b/cipher/cast5.c index 837ea0fe..20bf7479 100644 --- a/cipher/cast5.c +++ b/cipher/cast5.c @@ -45,7 +45,6 @@ #include "bithelp.h" #include "bufhelp.h" #include "cipher-internal.h" -#include "cipher-selftest.h" /* USE_AMD64_ASM indicates whether to use AMD64 assembly code. */ #undef USE_AMD64_ASM @@ -991,48 +990,6 @@ _gcry_cast5_cfb_dec(void *context, unsigned char *iv, void *outbuf_arg, } -/* Run the self-tests for CAST5-CTR, tests IV increment of bulk CTR - encryption. Returns NULL on success. */ -static const char * -selftest_ctr (void) -{ - const int nblocks = 4+1; - const int blocksize = CAST5_BLOCKSIZE; - const int context_size = sizeof(CAST5_context); - - return _gcry_selftest_helper_ctr("CAST5", &cast_setkey, - &encrypt_block, nblocks, blocksize, context_size); -} - - -/* Run the self-tests for CAST5-CBC, tests bulk CBC decryption. - Returns NULL on success. */ -static const char * -selftest_cbc (void) -{ - const int nblocks = 4+2; - const int blocksize = CAST5_BLOCKSIZE; - const int context_size = sizeof(CAST5_context); - - return _gcry_selftest_helper_cbc("CAST5", &cast_setkey, - &encrypt_block, nblocks, blocksize, context_size); -} - - -/* Run the self-tests for CAST5-CFB, tests bulk CBC decryption. - Returns NULL on success. */ -static const char * -selftest_cfb (void) -{ - const int nblocks = 4+2; - const int blocksize = CAST5_BLOCKSIZE; - const int context_size = sizeof(CAST5_context); - - return _gcry_selftest_helper_cfb("CAST5", &cast_setkey, - &encrypt_block, nblocks, blocksize, context_size); -} - - static const char* selftest(void) { @@ -1046,7 +1003,6 @@ selftest(void) static const byte cipher[8] = { 0x23, 0x8B, 0x4F, 0xE5, 0x84, 0x7E, 0x44, 0xB2 }; byte buffer[8]; - const char *r; cast_setkey( &c, key, 16, &bulk_ops ); encrypt_block( &c, buffer, plain ); @@ -1082,15 +1038,6 @@ selftest(void) } #endif - if ( (r = selftest_cbc ()) ) - return r; - - if ( (r = selftest_cfb ()) ) - return r; - - if ( (r = selftest_ctr ()) ) - return r; - return NULL; } diff --git a/cipher/cipher-selftest.c b/cipher/cipher-selftest.c deleted file mode 100644 index d7f38a42..00000000 --- a/cipher/cipher-selftest.c +++ /dev/null @@ -1,512 +0,0 @@ -/* cipher-selftest.c - Helper functions for bulk encryption selftests. - * Copyright (C) 2013,2020 Jussi Kivilinna - * - * This file is part of Libgcrypt. - * - * Libgcrypt is free software; you can redistribute it and/or modify - * it under the terms of the GNU Lesser general Public License as - * published by the Free Software Foundation; either version 2.1 of - * the License, or (at your option) any later version. - * - * Libgcrypt is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU Lesser General Public License for more details. - * - * You should have received a copy of the GNU Lesser General Public - * License along with this program; if not, see . - */ - -#include -#ifdef HAVE_SYSLOG -# include -#endif /*HAVE_SYSLOG*/ - -#include "types.h" -#include "g10lib.h" -#include "cipher.h" -#include "bufhelp.h" -#include "cipher-selftest.h" -#include "cipher-internal.h" - -#ifdef HAVE_STDINT_H -# include /* uintptr_t */ -#elif defined(HAVE_INTTYPES_H) -# include -#else -/* In this case, uintptr_t is provided by config.h. */ -#endif - -/* Helper macro to force alignment to 16 bytes. */ -#ifdef HAVE_GCC_ATTRIBUTE_ALIGNED -# define ATTR_ALIGNED_16 __attribute__ ((aligned (16))) -#else -# define ATTR_ALIGNED_16 -#endif - - -/* Return an allocated buffers of size CONTEXT_SIZE with an alignment - of 16. The caller must free that buffer using the address returned - at R_MEM. Returns NULL and sets ERRNO on failure. */ -void * -_gcry_cipher_selftest_alloc_ctx (const int context_size, unsigned char **r_mem) -{ - int offs; - unsigned int ctx_aligned_size, memsize; - - ctx_aligned_size = context_size + 15; - ctx_aligned_size -= ctx_aligned_size & 0xf; - - memsize = ctx_aligned_size + 16; - - *r_mem = xtrycalloc (1, memsize); - if (!*r_mem) - return NULL; - - offs = (16 - ((uintptr_t)*r_mem & 15)) & 15; - return (void*)(*r_mem + offs); -} - - -/* Run the self-tests for -CBC-, tests bulk CBC - decryption. Returns NULL on success. */ -const char * -_gcry_selftest_helper_cbc (const char *cipher, gcry_cipher_setkey_t setkey_func, - gcry_cipher_encrypt_t encrypt_one, - const int nblocks, const int blocksize, - const int context_size) -{ - cipher_bulk_ops_t bulk_ops = { 0, }; - int i, offs; - unsigned char *ctx, *plaintext, *plaintext2, *ciphertext, *iv, *iv2, *mem; - unsigned int ctx_aligned_size, memsize; - - static const unsigned char key[16] ATTR_ALIGNED_16 = { - 0x66,0x9A,0x00,0x7F,0xC7,0x6A,0x45,0x9F, - 0x98,0xBA,0xF9,0x17,0xFE,0xDF,0x95,0x22 - }; - - /* Allocate buffers, align first two elements to 16 bytes and latter to - block size. */ - ctx_aligned_size = context_size + 15; - ctx_aligned_size -= ctx_aligned_size & 0xf; - - memsize = ctx_aligned_size + (blocksize * 2) + (blocksize * nblocks * 3) + 16; - - mem = xtrycalloc (1, memsize); - if (!mem) - return "failed to allocate memory"; - - offs = (16 - ((uintptr_t)mem & 15)) & 15; - ctx = (void*)(mem + offs); - iv = ctx + ctx_aligned_size; - iv2 = iv + blocksize; - plaintext = iv2 + blocksize; - plaintext2 = plaintext + nblocks * blocksize; - ciphertext = plaintext2 + nblocks * blocksize; - - /* Initialize ctx */ - if (setkey_func (ctx, key, sizeof(key), &bulk_ops) != GPG_ERR_NO_ERROR) - { - xfree(mem); - return "setkey failed"; - } - - /* Test single block code path */ - memset (iv, 0x4e, blocksize); - memset (iv2, 0x4e, blocksize); - for (i = 0; i < blocksize; i++) - plaintext[i] = i; - - /* CBC manually. */ - buf_xor (ciphertext, iv, plaintext, blocksize); - encrypt_one (ctx, ciphertext, ciphertext); - memcpy (iv, ciphertext, blocksize); - - /* CBC decrypt. */ - bulk_ops.cbc_dec (ctx, iv2, plaintext2, ciphertext, 1); - if (memcmp (plaintext2, plaintext, blocksize)) - { - xfree (mem); -#ifdef HAVE_SYSLOG - syslog (LOG_USER|LOG_WARNING, "Libgcrypt warning: " - "%s-CBC-%d test failed (plaintext mismatch)", cipher, - blocksize * 8); -#else - (void)cipher; /* Not used. */ -#endif - return "selftest for CBC failed - see syslog for details"; - } - - if (memcmp (iv2, iv, blocksize)) - { - xfree (mem); -#ifdef HAVE_SYSLOG - syslog (LOG_USER|LOG_WARNING, "Libgcrypt warning: " - "%s-CBC-%d test failed (IV mismatch)", cipher, blocksize * 8); -#endif - return "selftest for CBC failed - see syslog for details"; - } - - /* Test parallelized code paths */ - memset (iv, 0x5f, blocksize); - memset (iv2, 0x5f, blocksize); - - for (i = 0; i < nblocks * blocksize; i++) - plaintext[i] = i; - - /* Create CBC ciphertext manually. */ - for (i = 0; i < nblocks * blocksize; i+=blocksize) - { - buf_xor (&ciphertext[i], iv, &plaintext[i], blocksize); - encrypt_one (ctx, &ciphertext[i], &ciphertext[i]); - memcpy (iv, &ciphertext[i], blocksize); - } - - /* Decrypt using bulk CBC and compare result. */ - bulk_ops.cbc_dec (ctx, iv2, plaintext2, ciphertext, nblocks); - - if (memcmp (plaintext2, plaintext, nblocks * blocksize)) - { - xfree (mem); -#ifdef HAVE_SYSLOG - syslog (LOG_USER|LOG_WARNING, "Libgcrypt warning: " - "%s-CBC-%d test failed (plaintext mismatch, parallel path)", - cipher, blocksize * 8); -#endif - return "selftest for CBC failed - see syslog for details"; - } - if (memcmp (iv2, iv, blocksize)) - { - xfree (mem); -#ifdef HAVE_SYSLOG - syslog (LOG_USER|LOG_WARNING, "Libgcrypt warning: " - "%s-CBC-%d test failed (IV mismatch, parallel path)", - cipher, blocksize * 8); -#endif - return "selftest for CBC failed - see syslog for details"; - } - - xfree (mem); - return NULL; -} - -/* Run the self-tests for -CFB-, tests bulk CFB - decryption. Returns NULL on success. */ -const char * -_gcry_selftest_helper_cfb (const char *cipher, gcry_cipher_setkey_t setkey_func, - gcry_cipher_encrypt_t encrypt_one, - const int nblocks, const int blocksize, - const int context_size) -{ - cipher_bulk_ops_t bulk_ops = { 0, }; - int i, offs; - unsigned char *ctx, *plaintext, *plaintext2, *ciphertext, *iv, *iv2, *mem; - unsigned int ctx_aligned_size, memsize; - - static const unsigned char key[16] ATTR_ALIGNED_16 = { - 0x11,0x9A,0x00,0x7F,0xC7,0x6A,0x45,0x9F, - 0x98,0xBA,0xF9,0x17,0xFE,0xDF,0x95,0x33 - }; - - /* Allocate buffers, align first two elements to 16 bytes and latter to - block size. */ - ctx_aligned_size = context_size + 15; - ctx_aligned_size -= ctx_aligned_size & 0xf; - - memsize = ctx_aligned_size + (blocksize * 2) + (blocksize * nblocks * 3) + 16; - - mem = xtrycalloc (1, memsize); - if (!mem) - return "failed to allocate memory"; - - offs = (16 - ((uintptr_t)mem & 15)) & 15; - ctx = (void*)(mem + offs); - iv = ctx + ctx_aligned_size; - iv2 = iv + blocksize; - plaintext = iv2 + blocksize; - plaintext2 = plaintext + nblocks * blocksize; - ciphertext = plaintext2 + nblocks * blocksize; - - /* Initialize ctx */ - if (setkey_func (ctx, key, sizeof(key), &bulk_ops) != GPG_ERR_NO_ERROR) - { - xfree(mem); - return "setkey failed"; - } - - /* Test single block code path */ - memset(iv, 0xd3, blocksize); - memset(iv2, 0xd3, blocksize); - for (i = 0; i < blocksize; i++) - plaintext[i] = i; - - /* CFB manually. */ - encrypt_one (ctx, ciphertext, iv); - buf_xor_2dst (iv, ciphertext, plaintext, blocksize); - - /* CFB decrypt. */ - bulk_ops.cfb_dec (ctx, iv2, plaintext2, ciphertext, 1); - if (memcmp(plaintext2, plaintext, blocksize)) - { - xfree(mem); -#ifdef HAVE_SYSLOG - syslog (LOG_USER|LOG_WARNING, "Libgcrypt warning: " - "%s-CFB-%d test failed (plaintext mismatch)", cipher, - blocksize * 8); -#else - (void)cipher; /* Not used. */ -#endif - return "selftest for CFB failed - see syslog for details"; - } - - if (memcmp(iv2, iv, blocksize)) - { - xfree(mem); -#ifdef HAVE_SYSLOG - syslog (LOG_USER|LOG_WARNING, "Libgcrypt warning: " - "%s-CFB-%d test failed (IV mismatch)", cipher, blocksize * 8); -#endif - return "selftest for CFB failed - see syslog for details"; - } - - /* Test parallelized code paths */ - memset(iv, 0xe6, blocksize); - memset(iv2, 0xe6, blocksize); - - for (i = 0; i < nblocks * blocksize; i++) - plaintext[i] = i; - - /* Create CFB ciphertext manually. */ - for (i = 0; i < nblocks * blocksize; i+=blocksize) - { - encrypt_one (ctx, &ciphertext[i], iv); - buf_xor_2dst (iv, &ciphertext[i], &plaintext[i], blocksize); - } - - /* Decrypt using bulk CBC and compare result. */ - bulk_ops.cfb_dec (ctx, iv2, plaintext2, ciphertext, nblocks); - - if (memcmp(plaintext2, plaintext, nblocks * blocksize)) - { - xfree(mem); -#ifdef HAVE_SYSLOG - syslog (LOG_USER|LOG_WARNING, "Libgcrypt warning: " - "%s-CFB-%d test failed (plaintext mismatch, parallel path)", - cipher, blocksize * 8); -#endif - return "selftest for CFB failed - see syslog for details"; - } - if (memcmp(iv2, iv, blocksize)) - { - xfree(mem); -#ifdef HAVE_SYSLOG - syslog (LOG_USER|LOG_WARNING, "Libgcrypt warning: " - "%s-CFB-%d test failed (IV mismatch, parallel path)", cipher, - blocksize * 8); -#endif - return "selftest for CFB failed - see syslog for details"; - } - - xfree(mem); - return NULL; -} - -/* Run the self-tests for -CTR-, tests IV increment - of bulk CTR encryption. Returns NULL on success. */ -const char * -_gcry_selftest_helper_ctr (const char *cipher, gcry_cipher_setkey_t setkey_func, - gcry_cipher_encrypt_t encrypt_one, - const int nblocks, const int blocksize, - const int context_size) -{ - cipher_bulk_ops_t bulk_ops = { 0, }; - int i, j, offs, diff; - unsigned char *ctx, *plaintext, *plaintext2, *ciphertext, *ciphertext2, - *iv, *iv2, *mem; - unsigned int ctx_aligned_size, memsize; - - static const unsigned char key[16] ATTR_ALIGNED_16 = { - 0x06,0x9A,0x00,0x7F,0xC7,0x6A,0x45,0x9F, - 0x98,0xBA,0xF9,0x17,0xFE,0xDF,0x95,0x21 - }; - - /* Allocate buffers, align first two elements to 16 bytes and latter to - block size. */ - ctx_aligned_size = context_size + 15; - ctx_aligned_size -= ctx_aligned_size & 0xf; - - memsize = ctx_aligned_size + (blocksize * 2) + (blocksize * nblocks * 4) + 16; - - mem = xtrycalloc (1, memsize); - if (!mem) - return "failed to allocate memory"; - - offs = (16 - ((uintptr_t)mem & 15)) & 15; - ctx = (void*)(mem + offs); - iv = ctx + ctx_aligned_size; - iv2 = iv + blocksize; - plaintext = iv2 + blocksize; - plaintext2 = plaintext + nblocks * blocksize; - ciphertext = plaintext2 + nblocks * blocksize; - ciphertext2 = ciphertext + nblocks * blocksize; - - /* Initialize ctx */ - if (setkey_func (ctx, key, sizeof(key), &bulk_ops) != GPG_ERR_NO_ERROR) - { - xfree(mem); - return "setkey failed"; - } - - /* Test single block code path */ - memset (iv, 0xff, blocksize); - for (i = 0; i < blocksize; i++) - plaintext[i] = i; - - /* CTR manually. */ - encrypt_one (ctx, ciphertext, iv); - for (i = 0; i < blocksize; i++) - ciphertext[i] ^= plaintext[i]; - for (i = blocksize; i > 0; i--) - { - iv[i-1]++; - if (iv[i-1]) - break; - } - - memset (iv2, 0xff, blocksize); - bulk_ops.ctr_enc (ctx, iv2, plaintext2, ciphertext, 1); - - if (memcmp (plaintext2, plaintext, blocksize)) - { - xfree (mem); -#ifdef HAVE_SYSLOG - syslog (LOG_USER|LOG_WARNING, "Libgcrypt warning: " - "%s-CTR-%d test failed (plaintext mismatch)", cipher, - blocksize * 8); -#else - (void)cipher; /* Not used. */ -#endif - return "selftest for CTR failed - see syslog for details"; - } - - if (memcmp (iv2, iv, blocksize)) - { - xfree (mem); -#ifdef HAVE_SYSLOG - syslog (LOG_USER|LOG_WARNING, "Libgcrypt warning: " - "%s-CTR-%d test failed (IV mismatch)", cipher, - blocksize * 8); -#endif - return "selftest for CTR failed - see syslog for details"; - } - - /* Test bulk encryption with typical IV. */ - memset(iv, 0x57, blocksize-4); - iv[blocksize-1] = 1; - iv[blocksize-2] = 0; - iv[blocksize-3] = 0; - iv[blocksize-4] = 0; - memset(iv2, 0x57, blocksize-4); - iv2[blocksize-1] = 1; - iv2[blocksize-2] = 0; - iv2[blocksize-3] = 0; - iv2[blocksize-4] = 0; - - for (i = 0; i < blocksize * nblocks; i++) - plaintext2[i] = plaintext[i] = i; - - /* Create CTR ciphertext manually. */ - for (i = 0; i < blocksize * nblocks; i+=blocksize) - { - encrypt_one (ctx, &ciphertext[i], iv); - for (j = 0; j < blocksize; j++) - ciphertext[i+j] ^= plaintext[i+j]; - for (j = blocksize; j > 0; j--) - { - iv[j-1]++; - if (iv[j-1]) - break; - } - } - - bulk_ops.ctr_enc (ctx, iv2, ciphertext2, plaintext2, nblocks); - - if (memcmp (ciphertext2, ciphertext, blocksize * nblocks)) - { - xfree (mem); -#ifdef HAVE_SYSLOG - syslog (LOG_USER|LOG_WARNING, "Libgcrypt warning: " - "%s-CTR-%d test failed (ciphertext mismatch, bulk)", cipher, - blocksize * 8); -#endif - return "selftest for CTR failed - see syslog for details"; - } - if (memcmp(iv2, iv, blocksize)) - { - xfree (mem); -#ifdef HAVE_SYSLOG - syslog (LOG_USER|LOG_WARNING, "Libgcrypt warning: " - "%s-CTR-%d test failed (IV mismatch, bulk)", cipher, - blocksize * 8); -#endif - return "selftest for CTR failed - see syslog for details"; - } - - /* Test parallelized code paths (check counter overflow handling) */ - for (diff = 0; diff < nblocks; diff++) { - memset(iv, 0xff, blocksize); - iv[blocksize-1] -= diff; - iv[0] = iv[1] = 0; - iv[2] = 0x07; - - for (i = 0; i < blocksize * nblocks; i++) - plaintext[i] = i; - - /* Create CTR ciphertext manually. */ - for (i = 0; i < blocksize * nblocks; i+=blocksize) - { - encrypt_one (ctx, &ciphertext[i], iv); - for (j = 0; j < blocksize; j++) - ciphertext[i+j] ^= plaintext[i+j]; - for (j = blocksize; j > 0; j--) - { - iv[j-1]++; - if (iv[j-1]) - break; - } - } - - /* Decrypt using bulk CTR and compare result. */ - memset(iv2, 0xff, blocksize); - iv2[blocksize-1] -= diff; - iv2[0] = iv2[1] = 0; - iv2[2] = 0x07; - - bulk_ops.ctr_enc (ctx, iv2, plaintext2, ciphertext, nblocks); - - if (memcmp (plaintext2, plaintext, blocksize * nblocks)) - { - xfree (mem); -#ifdef HAVE_SYSLOG - syslog (LOG_USER|LOG_WARNING, "Libgcrypt warning: " - "%s-CTR-%d test failed (plaintext mismatch, diff: %d)", cipher, - blocksize * 8, diff); -#endif - return "selftest for CTR failed - see syslog for details"; - } - if (memcmp(iv2, iv, blocksize)) - { - xfree (mem); -#ifdef HAVE_SYSLOG - syslog (LOG_USER|LOG_WARNING, "Libgcrypt warning: " - "%s-CTR-%d test failed (IV mismatch, diff: %d)", cipher, - blocksize * 8, diff); -#endif - return "selftest for CTR failed - see syslog for details"; - } - } - - xfree (mem); - return NULL; -} diff --git a/cipher/cipher-selftest.h b/cipher/cipher-selftest.h deleted file mode 100644 index c3090ad1..00000000 --- a/cipher/cipher-selftest.h +++ /dev/null @@ -1,69 +0,0 @@ -/* cipher-selftest.h - Helper functions for bulk encryption selftests. - * Copyright (C) 2013,2020 Jussi Kivilinna - * - * This file is part of Libgcrypt. - * - * Libgcrypt is free software; you can redistribute it and/or modify - * it under the terms of the GNU Lesser general Public License as - * published by the Free Software Foundation; either version 2.1 of - * the License, or (at your option) any later version. - * - * Libgcrypt is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU Lesser General Public License for more details. - * - * You should have received a copy of the GNU Lesser General Public - * License along with this program; if not, see . - */ - -#ifndef G10_SELFTEST_HELP_H -#define G10_SELFTEST_HELP_H - -#include -#include "types.h" -#include "g10lib.h" -#include "cipher.h" - -typedef void (*gcry_cipher_bulk_cbc_dec_t)(void *context, unsigned char *iv, - void *outbuf_arg, - const void *inbuf_arg, - size_t nblocks); - -typedef void (*gcry_cipher_bulk_cfb_dec_t)(void *context, unsigned char *iv, - void *outbuf_arg, - const void *inbuf_arg, - size_t nblocks); - -typedef void (*gcry_cipher_bulk_ctr_enc_t)(void *context, unsigned char *iv, - void *outbuf_arg, - const void *inbuf_arg, - size_t nblocks); - -/* Helper function to allocate an aligned context for selftests. */ -void *_gcry_cipher_selftest_alloc_ctx (const int context_size, - unsigned char **r_mem); - - -/* Helper function for bulk CBC decryption selftest */ -const char * -_gcry_selftest_helper_cbc (const char *cipher, gcry_cipher_setkey_t setkey, - gcry_cipher_encrypt_t encrypt_one, - const int nblocks, const int blocksize, - const int context_size); - -/* Helper function for bulk CFB decryption selftest */ -const char * -_gcry_selftest_helper_cfb (const char *cipher, gcry_cipher_setkey_t setkey, - gcry_cipher_encrypt_t encrypt_one, - const int nblocks, const int blocksize, - const int context_size); - -/* Helper function for bulk CTR encryption selftest */ -const char * -_gcry_selftest_helper_ctr (const char *cipher, gcry_cipher_setkey_t setkey, - gcry_cipher_encrypt_t encrypt_one, - const int nblocks, const int blocksize, - const int context_size); - -#endif /*G10_SELFTEST_HELP_H*/ diff --git a/cipher/des.c b/cipher/des.c index 51116fcf..7a81697a 100644 --- a/cipher/des.c +++ b/cipher/des.c @@ -120,7 +120,6 @@ #include "cipher.h" #include "bufhelp.h" #include "cipher-internal.h" -#include "cipher-selftest.h" #define DES_BLOCKSIZE 8 @@ -1047,66 +1046,6 @@ is_weak_key ( const byte *key ) } -/* Alternative setkey for selftests; need larger key than default. */ -static gcry_err_code_t -bulk_selftest_setkey (void *context, const byte *__key, unsigned __keylen, - cipher_bulk_ops_t *bulk_ops) -{ - static const unsigned char key[24] ATTR_ALIGNED_16 = { - 0x66,0x9A,0x00,0x7F,0xC7,0x6A,0x45,0x9F, - 0x98,0xBA,0xF9,0x17,0xFE,0xDF,0x95,0x22, - 0x18,0x2A,0x39,0x47,0x5E,0x6F,0x75,0x82 - }; - - (void)__key; - (void)__keylen; - - return do_tripledes_setkey(context, key, sizeof(key), bulk_ops); -} - - -/* Run the self-tests for DES-CTR, tests IV increment of bulk CTR - encryption. Returns NULL on success. */ -static const char * -selftest_ctr (void) -{ - const int nblocks = 3+1; - const int blocksize = DES_BLOCKSIZE; - const int context_size = sizeof(struct _tripledes_ctx); - - return _gcry_selftest_helper_ctr("3DES", &bulk_selftest_setkey, - &do_tripledes_encrypt, nblocks, blocksize, context_size); -} - - -/* Run the self-tests for DES-CBC, tests bulk CBC decryption. - Returns NULL on success. */ -static const char * -selftest_cbc (void) -{ - const int nblocks = 3+2; - const int blocksize = DES_BLOCKSIZE; - const int context_size = sizeof(struct _tripledes_ctx); - - return _gcry_selftest_helper_cbc("3DES", &bulk_selftest_setkey, - &do_tripledes_encrypt, nblocks, blocksize, context_size); -} - - -/* Run the self-tests for DES-CFB, tests bulk CBC decryption. - Returns NULL on success. */ -static const char * -selftest_cfb (void) -{ - const int nblocks = 3+2; - const int blocksize = DES_BLOCKSIZE; - const int context_size = sizeof(struct _tripledes_ctx); - - return _gcry_selftest_helper_cfb("3DES", &bulk_selftest_setkey, - &do_tripledes_encrypt, nblocks, blocksize, context_size); -} - - /* * Performs a selftest of this DES/Triple-DES implementation. * Returns an string with the error text on failure. @@ -1115,8 +1054,6 @@ selftest_cfb (void) static const char * selftest (void) { - const char *r; - /* * Check if 'u32' is really 32 bits wide. This DES / 3DES implementation * need this. @@ -1296,15 +1233,6 @@ selftest (void) return "DES weak key detection failed"; } - if ( (r = selftest_cbc ()) ) - return r; - - if ( (r = selftest_cfb ()) ) - return r; - - if ( (r = selftest_ctr ()) ) - return r; - return 0; } diff --git a/cipher/rijndael-aesni.c b/cipher/rijndael-aesni.c index ff6b0b26..156af015 100644 --- a/cipher/rijndael-aesni.c +++ b/cipher/rijndael-aesni.c @@ -27,7 +27,6 @@ #include "g10lib.h" #include "cipher.h" #include "bufhelp.h" -#include "cipher-selftest.h" #include "rijndael-internal.h" #include "./cipher-internal.h" diff --git a/cipher/rijndael-armv8-ce.c b/cipher/rijndael-armv8-ce.c index b24ae3e9..e53c940e 100644 --- a/cipher/rijndael-armv8-ce.c +++ b/cipher/rijndael-armv8-ce.c @@ -27,7 +27,6 @@ #include "g10lib.h" #include "cipher.h" #include "bufhelp.h" -#include "cipher-selftest.h" #include "rijndael-internal.h" #include "./cipher-internal.h" diff --git a/cipher/rijndael-padlock.c b/cipher/rijndael-padlock.c index 3af214d7..2583b834 100644 --- a/cipher/rijndael-padlock.c +++ b/cipher/rijndael-padlock.c @@ -27,7 +27,6 @@ #include "g10lib.h" #include "cipher.h" #include "bufhelp.h" -#include "cipher-selftest.h" #include "rijndael-internal.h" #ifdef USE_PADLOCK diff --git a/cipher/rijndael-ssse3-amd64.c b/cipher/rijndael-ssse3-amd64.c index b0723853..0f0abf62 100644 --- a/cipher/rijndael-ssse3-amd64.c +++ b/cipher/rijndael-ssse3-amd64.c @@ -43,7 +43,6 @@ #include "g10lib.h" #include "cipher.h" #include "bufhelp.h" -#include "cipher-selftest.h" #include "rijndael-internal.h" #include "./cipher-internal.h" diff --git a/cipher/rijndael-vaes.c b/cipher/rijndael-vaes.c index 0d7d1367..dbcf9afa 100644 --- a/cipher/rijndael-vaes.c +++ b/cipher/rijndael-vaes.c @@ -26,7 +26,6 @@ #include "g10lib.h" #include "cipher.h" #include "bufhelp.h" -#include "cipher-selftest.h" #include "rijndael-internal.h" #include "./cipher-internal.h" diff --git a/cipher/rijndael.c b/cipher/rijndael.c index 9b96b616..dddcbc54 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -46,7 +46,6 @@ #include "g10lib.h" #include "cipher.h" #include "bufhelp.h" -#include "cipher-selftest.h" #include "rijndael-internal.h" #include "./cipher-internal.h" @@ -1535,7 +1534,7 @@ static const char* selftest_basic_128 (void) { RIJNDAEL_context *ctx; - unsigned char *ctxmem; + unsigned char ctxmem[sizeof(*ctx) + 16]; unsigned char scratch[16]; cipher_bulk_ops_t bulk_ops; @@ -1579,21 +1578,15 @@ selftest_basic_128 (void) }; #endif - /* Because gcc/ld can only align the CTX struct on 8 bytes on the - stack, we need to allocate that context on the heap. */ - ctx = _gcry_cipher_selftest_alloc_ctx (sizeof *ctx, &ctxmem); - if (!ctx) - return "failed to allocate memory"; + ctx = (void *)(ctxmem + ((16 - ((uintptr_t)ctxmem & 15)) & 15)); rijndael_setkey (ctx, key_128, sizeof (key_128), &bulk_ops); rijndael_encrypt (ctx, scratch, plaintext_128); if (memcmp (scratch, ciphertext_128, sizeof (ciphertext_128))) { - xfree (ctxmem); return "AES-128 test encryption failed."; } rijndael_decrypt (ctx, scratch, scratch); - xfree (ctxmem); if (memcmp (scratch, plaintext_128, sizeof (plaintext_128))) return "AES-128 test decryption failed."; @@ -1605,7 +1598,7 @@ static const char* selftest_basic_192 (void) { RIJNDAEL_context *ctx; - unsigned char *ctxmem; + unsigned char ctxmem[sizeof(*ctx) + 16]; unsigned char scratch[16]; cipher_bulk_ops_t bulk_ops; @@ -1626,18 +1619,15 @@ selftest_basic_192 (void) 0x12,0x13,0x1A,0xC7,0xC5,0x47,0x88,0xAA }; - ctx = _gcry_cipher_selftest_alloc_ctx (sizeof *ctx, &ctxmem); - if (!ctx) - return "failed to allocate memory"; + ctx = (void *)(ctxmem + ((16 - ((uintptr_t)ctxmem & 15)) & 15)); + rijndael_setkey (ctx, key_192, sizeof(key_192), &bulk_ops); rijndael_encrypt (ctx, scratch, plaintext_192); if (memcmp (scratch, ciphertext_192, sizeof (ciphertext_192))) { - xfree (ctxmem); return "AES-192 test encryption failed."; } rijndael_decrypt (ctx, scratch, scratch); - xfree (ctxmem); if (memcmp (scratch, plaintext_192, sizeof (plaintext_192))) return "AES-192 test decryption failed."; @@ -1650,7 +1640,7 @@ static const char* selftest_basic_256 (void) { RIJNDAEL_context *ctx; - unsigned char *ctxmem; + unsigned char ctxmem[sizeof(*ctx) + 16]; unsigned char scratch[16]; cipher_bulk_ops_t bulk_ops; @@ -1672,18 +1662,15 @@ selftest_basic_256 (void) 0x9A,0xCF,0x72,0x80,0x86,0x04,0x0A,0xE3 }; - ctx = _gcry_cipher_selftest_alloc_ctx (sizeof *ctx, &ctxmem); - if (!ctx) - return "failed to allocate memory"; + ctx = (void *)(ctxmem + ((16 - ((uintptr_t)ctxmem & 15)) & 15)); + rijndael_setkey (ctx, key_256, sizeof(key_256), &bulk_ops); rijndael_encrypt (ctx, scratch, plaintext_256); if (memcmp (scratch, ciphertext_256, sizeof (ciphertext_256))) { - xfree (ctxmem); return "AES-256 test encryption failed."; } rijndael_decrypt (ctx, scratch, scratch); - xfree (ctxmem); if (memcmp (scratch, plaintext_256, sizeof (plaintext_256))) return "AES-256 test decryption failed."; @@ -1691,60 +1678,6 @@ selftest_basic_256 (void) } -/* Run the self-tests for AES-CTR-128, tests IV increment of bulk CTR - encryption. Returns NULL on success. */ -static const char* -selftest_ctr_128 (void) -{ -#ifdef USE_VAES - const int nblocks = 16+1; -#else - const int nblocks = 8+1; -#endif - const int blocksize = BLOCKSIZE; - const int context_size = sizeof(RIJNDAEL_context); - - return _gcry_selftest_helper_ctr("AES", &rijndael_setkey, - &rijndael_encrypt, nblocks, blocksize, context_size); -} - - -/* Run the self-tests for AES-CBC-128, tests bulk CBC decryption. - Returns NULL on success. */ -static const char* -selftest_cbc_128 (void) -{ -#ifdef USE_VAES - const int nblocks = 16+2; -#else - const int nblocks = 8+2; -#endif - const int blocksize = BLOCKSIZE; - const int context_size = sizeof(RIJNDAEL_context); - - return _gcry_selftest_helper_cbc("AES", &rijndael_setkey, - &rijndael_encrypt, nblocks, blocksize, context_size); -} - - -/* Run the self-tests for AES-CFB-128, tests bulk CFB decryption. - Returns NULL on success. */ -static const char* -selftest_cfb_128 (void) -{ -#ifdef USE_VAES - const int nblocks = 16+2; -#else - const int nblocks = 8+2; -#endif - const int blocksize = BLOCKSIZE; - const int context_size = sizeof(RIJNDAEL_context); - - return _gcry_selftest_helper_cfb("AES", &rijndael_setkey, - &rijndael_encrypt, nblocks, blocksize, context_size); -} - - /* Run all the self-tests and return NULL on success. This function is used for the on-the-fly self-tests. */ static const char * @@ -1757,15 +1690,6 @@ selftest (void) || (r = selftest_basic_256 ()) ) return r; - if ( (r = selftest_ctr_128 ()) ) - return r; - - if ( (r = selftest_cbc_128 ()) ) - return r; - - if ( (r = selftest_cfb_128 ()) ) - return r; - return r; } diff --git a/cipher/serpent.c b/cipher/serpent.c index dfe5cc28..11eeb079 100644 --- a/cipher/serpent.c +++ b/cipher/serpent.c @@ -30,7 +30,6 @@ #include "bithelp.h" #include "bufhelp.h" #include "cipher-internal.h" -#include "cipher-selftest.h" #include "bulkhelp.h" @@ -1540,48 +1539,6 @@ _gcry_serpent_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, -/* Run the self-tests for SERPENT-CTR-128, tests IV increment of bulk CTR - encryption. Returns NULL on success. */ -static const char* -selftest_ctr_128 (void) -{ - const int nblocks = 16+8+1; - const int blocksize = sizeof(serpent_block_t); - const int context_size = sizeof(serpent_context_t); - - return _gcry_selftest_helper_ctr("SERPENT", &serpent_setkey, - &serpent_encrypt, nblocks, blocksize, context_size); -} - - -/* Run the self-tests for SERPENT-CBC-128, tests bulk CBC decryption. - Returns NULL on success. */ -static const char* -selftest_cbc_128 (void) -{ - const int nblocks = 16+8+2; - const int blocksize = sizeof(serpent_block_t); - const int context_size = sizeof(serpent_context_t); - - return _gcry_selftest_helper_cbc("SERPENT", &serpent_setkey, - &serpent_encrypt, nblocks, blocksize, context_size); -} - - -/* Run the self-tests for SERPENT-CBC-128, tests bulk CBC decryption. - Returns NULL on success. */ -static const char* -selftest_cfb_128 (void) -{ - const int nblocks = 16+8+2; - const int blocksize = sizeof(serpent_block_t); - const int context_size = sizeof(serpent_context_t); - - return _gcry_selftest_helper_cfb("SERPENT", &serpent_setkey, - &serpent_encrypt, nblocks, blocksize, context_size); -} - - /* Serpent test. */ static const char * @@ -1590,7 +1547,6 @@ serpent_test (void) serpent_context_t context; unsigned char scratch[16]; unsigned int i; - const char *r; static struct test { @@ -1662,15 +1618,6 @@ serpent_test (void) } } - if ( (r = selftest_ctr_128 ()) ) - return r; - - if ( (r = selftest_cbc_128 ()) ) - return r; - - if ( (r = selftest_cfb_128 ()) ) - return r; - return NULL; } diff --git a/cipher/sm4.c b/cipher/sm4.c index 7c7bc1ff..5f8bf224 100644 --- a/cipher/sm4.c +++ b/cipher/sm4.c @@ -29,7 +29,6 @@ #include "cipher.h" #include "bufhelp.h" #include "cipher-internal.h" -#include "cipher-selftest.h" #include "bulkhelp.h" /* Helper macro to force alignment to 64 bytes. */ @@ -1429,51 +1428,11 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) return 0; } -/* Run the self-tests for SM4-CTR, tests IV increment of bulk CTR - encryption. Returns NULL on success. */ -static const char* -selftest_ctr_128 (void) -{ - const int nblocks = 16 - 1; - const int blocksize = 16; - const int context_size = sizeof(SM4_context); - - return _gcry_selftest_helper_ctr("SM4", &sm4_setkey, - &sm4_encrypt, nblocks, blocksize, context_size); -} - -/* Run the self-tests for SM4-CBC, tests bulk CBC decryption. - Returns NULL on success. */ -static const char* -selftest_cbc_128 (void) -{ - const int nblocks = 16 - 1; - const int blocksize = 16; - const int context_size = sizeof(SM4_context); - - return _gcry_selftest_helper_cbc("SM4", &sm4_setkey, - &sm4_encrypt, nblocks, blocksize, context_size); -} - -/* Run the self-tests for SM4-CFB, tests bulk CFB decryption. - Returns NULL on success. */ -static const char* -selftest_cfb_128 (void) -{ - const int nblocks = 16 - 1; - const int blocksize = 16; - const int context_size = sizeof(SM4_context); - - return _gcry_selftest_helper_cfb("SM4", &sm4_setkey, - &sm4_encrypt, nblocks, blocksize, context_size); -} - static const char * sm4_selftest (void) { SM4_context ctx; byte scratch[16]; - const char *r; static const byte plaintext[16] = { 0x01, 0x23, 0x45, 0x67, 0x89, 0xAB, 0xCD, 0xEF, @@ -1498,15 +1457,6 @@ sm4_selftest (void) if (memcmp (scratch, plaintext, sizeof (plaintext))) return "SM4 test decryption failed."; - if ( (r = selftest_ctr_128 ()) ) - return r; - - if ( (r = selftest_cbc_128 ()) ) - return r; - - if ( (r = selftest_cfb_128 ()) ) - return r; - return NULL; } diff --git a/cipher/twofish.c b/cipher/twofish.c index 4ae5d5a6..b300715b 100644 --- a/cipher/twofish.c +++ b/cipher/twofish.c @@ -46,7 +46,6 @@ #include "cipher.h" #include "bufhelp.h" #include "cipher-internal.h" -#include "cipher-selftest.h" #include "bulkhelp.h" @@ -1527,46 +1526,6 @@ _gcry_twofish_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, return nblocks; } - - -/* Run the self-tests for TWOFISH-CTR, tests IV increment of bulk CTR - encryption. Returns NULL on success. */ -static const char * -selftest_ctr (void) -{ - const int nblocks = 16+1; - const int blocksize = TWOFISH_BLOCKSIZE; - const int context_size = sizeof(TWOFISH_context); - - return _gcry_selftest_helper_ctr("TWOFISH", &twofish_setkey, - &twofish_encrypt, nblocks, blocksize, context_size); -} - -/* Run the self-tests for TWOFISH-CBC, tests bulk CBC decryption. - Returns NULL on success. */ -static const char * -selftest_cbc (void) -{ - const int nblocks = 16+2; - const int blocksize = TWOFISH_BLOCKSIZE; - const int context_size = sizeof(TWOFISH_context); - - return _gcry_selftest_helper_cbc("TWOFISH", &twofish_setkey, - &twofish_encrypt, nblocks, blocksize, context_size); -} - -/* Run the self-tests for TWOFISH-CFB, tests bulk CBC decryption. - Returns NULL on success. */ -static const char * -selftest_cfb (void) -{ - const int nblocks = 16+2; - const int blocksize = TWOFISH_BLOCKSIZE; - const int context_size = sizeof(TWOFISH_context); - - return _gcry_selftest_helper_cfb("TWOFISH", &twofish_setkey, - &twofish_encrypt, nblocks, blocksize, context_size); -} /* Test a single encryption and decryption with each key size. */ @@ -1577,7 +1536,6 @@ selftest (void) TWOFISH_context ctx; /* Expanded key. */ byte scratch[16]; /* Encryption/decryption result buffer. */ cipher_bulk_ops_t bulk_ops; - const char *r; /* Test vectors for single encryption/decryption. Note that I am using * the vectors from the Twofish paper's "known answer test", I=3 for @@ -1627,13 +1585,6 @@ selftest (void) if (memcmp (scratch, plaintext_256, sizeof (plaintext_256))) return "Twofish-256 test decryption failed."; - if ((r = selftest_ctr()) != NULL) - return r; - if ((r = selftest_cbc()) != NULL) - return r; - if ((r = selftest_cfb()) != NULL) - return r; - return NULL; } diff --git a/tests/basic.c b/tests/basic.c index 40a0a474..f5513740 100644 --- a/tests/basic.c +++ b/tests/basic.c @@ -27,6 +27,13 @@ #include #include #include +#ifdef HAVE_STDINT_H +# include /* uintptr_t */ +#elif defined(HAVE_INTTYPES_H) +# include +#else +/* In this case, uintptr_t is provided by config.h. */ +#endif #include "../src/gcrypt-int.h" @@ -11672,6 +11679,764 @@ out: +static void buf_xor(void *vdst, const void *vsrc1, const void *vsrc2, size_t len) +{ + char *dst = vdst; + const char *src1 = vsrc1; + const char *src2 = vsrc2; + + while (len) + { + *(char *)dst = *(char *)src1 ^ *(char *)src2; + dst++; + src1++; + src2++; + len--; + } +} + +/* Run the tests for -CBC-, tests bulk CBC + decryption. Returns NULL on success. */ +static int +cipher_cbc_bulk_test (int cipher_algo) +{ + const int nblocks = 128 - 1; + int i, offs; + int blocksize; + const char *cipher; + gcry_cipher_hd_t hd_one; + gcry_cipher_hd_t hd_cbc; + gcry_error_t err = 0; + unsigned char *plaintext, *plaintext2, *ciphertext, *iv, *iv2, *mem; + unsigned int memsize; + unsigned int keylen; + + static const unsigned char key[32] = { + 0x66,0x9A,0x00,0x7F,0xC7,0x6A,0x45,0x9F, + 0x98,0xBA,0xF9,0x17,0xFE,0xDF,0x95,0x22, + 0x66,0x9A,0x00,0x7F,0xC7,0x6A,0x45,0x9F, + 0x98,0xBA,0xF9,0x17,0xFE,0xDF,0x95,0x22 + }; + + if (gcry_cipher_test_algo (cipher_algo)) + return -1; + blocksize = gcry_cipher_get_algo_blklen(cipher_algo); + if (blocksize < 8) + return -1; + cipher = gcry_cipher_algo_name (cipher_algo); + keylen = gcry_cipher_get_algo_keylen (cipher_algo); + if (keylen > sizeof(key)) + { + fail ("%s-CBC-%d test failed (key too short)", cipher, blocksize * 8); + return -1; + } + + memsize = (blocksize * 2) + (blocksize * nblocks * 3) + 16; + + mem = xcalloc (1, memsize); + if (!mem) + return -1; + + offs = (16 - ((uintptr_t)mem & 15)) & 15; + iv = (void*)(mem + offs); + iv2 = iv + blocksize; + plaintext = iv2 + blocksize; + plaintext2 = plaintext + nblocks * blocksize; + ciphertext = plaintext2 + nblocks * blocksize; + + err = gcry_cipher_open (&hd_one, cipher_algo, GCRY_CIPHER_MODE_ECB, 0); + if (err) + { + xfree(mem); + fail ("%s-CBC-%d test failed (cipher open fail)", cipher, blocksize * 8); + return -1; + } + err = gcry_cipher_open (&hd_cbc, cipher_algo, GCRY_CIPHER_MODE_CBC, 0); + if (err) + { + gcry_cipher_close (hd_one); + xfree(mem); + fail ("%s-CBC-%d test failed (cipher open fail)", cipher, blocksize * 8); + return -1; + } + + /* Initialize ctx */ + if (gcry_cipher_setkey (hd_one, key, keylen) || + gcry_cipher_setkey (hd_cbc, key, keylen)) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cbc); + xfree(mem); + fail ("%s-CBC-%d test failed (setkey fail)", cipher, blocksize * 8); + return -1; + } + + /* Test single block code path */ + memset (iv, 0x4e, blocksize); + memset (iv2, 0x4e, blocksize); + for (i = 0; i < blocksize; i++) + plaintext[i] = i; + + /* CBC manually. */ + buf_xor (ciphertext, iv, plaintext, blocksize); + err = gcry_cipher_encrypt (hd_one, ciphertext, blocksize, + ciphertext, blocksize); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cbc); + xfree(mem); + fail ("%s-CBC-%d test failed (ECB encrypt fail)", cipher, blocksize * 8); + return -1; + } + memcpy (iv, ciphertext, blocksize); + + /* CBC decrypt. */ + err = gcry_cipher_setiv (hd_cbc, iv2, blocksize); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cbc); + xfree(mem); + fail ("%s-CBC-%d test failed (setiv fail)", cipher, blocksize * 8); + return -1; + } + err = gcry_cipher_decrypt (hd_cbc, plaintext2, blocksize * 1, + ciphertext, blocksize * 1); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cbc); + xfree(mem); + fail ("%s-CBC-%d test failed (CBC decrypt fail)", cipher, blocksize * 8); + return -1; + } + + if (memcmp (plaintext2, plaintext, blocksize)) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cbc); + xfree (mem); + fail ("%s-CBC-%d test failed (plaintext mismatch)", cipher, blocksize * 8); + return -1; + } + +#if 0 /* missing interface for reading IV */ + if (memcmp (iv2, iv, blocksize)) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cbc); + xfree (mem); + fail ("%s-CBC-%d test failed (IV mismatch)", cipher, blocksize * 8); + return -1; + } +#endif + + /* Test parallelized code paths */ + memset (iv, 0x5f, blocksize); + memset (iv2, 0x5f, blocksize); + + for (i = 0; i < nblocks * blocksize; i++) + plaintext[i] = i; + + /* Create CBC ciphertext manually. */ + for (i = 0; i < nblocks * blocksize; i+=blocksize) + { + buf_xor (&ciphertext[i], iv, &plaintext[i], blocksize); + err = gcry_cipher_encrypt (hd_one, &ciphertext[i], blocksize, + &ciphertext[i], blocksize); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cbc); + xfree(mem); + fail ("%s-CBC-%d test failed (ECB encrypt fail)", cipher, blocksize * 8); + return -1; + } + memcpy (iv, &ciphertext[i], blocksize); + } + + /* Decrypt using bulk CBC and compare result. */ + err = gcry_cipher_setiv (hd_cbc, iv2, blocksize); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cbc); + xfree(mem); + fail ("%s-CBC-%d test failed (setiv fail)", cipher, blocksize * 8); + return -1; + } + err = gcry_cipher_decrypt (hd_cbc, plaintext2, blocksize * nblocks, + ciphertext, blocksize * nblocks); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cbc); + xfree(mem); + fail ("%s-CBC-%d test failed (CBC decrypt fail)", cipher, blocksize * 8); + return -1; + } + + if (memcmp (plaintext2, plaintext, nblocks * blocksize)) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cbc); + xfree (mem); + fail ("%s-CBC-%d test failed (plaintext mismatch, parallel path)", + cipher, blocksize * 8); + return -1; + } +#if 0 /* missing interface for reading IV */ + if (memcmp (iv2, iv, blocksize)) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cbc); + xfree (mem); + fail ("%s-CBC-%d test failed (IV mismatch, parallel path)", + cipher, blocksize * 8); + return -1; + } +#endif + + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cbc); + xfree (mem); + return -1; +} + + +static void +buf_xor_2dst(void *vdst1, void *vdst2, const void *vsrc, size_t len) +{ + byte *dst1 = vdst1; + byte *dst2 = vdst2; + const byte *src = vsrc; + + for (; len; len--) + *dst1++ = (*dst2++ ^= *src++); +} + +/* Run the tests for -CFB-, tests bulk CFB + decryption. Returns NULL on success. */ +static int +cipher_cfb_bulk_test (int cipher_algo) +{ + const int nblocks = 128 - 1; + int blocksize; + const char *cipher; + gcry_cipher_hd_t hd_one; + gcry_cipher_hd_t hd_cfb; + gcry_error_t err = 0; + int i, offs; + unsigned char *plaintext, *plaintext2, *ciphertext, *iv, *iv2, *mem; + unsigned int memsize; + unsigned int keylen; + + static const unsigned char key[32] = { + 0x11,0x9A,0x00,0x7F,0xC7,0x6A,0x45,0x9F, + 0x98,0xBA,0xF9,0x17,0xFE,0xDF,0x95,0x33, + 0x11,0x9A,0x00,0x7F,0xC7,0x6A,0x45,0x9F, + 0x98,0xBA,0xF9,0x17,0xFE,0xDF,0x95,0x33 + }; + + if (gcry_cipher_test_algo (cipher_algo)) + return -1; + blocksize = gcry_cipher_get_algo_blklen(cipher_algo); + if (blocksize < 8) + return -1; + cipher = gcry_cipher_algo_name (cipher_algo); + keylen = gcry_cipher_get_algo_keylen (cipher_algo); + if (keylen > sizeof(key)) + { + fail ("%s-CFB-%d test failed (key too short)", cipher, blocksize * 8); + return -1; + } + + memsize = (blocksize * 2) + (blocksize * nblocks * 3) + 16; + + mem = xcalloc (1, memsize); + if (!mem) + return -1; + + offs = (16 - ((uintptr_t)mem & 15)) & 15; + iv = (void*)(mem + offs); + iv2 = iv + blocksize; + plaintext = iv2 + blocksize; + plaintext2 = plaintext + nblocks * blocksize; + ciphertext = plaintext2 + nblocks * blocksize; + + err = gcry_cipher_open (&hd_one, cipher_algo, GCRY_CIPHER_MODE_ECB, 0); + if (err) + { + xfree(mem); + fail ("%s-CFB-%d test failed (cipher open fail)", cipher, blocksize * 8); + return -1; + } + err = gcry_cipher_open (&hd_cfb, cipher_algo, GCRY_CIPHER_MODE_CFB, 0); + if (err) + { + gcry_cipher_close (hd_one); + xfree(mem); + fail ("%s-CFB-%d test failed (cipher open fail)", cipher, blocksize * 8); + return -1; + } + + /* Initialize ctx */ + if (gcry_cipher_setkey (hd_one, key, keylen) || + gcry_cipher_setkey (hd_cfb, key, keylen)) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cfb); + xfree(mem); + fail ("%s-CFB-%d test failed (setkey fail)", cipher, blocksize * 8); + return -1; + } + + /* Test single block code path */ + memset(iv, 0xd3, blocksize); + memset(iv2, 0xd3, blocksize); + for (i = 0; i < blocksize; i++) + plaintext[i] = i; + + /* CFB manually. */ + err = gcry_cipher_encrypt (hd_one, ciphertext, blocksize, iv, blocksize); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cfb); + xfree(mem); + fail ("%s-CFB-%d test failed (ECB encrypt fail)", cipher, blocksize * 8); + return -1; + } + buf_xor_2dst (iv, ciphertext, plaintext, blocksize); + + /* CFB decrypt. */ + err = gcry_cipher_setiv (hd_cfb, iv2, blocksize); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cfb); + xfree(mem); + fail ("%s-CFB-%d test failed (setiv fail)", cipher, blocksize * 8); + return -1; + } + err = gcry_cipher_decrypt (hd_cfb, plaintext2, blocksize * 1, + ciphertext, blocksize * 1); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cfb); + xfree(mem); + fail ("%s-CFB-%d test failed (CFB decrypt fail)", cipher, blocksize * 8); + return -1; + } + if (memcmp(plaintext2, plaintext, blocksize)) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cfb); + xfree(mem); + fail ("%s-CFB-%d test failed (plaintext mismatch)", + cipher, blocksize * 8); + return -1; + } + +#if 0 + if (memcmp(iv2, iv, blocksize)) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cfb); + xfree(mem); + fail ("%s-CFB-%d test failed (IV mismatch)", + cipher, blocksize * 8); + return -1; + } +#endif + + /* Test parallelized code paths */ + memset(iv, 0xe6, blocksize); + memset(iv2, 0xe6, blocksize); + + for (i = 0; i < nblocks * blocksize; i++) + plaintext[i] = i; + + /* Create CFB ciphertext manually. */ + for (i = 0; i < nblocks * blocksize; i+=blocksize) + { + err = gcry_cipher_encrypt (hd_one, &ciphertext[i], blocksize, + iv, blocksize); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cfb); + xfree(mem); + fail ("%s-CFB-%d test failed (ECB encrypt fail)", cipher, blocksize * 8); + return -1; + } + buf_xor_2dst (iv, &ciphertext[i], &plaintext[i], blocksize); + } + + /* Decrypt using bulk CBC and compare result. */ + err = gcry_cipher_setiv (hd_cfb, iv2, blocksize); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cfb); + xfree(mem); + fail ("%s-CFB-%d test failed (setiv fail)", cipher, blocksize * 8); + return -1; + } + err = gcry_cipher_decrypt (hd_cfb, plaintext2, blocksize * nblocks, + ciphertext, blocksize * nblocks); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cfb); + xfree(mem); + fail ("%s-CFB-%d test failed (CFB decrypt fail)", cipher, blocksize * 8); + return -1; + } + + if (memcmp(plaintext2, plaintext, nblocks * blocksize)) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cfb); + xfree(mem); + fail ("%s-CFB-%d test failed (plaintext mismatch, parallel path)", + cipher, blocksize * 8); + return -1; + } +#if 0 + if (memcmp(iv2, iv, blocksize)) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cfb); + xfree(mem); + fail ("%s-CFB-%d test failed (IV mismatch, parallel path)", + cipher, blocksize * 8); + return -1; + } +#endif + + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_cfb); + xfree(mem); + return -1; +} + + +/* Run the tests for -CTR-, tests IV increment + of bulk CTR encryption. Returns NULL on success. */ +static int +cipher_ctr_bulk_test (int cipher_algo) +{ + const int nblocks = 128 - 1; + int blocksize; + const char *cipher; + gcry_cipher_hd_t hd_one; + gcry_cipher_hd_t hd_ctr; + gcry_error_t err = 0; + int i, j, offs, diff; + unsigned char *plaintext, *plaintext2, *ciphertext, *ciphertext2, + *iv, *iv2, *mem; + unsigned int memsize; + unsigned int keylen; + + static const unsigned char key[32] = { + 0x06,0x9A,0x00,0x7F,0xC7,0x6A,0x45,0x9F, + 0x98,0xBA,0xF9,0x17,0xFE,0xDF,0x95,0x21, + 0x06,0x9A,0x00,0x7F,0xC7,0x6A,0x45,0x9F, + 0x98,0xBA,0xF9,0x17,0xFE,0xDF,0x95,0x21 + }; + + if (gcry_cipher_test_algo (cipher_algo)) + return -1; + blocksize = gcry_cipher_get_algo_blklen(cipher_algo); + if (blocksize < 8) + return -1; + cipher = gcry_cipher_algo_name (cipher_algo); + keylen = gcry_cipher_get_algo_keylen (cipher_algo); + if (keylen > sizeof(key)) + { + fail ("%s-CTR-%d test failed (key too short)", cipher, blocksize * 8); + return -1; + } + + memsize = (blocksize * 2) + (blocksize * nblocks * 4) + 16; + + mem = xcalloc (1, memsize); + if (!mem) + return -1; + + offs = (16 - ((uintptr_t)mem & 15)) & 15; + iv = (void*)(mem + offs); + iv2 = iv + blocksize; + plaintext = iv2 + blocksize; + plaintext2 = plaintext + nblocks * blocksize; + ciphertext = plaintext2 + nblocks * blocksize; + ciphertext2 = ciphertext + nblocks * blocksize; + + err = gcry_cipher_open (&hd_one, cipher_algo, GCRY_CIPHER_MODE_ECB, 0); + if (err) + { + xfree(mem); + fail ("%s-CTR-%d test failed (cipher open fail)", cipher, blocksize * 8); + return -1; + } + err = gcry_cipher_open (&hd_ctr, cipher_algo, GCRY_CIPHER_MODE_CTR, 0); + if (err) + { + gcry_cipher_close (hd_one); + xfree(mem); + fail ("%s-CTR-%d test failed (cipher open fail)", cipher, blocksize * 8); + return -1; + } + + /* Initialize ctx */ + if (gcry_cipher_setkey (hd_one, key, keylen) || + gcry_cipher_setkey (hd_ctr, key, keylen)) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_ctr); + xfree(mem); + fail ("%s-CTR-%d test failed (setkey fail)", cipher, blocksize * 8); + return -1; + } + + /* Test single block code path */ + memset (iv, 0xff, blocksize); + for (i = 0; i < blocksize; i++) + plaintext[i] = i; + + /* CTR manually. */ + err = gcry_cipher_encrypt (hd_one, ciphertext, blocksize, iv, blocksize); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_ctr); + xfree(mem); + fail ("%s-CTR-%d test failed (ECB encrypt fail)", cipher, blocksize * 8); + return -1; + } + for (i = 0; i < blocksize; i++) + ciphertext[i] ^= plaintext[i]; + for (i = blocksize; i > 0; i--) + { + iv[i-1]++; + if (iv[i-1]) + break; + } + + memset (iv2, 0xff, blocksize); + err = gcry_cipher_setctr (hd_ctr, iv2, blocksize); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_ctr); + xfree(mem); + fail ("%s-CTR-%d test failed (setiv fail)", cipher, blocksize * 8); + return -1; + } + err = gcry_cipher_encrypt (hd_ctr, plaintext2, blocksize * 1, + ciphertext, blocksize * 1); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_ctr); + xfree(mem); + fail ("%s-CTR-%d test failed (CTR encrypt fail)", cipher, blocksize * 8); + return -1; + } + + if (memcmp (plaintext2, plaintext, blocksize)) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_ctr); + xfree(mem); + fail ("%s-CTR-%d test failed (plaintext mismatch)", + cipher, blocksize * 8); + return -1; + } + +#if 0 + if (memcmp (iv2, iv, blocksize)) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_ctr); + xfree(mem); + fail ("%s-CTR-%d test failed (IV mismatch)", cipher, blocksize * 8); + return -1; + } +#endif + + /* Test bulk encryption with typical IV. */ + memset(iv, 0x57, blocksize-4); + iv[blocksize-1] = 1; + iv[blocksize-2] = 0; + iv[blocksize-3] = 0; + iv[blocksize-4] = 0; + memset(iv2, 0x57, blocksize-4); + iv2[blocksize-1] = 1; + iv2[blocksize-2] = 0; + iv2[blocksize-3] = 0; + iv2[blocksize-4] = 0; + + for (i = 0; i < blocksize * nblocks; i++) + plaintext2[i] = plaintext[i] = i; + + /* Create CTR ciphertext manually. */ + for (i = 0; i < blocksize * nblocks; i+=blocksize) + { + err = gcry_cipher_encrypt (hd_one, &ciphertext[i], blocksize, + iv, blocksize); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_ctr); + xfree(mem); + fail ("%s-CTR-%d test failed (ECB encrypt fail)", + cipher, blocksize * 8); + return -1; + } + for (j = 0; j < blocksize; j++) + ciphertext[i+j] ^= plaintext[i+j]; + for (j = blocksize; j > 0; j--) + { + iv[j-1]++; + if (iv[j-1]) + break; + } + } + + err = gcry_cipher_setctr (hd_ctr, iv2, blocksize); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_ctr); + xfree(mem); + fail ("%s-CTR-%d test failed (setiv fail)", cipher, blocksize * 8); + return -1; + } + err = gcry_cipher_encrypt (hd_ctr, ciphertext2, blocksize * nblocks, + plaintext2, blocksize * nblocks); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_ctr); + xfree(mem); + fail ("%s-CTR-%d test failed (CTR encrypt fail)", cipher, blocksize * 8); + return -1; + } + + if (memcmp (ciphertext2, ciphertext, blocksize * nblocks)) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_ctr); + xfree(mem); + fail ("%s-CTR-%d test failed (ciphertext mismatch, bulk)", + cipher, blocksize * 8); + return -1; + } +#if 0 + if (memcmp (iv2, iv, blocksize)) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_ctr); + xfree(mem); + fail ("%s-CTR-%d test failed (IV mismatch, bulk)", cipher, blocksize * 8); + return -1; + } +#endif + + /* Test parallelized code paths (check counter overflow handling) */ + for (diff = 0; diff < nblocks; diff++) { + memset(iv, 0xff, blocksize); + iv[blocksize-1] -= diff; + iv[0] = iv[1] = 0; + iv[2] = 0x07; + + for (i = 0; i < blocksize * nblocks; i++) + plaintext[i] = i; + + /* Create CTR ciphertext manually. */ + for (i = 0; i < blocksize * nblocks; i+=blocksize) + { + err = gcry_cipher_encrypt (hd_one, &ciphertext[i], blocksize, + iv, blocksize); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_ctr); + xfree(mem); + fail ("%s-CTR-%d test failed (ECB encrypt fail)", + cipher, blocksize * 8); + return -1; + } + for (j = 0; j < blocksize; j++) + ciphertext[i+j] ^= plaintext[i+j]; + for (j = blocksize; j > 0; j--) + { + iv[j-1]++; + if (iv[j-1]) + break; + } + } + + /* Decrypt using bulk CTR and compare result. */ + memset(iv2, 0xff, blocksize); + iv2[blocksize-1] -= diff; + iv2[0] = iv2[1] = 0; + iv2[2] = 0x07; + + err = gcry_cipher_setctr (hd_ctr, iv2, blocksize); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_ctr); + xfree(mem); + fail ("%s-CTR-%d test failed (setiv fail)", cipher, blocksize * 8); + return -1; + } + err = gcry_cipher_decrypt (hd_ctr, plaintext2, blocksize * nblocks, + ciphertext, blocksize * nblocks); + if (err) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_ctr); + xfree(mem); + fail ("%s-CTR-%d test failed (CTR decrypt fail)", cipher, blocksize * 8); + return -1; + } + + if (memcmp (plaintext2, plaintext, blocksize * nblocks)) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_ctr); + xfree(mem); + fail ("%s-CTR-%d test failed (plaintext mismatch, diff: %d)", + cipher, blocksize * 8, diff); + return -1; + } +#if 0 + if (memcmp(iv2, iv, blocksize)) + { + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_ctr); + xfree(mem); + fail ("%s-CTR-%d test failed (IV mismatch, diff: %d)", + cipher, blocksize * 8, diff); + return -1; + } +#endif + } + + gcry_cipher_close (hd_one); + gcry_cipher_close (hd_ctr); + xfree(mem); + return -1; +} + + + static void check_ciphers (void) { @@ -11784,6 +12549,13 @@ check_ciphers (void) check_one_cipher (algos[i], GCRY_CIPHER_MODE_OCB, 0); if (gcry_cipher_get_algo_blklen (algos[i]) == GCRY_XTS_BLOCK_LEN) check_one_cipher (algos[i], GCRY_CIPHER_MODE_XTS, 0); + + if (gcry_cipher_get_algo_blklen (algos[i]) >= 8) + { + cipher_cbc_bulk_test (algos[i]); + cipher_cfb_bulk_test (algos[i]); + cipher_ctr_bulk_test (algos[i]); + } } for (i = 0; algos2[i]; i++) -- 2.34.1 From gniibe at fsij.org Tue May 10 09:05:16 2022 From: gniibe at fsij.org (NIIBE Yutaka) Date: Tue, 10 May 2022 16:05:16 +0900 Subject: Old bug in gcry_mpi_invm producing wrong result In-Reply-To: References: Message-ID: <87pmkl50lv.fsf@akagi.fsij.org> Guido Vranken wrote: > It says that InvMod(18446744073709551615, > 340282366762482138434845932244680310781) is > 170141183381241069226646338154899963903 but that's not true, because > 170141183381241069226646338154899963903 * 18446744073709551615 % > 340282366762482138434845932244680310781 is 4294967297, not 1. Thank you for your report. With libgcrypt 1.8, it works correctly. It is tracked by: https://dev.gnupg.org/T5970 The fix I pushed is: diff --git a/mpi/mpih-const-time.c b/mpi/mpih-const-time.c index b527ad79..9d74d190 100644 --- a/mpi/mpih-const-time.c +++ b/mpi/mpih-const-time.c @@ -204,6 +204,13 @@ _gcry_mpih_cmp_ui (mpi_ptr_t up, mpi_size_t usize, unsigned long v) is_all_zero &= (up[i] == 0); if (is_all_zero) - return up[0] - v; + { + if (up[0] < v) + return -1; + else if (up[0] > v) + return 1; + else + return 0; + } return 1; } The expression of up[0] - v is only correct on 32-bit architecture. It may return wrong result on 64-bit architecture. -- From guidovranken at gmail.com Tue May 10 12:51:00 2022 From: guidovranken at gmail.com (Guido Vranken) Date: Tue, 10 May 2022 12:51:00 +0200 Subject: Old bug in gcry_mpi_invm producing wrong result In-Reply-To: <87pmkl50lv.fsf@akagi.fsij.org> References: <87pmkl50lv.fsf@akagi.fsij.org> Message-ID: Thank you. I have confirmed that your patch resolves the issue. However I tried again with 1.8.0 and at that version, the reproducer prints "Inverse exists". On Tue, May 10, 2022 at 9:05 AM NIIBE Yutaka wrote: > Guido Vranken wrote: > > It says that InvMod(18446744073709551615, > > 340282366762482138434845932244680310781) is > > 170141183381241069226646338154899963903 but that's not true, because > > 170141183381241069226646338154899963903 * 18446744073709551615 % > > 340282366762482138434845932244680310781 is 4294967297, not 1. > > Thank you for your report. With libgcrypt 1.8, it works correctly. > > It is tracked by: https://dev.gnupg.org/T5970 > > The fix I pushed is: > > diff --git a/mpi/mpih-const-time.c b/mpi/mpih-const-time.c > index b527ad79..9d74d190 100644 > --- a/mpi/mpih-const-time.c > +++ b/mpi/mpih-const-time.c > @@ -204,6 +204,13 @@ _gcry_mpih_cmp_ui (mpi_ptr_t up, mpi_size_t usize, > unsigned long v) > is_all_zero &= (up[i] == 0); > > if (is_all_zero) > - return up[0] - v; > + { > + if (up[0] < v) > + return -1; > + else if (up[0] > v) > + return 1; > + else > + return 0; > + } > return 1; > } > > > > The expression of up[0] - v is only correct on 32-bit architecture. > It may return wrong result on 64-bit architecture. > -- > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tianjia.zhang at linux.alibaba.com Tue May 10 14:49:47 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Tue, 10 May 2022 20:49:47 +0800 Subject: [PATCH] Fix the error of GET_DATA_POINTER for aarch64 on linux Message-ID: <20220510124947.96618-1-tianjia.zhang@linux.alibaba.com> * cipher/asm-common-aarch64.h: Use same macro GET_DATA_POINTER() for linux and windows. -- For multiple labels defined in the same consts object on Linux, except the first label, an error result will be obtained when taking the addresss of other labels through GET_DATA_POINTER(). The error address is the same as the first label. An error fragment code after compilation is as follows: 0x0000fffff7f18be4 <+12>: adrp x6, 0xfffff7fb8000 0x0000fffff7f18be8 <+16>: ldr x6, [x6, #3216] 0x0000fffff7f18bec <+20>: ld1b {z24.b}, p0/z, [x6] 0x0000fffff7f18bf0 <+24>: adrp x6, 0xfffff7fb8000 0x0000fffff7f18bf4 <+28>: ldr x6, [x6, #3216] 0x0000fffff7f18bf8 <+32>: ldr p1, [x6] This patch fixes this problem by using the ADRP/ADD instruction. Signed-off-by: Tianjia Zhang --- cipher/asm-common-aarch64.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cipher/asm-common-aarch64.h b/cipher/asm-common-aarch64.h index d3f7801c7661..b29479a32157 100644 --- a/cipher/asm-common-aarch64.h +++ b/cipher/asm-common-aarch64.h @@ -33,7 +33,7 @@ #define GET_DATA_POINTER(reg, name) \ adrp reg, name at GOTPAGE ; \ add reg, reg, name at GOTPAGEOFF ; -#elif defined(_WIN32) +#elif defined(__linux__) || defined(_WIN32) #define GET_DATA_POINTER(reg, name) \ adrp reg, name ; \ add reg, reg, #:lo12:name ; -- 2.24.3 (Apple Git-128) From gniibe at fsij.org Tue May 10 16:24:02 2022 From: gniibe at fsij.org (NIIBE Yutaka) Date: Tue, 10 May 2022 23:24:02 +0900 Subject: Old bug in gcry_mpi_invm producing wrong result In-Reply-To: References: <87pmkl50lv.fsf@akagi.fsij.org> Message-ID: <875ymd79fh.fsf@jumper.gniibe.org> Guido Vranken wrote: > However I tried again with 1.8.0 and at that version, the reproducer prints > "Inverse exists". Ah, yes. You are right. I should have said specifically. It was libgcrypt 1.8.6, which fixed the old bug for the return value of gcry_mpi_invm. After that version, it works correctly (either 32-bit or 64-bit) in 1.8 series. But by the commit of 128045a12139fe2e4be877df59da10c7d4857d9a, which is included in libgcrypt 1.9.0 and later, it works incorrectly again (on 64-bit machine). 1.10.2 will include the fix. -- From jussi.kivilinna at iki.fi Wed May 11 19:14:04 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Wed, 11 May 2022 20:14:04 +0300 Subject: [PATCH] Fix the error of GET_DATA_POINTER for aarch64 on linux In-Reply-To: <20220510124947.96618-1-tianjia.zhang@linux.alibaba.com> References: <20220510124947.96618-1-tianjia.zhang@linux.alibaba.com> Message-ID: Hello, On 10.5.2022 15.49, Tianjia Zhang via Gcrypt-devel wrote: > * cipher/asm-common-aarch64.h: Use same macro GET_DATA_POINTER() for > linux and windows. > -- > > For multiple labels defined in the same consts object on Linux, > except the first label, an error result will be obtained when taking > the addresss of other labels through GET_DATA_POINTER(). The error > address is the same as the first label. An error fragment code after > compilation is as follows: > > 0x0000fffff7f18be4 <+12>: adrp x6, 0xfffff7fb8000 > 0x0000fffff7f18be8 <+16>: ldr x6, [x6, #3216] > 0x0000fffff7f18bec <+20>: ld1b {z24.b}, p0/z, [x6] > 0x0000fffff7f18bf0 <+24>: adrp x6, 0xfffff7fb8000 > 0x0000fffff7f18bf4 <+28>: ldr x6, [x6, #3216] > 0x0000fffff7f18bf8 <+32>: ldr p1, [x6] > > This patch fixes this problem by using the ADRP/ADD instruction. > > Signed-off-by: Tianjia Zhang > --- > cipher/asm-common-aarch64.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/cipher/asm-common-aarch64.h b/cipher/asm-common-aarch64.h > index d3f7801c7661..b29479a32157 100644 > --- a/cipher/asm-common-aarch64.h > +++ b/cipher/asm-common-aarch64.h > @@ -33,7 +33,7 @@ > #define GET_DATA_POINTER(reg, name) \ > adrp reg, name at GOTPAGE ; \ > add reg, reg, name at GOTPAGEOFF ; > -#elif defined(_WIN32) > +#elif defined(__linux__) || defined(_WIN32) > #define GET_DATA_POINTER(reg, name) \ > adrp reg, name ; \ > add reg, reg, #:lo12:name ; It might be better to rename GET_DATA_POINTER to GET_LOCAL_POINTER, remove the ifdefs and just use: #define GET_LOCAL_POINTER(reg, name) \ adrp reg, name ; \ add reg, reg, #:lo12:name ; The Apple variant is likely to be broken in the same way as the Linux variant. If we later need support for external objects, we can add GET_EXTERN_POINTER macro. -Jussi From jussi.kivilinna at iki.fi Thu May 12 08:11:24 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Thu, 12 May 2022 09:11:24 +0300 Subject: [PATCH] aarch64-asm: use ADR for getting pointers for local labels Message-ID: <20220512061124.757317-1-jussi.kivilinna@iki.fi> * cipher/asm-common-aarch64.h (GET_DATA_POINTER): Remove. (GET_LOCAL_POINTER): New. * cipher/camellia-aarch64.S: Use GET_LOCAL_POINTER instead of ADR instruction directly. * cipher/chacha20-aarch64.S: Use GET_LOCAL_POINTER instead of GET_DATA_POINTER. * cipher/cipher-gcm-armv8-aarch64-ce.S: Likewise. * cipher/crc-armv8-aarch64-ce.S: Likewise. * cipher/sha1-armv8-aarch64-ce.S: Likewise. * cipher/sha256-armv8-aarch64-ce.S: Likewise. * cipher/sm3-aarch64.S: Likewise. * cipher/sm3-armv8-aarch64-ce.S: Likewise. * cipher/sm4-aarch64.S: Likewise. --- Switch to use ADR instead of ADRP/LDR or ADRP/ADD for getting data pointers within assembly files. ADR is more portable across targets and does not require labels to be declared in GOT tables. Signed-off-by: Jussi Kivilinna --- cipher/asm-common-aarch64.h | 15 ++------------- cipher/camellia-aarch64.S | 4 ++-- cipher/chacha20-aarch64.S | 8 ++++---- cipher/cipher-gcm-armv8-aarch64-ce.S | 6 +++--- cipher/crc-armv8-aarch64-ce.S | 4 ++-- cipher/sha1-armv8-aarch64-ce.S | 2 +- cipher/sha256-armv8-aarch64-ce.S | 2 +- cipher/sm3-aarch64.S | 2 +- cipher/sm3-armv8-aarch64-ce.S | 2 +- cipher/sm4-aarch64.S | 2 +- 10 files changed, 18 insertions(+), 29 deletions(-) diff --git a/cipher/asm-common-aarch64.h b/cipher/asm-common-aarch64.h index d3f7801c..b38b17a6 100644 --- a/cipher/asm-common-aarch64.h +++ b/cipher/asm-common-aarch64.h @@ -29,19 +29,8 @@ # define ELF(...) /*_*/ #endif -#ifdef __APPLE__ -#define GET_DATA_POINTER(reg, name) \ - adrp reg, name at GOTPAGE ; \ - add reg, reg, name at GOTPAGEOFF ; -#elif defined(_WIN32) -#define GET_DATA_POINTER(reg, name) \ - adrp reg, name ; \ - add reg, reg, #:lo12:name ; -#else -#define GET_DATA_POINTER(reg, name) \ - adrp reg, :got:name ; \ - ldr reg, [reg, #:got_lo12:name] ; -#endif +#define GET_LOCAL_POINTER(reg, label) \ + adr reg, label; #ifdef HAVE_GCC_ASM_CFI_DIRECTIVES /* CFI directives to emit DWARF stack unwinding information. */ diff --git a/cipher/camellia-aarch64.S b/cipher/camellia-aarch64.S index 30b568d3..c019c168 100644 --- a/cipher/camellia-aarch64.S +++ b/cipher/camellia-aarch64.S @@ -214,7 +214,7 @@ _gcry_camellia_arm_encrypt_block: * w3: keybitlen */ - adr RTAB1, _gcry_camellia_arm_tables; + GET_LOCAL_POINTER(RTAB1, _gcry_camellia_arm_tables); mov RMASK, #(0xff<<4); /* byte mask */ add RTAB2, RTAB1, #(1 * 4); add RTAB3, RTAB1, #(2 * 4); @@ -274,7 +274,7 @@ _gcry_camellia_arm_decrypt_block: * w3: keybitlen */ - adr RTAB1, _gcry_camellia_arm_tables; + GET_LOCAL_POINTER(RTAB1, _gcry_camellia_arm_tables); mov RMASK, #(0xff<<4); /* byte mask */ add RTAB2, RTAB1, #(1 * 4); add RTAB3, RTAB1, #(2 * 4); diff --git a/cipher/chacha20-aarch64.S b/cipher/chacha20-aarch64.S index 2a980b95..540f892b 100644 --- a/cipher/chacha20-aarch64.S +++ b/cipher/chacha20-aarch64.S @@ -206,10 +206,10 @@ _gcry_chacha20_aarch64_blocks4: */ CFI_STARTPROC() - GET_DATA_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_rot8); + GET_LOCAL_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_rot8); add INPUT_CTR, INPUT, #(12*4); ld1 {ROT8.16b}, [CTR]; - GET_DATA_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_inc_counter); + GET_LOCAL_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_inc_counter); mov INPUT_POS, INPUT; ld1 {VCTR.16b}, [CTR]; @@ -383,10 +383,10 @@ _gcry_chacha20_poly1305_aarch64_blocks4: mov POLY_RSTATE, x4; mov POLY_RSRC, x5; - GET_DATA_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_rot8); + GET_LOCAL_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_rot8); add INPUT_CTR, INPUT, #(12*4); ld1 {ROT8.16b}, [CTR]; - GET_DATA_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_inc_counter); + GET_LOCAL_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_inc_counter); mov INPUT_POS, INPUT; ld1 {VCTR.16b}, [CTR]; diff --git a/cipher/cipher-gcm-armv8-aarch64-ce.S b/cipher/cipher-gcm-armv8-aarch64-ce.S index 687fabe3..78f3ad2d 100644 --- a/cipher/cipher-gcm-armv8-aarch64-ce.S +++ b/cipher/cipher-gcm-armv8-aarch64-ce.S @@ -169,7 +169,7 @@ _gcry_ghash_armv8_ce_pmull: cbz x3, .Ldo_nothing; - GET_DATA_POINTER(x5, .Lrconst) + GET_LOCAL_POINTER(x5, .Lrconst) eor vZZ.16b, vZZ.16b, vZZ.16b ld1 {rhash.16b}, [x1] @@ -368,7 +368,7 @@ _gcry_polyval_armv8_ce_pmull: cbz x3, .Lpolyval_do_nothing; - GET_DATA_POINTER(x5, .Lrconst) + GET_LOCAL_POINTER(x5, .Lrconst) eor vZZ.16b, vZZ.16b, vZZ.16b ld1 {rhash.16b}, [x1] @@ -589,7 +589,7 @@ _gcry_ghash_setup_armv8_ce_pmull: */ CFI_STARTPROC() - GET_DATA_POINTER(x2, .Lrconst) + GET_LOCAL_POINTER(x2, .Lrconst) eor vZZ.16b, vZZ.16b, vZZ.16b diff --git a/cipher/crc-armv8-aarch64-ce.S b/cipher/crc-armv8-aarch64-ce.S index 7ac884af..b6cdbb3d 100644 --- a/cipher/crc-armv8-aarch64-ce.S +++ b/cipher/crc-armv8-aarch64-ce.S @@ -71,7 +71,7 @@ _gcry_crc32r_armv8_ce_bulk: */ CFI_STARTPROC() - GET_DATA_POINTER(x7, .Lcrc32_constants) + GET_LOCAL_POINTER(x7, .Lcrc32_constants) add x9, x3, #consts_k(5 - 1) cmp x2, #128 @@ -280,7 +280,7 @@ _gcry_crc32_armv8_ce_bulk: */ CFI_STARTPROC() - GET_DATA_POINTER(x7, .Lcrc32_constants) + GET_LOCAL_POINTER(x7, .Lcrc32_constants) add x4, x7, #.Lcrc32_bswap_shuf - .Lcrc32_constants cmp x2, #128 ld1 {v7.16b}, [x4] diff --git a/cipher/sha1-armv8-aarch64-ce.S b/cipher/sha1-armv8-aarch64-ce.S index ea26564b..f95717ee 100644 --- a/cipher/sha1-armv8-aarch64-ce.S +++ b/cipher/sha1-armv8-aarch64-ce.S @@ -109,7 +109,7 @@ _gcry_sha1_transform_armv8_ce: cbz x2, .Ldo_nothing; - GET_DATA_POINTER(x4, .LK_VEC); + GET_LOCAL_POINTER(x4, .LK_VEC); ld1 {vH0123.4s}, [x0] /* load h0,h1,h2,h3 */ ld1 {vK1.4s-vK4.4s}, [x4] /* load K1,K2,K3,K4 */ diff --git a/cipher/sha256-armv8-aarch64-ce.S b/cipher/sha256-armv8-aarch64-ce.S index d0fa6285..5616eada 100644 --- a/cipher/sha256-armv8-aarch64-ce.S +++ b/cipher/sha256-armv8-aarch64-ce.S @@ -119,7 +119,7 @@ _gcry_sha256_transform_armv8_ce: cbz x2, .Ldo_nothing; - GET_DATA_POINTER(x3, .LK); + GET_LOCAL_POINTER(x3, .LK); mov x4, x3 ld1 {vH0123.4s-vH4567.4s}, [x0] /* load state */ diff --git a/cipher/sm3-aarch64.S b/cipher/sm3-aarch64.S index 3fb89006..0e58254b 100644 --- a/cipher/sm3-aarch64.S +++ b/cipher/sm3-aarch64.S @@ -425,7 +425,7 @@ _gcry_sm3_transform_aarch64: CFI_DEF_CFA_REGISTER(RFRAME); sub addr0, sp, #STACK_SIZE; - GET_DATA_POINTER(RKPTR, .LKtable); + GET_LOCAL_POINTER(RKPTR, .LKtable); and sp, addr0, #(~63); /* Preload first block. */ diff --git a/cipher/sm3-armv8-aarch64-ce.S b/cipher/sm3-armv8-aarch64-ce.S index 0900b84f..d592d08a 100644 --- a/cipher/sm3-armv8-aarch64-ce.S +++ b/cipher/sm3-armv8-aarch64-ce.S @@ -170,7 +170,7 @@ _gcry_sm3_transform_armv8_ce: ext CTX2.16b, CTX2.16b, CTX2.16b, #8; .Lloop: - GET_DATA_POINTER(x3, .Lsm3_Ktable); + GET_LOCAL_POINTER(x3, .Lsm3_Ktable); ld1 {v0.16b-v3.16b}, [x1], #64; sub x2, x2, #1; diff --git a/cipher/sm4-aarch64.S b/cipher/sm4-aarch64.S index 306b425e..8d06991b 100644 --- a/cipher/sm4-aarch64.S +++ b/cipher/sm4-aarch64.S @@ -84,7 +84,7 @@ ELF(.size _gcry_sm4_aarch64_consts,.-_gcry_sm4_aarch64_consts) /* Helper macros. */ #define preload_sbox(ptr) \ - GET_DATA_POINTER(ptr, .Lsm4_sbox); \ + GET_LOCAL_POINTER(ptr, .Lsm4_sbox); \ ld1 {v16.16b-v19.16b}, [ptr], #64; \ ld1 {v20.16b-v23.16b}, [ptr], #64; \ ld1 {v24.16b-v27.16b}, [ptr], #64; \ -- 2.34.1 From jussi.kivilinna at iki.fi Thu May 12 08:08:09 2022 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Thu, 12 May 2022 09:08:09 +0300 Subject: [PATCH] aarch64-asm: use ADR for getting pointers for local labels Message-ID: <20220512060809.756352-1-jussi.kivilinna@iki.fi> * cipher/asm-common-aarch64.h (GET_DATA_POINTER): Remove. (GET_LOCAL_POINTER): New. * cipher/camellia-aarch64.S: Use GET_LOCAL_POINTER instead of ADR instruction directly. * cipher/chacha20-aarch64.S: Use GET_LOCAL_POINTER instead of GET_DATA_POINTER. * cipher/cipher-gcm-armv8-aarch64-ce.S: Likewise. * cipher/crc-armv8-aarch64-ce.S: Likewise. * cipher/sha1-armv8-aarch64-ce.S: Likewise. * cipher/sha256-armv8-aarch64-ce.S: Likewise. * cipher/sm3-aarch64.S: Likewise. * cipher/sm3-armv8-aarch64-ce.S: Likewise. * cipher/sm4-aarch64.S: Likewise. --- Switch to use ADR instead of ADRP/LDR or ADRP/ADD for getting data pointers within assembly files. ADR is more portable across targets and does not require labels to be declared in GOT tables. Signed-off-by: Jussi Kivilinna --- cipher/asm-common-aarch64.h | 15 ++------------- cipher/camellia-aarch64.S | 4 ++-- cipher/chacha20-aarch64.S | 8 ++++---- cipher/cipher-gcm-armv8-aarch64-ce.S | 6 +++--- cipher/crc-armv8-aarch64-ce.S | 4 ++-- cipher/sha1-armv8-aarch64-ce.S | 2 +- cipher/sha256-armv8-aarch64-ce.S | 2 +- cipher/sm3-aarch64.S | 2 +- cipher/sm3-armv8-aarch64-ce.S | 2 +- cipher/sm4-aarch64.S | 2 +- 10 files changed, 18 insertions(+), 29 deletions(-) diff --git a/cipher/asm-common-aarch64.h b/cipher/asm-common-aarch64.h index d3f7801c..b38b17a6 100644 --- a/cipher/asm-common-aarch64.h +++ b/cipher/asm-common-aarch64.h @@ -29,19 +29,8 @@ # define ELF(...) /*_*/ #endif -#ifdef __APPLE__ -#define GET_DATA_POINTER(reg, name) \ - adrp reg, name at GOTPAGE ; \ - add reg, reg, name at GOTPAGEOFF ; -#elif defined(_WIN32) -#define GET_DATA_POINTER(reg, name) \ - adrp reg, name ; \ - add reg, reg, #:lo12:name ; -#else -#define GET_DATA_POINTER(reg, name) \ - adrp reg, :got:name ; \ - ldr reg, [reg, #:got_lo12:name] ; -#endif +#define GET_LOCAL_POINTER(reg, label) \ + adr reg, label; #ifdef HAVE_GCC_ASM_CFI_DIRECTIVES /* CFI directives to emit DWARF stack unwinding information. */ diff --git a/cipher/camellia-aarch64.S b/cipher/camellia-aarch64.S index 30b568d3..c019c168 100644 --- a/cipher/camellia-aarch64.S +++ b/cipher/camellia-aarch64.S @@ -214,7 +214,7 @@ _gcry_camellia_arm_encrypt_block: * w3: keybitlen */ - adr RTAB1, _gcry_camellia_arm_tables; + GET_LOCAL_POINTER(RTAB1, _gcry_camellia_arm_tables); mov RMASK, #(0xff<<4); /* byte mask */ add RTAB2, RTAB1, #(1 * 4); add RTAB3, RTAB1, #(2 * 4); @@ -274,7 +274,7 @@ _gcry_camellia_arm_decrypt_block: * w3: keybitlen */ - adr RTAB1, _gcry_camellia_arm_tables; + GET_LOCAL_POINTER(RTAB1, _gcry_camellia_arm_tables); mov RMASK, #(0xff<<4); /* byte mask */ add RTAB2, RTAB1, #(1 * 4); add RTAB3, RTAB1, #(2 * 4); diff --git a/cipher/chacha20-aarch64.S b/cipher/chacha20-aarch64.S index 2a980b95..540f892b 100644 --- a/cipher/chacha20-aarch64.S +++ b/cipher/chacha20-aarch64.S @@ -206,10 +206,10 @@ _gcry_chacha20_aarch64_blocks4: */ CFI_STARTPROC() - GET_DATA_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_rot8); + GET_LOCAL_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_rot8); add INPUT_CTR, INPUT, #(12*4); ld1 {ROT8.16b}, [CTR]; - GET_DATA_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_inc_counter); + GET_LOCAL_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_inc_counter); mov INPUT_POS, INPUT; ld1 {VCTR.16b}, [CTR]; @@ -383,10 +383,10 @@ _gcry_chacha20_poly1305_aarch64_blocks4: mov POLY_RSTATE, x4; mov POLY_RSRC, x5; - GET_DATA_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_rot8); + GET_LOCAL_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_rot8); add INPUT_CTR, INPUT, #(12*4); ld1 {ROT8.16b}, [CTR]; - GET_DATA_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_inc_counter); + GET_LOCAL_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_inc_counter); mov INPUT_POS, INPUT; ld1 {VCTR.16b}, [CTR]; diff --git a/cipher/cipher-gcm-armv8-aarch64-ce.S b/cipher/cipher-gcm-armv8-aarch64-ce.S index 687fabe3..78f3ad2d 100644 --- a/cipher/cipher-gcm-armv8-aarch64-ce.S +++ b/cipher/cipher-gcm-armv8-aarch64-ce.S @@ -169,7 +169,7 @@ _gcry_ghash_armv8_ce_pmull: cbz x3, .Ldo_nothing; - GET_DATA_POINTER(x5, .Lrconst) + GET_LOCAL_POINTER(x5, .Lrconst) eor vZZ.16b, vZZ.16b, vZZ.16b ld1 {rhash.16b}, [x1] @@ -368,7 +368,7 @@ _gcry_polyval_armv8_ce_pmull: cbz x3, .Lpolyval_do_nothing; - GET_DATA_POINTER(x5, .Lrconst) + GET_LOCAL_POINTER(x5, .Lrconst) eor vZZ.16b, vZZ.16b, vZZ.16b ld1 {rhash.16b}, [x1] @@ -589,7 +589,7 @@ _gcry_ghash_setup_armv8_ce_pmull: */ CFI_STARTPROC() - GET_DATA_POINTER(x2, .Lrconst) + GET_LOCAL_POINTER(x2, .Lrconst) eor vZZ.16b, vZZ.16b, vZZ.16b diff --git a/cipher/crc-armv8-aarch64-ce.S b/cipher/crc-armv8-aarch64-ce.S index 7ac884af..b6cdbb3d 100644 --- a/cipher/crc-armv8-aarch64-ce.S +++ b/cipher/crc-armv8-aarch64-ce.S @@ -71,7 +71,7 @@ _gcry_crc32r_armv8_ce_bulk: */ CFI_STARTPROC() - GET_DATA_POINTER(x7, .Lcrc32_constants) + GET_LOCAL_POINTER(x7, .Lcrc32_constants) add x9, x3, #consts_k(5 - 1) cmp x2, #128 @@ -280,7 +280,7 @@ _gcry_crc32_armv8_ce_bulk: */ CFI_STARTPROC() - GET_DATA_POINTER(x7, .Lcrc32_constants) + GET_LOCAL_POINTER(x7, .Lcrc32_constants) add x4, x7, #.Lcrc32_bswap_shuf - .Lcrc32_constants cmp x2, #128 ld1 {v7.16b}, [x4] diff --git a/cipher/sha1-armv8-aarch64-ce.S b/cipher/sha1-armv8-aarch64-ce.S index ea26564b..f95717ee 100644 --- a/cipher/sha1-armv8-aarch64-ce.S +++ b/cipher/sha1-armv8-aarch64-ce.S @@ -109,7 +109,7 @@ _gcry_sha1_transform_armv8_ce: cbz x2, .Ldo_nothing; - GET_DATA_POINTER(x4, .LK_VEC); + GET_LOCAL_POINTER(x4, .LK_VEC); ld1 {vH0123.4s}, [x0] /* load h0,h1,h2,h3 */ ld1 {vK1.4s-vK4.4s}, [x4] /* load K1,K2,K3,K4 */ diff --git a/cipher/sha256-armv8-aarch64-ce.S b/cipher/sha256-armv8-aarch64-ce.S index d0fa6285..5616eada 100644 --- a/cipher/sha256-armv8-aarch64-ce.S +++ b/cipher/sha256-armv8-aarch64-ce.S @@ -119,7 +119,7 @@ _gcry_sha256_transform_armv8_ce: cbz x2, .Ldo_nothing; - GET_DATA_POINTER(x3, .LK); + GET_LOCAL_POINTER(x3, .LK); mov x4, x3 ld1 {vH0123.4s-vH4567.4s}, [x0] /* load state */ diff --git a/cipher/sm3-aarch64.S b/cipher/sm3-aarch64.S index 3fb89006..0e58254b 100644 --- a/cipher/sm3-aarch64.S +++ b/cipher/sm3-aarch64.S @@ -425,7 +425,7 @@ _gcry_sm3_transform_aarch64: CFI_DEF_CFA_REGISTER(RFRAME); sub addr0, sp, #STACK_SIZE; - GET_DATA_POINTER(RKPTR, .LKtable); + GET_LOCAL_POINTER(RKPTR, .LKtable); and sp, addr0, #(~63); /* Preload first block. */ diff --git a/cipher/sm3-armv8-aarch64-ce.S b/cipher/sm3-armv8-aarch64-ce.S index 0900b84f..d592d08a 100644 --- a/cipher/sm3-armv8-aarch64-ce.S +++ b/cipher/sm3-armv8-aarch64-ce.S @@ -170,7 +170,7 @@ _gcry_sm3_transform_armv8_ce: ext CTX2.16b, CTX2.16b, CTX2.16b, #8; .Lloop: - GET_DATA_POINTER(x3, .Lsm3_Ktable); + GET_LOCAL_POINTER(x3, .Lsm3_Ktable); ld1 {v0.16b-v3.16b}, [x1], #64; sub x2, x2, #1; diff --git a/cipher/sm4-aarch64.S b/cipher/sm4-aarch64.S index 306b425e..8d06991b 100644 --- a/cipher/sm4-aarch64.S +++ b/cipher/sm4-aarch64.S @@ -84,7 +84,7 @@ ELF(.size _gcry_sm4_aarch64_consts,.-_gcry_sm4_aarch64_consts) /* Helper macros. */ #define preload_sbox(ptr) \ - GET_DATA_POINTER(ptr, .Lsm4_sbox); \ + GET_LOCAL_POINTER(ptr, .Lsm4_sbox); \ ld1 {v16.16b-v19.16b}, [ptr], #64; \ ld1 {v20.16b-v23.16b}, [ptr], #64; \ ld1 {v24.16b-v27.16b}, [ptr], #64; \ -- 2.34.1 From tianjia.zhang at linux.alibaba.com Thu May 12 16:00:18 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Thu, 12 May 2022 22:00:18 +0800 Subject: [PATCH] aarch64-asm: use ADR for getting pointers for local labels In-Reply-To: <20220512061124.757317-1-jussi.kivilinna@iki.fi> References: <20220512061124.757317-1-jussi.kivilinna@iki.fi> Message-ID: <7602479f-7765-0614-32d2-1bd2690f45af@linux.alibaba.com> Hi Jussi, On 5/12/22 2:11 PM, Jussi Kivilinna wrote: > * cipher/asm-common-aarch64.h (GET_DATA_POINTER): Remove. > (GET_LOCAL_POINTER): New. > * cipher/camellia-aarch64.S: Use GET_LOCAL_POINTER instead of ADR > instruction directly. > * cipher/chacha20-aarch64.S: Use GET_LOCAL_POINTER instead of > GET_DATA_POINTER. > * cipher/cipher-gcm-armv8-aarch64-ce.S: Likewise. > * cipher/crc-armv8-aarch64-ce.S: Likewise. > * cipher/sha1-armv8-aarch64-ce.S: Likewise. > * cipher/sha256-armv8-aarch64-ce.S: Likewise. > * cipher/sm3-aarch64.S: Likewise. > * cipher/sm3-armv8-aarch64-ce.S: Likewise. > * cipher/sm4-aarch64.S: Likewise. > --- > > Switch to use ADR instead of ADRP/LDR or ADRP/ADD for getting > data pointers within assembly files. ADR is more portable across > targets and does not require labels to be declared in GOT tables. > > Signed-off-by: Jussi Kivilinna > --- Looks good to me. I don't have an apple M1 machine, only tested on arm64 Linux. Reviewed-and-tested-by: Tianjia Zhang Best regards, Tianjia > cipher/asm-common-aarch64.h | 15 ++------------- > cipher/camellia-aarch64.S | 4 ++-- > cipher/chacha20-aarch64.S | 8 ++++---- > cipher/cipher-gcm-armv8-aarch64-ce.S | 6 +++--- > cipher/crc-armv8-aarch64-ce.S | 4 ++-- > cipher/sha1-armv8-aarch64-ce.S | 2 +- > cipher/sha256-armv8-aarch64-ce.S | 2 +- > cipher/sm3-aarch64.S | 2 +- > cipher/sm3-armv8-aarch64-ce.S | 2 +- > cipher/sm4-aarch64.S | 2 +- > 10 files changed, 18 insertions(+), 29 deletions(-) > > diff --git a/cipher/asm-common-aarch64.h b/cipher/asm-common-aarch64.h > index d3f7801c..b38b17a6 100644 > --- a/cipher/asm-common-aarch64.h > +++ b/cipher/asm-common-aarch64.h > @@ -29,19 +29,8 @@ > # define ELF(...) /*_*/ > #endif > > -#ifdef __APPLE__ > -#define GET_DATA_POINTER(reg, name) \ > - adrp reg, name at GOTPAGE ; \ > - add reg, reg, name at GOTPAGEOFF ; > -#elif defined(_WIN32) > -#define GET_DATA_POINTER(reg, name) \ > - adrp reg, name ; \ > - add reg, reg, #:lo12:name ; > -#else > -#define GET_DATA_POINTER(reg, name) \ > - adrp reg, :got:name ; \ > - ldr reg, [reg, #:got_lo12:name] ; > -#endif > +#define GET_LOCAL_POINTER(reg, label) \ > + adr reg, label; > > #ifdef HAVE_GCC_ASM_CFI_DIRECTIVES > /* CFI directives to emit DWARF stack unwinding information. */ > diff --git a/cipher/camellia-aarch64.S b/cipher/camellia-aarch64.S > index 30b568d3..c019c168 100644 > --- a/cipher/camellia-aarch64.S > +++ b/cipher/camellia-aarch64.S > @@ -214,7 +214,7 @@ _gcry_camellia_arm_encrypt_block: > * w3: keybitlen > */ > > - adr RTAB1, _gcry_camellia_arm_tables; > + GET_LOCAL_POINTER(RTAB1, _gcry_camellia_arm_tables); > mov RMASK, #(0xff<<4); /* byte mask */ > add RTAB2, RTAB1, #(1 * 4); > add RTAB3, RTAB1, #(2 * 4); > @@ -274,7 +274,7 @@ _gcry_camellia_arm_decrypt_block: > * w3: keybitlen > */ > > - adr RTAB1, _gcry_camellia_arm_tables; > + GET_LOCAL_POINTER(RTAB1, _gcry_camellia_arm_tables); > mov RMASK, #(0xff<<4); /* byte mask */ > add RTAB2, RTAB1, #(1 * 4); > add RTAB3, RTAB1, #(2 * 4); > diff --git a/cipher/chacha20-aarch64.S b/cipher/chacha20-aarch64.S > index 2a980b95..540f892b 100644 > --- a/cipher/chacha20-aarch64.S > +++ b/cipher/chacha20-aarch64.S > @@ -206,10 +206,10 @@ _gcry_chacha20_aarch64_blocks4: > */ > CFI_STARTPROC() > > - GET_DATA_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_rot8); > + GET_LOCAL_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_rot8); > add INPUT_CTR, INPUT, #(12*4); > ld1 {ROT8.16b}, [CTR]; > - GET_DATA_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_inc_counter); > + GET_LOCAL_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_inc_counter); > mov INPUT_POS, INPUT; > ld1 {VCTR.16b}, [CTR]; > > @@ -383,10 +383,10 @@ _gcry_chacha20_poly1305_aarch64_blocks4: > mov POLY_RSTATE, x4; > mov POLY_RSRC, x5; > > - GET_DATA_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_rot8); > + GET_LOCAL_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_rot8); > add INPUT_CTR, INPUT, #(12*4); > ld1 {ROT8.16b}, [CTR]; > - GET_DATA_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_inc_counter); > + GET_LOCAL_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_inc_counter); > mov INPUT_POS, INPUT; > ld1 {VCTR.16b}, [CTR]; > > diff --git a/cipher/cipher-gcm-armv8-aarch64-ce.S b/cipher/cipher-gcm-armv8-aarch64-ce.S > index 687fabe3..78f3ad2d 100644 > --- a/cipher/cipher-gcm-armv8-aarch64-ce.S > +++ b/cipher/cipher-gcm-armv8-aarch64-ce.S > @@ -169,7 +169,7 @@ _gcry_ghash_armv8_ce_pmull: > > cbz x3, .Ldo_nothing; > > - GET_DATA_POINTER(x5, .Lrconst) > + GET_LOCAL_POINTER(x5, .Lrconst) > > eor vZZ.16b, vZZ.16b, vZZ.16b > ld1 {rhash.16b}, [x1] > @@ -368,7 +368,7 @@ _gcry_polyval_armv8_ce_pmull: > > cbz x3, .Lpolyval_do_nothing; > > - GET_DATA_POINTER(x5, .Lrconst) > + GET_LOCAL_POINTER(x5, .Lrconst) > > eor vZZ.16b, vZZ.16b, vZZ.16b > ld1 {rhash.16b}, [x1] > @@ -589,7 +589,7 @@ _gcry_ghash_setup_armv8_ce_pmull: > */ > CFI_STARTPROC() > > - GET_DATA_POINTER(x2, .Lrconst) > + GET_LOCAL_POINTER(x2, .Lrconst) > > eor vZZ.16b, vZZ.16b, vZZ.16b > > diff --git a/cipher/crc-armv8-aarch64-ce.S b/cipher/crc-armv8-aarch64-ce.S > index 7ac884af..b6cdbb3d 100644 > --- a/cipher/crc-armv8-aarch64-ce.S > +++ b/cipher/crc-armv8-aarch64-ce.S > @@ -71,7 +71,7 @@ _gcry_crc32r_armv8_ce_bulk: > */ > CFI_STARTPROC() > > - GET_DATA_POINTER(x7, .Lcrc32_constants) > + GET_LOCAL_POINTER(x7, .Lcrc32_constants) > add x9, x3, #consts_k(5 - 1) > cmp x2, #128 > > @@ -280,7 +280,7 @@ _gcry_crc32_armv8_ce_bulk: > */ > CFI_STARTPROC() > > - GET_DATA_POINTER(x7, .Lcrc32_constants) > + GET_LOCAL_POINTER(x7, .Lcrc32_constants) > add x4, x7, #.Lcrc32_bswap_shuf - .Lcrc32_constants > cmp x2, #128 > ld1 {v7.16b}, [x4] > diff --git a/cipher/sha1-armv8-aarch64-ce.S b/cipher/sha1-armv8-aarch64-ce.S > index ea26564b..f95717ee 100644 > --- a/cipher/sha1-armv8-aarch64-ce.S > +++ b/cipher/sha1-armv8-aarch64-ce.S > @@ -109,7 +109,7 @@ _gcry_sha1_transform_armv8_ce: > > cbz x2, .Ldo_nothing; > > - GET_DATA_POINTER(x4, .LK_VEC); > + GET_LOCAL_POINTER(x4, .LK_VEC); > > ld1 {vH0123.4s}, [x0] /* load h0,h1,h2,h3 */ > ld1 {vK1.4s-vK4.4s}, [x4] /* load K1,K2,K3,K4 */ > diff --git a/cipher/sha256-armv8-aarch64-ce.S b/cipher/sha256-armv8-aarch64-ce.S > index d0fa6285..5616eada 100644 > --- a/cipher/sha256-armv8-aarch64-ce.S > +++ b/cipher/sha256-armv8-aarch64-ce.S > @@ -119,7 +119,7 @@ _gcry_sha256_transform_armv8_ce: > > cbz x2, .Ldo_nothing; > > - GET_DATA_POINTER(x3, .LK); > + GET_LOCAL_POINTER(x3, .LK); > mov x4, x3 > > ld1 {vH0123.4s-vH4567.4s}, [x0] /* load state */ > diff --git a/cipher/sm3-aarch64.S b/cipher/sm3-aarch64.S > index 3fb89006..0e58254b 100644 > --- a/cipher/sm3-aarch64.S > +++ b/cipher/sm3-aarch64.S > @@ -425,7 +425,7 @@ _gcry_sm3_transform_aarch64: > CFI_DEF_CFA_REGISTER(RFRAME); > > sub addr0, sp, #STACK_SIZE; > - GET_DATA_POINTER(RKPTR, .LKtable); > + GET_LOCAL_POINTER(RKPTR, .LKtable); > and sp, addr0, #(~63); > > /* Preload first block. */ > diff --git a/cipher/sm3-armv8-aarch64-ce.S b/cipher/sm3-armv8-aarch64-ce.S > index 0900b84f..d592d08a 100644 > --- a/cipher/sm3-armv8-aarch64-ce.S > +++ b/cipher/sm3-armv8-aarch64-ce.S > @@ -170,7 +170,7 @@ _gcry_sm3_transform_armv8_ce: > ext CTX2.16b, CTX2.16b, CTX2.16b, #8; > > .Lloop: > - GET_DATA_POINTER(x3, .Lsm3_Ktable); > + GET_LOCAL_POINTER(x3, .Lsm3_Ktable); > ld1 {v0.16b-v3.16b}, [x1], #64; > sub x2, x2, #1; > > diff --git a/cipher/sm4-aarch64.S b/cipher/sm4-aarch64.S > index 306b425e..8d06991b 100644 > --- a/cipher/sm4-aarch64.S > +++ b/cipher/sm4-aarch64.S > @@ -84,7 +84,7 @@ ELF(.size _gcry_sm4_aarch64_consts,.-_gcry_sm4_aarch64_consts) > /* Helper macros. */ > > #define preload_sbox(ptr) \ > - GET_DATA_POINTER(ptr, .Lsm4_sbox); \ > + GET_LOCAL_POINTER(ptr, .Lsm4_sbox); \ > ld1 {v16.16b-v19.16b}, [ptr], #64; \ > ld1 {v20.16b-v23.16b}, [ptr], #64; \ > ld1 {v24.16b-v27.16b}, [ptr], #64; \ From tianjia.zhang at linux.alibaba.com Thu May 12 16:06:08 2022 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Thu, 12 May 2022 22:06:08 +0800 Subject: [PATCH] Fix the error of GET_DATA_POINTER for aarch64 on linux In-Reply-To: References: <20220510124947.96618-1-tianjia.zhang@linux.alibaba.com> Message-ID: <7549e014-80c2-a68d-b155-7cc43c34db40@linux.alibaba.com> Hi Jussi, On 5/12/22 1:14 AM, Jussi Kivilinna wrote: > Hello, > > On 10.5.2022 15.49, Tianjia Zhang via Gcrypt-devel wrote: >> * cipher/asm-common-aarch64.h: Use same macro GET_DATA_POINTER() for >> ?? linux and windows. >> -- >> >> For multiple labels defined in the same consts object on Linux, >> except the first label, an error result will be obtained when taking >> the addresss of other labels through GET_DATA_POINTER(). The error >> address is the same as the first label. An error fragment code after >> compilation is as follows: >> >> ?? 0x0000fffff7f18be4 <+12>:??? adrp??? x6, 0xfffff7fb8000 >> ?? 0x0000fffff7f18be8 <+16>:??? ldr??? x6, [x6, #3216] >> ?? 0x0000fffff7f18bec <+20>:??? ld1b??? {z24.b}, p0/z, [x6] >> ?? 0x0000fffff7f18bf0 <+24>:??? adrp??? x6, 0xfffff7fb8000 >> ?? 0x0000fffff7f18bf4 <+28>:??? ldr??? x6, [x6, #3216] >> ?? 0x0000fffff7f18bf8 <+32>:??? ldr??? p1, [x6] >> >> This patch fixes this problem by using the ADRP/ADD instruction. >> >> Signed-off-by: Tianjia Zhang >> --- >> ? cipher/asm-common-aarch64.h | 2 +- >> ? 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/cipher/asm-common-aarch64.h b/cipher/asm-common-aarch64.h >> index d3f7801c7661..b29479a32157 100644 >> --- a/cipher/asm-common-aarch64.h >> +++ b/cipher/asm-common-aarch64.h >> @@ -33,7 +33,7 @@ >> ? #define GET_DATA_POINTER(reg, name) \ >> ????? adrp??? reg, name at GOTPAGE ; \ >> ????? add???? reg, reg, name at GOTPAGEOFF ; >> -#elif defined(_WIN32) >> +#elif defined(__linux__) || defined(_WIN32) >> ? #define GET_DATA_POINTER(reg, name) \ >> ????? adrp??? reg, name ; \ >> ????? add???? reg, reg, #:lo12:name ; > > It might be better to rename GET_DATA_POINTER to GET_LOCAL_POINTER, > remove the ifdefs and just use: > ? #define GET_LOCAL_POINTER(reg, name) \ > ??? adrp??? reg, name ; \ > ??? add???? reg, reg, #:lo12:name ; > > The Apple variant is likely to be broken in the same way as the > Linux variant. If we later need support for external objects, > we can add GET_EXTERN_POINTER macro. > > -Jussi I agree. It's a good idea. I've tested the latest patch. Best regards, Tianjia