[PATCH] MPI helper of table lookup, Least Leak Intended

Wed Feb 19 02:03:31 CET 2025

On 2/17/25 23:30, NIIBE Yutaka wrote:
> Hello,
>
> Thank you for your comments.

You are welcome.

> Jacob Bachmeyer<jcb62281 at gmail.com> wrote:
> [...]
>
>> There might also be architecture-specific instructions that can be used
>> to retrieve a table row without polluting the data cache; allowing
>> architecture-specific overrides here could make a very significant
>> performance difference, as the basic implementation could easily flush
>> the entire data cache if used on a large table.
>>
>> For the base case, reading the entire table is probably the best that
>> you can do, but if you have a "load without temporal locality"
>> instruction (I believe that there are such instructions in SSE, for
>> example), you can avoid the problem, while accessing only a single table
>> row.  (The memory bus is assumed to not be visible to an attacker.)
> Ah, I didn't consider that.
>
> IIUC, you mean something like _mm*_stream_load_si* functions in the
> Intel Intrinsics Guide (to access an entry in a table).
>
>      https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#cats=Load

Those, and analogous instructions on other architectures.

> That is interesting to try, and it could be effective when table is
> larger and read-only.  (But when table is larger than a page,
> it might be a target of TLB flush attack to determine which page.)
>
> Note that in this particular case of the modular exponentiation, the
> table size is typically 4 Ki-byte and the entry size is 256-byte.  The
> table is computed in _gcry_mpih_powm_lli before the loop which uses the
> table.

So, for the initial case, simply ensure that the table is page-aligned 
and TLB leaks are solved, since the table is exactly one page.

Exactly one page is also very unlikely to have significant impact on the 
effectiveness of the data cache, since every modern processor that I 
have seen has L1 caches much larger than 4KiB.

I believe that there are also prefetch instructions that would force TLB 
entries to be allocated, or "non-temporal" accesses might also have 
their own TLB.

> For now, let me apply and push _gcry_mpih_lookup_lli, and
> possible improvement will be done in future.

You should probably add comments noting this caveat that the LLI table 
select will read the entire table into the data cache, which is fine for 
very small tables but could be very slow if used with a larger table.

This (holding the entire table in the data cache) might actually be 
beneficial, since it avoids the possibility of cache misses introducing 
timing leaks.

Just be sure to document the caveat that the table size must be 
carefully limited.

-- Jacob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20250218/d39e1b42/attachment.html>