<!DOCTYPE html>

<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <div class="moz-cite-prefix">On 2/17/25 23:30, NIIBE Yutaka wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:87wmdnlsza.fsf@haruna.fsij.org">

      <pre wrap="" class="moz-quote-pre">Hello,

Thank you for your comments.</pre>

    </blockquote>

    <p>You are welcome.</p>

    <blockquote type="cite" cite="mid:87wmdnlsza.fsf@haruna.fsij.org">

      <pre wrap="" class="moz-quote-pre">Jacob Bachmeyer <a class="moz-txt-link-rfc2396E" href="mailto:jcb62281@gmail.com"><jcb62281@gmail.com></a> wrote:

</pre>

      <pre wrap="" class="moz-quote-pre">[...]

</pre>

      <blockquote type="cite">

        <pre wrap="" class="moz-quote-pre">There might also be architecture-specific instructions that can be used 

to retrieve a table row without polluting the data cache; allowing 

architecture-specific overrides here could make a very significant 

performance difference, as the basic implementation could easily flush 

the entire data cache if used on a large table.

For the base case, reading the entire table is probably the best that 

you can do, but if you have a "load without temporal locality" 

instruction (I believe that there are such instructions in SSE, for 

example), you can avoid the problem, while accessing only a single table 

row.  (The memory bus is assumed to not be visible to an attacker.)

</pre>

      </blockquote>

      <pre wrap="" class="moz-quote-pre">

Ah, I didn't consider that.

IIUC, you mean something like _mm*_stream_load_si* functions in the

Intel Intrinsics Guide (to access an entry in a table).

    <a class="moz-txt-link-freetext" href="https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#cats=Load">https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#cats=Load</a></pre>

    </blockquote>

    <p>Those, and analogous instructions on other architectures.</p>

    <blockquote type="cite" cite="mid:87wmdnlsza.fsf@haruna.fsij.org">

      <pre wrap="" class="moz-quote-pre">That is interesting to try, and it could be effective when table is

larger and read-only.  (But when table is larger than a page,

it might be a target of TLB flush attack to determine which page.)

Note that in this particular case of the modular exponentiation, the

table size is typically 4 Ki-byte and the entry size is 256-byte.  The

table is computed in _gcry_mpih_powm_lli before the loop which uses the

table.</pre>

    </blockquote>

    <p>So, for the initial case, simply ensure that the table is

      page-aligned and TLB leaks are solved, since the table is exactly

      one page.</p>

    <p>Exactly one page is also very unlikely to have significant impact

      on the effectiveness of the data cache, since every modern

      processor that I have seen has L1 caches much larger than 4KiB.<span

      style="white-space: pre-wrap">

</span></p>

    <p><span style="white-space: pre-wrap">I believe that there are also prefetch instructions that would force TLB entries to be allocated, or "non-temporal" accesses might also have their own TLB.

</span></p>

    <blockquote type="cite" cite="mid:87wmdnlsza.fsf@haruna.fsij.org">

      <pre wrap="" class="moz-quote-pre">For now, let me apply and push _gcry_mpih_lookup_lli, and

possible improvement will be done in future.

</pre>

    </blockquote>

    <p>You should probably add comments noting this caveat that the LLI

      table select will read the entire table into the data cache, which

      is fine for very small tables but could be very slow if used with

      a larger table.</p>

    <p>This (holding the entire table in the data cache) might actually

      be beneficial, since it avoids the possibility of cache misses

      introducing timing leaks.</p>

    <p>Just be sure to document the caveat that the table size must be

      carefully limited.<br>

    </p>

    <p><br>

    </p>

    <p>-- Jacob<br>

    </p>

  </body>

</html>