Discrepancies in extracted photo-id images from dumps

Daniel Kahn Gillmor dkg at fifthhorseman.net
Sat Jan 19 17:23:33 CET 2019


On Sat 2019-01-19 17:10:38 +0100, Stefan Claas wrote:
> Now i wonder why i have such high discrepancies in the numbers?

jpegextractor looks like it uses a simple heuristic to find jpegs.

in particular (quoting from
https://www.digiater.nl/openvms/decus/vmslt02a/net/jpeg-extractor.html):

     jpegextractor uses the fact that valid binary JPEG streams start
     with the byte sequence ff d8 ff and end with the byte sequence ff
     d9. It copies all of those streams to new files. As jpegextractor
     simply looks for the two sequences it does not have to know the
     format of the encapsulating file and thus works with all formats
     that embed JPEG streams.

consider that a lot of OpenPGP key material is high-entropy -- public
keys, cryptographic signatures, etc are all essentially random bytes.
hand-wavy approximations follow, i'd be happy if someone wants to make
them more rigorus.

If we look at triplets of three consecutive octets, each such sequence
should appear roughly once every 2^(8*3) == 16777216 triplets.  and
any specific pair of octets will appear roughly once every 2^(8*2) ==
65536 pairs.


So about every 16 million octets of high-entropy data, you'll find that
starting "ff d8 ff" triplet, and much more frequently you'll find the
ending "ff d9" pair.

So assuming that the bulk of a 1GiB dump is high-entropy data *with no
actual JPEGs in it at all*, you should expect to see jpegextractor have
at least 1G/16M == 64 false positive matches.

that doesn't quite add up to the number of extras that you're seeing
from jpegextractor, but it suggests that there will be a large number of
false positives by that mechanism at any rate.

have you tried looking at the jpegs that jpegextractor produces?

     --dkg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gnupg-users/attachments/20190119/cf5b48d8/attachment-0001.sig>


More information about the Gnupg-users mailing list