Differentiating GPG data from random data

Tue Nov 25 05:21:36 CET 2008

On Nov 24, 2008, at 5:19 PM, Ted wrote:

> Hi,
>
> Hope this is not off-topic here.
>
> I'm writing a program that searches for files that are made up of
> random data. GPG data (that is not ascii armored) is consistently
> identified by the program. That's expected as GPG data is very random.
> However, even though GPG data passes the random tests, I'm not
> interested in finding GPG encrypted files, so I thought I would write
> a routine to exclude these files based on the first few bytes of the
> file, but I'm not comfortable with doing that. It's not ideal, but
> seems to work OK. Basically I'm skipping random data files that have
> certain bytes in the beginning like so:
>
> Symmetric:
> Hex(8c 0d 04 03)
> Dec(140 13 4 3)
>
> Asymmetric:
> Hex(85 02 0e 03)
> Dec(133 2 14 3)
>
> This works well in informal testing on multiple systems running
> various versions of GPG, but I bet it will fail a lot in the real
> world after reading the RFC's. That's why I thought I might pose the
> question to this list. Is there a simple way to skip most GPG
> encrypted files without implementing 4880? It does not have to be
> perfect, but perhaps there is something better than what I have
> described above.

Those bytes will more-or-less work, but as you say won't catch  
everything.  In OpenPGP, the first few octets cover the length and  
type of the packet, so those bytes hardcode a particular length, which  
is probably not what you want.  For example, the "85 02 0e 03" from  
your example is an old-style encrypted session key that is 526 bytes  
long, which will only match a particular key size.

The problem is that OpenPGP has so many different ways to encode a  
particular packet, that writing a rule loose enough to match them all  
will inevitably have a huge number of false positives.  For example,  
hex 84, 85, 86, and C1 can all indicate an asymmetrically encrypted  
message.  85 is the most common (and 84 would be extremely uncommon),  
but they are all possible.  Some OpenPGP programs start with or A8,  
A9, AA, or CA (though it is virtually always A8).  GPG will read such  
a message, but doesn't generate it.

For your purpose, is it better to have false positives or false  
negatives?  That is, it is better to accidentally include some GPG  
files, or better to accidentally exclude some files?  That would help  
in figuring out how many bytes you want to match on.

David