Validation of User ID with invalid (non UTF-8) encoding

Tue Apr 29 14:08:58 CEST 2014

On Tue, 29 Apr 2014 11:11, martijn.list at gmail.com said:

> Some keys stored on the public key servers have User IDs which seem to
> be encoded with a different encoding than UTF-8.

Right.  Old PGP versions didn't care about the requirement for utf-8 and
used whatever the terminal was configured to (i.e. Latin-1).  But that
should only be a display problem.  See below for the code GPA uses to
detect and fix the display problem.

> $ gpg --check-sigs 0xA8364AC589C44886
> pub   1024D/89C44886 1999-09-30
> uid                  Lasse M\xberkedahl Larsen <lml at gr3.dk>
> sig!         89C44886 1999-09-30  Lasse M\xberkedahl Larsen <lml at gr3.dk>
> sub   2048g/0CA36EF9 1999-09-30
> sig!         89C44886 1999-09-30  Lasse M\xberkedahl Larsen <lml at gr3.dk>
>
> My own Java based tool however fails to validate this User ID, i.e., the
> calculated hash always returns a different value. Also PGP desktop

Note that the above output is for humans and has been sanitized to
inhibit attacks using ANSI control sequences.  To check the signature
you need to use the bare OpenPGP packets and not some gpg output.

I am not aware of any PGP problems with user ids - the verification uses
the data verbatim and is transparent to the encoding.

Shalom-Salam,

   Werner

====
/* Return the user ID, making sure it is properly UTF-8 encoded.
   Allocates a new string, which must be freed with g_free ().  */
static gchar *
string_to_utf8 (const gchar *string)
{
  const char *s;

  if (!string)
    return NULL;

  /* Due to a bug in old and not so old PGP versions user IDs have
     been copied verbatim into the key.  Thus many users with Umlauts
     et al. in their name will see their names garbled.  Although this
     is not an issue for me (;-)), I have a couple of friends with
     Umlauts in their name, so let's try to make their life easier by
     detecting invalid encodings and convert that to Latin-1.  We use
     this even for X.509 because it may make things even better given
     all the invalid encodings often found in X.509 certificates.  */
  for (s = string; *s && !(*s & 0x80); s++)
    ;
  if (*s && ((s[1] & 0xc0) == 0x80) && ( ((*s & 0xe0) == 0xc0)
                                         || ((*s & 0xf0) == 0xe0)
                                         || ((*s & 0xf8) == 0xf0)
                                         || ((*s & 0xfc) == 0xf8)
                                         || ((*s & 0xfe) == 0xfc)) )
    {
      /* Possible utf-8 character followed by continuation byte.
         Although this might still be Latin-1 we better assume that it
         is valid utf-8. */
      return g_strdup (string);
     }
  else if (*s && !strchr (string, 0xc3))
    {
      /* No 0xC3 character in the string; assume that it is Latin-1.  */
      return g_convert (string, -1, "UTF-8", "ISO-8859-1", NULL, NULL, NULL);
    }
  else
    {
      /* Everything else is assumed to be UTF-8.  We do this even that
         we know the encoding is not valid.  However as we only test
         the first non-ascii character, valid encodings might
         follow.  */
      return g_strdup (string);
    }
}

-- 
Die Gedanken sind frei.  Ausnahmen regelt ein Bundesgesetz.