Proofreadable base64 (was Re: Printing Keys and using OCR.)

Mon May 28 18:35:55 CEST 2007

Not meaning to kick a dead thread, but this whole conversation has
gotten me thinking about how to produce an effective variant of base64
for paper storage.  Base64 is an interesting solution because it fully
encodes raw data into what is effectively printable characters.  It was
yet obviously not designed for data on paper, at least initially,
because of possible ambiguities in the glyphs it does use.

To correct this wouldn't be the first time this sort of thing were done.
 For some reason the first example that jumps to mind is 8-to-10 coding
as used in Serial ATA.  I'm no electrical engineer, but by some
intuition the encoding of an 8-bit word into an exactly equivalent
10-bit word with superior signal characteristics makes sense to me.

That said, the recipe for base64 is already well-known--each character
represents its 6-bit index in the string "A-Za-z0-9+/".  I really don't
think anyone wants to do too much messing with this elegant formula.

I've come up with something which I haven't yet tried to implement but
which I think would be interesting to try.  Let's call it "proofreadable
base64".  It's not terribly efficient, but we're going for
recoverability more than efficiency.

It goes something like this:  We can assume that each line of our medium
is capable of relaying 76 relatively legible characters.  The first 32
are data in normal base64.  Then, there is a space and a CRC-24 as
specified in OpenPGP.  Then, there are two spaces.  After this, the
first part of the line is repeated, except it is as if it were filtered
through the command:

tr 'A-Za-z0-9+/=' '0-9A-Z+/=a-z'

That is, for every "REGNADKCIN" that appears on the left side, there is
a "H46D03A28D" on the right side.

The output should be printed using a legible, fixed-width font in order
to preserve column alignment.

For our 137.5% increase in size, we've gotten a great deal of
correctability.  Firstly, every base64 character has effectively become
a less ambiguous digraph in this encoding.  It's probably easy for OCR
to confuse 0, O, o, and Q in base64, but the pairs 0/n, O/E, o/b, Q/G
are far less ambiguous.  Secondly, an equivalently disambiguated CRC-24
on each line can tell a program which lines need to be reexamined in the
first place.  Combined with the first property, this could go a long way
in helping the computer correct its own errors.  For example, if the CRC
came up incorrect, and an o/n pair appeared in the input, it would
definitely try converting the error to a 0/n pair.

Finally, in the event that this relatively simple checking mechanism is
forgotten, we can cover up the last three columns of the paper, scan it,
and try to read it in as plain base64.  (That said, we could really also
prepend the source of a checking program to the printed output. :-)

What does everyone think?

Thanks
PSM

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 252 bytes
Desc: OpenPGP digital signature
Url : /pipermail/attachments/20070528/dfd9fef0/attachment.pgp