Proofreadable base64 (was Re: Printing Keys and using OCR.)
Casey Jones
groups at caseyljones.net
Tue May 29 13:01:55 CEST 2007
Peter S. May wrote:
> After this, the first part of the
> line is repeated, except it is as if it were filtered
> through the command:
>
> tr 'A-Za-z0-9+/=' '0-9A-Z+/=a-z'
>
> That is, for every "REGNADKCIN" that appears
> on the left side, there is
> a "H46D03A28D" on the right side.
That's a clever way of dramatically increasing the "uniqueness" of each
character to reduce the ambiguity of the OCR. It would be useful for
both error detection and error correction. If it could be integrated
into the OCR engine itself, it would be even more effective. Although
Gallager or Turbo Codes would give much better error correction for a
given storage space, your method would be way easier to implement.
I'm leaning strongly against base64. There are just too many characters
that can be easily confused. Base32 would be nearly as dense (5 bits
instead of 6, per char) and would allow many tough characters to be left
out. A simple conversion chart for base32 chars could take up just one
line at the bottom of the page. The conversion to base32 and back would
be very easy. Selecting the unambiguous 32 characters to use as the
symbol set would require some care. Maybe some testing to find out which
symbols the OCR programs get wrong most often.
> ...this wouldn't be the first time this sort of thing were done.
The only thing I've found similar is the Centinel Data Archiving Project.
http://www.cedarcreek.umn.edu/tools/t1003.html
The pdf file is a much clearer explanation than the other two.
Centinel seems to be just an error detecting code at the beginning of
each line. This might be good enough, but I'm starting to think that
some error correction would be highly desirable. Even a little error
correction could be a huge advantage over just error detection.
> For some reason the first example that jumps to mind is 8-to-10 coding
> as used in Serial ATA. I'm no electrical engineer, but by some
> intuition the encoding of an 8-bit word into an exactly equivalent
> 10-bit word with superior signal characteristics makes sense to me.
I think most error correction codes mix the code bits with the data
bits. I'd like to keep the data in separate blocks to make it easy for
humans to separate and decode it. Unfortunately separating the error
correction bits probably makes the code less robust. If we want to
intermix the error correction code maybe we could include a note at the
bottom that says "the third,sixth,ninth,etc columns and rows are error
correction data". We also don't need the feature of hard drives and
some signaling methods that make sure there are a good mixture of ones
and zeros in order to keep the signal oscillating. We can have all zeros
or all ones on paper if we want, with no signal detection problems.
I was thinking about just using a normal typewriter size font. But then
I realized that if we use a font half the size, it would not only
improve data density, but we could include extra error correction. A
small font with more error correction would probably be more reliable
than a large font with less error correction.
More information about the Gnupg-users
mailing list