Proofreadable base64 (was Re: Printing Keys and using OCR.)

Casey Jones groups at caseyljones.net
Tue May 29 13:01:55 CEST 2007


Peter S. May wrote:
 > After this, the first part of the
 > line is repeated, except it is as if it were filtered
 > through the command:
 >
 > tr 'A-Za-z0-9+/=' '0-9A-Z+/=a-z'
 >
 > That is, for every "REGNADKCIN" that appears
 > on the left side, there is
 > a "H46D03A28D" on the right side.

That's a clever way of dramatically increasing the "uniqueness" of each 
character to reduce the ambiguity of the OCR. It would be useful for 
both error detection and error correction. If it could be integrated 
into the OCR engine itself, it would be even more effective. Although 
Gallager or Turbo Codes would give much better error correction for a 
given storage space, your method would be way easier to implement.

I'm leaning strongly against base64. There are just too many characters 
that can be easily confused. Base32 would be nearly as dense (5 bits 
instead of 6, per char) and would allow many tough characters to be left 
out. A simple conversion chart for base32 chars could take up just one 
line at the bottom of the page. The conversion to base32 and back would 
be very easy. Selecting the unambiguous 32 characters to use as the 
symbol set would require some care. Maybe some testing to find out which 
symbols the OCR programs get wrong most often.

> ...this wouldn't be the first time this sort of thing were done.
The only thing I've found similar is the Centinel Data Archiving Project.
http://www.cedarcreek.umn.edu/tools/t1003.html
The pdf file is a much clearer explanation than the other two.
Centinel seems to be just an error  detecting code at the beginning of 
each line. This might be good enough, but I'm starting to think that 
some error correction would be highly desirable. Even a little error 
correction could be a huge advantage over just error detection.

>  For some reason the first example that jumps to mind is 8-to-10 coding
> as used in Serial ATA.  I'm no electrical engineer, but by some
> intuition the encoding of an 8-bit word into an exactly equivalent
> 10-bit word with superior signal characteristics makes sense to me.

I think most error correction codes mix the code bits with the data 
bits. I'd like to keep the data in separate blocks to make it easy for 
humans to separate and decode it. Unfortunately separating the error 
correction bits probably makes the code less robust. If we want to 
intermix the error correction code maybe we could include a note at the 
bottom that says "the third,sixth,ninth,etc columns and rows are error 
correction data".  We also don't need the feature of hard drives and 
some signaling methods that make sure there are a good mixture of ones 
and zeros in order to keep the signal oscillating. We can have all zeros 
or all ones on paper if we want, with no signal detection problems.

I was thinking about just using a normal typewriter size font. But then 
I realized that if we use a font half the size, it would not only 
improve data density, but we could include extra error correction. A 
small font with more error correction would probably be more reliable 
than a large font with less error correction.



More information about the Gnupg-users mailing list