Printing Keys and using OCR (was: Proofreadable base64)

Peter Palfrader peter at palfrader.org
Fri Sep 21 00:59:00 CEST 2007


On Mon, 28 May 2007, Peter S. May wrote:

> Not meaning to kick a dead thread

This must be a zombie by now :)

> I've come up with something which I haven't yet tried to implement but
> which I think would be interesting to try.  Let's call it "proofreadable
> base64".  It's not terribly efficient, but we're going for
> recoverability more than efficiency.
> 
> It goes something like this:  We can assume that each line of our medium
> is capable of relaying 76 relatively legible characters.  The first 32
> are data in normal base64.  Then, there is a space and a CRC-24 as
> specified in OpenPGP.  Then, there are two spaces.  After this, the
> first part of the line is repeated, except it is as if it were filtered
> through the command:
> 
> tr 'A-Za-z0-9+/=' '0-9A-Z+/=a-z'

Nice idea.  When trying to find decent backup methods for my new Tor
identity key I cam accross this thread.

I played all day with ocr and friends.  In the course I wrote a small
script that does what you suggest.  I tried to keep it small enough to
print it along with whatever data you have - I clearly failed there.
But other than that it works nicely.

I used the OCR-A font available from a CTAN[0] mirror near you to print the
output of my script.

Then I used gocr[1][2] (0.41-1 as shipped in debian etch) to turn a scan back
into data.  That didn't work out so well at first - gocr had real
trouble distinguishing zeroes and the letter D like Delta.  Fortunately
gocr has an option to disable its internal recognition engine and
instead use a mode whereby it asks you about characters it doesn't
recognize - initially that's all of them - and writes that to a
database.  In the end it asked me for about 300 chars out of 8000 - most
of them at the beginning of the text - but produced the original text
with only a few mishaps, which were caught easily using the encoding
described above.
[maybe I should also try a more recent version of gocr]


If anybody wants to play with this, I uploaded my two scans to
http://asteria.noreply.org/~weasel/ocr/

To use gocr with the database learning and its internal recognition
engine turned off simply
  mkdir db; gocr -m 256 -m 130 -i 1.ppm -o 1.txt


I guess playing with encodings other than base64 might be the next step.
There was a strong point made for simply using base16, maybe with
different characters that play nicely with gocr using OCR-A.



Optar[2] is another nice tool which I tried today.  While it does not
provide the "fallback to typing it all in" option it shows promise.
Using the default values I still had several bitflips after scanning in
the printout tho.  Future tests will probably include changing optar's
paramters to larger dots (I don't need 200kb per page), and maybe
preprocessing the data with par2.


Cheers,
Peter

0. http://www.ctan.org/
   http://www.ctan.org/cgi-bin/search.py?metadataSearch=ocr-a&metadataSearchSubmit=Search
1. http://packages.debian.org/gocr
   http://packages.debian.org/etch/gocr
   http://jocr.sourceforge.net/
2. http://ronja.twibright.com/optar/
3. http://www.par2.net/
-- 
                           |  .''`.  ** Debian GNU/Linux **
      Peter Palfrader      | : :' :      The  universal
 http://www.palfrader.org/ | `. `'      Operating System
                           |   `-    http://www.debian.org/
-------------- next part --------------
#!/usr/bin/perl

use strict;
use warnings;
use Digest::SHA1 qw(sha1_hex);
use MIME::Base64;

if (@ARGV != 1 ||
  $ARGV[0] !~ /^-[de]$/) {
  die "Usage: $0 -d|-e\n";
};

if ($ARGV[0] eq '-e') {
  # encoding.  not needed for decoding
  undef $/;
  my ($bytes, $totallength, $totalhash, $line);
  $bytes = <STDIN>;
  $totallength = length($bytes);
  $totalhash = sha1_hex($bytes);
  $line = 1;
  printf("<line>  <data in base64>  <first 12 chars  <base64 with tr\n");
  printf("                           of sha1 in hex>  'A-Za-z0-9+/=' '0-9A-Z+/=a-z'>\n");
  printf("-A-B-C-\n");
  while (length($bytes) > 0) {
    my ($this, $encoded, $tred, $hash);
    $this = substr($bytes, 0, 18, '');
    $encoded = encode_base64($this, '');
    ($tred = $encoded) =~ tr#A-Za-z0-9+/=#0-9A-Z+/=a-z#;
    $hash = substr( sha1_hex($this), 0, 12);
    printf("%06d  %-24s  %s  %-24s\n", $line++, $encoded, $hash, $tred);
  };
  printf("-A-B-C-\n");
  print("XXXXXX total length: $totallength\n");
  print("XXXXXX SHA1: $totalhash\n");
} else {
  # decoding
  my (@bytes, $line, $found_marker, $exit);
  $exit = 0;
  $line = 0;
  $found_marker = 0;
  while (<STDIN>) {
    chomp;
    if ($_ eq '-A-B-C-') {
      $found_marker = 1;
      last;
    };
  };
  unless ($found_marker) {
    die ("Did not find start marker '-A-B-C-' in input\n");
  };
  $found_marker = 0;

  while (<STDIN>) {
    $line++;
    chomp;
    if ($_ eq '-A-B-C-') {
      $found_marker = 1;
      last;
    };
    my ($l, $d, $h, $t, $t2, $decoded_d, $decoded_t, $hashd, $hasht, $bytes) = split;
    $bytes = '';
    ($t2 = $t) =~ tr#0-9A-Z+/=a-z#A-Za-z0-9+/=#;
    $decoded_d = decode_base64($d);
    $decoded_t = decode_base64($t2);
    $hashd = substr( sha1_hex($decoded_d), 0, 12);
    $hasht = substr( sha1_hex($decoded_t), 0, 12);

    if ($l != $line) { warn ("Line $line: wrong index $l\n"); };
    if (length($t2) != length($d)) {
      warn("Line $line: data copies have different length.\n");
    } elsif ($t2 ne $d) {
      warn("Line $line: data copies do not match.\n");
      for (my $i=0; $i<length($d); $i++) {
        my ($a, $b, $b2);
        $a = substr($d,$i,1);
        $b = substr($t,$i,1);
        $b2 = substr($t2,$i,1);
        warn("Line $line: character ".($i+1)." mismatch.  Think it is $a and $b.\n") if ($a ne $b2);
      };
    };
    if ($h ne $hashd) {
      warn("Line $line: hash does not match data (on the left, plain copy)\n");
    } else {
      $bytes = $decoded_d;
    };
    if ($h ne $hasht) {
      warn("Line $line: hash does not match data copy (on the right, transformed copy)\n");
    } else {
      $bytes = $decoded_t;
    };
    if (length($bytes) == 0) {
      if ($t2 eq $d) {
        $bytes = $decoded_d;
      } else {
        warn("Line $line: no data, will use 18 zeroes instead\n");
        $bytes = "\0" x 18;
        $exit = 1;
      };
    };
    push @bytes, $bytes;
  };
  unless ($found_marker) {
    warn ("Did not find end marker '-A-B-C-' in input\n");
  };

  print STDERR "total length: ", length(join('', at bytes)), "\n";
  print STDERR "SHA1: ", sha1_hex(join('', at bytes)), "\n";
  print join('', at bytes);
  exit($exit);
};

# Copyright (c) 2007 Peter Palfrader <peter at palfrader.org>
# 
# Permission is hereby granted, free of charge, to any person obtaining
# a copy of this software and associated documentation files (the
# "Software"), to deal in the Software without restriction, including
# without limitation the rights to use, copy, modify, merge, publish,
# distribute, sublicense, and/or sell copies of the Software, and to
# permit persons to whom the Software is furnished to do so, subject to
# the following conditions:
# 
# The above copyright notice and this permission notice shall be
# included in all copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
# LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
# OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
# WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


More information about the Gnupg-users mailing list