TOFU performance / DB format

Wed Oct 28 11:12:32 CET 2015

Hi Andre,

Thanks for sharing your concerns.

At Tue, 27 Oct 2015 15:30:23 +0100,
Andre Heinecke wrote:
> On Friday 23 October 2015 19:09:24 Neal H. Walfield wrote:
> > Unfortunately, this change only helps with the flat database format.
> > For the split forat, we're still looking at over 90 seconds for the
> > initial listing and 3 seconds for subsequent listings.
> 
> I can confirm that thanks to your performance improvements initializing the flat 
> db is now really as fast as I would expect. << 1second. :-)
> 
> Great, Thanks!

Great.

> > I'd be interested to hear people's opinions about whether the split
> > format (with its ability to be more easily synced) is still
> > interesting despite the slow performance.
> 
> I'm not sure that keeping two database formats and making them configurable is 
> a good idea.
> 
> - It makes things harder to maintain. Obviously you have two codepaths with 
> different issues. e.g. The performance for the flat format is now fine while the 
> performance for the split format is still bad. You will receive bug reports 
> and you will always have to ask "are you using flat / split"
> 
> The majority will use whatever is default but you will still have to maintain 
> the code for the minority that uses another option.

These are valid concerns, but I don't think they are major concerns.

First, the code differences are rather minor.  The code for updating a
split format DB is nearly identical to that for updating the flat
format.  The primary difference is the routing, which is quite
straightforward.

Further, I've written some tests for the TOFU DB code and make check
runs it twice: once using the flat format and once using the split
format.

> - It makes things harder to document. "If you are using tofu-db-split format 
> you will have to sync that folder. If you are using tofo-db-flat format you 
> will have to sync that db." etc.
> 
> - I don't see how the split format solves two way sync:
> 
>   -- Two way sync is not really supported in GnuPG. We have the monolithic
>    keyring and the trust.db and It's just not supported afaik two merge
>    two gnupg home dirs.
>   -- How should such a two-way sync work? If I would only use the "last
>   updated" db from two systems how would I merge them in case of conflicts?
> 
> With regards to easier sync. It can be easier Just to copy bulk data if it is 
> small then lots and lots of different files. E.g with rsync to a system 
> connected over the internet i see the following values:
> 
> (Ok that is cheating a bit as the tofu.db does not really contain any data but 
> the initialization)
> 
> First run:
> rsync -rvzP tofu.d remote:/tmp/tofu.d  0.54s user 0.10s system 3% cpu 20.231 
> total

I think there might be a bit of confusion.  rsync doesn't do two way
synchronization.  It can be used create a backup and update that
backup.  But, it doesn't do a two-way sync.

A two-way sync works as follows.

Consider my laptop (L) and my desktop (D).  L and D initially have the
exact same data.  Now, I do some work on L and modify the file 'l'.
Later, I do some work on D and modify the file 'd'.

L and D have now not only diverged from the original, but they have
diverged from each other.  Thus, if we were to use rsync to
synchronize L with D, the changes to 'd' would be lost.  Likewise, if
we were to use rsync to synchronize D with L, the changes to 'l' would
be lost.

Unison works by considering the files the basic unit of
synchronization and memorizes the state of each file (by saving a
check sum and the last modification time).  When it synchronizes L and
D, it sees that D has the original copy of 'l' and sends L's updated
version of 'l' to D.  Likewise it sees that L has the original version
of 'd' and sends D's updated version of 'd' to L.  Only if a file is
updated on both L and D does the user need to manually intervene and
either choose a copy or manually merge the changes.

This approach isn't perfect, but in practice it works great: I rarely
have to manually fix something.  (I've been using this setup for about
a decade, I think.)

The difficulty with the TOFU data is that we update the DB whenever we
see a new signed message.  Thus, for active users of GnuPG, we expect
there to be a fair amount of churn.  Further, it is not possible to
merge two DBs by hand.  (We could and probably will write a tool to do
this.  Nevertheless, we'd then have to teach users about it, which is
a pain.)  By splitting the DB isn't many small, atomic chunks, the
chance of a conflict decreases dramatically.  But, equally important,
if two chunks do diverge, choosing one copy randomly still results in
a usable DB with little data loss.

Personally, I find this improved ability to do two-synchronization is
a big win.  But, given the huge performance difference, I think it is
reasonable to make the flat format the default DB format.

Thanks,

:) Neal