full-text v. regular expression userid searches (was: Re: [svn] GnuPG - r3867 - trunk/keyserver)

David Shaw dshaw at jabberwocky.com
Fri Aug 19 06:35:22 CEST 2005


On Fri, Aug 19, 2005 at 12:24:05AM -0400, Jason Harris wrote:
> On Thu, Aug 18, 2005 at 10:48:54PM -0400, David Shaw wrote:
> > On Thu, Aug 18, 2005 at 09:32:23PM -0400, Jason Harris wrote:
> 
> > > LDAP?  (pks and SKS can, AFAIK.)
> > 
> > No, the other way around.  LDAP actually supports everything here and
> > more since it has an actual search syntax with wildcards.  Both pks
> > and SKS searches are much more limited and inherently substring.  In
> > pks, "exact" means "exact substring with whole words" and "not exact"
> > means "whole word match".  Not quite sure exactly what SKS does, but I
> > know the search facility there is being tinkered with as we speak.
> 
> pks considers "words" to be 2 or more chars long, SKS allow single-
> character "words."  Neither currently stores the full userid strings
> in a separate db (table), so both match whole words, fetch the candidate
> keys, and check for exact substring matches in the candidate keys.
> 
> Supporting (e.g., POSIX) Regular Expression searches would be interesting,
> both in GPG and (HKP) keyserver keyrings, but searching the 2545113+
> userids (on 2209793+ keys) on (well-synchronized) keyservers could be
> unacceptably slow.  The raw text is 99425942+ bytes (94.8+MB).

I don't know that it would be unusably slow: searching in a database
is a very well researched problem and there are ways to speed things
up. I seem to recall the old PGP LDAP server did quite well in
searching, and it had quite a large number of keys to search through.
Even though it wasn't synchronized with the HKP world, it was pretty
up to date (being the default server for PGP).

Not that I'm suggesting regex searches.  I think they're overkill for
the problem at hand.  Even LDAP doesn't do full regex.

> NB:  The legacy LDAP server fails this search:
> 
>   gpgkeys: LDAP search for: (&(pgpuserid=3D*jaso*harr*)(pgpdisabled=3D0))
> 
> although the GD (LDAP) server quickly returns (always only one match)
> 0x341A91C4.  But, I don't consider the GD a good benchmark since it has
> so few keys, can (and does?) stop looking after the first match, etc.

Which legacy LDAP server are you testing with?  PGP.com's old server
is gone, and I think horowitz is currently broken: any search returns
no responses.

David



More information about the Gnupg-devel mailing list