safe ephemeral GnuPG homdirs

Sun Jun 16 16:27:37 CEST 2019

hi folks--

I've been looking at the notmuch test suite, which uses an ephemeral
GNUPGHOME in several places.  §4.5.2 "Ephemeral home directories" of
gnupg's info pages clearly state that this is the preferred way to
perform isolated GnuPG actions.  However, the notmuch project is having
difficulty doing this safely in a robust cross-platform way.  This is
made more complex by the fact that test suites are frequently run in
unusual system configurations (e.g. schroot, minimalist containers,
etc), which may not have all the trappings of a full-blown desktop
machine.

The issues here revolve around sockaddr_un.sun_path size limits,
arbitrary selection of hard-coded paths for runtime directories, etc,
and i know that a lot of work has been put into trying to make it
robust, but if even the technically-sophisticated notmuch development
community is still struggling [0], it's hard to imagine that it's being used
safely in other comparable contexts.

(for reference, all of the notes here are related to GnuPG 2.2.16 from
debian experimental, on a debian buster system)

I'm wondering whether explicit signalling somehow wouldn't work better,
though i'm reluctant to make the logic for socketdir selection even more
complex.

Is there some way to simplify this logic?  Or, failing that, can we
offer some portable way for a user of GnuPG to select a "safe" ephemeral
homedir at runtime?  or at least to get an explicit alert about what we
think the problem is likely to be?

Some example proposals to make things more robust (these are not
mutually exclusive):

a) We could offer "gpgconf --mkdir-ephemeral" -- it would make an
   ephemeral directory given the state of the current system and the
   logic of socketdir selection, such that it knows that the socket
   paths will fit.

b) gpgconf --list-dir and gpgconf --launch could emit an explicit
   warning to stderr if gpgconf detects that the socketdir is likely to
   be unreachable due to sizeof(sun_path) limits.

c) all parts of GnuPG that offer or connect to one of the sockets that
   are known to be too long could try to use relative paths when either
   bind()ing or connect()ing.  Doing this safely seems pretty
   complicated, since the lack of bindat() or connectat() likely means
   we need to use chdir() or fchdir() before bind() or connect(), and of
   course the different current working directory has other consequences
   for the rest of the process.  maybe this can be done safely within a
   fork() where a child process does socket(), chdir() and
   bind()/connect(), and then returns the resultant file descriptor to
   the parent process?  The complexity of implementation here is
   daunting, and it doesn't account for other clients of the GnuPG
   daemons, who will have to figure out how to do this themselves.  But
   it would still be more robust than the current situation.  I think
   Justus Winter worked on something like this, but maybe it never got
   integrated?  I don't know how portable it is to pass file
   descriptors between processes, or what to do on systems that don't
   have fork() -- maybe it's not necessary on those systems because they
   may not have the same limits on sun_path ?

What do you think about any of the above?

--------

As an aside, in reviewing the code for socket selection, i came across
this comment in common/homedir.c, _gnupg_socketdir_internal():

  /* It has been suggested to first check XDG_RUNTIME_DIR envvar.
   * However, the specs state that the lifetime of the directory MUST
   * be bound to the user being logged in.  Now GnuPG may also be run
   * as a background process with no (desktop) user logged in.  Thus
   * we better don't do that.  */

I don't understand how this statement justifies the logic used here, or
justifies not relying on the explicit signal of XDG_RUNTIME_DIR.  A few
observations:

 * the parenthetical aside in "no (desktop) user" isn't appropriate --
   non-desktop users (e.g. remote ssh logins, virtual terminal logins,
   etc) can have XDG_RUNTIME_DIR set.  On many systems with a typical
   PAM stack, XDG_RUNTIME_DIR is set appropriately (and the directory is
   created, if necessary) from most session initiations, whether it is
   ssh, virtual terminal, graphical desktop, etc.

 * if XDG_RUNTIME_DIR is set, then it seems legitimate to just use it
   directly, because we are *not* in the use case of being run as a
   background process with no available session.

 * if the concern is that XDG_RUNTIME_DIR will get cleaned up on session
   logout, that is *also* a concern for choosing the hard-coded paths
   that we choose.  They too may be cleaned up on session logout.

-------------

Finally, i note that even gpg-connect-agent gives oddly different error
messages depending on the length of the socket paths, in a GNU/Linux
system with sizeof(sun_path) == 108 anyway.

I ran some tests where no gpg-agent was initially running, on a system
where "/run/user/$(id -u)" is not present, and where $XDG_RUNTIME_DIR is
not set.  I tried varying the length (as the environment variable
$SOCKLEN) of standard gpg-agent socket:

    echo "XDG_RUNTIME_DIR=$XDG_RUNTIME_DIR"
    export GNUPGHOME=/tmp/gnupg-home-test/$(python -c "print(\"a\"*($SOCKLEN-34))")
    mkdir -p -m 0700 "$GNUPGHOME"
    echo "GNUPGHOME=$GNUPGHOME"
    ls -la "$(gpgconf --list-dir agent-socket)"
    gpgconf --list-dir agent-socket
    wc -c <<<"$(gpgconf --list-dir agent-socket)"
    gpg-connect-agent "getinfo pid" "killagent" /bye

In tests where the path (including NUL terminator) is less than 100
chars, gpg-connect-agent succeeded.

In tests where the path (including NUL terminator) is between 100 and
107 chars long (inclusive), gpg-connect-agent fails after a 5 second
timeout with:

    gpg-connect-agent: can't connect to the agent: IPC connect call failed

when the path (including NUL terminator) is 108 chars long,
gpg-connect-agent fails after a 5 second timeout with:

    gpg-connect-agent: can't connect to the agent: File name too long

Some questions:

 * why is it failing for paths between 100 and 107 ?  Is that because
   other socket paths are longer than agent-socket, and gpg-agent is
   failing to launch because *any* of those sockets are too long?  If
   so, can gpg-connect-agent be smarter about that?

 * why does gpg-connect-agent need to wait 5 seconds to time out if it
   already knows that the file name is too long?

These are subtle issues that might be hard to fix immediately.  I'd like
to file some record of them in https://dev.gnupg.org/ to make sure they
don't get lost.  I'm not sure which of the many points in this message
deserve their own issue reference, though.  let me know if you have a
preference.

Regards,

        --dkg

[0] see the thread around recent Message-ID:
    <87r27z63uk.fsf at ra.horus-it.com>, Subject: "[PATCH] configure: fix
    mktemp call for macOS" on
    https://notmuchmail.org/pipermail/notmuch/2019/