WKD proper behavior on fetch error

Thu Jan 14 01:47:12 CET 2021

On 2021-01-13 at 10:12 +0100, Neal H. Walfield wrote:
> I'd like to clarify what Sequoia is doing (wrong).
> (...)

Hello Neal

Thanks for chiming in and explaining the steps taken by sequoia.

I'll try to re-focus this subthread back on the initial topic of your
email.

> The I-D says "Only if the required sub-domain does not exist, they
> SHOULD fall back to the direct method."  The text doesn't say: "If
> there is an error, they SHOULD fallback to the direct method unless
> the required sub-domain does not exist, in which case they MUST NOT
> fall back to the direct method."  So, strictly speaking, I don't
> think Sequoia is violating the specification.

I understand this to mean it as "only use the direct method if the
required sub-domain does not exist", with the SHOULD meaning that the
direct method is not required (not sure why, I would have probably used
a MUST). As such, I do think sequoia is non-conformant, although I'm
more interested in determining the proper behaviour of a WKD client.

> We thought about this question, but we couldn't figure out a
> satisfactory answer.  The worst attack we could come up with is:
> (...)
> So sure, that's possible, but it seems like WKD shouldn't foo.com's
> biggest worry in that case.

I can make up some scenarios where foo.com is hosted on EvilCDN and
openpgpkey.foo.com in a safe server. But I agree that's not too likely.

I think it would be good that sq stopped after processing
openpgpkey.foo.com, mainly to follow the principle of least surprise.

If the key can only be placed in one place, then it MUST be good (or
bad, but it will be consistent).

If the admins wanted to use the advanced method, but misconfigured it,
while testing only with sequoia (but having a good direct method), they
would be misled to think WKD is properly implemented, while it is not.

The team managing foo.com may be completely different (e.g. marketing)
than those handling pgp keys (e.g. email sysadmins). By keeping the
openpgpkey subdomain for themselves, the admins can give complete
control of the apex website to marketing (known to fill their
credentials on phishing pages from time to time) without letting them
any access to publish pgp keys.

Hard to debug errors: "when fetching your key via wkd, I do receive
your key from your server, but it expired in 2018!" can have people
scratching their head for a long time (the last key is there, there is
no 2018 key stored here…) until figuring that the server certificate of
openpgpkey.foo.com expired (and they had an old key on the main
website). A direct error "Certificate of openpgpkey.foo.com expired 2
days ago" would have been much clearer.

> On the other hand, implementing this prohibition means that a DNS
> server can prevent its clients from using WKD by forcing all
> openpgpkey subdomains to resolve to 127.1.  That's hard to notice,
> because everything else still appears to work.

I think it's the other way round. If sequoia falls back to the direct
method and returns "No WKD key published for jdoe at foo.com", it will get
unnoticed. A hard error of "Couldn't connect to openpgpkey.foo.com
(127.0.0.1)" or "The certificate of openpgpkey.foo.com (1.2.3.4) is not
trusted" would make it noticeable.

Probably the most important part of the rule: "all implementations of
WKD should behave in the same way". I don't mind if it was gnupg that
was changed to behave like sequoia, but given identical conditions,
ideally all clients (and the draft reading) should produce the same
result (find key X, an error, etc.).

> we helped an organization deploy WKD, and they had a similar problem…

It was misconfigured. Spawning a https server for openpgpkeys (as they
did) is a way to solve it. They could as well have made a openpgpkeys
record pointing to the same server as the apex domain, and use a
certificate with both names. Or even install the keys on the server
providing the 404. It seems the wrong to make it an issue for the
client to figure out where the keys may be.
There is a long story of browsers helpfully "fixing" the encoding or
Content-type of files, which caused a lot of harm in the long term to
avoid security issues derived from browser sniffing changing to
insecure defaults, when people really meant what they said. It seems
difficult the same could happen here, but the idea that the server
should be properly set up, rather than the client fixing the errors for
the user is the same.

I would recommend to remove the or_else case and fail with an error if
the advanced method is (supposedly) set up but fails. At least, I think
there should be a diagnostic e.g. "WKD advanced method configured but
broken. Connection to openpgpkey.foo.com (1.2.3.4) failed: Bad
certificate. Trying direct method" although I would prefer a hard
error.
(Of course, if the user explicitly requested the client/library to only
use the direct method, ignore certificate errors, etc. it'd be fine to
do so)

Best regards

PS: If I'm reading your code correctly, it would not only fall back to
the apex domain in case of a certificate error, but also if the key is
not found (404), which could result in removing a key (by deleting the
container file) having the unintended effect of finding a result
through the direct method.

PPS: Another benefit would be that we could have avoided this long
thread. :-)