here's one way to try to frame the question: Imagine the situation as a
game, where you have two players on one team, "defense" named Alice and
Bob; Alice wants to send a message to Bob.  Another player on the
opposing team, "offense", is named Mallory, is trying to send a message
to Bob as well, but trying to trick Bob into thinking that the incoming
message comes from Alice.

The way the game is played, either Alice or Mallory gets to send a
message.  Bob has to decide whether the message actually came from
Alice.  If Bob gets it right, the "defense" wins.  If Bob gets it wrong,
the "offense" wins.  The game is played multiple times.

Is that the scenario you're thinking of?  If so, does the defense need
to win 100% of the time over thousands of games?  or is it acceptable
for offense to win occasionally?

In any case question is: how much work does Mallory need to do to get
Bob to make a mistake?  How frequently can Mallory trick Bob into
accepting mail from her as though it were from Alice?  Conversely, how
many messages that were actually from Alice can Bob accidentally reject
without making Alice upset enough to give up on the entire
communications scheme?

When you frame the problem this way, you can start thinking more
concretely about what "bulletproof" means, and you can actually design
user trials to test proposals.

There are probably other ways to concretize the problem, this is just
one that i've come up with.  But without a concrete way to understand
what we're looking for, words like "bullet proof" or "easy to read" or
"cryptographically secure" are tough to get people to agree on.

I suspect (as discussed upthread) that TOFU will have better metrics for
"defense" at the game described above than any attempt that involves
asking people to visually distinguish deterministically-generated
identicons.  But i don't know, because i haven't tested it.

