current charset guessing

Alain Bench xwck at oreka.com
Sat Jan 15 01:17:10 CET 2005


 On Tuesday, January 11, 2005 at 3:54:43 PM +0100, Werner Koch wrote:

> On Fri, 7 Jan 2005 18:23:34 +0100 (CET), Alain Bench said:
>> fails on implicit charsets, ambiguous names, or platform specific
>> spellings.
> we would need to reimplement everything from libiconv or check whether
> a proper libiconv is available.

    I assume you meant libcharset. Yes: Reimplement, or reuse it, use it
if available, or even provide it. After all it's already squatting in
the tarball: gnupg-1.4.0/intl/localcharset.c :-)

    But beware of the Win32 OEM/ANSI mismatch problem.


>> nl_langinfo(CODESET) also needs sanitizing
> this is something libiconv should care about

    Libcharset, yes. Also libiconv accepts some limited common aliases
as parameters, and that helps, but not in all cases. A platform specific
iconv *should* accept the same names that were output by its
nl_langinfo(CODESET), or so I hope. Hybrid cases are complicated.

    Example: Allegedly some versions of AIX may report "IBM-850"
CODESET. That's unknown to libiconv 1.9.2 who knows 4 aliases:

| $ iconv -l | grep 850
| 850 CP850 IBM850 CSPC850MULTILINGUAL

    But on AIX, libcharset canonicalises "IBM-850" to "CP850" and
reports this known name.


>> "make check" also fails
> changed to a warning.
>> in CP-1252, not in Latin-1
> Removed that.
>> there is a whole set of aliases [28591 =3D=3D L1]
> I adapted that list and use it for Windows.

    Great! Thank you. :-)


>> Libcharset seems to call GetACP() only, never GetConsoleOutputCP().
>> IIUC that would be false for console apps?
> GetConsoleOutputCP is the correct thing to do but it does not harm to
> fall back to getACP.

    I wonder how The Bat!=99 can make GetConsoleOutputCP() return 0 when
needed. FreeConsole() to detach, or something like that?

    I wonder how to decide which function applies, OCP or ACP. I mean:
Typically one gives 850 the other 1252. When you run in text mode,
that's typically in a console: OEM CP 850 is good. But try to run GnuPG
w32cli-1.4.0a in the default rxvt of MSYS-1.0.10. That rxvt is a Latin-1
terminal, but GetConsoleOutputCP() and GnuPG still report CP850, and of
course umlauts are garbled (real key on keyservers):

| $ gpg -vvv --list-keys BD7C8AA1
| pub   4096R/BD7C8AA1 2005-01-01 [expires: 2005-12-31]
| uid                  Hans M=81ller <ndof at gmx.li>
| sub   4096R/0958388C 2005-01-01 [expires: 2005-12-31]
|
| gpg: using character set `CP850'
| gpg: using PGP trust model

    Normally should be "M=FCller" (u umlaut). BTW I wonder why
stdout/stderr are reordered.


>> US-Ascii is not Latin-1.
> we got a lot of complaints about these warnings from US people and it
> seesm reasonable that many more machines are not configured properly
> for Latin-1 than those who are explicitly using ascii.

    Tolerance for misconfigured systems is good, but maybe not at the
cost of breaking legitimate usage, even rare. May I make two proposals:

 -1) Get rid of the warning message on simple display of unconvertable
chars. Unconfigured locale people would see (faked here):

| $ gpg -vvv --list-keys 0x882B59FD
| gpg: using character set `ASCII'
| gpg: using PGP trust model
| pub    512D/882B59FD 2005-01-08
| uid                  Ren\xc3\xa9 Lec\xc5\x93ur <joe at foo.bar>

    No more annoying warning, but a strange accents display that may
lead them to read the doc and fix locale. Doc may point to
Sven Mascheck's site, and provide quick "export LANG=3Den_US" hint.


 -2) Make and document a special "novars" case: Where locale variables
are applicable (not Win32), if all 3 of LC_ALL, LC_CTYPE, and LANG are
unset, and so far guessed charset is US-Ascii, then charset =3D Latin-1.

    In -vvv mode, unconf guys would be hinted (again faked output):

| $ gpg -vvv --list-keys "Ren=E9"
| gpg: using character set `ISO-8859-1' novars fallback: see http://expla=
nations
| gpg: using PGP trust model
| pub    512D/882B59FD 2005-01-08
| uid                  Ren=E9 Lec\xc5\x93ur <joe at foo.bar>

    No more annoying warning, and accents hopefully displayed at best
possible, even in these adverse conditions. Harmless for normal guys.


Bye!	Alain.
--=20
When you want to reply to a mailing list, please avoid doing so with
Lotus Notes 5. This lacks necessary references and breaks threads.



More information about the Gnupg-users mailing list