current charset guessing

Fri Jan 7 18:23:34 CET 2005

 On Thursday, January 6, 2005 at 12:22:43 PM +0100, Werner Koch wrote:

> On Mon, 3 Jan 2005 01:26:57 +0100 (CET), Alain Bench said:
>> to guess current charset when CODESET lacks?
> Please try the attached patch.

    Thanks: It works in many cases, but fails on implicit charsets,
ambiguous names, or platform specific spellings. Examples:

| $ LC_CTYPE=pl_PL gpg -vvv --some-function
| gpg: using character set `iso-8859-1'
| [...]

    Works in Latin-1 while real pl_PL implicit charset is Latin-2.


| $ LC_CTYPE=ja_JP.EUC gpg -vvv
| gpg: conversion from `utf-8' to `EUC' not available

    We know from context it's EUC-JP, but iconv can't decide if it's
EUC-JP, EUC-KR, EUC-CN, EUC-TW, or EUC-JISX0213.


| $ LC_CTYPE=fr_FR.iso885915 at euro gpg -vvv
| gpg: conversion from `utf-8' to `iso885915' not available

    This "alias" of Latin-9 is not known to libiconv 1.9.2, which knows
only six of them:

| $ iconv -l | grep ISO-8859-15
| ISO-8859-15 ISO-IR-203 ISO8859-15 ISO_8859-15 ISO_8859-15:1998 LATIN-9


    Note in such locales, "make check" also fails:

| $ make check
| [...]
| Making all in checks
| make[2]: Entering directory `/tmp/gnupg-1.4.0/checks'
| echo '#!/bin/sh' >./gpg_dearmor
| echo "../g10/gpg --no-options --no-greeting \
|              --no-secmem-warning --batch --dearmor" >>./gpg_dearmor
| chmod 755 ./gpg_dearmor
| ./gpg_dearmor > ./pubring.gpg < ./pubring.asc
| gpg: conversion from `utf-8' to `cp-1252' not available
| make[2]: *** [pubring.gpg] Error 2
| make[2]: Leaving directory `/tmp/gnupg-1.4.0/checks'
| make[1]: *** [all-recursive] Error 1
| make[1]: Leaving directory `/tmp/gnupg-1.4.0'
| make: *** [all] Error 2


    Those platform specific charset spellings are quite common. That's
why libcharset does some sanitizing. And why Mutt has big internal table
of aliases, and provides "iconv-hook" command so users can add to this
table. BTW the output of nl_langinfo(CODESET) also needs sanitizing on
some platforms.

    Libcharset's locale_charset() does itself the appropriate
nl_langinfo(CODESET), GetACP(), DosQueryCp() on OS/2,
setlocale(LC_ALL, ""), getenv() parsing, and canonicalization thru
internal alias table, or external $LIBDIR/charset.alias file.


>> On Win32 the name for Latin-1 is not CP1252, but CP28591.
> My reference says 1252 thus mapping 1252 to Latin-1 is correct.

    CP-1252 is a superset of Latin-1 with 27 more chars. Example one can
write "Lecœur" (oe ligature) in CP-1252, not in Latin-1:

| $ grep U0153 glibc-2.3.3/localedata/charmaps/CP1252
| <U0153>     /x9c         LATIN SMALL LIGATURE OE
| $ echo -e "\234" | iconv -f cp1252 -t iso-8859-1
| iconv: (stdin): cannot convert

    I'd say treating CP-1252 as Latin-1 is good enough and better than
nothing when iconv is unavailable. But it's suboptimal when iconv is
there. And could even lead to wrongly and unnoticably insert UTF-8
control chars inside a key UID. Think Mr Lecœur enters his name:

| $ echo -ne "\234" | iconv -f iso-8859-1 -t utf-8 | hex
| C2 9C
| $ grep U009C glibc-2.3.3/localedata/charmaps/UTF-8
| <U009C>     /xc2/x9c     STRING TERMINATOR (ST)

    That's obviously wrong, but when Mr Lecœur enters or displays his
UID (in same conditions), everything seems correct to him. Other people
see garbage, though.


> If CP28591 is also a Latin-1 encoding, libiconv should handle this.

    I agree libiconv probably should handle CP28591 as an alias of
Latin-1. But it does not (straight binaries gnupg-w32cli-1.4.0a.zip and
libiconv-1.9.1.dll.zip from gnupg.org, on Win2000sp4 cmd.exe):

| C:\>chcp
| Page de codes active : 28591
|
| C:\>gpg -vvv
| gpg: conversion from `utf-8' to `CP28591' not available

    Looking in libiconv 1.9.2 source: No such alias. But in
libiconv-1.9.2/libcharset/lib/localcharset.c there is a whole set of
aliases:

| /* Determine a canonical name for the current locale's character encoding.
| # if defined WIN32
|     cp = "CP936" "\0" "GBK" "\0"
|	   "CP1361" "\0" "JOHAB" "\0"
|	   "CP20127" "\0" "ASCII" "\0"
|	   "CP20866" "\0" "KOI8-R" "\0"
|	   "CP21866" "\0" "KOI8-RU" "\0"
|	   "CP28591" "\0" "ISO-8859-1" "\0"
|	   "CP28592" "\0" "ISO-8859-2" "\0"
|	   "CP28593" "\0" "ISO-8859-3" "\0"
|	   "CP28594" "\0" "ISO-8859-4" "\0"
|	   "CP28595" "\0" "ISO-8859-5" "\0"
|	   "CP28596" "\0" "ISO-8859-6" "\0"
|	   "CP28597" "\0" "ISO-8859-7" "\0"
|	   "CP28598" "\0" "ISO-8859-8" "\0"
|	   "CP28599" "\0" "ISO-8859-9" "\0"
|	   "CP28605" "\0" "ISO-8859-15" "\0";
| # endif


>> Latin-9 is not Latin-1.
> Right: I have remove that.

    Good: Thanks. ISO-8859-15 is Latin-9, just to confuse us:

| ISO-8859-1	Latin-1
| ISO-8859-2	Latin-2
| ISO-8859-3	Latin-3
| ISO-8859-4	Latin-4
| ISO-8859-5	Cyrillic
| ISO-8859-6	Arabic
| ISO-8859-7	Greek
| ISO-8859-8	Hebrew
| ISO-8859-9	Latin-5
| ISO-8859-10	Latin-6
| ISO-8859-11	Thai
| ISO-8859-13	Latin-7
| ISO-8859-14	Latin-8
| ISO-8859-15	Latin-9
| ISO-8859-16	Latin-10


> For W32 use GetACP as error fallback.

    Hum... Libcharset seems to call GetACP() only, never
GetConsoleOutputCP(). IIUC that would be false for console apps? I'm
unable to check due to no MinGW32 compiler. Probably 850 and 1252
mismatch by default :-(.


> we silently assume that plain ASCII is actually meant as Latin-1. This
> makes sense because many Unix system don't have their locale set up
> properly and thus would get annoying error messages and we have to
> handle all the "bug" reports. Latin-1 has always been the character
> set used for 8 bit characters on Unix systems.

    US-Ascii is not Latin-1. I frequently use a 7 bits terminal in a
LANG=fr_FR.us-ascii locale, with or without //TRANSLITeration. That
false aliasing would break it. It would also break any case where
locales are unset and charset is different. Bad idea, I'd say.


Bye!	Alain.
-- 
Everything about locales on Sven Mascheck's excellent site at new
location <URL:http://www.in-ulm.de/~mascheck/locale/>. The little tester
utility is at <URL:http://www.in-ulm.de/~mascheck/locale/checklocale.c>.