Encodings

George Pauliuc pauliuc@gmx.net
17 Nov 2002 19:22:30 +0200


--=-ZVNrcjmABQCjdToq+NVJ
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

On Du, 2002-11-17 at 12:39, Lorenzo Cappelletti wrote:
> Hi all,

Hello Lorenzo!

> I think that UTF-8 doesn't suit everybody's needs and that the 2-letter
> language ID (en, de, etc.) is not enough.  Am I right?

Well... I can speak only for the Romanian part - and that is okay.  I
mean "ro" is enough.  The Republic of Moldova is using again Romanian
(although for political reasons some want to call it Moldavian - it is
the exact same laguages, with some old words, sometimes written in
chirilics thanks to URSS).  Bottom line, I don't think there will ever
be some "ro_MD" id.

UTF-8 is also Okay, but only from a theoretical standpoint.  Real life
shows that most browsing is done via Windows, and the most used browser
is in most cases IE.  Both Windows and IE have a hard time obeying the
standards.  First, Romanian it is said to be fully supported, at least
in XP.  But I hear not even XP has the fonts with all the needed
characters.  So, some of the chars in my translation, on a Windows
machine with the default fonts will show a lot of sqares (instead of the
characters it has no equivalent).

Regarding IE - beware of how it might react.  My main project
(http://romdict.sourceforge.net/) uses CSS and a transparent PNG.  I was
surprised to find out on my girlfriend's computer (she has Windows) that
IE5.5 doesn't show transparency in the PNG.  And that the justify isn't
done as requested by the CSS.  As you can see that page isn't anything
fancy.  But I guess there are other problems involved Windows and IE
that should be addressed.  I might add that I installed phoenix (a
stripped down Mozilla version) on that computer and, although the UTF-8
encoded page couldn't be shown right, the PNG and CSS had no problems
and the site was displaied like it shoud.

I think encoding (at least in my case) is a taugh decision.  I already
told Lorenzo the story, but I'll bother again the list for an advice.

In Romanian we have two special chars and 's' and a 't' with a comma
below it.  Latin2 charset should have adress this problem.  But there
was a mistake.  They assumed that is a cedilla and not a comma.  So
there we have an 's' and a 't' with cedilla below.  The explanation is
like this - the Turks do have an 's' with a cedilla.  But the 't' with a
cedilla isn't in any language I know.  I can't state there isn't one.=20
But never heard of such a thing.   The Turks received later on a new
Latin encoding (5?) so Latin2 should have been corrected.  But it
wasn't.  Some font producers corrected the cedilla 't'.  And that lead
to inconsistency and the 's' still has a cedilla.  Finally, there was
Latin10 encoding which is correct.  This raises two problems.  One:
there is almost no support for Latin10 - mostly illegaly cracked fonts
from Latin2 (the two letters have the same code, only the 'picture'
description is different).  Two:  system like Windows associate Romanian
with Latin2, thus most users who have Latin10 fonts installed still use
Latin2 fonts.  In the case the browser isn't smart enough to get its own
fonts, it will restore to default and show my Latin10 encoded page in
say Latin1 fonts.  And that results in a mess.

Unicode repeated the mistake.  They say cedilla 't' is for Romanian (?)
and no other language is listed there.  Again, this character might not
even exist.  They corrected the problem by adding a Latin Extended B
which has the two missing characters.  If things would stop here, UTF-8
would be the ideal encoding.  But, as I said above, the fonts that come
with Windows do not have those two characters.  Worse (or logical) the
keyboard mappings lead to the cedilla characters and perpetuate the
mistake.

Bottom line, don't know what encoding to use.  Latin10 might result in
Latin1 usage even in some of the cases where Latin10 fonts are
installed.  Latin2 is incorrect.  And UTF-8 on Windows doesn't show up
right.  I was told Mac can print my project pages okay (even the Unicode
pages).  And I've tested that on various Unices.  As long as there are
the fonts - the pages are okay.

> 2. WML files should not get messed up when one just opens and saves a
> .wml file.  I've heard from Debian crew that it has happend because of
> some text editors.

To prevent that, everybody should have its own copy of the main .wml in
his/her branch.  Redundancy, but avoids some problems.  Finally, recode
can do a great job of changing the character encoding with almost no
effort.

> To resize the problem, I don't think it's necessary that any web
> translator were able to view other translators' translation.  What
> matter is that, when a translator works on his translation, they won't
> spoil others' work, just because their editor doesn't correctly handle
> (by ignoring) non-Latin encodings.

Or use UTF-8 for the .wml than convert the output to whatever encoding
would be used on the site.


--=-ZVNrcjmABQCjdToq+NVJ
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
Comment: 'See search.keyserver.net for my signing key'

iD8DBQA9189BEM28XWGBdX8RAnyFAJ92YcuH2InC7kxoViOU0b5t/1cKPQCgsajJ
Un4hsXIT+wTEd8Ww+ueg08w=
=dpNc
-----END PGP SIGNATURE-----

--=-ZVNrcjmABQCjdToq+NVJ--