Japanese and UTF8
IIDA Yosiaki
iida@ring.gr.jp
17 Feb 2000 21:00:33 +0900
Werner Koch writes:
>I expect that EOC_JP uses state shifting.
I see you wanted to mean EUC-JP :-).
In single octet base, it has state shifting.
In multiple octets base, it doesn't.
* EUC uses two kinds of single shift characters. They are
SS2R and SS3R, which are coded in 8/14 and 8/15 respectively.
In this sense, it has state shifting.
* OTOH,
- ASCII character set (strictly speaking, it may be
Latin Alphabetic character set of JIS X 0201, but there
is no big deal between two) is ALWAYS designed in G0
element. G0 is ALWAYS invoked by GL area.
- 2 byte KANZI character set is ALWAYS designated in G1
element. Two GL byte sequence without leading shingle
shift invokes G1.
- KATAKANA character set is ALWAYS designated in G2
element. Sequence of SS2R and one GR byte invokes G2.
- 2 byte supplementary KANZI character set is ALWAYS
designated in G3. Sequence of SS3R and one GR byte
invokes G3.
- Other sequences are illegal.
In this sense, it doesn't have state shifting and thus
logic can be hard coded (if you want :-).
Roughly, the following is the code:
if (isascii(c)) {
/* bit pattern 0xxx xxxx */
Frob_Ascii(c);
} else switch (0xff & c) {
case 0x8e:
/* bit pattern 1000 1110 1xxx xxxx */
Get_one_more_byte_and_frob_it_as_KANA();
break;
case 0x8f:
/* bit pattern 1000 1111 1xxx xxxx 1xxx xxxx */
Get_two_more_bytes_and_frob_them_as_supplementary_KANZI();
break;
default:
if (0xa0 < c && c < 0xff) {
/* bit pattern 1xxx xxxx 1xxx xxxx */
Get_one_more_byte_and_frob_them_as_KANZI(c);
} else {
Alert_Error(c);
}
}
--
iida