Japanese and UTF8

IIDA Yosiaki iida@ring.gr.jp
17 Feb 2000 21:00:33 +0900

Werner Koch writes:

>I expect that EOC_JP uses state shifting.
I see you wanted to mean EUC-JP :-). In single octet base, it has state shifting. In multiple octets base, it doesn't. * EUC uses two kinds of single shift characters. They are SS2R and SS3R, which are coded in 8/14 and 8/15 respectively. In this sense, it has state shifting. * OTOH, - ASCII character set (strictly speaking, it may be Latin Alphabetic character set of JIS X 0201, but there is no big deal between two) is ALWAYS designed in G0 element. G0 is ALWAYS invoked by GL area. - 2 byte KANZI character set is ALWAYS designated in G1 element. Two GL byte sequence without leading shingle shift invokes G1. - KATAKANA character set is ALWAYS designated in G2 element. Sequence of SS2R and one GR byte invokes G2. - 2 byte supplementary KANZI character set is ALWAYS designated in G3. Sequence of SS3R and one GR byte invokes G3. - Other sequences are illegal. In this sense, it doesn't have state shifting and thus logic can be hard coded (if you want :-). Roughly, the following is the code: if (isascii(c)) { /* bit pattern 0xxx xxxx */ Frob_Ascii(c); } else switch (0xff & c) { case 0x8e: /* bit pattern 1000 1110 1xxx xxxx */ Get_one_more_byte_and_frob_it_as_KANA(); break; case 0x8f: /* bit pattern 1000 1111 1xxx xxxx 1xxx xxxx */ Get_two_more_bytes_and_frob_them_as_supplementary_KANZI(); break; default: if (0xa0 < c && c < 0xff) { /* bit pattern 1xxx xxxx 1xxx xxxx */ Get_one_more_byte_and_frob_them_as_KANZI(c); } else { Alert_Error(c); } } -- iida