Japanese and UTF8

17 Feb 2000 21:00:33 +0900

Werner Koch writes:

>I expect that EOC_JP uses state shifting.

  I see you wanted to mean EUC-JP :-).

  In single octet base, it has state shifting.
  In multiple octets base, it doesn't.

* EUC uses two kinds of single shift characters.  They are
  SS2R and SS3R, which are coded in 8/14 and 8/15 respectively.
  In this sense, it has state shifting.

* OTOH,
  - ASCII character set (strictly speaking, it may be
    Latin Alphabetic character set of JIS X 0201, but there
    is no big deal between two) is ALWAYS designed in G0
    element.  G0 is ALWAYS invoked by GL area.
  - 2 byte KANZI character set is ALWAYS designated in G1
    element.  Two GL byte sequence without leading shingle
    shift invokes G1.
  - KATAKANA character set is ALWAYS designated in G2
    element.  Sequence of SS2R and one GR byte invokes G2.
  - 2 byte supplementary KANZI character set is ALWAYS
    designated in G3.  Sequence of SS3R and one GR byte
    invokes G3.
  - Other sequences are illegal.
  In this sense, it doesn't have state shifting and thus
  logic can be hard coded (if you want :-).
  Roughly, the following is the code:

if (isascii(c)) {
  /* bit pattern 0xxx xxxx */
  Frob_Ascii(c);
} else switch (0xff & c) {
case 0x8e:
  /* bit pattern 1000 1110  1xxx xxxx */
  Get_one_more_byte_and_frob_it_as_KANA();
  break;
case 0x8f:
  /* bit pattern 1000 1111  1xxx xxxx  1xxx xxxx */
  Get_two_more_bytes_and_frob_them_as_supplementary_KANZI();
  break;
default:
  if (0xa0 < c && c < 0xff) {
    /* bit pattern 1xxx xxxx  1xxx xxxx */
    Get_one_more_byte_and_frob_them_as_KANZI(c);
  } else {
    Alert_Error(c);
  }
}
--
  iida