Version 14 (modified by ross@…, 12 years ago) (diff)

Background on Unicode

  • Unicode (or equivalently ISO 10646-1) defines a finite set of abstract characters and assigns them code points in the range 0x0 to 0x10ffff (i.e. a 21-bit quantity).
  • Characters are distinguished from glyphs: the presentation of a character may vary with style, locale, etc, and some glyphs may correspond to a sequence of characters (e.g. base character and combining mark characters).
  • Unicode has a story for the display of mixed left-to-right and right-to-left scripts (the BiDi algorithm).
  • The first 128 code points match US-ASCII; the first 256 code points match ISO 8859-1 (Latin-1).
  • For reasons of backwards compatibility and space efficiency, there are a variety of variable-length encodings of the code points themselves into byte streams.
    • UTF-8 seeks to ensure that the ASCII characters retain their traditional coding in the bottom 7-bits of a single byte. Non-ASCII characters are coded using two or more bytes with the top-bit set.
    • UTF-16 makes all characters 16-bits wide. Unfortunately this does not cover the entire code space, so there are some 'page-switch' characters that swap out the current 'page' of the code book for a different one. So although most characters end up fitting in a single 16-bit field, some must be coded as two successive fields.
    • UCS-4 uses a full 32-bit word per character.
    • To make things more exciting, the UTF-16 and UCS-4 encodings have two variations, depending on the endianness of the machine they were originally written on. So if you read a raw byte-stream and want to convert it to 16-bit chunks, you first need to work out the byte-ordering. This is often done by reading a few bytes and then looking up a heuristic table, although there is also a 'byte-order mark' which is a non-printing character which may or may not be present.
  • Other character sets and their encodings may be treated as encodings of Unicode, but they will not represent all characters, and in some cases (e.g. ISO 2022) conversion to Unicode and back will not be an identity.
  • Unix-like systems and many others traditionally deal with byte-streams. Various regional encodings are still widely used, but UTF-8 is growing in popularity.
  • Windows NT and later uses UTF-16.
  • Almost no system stores UCS-4 in files, but in some C libraries (e.g. glibc), the type wchar_t (wide character) is UCS-4.
  • Any system must be able to read/write files that originated on any other platform.
  • As an example of the complex heuristics needed to guess the encoding of any particular file, see the XML standard.

See also UnicodeInHaskellSource and CharAsUnicode.