Changes between Initial Version and Version 1 of CharAsUnicode

Dec 6, 2005 11:53:38 AM (10 years ago)


  • CharAsUnicode

    v1 v1  
     1= The Char type =
     3The Haskell 98 Report ([ Characters and Strings]) states that the type `Char` represents [wiki:Unicode], which seems to be the canonical choice.
     4The functions of the [ Char] module work with Unicode for GHC and Hugs, with one divergence from the Report:
     5 * `isAlpha` selects Unicode alphabetic characters, not just the union of lower- and upper-case letters.
     6More sophisticated functions could be provided by additional libraries.
     8== Input and Output ==
     10The Haskell 98 [ Prelude] and [ IO] modules provide I/O primitives using the `Char` type.
     12 * All character based I/O in Hugs and jhc-compiled programs uses the encoding of the current locale.
     13 * Other implementations perform I/O on bytes treated as characters, i.e. belonging to the Latin-1 subset.
     15Assuming we retain Unicode as the representation of `Char`:
     17 * Flexible handling of character encodings will be needed, but there is no existing implementation. Should we specify it or leave room for experimentation?
     18 * [wiki:BinaryIO] is needed anyway, and would provide a base for these encodings.
     19 * A simple character-based I/O and system interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users. However it might not be in the Prelude if we [wiki:Prelude shrink the Prelude].
     21== Strings in System functions ==
     23Native system calls use varying representations of strings:
     25 * Unix-like systems and many others use byte strings, which may use various encodings (or may not be character data at all).
     26 * The NTFS file system (Windows) stores filenames in UTF-16, and the Win32 interface provides functions using UTF-16. Since Windows NT, the byte-level interface is a compatibility layer over UTF-16.
     28Haskell 98 defines `FilePath` as `String` (used in the [ Prelude], [ IO] and [ Directory] modules).
     29The functions in [ System] use `String` for program arguments and environment values.
     31 * Hugs exchanges byte-strings using a byte encoding of Unicode determined by the current locale.
     32 * Other implementations treat the byte-strings interchanged with the operation system as characters, i.e. belonging to the Latin-1 subset.
     33 * The ForeignFunctionInterface specifies `CString` functions that perform locale-based conversion, but these are not yet provided by the Haskell implementations.
     35A disadvantage or using encodings is that some byte-strings may not be legal encodings, e.g. using a program argument as a filename may fail. Converting to `String` and back may also lose distinctions for some encodings. On the other hand, byte-strings are inappropriate if the underlying system uses a form of Unicode (e.g. recent Windows, and possibly more systems in the future). One way out would be to provide an abstract type for strings in O/S form. Again, the old character interface would remain useful for many.
     37== A Straw-Man Proposal ==
     39 * '''I/O.''' 
     40   All raw I/O is in terms of octets, i.e. {{{Word8}}}
     41 * '''Conversions.'''
     42   Pure functions exist to convert octets to and from any particular encoding:
     44   stringDecode :: Encoding -> [Word8] -> [Char]
     45   stringEncode :: Encoding -> [Char] -> [Word8]
     47   The codecs must operate on strings, not individual characters, because some
     48   encodings use variable-length sequences of octets.
     49 * '''Efficiency.'''
     50   Semantically, character-based I/O is a simple composition of the raw
     51   I/O primitives with an encoding conversion function.  However, for
     52   efficiency, an implementation might choose to provide certain encoded
     53   I/O operations primitively.  If such primitives are exposed to the
     54   user, they should have standard names so that other implementations can
     55   provide the same functionality in pure Haskell Prime.
     56 * '''Locales.'''
     57   It may be possible to retain the traditional I/O signatures for
     58   hGetChar, hPutChar, readFile, writeFile, etc, but only by introducing
     59   a stateful notion of ''current encoding'' associated with each
     60   individual handle.  The default encoding could be inherited from the
     61   operating system environment, but it should also be possible to
     62   change the encoding explicitly.
     64   getIOEncoding :: Handle -> IO Encoding
     65   setIOEncoding :: Encoding -> Handle -> IO ()
     66   resetIOEncoding :: Handle -> IO ()  -- go back to default
     68 * '''Filenames, program arguments, environment.'''
     69   * Filenames are stored in Haskell as {{{[Char]}}}, but the operating
     70     system should receive {{{[Word8]}}} for any I/O using filenames.
     71     Some encoding conversion is therefore required.  Usually, this will
     72     be platform-dependent, and so the actual encoding may be hidden
     73     from the programmer as part of the default locale.
     74   * Program arguments, and symbols from the environment, are supplied
     75     by the operating system to the Haskell program as {{{[Word8]}}}.
     76     The program is responsible for conversion to {{{[Char]}}}.  Again,
     77     there may be a default encoding chosen based on the locale.