Version 1 (modified by ross@…, 10 years ago) (diff)

Unicode in Haskell source

The Haskell 98 Report (Lexical Structure) claims that Haskell source code uses the Unicode character set. Haskell source code is stored in text files using various character sets and encodings. If Unicode were allowed, how would implementations know which encoding was used?

  • Jhc allows unrestricted use of the Unicode character set in Haskell source, treating input as UTF-8. Several uses of Unicode characters in place of Haskell keywords are permitted:
    • '→' ('\x2192') is equivalent to '→'
    • '←' ('\x2190') is equivalent to '←
    • '∷' ('\x2237') is equivalent to '::'
    • '‥' ('\x2025') is equivalent to '..'
    • '⇒' ('\x21d2') is equivalent to '⇒'
    • '∀' ('\x2200') is equivalent to 'forall'
    • '∃' ('\x2203') is equivalent to 'exists' (see ExistentialQuantification)
    In addition there is experimental support for defining new operators and names using various Unicode characters.
  • Hugs treats input as being in the encoding specified by the current locale, but permits Unicode only in comments and character and string literals.
  • Others treat source code as ISO 8858-1 (Latin-1).

Some things we could do:

  • Revert to US-ASCII, Latin-1 or implementation-defined character sets.
  • Allow Unicode with the encoding specified outside source files (e.g. by the current locale, as currently done by Hugs). This would make Haskell source containing non-ASCII characters non-portable.
  • Allow Unicode, with a mechanism for specifying encoding in the source file, e.g.
    • Introduce a pragma {-# ENCODING e #-} with a range of possible values of the encoding e (cf IANA character sets). If the pragma is present, it must be at the beginning of the file. If it is not present, the file is encoded in US-ASCII. Note that even if the pragma is present, some heuristic may be needed even to get as far as interpreting the encoding declaration, like in XML. The fact that the first three characters must be {-# will be useful here. Haskell implementations must support at least the encodings US-ASCII, ISO-8859-1, and UTF-8.
  • Allow Unicode, defining a portable form (the \uNNNN escapes in Haskell 1.4 were an attempt at this).

If Unicode is allowed, should its use be restricted?

  • Haskell 98 already has character escapes for arbitrary Unicode characters in character and string literals. Thus Unicode in these literals can always be transformed into a portable form.
  • Haskell 98 permits upper, title and lower case alphabetic characters (but not other alphabetic characters) in identifiers, and symbol or punctuation characters in symbols. Thus a source text may not be representable in all encodings (especially ASCII).

It is not reasonable to display all Unicode characters with the same width, but the Haskell 98 Report (Layout) says:

For the purposes of the layout rule, Unicode characters in a source program are considered to be of the same, fixed, width as an ASCII character. However, to avoid visual confusion, programmers should avoid writing programs in which the meaning of implicit layout depends on the width of non-space characters.

Is this adequate?