|Version 2 (modified by autrijus@…, 9 years ago) (diff)|
Source Encoding Detection
Haskell source code uses the Unicode character set. However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code outside ASCII range non-portable.
This proposal outlines a detection heuristics that categorizes the source code as under UTF-8, UTF-16 or UTF-32. A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input.
This proposal does not cover user-specified source encoding.
This heuristics uses at most 4 bytes from the byte representation of Haskell source code.
import Data.Word data EncodedSource = UTF8 [Word8] | UTF16 Endian [Word8] | UTF32 Endian [Word8] -- | UserDefined ... data Endian = LittleEndian | BigEndian detectSourceEncoding :: [Word8] -> EncodedSource detectSourceEncoding bytes = case bytes of  -> UTF8  [0x00] -> invalidNulls xs@[_] -> UTF8 xs [0xFF, 0xFE] -> UTF16 LittleEndian  (0xFE:0xFF:xs) -> UTF16 BigEndian xs [0x00, 0x00] -> invalidNulls xs@[0x00, _] -> UTF16 BigEndian xs xs@[_, 0x00] -> UTF16 LittleEndian xs xs@[_, _] -> UTF8 xs [0x00, 0x00, 0x00] -> invalidNulls xs@[_, _, _] -> UTF8 xs (0xEF:0xBB:0xBF:xs) -> UTF8 xs (0x00:0x00:0xFE:0xFF:xs) -> UTF32 BigEndian xs (0xFF:0xFE:0x00:0x00:xs) -> UTF32 LittleEndian xs (0xFF:0xFE:xs) -> UTF16 BigEndian xs (0x00:0x00:0x00:0x00:_) -> invalidNulls xs@(0x00:0x00:0x00:_) -> UTF32 BigEndian xs xs@(_:0x00:0x00:0x00:_) -> UTF32 LittleEndian xs (0x00:0x00:_) -> invalidNulls xs@(0x00:_) -> UTF16 BigEndian xs xs@(_:0x00:_) -> UTF16 LittleEndian xs xs -> UTF8 xs where invalidNulls = error "(implementation-specific error message)"
The heuristics has the following properties:
- Byte-order mark is optional on all three encodings.
- If present, byte-order-marks are consumed before lexical analysis.
- Source code known to begin with the NULL chracter is disallowed.
Furthermore, as long as the first logical characters in the program is under codepoint 0xFF (the "ASCII/Latin1" range), this heuristics can always gracefully handle two common class of text editor flaws:
- Emitting byte-order mark for UTF-8 text.
- Omitting byte-order mark for UTF-16 or UTF-32 text.
- Ensures uniform treatment of Unicode in source code.
- Disallows implicit ISO-8859-* encodings in source code, ensuring portability.
- Mandating a minimum support for UTF-8/UTF-16 places an implementation burden on compiler writers.
- Existing code relying on a non-UTF8, locale-/implementation-specific encoding will need conversion.