Operating Systems

UTF (Unicode Transformation Format)

Unicode defines two mapping methods: the Unicode Transformation Format (UTF) encodings, and the Universal Coded Character Set (UCS) encodings. An encoding maps (possibly a subset of) the range of Unicode code points to sequences of values in some fixed-size range, termed code values.

All UTF encodings map all code points (except surrogates) to a unique sequence of bytes. The numbers in the names of the encodings indicate the number of bits per code value (for UTF encodings) or the number of bytes per code value (for UCS encodings). UTF-8 and UTF-16 are probably the most commonly used encodings. UCS-2 is an obsolete subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent.

UTF encodings include:

  • UTF-1, a retired predecessor of UTF-8, maximizes compatibility with ISO 2022, no longer part of The Unicode Standard;
  • UTF-7, a 7-bit encoding sometimes used in e-mail, often considered obsolete (not part of The Unicode Standard, but only documented as an informational RFC, i.e., not on the Internet Standards Track either);
  • UTF-8, an 8-bit variable-width encoding which maximizes compatibility with ASCII;
  • UTF-EBCDIC, an 8-bit variable-width encoding similar to UTF-8, but designed for compatibility with EBCDIC (not part of The Unicode Standard);
  • UTF-16, a 16-bit, variable-width encoding;
  • UTF-32, a 32-bit, fixed-width encoding.
Related Articles