ietf-nntp NNTP and 16-bit charsets

Mon May 7 09:48:08 PDT 2001

Charles Lindsey <chl at clw.cs.man.ac.uk> writes:

> I see that Stan has confirmed that UTF-16 and UTF-32 are the correct
> ones to refer to. But I would still like an explanation of what the
> subtle difference between those terms is.

> I see that RFC 2279 refers to UCS-2 and UCS-4. I presume also that
> UTF-16 and UTF-32 are ways of representing (or encoding, though the
> encoding is trivial here) the 10646 charsets, such as UTF-8 is an
> encoding of those charsets. So what are UCS-2 and UCS-4?

Well, I'm not the best person to ask since I haven't been doing the
Unicode stuff for very long, but as I understand it, it goes something
like this:

  Unicode is a set of characters, organized into planes and assigned
  ordinary mathematical numbers.

  UCS-2 is the ISO version of Unicode encoded in two octets.  UCS-4 is
  the ISO version of Unicode encoded in four octets.  These are mostly
  only interesting if you care about ISO 10646 rather than just following
  the work of the Unicode Consortium.

  UTF-* is the representation on the wire of a Unicode character, with
  endianness specified and the exact encoding nailed down.  UTF is a
  Unicode (not ISO) thing.

Basically, one pretty much never cares about UCS-2 and UCS-4.  The
important parts are the Unicode abstract code points (pure numbers) and
the UTF-* transformations (how those numbers are represented on the wire).
As near as I can tell, the main reason why UCS-* exists at all is because
that's what's in the ISO standards and one therefore talks about it in
relation to compatibility between Unicode and ISO 10646.

See <http://www.unicode.org/glossary/>.

-- 
Russ Allbery (rra at stanford.edu)             <http://www.eyrie.org/~eagle/>