Message from Ned Freed: ietf-nntp NNTP and 16-bit charsets

Sun May 6 00:49:07 PDT 2001

> > > >        NOTE: Texts using charsets which represent characters as
> > > >        sequences of 16 or 32 bits (e.g. UCS-2 and UCS-4) cannot be
> > > >        reliably conveyed in the above format.

> > > False.  16-bit or 32-bit character sets that have an encoding that
> > > avoids NUL, CR, or LF work fine.  Possibly a pedantic point, but it
> > > wouldn't surprise me if there are legacy Asian character sets with that
> > > property.

> > They may well get it right in the low order byte, but I would be
> > surprised if they all got it right in the high order byte.

> Surely you would agree that designing a 16-bit or 32-bit character set
> with this property is not particularly hard?  There's a very obvious range
> of character numbers that you simply don't assign.

> I don't know whether anyone has designed such a character set, but it's
> clearly possible.  I believe that makes the note above factually
> incorrect.

Not only have such charsets been designed, they are the rule rather than the
exception for multibyte charsets. The exception is UTF-16, and it really is
just that: An exception.

I already explained in a previous message that the various legacy Asian
charsets work this way. I've written code to handle most of the Asian charsets
so this happens to be something I know a lot about.

> > Anyway, I have made my copy read "cannot, in general, be reliably
> > conveyed". But I have to keep reminding myself that I am not the editor
> > of this list, so the text is just a suggestion for Stan to act upon, of
> > course.

> Why not just say exactly what we mean?

That's what we should do here.

>     NOTE: Texts using encodings (such as UTF-16 or UTF-32) that may
>     contain the NUL octet or the CR or LF octets in contexts other than
>     the CRLF line ending cannot be reliably conveyed in the above format.

> I believe that UTF-16 and UTF-32 are the correct things to reference, not
> UCS-2 or UCS-4, but someone who has a firmer grasp on the difference
> between a charset and an encoding may want to check me on that.

You are quite correct: The charsets to reference are UTF-16 and UTF-32.

                                Ned