ietf-nntp Message from Ned Freed: NNTP and 16-bit charsets

Thu Apr 26 17:12:16 PDT 2001

> Some discussion on the USEFOR list has drawn attention to a possible
> problem with charsets using 16bit characters. Consider the following
> situation.

> An article with its headers written in UTF-8, and with a Content-Type that
> specifies charset=some-16-bit-set. Yes, life would be simpler if it had
> used charset=utf-8, but 16-bit charsets are quite legitimate as MIME
> objects. Again, it might be better to have used Content-Transfer-Encoding:
> base64 (and maybe that will have to be mandated somewhere), but suppose it
> is actually being sent with encoding 8bit, or even binary.

Actually, whether or not it is legal depends on the content-type, the charset,
and the encoding. There are restrictions that forbid some combinations.

In particular, charsets that use NUL characters or which represent CR or LF as
anything other than 8bit CR or LF characters cannot be represented using either
a 7bit or 8bit encoding. Your only encoding choices for such a charset are
quoted-printable, base64, or binary.

The text top-level media type has the same CR-LF restriction, but doesn't
have the NUL restriction.

> The first question to address is what our present draft actually says
> about this situation, and it turns out to be unclear.

> First, any commands from client to server are in UTF-8. OK so far. We have
> just issued a valid ARTICLE command for this article. What next?

> We get a one-line 220 response, which we interpret as UTF-8 characters
> terminated by CRLF.

> Then we get the headers of the article, which we interpret as UTF-8
> characters terminated by CRLF.

> And then maybe we get a stream of bytes representing the 16-bit characters
> of the body which continues until we get five consecutive octets that look
> like "CFLF.CRLF".

> Is that supposed to be allowed or not? The only text in the draft that
> appears to address the issue is in section 4, and it does not really say
> one way or the other. It seems to imply that "lines" will be regognisable
> in the octet stream, and that CRLFs will somehow magically get inserted
> there (whereas the octet pair CRLF might well respresent some printable
> character in the 16-bit charset). And how is the "point-stuffing" rule to
> be interpreted?

Well, while it doesn't come out and say it, it is clear that the response is
line-oriented and that lines must be terminated by CRLFs. This implicitly
disallows the use of the binary encoding.

> So I think the wording needs to be clarified. But first, we have to decide
> what we INTEND. I see the following possibilities:

> 1. We say that such an arbitrary octet stream is to be subjected to a
> transformation which does special things when any parts of adjacent (not
> necessarily part of the same 16-bit char) octets happen to look like CRLF,
> and maybe inserts an extra "." byte after such a pair, and maybe does
> other things with single bytes that happen to look like CR, LF or NULL.
> And then there is an inverse transformation at the client end which
> restores it all to the original octet stream. This all sounds horribly
> messy, but it might in fact actually work with little bother, even in
> existing implementations.

This approach has been tried in the messaging world. It is been found to be
surprisingly hard to get right, and it is why we implemented an extension to
SMTP that uses counted chunks for transferring binary material rather than
overloading dot-stuffing. (Dot stuffing is also claimed by some to be an
efficiency concern. I don't buy it personally, but the claim has been made.)

> 2. We say that only 8-bit charsets are supported, and that any bytes that
> look like ASCII CR, LF (& NULL?) must indeed be interpreted as such. I
> think UTF-8 would pass that test. Effectively, we would be insisting on
> the use of an appropriate Content-Transfer-Encoding.

The right thing to say in this case is that you don't allow the binary encoding
or any new encoding with a comparable output range. Problem solved given the
existing restrictions on other encodings.

                                Ned