ietf-nntp NNTP and 16-bit charsets

Thu Apr 26 05:23:19 PDT 2001

Some discussion on the USEFOR list has drawn attention to a possible
problem with charsets using 16bit characters. Consider the following
situation.

An article with its headers written in UTF-8, and with a Content-Type that
specifies charset=some-16-bit-set. Yes, life would be simpler if it had
used charset=utf-8, but 16-bit charsets are quite legitimate as MIME
objects. Again, it might be better to have used Content-Transfer-Encoding:
base64 (and maybe that will have to be mandated somewhere), but suppose it
is actually being sent with encoding 8bit, or even binary.

The first question to address is what our present draft actually says
about this situation, and it turns out to be unclear.

First, any commands from client to server are in UTF-8. OK so far. We have
just issued a valid ARTICLE command for this article. What next?

We get a one-line 220 response, which we interpret as UTF-8 characters
terminated by CRLF.

Then we get the headers of the article, which we interpret as UTF-8
characters terminated by CRLF.

And then maybe we get a stream of bytes representing the 16-bit characters
of the body which continues until we get five consecutive octets that look
like "CFLF.CRLF".

Is that supposed to be allowed or not? The only text in the draft that
appears to address the issue is in section 4, and it does not really say
one way or the other. It seems to imply that "lines" will be regognisable
in the octet stream, and that CRLFs will somehow magically get inserted
there (whereas the octet pair CRLF might well respresent some printable
character in the 16-bit charset). And how is the "point-stuffing" rule to
be interpreted?

So I think the wording needs to be clarified. But first, we have to decide
what we INTEND. I see the following possibilities:

1. We say that such an arbitrary octet stream is to be subjected to a
transformation which does special things when any parts of adjacent (not
necessarily part of the same 16-bit char) octets happen to look like CRLF,
and maybe inserts an extra "." byte after such a pair, and maybe does
other things with single bytes that happen to look like CR, LF or NULL.
And then there is an inverse transformation at the client end which
restores it all to the original octet stream. This all sounds horribly
messy, but it might in fact actually work with little bother, even in
existing implementations.

2. We say that only 8-bit charsets are supported, and that any bytes that
look like ASCII CR, LF (& NULL?) must indeed be interpreted as such. I
think UTF-8 would pass that test. Effectively, we would be insisting on
the use of an appropriate Content-Transfer-Encoding.

Are there any other solutions, and is there any existing practice to guide
us?

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl at clw.cs.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5