[NNTP] Re: New NNTP drafts approaching IETF Last Call

Clive D.W. Feather clive at demon.net
Tue Mar 15 09:18:40 PST 2005


Mark Crispin said:
> The text in section 3.2:
[...]

Actually it's 3.1. Anyway, let's start with the basics. Further up it says:

    The character set for all NNTP commands is UTF-8 [RFC3629].

Moving to the command syntax in 9.1, the generic syntax is:

    keyword *(WS token)

keyword is ASCII; token is UTF-8. The syntaxes of the individual commands
meet this requirement. For responses, any arguments after the code are
required to be ASCII, while the trailing comment is UTF-8.

So the issue only applies to the subsequent lines of multi-line commands
and responses.

>    Note that texts using an encoding (such as UTF-16 or UTF-32) that may
>    contain the octets NUL, LF, or CR other than a CRLF pair cannot be
>    reliably conveyed in the above format (that is, they violate the MUST
>    requirement above).  However, except when stated otherwise, this
>    specification does not require the content to be UTF-8 and therefore
>    it MAY include octets above and below 128 mixed arbitrarily.
> 
> seems silly to me.  Nobody sends UTF-16, UTF-32, UCS-2, or UCS-4 data in 
> Internet protocol commands.  Viewed one way, it's a tautology; viewed 
> another, it confuses contexts.

Firstly, this text appears in the context of multi-line responses - it's
immediately after the bit about dot-stuffing. The actual requirement is:

   1.  The response consists of a sequence of one or more "lines", each 
       being a stream of octets ending with a CRLF pair.  Apart from
       those line endings, the stream MUST NOT include the octets NUL,
       LF, or CR.

and this is echoed in the syntax. So it's not protocol commands and I don't
see where the context is confused.

> Furthermore, the second sentence, while obviously intended to maintain 
> compatibility with the past, is short-sighted and will lead to 
> compatibility problems forever.

It's consistent with the real world, however.

> Suggest the following rewrite:
> 
>    Note: implementations prior to this specification used octets other
>    than CR, NUL, and LF arbitrarily; the character set of any octets
>    greater than 128 is indeterminate with old servers.  Server
>    implementations which comply with this specification (and thus
>    advertise VERSION 2 in CAPABILITIES) MUST send UTF-8 strings in
>    responses exclusively; and client implementations MUST treat any
>    response string from a server which advertises VERSION 2 as being
>    in UTF-8.

Sorry, but this is just unacceptable to me.

Firstly, the NNTP specification is written to pass around very generic
"articles" - it could front-end many different content databases, of which
Usenet is only one. If you read 3.6 and Appendix A you'll see how we've
been careful to keep it generic like that. In particular:

   The content of a header SHOULD be in UTF-8.  However, if a server
   receives an article from elsewhere that uses octets in the range 128
   to 255 in some other manner, it MAY pass it to a client without
   modification.  Therefore clients MUST be prepared to receive such
   headers and also data derived from them (e.g.  in the responses from
   the OVER (Section 8.3) command) and MUST NOT assume that they are
   always UTF-8.  How the client will then process those headers,
   including identifying the encoding used, is outside the scope of this
   document.

and, at the start of Appendix A:

   NNTP is most often used for transferring articles that conform to RFC
   1036 [RFC1036] (such articles are called "Netnews articles" here).
   It is also sometimes used for transferring email messages that
   conform to RFC 2822 [RFC2822] (such articles are called "email
   articles" here).  In this situation, articles must conform both to
   this specification and to that other one; this appendix describes
   some relevant issues.

Note particularly "must conform both to [...] that other one". Any
requirements on character set should be related to the article semantics,
not to the transfer protocol syntax.

Secondly, there is no way that your proposal is going to work. Suppose that
I upgrade my server to talk NNTPv2. You are going to require me to convert
*EVERY* article body received to UTF-8? That completely breaks the spirit
of Usenet and probably the wording of Usefor/RFC1036.

I've just run an analysis of around 64,000 articles that arrived on our
news server today. I checked two things:
  * Do they have a Content-Type: header?
  * Do they use characters coded 128 or above and, if so, is the usage
    consistent with UTF-8?

Header     Coding      Occurrences
  Yes       ASCII         1367
  Yes       UTF-8            6     one of which was my own test message
  Yes       other         1295
  No        ASCII         8133
  No        UTF-8            0
  No        other        53426

That says there's a long way to go in the adoption of UTF-8, and the right
place to start is in Usefor, not NNTP. For the same reason that it's RFC
2822, not 2821, that deals with email content.

-- 
Clive D.W. Feather  | Work:  <clive at demon.net>   | Tel:    +44 20 8495 6138
Internet Expert     | Home:  <clive at davros.org>  | Fax:    +44 870 051 9937
Demon Internet      | WWW: http://www.davros.org | Mobile: +44 7973 377646
Thus plc            |                            |



More information about the ietf-nntp mailing list