[NNTP] Re: New NNTP drafts approaching IETF Last Call
Clive D.W. Feather
clive at demon.net
Tue Mar 15 09:18:40 PST 2005
Mark Crispin said:
> The text in section 3.2:
[...]
Actually it's 3.1. Anyway, let's start with the basics. Further up it says:
The character set for all NNTP commands is UTF-8 [RFC3629].
Moving to the command syntax in 9.1, the generic syntax is:
keyword *(WS token)
keyword is ASCII; token is UTF-8. The syntaxes of the individual commands
meet this requirement. For responses, any arguments after the code are
required to be ASCII, while the trailing comment is UTF-8.
So the issue only applies to the subsequent lines of multi-line commands
and responses.
> Note that texts using an encoding (such as UTF-16 or UTF-32) that may
> contain the octets NUL, LF, or CR other than a CRLF pair cannot be
> reliably conveyed in the above format (that is, they violate the MUST
> requirement above). However, except when stated otherwise, this
> specification does not require the content to be UTF-8 and therefore
> it MAY include octets above and below 128 mixed arbitrarily.
>
> seems silly to me. Nobody sends UTF-16, UTF-32, UCS-2, or UCS-4 data in
> Internet protocol commands. Viewed one way, it's a tautology; viewed
> another, it confuses contexts.
Firstly, this text appears in the context of multi-line responses - it's
immediately after the bit about dot-stuffing. The actual requirement is:
1. The response consists of a sequence of one or more "lines", each
being a stream of octets ending with a CRLF pair. Apart from
those line endings, the stream MUST NOT include the octets NUL,
LF, or CR.
and this is echoed in the syntax. So it's not protocol commands and I don't
see where the context is confused.
> Furthermore, the second sentence, while obviously intended to maintain
> compatibility with the past, is short-sighted and will lead to
> compatibility problems forever.
It's consistent with the real world, however.
> Suggest the following rewrite:
>
> Note: implementations prior to this specification used octets other
> than CR, NUL, and LF arbitrarily; the character set of any octets
> greater than 128 is indeterminate with old servers. Server
> implementations which comply with this specification (and thus
> advertise VERSION 2 in CAPABILITIES) MUST send UTF-8 strings in
> responses exclusively; and client implementations MUST treat any
> response string from a server which advertises VERSION 2 as being
> in UTF-8.
Sorry, but this is just unacceptable to me.
Firstly, the NNTP specification is written to pass around very generic
"articles" - it could front-end many different content databases, of which
Usenet is only one. If you read 3.6 and Appendix A you'll see how we've
been careful to keep it generic like that. In particular:
The content of a header SHOULD be in UTF-8. However, if a server
receives an article from elsewhere that uses octets in the range 128
to 255 in some other manner, it MAY pass it to a client without
modification. Therefore clients MUST be prepared to receive such
headers and also data derived from them (e.g. in the responses from
the OVER (Section 8.3) command) and MUST NOT assume that they are
always UTF-8. How the client will then process those headers,
including identifying the encoding used, is outside the scope of this
document.
and, at the start of Appendix A:
NNTP is most often used for transferring articles that conform to RFC
1036 [RFC1036] (such articles are called "Netnews articles" here).
It is also sometimes used for transferring email messages that
conform to RFC 2822 [RFC2822] (such articles are called "email
articles" here). In this situation, articles must conform both to
this specification and to that other one; this appendix describes
some relevant issues.
Note particularly "must conform both to [...] that other one". Any
requirements on character set should be related to the article semantics,
not to the transfer protocol syntax.
Secondly, there is no way that your proposal is going to work. Suppose that
I upgrade my server to talk NNTPv2. You are going to require me to convert
*EVERY* article body received to UTF-8? That completely breaks the spirit
of Usenet and probably the wording of Usefor/RFC1036.
I've just run an analysis of around 64,000 articles that arrived on our
news server today. I checked two things:
* Do they have a Content-Type: header?
* Do they use characters coded 128 or above and, if so, is the usage
consistent with UTF-8?
Header Coding Occurrences
Yes ASCII 1367
Yes UTF-8 6 one of which was my own test message
Yes other 1295
No ASCII 8133
No UTF-8 0
No other 53426
That says there's a long way to go in the adoption of UTF-8, and the right
place to start is in Usefor, not NNTP. For the same reason that it's RFC
2822, not 2821, that deals with email content.
--
Clive D.W. Feather | Work: <clive at demon.net> | Tel: +44 20 8495 6138
Internet Expert | Home: <clive at davros.org> | Fax: +44 870 051 9937
Demon Internet | WWW: http://www.davros.org | Mobile: +44 7973 377646
Thus plc | |
More information about the ietf-nntp
mailing list