[NNTP] LISTGROUP

Mon Mar 28 06:02:47 PST 2005

Charles Lindsey said:
>> The formal syntax uses special non-terminals S-CHAR, S-NONTAB, and S-TEXT
>> that have two separate definitions: one that MUST be accepted, and one that
>> SHOULD be generated.
>> * Header content in articles (when unfolded) is S-CHAR.
>> * Header contents in HDR/OVER responses is S-NONTAB.
>> * Newsgroup description is S-TEXT.
>> Apart from article bodies and the HELP output, the entire remaining syntax
>> is UTF-8 based.
>
>> For the record:
>
>>                   MUST accept                 SHOULD generate
>>     S-CHAR        %x21-FF                     any UTF-8 from U+0021 upwards
>>     S-NONTAB      any except TAB              any UTF-8 except TAB
>>     S-TEXT        any but not beginning       any UTF-8, but beginning
>>                   with TAB or SP              with U+0021 or above
>
>> (in all the above, "any" excludes NUL, CR, and LF).
>
> Now I am even more confused, because one has to define carefully when
> "accept" applies and when "generate" applies.

The above was my paraphrasing. The actual text reads:

3.6:
   The content of a header SHOULD be in UTF-8.  However, if a server
   receives an article from elsewhere that uses octets in the range 128
   to 255 in some other manner, it MAY pass it to a client without
   modification.  Therefore clients MUST be prepared to receive such
   headers and also data derived from them (e.g.  in the responses from
   the OVER (Section 8.3) command) and MUST NOT assume that they are
   always UTF-8.  How the client will then process those headers,
   including identifying the encoding used, is outside the scope of this
   document.

We don't say anything specific in the description of LIST NEWSGROUPS.

Finally, the formal syntax says:

   The following non-terminals require special consideration.  They
   represent situations where material SHOULD be restricted to UTF-8,
   but implementations MUST be able to cope with other character
   encodings.  Therefore there are two sets of definitions for them.

   Implementations MUST accept any content that meets this syntax:

     S-CHAR   = %x21-FF
     S-NONTAB = CTRL / SP / S-CHAR
     S-TEXT   = (CTRL / S-CHAR) *B-CHAR

   Implementations SHOULD only generate content that meets this syntax:

     S-CHAR   = P-CHAR
     S-NONTAB = U-NONTAB
     S-TEXT   = U-TEXT

Clearly this needs to be made clearer.

> So an article is POSTed containing the header "Subject: !@#$". That is a
> "MUST accept", so the server accepts it. What does the server then do with
> it? That is not really our business, but having accepted it we should not
> be surprised if it stores it and/or attempts to relay it to other sites.

The intent is that that behaviour is conforming.

> So the server has stored it, and now some other client tries to READ it.
> Are you saying that your "SHOULD generate" is violated if the article is
> now sent, including that "!@#$", in response to the READ. Likewise, is that
> "SHOULD generate" violated if the server becomes a client and says IHAVE
> that article to another server, and then sends it as-is (in which case
> it is a "MUST accept" for the other site).

No to both.

> In fact, I think it is clear that all existing implementations will simply
> include that "!@#$" in all the relevant places, simply because it it too
> much hassle and a waste of resources to try and detect these obscure
> happenings

Exactly.

I think the text in 3.6 is approximately correct, but it mixes up roles.
I will change it:

-  The content of a header SHOULD be in UTF-8.  However, if a server
+  The content of a header SHOULD be in UTF-8.  However, if an implementation
   receives an article from elsewhere that uses octets in the range 128
-  to 255 in some other manner, it MAY pass it to a client without
+  to 255 in some other manner, it MAY pass it to a client or server without
-  modification.  Therefore clients MUST be prepared to receive such
+  modification.  Therefore implementations MUST be prepared to receive such
   headers and also data derived from them (e.g.  in the responses from
   the OVER (Section 8.3) command) and MUST NOT assume that they are
-  always UTF-8.  How the client will then process those headers,
+  always UTF-8.  Any external processing of those headers,
   including identifying the encoding used, is outside the scope of this
   document.

I've added to LIST NEWSGROUPS:

   The description SHOULD be in UTF-8.  However, servers sometimes
   obtain the information from an external source which has used a
   different encoding (one that uses octets in the range 128 to 255
   in some other manner).  In this case they MAY pass it on unchanged
   and clients MUST be prepared to receive such descriptions.

Finally, I've changed the formal syntax to:

   Implementations MUST accept any content that meets this syntax:

     S-CHAR   = %x21-FF
     S-NONTAB = CTRL / SP / S-CHAR
     S-TEXT   = (CTRL / S-CHAR) *B-CHAR

   and MAY pass such content on unaltered.

   When generating new content or re-encoding existing content,
   implementations SHOULD conform to this syntax:

     S-CHAR   = P-CHAR
     S-NONTAB = U-NONTAB
     S-TEXT   = U-TEXT

I could easily be convinced that that SHOULD should be a MUST.

-- 
Clive D.W. Feather  | Work:  <clive at demon.net>   | Tel:    +44 20 8495 6138
Internet Expert     | Home:  <clive at davros.org>  | Fax:    +44 870 051 9937
Demon Internet      | WWW: http://www.davros.org | Mobile: +44 7973 377646
Thus plc            |                            |