[ietf-nntp] Further syntax

Thu Mar 4 01:54:39 PST 2004

Russ Allbery said:
>> * Delimiters between fields vary:
>>   HDR                - single space
>>   LIST ACTIVE        - one or more spaces
>>   LIST ACTIVE.TIMES  - one or more spaces
>>   LIST DISTRIB.PATS  - colon
>>   LIST DISTRIBUTIONS - one or more spaces
>>   LIST EXTENSIONS    - single space
>>   LIST NEWSGROUPS    - white space
>>   NEWGROUPS          - one or more spaces
>>   OVER               - single tab

> We could specify the delimiter for LIST EXTENSIONS as one or more spaces,
[...]

Okay, made that one change and nothing else. This means that the "normal"
delimiter is one or more spaces, with the others being exceptions for good
reasons.

>> * I had to guess what characters are allowed in some contents. Here's the
>> choices I've made:
>>   HDR and OVER - the header contents are UTF-8 printable, they cannot
>>     contain tabs or other controls, but may contain spaces.
> 
> Hm.  In practice, HDR and OVER are going to return the raw octets from the
> article.  This may or may not be in UTF-8.  I'm not sure what to do about
> this.  Obviously, 8-bit characters in article headers mean that the
> articles are violating the underlying article format standard, at least at
> present (and non-UTF-8 characters will probably always mean that), but on
> the other hand we've tried to keep NNTP agnostic about the underlying
> articles and people are definitely using NNTP with locally-defined
> character sets.

Um, not quite. We have put some structure into an article, even though it's
much looser than 1036 or 2822:

    article = 1*header CRLF body
    header = header-name ":" [CRLF] SP header-content CRLF
    header-content = *(P-CHAR / [CRLF] WS)
    body = *(*B-CHAR CRLF)

P-CHAR is ASCII printable or UTF-8. B-CHAR is raw bytes except NUL CR and
LF. header-name is ASCII printable other than colon.

So we've already limited header contents to UTF-8. See below.

> I think we should specify HDR and OVER as containing raw bytes excluding
> the standard suspects (CR, LF, NUL) and, in the case of OVER, TAB.

Whatever we do for article headers (see below) should apply to HDR and
OVER.

> >   LIST ACTIVE.TIMES - second field has no limit on the number of digits;
> >     third field is UTF-8 and must start with a printable character, but
> >     may contain spaces or controls.
> >   LIST DISTRIBUTIONS - both fields are UTF-8; the distribution must not
> >     contain spaces or controls, but the description can.
> >   LIST NEWSGROUPS - the description is UTF-8 and must start with a
> >     printable character, but may contain spaces or controls.
> 
> These are all other places where people may be using local character sets.
> I'm not sure that we really want to dive into specifying a character set
> here, although in practice if the newsgroup names are in UTF-8, it doesn't
> work very well for the descriptions to be in any other character set.  On
> the other hand, there is *substantial* existing practice for LIST
> NEWSGROUPS containing random local character sets, and it's not clear what
> servers should really do about that.
> 
> I think that really resolving this is a bit out of the scope of our
> working group, since the newsgroup description is really a USEFOR thing.

So you would say that "UTF-8" in each of the above should be relaxed to
"arbitrary high-bit-set characters", but the remaining limitations (e.g. no
spaces for distributions) should remain?

I can do this, but onn the other hand we *do* say that a purpose of this
update is to make UTF-8 the primary character set.

For LIST ACTIVE.TIMES, I think we want UTF-8 to be consistent with
mailboxes. Distributions are like newsgroups; I think - the names are
presently probably ASCII and should extend to UTF-8, while the descriptions
should be like LIST NEWSGROUPS.

But I'll leave it for you to decide.

> I'm hesitant to dive into the middle of these particular fights, and would
> prefer to bail and specify that they're in the same sort of unspecified
> character set as we say for ARTICLE.  I know this isn't ideal, but for the
> client author it *is* accurate and is what they will encouter in practice
> when using these existing commands.

As I said, we already limit the *header* section of an article to UTF-8.
Or, rather, the grammar does; I've just noticed that the text says:

    The content MUST NOT contain CRLF but is otherwise unrestricted;
    in particular, it MAY be empty. 

So we need to decide whether header contents can contain arbitrary octets
in the %x80-FF range, or whether they are limited to UTF-8. Once decided,
the text and/or grammar can be fixed; all the decisions above will follow
on automatically.

>>   LIST OVERVIEW.FMT - all the text quoted in 8.5.2.2 is case-sensitive.
> Hm.  It doesn't make a lot of sense to me for this text to be
> case-sensitive when everywhere else in NNTP header names are
> case-insensitive.

Hmm, of course they are. Mind, we never actually said that for metadata
items (I've added it now).

I had taken "the first 7 lines MUST be exactly" as specifying case as well.
I've now added "(except for the case of letters)" and altered the grammar
accordingly.

The remaining case-sensitivity in the grammar is:
* status field in LIST ACTIVE response;
* extension labels (but not their arguments).

-- 
Clive D.W. Feather  | Work:  <clive at demon.net>   | Tel:    +44 20 8495 6138
Internet Expert     | Home:  <clive at davros.org>  | *** NOTE CHANGE ***
Demon Internet      | WWW: http://www.davros.org | Fax:    +44 870 051 9937
Thus plc            |                            | Mobile: +44 7973 377646