[ietf-nntp] Further syntax
Clive D.W. Feather
clive at demon.net
Thu Mar 4 01:54:39 PST 2004
Russ Allbery said:
>> * Delimiters between fields vary:
>> HDR - single space
>> LIST ACTIVE - one or more spaces
>> LIST ACTIVE.TIMES - one or more spaces
>> LIST DISTRIB.PATS - colon
>> LIST DISTRIBUTIONS - one or more spaces
>> LIST EXTENSIONS - single space
>> LIST NEWSGROUPS - white space
>> NEWGROUPS - one or more spaces
>> OVER - single tab
> We could specify the delimiter for LIST EXTENSIONS as one or more spaces,
[...]
Okay, made that one change and nothing else. This means that the "normal"
delimiter is one or more spaces, with the others being exceptions for good
reasons.
>> * I had to guess what characters are allowed in some contents. Here's the
>> choices I've made:
>> HDR and OVER - the header contents are UTF-8 printable, they cannot
>> contain tabs or other controls, but may contain spaces.
>
> Hm. In practice, HDR and OVER are going to return the raw octets from the
> article. This may or may not be in UTF-8. I'm not sure what to do about
> this. Obviously, 8-bit characters in article headers mean that the
> articles are violating the underlying article format standard, at least at
> present (and non-UTF-8 characters will probably always mean that), but on
> the other hand we've tried to keep NNTP agnostic about the underlying
> articles and people are definitely using NNTP with locally-defined
> character sets.
Um, not quite. We have put some structure into an article, even though it's
much looser than 1036 or 2822:
article = 1*header CRLF body
header = header-name ":" [CRLF] SP header-content CRLF
header-content = *(P-CHAR / [CRLF] WS)
body = *(*B-CHAR CRLF)
P-CHAR is ASCII printable or UTF-8. B-CHAR is raw bytes except NUL CR and
LF. header-name is ASCII printable other than colon.
So we've already limited header contents to UTF-8. See below.
> I think we should specify HDR and OVER as containing raw bytes excluding
> the standard suspects (CR, LF, NUL) and, in the case of OVER, TAB.
Whatever we do for article headers (see below) should apply to HDR and
OVER.
> > LIST ACTIVE.TIMES - second field has no limit on the number of digits;
> > third field is UTF-8 and must start with a printable character, but
> > may contain spaces or controls.
> > LIST DISTRIBUTIONS - both fields are UTF-8; the distribution must not
> > contain spaces or controls, but the description can.
> > LIST NEWSGROUPS - the description is UTF-8 and must start with a
> > printable character, but may contain spaces or controls.
>
> These are all other places where people may be using local character sets.
> I'm not sure that we really want to dive into specifying a character set
> here, although in practice if the newsgroup names are in UTF-8, it doesn't
> work very well for the descriptions to be in any other character set. On
> the other hand, there is *substantial* existing practice for LIST
> NEWSGROUPS containing random local character sets, and it's not clear what
> servers should really do about that.
>
> I think that really resolving this is a bit out of the scope of our
> working group, since the newsgroup description is really a USEFOR thing.
So you would say that "UTF-8" in each of the above should be relaxed to
"arbitrary high-bit-set characters", but the remaining limitations (e.g. no
spaces for distributions) should remain?
I can do this, but onn the other hand we *do* say that a purpose of this
update is to make UTF-8 the primary character set.
For LIST ACTIVE.TIMES, I think we want UTF-8 to be consistent with
mailboxes. Distributions are like newsgroups; I think - the names are
presently probably ASCII and should extend to UTF-8, while the descriptions
should be like LIST NEWSGROUPS.
But I'll leave it for you to decide.
> I'm hesitant to dive into the middle of these particular fights, and would
> prefer to bail and specify that they're in the same sort of unspecified
> character set as we say for ARTICLE. I know this isn't ideal, but for the
> client author it *is* accurate and is what they will encouter in practice
> when using these existing commands.
As I said, we already limit the *header* section of an article to UTF-8.
Or, rather, the grammar does; I've just noticed that the text says:
The content MUST NOT contain CRLF but is otherwise unrestricted;
in particular, it MAY be empty.
So we need to decide whether header contents can contain arbitrary octets
in the %x80-FF range, or whether they are limited to UTF-8. Once decided,
the text and/or grammar can be fixed; all the decisions above will follow
on automatically.
>> LIST OVERVIEW.FMT - all the text quoted in 8.5.2.2 is case-sensitive.
> Hm. It doesn't make a lot of sense to me for this text to be
> case-sensitive when everywhere else in NNTP header names are
> case-insensitive.
Hmm, of course they are. Mind, we never actually said that for metadata
items (I've added it now).
I had taken "the first 7 lines MUST be exactly" as specifying case as well.
I've now added "(except for the case of letters)" and altered the grammar
accordingly.
The remaining case-sensitivity in the grammar is:
* status field in LIST ACTIVE response;
* extension labels (but not their arguments).
--
Clive D.W. Feather | Work: <clive at demon.net> | Tel: +44 20 8495 6138
Internet Expert | Home: <clive at davros.org> | *** NOTE CHANGE ***
Demon Internet | WWW: http://www.davros.org | Fax: +44 870 051 9937
Thus plc | | Mobile: +44 7973 377646
More information about the ietf-nntp
mailing list