Comments on draft-ietf-nntpext-base-02.txt

Mon Nov 10 08:05:58 PST 1997

I first tried to post this to the list about a month ago, as suggested
by Stan Barber. However, according to the archives it never made it. In
the meantime, I am now fully subscribed to the list, so let's hope it
makes it this time.

Due to the delay, some of this may now be a little dated. For example, I
see that you have been talking about UTF-8, and may even be prepared to
go further down that line than I am suggesting here.

Comments on draft-ietf-nntpext-base-02.txt

I am speaking as a member of the usenet-format list that is working on a
successor to RFC1036. That does not imply that I can speak _for_ the group
(who can?), but I have been assigned the task of producing the text for the
Newsgroups: and Distribution: headers, and it is clear that the Newsgroups
header in particular has to interface nicely with the NNTP protocol.

Our process is only just starting (we do not even have a complete first
draft yet) whereas yours seems to be well advanced. However, I hope you
will agree that we should be on the lookout for potential
incompatibilites, and that hopefully our two standards will be as
consistent as we can manage.

Actually, having studied your draft, I find remarkably little that should
cause any difficulty. Our major concern is with non-Ascii charsets, and
you seem to have put in enough handles as almost to cover them Almost, but
not quite.

There are two major decisions that we are more or less committed to:

1. The transport mechanism for news MUST be 8-bit clean. One will still be
required to use appropriate MIME headers in an article that strays from
pure Ascii, but in effect the Transfer-Encoding should be 8bit.

2. We are under pressure from the Scandinavians (so far, but I expect the
Japanese are not far behind) to allow non-Ascii characters in newsgroup
names. One option is to let them be in iso-8859-1 by default
(effectively, that would allow any of the iso-8859-x). But that does not
give us the rest of Unicode. A more likely possibility is to let them be
in UTF-8 by default. Note that, with both these possibilities, all
existing usage (newsgroup names in ascii) will continue to work without
hindrance, since Ascii is a proper subset of both options.

Another step we are taking is some carefully defined terminology for
	posting agent
	injecting agent
	relaying agent
	serving agent
	reading agent
	followup agent
with (I hope) obvious meanings (a serving agent is one that actually
maintains a news spool). So NNTP will sit between a posting agent and an
injecting agent (for posting) or between a serving agent and a reading
agent (for reading) or between two relaying agents (for the IHAVE
protocol). Of course, one piece of software will often perform the
functions of several of those agents.

So, now to the nitty gritty.

          4.   Basic Operation.

...

            NNTP operates over any reliable data stream 8-bit-wide
            channel. When running over TCP/IP, the official port for the
            NNTP service is 119. ...

Yes, this sets the right tone for what follows.

            The default character set for all NNTP commands is US-
            ASCII[2].

Well, I might try to talk you into UTF-8, but for now I would be happy
with some mention of the CHARSET command at this point (e.g. "but see..."
or "except as provided in...").

            Commands in the NNTP MUST consist of a case-
            insensitive keyword, which MAY be followed by one or more
            arguments.  All commands MUST be terminated by a CRLF pair.
            Multiple commands MUST not be permitted on the same line.
            Keywords MUST consist of printable US-ASCII characters.

Yes, no question of Ascii for the commands.

            Unless otherwise noted elsewhere in this document, Arguments
            SHOULD consist of printable US-ASCII characters.

Could you add "However, arguments outside the US-ASCII range SHOULD be
passed to the server, even if a CHARSET command has not been given."

            Keywords and
            arguments MUST be each separated by one or more SPACE or TAB
            characters. Keywords MUST be at least three characters and
            MUST NOT exceed 12 characters.  Command lines MUST not exceed
            512 characters, which includes the terminating CRLF pair.

I think the position I am trying to get to is that clients which want to
use, say, UTF-8 SHOULD give a CHARSET command, and if 204 is returned the
server MUST accept such characters. If 404 or 500 is returned, the client
MAY press on regardless, which in practice will probably mean that most
straightforward things will work (but not wildmats probably), and even if
some command then returns an error - well that was just Tough. This seems
to be consistent with what you discuss in section 8.1, and gives clients
the maximum opportunity to use whatever facilities they understand without
the NNTP daemon getting in the way.

            Each response MUST start with a three-digit status indicator
            that is sufficient to distinguish all responses. Responses to
            certain commands MAY be multi-line. In these cases, which are
            clearly indicated below, after sending the first line of the
            response and a CRLF, any additional lines are sent, each
            terminated by a CRLF pair. When all lines of the response have
            been sent, a final line MUST be sent, consisting of a
            termination octet (ASCII decimal code 046, ".") and a CRLF
            pair.  If any line of the multi-line response begins with the
            termination octet, the line MUST be "byte-stuffed" by pre-
            pending the termination octet to that line of the response.
            Hence, a multi-line response is terminated with the five
            octets "CRLF.CRLF".  When examining a multi-line response, the
            client MUST check to see if the line begins with the
            termination octet. If so and if octets other than CRLF follow,
            the first octet of the line (the termination octet) MUST be
            stripped away.  If so and if CRLF immediately follows the
            termination character, then the response from the NNTP server
            is ended and the line containing ".CRLF" MUST not considered
            part of the multi-line response.

What I would like here is for a multi-line response (at least when its
contents consist of headers or bodies for articles) to be regarded as a
stream of octets, any octet being allowed except those corresponding to
Ascii NUL, CR and LF (that is the 8bit MIME definition). CRLF to mean line
end, and byte stuffing for CRLF.CRLF as usual. The interpretation is an
issue between the server and the client, with the aid of any MIME headers
that may be contained within the data, and any CHARSET command that might
have been given (well, the CHARSET part of that might be controversial -
see later).

I note that you do not specify any maximum line length. MIME suggests a
maximum of 998 octets plus CRLF. I could live with that.

BTW, it is not clear how a multi-line response consisting of zero lines is
to be delivered. Is a ".CRLF" as a parameter after a command sufficient,
even though it is not a .CRLF at the start of a line?

          5.   The WILDMAT format

I think the interpretation of a wildmat is going to be the principle
difficulty in implementing strange charsets. The interpretation of a
wildmat will need to depend on any CHARSET command previously given and
accepted.

A * in a wildmat should never be a problem.

A ? will not be a problem with any iso-8859-x, but could be awkward with
UTF-8 (where 1 character may be represented by several octets, although it
is always possible to tell how many octets by looking at the first).

[a-zà-ý] will be impossible if the wildmat interpreter does not understand
the charset in question.

But even so, wildmat implementations written for Ascii will likely work
quite well for people looking for soc.culture.* even if one of the groups
retrieved does turn out to be soc.culture.ålesund.

Just for interest, I have created a local group "local.aáb", that is
"local.aM-^Aáb" in case your reader could not cope or, in hex, 
"6c 6f 63 61 6c 2e 61 81 e1 62"

That includes both a legitimate printable iso-8859-1 character, and also
an iso-8859-1 control character (which could actually arise in a UTF-8
encoding).  CNEWS created the group without difficulty, my UNIX file
system (Solaris 2.3) created a directory of that name, and Stan's reference
implementation of NNTP was happy to find it using commands such as
	list active local.aáb
	list active local*
	list active local.a??b
	group local.aáb
Also Netscape (again working through Stan's NNTP daemon) was able to post
articles to the group and retrieve them from it.

Of course, this rather simple experiment does not _prove_ anything, but it
gives grounds for optimism (indeed, it worked out much better than my
expectation before I actually tried it).

P.S. After writing the above, CNEW baulked during the next expiry run when it
was checking and updating the active file :-( . Doesn't seem like a
showstopper.

          10.1.1.1  GROUP

            GROUP ggg

The parameter ggg, which appears in several commands, will be the most
common application for charsets other than Ascii. Ideally, we might like
the default to be UTF-8. More realistically, it should be interpreted in
accordance with any CHARSET command.

I think what I would like to see in your text here is an explicit
reference to the CHARSET command, and the things it might cause to become
acceptable for ggg.

The other command where this issue arises is LISTGROUP.

          10.2.1    ARTICLE

            ARTICLE [<message-id>|nnn]

            This response displays the header, a blank line, then the body
            (text) of the specified article. The optional parameter nnn is
            the numeric id of an article in the current news group and
            MUST be chosen from the range of articles provided when the
            news group was selected.  If it is omitted, the current
            article is assumed. Message-id is the message id of an article
            as shown in that article's header.

This is another place where the output should be regarded as a stream of
octets, with anything other than Ascii NUL, CR and LF allowed. See my
remarks concerning multi-line responses above.

          10.3.1    POST

            POST

            If posting is permitted, the article MUST be presented in the
            format specified by RFC 1036, and MUST include all required
            header lines. After the article's header and body have been
            completely sent by the client to the server, a further
            response code MUST be returned to indicate success or failure
            of the posting attempt.

I believe this is a little too strong. The required headers in RFC1036
include Message-ID and Date. It is usually safer to let the injecting
agent (i.e., in practice, 'inews') generate these, and the injecting agent
will in general be the server rather than the client in the usual NNTP
setup. Of course, if the client does generate these headers in a manner
acceptable to 'inews', then that is fine, but I would not regard that as
the norm.

            The text forming the header and body of the message to be
            posted MUST be sent by the client using the conventions for
            text received from the news server: A single period (".") on a
            line indicates the end of the text, with lines starting with a
            period in the original text having that period doubled during
            transmission.

This is, of course, the main example of where the multi-line format needs
to be interpreted as an octet stream, as already mentioned.

            No attempt shall be made by the server to filter characters,
            fold or limit lines, or otherwise process incoming text. The
            intent is that the server just passes the incoming message to
            be posted to the server installation's news posting software,
            which is not part of this specification.

Agree entirely. This lecture cannot be repeated too often, and will
undoubtedly figure prominently in our draft.

          10.3.2    IHAVE

            IHAVE <message-id>

I do not think we have any plans to permit non-ascii characters in
Message-IDs (but, I suppose one day they might find themselves in DNS
domain names, which should make life interesting for a while :-) ).

            However, the server may elect not to post or forward the
            article if after further examination of the article it deems
            it inappropriate to do so. The 436 or 437 error codes MUST be
            returned as appropriate to the situation.

I believe that MUST should be a SHOULD, on account of the permitted
exception which you mention in 10.3.2.1. Otherwise, CNEWS is broken (and
that would be a shame :-) ).

          10.4.1    LIST

            LIST [ACTIVE [wildmat]]

This is the first example of a wildmat that might want to set a apttern
involving non-ascii characters. I think my remarks above have covered this
issue sufficiently. Other commands where this arises include LIST
ACTIVE.TIMES, LIST NEWSGROUPS, PAT, NEWGROUPS and NEWNEWS.

            The response to the LIST keyword with no parameters returns a
            list of valid news groups and associated information.  Each
            news group is sent as a line of text in the following format:

               group last first status

The "group" could, of course, contain characters in UTF-8 or whatever. I
think the client has to accept whatever the server sends. It will, of
course, be some time before all clients are able to display group names
provided in UTF-8, but my experiment showed that even existing clients can
make some sort of sense of them. I do not expect it ever to be the case
that all clients will be able to display everything in unicode, but if
you intend to read japanese groups, then you will presumably invest in a
japanese client - other clients should just indicate that something
undisplayable is there.

However, this is an issue for client implementors, and not for NNTP or
even for the new 1036.

Note that this same situation arises in the LIST DISTRIB.PATS command
(except that wildmats may also be returned there, which should be
interesting).

          10.4.3    LIST DISTRIBUTIONS

            LIST DISTRIBUTIONS

My present thinking is to leave Distributions in strict ascii for now.

          10.4.5    LIST NEWSGROUPS

               LIST NEWSGROUPS [wildmat]

            The newsgroups file is maintained by some news transport
            systems to contain the name of each news group that is
            active on the server and a short description about the
            purpose of each news group. Each line in the file contains
            two fields, the news group name and a short explanation of
            the purpose of that news group.

I think we have to accept that the short description lines may be in any
languags and any charset. My earlier remarks on multi-line responses
should cover the situation. Other commands where texts might appear in
strange charsets are LIST SUBSCRIPTIONS, OVER and (possibly) LIST
OVERVIEW.FMT.

          10.4.8    LISTGROUP

               LISTGROUP [ggg]

               Note that the name of the news group is not case-dependent.
               It must otherwise match a news group obtained from the LIST
               command or an error will result.

Agreed, but doesn't this remark also apply to various other commands with
a ggg or wildmat parameter?

          10.4.9    OVER

            OVER [range]

            Each line of output MUST be formatted with the article number,
            followed by each of the headers in the overview database or
            the article itself (when the data is not available in the
            overview database) for that article separated by a tab
            character.  The sequence of fields must be in this order:
            subject, author, date, message-id, references, byte count, and
            line count. Other optional fields may follow line count. Where
            no data exists, a null field must be provided (i.e. the output
            will have two tab characters adjacent to each other). Servers
            should not output fields for articles that have been removed
            since the overview database was created.

I find the provision for sending the whole article rather strange. Surely
the headers alone would suffice, and I would have expected a special
response code for this situation, and also some indication of how the
multi-line response was to be formatted (e.g. to indicate the gap between
successive articles).

          12.1 CHARSET

            CHARSET [charset]

            The CHARSET command is used to change the default character
            set for certain types of arguments: group names and the
            contents of article headers. The argument must be the name of
            a character set registered with the IANA. The server MUST
            return 204 if the specified character set is supported.
            Otherwise, the server MUST return 404.

I think this mechanism will work fine for charsets which have US-ASCII as
a subset. This includes UTF-8 and all of iso-8859-x. It is then possible
to regard the whole command line (including the command name) as being
within that charset. Things start to get more complex with other
byte-sized charsets.  If someone insists on specifying EBCDIC, for
example, are parts of the line (e.g. the command name) meant to be in
US-ASCII and parts (e.g. group names) meant to be in EBCDIC? Ugh! However,
EBCDIC is unlikely to be a problem in practice.

But suppose someone wants to specify a 16bit or 32bit charset (UCS-2 or
UCS-4). That is going to get reeeeal messy if the line contains both 8bit
and 16bit characters.

I would suggest the only solution is to ban such charsets, and state that
anyone needing to use UCS-2 or UCS-4 should first encode it into UTF-8
(yes, that is always possible).

            When used as arguments to commands, group names and the
            contents of article headers MUST be decoded before comparing
            text in a character set other than US-ASCII. US-ASCII must be
            supported; other character sets may be supported.

I think you are trying to say that both the header and the parameter
should be in the same charset before any comparison is done (this
particularly applies to wildmats). Which is converted to what is an
implementation issue. For example, an implementor might decide to convert
everything into 16bit UCS-2 or even 32bit UCS-4, and then process that.
Alternatively, it is not too hard to write a wildmat comparator working
directly with UTF-8 (because it preserves the lexicographic ordering of
UCS-4).

            The use of CHARSET with no argument will reset the default
            character set to US-ASCII.

            Note that only argument processing is affected by the
            character set. The server MUST not translate any part of any
            multi-line response returned to the client based on the
            current character set.

I am not so sure of this. Probably agreed for bodies (where the MIME
headers should prevail).

          12.4 NEWGROUPS

            NEWGROUPS date time [GMT] [<wildmat>]

            Time must also be specified.  It must be as 6 digits HHMMSS
            with HH being hours on the 24-hour clock, MM minutes 00-59,
            and SS seconds 00-59.  The time is assumed to be in the
            server's timezone unless the token "GMT" appears in which case
            both time and date are evaluated at the 0 meridian.

The use of "GMT" seems to be going out of fashion. Maybe you should accept
"UTC" and "+0000" as well. The Date: formats in both Mail and News seem to
be going that way. Likewise in the NEWNEWS command.

          12.5 NEWNEWS

            NEWNEWS newsgroups date time [GMT] [<distributions>]

            A list of message-ids of articles posted or received to the
            specified news group since "date" will be listed. The format
            of the listing will be one message-id per line, as though text
            were being sent.  A single line consisting solely of one
            period followed by CR-LF will terminate the list.

Didn't you mean to say that the newsgroups parameter was a comma-separated
list of wildmats?

REFERENCES

See

RFC2044 "UTF-8, a transformation format of Unicode and ISO 10646"
RFC2130 "The Report of the IAB Character Set Workshop" 29 Feb - 1 Mar 1996
(deals with why UTF-8 should be the default Character Encodine Scheme)

Charles H. Lindsey ---------At Home, doing my own thing-------------------------
Email:     chl at clw.cs.man.ac.uk   Web:   http://www.cs.man.ac.uk/~chl
Voice/Fax: +44 161 437 4506       Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7  65 E8 64 7E 14 A4 AB A5