ietf-nntp Overview and non-printing characters

Sun Jul 8 03:21:09 PDT 2001

In reviewing the draft, I found that we'd not yet resolved the previous
discussion of overview and we still have the original text in the draft
regarding how to generate overview information:

    Any sequence of US-ASCII space or non-printing characters in a field
    MUST be replaced by a single US-ASCII space.

ISO-2022 encoding uses ESC to introduce a character set switch, and this
is used in ISO-2022-JP encoding for Japanese.  Checking ten days worth of
traffic in the fj.* hierarchy, I found 463 articles with raw ESC
characters in the Subject lines.

INN has always, since the introduction of the XOVER command, passed all
characters other than CR, LF, and TAB from the Subject header into the
overview database unmolested.  Those ESC characters are necessary for the
correct parsing of those Subject headers; if the ESC characters are
replaced by spaces, the entire subject line is altered into something
nonsensical.  Since the overview information is frequently used to display
the initial thread tree, I think that's a pretty significant and important
factor to take into account.

Furthermore, RFC 1036, by way of RFC 822, explicitly allows this, so one
can't even point at a standard and say that ESC isn't allowed in
unstructured fields.  It is.  (Not to mention that NNTP isn't, strictly
speaking, required to only provide RFC 1036 messages.)

Clearly (at least to me), at least ESC has to be excepted from the current
requirement to munge non-printing characters or we'll break reasonably
widespread existing practice.  I personally would prefer to go farther
than that and simply remove the bit about non-printing characters and only
treat TAB, CR, and LF as special.  The previous discussion on this list
appeared to support that, as does implementation experience in both INN
and its predecessors.

As for the other part of the current language, compacting adjacent
whitespace, I think that's "obviously" wrong and would break existing
usage.  For example, I know people who score down articles with subjects
ending in several spaces and then a number, since this is a common
signature of spam.  I think they'd be very surprised when that stopped
working because the scoring feature of their newsreader worked based on
overview and suddenly there were no more repeated spaces in overview.

Since we're defining a new command in this area, I think we should take
the opportunity to clean this up properly.  I would define properly as (a)
following RFC 2822 folding semantics, and (b) making the minimal necessary
changes to the field contents to not break the protocol and no more.  I
think there's a strong defense of (b) on the grounds of preserving as much
information as possible, and given that INN has been doing this since 1.4
I can't see any counterarguments on the grounds that it might break
something.  Anything that this would break has already been broken when
talking to INN servers for many years now.

So to sum it all up, I propose replacing the paragraph:

    The content of any subsequent field is given by the response to the
    LIST OVERVIEW.FMT command.  A field may be empty (in which case there
    will be two adjacent US-ASCII tabs, and a sequence of trailing
    US-ASCII tabs may be omitted).  Any sequence of US-ASCII space or
    non-printing characters in a field MUST be replaced by a single
    US-ASCII space.

with the two paragraphs:

    The content of each field is formed by taking the original content
    (such as the raw subject line from the article), removing all US-ASCII
    CRLF pairs, and then replacing each remaining US-ASCII NUL, TAB, CR,
    or LF character with a single US-ASCII space.

    The content of any subsequent field is given by the response to the
    LIST OVERVIEW.FMT command.  A field may be empty (in which case there
    will be two adjacent US-ASCII tabs, and a sequence of trailing
    US-ASCII tabs may be omitted).

This implements RFC 2822 unfolding (in which CRLF followed by whitespace
should be treated as equivalent to the following whitespace) and still
handles the case of bare CR or LF.  It also handles NUL, just in the name
of generality and in case we eventually extend NNTP to handle pure binary
encodings.

This approach has the advantage of being maximally correct (at least in my
opinion) and the possible disadvantage of not being *anyone's* existing
practice for folded headers.  Personally, I consider what INN currently
does with folded headers to be simply wrong, particularly given what RFC
2822 very clearly says about unfolding headers, and would be happy to just
fix it.  In Andrew's survey of the behavior of other news servers from a
while back, it looked like some of the others may actually already be
doing real unfolding, but there wasn't enough data to tell for sure.

Opinions?

-- 
Russ Allbery (rra at stanford.edu)             <http://www.eyrie.org/~eagle/>