ietf-nntp OVER and PAT

Charles Lindsey chl at clw.cs.man.ac.uk
Fri Nov 17 05:03:46 PST 2000


I promised to summarise where I thought we had reached when we last
discussed these things at the end of August.

OVER
----

We were examining this because we wanted any header
munging^H^H^H^H^H^H^Hcanonicalisation done in PAT to be the same as that
done in OVER, for the benefit of people who wanted to implement PAT by
searching the overview database.

There was confusion as to what was supposed to happen.

Are adjacent SP characters to be rendered as a single SP? Andrew Gierth
claimed not, because he regarded SP as a "printing character" (which is
technically correct), whereas TAB was not. But clearly TABs have to be
converted to SP. What if there are adjacent TABS, or TAB followed by SP,
and so on?

What is supposed to happen with folding? It was reported that INN would
replace "CR LF SP" by "SP SP SP", which would seem to violate all
interpretations of the rules. Moreover, other systems produced a variety
of other effects for this case.

What was supposed to happen to other control characters? The opinion
seemed to be that they should just be left in as they were.

However, noone seemed to have read what the draft actually said (and still
says in 9.5.2.2), which is:

	"Any sequence of US-ASCII space or non-printing characters in a
	field MUST be replaced by a single US-ASCII space."

IOW, all whitespace, folding and control characters is collapsed to a
single SP, which is a nice simple rule to understand, and makes it easy to
do pattern matching against it (especially when using wildmats in XPAT
style, where there is no means to specify multiple spaces in the wildmat,
nor any means to specify arbitrary whitespace, even if \020 is
introduced).

So can we first of all agree to accept that rule for OVER, even though
noone currently implements it, on the grounds that there is no consistent
interpretation in use anyway?

I would however add a NOTE:

	NOTE: Neither a non-breaking space character (such as that
	represented by the two octets 0xC2A0 in UTF-8) nor any
	non-printing character outside of the strict US-ASCII subset of
	UTF-8 is to be replaced in this manner.

Indeed, the text I quoted should actually say "... or US-ASCII non-printing
characters ...".


PAT
---

The text in the latest draft is clearly wrong, since the sysntax allows
exactly one wildmat, whereas the text speaks of "one or more wildmats".

I think the consensus we reached was that we should try to implement the
intention of XPAT. I.e. we would allow several wildmats, and the gap
between the wildmats would match some form of whitespace/folding, which is
how XPAT is usually implemented currently.

I grant you that talk in recent days has been going in the opposite
direction, but I think that sticking with XPAT has much to commend it,
other things being equal (which they probably are not). Moreover,
implementors would likely just point PAT at their existing implementations
of XPAT, whereas they might choose not to implement it at all if we make
it seriously different.

The way we were describing it back in August was "first canonicalize the
header, then match against the wildmat(s)". The canonicalization needed to
be the same as that in OVER, so that implementors could use the overview
database. It was agreed that the matching against the wildmat(s) needed to
be "anchored at both ends" of the header, because that is standard wildmat
semantics. So if we accept the canonicalization in OVER as collapsing all
whitespace to single SP (see above), then the semantics of the whole thing
become clear and even clean (relatively speaking).

There should be an explicit statement that implementors need only do PAT
matching against the headers they choose to include in their Overviews
(they MAY do more, of course). So we need an extra response code, for
which I suggest:
	521  Sorry, we don't do that header

There was also a suggestion to return an "article number" of zero when using
PAT with an article specified by <message-id>, rather than returning a
"412 no newsgroup selected" which is really a bit silly. There is an
example in the present text which could re removed entirely if that change
was made.

So that is how PAT would look if we do it in XPAT style. A bit ugly, but
backwards compatible.

The Bad News is that it does not work well with "comma as indication of
alternative" for the reasons I have already posted. Please read that
example carefully.

The alternative is to use some \020 or other notation to represent space
in a wildmat. Note that this would then match arbitrary
whitespace/folding/control-characters if the canonicalization in OVER is
retained, which is actually quite a useful feature.

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Email:     chl at clw.cs.man.ac.uk  Web:   http://www.cs.man.ac.uk/~chl
Voice/Fax: +44 161 436 6131      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9     Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5



More information about the ietf-nntp mailing list