ietf-nntp Wildmats

Mon Mar 12 03:01:42 PST 2001

In <oPAEy7Yr4Jq6EwV$@romana.davros.org> "Clive D. W. Feather" <clive at on-the-train.demon.co.uk> writes:

>>OTOH, none of { * , ? [ \ } can ever occur in a newsgroup-name, certainly
>>not in [USEFOR] and not in any current usage either (1036 seems silent on
>>the issue), so it might be argued that wildmat-escape could be omitted
>>entirely.
> [...]
>>>[W4: wildmat-specials were \s for white space and \u1234 and \U12345678
>>>for general character escapes. Do we need these any more or shall we drop
>>>them ? My inclination is to drop them for now; we may want to resurrect
>>>them if we do more generic wildmats, but they aren't needed for newsgroup
>>>names.]
>>
>>Yes, drop them.

> If we do this, then I recommend that we ban backslash entirely in 
> wildmats, including in [...] sets. This allows them to be used in the 
> future if we ever need an escape mechanism and in the meantime 
> eliminates some potential confusion.

If you ban backslash, then there are other candidates for banning too,
such as comma which you mention later on. I think it would be simpler
just to specify those characters which you DO allow, with a NOTE to the
effect that wildmats are for matching newsgroup-names, and the allowed
characters reflect that.

According to USEFOR (RFC 1036 is silent) the allowed characters are:

<Unicode Letter, Lowercase>
<Unicode Letter, Other>
<Unicode Number, Decimal Digit>
<Unicode Number, Other>
"+" / "-" / "_"

Translating that into something more manageable, and including the upper
case letters (though not strictly allowed, the NNTP standard is not the
place to say that, and there are a few, possibly bogus, groups that use
them) we get

ASCII lower case
ASCII upper case
ASCII digits
ASCII "+" / "-" / "_"
Any UTF-8 non-ASCII

> Which reminds me: *somewhere* we need text along the lines of:

>     If a parameter that is specified as a wildmat does not meet the
>     syntax of 5.1.1, the NNTP server MAY place some interpretation on it
>     (not specified by this document) or otherwise MUST generate a 501
>     response.

I think what you say is that extensions to this standard MAY extend the
list of allowed characters or augment the syntax in a backward-compatible
fashion so as to allow the use of wildmats in other contexts, in which
case the extended syntax MAY be used even when matching newsgroups (though
it would not cause anything new to be matched). The only extension likely
to want to do this is XPAT, which is deligthfully silent on this issue so
far.

But restricting the syntax as I have suggested above should remove all of
Andrew's worries at a stroke.

>>>  5.1.2 Formalised syntax
>>>[Warning: this ain't pretty, but I think it's correct and unambiguous.]
>>
>>*You* know that there is a Van Wijngaarden Grammar hiding in there, and
>>*I* know that there is a Van Wijngaarden Grammar hiding in there, but for
>>the unfortunate masses who do not know, this is just too complicated and
>>counterproductive

>>I recommend to leave it all out.

> Does anyone want this section to be left ?

I think it was a useful exercise to demonstrate that your original method
was better.

>>>  this range consists of every character whose code lies
>>>  between the two characters in the range, inclusive. Thus "[a-dg]"
>>>  is equivalent to "[abcdg]"; each match any of the five characters "a",
>>>  "b", "c", "d", or "g". Note that the codes are always those of
>>>  ISO 10646, no matter what the local character set is.
>>
>>Hmmmm! Are we sure we know what the collating order is for arbitrary UTF-8
>>characters?

> You may feel a little more comfortable knowing that ISO 10646 and UTF-8 
> lexical orders are the same. So in this case the characters that are 
> matched are:

Yes, I had seemed to remember that there was such an equivalence. Perhaps
you should say:

"Note that the codes are always the 16-bit characters of ISO 10646, no
matter what the local character set is."

and then add a NOTE about the lexical order of UTF-8.

>>>  If the first char in a range has a higher code than the second one, the
>>>  characters represented by the range are determined by the implementation.
>>>  This must be done in a consistent manner, so that, for example,
>>>  "[d-a],[^d-a]" will match every possible character.
>>
>>>[W7: do we want to remove the consistency requirement ? This would mean
>>>that, for example, "[d-a]" and "[^d-a]" might both match the same set, or
>>>that the two ranges in "[d-a][d-a]" might match different sets.]
>>
>>I should forbid it entirely in your disambiguating rules. Or else say that
>>such a range never matches anything.

> We had this discussion previously. The consensus was that all ranges 
> should match exactly one character, and that such ranges were allowed 
> but had no definition of which characters matched. That is, the existing 
> wording.

> The only question left is the one I asked: do we require consistency or 
> not ?

Hmm! I would prefer disallowing it. But if the consensus was otherwise,
then I suppose consistency is a good thing. Would it be possible to say
that [d-a] was allowed syntactically, but would never match anything. In
that case it would still happen that "[d-a],[^d-a]" would match
everything.

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl at clw.cs.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5