ietf-nntp Wildmats

Thu Mar 8 04:13:00 PST 2001

In <20010307164438.K85255 at demon.net> "Clive D.W. Feather" <clive at demon.net> writes:

>  5.1 Wildmat syntax

>[W1: do we want to exclude ! and require it to be escaped as \! ? Doing so
>eliminates an ambiguity in the syntax. On the other hand, common usage is
>that ! is not special other than at the start of a pattern. My inclination
>is to not escape it.]

I agree.

>    wildmat-escape = wildmat-hide / wildmat-special

>    wildmat-hide = "\" wildmat-hidden

>    wildmat-hidden = %x21-2F / %x3A-40 / %x5B-60 / %x7B-7F / UTF-8-non-ascii
>        ; exclude 0-9, A-Z, a-z

>[W2: this allows any non-alphanumeric to be escaped with \. Is this too
>general; should it be limited to * , ? [ and perhaps !. My inclination is
>to leave it as shown here.]

>[W3: if leaving it generic, do we want to include or exclude non-ascii
>characters ? The above says include, but I'm inclined to exclude them.]

I think you got your 'include's and 'exclude's crossed in there. Anyway,
my view is that, if you are going to allow '\' escapes at all, then you
allow them on _everything_; no arbitray rules to have to remember. I think
that is what you are saying.

OTOH, none of { * , ? [ \ } can ever occur in a newsgroup-name, certainly
not in [USEFOR] and not in any current usage either (1036 seems silent on
the issue), so it might be argued that wildmat-escape could be omitted
entirely. The only place where it might be useful is in XPAT (which is
still defined in the Common Extensions RFC, though I hope it will now be
quietly forgotten).

>    wildmat-special = "\" %x73 / "\" %x75 hex hex hex hex /
>        "\" %x55 hex hex hex hex hex hex hex hex

>[W4: wildmat-specials were \s for white space and \u1234 and \U12345678
>for general character escapes. Do we need these any more or shall we drop
>them ? My inclination is to drop them for now; we may want to resurrect
>them if we do more generic wildmats, but they aren't needed for newsgroup
>names.]

Yes, drop them. Their only possible use is in XPAT, and XPAT has already
made other arrangements to deal with the problem (it is _just_ about
possible to work out what XPAT is supposed to do from the Common
Extensions RFC).

>  - Within a wildmat-set-body, the character "-" shall be parsed as being
>    a wildmat-range-delim unless:
>    * it is the first or last character in the wildmat-set-body, or
>    * either of the two immediately preceding characters is a "-" that can
>      be parsed as a wildmat-range-range (this determination is made from
                                    ^^^^^
				    delim
>      left to right, so that in "[%----b-c]" only the first and fourth
>      dashes are wildmat-range-delims).

>  5.1.2 Formalised syntax

>  A wildmat is equivalently described by the following syntax. This
>  version is more complex but only has one possible parse for every
>  valid wildmat, rather than relying on separate notes. It generates
>  the same wildmats as that in the previous section.

>[Warning: this ain't pretty, but I think it's correct and unambiguous.]

*You* know that there is a Van Wijngaarden Grammar hiding in there, and
*I* know that there is a Van Wijngaarden Grammar hiding in there, but for
the unfortunate masses who do not know, this is just too complicated and
counterproductive (though it might give a clue as to how to implement it).
Methinks you have been sitting in too many delayed trains :-) .

I recommend to leave it all out.

>[In general: -p- means "(start of) a positive case", -t- means "(start of)
                                                      ^^^
						      -n-
>a negative case", -t- means "generic or trailing", -x- means "dash
>excluded".]

>  5.2 Wildmat semantics

>  "*" matches zero or more characters. It can match an empty string, but
>  it cannot match only part of a UTF-8 sequence that consists of more than
>  one octet.

ITYM "it cannot match a subsequence of a UTF-8 sequence that is not
aligned to the character boundaries". Otherwise you are forbidding a match
with a sequence of genuine UTF-8 characters that is embedded in a longer
sequence of UTF-8 characters.

>  5.2.1 Wildmat sets

>  A wildmat-set matches exactly one character in the string. Which
>  characters are matched depend on the wildmat-set-body.

>[W5: the grammar does not treat \ in sets as special, just as another
>character, so "a[b\]c]" matches the two strings "abc]" and "a\c]".
>Are we happy with this ? It matches existing practice as I understand it,
>and I'm inclined to keep it.]

I agree.

>[W6: the grammar does not treat , in sets as special. This means that the
>wildmat "a[b,c]d" is a single pattern that matches the three strings "abd",
>"a,d", and "acd". Are we happy with this ? It matches existing practice
>but means that you can't split a wildmat into the component patterns just
>by looking for unescaped commas. I'm inclined to keep it as it is.]

I agree. There is, in fact, NO existing practice, as I have explained in
reply to Andrew.

>  The body is split into wildmat-set-ranges and wildmat-set-chars.
>  Each wildmat-set-char specifies a single character that the set will
>  match. Each wildmat-set-range specifies a range of characters that the
>  set will match; this range consists of every character whose code lies
>  between the two characters in the range, inclusive. Thus "[a-dg]"
>  is equivalent to "[abcdg]"; each match any of the five characters "a",
>  "b", "c", "d", or "g". Note that the codes are always those of
>  ISO 10646, no matter what the local character set is.

Hmmmm! Are we sure we know what the collating order is for arbitrary UTF-8
characters?

>  If the first char in a range has a higher code than the second one, the
>  characters represented by the range are determined by the implementation.
>  This must be done in a consistent manner, so that, for example,
>  "[d-a],[^d-a]" will match every possible character.

>[W7: do we want to remove the consistency requirement ? This would mean
>that, for example, "[d-a]" and "[^d-a]" might both match the same set, or
>that the two ranges in "[d-a][d-a]" might match different sets.]

I should forbid it entirely in your disambiguating rules. Or else say that
such a range never matches anything.

>  9.4.1 LIST

>  If the optional wildmat parameter is specified, the list is
>! limited to only the groups whose names match the wildmat. This
                   ^^^
		   those
>! will normally be very efficient if the wildmat is a simple group
>! name.

>  9.4.2 LIST ACTIVE.TIMES

>  If the optional wildmat parameter is specified, the list is
>! limited to only the groups whose names match the wildmat. This
                   ^^^
		   those
>! will normally be very efficient if the wildmat is a simple group
>! name.

>  9.4.5 LIST NEWSGROUPS

>!    the optional wildmat parameter is specified, the list is
>!    limited to only the groups that match the wildmat (no
                      ^^^
		      those
>!    matching is done on the group descriptions). This will

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl at clw.cs.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5