ietf-nntp Wildmats
Charles Lindsey
chl at clw.cs.man.ac.uk
Thu Mar 8 04:13:00 PST 2001
In <20010307164438.K85255 at demon.net> "Clive D.W. Feather" <clive at demon.net> writes:
> 5.1 Wildmat syntax
>[W1: do we want to exclude ! and require it to be escaped as \! ? Doing so
>eliminates an ambiguity in the syntax. On the other hand, common usage is
>that ! is not special other than at the start of a pattern. My inclination
>is to not escape it.]
I agree.
> wildmat-escape = wildmat-hide / wildmat-special
> wildmat-hide = "\" wildmat-hidden
> wildmat-hidden = %x21-2F / %x3A-40 / %x5B-60 / %x7B-7F / UTF-8-non-ascii
> ; exclude 0-9, A-Z, a-z
>[W2: this allows any non-alphanumeric to be escaped with \. Is this too
>general; should it be limited to * , ? [ and perhaps !. My inclination is
>to leave it as shown here.]
>[W3: if leaving it generic, do we want to include or exclude non-ascii
>characters ? The above says include, but I'm inclined to exclude them.]
I think you got your 'include's and 'exclude's crossed in there. Anyway,
my view is that, if you are going to allow '\' escapes at all, then you
allow them on _everything_; no arbitray rules to have to remember. I think
that is what you are saying.
OTOH, none of { * , ? [ \ } can ever occur in a newsgroup-name, certainly
not in [USEFOR] and not in any current usage either (1036 seems silent on
the issue), so it might be argued that wildmat-escape could be omitted
entirely. The only place where it might be useful is in XPAT (which is
still defined in the Common Extensions RFC, though I hope it will now be
quietly forgotten).
> wildmat-special = "\" %x73 / "\" %x75 hex hex hex hex /
> "\" %x55 hex hex hex hex hex hex hex hex
>[W4: wildmat-specials were \s for white space and \u1234 and \U12345678
>for general character escapes. Do we need these any more or shall we drop
>them ? My inclination is to drop them for now; we may want to resurrect
>them if we do more generic wildmats, but they aren't needed for newsgroup
>names.]
Yes, drop them. Their only possible use is in XPAT, and XPAT has already
made other arrangements to deal with the problem (it is _just_ about
possible to work out what XPAT is supposed to do from the Common
Extensions RFC).
> - Within a wildmat-set-body, the character "-" shall be parsed as being
> a wildmat-range-delim unless:
> * it is the first or last character in the wildmat-set-body, or
> * either of the two immediately preceding characters is a "-" that can
> be parsed as a wildmat-range-range (this determination is made from
^^^^^
delim
> left to right, so that in "[%----b-c]" only the first and fourth
> dashes are wildmat-range-delims).
> 5.1.2 Formalised syntax
> A wildmat is equivalently described by the following syntax. This
> version is more complex but only has one possible parse for every
> valid wildmat, rather than relying on separate notes. It generates
> the same wildmats as that in the previous section.
>[Warning: this ain't pretty, but I think it's correct and unambiguous.]
*You* know that there is a Van Wijngaarden Grammar hiding in there, and
*I* know that there is a Van Wijngaarden Grammar hiding in there, but for
the unfortunate masses who do not know, this is just too complicated and
counterproductive (though it might give a clue as to how to implement it).
Methinks you have been sitting in too many delayed trains :-) .
I recommend to leave it all out.
>[In general: -p- means "(start of) a positive case", -t- means "(start of)
^^^
-n-
>a negative case", -t- means "generic or trailing", -x- means "dash
>excluded".]
> 5.2 Wildmat semantics
> "*" matches zero or more characters. It can match an empty string, but
> it cannot match only part of a UTF-8 sequence that consists of more than
> one octet.
ITYM "it cannot match a subsequence of a UTF-8 sequence that is not
aligned to the character boundaries". Otherwise you are forbidding a match
with a sequence of genuine UTF-8 characters that is embedded in a longer
sequence of UTF-8 characters.
> 5.2.1 Wildmat sets
> A wildmat-set matches exactly one character in the string. Which
> characters are matched depend on the wildmat-set-body.
>[W5: the grammar does not treat \ in sets as special, just as another
>character, so "a[b\]c]" matches the two strings "abc]" and "a\c]".
>Are we happy with this ? It matches existing practice as I understand it,
>and I'm inclined to keep it.]
I agree.
>[W6: the grammar does not treat , in sets as special. This means that the
>wildmat "a[b,c]d" is a single pattern that matches the three strings "abd",
>"a,d", and "acd". Are we happy with this ? It matches existing practice
>but means that you can't split a wildmat into the component patterns just
>by looking for unescaped commas. I'm inclined to keep it as it is.]
I agree. There is, in fact, NO existing practice, as I have explained in
reply to Andrew.
> The body is split into wildmat-set-ranges and wildmat-set-chars.
> Each wildmat-set-char specifies a single character that the set will
> match. Each wildmat-set-range specifies a range of characters that the
> set will match; this range consists of every character whose code lies
> between the two characters in the range, inclusive. Thus "[a-dg]"
> is equivalent to "[abcdg]"; each match any of the five characters "a",
> "b", "c", "d", or "g". Note that the codes are always those of
> ISO 10646, no matter what the local character set is.
Hmmmm! Are we sure we know what the collating order is for arbitrary UTF-8
characters?
> If the first char in a range has a higher code than the second one, the
> characters represented by the range are determined by the implementation.
> This must be done in a consistent manner, so that, for example,
> "[d-a],[^d-a]" will match every possible character.
>[W7: do we want to remove the consistency requirement ? This would mean
>that, for example, "[d-a]" and "[^d-a]" might both match the same set, or
>that the two ranges in "[d-a][d-a]" might match different sets.]
I should forbid it entirely in your disambiguating rules. Or else say that
such a range never matches anything.
> 9.4.1 LIST
> If the optional wildmat parameter is specified, the list is
>! limited to only the groups whose names match the wildmat. This
^^^
those
>! will normally be very efficient if the wildmat is a simple group
>! name.
> 9.4.2 LIST ACTIVE.TIMES
> If the optional wildmat parameter is specified, the list is
>! limited to only the groups whose names match the wildmat. This
^^^
those
>! will normally be very efficient if the wildmat is a simple group
>! name.
> 9.4.5 LIST NEWSGROUPS
>! the optional wildmat parameter is specified, the list is
>! limited to only the groups that match the wildmat (no
^^^
those
>! matching is done on the group descriptions). This will
--
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl
Email: chl at clw.cs.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5
More information about the ietf-nntp
mailing list