ietf-nntp Wildmats
Clive D.W. Feather
clive at demon.net
Wed Mar 7 08:44:38 PST 2001
New wording at <http://www.davros.org/nntp-texts/section-5.txt>. Repeated
here for discussion.
NNTP proposed text
Section 5
Last changed 2001-03-07 16:45 UTC
Since almost all the text is new, change markers are not used.
There are a number of open issues that I would like discussed. I've
described these in unindented bracketed text and given each an issue
number of the form W1, W2, etc. Other comments are also placed in
unindented bracketed text.
5. The WILDMAT format
The WILDMAT format described here is based on the version
first developed by Rich Salz [5], which in turn was derived from
the format used in the UNIX "find" command to articulate file names.
It was developed to provide a uniform mechanism for matching
patterns in the same manner that the UNIX shell matches filenames.
5.1 Wildmat syntax
5.1.1 Simplified syntax
A wildmat is described by the following augmented BNF[9] syntax
(note that this syntax contains ambiguities and special cases described
at the end):
wildmat = wildmat-pattern *("," ["!"] wildmat-pattern)
wildmat-pattern = 1*wildmat-item
wildmat-item = wildmat-exact / wildmat-escape / wildmat-wild
wildmat-exact = %x21-29 / %x2B / %x2D-3E / %x40-5A / %x5D-7F /
UTF-8-non-ascii ; exclude * , ? [ \
[W1: do we want to exclude ! and require it to be escaped as \! ? Doing so
eliminates an ambiguity in the syntax. On the other hand, common usage is
that ! is not special other than at the start of a pattern. My inclination
is to not escape it.]
wildmat-escape = wildmat-hide / wildmat-special
wildmat-hide = "\" wildmat-hidden
wildmat-hidden = %x21-2F / %x3A-40 / %x5B-60 / %x7B-7F / UTF-8-non-ascii
; exclude 0-9, A-Z, a-z
[W2: this allows any non-alphanumeric to be escaped with \. Is this too
general; should it be limited to * , ? [ and perhaps !. My inclination is
to leave it as shown here.]
[W3: if leaving it generic, do we want to include or exclude non-ascii
characters ? The above says include, but I'm inclined to exclude them.]
wildmat-special = "\" %x73 / "\" %x75 hex hex hex hex /
"\" %x55 hex hex hex hex hex hex hex hex
[W4: wildmat-specials were \s for white space and \u1234 and \U12345678
for general character escapes. Do we need these any more or shall we drop
them ? My inclination is to drop them for now; we may want to resurrect
them if we do more generic wildmats, but they aren't needed for newsgroup
names.]
wildmat-wild = "*" / "?" / wildmat-set
wildmat-set = "[" ["^"] wildmat-set-body "]"
wildmat-set-body = 1*wildmat-set-item
wildmat-set-item = wildmat-set-char / wildmat-set-range
wildmat-set-char = %x21-7F / UTF-8-non-ascii
wildmat-set-range = wildmat-set-char wildmat-range-delim wildmat-set-char
wildmat-range-delim = "-"
UTF-8-non-ascii is defined in section 13.
This syntax must be interpreted subject to the following rules:
- Where a wildmat-pattern is not immediately preceded by "!", it shall
not begin with a "!".
- Where a wildmat-set-body is not immediately preceded by "^", it shall
not begin with a "^".
- The character "]" may only appear in a wildmat-set-body if it is the
very first character of that body.
[I think this rule is better than the -1- and -2- rules I had before.]
- Within a wildmat-set-body, the character "-" shall be parsed as being
a wildmat-range-delim unless:
* it is the first or last character in the wildmat-set-body, or
* either of the two immediately preceding characters is a "-" that can
be parsed as a wildmat-range-range (this determination is made from
left to right, so that in "[%----b-c]" only the first and fourth
dashes are wildmat-range-delims).
[I've had another go at this wording.]
5.1.2 Formalised syntax
A wildmat is equivalently described by the following syntax. This
version is more complex but only has one possible parse for every
valid wildmat, rather than relying on separate notes. It generates
the same wildmats as that in the previous section.
[Warning: this ain't pretty, but I think it's correct and unambiguous.]
[In general: -p- means "(start of) a positive case", -t- means "(start of)
a negative case", -t- means "generic or trailing", -x- means "dash
excluded".]
wildmat = wildmat-p-pattern *("," wildmat-t-pattern)
wildmat-t-pattern = wildmat-p-pattern / "!" wildmat-n-pattern
wildmat-p-pattern = wildmat-p-item *wildmat-t-item
wildmat-n-pattern = 1*wildmat-t-item
wildmat-t-item = "!" / wildmat-p-item
wildmat-p-item = wildmat-p-exact / wildmat-escape / wildmat-wild
wildmat-p-exact = %x22-29 / %x2B / %x2D-3E / %x40-5A / %x5D-7F /
UTF-8-non-ascii ; exclude ! * , ? [ \
wildmat-escape = wildmat-hide / wildmat-special
wildmat-hide = "\" wildmat-hidden
wildmat-hidden = %x21-2F / %x3A-40 / %x5B-60 / %x7B-7F / UTF-8-non-ascii
; exclude 0-9, A-Z, a-z
wildmat-special = "\" %x73 / "\" %x75 hex hex hex hex /
"\" %x55 hex hex hex hex hex hex hex hex
wildmat-wild = "*" / "?" / wildmat-set
wildmat-set = wildmat-p-set / wildmat-n-set
wildmat-p-set = "[" wildmat-p-set-body "]"
wildmat-n-set = "[" "^" wildmat-n-set-body "]"
wildmat-p-set-body = wildmat-p-set-subbody *wildmat-t-set-subbody /
wildmat-p-set-tail / wildmat-p-set-subbody wildmat-t-set-tail /
wildmat-p-set-subbody *wildmat-t-set-subbody wildmat-t-set-tail
wildmat-n-set-body = wildmat-n-set-subbody *wildmat-t-set-subbody /
wildmat-n-set-tail / wildmat-n-set-subbody wildmat-t-set-tail /
wildmat-n-set-subbody *wildmat-t-set-subbody wildmat-t-set-tail
wildmat-p-set-subbody = wildmat-p-set-range /
wildmat-p-set-char *wildmat-x-set-char wildmat-x-set-range
wildmat-n-set-subbody = wildmat-n-set-range /
wildmat-n-set-char *wildmat-x-set-char wildmat-x-set-range
wildmat-t-set-subbody = wildmat-t-set-range /
wildmat-t-set-char *wildmat-x-set-char wildmat-x-set-range
wildmat-p-set-range = wildmat-p-set-char "-" wildmat-t-set-char
wildmat-n-set-range = wildmat-n-set-char "-" wildmat-t-set-char
wildmat-t-set-range = wildmat-t-set-char "-" wildmat-t-set-char
wildmat-x-set-range = wildmat-x-set-char "-" wildmat-t-set-char
wildmat-p-set-tail = wildmat-p-set-char *wildmat-x-set-char *1"-"
wildmat-n-set-tail = wildmat-n-set-char *wildmat-x-set-char *1"-"
wildmat-t-set-tail = wildmat-t-set-char *wildmat-x-set-char *1"-"
wildmat-p-set-char = wildmat-c-set-char / "-" / "]"
wildmat-n-set-char = wildmat-c-set-char / "-" / "]" / "^"
wildmat-t-set-char = wildmat-c-set-char / "-" / "^"
wildmat-x-set-char = wildmat-c-set-char / "^"
wildmat-c-set-char = %x21-2C / %x2E-5C / %x5F-7F / UTF-8-non-ascii
; exclude - ] ^
UTF-8-non-ascii is defined in section 13.
The following additional syntax defines the terms used in section 5.2:
wildmat-pattern = wildmat-p-pattern / wildmat-n-pattern
wildmat-exact = "!" / wildmat-p-exact
wildmat-set-body = wildmat-p-body / wildmat-n-body
wildmat-set-range = wildmat-p-set-range / wildmat-n-set-range /
wildmat-t-set-range
wildmat-set-char = wildmat-n-set-char
; which is a superset of the other wildmat-?-set-char definitions
; characters forming part of a wildmat-set-range are excluded
5.2 Wildmat semantics
A wildmat is tested against a string, and either matches or does not
match. To do this, each constituent wildmat-pattern is matched against
the string and the rightmost pattern that matches is identified. If
that wildmat-pattern is not preceded with "!", the whole wildmat matches.
If it is preceded by "!", or if no wildmat-pattern matches, the whole
wildmat does not match.
For example, consider the wildmat "a*,!*b,*c*":
the string "aaa" matches because the rightmost match is with "a*"
the string "abb" does not match because the rightmost match is with "*b"
the string "ccb" matches because the rightmost match is with "*c*"
the string "xxx" does not match because no wildmat-pattern matches
A wildmat-pattern matches a string if the string can be broken into
components, each of which matches the corresponding wildmat-item in
the pattern; the matches must be in the same order, and the whole string
must be used in the match. The pattern is "anchored"; that is, the first
and last characters in the string must match the first and last item
respectively (unless that item is an asterisk matching zero characters).
A wildmat-exact matches the same character (which may be more than one
octet in UTF-8).
"?" matches exactly one character (which may be more than one octet).
"*" matches zero or more characters. It can match an empty string, but
it cannot match only part of a UTF-8 sequence that consists of more than
one octet.
[The next few items match the syntax above; if we change the syntax I'll
change or delete these to match.]
"\" followed by an character other than an ASCII letter or digit matches
that character; it can be used, for example, to match a literal "*" or
"," in a string.
"\s" matches one or more spaces (note that for the PAT command other
white-space characters are replaced by space).
"\u1234" matches the character with code 1234 hex (this will have the
UTF-8 sequence %xE1 %x88 %xB4). There are always exactly four hexadecimal
digits following the "u". "\u0020" is another way to match a space, or
"\u003F" another way to match a question mark.
"\U12345678" matches the character with code 12345678 hex. There are
always exactly eight hexadecimal digits following the "U". "\U00001234"
is equivalent to "\u1234".
If "\" is followed by any other character, the behaviour is
undefined.
5.2.1 Wildmat sets
A wildmat-set matches exactly one character in the string. Which
characters are matched depend on the wildmat-set-body.
[W5: the grammar does not treat \ in sets as special, just as another
character, so "a[b\]c]" matches the two strings "abc]" and "a\c]".
Are we happy with this ? It matches existing practice as I understand it,
and I'm inclined to keep it.]
[W6: the grammar does not treat , in sets as special. This means that the
wildmat "a[b,c]d" is a single pattern that matches the three strings "abd",
"a,d", and "acd". Are we happy with this ? It matches existing practice
but means that you can't split a wildmat into the component patterns just
by looking for unescaped commas. I'm inclined to keep it as it is.]
If the body is preceded by "^", the set is "inverted". That is, it
matches a character if and only if the set without a "^" prefix would
not match the character, and vice versa.
The body is split into wildmat-set-ranges and wildmat-set-chars.
Each wildmat-set-char specifies a single character that the set will
match. Each wildmat-set-range specifies a range of characters that the
set will match; this range consists of every character whose code lies
between the two characters in the range, inclusive. Thus "[a-dg]"
is equivalent to "[abcdg]"; each match any of the five characters "a",
"b", "c", "d", or "g". Note that the codes are always those of
ISO 10646, no matter what the local character set is.
If the first char in a range has a higher code than the second one, the
characters represented by the range are determined by the implementation.
This must be done in a consistent manner, so that, for example,
"[d-a],[^d-a]" will match every possible character.
[W7: do we want to remove the consistency requirement ? This would mean
that, for example, "[d-a]" and "[^d-a]" might both match the same set, or
that the two ranges in "[d-a][d-a]" might match different sets.]
Implementers must be careful to apply the pattern-matching process
to whole characters encoded in UTF-8, and not to individual octets.
5.3 Examples
[Again, these correspond to the current text and may need changing if
we change other things.]
In these examples, $ and @ are used to represent the two octets 0xC2
and 0xA3 respectively; $@ is thus the UTF-8 encoding for the pound
sterling symbol, shown as # in the descriptions.
Wildmat Description of strings that match
abc the one string "abc"
abc,def the two strings "abc" and "def"
$@ the one character string "#"
a* any string that begins with "a"
a*b any string that begins with "a" and ends with "b"
a*,*b any string that begins with "a" or ends with "b"
a*,!*b any string that begins with "a" and does not end with "b"
a*,!*b,c* any string that begins with "a" and does not end with "b", and
any string that begins with "c" no matter what it ends with
a*,c*,!*b any string that begins with "a" or "c" and does not end
with "b"
a\u0062c the one string "abc"
a\u002a the one string "a*"
a\* the one string "a*"
abc\,def the one string "abc,def"
a\u0020c the one string "a c"
a\sc the strings "a c", "a c", "a c", "a c", etc.
?a* any string with "a" as its second character
??a* any string with "a" as its third character
*a? any string with "a" as its penultimate character
*a?? any string with "a" as its antepenultimate character
[abc] the three strings "a", "b", and "c"
[^abc] any one character string except the three "a", "b", and "c"
[a-zA-Z] any one character string consisting of an ASCII letter
[0-9]* any string beginning with an ASCII digit
[a$@] the two strings "a" and "#"
[a-$@] the 67 one character strings from "a" to "#"
[a^bc] the four strings "a", "^", "b", and "c"
[a-c-] the four strings "a", "b", "c", and "-"
[a-c-f] the five strings "a", "b", "c", "-", and "f"
[-a0-] the three strings "-", "a", and "0"
[-a0-2] the five strings "-", "a", "0", "1", and "2"
[--0] the four strings "-", ".", "/", and "0"
[]abc] the four strings "]", "a", "b", and "c"
[ab]c] the two strings "ac]" and "bc]"
[a\]c] the two strings "ac]" and "\c]"
a[b,c]d the three strings "abd", "a,d" and "acd"
a\[b,c]d the two strings "a[b" and "c]d"
[b-a] some unspecified set of one character strings
[^b-a] all one character strings not matched by the previous pattern
========
[The following changes also need be made to other sections for consistency.
In addition the formal grammar will need updating.]
9.4 The LIST Keyword
9.4.1 LIST
[...]
If the optional wildmat parameter is specified, the list is
! limited to only the groups whose names match the wildmat. This
! will normally be very efficient if the wildmat is a simple group
! name.
9.4.2 LIST ACTIVE.TIMES
LIST ACTIVE.TIMES [wildmat]
[...]
If the optional wildmat parameter is specified, the list is
! limited to only the groups whose names match the wildmat. This
! will normally be very efficient if the wildmat is a simple group
! name.
9.4.4 LIST DISTRIB.PATS
LIST DISTRIB.PATS
The distrib.pats file is maintained by some news transport
systems to allow clients to choose a value for the
Distribution: line in the header of a news article being
posted. The information returned consists of lines, in no
particular order, each of which contains three fields
! separated by colons: a weight, a wildmat (which may be a simple
! group name), and a Distribution: value, in that order.
[...]
9.4.5 LIST NEWSGROUPS
LIST NEWSGROUPS [wildmat]
[...]
If the information is not available, the
server will return the 503 response. If the server does not
recognize the command it should return a 501 response. If
! the optional wildmat parameter is specified, the list is
! limited to only the groups that match the wildmat (no
! matching is done on the group descriptions). This will
! normally be very efficient if the wildmat is a simple group
! name. If nothing is matched
an empty list is returned, not an error.
11.4 NEWNEWS
! NEWNEWS wildmat date time [GMT]
! The message-ids of all articles added to a set of newsgroups
! since the given date-time will be listed. The set consists
! of all newsgroups whose name matches the wildmat.
The format of the listing will be one message-id per line, as
though text were being sent. Each message-id SHALL appear only
once in a response. The order of the response has no specific
significance and may vary from response to response in the
! same session. Date and time are in the same format as the
! NEWGROUPS command.
Note that an empty list (i.e., the text body returned by this
command consists only of the terminating period) is a possible
valid response, and indicates that there is currently no new
news.
Clients SHOULD make all queries in Coordinated Universal Time
when possible.
--
Clive D.W. Feather | Work: <clive at demon.net> | Tel: +44 20 8371 1138
Internet Expert | Home: <clive at davros.org> | Fax: +44 20 8371 1037
Demon Internet | WWW: http://www.davros.org | DFax: +44 20 8371 4037
Thus plc | | Mobile: +44 7973 377646
More information about the ietf-nntp
mailing list