ietf-nntp Wildmats

Clive D.W. Feather clive at demon.net
Wed Mar 7 08:44:38 PST 2001


New wording at <http://www.davros.org/nntp-texts/section-5.txt>. Repeated
here for discussion.


NNTP proposed text
Section 5
Last changed 2001-03-07 16:45 UTC

Since almost all the text is new, change markers are not used.

There are a number of open issues that I would like discussed. I've
described these in unindented bracketed text and given each an issue
number of the form W1, W2, etc. Other comments are also placed in
unindented bracketed text.


  5. The WILDMAT format

  The WILDMAT format described here is based on the version
  first developed by Rich Salz [5], which in turn was derived from
  the format used in the UNIX "find" command to articulate file names.
  It was developed to provide a uniform mechanism for matching
  patterns in the same manner that the UNIX shell matches filenames.

  5.1 Wildmat syntax

  5.1.1 Simplified syntax

  A wildmat is described by the following augmented BNF[9] syntax
  (note that this syntax contains ambiguities and special cases described
  at the end):

    wildmat = wildmat-pattern *("," ["!"] wildmat-pattern)

    wildmat-pattern = 1*wildmat-item

    wildmat-item = wildmat-exact / wildmat-escape / wildmat-wild

    wildmat-exact = %x21-29 / %x2B / %x2D-3E / %x40-5A / %x5D-7F /
        UTF-8-non-ascii  ; exclude * , ? [ \

[W1: do we want to exclude ! and require it to be escaped as \! ? Doing so
eliminates an ambiguity in the syntax. On the other hand, common usage is
that ! is not special other than at the start of a pattern. My inclination
is to not escape it.]

    wildmat-escape = wildmat-hide / wildmat-special

    wildmat-hide = "\" wildmat-hidden

    wildmat-hidden = %x21-2F / %x3A-40 / %x5B-60 / %x7B-7F / UTF-8-non-ascii
        ; exclude 0-9, A-Z, a-z

[W2: this allows any non-alphanumeric to be escaped with \. Is this too
general; should it be limited to * , ? [ and perhaps !. My inclination is
to leave it as shown here.]

[W3: if leaving it generic, do we want to include or exclude non-ascii
characters ? The above says include, but I'm inclined to exclude them.]

    wildmat-special = "\" %x73 / "\" %x75 hex hex hex hex /
        "\" %x55 hex hex hex hex hex hex hex hex

[W4: wildmat-specials were \s for white space and \u1234 and \U12345678
for general character escapes. Do we need these any more or shall we drop
them ? My inclination is to drop them for now; we may want to resurrect
them if we do more generic wildmats, but they aren't needed for newsgroup
names.]

    wildmat-wild = "*" / "?" / wildmat-set

    wildmat-set = "[" ["^"] wildmat-set-body "]"

    wildmat-set-body = 1*wildmat-set-item

    wildmat-set-item = wildmat-set-char / wildmat-set-range

    wildmat-set-char = %x21-7F / UTF-8-non-ascii

    wildmat-set-range = wildmat-set-char wildmat-range-delim wildmat-set-char

    wildmat-range-delim = "-"

  UTF-8-non-ascii is defined in section 13.

  This syntax must be interpreted subject to the following rules:

  - Where a wildmat-pattern is not immediately preceded by "!", it shall
    not begin with a "!".

  - Where a wildmat-set-body is not immediately preceded by "^", it shall
    not begin with a "^".

  - The character "]" may only appear in a wildmat-set-body if it is the
    very first character of that body.

[I think this rule is better than the -1- and -2- rules I had before.]

  - Within a wildmat-set-body, the character "-" shall be parsed as being
    a wildmat-range-delim unless:
    * it is the first or last character in the wildmat-set-body, or
    * either of the two immediately preceding characters is a "-" that can
      be parsed as a wildmat-range-range (this determination is made from
      left to right, so that in "[%----b-c]" only the first and fourth
      dashes are wildmat-range-delims).

[I've had another go at this wording.]

  5.1.2 Formalised syntax

  A wildmat is equivalently described by the following syntax. This
  version is more complex but only has one possible parse for every
  valid wildmat, rather than relying on separate notes. It generates
  the same wildmats as that in the previous section.

[Warning: this ain't pretty, but I think it's correct and unambiguous.]

[In general: -p- means "(start of) a positive case", -t- means "(start of)
a negative case", -t- means "generic or trailing", -x- means "dash
excluded".]

    wildmat = wildmat-p-pattern *("," wildmat-t-pattern)

    wildmat-t-pattern = wildmat-p-pattern / "!" wildmat-n-pattern

    wildmat-p-pattern = wildmat-p-item *wildmat-t-item

    wildmat-n-pattern = 1*wildmat-t-item

    wildmat-t-item = "!" / wildmat-p-item

    wildmat-p-item = wildmat-p-exact / wildmat-escape / wildmat-wild

    wildmat-p-exact = %x22-29 / %x2B / %x2D-3E / %x40-5A / %x5D-7F /
        UTF-8-non-ascii  ; exclude ! * , ? [ \

    wildmat-escape = wildmat-hide / wildmat-special

    wildmat-hide = "\" wildmat-hidden

    wildmat-hidden = %x21-2F / %x3A-40 / %x5B-60 / %x7B-7F / UTF-8-non-ascii
        ; exclude 0-9, A-Z, a-z

    wildmat-special = "\" %x73 / "\" %x75 hex hex hex hex /
        "\" %x55 hex hex hex hex hex hex hex hex

    wildmat-wild = "*" / "?" / wildmat-set

    wildmat-set = wildmat-p-set / wildmat-n-set

    wildmat-p-set = "[" wildmat-p-set-body "]"

    wildmat-n-set = "[" "^" wildmat-n-set-body "]"

    wildmat-p-set-body = wildmat-p-set-subbody *wildmat-t-set-subbody /
        wildmat-p-set-tail / wildmat-p-set-subbody wildmat-t-set-tail /
        wildmat-p-set-subbody *wildmat-t-set-subbody wildmat-t-set-tail

    wildmat-n-set-body = wildmat-n-set-subbody *wildmat-t-set-subbody /
        wildmat-n-set-tail / wildmat-n-set-subbody wildmat-t-set-tail /
        wildmat-n-set-subbody *wildmat-t-set-subbody wildmat-t-set-tail

    wildmat-p-set-subbody = wildmat-p-set-range /
        wildmat-p-set-char *wildmat-x-set-char wildmat-x-set-range

    wildmat-n-set-subbody = wildmat-n-set-range /
        wildmat-n-set-char *wildmat-x-set-char wildmat-x-set-range

    wildmat-t-set-subbody = wildmat-t-set-range /
        wildmat-t-set-char *wildmat-x-set-char wildmat-x-set-range

    wildmat-p-set-range = wildmat-p-set-char "-" wildmat-t-set-char

    wildmat-n-set-range = wildmat-n-set-char "-" wildmat-t-set-char

    wildmat-t-set-range = wildmat-t-set-char "-" wildmat-t-set-char

    wildmat-x-set-range = wildmat-x-set-char "-" wildmat-t-set-char

    wildmat-p-set-tail = wildmat-p-set-char *wildmat-x-set-char *1"-"

    wildmat-n-set-tail = wildmat-n-set-char *wildmat-x-set-char *1"-"

    wildmat-t-set-tail = wildmat-t-set-char *wildmat-x-set-char *1"-"

    wildmat-p-set-char = wildmat-c-set-char / "-" / "]"

    wildmat-n-set-char = wildmat-c-set-char / "-" / "]" / "^"

    wildmat-t-set-char = wildmat-c-set-char / "-" / "^"

    wildmat-x-set-char = wildmat-c-set-char / "^"

    wildmat-c-set-char = %x21-2C / %x2E-5C / %x5F-7F / UTF-8-non-ascii
        ; exclude - ] ^

  UTF-8-non-ascii is defined in section 13.

  The following additional syntax defines the terms used in section 5.2:

    wildmat-pattern = wildmat-p-pattern / wildmat-n-pattern

    wildmat-exact = "!" / wildmat-p-exact

    wildmat-set-body = wildmat-p-body / wildmat-n-body

    wildmat-set-range = wildmat-p-set-range / wildmat-n-set-range /
        wildmat-t-set-range

    wildmat-set-char = wildmat-n-set-char
        ; which is a superset of the other wildmat-?-set-char definitions
        ; characters forming part of a wildmat-set-range are excluded

  5.2 Wildmat semantics

  A wildmat is tested against a string, and either matches or does not
  match. To do this, each constituent wildmat-pattern is matched against
  the string and the rightmost pattern that matches is identified. If
  that wildmat-pattern is not preceded with "!", the whole wildmat matches.
  If it is preceded by "!", or if no wildmat-pattern matches, the whole
  wildmat does not match.

  For example, consider the wildmat "a*,!*b,*c*":

    the string "aaa" matches because the rightmost match is with "a*"
    the string "abb" does not match because the rightmost match is with "*b"
    the string "ccb" matches because the rightmost match is with "*c*"
    the string "xxx" does not match because no wildmat-pattern matches

  A wildmat-pattern matches a string if the string can be broken into
  components, each of which matches the corresponding wildmat-item in
  the pattern; the matches must be in the same order, and the whole string
  must be used in the match. The pattern is "anchored"; that is, the first
  and last characters in the string must match the first and last item
  respectively (unless that item is an asterisk matching zero characters).

  A wildmat-exact matches the same character (which may be more than one
  octet in UTF-8).

  "?" matches exactly one character (which may be more than one octet).

  "*" matches zero or more characters. It can match an empty string, but
  it cannot match only part of a UTF-8 sequence that consists of more than
  one octet.

[The next few items match the syntax above; if we change the syntax I'll
change or delete these to match.]

  "\" followed by an character other than an ASCII letter or digit matches
  that character; it can be used, for example, to match a literal "*" or
  "," in a string.

  "\s" matches one or more spaces (note that for the PAT command other
  white-space characters are replaced by space).
  
  "\u1234" matches the character with code 1234 hex (this will have the
  UTF-8 sequence %xE1 %x88 %xB4). There are always exactly four hexadecimal
  digits following the "u". "\u0020" is another way to match a space, or
  "\u003F" another way to match a question mark.

  "\U12345678" matches the character with code 12345678 hex. There are
  always exactly eight hexadecimal digits following the "U". "\U00001234"
  is equivalent to "\u1234".

  If "\" is followed by any other character, the behaviour is
  undefined.

  5.2.1 Wildmat sets

  A wildmat-set matches exactly one character in the string. Which
  characters are matched depend on the wildmat-set-body.

[W5: the grammar does not treat \ in sets as special, just as another
character, so "a[b\]c]" matches the two strings "abc]" and "a\c]".
Are we happy with this ? It matches existing practice as I understand it,
and I'm inclined to keep it.]

[W6: the grammar does not treat , in sets as special. This means that the
wildmat "a[b,c]d" is a single pattern that matches the three strings "abd",
"a,d", and "acd". Are we happy with this ? It matches existing practice
but means that you can't split a wildmat into the component patterns just
by looking for unescaped commas. I'm inclined to keep it as it is.]

  If the body is preceded by "^", the set is "inverted". That is, it
  matches a character if and only if the set without a "^" prefix would
  not match the character, and vice versa.

  The body is split into wildmat-set-ranges and wildmat-set-chars.
  Each wildmat-set-char specifies a single character that the set will
  match. Each wildmat-set-range specifies a range of characters that the
  set will match; this range consists of every character whose code lies
  between the two characters in the range, inclusive. Thus "[a-dg]"
  is equivalent to "[abcdg]"; each match any of the five characters "a",
  "b", "c", "d", or "g". Note that the codes are always those of
  ISO 10646, no matter what the local character set is.

  If the first char in a range has a higher code than the second one, the
  characters represented by the range are determined by the implementation.
  This must be done in a consistent manner, so that, for example,
  "[d-a],[^d-a]" will match every possible character.

[W7: do we want to remove the consistency requirement ? This would mean
that, for example, "[d-a]" and "[^d-a]" might both match the same set, or
that the two ranges in "[d-a][d-a]" might match different sets.]

  Implementers must be careful to apply the pattern-matching process
  to whole characters encoded in UTF-8, and not to individual octets.

  5.3  Examples

[Again, these correspond to the current text and may need changing if
we change other things.]

  In these examples, $ and @ are used to represent the two octets 0xC2
  and 0xA3 respectively; $@ is thus the UTF-8 encoding for the pound
  sterling symbol, shown as # in the descriptions.

  Wildmat    Description of strings that match

  abc        the one string "abc"
  abc,def    the two strings "abc" and "def"
  $@         the one character string "#"
  a*         any string that begins with "a"
  a*b        any string that begins with "a" and ends with "b"
  a*,*b      any string that begins with "a" or ends with "b"
  a*,!*b     any string that begins with "a" and does not end with "b"
  a*,!*b,c*  any string that begins with "a" and does not end with "b", and
             any string that begins with "c" no matter what it ends with
  a*,c*,!*b  any string that begins with "a" or "c" and does not end
             with "b"
  a\u0062c   the one string "abc"
  a\u002a    the one string "a*"
  a\*        the one string "a*"
  abc\,def   the one string "abc,def"
  a\u0020c   the one string "a c"
  a\sc       the strings "a c", "a  c", "a   c", "a    c", etc.
  ?a*        any string with "a" as its second character
  ??a*       any string with "a" as its third character
  *a?        any string with "a" as its penultimate character
  *a??       any string with "a" as its antepenultimate character
  [abc]      the three strings "a", "b", and "c"
  [^abc]     any one character string except the three "a", "b", and "c"
  [a-zA-Z]   any one character string consisting of an ASCII letter
  [0-9]*     any string beginning with an ASCII digit
  [a$@]      the two strings "a" and "#"
  [a-$@]     the 67 one character strings from "a" to "#"
  [a^bc]     the four strings "a", "^", "b", and "c"
  [a-c-]     the four strings "a", "b", "c", and "-"
  [a-c-f]    the five strings "a", "b", "c", "-", and "f"
  [-a0-]     the three strings "-", "a", and "0"
  [-a0-2]    the five strings "-", "a", "0", "1", and "2"
  [--0]      the four strings "-", ".", "/", and "0"
  []abc]     the four strings "]", "a", "b", and "c"
  [ab]c]     the two strings "ac]" and "bc]"
  [a\]c]     the two strings "ac]" and "\c]"
  a[b,c]d    the three strings "abd", "a,d" and "acd"
  a\[b,c]d   the two strings "a[b" and "c]d"
  [b-a]      some unspecified set of one character strings
  [^b-a]     all one character strings not matched by the previous pattern

========

[The following changes also need be made to other sections for consistency.
In addition the formal grammar will need updating.]


  9.4 The LIST Keyword

  9.4.1 LIST

[...]

  If the optional wildmat parameter is specified, the list is
! limited to only the groups whose names match the wildmat. This
! will normally be very efficient if the wildmat is a simple group
! name.

  9.4.2 LIST ACTIVE.TIMES

  LIST ACTIVE.TIMES [wildmat]

[...]

  If the optional wildmat parameter is specified, the list is
! limited to only the groups whose names match the wildmat. This
! will normally be very efficient if the wildmat is a simple group
! name.

  9.4.4 LIST DISTRIB.PATS

  LIST DISTRIB.PATS

  The distrib.pats file is maintained by some news transport
  systems to allow clients to choose a value for the
  Distribution: line in the header of a news article being
  posted. The information returned consists of lines, in no
  particular order, each of which contains three fields
! separated by colons: a weight, a wildmat (which may be a simple
! group name), and a Distribution: value, in that order.

[...]

  9.4.5 LIST NEWSGROUPS

     LIST NEWSGROUPS [wildmat]

[...]
     If the information is not available, the
     server will return the 503 response. If the server does not
     recognize the command it should return a 501 response. If
!    the optional wildmat parameter is specified, the list is
!    limited to only the groups that match the wildmat (no
!    matching is done on the group descriptions). This will
!    normally be very efficient if the wildmat is a simple group
!    name. If nothing is matched
     an empty list is returned, not an error.

  11.4 NEWNEWS

! NEWNEWS wildmat date time [GMT]

! The message-ids of all articles added to a set of newsgroups
! since the given date-time will be listed. The set consists
! of all newsgroups whose name matches the wildmat.
  The format of the listing will be one message-id per line, as
  though text were being sent. Each message-id SHALL appear only
  once in a response. The order of the response has no specific
  significance and may vary from response to response in the
! same session. Date and time are in the same format as the
! NEWGROUPS command.

  Note that an empty list (i.e., the text body returned by this
  command consists only of the terminating period) is a possible
  valid response, and indicates that there is currently no new
  news.

  Clients SHOULD make all queries in Coordinated Universal Time 
  when possible. 

-- 
Clive D.W. Feather  | Work:  <clive at demon.net>   | Tel:  +44 20 8371 1138
Internet Expert     | Home:  <clive at davros.org>  | Fax:  +44 20 8371 1037
Demon Internet      | WWW: http://www.davros.org | DFax: +44 20 8371 4037
Thus plc            |                            | Mobile: +44 7973 377646 



More information about the ietf-nntp mailing list