ietf-nntp wildmat routines and text
Clive D.W. Feather
clive at demon.net
Thu Jul 27 15:32:51 PDT 2000
Russ Allbery said:
> Below is the documentation that I wrote for INN on how wildmat patterns
> work, with all the references to @ removed. This may be suitable for the
> standard, although it could probably use some pruning before put into an
> RFC since it's intended to be wordy and clear right now.
I've taken this, plus the other comments made on wildmats, and attempted to
write a new section 5. This is it. People may find it rather formal, but I
felt that that was better than leaving ambiguities in. I've included a
*lot* more examples.
I added \u at the outer level, but not special meanings for \ inside
character classes. The wording can easily be adjusted if we want to
remove the first or add the second.
5. The WILDMAT format
! The WILDMAT format described here is based on the version
! first developed by Rich Salz [5] which was derived from the format
used in the UNIX "find" command to articulate file names. It
was developed to provide a uniform mechanism for matching
patterns in the same manner that the UNIX shell matches
! filenames.
5.1 Wildmat structure
! A wildmat pattern consists of one or more component patterns. If
! there is more than one, they are separated by commas. Each component
! pattern can optionally be prefixed with an exclamation mark (which is
! not part of the component pattern). A string is tested against a
! wildmat pattern as follows:
! * test the string against each component and note which match;
! * if none match, the string does not match the wildmat;
! * if the rightmost component that matches is prefixed with an exclamation
! mark, the string does not match the wildmat;
! * otherwise the string matches the wildmat.
!
! A component pattern consists of one or more units (there is no separator
! between the units). A unit consists of any of the following:
! [1] any ASCII character in the range %x22 to %x7E except for %x2A, %x2C,
! %x2F, %x5B, and %x5C (thus the excluded characters are control codes,
! space, exclamation, asterisk, comma, question mark, open square
! bracket, backslash, and delete);
! [2] any multi-octet UTF-8 character;
! [3] backslash, "u", and then four hexadecimal digits;
! [4] backslash, "U", and then eight hexadecimal digits;
! [5] asterisk;
! [6] question mark;
! [7] backslash followed by any non-alphanumeric ASCII character in the
! range %x21 to %x7E;
! [8] a set specifier.
!
! A string is matched against a component pattern by matching each
! character in the string against a corresponding unit in the pattern.
! (Apart from asterisk, each unit matches exactly one character;
! asterisk matches any number of characters including zero.) The
! pattern is "anchored"; that is, the first and last characters in the
! string must match the first and last unit respectively (unless that
! unit is an asterisk matching zero characters). The various units
! match characters as follows:
! [1] and [2] match precisely that character.
! [3] and [4] match the character that has the ISO 10646 code given by
! the hexadecimal number (so "\u00a3" matches the pound sterling
! character, which is represented as the two octets %xC2 %xA3 in UTF-8).
! [5] matches any number of characters, including zero.
! [6] matches any one character.
! [7] matches the ASCII character following the backslash (this may itself
! be a backslash).
! [8] matches any of the characters in the set.
!
! A set specifier consists of:
! * open square bracket ([)
! * an optional caret (^)
! * one or more set values, which are either:
! - an individual character (which may be multioctet)
! - a range specifier, given by two characters (which may be multioctet)
! separated with a minus (-)
! * close square bracket (])
!
! A caret, minus, or close square bracket is always taken to have its
! special meaning where possible. Thus a close square bracket can only
! be the first character in the set values, a minus can only be the
! first, the last, or the second character in a range specifier, and
! a caret cannot be the first when the optional caret was not specified.
!
! If the set specifier includes the optional caret, the set consists
! of all the characters that would not be in the set if the caret were
! omitted.
!
! If the set specifier does not include the optional caret, then the
! set consists of:
! - all the individual characters;
! - for each range, all the characters whose codes are greater than or
! equal to that of the first character and less than or equal to that
! of the second character.
! In character ranges, the codes used are those of ISO 10646, no matter
! what the local character set is. If the first character has a higher
! code than the second, the meaning is undefined.
Implementers must be careful to apply the pattern-matching process
to whole characters encoded in UTF-8, and not to individual octets.
5.1 Examples
! Wildmat Description of strings that match
!
! abc the one string "abc"
! abc,def the two strings "abc" and "def"
! a* any string that begins with "a"
! a*b any string that begins with "a" and ends with "b"
! a*,*b any string that begins with "a" or ends with "b"
! a*,!*b any string that begins with "a" and does not end with "b"
! a*,!*b,c* any string that begins with "a" and does not end with "b",
! or any string that begins with "c"
! a*,c*,!*b any string that begins with "a" or "c" and does not end
! with "b"
! a\u0062c the one string "abc"
! a\u002a the one string "a*"
! a\* the one string "a*"
! abc\,def the one string "abc,def"
! ?a* any string with "a" as its second character
! ??a* any string with "a" as its third character
! *a? any string with "a" as its penultimate character
! *a?? any string with "a" as its antepenultimate character
! [abc] the three strings "a", "b", and "c"
! [^abc] any one character string except the three "a", "b", and "c"
! [a-zA-Z] any one character string consisting of an ASCII letter
! [0-9]* any string beginning with an ASCII digit
! [a^bc] the four strings "a", "^", "b", and "c"
! [a-c-] the four strings "a", "b", "c", and "-"
! []abc] the four strings "]", "a", "b", and "c"
! [ab]c] the two strings "ac]" and "bc]"
! [a\]c] the two strings "ac]" and "\c]"
--
Clive D.W. Feather | Work: <clive at demon.net> | Tel: +44 20 8371 1138
Internet Expert | Home: <clive at davros.org> | Fax: +44 20 8371 1037
Demon Internet | WWW: http://www.davros.org | DFax: +44 20 8371 4037
Thus plc | | Mobile: +44 7973 377646
More information about the ietf-nntp
mailing list