ietf-nntp Proposed wildmat text

Charles Lindsey chl at clw.cs.man.ac.uk
Tue Dec 5 06:45:13 PST 2000


I have at last got round to looking at Clive's wildmat wording
(http://www.davros.org/nntp-texts/section-5.txt). Comments follow:

>  5. The WILDMAT format
>
>  The WILDMAT format described here is based on the version
>  first developed by Rich Salz [5] which was derived from the format
>  used in the UNIX "find" command to articulate file names. It
>  was developed to provide a uniform mechanism for matching
>  patterns in the same manner that the UNIX shell matches filenames.
>
>  5.1 Wildmat syntax

I am not entirely sure that BNF is the right way to describe this.
Sometimes things are much easier to describe in plain words, and this may
be one of them. But sticking to BNF for the moment ...
>
>  A wildmat is described by the following augmented BNF[8] syntax:
>
>    wildmat = wildmat-pattern *("," ["!"] wildmat-pattern)

I though we had agreed that there could be a '!' on the first
wildmat-pattern, so one could write
	NEWNEWS !alt.* ...
the understanding being that a wildmat starting with '!' was to be
interpreted as if it had been "*,!alt.*". That would accord with present
behaviour of NEWNEWS.

>
>    wildmat-pattern = 1*wildmat-item
>
>    wildmat-item = wildmat-exact / wildmat-wild / wildmat-escape
>
>    wildmat-exact = %x21-29 / %x2B / %x2D-3E / %x40-5A / %x5D-7F /
>      UTF-8-non-ascii ; exclude * , ? [ \

Please exclude '!' also and let people escape it (\!) if they ever need
it. That would be consistent with the treatment of '*', '?', ',', etc.

>
>    wildmat-escape = wildmat-hide / wildmat-special
>
>    wildmat-hide = "\" (%x22-2F / %x3A-40 / %x5B-60 / %x7B-7F /
>      UTF-8-non-ascii) ; exclude 0-9, A-Z, a-z
>
>    wildmat-special = "\" %x73 / "\" %x75 hex hex hex hex /
>      "\" %x55 hex hex hex hex hex hex hex hex

Would we still need this if we do away with the PAT command? Or would we
still need it for people who wanted to specify strange newsgroup names
with UTF-8 characters in them, and who didn't have a convenient capability
to generate same from their keyboards?

And to specify SP, I would still like a 2-hex-digit option (since people
who think in purely ASCII terms won't see the need for \u0020 when \x20
looks more familiar). (but if there is no PAT, I don't care.)

>
>    wildmat-wild = "*" / "?" / wildmat-set
>
>    wildmat-set = "[" ["^"] wildmat-set-body "]"
>
>    wildmat-set-body = wildmat-set-1 *wildmat-set-2
>
>    wildmat-set-1 = wildmat-set-1-char / wildmat-set-1-range
>
>    wildmat-set-2 = wildmat-set-2-char / wildmat-set-2-range
>
>    wildmat-set-1-char = %21-7F / UTF-8-non-ascii
>
>    wildmat-set-2-char = %21-5C / %5E-7F / UTF-8-non-ascii ; exclude ]
>
>    wildmat-set-1-range = wildmat-set-1-char "-" wildmat-set-2-char
>
>    wildmat-set-2-range = wildmat-set-2-char "-" wildmat-set-2-char
>
No, you are trying to define meanings for all sorts of stupid things like
[a-------b] which are quite useless, but will complicate implementation
tremendously. People will have expectations of how this notation should
work based on its use in filename globbing and regular expressions. We
should match those expectations, but we should not try to do more.

I have been looking into how this notation is defined in other situations.
Mostly, it isn't :-( .

Bourne sh and csh explain what a-b means, but ignore the issue of all
'awkward' cases. They do not address the issue of ']' at all. Ksh goes so
far as to say that any '-' should be first or last, but still ignores ']'.
Note that file globbing in all these shells uses '!' rather than '^' for
negating the whole set.

Turning to regular expressions, I find man pages for regexp(5) and
regex(5), with no clear indication of which is which (this is Solaris 7 I
am looking at). However, it seems that regex(5) is the latest POSIX
thinking on the matter, and at least it gives clear answers to all the
'awkward' cases (regexp(5) still does not account for the possibility of
"a-b" where either 'a' or 'b' is itself a '-'). So I would suggest that we
base our thinking on regex(5).

Summary of rgeex(5) features:

'^' at the start means negate the whole set. '^' anywhere else means
itself. Henceforth, when I say "start", I mean "after removing the
initial '^', if any".

']' at the (revised) start means itself. Elsewhere, it terminates the set.

'-' at the start or end means itself (except in the form [--a]).
Elsewhere, it indicates a range. If you want both ']' and '-' in your set,
then you put ']' at the start and '-' at the end, as in []abc-].

'a-b' is a range PROVIDED a<=b in the collating order. If a>b, the RE is
invalid. 'b' may be a '-' (as in [%--]) but 'a' may not be a minus (as in
[%--:]) EXCEPT at the start (as in [--:]). 'a-b-c' is invalid in all
circumstances.

This it will be seen that regex(5) sidesteps all the 'awkward' cases by
declaring most of them as 'invalid'. If we were to do that (either by
excluding them syntactically, or declaring the effect of using them as
"undefined"), then we would fulfil all of people's reasonable expectations
without burdening implementors unnecessarily. Note that regex(5) also
speficies lots of other bells and whistles (including taking the collating
order from the locale) which we should quietly ignore.

Here is a grammar that encompasses what I would now suggest:

   wildmat-set = "[" wildmat-set-body "]" /
                 "[" "^" negated-wildmat-set-body "]"

   wildmat-set-body = wildmat-item-1 *wildmat-item-rest ["-"]

   negated-wildmat-set-body =
              negated-wildmat-item-1 *wildmat-item-rest ["-"]

   wildmat-item-1 = ( "-" / "]" / wildmat-char-other ) [ wildmat-range ]

   negated-wildmat-item-1 = 
              ( "-" / "]" / "^" / wildmat-char-other ) [ wildmat-range ]

   wildmat-item-rest =    ( "^" / wildmat-char-other ) [ wildmat-range ]

   wildmat-range = "-" ( "-" /  wildmat-char-other )

   wildmat-char-other = %21-2C / %2E-5C / %5F-7F / UTF-8-non-ascii
                       ; exclude SP, "-", "]", "^"

BTW, I see that no grammar has been given for UTF-8-non-ascii. That needs
to be fixed.

Observe that I have not actually forbidden 'a-b' where a>b. This could be
done by a suitable piece of text, but OTOH one could argue no harm is done
by leaving it in.

Here is some output from a modern version of 'ed'. I claim it is accepting
and rejecting exactly those cases which would be accepted and rejected by
the above grammar.

chl% ed
a
&
-
9



More information about the ietf-nntp mailing list