ietf-nntp Proposed wildmat text

Wed Dec 6 04:17:35 PST 2000

Charles Lindsey said:
>>  5.1 Wildmat syntax
> 
> I am not entirely sure that BNF is the right way to describe this.
> Sometimes things are much easier to describe in plain words, and this may
> be one of them.

Plain words are fine until you hit ambiguities. By the time you've ironed
them out, formal syntax may be easier.

>>  A wildmat is described by the following augmented BNF[8] syntax:
>>
>>    wildmat = wildmat-pattern *("," ["!"] wildmat-pattern)
> I though we had agreed that there could be a '!' on the first
> wildmat-pattern, so one could write
> 	NEWNEWS !alt.* ...
> the understanding being that a wildmat starting with '!' was to be
> interpreted as if it had been "*,!alt.*". That would accord with present
> behaviour of NEWNEWS.

This would be fairly easy to change, though it does break the "rightmost
match" rule. I'll change it if/when I'm told that that's the WG consensus.

>>    wildmat-exact = %x21-29 / %x2B / %x2D-3E / %x40-5A / %x5D-7F /
>>      UTF-8-non-ascii ; exclude * , ? [ \
> 
> Please exclude '!' also and let people escape it (\!) if they ever need
> it. That would be consistent with the treatment of '*', '?', ',', etc.

No it wouldn't. The other characters can occur anywhere within a wildmat
pattern; it's just that they aren't interpreted as a wildmat-exact when
they do. Allowing ! anywhere except at the start of a non-negated wildmat
reduces the number of "can't happen" cases and simplifies the parser.

>>    wildmat-special = "\" %x73 / "\" %x75 hex hex hex hex /
>>      "\" %x55 hex hex hex hex hex hex hex hex
> 
> Would we still need this if we do away with the PAT command? Or would we
> still need it for people who wanted to specify strange newsgroup names
> with UTF-8 characters in them, and who didn't have a convenient capability
> to generate same from their keyboards?

The purposes of \u are:
- for people who want to insert non-ASCII characters and don't have UTF-8
- to escape control characters
- as another way of escaping special characters like comma
I really don't know if we need it without PAT - opinions ?

> And to specify SP, I would still like a 2-hex-digit option (since people
> who think in purely ASCII terms won't see the need for \u0020 when \x20
> looks more familiar). (but if there is no PAT, I don't care.)

That's what \s was for, modulo the discussion about white-space folding.

>>    wildmat-set = "[" ["^"] wildmat-set-body "]"
>>    wildmat-set-body = wildmat-set-1 *wildmat-set-2
>>    wildmat-set-1 = wildmat-set-1-char / wildmat-set-1-range
>>    wildmat-set-2 = wildmat-set-2-char / wildmat-set-2-range
>>    wildmat-set-1-char = %21-7F / UTF-8-non-ascii
>>    wildmat-set-2-char = %21-5C / %5E-7F / UTF-8-non-ascii ; exclude ]
>>    wildmat-set-1-range = wildmat-set-1-char "-" wildmat-set-2-char
>>    wildmat-set-2-range = wildmat-set-2-char "-" wildmat-set-2-char
>
> No, you are trying to define meanings for all sorts of stupid things like
> [a-------b] which are quite useless, but will complicate implementation
> tremendously.

On the contrary, the rather hairy-looking syntax there actually expresses
the *simplest* implementation: scan left-to-right for "x-y" triples
that don't overlap previous triples.

> I have been looking into how this notation is defined in other situations.
> Mostly, it isn't :-( .

Exactly.

> Summary of rgeex(5) features:
> 
> '^' at the start means negate the whole set. '^' anywhere else means
> itself. Henceforth, when I say "start", I mean "after removing the
> initial '^', if any".
> 
> ']' at the (revised) start means itself. Elsewhere, it terminates the set.

Shown in my syntax by the -1 and -2 distinctions.

> '-' at the start or end means itself (except in the form [--a]).
> Elsewhere, it indicates a range. If you want both ']' and '-' in your set,
> then you put ']' at the start and '-' at the end, as in []abc-].

That's what happens in my syntax.

> 'a-b' is a range PROVIDED a<=b in the collating order. If a>b, the RE is
> invalid. 'b' may be a '-' (as in [%--]) but 'a' may not be a minus (as in
> [%--:]) EXCEPT at the start (as in [--:]). 'a-b-c' is invalid in all
> circumstances.

If a>b, then the entire contents of the set becomes undefined, but it
should still match exactly one character (IMO). But I'm not too bothered
about that. [%--:] is the range % to - plus the character :, in both
regex and my syntax. The only debate is about a-b-c, which I think is a-b
plus - plus c, and you disagree. I'm willing to be persuaded that this
should change.

> This it will be seen that regex(5) sidesteps all the 'awkward' cases by
> declaring most of them as 'invalid'. If we were to do that (either by
> excluding them syntactically, or declaring the effect of using them as
> "undefined"), then we would fulfil all of people's reasonable expectations
> without burdening implementors unnecessarily.

I can only see one, or perhaps two, places where I've defined something
that regex doesn't. I don't see why either of those is a burden.

> Note that regex(5) also
> speficies lots of other bells and whistles (including taking the collating
> order from the locale) which we should quietly ignore.

Agreed.

> Here is a grammar that encompasses what I would now suggest:
[...]
> BTW, I see that no grammar has been given for UTF-8-non-ascii. That needs
> to be fixed.

That can just be a cross-reference to the grammar at the end of the
document.

> Observe that I have not actually forbidden 'a-b' where a>b. This could be
> done by a suitable piece of text, but OTOH one could argue no harm is done
> by leaving it in.

That's also the approach I took. I think, however, that in this case the
set contents are unspecified but the set still matches one character from
the set contents.

> Here is some output from a modern version of 'ed'. I claim it is accepting
> and rejecting exactly those cases which would be accepted and rejected by
> the above grammar.
> 
> chl% ed
> a
> &
> -
> 9

Huh ?

-- 
Clive D.W. Feather  | Work:  <clive at demon.net>   | Tel:  +44 20 8371 1138
Internet Expert     | Home:  <clive at davros.org>  | Fax:  +44 20 8371 1037
Demon Internet      | WWW: http://www.davros.org | DFax: +44 20 8371 4037
Thus plc            |                            | Mobile: +44 7973 377646