ietf-nntp Proposed wildmat text
Clive D.W. Feather
clive at demon.net
Wed Dec 6 04:17:35 PST 2000
Charles Lindsey said:
>> 5.1 Wildmat syntax
>
> I am not entirely sure that BNF is the right way to describe this.
> Sometimes things are much easier to describe in plain words, and this may
> be one of them.
Plain words are fine until you hit ambiguities. By the time you've ironed
them out, formal syntax may be easier.
>> A wildmat is described by the following augmented BNF[8] syntax:
>>
>> wildmat = wildmat-pattern *("," ["!"] wildmat-pattern)
> I though we had agreed that there could be a '!' on the first
> wildmat-pattern, so one could write
> NEWNEWS !alt.* ...
> the understanding being that a wildmat starting with '!' was to be
> interpreted as if it had been "*,!alt.*". That would accord with present
> behaviour of NEWNEWS.
This would be fairly easy to change, though it does break the "rightmost
match" rule. I'll change it if/when I'm told that that's the WG consensus.
>> wildmat-exact = %x21-29 / %x2B / %x2D-3E / %x40-5A / %x5D-7F /
>> UTF-8-non-ascii ; exclude * , ? [ \
>
> Please exclude '!' also and let people escape it (\!) if they ever need
> it. That would be consistent with the treatment of '*', '?', ',', etc.
No it wouldn't. The other characters can occur anywhere within a wildmat
pattern; it's just that they aren't interpreted as a wildmat-exact when
they do. Allowing ! anywhere except at the start of a non-negated wildmat
reduces the number of "can't happen" cases and simplifies the parser.
>> wildmat-special = "\" %x73 / "\" %x75 hex hex hex hex /
>> "\" %x55 hex hex hex hex hex hex hex hex
>
> Would we still need this if we do away with the PAT command? Or would we
> still need it for people who wanted to specify strange newsgroup names
> with UTF-8 characters in them, and who didn't have a convenient capability
> to generate same from their keyboards?
The purposes of \u are:
- for people who want to insert non-ASCII characters and don't have UTF-8
- to escape control characters
- as another way of escaping special characters like comma
I really don't know if we need it without PAT - opinions ?
> And to specify SP, I would still like a 2-hex-digit option (since people
> who think in purely ASCII terms won't see the need for \u0020 when \x20
> looks more familiar). (but if there is no PAT, I don't care.)
That's what \s was for, modulo the discussion about white-space folding.
>> wildmat-set = "[" ["^"] wildmat-set-body "]"
>> wildmat-set-body = wildmat-set-1 *wildmat-set-2
>> wildmat-set-1 = wildmat-set-1-char / wildmat-set-1-range
>> wildmat-set-2 = wildmat-set-2-char / wildmat-set-2-range
>> wildmat-set-1-char = %21-7F / UTF-8-non-ascii
>> wildmat-set-2-char = %21-5C / %5E-7F / UTF-8-non-ascii ; exclude ]
>> wildmat-set-1-range = wildmat-set-1-char "-" wildmat-set-2-char
>> wildmat-set-2-range = wildmat-set-2-char "-" wildmat-set-2-char
>
> No, you are trying to define meanings for all sorts of stupid things like
> [a-------b] which are quite useless, but will complicate implementation
> tremendously.
On the contrary, the rather hairy-looking syntax there actually expresses
the *simplest* implementation: scan left-to-right for "x-y" triples
that don't overlap previous triples.
> I have been looking into how this notation is defined in other situations.
> Mostly, it isn't :-( .
Exactly.
> Summary of rgeex(5) features:
>
> '^' at the start means negate the whole set. '^' anywhere else means
> itself. Henceforth, when I say "start", I mean "after removing the
> initial '^', if any".
>
> ']' at the (revised) start means itself. Elsewhere, it terminates the set.
Shown in my syntax by the -1 and -2 distinctions.
> '-' at the start or end means itself (except in the form [--a]).
> Elsewhere, it indicates a range. If you want both ']' and '-' in your set,
> then you put ']' at the start and '-' at the end, as in []abc-].
That's what happens in my syntax.
> 'a-b' is a range PROVIDED a<=b in the collating order. If a>b, the RE is
> invalid. 'b' may be a '-' (as in [%--]) but 'a' may not be a minus (as in
> [%--:]) EXCEPT at the start (as in [--:]). 'a-b-c' is invalid in all
> circumstances.
If a>b, then the entire contents of the set becomes undefined, but it
should still match exactly one character (IMO). But I'm not too bothered
about that. [%--:] is the range % to - plus the character :, in both
regex and my syntax. The only debate is about a-b-c, which I think is a-b
plus - plus c, and you disagree. I'm willing to be persuaded that this
should change.
> This it will be seen that regex(5) sidesteps all the 'awkward' cases by
> declaring most of them as 'invalid'. If we were to do that (either by
> excluding them syntactically, or declaring the effect of using them as
> "undefined"), then we would fulfil all of people's reasonable expectations
> without burdening implementors unnecessarily.
I can only see one, or perhaps two, places where I've defined something
that regex doesn't. I don't see why either of those is a burden.
> Note that regex(5) also
> speficies lots of other bells and whistles (including taking the collating
> order from the locale) which we should quietly ignore.
Agreed.
> Here is a grammar that encompasses what I would now suggest:
[...]
> BTW, I see that no grammar has been given for UTF-8-non-ascii. That needs
> to be fixed.
That can just be a cross-reference to the grammar at the end of the
document.
> Observe that I have not actually forbidden 'a-b' where a>b. This could be
> done by a suitable piece of text, but OTOH one could argue no harm is done
> by leaving it in.
That's also the approach I took. I think, however, that in this case the
set contents are unspecified but the set still matches one character from
the set contents.
> Here is some output from a modern version of 'ed'. I claim it is accepting
> and rejecting exactly those cases which would be accepted and rejected by
> the above grammar.
>
> chl% ed
> a
> &
> -
> 9
Huh ?
--
Clive D.W. Feather | Work: <clive at demon.net> | Tel: +44 20 8371 1138
Internet Expert | Home: <clive at davros.org> | Fax: +44 20 8371 1037
Demon Internet | WWW: http://www.davros.org | DFax: +44 20 8371 4037
Thus plc | | Mobile: +44 7973 377646
More information about the ietf-nntp
mailing list