ietf-nntp Wildmats
Clive D. W. Feather
clive at on-the-train.demon.co.uk
Fri Mar 9 01:11:07 PST 2001
-----BEGIN PGP SIGNED MESSAGE-----
In message <G9voLo.D0I at clw.cs.man.ac.uk>, Charles Lindsey
<chl at clw.cs.man.ac.uk> writes
>> wildmat-escape = wildmat-hide / wildmat-special
>
>> wildmat-hide = "\" wildmat-hidden
>
>> wildmat-hidden = %x21-2F / %x3A-40 / %x5B-60 / %x7B-7F / UTF-8-non-ascii
>> ; exclude 0-9, A-Z, a-z
>
>>[W2: this allows any non-alphanumeric to be escaped with \. Is this too
>>general; should it be limited to * , ? [ and perhaps !. My inclination is
>>to leave it as shown here.]
>
>>[W3: if leaving it generic, do we want to include or exclude non-ascii
>>characters ? The above says include, but I'm inclined to exclude them.]
>
>I think you got your 'include's and 'exclude's crossed in there. Anyway,
>my view is that, if you are going to allow '\' escapes at all, then you
>allow them on _everything_; no arbitray rules to have to remember. I think
>that is what you are saying.
No, my personal preference was to allow only the ASCII punctuation marks
to be escaped. That is:
wildmat-hidden = %x21-2F / %x3A-40 / %x5B-60 / %x7B-7F
>OTOH, none of { * , ? [ \ } can ever occur in a newsgroup-name, certainly
>not in [USEFOR] and not in any current usage either (1036 seems silent on
>the issue), so it might be argued that wildmat-escape could be omitted
>entirely.
[...]
>>[W4: wildmat-specials were \s for white space and \u1234 and \U12345678
>>for general character escapes. Do we need these any more or shall we drop
>>them ? My inclination is to drop them for now; we may want to resurrect
>>them if we do more generic wildmats, but they aren't needed for newsgroup
>>names.]
>
>Yes, drop them.
If we do this, then I recommend that we ban backslash entirely in
wildmats, including in [...] sets. This allows them to be used in the
future if we ever need an escape mechanism and in the meantime
eliminates some potential confusion.
Which reminds me: *somewhere* we need text along the lines of:
If a parameter that is specified as a wildmat does not meet the
syntax of 5.1.1, the NNTP server MAY place some interpretation on it
(not specified by this document) or otherwise MUST generate a 501
response.
>> * either of the two immediately preceding characters is a "-" that can
>> be parsed as a wildmat-range-range (this determination is made from
> ^^^^^
> delim
Oops.
>> 5.1.2 Formalised syntax
>>[Warning: this ain't pretty, but I think it's correct and unambiguous.]
>
>*You* know that there is a Van Wijngaarden Grammar hiding in there, and
>*I* know that there is a Van Wijngaarden Grammar hiding in there, but for
>the unfortunate masses who do not know, this is just too complicated and
>counterproductive
Yes, this may be the case.
I wrote this section because there had been some disquiet with the idea
that the syntax was subject to statements like those I placed at the
end. I therefore did this as an alternative.
>(though it might give a clue as to how to implement it).
TBH, I wouldn't do it that way; using the first grammar gives a more
natural approach.
>Methinks you have been sitting in too many delayed trains :-) .
Actually I did it in the office.
Some torturing of yacc has led me to the annoying conclusion that the
grammar for sets is actually LR(2), not LR(1).
>I recommend to leave it all out.
Does anyone want this section to be left ?
>>[In general: -p- means "(start of) a positive case", -t- means "(start of)
> ^^^
> -n-
Oops again.
>> 5.2 Wildmat semantics
>
>
>> "*" matches zero or more characters. It can match an empty string, but
>> it cannot match only part of a UTF-8 sequence that consists of more than
>> one octet.
>
>ITYM "it cannot match a subsequence of a UTF-8 sequence that is not
>aligned to the character boundaries". Otherwise you are forbidding a match
>with a sequence of genuine UTF-8 characters that is embedded in a longer
>sequence of UTF-8 characters.
I think we have different views of what "UTF-8 sequence" means. I was
thinking of a single character; you obviously aren't. However, I don't
have any problem with changing the wording.
>>[W5: the grammar does not treat \ in sets as special, just as another
>>character, so "a[b\]c]" matches the two strings "abc]" and "a\c]".
>>Are we happy with this ? It matches existing practice as I understand it,
>>and I'm inclined to keep it.]
>
>I agree.
See above: at the moment I think explicitly forbidding it would be
better, since nobody is expected to use it.
>>[W6: the grammar does not treat , in sets as special.
>I agree. There is, in fact, NO existing practice, as I have explained in
>reply to Andrew.
Again, would we be better off forbidding comma entirely for now and
continuing this argument only when there's a need ?
>> this range consists of every character whose code lies
>> between the two characters in the range, inclusive. Thus "[a-dg]"
>> is equivalent to "[abcdg]"; each match any of the five characters "a",
>> "b", "c", "d", or "g". Note that the codes are always those of
>> ISO 10646, no matter what the local character set is.
>
>Hmmmm! Are we sure we know what the collating order is for arbitrary UTF-8
>characters?
Yes: we're not using any particular locale's collating order, but simply
the numerical order of ISO 10646 codes. So [@-$], where @ is the
sequence 0xC2 0xA3 and $ is the sequence 0xE1 0x94 0xB5, matches the
characters with ISO 10646 codes U+00A3 to U+1535 inclusive; that is,
0x1493 different codes.
You may feel a little more comfortable knowing that ISO 10646 and UTF-8
lexical orders are the same. So in this case the characters that are
matched are:
- first octet 0xC2, second octet 0xA3 or greater
- first octet 0xC3 to 0xE0 inclusive
- first octet 0xE1, second octet 0x93 or less
- first octet 0xE1, second octet 0x94, third octet 0xB5 or less
>> If the first char in a range has a higher code than the second one, the
>> characters represented by the range are determined by the implementation.
>> This must be done in a consistent manner, so that, for example,
>> "[d-a],[^d-a]" will match every possible character.
>
>>[W7: do we want to remove the consistency requirement ? This would mean
>>that, for example, "[d-a]" and "[^d-a]" might both match the same set, or
>>that the two ranges in "[d-a][d-a]" might match different sets.]
>
>I should forbid it entirely in your disambiguating rules. Or else say that
>such a range never matches anything.
We had this discussion previously. The consensus was that all ranges
should match exactly one character, and that such ranges were allowed
but had no definition of which characters matched. That is, the existing
wording.
The only question left is the one I asked: do we require consistency or
not ?
>> 9.4.1 LIST
>
>> If the optional wildmat parameter is specified, the list is
>>! limited to only the groups whose names match the wildmat. This
> ^^^
> those
>>! will normally be very efficient if the wildmat is a simple group
>>! name.
[etc.]
Noted.
- --
Clive D.W. Feather | Internet Expert | Work: <clive at demon.net>
Tel: +44 20 8371 1138 | Demon Internet | Home: <clive at davros.org>
Fax: +44 20 8371 1037 | Thus plc | Web: <http://www.davros.org>
Written on my laptop; please observe the Reply-To address
-----BEGIN PGP SIGNATURE-----
Version: PGPsdk version 1.7.1
iQEVAwUBOqieKiNAHP3TFZrhAQGEowf+PEYrxvFFbxxAEuOd5M5HgeVVTnJzEisS
pMQZHQLwORck8RkfcmXF8OmbPyi9Snb9LSpsg51YD17VSSNpNBrVLiZavhVR+08Q
mBpMy9CuMEb+VzJrRJf+5zMsmCiFKzOTNIyIDLq1cucNtXMnkvKwdYke//amLz5B
tgcodhwXomFDok34000sAwDzvtwkYhtiyngTAorLQruM48wUh/2L1u4F6kcBBC9B
D6QN6PZr7/q9CVzv4Wid6PjzeG7JoWTKjt6STf4GPpOsl/K99M85opYn4uAGPIAk
oizOwW5PKZiM/i7Ecy4PKuNVTh4nUJquhL1+89mYcZL7Sru6IPt3Sg==
=Chhg
-----END PGP SIGNATURE-----
More information about the ietf-nntp
mailing list