ietf-nntp Wildmats

Clive D. W. Feather clive at on-the-train.demon.co.uk
Fri Mar 9 01:11:07 PST 2001


-----BEGIN PGP SIGNED MESSAGE-----

 In message <G9voLo.D0I at clw.cs.man.ac.uk>, Charles Lindsey 
 <chl at clw.cs.man.ac.uk> writes
>>    wildmat-escape = wildmat-hide / wildmat-special
>
>>    wildmat-hide = "\" wildmat-hidden
>
>>    wildmat-hidden = %x21-2F / %x3A-40 / %x5B-60 / %x7B-7F / UTF-8-non-ascii
>>        ; exclude 0-9, A-Z, a-z
>
>>[W2: this allows any non-alphanumeric to be escaped with \. Is this too
>>general; should it be limited to * , ? [ and perhaps !. My inclination is
>>to leave it as shown here.]
>
>>[W3: if leaving it generic, do we want to include or exclude non-ascii
>>characters ? The above says include, but I'm inclined to exclude them.]
>
>I think you got your 'include's and 'exclude's crossed in there. Anyway,
>my view is that, if you are going to allow '\' escapes at all, then you
>allow them on _everything_; no arbitray rules to have to remember. I think
>that is what you are saying.

 No, my personal preference was to allow only the ASCII punctuation marks 
 to be escaped. That is:
     wildmat-hidden = %x21-2F / %x3A-40 / %x5B-60 / %x7B-7F

>OTOH, none of { * , ? [ \ } can ever occur in a newsgroup-name, certainly
>not in [USEFOR] and not in any current usage either (1036 seems silent on
>the issue), so it might be argued that wildmat-escape could be omitted
>entirely.
 [...]
>>[W4: wildmat-specials were \s for white space and \u1234 and \U12345678
>>for general character escapes. Do we need these any more or shall we drop
>>them ? My inclination is to drop them for now; we may want to resurrect
>>them if we do more generic wildmats, but they aren't needed for newsgroup
>>names.]
>
>Yes, drop them.

 If we do this, then I recommend that we ban backslash entirely in 
 wildmats, including in [...] sets. This allows them to be used in the 
 future if we ever need an escape mechanism and in the meantime 
 eliminates some potential confusion.

 Which reminds me: *somewhere* we need text along the lines of:

     If a parameter that is specified as a wildmat does not meet the
     syntax of 5.1.1, the NNTP server MAY place some interpretation on it
     (not specified by this document) or otherwise MUST generate a 501
     response.

>>    * either of the two immediately preceding characters is a "-" that can
>>      be parsed as a wildmat-range-range (this determination is made from
>                                    ^^^^^
>                                   delim

 Oops.

>>  5.1.2 Formalised syntax
>>[Warning: this ain't pretty, but I think it's correct and unambiguous.]
>
>*You* know that there is a Van Wijngaarden Grammar hiding in there, and
>*I* know that there is a Van Wijngaarden Grammar hiding in there, but for
>the unfortunate masses who do not know, this is just too complicated and
>counterproductive

 Yes, this may be the case.

 I wrote this section because there had been some disquiet with the idea 
 that the syntax was subject to statements like those I placed at the 
 end. I therefore did this as an alternative.

>(though it might give a clue as to how to implement it).

 TBH, I wouldn't do it that way; using the first grammar gives a more 
 natural approach.

>Methinks you have been sitting in too many delayed trains :-) .

 Actually I did it in the office.

 Some torturing of yacc has led me to the annoying conclusion that the 
 grammar for sets is actually LR(2), not LR(1).

>I recommend to leave it all out.

 Does anyone want this section to be left ?

>>[In general: -p- means "(start of) a positive case", -t- means "(start of)
>                                                      ^^^
>                                                     -n-

 Oops again.

>>  5.2 Wildmat semantics
>
>
>>  "*" matches zero or more characters. It can match an empty string, but
>>  it cannot match only part of a UTF-8 sequence that consists of more than
>>  one octet.
>
>ITYM "it cannot match a subsequence of a UTF-8 sequence that is not
>aligned to the character boundaries". Otherwise you are forbidding a match
>with a sequence of genuine UTF-8 characters that is embedded in a longer
>sequence of UTF-8 characters.

 I think we have different views of what "UTF-8 sequence" means. I was 
 thinking of a single character; you obviously aren't. However, I don't 
 have any problem with changing the wording.

>>[W5: the grammar does not treat \ in sets as special, just as another
>>character, so "a[b\]c]" matches the two strings "abc]" and "a\c]".
>>Are we happy with this ? It matches existing practice as I understand it,
>>and I'm inclined to keep it.]
>
>I agree.

 See above: at the moment I think explicitly forbidding it would be 
 better, since nobody is expected to use it.

>>[W6: the grammar does not treat , in sets as special.

>I agree. There is, in fact, NO existing practice, as I have explained in
>reply to Andrew.

 Again, would we be better off forbidding comma entirely for now and 
 continuing this argument only when there's a need ?

>>  this range consists of every character whose code lies
>>  between the two characters in the range, inclusive. Thus "[a-dg]"
>>  is equivalent to "[abcdg]"; each match any of the five characters "a",
>>  "b", "c", "d", or "g". Note that the codes are always those of
>>  ISO 10646, no matter what the local character set is.
>
>Hmmmm! Are we sure we know what the collating order is for arbitrary UTF-8
>characters?

 Yes: we're not using any particular locale's collating order, but simply 
 the numerical order of ISO 10646 codes. So [@-$], where @ is the 
 sequence 0xC2 0xA3 and $ is the sequence 0xE1 0x94 0xB5, matches the 
 characters with ISO 10646 codes U+00A3 to U+1535 inclusive; that is, 
 0x1493 different codes.

 You may feel a little more comfortable knowing that ISO 10646 and UTF-8 
 lexical orders are the same. So in this case the characters that are 
 matched are:
 - first octet 0xC2, second octet 0xA3 or greater
 - first octet 0xC3 to 0xE0 inclusive
 - first octet 0xE1, second octet 0x93 or less
 - first octet 0xE1, second octet 0x94, third octet 0xB5 or less

>>  If the first char in a range has a higher code than the second one, the
>>  characters represented by the range are determined by the implementation.
>>  This must be done in a consistent manner, so that, for example,
>>  "[d-a],[^d-a]" will match every possible character.
>
>>[W7: do we want to remove the consistency requirement ? This would mean
>>that, for example, "[d-a]" and "[^d-a]" might both match the same set, or
>>that the two ranges in "[d-a][d-a]" might match different sets.]
>
>I should forbid it entirely in your disambiguating rules. Or else say that
>such a range never matches anything.

 We had this discussion previously. The consensus was that all ranges 
 should match exactly one character, and that such ranges were allowed 
 but had no definition of which characters matched. That is, the existing 
 wording.

 The only question left is the one I asked: do we require consistency or 
 not ?

>>  9.4.1 LIST
>
>>  If the optional wildmat parameter is specified, the list is
>>! limited to only the groups whose names match the wildmat. This
>                   ^^^
>                  those
>>! will normally be very efficient if the wildmat is a simple group
>>! name.

 [etc.]

 Noted.

- -- 
Clive D.W. Feather    | Internet Expert      | Work: <clive at demon.net>
Tel: +44 20 8371 1138 | Demon Internet       | Home: <clive at davros.org>
Fax: +44 20 8371 1037 | Thus plc             | Web:  <http://www.davros.org>
Written on my laptop; please observe the Reply-To address

-----BEGIN PGP SIGNATURE-----
Version: PGPsdk version 1.7.1

iQEVAwUBOqieKiNAHP3TFZrhAQGEowf+PEYrxvFFbxxAEuOd5M5HgeVVTnJzEisS
pMQZHQLwORck8RkfcmXF8OmbPyi9Snb9LSpsg51YD17VSSNpNBrVLiZavhVR+08Q
mBpMy9CuMEb+VzJrRJf+5zMsmCiFKzOTNIyIDLq1cucNtXMnkvKwdYke//amLz5B
tgcodhwXomFDok34000sAwDzvtwkYhtiyngTAorLQruM48wUh/2L1u4F6kcBBC9B
D6QN6PZr7/q9CVzv4Wid6PjzeG7JoWTKjt6STf4GPpOsl/K99M85opYn4uAGPIAk
oizOwW5PKZiM/i7Ecy4PKuNVTh4nUJquhL1+89mYcZL7Sru6IPt3Sg==
=Chhg
-----END PGP SIGNATURE-----



More information about the ietf-nntp mailing list