ietf-nntp wildmat routines and text

Fri Jul 28 11:19:38 PDT 2000

Clive D W Feather <clive at demon.net> writes:
> Russ Allbery said:

>> Depends on how simple and stupid the algorithm they're using is.  The
>> "simplest thing that could possibly work" is to just backslash every
>> character without caring if it's alphanumeric or not.  It complicates
>> the meaning of backslash any way you cut it; the question is whether
>> anyone was taking advantage of that meaning of backslash.

> Well, I'm coming from the background of C programming, where it's well
> known that backslash has two meanings:
> (1) before an alphanumeric, introduces some special feature (e.g. "\n");
> (2) before other characters, removes the special meaning of that character.

Right, and this is also common in Perl.  I'm just not sure that everyone
using wildmats would have the same expectations.  The INN wildmat man page
has always pretty clearly said that you can escape absolutely anything
with a backslash.

>> It also makes the wildmat parser mildly more complicated (because now
>> it has to know how to do ISO 10646 to UTF-8 conversion

> Though it's pretty trivial.

True.

>> and has to handle more ill-formed wildmats; one of the rather nice
>> things about wildmats is that outside of character classes, I think the
>> only ill-formed wildmat is one ending in a backslash).

> or a malformed UTF-8 sequence.

At least for INN, I don't intend to treat malformed UTF-8 sequences as an
error.

UTF-8 has a bunch of *really* nice properties for doing things like this.
One of them is that the UTF-8 "continuation" octet (the second and
subsequent octets of a multibyte character) fall into the infrequently
used portion of ISO 8859-1, at least for our purposes (they're mostly
punctuation and similar things, not the accented characters, which look
like the start of a multibyte character).  My plan at the moment is to do
matching an octet at a time rather than worrying about converting from
UTF-8 to Unicode (except for metacharacters, of course), and for the "skip
past a character operation" just stop as soon as an octet that doesn't
look like a continuation octet is encountered.  That means that ISO 8859-1
high-bit octets will mostly be treated correctly as single characters
since the next octet won't look like a continuation.

Using that approach, 95% or so of the wildmats actually on ISO 8859-1 text
or containing ISO 8859-1 characters will do the right thing, which is
rather valuable for the transition period while a lot of people are still
doing "just send 8" while using ISO 8859-1 or similar character sets.

I don't recommend this approach be in the standard, of course, so long as
it's permitted (and saying that the match status of any wildmat containing
malformed UTF-8 sequences is undefined or implementation-defined or some
more IETF-like phrasing of that).

> Would a way to increase consensus be to say that [ in a wildmat
> introduces implementation-defined behaviour, and leave it at that ?

I don't think so.  I'd rather define character classes; they're
complicated, but they're useful.

-- 
Russ Allbery (rra at stanford.edu)             <http://www.eyrie.org/~eagle/>