[NNTP] Re: New NNTP drafts approaching IETF Last Call

Wed Mar 23 03:45:39 PST 2005

In <Pine.WNT.4.63.0503221333220.5332 at Tomobiki-Cho.CAC.Washington.EDU> Mark Crispin <MRC at CAC.Washington.EDU> writes:

>The following text on page 9:

>    The term "character" means a single Unicode code point and
>    implementations are not required to carry out normalisation.  Thus
>    U+0084 (A-dieresis) is one character while U+0041 U+0308 (A composed
>    with dieresis) is two; the two need not be treated as equivalent.

>is problematic and is unlikely to pass muster.

>Welcome to the wonderful world of stringprep.

No, stringprep is for implementing the idea of "early normalization", as
espoused by Martin Duerst and the Unicode people. He who generates a UTF-8
string that may have to be recognized by software is responsible for
normalizing. He who transports it can then assume it is normalized and is
not obliged to confirm the normalization (at much expense) en route. If
the generator of the string screws up, then "garbage in, garbage out"
applies.

So as regards article contents, it is up to the standards for the article
format to require such normalizations as may be necessary. If NNTP itself
requires certain formats within commands, then it would be in order to
specify normalizations if such were needed, and likewise for any critical
strings in responses. However, there is nothing in the present standard
where these are needed AFAICS.

For example, most text in responses, even if not written in English, is
for human consumption only. Message-ids are in any case restricted to
ASCII.

The only potential worry is newsgroup-names. Currently, these are
restricted to ASCII, but the Usefor WG is charged with producing some form
of I18N newsgroup-names after it has completed its present tasks. The
format of these is yet to be decided. The canonical form on the wire may
be UTF-8 (in which case it will need to be normalized, presumably with a
something-prep). Or it may be some encoding into ASCII. Or whatever.

As far as NNTP is concerned, it has to recognize newsgroup-names but,
following the principle of early normalization, it is entitled to assume
that any newsgroup-name presented to it is already normalized, and so a
simple comparison of the octets presented is sufficient to compare two
names, or to determine whether a given name matches a given wildmat (that
should be made clear in the text, if it is not already so). We have
already taken care to ensure that the syntax and semantics of wildmats is
UTF-8 proof.

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl at clerew.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5