ietf-nntp Commetns on draft-15.pdf

Wed Jan 2 03:31:10 PST 2002

Stan Barber said:
>> P53-17	I find it odd that the commands DATE, HELP, NEWGROUPS and NEWNEWS
>> 	are described _after_ the CONCLUSION step, and even after the
>> 	Extensions.  I would recommend some section re-ordering here (but
>> 	I don't think I want a wholesale re-ordering as some others have
>> 	suggested).
> 
> These commands are here because they are not usually used in a session. 
> So, they are defined after all the commands that are usually used in a 
> session.

That is extraordinarily arrogant of you.

In the last 57,641 NNTP sessions made by romana.davros.org, the NEWGROUPS
command was used at least once in every session, the NEWNEWS command at
least twice, and the GROUP, LAST, NEXT, STAT, IHAVE, LIST ACTIVE.TIMES,
and LIST DISTRIB.PATS commands (all part of your "usually used" section)
*never*.

This machine is not alone. I strongly suspect that the Demon servers see
far more NEWNEWS commands than NEXT commands.

You don't know what the "usual" pattern of use is. Therefore ordering the
document on that basis is wrong. The ordering should be logical:

- greeting step
- mandatory commands
- conclusion step        ) either way round
- extensions             ) would be logical

>> P61+8,14
>> 	The UTF-8 syntax in USEFOR is:
[...]
>> 	The difference is that USEFOR has excluded more octets
>> 	that are not supposed to occur in UTF-8, including all those which
>> 	would belong to Unicode "surrogates". Do we want to make the two
>> 	drafts identical at this point, for the removal of all confusion?
> 
> It might be more appropriate for one or the other group to publish a 
> UTF-8 definition as its own RFC and then have both groups refer to it. 

UTF-8 is *defined* by the Unicode Organisation in cooperation with ISO.

What we're talking about here is a syntax notation. To write the syntax
so as to include exactly the valid sequences and no others is almost
impossible, especially when you look at character semantics. You *always*
need to say that, despite the syntax, it is not permitted to use sequences
forbidden by the formal definition.

So the question is how much effort to put in. For example:

* The minimal unambiguous syntax is:

    UTF-8-non-ascii = %xC0-FF *%x80-BF

* A trivial change is:

    UTF-8-non-ascii = %xC2-FD 1*5%x80-BF

which eliminates many invalid sequences.

* You can expose the basic structure of UTF-8 using the syntax we have
at present:

  UTF-8-non-ascii = UTF8-2 / UTF8-3 / UTF8-4 / UTF8-5 / UTF8-6 
  UTF8-1 = %x80-BF
  UTF8-2 = %xC0-DF UTF8-1 
  UTF8-3 = %xE0-EF 2UTF8-1
  UTF8-4 = %xF0-F7 3UTF8-1
  UTF8-5 = %xF8-FB 4UTF8-1
  UTF8-6 = %xFC-FD 5UTF8-1

Again, C0 can be changed to C2.

* You can eliminate all the "wrong length" sequences:

  UTF-8-non-ascii = UTF8-2 / UTF8-3 / UTF8-4 / UTF8-5 / UTF8-6 
  UTF8-1 = %x80-BF
  UTF8-2 = %xC2-DF UTF8-1 
  UTF8-3 = %xE0 %xA0-BF  UTF8-1 / %xE1-EF 2UTF8-1
  UTF8-4 = %xF0 %x90-BF 2UTF8-1 / %xF1-F7 3UTF8-1
  UTF8-5 = %xF8 %x88-BF 3UTF8-1 / %xF9-FB 4UTF8-1
  UTF8-6 = %xFC %x84-BF 4UTF8-1 / %xFD 5UTF8-1

* You can eliminate "surrogates" by changing one line of that:

  UTF8-3 = %xE0 %xA0-BF UTF8-1 / %xE1-EC 2UTF8-1 /
           %xED %x80-9F UTF8-1 / %xEE-EF 2UTF8-1

* You can eliminate all values outside Unicode's declared limit of
U+10FFFF:

  UTF-8-non-ascii = UTF8-2 / UTF8-3 / UTF8-4
  UTF8-1 = %x80-BF
  UTF8-2 = %xC2-DF UTF8-1 
  UTF8-3 = %xE0 %xA0-BF UTF8-1 / %xE1-EC 2UTF8-1 /
           %xED %x80-9F UTF8-1 / %xEE-EF 2UTF8-1
  UTF8-4 = %xF0 %x90-BF 2UTF8-1 / %xF1-F3 3UTF8-1 /
           %xF4 %x80-8F 2UTF8-1

The choice is ours !

-- 
Clive D.W. Feather  | Work:  <clive at demon.net>   | Tel:  +44 20 8371 1138
Internet Expert     | Home:  <clive at davros.org>  | Fax:  +44 20 8371 4037
Demon Internet      | WWW: http://www.davros.org | Mobile: +44 7973 377646
Thus plc            |                            |