ietf-nntp UTF-8 syntax

Mon Jan 7 01:21:27 PST 2002

-----BEGIN PGP SIGNED MESSAGE-----

Okay, first a primer for those who aren't up to date.

Unicode defines far more than 256 characters (it has a 20.087 bit coding
space). To allow these characters to be represented in an 8 bit stream,
there is a standard encoding method called UTF-8. At its very simplest,
this has the form:

    %x00-7F / %xC0-FF *%x80-BF

That is, characters 0 to 127 are encoded as themselves, while all other
characters are encoded as a sequence of octets with the top bit set. Bit
6 of each such octet indicates whether it is the first octet in the
sequence (bit 6 set) or a subsequent octet (bit 6 clear).

While this would suffice as a definition for our purposes, UTF-8 goes
further by using the first octet to also encode the length of the
sequence. So the syntax can be further narrowed down to:

    %x00-%x7F /
    %xC0-DF 1%x80-BF /
    %xE0-EF 2%x80-BF /
    %xF0-F7 3%x80-BF /
    %xF8-FB 4%x80-BF /
    %xFC-FD 5%x80-BF

Note how we now have a number of invalid sequences, such as the triplet
%xC8 %x9F %xAA, which are no longer generated and can never appear in a
valid UTF-8 sequence.

In actual fact, there are four further classes of octet sequence that
are forbidden by the formal specification of UTF-8. These are:

(1) sequences that encode a value greater than 0x10FFFF;
(2) sequences that encode the values 0xD800 to 0xDFFF, since these are
    only used as part of a different encoding called UTF-16;
(3) sequences that encode the same value as another, shorter, sequence
    (for example, the value 0x0080 is encoded by all of the sequences:
        %xC2 %x80
        %xE0 %x82 %x00
        %xF0 %x80 %x82 %x80
        %xF8 %x80 %x80 %x82 %x80
        %xFC %x80 %x80 %x80 %x82 %x80
    but only the first of these is permitted);
(4) sequences that encode the values 0xFDD0 to 0xFDEF, or that encode
    values such that (V & 0xFFFE) == 0xFFFE (these codes are permitted
    for internal use but not during data interchange).

The first three of these are nice large blocks, and a simple change to
the syntax eliminates them:

   UTF-8-non-ascii = UTF8-2 / UTF8-3 / UTF8-4
   UTF8-1 = %x80-BF
   UTF8-2 = %xC2-DF UTF8-1
   UTF8-3 = %xE0 %xA0-BF UTF8-1 / %xE1-EC 2UTF8-1 /
            %xED %x80-9F UTF8-1 / %xEE-EF 2UTF8-1
   UTF8-4 = %xF0 %x90-BF 2UTF8-1 / %xF1-F3 3UTF8-1 /
            %xF4 %x80-8F 2UTF8-1

The last one is a bit more complicated, because these codes are rather
scattered.

However, we also have the problem that implementations should be
prepared for incoming sequences that are invalid in one way or another.
Unicode says that they must be removed from the data stream in some way
or other - this could be simple removal, or software can attempt to spot
"damaged" data and repair it in some way. There is also a code 0xFFFD
(encoding %xEF %xBF %xBD) which is intended to act as a replacement for
anything unacceptable.

- ----

On the one hand, we have the general "be liberal in what you accept, be
conservative in what you send" rule. On the other hand, we don't want to
get too complicated in our specification. On the third hand we want to
address the major issues. Here is how I would handle this.

In the syntax, replace all the UTF-8 stuff with:

    UTF-8-non-ascii = 1*%x80-FF ; see notes below

but then add a section at the bottom:

    Note: valid UTF-8 sequences all match the syntax:

    UTF-8-valid-non-ascii = %xC2-DF 1UTF8-1 / %xE0 %xA0-BF 1UTF8-1 /
                            %xE1-EC 2UTF8-1 / %xED %x80-9F 1UTF8-1 /
                            %xEE-EF 2UTF8-1 / %xF0 %x90-BF 2UTF8-1 /
                            %xF1-F3 3UTF8-1 / %xF4 %x80-8F 2UTF8-1
    UTF8-1 = %x80-BF

    Any sequence of bytes matching UTF-8-non-ascii but not
    UTF-8-valid-non-ascii is an "illegal sequence". See section 14.5.

Then add a new section to "Security Considerations":

  14.5 UTF-8 issues

    The UTF-8 specification [99] permits only certain sequences of
    octets with the high bit set. Other sequences are "illegal". The
    Unicode standard identifies a number of security issues related
    to illegal sequences and forbids their generation by conforming
    implementations.

    NNTP clients and servers MUST NOT generate illegal sequences. They
    SHOULD detect such sequences and take some appropriate action. This
    could include:
    - closing the connection
    - generating a 501 response code
    - replacing illegal sequences by a "guessed" valid sequence (based
      on properties of the UTF-8 encoding)
    - replacing illegal sequences by the sequence %xEF %xBF %xBD, which
      encodes the "replacement character".

    [99] http://www.unicode.org

>Do you believe that no more changes to the BNF concerning UTF8 will be required
>after this change?

Yes.

- -- 
Clive D.W. Feather    | Internet Expert      | Work: <clive at demon.net>
Tel: +44 20 8371 1138 | Demon Internet       | Home: <clive at davros.org>
Fax: +44 20 8371 1037 | Thus plc             | Web:  <http://www.davros.org>
Written on my laptop; please observe the Reply-To address

-----BEGIN PGP SIGNATURE-----
Version: PGPsdk version 1.7.1

iQEVAwUBPDlolSNAHP3TFZrhAQG50wf+JEyj3GBIK634ILtklRWczPC5uKnJki9J
HzZxLSHe2w53YrrcTBGOkQu0qmhN0AgQSLoAnT+P4v8UtemSm8m2gxdiSCeIxoDo
mJBGZ+dWfmWlvOaqTbSYJSRDoKD+ICtMKXhp18zGchnjHsK/acPcNoNPWcgOPb5W
BJje4jOgC3nCB2r9Cuw/zn47soQwqq+PmLO4l19C6pNRxwmtdlqAZSTouV9clIw/
ojtDpiSD31ctxBdi+g3GVaLSDAku2FtIQsKnWtjwBO6S5j1tCxK+voLApS2Bxt6l
yEfASkNFj6gmLJ1HZlzDuD/ckzOfVaHKAH14IKjo5R2HykBv+jiA4g==
=oL9w
-----END PGP SIGNATURE-----