[NNTP] Internationalisation

Sat Apr 16 17:28:36 PDT 2005

Clive D W Feather <clive at demon.net> writes:

> First draft. Comments welcome.

I think you might be trying a bit too hard here to not put any constraints
on anything.  I think we can be more succinct and a little stronger.  I'll
try to do this on a line-by-line basis, but to sum up, I think we can say:

 o The character set of article bodies SHOULD be tagged in the article
   headers via some mechanism such as [MIME].

 o Generators of article headers are strongly encouraged to use a US-ASCII
   encoding such as [RFC2047] until such time as another approach has been
   standardized; 8-bit encodings (including UTF-8) are not forbidden by
   this standard but are likely to cause interoperability problems.

 o Generators of newsgroup descriptions are strongly encouraged to use
   US-ASCII or UTF-8 until a successor to [RFC1036] standardizes a
   particular approach.  8-bit encodings other than UTF-8 are not
   forbidden by this standard but are known to cause interoperability
   problems in practice.

 o Although UTF-8 is allowed for newsgroup names by this standard in
   anticipation of where the article format standard may go, newsgroup
   names SHOULD be restricted to US-ASCII for the time being until a
   successor to [RFC1036] standardizes some other approach.  Character
   sets other than US-ASCII and UTF-8 MUST NOT be used for newsgroup
   names.

 o In all cases, implementations which receive articles from other sources
   MUST correctly handle arbitrary 8-bit data in article headers, article
   bodies, and newsgroup descriptions and MAY pass such data along via the
   NNTP protocol even if it does not follow the above recommendations.

and in fact I would replace nearly all of section 10.2 with the above
text.

>    10.  Internationalisation Considerations

>    10.1  Introduction and historical situation

>    RFC 977 [RFC977] was written at a time when internationalisation was
>    not seen as a significant issue.  As such, it was written on the
>    assumption that all communication would be in ASCII and use only a
>    7-bit transport layer.

Agree with Charles that a ", although in practice all known NNTP
implementations are 8-bit clean" wouldn't be out of line here.

>    Since then, Usenet and NNTP have spread throughout the world.  In the
>    absence of standards for handling the issues of language and
>    character sets, countries, newsgroup hierarchies, and individuals
>    have all found different solutions which work for them but are not
>    necessarily appropriate elsewhere.  For example, some have adopted a
>    default 8-bit character set appropriate to their needs (such as
>    ISO8859-1 in Western Europe or KOI-8 in Russia), others have used
>    ASCII (either US-ASCII or national variants) in headers but local 16-
>    bit character sets in article bodies, and still others have gone for
>    a combination of MIME and UTF-8.

I wouldn't mention UTF-8 here.  There's very little use of UTF-8 in
practice on Usenet, and what little there is is in bodies and is tagged
via MIME and therefore already covered by the MIME reference.

>    With the increased use of MIME in email, it is becoming more common
>    to find MIME headers identifying the character set of the body, but
>    this is far from universal.

>    The resulting confusion does not help interoperability.

>    One point that has been generally accepted is that articles can
>    contain octets with the top bit set, and NNTP is only expected to
>    operate on 8-bit clean transport paths.

This seems fine.

>    10.2  This specification

>    Part of the role of this present specification is to eliminate this
>    confusion and promote interoperability as far as possible.  At the
>    same time, it is necessary to accept the existence of the present
>    situation and not gratuitously break existing implementations and
>    arrangements, even if they are less than optimal.  Therefore current
>    practice has been taken into consideration while in producing this
>    specification.

>    The NNTP itself is extended from US-ASCII [ANSI1986] to UTF-8
>    [RFC3629] in this specification.  Except in the specific areas
>    discussed below, UTF-8 (which is a superset of ASCII) is mandatory
>    and implementations MUST NOT use any other encoding.

This is a bit too chatty, I think.  I would say something like:

    This standard extends NNTP from US-ASCII [ANSI1986] to UTF-8
    [RFC3629], and for most portions of the protocol, UTF-8 (which is a
    superset of ASCII) is mandatory and implementations MUST NOT use any
    other encoding.  For article headers and bodies, use of MIME is
    strongly recommended.  However, given widely divergent existing
    practices, an attempt to require all NNTP data meet a particular
    encoding and tagging standard at this time would be premature and
    unsuccessful.

    Accordingly, this protocol allows arbitrary 8-bit data in article
    headers, article bodies, and newsgroup descriptions, subject to the
    following recommendations:

and then insert my text from above.

Then drop everything to:

>    This requirement affects the ARTICLE (Section 6.2.1), BODY
>    (Section 6.2.3), HDR (Section 8.5), HEAD (Section 6.2.2), IHAVE
>    (Section 6.3.2), OVER (Section 8.3), and POST (Section 6.3.1)
>    commands.

>    The second area of deviation is the newsgroups list returned by the
>    LIST NEWSGROUPS (Section 7.6.6) command.  The actual newsgroup name
>    is required to be in UTF-8 - in practice, Usenet newsgroup names are
>    almost all US-ASCII - but the descriptive text is normally generated
>    according to the standards of the local hierarchy and, once again,
>    may not conform to UTF-8.

Replace these two paragraphs with:

    The recommendations for article headers and bodies affect the ARTICLE
    (Section 6.2.1), BODY (Section 6.2.3), HDR (Section 8.5), HEAD
    (Section 6.2.2), IHAVE (Section 6.3.2), OVER (Section 8.3) and POST
    (Section 6.3.1) commands.  The recommendations for newsgroup
    descriptions affect the LIST NEWSGROUPS (Section 7.6.6) command.

Do we really need to allow deviation in HELP text?  I would tend to just
require it be UTF-8 and not worry about it.

>    10.3  Outstanding issues

>    10.3.1  Article format

>    While the primary use of NNTP is for transmitting articles that
>    conform to RFC 1036 [RFC1036] (Netnews articles), it is also used for
>    other formats (see Appendix A).  It is therefore most appropriate
>    that internationalisation issues related to article formats be
>    addressed in the relevant specifications.  For Netnews articles, this
>    is any successor to RFC 1036.  For email messages, it is RFC 2822
>    [RFC2822].

>    Of course, any article transmitted via NNTP needs to conform to this
>    specification as well.

This seems fine.

>    10.3.2  Newsgroup names and descriptions

I would drop most of this section; I think we're justifying ourselves too
much around newsgroup descriptions and the text I wrote above would cover
this better.  In particular, I think this:

>    o  servers SHOULD by default report to their administrator any use of
>       character sets other than UTF-8 in the newsgroups list data (see
>       Section 7.6.6);

isn't a good idea; what is the administrator going to do with that sort of
report?  And I think the newsgroup name issue is dealt with better by my
text above.

I would instead have a 10.3.2 section entitled Canonicalization that says
something like:

    Restricting newsgroup names to UTF-8 is not a complete solution.  When
    new newsgroup names are created or a user is asked to enter a
    newsgroup name, some form of canonicalisation will need to take place.
    This specification does not attempt to define that canonicalization;
    servers are expected to match newsgroup names octet-by-octet for the
    time being.  Further work is needed in this area in conjunction with
    the article format standard.

    In the meantime, any implementation experimenting with UTF-8 newsgroup
    names is strongly cautioned that a future specification may require
    that those names be canonicalized when used with NNTP and that
    canonicalization may not be compatible with their experimental
    newsgroup names.

Thank you very much for doing a first draft!  I hope this is helpful.

-- 
Russ Allbery (rra at stanford.edu)             <http://www.eyrie.org/~eagle/>