[NNTP] Internationalisation

Clive D.W. Feather clive at demon.net
Fri Apr 8 09:29:46 PDT 2005


First draft. Comments welcome.

   10.  Internationalisation Considerations

   10.1  Introduction and historical situation

   RFC 977 [RFC977] was written at a time when internationalisation was
   not seen as a significant issue.  As such, it was written on the
   assumption that all communication would be in ASCII and use only a
   7-bit transport layer.

   Since then, Usenet and NNTP have spread throughout the world.  In the
   absence of standards for handling the issues of language and
   character sets, countries, newsgroup hierarchies, and individuals
   have all found different solutions which work for them but are not
   necessarily appropriate elsewhere.  For example, some have adopted a
   default 8-bit character set appropriate to their needs (such as
   ISO8859-1 in Western Europe or KOI-8 in Russia), others have used
   ASCII (either US-ASCII or national variants) in headers but local 16-
   bit character sets in article bodies, and still others have gone for
   a combination of MIME and UTF-8.  With the increased use of MIME in
   email, it is becoming more common to find MIME headers identifying
   the character set of the body, but this is far from universal.

   The resulting confusion does not help interoperability.

   One point that has been generally accepted is that articles can
   contain octets with the top bit set, and NNTP is only expected to
   operate on 8-bit clean transport paths.

   10.2  This specification

   Part of the role of this present specification is to eliminate this
   confusion and promote interoperability as far as possible.  At the
   same time, it is necessary to accept the existence of the present
   situation and not gratuitously break existing implementations and
   arrangements, even if they are less than optimal.  Therefore current
   practice has been taken into consideration while in producing this
   specification.

   The NNTP itself is extended from US-ASCII [ANSI1986] to UTF-8
   [RFC3629] in this specification.  Except in the specific areas
   discussed below, UTF-8 (which is a superset of ASCII) is mandatory
   and implementations MUST NOT use any other encoding.

   The major deviation from this requirement lies in the topic of
   articles and data derived from them.  As described in Section 3.6,
   articles consist of a set of headers and then a body.  While the
   names of headers (e.g.  "From" or "Subject") are limited to US-ASCII,
   some header values (and, of course, the article body) are generated
   by users using software which adopts local practices; for example, it
   may encode all text is in ISO 8859-1 without including a MIME header
   to that effect.

   OUTSTANDING ISSUE

      Include references to MIME?  To 8859-1?  To KOI-8?

   In an ideal world it would be possible to declare such usage non-
   conforming and ignore it, but in practice any specification that
   attempted to do so would be ignored.  Therefore this version of NNTP
   allows this practice.  More specifically, while implementations
   SHOULD only allow the creation of new articles where the headers
   conform to UTF-8, where an article is obtained from an external
   source an implementation MAY pass it on, and derive data from it
   (such as the response to the HDR command), even though the article or
   the data is not valid UTF-8.  Implementations MUST transfer such
   articles and data correctly.  (Nevertheless, a client or server MAY
   elect not to post or forward the article if, after further
   examination of the article, it deems it inappropriate to do so.)

   This requirement affects the ARTICLE (Section 6.2.1), BODY
   (Section 6.2.3), HDR (Section 8.5), HEAD (Section 6.2.2), IHAVE
   (Section 6.3.2), OVER (Section 8.3), and POST (Section 6.3.1)
   commands.

   The second area of deviation is the newsgroups list returned by the
   LIST NEWSGROUPS (Section 7.6.6) command.  The actual newsgroup name
   is required to be in UTF-8 - in practice, Usenet newsgroup names are
   almost all US-ASCII - but the descriptive text is normally generated
   according to the standards of the local hierarchy and, once again,
   may not conform to UTF-8.

   The final deviation is the HELP (Section 7.2) command.  The help text
   that this returns is typically created by server operators and is not
   presented to normal users.  It does not appear profitable to put
   restrictions on this text.

   10.3  Outstanding issues

   10.3.1  Article format

   While the primary use of NNTP is for transmitting articles that
   conform to RFC 1036 [RFC1036] (Netnews articles), it is also used for
   other formats (see Appendix A).  It is therefore most appropriate
   that internationalisation issues related to article formats be
   addressed in the relevant specifications.  For Netnews articles, this
   is any successor to RFC 1036.  For email messages, it is RFC 2822
   [RFC2822].

   Of course, any article transmitted via NNTP needs to conform to this
   specification as well.

   10.3.2  Newsgroup names and descriptions

   Newsgroup names are required by this specification to be in UTF-8
   and, in practice, are almost always US-ASCII.  The spread of
   implementations that conform to this specification should suffice to
   encourage the phasing out of the few non-conforming names in use.

   Restricting newsgroup names to UTF-8 is not a complete solution to
   the issues, of course.  In particular, when new newsgroup names are
   created or a user is asked to enter a newsgroup name, some form of
   canonicalisation will need to take place.

   Newsgroup descriptions are a more difficult problem.  The pressures
   that have restricted newsgroup names to US-ASCII - essentially that
   it is more likely to remain unaltered during transmission - do not
   apply to descriptions.  More variation has therefore sprung up and
   there will be difficult problems involved in any transition to UTF-8.

   Since the primary use of NNTP is with Netnews, and since this
   information is normally distributed through specially formatted
   articles, it is recommended that these issues be addressed in any
   successor to RFC 1036.  In the meantime:

   o  servers SHOULD by default report to their administrator any use of
      character sets other than UTF-8 in the newsgroups list data (see
      Section 7.6.6);

   o  administrators of Netnews hierarchies SHOULD NOT permit the
      creation of newsgroups with names that are not US-ASCII, as any
      name that does not conform to the eventual specifications in this
      regard is likely to be a permanent source of interoperability
      issues.


   10.3.3  Other

   While the text of the HELP response remains an open issue, it is
   unclear whether there is benefit in attempting to solve it.

-- 
Clive D.W. Feather  | Work:  <clive at demon.net>   | Tel:    +44 20 8495 6138
Internet Expert     | Home:  <clive at davros.org>  | Fax:    +44 870 051 9937
Demon Internet      | WWW: http://www.davros.org | Mobile: +44 7973 377646
Thus plc            |                            |



More information about the ietf-nntp mailing list