[NNTP] Internationalisation
Clive D.W. Feather
clive at demon.net
Fri Apr 8 09:29:46 PDT 2005
First draft. Comments welcome.
10. Internationalisation Considerations
10.1 Introduction and historical situation
RFC 977 [RFC977] was written at a time when internationalisation was
not seen as a significant issue. As such, it was written on the
assumption that all communication would be in ASCII and use only a
7-bit transport layer.
Since then, Usenet and NNTP have spread throughout the world. In the
absence of standards for handling the issues of language and
character sets, countries, newsgroup hierarchies, and individuals
have all found different solutions which work for them but are not
necessarily appropriate elsewhere. For example, some have adopted a
default 8-bit character set appropriate to their needs (such as
ISO8859-1 in Western Europe or KOI-8 in Russia), others have used
ASCII (either US-ASCII or national variants) in headers but local 16-
bit character sets in article bodies, and still others have gone for
a combination of MIME and UTF-8. With the increased use of MIME in
email, it is becoming more common to find MIME headers identifying
the character set of the body, but this is far from universal.
The resulting confusion does not help interoperability.
One point that has been generally accepted is that articles can
contain octets with the top bit set, and NNTP is only expected to
operate on 8-bit clean transport paths.
10.2 This specification
Part of the role of this present specification is to eliminate this
confusion and promote interoperability as far as possible. At the
same time, it is necessary to accept the existence of the present
situation and not gratuitously break existing implementations and
arrangements, even if they are less than optimal. Therefore current
practice has been taken into consideration while in producing this
specification.
The NNTP itself is extended from US-ASCII [ANSI1986] to UTF-8
[RFC3629] in this specification. Except in the specific areas
discussed below, UTF-8 (which is a superset of ASCII) is mandatory
and implementations MUST NOT use any other encoding.
The major deviation from this requirement lies in the topic of
articles and data derived from them. As described in Section 3.6,
articles consist of a set of headers and then a body. While the
names of headers (e.g. "From" or "Subject") are limited to US-ASCII,
some header values (and, of course, the article body) are generated
by users using software which adopts local practices; for example, it
may encode all text is in ISO 8859-1 without including a MIME header
to that effect.
OUTSTANDING ISSUE
Include references to MIME? To 8859-1? To KOI-8?
In an ideal world it would be possible to declare such usage non-
conforming and ignore it, but in practice any specification that
attempted to do so would be ignored. Therefore this version of NNTP
allows this practice. More specifically, while implementations
SHOULD only allow the creation of new articles where the headers
conform to UTF-8, where an article is obtained from an external
source an implementation MAY pass it on, and derive data from it
(such as the response to the HDR command), even though the article or
the data is not valid UTF-8. Implementations MUST transfer such
articles and data correctly. (Nevertheless, a client or server MAY
elect not to post or forward the article if, after further
examination of the article, it deems it inappropriate to do so.)
This requirement affects the ARTICLE (Section 6.2.1), BODY
(Section 6.2.3), HDR (Section 8.5), HEAD (Section 6.2.2), IHAVE
(Section 6.3.2), OVER (Section 8.3), and POST (Section 6.3.1)
commands.
The second area of deviation is the newsgroups list returned by the
LIST NEWSGROUPS (Section 7.6.6) command. The actual newsgroup name
is required to be in UTF-8 - in practice, Usenet newsgroup names are
almost all US-ASCII - but the descriptive text is normally generated
according to the standards of the local hierarchy and, once again,
may not conform to UTF-8.
The final deviation is the HELP (Section 7.2) command. The help text
that this returns is typically created by server operators and is not
presented to normal users. It does not appear profitable to put
restrictions on this text.
10.3 Outstanding issues
10.3.1 Article format
While the primary use of NNTP is for transmitting articles that
conform to RFC 1036 [RFC1036] (Netnews articles), it is also used for
other formats (see Appendix A). It is therefore most appropriate
that internationalisation issues related to article formats be
addressed in the relevant specifications. For Netnews articles, this
is any successor to RFC 1036. For email messages, it is RFC 2822
[RFC2822].
Of course, any article transmitted via NNTP needs to conform to this
specification as well.
10.3.2 Newsgroup names and descriptions
Newsgroup names are required by this specification to be in UTF-8
and, in practice, are almost always US-ASCII. The spread of
implementations that conform to this specification should suffice to
encourage the phasing out of the few non-conforming names in use.
Restricting newsgroup names to UTF-8 is not a complete solution to
the issues, of course. In particular, when new newsgroup names are
created or a user is asked to enter a newsgroup name, some form of
canonicalisation will need to take place.
Newsgroup descriptions are a more difficult problem. The pressures
that have restricted newsgroup names to US-ASCII - essentially that
it is more likely to remain unaltered during transmission - do not
apply to descriptions. More variation has therefore sprung up and
there will be difficult problems involved in any transition to UTF-8.
Since the primary use of NNTP is with Netnews, and since this
information is normally distributed through specially formatted
articles, it is recommended that these issues be addressed in any
successor to RFC 1036. In the meantime:
o servers SHOULD by default report to their administrator any use of
character sets other than UTF-8 in the newsgroups list data (see
Section 7.6.6);
o administrators of Netnews hierarchies SHOULD NOT permit the
creation of newsgroups with names that are not US-ASCII, as any
name that does not conform to the eventual specifications in this
regard is likely to be a permanent source of interoperability
issues.
10.3.3 Other
While the text of the HELP response remains an open issue, it is
unclear whether there is benefit in attempting to solve it.
--
Clive D.W. Feather | Work: <clive at demon.net> | Tel: +44 20 8495 6138
Internet Expert | Home: <clive at davros.org> | Fax: +44 870 051 9937
Demon Internet | WWW: http://www.davros.org | Mobile: +44 7973 377646
Thus plc | |
More information about the ietf-nntp
mailing list