[NNTP] Internationalisation

Mon Apr 11 04:20:55 PDT 2005

In <20050408162946.GW94541 at finch-staff-1.thus.net> "Clive D.W. Feather" <clive at demon.net> writes:

>First draft. Comments welcome.

I think this pudding has been somewhat over-egged :-) .

Generally, the less we say on this issue, the less likely it is to come
back to haunt us later.

>   10.  Internationalisation Considerations

>   10.1  Introduction and historical situation

>   RFC 977 [RFC977] was written at a time when internationalisation was
>   not seen as a significant issue.  As such, it was written on the
>   assumption that all communication would be in ASCII and use only a
>   7-bit transport layer.

Just to forestall the Bruce Lillys of this world, a parenthetical remark
to the effect that all known current implementations are nevertheless 8bit
clean would help.

>   Since then, Usenet and NNTP have spread throughout the world.  In the
>   absence of standards for handling the issues of language and
>   character sets, countries, newsgroup hierarchies, and individuals
>   have all found different solutions which work for them but are not
>   necessarily appropriate elsewhere.  For example, some have adopted a
>   default 8-bit character set appropriate to their needs (such as
>   ISO8859-1 in Western Europe or KOI-8 in Russia), others have used
>   ASCII (either US-ASCII or national variants) in headers but local 16-
>   bit character sets in article bodies, and still others have gone for
>   a combination of MIME and UTF-8.  With the increased use of MIME in
>   email, it is becoming more common to find MIME headers identifying
>   the character set of the body, but this is far from universal.

Again, whilst all of these deviations have been seen somewhere, they are
not as common as this text would suggest. Successors to RFC 1036, such as
Usefor or its proposed I18N extension, may yet decide to try and put the
genii back in the bottle (or they may not, or they may weasel their way
around the problem). So you need to be completely neutral regarding such
possibilities, and not say anything which could be taken as encouraging
(or discouraging for that matter) such deviations.

In particular, I doubt any such I18N extension is going to give any
comfort at all to those who would use strange charsets in bodies without
proper MIME headers.

>   The resulting confusion does not help interoperability.

>   One point that has been generally accepted is that articles can
>   contain octets with the top bit set, and NNTP is only expected to
>   operate on 8-bit clean transport paths.

Agreed.

>   10.2  This specification

>   Part of the role of this present specification is to eliminate this
>   confusion and promote interoperability as far as possible.  At the
>   same time, it is necessary to accept the existence of the present
>   situation and not gratuitously break existing implementations and
>   arrangements, even if they are less than optimal.  Therefore current
>   practice has been taken into consideration while in producing this
>   specification.

I think I would rather say:

Whilst this specification has been designed to place no gratuitous
obstacles to the continued transport of articles which exceed the
spcifications in RFC 1036 in such ways, it should not be read as
condoning such practices (which is a matter properly left to future
extensions of RFC 1036).

>   The NNTP itself is extended from US-ASCII [ANSI1986] to UTF-8
>   [RFC3629] in this specification.  Except in the specific areas
>   discussed below, UTF-8 (which is a superset of ASCII) is mandatory
>   and implementations MUST NOT use any other encoding.

>   The major deviation from this requirement lies in the topic of
>   articles and data derived from them.  As described in Section 3.6,
>   articles consist of a set of headers and then a body.  While the
>   names of headers (e.g.  "From" or "Subject") are limited to US-ASCII,
>   some header values (and, of course, the article body) are generated
>   by users using software which adopts local practices; for example, it
>   may encode all text is in ISO 8859-1 without including a MIME header
>   to that effect.

... software which may have adopted other charsets and/or practices ...

>   OUTSTANDING ISSUE

>      Include references to MIME?  To 8859-1?  To KOI-8?

>   In an ideal world it would be possible to declare such usage non-
>   conforming and ignore it, but in practice any specification that
>   attempted to do so would be ignored.  Therefore this version of NNTP
>   allows this practice.  More specifically, while implementations
>   SHOULD only allow the creation of new articles where the headers
>   conform to UTF-8, where an article is obtained from an external
>   source an implementation MAY pass it on, and derive data from it
>   (such as the response to the HDR command), even though the article or
>   the data is not valid UTF-8.  Implementations MUST transfer such
>   articles and data correctly.  (Nevertheless, a client or server MAY
>   elect not to post or forward the article if, after further
>   examination of the article, it deems it inappropriate to do so.)

I think I would omit the first two sentences.

>   This requirement affects the ARTICLE (Section 6.2.1), BODY
>   (Section 6.2.3), HDR (Section 8.5), HEAD (Section 6.2.2), IHAVE
>   (Section 6.3.2), OVER (Section 8.3), and POST (Section 6.3.1)
>   commands.

>   The second area of deviation is the newsgroups list returned by the
>   LIST NEWSGROUPS (Section 7.6.6) command.  The actual newsgroup name
>   is required to be in UTF-8 - in practice, Usenet newsgroup names are
>   almost all US-ASCII - but the descriptive text is normally generated
>   according to the standards of the local hierarchy and, once again,
>   may not conform to UTF-8.

Again, you need to allow for the possibility that the I18N extension to
Usefor may go in for some encoding of newsgroup-names into ASCII, rather
than expressing them directly in UTF-8 (though I think you have pretty
well precluded any other ways of handling them, which will not please the
Chinese :-( ).

At the time that Usefor decided to postpone the I18N of newsgroup-names to
a future document, there were mutterings about such encodings, even
involving the ghastly punycode. I am sure that you and I agree that would
be a terrible approach to take, but we still need to take a neutral stance
on it in this NNTP standard.

>   The final deviation is the HELP (Section 7.2) command.  The help text
>   that this returns is typically created by server operators and is not
>   presented to normal users.  It does not appear profitable to put
>   restrictions on this text.

By which you mean that you do not care if it is in KOI-8?

>   10.3  Outstanding issues

>   10.3.1  Article format

>   While the primary use of NNTP is for transmitting articles that
>   conform to RFC 1036 [RFC1036] (Netnews articles), it is also used for
>   other formats (see Appendix A).  It is therefore most appropriate
>   that internationalisation issues related to article formats be
>   addressed in the relevant specifications.  For Netnews articles, this
>   is any successor to RFC 1036.  For email messages, it is RFC 2822
>   [RFC2822].

Agreed, except that there could also be extensions/successors to RFC 2822.

>   Of course, any article transmitted via NNTP needs to conform to this
>   specification as well.

>   10.3.2  Newsgroup names and descriptions

>   Newsgroup names are required by this specification to be in UTF-8
>   and, in practice, are almost always US-ASCII.  The spread of
>   implementations that conform to this specification should suffice to
>   encourage the phasing out of the few non-conforming names in use.

ITYM those that conform to the successors and extensions mentioned above.

>   Restricting newsgroup names to UTF-8 is not a complete solution to
>   the issues, of course.  In particular, when new newsgroup names are
>   created or a user is asked to enter a newsgroup name, some form of
>   canonicalisation will need to take place.

I think you need to make it clear here that implementations of this new
NNTP standard MUST consider two newsgroup-names to refer to the same group
IFF they are represented by the same sequence of octets. Thus such
canonicalization as may be needed is the responsibility of the clients, or
of the the machinery (undefined by this standard) that admits new
newsgroups to the active file of the server.

>   Newsgroup descriptions are a more difficult problem.  The pressures
>   that have restricted newsgroup names to US-ASCII - essentially that
>   it is more likely to remain unaltered during transmission - do not
>   apply to descriptions.  More variation has therefore sprung up and
>   there will be difficult problems involved in any transition to UTF-8.

Hmmmm! Might it not be possible to allow the lattitude you have permitted
for the texts or articles to apply here also? I am not sure. You certainly
need to be able to extract a valid list of newsgroup-names from a LIST
NEWSGROUPS response, even if the remainder of the descriptions looks like
gibberish, which pretty well limits you to UTF-8 (unless the
newsgroup-names are encoded in ASCII - yech!).

>   Since the primary use of NNTP is with Netnews, and since this
>   information is normally distributed through specially formatted
>   articles, it is recommended that these issues be addressed in any
>   successor to RFC 1036.  In the meantime:

>   o  servers SHOULD by default report to their administrator any use of
>      character sets other than UTF-8 in the newsgroups list data (see
>      Section 7.6.6);

>   o  administrators of Netnews hierarchies SHOULD NOT permit the
>      creation of newsgroups with names that are not US-ASCII, as any
>      name that does not conform to the eventual specifications in this
>      regard is likely to be a permanent source of interoperability
>      issues.

Do those actually add anything to what has already been said earlier?

>   10.3.3  Other

>   While the text of the HELP response remains an open issue, it is
>   unclear whether there is benefit in attempting to solve it.

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl at clerew.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5