ietf-nntp CHARSET in nntp

Sun Oct 5 14:28:50 PDT 1997

On Wed, 1 Oct 1997, John Myers wrote:

> I'd certainly like to echo Brian Hernacki's sentiments, plus add some
> more information.

I very much agree with Brian and John.

> Use of UTF-8 is strongly recommended, both by the report of the IAB
> Character Set Workshop (RFC 2130) and the proposed IETF Policy on
> Character Sets and Languages (draft-alvestrand-charset-policy-01.txt)
> 
> Of course, one cannot Just-Send-UTF-8 in an existing deployed
> protocol--one needs to negotiate the ability to use anything besides the
> existing US-ASCII.  It does, however, simplify things greatly to make
> the only charset that can be negotiated be UTF-8.

It depends :-). There are certainly places in protocols where
upgrade to using UTF-8 is possible without negotiation. There
are several reasons for this. The first is that UTF-8 is completely
ASCII-compatible, i.e. everything that is ASCII stays ASCII, and
nothing that looks like ASCII is anything else than ASCII.
The second reason is that UTF-8 can easily be distinguished
from other character encodings by some simple and highly
specific heuristics. This is important in particular in
cases where up to now, other 8-bit encodings were used againgst
the specs.

As an example, UTF-8 could easily be introduced as an encoding
for search keys. Applications that don't understand 8-bit
will fold it and probably not find the nonsense they fold it to.
Even if a request fails, that's just what is expected.

> The approach being taken in other existing protocols, such as IMAP, is
> to add a command to negotiate the language of the server-issued
> human-readable text (such as error messages).  The extension command is
> defined such that negotiation of a language has as a side-effect the
> negotiation of UTF-8.  (Put another way, the extension defines that
> error messages in the non-default language are encoded in UTF-8).

This is an efficient way of combining two things that are somehow,
although not exactly, related.

> I'm not sure what is the right thing to do with group names,
> it depends on what the right behavior is when a server is asked to
> present a non-ascii group name to a legacy NNTP client.

The right behaviour is most probably undefined, because officially
it is an error anyway. On the other hand, there have been reports
that most News software seems to be quite tolerant to 8-bit encodings.
So the typical behaviour seems to be to just send that name, in its
binary form, as the server knows it. In some areas, e.g. Scandinavia,
this has already lead to experiments based on Latin-1.

The bottleneck seem not to be the client-server communication or
the client or server implementation, but the file system(s) used
to store news articles.

It might be possible to separate the issue of 8-bit compatibility
from the issue of UTF-8. This was done for FTP filenames in the
context of the FTPEXT WG. First, filenames were officially allowed
to be 8-bit, which was wide practice but officially not allowed
before. Second, UTF-8 was defined as the preferred encoding for
interoperability. The same thing might be possible for newsgroup
names, i.e. that NNTP implementations only guarantee the conservation
of 8-bit byte strings, while it is the job of the client implementations
to interpret these strings, preferably as UTF-8.

There are of course some differences between FTP and NNTP, in that
newsgroup names have a much more global scope and repeated use
when compared to FTP file names.

Also, please note that the issue of casing should be considered.
It is quite impossible to have case insensitivity in an international
context. This is not the fault of Unicode/ISO 10646, it is just
because case relations are different in different languages.
For newsgroup names, this probably means that (with some backwards
compatibilty exceptions for ASCII), newsgroup names always only
have to be in lowercase.

Regards,	Martin.