From: Jean-Michel POURE <jm.poure@freesurf.fr>
Subject: Java's Unicode Notation
Date: Thu, 08 Nov 2001 14:12:04 +0100
Message-ID: <4.2.0.58.20011108141018.00a59dc0@pop.freesurf.fr>
> Dear Tatsuo,
>
> Could it be possible to use the Java Unicode Notation to define UTF-8
> strings in PostgreSQL 7.2.
No. It's too late. We are in the beta freeze stage.
> Information can be found on http://czyborra.com/utf/
>
> Do you think it is hard to implement?
>
> Best regards,
> Jean-Michel POURE
>
> ************************************************
> Java's Unicode Notation
> There are some less compact but more readable ASCII transformations the
> most important of which is the Java Unicode Notation as allowed in Java
> source code and processed by Java's native2ascii converter:
> putwchar(c)
> {
> if (c >= 0x10000) {
> printf ("\\u%04x\\u%04x" , 0xD7C0 + (c >> 10), 0xDC00 | c & 0x3FF);
> }
> else if (c >= 0x100) printf ("\\u%04x", c);
> else putchar (c);
> }
> The advantage of the \u20ac notation is that it is very easy to type it in
> on any old ASCII keyboard and easy to look up the intended character if you
> happen to have a copy of the Unicode book or the
> {unidata2,names2,unihan}.txt files from the Unicode FTP site or CD-ROM or
> know what U+20AC is the �.
> What's not so nice about the \u20ac notation is that the small letters are
> quite unusual for Unicode characters, the backslashes have to be quoted for
> many Unix tools, the four hexdigits without a terminator may appear merged
> with the following word as in \u00a333 for ��33, it is unclear when and how
> you have to escape the backslash character itself, 6 bytes for one
> character may be considered wasteful, and there is no way to clearly
> present the characters beyond \uffff without \ud800\udc00 surrogates, and
> last but not least the plain hexnumbers may not be very helpful.
> JAVA is one of the target and source encodings of yudit and its uniconv
> converter.
>