Re: Java's Unicode Notation

Поиск
Список
Период
Сортировка
От Patrice Hédé
Тема Re: Java's Unicode Notation
Дата
Msg-id 20011112190314.A2495@idf.net
обсуждение исходный текст
Ответ на Java's Unicode Notation  (Jean-Michel POURE <jm.poure@freesurf.fr>)
Список pgsql-hackers
Hi,

I'm answering to the original mail, as it has the description itself.

* Jean-Michel POURE <jm.poure@freesurf.fr> [011107 22:04]:
> Dear all,
> 
> Could it be possible to use the Java Unicode Notation to define UTF-8 
> strings in PostgreSQL 7.2.
> Information can be found on http://czyborra.com/utf/
> 
> Best regards,
> Jean-Michel pOURE
> 
> ************************************************
> 
> Java's Unicode Notation
> There are some less compact but more readable ASCII transformations
> the most important of which is the Java Unicode Notation as allowed
> in Java source code and processed by Java's native2ascii converter:
> 
> putwchar(c)
> {
>   if (c >= 0x10000) {
>     printf ("\\u%04x\\u%04x" , 0xD7C0 + (c >> 10), 0xDC00 | c & 0x3FF);
>   }
>   else if (c >= 0x100) printf ("\\u%04x", c);
>   else putchar (c);
> }
> 
> The advantage of the \u20ac notation is that it is very easy to type
> it in on any old ASCII keyboard and easy to look up the intended
> character if you happen to have a copy of the Unicode book or the
> {unidata2,names2,unihan}.txt files from the Unicode FTP site or
> CD-ROM or know what U+20AC is the €.                                   ^^^
Was that the codepoint for the windows proprietary charset for the
Euro, disguised in a mail advertising itself as "iso-8859-1", which
doesn't have the euro sign ? ;)

[No wonder Unicode is really needed in Europe !]

> What's not so nice about the \u20ac notation is that the small
> letters are quite unusual for Unicode characters, the backslashes
> have to be quoted for many Unix tools, the four hexdigits without a
> terminator may appear merged with the following word as in \u00a333
> for £33, it is unclear when and how you have to escape the backslash
> character itself, 6 bytes for one character may be considered
> wasteful, and there is no way to clearly present the characters
> beyond \uffff without \ud800\udc00 surrogates, and last but not
> least the plain hexnumbers may not be very helpful.
> 
> JAVA is one of the target and source encodings of yudit and its
> uniconv converter.

I have to disagree about this feature... well, not about the idea, but
the implementation.

First, the use of surrogates to describe > 0x010000 codepoints.
Surrogates are NOT Unicode codepoints. They only exist in UTF-16
encoding, which is the encoding used by Java and Windows. However,
PostgreSQL, as most Unix tools, uses UTF-8 as encoding.

Encoding codepoints over 0xffff with two surrogates in UTF-8 is
illegal... So, you should forget about this, as this is an unnatural
extra step.

I've seen somewhere the notation \v010000 (using \v for 6-char
codepoints). But I don't like it too much either.

I agree with your idea of being able to express unicode codepoints
directly with escape characters. I personally like Perl's solution :

\x{20ac}
\x{010123}
\x{7e}

Using the braces, it makes it unambiguous to deal with codepoint
length (I've often myself put one "0" too much or not enough in
unicode code point descriptions).

I don't mind \u{...} instead of \x{...}. But a lot of PostgreSQL users
would be familiar with \x{} notation :) [Me being the first one]

I think that this is something for psql however. Where is "\n"
translated, for example ? Anyway, for 7.3... :)

Patrice.

-- 
Patrice Hédé
email: patrice hede à islande org
www  : http://www.islande.org/


В списке pgsql-hackers по дате отправления:

Предыдущее
От: mlw
Дата:
Сообщение: rename index?
Следующее
От: Jaume Teixi
Дата:
Сообщение: howto bypass the intersect + order by bug in 7.0.3