Re: Java's Unicode Notation
От | Patrice Hédé |
---|---|
Тема | Re: Java's Unicode Notation |
Дата | |
Msg-id | 20011112190314.A2495@idf.net обсуждение исходный текст |
Ответ на | Java's Unicode Notation (Jean-Michel POURE <jm.poure@freesurf.fr>) |
Список | pgsql-hackers |
Hi, I'm answering to the original mail, as it has the description itself. * Jean-Michel POURE <jm.poure@freesurf.fr> [011107 22:04]: > Dear all, > > Could it be possible to use the Java Unicode Notation to define UTF-8 > strings in PostgreSQL 7.2. > Information can be found on http://czyborra.com/utf/ > > Best regards, > Jean-Michel pOURE > > ************************************************ > > Java's Unicode Notation > There are some less compact but more readable ASCII transformations > the most important of which is the Java Unicode Notation as allowed > in Java source code and processed by Java's native2ascii converter: > > putwchar(c) > { > if (c >= 0x10000) { > printf ("\\u%04x\\u%04x" , 0xD7C0 + (c >> 10), 0xDC00 | c & 0x3FF); > } > else if (c >= 0x100) printf ("\\u%04x", c); > else putchar (c); > } > > The advantage of the \u20ac notation is that it is very easy to type > it in on any old ASCII keyboard and easy to look up the intended > character if you happen to have a copy of the Unicode book or the > {unidata2,names2,unihan}.txt files from the Unicode FTP site or > CD-ROM or know what U+20AC is the . ^^^ Was that the codepoint for the windows proprietary charset for the Euro, disguised in a mail advertising itself as "iso-8859-1", which doesn't have the euro sign ? ;) [No wonder Unicode is really needed in Europe !] > What's not so nice about the \u20ac notation is that the small > letters are quite unusual for Unicode characters, the backslashes > have to be quoted for many Unix tools, the four hexdigits without a > terminator may appear merged with the following word as in \u00a333 > for £33, it is unclear when and how you have to escape the backslash > character itself, 6 bytes for one character may be considered > wasteful, and there is no way to clearly present the characters > beyond \uffff without \ud800\udc00 surrogates, and last but not > least the plain hexnumbers may not be very helpful. > > JAVA is one of the target and source encodings of yudit and its > uniconv converter. I have to disagree about this feature... well, not about the idea, but the implementation. First, the use of surrogates to describe > 0x010000 codepoints. Surrogates are NOT Unicode codepoints. They only exist in UTF-16 encoding, which is the encoding used by Java and Windows. However, PostgreSQL, as most Unix tools, uses UTF-8 as encoding. Encoding codepoints over 0xffff with two surrogates in UTF-8 is illegal... So, you should forget about this, as this is an unnatural extra step. I've seen somewhere the notation \v010000 (using \v for 6-char codepoints). But I don't like it too much either. I agree with your idea of being able to express unicode codepoints directly with escape characters. I personally like Perl's solution : \x{20ac} \x{010123} \x{7e} Using the braces, it makes it unambiguous to deal with codepoint length (I've often myself put one "0" too much or not enough in unicode code point descriptions). I don't mind \u{...} instead of \x{...}. But a lot of PostgreSQL users would be familiar with \x{} notation :) [Me being the first one] I think that this is something for psql however. Where is "\n" translated, for example ? Anyway, for 7.3... :) Patrice. -- Patrice Hédé email: patrice hede à islande org www : http://www.islande.org/
В списке pgsql-hackers по дате отправления: