Re: Unicode string literals versus the world

Поиск
Список
Период
Сортировка
От Sam Mason
Тема Re: Unicode string literals versus the world
Дата
Msg-id 20090416143440.GM12225@frubble.xen.chris-lamb.co.uk
обсуждение исходный текст
Ответ на Re: Unicode string literals versus the world  (Marko Kreen <markokr@gmail.com>)
Ответы Re: Unicode string literals versus the world  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Unicode string literals versus the world  (Marko Kreen <markokr@gmail.com>)
Список pgsql-hackers
On Thu, Apr 16, 2009 at 02:47:20PM +0300, Marko Kreen wrote:
> On 4/16/09, Sam Mason <sam@samason.me.uk> wrote:
> > Microsoft have also gone this way in C#, named code points are not
> > supported however.
> 
> And it handles also non-BMP codepoints with \u escape similarly:
> 
>   http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences
> 
> This makes it even more standard.

I fail to see what you're pointing out here; as far as I understand it,
\u is for BMP code points and \U extends the range out to 32bit code
points.  I can't see anything about non-BMP and \u in the above link,
you appear free to write your own surrogate pairs but that seems like an
independent issue.

I'd not realised before that C# is specified to use UTF-16 as its
internal encoding.

> >  This would be following the BitC[2] project, especially if it was more
> >  like:
> >
> >   \{U+xxxx}
> 
> We already got yet-another-unique-way-of-escaping-unicode with U&.
> 
> Now let's try to support some actual standard also.

That comes across *very* negatively; I hope it's just a language issue.

I read your parent post as soliciting opinions on possible ways to
encode Unicode characters in PG's literals.  The U&'lit' was criticised,
you posted some suggestions, I followed up with what I hoped to be a
useful addition.  It seems useful here to separate "de jure" from "de
facto" standards; implementing U&'lit' would be following the de jure
standard, anything else would be de facto.

A survey of existing SQL implementations would seem to be more appropriate
as well:

Oracle: UNISTR(string-literal) and \xxxx
 It looks as though Oracle originally used UCS-2 internally (i.e. BMP only) but more recently Unicode support has been
improvedto allow other planes.
 

MS-SQL Server: 
 can't find anything remotely useful; best seems to be to use NCHAR(integer-expression) which looks somewhat
unmaintainable.

DB2: U&string-literal and \xxxxxx
 i.e. it follows the SQL-2003 spec

FireBird:
 can't find much either; support looks somewhat low on the ground

MySQL:
 same again, seems to assume query is encoded in UTF-8

Summary seems to be that either I'm bad at searching or support for
Unicode doesn't seem very complete in the database world and people work
around it somehow.

> You did not read my mail carefully enough - the Java and also Python/C#
> already support non-BMP chars with '\u' and exactly the same (utf16) way.

Again, I think this may be a language issue; if not then more verbose
explanations help, maybe something like "sorry, I obviously didn't
explain that very well".  You will of course felt you explained it
perfectly well, but everybody enters a discussion with different
intuitions and biases, email has a nasty habit of accentuating these
differences and compounding them with language problems.

I'd never heard of UTF-16 surrogate pairs before this discussion and
hence didn't realise that it's valid to have a surrogate pair in place
of a single code point.  The docs say that <D800 DF02> corresponds to
U+10302, Python would appear to follow my intuitions in that:
 ord(u'\uD800\uDF02')

results in an error instead of giving back 66306, as I'd expect.  Is
this a bug in Python, my understanding, or something else?

--  Sam  http://samason.me.uk/


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Devrim GÜNDÜZ
Дата:
Сообщение: Re: Yet another regression issue with Fedora-10 + PG 8.4 beta1
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Unicode string literals versus the world