Re: Bug in UTF8-Validation Code?

Поиск
Список
Период
Сортировка
От Albe Laurenz
Тема Re: Bug in UTF8-Validation Code?
Дата
Msg-id AFCCBB403D7E7A4581E48F20AF3E5DB20203E26F@EXADV1.host.magwien.gv.at
обсуждение исходный текст
Ответ на Bug in UTF8-Validation Code?  (Mario Weilguni <mweilguni@sime.com>)
Ответы Re: Bug in UTF8-Validation Code?  (Mark Dilger <pgsql@markdilger.com>)
Список pgsql-hackers
Mark Dilger wrote:
>> What I suggest (and what Oracle implements, and isn't CHR() and
ASCII()
>> partly for Oracle compatibility?) is that CHR() and ASCII()
>> convert between a character (in database encoding) and
>> that database encoding in numeric form.
>
> Looking at Oracle documentation, it appears that you get different
> behavior from CHR(X [USING NCHAR_CS]) depending on whether you call it

> with the argument USING NCHAR_CS.  Oracle 9i and higher have an
> additional function called NCHR(X) which is supposed to be the same as

> CHR(X USING NCHAR_CS).
>
> On http://www.oraclehome.co.uk/chr-function.htm it says that "To use
> UTF8, you specify using nchar_cs in the argument list".  Does this
mean
> that CHR(X) behaves as Tom Lane wants, and NCHR(X) behaves as Albe
> Laurenz wants?  Vice versa?

That web page is misleading at least, if not downright wrong.

It's just that an Oracle database has 2 character sets, a "database
character set" and a "national character set", the latter always being a
UNICODE encoding (the name "national character set" is somewhat
misleading).

This baroque concept is from those days when nobody had a UNICODE
database, but people still wanted to store characters not supported
by the "database character set" - in that case you could define a column
to be in the "national character set".

CHR(n) and CHR(n USING NCHAR_CS) = NCHR(n) are the same function, only
that the first one uses the "database character set" and the latter ones
the "national character set".

Nowadays this Oracle concept of "national character set" is nearly
obsolete, one normally uses a UNICODE "database character set".

Oracle has two things to say about CHR():
 "For single-byte character sets, if n > 256, then Oracle Database  returns the binary equivalent of n mod 256. For
multibytecharacter  sets, n must resolve to one entire code point. Invalid code points  are not validated, and the
resultof specifying invalid code points  is indeterminate." 

It seems that Oracle means "encoding" when it says "code point" :^)
We should of course reject invalid arguments!
I don't know if I like this modulus thing for single byte encodings
or not...
 "Use of the CHR function (either with or without the optional USING  NCHAR_CS clause) results in code that is not
portablebetween ASCII-  and EBCDIC-based machine architectures." 

There's one thing that strikes me as weird in your implementation:

> pgsql=# select chr(0);
> ERROR:  character 0x00 of encoding "SQL_ASCII" has no equivalent in
"UTF8"

0x00 is a valid UNICODE code point and also a valid UTF-8 character!

To me (maybe only to me) CHR() and ASCII() have always had the look
and feel of "type casts" between "char" and integer, with all the lack
of portability this might imply.

Yours,
Laurenz Albe


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Martijn van Oosterhout
Дата:
Сообщение: Re: Bug in UTF8-Validation Code?
Следующее
От: "Zeugswetter Andreas ADI SD"
Дата:
Сообщение: Re: Bug in UTF8-Validation Code?