Re: XPATH vs. server_encoding != UTF-8

Поиск
Список
Период
Сортировка
От Florian Pflug
Тема Re: XPATH vs. server_encoding != UTF-8
Дата
Msg-id 799546AE-B11C-4718-BA19-A182E796F6C4@phlo.org
обсуждение исходный текст
Ответ на Re: XPATH vs. server_encoding != UTF-8  (Florian Pflug <fgp@phlo.org>)
Список pgsql-hackers
On Jul24, 2011, at 01:25 , Florian Pflug wrote:
> On Jul23, 2011, at 22:49 , Peter Eisentraut wrote:
>
>> On lör, 2011-07-23 at 17:49 +0200, Florian Pflug wrote:
>>> The current thread about JSON and the ensuing discussion about the
>>> XML types' behaviour in non-UTF8 databases made me try out how well
>>> XPATH() copes with that situation. The code, at least, looks
>>> suspicious - XPATH neither verifies that the server encoding is UTF-8,
>>> not does it pass the server encoding on to libxml's xpath functions.
>>
>> This issue is on the Todo list, and there are some archive links there.
>
> Thanks for the pointer, but I think the discussion there doesn't
> really apply here.

Upon further reflection, I came to realize that it in fact does apply.

All the non-XPath related XML *parsing* seems to go through xml_parse(),
but we also use libxml to write XML, making XMLELEMENT() and friends
equally susceptible to all kinds of encoding trouble. For the fun of it,
try the following in a ISO-8859-1 database (which client_encoding correctly
set up, so the umlaut-a reaches the backend unharmed)
 select xmlelement(name "r", xmlattributes('ä' as a));

you get
   xmlelement
-------------------<r a="䀀"/>

Well, actually, you only get that about 9 times out of 10. Sometimes
you instead get
       xmlelement
---------------------------<r a="䀁\x01\x01"/>

It seems the libxml reads past the terminating zero byte if it's
preceeded by an invalid UTF-8 byte sequence (like 0xe4 0x00 in the example
above). Ouch!

Also, passing encoding ASCII to libxml's parser doesn't prevent it from
expanding entity references referring to characters outside the ASCII
range. So even with my patch applied you can make XPATH() return wrong
results. For example (0xe4 is the unicode codepoint representing umlaut-a)
 select xpath('/r/@a', '<r a="ä"/>'::xml);

gives (*with* my patch applied)
xpath
-------{ä}

So scratch the whole idea. There doesn't seem to be a simple way to
make the XML type work sanely in a non-UTF-8 setting :-(. Apart from
simple input and output that is, which already seems to work correctly
regardless of the server encoding.

BTW, for the sake of getting this into the archives just in case someone
decides to fix this and stumbles over this thread:

It seems to me that the easiest way to fix XML generation in the non-UTF-8
case would be to cease using libxml for emitting XML at all. The only
non-trivial use of libxml there is the escaping of attribute values, and
we do already have our own escape_xml() function - it just needs to be
taught the additional escapes needed for attribute values. (libxml is
also used to convert binary values to base64 or hexadecimal notation,
but there're no encoding issues there)

best regards,
Florian Pflug



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: libpq SSL with non-blocking sockets
Следующее
От: Stefan Kaltenbrunner
Дата:
Сообщение: Re: pgbench cpu overhead (was Re: lazy vxid locks, v1)