Re: [HACKERS] possible encoding issues with libxml2 functions

Поиск
Список
Период
Сортировка
От Pavel Stehule
Тема Re: [HACKERS] possible encoding issues with libxml2 functions
Дата
Msg-id CAFj8pRBpUONGfVh-ixD-tje_tfjr0gSi6gjcAYNBcbyTgQttaQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] possible encoding issues with libxml2 functions  (Noah Misch <noah@leadboat.com>)
Ответы Re: [HACKERS] possible encoding issues with libxml2 functions  (Noah Misch <noah@leadboat.com>)
Список pgsql-hackers


2017-08-20 9:21 GMT+02:00 Noah Misch <noah@leadboat.com>:
On Sun, Aug 20, 2017 at 08:46:03AM +0200, Pavel Stehule wrote:
> 2017-08-20 4:17 GMT+02:00 Noah Misch <noah@leadboat.com>:
> > On Sat, Aug 19, 2017 at 10:53:19PM +0200, Pavel Stehule wrote:
> > > I am sending some POC  - it does support XPATH and XMLTABLE for not UTF8
> > > server encoding.
> > >
> > > In this case, all strings should be converted to UTF8 before call libXML2
> > > functions, and result should be converted back from UTF8.
> >
> > Adding support for xpath in non-UTF8 databases is a v11 feature proposal.
> > Please start a new thread for this, and add it to the open CommitFest.
> >
> > In this thread, would you provide the version of your patch that I
> > described
> > in my 2017-08-08 post to this thread?  That's a back-patchable bug fix.
>
>
> There are three issues:
>
> 1. processing 1byte encoding XMLs documents with encoding declaration -
> should be fixed by ecoding_for_xmlCtxtReadMemory.patch. This patch is very
> short and safe - can be apply immediately (there is regress tests)

We should fix that problem, yes.  encoding_for_xmlCtxtReadMemory.patch is not
the right fix; see below.

> 2 encoding issues in XPath specification (and  namespaces) - because
> multibytes chars are not usually used in tag names, this issue hit minimum
> users.
>
> 3. encoding issues in XPath and XMLTABLE results - this is bad issue - the
> function XMLTABLE will not be functional on non UTF8 databases. Fortunately
> - there are less users with this encoding, but probably should be apply as
> fix in 10/11 Postgres.

(2) and (3) are long-documented (since commit 14180f9, 2009-06) limitations in
xpath functions.  That's why I would treat an improvement as a new feature and
not back-patch it.  It is tempting to fix v10 so XMLTABLE is born without this
limitation, but it is too late in the release cycle.

I agree 


Reviewing encoding_for_xmlCtxtReadMemory.patch:

On Sat, Aug 19, 2017 at 09:13:50AM +0200, Pavel Stehule wrote:
> -             doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
> +             doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL,
> +                                                             pg_encoding_to_char(GetDatabaseEncoding()), 0);

This assumes libxml2 understands every encoding name that
pg_encoding_to_char() can return, but it does not.  xpath-bugfix.patch was
better.  Here are the relevant parts of my review of that patch.

I understand to this argument. 

On Mon, Aug 07, 2017 at 05:10:14PM -0700, Noah Misch wrote:
> On Wed, Apr 05, 2017 at 10:53:39PM +0200, Pavel Stehule wrote:
> > One possible fix - and similar technique is used more times in xml.c code
> > .. xmlroot
>
> > +   /* remove header */
> > +   parse_xml_decl(string, &header_len, NULL, NULL, NULL);
>
> > -       doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
> > +       doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len -

> I like this parse_xml_decl() approach better.  Would you update
> your patch to use it and to add the test case I give above?

Again, would you make those two edits?

Now, I am not sure so it is correct fix. We will fix case, when server is in UTF8, but maybe we will break some cases when server is not in UTF8 (although it is broken already).

I am think so correct solution is encoding to UTF8 and passing encoding parameter. It will works safely in all cases - and we don't need cut header. We should not to cut header if server encoding is not in UTF8 and we don't pass encoding as parameter. When we pass encoding as parameter, then cutting header is not necessary.

What do you think about last patch?

Regards

Pavel

 

Thanks,
nm

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Noah Misch
Дата:
Сообщение: Re: [HACKERS] possible encoding issues with libxml2 functions
Следующее
От: Nikolay Shaplov
Дата:
Сообщение: Re: [HACKERS] pgbench tap tests & minor fixes