Обсуждение: BUG #16046: xpath returns CDATA tag along with the value in postgres 12

Поиск
Список
Период
Сортировка

BUG #16046: xpath returns CDATA tag along with the value in postgres 12

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      16046
Logged by:          Mohammad Mostafa Kamal
Email address:      mostafa_bit0108@hotmail.com
PostgreSQL version: 12.0
Operating system:   Windows 7
Description:

While using xpath to extract text from a CDATA section of xml, it returns
CDATA tag along with the value. 

Query: SELECT
unnest(xpath('//cname/aname/text()','<cname><aname><![CDATA[select
5]]></aname></cname>'::xml))

Output - pg11: select 5 

Output - pg12:  <![CDATA[select 5]]>


Re: BUG #16046: xpath returns CDATA tag along with the value in postgres 12

От
Tom Lane
Дата:
PG Bug reporting form <noreply@postgresql.org> writes:
> While using xpath to extract text from a CDATA section of xml, it returns
> CDATA tag along with the value.
>
> Query: SELECT unnest(xpath('//cname/aname/text()','<cname><aname><![CDATA[select 5]]></aname></cname>'::xml))
>
> Output - pg11: select 5
>
> Output - pg12: <![CDATA[select 5]]>

I haven't actually bisected to make sure, but I imagine this is a
consequence of commit 251cf2e27bec98274e8bb002608680bdc211319e.
What's not entirely clear to me is whether it's an intentional
effect, or a bug.  Authors, any comments?

            regards, tom lane



Re: BUG #16046: xpath returns CDATA tag along with the value inpostgres 12

От
Chapman Flack
Дата:
On 10/24/19 17:38, Tom Lane wrote:
>> Query: SELECT unnest(xpath('//cname/aname/text()','<cname><aname><![CDATA[select 5]]></aname></cname>'::xml))
>>
>> Output - pg11: select 5 
>>
>> Output - pg12: <![CDATA[select 5]]>
> 
> ... What's not entirely clear to me is whether it's an intentional
> effect, or a bug.  Authors, any comments?

Hmm. I would say the pg12 behavior is "not wrong". But it's unexpected.
xpath's return type is xml (well, array of), so the result must have a
form that can escape any characters mistakable for markup. In this example,
there aren't any, but once tweaked so there are:

SELECT
 unnest(xpath('//cname/aname/text()',
  '<cname><aname><![CDATA[select 5 & 6 <yahoo!>]]></aname></cname>'::xml));

pg12: <![CDATA[select 5 & 6 <yahoo!>]]>

the necessity is clear.

The other valid option would be to return, not CDATA, but a regular
text node, which would look like straight text if there were no special
characters in it, and would otherwise have every such character individually
escaped. That's what I get from pg11:

SELECT
 unnest(xpath('//cname/aname/text()',
  '<cname><aname><![CDATA[select 5 & 6 <yahoo!>]]></aname></cname>'::xml));

pg11: select 5 & 6 <yahoo!>

So what pg11 is doing is also "not wrong" (in this respect, anyway).
And looks "more natural", in the case where the value has no characters
that need escaping.

Which may or may not be a good thing. Perhaps it could lead the unwary
in some cases to think such a query is giving a directly usable
text string back, which will be harmless until the one time a value
with escaping comes back. (The no-surprises way to get back a directly
usable text string, if that's what's wanted, would be with XMLTABLE
and an output column of text type.)

Oddly, what pg12 is doing seems to be influenced by the form of escaping
used in the input:

SELECT
 unnest(xpath('//cname/aname/text()',
  '<cname><aname><![CDATA[select 5 & 6 <yahoo!>]]></aname></cname>'::xml));
              unnest
-----------------------------------
 <![CDATA[select 5 & 6 <yahoo!>]]>

SELECT
 unnest(xpath('//cname/aname/text()',
  '<cname><aname>select 5 & 6 <yahoo!></aname></cname>'::xml));
             unnest
---------------------------------
 select 5 & 6 <yahoo!>


Either form of result is correct, and having it respect the form that was
used in the input might even be delightfully smart.

I haven't looked in the code just now to see if it is intentionally being
delightfully smart, or more simplistic-and-lucky.

Regards,
-Chap



Re: BUG #16046: xpath returns CDATA tag along with the value inpostgres 12

От
Chapman Flack
Дата:
On 10/24/19 19:14, Chapman Flack wrote:
>  <![CDATA[select 5 & 6 <yahoo!>]]>
>  select 5 & 6 <yahoo!>
> 
> Either form of result is correct, and having it respect the form that was
> used in the input might even be delightfully smart.
> 
> I haven't looked in the code just now to see if it is intentionally being
> delightfully smart, or more simplistic-and-lucky.

It appears to be probably unintentional-but-ok: libxml tags a CDATA
section differently (XML_CDATA_SECTION_NODE) than a text node
(XML_TEXT_NODE), so a CDATA node falls into the catch-all branch of

if (cur->type != XML_ATTRIBUTE_NODE && cur->type != XML_TEXT_NODE)

where it gets dumped faithfully by xmlNodeDump(), with the upshot
that the result always will be well-formed XML, and will respect whether
the input was supplied as CDATA or not. We could choose to say that's
what we meant it to do all along.

Regards,
-Chap



Re: BUG #16046: xpath returns CDATA tag along with the value inpostgres 12

От
Chapman Flack
Дата:
On 10/24/19 19:14, Chapman Flack wrote:

> Which may or may not be a good thing. Perhaps it could lead the unwary
> in some cases to think such a query is giving a directly usable
> text string back, which will be harmless until the one time a value
> with escaping comes back. (The no-surprises way to get back a directly
> usable text string, if that's what's wanted, would be with XMLTABLE
> and an output column of text type.)

If the original reporter is happy with getting xml back, instead of
raw text, but only wants the form of escaping to be the same as it was
in pg11, the xpath //cname/aname/text() can just be rewritten as
string(//cname/aname), which will lose any CDATA-ness and consistently
produce the form where individual characters are escaped, in 11 or 12.

I think.

Regards,
-Chap



Re: BUG #16046: xpath returns CDATA tag along with the value inpostgres 12

От
Markus Winand
Дата:

On 25 Oct 2019, at 01:34, Chapman Flack <chap@anastigmatix.net> wrote:

On 10/24/19 19:14, Chapman Flack wrote:
<![CDATA[select 5 & 6 <yahoo!>]]>
select 5 &amp; 6 &lt;yahoo!&gt;

Either form of result is correct, and having it respect the form that was
used in the input might even be delightfully smart.

I haven't looked in the code just now to see if it is intentionally being
delightfully smart, or more simplistic-and-lucky.

It appears to be probably unintentional-but-ok: libxml tags a CDATA
section differently (XML_CDATA_SECTION_NODE) than a text node
(XML_TEXT_NODE), so a CDATA node falls into the catch-all branch of

if (cur->type != XML_ATTRIBUTE_NODE && cur->type != XML_TEXT_NODE)

This was intentional. Earlier versions of the patch had the CDATA explicitly listed[0]. I suggested to reverse the logic later [1].

My concern at that time was about xmltable but I think the current behavior is fine for xpath too. If you pick fragment(s) out of an XML document the most reasonable thing to do is to return that part(s) of the XML as it actually appears in the input.

 We could choose to say that's
what we meant it to do all along.

+1

-markus


Re: BUG #16046: xpath returns CDATA tag along with the value in postgres 12

От
Tom Lane
Дата:
Chapman Flack <chap@anastigmatix.net> writes:
> On 10/24/19 19:14, Chapman Flack wrote:
>> Which may or may not be a good thing. Perhaps it could lead the unwary
>> in some cases to think such a query is giving a directly usable
>> text string back, which will be harmless until the one time a value
>> with escaping comes back. (The no-surprises way to get back a directly
>> usable text string, if that's what's wanted, would be with XMLTABLE
>> and an output column of text type.)

> If the original reporter is happy with getting xml back, instead of
> raw text, but only wants the form of escaping to be the same as it was
> in pg11, the xpath //cname/aname/text() can just be rewritten as
> string(//cname/aname), which will lose any CDATA-ness and consistently
> produce the form where individual characters are escaped, in 11 or 12.

> I think.

Yeah, that does work the same for me in 11 and HEAD:

regression=# SELECT unnest(xpath('string(//cname/aname)','<cname><aname><![CDATA[select 5>4]]></aname></cname>'::xml));
    unnest
---------------
 select 5>4
(1 row)

So I'm inclined to write this off as an intended, or at least not
unwanted, change.  However, this complaint does reinforce Alvaro's
original choice not to back-patch 251cf2e27.

The v12 release notes don't call this out as a compatibility issue,
nor mention anything about CDATA handling:

  * Fix assorted bugs in XML functions (Pavel Stehule, Markus Winand,
    Chapman Flack)

    Specifically, in XMLTABLE, xpath(), and xmlexists(), fix some cases
    where nothing was output for a node, or an unexpected error was
    thrown, or necessary escaping of XML special characters was omitted.

Should we extend that item (if so, what to say exactly?) and/or move
it into the "Observe the following incompatibilities" list?

            regards, tom lane



Re: BUG #16046: xpath returns CDATA tag along with the value inpostgres 12

От
Pavel Stehule
Дата:


pá 25. 10. 2019 v 19:28 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
Chapman Flack <chap@anastigmatix.net> writes:
> On 10/24/19 19:14, Chapman Flack wrote:
>> Which may or may not be a good thing. Perhaps it could lead the unwary
>> in some cases to think such a query is giving a directly usable
>> text string back, which will be harmless until the one time a value
>> with escaping comes back. (The no-surprises way to get back a directly
>> usable text string, if that's what's wanted, would be with XMLTABLE
>> and an output column of text type.)

> If the original reporter is happy with getting xml back, instead of
> raw text, but only wants the form of escaping to be the same as it was
> in pg11, the xpath //cname/aname/text() can just be rewritten as
> string(//cname/aname), which will lose any CDATA-ness and consistently
> produce the form where individual characters are escaped, in 11 or 12.

> I think.

Yeah, that does work the same for me in 11 and HEAD:

regression=# SELECT unnest(xpath('string(//cname/aname)','<cname><aname><![CDATA[select 5>4]]></aname></cname>'::xml));
    unnest     
---------------
 select 5&gt;4
(1 row)

So I'm inclined to write this off as an intended, or at least not
unwanted, change.  However, this complaint does reinforce Alvaro's
original choice not to back-patch 251cf2e27.

The v12 release notes don't call this out as a compatibility issue,
nor mention anything about CDATA handling:

  * Fix assorted bugs in XML functions (Pavel Stehule, Markus Winand,
    Chapman Flack)

    Specifically, in XMLTABLE, xpath(), and xmlexists(), fix some cases
    where nothing was output for a node, or an unexpected error was
    thrown, or necessary escaping of XML special characters was omitted.

Should we extend that item (if so, what to say exactly?) and/or move
it into the "Observe the following incompatibilities" list?

I tested XMLTABLE function, and it is work how it is expected. About xpath function - I don't know - there is a agreement so result is correct, and I can say so form of result of Postgres 12 is maybe little bit better, because you will get raw result without any transformation, what is sense of CDATA tag. Unfortunately, I agree, so this behave can surprise for developers and it's clear incompatible behave. Trimming CDATA tag or using string function is easy.

Somewhere should be strong warning about change of processing of CDATA nodes - originally these nodes was converted to text and escaped, now they returned in raw format - without any conversions. The example with "string" function should be used there.

Regards

Pavel


                        regards, tom lane

Re: BUG #16046: xpath returns CDATA tag along with the value inpostgres 12

От
Mohammad Mostafa Kamal
Дата:
Thank you all so very much for your valuable comments.

So we can be take it for granted that, the behavior of CDATA will remain as it is currently now in pg12 and we need change our queries accordingly to handle this from now on.

Thanks again,
Mostafa

From: Pavel Stehule <pavel.stehule@gmail.com>
Sent: Tuesday, November 5, 2019 10:28 PM
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Chapman Flack <chap@anastigmatix.net>; mostafa_bit0108@hotmail.com <mostafa_bit0108@hotmail.com>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>; Markus Winand <markus.winand@winand.at>
Subject: Re: BUG #16046: xpath returns CDATA tag along with the value in postgres 12
 


pá 25. 10. 2019 v 19:28 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
Chapman Flack <chap@anastigmatix.net> writes:
> On 10/24/19 19:14, Chapman Flack wrote:
>> Which may or may not be a good thing. Perhaps it could lead the unwary
>> in some cases to think such a query is giving a directly usable
>> text string back, which will be harmless until the one time a value
>> with escaping comes back. (The no-surprises way to get back a directly
>> usable text string, if that's what's wanted, would be with XMLTABLE
>> and an output column of text type.)

> If the original reporter is happy with getting xml back, instead of
> raw text, but only wants the form of escaping to be the same as it was
> in pg11, the xpath //cname/aname/text() can just be rewritten as
> string(//cname/aname), which will lose any CDATA-ness and consistently
> produce the form where individual characters are escaped, in 11 or 12.

> I think.

Yeah, that does work the same for me in 11 and HEAD:

regression=# SELECT unnest(xpath('string(//cname/aname)','<cname><aname><![CDATA[select 5>4]]></aname></cname>'::xml));
    unnest     
---------------
 select 5&gt;4
(1 row)

So I'm inclined to write this off as an intended, or at least not
unwanted, change.  However, this complaint does reinforce Alvaro's
original choice not to back-patch 251cf2e27.

The v12 release notes don't call this out as a compatibility issue,
nor mention anything about CDATA handling:

  * Fix assorted bugs in XML functions (Pavel Stehule, Markus Winand,
    Chapman Flack)

    Specifically, in XMLTABLE, xpath(), and xmlexists(), fix some cases
    where nothing was output for a node, or an unexpected error was
    thrown, or necessary escaping of XML special characters was omitted.

Should we extend that item (if so, what to say exactly?) and/or move
it into the "Observe the following incompatibilities" list?

I tested XMLTABLE function, and it is work how it is expected. About xpath function - I don't know - there is a agreement so result is correct, and I can say so form of result of Postgres 12 is maybe little bit better, because you will get raw result without any transformation, what is sense of CDATA tag. Unfortunately, I agree, so this behave can surprise for developers and it's clear incompatible behave. Trimming CDATA tag or using string function is easy.

Somewhere should be strong warning about change of processing of CDATA nodes - originally these nodes was converted to text and escaped, now they returned in raw format - without any conversions. The example with "string" function should be used there.

Regards

Pavel


                        regards, tom lane