Обсуждение: BUG #16046: xpath returns CDATA tag along with the value in postgres 12
BUG #16046: xpath returns CDATA tag along with the value in postgres 12
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 16046 Logged by: Mohammad Mostafa Kamal Email address: mostafa_bit0108@hotmail.com PostgreSQL version: 12.0 Operating system: Windows 7 Description: While using xpath to extract text from a CDATA section of xml, it returns CDATA tag along with the value. Query: SELECT unnest(xpath('//cname/aname/text()','<cname><aname><![CDATA[select 5]]></aname></cname>'::xml)) Output - pg11: select 5 Output - pg12: <![CDATA[select 5]]>
PG Bug reporting form <noreply@postgresql.org> writes: > While using xpath to extract text from a CDATA section of xml, it returns > CDATA tag along with the value. > > Query: SELECT unnest(xpath('//cname/aname/text()','<cname><aname><![CDATA[select 5]]></aname></cname>'::xml)) > > Output - pg11: select 5 > > Output - pg12: <![CDATA[select 5]]> I haven't actually bisected to make sure, but I imagine this is a consequence of commit 251cf2e27bec98274e8bb002608680bdc211319e. What's not entirely clear to me is whether it's an intentional effect, or a bug. Authors, any comments? regards, tom lane
On 10/24/19 17:38, Tom Lane wrote: >> Query: SELECT unnest(xpath('//cname/aname/text()','<cname><aname><![CDATA[select 5]]></aname></cname>'::xml)) >> >> Output - pg11: select 5 >> >> Output - pg12: <![CDATA[select 5]]> > > ... What's not entirely clear to me is whether it's an intentional > effect, or a bug. Authors, any comments? Hmm. I would say the pg12 behavior is "not wrong". But it's unexpected. xpath's return type is xml (well, array of), so the result must have a form that can escape any characters mistakable for markup. In this example, there aren't any, but once tweaked so there are: SELECT unnest(xpath('//cname/aname/text()', '<cname><aname><![CDATA[select 5 & 6 <yahoo!>]]></aname></cname>'::xml)); pg12: <![CDATA[select 5 & 6 <yahoo!>]]> the necessity is clear. The other valid option would be to return, not CDATA, but a regular text node, which would look like straight text if there were no special characters in it, and would otherwise have every such character individually escaped. That's what I get from pg11: SELECT unnest(xpath('//cname/aname/text()', '<cname><aname><![CDATA[select 5 & 6 <yahoo!>]]></aname></cname>'::xml)); pg11: select 5 & 6 <yahoo!> So what pg11 is doing is also "not wrong" (in this respect, anyway). And looks "more natural", in the case where the value has no characters that need escaping. Which may or may not be a good thing. Perhaps it could lead the unwary in some cases to think such a query is giving a directly usable text string back, which will be harmless until the one time a value with escaping comes back. (The no-surprises way to get back a directly usable text string, if that's what's wanted, would be with XMLTABLE and an output column of text type.) Oddly, what pg12 is doing seems to be influenced by the form of escaping used in the input: SELECT unnest(xpath('//cname/aname/text()', '<cname><aname><![CDATA[select 5 & 6 <yahoo!>]]></aname></cname>'::xml)); unnest ----------------------------------- <![CDATA[select 5 & 6 <yahoo!>]]> SELECT unnest(xpath('//cname/aname/text()', '<cname><aname>select 5 & 6 <yahoo!></aname></cname>'::xml)); unnest --------------------------------- select 5 & 6 <yahoo!> Either form of result is correct, and having it respect the form that was used in the input might even be delightfully smart. I haven't looked in the code just now to see if it is intentionally being delightfully smart, or more simplistic-and-lucky. Regards, -Chap
On 10/24/19 19:14, Chapman Flack wrote: > <![CDATA[select 5 & 6 <yahoo!>]]> > select 5 & 6 <yahoo!> > > Either form of result is correct, and having it respect the form that was > used in the input might even be delightfully smart. > > I haven't looked in the code just now to see if it is intentionally being > delightfully smart, or more simplistic-and-lucky. It appears to be probably unintentional-but-ok: libxml tags a CDATA section differently (XML_CDATA_SECTION_NODE) than a text node (XML_TEXT_NODE), so a CDATA node falls into the catch-all branch of if (cur->type != XML_ATTRIBUTE_NODE && cur->type != XML_TEXT_NODE) where it gets dumped faithfully by xmlNodeDump(), with the upshot that the result always will be well-formed XML, and will respect whether the input was supplied as CDATA or not. We could choose to say that's what we meant it to do all along. Regards, -Chap
On 10/24/19 19:14, Chapman Flack wrote: > Which may or may not be a good thing. Perhaps it could lead the unwary > in some cases to think such a query is giving a directly usable > text string back, which will be harmless until the one time a value > with escaping comes back. (The no-surprises way to get back a directly > usable text string, if that's what's wanted, would be with XMLTABLE > and an output column of text type.) If the original reporter is happy with getting xml back, instead of raw text, but only wants the form of escaping to be the same as it was in pg11, the xpath //cname/aname/text() can just be rewritten as string(//cname/aname), which will lose any CDATA-ness and consistently produce the form where individual characters are escaped, in 11 or 12. I think. Regards, -Chap
On 25 Oct 2019, at 01:34, Chapman Flack <chap@anastigmatix.net> wrote:On 10/24/19 19:14, Chapman Flack wrote:<![CDATA[select 5 & 6 <yahoo!>]]>
select 5 & 6 <yahoo!>
Either form of result is correct, and having it respect the form that was
used in the input might even be delightfully smart.
I haven't looked in the code just now to see if it is intentionally being
delightfully smart, or more simplistic-and-lucky.
It appears to be probably unintentional-but-ok: libxml tags a CDATA
section differently (XML_CDATA_SECTION_NODE) than a text node
(XML_TEXT_NODE), so a CDATA node falls into the catch-all branch of
if (cur->type != XML_ATTRIBUTE_NODE && cur->type != XML_TEXT_NODE)
My concern at that time was about xmltable but I think the current behavior is fine for xpath too. If you pick fragment(s) out of an XML document the most reasonable thing to do is to return that part(s) of the XML as it actually appears in the input.
We could choose to say that's
what we meant it to do all along.
+1
-markus
Chapman Flack <chap@anastigmatix.net> writes: > On 10/24/19 19:14, Chapman Flack wrote: >> Which may or may not be a good thing. Perhaps it could lead the unwary >> in some cases to think such a query is giving a directly usable >> text string back, which will be harmless until the one time a value >> with escaping comes back. (The no-surprises way to get back a directly >> usable text string, if that's what's wanted, would be with XMLTABLE >> and an output column of text type.) > If the original reporter is happy with getting xml back, instead of > raw text, but only wants the form of escaping to be the same as it was > in pg11, the xpath //cname/aname/text() can just be rewritten as > string(//cname/aname), which will lose any CDATA-ness and consistently > produce the form where individual characters are escaped, in 11 or 12. > I think. Yeah, that does work the same for me in 11 and HEAD: regression=# SELECT unnest(xpath('string(//cname/aname)','<cname><aname><![CDATA[select 5>4]]></aname></cname>'::xml)); unnest --------------- select 5>4 (1 row) So I'm inclined to write this off as an intended, or at least not unwanted, change. However, this complaint does reinforce Alvaro's original choice not to back-patch 251cf2e27. The v12 release notes don't call this out as a compatibility issue, nor mention anything about CDATA handling: * Fix assorted bugs in XML functions (Pavel Stehule, Markus Winand, Chapman Flack) Specifically, in XMLTABLE, xpath(), and xmlexists(), fix some cases where nothing was output for a node, or an unexpected error was thrown, or necessary escaping of XML special characters was omitted. Should we extend that item (if so, what to say exactly?) and/or move it into the "Observe the following incompatibilities" list? regards, tom lane
pá 25. 10. 2019 v 19:28 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
Chapman Flack <chap@anastigmatix.net> writes:
> On 10/24/19 19:14, Chapman Flack wrote:
>> Which may or may not be a good thing. Perhaps it could lead the unwary
>> in some cases to think such a query is giving a directly usable
>> text string back, which will be harmless until the one time a value
>> with escaping comes back. (The no-surprises way to get back a directly
>> usable text string, if that's what's wanted, would be with XMLTABLE
>> and an output column of text type.)
> If the original reporter is happy with getting xml back, instead of
> raw text, but only wants the form of escaping to be the same as it was
> in pg11, the xpath //cname/aname/text() can just be rewritten as
> string(//cname/aname), which will lose any CDATA-ness and consistently
> produce the form where individual characters are escaped, in 11 or 12.
> I think.
Yeah, that does work the same for me in 11 and HEAD:
regression=# SELECT unnest(xpath('string(//cname/aname)','<cname><aname><![CDATA[select 5>4]]></aname></cname>'::xml));
unnest
---------------
select 5>4
(1 row)
So I'm inclined to write this off as an intended, or at least not
unwanted, change. However, this complaint does reinforce Alvaro's
original choice not to back-patch 251cf2e27.
The v12 release notes don't call this out as a compatibility issue,
nor mention anything about CDATA handling:
* Fix assorted bugs in XML functions (Pavel Stehule, Markus Winand,
Chapman Flack)
Specifically, in XMLTABLE, xpath(), and xmlexists(), fix some cases
where nothing was output for a node, or an unexpected error was
thrown, or necessary escaping of XML special characters was omitted.
Should we extend that item (if so, what to say exactly?) and/or move
it into the "Observe the following incompatibilities" list?
Somewhere should be strong warning about change of processing of CDATA nodes - originally these nodes was converted to text and escaped, now they returned in raw format - without any conversions. The example with "string" function should be used there.
Regards
Pavel
regards, tom lane
Re: BUG #16046: xpath returns CDATA tag along with the value inpostgres 12
От
Mohammad Mostafa Kamal
Дата:
Thank you all so very much for your valuable comments.
So we can be take it for granted that, the behavior of CDATA will remain as it is currently now in pg12 and we need change our queries accordingly to handle this from now on.
Thanks again,
Mostafa
Mostafa
From: Pavel Stehule <pavel.stehule@gmail.com>
Sent: Tuesday, November 5, 2019 10:28 PM
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Chapman Flack <chap@anastigmatix.net>; mostafa_bit0108@hotmail.com <mostafa_bit0108@hotmail.com>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>; Markus Winand <markus.winand@winand.at>
Subject: Re: BUG #16046: xpath returns CDATA tag along with the value in postgres 12
Sent: Tuesday, November 5, 2019 10:28 PM
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Chapman Flack <chap@anastigmatix.net>; mostafa_bit0108@hotmail.com <mostafa_bit0108@hotmail.com>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>; Markus Winand <markus.winand@winand.at>
Subject: Re: BUG #16046: xpath returns CDATA tag along with the value in postgres 12
pá 25. 10. 2019 v 19:28 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
Chapman Flack <chap@anastigmatix.net> writes:
> On 10/24/19 19:14, Chapman Flack wrote:
>> Which may or may not be a good thing. Perhaps it could lead the unwary
>> in some cases to think such a query is giving a directly usable
>> text string back, which will be harmless until the one time a value
>> with escaping comes back. (The no-surprises way to get back a directly
>> usable text string, if that's what's wanted, would be with XMLTABLE
>> and an output column of text type.)
> If the original reporter is happy with getting xml back, instead of
> raw text, but only wants the form of escaping to be the same as it was
> in pg11, the xpath //cname/aname/text() can just be rewritten as
> string(//cname/aname), which will lose any CDATA-ness and consistently
> produce the form where individual characters are escaped, in 11 or 12.
> I think.
Yeah, that does work the same for me in 11 and HEAD:
regression=# SELECT unnest(xpath('string(//cname/aname)','<cname><aname><![CDATA[select 5>4]]></aname></cname>'::xml));
unnest
---------------
select 5>4
(1 row)
So I'm inclined to write this off as an intended, or at least not
unwanted, change. However, this complaint does reinforce Alvaro's
original choice not to back-patch 251cf2e27.
The v12 release notes don't call this out as a compatibility issue,
nor mention anything about CDATA handling:
* Fix assorted bugs in XML functions (Pavel Stehule, Markus Winand,
Chapman Flack)
Specifically, in XMLTABLE, xpath(), and xmlexists(), fix some cases
where nothing was output for a node, or an unexpected error was
thrown, or necessary escaping of XML special characters was omitted.
Should we extend that item (if so, what to say exactly?) and/or move
it into the "Observe the following incompatibilities" list?
Somewhere should be strong warning about change of processing of CDATA nodes - originally these nodes was converted to text and escaped, now they returned in raw format - without any conversions. The example with "string" function should be used there.
Regards
Pavel
regards, tom lane