Обсуждение: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated with wrong context
Hi! I’ve found two issues with XML/XPath integration in PostgreSQL. Two patches are attached. I’ve just separated them becausethe second one is an incompatible change. * Patch 0001: Accept TEXT and CDATA nodes in XMLTABLE’s column_expression. > Column_expressions that match TEXT or CDATA nodes must return > the contents of the node itself, not the content of the > non-existing childs (i.e. the empty string). The following query returns a wrong result in master: SELECT * FROM (VALUES ('<xml>text</xml>'::xml) , ('<xml><![CDATA[<cdata>]]></xml>'::xml) ) d(x) CROSS JOIN LATERAL XMLTABLE('/xml' PASSING x COLUMNS "node()" TEXT PATH 'node()' ) t x | node() --------------------------------+-------- <xml>text</xml> | <xml><![CDATA[<cdata>]]></xml> | (the “node()” column is empty) The correct result can be obtained with patch 0001 applied: x | node() --------------------------------+--------- <xml>text</xml> | text <xml><![CDATA[<cdata>]]></xml> | <cdata> The patch adopts existing tests to guard against future regressions. * Patch 0002: Set correct context for XPath evaluation. > According to the ISO standard, the context of XMLTABLE's XPath > row_expression is the document node of the XML[1] - not the root node. > > The respective code is shared with non-ISO functions such as xpath > and xpath_exists so that the change also affects these functions. It’s an incompatible change - even the regression tests need to be adjusted because they test for the “wrong” behaviour. The documentation, on the other hand, does neither document the behaviour explicitly, no does it show any example that breaks due to this patch. The alternatives to this patch are (1) documenting the current standard-breaking behaviour and (2) fixing the context only for ISO functions. Personally, I don’t like either of them. I’ve checked the libxml2 code to see what’s the proper way to set the context to the document node. It’s indeed just “(xmlNodePtr)ctxt->doc”. See: https://github.com/GNOME/libxml2/blob/7abec671473b837f99181442d59edd0cc2ee01d1/xpath.c#L14306 -markus -- [1] SQL/XML:2011, 6.18 General Rule 1aii2.
Вложения
Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated withwrong context
От
Alvaro Herrera
Дата:
Markus Winand wrote: Hi Markus, > I’ve found two issues with XML/XPath integration in PostgreSQL. Two > patches are attached. I’ve just separated them because the second one > is an incompatible change. Thanks for these. I'll have a look at them after the commitfest is over. In the meantime, if Peter or Pavel have comments, it'd be good to hear them. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated withwrong context
От
Alvaro Herrera
Дата:
Hello Markus, Markus Winand wrote: > * Patch 0001: Accept TEXT and CDATA nodes in XMLTABLE’s column_expression. > > > Column_expressions that match TEXT or CDATA nodes must return > > the contents of the node itself, not the content of the > > non-existing childs (i.e. the empty string). > > The following query returns a wrong result in master: > > SELECT * > FROM (VALUES ('<xml>text</xml>'::xml) > , ('<xml><![CDATA[<cdata>]]></xml>'::xml) > ) d(x) > CROSS JOIN LATERAL > XMLTABLE('/xml' > PASSING x > COLUMNS "node()" TEXT PATH 'node()' > ) t > The correct result can be obtained with patch 0001 applied: > > x | node() > --------------------------------+--------- > <xml>text</xml> | text > <xml><![CDATA[<cdata>]]></xml> | <cdata> > > The patch adopts existing tests to guard against future regressions. I remembered while reading this that Craig Ringer had commented that XML text sections can have multiple children: just put XML comments amidst the text. To test this, I added a comment in one of the text values, in the regression database: update xmldata set data = regexp_replace(data::text, 'HongKong', 'Hong<!-- a space -->Kong')::xml; With the modified data, the new query in the regression tests fails: SELECT xmltable.* FROM (SELECT data FROM xmldata) x, LATERAL XMLTABLE('/ROWS/ROW' PASSING data COLUMNS id int PATH '@id', _id FOR ORDINALITY, country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, country_id text PATH 'COUNTRY_ID', region_id int PATH 'REGION_ID', size float PATH 'SIZE', unit text PATH 'SIZE/@unit', premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified'); ERROR: more than one value returned by column XPath expression This seems really undesirable, so I looked into getting it fixed. Turns out that this error is being thrown from inside the same function we're modifying, line 4559. I didn't have a terribly high opinion of that ereport() when working on this feature to begin with, so now that it's proven to provide a bad experience, let's try removing it. With that change (patch attached), the feature seems to work; I tried with this query, which seems to give the expected output (notice we have some columns of type xml, others of type text, with and without the text() function in the path): SELECT xmltable.* FROM (SELECT data FROM xmldata) x, LATERAL XMLTABLE('/ROWS/ROW' PASSING data COLUMNS country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, country_name_xml xml PATH 'COUNTRY_NAME/text()' NOT NULL, country_name_n text PATH 'COUNTRY_NAME' NOT NULL, country_name_xml_n xml PATH 'COUNTRY_NAME' NOT NULL); country_name │ country_name_xml │ country_name_n │ country_name_xml_n ────────────────┼──────────────────┼────────────────┼─────────────────────────────────────────────────────── Australia │ Australia │ Australia │ <COUNTRY_NAME>Australia</COUNTRY_NAME> China │ China │ China │ <COUNTRY_NAME>China</COUNTRY_NAME> HongKong │ HongKong │ HongKong │ <COUNTRY_NAME>Hong<!-- a space -->Kong</COUNTRY_NAME> India │ India │ India │ <COUNTRY_NAME>India</COUNTRY_NAME> Japan │ Japan │ Japan │ <COUNTRY_NAME>Japan</COUNTRY_NAME> Singapore │ Singapore │ Singapore │ <COUNTRY_NAME>Singapore</COUNTRY_NAME> Czech Republic │ Czech Republic │ Czech Republic │ <COUNTRY_NAME>Czech Republic</COUNTRY_NAME> Germany │ Germany │ Germany │ <COUNTRY_NAME>Germany</COUNTRY_NAME> France │ France │ France │ <COUNTRY_NAME>France</COUNTRY_NAME> Egypt │ Egypt │ Egypt │ <COUNTRY_NAME>Egypt</COUNTRY_NAME> Sudan │ Sudan │ Sudan │ <COUNTRY_NAME>Sudan</COUNTRY_NAME> (11 filas) This means that the concatenation now works for all types, not just xml, so I can do this also: update xmldata set data = regexp_replace(data::text, '791', '7<!--small country-->91')::xml; SELECT xmltable.* FROM (SELECT data FROM xmldata) x, LATERAL XMLTABLE('/ROWS/ROW' PASSING data COLUMNS country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, size_text float PATH 'SIZE/text()', "SIZE" float, size_xml xml PATH 'SIZE'); country_name │ size_text │ SIZE ────────────────┼───────────┼────── Australia │ │ China │ │ HongKong │ │ India │ │ Japan │ │ Singapore │ 791 │ 791 Czech Republic │ │ Germany │ │ France │ │ Egypt │ │ Sudan │ │ (11 filas) I'm not sure if this concatenation of things that are not text or XML is undesirable for whatever reason. Any clues? Also, array indexes behave funny. First let's add more XML comments inside that number, and query for the subscripts: update xmldata set data = regexp_replace(data::text, '7<!--small country-->91', '<!--ah-->7<!--oh-->9<!--uh-->1')::xml; SELECT xmltable.* FROM (SELECT data FROM xmldata) x, LATERAL XMLTABLE('/ROWS/ROW' PASSING data COLUMNS country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, size_text float PATH 'SIZE/text()', size_text_1 float PATH 'SIZE/text()[1]', size_text_2 float PATH 'SIZE/text()[2]', "SIZE" float, size_xml xml PATH 'SIZE') where size_text is not null; country_name │ size_text │ size_text_1 │ size_text_2 │ size_text_3 │ SIZE │ size_xml ──────────────┼───────────┼─────────────┼─────────────┼─────────────┼──────┼─────────────────────────────────────────────────────── Singapore │ 791 │ 791 │ 91 │ 1 │ 791 │ <SIZE unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE> (1 fila) I don't know what to make of that. Seems pretty broken. After this, I looked for some examples of XPath syntax in the interwebs. I came across the | operator (which apparently is a "union" of some sort). Behold: SELECT xmltable.* FROM (SELECT data FROM xmldata) x, LATERAL XMLTABLE('/ROWS/ROW' PASSING data COLUMNS country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, size_text text PATH 'SIZE/text() | SIZE/@unit') where size_text is not null ; country_name │ size_text ──────────────┼─────────── Singapore │ km791 (1 fila) The "units" property is ahead of the text(), which is pretty odd. But if I remove the text() call, it puts the units after the text: SELECT xmltable.* FROM (SELECT data FROM xmldata) x, LATERAL XMLTABLE('/ROWS/ROW' PASSING data COLUMNS country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, size_text text PATH 'SIZE | SIZE/@unit') where size_text is not null ; country_name │ size_text ──────────────┼───────────────────────────────────────────────────────── Singapore │ <SIZE unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE>km (1 fila) Again, I don't know what to make of this. Anyway, this seems firmly in the libxml side of things; the only conclusion I have is that nobody ever uses libxml terribly much for complex XPath processing. Basing another test on your original test case, look at the first row here. Is this result correct? SELECT * FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml) , ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml) ) d(x) CROSS JOIN LATERAL XMLTABLE('/xml' PASSING x COLUMNS "node()" TEXT PATH 'node()' ) t; x │ node() ───────────────────────────────────────────────────────────┼──────────────────────────────────── <xml>te<!-- ahoy -->xt</xml> │ te ahoy xt <xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some <!-- really --> weird <stuff> Why was the comment contents expanded inside node()? Note what happens if I change the type from text to xml in that column: SELECT * FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml) , ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml) ) d(x) CROSS JOIN LATERAL XMLTABLE('/xml' PASSING x COLUMNS "node()" xml PATH 'node()' ) t; x │ node() ───────────────────────────────────────────────────────────┼──────────────────────────────────────────────── <xml>te<!-- ahoy -->xt</xml> │ te ahoy xt <xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some <!-- really --> weird <stuff> (2 filas) Further note that, per the standard, you can omit the PATH clause in which case the column name is used as the path, which seems to work correctly. Phew! > * Patch 0002: Set correct context for XPath evaluation. > > > According to the ISO standard, the context of XMLTABLE's XPath > > row_expression is the document node of the XML[1] - not the root node. > > > > The respective code is shared with non-ISO functions such as xpath > > and xpath_exists so that the change also affects these functions. > > It’s an incompatible change - even the regression tests need to be adjusted because they > test for the “wrong” behaviour. I'm really afraid to change the non-ISO functions in PG10, since it's already released for quite some time with this long-standing behavior; and I'm not sure it'll do much good to change XMLTABLE in PG10 and leave the non-iso functions unpatched. I think the easiest route is to patch all functions only for PostgreSQL 11. Maybe if XMLTABLE is considered unacceptably broken in PostgreSQL 10 we can patch only that one. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated with wrong context
От
Markus Winand
Дата:
Hi and thanks for your efforts. > On 2018-04-26, at 21:18 , Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Hello Markus, > > Markus Winand wrote: > >> * Patch 0001: Accept TEXT and CDATA nodes in XMLTABLE’s column_expression. >> >>> Column_expressions that match TEXT or CDATA nodes must return >>> the contents of the node itself, not the content of the >>> non-existing childs (i.e. the empty string). >> >> The following query returns a wrong result in master: >> >> SELECT * >> FROM (VALUES ('<xml>text</xml>'::xml) >> , ('<xml><![CDATA[<cdata>]]></xml>'::xml) >> ) d(x) >> CROSS JOIN LATERAL >> XMLTABLE('/xml' >> PASSING x >> COLUMNS "node()" TEXT PATH 'node()' >> ) t > >> The correct result can be obtained with patch 0001 applied: >> >> x | node() >> --------------------------------+--------- >> <xml>text</xml> | text >> <xml><![CDATA[<cdata>]]></xml> | <cdata> >> >> The patch adopts existing tests to guard against future regressions. > > I remembered while reading this that Craig Ringer had commented that XML > text sections can have multiple children: just put XML comments amidst > the text. As per my understanding of XML, this is a sibling—not a child—of two text nodes. > To test this, I added a comment in one of the text values, in > the regression database: > > update xmldata set data = regexp_replace(data::text, 'HongKong', 'Hong<!-- a space -->Kong')::xml; > > With the modified data, the new query in the regression tests fails: > > SELECT xmltable.* > FROM (SELECT data FROM xmldata) x, > LATERAL XMLTABLE('/ROWS/ROW' > PASSING data > COLUMNS id int PATH '@id', > _id FOR ORDINALITY, > country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, > country_id text PATH 'COUNTRY_ID', > region_id int PATH 'REGION_ID', > size float PATH 'SIZE', > unit text PATH 'SIZE/@unit', > premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified'); > > ERROR: more than one value returned by column Path expression I’ve tried matching “node()” against XML that includes a mixture of text and comment nodes in other database products. Two out of two failed with a similar error message. - ORA-19279: XPTY0004 - XQuery dynamic type mismatch: expected singleton sequence - got multi-item sequence - SQLCODE=-16003, SQLSTATE=10507, SQLERRMC=( item(), item()+ ) See: https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.messages.sql.doc/doc/msql16003n.html So I digged through the standard to figure out what the standard-mandated behaviour is. The bottom line is that an error is the standard mandated behavior because SQL uses a XPath/XQuery “cast" in the procedure. The XPath/XQuery spec says “If the result of atomization is a sequence of more than one atomic value, a type error is raised [err:XPTY0004]” (compare that to the ORA- above). https://www.w3.org/TR/xpath-31/#id-cast For reference, how I came there: The XPath/XQuery cast is triggered for non XML types in XMLCAST (SQL-14:2011 6.6 GR 4 h iv.) which is in turn triggered by XMLTABLE (SQL-14:2011 7.1 SR 4 e ii 8). [Note: I only have SQL-14:2011, thus no references to :2016] > This seems really undesirable, so I looked into getting it fixed. Turns > out that this error is being thrown from inside the same function we're > modifying, line 4559. I didn't have a terribly high opinion of that > ereport() when working on this feature to begin with, so now that it's > proven to provide a bad experience, let's try removing it. With that > change (patch attached), the feature seems to work; I tried with this > query, which seems to give the expected output (notice we have some > columns of type xml, others of type text, with and without the text() > function in the path): > > SELECT xmltable.* > FROM (SELECT data FROM xmldata) x, > LATERAL XMLTABLE('/ROWS/ROW' > PASSING data COLUMNS > country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, > country_name_xml xml PATH 'COUNTRY_NAME/text()' NOT NULL, > country_name_n text PATH 'COUNTRY_NAME' NOT NULL, > country_name_xml_n xml PATH 'COUNTRY_NAME' NOT NULL); > country_name │ country_name_xml │ country_name_n │ country_name_xml_n > ────────────────┼──────────────────┼────────────────┼─────────────────────────────────────────────────────── > Australia │ Australia │ Australia │ <COUNTRY_NAME>Australia</COUNTRY_NAME> > China │ China │ China │ <COUNTRY_NAME>China</COUNTRY_NAME> > HongKong │ HongKong │ HongKong │ <COUNTRY_NAME>Hong<!-- a space -->Kong</COUNTRY_NAME> > India │ India │ India │ <COUNTRY_NAME>India</COUNTRY_NAME> > Japan │ Japan │ Japan │ <COUNTRY_NAME>Japan</COUNTRY_NAME> > Singapore │ Singapore │ Singapore │ <COUNTRY_NAME>Singapore</COUNTRY_NAME> > Czech Republic │ Czech Republic │ Czech Republic │ <COUNTRY_NAME>Czech Republic</COUNTRY_NAME> > Germany │ Germany │ Germany │ <COUNTRY_NAME>Germany</COUNTRY_NAME> > France │ France │ France │ <COUNTRY_NAME>France</COUNTRY_NAME> > Egypt │ Egypt │ Egypt │ <COUNTRY_NAME>Egypt</COUNTRY_NAME> > Sudan │ Sudan │ Sudan │ <COUNTRY_NAME>Sudan</COUNTRY_NAME> > (11 filas) > > > This means that the concatenation now works for all types, not just xml, so I > can do this also: > > update xmldata set data = regexp_replace(data::text, '791', '7<!--small country-->91')::xml; > > SELECT xmltable.* > FROM (SELECT data FROM xmldata) x, > LATERAL XMLTABLE('/ROWS/ROW' > PASSING data > COLUMNS > country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, > size_text float PATH 'SIZE/text()', > "SIZE" float, size_xml xml PATH 'SIZE'); > country_name │ size_text │ SIZE > ────────────────┼───────────┼────── > Australia │ │ > China │ │ > HongKong │ │ > India │ │ > Japan │ │ > Singapore │ 791 │ 791 > Czech Republic │ │ > Germany │ │ > France │ │ > Egypt │ │ > Sudan │ │ > (11 filas) > > I'm not sure if this concatenation of things that are not text or XML is > undesirable for whatever reason. Any clues? As per above the reason against it is standard conformance. > Also, array indexes behave funny. First let's add more XML comments > inside that number, and query for the subscripts: > > update xmldata set data = regexp_replace(data::text, '7<!--small country-->91', '<!--ah-->7<!--oh-->9<!--uh-->1')::xml; > > SELECT xmltable.* > FROM (SELECT data FROM xmldata) x, > LATERAL XMLTABLE('/ROWS/ROW' > PASSING data > COLUMNS > country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, > size_text float PATH 'SIZE/text()', > size_text_1 float PATH 'SIZE/text()[1]', > size_text_2 float PATH 'SIZE/text()[2]', > "SIZE" float, size_xml xml PATH 'SIZE') > where size_text is not null; > > country_name │ size_text │ size_text_1 │ size_text_2 │ size_text_3 │ SIZE │ size_xml > ──────────────┼───────────┼─────────────┼─────────────┼─────────────┼──────┼─────────────────────────────────────────────────────── > Singapore │ 791 │ 791 │ 91 │ 1 │ 791 │ <SIZE unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE> > (1 fila) > > I don't know what to make of that. Seems pretty broken. Absolutely. Also, node() matching comments or processing instructions seems to be broken too: SELECT * FROM (VALUES ('<xml><!--comment--></xml>'::xml) , ('<xml><?pi content?></xml>'::xml) ) d(x) CROSS JOIN LATERAL XMLTABLE('/xml' PASSING x COLUMNS "node()" TEXT PATH 'node()' ) t x | node() ---------------------------+-------- <xml><!--comment--></xml> | <xml><?pi content?></xml> | (2 rows) I can look into this, but it may take a while. As per my understanding explained above, I’d base future work on my original patch. > After this, I looked for some examples of XPath syntax in the interwebs. > I came across the | operator (which apparently is a "union" of some > sort). Behold: > > SELECT xmltable.* > FROM (SELECT data FROM xmldata) x, > LATERAL XMLTABLE('/ROWS/ROW' > PASSING data > COLUMNS > country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, > size_text text PATH 'SIZE/text() | SIZE/@unit') > where size_text is not null ; > > country_name │ size_text > ──────────────┼─────────── > Singapore │ km791 > (1 fila) > > The "units" property is ahead of the text(), which is pretty odd. As per above, I think this is outside the spec (as it matches several nodes). Other than that it is related to the “document order” in which XPath generally returns matched nodes (except a few axis that return their matched in reverse order—e.g. ancestor::) > But if I > remove the text() call, it puts the units after the text: > > SELECT xmltable.* > FROM (SELECT data FROM xmldata) x, > LATERAL XMLTABLE('/ROWS/ROW' > PASSING data > COLUMNS > country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, > size_text text PATH 'SIZE | SIZE/@unit') > where size_text is not null ; > country_name │ size_text > ──────────────┼───────────────────────────────────────────────────────── > Singapore │ <SIZE unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE>km > (1 fila) > > Again, I don't know what to make of this. As per above, the document order is: 1. <size> opening tag 2. the @unit attribute for size 3. the text() inside the size. Now that you match (1) instead of (3), it is returned first. > Anyway, this seems firmly in > the libxml side of things; the only conclusion I have is that nobody > ever uses libxml terribly much for complex XPath processing. > > > Basing another test on your original test case, look at the first row > here. Is this result correct? > > SELECT * > FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml) > , ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml) > ) d(x) > CROSS JOIN LATERAL > XMLTABLE('/xml' > PASSING x > COLUMNS "node()" TEXT PATH 'node()' > ) t; > x │ node() > ───────────────────────────────────────────────────────────┼──────────────────────────────────── > <xml>te<!-- ahoy -->xt</xml> │ te ahoy xt > <xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some <!-- really --> weird <stuff> > > Why was the comment contents expanded inside node()? Because node() matches also comments and your patches concatenates all matches. > Note what happens > if I change the type from text to xml in that column: > > SELECT * > FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml) > , ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml) > ) d(x) > CROSS JOIN LATERAL > XMLTABLE('/xml' > PASSING x > COLUMNS "node()" xml PATH 'node()' > ) t; > > x │ node() > ───────────────────────────────────────────────────────────┼──────────────────────────────────────────────── > <xml>te<!-- ahoy -->xt</xml> │ te ahoy xt > <xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some <!-- really --> weird <stuff> > (2 filas) The comment seems to be wrong. I guess it’s fine if the CDATA gets transformed in to an equivalent string using the XML entities. Yet, it might be better avoiding it. Will look into it, but may take a while. > Further note that, per the standard, you can omit the PATH clause in > which case the column name is used as the path, which seems to work > correctly. Phew! > >> * Patch 0002: Set correct context for XPath evaluation. >> >>> According to the ISO standard, the context of XMLTABLE's XPath >>> row_expression is the document node of the XML[1] - not the root node. >>> >>> The respective code is shared with non-ISO functions such as xpath >>> and xpath_exists so that the change also affects these functions. >> >> It’s an incompatible change - even the regression tests need to be adjusted because they >> test for the “wrong” behaviour. > > I'm really afraid to change the non-ISO functions in PG10, since > it's already released for quite some time with this long-standing > behavior; and I'm not sure it'll do much good to change XMLTABLE in PG10 > and leave the non-iso functions unpatched. I think the easiest route is > to patch all functions only for PostgreSQL 11. Maybe if XMLTABLE is > considered unacceptably broken in PostgreSQL 10 we can patch only that > one. I’m in favour of doing all at once in PG11. As the XMLTABLE examples in the docs consistently use absolute XPath expressions, there is hope that people do it this way so that they won’t be affected by the patch. When I stumbled upon this issue, I first though that the context is not set at all. I did not even come to my mind that the context could be there, but wrongly set. I only found out that it actually is set when I made the patch. In other words, I would not have made code relying on the wrong behaviour. Unfortunately, this argument doesn’t hold for the non-ISO functions, which are also longer in PostgreSQL. -markus
Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated withwrong context
От
Alvaro Herrera
Дата:
Hello hackers, Any objections fixing this on Pg11? My opinion is that it's better to fix it now rather than waiting for pg12. It's already broken in pg10 (xmltable) and older (rest of the xml stuff). As Markus advised, I'd not backpatch fixes for fear of breaking stuff, but I'd rather release pg11 with the correct behavior. I intend to get it done this week if there are no serious objections. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated withwrong context
От
Alvaro Herrera
Дата:
Hi On 2018-Mar-27, Markus Winand wrote: > I’ve found two issues with XML/XPath integration in PostgreSQL. Two > patches are attached. I’ve just separated them because the second one > is an incompatible change. > > * Patch 0001: Accept TEXT and CDATA nodes in XMLTABLE’s column_expression. I pushed 0001 to REL_10_STABLE and master -- since there is no incompatible behavior change. (Well, there is a behavior change, but it's pretty hard to imagine someone would be relying on the bogus old output.) I'll see about 0002 now -- only to master, as discussed. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated withwrong context
От
Alvaro Herrera
Дата:
On 2018-Jun-20, Alvaro Herrera wrote: > I'll see about 0002 now -- only to master, as discussed. Here's a patch. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated withwrong context
От
Alvaro Herrera
Дата:
I have pushed the patch now (in your original form rather than my later formulation) -- let's see how the buildfarm likes it. There are (at least) three issues remaining, as per below; Pavel, do you have any insight on these? First one is about array indexes not working sanely (I couldn't get this to work in Oracle) > > Also, array indexes behave funny. First let's add more XML comments > > inside that number, and query for the subscripts: > > > > update xmldata set data = regexp_replace(data::text, '7<!--small country-->91', '<!--ah-->7<!--oh-->9<!--uh-->1')::xml; > > > > SELECT xmltable.* > > FROM (SELECT data FROM xmldata) x, > > LATERAL XMLTABLE('/ROWS/ROW' > > PASSING data > > COLUMNS > > country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, > > size_text float PATH 'SIZE/text()', > > size_text_1 float PATH 'SIZE/text()[1]', > > size_text_2 float PATH 'SIZE/text()[2]', > > "SIZE" float, size_xml xml PATH 'SIZE') > > where size_text is not null; > > > > country_name │ size_text │ size_text_1 │ size_text_2 │ size_text_3 │ SIZE │ size_xml > > ──────────────┼───────────┼─────────────┼─────────────┼─────────────┼──────┼─────────────────────────────────────────────────────── > > Singapore │ 791 │ 791 │ 91 │ 1 │ 791 │ <SIZE unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE> > > (1 fila) The second one is about (lack of!) processing instructions and comments: > Also, node() matching comments or processing instructions > seems to be broken too: > > SELECT * > FROM (VALUES ('<xml><!--comment--></xml>'::xml) > , ('<xml><?pi content?></xml>'::xml) > ) d(x) > CROSS JOIN LATERAL > XMLTABLE('/xml' > PASSING x > COLUMNS "node()" TEXT PATH 'node()' > ) t > > x | node() > ---------------------------+-------- > <xml><!--comment--></xml> | > <xml><?pi content?></xml> | > (2 rows) > > I can look into this, but it may take a while. Compare the empty second columns with oracle behavior, which returns the contents of the PI and the comment. As a script for http://rextester.com/l/oracle_online_compiler create table xmltb (data xmltype) \\ insert into xmltb values ('<xml><!--the comment is here--></xml>') \\ insert into xmltb values ('<xml><?pi php_stuff(); do_stuff("hello"); ?></xml>') \\ SELECT * FROM xmltb, XMLTABLE('/xml' PASSING data COLUMNS node varchar2(100) PATH 'node()') t \\ drop table xmltb \\ The third issue is the way we output comments when they're in a column of type XML: > > Note what happens if I change the type from text to xml in that > > column: > > > > SELECT * > > FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml) > > , ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml) > > ) d(x) > > CROSS JOIN LATERAL > > XMLTABLE('/xml' > > PASSING x > > COLUMNS "node()" xml PATH 'node()' > > ) t; > > > > x │ node() > > ───────────────────────────────────────────────────────────┼──────────────────────────────────────────────── > > <xml>te<!-- ahoy -->xt</xml> │ te ahoy xt > > <xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some <!-- really --> weird <stuff> > > (2 filas) > > The comment seems to be wrong. > > I guess it’s fine if the CDATA gets transformed in to an equivalent > string using the XML entities. Yet, it might be better avoiding it. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated with wrong context
От
Markus Winand
Дата:
Hi! > I have pushed the patch now (in your original form rather than my later > formulation) -- let's see how the buildfarm likes it. There are (at > least) three issues remaining, as per below; Pavel, do you have any > insight on these? Attached patch 0001 fixes all three issues and one more that is, again, an incompatible change (see below). > On 2018-06-21, at 22:27 , Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > First one is about array indexes not working sanely (I couldn't get this > to work in Oracle) > >>> Also, array indexes behave funny. First let's add more XML comments >>> inside that number, and query for the subscripts: >>> >>> update xmldata set data = regexp_replace(data::text, '7<!--small country-->91', '<!--ah-->7<!--oh-->9<!--uh-->1')::xml; >>> >>> SELECT xmltable.* >>> FROM (SELECT data FROM xmldata) x, >>> LATERAL XMLTABLE('/ROWS/ROW' >>> PASSING data >>> COLUMNS >>> country_name text PATH 'COUNTRY_NAME/text()' NOT NULL, >>> size_text float PATH 'SIZE/text()', >>> size_text_1 float PATH 'SIZE/text()[1]', >>> size_text_2 float PATH 'SIZE/text()[2]', >>> "SIZE" float, size_xml xml PATH 'SIZE') >>> where size_text is not null; >>> >>> country_name │ size_text │ size_text_1 │ size_text_2 │ size_text_3 │ SIZE │ size_xml >>> ──────────────┼───────────┼─────────────┼─────────────┼─────────────┼──────┼─────────────────────────────────────────────────────── >>> Singapore │ 791 │ 791 │ 91 │ 1 │ 791 │ <SIZE unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE> >>> (1 fila) The concatenation was caused by the use of xmlNodeListGetString, which I have completely eliminated because it was even wrong for element nodes. As per my understanding (and how Oracle and DB2 do it), text nodes contained in nested elements must also be included in the string representation. This is an incompatible change that needs adoption of an regression test: SELECT * FROM xmltable('/root' passing '<root><element>a1a<!-- aaaa -->a2a<?aaaaa?> <!--z--> bbbb<x>xxx</x>cccc</element></root>'COLUMNS element text); element -------------------- a1aa2a bbbbcccc This should actually include the ‘xxx’ from the nested <x> element as well. As I could not find a libxml function that does this, I’ve introduced a new one (xml_xmlnodetostring) in the vein of libxml’s xmlNodeListGetString. > The second one is about (lack of!) processing instructions and comments: > >> Also, node() matching comments or processing instructions >> seems to be broken too: >> >> SELECT * >> FROM (VALUES ('<xml><!--comment--></xml>'::xml) >> , ('<xml><?pi content?></xml>'::xml) >> ) d(x) >> CROSS JOIN LATERAL >> XMLTABLE('/xml' >> PASSING x >> COLUMNS "node()" TEXT PATH 'node()' >> ) t >> >> x | node() >> ---------------------------+-------- >> <xml><!--comment--></xml> | >> <xml><?pi content?></xml> | >> (2 rows) >> This was easy to fix: PI’s and comments contain their data in their own content, not in a child. > The third issue is the way we output comments when they're in a column > of type XML: > >>> Note what happens if I change the type from text to xml in that >>> column: >>> >>> SELECT * >>> FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml) >>> , ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml) >>> ) d(x) >>> CROSS JOIN LATERAL >>> XMLTABLE('/xml' >>> PASSING x >>> COLUMNS "node()" xml PATH 'node()' >>> ) t; >>> >>> x │ node() >>> ───────────────────────────────────────────────────────────┼──────────────────────────────────────────────── >>> <xml>te<!-- ahoy -->xt</xml> │ te ahoy xt >>> <xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some <!-- really --> weird <stuff> >>> (2 filas) I included comments, CDATA sections and PIs into the list of nodes that need to use libxml's xmlNodeDump. The 0001 patch also removes the distinction between the case of "count == 1 && typid == XMLOID” and "count > 1" because I don’t see any reason for two different code paths there (except that one string copy is saved in the count == 1 case). The 0001 patch also adds regression tests for all three issues. There is a second patch, for which I don’t have strong opinions, that eliminates the use of xmlStrdup in two cases. As far as I can see, the xmlStrdup is only there so the xmlFree at the end can be done unconditionally. I found it disturbing to do a strdup for the sole purpose it can be free-ed again. However, the patch is not a beauty, just don’t apply if there is any doubt. -markus
Вложения
Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated withwrong context
От
Alvaro Herrera
Дата:
On 2018-Jul-08, Markus Winand wrote: > Hi! > > > I have pushed the patch now (in your original form rather than my later > > formulation) -- let's see how the buildfarm likes it. There are (at > > least) three issues remaining, as per below; Pavel, do you have any > > insight on these? > > Attached patch 0001 fixes all three issues and one more that is, again, > an incompatible change (see below). For closure: this was re-published as https://postgr.es/m/3e8eab9e-7289-6c23-5e2c-153cccea2257@anastigmatix.net and got pushed as commit 251cf2e27be. Thank you for reporting this! -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services