Обсуждение: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated with wrong context

Поиск
Список
Период
Сортировка

XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated with wrong context

От
Markus Winand
Дата:
Hi!

I’ve found two issues with XML/XPath integration in PostgreSQL. Two patches are attached. I’ve just separated them
becausethe second one is an incompatible change. 

* Patch 0001: Accept TEXT and CDATA nodes in XMLTABLE’s column_expression.

> Column_expressions that match TEXT or CDATA nodes must return
> the contents of the node itself, not the content of the
> non-existing childs (i.e. the empty string).

The following query returns a wrong result in master:

SELECT *
  FROM (VALUES ('<xml>text</xml>'::xml)
             , ('<xml><![CDATA[<cdata>]]></xml>'::xml)
       ) d(x)
  CROSS JOIN LATERAL
        XMLTABLE('/xml'
                 PASSING x
                 COLUMNS "node()" TEXT PATH 'node()'
                ) t

               x                | node()
--------------------------------+--------
 <xml>text</xml>                |
 <xml><![CDATA[<cdata>]]></xml> |

(the “node()” column is empty)

The correct result can be obtained with patch 0001 applied:

               x                | node()
--------------------------------+---------
 <xml>text</xml>                | text
 <xml><![CDATA[<cdata>]]></xml> | <cdata>

The patch adopts existing tests to guard against future regressions.

* Patch 0002: Set correct context for XPath evaluation.

> According to the ISO standard, the context of XMLTABLE's XPath
> row_expression is the document node of the XML[1] - not the root node.
>
> The respective code is shared with non-ISO functions such as xpath
> and xpath_exists so that the change also affects these functions.

It’s an incompatible change - even the regression tests need to be adjusted because they
test for the “wrong” behaviour. The documentation, on the other hand, does neither
document the behaviour explicitly, no does it show any example that breaks due to this patch.

The alternatives to this patch are (1) documenting the current standard-breaking behaviour
and (2) fixing the context only for ISO functions. Personally, I don’t like either of them.

I’ve checked the libxml2 code to see what’s the proper way to set the context to the
document node. It’s indeed just “(xmlNodePtr)ctxt->doc”.

See: https://github.com/GNOME/libxml2/blob/7abec671473b837f99181442d59edd0cc2ee01d1/xpath.c#L14306

-markus

--
[1] SQL/XML:2011, 6.18 General Rule 1aii2.










Вложения

Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated withwrong context

От
Alvaro Herrera
Дата:
Markus Winand wrote:

Hi Markus,

> I’ve found two issues with XML/XPath integration in PostgreSQL. Two
> patches are attached. I’ve just separated them because the second one
> is an incompatible change.

Thanks for these.  I'll have a look at them after the commitfest is
over.  In the meantime, if Peter or Pavel have comments, it'd be good to
hear them.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated withwrong context

От
Alvaro Herrera
Дата:
Hello Markus,

Markus Winand wrote:

> * Patch 0001: Accept TEXT and CDATA nodes in XMLTABLE’s column_expression.
> 
> > Column_expressions that match TEXT or CDATA nodes must return
> > the contents of the node itself, not the content of the
> > non-existing childs (i.e. the empty string).
> 
> The following query returns a wrong result in master:
> 
> SELECT *
>   FROM (VALUES ('<xml>text</xml>'::xml)
>              , ('<xml><![CDATA[<cdata>]]></xml>'::xml)
>        ) d(x)
>   CROSS JOIN LATERAL
>         XMLTABLE('/xml'
>                  PASSING x
>                  COLUMNS "node()" TEXT PATH 'node()'
>                 ) t

> The correct result can be obtained with patch 0001 applied:
> 
>                x                | node()
> --------------------------------+---------
>  <xml>text</xml>                | text
>  <xml><![CDATA[<cdata>]]></xml> | <cdata>
>     
> The patch adopts existing tests to guard against future regressions.

I remembered while reading this that Craig Ringer had commented that XML
text sections can have multiple children: just put XML comments amidst
the text.  To test this, I added a comment in one of the text values, in
the regression database:

update xmldata set data = regexp_replace(data::text, 'HongKong', 'Hong<!-- a space -->Kong')::xml;

With the modified data, the new query in the regression tests fails:

SELECT  xmltable.*
   FROM (SELECT data FROM xmldata) x,
        LATERAL XMLTABLE('/ROWS/ROW'
                         PASSING data
                         COLUMNS id int PATH '@id',
                                  _id FOR ORDINALITY,
                                  country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
                                  country_id text PATH 'COUNTRY_ID',
                                  region_id int PATH 'REGION_ID',
                                  size float PATH 'SIZE',
                                  unit text PATH 'SIZE/@unit',
                                  premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');

ERROR:  more than one value returned by column XPath expression

This seems really undesirable, so I looked into getting it fixed.  Turns
out that this error is being thrown from inside the same function we're
modifying, line 4559.  I didn't have a terribly high opinion of that
ereport() when working on this feature to begin with, so now that it's
proven to provide a bad experience, let's try removing it.  With that
change (patch attached), the feature seems to work; I tried with this
query, which seems to give the expected output (notice we have some
columns of type xml, others of type text, with and without the text()
function in the path):

SELECT  xmltable.*
   FROM (SELECT data FROM xmldata) x,
        LATERAL XMLTABLE('/ROWS/ROW'
                         PASSING data COLUMNS
                                  country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
                                  country_name_xml xml PATH 'COUNTRY_NAME/text()' NOT NULL,
                                  country_name_n text PATH 'COUNTRY_NAME' NOT NULL,
                                  country_name_xml_n xml PATH 'COUNTRY_NAME' NOT NULL);
  country_name  │ country_name_xml │ country_name_n │                  country_name_xml_n                   
────────────────┼──────────────────┼────────────────┼───────────────────────────────────────────────────────
 Australia      │ Australia        │ Australia      │ <COUNTRY_NAME>Australia</COUNTRY_NAME>
 China          │ China            │ China          │ <COUNTRY_NAME>China</COUNTRY_NAME>
 HongKong       │ HongKong         │ HongKong       │ <COUNTRY_NAME>Hong<!-- a space -->Kong</COUNTRY_NAME>
 India          │ India            │ India          │ <COUNTRY_NAME>India</COUNTRY_NAME>
 Japan          │ Japan            │ Japan          │ <COUNTRY_NAME>Japan</COUNTRY_NAME>
 Singapore      │ Singapore        │ Singapore      │ <COUNTRY_NAME>Singapore</COUNTRY_NAME>
 Czech Republic │ Czech Republic   │ Czech Republic │ <COUNTRY_NAME>Czech Republic</COUNTRY_NAME>
 Germany        │ Germany          │ Germany        │ <COUNTRY_NAME>Germany</COUNTRY_NAME>
 France         │ France           │ France         │ <COUNTRY_NAME>France</COUNTRY_NAME>
 Egypt          │ Egypt            │ Egypt          │ <COUNTRY_NAME>Egypt</COUNTRY_NAME>
 Sudan          │ Sudan            │ Sudan          │ <COUNTRY_NAME>Sudan</COUNTRY_NAME>
(11 filas)


This means that the concatenation now works for all types, not just xml, so I
can do this also:

update xmldata set data = regexp_replace(data::text, '791', '7<!--small country-->91')::xml;

SELECT  xmltable.*
   FROM (SELECT data FROM xmldata) x,
        LATERAL XMLTABLE('/ROWS/ROW'
                         PASSING data
                         COLUMNS 
                                  country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
                                  size_text float PATH 'SIZE/text()',
                                  "SIZE" float, size_xml xml PATH 'SIZE');
  country_name  │ size_text │ SIZE 
────────────────┼───────────┼──────
 Australia      │           │     
 China          │           │     
 HongKong       │           │     
 India          │           │     
 Japan          │           │     
 Singapore      │       791 │  791
 Czech Republic │           │     
 Germany        │           │     
 France         │           │     
 Egypt          │           │     
 Sudan          │           │     
(11 filas)

I'm not sure if this concatenation of things that are not text or XML is
undesirable for whatever reason.  Any clues?

Also, array indexes behave funny.  First let's add more XML comments
inside that number, and query for the subscripts:

update xmldata set data = regexp_replace(data::text, '7<!--small country-->91',
'<!--ah-->7<!--oh-->9<!--uh-->1')::xml;

SELECT  xmltable.*
   FROM (SELECT data FROM xmldata) x,
        LATERAL XMLTABLE('/ROWS/ROW'
                         PASSING data
                         COLUMNS
                                  country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
                                  size_text float PATH 'SIZE/text()',
                                  size_text_1 float PATH 'SIZE/text()[1]',
                                  size_text_2 float PATH 'SIZE/text()[2]',
                                  "SIZE" float, size_xml xml PATH 'SIZE')
where size_text is not null;

 country_name │ size_text │ size_text_1 │ size_text_2 │ size_text_3 │ SIZE │                       size_xml
          
 

──────────────┼───────────┼─────────────┼─────────────┼─────────────┼──────┼───────────────────────────────────────────────────────
 Singapore    │       791 │         791 │          91 │           1 │  791 │ <SIZE
unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE>
(1 fila)

I don't know what to make of that.  Seems pretty broken.

After this, I looked for some examples of XPath syntax in the interwebs.
I came across the | operator (which apparently is a "union" of some
sort).  Behold:

SELECT  xmltable.*
   FROM (SELECT data FROM xmldata) x,
        LATERAL XMLTABLE('/ROWS/ROW'
                         PASSING data
                         COLUMNS
                                  country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
                                  size_text text PATH 'SIZE/text() | SIZE/@unit')
where size_text is not null ;

 country_name │ size_text 
──────────────┼───────────
 Singapore    │ km791
(1 fila)

The "units" property is ahead of the text(), which is pretty odd.  But if I
remove the text() call, it puts the units after the text:

SELECT  xmltable.*
   FROM (SELECT data FROM xmldata) x,
        LATERAL XMLTABLE('/ROWS/ROW'
                         PASSING data
                         COLUMNS
                                  country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
                                  size_text text PATH 'SIZE | SIZE/@unit')
where size_text is not null ;
 country_name │                        size_text                        
──────────────┼─────────────────────────────────────────────────────────
 Singapore    │ <SIZE unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE>km
(1 fila)

Again, I don't know what to make of this.  Anyway, this seems firmly in
the libxml side of things; the only conclusion I have is that nobody
ever uses libxml terribly much for complex XPath processing.


Basing another test on your original test case, look at the first row
here.  Is this result correct?

SELECT *
  FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml)
             , ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml)
       ) d(x)
  CROSS JOIN LATERAL
        XMLTABLE('/xml'
                 PASSING x
                 COLUMNS "node()" TEXT PATH 'node()'
                ) t;
                             x                             │               node()               
───────────────────────────────────────────────────────────┼────────────────────────────────────
 <xml>te<!-- ahoy -->xt</xml>                              │ te ahoy xt
 <xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some <!-- really --> weird <stuff>

Why was the comment contents expanded inside node()?  Note what happens
if I change the type from text to xml in that column:

SELECT *
  FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml)
             , ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml)
       ) d(x)
  CROSS JOIN LATERAL
        XMLTABLE('/xml'
                 PASSING x
                 COLUMNS "node()" xml PATH 'node()'
                ) t;

                             x                             │                     node()                     
───────────────────────────────────────────────────────────┼────────────────────────────────────────────────
 <xml>te<!-- ahoy -->xt</xml>                              │ te ahoy xt
 <xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some <!-- really --> weird <stuff>
(2 filas)


Further note that, per the standard, you can omit the PATH clause in
which case the column name is used as the path, which seems to work
correctly.  Phew!

> * Patch 0002: Set correct context for XPath evaluation.
> 
> > According to the ISO standard, the context of XMLTABLE's XPath
> > row_expression is the document node of the XML[1] - not the root node.
> > 
> > The respective code is shared with non-ISO functions such as xpath
> > and xpath_exists so that the change also affects these functions.
> 
> It’s an incompatible change - even the regression tests need to be adjusted because they
> test for the “wrong” behaviour.

I'm really afraid to change the non-ISO functions in PG10, since
it's already released for quite some time with this long-standing
behavior; and I'm not sure it'll do much good to change XMLTABLE in PG10
and leave the non-iso functions unpatched.  I think the easiest route is
to patch all functions only for PostgreSQL 11.  Maybe if XMLTABLE is
considered unacceptably broken in PostgreSQL 10 we can patch only that
one.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated with wrong context

От
Markus Winand
Дата:
Hi and thanks for your efforts.

> On 2018-04-26, at 21:18 , Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Hello Markus,
>
> Markus Winand wrote:
>
>> * Patch 0001: Accept TEXT and CDATA nodes in XMLTABLE’s column_expression.
>>
>>> Column_expressions that match TEXT or CDATA nodes must return
>>> the contents of the node itself, not the content of the
>>> non-existing childs (i.e. the empty string).
>>
>> The following query returns a wrong result in master:
>>
>> SELECT *
>>  FROM (VALUES ('<xml>text</xml>'::xml)
>>             , ('<xml><![CDATA[<cdata>]]></xml>'::xml)
>>       ) d(x)
>>  CROSS JOIN LATERAL
>>        XMLTABLE('/xml'
>>                 PASSING x
>>                 COLUMNS "node()" TEXT PATH 'node()'
>>                ) t
>
>> The correct result can be obtained with patch 0001 applied:
>>
>>               x                | node()
>> --------------------------------+---------
>> <xml>text</xml>                | text
>> <xml><![CDATA[<cdata>]]></xml> | <cdata>
>>
>> The patch adopts existing tests to guard against future regressions.
>
> I remembered while reading this that Craig Ringer had commented that XML
> text sections can have multiple children: just put XML comments amidst
> the text.

As per my understanding of XML, this is a sibling—not a child—of two text nodes.

> To test this, I added a comment in one of the text values, in
> the regression database:
>
> update xmldata set data = regexp_replace(data::text, 'HongKong', 'Hong<!-- a space -->Kong')::xml;
>
> With the modified data, the new query in the regression tests fails:
>
> SELECT  xmltable.*
>   FROM (SELECT data FROM xmldata) x,
>        LATERAL XMLTABLE('/ROWS/ROW'
>                         PASSING data
>                         COLUMNS id int PATH '@id',
>                                  _id FOR ORDINALITY,
>                                  country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
>                                  country_id text PATH 'COUNTRY_ID',
>                                  region_id int PATH 'REGION_ID',
>                                  size float PATH 'SIZE',
>                                  unit text PATH 'SIZE/@unit',
>                                  premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');
>
> ERROR:  more than one value returned by column Path expression

I’ve tried matching “node()” against XML that includes a mixture of text and
comment nodes in other database products. Two out of two failed with a
similar error message.

- ORA-19279: XPTY0004 - XQuery dynamic type mismatch: expected singleton sequence - got multi-item sequence
- SQLCODE=-16003, SQLSTATE=10507, SQLERRMC=( item(), item()+ )
  See:
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.messages.sql.doc/doc/msql16003n.html

So I digged through the standard to figure out what the standard-mandated
behaviour is.

The bottom line is that an error is the standard mandated behavior because
SQL uses a XPath/XQuery “cast" in the procedure. The XPath/XQuery spec
says “If the result of atomization is a sequence of more than one atomic value,
a type error is raised [err:XPTY0004]” (compare that to the ORA- above).
https://www.w3.org/TR/xpath-31/#id-cast

For reference, how I came there:
The XPath/XQuery cast is triggered for non XML types in
XMLCAST (SQL-14:2011 6.6 GR 4 h iv.) which is in turn triggered by
XMLTABLE (SQL-14:2011 7.1 SR 4 e ii 8).
[Note: I only have SQL-14:2011, thus no references to :2016]


> This seems really undesirable, so I looked into getting it fixed.  Turns
> out that this error is being thrown from inside the same function we're
> modifying, line 4559.  I didn't have a terribly high opinion of that
> ereport() when working on this feature to begin with, so now that it's
> proven to provide a bad experience, let's try removing it.  With that
> change (patch attached), the feature seems to work; I tried with this
> query, which seems to give the expected output (notice we have some
> columns of type xml, others of type text, with and without the text()
> function in the path):
>
> SELECT  xmltable.*
>   FROM (SELECT data FROM xmldata) x,
>        LATERAL XMLTABLE('/ROWS/ROW'
>                         PASSING data COLUMNS
>                                  country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
>                                  country_name_xml xml PATH 'COUNTRY_NAME/text()' NOT NULL,
>                                  country_name_n text PATH 'COUNTRY_NAME' NOT NULL,
>                                  country_name_xml_n xml PATH 'COUNTRY_NAME' NOT NULL);
>  country_name  │ country_name_xml │ country_name_n │                  country_name_xml_n
> ────────────────┼──────────────────┼────────────────┼───────────────────────────────────────────────────────
> Australia      │ Australia        │ Australia      │ <COUNTRY_NAME>Australia</COUNTRY_NAME>
> China          │ China            │ China          │ <COUNTRY_NAME>China</COUNTRY_NAME>
> HongKong       │ HongKong         │ HongKong       │ <COUNTRY_NAME>Hong<!-- a space -->Kong</COUNTRY_NAME>
> India          │ India            │ India          │ <COUNTRY_NAME>India</COUNTRY_NAME>
> Japan          │ Japan            │ Japan          │ <COUNTRY_NAME>Japan</COUNTRY_NAME>
> Singapore      │ Singapore        │ Singapore      │ <COUNTRY_NAME>Singapore</COUNTRY_NAME>
> Czech Republic │ Czech Republic   │ Czech Republic │ <COUNTRY_NAME>Czech Republic</COUNTRY_NAME>
> Germany        │ Germany          │ Germany        │ <COUNTRY_NAME>Germany</COUNTRY_NAME>
> France         │ France           │ France         │ <COUNTRY_NAME>France</COUNTRY_NAME>
> Egypt          │ Egypt            │ Egypt          │ <COUNTRY_NAME>Egypt</COUNTRY_NAME>
> Sudan          │ Sudan            │ Sudan          │ <COUNTRY_NAME>Sudan</COUNTRY_NAME>
> (11 filas)
>
>
> This means that the concatenation now works for all types, not just xml, so I
> can do this also:
>
> update xmldata set data = regexp_replace(data::text, '791', '7<!--small country-->91')::xml;
>
> SELECT  xmltable.*
>   FROM (SELECT data FROM xmldata) x,
>        LATERAL XMLTABLE('/ROWS/ROW'
>                         PASSING data
>                         COLUMNS
>                                  country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
>                                  size_text float PATH 'SIZE/text()',
>                                  "SIZE" float, size_xml xml PATH 'SIZE');
>  country_name  │ size_text │ SIZE
> ────────────────┼───────────┼──────
> Australia      │           │
> China          │           │
> HongKong       │           │
> India          │           │
> Japan          │           │
> Singapore      │       791 │  791
> Czech Republic │           │
> Germany        │           │
> France         │           │
> Egypt          │           │
> Sudan          │           │
> (11 filas)
>
> I'm not sure if this concatenation of things that are not text or XML is
> undesirable for whatever reason.  Any clues?

As per above the reason against it is standard conformance.

> Also, array indexes behave funny.  First let's add more XML comments
> inside that number, and query for the subscripts:
>
> update xmldata set data = regexp_replace(data::text, '7<!--small country-->91',
'<!--ah-->7<!--oh-->9<!--uh-->1')::xml;
>
> SELECT  xmltable.*
>   FROM (SELECT data FROM xmldata) x,
>        LATERAL XMLTABLE('/ROWS/ROW'
>                         PASSING data
>                         COLUMNS
>                                  country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
>                                  size_text float PATH 'SIZE/text()',
>                                  size_text_1 float PATH 'SIZE/text()[1]',
>                                  size_text_2 float PATH 'SIZE/text()[2]',
>                                  "SIZE" float, size_xml xml PATH 'SIZE')
> where size_text is not null;
>
> country_name │ size_text │ size_text_1 │ size_text_2 │ size_text_3 │ SIZE │                       size_xml
            
>
──────────────┼───────────┼─────────────┼─────────────┼─────────────┼──────┼───────────────────────────────────────────────────────
> Singapore    │       791 │         791 │          91 │           1 │  791 │ <SIZE
unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE>
> (1 fila)
>
> I don't know what to make of that.  Seems pretty broken.

Absolutely.

Also, node() matching comments or processing instructions
seems to be broken too:

SELECT *
 FROM (VALUES ('<xml><!--comment--></xml>'::xml)
            , ('<xml><?pi content?></xml>'::xml)
      ) d(x)
 CROSS JOIN LATERAL
       XMLTABLE('/xml'
                PASSING x
                COLUMNS "node()" TEXT PATH 'node()'
               ) t

             x             | node()
---------------------------+--------
 <xml><!--comment--></xml> |
 <xml><?pi content?></xml> |
(2 rows)

I can look into this, but it may take a while.

As per my understanding explained above, I’d base future work on my original patch.

> After this, I looked for some examples of XPath syntax in the interwebs.
> I came across the | operator (which apparently is a "union" of some
> sort).  Behold:
>
> SELECT  xmltable.*
>   FROM (SELECT data FROM xmldata) x,
>        LATERAL XMLTABLE('/ROWS/ROW'
>                         PASSING data
>                         COLUMNS
>                                  country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
>                                  size_text text PATH 'SIZE/text() | SIZE/@unit')
> where size_text is not null ;
>
> country_name │ size_text
> ──────────────┼───────────
> Singapore    │ km791
> (1 fila)
>
> The "units" property is ahead of the text(), which is pretty odd.

As per above, I think this is outside the spec (as it matches several nodes).

Other than that it is related to the “document order” in which XPath
generally returns matched nodes (except a few axis that return their matched
in reverse order—e.g. ancestor::)

>  But if I
> remove the text() call, it puts the units after the text:
>
> SELECT  xmltable.*
>   FROM (SELECT data FROM xmldata) x,
>        LATERAL XMLTABLE('/ROWS/ROW'
>                         PASSING data
>                         COLUMNS
>                                  country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
>                                  size_text text PATH 'SIZE | SIZE/@unit')
> where size_text is not null ;
> country_name │                        size_text
> ──────────────┼─────────────────────────────────────────────────────────
> Singapore    │ <SIZE unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE>km
> (1 fila)
>
> Again, I don't know what to make of this.

As per above, the document order is:

1. <size> opening tag
2. the @unit attribute for size
3. the text() inside the size.

Now that you match (1) instead of (3), it is returned first.

>  Anyway, this seems firmly in
> the libxml side of things; the only conclusion I have is that nobody
> ever uses libxml terribly much for complex XPath processing.
>
>
> Basing another test on your original test case, look at the first row
> here.  Is this result correct?
>
> SELECT *
>  FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml)
>             , ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml)
>       ) d(x)
>  CROSS JOIN LATERAL
>        XMLTABLE('/xml'
>                 PASSING x
>                 COLUMNS "node()" TEXT PATH 'node()'
>                ) t;
>                             x                             │               node()
> ───────────────────────────────────────────────────────────┼────────────────────────────────────
> <xml>te<!-- ahoy -->xt</xml>                              │ te ahoy xt
> <xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some <!-- really --> weird <stuff>
>
> Why was the comment contents expanded inside node()?

Because node() matches also comments and your patches concatenates all matches.

> Note what happens
> if I change the type from text to xml in that column:
>
> SELECT *
>  FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml)
>             , ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml)
>       ) d(x)
>  CROSS JOIN LATERAL
>        XMLTABLE('/xml'
>                 PASSING x
>                 COLUMNS "node()" xml PATH 'node()'
>                ) t;
>
>                             x                             │                     node()
> ───────────────────────────────────────────────────────────┼────────────────────────────────────────────────
> <xml>te<!-- ahoy -->xt</xml>                              │ te ahoy xt
> <xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some <!-- really --> weird <stuff>
> (2 filas)

The comment seems to be wrong.

I guess it’s fine if the CDATA gets transformed in to an equivalent
string using the XML entities. Yet, it might be better avoiding it.

Will look into it, but may take a while.

> Further note that, per the standard, you can omit the PATH clause in
> which case the column name is used as the path, which seems to work
> correctly.  Phew!
>
>> * Patch 0002: Set correct context for XPath evaluation.
>>
>>> According to the ISO standard, the context of XMLTABLE's XPath
>>> row_expression is the document node of the XML[1] - not the root node.
>>>
>>> The respective code is shared with non-ISO functions such as xpath
>>> and xpath_exists so that the change also affects these functions.
>>
>> It’s an incompatible change - even the regression tests need to be adjusted because they
>> test for the “wrong” behaviour.
>
> I'm really afraid to change the non-ISO functions in PG10, since
> it's already released for quite some time with this long-standing
> behavior; and I'm not sure it'll do much good to change XMLTABLE in PG10
> and leave the non-iso functions unpatched.  I think the easiest route is
> to patch all functions only for PostgreSQL 11.  Maybe if XMLTABLE is
> considered unacceptably broken in PostgreSQL 10 we can patch only that
> one.

I’m in favour of doing all at once in PG11.

As the XMLTABLE examples in the docs consistently use absolute XPath
expressions, there is hope that people do it this way so that they won’t be
affected by the patch.

When I stumbled upon this issue, I first though that the context is not set
at all. I did not even come to my mind that the context could be there,
but wrongly set. I only found out that it actually is set when I made the patch.

In other words, I would not have made code relying on the wrong behaviour.
Unfortunately, this argument doesn’t hold for the non-ISO functions, which
are also longer in PostgreSQL.

-markus



Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated withwrong context

От
Alvaro Herrera
Дата:
Hello hackers,

Any objections fixing this on Pg11?  My opinion is that it's better to
fix it now rather than waiting for pg12.  It's already broken in pg10
(xmltable) and older (rest of the xml stuff).  As Markus advised, I'd
not backpatch fixes for fear of breaking stuff, but I'd rather release
pg11 with the correct behavior.

I intend to get it done this week if there are no serious objections.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated withwrong context

От
Alvaro Herrera
Дата:
Hi

On 2018-Mar-27, Markus Winand wrote:

> I’ve found two issues with XML/XPath integration in PostgreSQL. Two
> patches are attached. I’ve just separated them because the second one
> is an incompatible change.
> 
> * Patch 0001: Accept TEXT and CDATA nodes in XMLTABLE’s column_expression.

I pushed 0001 to REL_10_STABLE and master -- since there is no
incompatible behavior change.  (Well, there is a behavior change, but
it's pretty hard to imagine someone would be relying on the bogus old
output.)

I'll see about 0002 now -- only to master, as discussed.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated withwrong context

От
Alvaro Herrera
Дата:
On 2018-Jun-20, Alvaro Herrera wrote:

> I'll see about 0002 now -- only to master, as discussed.

Here's a patch.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated withwrong context

От
Alvaro Herrera
Дата:
I have pushed the patch now (in your original form rather than my later
formulation) -- let's see how the buildfarm likes it.  There are (at
least) three issues remaining, as per below; Pavel, do you have any
insight on these?

First one is about array indexes not working sanely (I couldn't get this
to work in Oracle)

> > Also, array indexes behave funny.  First let's add more XML comments
> > inside that number, and query for the subscripts:
> > 
> > update xmldata set data = regexp_replace(data::text, '7<!--small country-->91',
'<!--ah-->7<!--oh-->9<!--uh-->1')::xml;
> > 
> > SELECT  xmltable.*
> >   FROM (SELECT data FROM xmldata) x,
> >        LATERAL XMLTABLE('/ROWS/ROW'
> >                         PASSING data
> >                         COLUMNS
> >                                  country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
> >                                  size_text float PATH 'SIZE/text()',
> >                                  size_text_1 float PATH 'SIZE/text()[1]',
> >                                  size_text_2 float PATH 'SIZE/text()[2]',
> >                                  "SIZE" float, size_xml xml PATH 'SIZE')
> > where size_text is not null;
> > 
> > country_name │ size_text │ size_text_1 │ size_text_2 │ size_text_3 │ SIZE │                       size_xml
             
 
> >
──────────────┼───────────┼─────────────┼─────────────┼─────────────┼──────┼───────────────────────────────────────────────────────
> > Singapore    │       791 │         791 │          91 │           1 │  791 │ <SIZE
unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE>
> > (1 fila)

The second one is about (lack of!) processing instructions and comments:

> Also, node() matching comments or processing instructions
> seems to be broken too:
> 
> SELECT *
>  FROM (VALUES ('<xml><!--comment--></xml>'::xml)
>             , ('<xml><?pi content?></xml>'::xml)
>       ) d(x)
>  CROSS JOIN LATERAL
>        XMLTABLE('/xml'
>                 PASSING x
>                 COLUMNS "node()" TEXT PATH 'node()'
>                ) t
> 
>              x             | node()
> ---------------------------+--------
>  <xml><!--comment--></xml> |
>  <xml><?pi content?></xml> |
> (2 rows)
> 
> I can look into this, but it may take a while.

Compare the empty second columns with oracle behavior, which returns the
contents of the PI and the comment.  As a script for
http://rextester.com/l/oracle_online_compiler

create table xmltb (data xmltype) \\
insert into xmltb values ('<xml><!--the comment is here--></xml>') \\
insert into xmltb values ('<xml><?pi php_stuff(); do_stuff("hello"); ?></xml>') \\
SELECT *  FROM xmltb, XMLTABLE('/xml' PASSING data COLUMNS node varchar2(100) PATH 'node()') t \\
drop table xmltb \\


The third issue is the way we output comments when they're in a column
of type XML:

> > Note what happens if I change the type from text to xml in that
> > column:
> > 
> > SELECT *
> >  FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml)
> >             , ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml)
> >       ) d(x)
> >  CROSS JOIN LATERAL
> >        XMLTABLE('/xml'
> >                 PASSING x
> >                 COLUMNS "node()" xml PATH 'node()'
> >                ) t;
> > 
> >                             x                             │                     node()                     
> > ───────────────────────────────────────────────────────────┼────────────────────────────────────────────────
> > <xml>te<!-- ahoy -->xt</xml>                              │ te ahoy xt
> > <xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some <!-- really --> weird <stuff>
> > (2 filas)
> 
> The comment seems to be wrong.
> 
> I guess it’s fine if the CDATA gets transformed in to an equivalent
> string using the XML entities. Yet, it might be better avoiding it.


-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated with wrong context

От
Markus Winand
Дата:
Hi!

> I have pushed the patch now (in your original form rather than my later
> formulation) -- let's see how the buildfarm likes it.  There are (at
> least) three issues remaining, as per below; Pavel, do you have any
> insight on these?


Attached patch 0001 fixes all three issues and one more that is, again, 
an incompatible change (see below).



> On 2018-06-21, at 22:27 , Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> First one is about array indexes not working sanely (I couldn't get this
> to work in Oracle)
>
>>> Also, array indexes behave funny.  First let's add more XML comments
>>> inside that number, and query for the subscripts:
>>>
>>> update xmldata set data = regexp_replace(data::text, '7<!--small country-->91',
'<!--ah-->7<!--oh-->9<!--uh-->1')::xml;
>>>
>>> SELECT  xmltable.*
>>>  FROM (SELECT data FROM xmldata) x,
>>>       LATERAL XMLTABLE('/ROWS/ROW'
>>>                        PASSING data
>>>                        COLUMNS
>>>                                 country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
>>>                                 size_text float PATH 'SIZE/text()',
>>>                                 size_text_1 float PATH 'SIZE/text()[1]',
>>>                                 size_text_2 float PATH 'SIZE/text()[2]',
>>>                                 "SIZE" float, size_xml xml PATH 'SIZE')
>>> where size_text is not null;
>>>
>>> country_name │ size_text │ size_text_1 │ size_text_2 │ size_text_3 │ SIZE │                       size_xml
              
>>>
──────────────┼───────────┼─────────────┼─────────────┼─────────────┼──────┼───────────────────────────────────────────────────────
>>> Singapore    │       791 │         791 │          91 │           1 │  791 │ <SIZE
unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE>
>>> (1 fila)

The concatenation was caused by the use of xmlNodeListGetString, which I have
completely eliminated because it was even wrong for element nodes.

As per my understanding (and how Oracle and DB2 do it), text nodes contained
in nested elements must also be included in the string representation. This is an
incompatible change that needs adoption of an regression test:

SELECT * FROM xmltable('/root' passing '<root><element>a1a<!-- aaaa -->a2a<?aaaaa?> <!--z-->
bbbb<x>xxx</x>cccc</element></root>'COLUMNS element text); 
      element
--------------------
a1aa2a   bbbbcccc

This should actually include the ‘xxx’ from the nested <x> element as well.

As I could not find a libxml function that does this, I’ve introduced a new one
(xml_xmlnodetostring) in the vein of libxml’s xmlNodeListGetString.

> The second one is about (lack of!) processing instructions and comments:
>
>> Also, node() matching comments or processing instructions
>> seems to be broken too:
>>
>> SELECT *
>> FROM (VALUES ('<xml><!--comment--></xml>'::xml)
>>            , ('<xml><?pi content?></xml>'::xml)
>>      ) d(x)
>> CROSS JOIN LATERAL
>>       XMLTABLE('/xml'
>>                PASSING x
>>                COLUMNS "node()" TEXT PATH 'node()'
>>               ) t
>>
>>             x             | node()
>> ---------------------------+--------
>> <xml><!--comment--></xml> |
>> <xml><?pi content?></xml> |
>> (2 rows)
>>

This was easy to fix: PI’s and comments contain their data in their own content,
not in a child.


> The third issue is the way we output comments when they're in a column
> of type XML:
>
>>> Note what happens if I change the type from text to xml in that
>>> column:
>>>
>>> SELECT *
>>> FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml)
>>>            , ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml)
>>>      ) d(x)
>>> CROSS JOIN LATERAL
>>>       XMLTABLE('/xml'
>>>                PASSING x
>>>                COLUMNS "node()" xml PATH 'node()'
>>>               ) t;
>>>
>>>                            x                             │                     node()
>>> ───────────────────────────────────────────────────────────┼────────────────────────────────────────────────
>>> <xml>te<!-- ahoy -->xt</xml>                              │ te ahoy xt
>>> <xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some <!-- really --> weird <stuff>
>>> (2 filas)

I included comments, CDATA sections and PIs into the list of nodes that need to use libxml's xmlNodeDump.

The 0001 patch also removes the distinction between the case of
"count == 1 && typid == XMLOID” and "count > 1" because I don’t see any reason
for two different code paths there (except that one string copy is saved in
the count == 1 case).

The 0001 patch also adds regression tests for all three issues.

There is a second patch, for which I don’t have strong opinions, that eliminates
the use of xmlStrdup in two cases. As far as I can see, the xmlStrdup is only there
so the xmlFree at the end can be done unconditionally. I found it disturbing to do
a strdup for the sole purpose it can be free-ed again. However, the patch is not a beauty,
just don’t apply if there is any doubt.

-markus


Вложения

Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated withwrong context

От
Alvaro Herrera
Дата:
On 2018-Jul-08, Markus Winand wrote:

> Hi!
> 
> > I have pushed the patch now (in your original form rather than my later
> > formulation) -- let's see how the buildfarm likes it.  There are (at
> > least) three issues remaining, as per below; Pavel, do you have any
> > insight on these?
> 
> Attached patch 0001 fixes all three issues and one more that is, again, 
> an incompatible change (see below).

For closure: this was re-published as
https://postgr.es/m/3e8eab9e-7289-6c23-5e2c-153cccea2257@anastigmatix.net
and got pushed as commit 251cf2e27be.

Thank you for reporting this!

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services