Обсуждение: Regression with large XML data input

Поиск
Список
Период
Сортировка

Regression with large XML data input

От
Michael Paquier
Дата:
Hi all,
(Adding in CC Tom and Eric, as committer and  author.)

A customer has reported a regression with the parsing of rather large
XML data, introduced by the set of backpatches done with f68d6aabb7e2
& friends.

The problem is introduced by the change from
xmlParseBalancedChunkMemory() to xmlNewNode() +
xmlParseInNodeContext() in xml_parse(), to avoid an issue in
xmlParseBalancedChunkMemory() in the range of libxml2 2.13.0-2.13.2
for a bug that has already been fixed upstream, where we use a
temporary root node for the case where parse_as_document is false.

If the input XML data is large enough, one gets a failure at the top
of the latest branches, and it worked properly before.  Here is a
short test case (courtesy of a colleague, case that I've modified
slightly):
CREATE TABLE xmldata (id BIGINT PRIMARY KEY, message XML );
DO $$ DECLARE size_40mb TEXT := repeat('X', 40000000);
BEGIN
 BEGIN
   INSERT INTO xmldata (id, message) VALUES
     ( 1, (('<Root><Item><Name>Test40MB</Name><Content>' || size_40mb || '</Content></Item></Root>')::xml) );
   RAISE NOTICE 'insert 40MB successful';
   EXCEPTION WHEN OTHERS THEN RAISE NOTICE 'Error insert 40MB: %', SQLERRM;
 END;
END $$;

Switching back to the previous code, where we rely on
xmlParseBalancedChunkMemory() fixes the issue.  A quick POC is
attached.  It fails one case in check-world with SERIALIZE because I
am not sure it is possible to pass down some options through
xmlParseBalancedChunkMemory(), still the regression is gone, and I am
wondering if there is not a better solution to be able to dodge the
original problem and still accept this case.  One good thing is that
xmlParseBalancedChunkMemory() is able to return a list of nodes, that
we need for this code path of xml_parse().  So perhaps one solution
would be the addition of a code path with
xmlParseBalancedChunkMemory() depending on the options we want to
process, keeping the code path with the fake "content-root" for the
XML SERIALIZE case.

The patch in question has been applied first to 6082b3d5d3d1 on HEAD
impacting v18~, then it has been backpatched down to all stable
branches, like f68d6aabb7e2, introducing the regression in all the
stable branches since the minor releases done in August 2024, as of:
12.20, 13.16, 14.13, 15.8, 16.4.

Thoughts or comments?
--
Michael

Вложения

Re: Regression with large XML data input

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> A customer has reported a regression with the parsing of rather large
> XML data, introduced by the set of backpatches done with f68d6aabb7e2
> & friends.

Bleah.

> Switching back to the previous code, where we rely on
> xmlParseBalancedChunkMemory() fixes the issue.

Yeah, just reverting these commits might be an acceptable answer,
since the main point was to work around a bleeding-edge bug:

>> * Early 2.13.x releases of libxml2 contain a bug that causes
>> xmlParseBalancedChunkMemory to return the wrong status value in some
>> cases.  This breaks our regression tests.  While that bug is now fixed
>> upstream and will probably never be seen in any production-oriented
>> distro, it is currently a problem on some more-bleeding-edge-friendly
>> platforms.

Presumably that problem is now gone, a year later.  The other point
about

>> * xmlParseBalancedChunkMemory is considered to depend on libxml2's
>> semi-deprecated SAX1 APIs, and will go away when and if they do.

is still hypothetical I think.  But we might want to keep this bit:

>> While here, avoid allocating an xmlParserCtxt in DOCUMENT parse mode,
>> since that code path is not going to use it.

            regards, tom lane



Re: Regression with large XML data input

От
Michael Paquier
Дата:
On Wed, Jul 23, 2025 at 11:28:38PM -0400, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> Switching back to the previous code, where we rely on
>> xmlParseBalancedChunkMemory() fixes the issue.
>
> Yeah, just reverting these commits might be an acceptable answer,
> since the main point was to work around a bleeding-edge bug:

Still it is not possible to do exactly that on all the branches
because of the business with XMLSERIALIZE that requires some options
for xmlParseInNodeContext(), is it?

>>> * Early 2.13.x releases of libxml2 contain a bug that causes
>>> xmlParseBalancedChunkMemory to return the wrong status value in some
>>> cases.  This breaks our regression tests.  While that bug is now fixed
>>> upstream and will probably never be seen in any production-oriented
>>> distro, it is currently a problem on some more-bleeding-edge-friendly
>>> platforms.
>
> Presumably that problem is now gone, a year later.  The other point
> about

I would probably agree that it does not seem worth caring for this
range in the early 2.13 series.  I didn't mention it upthread but all
my tests were with debian GID's libxml2 which seems to be a 2.12.7
flavor with some 2.9.14 pieces, based on what apt is telling me.  I
did not test with a different version from upstream, but I'm pretty
sure that's the same story.

>>> * xmlParseBalancedChunkMemory is considered to depend on libxml2's
>>> semi-deprecated SAX1 APIs, and will go away when and if they do.
>
> is still hypothetical I think.  But we might want to keep this bit:

Worth mentioning upstream 4f329dc52490, I guess, added to the 2.14
branch:
parser: Implement xmlCtxtParseContent

This implements xmlCtxtParseContent, a better alternative to
xmlParseInNodeContext or xmlParseBalancedChunkMemory. It accepts a
parser context and a parser input, making it a lot more versatile.

With all our stable branches, I am not sure if this should be
considered, but that seems worth keeping in mind.

>>> While here, avoid allocating an xmlParserCtxt in DOCUMENT parse mode,
>>> since that code path is not going to use it.

Are you planning to look at that for the next minor release?  It would
take me a couple of hours to dig into all that, and I am sure that I
am going to need your stamp or Erik's to avoid doing a stupid thing.
--
Michael

Вложения

Re: Regression with large XML data input

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> Are you planning to look at that for the next minor release?  It would
> take me a couple of hours to dig into all that, and I am sure that I
> am going to need your stamp or Erik's to avoid doing a stupid thing.

It was my commit, so my responsibility, so I'll deal with it.
Planning to wait for Erik's input though.

            regards, tom lane



Re: Regression with large XML data input

От
Tom Lane
Дата:
I wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> A customer has reported a regression with the parsing of rather large
>> XML data, introduced by the set of backpatches done with f68d6aabb7e2
>> & friends.

> Bleah.

The supplied test case hides important details in the error message.
If you get rid of the exception block so that the error is reported
in full, what you see is

regression=# CREATE TEMP TABLE xmldata (id BIGINT PRIMARY KEY, message XML );
CREATE TABLE
regression=# DO $$ DECLARE size_40mb TEXT := repeat('X', 40000000);
regression$# BEGIN
regression$#    INSERT INTO xmldata (id, message) VALUES
regression$#      ( 1, (('<Root><Item><Name>Test40MB</Name><Content>' || size_40mb || '</Content></Item></Root>')::xml)
);
regression$# END $$;
ERROR:  invalid XML content
DETAIL:  line 1: internal error: Huge input lookup
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
                                                                               ^
CONTEXT:  SQL statement "INSERT INTO xmldata (id, message) VALUES
     ( 1, (('<Root><Item><Name>Test40MB</Name><Content>' || size_40mb || '</Content></Item></Root>')::xml) )"
PL/pgSQL function inline_code_block line 3 at SQL statement
regression=#

That is, what we are hitting is libxml2's internal protections
against processing "too large" input.  I am not really sure
why the other coding failed to hit this same thing, but I wonder
if we shouldn't leave well enough alone.  See commits 2197d0622
and f2743a7d7, where we tried to enable such cases and then
decided it was too risky.  I'm afraid now that our prior coding
might have allowed billion-laugh-like cases to be reachable.

            regards, tom lane



Re: Regression with large XML data input

От
Erik Wienhold
Дата:
On 2025-07-24 20:10 +0200, Tom Lane wrote:
> The supplied test case hides important details in the error message.
> If you get rid of the exception block so that the error is reported
> in full, what you see is
> 
> regression=# CREATE TEMP TABLE xmldata (id BIGINT PRIMARY KEY, message XML );
> CREATE TABLE
> regression=# DO $$ DECLARE size_40mb TEXT := repeat('X', 40000000);
> regression$# BEGIN
> regression$#    INSERT INTO xmldata (id, message) VALUES
> regression$#      ( 1, (('<Root><Item><Name>Test40MB</Name><Content>' || size_40mb ||
'</Content></Item></Root>')::xml));
 
> regression$# END $$;
> ERROR:  invalid XML content
> DETAIL:  line 1: internal error: Huge input lookup
> XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>                                                                                ^
> CONTEXT:  SQL statement "INSERT INTO xmldata (id, message) VALUES
>      ( 1, (('<Root><Item><Name>Test40MB</Name><Content>' || size_40mb || '</Content></Item></Root>')::xml) )"
> PL/pgSQL function inline_code_block line 3 at SQL statement
> regression=# 
> 
> That is, what we are hitting is libxml2's internal protections
> against processing "too large" input.  I am not really sure
> why the other coding failed to hit this same thing, but I wonder
> if we shouldn't leave well enough alone.  See commits 2197d0622
> and f2743a7d7, where we tried to enable such cases and then
> decided it was too risky.  I'm afraid now that our prior coding
> might have allowed billion-laugh-like cases to be reachable.

I was just looking into Michael's fix when I saw your message.  The fix
works on libxml2 2.14.5.  But on 2.13.8 xmlParseBalancedChunkMemory
returns XML_ERR_RESOURCE_LIMIT and I get this error:

    ERROR:  invalid XML content
    DETAIL:  line 1: Resource limit exceeded: Text node too long, try XML_PARSE_HUGE
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Not sure how to set XML_PARSE_HUGE without an xmlParserCtxtPtr at hand,
though.  Also haven't looked into why 2.14.5 is not subject to that
resource limit.  But as you've already noted, that code was heavily
refactored in 2.13 [1].

[1] https://www.postgresql.org/message-id/716736.1720376901%40sss.pgh.pa.us

-- 
Erik Wienhold



Re: Regression with large XML data input

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
>>> A customer has reported a regression with the parsing of rather large
>>> XML data, introduced by the set of backpatches done with f68d6aabb7e2
>>> & friends.

BTW, further testing shows that the same failure occurs at
f68d6aabb7e2^.  So AFAICS, the answer as to why the behavior
changed there is that it didn't.

            regards, tom lane



Re: Regression with large XML data input

От
Erik Wienhold
Дата:
On 2025-07-24 05:12 +0200, Michael Paquier wrote:
> Switching back to the previous code, where we rely on
> xmlParseBalancedChunkMemory() fixes the issue.  A quick POC is
> attached.  It fails one case in check-world with SERIALIZE because I
> am not sure it is possible to pass down some options through
> xmlParseBalancedChunkMemory(), still the regression is gone, and I am
> wondering if there is not a better solution to be able to dodge the
> original problem and still accept this case.

The whitespace can be preserved by setting xmlKeepBlanksDefault before
parsing.  See attached v2.  That function is deprecated, though.  But
libxml2 uses thread-local globals, so it should be safe.  Other than
that, I see no other way to set XML_PARSE_NOBLANKS with
xmlParseBalancedChunkMemory.

[1] https://gitlab.gnome.org/GNOME/libxml2/-/blob/408bd0e18e6ddba5d18e51d52da0f7b3ca1b4421/parserInternals.c#L2833

-- 
Erik Wienhold

Вложения

Re: Regression with large XML data input

От
Tom Lane
Дата:
I wrote:
> BTW, further testing shows that the same failure occurs at
> f68d6aabb7e2^.  So AFAICS, the answer as to why the behavior
> changed there is that it didn't.

Oh, wait ... the plot thickens!  The above statement is true
when testing on my Mac with libxml2 2.13.8 from MacPorts.
With either HEAD or f68d6aabb7e2^, I get errors similar to
what Erik just showed:

ERROR:  invalid XML content
DETAIL:  line 1: Resource limit exceeded: Text node too long, try XML_PARSE_HUGE
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

However, when testing on RHEL8 with libxml2 2.9.7, indeed
I get "Huge input lookup" with our current code but no
failure with f68d6aabb7e2^.

The way I interpret these results is that in older libxml2 versions,
xmlParseBalancedChunkMemory is missing an XML_ERR_RESOURCE_LIMIT check
that does exist in newer versions.  So even if we were to do some kind
of reversion, it would only prevent the error in libxml2 versions that
lack that check.  And in those versions we'd probably be exposing
ourselves to resource-exhaustion problems.

On the whole I'm thinking more and more that we don't want to
touch this.  Our recommendation for processing multi-megabyte
chunks of XML should be "don't".  Unless we want to find or
write a replacement for libxml2 ... which we have discussed,
but so far nothing's happened.

            regards, tom lane



Re: Regression with large XML data input

От
Jim Jones
Дата:

On 24.07.25 21:23, Tom Lane wrote:
> Oh, wait ... the plot thickens!  The above statement is true
> when testing on my Mac with libxml2 2.13.8 from MacPorts.
> With either HEAD or f68d6aabb7e2^, I get errors similar to
> what Erik just showed:
>
> ERROR:  invalid XML content
> DETAIL:  line 1: Resource limit exceeded: Text node too long, try XML_PARSE_HUGE
> XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

I get the same error with libxml2 2.9.14 on Ubuntu.

> However, when testing on RHEL8 with libxml2 2.9.7, indeed
> I get "Huge input lookup" with our current code but no
> failure with f68d6aabb7e2^.
>
> The way I interpret these results is that in older libxml2 versions,
> xmlParseBalancedChunkMemory is missing an XML_ERR_RESOURCE_LIMIT check
> that does exist in newer versions.  So even if we were to do some kind
> of reversion, it would only prevent the error in libxml2 versions that
> lack that check.  And in those versions we'd probably be exposing
> ourselves to resource-exhaustion problems.
>
> On the whole I'm thinking more and more that we don't want to
> touch this.  Our recommendation for processing multi-megabyte
> chunks of XML should be "don't".  Unless we want to find or
> write a replacement for libxml2 ... which we have discussed,
> but so far nothing's happened.

I also believe that addressing this limitation may not be worth the
associated risks. Moreover, a 10MB text node is rather large and
probably exceeds the needs of most users.

Best, Jim



Re: Regression with large XML data input

От
Michael Paquier
Дата:
On Fri, Jul 25, 2025 at 01:25:48AM +0200, Jim Jones wrote:
> On 24.07.25 21:23, Tom Lane wrote:
>> However, when testing on RHEL8 with libxml2 2.9.7, indeed
>> I get "Huge input lookup" with our current code but no
>> failure with f68d6aabb7e2^.
>>
>> The way I interpret these results is that in older libxml2 versions,
>> xmlParseBalancedChunkMemory is missing an XML_ERR_RESOURCE_LIMIT check
>> that does exist in newer versions.  So even if we were to do some kind
>> of reversion, it would only prevent the error in libxml2 versions that
>> lack that check.  And in those versions we'd probably be exposing
>> ourselves to resource-exhaustion problems.

Linux distributions may not seem very eager to add this check, though?
The top of debian GID uses a version of libxml2 where the difference
shows up, so it means that we have years ahead with the old code.

If it were discussing things from the perspective where this new code
was added after a major version bump of Postgres, I would not argue
much about that: breakages happen every year and users adapt their
applications to it.  Here, however, we are talking about a change in a
stable branch, across a minor version, which should be a bit more
flawless from a user perspective?  I may be influenced by the point of
seeing a customer impacted by that, of course, there is no denying
that.  The point is that this enforces one behavior that's part of
2.13 and onwards.  Versions of PG before f68d6aabb7e2 were still OK
with that and the new code of Postgres closes the door completely.
Even if the behavior that Postgres had when linking with libxml2 2.12
or older was kind of "accidendal" because
xmlParseBalancedChunkMemory() lacked the XML_ERR_RESOURCE_LIMIT check,
it was there, and users relied on that.

One possibility that I could see here for stable branches would be to
make the code a bit smarter depending on LIBXML_VERSION, where we
could keep the new code for 2.13 onwards, but keep a compatible
behavior with 2.12 and older, based on xmlParseBalancedChunkMemory().

>> On the whole I'm thinking more and more that we don't want to
>> touch this.  Our recommendation for processing multi-megabyte
>> chunks of XML should be "don't".  Unless we want to find or
>> write a replacement for libxml2 ... which we have discussed,
>> but so far nothing's happened.
>
> I also believe that addressing this limitation may not be worth the
> associated risks. Moreover, a 10MB text node is rather large and
> probably exceeds the needs of most users.

Yeah, still some people use it, so while I am OK to accept this as a
conclusion at the end and send back to this thread that workarounds
are required in applications to split the inputs, that was really
surprising.  (Aka from the point of view of the customer whose
application suddenly fails after what should have been a "simple"
minor update.)
--
Michael

Вложения

Re: Regression with large XML data input

От
Robert Treat
Дата:
On Thu, Jul 24, 2025 at 8:08 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Fri, Jul 25, 2025 at 01:25:48AM +0200, Jim Jones wrote:
> > On 24.07.25 21:23, Tom Lane wrote:
> >> However, when testing on RHEL8 with libxml2 2.9.7, indeed
> >> I get "Huge input lookup" with our current code but no
> >> failure with f68d6aabb7e2^.
> >>
> >> The way I interpret these results is that in older libxml2 versions,
> >> xmlParseBalancedChunkMemory is missing an XML_ERR_RESOURCE_LIMIT check
> >> that does exist in newer versions.  So even if we were to do some kind
> >> of reversion, it would only prevent the error in libxml2 versions that
> >> lack that check.  And in those versions we'd probably be exposing
> >> ourselves to resource-exhaustion problems.
>
> Linux distributions may not seem very eager to add this check, though?
> The top of debian GID uses a version of libxml2 where the difference
> shows up, so it means that we have years ahead with the old code.
>
> If it were discussing things from the perspective where this new code
> was added after a major version bump of Postgres, I would not argue
> much about that: breakages happen every year and users adapt their
> applications to it.  Here, however, we are talking about a change in a
> stable branch, across a minor version, which should be a bit more
> flawless from a user perspective?  I may be influenced by the point of
> seeing a customer impacted by that, of course, there is no denying
> that.  The point is that this enforces one behavior that's part of
> 2.13 and onwards.  Versions of PG before f68d6aabb7e2 were still OK
> with that and the new code of Postgres closes the door completely.
> Even if the behavior that Postgres had when linking with libxml2 2.12
> or older was kind of "accidendal" because
> xmlParseBalancedChunkMemory() lacked the XML_ERR_RESOURCE_LIMIT check,
> it was there, and users relied on that.
>
> One possibility that I could see here for stable branches would be to
> make the code a bit smarter depending on LIBXML_VERSION, where we
> could keep the new code for 2.13 onwards, but keep a compatible
> behavior with 2.12 and older, based on xmlParseBalancedChunkMemory().
>

While I am pretty sympathetic to the idea that we hang our hats on
"Postgres doesn't break things in minor version updates", and this
seems to betray that, one scenario where we would break things is if
it were the only reasonable option wrt a bug / security fix, which
this seems potentially close to. If we can come up with a work around
like the above though, it would certainly be nice to give people a
path forward even if it ends up with a breaking major version change.
This at least eliminates the "surprise!" factor.

> >> On the whole I'm thinking more and more that we don't want to
> >> touch this.  Our recommendation for processing multi-megabyte
> >> chunks of XML should be "don't".  Unless we want to find or
> >> write a replacement for libxml2 ... which we have discussed,
> >> but so far nothing's happened.
> >
> > I also believe that addressing this limitation may not be worth the
> > associated risks. Moreover, a 10MB text node is rather large and
> > probably exceeds the needs of most users.
>
> Yeah, still some people use it, so while I am OK to accept this as a
> conclusion at the end and send back to this thread that workarounds
> are required in applications to split the inputs, that was really
> surprising.  (Aka from the point of view of the customer whose
> application suddenly fails after what should have been a "simple"
> minor update.)

There are a lot of public data sets that provide xml dumps as a
generic format for "non-commercial databases", and those can often be
quite large. I suspect we don't see those use cases a lot because
historically users have been forced to resort to perl/python/etc
scripts to convert the data prior to ingesting. Which is to say, I
think these use cases are more common than we think, and if there were
ever a stable implementation that supported these large use cases,
we'll start to see more of it.


Robert Treat
https://xzilla.net



Re: Regression with large XML data input

От
Tom Lane
Дата:
Robert Treat <rob@xzilla.net> writes:
> On Thu, Jul 24, 2025 at 8:08 PM Michael Paquier <michael@paquier.xyz> wrote:
>> If it were discussing things from the perspective where this new code
>> was added after a major version bump of Postgres, I would not argue
>> much about that: breakages happen every year and users adapt their
>> applications to it.  Here, however, we are talking about a change in a
>> stable branch, across a minor version, which should be a bit more
>> flawless from a user perspective?

> While I am pretty sympathetic to the idea that we hang our hats on
> "Postgres doesn't break things in minor version updates", and this
> seems to betray that, one scenario where we would break things is if
> it were the only reasonable option wrt a bug / security fix, which
> this seems potentially close to.

I'll be the first to say that I'm not too pleased with it either.
However, from Jim Jones' result upthread, a "minor update" of libxml2
could also have caused this problem: 2.9.7 and 2.9.14 behave
differently.  So we don't have sole control --- or sole responsibility
--- here.

I'd be more excited about trying to avoid the failure if I were not
afraid that "avoid the failure" really means "re-expose a security
hazard".  Why should we believe that if libxml2 throws a
resource-limit error (for identical inputs) in one code path and not
another, that's anything but a missed error check in the second path?
(Maybe this is the same thing Robert is saying, not quite sure.)

> There are a lot of public data sets that provide xml dumps as a
> generic format for "non-commercial databases", and those can often be
> quite large. I suspect we don't see those use cases a lot because
> historically users have been forced to resort to perl/python/etc
> scripts to convert the data prior to ingesting. Which is to say, I
> think these use cases are more common than we think, and if there were
> ever a stable implementation that supported these large use cases,
> we'll start to see more of it.

Yeah, it's a real shame that we don't have more-reliable
infrastructure for XML.  I'm not volunteering to fix it though...

            regards, tom lane



Re: Regression with large XML data input

От
Michael Paquier
Дата:
On Fri, Jul 25, 2025 at 02:21:26PM -0400, Tom Lane wrote:
> I'll be the first to say that I'm not too pleased with it either.
> However, from Jim Jones' result upthread, a "minor update" of libxml2
> could also have caused this problem: 2.9.7 and 2.9.14 behave
> differently.  So we don't have sole control --- or sole responsibility
> --- here.

This sentence is incorrect after I have double-checked the behaviors I
am seeing based on local builds of libxml2 2.9.7 and 2.9.14.  For
example with the top of REL_15_STABLE, with and without
72c65d6658d4, I am getting (removing the exception in the example does
not matter if it's a success):
- libxml2 2.9.7  + top of REL_15_STABLE => test failure
- libxml2 2.9.14 + top of REL_15_STABLE => test failure
- libxml2 2.9.7  + top of REL_15_STABLE + revert of 72c65d6658d4
=> test success
- libxml2 2.9.14 + top of REL_15_STABLE + revert of 72c65d6658d4
=> test success

So if one uses a version of libxml2 2.9.X, he/she would be able to see
the large data case work with Postgres at 72c65d6658d4^1, and a
failure with 72c65d6658d4 and onwards.  Taking Postgres in isolation
with any version of libxml2 in the 2.9.X series prevents the case to
work.  This does not depend on 2.9.X, only on the fact that we link
Postgres to a newer major version of libxml2.

Please note that this is also the behavior I see in a Debian GID
environment and I guess any existing Debian release: we rely on
libxml2 2.9.X, so a minor upgrade of Postgres is the factor able to
trigger the behavior change.  It seems to me that there's an argument
for compatibility with the 2.9.X series, which still seems quite
present in the wild, and that we could decide one solution over the
other in xml_parse() based on LIBXML_VERSION.  What I am seeing is
that at fixed major version of libxml2, then Postgres holds the
responsibility here.
--
Michael

Вложения

Re: Regression with large XML data input

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> On Fri, Jul 25, 2025 at 02:21:26PM -0400, Tom Lane wrote:
>> However, from Jim Jones' result upthread, a "minor update" of libxml2
>> could also have caused this problem: 2.9.7 and 2.9.14 behave
>> differently.  So we don't have sole control --- or sole responsibility
>> --- here.

> This sentence is incorrect after I have double-checked the behaviors I
> am seeing based on local builds of libxml2 2.9.7 and 2.9.14.

Hmm, okay, I misread Jim's results then.  But there still remains
the big question: what reason is there to believe that it's safe
to return to the old behavior?  If newer libxml2 versions report
XML_ERR_RESOURCE_LIMIT on the same input, doesn't it seem likely
that there's a live hazard in the old code?

            regards, tom lane



Re: Regression with large XML data input

От
Michael Paquier
Дата:
On Sun, Jul 27, 2025 at 10:16:47PM -0400, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> This sentence is incorrect after I have double-checked the behaviors I
>> am seeing based on local builds of libxml2 2.9.7 and 2.9.14.
>
> Hmm, okay, I misread Jim's results then.  But there still remains
> the big question: what reason is there to believe that it's safe
> to return to the old behavior?  If newer libxml2 versions report
> XML_ERR_RESOURCE_LIMIT on the same input, doesn't it seem likely
> that there's a live hazard in the old code?

Yes, I guess that it's not entirely safe to use libxml2 2.9.X if we
rely on the old code, but it also sucks for customers that relied on
the old bevahior to not have a way out in stable branches, and there
are still plenty of environments that rely on 2.9.X.

That's the reason why I am suggesting an hybrid approach on stable
branches where the regression shows up, where it would be possible to
store large XML data values for *some* cases, keeping some
compatibility:
- if building Postgres with 2.13.X, use the new code, as on HEAD,
forcing the restrictions.
- if building Postgres with 2.12.X or older, then use the old code, as
in what Postgres did before f68d6aabb.

If one wants to enforce a stricter behavior, then it is enough to
upgrade the version of libxml2 on a stable branch of PG.  If one
wants to preserve compatibility, then keep the older version of
libxml2 but be aware that, while things are compatible in terms of
input handling, libxml2 was unsafe and it was so for years.  Hence,
based on the number of environments still using 2.9.X, that seems
worth the hack on stable branches.

I am not going to argue against the commits that have reached
REL_18_STABLE to add compatibility for libxml2 2.13.X, we can leave
this stuff as-is and enforce stronger restrictions across all versions
of libxml2, letting users deal with application changes across a major
version change of PG.  So, while it is not perfect and I'm aware of
that my argument is not perfect, it would at least give packagers and
users the option to use the previous compatibility layer if they want,
leaving the stable branches of PG somewhat intact.  What I think is
bad for users is the fact that Postgres closes entirely this window,
on a stable branch, even if this was accidental based on the previous
coding style.  I understand that from the point of view of a
maintainer this is rather bad, but from the customer point of view the
current situation is also bad to deal with in the scope of a minor
upgrade, because applications suddenly break.
--
Michael

Вложения

Re: Regression with large XML data input

От
"David G. Johnston"
Дата:
On Sunday, July 27, 2025, Michael Paquier <michael@paquier.xyz> wrote:
I am not going to argue against the commits that have reached
REL_18_STABLE to add compatibility for libxml2 2.13.X, we can leave
this stuff as-is and enforce stronger restrictions across all versions
of libxml2, letting users deal with application changes across a major
version change of PG.  So, while it is not perfect and I'm aware of
that my argument is not perfect, it would at least give packagers and
users the option to use the previous compatibility layer if they want,
leaving the stable branches of PG somewhat intact.  What I think is
bad for users is the fact that Postgres closes entirely this window,
on a stable branch, even if this was accidental based on the previous
coding style.  I understand that from the point of view of a
maintainer this is rather bad, but from the customer point of view the
current situation is also bad to deal with in the scope of a minor
upgrade, because applications suddenly break.

Is breaking with tradition and implementing the solution that locks down the system but also gives DBAs the ability the make an informed runtime decision to bypass said restriction on the table?

David J.

Re: Regression with large XML data input

От
Jim Jones
Дата:

On 28.07.25 04:47, Michael Paquier wrote:
> I understand that from the point of view of a
> maintainer this is rather bad, but from the customer point of view the
> current situation is also bad to deal with in the scope of a minor
> upgrade, because applications suddenly break.

I totally get it --- from the user’s perspective, it’s hard to see this
as a bugfix.

I was wondering whether using XML_PARSE_HUGE in xml_parse's options
could help address this, for example:

options = XML_PARSE_NOENT | XML_PARSE_DTDATTR | XML_PARSE_HUGE
          | (preserve_whitespace ? 0 : XML_PARSE_NOBLANKS);


According to libxml2's parserInternals.h:

/**
 * Maximum size allowed for a single text node when building a tree.
 * This is not a limitation of the parser but a safety boundary feature,
 * use XML_PARSE_HUGE option to override it.
 * Introduced in 2.9.0
 */
#define XML_MAX_TEXT_LENGTH 10000000

/**
 * Maximum size allowed when XML_PARSE_HUGE is set.
 */
#define XML_MAX_HUGE_LENGTH 1000000000

The XML_MAX_TEXT_LENGTH limit is what we're hitting now, but
XML_MAX_HUGE_LENGTH is extremely generous. Here's a quick PoC using
XML_PARSE_HUGE:

psql (19devel)
Type "help" for help.

postgres=# CREATE TABLE xmldata (message xml);
CREATE TABLE
postgres=# DO $$
DECLARE huge_size text := repeat('X', 1000000000);
BEGIN
  INSERT INTO xmldata (message) VALUES
  ((('<foo><bar>' || huge_size ||'</bar></foo>')::xml));
END $$;
DO
postgres=# SELECT pg_size_pretty(length(message::text)::bigint) FROM
xmldata;
 pg_size_pretty
----------------
 954 MB
(1 row)

While XML_MAX_HUGE_LENGTH prevents unlimited memory usage, it still
opens the door to potential resource exhaustion. I couldn't find a way
to dynamically adjust this limit in libxml2.

One idea would be to guard XML_PARSE_HUGE behind a GUC --- say,
xml_enable_huge_parsing. That would at least allow controlled
environments to opt in. But of course, that wouldn't help current releases.

Best regards, Jim



Re: Regression with large XML data input

От
Erik Wienhold
Дата:
On 2025-07-28 09:45 +0200, Jim Jones wrote:
> 
> On 28.07.25 04:47, Michael Paquier wrote:
> > I understand that from the point of view of a maintainer this is
> > rather bad, but from the customer point of view the current
> > situation is also bad to deal with in the scope of a minor upgrade,
> > because applications suddenly break.
> 
> I totally get it --- from the user’s perspective, it’s hard to see
> this as a bugfix.
> 
> I was wondering whether using XML_PARSE_HUGE in xml_parse's options
> could help address this, for example:
> 
> options = XML_PARSE_NOENT | XML_PARSE_DTDATTR | XML_PARSE_HUGE
>           | (preserve_whitespace ? 0 : XML_PARSE_NOBLANKS);

This also came to my mind, but it was already tried and reverted soon
after for security reasons. [1]

> One idea would be to guard XML_PARSE_HUGE behind a GUC --- say,
> xml_enable_huge_parsing. That would at least allow controlled
> environments to opt in. But of course, that wouldn't help current
> releases.

+1 for new major releases.  But normal users must not be allowed to
enable that GUC.  So probably context PGC_SU_BACKEND.

I'm leaning towards Michael's proposal of adding a libxml2 version check
in the stable branches before REL_18_STABLE and parsing the content with
xmlParseBalancedChunkMemory on versions up to 2.12.x.

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f2743a7d70e7b2891277632121bb51e739743a47

-- 
Erik Wienhold



Re: Regression with large XML data input

От
Tom Lane
Дата:
Erik Wienhold <ewie@ewie.name> writes:
> I'm leaning towards Michael's proposal of adding a libxml2 version check
> in the stable branches before REL_18_STABLE and parsing the content with
> xmlParseBalancedChunkMemory on versions up to 2.12.x.

I spent some time investigating this.  It appears that even when using
our old code with xmlParseBalancedChunkMemory (I tested against our
commit 6082b3d5d^), libxml2 versions 2.12.x and 2.13.x will throw
a resource-exhaustion error; but earlier and later releases do not.
Drilling down with "git bisect", the first libxml2 commit that throws
an error is

commit 834b8123efc4a4c369671cad2f1b0eb744ae67e9
Author: Nick Wellnhofer <wellnhofer@aevum.de>
Date:   Tue Aug 8 15:21:28 2023 +0200

    parser: Stream data when reading from memory
    
    Don't create a copy of the whole input buffer. Read the data chunk by
    chunk to save memory.

and then the one that restores it to not throwing an error is

commit 7148b778209aaaf684c156c5e2e40d6e477f13f7
Author: Nick Wellnhofer <wellnhofer@aevum.de>
Date:   Sun Jul 7 16:11:08 2024 +0200

    parser: Optimize memory buffer I/O
    
    Reenable zero-copy IO for zero-terminated static memory buffers.
    
    Don't stream zero-terminated dynamic memory buffers on top of creating
    a copy.

So apparently the "resource exhaustion" complaint is not about
anything deeper than not wanting to make a copy of a many-megabyte
input string, and the fact that it appeared and disappeared is an
artifact of libxml2 implementation choices that we have no control
over (and, IMO, no responsibility for working around).

Based on this, I'm okay with reverting to using
xmlParseBalancedChunkMemory: while that won't help users of 2.12.x or
2.13.x, it will help users of earlier and later libxml2.  But I think
we should just do that across the board, not insert libxml2 version
tests nor do it differently in different PG versions.

I've not looked at the details of the proposed patches, but will
do so now that the direction to go in is apparent.

            regards, tom lane



Re: Regression with large XML data input

От
Tom Lane
Дата:
I wrote:
> I've not looked at the details of the proposed patches, but will
> do so now that the direction to go in is apparent.

Erik's v2 is slightly wrong as to the save-and-restore logic for
the KeepBlanks setting: we need to restore in the error path too,
and we'd better mark the save variable volatile since it's modified
inside the PG_TRY.  I made some other cosmetic changes, mainly to
avoid calculating "options" when it won't be used.  I tested the
attached v3 against RHEL8's libxml2-2.9.7, as well as against today's
libxml2 git master, and it accepts the problematic input on both.

            regards, tom lane

diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index f7b731825fc..3379d392260 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -1769,7 +1769,7 @@ xml_doctype_in_content(const xmlChar *str)
  * xmloption_arg, but a DOCTYPE node in the input can force DOCUMENT mode).
  *
  * If parsed_nodes isn't NULL and we parse in CONTENT mode, the list
- * of parsed nodes from the xmlParseInNodeContext call will be returned
+ * of parsed nodes from the xmlParseBalancedChunkMemory call will be returned
  * to *parsed_nodes.  (It is caller's responsibility to free that.)
  *
  * Errors normally result in ereport(ERROR), but if escontext is an
@@ -1795,6 +1795,7 @@ xml_parse(text *data, XmlOptionType xmloption_arg,
     PgXmlErrorContext *xmlerrcxt;
     volatile xmlParserCtxtPtr ctxt = NULL;
     volatile xmlDocPtr doc = NULL;
+    volatile int save_keep_blanks = -1;
 
     /*
      * This step looks annoyingly redundant, but we must do it to have a
@@ -1822,7 +1823,6 @@ xml_parse(text *data, XmlOptionType xmloption_arg,
     PG_TRY();
     {
         bool        parse_as_document = false;
-        int            options;
         int            res_code;
         size_t        count = 0;
         xmlChar    *version = NULL;
@@ -1853,18 +1853,6 @@ xml_parse(text *data, XmlOptionType xmloption_arg,
                 parse_as_document = true;
         }
 
-        /*
-         * Select parse options.
-         *
-         * Note that here we try to apply DTD defaults (XML_PARSE_DTDATTR)
-         * according to SQL/XML:2008 GR 10.16.7.d: 'Default values defined by
-         * internal DTD are applied'.  As for external DTDs, we try to support
-         * them too (see SQL/XML:2008 GR 10.16.7.e), but that doesn't really
-         * happen because xmlPgEntityLoader prevents it.
-         */
-        options = XML_PARSE_NOENT | XML_PARSE_DTDATTR
-            | (preserve_whitespace ? 0 : XML_PARSE_NOBLANKS);
-
         /* initialize output parameters */
         if (parsed_xmloptiontype != NULL)
             *parsed_xmloptiontype = parse_as_document ? XMLOPTION_DOCUMENT :
@@ -1874,11 +1862,26 @@ xml_parse(text *data, XmlOptionType xmloption_arg,
 
         if (parse_as_document)
         {
+            int            options;
+
+            /* set up parser context used by xmlCtxtReadDoc */
             ctxt = xmlNewParserCtxt();
             if (ctxt == NULL || xmlerrcxt->err_occurred)
                 xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
                             "could not allocate parser context");
 
+            /*
+             * Select parse options.
+             *
+             * Note that here we try to apply DTD defaults (XML_PARSE_DTDATTR)
+             * according to SQL/XML:2008 GR 10.16.7.d: 'Default values defined
+             * by internal DTD are applied'.  As for external DTDs, we try to
+             * support them too (see SQL/XML:2008 GR 10.16.7.e), but that
+             * doesn't really happen because xmlPgEntityLoader prevents it.
+             */
+            options = XML_PARSE_NOENT | XML_PARSE_DTDATTR
+                | (preserve_whitespace ? 0 : XML_PARSE_NOBLANKS);
+
             doc = xmlCtxtReadDoc(ctxt, utf8string,
                                  NULL,    /* no URL */
                                  "UTF-8",
@@ -1900,10 +1903,7 @@ xml_parse(text *data, XmlOptionType xmloption_arg,
         }
         else
         {
-            xmlNodePtr    root;
-            xmlNodePtr    oldroot PG_USED_FOR_ASSERTS_ONLY;
-
-            /* set up document with empty root node to be the context node */
+            /* set up document that xmlParseBalancedChunkMemory will add to */
             doc = xmlNewDoc(version);
             if (doc == NULL || xmlerrcxt->err_occurred)
                 xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
@@ -1916,36 +1916,23 @@ xml_parse(text *data, XmlOptionType xmloption_arg,
                             "could not allocate XML document");
             doc->standalone = standalone;
 
-            root = xmlNewNode(NULL, (const xmlChar *) "content-root");
-            if (root == NULL || xmlerrcxt->err_occurred)
-                xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
-                            "could not allocate xml node");
-
-            /*
-             * This attaches root to doc, so we need not free it separately;
-             * and there can't yet be any old root to free.
-             */
-            oldroot = xmlDocSetRootElement(doc, root);
-            Assert(oldroot == NULL);
+            /* set parse options --- have to do this the ugly way */
+            save_keep_blanks = xmlKeepBlanksDefault(preserve_whitespace ? 1 : 0);
 
             /* allow empty content */
             if (*(utf8string + count))
             {
                 xmlNodePtr    node_list = NULL;
-                xmlParserErrors res;
-
-                res = xmlParseInNodeContext(root,
-                                            (char *) utf8string + count,
-                                            strlen((char *) utf8string + count),
-                                            options,
-                                            &node_list);
 
-                if (res != XML_ERR_OK || xmlerrcxt->err_occurred)
+                res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0,
+                                                       utf8string + count,
+                                                       &node_list);
+                if (res_code != 0 || xmlerrcxt->err_occurred)
                 {
-                    xmlFreeNodeList(node_list);
                     xml_errsave(escontext, xmlerrcxt,
                                 ERRCODE_INVALID_XML_CONTENT,
                                 "invalid XML content");
+                    xmlFreeNodeList(node_list);
                     goto fail;
                 }
 
@@ -1961,6 +1948,8 @@ fail:
     }
     PG_CATCH();
     {
+        if (save_keep_blanks != -1)
+            xmlKeepBlanksDefault(save_keep_blanks);
         if (doc != NULL)
             xmlFreeDoc(doc);
         if (ctxt != NULL)
@@ -1972,6 +1961,9 @@ fail:
     }
     PG_END_TRY();
 
+    if (save_keep_blanks != -1)
+        xmlKeepBlanksDefault(save_keep_blanks);
+
     if (ctxt != NULL)
         xmlFreeParserCtxt(ctxt);


Re: Regression with large XML data input

От
Michael Paquier
Дата:
On Mon, Jul 28, 2025 at 04:16:07PM -0400, Tom Lane wrote:
> Erik's v2 is slightly wrong as to the save-and-restore logic for
> the KeepBlanks setting: we need to restore in the error path too,
> and we'd better mark the save variable volatile since it's modified
> inside the PG_TRY.  I made some other cosmetic changes, mainly to
> avoid calculating "options" when it won't be used.  I tested the
> attached v3 against RHEL8's libxml2-2.9.7, as well as against today's
> libxml2 git master, and it accepts the problematic input on both.

This has been applied already as of 71c0921b649d, and I've
double-checked the behavior with my local libxml2 builds as well.
Thanks for the fix, the result looks OK to me!

+            /* set parse options --- have to do this the ugly way */
+            save_keep_blanks = xmlKeepBlanksDefault(preserve_whitespace ? 1 : 0);
[...]
+        if (save_keep_blanks != -1)
+            xmlKeepBlanksDefault(save_keep_blanks);
[...]
+    if (save_keep_blanks != -1)
+        xmlKeepBlanksDefault(save_keep_blanks);

I didn't spot that this option existed though.  That's an interesting
trick, Erik.  It's a bit weird that we have to to live with this
style, but with the TRY/CATCH blocks we are already relying on it's
not that bad at the end.
--
Michael

Вложения

Re: Regression with large XML data input

От
Jim Jones
Дата:

On 28.07.25 22:16, Tom Lane wrote:
> Erik's v2 is slightly wrong as to the save-and-restore logic for
> the KeepBlanks setting: we need to restore in the error path too,
> and we'd better mark the save variable volatile since it's modified
> inside the PG_TRY.  I made some other cosmetic changes, mainly to
> avoid calculating "options" when it won't be used.  I tested the
> attached v3 against RHEL8's libxml2-2.9.7, as well as against today's
> libxml2 git master, and it accepts the problematic input on both.

Out of curiosity, what's the reasoning behind keeping node_list instead
of directly using parsed_nodes in the xmlParseBalancedChunkMemory call?

Example:

if (*(utf8string + count))
{
    res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0,
                                           utf8string + count,
                                           parsed_nodes);
    if (res_code != 0 || xmlerrcxt->err_occurred)
    {
        xml_errsave(escontext, xmlerrcxt,
                    ERRCODE_INVALID_XML_CONTENT,
                    "invalid XML content");
        goto fail;
    }
}

I was also wondering if we should add to PG 19 a GUC to enable
XML_MAX_HUGE_LENGTH if so needed. If we go down that route, we'd likely
need to revisit xmlParseBalancedChunkMemory (again!) since it appears to
be hardcoded to XML_MAX_TEXT_LENGTH. Any thoughts?

Best regards, Jim
Вложения

Re: Regression with large XML data input

От
Tom Lane
Дата:
Jim Jones <jim.jones@uni-muenster.de> writes:
> Out of curiosity, what's the reasoning behind keeping node_list instead
> of directly using parsed_nodes in the xmlParseBalancedChunkMemory call?

In the original coding, there was a hazard of the node list getting
leaked if the caller passed parsed_nodes == NULL.  Or at least I
thought there was.  It may be that all releases of libxml2 are smart
enough to free the node list if there's no way to pass it back,
but I guess we had reason not to trust it.  Possibly there's something
about that in the discussion that led up to 6082b3d5d, though I see
I neglected to mention it in the commit message.

            regards, tom lane



Re: Regression with large XML data input

От
Jim Jones
Дата:

On 29.07.25 14:11, Tom Lane wrote:
> In the original coding, there was a hazard of the node list getting
> leaked if the caller passed parsed_nodes == NULL.  Or at least I
> thought there was.  It may be that all releases of libxml2 are smart
> enough to free the node list if there's no way to pass it back,
> but I guess we had reason not to trust it.  Possibly there's something
> about that in the discussion that led up to 6082b3d5d, though I see
> I neglected to mention it in the commit message.

I see.. thanks for explaining.
I went through the discussions and the libxml2 issue, and I also think
it is prudent to keep it like that :)

Could you add a short comment to it? Something like this:

/*
 * We use a local variable (node_list) to receive the result
 * from xmlParseBalancedChunkMemory(), even though we might
 * eventually return it via parsed_nodes. This ensures that we
 * retain control of the memory and can safely free it here
 * in case of parse errors or early exits.
 *
 * If parsing fails, we free node_list immediately. If parsing
 * succeeds, we assign it to *parsed_nodes (if provided), which
 * will later be attached to the document tree. Otherwise, if
 * the caller is not interested in the parsed nodes (i.e.,
 * parsed_nodes == NULL), we free them immediately.
 *
 */


Thanks!

Best, Jim



Re: Regression with large XML data input

От
Tom Lane
Дата:
Jim Jones <jim.jones@uni-muenster.de> writes:
> On 29.07.25 14:11, Tom Lane wrote:
>> In the original coding, there was a hazard of the node list getting
>> leaked if the caller passed parsed_nodes == NULL.  Or at least I
>> thought there was.  It may be that all releases of libxml2 are smart
>> enough to free the node list if there's no way to pass it back,
>> but I guess we had reason not to trust it.  Possibly there's something
>> about that in the discussion that led up to 6082b3d5d, though I see
>> I neglected to mention it in the commit message.

> I see.. thanks for explaining.

Re-reading the prior thread, I see that my memory above is quite
faulty: we added the node_list intermediate variable as a way to
detect errors when the return code from xmlParseBalancedChunkMemory
couldn't be trusted.  So I think you're right to question whether we
still need it.  I tried reverting to just passing parsed_nodes, and
I don't see any leak in either the normal or error paths --- so at
least with the quite-old version of libxml2 I'm testing, there is no
such bug.

> I went through the discussions and the libxml2 issue, and I also think
> it is prudent to keep it like that :)

I've got mixed feelings about it now.  I think the $64 question
is whether there are any cases in which xmlParseBalancedChunkMemory
thinks things are fine (and returns a node list) but then we conclude
there's an error, perhaps as a consequence of xmlerrcxt->err_occurred
having become set earlier.  That's a little bit of a stretch.

In any case, I now realize that I broke that scenario yesterday
because I forgot that xml_errsave could throw a longjmp --- so freeing
the node list after calling it is too late :-(

On the whole I'm inclined to revert to the previous coding without
a node_list variable.

            regards, tom lane



Re: Regression with large XML data input

От
Jim Jones
Дата:

On 29.07.25 17:27, Tom Lane wrote:
> Re-reading the prior thread, I see that my memory above is quite
> faulty: we added the node_list intermediate variable as a way to
> detect errors when the return code from xmlParseBalancedChunkMemory
> couldn't be trusted.  So I think you're right to question whether we
> still need it.  I tried reverting to just passing parsed_nodes, and
> I don't see any leak in either the normal or error paths --- so at
> least with the quite-old version of libxml2 I'm testing, there is no
> such bug.
>
>> I went through the discussions and the libxml2 issue, and I also think
>> it is prudent to keep it like that :)
> I've got mixed feelings about it now.  I think the $64 question
> is whether there are any cases in which xmlParseBalancedChunkMemory
> thinks things are fine (and returns a node list) but then we conclude
> there's an error, perhaps as a consequence of xmlerrcxt->err_occurred
> having become set earlier.  That's a little bit of a stretch.
>
> In any case, I now realize that I broke that scenario yesterday
> because I forgot that xml_errsave could throw a longjmp --- so freeing
> the node list after calling it is too late :-(
>
> On the whole I'm inclined to revert to the previous coding without
> a node_list variable.

I also didn't spot any leaks, but I was rather hesitant to remove it
after re-reading the code, since there's still a risk of leakage if the
caller fails to free parsed_nodes in case of an error. However, it seems
that only xmltotext_with_options relies on this, and even then, the
result of parsed_nodes is added to a document that gets freed in case of
failure, so it looks like we're covered.

Best regards, Jim



Re: Regression with large XML data input

От
Tom Lane
Дата:
Jim Jones <jim.jones@uni-muenster.de> writes:
> I also didn't spot any leaks, but I was rather hesitant to remove it
> after re-reading the code, since there's still a risk of leakage if the
> caller fails to free parsed_nodes in case of an error. However, it seems
> that only xmltotext_with_options relies on this, and even then, the
> result of parsed_nodes is added to a document that gets freed in case of
> failure, so it looks like we're covered.

Yeah, that's a separate issue: once we assign it to *parsed_nodes,
it's on the caller's head to make sure it's freed.  I also spent
awhile staring at xmltotext_with_options.  The early exit for !indent
is safer than it looks: we can only take that path in DOCUMENT mode,
and we won't have any content nodes in that case.  But if we hit
an error somewhere between there and attaching the content nodes to
the document, they'd get leaked.  It looks to me like the only
plausible way for that to happen is an OOM failure within libxml2
operations, which is unlikely but not impossible.  So perhaps it's
worth the trouble to make xmltotext_with_options more wary.

            regards, tom lane



Re: Regression with large XML data input

От
"zengman"
Дата:
> The whitespace can be preserved by setting xmlKeepBlanksDefault before
> parsing.  See attached v2.  That function is deprecated, though.  But
> libxml2 uses thread-local globals, so it should be safe.  Other than
> that, I see no other way to set XML_PARSE_NOBLANKS with
> xmlParseBalancedChunkMemory.
> 
> [1] https://gitlab.gnome.org/GNOME/libxml2/-/blob/408bd0e18e6ddba5d18e51d52da0f7b3ca1b4421/parserInternals.c#L2833

Hi everyone, 

I have a small issue that needs resolving.

My environment:
```
postgres@zxm-VMware-Virtual-Platform:~$ uname -i -s -r -v
Linux 6.11.0-29-generic #29-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 13 20:29:41 UTC 2025 x86_64
postgres@zxm-VMware-Virtual-Platform:~$ 
postgres@zxm-VMware-Virtual-Platform:~$ xml2-config --version 
2.12.7
```

After setting COPT=-Werror, compilation fails with the following errors (warnings are enforced as errors):
```
xml.c: In function ‘xml_parse’:
xml.c:1919:25: error: ‘xmlKeepBlanksDefault’ is deprecated [-Werror=deprecated-declarations]
 1919 |                         save_keep_blanks = xmlKeepBlanksDefault(preserve_whitespace ? 1 : 0);
      |                         ^~~~~~~~~~~~~~~~
In file included from xml.c:51:
/usr/include/libxml2/libxml/parser.h:957:17: note: declared here
  957 |                 xmlKeepBlanksDefault    (int val);
      |                 ^~~~~~~~~~~~~~~~~~~~
xml.c:1943:25: error: ‘xmlKeepBlanksDefault’ is deprecated [-Werror=deprecated-declarations]
 1943 |                         xmlKeepBlanksDefault(save_keep_blanks);
      |                         ^~~~~~~~~~~~~~~~~~~~
/usr/include/libxml2/libxml/parser.h:957:17: note: declared here
  957 |                 xmlKeepBlanksDefault    (int val);
      |                 ^~~~~~~~~~~~~~~~~~~~
xml.c:1956:17: error: ‘xmlKeepBlanksDefault’ is deprecated [-Werror=deprecated-declarations]
 1956 |                 xmlKeepBlanksDefault(save_keep_blanks);
      |                 ^~~~~~~~~~~~~~~~~~~~
/usr/include/libxml2/libxml/parser.h:957:17: note: declared here
  957 |                 xmlKeepBlanksDefault    (int val);
      |                 ^~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
```

These deprecation warnings do not impact the test results in any way. 
Therefore, I have attached a patch to suppress these specific warnings. 

--
Regards,
Man Zeng
www.openhalo.org
Вложения

Re: Regression with large XML data input

От
Erik Wienhold
Дата:
On 2025-12-25 08:29 +0100, zengman wrote:
> > The whitespace can be preserved by setting xmlKeepBlanksDefault before
> > parsing.  See attached v2.  That function is deprecated, though.  But
> > libxml2 uses thread-local globals, so it should be safe.  Other than
> > that, I see no other way to set XML_PARSE_NOBLANKS with
> > xmlParseBalancedChunkMemory.
> > 
> > [1] https://gitlab.gnome.org/GNOME/libxml2/-/blob/408bd0e18e6ddba5d18e51d52da0f7b3ca1b4421/parserInternals.c#L2833
> 
> Hi everyone, 
> 
> I have a small issue that needs resolving.
> 
> My environment:
> ```
> postgres@zxm-VMware-Virtual-Platform:~$ uname -i -s -r -v
> Linux 6.11.0-29-generic #29-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 13 20:29:41 UTC 2025 x86_64
> postgres@zxm-VMware-Virtual-Platform:~$ 
> postgres@zxm-VMware-Virtual-Platform:~$ xml2-config --version 
> 2.12.7
> ```
> 
> After setting COPT=-Werror, compilation fails with the following errors (warnings are enforced as errors):
> ```
> xml.c: In function ‘xml_parse’:
> xml.c:1919:25: error: ‘xmlKeepBlanksDefault’ is deprecated [-Werror=deprecated-declarations]
>  1919 |                         save_keep_blanks = xmlKeepBlanksDefault(preserve_whitespace ? 1 : 0);
>       |                         ^~~~~~~~~~~~~~~~
> In file included from xml.c:51:
> /usr/include/libxml2/libxml/parser.h:957:17: note: declared here
>   957 |                 xmlKeepBlanksDefault    (int val);
>       |                 ^~~~~~~~~~~~~~~~~~~~
> xml.c:1943:25: error: ‘xmlKeepBlanksDefault’ is deprecated [-Werror=deprecated-declarations]
>  1943 |                         xmlKeepBlanksDefault(save_keep_blanks);
>       |                         ^~~~~~~~~~~~~~~~~~~~
> /usr/include/libxml2/libxml/parser.h:957:17: note: declared here
>   957 |                 xmlKeepBlanksDefault    (int val);
>       |                 ^~~~~~~~~~~~~~~~~~~~
> xml.c:1956:17: error: ‘xmlKeepBlanksDefault’ is deprecated [-Werror=deprecated-declarations]
>  1956 |                 xmlKeepBlanksDefault(save_keep_blanks);
>       |                 ^~~~~~~~~~~~~~~~~~~~
> /usr/include/libxml2/libxml/parser.h:957:17: note: declared here
>   957 |                 xmlKeepBlanksDefault    (int val);
>       |                 ^~~~~~~~~~~~~~~~~~~~
> cc1: all warnings being treated as errors
> ```
> 
> These deprecation warnings do not impact the test results in any way. 
> Therefore, I have attached a patch to suppress these specific warnings. 

The patch works when building with COPT=-Werror.  I guess the change is
okay since we already make use of -Wno-deprecated-declarations in
src/backend/jit/llvm/Makefile:

    # LLVM 14 produces deprecation warnings.  We'll need to make some changes
    # before the relevant functions are removed, but for now silence the warnings.
    ifeq ($(GCC), yes)
    LLVM_CFLAGS += -Wno-deprecated-declarations
    endif

But do we need the same guard for GCC here as well?

Alternatively, can you upgrade to libxml2 2.13.3+ which undeprecated
xmlKeepBlanksDefault?

-- 
Erik Wienhold



Re: Regression with large XML data input

От
"zengman"
Дата:
> Alternatively, can you upgrade to libxml2 2.13.3+ which undeprecated
> xmlKeepBlanksDefault?
Hi,

I downloaded the source code of libxml2, compiled it and upgraded to version 2.13.4. It seems that the result is the
same.
 
I'm wondering whether the default libxml2 provided by different operating systems meets our requirements, 
or if we need to specify a particular version of libxml2 externally?

```
xml.c: In function ‘xml_parse’:
xml.c:1919:25: error: ‘xmlKeepBlanksDefault’ is deprecated [-Werror=deprecated-declarations]
 1919 |                         save_keep_blanks = xmlKeepBlanksDefault(preserve_whitespace ? 1 : 0);
      |                         ^~~~~~~~~~~~~~~~
In file included from xml.c:51:
/usr/include/libxml2/libxml/parser.h:957:17: note: declared here
  957 |                 xmlKeepBlanksDefault    (int val);
      |                 ^~~~~~~~~~~~~~~~~~~~
xml.c:1943:25: error: ‘xmlKeepBlanksDefault’ is deprecated [-Werror=deprecated-declarations]
 1943 |                         xmlKeepBlanksDefault(save_keep_blanks);
      |                         ^~~~~~~~~~~~~~~~~~~~
/usr/include/libxml2/libxml/parser.h:957:17: note: declared here
  957 |                 xmlKeepBlanksDefault    (int val);
      |                 ^~~~~~~~~~~~~~~~~~~~
xml.c:1956:17: error: ‘xmlKeepBlanksDefault’ is deprecated [-Werror=deprecated-declarations]
 1956 |                 xmlKeepBlanksDefault(save_keep_blanks);
      |                 ^~~~~~~~~~~~~~~~~~~~
/usr/include/libxml2/libxml/parser.h:957:17: note: declared here
  957 |                 xmlKeepBlanksDefault    (int val);
      |                 ^~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[4]: *** [../../../../src/Makefile.global:970: xml.o] Error 1
make[4]: Leaving directory '/home/postgres/code/postgres/src/backend/utils/adt'
make[3]: *** [../../../src/backend/common.mk:37: adt-recursive] Error 2
make[3]: Leaving directory '/home/postgres/code/postgres/src/backend/utils'
make[2]: *** [common.mk:37: utils-recursive] Error 2
make[2]: Leaving directory '/home/postgres/code/postgres/src/backend'
make[1]: *** [Makefile:42: all-backend-recurse] Error 2
make[1]: Leaving directory '/home/postgres/code/postgres/src'
make: *** [GNUmakefile:11: all-src-recurse] Error 2
postgres@zxm-VMware-Virtual-Platform:~/code/postgres$ xml2-config --version 
2.13.4
```

--
Regards,
Man Zeng
www.openhalo.org

Re: Regression with large XML data input

От
Tom Lane
Дата:
Erik Wienhold <ewie@ewie.name> writes:
> On 2025-12-25 08:29 +0100, zengman wrote:
>> After setting COPT=-Werror, compilation fails with the following errors (warnings are enforced as errors):
>> xml.c: In function ‘xml_parse’:
>> xml.c:1919:25: error: ‘xmlKeepBlanksDefault’ is deprecated [-Werror=deprecated-declarations]
>> These deprecation warnings do not impact the test results in any way.
>> Therefore, I have attached a patch to suppress these specific warnings.

> The patch works when building with COPT=-Werror.  I guess the change is
> okay since we already make use of -Wno-deprecated-declarations in
> src/backend/jit/llvm/Makefile:

>     # LLVM 14 produces deprecation warnings.  We'll need to make some changes
>     # before the relevant functions are removed, but for now silence the warnings.
>     ifeq ($(GCC), yes)
>     LLVM_CFLAGS += -Wno-deprecated-declarations
>     endif

Actually, that was a temporary hack that has long outlived its
usefulness [1].  I don't love doing that for xml.c, because it
would put a permanent block on our seeing future deprecation
issues (until they become outright build failures).  Still...

> But do we need the same guard for GCC here as well?

Yeah, we would.  Non-gcc-alike compilers are unlikely to accept that
switch.  We would also need a fix for the meson build system.

> Alternatively, can you upgrade to libxml2 2.13.3+ which undeprecated
> xmlKeepBlanksDefault?

That would be my preferred answer, but after a quick census of the
buildfarm I'm not sure we can get away with it.  The animals that
are showing these warnings are all running RHEL 10 or clones of it,
which is going to be in-support till 2035 or so.  Maybe Red Hat
will upgrade to a newer libxml2 version than whatever they chose
for RHEL 10.0, but I would not count on it.

            regards, tom lane

[1] https://www.postgresql.org/message-id/1407185.1766682319%40sss.pgh.pa.us



Re: Regression with large XML data input

От
"zengman"
Дата:
> usefulness [1].  I don't love doing that for xml.c, because it
> would put a permanent block on our seeing future deprecation
> issues (until they become outright build failures).  Still...

Under the current condition that the relevant code here is not modified, this might be an acceptable temporary
solution.

> > But do we need the same guard for GCC here as well?
> 
> Yeah, we would.  Non-gcc-alike compilers are unlikely to accept that
> switch.  We would also need a fix for the meson build system.

Patch 0002 adds GCC checking. Additionally, I tested the meson build system and saw no warnings with the
`-Dc_args="-Wall"`configuration.
 

```
meson setup .. --prefix=$PGHOME -Dc_args="-Wall"
postgres@zxm-VMware-Virtual-Platform:~/code/postgres/build$   meson compile -j8 
INFO: autodetecting backend as ninja
INFO: calculating backend command to run: /usr/bin/ninja -j 8
[2471/2471] Linking target src/interfaces/ecpg/test/thread/alloc
postgres@zxm-VMware-Virtual-Platform:~/code/postgres/build$ 
```

--
Regards,
Man Zeng
www.openhalo.org
Вложения