Обсуждение: BUG #15592: Memory overuse with subquery containing unnest() and setoperations (11.x regression)

Поиск
Список
Период
Сортировка

BUG #15592: Memory overuse with subquery containing unnest() and setoperations (11.x regression)

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      15592
Logged by:          Dmitry Marakasov
Email address:      amdmi3@amdmi3.ru
PostgreSQL version: 11.1
Operating system:   FreeBSD 12.0
Description:

Shamefully I wasn't aware of array && operator, so I've been using the
following construct to check for array intersection of two columns of text[]
type:

SELECT ... WHERE EXISTS(SELECT unnest(first) INTERSECT SELECT
unnest(second));

After upgrading from PostgreSQL 10.x to 11.x this query began to eat
inadequate amounts of memory (several gigabytes per hundred thousands rows).
I've narrowed it down to this simple case which demonstrates the problem:

CREATE TABLE test AS
SELECT generate_series(1,500000) AS id, '{a,b}'::text[] AS first,
'{a}'::text[] AS second;

SELECT DISTINCT EXISTS(SELECT unnest(first) INTERSECT SELECT unnest(second))
FROM test;

This query makes worker process eat up about 2GBs of memory. Memory usage
grows linearly during SELECT execution, afterwards the memory is freed.
Memory overuse doesn't happen on PostgreSQL 10, so this may be a regression
in 11. My original case was quite silly and fixable with using an array
operator, but I suspect that this problem may fire in legitimate cases too,
so I thought it's worth reporting it.

Additional info:

Execution plan (it's the same on 10 and 11):

                                                                 QUERY PLAN
                                                                

---------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=36527.18..36527.29 rows=2 width=1) (actual
time=5005.132..5005.132 rows=1 loops=1)
   Output: ((SubPlan 1))
   Group Key: (SubPlan 1)
   Buffers: shared hit=2048 read=4125 dirtied=4125 written=4093
   ->  Seq Scan on public.test  (cost=0.00..35215.42 rows=524705 width=1)
(actual time=0.053..4875.718 rows=500000 loops=1)
         Output: (SubPlan 1)
         Buffers: shared hit=2048 read=4125 dirtied=4125 written=4093
         SubPlan 1
           ->  HashSetOp Intersect  (cost=0.00..4.54 rows=100 width=36)
(actual time=0.005..0.005 rows=1 loops=500000)
                 Output: "*SELECT* 1".unnest, (0)
                 ->  Append  (cost=0.00..4.04 rows=200 width=36) (actual
time=0.001..0.004 rows=3 loops=500000)
                       ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..1.52
rows=100 width=36) (actual time=0.001..0.002 rows=2 loops=500000)
                             Output: "*SELECT* 1".unnest, 0
                             ->  ProjectSet  (cost=0.00..0.52 rows=100
width=32) (actual time=0.001..0.001 rows=2 loops=500000)
                                   Output: unnest(test.first)
                                   ->  Result  (cost=0.00..0.01 rows=1
width=0) (actual time=0.000..0.000 rows=1 loops=500000)
                       ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..1.52
rows=100 width=36) (actual time=0.001..0.001 rows=1 loops=500000)
                             Output: "*SELECT* 2".unnest, 1
                             ->  ProjectSet  (cost=0.00..0.52 rows=100
width=32) (actual time=0.001..0.001 rows=1 loops=500000)
                                   Output: unnest(test.second)
                                   ->  Result  (cost=0.00..0.01 rows=1
width=0) (actual time=0.000..0.000 rows=1 loops=500000)
 Planning Time: 0.185 ms
 Execution Time: 5120.463 ms

I've also managed to get MemoryContextStats(TopMemoryContext) (from my
production query, no the simplified one above), here it is:
https://gist.github.com/AMDmi3/b55b2f3f6acf06cf6eab59fc043c7a0f and this
line looks suspicious:

> 344753 more child contexts containing 2824216576 total in 344753 blocks;
2735944032 free (5 chunks); 88272544 used


=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:
> After upgrading from PostgreSQL 10.x to 11.x this query began to eat
> inadequate amounts of memory (several gigabytes per hundred thousands rows).
> I've narrowed it down to this simple case which demonstrates the problem:

> CREATE TABLE test AS
> SELECT generate_series(1,500000) AS id, '{a,b}'::text[] AS first,
> '{a}'::text[] AS second;

> SELECT DISTINCT EXISTS(SELECT unnest(first) INTERSECT SELECT unnest(second))
> FROM test;

Ugh.  Something is creating ExprContexts and not freeing them --- a memory
context dump taken at ExecutorEnd shows

      ExecutorState: 100663296 total in 22 blocks; 4625424 free (19 chunks); 96037872 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ...
        499916 more child contexts containing 4095311872 total in 499916 blocks; 3967322616 free (1 chunks); 127989256
used

Not sure where the problem is, yet, but for sure this is a bug.
Thanks for the report!

            regards, tom lane


Re: BUG #15592: Memory overuse with subquery containing unnest() and set operations (11.x regression)

От
Andres Freund
Дата:

On January 14, 2019 8:24:40 AM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:
>> After upgrading from PostgreSQL 10.x to 11.x this query began to eat
>> inadequate amounts of memory (several gigabytes per hundred thousands
>rows).
>> I've narrowed it down to this simple case which demonstrates the
>problem:
>
>> CREATE TABLE test AS
>> SELECT generate_series(1,500000) AS id, '{a,b}'::text[] AS first,
>> '{a}'::text[] AS second;
>
>> SELECT DISTINCT EXISTS(SELECT unnest(first) INTERSECT SELECT
>unnest(second))
>> FROM test;
>
>Ugh.  Something is creating ExprContexts and not freeing them --- a
>memory
>context dump taken at ExecutorEnd shows
>
>ExecutorState: 100663296 total in 22 blocks; 4625424 free (19 chunks);
>96037872 used
>    ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
>    ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
>    ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
>        ...
>499916 more child contexts containing 4095311872 total in 499916
>blocks; 3967322616 free (1 chunks); 127989256 used
>
>Not sure where the problem is, yet, but for sure this is a bug.
>Thanks for the report!

Think I know where the problem is - let me have a coffee and check? I think I might have a good lying around...
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Andres Freund <andres@anarazel.de> writes:
> On January 14, 2019 8:24:40 AM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Ugh.  Something is creating ExprContexts and not freeing them --- a
>> memory context dump taken at ExecutorEnd shows

> Think I know where the problem is - let me have a coffee and check? I think I might have a good lying around...

I had just determined that the extra context was added by bf6c614a2,
which at this point has accumulated a pretty serious collection of
bugs, judging by subsequent mentions in the commit log.

I'm inclined to think that entire patch needs review.

            regards, tom lane


I wrote:
> Andres Freund <andres@anarazel.de> writes:
>> Think I know where the problem is - let me have a coffee and check? I think I might have a good lying around...

> I had just determined that the extra context was added by bf6c614a2,

The short answer here is that this addition to BuildTupleHashTable:

    hashtable->exprcontext = CreateExprContext(parent->state);

allocates memory that is not freed by freeing the hashtable's tablecxt,
breaking the API for grouping hashtables.

Why does a hashtable need its own exprcontext now when it didn't before?

            regards, tom lane


Re: BUG #15592: Memory overuse with subquery containing unnest() andset operations (11.x regression)

От
Andres Freund
Дата:
Hi,

On 2019-01-14 11:57:55 -0500, Tom Lane wrote:
> I wrote:
> > Andres Freund <andres@anarazel.de> writes:
> >> Think I know where the problem is - let me have a coffee and check? I think I might have a good lying around...
>
> > I had just determined that the extra context was added by bf6c614a2,
> > which at this point has accumulated a pretty serious collection of
> > bugs, judging by subsequent mentions in the commit log.

Hm, there was one bug related to this before? Fixing up test failures in
sepgsql (which I can't run locally) just after commit doesn't really
count.


> > I had just determined that the extra context was added by bf6c614a2,
> 
> The short answer here is that this addition to BuildTupleHashTable:
> 
>     hashtable->exprcontext = CreateExprContext(parent->state);
> 
> allocates memory that is not freed by freeing the hashtable's tablecxt,
> breaking the API for grouping hashtables.
> 
> Why does a hashtable need its own exprcontext now when it didn't before?

Because the comparison is now done via ExprState machinery than manual
fmgr invocations.  I'd discussed a patch solving this a few weeks ago
with Andrew Gierth, but got stuck with the fact that essentially all
fixes entail either an API or an ABI break - but I think we gotta do
that anyway.

For performance reasons the API really should be changed so we don't
allocate/deallocate the entire hashtable on each round - I've prototyped
a fix that only resets it and there's noticable performance gains. But
that's not something we can easily do in the back branches.

Let me rebase and cleanup my patches, it's probably easier to discuss
with them in front of us.

Greetings,

Andres Freund


Re: BUG #15592: Memory overuse with subquery containing unnest() andset operations (11.x regression)

От
Andres Freund
Дата:
Hi,

On 2019-01-14 10:04:23 -0800, Andres Freund wrote:
> On 2019-01-14 11:57:55 -0500, Tom Lane wrote:
> > The short answer here is that this addition to BuildTupleHashTable:
> > 
> >     hashtable->exprcontext = CreateExprContext(parent->state);
> > 
> > allocates memory that is not freed by freeing the hashtable's tablecxt,
> > breaking the API for grouping hashtables.
> > 
> > Why does a hashtable need its own exprcontext now when it didn't before?
> 
> Because the comparison is now done via ExprState machinery than manual
> fmgr invocations.  I'd discussed a patch solving this a few weeks ago
> with Andrew Gierth, but got stuck with the fact that essentially all
> fixes entail either an API or an ABI break - but I think we gotta do
> that anyway.
> 
> For performance reasons the API really should be changed so we don't
> allocate/deallocate the entire hashtable on each round - I've prototyped
> a fix that only resets it and there's noticable performance gains. But
> that's not something we can easily do in the back branches.
> 
> Let me rebase and cleanup my patches, it's probably easier to discuss
> with them in front of us.

Attached is a series of patches that fix this, and the issue in [1],
namely that we do not reuse JITed expressions, leading to significant
overhead when JIT compiling.

The first patch just moves the ExprContext into the tablecxt of the
tuple hash table, and uses a standalone expression instead of linked
into estate. I think that's OK for the the table's purpose, because the
expression used is tightly restricted because it's built with
ExecBuildGroupingEqual(). Previously we'd call fmgr functions directly,
so there can't be any dependency on expression parents here (and it's
not clear how that'd ever make any sense). Given that, I think it's ok
to not explicitly shutdown the expr context.  If somebody disagrees, we
could change this, but freeing individual ExprContexts is
O(#exprcontexts), so I'd prefer not to unnecessarily go there.

This is sufficient to fix the memory leak problem (but is slower than
10, due to the overhead of creating the comparator expression
repeatedly).

The remaining patches add 0002) support for resetting a simplehash
hashtable, 0003) Add BuildTupleHashTableExt(), which allows to place the
tuple hashtable into a separate memory context besides tablecxt, 0004)
changes in-core tuple hash users to reset instead recreate hashtables.
This has the advantage of being doable without breaking external users
of the API, while still avoiding recreation of the comparator expression
(and the slot, and the TupleHashTable itself), which in turn avoids the
JIT overhead.

These patches, especially surrounding comments, need a bit of polish,
but I thought it'd be good to discuss them before investing more time.

Comments?

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/15486-05850f065da42931%40postgresql.org

Вложения

Re: BUG #15592: Memory overuse with subquery containing unnest() andset operations (11.x regression)

От
Andres Freund
Дата:
Hi,

On 2019-01-14 14:28:38 -0800, Andres Freund wrote:
> Attached is a series of patches that fix this, and the issue in [1],
> namely that we do not reuse JITed expressions, leading to significant
> overhead when JIT compiling.
> 
> The first patch just moves the ExprContext into the tablecxt of the
> tuple hash table, and uses a standalone expression instead of linked
> into estate. I think that's OK for the the table's purpose, because the
> expression used is tightly restricted because it's built with
> ExecBuildGroupingEqual(). Previously we'd call fmgr functions directly,
> so there can't be any dependency on expression parents here (and it's
> not clear how that'd ever make any sense). Given that, I think it's ok
> to not explicitly shutdown the expr context.  If somebody disagrees, we
> could change this, but freeing individual ExprContexts is
> O(#exprcontexts), so I'd prefer not to unnecessarily go there.
> 
> This is sufficient to fix the memory leak problem (but is slower than
> 10, due to the overhead of creating the comparator expression
> repeatedly).
> 
> The remaining patches add 0002) support for resetting a simplehash
> hashtable, 0003) Add BuildTupleHashTableExt(), which allows to place the
> tuple hashtable into a separate memory context besides tablecxt, 0004)
> changes in-core tuple hash users to reset instead recreate hashtables.
> This has the advantage of being doable without breaking external users
> of the API, while still avoiding recreation of the comparator expression
> (and the slot, and the TupleHashTable itself), which in turn avoids the
> JIT overhead.
> 
> These patches, especially surrounding comments, need a bit of polish,
> but I thought it'd be good to discuss them before investing more time.

I've pushed these now.

Does anybody have an opinion about removing the BuildTupleHashTable /
BuildTupleHashTableExt split in master?

Dmitry, the fix will be included in the next minor release, which is
scheduled for next week.

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
>> ... Given that, I think it's ok
>> to not explicitly shutdown the expr context.

Uh, what?

> Dmitry, the fix will be included in the next minor release, which is
> scheduled for next week.

This doesn't really seem like the kind of patch to push on a release
weekend.  At this point you can't even be confident of getting a readout
from every active buildfarm member.

            regards, tom lane


Re: BUG #15592: Memory overuse with subquery containing unnest() andset operations (11.x regression)

От
Andres Freund
Дата:
Hi,

On 2019-02-09 09:34:41 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> >> ... Given that, I think it's ok
> >> to not explicitly shutdown the expr context.
> 
> Uh, what?

Well, you explicitly removed the surrounding reasoning. What's the issue
you see here? The generated expression cannot contain anything that uses
shutdown callbacks (it can only contain *_FETCHSOME, *_VAR,
*_NOT_DISTINCT, *_QUAL expression steps). No user defined expressions
have access to the ExprContext.  I'd written about this a few weeks ago,
nobody responded with a better suggestion.


> > Dmitry, the fix will be included in the next minor release, which is
> > scheduled for next week.
> 
> This doesn't really seem like the kind of patch to push on a release
> weekend.  At this point you can't even be confident of getting a readout
> from every active buildfarm member.

Yea, I'd hoped to push this earlier, but unfortune family issues +
related travel + work travel prevented me from getting to this
earlier. The memory leak is significant, the patch hasn't materially
changed in the two weeks since it has been posted, and there's nothing
operating system / architecture related, which I think makes this
acceptable.

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> On 2019-02-09 09:34:41 -0500, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> ... Given that, I think it's ok
>>> to not explicitly shutdown the expr context.

>> Uh, what?

> Well, you explicitly removed the surrounding reasoning. What's the issue
> you see here? The generated expression cannot contain anything that uses
> shutdown callbacks

Even if that happens to be true today, it certainly looks like a booby
trap primed and ready to bite somebody in the future.  An ExprContext
that fails to provide one of the basic services of an ExprContext ---
indeed, the *only* basic service of an ExprContext, since whatever else
it does is just a memory context feature --- doesn't sound like a great
plan to me.

What is it you're actually hoping to do by removing this guarantee?
(I apologize for not having been paying attention to this thread,
but I do have finite bandwidth.)

>> This doesn't really seem like the kind of patch to push on a release
>> weekend.  At this point you can't even be confident of getting a readout
>> from every active buildfarm member.

> Yea, I'd hoped to push this earlier, but unfortune family issues +
> related travel + work travel prevented me from getting to this
> earlier. The memory leak is significant, the patch hasn't materially
> changed in the two weeks since it has been posted, and there's nothing
> operating system / architecture related, which I think makes this
> acceptable.

I'm thinking more of the valgrind and clobber_cache_always members,
which are pretty slow by definition.

            regards, tom lane


Re: BUG #15592: Memory overuse with subquery containing unnest() andset operations (11.x regression)

От
Andres Freund
Дата:
Hi,

On 2019-02-09 10:09:57 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2019-02-09 09:34:41 -0500, Tom Lane wrote:
> >> Andres Freund <andres@anarazel.de> writes:
> >>> ... Given that, I think it's ok
> >>> to not explicitly shutdown the expr context.
> 
> >> Uh, what?
> 
> > Well, you explicitly removed the surrounding reasoning. What's the issue
> > you see here? The generated expression cannot contain anything that uses
> > shutdown callbacks
> 
> Even if that happens to be true today, it certainly looks like a booby
> trap primed and ready to bite somebody in the future.  An ExprContext
> that fails to provide one of the basic services of an ExprContext ---
> indeed, the *only* basic service of an ExprContext, since whatever else
> it does is just a memory context feature --- doesn't sound like a great
> plan to me.
> 
> What is it you're actually hoping to do by removing this guarantee?
> (I apologize for not having been paying attention to this thread,
> but I do have finite bandwidth.)

I think we probably can do better in master, but I don't see a good
solution that's not expensive in v11. The tuple hash table can be
created / destroyed at a prodigious rate before 317ffdfea / 356687bd825,
and I don't see a good way to get rid of needing an ExprContext created
therein. We could register a callback on the memory context to drop the
ExprContext, but unfortunately dropping ExprContexts retail isn't
particularly cheap as it has to go through a singly linked list
(something we ought to fix one day by using a doubly linked list, but
certainly not a minor release).  After the aforementioned changes that's
much less an issue, but without an API break, I don't see how we can
make sure that external code isn't broken by forcing it to only reset
tuple hash tables rather than recreating them via a memory context
reset.

FWIW, for me the most basic service of ExprContext is to provide
scan/inner/out slot.


> >> This doesn't really seem like the kind of patch to push on a release
> >> weekend.  At this point you can't even be confident of getting a readout
> >> from every active buildfarm member.
> 
> > Yea, I'd hoped to push this earlier, but unfortune family issues +
> > related travel + work travel prevented me from getting to this
> > earlier. The memory leak is significant, the patch hasn't materially
> > changed in the two weeks since it has been posted, and there's nothing
> > operating system / architecture related, which I think makes this
> > acceptable.
> 
> I'm thinking more of the valgrind and clobber_cache_always members,
> which are pretty slow by definition.

I'd run valgrind locally (for the regression/isolation tests, not for
the tap tests, but that seems more then plenty for this change). I
didn't run with clobber caches always, but that seems pretty unlikely to
be affected by this change.

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> On 2019-02-09 10:09:57 -0500, Tom Lane wrote:
>> What is it you're actually hoping to do by removing this guarantee?

> I think we probably can do better in master, but I don't see a good
> solution that's not expensive in v11. The tuple hash table can be
> created / destroyed at a prodigious rate before 317ffdfea / 356687bd825,
> and I don't see a good way to get rid of needing an ExprContext created
> therein. We could register a callback on the memory context to drop the
> ExprContext, but unfortunately dropping ExprContexts retail isn't
> particularly cheap as it has to go through a singly linked list
> (something we ought to fix one day by using a doubly linked list, but
> certainly not a minor release).

Why not?  I think converting that list into a dlist might be a perfectly
reasonable thing to do.  Certainly there are ABI considerations in the
back branches, but we could probably dodge those adequately with the
usual hack of putting the added field at the end.

            regards, tom lane