Обсуждение: BUG #15847: Running out of memory when planning full outer joins involving many partitions

Поиск
Список
Период
Сортировка

BUG #15847: Running out of memory when planning full outer joins involving many partitions

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      15847
Logged by:          Feike Steenbergen
Email address:      feikesteenbergen@gmail.com
PostgreSQL version: 11.3
Operating system:   Ubuntu 18.04.2 LTS
Description:

Hi all,

We've had a few reports recently that had a backend consume a lot of
memory
causing either an OOM-kill or kubernetes rescheduling their PostgreSQL
pod.

The actual report and some graphs and details can be found here:
https://github.com/timescale/timescaledb/issues/1274

Note: The behavior is not TimescaleDB specific, it also happens on a
vanilla
PostgreSQL installation without Timescale installed.

For this specific bug report there were two things that clearly stood out:

- a FULL OUTER JOIN is done
- many partitions (thousands) are involved

The problematic behavior is as follows:

While planning a query, the backend uses a full CPU and it's memory keeps
increasing until either:

- ERROR: 53200: out of memory for systems with overcommit disabled
- or killed by OOM
- or rescheduled (kubernetes)

The backend seems to be in add_child_rel_equivalences during this time and
does
not respond to SIGINT while it is in there.

I've encountered this problem on 11.3, 10.8 and 9.6.13 (with table
inheritance
instead of declarative partitioning).

regards,

Feike Steenbergen



/*
The below SQL should reproduce the issue on a machine with <= 16GB memory:

This is a surrogate test case to trigger the problematic behavior.

The actual report of the user involved multiple tables, but to simplify
things here I'm just reusing the same partitioned table with a lot of
partitions
*/

CREATE TABLE buggy(
    inserted timestamptz not null
)
PARTITION BY RANGE (inserted);

-- Create some partitions
DO $BODY$
DECLARE
    partname text;
    start date := date_trunc('week', '1999-12-31'::date);
BEGIN
    FOR i IN 0..1000 LOOP
        partname := format('buggy_%s', to_char(start, 'IYYYIW'));
        EXECUTE format( $$CREATE TABLE %I PARTITION OF %I
                          FOR VALUES FROM (%L) TO (%L)$$,
                        partname,
                        'buggy',
                        start,
                        start + 7
                );
        start := start + 7;
    END LOOP;
END;
$BODY$;


-- This works fine
EXPLAIN
SELECT
    inserted
FROM
    buggy b1
LEFT JOIN
    buggy b2 USING (inserted)
LEFT JOIN
    buggy b3 USING (inserted)
LEFT JOIN
    buggy b4 USING (inserted
);

/*
This will either do the following:
- `ERROR:  53200: out of memory` for systems with overcommit disabled, or
- an out-of-memory kill (kernel)
- rescheduling of the pod (k8s)
*/
EXPLAIN
SELECT
    inserted
FROM
    buggy b1
FULL OUTER JOIN
    buggy b2 USING (inserted)
FULL OUTER JOIN
    buggy b3 USING (inserted)
FULL OUTER JOIN
    buggy b4 USING (inserted)
;


PG Bug reporting form <noreply@postgresql.org> writes:
> We've had a few reports recently that had a backend consume a lot of
> memory causing either an OOM-kill or kubernetes rescheduling their
> PostgreSQL pod.
> For this specific bug report there were two things that clearly stood out:
> - a FULL OUTER JOIN is done
> - many partitions (thousands) are involved

I poked into this and found the cause.  For the sample query, we have
an EquivalenceClass containing the expression
    COALESCE(COALESCE(Var_1_1, Var_2_1), Var_3_1)
where each of the Vars belongs to an appendrel parent.
add_child_rel_equivalences() needs to add expressions representing the
transform of that to each child relation.  That is, if the children
of table 1 are A1 and A2, of table 2 are B1 and B2, and of table 3
are C1 and C2, what we'd like to add are the expressions
    COALESCE(COALESCE(Var_A1_1, Var_2_1), Var_3_1)
    COALESCE(COALESCE(Var_A2_1, Var_2_1), Var_3_1)
    COALESCE(COALESCE(Var_1_1, Var_B1_1), Var_3_1)
    COALESCE(COALESCE(Var_1_1, Var_B2_1), Var_3_1)
    COALESCE(COALESCE(Var_1_1, Var_2_1), Var_C1_1)
    COALESCE(COALESCE(Var_1_1, Var_2_1), Var_C2_1)
However, what it's actually producing is additional combinations for
each appendrel after the first, because each call also mutates the
previously-added child expressions.  So in this example we also get
    COALESCE(COALESCE(Var_A1_1, Var_B1_1), Var_3_1)
    COALESCE(COALESCE(Var_A2_1, Var_B2_1), Var_3_1)
    COALESCE(COALESCE(Var_A1_1, Var_2_1), Var_C1_1)
    COALESCE(COALESCE(Var_A2_1, Var_2_1), Var_C2_1)
    COALESCE(COALESCE(Var_A1_1, Var_B1_1), Var_C1_1)
    COALESCE(COALESCE(Var_A2_1, Var_B2_1), Var_C2_1)
With two appendrels involved, that's O(N^2) expressions; with
three appendrels, more like O(N^3).

This is by no means specific to FULL JOINs; you could get the same
behavior with join clauses like "WHERE t1.a + t2.b + t3.c = t4.d".

These extra expressions don't have any use, since we're not
going to join the children directly to each other.  So we need
to fix add_child_rel_equivalences() to not do that.  The
simplest way seems to be to make it ignore em_is_child EC members,
which requires that we use the adjust_appendrel_attrs_multilevel
machinery if we're trying to convert one of the original expressions
for a grandchild relation.  As attached.  This patch fixes the
described performance problem and still passes check-world.

While I don't have any hesitation about pushing this patch into HEAD,
I do feel a bit nervous about back-patching it, particularly right
before a set of minor releases.  I don't think that we consider large
numbers of partitions to be a well-supported case in v11 or before,
so for the released branches I'd rather just say "if it hurts, don't
do that".

As an aside, adjust_appendrel_attrs_multilevel() makes me positively ill
(and it's not the head cold I have today...).  It's unbelievably
brute-force, which might be okay if it were something we'd execute only
once per query, but examples like this can require it to be executed
thousands of times.  Still, right now is probably not a good time to blow
it up and rewrite it.

            regards, tom lane

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index e9bd5ea..b50e9cc 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -2135,11 +2135,10 @@ add_child_rel_equivalences(PlannerInfo *root,
             continue;

         /*
-         * No point in searching if parent rel not mentioned in eclass; but we
-         * can't tell that for sure if parent rel is itself a child.
+         * No point in searching if child's topmost parent rel is not
+         * mentioned in eclass.
          */
-        if (parent_rel->reloptkind == RELOPT_BASEREL &&
-            !bms_is_subset(parent_rel->relids, cur_ec->ec_relids))
+        if (!bms_is_subset(child_rel->top_parent_relids, cur_ec->ec_relids))
             continue;

         foreach(lc2, cur_ec->ec_members)
@@ -2149,18 +2148,41 @@ add_child_rel_equivalences(PlannerInfo *root,
             if (cur_em->em_is_const)
                 continue;        /* ignore consts here */

-            /* Does it reference parent_rel? */
-            if (bms_overlap(cur_em->em_relids, parent_rel->relids))
+            /*
+             * We consider only original EC members here, not
+             * already-transformed child members.  Otherwise, if some original
+             * member expression references more than one appendrel, we'd get
+             * an O(N^2) explosion of useless derived expressions for
+             * combinations of children.
+             */
+            if (cur_em->em_is_child)
+                continue;        /* ignore children here */
+
+            /* Does this member reference child's topmost parent rel? */
+            if (bms_overlap(cur_em->em_relids, child_rel->top_parent_relids))
             {
                 /* Yes, generate transformed child version */
                 Expr       *child_expr;
                 Relids        new_relids;
                 Relids        new_nullable_relids;

-                child_expr = (Expr *)
-                    adjust_appendrel_attrs(root,
-                                           (Node *) cur_em->em_expr,
-                                           1, &appinfo);
+                if (parent_rel->reloptkind == RELOPT_BASEREL)
+                {
+                    /* Simple single-level transformation */
+                    child_expr = (Expr *)
+                        adjust_appendrel_attrs(root,
+                                               (Node *) cur_em->em_expr,
+                                               1, &appinfo);
+                }
+                else
+                {
+                    /* Must do multi-level transformation */
+                    child_expr = (Expr *)
+                        adjust_appendrel_attrs_multilevel(root,
+                                                          (Node *) cur_em->em_expr,
+                                                          child_rel->relids,
+                                                          child_rel->top_parent_relids);
+                }

                 /*
                  * Transform em_relids to match.  Note we do *not* do
@@ -2169,7 +2191,7 @@ add_child_rel_equivalences(PlannerInfo *root,
                  * don't want the child member to be marked as constant.
                  */
                 new_relids = bms_difference(cur_em->em_relids,
-                                            parent_rel->relids);
+                                            child_rel->top_parent_relids);
                 new_relids = bms_add_members(new_relids, child_rel->relids);

                 /*
@@ -2177,10 +2199,11 @@ add_child_rel_equivalences(PlannerInfo *root,
                  * parent and child relids are singletons.
                  */
                 new_nullable_relids = cur_em->em_nullable_relids;
-                if (bms_overlap(new_nullable_relids, parent_rel->relids))
+                if (bms_overlap(new_nullable_relids,
+                                child_rel->top_parent_relids))
                 {
                     new_nullable_relids = bms_difference(new_nullable_relids,
-                                                         parent_rel->relids);
+                                                         child_rel->top_parent_relids);
                     new_nullable_relids = bms_add_members(new_nullable_relids,
                                                           child_rel->relids);
                 }

Re: BUG #15847: Running out of memory when planning full outer joinsinvolving many partitions

От
Feike Steenbergen
Дата:
On Thu, 13 Jun 2019 at 19:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This patch fixes the
> described performance problem and still passes check-world.

Thanks! I can see drastic improvements in a 3-way join
and the 4-way join is now actually working.

Feike