Обсуждение: ERROR: no relation entry for relid 6

Поиск
Список
Период
Сортировка

ERROR: no relation entry for relid 6

От
Richard Guo
Дата:
I came across $subject on master and here is the query I'm running.

create table t (a int unique, b int);

explain (costs off)
select 1 from t t1
    left join t t2 on true
   inner join t t3 on true
    left join t t4 on t2.a = t4.a and t2.a = t3.a;
ERROR:  no relation entry for relid 6

I looked into it and it should be caused by some problem in outer-join
removal.  After we've decided that the join to t4 is removable, which is
no problem, one of the things we need to do is to remove any joinquals
referencing t4 from the joininfo lists.  In this query, there are two
such quals, 't2.a = t4.a' and 't2.a = t3.a'.  And both quals have two
versions, one with t1/t2 join in their nulling bitmap and one without.
The former version would be treated as being "pushed down" because its
required_relids exceed the scope of joinrelids, due to the t1/t2 join
included in the qual's nulling bitmap but not included in joinrelids.
And as a result this version of quals would be put back.  I think this
is where the problem is.  Ideally we should not put them back.

This issue seems to have existed for a while, and is revealed by the
change in c8b881d2 recently.  I'm not sure how to fix it yet.  What I'm
thinking is that maybe this has something to do with the loose ends we
have in make_outerjoininfo.  Actually in this query the t1/t2 join
cannot commute with the join to t4.  If we can know that in
make_outerjoininfo, we'd have added t1/t2 join to the min_lefthand of
the join to t4, and that would avoid this issue.

Thanks
Richard

Re: ERROR: no relation entry for relid 6

От
Richard Guo
Дата:

On Tue, May 23, 2023 at 2:38 PM Richard Guo <guofenglinux@gmail.com> wrote:
I came across $subject on master and here is the query I'm running.

create table t (a int unique, b int);

explain (costs off)
select 1 from t t1
    left join t t2 on true
   inner join t t3 on true
    left join t t4 on t2.a = t4.a and t2.a = t3.a;
ERROR:  no relation entry for relid 6

I looked into it and it should be caused by some problem in outer-join
removal.  After we've decided that the join to t4 is removable, which is
no problem, one of the things we need to do is to remove any joinquals
referencing t4 from the joininfo lists.  In this query, there are two
such quals, 't2.a = t4.a' and 't2.a = t3.a'.  And both quals have two
versions, one with t1/t2 join in their nulling bitmap and one without.
The former version would be treated as being "pushed down" because its
required_relids exceed the scope of joinrelids, due to the t1/t2 join
included in the qual's nulling bitmap but not included in joinrelids.
And as a result this version of quals would be put back.  I think this
is where the problem is.  Ideally we should not put them back.

This issue seems to have existed for a while, and is revealed by the
change in c8b881d2 recently.  I'm not sure how to fix it yet.  What I'm
thinking is that maybe this has something to do with the loose ends we
have in make_outerjoininfo.  Actually in this query the t1/t2 join
cannot commute with the join to t4.  If we can know that in
make_outerjoininfo, we'd have added t1/t2 join to the min_lefthand of
the join to t4, and that would avoid this issue.

Considering that clone clauses should always be outer-join clauses not
filter clauses, I'm wondering if we can add an additional check for that
in RINFO_IS_PUSHED_DOWN, something like

 #define RINFO_IS_PUSHED_DOWN(rinfo, joinrelids) \
-   ((rinfo)->is_pushed_down || \
-    !bms_is_subset((rinfo)->required_relids, joinrelids))
+   (!((rinfo)->has_clone || ((rinfo)->is_clone)) && \
+    ((rinfo)->is_pushed_down || \
+     !bms_is_subset((rinfo)->required_relids, joinrelids)))

This change can fix the case shown upthread.  But I doubt it's the
perfect fix we want.

Thanks
Richard

Re: ERROR: no relation entry for relid 6

От
Tom Lane
Дата:
Richard Guo <guofenglinux@gmail.com> writes:
>> create table t (a int unique, b int);
>> 
>> explain (costs off)
>> select 1 from t t1
>> left join t t2 on true
>> inner join t t3 on true
>> left join t t4 on t2.a = t4.a and t2.a = t3.a;
>> ERROR:  no relation entry for relid 6

Ugh.

> Considering that clone clauses should always be outer-join clauses not
> filter clauses, I'm wondering if we can add an additional check for that
> in RINFO_IS_PUSHED_DOWN, something like

>  #define RINFO_IS_PUSHED_DOWN(rinfo, joinrelids) \
> -   ((rinfo)->is_pushed_down || \
> -    !bms_is_subset((rinfo)->required_relids, joinrelids))
> +   (!((rinfo)->has_clone || ((rinfo)->is_clone)) && \
> +    ((rinfo)->is_pushed_down || \
> +     !bms_is_subset((rinfo)->required_relids, joinrelids)))

I don't like that one bit; it seems way too likely to introduce
bad side-effects elsewhere.

I think the real issue is that "is pushed down" has never been a
conceptually accurate way of thinking about what
remove_rel_from_query needs to do here.  Using RINFO_IS_PUSHED_DOWN
happened to work up to now, but it's an abuse of that macro, and
changing the macro's behavior isn't the right way to fix it.

Having said that, I'm not sure what is a better way to think about it.
It seems like our data structure doesn't have a clear tie between
restrictinfos and their original join level, or at least, to the extent
that it did the "clone clause" mechanism has broken it.

I wonder if we could do something involving recording the
rinfo_serial numbers of all the clauses extracted from a particular
syntactic join level (we could keep this in a bitmapset attached
to each SpecialJoinInfo, perhaps) and then filtering the joinclauses
on the basis of serial numbers instead of required_relids.

            regards, tom lane



Re: ERROR: no relation entry for relid 6

От
Richard Guo
Дата:

On Wed, May 24, 2023 at 2:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <guofenglinux@gmail.com> writes:
> Considering that clone clauses should always be outer-join clauses not
> filter clauses, I'm wondering if we can add an additional check for that
> in RINFO_IS_PUSHED_DOWN, something like

>  #define RINFO_IS_PUSHED_DOWN(rinfo, joinrelids) \
> -   ((rinfo)->is_pushed_down || \
> -    !bms_is_subset((rinfo)->required_relids, joinrelids))
> +   (!((rinfo)->has_clone || ((rinfo)->is_clone)) && \
> +    ((rinfo)->is_pushed_down || \
> +     !bms_is_subset((rinfo)->required_relids, joinrelids)))

I don't like that one bit; it seems way too likely to introduce
bad side-effects elsewhere.

Agreed.  I also do not have too much confidence in it.
 
I think the real issue is that "is pushed down" has never been a
conceptually accurate way of thinking about what
remove_rel_from_query needs to do here.  Using RINFO_IS_PUSHED_DOWN
happened to work up to now, but it's an abuse of that macro, and
changing the macro's behavior isn't the right way to fix it.

Having said that, I'm not sure what is a better way to think about it.
It seems like our data structure doesn't have a clear tie between
restrictinfos and their original join level, or at least, to the extent
that it did the "clone clause" mechanism has broken it.

I wonder if we could do something involving recording the
rinfo_serial numbers of all the clauses extracted from a particular
syntactic join level (we could keep this in a bitmapset attached
to each SpecialJoinInfo, perhaps) and then filtering the joinclauses
on the basis of serial numbers instead of required_relids.

I think this is a better way to fix the issue.  I went ahead and drafted
a patch as attached.  But I doubt that the collecting of rinfo_serial
numbers is thorough in the patch.  Should we also collect the
rinfo_serial of quals generated in reconsider_outer_join_clauses()?  I
believe that we do not need to consider the quals from
generate_base_implied_equalities(), since they are all supposed to be
restriction clauses not join clauses.

Thanks
Richard
Вложения

Re: ERROR: no relation entry for relid 6

От
Tom Lane
Дата:
Richard Guo <guofenglinux@gmail.com> writes:
> On Wed, May 24, 2023 at 2:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wonder if we could do something involving recording the
>> rinfo_serial numbers of all the clauses extracted from a particular
>> syntactic join level (we could keep this in a bitmapset attached
>> to each SpecialJoinInfo, perhaps) and then filtering the joinclauses
>> on the basis of serial numbers instead of required_relids.

> I think this is a better way to fix the issue.  I went ahead and drafted
> a patch as attached.  But I doubt that the collecting of rinfo_serial
> numbers is thorough in the patch.  Should we also collect the
> rinfo_serial of quals generated in reconsider_outer_join_clauses()?

Not sure.  However, I thought of a possible fix that doesn't require
so much new mechanism: we could consider potentially-commuted outer
joins as part of the relid set that's okay to discard, as in the
attached.  This is still relying on RINFO_IS_PUSHED_DOWN, which I
continue to feel is not quite the right thing here, but on the other
hand that logic survived for years without trouble.  What broke it
was addition of outer-join relids to the mix.  I briefly considered
proposing that we simply ignore all outer-join relids in the test that
decides whether to keep or discard a joinqual, but this way seems at
least slightly more principled.

A couple of notes:

1. The test case you give upthread only needs to ignore commute_below_l,
that is it still passes without the lines

+    join_plus_commute = bms_add_members(join_plus_commute,
+                                        removable_sjinfo->commute_above_r);

Based on what deconstruct_distribute_oj_quals is doing, it seems
likely to me that there are cases that require ignoring
commute_above_r, but I've not tried to devise one.  It'd be good to
have one to include in the commit, if we can find one.

2. We could go a little further in refactoring, specifically the
computation of joinrelids could be moved into remove_rel_from_query,
since remove_useless_joins itself doesn't need it.  Not sure if
this'd be an improvement or not.  Certainly it'd be a loser if
remove_useless_joins grew a reason to need the value; but I can't
foresee a reason for that to happen --- remove_rel_from_query is
where basically all the work is going on.

3. I called the parameter removable_sjinfo because sjinfo is already
used within another loop, leading to a shadowed-parameter warning.
In a green field we'd probably have called the parameter sjinfo
and used another name for the loop's local variable, but that would
make the patch bulkier without adding anything.  Haven't decided
whether to rename before commit or leave it as-is.

            regards, tom lane

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index a61e35f92d..54ae0379b3 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -35,7 +35,8 @@

 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
-static void remove_rel_from_query(PlannerInfo *root, int relid, int ojrelid,
+static void remove_rel_from_query(PlannerInfo *root, int relid,
+                                  SpecialJoinInfo *removable_sjinfo,
                                   Relids joinrelids);
 static void remove_rel_from_restrictinfo(RestrictInfo *rinfo,
                                          int relid, int ojrelid);
@@ -93,7 +94,7 @@ restart:
         if (sjinfo->ojrelid != 0)
             joinrelids = bms_add_member(joinrelids, sjinfo->ojrelid);

-        remove_rel_from_query(root, innerrelid, sjinfo->ojrelid, joinrelids);
+        remove_rel_from_query(root, innerrelid, sjinfo, joinrelids);

         /* We verify that exactly one reference gets removed from joinlist */
         nremoved = 0;
@@ -331,10 +332,13 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
  * the planner's data structures that will actually be consulted later.
  */
 static void
-remove_rel_from_query(PlannerInfo *root, int relid, int ojrelid,
+remove_rel_from_query(PlannerInfo *root, int relid,
+                      SpecialJoinInfo *removable_sjinfo,
                       Relids joinrelids)
 {
     RelOptInfo *rel = find_base_rel(root, relid);
+    int            ojrelid = removable_sjinfo->ojrelid;
+    Relids        join_plus_commute;
     List       *joininfos;
     Index        rti;
     ListCell   *l;
@@ -454,8 +458,19 @@ remove_rel_from_query(PlannerInfo *root, int relid, int ojrelid,
      * outerjoin-delayed quals, which can get marked as requiring the rel in
      * order to force them to be evaluated at or above the join.  We can't
      * just discard them, though.  Only quals that logically belonged to the
-     * outer join being discarded should be removed from the query.
-     *
+     * outer join being discarded should be removed from the query.  However,
+     * we might encounter a qual that is a clone of a deletable qual with some
+     * outer-join relids added (see deconstruct_distribute_oj_quals).  To
+     * ensure we get rid of such clones as well, add the relids of all OJs
+     * commutable with this one to the set we test against for
+     * pushed-down-ness.
+     */
+    join_plus_commute = bms_union(joinrelids,
+                                  removable_sjinfo->commute_below_l);
+    join_plus_commute = bms_add_members(join_plus_commute,
+                                        removable_sjinfo->commute_above_r);
+
+    /*
      * We must make a copy of the rel's old joininfo list before starting the
      * loop, because otherwise remove_join_clause_from_rels would destroy the
      * list while we're scanning it.
@@ -467,7 +482,7 @@ remove_rel_from_query(PlannerInfo *root, int relid, int ojrelid,

         remove_join_clause_from_rels(root, rinfo, rinfo->required_relids);

-        if (RINFO_IS_PUSHED_DOWN(rinfo, joinrelids))
+        if (RINFO_IS_PUSHED_DOWN(rinfo, join_plus_commute))
         {
             /*
              * There might be references to relid or ojrelid in the

Re: ERROR: no relation entry for relid 6

От
Richard Guo
Дата:

On Fri, May 26, 2023 at 6:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
1. The test case you give upthread only needs to ignore commute_below_l,
that is it still passes without the lines

+    join_plus_commute = bms_add_members(join_plus_commute,
+                                        removable_sjinfo->commute_above_r);

Based on what deconstruct_distribute_oj_quals is doing, it seems
likely to me that there are cases that require ignoring
commute_above_r, but I've not tried to devise one.  It'd be good to
have one to include in the commit, if we can find one.

It seems that queries of the second form of identity 3 require ignoring
commute_above_r.

select 1 from t t1 left join (t t2 left join t t3 on t2.a = t3.a) on
t1.a = t2.a;

When removing t2/t3 join, the clone of 't2.a = t3.a' with t1 join in the
nulling bitmap would be put back if we do not ignore commute_above_r.
There is no observable problem though because it would be rejected later
in subbuild_joinrel_restrictlist, but still I think it should not be put
back in the first place.
 
2. We could go a little further in refactoring, specifically the
computation of joinrelids could be moved into remove_rel_from_query,
since remove_useless_joins itself doesn't need it.  Not sure if
this'd be an improvement or not.  Certainly it'd be a loser if
remove_useless_joins grew a reason to need the value; but I can't
foresee a reason for that to happen --- remove_rel_from_query is
where basically all the work is going on.

+1 to move the computation of joinrelids into remove_rel_from_query().
We also do that in join_is_removable().

BTW, I doubt that the check of 'sjinfo->ojrelid != 0' in
remove_useless_joins() is needed.  Since we've known that the join is
removable, I think we can just Assert that sjinfo->ojrelid cannot be 0.

--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -91,8 +91,8 @@ restart:

        /* Compute the relid set for the join we are considering */
        joinrelids = bms_union(sjinfo->min_lefthand, sjinfo->min_righthand);
-       if (sjinfo->ojrelid != 0)
-           joinrelids = bms_add_member(joinrelids, sjinfo->ojrelid);
+       Assert(sjinfo->ojrelid != 0);
+       joinrelids = bms_add_member(joinrelids, sjinfo->ojrelid);

        remove_rel_from_query(root, innerrelid, sjinfo, joinrelids);
 
3. I called the parameter removable_sjinfo because sjinfo is already
used within another loop, leading to a shadowed-parameter warning.
In a green field we'd probably have called the parameter sjinfo
and used another name for the loop's local variable, but that would
make the patch bulkier without adding anything.  Haven't decided
whether to rename before commit or leave it as-is.

Personally I prefer to rename before commit but I'm OK with both ways.

Thanks
Richard

Re: ERROR: no relation entry for relid 6

От
Tom Lane
Дата:
Richard Guo <guofenglinux@gmail.com> writes:
> On Fri, May 26, 2023 at 6:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Based on what deconstruct_distribute_oj_quals is doing, it seems
>> likely to me that there are cases that require ignoring
>> commute_above_r, but I've not tried to devise one.  It'd be good to
>> have one to include in the commit, if we can find one.

> It seems that queries of the second form of identity 3 require ignoring
> commute_above_r.
> select 1 from t t1 left join (t t2 left join t t3 on t2.a = t3.a) on
> t1.a = t2.a;
> When removing t2/t3 join, the clone of 't2.a = t3.a' with t1 join in the
> nulling bitmap would be put back if we do not ignore commute_above_r.
> There is no observable problem though because it would be rejected later
> in subbuild_joinrel_restrictlist, but still I think it should not be put
> back in the first place.

Ah.  I realized that we could make the problem testable by adding
assertions that a joinclause we're not removing doesn't contain
any surviving references to the target rel or join.  That turns
out to fire (without the bug fix) in half a dozen existing test
cases, so I felt that we didn't need to add another one.

I did the other refactoring we discussed and pushed it.
Thanks for the report and review!

            regards, tom lane



Re: ERROR: no relation entry for relid 6

От
Richard Guo
Дата:

On Sat, May 27, 2023 at 12:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Ah.  I realized that we could make the problem testable by adding
assertions that a joinclause we're not removing doesn't contain
any surviving references to the target rel or join.  That turns
out to fire (without the bug fix) in half a dozen existing test
cases, so I felt that we didn't need to add another one.

I did the other refactoring we discussed and pushed it.
Thanks for the report and review!

Thanks for pushing it!

I've managed to find some problems on master with the help of the new
assertions.  First the query below would trigger the new assertions.

create table t (a int unique, b int);

explain (costs off)
select 1 from t t1 left join
    (select t2.a, 1 as c from
     t t2 left join t t3 on t2.a = t3.a) s
on true left join t t4 on true where s.a < s.c;
server closed the connection unexpectedly

Note that 's.c' would be wrapped in PHV so the qual 's.a < s.c' is
actually 't2.a < PHV(1)', and it is one of t3's joinquals.  When we're
removing the t2/t3 join, we decide that this PHV is no longer needed so
we remove it entirely rather than just remove references from it.  But
actually its phrels still have references to t3 and t2/t3 join.  So when
we put back the qual 's.a < s.c', we will trigger the new assertions.

At first glance I thought we can just remove the new assertions.  But
then I figured out that the problem is more complex than that.  If the
PHV contains lateral references, removing the PHV entirely would cause
us to lose the information about the lateral references, and that may
cause wrong query results in some cases.  Consider the query below
(after removing the two new assertions, or in a non-assert build).

explain (costs off)
select 1 from t t1 left join
    lateral (select t2.a, coalesce(t1.a, 1) as c from
             t t2 left join t t3 on t2.a = t3.a) s
on true left join t t4 on true where s.a < s.c;
                    QUERY PLAN
--------------------------------------------------
 Nested Loop Left Join
   ->  Nested Loop
         ->  Seq Scan on t t1
         ->  Materialize
               ->  Seq Scan on t t2
                     Filter: (a < COALESCE(a, 1))
   ->  Materialize
         ->  Seq Scan on t t4
(8 rows)

The PHV of 'coalesce(t1.a, 1)' has lateral reference to t1 but we'd lose
this information because we've removed this PHV entirely in
remove_rel_from_query.  As a consequence, we'd fail to extract the
lateral dependency for t2 and fail to build the nestloop parameters for
the t1/t2 join.  And that causes wrong query results.  We can see that
if we insert some data into the table.

insert into t select 1,1;
insert into t select 2,2;

On v15 the query above gives

 ?column?
----------
        1
        1
(2 rows)

but on master it gives

 ?column?
----------
(0 rows)

I haven't thought through how to fix it, but I suspect that we may need
to do more checking before we decide to remove PHVs in
remove_rel_from_query.

Thanks
Richard

Re: ERROR: no relation entry for relid 6

От
Richard Guo
Дата:

On Tue, May 30, 2023 at 10:28 AM Richard Guo <guofenglinux@gmail.com> wrote:
I haven't thought through how to fix it, but I suspect that we may need
to do more checking before we decide to remove PHVs in
remove_rel_from_query.

Hmm, maybe we can additionally check if the PHV needs to be evaluated
above the join.  If so it cannot be removed.

--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -425,7 +425,8 @@ remove_rel_from_query(PlannerInfo *root, int relid, SpecialJoinInfo *sjinfo)

        Assert(!bms_is_member(relid, phinfo->ph_lateral));
        if (bms_is_subset(phinfo->ph_needed, joinrelids) &&
-           bms_is_member(relid, phinfo->ph_eval_at))
+           bms_is_member(relid, phinfo->ph_eval_at) &&
+           !bms_is_member(ojrelid, phinfo->ph_eval_at))
        {
            root->placeholder_list = foreach_delete_current(root->placeholder_list,
                                                            l);

Does this make sense?

Thanks
Richard

Re: ERROR: no relation entry for relid 6

От
Tom Lane
Дата:
Richard Guo <guofenglinux@gmail.com> writes:
> On Tue, May 30, 2023 at 10:28 AM Richard Guo <guofenglinux@gmail.com> wrote:
>> I haven't thought through how to fix it, but I suspect that we may need
>> to do more checking before we decide to remove PHVs in
>> remove_rel_from_query.

Oh, I like this example!  It shows a place where we are now smarter
than we used to be, because v15 and earlier fail to recognize that
the join could be removed.  But we do have to clean up the query
properly afterwards.

> Hmm, maybe we can additionally check if the PHV needs to be evaluated
> above the join.  If so it cannot be removed.

Yeah, that seems to make sense, and it squares with the existing
comment saying that PHVs used above the join can't be removed.
Pushed that way.

            regards, tom lane



Re: ERROR: no relation entry for relid 6

От
Richard Guo
Дата:

On Fri, Jun 9, 2023 at 5:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <guofenglinux@gmail.com> writes:
> Hmm, maybe we can additionally check if the PHV needs to be evaluated
> above the join.  If so it cannot be removed.
Yeah, that seems to make sense, and it squares with the existing
comment saying that PHVs used above the join can't be removed.
Pushed that way.

Thanks for pushing it!  I've closed the open item for it.

Thanks
Richard