Обсуждение: BUG #15395: Assert failure when using CURRENT OF with inheritance

Поиск
Список
Период
Сортировка

BUG #15395: Assert failure when using CURRENT OF with inheritance

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      15395
Logged by:          Matvey Arye
Email address:      mat@timescale.com
PostgreSQL version: 10.4
Operating system:   OSX
Description:

The following SQL script causes an Assert failure if PostgreSQL is compiled
with asserts. When run on a PostgreSQL instance with Asserts turned off,
this script seems to work as expected. Thus, I suspect that it may be safe
to just remove the Assert at execCurrent.c:239 but am not familiar enough
with this part of the code to be sure.

The SQL script is:
```
CREATE TABLE current_check (currentid int, payload text);

CREATE TABLE current_check_1 () INHERITS (current_check);
CREATE TABLE current_check_2 () INHERITS (current_check);

INSERT INTO current_check_1 VALUES
    (1, 'abc');
INSERT INTO current_check_2 VALUES
    (2, 'bcd'),
    (3, 'cde'),
    (4, 'def');

BEGIN;

DECLARE current_check_cursor SCROLL CURSOR FOR SELECT * FROM
current_check;

FETCH ABSOLUTE 2 FROM current_check_cursor;
FETCH ABSOLUTE 1 FROM current_check_cursor;

DELETE FROM current_check WHERE CURRENT OF current_check_cursor RETURNING
*;

COMMIT;
```

The backtrace for the Assert is:
```
  * frame #0: 0x00007fff54a19b66 libsystem_kernel.dylib`__pthread_kill +
10
    frame #1: 0x00007fff54be4080 libsystem_pthread.dylib`pthread_kill +
333
    frame #2: 0x00007fff549751ae libsystem_c.dylib`abort + 127
    frame #3: 0x0000000102581630
postgres`ExceptionalCondition(conditionName="!(((bool) (((const
void*)(current_tid) != ((void*)0)) && ((current_tid)->ip_posid !=0))))",
errorType="FailedAssertion", fileName="execCurrent.c", lineNumber=239) at
assert.c:54
    frame #4: 0x00000001021b025f
postgres`execCurrentOf(cexpr=0x00007f890604bf00,
econtext=0x00007f890609c640, table_oid=16402,
current_tid=0x00007ffeedcf4588) atexecCurrent.c:239
    frame #5: 0x00000001021ffbd5
postgres`TidListEval(tidstate=0x00007f890609a6b8) at nodeTidscan.c:224
    frame #6: 0x00000001021ff4de postgres`TidNext(node=0x00007f890609a6b8)
at nodeTidscan.c:329
    frame #7: 0x00000001021cabab
postgres`ExecScanFetch(node=0x00007f890609a6b8, accessMtd=(postgres`TidNext
at nodeTidscan.c:304), recheckMtd=(postgres`TidRecheck at
nodeTidscan.c:420)) at execScan.c:97
    frame #8: 0x00000001021ca8d3 postgres`ExecScan(node=0x00007f890609a6b8,
accessMtd=(postgres`TidNext at nodeTidscan.c:304),
recheckMtd=(postgres`TidRecheck at nodeTidscan.c:420)) at execScan.c:164
    frame #9: 0x00000001021ff082
postgres`ExecTidScan(pstate=0x00007f890609a6b8) at nodeTidscan.c:453
    frame #10: 0x00000001021c8d72
postgres`ExecProcNodeFirst(node=0x00007f890609a6b8) at execProcnode.c:430
    frame #11: 0x00000001021f27f2
postgres`ExecProcNode(node=0x00007f890609a6b8) at executor.h:250
    frame #12: 0x00000001021f182f
postgres`ExecModifyTable(pstate=0x00007f8906098f00) at
nodeModifyTable.c:1645
    frame #13: 0x00000001021c5a82
postgres`ExecProcNode(node=0x00007f8906098f00) at executor.h:250
    frame #14: 0x00000001021c1813
postgres`ExecutePlan(estate=0x00007f8906098840,
planstate=0x00007f8906098f00, use_parallel_mode='\0', operation=CMD_DELETE,
sendTuples='\x01', numberTuples=0, direction=ForwardScanDirection,
dest=0x00007f8906000278, execute_once='\x01') at execMain.c:1722
    frame #15: 0x00000001021c16db
postgres`standard_ExecutorRun(queryDesc=0x00007f8906000310,
direction=ForwardScanDirection, count=0, execute_once='\x01') at
execMain.c:363
    frame #16: 0x00000001021c1497
postgres`ExecutorRun(queryDesc=0x00007f8906000310,
direction=ForwardScanDirection, count=0, execute_once='\x01') at
execMain.c:306
    frame #17: 0x00000001023cfc4e
postgres`ProcessQuery(plan=0x00007f890608eea0, sourceText="DELETE FROM
current_check WHERE CURRENT OF current_check_cursor RETURNING *;",
params=0x0000000000000000, queryEnv=0x0000000000000000,
dest=0x00007f8906000278, completionTag="") at pquery.c:161
    frame #18: 0x00000001023cecb8
postgres`PortalRunMulti(portal=0x00007f8903806040, isTopLevel='\x01',
setHoldSnapshot='\x01', dest=0x00007f8906000278, altdest=0x00000001027d3348,
completionTag="") at pquery.c:1286
    frame #19: 0x00000001023ce6b4
postgres`FillPortalStore(portal=0x00007f8903806040, isTopLevel='\x01') at
pquery.c:1033
    frame #20: 0x00000001023ce285
postgres`PortalRun(portal=0x00007f8903806040, count=9223372036854775807,
isTopLevel='\x01', run_once='\x01', dest=0x00007f890608ef98,
altdest=0x00007f890608ef98, completionTag="") at pquery.c:768
    frame #21: 0x00000001023c9c2c
postgres`exec_simple_query(query_string="DELETE FROM current_check WHERE
CURRENT OF current_check_cursor RETURNING *;") at postgres.c:1099
    frame #22: 0x00000001023c8e37 postgres`PostgresMain(argc=1,
argv=0x00007f8905006d10, dbname="single", username="default_perm_user") at
postgres.c:4088
    frame #23: 0x000000010230f08e
postgres`BackendRun(port=0x00007f89049006c0) at postmaster.c:4405
    frame #24: 0x000000010230e403
postgres`BackendStartup(port=0x00007f89049006c0) at postmaster.c:4077
    frame #25: 0x000000010230d3e5 postgres`ServerLoop at postmaster.c:1755
    frame #26: 0x000000010230ab25 postgres`PostmasterMain(argc=8,
argv=0x00007f8902c03160) at postmaster.c:1363
    frame #27: 0x000000010222e529 postgres`main(argc=8,
argv=0x00007f8902c03160) at main.c:228
```


Re: BUG #15395: Assert failure when using CURRENT OF with inheritance

От
Tom Lane
Дата:
=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:
> The following SQL script causes an Assert failure if PostgreSQL is compiled
> with asserts.

Nice catch!  This has evidently been broken for a long time.

> When run on a PostgreSQL instance with Asserts turned off,
> this script seems to work as expected. Thus, I suspect that it may be safe
> to just remove the Assert at execCurrent.c:239 but am not familiar enough
> with this part of the code to be sure.

No, I don't think that's a great fix.  After tracing through it, I can
see that what's going wrong is:

1. The plan tree for the cursor looks like

EXPLAIN DECLARE current_check_cursor SCROLL CURSOR FOR SELECT * FROM current_check;
                                QUERY PLAN
--------------------------------------------------------------------------
 Append  (cost=0.00..58.11 rows=2541 width=36)
   ->  Seq Scan on current_check  (cost=0.00..0.00 rows=1 width=36)
   ->  Seq Scan on current_check_1  (cost=0.00..22.70 rows=1270 width=36)
   ->  Seq Scan on current_check_2  (cost=0.00..22.70 rows=1270 width=36)
(4 rows)

2. The "FETCH ABSOLUTE 2" command runs through the single tuple available
from the seqscans on current_check and current_check_1, and then stops
on the first tuple available from current_check_2.

3. The "FETCH ABSOLUTE 1" command rewinds the executor and fetches the
first available tuple, which is in current_check_1.

4. At this point, the SeqScan node for current_check has been run to
completion, so its ScanTupleSlot is empty.  The node for current_check_1
is correctly positioned on a tuple.  The node for current_check_2 has
not been visited yet, so its state is as ExecReScan left it.  The trouble
is that the only thing that's happened to its ScanTupleSlot is where
heapam.c's initscan() does

    ItemPointerSetInvalid(&scan->rs_ctup.t_self);

It's only rather accidental that the plan node's ScanTupleSlot is
pointing at that at all, but it is.  So execCurrent.c sees an apparently
valid scan tuple containing an invalid t_self, and it spits up, which
I think is entirely a good thing.


I think that the correct fix is to ensure that the ScanTupleSlot gets
cleared during ExecReScan, which fortunately is pretty easy to mechanize
because all scan node types call ExecScanReScan.  So the execScan.c
part of the patch below seems to be enough to resolve the reported case.

However, we're not really out of the woods yet, because I noticed that
places like ExecReScanAppend will postpone calling ExecReScan if there's
a chgParam flag set on their sub-nodes.  This means that we can't
necessarily rely on ExecutorRewind performing an immediate call of
ExecReScan for every scan node in the plan.  It seems to me, therefore,
that we need to teach execCurrent.c to check for pending-rescan flags
and not believe that a node is really on a tuple if it finds such a
flag at or above the node.  The execCurrent.c part of the attached patch
does that.

I've not tried yet to create an actual test case for the chgParam-based
failure.  It's definitely possible that that problem is only hypothetical
at the moment because cursor plans that would satisfy search_plan_tree
would be too simple to contain any such flags.  But I think we'd better
add that logic anyway.

            regards, tom lane

diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index f70e54f..01139e4 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -23,7 +23,8 @@


 static char *fetch_cursor_param_value(ExprContext *econtext, int paramId);
-static ScanState *search_plan_tree(PlanState *node, Oid table_oid);
+static ScanState *search_plan_tree(PlanState *node, Oid table_oid,
+                 bool *pending_rescan);


 /*
@@ -156,8 +157,10 @@ execCurrentOf(CurrentOfExpr *cexpr,
          * aggregation.
          */
         ScanState  *scanstate;
+        bool        pending_rescan = false;

-        scanstate = search_plan_tree(queryDesc->planstate, table_oid);
+        scanstate = search_plan_tree(queryDesc->planstate, table_oid,
+                                     &pending_rescan);
         if (!scanstate)
             ereport(ERROR,
                     (errcode(ERRCODE_INVALID_CURSOR_STATE),
@@ -177,8 +180,12 @@ execCurrentOf(CurrentOfExpr *cexpr,
                      errmsg("cursor \"%s\" is not positioned on a row",
                             cursor_name)));

-        /* Now OK to return false if we found an inactive scan */
-        if (TupIsNull(scanstate->ss_ScanTupleSlot))
+        /*
+         * Now OK to return false if we found an inactive scan.  It is
+         * inactive either if it's not positioned on a row, or there's a
+         * rescan pending for it.
+         */
+        if (TupIsNull(scanstate->ss_ScanTupleSlot) || pending_rescan)
             return false;

         /*
@@ -291,10 +298,20 @@ fetch_cursor_param_value(ExprContext *econtext, int paramId)
  *
  * Search through a PlanState tree for a scan node on the specified table.
  * Return NULL if not found or multiple candidates.
+ *
+ * If a candidate is found, set *pending_rescan to true if that candidate
+ * or any node above it has a pending rescan action, i.e. chgParam != NULL.
+ * That indicates that we shouldn't consider the node to be positioned on a
+ * valid tuple, even if its own state would indicate that it is.  (Caller
+ * must initialize *pending_rescan to false, and should not trust its state
+ * if multiple candidates are found.)
  */
 static ScanState *
-search_plan_tree(PlanState *node, Oid table_oid)
+search_plan_tree(PlanState *node, Oid table_oid,
+                 bool *pending_rescan)
 {
+    ScanState  *result = NULL;
+
     if (node == NULL)
         return NULL;
     switch (nodeTag(node))
@@ -314,7 +331,7 @@ search_plan_tree(PlanState *node, Oid table_oid)
                 ScanState  *sstate = (ScanState *) node;

                 if (RelationGetRelid(sstate->ss_currentRelation) == table_oid)
-                    return sstate;
+                    result = sstate;
                 break;
             }

@@ -325,13 +342,13 @@ search_plan_tree(PlanState *node, Oid table_oid)
         case T_AppendState:
             {
                 AppendState *astate = (AppendState *) node;
-                ScanState  *result = NULL;
                 int            i;

                 for (i = 0; i < astate->as_nplans; i++)
                 {
                     ScanState  *elem = search_plan_tree(astate->appendplans[i],
-                                                        table_oid);
+                                                        table_oid,
+                                                        pending_rescan);

                     if (!elem)
                         continue;
@@ -339,7 +356,7 @@ search_plan_tree(PlanState *node, Oid table_oid)
                         return NULL;    /* multiple matches */
                     result = elem;
                 }
-                return result;
+                break;
             }

             /*
@@ -348,13 +365,13 @@ search_plan_tree(PlanState *node, Oid table_oid)
         case T_MergeAppendState:
             {
                 MergeAppendState *mstate = (MergeAppendState *) node;
-                ScanState  *result = NULL;
                 int            i;

                 for (i = 0; i < mstate->ms_nplans; i++)
                 {
                     ScanState  *elem = search_plan_tree(mstate->mergeplans[i],
-                                                        table_oid);
+                                                        table_oid,
+                                                        pending_rescan);

                     if (!elem)
                         continue;
@@ -362,7 +379,7 @@ search_plan_tree(PlanState *node, Oid table_oid)
                         return NULL;    /* multiple matches */
                     result = elem;
                 }
-                return result;
+                break;
             }

             /*
@@ -371,18 +388,31 @@ search_plan_tree(PlanState *node, Oid table_oid)
              */
         case T_ResultState:
         case T_LimitState:
-            return search_plan_tree(node->lefttree, table_oid);
+            result = search_plan_tree(node->lefttree,
+                                      table_oid,
+                                      pending_rescan);
+            break;

             /*
              * SubqueryScan too, but it keeps the child in a different place
              */
         case T_SubqueryScanState:
-            return search_plan_tree(((SubqueryScanState *) node)->subplan,
-                                    table_oid);
+            result = search_plan_tree(((SubqueryScanState *) node)->subplan,
+                                      table_oid,
+                                      pending_rescan);
+            break;

         default:
             /* Otherwise, assume we can't descend through it */
             break;
     }
-    return NULL;
+
+    /*
+     * If we found a candidate at or below this node, then this node's
+     * chgParam indicates a pending rescan that will affect this node.
+     */
+    if (result && node->chgParam != NULL)
+        *pending_rescan = true;
+
+    return result;
 }
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index caf9173..f84a3fb 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -263,6 +263,12 @@ ExecScanReScan(ScanState *node)
 {
     EState       *estate = node->ps.state;

+    /*
+     * We must clear the scan tuple so that observers (e.g., execCurrent.c)
+     * can tell that this plan node is not positioned on a tuple.
+     */
+    ExecClearTuple(node->ss_ScanTupleSlot);
+
     /* Rescan EvalPlanQual tuple if we're inside an EvalPlanQual recheck */
     if (estate->es_epqScanDone != NULL)
     {

Re: BUG #15395: Assert failure when using CURRENT OF with inheritance

От
Tom Lane
Дата:
I wrote:
> I've not tried yet to create an actual test case for the chgParam-based
> failure.  It's definitely possible that that problem is only hypothetical
> at the moment because cursor plans that would satisfy search_plan_tree
> would be too simple to contain any such flags.  But I think we'd better
> add that logic anyway.

After fooling with that for awhile longer, indeed there doesn't seem to
be a way to cause such a failure today.  The closest thing I was able
to find was that an InitPlan's output params could get marked for
recalculation, resulting in chgParam getting set on nodes that reference
the InitPlan.  A cursor query could certainly contain an InitPlan ---
but there doesn't seem to be any code path that would result in forcing
recalculation of one, except with things that couldn't be in a simple
cursor plan, such as NestLoop joins.  Still, it's not real hard to
conceive of future changes that would render this a live issue, so
I still feel that the search_plan_tree change is a good idea.

So, I've added a test case based on your example and pushed the patch.

BTW, I don't understand your report that just removing the Assert made
it work for you.  That would've resulted in returning an invalid TID
to nodeTidscan.c, which AFAICS should've ended up with an ereport about
a bad block number in heap_fetch.

            regards, tom lane


Re: BUG #15395: Assert failure when using CURRENT OF with inheritance

От
Mat Arye
Дата:
Sorry for the delayed response. The patch looks good to me and your explanation makes sense.

I just re-ran my original test sql script on an official docker image of postgres (which obviously has asserts off) and didn't get any errors + the correct row is deleted.
I can investigate further if you think it's worthwhile.

Thanks for all your help.

-Mat

On Sun, Sep 23, 2018 at 4:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
> I've not tried yet to create an actual test case for the chgParam-based
> failure.  It's definitely possible that that problem is only hypothetical
> at the moment because cursor plans that would satisfy search_plan_tree
> would be too simple to contain any such flags.  But I think we'd better
> add that logic anyway.

After fooling with that for awhile longer, indeed there doesn't seem to
be a way to cause such a failure today.  The closest thing I was able
to find was that an InitPlan's output params could get marked for
recalculation, resulting in chgParam getting set on nodes that reference
the InitPlan.  A cursor query could certainly contain an InitPlan ---
but there doesn't seem to be any code path that would result in forcing
recalculation of one, except with things that couldn't be in a simple
cursor plan, such as NestLoop joins.  Still, it's not real hard to
conceive of future changes that would render this a live issue, so
I still feel that the search_plan_tree change is a good idea.

So, I've added a test case based on your example and pushed the patch.

BTW, I don't understand your report that just removing the Assert made
it work for you.  That would've resulted in returning an invalid TID
to nodeTidscan.c, which AFAICS should've ended up with an ereport about
a bad block number in heap_fetch.

                        regards, tom lane