Обсуждение: BUG #17355: Server crashes on ExecReScanForeignScan in postgres_fdw when accessing foreign partition

Поиск
Список
Период
Сортировка

BUG #17355: Server crashes on ExecReScanForeignScan in postgres_fdw when accessing foreign partition

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17355
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 14.1
Operating system:   Ubuntu 20.04
Description:

When running the following script (with 20 concurrent insert/update/delete
queries):
cat << 'EOF' | psql postgres
CREATE EXTENSION postgres_fdw;

CREATE SERVER testserver1 FOREIGN DATA WRAPPER postgres_fdw;
DO $d$
    BEGIN
        EXECUTE $$CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw
            OPTIONS (dbname '$$||current_database()||$$',
                     port '$$||current_setting('port')||$$'
            )$$;
    END;
$d$;

CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
CREATE TABLE async_pt (a int, b int) PARTITION BY RANGE (a);
CREATE TABLE base_tbl2 (a int, b int);
CREATE FOREIGN TABLE async_p2 PARTITION OF async_pt FOR VALUES FROM (2000)
TO (3000)
  SERVER loopback OPTIONS (table_name 'base_tbl2');
CREATE TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO
(4000);
EOF

for i in `seq 20`; do
cat << 'EOF' | psql postgres >psql-$i.log &
INSERT INTO async_pt SELECT i FROM generate_series(3000, 3100) i;
DELETE FROM async_pt;
UPDATE async_pt SET a = a;
EOF
done
wait

I get crashes with the following stacktrace:
Core was generated by `postgres: law postgres [local] DELETE
                        '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f0f438e6c42 in postgresReScanForeignScan (node=0x5580b57f23b8) at
postgres_fdw.c:1651
1651            if (!fsstate->cursor_exists)
(gdb) bt
#0  0x00007f0f438e6c42 in postgresReScanForeignScan (node=0x5580b57f23b8) at
postgres_fdw.c:1651
#1  0x00005580b4987bf9 in ExecReScanForeignScan (node=0x5580b57f23b8) at
nodeForeignscan.c:326
#2  0x00005580b4942c0a in ExecReScan (node=0x5580b57f23b8) at
execAmi.c:235
#3  0x00005580b49815bb in ExecProcNode (node=0x5580b57f23b8) at
../../../src/include/executor/executor.h:255
#4  0x00005580b4981e3b in ExecAppend (pstate=0x5580b57f1f48) at
nodeAppend.c:360
#5  0x00005580b4967c88 in ExecProcNodeFirst (node=0x5580b57f1f48) at
execProcnode.c:463
#6  0x00005580b495b774 in ExecProcNode (node=0x5580b57f1f48) at
../../../src/include/executor/executor.h:257
#7  0x00005580b4960373 in EvalPlanQualNext (epqstate=0x5580b57cf360) at
execMain.c:2612
#8  0x00005580b495fc99 in EvalPlanQual (epqstate=0x5580b57cf360,
relation=0x7f0f37b3d598, rti=3, 
    inputslot=0x5580b57d0540) at execMain.c:2383
#9  0x00005580b49a70ed in ExecDelete (mtstate=0x5580b57cf278,
resultRelInfo=0x5580b57cf5c8, tupleid=0x7ffdf2fa6002, 
    oldtuple=0x0, planSlot=0x5580b57dc9a8, epqstate=0x5580b57cf360,
estate=0x5580b57ceff0, processReturning=true, 
    canSetTag=true, changingPart=false, tupleDeleted=0x0, epqreturnslot=0x0)
at nodeModifyTable.c:1243
#10 0x00005580b49a95b3 in ExecModifyTable (pstate=0x5580b57cf278) at
nodeModifyTable.c:2599
#11 0x00005580b4967c88 in ExecProcNodeFirst (node=0x5580b57cf278) at
execProcnode.c:463
#12 0x00005580b495b774 in ExecProcNode (node=0x5580b57cf278) at
../../../src/include/executor/executor.h:257
#13 0x00005580b495e392 in ExecutePlan (estate=0x5580b57ceff0,
planstate=0x5580b57cf278, use_parallel_mode=false, 
    operation=CMD_DELETE, sendTuples=false, numberTuples=0,
direction=ForwardScanDirection, dest=0x5580b57dad98, 
    execute_once=true) at execMain.c:1551
#14 0x00005580b495beab in standard_ExecutorRun (queryDesc=0x5580b5711f80,
direction=ForwardScanDirection, count=0, 
    execute_once=true) at execMain.c:361
#15 0x00005580b495bc96 in ExecutorRun (queryDesc=0x5580b5711f80,
direction=ForwardScanDirection, count=0, 
    execute_once=true) at execMain.c:305
#16 0x00005580b4bd7cf8 in ProcessQuery (plan=0x5580b57d9ac0,
sourceText=0x5580b56e9440 "DELETE FROM async_pt;", 
    params=0x0, queryEnv=0x0, dest=0x5580b57dad98, qc=0x7ffdf2fa64b0) at
pquery.c:160
#17 0x00005580b4bd98b7 in PortalRunMulti (portal=0x5580b574d090,
isTopLevel=true, setHoldSnapshot=false, 
    dest=0x5580b57dad98, altdest=0x5580b57dad98, qc=0x7ffdf2fa64b0) at
pquery.c:1274
#18 0x00005580b4bd8d9d in PortalRun (portal=0x5580b574d090,
count=9223372036854775807, isTopLevel=true, run_once=true, 
    dest=0x5580b57dad98, altdest=0x5580b57dad98, qc=0x7ffdf2fa64b0) at
pquery.c:788
#19 0x00005580b4bd1bd9 in exec_simple_query (query_string=0x5580b56e9440
"DELETE FROM async_pt;") at postgres.c:1214
#20 0x00005580b4bd6ac8 in PostgresMain (argc=1, argv=0x7ffdf2fa66d0,
dbname=0x5580b5714f78 "postgres", 
    username=0x5580b5714f58 "law") at postgres.c:4486
#21 0x00005580b4afb357 in BackendRun (port=0x5580b570dab0) at
postmaster.c:4530
#22 0x00005580b4afabb2 in BackendStartup (port=0x5580b570dab0) at
postmaster.c:4252
#23 0x00005580b4af69a7 in ServerLoop () at postmaster.c:1745
#24 0x00005580b4af6104 in PostmasterMain (argc=3, argv=0x5580b56e34b0) at
postmaster.c:1417
#25 0x00005580b49e566a in main (argc=3, argv=0x5580b56e34b0) at main.c:209

postmaster.log contains:
2022-01-06 10:46:20.145 MSK [403022] ERROR:  cannot re-evaluate a Foreign
Update or Delete during EvalPlanQual
2022-01-06 10:46:20.145 MSK [403022] STATEMENT:  UPDATE async_pt SET a =
a;
2022-01-06 10:46:20.320 MSK [403059] LOG:  unexpected EOF on client
connection with an open transaction
2022-01-06 10:46:20.320 MSK [402977] LOG:  server process (PID 403043) was
terminated by signal 11: Segmentation fault
2022-01-06 10:46:20.320 MSK [402977] DETAIL:  Failed process was running:
DELETE FROM async_pt;


On Thu, Jan 6, 2022 at 5:25 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
> When running the following script (with 20 concurrent insert/update/delete
> queries):
> cat << 'EOF' | psql postgres
> CREATE EXTENSION postgres_fdw;
>
> CREATE SERVER testserver1 FOREIGN DATA WRAPPER postgres_fdw;
> DO $d$
>     BEGIN
>         EXECUTE $$CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw
>             OPTIONS (dbname '$$||current_database()||$$',
>                      port '$$||current_setting('port')||$$'
>             )$$;
>     END;
> $d$;
>
> CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
> CREATE TABLE async_pt (a int, b int) PARTITION BY RANGE (a);
> CREATE TABLE base_tbl2 (a int, b int);
> CREATE FOREIGN TABLE async_p2 PARTITION OF async_pt FOR VALUES FROM (2000)
> TO (3000)
>   SERVER loopback OPTIONS (table_name 'base_tbl2');
> CREATE TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO
> (4000);
> EOF
>
> for i in `seq 20`; do
> cat << 'EOF' | psql postgres >psql-$i.log &
> INSERT INTO async_pt SELECT i FROM generate_series(3000, 3100) i;
> DELETE FROM async_pt;
> UPDATE async_pt SET a = a;
> EOF
> done
> wait
>
> I get crashes with the following stacktrace:

I haven't tried to reproduce this yet, but I have one question: did
you set the async_capable option to true?

Thanks for the report!

Best regards,
Etsuro Fujita



Re: BUG #17355: Server crashes on ExecReScanForeignScan in postgres_fdw when accessing foreign partition

От
Alexander Lakhin
Дата:
06.01.2022 12:56, Etsuro Fujita wrote:
> I haven't tried to reproduce this yet, but I have one question: did
> you set the async_capable option to true?
The async_capable setting doesn't affect this, so I've filed it as a
separate bug report.
All the other parameters have default values in my setup.

Best regards,
Alexander



On Thu, Jan 6, 2022 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> 06.01.2022 12:56, Etsuro Fujita wrote:
> > I haven't tried to reproduce this yet, but I have one question: did
> > you set the async_capable option to true?

> The async_capable setting doesn't affect this, so I've filed it as a
> separate bug report.
> All the other parameters have default values in my setup.

Ok, thanks!

Best regards,
Etsuro Fujita



On Fri, Jan 7, 2022 at 3:19 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Thu, Jan 6, 2022 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> > 06.01.2022 12:56, Etsuro Fujita wrote:
> > > I haven't tried to reproduce this yet

I think a simple reproducer for this issue is:

1) Define the partitioned table async_pt as shown by Alexander.
2) Run concurrent transactions as follows.

Session A: insert into async_pt values (3000);
Session A: begin;
Session A: update async_pt set a = a;
Session B: delete from async_pt;
Session A: commit;

The commit in Session A would cause a server crash in Session B due to
the segmentation fault.

Also, I think a simple reproducer for the “cannot re-evaluate a
Foreign Update or Delete during EvalPlanQual” error is:

1) Define the partitioned table async_pt as shown by Alexander.
2) Run concurrent transactions as follows.

Session A: insert into async_pt values (3000);
Session A: begin;
Session A: update async_pt set a = a;
Session B: update async_pt set a = a;
Session A: commit;

The commit in Session A would cause the transaction in Session B to
abort with that error.

I think the root cause of these issues is that because of the rework
for inherited UPDATE/DELETE in commit 86dc90056, ForeignScan nodes
doing direct modifications are re-evaluated as part of the
EvalPlanQual subtree when doing an EvalPlanQual check, which breaks
the assumption that those ForeignScan nodes should never be
re-evaluated by EvalPlanQual, leading to these issues.  To fix, I’d
like to propose to ignore those ForeignScan nodes in
ExecForeignScan/ExecReScanForeignScan when doing that recheck, like
the attached.

Best regards,
Etsuro Fujita

Вложения

Re: BUG #17355: Server crashes on ExecReScanForeignScan in postgres_fdw when accessing foreign partition

От
Alexander Lakhin
Дата:
Hello Etsura-san,
18.01.2022 11:01, Etsuro Fujita wrote:
> On Fri, Jan 7, 2022 at 3:19 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>> On Thu, Jan 6, 2022 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>>> 06.01.2022 12:56, Etsuro Fujita wrote:
>>>> I haven't tried to reproduce this yet
> I think a simple reproducer for this issue is:
>
> 1) Define the partitioned table async_pt as shown by Alexander.
> 2) Run concurrent transactions as follows.
>
> Session A: insert into async_pt values (3000);
> Session A: begin;
> Session A: update async_pt set a = a;
> Session B: delete from async_pt;
> Session A: commit;
>
> The commit in Session A would cause a server crash in Session B due to
> the segmentation fault.
>
> Also, I think a simple reproducer for the “cannot re-evaluate a
> Foreign Update or Delete during EvalPlanQual” error is:
>
> 1) Define the partitioned table async_pt as shown by Alexander.
> 2) Run concurrent transactions as follows.
>
> Session A: insert into async_pt values (3000);
> Session A: begin;
> Session A: update async_pt set a = a;
> Session B: update async_pt set a = a;
> Session A: commit;
>
> The commit in Session A would cause the transaction in Session B to
> abort with that error.
Thanks for the explanation and the fix!
I've made a simple isolation test (see attachment) to confirm this. And
it crashes without the fix as you said.
With your fix it works as expected (I was wondering how "plan->operation
!= CMD_SELECT" with RETURNING will work and I haven't seen any issues yet.)

(Besides that I've observed an infinite waiting for ShareLock with
step "s1i" { INSERT INTO pt VALUES (2000); }
This doesn't happen with a regular (not foreign) table.)

Best regards,
Alexander

Вложения
Hi Alexander,

On Tue, Jan 25, 2022 at 3:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> I've made a simple isolation test (see attachment) to confirm this. And
> it crashes without the fix as you said.
> With your fix it works as expected

Thanks for testing!

> (I was wondering how "plan->operation
> != CMD_SELECT" with RETURNING will work and I haven't seen any issues yet.)

EvalPlanQual rechecking is irrelevant for direct modifications even
when the UPDATE/DELETE query has a RETURNING clause; in which case we
compute the RETURNING expressions locally, but that is done *after*
updating/deleting foreign rows on the remote side.

> (Besides that I've observed an infinite waiting for ShareLock with
> step "s1i" { INSERT INTO pt VALUES (2000); }
> This doesn't happen with a regular (not foreign) table.)

You mean the lock wait occurs on the remote side, not on the local
side?  If so, I think that that is expected behavior because a write
conflict occurs on the remote side in that case.  Maybe I don’t fully
understand your words, so could you elaborate a bit more on your
observation?

Best regards,
Etsuro Fujita



Re: BUG #17355: Server crashes on ExecReScanForeignScan in postgres_fdw when accessing foreign partition

От
Alexander Lakhin
Дата:
Hello Etsuro-san,

30.01.2022 13:59, Etsuro Fujita wrote:
>> (Besides that I've observed an infinite waiting for ShareLock with
>> step "s1i" { INSERT INTO pt VALUES (2000); }
>> This doesn't happen with a regular (not foreign) table.)
> You mean the lock wait occurs on the remote side, not on the local
> side?  If so, I think that that is expected behavior because a write
> conflict occurs on the remote side in that case.  Maybe I don’t fully
> understand your words, so could you elaborate a bit more on your
> observation?
Yes, you are right, that was expected behavior. I didn't realize that
the isolationtester itself resolves blocking when the target table is local.
The isolationtester controls the step execution using
pg_isolation_test_session_is_blocked(), but when the target table is
foreign, it can not determine correctly whether one step blocking other
(cause it checks local, not remote session pids) and just hangs.
Sorry for the noise.

Best regards,
Alexander



Re: BUG #17355: Server crashes on ExecReScanForeignScan in postgres_fdw when accessing foreign partition

От
Etsuro Fujita
Дата:
Hi Alexander,

On Tue, Feb 1, 2022 at 1:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> 30.01.2022 13:59, Etsuro Fujita wrote:
> >> (Besides that I've observed an infinite waiting for ShareLock with
> >> step "s1i" { INSERT INTO pt VALUES (2000); }
> >> This doesn't happen with a regular (not foreign) table.)
> > You mean the lock wait occurs on the remote side, not on the local
> > side?  If so, I think that that is expected behavior because a write
> > conflict occurs on the remote side in that case.  Maybe I don’t fully
> > understand your words, so could you elaborate a bit more on your
> > observation?
> Yes, you are right, that was expected behavior. I didn't realize that
> the isolationtester itself resolves blocking when the target table is local.
> The isolationtester controls the step execution using
> pg_isolation_test_session_is_blocked(), but when the target table is
> foreign, it can not determine correctly whether one step blocking other
> (cause it checks local, not remote session pids) and just hangs.

Ok, thanks for the explanation!

Attached is an updated patch.  I tweaked a comment a little bit and
added the commit message.  I didn’t include your test because in my
understanding we don’t add such a test into the postgres_fdw
regression test.  I’ll commit the patch if there are no objections
from you (or anyone else).

Best regards,
Etsuro Fujita

Вложения

Re: BUG #17355: Server crashes on ExecReScanForeignScan in postgres_fdw when accessing foreign partition

От
Alexander Lakhin
Дата:
01.02.2022 11:59, Etsuro Fujita wrote:
> Attached is an updated patch.  I tweaked a comment a little bit and
> added the commit message.  I didn’t include your test because in my
> understanding we don’t add such a test into the postgres_fdw
> regression test.  I’ll commit the patch if there are no objections
> from you (or anyone else).
Thank you! I have no objections.
That test was not intended for committing, I just wanted to show how
I've tested the fix.

Best regards,
Alexander



On Tue, Feb 1, 2022 at 5:58 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Hi Alexander,
>
> On Tue, Feb 1, 2022 at 1:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> > 30.01.2022 13:59, Etsuro Fujita wrote:
> > >> (Besides that I've observed an infinite waiting for ShareLock with
> > >> step "s1i" { INSERT INTO pt VALUES (2000); }
> > >> This doesn't happen with a regular (not foreign) table.)
> > > You mean the lock wait occurs on the remote side, not on the local
> > > side?  If so, I think that that is expected behavior because a write
> > > conflict occurs on the remote side in that case.  Maybe I don’t fully
> > > understand your words, so could you elaborate a bit more on your
> > > observation?
> > Yes, you are right, that was expected behavior. I didn't realize that
> > the isolationtester itself resolves blocking when the target table is local.
> > The isolationtester controls the step execution using
> > pg_isolation_test_session_is_blocked(), but when the target table is
> > foreign, it can not determine correctly whether one step blocking other
> > (cause it checks local, not remote session pids) and just hangs.
>
> Ok, thanks for the explanation!
>
> Attached is an updated patch.  I tweaked a comment a little bit and
> added the commit message.  I didn’t include your test because in my
> understanding we don’t add such a test into the postgres_fdw
> regression test.  I’ll commit the patch if there are no objections
> from you (or anyone else).

Thank you for working on this and sorry about failing to reply to this sooner.

Your proposed fix makes sense.

I was trying to understand in what way commit 86dc90056's is related
to this, because you mention it here in the thread and also in the
commit message.  I see that c3928b467, also mentioned in your commit
message, fixed a related bug [1], which in turn refers to 1375422c as
the cause [2].  c3928b467 fixed things so that BeginDirectModify()
won't get called if EPQ is active, so fdw_state won't get set.  Given
that, it also fixed ForeignNext() to not enter the FDW to perform a
non-SELECT operation in the EPQ-active case, which made sure its
caller Exec[Foreign]Scan() won't enter the FDW in that case.  Though
it didn't also fix ExecReScan[ForeignScan](), so this bug report.
Your patch fixes both ExecForeignScan() and ExecReScanForeignScan() to
not enter the FDW during EvalPlanQual() in non-SELECT cases.

Before commit 86dc90056, EPQState.plan would be set to the 1st child
subplan, which if it happened to be a direct-modify ForeignScan node,
this bug would manifest.  After that commit though, a direct-modify
child subplan would *always* be present in EPQState.plan by way of its
parent Append node or some parent node thereof being set as
EPQState.plan, so the bug would now always manifest.  So my
understanding is that the commit 86dc90056 only changed *when* the bug
manifests, which is now "always" if direct-modify children are
present.  Is that understanding correct?

BTW, isn't the following code in ForeignNext() added by c3928b467
non-reachable after your patch:

        /*
         * direct modifications cannot be re-evaluated, so shouldn't get here
         * during EvalPlanQual processing
         */
        if (estate->es_epq_active != NULL)
            elog(ERROR, "cannot re-evaluate a Foreign Update or Delete
during EvalPlanQual");

Should that be converted to an Assert(estate->es_epq_active == NULL)?

--
Amit Langote
EDB: http://www.enterprisedb.com

[1] EvalPlanQual() attempting to initialize a direct-modify
ForeignScan node without the necessary ResultRelInfo having been set
up.

[2] As of 1375422c, ResultRelInfos are initialized in
ExecInitModifyTable() which won't be invoked during
EvalPlanQualStart(), so EvalPlanQual() can no longer rely on accessing
a direct-modify ForeignScan node's ResultRelInfo.



Re: BUG #17355: Server crashes on ExecReScanForeignScan in postgres_fdw when accessing foreign partition

От
Etsuro Fujita
Дата:
Hi Alexander,

On Wed, Feb 2, 2022 at 3:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> 01.02.2022 11:59, Etsuro Fujita wrote:
> > Attached is an updated patch.  I tweaked a comment a little bit and
> > added the commit message.  I didn’t include your test because in my
> > understanding we don’t add such a test into the postgres_fdw
> > regression test.  I’ll commit the patch if there are no objections
> > from you (or anyone else).
> Thank you! I have no objections.
> That test was not intended for committing, I just wanted to show how
> I've tested the fix.

OK

Thanks!

Best regards,
Etsuro Fujita



Re: BUG #17355: Server crashes on ExecReScanForeignScan in postgres_fdw when accessing foreign partition

От
Etsuro Fujita
Дата:
Hi Amit-san,

On Wed, Feb 2, 2022 at 12:59 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Your proposed fix makes sense.

Cool!

> I was trying to understand in what way commit 86dc90056's is related
> to this, because you mention it here in the thread and also in the
> commit message.  I see that c3928b467, also mentioned in your commit
> message, fixed a related bug [1], which in turn refers to 1375422c as
> the cause [2].

I think the diagnosis in commit c3928b467 isn't correct; I think the
cause is commit 86dc90056, because before that commit, direct-modify
ForeignScan nodes didn't get called during EvalPlanQual processing,
even in the case of an inherited UPDATE/DELETE.  (In direct
modification, all the work for it except the RETURNING computation is
done on the remote side, so we don't need to invoke EvalPlanQual!)
BUT as you know, commit 86dc90056 changed things so that such
ForeignScan nodes can get called during EvalPlanQual processing, as
they are contained as part of the EvalPlanQual subtree in the case of
an inherited UPDATE/DELETE, which I think led to the issues in commit
c3928b467 as well as this bug report.

> Before commit 86dc90056, EPQState.plan would be set to the 1st child
> subplan, which if it happened to be a direct-modify ForeignScan node,
> this bug would manifest.  After that commit though, a direct-modify
> child subplan would *always* be present in EPQState.plan by way of its
> parent Append node or some parent node thereof being set as
> EPQState.plan, so the bug would now always manifest.  So my
> understanding is that the commit 86dc90056 only changed *when* the bug
> manifests, which is now "always" if direct-modify children are
> present.  Is that understanding correct?

As I mentioned above, I think we didn't have this bug before commit
86dc90056.  As for when this bug happens, I think it would depend on
the partitioned-table definition; eg, for a partitioned table like the
one discussed in the thread for commit c3928b467, it wouldn't happen.

> BTW, isn't the following code in ForeignNext() added by c3928b467
> non-reachable after your patch:
>
>         /*
>          * direct modifications cannot be re-evaluated, so shouldn't get here
>          * during EvalPlanQual processing
>          */
>         if (estate->es_epq_active != NULL)
>             elog(ERROR, "cannot re-evaluate a Foreign Update or Delete
> during EvalPlanQual");
>
> Should that be converted to an Assert(estate->es_epq_active == NULL)?

+1  I updated the patch as such.  Attached is a new version.  I also
tweaked a comment a litttle bit further.

Thanks!

Best regards,
Etsuro Fujita

Вложения
Fujita-san,

On Wed, Feb 2, 2022 at 7:55 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Wed, Feb 2, 2022 at 12:59 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > I was trying to understand in what way commit 86dc90056's is related
> > to this, because you mention it here in the thread and also in the
> > commit message.  I see that c3928b467, also mentioned in your commit
> > message, fixed a related bug [1], which in turn refers to 1375422c as
> > the cause [2].
>
> I think the diagnosis in commit c3928b467 isn't correct; I think the
> cause is commit 86dc90056, because before that commit, direct-modify
> ForeignScan nodes didn't get called during EvalPlanQual processing,
> even in the case of an inherited UPDATE/DELETE.  (In direct
> modification, all the work for it except the RETURNING computation is
> done on the remote side, so we don't need to invoke EvalPlanQual!)
> BUT as you know, commit 86dc90056 changed things so that such
> ForeignScan nodes can get called during EvalPlanQual processing, as
> they are contained as part of the EvalPlanQual subtree in the case of
> an inherited UPDATE/DELETE, which I think led to the issues in commit
> c3928b467 as well as this bug report.

You're right, thanks for the explanation.

What I embarrassingly forgot is that, before 86dc90056, each child
subplan would be set to be passed to EvalPlanQ ual() when in its turn
came in the ExecModifyTable()'s main loo, using the following code
that 86dc90056 removed:

        if (TupIsNull(planSlot))
-       {
-           /* advance to next subplan if any */
-           node->mt_whichplan++;
-           if (node->mt_whichplan < node->mt_nplans)
-           {
-               resultRelInfo++;
-               subplanstate = node->mt_plans[node->mt_whichplan];
-               junkfilter = resultRelInfo->ri_junkFilter;
-               EvalPlanQualSetPlan(&node->mt_epqstate, subplanstate->plan,
-                                   node->mt_arowmarks[node->mt_whichplan]);
-               continue;
-           }
-           else
-               break;
-       }

Direct-modify subplans, even if they would get set to be passed to
EvalPlanQual(), would never actually be processed for the reasons you
mention.  Now that there's only one subplan and all the child subplans
just its subnodes, any direct-modify nodes *would* get processed
because they're in the plan that gets passed to EvalPlanQual().

> > Before commit 86dc90056, EPQState.plan would be set to the 1st child
> > subplan, which if it happened to be a direct-modify ForeignScan node,
> > this bug would manifest.  After that commit though, a direct-modify
> > child subplan would *always* be present in EPQState.plan by way of its
> > parent Append node or some parent node thereof being set as
> > EPQState.plan, so the bug would now always manifest.  So my
> > understanding is that the commit 86dc90056 only changed *when* the bug
> > manifests, which is now "always" if direct-modify children are
> > present.  Is that understanding correct?
>
> As I mentioned above, I think we didn't have this bug before commit
> 86dc90056.  As for when this bug happens, I think it would depend on
> the partitioned-table definition; eg, for a partitioned table like the
> one discussed in the thread for commit c3928b467, it wouldn't happen.

Yeah, got it.

> > BTW, isn't the following code in ForeignNext() added by c3928b467
> > non-reachable after your patch:
> >
> >         /*
> >          * direct modifications cannot be re-evaluated, so shouldn't get here
> >          * during EvalPlanQual processing
> >          */
> >         if (estate->es_epq_active != NULL)
> >             elog(ERROR, "cannot re-evaluate a Foreign Update or Delete
> > during EvalPlanQual");
> >
> > Should that be converted to an Assert(estate->es_epq_active == NULL)?
>
> +1  I updated the patch as such.  Attached is a new version.  I also
> tweaked a comment a litttle bit further.
>
> Thanks!

Thanks, looks good to me.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: BUG #17355: Server crashes on ExecReScanForeignScan in postgres_fdw when accessing foreign partition

От
Alvaro Herrera
Дата:
On 2022-Feb-01, Etsuro Fujita wrote:

> I didn’t include your test because in my
> understanding we don’t add such a test into the postgres_fdw
> regression test.  I’ll commit the patch if there are no objections
> from you (or anyone else).

I think it would be good to reconsider this.  We currently don't have
any isolation test files for postgres_fdw, but that doesn't mean we
can't add some (we now have several subdirs where both types of tests
coexist).  And if we don't have these tests, it will be easier to break
this again in the future without noticing if we decide to rework
executor code again, which I'm sure we'll do.

Maybe you don't need to add the test in back-branches, which is more
work; but if you only target branch master, it should only require an
ISOLATION line in the Makefile listing the new test.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Every machine is a smoke machine if you operate it wrong enough."
https://twitter.com/libseybieda/status/1541673325781196801



Re: BUG #17355: Server crashes on ExecReScanForeignScan in postgres_fdw when accessing foreign partition

От
Etsuro Fujita
Дата:
Hi Amit-san,

On Wed, Feb 2, 2022 at 11:06 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Feb 2, 2022 at 7:55 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > On Wed, Feb 2, 2022 at 12:59 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > BTW, isn't the following code in ForeignNext() added by c3928b467
> > > non-reachable after your patch:
> > >
> > >         /*
> > >          * direct modifications cannot be re-evaluated, so shouldn't get here
> > >          * during EvalPlanQual processing
> > >          */
> > >         if (estate->es_epq_active != NULL)
> > >             elog(ERROR, "cannot re-evaluate a Foreign Update or Delete
> > > during EvalPlanQual");
> > >
> > > Should that be converted to an Assert(estate->es_epq_active == NULL)?
> >
> > +1  I updated the patch as such.  Attached is a new version.  I also
> > tweaked a comment a litttle bit further.

> Thanks, looks good to me.

Ok, I have committed the patch.

Thanks again!

Best regards,
Etsuro Fujita



Re: BUG #17355: Server crashes on ExecReScanForeignScan in postgres_fdw when accessing foreign partition

От
Etsuro Fujita
Дата:
Hi Alvaro,

On Wed, Feb 2, 2022 at 11:42 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Feb-01, Etsuro Fujita wrote:
> > I didn’t include your test because in my
> > understanding we don’t add such a test into the postgres_fdw
> > regression test.  I’ll commit the patch if there are no objections
> > from you (or anyone else).
>
> I think it would be good to reconsider this.  We currently don't have
> any isolation test files for postgres_fdw, but that doesn't mean we
> can't add some (we now have several subdirs where both types of tests
> coexist).  And if we don't have these tests, it will be easier to break
> this again in the future without noticing if we decide to rework
> executor code again, which I'm sure we'll do.

Agreed.

> Maybe you don't need to add the test in back-branches, which is more
> work; but if you only target branch master, it should only require an
> ISOLATION line in the Makefile listing the new test.

I think it would be better to create a separate patch so that we test
not only this but other EvalPlanQual stuff such as ForeignRecheck, so
I applied the proposed patch to HEAD as well.

Thanks!

Best regards,
Etsuro Fujita



Re: BUG #17355: Server crashes on ExecReScanForeignScan in postgres_fdw when accessing foreign partition

От
Marina Polyakova
Дата:
On 2022-02-03 09:38, Etsuro Fujita wrote:
> I think it would be better to create a separate patch so that we test
> not only this but other EvalPlanQual stuff such as ForeignRecheck, so
> I applied the proposed patch to HEAD as well.
> 
> Thanks!
> 
> Best regards,
> Etsuro Fujita

Hello everyone in this thread!

Looks like gcc 11.2.0 is now reporting a warning in the branches 
master/REL_14_STABLE, e.g. see CI 
https://github.com/postgres/postgres/runs/5047720839:

[06:30:18.201] nodeForeignscan.c: In function ‘ForeignNext’:
[06:30:18.201] nodeForeignscan.c:47:13: error: unused variable ‘estate’ 
[-Werror=unused-variable]
[06:30:18.201]    47 |  EState    *estate = node->ss.ps.state;
[06:30:18.201]       |             ^~~~~~

Please could you fix it? E.g. like in the attached patch - the following 
commands work for me with it:

$ ./configure && COPT=-Werror make world-bin

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
Вложения

Re: BUG #17355: Server crashes on ExecReScanForeignScan in postgres_fdw when accessing foreign partition

От
Alvaro Herrera
Дата:
On 2022-Feb-03, Etsuro Fujita wrote:

> > Maybe you don't need to add the test in back-branches, which is more
> > work; but if you only target branch master, it should only require an
> > ISOLATION line in the Makefile listing the new test.
> 
> I think it would be better to create a separate patch so that we test
> not only this but other EvalPlanQual stuff such as ForeignRecheck, so
> I applied the proposed patch to HEAD as well.

Sure, that sounds good.  Are you or is anybody else working on that?

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
Maybe there's lots of data loss but the records of data loss are also lost.
(Lincoln Yeoh)



Re: BUG #17355: Server crashes on ExecReScanForeignScan in postgres_fdw when accessing foreign partition

От
Etsuro Fujita
Дата:
On Mon, Feb 7, 2022 at 12:00 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Feb-03, Etsuro Fujita wrote:
> > > Maybe you don't need to add the test in back-branches, which is more
> > > work; but if you only target branch master, it should only require an
> > > ISOLATION line in the Makefile listing the new test.
> >
> > I think it would be better to create a separate patch so that we test
> > not only this but other EvalPlanQual stuff such as ForeignRecheck, so
> > I applied the proposed patch to HEAD as well.
>
> Sure, that sounds good.  Are you or is anybody else working on that?

I don't have time on this for v15.  I'm planning to work on it for
v16, but if anybody else feels like doing so for v15 or later, feel
free.

Best regards,
Etsuro Fujita