postgres crash on concurrent update of inheritance partitioned table

Поиск
Список
Период
Сортировка
От Oleksii Kliukin
Тема postgres crash on concurrent update of inheritance partitioned table
Дата
Msg-id EEF05F66-2871-4786-992B-5F45C92FEE2E@hintbits.com
обсуждение исходный текст
Ответы Re: postgres crash on concurrent update of inheritance partitionedtable  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-bugs
Hello there,

We have experienced a few crashes a day since upgrading to pg 12.1 last
week, during the process of updating an inheritance-partitioned table by
several concurrent sessions.

Here’s the reproducer that consistency crashes pg 12.1 (and master) on my
machine.

Setup:

create table p(flag boolean);
create table c1() inherits(p);
create table c2() inherits(p);
insert into c1 select false from generate_series(1,10);
insert into c2 select false from generate_series(1,10);

session 1:                    session 2:
begin;                        begin
update p set flag = true;
                            update p set flag = true;
commit;                        server closed the connection unexpectedly
                                This probably means the server terminated abnormally
                                before or while processing the request.
                            The connection to the server was lost. Attempting reset: Failed.

The original backtrace we had from a production server (built w/o assertions), where it crashed from time to time.

#0  0x0000557a89e7b3f4 in EvalPlanQualBegin (epqstate=epqstate@entry=0x557a8c3bed50) at execMain.c:2735
#1  0x0000557a89e7b80b in EvalPlanQual (epqstate=epqstate@entry=0x557a8c3bed50, relation=relation@entry=0x557a8bcdce98,
rti=268,inputslot=inputslot@entry=0x557a8c55eb00) at execMain.c:2454 
#2  0x0000557a89e9d633 in ExecUpdate (mtstate=mtstate@entry=0x557a8c3bec58, tupleid=0x7ffc99962f8a, oldtuple=0x0,
slot=<optimizedout>, planSlot=0x557a8c50f4d8, epqstate=epqstate@entry=0x557a8c3bed50, estate=0x557a8c3bb0f0,
canSetTag=true)
    at nodeModifyTable.c:1387
#3  0x0000557a89e9e612 in ExecModifyTable (pstate=0x557a8c3bec58) at nodeModifyTable.c:2223
#4  0x0000557a89e78b8b in ExecProcNode (node=0x557a8c3bec58) at ../../../src/include/executor/executor.h:239
#5  ExecutePlan (execute_once=<optimized out>, dest=0x7f1cdbf8c5c8, direction=<optimized out>, numberTuples=0,
sendTuples=<optimizedout>, operation=CMD_UPDATE, use_parallel_mode=<optimized out>, planstate=0x557a8c3bec58,
estate=0x557a8c3bb0f0)at execMain.c:1646 
#6  standard_ExecutorRun (queryDesc=0x557a8c3ce440, direction=<optimized out>, count=0, execute_once=<optimized out>)
atexecMain.c:364 
#7  0x00007f1ceff25035 in pgss_ExecutorRun (queryDesc=0x557a8c3ce440, direction=ForwardScanDirection, count=0,
execute_once=<optimizedout>) at pg_stat_statements.c:893 
#8  0x0000557a89fcef60 in ProcessQuery (plan=<optimized out>,
    sourceText=0x557a8afd4c40 "BEGIN;\n    CREATE TEMP TABLE foobar(a x, b y, c bigint, d e, f bigint, g bigint) ON
COMMITDROP;\n    COPY foobar(a, b"..., params=0x0, 
    queryEnv=0x0, dest=0x7f1cdbf8c5c8, completionTag=0x7ffc999633b0 "") at pquery.c:161
#9  0x0000557a89fcf190 in PortalRunMulti (portal=portal@entry=0x557a8b06ce60, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,dest=dest@entry=0x7f1cdbf8c5c8, altdest=altdest@entry=0x7f1cdbf8c5c8, 
    completionTag=completionTag@entry=0x7ffc999633b0 "") at pquery.c:1283
#10 0x0000557a89fcfcca in PortalRun (portal=portal@entry=0x557a8b06ce60, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0x7f1cdbf8c5c8,
altdest=altdest@entry=0x7f1cdbf8c5c8,
    completionTag=0x7ffc999633b0 "") at pquery.c:796
#11 0x0000557a89fcbd45 in exec_simple_query (
    query_string=0x557a8afd4c40 ""BEGIN;\n    CREATE TEMP TABLE foobar(a x, b y, c bigint, d e, f bigint, g bigint) ON
COMMITDROP;\n    COPY foobar(a, b"...) at postgres.c:1215 
#12 0x0000557a89fcd5b1 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x557a8b0134c8, dbname=<optimized out>,
username=<optimizedout>) at postgres.c:4236 
#13 0x0000557a89f578f0 in BackendRun (port=0x557a8b011290, port=0x557a8b011290) at postmaster.c:4437
#14 BackendStartup (port=0x557a8b011290) at postmaster.c:4128
#15 ServerLoop () at postmaster.c:1704
#16 0x0000557a89f58850 in PostmasterMain (argc=5, argv=0x557a8afceae0) at postmaster.c:1377
#17 0x0000557a89cd66ba in main (argc=5, argv=0x557a8afceae0) at main.c:228

The actual line it crashes is (my sources are a bit newer than the
backtrace, so the line number doesn’t match exactly):

File: src/backend/executor/execMain.c
2721:             /*
2722:              * Force evaluation of any InitPlan outputs that could be needed
2723:              * by the subplan, just in case they got reset since
2724:              * EvalPlanQualStart (see comments therein).
2725:              */
2726:             ExecSetParamPlanMulti(rcplanstate->plan->extParam,
2727:                                   GetPerTupleExprContext(parentestate));
2728:

rcplanstate->plan was NULL during the crash.

Poking into this, we found the following anomaly:

(gdb) p *epqstate
$2 = {parentestate = 0x557a8c3bb0f0, epqParam = 0, tuple_table = 0x557aa2e81918, relsubs_slot = 0x557a8c3ea3e8, plan =
0x7f1cdbf30798,arowMarks = 0x557a8c532208, origslot = 0x557a8c50f4d8, recheckestate = 0x557a8be7b7c0, relsubs_rowmark =
0x557a8b0ae5d0,
  relsubs_done = 0x557a8b53b9f8, recheckplanstate = 0x557a8b53b9f8}

(Notice relsubs_done = 0x557a8b53b9f8, recheckplanstate = 0x557a8b53b9f8
share the same address, while being completely different fields).

It looks like
File: src/backend/executor/execMain.c
2714:         MemSet(epqstate->relsubs_done, 0, rtsize * sizeof(bool));

zeroed-out not just the relsubs_done, but the recheckplansate, as it pointed
to the same address, causing the segfault.

Our theory is that the underlying context the epqstate members are allocated
at gets reset at some point (perhaps when finishing updating a single
partition), but the corresponding struct fields are not zeroed out, so they
are reused and the memory they point to is allocated elsewhere, most
likely for the next struct member, causing the condition above.

The backtrace from the attempt to reproduce it (with asserts enabled):

(gdb) bt
#0  0x00007f90d113df25 in raise () from /usr/lib/libc.so.6
#1  0x00007f90d1127897 in abort () from /usr/lib/libc.so.6
#2  0x000055ba146697f5 in ExceptionalCondition (conditionName=0x55ba148423e0 "!(epqstate->relsubs_rowmark[scanrelid -
1]== ((void *)0))", errorType=0x55ba14842399 "FailedAssertion", fileName=0x55ba148423cf "execScan.c", lineNumber=89) at
assert.c:54
#3  0x000055ba142cd24a in ExecScanFetch (node=0x55ba1661ff00, accessMtd=0x55ba14300f8c <SeqNext>,
recheckMtd=0x55ba14301031<SeqRecheck>) at execScan.c:89 
#4  0x000055ba142cd420 in ExecScan (node=0x55ba1661ff00, accessMtd=0x55ba14300f8c <SeqNext>, recheckMtd=0x55ba14301031
<SeqRecheck>)at execScan.c:200 
#5  0x000055ba1430107f in ExecSeqScan (pstate=0x55ba1661ff00) at nodeSeqscan.c:112
#6  0x000055ba142caea0 in ExecProcNodeFirst (node=0x55ba1661ff00) at execProcnode.c:445
#7  0x000055ba142bf78f in ExecProcNode (node=0x55ba1661ff00) at ../../../src/include/executor/executor.h:239
#8  0x000055ba142c3ef6 in EvalPlanQualNext (epqstate=0x55ba16605638) at execMain.c:2695
#9  0x000055ba142c38b2 in EvalPlanQual (epqstate=0x55ba16605638, relation=0x7f904c2f1918, rti=4,
inputslot=0x55ba1661aa50)at execMain.c:2467 
#10 0x000055ba142fb8ab in ExecUpdate (mtstate=0x55ba16605540, tupleid=0x7ffe17d1d02a, oldtuple=0x0,
slot=0x55ba16615510,planSlot=0x55ba16612df0, epqstate=0x55ba16605638, estate=0x55ba16604e68, canSetTag=true) at
nodeModifyTable.c:1387
#11 0x000055ba142fcee3 in ExecModifyTable (pstate=0x55ba16605540) at nodeModifyTable.c:2223
#12 0x000055ba142caea0 in ExecProcNodeFirst (node=0x55ba16605540) at execProcnode.c:445
#13 0x000055ba142bf78f in ExecProcNode (node=0x55ba16605540) at ../../../src/include/executor/executor.h:239
#14 0x000055ba142c2167 in ExecutePlan (estate=0x55ba16604e68, planstate=0x55ba16605540, use_parallel_mode=false,
operation=CMD_UPDATE,sendTuples=false, numberTuples=0, direction=ForwardScanDirection, dest=0x7f904c2c4b00,
execute_once=true)at execMain.c:1646 
#15 0x000055ba142bfe08 in standard_ExecutorRun (queryDesc=0x55ba16604a58, direction=ForwardScanDirection, count=0,
execute_once=true)at execMain.c:364 
#16 0x000055ba142bfc29 in ExecutorRun (queryDesc=0x55ba16604a58, direction=ForwardScanDirection, count=0,
execute_once=true)at execMain.c:308 
#17 0x000055ba144db8c1 in ProcessQuery (plan=0x7f904c2c4a08, sourceText=0x55ba1651a6b8 "UPDATE p i\n        SET flag =
true;",params=0x0, queryEnv=0x0, dest=0x7f904c2c4b00, completionTag=0x7ffe17d1d4a0 "") at pquery.c:161 
#18 0x000055ba144dd385 in PortalRunMulti (portal=0x55ba16583028, isTopLevel=true, setHoldSnapshot=false,
dest=0x7f904c2c4b00,altdest=0x7f904c2c4b00, completionTag=0x7ffe17d1d4a0 "") at pquery.c:1283 
#19 0x000055ba144dc8c5 in PortalRun (portal=0x55ba16583028, count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x7f904c2c4b00,altdest=0x7f904c2c4b00, completionTag=0x7ffe17d1d4a0 "") at pquery.c:796 
#20 0x000055ba144d5f20 in exec_simple_query (query_string=0x55ba1651a6b8 "UPDATE p i\n        SET flag = true;") at
postgres.c:1215
#21 0x000055ba144da82f in PostgresMain (argc=1, argv=0x55ba16546c20, dbname=0x55ba16546b00 "shard4",
username=0x55ba16516d78"ildus") at postgres.c:4236 
#22 0x000055ba14423fb7 in BackendRun (port=0x55ba16539850) at postmaster.c:4437
#23 0x000055ba144236b6 in BackendStartup (port=0x55ba16539850) at postmaster.c:4128
#24 0x000055ba1441f739 in ServerLoop () at postmaster.c:1704
#25 0x000055ba1441eec5 in PostmasterMain (argc=1, argv=0x55ba16514d30) at postmaster.c:1377
#26 0x000055ba1433588d in main (argc=1, argv=0x55ba16514d30) at main.c:228


We (Ildus Kurbengaliev and I) came with a tentative fix of zero-ing out
epqstate members at EvalPlanQualEnd, namely:

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index ac92340af5..b37023c46e 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2996,4 +2996,6 @@ EvalPlanQualEnd(EPQState *epqstate)
        epqstate->recheckestate = NULL;
        epqstate->recheckplanstate = NULL;
        epqstate->origslot = NULL;
+       epqstate->relsubs_done = NULL;
+       epqstate->relsubs_rowmark = NULL;
 }

With the patch above I haven’t experienced any crashes on my test setup.

This is likely related to the recent crash Julian Schauder has reported at
https://www.postgresql.org/message-id/flat/16219-578ae955109d9fd2%40postgresql.org

and possibly to
https://www.postgresql.org/message-id/16184-2f718844be10c863%40postgresql.org

Regards,
Oleksii “Alex” Kliukin




В списке pgsql-bugs по дате отправления:

Предыдущее
От: Pavel Suderevsky
Дата:
Сообщение: Re: BUG #16159: recovery requests WALs for the next timelines beforetimeline switch LSN has been reached
Следующее
От: Tom Lane
Дата:
Сообщение: Re: BUG #16237: When restoring database, backend disconnects or crashes when foreign key is created