Обсуждение: BUG #18986: SIGSEGV in nodeModifyTable.c during Parallel Execution
The following bug has been logged on the website: Bug reference: 18986 Logged by: Yaroslav Syrytsia Email address: y@lll.gd PostgreSQL version: 17.5 Operating system: Ubuntu 22.04.5 Description: Hello, During testing a flow in an app, I encountered a segfault that only happens during parallel execution: https://gist.github.com/joy4eg/708458e204f52129a8e54a13534586b7 Looking into the coredump, it looks like a NULL pointer dereference is happening at https://github.com/postgres/postgres/blob/5e2f3df49d4298c6097789364a5a53be172f6e85/src/backend/executor/nodeModifyTable.c#L752 After further investigation, that part of the code looks suspicious as well: static void ExecInitInsertProjection(ModifyTableState *mtstate, ResultRelInfo *resultRelInfo) { ... /* Build ProjectionInfo if needed (it probably isn't). */ if (need_projection) { TupleDesc relDesc = RelationGetDescr(resultRelInfo->ri_RelationDesc); /* need an expression context to do the projection */ if (mtstate->ps.ps_ExprContext == NULL) ExecAssignExprContext(estate, &mtstate->ps); resultRelInfo->ri_projectNew = ExecBuildProjectionInfo(insertTargetList, mtstate->ps.ps_ExprContext, resultRelInfo->ri_newTupleSlot, &mtstate->ps, relDesc); } resultRelInfo->ri_projectNewInfoValid = true; // <--- it's set to true even if need_projection may be NULL. } (gdb) frame 0 #0 ExecGetUpdateNewTuple (oldSlot=0x55defb35ae20, planSlot=0x55defb41d9d8, relinfo=0x55defb312108) at executor/./build/../src/backend/executor/nodeModifyTable.c:752 752 econtext = newProj->pi_exprContext; (gdb) p relinfo $1 = (ResultRelInfo *) 0x55defb312108 (gdb) p *relinfo $2 = {type = T_ResultRelInfo, ri_RangeTableIndex = 15, ri_RelationDesc = 0x7f0c0eb80500, ri_NumIndices = 21, ri_IndexRelationDescs = 0x55defb358248, ri_IndexRelationInfo = 0x55defb358350, ri_RowIdAttNo = 1, ri_extraUpdatedCols = 0x55defb357300, ri_projectNew = 0x0, ri_newTupleSlot = 0x55defb323cf0, ri_oldTupleSlot = 0x55defb323ae8, ri_projectNewInfoValid = true, ri_needLockTagTuple = false, ri_TrigDesc = 0x55defb312310, ri_TrigFunctions = 0x55defb31a318, ri_TrigWhenExprs = 0x55defb2d8cf8, ri_TrigInstrument = 0x0, ri_ReturningSlot = 0x0, ri_TrigOldSlot = 0x55defb35ae20, ri_TrigNewSlot = 0x0, ri_FdwRoutine = 0x0, ri_FdwState = 0x0, ri_usesFdwDirectModify = false, ri_NumSlots = 0, ri_NumSlotsInitialized = 0, ri_BatchSize = 0, ri_Slots = 0x0, ri_PlanSlots = 0x0, ri_WithCheckOptions = 0x0, ri_WithCheckOptionExprs = 0x0, ri_ConstraintExprs = 0x0, ri_GeneratedExprsI = 0x0, ri_GeneratedExprsU = 0x55defb358140, ri_NumGeneratedNeededI = 0, ri_NumGeneratedNeededU = 2, ri_returningList = 0x55defb39c630, ri_projectReturning = 0x55defb323590, ri_onConflictArbiterIndexes = 0x0, ri_onConflict = 0x0, ri_MergeActions = {0x55defb324518, 0x0, 0x55defb328e50}, ri_MergeJoinCondition = 0x0, ri_PartitionCheckExpr = 0x0, ri_ChildToRootMap = 0x0, ri_ChildToRootMapValid = false, ri_RootToChildMap = 0x0, ri_RootToChildMapValid = false, ri_RootResultRelInfo = 0x0, ri_PartitionTupleSlot = 0x0, ri_CopyMultiInsertBuffer = 0x0, ri_ancestorResultRels = 0x0} Where: - ri_projectNewInfoValid = true - ri_projectNew = 0x0 Version tested: - 17.5 (ubuntu, apt, with and without pgroonga) - postgres:18beta1-bookworm (docker, without pgroonga) - postgres:17.5-bookworm (docker, without pgroonga) Yaroslav.
PG Bug reporting form <noreply@postgresql.org> writes: > During testing a flow in an app, I encountered a segfault that only happens > during parallel execution: > https://gist.github.com/joy4eg/708458e204f52129a8e54a13534586b7 This report isn't terribly helpful, since you have not explained how to trigger the crash. If we can't duplicate it, it's hard to decide what is the appropriate fix. regards, tom lane
On Tue, Jul 15, 2025 at 02:13:53PM -0400, Tom Lane wrote: > PG Bug reporting form <noreply@postgresql.org> writes: > > During testing a flow in an app, I encountered a segfault that only happens > > during parallel execution: > > https://gist.github.com/joy4eg/708458e204f52129a8e54a13534586b7 > > This report isn't terribly helpful, since you have not explained > how to trigger the crash. If we can't duplicate it, it's hard > to decide what is the appropriate fix. For the sake of visibility if the information on github gets lost, the change can be summarized by this query pattern, but we don't have any information about the end of the query: WITH replaced AS ( DELETE FROM replaceable_events_before_update WHERE replaced_by_id = ANY($1) RETURNING attr_list_1, [...] ), source_data (attr_list_2) AS ( SELECT * from replaced UNION ALL VALUES (attr_list_3 ... That's most likely generated a MERGE by an ORM to have these many parameters and columns listed. Without the full pattern or an idea of the schema as this includes triggers, what we can do is limited, as Tom says. And the relevant portion of the backtrace: #7 ExecGetUpdateNewTuple (oldSlot=0x55b0842d0288, #planSlot=0x55b084249d68, relinfo=0x55b0836a9b88) at #executor/./build/../src/backend/executor/nodeModifyTable.c:752 #8 ExecBRUpdateTriggers (estate=0x55b08369a708, #epqstate=0x55b0836a9a68, relinfo=0x55b0836a9b88, #tupleid=0x7fffc373628a, fdw_trigtuple=0x0, newslot=0x55b083720eb0, #tmresult=0x7fffc37361d0, tmfd=0x7fffc37362e0) at #commands/./build/../src/backend/commands/trigger.c:2979 #9 0x000055b08158628e in ExecUpdatePrologue #(context=context@entry=0x7fffc37362c0, #resultRelInfo=resultRelInfo@entry=0x55b0836a9b88, #tupleid=tupleid@entry=0x7fffc373628a, oldtuple=oldtuple@entry=0x0, #slot=slot@entry=0x55b083720eb0, result=result@entry=0x7fffc37361d0) #at executor/./build/../src/backend/executor/nodeModifyTable.c:1949 #10 0x000055b081587e75 in ExecMergeMatched (context=<optimized out>, #resultRelInfo=<optimized out>, tupleid=0x7fffc373628a, oldtuple=0x0, #canSetTag=false, matched=0x7fffc3736290) at #executor/./build/../src/backend/executor/nodeModifyTable.c:3023 If you could extract the query string from the backtrace, that would be one piece of information. With triggers and MERGE in mind, it may be possible to reproduce something. -- Michael
Вложения
Hello, I made a small reproducer here - https://github.com/joy4eg/pg-merge-crash It creates a few tables and then executes MERGE statements in parallel. Yaroslav. > On 2025-07-15 20:13 CEST Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > PG Bug reporting form <noreply@postgresql.org> writes: > > During testing a flow in an app, I encountered a segfault that only happens > > during parallel execution: > > https://gist.github.com/joy4eg/708458e204f52129a8e54a13534586b7 > > This report isn't terribly helpful, since you have not explained > how to trigger the crash. If we can't duplicate it, it's hard > to decide what is the appropriate fix. > > regards, tom lane
On Wed, 16 Jul 2025 at 01:10, Michael Paquier <michael@paquier.xyz> wrote:
>
> And the relevant portion of the backtrace:
> #7 ExecGetUpdateNewTuple (oldSlot=0x55b0842d0288,
> #planSlot=0x55b084249d68, relinfo=0x55b0836a9b88) at
> #executor/./build/../src/backend/executor/nodeModifyTable.c:752
> #8 ExecBRUpdateTriggers (estate=0x55b08369a708,
> #epqstate=0x55b0836a9a68, relinfo=0x55b0836a9b88,
> #tupleid=0x7fffc373628a, fdw_trigtuple=0x0, newslot=0x55b083720eb0,
> #tmresult=0x7fffc37361d0, tmfd=0x7fffc37362e0) at
> #commands/./build/../src/backend/commands/trigger.c:2979
> #9 0x000055b08158628e in ExecUpdatePrologue
> #(context=context@entry=0x7fffc37362c0,
> #resultRelInfo=resultRelInfo@entry=0x55b0836a9b88,
> #tupleid=tupleid@entry=0x7fffc373628a, oldtuple=oldtuple@entry=0x0,
> #slot=slot@entry=0x55b083720eb0, result=result@entry=0x7fffc37361d0)
> #at executor/./build/../src/backend/executor/nodeModifyTable.c:1949
> #10 0x000055b081587e75 in ExecMergeMatched (context=<optimized out>,
> #resultRelInfo=<optimized out>, tupleid=0x7fffc373628a, oldtuple=0x0,
> #canSetTag=false, matched=0x7fffc3736290) at
> #executor/./build/../src/backend/executor/nodeModifyTable.c:3023
Hmm, I can see the problem. This particular stack trace should never happen:
#7 ExecGetUpdateNewTuple
#8 ExecBRUpdateTriggers
#9 ExecUpdatePrologue
#10 ExecMergeMatched
ExecBRUpdateTriggers() only calls ExecGetUpdateNewTuple() if
GetTupleForTrigger() sets epqslot_candidate to a non-null value, and
GetTupleForTrigger() should only set epqslot_candidate (*epqslot) to
non-null if we're not doing a MERGE (see commit 9321c79).
The problem, however, is that GetTupleForTrigger() tests for MERGE by doing
if (estate->es_plannedstmt->commandType != CMD_MERGE)
which doesn't work if the MERGE is inside a CTE. So we need a
different way for ExecBRUpdateTriggers() / GetTupleForTrigger() to
know that it's being called from within a MERGE.
Regards,
Dean
On Wed, 16 Jul 2025 at 10:55, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > The problem, however, is that GetTupleForTrigger() tests for MERGE by doing > > if (estate->es_plannedstmt->commandType != CMD_MERGE) > > which doesn't work if the MERGE is inside a CTE. So we need a > different way for ExecBRUpdateTriggers() / GetTupleForTrigger() to > know that it's being called from within a MERGE. Attached is a reproducer using the merge-match-recheck isolation test. I believe that the bug only goes back to v17, because MERGE could not appear inside a CTE prior to that. I think that best thing to do is pass the commandType to ExecBRUpdateTriggers(), except that in v17 we shouldn't change the signature of ExecBRUpdateTriggers(), so a wrapper function will be needed, similar to what 9321c79 did. Question: Is it OK to change the signature of ExecBRUpdateTriggers() in v18? Regards, Dean
Вложения
On Wed, Jul 16, 2025 at 11:42:33AM +0100, Dean Rasheed wrote: > Attached is a reproducer using the merge-match-recheck isolation test. > > I believe that the bug only goes back to v17, because MERGE could not > appear inside a CTE prior to that. That was fast, nice! > I think that best thing to do is pass the commandType to > ExecBRUpdateTriggers(), except that in v17 we shouldn't change the > signature of ExecBRUpdateTriggers(), so a wrapper function will be > needed, similar to what 9321c79 did. Yes, changing ExecBRUpdateTriggers() would not be a good idea on a stable branch.. I can see that at least timescaledb does a direct call to it. A minor upgrade breakage would be bad for them. > Question: Is it OK to change the signature of ExecBRUpdateTriggers() > in v18? We are still in beta, so that's not a problem for v18 and HEAD. -- Michael
Вложения
On Wed, 16 Jul 2025 at 11:49, Michael Paquier <michael@paquier.xyz> wrote: > > > Question: Is it OK to change the signature of ExecBRUpdateTriggers() > > in v18? > > We are still in beta, so that's not a problem for v18 and HEAD. Cool. I thought so, but I wanted to check. Attached is a patch for HEAD/v18, and a slightly different one for v17, preserving the trigger ABI in the standard way. I decided to do this by adding an extra "is_merge_update" boolean parameter, rather than passing the commandType because that looked slightly neater. It was also necessary to update ExecBRDeleteTriggers(), since otherwise a concurrent MERGE DELETE could do the wrong thing (execute the wrong action, rather than seg-faulting). That was picked up by an existing isolation test case added by 9321c79, so no need for more tests. I haven't tested this against the OP's reproducer. Regards, Dean
Вложения
On Wed, Jul 16, 2025 at 02:05:06PM +0100, Dean Rasheed wrote: > I decided to do this by adding an extra "is_merge_update" boolean > parameter, rather than passing the commandType because that looked > slightly neater. It was also necessary to update > ExecBRDeleteTriggers(), since otherwise a concurrent MERGE DELETE > could do the wrong thing (execute the wrong action, rather than > seg-faulting). That was picked up by an existing isolation test case > added by 9321c79, so no need for more tests. Looks sensible here, for the HEAD and v17 flavors. > I haven't tested this against the OP's reproducer. Except waiting for the OP to actually test the fix, I'm not sure if there is much that we can do. Anyway, the stack that one gets with the isolation test is very close to what the OP has reported, so I'm ready to bet a coffee that you have been able to narrow this one down. I have also raised an issue to timescale to make them aware of the issue. I have no idea if this is an issue for them, just acting as a reporter: https://github.com/timescale/timescaledb/issues/8392 -- Michael
Вложения
On Thu, 17 Jul 2025 at 06:45, Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Jul 16, 2025 at 02:05:06PM +0100, Dean Rasheed wrote: > > I decided to do this by adding an extra "is_merge_update" boolean > > parameter, rather than passing the commandType because that looked > > slightly neater. It was also necessary to update > > ExecBRDeleteTriggers(), since otherwise a concurrent MERGE DELETE > > could do the wrong thing (execute the wrong action, rather than > > seg-faulting). That was picked up by an existing isolation test case > > added by 9321c79, so no need for more tests. > > Looks sensible here, for the HEAD and v17 flavors. Thanks for looking. > > I haven't tested this against the OP's reproducer. > > Except waiting for the OP to actually test the fix, I'm not sure if > there is much that we can do. I'll push this in a day or so in any case, since it's clearly fixing *an* issue, even if it doesn't entirely fix the OP's issue. > Anyway, the stack that one gets with > the isolation test is very close to what the OP has reported, so I'm > ready to bet a coffee that you have been able to narrow this one down. Yeah, that was my thinking too. > I have also raised an issue to timescale to make them aware of the > issue. I have no idea if this is an issue for them, just acting as a > reporter: > https://github.com/timescale/timescaledb/issues/8392 Cool, thanks for doing that. Regards, Dean
On Thu, 17 Jul 2025 at 07:46, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > I'll push this in a day or so in any case, since it's clearly fixing > *an* issue, even if it doesn't entirely fix the OP's issue. > Pushed. Regards, Dean
Hello,
I just made a fresh build:
commit 91ad1bdef8e46983d82d2723c6e1c05e004f74c5 (HEAD -> 17_stable, origin/REL_17_STABLE)
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Date: Fri Jul 18 10:01:31 2025 +0100
Fix concurrent update trigger issues with MERGE in a CTE.
Can confirm that the issue is resolved now.
Thanks!
Yaroslav.
On Jul 18 2025, at 11:19 am, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
On Thu, 17 Jul 2025 at 07:46, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:>> I'll push this in a day or so in any case, since it's clearly fixing> *an* issue, even if it doesn't entirely fix the OP's issue.>Pushed.Regards,Dean