Hi,
On 2017-12-21 02:42:25 -0800, Andres Freund wrote:
> Trying to debug this I found another issue. I'd placed a sleep(10) in
> ExecParallelHashCloseBatchAccessors() and then ctrl-c'ed the server for
> some reason. Segfault time:
>
> #0 0x000055bfbac42539 in tas (lock=0x7fcd82ae14ac <error: Cannot access memory at address 0x7fcd82ae14ac>) at
/home/andres/src/postgresql/src/include/storage/s_lock.h:228
> #1 0x000055bfbac42b4d in ConditionVariableCancelSleep () at
/home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:173
> #2 0x000055bfba8e24ae in AbortTransaction () at /home/andres/src/postgresql/src/backend/access/transam/xact.c:2478
> #3 0x000055bfba8e4a2a in AbortOutOfAnyTransaction () at
/home/andres/src/postgresql/src/backend/access/transam/xact.c:4387
> So, afaics no workers had yet attached, the leader accepted the cancel
> interrupt, the dsm segments were destroyed, and as part of cleanup
> cv_sleep_target was supposed to be reset, which fails, because it's
> memory has since been freed. Looking at how that can happen.
Oh. This seems to be a condition variable bug independent of PHJ. The
problem is that the DSM segment etc all get cleaned up in
*subtransaction* abort.
Afaict it's a bug that AbortTransaction() does
ConditionVariableCancelSleep() but AbortSubTransaction() does not,
despite the latter releasing dsm segments via
ResourceOwnerRelease(RESOURCE_RELEASE_BEFORE_LOCKS).
Adding that seems to fix the crash.
This seems like something we need to backpatch.
Greetings,
Andres Freund