kevent latch paths don't handle postmaster death well

Поиск
Список
Период
Сортировка
От Tom Lane
Тема kevent latch paths don't handle postmaster death well
Дата
Msg-id 3624029.1602701929@sss.pgh.pa.us
обсуждение исходный текст
Ответы Re: kevent latch paths don't handle postmaster death well  (Thomas Munro <thomas.munro@gmail.com>)
Re: kevent latch paths don't handle postmaster death well  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-hackers
I noticed that chipmunk failed [1] with a rather interesting log:

2020-10-14 08:57:01.661 EEST [27048:6] pg_regress/prepared_xacts LOG:  statement: UPDATE pxtest1 SET foobar = 'bbb'
WHEREfoobar = 'aaa'; 
2020-10-14 08:57:01.721 EEST [27048:7] pg_regress/prepared_xacts LOG:  statement: SELECT * FROM pxtest1;
2020-10-14 08:57:01.823 EEST [27048:8] pg_regress/prepared_xacts FATAL:  postmaster exited during a parallel
transaction
TRAP: FailedAssertion("entry->trans == NULL", File: "pgstat.c", Line: 909, PID: 27048)
2020-10-14 08:57:01.861 EEST [27051:1] ERROR:  could not attach to dynamic shared area
2020-10-14 08:57:01.861 EEST [27051:2] STATEMENT:  SELECT * FROM pxtest1;

I do not know what happened to the postmaster, but seeing that chipmunk
is a very small machine running a pretty old Linux kernel, it's plausible
to guess that the OOM killer decided to pick on the postmaster.  (I wonder
whether Heikki has taken any steps to prevent that on that machine.)
My concern today is not that the postmaster died, but that the subsequent
response was an Assert failure.  Not good.

I tried to reproduce this by dint of manually kill -9'ing the postmaster
during the select_parallel regression test.  Figuring that a slower
machine would give me a better chance of success at that, I used an old
Mac laptop that wasn't doing anything else.  I did not succeed yet, but
what I did reproducibly get (in five out of five tries) was a leader
process that was permanently stuck on a latch, waiting for its worker(s)
to die.  Needless to say, there were no workers and never would be.
The stack trace varied a bit, but here's an interesting case:

(gdb) bt
#0  0x90b267ac in kevent ()
#1  0x003b76e8 in WaitEventSetWaitBlock [inlined] () at latch.c:1506
#2  0x003b76e8 in WaitEventSetWait (set=0x10003a8, timeout=-1, occurred_events=<value temporarily unavailable, due to
optimizations>,nevents=1, wait_event_info=<value temporarily unavailable, due to optimizations>) at latch.c:1309 
#3  0x003b814c in WaitLatch (latch=<value temporarily unavailable, due to optimizations>, wakeEvents=17, timeout=-1,
wait_event_info=134217729)at latch.c:411 
#4  0x0032e77c in WaitForBackgroundWorkerShutdown (handle=0x10280a4) at bgworker.c:1139
#5  0x000bc6fc in WaitForParallelWorkersToExit (pcxt=0xc6f84c) at parallel.c:876
#6  0x000bc99c in DestroyParallelContext (pcxt=0xc6f84c) at parallel.c:958
#7  0x000bdc48 in dlist_is_empty [inlined] () at lib/ilist.h:1231
#8  0x000bdc48 in AtEOXact_Parallel (isCommit=4 '\004') at parallel.c:1224
#9  0x000ccf24 in AbortTransaction () at xact.c:2702
#10 0x000cd534 in AbortOutOfAnyTransaction () at xact.c:4623
#11 0x00550b54 in ShutdownPostgres (code=<value temporarily unavailable, due to optimizations>, arg=<value temporarily
unavailable,due to optimizations>) at postinit.c:1195 
#12 0x003b5ff0 in shmem_exit (code=1) at ipc.c:239
#13 0x003b6168 in proc_exit_prepare (code=1) at ipc.c:194
#14 0x003b6240 in proc_exit (code=1) at ipc.c:107
#15 0x00541f6c in errfinish (filename=<value temporarily unavailable, due to optimizations>, lineno=<value temporarily
unavailable,due to optimizations>, funcname=0x5bd55c "WaitForParallelWorkersToExit") at elog.c:578 
#16 0x000bc748 in WaitForParallelWorkersToExit (pcxt=0xc6f84c) at parallel.c:885
#17 0x000bc7c8 in ReinitializeParallelDSM (pcxt=0xc6f84c) at parallel.c:471
#18 0x0021e468 in ExecParallelReinitialize (planstate=0x109a9a0, pei=0xc3a09c, sendParams=0x0) at execParallel.c:906
#19 0x00239f6c in ExecGather (pstate=0x109a848) at nodeGather.c:177
#20 0x00221cd8 in ExecProcNodeInstr (node=0x109a848) at execProcnode.c:466
#21 0x0024f7a8 in ExecNestLoop (pstate=0x1099700) at executor/executor.h:244
#22 0x00221cd8 in ExecProcNodeInstr (node=0x1099700) at execProcnode.c:466
#23 0x0022edbc in ExecProcNode [inlined] () at executor.h:244
#24 0x0022edbc in fetch_input_tuple (aggstate=0x10993d0) at executor.h:589

We appear to have already realized that the postmaster died, since we're
inside proc_exit.  WaitForBackgroundWorkerShutdown is doing this:

        rc = WaitLatch(MyLatch,
                       WL_LATCH_SET | WL_POSTMASTER_DEATH, 0,
                       WAIT_EVENT_BGWORKER_SHUTDOWN);

which one would certainly hope would not block at all if the postmaster
is already dead, yet it's doing so.  I guess that the kevent stuff is
failing to handle the case of another WaitLatch call after the postmaster
is already known dead.

In case it helps, I checked the contents of the WaitEventSet:

(gdb) p *LatchWaitSet
$2 = {
  nevents = 2,
  nevents_space = 2,
  events = 0x10003cc,
  latch = 0xa1b673c,
  latch_pos = 0,
  exit_on_postmaster_death = 0 '\0',
  kqueue_fd = 12,
  kqueue_ret_events = 0x10003ec,
  report_postmaster_not_running = 0 '\0'
}
(gdb) p LatchWaitSet->events[0]
$3 = {
  pos = 0,
  events = 1,
  fd = 10,
  user_data = 0x0
}
(gdb) p LatchWaitSet->events[1]
$4 = {
  pos = 1,
  events = 16,
  fd = 3,
  user_data = 0x0
}


I thought possibly this was an ancient-macOS problem, but I've now
reproduced substantially the same behavior on an up-to-date Catalina
machine (10.15.7), so I do not think we can write it off that way.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2020-10-14%2000%3A04%3A08



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: speed up unicode decomposition and recomposition
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: kevent latch paths don't handle postmaster death well