Обсуждение: pg_prewarm bgworker could break fast shutdown

Поиск
Список
Период
Сортировка

pg_prewarm bgworker could break fast shutdown

От
Alexander Kukushkin
Дата:
Hello,

I the fast shutdown was initiated before pg_prewarm managed to load
buffers from the dump (and start the main loop), the pg_prewarm
bgworker process never exits on SIGTERM and effectively preventing the
clean shutdown of the cluster.

This problem bite me a few times, but yesterday I managed to attach to
the pg_prewarm process and got a stacktrace:
(gdb) bt #0 0x00007f394d788d27 in epoll_wait () from
/lib/x86_64-linux-gnu/libc.so.6 #1 0x000056059d6412f9 in
WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffc598f2b00,
cur_timeout=-1, set=0x56059f5757d8) at
./build/../src/backend/storage/ipc/latch.c:1048 #2 WaitEventSetWait
(set=set@entry=0x56059f5757d8, timeout=timeout@entry=-1,
occurred_events=occurred_events@entry=0x7ffc598f2b00,
nevents=nevents@entry=1,
wait_event_info=wait_event_info@entry=134217728) at
./build/../src/backend/storage/ipc/latch.c:1000 #3 0x000056059d641748
in WaitLatchOrSocket (latch=0x7f393ec32164,
wakeEvents=wakeEvents@entry=17, sock=sock@entry=-1, timeout=-1,
timeout@entry=0, wait_event_info=wait_event_info@entry=134217728) at
./build/../src/backend/storage/ipc/latch.c:385 #4 0x000056059d641805
in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=17,
timeout=timeout@entry=0,
wait_event_info=wait_event_info@entry=134217728) at
./build/../src/backend/storage/ipc/latch.c:339 #5 0x000056059d5e1d40
in WaitForBackgroundWorkerShutdown (handle=0x56059f57e9b0) at
./build/../src/backend/postmaster/bgworker.c:1153 #6
0x00007f3944e1a180 in apw_start_database_worker () at
./build/../contrib/pg_prewarm/autoprewarm.c:866 #7 0x00007f3944e1a739
in apw_load_buffers () at
./build/../contrib/pg_prewarm/autoprewarm.c:404 #8 autoprewarm_main
(main_arg=<optimized out>) at
./build/../contrib/pg_prewarm/autoprewarm.c:203 #9 0x000056059d5e16ee
in StartBackgroundWorker () at
./build/../src/backend/postmaster/bgworker.c:834 #10
0x000056059d5ed58c in do_start_bgworker (rw=0x56059f56cd10) at
./build/../src/backend/postmaster/postmaster.c:5713 #11
maybe_start_bgworkers () at
./build/../src/backend/postmaster/postmaster.c:5939 #12
0x000056059d5ee02d in sigusr1_handler (postgres_signal_arg=<optimized
out>) at ./build/../src/backend/postmaster/postmaster.c:5086 #13
<signal handler called> #14 0x00007f394d77e0f7 in select () from
/lib/x86_64-linux-gnu/libc.so.6 #15 0x000056059d5ee58b in ServerLoop
() at ./build/../src/backend/postmaster/postmaster.c:1671 #16
0x000056059d5f038d in PostmasterMain (argc=17, argv=0x56059f51a080) at
./build/../src/backend/postmaster/postmaster.c:1380 #17
0x000056059d37a992 in main (argc=17, argv=0x56059f51a080) at
./build/../src/backend/main/main.c:228

It has happened on 11.9, but after looking at HEAD I think the problem
still exists.

Regards,
--
Alexander Kukushkin



Re: pg_prewarm bgworker could break fast shutdown

От
Tom Lane
Дата:
Alexander Kukushkin <cyberdemn@gmail.com> writes:
> I the fast shutdown was initiated before pg_prewarm managed to load
> buffers from the dump (and start the main loop), the pg_prewarm
> bgworker process never exits on SIGTERM and effectively preventing the
> clean shutdown of the cluster.

I might be wrong about this, but I suspect what you've got here is that
the postmaster never launched the child bgworker (and now never will
launch it), so GetBackgroundWorkerPid returns BGWH_NOT_YET_STARTED
and then WaitForBackgroundWorkerShutdown keeps on waiting.  If that
interpretation is accurate then the same problem could occur with
parallel query.  (And I believe it's been sufficiently demonstrated
that parallel query falls over very easily in such corner cases,
so this isn't an astonishing conclusion.)

I'm inclined to think what we need to do about this is to have
the postmaster transition all pending worker-start requests into
STOPPED state, or some new FAILED state, when it starts trying to
shut stuff down.  I sure don't see any such logic there now ---
it just sends out a bunch of SIGTERMs and that's it.  Also,
it looks like bgworker_should_start_now() doesn't distinguish
postmaster states that don't allow starting a bgworker right
this moment, but probably will allow it later, from states in
which it never will be allowed and we need to fail the request
not just postpone it indefinitely.

            regards, tom lane