Postmaster doesn't correctly handle crashes in PM_STARTUP state

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Postmaster doesn't correctly handle crashes in PM_STARTUP state
Дата
Msg-id 20230729215124.ra4rbwck5dlawvmo@awork3.anarazel.de
обсуждение исходный текст
Список pgsql-hackers
Hi,

While testing something I made the checkpointer process intentionally crash as
soon as it started up.  The odd thing I observed on macOS is that we start a
*new* checkpointer before shutting down:

2023-07-29 14:32:39.241 PDT [65031] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-07-29 14:32:39.244 PDT [65031] DEBUG:  reaping dead processes
2023-07-29 14:32:39.244 PDT [65031] LOG:  checkpointer process (PID 65032) was terminated by signal 11: Segmentation
fault:11
 
2023-07-29 14:32:39.244 PDT [65031] LOG:  terminating any other active server processes
2023-07-29 14:32:39.244 PDT [65031] DEBUG:  sending SIGQUIT to process 65034
2023-07-29 14:32:39.245 PDT [65031] DEBUG:  sending SIGQUIT to process 65033
2023-07-29 14:32:39.245 PDT [65031] DEBUG:  reaping dead processes
2023-07-29 14:32:39.245 PDT [65035] LOG:  process 65035 taking over ProcSignal slot 126, but it's not empty
2023-07-29 14:32:39.245 PDT [65031] DEBUG:  reaping dead processes
2023-07-29 14:32:39.245 PDT [65031] LOG:  shutting down because restart_after_crash is off

Note that a new process (65035) is started after the crash has been
observed. I added logging to StartChildProcess(), and the process that's
started is another checkpointer.

I could not initially reproduce this on linux.

After a fair bit of confusion, I figured out the reason: On macOS it takes a
bit longer for the startup process to finish, which means we're still in
PM_STARTUP state when we see that crash, instead of PM_RECOVERY or PM_RUN or
...

The problem is that unfortunately HandleChildCrash() doesn't change pmState
when in PM_STARTUP:

    /* We now transit into a state of waiting for children to die */
    if (pmState == PM_RECOVERY ||
        pmState == PM_HOT_STANDBY ||
        pmState == PM_RUN ||
        pmState == PM_STOP_BACKENDS ||
        pmState == PM_SHUTDOWN)
        pmState = PM_WAIT_BACKENDS;

Once I figured that out, I put a sleep(1) in StartupProcessMain(), and the
problem reproduces on linux as well.

I haven't fully dug through the history, this looks to be a quite old problem.


Arguably we might also be missing PM_SHUTDOWN_2, but I can't really see a bad
consequence of that.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Nathan Bossart
Дата:
Сообщение: Re: should frontend tools use syncfs() ?
Следующее
От: José Neves
Дата:
Сообщение: CDC/ETL system on top of logical replication with pgoutput, custom client