Re: Weird failure with latches in curculio on v15

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Weird failure with latches in curculio on v15
Дата
Msg-id 20230203071557.ofq56vmuxx5jgbey@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: Weird failure with latches in curculio on v15  (Michael Paquier <michael@paquier.xyz>)
Ответы Re: Weird failure with latches in curculio on v15  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Hi,

On 2023-02-02 10:23:21 +0900, Michael Paquier wrote:
> On Wed, Feb 01, 2023 at 10:18:27AM -0800, Andres Freund wrote:
> > On 2023-02-01 12:27:19 -0500, Tom Lane wrote:
> >> Andres Freund <andres@anarazel.de> writes:
> >> The main thing that system() brings to the table is platform-specific
> >> knowledge of where the shell is.  I'm not very sure that we want to
> >> wire in "/bin/sh".
> > 
> > We seem to be doing OK with using SHELLPROG in pg_regress, which just
> > seems to be using $SHELL from the build environment.
> 
> It looks like this had better centralize a bit more of the logic from
> pg_regress.c if that were to happen, to avoid more fuzzy logic with
> WIN32.  That becomes invasive for a back-patch.

I don't think we should consider backpatching such a change. There's
enough subtlety that I'd want to see it bake for some time.


> By the way, there is something that's itching me a bit here.  9a740f8
> has enlarged by a lot the window between PreRestoreCommand() and
> PostRestoreCommand(), however curculio has reported a failure on
> REL_15_STABLE, where we only manipulate my_wait_event_info while the
> flag is on.  Or I am getting that right that there is no way out of it
> unless we remove the dependency to system() even in the back-branches?
> Could there be an extra missing piece here?

Yea, that's indeed odd.


Ugh, I think I might understand what's happening:

The signal arrives just after the fork() (within system()). Because we
have all our processes configure themselves as process group leaders,
and we signal the entire process group (c.f. signal_child()), both the
child process and the parent will process the signal. So we'll end up
doing a proc_exit() in both. As both are trying to remove themselves
from the same PGPROC etc entry, that doesn't end well.

I don't see how we can solve that properly as long as we use system().

A workaround for the back branches could be to have a test in
StartupProcShutdownHandler() that tests if MyProcPid == getpid(), and
not do the proc_exit() if they don't match. We probably should just do
an _exit() in that case.


I doubt the idea to signal the entire process group in in signal_child()
is good. I regularly see core dumps of archive commands because we sent
SIGQUIT during an immediate shutdown, and of course cp etc don't have a
SIGQUIT handler, and the default action is to core dump.

But a replacement for it is not a small amount of work. While a
subprocess is running, we can't just handle SIGQUIT with _exit() while
the subprocess is running, we need to first signal the child with
something appropriate (SIGKILL?).

OTOH, the current approach only works on systems with setsid(2) support,
so we probably shouldn't rely so hard on it anyway.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jim Jones
Дата:
Сообщение: Re: [PATCH] Add pretty-printed XML output option
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Weird failure with latches in curculio on v15