Re: Weird failure with latches in curculio on v15

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Weird failure with latches in curculio on v15
Дата
Msg-id 20230201165801.33ydbxvjdbomjqa7@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: Weird failure with latches in curculio on v15  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Weird failure with latches in curculio on v15  (Robert Haas <robertmhaas@gmail.com>)
Re: Weird failure with latches in curculio on v15  (Nathan Bossart <nathandbossart@gmail.com>)
Список pgsql-hackers
Hi,

On 2023-02-01 10:12:26 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2023-02-01 16:21:16 +1300, Thomas Munro wrote:
> >> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM
> >> handler which is allowed to call that while in_restore_command is
> >> true.
>
> > Ugh, no wonder we're getting crashes. This whole business seems bogus as
> > hell.
>
> Indeed :-(
>
> > I don't see a choice but to revert the recent changes. They need a
> > fairly large rewrite.
>
> 9a740f81e clearly made things a lot worse, but it wasn't great
> before.  Can we see a way forward to removing the problem entirely?

Yea, I think we can - we should stop relying on system(). If we instead
run the command properly as a subprocess, we don't need to do bad things
in the signal handler anymore.


> The fundamental issue is that we have no good way to break out
> of system(), and I think the original idea was that
> in_restore_command would be set *only* for the duration of the
> system() call.  That's clearly been lost sight of completely,
> but maybe as a stopgap we could try to get back to that.

We could push the functions setting in_restore_command down into
ExecuteRecoveryCommand(). But I don't think that'd end up necessarily
being right either - we'd now use the mechanism in places we previously
didn't (cleanup/end commands).

And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that
doesn't look right:
- We now have two places open-coding what BuildRestoreCommand did

- I'm doubtful that the new shell_* functions are the base for a good
  API to abstract restoring files

- the error message for a failed restore command seems to have gotten
  worse:
  could not restore file \"%s\" from archive: %s"
  ->
  "%s \"%s\": %s", commandName, command

- shell_* imo is not a good namespace for something called from xlog.c,
  xlogarchive.c. I realize the intention is that shell_archive.c is
  going to be its own "restore module", but for now it imo looks odd

- The comment moved out of RestoreArchivedFile() doesn't seems less
  useful at its new location

- explanation of why we use GetOldestRestartPoint() is halfway lost


My name is listed as the first Reviewed-by, but I certainly haven't done
any meaningful review of these patches. I just replied to top-level
email proposing "recovery modules".

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Borisov
Дата:
Сообщение: Re: About PostgreSQL Core Team
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Weird failure with latches in curculio on v15