Re: Weird failure with latches in curculio on v15

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Weird failure with latches in curculio on v15
Дата
Msg-id 20230201105514.rsjl4bnhb65giyvo@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: Weird failure with latches in curculio on v15  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: Weird failure with latches in curculio on v15  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Hi,

On 2023-02-01 16:21:16 +1300, Thomas Munro wrote:
> My database off assertion failures has four like that, all 15 and master:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-02-01%2001:05:17
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-01-11%2011:16:40
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2022-11-22%2012:19:21
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2022-11-17%2021:47:02
> 
> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM
> handler which is allowed to call that while in_restore_command is
> true.

Ugh, no wonder we're getting crashes. This whole business seems bogus as
hell.


RestoreArchivedFile():
...
    /*
     * Check signals before restore command and reset afterwards.
     */
    PreRestoreCommand();

    /*
     * Copy xlog from archival storage to XLOGDIR
     */
    ret = shell_restore(xlogfname, xlogpath, lastRestartPointFname);

    PostRestoreCommand();


/* SIGTERM: set flag to abort redo and exit */
static void
StartupProcShutdownHandler(SIGNAL_ARGS)
{
    int            save_errno = errno;

    if (in_restore_command)
        proc_exit(1);
    else
        shutdown_requested = true;
    WakeupRecovery();

    errno = save_errno;
}

Where PreRestoreCommand()/PostRestoreCommand() set in_restore_command.



There's *a lot* of stuff happening inside shell_restore() that's not
compatible with doing proc_exit() inside a signal handler. We're
allocating memory! Interact with stdout.

There's also the fact that system() isn't signal safe, but that's a much
less likely problematic issue.


This appears to have gotten worse over a sequence of commits. The
following commits each added something betwen PreRestoreCommand() and
PostRestoreCommand().


commit 1b06d7bac901e5fd20bba597188bae2882bf954b
Author: Fujii Masao <fujii@postgresql.org>
Date:   2021-11-22 10:28:21 +0900
 
    Report wait events for local shell commands like archive_command.

added pgstat_report_wait_start/end. Unlikely to cause big issues, but
not good.


commit 7fed801135bae14d63b11ee4a10f6083767046d8
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2022-08-29 13:55:38 -0400

    Clean up inconsistent use of fflush().

Made it a bit worse by adding an fflush(). That certainly seems like it
could cause hangs.


commit 9a740f81eb02e04179d78f3df2ce671276c27b07
Author: Michael Paquier <michael@paquier.xyz>
Date:   2023-01-16 16:31:43 +0900

    Refactor code in charge of running shell-based recovery commands

which completely broke the mechanism. We suddenly run the entirety of
shell_restore(), which does pallocs etc to build the string passed to
system, and raises errors, all within a section in which a signal
handler can invoke proc_exit().  That's just completely broken.


Sorry, but particularly in this area, you got to be a heck of a lot more
careful.

I don't see a choice but to revert the recent changes. They need a
fairly large rewrite.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Dagfinn Ilmari Mannsåker
Дата:
Сообщение: Re: Clarify deleting comments and security labels in synopsis
Следующее
От: David Rowley
Дата:
Сообщение: Re: heapgettup refactoring