interrupt processing during FATAL processing

Поиск
Список
Период
Сортировка
От Andres Freund
Тема interrupt processing during FATAL processing
Дата
Msg-id 20220715172938.t7uivghdx5vj36cn@awork3.anarazel.de
обсуждение исходный текст
Список pgsql-hackers
Hi,

One of the problems we found ([1]) when looking at spurious failures of the
recovery conflict test ([2]) is that a single backend can FATAL out multiple
times.  That seems independent enough from that thread that I thought it best
to raise separately.

The problem is that during the course of FATAL processing, we end up
processing interrupts, which then can process the next recovery conflict
interrupt. This happens because while we send the FATAL to the client
(errfinish()->EmitErrorReport()) we'll process interrupts in various
places. If e.g. a new recovery conflict interrupt has been raised (which
startup.c does at an absurd rate), that'll then trigger a new FATAL.

Sources for recursive processing of interrupts are e.g. < ERROR ereports like
the COMERROR in internal_flush().

A similar problem does *not* exist for for ERROR, because errfinish()
PG_RE_THROW()s before the EmitErrorReport() and PostgresMain() does
HOLD_INTERRUPTS() first thing after sigsetjmp().

One might reasonably think that the proc_exit_inprogress logic in die(),
RecoveryConflictInterrupt() etc.  protects us against this - but it doesn't,
because we've not yet set it when doing EmitErrorReport(), it gets set later
during proc_exit().

I'm not certain what the best way to address this is.

One approach would be to put a HOLD_INTERRUPTS() before EmitErrorReport() when
handling a FATAL error. I'm a bit worried this might make it harder to
interrupt FATAL errors when they're blocking sending the message to the client
- ProcessClientWriteInterrupt() won't do its thing. OTOH, it looks like we
already have that problem if there's a FATAL after the sigsetjmp() block did
HOLD_INTERRUPTS(), because erfinish() won't reset InterruptHoldoffCount like
it does for ERROR.

Another approach would be to set proc_exit_inprogress earlier, before the
EmitErrorReport(). That's a bit ugly, because it ties ipc.c more closely to
elog.c, but it also "feels" correct to me. OTOH, it's at best a partial
protection, because it doesn't prevent already pending interrupts from being
processed.

I guess we could instead add a dedicated variable indicating whether we're
currently processing a FATAL error? I was considering exposin a function
checking whether elog's recursion_depth is != 0, and short-circuit
ProcessInterrupts() based on that, but I don't think that'd be good - we want
to be able interrupt some super-noisy NOTICE or such.


I suspect that we should, even if it does not address this issue, reset
InterruptHoldoffCount in errfinish() for FATALs, similar to ERRORs, so that
FATALs can benefit from the logic in ProcessClientWriteInterrupt().

Greetings,

Andres Freund

[1] https://postgr.es/m/20220701231833.vh76xkf3udani3np%40alap3.anarazel.de
[2] clearly we needed that test urgently, but I can't deny regretting the
    consequence of having to fix the plethora of bugs it's uncovering



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: Allowing REINDEX to have an optional name
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths