Hi,
On 2022-08-03 16:33:46 -0400, Robert Haas wrote:
> On Wed, Aug 3, 2022 at 1:57 PM Andres Freund <andres@anarazel.de> wrote:
> > The reason nothing might get logged in some cases is that
> > e.g. ResolveRecoveryConflictWithLock() tells
> > ResolveRecoveryConflictWithVirtualXIDs() to *not* report the waiting:
> > /*
> > * Prevent ResolveRecoveryConflictWithVirtualXIDs() from reporting
> > * "waiting" in PS display by disabling its argument report_waiting
> > * because the caller, WaitOnLock(), has already reported that.
> > */
> >
> > so ResolveRecoveryConflictWithLock() can end up looping indefinitely without
> > logging anything.
>
> I understand why we need to avoid adding "waiting" to the PS status
> when we've already done that, but it doesn't seem like that should
> imply skipping ereport() of log messages.
>
> I think we could redesign the way the ps display works to make things
> a whole lot simpler. Let's have a function set_ps_display() and
> another function set_ps_display_suffix(). What gets reported to the OS
> is the concatenation of the two. Calling set_ps_display() implicitly
> resets the suffix to empty.
>
> AFAICS, that'd let us get rid of this tricky logic, and some other
> tricky logic as well. Here, we'd just say set_ps_display_suffix("
> waiting") and not worry about whether the caller might have already
> done something similar.
That sounds like it'd be an improvement. Of course we still need to fix that
we can signal at a rate not allowing the other side to handle the conflict,
but at least that'd be easier to identify...
> > Another question I have about ResolveRecoveryConflictWithLock() is whether
> > it's ok that we don't check deadlocks around the
> > ResolveRecoveryConflictWithVirtualXIDs() call? It might be ok, because we'd
> > only block if there's a recovery conflict, in which killing the process ought
> > to succeed?
>
> The startup process is supposed to always "win" in any deadlock
> situation, so I'm not sure what you think is a problem here. We get
> the conflicting lockers. We kill them. If they don't die, that's a
> bug, but killing ourselves doesn't really help anything; if we die,
> the whole system goes down, which seems undesirable.
The way deadlock timeout for the startup process works is that we wait for it
to pass and then send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK to the
backends. So it's not that the startup process would die.
The question is basically whether there are cases were
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK would resolve a conflict but
PROCSIG_RECOVERY_CONFLICT_LOCK wouldn't. It seems plausible that there isn't,
but it's also not obvious enough that I'd fully trust it.
Greetings,
Andres Freund