Обсуждение: Re: Accept recovery conflict interrupt on blocked writing
Bonjour Anthonin, On Mon, Jan 13, 2025 at 11:31 PM Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> wrote: > To avoid blocking recovery for an extended period of time, this patch > changes client write interrupts by handling recovery conflict > interrupts instead of ignoring them. Since the interrupt happens while > we're likely to have partially written results on the socket, there's > no easy way to keep protocol sync so the session needs to be > terminated. Right. Before commit 0da096d7 in v17, the recovery conflict code running in a signal handler would have set ProcDiePending, so this looks like an unintended regression due to that commit. In your patch you have CHECK_FOR_INTERRUPTS(), but I think that means it could also service other interrupts that we probably don't want to introduce without more study, no? For example if ProcessRecoveryConflictInterrupts() returns without throwing, and another pending interrupt flag is also set. Thinking of other proc signals/interrupts, of course it's not really acceptable that we don't process at least also global barriers here, ie for a time controllable by a client, but let's call that an independent problem. I think the minimal thing for this specific problem might be to use ProcessRecoveryConflictInterrupts() directly. I understand why you created DoingBlockingWrite, based on the variable named DoingQueryRead. I wonder if it would be better to model it on QueryCancelHoldoffCount, inventing QueryCancelPromoteToFatalCount, but I'm not sure. The read from the socket also seems to have a variant of the problem, unless I'm missing something, except in that case I'm not sure it's new. For sending you're proposing that our only choice is to kill the session, which makes sense, but the equivalent read side stay-in-sync strategy is to keep deferring until the client gets around to sending a complete message, if I read that right. Hmm, I wonder if we could write tests for both directions in 031_recovery_conflict.pl, using a Perl to do raw pq protocol in TCP, as seen in some other recent proposals... I might have a crack at that soon unless you want to.
Hi,
On 2025-01-17 17:01:53 +0100, Anthonin Bonnefoy wrote:
> I've cleaned up the tests: I've created a dedicated PgProto
> (definitely open to suggestions for a better name...) module
> containing all the helpers to send and receive messages on a raw
> socket in 0001.
Might be worth using it it in src/test/postmaster/t/002_start_stop.pl? That
has e.g. code to send a startup message.
> Also, I think the DoingBlockingWrite variable was unnecessary? Calling
> ProcessRecoveryConflictInterrupts should be enough as I don't think
> QueryCancelHoldoffCount can be >0 when writing on a socket it should always
> be able to cancel the statement.
I'd probably add an assertion to that effect, it's too easy for that to change
in a few years otherwise.
> Previously, all interrupts except process dying were ignored while a
> process was blocked writing to a socket. If the connection to the client
> was broken (no clean FIN nor RST), a process sending results to the
> client could be stuck for 924s until the TCP retransmission timeout is
> reached. During this time, it was possible for the process to conflict
> with recovery: For example, the process returning results can have a
> conflicting buffer pin.
>
> To avoid blocking recovery for an extended period of time, this patch
> changes client write interrupts by handling recovery conflict interrupts
> instead of ignoring them. Since the interrupt happens while we're likely
> to have partially written results on the socket, there's no easy way to
> keep protocol sync so the session needs to be terminated.
I don't see anything implementing the promotion of ERRORs to FATAL?  You're
preventing the error message being sent to the client, but I don't think that
causes the connection to be terminated.  The pre-existing code doesn't have
that problem, because it's only active when ProcDiePending is already set.
In fact, the test your patch added goes through
ProcessRecoveryConflictInterrupts() multiple times:
2025-01-17 12:52:47.842 EST [3376411] LOG:  recovery still waiting after 20.709 ms: recovery conflict on buffer pin
2025-01-17 12:52:47.842 EST [3376411] CONTEXT:  WAL redo at 0/3462288 for Heap2/PRUNE_VACUUM_SCAN: , isCatalogRel: F,
nplans:0, nredirected: 0, ndead: 0, nun>
 
3376451: recovery conflict interrupt while blocked
3376451: recovery conflict processing done
write(8192) = -1: 11/Resource temporarily unavailable
3376451: recovery conflict interrupt while blocked
3376451: recovery conflict processing done
...
write(8192) = -1: 11/Resource temporarily unavailable
3376451: recovery conflict interrupt while blocked
3376451: recovery conflict processing done
write(8192) = -1: 11/Resource temporarily unavailable
3376451: recovery conflict interrupt while blocked
2025-01-17 12:52:48.072 EST [3376451] 031_recovery_conflict.pl ERROR:  canceling statement due to conflict with
recovery
2025-01-17 12:52:48.072 EST [3376451] 031_recovery_conflict.pl DETAIL:  User was holding shared buffer pin for too
long.
2025-01-17 12:52:48.072 EST [3376451] 031_recovery_conflict.pl STATEMENT:  
                                BEGIN;
                                DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM
test_recovery_conflict_table1;
                                FETCH FORWARD FROM test_recovery_conflict_cursor;
                                SELECT generate_series(1, 100000);
                        
backend 3376451> 2025-01-17 12:52:48.072 EST [3376411] LOG:  recovery finished waiting after 250.681 ms: recovery
conflicton buffer pin
 
I don't actually know why the conflict ends up being resolved after a bunch of
retries.
Note also the "backend>" (to which I added the PID to identify it) which gets
emitted. Just setting whereToSendOutput = DestNone has side effects when not
actually in a process exit status...
> @@ -581,6 +582,22 @@ ProcessClientWriteInterrupt(bool blocked)
>          else
>              SetLatch(MyLatch);
>      }
> +    else if (blocked && RecoveryConflictPending)
> +    {
> +        /*
> +         * We're conflicting with recovery while being blocked writing. This
> +         * can happen when the process is returning results and no ACK is
> +         * received (broken connection, client overloaded...), eventually
> +         * saturating the socket buffer while the process holds a page pin
> +         * that eventually conflict with recovery.
> +         */
> +        if (InterruptHoldoffCount == 0 && CritSectionCount == 0)
I was about to say that we shouldn't get here with CritSectionCount != 0, but
unfortunately I'm not sure that's true. Thinking of e.g. debug messages or
such. And it's of course a copy of existing code...
Playing around with this it's unfortunately worse - we very commonly get to
ProcessClientWriteInterrupt() with InterruptHoldoffCount > 0. Including just
after processing recovery conflict interrupts. The reason for that is that if
we do throw an error the first thing we do is to hold interrupts
(c.f. sigsetjmp() in PostgresMain()) and then we call EmitErrorReport().
Unfortunately I suspect that this means any protection we'd get from this
version of the patch is rather incomplete - if the recovery conflict is
triggered while not blocked, we'll ERROR out and report the message with
interrupts held. That cycle of recovery conflict signalling wouldn't ever be
resolved in such a case, I think. Of course we do re-signal recovery conflict
interrupts rather aggressively - but I'm not sure that's going to help
reliably.
I've previously suggested that we ought to report some errors to the client by
trying to send the data in a a non-blocking way, and just terminate if we
can't send the data immediately. IIRC Tom was concerned that that would lead
to too many lost messages.  Perhaps we could instead implement this with a
fairly timer that terminates connections when stuck trying to write out an
error message?
> +        {
> +            if (whereToSendOutput == DestRemote)
> +                whereToSendOutput = DestNone;
> +            ProcessRecoveryConflictInterrupts();
> +        }
> +    }
>
This largely duplicates the existing code in
ProcessClientWriteInterrupt(). Perhaps it's worth flipping the order of
branches so that we first check that we're neither blocked nor have
InterruptHoldoffCount == 0 && CritSectionCount == 0?
Greetings,
Andres Freund
			
		Hi, On 2025-01-17 13:03:35 -0500, Andres Freund wrote: > I don't see anything implementing the promotion of ERRORs to FATAL? You're > preventing the error message being sent to the client, but I don't think that > causes the connection to be terminated. The pre-existing code doesn't have > that problem, because it's only active when ProcDiePending is already set. > > In fact, the test your patch added goes through > ProcessRecoveryConflictInterrupts() multiple times: > > 2025-01-17 12:52:47.842 EST [3376411] LOG: recovery still waiting after 20.709 ms: recovery conflict on buffer pin > 2025-01-17 12:52:47.842 EST [3376411] CONTEXT: WAL redo at 0/3462288 for Heap2/PRUNE_VACUUM_SCAN: , isCatalogRel: F, nplans:0, nredirected: 0, ndead: 0, nun> > 3376451: recovery conflict interrupt while blocked > 3376451: recovery conflict processing done > write(8192) = -1: 11/Resource temporarily unavailable > 3376451: recovery conflict interrupt while blocked > 3376451: recovery conflict processing done > ... > write(8192) = -1: 11/Resource temporarily unavailable > 3376451: recovery conflict interrupt while blocked > 3376451: recovery conflict processing done > write(8192) = -1: 11/Resource temporarily unavailable > 3376451: recovery conflict interrupt while blocked > 2025-01-17 12:52:48.072 EST [3376451] 031_recovery_conflict.pl ERROR: canceling statement due to conflict with recovery > 2025-01-17 12:52:48.072 EST [3376451] 031_recovery_conflict.pl DETAIL: User was holding shared buffer pin for too long. > 2025-01-17 12:52:48.072 EST [3376451] 031_recovery_conflict.pl STATEMENT: > BEGIN; > DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1; > FETCH FORWARD FROM test_recovery_conflict_cursor; > SELECT generate_series(1, 100000); > > backend 3376451> 2025-01-17 12:52:48.072 EST [3376411] LOG: recovery finished waiting after 250.681 ms: recovery conflicton buffer pin > > I don't actually know why the conflict ends up being resolved after a bunch of > retries. It's because the test sets deadlock_timeout lower than max_standby_streaming_delay. > Note also the "backend>" (to which I added the PID to identify it) which gets > emitted. Just setting whereToSendOutput = DestNone has side effects when not > actually in a process exit status... I the the only reason the patch works at all is because we end up in InteractiveBackend()'s EOF handling, because InteractiveBackend reads from stdin. That's closed here, but I don't think we have any guarantee that stdin isn't something that can be read from. Greetings, Andres Freund
Hi,
On 2025-01-17 13:03:35 -0500, Andres Freund wrote:
> > Previously, all interrupts except process dying were ignored while a
> > process was blocked writing to a socket. If the connection to the client
> > was broken (no clean FIN nor RST), a process sending results to the
> > client could be stuck for 924s until the TCP retransmission timeout is
> > reached. During this time, it was possible for the process to conflict
> > with recovery: For example, the process returning results can have a
> > conflicting buffer pin.
> >
> > To avoid blocking recovery for an extended period of time, this patch
> > changes client write interrupts by handling recovery conflict interrupts
> > instead of ignoring them. Since the interrupt happens while we're likely
> > to have partially written results on the socket, there's no easy way to
> > keep protocol sync so the session needs to be terminated.
>
> I don't see anything implementing the promotion of ERRORs to FATAL?  You're
> preventing the error message being sent to the client, but I don't think that
> causes the connection to be terminated.  The pre-existing code doesn't have
> that problem, because it's only active when ProcDiePending is already set.
The easiest way to do that promotion is probably to set ExitOnAnyError to
true before entering ProcessRecoveryConflictInterrupts().  That does prevent
ending up in InteractiveBackend().
But I suspect this won't suffice:
> > @@ -581,6 +582,22 @@ ProcessClientWriteInterrupt(bool blocked)
> >          else
> >              SetLatch(MyLatch);
> >      }
> > +    else if (blocked && RecoveryConflictPending)
> > +    {
> > +        /*
> > +         * We're conflicting with recovery while being blocked writing. This
> > +         * can happen when the process is returning results and no ACK is
> > +         * received (broken connection, client overloaded...), eventually
> > +         * saturating the socket buffer while the process holds a page pin
> > +         * that eventually conflict with recovery.
> > +         */
> > +        if (InterruptHoldoffCount == 0 && CritSectionCount == 0)
>
> I was about to say that we shouldn't get here with CritSectionCount != 0, but
> unfortunately I'm not sure that's true. Thinking of e.g. debug messages or
> such. And it's of course a copy of existing code...
>
> Playing around with this it's unfortunately worse - we very commonly get to
> ProcessClientWriteInterrupt() with InterruptHoldoffCount > 0. Including just
> after processing recovery conflict interrupts. The reason for that is that if
> we do throw an error the first thing we do is to hold interrupts
> (c.f. sigsetjmp() in PostgresMain()) and then we call EmitErrorReport().
>
> Unfortunately I suspect that this means any protection we'd get from this
> version of the patch is rather incomplete - if the recovery conflict is
> triggered while not blocked, we'll ERROR out and report the message with
> interrupts held. That cycle of recovery conflict signalling wouldn't ever be
> resolved in such a case, I think. Of course we do re-signal recovery conflict
> interrupts rather aggressively - but I'm not sure that's going to help
> reliably.
This unfortunately indeed is true.  If I, instead of the generate_series(),
add the following to the pgproto query:
            DO \$\$BEGIN RAISE 'endless scream: %', repeat('a', 1000000); END;\$\$;
the conflict doesn't get handled.
Example log, with a fair bit of added debugging added:
2025-01-17 13:31:46.226 EST [3407842] 031_recovery_conflict.pl ERROR:  endless scream: aa...
2025-01-17 13:31:46.226 EST [3407842] 031_recovery_conflict.pl CONTEXT:  PL/pgSQL function inline_code_block line 1 at
RAISE
2025-01-17 13:31:46.226 EST [3407842] 031_recovery_conflict.pl STATEMENT:
                                BEGIN;
                                DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM
test_recovery_conflict_table1;
                                FETCH FORWARD FROM test_recovery_conflict_cursor;
                                DO $$BEGIN RAISE 'endless scream: %', repeat('a', 1000000); END;$$;
3407842: start fronted error
3407842: ProcessClientWriteInterrupt blocked: 0, IHC: 1, CSC: 0, ProcDieP: 0, RecoveryConflictPending: 0
3407842: write(8192) = 8192: 0/Success
3407842: ProcessClientWriteInterrupt blocked: 0, IHC: 1, CSC: 0, ProcDieP: 0, RecoveryConflictPending: 0
3407842: ProcessClientWriteInterrupt blocked: 0, IHC: 1, CSC: 0, ProcDieP: 0, RecoveryConflictPending: 0
3407842: write(992023) = 219264: 0/Success
3407842: ProcessClientWriteInterrupt blocked: 0, IHC: 1, CSC: 0, ProcDieP: 0, RecoveryConflictPending: 0
3407842: ProcessClientWriteInterrupt blocked: 0, IHC: 1, CSC: 0, ProcDieP: 0, RecoveryConflictPending: 0
3407842: write(772759) = -1: 11/Resource temporarily unavailable
2025-01-17 13:31:46.231 EST [3407800] DEBUG:  reaping dead processes
2025-01-17 13:31:46.232 EST [3407800] DEBUG:  releasing pm child slot 3
2025-01-17 13:31:46.232 EST [3407800] DEBUG:  client backend (PID 3407844) exited with exit code 0
2025-01-17 13:31:46.407 EST [3407804] DEBUG:  sendtime 2025-01-17 13:31:46.407326-05 receipttime 2025-01-17
13:31:46.40737-05replication apply delay 0 ms transfer latency 1 ms
 
2025-01-17 13:31:46.407 EST [3407804] DEBUG:  sending write 0/3460F78 flush 0/3460DA0 apply 0/3460DA0
2025-01-17 13:31:46.407 EST [3407804] DEBUG:  sending write 0/3460F78 flush 0/3460F78 apply 0/3460DA0
3407842: ProcessClientWriteInterrupt blocked: 1, IHC: 1, CSC: 0, ProcDieP: 0, RecoveryConflictPending: 1
3407832: !HoldingBufferPinThatDelaysRecovery
3407842: write(772759) = -1: 11/Resource temporarily unavailable
2025-01-17 13:31:46.707 EST [3407803] LOG:  recovery still waiting after 300.099 ms: recovery conflict on buffer pin
2025-01-17 13:31:46.707 EST [3407803] CONTEXT:  WAL redo at 0/3460DA0 for Heap2/PRUNE_VACUUM_SCAN: , isCatalogRel: F,
nplans:0, nredirected: 0, ndead: 0, nunused: 1, unused: [21]; blkref #0: rel 1663/16385/16386, blk 0
 
3407832: !HoldingBufferPinThatDelaysRecovery
3407842: ProcessClientWriteInterrupt blocked: 1, IHC: 1, CSC: 0, ProcDieP: 0, RecoveryConflictPending: 1
3407842: write(772759) = -1: 11/Resource temporarily unavailable
3407842: ProcessClientWriteInterrupt blocked: 1, IHC: 1, CSC: 0, ProcDieP: 0, RecoveryConflictPending: 1
3407842: write(772759) = -1: 11/Resource temporarily unavailable
2025-01-17 13:31:56.417 EST [3407804] DEBUG:  sending write 0/3460F78 flush 0/3460F78 apply 0/3460DA0
2025-01-17 13:32:06.427 EST [3407804] DEBUG:  sending write 0/3460F78 flush 0/3460F78 apply 0/3460DA0
2025-01-17 13:32:07.224 EST [3407804] DEBUG:  sendtime 2025-01-17 13:32:07.224295-05 receipttime 2025-01-17
13:32:07.224413-05replication apply delay 20817 ms transfer latency 1 ms
 
Greetings,
Andres Freund