BUG #18961: Race scenario where max_standby_streaming_delay is not honored
От | PG Bug reporting form |
---|---|
Тема | BUG #18961: Race scenario where max_standby_streaming_delay is not honored |
Дата | |
Msg-id | 18961-455d3d9b0225b24b@postgresql.org обсуждение исходный текст |
Список | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 18961 Logged by: Anthony Hsu Email address: erwaman@gmail.com PostgreSQL version: 17.5 Operating system: Linux Description: In the current ResolveRecoveryConflictWithBufferPin implementation in standby.c, I think there's a race scenario where a backend holding a conflicting buffer pin won't receive a PROCSIG_RECOVERY_CONFLICT_BUFFERPIN message promptly: 1. Assume max_standby_streaming_delay has expired when the startup process enters ResolveRecoveryConflictWithBufferPin 2. Assume backend 1 holds a conflicting buffer pin while backend 2 does not 3. Since we are past the standby limit time, the startup process broadcasts PROCSIG_RECOVERY_CONFLICT_BUFFERPIN here [1] without enabling any timeouts 4. Then the startup process waits to be woken up via ProcWaitForSignal(WAIT_EVENT_BUFFER_PIN) here [2] 5. Suppose backend 2 receives PROCSIG_RECOVERY_CONFLICT_BUFFERPIN first, sees it does not hold a conflicting buffer pin, and *then* proceeds to pin the buffer 6. Suppose then backend 1 receives PROCSIG_RECOVERY_CONFLICT_BUFFERPIN, processes interrupts, and cancels itself. During cleanup, in UnpinBuffer(), it will see the pin count is still > 1 (startup process + backend 2 have it pinned), so it will NOT wake up the startup process. 7. The startup process will only get woken up once backend 2 unpins the buffer and the pin count reaches 1 (or some other signal causes the startup process's latch to be set). Only then will it try to acquire the cleanup lock again and broadcast another PROCSIG_RECOVERY_CONFLICT_BUFFERPIN message if it fails to acquire the cleanup lock. The unexpected behavior is in step (7), where it could be arbitrarily long until the startup process is woken up again. The expected behavior is that since max_standby_streaming_delay has already expired, the startup process should wake up quickly and broadcast another PROCSIG_RECOVERY_CONFLICT_BUFFERPIN message if there are still conflicting backends. I was able to reproduce this scenario with some custom code to control the execution sequence. One way to fix this scenario is to just remove the `if` block here [3] entirely so that we always enable the STANDBY_TIMEOUT and STANDBY_DEADLOCK_TIMEOUT timeouts. [1] https://github.com/postgres/postgres/blob/45c357e0e85d2dffe7af5440806150124a725a01/src/backend/storage/ipc/standby.c#L805 [2] https://github.com/postgres/postgres/blob/45c357e0e85d2dffe7af5440806150124a725a01/src/backend/storage/ipc/standby.c#L842 [3] https://github.com/postgres/postgres/blob/45c357e0e85d2dffe7af5440806150124a725a01/src/backend/storage/ipc/standby.c#L800-L806
В списке pgsql-bugs по дате отправления: