Обсуждение: [PATCH] Fix premature timeout in pg_promote() caused by signal interruptions
Hi all, We have observed an issue where pg_promote() returns false and issues a timeout warning prematurely, even if the standby server is successfully promoted later within the specified timeout period. Problem Description The current implementation of pg_promote() calculates a fixed number of loop iterations based on the timeout value, assuming each loop waits exactly 100 ms for the backend latch. However, if the backend receives an unrelated signal (e.g., from client_connection_check_interval), it wakes up early. These repeated, unrelated wakeups cause the loop counter to deplete much faster than intended, leading to a premature timeout. Reproduction Set up a standby server while modifying pg_promote not to write the promote file to block the promotion. And by setting client_connection_check_interval = 1, we can consistently trigger a premature timeout. In the example below, a 10-second timeout expires in roughly 107 ms: postgres=# set client_connection_check_interval=1; SET postgres=# \timing Timing is on. postgres=# select pg_promote(true, 10); WARNING: server did not promote within 10 seconds ┌────────────┐ │ pg_promote │ ├────────────┤ │ f │ └────────────┘ (1 row) Time: 107.783 ms Proposed Fix The attached patch modifies the logic to loop based on the actual elapsed time rather than a fixed number of iterations. This ensures that pg_promote() respects the specified timeout regardless of how many times the backend latch is signaled. After applying the patch, the timeout behaves as expected: postgres=# set client_connection_check_interval=1; SET postgres=# \timing Timing is on. postgres=# select pg_promote(true, 10); WARNING: server did not promote within 10 seconds ┌────────────┐ │ pg_promote │ ├────────────┤ │ f │ └────────────┘ (1 row) Time: 10000.865 ms (00:10.001) We would like to submit this patch for the community's consideration. Best regards Robert Pang Google
Вложения
Re: [PATCH] Fix premature timeout in pg_promote() caused by signal interruptions
От
Michael Paquier
Дата:
On Wed, Mar 11, 2026 at 09:44:07AM -0700, Robert Pang wrote: > The current implementation of pg_promote() calculates a fixed number > of loop iterations based on the timeout value, assuming each loop > waits exactly 100 ms for the backend latch. However, if the backend > receives an unrelated signal (e.g., from > client_connection_check_interval), it wakes up early. These repeated, > unrelated wakeups cause the loop counter to deplete much faster than > intended, leading to a premature timeout. It is true that we can do better here, and your proposal about having a more precise timeout calculation looks like a sensible improvement for this case. No objections regarding your patch. I would like to apply it on HEAD, if there are no objections. -- Michael
Вложения
Re: [PATCH] Fix premature timeout in pg_promote() caused by signal interruptions
От
getiancheng
Дата:
---- Replied Message ----
| From | Michael Paquier<michael@paquier.xyz> |
| Date | 3/25/2026 10:28 |
| To | Robert Pang<robertpang@google.com> |
| Cc | pgsql-hackers<pgsql-hackers@postgresql.org> |
| Subject | Re: [PATCH] Fix premature timeout in pg_promote() caused by signal interruptions |
On Wed, Mar 11, 2026 at 09:44:07AM -0700, Robert Pang wrote:The current implementation of pg_promote() calculates a fixed number
of loop iterations based on the timeout value, assuming each loop
waits exactly 100 ms for the backend latch. However, if the backend
receives an unrelated signal (e.g., from
client_connection_check_interval), it wakes up early. These repeated,
unrelated wakeups cause the loop counter to deplete much faster than
intended, leading to a premature timeout.
It is true that we can do better here, and your proposal about having
a more precise timeout calculation looks like a sensible improvement
for this case.
No objections regarding your patch. I would like to apply it on HEAD,
if there are no objections.
--
Michael
Hi all,Overall LGTM. Just a small comment:"+ end_time = GetCurrentTimestamp() + wait_seconds * 1000000L;"I think we can use TimestampTzPlusSeconds(GetCurrentTimestamp(), wait_seconds).Best regardsTiancheng Ge
Re: [PATCH] Fix premature timeout in pg_promote() caused by signal interruptions
От
Robert Pang
Дата:
Hi Michael and Tiancheng
On Tue, Mar 24, 2026 at 8:06 PM getiancheng_2012 <18663776784@163.com> wrote:
---- Replied Message ----
From Michael Paquier<michael@paquier.xyz> Date 3/25/2026 10:28 To Robert Pang<robertpang@google.com> Cc pgsql-hackers<pgsql-hackers@postgresql.org> Subject Re: [PATCH] Fix premature timeout in pg_promote() caused by signal interruptions On Wed, Mar 11, 2026 at 09:44:07AM -0700, Robert Pang wrote:The current implementation of pg_promote() calculates a fixed number
of loop iterations based on the timeout value, assuming each loop
waits exactly 100 ms for the backend latch. However, if the backend
receives an unrelated signal (e.g., from
client_connection_check_interval), it wakes up early. These repeated,
unrelated wakeups cause the loop counter to deplete much faster than
intended, leading to a premature timeout.
It is true that we can do better here, and your proposal about having
a more precise timeout calculation looks like a sensible improvement
for this case.
No objections regarding your patch. I would like to apply it on HEAD,
if there are no objections.
--
MichaelHi all,Overall LGTM. Just a small comment:"+ end_time = GetCurrentTimestamp() + wait_seconds * 1000000L;"I think we can use TimestampTzPlusSeconds(GetCurrentTimestamp(), wait_seconds).Best regardsTiancheng Ge
Thank you for reviewing this patch. The use of TimestampTzPlusSeconds() will be good.
Best regards
Robert Pang
Re: [PATCH] Fix premature timeout in pg_promote() caused by signal interruptions
От
Michael Paquier
Дата:
On Wed, Mar 25, 2026 at 09:05:34PM -0700, Robert Pang wrote: > Thank you for reviewing this patch. The use of TimestampTzPlusSeconds() > will be good. Yep. Included the suggestion and applied the patch on HEAD. Thanks, all. -- Michael