Re: sync_standbys_defined and pg_stat_replication
От | Ants Aasma |
---|---|
Тема | Re: sync_standbys_defined and pg_stat_replication |
Дата | |
Msg-id | CANwKhkMTkckNSoz0kUYFvaMDA_6g2uuABcvhoowu64e9e+op=A@mail.gmail.com обсуждение исходный текст |
Ответ на | sync_standbys_defined and pg_stat_replication (Jeremy Schneider <schneider@ardentperf.com>) |
Ответы |
Re: sync_standbys_defined and pg_stat_replication
|
Список | pgsql-hackers |
On Tue, 7 Oct 2025 at 08:59, Jeremy Schneider <schneider@ardentperf.com> wrote: > For failover to work correctly, if someone changes the GUC > synchronous_standby_names to enable sync replication, then we need to > understand the exact moment when backends will begin to block in order > to correctly determine when we can failover without data loss. There is an early out in SyncRepWaitForLSN() when WalSndCtl->sync_standbys_status has SYNC_STANDBY_DEFINED unset. That flag gets set by the checkpointer in SyncRepUpdateSyncStandbysDefined() via CheckpointWriteDelay() among other places. But only when it's not executing a fast checkpoint or it's not behind on checkpoints. In other words, synchronous_standby_names will not become effective until checkpointer has some downtime. While this is a small problem on its own, there is no way to check if this has happened or not. For the config update getting delayed the fix seems simple - just do the config update unconditionally. Patch attached. For the other problem, my thinking is to provide a new function that allows a user to check if synchronous replication is active. Ideally this function would give other information also needed by cluster managers. Specifically when a replica is removed from synchronous standby names we would need still need to consider that replica as a potential synchronous replica until a quorum matching the current synchronous_standby_names setting overtakes the last LSN confirmed by a replica matching the removed name. To illustrate the situation where this is needed, consider s_s_n = 'ANY 1 (A B)'. While this setting is active we have to check latest replicated LSN from both A and B to know which one to promote. Lets say transaction X is replicated to A and confirmed, but not yet to B. Now A is removed so s_s_n becomes 'ANY 1 (B)'. Based on this setting it is always safe to promote B, but until B receives the LSN that was on primary when synchronous_standby_names was changed, it might not have all the data. This one is possible to work around by checking the relevant values from pg_stat_replication, but it would be nice to have a neater interface. My proposal is something like this: postgres=# SELECT * FROM pg_sync_replication_status(); is_active | synchronous_standby_names | has_quorum -----------+---------------------------+------------ t | ANY 1 (A B) | f (1 row) Thoughts? Regards, Ants Aasma
Вложения
В списке pgsql-hackers по дате отправления: