Re: Allow reading LSN written by walreciever, but not flushed yet
От | Jeremy Schneider |
---|---|
Тема | Re: Allow reading LSN written by walreciever, but not flushed yet |
Дата | |
Msg-id | 20251005164209.04de4c4b@ardentperf.com обсуждение исходный текст |
Ответ на | Re: Allow reading LSN written by walreciever, but not flushed yet (Andrey Borodin <x4mmm@yandex-team.ru>) |
Список | pgsql-hackers |
On Wed, 21 May 2025 18:38:02 +0300 Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > On 21 May 2025, at 15:03, Fujii Masao <masao.fujii@oss.nttdata.com> > > wrote: > > > > On 2025/05/21 17:35, Andrey Borodin wrote: > >> Well, we implemented this and made tests that do a lot of > >> failovers. These tests observed data loss in some infrequent cases > >> due to wrong new primary selection. Because "few seconds" is > >> actually unknown random time. > > > > I see your point. But doesn't a similar issue exist even with the > > write LSN? For example, even if node1's write LSN is ahead of > > node2's at one moment, node2 might catch up or surpass it a few > > seconds later. > > > > If the walreceiver is no longer running, we can assume the write > > LSN has reached its final value. So by waiting for the walreceiver > > to exit on both nodes, we can "safely" compare their write LSNs to > > decide which one is ahead. Also, in this situation, since > > XLogWalRcvFlush() is called during WalRcvDie(), the flush LSN seems > > effectively guaranteed to match the write LSN. So it seems also > > safe to use the flush LSN. > > You are right. Receive LSN is meaningless when receive is in > progress. So the only way to know receive LSN is to stop receiving... > I need to think more about it. When we're making a decision about cluster reconfiguration and promoting a standby to be the new writer, usually the writer has stopped sending - so I think we will stop receiving pretty quickly (network issues notwithstanding). Eventually the in-flight WAL will get sync'd and replayed on replicas. This thread/request might partly be about whether postgres cluster management software can make a promotion decision right away or whether it needs to delay and give the system time to sync WAL, or resort to directly decoding WAL which isn't yet sync'd. A large and stressed system could get into a state where fsync takes awhile. I'm thinking it simplifies our ability to ensure correctness in cluster reconfiguration algorithms if we have direct access to the write LSN for cases where synchronous_commit=remote_write; we can then avoid resorting to delays or external tools like lwaldump. With quorum replication, we need to design promotion logic carefully to determine which replica has the latest COMMITs that were acknowledged to the client. -Jeremy Schneider
В списке pgsql-hackers по дате отправления: