Re: Allow reading LSN written by walreciever, but not flushed yet

Поиск
Список
Период
Сортировка
От Jeremy Schneider
Тема Re: Allow reading LSN written by walreciever, but not flushed yet
Дата
Msg-id 20251005164209.04de4c4b@ardentperf.com
обсуждение исходный текст
Ответ на Re: Allow reading LSN written by walreciever, but not flushed yet  (Andrey Borodin <x4mmm@yandex-team.ru>)
Список pgsql-hackers
On Wed, 21 May 2025 18:38:02 +0300
Andrey Borodin <x4mmm@yandex-team.ru> wrote:

> > On 21 May 2025, at 15:03, Fujii Masao <masao.fujii@oss.nttdata.com>
> > wrote:
> > 
> > On 2025/05/21 17:35, Andrey Borodin wrote:
> >> Well, we implemented this and made tests that do a lot of
> >> failovers. These tests observed data loss in some infrequent cases
> >> due to wrong new primary selection. Because "few seconds" is
> >> actually unknown random time.
> > 
> > I see your point. But doesn't a similar issue exist even with the
> > write LSN? For example, even if node1's write LSN is ahead of
> > node2's at one moment, node2 might catch up or surpass it a few
> > seconds later.
> > 
> > If the walreceiver is no longer running, we can assume the write
> > LSN has reached its final value. So by waiting for the walreceiver
> > to exit on both nodes, we can "safely" compare their write LSNs to
> > decide which one is ahead. Also, in this situation, since
> > XLogWalRcvFlush() is called during WalRcvDie(), the flush LSN seems
> > effectively guaranteed to match the write LSN. So it seems also
> > safe to use the flush LSN.
> 
> You are right. Receive LSN is meaningless when receive is in
> progress. So the only way to know receive LSN is to stop receiving...
> I need to think more about it.

When we're making a decision about cluster reconfiguration and
promoting a standby to be the new writer, usually the writer has
stopped sending - so I think we will stop receiving pretty quickly
(network issues notwithstanding).

Eventually the in-flight WAL will get sync'd and replayed on replicas.
This thread/request might partly be about whether postgres cluster
management software can make a promotion decision right away or whether
it needs to delay and give the system time to sync WAL, or resort to
directly decoding WAL which isn't yet sync'd.

A large and stressed system could get into a state where fsync takes
awhile. I'm thinking it simplifies our ability to ensure correctness in
cluster reconfiguration algorithms if we have direct access to the
write LSN for cases where synchronous_commit=remote_write; we can then
avoid resorting to delays or external tools like lwaldump.

With quorum replication, we need to design promotion logic carefully to
determine which replica has the latest COMMITs that were acknowledged
to the client.

-Jeremy Schneider




В списке pgsql-hackers по дате отправления: