Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
От | Masahiko Sawada |
---|---|
Тема | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
Дата | |
Msg-id | CAD21AoB=Rf-SASOJR2WqvWcrA5Q3S2oUBACVLdJPaA8x6EchBA@mail.gmail.com обсуждение исходный текст |
Ответ на | RE: POC: enable logical decoding when wal_level = 'replica' without a server restart ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>) |
Список | pgsql-hackers |
On Thu, Jul 31, 2025 at 5:00 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > > I thought we could fix this issue by checking the number of in-use > > logical slots while holding ReplicationSlotControlLock and > > LogicalDecodingControlLock, but it seems we need to deal with another > > race condition too between backends and startup processes at the end > > of recovery. > > > > Currently the backend skips controlling logical decoding status if the > > server is in recovery (by checking RecoveryInProgress()), but it's > > possible that a backend process tries to drop a logical slot after the > > startup process calling UpdateLogicalDecodingStatusEndOfRecovery() and > > before accepting writes. > > Right. I also verified on local and found that > ReplicationSlotDropAcquired()->DisableLogicalDecodingIfNecessary() sometimes > skips to modify the status because RecoveryInProgress is still false. > > > In this case, the backend ends up not > > disabling logical decoding and it remains enabled. I think we would > > somehow need to delay the logical decoding status change in this > > period until the recovery completes. > > My primitive idea was to 1) keep startup acquiring the lock till end of recovery > and 2) DisableLogicalDecodingIfNecessary() acquires lock before checking the > recovery status, but it could not work well. Not sure but WaitForProcSignalBarrier() > stucked if the process acquired LogicalDecodingControlLock lock.... I think that it's not realistic to keep holding a lwlock until the recovery actually completes because we perform a checkpoint after that. In the latest version patch I attached, I introduce a flag on shared memory to delay any logical decoding status change until the recovery completes. The implementation got more complex than I expected but I don't have a better idea. I'm open to other approaches. Also, I incorporated all comments I got so far[1][2][3] and updated the documentation. Regards, [1] https://www.postgresql.org/message-id/CALDaNm3BfG1hpWVEaqwBgXpcEGSQXDi536OzB2%3D8SFTz-v%2B3CA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAJpy0uDxap0YKLx5N45_Vz49QARjioUaOb1qpaiV0PBkYoivRg%40mail.gmail.com [3] https://www.postgresql.org/message-id/OSCPR01MB149663D242F6E97630758DD6EF55AA%40OSCPR01MB14966.jpnprd01.prod.outlook.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
В списке pgsql-hackers по дате отправления: