Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
От | Masahiko Sawada |
---|---|
Тема | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
Дата | |
Msg-id | CAD21AoDkppZA6XcaPhiC-E6_03E9_Kxx=B3ecUNRoOs7WyKDHg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart (shveta malik <shveta.malik@gmail.com>) |
Список | pgsql-hackers |
On Tue, Oct 14, 2025 at 2:24 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Tue, Oct 14, 2025 at 2:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Hi, > > > > On Fri, Oct 10, 2025 at 1:54 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > On Fri, Oct 10, 2025 at 9:32 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > On Thu, Oct 9, 2025 at 4:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > Thank you for the comments. > > > > > > > > > > I've attached the updated patch. As we discussed, it now uses the lazy > > > > > deactivation in all cases and has the discussed points in the comments > > > > > atop logicalctl.c. > > > > > > > > > > > > > Thank You for making the changes. I am in the process of verifying the > > > > patch. Please find one concern. > > > > Thank you for reviewing the patch! > > > > > > > > > > When I promote a standby, there's a brief period during > > > > recovery-completion when status_change_allowed is set to true and also > > > > RecoveryInProgress() is still true. If I try to create a logical > > > > replication slot on the standby during this window, it fails with the > > > > following error: > > > > > > > > postgres=# SELECT pg_create_logical_replication_slot('slot1', > > > > 'pgoutput', false, false, false); > > > > ERROR: logical decoding on standby requires "effective_wal_level" >= > > > > "logical" on the primary > > > > HINT: To enable logical decoding, set "wal_level" >= "logical" or > > > > create at least one logical slot when "wal_level" = "replica". > > > > > > > > Note that the primary already has logical slots. This issue occurs > > > > because logical decoding is disabled by > > > > UpdateLogicalDecodingStatusEndOfRecovery() due to no existing slots on > > > > standby while RecoveryInProgress() is still true. Should > > > > CheckLogicalDecodingRequirements() also consider > > > > 'status_change_allowed' on standby to handle this transition more > > > > gracefully? > > > > Good point. I think we can check status_change_allowed instead of > > RecoveryInProgress(). Fixed the patch accordingly. > > > > Yes, you are right. Even RecoveryInProgress() is not needed. > Verified, it works well now. > > > > > > > 1) > > > I am creating a subscription for publication present on standby, I get > > > this error: > > > > > > postgres=# CREATE subscription sub1 connection 'dbname=postgres > > > host=localhost user=shveta port=5434' publication pub1; > > > ERROR: could not create replication slot "sub1": ERROR: logical > > > decoding on standby requires "effective_wal_level" >= "logical" on the > > > primary > > > HINT: To enable logical decoding, set "wal_level" >= "logical" or > > > create at least one logical slot when "wal_level" = "replica". > > > > > > Should Hint also mention 'on primary' i.e. 'To enable logical decoding > > > on primary, set..' , otherwise it could be slightly confusing to > > > follow HINT. > > > > Fixed. > > > > > > > > 2) > > > CheckLogicalDecodingRequirements(): > > > > > > Is there a reason for moving 'logical decoding on standby requires..' > > > error before 'logical decoding requires a database connection' error? > > > Shall we keep the same order as earlier? > > > > Fixed. > > > > > > > > 3) > > > GetActiveWalLevelOnStandby() is not being used anywhere now. > > > > Right. However, it's an exposed function so I'm somewhat hesitant with > > removing it. > > > > > > > > 4) > > > + if (!old_pending_disable) > > > + { > > > + volatile PROC_HDR *procglobal = ProcGlobal; > > > + ProcNumber checkpointerProc = procglobal->checkpointerProc; > > > + > > > + /* Wake up the checkpointer */ > > > + if (checkpointerProc != INVALID_PROC_NUMBER) > > > + SetLatch(&GetPGProcByNumber(checkpointerProc)->procLatch); > > > + } > > > > > > Why are we waking the checkpointer only when pending_disable changes > > > from false to true? When this function is called and > > > old_pending_disable is still true, wouldn’t that be an even stronger > > > reason to wake the checkpointer? > > > > I guess even if we set the checkpointer's latch multiple times, it > > would not help wake it up quickly because the checkpointer might be > > just busy with its checkpointing work. > > Okay, I see your point. > > > Having said that, it would help > > simplify the code structure a bit. Given that > > RequestDisableLogicalDecoding() function expects to be called only > > after dropping the possibly-last logical slot, it would not be harmful > > even if we set its latch anyway. And it might help the case where the > > checkpointer is temporarily not working and we could not set it latch. > > So I took your idea. Thanks! > > Okay. > > > > > > > > > 5) > > > + /* > > > + * Update shmem flags. We don't need to care about the order of setting > > > + * global flag and writing the WAL record writes are not allowed yet. > > > + */ > > > > > > Comment is somewhat problematic. > > > > > > 6) > > > > > > ReportSlotInvalidation(): > > > + appendStringInfoString(&err_detail, _("Logical decoding on standby > > > requires \"wal_level\" >= \"logical\" or at least one logical slot on > > > the primary server.")); > > > > > > This message could be a little ambiguous. It is not completely clear > > > that that wal_level=logical is needed on standby or on primary. How > > > about below msg: > > > > > > Logical decoding on standby requires the primary server to have either > > > wal_level >= logical or at least one logical replication slot created. > > > > > > 7) > > > + "effective_wal_level got decreased to 'replica' on the primary to > > > invalidate stnadby's slots" > > > > > > stnadby --> standby > > > > Fixed above points. > > > > > > > > 8) > > > + # Trigger promotion with no wait for the startup process to reach the > > > + # injection point. > > > + $standby5->safe_psql('postgres', qq[select pg_promote(false)]); > > > + note('promote the standby and waiting for injection_point'); > > > + $standby5->wait_for_event('startup', > > > + 'startup-logical-decoding-status-change-end-of-recovery'); > > > + note( > > > + "injection_point > > > 'startup-logical-decoding-status-change-end-of-recovery' is reached" > > > + ); > > > > > > Is the comment correct here? We are waiting to reach the injection > > > point? But the comment says 'no wait for'? > > > > I meant to promote the server without wait (i.e., passing false to > > pg_promote() function). I modified the comments. > > > > > > > > 9) > > > + # Drop the logical slot, requesting to disable logical decoding to > > > the checkpointer. > > > + # It has to wait for the recovery to complete before disabling > > > logical decoding. > > > > > > Double space in 'It has'. > > > > > > 10) > > > + # Drop the logical slot, requesting to disable logical decoding to > > > the checkpointer. > > > + # It has to wait for the recovery to complete before disabling > > > logical decoding. > > > + $standby5->safe_psql('postgres', > > > + qq[select pg_drop_replication_slot('standby5_slot');]); > > > + > > > + # Resume the startup process to complete the recovery. > > > + $standby5->safe_psql('postgres', > > > + qq[select injection_points_wakeup('startup-logical-decoding-status-change-end-of-recovery')] > > > + ); > > > > > > I think between these 2 steps, we should check logs for 'waiting for > > > recovery completion to change logical decoding status'. > > > The test will become more clear then. > > > > Fixed the above points. > > > > I've attached the updated version patch. > > > > Thanks. I was testing slot-sync worker flow and noticed that even if > 'sync_replication_slots' is enabled, it takes quite some time (1-2 > mins) for slot sync worker to start. On noticing > XLOG_LOGICAL_DECODING_STATUS_CHANGE, isn't there a way to (or > shouldn't we) wakeup postmaster to start slotsync worker immediately? > > Logs for timestamps reference: > > 2025-10-14 14:41:32.852 IST [208475] LOG: replication slot > synchronization requires "effective_wal_level" >= "logical" on the > primary > 2025-10-14 14:41:32.852 IST [208475] HINT: To enable logical decoding > on primary, set "wal_level" >= "logical" or create at least one > logical slot when "wal_level" = "replica". > 2025-10-14 14:41:35.499 IST [208481] LOG: update logical decoding status to 1 > 2025-10-14 14:41:35.499 IST [208481] CONTEXT: WAL redo at 0/03000060 > for XLOG/LOGICAL_DECODING_STATUS_CHANGE: true > 2025-10-14 14:42:53.611 IST [208529] LOG: slot sync worker started > ------ I think that this particular situation happened because the postmaster was sleeping and didn't have any tasks. If a new connection comes in or the existing connection exits, the postmaster launches the slotsync worker. I believe it would not be a serious issue in practice. > There was a lot of discussion regarding slotsync worker in the past, > please let me know if it is concluded and I missed it somehow. Since we've discussed that it's out of scope of this patch that we shutdown dynamically upon logical decoding status changes, even if logical decoding status gets disabled on the standby, the slotsync worker keeps working. If it gets enabled again, we don't need to wake up the postmaster as the slotsync worker would already be running. I think we can introduce more dynamic control over the slotsync worker in a separate patch if we need. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
В списке pgsql-hackers по дате отправления: