Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
От | Masahiko Sawada |
---|---|
Тема | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
Дата | |
Msg-id | CAD21AoC70bOrtMXGA85Ltk8hm0cHFYQ+JfRuofbcQUz3CbHLSg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart (shveta malik <shveta.malik@gmail.com>) |
Список | pgsql-hackers |
On Fri, Jun 6, 2025 at 3:02 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Wed, Jun 4, 2025 at 3:40 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Wed, Jun 4, 2025 at 6:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Tue, May 20, 2025 at 9:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > Yeah, I find the idea that the presence of a logical slot will allow > > > > the user to enable logical decoding/replication more appealing than > > > > this new alternative, leaving aside the challenges of realizing it. > > > > +1. This idea appears more user-friendly and easier to understand > > compared to other approaches, such as having multiple GUCs or using > > ALTER SYSTEM. > > > > > I've drafted this idea. Here are summary for attached two patches: > > > > > > 0001 patch allows us to create a logical slot without WAL reservation. > > > > > > 0002 patch is the main patch for dynamically enabling/disabling > > > logical decoding when wal_level is 'replica'. > > > > Thank You for the patches. I have done some initial testing, it seems > > to be working well. I will do more testing and review and will share > > further feedback. > > I reviewed further and had few concerns: Thank you for reviewing this feature! > > 1) > We now invalidate slots on standby if the primary (with > wal_level=replica) has dropped the last logical slot and internally > reverted its runtime (effective) wal_level back to replica. Consider > the following scenario involving a cascaded logical replication setup: > > a) The publisher is configured with wal_level = replica and has > created a publication (pub1). > b) A subscriber server creates a subscription (sub1) to pub1. As part > of the slot creation for sub1, the publisher's effective wal_level is > switched to logical. > c) The publisher also has a physical standby, which in turn has its > own logical subscriber, named standby_sub1. > > At this point, everything works as expected i.e. changes from the > publisher flow through the physical standby and are replicated to > standby_sub1. Now if the user drops sub1, the replication slot on the > primary is also dropped. Since this was the last logical slot, the > primary automatically switches its effective wal_level back to > replica. This change propagates to the standby, causing it to > invalidate the slot for standby_sub1. As a result, the standby logs > the following error: > > STATEMENT: START_REPLICATION SLOT "standby_sub1" LOGICAL 0/0 (...) > ERROR: logical decoding needs to be enabled on the primary > > Even if we manually recreate a logical slot on the primary afterward, > the standby_sub1 subscriber is not able to proceed: > ERROR: can no longer access replication slot "standby_sub1" > DETAIL: This replication slot has been invalidated due to > "wal_level_insufficient". > > So the removal of the logical subscriber for the publisher has somehow > restricted the logical subscriber of standby to work. Is this > behaviour acceptable? > > Without this feature, if I manually switch back wal_level to replica > on primary, then it will fail to start. This makes the issue obvious > and prevents misconfiguration. > FATAL: logical replication slot "sub2" exists, but "wal_level" < "logical" > HINT: Change "wal_level" to be "logical" or higher. > > But the current behaviour is harder to diagnose, as the problem is > effectively hidden behind subscription/slot creation/deletion. The most upstream server in replication configuration would carefully need to keep having at least one logical slot. One way to keep effective_wal_level 'logical' on the publisher where wal_level = 'replica' is to have a logical slot without WAL reservation that is not relevant with any subscriptions. It could require an extra logical slot but seems workable. Does it resolve this concern? > 2) > 'show effective_wal_level' shows output as 'logical' if a slot exists > on primary. But on physical standby, it still shows it as 'replica' > even in the presence of slots. Is this intentional? Yes. I think we should disallow the standbys to create a logical slot as long as they use wal_level = 'replica', because otherwise the standby would need to invalidate the logical slot at a promotion. Which could cause a large down time in a failover case. > 3) > I haven’t tested this yet, but I’d like to discuss what the expected > behavior should be if a slot exists on the primary but is marked as > invalidated. Will an invalidated slot still cause the effective > wal_level to remain at logical, or will invalidating the only logical > slot trigger a switch back to replica? > There is a chance that a slot with un-reserved wal may be invalidated > due to time-out. Good point. I think we don't need to decrease the effective_wal_level to 'replica' even if we invalidate all logical slots. We need neither WAL reservation nor dead tuple retention in order to set effective_wal_level to 'logical' so I think it's straightforward that effective_wal_level value depends on only the presence of logical slots. If dle_replication_slot_timeout affects also logical slots created with immeidately_reserve=false, we might want to exclude them to avoid confusion. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
В списке pgsql-hackers по дате отправления: