Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

Поиск
Список
Период
Сортировка
От Masahiko Sawada
Тема Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Дата
Msg-id CAD21AoARX01kBsech_e_wETTQ9ZL+N-UC0OhTPvmYH9Zc1kdzA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: POC: enable logical decoding when wal_level = 'replica' without a server restart  (Peter Smith <smithpb2250@gmail.com>)
Список pgsql-hackers
On Tue, Dec 2, 2025 at 2:39 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Dec 1, 2025 at 4:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Nov 27, 2025 at 12:33 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > Hi Sawada-San.
> > >
> > > Some review comments for v30-0001.
> >
> > Thank you for the comments!
> >
> > > ======
> > > src/backend/replication/logical/logicalctl.c
> > >
> > > 6b.
> > > Would it be better to have a self-documenting enum {UNKNOWN, ENABLED,
> > > DISABLED} instead of some magic int values that need lots of comments
> > > explaining how to use them?
> >
> > No, we should not cast enum values directly to a boolean. We have a
> > precedent of such usage like LocalXLogInsertAllowed, and I don't think
> > we can reduce the comments even if we use a self-documenting enum
> > since the usage is complicated. We could consider #define values 0, 1,
> > and -1 but it seems to be overkill to this usage.
> >
>
> I never suggested casting enum values directly to booleans. FYI - I've
> attached a diff patch to demonstrate what I meant. There are fewer
> comments since there is no need to explain special values -1,0,1.

Thank you for the patch. I understood what you meant but I don't think
it's a good idea. While it removes three lines of comment, it
introduces a conditional branch:

- XLogLogicalInfoXactCache = (int) XLogLogicalInfo;
+ XLogLogicalInfoXactCache = XLogLogicalInfo ?
+     CACHE_XLOGLOGICALINFO_ENABLED :
+     CACHE_XLOGLOGICALINFO_DISABLED;

Since this function can be called in hot paths I would choose the
original implementation, and believe that LocalXLogInsertAllowed uses
an integer value for the same reason.

>
> //////////
>
> Anyway, here are some more review comments for v31-0001.
>
> ======
> src/backend/replication/logical/decode.c
>
> 1.
> + /*
> + * a XLOG_PARAMETER_CHANGE record might change wal_level to
> + * 'replica' or 'minimal', but we don't check the logical decoding
> + * availability here because it's updated by a
> + * XLOG_LOGICAL_DECODING_STATUS_CHANGE record.
> + */
>
>
> Missed a previous review comment. Start this comment with a capital 'A'.
>
> ======
> src/backend/replication/logical/logicalctl.c
>
> CheckXLogLogicalInfo:
>
> 2.
> + XLogLogicalInfoXactCache = (int) XLogLogicalInfo;
> +
> + return XLogLogicalInfoXactCache;
> +}
>
>
> This function should be returning the bool -- i.e. the
> XLogLogicalInfo, not the int XLogLogicalInfoCache.

Fixed.

>
> ======
> src/test/recovery/t/050_effective_wal_level.pl
>
> 3.
> +# Wait for the checkpointer to decrease effective_wal_level to 'replica'.
> +sub wait_for_logical_decoding_disabled
> +{
> + my ($node) = @_;
> +
> + $node->poll_query_until('postgres',
> + qq[select current_setting('effective_wal_level') = 'replica';]);
> +}
>
> Should this subroutine also execute a CHECKPOINT here? Otherwise, it
> might be a long wait before the checkpoint happens automatically.
> Also, it would align better with the comment; otherwise, this
> subroutine has nothing really to do with checkpoints at all.

I don't think we need to execute a CHECKPOINT since we wake up the
checkpointer process after initiating a disable request. Did you see a
case where it took a long time to decrease effective_wal_level to
'replica'? If so and it's not because the checkpointer was busy with a
checkpointing, we should fix it separately.

> ~~~
>
> 4.
> +# Initialize the primary server with wal_level = 'replica'.
> +my $primary = PostgreSQL::Test::Cluster->new('primary');
> +$primary->init(allows_streaming => 1);
> +$primary->append_conf('postgresql.conf', "log_min_messages = debug1");
> +$primary->start();
>
> Would it be nicer to explicitly set wal_level? Otherwise, perhaps the
> comment should mention that this is the default.

Sorry, I don't understand the comment. IIUC specifying
'allows_streaming => 1' to init() sets 'replica' to wal_level. If it's
omitted, 'minimal' is set to wal_level. Explicitly setting wal_level =
'replica' is redundant with what init(allows_streaming => 1) does, and
the comment already mentions that the server starts with
wal_level='replica'.

>
> ~~~
>
> 5.
> +# Create a physical slot.
> +$primary->safe_psql('postgres',
> + qq[select pg_create_physical_replication_slot('test_phy_slot', false, false)]
> +);
> +
> +# Physical slots don't affect effective_wal_level.
> +test_wal_level($primary, "replica|replica",
> + "effective_wal_level doesn't change with a new physical slot");
> +$primary->safe_psql('postgres',
> + qq[select pg_drop_replication_slot('test_phy_slot')]);
>
> That seems like a single test case, so the code/comments can be combined.
>
> SUGGESTION
> Check that creating a physical slot doesn't affect effective_wal_level.
>
> ~~~
>
> 6.
> +# Create a temporary logical slot but exits without releasing it explicitly.
> +# This enables logical decoding but skips disabling it and delegates to the
> +# checkpointer.
>
> typo: /but exits/ but exit/
>
> ~~~
>
> 7.
> +# Create a new logical slot and check if effective_wal_level must be increased
> +# to 'logical'.
>
> typo: /if effective_wal_level must be increased/that
> effective_wal_level is increased/
>
> ~~~
>
> 8.
> +# Check if the server cannot start with wal_level='minimal' as long as there is
> +# at least one replication slot.
> +$primary->adjust_conf('postgresql.conf', 'wal_level', 'minimal');
> +$primary->adjust_conf('postgresql.conf', 'max_wal_senders', '0');
> +$primary->stop;
> +
> +command_fails(
> + [
> + 'pg_ctl',
> + '--pgdata' => $primary->data_dir,
> + '--log' => $primary->logfile,
> + 'start',
> + ],
> + "cannot server with wal_level='minimal' as there is in-use logical slot");
> +
>
> 8a.
> Tweak comment.
>
> SUGGESTION
> Verify that the server cannot start with wal_level='minimal' when
> there is at least one replication slot.
>
> ~
>
> 8b.
> Typo (missing word "start"?) in "cannot server"
> Typo /is in-use/is an in-use/
>

Fixed.

> ~~~
>
> 9.
> +# Revert the modified settings.
> +$primary->adjust_conf('postgresql.conf', 'wal_level', 'replica');
> +$primary->adjust_conf('postgresql.conf', 'max_wal_senders', '10');
> +
> +# Add other settings to test if we disable logical decoding when
> invalidating the last
> +# logical slot.
> +$primary->append_conf(
> + 'postgresql.conf',
> + qq[
> +min_wal_size = 32MB
> +max_wal_size = 32MB
> +max_slot_wal_keep_size = 16MB
> +]);
> +$primary->start;
>
> Can those setting changes all be combined so there is only one primary->start?

I think there already only one primary->start in the above code chunk.
What is your suggestion?

>
> ~~~
>
> 10.
> +# Check if logical decoding is disabled after invalidating the last
> logical slot.
>
> There are quite a few "check if" comments like this one that could be
> worded more strongly. I won't make review comments for all of them,
> but here is an example.
>
> SUGGESTION:
> Verify that logical decoding is disabled after invalidating the last
> logical slot.

Fixed related places.

>
> ~~~
>
> 11.
> +# Recreate the logical slot to enable logical decoding again.
> +$primary->safe_psql('postgres',
> + qq[select pg_drop_replication_slot('test_slot')]);
> +$primary->safe_psql('postgres',
> + qq[select pg_create_logical_replication_slot('test_slot', 'pgoutput')])
>
> Should this also call 'test_wal_level' subroutine to verify the states?

I think that path is already covered by other tests. For example we have:

# Create a new logical slot and check that effective_wal_level must be increased
# to 'logical'.
$primary->safe_psql('postgres',
        qq[select pg_create_logical_replication_slot('test_slot', 'pgoutput')]);
test_wal_level($primary, "replica|logical",
        "effective_wal_level increased to 'logical' upon a logical
slot creation"
);

> ~~~
>
> 12.
> +# Initialize standby2 node and start it with wal_level = 'logical'.
> +my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
> +$standby2->init_from_backup($primary, 'my_backup', has_streaming => 1);
>
> I got confused around here ... Probably I am mistaken, but I thought:
>
> First, there was $primary (a primary) and $standby1 (a standby)
> Then, the $standby1 got promoted (so now $standby1 is a primary and
> $primary is the standby)
>
> And, now we are making another $standby2 from the $primary.
>
> AFAIK, that $primary is no longer a primary instance, right?
>
> (???)
>
> So, if $primary is not really the primary anymore, then now I am
> unsure if later test comments like below are correct:
>
> +# Regardless of their wal_level values, effective_wal_level values on the
> +# standby and the cascaded standby depend on the primary's value, 'logical'.
>
> and
>
> +# Drop the primary's last logical slot, decreasing effective_wal_level to
> +# 'replica' on all nodes.
>
> etc.

In 050_effective_wal_level.pl, we repeat to create a standby server of
$primary and promote it, in order to test the behavior at the
promotion. Once the standby is promoted, we do some tests against it
and stop it, and then create another standby for $primary for other
tests. Therefore, the $primary never becomes a standby and we don't
use $standby{1,2,3,4} as the primary server.

>
> ~~~
>
> 13.
> +# Check if the effective_wal_level remains 'logical' on both nodes
>
> typo: "the both"

Fixed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



В списке pgsql-hackers по дате отправления: