Обсуждение: logical decoding and replication of sequences

Поиск
Список
Период
Сортировка

logical decoding and replication of sequences

От
Tomas Vondra
Дата:
Hi,

One of the existing limitations of logical decoding / replication is
that it does no care about sequences. The annoying consequence is that
after a failover to logical replica, all the table data may be
replicated but the sequences are still at the initial values, requiring
some custom solution that moves the sequences forward enough to prevent
duplicities.

There have been attempts to address this in the past, most recently [1],
but none of them got in due to various issues.

This is an attempt, based on [1] (but with many significant parts added
or reworked), aiming to deal with this. The primary purpose of sharing
it is getting feedback and opinions on the design decisions. It's still
a WIP - it works fine AFAICS, but some of the bits may be a bit hackish.

The overall goal is to have the same sequence data on the primary and
logical replica, or something sufficiently close to that, so that the
replica after a failover does not generate duplicate values.

This patch does a couple basic things:

1) extends the logical decoding to handle sequences. It adds a new
    callback, similarly to what we have for messages. There's a bit of
    complexity with transactional and non-transactional behavior, more
    about that later

2) extends test_decoding to support this new callback, printing the
    sequence increments (the decoded WAL records)

3) extends built-in replication to support sequences, so publications
    may contain both tables and sequences, etc., sequences data sync
    when creating subscriptions, etc.


transactional vs. non-transactional
-----------------------------------

The first part (extending logical decoding) is simple in principle. We
simply decode the sequence updates, but then comes a challenge - should
we just treat it transactionally and stash it in reorder buffer, or
just pass it to the output plugin right-away?

For messages, this can be specified as a flag when adding the message,
so the user can decide depending on the message purpose. For sequences,
all we do is nextval() and it depends on the context in which it's used,
we can't just pick one of those approaches.

Consider this, for example:

   CREATE SEQUENCE s;
   BEGIN;
     SELECT nextval('s') FROM generate_series(1,1000) s(i);
   ROLLBACK;

If we handle this "transactionally", we'd stash the "nextval" increment
into the transaction, and then discard it due to the rollback, so the
output plugin (and replica) would never get it. So this is an argument
for non-transactional behavior.

On the other hand, consider this:

   CREATE SEQUENCE s;
   BEGIN;
      ALTER SEQUENCE s RESTART WITH 2000;
      SELECT nextval('s') FROM generate_series(1,1000) s(i);
   ROLLBACK;

In this case the ALTER creates a new relfilenode, and the ROLLBACK does
discard it including the effects of the nextval calls. So here we should
treat it transactionally, stash the increment(s) in the transaction and
just discard it all on rollback.

A somewhat similar example is this

   BEGIN;
      CREATE SEQUENCE s;
      SELECT nextval('s') FROM generate_series(1,1000) s(i);
   COMMIT;

Again - the decoded nextval needs to be handled transactionally, because
otherwise it's going to be very difficult for custom plugins to combine
this with DDL replication.

So the patch does a fairly simple thing:

1) By default, sequences are treated non-transactionally, i.e. sent to
    the output plugin right away.

2) We track sequences created in running (sub)transactions, and those
    are handled transactionally. This includes ALTER SEQUENCE cases,
    which create a new relfilenode, which is used as an identifier.

It's a bit more complex, because of cases like this:

   BEGIN;
      CREATE SEQUENCE s;
      SAVEPOINT a;
      SELECT nextval('s') FROM generate_series(1,1000) s(i);
      ROLLBACK TO a;
   COMMIT;

because we must not discard the nextval changes - this is handled by
always stashing the nextval changes to the subxact where the sequence
relfilenode was created.

The tracking is a bit cumbersome - there's a hash table with relfilenode
mapped to XID in which it was created. AFAIK that works, but might be
an issue with many sequences created in running transactions. Not sure.


detecting sequence creation
---------------------------

Detection that a sequence (or rather the relfilenode) was created is
done by adding a "created" flag into the xl_seq_rec, and setting it to
"true" in the first WAL record after the creation. There might be some
other way, but this seemed simple enough.


applying the sequence (ResetSequence2)
--------------------------------------

The decoding pretty much just extracts log_value, log_cnt and is_called
from the sequence, and passes them to the output plugin. On the replica
we extract those from the message, and write them to the local sequence
using a new ResetSequence2 function.

It's possible we don't really need log_cnt and is_called. After all,
log_cnt is zero most of the time anyway, and the worst thing that could
happen if we ignore it is we skip a couple values (which seems fine).


syncing sequences in a subscription
-----------------------------------

After creating a subscription, the sequences get syncronized just like
tables. This part ia a bit hacked together, and there's definitely room
for improvement - e.g. a new bgworker is started for each sequence, as
we simply treat both tabels and sequences as "relation". But all we need
to do for sequences is copying the (last_value, log_cnt, is_called) and
calling ResetSequence2, so maybe we could sync all sequences in a single
worker, or something like that.


new "sequence" publication action
---------------------------------

The publications now have a new "sequence" publication action, which is
enabled by default. This determines whether the publication decodes
sequences or what.


FOR ALL SEQUENCES
-----------------

It should be possible to create FOR ALL SEQUENCES publications, just
like we have FOR ALL TABLES. But this produces shift/reduce conflicts
in the grammar, and I didn't bother dealing with that. So for now it's
required to do ALTER PUBLICATION ... [ADD | DROP] SEQUENCE ...


no streaming support yet
------------------------

There's no supoprt for streaming of in-progress transactions yet, but
should be trivial to add.


GetCurrentTransactionId() in nextval
------------------------------------

There's a bit annoying behavior of nextval() - if you do this:

   BEGIN;
     CREATE SEQUENCE s;
     SAVEPOINT a;
     SELECT nextval('s') FROM generate_series(1,100) s(i);
   COMMIT;

then the WAL record for nextval (right after the savepoint) will have
XID 0 (easy to see in pg_waldump). That's kinda strange, and it causes
problems in DecodeSequence() when calling

     SnapBuildProcessChange(builder, xid, buf->origptr)

for transactional changes, because that expects a valid XID. Fixing
this required adding GetCurrentTransactionId() to nextval() and two
other functions, which were only doing

   if (RelationNeedsWAL(seqrel))
     GetTopTransactionId();

so far. I'm not sure if this has some particularly bad consequences.


regards

[1] 
https://www.postgresql.org/message-id/flat/1710ed7e13b.cd7177461430746.3372264562543607781%40highgo.ca

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
Hi,

Seems the cfbot was not entirely happy with the patch on some platforms, 
so here's a fixed version. There was a bogus call to ensure_transaction 
function (which does not exist at all) and a silly bug in transaction 
management in apply_handle_sequence.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
A rebased patch, addressing a minor bitrot due to 4daa140a2f5.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 6/23/21 4:14 PM, Tomas Vondra wrote:
> A rebased patch, addressing a minor bitrot due to 4daa140a2f5.
> 

Meh, forgot to attach the patch as usual, of course ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: logical decoding and replication of sequences

От
vignesh C
Дата:
On Wed, Jun 23, 2021 at 7:55 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 6/23/21 4:14 PM, Tomas Vondra wrote:
> > A rebased patch, addressing a minor bitrot due to 4daa140a2f5.
> >
>
> Meh, forgot to attach the patch as usual, of course ...

The patch does not apply on Head anymore, could you rebase and post a
patch. I'm changing the status to "Waiting for Author".

Regards,
Vignesh



Re: logical decoding and replication of sequences

От
Peter Eisentraut
Дата:
On 08.06.21 00:28, Tomas Vondra wrote:
> 
> new "sequence" publication action
> ---------------------------------
> 
> The publications now have a new "sequence" publication action, which is
> enabled by default. This determines whether the publication decodes
> sequences or what.
> 
> 
> FOR ALL SEQUENCES
> -----------------
> 
> It should be possible to create FOR ALL SEQUENCES publications, just
> like we have FOR ALL TABLES. But this produces shift/reduce conflicts
> in the grammar, and I didn't bother dealing with that. So for now it's
> required to do ALTER PUBLICATION ... [ADD | DROP] SEQUENCE ...

I have been thinking about these DDL-level issues a bit.  The most 
common use case will be to have a bunch of tables with implicit 
sequences, and you just want to replicate them from here to there 
without too much effort.  So ideally an implicit sequence should be 
replicated by default if the table is part of a publication (unless 
sequences are turned off by the publication option).

We already have support for things like that in 
GetPublicationRelations(), where a partitioned table is expanded to 
include the actual partitions.  I think that logic could be reused.  So 
in general I would have GetPublicationRelations() include sequences and 
don't have GetPublicationSequenceRelations() at all.  Then sequences 
could also be sent by pg_publication_tables(), maybe add a relkind 
column.  And then you also don't need so much duplicate DDL code, if you 
just consider everything as a relation.  For example, there doesn't seem 
to be an actual need to have fetch_sequence_list() and subsequent 
processing on the subscriber side.  It does the same thing as 
fetch_table_list(), so it might as well just all be one thing.

We do, however, probably need some checking that we don't replicate 
tables to sequences or vice versa.

We probably also don't need a separate FOR ALL SEQUENCES option.  What 
users really want is a "for everything" option.  We could think about 
renaming or alternative syntax, but in principle I think FOR ALL TABLES 
should include sequences by default.

Tests under src/test/subscription/ are needed.

I'm not sure why test_decoding needs a skip-sequences option.  The 
source code says it's for backward compatibility, but I don't see why we 
need that.



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 7/20/21 5:30 PM, Peter Eisentraut wrote:
> On 08.06.21 00:28, Tomas Vondra wrote:
>>
>> new "sequence" publication action
>> ---------------------------------
>>
>> The publications now have a new "sequence" publication action, which is
>> enabled by default. This determines whether the publication decodes
>> sequences or what.
>>
>>
>> FOR ALL SEQUENCES
>> -----------------
>>
>> It should be possible to create FOR ALL SEQUENCES publications, just
>> like we have FOR ALL TABLES. But this produces shift/reduce conflicts
>> in the grammar, and I didn't bother dealing with that. So for now it's
>> required to do ALTER PUBLICATION ... [ADD | DROP] SEQUENCE ...
> 
> I have been thinking about these DDL-level issues a bit.  The most
> common use case will be to have a bunch of tables with implicit
> sequences, and you just want to replicate them from here to there
> without too much effort.  So ideally an implicit sequence should be
> replicated by default if the table is part of a publication (unless
> sequences are turned off by the publication option).
> 

Agreed, that seems like a reasonable approach.

> We already have support for things like that in
> GetPublicationRelations(), where a partitioned table is expanded to
> include the actual partitions.  I think that logic could be reused.  So
> in general I would have GetPublicationRelations() include sequences and
> don't have GetPublicationSequenceRelations() at all.  Then sequences
> could also be sent by pg_publication_tables(), maybe add a relkind
> column.  And then you also don't need so much duplicate DDL code, if you
> just consider everything as a relation.  For example, there doesn't seem
> to be an actual need to have fetch_sequence_list() and subsequent
> processing on the subscriber side.  It does the same thing as
> fetch_table_list(), so it might as well just all be one thing.
> 

Not sure. I agree with replicating implicit sequences by default,
without having to add them to the publication. But I think we should
allow adding other sequences too, and I think some of this code and
differentiation from tables will be needed.

FWIW I'm not claiming there are no duplicate parts - I've mostly
copy-pasted the table-handling code for sequences, and I'll look into
reusing some of it.

> We do, however, probably need some checking that we don't replicate
> tables to sequences or vice versa.
> 

True. I haven't tried doing such silly things yet.

> We probably also don't need a separate FOR ALL SEQUENCES option.  What
> users really want is a "for everything" option.  We could think about
> renaming or alternative syntax, but in principle I think FOR ALL TABLES
> should include sequences by default.
> 
> Tests under src/test/subscription/ are needed.
> 

Yeah, true. At the moment there are just tests in test_decoding, mostly
because the previous patch versions did not add support for sequences to
built-in replication. Will fix.

> I'm not sure why test_decoding needs a skip-sequences option.  The
> source code says it's for backward compatibility, but I don't see why we
> need that.

Hmmm, I'm a bit baffled by skip-sequences true. I think Cary added it to
limit chances in test_decoding tests, while the misleading comment about
backwards compatibility comes from me.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
Here's a rebased version of the patch, no other changes.

I think the crucial aspect of this that needs discussion/feedback the
most is the transactional vs. non-transactional behavior. All the other
questions are less important / cosmetic.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
Hi,

Here's a an updated version of this patch - rebased to current master, 
and fixing some of the issues raised in Peter's review.

Mainly, this adds a TAP test to src/test/subscription, focusing on 
testing the various situations with transactional and non-transactional 
behavior (with subtransactions and various ROLLBACK versions).

This new TAP test however uncovered an issue with wait_for_catchup(), 
because that uses pg_current_wal_lsn() to wait for replication of all 
the changes. But that does not work when the sequence gets advanced in a 
transaction that is then aborted, e.g. like this:

BEGIN;
SELECT nextval('s') FROM generate_series(1,100);
ROLLBACK;

The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write, 
which is updated by XLogFlush() - but only in RecordTransactionCommit. 
Which makes sense, because only the committed stuff is "visible". But 
the non-transactional behavior changes this, because now some of the 
changes from aborted transactions may need to be replicated. Which means 
the wait_for_catchup() ends up not waiting for the sequence change.

One option would be adding XLogFlush() to RecordTransactionAbort(), but 
my guess is we don't do that intentionally (even though aborts should be 
fairly rare in most workloads).

I'm not entirely sure changing this (replicating changes from aborted 
xacts) is a good idea. Maybe it'd be better to replicate this "lazily" - 
instead of replicating the advances right away, we might remember which 
sequences were advanced in the transaction, and then replicate current 
state for those sequences at commit time.

The idea is that if an increment is "invisible" we probably don't need 
to replicate it, it's fine to replicate the next "visible" increment. So 
for example given

BEGIN;
SELECT nextval('s');
ROLLBACK;

BEGIN;
SELECT nextval('s');
COMMIT;

we don't need to replicate the first change, but we need to replicate 
the second one.

The trouble is we don't decode individual sequence advances, just those 
that update the sequence tuple (so every 32 values or so). So we'd need 
to remeber the first increment, in a way that is (a) persistent across 
restarts and (b) shared by all backends.

The other challenge seems to be ordering of the changes - at the moment 
we have no issues with this, because increments on the same sequence are 
replicated immediately, in the WAL order. But postponing them to commit 
time would affect this order.


I've also briefly looked at the code duplication - there's a couple 
functions in the patch that I shamelessly copy-pasted and tweaked to 
handle sequences instead of tables:

publicationcmds.c
-----------------

1) OpenTableList/CloseTableList - > OpenSequenceList/CloseSequenceList

Trivial differences, trivial to get rid of - the only difference is 
pretty much just table_open vs. relation open.


2) AlterPublicationTables -> AlterPublicationSequences

This is a bit more complicated, because the tables also handle 
inheritance (which I think does not apply to sequences). Other than 
that, it's calling the functions from (1).


subscriptioncmds.c
------------------

1) fetch_table_list, fetch_sequence_list

Minimal differences, essentially just the catalog name.

2) AlterSubscription_refresh

A lot of duplication, but the code handling tables and sequences is 
almost exactly the same and can be reused fairly easily (moved to a 
separate function, called for tables and then sequences).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: logical decoding and replication of sequences

От
Robert Haas
Дата:
On Tue, Jul 20, 2021 at 5:41 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> I think the crucial aspect of this that needs discussion/feedback the
> most is the transactional vs. non-transactional behavior. All the other
> questions are less important / cosmetic.

Yeah, it seems really tricky to me to get this right. The hard part
is, I think, mostly figuring out what the right behavior really is.

DDL in PostgreSQL is transactional. Non-DDL operations on sequences
are non-transactional. If a given transaction does only one of those
things, it seems clear enough what to do, but when the same
(sub)transaction does both, it gets messy. I'd be tempted to think
about something like:

1. When a transaction performs only non-transactional operations on
sequences, they are emitted immediately.

2. If a transaction performs transactional operations on sequences,
the decoded operations acquire a dependency on the transaction and
cannot be emitted until that transaction is fully decoded. When commit
or abort of that XID is reached, emit the postponed non-transactional
operations at that point.

I think this is similar to what you've designed, but I'm not sure that
it's exactly equivalent. I think in particular that it may be better
to insist that all of these operations are non-transactional and that
the debate is only about when they can be sent, rather than trying to
sort of convert them into an equivalent series of transactional
operations. That approach seems confusing especially in the case where
some (sub)transactions abort.

But this is just my $0.02.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences

От
Peter Eisentraut
Дата:
On 30.07.21 20:26, Tomas Vondra wrote:
> Here's a an updated version of this patch - rebased to current master, 
> and fixing some of the issues raised in Peter's review.

This patch needs an update, as various conflicts have arisen now.

As was discussed before, it might be better to present a separate patch 
for just the logical decoding part for now, since the replication and 
DDL stuff has the potential to conflict heavily with other patches being 
discussed right now.  It looks like cutting this patch in two should be 
doable easily.

I looked through the test cases in test_decoding again.  It all looks 
pretty sensible.  If anyone can think of any other tricky or dubious 
cases, we can add them there.  It's easiest to discuss these things with 
concrete test cases rather than in theory.

One slightly curious issue is that this can make sequence values go 
backwards, when seen by the logical decoding consumer, like in the test 
case:

+ BEGIN
+ sequence: public.test_sequence transactional: 1 created: 1 last_value: 
1, log_cnt: 0 is_called: 0
+ COMMIT
+ sequence: public.test_sequence transactional: 0 created: 0 last_value: 
33, log_cnt: 0 is_called: 1
+ BEGIN
+ sequence: public.test_sequence transactional: 1 created: 1 last_value: 
4, log_cnt: 0 is_called: 1
+ COMMIT
+ sequence: public.test_sequence transactional: 0 created: 0 last_value: 
334, log_cnt: 0 is_called: 1

I suppose that's okay, since it's not really the intention that someone 
is concurrently consuming sequence values on the subscriber.  Maybe 
something for the future.  Fixing that would require changing the way 
transactional sequence DDL updates these values, so it's not directly 
the job of the decoding to address this.

A small thing I found: Maybe the text that test_decoding produces for 
sequences can be made to look more consistent with the one for tables. 
For example, in

+ BEGIN
+ sequence: public.test_table_a_seq transactional: 1 created: 1 
last_value: 1, log_cnt: 0 is_called: 0
+ sequence: public.test_table_a_seq transactional: 1 created: 0 
last_value: 33, log_cnt: 0 is_called: 1
+ table public.test_table: INSERT: a[integer]:1 b[integer]:100
+ table public.test_table: INSERT: a[integer]:2 b[integer]:200
+ COMMIT

note how the punctuation is different.



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
Hi,

On 9/23/21 12:27 PM, Peter Eisentraut wrote:
> On 30.07.21 20:26, Tomas Vondra wrote:
>> Here's a an updated version of this patch - rebased to current master, 
>> and fixing some of the issues raised in Peter's review.
> 
> This patch needs an update, as various conflicts have arisen now.
> 
> As was discussed before, it might be better to present a separate patch 
> for just the logical decoding part for now, since the replication and 
> DDL stuff has the potential to conflict heavily with other patches being 
> discussed right now.  It looks like cutting this patch in two should be 
> doable easily.
> 

Attached is the rebased patch, split into three parts:

1) basic decoding infrastructure (decoding, reorderbuffer etc.)
2) support for sequences in test_decoding
3) support for sequences in built-in replication (catalogs, syntax, ...)

The last part is the largest one - I'm sure we'll have discussions about 
the grammar, adding sequences automatically, etc. But as you said, let's 
focus on the first part, which deals with the required decoding stuff.

I've added a couple comments, explaining how we track sequences, why we 
need the XID in nextval() etc. I've also added streaming support.


> I looked through the test cases in test_decoding again.  It all looks 
> pretty sensible.  If anyone can think of any other tricky or dubious 
> cases, we can add them there.  It's easiest to discuss these things with 
> concrete test cases rather than in theory.
> 
> One slightly curious issue is that this can make sequence values go 
> backwards, when seen by the logical decoding consumer, like in the test 
> case:
> 
> + BEGIN
> + sequence: public.test_sequence transactional: 1 created: 1 last_value: 
> 1, log_cnt: 0 is_called: 0
> + COMMIT
> + sequence: public.test_sequence transactional: 0 created: 0 last_value: 
> 33, log_cnt: 0 is_called: 1
> + BEGIN
> + sequence: public.test_sequence transactional: 1 created: 1 last_value: 
> 4, log_cnt: 0 is_called: 1
> + COMMIT
> + sequence: public.test_sequence transactional: 0 created: 0 last_value: 
> 334, log_cnt: 0 is_called: 1
> 
> I suppose that's okay, since it's not really the intention that someone 
> is concurrently consuming sequence values on the subscriber.  Maybe 
> something for the future.  Fixing that would require changing the way 
> transactional sequence DDL updates these values, so it's not directly 
> the job of the decoding to address this.
> 

Yeah, that's due to how ALTER SEQUENCE does things, and I agree redoing 
that seems well out of scope for this patch. What seems a bit suspicious 
is that some of the ALTER SEQUENCE changes have "created: 1" - it's 
probably correct, though, because those ALTER SEQUENCE statements can be 
rolled-back, so we see it as if a new sequence is created. The flag name 
might be a bit confusing, though.

> A small thing I found: Maybe the text that test_decoding produces for 
> sequences can be made to look more consistent with the one for tables. 
> For example, in
> 
> + BEGIN
> + sequence: public.test_table_a_seq transactional: 1 created: 1 
> last_value: 1, log_cnt: 0 is_called: 0
> + sequence: public.test_table_a_seq transactional: 1 created: 0 
> last_value: 33, log_cnt: 0 is_called: 1
> + table public.test_table: INSERT: a[integer]:1 b[integer]:100
> + table public.test_table: INSERT: a[integer]:2 b[integer]:200
> + COMMIT
> 
> note how the punctuation is different.

I did tweak this a bit, hopefully it's more consistent.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: logical decoding and replication of sequences

От
Hannu Krosing
Дата:
Just a note for some design decisions

> 1) By default, sequences are treated non-transactionally, i.e. sent to the output plugin right away.

If our aim is just to make sure that all user-visible data in
*transactional* tables is consistent with sequence state then  one
very much simplified approach to this could be to track the results of
nextval() calls in a transaction at COMMIT put the latest sequence
value in WAL (or just track the sequences affected and put the latest
sequence state in WAL at commit which needs extra read of sequence but
protects against race conditions with parallel transactions which get
rolled back later)

This avoids sending redundant changes for multiple nextval() calls
(like loading a million-row table with sequence-generated id column)

And one can argue that we can safely ignore anything in ROLLBACKED
sequences. This is assuming that even if we did advance the sequence
paste the last value sent by the latest COMMITTED transaction it does
not matter for database consistency.

It can matter if customers just call nextval() in rolled-back
transactions and somehow expect these values to be replicated based on
reasoning along "sequences are not transactional - so rollbacks should
not matter" .

Or we may get away with most in-detail sequence tracking on the source
if we just keep track of the xmin of the sequence and send the
sequence info over at commit if it == current_transaction_id ?


-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

On Fri, Sep 24, 2021 at 9:16 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hi,
>
> On 9/23/21 12:27 PM, Peter Eisentraut wrote:
> > On 30.07.21 20:26, Tomas Vondra wrote:
> >> Here's a an updated version of this patch - rebased to current master,
> >> and fixing some of the issues raised in Peter's review.
> >
> > This patch needs an update, as various conflicts have arisen now.
> >
> > As was discussed before, it might be better to present a separate patch
> > for just the logical decoding part for now, since the replication and
> > DDL stuff has the potential to conflict heavily with other patches being
> > discussed right now.  It looks like cutting this patch in two should be
> > doable easily.
> >
>
> Attached is the rebased patch, split into three parts:
>
> 1) basic decoding infrastructure (decoding, reorderbuffer etc.)
> 2) support for sequences in test_decoding
> 3) support for sequences in built-in replication (catalogs, syntax, ...)
>
> The last part is the largest one - I'm sure we'll have discussions about
> the grammar, adding sequences automatically, etc. But as you said, let's
> focus on the first part, which deals with the required decoding stuff.
>
> I've added a couple comments, explaining how we track sequences, why we
> need the XID in nextval() etc. I've also added streaming support.
>
>
> > I looked through the test cases in test_decoding again.  It all looks
> > pretty sensible.  If anyone can think of any other tricky or dubious
> > cases, we can add them there.  It's easiest to discuss these things with
> > concrete test cases rather than in theory.
> >
> > One slightly curious issue is that this can make sequence values go
> > backwards, when seen by the logical decoding consumer, like in the test
> > case:
> >
> > + BEGIN
> > + sequence: public.test_sequence transactional: 1 created: 1 last_value:
> > 1, log_cnt: 0 is_called: 0
> > + COMMIT
> > + sequence: public.test_sequence transactional: 0 created: 0 last_value:
> > 33, log_cnt: 0 is_called: 1
> > + BEGIN
> > + sequence: public.test_sequence transactional: 1 created: 1 last_value:
> > 4, log_cnt: 0 is_called: 1
> > + COMMIT
> > + sequence: public.test_sequence transactional: 0 created: 0 last_value:
> > 334, log_cnt: 0 is_called: 1
> >
> > I suppose that's okay, since it's not really the intention that someone
> > is concurrently consuming sequence values on the subscriber.  Maybe
> > something for the future.  Fixing that would require changing the way
> > transactional sequence DDL updates these values, so it's not directly
> > the job of the decoding to address this.
> >
>
> Yeah, that's due to how ALTER SEQUENCE does things, and I agree redoing
> that seems well out of scope for this patch. What seems a bit suspicious
> is that some of the ALTER SEQUENCE changes have "created: 1" - it's
> probably correct, though, because those ALTER SEQUENCE statements can be
> rolled-back, so we see it as if a new sequence is created. The flag name
> might be a bit confusing, though.
>
> > A small thing I found: Maybe the text that test_decoding produces for
> > sequences can be made to look more consistent with the one for tables.
> > For example, in
> >
> > + BEGIN
> > + sequence: public.test_table_a_seq transactional: 1 created: 1
> > last_value: 1, log_cnt: 0 is_called: 0
> > + sequence: public.test_table_a_seq transactional: 1 created: 0
> > last_value: 33, log_cnt: 0 is_called: 1
> > + table public.test_table: INSERT: a[integer]:1 b[integer]:100
> > + table public.test_table: INSERT: a[integer]:2 b[integer]:200
> > + COMMIT
> >
> > note how the punctuation is different.
>
> I did tweak this a bit, hopefully it's more consistent.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 9/25/21 22:05, Hannu Krosing wrote:
> Just a note for some design decisions
> 
>> 1) By default, sequences are treated non-transactionally, i.e. sent to the output plugin right away.
> 
> If our aim is just to make sure that all user-visible data in
> *transactional* tables is consistent with sequence state then  one
> very much simplified approach to this could be to track the results of
> nextval() calls in a transaction at COMMIT put the latest sequence
> value in WAL (or just track the sequences affected and put the latest
> sequence state in WAL at commit which needs extra read of sequence but
> protects against race conditions with parallel transactions which get
> rolled back later)
> 

Not sure. TBH I feel rather uneasy about adding more stuff in COMMIT.

> This avoids sending redundant changes for multiple nextval() calls
> (like loading a million-row table with sequence-generated id column)
> 

Yeah, it'd be nice to have to optimize this a bit, somehow. But I'd bet 
it's a negligible amount of data / changes, compared to the table.

> And one can argue that we can safely ignore anything in ROLLBACKED
> sequences. This is assuming that even if we did advance the sequence
> paste the last value sent by the latest COMMITTED transaction it does
> not matter for database consistency.
> 

I don't think we can ignore aborted (ROLLBACK) transactions, in the 
sense that you can't just discard the increments. Imagine you have this 
sequence of transactions:

BEGIN;
SELECT nextval('s');  -- allocates new chunk of values
ROLLBACK;

BEGIN;
SELECT nextval('s'); -- returns one of the cached values
COMMIT;

If you ignore the aborted transaction, then the sequence increment won't 
be replicated -- but that's wrong, because user now has a visible 
sequence value from that chunk.

So I guess we'd have to maintain a cache of sequences incremented in the 
current session, do nothing in aborted transactions (i.e. keep the 
contents but don't log anything) and log/reset at commit.

I wonder if multiple sessions make this even more problematic (e.g. due 
to session just disconnecting mid transansaction, without writing the 
abort record at all). But AFAICS that's not an issue, because the other 
session has a separate cache for the sequence.

> It can matter if customers just call nextval() in rolled-back
> transactions and somehow expect these values to be replicated based on
> reasoning along "sequences are not transactional - so rollbacks should
> not matter" .
> 

I don't think we guarantee anything for data in transactions that did 
not commit, so this seems like a non-issue. I.e. we don't need to go out 
of our way to guarantee something we never promised.

> Or we may get away with most in-detail sequence tracking on the source
> if we just keep track of the xmin of the sequence and send the
> sequence info over at commit if it == current_transaction_id ?
> 

Not sure I understand this proposal. Can you explain?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
Hi,

I've spent a bit of time exploring the alternative approach outlined by 
Hannu, i.e. tracking sequences accessed by the transaction, and logging 
the final state just once at COMMIT. Attached is an experimental version 
of the patch series doing that - 0001 does the original approach 
(decoding the sequence updates from WAL) and then 0002 reworks it to 
this alternative solution. The 0003 and 0004 stay mostly the same, 
except for minor fixes. Some of the tests in 0003/0004 fail, because 
0002 changes the semantics in various ways (more about that later).

The original approach (0001) may seem complex at first, but in principle 
it just  decodes changes to the sequence relation, and either stashes 
them into transaction (just like other changes) or applies them right 
away. I'd say that's the most complicated part - deciding whether the 
change is transactional or not.

0002 reworks that so that it doesn't decode the existing WAL records, 
but tracks sequences which have been modified (updated on-disk state) 
and then accessed in the current transaction. And then at COMMIT time we 
write a new WAL message with info about the sequence.

I realized we already cache sequences for each session - seqhashtab in 
sequence.c. It doesn't have any concept of a transaction, but it seems 
fairly easy to make that possible. I did this by adding two flags

  - needs_log - means the seesion advanced the sequence (on disk)
  - touched   - true if the current xact called nextval() etc.

The idea is that what matters is updates to on-disk state, so whenever 
we do that we set needs_log. But it only matters when the changes are 
made visible in a committed transaction. Consider for example this:

BEGIN;
SELECT nextval('s') FROM generate_series(1,10000) s(i);
ROLLBACK;
SELECT nextval('s');

The first nextval() call certainly sets both flags to true, at least for 
default sequences caching 32 values. But the values are not confirmed to 
the user because of the rollback - this resets 'touched' flag, but 
leaves 'needs_log' set to true.

And then the next nextval() - which may easily be just from cache - sets 
touched=true again, and logs the sequence state at (implicit) commit. 
Which resets both flags again.

The logging/cleanup happens in AtEOXact_Sequences() which gets called 
before commit/abort. This walks all cached sequences and writes the 
state for those with both flags true (or resets flag for abort).

The cache also keeps info about the last "sequence state" in the 
session, which is then used when writing into into WAL.


To write the sequence state into WAL, I've added a new WAL record 
xl_logical_sequence to RM_LOGICALMSG_ID, next to the xl_logical_message. 
It's a bit arbitrary, maybe it should be part of RM_SEQ_ID, but it does 
the trick. I don't think this is the main issue and it's easy enough to 
move it elsewhere if needed.

So, that seems fairly straight-forward and it may reduce the number of 
replication messages for large transactions. Unfortunately, it's not 
much simpler compared to the first approach - the amount of code is 
about the same, and there's a bunch of other issues.

The main issue seems to be about ordering. Consider multiple sessions 
all advancing the sequence. With the "old" approach this was naturally 
ordered - the order in which the increments were written to WAL made 
sense. But the sessions may advance the sequences in one order and then 
commit in a different order, which mixes the updates. Consider for 
example this scenario with two concurrent transactions:

T1: nextval('s') -> allocates values [1,32]
T2: nextval('s') -> allocates values [33,64]
T2: commit -> logs [33,64]
T1: commit -> logs [1,32]

The result is the sequence on the replica diverted because it replayed 
the increments in the opposite order.

I can think of two ways to fix this. Firstly, we could "merge" the 
increments in some smart way, e.g. by discarding values considered 
"stale" (like decrements). But that seems pretty fragile, because the 
sequence may be altered in various ways, reset, etc. And it seems more 
like transferring responsibility to someone else instead of actually 
solving the issue.

The other fix is simply reading the current sequence state from disk at 
commit and logging that (instead of the values cached from the last 
increment). But I'm rather skeptical about doing such things right 
before COMMIT.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
Hi,

Here's a slightly improved version of the patch, fixing a couple issues 
I failed to notice. It also addresses a couple of the issues described 
in the last message, although mostly to show what would need to be done.

1) Handle sequences dropped in the xact by calling try_relation_open, 
and just doing nothing if not found. Otherwise we'd get a failure in 
reorderbuffer, when decoding the change.

2) Fixed nextval_internal() to log the correct last_value (the one we 
write into WAL).

3) Reread the sequence state in AtEOXact_Sequences, to deal with the 
ordering issue described before. This makes (2) somewhat pointless, 
because we just read whatever is on disk at that point. But having both 
makes it easier to experiment / see what'd happen.

4) Log the stats in DefineSequence() - Without this we'd not have the 
initial sequence state in the WAL, because only nextval/setval etc. do 
the logging. The old approach (decoding the sequence tuple) does not 
have this issue.


The (3) changes the behavior in a somewhat strange way. Consider this 
case with two concurrent transactions:

T1: BEGIN;
T2: BEGIN;
T1: SELECT nextval('s') FROM generate_series(1,100) s(i);
T2: SELECT nextval('s') FROM generate_series(1,100) s(i);
T1: COMMIT;
T2: COMMIT;

The result is that both transactions have used the same sequence, and so 
will re-read the state from disk. But at that point the state is exactly 
the same, so we'll log the same thing twice.

There's a much deeper issue, though. The current patch only logs the 
sequence if the session generated WAL when incrementing the sequence 
(which happens every 32 values). But other sessions may already use 
values from this range, so consider for example this:

T1: BEGIN;
T1: SELECT nextval('s') FROM generate_series(1,100) s(i);
T2: BEGIN;
T2: SELECT nextval('s');
T2: COMMIT;
T1: ROLLBACK;

Which unfortunately means T2 already used a value, but the increment may 
not be logged at that time (or ever). This seems like a fatal issue, 
because it means we need to log *all* sequences the transaction touches, 
not just those that wrote the increment to WAL. That might still work 
for large transactions consuming many sequence values, but it's pretty 
inefficient for small OLTP transactions that only need one or two values 
from the sequence.

So I think just decoding the sequence tuples is a better solution - for 
large transactions (consuming many values from the sequence) it may be 
more expensive (i.e. send more records to replica). But I doubt that 
matters too much - it's likely negligible compared to other data for 
large transactions.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Peter Eisentraut
Дата:
On 22.11.21 01:47, Tomas Vondra wrote:
> So I think just decoding the sequence tuples is a better solution - for 
> large transactions (consuming many values from the sequence) it may be 
> more expensive (i.e. send more records to replica). But I doubt that 
> matters too much - it's likely negligible compared to other data for 
> large transactions.

I agree that the original approach is better.  It was worth trying out 
this alternative, but it seems quite complicated.  I note that a lot of 
additional code had to be added around several areas of the code, 
whereas the original patch really just touched the logical decoding 
code, as it should.  This doesn't prevent anyone from attempting to 
optimize things somehow in the future, but for now let's move forward 
with the simple approach.



Re: logical decoding and replication of sequences

От
Petr Jelinek
Дата:
> On 22. 11. 2021, at 16:44, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>
> On 22.11.21 01:47, Tomas Vondra wrote:
>> So I think just decoding the sequence tuples is a better solution - for large transactions (consuming many values
fromthe sequence) it may be more expensive (i.e. send more records to replica). But I doubt that matters too much -
it'slikely negligible compared to other data for large transactions. 
>
> I agree that the original approach is better.  It was worth trying out this alternative, but it seems quite
complicated. I note that a lot of additional code had to be added around several areas of the code, whereas the
originalpatch really just touched the logical decoding code, as it should.  This doesn't prevent anyone from attempting
tooptimize things somehow in the future, but for now let's move forward with the simple approach. 

+1

--
Petr Jelinek




Re: logical decoding and replication of sequences

От
Andres Freund
Дата:
Hi,

On 2021-09-25 22:05:43 +0200, Hannu Krosing wrote:
> If our aim is just to make sure that all user-visible data in
> *transactional* tables is consistent with sequence state then  one
> very much simplified approach to this could be to track the results of
> nextval() calls in a transaction at COMMIT put the latest sequence
> value in WAL (or just track the sequences affected and put the latest
> sequence state in WAL at commit which needs extra read of sequence but
> protects against race conditions with parallel transactions which get
> rolled back later)

I think this is a bad idea. It's architecturally more complicated and prevents
use cases because sequence values aren't guaranteed to be as new as on the
original system. You'd need to track all sequence use somehow *even if there
is no relevant WAL generated* in a transaction. There's simply no evidence of
sequence use in a transaction if that transaction uses a previously logged
sequence value.

Greetings,

Andres Freund



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 11/23/21 02:01, Andres Freund wrote:
> Hi,
> 
> On 2021-09-25 22:05:43 +0200, Hannu Krosing wrote:
>> If our aim is just to make sure that all user-visible data in
>> *transactional* tables is consistent with sequence state then  one
>> very much simplified approach to this could be to track the results of
>> nextval() calls in a transaction at COMMIT put the latest sequence
>> value in WAL (or just track the sequences affected and put the latest
>> sequence state in WAL at commit which needs extra read of sequence but
>> protects against race conditions with parallel transactions which get
>> rolled back later)
> 
> I think this is a bad idea. It's architecturally more complicated and prevents
> use cases because sequence values aren't guaranteed to be as new as on the
> original system. You'd need to track all sequence use somehow *even if there
> is no relevant WAL generated* in a transaction. There's simply no evidence of
> sequence use in a transaction if that transaction uses a previously logged
> sequence value.
> 

Not quite. We already have a cache of all sequences used by a session 
(see seqhashtab in sequence.c), and it's not that hard to extend it to 
per-transaction tracking. That's what the last two versions do, mostly.

But there are various issues with that approach, described in my last 
message(s).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Peter Eisentraut
Дата:
I have checked the 0001 and 0003 patches.  (I think we have dismissed
the approach in 0002 for now.  And let's talk about 0004 later.)

I have attached a few fixup patches, described further below.

# 0001

The argument "create" for fill_seq_with_data() is always true (and
patch 0002 removes it).  So I'm not sure if it's needed.  If it is, it
should be documented somewhere.

About the comment you added in nextval_internal(): It's a good
explanation, so I would leave it in.  I also agree with the
conclusion of the comment that the current solution is reasonable.  We
probably don't need the same comment again in fill_seq_with_data() and
again in do_setval().  Note that both of the latter functions already
point to nextval_interval() for other comments, so the same can be
relied upon here as well.

The order of the new fields sequence_cb and stream_sequence_cb is a
bit inconsistent compared to truncate_cb and stream_truncate_cb.
Maybe take another look to make the order more uniform.

Some documentation needs to be added to the Logical Decoding chapter.
I have attached a patch that I think catches all the places that need
to be updated, with some details left for you to fill in. ;-) The
ordering of the some of the documentation chunks reflects the order in
which the callbacks appear in the header files, which might not be
optimal; see above.

In reorderbuffer.c, you left a comment about how to access a sequence
tuple.  There is an easier way, using Form_pg_sequence_data, which is
how sequence.c also does it.  See attached patch.

# 0003

The tests added in 0003 don't pass for me.  It appears that you might
have forgotten to update the expected files after you added some tests
or something.  The cfbot shows the same.  See attached patch for a
correction, but do check what your intent was.

As mentioned before, we probably don't need the skip-sequences option
in test_decoding.
Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 12/7/21 15:38, Peter Eisentraut wrote:
> I have checked the 0001 and 0003 patches.  (I think we have dismissed
> the approach in 0002 for now.  And let's talk about 0004 later.)
> 

Right, I think that's correct.

> I have attached a few fixup patches, described further below.
> 
> # 0001
> 
> The argument "create" for fill_seq_with_data() is always true (and
> patch 0002 removes it).  So I'm not sure if it's needed.  If it is, it
> should be documented somewhere.
> 

Good point. I think it could be removed, but IIRC it'll be needed when
adding sequence decoding to the built-in replication, and that patch is
meant to be just an implementation of the API, without changing WAL.

OTOH I don't see it in the last version of that patch (in ResetSequence2
call) so maybe it's not really needed. I've kept it for now, but I'll
investigate.

> About the comment you added in nextval_internal(): It's a good
> explanation, so I would leave it in.  I also agree with the
> conclusion of the comment that the current solution is reasonable.  We
> probably don't need the same comment again in fill_seq_with_data() and
> again in do_setval().  Note that both of the latter functions already
> point to nextval_interval() for other comments, so the same can be
> relied upon here as well.
> 

True. I moved it a bit in nextval_internal() and removed the other
copies. The existing references should be enough.

> The order of the new fields sequence_cb and stream_sequence_cb is a
> bit inconsistent compared to truncate_cb and stream_truncate_cb.
> Maybe take another look to make the order more uniform.
> 

You mean in OutputPluginCallbacks? I'd actually argue it's the truncate
callbacks that are inconsistent - in the regular section truncate_cb is
right before commit_cb, while in the streaming section it's at the end.

Or what order do you think would be better? I'm fine with changing it.

> Some documentation needs to be added to the Logical Decoding chapter.
> I have attached a patch that I think catches all the places that need
> to be updated, with some details left for you to fill in. ;-) The
> ordering of the some of the documentation chunks reflects the order in
> which the callbacks appear in the header files, which might not be
> optimal; see above.
> 

Thanks. I added a bit about the callbacks being optional and what the
parameters mean (only for sequence_cb, as the stream_ callbacks
generally don't copy that bit).

> In reorderbuffer.c, you left a comment about how to access a sequence
> tuple.  There is an easier way, using Form_pg_sequence_data, which is
> how sequence.c also does it.  See attached patch.
> 

Yeah, that looks much nicer.

> # 0003
> 
> The tests added in 0003 don't pass for me.  It appears that you might
> have forgotten to update the expected files after you added some tests
> or something.  The cfbot shows the same.  See attached patch for a
> correction, but do check what your intent was.
> 

Yeah. I suspect I removed the expected results due to the experimental
rework. I haven't noticed that because some of the tests for the
built-in replication are expected to fail.


Attached is an updated version of the first two parts (infrastructure
and test_decoding), with all your fixes merged.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Peter Eisentraut
Дата:
On 08.12.21 01:23, Tomas Vondra wrote:
>> The argument "create" for fill_seq_with_data() is always true (and
>> patch 0002 removes it).  So I'm not sure if it's needed.  If it is, it
>> should be documented somewhere.
>>
> 
> Good point. I think it could be removed, but IIRC it'll be needed when
> adding sequence decoding to the built-in replication, and that patch is
> meant to be just an implementation of the API, without changing WAL.
> 
> OTOH I don't see it in the last version of that patch (in ResetSequence2
> call) so maybe it's not really needed. I've kept it for now, but I'll
> investigate.

Ok, please check.  If it is needed, perhaps then we need a way for 
test_decoding() to simulate it, for testing.  But perhaps it's not needed.

>> The order of the new fields sequence_cb and stream_sequence_cb is a
>> bit inconsistent compared to truncate_cb and stream_truncate_cb.
>> Maybe take another look to make the order more uniform.
>>
> 
> You mean in OutputPluginCallbacks? I'd actually argue it's the truncate
> callbacks that are inconsistent - in the regular section truncate_cb is
> right before commit_cb, while in the streaming section it's at the end.

Ok, that makes sense.  Then leave yours.

When the question about fill_seq_with_data() is resolved, I have no more 
comments on this part.



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
Hi,

here's an updated version of the patches, dealing with almost all of the
issues (at least in the 0001 and 0002 parts). The main changes:

1) I've removed the  'created' flag from fill_seq_with_data, as
discussed. I don't think it's needed by any of the parts (not even 0003,
AFAICS). We still need it in xl_seq_rec, though.

2) GetCurrentTransactionId() added to sequence.c are called only with
wal_level=logical, to minimize the overhead.


There's still one remaining problem, that I already explained in [1].
The problem is that with this:

  BEGIN;
  SELECT nextval('s') FROM generate_series(1,100);
  ROLLBACK;


The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write,
which is updated by XLogFlush() - but only in RecordTransactionCommit.
Which makes sense, because only the committed stuff is "visible".

But the non-transactional behavior of sequence decoding disagrees with
this, because now some of the changes from aborted transactions may be
replicated. Which means the wait_for_catchup() ends up not waiting for
the sequence change to be replicated. This is an issue for tests in
patch 0003, at least.

My concern is this actually affects other places waiting for things
getting replicated :-/

I have outlined some ideas how to deal with this in [1], but we got
sidetracked by exploring the alternative approach.


regards


[1]
https://www.postgresql.org/message-id/3d6df331-5532-6848-eb45-344b265e0238%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Tue, Dec 14, 2021 at 7:02 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hi,
>
> here's an updated version of the patches, dealing with almost all of the
> issues (at least in the 0001 and 0002 parts). The main changes:
>
> 1) I've removed the  'created' flag from fill_seq_with_data, as
> discussed. I don't think it's needed by any of the parts (not even 0003,
> AFAICS). We still need it in xl_seq_rec, though.
>
> 2) GetCurrentTransactionId() added to sequence.c are called only with
> wal_level=logical, to minimize the overhead.
>
>
> There's still one remaining problem, that I already explained in [1].
> The problem is that with this:
>
>   BEGIN;
>   SELECT nextval('s') FROM generate_series(1,100);
>   ROLLBACK;
>
>
> The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write,
> which is updated by XLogFlush() - but only in RecordTransactionCommit.
> Which makes sense, because only the committed stuff is "visible".
>
> But the non-transactional behavior of sequence decoding disagrees with
> this, because now some of the changes from aborted transactions may be
> replicated. Which means the wait_for_catchup() ends up not waiting for
> the sequence change to be replicated. This is an issue for tests in
> patch 0003, at least.
>
> My concern is this actually affects other places waiting for things
> getting replicated :-/
>

By any chance, will this impact synchronous replication as well which
waits for commits to be replicated?

How is this patch dealing with prepared transaction case where at
prepare we will send transactional changes and then later if rollback
prepared happens then the publisher will rollback changes related to
new relfilenode but subscriber would have already replayed the updated
seqval change which won't be rolled back?

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Peter Eisentraut
Дата:
On 14.12.21 02:31, Tomas Vondra wrote:
> There's still one remaining problem, that I already explained in [1].
> The problem is that with this:
> 
>    BEGIN;
>    SELECT nextval('s') FROM generate_series(1,100);
>    ROLLBACK;
> 
> 
> The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write,
> which is updated by XLogFlush() - but only in RecordTransactionCommit.
> Which makes sense, because only the committed stuff is "visible".
> 
> But the non-transactional behavior of sequence decoding disagrees with
> this, because now some of the changes from aborted transactions may be
> replicated. Which means the wait_for_catchup() ends up not waiting for
> the sequence change to be replicated.

I can't think of a reason why this might be a problem.



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 12/15/21 14:20, Amit Kapila wrote:
> On Tue, Dec 14, 2021 at 7:02 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Hi,
>>
>> here's an updated version of the patches, dealing with almost all of the
>> issues (at least in the 0001 and 0002 parts). The main changes:
>>
>> 1) I've removed the  'created' flag from fill_seq_with_data, as
>> discussed. I don't think it's needed by any of the parts (not even 0003,
>> AFAICS). We still need it in xl_seq_rec, though.
>>
>> 2) GetCurrentTransactionId() added to sequence.c are called only with
>> wal_level=logical, to minimize the overhead.
>>
>>
>> There's still one remaining problem, that I already explained in [1].
>> The problem is that with this:
>>
>>   BEGIN;
>>   SELECT nextval('s') FROM generate_series(1,100);
>>   ROLLBACK;
>>
>>
>> The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write,
>> which is updated by XLogFlush() - but only in RecordTransactionCommit.
>> Which makes sense, because only the committed stuff is "visible".
>>
>> But the non-transactional behavior of sequence decoding disagrees with
>> this, because now some of the changes from aborted transactions may be
>> replicated. Which means the wait_for_catchup() ends up not waiting for
>> the sequence change to be replicated. This is an issue for tests in
>> patch 0003, at least.
>>
>> My concern is this actually affects other places waiting for things
>> getting replicated :-/
>>
> 
> By any chance, will this impact synchronous replication as well which
> waits for commits to be replicated?
> 

Physical or logical replication? Physical is certainly not replicated.

For logical replication, it's more complicated.

> How is this patch dealing with prepared transaction case where at
> prepare we will send transactional changes and then later if rollback
> prepared happens then the publisher will rollback changes related to
> new relfilenode but subscriber would have already replayed the updated
> seqval change which won't be rolled back?
> 

I'm not sure what exact scenario you are describing, but in general we
don't rollback sequence changes anyway, so this should not cause any
divergence between the publisher and subscriber.

Or are you talking about something like ALTER SEQUENCE? I think that
should trigger the transactional behavior for the new relfilenode, so
the subscriber won't see that changes.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 12/15/21 14:51, Tomas Vondra wrote:
> On 12/15/21 14:20, Amit Kapila wrote:
>> On Tue, Dec 14, 2021 at 7:02 AM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> Hi,
>>>
>>> here's an updated version of the patches, dealing with almost all of the
>>> issues (at least in the 0001 and 0002 parts). The main changes:
>>>
>>> 1) I've removed the  'created' flag from fill_seq_with_data, as
>>> discussed. I don't think it's needed by any of the parts (not even 0003,
>>> AFAICS). We still need it in xl_seq_rec, though.
>>>
>>> 2) GetCurrentTransactionId() added to sequence.c are called only with
>>> wal_level=logical, to minimize the overhead.
>>>
>>>
>>> There's still one remaining problem, that I already explained in [1].
>>> The problem is that with this:
>>>
>>>   BEGIN;
>>>   SELECT nextval('s') FROM generate_series(1,100);
>>>   ROLLBACK;
>>>
>>>
>>> The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write,
>>> which is updated by XLogFlush() - but only in RecordTransactionCommit.
>>> Which makes sense, because only the committed stuff is "visible".
>>>
>>> But the non-transactional behavior of sequence decoding disagrees with
>>> this, because now some of the changes from aborted transactions may be
>>> replicated. Which means the wait_for_catchup() ends up not waiting for
>>> the sequence change to be replicated. This is an issue for tests in
>>> patch 0003, at least.
>>>
>>> My concern is this actually affects other places waiting for things
>>> getting replicated :-/
>>>
>>
>> By any chance, will this impact synchronous replication as well which
>> waits for commits to be replicated?
>>
> 
> Physical or logical replication? Physical is certainly not replicated.
> 
> For logical replication, it's more complicated.
> 

Apologies, sent too early ... I think it's more complicated for logical
sync replication, because of a scenario like this:

  BEGIN;
  SELECT nextval('s') FROM generate_series(1,100);  <-- writes WAL
  ROLLBACK;

  SELECT nextval('s');

The first transaction advances the sequence enough to generate a WAL,
which we do every 32 values. But it's rolled back, so it does not update
LogwrtResult.Write, because that happens only at commit.

And then the nextval() generates a value from the sequence without
generating WAL, so it doesn't update the LSN either (IIRC). That'd mean
a sync replication may not wait for this change to reach the subscriber.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Alvaro Herrera
Дата:
Looking at 0003,

On 2021-Dec-14, Tomas Vondra wrote:

> diff --git a/doc/src/sgml/ref/alter_publication.sgml b/doc/src/sgml/ref/alter_publication.sgml
> index bb4ef5e5e22..4d166ad3f9c 100644
> --- a/doc/src/sgml/ref/alter_publication.sgml
> +++ b/doc/src/sgml/ref/alter_publication.sgml
> @@ -31,7 +31,9 @@ ALTER PUBLICATION <replaceable class="parameter">name</replaceable> RENAME TO <r
>  <phrase>where <replaceable class="parameter">publication_object</replaceable> is one of:</phrase>
>  
>      TABLE [ ONLY ] <replaceable class="parameter">table_name</replaceable> [ * ] [, ... ]
> +    SEQUENCE <replaceable class="parameter">sequence_name</replaceable> [ * ] [, ... ]
>      ALL TABLES IN SCHEMA { <replaceable class="parameter">schema_name</replaceable> | CURRENT_SCHEMA } [, ... ]
> +    ALL SEQUENCE IN SCHEMA { <replaceable class="parameter">schema_name</replaceable> | CURRENT_SCHEMA } [, ... ]

Note that this says ALL SEQUENCE; I think it should be ALL SEQUENCES.

> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
> index 3d4dd43e47b..f037c17985b 100644
> --- a/src/backend/parser/gram.y
> +++ b/src/backend/parser/gram.y
> @@ -9762,6 +9762,26 @@ PublicationObjSpec:
...
> +            | ALL SEQUENCE IN_P SCHEMA ColId
> +                {
> +                    $$ = makeNode(PublicationObjSpec);
> +                    $$->pubobjtype = PUBLICATIONOBJ_SEQUENCE_IN_SCHEMA;
> +                    $$->name = $5;
> +                    $$->location = @5;
> +                }
> +            | ALL SEQUENCES IN_P SCHEMA CURRENT_SCHEMA
> +                {
> +                    $$ = makeNode(PublicationObjSpec);
> +                    $$->pubobjtype = PUBLICATIONOBJ_SEQUENCE_IN_CUR_SCHEMA;
> +                    $$->location = @5;
> +                }

And here you have ALL SEQUENCE in one spot and ALL SEQUENCES in the
other.

BTW I think these enum values should use the plural too,
PUBLICATIONOBJ_SEQUENCES_IN_CUR_SCHEMA (not SEQUENCE).  I suppose you
copied from PUBLICATIONOBJ_TABLE_IN_CUR_SCHEMA, but that too seems to be
a mistake: should be PUBLICATIONOBJ_TABLES_IN_CUR_SCHEMA.


> @@ -10097,6 +10117,12 @@ UnlistenStmt:
>                  }
>          ;
>  
> +/*
> + * FIXME
> + *
> + * opt_publication_for_sequences and publication_for_sequences should be
> + * copies for sequences
> + */

Not sure if this FIXME is relevant or should just be removed.

> @@ -10105,6 +10131,12 @@ UnlistenStmt:
>   *        BEGIN / COMMIT / ROLLBACK
>   *        (also older versions END / ABORT)
>   *
> + * ALTER PUBLICATION name ADD SEQUENCE sequence [, sequence2]
> + *
> + * ALTER PUBLICATION name DROP SEQUENCE sequence [, sequence2]
> + *
> + * ALTER PUBLICATION name SET SEQUENCE sequence [, sequence2]
> + *

This comment addition seems misplaced?

> diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
> index 2f412ca3db3..e30bf7b1b55 100644
> --- a/src/bin/psql/tab-complete.c
> +++ b/src/bin/psql/tab-complete.c
> @@ -1647,13 +1647,13 @@ psql_completion(const char *text, int start, int end)
>          COMPLETE_WITH("ADD", "DROP", "OWNER TO", "RENAME TO", "SET");
>      /* ALTER PUBLICATION <name> ADD */
>      else if (Matches("ALTER", "PUBLICATION", MatchAny, "ADD"))
> -        COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE");
> +        COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE|SEQUENCE");
>      /* ALTER PUBLICATION <name> DROP */
>      else if (Matches("ALTER", "PUBLICATION", MatchAny, "DROP"))
> -        COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE");
> +        COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE|SEQUENCE");
>      /* ALTER PUBLICATION <name> SET */
>      else if (Matches("ALTER", "PUBLICATION", MatchAny, "SET"))
> -        COMPLETE_WITH("(", "ALL TABLES IN SCHEMA", "TABLE");
> +        COMPLETE_WITH("(", "ALL TABLES IN SCHEMA", "TABLE|SEQUENCE");

I think you should also add "ALL SEQUENCES IN SCHEMA" to these lists.


>      else if (Matches("ALTER", "PUBLICATION", MatchAny, "ADD|DROP|SET", "ALL", "TABLES", "IN", "SCHEMA"))

... and perhaps make this "ALL", "TABLES|SEQUENCES", "IN", "SCHEMA".


-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Nunca confiaré en un traidor.  Ni siquiera si el traidor lo he creado yo"
(Barón Vladimir Harkonnen)



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 12/15/21 17:42, Alvaro Herrera wrote:
> Looking at 0003,
> 
> On 2021-Dec-14, Tomas Vondra wrote:
> 
>> diff --git a/doc/src/sgml/ref/alter_publication.sgml b/doc/src/sgml/ref/alter_publication.sgml
>> index bb4ef5e5e22..4d166ad3f9c 100644
>> --- a/doc/src/sgml/ref/alter_publication.sgml
>> +++ b/doc/src/sgml/ref/alter_publication.sgml
>> @@ -31,7 +31,9 @@ ALTER PUBLICATION <replaceable class="parameter">name</replaceable> RENAME TO <r
>>  <phrase>where <replaceable class="parameter">publication_object</replaceable> is one of:</phrase>
>>  
>>      TABLE [ ONLY ] <replaceable class="parameter">table_name</replaceable> [ * ] [, ... ]
>> +    SEQUENCE <replaceable class="parameter">sequence_name</replaceable> [ * ] [, ... ]
>>      ALL TABLES IN SCHEMA { <replaceable class="parameter">schema_name</replaceable> | CURRENT_SCHEMA } [, ... ]
>> +    ALL SEQUENCE IN SCHEMA { <replaceable class="parameter">schema_name</replaceable> | CURRENT_SCHEMA } [, ... ]
> 
> Note that this says ALL SEQUENCE; I think it should be ALL SEQUENCES.
> 
>> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
>> index 3d4dd43e47b..f037c17985b 100644
>> --- a/src/backend/parser/gram.y
>> +++ b/src/backend/parser/gram.y
>> @@ -9762,6 +9762,26 @@ PublicationObjSpec:
> ...
>> +            | ALL SEQUENCE IN_P SCHEMA ColId
>> +                {
>> +                    $$ = makeNode(PublicationObjSpec);
>> +                    $$->pubobjtype = PUBLICATIONOBJ_SEQUENCE_IN_SCHEMA;
>> +                    $$->name = $5;
>> +                    $$->location = @5;
>> +                }
>> +            | ALL SEQUENCES IN_P SCHEMA CURRENT_SCHEMA
>> +                {
>> +                    $$ = makeNode(PublicationObjSpec);
>> +                    $$->pubobjtype = PUBLICATIONOBJ_SEQUENCE_IN_CUR_SCHEMA;
>> +                    $$->location = @5;
>> +                }
> 
> And here you have ALL SEQUENCE in one spot and ALL SEQUENCES in the
> other.
> 
> BTW I think these enum values should use the plural too,
> PUBLICATIONOBJ_SEQUENCES_IN_CUR_SCHEMA (not SEQUENCE).  I suppose you
> copied from PUBLICATIONOBJ_TABLE_IN_CUR_SCHEMA, but that too seems to be
> a mistake: should be PUBLICATIONOBJ_TABLES_IN_CUR_SCHEMA.
> 
> 
>> @@ -10097,6 +10117,12 @@ UnlistenStmt:
>>                  }
>>          ;
>>  
>> +/*
>> + * FIXME
>> + *
>> + * opt_publication_for_sequences and publication_for_sequences should be
>> + * copies for sequences
>> + */
> 
> Not sure if this FIXME is relevant or should just be removed.
> 
>> @@ -10105,6 +10131,12 @@ UnlistenStmt:
>>   *        BEGIN / COMMIT / ROLLBACK
>>   *        (also older versions END / ABORT)
>>   *
>> + * ALTER PUBLICATION name ADD SEQUENCE sequence [, sequence2]
>> + *
>> + * ALTER PUBLICATION name DROP SEQUENCE sequence [, sequence2]
>> + *
>> + * ALTER PUBLICATION name SET SEQUENCE sequence [, sequence2]
>> + *
> 
> This comment addition seems misplaced?
> 
>> diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
>> index 2f412ca3db3..e30bf7b1b55 100644
>> --- a/src/bin/psql/tab-complete.c
>> +++ b/src/bin/psql/tab-complete.c
>> @@ -1647,13 +1647,13 @@ psql_completion(const char *text, int start, int end)
>>          COMPLETE_WITH("ADD", "DROP", "OWNER TO", "RENAME TO", "SET");
>>      /* ALTER PUBLICATION <name> ADD */
>>      else if (Matches("ALTER", "PUBLICATION", MatchAny, "ADD"))
>> -        COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE");
>> +        COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE|SEQUENCE");
>>      /* ALTER PUBLICATION <name> DROP */
>>      else if (Matches("ALTER", "PUBLICATION", MatchAny, "DROP"))
>> -        COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE");
>> +        COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE|SEQUENCE");
>>      /* ALTER PUBLICATION <name> SET */
>>      else if (Matches("ALTER", "PUBLICATION", MatchAny, "SET"))
>> -        COMPLETE_WITH("(", "ALL TABLES IN SCHEMA", "TABLE");
>> +        COMPLETE_WITH("(", "ALL TABLES IN SCHEMA", "TABLE|SEQUENCE");
> 
> I think you should also add "ALL SEQUENCES IN SCHEMA" to these lists.
> 
> 
>>      else if (Matches("ALTER", "PUBLICATION", MatchAny, "ADD|DROP|SET", "ALL", "TABLES", "IN", "SCHEMA"))
> 
> ... and perhaps make this "ALL", "TABLES|SEQUENCES", "IN", "SCHEMA".
> 

Thanks for the review. I'm aware 0003 is still incomplete and subject to
change - it's certainly not meant for commit yet. The current 0003 patch
is sufficient for testing the infrastructure, but we need to figure out
how to make it easier to use, what to do with implicit sequences and
similar things. Peter had some ideas in [1].

[1]
https://www.postgresql.org/message-id/359bf6d0-413d-292a-4305-e99eeafead39%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Wed, Dec 15, 2021 at 7:21 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 12/15/21 14:20, Amit Kapila wrote:
> > On Tue, Dec 14, 2021 at 7:02 AM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> Hi,
> >>
> >> here's an updated version of the patches, dealing with almost all of the
> >> issues (at least in the 0001 and 0002 parts). The main changes:
> >>
> >> 1) I've removed the  'created' flag from fill_seq_with_data, as
> >> discussed. I don't think it's needed by any of the parts (not even 0003,
> >> AFAICS). We still need it in xl_seq_rec, though.
> >>
> >> 2) GetCurrentTransactionId() added to sequence.c are called only with
> >> wal_level=logical, to minimize the overhead.
> >>
> >>
> >> There's still one remaining problem, that I already explained in [1].
> >> The problem is that with this:
> >>
> >>   BEGIN;
> >>   SELECT nextval('s') FROM generate_series(1,100);
> >>   ROLLBACK;
> >>
> >>
> >> The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write,
> >> which is updated by XLogFlush() - but only in RecordTransactionCommit.
> >> Which makes sense, because only the committed stuff is "visible".
> >>
> >> But the non-transactional behavior of sequence decoding disagrees with
> >> this, because now some of the changes from aborted transactions may be
> >> replicated. Which means the wait_for_catchup() ends up not waiting for
> >> the sequence change to be replicated. This is an issue for tests in
> >> patch 0003, at least.
> >>
> >> My concern is this actually affects other places waiting for things
> >> getting replicated :-/
> >>
> >
> > By any chance, will this impact synchronous replication as well which
> > waits for commits to be replicated?
> >
>
> Physical or logical replication?
>

logical replication.

> Physical is certainly not replicated.
>
> For logical replication, it's more complicated.
>
> > How is this patch dealing with prepared transaction case where at
> > prepare we will send transactional changes and then later if rollback
> > prepared happens then the publisher will rollback changes related to
> > new relfilenode but subscriber would have already replayed the updated
> > seqval change which won't be rolled back?
> >
>
> I'm not sure what exact scenario you are describing, but in general we
> don't rollback sequence changes anyway, so this should not cause any
> divergence between the publisher and subscriber.
>
> Or are you talking about something like ALTER SEQUENCE? I think that
> should trigger the transactional behavior for the new relfilenode, so
> the subscriber won't see that changes.
>

Yeah, something like Alter Sequence and nextval combination. I see
that it will be handled because of the transactional behavior for the
new relfilenode as for applying each sequence change, a new
relfilenode is created. I think on apply side, the patch always
creates a new relfilenode irrespective of whether the sequence message
is transactional or not. Do we need to do that for the
non-transactional messages as well?

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:


On 12/16/21 12:59, Amit Kapila wrote:
> On Wed, Dec 15, 2021 at 7:21 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 12/15/21 14:20, Amit Kapila wrote:
>>> On Tue, Dec 14, 2021 at 7:02 AM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> here's an updated version of the patches, dealing with almost all of the
>>>> issues (at least in the 0001 and 0002 parts). The main changes:
>>>>
>>>> 1) I've removed the  'created' flag from fill_seq_with_data, as
>>>> discussed. I don't think it's needed by any of the parts (not even 0003,
>>>> AFAICS). We still need it in xl_seq_rec, though.
>>>>
>>>> 2) GetCurrentTransactionId() added to sequence.c are called only with
>>>> wal_level=logical, to minimize the overhead.
>>>>
>>>>
>>>> There's still one remaining problem, that I already explained in [1].
>>>> The problem is that with this:
>>>>
>>>>   BEGIN;
>>>>   SELECT nextval('s') FROM generate_series(1,100);
>>>>   ROLLBACK;
>>>>
>>>>
>>>> The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write,
>>>> which is updated by XLogFlush() - but only in RecordTransactionCommit.
>>>> Which makes sense, because only the committed stuff is "visible".
>>>>
>>>> But the non-transactional behavior of sequence decoding disagrees with
>>>> this, because now some of the changes from aborted transactions may be
>>>> replicated. Which means the wait_for_catchup() ends up not waiting for
>>>> the sequence change to be replicated. This is an issue for tests in
>>>> patch 0003, at least.
>>>>
>>>> My concern is this actually affects other places waiting for things
>>>> getting replicated :-/
>>>>
>>>
>>> By any chance, will this impact synchronous replication as well which
>>> waits for commits to be replicated?
>>>
>>
>> Physical or logical replication?
>>
> 
> logical replication.
> 
>> Physical is certainly not replicated.
>>
>> For logical replication, it's more complicated.
>>
>>> How is this patch dealing with prepared transaction case where at
>>> prepare we will send transactional changes and then later if rollback
>>> prepared happens then the publisher will rollback changes related to
>>> new relfilenode but subscriber would have already replayed the updated
>>> seqval change which won't be rolled back?
>>>
>>
>> I'm not sure what exact scenario you are describing, but in general we
>> don't rollback sequence changes anyway, so this should not cause any
>> divergence between the publisher and subscriber.
>>
>> Or are you talking about something like ALTER SEQUENCE? I think that
>> should trigger the transactional behavior for the new relfilenode, so
>> the subscriber won't see that changes.
>>
> 
> Yeah, something like Alter Sequence and nextval combination. I see
> that it will be handled because of the transactional behavior for the
> new relfilenode as for applying each sequence change, a new
> relfilenode is created.

Right.

> I think on apply side, the patch always creates a new relfilenode
> irrespective of whether the sequence message is transactional or not.
> Do we need to do that for the non-transactional messages as well?
> 

Good question. I don't think that's necessary, I'll see if we can simply
update the tuple (mostly just like redo).

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 12/15/21 14:51, Tomas Vondra wrote:
> On 12/15/21 14:20, Amit Kapila wrote:
>> On Tue, Dec 14, 2021 at 7:02 AM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> Hi,
>>>
>>> here's an updated version of the patches, dealing with almost all of the
>>> issues (at least in the 0001 and 0002 parts). The main changes:
>>>
>>> 1) I've removed the  'created' flag from fill_seq_with_data, as
>>> discussed. I don't think it's needed by any of the parts (not even 0003,
>>> AFAICS). We still need it in xl_seq_rec, though.
>>>
>>> 2) GetCurrentTransactionId() added to sequence.c are called only with
>>> wal_level=logical, to minimize the overhead.
>>>
>>>
>>> There's still one remaining problem, that I already explained in [1].
>>> The problem is that with this:
>>>
>>>    BEGIN;
>>>    SELECT nextval('s') FROM generate_series(1,100);
>>>    ROLLBACK;
>>>
>>>
>>> The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write,
>>> which is updated by XLogFlush() - but only in RecordTransactionCommit.
>>> Which makes sense, because only the committed stuff is "visible".
>>>
>>> But the non-transactional behavior of sequence decoding disagrees with
>>> this, because now some of the changes from aborted transactions may be
>>> replicated. Which means the wait_for_catchup() ends up not waiting for
>>> the sequence change to be replicated. This is an issue for tests in
>>> patch 0003, at least.
>>>
>>> My concern is this actually affects other places waiting for things
>>> getting replicated :-/
>>>
>>
>> By any chance, will this impact synchronous replication as well which
>> waits for commits to be replicated?
>>
> 
> Physical or logical replication? Physical is certainly not replicated.
> 

Actually, I take that back. It does affect physical (sync) replication 
just as well, and I think it might be considered a data loss issue. I 
started a new thread to discuss that, so that it's not buried in this 
thread about logical replication.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
Hi,

Here's a rebased version of the patch series. I decided to go back to 
the version from 2021/12/14, which does not include the changes to WAL 
logging. So this has the same issue with nextval() now waiting for WAL 
flush (and/or sync replica), as demonstrated by occasional failures of 
the TAP test in 0003, but my reasoning is this:

1) This is a preexisting issue, affecting sequences in general. It's not 
related to this patch, really. The fix will be independent of this, and 
there's little reason to block the decoding until that happens.

2) We've discussed a couple ways to fix this in [1] - logging individual 
sequence increments, flushing everything right away, waiting for page 
LSN, etc. My opinion is we'll use some form of waiting for page LSN, but 
no matter what fix we use it'll have almost no impact on this patch. 
There might be minor changes to the test, but that's about it.

3) There are ways to stabilize the tests even without that - it's enough 
to generate a little bit of WAL / get XID, not just nextval(). But 
that's only for 0003 which I don't intend to commit yet, and 0001/0002 
have no problems at all.


regards

[1] 
https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9be9@enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Peter Eisentraut
Дата:
On 26.01.22 03:16, Tomas Vondra wrote:
> Here's a rebased version of the patch series. I decided to go back to 
> the version from 2021/12/14, which does not include the changes to WAL 
> logging.

I notice that test_decoding still has skip-sequences enabled by default, 
"for backward compatibility".  I think we had concluded in previous 
discussions that we don't need that.  I would remove the option altogether.




Re: logical decoding and replication of sequences

От
Petr Jelinek
Дата:
I would not remove it altogether, there is plenty of consumers of this extension's output in the wild (even if I think
it'sunfortunate) that might not be interested in sequences, but changing it to off by default certainly makes sense. 

--
Petr Jelinek

> On 26. 1. 2022, at 11:09, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>
> On 26.01.22 03:16, Tomas Vondra wrote:
>> Here's a rebased version of the patch series. I decided to go back to the version from 2021/12/14, which does not
includethe changes to WAL logging. 
>
> I notice that test_decoding still has skip-sequences enabled by default, "for backward compatibility".  I think we
hadconcluded in previous discussions that we don't need that.  I would remove the option altogether. 
>




Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 1/26/22 14:01, Petr Jelinek wrote:
> I would not remove it altogether, there is plenty of consumers of 
> this extension's output in the wild (even if I think it's
> unfortunate) that might not be interested in sequences, but changing
> it to off by default certainly makes sense.

Indeed. Attached is an updated patch series, with 0003 switching it to 
false by default (and fixing the fallout). For commit I'll merge that 
into 0002, of course.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Peter Eisentraut
Дата:
On 27.01.22 00:32, Tomas Vondra wrote:
> 
> On 1/26/22 14:01, Petr Jelinek wrote:
>> I would not remove it altogether, there is plenty of consumers of this 
>> extension's output in the wild (even if I think it's
>> unfortunate) that might not be interested in sequences, but changing
>> it to off by default certainly makes sense.
> 
> Indeed. Attached is an updated patch series, with 0003 switching it to 
> false by default (and fixing the fallout). For commit I'll merge that 
> into 0002, of course.

(could be done in separate patches too IMO)

test_decoding.c uses %zu several times for int64 values, which is not 
correct.  This should use INT64_FORMAT or %lld with a cast to (long long 
int).

I don't know, maybe it's worth commenting somewhere how the expected 
values in contrib/test_decoding/expected/sequence.out come about. 
Otherwise, it's quite a puzzle to determine where the 3830 comes from, 
for example.

I think the skip-sequences options is better turned around into a 
positive name like include-sequences.  There is a mix of "skip" and 
"include" options in test_decoding, but there are more "include" ones 
right now.

In the 0003, a few files have been missed, apparently, so the tests 
don't fully pass.  See attached patch.

I haven't looked fully through the 0004 patch, but I notice that there 
was a confusing mix of FOR ALL SEQUENCE and FOR ALL SEQUENCES.  I have 
corrected that in the other attached patch.

Other than the mentioned cosmetic issues, I think 0001-0003 are ready to go.
Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 1/27/22 17:05, Peter Eisentraut wrote:
> On 27.01.22 00:32, Tomas Vondra wrote:
>>
>> On 1/26/22 14:01, Petr Jelinek wrote:
>>> I would not remove it altogether, there is plenty of consumers of 
>>> this extension's output in the wild (even if I think it's
>>> unfortunate) that might not be interested in sequences, but changing
>>> it to off by default certainly makes sense.
>>
>> Indeed. Attached is an updated patch series, with 0003 switching it to 
>> false by default (and fixing the fallout). For commit I'll merge that 
>> into 0002, of course.
> 
> (could be done in separate patches too IMO)
> 
> test_decoding.c uses %zu several times for int64 values, which is not 
> correct.  This should use INT64_FORMAT or %lld with a cast to (long long 
> int).
> 

Good point - INT64_FORMAT seems better. Also, the formatting was not 
quite right (missing space between the colon), so I fixed that too.

> I don't know, maybe it's worth commenting somewhere how the expected 
> values in contrib/test_decoding/expected/sequence.out come about. 
> Otherwise, it's quite a puzzle to determine where the 3830 comes from, 
> for example.
> 

Yeah, that's probably a good idea - I had to think about the expected 
output repeatedly, so an explanation would help. I'll do that in the 
next version of the patch.

> I think the skip-sequences options is better turned around into a 
> positive name like include-sequences.  There is a mix of "skip" and 
> "include" options in test_decoding, but there are more "include" ones 
> right now.
> 

Hmmm. I don't see much difference between skip-sequences and 
include-sequences, but I don't feel very strongly about it either so I 
switched that to include-sequences (which defaults to true).

> In the 0003, a few files have been missed, apparently, so the tests 
> don't fully pass.  See attached patch.
> 

D'oh! I'd swear I've fixed those too.

> I haven't looked fully through the 0004 patch, but I notice that there 
> was a confusing mix of FOR ALL SEQUENCE and FOR ALL SEQUENCES.  I have 
> corrected that in the other attached patch.
> 
> Other than the mentioned cosmetic issues, I think 0001-0003 are ready to go.

Thanks. I think we'll have time to look at 0004 more closely once the 
initial parts get committed.


Attached is a a rebased/squashed version of the patches, with all the 
fixes discussed here.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
I've polished & pushed the first part adding sequence decoding 
infrastructure etc. Attached are the two remaining parts.

I plan to wait a day or two and then push the test_decoding part. The 
last part (for built-in replication) will need more work and maybe 
rethinking the grammar etc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 2/10/22 19:17, Tomas Vondra wrote:
> I've polished & pushed the first part adding sequence decoding
> infrastructure etc. Attached are the two remaining parts.
> 
> I plan to wait a day or two and then push the test_decoding part. The
> last part (for built-in replication) will need more work and maybe
> rethinking the grammar etc.
> 

I've pushed the second part, adding sequences to test_decoding.

Here's the remaining part, rebased, with a small tweak in the TAP test
to eliminate the issue with not waiting for sequence increments. I've
kept the tweak in a separate patch, so that we can throw it away easily
if we happen to resolve the issue.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 2/12/22 01:34, Tomas Vondra wrote:
> On 2/10/22 19:17, Tomas Vondra wrote:
>> I've polished & pushed the first part adding sequence decoding
>> infrastructure etc. Attached are the two remaining parts.
>>
>> I plan to wait a day or two and then push the test_decoding part. The
>> last part (for built-in replication) will need more work and maybe
>> rethinking the grammar etc.
>>
> 
> I've pushed the second part, adding sequences to test_decoding.
> 
> Here's the remaining part, rebased, with a small tweak in the TAP test
> to eliminate the issue with not waiting for sequence increments. I've
> kept the tweak in a separate patch, so that we can throw it away easily
> if we happen to resolve the issue.
> 

Hmm, cfbot was not happy about this, so here's a version fixing the
elog() format issue reported by CirrusCI/mingw by ditching the log
message. It was useful for debugging, but otherwise just noise.

I'm a bit puzzled about the macOS failure, though. It seems as if the
test does not wait for the subscriber long enough, but this is with the
tweaked test variant, so it should not have the rollback issue. And I
haven't seen this failure on any other machine.

Regarding adding decoding of sequences to the built-in replication,
there is a couple questions that we need to discuss first before
cleaning up the code etc. Most of them are related to syntax and
handling of various sequence variants.


1) Firstly, what about implicit sequences. That is, if you create a
table with SERIAL or BIGSERIAL column, that'll have a sequence attached.
Should those sequences be added to the publication when the table gets
added? Or should we require adding them separately? Or should that be
specified in the grammar, somehow? Should we have INCLUDING SEQUENCES
for ALTER PUBLICATION ... ADD TABLE ...?

I think we shouldn't require replicating the sequence, because who knows
what the schema is on the subscriber? We want to allow differences, so
maybe the sequence is not there. I'd start with just adding them
separately, because that just seems simpler, but maybe there are good
reasons to support adding them in ADD TABLE.


2) Should it be possible to add sequences that are also associated with
a serial column, without the table being replicated too? I'd say yes, if
people want to do that - I don't think it can cause any issues, and it's
possible to just use sequence directly for non-serial columns anyway.
Which is the same thing, but we can't detect that.


3) What about sequences for UNLOGGED tables? At the moment we don't
allow sequences to be UNLOGGED (Peter revived his patch [1], but that's
not committed yet). Again, I'd say it's up to the user to decide which
sequences are replicated - it's similar to (2).


4) I wonder if we actually want FOR ALL SEQUENCES. On the one hand it'd
be symmetrical with FOR ALL TABLES, which is the other object type we
can replicate. So it'd seem reasonable to handle them in a similar way.
But it's causing some shift/reduce error in the grammar, so it'll need
some changes.



regards


[1]
https://www.postgresql.org/message-id/8da92c1f-9117-41bc-731b-ce1477a77d69@enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 2/12/22 20:58, Tomas Vondra wrote:
> On 2/12/22 01:34, Tomas Vondra wrote:
>> On 2/10/22 19:17, Tomas Vondra wrote:
>>> I've polished & pushed the first part adding sequence decoding
>>> infrastructure etc. Attached are the two remaining parts.
>>>
>>> I plan to wait a day or two and then push the test_decoding part. The
>>> last part (for built-in replication) will need more work and maybe
>>> rethinking the grammar etc.
>>>
>>
>> I've pushed the second part, adding sequences to test_decoding.
>>
>> Here's the remaining part, rebased, with a small tweak in the TAP test
>> to eliminate the issue with not waiting for sequence increments. I've
>> kept the tweak in a separate patch, so that we can throw it away easily
>> if we happen to resolve the issue.
>>
> 
> Hmm, cfbot was not happy about this, so here's a version fixing the
> elog() format issue reported by CirrusCI/mingw by ditching the log
> message. It was useful for debugging, but otherwise just noise.
> 

There was another elog() making mingw unhappy, so here's a fix for that.

This should also fix an issue on the macOS machine. This is a thinko in
the tests, because wait_for_catchup() may not wait for all the sequence
increments after a rollback. The default mode is "write" which uses
pg_current_wal_lsn(), and that may be a bit stale after a rollback.
Doing a simple insert after the rollback fixes this (using other LSN,
like pg_current_wal_insert_lsn() would work too, but it'd cause long
waits in the test).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Peter Eisentraut
Дата:
On 13.02.22 14:10, Tomas Vondra wrote:
> Here's the remaining part, rebased, with a small tweak in the TAP test
> to eliminate the issue with not waiting for sequence increments. I've
> kept the tweak in a separate patch, so that we can throw it away easily
> if we happen to resolve the issue.

This basically looks fine to me.  You have identified a few XXX and 
FIXME spots that should be addressed.

Here are a few more comments:

* general

Handling of owned sequences, as previously discussed.  This would 
probably be a localized change somewhere in get_rel_sync_entry(), so it 
doesn't affect the overall structure of the patch.

pg_dump support is missing.

Some psql \dxxx support should probably be there.  Check where existing 
publication-table relationships are displayed.

* src/backend/catalog/system_views.sql

+         LATERAL pg_get_publication_sequences(P.pubname) GPT,

The GPT presumably stood for "get publication tables", so should be changed.

(There might be a few more copy-and-paste occasions like this around.  I 
have not written down all of them.)

* src/backend/commands/publicationcmds.c

This adds a new publication option called "sequence".  I would have 
expected it to be named "sequences", but that's just my opinion.

But in any case, the option is not documented at all.

Some of the new functions added in this file are almost exact duplicates 
of the analogous functions for tables.  For example, 
PublicationAddSequences()/PublicationDropSequences() are almost
exactly the same as PublicationAddTables()/PublicationDropTables(). 
This could be handled with less duplication by just adding an ObjectType 
argument to the existing functions.

* src/backend/commands/sequence.c

Could use some refactoring of ResetSequence()/ResetSequence2().  Maybe 
call the latter ResetSequenceData() and have the former call it internally.

* src/backend/commands/subscriptioncmds.c

Also lots of duplication between tables and sequences in this file.

* src/backend/replication/logical/tablesync.c

The comment says it needs sequence support, but there appears to be 
support for initial sequence syncing.  What exactly is missing here?

* src/test/subscription/t/028_sequences.pl

Change to use done_testing() (see 549ec201d6132b7c7ee11ee90a4e02119259ba5b).



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 2/15/22 10:00, Peter Eisentraut wrote:
> On 13.02.22 14:10, Tomas Vondra wrote:
>> Here's the remaining part, rebased, with a small tweak in the TAP test
>> to eliminate the issue with not waiting for sequence increments. I've
>> kept the tweak in a separate patch, so that we can throw it away easily
>> if we happen to resolve the issue.
> 
> This basically looks fine to me.  You have identified a few XXX and
> FIXME spots that should be addressed.
> 
> Here are a few more comments:
> 
> * general
> 
> Handling of owned sequences, as previously discussed.  This would
> probably be a localized change somewhere in get_rel_sync_entry(), so it
> doesn't affect the overall structure of the patch.
> 

So you're suggesting not to track owned sequences in pg_publication_rel
explicitly, and handle them dynamically in output plugin? So when
calling get_rel_sync_entry on the sequence, we'd check if it's owned by
a table that is replicated.

We'd want a way to enable/disable this for each publication, but that
makes the lookups more complicated. Also, we'd probably need the same
logic elsewhere (e.g. in psql, when listing sequences in a publication).

I'm not sure we want this complexity, maybe we should simply deal with
this in the ALTER PUBLICATION and track all sequences (owned or not) in
pg_publication_rel.

> pg_dump support is missing.
> 

Will fix.

> Some psql \dxxx support should probably be there.  Check where existing
> publication-table relationships are displayed.
> 

Yeah, I noticed this while adding regression tests. Currently, \dRp+
shows something like this:

                         Publication testpub_fortbl
            Owner           | All tables | Inserts | Updates  ...
  --------------------------+------------+---------+--------- ...
   regress_publication_user | f          | t       | t        ...
  Tables:
      "pub_test.testpub_nopk"
      "public.testpub_tbl1"

or

                     Publication testpub5_forschema
            Owner           | All tables | Inserts | Updates |  ...
  --------------------------+------------+---------+---------+- ...
   regress_publication_user | f          | t       | t       |  ...
  Tables from schemas:
      "CURRENT_SCHEMA"
      "public"

I wonder if we should copy the same block for sequences, so

                         Publication testpub_fortbl
            Owner           | All tables | Inserts | Updates  ...
  --------------------------+------------+---------+--------- ...
   regress_publication_user | f          | t       | t        ...
  Tables:
      "pub_test.testpub_nopk"
      "public.testpub_tbl1"
  Sequences:
      "pub_test.sequence_s1"
      "public.sequence_s2"

And same for "tables from schemas" etc.


> * src/backend/catalog/system_views.sql
> 
> +         LATERAL pg_get_publication_sequences(P.pubname) GPT,
> 
> The GPT presumably stood for "get publication tables", so should be
> changed.
> 
> (There might be a few more copy-and-paste occasions like this around.  I
> have not written down all of them.)
> 

Will fix.

> * src/backend/commands/publicationcmds.c
> 
> This adds a new publication option called "sequence".  I would have
> expected it to be named "sequences", but that's just my opinion.
> 
> But in any case, the option is not documented at all.
> 
> Some of the new functions added in this file are almost exact duplicates
> of the analogous functions for tables.  For example,
> PublicationAddSequences()/PublicationDropSequences() are almost
> exactly the same as PublicationAddTables()/PublicationDropTables(). This
> could be handled with less duplication by just adding an ObjectType
> argument to the existing functions.
> 

Yes, I noticed that too, and I'll review this later, after making sure
the behavior is correct.

> * src/backend/commands/sequence.c
> 
> Could use some refactoring of ResetSequence()/ResetSequence2().  Maybe
> call the latter ResetSequenceData() and have the former call it internally.
> 

Will check.

> * src/backend/commands/subscriptioncmds.c
> 
> Also lots of duplication between tables and sequences in this file.
> 

Same as the case above.

> * src/backend/replication/logical/tablesync.c
> 
> The comment says it needs sequence support, but there appears to be
> support for initial sequence syncing.  What exactly is missing here?
> 

I think that's just obsolete comment.

> * src/test/subscription/t/028_sequences.pl
> 
> Change to use done_testing() (see
> 549ec201d6132b7c7ee11ee90a4e02119259ba5b).

Will fix.


Do we need to handle both FOR ALL TABLES and FOR ALL SEQUENCES at the
same time? That seems like a reasonable thing people might want.

The patch probably also needs to modify pg_publication_namespace to
track whether the schema is FOR TABLES IN SCHEMA or FOR SEQUENCES.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Sat, Feb 19, 2022 at 7:48 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Do we need to handle both FOR ALL TABLES and FOR ALL SEQUENCES at the
> same time? That seems like a reasonable thing people might want.
>

+1. That seems reasonable to me as well. I think the same will apply
to  FOR ALL TABLES IN SCHEMA and FOR ALL SEQUENCES IN SCHEMA.

-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
Hi,

Here's an updated version of the patch, fixing most of the issues from
reviews so far. There's still a couple FIXME comments, but I think those
are minor and/or straightforward to deal with.

The main improvements:

1) Elimination of a lot of redundant code - one function handling
tables, and an almost exact copy handling sequences. Now a single
function handles both, possibly with "sequences" flag to tweak the behavior.

2) A couple other functions gained "sequences" flag, determining which
objects are "interesting". For example PublicationAddSchemas needs to
know whether it's FOR ALL SEQUENCES or FOR ALL TABLES IN SCHEMA. I don't
think we can just use relkind here easily, because for tables we need to
handle various types of tables (regular, partitioned, ...).

3) I also renamed a couple functions with "tables" in the name, which
are now used for sequences too. So for example OpenTablesList() is
renamed to OpenRelationList() and so on.

4) Addition of a number of regression tests to "publication.sql", which
showed a lot of issues, mostly related to not distinguishing tables and
sequences when handling "FOR ALL TABLES [IN SCHEMA]" and "FOR ALL
SEQUENCES [IN SCHEMA]".

5) Proper tracking of FOR ALL [TABLES|SEQUENCES] IN SCHEMA in a catalog.
The pg_publication_namespace gained a pnsequences flag, which determines
which case it is. So for example if you do

  ALTER PUBLICATION p ADD ALL TABLES IN SCHEMA s;
  ALTER PUBLICATION p ADD ALL SEQUENCES IN SCHEMA s;

there will be two rows in the catalog, one with 't' and one with 'f' in
the new column. I'm not sure this is the best way to track this - maybe
it'd be better to have two flags, and keep a single row. Or maybe we
should have an array of relkinds (but that has the issue with tables
having multiple relkinds mentioned before). Multiple rows make it more
convenient to add/remove publication schemas - with a single row it'd be
necessary to either insert a new row or update an existing one when
adding the schema, and similarly for dropping it.

But maybe there are reasons / precedent to design this differently ...

6) I'm not entirely sure the object_address changes (handling of the
pnsequences flag) are correct.

7) This updates psql to do roughly the same thing as for tables, so \dRp
now list sequences added either directly or through schema, so you might
get footer like this:

  \dRp+ testpub_mix
  ...
  Tables:
      "public.testpub_tbl1"
  Tables from schemas:
      "pub_test"
  Sequences:
      "public.testpub_seq1"
  Sequences from schemas:
      "pub_test"

Maybe it's a bit too verbose, though. It also addes "All sequences" and
"Sequences" columns into the publication description, but I don't think
that can be done much differently.

FWIW I had to switch the describeOneTableDetails() chunk dealing with
sequences from printQuery() to printTable() in order to handle dynamic
footers.

There's also a change in \dn+ because a schema may be included in one
publication as "FOR ALL SEQUENCES IN SCHEMA" and in another publication
with "FOR ALL TABLES IN SCHEMA". So I modified the output to

  \dn+ pub_test1
  ...
  Publications:
      "testpub_schemas" (sequences)
      "testpub_schemas" (tables)

But maybe it'd be better to aggregate this into a single line like

  \dn+ pub_test1
  ...
  Publications:
      "testpub_schemas" (tables, sequences)

Opinions?

8) There's a shortcoming in the current grammar, because you can specify
either

  CREATE PUBLICATION p FOR ALL TABLES;

or

  CREATE PUBLICATION p FOR ALL SEQUENCES;

but it's not possible to do

  CREATE PUBLICATION p FOR ALL TABLES AND FOR ALL SEQUENCES;

which seems like a fairly reasonable thing users might want to do.

The problem is that "FOR ALL TABLES" (and same for sequences) is
hard-coded in the grammar, not defined as PublicationObjSpec. This also
means you can't do

  ALTER PUBLICATION p ADD ALL TABLES;

AFAICS there are two ways to fix this - adding the combinations into the
definition of CreatePublicationStmt, or adding FOR ALL TABLES (and
sequences) to PublicationObjSpec.

9) Another grammar limitation is that we don't cross-check the relkind,
so for example

  ALTER PUBLICATION p ADD TABLE sequence;

might actually work. Should be easy to fix, though.

10) Added pg_dump support (including tests). I'll add more tests, to
check more grammar combinations.

11) I need to test more grammar combinations in the TAP test too, to
verify the output plugin interprets the stuff correctly.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 2/19/22 06:33, Amit Kapila wrote:
> On Sat, Feb 19, 2022 at 7:48 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Do we need to handle both FOR ALL TABLES and FOR ALL SEQUENCES at the
>> same time? That seems like a reasonable thing people might want.
>>
> 
> +1. That seems reasonable to me as well. I think the same will apply
> to  FOR ALL TABLES IN SCHEMA and FOR ALL SEQUENCES IN SCHEMA.
> 

It already works for "IN SCHEMA" because that's handled as a publication
object, but FOR ALL TABLES and FOR ALL SEQUENCES are defined directly in
CreatePublicationStmt.

Which also means you can't do ALTER PUBLICATION and change it to FOR ALL
TABLES. Which is a bit annoying, but OK. It's a bit weird FOR ALL TABLES
is mentioned in docs for ALTER PUBLICATION as if it was supported.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Wed, Feb 23, 2022 at 4:54 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> 7) This updates psql to do roughly the same thing as for tables, so \dRp
> now list sequences added either directly or through schema, so you might
> get footer like this:
>
>   \dRp+ testpub_mix
>   ...
>   Tables:
>       "public.testpub_tbl1"
>   Tables from schemas:
>       "pub_test"
>   Sequences:
>       "public.testpub_seq1"
>   Sequences from schemas:
>       "pub_test"
>
> Maybe it's a bit too verbose, though. It also addes "All sequences" and
> "Sequences" columns into the publication description, but I don't think
> that can be done much differently.
>
> FWIW I had to switch the describeOneTableDetails() chunk dealing with
> sequences from printQuery() to printTable() in order to handle dynamic
> footers.
>
> There's also a change in \dn+ because a schema may be included in one
> publication as "FOR ALL SEQUENCES IN SCHEMA" and in another publication
> with "FOR ALL TABLES IN SCHEMA". So I modified the output to
>
>   \dn+ pub_test1
>   ...
>   Publications:
>       "testpub_schemas" (sequences)
>       "testpub_schemas" (tables)
>
> But maybe it'd be better to aggregate this into a single line like
>
>   \dn+ pub_test1
>   ...
>   Publications:
>       "testpub_schemas" (tables, sequences)
>
> Opinions?
>

I think the second one (aggregated) might be slightly better as that
will lead to a lesser number of lines when there are multiple such
publications but it should be okay if you and others prefer first.

> 8) There's a shortcoming in the current grammar, because you can specify
> either
>
>   CREATE PUBLICATION p FOR ALL TABLES;
>
> or
>
>   CREATE PUBLICATION p FOR ALL SEQUENCES;
>
> but it's not possible to do
>
>   CREATE PUBLICATION p FOR ALL TABLES AND FOR ALL SEQUENCES;
>
> which seems like a fairly reasonable thing users might want to do.
>

Isn't it better to support this with a syntax as indicated by Tom in
one of his earlier emails on this topic [1]? IIUC, it would be as
follows:

CREATE PUBLICATION p FOR ALL TABLES, ALL SEQUENCES;

> The problem is that "FOR ALL TABLES" (and same for sequences) is
> hard-coded in the grammar, not defined as PublicationObjSpec. This also
> means you can't do
>
>   ALTER PUBLICATION p ADD ALL TABLES;
>
> AFAICS there are two ways to fix this - adding the combinations into the
> definition of CreatePublicationStmt, or adding FOR ALL TABLES (and
> sequences) to PublicationObjSpec.
>

I can imagine that adding to PublicationObjSpec will look compatible
with existing code but maybe another way will also be okay.

[1] - https://www.postgresql.org/message-id/877603.1629120678%40sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 2/23/22 12:10, Amit Kapila wrote:
> On Wed, Feb 23, 2022 at 4:54 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> 7) This updates psql to do roughly the same thing as for tables, so \dRp
>> now list sequences added either directly or through schema, so you might
>> get footer like this:
>>
>>   \dRp+ testpub_mix
>>   ...
>>   Tables:
>>       "public.testpub_tbl1"
>>   Tables from schemas:
>>       "pub_test"
>>   Sequences:
>>       "public.testpub_seq1"
>>   Sequences from schemas:
>>       "pub_test"
>>
>> Maybe it's a bit too verbose, though. It also addes "All sequences" and
>> "Sequences" columns into the publication description, but I don't think
>> that can be done much differently.
>>
>> FWIW I had to switch the describeOneTableDetails() chunk dealing with
>> sequences from printQuery() to printTable() in order to handle dynamic
>> footers.
>>
>> There's also a change in \dn+ because a schema may be included in one
>> publication as "FOR ALL SEQUENCES IN SCHEMA" and in another publication
>> with "FOR ALL TABLES IN SCHEMA". So I modified the output to
>>
>>   \dn+ pub_test1
>>   ...
>>   Publications:
>>       "testpub_schemas" (sequences)
>>       "testpub_schemas" (tables)
>>
>> But maybe it'd be better to aggregate this into a single line like
>>
>>   \dn+ pub_test1
>>   ...
>>   Publications:
>>       "testpub_schemas" (tables, sequences)
>>
>> Opinions?
>>
> 
> I think the second one (aggregated) might be slightly better as that
> will lead to a lesser number of lines when there are multiple such
> publications but it should be okay if you and others prefer first.
> 

Maybe, but I don't think it's very common to have that many schemas
added to the same publication. And it probably does not make much
difference whether you have 1000 or 2000 items in the list - either both
are acceptable or unacceptable, I think.

But I plan to look at this a bit more.

>> 8) There's a shortcoming in the current grammar, because you can specify
>> either
>>
>>   CREATE PUBLICATION p FOR ALL TABLES;
>>
>> or
>>
>>   CREATE PUBLICATION p FOR ALL SEQUENCES;
>>
>> but it's not possible to do
>>
>>   CREATE PUBLICATION p FOR ALL TABLES AND FOR ALL SEQUENCES;
>>
>> which seems like a fairly reasonable thing users might want to do.
>>
> 
> Isn't it better to support this with a syntax as indicated by Tom in
> one of his earlier emails on this topic [1]? IIUC, it would be as
> follows:
> 
> CREATE PUBLICATION p FOR ALL TABLES, ALL SEQUENCES;
> 

Yes. That's mostly what I meant by adding this to PublicationObjSpec.

>> The problem is that "FOR ALL TABLES" (and same for sequences) is
>> hard-coded in the grammar, not defined as PublicationObjSpec. This also
>> means you can't do
>>
>>   ALTER PUBLICATION p ADD ALL TABLES;
>>
>> AFAICS there are two ways to fix this - adding the combinations into the
>> definition of CreatePublicationStmt, or adding FOR ALL TABLES (and
>> sequences) to PublicationObjSpec.
>>
> 
> I can imagine that adding to PublicationObjSpec will look compatible
> with existing code but maybe another way will also be okay.
> 

I think just hard-coding this into CreatePublicationStmt would work, but
it'll be an issue once/if we start adding more options. I'm not sure if
it makes sense to replicate other relkinds, but maybe DDL?

I'll try tweaking PublicationObjSpec, and we'll see.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
"Euler Taveira"
Дата:
On Wed, Feb 23, 2022, at 1:07 PM, Tomas Vondra wrote:
Maybe, but I don't think it's very common to have that many schemas
added to the same publication. And it probably does not make much
difference whether you have 1000 or 2000 items in the list - either both
are acceptable or unacceptable, I think.
Wouldn't it confuse users? Hey, duplicate publication. How? Wait. Doh.

I think just hard-coding this into CreatePublicationStmt would work, but
it'll be an issue once/if we start adding more options. I'm not sure if
it makes sense to replicate other relkinds, but maybe DDL?
Materialized view? As you mentioned DDL, maybe we can use the CREATE
PUBLICATION syntax to select which DDL commands we want to replicate.


--
Euler Taveira

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 2/23/22 18:33, Euler Taveira wrote:
> On Wed, Feb 23, 2022, at 1:07 PM, Tomas Vondra wrote:
>> Maybe, but I don't think it's very common to have that many
>> schemas added to the same publication. And it probably does not
>> make much difference whether you have 1000 or 2000 items in the
>> list - either both are acceptable or unacceptable, I think.
>
> Wouldn't it confuse users? Hey, duplicate publication. How? Wait.
> Doh.
> 

I don't follow. Duplicate publications? This talks about rows in
pg_publication_namespace, not pg_publication.

>> I think just hard-coding this into CreatePublicationStmt would
>> work, but it'll be an issue once/if we start adding more options.
>> I'm not sure if it makes sense to replicate other relkinds, but
>> maybe DDL?
>
> Materialized view? As you mentioned DDL, maybe we can use the CREATE 
> PUBLICATION syntax to select which DDL commands we want to
> replicate.
> 

Well, yeah. But that doesn't really say whether to hard-code it into the
CREATE PUBLICATION syntax or cover it by PublicationObjSpec. Hard-coding
it also means we can't ALTER the publication to be FOR ALL TABLES etc.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
"Euler Taveira"
Дата:
On Wed, Feb 23, 2022, at 4:18 PM, Tomas Vondra wrote:
On 2/23/22 18:33, Euler Taveira wrote:
> On Wed, Feb 23, 2022, at 1:07 PM, Tomas Vondra wrote:
>> Maybe, but I don't think it's very common to have that many
>> schemas added to the same publication. And it probably does not
>> make much difference whether you have 1000 or 2000 items in the
>> list - either both are acceptable or unacceptable, I think.
>
> Wouldn't it confuse users? Hey, duplicate publication. How? Wait.
> Doh.


I don't follow. Duplicate publications? This talks about rows in
pg_publication_namespace, not pg_publication.
I was referring to

  Publications:
      "testpub_schemas" (sequences)
      "testpub_schemas" (tables)

versus

  Publications:
      "testpub_schemas" (tables, sequences)

I prefer the latter.


--
Euler Taveira

Re: logical decoding and replication of sequences

От
Peter Eisentraut
Дата:
On 23.02.22 12:10, Amit Kapila wrote:
> Isn't it better to support this with a syntax as indicated by Tom in
> one of his earlier emails on this topic [1]? IIUC, it would be as
> follows:
> 
> CREATE PUBLICATION p FOR ALL TABLES, ALL SEQUENCES;

I don't think there is any point in supporting this.  What FOR ALL 
TABLES was really supposed to mean was "everything you can get your 
hands on".  I think we should just redefine FOR ALL TABLES to mean that, 
maybe replace it with a different syntax.  If you want to exclude 
sequences for some reason, there is already a publication option for 
that.  And FOR ALL SEQUENCES by itself doesn't make any sense in practice.

Are there any other object types besides tables and sequences that we 
might want to logically-replicate in the future and whose possible 
syntax we should think about?  I can't think of anything.




Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 2/10/22 19:17, Tomas Vondra wrote:
> > I've polished & pushed the first part adding sequence decoding
> > infrastructure etc. Attached are the two remaining parts.
> >
> > I plan to wait a day or two and then push the test_decoding part. The
> > last part (for built-in replication) will need more work and maybe
> > rethinking the grammar etc.
> >
>
> I've pushed the second part, adding sequences to test_decoding.
>

The test_decoding is failing randomly in the last few days. I am not
completely sure but they might be related to this work. The two of
these appears to be due to the same reason:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07

TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File:
"reorderbuffer.c", Line: 1173, PID: 35013)
0   postgres                            0x00593de0 ExceptionalCondition + 160\\0

Another:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2022-02-16%2006%3A21%3A48

--- /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/expected/rewrite.out
2022-02-14 20:19:14.000000000 +0000
+++ /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/results/rewrite.out
2022-02-16 07:42:18.000000000 +0000
@@ -126,6 +126,7 @@
   table public.replication_example: INSERT: id[integer]:4
somedata[integer]:3 text[character varying]:null
testcolumn1[integer]:null
   table public.replication_example: INSERT: id[integer]:5
somedata[integer]:4 text[character varying]:null
testcolumn1[integer]:2 testcolumn2[integer]:1
   COMMIT
+  sequence public.replication_example_id_seq: transactional:0
last_value: 38 log_cnt: 0 is_called:1
   BEGIN
   table public.replication_example: INSERT: id[integer]:6
somedata[integer]:5 text[character varying]:null
testcolumn1[integer]:3 testcolumn2[integer]:null
   COMMIT
@@ -133,7 +134,7 @@
   table public.replication_example: INSERT: id[integer]:7
somedata[integer]:6 text[character varying]:null
testcolumn1[integer]:4 testcolumn2[integer]:null
   table public.replication_example: INSERT: id[integer]:8
somedata[integer]:7 text[character varying]:null
testcolumn1[integer]:5 testcolumn2[integer]:null
testcolumn3[integer]:1
   COMMIT
- (15 rows)
+ (16 rows)

-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Mon, Feb 28, 2022 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > On 2/10/22 19:17, Tomas Vondra wrote:
> > > I've polished & pushed the first part adding sequence decoding
> > > infrastructure etc. Attached are the two remaining parts.
> > >
> > > I plan to wait a day or two and then push the test_decoding part. The
> > > last part (for built-in replication) will need more work and maybe
> > > rethinking the grammar etc.
> > >
> >
> > I've pushed the second part, adding sequences to test_decoding.
> >
>
> The test_decoding is failing randomly in the last few days. I am not
> completely sure but they might be related to this work. The two of
> these appears to be due to the same reason:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07
>
> TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File:
> "reorderbuffer.c", Line: 1173, PID: 35013)
> 0   postgres                            0x00593de0 ExceptionalCondition + 160\\0
>

While reviewing the code for this, I noticed that in
sequence_decode(), we don't call ReorderBufferProcessXid to register
the first known lsn in WAL for the current xid. The similar functions
logicalmsg_decode() or heap_decode() do call ReorderBufferProcessXid
even if they decide not to queue or send the change. Is there a reason
for not doing the same here? However, I am not able to deduce any
scenario where lack of this will lead to such an Assertion failure.
Any thoughts?

-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Peter Smith
Дата:
On Tue, Mar 1, 2022 at 10:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 28, 2022 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> > >
> > > On 2/10/22 19:17, Tomas Vondra wrote:
> > > > I've polished & pushed the first part adding sequence decoding
> > > > infrastructure etc. Attached are the two remaining parts.
> > > >
> > > > I plan to wait a day or two and then push the test_decoding part. The
> > > > last part (for built-in replication) will need more work and maybe
> > > > rethinking the grammar etc.
> > > >
> > >
> > > I've pushed the second part, adding sequences to test_decoding.
> > >
> >
> > The test_decoding is failing randomly in the last few days. I am not
> > completely sure but they might be related to this work. The two of
> > these appears to be due to the same reason:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07
> >
> > TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File:
> > "reorderbuffer.c", Line: 1173, PID: 35013)
> > 0   postgres                            0x00593de0 ExceptionalCondition + 160\\0
> >

FYI, it looks like the same assertion has failed again on the same
build-farm machine [1]

>
> While reviewing the code for this, I noticed that in
> sequence_decode(), we don't call ReorderBufferProcessXid to register
> the first known lsn in WAL for the current xid. The similar functions
> logicalmsg_decode() or heap_decode() do call ReorderBufferProcessXid
> even if they decide not to queue or send the change. Is there a reason
> for not doing the same here? However, I am not able to deduce any
> scenario where lack of this will lead to such an Assertion failure.
> Any thoughts?
>

------
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-03-03%2023%3A14%3A26

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 3/1/22 12:53, Amit Kapila wrote:
> On Mon, Feb 28, 2022 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> On 2/10/22 19:17, Tomas Vondra wrote:
>>>> I've polished & pushed the first part adding sequence decoding
>>>> infrastructure etc. Attached are the two remaining parts.
>>>>
>>>> I plan to wait a day or two and then push the test_decoding part. The
>>>> last part (for built-in replication) will need more work and maybe
>>>> rethinking the grammar etc.
>>>>
>>>
>>> I've pushed the second part, adding sequences to test_decoding.
>>>
>>
>> The test_decoding is failing randomly in the last few days. I am not
>> completely sure but they might be related to this work. The two of
>> these appears to be due to the same reason:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07
>>
>> TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File:
>> "reorderbuffer.c", Line: 1173, PID: 35013)
>> 0   postgres                            0x00593de0 ExceptionalCondition + 160\\0
>>
> 
> While reviewing the code for this, I noticed that in
> sequence_decode(), we don't call ReorderBufferProcessXid to register
> the first known lsn in WAL for the current xid. The similar functions
> logicalmsg_decode() or heap_decode() do call ReorderBufferProcessXid
> even if they decide not to queue or send the change. Is there a reason
> for not doing the same here? However, I am not able to deduce any
> scenario where lack of this will lead to such an Assertion failure.
> Any thoughts?
> 

Thanks, that seems like an omission. Will fix.


regards


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 2/28/22 12:46, Amit Kapila wrote:
> On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 2/10/22 19:17, Tomas Vondra wrote:
>>> I've polished & pushed the first part adding sequence decoding
>>> infrastructure etc. Attached are the two remaining parts.
>>>
>>> I plan to wait a day or two and then push the test_decoding part. The
>>> last part (for built-in replication) will need more work and maybe
>>> rethinking the grammar etc.
>>>
>>
>> I've pushed the second part, adding sequences to test_decoding.
>>
> 
> The test_decoding is failing randomly in the last few days. I am not
> completely sure but they might be related to this work. The two of
> these appears to be due to the same reason:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07
> 
> TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File:
> "reorderbuffer.c", Line: 1173, PID: 35013)
> 0   postgres                            0x00593de0 ExceptionalCondition + 160\\0
> 

This might be related to the issue reported by Amit, i.e. that
sequence_decode does not call ReorderBufferProcessXid(). If this keeps
failing, we'll have to add some extra debug info (logging LSN etc.), at
least temporarily. It'd be valuable to inspect the WAL too.

> Another:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2022-02-16%2006%3A21%3A48
> 
> --- /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/expected/rewrite.out
> 2022-02-14 20:19:14.000000000 +0000
> +++ /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/results/rewrite.out
> 2022-02-16 07:42:18.000000000 +0000
> @@ -126,6 +126,7 @@
>    table public.replication_example: INSERT: id[integer]:4
> somedata[integer]:3 text[character varying]:null
> testcolumn1[integer]:null
>    table public.replication_example: INSERT: id[integer]:5
> somedata[integer]:4 text[character varying]:null
> testcolumn1[integer]:2 testcolumn2[integer]:1
>    COMMIT
> +  sequence public.replication_example_id_seq: transactional:0
> last_value: 38 log_cnt: 0 is_called:1
>    BEGIN
>    table public.replication_example: INSERT: id[integer]:6
> somedata[integer]:5 text[character varying]:null
> testcolumn1[integer]:3 testcolumn2[integer]:null
>    COMMIT
> @@ -133,7 +134,7 @@
>    table public.replication_example: INSERT: id[integer]:7
> somedata[integer]:6 text[character varying]:null
> testcolumn1[integer]:4 testcolumn2[integer]:null
>    table public.replication_example: INSERT: id[integer]:8
> somedata[integer]:7 text[character varying]:null
> testcolumn1[integer]:5 testcolumn2[integer]:null
> testcolumn3[integer]:1
>    COMMIT
> - (15 rows)
> + (16 rows)
> 

Interesting. I can think of one reason that might cause this - we log
the first sequence increment after a checkpoint. So if a checkpoint
happens in an unfortunate place, there'll be an extra WAL record. On
slow / busy machines that's quite possible, I guess.

I wonder if these two issues might be related ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 3/7/22 17:39, Tomas Vondra wrote:
> 
> 
> On 3/1/22 12:53, Amit Kapila wrote:
>> On Mon, Feb 28, 2022 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>
>>> On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> On 2/10/22 19:17, Tomas Vondra wrote:
>>>>> I've polished & pushed the first part adding sequence decoding
>>>>> infrastructure etc. Attached are the two remaining parts.
>>>>>
>>>>> I plan to wait a day or two and then push the test_decoding part. The
>>>>> last part (for built-in replication) will need more work and maybe
>>>>> rethinking the grammar etc.
>>>>>
>>>>
>>>> I've pushed the second part, adding sequences to test_decoding.
>>>>
>>>
>>> The test_decoding is failing randomly in the last few days. I am not
>>> completely sure but they might be related to this work. The two of
>>> these appears to be due to the same reason:
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07
>>>
>>> TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File:
>>> "reorderbuffer.c", Line: 1173, PID: 35013)
>>> 0   postgres                            0x00593de0 ExceptionalCondition + 160\\0
>>>
>>
>> While reviewing the code for this, I noticed that in
>> sequence_decode(), we don't call ReorderBufferProcessXid to register
>> the first known lsn in WAL for the current xid. The similar functions
>> logicalmsg_decode() or heap_decode() do call ReorderBufferProcessXid
>> even if they decide not to queue or send the change. Is there a reason
>> for not doing the same here? However, I am not able to deduce any
>> scenario where lack of this will lead to such an Assertion failure.
>> Any thoughts?
>>
> 
> Thanks, that seems like an omission. Will fix.
> 

I've pushed this simple fix. Not sure it'll fix the assert failures on
skink/locust, though. Given the lack of information it'll be difficult
to verify. So let's wait a bit.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 3/7/22 17:53, Tomas Vondra wrote:
> On 2/28/22 12:46, Amit Kapila wrote:
>> On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> On 2/10/22 19:17, Tomas Vondra wrote:
>>>> I've polished & pushed the first part adding sequence decoding
>>>> infrastructure etc. Attached are the two remaining parts.
>>>>
>>>> I plan to wait a day or two and then push the test_decoding part. The
>>>> last part (for built-in replication) will need more work and maybe
>>>> rethinking the grammar etc.
>>>>
>>>
>>> I've pushed the second part, adding sequences to test_decoding.
>>>
>>
>> The test_decoding is failing randomly in the last few days. I am not
>> completely sure but they might be related to this work. The two of
>> these appears to be due to the same reason:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07
>>
>> TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File:
>> "reorderbuffer.c", Line: 1173, PID: 35013)
>> 0   postgres                            0x00593de0 ExceptionalCondition + 160\\0
>>
> 
> This might be related to the issue reported by Amit, i.e. that
> sequence_decode does not call ReorderBufferProcessXid(). If this keeps
> failing, we'll have to add some extra debug info (logging LSN etc.), at
> least temporarily. It'd be valuable to inspect the WAL too.
> 
>> Another:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2022-02-16%2006%3A21%3A48
>>
>> --- /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/expected/rewrite.out
>> 2022-02-14 20:19:14.000000000 +0000
>> +++ /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/results/rewrite.out
>> 2022-02-16 07:42:18.000000000 +0000
>> @@ -126,6 +126,7 @@
>>    table public.replication_example: INSERT: id[integer]:4
>> somedata[integer]:3 text[character varying]:null
>> testcolumn1[integer]:null
>>    table public.replication_example: INSERT: id[integer]:5
>> somedata[integer]:4 text[character varying]:null
>> testcolumn1[integer]:2 testcolumn2[integer]:1
>>    COMMIT
>> +  sequence public.replication_example_id_seq: transactional:0
>> last_value: 38 log_cnt: 0 is_called:1
>>    BEGIN
>>    table public.replication_example: INSERT: id[integer]:6
>> somedata[integer]:5 text[character varying]:null
>> testcolumn1[integer]:3 testcolumn2[integer]:null
>>    COMMIT
>> @@ -133,7 +134,7 @@
>>    table public.replication_example: INSERT: id[integer]:7
>> somedata[integer]:6 text[character varying]:null
>> testcolumn1[integer]:4 testcolumn2[integer]:null
>>    table public.replication_example: INSERT: id[integer]:8
>> somedata[integer]:7 text[character varying]:null
>> testcolumn1[integer]:5 testcolumn2[integer]:null
>> testcolumn3[integer]:1
>>    COMMIT
>> - (15 rows)
>> + (16 rows)
>>
> 
> Interesting. I can think of one reason that might cause this - we log
> the first sequence increment after a checkpoint. So if a checkpoint
> happens in an unfortunate place, there'll be an extra WAL record. On
> slow / busy machines that's quite possible, I guess.
> 

I've tweaked the checkpoint_interval to make checkpoints more aggressive
(set it to 1s), and it seems my hunch was correct - it produces failures
exactly like this one. The best fix probably is to just disable decoding
of sequences in those tests that are not aimed at testing sequence decoding.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 3/7/22 22:25, Tomas Vondra wrote:
> 
> 
> On 3/7/22 17:53, Tomas Vondra wrote:
>> On 2/28/22 12:46, Amit Kapila wrote:
>>> On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> On 2/10/22 19:17, Tomas Vondra wrote:
>>>>> I've polished & pushed the first part adding sequence decoding
>>>>> infrastructure etc. Attached are the two remaining parts.
>>>>>
>>>>> I plan to wait a day or two and then push the test_decoding part. The
>>>>> last part (for built-in replication) will need more work and maybe
>>>>> rethinking the grammar etc.
>>>>>
>>>>
>>>> I've pushed the second part, adding sequences to test_decoding.
>>>>
>>>
>>> The test_decoding is failing randomly in the last few days. I am not
>>> completely sure but they might be related to this work. The two of
>>> these appears to be due to the same reason:
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07
>>>
>>> TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File:
>>> "reorderbuffer.c", Line: 1173, PID: 35013)
>>> 0   postgres                            0x00593de0 ExceptionalCondition + 160\\0
>>>
>>
>> This might be related to the issue reported by Amit, i.e. that
>> sequence_decode does not call ReorderBufferProcessXid(). If this keeps
>> failing, we'll have to add some extra debug info (logging LSN etc.), at
>> least temporarily. It'd be valuable to inspect the WAL too.
>>
>>> Another:
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2022-02-16%2006%3A21%3A48
>>>
>>> --- /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/expected/rewrite.out
>>> 2022-02-14 20:19:14.000000000 +0000
>>> +++ /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/results/rewrite.out
>>> 2022-02-16 07:42:18.000000000 +0000
>>> @@ -126,6 +126,7 @@
>>>    table public.replication_example: INSERT: id[integer]:4
>>> somedata[integer]:3 text[character varying]:null
>>> testcolumn1[integer]:null
>>>    table public.replication_example: INSERT: id[integer]:5
>>> somedata[integer]:4 text[character varying]:null
>>> testcolumn1[integer]:2 testcolumn2[integer]:1
>>>    COMMIT
>>> +  sequence public.replication_example_id_seq: transactional:0
>>> last_value: 38 log_cnt: 0 is_called:1
>>>    BEGIN
>>>    table public.replication_example: INSERT: id[integer]:6
>>> somedata[integer]:5 text[character varying]:null
>>> testcolumn1[integer]:3 testcolumn2[integer]:null
>>>    COMMIT
>>> @@ -133,7 +134,7 @@
>>>    table public.replication_example: INSERT: id[integer]:7
>>> somedata[integer]:6 text[character varying]:null
>>> testcolumn1[integer]:4 testcolumn2[integer]:null
>>>    table public.replication_example: INSERT: id[integer]:8
>>> somedata[integer]:7 text[character varying]:null
>>> testcolumn1[integer]:5 testcolumn2[integer]:null
>>> testcolumn3[integer]:1
>>>    COMMIT
>>> - (15 rows)
>>> + (16 rows)
>>>
>>
>> Interesting. I can think of one reason that might cause this - we log
>> the first sequence increment after a checkpoint. So if a checkpoint
>> happens in an unfortunate place, there'll be an extra WAL record. On
>> slow / busy machines that's quite possible, I guess.
>>
> 
> I've tweaked the checkpoint_interval to make checkpoints more aggressive
> (set it to 1s), and it seems my hunch was correct - it produces failures
> exactly like this one. The best fix probably is to just disable decoding
> of sequences in those tests that are not aimed at testing sequence decoding.
> 

I've pushed a fix for this, adding "include-sequences=0" to a couple
test_decoding tests, which were failing with concurrent checkpoints.

Unfortunately, I realized we have a similar issue in the "sequences"
tests too :-( Imagine you do a series of sequence increments, e.g.

  SELECT nextval('s') FROM generate_sequences(1,100);

If there's a concurrent checkpoint, this may add an extra WAL record,
affecting the decoded output (and also the data stored in the sequence
relation itself). Not sure what to do about this ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 3/7/22 22:11, Tomas Vondra wrote:
> 
> 
> On 3/7/22 17:39, Tomas Vondra wrote:
>>
>>
>> On 3/1/22 12:53, Amit Kapila wrote:
>>> On Mon, Feb 28, 2022 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>
>>>> On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra
>>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>>
>>>>> On 2/10/22 19:17, Tomas Vondra wrote:
>>>>>> I've polished & pushed the first part adding sequence decoding
>>>>>> infrastructure etc. Attached are the two remaining parts.
>>>>>>
>>>>>> I plan to wait a day or two and then push the test_decoding part. The
>>>>>> last part (for built-in replication) will need more work and maybe
>>>>>> rethinking the grammar etc.
>>>>>>
>>>>>
>>>>> I've pushed the second part, adding sequences to test_decoding.
>>>>>
>>>>
>>>> The test_decoding is failing randomly in the last few days. I am not
>>>> completely sure but they might be related to this work. The two of
>>>> these appears to be due to the same reason:
>>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09
>>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07
>>>>
>>>> TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File:
>>>> "reorderbuffer.c", Line: 1173, PID: 35013)
>>>> 0   postgres                            0x00593de0 ExceptionalCondition + 160\\0
>>>>
>>>
>>> While reviewing the code for this, I noticed that in
>>> sequence_decode(), we don't call ReorderBufferProcessXid to register
>>> the first known lsn in WAL for the current xid. The similar functions
>>> logicalmsg_decode() or heap_decode() do call ReorderBufferProcessXid
>>> even if they decide not to queue or send the change. Is there a reason
>>> for not doing the same here? However, I am not able to deduce any
>>> scenario where lack of this will lead to such an Assertion failure.
>>> Any thoughts?
>>>
>>
>> Thanks, that seems like an omission. Will fix.
>>
> 
> I've pushed this simple fix. Not sure it'll fix the assert failures on
> skink/locust, though. Given the lack of information it'll be difficult
> to verify. So let's wait a bit.
> 

I've done about 5000 runs of 'make check' in test_decoding, on two rpi
machines (one armv7, one aarch64). Not a single assert failure :-(

How come skink/locust hit that in just a couple runs?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Wed, Mar 9, 2022 at 4:14 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 3/7/22 22:11, Tomas Vondra wrote:
> >
> > I've pushed this simple fix. Not sure it'll fix the assert failures on
> > skink/locust, though. Given the lack of information it'll be difficult
> > to verify. So let's wait a bit.
> >
>
> I've done about 5000 runs of 'make check' in test_decoding, on two rpi
> machines (one armv7, one aarch64). Not a single assert failure :-(
>
> How come skink/locust hit that in just a couple runs?
>

Is it failed after you pushed a fix? I don't think so or am I missing
something? I feel even if doesn't occur again it would have been
better if we had some theory on how it occurred in the first place
because that would make us feel more confident that we won't have any
related problem left.

-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 3/9/22 12:41, Amit Kapila wrote:
> On Wed, Mar 9, 2022 at 4:14 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 3/7/22 22:11, Tomas Vondra wrote:
>>>
>>> I've pushed this simple fix. Not sure it'll fix the assert failures on
>>> skink/locust, though. Given the lack of information it'll be difficult
>>> to verify. So let's wait a bit.
>>>
>>
>> I've done about 5000 runs of 'make check' in test_decoding, on two rpi
>> machines (one armv7, one aarch64). Not a single assert failure :-(
>>
>> How come skink/locust hit that in just a couple runs?
>>
> 
> Is it failed after you pushed a fix? I don't think so or am I missing
> something? I feel even if doesn't occur again it would have been
> better if we had some theory on how it occurred in the first place
> because that would make us feel more confident that we won't have any
> related problem left.
> 

I don't think it failed yet - we have to wait a bit longer to make any
conclusions, though. On skink it failed only twice over 1 month. I agree
it'd be nice to have some theory, but I really don't have one.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Peter Eisentraut
Дата:
On 23.02.22 00:24, Tomas Vondra wrote:
> Here's an updated version of the patch, fixing most of the issues from
> reviews so far. There's still a couple FIXME comments, but I think those
> are minor and/or straightforward to deal with.

This patch needs a rebase because of a conflict in 
expected/publication.out.  In addition, see the attached fixup patch to 
get the pg_dump tests passing (and some other stuff).

028_sequences.pl should be renamed to 029, since there is now another 028.

In psql, the output of \dRp and \dRp+ is inconsistent.  The former shows

All tables | All sequences | Inserts | Updates | Deletes | Truncates | 
Sequences | Via root

the latter shows

All tables | All sequences | Inserts | Updates | Deletes | Sequences | 
Truncates | Via root

I think the first order is the best one.

That's all for now, I'll come back with more reviewing later.
Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 3/10/22 12:07, Peter Eisentraut wrote:
> On 23.02.22 00:24, Tomas Vondra wrote:
>> Here's an updated version of the patch, fixing most of the issues from
>> reviews so far. There's still a couple FIXME comments, but I think those
>> are minor and/or straightforward to deal with.
> 
> This patch needs a rebase because of a conflict in
> expected/publication.out.  In addition, see the attached fixup patch to
> get the pg_dump tests passing (and some other stuff).
> 

OK, rebased patch attached.

> 028_sequences.pl should be renamed to 029, since there is now another 028.
> 

Renamed.

> In psql, the output of \dRp and \dRp+ is inconsistent.  The former shows
> 
> All tables | All sequences | Inserts | Updates | Deletes | Truncates |
> Sequences | Via root
> 
> the latter shows
> 
> All tables | All sequences | Inserts | Updates | Deletes | Sequences |
> Truncates | Via root
> 
> I think the first order is the best one.
> 

Good idea, I've tweaked the code to use the former order.

> That's all for now, I'll come back with more reviewing later.


thanks

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Tue, Mar 8, 2022 at 11:59 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 3/7/22 22:25, Tomas Vondra wrote:
> >>
> >> Interesting. I can think of one reason that might cause this - we log
> >> the first sequence increment after a checkpoint. So if a checkpoint
> >> happens in an unfortunate place, there'll be an extra WAL record. On
> >> slow / busy machines that's quite possible, I guess.
> >>
> >
> > I've tweaked the checkpoint_interval to make checkpoints more aggressive
> > (set it to 1s), and it seems my hunch was correct - it produces failures
> > exactly like this one. The best fix probably is to just disable decoding
> > of sequences in those tests that are not aimed at testing sequence decoding.
> >
>
> I've pushed a fix for this, adding "include-sequences=0" to a couple
> test_decoding tests, which were failing with concurrent checkpoints.
>
> Unfortunately, I realized we have a similar issue in the "sequences"
> tests too :-( Imagine you do a series of sequence increments, e.g.
>
>   SELECT nextval('s') FROM generate_sequences(1,100);
>
> If there's a concurrent checkpoint, this may add an extra WAL record,
> affecting the decoded output (and also the data stored in the sequence
> relation itself). Not sure what to do about this ...
>

I am also not sure what to do for it but maybe if in some way we can
increase checkpoint timeout or other parameters for these tests then
it would reduce the chances of such failures. The other idea could be
to perform checkpoint before the start of tests to reduce the
possibility of another checkpoint.

-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Fri, Mar 11, 2022 at 5:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Mar 8, 2022 at 11:59 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > On 3/7/22 22:25, Tomas Vondra wrote:
> > >>
> > >> Interesting. I can think of one reason that might cause this - we log
> > >> the first sequence increment after a checkpoint. So if a checkpoint
> > >> happens in an unfortunate place, there'll be an extra WAL record. On
> > >> slow / busy machines that's quite possible, I guess.
> > >>
> > >
> > > I've tweaked the checkpoint_interval to make checkpoints more aggressive
> > > (set it to 1s), and it seems my hunch was correct - it produces failures
> > > exactly like this one. The best fix probably is to just disable decoding
> > > of sequences in those tests that are not aimed at testing sequence decoding.
> > >
> >
> > I've pushed a fix for this, adding "include-sequences=0" to a couple
> > test_decoding tests, which were failing with concurrent checkpoints.
> >
> > Unfortunately, I realized we have a similar issue in the "sequences"
> > tests too :-( Imagine you do a series of sequence increments, e.g.
> >
> >   SELECT nextval('s') FROM generate_sequences(1,100);
> >
> > If there's a concurrent checkpoint, this may add an extra WAL record,
> > affecting the decoded output (and also the data stored in the sequence
> > relation itself). Not sure what to do about this ...
> >
>
> I am also not sure what to do for it but maybe if in some way we can
> increase checkpoint timeout or other parameters for these tests then
> it would reduce the chances of such failures. The other idea could be
> to perform checkpoint before the start of tests to reduce the
> possibility of another checkpoint.
>

One more thing, I notice while checking the commit for this feature is
that the below include seems to be out of order:
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -42,6 +42,7 @@
 #include "replication/reorderbuffer.h"
 #include "replication/snapbuild.h"
 #include "storage/standby.h"
+#include "commands/sequence.h"

-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 3/11/22 12:34, Amit Kapila wrote:
> On Tue, Mar 8, 2022 at 11:59 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 3/7/22 22:25, Tomas Vondra wrote:
>>>>
>>>> Interesting. I can think of one reason that might cause this - we log
>>>> the first sequence increment after a checkpoint. So if a checkpoint
>>>> happens in an unfortunate place, there'll be an extra WAL record. On
>>>> slow / busy machines that's quite possible, I guess.
>>>>
>>>
>>> I've tweaked the checkpoint_interval to make checkpoints more aggressive
>>> (set it to 1s), and it seems my hunch was correct - it produces failures
>>> exactly like this one. The best fix probably is to just disable decoding
>>> of sequences in those tests that are not aimed at testing sequence decoding.
>>>
>>
>> I've pushed a fix for this, adding "include-sequences=0" to a couple
>> test_decoding tests, which were failing with concurrent checkpoints.
>>
>> Unfortunately, I realized we have a similar issue in the "sequences"
>> tests too :-( Imagine you do a series of sequence increments, e.g.
>>
>>   SELECT nextval('s') FROM generate_sequences(1,100);
>>
>> If there's a concurrent checkpoint, this may add an extra WAL record,
>> affecting the decoded output (and also the data stored in the sequence
>> relation itself). Not sure what to do about this ...
>>
> 
> I am also not sure what to do for it but maybe if in some way we can
> increase checkpoint timeout or other parameters for these tests then
> it would reduce the chances of such failures. The other idea could be
> to perform checkpoint before the start of tests to reduce the
> possibility of another checkpoint.
> 

Yeah, I had the same ideas, but I'm not sure I like any of them. I doubt
we want to make checkpoints extremely rare, and even if we do that it'll
still fail on slow machines (e.g. with valgrind, clobber cache etc.).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Peter Eisentraut
Дата:
Further review (based on 20220310 patch):

  doc/src/sgml/ref/create_publication.sgml       |   3 +

For the clauses added to the synopsis, descriptions should be added
below.  See attached patch for a start.

  src/backend/commands/sequence.c                |  79 ++

There is quite a bit of overlap between ResetSequence() and
ResetSequence2(), but I couldn't see a good way to combine them that
genuinely saves code and complexity.  So maybe it's ok.

Actually, ResetSequence2() is not really "reset", it's just "set".
Maybe pick a different function name.

  src/backend/commands/subscriptioncmds.c        | 272 +++++++

The code added in AlterSubscription_refresh() seems to be entirely
copy-and-paste from the tables case.  I think this could be combined
by concatenating the lists from fetch_table_list() and
fetch_sequence_list() and looping over it once.  The same also applies
to CreateSubscription(), although the code duplication is smaller
there.

This in turn means that fetch_table_list() and fetch_sequence_list()
can be combined, so that you don't actually need any extensive new
code in CreateSubscription() and AlterSubscription_refresh() for
sequences.  This could go on, you can combine more of the underlying
code, like pg_publication_tables and pg_publication_sequences and so
on.

  src/backend/replication/logical/proto.c        |  52 ++

The documentation of the added protocol message needs to be added to
the documentation.  See attached patch for a start.

The sequence message does not contain the sequence Oid, unlike the
relation message.  Would that be good to add?

  src/backend/replication/logical/worker.c       |  56 ++

Maybe the Asserts in apply_handle_sequence() should be elogs.  These
are checking what is sent over the network, so we don't want a
bad/evil peer able to trigger asserts.  And in non-assert builds these
conditions would be unchecked.

  src/backend/replication/pgoutput/pgoutput.c    |  82 +-

I find the the in get_rel_sync_entry() confusing.  You add a section for

if (!publish && is_sequence)

but then shouldn't the code below that be something like

if (!publish && !is_sequence)

  src/bin/pg_dump/t/002_pg_dump.pl               |  38 +-

This adds a new publication "pub4", but the tests already contain a
"pub4".  I'm not sure why this even works, but perhaps the new one
shold be "pub5", unless there is a deeper meaning.

  src/include/catalog/pg_publication_namespace.h |   3 +-

I don't like how the distinction between table and sequence is done
using a bool field.  That affects also the APIs in pg_publication.c
and publicationcmds.c especially.  There is a lot of unadorned "true"
and "false" being passed around that isn't very clear, and it all
appears to originate at this catalog.  I think we could use a char
field here that uses the relkind constants.  That would also make the
code in pg_publication.c etc. slightly clearer.


See attached patch for more small tweaks.

Your patch still contains a number of XXX and FIXME comments, which in 
my assessment are all more or less correct, so I didn't comment on those 
separately.

Other than that, this seems pretty good.

Earlier in the thread I commented on some aspects of the new grammar 
(e.g., do we need FOR ALL SEQUENCES?).  I think this would be useful to 
review again after all the new logical replication patches are in.  I 
don't want to hold up this patch for that at this point.
Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 3/13/22 07:45, Peter Eisentraut wrote:
> Further review (based on 20220310 patch):
> 
>  doc/src/sgml/ref/create_publication.sgml       |   3 +
> 
> For the clauses added to the synopsis, descriptions should be added
> below.  See attached patch for a start.

Thanks. I'm not sure what other improvements do you think this .sgml
file needs?

> 
>  src/backend/commands/sequence.c                |  79 ++
> 
> There is quite a bit of overlap between ResetSequence() and
> ResetSequence2(), but I couldn't see a good way to combine them that
> genuinely saves code and complexity.  So maybe it's ok.
> 
> Actually, ResetSequence2() is not really "reset", it's just "set".
> Maybe pick a different function name.

Yeah, good point. I think the functions are sufficiently different, and
attempting to remove the publications is unlikely to be an improvement.
But you're right "ResetSequence2" is not a great name, so I've changed
it to "SetSequence".

> 
>  src/backend/commands/subscriptioncmds.c        | 272 +++++++
> 
> The code added in AlterSubscription_refresh() seems to be entirely
> copy-and-paste from the tables case.  I think this could be combined
> by concatenating the lists from fetch_table_list() and
> fetch_sequence_list() and looping over it once.  The same also applies
> to CreateSubscription(), although the code duplication is smaller
> there.
> 
> This in turn means that fetch_table_list() and fetch_sequence_list()
> can be combined, so that you don't actually need any extensive new
> code in CreateSubscription() and AlterSubscription_refresh() for
> sequences.  This could go on, you can combine more of the underlying
> code, like pg_publication_tables and pg_publication_sequences and so
> on.
> 

I've removed the duplicated code in both places, so that it processes
only a single list, which is a combination of tables and sequences
(using list_concat). For CreateSubscription it was trivial, because the
code was simple and perfect copy. For _refresh there was a minor
difference, but I think it was actually entirely unnecessary when
processing the combined list. But this will need more testing.

I'm not sure about the last bit, though. How would you combine code for
pg_publication_tables and pg_publication_sequences, etc?

>  src/backend/replication/logical/proto.c        |  52 ++
> 
> The documentation of the added protocol message needs to be added to
> the documentation.  See attached patch for a start.
> 

OK. I've resolved the FIXME for LSN. Not sure what else is needed?

> The sequence message does not contain the sequence Oid, unlike the
> relation message.  Would that be good to add?

I don't think we need to do that. For relations we do that because it
serves as an identifier in RelationSyncCache, and it links the various
messages to it. For sequences we don't need that - the schema is fixed.

Or do you see a practical reason to add the OID?

> 
>  src/backend/replication/logical/worker.c       |  56 ++
> 
> Maybe the Asserts in apply_handle_sequence() should be elogs.  These
> are checking what is sent over the network, so we don't want a
> bad/evil peer able to trigger asserts.  And in non-assert builds these
> conditions would be unchecked.
>

I'll think about it, but AFAIK we don't really assume evil peers.


>  src/backend/replication/pgoutput/pgoutput.c    |  82 +-
> 
> I find the the in get_rel_sync_entry() confusing.  You add a section for
> 
> if (!publish && is_sequence)
> 
> but then shouldn't the code below that be something like
> 
> if (!publish && !is_sequence)
> 

Hmm, maybe. But I think there's actually a bigger issue - this does not
seem to be dealing with pg_publication_namespace.pnsequences correctly.
That is, we we don't differentiate which schemas include tables and
which schemas include sequences. Interestingly, no tests fail. I'll take
a closer look tomorrow.

>  src/bin/pg_dump/t/002_pg_dump.pl               |  38 +-
> 
> This adds a new publication "pub4", but the tests already contain a
> "pub4".  I'm not sure why this even works, but perhaps the new one
> shold be "pub5", unless there is a deeper meaning.
> 

I agree, pub5 it is. But it's interesting it does not fail even with the
duplicate name. Strange.

>  src/include/catalog/pg_publication_namespace.h |   3 +-
> 
> I don't like how the distinction between table and sequence is done
> using a bool field.  That affects also the APIs in pg_publication.c
> and publicationcmds.c especially.  There is a lot of unadorned "true"
> and "false" being passed around that isn't very clear, and it all
> appears to originate at this catalog.  I think we could use a char
> field here that uses the relkind constants.  That would also make the
> code in pg_publication.c etc. slightly clearer.
> 

I thought about using relkind, but it does not work all that nicely
because we have multiple relkinds for a table (because of partitioned
tables). So I found that confusing.

Maybe we should just use 'r' for any table, in this catalog?

> 
> See attached patch for more small tweaks.
> 
> Your patch still contains a number of XXX and FIXME comments, which in
> my assessment are all more or less correct, so I didn't comment on those
> separately.
> 

Yeah, I plan to look at those next.

> Other than that, this seems pretty good.
> 
> Earlier in the thread I commented on some aspects of the new grammar
> (e.g., do we need FOR ALL SEQUENCES?).  I think this would be useful to
> review again after all the new logical replication patches are in.  I
> don't want to hold up this patch for that at this point.

I'm not particularly attached to the grammar, but I don't see any reason
not to have mostly the same grammar/options as for tables.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
Hi,

Attached is a rebased patch, addressing most of the remaining issues.
The main improvements are:


1) pg_publication_namespace.pntype and type checks

Originally, the patch used pnsequences flag to distinguish which entries
added by FOR ALL TABLES IN SCHEMA and FOR ALL SEQUENCES IN SCHEMA. I've
decided to replace this with a simple char column, called pntype, where
't' means tables and 's' sequences. As explained before, relkind doesn't
work well because of partitioned tables. A char, with a function to
match it to relkind values works fairly well.

I've revisited the question how to represent publications publishing the
same schema twice - once for tables, once for sequences. There were
proposals to represent this with a single row, i.e. turn pntype into an
array of char values. So it'd be either ['t'], ['s'] or ['s', 't']. I
spent some time working on that, but I've decided to keep the current
approach with two separate rows - it's easier to manage, lookup etc.


2) pg_get_object_address

I've updated the objectaddress code to consider pntype when looking-up
the pntype value, so each row in pg_publication_namespace gets the
correct ObjectAddress.


3) for all [tables | sequences]

The original patch did not allow creating publication for all tables and
all sequences at the same time. I've tweaked the grammar to allow this:

   CREATE PUBLICATION p FOR ALL list_of_types;

where list_of_types is arbitrary combination of TABLES and SEQUENCES.
It's implemented in a slightly awkward way - partially in the grammar,
partially in the publicationcmds.c. I suspect there's a (cleaner) way to
do this entirely in the grammar but I haven't succeeded yet.


4) prevent 'ADD TABLE sequence' and 'ADD SEQUENCE table'

It was possible to do "ADD TABLE" and pass it a sequence, which would
fail to notice if the publication already includes all sequences from
the schema. I've added a check preventing that (and a similar one for
ADD SEQUENCE).


5) missing block in AlterTableNamespace to cross-check moving published
sequence to already published schema

A block of code was missing from AlterTableNamespace, checking that
we're not moving a sequence into a schema that is already published (all
the sequences from it).


6) a couple comment fixes

Various comment improvements and fixes. At this point there's a couple
trivial FIXME/XXX comments remaining.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Peter Eisentraut
Дата:
On 20.03.22 23:55, Tomas Vondra wrote:
> Attached is a rebased patch, addressing most of the remaining issues.

This looks okay to me, if the two FIXMEs are addressed.  Remember to 
also update protocol.sgml if you change LOGICAL_REP_MSG_SEQUENCE.



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 3/21/22 14:05, Peter Eisentraut wrote:
> On 20.03.22 23:55, Tomas Vondra wrote:
>> Attached is a rebased patch, addressing most of the remaining issues.
> 
> This looks okay to me, if the two FIXMEs are addressed.  Remember to
> also update protocol.sgml if you change LOGICAL_REP_MSG_SEQUENCE.

Thanks. Do we want to use a different constant for the sequence message?
I've used 'X' for the WIP patch, but maybe there's a better value?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Peter Eisentraut
Дата:
On 21.03.22 22:54, Tomas Vondra wrote:
> On 3/21/22 14:05, Peter Eisentraut wrote:
>> On 20.03.22 23:55, Tomas Vondra wrote:
>>> Attached is a rebased patch, addressing most of the remaining issues.
>>
>> This looks okay to me, if the two FIXMEs are addressed.  Remember to
>> also update protocol.sgml if you change LOGICAL_REP_MSG_SEQUENCE.
> 
> Thanks. Do we want to use a different constant for the sequence message?
> I've used 'X' for the WIP patch, but maybe there's a better value?

I would do small 's'.  Alternatively, 'Q'/'q' is still available, too.



Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Mon, Mar 21, 2022 at 4:25 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hi,
>
> Attached is a rebased patch, addressing most of the remaining issues.
>

It appears that on the apply side, the patch always creates a new
relfilenode irrespective of whether the sequence message is
transactional or not. Is it required to create a new relfilenode for
non-transactional messages? If not that could be costly?

-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Petr Jelinek
Дата:
> On 22. 3. 2022, at 13:09, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Mar 21, 2022 at 4:25 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Hi,
>>
>> Attached is a rebased patch, addressing most of the remaining issues.
>>
>
> It appears that on the apply side, the patch always creates a new
> relfilenode irrespective of whether the sequence message is
> transactional or not. Is it required to create a new relfilenode for
> non-transactional messages? If not that could be costly?
>


That's a good catch, I think we should just write the page in the non-transactional case, no need to mess with
relnodes.


Petr


Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Tue, Mar 22, 2022 at 5:41 PM Petr Jelinek
<petr.jelinek@enterprisedb.com> wrote:
>
> > On 22. 3. 2022, at 13:09, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Mar 21, 2022 at 4:25 AM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> Hi,
> >>
> >> Attached is a rebased patch, addressing most of the remaining issues.
> >>
> >
> > It appears that on the apply side, the patch always creates a new
> > relfilenode irrespective of whether the sequence message is
> > transactional or not. Is it required to create a new relfilenode for
> > non-transactional messages? If not that could be costly?
> >
>
>
> That's a good catch, I think we should just write the page in the non-transactional case, no need to mess with
relnodes.
>

What if the current node has also incremented from the existing
sequence? Basically, how will we deal with conflicts? It seems we will
overwrite the actions done on the existing node which means sequence
values can go back.

On looking a bit more closely, I think I see some more fundamental
problems here:

*  Don't we need some syncing mechanism between apply worker and
sequence sync worker so that apply worker skips the sequence changes
till the sync worker is finished, otherwise, there is a risk of one
overriding the values of the other?

* Currently, the patch uses one sync worker per sequence. It seems to
be a waste of resources considering apart from one additional process,
we need origin/slot to sync each sequence.

* Don't we need explicit privilege checking before applying sequence
data as we do in commit a2ab9c06ea15fbcb2bfde570986a06b37f52bcca for
tables?


-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Petr Jelinek
Дата:

On 23. 3. 2022, at 12:50, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 22, 2022 at 5:41 PM Petr Jelinek
<petr.jelinek@enterprisedb.com> wrote:

On 22. 3. 2022, at 13:09, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 21, 2022 at 4:25 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hi,

Attached is a rebased patch, addressing most of the remaining issues.


It appears that on the apply side, the patch always creates a new
relfilenode irrespective of whether the sequence message is
transactional or not. Is it required to create a new relfilenode for
non-transactional messages? If not that could be costly?



That's a good catch, I think we should just write the page in the non-transactional case, no need to mess with relnodes.


What if the current node has also incremented from the existing
sequence? Basically, how will we deal with conflicts? It seems we will
overwrite the actions done on the existing node which means sequence
values can go back.



I think this is perfectly acceptable behavior, we are replicating state from upstream, not reconciling state on downstream.

You can't really use the builtin sequences to implement distributed sequence via replication. If user wants to write to both nodes they should not replicate the sequence value and instead offset the sequence on each node so they produce different ranges, that's quite common approach. One day we might want revisit adding support for custom sequence AMs.


* Currently, the patch uses one sync worker per sequence. It seems to
be a waste of resources considering apart from one additional process,
we need origin/slot to sync each sequence.



This is indeed wasteful but not something that I'd consider blocker for the patch personally.

--  
Petr

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 3/23/22 13:46, Petr Jelinek wrote:
> 
>> On 23. 3. 2022, at 12:50, Amit Kapila <amit.kapila16@gmail.com
>> <mailto:amit.kapila16@gmail.com>> wrote:
>>
>> On Tue, Mar 22, 2022 at 5:41 PM Petr Jelinek
>> <petr.jelinek@enterprisedb.com <mailto:petr.jelinek@enterprisedb.com>>
>> wrote:
>>>
>>>> On 22. 3. 2022, at 13:09, Amit Kapila <amit.kapila16@gmail.com
>>>> <mailto:amit.kapila16@gmail.com>> wrote:
>>>>
>>>> On Mon, Mar 21, 2022 at 4:25 AM Tomas Vondra
>>>> <tomas.vondra@enterprisedb.com
>>>> <mailto:tomas.vondra@enterprisedb.com>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Attached is a rebased patch, addressing most of the remaining issues.
>>>>>
>>>>
>>>> It appears that on the apply side, the patch always creates a new
>>>> relfilenode irrespective of whether the sequence message is
>>>> transactional or not. Is it required to create a new relfilenode for
>>>> non-transactional messages? If not that could be costly?
>>>>
>>>
>>>
>>> That's a good catch, I think we should just write the page in the
>>> non-transactional case, no need to mess with relnodes.
>>>
>>
>> What if the current node has also incremented from the existing
>> sequence? Basically, how will we deal with conflicts? It seems we will
>> overwrite the actions done on the existing node which means sequence
>> values can go back.
>>
> 
> 
> I think this is perfectly acceptable behavior, we are replicating state
> from upstream, not reconciling state on downstream.
> 
> You can't really use the builtin sequences to implement distributed
> sequence via replication. If user wants to write to both nodes they
> should not replicate the sequence value and instead offset the sequence
> on each node so they produce different ranges, that's quite common
> approach. One day we might want revisit adding support for custom
> sequence AMs.
> 

Exactly. Moreover it's about the same behavior as if you update table
data on the subscriber, and then an UPDATE gets replicated and
overwrites the local change.

Attached is a patch fixing the relfilenode issue - now we only allocate
a new relfilenode for the transactional case, and an in-place update
similar to a setval() otherwise. And thanks for noticing this.

> 
>> * Currently, the patch uses one sync worker per sequence. It seems to
>> be a waste of resources considering apart from one additional process,
>> we need origin/slot to sync each sequence.
>>
> 
> 
> This is indeed wasteful but not something that I'd consider blocker for
> the patch personally.
> 

Right, and the same argument can be made for tablesync of tiny tables
(which a sequence essentially is). I'm sure there are ways to improve
this, but that can be done later.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Greg Stark
Дата:
Awesome to see this get committed, thanks Tomas.

Is there anything left or shall I update the CF entry to committed?



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
Hi,

Pushed, after going through the patch once more, addressed the remaining
FIXMEs, corrected a couple places in the docs and comments, etc. Minor
tweaks, nothing important.

I've been thinking about the grammar a bit more after pushing, and I
realized that maybe it'd be better to handle the FOR ALL TABLES /
SEQUENCES clause as PublicationObjSpec, not as a separate/special case.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 3/24/22 22:52, Greg Stark wrote:
> Awesome to see this get committed, thanks Tomas.
> 
> Is there anything left or shall I update the CF entry to committed?

Yeah, let's mark it as committed. I was waiting for some feedback from
the buildfarm - there are some failures, but it seems unrelated.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Fri, Mar 25, 2022 at 3:29 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Pushed.
>

Some of the comments given by me [1] don't seem to be addressed or
responded to. Let me try to say again for the ease of discussion:

* Don't we need some syncing mechanism between apply worker and
sequence sync worker so that apply worker skips the sequence changes
till the sync worker is finished, otherwise, there is a risk of one
overriding the values of the other? See how we take care of this for a
table in should_apply_changes_for_rel() and its callers. If we don't
do this for sequences for some reason then probably a comment
somewhere is required.

* Don't we need explicit privilege checking before applying sequence
data as we do in commit a2ab9c06ea15fbcb2bfde570986a06b37f52bcca for
tables?

Few new comments:
=================
1. A simple test like the below crashes for me:
postgres=# create sequence s1;
CREATE SEQUENCE
postgres=# create sequence s2;
CREATE SEQUENCE
postgres=# create publication pub1 for sequence s1, s2;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

2. In apply_handle_sequence() do we need AccessExclusiveLock for
non-transactional case?

3. In apply_handle_sequence(), I think for transactional case, we need
to skip the operation, if the skip lsn is set. See how we skip in
apply_handle_insert() and similar functions.

[1] - https://www.postgresql.org/message-id/CAA4eK1Jn-DttQ%3D4Pdh9YCe1w%2BzGbgC%2B0uR0sfg2RtkjiAPmB9g%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Masahiko Sawada
Дата:
On Fri, Mar 25, 2022 at 6:59 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hi,
>
> Pushed, after going through the patch once more, addressed the remaining
> FIXMEs, corrected a couple places in the docs and comments, etc. Minor
> tweaks, nothing important.
>

The commit updates tab-completion for ALTER PUBLICATION but seems not
to update for CREATE PUBLICATION. I've attached a patch for that.

Also, the commit add a new pgoutput option "sequences":

+                else if (strcmp(defel->defname, "sequences") == 0)
+                {
+                        if (sequences_option_given)
+                                ereport(ERROR,
+                                                (errcode(ERRCODE_SYNTAX_ERROR),
+                                                 errmsg("conflicting
or redundant options")));
+                        sequences_option_given = true;
+
+                        data->sequences = defGetBoolean(defel);
+                }

But as far as I read changes, there is no use of this option, and this
code is not tested. Can we remove it or is it for upcoming changes?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 3/25/22 05:01, Amit Kapila wrote:
> On Fri, Mar 25, 2022 at 3:29 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Pushed.
>>
> 
> Some of the comments given by me [1] don't seem to be addressed or
> responded to. Let me try to say again for the ease of discussion:
> 

D'oh! I got distracted by Petr's response to that message, and missed
this part ...

> * Don't we need some syncing mechanism between apply worker and
> sequence sync worker so that apply worker skips the sequence changes
> till the sync worker is finished, otherwise, there is a risk of one
> overriding the values of the other? See how we take care of this for a
> table in should_apply_changes_for_rel() and its callers. If we don't
> do this for sequences for some reason then probably a comment
> somewhere is required.
> 

How would that happen? If we're effectively setting the sequence as a
side effect of inserting the data, then why should we even replicate the
sequence? We'll have the problem later too, no?

> * Don't we need explicit privilege checking before applying sequence
> data as we do in commit a2ab9c06ea15fbcb2bfde570986a06b37f52bcca for
> tables?
> 

So essentially something like TargetPrivilegesCheck in the worker? I
think you're probably right we need something like that.

> Few new comments:
> =================
> 1. A simple test like the below crashes for me:
> postgres=# create sequence s1;
> CREATE SEQUENCE
> postgres=# create sequence s2;
> CREATE SEQUENCE
> postgres=# create publication pub1 for sequence s1, s2;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> 

Yeah, preprocess_pubobj_list seems to be a few bricks shy. I have a fix,
will push shortly.

> 2. In apply_handle_sequence() do we need AccessExclusiveLock for
> non-transactional case?
> 

Good catch. This lock was inherited from ResetSequence, but now that the
transactional case works differently, we probably don't need it.

> 3. In apply_handle_sequence(), I think for transactional case, we need
> to skip the operation, if the skip lsn is set. See how we skip in
> apply_handle_insert() and similar functions.
> 

Right.


Thanks for these reports!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 3/25/22 08:00, Masahiko Sawada wrote:
> On Fri, Mar 25, 2022 at 6:59 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Hi,
>>
>> Pushed, after going through the patch once more, addressed the remaining
>> FIXMEs, corrected a couple places in the docs and comments, etc. Minor
>> tweaks, nothing important.
>>
> 
> The commit updates tab-completion for ALTER PUBLICATION but seems not
> to update for CREATE PUBLICATION. I've attached a patch for that.
> 

Thanks. I'm pretty sure the patch did that, but it likely got lost in
one of the rebases due to a conflict. Too bad we don't have tests for
tab-complete. Will fix.

> Also, the commit add a new pgoutput option "sequences":
> 
> +                else if (strcmp(defel->defname, "sequences") == 0)
> +                {
> +                        if (sequences_option_given)
> +                                ereport(ERROR,
> +                                                (errcode(ERRCODE_SYNTAX_ERROR),
> +                                                 errmsg("conflicting
> or redundant options")));
> +                        sequences_option_given = true;
> +
> +                        data->sequences = defGetBoolean(defel);
> +                }
> 
> But as far as I read changes, there is no use of this option, and this
> code is not tested. Can we remove it or is it for upcoming changes?
> 

pgoutput_sequence uses this

    if (!data->sequences)
        return;

This was inspired by what we do for logical messages, but maybe there's
an argument we don't need this, considering we have "sequence" action
and that a sequence has to be added to the publication. I don't think
there's any future patch relying on this (and it could add it back, if
needed).

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Fri, Mar 25, 2022 at 3:56 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
> On 3/25/22 05:01, Amit Kapila wrote:
> > On Fri, Mar 25, 2022 at 3:29 AM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> Pushed.
> >>
> >
> > Some of the comments given by me [1] don't seem to be addressed or
> > responded to. Let me try to say again for the ease of discussion:
> >
>
> D'oh! I got distracted by Petr's response to that message, and missed
> this part ...
>
> > * Don't we need some syncing mechanism between apply worker and
> > sequence sync worker so that apply worker skips the sequence changes
> > till the sync worker is finished, otherwise, there is a risk of one
> > overriding the values of the other? See how we take care of this for a
> > table in should_apply_changes_for_rel() and its callers. If we don't
> > do this for sequences for some reason then probably a comment
> > somewhere is required.
> >
>
> How would that happen? If we're effectively setting the sequence as a
> side effect of inserting the data, then why should we even replicate the
> sequence?
>

I was talking just about sequence values here, considering that some
sequence is just replicating based on nextval. I think the problem is
that apply worker might override what copy has done if an apply worker
is behind the sequence sync worker as both can run in parallel. Let me
try to take some theoretical example to explain this:

Assume, at LSN 10000, the value of sequence s1 is 10. Then by LSN
12000, the value of s1 becomes 20. Now, say copy decides to copy the
sequence value till LSN 12000 which means it will make the value as 20
on the subscriber, now, in parallel, apply worker can process LSN
10000 and make it again 10. Apply worker might end up redoing all
sequence operations along with some transactional ones where we
recreate the file. I am not sure what exact problem it can lead to but
I think we don't need to redo the work.

 We'll have the problem later too, no?
>
> > * Don't we need explicit privilege checking before applying sequence
> > data as we do in commit a2ab9c06ea15fbcb2bfde570986a06b37f52bcca for
> > tables?
> >
>
> So essentially something like TargetPrivilegesCheck in the worker?
>

Right.

Few more comments:
==================
1.
@@ -636,7 +704,7 @@ CreatePublication(ParseState *pstate,
CreatePublicationStmt *stmt)
                       get_database_name(MyDatabaseId));

    /* FOR ALL TABLES requires superuser */
-   if (stmt->for_all_tables && !superuser())
+   if (for_all_tables && !superuser())
        ereport(ERROR,
                (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
                 errmsg("must be superuser to create FOR ALL TABLES
publication")));

Don't we need a similar check for 'for_all_schema' publications?

2.
+<varlistentry>
+<term>
+        Int8
+</term>
+<listitem>
+<para>
+                1 if the sequence update is transactions, 0 otherwise.

Shall we say transactional instead of transactions?

3.
+/*
+ * Determine object type given the object type set for a schema.
+ */
+char
+pub_get_object_type_for_relkind(char relkind)

Shouldn't it be 'relation' instead of 'schema' at the end of the sentence?

4.
@@ -1739,13 +1804,13 @@ get_rel_sync_entry(PGOutputData *data,
Relation relation)
    {
        Oid         schemaId = get_rel_namespace(relid);
        List       *pubids = GetRelationPublications(relid);
-
+       char        objectType =
pub_get_object_type_for_relkind(get_rel_relkind(relid));

A few lines after this we are again getting relkind which is not a big
deal but OTOH there doesn't seem to be a need to fetch the same thing
twice from the cache.

5.
+
+   /* Check that user is allowed to manipulate the publication tables. */
+   if (sequences && pubform->puballsequences)

/tables/sequences

6.
+apply_handle_sequence(StringInfo s)
{
...
+
+   relid = RangeVarGetRelid(makeRangeVar(seq.nspname,
+                                         seq.seqname, -1),
+                            RowExclusiveLock, false);
...
}

As here, we are using missing_ok, if the sequence doesn't exist, it
will give a message like: "ERROR:  relation "public.s1" does not
exist" whereas for tables we give a slightly more clear message like:
"ERROR:  logical replication target relation "public.t1" does not
exist". This is handled via logicalrep_rel_open().

-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 3/25/22 12:21, Amit Kapila wrote:
> On Fri, Mar 25, 2022 at 3:56 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>>
>> On 3/25/22 05:01, Amit Kapila wrote:
>>> On Fri, Mar 25, 2022 at 3:29 AM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> Pushed.
>>>>
>>>
>>> Some of the comments given by me [1] don't seem to be addressed or
>>> responded to. Let me try to say again for the ease of discussion:
>>>
>>
>> D'oh! I got distracted by Petr's response to that message, and missed
>> this part ...
>>
>>> * Don't we need some syncing mechanism between apply worker and
>>> sequence sync worker so that apply worker skips the sequence changes
>>> till the sync worker is finished, otherwise, there is a risk of one
>>> overriding the values of the other? See how we take care of this for a
>>> table in should_apply_changes_for_rel() and its callers. If we don't
>>> do this for sequences for some reason then probably a comment
>>> somewhere is required.
>>>
>>
>> How would that happen? If we're effectively setting the sequence as a
>> side effect of inserting the data, then why should we even replicate the
>> sequence?
>>
> 
> I was talking just about sequence values here, considering that some
> sequence is just replicating based on nextval. I think the problem is
> that apply worker might override what copy has done if an apply worker
> is behind the sequence sync worker as both can run in parallel. Let me
> try to take some theoretical example to explain this:
> 
> Assume, at LSN 10000, the value of sequence s1 is 10. Then by LSN
> 12000, the value of s1 becomes 20. Now, say copy decides to copy the
> sequence value till LSN 12000 which means it will make the value as 20
> on the subscriber, now, in parallel, apply worker can process LSN
> 10000 and make it again 10. Apply worker might end up redoing all
> sequence operations along with some transactional ones where we
> recreate the file. I am not sure what exact problem it can lead to but
> I think we don't need to redo the work.
> 
>  We'll have the problem later too, no?
>

Ah, I was confused why this would be an issue for sequences and not for
plain tables, but now I realize apply_handle_sequence() is not called in
apply_handle_sequence. Yes, that's probably a thinko.


>>> * Don't we need explicit privilege checking before applying sequence
>>> data as we do in commit a2ab9c06ea15fbcb2bfde570986a06b37f52bcca for
>>> tables?
>>>
>>
>> So essentially something like TargetPrivilegesCheck in the worker?
>>
> 
> Right.
> 

OK, will do.

> Few more comments:
> ==================
> 1.
> @@ -636,7 +704,7 @@ CreatePublication(ParseState *pstate,
> CreatePublicationStmt *stmt)
>                        get_database_name(MyDatabaseId));
> 
>     /* FOR ALL TABLES requires superuser */
> -   if (stmt->for_all_tables && !superuser())
> +   if (for_all_tables && !superuser())
>         ereport(ERROR,
>                 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
>                  errmsg("must be superuser to create FOR ALL TABLES
> publication")));
> 
> Don't we need a similar check for 'for_all_schema' publications?
> 

I think you mean "for_all_sequences", right?

> 2.
> +<varlistentry>
> +<term>
> +        Int8
> +</term>
> +<listitem>
> +<para>
> +                1 if the sequence update is transactions, 0 otherwise.
> 
> Shall we say transactional instead of transactions?
> 
> 3.
> +/*
> + * Determine object type given the object type set for a schema.
> + */
> +char
> +pub_get_object_type_for_relkind(char relkind)
> 
> Shouldn't it be 'relation' instead of 'schema' at the end of the sentence?
> 
> 4.
> @@ -1739,13 +1804,13 @@ get_rel_sync_entry(PGOutputData *data,
> Relation relation)
>     {
>         Oid         schemaId = get_rel_namespace(relid);
>         List       *pubids = GetRelationPublications(relid);
> -
> +       char        objectType =
> pub_get_object_type_for_relkind(get_rel_relkind(relid));
> 
> A few lines after this we are again getting relkind which is not a big
> deal but OTOH there doesn't seem to be a need to fetch the same thing
> twice from the cache.
> 
> 5.
> +
> +   /* Check that user is allowed to manipulate the publication tables. */
> +   if (sequences && pubform->puballsequences)
> 
> /tables/sequences
> 
> 6.
> +apply_handle_sequence(StringInfo s)
> {
> ...
> +
> +   relid = RangeVarGetRelid(makeRangeVar(seq.nspname,
> +                                         seq.seqname, -1),
> +                            RowExclusiveLock, false);
> ...
> }
> 
> As here, we are using missing_ok, if the sequence doesn't exist, it
> will give a message like: "ERROR:  relation "public.s1" does not
> exist" whereas for tables we give a slightly more clear message like:
> "ERROR:  logical replication target relation "public.t1" does not
> exist". This is handled via logicalrep_rel_open().
> 

Thanks, I'll look at rewording these comments and messages.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
vignesh C
Дата:
On Fri, Mar 25, 2022 at 3:29 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hi,
>
> Pushed, after going through the patch once more, addressed the remaining
> FIXMEs, corrected a couple places in the docs and comments, etc. Minor
> tweaks, nothing important.

While rebasing patch [1] I found a couple of comments:
static void
 ObjectsInPublicationToOids(List *pubobjspec_list, ParseState *pstate,
-    List **rels, List **schemas)
+    List **tables, List **sequences,
+    List **tables_schemas, List **sequences_schemas,
+    List **schemas)
 {
  ListCell   *cell;
  PublicationObjSpec *pubobj;
@@ -185,12 +194,23 @@ ObjectsInPublicationToOids(List
*pubobjspec_list, ParseState *pstate,
  switch (pubobj->pubobjtype)
  {
  case PUBLICATIONOBJ_TABLE:
- *rels = lappend(*rels, pubobj->pubtable);
+ *tables = lappend(*tables, pubobj->pubtable);
+ break;
+ case PUBLICATIONOBJ_SEQUENCE:
+ *sequences = lappend(*sequences, pubobj->pubtable);
  break;
  case PUBLICATIONOBJ_TABLES_IN_SCHEMA:
  schemaid = get_namespace_oid(pubobj->name, false);

  /* Filter out duplicates if user specifies "sch1, sch1" */
+ *tables_schemas = list_append_unique_oid(*tables_schemas, schemaid);
+ *schemas = list_append_unique_oid(*schemas, schemaid);
+ break;

Now tables_schemas and sequence_schemas are being updated and used in
ObjectsInPublicationToOids, schema parameter is no longer being used
after processing in ObjectsInPublicationToOids, I felt we can remove
that parameter.

  /* ALTER PUBLICATION <name> ADD */
  else if (Matches("ALTER", "PUBLICATION", MatchAny, "ADD"))
- COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE");
+ COMPLETE_WITH("ALL TABLES IN SCHEMA", "ALL SEQUENCES IN SCHEMA",
"TABLE", "SEQUENCE");

Tab completion of alter publication for ADD and DROP is the same, we
could combine it.

Attached a patch for the same.
Thoughts?

[1] - https://www.postgresql.org/message-id/CALDaNm3%3DJrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh%3DtamA%40mail.gmail.com

Regards,
Vignesh

Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 3/25/22 12:59, Tomas Vondra wrote:
> 
> On 3/25/22 12:21, Amit Kapila wrote:
>> On Fri, Mar 25, 2022 at 3:56 PM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>>
>>> On 3/25/22 05:01, Amit Kapila wrote:
>>>> On Fri, Mar 25, 2022 at 3:29 AM Tomas Vondra
>>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>>
>>>>> Pushed.
>>>>>
>>>>
>>>> Some of the comments given by me [1] don't seem to be addressed or
>>>> responded to. Let me try to say again for the ease of discussion:
>>>>
>>>
>>> D'oh! I got distracted by Petr's response to that message, and missed
>>> this part ...
>>>
>>>> * Don't we need some syncing mechanism between apply worker and
>>>> sequence sync worker so that apply worker skips the sequence changes
>>>> till the sync worker is finished, otherwise, there is a risk of one
>>>> overriding the values of the other? See how we take care of this for a
>>>> table in should_apply_changes_for_rel() and its callers. If we don't
>>>> do this for sequences for some reason then probably a comment
>>>> somewhere is required.
>>>>
>>>
>>> How would that happen? If we're effectively setting the sequence as a
>>> side effect of inserting the data, then why should we even replicate the
>>> sequence?
>>>
>>
>> I was talking just about sequence values here, considering that some
>> sequence is just replicating based on nextval. I think the problem is
>> that apply worker might override what copy has done if an apply worker
>> is behind the sequence sync worker as both can run in parallel. Let me
>> try to take some theoretical example to explain this:
>>
>> Assume, at LSN 10000, the value of sequence s1 is 10. Then by LSN
>> 12000, the value of s1 becomes 20. Now, say copy decides to copy the
>> sequence value till LSN 12000 which means it will make the value as 20
>> on the subscriber, now, in parallel, apply worker can process LSN
>> 10000 and make it again 10. Apply worker might end up redoing all
>> sequence operations along with some transactional ones where we
>> recreate the file. I am not sure what exact problem it can lead to but
>> I think we don't need to redo the work.
>>
>>  We'll have the problem later too, no?
>>
> 
> Ah, I was confused why this would be an issue for sequences and not for
> plain tables, but now I realize apply_handle_sequence() is not called in
> apply_handle_sequence. Yes, that's probably a thinko.
> 

Hmm, so fixing this might be a bit trickier than I expected.

Firstly, currently we only send nspname/relname in the sequence message,
not the remote OID or schema. The idea was that for sequences we don't
really need schema info, so this seemed OK.

But should_apply_changes_for_rel() needs LogicalRepRelMapEntry, and to
create/maintain that those records we need to send the schema.

Attached is a WIP patch does that.

Two places need more work, I think:

1) maybe_send_schema needs ReorderBufferChange, but we don't have that
for sequences, we only have TXN. I created a simple wrapper, but maybe
we should just tweak maybe_send_schema to use TXN.

2) The transaction handling in is a bit confusing. The non-transactional
increments won't have any explicit commit later, so we can't just rely
on begin_replication_step/end_replication_step. But I want to try
spending a bit more time on this.


But there's a more serious issue, I think. So far, we allowed this:

  BEGIN;
  CREATE SEQUENCE s2;
  ALTER PUBLICATION p ADD SEQUENCE s2;
  INSERT INTO seq_test SELECT nextval('s2') FROM generate_series(1,100);
  COMMIT;

and the behavior was that we replicated the changes. But with the patch
applied, that no longer happens, because should_apply_changes_for_rel
says the change should not be applied.

And after thinking about this, I think that's correct - we can't apply
changes until ALTER SUBSCRIPTION ... REFRESH PUBLICATION gets executed,
and we can't do that until the transaction commits.

So I guess that's correct, and the current behavior is a bug.

For a while I was thinking that maybe this means we don't need the
transactional behavior at all, but I think we do - we have to handle
ALTER SEQUENCE cases that are transactional.

Does that make sense, Amit?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 3/25/22 15:34, vignesh C wrote:
> On Fri, Mar 25, 2022 at 3:29 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Hi,
>>
>> Pushed, after going through the patch once more, addressed the remaining
>> FIXMEs, corrected a couple places in the docs and comments, etc. Minor
>> tweaks, nothing important.
> 
> While rebasing patch [1] I found a couple of comments:
> static void
>  ObjectsInPublicationToOids(List *pubobjspec_list, ParseState *pstate,
> -    List **rels, List **schemas)
> +    List **tables, List **sequences,
> +    List **tables_schemas, List **sequences_schemas,
> +    List **schemas)
>  {
>   ListCell   *cell;
>   PublicationObjSpec *pubobj;
> @@ -185,12 +194,23 @@ ObjectsInPublicationToOids(List
> *pubobjspec_list, ParseState *pstate,
>   switch (pubobj->pubobjtype)
>   {
>   case PUBLICATIONOBJ_TABLE:
> - *rels = lappend(*rels, pubobj->pubtable);
> + *tables = lappend(*tables, pubobj->pubtable);
> + break;
> + case PUBLICATIONOBJ_SEQUENCE:
> + *sequences = lappend(*sequences, pubobj->pubtable);
>   break;
>   case PUBLICATIONOBJ_TABLES_IN_SCHEMA:
>   schemaid = get_namespace_oid(pubobj->name, false);
> 
>   /* Filter out duplicates if user specifies "sch1, sch1" */
> + *tables_schemas = list_append_unique_oid(*tables_schemas, schemaid);
> + *schemas = list_append_unique_oid(*schemas, schemaid);
> + break;
> 
> Now tables_schemas and sequence_schemas are being updated and used in
> ObjectsInPublicationToOids, schema parameter is no longer being used
> after processing in ObjectsInPublicationToOids, I felt we can remove
> that parameter.
> 

Thanks! That's a nice simplification, I'll get that pushed in a couple
minutes.

>   /* ALTER PUBLICATION <name> ADD */
>   else if (Matches("ALTER", "PUBLICATION", MatchAny, "ADD"))
> - COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE");
> + COMPLETE_WITH("ALL TABLES IN SCHEMA", "ALL SEQUENCES IN SCHEMA",
> "TABLE", "SEQUENCE");
> 
> Tab completion of alter publication for ADD and DROP is the same, we
> could combine it.
> 

We could, but I find these combined rules harder to read, so I'll keep
the current tab-completion.

> Attached a patch for the same.
> Thoughts?

Thanks for taking a look! Appreciated.

> 
> [1] - https://www.postgresql.org/message-id/CALDaNm3%3DJrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh%3DtamA%40mail.gmail.com
> 

regars

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
Hi,

I've fixed most of the reported issues (or at least I think so), with
the exception of those in apply_handle_sequence function, i.e.:

1) properly coordinating with the tablesync worker

2) considering skip_lsn, skipping changes

3) missing privilege check, similar to TargetPrivilegesCheck

4) nicer error message if the sequence does not exist


The apply_handle_sequence stuff seems to be inter-related, so I plan to
deal with that in a single separate commit - the main part being the
tablesync coordination, per the fix I shared earlier today. But I need
time to think about that, I don't want to rush that.


Thanks for the feedback!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Fri, Mar 25, 2022 at 10:20 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hmm, so fixing this might be a bit trickier than I expected.
>
> Firstly, currently we only send nspname/relname in the sequence message,
> not the remote OID or schema. The idea was that for sequences we don't
> really need schema info, so this seemed OK.
>
> But should_apply_changes_for_rel() needs LogicalRepRelMapEntry, and to
> create/maintain that those records we need to send the schema.
>
> Attached is a WIP patch does that.
>
> Two places need more work, I think:
>
> 1) maybe_send_schema needs ReorderBufferChange, but we don't have that
> for sequences, we only have TXN. I created a simple wrapper, but maybe
> we should just tweak maybe_send_schema to use TXN.
>
> 2) The transaction handling in is a bit confusing. The non-transactional
> increments won't have any explicit commit later, so we can't just rely
> on begin_replication_step/end_replication_step. But I want to try
> spending a bit more time on this.
>

I didn't understand what you want to say in point (2).

>
> But there's a more serious issue, I think. So far, we allowed this:
>
>   BEGIN;
>   CREATE SEQUENCE s2;
>   ALTER PUBLICATION p ADD SEQUENCE s2;
>   INSERT INTO seq_test SELECT nextval('s2') FROM generate_series(1,100);
>   COMMIT;
>
> and the behavior was that we replicated the changes. But with the patch
> applied, that no longer happens, because should_apply_changes_for_rel
> says the change should not be applied.
>
> And after thinking about this, I think that's correct - we can't apply
> changes until ALTER SUBSCRIPTION ... REFRESH PUBLICATION gets executed,
> and we can't do that until the transaction commits.
>
> So I guess that's correct, and the current behavior is a bug.
>

Yes, I also think that is a bug.

> For a while I was thinking that maybe this means we don't need the
> transactional behavior at all, but I think we do - we have to handle
> ALTER SEQUENCE cases that are transactional.
>

I need some time to think about this. At all places, it is mentioned
as creating a sequence for transactional cases which at the very least
need some tweak.

-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 3/26/22 08:28, Amit Kapila wrote:
> On Fri, Mar 25, 2022 at 10:20 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Hmm, so fixing this might be a bit trickier than I expected.
>>
>> Firstly, currently we only send nspname/relname in the sequence message,
>> not the remote OID or schema. The idea was that for sequences we don't
>> really need schema info, so this seemed OK.
>>
>> But should_apply_changes_for_rel() needs LogicalRepRelMapEntry, and to
>> create/maintain that those records we need to send the schema.
>>
>> Attached is a WIP patch does that.
>>
>> Two places need more work, I think:
>>
>> 1) maybe_send_schema needs ReorderBufferChange, but we don't have that
>> for sequences, we only have TXN. I created a simple wrapper, but maybe
>> we should just tweak maybe_send_schema to use TXN.
>>
>> 2) The transaction handling in is a bit confusing. The non-transactional
>> increments won't have any explicit commit later, so we can't just rely
>> on begin_replication_step/end_replication_step. But I want to try
>> spending a bit more time on this.
>>
> 
> I didn't understand what you want to say in point (2).
> 

My point is that handle_apply_sequence() either needs to use the same
transaction handling as other apply methods, or start (and commit) a
separate transaction for the "transactional" case.

Which means we can't use the begin_replication_step/end_replication_step
and the current code seems a bit complex. And I'm not sure it's quite
correct. So this place needs more work.

>>
>> But there's a more serious issue, I think. So far, we allowed this:
>>
>>   BEGIN;
>>   CREATE SEQUENCE s2;
>>   ALTER PUBLICATION p ADD SEQUENCE s2;
>>   INSERT INTO seq_test SELECT nextval('s2') FROM generate_series(1,100);
>>   COMMIT;
>>
>> and the behavior was that we replicated the changes. But with the patch
>> applied, that no longer happens, because should_apply_changes_for_rel
>> says the change should not be applied.
>>
>> And after thinking about this, I think that's correct - we can't apply
>> changes until ALTER SUBSCRIPTION ... REFRESH PUBLICATION gets executed,
>> and we can't do that until the transaction commits.
>>
>> So I guess that's correct, and the current behavior is a bug.
>>
> 
> Yes, I also think that is a bug.
> 

OK

>> For a while I was thinking that maybe this means we don't need the
>> transactional behavior at all, but I think we do - we have to handle
>> ALTER SEQUENCE cases that are transactional.
>>
> 
> I need some time to think about this.

Understood.

> At all places, it is mentioned
> as creating a sequence for transactional cases which at the very least
> need some tweak.
> 

Which places?



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Sat, Mar 26, 2022 at 3:26 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 3/26/22 08:28, Amit Kapila wrote:
> >>
> >> 2) The transaction handling in is a bit confusing. The non-transactional
> >> increments won't have any explicit commit later, so we can't just rely
> >> on begin_replication_step/end_replication_step. But I want to try
> >> spending a bit more time on this.
> >>
> >
> > I didn't understand what you want to say in point (2).
> >
>
> My point is that handle_apply_sequence() either needs to use the same
> transaction handling as other apply methods, or start (and commit) a
> separate transaction for the "transactional" case.
>
> Which means we can't use the begin_replication_step/end_replication_step
>

We already call CommitTransactionCommand after end_replication_step at
a few places in that file so as there is no explicit commit in
non-transactional case, we can probably call CommitTransactionCommand
for it.

> and the current code seems a bit complex. And I'm not sure it's quite
> correct. So this place needs more work.
>

Agreed.

>
> >> For a while I was thinking that maybe this means we don't need the
> >> transactional behavior at all, but I think we do - we have to handle
> >> ALTER SEQUENCE cases that are transactional.
> >>
> >
> > I need some time to think about this.
>

While thinking about this, I think I see a problem with the
non-transactional handling of sequences. It seems that we will skip
sending non-transactional sequence change if it occurs before the
decoding has reached a consistent point but the surrounding commit
occurs after a consistent point is reached. In such cases, the
corresponding DMLs like inserts will be sent but sequence changes
won't be sent. For example (this scenario is based on
twophase_snapshot.spec),

Initial setup:
==============
Create table t1_seq(c1 int);
Create Sequence seq1;

Test Execution via multiple sessions (this test allows insert in
session-2 to happen before we reach a consistent point and commit
happens after a consistent point):
=======================================================================================================

Session-2:
Begin;
SELECT pg_current_xact_id();

Session-1:
SELECT 'init' FROM pg_create_logical_replication_slot('test_slot',
'test_decoding', false, true);

Session-3:
Begin;
SELECT pg_current_xact_id();

Session-2:
Commit;
Begin;
INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100);

Session-3:
Commit;

Session-2:
Commit 'foo'

Session-1:
SELECT data  FROM pg_logical_slot_get_changes('test_slot', NULL, NULL,
'include-xids', 'false', 'skip-empty-xacts', '1');

                     data
----------------------------------------------
 BEGIN
 table public.t1_seq: INSERT: c1[integer]:1
 table public.t1_seq: INSERT: c1[integer]:2
 table public.t1_seq: INSERT: c1[integer]:3
 table public.t1_seq: INSERT: c1[integer]:4
 table public.t1_seq: INSERT: c1[integer]:5
 table public.t1_seq: INSERT: c1[integer]:6


Now, if we normally try to decode such an insert, the result would be
something like:
                                     data
------------------------------------------------------------------------------
 sequence public.seq1: transactional:0 last_value: 33 log_cnt: 0 is_called:1
 sequence public.seq1: transactional:0 last_value: 66 log_cnt: 0 is_called:1
 sequence public.seq1: transactional:0 last_value: 99 log_cnt: 0 is_called:1
 sequence public.seq1: transactional:0 last_value: 132 log_cnt: 0 is_called:1
 BEGIN
 table public.t1_seq: INSERT: c1[integer]:1
 table public.t1_seq: INSERT: c1[integer]:2
 table public.t1_seq: INSERT: c1[integer]:3
 table public.t1_seq: INSERT: c1[integer]:4
 table public.t1_seq: INSERT: c1[integer]:5
 table public.t1_seq: INSERT: c1[integer]:6

This will create an inconsistent replica as sequence changes won't be
replicated. I thought about changing snapshot dealing of
non-transactional sequence changes similar to transactional ones but
that also won't work because it is only at commit we decide whether we
can send the changes.

For the transactional case, as we are considering the create sequence
operation as transactional, we would unnecessarily queue them even
though that is not required. Basically, they don't need to be
considered transactional and we can simply ignore such messages like
other DDLs. But for that probably we need to distinguish Alter/Create
case which may or may not be straightforward. Now, queuing them is
probably harmless unless it causes the transaction to spill/stream.

I still couldn't think completely about cases where a mix of
transactional and non-transactional changes occur in the same
transaction as I think it somewhat depends on what we want to do about
the above cases.

> > At all places, it is mentioned
> > as creating a sequence for transactional cases which at the very least
> > need some tweak.
> >
>
> Which places?
>

In comments like:
a. When decoding sequences, we differentiate between sequences created
in a (running) transaction and sequences created in other (already
committed) transactions.
b. ... But for new sequences, we need to handle them in a transactional way, ..
c. ... Change needs to be handled as transactional, because the
sequence was created in a transaction that is still running ...

It seems all these places indicate a scenario of creating a sequence
whereas we want to do transactional stuff mainly for Alter.

-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Masahiko Sawada
Дата:
On Sat, Mar 26, 2022 at 6:56 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 3/26/22 08:28, Amit Kapila wrote:
> > On Fri, Mar 25, 2022 at 10:20 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> Hmm, so fixing this might be a bit trickier than I expected.
> >>
> >> Firstly, currently we only send nspname/relname in the sequence message,
> >> not the remote OID or schema. The idea was that for sequences we don't
> >> really need schema info, so this seemed OK.
> >>
> >> But should_apply_changes_for_rel() needs LogicalRepRelMapEntry, and to
> >> create/maintain that those records we need to send the schema.
> >>
> >> Attached is a WIP patch does that.
> >>
> >> Two places need more work, I think:
> >>
> >> 1) maybe_send_schema needs ReorderBufferChange, but we don't have that
> >> for sequences, we only have TXN. I created a simple wrapper, but maybe
> >> we should just tweak maybe_send_schema to use TXN.
> >>
> >> 2) The transaction handling in is a bit confusing. The non-transactional
> >> increments won't have any explicit commit later, so we can't just rely
> >> on begin_replication_step/end_replication_step. But I want to try
> >> spending a bit more time on this.
> >>
> >
> > I didn't understand what you want to say in point (2).
> >
>
> My point is that handle_apply_sequence() either needs to use the same
> transaction handling as other apply methods, or start (and commit) a
> separate transaction for the "transactional" case.
>
> Which means we can't use the begin_replication_step/end_replication_step
> and the current code seems a bit complex. And I'm not sure it's quite
> correct. So this place needs more work.
>
> >>
> >> But there's a more serious issue, I think. So far, we allowed this:
> >>
> >>   BEGIN;
> >>   CREATE SEQUENCE s2;
> >>   ALTER PUBLICATION p ADD SEQUENCE s2;
> >>   INSERT INTO seq_test SELECT nextval('s2') FROM generate_series(1,100);
> >>   COMMIT;
> >>
> >> and the behavior was that we replicated the changes. But with the patch
> >> applied, that no longer happens, because should_apply_changes_for_rel
> >> says the change should not be applied.
> >>
> >> And after thinking about this, I think that's correct - we can't apply
> >> changes until ALTER SUBSCRIPTION ... REFRESH PUBLICATION gets executed,
> >> and we can't do that until the transaction commits.
> >>
> >> So I guess that's correct, and the current behavior is a bug.
> >>
> >
> > Yes, I also think that is a bug.
> >
>
> OK

I also think that this is a bug. Given this behavior is a bug and
newly-added sequence data should be replicated only after ALTER
SUBSCRIPTION ... REFRESH PUBLICATION, is there any case where the
sequence message applied on the subscriber is transactional?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Fri, Apr 1, 2022 at 10:22 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Mar 26, 2022 at 6:56 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> > >>
> > >> But there's a more serious issue, I think. So far, we allowed this:
> > >>
> > >>   BEGIN;
> > >>   CREATE SEQUENCE s2;
> > >>   ALTER PUBLICATION p ADD SEQUENCE s2;
> > >>   INSERT INTO seq_test SELECT nextval('s2') FROM generate_series(1,100);
> > >>   COMMIT;
> > >>
> > >> and the behavior was that we replicated the changes. But with the patch
> > >> applied, that no longer happens, because should_apply_changes_for_rel
> > >> says the change should not be applied.
> > >>
> > >> And after thinking about this, I think that's correct - we can't apply
> > >> changes until ALTER SUBSCRIPTION ... REFRESH PUBLICATION gets executed,
> > >> and we can't do that until the transaction commits.
> > >>
> > >> So I guess that's correct, and the current behavior is a bug.
> > >>
> > >
> > > Yes, I also think that is a bug.
> > >
> >
> > OK
>
> I also think that this is a bug. Given this behavior is a bug and
> newly-added sequence data should be replicated only after ALTER
> SUBSCRIPTION ... REFRESH PUBLICATION, is there any case where the
> sequence message applied on the subscriber is transactional?
>

It could be required for Alter Sequence as that can also rewrite the
relfilenode. However, IIUC, I think there is a bigger problem with
non-transactional sequence implementation as that can cause
inconsistent replica. See the problem description and test case in my
previous email [1] (While thinking about this, I think I see a problem
with the non-transactional handling of sequences....). Can you please
once check that and let me know if I am missing something there? If
not, then I think we may need to first think of a solution for
non-transactional sequence handling.

[1] - https://www.postgresql.org/message-id/CAA4eK1KAFdQEULk%2B4C%3DieWA5UPSUtf_gtqKsFj9J9f2c%3D8hm4g%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 3/28/22 07:29, Amit Kapila wrote:
> ...
>
> While thinking about this, I think I see a problem with the
> non-transactional handling of sequences. It seems that we will skip
> sending non-transactional sequence change if it occurs before the
> decoding has reached a consistent point but the surrounding commit
> occurs after a consistent point is reached. In such cases, the
> corresponding DMLs like inserts will be sent but sequence changes
> won't be sent. For example (this scenario is based on
> twophase_snapshot.spec),
> 
> Initial setup:
> ==============
> Create table t1_seq(c1 int);
> Create Sequence seq1;
> 
> Test Execution via multiple sessions (this test allows insert in
> session-2 to happen before we reach a consistent point and commit
> happens after a consistent point):
> =======================================================================================================
> 
> Session-2:
> Begin;
> SELECT pg_current_xact_id();
> 
> Session-1:
> SELECT 'init' FROM pg_create_logical_replication_slot('test_slot',
> 'test_decoding', false, true);
> 
> Session-3:
> Begin;
> SELECT pg_current_xact_id();
> 
> Session-2:
> Commit;
> Begin;
> INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100);
> 
> Session-3:
> Commit;
> 
> Session-2:
> Commit 'foo'
> 
> Session-1:
> SELECT data  FROM pg_logical_slot_get_changes('test_slot', NULL, NULL,
> 'include-xids', 'false', 'skip-empty-xacts', '1');
> 
>                      data
> ----------------------------------------------
>  BEGIN
>  table public.t1_seq: INSERT: c1[integer]:1
>  table public.t1_seq: INSERT: c1[integer]:2
>  table public.t1_seq: INSERT: c1[integer]:3
>  table public.t1_seq: INSERT: c1[integer]:4
>  table public.t1_seq: INSERT: c1[integer]:5
>  table public.t1_seq: INSERT: c1[integer]:6
> 
> 
> Now, if we normally try to decode such an insert, the result would be
> something like:
>                                      data
> ------------------------------------------------------------------------------
>  sequence public.seq1: transactional:0 last_value: 33 log_cnt: 0 is_called:1
>  sequence public.seq1: transactional:0 last_value: 66 log_cnt: 0 is_called:1
>  sequence public.seq1: transactional:0 last_value: 99 log_cnt: 0 is_called:1
>  sequence public.seq1: transactional:0 last_value: 132 log_cnt: 0 is_called:1
>  BEGIN
>  table public.t1_seq: INSERT: c1[integer]:1
>  table public.t1_seq: INSERT: c1[integer]:2
>  table public.t1_seq: INSERT: c1[integer]:3
>  table public.t1_seq: INSERT: c1[integer]:4
>  table public.t1_seq: INSERT: c1[integer]:5
>  table public.t1_seq: INSERT: c1[integer]:6
> 
> This will create an inconsistent replica as sequence changes won't be
> replicated.

Hmm, that's interesting. I wonder if it can actually happen, though.
Have you been able to reproduce that, somehow?

> I thought about changing snapshot dealing of
> non-transactional sequence changes similar to transactional ones but
> that also won't work because it is only at commit we decide whether we
> can send the changes.
> 
I wonder if there's some earlier LSN (similar to the consistent point)
which might be useful for this.

Or maybe we should queue even the non-transactional changes, not
per-transaction but in a global list, and then at each commit either
discard inspect them (at that point we know the lowest LSN for all
transactions and the consistent point). Seems complex, though.

> For the transactional case, as we are considering the create sequence
> operation as transactional, we would unnecessarily queue them even
> though that is not required. Basically, they don't need to be
> considered transactional and we can simply ignore such messages like
> other DDLs. But for that probably we need to distinguish Alter/Create
> case which may or may not be straightforward. Now, queuing them is
> probably harmless unless it causes the transaction to spill/stream.
> 

I'm not sure I follow. Why would we queue them unnecessarily?

Also, there's the bug with decoding changes in transactions that create
the sequence and add it to a publication. I think the agreement was that
this behavior was incorrect, we should not decode changes until the
subscription is refreshed. Doesn't that mean can't be any CREATE case,
just ALTER?

> I still couldn't think completely about cases where a mix of
> transactional and non-transactional changes occur in the same
> transaction as I think it somewhat depends on what we want to do about
> the above cases.
> 

Understood. I need to think about this too.

>>> At all places, it is mentioned
>>> as creating a sequence for transactional cases which at the very least
>>> need some tweak.
>>>
>>
>> Which places?
>>
> 
> In comments like:
> a. When decoding sequences, we differentiate between sequences created
> in a (running) transaction and sequences created in other (already
> committed) transactions.
> b. ... But for new sequences, we need to handle them in a transactional way, ..
> c. ... Change needs to be handled as transactional, because the
> sequence was created in a transaction that is still running ...
> 
> It seems all these places indicate a scenario of creating a sequence
> whereas we want to do transactional stuff mainly for Alter.
> 

Right, I'll think about how to clarify the comments.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:
On 4/1/22 17:02, Tomas Vondra wrote:
> 
> 
> On 3/28/22 07:29, Amit Kapila wrote:
>> ...
>>
>> While thinking about this, I think I see a problem with the
>> non-transactional handling of sequences. It seems that we will skip
>> sending non-transactional sequence change if it occurs before the
>> decoding has reached a consistent point but the surrounding commit
>> occurs after a consistent point is reached. In such cases, the
>> corresponding DMLs like inserts will be sent but sequence changes
>> won't be sent. For example (this scenario is based on
>> twophase_snapshot.spec),
>>
>> Initial setup:
>> ==============
>> Create table t1_seq(c1 int);
>> Create Sequence seq1;
>>
>> Test Execution via multiple sessions (this test allows insert in
>> session-2 to happen before we reach a consistent point and commit
>> happens after a consistent point):
>> =======================================================================================================
>>
>> Session-2:
>> Begin;
>> SELECT pg_current_xact_id();
>>
>> Session-1:
>> SELECT 'init' FROM pg_create_logical_replication_slot('test_slot',
>> 'test_decoding', false, true);
>>
>> Session-3:
>> Begin;
>> SELECT pg_current_xact_id();
>>
>> Session-2:
>> Commit;
>> Begin;
>> INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100);
>>
>> Session-3:
>> Commit;
>>
>> Session-2:
>> Commit 'foo'
>>
>> Session-1:
>> SELECT data  FROM pg_logical_slot_get_changes('test_slot', NULL, NULL,
>> 'include-xids', 'false', 'skip-empty-xacts', '1');
>>
>>                      data
>> ----------------------------------------------
>>  BEGIN
>>  table public.t1_seq: INSERT: c1[integer]:1
>>  table public.t1_seq: INSERT: c1[integer]:2
>>  table public.t1_seq: INSERT: c1[integer]:3
>>  table public.t1_seq: INSERT: c1[integer]:4
>>  table public.t1_seq: INSERT: c1[integer]:5
>>  table public.t1_seq: INSERT: c1[integer]:6
>>
>>
>> Now, if we normally try to decode such an insert, the result would be
>> something like:
>>                                      data
>> ------------------------------------------------------------------------------
>>  sequence public.seq1: transactional:0 last_value: 33 log_cnt: 0 is_called:1
>>  sequence public.seq1: transactional:0 last_value: 66 log_cnt: 0 is_called:1
>>  sequence public.seq1: transactional:0 last_value: 99 log_cnt: 0 is_called:1
>>  sequence public.seq1: transactional:0 last_value: 132 log_cnt: 0 is_called:1
>>  BEGIN
>>  table public.t1_seq: INSERT: c1[integer]:1
>>  table public.t1_seq: INSERT: c1[integer]:2
>>  table public.t1_seq: INSERT: c1[integer]:3
>>  table public.t1_seq: INSERT: c1[integer]:4
>>  table public.t1_seq: INSERT: c1[integer]:5
>>  table public.t1_seq: INSERT: c1[integer]:6
>>
>> This will create an inconsistent replica as sequence changes won't be
>> replicated.
> 
> Hmm, that's interesting. I wonder if it can actually happen, though.
> Have you been able to reproduce that, somehow?
> 
>> I thought about changing snapshot dealing of
>> non-transactional sequence changes similar to transactional ones but
>> that also won't work because it is only at commit we decide whether we
>> can send the changes.
>>
> I wonder if there's some earlier LSN (similar to the consistent point)
> which might be useful for this.
> 
> Or maybe we should queue even the non-transactional changes, not
> per-transaction but in a global list, and then at each commit either
> discard inspect them (at that point we know the lowest LSN for all
> transactions and the consistent point). Seems complex, though.
> 
>> For the transactional case, as we are considering the create sequence
>> operation as transactional, we would unnecessarily queue them even
>> though that is not required. Basically, they don't need to be
>> considered transactional and we can simply ignore such messages like
>> other DDLs. But for that probably we need to distinguish Alter/Create
>> case which may or may not be straightforward. Now, queuing them is
>> probably harmless unless it causes the transaction to spill/stream.
>>
> 
> I'm not sure I follow. Why would we queue them unnecessarily?
> 
> Also, there's the bug with decoding changes in transactions that create
> the sequence and add it to a publication. I think the agreement was that
> this behavior was incorrect, we should not decode changes until the
> subscription is refreshed. Doesn't that mean can't be any CREATE case,
> just ALTER?
> 

So, I investigated this a bit more, and I wrote a couple test_decoding
isolation tests (patch attached) demonstrating the issue. Actually, I
should say "issues" because it's a bit worse than you described ...

The whole problem is in this chunk of code in sequence_decode():


  /* Skip the change if already processed (per the snapshot). */
  if (transactional &&
      !SnapBuildProcessChange(builder, xid, buf->origptr))
      return;
  else if (!transactional &&
           (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
            SnapBuildXactNeedsSkip(builder, buf->origptr)))
      return;

  /* Queue the increment (or send immediately if not transactional). */
  snapshot = SnapBuildGetOrBuildSnapshot(builder, xid);
  ReorderBufferQueueSequence(ctx->reorder, xid, snapshot, buf->endptr,
                             origin_id, target_node, transactional,
                             xlrec->created, tuplebuf);

With the script you described, the increment is non-transactional, so we
end up in the second branch, return and thus discard the increment.

But it's also possible the change is transactional, which can only
trigger the first branch. But it does not, so we start building the
snapshot. But the first thing SnapBuildGetOrBuildSnapshot does is

   Assert(builder->state == SNAPBUILD_CONSISTENT);

and we're still not in a consistent snapshot, so it just crashes and
burn :-(

The sequences.spec file has two definitions of s2restart step, one empty
(resulting in non-transactional change), one with ALTER SEQUENCE (which
means the change will be transactional).


The really "funny" thing is this is not new code - this is an exact copy
from logicalmsg_decode(), and logical messages have all those issues
too. We may discard some messages, trigger the same Assert, etc. There's
a messages2.spec demonstrating this (s2message step defines whether the
message is transactional or not).

So I guess we need to fix both places, perhaps in a similar way. And one
of those will have to be backpatched (which may make it more complex).


The only option I see is reworking the decoding so that it does not need
the snapshot at all. We'll need to stash the changes just like any other
change, apply them at end of transaction, and the main difference
between transactional and non-transactional case would be what happens
at abort. Transactional changes would be discarded, non-transactional
would be applied anyway.

The challenge is this reorders the sequence changes, so we'll need to
reconcile them somehow. One option would be to simply (1) apply the
change with the highest LSN in the transaction, and then walk all other
in-progress transactions and changes for that sequence with a lower LSN.
Not sure how complex/expensive that would be, though. Another problem is
not all increments are WAL-logged, of course, not sure about that.

Another option might be revisiting the approach proposed by Hannu in
September [1], i.e. tracking sequences touched in a transaction, and
then replicating the current state at that particular moment.



regards


[1]
https://www.postgresql.org/message-id/CAMT0RQQeDR51xs8zTa25YpfKB1B34nS-Q4hhsRPznVsjMB_P1w%40mail.gmail.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Fri, Apr 1, 2022 at 8:32 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 3/28/22 07:29, Amit Kapila wrote:
> > I thought about changing snapshot dealing of
> > non-transactional sequence changes similar to transactional ones but
> > that also won't work because it is only at commit we decide whether we
> > can send the changes.
> >
> I wonder if there's some earlier LSN (similar to the consistent point)
> which might be useful for this.
>
> Or maybe we should queue even the non-transactional changes, not
> per-transaction but in a global list, and then at each commit either
> discard inspect them (at that point we know the lowest LSN for all
> transactions and the consistent point). Seems complex, though.
>

I couldn't follow '..discard inspect them ..'. Do you mean we inspect
them and discard whichever are not required? It seems here we are
talking about a new global ReorderBufferGlobal instead of
ReorderBufferTXN to collect these changes but we don't need only
consistent point LSN because we do send if the commit of containing
transaction is after consistent point LSN, so we need some transaction
information as well. I think it could bring new challenges.

> > For the transactional case, as we are considering the create sequence
> > operation as transactional, we would unnecessarily queue them even
> > though that is not required. Basically, they don't need to be
> > considered transactional and we can simply ignore such messages like
> > other DDLs. But for that probably we need to distinguish Alter/Create
> > case which may or may not be straightforward. Now, queuing them is
> > probably harmless unless it causes the transaction to spill/stream.
> >
>
> I'm not sure I follow. Why would we queue them unnecessarily?
>
> Also, there's the bug with decoding changes in transactions that create
> the sequence and add it to a publication. I think the agreement was that
> this behavior was incorrect, we should not decode changes until the
> subscription is refreshed. Doesn't that mean can't be any CREATE case,
> just ALTER?
>

Yeah, but how will we distinguish them. Aren't they using the same
kind of WAL record?

-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Sat, Apr 2, 2022 at 5:47 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 4/1/22 17:02, Tomas Vondra wrote:
>
> The only option I see is reworking the decoding so that it does not need
> the snapshot at all. We'll need to stash the changes just like any other
> change, apply them at end of transaction, and the main difference
> between transactional and non-transactional case would be what happens
> at abort. Transactional changes would be discarded, non-transactional
> would be applied anyway.
>

I think in the above I am not following how we can make it work
without considering *snapshot at all* because based on that we would
have done the initial sync (copy_sequence) and if we don't follow that
later it can lead to inconsistency. I might be missing something here.

> The challenge is this reorders the sequence changes, so we'll need to
> reconcile them somehow. One option would be to simply (1) apply the
> change with the highest LSN in the transaction, and then walk all other
> in-progress transactions and changes for that sequence with a lower LSN.
> Not sure how complex/expensive that would be, though. Another problem is
> not all increments are WAL-logged, of course, not sure about that.
>
> Another option might be revisiting the approach proposed by Hannu in
> September [1], i.e. tracking sequences touched in a transaction, and
> then replicating the current state at that particular moment.
>

I'll think about that approach as well.


-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 4/2/22 12:35, Amit Kapila wrote:
> On Fri, Apr 1, 2022 at 8:32 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 3/28/22 07:29, Amit Kapila wrote:
>>> I thought about changing snapshot dealing of
>>> non-transactional sequence changes similar to transactional ones but
>>> that also won't work because it is only at commit we decide whether we
>>> can send the changes.
>>>
>> I wonder if there's some earlier LSN (similar to the consistent point)
>> which might be useful for this.
>>
>> Or maybe we should queue even the non-transactional changes, not
>> per-transaction but in a global list, and then at each commit either
>> discard inspect them (at that point we know the lowest LSN for all
>> transactions and the consistent point). Seems complex, though.
>>
> 
> I couldn't follow '..discard inspect them ..'. Do you mean we inspect
> them and discard whichever are not required? It seems here we are
> talking about a new global ReorderBufferGlobal instead of
> ReorderBufferTXN to collect these changes but we don't need only
> consistent point LSN because we do send if the commit of containing
> transaction is after consistent point LSN, so we need some transaction
> information as well. I think it could bring new challenges.
> 

Sorry for the gibberish. Yes, I meant to discard sequence changes that
are no longer needed, due to being "obsoleted" by the applied change. We
must not apply "older" changes (using LSN) because that would make the
sequence go backwards.

I'm not entirely sure whether the list of changes should be kept in TXN
or in the global reorderbuffer object - we need to track which TXN the
change belongs to (because of transactional changes) but we also need to
discard the unnecessary changes efficiently (and walking TXN might be
expensive).

But yes, I'm sure there will be challenges. One being that tracking just
the decoded WAL stuff is not enough, because nextval() may not generate
WAL. But we still need to make sure the increment is replicated.

What I think we might do is this:

- add a global list of decoded sequence increments to ReorderBuffer

- at each commit/abort walk the list, walk the list and consider all
increments up to the commit LSN that "match" (non-transactional match
all TXNs, transactional match only the current TXN)

- replicate the last "matching" status for each sequence, discard the
processed ones

We could probably optimize this by not tracking every single increment,
but merge them "per transaction", I think.

I'm sure this description is pretty rough and will need refining, handle
various corner-cases etc.

>>> For the transactional case, as we are considering the create sequence
>>> operation as transactional, we would unnecessarily queue them even
>>> though that is not required. Basically, they don't need to be
>>> considered transactional and we can simply ignore such messages like
>>> other DDLs. But for that probably we need to distinguish Alter/Create
>>> case which may or may not be straightforward. Now, queuing them is
>>> probably harmless unless it causes the transaction to spill/stream.
>>>
>>
>> I'm not sure I follow. Why would we queue them unnecessarily?
>>
>> Also, there's the bug with decoding changes in transactions that create
>> the sequence and add it to a publication. I think the agreement was that
>> this behavior was incorrect, we should not decode changes until the
>> subscription is refreshed. Doesn't that mean can't be any CREATE case,
>> just ALTER?
>>
> 
> Yeah, but how will we distinguish them. Aren't they using the same
> kind of WAL record?
> 

Same WAL record, but the "created" flag which should distinguish these
two cases, IIRC.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 4/2/22 12:43, Amit Kapila wrote:
> On Sat, Apr 2, 2022 at 5:47 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 4/1/22 17:02, Tomas Vondra wrote:
>>
>> The only option I see is reworking the decoding so that it does not need
>> the snapshot at all. We'll need to stash the changes just like any other
>> change, apply them at end of transaction, and the main difference
>> between transactional and non-transactional case would be what happens
>> at abort. Transactional changes would be discarded, non-transactional
>> would be applied anyway.
>>
> 
> I think in the above I am not following how we can make it work
> without considering *snapshot at all* because based on that we would
> have done the initial sync (copy_sequence) and if we don't follow that
> later it can lead to inconsistency. I might be missing something here.
> 

Well, what I meant to say is that we can't consider the snapshot at this
phase of decoding. We'd still consider it later, at commit/abort time,
of course. I.e. it'd be fairly similar to what heap_decode/DecodeInsert
does, for example. AFAIK this does not build the snapshot anywhere.

>> The challenge is this reorders the sequence changes, so we'll need to
>> reconcile them somehow. One option would be to simply (1) apply the
>> change with the highest LSN in the transaction, and then walk all other
>> in-progress transactions and changes for that sequence with a lower LSN.
>> Not sure how complex/expensive that would be, though. Another problem is
>> not all increments are WAL-logged, of course, not sure about that.
>>
>> Another option might be revisiting the approach proposed by Hannu in
>> September [1], i.e. tracking sequences touched in a transaction, and
>> then replicating the current state at that particular moment.
>>
> 
> I'll think about that approach as well.
> 

Thanks!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 4/2/22 13:51, Tomas Vondra wrote:
> 
> 
> On 4/2/22 12:35, Amit Kapila wrote:
>> On Fri, Apr 1, 2022 at 8:32 PM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> On 3/28/22 07:29, Amit Kapila wrote:
>>>> I thought about changing snapshot dealing of
>>>> non-transactional sequence changes similar to transactional ones but
>>>> that also won't work because it is only at commit we decide whether we
>>>> can send the changes.
>>>>
>>> I wonder if there's some earlier LSN (similar to the consistent point)
>>> which might be useful for this.
>>>
>>> Or maybe we should queue even the non-transactional changes, not
>>> per-transaction but in a global list, and then at each commit either
>>> discard inspect them (at that point we know the lowest LSN for all
>>> transactions and the consistent point). Seems complex, though.
>>>
>>
>> I couldn't follow '..discard inspect them ..'. Do you mean we inspect
>> them and discard whichever are not required? It seems here we are
>> talking about a new global ReorderBufferGlobal instead of
>> ReorderBufferTXN to collect these changes but we don't need only
>> consistent point LSN because we do send if the commit of containing
>> transaction is after consistent point LSN, so we need some transaction
>> information as well. I think it could bring new challenges.
>>
> 
> Sorry for the gibberish. Yes, I meant to discard sequence changes that
> are no longer needed, due to being "obsoleted" by the applied change. We
> must not apply "older" changes (using LSN) because that would make the
> sequence go backwards.
> 
> I'm not entirely sure whether the list of changes should be kept in TXN
> or in the global reorderbuffer object - we need to track which TXN the
> change belongs to (because of transactional changes) but we also need to
> discard the unnecessary changes efficiently (and walking TXN might be
> expensive).
> 
> But yes, I'm sure there will be challenges. One being that tracking just
> the decoded WAL stuff is not enough, because nextval() may not generate
> WAL. But we still need to make sure the increment is replicated.
> 
> What I think we might do is this:
> 
> - add a global list of decoded sequence increments to ReorderBuffer
> 
> - at each commit/abort walk the list, walk the list and consider all
> increments up to the commit LSN that "match" (non-transactional match
> all TXNs, transactional match only the current TXN)
> 
> - replicate the last "matching" status for each sequence, discard the
> processed ones
> 
> We could probably optimize this by not tracking every single increment,
> but merge them "per transaction", I think.
> 
> I'm sure this description is pretty rough and will need refining, handle
> various corner-cases etc.
> 
I did some experiments over the weekend, exploring how to rework the
sequence decoding in various ways. Let me share some WIP patches,
hopefully that can be useful for trying more stuff and moving this
discussion forward.

I tried two things - (1) accumulating sequence increments in global
array and then doing something with it, and (2) treating all sequence
increments as regular changes (in a TXN) and then doing something
special during the replay. Attached are two patchsets, one for each
approach.

Note: It's important to remember decoding of sequences is not the only
code affected by this. The logical messages have the same issue,
certainly when it comes to transactional vs. non-transactional stuff and
handling of snapshots. Even if the sequence decoding ends up being
reverted, we still need to fix that, somehow. And my feeling is the
solutions ought to be pretty similar in both cases.

Now, regarding the two approaches:

(1) accumulating sequences in global hash table

The main problem with regular sequence increments is that those need to
be non-transactional - a transaction may use a sequence without any
WAL-logging, if the WAL was written by an earlier transaction. The
problem is the earlier trasaction might have been rolled back, and thus
simply discarded by the logical decoding. But we still need to apply
that, in order not to lose the sequence increment.

The current code just applies those non-transactional increments right
after decoding the increment, but that does not work because we may not
have a snapshot at that point. And we only have the snapshot when within
a transaction (AFAICS) so this queues all changes and then applies the
changes later.

The changes need to be shared by all transactions, so queueing them in a
global works fairly well - otherwise we'd have to walk all transactions,
in order to see if there are relevant sequence increments.

But some increments may be transactional, e.g. when the sequence is
created or altered in a transaction. To allow tracking this, this uses a
hash table, with relfilenode as a key.

There's a couple issues with this, though. Firstly, stashing the changes
outside transactions, it's not included in memory accounting, it's not
spilled to disk or streamed, etc. I guess fixing this is possible, but
it's certainly not straightforward, because we mix increments from many
different transactions.

A bigger issue is that I'm not sure this actually handles the snapshots
correctly either.

The non-transactional increments affect all transactions, so when
ReorderBufferProcessSequences gets executed, it processes all of them,
no matter the source transaction. Can we be sure the snapshot in the
applying transaction is the same (or "compatible") as the snapshot in
the source transaction?

Transactional increments can be simply processed as regular changes, of
course, but one difference is that we always create the transaction
(while before we just triggered the apply callback). This is necessary
as now we drive all of this from ReorderBufferCommit(), and without the
transaction the increment not be applied / there would be no snapshot.

It does seem to work, though, although I haven't tested it much so far.
One annoying bit is that we have to always walk all sequences and
increments, for each change in the transaction. Which seems quite
expensive, although the number of in-progress increments should be
pretty low (roughly equal to the number of sequences). Or at least the
part we need to consider for a single change (i.e. between two LSNs).

So maybe this should work. The one part this does not handle at all are
aborted transactions. At the moment we just discard those, which means
(a) we fail to discard the transactional changes from the hash table,
and (b) we can't apply the non-transactional changes, because with the
changes we also discard the snapshots we need.

I wonder if we could use a different snapshot, though. Non-transactional
changes can't change the relfilenode, after all. Not sure. If not, the
only solution I can think of is processing even aborted transactions,
but skipping changes except those that update snapshots.

There's a serious problem with streaming, though - we don't know which
transaction will commit first, hence we can't decide whether to send the
sequence changes. This seems pretty fatal to me. So we'd have to stream
the sequence changes only at commit, and then do some of this work on
the worker (i.e. merge the sequence changes to the right place). That
seems pretty annoying.


(2) treating sequence change as regular changes

This adopts a different approach - instead of accumulating the sequence
increments in a global hash table, it treats them as regular changes.
Which solves the snapshot issue, and issues with spilling to disk,
streaming and so on.

But it has various other issues with handling concurrent transactions,
unfortunately, which probably make this approach infeasible:

* The non-transactional stuff has to be applied in the first transaction
that commits, not in the transaction that generated the WAL. That does
not work too well with this approach, because we have to walk changes in
all other transactions.

* Another serious issue seems to be streaming - if we already streamed
some of the changes, we can't iterate through them anymore.

Also, having to walk the transactions over and over for each change, to
apply relevant sequence increments, that's mighty expensive. The other
approach needs to do that too, but walking the global hash table seems
much cheaper.

The other issue this handling of aborted transactions - we need to apply
sequence increments even from those transactions, of course. The other
approach has this issue too, though.


(3) tracking sequences touched by transaction

This is the approach proposed by Hannu Krosing. I haven't explored this
again yet, but I recall I wrote a PoC patch a couple months back.

It seems to me most of the problems stems from trying to derive sequence
state from decoded WAL changes, which is problematic because of the
non-transactional nature of sequences (i.e. WAL for one transaction
affects other transactions in non-obvious ways). And this approach
simply works around that entirely - instead of trying to deduce the
sequence state from WAL, we'd make sure to write the current sequence
state (or maybe just ID of the sequence) at commit time. Which should
eliminate most of the complexity / problems, I think.


I'm not really sure what to do about this. All of those reworks seems
like an extensive redesign of the patch, and considering the last CF is
already over ... not great.

However, even if we end up reverting this, we'll still have the same
problem with snapshots for logical messages.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Masahiko Sawada
Дата:
On Sat, Apr 2, 2022 at 8:52 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 4/2/22 12:35, Amit Kapila wrote:
> > On Fri, Apr 1, 2022 at 8:32 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> On 3/28/22 07:29, Amit Kapila wrote:
> >>> I thought about changing snapshot dealing of
> >>> non-transactional sequence changes similar to transactional ones but
> >>> that also won't work because it is only at commit we decide whether we
> >>> can send the changes.
> >>>
> >> I wonder if there's some earlier LSN (similar to the consistent point)
> >> which might be useful for this.
> >>
> >> Or maybe we should queue even the non-transactional changes, not
> >> per-transaction but in a global list, and then at each commit either
> >> discard inspect them (at that point we know the lowest LSN for all
> >> transactions and the consistent point). Seems complex, though.
> >>
> >
> > I couldn't follow '..discard inspect them ..'. Do you mean we inspect
> > them and discard whichever are not required? It seems here we are
> > talking about a new global ReorderBufferGlobal instead of
> > ReorderBufferTXN to collect these changes but we don't need only
> > consistent point LSN because we do send if the commit of containing
> > transaction is after consistent point LSN, so we need some transaction
> > information as well. I think it could bring new challenges.
> >
>
> Sorry for the gibberish. Yes, I meant to discard sequence changes that
> are no longer needed, due to being "obsoleted" by the applied change. We
> must not apply "older" changes (using LSN) because that would make the
> sequence go backwards.

It's not related to this issue but I think that non-transactional
sequence changes could be resent in case of the subscriber crashes
because it doesn’t update replication origin LSN, is that right? If
so, while resending the sequence changes, the sequence value on the
subscriber can temporarily go backward.

>
> I'm not entirely sure whether the list of changes should be kept in TXN
> or in the global reorderbuffer object - we need to track which TXN the
> change belongs to (because of transactional changes) but we also need to
> discard the unnecessary changes efficiently (and walking TXN might be
> expensive).
>
> But yes, I'm sure there will be challenges. One being that tracking just
> the decoded WAL stuff is not enough, because nextval() may not generate
> WAL. But we still need to make sure the increment is replicated.
>
> What I think we might do is this:
>
> - add a global list of decoded sequence increments to ReorderBuffer
>
> - at each commit/abort walk the list, walk the list and consider all
> increments up to the commit LSN that "match" (non-transactional match
> all TXNs, transactional match only the current TXN)
>
> - replicate the last "matching" status for each sequence, discard the
> processed ones
>
> We could probably optimize this by not tracking every single increment,
> but merge them "per transaction", I think.
>
> I'm sure this description is pretty rough and will need refining, handle
> various corner-cases etc.
>
> >>> For the transactional case, as we are considering the create sequence
> >>> operation as transactional, we would unnecessarily queue them even
> >>> though that is not required. Basically, they don't need to be
> >>> considered transactional and we can simply ignore such messages like
> >>> other DDLs. But for that probably we need to distinguish Alter/Create
> >>> case which may or may not be straightforward. Now, queuing them is
> >>> probably harmless unless it causes the transaction to spill/stream.
> >>>
> >>
> >> I'm not sure I follow. Why would we queue them unnecessarily?
> >>
> >> Also, there's the bug with decoding changes in transactions that create
> >> the sequence and add it to a publication. I think the agreement was that
> >> this behavior was incorrect, we should not decode changes until the
> >> subscription is refreshed. Doesn't that mean can't be any CREATE case,
> >> just ALTER?
> >>
> >
> > Yeah, but how will we distinguish them. Aren't they using the same
> > kind of WAL record?
> >
>
> Same WAL record, but the "created" flag which should distinguish these
> two cases, IIRC.

Since the "created" flag indicates that we created a new relfilenode
so it's true when both CREATE and ALTER.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Mon, Apr 4, 2022 at 11:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Apr 2, 2022 at 8:52 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>
> It's not related to this issue but I think that non-transactional
> sequence changes could be resent in case of the subscriber crashes
> because it doesn’t update replication origin LSN, is that right? If
> so, while resending the sequence changes, the sequence value on the
> subscriber can temporarily go backward.
>

Yes, this can happen for the non-transactional sequence changes though
this is a different problem than what is happening on the decoding
side.

> > >> Also, there's the bug with decoding changes in transactions that create
> > >> the sequence and add it to a publication. I think the agreement was that
> > >> this behavior was incorrect, we should not decode changes until the
> > >> subscription is refreshed. Doesn't that mean can't be any CREATE case,
> > >> just ALTER?
> > >>
> > >
> > > Yeah, but how will we distinguish them. Aren't they using the same
> > > kind of WAL record?
> > >
> >
> > Same WAL record, but the "created" flag which should distinguish these
> > two cases, IIRC.
>
> Since the "created" flag indicates that we created a new relfilenode
> so it's true when both CREATE and ALTER.
>

Yes, this is my understanding as well. So, we need something else.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Sat, Apr 2, 2022 at 5:47 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 4/1/22 17:02, Tomas Vondra wrote:
>
> So, I investigated this a bit more, and I wrote a couple test_decoding
> isolation tests (patch attached) demonstrating the issue. Actually, I
> should say "issues" because it's a bit worse than you described ...
>
> The whole problem is in this chunk of code in sequence_decode():
>
>
>   /* Skip the change if already processed (per the snapshot). */
>   if (transactional &&
>       !SnapBuildProcessChange(builder, xid, buf->origptr))
>       return;
>   else if (!transactional &&
>            (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
>             SnapBuildXactNeedsSkip(builder, buf->origptr)))
>       return;
>
>   /* Queue the increment (or send immediately if not transactional). */
>   snapshot = SnapBuildGetOrBuildSnapshot(builder, xid);
>   ReorderBufferQueueSequence(ctx->reorder, xid, snapshot, buf->endptr,
>                              origin_id, target_node, transactional,
>                              xlrec->created, tuplebuf);
>
> With the script you described, the increment is non-transactional, so we
> end up in the second branch, return and thus discard the increment.
>
> But it's also possible the change is transactional, which can only
> trigger the first branch. But it does not, so we start building the
> snapshot. But the first thing SnapBuildGetOrBuildSnapshot does is
>
>    Assert(builder->state == SNAPBUILD_CONSISTENT);
>
> and we're still not in a consistent snapshot, so it just crashes and
> burn :-(
>
> The sequences.spec file has two definitions of s2restart step, one empty
> (resulting in non-transactional change), one with ALTER SEQUENCE (which
> means the change will be transactional).
>
>
> The really "funny" thing is this is not new code - this is an exact copy
> from logicalmsg_decode(), and logical messages have all those issues
> too. We may discard some messages, trigger the same Assert, etc. There's
> a messages2.spec demonstrating this (s2message step defines whether the
> message is transactional or not).
>

It seems to me that the Assert in SnapBuildGetOrBuildSnapshot() is
wrong. It is required only for non-transactional logical messages. For
transactional message(s), we decide at the commit time whether the
snapshot has reached a consistent state and then decide whether to
skip the entire transaction or not. So, the possible fix for Assert
could be that we pass an additional parameter 'transactional' to
SnapBuildGetOrBuildSnapshot() and then assert only when it is false. I
have also checked the development thread for this work and it appears
to be introduced for non-transactional cases only. See email[1], this
new function and Assert was for the non-transactional case and later
while rearranging the code, this problem got introduced.

Now, for the non-transactional cases, I am not sure if there is a
one-to-one mapping with the sequence case. The way sequences are dealt
with on the subscriber-side (first we copy initial data and then
replicate the incremental changes) appears more as we deal with the
table and its incremental changes. There is some commonality with
non-transactional messages w.r.t the case where we want sequence
changes to be sent even on rollbacks unless some DDL has happened for
them but if we see the overall solution it doesn't appear that we can
use it similar to messages. I think this is the reason we are facing
the other problems w.r.t to syncing sequences to subscribers including
the problem reported by Sawada-San yesterday.

Now, the particular case where we won't send a non-transactional
logical message unless the snapshot is consistent could be considered
as its behavior and may be documented better. I am not very sure about
this as there is no example of the way sync for these messages happens
in the core but if someone outside the core wants a different behavior
and presents the case then we can probably try to enhance it. I feel
the same is not true for sequences because it can cause the replica
(subscriber) to go out of sync with the master (publisher).

> So I guess we need to fix both places, perhaps in a similar way.
>

It depends but I think for logical messages we should do the minimal
fix required for Asserts and probably document the current behavior
bit better unless we think there is a case to make it behave similar
to sequences.


[1] - https://www.postgresql.org/message-id/56D4B3AD.5000207%402ndquadrant.com

-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Amit Kapila
Дата:
On Mon, Apr 4, 2022 at 3:10 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I did some experiments over the weekend, exploring how to rework the
> sequence decoding in various ways. Let me share some WIP patches,
> hopefully that can be useful for trying more stuff and moving this
> discussion forward.
>
> I tried two things - (1) accumulating sequence increments in global
> array and then doing something with it, and (2) treating all sequence
> increments as regular changes (in a TXN) and then doing something
> special during the replay. Attached are two patchsets, one for each
> approach.
>
> Note: It's important to remember decoding of sequences is not the only
> code affected by this. The logical messages have the same issue,
> certainly when it comes to transactional vs. non-transactional stuff and
> handling of snapshots. Even if the sequence decoding ends up being
> reverted, we still need to fix that, somehow. And my feeling is the
> solutions ought to be pretty similar in both cases.
>
> Now, regarding the two approaches:
>
> (1) accumulating sequences in global hash table
>
> The main problem with regular sequence increments is that those need to
> be non-transactional - a transaction may use a sequence without any
> WAL-logging, if the WAL was written by an earlier transaction. The
> problem is the earlier trasaction might have been rolled back, and thus
> simply discarded by the logical decoding. But we still need to apply
> that, in order not to lose the sequence increment.
>
> The current code just applies those non-transactional increments right
> after decoding the increment, but that does not work because we may not
> have a snapshot at that point. And we only have the snapshot when within
> a transaction (AFAICS) so this queues all changes and then applies the
> changes later.
>
> The changes need to be shared by all transactions, so queueing them in a
> global works fairly well - otherwise we'd have to walk all transactions,
> in order to see if there are relevant sequence increments.
>
> But some increments may be transactional, e.g. when the sequence is
> created or altered in a transaction. To allow tracking this, this uses a
> hash table, with relfilenode as a key.
>
> There's a couple issues with this, though. Firstly, stashing the changes
> outside transactions, it's not included in memory accounting, it's not
> spilled to disk or streamed, etc. I guess fixing this is possible, but
> it's certainly not straightforward, because we mix increments from many
> different transactions.
>
> A bigger issue is that I'm not sure this actually handles the snapshots
> correctly either.
>
> The non-transactional increments affect all transactions, so when
> ReorderBufferProcessSequences gets executed, it processes all of them,
> no matter the source transaction. Can we be sure the snapshot in the
> applying transaction is the same (or "compatible") as the snapshot in
> the source transaction?
>

I don't think we can assume that. I think it is possible that some
other transaction's WAL can be in-between start/end lsn of txn (which
we decide to send) which may not finally reach a consistent state.
Consider a case similar to shown in one of my previous emails:
Session-2:
Begin;
SELECT pg_current_xact_id();

Session-1:
SELECT 'init' FROM pg_create_logical_replication_slot('test_slot',
'test_decoding', false, true);

Session-3:
Begin;
SELECT pg_current_xact_id();

Session-2:
Commit;
Begin;
INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100);

Session-3:
Commit;

Session-2:
Commit;

Here, we send changes (say insert from txn 700) from session-2 because
session-3's commit happens before it. Now, consider another
transaction parallel to txn 700 which generates some WAL related to
sequences but it committed before session-3's commit. So though, its
changes will be the in-between start/end LSN of txn 700 but those
shouldn't be sent.

I have not tried this and also this may be solvable in some way but I
think processing changes from other TXNs sounds risky to me in terms
of snapshot handling.

>
>
> (2) treating sequence change as regular changes
>
> This adopts a different approach - instead of accumulating the sequence
> increments in a global hash table, it treats them as regular changes.
> Which solves the snapshot issue, and issues with spilling to disk,
> streaming and so on.
>
> But it has various other issues with handling concurrent transactions,
> unfortunately, which probably make this approach infeasible:
>
> * The non-transactional stuff has to be applied in the first transaction
> that commits, not in the transaction that generated the WAL. That does
> not work too well with this approach, because we have to walk changes in
> all other transactions.
>

Why do you want to traverse other TXNs in this approach? Is it because
the current TXN might be using some value of sequence which has been
actually WAL logged in the other transaction but that other
transaction has not been sent yet? I think if we don't send that then
probably replica sequences columns (in some tables) have some values
but actually the sequence itself won't have still that value which
sounds problematic. Is that correct?

> * Another serious issue seems to be streaming - if we already streamed
> some of the changes, we can't iterate through them anymore.
>
> Also, having to walk the transactions over and over for each change, to
> apply relevant sequence increments, that's mighty expensive. The other
> approach needs to do that too, but walking the global hash table seems
> much cheaper.
>
> The other issue this handling of aborted transactions - we need to apply
> sequence increments even from those transactions, of course. The other
> approach has this issue too, though.
>
>
> (3) tracking sequences touched by transaction
>
> This is the approach proposed by Hannu Krosing. I haven't explored this
> again yet, but I recall I wrote a PoC patch a couple months back.
>
> It seems to me most of the problems stems from trying to derive sequence
> state from decoded WAL changes, which is problematic because of the
> non-transactional nature of sequences (i.e. WAL for one transaction
> affects other transactions in non-obvious ways). And this approach
> simply works around that entirely - instead of trying to deduce the
> sequence state from WAL, we'd make sure to write the current sequence
> state (or maybe just ID of the sequence) at commit time. Which should
> eliminate most of the complexity / problems, I think.
>

That sounds promising but I haven't thought in detail about that approach.

>
> I'm not really sure what to do about this. All of those reworks seems
> like an extensive redesign of the patch, and considering the last CF is
> already over ... not great.
>

Yeah, I share the same feeling that even if we devise solutions to all
the known problems it requires quite some time to ensure everything is
correct.

-- 
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 4/5/22 12:06, Amit Kapila wrote:
> On Mon, Apr 4, 2022 at 3:10 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> I did some experiments over the weekend, exploring how to rework the
>> sequence decoding in various ways. Let me share some WIP patches,
>> hopefully that can be useful for trying more stuff and moving this
>> discussion forward.
>>
>> I tried two things - (1) accumulating sequence increments in global
>> array and then doing something with it, and (2) treating all sequence
>> increments as regular changes (in a TXN) and then doing something
>> special during the replay. Attached are two patchsets, one for each
>> approach.
>>
>> Note: It's important to remember decoding of sequences is not the only
>> code affected by this. The logical messages have the same issue,
>> certainly when it comes to transactional vs. non-transactional stuff and
>> handling of snapshots. Even if the sequence decoding ends up being
>> reverted, we still need to fix that, somehow. And my feeling is the
>> solutions ought to be pretty similar in both cases.
>>
>> Now, regarding the two approaches:
>>
>> (1) accumulating sequences in global hash table
>>
>> The main problem with regular sequence increments is that those need to
>> be non-transactional - a transaction may use a sequence without any
>> WAL-logging, if the WAL was written by an earlier transaction. The
>> problem is the earlier trasaction might have been rolled back, and thus
>> simply discarded by the logical decoding. But we still need to apply
>> that, in order not to lose the sequence increment.
>>
>> The current code just applies those non-transactional increments right
>> after decoding the increment, but that does not work because we may not
>> have a snapshot at that point. And we only have the snapshot when within
>> a transaction (AFAICS) so this queues all changes and then applies the
>> changes later.
>>
>> The changes need to be shared by all transactions, so queueing them in a
>> global works fairly well - otherwise we'd have to walk all transactions,
>> in order to see if there are relevant sequence increments.
>>
>> But some increments may be transactional, e.g. when the sequence is
>> created or altered in a transaction. To allow tracking this, this uses a
>> hash table, with relfilenode as a key.
>>
>> There's a couple issues with this, though. Firstly, stashing the changes
>> outside transactions, it's not included in memory accounting, it's not
>> spilled to disk or streamed, etc. I guess fixing this is possible, but
>> it's certainly not straightforward, because we mix increments from many
>> different transactions.
>>
>> A bigger issue is that I'm not sure this actually handles the snapshots
>> correctly either.
>>
>> The non-transactional increments affect all transactions, so when
>> ReorderBufferProcessSequences gets executed, it processes all of them,
>> no matter the source transaction. Can we be sure the snapshot in the
>> applying transaction is the same (or "compatible") as the snapshot in
>> the source transaction?
>>
> 
> I don't think we can assume that. I think it is possible that some
> other transaction's WAL can be in-between start/end lsn of txn (which
> we decide to send) which may not finally reach a consistent state.
> Consider a case similar to shown in one of my previous emails:
> Session-2:
> Begin;
> SELECT pg_current_xact_id();
> 
> Session-1:
> SELECT 'init' FROM pg_create_logical_replication_slot('test_slot',
> 'test_decoding', false, true);
> 
> Session-3:
> Begin;
> SELECT pg_current_xact_id();
> 
> Session-2:
> Commit;
> Begin;
> INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100);
> 
> Session-3:
> Commit;
> 
> Session-2:
> Commit;
> 
> Here, we send changes (say insert from txn 700) from session-2 because
> session-3's commit happens before it. Now, consider another
> transaction parallel to txn 700 which generates some WAL related to
> sequences but it committed before session-3's commit. So though, its
> changes will be the in-between start/end LSN of txn 700 but those
> shouldn't be sent.
> 
> I have not tried this and also this may be solvable in some way but I
> think processing changes from other TXNs sounds risky to me in terms
> of snapshot handling.
> 

Yes, I know this can happen. I was only really thinking about what might
happen to the relfilenode of the sequence itself - and I don't think any
concurrent transaction could swoop in and change the relfilenode in any
meaningful way, due to locking.

But of course, if we expect/require to have a perfect snapshot for that
exact position in the transaction, this won't work. IMO the whole idea
that we can have non-transactional bits in naturally transactional
decoding seems a bit suspicious (at least in hindsight).

No matter what we do for sequences, though, this still affects logical
messages too. Not sure what to do there :-(

>>
>>
>> (2) treating sequence change as regular changes
>>
>> This adopts a different approach - instead of accumulating the sequence
>> increments in a global hash table, it treats them as regular changes.
>> Which solves the snapshot issue, and issues with spilling to disk,
>> streaming and so on.
>>
>> But it has various other issues with handling concurrent transactions,
>> unfortunately, which probably make this approach infeasible:
>>
>> * The non-transactional stuff has to be applied in the first transaction
>> that commits, not in the transaction that generated the WAL. That does
>> not work too well with this approach, because we have to walk changes in
>> all other transactions.
>>
> 
> Why do you want to traverse other TXNs in this approach? Is it because
> the current TXN might be using some value of sequence which has been
> actually WAL logged in the other transaction but that other
> transaction has not been sent yet? I think if we don't send that then
> probably replica sequences columns (in some tables) have some values
> but actually the sequence itself won't have still that value which
> sounds problematic. Is that correct?
> 

Well, how else would you get to sequence changes in the other TXNs?

Consider this:

T1: begin
T2: begin

T2: nextval('s') -> writes WAL for 32 values
T1: nextval('s') -> gets value without WAL

T1: commit
T2: commit

Now, if we commit T1 without "applying" the sequence change from T2, we
loose the sequence state. But we still write/replicate the value
generated from the sequence.

>> * Another serious issue seems to be streaming - if we already streamed
>> some of the changes, we can't iterate through them anymore.
>>
>> Also, having to walk the transactions over and over for each change, to
>> apply relevant sequence increments, that's mighty expensive. The other
>> approach needs to do that too, but walking the global hash table seems
>> much cheaper.
>>
>> The other issue this handling of aborted transactions - we need to apply
>> sequence increments even from those transactions, of course. The other
>> approach has this issue too, though.
>>
>>
>> (3) tracking sequences touched by transaction
>>
>> This is the approach proposed by Hannu Krosing. I haven't explored this
>> again yet, but I recall I wrote a PoC patch a couple months back.
>>
>> It seems to me most of the problems stems from trying to derive sequence
>> state from decoded WAL changes, which is problematic because of the
>> non-transactional nature of sequences (i.e. WAL for one transaction
>> affects other transactions in non-obvious ways). And this approach
>> simply works around that entirely - instead of trying to deduce the
>> sequence state from WAL, we'd make sure to write the current sequence
>> state (or maybe just ID of the sequence) at commit time. Which should
>> eliminate most of the complexity / problems, I think.
>>
> 
> That sounds promising but I haven't thought in detail about that approach.
> 

So, here's a patch doing that. It's a reworked/improved version of the
patch [1] shared in November.

It seems to be working pretty nicely. The behavior is a little bit
different, of course, because we only replicate "committed" changes, so
if you do nextval() in aborted transaction that is not replicated. Which
I think is fine, because we generally make no durability guarantees for
aborted transactions in general.

But there are a couple issues too:

1) locking

We have to read sequence change before the commit, but we must not allow
reordering (because then the state might go backwards again). I'm not
sure how serious impact could this have on performance.

2) dropped sequences

I'm not sure what to do about sequences dropped in the transaction. The
patch simply attempts to read the current sequence state before the
commit, but if the sequence was dropped (in that transaction), that
can't happen. I'm not sure if that's OK or not.

3) WAL record

To replicate the stuff the patch uses a LogicalMessage, but I guess a
separate WAL record would be better. But that's a technical detail.


regards

[1]
https://www.postgresql.org/message-id/2cd38bab-c874-8e0b-98e7-d9abaaf9806a@enterprisedb.com

>>
>> I'm not really sure what to do about this. All of those reworks seems
>> like an extensive redesign of the patch, and considering the last CF is
>> already over ... not great.
>>
> 
> Yeah, I share the same feeling that even if we devise solutions to all
> the known problems it requires quite some time to ensure everything is
> correct.
> 

True. Let's keep working on this for a bit more time and then we can
decide what to do.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 4/6/22 16:13, Tomas Vondra wrote:
> 
> 
> On 4/5/22 12:06, Amit Kapila wrote:
>> On Mon, Apr 4, 2022 at 3:10 AM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> I did some experiments over the weekend, exploring how to rework the
>>> sequence decoding in various ways. Let me share some WIP patches,
>>> hopefully that can be useful for trying more stuff and moving this
>>> discussion forward.
>>>
>>> I tried two things - (1) accumulating sequence increments in global
>>> array and then doing something with it, and (2) treating all sequence
>>> increments as regular changes (in a TXN) and then doing something
>>> special during the replay. Attached are two patchsets, one for each
>>> approach.
>>>
>>> Note: It's important to remember decoding of sequences is not the only
>>> code affected by this. The logical messages have the same issue,
>>> certainly when it comes to transactional vs. non-transactional stuff and
>>> handling of snapshots. Even if the sequence decoding ends up being
>>> reverted, we still need to fix that, somehow. And my feeling is the
>>> solutions ought to be pretty similar in both cases.
>>>
>>> Now, regarding the two approaches:
>>>
>>> (1) accumulating sequences in global hash table
>>>
>>> The main problem with regular sequence increments is that those need to
>>> be non-transactional - a transaction may use a sequence without any
>>> WAL-logging, if the WAL was written by an earlier transaction. The
>>> problem is the earlier trasaction might have been rolled back, and thus
>>> simply discarded by the logical decoding. But we still need to apply
>>> that, in order not to lose the sequence increment.
>>>
>>> The current code just applies those non-transactional increments right
>>> after decoding the increment, but that does not work because we may not
>>> have a snapshot at that point. And we only have the snapshot when within
>>> a transaction (AFAICS) so this queues all changes and then applies the
>>> changes later.
>>>
>>> The changes need to be shared by all transactions, so queueing them in a
>>> global works fairly well - otherwise we'd have to walk all transactions,
>>> in order to see if there are relevant sequence increments.
>>>
>>> But some increments may be transactional, e.g. when the sequence is
>>> created or altered in a transaction. To allow tracking this, this uses a
>>> hash table, with relfilenode as a key.
>>>
>>> There's a couple issues with this, though. Firstly, stashing the changes
>>> outside transactions, it's not included in memory accounting, it's not
>>> spilled to disk or streamed, etc. I guess fixing this is possible, but
>>> it's certainly not straightforward, because we mix increments from many
>>> different transactions.
>>>
>>> A bigger issue is that I'm not sure this actually handles the snapshots
>>> correctly either.
>>>
>>> The non-transactional increments affect all transactions, so when
>>> ReorderBufferProcessSequences gets executed, it processes all of them,
>>> no matter the source transaction. Can we be sure the snapshot in the
>>> applying transaction is the same (or "compatible") as the snapshot in
>>> the source transaction?
>>>
>>
>> I don't think we can assume that. I think it is possible that some
>> other transaction's WAL can be in-between start/end lsn of txn (which
>> we decide to send) which may not finally reach a consistent state.
>> Consider a case similar to shown in one of my previous emails:
>> Session-2:
>> Begin;
>> SELECT pg_current_xact_id();
>>
>> Session-1:
>> SELECT 'init' FROM pg_create_logical_replication_slot('test_slot',
>> 'test_decoding', false, true);
>>
>> Session-3:
>> Begin;
>> SELECT pg_current_xact_id();
>>
>> Session-2:
>> Commit;
>> Begin;
>> INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100);
>>
>> Session-3:
>> Commit;
>>
>> Session-2:
>> Commit;
>>
>> Here, we send changes (say insert from txn 700) from session-2 because
>> session-3's commit happens before it. Now, consider another
>> transaction parallel to txn 700 which generates some WAL related to
>> sequences but it committed before session-3's commit. So though, its
>> changes will be the in-between start/end LSN of txn 700 but those
>> shouldn't be sent.
>>
>> I have not tried this and also this may be solvable in some way but I
>> think processing changes from other TXNs sounds risky to me in terms
>> of snapshot handling.
>>
> 
> Yes, I know this can happen. I was only really thinking about what might
> happen to the relfilenode of the sequence itself - and I don't think any
> concurrent transaction could swoop in and change the relfilenode in any
> meaningful way, due to locking.
> 
> But of course, if we expect/require to have a perfect snapshot for that
> exact position in the transaction, this won't work. IMO the whole idea
> that we can have non-transactional bits in naturally transactional
> decoding seems a bit suspicious (at least in hindsight).
> 
> No matter what we do for sequences, though, this still affects logical
> messages too. Not sure what to do there :-(
> 
>>>
>>>
>>> (2) treating sequence change as regular changes
>>>
>>> This adopts a different approach - instead of accumulating the sequence
>>> increments in a global hash table, it treats them as regular changes.
>>> Which solves the snapshot issue, and issues with spilling to disk,
>>> streaming and so on.
>>>
>>> But it has various other issues with handling concurrent transactions,
>>> unfortunately, which probably make this approach infeasible:
>>>
>>> * The non-transactional stuff has to be applied in the first transaction
>>> that commits, not in the transaction that generated the WAL. That does
>>> not work too well with this approach, because we have to walk changes in
>>> all other transactions.
>>>
>>
>> Why do you want to traverse other TXNs in this approach? Is it because
>> the current TXN might be using some value of sequence which has been
>> actually WAL logged in the other transaction but that other
>> transaction has not been sent yet? I think if we don't send that then
>> probably replica sequences columns (in some tables) have some values
>> but actually the sequence itself won't have still that value which
>> sounds problematic. Is that correct?
>>
> 
> Well, how else would you get to sequence changes in the other TXNs?
> 
> Consider this:
> 
> T1: begin
> T2: begin
> 
> T2: nextval('s') -> writes WAL for 32 values
> T1: nextval('s') -> gets value without WAL
> 
> T1: commit
> T2: commit
> 
> Now, if we commit T1 without "applying" the sequence change from T2, we
> loose the sequence state. But we still write/replicate the value
> generated from the sequence.
> 
>>> * Another serious issue seems to be streaming - if we already streamed
>>> some of the changes, we can't iterate through them anymore.
>>>
>>> Also, having to walk the transactions over and over for each change, to
>>> apply relevant sequence increments, that's mighty expensive. The other
>>> approach needs to do that too, but walking the global hash table seems
>>> much cheaper.
>>>
>>> The other issue this handling of aborted transactions - we need to apply
>>> sequence increments even from those transactions, of course. The other
>>> approach has this issue too, though.
>>>
>>>
>>> (3) tracking sequences touched by transaction
>>>
>>> This is the approach proposed by Hannu Krosing. I haven't explored this
>>> again yet, but I recall I wrote a PoC patch a couple months back.
>>>
>>> It seems to me most of the problems stems from trying to derive sequence
>>> state from decoded WAL changes, which is problematic because of the
>>> non-transactional nature of sequences (i.e. WAL for one transaction
>>> affects other transactions in non-obvious ways). And this approach
>>> simply works around that entirely - instead of trying to deduce the
>>> sequence state from WAL, we'd make sure to write the current sequence
>>> state (or maybe just ID of the sequence) at commit time. Which should
>>> eliminate most of the complexity / problems, I think.
>>>
>>
>> That sounds promising but I haven't thought in detail about that approach.
>>
> 
> So, here's a patch doing that. It's a reworked/improved version of the
> patch [1] shared in November.
> 
> It seems to be working pretty nicely. The behavior is a little bit
> different, of course, because we only replicate "committed" changes, so
> if you do nextval() in aborted transaction that is not replicated. Which
> I think is fine, because we generally make no durability guarantees for
> aborted transactions in general.
> 
> But there are a couple issues too:
> 
> 1) locking
> 
> We have to read sequence change before the commit, but we must not allow
> reordering (because then the state might go backwards again). I'm not
> sure how serious impact could this have on performance.
> 
> 2) dropped sequences
> 
> I'm not sure what to do about sequences dropped in the transaction. The
> patch simply attempts to read the current sequence state before the
> commit, but if the sequence was dropped (in that transaction), that
> can't happen. I'm not sure if that's OK or not.
> 
> 3) WAL record
> 
> To replicate the stuff the patch uses a LogicalMessage, but I guess a
> separate WAL record would be better. But that's a technical detail.
> 
> 
> regards
> 
> [1]
> https://www.postgresql.org/message-id/2cd38bab-c874-8e0b-98e7-d9abaaf9806a@enterprisedb.com
> 
>>>
>>> I'm not really sure what to do about this. All of those reworks seems
>>> like an extensive redesign of the patch, and considering the last CF is
>>> already over ... not great.
>>>
>>
>> Yeah, I share the same feeling that even if we devise solutions to all
>> the known problems it requires quite some time to ensure everything is
>> correct.
>>
> 
> True. Let's keep working on this for a bit more time and then we can
> decide what to do.
> 

I've pushed a revert af all the commits related to this - decoding of
sequences and test_decoding / built-in replication changes. The approach
combining transactional and non-transactional behavior implemented by
the patch clearly has issues, and it seems foolish to hope we'll find a
simple fix. So the changes would have to be much more extensive, and
doing that after the last CF seems like an obviously bad idea.

Attached is a rebased patch, implementing the approach based on
WAL-logging sequences at commit time.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences

От
Justin Pryzby
Дата:
Some typos I found before the patch was reverted.

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index a6ea6ff3fcf..d4bd8d41c4b 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -834,9 +834,8 @@ typedef void (*LogicalDecodeSequenceCB) (struct LogicalDecodingContext *ctx,
       non-transactional increments, the transaction may be either NULL or not
       NULL, depending on if the transaction already has an XID assigned.
       The <parameter>sequence_lsn</parameter> has the WAL location of the
-      sequence update. <parameter>transactional</parameter> says if the
-      sequence has to be replayed as part of the transaction or directly.
-
+      sequence update. <parameter>transactional</parameter> indicates whether
+      the sequence has to be replayed as part of the transaction or directly.
       The <parameter>last_value</parameter>, <parameter>log_cnt</parameter> and
       <parameter>is_called</parameter> parameters describe the sequence change.
      </para>
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 60866431db3..b71122cce5d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -927,7 +927,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
  * Treat the sequence increment as transactional?
  *
  * The hash table tracks all sequences created in in-progress transactions,
- * so we simply do a lookup (the sequence is identified by relfilende). If
+ * so we simply do a lookup (the sequence is identified by relfilenode). If
  * we find a match, the increment should be handled as transactional.
  */
 bool
@@ -2255,7 +2255,7 @@ ReorderBufferApplySequence(ReorderBuffer *rb, ReorderBufferTXN *txn,
     tuple = &change->data.sequence.tuple->tuple;
     seq = (Form_pg_sequence_data) GETSTRUCT(tuple);
 
-    /* Only ever called from ReorderBufferApplySequence, so transational. */
+    /* Only ever called from ReorderBufferApplySequence, so transactional. */
     if (streaming)
         rb->stream_sequence(rb, txn, change->lsn, relation, true,
                             seq->last_value, seq->log_cnt, seq->is_called);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index a15ce9edb13..8193bfe6515 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5601,7 +5601,7 @@ RelationBuildPublicationDesc(Relation relation, PublicationDesc *pubdesc)
                                      GetSchemaPublications(schemaid, objType));
 
     /*
-     * If this is a partion (and thus a table), lookup all ancestors and track
+     * If this is a partition (and thus a table), lookup all ancestors and track
      * all publications them too.
      */
     if (relation->rd_rel->relispartition)



Re: logical decoding and replication of sequences

От
Zheng Li
Дата:
> But of course, if we expect/require to have a perfect snapshot for that
> exact position in the transaction, this won't work. IMO the whole idea
> that we can have non-transactional bits in naturally transactional
> decoding seems a bit suspicious (at least in hindsight).
>
> No matter what we do for sequences, though, this still affects logical
> messages too. Not sure what to do there :-(

Hi, I spent some time trying to understand this problem while I was
evaluating its impact on the DDL replication in [1]. I think for DDL
we could always remove the
non-transactional bits since DDL will probably always be processed
transactionally.

I attempted to solve the problem for messages. Here is a potential
solution by keeping track of
the last decoded/acked non-transactional message/operation lsn and use
it to check if a non-transactional message record should be skipped
during decoding,
to do that I added new fields
ReplicationSlotPersistentData.non_xact_op_at,
XLogReaderState.NonXactOpRecPtr and
SnapBuild.start_decoding_nonxactop_at.
This is the end LSN of the last non-transactional message/operation
decoded/acked. I verified this approach solves the issue of
missing decoding of non-transactional messages under
concurrency/before the builder state reaches SNAPBUILD_CONSISTENT.
Once
the builder state reach SNAPBUILD_CONSISTENT, the new field
ReplicationSlotPersistentData.non_xact_op_at can be set
to ReplicationSlotPersistentData.confirmed_flush.

Similar to the sequence issue, here is the test case for logical messages:

Test concurrent execution in 3 sessions that allows pg_logical_emit_message in
session-2 to happen before we reach a consistent point and commit
happens after a consistent point:

Session-2:

Begin;
SELECT pg_current_xact_id();

Session-1:
SELECT 'init' FROM pg_create_logical_replication_slot('test_slot',
'test_decoding', false, true);

Session-3:

Begin;
SELECT pg_current_xact_id();

Session-2:

Commit;
Begin;
SELECT pg_logical_emit_message(true, 'test_decoding', 'msg1');
SELECT pg_logical_emit_message(false, 'test_decoding', 'msg2');

Session-3:

Commit;

Session-1: (at this point, the session will crash without the fix)

SELECT data FROM pg_logical_slot_get_changes('test_slot', NULL, NULL,
'force-binary', '0', 'skip-empty-xacts', '1');
data
---------------------------------------------------------------------
message: transactional: 0 prefix: test_decoding, sz: 4 content:msg1

Session-2:

Commit;

Session-1:

SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
NULL, 'force-binary', '0', 'skip-empty-xacts', '1');
data
---------------------------------------------------------------------
message: transactional: 1 prefix: test_decoding, sz: 4 content:msg2

I also tried the same approach on sequences (on a commit before the
revert of sequence replication) and it seems to be working but
I think it needs further testing.

Patch 0001-Intorduce-new-field-ReplicationSlotPersistentData.no.patch
applies on master which contains the fix for logical messages.

[1] https://www.postgresql.org/message-id/flat/CAAD30U+pVmfKwUKy8cbZOnUXyguJ-uBNejwD75Kyo=OjdQGJ9g@mail.gmail.com

Thoughts?

With Regards,
Zheng

Вложения

Re: logical decoding and replication of sequences

От
Noah Misch
Дата:
On Thu, Apr 07, 2022 at 08:34:50PM +0200, Tomas Vondra wrote:
> I've pushed a revert af all the commits related to this - decoding of
> sequences and test_decoding / built-in replication changes.

Two July buildfarm runs failed with PANIC during standby promotion:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-19%2004%3A13%3A18
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-31%2011%3A33%3A13

The attached patch hacks things so an ordinary x86_64 GNU/Linux machine
reproduces this consistently.  "git bisect" then traced the regression to the
above revert commit (2c7ea57e56ca5f668c32d4266e0a3e45b455bef5).  The pg_ctl
test suite passes under this hack in all supported branches, and it passed on
v15 until that revert.  Would you investigate?

The buildfarm animal uses keep_error_builds.  From kept data directories, I
deduced these events:

- After the base backup, auto-analyze ran on the primary and wrote WAL.
- Standby streamed and wrote up to 0/301FFF.
- Standby received the promote signal.  Terminated streaming.  WAL page at 0/302000 remained all-zeros.
- Somehow, end-of-recovery became a PANIC.

Key portions from buildfarm logs:

=== good run standby2 log
2022-07-21 22:55:16.860 UTC [25034912:5] LOG:  received promote request
2022-07-21 22:55:16.878 UTC [26804682:2] FATAL:  terminating walreceiver process due to administrator command
2022-07-21 22:55:16.878 UTC [25034912:6] LOG:  invalid record length at 0/3000060: wanted 24, got 0
2022-07-21 22:55:16.878 UTC [25034912:7] LOG:  redo done at 0/3000028 system usage: CPU: user: 0.00 s, system: 0.00 s,
elapsed:0.42 s
 
2022-07-21 22:55:16.878 UTC [25034912:8] LOG:  selected new timeline ID: 2
2022-07-21 22:55:17.004 UTC [25034912:9] LOG:  archive recovery complete
2022-07-21 22:55:17.005 UTC [23724044:1] LOG:  checkpoint starting: force
2022-07-21 22:55:17.008 UTC [14549364:4] LOG:  database system is ready to accept connections
2022-07-21 22:55:17.093 UTC [23724044:2] LOG:  checkpoint complete: wrote 3 buffers (2.3%); 0 WAL file(s) added, 0
removed,1 recycled; write=0.019 s, sync=0.001 s, total=0.089 s; sync files=0, longest=0.000 s, average=0.000 s;
distance=16384kB, estimate=16384 kB
 
2022-07-21 22:55:17.143 UTC [27394418:1] [unknown] LOG:  connection received: host=[local]
2022-07-21 22:55:17.144 UTC [27394418:2] [unknown] LOG:  connection authorized: user=nm database=postgres
application_name=003_promote.pl
2022-07-21 22:55:17.147 UTC [27394418:3] 003_promote.pl LOG:  statement: SELECT pg_is_in_recovery()
2022-07-21 22:55:17.148 UTC [27394418:4] 003_promote.pl LOG:  disconnection: session time: 0:00:00.005 user=nm
database=postgreshost=[local]
 
2022-07-21 22:55:58.301 UTC [14549364:5] LOG:  received immediate shutdown request
2022-07-21 22:55:58.337 UTC [14549364:6] LOG:  database system is shut down

=== failed run standby2 log, with my annotations
2022-07-19 05:28:22.136 UTC [7340406:5] LOG:  received promote request
2022-07-19 05:28:22.163 UTC [8519860:2] FATAL:  terminating walreceiver process due to administrator command
2022-07-19 05:28:22.166 UTC [7340406:6] LOG:  invalid magic number 0000 in log segment 000000010000000000000003, offset
131072
  New compared to the good run.  XLOG_PAGE_MAGIC didn't match.  This implies the WAL ended at a WAL page boundary.
2022-07-19 05:28:22.166 UTC [7340406:7] LOG:  redo done at 0/301F168 system usage: CPU: user: 0.00 s, system: 0.00 s,
elapsed:0.18 s
 
2022-07-19 05:28:22.166 UTC [7340406:8] LOG:  last completed transaction was at log time 2022-07-19 05:28:13.956716+00
  New compared to the good run.  The good run had no transactions to replay.  The bad run replayed records from an
auto-analyze.
2022-07-19 05:28:22.166 UTC [7340406:9] PANIC:  invalid record length at 0/301F168: wanted 24, got 0
  More WAL overall in bad run, due to auto-analyze.  End of recovery wrongly considered a PANIC.
2022-07-19 05:28:22.583 UTC [8388800:4] LOG:  startup process (PID 7340406) was terminated by signal 6: IOT/Abort trap
2022-07-19 05:28:22.584 UTC [8388800:5] LOG:  terminating any other active server processes
2022-07-19 05:28:22.587 UTC [8388800:6] LOG:  shutting down due to startup process failure
2022-07-19 05:28:22.627 UTC [8388800:7] LOG:  database system is shut down

Let me know if I've left out details you want; I may be able to dig more out
of the buildfarm artifacts.

Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 8/7/22 02:36, Noah Misch wrote:
> On Thu, Apr 07, 2022 at 08:34:50PM +0200, Tomas Vondra wrote:
>> I've pushed a revert af all the commits related to this - decoding of
>> sequences and test_decoding / built-in replication changes.
> 
> Two July buildfarm runs failed with PANIC during standby promotion:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-19%2004%3A13%3A18
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-31%2011%3A33%3A13
> 
> The attached patch hacks things so an ordinary x86_64 GNU/Linux machine
> reproduces this consistently.  "git bisect" then traced the regression to the
> above revert commit (2c7ea57e56ca5f668c32d4266e0a3e45b455bef5).  The pg_ctl
> test suite passes under this hack in all supported branches, and it passed on
> v15 until that revert.  Would you investigate?
> 
> The buildfarm animal uses keep_error_builds.  From kept data directories, I
> deduced these events:
> 
> - After the base backup, auto-analyze ran on the primary and wrote WAL.
> - Standby streamed and wrote up to 0/301FFF.
> - Standby received the promote signal.  Terminated streaming.  WAL page at 0/302000 remained all-zeros.
> - Somehow, end-of-recovery became a PANIC.
> 

I think it'd be really bizarre if this was due to the revert, as that
simply undoes minor WAL changes (and none of this should affect what
happens at WAL page boundary etc.). It just restores WAL as it was
before 0da92dc, nothing particularly complicated. I did go through all
of the changes again and I haven't spotted anything particularly
suspicious, but I'll give it another try tomorrow.

However, I did try bisecting this using the attached patch, and that
does not suggest the issue is in the revert commit. It actually fails
all the way back to 5dc0418fab2, and it starts working on 9553b4115f1.

    ...
    6392f2a0968 Try to silence "-Wmissing-braces" complaints in ...
 => 5dc0418fab2 Prefetch data referenced by the WAL, take II.
    9553b4115f1 Fix warning introduced in 5c279a6d350.
    ...

This is merely 10 commits before the revert, and it seems way more
related to WAL. Also, adding this to the two nodes in 003_standby.pl
makes the issue go away, it seems:

  $node_standby->append_conf('postgresql.conf',
      qq(recovery_prefetch = off));

I'd bet it's about WAL prefetching, not the revert, and the bisect was a
bit incorrect, because the commits are close and the failures happen to
be rare. (Presumably you first did the bisect and then wrote the patch
that reproduces this, right?)

Adding Thomas Munro to the thread, he's the WAL prefetching expert ;-)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Noah Misch
Дата:
On Sun, Aug 07, 2022 at 03:18:52PM +0200, Tomas Vondra wrote:
> On 8/7/22 02:36, Noah Misch wrote:
> > On Thu, Apr 07, 2022 at 08:34:50PM +0200, Tomas Vondra wrote:
> >> I've pushed a revert af all the commits related to this - decoding of
> >> sequences and test_decoding / built-in replication changes.
> > 
> > Two July buildfarm runs failed with PANIC during standby promotion:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-19%2004%3A13%3A18
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-31%2011%3A33%3A13
> > 
> > The attached patch hacks things so an ordinary x86_64 GNU/Linux machine
> > reproduces this consistently.  "git bisect" then traced the regression to the
> > above revert commit (2c7ea57e56ca5f668c32d4266e0a3e45b455bef5).  The pg_ctl
> > test suite passes under this hack in all supported branches, and it passed on
> > v15 until that revert.  Would you investigate?
> > 
> > The buildfarm animal uses keep_error_builds.  From kept data directories, I
> > deduced these events:
> > 
> > - After the base backup, auto-analyze ran on the primary and wrote WAL.
> > - Standby streamed and wrote up to 0/301FFF.
> > - Standby received the promote signal.  Terminated streaming.  WAL page at 0/302000 remained all-zeros.
> > - Somehow, end-of-recovery became a PANIC.
> 
> I think it'd be really bizarre if this was due to the revert, as that
> simply undoes minor WAL changes (and none of this should affect what
> happens at WAL page boundary etc.). It just restores WAL as it was
> before 0da92dc, nothing particularly complicated. I did go through all
> of the changes again and I haven't spotted anything particularly
> suspicious, but I'll give it another try tomorrow.
> 
> However, I did try bisecting this using the attached patch, and that
> does not suggest the issue is in the revert commit. It actually fails
> all the way back to 5dc0418fab2, and it starts working on 9553b4115f1.
> 
>     ...
>     6392f2a0968 Try to silence "-Wmissing-braces" complaints in ...
>  => 5dc0418fab2 Prefetch data referenced by the WAL, take II.
>     9553b4115f1 Fix warning introduced in 5c279a6d350.
>     ...
> 
> This is merely 10 commits before the revert, and it seems way more
> related to WAL. Also, adding this to the two nodes in 003_standby.pl
> makes the issue go away, it seems:
> 
>   $node_standby->append_conf('postgresql.conf',
>       qq(recovery_prefetch = off));
> 
> I'd bet it's about WAL prefetching, not the revert, and the bisect was a
> bit incorrect, because the commits are close and the failures happen to
> be rare. (Presumably you first did the bisect and then wrote the patch
> that reproduces this, right?)

No.  I wrote the patch, then used the patch to drive the bisect.  With ten
iterations, commit 2c7ea57 passes 0/10, while 2c7ea57^ passes 10/10.  I've now
tried recovery_prefetch=off.  With that, the test passes 10/10 at 2c7ea57.
Given your observation of a failure at 5dc0418fab2, I agree with your
conclusion.  Whatever the role of 2c7ea57 in exposing the failure on my
machine, a root cause in WAL prefetching looks more likely.

> Adding Thomas Munro to the thread, he's the WAL prefetching expert ;-)



Re: logical decoding and replication of sequences

От
Thomas Munro
Дата:
On Mon, Aug 8, 2022 at 7:12 AM Noah Misch <noah@leadboat.com> wrote:
> On Sun, Aug 07, 2022 at 03:18:52PM +0200, Tomas Vondra wrote:
> > I'd bet it's about WAL prefetching, not the revert, and the bisect was a
> > bit incorrect, because the commits are close and the failures happen to
> > be rare. (Presumably you first did the bisect and then wrote the patch
> > that reproduces this, right?)
>
> No.  I wrote the patch, then used the patch to drive the bisect.  With ten
> iterations, commit 2c7ea57 passes 0/10, while 2c7ea57^ passes 10/10.  I've now
> tried recovery_prefetch=off.  With that, the test passes 10/10 at 2c7ea57.
> Given your observation of a failure at 5dc0418fab2, I agree with your
> conclusion.  Whatever the role of 2c7ea57 in exposing the failure on my
> machine, a root cause in WAL prefetching looks more likely.
>
> > Adding Thomas Munro to the thread, he's the WAL prefetching expert ;-)

Thanks for the repro patch and bisection work.  Looking...



Re: logical decoding and replication of sequences

От
Thomas Munro
Дата:
On Mon, Aug 8, 2022 at 9:09 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Thanks for the repro patch and bisection work.  Looking...

I don't have the complete explanation yet, but it's something like
this.  We hit the following branch in xlogrecovery.c...

        if (StandbyMode &&
                !XLogReaderValidatePageHeader(xlogreader,
targetPagePtr, readBuf))
        {
                /*
                 * Emit this error right now then retry this page
immediately. Use
                 * errmsg_internal() because the message was already translated.
                 */
                if (xlogreader->errormsg_buf[0])
                        ereport(emode_for_corrupt_record(emode,
xlogreader->EndRecPtr),
                                        (errmsg_internal("%s",
xlogreader->errormsg_buf)));

                /* reset any error XLogReaderValidatePageHeader()
might have set */
                xlogreader->errormsg_buf[0] = '\0';
                goto next_record_is_invalid;
        }

... but, even though there was a (suppressed) error, nothing
invalidates the reader's page buffer.  Normally,
XLogReadValidatePageHeader() failure or any other kind of error
encountered by xlogreader.c'd decoding logic would do that, but here
the read_page callback is directly calling the header validation.
Without prefetching, that doesn't seem to matter, but reading ahead
can cause us to have the problem page in our buffer at the wrong time,
and then not re-read it when we should.  Or something like that.

The attached patch that simply moves the cache invalidation into
report_invalid_record(), so that it's reached by the above code and
everything else that reports an error, seems to fix the problem in
src/bin/pg_ctl/t/003_promote.pl with Noah's spanner-in-the-works patch
applied, and passes check-world without it.  I need to look at this
some more, though, and figure out if it's the right fix.

Вложения

Re: logical decoding and replication of sequences

От
Kyotaro Horiguchi
Дата:
At Mon, 8 Aug 2022 18:15:46 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in 
> On Mon, Aug 8, 2022 at 9:09 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Thanks for the repro patch and bisection work.  Looking...
> 
> I don't have the complete explanation yet, but it's something like
> this.  We hit the following branch in xlogrecovery.c...
> 
>         if (StandbyMode &&
>                 !XLogReaderValidatePageHeader(xlogreader,
> targetPagePtr, readBuf))
>         {
>                 /*
>                  * Emit this error right now then retry this page
> immediately. Use
>                  * errmsg_internal() because the message was already translated.
>                  */
>                 if (xlogreader->errormsg_buf[0])
>                         ereport(emode_for_corrupt_record(emode,
> xlogreader->EndRecPtr),
>                                         (errmsg_internal("%s",
> xlogreader->errormsg_buf)));
> 
>                 /* reset any error XLogReaderValidatePageHeader()
> might have set */
>                 xlogreader->errormsg_buf[0] = '\0';
>                 goto next_record_is_invalid;
>         }
> 
> ... but, even though there was a (suppressed) error, nothing
> invalidates the reader's page buffer.  Normally,
> XLogReadValidatePageHeader() failure or any other kind of error
> encountered by xlogreader.c'd decoding logic would do that, but here
> the read_page callback is directly calling the header validation.
> Without prefetching, that doesn't seem to matter, but reading ahead
> can cause us to have the problem page in our buffer at the wrong time,
> and then not re-read it when we should.  Or something like that.
> 
> The attached patch that simply moves the cache invalidation into
> report_invalid_record(), so that it's reached by the above code and
> everything else that reports an error, seems to fix the problem in
> src/bin/pg_ctl/t/003_promote.pl with Noah's spanner-in-the-works patch
> applied, and passes check-world without it.  I need to look at this
> some more, though, and figure out if it's the right fix.

If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral
misses the chance to inavlidate reader-state.  That state is not an
error while in StandbyMode.

In the repro case, XLogPageRead returns XLREAD_WOULDBLOCK after the
first failure.  This situation (of course) was not considered when
that code was introduced. If that function is going to return with
XLREAD_WOULDBLOCK while lastSourceFailed, it should be turned into
XLREAD_FAIL. So, the following also works.

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 21088e78f6..9f242fe656 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3220,7 +3220,9 @@ retry:
                                             xlogreader->nonblocking))
         {
             case XLREAD_WOULDBLOCK:
-                return XLREAD_WOULDBLOCK;
+                if (!lastSourceFailed)
+                    return XLREAD_WOULDBLOCK;
+                /* Fall through.  */
             case XLREAD_FAIL:
                 if (readFile >= 0)
                     close(readFile);

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: logical decoding and replication of sequences

От
Kyotaro Horiguchi
Дата:
At Mon, 08 Aug 2022 17:33:22 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral
> misses the chance to inavlidate reader-state.  That state is not an
> error while in StandbyMode.

Mmm... Maybe I wanted to say:  (Still I'm not sure the rewrite works..)

If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral
would miss the chance to invalidate reader-state.  When XLogPageRead
is called in blocking mode while in StandbyMode (that is, the
traditional condition) , the function continues retrying until it
succeeds, or returns XLRAD_FAIL if promote is triggered.  In other
words, it was not supposed to return non-failure while the header
validation is failing while in standby mode.  But while in nonblocking
mode, the function can return non-failure with lastSourceFailed =
true, which seems wrong.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: logical decoding and replication of sequences

От
Michael Paquier
Дата:
On Mon, Aug 08, 2022 at 06:15:46PM +1200, Thomas Munro wrote:
> The attached patch that simply moves the cache invalidation into
> report_invalid_record(), so that it's reached by the above code and
> everything else that reports an error, seems to fix the problem in
> src/bin/pg_ctl/t/003_promote.pl with Noah's spanner-in-the-works patch
> applied, and passes check-world without it.  I need to look at this
> some more, though, and figure out if it's the right fix.

Thomas, where are you on this open item?  A potential PANIC at
promotion is bad.  One possible exit path would be to switch the
default of recovery_prefetch, though that's a kind of last-resort
option seen from here.
--
Michael

Вложения

Re: logical decoding and replication of sequences

От
Thomas Munro
Дата:
On Tue, Aug 23, 2022 at 4:21 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Aug 08, 2022 at 06:15:46PM +1200, Thomas Munro wrote:
> > The attached patch that simply moves the cache invalidation into
> > report_invalid_record(), so that it's reached by the above code and
> > everything else that reports an error, seems to fix the problem in
> > src/bin/pg_ctl/t/003_promote.pl with Noah's spanner-in-the-works patch
> > applied, and passes check-world without it.  I need to look at this
> > some more, though, and figure out if it's the right fix.
>
> Thomas, where are you on this open item?  A potential PANIC at
> promotion is bad.  One possible exit path would be to switch the
> default of recovery_prefetch, though that's a kind of last-resort
> option seen from here.

I will get a fix committed this week -- I need to study
Horiguchi-san's analysis...



Re: logical decoding and replication of sequences

От
Robert Haas
Дата:
On Tue, Aug 23, 2022 at 12:33 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Aug 23, 2022 at 4:21 PM Michael Paquier <michael@paquier.xyz> wrote:
> > On Mon, Aug 08, 2022 at 06:15:46PM +1200, Thomas Munro wrote:
> > > The attached patch that simply moves the cache invalidation into
> > > report_invalid_record(), so that it's reached by the above code and
> > > everything else that reports an error, seems to fix the problem in
> > > src/bin/pg_ctl/t/003_promote.pl with Noah's spanner-in-the-works patch
> > > applied, and passes check-world without it.  I need to look at this
> > > some more, though, and figure out if it's the right fix.
> >
> > Thomas, where are you on this open item?  A potential PANIC at
> > promotion is bad.  One possible exit path would be to switch the
> > default of recovery_prefetch, though that's a kind of last-resort
> > option seen from here.
>
> I will get a fix committed this week -- I need to study
> Horiguchi-san's analysis...

Hi!

I haven't been paying attention to this thread, but my attention was
just drawn to it, and I'm wondering if the issue you're trying to
track down here is actually the same as what I reported yesterday
here:

https://www.postgresql.org/message-id/CA+TgmoY0Lri=fCueg7m_2R_bSspUb1F8OFycEGaHNJw_EUW-=Q@mail.gmail.com

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences

От
Thomas Munro
Дата:
On Wed, Aug 24, 2022 at 3:04 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I haven't been paying attention to this thread, but my attention was
> just drawn to it, and I'm wondering if the issue you're trying to
> track down here is actually the same as what I reported yesterday
> here:
>
> https://www.postgresql.org/message-id/CA+TgmoY0Lri=fCueg7m_2R_bSspUb1F8OFycEGaHNJw_EUW-=Q@mail.gmail.com

Summarising a chat we had about this:  Different bug, similar
ingredients.  Robert describes a screw-up in what is written, but here
we're talking about a cache invalidation bug while reading.



Re: logical decoding and replication of sequences

От
Thomas Munro
Дата:
On Mon, Aug 8, 2022 at 8:56 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Mon, 08 Aug 2022 17:33:22 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral
> > misses the chance to inavlidate reader-state.  That state is not an
> > error while in StandbyMode.
>
> Mmm... Maybe I wanted to say:  (Still I'm not sure the rewrite works..)
>
> If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral
> would miss the chance to invalidate reader-state.  When XLogPageRead
> is called in blocking mode while in StandbyMode (that is, the
> traditional condition) , the function continues retrying until it
> succeeds, or returns XLRAD_FAIL if promote is triggered.  In other
> words, it was not supposed to return non-failure while the header
> validation is failing while in standby mode.  But while in nonblocking
> mode, the function can return non-failure with lastSourceFailed =
> true, which seems wrong.

New ideas:

0001:  Instead of figuring out when to invalidate the cache, let's
just invalidate it before every read attempt.  It is only marked valid
after success (ie state->readLen > 0).  No need to worry about error
cases.

0002:  While here, I don't like xlogrecovery.c clobbering
xlogreader.c's internal error state, so I think we should have a
function for that with a documented purpose.  It was also a little
inconsistent that it didn't clear a flag (but not buggy AFAICS; kinda
wondering if I should just get rid of that flag, but that's for
another day).

0003:  Thinking about your comments above made me realise that I don't
really want XLogReadPage() to be internally retrying for obscure
failures while reading ahead.  I think I prefer to give up on
prefetching as soon as anything tricky happens, and deal with
complexities once recovery catches up to that point.  I am still
thinking about this point.

Here's the patch set I'm testing.

Вложения

Re: logical decoding and replication of sequences

От
Tomas Vondra
Дата:

On 8/29/22 12:21, Thomas Munro wrote:
> On Mon, Aug 8, 2022 at 8:56 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
>> At Mon, 08 Aug 2022 17:33:22 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
>>> If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral
>>> misses the chance to inavlidate reader-state.  That state is not an
>>> error while in StandbyMode.
>>
>> Mmm... Maybe I wanted to say:  (Still I'm not sure the rewrite works..)
>>
>> If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral
>> would miss the chance to invalidate reader-state.  When XLogPageRead
>> is called in blocking mode while in StandbyMode (that is, the
>> traditional condition) , the function continues retrying until it
>> succeeds, or returns XLRAD_FAIL if promote is triggered.  In other
>> words, it was not supposed to return non-failure while the header
>> validation is failing while in standby mode.  But while in nonblocking
>> mode, the function can return non-failure with lastSourceFailed =
>> true, which seems wrong.
> 
> New ideas:
> 
> 0001:  Instead of figuring out when to invalidate the cache, let's
> just invalidate it before every read attempt.  It is only marked valid
> after success (ie state->readLen > 0).  No need to worry about error
> cases.
> 

Maybe I misunderstand how all this works, but won't this have a really
bad performance impact. If not, why do we need the cache at all?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences

От
Thomas Munro
Дата:
On Tue, Aug 30, 2022 at 6:04 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 8/29/22 12:21, Thomas Munro wrote:
> > 0001:  Instead of figuring out when to invalidate the cache, let's
> > just invalidate it before every read attempt.  It is only marked valid
> > after success (ie state->readLen > 0).  No need to worry about error
> > cases.
>
> Maybe I misunderstand how all this works, but won't this have a really
> bad performance impact. If not, why do we need the cache at all?

It's a bit confusing because there are several levels of "read".  The
cache remains valid as long as the caller of ReadPageInternal() keeps
asking for data that is in range (see early return after comment "/*
check whether we have all the requested data already */").  As soon as
the caller asks for something not in range, this patch marks the cache
invalid before calling the page_read() callback (= XLogPageRead()).
It is only marked valid again after that succeeds.  Here's a new
version with no code change, just a better commit message to try to
explain that more clearly.

Вложения