Обсуждение: logical decoding and replication of sequences, take 2

Поиск
Список
Период
Сортировка

logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Hi,

Here's a rebased version of the patch adding logical decoding of
sequences. The previous attempt [1] ended up getting reverted, due to
running into issues with non-transactional nature of sequences when
decoding the existing WAL records. See [2] for details.

This patch uses a different approach, proposed by Hannu Krosing [3],
based on tracking sequences actually modified in each transaction, and
then WAL-logging the state at the end.

This does work, but I'm not very happy about WAL-logging all sequences
at the end. The "problem" is we have to re-read the current state of the
sequence from disk, because it might be concurrently updated by another
transaction.

Imagine two transactions, T1 and T2:

T1: BEGIN

T1: SELECT nextval('s') FROM generate_series(1,1000)

T2: BEGIN

T2: SELECT nextval('s') FROM generate_series(1,1000)

T2: COMMIT

T1: COMMIT

The expected outcome is that the sequence value is ~2000. We must not
blindly apply the changes from T2 by the increments in T1. So the patch
simply reads "current" state of the transaction at commit time. Which is
annoying, because it involves I/O, increases the commit duration, etc.

On the other hand, this is likely cheaper than the other approach based
on WAL-logging every sequence increment (that would have to be careful
about obsoleted increments too, when applying them transactionally).


I wonder if we might deal with this by simply WAL-logging LSN of the
last change for each sequence (in the given xact), which would allow
discarding the "obsolete" changes quite easily I think. nextval() would
simply look at LSN in the page header.

And maybe we could then use the LSN to read the increment from the WAL
during decoding, instead of having to read it and WAL-log it during
commit. Essentially, we'd run a local XLogReader. Of course, we'd have
to be careful about checkpoints, not sure what to do about that.

Another idea that just occurred to me is that if we end up having to
read the sequence state during commit, maybe we could at least optimize
it somehow. For example we might track LSN of the last logged state for
each sequence (in shared memory or something), and the other sessions
could just skip the WAL-log if their "local" LSN is <= than this LSN.


regards


[1]
https://www.postgresql.org/message-id/flat/d045f3c2-6cfb-06d3-5540-e63c320df8bc@enterprisedb.com

[2]
https://www.postgresql.org/message-id/00708727-d856-1886-48e3-811296c7ba8c%40enterprisedb.com

[3]
https://www.postgresql.org/message-id/CAMT0RQQeDR51xs8zTa25YpfKB1B34nS-Q4hhsRPznVsjMB_P1w%40mail.gmail.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
I've been thinking about the two optimizations mentioned at the end a
bit more, so let me share my thoughts before I forget that:

On 8/18/22 23:10, Tomas Vondra wrote:
>
> ...
>
> And maybe we could then use the LSN to read the increment from the WAL
> during decoding, instead of having to read it and WAL-log it during
> commit. Essentially, we'd run a local XLogReader. Of course, we'd have
> to be careful about checkpoints, not sure what to do about that.
> 

I think logging just the LSN is workable.

I was worried about dealing with checkpoints, because imagine you do
nextval() on sequence that was last WAL-logged a couple checkpoints
back. Then you wouldn't be able to read the LSN (when decoding), because
the WAL might have been recycled. But that can't happen, because we
always force WAL-logging the first time nextval() is called after a
checkpoint. So we know the LSN is guaranteed to be available.

Of course, this would not reduce the amount of WAL messages, because
we'd still log all sequences touched by the transaction. We wouldn't
need to read the state from disk, though, and we could ignore "old"
stuff in decoding (with LSN lower than the last LSN we decoded).

For frequently used sequences that seems like a win.


> Another idea that just occurred to me is that if we end up having to
> read the sequence state during commit, maybe we could at least optimize
> it somehow. For example we might track LSN of the last logged state for
> each sequence (in shared memory or something), and the other sessions
> could just skip the WAL-log if their "local" LSN is <= than this LSN.
> 

Tracking the last LSN for each sequence (in a SLRU or something) should
work too, I guess. In principle this just moves the skipping of "old"
increments from decoding to writing, so that we don't even have to write
those into WAL.

We don't even need persistence, nor to keep all the records, I think. If
you don't find a record for a given sequence, assume it wasn't logged
yet and just log it. Of course, it requires a bit of shared memory for
each sequence, say ~32B. Not sure about the overhead, but I'd bet if you
have many (~thousands) frequently used sequences, there'll be a lot of
other overhead making this irrelevant.

Of course, if we're doing the skipping when writing the WAL, maybe we
should just read the sequence state - we'd do the I/O, but only in
fraction of the transactions, and we wouldn't need to read old WAL in
logical decoding.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Hi,

I noticed on cfbot the patch no longer applies, so here's a rebased
version. Most of the breakage was due to the column filtering reworks,
grammar changes etc. A lot of bitrot, but mostly mechanical stuff.

I haven't looked into the optimizations / improvements I discussed in my
previous post (logging only LSN of the last WAL-logged increment),
because while fixing "make check-world" I ran into a more serious issue
that I think needs to be discussed first. And I suspect it might also
affect the feasibility of the LSN optimization.

So, what's the issue - the current solution is based on WAL-logging
state of all sequences incremented by the transaction at COMMIT. To do
that, we read the state from disk, and write that into WAL. However,
these WAL messages are not necessarily correlated to COMMIT records, so
stuff like this might happen:

1. transaction T1 increments sequence S
2. transaction T2 increments sequence S
3. both T1 and T2 start to COMMIT
4. T1 reads state of S from disk, writes it into WAL
5. transaction T3 increments sequence S
6. T2 reads state of S from disk, writes it into WAL
7. T2 write COMMIT into WAL
8. T1 write COMMIT into WAL

Because the apply order is determined by ordering of COMMIT records,
this means we'd apply the increments logged by T2, and then by T1. But
that undoes the increment by T3, and the sequence would go backwards.

The previous patch version addressed that by acquiring lock on the
sequence, holding it until transaction end. This effectively ensures the
order of sequence messages and COMMIT matches. But that's problematic
for a number of reasons:

1) throughput reduction, because the COMMIT records need to serialize

2) deadlock risk, if we happen to lock sequences in different order
   (in different transactions)

3) problem for prepared transactions - the sequences are locked and
   logged in PrepareTransaction, because we may not have seqhashtab
   beyond that point. This is a much worse variant of (1).

Note: I also wonder what happens if someone does DISCARD SEQUENCES. I
guess we'll forget the sequences, which is bad - so we'd have to invent
a separate cache that does not have this issue.


I realized (3) because one of the test_decoding TAP tests got stuck
exactly because of a sequence locked by a prepared transaction.

This patch simply releases the lock after writing the WAL message, but
that just makes it vulnerable to the reordering. And this would have
been true even with the LSN optimization.

However, I was thinking that maybe we could use the LSN of the WAL
message (XLOG_LOGICAL_SEQUENCE) to deal with the ordering issue, because
*this* is the sensible sequence increment ordering.

In the example above, we'd first apply the WAL message from T2 (because
that commits first). And then we'd get to apply T1, but the WAL message
has an older LSN, so we'd skip it.

But this requires us remembering LSN of the already applied WAL sequence
messages, which could be tricky - we'd need to persist it in some way
because of restarts, etc. We can't do this while decoding but on the
apply side, I think, because of streaming, aborts.

The other option might be to make these messages non-transactional, in
which case we'd separate the ordering from COMMIT ordering, evading the
reordering problem.

That'd mean we'd ignore rollbacks (which seems fine), we could probably
optimize this by checking if the state actually changed, etc. But we'd
also need to deal with transactions created in the (still uncommitted)
transaction. But I'm also worried it might lead to the same issue with
non-transactional behaviors that forced revert in v15.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Ian Lawrence Barwick
Дата:
2022年11月12日(土) 7:49 Tomas Vondra <tomas.vondra@enterprisedb.com>:
>
> Hi,
>
> I noticed on cfbot the patch no longer applies, so here's a rebased
> version. Most of the breakage was due to the column filtering reworks,
> grammar changes etc. A lot of bitrot, but mostly mechanical stuff.

(...)

Hi

Thanks for the update patch.

While reviewing the patch backlog, we have determined that this patch adds
one or more TAP tests but has not added the test to the "meson.build" file.

To do this, locate the relevant "meson.build" file for each test and add it
in the 'tests' dictionary, which will look something like this:

  'tap': {
    'tests': [
      't/001_basic.pl',
    ],
  },

For some additional details please see this Wiki article:

  https://wiki.postgresql.org/wiki/Meson_for_patch_authors

For more information on the meson build system for PostgreSQL see:

  https://wiki.postgresql.org/wiki/Meson


Regards

Ian Barwick



Re: logical decoding and replication of sequences, take 2

От
Robert Haas
Дата:
On Fri, Nov 11, 2022 at 5:49 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> The other option might be to make these messages non-transactional, in
> which case we'd separate the ordering from COMMIT ordering, evading the
> reordering problem.
>
> That'd mean we'd ignore rollbacks (which seems fine), we could probably
> optimize this by checking if the state actually changed, etc. But we'd
> also need to deal with transactions created in the (still uncommitted)
> transaction. But I'm also worried it might lead to the same issue with
> non-transactional behaviors that forced revert in v15.

I think it might be a good idea to step back slightly from
implementation details and try to agree on a theoretical model of
what's happening here. Let's start by banishing the words
transactional and non-transactional from the conversation and talk
about what logical replication is trying to do.

We can imagine that the replicated objects on the primary pass through
a series of states S1, S2, ..., Sn, where n keeps going up as new
state changes occur. The state, for our purposes here, is the contents
of the database as they could be observed by a user running SELECT
queries at some moment in time chosen by the user. For instance, if
the initial state of the database is S1, and then the user executes
BEGIN, 2 single-row INSERT statements, and a COMMIT, then S2 is the
state that differs from S1 in that both of those rows are now part of
the database contents. There is no state where one of those rows is
visible and the other is not. That was never observable by the user,
except from within the transaction as it was executing, which we can
and should discount. I believe that the goal of logical replication is
to bring about a state of affairs where the set of states observable
on the standby is a subset of the states observable on the primary.
That is, if the primary goes from S1 to S2 to S3, the standby can do
the same thing, or it can go straight from S1 to S3 without ever
making it possible for the user to observe S2. Either is correct
behavior. But the standby cannot invent any new states that didn't
occur on the primary. It can't decide to go from S1 to S1.5 to S2.5 to
S3, or something like that. It can only consolidate changes that
occurred separately on the primary, never split them up. Neither can
it reorder them.

Now, if you accept this as a reasonable definition of correctness,
then the next question is what consequences it has for transactional
and non-transactional behavior. If all behavior is transactional, then
we've basically got to replay each primary transaction in a single
standby transaction, and commit those transactions in the same order
that the corresponding primary transactions committed. We could
legally choose to merge a group of transactions that committed one
after the other on the primary into a single transaction on the
standby, and it might even be a good idea if they're all very tiny,
but it's not required. But if there are non-transactional things
happening, then there are changes that become visible at some time
other than at a transaction commit. For example, consider this
sequence of events, in which each "thing" that happens is
transactional except where the contrary is noted:

T1: BEGIN;
T2: BEGIN;
T1: Do thing 1;
T2: Do thing 2;
T1: Do a non-transactional thing;
T1: Do thing 3;
T2: Do thing 4;
T2: COMMIT;
T1: COMMIT;

From the point of the user here, there are 4 observable states here:

S1: Initiate state.
S2: State after the non-transactional thing happens.
S3: State after T2 commits (reflects the non-transactional thing plus
things 2 and 4).
S4: State after T1 commits.

Basically, the non-transactional thing behaves a whole lot like a
separate transaction. That non-transactional operation ought to be
replicated before T2, which ought to be replicated before T1. Maybe
logical replication ought to treat it in exactly that way: as a
separate operation that needs to be replicated after any earlier
transactions that completed prior to the history shown here, but
before T2 or T1. Alternatively, you can merge the non-transactional
change into T2, i.e. the first transaction that committed after it
happened. But you can't merge it into T1, even though it happened in
T1. If you do that, then you're creating states on the standby that
never existed on the primary, which is wrong. You could argue that
this is just nitpicking: who cares if the change in the sequence value
doesn't get replicated at exactly the right moment? But I don't think
it's a technicality at all: I think if we don't make the operation
appear to happen at the same point in the sequence as it became
visible on the master, then there will be endless artifacts and corner
cases to the bottom of which we will never get. Just like if we
replicated the actual transactions out of order, chaos would ensue,
because there can be logical dependencies between them, so too can
there be logical dependencies between non-transactional operations, or
between a non-transactional operation and a transactional operation.

To make it more concrete, consider two sessions concurrently running this SQL:

insert into t1 select nextval('s1') from generate_series(1,1000000) g;

There are, in effect, 2000002 transaction-like things here. The
sequence gets incremented 2 million times, and then there are 2
commits that each insert a million rows. Perhaps the actual order of
events looks something like this:

1. nextval the sequence N times, where N >= 1 million
2. commit the first transaction, adding a million rows to t1
3. nextval the sequence 2 million - N times
4. commit the second transaction, adding another million rows to t1

Unless we replicate all of the nextval operations that occur in step 1
at the same time or prior to replicating the first transaction in step
2, we might end up making visible a state where the next value of the
sequence is less than the highest value present in the table, which
would be bad.

With that perhaps overly-long set of preliminaries, I'm going to move
on to talking about the implementation ideas which you mention. You
write that "the current solution is based on WAL-logging state of all
sequences incremented by the transaction at COMMIT" and then, it seems
to me, go on to demonstrate that it's simply incorrect. In my opinion,
the fundamental problem is that it doesn't look at the order that
things happened on the primary and do them in the same order on the
standby. Instead, it accepts that the non-transactional operations are
going to be replicated at the wrong time, and then tries to patch
around the issue by attempting to scrounge up the correct values at
some convenient point and use that data to compensate for our failure
to do the right thing at an earlier point. That doesn't seem like a
satisfying solution, and I think it will be hard to make it fully
correct.

Your alternative proposal says "The other option might be to make
these messages non-transactional, in which case we'd separate the
ordering from COMMIT ordering, evading the reordering problem." But I
don't think that avoids the reordering problem at all. Nor do I think
it's correct. I don't think you *can* separate the ordering of these
operations from the COMMIT ordering. They are, as I argue here,
essentially mini-commits that only bump the sequence value, and they
need to be replicated after the transactions that commit prior to the
sequence value bump and before those that commit afterward. If they
aren't handled that way, I don't think you're going to get fully
correct behavior.

I'm going to confess that I have no really specific idea how to
implement that. I'm just not sufficiently familiar with this code.
However, I suspect that the solution lies in changing things on the
decoding side rather than in the WAL format. I feel like the
information that we need in order to do the right thing must already
be present in the WAL. If it weren't, then how could crash recovery
work correctly, or physical replication? At any given moment, you can
choose to promote a physical standby, and at that point the state you
observe on the new primary had better be some state that existed on
the primary at some point in its history. At any moment, you can
unplug the primary, restart it, and run crash recovery, and if you do,
you had better end up with some state that existed on the primary at
some point shortly before the crash. I think that there are actually a
few subtle inaccuracies in the last two sentences, because actually
the order in which transactions become visible on a physical standby
can differ from the order in which it happens on the primary, but I
don't think that actually changes the picture much. The point is that
the WAL is the definitive source of information about what happened
and in what order it happened, and we use it in that way already in
the context of physical replication, and of standbys. If logical
decoding has a problem with some case that those systems handle
correctly, the problem is with logical decoding, not the WAL format.

In particular, I think it's likely that the "non-transactional
messages" that you mention earlier don't get applied at the point in
the commit sequence where they were found in the WAL. Not sure why
exactly, but perhaps the point at which we're reading WAL runs ahead
of the decoding per se, or something like that, and thus those
non-transactional messages arrive too early relative to the commit
ordering. Possibly that could be changed, and they could be buffered
until earlier commits are replicated. Or else, when we see a WAL
record for a non-transactional sequence operation, we could arrange to
bundle that operation into an "adjacent" replicated transaction i.e.
the transaction whose commit record occurs most nearly prior to, or
most nearly after, the WAL record for the operation itself. Or else,
we could create "virtual" transactions for such operations and make
sure those get replayed at the right point in the commit sequence. Or
else, I don't know, maybe something else. But I think the overall
picture is that we need to approach the problem by replicating changes
in WAL order, as a physical standby would do. Saying that a change is
"nontransactional" doesn't mean that it's exempt from ordering
requirements; rather, it means that that change has its own place in
that ordering, distinct from the transaction in which it occurred.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 11/16/22 22:05, Robert Haas wrote:
> On Fri, Nov 11, 2022 at 5:49 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> The other option might be to make these messages non-transactional, in
>> which case we'd separate the ordering from COMMIT ordering, evading the
>> reordering problem.
>>
>> That'd mean we'd ignore rollbacks (which seems fine), we could probably
>> optimize this by checking if the state actually changed, etc. But we'd
>> also need to deal with transactions created in the (still uncommitted)
>> transaction. But I'm also worried it might lead to the same issue with
>> non-transactional behaviors that forced revert in v15.
> 
> I think it might be a good idea to step back slightly from
> implementation details and try to agree on a theoretical model of
> what's happening here. Let's start by banishing the words
> transactional and non-transactional from the conversation and talk
> about what logical replication is trying to do.
> 

OK, let's try.

> We can imagine that the replicated objects on the primary pass through
> a series of states S1, S2, ..., Sn, where n keeps going up as new
> state changes occur. The state, for our purposes here, is the contents
> of the database as they could be observed by a user running SELECT
> queries at some moment in time chosen by the user. For instance, if
> the initial state of the database is S1, and then the user executes
> BEGIN, 2 single-row INSERT statements, and a COMMIT, then S2 is the
> state that differs from S1 in that both of those rows are now part of
> the database contents. There is no state where one of those rows is
> visible and the other is not. That was never observable by the user,
> except from within the transaction as it was executing, which we can
> and should discount. I believe that the goal of logical replication is
> to bring about a state of affairs where the set of states observable
> on the standby is a subset of the states observable on the primary.
> That is, if the primary goes from S1 to S2 to S3, the standby can do
> the same thing, or it can go straight from S1 to S3 without ever
> making it possible for the user to observe S2. Either is correct
> behavior. But the standby cannot invent any new states that didn't
> occur on the primary. It can't decide to go from S1 to S1.5 to S2.5 to
> S3, or something like that. It can only consolidate changes that
> occurred separately on the primary, never split them up. Neither can
> it reorder them.
> 

I mostly agree, and in a way the last patch aims to do roughly this,
i.e. make sure that the state after each transaction matches the state a
user might observe on the primary (modulo implementation challenges).

There's a couple of caveats, though:

1) Maybe we should focus more on "actually observed" state instead of
"observable". Who cares if the sequence moved forward in a transaction
that was ultimately rolled back? No committed transaction should have
observer those values - in a way, the last "valid" state of the sequence
is the last value generated in a transaction that ultimately committed.

2) I think what matters more is that we never generate duplicate value.
That is, if you generate a value from a sequence, commit a transaction
and replicate it, then the logical standby should not generate the same
value from the sequence. This guarantee seems necessary for "failover"
to logical standby.

> Now, if you accept this as a reasonable definition of correctness,
> then the next question is what consequences it has for transactional
> and non-transactional behavior. If all behavior is transactional, then
> we've basically got to replay each primary transaction in a single
> standby transaction, and commit those transactions in the same order
> that the corresponding primary transactions committed. We could
> legally choose to merge a group of transactions that committed one
> after the other on the primary into a single transaction on the
> standby, and it might even be a good idea if they're all very tiny,
> but it's not required. But if there are non-transactional things
> happening, then there are changes that become visible at some time
> other than at a transaction commit. For example, consider this
> sequence of events, in which each "thing" that happens is
> transactional except where the contrary is noted:
> 
> T1: BEGIN;
> T2: BEGIN;
> T1: Do thing 1;
> T2: Do thing 2;
> T1: Do a non-transactional thing;
> T1: Do thing 3;
> T2: Do thing 4;
> T2: COMMIT;
> T1: COMMIT;
> 
> From the point of the user here, there are 4 observable states here:
> 
> S1: Initiate state.
> S2: State after the non-transactional thing happens.
> S3: State after T2 commits (reflects the non-transactional thing plus
> things 2 and 4).
> S4: State after T1 commits.
> 
> Basically, the non-transactional thing behaves a whole lot like a
> separate transaction. That non-transactional operation ought to be
> replicated before T2, which ought to be replicated before T1. Maybe
> logical replication ought to treat it in exactly that way: as a
> separate operation that needs to be replicated after any earlier
> transactions that completed prior to the history shown here, but
> before T2 or T1. Alternatively, you can merge the non-transactional
> change into T2, i.e. the first transaction that committed after it
> happened. But you can't merge it into T1, even though it happened in
> T1. If you do that, then you're creating states on the standby that
> never existed on the primary, which is wrong. You could argue that
> this is just nitpicking: who cares if the change in the sequence value
> doesn't get replicated at exactly the right moment? But I don't think
> it's a technicality at all: I think if we don't make the operation
> appear to happen at the same point in the sequence as it became
> visible on the master, then there will be endless artifacts and corner
> cases to the bottom of which we will never get. Just like if we
> replicated the actual transactions out of order, chaos would ensue,
> because there can be logical dependencies between them, so too can
> there be logical dependencies between non-transactional operations, or
> between a non-transactional operation and a transactional operation.
> 

Well, yeah - we can either try to perform the stuff independently of the
transactions that triggered it, or we can try making it part of some of
the transactions. Each of those options has problems, though :-(

The first version of the patch tried the first approach, i.e. decode the
increments and apply that independently. But:

  (a) What would you do with increments of sequences created/reset in a
      transaction? Can't apply those outside the transaction, because it
      might be rolled back (and that state is not visible on primary).

  (b) What about increments created before we have a proper snapshot?
      There may be transactions dependent on the increment. This is what
      ultimately led to revert of the patch.

This version of the patch tries to do the opposite thing - make sure
that the state after each commit matches what the transaction might have
seen (for sequences it accessed). It's imperfect, because it might log a
state generated "after" the sequence got accessed - it focuses on the
guarantee not to generate duplicate values.

> To make it more concrete, consider two sessions concurrently running this SQL:
> 
> insert into t1 select nextval('s1') from generate_series(1,1000000) g;
> 
> There are, in effect, 2000002 transaction-like things here. The
> sequence gets incremented 2 million times, and then there are 2
> commits that each insert a million rows. Perhaps the actual order of
> events looks something like this:
> 
> 1. nextval the sequence N times, where N >= 1 million
> 2. commit the first transaction, adding a million rows to t1
> 3. nextval the sequence 2 million - N times
> 4. commit the second transaction, adding another million rows to t1
> 
> Unless we replicate all of the nextval operations that occur in step 1
> at the same time or prior to replicating the first transaction in step
> 2, we might end up making visible a state where the next value of the
> sequence is less than the highest value present in the table, which
> would be bad.
> 

Right, that's the "guarantee" I've mentioned above, more or less.

> With that perhaps overly-long set of preliminaries, I'm going to move
> on to talking about the implementation ideas which you mention. You
> write that "the current solution is based on WAL-logging state of all
> sequences incremented by the transaction at COMMIT" and then, it seems
> to me, go on to demonstrate that it's simply incorrect. In my opinion,
> the fundamental problem is that it doesn't look at the order that
> things happened on the primary and do them in the same order on the
> standby. Instead, it accepts that the non-transactional operations are
> going to be replicated at the wrong time, and then tries to patch
> around the issue by attempting to scrounge up the correct values at
> some convenient point and use that data to compensate for our failure
> to do the right thing at an earlier point. That doesn't seem like a
> satisfying solution, and I think it will be hard to make it fully
> correct.
> 

I understand what you're saying, but I'm not sure I agree with you.

Yes, this would mean we accept we may end up with something like this:

1: T1 logs sequence state S1
2: someone increments sequence
3: T2 logs sequence stats S2
4: T2 commits
5: T1 commits

which "inverts" the apply order of S1 vs. S2, because we first apply S2
and then the "old" S1. But as long as we're smart enough to "discard"
applying S1, I think that's acceptable - because it guarantees we'll not
generate duplicate values (with values in the committed transaction).

I'd also argue it does not actually generate invalid state, because once
we commit either transaction, S2 is what's visible.

Yes, if you so "SELECT * FROM sequence" you'll see some intermediate
state, but that's not how sequences are accessed. And you can't do
currval('s') from a transaction that never accessed the sequence.

And if it did, we'd write S2 (or whatever it saw) as part of it's commits.

So I think the main issue of this approach is how to decide which
sequence states are obsolete and should be skipped.

> Your alternative proposal says "The other option might be to make
> these messages non-transactional, in which case we'd separate the
> ordering from COMMIT ordering, evading the reordering problem." But I
> don't think that avoids the reordering problem at all.

I don't understand why. Why would it not address the reordering issue?

> Nor do I think it's correct.

Nor do I understand this. I mean, isn't it essentially the option you
mentioned earlier - treating the non-transactional actions as
independent transactions? Yes, we'd be batching them so that we'd not
see "intermediate" states, but those are not observed by abyone.

> I don't think you *can* separate the ordering of these
> operations from the COMMIT ordering. They are, as I argue here,
> essentially mini-commits that only bump the sequence value, and they
> need to be replicated after the transactions that commit prior to the
> sequence value bump and before those that commit afterward. If they
> aren't handled that way, I don't think you're going to get fully
> correct behavior.

I'm confused. Isn't that pretty much exactly what I'm proposing? Imagine
you have something like this:

1: T1 does something and also increments a sequence
2: T1 logs state of the sequence (right before commit)
3: T1 writes COMMIT

Now when we decode/apply this, we end up doing this:

1: decode all T1 changes, stash them
2: decode the sequence state and apply it separately
3: decode COMMIT, apply all T1 changes

There might be other transactions interleaving with this, but I think
it'd behave correctly. What example would not work?

> 
> I'm going to confess that I have no really specific idea how to
> implement that. I'm just not sufficiently familiar with this code.
> However, I suspect that the solution lies in changing things on the
> decoding side rather than in the WAL format. I feel like the
> information that we need in order to do the right thing must already
> be present in the WAL. If it weren't, then how could crash recovery
> work correctly, or physical replication? At any given moment, you can
> choose to promote a physical standby, and at that point the state you
> observe on the new primary had better be some state that existed on
> the primary at some point in its history. At any moment, you can
> unplug the primary, restart it, and run crash recovery, and if you do,
> you had better end up with some state that existed on the primary at
> some point shortly before the crash. I think that there are actually a
> few subtle inaccuracies in the last two sentences, because actually
> the order in which transactions become visible on a physical standby
> can differ from the order in which it happens on the primary, but I
> don't think that actually changes the picture much. The point is that
> the WAL is the definitive source of information about what happened
> and in what order it happened, and we use it in that way already in
> the context of physical replication, and of standbys. If logical
> decoding has a problem with some case that those systems handle
> correctly, the problem is with logical decoding, not the WAL format.
> 

The problem lies in how we log sequences. If we wrote each individual
increment to WAL, it might work the way you propose (except for cases
with sequences created in a transaction, etc.). But that's not what we
do - we log sequence increments in batches of 32 values, and then only
modify the sequence relfilenode.

This works for physical replication, because the WAL describes the
"next" state of the sequence (so if you do "SELECT * FROM sequence"
you'll not see the same state, and the sequence value may "jump ahead"
after a failover).

But for logical replication this does not work, because the transaction
might depend on a state created (WAL-logged) by some other transaction.
And perhaps that transaction actually happened *before* we even built
the first snapshot for decoding :-/

There's also the issue with what snapshot to use when decoding these
transactional changes in logical decoding (see


> In particular, I think it's likely that the "non-transactional
> messages" that you mention earlier don't get applied at the point in
> the commit sequence where they were found in the WAL. Not sure why
> exactly, but perhaps the point at which we're reading WAL runs ahead
> of the decoding per se, or something like that, and thus those
> non-transactional messages arrive too early relative to the commit
> ordering. Possibly that could be changed, and they could be buffered

I'm not sure which case of "non-transactional messages" this refers to,
so I can't quite respond to these comments. Perhaps you mean the
problems that killed the previous patch [1]?

[1]
https://www.postgresql.org/message-id/00708727-d856-1886-48e3-811296c7ba8c%40enterprisedb.com


> until earlier commits are replicated. Or else, when we see a WAL
> record for a non-transactional sequence operation, we could arrange to
> bundle that operation into an "adjacent" replicated transaction i.e.

IIRC moving stuff between transactions during decoding is problematic,
because of snapshots.

> the transaction whose commit record occurs most nearly prior to, or
> most nearly after, the WAL record for the operation itself. Or else,
> we could create "virtual" transactions for such operations and make
> sure those get replayed at the right point in the commit sequence. Or
> else, I don't know, maybe something else. But I think the overall
> picture is that we need to approach the problem by replicating changes
> in WAL order, as a physical standby would do. Saying that a change is
> "nontransactional" doesn't mean that it's exempt from ordering
> requirements; rather, it means that that change has its own place in
> that ordering, distinct from the transaction in which it occurred.
> 

But doesn't the approach with WAL-logging sequence state before COMMIT,
and then applying it independently in WAL-order, do pretty much this?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Andres Freund
Дата:
Hi,


On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:
> Well, yeah - we can either try to perform the stuff independently of the
> transactions that triggered it, or we can try making it part of some of
> the transactions. Each of those options has problems, though :-(
>
> The first version of the patch tried the first approach, i.e. decode the
> increments and apply that independently. But:
>
>   (a) What would you do with increments of sequences created/reset in a
>       transaction? Can't apply those outside the transaction, because it
>       might be rolled back (and that state is not visible on primary).

I think a reasonable approach could be to actually perform different WAL
logging for that case. It'll require a bit of machinery, but could actually
result in *less* WAL logging overall, because we don't need to emit a WAL
record for each SEQ_LOG_VALS sequence values.



>   (b) What about increments created before we have a proper snapshot?
>       There may be transactions dependent on the increment. This is what
>       ultimately led to revert of the patch.

I don't understand this - why would we ever need to process those increments
from before we have a snapshot?  Wouldn't they, by definition, be before the
slot was active?

To me this is the rough equivalent of logical decoding not giving the initial
state of all tables. You need some process outside of logical decoding to get
that (obviously we have some support for that via the exported data snapshot
during slot creation).

I assume that part of the initial sync would have to be a new sequence
synchronization step that reads all the sequence states on the publisher and
ensures that the subscriber sequences are at the same point. There's a bit of
trickiness there, but it seems entirely doable. The logical replication replay
support for sequences will have to be a bit careful about not decreasing the
subscriber's sequence values - the standby initially will be ahead of the
increments we'll see in the WAL. But that seems inevitable given the
non-transactional nature of sequences.



> This version of the patch tries to do the opposite thing - make sure
> that the state after each commit matches what the transaction might have
> seen (for sequences it accessed). It's imperfect, because it might log a
> state generated "after" the sequence got accessed - it focuses on the
> guarantee not to generate duplicate values.

That approach seems quite wrong to me.


> > I'm going to confess that I have no really specific idea how to
> > implement that. I'm just not sufficiently familiar with this code.
> > However, I suspect that the solution lies in changing things on the
> > decoding side rather than in the WAL format. I feel like the
> > information that we need in order to do the right thing must already
> > be present in the WAL. If it weren't, then how could crash recovery
> > work correctly, or physical replication? At any given moment, you can
> > choose to promote a physical standby, and at that point the state you
> > observe on the new primary had better be some state that existed on
> > the primary at some point in its history. At any moment, you can
> > unplug the primary, restart it, and run crash recovery, and if you do,
> > you had better end up with some state that existed on the primary at
> > some point shortly before the crash.

One minor exception here is that there's no real time bound to see the last
few sequence increments if nothing after the XLOG_SEQ_LOG records forces a WAL
flush.


> > I think that there are actually a
> > few subtle inaccuracies in the last two sentences, because actually
> > the order in which transactions become visible on a physical standby
> > can differ from the order in which it happens on the primary, but I
> > don't think that actually changes the picture much. The point is that
> > the WAL is the definitive source of information about what happened
> > and in what order it happened, and we use it in that way already in
> > the context of physical replication, and of standbys. If logical
> > decoding has a problem with some case that those systems handle
> > correctly, the problem is with logical decoding, not the WAL format.
> >
>
> The problem lies in how we log sequences. If we wrote each individual
> increment to WAL, it might work the way you propose (except for cases
> with sequences created in a transaction, etc.). But that's not what we
> do - we log sequence increments in batches of 32 values, and then only
> modify the sequence relfilenode.

> This works for physical replication, because the WAL describes the
> "next" state of the sequence (so if you do "SELECT * FROM sequence"
> you'll not see the same state, and the sequence value may "jump ahead"
> after a failover).
>
> But for logical replication this does not work, because the transaction
> might depend on a state created (WAL-logged) by some other transaction.
> And perhaps that transaction actually happened *before* we even built
> the first snapshot for decoding :-/

I really can't follow the "depend on state ... by some other transaction"
aspect.


Even the case of a sequence that is renamed inside a transaction that did
*not* create / reset the sequence and then also triggers increment of the
sequence seems to be dealt with reasonably by processing sequence increments
outside a transaction - the old name will be used for the increments, replay
of the renaming transaction would then implement the rename in a hypothetical
DDL-replay future.


> There's also the issue with what snapshot to use when decoding these
> transactional changes in logical decoding (see

Incomplete parenthetical? Or were you referencing the next paragraph?

What are the transactional changes you're referring to here?


I did some skimming of the referenced thread about the reversal of the last
approach, but I couldn't really understand what the fundamental issues were
with the reverted implementation - it's a very long thread and references
other threads.

Greetings,

Andres Freund



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 11/17/22 03:43, Andres Freund wrote:
> Hi,
> 
> 
> On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:
>> Well, yeah - we can either try to perform the stuff independently of the
>> transactions that triggered it, or we can try making it part of some of
>> the transactions. Each of those options has problems, though :-(
>>
>> The first version of the patch tried the first approach, i.e. decode the
>> increments and apply that independently. But:
>>
>>   (a) What would you do with increments of sequences created/reset in a
>>       transaction? Can't apply those outside the transaction, because it
>>       might be rolled back (and that state is not visible on primary).
> 
> I think a reasonable approach could be to actually perform different WAL
> logging for that case. It'll require a bit of machinery, but could actually
> result in *less* WAL logging overall, because we don't need to emit a WAL
> record for each SEQ_LOG_VALS sequence values.
> 

Could you elaborate? Hard to comment without knowing more ...

My point was that stuff like this (creating a new sequence or at least a
new relfilenode) means we can't apply that independently of the
transaction (unlike regular increments). I'm not sure how a change to
WAL logging would make that go away.

> 
> 
>>   (b) What about increments created before we have a proper snapshot?
>>       There may be transactions dependent on the increment. This is what
>>       ultimately led to revert of the patch.
> 
> I don't understand this - why would we ever need to process those increments
> from before we have a snapshot?  Wouldn't they, by definition, be before the
> slot was active?
> 
> To me this is the rough equivalent of logical decoding not giving the initial
> state of all tables. You need some process outside of logical decoding to get
> that (obviously we have some support for that via the exported data snapshot
> during slot creation).
> 

Which is what already happens during tablesync, no? We more or less copy
sequences as if they were tables.

> I assume that part of the initial sync would have to be a new sequence
> synchronization step that reads all the sequence states on the publisher and
> ensures that the subscriber sequences are at the same point. There's a bit of
> trickiness there, but it seems entirely doable. The logical replication replay
> support for sequences will have to be a bit careful about not decreasing the
> subscriber's sequence values - the standby initially will be ahead of the
> increments we'll see in the WAL. But that seems inevitable given the
> non-transactional nature of sequences.
> 

See fetch_sequence_data / copy_sequence in the patch. The bit about
ensuring the sequence does not go away (say, using page LSN and/or LSN
of the increment) is not there, however isn't that pretty much what I
proposed doing for "reconciling" the sequence state logged at COMMIT?

> 
>> This version of the patch tries to do the opposite thing - make sure
>> that the state after each commit matches what the transaction might have
>> seen (for sequences it accessed). It's imperfect, because it might log a
>> state generated "after" the sequence got accessed - it focuses on the
>> guarantee not to generate duplicate values.
> 
> That approach seems quite wrong to me.
> 

Why? Because it might log a state for sequence as of COMMIT, when the
transaction accessed the sequence much earlier? That is, this may happen:

T1: nextval('s') -> 1
T2: call nextval('s') 1000000x
T1: commit

and T1 will log sequence state ~1000001, give or take. I don't think
there's way around that, given the non-transactional nature of
sequences. And I'm not convinced this is an issue, as it ensures
uniqueness of values generated on the subscriber. And I think it's
reasonable to replicate the sequence state as of the commit (because
that's what you'd see on the primary).

> 
>>> I'm going to confess that I have no really specific idea how to
>>> implement that. I'm just not sufficiently familiar with this code.
>>> However, I suspect that the solution lies in changing things on the
>>> decoding side rather than in the WAL format. I feel like the
>>> information that we need in order to do the right thing must already
>>> be present in the WAL. If it weren't, then how could crash recovery
>>> work correctly, or physical replication? At any given moment, you can
>>> choose to promote a physical standby, and at that point the state you
>>> observe on the new primary had better be some state that existed on
>>> the primary at some point in its history. At any moment, you can
>>> unplug the primary, restart it, and run crash recovery, and if you do,
>>> you had better end up with some state that existed on the primary at
>>> some point shortly before the crash.
> 
> One minor exception here is that there's no real time bound to see the last
> few sequence increments if nothing after the XLOG_SEQ_LOG records forces a WAL
> flush.
> 

Right. Another issue is we ignore stuff that happened in aborted
transactions, so then nextval('s') in another transaction may not wait
for syncrep to confirm receiving that WAL. Which is a data loss case,
see [1]:

[1]
https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9be9%40enterprisedb.com

> 
>>> I think that there are actually a
>>> few subtle inaccuracies in the last two sentences, because actually
>>> the order in which transactions become visible on a physical standby
>>> can differ from the order in which it happens on the primary, but I
>>> don't think that actually changes the picture much. The point is that
>>> the WAL is the definitive source of information about what happened
>>> and in what order it happened, and we use it in that way already in
>>> the context of physical replication, and of standbys. If logical
>>> decoding has a problem with some case that those systems handle
>>> correctly, the problem is with logical decoding, not the WAL format.
>>>
>>
>> The problem lies in how we log sequences. If we wrote each individual
>> increment to WAL, it might work the way you propose (except for cases
>> with sequences created in a transaction, etc.). But that's not what we
>> do - we log sequence increments in batches of 32 values, and then only
>> modify the sequence relfilenode.
> 
>> This works for physical replication, because the WAL describes the
>> "next" state of the sequence (so if you do "SELECT * FROM sequence"
>> you'll not see the same state, and the sequence value may "jump ahead"
>> after a failover).
>>
>> But for logical replication this does not work, because the transaction
>> might depend on a state created (WAL-logged) by some other transaction.
>> And perhaps that transaction actually happened *before* we even built
>> the first snapshot for decoding :-/
> 
> I really can't follow the "depend on state ... by some other transaction"
> aspect.
> 

T1: nextval('s') -> writes WAL, covering by the next 32 increments
T2: nextval('s') -> no WAL generated, covered by T1 WAL

This is what I mean by "dependency" on state logged by another
transaction. It already causes problems with streaming replication (see
the reference to syncrep above), logical replication has the same issue.

> 
> Even the case of a sequence that is renamed inside a transaction that did
> *not* create / reset the sequence and then also triggers increment of the
> sequence seems to be dealt with reasonably by processing sequence increments
> outside a transaction - the old name will be used for the increments, replay
> of the renaming transaction would then implement the rename in a hypothetical
> DDL-replay future.
> 
> 
>> There's also the issue with what snapshot to use when decoding these
>> transactional changes in logical decoding (see
> 
> Incomplete parenthetical? Or were you referencing the next paragraph?
> 
> What are the transactional changes you're referring to here?
> 

Sorry, IIRC I merely wanted to mention/reference the snapshot issue in
the thread [2] that I ended up referencing in the next paragraph.


[2]
https://www.postgresql.org/message-id/00708727-d856-1886-48e3-811296c7ba8c%40enterprisedb.com

> 
> I did some skimming of the referenced thread about the reversal of the last
> approach, but I couldn't really understand what the fundamental issues were
> with the reverted implementation - it's a very long thread and references
> other threads.
> 

Yes, it's long/complex, but I intentionally linked to a specific message
which describes the issue ...

It's entirely possible there is a simple fix for the issue, and I just
got confused / unable to see the solution. The whole issue was due to
having a mix of transactional and non-transactional cases, similarly to
logical messages - and logicalmsg_decode() has the same issue, so maybe
let's talk about that for a moment.

See [3] and imagine you're dealing with a transactional message, but
you're still building a consistent snapshot. So the first branch applies:

    if (transactional &&
        !SnapBuildProcessChange(builder, xid, buf->origptr))
        return;

but because we don't have a snapshot, SnapBuildProcessChange does this:

    if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
        return false;

which however means logicalmsg_decode() does

    snapshot = SnapBuildGetOrBuildSnapshot(builder);

which crashes, because it hits this assert:

    Assert(builder->state == SNAPBUILD_CONSISTENT);

The sequence decoding did almost the same thing, with the same issue.
Maybe the correct thing to do is to just ignore the change in this case?
Presumably it'd be replicated by tablesync. But we've been unable to
convince ourselves that's correct, or what snapshot to pass to
ReorderBufferQueueMessage/ReorderBufferQueueSequence.


[3]
https://github.com/postgres/postgres/blob/master/src/backend/replication/logical/decode.c#L585


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Robert Haas
Дата:
On Wed, Nov 16, 2022 at 8:41 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> There's a couple of caveats, though:
>
> 1) Maybe we should focus more on "actually observed" state instead of
> "observable". Who cares if the sequence moved forward in a transaction
> that was ultimately rolled back? No committed transaction should have
> observer those values - in a way, the last "valid" state of the sequence
> is the last value generated in a transaction that ultimately committed.

When I say "observable" I mean from a separate transaction, not one
that is making changes to things.

I said "observable" rather than "actually observed" because we neither
know nor care whether someone actually ran a SELECT statement at any
given moment in time, just what they would have seen if they did.

> 2) I think what matters more is that we never generate duplicate value.
> That is, if you generate a value from a sequence, commit a transaction
> and replicate it, then the logical standby should not generate the same
> value from the sequence. This guarantee seems necessary for "failover"
> to logical standby.

I think that matters, but I don't think it's sufficient. We need to
preserve the order in which things appear to happen, and which changes
are and are not atomic, not just the final result.

> Well, yeah - we can either try to perform the stuff independently of the
> transactions that triggered it, or we can try making it part of some of
> the transactions. Each of those options has problems, though :-(
>
> The first version of the patch tried the first approach, i.e. decode the
> increments and apply that independently. But:
>
>   (a) What would you do with increments of sequences created/reset in a
>       transaction? Can't apply those outside the transaction, because it
>       might be rolled back (and that state is not visible on primary).

If the state isn't going to be visible until the transaction commits,
it has to be replicated as part of the transaction. If I create a
sequence and then nextval it a bunch of times, I can't replicate that
by first creating the sequence, and then later, as a separate
operation, replicating the nextvals. If I do that, then there's an
intermediate state visible on the replica that was never visible on
the origin server. That's broken.

>   (b) What about increments created before we have a proper snapshot?
>       There may be transactions dependent on the increment. This is what
>       ultimately led to revert of the patch.

Whatever problem exists here is with the implementation, not the
concept. If you copy the initial state as it exists at some moment in
time to a replica, and then replicate all the changes that happen
afterward to that replica without messing up the order, the replica
WILL be in sync with the origin server. The things that happen before
you copy the initial state do not and cannot matter.

But what you're describing sounds like the changes aren't really
replicated in visibility order, and then it is easy to see how a
problem like this can happen. Because now, an operation that actually
became visible just before or just after the initial copy was taken
might be thought to belong on the other side of that boundary, and
then everything will break. And it sounds like that is what you are
describing.

> This version of the patch tries to do the opposite thing - make sure
> that the state after each commit matches what the transaction might have
> seen (for sequences it accessed). It's imperfect, because it might log a
> state generated "after" the sequence got accessed - it focuses on the
> guarantee not to generate duplicate values.

Like Andres, I just can't imagine this being correct. It feels like
it's trying to paper over the failure to do the replication properly
during the transaction by overwriting state at the end.

> Yes, this would mean we accept we may end up with something like this:
>
> 1: T1 logs sequence state S1
> 2: someone increments sequence
> 3: T2 logs sequence stats S2
> 4: T2 commits
> 5: T1 commits
>
> which "inverts" the apply order of S1 vs. S2, because we first apply S2
> and then the "old" S1. But as long as we're smart enough to "discard"
> applying S1, I think that's acceptable - because it guarantees we'll not
> generate duplicate values (with values in the committed transaction).
>
> I'd also argue it does not actually generate invalid state, because once
> we commit either transaction, S2 is what's visible.

I agree that it's OK if the sequence increment gets merged into the
commit that immediately follows. However, I disagree with the idea of
discarding the second update on the grounds that it would make the
sequence go backward and we know that can't be right. That algorithm
works in the really specific case where the only operations are
increments. As soon as anyone does anything else to the sequence, such
an algorithm can no longer work. Nor can it work for objects that are
not sequences. The alternative strategy of replicating each change
exactly once and in the correct order works for all current and future
object types in all cases.

> > Your alternative proposal says "The other option might be to make
> > these messages non-transactional, in which case we'd separate the
> > ordering from COMMIT ordering, evading the reordering problem." But I
> > don't think that avoids the reordering problem at all.
>
> I don't understand why. Why would it not address the reordering issue?
>
> > Nor do I think it's correct.
>
> Nor do I understand this. I mean, isn't it essentially the option you
> mentioned earlier - treating the non-transactional actions as
> independent transactions? Yes, we'd be batching them so that we'd not
> see "intermediate" states, but those are not observed by abyone.

I don't think that batching them is a bad idea, in fact I think it's
necessary. But those batches still have to be applied at the right
time relative to the sequence of commits.

> I'm confused. Isn't that pretty much exactly what I'm proposing? Imagine
> you have something like this:
>
> 1: T1 does something and also increments a sequence
> 2: T1 logs state of the sequence (right before commit)
> 3: T1 writes COMMIT
>
> Now when we decode/apply this, we end up doing this:
>
> 1: decode all T1 changes, stash them
> 2: decode the sequence state and apply it separately
> 3: decode COMMIT, apply all T1 changes
>
> There might be other transactions interleaving with this, but I think
> it'd behave correctly. What example would not work?

What if one of the other transactions renames the sequence, or changes
the current value, or does basically anything to it other than
nextval?

> The problem lies in how we log sequences. If we wrote each individual
> increment to WAL, it might work the way you propose (except for cases
> with sequences created in a transaction, etc.). But that's not what we
> do - we log sequence increments in batches of 32 values, and then only
> modify the sequence relfilenode.
>
> This works for physical replication, because the WAL describes the
> "next" state of the sequence (so if you do "SELECT * FROM sequence"
> you'll not see the same state, and the sequence value may "jump ahead"
> after a failover).
>
> But for logical replication this does not work, because the transaction
> might depend on a state created (WAL-logged) by some other transaction.
> And perhaps that transaction actually happened *before* we even built
> the first snapshot for decoding :-/

I agree that there's a problem here but I don't think that it's a huge
problem. I think that it's not QUITE right to think about what state
is visible on the primary. It's better to think about what state would
be visible on the primary if it crashed and restarted after writing
any given amount of WAL, or what would be visible on a physical
standby after replaying any given amount of WAL. If logical
replication mimics that, I think it's as correct as it needs to be. If
not, those other systems are broken, too.

So I think what should happen is that when we write a WAL record
saying that the sequence has been incremented by 32, that should be
logically replicated after all commits whose commit record precedes
that WAL record and before commits whose commit record follows that
WAL record. It is OK to merge the replication of that record into one
of either the immediately preceding or the immediately following
commit, but you can't do it as part of any other commit because then
you're changing the order of operations.

For instance, consider:

T1: BEGIN; INSERT; COMMIT;
T2: BEGIN; nextval('a_seq') causing a logged advancement to the sequence;
T3: BEGIN; nextval('b_seq') causing a logged advancement to the sequence;
T4: BEGIN; INSERT; COMMIT;
T2: COMMIT;
T3: COMMIT;

The sequence increments can be replicated as part of T1 or part of T4
or in between applying T1 and T4. They cannot be applied as part of T2
or T3. Otherwise, suppose T4 read the current value of one of those
sequences and included that value in the inserted row, and the target
table happened to be the sequence_value_at_end_of_period table. Then
imagine that after receiving the data for T4 and replicating it, the
primary server is hit by a meteor and the replica is promoted. Well,
it's now possible for some new transaction to get a value from that
sequence than what has already been written to the
sequence_value_at_end_of_period table, which will presumably break the
application.

> > In particular, I think it's likely that the "non-transactional
> > messages" that you mention earlier don't get applied at the point in
> > the commit sequence where they were found in the WAL. Not sure why
> > exactly, but perhaps the point at which we're reading WAL runs ahead
> > of the decoding per se, or something like that, and thus those
> > non-transactional messages arrive too early relative to the commit
> > ordering. Possibly that could be changed, and they could be buffered
>
> I'm not sure which case of "non-transactional messages" this refers to,
> so I can't quite respond to these comments. Perhaps you mean the
> problems that killed the previous patch [1]?

In http://postgr.es/m/8bf1c518-b886-fe1b-5c42-09f9c663146d@enterprisedb.com
you said "The other option might be to make these messages
non-transactional". I was referring to that.

> > the transaction whose commit record occurs most nearly prior to, or
> > most nearly after, the WAL record for the operation itself. Or else,
> > we could create "virtual" transactions for such operations and make
> > sure those get replayed at the right point in the commit sequence. Or
> > else, I don't know, maybe something else. But I think the overall
> > picture is that we need to approach the problem by replicating changes
> > in WAL order, as a physical standby would do. Saying that a change is
> > "nontransactional" doesn't mean that it's exempt from ordering
> > requirements; rather, it means that that change has its own place in
> > that ordering, distinct from the transaction in which it occurred.
>
> But doesn't the approach with WAL-logging sequence state before COMMIT,
> and then applying it independently in WAL-order, do pretty much this?

I'm sort of repeating myself here, but: only if the only operations
that ever get performed on sequences are increments. Which is just not
true.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Andres Freund
Дата:
Hi,

On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote:
> On 11/17/22 03:43, Andres Freund wrote:
> > On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:
> >> Well, yeah - we can either try to perform the stuff independently of the
> >> transactions that triggered it, or we can try making it part of some of
> >> the transactions. Each of those options has problems, though :-(
> >>
> >> The first version of the patch tried the first approach, i.e. decode the
> >> increments and apply that independently. But:
> >>
> >>   (a) What would you do with increments of sequences created/reset in a
> >>       transaction? Can't apply those outside the transaction, because it
> >>       might be rolled back (and that state is not visible on primary).
> >
> > I think a reasonable approach could be to actually perform different WAL
> > logging for that case. It'll require a bit of machinery, but could actually
> > result in *less* WAL logging overall, because we don't need to emit a WAL
> > record for each SEQ_LOG_VALS sequence values.
> >
>
> Could you elaborate? Hard to comment without knowing more ...
>
> My point was that stuff like this (creating a new sequence or at least a
> new relfilenode) means we can't apply that independently of the
> transaction (unlike regular increments). I'm not sure how a change to
> WAL logging would make that go away.

Different WAL logging would make it easy to handle that on the logical
decoding level. We don't need to emit WAL records each time a
created-in-this-toplevel-xact sequences gets incremented as they're not
persisting anyway if the surrounding xact aborts. We already need to remember
the filenode so it can be dropped at the end of the transaction, so we could
emit a single record for each sequence at that point.


> >>   (b) What about increments created before we have a proper snapshot?
> >>       There may be transactions dependent on the increment. This is what
> >>       ultimately led to revert of the patch.
> >
> > I don't understand this - why would we ever need to process those increments
> > from before we have a snapshot?  Wouldn't they, by definition, be before the
> > slot was active?
> >
> > To me this is the rough equivalent of logical decoding not giving the initial
> > state of all tables. You need some process outside of logical decoding to get
> > that (obviously we have some support for that via the exported data snapshot
> > during slot creation).
> >
>
> Which is what already happens during tablesync, no? We more or less copy
> sequences as if they were tables.

I think you might have to copy sequences after tables, but I'm not sure. But
otherwise, yea.


> > I assume that part of the initial sync would have to be a new sequence
> > synchronization step that reads all the sequence states on the publisher and
> > ensures that the subscriber sequences are at the same point. There's a bit of
> > trickiness there, but it seems entirely doable. The logical replication replay
> > support for sequences will have to be a bit careful about not decreasing the
> > subscriber's sequence values - the standby initially will be ahead of the
> > increments we'll see in the WAL. But that seems inevitable given the
> > non-transactional nature of sequences.
> >
>
> See fetch_sequence_data / copy_sequence in the patch. The bit about
> ensuring the sequence does not go away (say, using page LSN and/or LSN
> of the increment) is not there, however isn't that pretty much what I
> proposed doing for "reconciling" the sequence state logged at COMMIT?

Well, I think the approach of logging all sequence increments at commit is the
wrong idea...

Creating a new relfilenode whenever a sequence is incremented seems like a
complete no-go to me. That increases sequence overhead by several orders of
magnitude and will lead to *awful* catalog bloat on the subscriber.


> >
> >> This version of the patch tries to do the opposite thing - make sure
> >> that the state after each commit matches what the transaction might have
> >> seen (for sequences it accessed). It's imperfect, because it might log a
> >> state generated "after" the sequence got accessed - it focuses on the
> >> guarantee not to generate duplicate values.
> >
> > That approach seems quite wrong to me.
> >
>
> Why? Because it might log a state for sequence as of COMMIT, when the
> transaction accessed the sequence much earlier?

Mainly because sequences aren't transactional and trying to make them will
require awful contortions.

While there are cases where we don't flush the WAL / wait for syncrep for
sequences, we do replicate their state correctly on physical replication. If
an LSN has been acknowledged as having been replicated, we won't just loose a
prior sequence increment after promotion, even if the transaction didn't [yet]
commit.

It's completely valid for an application to call nextval() in one transaction,
potentially even abort it, and then only use that sequence value in another
transaction.



> > I did some skimming of the referenced thread about the reversal of the last
> > approach, but I couldn't really understand what the fundamental issues were
> > with the reverted implementation - it's a very long thread and references
> > other threads.
> >
>
> Yes, it's long/complex, but I intentionally linked to a specific message
> which describes the issue ...
>
> It's entirely possible there is a simple fix for the issue, and I just
> got confused / unable to see the solution. The whole issue was due to
> having a mix of transactional and non-transactional cases, similarly to
> logical messages - and logicalmsg_decode() has the same issue, so maybe
> let's talk about that for a moment.
>
> See [3] and imagine you're dealing with a transactional message, but
> you're still building a consistent snapshot. So the first branch applies:
>
>     if (transactional &&
>         !SnapBuildProcessChange(builder, xid, buf->origptr))
>         return;
>
> but because we don't have a snapshot, SnapBuildProcessChange does this:
>
>     if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
>         return false;

In this case we'd just return without further work in logicalmsg_decode(). The
problematic case presumably is is when we have a full snapshot but aren't yet
consistent, but xid is >= next_phase_at. Then SnapBuildProcessChange() returns
true. And we reach:

> which however means logicalmsg_decode() does
>
>     snapshot = SnapBuildGetOrBuildSnapshot(builder);
>
> which crashes, because it hits this assert:
>
>     Assert(builder->state == SNAPBUILD_CONSISTENT);

I think the problem here is just that we shouldn't even try to get a snapshot
in the transactional case - note that it's not even used in
ReorderBufferQueueMessage() for transactional message. The transactional case
needs to behave like a "normal" change - we might never decode the message if
the transaction ends up committing before we've reached a consistent point.


> The sequence decoding did almost the same thing, with the same issue.
> Maybe the correct thing to do is to just ignore the change in this case?

No, I don't think that'd be correct, the message | sequence needs to be queued
for the transaction. If the transaction ends up committing after we've reached
consistency, we'll get the correct snapshot from the base snapshot set in
SnapBuildProcessChange().

Greetings,

Andres Freund



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 11/17/22 18:07, Andres Freund wrote:
> Hi,
> 
> On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote:
>> On 11/17/22 03:43, Andres Freund wrote:
>>> On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:
>>>> Well, yeah - we can either try to perform the stuff independently of the
>>>> transactions that triggered it, or we can try making it part of some of
>>>> the transactions. Each of those options has problems, though :-(
>>>>
>>>> The first version of the patch tried the first approach, i.e. decode the
>>>> increments and apply that independently. But:
>>>>
>>>>   (a) What would you do with increments of sequences created/reset in a
>>>>       transaction? Can't apply those outside the transaction, because it
>>>>       might be rolled back (and that state is not visible on primary).
>>>
>>> I think a reasonable approach could be to actually perform different WAL
>>> logging for that case. It'll require a bit of machinery, but could actually
>>> result in *less* WAL logging overall, because we don't need to emit a WAL
>>> record for each SEQ_LOG_VALS sequence values.
>>>
>>
>> Could you elaborate? Hard to comment without knowing more ...
>>
>> My point was that stuff like this (creating a new sequence or at least a
>> new relfilenode) means we can't apply that independently of the
>> transaction (unlike regular increments). I'm not sure how a change to
>> WAL logging would make that go away.
> 
> Different WAL logging would make it easy to handle that on the logical
> decoding level. We don't need to emit WAL records each time a
> created-in-this-toplevel-xact sequences gets incremented as they're not
> persisting anyway if the surrounding xact aborts. We already need to remember
> the filenode so it can be dropped at the end of the transaction, so we could
> emit a single record for each sequence at that point.
> 
> 
>>>>   (b) What about increments created before we have a proper snapshot?
>>>>       There may be transactions dependent on the increment. This is what
>>>>       ultimately led to revert of the patch.
>>>
>>> I don't understand this - why would we ever need to process those increments
>>> from before we have a snapshot?  Wouldn't they, by definition, be before the
>>> slot was active?
>>>
>>> To me this is the rough equivalent of logical decoding not giving the initial
>>> state of all tables. You need some process outside of logical decoding to get
>>> that (obviously we have some support for that via the exported data snapshot
>>> during slot creation).
>>>
>>
>> Which is what already happens during tablesync, no? We more or less copy
>> sequences as if they were tables.
> 
> I think you might have to copy sequences after tables, but I'm not sure. But
> otherwise, yea.
> 
> 
>>> I assume that part of the initial sync would have to be a new sequence
>>> synchronization step that reads all the sequence states on the publisher and
>>> ensures that the subscriber sequences are at the same point. There's a bit of
>>> trickiness there, but it seems entirely doable. The logical replication replay
>>> support for sequences will have to be a bit careful about not decreasing the
>>> subscriber's sequence values - the standby initially will be ahead of the
>>> increments we'll see in the WAL. But that seems inevitable given the
>>> non-transactional nature of sequences.
>>>
>>
>> See fetch_sequence_data / copy_sequence in the patch. The bit about
>> ensuring the sequence does not go away (say, using page LSN and/or LSN
>> of the increment) is not there, however isn't that pretty much what I
>> proposed doing for "reconciling" the sequence state logged at COMMIT?
> 
> Well, I think the approach of logging all sequence increments at commit is the
> wrong idea...
> 

But we're not logging all sequence increments, no?

We're logging the state for each sequence touched by the transaction,
but only once - if the transaction incremented the sequence 1000000x
times, we'll still log it just once (at least for this particular purpose).

Yes, if transactions touch each sequence just once, then we're logging
individual increments.

The only more efficient solution would be to decode the existing WAL
(every ~32 increments), and perhaps also tracking which sequences were
accessed by a transaction. And then simply stashing the increments in a
global reorderbuffer hash table, and then applying only the last one at
commit time. This would require the transactional / non-transactional
behavior (I think), but perhaps we can make that work.

Or are you thinking about some other scheme?

> Creating a new relfilenode whenever a sequence is incremented seems like a
> complete no-go to me. That increases sequence overhead by several orders of
> magnitude and will lead to *awful* catalog bloat on the subscriber.
> 

You mean on the the apply side? Yes, I agree this needs a better
approach, I've focused on the decoding side so far.

> 
>>>
>>>> This version of the patch tries to do the opposite thing - make sure
>>>> that the state after each commit matches what the transaction might have
>>>> seen (for sequences it accessed). It's imperfect, because it might log a
>>>> state generated "after" the sequence got accessed - it focuses on the
>>>> guarantee not to generate duplicate values.
>>>
>>> That approach seems quite wrong to me.
>>>
>>
>> Why? Because it might log a state for sequence as of COMMIT, when the
>> transaction accessed the sequence much earlier?
> 
> Mainly because sequences aren't transactional and trying to make them will
> require awful contortions.
> 
> While there are cases where we don't flush the WAL / wait for syncrep for
> sequences, we do replicate their state correctly on physical replication. If
> an LSN has been acknowledged as having been replicated, we won't just loose a
> prior sequence increment after promotion, even if the transaction didn't [yet]
> commit.
> 

True, I agree we should aim to achieve that.

> It's completely valid for an application to call nextval() in one transaction,
> potentially even abort it, and then only use that sequence value in another
> transaction.
> 

I don't quite agree with that - we make no promises about what happens
to sequence changes in aborted transactions. I don't think I've ever
seen an application using such pattern either.

And I'd argue we already fail to uphold such guarantee, because we don't
wait for syncrep if the sequence WAL happened in aborted transaction. So
if you use the value elsewhere (outside PG), you may lose it.

Anyway, I think the scheme I outlined above (with stashing decoded
increments, logged once every 32 values) and applying the latest
increment for all sequences at commit, would work.

> 
> 
>>> I did some skimming of the referenced thread about the reversal of the last
>>> approach, but I couldn't really understand what the fundamental issues were
>>> with the reverted implementation - it's a very long thread and references
>>> other threads.
>>>
>>
>> Yes, it's long/complex, but I intentionally linked to a specific message
>> which describes the issue ...
>>
>> It's entirely possible there is a simple fix for the issue, and I just
>> got confused / unable to see the solution. The whole issue was due to
>> having a mix of transactional and non-transactional cases, similarly to
>> logical messages - and logicalmsg_decode() has the same issue, so maybe
>> let's talk about that for a moment.
>>
>> See [3] and imagine you're dealing with a transactional message, but
>> you're still building a consistent snapshot. So the first branch applies:
>>
>>     if (transactional &&
>>         !SnapBuildProcessChange(builder, xid, buf->origptr))
>>         return;
>>
>> but because we don't have a snapshot, SnapBuildProcessChange does this:
>>
>>     if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
>>         return false;
> 
> In this case we'd just return without further work in logicalmsg_decode(). The
> problematic case presumably is is when we have a full snapshot but aren't yet
> consistent, but xid is >= next_phase_at. Then SnapBuildProcessChange() returns
> true. And we reach:
> 
>> which however means logicalmsg_decode() does
>>
>>     snapshot = SnapBuildGetOrBuildSnapshot(builder);
>>
>> which crashes, because it hits this assert:
>>
>>     Assert(builder->state == SNAPBUILD_CONSISTENT);
> 
> I think the problem here is just that we shouldn't even try to get a snapshot
> in the transactional case - note that it's not even used in
> ReorderBufferQueueMessage() for transactional message. The transactional case
> needs to behave like a "normal" change - we might never decode the message if
> the transaction ends up committing before we've reached a consistent point.
> 
> 
>> The sequence decoding did almost the same thing, with the same issue.
>> Maybe the correct thing to do is to just ignore the change in this case?
> 
> No, I don't think that'd be correct, the message | sequence needs to be queued
> for the transaction. If the transaction ends up committing after we've reached
> consistency, we'll get the correct snapshot from the base snapshot set in
> SnapBuildProcessChange().
> 

Yeah, I think you're right. I looked at this again, with fresh mind, and
I came to the same conclusion. Roughly what the attached patch does.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Andres Freund
Дата:
Hi,

On 2022-11-17 22:13:23 +0100, Tomas Vondra wrote:
> On 11/17/22 18:07, Andres Freund wrote:
> > On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote:
> >> On 11/17/22 03:43, Andres Freund wrote:
> >>> I assume that part of the initial sync would have to be a new sequence
> >>> synchronization step that reads all the sequence states on the publisher and
> >>> ensures that the subscriber sequences are at the same point. There's a bit of
> >>> trickiness there, but it seems entirely doable. The logical replication replay
> >>> support for sequences will have to be a bit careful about not decreasing the
> >>> subscriber's sequence values - the standby initially will be ahead of the
> >>> increments we'll see in the WAL. But that seems inevitable given the
> >>> non-transactional nature of sequences.
> >>>
> >>
> >> See fetch_sequence_data / copy_sequence in the patch. The bit about
> >> ensuring the sequence does not go away (say, using page LSN and/or LSN
> >> of the increment) is not there, however isn't that pretty much what I
> >> proposed doing for "reconciling" the sequence state logged at COMMIT?
> >
> > Well, I think the approach of logging all sequence increments at commit is the
> > wrong idea...
> >
>
> But we're not logging all sequence increments, no?

I was imprecise - I meant streaming them out at commit.



> Yeah, I think you're right. I looked at this again, with fresh mind, and
> I came to the same conclusion. Roughly what the attached patch does.

To me it seems a bit nicer to keep the SnapBuildGetOrBuildSnapshot() call in
decode.c instead of moving it to reorderbuffer.c. Perhaps we should add a
snapbuild.c helper similar to SnapBuildProcessChange() for non-transactional
changes that also gets a snapshot?

Could look something like

    Snapshot snapshot = NULL;

    if (message->transactional &&
        !SnapBuildProcessChange(builder, xid, buf->origptr))
        return;
    else if (!SnapBuildProcessStateNonTx(builder, &snapshot))
        return;

    ...

Or perhaps we should just bite the bullet and add an argument to
SnapBuildProcessChange to deal with that?

Greetings,

Andres Freund



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Hi,

Here's a rebased version of the sequence decoding patch.

0001 is a fix for the pre-existing issue in logicalmsg_decode,
attempting to build a snapshot before getting into a consistent state.
AFAICS this only affects assert-enabled builds and is otherwise
harmless, because we are not actually using the snapshot (apply gets a
valid snapshot from the transaction).

This is mostly the fix I shared in November, except that I kept the call
in decode.c (per comment from Andres). I haven't added any argument to
SnapBuildProcessChange because we may need to backpatch this (and it
didn't seem much simpler, IMHO).

0002 is a rebased version of the original approach, committed as
0da92dc530 (and then reverted in 2c7ea57e56). This includes the same fix
as 0001 (for the sequence messages), the primary reason for the revert.

The rebase was not quite straightforward, due to extensive changes in
how publications deal with tables/schemas, and so on. So this adopts
them, but other than that it behaves just like the original patch.

So this abandons the approach with COMMIT-time logging for sequences
accessed/modified by the transaction, proposed in response to the
revert. It seemed like a good (and simpler) alternative, but there were
far too many issues - higher overhead, ordering of records for
concurrent transactions, making it reliable, etc.

I think the main remaining question is what's the goal of this patch, or
rather what "guarantees" we expect from it - what we expect to see on
the replica after incrementing a sequence on the primary.

Robert described [1] a model and argued the standby should not "invent"
new states. It's a long / detailed explanation, I'm not going to try to
shorten in here because that'd inevitably omit various details. So
better read it whole ...

Anyway, I don't think this approach (essentially treating most sequence
increments as non-transactional) breaks any consistency guarantees or
introduces any "new" states that would not be observable on the primary.
In a way, this treats non-transactional sequence increments as separate
transactions, and applies them directly. If you read the sequence in
between two commits, you might see any "intermediate" state of the
sequence - that's the nature of non-transactional changes.

We could "postpone" applying the decoded changes until the first next
commit, which might improve performance if a transaction is long enough
to cover many sequence increments. But that's more a performance
optimization than a matter of correctness, IMHO.

One caveat is that because of how WAL works for sequences, we're
actually decoding changes "ahead" so if you read the sequence on the
subscriber it'll actually seem to be slightly ahead (up to ~32 values).
This could be eliminated by setting SEQ_LOG_VALS to 0, which however
increases the sequence costs, of course.

This however brings me to the original question what's the purpose of
this patch - and that's essentially keeping sequences up to date to make
them usable after a failover. We can't generate values from the sequence
on the subscriber, because it'd just get overwritten. And from this
point of view, it's also fine that the sequence is slightly ahead,
because that's what happens after crash recovery anyway. And we're not
guaranteeing the sequences to be gap-less.


regards


[1]
https://www.postgresql.org/message-id/CA%2BTgmoaYG7672OgdwpGm5cOwy8_ftbs%3D3u-YMvR9fiJwQUzgrQ%40mail.gmail.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Robert Haas
Дата:
On Tue, Jan 10, 2023 at 1:32 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> 0001 is a fix for the pre-existing issue in logicalmsg_decode,
> attempting to build a snapshot before getting into a consistent state.
> AFAICS this only affects assert-enabled builds and is otherwise
> harmless, because we are not actually using the snapshot (apply gets a
> valid snapshot from the transaction).
>
> This is mostly the fix I shared in November, except that I kept the call
> in decode.c (per comment from Andres). I haven't added any argument to
> SnapBuildProcessChange because we may need to backpatch this (and it
> didn't seem much simpler, IMHO).

I tend to associate transactional behavior with snapshots, so it looks
odd to see code that builds a snapshot only when the message is
non-transactional. I think that a more detailed comment spelling out
the reasoning would be useful here.

> This however brings me to the original question what's the purpose of
> this patch - and that's essentially keeping sequences up to date to make
> them usable after a failover. We can't generate values from the sequence
> on the subscriber, because it'd just get overwritten. And from this
> point of view, it's also fine that the sequence is slightly ahead,
> because that's what happens after crash recovery anyway. And we're not
> guaranteeing the sequences to be gap-less.

I agree that it's fine for the sequence to be slightly ahead, but I
think that it can't be too far ahead without causing problems. Suppose
for example that transaction #1 creates a sequence. Transaction #2
does nextval on the sequence a bunch of times and inserts rows into a
table using the sequence values as the PK. It's fine if the nextval
operations are replicated ahead of the commit of transaction #2 -- in
fact I'd say it's necessary for correctness -- but they can't precede
the commit of transaction #1, since then the sequence won't exist yet.
Likewise, if there's an ALTER SEQUENCE that creates a new relfilenode,
I think that needs to act as a barrier: non-transactional changes that
happened before that transaction must also be replicated before that
transaction is replicated, and those that happened after that
transaction is replicated must be replayed after that transaction is
replicated. Otherwise, at the very least, there will be states visible
on the standby that were never visible on the origin server, and maybe
we'll just straight up get the wrong answer. For instance:

1. nextval, setting last_value to 3
2. ALTER SEQUENCE, getting a new relfilenode, and also set last_value to 19
3. nextval, setting last_value to 20

If 3 happens before 2, the sequence ends up in the wrong state.

Maybe you've already got this and similar cases totally correctly
handled, I'm not sure, just throwing it out there.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 1/10/23 20:52, Robert Haas wrote:
> On Tue, Jan 10, 2023 at 1:32 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> 0001 is a fix for the pre-existing issue in logicalmsg_decode,
>> attempting to build a snapshot before getting into a consistent state.
>> AFAICS this only affects assert-enabled builds and is otherwise
>> harmless, because we are not actually using the snapshot (apply gets a
>> valid snapshot from the transaction).
>>
>> This is mostly the fix I shared in November, except that I kept the call
>> in decode.c (per comment from Andres). I haven't added any argument to
>> SnapBuildProcessChange because we may need to backpatch this (and it
>> didn't seem much simpler, IMHO).
> 
> I tend to associate transactional behavior with snapshots, so it looks
> odd to see code that builds a snapshot only when the message is
> non-transactional. I think that a more detailed comment spelling out
> the reasoning would be useful here.
> 

I'll try adding a comment explaining this, but the reasoning is fairly
simple AFAICS:

1) We don't actually need to build the snapshot for transactional
changes, because if we end up applying the change, we'll use snapshot
provided/maintained by reorderbuffer.

2) But we don't know if we end up applying the change - it may happen
this is one of the transactions we're waiting to finish / skipped, in
which case the snapshot is kinda bogus anyway. What "saved" us is that
we'll not actually use the snapshot in the end. It's just the assert
that causes issues.

3) For non-transactional changes, we need a snapshot because we're going
to execute the callback right away. But in this case the code actually
protects against building inconsistent snapshots.

>> This however brings me to the original question what's the purpose of
>> this patch - and that's essentially keeping sequences up to date to make
>> them usable after a failover. We can't generate values from the sequence
>> on the subscriber, because it'd just get overwritten. And from this
>> point of view, it's also fine that the sequence is slightly ahead,
>> because that's what happens after crash recovery anyway. And we're not
>> guaranteeing the sequences to be gap-less.
> 
> I agree that it's fine for the sequence to be slightly ahead, but I
> think that it can't be too far ahead without causing problems. Suppose
> for example that transaction #1 creates a sequence. Transaction #2
> does nextval on the sequence a bunch of times and inserts rows into a
> table using the sequence values as the PK. It's fine if the nextval
> operations are replicated ahead of the commit of transaction #2 -- in
> fact I'd say it's necessary for correctness -- but they can't precede
> the commit of transaction #1, since then the sequence won't exist yet.

It's not clear to me how could that even happen. If transaction #1
creates a sequence, it's invisible for transaction #2. So how could it
do nextval() on it? #2 has to wait for #1 to commit before it can do
anything on the sequence, which enforces the correct ordering, no?

> Likewise, if there's an ALTER SEQUENCE that creates a new relfilenode,
> I think that needs to act as a barrier: non-transactional changes that
> happened before that transaction must also be replicated before that
> transaction is replicated, and those that happened after that
> transaction is replicated must be replayed after that transaction is
> replicated. Otherwise, at the very least, there will be states visible
> on the standby that were never visible on the origin server, and maybe
> we'll just straight up get the wrong answer. For instance:
> 
> 1. nextval, setting last_value to 3
> 2. ALTER SEQUENCE, getting a new relfilenode, and also set last_value to 19
> 3. nextval, setting last_value to 20
> 
> If 3 happens before 2, the sequence ends up in the wrong state.
> 
> Maybe you've already got this and similar cases totally correctly
> handled, I'm not sure, just throwing it out there.
> 

I believe this should behave correctly too, thanks to locking.

If a transaction does ALTER SEQUENCE, that locks the sequence, so only
that transaction can do stuff with that sequence (and changes from that
point are treated as transactional). And everyone else is waiting for #1
to commit.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Andres Freund
Дата:
Hi,


Heikki, CCed you due to the point about 2c03216d8311 below.


On 2023-01-10 19:32:12 +0100, Tomas Vondra wrote:
> 0001 is a fix for the pre-existing issue in logicalmsg_decode,
> attempting to build a snapshot before getting into a consistent state.
> AFAICS this only affects assert-enabled builds and is otherwise
> harmless, because we are not actually using the snapshot (apply gets a
> valid snapshot from the transaction).

LGTM.


> 0002 is a rebased version of the original approach, committed as
> 0da92dc530 (and then reverted in 2c7ea57e56). This includes the same fix
> as 0001 (for the sequence messages), the primary reason for the revert.
> 
> The rebase was not quite straightforward, due to extensive changes in
> how publications deal with tables/schemas, and so on. So this adopts
> them, but other than that it behaves just like the original patch.

This is a huge diff:
>  72 files changed, 4715 insertions(+), 612 deletions(-)

It'd be nice to split it to make review easier. Perhaps the sequence decoding
support could be split from the whole publication rigamarole?


> This does not include any changes to test_decoding and/or the built-in
> replication - those will be committed in separate patches.

Looks like that's not the case anymore?


> +/*
> + * Update the sequence state by modifying the existing sequence data row.
> + *
> + * This keeps the same relfilenode, so the behavior is non-transactional.
> + */
> +static void
> +SetSequence_non_transactional(Oid seqrelid, int64 last_value, int64 log_cnt, bool is_called)
> +{
> +    SeqTable    elm;
> +    Relation    seqrel;
> +    Buffer        buf;
> +    HeapTupleData seqdatatuple;
> +    Form_pg_sequence_data seq;
> +
> +    /* open and lock sequence */
> +    init_sequence(seqrelid, &elm, &seqrel);
> +
> +    /* lock page' buffer and read tuple */
> +    seq = read_seq_tuple(seqrel, &buf, &seqdatatuple);
> +
> +    /* check the comment above nextval_internal()'s equivalent call. */
> +    if (RelationNeedsWAL(seqrel))
> +    {
> +        GetTopTransactionId();
> +
> +        if (XLogLogicalInfoActive())
> +            GetCurrentTransactionId();
> +    }
> +
> +    /* ready to change the on-disk (or really, in-buffer) tuple */
> +    START_CRIT_SECTION();
> +
> +    seq->last_value = last_value;
> +    seq->is_called = is_called;
> +    seq->log_cnt = log_cnt;
> +
> +    MarkBufferDirty(buf);
> +
> +    /* XLOG stuff */
> +    if (RelationNeedsWAL(seqrel))
> +    {
> +        xl_seq_rec    xlrec;
> +        XLogRecPtr    recptr;
> +        Page        page = BufferGetPage(buf);
> +
> +        XLogBeginInsert();
> +        XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT);
> +
> +        xlrec.locator = seqrel->rd_locator;
> +        xlrec.created = false;
> +
> +        XLogRegisterData((char *) &xlrec, sizeof(xl_seq_rec));
> +        XLogRegisterData((char *) seqdatatuple.t_data, seqdatatuple.t_len);
> +
> +        recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG);
> +
> +        PageSetLSN(page, recptr);
> +    }
> +
> +    END_CRIT_SECTION();
> +
> +    UnlockReleaseBuffer(buf);
> +
> +    /* Clear local cache so that we don't think we have cached numbers */
> +    /* Note that we do not change the currval() state */
> +    elm->cached = elm->last;
> +
> +    relation_close(seqrel, NoLock);
> +}
> +
> +/*
> + * Update the sequence state by creating a new relfilenode.
> + *
> + * This creates a new relfilenode, to allow transactional behavior.
> + */
> +static void
> +SetSequence_transactional(Oid seq_relid, int64 last_value, int64 log_cnt, bool is_called)
> +{
> +    SeqTable    elm;
> +    Relation    seqrel;
> +    Buffer        buf;
> +    HeapTupleData seqdatatuple;
> +    Form_pg_sequence_data seq;
> +    HeapTuple    tuple;
> +
> +    /* open and lock sequence */
> +    init_sequence(seq_relid, &elm, &seqrel);
> +
> +    /* lock page' buffer and read tuple */
> +    seq = read_seq_tuple(seqrel, &buf, &seqdatatuple);
> +
> +    /* Copy the existing sequence tuple. */
> +    tuple = heap_copytuple(&seqdatatuple);
> +
> +    /* Now we're done with the old page */
> +    UnlockReleaseBuffer(buf);
> +
> +    /*
> +     * Modify the copied tuple to update the sequence state (similar to what
> +     * ResetSequence does).
> +     */
> +    seq = (Form_pg_sequence_data) GETSTRUCT(tuple);
> +    seq->last_value = last_value;
> +    seq->is_called = is_called;
> +    seq->log_cnt = log_cnt;
> +
> +    /*
> +     * Create a new storage file for the sequence - this is needed for the
> +     * transactional behavior.
> +     */
> +    RelationSetNewRelfilenumber(seqrel, seqrel->rd_rel->relpersistence);
> +
> +    /*
> +     * Ensure sequence's relfrozenxid is at 0, since it won't contain any
> +     * unfrozen XIDs.  Same with relminmxid, since a sequence will never
> +     * contain multixacts.
> +     */
> +    Assert(seqrel->rd_rel->relfrozenxid == InvalidTransactionId);
> +    Assert(seqrel->rd_rel->relminmxid == InvalidMultiXactId);
> +
> +    /*
> +     * Insert the modified tuple into the new storage file. This does all the
> +     * necessary WAL-logging etc.
> +     */
> +    fill_seq_with_data(seqrel, tuple);
> +
> +    /* Clear local cache so that we don't think we have cached numbers */
> +    /* Note that we do not change the currval() state */
> +    elm->cached = elm->last;
> +
> +    relation_close(seqrel, NoLock);
> +}
> +
> +/*
> + * Set a sequence to a specified internal state.
> + *
> + * The change is made transactionally, so that on failure of the current
> + * transaction, the sequence will be restored to its previous state.
> + * We do that by creating a whole new relfilenode for the sequence; so this
> + * works much like the rewriting forms of ALTER TABLE.
> + *
> + * Caller is assumed to have acquired AccessExclusiveLock on the sequence,
> + * which must not be released until end of transaction.  Caller is also
> + * responsible for permissions checking.
> + */
> +void
> +SetSequence(Oid seq_relid, bool transactional, int64 last_value, int64 log_cnt, bool is_called)
> +{
> +    if (transactional)
> +        SetSequence_transactional(seq_relid, last_value, log_cnt, is_called);
> +    else
> +        SetSequence_non_transactional(seq_relid, last_value, log_cnt, is_called);
> +}

That's a lot of duplication with existing code. There's no explanation why
SetSequence() as well as do_setval() exists.


>  /*
>   * Initialize a sequence's relation with the specified tuple as content
>   *
> @@ -406,8 +560,13 @@ fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum)
>  
>      /* check the comment above nextval_internal()'s equivalent call. */
>      if (RelationNeedsWAL(rel))
> +    {
>          GetTopTransactionId();
>  
> +        if (XLogLogicalInfoActive())
> +            GetCurrentTransactionId();
> +    }

Is it actually possible to reach this without an xid already having been
assigned for the current xact?



> @@ -806,10 +966,28 @@ nextval_internal(Oid relid, bool check_permissions)
>       * It's sufficient to ensure the toplevel transaction has an xid, no need
>       * to assign xids subxacts, that'll already trigger an appropriate wait.
>       * (Have to do that here, so we're outside the critical section)
> +     *
> +     * We have to ensure we have a proper XID, which will be included in
> +     * the XLOG record by XLogRecordAssemble. Otherwise the first nextval()
> +     * in a subxact (without any preceding changes) would get XID 0, and it
> +     * would then be impossible to decide which top xact it belongs to.
> +     * It'd also trigger assert in DecodeSequence. We only do that with
> +     * wal_level=logical, though.
> +     *
> +     * XXX This might seem unnecessary, because if there's no XID the xact
> +     * couldn't have done anything important yet, e.g. it could not have
> +     * created a sequence. But that's incorrect, because of subxacts. The
> +     * current subtransaction might not have done anything yet (thus no XID),
> +     * but an earlier one might have created the sequence.
>       */

What about restricting this to the case you're mentioning,
i.e. subtransactions?


> @@ -845,6 +1023,7 @@ nextval_internal(Oid relid, bool check_permissions)
>          seq->log_cnt = 0;
>  
>          xlrec.locator = seqrel->rd_locator;

I realize this isn't from this patch, but:

Why do we include the locator in the record? We already have it via
XLogRegisterBuffer(), no? And afaict we don't even use it, as we read the page
via XLogInitBufferForRedo() during recovery.

Kinda looks like an oversight in 2c03216d8311




> +/*
> + * Handle sequence decode
> + *
> + * Decoding sequences is a bit tricky, because while most sequence actions
> + * are non-transactional (not subject to rollback), some need to be handled
> + * as transactional.
> + *
> + * By default, a sequence increment is non-transactional - we must not queue
> + * it in a transaction as other changes, because the transaction might get
> + * rolled back and we'd discard the increment. The downstream would not be
> + * notified about the increment, which is wrong.
> + *
> + * On the other hand, the sequence may be created in a transaction. In this
> + * case we *should* queue the change as other changes in the transaction,
> + * because we don't want to send the increments for unknown sequence to the
> + * plugin - it might get confused about which sequence it's related to etc.
> + */
> +void
> +sequence_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
> +{

> +    /* extract the WAL record, with "created" flag */
> +    xlrec = (xl_seq_rec *) XLogRecGetData(r);
> +
> +    /* XXX how could we have sequence change without data? */
> +    if(!datalen || !tupledata)
> +        return;

Yea, I think we should error out here instead, something has gone quite wrong
if this happens.


> +    tuplebuf = ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
> +    DecodeSeqTuple(tupledata, datalen, tuplebuf);
> +
> +    /*
> +     * Should we handle the sequence increment as transactional or not?
> +     *
> +     * If the sequence was created in a still-running transaction, treat
> +     * it as transactional and queue the increments. Otherwise it needs
> +     * to be treated as non-transactional, in which case we send it to
> +     * the plugin right away.
> +     */
> +    transactional = ReorderBufferSequenceIsTransactional(ctx->reorder,
> +                                                         target_locator,
> +                                                         xlrec->created);

Why re-create this information during decoding, when we basically already have
it available on the primary? I think we already pay the price for that
tracking, which we e.g. use for doing a non-transactional truncate:

        /*
         * Normally, we need a transaction-safe truncation here.  However, if
         * the table was either created in the current (sub)transaction or has
         * a new relfilenumber in the current (sub)transaction, then we can
         * just truncate it in-place, because a rollback would cause the whole
         * table or the current physical file to be thrown away anyway.
         */
        if (rel->rd_createSubid == mySubid ||
            rel->rd_newRelfilelocatorSubid == mySubid)
        {
            /* Immediate, non-rollbackable truncation is OK */
            heap_truncate_one_rel(rel);
        }

Afaict we could do something similar for sequences, except that I think we
would just check if the sequence was created in the current transaction
(i.e. any of the fields are set).


> +/*
> + * A transactional sequence increment is queued to be processed upon commit
> + * and a non-transactional increment gets processed immediately.
> + *
> + * A sequence update may be both transactional and non-transactional. When
> + * created in a running transaction, treat it as transactional and queue
> + * the change in it. Otherwise treat it as non-transactional, so that we
> + * don't forget the increment in case of a rollback.
> + */
> +void
> +ReorderBufferQueueSequence(ReorderBuffer *rb, TransactionId xid,
> +                           Snapshot snapshot, XLogRecPtr lsn, RepOriginId origin_id,
> +                           RelFileLocator rlocator, bool transactional, bool created,
> +                           ReorderBufferTupleBuf *tuplebuf)


> +        /*
> +         * Decoding needs access to syscaches et al., which in turn use
> +         * heavyweight locks and such. Thus we need to have enough state around to
> +         * keep track of those.  The easiest way is to simply use a transaction
> +         * internally.  That also allows us to easily enforce that nothing writes
> +         * to the database by checking for xid assignments.
> +         *
> +         * When we're called via the SQL SRF there's already a transaction
> +         * started, so start an explicit subtransaction there.
> +         */
> +        using_subtxn = IsTransactionOrTransactionBlock();

This duplicates a lot of the code from ReorderBufferProcessTXN(). But only
does so partially. It's hard to tell whether some of the differences are
intentional. Could we de-duplicate that code with ReorderBufferProcessTXN()?

Maybe something like

void
ReorderBufferSetupXactEnv(ReorderBufferXactEnv *, bool process_invals);

void
ReorderBufferTeardownXactEnv(ReorderBufferXactEnv *, bool is_error);




Greetings,

Andres Freund



Re: logical decoding and replication of sequences, take 2

От
Robert Haas
Дата:
On Wed, Jan 11, 2023 at 1:29 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> > I agree that it's fine for the sequence to be slightly ahead, but I
> > think that it can't be too far ahead without causing problems. Suppose
> > for example that transaction #1 creates a sequence. Transaction #2
> > does nextval on the sequence a bunch of times and inserts rows into a
> > table using the sequence values as the PK. It's fine if the nextval
> > operations are replicated ahead of the commit of transaction #2 -- in
> > fact I'd say it's necessary for correctness -- but they can't precede
> > the commit of transaction #1, since then the sequence won't exist yet.
>
> It's not clear to me how could that even happen. If transaction #1
> creates a sequence, it's invisible for transaction #2. So how could it
> do nextval() on it? #2 has to wait for #1 to commit before it can do
> anything on the sequence, which enforces the correct ordering, no?

Yeah, I meant if #1 had committed and then #2 started to do its thing.
I was worried that decoding might reach the nextval operations in
transaction #2 before it replayed #1.

This worry may be entirely based on me not understanding how this
actually works. Do we always apply a transaction as soon as we see the
commit record for it, before decoding any further?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Andres Freund
Дата:
Hi,

On 2023-01-11 15:23:18 -0500, Robert Haas wrote:
> Yeah, I meant if #1 had committed and then #2 started to do its thing.
> I was worried that decoding might reach the nextval operations in
> transaction #2 before it replayed #1.
>
> This worry may be entirely based on me not understanding how this
> actually works. Do we always apply a transaction as soon as we see the
> commit record for it, before decoding any further?

Yes.

Otherwise we'd have a really hard time figuring out the correct historical
snapshot to use for subsequent transactions - they'd have been able to see the
catalog modifications made by the committing transaction.

Greetings,

Andres Freund



Re: logical decoding and replication of sequences, take 2

От
Robert Haas
Дата:
On Wed, Jan 11, 2023 at 3:28 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-01-11 15:23:18 -0500, Robert Haas wrote:
> > Yeah, I meant if #1 had committed and then #2 started to do its thing.
> > I was worried that decoding might reach the nextval operations in
> > transaction #2 before it replayed #1.
> >
> > This worry may be entirely based on me not understanding how this
> > actually works. Do we always apply a transaction as soon as we see the
> > commit record for it, before decoding any further?
>
> Yes.
>
> Otherwise we'd have a really hard time figuring out the correct historical
> snapshot to use for subsequent transactions - they'd have been able to see the
> catalog modifications made by the committing transaction.

I wonder, then, what happens if somebody wants to do parallel apply.
That would seem to require some relaxation of this rule, but then
doesn't that break what this patch wants to do?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Andres Freund
Дата:
Hi,

On 2023-01-11 15:41:45 -0500, Robert Haas wrote:
> I wonder, then, what happens if somebody wants to do parallel apply.  That
> would seem to require some relaxation of this rule, but then doesn't that
> break what this patch wants to do?

I don't think it'd pose a direct problem - presumably you'd only parallelize
applying changes, not committing the transactions containing them. You'd get a
lot of inconsistencies otherwise.

If you're thinking of decoding changes in parallel (rather than streaming out
large changes before commit when possible), you'd only be able to do that in
cases when transaction haven't performed catalog changes, I think. In which
case there'd also be no issue wrt transactional sequence changes.

Greetings,

Andres Freund



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 1/11/23 21:58, Andres Freund wrote:
> Hi,
> 
> On 2023-01-11 15:41:45 -0500, Robert Haas wrote:
>> I wonder, then, what happens if somebody wants to do parallel apply.  That
>> would seem to require some relaxation of this rule, but then doesn't that
>> break what this patch wants to do?
> 
> I don't think it'd pose a direct problem - presumably you'd only parallelize
> applying changes, not committing the transactions containing them. You'd get a
> lot of inconsistencies otherwise.
> 

Right. It's the commit order that matters - as long as that's
maintained, the result should be consistent etc.

There's plenty of other hard problems, though - for example it's trivial
for the apply workers to apply the changes in the incorrect order
(contradicting commit order) and then a deadlock. And the deadlock
detector may easily keep aborting the incorrect worker (the oldest one),
so that the replication grinds down to a halt.

I was wondering recently how far would we get by just doing prefetch for
logical apply - instead of applying the changes, just try doing a lookup
on he replica identity values, and then simple serial apply.

> If you're thinking of decoding changes in parallel (rather than streaming out
> large changes before commit when possible), you'd only be able to do that in
> cases when transaction haven't performed catalog changes, I think. In which
> case there'd also be no issue wrt transactional sequence changes.
> 

Perhaps, although it's not clear to me how would you know that in
advance? I mean, you could start decoding changes in parallel, and then
you find one of the earlier transactions touched a catalog.

Bu maybe I misunderstand what "decoding" refers to - don't we need the
snapshot only in reorderbuffer? In which case all the other stuff could
be parallelized (not sure if that's really expensive).

Anyway, all of this is far out of scope of this patch.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 1/11/23 21:12, Andres Freund wrote:
> Hi,
> 
> 
> Heikki, CCed you due to the point about 2c03216d8311 below.
> 
> 
> On 2023-01-10 19:32:12 +0100, Tomas Vondra wrote:
>> 0001 is a fix for the pre-existing issue in logicalmsg_decode,
>> attempting to build a snapshot before getting into a consistent state.
>> AFAICS this only affects assert-enabled builds and is otherwise
>> harmless, because we are not actually using the snapshot (apply gets a
>> valid snapshot from the transaction).
> 
> LGTM.
> 
> 
>> 0002 is a rebased version of the original approach, committed as
>> 0da92dc530 (and then reverted in 2c7ea57e56). This includes the same fix
>> as 0001 (for the sequence messages), the primary reason for the revert.
>>
>> The rebase was not quite straightforward, due to extensive changes in
>> how publications deal with tables/schemas, and so on. So this adopts
>> them, but other than that it behaves just like the original patch.
> 
> This is a huge diff:
>>  72 files changed, 4715 insertions(+), 612 deletions(-)
> 
> It'd be nice to split it to make review easier. Perhaps the sequence decoding
> support could be split from the whole publication rigamarole?
> 
> 
>> This does not include any changes to test_decoding and/or the built-in
>> replication - those will be committed in separate patches.
> 
> Looks like that's not the case anymore?
> 

Ah, right!  Now I realized I originally committed this in chunks, but
the revert was a single commit. And I just "reverted the revert" to
create this patch.

I'll definitely split this into smaller patches. This also explains the
obsolete commit message about test_decoding not being included, etc.

> 
>> +/*
>> + * Update the sequence state by modifying the existing sequence data row.
>> + *
>> + * This keeps the same relfilenode, so the behavior is non-transactional.
>> + */
>> +static void
>> +SetSequence_non_transactional(Oid seqrelid, int64 last_value, int64 log_cnt, bool is_called)
>> +{
>> +    SeqTable    elm;
>> +    Relation    seqrel;
>> +    Buffer        buf;
>> +    HeapTupleData seqdatatuple;
>> +    Form_pg_sequence_data seq;
>> +
>> +    /* open and lock sequence */
>> +    init_sequence(seqrelid, &elm, &seqrel);
>> +
>> +    /* lock page' buffer and read tuple */
>> +    seq = read_seq_tuple(seqrel, &buf, &seqdatatuple);
>> +
>> +    /* check the comment above nextval_internal()'s equivalent call. */
>> +    if (RelationNeedsWAL(seqrel))
>> +    {
>> +        GetTopTransactionId();
>> +
>> +        if (XLogLogicalInfoActive())
>> +            GetCurrentTransactionId();
>> +    }
>> +
>> +    /* ready to change the on-disk (or really, in-buffer) tuple */
>> +    START_CRIT_SECTION();
>> +
>> +    seq->last_value = last_value;
>> +    seq->is_called = is_called;
>> +    seq->log_cnt = log_cnt;
>> +
>> +    MarkBufferDirty(buf);
>> +
>> +    /* XLOG stuff */
>> +    if (RelationNeedsWAL(seqrel))
>> +    {
>> +        xl_seq_rec    xlrec;
>> +        XLogRecPtr    recptr;
>> +        Page        page = BufferGetPage(buf);
>> +
>> +        XLogBeginInsert();
>> +        XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT);
>> +
>> +        xlrec.locator = seqrel->rd_locator;
>> +        xlrec.created = false;
>> +
>> +        XLogRegisterData((char *) &xlrec, sizeof(xl_seq_rec));
>> +        XLogRegisterData((char *) seqdatatuple.t_data, seqdatatuple.t_len);
>> +
>> +        recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG);
>> +
>> +        PageSetLSN(page, recptr);
>> +    }
>> +
>> +    END_CRIT_SECTION();
>> +
>> +    UnlockReleaseBuffer(buf);
>> +
>> +    /* Clear local cache so that we don't think we have cached numbers */
>> +    /* Note that we do not change the currval() state */
>> +    elm->cached = elm->last;
>> +
>> +    relation_close(seqrel, NoLock);
>> +}
>> +
>> +/*
>> + * Update the sequence state by creating a new relfilenode.
>> + *
>> + * This creates a new relfilenode, to allow transactional behavior.
>> + */
>> +static void
>> +SetSequence_transactional(Oid seq_relid, int64 last_value, int64 log_cnt, bool is_called)
>> +{
>> +    SeqTable    elm;
>> +    Relation    seqrel;
>> +    Buffer        buf;
>> +    HeapTupleData seqdatatuple;
>> +    Form_pg_sequence_data seq;
>> +    HeapTuple    tuple;
>> +
>> +    /* open and lock sequence */
>> +    init_sequence(seq_relid, &elm, &seqrel);
>> +
>> +    /* lock page' buffer and read tuple */
>> +    seq = read_seq_tuple(seqrel, &buf, &seqdatatuple);
>> +
>> +    /* Copy the existing sequence tuple. */
>> +    tuple = heap_copytuple(&seqdatatuple);
>> +
>> +    /* Now we're done with the old page */
>> +    UnlockReleaseBuffer(buf);
>> +
>> +    /*
>> +     * Modify the copied tuple to update the sequence state (similar to what
>> +     * ResetSequence does).
>> +     */
>> +    seq = (Form_pg_sequence_data) GETSTRUCT(tuple);
>> +    seq->last_value = last_value;
>> +    seq->is_called = is_called;
>> +    seq->log_cnt = log_cnt;
>> +
>> +    /*
>> +     * Create a new storage file for the sequence - this is needed for the
>> +     * transactional behavior.
>> +     */
>> +    RelationSetNewRelfilenumber(seqrel, seqrel->rd_rel->relpersistence);
>> +
>> +    /*
>> +     * Ensure sequence's relfrozenxid is at 0, since it won't contain any
>> +     * unfrozen XIDs.  Same with relminmxid, since a sequence will never
>> +     * contain multixacts.
>> +     */
>> +    Assert(seqrel->rd_rel->relfrozenxid == InvalidTransactionId);
>> +    Assert(seqrel->rd_rel->relminmxid == InvalidMultiXactId);
>> +
>> +    /*
>> +     * Insert the modified tuple into the new storage file. This does all the
>> +     * necessary WAL-logging etc.
>> +     */
>> +    fill_seq_with_data(seqrel, tuple);
>> +
>> +    /* Clear local cache so that we don't think we have cached numbers */
>> +    /* Note that we do not change the currval() state */
>> +    elm->cached = elm->last;
>> +
>> +    relation_close(seqrel, NoLock);
>> +}
>> +
>> +/*
>> + * Set a sequence to a specified internal state.
>> + *
>> + * The change is made transactionally, so that on failure of the current
>> + * transaction, the sequence will be restored to its previous state.
>> + * We do that by creating a whole new relfilenode for the sequence; so this
>> + * works much like the rewriting forms of ALTER TABLE.
>> + *
>> + * Caller is assumed to have acquired AccessExclusiveLock on the sequence,
>> + * which must not be released until end of transaction.  Caller is also
>> + * responsible for permissions checking.
>> + */
>> +void
>> +SetSequence(Oid seq_relid, bool transactional, int64 last_value, int64 log_cnt, bool is_called)
>> +{
>> +    if (transactional)
>> +        SetSequence_transactional(seq_relid, last_value, log_cnt, is_called);
>> +    else
>> +        SetSequence_non_transactional(seq_relid, last_value, log_cnt, is_called);
>> +}
> 
> That's a lot of duplication with existing code. There's no explanation why
> SetSequence() as well as do_setval() exists.
> 

Thanks, I'll look into this.

> 
>>  /*
>>   * Initialize a sequence's relation with the specified tuple as content
>>   *
>> @@ -406,8 +560,13 @@ fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum)
>>  
>>      /* check the comment above nextval_internal()'s equivalent call. */
>>      if (RelationNeedsWAL(rel))
>> +    {
>>          GetTopTransactionId();
>>  
>> +        if (XLogLogicalInfoActive())
>> +            GetCurrentTransactionId();
>> +    }
> 
> Is it actually possible to reach this without an xid already having been
> assigned for the current xact?
> 

I believe it is. That's probably how I found this change is needed,
actually.

> 
> 
>> @@ -806,10 +966,28 @@ nextval_internal(Oid relid, bool check_permissions)
>>       * It's sufficient to ensure the toplevel transaction has an xid, no need
>>       * to assign xids subxacts, that'll already trigger an appropriate wait.
>>       * (Have to do that here, so we're outside the critical section)
>> +     *
>> +     * We have to ensure we have a proper XID, which will be included in
>> +     * the XLOG record by XLogRecordAssemble. Otherwise the first nextval()
>> +     * in a subxact (without any preceding changes) would get XID 0, and it
>> +     * would then be impossible to decide which top xact it belongs to.
>> +     * It'd also trigger assert in DecodeSequence. We only do that with
>> +     * wal_level=logical, though.
>> +     *
>> +     * XXX This might seem unnecessary, because if there's no XID the xact
>> +     * couldn't have done anything important yet, e.g. it could not have
>> +     * created a sequence. But that's incorrect, because of subxacts. The
>> +     * current subtransaction might not have done anything yet (thus no XID),
>> +     * but an earlier one might have created the sequence.
>>       */
> 
> What about restricting this to the case you're mentioning,
> i.e. subtransactions?
> 

That might work, but I need to think about it a bit.

I don't think it'd save us much, though. I mean, vast majority of
transactions (and subtransactions) calling nextval() will then do
something else which requires a XID. This just moves the XID a bit,
that's all.

> 
>> @@ -845,6 +1023,7 @@ nextval_internal(Oid relid, bool check_permissions)
>>          seq->log_cnt = 0;
>>  
>>          xlrec.locator = seqrel->rd_locator;
> 
> I realize this isn't from this patch, but:
> 
> Why do we include the locator in the record? We already have it via
> XLogRegisterBuffer(), no? And afaict we don't even use it, as we read the page
> via XLogInitBufferForRedo() during recovery.
> 
> Kinda looks like an oversight in 2c03216d8311
> 

I don't know, it's what the code did.

> 
> 
> 
>> +/*
>> + * Handle sequence decode
>> + *
>> + * Decoding sequences is a bit tricky, because while most sequence actions
>> + * are non-transactional (not subject to rollback), some need to be handled
>> + * as transactional.
>> + *
>> + * By default, a sequence increment is non-transactional - we must not queue
>> + * it in a transaction as other changes, because the transaction might get
>> + * rolled back and we'd discard the increment. The downstream would not be
>> + * notified about the increment, which is wrong.
>> + *
>> + * On the other hand, the sequence may be created in a transaction. In this
>> + * case we *should* queue the change as other changes in the transaction,
>> + * because we don't want to send the increments for unknown sequence to the
>> + * plugin - it might get confused about which sequence it's related to etc.
>> + */
>> +void
>> +sequence_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
>> +{
> 
>> +    /* extract the WAL record, with "created" flag */
>> +    xlrec = (xl_seq_rec *) XLogRecGetData(r);
>> +
>> +    /* XXX how could we have sequence change without data? */
>> +    if(!datalen || !tupledata)
>> +        return;
> 
> Yea, I think we should error out here instead, something has gone quite wrong
> if this happens.
> 

OK

> 
>> +    tuplebuf = ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
>> +    DecodeSeqTuple(tupledata, datalen, tuplebuf);
>> +
>> +    /*
>> +     * Should we handle the sequence increment as transactional or not?
>> +     *
>> +     * If the sequence was created in a still-running transaction, treat
>> +     * it as transactional and queue the increments. Otherwise it needs
>> +     * to be treated as non-transactional, in which case we send it to
>> +     * the plugin right away.
>> +     */
>> +    transactional = ReorderBufferSequenceIsTransactional(ctx->reorder,
>> +                                                         target_locator,
>> +                                                         xlrec->created);
> 
> Why re-create this information during decoding, when we basically already have
> it available on the primary? I think we already pay the price for that
> tracking, which we e.g. use for doing a non-transactional truncate:
> 
>         /*
>          * Normally, we need a transaction-safe truncation here.  However, if
>          * the table was either created in the current (sub)transaction or has
>          * a new relfilenumber in the current (sub)transaction, then we can
>          * just truncate it in-place, because a rollback would cause the whole
>          * table or the current physical file to be thrown away anyway.
>          */
>         if (rel->rd_createSubid == mySubid ||
>             rel->rd_newRelfilelocatorSubid == mySubid)
>         {
>             /* Immediate, non-rollbackable truncation is OK */
>             heap_truncate_one_rel(rel);
>         }
> 
> Afaict we could do something similar for sequences, except that I think we
> would just check if the sequence was created in the current transaction
> (i.e. any of the fields are set).
> 

Hmm, good point.

> 
>> +/*
>> + * A transactional sequence increment is queued to be processed upon commit
>> + * and a non-transactional increment gets processed immediately.
>> + *
>> + * A sequence update may be both transactional and non-transactional. When
>> + * created in a running transaction, treat it as transactional and queue
>> + * the change in it. Otherwise treat it as non-transactional, so that we
>> + * don't forget the increment in case of a rollback.
>> + */
>> +void
>> +ReorderBufferQueueSequence(ReorderBuffer *rb, TransactionId xid,
>> +                           Snapshot snapshot, XLogRecPtr lsn, RepOriginId origin_id,
>> +                           RelFileLocator rlocator, bool transactional, bool created,
>> +                           ReorderBufferTupleBuf *tuplebuf)
> 
> 
>> +        /*
>> +         * Decoding needs access to syscaches et al., which in turn use
>> +         * heavyweight locks and such. Thus we need to have enough state around to
>> +         * keep track of those.  The easiest way is to simply use a transaction
>> +         * internally.  That also allows us to easily enforce that nothing writes
>> +         * to the database by checking for xid assignments.
>> +         *
>> +         * When we're called via the SQL SRF there's already a transaction
>> +         * started, so start an explicit subtransaction there.
>> +         */
>> +        using_subtxn = IsTransactionOrTransactionBlock();
> 
> This duplicates a lot of the code from ReorderBufferProcessTXN(). But only
> does so partially. It's hard to tell whether some of the differences are
> intentional. Could we de-duplicate that code with ReorderBufferProcessTXN()?
> 
> Maybe something like
> 
> void
> ReorderBufferSetupXactEnv(ReorderBufferXactEnv *, bool process_invals);
> 
> void
> ReorderBufferTeardownXactEnv(ReorderBufferXactEnv *, bool is_error);
> 

Thanks for the suggestion, I'll definitely consider that in the next
version of the patch.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Andres Freund
Дата:
Hi,

On 2023-01-11 22:30:42 +0100, Tomas Vondra wrote:
> On 1/11/23 21:58, Andres Freund wrote:
> > If you're thinking of decoding changes in parallel (rather than streaming out
> > large changes before commit when possible), you'd only be able to do that in
> > cases when transaction haven't performed catalog changes, I think. In which
> > case there'd also be no issue wrt transactional sequence changes.
> > 
> 
> Perhaps, although it's not clear to me how would you know that in
> advance? I mean, you could start decoding changes in parallel, and then
> you find one of the earlier transactions touched a catalog.

You could have a running count of in-progress catalog modifying transactions
and not allow parallelized processing when that's not 0.


> Bu maybe I misunderstand what "decoding" refers to - don't we need the
> snapshot only in reorderbuffer? In which case all the other stuff could
> be parallelized (not sure if that's really expensive).

Calling output functions is pretty expensive, so being able to call those in
parallel has some benefits. But I don't think we're there.


> Anyway, all of this is far out of scope of this patch.

Yea, clearly that's independent work. And I don't think relying on commit
order in one more place, i.e. for sequences, would make it harder.

Greetings,

Andres Freund



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Hi,

here's a slightly updated version - the main change is splitting the
patch into multiple parts, along the lines of the original patch
reverted in 2c7ea57e56ca5f668c32d4266e0a3e45b455bef5:

- basic sequence decoding infrastructure
- support in test_decoding
- support in built-in logical replication

The revert mentions a couple additional parts, but those were mostly
fixes / improvements. And those are not merged into the three parts.


On 1/11/23 22:46, Tomas Vondra wrote:
> 
>>...
>>
>>> +/*
>>> + * Update the sequence state by modifying the existing sequence data row.
>>> + *
>>> + * This keeps the same relfilenode, so the behavior is non-transactional.
>>> + */
>>> +static void
>>> +SetSequence_non_transactional(Oid seqrelid, int64 last_value, int64 log_cnt, bool is_called)
>>> +{
>>> ...
>>>
>>> +void
>>> +SetSequence(Oid seq_relid, bool transactional, int64 last_value, int64 log_cnt, bool is_called)
>>> +{
>>> +    if (transactional)
>>> +        SetSequence_transactional(seq_relid, last_value, log_cnt, is_called);
>>> +    else
>>> +        SetSequence_non_transactional(seq_relid, last_value, log_cnt, is_called);
>>> +}
>>
>> That's a lot of duplication with existing code. There's no explanation why
>> SetSequence() as well as do_setval() exists.
>>
> 
> Thanks, I'll look into this.
> 

I haven't done anything about this yet. The functions are doing similar
things, but there's also a fair amount of differences so I haven't found
a good way to merge them yet.

>>
>>>  /*
>>>   * Initialize a sequence's relation with the specified tuple as content
>>>   *
>>> @@ -406,8 +560,13 @@ fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum)
>>>  
>>>      /* check the comment above nextval_internal()'s equivalent call. */
>>>      if (RelationNeedsWAL(rel))
>>> +    {
>>>          GetTopTransactionId();
>>>  
>>> +        if (XLogLogicalInfoActive())
>>> +            GetCurrentTransactionId();
>>> +    }
>>
>> Is it actually possible to reach this without an xid already having been
>> assigned for the current xact?
>>
> 
> I believe it is. That's probably how I found this change is needed,
> actually.
> 

I've added a comment explaining why this needed. I don't think it's
worth trying to optimize this, because in plausible workloads we'd just
delay the work a little bit.

>>
>>
>>> @@ -806,10 +966,28 @@ nextval_internal(Oid relid, bool check_permissions)
>>>       * It's sufficient to ensure the toplevel transaction has an xid, no need
>>>       * to assign xids subxacts, that'll already trigger an appropriate wait.
>>>       * (Have to do that here, so we're outside the critical section)
>>> +     *
>>> +     * We have to ensure we have a proper XID, which will be included in
>>> +     * the XLOG record by XLogRecordAssemble. Otherwise the first nextval()
>>> +     * in a subxact (without any preceding changes) would get XID 0, and it
>>> +     * would then be impossible to decide which top xact it belongs to.
>>> +     * It'd also trigger assert in DecodeSequence. We only do that with
>>> +     * wal_level=logical, though.
>>> +     *
>>> +     * XXX This might seem unnecessary, because if there's no XID the xact
>>> +     * couldn't have done anything important yet, e.g. it could not have
>>> +     * created a sequence. But that's incorrect, because of subxacts. The
>>> +     * current subtransaction might not have done anything yet (thus no XID),
>>> +     * but an earlier one might have created the sequence.
>>>       */
>>
>> What about restricting this to the case you're mentioning,
>> i.e. subtransactions?
>>
> 
> That might work, but I need to think about it a bit.
> 
> I don't think it'd save us much, though. I mean, vast majority of
> transactions (and subtransactions) calling nextval() will then do
> something else which requires a XID. This just moves the XID a bit,
> that's all.
> 

After thinking about this a bit more, I don't think the optimization is
worth it, for the reasons explained above.

>>
>>> +/*
>>> + * Handle sequence decode
>>> + *
>>> + * Decoding sequences is a bit tricky, because while most sequence actions
>>> + * are non-transactional (not subject to rollback), some need to be handled
>>> + * as transactional.
>>> + *
>>> + * By default, a sequence increment is non-transactional - we must not queue
>>> + * it in a transaction as other changes, because the transaction might get
>>> + * rolled back and we'd discard the increment. The downstream would not be
>>> + * notified about the increment, which is wrong.
>>> + *
>>> + * On the other hand, the sequence may be created in a transaction. In this
>>> + * case we *should* queue the change as other changes in the transaction,
>>> + * because we don't want to send the increments for unknown sequence to the
>>> + * plugin - it might get confused about which sequence it's related to etc.
>>> + */
>>> +void
>>> +sequence_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
>>> +{
>>
>>> +    /* extract the WAL record, with "created" flag */
>>> +    xlrec = (xl_seq_rec *) XLogRecGetData(r);
>>> +
>>> +    /* XXX how could we have sequence change without data? */
>>> +    if(!datalen || !tupledata)
>>> +        return;
>>
>> Yea, I think we should error out here instead, something has gone quite wrong
>> if this happens.
>>
> 
> OK
>

Done.

>>
>>> +    tuplebuf = ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
>>> +    DecodeSeqTuple(tupledata, datalen, tuplebuf);
>>> +
>>> +    /*
>>> +     * Should we handle the sequence increment as transactional or not?
>>> +     *
>>> +     * If the sequence was created in a still-running transaction, treat
>>> +     * it as transactional and queue the increments. Otherwise it needs
>>> +     * to be treated as non-transactional, in which case we send it to
>>> +     * the plugin right away.
>>> +     */
>>> +    transactional = ReorderBufferSequenceIsTransactional(ctx->reorder,
>>> +                                                         target_locator,
>>> +                                                         xlrec->created);
>>
>> Why re-create this information during decoding, when we basically already have
>> it available on the primary? I think we already pay the price for that
>> tracking, which we e.g. use for doing a non-transactional truncate:
>>
>>         /*
>>          * Normally, we need a transaction-safe truncation here.  However, if
>>          * the table was either created in the current (sub)transaction or has
>>          * a new relfilenumber in the current (sub)transaction, then we can
>>          * just truncate it in-place, because a rollback would cause the whole
>>          * table or the current physical file to be thrown away anyway.
>>          */
>>         if (rel->rd_createSubid == mySubid ||
>>             rel->rd_newRelfilelocatorSubid == mySubid)
>>         {
>>             /* Immediate, non-rollbackable truncation is OK */
>>             heap_truncate_one_rel(rel);
>>         }
>>
>> Afaict we could do something similar for sequences, except that I think we
>> would just check if the sequence was created in the current transaction
>> (i.e. any of the fields are set).
>>
> 
> Hmm, good point.
> 

But rd_createSubid/rd_newRelfilelocatorSubid fields are available only
in the original transaction, not during decoding. So we'd have to do
this check there and add the result to the WAL record. Is that what you
had in mind?

>>
>>> +/*
>>> + * A transactional sequence increment is queued to be processed upon commit
>>> + * and a non-transactional increment gets processed immediately.
>>> + *
>>> + * A sequence update may be both transactional and non-transactional. When
>>> + * created in a running transaction, treat it as transactional and queue
>>> + * the change in it. Otherwise treat it as non-transactional, so that we
>>> + * don't forget the increment in case of a rollback.
>>> + */
>>> +void
>>> +ReorderBufferQueueSequence(ReorderBuffer *rb, TransactionId xid,
>>> +                           Snapshot snapshot, XLogRecPtr lsn, RepOriginId origin_id,
>>> +                           RelFileLocator rlocator, bool transactional, bool created,
>>> +                           ReorderBufferTupleBuf *tuplebuf)
>>
>>
>>> +        /*
>>> +         * Decoding needs access to syscaches et al., which in turn use
>>> +         * heavyweight locks and such. Thus we need to have enough state around to
>>> +         * keep track of those.  The easiest way is to simply use a transaction
>>> +         * internally.  That also allows us to easily enforce that nothing writes
>>> +         * to the database by checking for xid assignments.
>>> +         *
>>> +         * When we're called via the SQL SRF there's already a transaction
>>> +         * started, so start an explicit subtransaction there.
>>> +         */
>>> +        using_subtxn = IsTransactionOrTransactionBlock();
>>
>> This duplicates a lot of the code from ReorderBufferProcessTXN(). But only
>> does so partially. It's hard to tell whether some of the differences are
>> intentional. Could we de-duplicate that code with ReorderBufferProcessTXN()?
>>
>> Maybe something like
>>
>> void
>> ReorderBufferSetupXactEnv(ReorderBufferXactEnv *, bool process_invals);
>>
>> void
>> ReorderBufferTeardownXactEnv(ReorderBufferXactEnv *, bool is_error);
>>
> 
> Thanks for the suggestion, I'll definitely consider that in the next
> version of the patch.

I did look at the code a bit, but I'm not sure there really is a lot of
duplicated code - yes, we start/abort the (sub)transaction, setup and
tear down the snapshot, etc. Or what else would you put into the two new
functions?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
cfbot didn't like the rebased / split patch, and after looking at it I
believe it's a bug in parallel apply of large transactions (216a784829),
which seems to have changed interpretation of in_remote_transaction and
in_streamed_transaction. I've reported the issue on that thread [1], but
here's a version with a temporary workaround so that we can continue
reviewing it.

regards

[1]
https://www.postgresql.org/message-id/984ff689-adde-9977-affe-cd6029e850be%40enterprisedb.com

On 1/15/23 00:39, Tomas Vondra wrote:
> Hi,
> 
> here's a slightly updated version - the main change is splitting the
> patch into multiple parts, along the lines of the original patch
> reverted in 2c7ea57e56ca5f668c32d4266e0a3e45b455bef5:
> 
> - basic sequence decoding infrastructure
> - support in test_decoding
> - support in built-in logical replication
> 
> The revert mentions a couple additional parts, but those were mostly
> fixes / improvements. And those are not merged into the three parts.
> 
> 
> On 1/11/23 22:46, Tomas Vondra wrote:
>>
>>> ...
>>>
>>>> +/*
>>>> + * Update the sequence state by modifying the existing sequence data row.
>>>> + *
>>>> + * This keeps the same relfilenode, so the behavior is non-transactional.
>>>> + */
>>>> +static void
>>>> +SetSequence_non_transactional(Oid seqrelid, int64 last_value, int64 log_cnt, bool is_called)
>>>> +{
>>>> ...
>>>>
>>>> +void
>>>> +SetSequence(Oid seq_relid, bool transactional, int64 last_value, int64 log_cnt, bool is_called)
>>>> +{
>>>> +    if (transactional)
>>>> +        SetSequence_transactional(seq_relid, last_value, log_cnt, is_called);
>>>> +    else
>>>> +        SetSequence_non_transactional(seq_relid, last_value, log_cnt, is_called);
>>>> +}
>>>
>>> That's a lot of duplication with existing code. There's no explanation why
>>> SetSequence() as well as do_setval() exists.
>>>
>>
>> Thanks, I'll look into this.
>>
> 
> I haven't done anything about this yet. The functions are doing similar
> things, but there's also a fair amount of differences so I haven't found
> a good way to merge them yet.
> 
>>>
>>>>  /*
>>>>   * Initialize a sequence's relation with the specified tuple as content
>>>>   *
>>>> @@ -406,8 +560,13 @@ fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum)
>>>>  
>>>>      /* check the comment above nextval_internal()'s equivalent call. */
>>>>      if (RelationNeedsWAL(rel))
>>>> +    {
>>>>          GetTopTransactionId();
>>>>  
>>>> +        if (XLogLogicalInfoActive())
>>>> +            GetCurrentTransactionId();
>>>> +    }
>>>
>>> Is it actually possible to reach this without an xid already having been
>>> assigned for the current xact?
>>>
>>
>> I believe it is. That's probably how I found this change is needed,
>> actually.
>>
> 
> I've added a comment explaining why this needed. I don't think it's
> worth trying to optimize this, because in plausible workloads we'd just
> delay the work a little bit.
> 
>>>
>>>
>>>> @@ -806,10 +966,28 @@ nextval_internal(Oid relid, bool check_permissions)
>>>>       * It's sufficient to ensure the toplevel transaction has an xid, no need
>>>>       * to assign xids subxacts, that'll already trigger an appropriate wait.
>>>>       * (Have to do that here, so we're outside the critical section)
>>>> +     *
>>>> +     * We have to ensure we have a proper XID, which will be included in
>>>> +     * the XLOG record by XLogRecordAssemble. Otherwise the first nextval()
>>>> +     * in a subxact (without any preceding changes) would get XID 0, and it
>>>> +     * would then be impossible to decide which top xact it belongs to.
>>>> +     * It'd also trigger assert in DecodeSequence. We only do that with
>>>> +     * wal_level=logical, though.
>>>> +     *
>>>> +     * XXX This might seem unnecessary, because if there's no XID the xact
>>>> +     * couldn't have done anything important yet, e.g. it could not have
>>>> +     * created a sequence. But that's incorrect, because of subxacts. The
>>>> +     * current subtransaction might not have done anything yet (thus no XID),
>>>> +     * but an earlier one might have created the sequence.
>>>>       */
>>>
>>> What about restricting this to the case you're mentioning,
>>> i.e. subtransactions?
>>>
>>
>> That might work, but I need to think about it a bit.
>>
>> I don't think it'd save us much, though. I mean, vast majority of
>> transactions (and subtransactions) calling nextval() will then do
>> something else which requires a XID. This just moves the XID a bit,
>> that's all.
>>
> 
> After thinking about this a bit more, I don't think the optimization is
> worth it, for the reasons explained above.
> 
>>>
>>>> +/*
>>>> + * Handle sequence decode
>>>> + *
>>>> + * Decoding sequences is a bit tricky, because while most sequence actions
>>>> + * are non-transactional (not subject to rollback), some need to be handled
>>>> + * as transactional.
>>>> + *
>>>> + * By default, a sequence increment is non-transactional - we must not queue
>>>> + * it in a transaction as other changes, because the transaction might get
>>>> + * rolled back and we'd discard the increment. The downstream would not be
>>>> + * notified about the increment, which is wrong.
>>>> + *
>>>> + * On the other hand, the sequence may be created in a transaction. In this
>>>> + * case we *should* queue the change as other changes in the transaction,
>>>> + * because we don't want to send the increments for unknown sequence to the
>>>> + * plugin - it might get confused about which sequence it's related to etc.
>>>> + */
>>>> +void
>>>> +sequence_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
>>>> +{
>>>
>>>> +    /* extract the WAL record, with "created" flag */
>>>> +    xlrec = (xl_seq_rec *) XLogRecGetData(r);
>>>> +
>>>> +    /* XXX how could we have sequence change without data? */
>>>> +    if(!datalen || !tupledata)
>>>> +        return;
>>>
>>> Yea, I think we should error out here instead, something has gone quite wrong
>>> if this happens.
>>>
>>
>> OK
>>
> 
> Done.
> 
>>>
>>>> +    tuplebuf = ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
>>>> +    DecodeSeqTuple(tupledata, datalen, tuplebuf);
>>>> +
>>>> +    /*
>>>> +     * Should we handle the sequence increment as transactional or not?
>>>> +     *
>>>> +     * If the sequence was created in a still-running transaction, treat
>>>> +     * it as transactional and queue the increments. Otherwise it needs
>>>> +     * to be treated as non-transactional, in which case we send it to
>>>> +     * the plugin right away.
>>>> +     */
>>>> +    transactional = ReorderBufferSequenceIsTransactional(ctx->reorder,
>>>> +                                                         target_locator,
>>>> +                                                         xlrec->created);
>>>
>>> Why re-create this information during decoding, when we basically already have
>>> it available on the primary? I think we already pay the price for that
>>> tracking, which we e.g. use for doing a non-transactional truncate:
>>>
>>>         /*
>>>          * Normally, we need a transaction-safe truncation here.  However, if
>>>          * the table was either created in the current (sub)transaction or has
>>>          * a new relfilenumber in the current (sub)transaction, then we can
>>>          * just truncate it in-place, because a rollback would cause the whole
>>>          * table or the current physical file to be thrown away anyway.
>>>          */
>>>         if (rel->rd_createSubid == mySubid ||
>>>             rel->rd_newRelfilelocatorSubid == mySubid)
>>>         {
>>>             /* Immediate, non-rollbackable truncation is OK */
>>>             heap_truncate_one_rel(rel);
>>>         }
>>>
>>> Afaict we could do something similar for sequences, except that I think we
>>> would just check if the sequence was created in the current transaction
>>> (i.e. any of the fields are set).
>>>
>>
>> Hmm, good point.
>>
> 
> But rd_createSubid/rd_newRelfilelocatorSubid fields are available only
> in the original transaction, not during decoding. So we'd have to do
> this check there and add the result to the WAL record. Is that what you
> had in mind?
> 
>>>
>>>> +/*
>>>> + * A transactional sequence increment is queued to be processed upon commit
>>>> + * and a non-transactional increment gets processed immediately.
>>>> + *
>>>> + * A sequence update may be both transactional and non-transactional. When
>>>> + * created in a running transaction, treat it as transactional and queue
>>>> + * the change in it. Otherwise treat it as non-transactional, so that we
>>>> + * don't forget the increment in case of a rollback.
>>>> + */
>>>> +void
>>>> +ReorderBufferQueueSequence(ReorderBuffer *rb, TransactionId xid,
>>>> +                           Snapshot snapshot, XLogRecPtr lsn, RepOriginId origin_id,
>>>> +                           RelFileLocator rlocator, bool transactional, bool created,
>>>> +                           ReorderBufferTupleBuf *tuplebuf)
>>>
>>>
>>>> +        /*
>>>> +         * Decoding needs access to syscaches et al., which in turn use
>>>> +         * heavyweight locks and such. Thus we need to have enough state around to
>>>> +         * keep track of those.  The easiest way is to simply use a transaction
>>>> +         * internally.  That also allows us to easily enforce that nothing writes
>>>> +         * to the database by checking for xid assignments.
>>>> +         *
>>>> +         * When we're called via the SQL SRF there's already a transaction
>>>> +         * started, so start an explicit subtransaction there.
>>>> +         */
>>>> +        using_subtxn = IsTransactionOrTransactionBlock();
>>>
>>> This duplicates a lot of the code from ReorderBufferProcessTXN(). But only
>>> does so partially. It's hard to tell whether some of the differences are
>>> intentional. Could we de-duplicate that code with ReorderBufferProcessTXN()?
>>>
>>> Maybe something like
>>>
>>> void
>>> ReorderBufferSetupXactEnv(ReorderBufferXactEnv *, bool process_invals);
>>>
>>> void
>>> ReorderBufferTeardownXactEnv(ReorderBufferXactEnv *, bool is_error);
>>>
>>
>> Thanks for the suggestion, I'll definitely consider that in the next
>> version of the patch.
> 
> I did look at the code a bit, but I'm not sure there really is a lot of
> duplicated code - yes, we start/abort the (sub)transaction, setup and
> tear down the snapshot, etc. Or what else would you put into the two new
> functions?
> 
> 
> regards
> 

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
vignesh C
Дата:
On Mon, 16 Jan 2023 at 04:49, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> cfbot didn't like the rebased / split patch, and after looking at it I
> believe it's a bug in parallel apply of large transactions (216a784829),
> which seems to have changed interpretation of in_remote_transaction and
> in_streamed_transaction. I've reported the issue on that thread [1], but
> here's a version with a temporary workaround so that we can continue
> reviewing it.
>

The patch does not apply on top of HEAD as in [1], please post a rebased patch:

=== Applying patches on top of PostgreSQL commit ID
17e72ec45d313b98bd90b95bc71b4cc77c2c89c3 ===
=== applying patch
./0001-Fix-snapshot-handling-in-logicalmsg_decode-20230116.patch
patching file src/backend/replication/logical/decode.c
patching file src/backend/replication/logical/reorderbuffer.c
=== applying patch ./0002-Logical-decoding-of-sequences-20230116.patch
patching file doc/src/sgml/logicaldecoding.sgml
Hunk #3 FAILED at 483.
Hunk #4 FAILED at 494.
Hunk #7 succeeded at 1252 (offset 4 lines).
2 out of 7 hunks FAILED -- saving rejects to file
doc/src/sgml/logicaldecoding.sgml.rej

[1] - http://cfbot.cputube.org/patch_41_3823.log

Regards,
Vignesh



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Hi,

Here's a rebased patch, without the last bit which is now unnecessary
thanks to c981d9145dea.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
"Jonathan S. Katz"
Дата:
Hi,

On 2/16/23 10:50 AM, Tomas Vondra wrote:
> Hi,
> 
> Here's a rebased patch, without the last bit which is now unnecessary
> thanks to c981d9145dea.

Thanks for continuing to work on this patch! I tested the latest version 
and have some feedback/clarifications.

I did some testing using a demo-app-based-on-a-real-world app I had 
conjured up[1]. This uses integer sequences as surrogate keys.

In general things seemed to work, but I had a couple of 
observations/questions.

1. Sequence IDs after a "failover". I believe this is a design decision, 
but I noticed that after simulating a failover, the IDs were replicating 
from a higher value, e.g.

INSERT INTO room (name) VALUES ('room 1');
INSERT INTO room (name) VALUES ('room 2');
INSERT INTO room (name) VALUES ('room 3');
INSERT INTO room (name) VALUES ('room 4');

The values of room_id_seq on each instance:

instance 1:

  last_value | log_cnt | is_called
------------+---------+-----------
           4 |      29 | t

  instance 2:

   last_value | log_cnt | is_called
------------+---------+-----------
          33 |       0 | t

After the switchover on instance 2:

INSERT INTO room (name) VALUES ('room 5') RETURNING id;

  id
----
  34

I don't see this as an issue for most applications, but we should at 
least document the behavior somewhere.

2. Using with origin=none with nonconflicting sequences.

I modified the example in [1] to set up two schemas with non-conflicting 
sequences[2], e.g. on instance 1:

CREATE TABLE public.room (
     id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 1) 
PRIMARY KEY,
     name text NOT NULL
);

and instance 2:

CREATE TABLE public.room (
     id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 2) 
PRIMARY KEY,
     name text NOT NULL
);

I ran the following on instance 1:

INSERT INTO public.room ('name') VALUES ('room 1-e');

This committed and successfully replicated.

However, when I ran the following on instance 2, I received a conlifct 
error:

INSERT INTO public.room ('name') VALUES ('room 1-w');

The conflict came further down the trigger change, i.e. to a change in 
the `public.calendar` table:

2023-02-22 01:49:12.293 UTC [87235] ERROR:  duplicate key value violates 
unique constraint "calendar_pkey"
2023-02-22 01:49:12.293 UTC [87235] DETAIL:  Key (id)=(661) already exists.

After futzing with the logging and restarting, I was also able to 
reproduce a similar conflict with the same insert pattern into 'room'.

I did notice that the sequence values kept bouncing around between the 
servers. Without any activity, this is what "SELECT * FROM room_id_seq" 
would return with queries run ~4s apart:

  last_value | log_cnt | is_called
------------+---------+-----------
         131 |       0 | t

  last_value | log_cnt | is_called
------------+---------+-----------
          65 |       0 | t

The values were more varying on "calendar". Again, this is under no 
additional write activity, these numbers kept fluctuating:

  last_value | log_cnt | is_called
------------+---------+-----------
         197 |       0 | t

  last_value | log_cnt | is_called
------------+---------+-----------
         461 |       0 | t

  last_value | log_cnt | is_called
------------+---------+-----------
         263 |       0 | t

  last_value | log_cnt | is_called
------------+---------+-----------
         527 |       0 | t

To handle this case for now, I adapted the schema to create sequences 
that we clearly independently named[3]. I did learn that I had to create 
sequences on both instances to support this behavior, e.g.:

-- instance 1
CREATE SEQUENCE public.room_id_1_seq AS int INCREMENT BY 2 START WITH 1;
CREATE SEQUENCE public.room_id_2_seq AS int INCREMENT BY 2 START WITH 2;
CREATE TABLE public.room (
     id int DEFAULT nextval('room_id_1_seq') PRIMARY KEY,
     name text NOT NULL
);

-- instance 2
CREATE SEQUENCE public.room_id_1_seq AS int INCREMENT BY 2 START WITH 1;
CREATE SEQUENCE public.room_id_2_seq AS int INCREMENT BY 2 START WITH 2;
CREATE TABLE public.room (
     id int DEFAULT nextval('room_id_2_seq') PRIMARY KEY,
     name text NOT NULL
);

After building out [3] this did work, but it was more tedious.

Is it possible to support IDENTITY columns (or serial columns) where the 
values of the sequence are set to different intervals on the 
publisher/subscriber?

Thanks,

Jonathan

[1] 
https://github.com/CrunchyData/postgres-realtime-demo/blob/main/examples/demo/demo1.sql
[2] https://gist.github.com/jkatz/5c34bf1e401b3376dfe8e627fcd30af3
[3] https://gist.github.com/jkatz/1599e467d55abec88ab487d8ac9dc7c3


Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 2/22/23 03:28, Jonathan S. Katz wrote:
> Hi,
> 
> On 2/16/23 10:50 AM, Tomas Vondra wrote:
>> Hi,
>>
>> Here's a rebased patch, without the last bit which is now unnecessary
>> thanks to c981d9145dea.
> 
> Thanks for continuing to work on this patch! I tested the latest version
> and have some feedback/clarifications.
> 

Thanks!

> I did some testing using a demo-app-based-on-a-real-world app I had
> conjured up[1]. This uses integer sequences as surrogate keys.
> 
> In general things seemed to work, but I had a couple of
> observations/questions.
> 
> 1. Sequence IDs after a "failover". I believe this is a design decision,
> but I noticed that after simulating a failover, the IDs were replicating
> from a higher value, e.g.
> 
> INSERT INTO room (name) VALUES ('room 1');
> INSERT INTO room (name) VALUES ('room 2');
> INSERT INTO room (name) VALUES ('room 3');
> INSERT INTO room (name) VALUES ('room 4');
> 
> The values of room_id_seq on each instance:
> 
> instance 1:
> 
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>           4 |      29 | t
> 
>  instance 2:
> 
>   last_value | log_cnt | is_called
> ------------+---------+-----------
>          33 |       0 | t
> 
> After the switchover on instance 2:
> 
> INSERT INTO room (name) VALUES ('room 5') RETURNING id;
> 
>  id
> ----
>  34
> 
> I don't see this as an issue for most applications, but we should at
> least document the behavior somewhere.
> 

Yes, this is due to how we WAL-log sequences. We don't log individual
increments, but every 32nd increment and we log the "future" sequence
state so that after a crash/recovery we don't generate duplicates.

So you do nextval() and it returns 1. But into WAL we record 32. And
there will be no WAL records until nextval reaches 32 and needs to
generate another batch.

And because logical replication relies on these WAL records, it inherits
this batching behavior with a "jump" on recovery/failover. IMHO it's OK,
it works for the "logical failover" use case and if you need gapless
sequences then regular sequences are not an issue anyway.

It's possible to reduce the jump a bit by reducing the batch size (from
32 to 0) so that every increment is logged. But it doesn't eliminate it
because of rollbacks.

> 2. Using with origin=none with nonconflicting sequences.
> 
> I modified the example in [1] to set up two schemas with non-conflicting
> sequences[2], e.g. on instance 1:
> 
> CREATE TABLE public.room (
>     id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 1)
> PRIMARY KEY,
>     name text NOT NULL
> );
> 
> and instance 2:
> 
> CREATE TABLE public.room (
>     id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 2)
> PRIMARY KEY,
>     name text NOT NULL
> );
> 

Well, yeah. We don't support active-active logical replication (at least
not with the built-in). You can easily get into similar issues without
sequences.

Replicating a sequence overwrites the state of the sequence on the other
side, which may result in it generating duplicate values with the other
node, etc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
"Jonathan S. Katz"
Дата:
On 2/22/23 5:02 AM, Tomas Vondra wrote:
> 
> On 2/22/23 03:28, Jonathan S. Katz wrote:

>> Thanks for continuing to work on this patch! I tested the latest version
>> and have some feedback/clarifications.
>>
> 
> Thanks!

Also I should mention I've been testing with both async/sync logical 
replication. I didn't have any specific comments on either as it seemed 
to just work and behaviors aligned with existing expectations.

Generally it's been a good experience and it seems to be working. :) At 
this point I'm trying to understand the limitations and tripwires so we 
can guide users appropriately.

> Yes, this is due to how we WAL-log sequences. We don't log individual
> increments, but every 32nd increment and we log the "future" sequence
> state so that after a crash/recovery we don't generate duplicates.
> 
> So you do nextval() and it returns 1. But into WAL we record 32. And
> there will be no WAL records until nextval reaches 32 and needs to
> generate another batch.
> 
> And because logical replication relies on these WAL records, it inherits
> this batching behavior with a "jump" on recovery/failover. IMHO it's OK,
> it works for the "logical failover" use case and if you need gapless
> sequences then regular sequences are not an issue anyway.
> 
> It's possible to reduce the jump a bit by reducing the batch size (from
> 32 to 0) so that every increment is logged. But it doesn't eliminate it
> because of rollbacks.

I generally agree. I think it's mainly something we should capture in 
the user docs that they can be a jump on the subscriber side, so people 
are not surprised.

Interestingly, in systems that tend to have higher rates of failover 
(I'm thinking of a few distributed systems), this may cause int4 
sequences to exhaust numbers slightly (marginally?) more quickly. Likely 
not too big of an issue, but something to keep in mind.

>> 2. Using with origin=none with nonconflicting sequences.
>>
>> I modified the example in [1] to set up two schemas with non-conflicting
>> sequences[2], e.g. on instance 1:
>>
>> CREATE TABLE public.room (
>>      id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 1)
>> PRIMARY KEY,
>>      name text NOT NULL
>> );
>>
>> and instance 2:
>>
>> CREATE TABLE public.room (
>>      id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 2)
>> PRIMARY KEY,
>>      name text NOT NULL
>> );
>>
> 
> Well, yeah. We don't support active-active logical replication (at least
> not with the built-in). You can easily get into similar issues without
> sequences.

The "origin=none" feature lets you replicate tables bidirectionally. 
While it's not full "active-active", this is a starting point and a 
feature for v16. We'll definitely have users replicating data 
bidirectionally with this.

> Replicating a sequence overwrites the state of the sequence on the other
> side, which may result in it generating duplicate values with the other
> node, etc.

I understand that we don't currently support global sequences, but I am 
concerned there may be a tripwire here in the origin=none case given 
it's fairly common to use serial/GENERATED BY to set primary keys. And 
it's fairly trivial to set them to be nonconflicting, or at least give 
the user the appearance that they are nonconflicting.

 From my high level understand of how sequences work, this sounds like 
it would be a lift to support the example in [1]. Or maybe the answer is 
that you can bidirectionally replicate the changes in the tables, but 
not sequences?

In any case, we should update the restrictions in [2] to state: while 
sequences can be replicated, there is additional work required if you 
are bidirectionally replicating tables that use sequences, esp. if used 
in a PK or a constraint. We can provide alternatives to how a user could 
set that up, i.e. not replicates the sequences or do something like in [3].

Thanks,

Jonathan

[1] https://gist.github.com/jkatz/5c34bf1e401b3376dfe8e627fcd30af3
[2] 
https://www.postgresql.org/docs/devel/logical-replication-restrictions.html
[3] https://gist.github.com/jkatz/1599e467d55abec88ab487d8ac9dc7c3

Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 2/22/23 18:04, Jonathan S. Katz wrote:
> On 2/22/23 5:02 AM, Tomas Vondra wrote:
>>
>> On 2/22/23 03:28, Jonathan S. Katz wrote:
> 
>>> Thanks for continuing to work on this patch! I tested the latest version
>>> and have some feedback/clarifications.
>>>
>>
>> Thanks!
> 
> Also I should mention I've been testing with both async/sync logical
> replication. I didn't have any specific comments on either as it seemed
> to just work and behaviors aligned with existing expectations.
> 
> Generally it's been a good experience and it seems to be working. :) At
> this point I'm trying to understand the limitations and tripwires so we
> can guide users appropriately.
> 

Good to hear.

>> Yes, this is due to how we WAL-log sequences. We don't log individual
>> increments, but every 32nd increment and we log the "future" sequence
>> state so that after a crash/recovery we don't generate duplicates.
>>
>> So you do nextval() and it returns 1. But into WAL we record 32. And
>> there will be no WAL records until nextval reaches 32 and needs to
>> generate another batch.
>>
>> And because logical replication relies on these WAL records, it inherits
>> this batching behavior with a "jump" on recovery/failover. IMHO it's OK,
>> it works for the "logical failover" use case and if you need gapless
>> sequences then regular sequences are not an issue anyway.
>>
>> It's possible to reduce the jump a bit by reducing the batch size (from
>> 32 to 0) so that every increment is logged. But it doesn't eliminate it
>> because of rollbacks.
> 
> I generally agree. I think it's mainly something we should capture in
> the user docs that they can be a jump on the subscriber side, so people
> are not surprised.
> 
> Interestingly, in systems that tend to have higher rates of failover
> (I'm thinking of a few distributed systems), this may cause int4
> sequences to exhaust numbers slightly (marginally?) more quickly. Likely
> not too big of an issue, but something to keep in mind.
> 

IMHO the number of systems that would work fine with int4 sequences but
this change results in the sequences being "exhausted" too quickly is
indistinguishable from 0. I don't think this is an issue.

>>> 2. Using with origin=none with nonconflicting sequences.
>>>
>>> I modified the example in [1] to set up two schemas with non-conflicting
>>> sequences[2], e.g. on instance 1:
>>>
>>> CREATE TABLE public.room (
>>>      id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 1)
>>> PRIMARY KEY,
>>>      name text NOT NULL
>>> );
>>>
>>> and instance 2:
>>>
>>> CREATE TABLE public.room (
>>>      id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 2)
>>> PRIMARY KEY,
>>>      name text NOT NULL
>>> );
>>>
>>
>> Well, yeah. We don't support active-active logical replication (at least
>> not with the built-in). You can easily get into similar issues without
>> sequences.
> 
> The "origin=none" feature lets you replicate tables bidirectionally.
> While it's not full "active-active", this is a starting point and a
> feature for v16. We'll definitely have users replicating data
> bidirectionally with this.
> 

Well, then the users need to use some other way to generate IDs, not
local sequences. Either some sort of distributed/global sequence, UUIDs
or something like that.

>> Replicating a sequence overwrites the state of the sequence on the other
>> side, which may result in it generating duplicate values with the other
>> node, etc.
> 
> I understand that we don't currently support global sequences, but I am
> concerned there may be a tripwire here in the origin=none case given
> it's fairly common to use serial/GENERATED BY to set primary keys. And
> it's fairly trivial to set them to be nonconflicting, or at least give
> the user the appearance that they are nonconflicting.
> 
> From my high level understand of how sequences work, this sounds like it
> would be a lift to support the example in [1]. Or maybe the answer is
> that you can bidirectionally replicate the changes in the tables, but
> not sequences?
> 

Yes, I don't think local sequences don't and can't work in such setups.

> In any case, we should update the restrictions in [2] to state: while
> sequences can be replicated, there is additional work required if you
> are bidirectionally replicating tables that use sequences, esp. if used
> in a PK or a constraint. We can provide alternatives to how a user could
> set that up, i.e. not replicates the sequences or do something like in [3].
> 

I agree. I see this as mostly a documentation issue.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
"Jonathan S. Katz"
Дата:
On 2/23/23 7:56 AM, Tomas Vondra wrote:
> On 2/22/23 18:04, Jonathan S. Katz wrote:
>> On 2/22/23 5:02 AM, Tomas Vondra wrote:
>>>

>> Interestingly, in systems that tend to have higher rates of failover
>> (I'm thinking of a few distributed systems), this may cause int4
>> sequences to exhaust numbers slightly (marginally?) more quickly. Likely
>> not too big of an issue, but something to keep in mind.
>>
> 
> IMHO the number of systems that would work fine with int4 sequences but
> this change results in the sequences being "exhausted" too quickly is
> indistinguishable from 0. I don't think this is an issue.

I agree it's an edge case. I do think it's a number greater than 0, 
having seen some incredibly flaky setups, particularly in distributed 
systems. I would not worry about it, but only mentioned it to try and 
probe edge cases.

>>> Well, yeah. We don't support active-active logical replication (at least
>>> not with the built-in). You can easily get into similar issues without
>>> sequences.
>>
>> The "origin=none" feature lets you replicate tables bidirectionally.
>> While it's not full "active-active", this is a starting point and a
>> feature for v16. We'll definitely have users replicating data
>> bidirectionally with this.
>>
> 
> Well, then the users need to use some other way to generate IDs, not
> local sequences. Either some sort of distributed/global sequence, UUIDs
> or something like that.
[snip]

>> In any case, we should update the restrictions in [2] to state: while
>> sequences can be replicated, there is additional work required if you
>> are bidirectionally replicating tables that use sequences, esp. if used
>> in a PK or a constraint. We can provide alternatives to how a user could
>> set that up, i.e. not replicates the sequences or do something like in [3].
>>
> 
> I agree. I see this as mostly a documentation issue.

Great. I agree that users need other mechanisms to generate IDs, but we 
should ensure we document that. If needed, I'm happy to help with the 
docs here.

Thanks,

Jonathan

Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Hi,

here's a rebased patch to make cfbot happy, dropping the first part that
is now unnecessary thanks to 7fe1aa991b.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
John Naylor
Дата:

On Wed, Mar 1, 2023 at 1:02 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> here's a rebased patch to make cfbot happy, dropping the first part that
> is now unnecessary thanks to 7fe1aa991b.

Hi Tomas,

I'm looking into doing some "in situ" testing, but for now I'll mention some minor nits I found:

0001

+ * so we simply do a lookup (the sequence is identified by relfilende). If

relfilenode? Or should it be called a relfilelocator, which is the parameter type? I see some other references to relfilenode in comments and commit message, and I'm not sure which need to be updated.

+ /* XXX Maybe check that we're still in the same top-level xact? */

Any ideas on what should happen here?

+ /* XXX how could we have sequence change without data? */
+ if(!datalen || !tupledata)
+ elog(ERROR, "sequence decode missing tuple data");

Since the ERROR is new based on feedback, we can get rid of XXX I think.

More generally, I associate XXX comments to highlight problems or unpleasantness in the code that don't quite rise to the level of FIXME, but are perhaps more serious than "NB:", "Note:", or "Important:"

+ * When we're called via the SQL SRF there's already a transaction

I see this was copied from existing code, but I found it confusing -- does this function have a stable name?

+ /* Only ever called from ReorderBufferApplySequence, so transational. */

Typo: transactional

0002

I see a few SERIAL types in the tests but no GENERATED ... AS IDENTITY -- not sure if it matters, but seems good for completeness.

Reminder for later: Patches 0002 and 0003 still refer to 0da92dc530, which is a reverted commit -- I assume it intends to refer to the content of 0001?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

От
John Naylor
Дата:
I tried a couple toy examples with various combinations of use styles.

Three with "automatic" reading from sequences:

create table test(i serial);
create table test(i int GENERATED BY DEFAULT AS IDENTITY);
create table test(i int default nextval('s1'));

...where s1 has some non-default parameters:

CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;

...and then two with explicit use of s1, one inserting the 'nextval' into a table with no default, and one with no table at all, just selecting from the sequence.

The last two seem to work similarly to the first three, so it seems like FOR ALL TABLES adds all sequences as well. Is that expected? The documentation for CREATE PUBLICATION mentions sequence options, but doesn't really say how these options should be used.

Here's the script:

# alter system set wal_level='logical';
# restart
# port 7777 is subscriber

echo
echo "PUB:"
psql -c "drop sequence if exists s1;"
psql -c "drop publication if exists pub1;"

echo
echo "SUB:"
psql -p 7777 -c "drop sequence if exists s1;"
psql -p 7777 -c "drop subscription if exists sub1 ;"

echo
echo "PUB:"
psql -c "CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;"
psql -c "CREATE PUBLICATION pub1 FOR ALL TABLES;"

echo
echo "SUB:"
psql -p 7777 -c "CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;"
psql -p 7777 -c "CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost dbname=john application_name=sub1 port=5432' PUBLICATION pub1;"


echo
echo "PUB:"
psql -c "select nextval('s1');"
psql -c "select nextval('s1');"
psql -c "select * from s1;"

sleep 1

echo
echo "SUB:"
psql -p 7777 -c "select * from s1;"

psql -p 7777 -c "drop subscription sub1 ;"

psql -p 7777 -c "select nextval('s1');"
psql -p 7777 -c "select * from s1;"


...with the last two queries returning

 nextval
---------
      67
(1 row)

 last_value | log_cnt | is_called
------------+---------+-----------
         67 |      32 | t

So, I interpret that the decrement by 32 got logged here.

Also, running

CREATE PUBLICATION pub2 FOR ALL SEQUENCES WITH (publish = 'insert, update, delete, truncate, sequence');

...reports success, but do non-default values of "publish = ..." have an effect (or should they), or are these just ignored? It seems like these cases shouldn't be treated orthogonally.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 3/10/23 11:03, John Naylor wrote:
> 
> On Wed, Mar 1, 2023 at 1:02 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
>> here's a rebased patch to make cfbot happy, dropping the first part that
>> is now unnecessary thanks to 7fe1aa991b.
> 
> Hi Tomas,
> 
> I'm looking into doing some "in situ" testing, but for now I'll mention
> some minor nits I found:
> 
> 0001
> 
> + * so we simply do a lookup (the sequence is identified by relfilende). If
> 
> relfilenode? Or should it be called a relfilelocator, which is the
> parameter type? I see some other references to relfilenode in comments
> and commit message, and I'm not sure which need to be updated.
> 

Yeah, that's a leftover from the original patch, before the relfilenode
was renamed to relfilelocator.

> + /* XXX Maybe check that we're still in the same top-level xact? */
> 
> Any ideas on what should happen here?
> 

I don't recall why I added this comment, but I don't think there's
anything we need to do (so drop the comment).

> + /* XXX how could we have sequence change without data? */
> + if(!datalen || !tupledata)
> + elog(ERROR, "sequence decode missing tuple data");
> 
> Since the ERROR is new based on feedback, we can get rid of XXX I think.
> 
> More generally, I associate XXX comments to highlight problems or
> unpleasantness in the code that don't quite rise to the level of FIXME,
> but are perhaps more serious than "NB:", "Note:", or "Important:"
> 

Understood. I keep adding XXX in places where I have some open
questions, or something that may need to be improved (so kinda less
serious than a FIXME).

> + * When we're called via the SQL SRF there's already a transaction
> 
> I see this was copied from existing code, but I found it confusing --
> does this function have a stable name?
> 

What do you mean by "stable name"? It certainly is not exposed as a
user-callable SQL function, so I think this comment it misleading and
should be removed.

> + /* Only ever called from ReorderBufferApplySequence, so transational. */
> 
> Typo: transactional
> 
> 0002
> 
> I see a few SERIAL types in the tests but no GENERATED ... AS IDENTITY
> -- not sure if it matters, but seems good for completeness.
> 

That's a good point. Adding tests for GENERATED ... AS IDENTITY is a
good idea.

> Reminder for later: Patches 0002 and 0003 still refer to 0da92dc530,
> which is a reverted commit -- I assume it intends to refer to the
> content of 0001?
> 

Correct. That needs to be adjusted at commit time.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 3/14/23 08:30, John Naylor wrote:
> I tried a couple toy examples with various combinations of use styles.
> 
> Three with "automatic" reading from sequences:
> 
> create table test(i serial);
> create table test(i int GENERATED BY DEFAULT AS IDENTITY);
> create table test(i int default nextval('s1'));
> 
> ...where s1 has some non-default parameters:
> 
> CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;
> 
> ...and then two with explicit use of s1, one inserting the 'nextval'
> into a table with no default, and one with no table at all, just
> selecting from the sequence.
> 
> The last two seem to work similarly to the first three, so it seems like
> FOR ALL TABLES adds all sequences as well. Is that expected?

Yeah, that's a bug - we shouldn't replicate the sequence changes, unless
the sequence is actually added to the publication. I tracked this down
to a thinko in get_rel_sync_entry() which failed to check the object
type when puballtables or puballsequences was set.

Attached is a patch fixing this.

> The documentation for CREATE PUBLICATION mentions sequence options,
> but doesn't really say how these options should be used.
Good point. The idea is that we handle tables and sequences the same
way, i.e. if you specify 'sequence' then we'll replicate increments for
sequences explicitly added to the publication.

If this is not clear, the docs may need some improvements.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Masahiko Sawada
Дата:
Hi,

On Wed, Mar 15, 2023 at 9:52 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 3/14/23 08:30, John Naylor wrote:
> > I tried a couple toy examples with various combinations of use styles.
> >
> > Three with "automatic" reading from sequences:
> >
> > create table test(i serial);
> > create table test(i int GENERATED BY DEFAULT AS IDENTITY);
> > create table test(i int default nextval('s1'));
> >
> > ...where s1 has some non-default parameters:
> >
> > CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;
> >
> > ...and then two with explicit use of s1, one inserting the 'nextval'
> > into a table with no default, and one with no table at all, just
> > selecting from the sequence.
> >
> > The last two seem to work similarly to the first three, so it seems like
> > FOR ALL TABLES adds all sequences as well. Is that expected?
>
> Yeah, that's a bug - we shouldn't replicate the sequence changes, unless
> the sequence is actually added to the publication. I tracked this down
> to a thinko in get_rel_sync_entry() which failed to check the object
> type when puballtables or puballsequences was set.
>
> Attached is a patch fixing this.
>
> > The documentation for CREATE PUBLICATION mentions sequence options,
> > but doesn't really say how these options should be used.
> Good point. The idea is that we handle tables and sequences the same
> way, i.e. if you specify 'sequence' then we'll replicate increments for
> sequences explicitly added to the publication.
>
> If this is not clear, the docs may need some improvements.
>

I'm late to this thread, but I have some questions and review comments.

Regarding sequence logical replication, it seems that changes of
sequence created after CREATE SUBSCRIPTION are applied on the
subscriber even without REFRESH PUBLICATION command on the subscriber.
Which is a different behavior than tables. For example, I set both
publisher and subscriber as follows:

1. On publisher
create publication test_pub for all sequences;

2. On subscriber
create subscription test_sub connection 'dbname=postgres port=5551'
publication test_pub; -- port=5551 is the publisher

3. On publisher
create sequence s1;
select nextval('s1');

I got the error "ERROR:  relation "public.s1" does not exist on the
subscriber". Probably we need to do should_apply_changes_for_rel()
check in apply_handle_sequence().

If my understanding is correct, is there any case where the subscriber
needs to apply transactional sequence changes? The commit message of
0001 patch says:

    * Changes for sequences created in the same top-level transaction are
      treated as transactional, i.e. just like any other change from that
      transaction, and discarded in case of a rollback.

IIUC such sequences are not visible to the subscriber, so it cannot
subscribe to them until the commit.

---
I got an assertion failure. The reproducible steps are:

1. On publisher
alter system set logical_replication_mode = 'immediate';
select pg_reload_conf();
create publication test_pub for all sequences;

2. On subscriber
create subscription test_sub connection 'dbname=postgres port=5551'
publication test_pub with (streaming='parall\el')

3. On publisher
begin;
create table bar (c int, d serial);
insert into bar(c) values (100);
commit;

I got the following assertion failure:

TRAP: failed Assert("(!seq.transactional) || in_remote_transaction"),
File: "worker.c", Line: 1458, PID: 508056
postgres: logical replication parallel apply worker for subscription
16388 (ExceptionalCondition+0x9e)[0xb6c0af]
postgres: logical replication parallel apply worker for subscription
16388 [0x92f7fe]
postgres: logical replication parallel apply worker for subscription
16388 (apply_dispatch+0xed)[0x932925]
postgres: logical replication parallel apply worker for subscription
16388 [0x90d927]
postgres: logical replication parallel apply worker for subscription
16388 (ParallelApplyWorkerMain+0x34f)[0x90dd8d]
postgres: logical replication parallel apply worker for subscription
16388 (StartBackgroundWorker+0x1f3)[0x8e7b19]
postgres: logical replication parallel apply worker for subscription
16388 [0x8f1798]
postgres: logical replication parallel apply worker for subscription
16388 [0x8f1b53]
postgres: logical replication parallel apply worker for subscription
16388 [0x8f0bed]
postgres: logical replication parallel apply worker for subscription
16388 [0x8ecca4]
postgres: logical replication parallel apply worker for subscription
16388 (PostmasterMain+0x1246)[0x8ec6d7]
postgres: logical replication parallel apply worker for subscription
16388 [0x7bbe5c]
/lib64/libc.so.6(__libc_start_main+0xf3)[0x7f69094cbcf3]
postgres: logical replication parallel apply worker for subscription
16388 (_start+0x2e)[0x49d15e]
2023-03-16 12:33:19.471 JST [507974] LOG:  background worker "logical
replication parallel worker" (PID 508056) was terminated by signal 6:
Aborted

seq.transactional is true and in_remote_transaction is false. It might
be an issue of the parallel apply feature rather than this patch.

---
There is no documentation about the new 'sequence' value of the
publish option in CREATE/ALTER PUBLICATION. It seems to be possible to
specify something like "CREATE PUBLICATION ... FOR ALL SEQUENCES WITH
(publish = 'truncate')" (i.e., not specifying 'sequence' value in the
publish option). How does logical replication work with this setting?
Nothing is replicated?

---
It seems that sequence replication does't work well together with
ALTER SUBSCRIPTION ... SKIP command. IIUC these changes are not
skipped even if these are transactional changes. The reproducible
steps are:

1. On both nodes
create table a (c int primary key);

2. On publisher
create publication hoge_pub for all sequences, tables

3. On subscriber
create subscription hoge_sub connection 'dbname=postgres port=5551'
publication hoge_pub;
insert into a values (1);

4. On publisher
begin;
create sequence s2;
insert into a values (nextval('s2'));
commit;

At step 4, applying INSERT conflicts with the existing row on the
subscriber. If I skip this transaction using ALTER SUBSCRIPTION ...
SKIP command, I got:

ERROR:  relation "public.s2" does not exist
CONTEXT:  processing remote data for replication origin "pg_16390"
during message type "BEGIN" in transaction 734, finished at 0/1751698

If I create the sequence s2 in advance on the subscriber, the sequence
change is applied on the subscriber.

If the subscriber doesn't need to apply transactional sequence changes
in the first place, this problem will disappear.

---
There are two typos in 0001 patch:

In the commit message:

   ensure the sequence record has a valid XID - until now the the increment

s/the the/ the/

And,

+   /* Only ever called from ReorderBufferApplySequence, so transational. */

s/transational/transactional/

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Thu, Mar 16, 2023 at 1:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> On Wed, Mar 15, 2023 at 9:52 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> >
> >
> > On 3/14/23 08:30, John Naylor wrote:
> ---
> I got an assertion failure. The reproducible steps are:
>
> 1. On publisher
> alter system set logical_replication_mode = 'immediate';
> select pg_reload_conf();
> create publication test_pub for all sequences;
>
> 2. On subscriber
> create subscription test_sub connection 'dbname=postgres port=5551'
> publication test_pub with (streaming='parall\el')
>
> 3. On publisher
> begin;
> create table bar (c int, d serial);
> insert into bar(c) values (100);
> commit;
>
> I got the following assertion failure:
>
> TRAP: failed Assert("(!seq.transactional) || in_remote_transaction"),
...
>
> seq.transactional is true and in_remote_transaction is false. It might
> be an issue of the parallel apply feature rather than this patch.
>

During parallel apply we didn't need to rely on in_remote_transaction,
so it was not set. I haven't checked the patch in detail but am
wondering, isn't it sufficient to instead check IsTransactionState()
and or IsTransactionOrTransactionBlock()?

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Hi!

On 3/16/23 08:38, Masahiko Sawada wrote:
> Hi,
> 
> On Wed, Mar 15, 2023 at 9:52 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>>
>>
>> On 3/14/23 08:30, John Naylor wrote:
>>> I tried a couple toy examples with various combinations of use styles.
>>>
>>> Three with "automatic" reading from sequences:
>>>
>>> create table test(i serial);
>>> create table test(i int GENERATED BY DEFAULT AS IDENTITY);
>>> create table test(i int default nextval('s1'));
>>>
>>> ...where s1 has some non-default parameters:
>>>
>>> CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;
>>>
>>> ...and then two with explicit use of s1, one inserting the 'nextval'
>>> into a table with no default, and one with no table at all, just
>>> selecting from the sequence.
>>>
>>> The last two seem to work similarly to the first three, so it seems like
>>> FOR ALL TABLES adds all sequences as well. Is that expected?
>>
>> Yeah, that's a bug - we shouldn't replicate the sequence changes, unless
>> the sequence is actually added to the publication. I tracked this down
>> to a thinko in get_rel_sync_entry() which failed to check the object
>> type when puballtables or puballsequences was set.
>>
>> Attached is a patch fixing this.
>>
>>> The documentation for CREATE PUBLICATION mentions sequence options,
>>> but doesn't really say how these options should be used.
>> Good point. The idea is that we handle tables and sequences the same
>> way, i.e. if you specify 'sequence' then we'll replicate increments for
>> sequences explicitly added to the publication.
>>
>> If this is not clear, the docs may need some improvements.
>>
> 
> I'm late to this thread, but I have some questions and review comments.
> 
> Regarding sequence logical replication, it seems that changes of
> sequence created after CREATE SUBSCRIPTION are applied on the
> subscriber even without REFRESH PUBLICATION command on the subscriber.
> Which is a different behavior than tables. For example, I set both
> publisher and subscriber as follows:
> 
> 1. On publisher
> create publication test_pub for all sequences;
> 
> 2. On subscriber
> create subscription test_sub connection 'dbname=postgres port=5551'
> publication test_pub; -- port=5551 is the publisher
> 
> 3. On publisher
> create sequence s1;
> select nextval('s1');
> 
> I got the error "ERROR:  relation "public.s1" does not exist on the
> subscriber". Probably we need to do should_apply_changes_for_rel()
> check in apply_handle_sequence().
> 

Yes, you're right - the sequence handling should have been calling the
should_apply_changes_for_rel() etc.

The attached 0005 patch should fix that - I still need to test it a bit
more and maybe clean it up a bit, but hopefully it'll allow you to
continue the review.

I had to tweak the protocol a bit, so that this uses the same cache as
tables. I wonder if maybe we should make it even more similar, by
essentially treating sequences as tables with (last_value, log_cnt,
called) columns.

> If my understanding is correct, is there any case where the subscriber
> needs to apply transactional sequence changes? The commit message of
> 0001 patch says:
> 
>     * Changes for sequences created in the same top-level transaction are
>       treated as transactional, i.e. just like any other change from that
>       transaction, and discarded in case of a rollback.
> 
> IIUC such sequences are not visible to the subscriber, so it cannot
> subscribe to them until the commit.
> 

The comment is slightly misleading, as it talks about creation of
sequences, but it should be talking about relfilenodes. For example, if
you create a sequence, add it to publication, and then in a later
transaction you do

   ALTER SEQUENCE x RESTART

or something else that creates a new relfilenode, then the subsequent
increments are visible only in that transaction. But we still need to
apply those on the subscriber, but only as part of the transaction,
because it might roll back.

> ---
> I got an assertion failure. The reproducible steps are:
> 

I do believe this was due to a thinko in apply_handle_sequence, which
sometimes started transaction and didn't terminate it correctly. I've
changed it to use the begin_replication_step() etc. and it seems to be
working fine now.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
vignesh C
Дата:
On Thu, 16 Mar 2023 at 21:55, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hi!
>
> On 3/16/23 08:38, Masahiko Sawada wrote:
> > Hi,
> >
> > On Wed, Mar 15, 2023 at 9:52 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >>
> >>
> >> On 3/14/23 08:30, John Naylor wrote:
> >>> I tried a couple toy examples with various combinations of use styles.
> >>>
> >>> Three with "automatic" reading from sequences:
> >>>
> >>> create table test(i serial);
> >>> create table test(i int GENERATED BY DEFAULT AS IDENTITY);
> >>> create table test(i int default nextval('s1'));
> >>>
> >>> ...where s1 has some non-default parameters:
> >>>
> >>> CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;
> >>>
> >>> ...and then two with explicit use of s1, one inserting the 'nextval'
> >>> into a table with no default, and one with no table at all, just
> >>> selecting from the sequence.
> >>>
> >>> The last two seem to work similarly to the first three, so it seems like
> >>> FOR ALL TABLES adds all sequences as well. Is that expected?
> >>
> >> Yeah, that's a bug - we shouldn't replicate the sequence changes, unless
> >> the sequence is actually added to the publication. I tracked this down
> >> to a thinko in get_rel_sync_entry() which failed to check the object
> >> type when puballtables or puballsequences was set.
> >>
> >> Attached is a patch fixing this.
> >>
> >>> The documentation for CREATE PUBLICATION mentions sequence options,
> >>> but doesn't really say how these options should be used.
> >> Good point. The idea is that we handle tables and sequences the same
> >> way, i.e. if you specify 'sequence' then we'll replicate increments for
> >> sequences explicitly added to the publication.
> >>
> >> If this is not clear, the docs may need some improvements.
> >>
> >
> > I'm late to this thread, but I have some questions and review comments.
> >
> > Regarding sequence logical replication, it seems that changes of
> > sequence created after CREATE SUBSCRIPTION are applied on the
> > subscriber even without REFRESH PUBLICATION command on the subscriber.
> > Which is a different behavior than tables. For example, I set both
> > publisher and subscriber as follows:
> >
> > 1. On publisher
> > create publication test_pub for all sequences;
> >
> > 2. On subscriber
> > create subscription test_sub connection 'dbname=postgres port=5551'
> > publication test_pub; -- port=5551 is the publisher
> >
> > 3. On publisher
> > create sequence s1;
> > select nextval('s1');
> >
> > I got the error "ERROR:  relation "public.s1" does not exist on the
> > subscriber". Probably we need to do should_apply_changes_for_rel()
> > check in apply_handle_sequence().
> >
>
> Yes, you're right - the sequence handling should have been calling the
> should_apply_changes_for_rel() etc.
>
> The attached 0005 patch should fix that - I still need to test it a bit
> more and maybe clean it up a bit, but hopefully it'll allow you to
> continue the review.
>
> I had to tweak the protocol a bit, so that this uses the same cache as
> tables. I wonder if maybe we should make it even more similar, by
> essentially treating sequences as tables with (last_value, log_cnt,
> called) columns.
>
> > If my understanding is correct, is there any case where the subscriber
> > needs to apply transactional sequence changes? The commit message of
> > 0001 patch says:
> >
> >     * Changes for sequences created in the same top-level transaction are
> >       treated as transactional, i.e. just like any other change from that
> >       transaction, and discarded in case of a rollback.
> >
> > IIUC such sequences are not visible to the subscriber, so it cannot
> > subscribe to them until the commit.
> >
>
> The comment is slightly misleading, as it talks about creation of
> sequences, but it should be talking about relfilenodes. For example, if
> you create a sequence, add it to publication, and then in a later
> transaction you do
>
>    ALTER SEQUENCE x RESTART
>
> or something else that creates a new relfilenode, then the subsequent
> increments are visible only in that transaction. But we still need to
> apply those on the subscriber, but only as part of the transaction,
> because it might roll back.
>
> > ---
> > I got an assertion failure. The reproducible steps are:
> >
>
> I do believe this was due to a thinko in apply_handle_sequence, which
> sometimes started transaction and didn't terminate it correctly. I've
> changed it to use the begin_replication_step() etc. and it seems to be
> working fine now.

One of the patch does not apply on HEAD, because of a recent commit,
we might have to rebase the patch:
git am 0005-fixup-syncing-refresh-sequences-20230316.patch
Applying: fixup syncing/refresh sequences
error: patch failed: src/backend/replication/pgoutput/pgoutput.c:711
error: src/backend/replication/pgoutput/pgoutput.c: patch does not apply
Patch failed at 0001 fixup syncing/refresh sequences

Regards,
Vignesh



Re: logical decoding and replication of sequences, take 2

От
John Naylor
Дата:
On Wed, Mar 15, 2023 at 7:51 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 3/14/23 08:30, John Naylor wrote:
> > I tried a couple toy examples with various combinations of use styles.
> >
> > Three with "automatic" reading from sequences:
> >
> > create table test(i serial);
> > create table test(i int GENERATED BY DEFAULT AS IDENTITY);
> > create table test(i int default nextval('s1'));
> >
> > ...where s1 has some non-default parameters:
> >
> > CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;
> >
> > ...and then two with explicit use of s1, one inserting the 'nextval'
> > into a table with no default, and one with no table at all, just
> > selecting from the sequence.
> >
> > The last two seem to work similarly to the first three, so it seems like
> > FOR ALL TABLES adds all sequences as well. Is that expected?
>
> Yeah, that's a bug - we shouldn't replicate the sequence changes, unless
> the sequence is actually added to the publication. I tracked this down
> to a thinko in get_rel_sync_entry() which failed to check the object
> type when puballtables or puballsequences was set.
>
> Attached is a patch fixing this.

Okay, I can verify that with 0001-0006, sequences don't replicate unless specified. I do see an additional change that doesn't make sense: On the subscriber I no longer see a jump to the logged 32 increment, I see the very next value:

# alter system set wal_level='logical';
# port 7777 is subscriber

echo
echo "PUB:"
psql -c "drop table if exists test;"
psql -c "drop publication if exists pub1;"

echo
echo "SUB:"
psql -p 7777 -c "drop table if exists test;"
psql -p 7777 -c "drop subscription if exists sub1 ;"

echo
echo "PUB:"
psql -c "create table test(i int GENERATED BY DEFAULT AS IDENTITY);"
psql -c "CREATE PUBLICATION pub1 FOR ALL TABLES;"
psql -c "CREATE PUBLICATION pub2 FOR ALL SEQUENCES;"

echo
echo "SUB:"
psql -p 7777 -c "create table test(i int GENERATED BY DEFAULT AS IDENTITY);"
psql -p 7777 -c "CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost dbname=postgres application_name=sub1 port=5432' PUBLICATION pub1;"
psql -p 7777 -c "CREATE SUBSCRIPTION sub2 CONNECTION 'host=localhost dbname=postgres application_name=sub2 port=5432' PUBLICATION pub2;"

echo
echo "PUB:"
psql -c "insert into test default values;"
psql -c "insert into test default values;"
psql -c "select * from test;"
psql -c "select * from test_i_seq;"

sleep 1

echo
echo "SUB:"
psql -p 7777 -c "select * from test;"
psql -p 7777 -c "select * from test_i_seq;"

psql -p 7777 -c "drop subscription sub1 ;"
psql -p 7777 -c "drop subscription sub2 ;"

psql -p 7777 -c "insert into test default values;"
psql -p 7777 -c "select * from test;"
psql -p 7777 -c "select * from test_i_seq;"

The last two queries on the subscriber show:

 i
---
 1
 2
 3
(3 rows)

 last_value | log_cnt | is_called
------------+---------+-----------
          3 |      30 | t
(1 row)

...whereas before with 0001-0003 I saw:

 i  
----
  1
  2
 34
(3 rows)

 last_value | log_cnt | is_called
------------+---------+-----------
         34 |      32 | t

> > The documentation for CREATE PUBLICATION mentions sequence options,
> > but doesn't really say how these options should be used.
> Good point. The idea is that we handle tables and sequences the same
> way, i.e. if you specify 'sequence' then we'll replicate increments for
> sequences explicitly added to the publication.
>
> If this is not clear, the docs may need some improvements.

Aside from docs, I'm not clear what some of the tests are doing:

+CREATE PUBLICATION testpub_forallsequences FOR ALL SEQUENCES WITH (publish = 'sequence');
+RESET client_min_messages;
+ALTER PUBLICATION testpub_forallsequences SET (publish = 'insert, sequence');

What does it mean to add 'insert' to a sequence publication?

Likewise, from a brief change in my test above, 'sequence' seems to be a noise word for table publications. I'm not fully read up on the background of this topic, but wanted to make sure I understood the design of the syntax.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

От
John Naylor
Дата:
On Wed, Mar 15, 2023 at 7:00 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> On 3/10/23 11:03, John Naylor wrote:

> > + * When we're called via the SQL SRF there's already a transaction
> >
> > I see this was copied from existing code, but I found it confusing --
> > does this function have a stable name?
>
> What do you mean by "stable name"? It certainly is not exposed as a
> user-callable SQL function, so I think this comment it misleading and
> should be removed.

Okay, I was just trying to think of why it was phrased this way...

--

Re: logical decoding and replication of sequences, take 2

От
vignesh C
Дата:
On Thu, 16 Mar 2023 at 21:55, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hi!
>
> On 3/16/23 08:38, Masahiko Sawada wrote:
> > Hi,
> >
> > On Wed, Mar 15, 2023 at 9:52 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >>
> >>
> >> On 3/14/23 08:30, John Naylor wrote:
> >>> I tried a couple toy examples with various combinations of use styles.
> >>>
> >>> Three with "automatic" reading from sequences:
> >>>
> >>> create table test(i serial);
> >>> create table test(i int GENERATED BY DEFAULT AS IDENTITY);
> >>> create table test(i int default nextval('s1'));
> >>>
> >>> ...where s1 has some non-default parameters:
> >>>
> >>> CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;
> >>>
> >>> ...and then two with explicit use of s1, one inserting the 'nextval'
> >>> into a table with no default, and one with no table at all, just
> >>> selecting from the sequence.
> >>>
> >>> The last two seem to work similarly to the first three, so it seems like
> >>> FOR ALL TABLES adds all sequences as well. Is that expected?
> >>
> >> Yeah, that's a bug - we shouldn't replicate the sequence changes, unless
> >> the sequence is actually added to the publication. I tracked this down
> >> to a thinko in get_rel_sync_entry() which failed to check the object
> >> type when puballtables or puballsequences was set.
> >>
> >> Attached is a patch fixing this.
> >>
> >>> The documentation for CREATE PUBLICATION mentions sequence options,
> >>> but doesn't really say how these options should be used.
> >> Good point. The idea is that we handle tables and sequences the same
> >> way, i.e. if you specify 'sequence' then we'll replicate increments for
> >> sequences explicitly added to the publication.
> >>
> >> If this is not clear, the docs may need some improvements.
> >>
> >
> > I'm late to this thread, but I have some questions and review comments.
> >
> > Regarding sequence logical replication, it seems that changes of
> > sequence created after CREATE SUBSCRIPTION are applied on the
> > subscriber even without REFRESH PUBLICATION command on the subscriber.
> > Which is a different behavior than tables. For example, I set both
> > publisher and subscriber as follows:
> >
> > 1. On publisher
> > create publication test_pub for all sequences;
> >
> > 2. On subscriber
> > create subscription test_sub connection 'dbname=postgres port=5551'
> > publication test_pub; -- port=5551 is the publisher
> >
> > 3. On publisher
> > create sequence s1;
> > select nextval('s1');
> >
> > I got the error "ERROR:  relation "public.s1" does not exist on the
> > subscriber". Probably we need to do should_apply_changes_for_rel()
> > check in apply_handle_sequence().
> >
>
> Yes, you're right - the sequence handling should have been calling the
> should_apply_changes_for_rel() etc.
>
> The attached 0005 patch should fix that - I still need to test it a bit
> more and maybe clean it up a bit, but hopefully it'll allow you to
> continue the review.
>
> I had to tweak the protocol a bit, so that this uses the same cache as
> tables. I wonder if maybe we should make it even more similar, by
> essentially treating sequences as tables with (last_value, log_cnt,
> called) columns.
>
> > If my understanding is correct, is there any case where the subscriber
> > needs to apply transactional sequence changes? The commit message of
> > 0001 patch says:
> >
> >     * Changes for sequences created in the same top-level transaction are
> >       treated as transactional, i.e. just like any other change from that
> >       transaction, and discarded in case of a rollback.
> >
> > IIUC such sequences are not visible to the subscriber, so it cannot
> > subscribe to them until the commit.
> >
>
> The comment is slightly misleading, as it talks about creation of
> sequences, but it should be talking about relfilenodes. For example, if
> you create a sequence, add it to publication, and then in a later
> transaction you do
>
>    ALTER SEQUENCE x RESTART
>
> or something else that creates a new relfilenode, then the subsequent
> increments are visible only in that transaction. But we still need to
> apply those on the subscriber, but only as part of the transaction,
> because it might roll back.
>
> > ---
> > I got an assertion failure. The reproducible steps are:
> >
>
> I do believe this was due to a thinko in apply_handle_sequence, which
> sometimes started transaction and didn't terminate it correctly. I've
> changed it to use the begin_replication_step() etc. and it seems to be
> working fine now.

Few comments:
1) One of the test is failing for me, I had also seen the same failure
in CFBOT at [1] too:
#   Failed test 'create sequence, advance it in rolled-back
transaction, but commit the create'
#   at t/030_sequences.pl line 152.
#          got: '1|0|f'
#     expected: '132|0|t'
t/030_sequences.pl ................. 5/? ?
#   Failed test 'advance the new sequence in a transaction and roll it back'
#   at t/030_sequences.pl line 175.
#          got: '1|0|f'
#     expected: '231|0|t'

#   Failed test 'advance sequence in a subtransaction'
#   at t/030_sequences.pl line 198.
#          got: '1|0|f'
#     expected: '330|0|t'
# Looks like you failed 3 tests of 6.

2) We could replace the below:
$node_publisher->wait_for_catchup('seq_sub');

# Wait for initial sync to finish as well
my $synced_query =
  "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT
IN ('s', 'r');";
$node_subscriber->poll_query_until('postgres', $synced_query)
  or die "Timed out while waiting for subscriber to synchronize data";

with:
$node_subscriber->wait_for_subscription_sync;

3) We could change 030_sequences to 033_sequences.pl as 030 is already used:
diff --git a/src/test/subscription/t/030_sequences.pl
b/src/test/subscription/t/030_sequences.pl
new file mode 100644
index 00000000000..9ae3c03d7d1
--- /dev/null
+++ b/src/test/subscription/t/030_sequences.pl

4) Copyright year should be changed to 2023:
@@ -0,0 +1,202 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# This tests that sequences are replicated correctly by logical replication
+use strict;
+use warnings;

[1] - https://cirrus-ci.com/task/5032679352041472

Regards,
Vignesh



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 3/17/23 06:53, John Naylor wrote:
> On Wed, Mar 15, 2023 at 7:51 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
>>
>>
>>
>> On 3/14/23 08:30, John Naylor wrote:
>> > I tried a couple toy examples with various combinations of use styles.
>> >
>> > Three with "automatic" reading from sequences:
>> >
>> > create table test(i serial);
>> > create table test(i int GENERATED BY DEFAULT AS IDENTITY);
>> > create table test(i int default nextval('s1'));
>> >
>> > ...where s1 has some non-default parameters:
>> >
>> > CREATE SEQUENCE s1 START 100 MAXVALUE 100 INCREMENT BY -1;
>> >
>> > ...and then two with explicit use of s1, one inserting the 'nextval'
>> > into a table with no default, and one with no table at all, just
>> > selecting from the sequence.
>> >
>> > The last two seem to work similarly to the first three, so it seems like
>> > FOR ALL TABLES adds all sequences as well. Is that expected?
>>
>> Yeah, that's a bug - we shouldn't replicate the sequence changes, unless
>> the sequence is actually added to the publication. I tracked this down
>> to a thinko in get_rel_sync_entry() which failed to check the object
>> type when puballtables or puballsequences was set.
>>
>> Attached is a patch fixing this.
> 
> Okay, I can verify that with 0001-0006, sequences don't replicate unless
> specified. I do see an additional change that doesn't make sense: On the
> subscriber I no longer see a jump to the logged 32 increment, I see the
> very next value:
> 
> # alter system set wal_level='logical';
> # port 7777 is subscriber
> 
> echo
> echo "PUB:"
> psql -c "drop table if exists test;"
> psql -c "drop publication if exists pub1;"
> 
> echo
> echo "SUB:"
> psql -p 7777 -c "drop table if exists test;"
> psql -p 7777 -c "drop subscription if exists sub1 ;"
> 
> echo
> echo "PUB:"
> psql -c "create table test(i int GENERATED BY DEFAULT AS IDENTITY);"
> psql -c "CREATE PUBLICATION pub1 FOR ALL TABLES;"
> psql -c "CREATE PUBLICATION pub2 FOR ALL SEQUENCES;"
> 
> echo
> echo "SUB:"
> psql -p 7777 -c "create table test(i int GENERATED BY DEFAULT AS IDENTITY);"
> psql -p 7777 -c "CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost
> dbname=postgres application_name=sub1 port=5432' PUBLICATION pub1;"
> psql -p 7777 -c "CREATE SUBSCRIPTION sub2 CONNECTION 'host=localhost
> dbname=postgres application_name=sub2 port=5432' PUBLICATION pub2;"
> 
> echo
> echo "PUB:"
> psql -c "insert into test default values;"
> psql -c "insert into test default values;"
> psql -c "select * from test;"
> psql -c "select * from test_i_seq;"
> 
> sleep 1
> 
> echo
> echo "SUB:"
> psql -p 7777 -c "select * from test;"
> psql -p 7777 -c "select * from test_i_seq;"
> 
> psql -p 7777 -c "drop subscription sub1 ;"
> psql -p 7777 -c "drop subscription sub2 ;"
> 
> psql -p 7777 -c "insert into test default values;"
> psql -p 7777 -c "select * from test;"
> psql -p 7777 -c "select * from test_i_seq;"
> 
> The last two queries on the subscriber show:
> 
>  i
> ---
>  1
>  2
>  3
> (3 rows)
> 
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>           3 |      30 | t
> (1 row)
> 
> ...whereas before with 0001-0003 I saw:
> 
>  i  
> ----
>   1
>   2
>  34
> (3 rows)
> 
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>          34 |      32 | t
> 

Oh, this is a silly thinko in how sequences are synced at the beginning
(or maybe a combination of two issues).

fetch_sequence_data() simply runs a select from the sequence

    SELECT last_value, log_cnt, is_called

but that's wrong, because that's the *current* state of the sequence, at
the moment it's initially synced. We to make this "correct" with respect
to the decoding, we'd need to deduce what was the last WAL record, so
something like

    last_value += log_cnt + 1

That should produce 34 again.

FWIW the older patch has this issue too, I believe the difference is
merely due to a slightly different timing between the sync and decoding
the first insert. If you insert a sleep after the CREATE SUBSCRIPTION
commands, it should disappear.


This however made me realize the initial sync of sequences may not be
correct. I mean, the idea of tablesync is syncing the data in REPEATABLE
READ transaction, and then applying decoded changes. But sequences are
not transactional in this way - if you select from a sequence, you'll
always see the latest data, even in REPEATABLE READ.

I wonder if this might result in losing some of the sequence increments,
and/or applying them in the wrong order (so that the sequence goes
backward for a while).


>> > The documentation for CREATE PUBLICATION mentions sequence options,
>> > but doesn't really say how these options should be used.
>> Good point. The idea is that we handle tables and sequences the same
>> way, i.e. if you specify 'sequence' then we'll replicate increments for
>> sequences explicitly added to the publication.
>>
>> If this is not clear, the docs may need some improvements.
> 
> Aside from docs, I'm not clear what some of the tests are doing:
> 
> +CREATE PUBLICATION testpub_forallsequences FOR ALL SEQUENCES WITH
> (publish = 'sequence');
> +RESET client_min_messages;
> +ALTER PUBLICATION testpub_forallsequences SET (publish = 'insert,
> sequence');
> 
> What does it mean to add 'insert' to a sequence publication?
> 

I don't recall why this particular test exists, but you can still add
tables to "for all sequences" publication. IMO it's fine to allow adding
actions that are irrelevant for currently published objects, we don't
have a cross-check to prevent that (how would you even do that e.g. for
FOR ALL TABLES publications?).

> Likewise, from a brief change in my test above, 'sequence' seems to be a
> noise word for table publications. I'm not fully read up on the
> background of this topic, but wanted to make sure I understood the
> design of the syntax.
> 

I think it's fine, for the same reason as above.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 3/17/23 18:55, Tomas Vondra wrote:
> 
> ...
> 
> This however made me realize the initial sync of sequences may not be
> correct. I mean, the idea of tablesync is syncing the data in REPEATABLE
> READ transaction, and then applying decoded changes. But sequences are
> not transactional in this way - if you select from a sequence, you'll
> always see the latest data, even in REPEATABLE READ.
> 
> I wonder if this might result in losing some of the sequence increments,
> and/or applying them in the wrong order (so that the sequence goes
> backward for a while).
> 

Yeah, I think my suspicion was warranted - it's pretty easy to make the
sequence go backwards for a while by adding a sleep between the slot
creation and the copy_sequence() call, and increment the sequence in
between (enough to do some WAL logging).

The copy_sequence() then reads the current on-disk state (because of the
non-transactional nature w.r.t. REPEATABLE READ), applies it, and then
we start processing the WAL added since the slot creation. But those are
older, so stuff like this happens:

    21:52:54.147 CET [35404] WARNING:  copy_sequence 1222 0 1
    21:52:54.163 CET [35404] WARNING:  apply_handle_sequence 990 0 1
    21:52:54.163 CET [35404] WARNING:  apply_handle_sequence 1023 0 1
    21:52:54.163 CET [35404] WARNING:  apply_handle_sequence 1056 0 1
    21:52:54.174 CET [35404] WARNING:  apply_handle_sequence 1089 0 1
    21:52:54.174 CET [35404] WARNING:  apply_handle_sequence 1122 0 1
    21:52:54.174 CET [35404] WARNING:  apply_handle_sequence 1155 0 1
    21:52:54.174 CET [35404] WARNING:  apply_handle_sequence 1188 0 1
    21:52:54.175 CET [35404] WARNING:  apply_handle_sequence 1221 0 1
    21:52:54.898 CET [35402] WARNING:  apply_handle_sequence 1254 0 1

Clearly, for sequences we can't quite rely on snapshots/slots, we need
to get the LSN to decide what changes to apply/skip from somewhere else.
I wonder if we can just ignore the queued changes in tablesync, but I
guess not - there can be queued increments after reading the sequence
state, and we need to apply those. But maybe we could use the page LSN
from the relfilenode - that should be the LSN of the last WAL record.

Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we
use to read the sequence state ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 3/17/23 18:55, Tomas Vondra wrote:
> >
> > ...
> >
> > This however made me realize the initial sync of sequences may not be
> > correct. I mean, the idea of tablesync is syncing the data in REPEATABLE
> > READ transaction, and then applying decoded changes. But sequences are
> > not transactional in this way - if you select from a sequence, you'll
> > always see the latest data, even in REPEATABLE READ.
> >
> > I wonder if this might result in losing some of the sequence increments,
> > and/or applying them in the wrong order (so that the sequence goes
> > backward for a while).
> >
>
> Yeah, I think my suspicion was warranted - it's pretty easy to make the
> sequence go backwards for a while by adding a sleep between the slot
> creation and the copy_sequence() call, and increment the sequence in
> between (enough to do some WAL logging).
>
> The copy_sequence() then reads the current on-disk state (because of the
> non-transactional nature w.r.t. REPEATABLE READ), applies it, and then
> we start processing the WAL added since the slot creation. But those are
> older, so stuff like this happens:
>
>     21:52:54.147 CET [35404] WARNING:  copy_sequence 1222 0 1
>     21:52:54.163 CET [35404] WARNING:  apply_handle_sequence 990 0 1
>     21:52:54.163 CET [35404] WARNING:  apply_handle_sequence 1023 0 1
>     21:52:54.163 CET [35404] WARNING:  apply_handle_sequence 1056 0 1
>     21:52:54.174 CET [35404] WARNING:  apply_handle_sequence 1089 0 1
>     21:52:54.174 CET [35404] WARNING:  apply_handle_sequence 1122 0 1
>     21:52:54.174 CET [35404] WARNING:  apply_handle_sequence 1155 0 1
>     21:52:54.174 CET [35404] WARNING:  apply_handle_sequence 1188 0 1
>     21:52:54.175 CET [35404] WARNING:  apply_handle_sequence 1221 0 1
>     21:52:54.898 CET [35402] WARNING:  apply_handle_sequence 1254 0 1
>
> Clearly, for sequences we can't quite rely on snapshots/slots, we need
> to get the LSN to decide what changes to apply/skip from somewhere else.
> I wonder if we can just ignore the queued changes in tablesync, but I
> guess not - there can be queued increments after reading the sequence
> state, and we need to apply those. But maybe we could use the page LSN
> from the relfilenode - that should be the LSN of the last WAL record.
>
> Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we
> use to read the sequence state ...
>

What if some Alter Sequence is performed before the copy starts and
after the copy is finished, the containing transaction rolled back?
Won't it copy something which shouldn't have been copied?

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 3/18/23 06:35, Amit Kapila wrote:
> On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> ...
>>
>> Clearly, for sequences we can't quite rely on snapshots/slots, we need
>> to get the LSN to decide what changes to apply/skip from somewhere else.
>> I wonder if we can just ignore the queued changes in tablesync, but I
>> guess not - there can be queued increments after reading the sequence
>> state, and we need to apply those. But maybe we could use the page LSN
>> from the relfilenode - that should be the LSN of the last WAL record.
>>
>> Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we
>> use to read the sequence state ...
>>
> 
> What if some Alter Sequence is performed before the copy starts and
> after the copy is finished, the containing transaction rolled back?
> Won't it copy something which shouldn't have been copied?
> 

That shouldn't be possible - the alter creates a new relfilenode and
it's invisible until commit. So either it gets committed (and then
replicated), or it remains invisible to the SELECT during sync.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Sat, Mar 18, 2023 at 8:49 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 3/18/23 06:35, Amit Kapila wrote:
> > On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> ...
> >>
> >> Clearly, for sequences we can't quite rely on snapshots/slots, we need
> >> to get the LSN to decide what changes to apply/skip from somewhere else.
> >> I wonder if we can just ignore the queued changes in tablesync, but I
> >> guess not - there can be queued increments after reading the sequence
> >> state, and we need to apply those. But maybe we could use the page LSN
> >> from the relfilenode - that should be the LSN of the last WAL record.
> >>
> >> Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we
> >> use to read the sequence state ...
> >>
> >
> > What if some Alter Sequence is performed before the copy starts and
> > after the copy is finished, the containing transaction rolled back?
> > Won't it copy something which shouldn't have been copied?
> >
>
> That shouldn't be possible - the alter creates a new relfilenode and
> it's invisible until commit. So either it gets committed (and then
> replicated), or it remains invisible to the SELECT during sync.
>

Okay, however, we need to ensure that such a change will later be
replicated and also need to ensure that the required WAL doesn't get
removed.

Say, if we use your first idea of page LSN from the relfilenode, then
how do we ensure that the corresponding WAL doesn't get removed when
later the sync worker tries to start replication from that LSN? I am
imagining here the sync_sequence_slot will be created before
copy_sequence but even then it is possible that the sequence has not
been updated for a long time and the LSN location will be in the past
(as compared to the slot's LSN) which means the corresponding WAL
could be removed. Now, here we can't directly start using the slot's
LSN to stream changes because there is no correlation of it with the
LSN (page LSN of sequence's relfilnode) where we want to start
streaming.

Now, for the second idea which is to directly use
pg_current_wal_insert_lsn(), I think we won't be able to ensure that
the changes covered by in-progress transactions like the one with
Alter Sequence I have given example would be streamed later after the
initial copy. Because the LSN returned by pg_current_wal_insert_lsn()
could be an LSN after the LSN associated with Alter Sequence but
before the corresponding xact's commit.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 3/20/23 04:42, Amit Kapila wrote:
> On Sat, Mar 18, 2023 at 8:49 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 3/18/23 06:35, Amit Kapila wrote:
>>> On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> ...
>>>>
>>>> Clearly, for sequences we can't quite rely on snapshots/slots, we need
>>>> to get the LSN to decide what changes to apply/skip from somewhere else.
>>>> I wonder if we can just ignore the queued changes in tablesync, but I
>>>> guess not - there can be queued increments after reading the sequence
>>>> state, and we need to apply those. But maybe we could use the page LSN
>>>> from the relfilenode - that should be the LSN of the last WAL record.
>>>>
>>>> Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we
>>>> use to read the sequence state ...
>>>>
>>>
>>> What if some Alter Sequence is performed before the copy starts and
>>> after the copy is finished, the containing transaction rolled back?
>>> Won't it copy something which shouldn't have been copied?
>>>
>>
>> That shouldn't be possible - the alter creates a new relfilenode and
>> it's invisible until commit. So either it gets committed (and then
>> replicated), or it remains invisible to the SELECT during sync.
>>
> 
> Okay, however, we need to ensure that such a change will later be
> replicated and also need to ensure that the required WAL doesn't get
> removed.
> 
> Say, if we use your first idea of page LSN from the relfilenode, then
> how do we ensure that the corresponding WAL doesn't get removed when
> later the sync worker tries to start replication from that LSN? I am
> imagining here the sync_sequence_slot will be created before
> copy_sequence but even then it is possible that the sequence has not
> been updated for a long time and the LSN location will be in the past
> (as compared to the slot's LSN) which means the corresponding WAL
> could be removed. Now, here we can't directly start using the slot's
> LSN to stream changes because there is no correlation of it with the
> LSN (page LSN of sequence's relfilnode) where we want to start
> streaming.
> 

I don't understand why we'd need WAL from before the slot is created,
which happens before copy_sequence so the sync will see a more recent
state (reflecting all changes up to the slot LSN).

I think the only "issue" are the WAL records after the slot LSN, or more
precisely deciding which of the decoded changes to apply.


> Now, for the second idea which is to directly use
> pg_current_wal_insert_lsn(), I think we won't be able to ensure that
> the changes covered by in-progress transactions like the one with
> Alter Sequence I have given example would be streamed later after the
> initial copy. Because the LSN returned by pg_current_wal_insert_lsn()
> could be an LSN after the LSN associated with Alter Sequence but
> before the corresponding xact's commit.

Yeah, I think you're right - the locking itself is not sufficient to
prevent this ordering of operations. copy_sequence would have to lock
the sequence exclusively, which seems bit disruptive.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Mon, Mar 20, 2023 at 1:49 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
> On 3/20/23 04:42, Amit Kapila wrote:
> > On Sat, Mar 18, 2023 at 8:49 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> On 3/18/23 06:35, Amit Kapila wrote:
> >>> On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra
> >>> <tomas.vondra@enterprisedb.com> wrote:
> >>>>
> >>>> ...
> >>>>
> >>>> Clearly, for sequences we can't quite rely on snapshots/slots, we need
> >>>> to get the LSN to decide what changes to apply/skip from somewhere else.
> >>>> I wonder if we can just ignore the queued changes in tablesync, but I
> >>>> guess not - there can be queued increments after reading the sequence
> >>>> state, and we need to apply those. But maybe we could use the page LSN
> >>>> from the relfilenode - that should be the LSN of the last WAL record.
> >>>>
> >>>> Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we
> >>>> use to read the sequence state ...
> >>>>
> >>>
> >>> What if some Alter Sequence is performed before the copy starts and
> >>> after the copy is finished, the containing transaction rolled back?
> >>> Won't it copy something which shouldn't have been copied?
> >>>
> >>
> >> That shouldn't be possible - the alter creates a new relfilenode and
> >> it's invisible until commit. So either it gets committed (and then
> >> replicated), or it remains invisible to the SELECT during sync.
> >>
> >
> > Okay, however, we need to ensure that such a change will later be
> > replicated and also need to ensure that the required WAL doesn't get
> > removed.
> >
> > Say, if we use your first idea of page LSN from the relfilenode, then
> > how do we ensure that the corresponding WAL doesn't get removed when
> > later the sync worker tries to start replication from that LSN? I am
> > imagining here the sync_sequence_slot will be created before
> > copy_sequence but even then it is possible that the sequence has not
> > been updated for a long time and the LSN location will be in the past
> > (as compared to the slot's LSN) which means the corresponding WAL
> > could be removed. Now, here we can't directly start using the slot's
> > LSN to stream changes because there is no correlation of it with the
> > LSN (page LSN of sequence's relfilnode) where we want to start
> > streaming.
> >
>
> I don't understand why we'd need WAL from before the slot is created,
> which happens before copy_sequence so the sync will see a more recent
> state (reflecting all changes up to the slot LSN).
>

Imagine the following sequence of events:
1. Operation on a sequence seq-1 which requires WAL. Say, this is done
at LSN 1000.
2. Some other random operations on unrelated objects. This would
increase LSN to 2000.
3. Create a slot that uses current LSN 2000.
4. Copy sequence seq-1 where you will get the LSN value as 1000. Then
you will use LSN 1000 as a starting point to start replication in
sequence sync worker.

It is quite possible that WAL from LSN 1000 may not be present. Now,
it may be possible that we use the slot's LSN in this case but
currently, it may not be possible without some changes in the slot
machinery. Even, if we somehow solve this, we have the below problem
where we can miss some concurrent activity.

> I think the only "issue" are the WAL records after the slot LSN, or more
> precisely deciding which of the decoded changes to apply.
>
>
> > Now, for the second idea which is to directly use
> > pg_current_wal_insert_lsn(), I think we won't be able to ensure that
> > the changes covered by in-progress transactions like the one with
> > Alter Sequence I have given example would be streamed later after the
> > initial copy. Because the LSN returned by pg_current_wal_insert_lsn()
> > could be an LSN after the LSN associated with Alter Sequence but
> > before the corresponding xact's commit.
>
> Yeah, I think you're right - the locking itself is not sufficient to
> prevent this ordering of operations. copy_sequence would have to lock
> the sequence exclusively, which seems bit disruptive.
>

Right, that doesn't sound like a good idea.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 3/20/23 12:00, Amit Kapila wrote:
> On Mon, Mar 20, 2023 at 1:49 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>>
>> On 3/20/23 04:42, Amit Kapila wrote:
>>> On Sat, Mar 18, 2023 at 8:49 PM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> On 3/18/23 06:35, Amit Kapila wrote:
>>>>> On Sat, Mar 18, 2023 at 3:13 AM Tomas Vondra
>>>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> Clearly, for sequences we can't quite rely on snapshots/slots, we need
>>>>>> to get the LSN to decide what changes to apply/skip from somewhere else.
>>>>>> I wonder if we can just ignore the queued changes in tablesync, but I
>>>>>> guess not - there can be queued increments after reading the sequence
>>>>>> state, and we need to apply those. But maybe we could use the page LSN
>>>>>> from the relfilenode - that should be the LSN of the last WAL record.
>>>>>>
>>>>>> Or maybe we could simply add pg_current_wal_insert_lsn() into the SQL we
>>>>>> use to read the sequence state ...
>>>>>>
>>>>>
>>>>> What if some Alter Sequence is performed before the copy starts and
>>>>> after the copy is finished, the containing transaction rolled back?
>>>>> Won't it copy something which shouldn't have been copied?
>>>>>
>>>>
>>>> That shouldn't be possible - the alter creates a new relfilenode and
>>>> it's invisible until commit. So either it gets committed (and then
>>>> replicated), or it remains invisible to the SELECT during sync.
>>>>
>>>
>>> Okay, however, we need to ensure that such a change will later be
>>> replicated and also need to ensure that the required WAL doesn't get
>>> removed.
>>>
>>> Say, if we use your first idea of page LSN from the relfilenode, then
>>> how do we ensure that the corresponding WAL doesn't get removed when
>>> later the sync worker tries to start replication from that LSN? I am
>>> imagining here the sync_sequence_slot will be created before
>>> copy_sequence but even then it is possible that the sequence has not
>>> been updated for a long time and the LSN location will be in the past
>>> (as compared to the slot's LSN) which means the corresponding WAL
>>> could be removed. Now, here we can't directly start using the slot's
>>> LSN to stream changes because there is no correlation of it with the
>>> LSN (page LSN of sequence's relfilnode) where we want to start
>>> streaming.
>>>
>>
>> I don't understand why we'd need WAL from before the slot is created,
>> which happens before copy_sequence so the sync will see a more recent
>> state (reflecting all changes up to the slot LSN).
>>
> 
> Imagine the following sequence of events:
> 1. Operation on a sequence seq-1 which requires WAL. Say, this is done
> at LSN 1000.
> 2. Some other random operations on unrelated objects. This would
> increase LSN to 2000.
> 3. Create a slot that uses current LSN 2000.
> 4. Copy sequence seq-1 where you will get the LSN value as 1000. Then
> you will use LSN 1000 as a starting point to start replication in
> sequence sync worker.
> 
> It is quite possible that WAL from LSN 1000 may not be present. Now,
> it may be possible that we use the slot's LSN in this case but
> currently, it may not be possible without some changes in the slot
> machinery. Even, if we somehow solve this, we have the below problem
> where we can miss some concurrent activity.
> 

I think the question is what would be the WAL-requiring operation at LSN
1000. If it's just regular nextval(), then we *will* see it during
copy_sequence - sequences are not transactional in the MVCC sense.

If it's an ALTER SEQUENCE, I guess it might create a new relfilenode,
and then we might fail to apply this - that'd be bad.

I wonder if we'd allow actually discarding the WAL while building the
consistent snapshot, though. You're however right we can't just decide
this based on LSN, we'd probably need to compare the relfilenodes too or
something like that ...

>> I think the only "issue" are the WAL records after the slot LSN, or more
>> precisely deciding which of the decoded changes to apply.
>>
>>
>>> Now, for the second idea which is to directly use
>>> pg_current_wal_insert_lsn(), I think we won't be able to ensure that
>>> the changes covered by in-progress transactions like the one with
>>> Alter Sequence I have given example would be streamed later after the
>>> initial copy. Because the LSN returned by pg_current_wal_insert_lsn()
>>> could be an LSN after the LSN associated with Alter Sequence but
>>> before the corresponding xact's commit.
>>
>> Yeah, I think you're right - the locking itself is not sufficient to
>> prevent this ordering of operations. copy_sequence would have to lock
>> the sequence exclusively, which seems bit disruptive.
>>
> 
> Right, that doesn't sound like a good idea.
> 

Although, maybe we could use a less strict lock level? I mean, one that
allows nextval() to continue, but would conflict with ALTER SEQUENCE.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Mon, Mar 20, 2023 at 5:13 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 3/20/23 12:00, Amit Kapila wrote:
> > On Mon, Mar 20, 2023 at 1:49 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >>
> >> I don't understand why we'd need WAL from before the slot is created,
> >> which happens before copy_sequence so the sync will see a more recent
> >> state (reflecting all changes up to the slot LSN).
> >>
> >
> > Imagine the following sequence of events:
> > 1. Operation on a sequence seq-1 which requires WAL. Say, this is done
> > at LSN 1000.
> > 2. Some other random operations on unrelated objects. This would
> > increase LSN to 2000.
> > 3. Create a slot that uses current LSN 2000.
> > 4. Copy sequence seq-1 where you will get the LSN value as 1000. Then
> > you will use LSN 1000 as a starting point to start replication in
> > sequence sync worker.
> >
> > It is quite possible that WAL from LSN 1000 may not be present. Now,
> > it may be possible that we use the slot's LSN in this case but
> > currently, it may not be possible without some changes in the slot
> > machinery. Even, if we somehow solve this, we have the below problem
> > where we can miss some concurrent activity.
> >
>
> I think the question is what would be the WAL-requiring operation at LSN
> 1000. If it's just regular nextval(), then we *will* see it during
> copy_sequence - sequences are not transactional in the MVCC sense.
>
> If it's an ALTER SEQUENCE, I guess it might create a new relfilenode,
> and then we might fail to apply this - that'd be bad.
>
> I wonder if we'd allow actually discarding the WAL while building the
> consistent snapshot, though.
>

No, as soon as we reserve the WAL location, we update the slot's
minLSN (replicationSlotMinLSN) which would prevent the required WAL
from being removed.

> You're however right we can't just decide
> this based on LSN, we'd probably need to compare the relfilenodes too or
> something like that ...
>
> >> I think the only "issue" are the WAL records after the slot LSN, or more
> >> precisely deciding which of the decoded changes to apply.
> >>
> >>
> >>> Now, for the second idea which is to directly use
> >>> pg_current_wal_insert_lsn(), I think we won't be able to ensure that
> >>> the changes covered by in-progress transactions like the one with
> >>> Alter Sequence I have given example would be streamed later after the
> >>> initial copy. Because the LSN returned by pg_current_wal_insert_lsn()
> >>> could be an LSN after the LSN associated with Alter Sequence but
> >>> before the corresponding xact's commit.
> >>
> >> Yeah, I think you're right - the locking itself is not sufficient to
> >> prevent this ordering of operations. copy_sequence would have to lock
> >> the sequence exclusively, which seems bit disruptive.
> >>
> >
> > Right, that doesn't sound like a good idea.
> >
>
> Although, maybe we could use a less strict lock level? I mean, one that
> allows nextval() to continue, but would conflict with ALTER SEQUENCE.
>

I don't know if that is a good idea but are you imagining a special
interface/mechanism just for logical replication because as far as I
can see you have used SELECT to fetch the sequence values?

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 3/20/23 13:26, Amit Kapila wrote:
> On Mon, Mar 20, 2023 at 5:13 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 3/20/23 12:00, Amit Kapila wrote:
>>> On Mon, Mar 20, 2023 at 1:49 PM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>>
>>>> I don't understand why we'd need WAL from before the slot is created,
>>>> which happens before copy_sequence so the sync will see a more recent
>>>> state (reflecting all changes up to the slot LSN).
>>>>
>>>
>>> Imagine the following sequence of events:
>>> 1. Operation on a sequence seq-1 which requires WAL. Say, this is done
>>> at LSN 1000.
>>> 2. Some other random operations on unrelated objects. This would
>>> increase LSN to 2000.
>>> 3. Create a slot that uses current LSN 2000.
>>> 4. Copy sequence seq-1 where you will get the LSN value as 1000. Then
>>> you will use LSN 1000 as a starting point to start replication in
>>> sequence sync worker.
>>>
>>> It is quite possible that WAL from LSN 1000 may not be present. Now,
>>> it may be possible that we use the slot's LSN in this case but
>>> currently, it may not be possible without some changes in the slot
>>> machinery. Even, if we somehow solve this, we have the below problem
>>> where we can miss some concurrent activity.
>>>
>>
>> I think the question is what would be the WAL-requiring operation at LSN
>> 1000. If it's just regular nextval(), then we *will* see it during
>> copy_sequence - sequences are not transactional in the MVCC sense.
>>
>> If it's an ALTER SEQUENCE, I guess it might create a new relfilenode,
>> and then we might fail to apply this - that'd be bad.
>>
>> I wonder if we'd allow actually discarding the WAL while building the
>> consistent snapshot, though.
>>
> 
> No, as soon as we reserve the WAL location, we update the slot's
> minLSN (replicationSlotMinLSN) which would prevent the required WAL
> from being removed.
> 
>> You're however right we can't just decide
>> this based on LSN, we'd probably need to compare the relfilenodes too or
>> something like that ...
>>
>>>> I think the only "issue" are the WAL records after the slot LSN, or more
>>>> precisely deciding which of the decoded changes to apply.
>>>>
>>>>
>>>>> Now, for the second idea which is to directly use
>>>>> pg_current_wal_insert_lsn(), I think we won't be able to ensure that
>>>>> the changes covered by in-progress transactions like the one with
>>>>> Alter Sequence I have given example would be streamed later after the
>>>>> initial copy. Because the LSN returned by pg_current_wal_insert_lsn()
>>>>> could be an LSN after the LSN associated with Alter Sequence but
>>>>> before the corresponding xact's commit.
>>>>
>>>> Yeah, I think you're right - the locking itself is not sufficient to
>>>> prevent this ordering of operations. copy_sequence would have to lock
>>>> the sequence exclusively, which seems bit disruptive.
>>>>
>>>
>>> Right, that doesn't sound like a good idea.
>>>
>>
>> Although, maybe we could use a less strict lock level? I mean, one that
>> allows nextval() to continue, but would conflict with ALTER SEQUENCE.
>>
> 
> I don't know if that is a good idea but are you imagining a special
> interface/mechanism just for logical replication because as far as I
> can see you have used SELECT to fetch the sequence values?
> 

Not sure what would the special mechanism be? I don't think it could
read the sequence from somewhere else, and due the lack of MVCC we'd
just read same sequence data from the current relfilenode. Or what else
would it do?

The one thing we can't quite do at the moment is locking the sequence,
because LOCK is only supported for tables. So we could either provide a
function to lock a sequence, or locks it and then returns the current
state (as if we did a SELECT).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 3/20/23 18:03, Tomas Vondra wrote:
> 
> ...
>>
>> I don't know if that is a good idea but are you imagining a special
>> interface/mechanism just for logical replication because as far as I
>> can see you have used SELECT to fetch the sequence values?
>>
> 
> Not sure what would the special mechanism be? I don't think it could
> read the sequence from somewhere else, and due the lack of MVCC we'd
> just read same sequence data from the current relfilenode. Or what else
> would it do?
> 

I was thinking about alternative ways to do this, but I couldn't think
of anything. The non-MVCC behavior of sequences means it's not really
possible to do this based on snapshots / slots or stuff like that ...

> The one thing we can't quite do at the moment is locking the sequence,
> because LOCK is only supported for tables. So we could either provide a
> function to lock a sequence, or locks it and then returns the current
> state (as if we did a SELECT).
> 

... so I took a stab at doing it like this. I didn't feel relaxing LOCK
restrictions to also allow locking sequences would be the right choice,
so I added a new function pg_sequence_lock_for_sync(). I wonder if we
could/should restrict this to logical replication use, somehow.

The interlock happens right after creating the slot - I was thinking
about doing it even before the slot gets created, but that's not
possible, because that installs a snapshot (so it has to be the first
command in the transaction). It acquires RowExclusiveLock, which is
enough to conflict with ALTER SEQUENCE, but allows nextval().

AFAICS this does the trick - if there's ALTER SEQUENCE, we'll wait for
it to complete. And copy_sequence() will read the resulting state, even
though this is REPEATABLE READ - remember, sequences are not subject to
that consistency.

The once anomaly I can think of is the sequence might seem to go
"backwards" for a little bit during the sync. Imagine this sequence of
operations:

1) tablesync creates slot
2) S1 does ALTER SEQUENCE ... RESTART WITH 20 (gets lock)
3) S2 tries ALTER SEQUENCE ... RESTART WITH 100 (waits for lock)
4) tablesync requests lock
5) S1 does the thing, commits
6) S2 acquires lock, does the thing, commits
7) tablesync gets lock, reads current sequence state
8) tablesync decodes changes from S1 and S2, applies them

But I think this is fine - it's part of the catchup, and until that's
done the sync is not considered completed.


I merged the earlier "fixup" patches into the relevant parts, and left
two patches with new tweaks (deducing the corrent "WAL" state from the
current state read by copy_sequence), and the interlock discussed here.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Masahiko Sawada
Дата:
Hi,

On Fri, Mar 24, 2023 at 7:26 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I merged the earlier "fixup" patches into the relevant parts, and left
> two patches with new tweaks (deducing the corrent "WAL" state from the
> current state read by copy_sequence), and the interlock discussed here.
>

Apart from that, how does the publication having sequences work with
subscribers who are not able to handle sequence changes, e.g. in a
case where PostgreSQL version of publication is newer than the
subscriber? As far as I tested the latest patches, the subscriber
(v15)  errors out with the error 'invalid logical replication message
type "Q"' when receiving a sequence change. I'm not sure it's sensible
behavior. I think we should instead either (1) deny starting the
replication if the subscriber isn't able to handle sequence changes
and the publication includes that, or (2) not send sequence changes to
such subscribers.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 3/27/23 03:32, Masahiko Sawada wrote:
> Hi,
> 
> On Fri, Mar 24, 2023 at 7:26 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> I merged the earlier "fixup" patches into the relevant parts, and left
>> two patches with new tweaks (deducing the corrent "WAL" state from the
>> current state read by copy_sequence), and the interlock discussed here.
>>
> 
> Apart from that, how does the publication having sequences work with
> subscribers who are not able to handle sequence changes, e.g. in a
> case where PostgreSQL version of publication is newer than the
> subscriber? As far as I tested the latest patches, the subscriber
> (v15)  errors out with the error 'invalid logical replication message
> type "Q"' when receiving a sequence change. I'm not sure it's sensible
> behavior. I think we should instead either (1) deny starting the
> replication if the subscriber isn't able to handle sequence changes
> and the publication includes that, or (2) not send sequence changes to
> such subscribers.
> 

I agree the "invalid message" error is not great, but it's not clear to
me how to do either (1). The trouble is we don't really know if the
publication contains (or will contain) sequences. I mean, what would
happen if the replication starts and then someone adds a sequence?

For (2), I think that's not something we should do - silently discarding
some messages seems error-prone. If the publication includes sequences,
presumably the user wanted to replicate those. If they want to replicate
to an older subscriber, create a publication without sequences.

Perhaps the right solution would be to check if the subscriber supports
replication of sequences in the output plugin, while attempting to write
the "Q" message. And error-out if the subscriber does not support it.

What do you think?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Masahiko Sawada
Дата:
On Mon, Mar 27, 2023 at 11:46 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 3/27/23 03:32, Masahiko Sawada wrote:
> > Hi,
> >
> > On Fri, Mar 24, 2023 at 7:26 AM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> I merged the earlier "fixup" patches into the relevant parts, and left
> >> two patches with new tweaks (deducing the corrent "WAL" state from the
> >> current state read by copy_sequence), and the interlock discussed here.
> >>
> >
> > Apart from that, how does the publication having sequences work with
> > subscribers who are not able to handle sequence changes, e.g. in a
> > case where PostgreSQL version of publication is newer than the
> > subscriber? As far as I tested the latest patches, the subscriber
> > (v15)  errors out with the error 'invalid logical replication message
> > type "Q"' when receiving a sequence change. I'm not sure it's sensible
> > behavior. I think we should instead either (1) deny starting the
> > replication if the subscriber isn't able to handle sequence changes
> > and the publication includes that, or (2) not send sequence changes to
> > such subscribers.
> >
>
> I agree the "invalid message" error is not great, but it's not clear to
> me how to do either (1). The trouble is we don't really know if the
> publication contains (or will contain) sequences. I mean, what would
> happen if the replication starts and then someone adds a sequence?
>
> For (2), I think that's not something we should do - silently discarding
> some messages seems error-prone. If the publication includes sequences,
> presumably the user wanted to replicate those. If they want to replicate
> to an older subscriber, create a publication without sequences.
>
> Perhaps the right solution would be to check if the subscriber supports
> replication of sequences in the output plugin, while attempting to write
> the "Q" message. And error-out if the subscriber does not support it.

It might be related to this topic; do we need to bump the protocol
version? The commit 64824323e57d introduced new streaming callbacks
and bumped the protocol version. I think the same seems to be true for
this change as it adds sequence_cb callback.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 3/28/23 18:34, Masahiko Sawada wrote:
> On Mon, Mar 27, 2023 at 11:46 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>>
>>
>> On 3/27/23 03:32, Masahiko Sawada wrote:
>>> Hi,
>>>
>>> On Fri, Mar 24, 2023 at 7:26 AM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> I merged the earlier "fixup" patches into the relevant parts, and left
>>>> two patches with new tweaks (deducing the corrent "WAL" state from the
>>>> current state read by copy_sequence), and the interlock discussed here.
>>>>
>>>
>>> Apart from that, how does the publication having sequences work with
>>> subscribers who are not able to handle sequence changes, e.g. in a
>>> case where PostgreSQL version of publication is newer than the
>>> subscriber? As far as I tested the latest patches, the subscriber
>>> (v15)  errors out with the error 'invalid logical replication message
>>> type "Q"' when receiving a sequence change. I'm not sure it's sensible
>>> behavior. I think we should instead either (1) deny starting the
>>> replication if the subscriber isn't able to handle sequence changes
>>> and the publication includes that, or (2) not send sequence changes to
>>> such subscribers.
>>>
>>
>> I agree the "invalid message" error is not great, but it's not clear to
>> me how to do either (1). The trouble is we don't really know if the
>> publication contains (or will contain) sequences. I mean, what would
>> happen if the replication starts and then someone adds a sequence?
>>
>> For (2), I think that's not something we should do - silently discarding
>> some messages seems error-prone. If the publication includes sequences,
>> presumably the user wanted to replicate those. If they want to replicate
>> to an older subscriber, create a publication without sequences.
>>
>> Perhaps the right solution would be to check if the subscriber supports
>> replication of sequences in the output plugin, while attempting to write
>> the "Q" message. And error-out if the subscriber does not support it.
> 
> It might be related to this topic; do we need to bump the protocol
> version? The commit 64824323e57d introduced new streaming callbacks
> and bumped the protocol version. I think the same seems to be true for
> this change as it adds sequence_cb callback.
> 

It's not clear to me what should be the exact behavior?

I mean, imagine we're opening a connection for logical replication, and
the subscriber does not handle sequences. What should the publisher do?

(Note: The correct commit hash is 464824323e57d.)

I don't think the streaming is a good match for sequences, because of a
couple important differences ...

Firstly, streaming determines *how* the changes are replicated, not what
gets replicated. It doesn't (silently) filter out "bad" events that the
subscriber doesn't know how to apply. If the subscriber does not know
how to deal with streamed xacts, it'll still get the same changes
exactly per the publication definition.

Secondly, the default value is "streming=off", i.e. the subscriber has
to explicitly request streaming when opening the connection. And we
simply check it against the negotiated protocol version, i.e. the check
in pgoutput_startup() protects against subscriber requesting a protocol
v1 but also streaming=on.

I don't think we can/should do more check at this point - we don't know
what's included in the requested publications at that point, and I doubt
it's worth adding because we certainly can't predict if the publication
will be altered to include/decode sequences in the future.


Speaking of precedents, TRUNCATE is probably a better one, because it's
a new action and it determines *what* the subscriber can handle. But
that does exactly the thing we do for sequences - if you open a
connection from PG10 subscriber (truncate was added in PG11), and the
publisher decodes a truncate, subscriber will do:

2023-03-28 20:29:46.921 CEST [2357609] ERROR:  invalid logical
   replication message type "T"
2023-03-28 20:29:46.922 CEST [2356534] LOG:  worker process: logical
   replication worker for subscription 16390 (PID 2357609) exited with
   exit code 1

I don't see why sequences should do anything else. If you need to
replicate to such subscriber, create a publication that does not have
'sequence' in the publish option ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Masahiko Sawada
Дата:
On Wed, Mar 29, 2023 at 3:34 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 3/28/23 18:34, Masahiko Sawada wrote:
> > On Mon, Mar 27, 2023 at 11:46 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >>
> >>
> >> On 3/27/23 03:32, Masahiko Sawada wrote:
> >>> Hi,
> >>>
> >>> On Fri, Mar 24, 2023 at 7:26 AM Tomas Vondra
> >>> <tomas.vondra@enterprisedb.com> wrote:
> >>>>
> >>>> I merged the earlier "fixup" patches into the relevant parts, and left
> >>>> two patches with new tweaks (deducing the corrent "WAL" state from the
> >>>> current state read by copy_sequence), and the interlock discussed here.
> >>>>
> >>>
> >>> Apart from that, how does the publication having sequences work with
> >>> subscribers who are not able to handle sequence changes, e.g. in a
> >>> case where PostgreSQL version of publication is newer than the
> >>> subscriber? As far as I tested the latest patches, the subscriber
> >>> (v15)  errors out with the error 'invalid logical replication message
> >>> type "Q"' when receiving a sequence change. I'm not sure it's sensible
> >>> behavior. I think we should instead either (1) deny starting the
> >>> replication if the subscriber isn't able to handle sequence changes
> >>> and the publication includes that, or (2) not send sequence changes to
> >>> such subscribers.
> >>>
> >>
> >> I agree the "invalid message" error is not great, but it's not clear to
> >> me how to do either (1). The trouble is we don't really know if the
> >> publication contains (or will contain) sequences. I mean, what would
> >> happen if the replication starts and then someone adds a sequence?
> >>
> >> For (2), I think that's not something we should do - silently discarding
> >> some messages seems error-prone. If the publication includes sequences,
> >> presumably the user wanted to replicate those. If they want to replicate
> >> to an older subscriber, create a publication without sequences.
> >>
> >> Perhaps the right solution would be to check if the subscriber supports
> >> replication of sequences in the output plugin, while attempting to write
> >> the "Q" message. And error-out if the subscriber does not support it.
> >
> > It might be related to this topic; do we need to bump the protocol
> > version? The commit 64824323e57d introduced new streaming callbacks
> > and bumped the protocol version. I think the same seems to be true for
> > this change as it adds sequence_cb callback.
> >
>
> It's not clear to me what should be the exact behavior?
>
> I mean, imagine we're opening a connection for logical replication, and
> the subscriber does not handle sequences. What should the publisher do?
>
> (Note: The correct commit hash is 464824323e57d.)

Thanks.

>
> I don't think the streaming is a good match for sequences, because of a
> couple important differences ...
>
> Firstly, streaming determines *how* the changes are replicated, not what
> gets replicated. It doesn't (silently) filter out "bad" events that the
> subscriber doesn't know how to apply. If the subscriber does not know
> how to deal with streamed xacts, it'll still get the same changes
> exactly per the publication definition.
>
> Secondly, the default value is "streming=off", i.e. the subscriber has
> to explicitly request streaming when opening the connection. And we
> simply check it against the negotiated protocol version, i.e. the check
> in pgoutput_startup() protects against subscriber requesting a protocol
> v1 but also streaming=on.
>
> I don't think we can/should do more check at this point - we don't know
> what's included in the requested publications at that point, and I doubt
> it's worth adding because we certainly can't predict if the publication
> will be altered to include/decode sequences in the future.

True. That's a valid argument.

>
> Speaking of precedents, TRUNCATE is probably a better one, because it's
> a new action and it determines *what* the subscriber can handle. But
> that does exactly the thing we do for sequences - if you open a
> connection from PG10 subscriber (truncate was added in PG11), and the
> publisher decodes a truncate, subscriber will do:
>
> 2023-03-28 20:29:46.921 CEST [2357609] ERROR:  invalid logical
>    replication message type "T"
> 2023-03-28 20:29:46.922 CEST [2356534] LOG:  worker process: logical
>    replication worker for subscription 16390 (PID 2357609) exited with
>    exit code 1
>
> I don't see why sequences should do anything else. If you need to
> replicate to such subscriber, create a publication that does not have
> 'sequence' in the publish option ...
>

I didn't check TRUNCATE cases, yes, sequence replication is a good
match for them. So it seems we don't need to do anything.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Wed, Mar 29, 2023 at 12:04 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 3/28/23 18:34, Masahiko Sawada wrote:
> > On Mon, Mar 27, 2023 at 11:46 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>>
> >>> Apart from that, how does the publication having sequences work with
> >>> subscribers who are not able to handle sequence changes, e.g. in a
> >>> case where PostgreSQL version of publication is newer than the
> >>> subscriber? As far as I tested the latest patches, the subscriber
> >>> (v15)  errors out with the error 'invalid logical replication message
> >>> type "Q"' when receiving a sequence change. I'm not sure it's sensible
> >>> behavior. I think we should instead either (1) deny starting the
> >>> replication if the subscriber isn't able to handle sequence changes
> >>> and the publication includes that, or (2) not send sequence changes to
> >>> such subscribers.
> >>>
> >>
> >> I agree the "invalid message" error is not great, but it's not clear to
> >> me how to do either (1). The trouble is we don't really know if the
> >> publication contains (or will contain) sequences. I mean, what would
> >> happen if the replication starts and then someone adds a sequence?
> >>
> >> For (2), I think that's not something we should do - silently discarding
> >> some messages seems error-prone. If the publication includes sequences,
> >> presumably the user wanted to replicate those. If they want to replicate
> >> to an older subscriber, create a publication without sequences.
> >>
> >> Perhaps the right solution would be to check if the subscriber supports
> >> replication of sequences in the output plugin, while attempting to write
> >> the "Q" message. And error-out if the subscriber does not support it.
> >
> > It might be related to this topic; do we need to bump the protocol
> > version? The commit 64824323e57d introduced new streaming callbacks
> > and bumped the protocol version. I think the same seems to be true for
> > this change as it adds sequence_cb callback.
> >
>
> It's not clear to me what should be the exact behavior?
>
> I mean, imagine we're opening a connection for logical replication, and
> the subscriber does not handle sequences. What should the publisher do?
>

I think deciding anything at the publisher would be tricky but won't
it be better if by default we disallow connection from subscriber to
the publisher when the publisher's version is higher? And then allow
it only based on some subscription option or maybe by default allow
the connection to a higher version but based on option disallows the
connection.

>
> Speaking of precedents, TRUNCATE is probably a better one, because it's
> a new action and it determines *what* the subscriber can handle. But
> that does exactly the thing we do for sequences - if you open a
> connection from PG10 subscriber (truncate was added in PG11), and the
> publisher decodes a truncate, subscriber will do:
>
> 2023-03-28 20:29:46.921 CEST [2357609] ERROR:  invalid logical
>    replication message type "T"
> 2023-03-28 20:29:46.922 CEST [2356534] LOG:  worker process: logical
>    replication worker for subscription 16390 (PID 2357609) exited with
>    exit code 1
>
> I don't see why sequences should do anything else.
>

Is this behavior of TRUNCATE known or discussed previously? I can't
see any mention of this in the docs or commit message. I guess if we
want to follow such behavior it should be well documented so that it
won't be a surprise for users. I think we would face such cases in the
future as well. One of the similar cases we are discussing for DDL
replication where a higher version publisher could send some DDL
syntax that lower version subscribers won't support and will lead to
an error [1].

[1] -
https://www.postgresql.org/message-id/OS0PR01MB5716088E497BDCBCED7FC3DA94849%40OS0PR01MB5716.jpnprd01.prod.outlook.com

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 3/29/23 11:51, Amit Kapila wrote:
> On Wed, Mar 29, 2023 at 12:04 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 3/28/23 18:34, Masahiko Sawada wrote:
>>> On Mon, Mar 27, 2023 at 11:46 PM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>>
>>>>> Apart from that, how does the publication having sequences work with
>>>>> subscribers who are not able to handle sequence changes, e.g. in a
>>>>> case where PostgreSQL version of publication is newer than the
>>>>> subscriber? As far as I tested the latest patches, the subscriber
>>>>> (v15)  errors out with the error 'invalid logical replication message
>>>>> type "Q"' when receiving a sequence change. I'm not sure it's sensible
>>>>> behavior. I think we should instead either (1) deny starting the
>>>>> replication if the subscriber isn't able to handle sequence changes
>>>>> and the publication includes that, or (2) not send sequence changes to
>>>>> such subscribers.
>>>>>
>>>>
>>>> I agree the "invalid message" error is not great, but it's not clear to
>>>> me how to do either (1). The trouble is we don't really know if the
>>>> publication contains (or will contain) sequences. I mean, what would
>>>> happen if the replication starts and then someone adds a sequence?
>>>>
>>>> For (2), I think that's not something we should do - silently discarding
>>>> some messages seems error-prone. If the publication includes sequences,
>>>> presumably the user wanted to replicate those. If they want to replicate
>>>> to an older subscriber, create a publication without sequences.
>>>>
>>>> Perhaps the right solution would be to check if the subscriber supports
>>>> replication of sequences in the output plugin, while attempting to write
>>>> the "Q" message. And error-out if the subscriber does not support it.
>>>
>>> It might be related to this topic; do we need to bump the protocol
>>> version? The commit 64824323e57d introduced new streaming callbacks
>>> and bumped the protocol version. I think the same seems to be true for
>>> this change as it adds sequence_cb callback.
>>>
>>
>> It's not clear to me what should be the exact behavior?
>>
>> I mean, imagine we're opening a connection for logical replication, and
>> the subscriber does not handle sequences. What should the publisher do?
>>
> 
> I think deciding anything at the publisher would be tricky but won't
> it be better if by default we disallow connection from subscriber to
> the publisher when the publisher's version is higher? And then allow
> it only based on some subscription option or maybe by default allow
> the connection to a higher version but based on option disallows the
> connection.
> 
>>
>> Speaking of precedents, TRUNCATE is probably a better one, because it's
>> a new action and it determines *what* the subscriber can handle. But
>> that does exactly the thing we do for sequences - if you open a
>> connection from PG10 subscriber (truncate was added in PG11), and the
>> publisher decodes a truncate, subscriber will do:
>>
>> 2023-03-28 20:29:46.921 CEST [2357609] ERROR:  invalid logical
>>    replication message type "T"
>> 2023-03-28 20:29:46.922 CEST [2356534] LOG:  worker process: logical
>>    replication worker for subscription 16390 (PID 2357609) exited with
>>    exit code 1
>>
>> I don't see why sequences should do anything else.
>>
> 
> Is this behavior of TRUNCATE known or discussed previously? I can't
> see any mention of this in the docs or commit message. I guess if we
> want to follow such behavior it should be well documented so that it
> won't be a surprise for users. I think we would face such cases in the
> future as well. One of the similar cases we are discussing for DDL
> replication where a higher version publisher could send some DDL
> syntax that lower version subscribers won't support and will lead to
> an error [1].
> 

I don't know where/how it's documented, TBH.

FWIW I agree the TRUNCATE-like behavior (failing on subscriber after
receiving unknown message type) is a bit annoying.

Perhaps it'd be reasonable to tie the "protocol version" to subscriber
capabilities, so that a protocol version guarantees what message types
the subscriber understands. So we could increment the protocol version,
check it in pgoutput_startup and then error-out in the sequence callback
if the subscriber version is too old.

That'd be nicer in the sense that we'd generate nicer error message on
the publisher, not an "unknown message type" on the subscriber. That's
doable, the main problem being it'd be inconsistent with the TRUNCATE
behavior. OTOH that was introduced in PG11, which is the oldest version
still under support ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Peter Eisentraut
Дата:
On 29.03.23 16:28, Tomas Vondra wrote:
> Perhaps it'd be reasonable to tie the "protocol version" to subscriber
> capabilities, so that a protocol version guarantees what message types
> the subscriber understands. So we could increment the protocol version,
> check it in pgoutput_startup and then error-out in the sequence callback
> if the subscriber version is too old.

That would make sense.

> That'd be nicer in the sense that we'd generate nicer error message on
> the publisher, not an "unknown message type" on the subscriber. That's
> doable, the main problem being it'd be inconsistent with the TRUNCATE
> behavior. OTOH that was introduced in PG11, which is the oldest version
> still under support ...

I think at the time TRUNCATE support was added, we didn't have a strong 
sense of how the protocol versioning would work or whether it would work 
at all, so doing nothing was the easiest way out.




Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Wed, Mar 29, 2023 at 7:58 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 3/29/23 11:51, Amit Kapila wrote:
> >>
> >> It's not clear to me what should be the exact behavior?
> >>
> >> I mean, imagine we're opening a connection for logical replication, and
> >> the subscriber does not handle sequences. What should the publisher do?
> >>
> >
> > I think deciding anything at the publisher would be tricky but won't
> > it be better if by default we disallow connection from subscriber to
> > the publisher when the publisher's version is higher? And then allow
> > it only based on some subscription option or maybe by default allow
> > the connection to a higher version but based on option disallows the
> > connection.
> >
> >>
> >> Speaking of precedents, TRUNCATE is probably a better one, because it's
> >> a new action and it determines *what* the subscriber can handle. But
> >> that does exactly the thing we do for sequences - if you open a
> >> connection from PG10 subscriber (truncate was added in PG11), and the
> >> publisher decodes a truncate, subscriber will do:
> >>
> >> 2023-03-28 20:29:46.921 CEST [2357609] ERROR:  invalid logical
> >>    replication message type "T"
> >> 2023-03-28 20:29:46.922 CEST [2356534] LOG:  worker process: logical
> >>    replication worker for subscription 16390 (PID 2357609) exited with
> >>    exit code 1
> >>
> >> I don't see why sequences should do anything else.
> >>
> >
> > Is this behavior of TRUNCATE known or discussed previously? I can't
> > see any mention of this in the docs or commit message. I guess if we
> > want to follow such behavior it should be well documented so that it
> > won't be a surprise for users. I think we would face such cases in the
> > future as well. One of the similar cases we are discussing for DDL
> > replication where a higher version publisher could send some DDL
> > syntax that lower version subscribers won't support and will lead to
> > an error [1].
> >
>
> I don't know where/how it's documented, TBH.
>
> FWIW I agree the TRUNCATE-like behavior (failing on subscriber after
> receiving unknown message type) is a bit annoying.
>
> Perhaps it'd be reasonable to tie the "protocol version" to subscriber
> capabilities, so that a protocol version guarantees what message types
> the subscriber understands. So we could increment the protocol version,
> check it in pgoutput_startup and then error-out in the sequence callback
> if the subscriber version is too old.
>
> That'd be nicer in the sense that we'd generate nicer error message on
> the publisher, not an "unknown message type" on the subscriber.
>

Agreed. So, we can probably formalize this rule such that whenever in
a newer version publisher we want to send additional information which
the old version subscriber won't be able to handle, the error should
be raised at the publisher by using protocol version number.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Masahiko Sawada
Дата:
On Thu, Mar 30, 2023 at 12:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Mar 29, 2023 at 7:58 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > On 3/29/23 11:51, Amit Kapila wrote:
> > >>
> > >> It's not clear to me what should be the exact behavior?
> > >>
> > >> I mean, imagine we're opening a connection for logical replication, and
> > >> the subscriber does not handle sequences. What should the publisher do?
> > >>
> > >
> > > I think deciding anything at the publisher would be tricky but won't
> > > it be better if by default we disallow connection from subscriber to
> > > the publisher when the publisher's version is higher? And then allow
> > > it only based on some subscription option or maybe by default allow
> > > the connection to a higher version but based on option disallows the
> > > connection.
> > >
> > >>
> > >> Speaking of precedents, TRUNCATE is probably a better one, because it's
> > >> a new action and it determines *what* the subscriber can handle. But
> > >> that does exactly the thing we do for sequences - if you open a
> > >> connection from PG10 subscriber (truncate was added in PG11), and the
> > >> publisher decodes a truncate, subscriber will do:
> > >>
> > >> 2023-03-28 20:29:46.921 CEST [2357609] ERROR:  invalid logical
> > >>    replication message type "T"
> > >> 2023-03-28 20:29:46.922 CEST [2356534] LOG:  worker process: logical
> > >>    replication worker for subscription 16390 (PID 2357609) exited with
> > >>    exit code 1
> > >>
> > >> I don't see why sequences should do anything else.
> > >>
> > >
> > > Is this behavior of TRUNCATE known or discussed previously? I can't
> > > see any mention of this in the docs or commit message. I guess if we
> > > want to follow such behavior it should be well documented so that it
> > > won't be a surprise for users. I think we would face such cases in the
> > > future as well. One of the similar cases we are discussing for DDL
> > > replication where a higher version publisher could send some DDL
> > > syntax that lower version subscribers won't support and will lead to
> > > an error [1].
> > >
> >
> > I don't know where/how it's documented, TBH.
> >
> > FWIW I agree the TRUNCATE-like behavior (failing on subscriber after
> > receiving unknown message type) is a bit annoying.
> >
> > Perhaps it'd be reasonable to tie the "protocol version" to subscriber
> > capabilities, so that a protocol version guarantees what message types
> > the subscriber understands. So we could increment the protocol version,
> > check it in pgoutput_startup and then error-out in the sequence callback
> > if the subscriber version is too old.
> >
> > That'd be nicer in the sense that we'd generate nicer error message on
> > the publisher, not an "unknown message type" on the subscriber.
> >
>
> Agreed. So, we can probably formalize this rule such that whenever in
> a newer version publisher we want to send additional information which
> the old version subscriber won't be able to handle, the error should
> be raised at the publisher by using protocol version number.

+1

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 3/30/23 05:15, Masahiko Sawada wrote:
>
> ...
>
>>>
>>> Perhaps it'd be reasonable to tie the "protocol version" to subscriber
>>> capabilities, so that a protocol version guarantees what message types
>>> the subscriber understands. So we could increment the protocol version,
>>> check it in pgoutput_startup and then error-out in the sequence callback
>>> if the subscriber version is too old.
>>>
>>> That'd be nicer in the sense that we'd generate nicer error message on
>>> the publisher, not an "unknown message type" on the subscriber.
>>>
>>
>> Agreed. So, we can probably formalize this rule such that whenever in
>> a newer version publisher we want to send additional information which
>> the old version subscriber won't be able to handle, the error should
>> be raised at the publisher by using protocol version number.
> 
> +1
> 

OK, I took a stab at this, see the attached 0007 patch which bumps the
protocol version, and allows the subscriber to specify "sequences" when
starting the replication, similar to what we do for the two-phase stuff.

The patch essentially adds 'sequences' to the replication start command,
depending on the server version, but it can be overridden by "sequences"
subscription option. The patch is pretty small, but I wonder how much
smarter this should be ...


I think there are about 4 cases that we need to consider

1) there are no sequences in the publication -> OK

2) publication with sequences, subscriber knows how to apply (and
specifies "sequences on" either automatically or explicitly) -> OK

3) publication with sequences, subscriber explicitly disabled them by
specifying "sequences off" in startup -> OK

4) publication with sequences, subscriber without sequence support (e.g.
older Postgres release) -> PROBLEM (?)


The reason why I think (4) may be a problem is that my opinion is we
shouldn't silently drop stuff that is meant to be part of the
publication. That is, if someone creates a publication and adds a
sequence to it, he wants to replicate the sequence.

But the current behavior is the old subscriber connects, doesn't specify
the 'sequences on' so the publisher disables that and then simply
ignores sequence increments during decoding.

I think we might want to detect this and error out instead of just
skipping the change, but that needs to happen later, only when the
publication actually has any sequences ...

I don't want to over-think / over-engineer this, though, so I wonder
what are your opinions on this?

There's a couple XXX comments in the code, mostly about stuff I left out
when copying the two-phase stuff. For example, we store two-phase stuff
in the replication slot itself - I don't think we need to do that for
sequences, though.

Another thing what to do about ALTER SUBSCRIPTION - at the moment it's
not possible to change the "sequences" option, but maybe we should allow
that? But then we'd need to re-sync all the sequences, somehow ...


Aside from that, I've also added 0005, which does the sync interlock in
a slightly different way - instead of a custom function for locking
sequence, it allows LOCK on sequences. Peter Eisentraut suggested doing
it like this, it's simpler, and I can't see what issues it might cause.
The patch should update LOCK documentation, I haven't done that yet.
Ultimately it should all be merged into 0003, of course.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
"Gregory Stark (as CFM)"
Дата:
Fwiw the cfbot seems to have some failing tests with this patch:


[19:05:11.398] # Failed test 'initial test data replicated'
[19:05:11.398] # at t/030_sequences.pl line 75.
[19:05:11.398] # got: '1|0|f'
[19:05:11.398] # expected: '132|0|t'
[19:05:11.398]
[19:05:11.398] # Failed test 'advance sequence in rolled-back transaction'
[19:05:11.398] # at t/030_sequences.pl line 98.
[19:05:11.398] # got: '1|0|f'
[19:05:11.398] # expected: '231|0|t'
[19:05:11.398]
[19:05:11.398] # Failed test 'create sequence, advance it in
rolled-back transaction, but commit the create'
[19:05:11.398] # at t/030_sequences.pl line 152.
[19:05:11.398] # got: '1|0|f'
[19:05:11.398] # expected: '132|0|t'
[19:05:11.398]
[19:05:11.398] # Failed test 'advance the new sequence in a
transaction and roll it back'
[19:05:11.398] # at t/030_sequences.pl line 175.
[19:05:11.398] # got: '1|0|f'
[19:05:11.398] # expected: '231|0|t'
[19:05:11.398]
[19:05:11.398] # Failed test 'advance sequence in a subtransaction'
[19:05:11.398] # at t/030_sequences.pl line 198.
[19:05:11.398] # got: '1|0|f'
[19:05:11.398] # expected: '330|0|t'
[19:05:11.398] # Looks like you failed 5 tests of 6.



-- 
Gregory Stark
As Commitfest Manager



Re: logical decoding and replication of sequences, take 2

От
Alvaro Herrera
Дата:
Patch 0002 is very annoying to scroll, and I realized that it's because
psql is writing 200kB of dashes in one of the test_decoding test cases.
I propose to set psql's printing format to 'unaligned' to avoid that,
which should cut the size of that patch to a tenth.

I wonder if there's a similar issue in 0003, but I didn't check.

It's annoying that git doesn't seem to have a way of reporting length of
longest lines.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I'm always right, but sometimes I'm more right than other times."
                                                  (Linus Torvalds)

Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 4/5/23 12:39, Alvaro Herrera wrote:
> Patch 0002 is very annoying to scroll, and I realized that it's because
> psql is writing 200kB of dashes in one of the test_decoding test cases.
> I propose to set psql's printing format to 'unaligned' to avoid that,
> which should cut the size of that patch to a tenth.
> 

Yeah, that's a good idea, I think. It shrunk the diff to ~90kB, which is
much better.

> I wonder if there's a similar issue in 0003, but I didn't check.
> 

I don't think so, there just seems to be enough code changes to generate
~260kB diff with all the context.

As for the cfbot failures reported by Greg, that turned out to be a
minor thinko in the protocol version negotiation, introduced by part
0008 (current part, after adding Alvaro's patch tweaking test output).
The subscriber failed to send 'sequences on' when starting the stream.
It also forgot to refresh the subscription after a sequence was added.

The attached patch version fixes all of this, but I think at this point
it's better to just postpone this for PG17 - if it was something we
could fix within a single release, maybe. But the replication protocol
is something we can't easily change after release, so if we find out the
versioning (and sequence negotiation) should work differently, we can't
change it. In fact, we'd be probably stuck with it until PG16 gets out
of support, not just until PG17 ...

I've thought about pushing at least the first two parts (adding the
sequence decoding infrastructure and test_decoding support), but I'm not
sure that's quite worth it without the built-in replication stuff.

Or we could push it and then tweak it after feature freeze, if we
conclude the protocol versioning should work differently. I recall we
did changes in the column and row filtering in PG15. But that seems
quite wrong, obviously.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Peter Eisentraut
Дата:
On 02.04.23 19:46, Tomas Vondra wrote:
> OK, I took a stab at this, see the attached 0007 patch which bumps the
> protocol version, and allows the subscriber to specify "sequences" when
> starting the replication, similar to what we do for the two-phase stuff.
> 
> The patch essentially adds 'sequences' to the replication start command,
> depending on the server version, but it can be overridden by "sequences"
> subscription option. The patch is pretty small, but I wonder how much
> smarter this should be ...

I think this should actually be much simpler.

All the code needs to do is:

- Raise protocol version (4->5)  (Your patch does that.)

- pgoutput_sequence() checks whether the protocol version is >=5 and if 
not it raises an error.

- Subscriber uses old protocol if the remote end is an older PG version. 
  (Your patch does that.)

I don't see the need for the subscriber to toggle sequences explicitly 
or anything like that.




Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
Hi,
Sorry for jumping late in this thread.

I started experimenting with the functionality. Maybe something that
was already discussed earlier. Given that the thread is being
discussed for so long and has gone several changes, revalidating the
functionality is useful.

I considered following aspects:
Changes to the sequence on subscriber
-----------------------------------------------------
1. Since this is logical decoding, logical replica is writable. So the
logically replicated sequence can be manipulated on the subscriber as
well. This implementation consolidates the changes on subscriber and
publisher rather than replicating the publisher state as is. That's
good. See example command sequence below
a. publisher calls nextval() - this sets the sequence state on
publisher as (1, 32, t) which is replicated to the subscriber.
b. subscriber calls nextval() once - this sets the sequence state on
subscriber as (34, 32, t)
c. subscriber calls nextval() 32 times - on-disk state of sequence
doesn't change on subscriber
d. subscriber calls nextval() 33 times - this sets the sequence state
on subscriber as (99, 0, t)
e. publisher calls nextval() 32 times - this sets the sequence state
on publisher as (33, 0, t)

The on-disk state on publisher at the end of e. is replicated to the
subscriber but subscriber doesn't apply it. The state there is still
(99, 0, t). I think this is closer to how logical replication of
sequence should look like. This is aso good enough as long as we
expect the replication of sequences to be used for failover and
switchover.

But it might not help if we want to consolidate the INSERTs that use
nextvals(). If we were to treat sequences as accumulating the
increments, we might be able to resolve the conflicts by adjusting the
columns values considering the increments made on subscriber. IIUC,
conflict resolution is not part of built-in logical replication. So we
may not want to go this route. But worth considering.

Implementation agnostic decoded change
--------------------------------------------------------
Current method of decoding and replicating the sequences is tied to
the implementation - it replicates the sequence row as is. If the
implementation changes in future, we might need to revise the decoded
presentation of sequence. I think only nextval() matters for sequence.
So as long as we are replicating information enough to calculate the
nextval we should be good. Current implementation does that by
replicating the log_value and is_called. is_called can be consolidated
into log_value itself. The implemented protocol, thus requires two
extra values to be replicated. Those can be ignored right now. But
they might pose a problem in future, if some downstream starts using
them. We will be forced to provide fake but sane values even if a
future upstream implementation does not produce those values. Of
course we can't predict the future implementation enough to decide
what would be an implementation independent format. E.g. if a
pluggable storage were to be used to implement sequences or if we come
around implementing distributed sequences, their shape can't be
predicted right now. So a change in protocol seems to be unavoidable
whatever we do. But starting with bare minimum might save us from
larger troubles. I think, it's better to just replicate the nextval()
and craft the representation on subscriber so that it produces that
nextval().

3. Primary key sequences
-----------------------------------
I am not experimented with this. But I think we will need to add the
sequences associated with the primary keys to the publications
publishing the owner tables. Otherwise, we will have problems with the
failover. And it needs to be done automatically since a. the names of
these sequences are generated automatically b. publications with FOR
ALL TABLES will add tables automatically and start replicating the
changes. Users may not be able to intercept the replication activity
to add the associated sequences are also addedto the publication.

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
Patch set needs a rebase, PFA rebased patch-set.

The conflict was in commit "Add decoding of sequences to built-in
replication", in files tablesync.c and 002_pg_dump.pl.

On Thu, May 18, 2023 at 7:53 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> Hi,
> Sorry for jumping late in this thread.
>
> I started experimenting with the functionality. Maybe something that
> was already discussed earlier. Given that the thread is being
> discussed for so long and has gone several changes, revalidating the
> functionality is useful.
>
> I considered following aspects:
> Changes to the sequence on subscriber
> -----------------------------------------------------
> 1. Since this is logical decoding, logical replica is writable. So the
> logically replicated sequence can be manipulated on the subscriber as
> well. This implementation consolidates the changes on subscriber and
> publisher rather than replicating the publisher state as is. That's
> good. See example command sequence below
> a. publisher calls nextval() - this sets the sequence state on
> publisher as (1, 32, t) which is replicated to the subscriber.
> b. subscriber calls nextval() once - this sets the sequence state on
> subscriber as (34, 32, t)
> c. subscriber calls nextval() 32 times - on-disk state of sequence
> doesn't change on subscriber
> d. subscriber calls nextval() 33 times - this sets the sequence state
> on subscriber as (99, 0, t)
> e. publisher calls nextval() 32 times - this sets the sequence state
> on publisher as (33, 0, t)
>
> The on-disk state on publisher at the end of e. is replicated to the
> subscriber but subscriber doesn't apply it. The state there is still
> (99, 0, t). I think this is closer to how logical replication of
> sequence should look like. This is aso good enough as long as we
> expect the replication of sequences to be used for failover and
> switchover.
>
> But it might not help if we want to consolidate the INSERTs that use
> nextvals(). If we were to treat sequences as accumulating the
> increments, we might be able to resolve the conflicts by adjusting the
> columns values considering the increments made on subscriber. IIUC,
> conflict resolution is not part of built-in logical replication. So we
> may not want to go this route. But worth considering.
>
> Implementation agnostic decoded change
> --------------------------------------------------------
> Current method of decoding and replicating the sequences is tied to
> the implementation - it replicates the sequence row as is. If the
> implementation changes in future, we might need to revise the decoded
> presentation of sequence. I think only nextval() matters for sequence.
> So as long as we are replicating information enough to calculate the
> nextval we should be good. Current implementation does that by
> replicating the log_value and is_called. is_called can be consolidated
> into log_value itself. The implemented protocol, thus requires two
> extra values to be replicated. Those can be ignored right now. But
> they might pose a problem in future, if some downstream starts using
> them. We will be forced to provide fake but sane values even if a
> future upstream implementation does not produce those values. Of
> course we can't predict the future implementation enough to decide
> what would be an implementation independent format. E.g. if a
> pluggable storage were to be used to implement sequences or if we come
> around implementing distributed sequences, their shape can't be
> predicted right now. So a change in protocol seems to be unavoidable
> whatever we do. But starting with bare minimum might save us from
> larger troubles. I think, it's better to just replicate the nextval()
> and craft the representation on subscriber so that it produces that
> nextval().
>
> 3. Primary key sequences
> -----------------------------------
> I am not experimented with this. But I think we will need to add the
> sequences associated with the primary keys to the publications
> publishing the owner tables. Otherwise, we will have problems with the
> failover. And it needs to be done automatically since a. the names of
> these sequences are generated automatically b. publications with FOR
> ALL TABLES will add tables automatically and start replicating the
> changes. Users may not be able to intercept the replication activity
> to add the associated sequences are also addedto the publication.
>
> --
> Best Wishes,
> Ashutosh Bapat



--
Best Wishes,
Ashutosh Bapat

Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 5/18/23 16:23, Ashutosh Bapat wrote:
> Hi,
> Sorry for jumping late in this thread.
> 
> I started experimenting with the functionality. Maybe something that
> was already discussed earlier. Given that the thread is being
> discussed for so long and has gone several changes, revalidating the
> functionality is useful.
> 
> I considered following aspects:
> Changes to the sequence on subscriber
> -----------------------------------------------------
> 1. Since this is logical decoding, logical replica is writable. So the
> logically replicated sequence can be manipulated on the subscriber as
> well. This implementation consolidates the changes on subscriber and
> publisher rather than replicating the publisher state as is. That's
> good. See example command sequence below
> a. publisher calls nextval() - this sets the sequence state on
> publisher as (1, 32, t) which is replicated to the subscriber.
> b. subscriber calls nextval() once - this sets the sequence state on
> subscriber as (34, 32, t)
> c. subscriber calls nextval() 32 times - on-disk state of sequence
> doesn't change on subscriber
> d. subscriber calls nextval() 33 times - this sets the sequence state
> on subscriber as (99, 0, t)
> e. publisher calls nextval() 32 times - this sets the sequence state
> on publisher as (33, 0, t)
> 
> The on-disk state on publisher at the end of e. is replicated to the
> subscriber but subscriber doesn't apply it. The state there is still
> (99, 0, t). I think this is closer to how logical replication of
> sequence should look like. This is aso good enough as long as we
> expect the replication of sequences to be used for failover and
> switchover.
> 

I'm really confused - are you describing what the patch is doing, or
what you think it should be doing? Because right now there's nothing
that'd "consolidate" the changes (in the sense of reconciling write
conflicts), and there's absolutely no way to do that.

So if the subscriber advances the sequence (which it technically can),
the subscriber state will be eventually be discarded and overwritten
when the next increment gets decoded from WAL on the publisher.

There's no way to fix this with type of sequences - it requires some
sort of global consensus (consensus on range assignment, locking or
whatever), which we don't have.

If the sequence is the only thing replicated, this may go unnoticed. But
chances are the user is also replicating the table with PK populated by
the sequence, at which point it'll lead to constraint violation.

> But it might not help if we want to consolidate the INSERTs that use
> nextvals(). If we were to treat sequences as accumulating the
> increments, we might be able to resolve the conflicts by adjusting the
> columns values considering the increments made on subscriber. IIUC,
> conflict resolution is not part of built-in logical replication. So we
> may not want to go this route. But worth considering.

We can't just adjust values in columns that may be used externally.

> 
> Implementation agnostic decoded change
> --------------------------------------------------------
> Current method of decoding and replicating the sequences is tied to
> the implementation - it replicates the sequence row as is. If the
> implementation changes in future, we might need to revise the decoded
> presentation of sequence. I think only nextval() matters for sequence.
> So as long as we are replicating information enough to calculate the
> nextval we should be good. Current implementation does that by
> replicating the log_value and is_called. is_called can be consolidated
> into log_value itself. The implemented protocol, thus requires two
> extra values to be replicated. Those can be ignored right now. But
> they might pose a problem in future, if some downstream starts using
> them. We will be forced to provide fake but sane values even if a
> future upstream implementation does not produce those values. Of
> course we can't predict the future implementation enough to decide
> what would be an implementation independent format. E.g. if a
> pluggable storage were to be used to implement sequences or if we come
> around implementing distributed sequences, their shape can't be
> predicted right now. So a change in protocol seems to be unavoidable
> whatever we do. But starting with bare minimum might save us from
> larger troubles. I think, it's better to just replicate the nextval()
> and craft the representation on subscriber so that it produces that
> nextval().

Yes, I agree with this. It's probably better to replicate just the next
value, without the log_cnt / is_called fields (which are implementation
specific).

> 
> 3. Primary key sequences
> -----------------------------------
> I am not experimented with this. But I think we will need to add the
> sequences associated with the primary keys to the publications
> publishing the owner tables. Otherwise, we will have problems with the
> failover. And it needs to be done automatically since a. the names of
> these sequences are generated automatically b. publications with FOR
> ALL TABLES will add tables automatically and start replicating the
> changes. Users may not be able to intercept the replication activity
> to add the associated sequences are also addedto the publication.
> 

Right, this idea was mentioned before, and I agree maybe we should
consider adding some of those "automatic" sequences automatically.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Tue, Jun 13, 2023 at 11:01 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 5/18/23 16:23, Ashutosh Bapat wrote:
> > Hi,
> > Sorry for jumping late in this thread.
> >
> > I started experimenting with the functionality. Maybe something that
> > was already discussed earlier. Given that the thread is being
> > discussed for so long and has gone several changes, revalidating the
> > functionality is useful.
> >
> > I considered following aspects:
> > Changes to the sequence on subscriber
> > -----------------------------------------------------
> > 1. Since this is logical decoding, logical replica is writable. So the
> > logically replicated sequence can be manipulated on the subscriber as
> > well. This implementation consolidates the changes on subscriber and
> > publisher rather than replicating the publisher state as is. That's
> > good. See example command sequence below
> > a. publisher calls nextval() - this sets the sequence state on
> > publisher as (1, 32, t) which is replicated to the subscriber.
> > b. subscriber calls nextval() once - this sets the sequence state on
> > subscriber as (34, 32, t)
> > c. subscriber calls nextval() 32 times - on-disk state of sequence
> > doesn't change on subscriber
> > d. subscriber calls nextval() 33 times - this sets the sequence state
> > on subscriber as (99, 0, t)
> > e. publisher calls nextval() 32 times - this sets the sequence state
> > on publisher as (33, 0, t)
> >
> > The on-disk state on publisher at the end of e. is replicated to the
> > subscriber but subscriber doesn't apply it. The state there is still
> > (99, 0, t). I think this is closer to how logical replication of
> > sequence should look like. This is aso good enough as long as we
> > expect the replication of sequences to be used for failover and
> > switchover.
> >
>
> I'm really confused - are you describing what the patch is doing, or
> what you think it should be doing? Because right now there's nothing
> that'd "consolidate" the changes (in the sense of reconciling write
> conflicts), and there's absolutely no way to do that.
>
> So if the subscriber advances the sequence (which it technically can),
> the subscriber state will be eventually be discarded and overwritten
> when the next increment gets decoded from WAL on the publisher.

I described what I observed in my experiments. My observation doesn't
agree with your description. I will revisit this when I review the
output plugin changes and the WAL receiver changes.

>
> Yes, I agree with this. It's probably better to replicate just the next
> value, without the log_cnt / is_called fields (which are implementation
> specific).

Ok. I will review the logic once you revise the patches.

>
> >
> > 3. Primary key sequences
> > -----------------------------------
> > I am not experimented with this. But I think we will need to add the
> > sequences associated with the primary keys to the publications
> > publishing the owner tables. Otherwise, we will have problems with the
> > failover. And it needs to be done automatically since a. the names of
> > these sequences are generated automatically b. publications with FOR
> > ALL TABLES will add tables automatically and start replicating the
> > changes. Users may not be able to intercept the replication activity
> > to add the associated sequences are also addedto the publication.
> >
>
> Right, this idea was mentioned before, and I agree maybe we should
> consider adding some of those "automatic" sequences automatically.
>

Are you planning to add this in the same patch set or separately?

I reviewed 0001 and related parts of 0004 and 0008 in detail.

I have only one major change request, about
typedef struct xl_seq_rec
{
RelFileLocator locator;
+ bool created; /* creates a new relfilenode (CREATE/ALTER) */

I am not sure what are the repercussions of adding a member to an existing WAL
record. I didn't see any code which handles the old WAL format which doesn't
contain the "created" flag. IIUC, the logical decoding may come across
a WAL record written in the old format after upgrade and restart. Is
that not possible?

But I don't think it's necessary. We can add a
decoding routine for RM_SMGR_ID. The decoding routine will add relfilelocator
in XLOG_SMGR_CREATE record to txn->sequences hash. Rest of the logic will work
as is. Of course we will add non-sequence relfilelocators as well but that
should be fine. Creating a new relfilelocator shouldn't be a frequent
operation. If at all we are worried about that, we can add only the
relfilenodes associated with sequences to the hash table.

If this idea has been discussed earlier, please point me to the relevant
discussion.

Some other minor comments and nitpicks.

<function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
<function>stream_commit_cb</function>, and <function>stream_change_cb</function>
- are required, while <function>stream_message_cb</function> and
+ are required, while <function>stream_message_cb</function>,
+ <function>stream_sequence_cb</function> and

Like the non-streaming counterpart, should we also mention what happens if those
callbacks are not defined? That applies to stream_message_cb and
stream_truncate_cb too.
+ /*
+ * Make sure the subtransaction has a XID assigned, so that the sequence
+ * increment WAL record is properly associated with it. This matters for
+ * increments of sequences created/altered in the transaction, which are
+ * handled as transactional.
+ */
+ if (XLogLogicalInfoActive())
+ GetCurrentTransactionId();

GetCurrentTransactionId() will also assign xids to all the parents so it
doesn't seem necessary to call both GetTopTransactionId() and
GetCurrentTransactionId(). Calling only the latter should suffice. Applies to
all the calls to GetCurrentTransactionId().

+
+ memcpy(((char *) tuple->tuple.t_data),
+ data + sizeof(xl_seq_rec),
+ SizeofHeapTupleHeader);
+
+ memcpy(((char *) tuple->tuple.t_data) + SizeofHeapTupleHeader,
+ data + sizeof(xl_seq_rec) + SizeofHeapTupleHeader,
+ datalen);

The memory chunks being copied in these memcpy calls are contiguous. Why don't
we use a single memcpy? For readability?

+ * If we don't have snapshot or we are just fast-forwarding, there is no
+ * point in decoding messages.

s/decoding messages/decoding sequence changes/

+ tupledata = XLogRecGetData(r);
+ datalen = XLogRecGetDataLen(r);
+ tuplelen = datalen - SizeOfHeapHeader - sizeof(xl_seq_rec);
+
+ /* extract the WAL record, with "created" flag */
+ xlrec = (xl_seq_rec *) XLogRecGetData(r);

I think we should set tupledata = xlrec + sizeof(xl_seq_rec) so that it points
to actual tuple data. This will also simplify the calculations in
DecodeSeqTule().
+/* entry for hash table we use to track sequences created in running xacts */

s/running/transaction being decoded/ ?

+
+ /* search the lookup table (we ignore the return value, found is enough) */
+ ent = hash_search(rb->sequences,
+ (void *) &rlocator,
+ created ? HASH_ENTER : HASH_FIND,
+ &found);

Misleading comment. We seem to be using the return value later.

+ /*
+ * When creating the sequence, remember the XID of the transaction
+ * that created id.
+ */
+ if (created)
+ ent->xid = xid;

Should we set ent->locator as well? The sequence won't get cleaned otherwise.

+
+ TeardownHistoricSnapshot(false);
+
+ AbortCurrentTransaction();

This call to AbortCurrentTransaction() in PG_TRY should be called if only this
block started the transaction?

+ PG_CATCH();
+ {
+ TeardownHistoricSnapshot(true);
+
+ AbortCurrentTransaction();

Shouldn't we do this only if this block started the transaction? And in that
case, wouldn't PG_RE_THROW take care of it?

+/*
+ * Helper function for ReorderBufferProcessTXN for applying sequences.
+ */
+static inline void
+ReorderBufferApplySequence(ReorderBuffer *rb, ReorderBufferTXN *txn,
+ Relation relation, ReorderBufferChange *change,
+ bool streaming)

Possibly we should find a way to call this function from
ReorderBufferQueueSequence() when processing non-transactional sequence change.
It should probably absorb logic common to both the cases.

+
+ if (RelationIsLogicallyLogged(relation))
+ ReorderBufferApplySequence(rb, txn, relation, change, streaming);

This condition is not used in ReorderBufferQueueSequence() when processing
non-transactional change there. Why?
+
+ if (len)
+ {
+ memcpy(data, &tup->tuple, sizeof(HeapTupleData));
+ data += sizeof(HeapTupleData);
+
+ memcpy(data, tup->tuple.t_data, len);
+ data += len;
+ }
+

We are just copying the sequence data. Shouldn't we copy the file locator as
well or that's not needed once the change has been queued? Similarly for
ReorderBufferChangeSize() and ReorderBufferChangeSize()

+ /*
+ * relfilenode => XID lookup table for sequences created in a transaction
+ * (also includes altered sequences, which assigns new relfilenode)
+ */
+ HTAB *sequences;
+

Better renamed as seq_rel_locator or some such. Shouldn't this be part of
ReorderBufferTxn which has similar transaction specific hashes.

I will continue reviewing the remaining patches.

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
Regarding the patchsets, I think we will need to rearrange the
commits. Right now 0004 has some parts that should have been in 0001.
Also the logic to assign XID to a subtrasaction be better a separate
commit. That piece is independent of logical decoding of sequences.

On Fri, Jun 23, 2023 at 6:48 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Tue, Jun 13, 2023 at 11:01 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > On 5/18/23 16:23, Ashutosh Bapat wrote:
> > > Hi,
> > > Sorry for jumping late in this thread.
> > >
> > > I started experimenting with the functionality. Maybe something that
> > > was already discussed earlier. Given that the thread is being
> > > discussed for so long and has gone several changes, revalidating the
> > > functionality is useful.
> > >
> > > I considered following aspects:
> > > Changes to the sequence on subscriber
> > > -----------------------------------------------------
> > > 1. Since this is logical decoding, logical replica is writable. So the
> > > logically replicated sequence can be manipulated on the subscriber as
> > > well. This implementation consolidates the changes on subscriber and
> > > publisher rather than replicating the publisher state as is. That's
> > > good. See example command sequence below
> > > a. publisher calls nextval() - this sets the sequence state on
> > > publisher as (1, 32, t) which is replicated to the subscriber.
> > > b. subscriber calls nextval() once - this sets the sequence state on
> > > subscriber as (34, 32, t)
> > > c. subscriber calls nextval() 32 times - on-disk state of sequence
> > > doesn't change on subscriber
> > > d. subscriber calls nextval() 33 times - this sets the sequence state
> > > on subscriber as (99, 0, t)
> > > e. publisher calls nextval() 32 times - this sets the sequence state
> > > on publisher as (33, 0, t)
> > >
> > > The on-disk state on publisher at the end of e. is replicated to the
> > > subscriber but subscriber doesn't apply it. The state there is still
> > > (99, 0, t). I think this is closer to how logical replication of
> > > sequence should look like. This is aso good enough as long as we
> > > expect the replication of sequences to be used for failover and
> > > switchover.
> > >
> >
> > I'm really confused - are you describing what the patch is doing, or
> > what you think it should be doing? Because right now there's nothing
> > that'd "consolidate" the changes (in the sense of reconciling write
> > conflicts), and there's absolutely no way to do that.
> >
> > So if the subscriber advances the sequence (which it technically can),
> > the subscriber state will be eventually be discarded and overwritten
> > when the next increment gets decoded from WAL on the publisher.
>
> I described what I observed in my experiments. My observation doesn't
> agree with your description. I will revisit this when I review the
> output plugin changes and the WAL receiver changes.
>
> >
> > Yes, I agree with this. It's probably better to replicate just the next
> > value, without the log_cnt / is_called fields (which are implementation
> > specific).
>
> Ok. I will review the logic once you revise the patches.
>
> >
> > >
> > > 3. Primary key sequences
> > > -----------------------------------
> > > I am not experimented with this. But I think we will need to add the
> > > sequences associated with the primary keys to the publications
> > > publishing the owner tables. Otherwise, we will have problems with the
> > > failover. And it needs to be done automatically since a. the names of
> > > these sequences are generated automatically b. publications with FOR
> > > ALL TABLES will add tables automatically and start replicating the
> > > changes. Users may not be able to intercept the replication activity
> > > to add the associated sequences are also addedto the publication.
> > >
> >
> > Right, this idea was mentioned before, and I agree maybe we should
> > consider adding some of those "automatic" sequences automatically.
> >
>
> Are you planning to add this in the same patch set or separately?
>
> I reviewed 0001 and related parts of 0004 and 0008 in detail.
>
> I have only one major change request, about
> typedef struct xl_seq_rec
> {
> RelFileLocator locator;
> + bool created; /* creates a new relfilenode (CREATE/ALTER) */
>
> I am not sure what are the repercussions of adding a member to an existing WAL
> record. I didn't see any code which handles the old WAL format which doesn't
> contain the "created" flag. IIUC, the logical decoding may come across
> a WAL record written in the old format after upgrade and restart. Is
> that not possible?
>
> But I don't think it's necessary. We can add a
> decoding routine for RM_SMGR_ID. The decoding routine will add relfilelocator
> in XLOG_SMGR_CREATE record to txn->sequences hash. Rest of the logic will work
> as is. Of course we will add non-sequence relfilelocators as well but that
> should be fine. Creating a new relfilelocator shouldn't be a frequent
> operation. If at all we are worried about that, we can add only the
> relfilenodes associated with sequences to the hash table.
>
> If this idea has been discussed earlier, please point me to the relevant
> discussion.
>
> Some other minor comments and nitpicks.
>
> <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
> <function>stream_commit_cb</function>, and <function>stream_change_cb</function>
> - are required, while <function>stream_message_cb</function> and
> + are required, while <function>stream_message_cb</function>,
> + <function>stream_sequence_cb</function> and
>
> Like the non-streaming counterpart, should we also mention what happens if those
> callbacks are not defined? That applies to stream_message_cb and
> stream_truncate_cb too.
> + /*
> + * Make sure the subtransaction has a XID assigned, so that the sequence
> + * increment WAL record is properly associated with it. This matters for
> + * increments of sequences created/altered in the transaction, which are
> + * handled as transactional.
> + */
> + if (XLogLogicalInfoActive())
> + GetCurrentTransactionId();
>
> GetCurrentTransactionId() will also assign xids to all the parents so it
> doesn't seem necessary to call both GetTopTransactionId() and
> GetCurrentTransactionId(). Calling only the latter should suffice. Applies to
> all the calls to GetCurrentTransactionId().
>
> +
> + memcpy(((char *) tuple->tuple.t_data),
> + data + sizeof(xl_seq_rec),
> + SizeofHeapTupleHeader);
> +
> + memcpy(((char *) tuple->tuple.t_data) + SizeofHeapTupleHeader,
> + data + sizeof(xl_seq_rec) + SizeofHeapTupleHeader,
> + datalen);
>
> The memory chunks being copied in these memcpy calls are contiguous. Why don't
> we use a single memcpy? For readability?
>
> + * If we don't have snapshot or we are just fast-forwarding, there is no
> + * point in decoding messages.
>
> s/decoding messages/decoding sequence changes/
>
> + tupledata = XLogRecGetData(r);
> + datalen = XLogRecGetDataLen(r);
> + tuplelen = datalen - SizeOfHeapHeader - sizeof(xl_seq_rec);
> +
> + /* extract the WAL record, with "created" flag */
> + xlrec = (xl_seq_rec *) XLogRecGetData(r);
>
> I think we should set tupledata = xlrec + sizeof(xl_seq_rec) so that it points
> to actual tuple data. This will also simplify the calculations in
> DecodeSeqTule().
> +/* entry for hash table we use to track sequences created in running xacts */
>
> s/running/transaction being decoded/ ?
>
> +
> + /* search the lookup table (we ignore the return value, found is enough) */
> + ent = hash_search(rb->sequences,
> + (void *) &rlocator,
> + created ? HASH_ENTER : HASH_FIND,
> + &found);
>
> Misleading comment. We seem to be using the return value later.
>
> + /*
> + * When creating the sequence, remember the XID of the transaction
> + * that created id.
> + */
> + if (created)
> + ent->xid = xid;
>
> Should we set ent->locator as well? The sequence won't get cleaned otherwise.
>
> +
> + TeardownHistoricSnapshot(false);
> +
> + AbortCurrentTransaction();
>
> This call to AbortCurrentTransaction() in PG_TRY should be called if only this
> block started the transaction?
>
> + PG_CATCH();
> + {
> + TeardownHistoricSnapshot(true);
> +
> + AbortCurrentTransaction();
>
> Shouldn't we do this only if this block started the transaction? And in that
> case, wouldn't PG_RE_THROW take care of it?
>
> +/*
> + * Helper function for ReorderBufferProcessTXN for applying sequences.
> + */
> +static inline void
> +ReorderBufferApplySequence(ReorderBuffer *rb, ReorderBufferTXN *txn,
> + Relation relation, ReorderBufferChange *change,
> + bool streaming)
>
> Possibly we should find a way to call this function from
> ReorderBufferQueueSequence() when processing non-transactional sequence change.
> It should probably absorb logic common to both the cases.
>
> +
> + if (RelationIsLogicallyLogged(relation))
> + ReorderBufferApplySequence(rb, txn, relation, change, streaming);
>
> This condition is not used in ReorderBufferQueueSequence() when processing
> non-transactional change there. Why?
> +
> + if (len)
> + {
> + memcpy(data, &tup->tuple, sizeof(HeapTupleData));
> + data += sizeof(HeapTupleData);
> +
> + memcpy(data, tup->tuple.t_data, len);
> + data += len;
> + }
> +
>
> We are just copying the sequence data. Shouldn't we copy the file locator as
> well or that's not needed once the change has been queued? Similarly for
> ReorderBufferChangeSize() and ReorderBufferChangeSize()
>
> + /*
> + * relfilenode => XID lookup table for sequences created in a transaction
> + * (also includes altered sequences, which assigns new relfilenode)
> + */
> + HTAB *sequences;
> +
>
> Better renamed as seq_rel_locator or some such. Shouldn't this be part of
> ReorderBufferTxn which has similar transaction specific hashes.
>
> I will continue reviewing the remaining patches.
>
> --
> Best Wishes,
> Ashutosh Bapat



--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
This is review of 0003 patch. Overall the patch looks good and helps
understand the decoding logic better.

+                                          data
+----------------------------------------------------------------------------------------
+ BEGIN
+ sequence public.test_sequence: transactional:1 last_value: 1
log_cnt: 0 is_called:0
+ COMMIT

Looking at this output, I am wondering how would this patch work with DDL
replication. I should have noticed this earlier, sorry. A sequence DDL has two
parts, changes to the catalogs and changes to the data file. Support for
replicating the data file changes is added by these patches. The catalog
changes will need to be supported by DDL replication patch. When applying the
DDL changes, there are two ways 1. just apply the catalog changes and let the
support added here apply the data changes. 2. Apply both the changes. If the
second route is chosen, all the "transactional" decoding and application
support added by this patch will need to be ripped out. That will make the
"transactional" field in the protocol will become useless. It has potential to
be waste bandwidth in future.

OTOH, I feel that waiting for the DDL repliation patch set to be commtted will
cause this patchset to be delayed for an unknown duration. That's undesirable
too.

One solution I see is to use Storage RMID WAL again. While decoding it we send
a message to the subscriber telling it that a new relfilenode is being
allocated to a sequence. The subscriber too then allocates new relfilenode to
the sequence. The sequence data changes are decoded without "transactional"
flag; but they are decoded as transactional or non-transactional using the same
logic as the current patch-set. The subscriber will always apply these changes
to the reflilenode associated with the sequence at that point in time. This
would have the same effect as the current patch-set. But then there is
potential that the DDL replication patchset will render the Storage decoding
useless. So not an option. But anyway, I will leave this as a comment as an
alternative thought and discarded. Also this might trigger a better idea.

What do you think?

+-- savepoint test on table with serial column
+BEGIN;
+CREATE TABLE test_table (a SERIAL, b INT);
+INSERT INTO test_table (b) VALUES (100);
+INSERT INTO test_table (b) VALUES (200);
+SAVEPOINT a;
+INSERT INTO test_table (b) VALUES (300);
+ROLLBACK TO SAVEPOINT a;

The third implicit nextval won't be logged so whether subtransaction is rolled
back or committed, it won't have much effect on the decoding. Adding
subtransaction around the first INSERT itself might be useful to test that the
subtransaction rollback does not rollback the sequence changes.

After adding {'include_sequences', false} to the calls to
pg_logical_slot_get_changes() in other tests, the SQL statement has grown
beyond 80 characters. Need to split it into multiple lines.

         }
+        else if (strcmp(elem->defname, "include-sequences") == 0)
+        {
+
+            if (elem->arg == NULL)
+                data->include_sequences = false;

By default inlclude_sequences = true. Shouldn't then it be set to true here?

After looking at the option processing code in
pg_logical_slot_get_changes_guts(), it looks like an argument can never be
NULL. But I see we have checks for NULL values of other arguments so it's ok to
keep a NULL check here.

I will look at 0004 next.

-- 
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 6/26/23 15:18, Ashutosh Bapat wrote:
> This is review of 0003 patch. Overall the patch looks good and helps
> understand the decoding logic better.
> 
> +                                          data
> +----------------------------------------------------------------------------------------
> + BEGIN
> + sequence public.test_sequence: transactional:1 last_value: 1
> log_cnt: 0 is_called:0
> + COMMIT
> 
> Looking at this output, I am wondering how would this patch work with DDL
> replication. I should have noticed this earlier, sorry. A sequence DDL has two
> parts, changes to the catalogs and changes to the data file. Support for
> replicating the data file changes is added by these patches. The catalog
> changes will need to be supported by DDL replication patch. When applying the
> DDL changes, there are two ways 1. just apply the catalog changes and let the
> support added here apply the data changes. 2. Apply both the changes. If the
> second route is chosen, all the "transactional" decoding and application
> support added by this patch will need to be ripped out. That will make the
> "transactional" field in the protocol will become useless. It has potential to
> be waste bandwidth in future.
> 

I don't understand why would it need to be ripped out. Why would it make
the transactional behavior useless? Can you explain?

IMHO we replicate either changes (and then DDL replication does not
interfere with that), or DDL (and then this patch should not interfere).

> OTOH, I feel that waiting for the DDL repliation patch set to be commtted will
> cause this patchset to be delayed for an unknown duration. That's undesirable
> too.
> 
> One solution I see is to use Storage RMID WAL again. While decoding it we send
> a message to the subscriber telling it that a new relfilenode is being
> allocated to a sequence. The subscriber too then allocates new relfilenode to
> the sequence. The sequence data changes are decoded without "transactional"
> flag; but they are decoded as transactional or non-transactional using the same
> logic as the current patch-set. The subscriber will always apply these changes
> to the reflilenode associated with the sequence at that point in time. This
> would have the same effect as the current patch-set. But then there is
> potential that the DDL replication patchset will render the Storage decoding
> useless. So not an option. But anyway, I will leave this as a comment as an
> alternative thought and discarded. Also this might trigger a better idea.
> 
> What do you think?
> 


I don't understand what the problem with DDL is, so I can't judge how
this is supposed to solve it.

> +-- savepoint test on table with serial column
> +BEGIN;
> +CREATE TABLE test_table (a SERIAL, b INT);
> +INSERT INTO test_table (b) VALUES (100);
> +INSERT INTO test_table (b) VALUES (200);
> +SAVEPOINT a;
> +INSERT INTO test_table (b) VALUES (300);
> +ROLLBACK TO SAVEPOINT a;
> 
> The third implicit nextval won't be logged so whether subtransaction is rolled
> back or committed, it won't have much effect on the decoding. Adding
> subtransaction around the first INSERT itself might be useful to test that the
> subtransaction rollback does not rollback the sequence changes.
> 
> After adding {'include_sequences', false} to the calls to
> pg_logical_slot_get_changes() in other tests, the SQL statement has grown
> beyond 80 characters. Need to split it into multiple lines.
> 
>          }
> +        else if (strcmp(elem->defname, "include-sequences") == 0)
> +        {
> +
> +            if (elem->arg == NULL)
> +                data->include_sequences = false;
> 
> By default inlclude_sequences = true. Shouldn't then it be set to true here?
> 

I don't follow. Is this still related to the DDL replication, or are you
describing some new issue with savepoints?

> After looking at the option processing code in
> pg_logical_slot_get_changes_guts(), it looks like an argument can never be
> NULL. But I see we have checks for NULL values of other arguments so it's ok to
> keep a NULL check here.
> 
> I will look at 0004 next.
> 

OK

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Mon, Jun 26, 2023 at 8:35 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 6/26/23 15:18, Ashutosh Bapat wrote:
> > This is review of 0003 patch. Overall the patch looks good and helps
> > understand the decoding logic better.
> >
> > +                                          data
> > +----------------------------------------------------------------------------------------
> > + BEGIN
> > + sequence public.test_sequence: transactional:1 last_value: 1
> > log_cnt: 0 is_called:0
> > + COMMIT
> >
> > Looking at this output, I am wondering how would this patch work with DDL
> > replication. I should have noticed this earlier, sorry. A sequence DDL has two
> > parts, changes to the catalogs and changes to the data file. Support for
> > replicating the data file changes is added by these patches. The catalog
> > changes will need to be supported by DDL replication patch. When applying the
> > DDL changes, there are two ways 1. just apply the catalog changes and let the
> > support added here apply the data changes. 2. Apply both the changes. If the
> > second route is chosen, all the "transactional" decoding and application
> > support added by this patch will need to be ripped out. That will make the
> > "transactional" field in the protocol will become useless. It has potential to
> > be waste bandwidth in future.
> >
>
> I don't understand why would it need to be ripped out. Why would it make
> the transactional behavior useless? Can you explain?
>
> IMHO we replicate either changes (and then DDL replication does not
> interfere with that), or DDL (and then this patch should not interfere).
>
> > OTOH, I feel that waiting for the DDL repliation patch set to be commtted will
> > cause this patchset to be delayed for an unknown duration. That's undesirable
> > too.
> >
> > One solution I see is to use Storage RMID WAL again. While decoding it we send
> > a message to the subscriber telling it that a new relfilenode is being
> > allocated to a sequence. The subscriber too then allocates new relfilenode to
> > the sequence. The sequence data changes are decoded without "transactional"
> > flag; but they are decoded as transactional or non-transactional using the same
> > logic as the current patch-set. The subscriber will always apply these changes
> > to the reflilenode associated with the sequence at that point in time. This
> > would have the same effect as the current patch-set. But then there is
> > potential that the DDL replication patchset will render the Storage decoding
> > useless. So not an option. But anyway, I will leave this as a comment as an
> > alternative thought and discarded. Also this might trigger a better idea.
> >
> > What do you think?
> >
>
>
> I don't understand what the problem with DDL is, so I can't judge how
> this is supposed to solve it.

I have not looked at the DDL replication patch in detail so I may be
missing something. IIUC, that patch replicates the DDL statement in
some form: parse tree or statement. But it doesn't replicate the some
or all WAL records that the DDL execution generates.

Consider DDL "ALTER SEQUENCE test_sequence RESTART WITH 4000;". It
updates the catalogs with a new relfilenode and also the START VALUE.
It also writes to the new relfilenode. When publisher replicates the
DDL and the subscriber applies it, it will do the same - update the
catalogs and write to new relfilenode. We don't want the sequence data
to be replicated again when it's changed by a DDL. All the
transactional changes are associated with a DDL. Other changes to the
data sequence are non-transactional. So when replicating the sequence
data changes, "transactional" field becomes useless. What I am
pointing to is: if we add "transactional" field in the protocol today
and in future DDL replication is implemented in a way that
"transactional" field becomes redundant, we have introduced a
redundant field which will eat a byte on wire.  Of course we can
remove it by bumping protocol version, but that's some work.

Please note we will still need the code to determine whether a change
in sequence data is transactional or not IOW whether it's associated
with DDL or not. So that code remains.

> >
> >          }
> > +        else if (strcmp(elem->defname, "include-sequences") == 0)
> > +        {
> > +
> > +            if (elem->arg == NULL)
> > +                data->include_sequences = false;
> >
> > By default inlclude_sequences = true. Shouldn't then it be set to true here?
> >
>
> I don't follow. Is this still related to the DDL replication, or are you
> describing some new issue with savepoints?

Not related to DDL replication. Not an issue with savepoints either.
Just a comment about that particular change. So for not being clear.

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Mon, Jun 26, 2023 at 8:35 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 6/26/23 15:18, Ashutosh Bapat wrote:

> > I will look at 0004 next.
> >
>
> OK


0004- is quite large. I think if we split this into two or even three
1. publication and
subscription catalog handling 2. built-in replication protocol changes, it
might be easier to review. But anyway, I have given it one read. I have
reviewed the parts which deal with the replication-proper in detail. I have
*not* thoroughly reviewed the parts which deal with the catalogs, pg_dump,
describe and tab completion. Similarly tests.  If those parts need a
thorough review, please let
me know.

But before jumping into the comments, a weird scenario I tried.  On publisher I
created a table t1(a int, b int) and a sequence s and added both to a
publication. On subscriber I swapped their names i.e. created a table s(a int, b
int) and a sequence t1 and subscribed to the publication. The subscription was
created, and during replication it threw error "logical replication target
relation "public.t1" is missing replicated columns: "a", "b" and  logical
replication target relation "public.s" is missing replicated columns:
"last_value", "lo    g_cnt", "is_called". I think it's good that it at least
threw an error. But it would be good if it detected that the reltypes
themselves are different and mentioned that in the error. Something like
"logical replication target "public.s" is not a sequence like source
"public.s".

Comments on the patch itself.

I didn't find any mention of 'sequence' in the documentation of publish option
in CREATE or ALTER PUBLICATION. Something missing in the documentation? But do
we really need to record "sequence" as an operation? Just adding the sequences
to the publication should be fine right? There's only one operation on
sequences, updating the sequence row.

+CREATE VIEW pg_publication_sequences AS
+    SELECT
+        P.pubname AS pubname,
+        N.nspname AS schemaname,
+        C.relname AS sequencename

If we report oid or regclass for sequences it might be easier to join the view
further. We don't have reg* for publication so we report both oid and
name of publication.

+/*
+ * Update the sequence state by modifying the existing sequence data row.
+ *
+ * This keeps the same relfilenode, so the behavior is non-transactional.
+ */
+static void
+SetSequence_non_transactional(Oid seqrelid, int64 last_value, int64
log_cnt, bool is_called)

This function has some code similar to nextval but with the sequence
of operations (viz. changes to buffer, WAL insert and cache update) changed.
Given the comments in nextval_internal() the difference in sequence of
operations should not make a difference in the end result. But I think it will
be good to deduplicate the code to avoid confusion and also for ease of
maintenance.

+
+/*
+ * Update the sequence state by creating a new relfilenode.
+ *
+ * This creates a new relfilenode, to allow transactional behavior.
+ */
+static void
+SetSequence_transactional(Oid seq_relid, int64 last_value, int64
log_cnt, bool is_called)

Need some deduplication here as well. But the similarities with AlterSequence,
ResetSequence or DefineSequence are less.

@@ -730,9 +731,9 @@ CreateSubscription(ParseState *pstate,
CreateSubscriptionStmt *stmt,
     {
             /*
-             * Get the table list from publisher and build local table status
-             * info.
+             * Get the table and sequence list from publisher and build
+             * local relation sync status info.
              */
-            tables = fetch_table_list(wrconn, publications);
-            foreach(lc, tables)
+            relations = fetch_table_list(wrconn, publications);

Is it allowed to connect a newer subscriber to an old publisher? If
yes the query
to fetch sequences will throw an error since it won't find the catalog.

@@ -882,8 +886,10 @@ AlterSubscription_refresh(Subscription *sub, bool
copy_data,
-        /* Get the table list from publisher. */
+        /* Get the list of relations from publisher. */
         pubrel_names = fetch_table_list(wrconn, sub->publications);
+        pubrel_names = list_concat(pubrel_names,
+                                   fetch_sequence_list(wrconn,
sub->publications));

Similarly here.

+void
+logicalrep_write_sequence(StringInfo out, Relation rel, TransactionId xid,
+
... snip ...
+    pq_sendint8(out, flags);
+    pq_sendint64(out, lsn);
... snip ...
+LogicalRepRelId
+logicalrep_read_sequence(StringInfo in, LogicalRepSequence *seqdata)
+{
... snip ...
+    /* XXX skipping flags and lsn */
+    pq_getmsgint(in, 1);
+    pq_getmsgint64(in);

We are ignoring these two fields on the WAL receiver side. I don't see such
fields being part of INSERT, UPDATE or DELETE messages. Should we just drop
those or do they have some future use? Two lsns are written by
OutputPrepareWrite() as prologue to the logical message. If this LSN
is one of them, it could be dropped anyway.


+static void
+fetch_sequence_data(char *nspname, char *relname,
... snip ...
+    appendStringInfo(&cmd, "SELECT last_value, log_cnt, is_called\n"
+                       "  FROM %s",
quote_qualified_identifier(nspname, relname));

We are using an undocumented interface here. SELECT ... FROM <sequence> is not
documented. This code will break if we change the way a sequence is stored.
That is quite unlikely but not impossible.  Ideally we should use one of the
methods documented at [1]. But none of them provide us what is needed per your
comment in copy_sequence() i.e the state of sequence as of last WAL record on
that sequence. So I don't have any better ideas that what's done in the patch.
May be we can use "nextval() + 32" as an approximation.

Some minor comments and nitpicks:

@@ -1958,12 +1958,14 @@ get_object_address_publication_schema(List
*object, bool missing_ok)

Need an update to the function prologue with the description of the third
element. Also the error message at the end of the function needs to mention the
object type.

-                appendStringInfo(&buffer, _("publication of schema %s
in publication %s"),
-                                 nspname, pubname);
+                appendStringInfo(&buffer, _("publication of schema %s
in publication %s type %s"),
+                                 nspname, pubname, objtype);

s/type/for object type/ ?


@@ -5826,18 +5842,24 @@ getObjectIdentityParts(const ObjectAddress *object,

                     break;
-                appendStringInfo(&buffer, "%s in publication %s",
-                                 nspname, pubname);
+                appendStringInfo(&buffer, "%s in publication %s type %s",
+                                 nspname, pubname, objtype);

s/type/object type/? ... in some other places as well?


+/*
+ * Check the character is a valid object type for schema publication.
+ *
+ * This recognizes either 't' for tables or 's' for sequences. Places that
+ * need to handle 'u' for unsupported relkinds need to do that explicitlyl

s/explicitlyl/explicitly/

+Datum
+pg_get_publication_sequences(PG_FUNCTION_ARGS)
+{
 ... snip ...
+        /*
+         * Publications support partitioned tables, although all changes are
+         * replicated using leaf partition identity and schema, so we only
+         * need those.
+         */

Not relevant here.

+        if (publication->allsequences)
+            sequences = GetAllSequencesPublicationRelations();
+        else
+        {
+            List       *relids,
+                       *schemarelids;
+
+            relids = GetPublicationRelations(publication->oid,
+                                             PUB_OBJTYPE_SEQUENCE,
+                                             publication->pubviaroot ?
+                                             PUBLICATION_PART_ROOT :
+                                             PUBLICATION_PART_LEAF);
+            schemarelids = GetAllSchemaPublicationRelations(publication->oid,
+
PUB_OBJTYPE_SEQUENCE,
+
publication->pubviaroot ?
+
PUBLICATION_PART_ROOT :
+
PUBLICATION_PART_LEAF);

I think we should just pass PUBLICATION_PART_ALL since that parameter is
irrelevant to sequences anyway. Otherwise this code would be confusing.

I think we should rename PublicationTable structure to PublicationRelation
since it can now contain information about a table or a sequence, both of which
are relations.

+/*
+ * Add or remove table to/from publication.

s/table/sequence/. Generally this applies to all the code, working for tables,
copied and modified for sequences.

@@ -18826,6 +18867,30 @@ preprocess_pubobj_list(List *pubobjspec_list,
core_yyscan_t yyscanner)
                         errmsg("invalid schema name"),
                         parser_errposition(pubobj->location));
         }
+        else if (pubobj->pubobjtype == PUBLICATIONOBJ_SEQUENCES_IN_SCHEMA ||
+                 pubobj->pubobjtype == PUBLICATIONOBJ_SEQUENCES_IN_CUR_SCHEMA)
+        {
+            /* WHERE clause is not allowed on a schema object */
+            if (pubobj->pubtable && pubobj->pubtable->whereClause)
+                ereport(ERROR,
+                        errcode(ERRCODE_SYNTAX_ERROR),
+                        errmsg("WHERE clause not allowed for schema"),
+                        parser_errposition(pubobj->location));

Grammar doesn't allow specifying whereClause with ALL TABLES IN SCHEMA
specification but we have code to throw error if that happens. We also have
similar code for ALL SEQUENCES IN SCHEMA. Should we add for SEQUENCE
specification as well?

+static void
+fetch_sequence_data(char *nspname, char *relname,
... snip ...
+    /* tablesync sets the sequences in non-transactional way */
+    SetSequence(RelationGetRelid(rel), false, last_value, log_cnt, is_called);
Why? In case of a regular table, in case the sync fails, the table will retain
its state before sync. Similarly it will be expected that the sequence retains
its state before sync, No?

@@ -1467,10 +1557,21 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)

Now that it syncs sequences as well, should we rename this as
LogicalRepSyncRelationStart?

+static void
+apply_handle_sequence(StringInfo s)
... snip ...
+        /*
+         * Commit the per-stream transaction (we only do this when not in
+         * remote transaction, i.e. for non-transactional sequence updates.)
+         */
+        if (!in_remote_transaction)
+            CommitTransactionCommand();

I understand the purpose of if block. It commits the transaction that was
started when applying a non-transactional sequence change. But didn't
understand the term "per-stream transaction".

@@ -5683,8 +5686,15 @@ RelationBuildPublicationDesc(Relation relation,
PublicationDesc *pubdesc)

Thanks for the additional comments. Those are useful.

@@ -1716,28 +1716,19 @@ describeOneTableDetails(const char *schemaname,

I think these changes make it easy to print the publication description per the
code changes later. But May be we should commit the refactoring patch
separately.

-DECLARE_UNIQUE_INDEX(pg_publication_namespace_pnnspid_pnpubid_index,
6239, PublicationNamespacePnnspidPnpubidIndexId, on
pg_publication_namespace using btree(pnnspid oid_ops, pnpubid
oid_ops));
+DECLARE_UNIQUE_INDEX(pg_publication_namespace_pnnspid_pnpubid_pntype_index,
8903, PublicationNamespacePnnspidPnpubidPntypeIndexId, on
pg_publication_namespace using btree(pnnspid oid_ops, pnpubid oid_ops,
pntype char_ops));

Why do we need a new OID? The old index should not be there in a cluster
created using this version and hence this OID will not be used.

[1] https://www.postgresql.org/docs/current/functions-sequence.html

Next I will review 0005.

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
0005, 0006 and 0007 are all related to the initial sequence sync. [3]
resulted in 0007 and I think we need it. That leaves 0005 and 0006 to
be reviewed in this response.

I followed the discussion starting [1] till [2]. The second one
mentions the interlock mechanism which has been implemented in 0005
and 0006. While I don't have an objection to allowing LOCKing a
sequence using the LOCK command, I am not sure whether it will
actually work or is even needed.

The problem described in [1] seems to be the same as the problem
described in [2]. In both cases we see the sequence moving backwards
during CATCHUP. At the end of catchup the sequence is in the right
state in both the cases. [2] actually deems this behaviour OK. I also
agree that the behaviour is ok. I am confused whether we have solved
anything using interlocking and it's really needed.

I see that the idea of using an LSN to decide whether or not to apply
a change to sequence started in [4]. In [5] Tomas proposed to use page
LSN. Looking at [6], it actually seems like a good idea. In [7] Tomas
agreed that LSN won't be sufficient. But I don't understand why. There
are three LSNs in the picture - restart LSN of sync slot,
confirmed_flush LSN of sync slot and page LSN of the sequence page
from where we read the initial state of the sequence. I think they can
be used with the following rules:
1. The publisher will not send any changes with LSN less than
confirmed_flush so we are good there.
2. Any non-transactional changes that happened between confirmed_flush
and page LSN should be discarded while syncing. They are already
visible to SELECT.
3. Any transactional changes with commit LSN between confirmed_flush
and page LSN should be discarded while syncing. They are already
visible to SELECT.
4. A DDL acquires a lock on sequence. Thus no other change to that
sequence can have an LSN between the LSN of the change made by DDL and
the commit LSN of that transaction. Only DDL changes to sequence are
transactional. Hence any transactional changes with commit LSN beyond
page LSN would not have been seen by the SELECT otherwise SELECT would
see the page LSN committed by that transaction. so they need to be
applied while syncing.
5. Any non-transactional changes beyond page LSN should be applied.
They are not seen by SELECT.

Am I missing something?

I don't have an idea how to get page LSN via a SQL query (while also
fetching data on that page). That may or may not be a challenge.

[1] https://www.postgresql.org/message-id/c2799362-9098-c7bf-c315-4d7975acafa3%40enterprisedb.com
[2] https://www.postgresql.org/message-id/2d4bee7b-31be-8b36-2847-a21a5d56e04f%40enterprisedb.com
[3] https://www.postgresql.org/message-id/f5a9d63d-a6fe-59a9-d1ed-38f6a5582c13%40enterprisedb.com
[4] https://www.postgresql.org/message-id/CAA4eK1KUYrXFq25xyjBKU1UDh7Dkzw74RXN1d3UAYhd4NzDcsg%40mail.gmail.com
[5] https://www.postgresql.org/message-id/CAA4eK1LiA8nV_ZT7gNHShgtFVpoiOvwoxNsmP_fryP%3DPsYPvmA%40mail.gmail.com
[6] https://www.postgresql.org/docs/current/storage-page-layout.html

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
And the last patch 0008.

@@ -1180,6 +1194,13 @@ AlterSubscription(ParseState *pstate,
AlterSubscriptionStmt *stmt,
... snip ...
+                if (IsSet(opts.specified_opts, SUBOPT_SEQUENCES))
+                {
+                    values[Anum_pg_subscription_subsequences - 1] =
+                        BoolGetDatum(opts.sequences);
+                    replaces[Anum_pg_subscription_subsequences - 1] = true;
+                }
+

The list of allowed options set a few lines above this code does not contain
"sequences". Is this option missing there or this code is unnecessary? If we
intend to add "sequence" at a later time after a subscription is created, will
the sequences be synced after ALTER SUBSCRIPTION?

+    /*
+     * ignore sequences when not requested
+     *
+     * XXX Maybe we should differentiate between "callbacks not defined" or
+     * "subscriber disabled sequence replication" and "subscriber does not
+     * know about sequence replication" (e.g. old subscriber version).
+     *
+     * For the first two it'd be fine to bail out here, but for the last it

It's not clear which two you are talking about. Maybe that's because the
paragraph above is ambiguious. It is in the form of A or B and C; so not clear
which cases we are differentiating between: (A, B, C), ((A or B) and C) or (A or
(B and C)) or something else.

+     * might be better to continue and error out only when the sequence
+     * would be replicated (e.g. as part of the publication). We don't know
+     * that here, unfortunately.

Please see comments on changes to pgoutput_startup() below. We may
want to change the paragraph accordingly.

@@ -298,6 +298,20 @@ StartupDecodingContext(List *output_plugin_options,
      */
     ctx->reorder->update_progress_txn = update_progress_txn_cb_wrapper;

+    /*
+     * To support logical decoding of sequences, we require the sequence
+     * callback. We decide it here, but only check it later in the wrappers.
+     *
+     * XXX Isn't it wrong to define only one of those callbacks? Say we
+     * only define the stream_sequence_cb() - that may get strange results
+     * depending on what gets streamed. Either none or both?

I don't think the current condition is correct; it will consider sequence
changes to be streamed even when sequence_cb is not defined and actually not
send those. sequence_cb is needed to send sequence changes irrespective of
whether transaction streaming is supported.  But stream_sequence_cb is required
if other stream callbacks are available. Something like

if (ctx->callbacks.sequence_cb)
{
    if (ctx->streaming)
    {
        if ctx->callbacks.stream_sequence_cb == NULL)
            ctx->sequences = false;
        else
            ctx->sequences = true;
    }
    else
        ctx->sequences = true;
}
else
    ctx->sequences = false;

+     *
+     * XXX Shouldn't sequence be defined at slot creation time, similar
+     * to two_phase? Probably not.

I don't know why two_phase is defined at the slot creation time, so can't
comment on this. But looks like something we need to answer before committing
the patches.

+    /*
+     * We allow decoding of sequences when the option is given at the streaming
+     * start, provided the plugin supports all the callbacks for two-phase.

s/two-phase/sequences/

+     *
+     * XXX Similar behavior to the two-phase block below.

I think we need to describe sequence specific behaviour instead of pointing to
the two-phase. two-phase is part of in replication slot's on disk specification
but sequence is not. Given that it's XXX, I think you are planning to do that.

+     *
+     * XXX Shouldn't this error out if the callbacks are not defined?

Isn't this already being done in pgoutput_startup()? Should we remove this XXX.

+        /*
+         * Here, we just check whether the sequences decoding option is passed
+         * by plugin and decide whether to enable it at later point of time. It
+         * remains enabled if the previous start-up has done so. But we only
+         * allow the option to be passed in with sufficient version of the
+         * protocol, and when the output plugin supports it.
+         */
+        if (!data->sequences)
+            ctx->sequences_opt_given = false;
+        else if (data->protocol_version <
LOGICALREP_PROTO_SEQUENCES_VERSION_NUM)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("requested proto_version=%d does not
support sequences, need %d or higher",
+                            data->protocol_version,
LOGICALREP_PROTO_SEQUENCES_VERSION_NUM)));
+        else if (!ctx->sequences)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("sequences requested, but not supported
by output plugin")));

If a given output plugin doesn't implement the callbacks but subscription
specifies sequences, the code will throw an error whether or not publication is
publishing sequences. Instead I think the behaviour should be same as the case
when publication doesn't include sequences even if the publisher node has
sequences. In either case publisher (the plugin or the publication) doesn't want
to publish sequence data. So subscriber's request can be ignored.

What might be good is to throw an error if the publication publishes the
sequences but there are no callbacks - both output plugin and the publication
are part of publisher node, thus it's easy for users to setup them consistently.
GetPublicationRelations can be tweaked a bit to return just tables or sequences.
That along with publication's all sequences flag should tell us whether
publication publishes any sequences or not.

That ends my first round of reviews.

-- 
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Tue, Jun 27, 2023 at 11:30 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> I have not looked at the DDL replication patch in detail so I may be
> missing something. IIUC, that patch replicates the DDL statement in
> some form: parse tree or statement. But it doesn't replicate the some
> or all WAL records that the DDL execution generates.
>

Yes, the DDL replication patch uses the parse tree and catalog
information to generate a deparsed form of DDL statement which is WAL
logged and used to replicate DDLs.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Hi,

here's a rebased and significantly reworked version of this patch
series, based on the recent reviews and discussion. Let me go through
the main differences:


1) reorder the patches to have the "shortening" of test output first


2) merge the various "fix" patches in to the three main patches

   0002 - introduce sequence decoding infrastructure
   0003 - add sequences to test_decoding
   0004 - add sequences to built-in replication

I've kept those patches separate to make the evolution easier to follow
and discuss, but it was necessary to cleanup the patch series and make
it clearer what the current state is.


3) simplify the replicated state

As suggested by Ashutosh, it may not be a good idea to replicate the
(last_value, log_cnt, is_called) tuple, as that's pretty tightly tied to
our internal implementation. Which may not be the right thing for other
plugins. So this new patch replicates just "value" which is pretty much
(last_value + log_cnt), representing the next value that should be safe
to generate on the subscriber (in case of a failover).


4) simplify test_decoding code & tests

I realized I can ditch some of the test_decoding changes, because at
some point we chose to only include sequences in test_decoding when
explicitly requested. So the tests don't need to disable that, it's the
other way - one test needs to enable it.

This now also prints the single value, instead of the three values.


5) minor tweaks in the built-in replication

This adopts the relaxed LOCK code to allow locking sequences during the
initial sync, and also adopts the replication of a single value (this
affects the "apply" side of that change too).


6) simplified protocol versioning

The main open question I had was what to do about protocol versioning
for the built-in replication - how to decide whether the subscriber can
apply sequences, and what should happen if we decode sequence but the
subscriber does not support that.

I was not entirely sure we want to handle this by a simple version
check, because that maps capabilities to a linear scale, which seems
pretty limiting. That is, each protocol version just grows, and new
version number means support of a new capability - like replication of
two-phase commits, or sequences. Which is nice, but it does not allow
supporting just the later feature, for example - you can't skip one.
Which is why 2PC decoding has both a version and a subscription flag,
which allows exactly that ...

When discussing this off-list with Peter Eisentraut, he reminded me of
his old message in the thread:

https://www.postgresql.org/message-id/8046273f-ea88-5c97-5540-0ccd5d244fd4@enterprisedb.com

where he advocates for exactly this simplified behavior. So I took a
stab at it and 0005 should be doing that. I keep it as a separate patch
for now, to make the changes clearer, but ultimately it should be merged
into 0003 and 0004 parts.

It's not particularly complex change, it mostly ditches the subscription
option (which also means columns in the pg_subscription catalog), and a
flag in the decoding context.

But the main change is in pgoutput_sequence(), where we protocol_version
and error-out if it's not the right version (instead of just ignoring
the sequence). AFAICS this behaves as expected - with PG15 subscriber, I
get an ERROR on the publisher side from the sequence callback.

But it no occurred to me we could do the same thing with the original
approach - allow the per-subscription "sequences" flag, but error out
when the subscriber did not enable that capability ...


Hopefully, I haven't forgotten to address any important point from the
reviews ...

The one thing I'm not really sure about is how it interferes with the
replication of DDL. But in principle, if it decodes DDL for ALTER
SEQUENCE, I don't see why it would be a problem that we then decode and
replicate the WAL for the sequence state. But if it is a problem, we
should be able to skip this WAL record with the initial sequence state
(which I think should be possible thanks to the "created" flag this
patch adds to the WAL record).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
Thanks for the updated patches. I haven't looked at the patches yet
but have some responses below.

On Thu, Jul 13, 2023 at 12:35 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

>
>
> 3) simplify the replicated state
>
> As suggested by Ashutosh, it may not be a good idea to replicate the
> (last_value, log_cnt, is_called) tuple, as that's pretty tightly tied to
> our internal implementation. Which may not be the right thing for other
> plugins. So this new patch replicates just "value" which is pretty much
> (last_value + log_cnt), representing the next value that should be safe
> to generate on the subscriber (in case of a failover).
>

Thanks. That will help.


> 5) minor tweaks in the built-in replication
>
> This adopts the relaxed LOCK code to allow locking sequences during the
> initial sync, and also adopts the replication of a single value (this
> affects the "apply" side of that change too).
>

I think the problem we are trying to solve with LOCK is not actually
getting solved. See [2]. Instead your earlier idea of using page LSN
looks better.

>
> 6) simplified protocol versioning

I had tested the cross-version logical replication with older set of
patches. Didn't see any unexpected behaviour then. I will test again.
>
> The one thing I'm not really sure about is how it interferes with the
> replication of DDL. But in principle, if it decodes DDL for ALTER
> SEQUENCE, I don't see why it would be a problem that we then decode and
> replicate the WAL for the sequence state. But if it is a problem, we
> should be able to skip this WAL record with the initial sequence state
> (which I think should be possible thanks to the "created" flag this
> patch adds to the WAL record).

I had suggested a solution in [1] to avoid adding a flag to the WAL
record. Did you consider it? If you considered it and rejected, I
would be interested in knowing reasons behind rejecting it. Let me
repeat here again:

```
We can add a
decoding routine for RM_SMGR_ID. The decoding routine will add relfilelocator
in XLOG_SMGR_CREATE record to txn->sequences hash. Rest of the logic will work
as is. Of course we will add non-sequence relfilelocators as well but that
should be fine. Creating a new relfilelocator shouldn't be a frequent
operation. If at all we are worried about that, we can add only the
relfilenodes associated with sequences to the hash table.
```

If the DDL replication takes care of replicating and applying sequence
changes, I think we don't need the changes tracking "transactional"
sequence changes in this patch-set. That also makes a case for not
adding a new field to WAL which may not be used.

[1] https://www.postgresql.org/message-id/CAExHW5v_vVqkhF4ehST9EzpX1L3bemD1S%2BkTk_-ZVu_ir-nKDw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAExHW5vHRgjWzi6zZbgCs97eW9U7xMtzXEQK%2BaepuzoGDsDNtg%40mail.gmail.com
--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 6/23/23 15:18, Ashutosh Bapat wrote:
> ...
>
> I reviewed 0001 and related parts of 0004 and 0008 in detail.
> 
> I have only one major change request, about
> typedef struct xl_seq_rec
> {
> RelFileLocator locator;
> + bool created; /* creates a new relfilenode (CREATE/ALTER) */
> 
> I am not sure what are the repercussions of adding a member to an existing WAL
> record. I didn't see any code which handles the old WAL format which doesn't
> contain the "created" flag. IIUC, the logical decoding may come across
> a WAL record written in the old format after upgrade and restart. Is
> that not possible?
> 

I don't understand why would adding a new field to xl_seq_rec be an
issue, considering it's done in a new major version. Sure, if you
generate WAL with old build, and start with a patched version, that
would break things. But that's true for many other patches, and it's
irrelevant for releases.

> But I don't think it's necessary. We can add a
> decoding routine for RM_SMGR_ID. The decoding routine will add relfilelocator
> in XLOG_SMGR_CREATE record to txn->sequences hash. Rest of the logic will work
> as is. Of course we will add non-sequence relfilelocators as well but that
> should be fine. Creating a new relfilelocator shouldn't be a frequent
> operation. If at all we are worried about that, we can add only the
> relfilenodes associated with sequences to the hash table.
> 

Hmmmm, that might work. I feel a bit uneasy about having to keep all
relfilenodes, not just sequences ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 7/5/23 16:51, Ashutosh Bapat wrote:
> 0005, 0006 and 0007 are all related to the initial sequence sync. [3]
> resulted in 0007 and I think we need it. That leaves 0005 and 0006 to
> be reviewed in this response.
> 
> I followed the discussion starting [1] till [2]. The second one
> mentions the interlock mechanism which has been implemented in 0005
> and 0006. While I don't have an objection to allowing LOCKing a
> sequence using the LOCK command, I am not sure whether it will
> actually work or is even needed.
> 
> The problem described in [1] seems to be the same as the problem
> described in [2]. In both cases we see the sequence moving backwards
> during CATCHUP. At the end of catchup the sequence is in the right
> state in both the cases. [2] actually deems this behaviour OK. I also
> agree that the behaviour is ok. I am confused whether we have solved
> anything using interlocking and it's really needed.
> 
> I see that the idea of using an LSN to decide whether or not to apply
> a change to sequence started in [4]. In [5] Tomas proposed to use page
> LSN. Looking at [6], it actually seems like a good idea. In [7] Tomas
> agreed that LSN won't be sufficient. But I don't understand why. There
> are three LSNs in the picture - restart LSN of sync slot,
> confirmed_flush LSN of sync slot and page LSN of the sequence page
> from where we read the initial state of the sequence. I think they can
> be used with the following rules:
> 1. The publisher will not send any changes with LSN less than
> confirmed_flush so we are good there.
> 2. Any non-transactional changes that happened between confirmed_flush
> and page LSN should be discarded while syncing. They are already
> visible to SELECT.
> 3. Any transactional changes with commit LSN between confirmed_flush
> and page LSN should be discarded while syncing. They are already
> visible to SELECT.
> 4. A DDL acquires a lock on sequence. Thus no other change to that
> sequence can have an LSN between the LSN of the change made by DDL and
> the commit LSN of that transaction. Only DDL changes to sequence are
> transactional. Hence any transactional changes with commit LSN beyond
> page LSN would not have been seen by the SELECT otherwise SELECT would
> see the page LSN committed by that transaction. so they need to be
> applied while syncing.
> 5. Any non-transactional changes beyond page LSN should be applied.
> They are not seen by SELECT.
> 
> Am I missing something?
> 

Hmmm, I think you're onto something and the interlock may not be
actually necessary ...

IIRC there were two examples of the non-MVCC sequence behavior, leading
me to add the interlock.


1) going "backwards" during catchup

Sequences are not MVCC, and if there are increments between the slot
creation and the SELECT, the sequence will go backwards. But it will
ultimately end with the correct value. The LSN checks were an attempt to
prevent this.

I don't recall why I concluded this would not be sufficient (there's no
link for [7] in your message), but maybe it was related to the sequence
increments not being WAL-logged and thus not guaranteed to update the
page LSN, or something like that.

But if we agree we only guarantee consistency at the end of the catchup,
this does not matter - it's OK to go backwards as long as the sequence
ends with the correct value.


2) missing an increment because of ALTER SEQUENCE

My concern here was that we might have a transaction that does ALTER
SEQUENCE before the tablesync slot gets created, and the SELECT still
sees the old sequence state because we start decoding after the ALTER.

But now that I think about it again, this probably can't happen, because
the slot won't be created until the ALTER commits. So we shouldn't miss
anything.

I suspect I got confused by some other bug in the patch at that time,
leading me to a faulty conclusion.


I'll try removing the interlock, and make sure it actually works OK.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 7/13/23 16:24, Ashutosh Bapat wrote:
> Thanks for the updated patches. I haven't looked at the patches yet
> but have some responses below.
> 
> On Thu, Jul 13, 2023 at 12:35 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> 
>>
>>
>> 3) simplify the replicated state
>>
>> As suggested by Ashutosh, it may not be a good idea to replicate the
>> (last_value, log_cnt, is_called) tuple, as that's pretty tightly tied to
>> our internal implementation. Which may not be the right thing for other
>> plugins. So this new patch replicates just "value" which is pretty much
>> (last_value + log_cnt), representing the next value that should be safe
>> to generate on the subscriber (in case of a failover).
>>
> 
> Thanks. That will help.
> 
> 
>> 5) minor tweaks in the built-in replication
>>
>> This adopts the relaxed LOCK code to allow locking sequences during the
>> initial sync, and also adopts the replication of a single value (this
>> affects the "apply" side of that change too).
>>
> 
> I think the problem we are trying to solve with LOCK is not actually
> getting solved. See [2]. Instead your earlier idea of using page LSN
> looks better.
> 

Thanks. I think you may be right, and the interlock may not be
necessary. I've responded to the linked threads, that's probably easier
to follow as it keeps the context.

>>
>> 6) simplified protocol versioning
> 
> I had tested the cross-version logical replication with older set of
> patches. Didn't see any unexpected behaviour then. I will test again.
>>

I think the question is what's the expected behavior. What behavior did
you expect/observe?

IIRC with the previous version of the patch, if you connected an old
subscriber (without sequence replication), it just ignored/skipped the
sequence increments and replicated the other changes.

The new patch detects that, and triggers ERROR on the publisher. And I
think that's the correct thing to do.

There was a lengthy discussion about making this more flexible (by not
tying this to "linear" protocol version) and/or permissive. I tried
doing that by doing similar thing to decoding of 2PC, which allows
choosing when creating a subscription.

But ultimately that just chooses where to throw an error - whether on
the publisher (in the output plugin callback) or on apply side (when
trying to apply change to non-existent sequence).

I still think it might be useful to have these "capabilities" orthogonal
to the protocol version, but it's a matter for a separate patch. It's
enough not to fail with "unknown message" on the subscriber.

>> The one thing I'm not really sure about is how it interferes with the
>> replication of DDL. But in principle, if it decodes DDL for ALTER
>> SEQUENCE, I don't see why it would be a problem that we then decode and
>> replicate the WAL for the sequence state. But if it is a problem, we
>> should be able to skip this WAL record with the initial sequence state
>> (which I think should be possible thanks to the "created" flag this
>> patch adds to the WAL record).
> 
> I had suggested a solution in [1] to avoid adding a flag to the WAL
> record. Did you consider it? If you considered it and rejected, I
> would be interested in knowing reasons behind rejecting it. Let me
> repeat here again:
> 
> ```
> We can add a
> decoding routine for RM_SMGR_ID. The decoding routine will add relfilelocator
> in XLOG_SMGR_CREATE record to txn->sequences hash. Rest of the logic will work
> as is. Of course we will add non-sequence relfilelocators as well but that
> should be fine. Creating a new relfilelocator shouldn't be a frequent
> operation. If at all we are worried about that, we can add only the
> relfilenodes associated with sequences to the hash table.
> ```
> 

Thanks for reminding me. In principle I'm not against using the proposed
approach - tracking all relfilenodes created by a transaction, although
I don't think the new flag in xl_seq_rec is a problem, and it's probably
cheaper than having to decode all relfilenode creations.

> If the DDL replication takes care of replicating and applying sequence
> changes, I think we don't need the changes tracking "transactional"
> sequence changes in this patch-set. That also makes a case for not
> adding a new field to WAL which may not be used.
> 

Maybe, but the DDL replication patch is not there yet, and I'm not sure
it's a good idea to make this patch wait for a much larger/complex
patch. If the DDL replication patch gets committed, it may ditch this
part (assuming it happens in the same development cycle).

However, my impression was DDL replication would be optional. In which
case we still need to handle the transactional case, to support sequence
replication without DDL replication enabled.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Thu, Jul 13, 2023 at 8:29 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 6/23/23 15:18, Ashutosh Bapat wrote:
> > ...
> >
> > I reviewed 0001 and related parts of 0004 and 0008 in detail.
> >
> > I have only one major change request, about
> > typedef struct xl_seq_rec
> > {
> > RelFileLocator locator;
> > + bool created; /* creates a new relfilenode (CREATE/ALTER) */
> >
> > I am not sure what are the repercussions of adding a member to an existing WAL
> > record. I didn't see any code which handles the old WAL format which doesn't
> > contain the "created" flag. IIUC, the logical decoding may come across
> > a WAL record written in the old format after upgrade and restart. Is
> > that not possible?
> >
>
> I don't understand why would adding a new field to xl_seq_rec be an
> issue, considering it's done in a new major version. Sure, if you
> generate WAL with old build, and start with a patched version, that
> would break things. But that's true for many other patches, and it's
> irrelevant for releases.

There are two issues
1. the name of the field "created" - what does created mean in a
"sequence status" WAL record? Consider following sequence of events
Begin;
Create sequence ('s');
select nextval('s') from generate_series(1, 1000);

...
commit

This is going to create 1000/32 WAL records with "created" = true. But
only the first one created the relfilenode. We might fix this little
annoyance by changing the name to "transactional".

2. Consider following scenario
v15 running logical decoding has restart_lsn before a "sequence
change" WAL record written in old format
stop the server
upgrade to v16
logical decoding will stat from restart_lsn pointing to a WAL record
written by v15. When it tries to read "sequence change" WAL record it
won't be able to get "created" flag.

Am I missing something here?

>
> > But I don't think it's necessary. We can add a
> > decoding routine for RM_SMGR_ID. The decoding routine will add relfilelocator
> > in XLOG_SMGR_CREATE record to txn->sequences hash. Rest of the logic will work
> > as is. Of course we will add non-sequence relfilelocators as well but that
> > should be fine. Creating a new relfilelocator shouldn't be a frequent
> > operation. If at all we are worried about that, we can add only the
> > relfilenodes associated with sequences to the hash table.
> >
>
> Hmmmm, that might work. I feel a bit uneasy about having to keep all
> relfilenodes, not just sequences ...

From relfilenode it should be easy to get to rel and then see if it's
a sequence. Only add relfilenodes for the sequence.

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Thu, Jul 13, 2023 at 9:47 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

>
> >>
> >> 6) simplified protocol versioning
> >
> > I had tested the cross-version logical replication with older set of
> > patches. Didn't see any unexpected behaviour then. I will test again.
> >>
>
> I think the question is what's the expected behavior. What behavior did
> you expect/observe?

Let me run my test again and respond.

>
> IIRC with the previous version of the patch, if you connected an old
> subscriber (without sequence replication), it just ignored/skipped the
> sequence increments and replicated the other changes.

I liked that.

>
> The new patch detects that, and triggers ERROR on the publisher. And I
> think that's the correct thing to do.

With this behaviour users will never be able to setup logical
replication between old and new servers considering almost every setup
has sequences.

>
> There was a lengthy discussion about making this more flexible (by not
> tying this to "linear" protocol version) and/or permissive. I tried
> doing that by doing similar thing to decoding of 2PC, which allows
> choosing when creating a subscription.
>
> But ultimately that just chooses where to throw an error - whether on
> the publisher (in the output plugin callback) or on apply side (when
> trying to apply change to non-existent sequence).

I had some comments on throwing error in [1], esp. towards the end.

>
> I still think it might be useful to have these "capabilities" orthogonal
> to the protocol version, but it's a matter for a separate patch. It's
> enough not to fail with "unknown message" on the subscriber.

Yes, We should avoid breaking replication with "unknown message".

I also agree that improving things in this area can be done in a
separate patch, but as far as possible in this release itself.

> > If the DDL replication takes care of replicating and applying sequence
> > changes, I think we don't need the changes tracking "transactional"
> > sequence changes in this patch-set. That also makes a case for not
> > adding a new field to WAL which may not be used.
> >
>
> Maybe, but the DDL replication patch is not there yet, and I'm not sure
> it's a good idea to make this patch wait for a much larger/complex
> patch. If the DDL replication patch gets committed, it may ditch this
> part (assuming it happens in the same development cycle).
>
> However, my impression was DDL replication would be optional. In which
> case we still need to handle the transactional case, to support sequence
> replication without DDL replication enabled.

As I said before, I don't think this patchset needs to wait for DDL
replication patch. Let's hope that the later lands in the same release
and straightens protocol instead of carrying it forever.

[1] https://www.postgresql.org/message-id/CAExHW5vScYKKb0RZoiNEPfbaQ60hihfuWeLuZF4JKrwPJXPcUw%40mail.gmail.com

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 7/14/23 09:34, Ashutosh Bapat wrote:
> On Thu, Jul 13, 2023 at 9:47 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> 
>>
>>>>
>>>> 6) simplified protocol versioning
>>>
>>> I had tested the cross-version logical replication with older set of
>>> patches. Didn't see any unexpected behaviour then. I will test again.
>>>>
>>
>> I think the question is what's the expected behavior. What behavior did
>> you expect/observe?
> 
> Let me run my test again and respond.
> 
>>
>> IIRC with the previous version of the patch, if you connected an old
>> subscriber (without sequence replication), it just ignored/skipped the
>> sequence increments and replicated the other changes.
> 
> I liked that.
> 

I liked that too, initially (which is why I did it that way). But I
changed my mind, because it's likely to cause more harm than good.

>>
>> The new patch detects that, and triggers ERROR on the publisher. And I
>> think that's the correct thing to do.
> 
> With this behaviour users will never be able to setup logical
> replication between old and new servers considering almost every setup
> has sequences.
> 

That's not true.

Replication to older versions works fine as long as the publication does
not include sequences (which need to be added explicitly). If you have a
publication with sequences, you clearly want to replicate them, ignoring
it is just confusing "magic".

If you have a publication with sequences and still want to replicate to
an older server, create a new publication without sequences.

>>
>> There was a lengthy discussion about making this more flexible (by not
>> tying this to "linear" protocol version) and/or permissive. I tried
>> doing that by doing similar thing to decoding of 2PC, which allows
>> choosing when creating a subscription.
>>
>> But ultimately that just chooses where to throw an error - whether on
>> the publisher (in the output plugin callback) or on apply side (when
>> trying to apply change to non-existent sequence).
> 
> I had some comments on throwing error in [1], esp. towards the end.
> 

Yes. You said:

    If a given output plugin doesn't implement the callbacks but
    subscription specifies sequences, the code will throw an error
    whether or not publication is publishing sequences.

This refers to situation when the subscriber says "sequences" when
opening the connection. And this happens *in the plugin* which also
defines the callbacks, so I don't see how we could not have the
callbacks defined ...

Furthermore, the simplified protocol versioning does away with the
"sequences" option, so in that case this can't even happen.

>>
>> I still think it might be useful to have these "capabilities" orthogonal
>> to the protocol version, but it's a matter for a separate patch. It's
>> enough not to fail with "unknown message" on the subscriber.
> 
> Yes, We should avoid breaking replication with "unknown message".
> 
> I also agree that improving things in this area can be done in a
> separate patch, but as far as possible in this release itself.
> 
>>> If the DDL replication takes care of replicating and applying sequence
>>> changes, I think we don't need the changes tracking "transactional"
>>> sequence changes in this patch-set. That also makes a case for not
>>> adding a new field to WAL which may not be used.
>>>
>>
>> Maybe, but the DDL replication patch is not there yet, and I'm not sure
>> it's a good idea to make this patch wait for a much larger/complex
>> patch. If the DDL replication patch gets committed, it may ditch this
>> part (assuming it happens in the same development cycle).
>>
>> However, my impression was DDL replication would be optional. In which
>> case we still need to handle the transactional case, to support sequence
>> replication without DDL replication enabled.
> 
> As I said before, I don't think this patchset needs to wait for DDL
> replication patch. Let's hope that the later lands in the same release
> and straightens protocol instead of carrying it forever.
> 

OK, I agree with that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 7/14/23 08:51, Ashutosh Bapat wrote:
> On Thu, Jul 13, 2023 at 8:29 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 6/23/23 15:18, Ashutosh Bapat wrote:
>>> ...
>>>
>>> I reviewed 0001 and related parts of 0004 and 0008 in detail.
>>>
>>> I have only one major change request, about
>>> typedef struct xl_seq_rec
>>> {
>>> RelFileLocator locator;
>>> + bool created; /* creates a new relfilenode (CREATE/ALTER) */
>>>
>>> I am not sure what are the repercussions of adding a member to an existing WAL
>>> record. I didn't see any code which handles the old WAL format which doesn't
>>> contain the "created" flag. IIUC, the logical decoding may come across
>>> a WAL record written in the old format after upgrade and restart. Is
>>> that not possible?
>>>
>>
>> I don't understand why would adding a new field to xl_seq_rec be an
>> issue, considering it's done in a new major version. Sure, if you
>> generate WAL with old build, and start with a patched version, that
>> would break things. But that's true for many other patches, and it's
>> irrelevant for releases.
> 
> There are two issues
> 1. the name of the field "created" - what does created mean in a
> "sequence status" WAL record? Consider following sequence of events
> Begin;
> Create sequence ('s');
> select nextval('s') from generate_series(1, 1000);
> 
> ...
> commit
> 
> This is going to create 1000/32 WAL records with "created" = true. But
> only the first one created the relfilenode. We might fix this little
> annoyance by changing the name to "transactional".
> 

I don't think that's true - this will create 1 record with
"created=true" (the one right after the CREATE SEQUENCE) and the rest
will have "created=false".

I realized I haven't modified seq_desc to show this flag, so I did that
in the updated patch version, which makes this easy to see.

And all of them need to be handled in a transactional way, because they
modify relfilenode visible only to that transaction. So calling the flag
"transactional" would be misleading, because the increments can be
transactional even with "created=false".


> 2. Consider following scenario
> v15 running logical decoding has restart_lsn before a "sequence
> change" WAL record written in old format
> stop the server
> upgrade to v16
> logical decoding will stat from restart_lsn pointing to a WAL record
> written by v15. When it tries to read "sequence change" WAL record it
> won't be able to get "created" flag.
> 
> Am I missing something here?
> 

You're missing the fact that pg_upgrade does not copy replication slots,
so the restart_lsn does not matter.

(Yes, this is pretty annoying consequence of using pg_upgrade. And maybe
we'll improve that in the future - but I'm pretty sure we won't allow
decoding old WAL.)

>>
>>> But I don't think it's necessary. We can add a
>>> decoding routine for RM_SMGR_ID. The decoding routine will add relfilelocator
>>> in XLOG_SMGR_CREATE record to txn->sequences hash. Rest of the logic will work
>>> as is. Of course we will add non-sequence relfilelocators as well but that
>>> should be fine. Creating a new relfilelocator shouldn't be a frequent
>>> operation. If at all we are worried about that, we can add only the
>>> relfilenodes associated with sequences to the hash table.
>>>
>>
>> Hmmmm, that might work. I feel a bit uneasy about having to keep all
>> relfilenodes, not just sequences ...
> 
> From relfilenode it should be easy to get to rel and then see if it's
> a sequence. Only add relfilenodes for the sequence.
> 

Will try.

Attached is an updated version with pg_waldump printing the "created"
flag in seq_desc, and removing the unnecessary interlock. I've kept the
protocol changes in a separate commit for now.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Fri, Jul 14, 2023 at 3:59 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

>
> >>
> >> The new patch detects that, and triggers ERROR on the publisher. And I
> >> think that's the correct thing to do.
> >
> > With this behaviour users will never be able to setup logical
> > replication between old and new servers considering almost every setup
> > has sequences.
> >
>
> That's not true.
>
> Replication to older versions works fine as long as the publication does
> not include sequences (which need to be added explicitly). If you have a
> publication with sequences, you clearly want to replicate them, ignoring
> it is just confusing "magic".

I was looking at it from a different angle. Publishers publish what
they want, subscribers choose what they want and what gets replicated
is intersection of these two sets. Both live happily.

But I am fine with that too. It's just that users need to create more
publications.

>
> If you have a publication with sequences and still want to replicate to
> an older server, create a new publication without sequences.
>

I tested the current patches with subscriber at PG 14 and publisher at
master + these patches. I created one table and a sequence on both
publisher and subscriber. I created two publications, one with
sequence and other without it. Both have the table in it. When the
subscriber subscribes to the publication with sequence, following
ERROR is repeated in the subscriber logs and nothing gets replicated
```
[2023-07-14 18:55:41.307 IST] [916293] [] [] [3/30:0] LOG:  00000:
logical replication apply worker for subscription "sub5433" has
started
[2023-07-14 18:55:41.307 IST] [916293] [] [] [3/30:0] LOCATION:
ApplyWorkerMain, worker.c:3169
[2023-07-14 18:55:41.322 IST] [916293] [] [] [3/0:0] ERROR:  08P01:
could not receive data from WAL stream: ERROR:  protocol version does
not support sequence replication
    CONTEXT:  slot "sub5433", output plugin "pgoutput", in the
sequence callback, associated LSN 0/1513718
[2023-07-14 18:55:41.322 IST] [916293] [] [] [3/0:0] LOCATION:
libpqrcv_receive, libpqwalreceiver.c:818
[2023-07-14 18:55:41.325 IST] [916213] [] [] [:0] LOG:  00000:
background worker "logical replication worker" (PID 916293) exited
with exit code 1
[2023-07-14 18:55:41.325 IST] [916213] [] [] [:0] LOCATION:
LogChildExit, postmaster.c:3737
```

When the subscriber subscribes to the publication without sequence,
things work normally.

The cross-version replication is working as expected then.

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Fri, Jul 14, 2023 at 4:10 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I don't think that's true - this will create 1 record with
> "created=true" (the one right after the CREATE SEQUENCE) and the rest
> will have "created=false".

I may have misread the code.

>
> I realized I haven't modified seq_desc to show this flag, so I did that
> in the updated patch version, which makes this easy to see.

Now I see it. Thanks for the clarification.

> >
> > Am I missing something here?
> >
>
> You're missing the fact that pg_upgrade does not copy replication slots,
> so the restart_lsn does not matter.
>
> (Yes, this is pretty annoying consequence of using pg_upgrade. And maybe
> we'll improve that in the future - but I'm pretty sure we won't allow
> decoding old WAL.)

Ah, I see. Thanks for correcting me.

> >>>
> >>
> >> Hmmmm, that might work. I feel a bit uneasy about having to keep all
> >> relfilenodes, not just sequences ...
> >
> > From relfilenode it should be easy to get to rel and then see if it's
> > a sequence. Only add relfilenodes for the sequence.
> >
>
> Will try.
>

Actually, adding all relfilenodes to hash may not be that bad. There
shouldn't be many of those. So the extra step to lookup reltype may
not be necessary. What's your reason for uneasiness? But yeah, there's
a way to avoid that as well.

Should I wait for this before the second round of review?

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 7/14/23 15:50, Ashutosh Bapat wrote:
> On Fri, Jul 14, 2023 at 3:59 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> 
>>
>>>>
>>>> The new patch detects that, and triggers ERROR on the publisher. And I
>>>> think that's the correct thing to do.
>>>
>>> With this behaviour users will never be able to setup logical
>>> replication between old and new servers considering almost every setup
>>> has sequences.
>>>
>>
>> That's not true.
>>
>> Replication to older versions works fine as long as the publication does
>> not include sequences (which need to be added explicitly). If you have a
>> publication with sequences, you clearly want to replicate them, ignoring
>> it is just confusing "magic".
> 
> I was looking at it from a different angle. Publishers publish what
> they want, subscribers choose what they want and what gets replicated
> is intersection of these two sets. Both live happily.
> 
> But I am fine with that too. It's just that users need to create more
> publications.
> 

I think you might make essentially the same argument about replicating
just some of the tables in the publication. That is, the publication has
tables t1 and t2, but subscriber only has t1. That will fail too, we
don't allow the subscriber to ignore changes for t2.

I think it'd be rather weird (and confusing) to do this differently for
different types of replicated objects.

>>
>> If you have a publication with sequences and still want to replicate to
>> an older server, create a new publication without sequences.
>>
> 
> I tested the current patches with subscriber at PG 14 and publisher at
> master + these patches. I created one table and a sequence on both
> publisher and subscriber. I created two publications, one with
> sequence and other without it. Both have the table in it. When the
> subscriber subscribes to the publication with sequence, following
> ERROR is repeated in the subscriber logs and nothing gets replicated
> ```
> [2023-07-14 18:55:41.307 IST] [916293] [] [] [3/30:0] LOG:  00000:
> logical replication apply worker for subscription "sub5433" has
> started
> [2023-07-14 18:55:41.307 IST] [916293] [] [] [3/30:0] LOCATION:
> ApplyWorkerMain, worker.c:3169
> [2023-07-14 18:55:41.322 IST] [916293] [] [] [3/0:0] ERROR:  08P01:
> could not receive data from WAL stream: ERROR:  protocol version does
> not support sequence replication
>     CONTEXT:  slot "sub5433", output plugin "pgoutput", in the
> sequence callback, associated LSN 0/1513718
> [2023-07-14 18:55:41.322 IST] [916293] [] [] [3/0:0] LOCATION:
> libpqrcv_receive, libpqwalreceiver.c:818
> [2023-07-14 18:55:41.325 IST] [916213] [] [] [:0] LOG:  00000:
> background worker "logical replication worker" (PID 916293) exited
> with exit code 1
> [2023-07-14 18:55:41.325 IST] [916213] [] [] [:0] LOCATION:
> LogChildExit, postmaster.c:3737
> ```
> 
> When the subscriber subscribes to the publication without sequence,
> things work normally.
> 
> The cross-version replication is working as expected then.
> 

Thanks for testing / confirming this! So, do we agree this behavior is
reasonable?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 7/14/23 16:02, Ashutosh Bapat wrote:
> ...
>>>>>
>>>>
>>>> Hmmmm, that might work. I feel a bit uneasy about having to keep all
>>>> relfilenodes, not just sequences ...
>>>
>>> From relfilenode it should be easy to get to rel and then see if it's
>>> a sequence. Only add relfilenodes for the sequence.
>>>
>>
>> Will try.
>>
> 
> Actually, adding all relfilenodes to hash may not be that bad. There
> shouldn't be many of those. So the extra step to lookup reltype may
> not be necessary. What's your reason for uneasiness? But yeah, there's
> a way to avoid that as well.
> 
> Should I wait for this before the second round of review?
> 

I don't think you have to wait - just ignore the part that changes the
WAL record, which is a pretty tiny bit of the patch.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Here's a slightly improved version of the patch, fixing two minor issues
reported by cfbot:

- compiler warning about fetch_sequence_data maybe not initializing a
variable (not true, but silence the warning)

- missing "id" for an element in SGML cocs



regards


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Fri, Jul 14, 2023 at 7:33 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

>
> Thanks for testing / confirming this! So, do we agree this behavior is
> reasonable?
>

This behaviour doesn't need any on-disk changes or has nothing in it
which prohibits us from changing it in future. So I think it's good as
a v0. If required we can add the protocol option to provide more
flexible behaviour.

One thing I am worried about is that the subscriber will get an error
only when a sequence change is decoded. All the prior changes will be
replicated and applied on the subscriber. Thus by the time the user
realises this mistake, they may have replicated data. At this point if
they want to subscribe to a publication without sequences they will
need to clean the already replicated data. But they may not be in a
position to know which is which esp when the subscriber has its own
data in those tables. Example,

publisher: create publication pub with sequences and tables
subscriber: subscribe to pub
publisher: modify data in tables and sequences
subscriber: replicates some data and errors out
publisher: delete some data from tables
publisher: create a publication pub_tab without sequences
subscriber: subscribe to pub_tab
subscriber: replicates the data but rows which were deleted on
publisher remain on the subscriber

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 7/18/23 15:52, Ashutosh Bapat wrote:
> On Fri, Jul 14, 2023 at 7:33 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> 
>>
>> Thanks for testing / confirming this! So, do we agree this behavior is
>> reasonable?
>>
> 
> This behaviour doesn't need any on-disk changes or has nothing in it
> which prohibits us from changing it in future. So I think it's good as
> a v0. If required we can add the protocol option to provide more
> flexible behaviour.
> 

True, although "no on-disk changes" does not exactly mean we can just
change it at will. Essentially, once it gets released, the behavior is
somewhat fixed for the next ~5 years, until that release gets EOL. And
likely longer, because more features are likely to do the same thing.

That's essentially why the patch was reverted from PG16 - I was worried
the elaborate protocol versioning/negotiation was not the right thing.

> One thing I am worried about is that the subscriber will get an error
> only when a sequence change is decoded. All the prior changes will be
> replicated and applied on the subscriber. Thus by the time the user
> realises this mistake, they may have replicated data. At this point if
> they want to subscribe to a publication without sequences they will
> need to clean the already replicated data. But they may not be in a
> position to know which is which esp when the subscriber has its own
> data in those tables. Example,
> 
> publisher: create publication pub with sequences and tables
> subscriber: subscribe to pub
> publisher: modify data in tables and sequences
> subscriber: replicates some data and errors out
> publisher: delete some data from tables
> publisher: create a publication pub_tab without sequences
> subscriber: subscribe to pub_tab
> subscriber: replicates the data but rows which were deleted on
> publisher remain on the subscriber
> 

Sure, but I'd argue that's correct. If the replication stream has
something the subscriber can't apply, what else would you do? We had
exactly the same thing with TRUNCATE, for example (except that it failed
with "unknown message" on the subscriber).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Wed, Jul 19, 2023 at 1:20 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> >>
> >
> > This behaviour doesn't need any on-disk changes or has nothing in it
> > which prohibits us from changing it in future. So I think it's good as
> > a v0. If required we can add the protocol option to provide more
> > flexible behaviour.
> >
>
> True, although "no on-disk changes" does not exactly mean we can just
> change it at will. Essentially, once it gets released, the behavior is
> somewhat fixed for the next ~5 years, until that release gets EOL. And
> likely longer, because more features are likely to do the same thing.
>
> That's essentially why the patch was reverted from PG16 - I was worried
> the elaborate protocol versioning/negotiation was not the right thing.

I agree that elaborate protocol would pose roadblocks in future. It's
better not to add that burden right now, esp. when usage is not clear.

Here's behavriour and extension matrix as I understand it and as of
the last set of patches.

Publisher PG 17, Subscriber PG 17 - changes to sequences are
replicated, downstream is capable of applying them

Publisher PG 16-, Subscriber PG 17  changes to sequences are never replicated

Publisher PG 18+, Subscriber PG 17 - same as 17, 17 case. Any changes
in PG 18+ need to make sure that PG 17 subscriber receives sequence
changes irrespective of changes in protocol. That may pose some
maintenance burden but doesn't seem to be any harder than usual
backward compatibility burden.

Moreover users can control whether changes to sequences get replicated
or not by controlling the objects contained in publication.

I don't see any downside to this. Looks all good. Please correct me if wrong.

>
> > One thing I am worried about is that the subscriber will get an error
> > only when a sequence change is decoded. All the prior changes will be
> > replicated and applied on the subscriber. Thus by the time the user
> > realises this mistake, they may have replicated data. At this point if
> > they want to subscribe to a publication without sequences they will
> > need to clean the already replicated data. But they may not be in a
> > position to know which is which esp when the subscriber has its own
> > data in those tables. Example,
> >
> > publisher: create publication pub with sequences and tables
> > subscriber: subscribe to pub
> > publisher: modify data in tables and sequences
> > subscriber: replicates some data and errors out
> > publisher: delete some data from tables
> > publisher: create a publication pub_tab without sequences
> > subscriber: subscribe to pub_tab
> > subscriber: replicates the data but rows which were deleted on
> > publisher remain on the subscriber
> >
>
> Sure, but I'd argue that's correct. If the replication stream has
> something the subscriber can't apply, what else would you do? We had
> exactly the same thing with TRUNCATE, for example (except that it failed
> with "unknown message" on the subscriber).

When the replication starts, the publisher knows what publication is
being used, it also knows what protocol is being used. From
publication it knows what objects will be replicated. So we could fail
before any changes are replicated when executing START_REPLICATION
command. According to [1], if an object is added or removed from
publication the subscriber is required to REFRESH SUBSCRIPTION in
which case there will be fresh START_REPLICATION command sent. So we
should fail the START_REPLICATION command before sending any change
rather than when a change is being replicated. That's more
deterministic and easy to handle. Of course any changes that were sent
before ALTER PUBLICATION can not be reverted, but that's expected.

Coming back to TRUNCATE, I don't think it's possible to know whether a
publication will send a truncate downstream or not. So we can't throw
an error before TRUNCATE change is decoded.

Anyway, I think this behaviour should be documented. I didn't see this
mentioned in PUBLICATION or SUBSCRIPTION documentation.

[1] https://www.postgresql.org/docs/current/sql-alterpublication.html

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 7/19/23 07:42, Ashutosh Bapat wrote:
> On Wed, Jul 19, 2023 at 1:20 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>
>>> This behaviour doesn't need any on-disk changes or has nothing in it
>>> which prohibits us from changing it in future. So I think it's good as
>>> a v0. If required we can add the protocol option to provide more
>>> flexible behaviour.
>>>
>>
>> True, although "no on-disk changes" does not exactly mean we can just
>> change it at will. Essentially, once it gets released, the behavior is
>> somewhat fixed for the next ~5 years, until that release gets EOL. And
>> likely longer, because more features are likely to do the same thing.
>>
>> That's essentially why the patch was reverted from PG16 - I was worried
>> the elaborate protocol versioning/negotiation was not the right thing.
> 
> I agree that elaborate protocol would pose roadblocks in future. It's
> better not to add that burden right now, esp. when usage is not clear.
> 
> Here's behavriour and extension matrix as I understand it and as of
> the last set of patches.
> 
> Publisher PG 17, Subscriber PG 17 - changes to sequences are
> replicated, downstream is capable of applying them
> 
> Publisher PG 16-, Subscriber PG 17  changes to sequences are never replicated
> 
> Publisher PG 18+, Subscriber PG 17 - same as 17, 17 case. Any changes
> in PG 18+ need to make sure that PG 17 subscriber receives sequence
> changes irrespective of changes in protocol. That may pose some
> maintenance burden but doesn't seem to be any harder than usual
> backward compatibility burden.
> 
> Moreover users can control whether changes to sequences get replicated
> or not by controlling the objects contained in publication.
> 
> I don't see any downside to this. Looks all good. Please correct me if wrong.
> 

I think this is an accurate description of what the current patch does.
And I think it's a reasonable behavior.

My point is that if this gets released in PG17, it'll be difficult to
change, even if it does not change on-disk format.

>>
>>> One thing I am worried about is that the subscriber will get an error
>>> only when a sequence change is decoded. All the prior changes will be
>>> replicated and applied on the subscriber. Thus by the time the user
>>> realises this mistake, they may have replicated data. At this point if
>>> they want to subscribe to a publication without sequences they will
>>> need to clean the already replicated data. But they may not be in a
>>> position to know which is which esp when the subscriber has its own
>>> data in those tables. Example,
>>>
>>> publisher: create publication pub with sequences and tables
>>> subscriber: subscribe to pub
>>> publisher: modify data in tables and sequences
>>> subscriber: replicates some data and errors out
>>> publisher: delete some data from tables
>>> publisher: create a publication pub_tab without sequences
>>> subscriber: subscribe to pub_tab
>>> subscriber: replicates the data but rows which were deleted on
>>> publisher remain on the subscriber
>>>
>>
>> Sure, but I'd argue that's correct. If the replication stream has
>> something the subscriber can't apply, what else would you do? We had
>> exactly the same thing with TRUNCATE, for example (except that it failed
>> with "unknown message" on the subscriber).
> 
> When the replication starts, the publisher knows what publication is
> being used, it also knows what protocol is being used. From
> publication it knows what objects will be replicated. So we could fail
> before any changes are replicated when executing START_REPLICATION
> command. According to [1], if an object is added or removed from
> publication the subscriber is required to REFRESH SUBSCRIPTION in
> which case there will be fresh START_REPLICATION command sent. So we
> should fail the START_REPLICATION command before sending any change
> rather than when a change is being replicated. That's more
> deterministic and easy to handle. Of course any changes that were sent
> before ALTER PUBLICATION can not be reverted, but that's expected.
> 
> Coming back to TRUNCATE, I don't think it's possible to know whether a
> publication will send a truncate downstream or not. So we can't throw
> an error before TRUNCATE change is decoded.
> 
> Anyway, I think this behaviour should be documented. I didn't see this
> mentioned in PUBLICATION or SUBSCRIPTION documentation.
> 

I need to think behavior about this a bit more, and maybe check how
difficult would be implementing it.

I did however look at the proposed alternative to the "created" flag.
The attached 0006 part ditches the flag with XLOG_SMGR_CREATE decoding.
The smgr_decode code needs a review (I'm not sure the
skipping/fast-forwarding part is correct), but it seems to be working
fine overall, although we need to ensure the WAL record has the correct XID.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 7/19/23 12:53, Tomas Vondra wrote:
> ...
> 
> I did however look at the proposed alternative to the "created" flag.
> The attached 0006 part ditches the flag with XLOG_SMGR_CREATE decoding.
> The smgr_decode code needs a review (I'm not sure the
> skipping/fast-forwarding part is correct), but it seems to be working
> fine overall, although we need to ensure the WAL record has the correct XID.
> 

cfbot reported two issues in the patch - compilation warning, due to
unused variable in sequence_decode, and a failing test in test_decoding.

The second thing happens because when creating the relfilenode, it may
happen before we know the XID. The patch already does ensure the WAL
with the sequence data has XID, but that's later. And when the CREATE
record did not have the correct XID, that broke the logic deciding which
increments should be "transactional".

This forces us to assign XID a bit earlier (it'd happen anyway, when
logging the increment). There's a bit of a drawback, because we don't
have the relation yet, so we can't do RelationNeedsWAL ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
Thanks Tomas for the updated patches.

Here are my comments on 0006 patch as well as 0002 patch.

On Wed, Jul 19, 2023 at 4:23 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I think this is an accurate description of what the current patch does.
> And I think it's a reasonable behavior.
>
> My point is that if this gets released in PG17, it'll be difficult to
> change, even if it does not change on-disk format.
>

Yes. I agree. And I don't see any problem even if we are not able to change it.

>
> I need to think behavior about this a bit more, and maybe check how
> difficult would be implementing it.

Ok.

In most of the comments and in documentation, there are some phrases
which do not look accurate.

Change to a sequence is being refered to as "sequence increment". While
ascending sequences are common, PostgreSQL supports descending sequences as
well. The changes there will be decrements. But that's not the only case. A
sequence may be restarted with an older value, in which case the change could
increment or a decrement. I think correct usage is 'changes to sequence' or
'sequence changes'.

Sequence being assigned a new relfilenode is referred to as sequence
being created. This is confusing. When an existing sequence is ALTERed, we
will not "create" a new sequence but we will "create" a new relfilenode and
"assign" it to that sequence.

PFA such edits in 0002 and 0006 patches. Let me know if those look
correct. I think we
need similar changes to the documentation and comments in other places.

>
> I did however look at the proposed alternative to the "created" flag.
> The attached 0006 part ditches the flag with XLOG_SMGR_CREATE decoding.
> The smgr_decode code needs a review (I'm not sure the
> skipping/fast-forwarding part is correct), but it seems to be working
> fine overall, although we need to ensure the WAL record has the correct XID.
>

Briefly describing the patch. When decoding a XLOG_SMGR_CREATE WAL
record, it adds the relfilenode mentioned in it to the sequences hash.
When decoding a sequence change record, it checks whether the
relfilenode in the WAL record is in hash table. If it is the sequence
changes is deemed transactional otherwise non-transactional. The
change looks good to me. It simplifies the logic to decide whether a
sequence change is transactional or not.

In sequence_decode() we skip sequence changes when fast forwarding.
Given that smgr_decode() is only to supplement sequence_decode(), I
think it's correct to do the same in smgr_decode() as well. Simillarly
skipping when we don't have full snapshot.

Some minor comments on 0006 patch

+    /* make sure the relfilenode creation is associated with the XID */
+    if (XLogLogicalInfoActive())
+        GetCurrentTransactionId();

I think this change is correct and is inline with similar changes in 0002. But
I looked at other places from where DefineRelation() is called. For regular
tables it is called from ProcessUtilitySlow() which in turn does not call
GetCurrentTransactionId(). I am wondering whether we are just discovering a
class of bugs caused by not associating an xid with a newly created
relfilenode.

+    /*
+     * If we don't have snapshot or we are just fast-forwarding, there is no
+     * point in decoding changes.
+     */
+    if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT ||
+        ctx->fast_forward)
+        return;

This code block is repeated.

+void
+ReorderBufferAddRelFileLocator(ReorderBuffer *rb, TransactionId xid,
+                               RelFileLocator rlocator)
+{
    ... snip ...
+
+    /* sequence changes require a transaction */
+    if (xid == InvalidTransactionId)
+        return;

IIUC, with your changes in DefineSequence() in this patch, this should not
happen. So this condition will never be true. But in case it happens, this code
will not add the relfilelocation to the hash table and we will deem the
sequence change as non-transactional. Isn't it better to just throw an error
and stop replication if that (ever) happens?

Also some comments on 0002 patch

@@ -405,8 +405,19 @@ fill_seq_fork_with_data(Relation rel, HeapTuple
tuple, ForkNumber forkNum)

     /* check the comment above nextval_internal()'s equivalent call. */
     if (RelationNeedsWAL(rel))
+    {
         GetTopTransactionId();

+        /*
+         * Make sure the subtransaction has a XID assigned, so that
the sequence
+         * increment WAL record is properly associated with it. This
matters for
+         * increments of sequences created/altered in the
transaction, which are
+         * handled as transactional.
+         */
+        if (XLogLogicalInfoActive())
+            GetCurrentTransactionId();
+    }
+

I think we should separately commit the changes which add a call to
GetCurrentTransactionId(). That looks like an existing bug/anomaly
which can stay irrespective of this patch.

+    /*
+     * To support logical decoding of sequences, we require the sequence
+     * callback. We decide it here, but only check it later in the wrappers.
+     *
+     * XXX Isn't it wrong to define only one of those callbacks? Say we
+     * only define the stream_sequence_cb() - that may get strange results
+     * depending on what gets streamed. Either none or both?
+     *
+     * XXX Shouldn't sequence be defined at slot creation time, similar
+     * to two_phase? Probably not.
+     */

Do you intend to keep these XXX's as is? My previous comments on this comment
block are in [1].

In fact, given that whether or not sequences are replicated is decided by the
protocol version, do we really need LogicalDecodingContext::sequences? Drawing
parallel with WAL messages, I don't think it's needed.

[1] https://www.postgresql.org/message-id/CAExHW5vScYKKb0RZoiNEPfbaQ60hihfuWeLuZF4JKrwPJXPcUw%40mail.gmail.com

--
Best Wishes,
Ashutosh Bapat

Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 7/20/23 09:24, Ashutosh Bapat wrote:
> Thanks Tomas for the updated patches.
> 
> Here are my comments on 0006 patch as well as 0002 patch.
> 
> On Wed, Jul 19, 2023 at 4:23 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> I think this is an accurate description of what the current patch does.
>> And I think it's a reasonable behavior.
>>
>> My point is that if this gets released in PG17, it'll be difficult to
>> change, even if it does not change on-disk format.
>>
> 
> Yes. I agree. And I don't see any problem even if we are not able to change it.
> 
>>
>> I need to think behavior about this a bit more, and maybe check how
>> difficult would be implementing it.
> 
> Ok.
> 
> In most of the comments and in documentation, there are some phrases
> which do not look accurate.
> 
> Change to a sequence is being refered to as "sequence increment". While
> ascending sequences are common, PostgreSQL supports descending sequences as
> well. The changes there will be decrements. But that's not the only case. A
> sequence may be restarted with an older value, in which case the change could
> increment or a decrement. I think correct usage is 'changes to sequence' or
> 'sequence changes'.
> 
> Sequence being assigned a new relfilenode is referred to as sequence
> being created. This is confusing. When an existing sequence is ALTERed, we
> will not "create" a new sequence but we will "create" a new relfilenode and
> "assign" it to that sequence.
> 
> PFA such edits in 0002 and 0006 patches. Let me know if those look
> correct. I think we
> need similar changes to the documentation and comments in other places.
> 

OK, I merged the changes into the patches, with some minor changes to
the wording etc.

>>
>> I did however look at the proposed alternative to the "created" flag.
>> The attached 0006 part ditches the flag with XLOG_SMGR_CREATE decoding.
>> The smgr_decode code needs a review (I'm not sure the
>> skipping/fast-forwarding part is correct), but it seems to be working
>> fine overall, although we need to ensure the WAL record has the correct XID.
>>
> 
> Briefly describing the patch. When decoding a XLOG_SMGR_CREATE WAL
> record, it adds the relfilenode mentioned in it to the sequences hash.
> When decoding a sequence change record, it checks whether the
> relfilenode in the WAL record is in hash table. If it is the sequence
> changes is deemed transactional otherwise non-transactional. The
> change looks good to me. It simplifies the logic to decide whether a
> sequence change is transactional or not.
> 

Right.

> In sequence_decode() we skip sequence changes when fast forwarding.
> Given that smgr_decode() is only to supplement sequence_decode(), I
> think it's correct to do the same in smgr_decode() as well. Simillarly
> skipping when we don't have full snapshot.
> 

I don't follow, smgr_decode already checks ctx->fast_forward.

> Some minor comments on 0006 patch
> 
> +    /* make sure the relfilenode creation is associated with the XID */
> +    if (XLogLogicalInfoActive())
> +        GetCurrentTransactionId();
> 
> I think this change is correct and is inline with similar changes in 0002. But
> I looked at other places from where DefineRelation() is called. For regular
> tables it is called from ProcessUtilitySlow() which in turn does not call
> GetCurrentTransactionId(). I am wondering whether we are just discovering a
> class of bugs caused by not associating an xid with a newly created
> relfilenode.
> 

Not sure. Why would it be a bug?

> +    /*
> +     * If we don't have snapshot or we are just fast-forwarding, there is no
> +     * point in decoding changes.
> +     */
> +    if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT ||
> +        ctx->fast_forward)
> +        return;
> 
> This code block is repeated.
> 

Fixed.

> +void
> +ReorderBufferAddRelFileLocator(ReorderBuffer *rb, TransactionId xid,
> +                               RelFileLocator rlocator)
> +{
>     ... snip ...
> +
> +    /* sequence changes require a transaction */
> +    if (xid == InvalidTransactionId)
> +        return;
> 
> IIUC, with your changes in DefineSequence() in this patch, this should not
> happen. So this condition will never be true. But in case it happens, this code
> will not add the relfilelocation to the hash table and we will deem the
> sequence change as non-transactional. Isn't it better to just throw an error
> and stop replication if that (ever) happens?
> 

It can't happen for sequence, but it may happen when creating a
non-sequence relfilenode. In a way, it's a way to skip (some)
unnecessary relfilenodes.

> Also some comments on 0002 patch
> 
> @@ -405,8 +405,19 @@ fill_seq_fork_with_data(Relation rel, HeapTuple
> tuple, ForkNumber forkNum)
> 
>      /* check the comment above nextval_internal()'s equivalent call. */
>      if (RelationNeedsWAL(rel))
> +    {
>          GetTopTransactionId();
> 
> +        /*
> +         * Make sure the subtransaction has a XID assigned, so that
> the sequence
> +         * increment WAL record is properly associated with it. This
> matters for
> +         * increments of sequences created/altered in the
> transaction, which are
> +         * handled as transactional.
> +         */
> +        if (XLogLogicalInfoActive())
> +            GetCurrentTransactionId();
> +    }
> +
> 
> I think we should separately commit the changes which add a call to
> GetCurrentTransactionId(). That looks like an existing bug/anomaly
> which can stay irrespective of this patch.
> 

Not sure, but I don't see this as a bug.

> +    /*
> +     * To support logical decoding of sequences, we require the sequence
> +     * callback. We decide it here, but only check it later in the wrappers.
> +     *
> +     * XXX Isn't it wrong to define only one of those callbacks? Say we
> +     * only define the stream_sequence_cb() - that may get strange results
> +     * depending on what gets streamed. Either none or both?
> +     *
> +     * XXX Shouldn't sequence be defined at slot creation time, similar
> +     * to two_phase? Probably not.
> +     */
> 
> Do you intend to keep these XXX's as is? My previous comments on this comment
> block are in [1].
> 
> In fact, given that whether or not sequences are replicated is decided by the
> protocol version, do we really need LogicalDecodingContext::sequences? Drawing
> parallel with WAL messages, I don't think it's needed.
> 

Right. We do that for two_phase because you can override that when
creating the subscription - sequences allowed that too initially, but
then we ditched that. So I don't think we need this.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
FWIW there's two questions related to the switch to XLOG_SMGR_CREATE.

1) Does smgr_decode() need to do the same block as sequence_decode()?

    /* Skip the change if already processed (per the snapshot). */
    if (transactional &&
        !SnapBuildProcessChange(builder, xid, buf->origptr))
        return;
    else if (!transactional &&
             (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
              SnapBuildXactNeedsSkip(builder, buf->origptr)))
        return;

I don't think it does. Also, we don't have any transactional flag here.
Or rather, everything is transactional ...


2) Currently, the sequences hash table is in reorderbuffer, i.e. global.
I was thinking maybe we should have it in the transaction (because we
need to do cleanup at the end). It seem a bit inconvenient, because then
we'd need to either search htabs in all subxacts, or transfer the
entries to the top-level xact (otoh, we already do that with snapshots),
and cleanup on abort.

What do you think?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Thu, Jul 20, 2023 at 8:22 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> OK, I merged the changes into the patches, with some minor changes to
> the wording etc.
>

I think we can do 0001-Make-test_decoding-ddl.out-shorter-20230720
even without the rest of the patches. Isn't it a separate improvement?

I see that origin filtering (origin=none) doesn't work with this
patch. You can see this by using the following statements:
Node-1:
postgres=# create sequence s;
CREATE SEQUENCE
postgres=# create publication mypub for all sequences;
CREATE PUBLICATION

Node-2:
postgres=# create sequence s;
CREATE SEQUENCE
postgres=# create subscription mysub_sub connection '....' publication
mypub with (origin=none);
NOTICE:  created replication slot "mysub_sub" on publisher
CREATE SUBSCRIPTION
postgres=# create publication mypub_sub for all sequences;
CREATE PUBLICATION

Node-1:
create subscription mysub_pub connection '...' publication mypub_sub
with (origin=none);
NOTICE:  created replication slot "mysub_pub" on publisher
CREATE SUBSCRIPTION

SELECT nextval('s') FROM generate_series(1,100);

After that, you can check on the subscriber that sequences values are
overridden with older values:
postgres=# select * from s;
 last_value | log_cnt | is_called
------------+---------+-----------
         67 |       0 | t
(1 row)
postgres=# select * from s;
 last_value | log_cnt | is_called
------------+---------+-----------
        100 |       0 | t
(1 row)
postgres=# select * from s;
 last_value | log_cnt | is_called
------------+---------+-----------
        133 |       0 | t
(1 row)
postgres=# select * from s;
 last_value | log_cnt | is_called
------------+---------+-----------
         67 |       0 | t
(1 row)

I haven't verified all the details but I think that is because we
don't set XLOG_INCLUDE_ORIGIN while logging sequence values.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Bharath Rupireddy
Дата:
On Mon, Jul 24, 2023 at 12:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 20, 2023 at 8:22 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > OK, I merged the changes into the patches, with some minor changes to
> > the wording etc.
> >
>
> I think we can do 0001-Make-test_decoding-ddl.out-shorter-20230720
> even without the rest of the patches. Isn't it a separate improvement?

+1. Yes, it can go separately. It would even be better if the test can
be modified to capture the toasted data into a psql variable before
insert into the table, and compare it with output of
pg_logical_slot_get_changes.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> 0005, 0006 and 0007 are all related to the initial sequence sync. [3]
> resulted in 0007 and I think we need it. That leaves 0005 and 0006 to
> be reviewed in this response.
>
> I followed the discussion starting [1] till [2]. The second one
> mentions the interlock mechanism which has been implemented in 0005
> and 0006. While I don't have an objection to allowing LOCKing a
> sequence using the LOCK command, I am not sure whether it will
> actually work or is even needed.
>
> The problem described in [1] seems to be the same as the problem
> described in [2]. In both cases we see the sequence moving backwards
> during CATCHUP. At the end of catchup the sequence is in the right
> state in both the cases.
>

I think we could see backward sequence value even after the catchup
phase (after the sync worker is exited and or the state of rel is
marked as 'ready' in pg_subscription_rel). The point is that there is
no guarantee that we will process all the pending WAL before
considering the sequence state is 'SYNCDONE' and or 'READY'. For
example, after copy_sequence, I see values like:

postgres=# select * from s;
 last_value | log_cnt | is_called
------------+---------+-----------
        165 |       0 | t
(1 row)
postgres=# select nextval('s');
 nextval
---------
     166
(1 row)
postgres=# select nextval('s');
 nextval
---------
     167
(1 row)
postgres=# select currval('s');
 currval
---------
     167
(1 row)

Then during the catchup phase:
postgres=# select * from s;
 last_value | log_cnt | is_called
------------+---------+-----------
         33 |       0 | t
(1 row)
postgres=# select * from s;
 last_value | log_cnt | is_called
------------+---------+-----------
         66 |       0 | t
(1 row)

postgres=# select * from pg_subscription_rel;
 srsubid | srrelid | srsubstate | srsublsn
---------+---------+------------+-----------
   16394 |   16390 | r          | 0/16374E8
   16394 |   16393 | s          | 0/1637700
(2 rows)

postgres=# select * from pg_subscription_rel;
 srsubid | srrelid | srsubstate | srsublsn
---------+---------+------------+-----------
   16394 |   16390 | r          | 0/16374E8
   16394 |   16393 | r          | 0/1637700
(2 rows)

Here Sequence relid id 16393. You can see sequence state is marked as ready.

postgres=# select * from s;
 last_value | log_cnt | is_called
------------+---------+-----------
         66 |       0 | t
(1 row)

Even after that, see below the value of the sequence is still not
caught up. Later, when the apply worker processes all the WAL, the
sequence state will be caught up.

postgres=# select * from s;
 last_value | log_cnt | is_called
------------+---------+-----------
        165 |       0 | t
(1 row)

So, there will be a window where the sequence won't be caught up for a
certain period of time and any usage of it (even after the sync is
finished) during that time could result in inconsistent behaviour.

The other question is whether it is okay to allow the sequence to go
backwards even during the initial sync phase? The reason I am asking
this question is that for the time sequence value moves backwards, one
is allowed to use it on the subscriber which will result in using
out-of-sequence values. For example, immediately, after copy_sequence
the values look like this:
postgres=# select * from s;
 last_value | log_cnt | is_called
------------+---------+-----------
        133 |      32 | t
(1 row)
postgres=# select nextval('s');
 nextval
---------
     134
(1 row)
postgres=# select currval('s');
 currval
---------
     134
(1 row)

But then during the sync phase, it can go backwards and one is allowed
to use it on the subscriber:
postgres=# select * from s;
 last_value | log_cnt | is_called
------------+---------+-----------
         66 |       0 | t
(1 row)
postgres=# select nextval('s');
 nextval
---------
      67
(1 row)

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 7/24/23 12:40, Amit Kapila wrote:
> On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
>>
>> 0005, 0006 and 0007 are all related to the initial sequence sync. [3]
>> resulted in 0007 and I think we need it. That leaves 0005 and 0006 to
>> be reviewed in this response.
>>
>> I followed the discussion starting [1] till [2]. The second one
>> mentions the interlock mechanism which has been implemented in 0005
>> and 0006. While I don't have an objection to allowing LOCKing a
>> sequence using the LOCK command, I am not sure whether it will
>> actually work or is even needed.
>>
>> The problem described in [1] seems to be the same as the problem
>> described in [2]. In both cases we see the sequence moving backwards
>> during CATCHUP. At the end of catchup the sequence is in the right
>> state in both the cases.
>>
> 
> I think we could see backward sequence value even after the catchup
> phase (after the sync worker is exited and or the state of rel is
> marked as 'ready' in pg_subscription_rel). The point is that there is
> no guarantee that we will process all the pending WAL before
> considering the sequence state is 'SYNCDONE' and or 'READY'. For
> example, after copy_sequence, I see values like:
> 
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>         165 |       0 | t
> (1 row)
> postgres=# select nextval('s');
>  nextval
> ---------
>      166
> (1 row)
> postgres=# select nextval('s');
>  nextval
> ---------
>      167
> (1 row)
> postgres=# select currval('s');
>  currval
> ---------
>      167
> (1 row)
> 
> Then during the catchup phase:
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>          33 |       0 | t
> (1 row)
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>          66 |       0 | t
> (1 row)
> 
> postgres=# select * from pg_subscription_rel;
>  srsubid | srrelid | srsubstate | srsublsn
> ---------+---------+------------+-----------
>    16394 |   16390 | r          | 0/16374E8
>    16394 |   16393 | s          | 0/1637700
> (2 rows)
> 
> postgres=# select * from pg_subscription_rel;
>  srsubid | srrelid | srsubstate | srsublsn
> ---------+---------+------------+-----------
>    16394 |   16390 | r          | 0/16374E8
>    16394 |   16393 | r          | 0/1637700
> (2 rows)
> 
> Here Sequence relid id 16393. You can see sequence state is marked as ready.
> 
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>          66 |       0 | t
> (1 row)
> 
> Even after that, see below the value of the sequence is still not
> caught up. Later, when the apply worker processes all the WAL, the
> sequence state will be caught up.
> 
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>         165 |       0 | t
> (1 row)
> 
> So, there will be a window where the sequence won't be caught up for a
> certain period of time and any usage of it (even after the sync is
> finished) during that time could result in inconsistent behaviour.
> 

I'm rather confused about which node these queries are executed on.
Presumably some of it is on publisher, some on subscriber?

Can you create a reproducer (TAP test demonstrating this?) I guess it
might require adding some sleeps to hit the right timing ...

> The other question is whether it is okay to allow the sequence to go
> backwards even during the initial sync phase? The reason I am asking
> this question is that for the time sequence value moves backwards, one
> is allowed to use it on the subscriber which will result in using
> out-of-sequence values. For example, immediately, after copy_sequence
> the values look like this:
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>         133 |      32 | t
> (1 row)
> postgres=# select nextval('s');
>  nextval
> ---------
>      134
> (1 row)
> postgres=# select currval('s');
>  currval
> ---------
>      134
> (1 row)
> 
> But then during the sync phase, it can go backwards and one is allowed
> to use it on the subscriber:
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>          66 |       0 | t
> (1 row)
> postgres=# select nextval('s');
>  nextval
> ---------
>       67
> (1 row)
> 

Well, as for going back during the sync phase, I think the agreement was
that's acceptable, as we don't make guarantees about that. The question
is what's the state at the end of the sync (which I think leads to the
first part of your message).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 7/24/23 08:31, Amit Kapila wrote:
> On Thu, Jul 20, 2023 at 8:22 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> OK, I merged the changes into the patches, with some minor changes to
>> the wording etc.
>>
> 
> I think we can do 0001-Make-test_decoding-ddl.out-shorter-20230720
> even without the rest of the patches. Isn't it a separate improvement?
> 

True.

> I see that origin filtering (origin=none) doesn't work with this
> patch. You can see this by using the following statements:
> Node-1:
> postgres=# create sequence s;
> CREATE SEQUENCE
> postgres=# create publication mypub for all sequences;
> CREATE PUBLICATION
> 
> Node-2:
> postgres=# create sequence s;
> CREATE SEQUENCE
> postgres=# create subscription mysub_sub connection '....' publication
> mypub with (origin=none);
> NOTICE:  created replication slot "mysub_sub" on publisher
> CREATE SUBSCRIPTION
> postgres=# create publication mypub_sub for all sequences;
> CREATE PUBLICATION
> 
> Node-1:
> create subscription mysub_pub connection '...' publication mypub_sub
> with (origin=none);
> NOTICE:  created replication slot "mysub_pub" on publisher
> CREATE SUBSCRIPTION
> 
> SELECT nextval('s') FROM generate_series(1,100);
> 
> After that, you can check on the subscriber that sequences values are
> overridden with older values:
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>          67 |       0 | t
> (1 row)
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>         100 |       0 | t
> (1 row)
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>         133 |       0 | t
> (1 row)
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>          67 |       0 | t
> (1 row)
> 
> I haven't verified all the details but I think that is because we
> don't set XLOG_INCLUDE_ORIGIN while logging sequence values.
> 

Hmmm, yeah. I guess we'll need to set XLOG_INCLUDE_ORIGIN with
wal_level=logical.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Alvaro Herrera
Дата:
On 2023-Jul-20, Tomas Vondra wrote:

> From 809d60be7e636b8505027ad87bcb9fc65224c47b Mon Sep 17 00:00:00 2001
> From: Tomas Vondra <tomas.vondra@postgresql.org>
> Date: Wed, 5 Apr 2023 22:49:41 +0200
> Subject: [PATCH 1/6] Make test_decoding ddl.out shorter
> 
> Some of the test_decoding test output was extremely wide, because it
> deals with toasted values, and the aligned mode causes psql to produce
> 200kB of dashes. Turn that off temporarily using \pset to avoid it.

Do you mind if I get this one pushed later today?  Or feel free to push
it yourself, if you want.  It's an annoying patch to keep seeing posted
over and over, with no further value.  

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El que vive para el futuro es un iluso, y el que vive para el pasado,
un imbécil" (Luis Adler, "Los tripulantes de la noche")



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Thu, Jul 20, 2023 at 8:22 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

> >
> > PFA such edits in 0002 and 0006 patches. Let me know if those look
> > correct. I think we
> > need similar changes to the documentation and comments in other places.
> >
>
> OK, I merged the changes into the patches, with some minor changes to
> the wording etc.

Thanks.


>
> > In sequence_decode() we skip sequence changes when fast forwarding.
> > Given that smgr_decode() is only to supplement sequence_decode(), I
> > think it's correct to do the same in smgr_decode() as well. Simillarly
> > skipping when we don't have full snapshot.
> >
>
> I don't follow, smgr_decode already checks ctx->fast_forward.

In your earlier email you seemed to expressed some doubts about the
change skipping code in smgr_decode(). To that, I gave my own
perspective of why the change skipping code in smgr_decode() is
correct. I think smgr_decode is doing the right  thing, IMO. No change
required there.

>
> > Some minor comments on 0006 patch
> >
> > +    /* make sure the relfilenode creation is associated with the XID */
> > +    if (XLogLogicalInfoActive())
> > +        GetCurrentTransactionId();
> >
> > I think this change is correct and is inline with similar changes in 0002. But
> > I looked at other places from where DefineRelation() is called. For regular
> > tables it is called from ProcessUtilitySlow() which in turn does not call
> > GetCurrentTransactionId(). I am wondering whether we are just discovering a
> > class of bugs caused by not associating an xid with a newly created
> > relfilenode.
> >
>
> Not sure. Why would it be a bug?

This discussion is unrelated to sequence decoding but let me add it
here. If we don't know the transaction ID that created a relfilenode,
we wouldn't know whether to roll back that creation if the transaction
gets rolled back during recovery. But maybe that doesn't matter since
the relfilenode is not visible in any of the catalogs, so it just lies
there unused.


>
> > +void
> > +ReorderBufferAddRelFileLocator(ReorderBuffer *rb, TransactionId xid,
> > +                               RelFileLocator rlocator)
> > +{
> >     ... snip ...
> > +
> > +    /* sequence changes require a transaction */
> > +    if (xid == InvalidTransactionId)
> > +        return;
> >
> > IIUC, with your changes in DefineSequence() in this patch, this should not
> > happen. So this condition will never be true. But in case it happens, this code
> > will not add the relfilelocation to the hash table and we will deem the
> > sequence change as non-transactional. Isn't it better to just throw an error
> > and stop replication if that (ever) happens?
> >
>
> It can't happen for sequence, but it may happen when creating a
> non-sequence relfilenode. In a way, it's a way to skip (some)
> unnecessary relfilenodes.

Ah! The comment is correct but cryptic. I didn't read it to mean this.

> > +    /*
> > +     * To support logical decoding of sequences, we require the sequence
> > +     * callback. We decide it here, but only check it later in the wrappers.
> > +     *
> > +     * XXX Isn't it wrong to define only one of those callbacks? Say we
> > +     * only define the stream_sequence_cb() - that may get strange results
> > +     * depending on what gets streamed. Either none or both?
> > +     *
> > +     * XXX Shouldn't sequence be defined at slot creation time, similar
> > +     * to two_phase? Probably not.
> > +     */
> >
> > Do you intend to keep these XXX's as is? My previous comments on this comment
> > block are in [1].

This comment remains unanswered.

> >
> > In fact, given that whether or not sequences are replicated is decided by the
> > protocol version, do we really need LogicalDecodingContext::sequences? Drawing
> > parallel with WAL messages, I don't think it's needed.
> >
>
> Right. We do that for two_phase because you can override that when
> creating the subscription - sequences allowed that too initially, but
> then we ditched that. So I don't think we need this.

Then we should just remove that member and its references.

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 7/24/23 13:14, Alvaro Herrera wrote:
> On 2023-Jul-20, Tomas Vondra wrote:
> 
>> From 809d60be7e636b8505027ad87bcb9fc65224c47b Mon Sep 17 00:00:00 2001
>> From: Tomas Vondra <tomas.vondra@postgresql.org>
>> Date: Wed, 5 Apr 2023 22:49:41 +0200
>> Subject: [PATCH 1/6] Make test_decoding ddl.out shorter
>>
>> Some of the test_decoding test output was extremely wide, because it
>> deals with toasted values, and the aligned mode causes psql to produce
>> 200kB of dashes. Turn that off temporarily using \pset to avoid it.
> 
> Do you mind if I get this one pushed later today?  Or feel free to push
> it yourself, if you want.  It's an annoying patch to keep seeing posted
> over and over, with no further value.  
> 

Feel free to push. It's your patch, after all.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Thu, Jul 20, 2023 at 10:19 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> FWIW there's two questions related to the switch to XLOG_SMGR_CREATE.
>
> 1) Does smgr_decode() need to do the same block as sequence_decode()?
>
>     /* Skip the change if already processed (per the snapshot). */
>     if (transactional &&
>         !SnapBuildProcessChange(builder, xid, buf->origptr))
>         return;
>     else if (!transactional &&
>              (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
>               SnapBuildXactNeedsSkip(builder, buf->origptr)))
>         return;
>
> I don't think it does. Also, we don't have any transactional flag here.
> Or rather, everything is transactional ...

Right.

>
>
> 2) Currently, the sequences hash table is in reorderbuffer, i.e. global.
> I was thinking maybe we should have it in the transaction (because we
> need to do cleanup at the end). It seem a bit inconvenient, because then
> we'd need to either search htabs in all subxacts, or transfer the
> entries to the top-level xact (otoh, we already do that with snapshots),
> and cleanup on abort.
>
> What do you think?

Hash table per transaction seems saner design. Adding it to the top
level transaction should be fine. The entry will contain an XID
anyway. If we add it to every subtransaction we will need to search
hash table in each of the subtransactions when deciding whether a
sequence change is transactional or not. Top transaction is a
reasonable trade off.

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 7/24/23 08:31, Amit Kapila wrote:
> On Thu, Jul 20, 2023 at 8:22 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> OK, I merged the changes into the patches, with some minor changes to
>> the wording etc.
>>
> 
> I think we can do 0001-Make-test_decoding-ddl.out-shorter-20230720
> even without the rest of the patches. Isn't it a separate improvement?
> 
> I see that origin filtering (origin=none) doesn't work with this
> patch. You can see this by using the following statements:
> Node-1:
> postgres=# create sequence s;
> CREATE SEQUENCE
> postgres=# create publication mypub for all sequences;
> CREATE PUBLICATION
> 
> Node-2:
> postgres=# create sequence s;
> CREATE SEQUENCE
> postgres=# create subscription mysub_sub connection '....' publication
> mypub with (origin=none);
> NOTICE:  created replication slot "mysub_sub" on publisher
> CREATE SUBSCRIPTION
> postgres=# create publication mypub_sub for all sequences;
> CREATE PUBLICATION
> 
> Node-1:
> create subscription mysub_pub connection '...' publication mypub_sub
> with (origin=none);
> NOTICE:  created replication slot "mysub_pub" on publisher
> CREATE SUBSCRIPTION
> 
> SELECT nextval('s') FROM generate_series(1,100);
> 
> After that, you can check on the subscriber that sequences values are
> overridden with older values:
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>          67 |       0 | t
> (1 row)
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>         100 |       0 | t
> (1 row)
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>         133 |       0 | t
> (1 row)
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>          67 |       0 | t
> (1 row)
> 
> I haven't verified all the details but I think that is because we
> don't set XLOG_INCLUDE_ORIGIN while logging sequence values.
> 

Good point. Attached is a patch that adds XLOG_INCLUDE_ORIGIN to
sequence changes. I considered doing that only for wal_level=logical,
but we don't do that elsewhere. Also, I didn't do that for smgr_create,
because we don't actually replicate that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 7/24/23 14:53, Ashutosh Bapat wrote:
> On Thu, Jul 20, 2023 at 8:22 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> 
>>>
>>> PFA such edits in 0002 and 0006 patches. Let me know if those look
>>> correct. I think we
>>> need similar changes to the documentation and comments in other places.
>>>
>>
>> OK, I merged the changes into the patches, with some minor changes to
>> the wording etc.
> 
> Thanks.
> 
> 
>>
>>> In sequence_decode() we skip sequence changes when fast forwarding.
>>> Given that smgr_decode() is only to supplement sequence_decode(), I
>>> think it's correct to do the same in smgr_decode() as well. Simillarly
>>> skipping when we don't have full snapshot.
>>>
>>
>> I don't follow, smgr_decode already checks ctx->fast_forward.
> 
> In your earlier email you seemed to expressed some doubts about the
> change skipping code in smgr_decode(). To that, I gave my own
> perspective of why the change skipping code in smgr_decode() is
> correct. I think smgr_decode is doing the right  thing, IMO. No change
> required there.
> 

I think that was referring to the skipping we do for logical messages:

    if (message->transactional &&
        !SnapBuildProcessChange(builder, xid, buf->origptr))
        return;
    else if (!message->transactional &&
             (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
              SnapBuildXactNeedsSkip(builder, buf->origptr)))
        return;

I concluded we don't need to do that here.

>>
>>> Some minor comments on 0006 patch
>>>
>>> +    /* make sure the relfilenode creation is associated with the XID */
>>> +    if (XLogLogicalInfoActive())
>>> +        GetCurrentTransactionId();
>>>
>>> I think this change is correct and is inline with similar changes in 0002. But
>>> I looked at other places from where DefineRelation() is called. For regular
>>> tables it is called from ProcessUtilitySlow() which in turn does not call
>>> GetCurrentTransactionId(). I am wondering whether we are just discovering a
>>> class of bugs caused by not associating an xid with a newly created
>>> relfilenode.
>>>
>>
>> Not sure. Why would it be a bug?
> 
> This discussion is unrelated to sequence decoding but let me add it
> here. If we don't know the transaction ID that created a relfilenode,
> we wouldn't know whether to roll back that creation if the transaction
> gets rolled back during recovery. But maybe that doesn't matter since
> the relfilenode is not visible in any of the catalogs, so it just lies
> there unused.
> 

I think that's unrelated to this patch.

> 
>>
>>> +void
>>> +ReorderBufferAddRelFileLocator(ReorderBuffer *rb, TransactionId xid,
>>> +                               RelFileLocator rlocator)
>>> +{
>>>     ... snip ...
>>> +
>>> +    /* sequence changes require a transaction */
>>> +    if (xid == InvalidTransactionId)
>>> +        return;
>>>
>>> IIUC, with your changes in DefineSequence() in this patch, this should not
>>> happen. So this condition will never be true. But in case it happens, this code
>>> will not add the relfilelocation to the hash table and we will deem the
>>> sequence change as non-transactional. Isn't it better to just throw an error
>>> and stop replication if that (ever) happens?
>>>
>>
>> It can't happen for sequence, but it may happen when creating a
>> non-sequence relfilenode. In a way, it's a way to skip (some)
>> unnecessary relfilenodes.
> 
> Ah! The comment is correct but cryptic. I didn't read it to mean this.
> 

OK, I'll improve the comment.

>>> +    /*
>>> +     * To support logical decoding of sequences, we require the sequence
>>> +     * callback. We decide it here, but only check it later in the wrappers.
>>> +     *
>>> +     * XXX Isn't it wrong to define only one of those callbacks? Say we
>>> +     * only define the stream_sequence_cb() - that may get strange results
>>> +     * depending on what gets streamed. Either none or both?
>>> +     *
>>> +     * XXX Shouldn't sequence be defined at slot creation time, similar
>>> +     * to two_phase? Probably not.
>>> +     */
>>>
>>> Do you intend to keep these XXX's as is? My previous comments on this comment
>>> block are in [1].
> 
> This comment remains unanswered.
> 

I think the conclusion was we don't need to do that. I forgot to remove
the comment, though.

>>>
>>> In fact, given that whether or not sequences are replicated is decided by the
>>> protocol version, do we really need LogicalDecodingContext::sequences? Drawing
>>> parallel with WAL messages, I don't think it's needed.
>>>
>>
>> Right. We do that for two_phase because you can override that when
>> creating the subscription - sequences allowed that too initially, but
>> then we ditched that. So I don't think we need this.
> 
> Then we should just remove that member and its references.
> 

The member is still needed - it says whether the plugin has callbacks
for sequence decoding or not (just like we have a flag for streaming,
for example). I see the XXX comment in sequence_decode() is no longer
needed, we rely on protocol versioning.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Alvaro Herrera
Дата:
On 2023-Jul-24, Tomas Vondra wrote:

> On 7/24/23 13:14, Alvaro Herrera wrote:

> > Do you mind if I get this one pushed later today?  Or feel free to push
> > it yourself, if you want.  It's an annoying patch to keep seeing posted
> > over and over, with no further value.  
> 
> Feel free to push. It's your patch, after all.

Thanks, done.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Learn about compilers. Then everything looks like either a compiler or
a database, and now you have two problems but one of them is fun."
            https://twitter.com/thingskatedid/status/1456027786158776329



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 7/24/23 12:40, Amit Kapila wrote:
> On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
>>
>> 0005, 0006 and 0007 are all related to the initial sequence sync. [3]
>> resulted in 0007 and I think we need it. That leaves 0005 and 0006 to
>> be reviewed in this response.
>>
>> I followed the discussion starting [1] till [2]. The second one
>> mentions the interlock mechanism which has been implemented in 0005
>> and 0006. While I don't have an objection to allowing LOCKing a
>> sequence using the LOCK command, I am not sure whether it will
>> actually work or is even needed.
>>
>> The problem described in [1] seems to be the same as the problem
>> described in [2]. In both cases we see the sequence moving backwards
>> during CATCHUP. At the end of catchup the sequence is in the right
>> state in both the cases.
>>
> 
> I think we could see backward sequence value even after the catchup
> phase (after the sync worker is exited and or the state of rel is
> marked as 'ready' in pg_subscription_rel). The point is that there is
> no guarantee that we will process all the pending WAL before
> considering the sequence state is 'SYNCDONE' and or 'READY'. For
> example, after copy_sequence, I see values like:
> 
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>         165 |       0 | t
> (1 row)
> postgres=# select nextval('s');
>  nextval
> ---------
>      166
> (1 row)
> postgres=# select nextval('s');
>  nextval
> ---------
>      167
> (1 row)
> postgres=# select currval('s');
>  currval
> ---------
>      167
> (1 row)
> 
> Then during the catchup phase:
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>          33 |       0 | t
> (1 row)
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>          66 |       0 | t
> (1 row)
> 
> postgres=# select * from pg_subscription_rel;
>  srsubid | srrelid | srsubstate | srsublsn
> ---------+---------+------------+-----------
>    16394 |   16390 | r          | 0/16374E8
>    16394 |   16393 | s          | 0/1637700
> (2 rows)
> 
> postgres=# select * from pg_subscription_rel;
>  srsubid | srrelid | srsubstate | srsublsn
> ---------+---------+------------+-----------
>    16394 |   16390 | r          | 0/16374E8
>    16394 |   16393 | r          | 0/1637700
> (2 rows)
> 
> Here Sequence relid id 16393. You can see sequence state is marked as ready.
> 

Right, but "READY" just means the apply caught up if the LSN where the
sync finished ...

> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>          66 |       0 | t
> (1 row)
> 
> Even after that, see below the value of the sequence is still not
> caught up. Later, when the apply worker processes all the WAL, the
> sequence state will be caught up.
> 

And how is this different from what tablesync does for tables? For that
'r' also does not mean it's fully caught up, IIRC. What matters is
whether the sequence since this moment can go back. And I don't think it
can, because that would require replaying changes from before we did
copy_sequence ...

> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>         165 |       0 | t
> (1 row)
> 
> So, there will be a window where the sequence won't be caught up for a
> certain period of time and any usage of it (even after the sync is
> finished) during that time could result in inconsistent behaviour.
> 
> The other question is whether it is okay to allow the sequence to go
> backwards even during the initial sync phase? The reason I am asking
> this question is that for the time sequence value moves backwards, one
> is allowed to use it on the subscriber which will result in using
> out-of-sequence values. For example, immediately, after copy_sequence
> the values look like this:
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>         133 |      32 | t
> (1 row)
> postgres=# select nextval('s');
>  nextval
> ---------
>      134
> (1 row)
> postgres=# select currval('s');
>  currval
> ---------
>      134
> (1 row)
> 
> But then during the sync phase, it can go backwards and one is allowed
> to use it on the subscriber:
> postgres=# select * from s;
>  last_value | log_cnt | is_called
> ------------+---------+-----------
>          66 |       0 | t
> (1 row)
> postgres=# select nextval('s');
>  nextval
> ---------
>       67
> (1 row)
> 

As I wrote earlier, I think the agreement was we make no guarantees
about what happens during the sync.

Also, not sure what you mean by "no one is allowed to use it on
subscriber" - that is only allowed after a failover/switchover, after
sequence sync completes.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Mon, Jul 24, 2023 at 9:32 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 7/24/23 12:40, Amit Kapila wrote:
> > On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > Even after that, see below the value of the sequence is still not
> > caught up. Later, when the apply worker processes all the WAL, the
> > sequence state will be caught up.
> >
>
> And how is this different from what tablesync does for tables? For that
> 'r' also does not mean it's fully caught up, IIRC. What matters is
> whether the sequence since this moment can go back. And I don't think it
> can, because that would require replaying changes from before we did
> copy_sequence ...
>

For sequences, it is quite possible that we replay WAL from before the
copy_sequence whereas the same is not true for tables (w.r.t
copy_table()). This is because for tables we have a kind of interlock
w.r.t LSN returned via create_slot (say this value of LSN is LSN1),
basically, the walsender corresponding to tablesync worker in
publisher won't send any WAL before that LSN whereas the same is not
true for sequences. Also, even if apply worker can receive WAL before
copy_table, it won't apply that as that would be behind the LSN1 and
the same is not true for sequences. So, for tables, we will never go
back to a state before the copy_table() but for sequences, we can go
back to a state before copy_sequence().

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Mon, Jul 24, 2023 at 4:22 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 7/24/23 12:40, Amit Kapila wrote:
> > On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> >>
> >> 0005, 0006 and 0007 are all related to the initial sequence sync. [3]
> >> resulted in 0007 and I think we need it. That leaves 0005 and 0006 to
> >> be reviewed in this response.
> >>
> >> I followed the discussion starting [1] till [2]. The second one
> >> mentions the interlock mechanism which has been implemented in 0005
> >> and 0006. While I don't have an objection to allowing LOCKing a
> >> sequence using the LOCK command, I am not sure whether it will
> >> actually work or is even needed.
> >>
> >> The problem described in [1] seems to be the same as the problem
> >> described in [2]. In both cases we see the sequence moving backwards
> >> during CATCHUP. At the end of catchup the sequence is in the right
> >> state in both the cases.
> >>
> >
> > I think we could see backward sequence value even after the catchup
> > phase (after the sync worker is exited and or the state of rel is
> > marked as 'ready' in pg_subscription_rel). The point is that there is
> > no guarantee that we will process all the pending WAL before
> > considering the sequence state is 'SYNCDONE' and or 'READY'. For
> > example, after copy_sequence, I see values like:
> >
> > postgres=# select * from s;
> >  last_value | log_cnt | is_called
> > ------------+---------+-----------
> >         165 |       0 | t
> > (1 row)
> > postgres=# select nextval('s');
> >  nextval
> > ---------
> >      166
> > (1 row)
> > postgres=# select nextval('s');
> >  nextval
> > ---------
> >      167
> > (1 row)
> > postgres=# select currval('s');
> >  currval
> > ---------
> >      167
> > (1 row)
> >
> > Then during the catchup phase:
> > postgres=# select * from s;
> >  last_value | log_cnt | is_called
> > ------------+---------+-----------
> >          33 |       0 | t
> > (1 row)
> > postgres=# select * from s;
> >  last_value | log_cnt | is_called
> > ------------+---------+-----------
> >          66 |       0 | t
> > (1 row)
> >
> > postgres=# select * from pg_subscription_rel;
> >  srsubid | srrelid | srsubstate | srsublsn
> > ---------+---------+------------+-----------
> >    16394 |   16390 | r          | 0/16374E8
> >    16394 |   16393 | s          | 0/1637700
> > (2 rows)
> >
> > postgres=# select * from pg_subscription_rel;
> >  srsubid | srrelid | srsubstate | srsublsn
> > ---------+---------+------------+-----------
> >    16394 |   16390 | r          | 0/16374E8
> >    16394 |   16393 | r          | 0/1637700
> > (2 rows)
> >
> > Here Sequence relid id 16393. You can see sequence state is marked as ready.
> >
> > postgres=# select * from s;
> >  last_value | log_cnt | is_called
> > ------------+---------+-----------
> >          66 |       0 | t
> > (1 row)
> >
> > Even after that, see below the value of the sequence is still not
> > caught up. Later, when the apply worker processes all the WAL, the
> > sequence state will be caught up.
> >
> > postgres=# select * from s;
> >  last_value | log_cnt | is_called
> > ------------+---------+-----------
> >         165 |       0 | t
> > (1 row)
> >
> > So, there will be a window where the sequence won't be caught up for a
> > certain period of time and any usage of it (even after the sync is
> > finished) during that time could result in inconsistent behaviour.
> >
>
> I'm rather confused about which node these queries are executed on.
> Presumably some of it is on publisher, some on subscriber?
>

These are all on the subscriber.

> Can you create a reproducer (TAP test demonstrating this?) I guess it
> might require adding some sleeps to hit the right timing ...
>

I have used the debugger to reproduce this as it needs quite some
coordination. I just wanted to see if the sequence can go backward and
didn't catch up completely before the sequence state is marked
'ready'. On the publisher side, I created a publication with a table
and a sequence. Then did the following steps:
SELECT nextval('s') FROM generate_series(1,50);
insert into t1 values(1);
SELECT nextval('s') FROM generate_series(51,150);

Then on the subscriber side with some debugging aid, I could find the
values in the sequence shown in the previous email. Sorry, I haven't
recorded each and every step but, if you think it helps, I can again
try to reproduce it and share the steps.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 7/25/23 08:28, Amit Kapila wrote:
> On Mon, Jul 24, 2023 at 9:32 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 7/24/23 12:40, Amit Kapila wrote:
>>> On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat
>>> <ashutosh.bapat.oss@gmail.com> wrote:
>>>
>>> Even after that, see below the value of the sequence is still not
>>> caught up. Later, when the apply worker processes all the WAL, the
>>> sequence state will be caught up.
>>>
>>
>> And how is this different from what tablesync does for tables? For that
>> 'r' also does not mean it's fully caught up, IIRC. What matters is
>> whether the sequence since this moment can go back. And I don't think it
>> can, because that would require replaying changes from before we did
>> copy_sequence ...
>>
> 
> For sequences, it is quite possible that we replay WAL from before the
> copy_sequence whereas the same is not true for tables (w.r.t
> copy_table()). This is because for tables we have a kind of interlock
> w.r.t LSN returned via create_slot (say this value of LSN is LSN1),
> basically, the walsender corresponding to tablesync worker in
> publisher won't send any WAL before that LSN whereas the same is not
> true for sequences. Also, even if apply worker can receive WAL before
> copy_table, it won't apply that as that would be behind the LSN1 and
> the same is not true for sequences. So, for tables, we will never go
> back to a state before the copy_table() but for sequences, we can go
> back to a state before copy_sequence().
> 

Right. I think the important detail is that during sync we have three
important LSNs

- LSN1 where the slot is created
- LSN2 where the copy happens
- LSN3 where we consider the sync completed

For tables, LSN1 == LSN2, because the data is completed using the
snapshot from the temporary slot. And (LSN1 <= LSN3).

But for sequences, the copy happens after the slot creation, possibly
with (LSN1 < LSN2). And because LSN3 comes from the main subscription
(which may be a bit behind, for whatever reason), it may happen that

   (LSN1 < LSN3 < LSN2)

The the sync ends at LSN3, but that means all sequence changes between
LSN3 and LSN2 will be applied "again" making the sequence go away.

IMHO the right fix is to make sure LSN3 >= LSN2 (for sequences).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Tue, Jul 25, 2023 at 5:29 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Right. I think the important detail is that during sync we have three
> important LSNs
>
> - LSN1 where the slot is created
> - LSN2 where the copy happens
> - LSN3 where we consider the sync completed
>
> For tables, LSN1 == LSN2, because the data is completed using the
> snapshot from the temporary slot. And (LSN1 <= LSN3).
>
> But for sequences, the copy happens after the slot creation, possibly
> with (LSN1 < LSN2). And because LSN3 comes from the main subscription
> (which may be a bit behind, for whatever reason), it may happen that
>
>    (LSN1 < LSN3 < LSN2)
>
> The the sync ends at LSN3, but that means all sequence changes between
> LSN3 and LSN2 will be applied "again" making the sequence go away.
>
> IMHO the right fix is to make sure LSN3 >= LSN2 (for sequences).

Back in this thread, an approach to use page LSN (LSN2 above) to make
sure that no change before LSN2 is applied on subscriber. The approach
was discussed in emails around [1] and discarded later for no reason.
I think that approach has some merit.

[1]
https://www.postgresql.org/message-id/flat/21c87ea8-86c9-80d6-bc78-9b95033ca00b%40enterprisedb.com#36bb9c7968b7af577dc080950761290d

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 7/25/23 15:18, Ashutosh Bapat wrote:
>
> ...
>
>> But for sequences, the copy happens after the slot creation, possibly
>> with (LSN1 < LSN2). And because LSN3 comes from the main subscription
>> (which may be a bit behind, for whatever reason), it may happen that
>>
>>    (LSN1 < LSN3 < LSN2)
>>
>> The the sync ends at LSN3, but that means all sequence changes between
>> LSN3 and LSN2 will be applied "again" making the sequence go away.
>>
>> IMHO the right fix is to make sure LSN3 >= LSN2 (for sequences).
> 

Do you agree this scheme would be correct?

> Back in this thread, an approach to use page LSN (LSN2 above) to make
> sure that no change before LSN2 is applied on subscriber. The approach
> was discussed in emails around [1] and discarded later for no reason.
> I think that approach has some merit.
> 
> [1]
https://www.postgresql.org/message-id/flat/21c87ea8-86c9-80d6-bc78-9b95033ca00b%40enterprisedb.com#36bb9c7968b7af577dc080950761290d
> 

That doesn't seem to be the correct link ... IIRC the page LSN was
discussed as a way to skip changes up to the point when the COPY was
done. I believe it might work with the scheme I described above too.

The trouble is we don't have an interface to select both the sequence
state and the page LSN. It's probably not hard to add (extend the
read_seq_tuple() to also return the LSN, and adding a SQL function), but
I don't think it'd add much value, compared to just getting the current
insert LSN after the COPY.

Yes, the current LSN may be a bit higher, so we may need to apply a
couple changes to get into "ready" state. But we read it right after
copy_sequence() so how much can happen in between?

Also, we can get into similar state anyway - the main subscription can
get ahead, at which point the sync has to catchup to it.

The attached patch (part 0007) does it this way. Can you try if you can
still reproduce the "backwards" movement with this version?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 7/24/23 14:57, Ashutosh Bapat wrote:
> ...
> 
>>
>>
>> 2) Currently, the sequences hash table is in reorderbuffer, i.e. global.
>> I was thinking maybe we should have it in the transaction (because we
>> need to do cleanup at the end). It seem a bit inconvenient, because then
>> we'd need to either search htabs in all subxacts, or transfer the
>> entries to the top-level xact (otoh, we already do that with snapshots),
>> and cleanup on abort.
>>
>> What do you think?
> 
> Hash table per transaction seems saner design. Adding it to the top
> level transaction should be fine. The entry will contain an XID
> anyway. If we add it to every subtransaction we will need to search
> hash table in each of the subtransactions when deciding whether a
> sequence change is transactional or not. Top transaction is a
> reasonable trade off.
> 

It's not clear to me what design you're proposing, exactly.

If we track it in top-level transactions, then we'd need copy the data
whenever a transaction is assigned as a child, and perhaps also remove
it when there's a subxact abort.

And we'd need to still search the hashes in all toplevel transactions on
every sequence increment - in principle we can't have increment for a
sequence created in another in-progress transaction, but maybe it's just
not assigned yet.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Here's a somewhat cleaned up version of the patch series, with some of
the smaller "rework" patches (protocol versioning, origins, smgr_create,
...) merged into the appropriate part. I've kept the bit adding separate
tablesync LSN.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Tue, Jul 25, 2023 at 5:29 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 7/25/23 08:28, Amit Kapila wrote:
> > On Mon, Jul 24, 2023 at 9:32 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> On 7/24/23 12:40, Amit Kapila wrote:
> >>> On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat
> >>> <ashutosh.bapat.oss@gmail.com> wrote:
> >>>
> >>> Even after that, see below the value of the sequence is still not
> >>> caught up. Later, when the apply worker processes all the WAL, the
> >>> sequence state will be caught up.
> >>>
> >>
> >> And how is this different from what tablesync does for tables? For that
> >> 'r' also does not mean it's fully caught up, IIRC. What matters is
> >> whether the sequence since this moment can go back. And I don't think it
> >> can, because that would require replaying changes from before we did
> >> copy_sequence ...
> >>
> >
> > For sequences, it is quite possible that we replay WAL from before the
> > copy_sequence whereas the same is not true for tables (w.r.t
> > copy_table()). This is because for tables we have a kind of interlock
> > w.r.t LSN returned via create_slot (say this value of LSN is LSN1),
> > basically, the walsender corresponding to tablesync worker in
> > publisher won't send any WAL before that LSN whereas the same is not
> > true for sequences. Also, even if apply worker can receive WAL before
> > copy_table, it won't apply that as that would be behind the LSN1 and
> > the same is not true for sequences. So, for tables, we will never go
> > back to a state before the copy_table() but for sequences, we can go
> > back to a state before copy_sequence().
> >
>
> Right. I think the important detail is that during sync we have three
> important LSNs
>
> - LSN1 where the slot is created
> - LSN2 where the copy happens
> - LSN3 where we consider the sync completed
>
> For tables, LSN1 == LSN2, because the data is completed using the
> snapshot from the temporary slot. And (LSN1 <= LSN3).
>
> But for sequences, the copy happens after the slot creation, possibly
> with (LSN1 < LSN2). And because LSN3 comes from the main subscription
> (which may be a bit behind, for whatever reason), it may happen that
>
>    (LSN1 < LSN3 < LSN2)
>
> The the sync ends at LSN3, but that means all sequence changes between
> LSN3 and LSN2 will be applied "again" making the sequence go away.
>

Yeah, the problem is something as you explained but an additional
minor point is that for sequences we also do end up applying the WAL
between LSN1 and LSN3 which makes it go backwards. The ideal way is
that sequences on subscribers never go backward in a way that is
visible to users. I will share my thoughts after studying your
proposal in a later email.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jul 25, 2023 at 5:29 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > On 7/25/23 08:28, Amit Kapila wrote:
> > > On Mon, Jul 24, 2023 at 9:32 PM Tomas Vondra
> > > <tomas.vondra@enterprisedb.com> wrote:
> > >>
> > >> On 7/24/23 12:40, Amit Kapila wrote:
> > >>> On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat
> > >>> <ashutosh.bapat.oss@gmail.com> wrote:
> > >>>
> > >>> Even after that, see below the value of the sequence is still not
> > >>> caught up. Later, when the apply worker processes all the WAL, the
> > >>> sequence state will be caught up.
> > >>>
> > >>
> > >> And how is this different from what tablesync does for tables? For that
> > >> 'r' also does not mean it's fully caught up, IIRC. What matters is
> > >> whether the sequence since this moment can go back. And I don't think it
> > >> can, because that would require replaying changes from before we did
> > >> copy_sequence ...
> > >>
> > >
> > > For sequences, it is quite possible that we replay WAL from before the
> > > copy_sequence whereas the same is not true for tables (w.r.t
> > > copy_table()). This is because for tables we have a kind of interlock
> > > w.r.t LSN returned via create_slot (say this value of LSN is LSN1),
> > > basically, the walsender corresponding to tablesync worker in
> > > publisher won't send any WAL before that LSN whereas the same is not
> > > true for sequences. Also, even if apply worker can receive WAL before
> > > copy_table, it won't apply that as that would be behind the LSN1 and
> > > the same is not true for sequences. So, for tables, we will never go
> > > back to a state before the copy_table() but for sequences, we can go
> > > back to a state before copy_sequence().
> > >
> >
> > Right. I think the important detail is that during sync we have three
> > important LSNs
> >
> > - LSN1 where the slot is created
> > - LSN2 where the copy happens
> > - LSN3 where we consider the sync completed
> >
> > For tables, LSN1 == LSN2, because the data is completed using the
> > snapshot from the temporary slot. And (LSN1 <= LSN3).
> >
> > But for sequences, the copy happens after the slot creation, possibly
> > with (LSN1 < LSN2). And because LSN3 comes from the main subscription
> > (which may be a bit behind, for whatever reason), it may happen that
> >
> >    (LSN1 < LSN3 < LSN2)
> >
> > The the sync ends at LSN3, but that means all sequence changes between
> > LSN3 and LSN2 will be applied "again" making the sequence go away.
> >
>
> Yeah, the problem is something as you explained but an additional
> minor point is that for sequences we also do end up applying the WAL
> between LSN1 and LSN3 which makes it go backwards.
>

I was reading this email thread and found the email by Andres [1]
which seems to me to say the same thing: "I assume that part of the
initial sync would have to be a new sequence synchronization step that
reads all the sequence states on the publisher and ensures that the
subscriber sequences are at the same point. There's a bit of
trickiness there, but it seems entirely doable. The logical
replication replay support for sequences will have to be a bit careful
about not decreasing the subscriber's sequence values - the standby
initially will be ahead of the
increments we'll see in the WAL.". Now, IIUC this means that even
before the sequence is marked as SYNCDONE, it shouldn't go backward.

[1]: "https://www.postgresql.org/message-id/20221117024357.ljjme6v75mny2j6u%40awork3.anarazel.de

With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 7/26/23 09:27, Amit Kapila wrote:
> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> ...
>>
> 
> I was reading this email thread and found the email by Andres [1]
> which seems to me to say the same thing: "I assume that part of the
> initial sync would have to be a new sequence synchronization step that
> reads all the sequence states on the publisher and ensures that the
> subscriber sequences are at the same point. There's a bit of
> trickiness there, but it seems entirely doable. The logical
> replication replay support for sequences will have to be a bit careful
> about not decreasing the subscriber's sequence values - the standby
> initially will be ahead of the
> increments we'll see in the WAL.". Now, IIUC this means that even
> before the sequence is marked as SYNCDONE, it shouldn't go backward.
> 

Well, I could argue that's more an opinion, and I'm not sure it really
contradicts the idea that the sequence should not go backwards only
after the sync completes.

Anyway, I was thinking about this a bit more, and it seems it's not as
difficult to use the page LSN to ensure sequences don't go backwards.
The 0005 change does that, by:

1) adding pg_sequence_state, that returns both the sequence state and
   the page LSN

2) copy_sequence returns the page LSN

3) tablesync then sets this LSN as origin_startpos (which for tables is
   just the LSN of the replication slot)

AFAICS this makes it work - we start decoding at the page LSN, so that
we  skip the increments that could lead to the sequence going backwards.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 7/26/23 09:27, Amit Kapila wrote:
> > On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Anyway, I was thinking about this a bit more, and it seems it's not as
> difficult to use the page LSN to ensure sequences don't go backwards.
>

While studying the changes for this proposal and related areas, I have
a few comments:
1. I think you need to advance the origin if it is changed due to
copy_sequence(), otherwise, if the sync worker restarts after
SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN
value.

2. Between the time of SYNCDONE and READY state, the patch can skip
applying non-transactional sequence changes even if it should apply
it. The reason is that during that state change
should_apply_changes_for_rel() decides whether to apply change based
on the value of remote_final_lsn which won't be set for
non-transactional change. I think we need to send the start LSN of a
non-transactional record and then use that as remote_final_lsn for
such a change.

3. For non-transactional sequence change apply, we don't set
replorigin_session_origin_lsn/replorigin_session_origin_timestamp as
we are doing in apply_handle_commit_internal() before calling
CommitTransactionCommand(). So, that can lead to the origin moving
backwards after restart which will lead to requesting and applying the
same changes again and for that period of time sequence can go
backwards. This needs some more thought as to what is the correct
behaviour/solution for this.

4. BTW, while checking this behaviour, I noticed that the initial sync
worker for sequence mentions the table in the LOG message: "LOG:
logical replication table synchronization worker for subscription
"mysub", table "s" has finished". Won't it be better here to refer to
it as a sequence?

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Tue, Jul 25, 2023 at 10:02 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 7/24/23 14:57, Ashutosh Bapat wrote:
> > ...
> >
> >>
> >>
> >> 2) Currently, the sequences hash table is in reorderbuffer, i.e. global.
> >> I was thinking maybe we should have it in the transaction (because we
> >> need to do cleanup at the end). It seem a bit inconvenient, because then
> >> we'd need to either search htabs in all subxacts, or transfer the
> >> entries to the top-level xact (otoh, we already do that with snapshots),
> >> and cleanup on abort.
> >>
> >> What do you think?
> >
> > Hash table per transaction seems saner design. Adding it to the top
> > level transaction should be fine. The entry will contain an XID
> > anyway. If we add it to every subtransaction we will need to search
> > hash table in each of the subtransactions when deciding whether a
> > sequence change is transactional or not. Top transaction is a
> > reasonable trade off.
> >
>
> It's not clear to me what design you're proposing, exactly.
>
> If we track it in top-level transactions, then we'd need copy the data
> whenever a transaction is assigned as a child, and perhaps also remove
> it when there's a subxact abort.

I thought, esp. with your changes to assign xid, we will always know
the top level transaction when a sequence is assigned a relfilenode.
So the refilenodes will always get added to the correct hash directly.
I didn't imagine a case where we will need to copy the hash table from
sub-transaction to top transaction. If that's true, yes it's
inconvenient.

As to the abort, don't we already remove entries on subtxn abort?
Having per transaction hash table doesn't seem to change anything
much.

>
> And we'd need to still search the hashes in all toplevel transactions on
> every sequence increment - in principle we can't have increment for a
> sequence created in another in-progress transaction, but maybe it's just
> not assigned yet.

We hold a strong lock on sequence when changing its relfilenode. The
sequence whose relfilenode is being changed can not be accessed by any
concurrent transaction. So I am not able to understand what you are
trying to say.

I think per (top level) transaction hash table is cleaner design. It
puts the hash table where it should be. But if that makes code
difficult, current design works too.

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 7/28/23 11:42, Amit Kapila wrote:
> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 7/26/23 09:27, Amit Kapila wrote:
>>> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Anyway, I was thinking about this a bit more, and it seems it's not as
>> difficult to use the page LSN to ensure sequences don't go backwards.
>>
> 
> While studying the changes for this proposal and related areas, I have
> a few comments:
> 1. I think you need to advance the origin if it is changed due to
> copy_sequence(), otherwise, if the sync worker restarts after
> SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN
> value.
> 

True, we want to restart at the new origin_startpos.

> 2. Between the time of SYNCDONE and READY state, the patch can skip
> applying non-transactional sequence changes even if it should apply
> it. The reason is that during that state change
> should_apply_changes_for_rel() decides whether to apply change based
> on the value of remote_final_lsn which won't be set for
> non-transactional change. I think we need to send the start LSN of a
> non-transactional record and then use that as remote_final_lsn for
> such a change.

Good catch. remote_final_lsn is set in apply_handle_begin, but that
won't happen for sequences. We're already sending the LSN, but
logicalrep_read_sequence ignores it - it should be enough to add it to
LogicalRepSequence and then set it in apply_handle_sequence().

> 
> 3. For non-transactional sequence change apply, we don't set
> replorigin_session_origin_lsn/replorigin_session_origin_timestamp as
> we are doing in apply_handle_commit_internal() before calling
> CommitTransactionCommand(). So, that can lead to the origin moving
> backwards after restart which will lead to requesting and applying the
> same changes again and for that period of time sequence can go
> backwards. This needs some more thought as to what is the correct
> behaviour/solution for this.
> 

I think saying "origin moves backwards" is a bit misleading. AFAICS the
origin position is not actually moving backwards, it's more that we
don't (and can't) move it forwards for each non-transactional change. So
yeah, we may re-apply those, and IMHO that's expected - the sequence is
allowed to be "ahead" on the subscriber.

I don't see a way to improve this, except maybe having a separate LSN
for non-transactional changes (for each origin).

> 4. BTW, while checking this behaviour, I noticed that the initial sync
> worker for sequence mentions the table in the LOG message: "LOG:
> logical replication table synchronization worker for subscription
> "mysub", table "s" has finished". Won't it be better here to refer to
> it as a sequence?
> 

Thanks, I'll fix that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Anyway, I was thinking about this a bit more, and it seems it's not as
> difficult to use the page LSN to ensure sequences don't go backwards.
> The 0005 change does that, by:
>
> 1) adding pg_sequence_state, that returns both the sequence state and
>    the page LSN
>
> 2) copy_sequence returns the page LSN
>
> 3) tablesync then sets this LSN as origin_startpos (which for tables is
>    just the LSN of the replication slot)
>
> AFAICS this makes it work - we start decoding at the page LSN, so that
> we  skip the increments that could lead to the sequence going backwards.
>

I like this design very much. It makes things simpler than complex.
Thanks for doing this.

I am wondering whether we could reuse pg_sequence_last_value() instead
of adding a new function. But the name of the function doesn't leave
much space for expanding its functionality. So we are good with a new
one. Probably some code deduplication.

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 7/28/23 14:35, Ashutosh Bapat wrote:
> On Tue, Jul 25, 2023 at 10:02 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 7/24/23 14:57, Ashutosh Bapat wrote:
>>> ...
>>>
>>>>
>>>>
>>>> 2) Currently, the sequences hash table is in reorderbuffer, i.e. global.
>>>> I was thinking maybe we should have it in the transaction (because we
>>>> need to do cleanup at the end). It seem a bit inconvenient, because then
>>>> we'd need to either search htabs in all subxacts, or transfer the
>>>> entries to the top-level xact (otoh, we already do that with snapshots),
>>>> and cleanup on abort.
>>>>
>>>> What do you think?
>>>
>>> Hash table per transaction seems saner design. Adding it to the top
>>> level transaction should be fine. The entry will contain an XID
>>> anyway. If we add it to every subtransaction we will need to search
>>> hash table in each of the subtransactions when deciding whether a
>>> sequence change is transactional or not. Top transaction is a
>>> reasonable trade off.
>>>
>>
>> It's not clear to me what design you're proposing, exactly.
>>
>> If we track it in top-level transactions, then we'd need copy the data
>> whenever a transaction is assigned as a child, and perhaps also remove
>> it when there's a subxact abort.
> 
> I thought, esp. with your changes to assign xid, we will always know
> the top level transaction when a sequence is assigned a relfilenode.
> So the refilenodes will always get added to the correct hash directly.
> I didn't imagine a case where we will need to copy the hash table from
> sub-transaction to top transaction. If that's true, yes it's
> inconvenient.
> 

Well, it's a matter of efficiency.

To check if a sequence change is transactional, we need to check if it's
for a relfilenode created in the current transaction (it can't be for
relfilenode created in a concurrent top-level transaction, due to MVCC).

If you don't copy the entries into the top-level xact, you have to walk
all subxacts and search all of those, for each sequence change. And
there may be quite a few of both subxacts and sequence changes ...

I wonder if we need to search the other top-level xacts, but we probably
need to do that. Because it might be a subxact without an assignment, or
something like that.

> As to the abort, don't we already remove entries on subtxn abort?
> Having per transaction hash table doesn't seem to change anything
> much.
> 

What entries are we removing? My point is that if we copy the entries to
the top-level xact, we probably need to remove them on abort. Or we
could leave them in the top-level xact hash.

>>
>> And we'd need to still search the hashes in all toplevel transactions on
>> every sequence increment - in principle we can't have increment for a
>> sequence created in another in-progress transaction, but maybe it's just
>> not assigned yet.
> 
> We hold a strong lock on sequence when changing its relfilenode. The
> sequence whose relfilenode is being changed can not be accessed by any
> concurrent transaction. So I am not able to understand what you are
> trying to say.
> 

How do you know the subxact has already been recognized as such? It may
be treated as top-level transaction for a while, until the assignment.

> I think per (top level) transaction hash table is cleaner design. It
> puts the hash table where it should be. But if that makes code
> difficult, current design works too.
> 


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Fri, Jul 28, 2023 at 6:12 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 7/28/23 11:42, Amit Kapila wrote:
> > On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> On 7/26/23 09:27, Amit Kapila wrote:
> >>> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> Anyway, I was thinking about this a bit more, and it seems it's not as
> >> difficult to use the page LSN to ensure sequences don't go backwards.
> >>
> >
> > While studying the changes for this proposal and related areas, I have
> > a few comments:
> > 1. I think you need to advance the origin if it is changed due to
> > copy_sequence(), otherwise, if the sync worker restarts after
> > SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN
> > value.
> >
>
> True, we want to restart at the new origin_startpos.
>
> > 2. Between the time of SYNCDONE and READY state, the patch can skip
> > applying non-transactional sequence changes even if it should apply
> > it. The reason is that during that state change
> > should_apply_changes_for_rel() decides whether to apply change based
> > on the value of remote_final_lsn which won't be set for
> > non-transactional change. I think we need to send the start LSN of a
> > non-transactional record and then use that as remote_final_lsn for
> > such a change.
>
> Good catch. remote_final_lsn is set in apply_handle_begin, but that
> won't happen for sequences. We're already sending the LSN, but
> logicalrep_read_sequence ignores it - it should be enough to add it to
> LogicalRepSequence and then set it in apply_handle_sequence().
>

As per my understanding, the LSN sent is EndRecPtr of record which is
the beginning of the next record (means current_record_end + 1). For
comparing the current record, we use the start_position of the record
as we do when we use the remote_final_lsn via apply_handle_begin().

> >
> > 3. For non-transactional sequence change apply, we don't set
> > replorigin_session_origin_lsn/replorigin_session_origin_timestamp as
> > we are doing in apply_handle_commit_internal() before calling
> > CommitTransactionCommand(). So, that can lead to the origin moving
> > backwards after restart which will lead to requesting and applying the
> > same changes again and for that period of time sequence can go
> > backwards. This needs some more thought as to what is the correct
> > behaviour/solution for this.
> >
>
> I think saying "origin moves backwards" is a bit misleading. AFAICS the
> origin position is not actually moving backwards, it's more that we
> don't (and can't) move it forwards for each non-transactional change. So
> yeah, we may re-apply those, and IMHO that's expected - the sequence is
> allowed to be "ahead" on the subscriber.
>

But, if this happens then for a period of time the sequence will go
backwards relative to what one would have observed before restart.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 7/28/23 14:44, Ashutosh Bapat wrote:
> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Anyway, I was thinking about this a bit more, and it seems it's not as
>> difficult to use the page LSN to ensure sequences don't go backwards.
>> The 0005 change does that, by:
>>
>> 1) adding pg_sequence_state, that returns both the sequence state and
>>    the page LSN
>>
>> 2) copy_sequence returns the page LSN
>>
>> 3) tablesync then sets this LSN as origin_startpos (which for tables is
>>    just the LSN of the replication slot)
>>
>> AFAICS this makes it work - we start decoding at the page LSN, so that
>> we  skip the increments that could lead to the sequence going backwards.
>>
> 
> I like this design very much. It makes things simpler than complex.
> Thanks for doing this.
> 

I agree it seems simpler. It'd be good to try testing / reviewing it a
bit more, so that it doesn't misbehave in some way.

> I am wondering whether we could reuse pg_sequence_last_value() instead
> of adding a new function. But the name of the function doesn't leave
> much space for expanding its functionality. So we are good with a new
> one. Probably some code deduplication.
> 

I don't think we should do that, the pg_sequence_last_value() function
is meant to do something different. I don't think it'd be any simpler to
also make it do what pg_sequence_state() does would make it any simpler.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 7/29/23 06:54, Amit Kapila wrote:
> On Fri, Jul 28, 2023 at 6:12 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 7/28/23 11:42, Amit Kapila wrote:
>>> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> On 7/26/23 09:27, Amit Kapila wrote:
>>>>> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>
>>>> Anyway, I was thinking about this a bit more, and it seems it's not as
>>>> difficult to use the page LSN to ensure sequences don't go backwards.
>>>>
>>>
>>> While studying the changes for this proposal and related areas, I have
>>> a few comments:
>>> 1. I think you need to advance the origin if it is changed due to
>>> copy_sequence(), otherwise, if the sync worker restarts after
>>> SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN
>>> value.
>>>
>>
>> True, we want to restart at the new origin_startpos.
>>
>>> 2. Between the time of SYNCDONE and READY state, the patch can skip
>>> applying non-transactional sequence changes even if it should apply
>>> it. The reason is that during that state change
>>> should_apply_changes_for_rel() decides whether to apply change based
>>> on the value of remote_final_lsn which won't be set for
>>> non-transactional change. I think we need to send the start LSN of a
>>> non-transactional record and then use that as remote_final_lsn for
>>> such a change.
>>
>> Good catch. remote_final_lsn is set in apply_handle_begin, but that
>> won't happen for sequences. We're already sending the LSN, but
>> logicalrep_read_sequence ignores it - it should be enough to add it to
>> LogicalRepSequence and then set it in apply_handle_sequence().
>>
> 
> As per my understanding, the LSN sent is EndRecPtr of record which is
> the beginning of the next record (means current_record_end + 1). For
> comparing the current record, we use the start_position of the record
> as we do when we use the remote_final_lsn via apply_handle_begin().
> 
>>>
>>> 3. For non-transactional sequence change apply, we don't set
>>> replorigin_session_origin_lsn/replorigin_session_origin_timestamp as
>>> we are doing in apply_handle_commit_internal() before calling
>>> CommitTransactionCommand(). So, that can lead to the origin moving
>>> backwards after restart which will lead to requesting and applying the
>>> same changes again and for that period of time sequence can go
>>> backwards. This needs some more thought as to what is the correct
>>> behaviour/solution for this.
>>>
>>
>> I think saying "origin moves backwards" is a bit misleading. AFAICS the
>> origin position is not actually moving backwards, it's more that we
>> don't (and can't) move it forwards for each non-transactional change. So
>> yeah, we may re-apply those, and IMHO that's expected - the sequence is
>> allowed to be "ahead" on the subscriber.
>>
> 
> But, if this happens then for a period of time the sequence will go
> backwards relative to what one would have observed before restart.
> 

That is true, but is it really a problem? This whole sequence decoding
thing was meant to allow logical failover - make sure that after switch
to the subscriber, the sequences don't generate duplicate values. From
this POV, the sequence going backwards (back to the confirmed origin
position) is not an issue - it's still far enough (ahead of publisher).

Is that great / ideal? No, I agree with that. But it was considered
acceptable and good enough for the failover use case ...

The only idea how to improve that is we could keep the non-transactional
changes (instead of applying them immediately), and then apply them on
the nearest "commit". That'd mean it's subject to the position tracking,
and the sequence would not go backwards, I think.

So every time we decode a commit, we'd check if we decoded any sequence
changes since the last commit, and merge them (a bit like a subxact).

This would however also mean sequence changes from rolled-back xacts may
not be replictated. I think that'd be fine, but IIRC Andres suggested
it's a valid use case.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 7/28/23 14:35, Ashutosh Bapat wrote:
>
> ...
>
> We hold a strong lock on sequence when changing its relfilenode. The
> sequence whose relfilenode is being changed can not be accessed by any
> concurrent transaction. So I am not able to understand what you are
> trying to say.
> 
> I think per (top level) transaction hash table is cleaner design. It
> puts the hash table where it should be. But if that makes code
> difficult, current design works too.
> 

I was thinking about switching to the per-txn hash, so here's a patch
adopting that approach (in part 0006). I can't say it's much simpler,
but maybe it can be simplified a bit. Most of the complexity comes from
assignments maybe happening with a delay, so it's hard to say what's a
top-level xact.

The patch essentially does this:

1) the HTAB is moved to ReorderBufferTXN

2) after decoding SGMR_CREATE, we add an entry to the current TXN and
(for subtransactions) to the parent TXN (even the copy references the
subxact)

3) when processing an assignment, we copy the HTAB entries from the
subxact to the parent

4) after a subxact abort, we remove the HTAB entries from the parent

5) while searching for the relfilenode, we only scan the HTAB in the
top-level xacts (this is possible due to the copying)

This could work without the copy in parent HTAB, but then we'd have to
scan all the transactions for every increment. And there may be many
lookups and many (sub)transactions, but only a small number of new
relfilenodes. So it seems like a good tradeoff.

If we could convince ourselves the subxact has to be already assigned
while decoding the sequence change, then we could simply search only the
current transaction (and the parent). But I've been unable to convince
myself that's guaranteed.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 7/29/23 14:38, Tomas Vondra wrote:
>
> ...
>
> The only idea how to improve that is we could keep the non-transactional
> changes (instead of applying them immediately), and then apply them on
> the nearest "commit". That'd mean it's subject to the position tracking,
> and the sequence would not go backwards, I think.
> 
> So every time we decode a commit, we'd check if we decoded any sequence
> changes since the last commit, and merge them (a bit like a subxact).
> 
> This would however also mean sequence changes from rolled-back xacts may
> not be replictated. I think that'd be fine, but IIRC Andres suggested
> it's a valid use case.
> 

I wasn't sure how difficult would this approach be, so I experimented
with this today, and it's waaaay more complicated than I thought. In
fact, I'm not even sure how to do that ...

The part 0008 is an WIP patch where ReorderBufferQueueSequence does not
apply the non-transactional changes immediately, and instead adds the
changes to a top-level list. And then ReorderBufferCommit adds a fake
subxact with all sequence changes up to the commit LSN.

The challenging part is snapshot management - when applying the changes
immediately, we can simply build and use the current snapshot. But with
0008 it's not that simple - we don't even know into which transaction
will the sequence change get "injected". In fact, we don't even know if
the parent transaction will have a snapshot (if it only does nextval()
it may seem empty). I was thinking maybe we could "keep" the snapshots
for non-transactional changes, but I suspect it might confuse the main
transaction in some way.

I'm still not convinced this behavior would actually be desirable ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 7/28/23 14:44, Ashutosh Bapat wrote:
> > On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> Anyway, I was thinking about this a bit more, and it seems it's not as
> >> difficult to use the page LSN to ensure sequences don't go backwards.
> >> The 0005 change does that, by:
> >>
> >> 1) adding pg_sequence_state, that returns both the sequence state and
> >>    the page LSN
> >>
> >> 2) copy_sequence returns the page LSN
> >>
> >> 3) tablesync then sets this LSN as origin_startpos (which for tables is
> >>    just the LSN of the replication slot)
> >>
> >> AFAICS this makes it work - we start decoding at the page LSN, so that
> >> we  skip the increments that could lead to the sequence going backwards.
> >>
> >
> > I like this design very much. It makes things simpler than complex.
> > Thanks for doing this.
> >
>
> I agree it seems simpler. It'd be good to try testing / reviewing it a
> bit more, so that it doesn't misbehave in some way.
>

Yeah, I also think this needs a review. This is a sort of new concept
where we don't use the LSN of the slot (for cases where copy returned
a larger value of LSN) or a full_snapshot created corresponding to the
sync slot by Walsender. For the case of the table, we build a full
snapshot because we use that for copying the table but why do we need
to build that for copying the sequence especially when we directly
copy it from the sequence relation without caring for any snapshot?

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 7/31/23 11:25, Amit Kapila wrote:
> On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 7/28/23 14:44, Ashutosh Bapat wrote:
>>> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> Anyway, I was thinking about this a bit more, and it seems it's not as
>>>> difficult to use the page LSN to ensure sequences don't go backwards.
>>>> The 0005 change does that, by:
>>>>
>>>> 1) adding pg_sequence_state, that returns both the sequence state and
>>>>    the page LSN
>>>>
>>>> 2) copy_sequence returns the page LSN
>>>>
>>>> 3) tablesync then sets this LSN as origin_startpos (which for tables is
>>>>    just the LSN of the replication slot)
>>>>
>>>> AFAICS this makes it work - we start decoding at the page LSN, so that
>>>> we  skip the increments that could lead to the sequence going backwards.
>>>>
>>>
>>> I like this design very much. It makes things simpler than complex.
>>> Thanks for doing this.
>>>
>>
>> I agree it seems simpler. It'd be good to try testing / reviewing it a
>> bit more, so that it doesn't misbehave in some way.
>>
> 
> Yeah, I also think this needs a review. This is a sort of new concept
> where we don't use the LSN of the slot (for cases where copy returned
> a larger value of LSN) or a full_snapshot created corresponding to the
> sync slot by Walsender. For the case of the table, we build a full
> snapshot because we use that for copying the table but why do we need
> to build that for copying the sequence especially when we directly
> copy it from the sequence relation without caring for any snapshot?
> 

We need the slot to decode/apply changes during catchup. The main
subscription may get ahead, and we need to ensure the WAL is not
discarded or something like that. This applies even if the initial sync
step does not use the slot/snapshot directly.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Mon, Jul 31, 2023 at 5:04 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 7/31/23 11:25, Amit Kapila wrote:
> > On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> On 7/28/23 14:44, Ashutosh Bapat wrote:
> >>> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
> >>> <tomas.vondra@enterprisedb.com> wrote:
> >>>>
> >>>> Anyway, I was thinking about this a bit more, and it seems it's not as
> >>>> difficult to use the page LSN to ensure sequences don't go backwards.
> >>>> The 0005 change does that, by:
> >>>>
> >>>> 1) adding pg_sequence_state, that returns both the sequence state and
> >>>>    the page LSN
> >>>>
> >>>> 2) copy_sequence returns the page LSN
> >>>>
> >>>> 3) tablesync then sets this LSN as origin_startpos (which for tables is
> >>>>    just the LSN of the replication slot)
> >>>>
> >>>> AFAICS this makes it work - we start decoding at the page LSN, so that
> >>>> we  skip the increments that could lead to the sequence going backwards.
> >>>>
> >>>
> >>> I like this design very much. It makes things simpler than complex.
> >>> Thanks for doing this.
> >>>
> >>
> >> I agree it seems simpler. It'd be good to try testing / reviewing it a
> >> bit more, so that it doesn't misbehave in some way.
> >>
> >
> > Yeah, I also think this needs a review. This is a sort of new concept
> > where we don't use the LSN of the slot (for cases where copy returned
> > a larger value of LSN) or a full_snapshot created corresponding to the
> > sync slot by Walsender. For the case of the table, we build a full
> > snapshot because we use that for copying the table but why do we need
> > to build that for copying the sequence especially when we directly
> > copy it from the sequence relation without caring for any snapshot?
> >
>
> We need the slot to decode/apply changes during catchup. The main
> subscription may get ahead, and we need to ensure the WAL is not
> discarded or something like that. This applies even if the initial sync
> step does not use the slot/snapshot directly.
>

AFAIK, none of these needs a full_snapshot (see usage of
SnapBuild->building_full_snapshot). The full_snapshot tracks both
catalog and non-catalog xacts in the snapshot where we require to
track non-catalog ones because we want to copy the table using that
snapshot. It is relatively expensive to build a full snapshot and we
don't do that unless it is required. For the current usage of this
patch, I think using CRS_NOEXPORT_SNAPSHOT would be sufficient.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 8/1/23 04:59, Amit Kapila wrote:
> On Mon, Jul 31, 2023 at 5:04 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 7/31/23 11:25, Amit Kapila wrote:
>>> ...
>>>
>>> Yeah, I also think this needs a review. This is a sort of new concept
>>> where we don't use the LSN of the slot (for cases where copy returned
>>> a larger value of LSN) or a full_snapshot created corresponding to the
>>> sync slot by Walsender. For the case of the table, we build a full
>>> snapshot because we use that for copying the table but why do we need
>>> to build that for copying the sequence especially when we directly
>>> copy it from the sequence relation without caring for any snapshot?
>>>
>>
>> We need the slot to decode/apply changes during catchup. The main
>> subscription may get ahead, and we need to ensure the WAL is not
>> discarded or something like that. This applies even if the initial sync
>> step does not use the slot/snapshot directly.
>>
> 
> AFAIK, none of these needs a full_snapshot (see usage of
> SnapBuild->building_full_snapshot). The full_snapshot tracks both
> catalog and non-catalog xacts in the snapshot where we require to
> track non-catalog ones because we want to copy the table using that
> snapshot. It is relatively expensive to build a full snapshot and we
> don't do that unless it is required. For the current usage of this
> patch, I think using CRS_NOEXPORT_SNAPSHOT would be sufficient.
> 

Yeah, you may be right we don't need a full snapshot, because we don't
need to export it. We however still need a snapshot, and it wasn't clear
to me whether you suggest we don't need the slot / snapshot at all.

Anyway, I think this is "just" a matter of efficiency, not correctness.
IMHO there are bigger questions regarding the "going back" behavior
after apply restart.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Tue, Aug 1, 2023 at 8:46 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Anyway, I think this is "just" a matter of efficiency, not correctness.
> IMHO there are bigger questions regarding the "going back" behavior
> after apply restart.


sequence_decode() has the following code
/* Skip the change if already processed (per the snapshot). */
if (transactional &&
!SnapBuildProcessChange(builder, xid, buf->origptr))
return;
else if (!transactional &&
(SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
SnapBuildXactNeedsSkip(builder, buf->origptr)))
return;

This means that if the subscription restarts, the upstream will *not*
send any non-transactional sequence changes with LSN prior to the LSN
specified by START_REPLICATION command. That should avoid replicating
all the non-transactional sequence changes since
ReplicationSlot::restart_lsn if the subscription restarts.

But in apply_handle_sequence(), we do not update the
replorigin_session_origin_lsn with LSN of the non-transactional
sequence change when it's applied. This means that if a subscription
restarts while it is half way through applying a transaction, those
changes will be replicated again. This will move the sequence
backward. If the subscription keeps restarting again and again while
applying that transaction, we will see the sequence "rubber banding"
[1] on subscription. So untill the transaction is completely applied,
the other users of the sequence may see duplicate values during this
time. I think this is undesirable.

But I am not able to find a case where this can lead to conflicting
values after failover. If there's only one transaction which is
repeatedly being applied, the rows which use sequence values were
never committed so there's no conflicting value present on the
subscription. The same reasoning can be extended to multiple in-flight
transactions. If another transaction (T2) uses the sequence values
changed by in-flight transaction T1 and if T2 commits before T1, the
sequence changes used by T2 must have LSNs before commit of T2 and
thus they will never be replicated. (See example below).

T1
insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q1
T2
insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q2
COMMIT;
T1
insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q13
COMMIT;

So I am not able to imagine a case when a sequence going backward can
cause conflicting values.

But whether or not that's the case, downstream should not request (and
hence receive) any changes that have been already applied (and
committed) downstream as a principle. I think a way to achieve this is
to update the replorigin_session_origin_lsn so that a sequence change
applied once is not requested (and hence sent) again.

[1] https://en.wikipedia.org/wiki/Rubber_banding

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 8/11/23 08:32, Ashutosh Bapat wrote:
> On Tue, Aug 1, 2023 at 8:46 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Anyway, I think this is "just" a matter of efficiency, not correctness.
>> IMHO there are bigger questions regarding the "going back" behavior
>> after apply restart.
> 
> 
> sequence_decode() has the following code
> /* Skip the change if already processed (per the snapshot). */
> if (transactional &&
> !SnapBuildProcessChange(builder, xid, buf->origptr))
> return;
> else if (!transactional &&
> (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
> SnapBuildXactNeedsSkip(builder, buf->origptr)))
> return;
> 
> This means that if the subscription restarts, the upstream will *not*
> send any non-transactional sequence changes with LSN prior to the LSN
> specified by START_REPLICATION command. That should avoid replicating
> all the non-transactional sequence changes since
> ReplicationSlot::restart_lsn if the subscription restarts.
> 

Ah, right, I got confused and mixed restart_lsn and the LSN passed in
the START_REPLICATION COMMAND. Thanks for the details, I think this
works fine.

> But in apply_handle_sequence(), we do not update the
> replorigin_session_origin_lsn with LSN of the non-transactional
> sequence change when it's applied. This means that if a subscription
> restarts while it is half way through applying a transaction, those
> changes will be replicated again. This will move the sequence
> backward. If the subscription keeps restarting again and again while
> applying that transaction, we will see the sequence "rubber banding"
> [1] on subscription. So untill the transaction is completely applied,
> the other users of the sequence may see duplicate values during this
> time. I think this is undesirable.
> 

Well, but as I said earlier, this is not expected to support using the
sequence on the subscriber until after the failover, so there's not real
risk of "duplicate values". Yes, you might select the data from the
sequence directly, but that would have all sorts of issues even without
replication - users are required to use nextval/currval and so on.

> But I am not able to find a case where this can lead to conflicting
> values after failover. If there's only one transaction which is
> repeatedly being applied, the rows which use sequence values were
> never committed so there's no conflicting value present on the
> subscription. The same reasoning can be extended to multiple in-flight
> transactions. If another transaction (T2) uses the sequence values
> changed by in-flight transaction T1 and if T2 commits before T1, the
> sequence changes used by T2 must have LSNs before commit of T2 and
> thus they will never be replicated. (See example below).
> 
> T1
> insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q1
> T2
> insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q2
> COMMIT;
> T1
> insert into t1 (nextval('seq'), ...) from generate_series(1, 100); - Q13
> COMMIT;
> 
> So I am not able to imagine a case when a sequence going backward can
> cause conflicting values.

Right, I agree this "rubber banding" can happen. But as long as we don't
go back too far (before the last applied commit) I think that'd fine. We
only need to make guarantees about committed transactions, and I don't
think we need to worry about this too much ...

> 
> But whether or not that's the case, downstream should not request (and
> hence receive) any changes that have been already applied (and
> committed) downstream as a principle. I think a way to achieve this is
> to update the replorigin_session_origin_lsn so that a sequence change
> applied once is not requested (and hence sent) again.
> 

I guess we could update the origin, per attached 0004. We don't have
timestamp to set replorigin_session_origin_timestamp, but it seems we
don't need that.

The attached patch merges the earlier improvements, except for the part
that experimented with adding a "fake" transaction (which turned out to
have a number of difficult issues).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> >
> > But whether or not that's the case, downstream should not request (and
> > hence receive) any changes that have been already applied (and
> > committed) downstream as a principle. I think a way to achieve this is
> > to update the replorigin_session_origin_lsn so that a sequence change
> > applied once is not requested (and hence sent) again.
> >
>
> I guess we could update the origin, per attached 0004. We don't have
> timestamp to set replorigin_session_origin_timestamp, but it seems we
> don't need that.
>
> The attached patch merges the earlier improvements, except for the part
> that experimented with adding a "fake" transaction (which turned out to
> have a number of difficult issues).

0004 looks good to me. But I need to review the impact of not setting
replorigin_session_origin_timestamp.

What fake transaction experiment are you talking about?

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > >
> > > But whether or not that's the case, downstream should not request (and
> > > hence receive) any changes that have been already applied (and
> > > committed) downstream as a principle. I think a way to achieve this is
> > > to update the replorigin_session_origin_lsn so that a sequence change
> > > applied once is not requested (and hence sent) again.
> > >
> >
> > I guess we could update the origin, per attached 0004. We don't have
> > timestamp to set replorigin_session_origin_timestamp, but it seems we
> > don't need that.
> >
> > The attached patch merges the earlier improvements, except for the part
> > that experimented with adding a "fake" transaction (which turned out to
> > have a number of difficult issues).
>
> 0004 looks good to me.


+ {
  CommitTransactionCommand();
+
+ /*
+ * Update origin state so we don't try applying this sequence
+ * change in case of crash.
+ *
+ * XXX We don't have replorigin_session_origin_timestamp, but we
+ * can just leave that set to 0.
+ */
+ replorigin_session_origin_lsn = seq.lsn;

IIUC, your proposal is to update the replorigin_session_origin_lsn, so
that after restart, it doesn't use some prior origin LSN to start with
which can in turn lead the sequence to go backward. If so, it should
be updated before calling CommitTransactionCommand() as we are doing
in apply_handle_commit_internal(). If that is not the intention then
it is not clear to me how updating replorigin_session_origin_lsn after
commit is helpful.

>
 But I need to review the impact of not setting
> replorigin_session_origin_timestamp.
>

This may not have a direct impact on built-in replication as I think
we don't rely on it yet but we need to think of out-of-core solutions.
I am not sure if I understood your proposal as per my previous comment
but once you clarify the same, I'll also try to think on the same.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Fri, Aug 18, 2023 at 10:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> > >
> > > >
> > > > But whether or not that's the case, downstream should not request (and
> > > > hence receive) any changes that have been already applied (and
> > > > committed) downstream as a principle. I think a way to achieve this is
> > > > to update the replorigin_session_origin_lsn so that a sequence change
> > > > applied once is not requested (and hence sent) again.
> > > >
> > >
> > > I guess we could update the origin, per attached 0004. We don't have
> > > timestamp to set replorigin_session_origin_timestamp, but it seems we
> > > don't need that.
> > >
> > > The attached patch merges the earlier improvements, except for the part
> > > that experimented with adding a "fake" transaction (which turned out to
> > > have a number of difficult issues).
> >
> > 0004 looks good to me.
>
>
> + {
>   CommitTransactionCommand();
> +
> + /*
> + * Update origin state so we don't try applying this sequence
> + * change in case of crash.
> + *
> + * XXX We don't have replorigin_session_origin_timestamp, but we
> + * can just leave that set to 0.
> + */
> + replorigin_session_origin_lsn = seq.lsn;
>
> IIUC, your proposal is to update the replorigin_session_origin_lsn, so
> that after restart, it doesn't use some prior origin LSN to start with
> which can in turn lead the sequence to go backward. If so, it should
> be updated before calling CommitTransactionCommand() as we are doing
> in apply_handle_commit_internal(). If that is not the intention then
> it is not clear to me how updating replorigin_session_origin_lsn after
> commit is helpful.
>

typedef struct ReplicationState
{
...
    /*
     * Location of the latest commit from the remote side.
     */
    XLogRecPtr    remote_lsn;

This is the variable that will be updated with the value of
replorigin_session_origin_lsn. This means we will now track some
arbitrary LSN location of the remote side in this variable. The above
comment makes me wonder if there is anything we are missing or if it
is just a matter of updating this comment because before the patch we
always adhere to what is written in the comment.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> >
> > The attached patch merges the earlier improvements, except for the part
> > that experimented with adding a "fake" transaction (which turned out to
> > have a number of difficult issues).
>
> 0004 looks good to me. But I need to review the impact of not setting
> replorigin_session_origin_timestamp.

I think it will be good to set replorigin_session_origin_timestamp = 0
explicitly so as not to pick up a garbage value. The timestamp is
written to the commit record. Beyond that I don't see any use of it.
It is further passed downstream if there is cascaded logical
replication setup. But I don't see it being used. So it should be fine
to leave it 0. I don't think we can use logically replicated sequences
in a mult-master environment where the timestamp may be used to
resolve conflict. Such a setup will require a distributed sequence
management which can not be achieved by logical replication alone.

In short, I didn't find any hazard in leaving the
replorigin_session_origin_timestamp as 0.

--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Fri, Aug 18, 2023 at 4:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Aug 18, 2023 at 10:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> > >
> > > On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra
> > > <tomas.vondra@enterprisedb.com> wrote:
> > > >
> > > > >
> > > > > But whether or not that's the case, downstream should not request (and
> > > > > hence receive) any changes that have been already applied (and
> > > > > committed) downstream as a principle. I think a way to achieve this is
> > > > > to update the replorigin_session_origin_lsn so that a sequence change
> > > > > applied once is not requested (and hence sent) again.
> > > > >
> > > >
> > > > I guess we could update the origin, per attached 0004. We don't have
> > > > timestamp to set replorigin_session_origin_timestamp, but it seems we
> > > > don't need that.
> > > >
> > > > The attached patch merges the earlier improvements, except for the part
> > > > that experimented with adding a "fake" transaction (which turned out to
> > > > have a number of difficult issues).
> > >
> > > 0004 looks good to me.
> >
> >
> > + {
> >   CommitTransactionCommand();
> > +
> > + /*
> > + * Update origin state so we don't try applying this sequence
> > + * change in case of crash.
> > + *
> > + * XXX We don't have replorigin_session_origin_timestamp, but we
> > + * can just leave that set to 0.
> > + */
> > + replorigin_session_origin_lsn = seq.lsn;
> >
> > IIUC, your proposal is to update the replorigin_session_origin_lsn, so
> > that after restart, it doesn't use some prior origin LSN to start with
> > which can in turn lead the sequence to go backward. If so, it should
> > be updated before calling CommitTransactionCommand() as we are doing
> > in apply_handle_commit_internal(). If that is not the intention then
> > it is not clear to me how updating replorigin_session_origin_lsn after
> > commit is helpful.
> >
>
> typedef struct ReplicationState
> {
> ...
>     /*
>      * Location of the latest commit from the remote side.
>      */
>     XLogRecPtr    remote_lsn;
>
> This is the variable that will be updated with the value of
> replorigin_session_origin_lsn. This means we will now track some
> arbitrary LSN location of the remote side in this variable. The above
> comment makes me wonder if there is anything we are missing or if it
> is just a matter of updating this comment because before the patch we
> always adhere to what is written in the comment.

I don't think we are missing anything. This value is used to track the
remote LSN upto which all the commits from upstream have been applied
locally. Since a non-transactional sequence change is like a single
WAL record transaction, it's LSN acts as the LSN of the mini-commit.
So it should be fine to update remote_lsn with sequence WAL record's
end LSN. That's what the patches do. I don't see any hazard. But you
are right, we need to update comments. Here and also at other places
like
replorigin_session_advance() which uses remote_commit as name of the
argument which gets assigned to ReplicationState::remote_lsn.

--
Best Wishes,
Ashutosh Bapat



RE: logical decoding and replication of sequences, take 2

От
"Zhijie Hou (Fujitsu)"
Дата:
On Wednesday, August 16, 2023 10:27 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

> 
> 
> I guess we could update the origin, per attached 0004. We don't have
> timestamp to set replorigin_session_origin_timestamp, but it seems we don't
> need that.
> 
> The attached patch merges the earlier improvements, except for the part that
> experimented with adding a "fake" transaction (which turned out to have a
> number of difficult issues).

I tried to test the patch and found a crash when calling
pg_logical_slot_get_changes() to consume sequence changes.

Steps:
----
create table t1_seq(a int);
create sequence seq1;
SELECT 'init' FROM pg_create_logical_replication_slot('test_slot',
'test_decoding', false, true);
INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100);
SELECT data  FROM pg_logical_slot_get_changes('test_slot', NULL, NULL,
'include-xids', 'false', 'skip-empty-xacts', '1');
----

Attach the backtrace in bt.txt.

Best Regards,
Hou zj

Вложения

Re: logical decoding and replication of sequences, take 2

От
Dilip Kumar
Дата:
On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>

I was reading through 0001, I noticed this comment in
ReorderBufferSequenceIsTransactional() function

+ * To decide if a sequence change should be handled as transactional or applied
+ * immediately, we track (sequence) relfilenodes created by each transaction.
+ * We don't know if the current sub-transaction was already assigned to the
+ * top-level transaction, so we need to check all transactions.

It says "We don't know if the current sub-transaction was already
assigned to the top-level transaction, so we need to check all
transactions". But IIRC as part of the steaming of in-progress
transactions we have ensured that whenever we are logging the first
change by any subtransaction we include the top transaction ID in it.

Refer this code

LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
XLogReaderState *record)
{
...
/*
* If the top-level xid is valid, we need to assign the subxact to the
* top-level xact. We need to do this for all records, hence we do it
* before the switch.
*/
if (TransactionIdIsValid(txid))
{
ReorderBufferAssignChild(ctx->reorder,
txid,
XLogRecGetXid(record),
buf.origptr);
}
}

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Dilip Kumar
Дата:
On Wed, Sep 20, 2023 at 3:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
>
> I was reading through 0001, I noticed this comment in
> ReorderBufferSequenceIsTransactional() function
>
> + * To decide if a sequence change should be handled as transactional or applied
> + * immediately, we track (sequence) relfilenodes created by each transaction.
> + * We don't know if the current sub-transaction was already assigned to the
> + * top-level transaction, so we need to check all transactions.
>
> It says "We don't know if the current sub-transaction was already
> assigned to the top-level transaction, so we need to check all
> transactions". But IIRC as part of the steaming of in-progress
> transactions we have ensured that whenever we are logging the first
> change by any subtransaction we include the top transaction ID in it.
>
> Refer this code
>
> LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
> XLogReaderState *record)
> {
> ...
> /*
> * If the top-level xid is valid, we need to assign the subxact to the
> * top-level xact. We need to do this for all records, hence we do it
> * before the switch.
> */
> if (TransactionIdIsValid(txid))
> {
> ReorderBufferAssignChild(ctx->reorder,
> txid,
> XLogRecGetXid(record),
> buf.origptr);
> }
> }

Some more comments

1.
ReorderBufferSequenceIsTransactional and ReorderBufferSequenceGetXid
are duplicated except the first one is just confirming whether
relfilelocator was created in the transaction or not and the other is
returning the XID as well so I think these two could be easily merged
so that we can avoid duplicate codes.

2.
/*
+ * ReorderBufferTransferSequencesToParent
+ * Copy the relfilenode entries to the parent after assignment.
+ */
+static void
+ReorderBufferTransferSequencesToParent(ReorderBuffer *rb,
+    ReorderBufferTXN *txn,
+    ReorderBufferTXN *subtxn)

If we agree with my comment in the previous email (i.e. the first WAL
by a subxid will always include topxid) then we do not need this
function at all and always add relfilelocator directly to the top
transaction and we never need to transfer.

That is all I have for now while first pass of 0001, later I will do a
more detailed review and will look into other patches also.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



RE: logical decoding and replication of sequences, take 2

От
"Zhijie Hou (Fujitsu)"
Дата:
On Friday, September 15, 2023 11:11 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> 
> On Wednesday, August 16, 2023 10:27 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> 
> Hi,
> 
> >
> >
> > I guess we could update the origin, per attached 0004. We don't have
> > timestamp to set replorigin_session_origin_timestamp, but it seems we
> > don't need that.
> >
> > The attached patch merges the earlier improvements, except for the
> > part that experimented with adding a "fake" transaction (which turned
> > out to have a number of difficult issues).
> 
> I tried to test the patch and found a crash when calling
> pg_logical_slot_get_changes() to consume sequence changes.

Oh, after confirming again, I realize it's my fault that my build environment
was not clean. This case passed after rebuilding. Sorry for the noise.

Best Regards,
Hou zj

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 9/22/23 13:24, Dilip Kumar wrote:
> On Wed, Sep 20, 2023 at 3:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>
>> I was reading through 0001, I noticed this comment in
>> ReorderBufferSequenceIsTransactional() function
>>
>> + * To decide if a sequence change should be handled as transactional or applied
>> + * immediately, we track (sequence) relfilenodes created by each transaction.
>> + * We don't know if the current sub-transaction was already assigned to the
>> + * top-level transaction, so we need to check all transactions.
>>
>> It says "We don't know if the current sub-transaction was already
>> assigned to the top-level transaction, so we need to check all
>> transactions". But IIRC as part of the steaming of in-progress
>> transactions we have ensured that whenever we are logging the first
>> change by any subtransaction we include the top transaction ID in it.
>>
>> Refer this code
>>
>> LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
>> XLogReaderState *record)
>> {
>> ...
>> /*
>> * If the top-level xid is valid, we need to assign the subxact to the
>> * top-level xact. We need to do this for all records, hence we do it
>> * before the switch.
>> */
>> if (TransactionIdIsValid(txid))
>> {
>> ReorderBufferAssignChild(ctx->reorder,
>> txid,
>> XLogRecGetXid(record),
>> buf.origptr);
>> }
>> }
> 
> Some more comments
> 
> 1.
> ReorderBufferSequenceIsTransactional and ReorderBufferSequenceGetXid
> are duplicated except the first one is just confirming whether
> relfilelocator was created in the transaction or not and the other is
> returning the XID as well so I think these two could be easily merged
> so that we can avoid duplicate codes.
> 

Right. The attached patch modifies the IsTransactional function to also
return the XID, and removes the GetXid one. It feels a bit weird because
now the IsTransactional function is called even in places where we know
the change is transactional. It's true two separate functions duplicated
a bit of code, ofc.

> 2.
> /*
> + * ReorderBufferTransferSequencesToParent
> + * Copy the relfilenode entries to the parent after assignment.
> + */
> +static void
> +ReorderBufferTransferSequencesToParent(ReorderBuffer *rb,
> +    ReorderBufferTXN *txn,
> +    ReorderBufferTXN *subtxn)
> 
> If we agree with my comment in the previous email (i.e. the first WAL
> by a subxid will always include topxid) then we do not need this
> function at all and always add relfilelocator directly to the top
> transaction and we never need to transfer.
> 

Good point! I don't recall why I thought this was necessary. I suspect
it was before I added the GetCurrentTransactionId() calls to ensure the
subxact has a XID. I replaced the ReorderBufferTransferSequencesToParent
call with an assert that the relfilenode hash table is empty, and I've
been unable to trigger any failures.

> That is all I have for now while first pass of 0001, later I will do a
> more detailed review and will look into other patches also.
> 

Thanks!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 9/20/23 11:53, Dilip Kumar wrote:
> On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
> 
> I was reading through 0001, I noticed this comment in
> ReorderBufferSequenceIsTransactional() function
> 
> + * To decide if a sequence change should be handled as transactional or applied
> + * immediately, we track (sequence) relfilenodes created by each transaction.
> + * We don't know if the current sub-transaction was already assigned to the
> + * top-level transaction, so we need to check all transactions.
> 
> It says "We don't know if the current sub-transaction was already
> assigned to the top-level transaction, so we need to check all
> transactions". But IIRC as part of the steaming of in-progress
> transactions we have ensured that whenever we are logging the first
> change by any subtransaction we include the top transaction ID in it.
> 

Yeah, that's a stale comment - the actual code only searched through the
top-level ones (and thus relying on the immediate assignment). As I
wrote in the earlier response, I suspect this code originates from
before I added the GetCurrentTransactionId() calls.

That being said, I do wonder why with the immediate assignments we still
need the bit in ReorderBufferAssignChild that says:

    /*
     * We already saw this transaction, but initially added it to the
     * list of top-level txns.  Now that we know it's not top-level,
     * remove it from there.
     */
    dlist_delete(&subtxn->node);

I don't think that affects this patch, but it's a bit confusing.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 9/13/23 15:18, Ashutosh Bapat wrote:
> On Fri, Aug 18, 2023 at 4:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Fri, Aug 18, 2023 at 10:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>
>>> On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat
>>> <ashutosh.bapat.oss@gmail.com> wrote:
>>>>
>>>> On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra
>>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>>
>>>>>>
>>>>>> But whether or not that's the case, downstream should not request (and
>>>>>> hence receive) any changes that have been already applied (and
>>>>>> committed) downstream as a principle. I think a way to achieve this is
>>>>>> to update the replorigin_session_origin_lsn so that a sequence change
>>>>>> applied once is not requested (and hence sent) again.
>>>>>>
>>>>>
>>>>> I guess we could update the origin, per attached 0004. We don't have
>>>>> timestamp to set replorigin_session_origin_timestamp, but it seems we
>>>>> don't need that.
>>>>>
>>>>> The attached patch merges the earlier improvements, except for the part
>>>>> that experimented with adding a "fake" transaction (which turned out to
>>>>> have a number of difficult issues).
>>>>
>>>> 0004 looks good to me.
>>>
>>>
>>> + {
>>>   CommitTransactionCommand();
>>> +
>>> + /*
>>> + * Update origin state so we don't try applying this sequence
>>> + * change in case of crash.
>>> + *
>>> + * XXX We don't have replorigin_session_origin_timestamp, but we
>>> + * can just leave that set to 0.
>>> + */
>>> + replorigin_session_origin_lsn = seq.lsn;
>>>
>>> IIUC, your proposal is to update the replorigin_session_origin_lsn, so
>>> that after restart, it doesn't use some prior origin LSN to start with
>>> which can in turn lead the sequence to go backward. If so, it should
>>> be updated before calling CommitTransactionCommand() as we are doing
>>> in apply_handle_commit_internal(). If that is not the intention then
>>> it is not clear to me how updating replorigin_session_origin_lsn after
>>> commit is helpful.
>>>
>>
>> typedef struct ReplicationState
>> {
>> ...
>>     /*
>>      * Location of the latest commit from the remote side.
>>      */
>>     XLogRecPtr    remote_lsn;
>>
>> This is the variable that will be updated with the value of
>> replorigin_session_origin_lsn. This means we will now track some
>> arbitrary LSN location of the remote side in this variable. The above
>> comment makes me wonder if there is anything we are missing or if it
>> is just a matter of updating this comment because before the patch we
>> always adhere to what is written in the comment.
> 
> I don't think we are missing anything. This value is used to track the
> remote LSN upto which all the commits from upstream have been applied
> locally. Since a non-transactional sequence change is like a single
> WAL record transaction, it's LSN acts as the LSN of the mini-commit.
> So it should be fine to update remote_lsn with sequence WAL record's
> end LSN. That's what the patches do. I don't see any hazard. But you
> are right, we need to update comments. Here and also at other places
> like
> replorigin_session_advance() which uses remote_commit as name of the
> argument which gets assigned to ReplicationState::remote_lsn.
> 

I agree - updating the replorigin_session_origin_lsn shouldn't break
anything. As you write, it's essentially a "mini-commit" and the commit
order remains the same.

I'm not sure about resetting replorigin_session_origin_timestamp to 0
though. It's not something we rely on very much (it may not correlated
with the commit order etc.). But why should we set it to 0? We don't do
that for regular commits, right? And IMO it makes sense to just use the
timestamp of the last commit before the sequence change.

FWIW I've left this in a separate commit, but I'll merge that into 0002
in the next patch version.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 7/25/23 12:20, Amit Kapila wrote:
> ...
>
> I have used the debugger to reproduce this as it needs quite some
> coordination. I just wanted to see if the sequence can go backward and
> didn't catch up completely before the sequence state is marked
> 'ready'. On the publisher side, I created a publication with a table
> and a sequence. Then did the following steps:
> SELECT nextval('s') FROM generate_series(1,50);
> insert into t1 values(1);
> SELECT nextval('s') FROM generate_series(51,150);
> 
> Then on the subscriber side with some debugging aid, I could find the
> values in the sequence shown in the previous email. Sorry, I haven't
> recorded each and every step but, if you think it helps, I can again
> try to reproduce it and share the steps.
> 

Amit, can you try to reproduce this backwards movement with the latest
version of the patch? I have tried triggering that (mis)behavior, but I
haven't been successful so far. I'm hesitant to declare it resolved, as
it's dependent on timing etc. and you mentioned it required quite some
coordination.


Thanks!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Thu, Oct 12, 2023 at 9:03 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 7/25/23 12:20, Amit Kapila wrote:
> > ...
> >
> > I have used the debugger to reproduce this as it needs quite some
> > coordination. I just wanted to see if the sequence can go backward and
> > didn't catch up completely before the sequence state is marked
> > 'ready'. On the publisher side, I created a publication with a table
> > and a sequence. Then did the following steps:
> > SELECT nextval('s') FROM generate_series(1,50);
> > insert into t1 values(1);
> > SELECT nextval('s') FROM generate_series(51,150);
> >
> > Then on the subscriber side with some debugging aid, I could find the
> > values in the sequence shown in the previous email. Sorry, I haven't
> > recorded each and every step but, if you think it helps, I can again
> > try to reproduce it and share the steps.
> >
>
> Amit, can you try to reproduce this backwards movement with the latest
> version of the patch?
>

I lost touch with this patch but IIRC the quoted problem per se
shouldn't occur after the idea to use page LSN instead of slot's LSN
for synchronization between sync and apply worker.

--
With Regards,
Amit Kapila.



RE: logical decoding and replication of sequences, take 2

От
"Zhijie Hou (Fujitsu)"
Дата:
On Thursday, October 12, 2023 11:06 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>

Hi,

I have been reviewing the patch set, and here are some initial comments.

1.

I think we need to mark the RBTXN_HAS_STREAMABLE_CHANGE flag for transactional
sequence change in ReorderBufferQueueChange().

2.

ReorderBufferSequenceIsTransactional

It seems we call the above function once in sequence_decode() and call it again
in ReorderBufferQueueSequence(), would it better to avoid the second call as
the hashtable search looks not cheap.

3.

The patch cleans up the sequence hash table when COMMIT or ABORT a transaction
(via ReorderBufferAbort() and ReorderBufferReturnTXN()), while it doesn't seem
destory the hash table when PREPARE the transaction. It's not a big porblem but
would it be better to release the memory earlier by destory the table for
prepare ?

4.

+pg_decode_stream_sequence(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
...
+    /* output BEGIN if we haven't yet, but only for the transactional case */
+    if (transactional)
+    {
+        if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+        {
+            pg_output_begin(ctx, data, txn, false);
+        }
+        txndata->xact_wrote_changes = true;
+    }

I think we should call pg_output_stream_start() instead of pg_output_begin()
for streaming sequence changes.

5.
+    /*
+     * Schema should be sent using the original relation because it
+     * also sends the ancestor's relation.
+     */
+    maybe_send_schema(ctx, txn, relation, relentry);

The comment seems a bit misleading here, I think it was used for the partition
logic in pgoutput_change().

Best Regards,
Hou zj

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Hi!

On 10/24/23 13:31, Zhijie Hou (Fujitsu) wrote:
> On Thursday, October 12, 2023 11:06 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
> 
> Hi,
> 
> I have been reviewing the patch set, and here are some initial comments.
> 
> 1.
> 
> I think we need to mark the RBTXN_HAS_STREAMABLE_CHANGE flag for transactional
> sequence change in ReorderBufferQueueChange().
> 

True. It's unlikely for a transaction to only have sequence increments
and be large enough to get streamed, and other changes would make it to
have this flag. But it's certainly more correct to set the flag even for
sequence changes.

The updated patch modifies ReorderBufferQueueChange to do this.

> 2.
> 
> ReorderBufferSequenceIsTransactional
> 
> It seems we call the above function once in sequence_decode() and call it again
> in ReorderBufferQueueSequence(), would it better to avoid the second call as
> the hashtable search looks not cheap.
> 

In principle yes, but I don't think it's worth it - I doubt the overhead
is going to be measurable.

Based on earlier reviews I tried to reduce the code duplication (there
used to be two separate functions doing the lookup), and I did consider
doing just one call in sequence_decode() and passing the XID to
ReorderBufferQueueSequence() - determining the XID is the only purpose
of the call there. But it didn't seem nice/worth it.

> 3.
> 
> The patch cleans up the sequence hash table when COMMIT or ABORT a transaction
> (via ReorderBufferAbort() and ReorderBufferReturnTXN()), while it doesn't seem
> destory the hash table when PREPARE the transaction. It's not a big porblem but
> would it be better to release the memory earlier by destory the table for
> prepare ?
> 

I think you're right. I added the sequence cleanup to a couple places,
right before cleanup of the transaction. I wonder if we should simply
call ReorderBufferSequenceCleanup() from ReorderBufferCleanupTXN().

> 4.
> 
> +pg_decode_stream_sequence(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
> ...
> +    /* output BEGIN if we haven't yet, but only for the transactional case */
> +    if (transactional)
> +    {
> +        if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
> +        {
> +            pg_output_begin(ctx, data, txn, false);
> +        }
> +        txndata->xact_wrote_changes = true;
> +    }
> 
> I think we should call pg_output_stream_start() instead of pg_output_begin()
> for streaming sequence changes.
> 

Good catch! Fixed.

> 5.
> +    /*
> +     * Schema should be sent using the original relation because it
> +     * also sends the ancestor's relation.
> +     */
> +    maybe_send_schema(ctx, txn, relation, relentry);
> 
> The comment seems a bit misleading here, I think it was used for the partition
> logic in pgoutput_change().

True. I've removed the comment.


Attached is an updated patch, with all those tweaks/fixes.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Hi,

I've been cleaning up the first two patches to get them committed soon
(adding the decoding infrastructure + test_decoding), cleaning up stale
comments, updating commit messages etc. And I think it's ready to go,
but it's too late over, so I plan going over once more tomorrow and then
likely push. But if someone wants to take a look, I'd welcome that.

The one issue I found during this cleanup is that the patch was missing
the changes introduced by 29d0a77fa660 for decoding of other stuff.

  commit 29d0a77fa6606f9c01ba17311fc452dabd3f793d
  Author: Amit Kapila <akapila@postgresql.org>
  Date:   Thu Oct 26 06:54:16 2023 +0530

      Migrate logical slots to the new node during an upgrade.
      ...

I fixed that, but perhaps someone might want to double check ...


0003 is here just for completeness - that's the part adding sequences to
built-in replication. I haven't done much with it, it needs some cleanup
too to get it committable. I don't intend to push that right after
0001+0002, though.


While going over 0001, I realized there might be an optimization for
ReorderBufferSequenceIsTransactional. As coded in 0001, it always
searches through all top-level transactions, and if there's many of them
that might be expensive, even if very few of them have any relfilenodes
in the hash table. It's still linear search, and it needs to happen for
each sequence change.

But can the relfilenode even be in some other top-level transaction? How
could it be - our transaction would not see it, and wouldn't be able to
generate the sequence change. So we should be able to simply check *our*
transaction (or if it's a subxact, the top-level transaction). Either
it's there (and it's transactional change), or not (and then it's
non-transactional change). The 0004 does this.

This of course hinges on when exactly the transactions get created, and
assignments processed. For example if this would fire before the txn
gets assigned to the top-level one, this would break. I don't think this
can happen thanks to the immediate logging of assignments, but I'm too
tired to think about it now.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I've been cleaning up the first two patches to get them committed soon
> (adding the decoding infrastructure + test_decoding), cleaning up stale
> comments, updating commit messages etc. And I think it's ready to go,
> but it's too late over, so I plan going over once more tomorrow and then
> likely push. But if someone wants to take a look, I'd welcome that.
>
> The one issue I found during this cleanup is that the patch was missing
> the changes introduced by 29d0a77fa660 for decoding of other stuff.
>
>   commit 29d0a77fa6606f9c01ba17311fc452dabd3f793d
>   Author: Amit Kapila <akapila@postgresql.org>
>   Date:   Thu Oct 26 06:54:16 2023 +0530
>
>       Migrate logical slots to the new node during an upgrade.
>       ...
>
> I fixed that, but perhaps someone might want to double check ...
>
>
> 0003 is here just for completeness - that's the part adding sequences to
> built-in replication. I haven't done much with it, it needs some cleanup
> too to get it committable. I don't intend to push that right after
> 0001+0002, though.
>
>
> While going over 0001, I realized there might be an optimization for
> ReorderBufferSequenceIsTransactional. As coded in 0001, it always
> searches through all top-level transactions, and if there's many of them
> that might be expensive, even if very few of them have any relfilenodes
> in the hash table. It's still linear search, and it needs to happen for
> each sequence change.
>
> But can the relfilenode even be in some other top-level transaction? How
> could it be - our transaction would not see it, and wouldn't be able to
> generate the sequence change. So we should be able to simply check *our*
> transaction (or if it's a subxact, the top-level transaction). Either
> it's there (and it's transactional change), or not (and then it's
> non-transactional change).
>

I also think the relfilenode should be part of either the current
top-level xact or one of its subxact, so looking at all the top-level
transactions for each change doesn't seem advisable.

> The 0004 does this.
>
> This of course hinges on when exactly the transactions get created, and
> assignments processed. For example if this would fire before the txn
> gets assigned to the top-level one, this would break. I don't think this
> can happen thanks to the immediate logging of assignments, but I'm too
> tired to think about it now.
>

This needs some thought because I think we can't guarantee the
association till we reach the point where we can actually decode the
xact. See comments in AssertTXNLsnOrder() [1].

I noticed few minor comments while reading the patch:
1.
+ * turned on here because the non-transactional logical message is
+ * decoded without waiting for these records.

Instead of '.. logical message', shouldn't we say sequence change message?

2.
+ /*
+ * If we found an entry with matchine relfilenode,

typo (matchine)

3.
+      Note that this may not the value obtained by the process updating the
+      process, but the future sequence value written to WAL (typically about
+      32 values ahead).

/may not the value/may not be the value

[1] -
/*
* Skip the verification if we don't reach the LSN at which we start
* decoding the contents of transactions yet because until we reach the
* LSN, we could have transactions that don't have the association between
* the top-level transaction and subtransaction yet and consequently have
* the same LSN.  We don't guarantee this association until we try to
* decode the actual contents of transaction.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Mon, Nov 27, 2023 at 11:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > While going over 0001, I realized there might be an optimization for
> > ReorderBufferSequenceIsTransactional. As coded in 0001, it always
> > searches through all top-level transactions, and if there's many of them
> > that might be expensive, even if very few of them have any relfilenodes
> > in the hash table. It's still linear search, and it needs to happen for
> > each sequence change.
> >
> > But can the relfilenode even be in some other top-level transaction? How
> > could it be - our transaction would not see it, and wouldn't be able to
> > generate the sequence change. So we should be able to simply check *our*
> > transaction (or if it's a subxact, the top-level transaction). Either
> > it's there (and it's transactional change), or not (and then it's
> > non-transactional change).
> >
>
> I also think the relfilenode should be part of either the current
> top-level xact or one of its subxact, so looking at all the top-level
> transactions for each change doesn't seem advisable.
>
> > The 0004 does this.
> >
> > This of course hinges on when exactly the transactions get created, and
> > assignments processed. For example if this would fire before the txn
> > gets assigned to the top-level one, this would break. I don't think this
> > can happen thanks to the immediate logging of assignments, but I'm too
> > tired to think about it now.
> >
>
> This needs some thought because I think we can't guarantee the
> association till we reach the point where we can actually decode the
> xact. See comments in AssertTXNLsnOrder() [1].
>

I am wondering that instead of building the infrastructure to know
whether a particular change is transactional on the decoding side,
can't we have some flag in the WAL record to note whether the change
is transactional or not? I have discussed this point with my colleague
Kuroda-San and we thought that it may be worth exploring whether we
can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to
determine if the sequence is created/changed in the current
subtransaction and then record that in WAL record. By this, we need to
have additional information in the WAL record like XLOG_SEQ_LOG but we
can probably do it only with wal_level as logical.

One minor point:
It'd also
+ * trigger assert in DecodeSequence.

I don't see DecodeSequence() in the patch. Which exact assert/function
are you referring to here?

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 11/27/23 11:13, Amit Kapila wrote:
> On Mon, Nov 27, 2023 at 11:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> While going over 0001, I realized there might be an optimization for
>>> ReorderBufferSequenceIsTransactional. As coded in 0001, it always
>>> searches through all top-level transactions, and if there's many of them
>>> that might be expensive, even if very few of them have any relfilenodes
>>> in the hash table. It's still linear search, and it needs to happen for
>>> each sequence change.
>>>
>>> But can the relfilenode even be in some other top-level transaction? How
>>> could it be - our transaction would not see it, and wouldn't be able to
>>> generate the sequence change. So we should be able to simply check *our*
>>> transaction (or if it's a subxact, the top-level transaction). Either
>>> it's there (and it's transactional change), or not (and then it's
>>> non-transactional change).
>>>
>>
>> I also think the relfilenode should be part of either the current
>> top-level xact or one of its subxact, so looking at all the top-level
>> transactions for each change doesn't seem advisable.
>>
>>> The 0004 does this.
>>>
>>> This of course hinges on when exactly the transactions get created, and
>>> assignments processed. For example if this would fire before the txn
>>> gets assigned to the top-level one, this would break. I don't think this
>>> can happen thanks to the immediate logging of assignments, but I'm too
>>> tired to think about it now.
>>>
>>
>> This needs some thought because I think we can't guarantee the
>> association till we reach the point where we can actually decode the
>> xact. See comments in AssertTXNLsnOrder() [1].
>>

I suppose you mean the comment before the SnapBuildXactNeedsSkip call,
which says:

  /*
   * Skip the verification if we don't reach the LSN at which we start
   * decoding the contents of transactions yet because until we reach
   * the LSN, we could have transactions that don't have the association
   * between the top-level transaction and subtransaction yet and
   * consequently have the same LSN.  We don't guarantee this
   * association until we try to decode the actual contents of
   * transaction. The ordering of the records prior to the
   * start_decoding_at LSN should have been checked before the restart.
   */

But doesn't this say that after we actually start decoding / stop
skipping, we should have seen the assignment? We're already decoding
transaction contents (because sequence change *is* part of xact, even if
we decide to replay it in the non-transactional way).

> 
> I am wondering that instead of building the infrastructure to know
> whether a particular change is transactional on the decoding side,
> can't we have some flag in the WAL record to note whether the change
> is transactional or not? I have discussed this point with my colleague
> Kuroda-San and we thought that it may be worth exploring whether we
> can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to
> determine if the sequence is created/changed in the current
> subtransaction and then record that in WAL record. By this, we need to
> have additional information in the WAL record like XLOG_SEQ_LOG but we
> can probably do it only with wal_level as logical.
> 

I may not understand the proposal exactly, but it's not enough to know
if it was created in the same subxact. It might have been created in
some earlier subxact in the same top-level xact.

FWIW I think one of the earlier patch versions did something like this,
by adding a "created" flag in the xlog record. And we concluded doing
this on the decoding side is a better solution.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Mon, Nov 27, 2023 at 4:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 11/27/23 11:13, Amit Kapila wrote:
> > On Mon, Nov 27, 2023 at 11:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra
> >> <tomas.vondra@enterprisedb.com> wrote:
> >>>
> >>> While going over 0001, I realized there might be an optimization for
> >>> ReorderBufferSequenceIsTransactional. As coded in 0001, it always
> >>> searches through all top-level transactions, and if there's many of them
> >>> that might be expensive, even if very few of them have any relfilenodes
> >>> in the hash table. It's still linear search, and it needs to happen for
> >>> each sequence change.
> >>>
> >>> But can the relfilenode even be in some other top-level transaction? How
> >>> could it be - our transaction would not see it, and wouldn't be able to
> >>> generate the sequence change. So we should be able to simply check *our*
> >>> transaction (or if it's a subxact, the top-level transaction). Either
> >>> it's there (and it's transactional change), or not (and then it's
> >>> non-transactional change).
> >>>
> >>
> >> I also think the relfilenode should be part of either the current
> >> top-level xact or one of its subxact, so looking at all the top-level
> >> transactions for each change doesn't seem advisable.
> >>
> >>> The 0004 does this.
> >>>
> >>> This of course hinges on when exactly the transactions get created, and
> >>> assignments processed. For example if this would fire before the txn
> >>> gets assigned to the top-level one, this would break. I don't think this
> >>> can happen thanks to the immediate logging of assignments, but I'm too
> >>> tired to think about it now.
> >>>
> >>
> >> This needs some thought because I think we can't guarantee the
> >> association till we reach the point where we can actually decode the
> >> xact. See comments in AssertTXNLsnOrder() [1].
> >>
>
> I suppose you mean the comment before the SnapBuildXactNeedsSkip call,
> which says:
>
>   /*
>    * Skip the verification if we don't reach the LSN at which we start
>    * decoding the contents of transactions yet because until we reach
>    * the LSN, we could have transactions that don't have the association
>    * between the top-level transaction and subtransaction yet and
>    * consequently have the same LSN.  We don't guarantee this
>    * association until we try to decode the actual contents of
>    * transaction. The ordering of the records prior to the
>    * start_decoding_at LSN should have been checked before the restart.
>    */
>
> But doesn't this say that after we actually start decoding / stop
> skipping, we should have seen the assignment? We're already decoding
> transaction contents (because sequence change *is* part of xact, even if
> we decide to replay it in the non-transactional way).
>

It means to say that the assignment is decided after start_decoding_at
point. We haven't decided that we are past start_decoding_at by the
time the patch is computing the transactional flag.

> >
> > I am wondering that instead of building the infrastructure to know
> > whether a particular change is transactional on the decoding side,
> > can't we have some flag in the WAL record to note whether the change
> > is transactional or not? I have discussed this point with my colleague
> > Kuroda-San and we thought that it may be worth exploring whether we
> > can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to
> > determine if the sequence is created/changed in the current
> > subtransaction and then record that in WAL record. By this, we need to
> > have additional information in the WAL record like XLOG_SEQ_LOG but we
> > can probably do it only with wal_level as logical.
> >
>
> I may not understand the proposal exactly, but it's not enough to know
> if it was created in the same subxact. It might have been created in
> some earlier subxact in the same top-level xact.
>

We should be able to detect even some earlier subxact or top-level
xact based on rd_createSubid/rd_newRelfilelocatorSubid.

> FWIW I think one of the earlier patch versions did something like this,
> by adding a "created" flag in the xlog record. And we concluded doing
> this on the decoding side is a better solution.
>

oh, I thought it would be much simpler than what we are doing on the
decoding-side. Can you please point me to the email discussion where
this is concluded or share the reason?

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Mon, Nov 27, 2023 at 4:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Nov 27, 2023 at 4:17 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
>
> > FWIW I think one of the earlier patch versions did something like this,
> > by adding a "created" flag in the xlog record. And we concluded doing
> > this on the decoding side is a better solution.
> >
>
> oh, I thought it would be much simpler than what we are doing on the
> decoding-side. Can you please point me to the email discussion where
> this is concluded or share the reason?
>

I'll check the thread about this point by myself as well but if by
chance you remember it then kindly share it.

--
With Regards,
Amit Kapila.



RE: logical decoding and replication of sequences, take 2

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Amit, Tomas,

> > >
> > > I am wondering that instead of building the infrastructure to know
> > > whether a particular change is transactional on the decoding side,
> > > can't we have some flag in the WAL record to note whether the change
> > > is transactional or not? I have discussed this point with my colleague
> > > Kuroda-San and we thought that it may be worth exploring whether we
> > > can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to
> > > determine if the sequence is created/changed in the current
> > > subtransaction and then record that in WAL record. By this, we need to
> > > have additional information in the WAL record like XLOG_SEQ_LOG but we
> > > can probably do it only with wal_level as logical.
> > >
> >
> > I may not understand the proposal exactly, but it's not enough to know
> > if it was created in the same subxact. It might have been created in
> > some earlier subxact in the same top-level xact.
> >
> 
> We should be able to detect even some earlier subxact or top-level
> xact based on rd_createSubid/rd_newRelfilelocatorSubid.

Here is a small PoC patchset to help your understanding. Please see attached
files.

0001, 0002 were not changed, and 0004 was reassigned to 0003.
(For now, I focused only on test_decoding, because it is only for evaluation purpose.)

0004 is what we really wanted to say. is_transactional is added in WAL record, and it stores
whether the operations is transactional. In order to distinguish the status, rd_createSubid and
rd_newRelfilelocatorSubid are used. According to the comment, they would be a valid value
only when the relation was changed within the transaction
Also, sequences_hash was not needed anymore, so it and related functions were removed.

How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 11/27/23 12:11, Amit Kapila wrote:
> On Mon, Nov 27, 2023 at 4:17 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 11/27/23 11:13, Amit Kapila wrote:
>>> On Mon, Nov 27, 2023 at 11:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>
>>>> On Mon, Nov 27, 2023 at 6:41 AM Tomas Vondra
>>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>>
>>>>> While going over 0001, I realized there might be an optimization for
>>>>> ReorderBufferSequenceIsTransactional. As coded in 0001, it always
>>>>> searches through all top-level transactions, and if there's many of them
>>>>> that might be expensive, even if very few of them have any relfilenodes
>>>>> in the hash table. It's still linear search, and it needs to happen for
>>>>> each sequence change.
>>>>>
>>>>> But can the relfilenode even be in some other top-level transaction? How
>>>>> could it be - our transaction would not see it, and wouldn't be able to
>>>>> generate the sequence change. So we should be able to simply check *our*
>>>>> transaction (or if it's a subxact, the top-level transaction). Either
>>>>> it's there (and it's transactional change), or not (and then it's
>>>>> non-transactional change).
>>>>>
>>>>
>>>> I also think the relfilenode should be part of either the current
>>>> top-level xact or one of its subxact, so looking at all the top-level
>>>> transactions for each change doesn't seem advisable.
>>>>
>>>>> The 0004 does this.
>>>>>
>>>>> This of course hinges on when exactly the transactions get created, and
>>>>> assignments processed. For example if this would fire before the txn
>>>>> gets assigned to the top-level one, this would break. I don't think this
>>>>> can happen thanks to the immediate logging of assignments, but I'm too
>>>>> tired to think about it now.
>>>>>
>>>>
>>>> This needs some thought because I think we can't guarantee the
>>>> association till we reach the point where we can actually decode the
>>>> xact. See comments in AssertTXNLsnOrder() [1].
>>>>
>>
>> I suppose you mean the comment before the SnapBuildXactNeedsSkip call,
>> which says:
>>
>>   /*
>>    * Skip the verification if we don't reach the LSN at which we start
>>    * decoding the contents of transactions yet because until we reach
>>    * the LSN, we could have transactions that don't have the association
>>    * between the top-level transaction and subtransaction yet and
>>    * consequently have the same LSN.  We don't guarantee this
>>    * association until we try to decode the actual contents of
>>    * transaction. The ordering of the records prior to the
>>    * start_decoding_at LSN should have been checked before the restart.
>>    */
>>
>> But doesn't this say that after we actually start decoding / stop
>> skipping, we should have seen the assignment? We're already decoding
>> transaction contents (because sequence change *is* part of xact, even if
>> we decide to replay it in the non-transactional way).
>>
> 
> It means to say that the assignment is decided after start_decoding_at
> point. We haven't decided that we are past start_decoding_at by the
> time the patch is computing the transactional flag.
> 

Ah, I see. We're deciding if the change is transactional before calling
SnapBuildXactNeedsSkip. That's a bit unfortunate.

>>>
>>> I am wondering that instead of building the infrastructure to know
>>> whether a particular change is transactional on the decoding side,
>>> can't we have some flag in the WAL record to note whether the change
>>> is transactional or not? I have discussed this point with my colleague
>>> Kuroda-San and we thought that it may be worth exploring whether we
>>> can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to
>>> determine if the sequence is created/changed in the current
>>> subtransaction and then record that in WAL record. By this, we need to
>>> have additional information in the WAL record like XLOG_SEQ_LOG but we
>>> can probably do it only with wal_level as logical.
>>>
>>
>> I may not understand the proposal exactly, but it's not enough to know
>> if it was created in the same subxact. It might have been created in
>> some earlier subxact in the same top-level xact.
>>
> 
> We should be able to detect even some earlier subxact or top-level
> xact based on rd_createSubid/rd_newRelfilelocatorSubid.
> 

Interesting. I admit I haven't considered using these fields before, so
I need to familiarize with it a bit, and try if it'd work.

>> FWIW I think one of the earlier patch versions did something like this,
>> by adding a "created" flag in the xlog record. And we concluded doing
>> this on the decoding side is a better solution.
>>
> 
> oh, I thought it would be much simpler than what we are doing on the
> decoding-side. Can you please point me to the email discussion where
> this is concluded or share the reason?
> 

I think the discussion started around [1], and then in a bunch of
following messages (search for "relfilenode").

regards


[1]
https://www.postgresql.org/message-id/CAExHW5v_vVqkhF4ehST9EzpX1L3bemD1S%2BkTk_-ZVu_ir-nKDw%40mail.gmail.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 11/27/23 13:08, Hayato Kuroda (Fujitsu) wrote:
> Dear Amit, Tomas,
> 
>>>>
>>>> I am wondering that instead of building the infrastructure to know
>>>> whether a particular change is transactional on the decoding side,
>>>> can't we have some flag in the WAL record to note whether the change
>>>> is transactional or not? I have discussed this point with my colleague
>>>> Kuroda-San and we thought that it may be worth exploring whether we
>>>> can use rd_createSubid/rd_newRelfilelocatorSubid in RelationData to
>>>> determine if the sequence is created/changed in the current
>>>> subtransaction and then record that in WAL record. By this, we need to
>>>> have additional information in the WAL record like XLOG_SEQ_LOG but we
>>>> can probably do it only with wal_level as logical.
>>>>
>>>
>>> I may not understand the proposal exactly, but it's not enough to know
>>> if it was created in the same subxact. It might have been created in
>>> some earlier subxact in the same top-level xact.
>>>
>>
>> We should be able to detect even some earlier subxact or top-level
>> xact based on rd_createSubid/rd_newRelfilelocatorSubid.
> 
> Here is a small PoC patchset to help your understanding. Please see attached
> files.
> 
> 0001, 0002 were not changed, and 0004 was reassigned to 0003.
> (For now, I focused only on test_decoding, because it is only for evaluation purpose.)
> 
> 0004 is what we really wanted to say. is_transactional is added in WAL record, and it stores
> whether the operations is transactional. In order to distinguish the status, rd_createSubid and
> rd_newRelfilelocatorSubid are used. According to the comment, they would be a valid value
> only when the relation was changed within the transaction
> Also, sequences_hash was not needed anymore, so it and related functions were removed.
> 
> How do you think?
> 

I think it's an a very nice idea, assuming it maintains the current
behavior. It makes a lot of code unnecessary, etc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Hi,

I spent a bit of time looking at the proposed change, and unfortunately
logging just the boolean flag does not work. A good example is this bit
from a TAP test added by the patch for built-in replication (which was
not included with the WIP patch):

  BEGIN;
  ALTER SEQUENCE s RESTART WITH 1000;
  SAVEPOINT sp1;
  INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,100);
  ROLLBACK TO sp1;
  COMMIT;

This is expected to produce:

  1131|0|t

but produces

  1000|0|f

instead. The reason is very simple - as implemented, the patch simply
checks if the relfilenode is from the same top-level transaction, which
it is, and sets the flag to "true". So we know the sequence changes need
to be queued and replayed as part of this transaction.

But then during decoding, we still queue the changes into the subxact,
which then aborts, and the changes are discarded. That is not how it's
supposed to work, because the new relfilenode is still valid, someone
might do nextval() and commit. And the nextval() may not get WAL-logged,
so we'd lose this.

What I guess we might do is log not just a boolean flag, but the XID of
the subtransaction that created the relfilenode. And then during
decoding we'd queue the changes into this subtransaction ...

0006 in the attached patch series does this, and it seems to fix the TAP
test failure. I left it at the end, to make it easier to run tests
without the patch applied.

There's a couple open questions, though.

- I'm not sure it's a good idea to log XIDs of subxacts into WAL like
this. I think it'd be OK, and there are other records that do that (like
RunningXacts or commit record), but maybe I'm missing something.

- We need the actual XID, not just the SubTransactionId. I wrote
SubTransactionGetXid() to to this, but I did not work with subxacts
this, so it'd be better if someone checked it's dealing with XID and
FullTransactionId correctly.

- I'm a bit concerned how this will perform with deeply nested
subtransactions. SubTransactionGetXid() does pretty much a linear
search, which might be somewhat expensive. And it's a cost put on
everyone who writes WAL, not just the decoding process. Maybe we should
at least limit this to wal_level=logical?

- seq_decode() then uses this XID (for transactional changes) instead of
the XID logged in the record itself. I think that's fine - it's the TXN
where we want to queue the change, after all, right?

- (unrelated) I also noticed that maybe ReorderBufferQueueSequence()
should always expect a valid XID. The code seems to suggest people can
pass InvalidTransactionId in the non-transactional case, but that's not
true because the rb->sequence() then fails.


The attached patches should also fix all the typos reported by Amit
earlier today.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Peter Smith
Дата:
FWIW, here are some more minor review comments for v20231127-3-0001

======
doc/src/sgml/logicaldecoding.sgml

1.
+      The <parameter>txn</parameter> parameter contains meta information about
+      the transaction the sequence change is part of. Note however that for
+      non-transactional updates, the transaction may be NULL, depending on
+      if the transaction already has an XID assigned.
+      The <parameter>sequence_lsn</parameter> has the WAL location of the
+      sequence update. <parameter>transactional</parameter> says if the
+      sequence has to be replayed as part of the transaction or directly.

/says if/specifies whether/

======
src/backend/commands/sequence.c

2. DecodeSeqTuple

+ memcpy(((char *) tuple->tuple.t_data),
+    data + sizeof(xl_seq_rec),
+    SizeofHeapTupleHeader);
+
+ memcpy(((char *) tuple->tuple.t_data) + SizeofHeapTupleHeader,
+    data + sizeof(xl_seq_rec) + SizeofHeapTupleHeader,
+    datalen);

Maybe I am misreading but isn't this just copying 2 contiguous pieces
of data? Won't a single memcpy of (SizeofHeapTupleHeader + datalen)
achieve the same?

======
.../replication/logical/reorderbuffer.c

3.
+ *   To decide if a sequence change is transactional, we maintain a hash
+ *   table of relfilenodes created in each (sub)transactions, along with
+ *   the XID of the (sub)transaction that created the relfilenode. The
+ *   entries from substransactions are copied to the top-level transaction
+ *   to make checks cheaper. The hash table gets cleaned up when the
+ *   transaction completes (commit/abort).

/substransactions/subtransactions/

~~~

4.
+ * A naive approach would be to just loop through all transactions and check
+ * each of them, but there may be (easily thousands) of subtransactions, and
+ * the check happens for each sequence change. So this could be very costly.

/may be (easily thousands) of/may be (easily thousands of)/

~~~

5. ReorderBufferSequenceCleanup

+ while ((ent = (ReorderBufferSequenceEnt *)
hash_seq_search(&scan_status)) != NULL)
+ {
+ (void) hash_search(txn->toptxn->sequences_hash,
+    (void *) &ent->rlocator,
+    HASH_REMOVE, NULL);
+ }

Typically, other HASH_REMOVE code I saw would check result for NULL to
give elog(ERROR, "hash table corrupted");

~~~

6. ReorderBufferQueueSequence

+ if (xid != InvalidTransactionId)
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

How about using the macro: TransactionIdIsValid

~~~

7. ReorderBufferQueueSequence

+ if (reloid == InvalidOid)
+ elog(ERROR, "could not map filenode \"%s\" to relation OID",
+ relpathperm(rlocator,
+ MAIN_FORKNUM));

How about using the macro: OidIsValid

~~~

8.
+ /*
+ * Calculate the first value of the next batch (at which point we
+ * generate and decode another WAL record.
+ */

Missing ')'

~~~

9. ReorderBufferAddRelFileLocator

+ /*
+ * We only care about sequence relfilenodes for now, and those always have
+ * a XID. So if there's no XID, don't bother adding them to the hash.
+ */
+ if (xid == InvalidTransactionId)
+ return;

How about using the macro: TransactionIdIsValid

~~~

10. ReorderBufferProcessTXN

+ if (reloid == InvalidOid)
+ elog(ERROR, "could not map filenode \"%s\" to relation OID",
+ relpathperm(change->data.sequence.locator,
+ MAIN_FORKNUM));

How about using the macro: OidIsValid

~~~

11. ReorderBufferChangeSize

+ if (tup)
+ {
+ sz += sizeof(HeapTupleData);
+ len = tup->tuple.t_len;
+ sz += len;
+ }

Why is the 'sz' increment split into 2 parts?

======
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Mon, Nov 27, 2023 at 11:45 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I spent a bit of time looking at the proposed change, and unfortunately
> logging just the boolean flag does not work. A good example is this bit
> from a TAP test added by the patch for built-in replication (which was
> not included with the WIP patch):
>
>   BEGIN;
>   ALTER SEQUENCE s RESTART WITH 1000;
>   SAVEPOINT sp1;
>   INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,100);
>   ROLLBACK TO sp1;
>   COMMIT;
>
> This is expected to produce:
>
>   1131|0|t
>
> but produces
>
>   1000|0|f
>
> instead. The reason is very simple - as implemented, the patch simply
> checks if the relfilenode is from the same top-level transaction, which
> it is, and sets the flag to "true". So we know the sequence changes need
> to be queued and replayed as part of this transaction.
>
> But then during decoding, we still queue the changes into the subxact,
> which then aborts, and the changes are discarded. That is not how it's
> supposed to work, because the new relfilenode is still valid, someone
> might do nextval() and commit. And the nextval() may not get WAL-logged,
> so we'd lose this.
>
> What I guess we might do is log not just a boolean flag, but the XID of
> the subtransaction that created the relfilenode. And then during
> decoding we'd queue the changes into this subtransaction ...
>
> 0006 in the attached patch series does this, and it seems to fix the TAP
> test failure. I left it at the end, to make it easier to run tests
> without the patch applied.
>

Offhand, I don't have any better idea than what you have suggested for
the problem but this needs some thoughts including the questions asked
by you. I'll spend some time on it and respond back.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 11/28/23 12:32, Amit Kapila wrote:
> On Mon, Nov 27, 2023 at 11:45 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> I spent a bit of time looking at the proposed change, and unfortunately
>> logging just the boolean flag does not work. A good example is this bit
>> from a TAP test added by the patch for built-in replication (which was
>> not included with the WIP patch):
>>
>>   BEGIN;
>>   ALTER SEQUENCE s RESTART WITH 1000;
>>   SAVEPOINT sp1;
>>   INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,100);
>>   ROLLBACK TO sp1;
>>   COMMIT;
>>
>> This is expected to produce:
>>
>>   1131|0|t
>>
>> but produces
>>
>>   1000|0|f
>>
>> instead. The reason is very simple - as implemented, the patch simply
>> checks if the relfilenode is from the same top-level transaction, which
>> it is, and sets the flag to "true". So we know the sequence changes need
>> to be queued and replayed as part of this transaction.
>>
>> But then during decoding, we still queue the changes into the subxact,
>> which then aborts, and the changes are discarded. That is not how it's
>> supposed to work, because the new relfilenode is still valid, someone
>> might do nextval() and commit. And the nextval() may not get WAL-logged,
>> so we'd lose this.
>>
>> What I guess we might do is log not just a boolean flag, but the XID of
>> the subtransaction that created the relfilenode. And then during
>> decoding we'd queue the changes into this subtransaction ...
>>
>> 0006 in the attached patch series does this, and it seems to fix the TAP
>> test failure. I left it at the end, to make it easier to run tests
>> without the patch applied.
>>
> 
> Offhand, I don't have any better idea than what you have suggested for
> the problem but this needs some thoughts including the questions asked
> by you. I'll spend some time on it and respond back.
> 

I've been experimenting with the idea to log the XID, and for a moment I
was worried it actually can't work, because subtransactions may not
actually be just nested in simple way, but form a tree. And what if the
sequence was altered in a different branch (sibling subxact), not in the
immediate parent. In which case the new SubTransactionGetXid() would
fail, because it just walks the current chain of subtransactions.

I've been thinking about cases like this:

   BEGIN;
   CREATE SEQUENCE s;        # XID 1000
   SELECT alter_sequence();  # XID 1001
   SAVEPOINT s1;
   SELECT COUNT(nextval('s')) FROM generate_series(1,100); # XID 1000
   ROLLBACK TO s1;
   SELECT COUNT(nextval('s')) FROM generate_series(1,100); # XID 1000
   COMMIT;

The XID values are what the sequence wal record will reference, assuming
that the main transaction XID is 1000.

Initially, I thought it's wrong that the nextval() calls reference XID
of the main transaction, because the last relfilenode comes from 1001,
which is the subxact created by alter_sequence() thanks to the exception
handling block. And that's where the approach in reorderbuffer would
queue the changes.

But I think this is actually correct too. When a subtransaction commits
(e.g. when alter_sequence() completes), it essentially becomes part of
the parent. And AtEOSubXact_cleanup() updates rd_newRelfilelocatorSubid
accordingly, setting it to parentSubid.

This also means that SubTransactionGetXid() can't actually fail, because
the ID has to reference an active subtransaction in the current stack.
I'm still concerned about the cost of the lookup, because the list may
be long and the subxact we're looking for may be quite high, but I guess
we might have another field, caching the XID. It'd need to be updated
only in AtEOSubXact_cleanup, and at that point we know it's the
immediate parent, so it'd be pretty cheap I think.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Hi,

I have been hacking on improving the improvements outlined in my
preceding e-mail, but I have some bad news - I ran into an issue that I
don't know how to solve :-(

Consider this transaction:

  BEGIN;
  ALTER SEQUENCE s RESTART 1000;

  SAVEPOINT s1;
  ALTER SEQUENCE s RESTART 2000;
  ROLLBACK TO s1;

  INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,40);
  COMMIT;

If you try this with the approach relying on rd_newRelfilelocatorSubid
and rd_createSubid, it fails like this on the subscriber:

  ERROR:  could not map filenode "base/5/16394" to relation OID

This happens because ReorderBufferQueueSequence tries to do this in the
non-transactional branch:

  reloid = RelidByRelfilenumber(rlocator.spcOid, rlocator.relNumber);

and the relfilenode is the one created by the first ALTER. But this is
obviously wrong - the changes should have been treated as transactional,
because they are tied to the first ALTER. So how did we get there?

Well, the whole problem is that in case of abort, AtEOSubXact_cleanup
resets the two fields to InvalidSubTransactionId. Which means the
rollback in the above transaction also forgets about the first ALTER.
Now that I look at the RelationData comments, it actually describes
exactly this situation:

  *
  * rd_newRelfilelocatorSubid is the ID of the highest subtransaction
  * the most-recent relfilenumber change has survived into or zero if
  * not changed in the current transaction (or we have forgotten
  * changing it).  This field is accurate when non-zero, but it can be
  * zero when a relation has multiple new relfilenumbers within a
  * single transaction, with one of them occurring in a subsequently
  * aborted subtransaction, e.g.
  *    BEGIN;
  *    TRUNCATE t;
  *    SAVEPOINT save;
  *    TRUNCATE t;
  *    ROLLBACK TO save;
  *    -- rd_newRelfilelocatorSubid is now forgotten
  *

The root of this problem is that we'd need some sort of "history" for
the field, so that when a subxact aborts, we can restore the previous
value. But we obviously don't have that, and I doubt we want to add that
to relcache - for example, it'd either need to impose some limit on the
history (and thus a failure when we reach the limit), or it'd need to
handle histories of arbitrary length.

At this point I don't see a solution for this, which means the best way
forward with the sequence decoding patch seems to be the original
approach, on the decoding side.

I'm attaching the patch with 0005 and 0006, adding two simple tests (no
other changes compared to yesterday's version).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 11/27/23 23:06, Peter Smith wrote:
> FWIW, here are some more minor review comments for v20231127-3-0001
> 
> ======
> doc/src/sgml/logicaldecoding.sgml
> 
> 1.
> +      The <parameter>txn</parameter> parameter contains meta information about
> +      the transaction the sequence change is part of. Note however that for
> +      non-transactional updates, the transaction may be NULL, depending on
> +      if the transaction already has an XID assigned.
> +      The <parameter>sequence_lsn</parameter> has the WAL location of the
> +      sequence update. <parameter>transactional</parameter> says if the
> +      sequence has to be replayed as part of the transaction or directly.
> 
> /says if/specifies whether/
> 

Will fix.

> ======
> src/backend/commands/sequence.c
> 
> 2. DecodeSeqTuple
> 
> + memcpy(((char *) tuple->tuple.t_data),
> +    data + sizeof(xl_seq_rec),
> +    SizeofHeapTupleHeader);
> +
> + memcpy(((char *) tuple->tuple.t_data) + SizeofHeapTupleHeader,
> +    data + sizeof(xl_seq_rec) + SizeofHeapTupleHeader,
> +    datalen);
> 
> Maybe I am misreading but isn't this just copying 2 contiguous pieces
> of data? Won't a single memcpy of (SizeofHeapTupleHeader + datalen)
> achieve the same?
> 

You're right, will fix. I think the code looked differently before, got
simplified and I haven't noticed this can be a single memcpy().

> ======
> .../replication/logical/reorderbuffer.c
> 
> 3.
> + *   To decide if a sequence change is transactional, we maintain a hash
> + *   table of relfilenodes created in each (sub)transactions, along with
> + *   the XID of the (sub)transaction that created the relfilenode. The
> + *   entries from substransactions are copied to the top-level transaction
> + *   to make checks cheaper. The hash table gets cleaned up when the
> + *   transaction completes (commit/abort).
> 
> /substransactions/subtransactions/
> 

Will fix.

> ~~~
> 
> 4.
> + * A naive approach would be to just loop through all transactions and check
> + * each of them, but there may be (easily thousands) of subtransactions, and
> + * the check happens for each sequence change. So this could be very costly.
> 
> /may be (easily thousands) of/may be (easily thousands of)/
> 
> ~~~

Thanks. I've reworded this to

  ... may be many (easily thousands of) subtransactions ...

> 
> 5. ReorderBufferSequenceCleanup
> 
> + while ((ent = (ReorderBufferSequenceEnt *)
> hash_seq_search(&scan_status)) != NULL)
> + {
> + (void) hash_search(txn->toptxn->sequences_hash,
> +    (void *) &ent->rlocator,
> +    HASH_REMOVE, NULL);
> + }
> 
> Typically, other HASH_REMOVE code I saw would check result for NULL to
> give elog(ERROR, "hash table corrupted");
> 

Good point, I'll add the error check

> ~~~
> 
> 6. ReorderBufferQueueSequence
> 
> + if (xid != InvalidTransactionId)
> + txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> 
> How about using the macro: TransactionIdIsValid
> 

Actually, I wrote in some other message, I think the check is not
necessary. Or rather it should be an assert that XID is valid. And yeah,
the macro is a good idea.

> ~~~
> 
> 7. ReorderBufferQueueSequence
> 
> + if (reloid == InvalidOid)
> + elog(ERROR, "could not map filenode \"%s\" to relation OID",
> + relpathperm(rlocator,
> + MAIN_FORKNUM));
> 
> How about using the macro: OidIsValid
> 

I chose to keep this consistent with other places in reorderbuffer, and
all of them use the equality check.

> ~~~
> 
> 8.
> + /*
> + * Calculate the first value of the next batch (at which point we
> + * generate and decode another WAL record.
> + */
> 
> Missing ')'
> 

Will fix.

> ~~~
> 
> 9. ReorderBufferAddRelFileLocator
> 
> + /*
> + * We only care about sequence relfilenodes for now, and those always have
> + * a XID. So if there's no XID, don't bother adding them to the hash.
> + */
> + if (xid == InvalidTransactionId)
> + return;
> 
> How about using the macro: TransactionIdIsValid
> 

Will change.

> ~~~
> 
> 10. ReorderBufferProcessTXN
> 
> + if (reloid == InvalidOid)
> + elog(ERROR, "could not map filenode \"%s\" to relation OID",
> + relpathperm(change->data.sequence.locator,
> + MAIN_FORKNUM));
> 
> How about using the macro: OidIsValid
> 

Same as the other Oid check - consistency.

> ~~~
> 
> 11. ReorderBufferChangeSize
> 
> + if (tup)
> + {
> + sz += sizeof(HeapTupleData);
> + len = tup->tuple.t_len;
> + sz += len;
> + }
> 
> Why is the 'sz' increment split into 2 parts?
> 

Because the other branches in ReorderBufferChangeSize do it that way.
You're right it might be coded on a single line.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Hi!

Considering my findings about issues with the rd_newRelfilelocatorSubid
field and how it makes that approach impossible, I decided to rip out
those patches, and go back to the approach where reorderbuffer tracks
new relfilenodes. This means the open questions I listed two days ago
disappear, because all of that was about the alternative approach.

I've also added a couple more tests into 034_sequences.pl, testing the
basic cases with substransactions that rollback (or not), etc. The
attached patch also addresses the review comments by Peter Smith.

The one remaining open question is ReorderBufferSequenceIsTransactional
and whether it can do better than searching through all top-level
transactions. The idea of 0002 was to only search the current top-level
xact, but Amit pointed out we can't rely on seeing the assignment until
we know we're in a consistent snapshot.

I'm yet to try doing some tests to measure how expensive this lookup can
be in practice. But let's assume it's measurable and significant enough
to matter. I wonder if we could salvage this optimization somehow. I'm
thinking about three options:

1) Could ReorderBufferSequenceIsTransactional check the snapshot is
already consistent etc. and use the optimized variant (looking only at
the same top-level xact) in that case? And if not, fallback to the
search of all top-level xacts. In practice, the full search would be
used only for a short initial period.

2) We could also make ReorderBufferSequenceIsTransactional to always
check the same top-level transaction first and then fallback, no matter
whether the snapshot is consistent or not. The problem is this doesn't
really optimize the common case where there are no new relfilenodes, so
we won't find a match in the top-level xact, and will always search
everything anyway.

3) Alternatively, we could maintain a global hash table, instead of in
the top-level transaction. So there'd always be two copies, one in the
xact itself and then in the global hash. Now there's either one (in
current top-level xact), or two (subxact + top-level xact).

I kinda like (3), because it just works and doesn't require the snapshot
being consistent etc.


Opinions?

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Wed, Nov 29, 2023 at 2:59 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I have been hacking on improving the improvements outlined in my
> preceding e-mail, but I have some bad news - I ran into an issue that I
> don't know how to solve :-(
>
> Consider this transaction:
>
>   BEGIN;
>   ALTER SEQUENCE s RESTART 1000;
>
>   SAVEPOINT s1;
>   ALTER SEQUENCE s RESTART 2000;
>   ROLLBACK TO s1;
>
>   INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,40);
>   COMMIT;
>
> If you try this with the approach relying on rd_newRelfilelocatorSubid
> and rd_createSubid, it fails like this on the subscriber:
>
>   ERROR:  could not map filenode "base/5/16394" to relation OID
>
> This happens because ReorderBufferQueueSequence tries to do this in the
> non-transactional branch:
>
>   reloid = RelidByRelfilenumber(rlocator.spcOid, rlocator.relNumber);
>
> and the relfilenode is the one created by the first ALTER. But this is
> obviously wrong - the changes should have been treated as transactional,
> because they are tied to the first ALTER. So how did we get there?
>
> Well, the whole problem is that in case of abort, AtEOSubXact_cleanup
> resets the two fields to InvalidSubTransactionId. Which means the
> rollback in the above transaction also forgets about the first ALTER.
> Now that I look at the RelationData comments, it actually describes
> exactly this situation:
>
>   *
>   * rd_newRelfilelocatorSubid is the ID of the highest subtransaction
>   * the most-recent relfilenumber change has survived into or zero if
>   * not changed in the current transaction (or we have forgotten
>   * changing it).  This field is accurate when non-zero, but it can be
>   * zero when a relation has multiple new relfilenumbers within a
>   * single transaction, with one of them occurring in a subsequently
>   * aborted subtransaction, e.g.
>   *    BEGIN;
>   *    TRUNCATE t;
>   *    SAVEPOINT save;
>   *    TRUNCATE t;
>   *    ROLLBACK TO save;
>   *    -- rd_newRelfilelocatorSubid is now forgotten
>   *
>
> The root of this problem is that we'd need some sort of "history" for
> the field, so that when a subxact aborts, we can restore the previous
> value. But we obviously don't have that, and I doubt we want to add that
> to relcache - for example, it'd either need to impose some limit on the
> history (and thus a failure when we reach the limit), or it'd need to
> handle histories of arbitrary length.
>

Yeah, I think that would be really tricky and we may not want to go there.

> At this point I don't see a solution for this, which means the best way
> forward with the sequence decoding patch seems to be the original
> approach, on the decoding side.
>

One thing that worries me about that approach is that it can suck with
the workload that has a lot of DDLs that create XLOG_SMGR_CREATE
records. We have previously fixed some such workloads in logical
decoding where decoding a transaction containing truncation of a table
with a lot of partitions (1000 or more) used to take a very long time.
Don't we face performance issues in such scenarios?

How do we see this work w.r.t to some sort of global sequences? There
is some recent discussion where I have raised a similar point [1].

[1] - https://www.postgresql.org/message-id/CAA4eK1JF%3D4_Eoq7FFjHSe98-_ooJ5QWd0s2_pj8gR%2B_dvwKxvA%40mail.gmail.com

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 11/29/23 14:42, Amit Kapila wrote:
> On Wed, Nov 29, 2023 at 2:59 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> I have been hacking on improving the improvements outlined in my
>> preceding e-mail, but I have some bad news - I ran into an issue that I
>> don't know how to solve :-(
>>
>> Consider this transaction:
>>
>>   BEGIN;
>>   ALTER SEQUENCE s RESTART 1000;
>>
>>   SAVEPOINT s1;
>>   ALTER SEQUENCE s RESTART 2000;
>>   ROLLBACK TO s1;
>>
>>   INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,40);
>>   COMMIT;
>>
>> If you try this with the approach relying on rd_newRelfilelocatorSubid
>> and rd_createSubid, it fails like this on the subscriber:
>>
>>   ERROR:  could not map filenode "base/5/16394" to relation OID
>>
>> This happens because ReorderBufferQueueSequence tries to do this in the
>> non-transactional branch:
>>
>>   reloid = RelidByRelfilenumber(rlocator.spcOid, rlocator.relNumber);
>>
>> and the relfilenode is the one created by the first ALTER. But this is
>> obviously wrong - the changes should have been treated as transactional,
>> because they are tied to the first ALTER. So how did we get there?
>>
>> Well, the whole problem is that in case of abort, AtEOSubXact_cleanup
>> resets the two fields to InvalidSubTransactionId. Which means the
>> rollback in the above transaction also forgets about the first ALTER.
>> Now that I look at the RelationData comments, it actually describes
>> exactly this situation:
>>
>>   *
>>   * rd_newRelfilelocatorSubid is the ID of the highest subtransaction
>>   * the most-recent relfilenumber change has survived into or zero if
>>   * not changed in the current transaction (or we have forgotten
>>   * changing it).  This field is accurate when non-zero, but it can be
>>   * zero when a relation has multiple new relfilenumbers within a
>>   * single transaction, with one of them occurring in a subsequently
>>   * aborted subtransaction, e.g.
>>   *    BEGIN;
>>   *    TRUNCATE t;
>>   *    SAVEPOINT save;
>>   *    TRUNCATE t;
>>   *    ROLLBACK TO save;
>>   *    -- rd_newRelfilelocatorSubid is now forgotten
>>   *
>>
>> The root of this problem is that we'd need some sort of "history" for
>> the field, so that when a subxact aborts, we can restore the previous
>> value. But we obviously don't have that, and I doubt we want to add that
>> to relcache - for example, it'd either need to impose some limit on the
>> history (and thus a failure when we reach the limit), or it'd need to
>> handle histories of arbitrary length.
>>
> 
> Yeah, I think that would be really tricky and we may not want to go there.
> 
>> At this point I don't see a solution for this, which means the best way
>> forward with the sequence decoding patch seems to be the original
>> approach, on the decoding side.
>>
> 
> One thing that worries me about that approach is that it can suck with
> the workload that has a lot of DDLs that create XLOG_SMGR_CREATE
> records. We have previously fixed some such workloads in logical
> decoding where decoding a transaction containing truncation of a table
> with a lot of partitions (1000 or more) used to take a very long time.
> Don't we face performance issues in such scenarios?
> 

I don't think we do, really. We will have to decode the SMGR records and
add the relfilenodes to the hash table(s), but I think that affects the
lookup performance too much. What I think might be a problem is if we
have many top-level transactions, especially if those transactions do
something that creates a relfilenode. Because then we'll have to do a
hash_search for each of them, and that might be measurable even if each
lookup is O(1). And we do the lookup for every sequence change ...

> How do we see this work w.r.t to some sort of global sequences? There
> is some recent discussion where I have raised a similar point [1].
> 
> [1] - https://www.postgresql.org/message-id/CAA4eK1JF%3D4_Eoq7FFjHSe98-_ooJ5QWd0s2_pj8gR%2B_dvwKxvA%40mail.gmail.com
> 

I think those are very different things, even though called "sequences".
AFAIK solutions like snowflakeID or UUIDs don't require replication of
any shared state (that's kinda the whole point), so I don't see why
would it need some special support in logical decoding.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 11/29/23 15:41, Tomas Vondra wrote:
> ...
>>
>> One thing that worries me about that approach is that it can suck with
>> the workload that has a lot of DDLs that create XLOG_SMGR_CREATE
>> records. We have previously fixed some such workloads in logical
>> decoding where decoding a transaction containing truncation of a table
>> with a lot of partitions (1000 or more) used to take a very long time.
>> Don't we face performance issues in such scenarios?
>>
> 
> I don't think we do, really. We will have to decode the SMGR records and
> add the relfilenodes to the hash table(s), but I think that affects the
> lookup performance too much. What I think might be a problem is if we
> have many top-level transactions, especially if those transactions do
> something that creates a relfilenode. Because then we'll have to do a
> hash_search for each of them, and that might be measurable even if each
> lookup is O(1). And we do the lookup for every sequence change ...
> 

I did some micro-benchmarking today, trying to identify cases where this
would cause unexpected problems, either due to having to maintain all
the relfilenodes, or due to having to do hash lookups for every sequence
change. But I think it's fine, mostly ...

I did all the following tests with 64 clients. I may try more, but even
with this there should be fair number of concurrent transactions, which
determines the number of top-level transactions in reorderbuffer. I'll
try with more clients tomorrow, but I don't think it'll change stuff.

The test is fairly simple - run a particular number of transactions
(might be 1000 * 64, or more). And then measure how long it takes to
decode the changes using test_decoding.

Now, the various workloads I tried:

1) "good case" - small OLTP transactions, a couple nextval('s') calls

  begin;
  insert into t (1);
  select nextval('s');
  insert into t (1);
  commit;

This is pretty fine, the sequence part of reorderbuffer is really not
measurable, it's like 1% of the total CPU time. Which is expected,
because we only wal-log every 32-nd increment or so.

2) "good case" - same as (1) but more nextval calls to always do wal


  begin;
  insert into t (1);
  select nextval('s') from generate_series(1,40);
  insert into t (1);
  commit;

Here sequences are more measurable, it's like 15% of CPU time, but most
of that comes to AbortCurrentTransaction() in the non-transactional
branch of ReorderBufferQueueSequence. I don't think there's a way around
that, and it's entirely unrelated to relfilenodes. The function checking
if the change is transactional (ReorderBufferSequenceIsTransactional) is
less than 1% of the profile - and this is the version that always walks
all top-level transactions.

3) "bad case" - small transactions that generate a lot of relfilenodes

  select alter_sequence();

where the function is defined like this (I did create 1000 sequences
before the test):

  CREATE OR REPLACE FUNCTION alter_sequence() RETURNS void AS $$
  DECLARE
      v INT;
  BEGIN
      v := 1 + (random() * 999)::int;
      execute format('alter sequence s%s restart with 1000', v);
      perform nextval('s');
  END;
  $$ LANGUAGE plpgsql;

This performs terribly, but it's entirely unrelated to sequences.
Current master has exactly the same problem, if transactions do DDL.
Like this, for example:

  CREATE OR REPLACE FUNCTION create_table() RETURNS void AS $$
  DECLARE
      v INT;
  BEGIN
      v := 1 + (random() * 999)::int;
      execute format('create table t%s (a int)', v);
      execute format('drop table t%s', v);
      insert into t values (1);
  END;
  $$ LANGUAGE plpgsql;

This has the same impact on master. The perf report shows this:

  --98.06%--pg_logical_slot_get_changes_guts
       |
        --97.88%--LogicalDecodingProcessRecord
             |
             --97.56%--xact_decode
                  |
                   --97.51%--DecodeCommit
                        |
                        |--91.92%--SnapBuildCommitTxn
                        |     |
                        |      --91.65%--SnapBuildBuildSnapshot
                        |           |
                        |           --91.14%--pg_qsort

The sequence decoding is maybe ~1%. The reason why SnapBuildSnapshot
takes so long is because:

-----------------
  Breakpoint 1, SnapBuildBuildSnapshot (builder=0x21f60f8)
                                      at snapbuild.c:498
  498        + sizeof(TransactionId) *   builder->committed.xcnt
  (gdb) p builder->committed.xcnt
  $4 = 11532
-----------------

And with each iteration it grows by 1. That looks quite weird, possibly
a bug worth fixing, but unrelated to this patch. I can't investigate
this more at the moment, not sure when/if I'll get to that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Peter Smith
Дата:
On Wed, Nov 29, 2023 at 11:45 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 11/27/23 23:06, Peter Smith wrote:
> > FWIW, here are some more minor review comments for v20231127-3-0001
> >
> > ======
> > .../replication/logical/reorderbuffer.c
> >
> > 3.
> > + *   To decide if a sequence change is transactional, we maintain a hash
> > + *   table of relfilenodes created in each (sub)transactions, along with
> > + *   the XID of the (sub)transaction that created the relfilenode. The
> > + *   entries from substransactions are copied to the top-level transaction
> > + *   to make checks cheaper. The hash table gets cleaned up when the
> > + *   transaction completes (commit/abort).
> >
> > /substransactions/subtransactions/
> >
>
> Will fix.

FYI - I think this typo still exists in the patch v20231128-0001.

======
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Thu, Nov 30, 2023 at 5:28 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> 3) "bad case" - small transactions that generate a lot of relfilenodes
>
>   select alter_sequence();
>
> where the function is defined like this (I did create 1000 sequences
> before the test):
>
>   CREATE OR REPLACE FUNCTION alter_sequence() RETURNS void AS $$
>   DECLARE
>       v INT;
>   BEGIN
>       v := 1 + (random() * 999)::int;
>       execute format('alter sequence s%s restart with 1000', v);
>       perform nextval('s');
>   END;
>   $$ LANGUAGE plpgsql;
>
> This performs terribly, but it's entirely unrelated to sequences.
> Current master has exactly the same problem, if transactions do DDL.
> Like this, for example:
>
>   CREATE OR REPLACE FUNCTION create_table() RETURNS void AS $$
>   DECLARE
>       v INT;
>   BEGIN
>       v := 1 + (random() * 999)::int;
>       execute format('create table t%s (a int)', v);
>       execute format('drop table t%s', v);
>       insert into t values (1);
>   END;
>   $$ LANGUAGE plpgsql;
>
> This has the same impact on master. The perf report shows this:
>
>   --98.06%--pg_logical_slot_get_changes_guts
>        |
>         --97.88%--LogicalDecodingProcessRecord
>              |
>              --97.56%--xact_decode
>                   |
>                    --97.51%--DecodeCommit
>                         |
>                         |--91.92%--SnapBuildCommitTxn
>                         |     |
>                         |      --91.65%--SnapBuildBuildSnapshot
>                         |           |
>                         |           --91.14%--pg_qsort
>
> The sequence decoding is maybe ~1%. The reason why SnapBuildSnapshot
> takes so long is because:
>
> -----------------
>   Breakpoint 1, SnapBuildBuildSnapshot (builder=0x21f60f8)
>                                       at snapbuild.c:498
>   498        + sizeof(TransactionId) *   builder->committed.xcnt
>   (gdb) p builder->committed.xcnt
>   $4 = 11532
> -----------------
>
> And with each iteration it grows by 1.
>

Can we somehow avoid this either by keeping DDL-related xacts open or
aborting them? Also, will it make any difference to use setval as
do_setval() seems to be logging each time?

If possible, can you share the scripts? Kuroda-San has access to the
performance machine, he may be able to try it as well.

--
With Regards,
Amit Kapila.



RE: logical decoding and replication of sequences, take 2

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Tomas,

> I did some micro-benchmarking today, trying to identify cases where this
> would cause unexpected problems, either due to having to maintain all
> the relfilenodes, or due to having to do hash lookups for every sequence
> change. But I think it's fine, mostly ...
>

I did also performance tests (especially case 3). First of all, there are some
variants from yours.

1. patch 0002 was reverted because it has an issue. So this test checks whether
   refactoring around ReorderBufferSequenceIsTransactional seems really needed.
2. per comments from Amit, I also measured the abort case. In this case, the
   alter_sequence() is called but the transaction is aborted.
3. I measured with changing number of clients {8, 16, 32, 64, 128}. In any cases,
   clients executed 1000 transactions. The performance machine has 128 core so that
   result for 128 clients might be saturated.
4. a short sleep (0.1s) was added in alter_sequence(), especially between
   "alter sequence" and nextval(). Because while testing, I found that the
   transaction is too short to execute in parallel. I think it is reasonable
   because ReorderBufferSequenceIsTransactional() might be worse when the parallelism
   is increased.

I attached one backend process via perf and executed pg_slot_logical_get_changes().
Attached txt file shows which function occupied CPU time, especially from
pg_logical_slot_get_changes_guts() and ReorderBufferSequenceIsTransactional().
Here are my observations about them.

* In case of commit, as you said, SnapBuildCommitTxn() seems dominant for 8-64
  clients case.
* For (commit, 128 clients) case, however, ReorderBufferRestoreChanges() waste
  many times. I think this is because changes exceed logical_decoding_work_mem,
  so we do not have to analyze anymore.
* In case of abort, CPU time used by ReorderBufferSequenceIsTransactional() is linearly
  longer. This means that we need to think some solution to avoid the overhead by
  ReorderBufferSequenceIsTransactional().

```
8 clients  3.73% occupied time
16 7.26%
32 15.82%
64 29.14%
128 46.27%
```

* In case of abort, I also checked CPU time used by ReorderBufferAddRelFileLocator(), but
  it seems not so depends on the number of clients.

```
8 clients 3.66% occupied time
16 6.94%
32 4.65%
64 5.39%
128 3.06%
```

As next step, I've planned to run the case which uses setval() function, because it
generates more WALs than normal nextval();
How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 11/30/23 12:56, Amit Kapila wrote:
> On Thu, Nov 30, 2023 at 5:28 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> 3) "bad case" - small transactions that generate a lot of relfilenodes
>>
>>   select alter_sequence();
>>
>> where the function is defined like this (I did create 1000 sequences
>> before the test):
>>
>>   CREATE OR REPLACE FUNCTION alter_sequence() RETURNS void AS $$
>>   DECLARE
>>       v INT;
>>   BEGIN
>>       v := 1 + (random() * 999)::int;
>>       execute format('alter sequence s%s restart with 1000', v);
>>       perform nextval('s');
>>   END;
>>   $$ LANGUAGE plpgsql;
>>
>> This performs terribly, but it's entirely unrelated to sequences.
>> Current master has exactly the same problem, if transactions do DDL.
>> Like this, for example:
>>
>>   CREATE OR REPLACE FUNCTION create_table() RETURNS void AS $$
>>   DECLARE
>>       v INT;
>>   BEGIN
>>       v := 1 + (random() * 999)::int;
>>       execute format('create table t%s (a int)', v);
>>       execute format('drop table t%s', v);
>>       insert into t values (1);
>>   END;
>>   $$ LANGUAGE plpgsql;
>>
>> This has the same impact on master. The perf report shows this:
>>
>>   --98.06%--pg_logical_slot_get_changes_guts
>>        |
>>         --97.88%--LogicalDecodingProcessRecord
>>              |
>>              --97.56%--xact_decode
>>                   |
>>                    --97.51%--DecodeCommit
>>                         |
>>                         |--91.92%--SnapBuildCommitTxn
>>                         |     |
>>                         |      --91.65%--SnapBuildBuildSnapshot
>>                         |           |
>>                         |           --91.14%--pg_qsort
>>
>> The sequence decoding is maybe ~1%. The reason why SnapBuildSnapshot
>> takes so long is because:
>>
>> -----------------
>>   Breakpoint 1, SnapBuildBuildSnapshot (builder=0x21f60f8)
>>                                       at snapbuild.c:498
>>   498        + sizeof(TransactionId) *   builder->committed.xcnt
>>   (gdb) p builder->committed.xcnt
>>   $4 = 11532
>> -----------------
>>
>> And with each iteration it grows by 1.
>>
> 
> Can we somehow avoid this either by keeping DDL-related xacts open or
> aborting them?
I
I'm not sure why the snapshot builder does this, i.e. why we end up
accumulating that many xids, and I didn't have time to look closer. So I
don't know if this would be a solution or not.

> Also, will it make any difference to use setval as
> do_setval() seems to be logging each time?
> 

I think that's pretty much what case (2) does, as it calls nextval()
enough time for each transaction do generate WAL. But I don't think this
is a very sensible benchmark - it's an extreme case, but practical cases
are far closer to case (1) because sequences are intermixed with other
activity. No one really does just nextval() calls.

> If possible, can you share the scripts? Kuroda-San has access to the
> performance machine, he may be able to try it as well.
> 

Sure, attached. But it's a very primitive script, nothing fancy.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 12/1/23 12:08, Hayato Kuroda (Fujitsu) wrote:
> Dear Tomas,
> 
>> I did some micro-benchmarking today, trying to identify cases where this
>> would cause unexpected problems, either due to having to maintain all
>> the relfilenodes, or due to having to do hash lookups for every sequence
>> change. But I think it's fine, mostly ...
>>
> 
> I did also performance tests (especially case 3). First of all, there are some
> variants from yours.
> 
> 1. patch 0002 was reverted because it has an issue. So this test checks whether
>    refactoring around ReorderBufferSequenceIsTransactional seems really needed.

FWIW I also did the benchmarks without the 0002 patch, for the same
reason. I forgot to mention that.

> 2. per comments from Amit, I also measured the abort case. In this case, the
>    alter_sequence() is called but the transaction is aborted.
> 3. I measured with changing number of clients {8, 16, 32, 64, 128}. In any cases,
>    clients executed 1000 transactions. The performance machine has 128 core so that
>    result for 128 clients might be saturated.
> 4. a short sleep (0.1s) was added in alter_sequence(), especially between
>    "alter sequence" and nextval(). Because while testing, I found that the
>    transaction is too short to execute in parallel. I think it is reasonable
>    because ReorderBufferSequenceIsTransactional() might be worse when the parallelism
>    is increased.
> 
> I attached one backend process via perf and executed pg_slot_logical_get_changes().
> Attached txt file shows which function occupied CPU time, especially from
> pg_logical_slot_get_changes_guts() and ReorderBufferSequenceIsTransactional().
> Here are my observations about them.
> 
> * In case of commit, as you said, SnapBuildCommitTxn() seems dominant for 8-64
>   clients case.
> * For (commit, 128 clients) case, however, ReorderBufferRestoreChanges() waste
>   many times. I think this is because changes exceed logical_decoding_work_mem,
>   so we do not have to analyze anymore.
> * In case of abort, CPU time used by ReorderBufferSequenceIsTransactional() is linearly
>   longer. This means that we need to think some solution to avoid the overhead by
>   ReorderBufferSequenceIsTransactional().
> 
> ```
> 8 clients  3.73% occupied time
> 16 7.26%
> 32 15.82%
> 64 29.14%
> 128 46.27%
> ```

Interesting, so what exactly does the transaction do? Anyway, I don't
think this is very surprising - I believe it behaves like this because
of having to search in many hash tables (one in each toplevel xact). And
I think the solution I explained before (maintaining a single toplevel
hash, instead of many per-top-level hashes).

FWIW I find this case interesting, but not very practical, because no
practical workload has that many aborts.

> 
> * In case of abort, I also checked CPU time used by ReorderBufferAddRelFileLocator(), but
>   it seems not so depends on the number of clients.
> 
> ```
> 8 clients 3.66% occupied time
> 16 6.94%
> 32 4.65%
> 64 5.39%
> 128 3.06%
> ```
> 
> As next step, I've planned to run the case which uses setval() function, because it
> generates more WALs than normal nextval();
> How do you think?
> 

Sure, although I don't think it's much different from the test selecting
40 values from the sequence (in each transaction). That generates about
the same amount of WAL.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



RE: logical decoding and replication of sequences, take 2

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Tomas,

> > I did also performance tests (especially case 3). First of all, there are some
> > variants from yours.
> >
> > 1. patch 0002 was reverted because it has an issue. So this test checks whether
> >    refactoring around ReorderBufferSequenceIsTransactional seems really
> needed.
> 
> FWIW I also did the benchmarks without the 0002 patch, for the same
> reason. I forgot to mention that.

Oh, good news. So your bench markings are quite meaningful.

> 
> Interesting, so what exactly does the transaction do?

It is quite simple - PSA the script file. It was executed with 64 multiplicity.
The definition of alter_sequence() is same as you said.
(I did use normal bash script for running them, but your approach may be smarter)

> Anyway, I don't
> think this is very surprising - I believe it behaves like this because
> of having to search in many hash tables (one in each toplevel xact). And
> I think the solution I explained before (maintaining a single toplevel
> hash, instead of many per-top-level hashes).

Agreed. And I can benchmark again for new ones, maybe when we decide new
approach.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 12/3/23 13:55, Hayato Kuroda (Fujitsu) wrote:
> Dear Tomas,
> 
>>> I did also performance tests (especially case 3). First of all, there are some
>>> variants from yours.
>>>
>>> 1. patch 0002 was reverted because it has an issue. So this test checks whether
>>>    refactoring around ReorderBufferSequenceIsTransactional seems really
>> needed.
>>
>> FWIW I also did the benchmarks without the 0002 patch, for the same
>> reason. I forgot to mention that.
> 
> Oh, good news. So your bench markings are quite meaningful.
> 
>>
>> Interesting, so what exactly does the transaction do?
> 
> It is quite simple - PSA the script file. It was executed with 64 multiplicity.
> The definition of alter_sequence() is same as you said.
> (I did use normal bash script for running them, but your approach may be smarter)
> 
>> Anyway, I don't
>> think this is very surprising - I believe it behaves like this because
>> of having to search in many hash tables (one in each toplevel xact). And
>> I think the solution I explained before (maintaining a single toplevel
>> hash, instead of many per-top-level hashes).
> 
> Agreed. And I can benchmark again for new ones, maybe when we decide new
> approach.
> 

Thanks for the script. Are you also measuring the time it takes to
decode this using test_decoding?

FWIW I did more comprehensive suite of tests over the weekend, with a
couple more variations. I'm attaching the updated scripts, running it
should be as simple as

  ./run.sh BRANCH TRANSACTIONS RUNS

so perhaps

  ./run.sh master 1000 3

to do 3 runs with 1000 transactions per client. And it'll run a bunch of
combinations hard-coded in the script, and write the timings into a CSV
file (with "master" in each row).

I did this on two machines (i5 with 4 cores, xeon with 16/32 cores). I
did this with current master, the basic patch (without the 0002 part),
and then with the optimized approach (single global hash table, see the
0004 part). That's what master / patched / optimized in the results is.

Interestingly enough, the i5 handled this much faster, it seems to be
better in single-core tasks. The xeon is still running, so the results
for "optimized" only have one run (out of 3), but shouldn't change much.

Attached is also a table summarizing this, and visualizing the timing
change (vs. master) in the last couple columns. Green is "faster" than
master (but we don't really expect that), and "red" means slower than
master (the more red, the slower).

There results are grouped by script (see the attached .tgz), with either
32 or 96 clients (which does affect the timing, but not between master
and patch). Some executions have no pg_sleep() calls, some have 0.001
wait (but that doesn't seem to make much difference).

Overall, I'd group the results into about three groups:

1) good cases [nextval, nextval-40, nextval-abort]

These are cases that slow down a bit, but the slowdown is mostly within
reasonable bounds (we're making the decoding to do more stuff, so it'd
be a bit silly to require that extra work to make no impact). And I do
think this is reasonable, because this is pretty much an extreme / worst
case behavior. People don't really do just nextval() calls, without
doing anything else. Not to mention doing aborts for 100% transactions.

So in practice this is going to be within noise (and in those cases the
results even show speedup, which seems a bit surprising). It's somewhat
dependent on CPU too - on xeon there's hardly any regression.


2) nextval-40-abort

Here the slowdown is clear, but I'd argue it generally falls in the same
group as (1). Yes, I'd be happier if it didn't behave like this, but if
someone can show me a practical workload affected by this ...


3) irrelevant cases [all the alters taking insane amounts of time]

I absolutely refuse to care about these extreme cases where decoding
100k transactions takes 5-10 minutes (on i5), or up to 30 minutes (on
xeon). If this was a problem for some practical workload, we'd have
already heard about it I guess. And even if there was such workload, it
wouldn't be up to this patch to fix that. There's clearly something
misbehaving in the snapshot builder.


I was hopeful the global hash table would be an improvement, but that
doesn't seem to be the case. I haven't done much profiling yet, but I'd
guess most of the overhead is due to ReorderBufferQueueSequence()
starting and aborting a transaction in the non-transactinal case. Which
is unfortunate, but I don't know if there's a way to optimize that.

Some time ago I floated the idea of maybe "queuing" the sequence changes
and only replay them on the next commit, somehow. But we did ran into
problems with which snapshot to use, that I didn't know how to solve.
Maybe we should try again. The idea is we'd queue the non-transactional
changes somewhere (can't be in the transaction, because we must keep
them even if it aborts), and then "inject" them into the next commit.
That'd mean we wouldn't do the separate start/abort for each change.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 12/3/23 18:52, Tomas Vondra wrote:
> ...
> 
> Some time ago I floated the idea of maybe "queuing" the sequence changes
> and only replay them on the next commit, somehow. But we did ran into
> problems with which snapshot to use, that I didn't know how to solve.
> Maybe we should try again. The idea is we'd queue the non-transactional
> changes somewhere (can't be in the transaction, because we must keep
> them even if it aborts), and then "inject" them into the next commit.
> That'd mean we wouldn't do the separate start/abort for each change.
> 

Another idea is that maybe we could somehow inform ReorderBuffer whether
the output plugin even is interested in sequences. That'd help with
cases where we don't even want/need to replicate sequences, e.g. because
the publication does not specify (publish=sequence).

What happens now in that case is we call ReorderBufferQueueSequence(),
it does the whole dance with starting/aborting the transaction, calls
rb->sequence() which just does "meh" and doesn't do anything. Maybe we
could just short-circuit this by asking the output plugin somehow.

In an extreme case the plugin may not even specify the sequence
callbacks, and we're still doing all of this.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Thanks for the script. Are you also measuring the time it takes to
> decode this using test_decoding?
>
> FWIW I did more comprehensive suite of tests over the weekend, with a
> couple more variations. I'm attaching the updated scripts, running it
> should be as simple as
>
>   ./run.sh BRANCH TRANSACTIONS RUNS
>
> so perhaps
>
>   ./run.sh master 1000 3
>
> to do 3 runs with 1000 transactions per client. And it'll run a bunch of
> combinations hard-coded in the script, and write the timings into a CSV
> file (with "master" in each row).
>
> I did this on two machines (i5 with 4 cores, xeon with 16/32 cores). I
> did this with current master, the basic patch (without the 0002 part),
> and then with the optimized approach (single global hash table, see the
> 0004 part). That's what master / patched / optimized in the results is.
>
> Interestingly enough, the i5 handled this much faster, it seems to be
> better in single-core tasks. The xeon is still running, so the results
> for "optimized" only have one run (out of 3), but shouldn't change much.
>
> Attached is also a table summarizing this, and visualizing the timing
> change (vs. master) in the last couple columns. Green is "faster" than
> master (but we don't really expect that), and "red" means slower than
> master (the more red, the slower).
>
> There results are grouped by script (see the attached .tgz), with either
> 32 or 96 clients (which does affect the timing, but not between master
> and patch). Some executions have no pg_sleep() calls, some have 0.001
> wait (but that doesn't seem to make much difference).
>
> Overall, I'd group the results into about three groups:
>
> 1) good cases [nextval, nextval-40, nextval-abort]
>
> These are cases that slow down a bit, but the slowdown is mostly within
> reasonable bounds (we're making the decoding to do more stuff, so it'd
> be a bit silly to require that extra work to make no impact). And I do
> think this is reasonable, because this is pretty much an extreme / worst
> case behavior. People don't really do just nextval() calls, without
> doing anything else. Not to mention doing aborts for 100% transactions.
>
> So in practice this is going to be within noise (and in those cases the
> results even show speedup, which seems a bit surprising). It's somewhat
> dependent on CPU too - on xeon there's hardly any regression.
>
>
> 2) nextval-40-abort
>
> Here the slowdown is clear, but I'd argue it generally falls in the same
> group as (1). Yes, I'd be happier if it didn't behave like this, but if
> someone can show me a practical workload affected by this ...
>
>
> 3) irrelevant cases [all the alters taking insane amounts of time]
>
> I absolutely refuse to care about these extreme cases where decoding
> 100k transactions takes 5-10 minutes (on i5), or up to 30 minutes (on
> xeon). If this was a problem for some practical workload, we'd have
> already heard about it I guess. And even if there was such workload, it
> wouldn't be up to this patch to fix that. There's clearly something
> misbehaving in the snapshot builder.
>
>
> I was hopeful the global hash table would be an improvement, but that
> doesn't seem to be the case. I haven't done much profiling yet, but I'd
> guess most of the overhead is due to ReorderBufferQueueSequence()
> starting and aborting a transaction in the non-transactinal case. Which
> is unfortunate, but I don't know if there's a way to optimize that.
>

Before discussing the alternative ideas you shared, let me try to
clarify my understanding so that we are on the same page. I see two
observations based on the testing and discussion we had (a) for
non-transactional cases, the overhead observed is mainly due to
starting/aborting a transaction for each change; (b) for transactional
cases, we see overhead due to traversing all the top-level txns and
check the hash table for each one to find whether change is
transactional.

Am, I missing something?

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 12/5/23 13:17, Amit Kapila wrote:
> ...
>> I was hopeful the global hash table would be an improvement, but that
>> doesn't seem to be the case. I haven't done much profiling yet, but I'd
>> guess most of the overhead is due to ReorderBufferQueueSequence()
>> starting and aborting a transaction in the non-transactinal case. Which
>> is unfortunate, but I don't know if there's a way to optimize that.
>>
> 
> Before discussing the alternative ideas you shared, let me try to
> clarify my understanding so that we are on the same page. I see two
> observations based on the testing and discussion we had (a) for
> non-transactional cases, the overhead observed is mainly due to
> starting/aborting a transaction for each change;

Yes, I believe that's true. See the attached profiles for nextval.sql
and nextval-40.sql from master and optimized build (with the global
hash), and also a perf-diff. I only include the top 1000 lines for each
profile, that should be enough.

master - current master without patches applied
optimized - master + sequence decoding with global hash table

For nextval, there's almost no difference in the profile. Decoding the
other changes (inserts) is the dominant part, as we only log sequences
every 32 increments.

For nextval-40, the main increase is likely due to this part

  |--11.09%--seq_decode
  |     |
  |     |--9.25%--ReorderBufferQueueSequence
  |     |     |
  |     |     |--3.56%--AbortCurrentTransaction
  |     |     |    |
  |     |     |     --3.53%--AbortSubTransaction
  |     |     |        |
  |     |     |        |--0.95%--AtSubAbort_Portals
  |     |     |        |          |
  |     |     |        |           --0.83%--hash_seq_search
  |     |     |        |
  |     |     |         --0.83%--ResourceOwnerReleaseInternal
  |     |     |
  |     |     |--2.06%--BeginInternalSubTransaction
  |     |     |          |
  |     |     |           --1.10%--CommitTransactionCommand
  |     |     |                     |
  |     |     |                      --1.07%--StartSubTransaction
  |     |     |
  |     |     |--1.28%--CleanupSubTransaction
  |     |     |          |
  |     |     |           --0.64%--AtSubCleanup_Portals
  |     |     |                     |
  |     |     |                      --0.55%--hash_seq_search
  |     |     |
  |     |      --0.67%--RelidByRelfilenumber

So yeah, that's the transaction stuff in ReorderBufferQueueSequence.

There's also per-diff, comparing individual functions.

> (b) for transactional
> cases, we see overhead due to traversing all the top-level txns and
> check the hash table for each one to find whether change is
> transactional.
> 

Not really, no. As I explained in my preceding e-mail, this check makes
almost no difference - I did expect it to matter, but it doesn't. And I
was a bit disappointed the global hash table didn't move the needle.

Most of the time is spent in

    78.81%     0.00%  postgres  postgres  [.] DecodeCommit (inlined)
      |
      ---DecodeCommit (inlined)
         |
         |--72.65%--SnapBuildCommitTxn
         |     |
         |      --72.61%--SnapBuildBuildSnapshot
         |            |
         |             --72.09%--pg_qsort
         |                    |
         |                    |--66.24%--pg_qsort
         |                    |          |

And there's almost no difference between master and build with sequence
decoding - see the attached diff-alter-sequence.perf, comparing the two
branches (perf diff -c delta-abs).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Dilip Kumar
Дата:
On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>

> Some time ago I floated the idea of maybe "queuing" the sequence changes
> and only replay them on the next commit, somehow. But we did ran into
> problems with which snapshot to use, that I didn't know how to solve.
> Maybe we should try again. The idea is we'd queue the non-transactional
> changes somewhere (can't be in the transaction, because we must keep
> them even if it aborts), and then "inject" them into the next commit.
> That'd mean we wouldn't do the separate start/abort for each change.

Why can't we use the same concept of
SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the
non-transactional changes (have some base snapshot before the first
change), and whenever there is any catalog change, queue new snapshot
change also in the queue of the non-transactional sequence change so
that while sending it to downstream whenever it is necessary we will
change the historic snapshot?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Tue, Dec 5, 2023 at 10:23 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 12/5/23 13:17, Amit Kapila wrote:
>
> > (b) for transactional
> > cases, we see overhead due to traversing all the top-level txns and
> > check the hash table for each one to find whether change is
> > transactional.
> >
>
> Not really, no. As I explained in my preceding e-mail, this check makes
> almost no difference - I did expect it to matter, but it doesn't. And I
> was a bit disappointed the global hash table didn't move the needle.
>
> Most of the time is spent in
>
>     78.81%     0.00%  postgres  postgres  [.] DecodeCommit (inlined)
>       |
>       ---DecodeCommit (inlined)
>          |
>          |--72.65%--SnapBuildCommitTxn
>          |     |
>          |      --72.61%--SnapBuildBuildSnapshot
>          |            |
>          |             --72.09%--pg_qsort
>          |                    |
>          |                    |--66.24%--pg_qsort
>          |                    |          |
>
> And there's almost no difference between master and build with sequence
> decoding - see the attached diff-alter-sequence.perf, comparing the two
> branches (perf diff -c delta-abs).
>

I think in this the commit time predominates which hides the overhead.
We didn't investigate in detail if that can be improved but if we see
a similar case of abort [1], it shows the overhead of
ReorderBufferSequenceIsTransactional(). I understand that aborts won't
be frequent and it is sort of unrealistic test but still helps to show
that there is overhead in ReorderBufferSequenceIsTransactional(). Now,
I am not sure if we can ignore that case because theoretically, the
overhead can increase based on the number of top-level transactions.

[1]:
https://www.postgresql.org/message-id/TY3PR01MB9889D457278B254CA87D1325F581A%40TY3PR01MB9889.jpnprd01.prod.outlook.com

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Dilip Kumar
Дата:
On Wed, Dec 6, 2023 at 11:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >

I was also wondering what happens if the sequence changes are
transactional but somehow the snap builder state changes to
SNAPBUILD_FULL_SNAPSHOT in between processing of the smgr_decode() and
the seq_decode() which means RelFileLocator will not be added to the
hash table and during the seq_decode() we will consider the change as
non-transactional.  I haven't fully analyzed that what is the real
problem in this case but have we considered this case? what happens if
the transaction having both ALTER SEQUENCE and nextval() gets aborted
but the nextva() has been considered as non-transactional because
smgr_decode() changes were not processed because snap builder state
was not yet SNAPBUILD_FULL_SNAPSHOT.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Wed, Dec 6, 2023 at 11:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
>
> > Some time ago I floated the idea of maybe "queuing" the sequence changes
> > and only replay them on the next commit, somehow. But we did ran into
> > problems with which snapshot to use, that I didn't know how to solve.
> > Maybe we should try again. The idea is we'd queue the non-transactional
> > changes somewhere (can't be in the transaction, because we must keep
> > them even if it aborts), and then "inject" them into the next commit.
> > That'd mean we wouldn't do the separate start/abort for each change.
>
> Why can't we use the same concept of
> SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the
> non-transactional changes (have some base snapshot before the first
> change), and whenever there is any catalog change, queue new snapshot
> change also in the queue of the non-transactional sequence change so
> that while sending it to downstream whenever it is necessary we will
> change the historic snapshot?
>

Oh, do you mean maintain different historic snapshots and then switch
based on the change we are processing? I guess the other thing we need
to consider is the order of processing the changes if we maintain
separate queues that need to be processed.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Sun, Dec 3, 2023 at 11:56 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 12/3/23 18:52, Tomas Vondra wrote:
> > ...
> >
>
> Another idea is that maybe we could somehow inform ReorderBuffer whether
> the output plugin even is interested in sequences. That'd help with
> cases where we don't even want/need to replicate sequences, e.g. because
> the publication does not specify (publish=sequence).
>
> What happens now in that case is we call ReorderBufferQueueSequence(),
> it does the whole dance with starting/aborting the transaction, calls
> rb->sequence() which just does "meh" and doesn't do anything. Maybe we
> could just short-circuit this by asking the output plugin somehow.
>
> In an extreme case the plugin may not even specify the sequence
> callbacks, and we're still doing all of this.
>

We could explore this but I guess it won't solve the problem we are
facing in cases where all sequences are published and plugin has
specified the sequence callbacks. I think it would add some overhead
of this check in positive cases where we decide to anyway do send the
changes.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Dilip Kumar
Дата:
On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Why can't we use the same concept of
> > SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the
> > non-transactional changes (have some base snapshot before the first
> > change), and whenever there is any catalog change, queue new snapshot
> > change also in the queue of the non-transactional sequence change so
> > that while sending it to downstream whenever it is necessary we will
> > change the historic snapshot?
> >
>
> Oh, do you mean maintain different historic snapshots and then switch
> based on the change we are processing? I guess the other thing we need
> to consider is the order of processing the changes if we maintain
> separate queues that need to be processed.

I mean we will not specifically maintain the historic changes, but if
there is any catalog change where we are pushing the snapshot to all
the transaction's change queue, at the same time we will push this
snapshot in the non-transactional sequence queue as well.  I am not
sure what is the problem with the ordering? because we will be
queueing all non-transactional sequence changes in a separate queue in
the order they arrive and as soon as we process the next commit we
will process all the non-transactional changes at that time.  Do you
see issue with that?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 12/6/23 10:05, Dilip Kumar wrote:
> On Wed, Dec 6, 2023 at 11:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
> 
> I was also wondering what happens if the sequence changes are
> transactional but somehow the snap builder state changes to
> SNAPBUILD_FULL_SNAPSHOT in between processing of the smgr_decode() and
> the seq_decode() which means RelFileLocator will not be added to the
> hash table and during the seq_decode() we will consider the change as
> non-transactional.  I haven't fully analyzed that what is the real
> problem in this case but have we considered this case? what happens if
> the transaction having both ALTER SEQUENCE and nextval() gets aborted
> but the nextva() has been considered as non-transactional because
> smgr_decode() changes were not processed because snap builder state
> was not yet SNAPBUILD_FULL_SNAPSHOT.
> 

Yes, if something like this happens, that'd be a problem:

1) decoding starts, with

   SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT

2) transaction that creates a new refilenode gets decoded, but we skip
   it because we don't have the correct snapshot

3) snapshot changes to SNAPBUILD_FULL_SNAPSHOT

4) we decode sequence change from nextval() for the sequence

This would lead to us attempting to apply sequence change for a
relfilenode that's not visible yet (and may even get aborted).

But can this even happen? Can we start decoding in the middle of a
transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID,
which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical
messages, where we also call the output plugin in non-transactional cases.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 12/6/23 12:05, Dilip Kumar wrote:
> On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>>> Why can't we use the same concept of
>>> SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the
>>> non-transactional changes (have some base snapshot before the first
>>> change), and whenever there is any catalog change, queue new snapshot
>>> change also in the queue of the non-transactional sequence change so
>>> that while sending it to downstream whenever it is necessary we will
>>> change the historic snapshot?
>>>
>>
>> Oh, do you mean maintain different historic snapshots and then switch
>> based on the change we are processing? I guess the other thing we need
>> to consider is the order of processing the changes if we maintain
>> separate queues that need to be processed.
> 
> I mean we will not specifically maintain the historic changes, but if
> there is any catalog change where we are pushing the snapshot to all
> the transaction's change queue, at the same time we will push this
> snapshot in the non-transactional sequence queue as well.  I am not
> sure what is the problem with the ordering? because we will be
> queueing all non-transactional sequence changes in a separate queue in
> the order they arrive and as soon as we process the next commit we
> will process all the non-transactional changes at that time.  Do you
> see issue with that?
> 

Isn't this (in principle) the idea of queuing the non-transactional
changes and then applying them on the next commit? Yes, I didn't get
very far with that, but I got stuck exactly on tracking which snapshot
to use, so if there's a way to do that, that'd fix my issue.

Also, would this mean we don't need to track the relfilenodes, if we're
able to query the catalog? Would we be able to check if the relfilenode
was created by the current xact?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 12/6/23 11:19, Amit Kapila wrote:
> On Sun, Dec 3, 2023 at 11:56 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 12/3/23 18:52, Tomas Vondra wrote:
>>> ...
>>>
>>
>> Another idea is that maybe we could somehow inform ReorderBuffer whether
>> the output plugin even is interested in sequences. That'd help with
>> cases where we don't even want/need to replicate sequences, e.g. because
>> the publication does not specify (publish=sequence).
>>
>> What happens now in that case is we call ReorderBufferQueueSequence(),
>> it does the whole dance with starting/aborting the transaction, calls
>> rb->sequence() which just does "meh" and doesn't do anything. Maybe we
>> could just short-circuit this by asking the output plugin somehow.
>>
>> In an extreme case the plugin may not even specify the sequence
>> callbacks, and we're still doing all of this.
>>
> 
> We could explore this but I guess it won't solve the problem we are
> facing in cases where all sequences are published and plugin has
> specified the sequence callbacks. I think it would add some overhead
> of this check in positive cases where we decide to anyway do send the
> changes.

Well, the idea is the check would be very simple (essentially just a
boolean flag somewhere), so not really measurable.

And if the plugin requests decoding sequences, I guess it's natural it
may have a bit of overhead. It needs to do more things, after all. It
needs to be acceptable, ofc.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 12/6/23 09:56, Amit Kapila wrote:
> On Tue, Dec 5, 2023 at 10:23 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 12/5/23 13:17, Amit Kapila wrote:
>>
>>> (b) for transactional
>>> cases, we see overhead due to traversing all the top-level txns and
>>> check the hash table for each one to find whether change is
>>> transactional.
>>>
>>
>> Not really, no. As I explained in my preceding e-mail, this check makes
>> almost no difference - I did expect it to matter, but it doesn't. And I
>> was a bit disappointed the global hash table didn't move the needle.
>>
>> Most of the time is spent in
>>
>>     78.81%     0.00%  postgres  postgres  [.] DecodeCommit (inlined)
>>       |
>>       ---DecodeCommit (inlined)
>>          |
>>          |--72.65%--SnapBuildCommitTxn
>>          |     |
>>          |      --72.61%--SnapBuildBuildSnapshot
>>          |            |
>>          |             --72.09%--pg_qsort
>>          |                    |
>>          |                    |--66.24%--pg_qsort
>>          |                    |          |
>>
>> And there's almost no difference between master and build with sequence
>> decoding - see the attached diff-alter-sequence.perf, comparing the two
>> branches (perf diff -c delta-abs).
>>
> 
> I think in this the commit time predominates which hides the overhead.
> We didn't investigate in detail if that can be improved but if we see
> a similar case of abort [1], it shows the overhead of
> ReorderBufferSequenceIsTransactional(). I understand that aborts won't
> be frequent and it is sort of unrealistic test but still helps to show
> that there is overhead in ReorderBufferSequenceIsTransactional(). Now,
> I am not sure if we can ignore that case because theoretically, the
> overhead can increase based on the number of top-level transactions.
> 
> [1]:
https://www.postgresql.org/message-id/TY3PR01MB9889D457278B254CA87D1325F581A%40TY3PR01MB9889.jpnprd01.prod.outlook.com
> 

But those profiles were with the "old" patch, with one hash table per
top-level transaction. I see nothing like that with the patch [1] that
replaces that with a single global hash table. With that patch, the
ReorderBufferSequenceIsTransactional() took ~0.5% in any tests I did.

What did have bigger impact is this:

    46.12%   1.47%  postgres [.] pg_logical_slot_get_changes_guts
      |
      |--45.12%--pg_logical_slot_get_changes_guts
      |    |
      |    |--42.34%--LogicalDecodingProcessRecord
      |    |    |
      |    |    |--12.82%--xact_decode
      |    |    |    |
      |    |    |    |--9.46%--DecodeAbort (inlined)
      |    |    |    |   |
      |    |    |    |   |--8.44%--ReorderBufferCleanupTXN
      |    |    |    |   |   |
      |    |    |    |   |   |--3.25%--ReorderBufferSequenceCleanup (in)
      |    |    |    |   |   |   |
      |    |    |    |   |   |   |--1.59%--hash_seq_search
      |    |    |    |   |   |   |
      |    |    |    |   |   |   |--0.80%--hash_search_with_hash_value
      |    |    |    |   |   |   |
      |    |    |    |   |   |    --0.59%--hash_search
      |    |    |    |   |   |              hash_bytes

I guess that could be optimized, but it's also a direct consequence of
the huge number of aborts for transactions that create relfilenode. For
any other workload this will be negligible.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Wed, Dec 6, 2023 at 7:20 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 12/6/23 11:19, Amit Kapila wrote:
> > On Sun, Dec 3, 2023 at 11:56 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> On 12/3/23 18:52, Tomas Vondra wrote:
> >>> ...
> >>>
> >>
> >> Another idea is that maybe we could somehow inform ReorderBuffer whether
> >> the output plugin even is interested in sequences. That'd help with
> >> cases where we don't even want/need to replicate sequences, e.g. because
> >> the publication does not specify (publish=sequence).
> >>
> >> What happens now in that case is we call ReorderBufferQueueSequence(),
> >> it does the whole dance with starting/aborting the transaction, calls
> >> rb->sequence() which just does "meh" and doesn't do anything. Maybe we
> >> could just short-circuit this by asking the output plugin somehow.
> >>
> >> In an extreme case the plugin may not even specify the sequence
> >> callbacks, and we're still doing all of this.
> >>
> >
> > We could explore this but I guess it won't solve the problem we are
> > facing in cases where all sequences are published and plugin has
> > specified the sequence callbacks. I think it would add some overhead
> > of this check in positive cases where we decide to anyway do send the
> > changes.
>
> Well, the idea is the check would be very simple (essentially just a
> boolean flag somewhere), so not really measurable.
>
> And if the plugin requests decoding sequences, I guess it's natural it
> may have a bit of overhead. It needs to do more things, after all. It
> needs to be acceptable, ofc.
>

I agree with you that if it can be done cheaply or without a
measurable overhead then it would be a good idea and can serve other
purposes as well. For example, see discussion [1]. I had more of what
the patch in email [1] is doing where it needs to start/stop xact and
do so relcache access etc. which seems can add some overhead if done
for each change, though I haven't measured so can't be sure.

[1] - https://www.postgresql.org/message-id/CAGfChW5Qo2SrjJ7rU9YYtZbRaWv6v-Z8MJn%3DdQNx4uCSqDEOHA%40mail.gmail.com

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Dilip Kumar
Дата:
On Wed, Dec 6, 2023 at 7:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 12/6/23 12:05, Dilip Kumar wrote:
> > On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >>> Why can't we use the same concept of
> >>> SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the
> >>> non-transactional changes (have some base snapshot before the first
> >>> change), and whenever there is any catalog change, queue new snapshot
> >>> change also in the queue of the non-transactional sequence change so
> >>> that while sending it to downstream whenever it is necessary we will
> >>> change the historic snapshot?
> >>>
> >>
> >> Oh, do you mean maintain different historic snapshots and then switch
> >> based on the change we are processing? I guess the other thing we need
> >> to consider is the order of processing the changes if we maintain
> >> separate queues that need to be processed.
> >
> > I mean we will not specifically maintain the historic changes, but if
> > there is any catalog change where we are pushing the snapshot to all
> > the transaction's change queue, at the same time we will push this
> > snapshot in the non-transactional sequence queue as well.  I am not
> > sure what is the problem with the ordering? because we will be
> > queueing all non-transactional sequence changes in a separate queue in
> > the order they arrive and as soon as we process the next commit we
> > will process all the non-transactional changes at that time.  Do you
> > see issue with that?
> >
>
> Isn't this (in principle) the idea of queuing the non-transactional
> changes and then applying them on the next commit?

Yes, it is.

 Yes, I didn't get
> very far with that, but I got stuck exactly on tracking which snapshot
> to use, so if there's a way to do that, that'd fix my issue.

Thinking more about the snapshot issue do we need to even bother about
changing the snapshot at all while streaming the non-transactional
sequence changes or we can send all the non-transactional changes with
a single snapshot? So mainly snapshot logically gets changed due to
these 2 events case1: When any transaction gets committed which has
done catalog operation (this changes the global snapshot) and case2:
When within a transaction, there is some catalog change (this just
updates the 'curcid' in the base snapshot of the transaction).

Now, if we are thinking that we are streaming all the
non-transactional sequence changes right before the next commit then
we are not bothered about the (case1) at all because all changes we
have queues so far are before this commit.   And if we come to a
(case2), if we are performing any catalog change on the sequence then
the following changes on the same sequence will be considered
transactional and if the changes are just on some other catalog (not
relevant to our sequence operation) then also we should not be worried
about command_id change because visibility of catalog lookup for our
sequence will be unaffected by this.

In short, I am trying to say that we can safely queue the
non-transactional sequence changes and stream them based on the
snapshot we got when we decode the first change, and as long as we are
planning to stream just before the next commit (or next in-progress
stream), we don't ever need to update the snapshot.

> Also, would this mean we don't need to track the relfilenodes, if we're
> able to query the catalog? Would we be able to check if the relfilenode
> was created by the current xact?

I think by querying the catalog and checking the xmin we should be
able to figure that out, but isn't that costlier than looking up the
relfilenode in hash?  Because just for identifying whether the changes
are transactional or non-transactional you would have to query the
catalog, that means for each change before we decide whether we add to
the transaction's change queue or non-transactional change queue we
will have to query the catalog i.e. you will have to start/stop the
transaction?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Wed, Dec 6, 2023 at 7:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 12/6/23 12:05, Dilip Kumar wrote:
> > On Wed, Dec 6, 2023 at 3:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >>> Why can't we use the same concept of
> >>> SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the
> >>> non-transactional changes (have some base snapshot before the first
> >>> change), and whenever there is any catalog change, queue new snapshot
> >>> change also in the queue of the non-transactional sequence change so
> >>> that while sending it to downstream whenever it is necessary we will
> >>> change the historic snapshot?
> >>>
> >>
> >> Oh, do you mean maintain different historic snapshots and then switch
> >> based on the change we are processing? I guess the other thing we need
> >> to consider is the order of processing the changes if we maintain
> >> separate queues that need to be processed.
> >
> > I mean we will not specifically maintain the historic changes, but if
> > there is any catalog change where we are pushing the snapshot to all
> > the transaction's change queue, at the same time we will push this
> > snapshot in the non-transactional sequence queue as well.  I am not
> > sure what is the problem with the ordering?
> >

Currently, we set up the historic snapshot before starting a
transaction to process the change and then adapt the updates to it
while processing the changes for the transaction. Now, while
processing this new queue of non-transactional sequence messages, we
probably need a separate snapshot and updates to it. So, either we
need some sort of switching between snapshots or do it in different
transactions.

> > because we will be
> > queueing all non-transactional sequence changes in a separate queue in
> > the order they arrive and as soon as we process the next commit we
> > will process all the non-transactional changes at that time.  Do you
> > see issue with that?
> >
>
> Isn't this (in principle) the idea of queuing the non-transactional
> changes and then applying them on the next commit? Yes, I didn't get
> very far with that, but I got stuck exactly on tracking which snapshot
> to use, so if there's a way to do that, that'd fix my issue.
>
> Also, would this mean we don't need to track the relfilenodes, if we're
> able to query the catalog? Would we be able to check if the relfilenode
> was created by the current xact?
>

I thought this new mechanism was for processing a queue of
non-transactional sequence changes. The tracking of relfilenodes is to
distinguish between transactional and non-transactional messages, so I
think we probably still need that.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Dilip Kumar
Дата:
On Wed, Dec 6, 2023 at 7:09 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Yes, if something like this happens, that'd be a problem:
>
> 1) decoding starts, with
>
>    SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT
>
> 2) transaction that creates a new refilenode gets decoded, but we skip
>    it because we don't have the correct snapshot
>
> 3) snapshot changes to SNAPBUILD_FULL_SNAPSHOT
>
> 4) we decode sequence change from nextval() for the sequence
>
> This would lead to us attempting to apply sequence change for a
> relfilenode that's not visible yet (and may even get aborted).
>
> But can this even happen? Can we start decoding in the middle of a
> transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID,
> which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical
> messages, where we also call the output plugin in non-transactional cases.

It's not a problem for logical messages because whether the message is
transaction or non-transactional is decided while WAL logs the message
itself.  But here our problem starts with deciding whether the change
is transactional vs non-transactional, because if we insert the
'relfilenode' in hash then the subsequent sequence change in the same
transaction would be considered transactional otherwise
non-transactional.  And XLOG_HEAP2_NEW_CID is just for changing the
snapshot->curcid which will only affect the catalog visibility of the
upcoming operation in the same transaction, but that's not an issue
because if some of the changes of this transaction are seen when
snapbuild state < SNAPBUILD_FULL_SNAPSHOT then this transaction has to
get committed before the state change to SNAPBUILD_CONSISTENT_SNAPSHOT
i.e. the commit LSN of this transaction is going to be <
start_decoding_at.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Hi,

There's been a lot discussed over the past month or so, and it's become
difficult to get a good idea what's the current state - what issues
remain to be solved, what's unrelated to this patch, and how to move if
forward. Long-running threads tend to be confusing, so I had a short
call with Amit to discuss the current state yesterday, and to make sure
we're on the same page. I believe it was very helpful, and I've promised
to post a short summary of the call - issues, what we agreed seems like
a path forward, etc.

Obviously, I might have misunderstood something, in which case Amit can
correct me. And I'd certainly welcome opinions from others.

In general, we discussed three areas - desirability of the feature,
correctness and performance. I believe a brief summary of the agreement
would be this:

- desirability of the feature: Random IDs (UUIDs etc.) are likely a much
better solution for distributed (esp. active-active) systems. But there
are important use cases that are likely to keep using regular sequences
(online upgrades of single-node instances, existing systems, ...).

- correctness: There's one possible correctness issue, when the snapshot
changes to FULL between record creating a sequence relfilenode and that
sequence advancing. This needs to be verified/reproduced, and fixed.

- performance issues: We've agreed the case with a lot of aborts (when
DecodeCommit consumes a lot of CPU) is unrelated to this patch. We've
discussed whether the overhead with many sequence changes (nextval-40)
is acceptable, and/or how to improve it.

Next, I'll go over these points in more details, with my understanding
of what the challenges are, possible solutions etc. Most of this was
discussed/agreed on the call, but some are ideas I had only after the
call when writing this summary.


1) desirability of the feature

Firstly, do we actually want/need this feature? I believe that's very
much a question of what use cases we're targeting.

If we only focus on distributed databases (particularly those with
multiple active nodes), then we probably agree that the right solution
is to not use sequences (~generators of incrementing values) but UUIDs
or similar random identifiers (better not call them sequences, there's
not much sequential about it). The huge advantage is this does not
require replicating any state between the nodes, so logical decoding can
simply ignore them and replicate just the generated values. I don't
think there's any argument about that. If I as building such distributed
system, I'd certainly use such random IDs.

The question is what to do about the other use cases - online upgrades
relying on logical decoding, failovers to logical replicas, and so on.
Or what to do about existing systems that can't be easily changed to use
different/random identifiers. Those are not really distributed systems
and therefore don't quite need random IDs.

Furthermore, it's not like random IDs have no drawbacks - UUIDv4 can
easily lead to massive write amplification, for example. There are
variants like UUIDv7 reducing the impact, but there's other stuff.

My takeaway from this is there's still value in having this feature.


2) correctness

The only correctness issue I'm aware of is the question what happens
when the snapshot switches to SNAPBUILD_FULL_SNAPSHOT between decoding
the relfilenode creation and the sequence increment, pointed out by
Dilip in [1].

If this happens (and while I don't have a reproducer, I also don't have
a very clear idea why it couldn't happen), it breaks how the patch
decides between transactional and non-transactional sequence changes.

So this seems like a fatal flaw - it definitely needs to be solved. I
don't have a good idea how to do that, unfortunately. The problem is the
dependency on an earlier record, and that this needs to be evaluated
immediately (in the decode phase). Logical messages don't have the same
issue because the "transactional" flag does not depend on earlier stuff,
and other records are not interpreted until apply/commit, when we know
everything relevant was decoded.

I don't know what the solution is. Either we find a way to make sure not
to lose/skip the smgr record, or we need to rethink how we determine the
transactional flag (perhaps even try again adding it to the WAL record,
but we didn't find a way to do that earlier).


3) performance issues

We have discussed two cases - "ddl-abort" and "nextval-40".

The "ddl-abort" is when the workload does a lot of DDL and then aborts
them, leading to profiles dominated by DecodeCommit. The agreement here
is that while this is a valid issue and we should try fixing it, it's
unrelated to this patch. The issue exists even on master. So in the
context of this patch we can ignore this issue.

The "nextval-40" applies to workloads doing a lot of regular sequence
changes. We only decode/apply changes written to WAL, and that happens
only for every 32 increments or so. The test was with a very simple
transaction (just sequence advanced to write WAL + 1-row insert), which
means it's pretty much a worst case impact. For larger transactions,
it's going to be hardly measurable. Also, this only measured decoding,
not apply (which also will make this less significant).

Most of the overhead comes from ReorderBufferQueueSequence() starting
and then aborting a transaction, per the profile in [2]. This only
happens in the non-transactional case, but we expect that in regular

Anyway, let's say we want to mitigate this overhead. I think there are
three ways to do that:


a) find a way to not have to apply sequence changes immediately, but
queue them until the next commit

This would give a chance to combine multiple sequence changes into a
single "replay change", reducing the overhead. There's a couple problems
with this, though. Firstly, it can't help OLTP workloads because the
transactions are short so sequence changes are unlikely to combine. It's
also, not clear how expensive this be - could it be expensive enough to
outweigh the benefits?

All of this is assuming it can be implemented, we don't have such patch
yet. I was speculating about something like this earlier, but I haven't
managed to make that work. Doesn't mean it's impossible, ofc.


b) provide a way for the output plugin to skip sequence decoding early

The way the decoding is coded now, ReorderBufferQueueSequence does all
the expensive dance even if the output plugin does not implement the
sequence callbacks.

Maybe we should have a way to allow skipping all of this early, right at
the beginning of ReorderBufferQueueSequence (and thus before we even try
to start/abort the transaction).

Ofc, this is not a perfect solution either - it won't help workloads
that actually need/want sequence decoding but the workload is such that
the decoding has significant overhead, or with plugins that choose to
support decoding sequences in genera. For example the built-in output
plugin would certainly support sequences - and the overhead would still
be there (even if no sequences are added to the publication).


b) instruct people to increase the sequence cache from 32 to 1024

This would reduce the number of WAL messages that need to be decoded and
replayed, reducing the overhead proportionally. Of course, this also
means the sequence will "jump forward" more in case of crash or failover
to the logical replica, but I think that's acceptable tradeoff. People
should not expect sequences to be gap-less anyway.

Considering nextval-40 is pretty much a worst-case behavior, I think
this might actually be an acceptable solution/workaround.


regards

[1]
https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/0bc34f71-7745-dc16-d765-5ba1f0776a3f%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Thu, Dec 7, 2023 at 10:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Dec 6, 2023 at 7:09 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > Yes, if something like this happens, that'd be a problem:
> >
> > 1) decoding starts, with
> >
> >    SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT
> >
> > 2) transaction that creates a new refilenode gets decoded, but we skip
> >    it because we don't have the correct snapshot
> >
> > 3) snapshot changes to SNAPBUILD_FULL_SNAPSHOT
> >
> > 4) we decode sequence change from nextval() for the sequence
> >
> > This would lead to us attempting to apply sequence change for a
> > relfilenode that's not visible yet (and may even get aborted).
> >
> > But can this even happen? Can we start decoding in the middle of a
> > transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID,
> > which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical
> > messages, where we also call the output plugin in non-transactional cases.
>
> It's not a problem for logical messages because whether the message is
> transaction or non-transactional is decided while WAL logs the message
> itself.  But here our problem starts with deciding whether the change
> is transactional vs non-transactional, because if we insert the
> 'relfilenode' in hash then the subsequent sequence change in the same
> transaction would be considered transactional otherwise
> non-transactional.
>

It is correct that we can make a wrong decision about whether a change
is transactional or non-transactional when sequence DDL happens before
the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
after that state. However, one thing to note here is that we won't try
to stream such a change because for non-transactional cases we don't
proceed unless the snapshot is in a consistent state. Now, if the
decision had been correct then we would probably have queued the
sequence change and discarded at commit.

One thing that we deviate here is that for non-sequence transactional
cases (including logical messages), we immediately start queuing the
changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
SnapBuildProcessChange() returns true which is quite possible) and
take final decision at commit/prepare/abort time. However, that won't
be the case for sequences because of the dependency of determining
transactional cases on one of the prior records. Now, I am not
completely sure at this stage if such a deviation can cause any
problem and or whether we are okay to have such a deviation for
sequences.

--
With Regards,
Amit Kapila.



RE: logical decoding and replication of sequences, take 2

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear hackers,

> It is correct that we can make a wrong decision about whether a change
> is transactional or non-transactional when sequence DDL happens before
> the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> after that state.

I found a workload which decoder distinguish wrongly.

# Prerequisite

Apply an attached patch for inspecting the sequence status. It can be applied atop v20231203 patch set.
Also, a table and a sequence must be defined:

```
CREATE TABLE foo (var int);
CREATE SEQUENCE s;
```

# Workload

Then, you can execute concurrent transactions from three clients like below:

Client-1

BEGIN;
INSERT INTO foo VALUES (1);

            Client-2

            SELECT pg_create_logical_replication_slot('slot', 'test_decoding');

                        Client-3

                        BEGIN;
                        ALTER SEQUENCE s MAXVALUE 5000;
COMMIT;
                        SAVEPOINT s1;
                        SELECT setval('s', 2000);
                        ROLLBACK;

            SELECT pg_logical_slot_get_changes('slot', 'test_decoding');

# Result and analysis

At first, below lines would be output on the log. This meant that WAL records
for ALTER SEQUENCE were decoded but skipped because the snapshot had been building.

```
...
LOG:  logical decoding found initial starting point at 0/154D238
DETAIL:  Waiting for transactions (approximately 1) older than 741 to end.
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding');
LOG:  XXX: smgr_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding');
LOG:  XXX: skipped
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding');
LOG:  XXX: seq_decode. snapshot is SNAPBUILD_BUILDING_SNAPSHOT
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding');
LOG:  XXX: skipped
...
```

Note that above `seq_decode...` line was not output via `setval()`, it was done
by ALTER SEQUENCE statement. Below is a call stack for inserting WAL.

```
XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG);
fill_seq_fork_with_data
fill_seq_with_data
AlterSequence
```

Then, subsequent lines would say like them. This means that the snapshot becomes
FULL and `setval()` is regarded non-transactional wrongly.

```
LOG:  logical decoding found initial consistent point at 0/154D658
DETAIL:  Waiting for transactions (approximately 1) older than 742 to end.
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding');
LOG:  XXX: seq_decode. snapshot is SNAPBUILD_FULL_SNAPSHOT
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding');
LOG:  XXX: the sequence is non-transactional
STATEMENT:  SELECT * FROM pg_create_logical_replication_slot('slot', 'test_decoding');
LOG:  XXX: not consistent: skipped
```

The change would be discarded because the snapshot has not been CONSISTENT yet
by the below part. If it has been transactional, we would have queued this
change though the transaction will be skipped at commit.

```
    else if (!transactional &&
             (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
              SnapBuildXactNeedsSkip(builder, buf->origptr)))
        return;
```

But anyway, we could find a case which we can make a wrong decision. This example
is lucky - does not output wrongly, but I'm not sure all the case like that.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

Re: logical decoding and replication of sequences, take 2

От
Dilip Kumar
Дата:
On Wed, Dec 13, 2023 at 6:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > > But can this even happen? Can we start decoding in the middle of a
> > > transaction? How come this wouldn't affect e.g. XLOG_HEAP2_NEW_CID,
> > > which is also skipped until SNAPBUILD_FULL_SNAPSHOT. Or logical
> > > messages, where we also call the output plugin in non-transactional cases.
> >
> > It's not a problem for logical messages because whether the message is
> > transaction or non-transactional is decided while WAL logs the message
> > itself.  But here our problem starts with deciding whether the change
> > is transactional vs non-transactional, because if we insert the
> > 'relfilenode' in hash then the subsequent sequence change in the same
> > transaction would be considered transactional otherwise
> > non-transactional.
> >
>
> It is correct that we can make a wrong decision about whether a change
> is transactional or non-transactional when sequence DDL happens before
> the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> after that state. However, one thing to note here is that we won't try
> to stream such a change because for non-transactional cases we don't
> proceed unless the snapshot is in a consistent state. Now, if the
> decision had been correct then we would probably have queued the
> sequence change and discarded at commit.
>
> One thing that we deviate here is that for non-sequence transactional
> cases (including logical messages), we immediately start queuing the
> changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> SnapBuildProcessChange() returns true which is quite possible) and
> take final decision at commit/prepare/abort time. However, that won't
> be the case for sequences because of the dependency of determining
> transactional cases on one of the prior records. Now, I am not
> completely sure at this stage if such a deviation can cause any
> problem and or whether we are okay to have such a deviation for
> sequences.

Okay, so this particular scenario that I raised is somehow saved, I
mean although we are considering transactional sequence operation as
non-transactional we also know that if some of the changes for a
transaction are skipped because the snapshot was not FULL that means
that transaction can not be streamed because that transaction has to
be committed before snapshot become CONSISTENT (based on the snapshot
state change machinery).  Ideally based on the same logic that the
snapshot is not consistent the non-transactional sequence changes are
also skipped.  But the only thing that makes me a bit uncomfortable is
that even though the result is not wrong we have made some wrong
intermediate decisions i.e. considered transactional change as
non-transactions.

One solution to this issue is that, even if the snapshot state does
not reach FULL just add the sequence relids to the hash, I mean that
hash is only maintained for deciding whether the sequence is changed
in that transaction or not.  So no adding such relids to hash seems
like a root cause of the issue.  Honestly, I haven't analyzed this
idea in detail about how easy it would be to add only these changes to
the hash and what are the other dependencies, but this seems like a
worthwhile direction IMHO.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> > >
> >
> > It is correct that we can make a wrong decision about whether a change
> > is transactional or non-transactional when sequence DDL happens before
> > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> > after that state. However, one thing to note here is that we won't try
> > to stream such a change because for non-transactional cases we don't
> > proceed unless the snapshot is in a consistent state. Now, if the
> > decision had been correct then we would probably have queued the
> > sequence change and discarded at commit.
> >
> > One thing that we deviate here is that for non-sequence transactional
> > cases (including logical messages), we immediately start queuing the
> > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> > SnapBuildProcessChange() returns true which is quite possible) and
> > take final decision at commit/prepare/abort time. However, that won't
> > be the case for sequences because of the dependency of determining
> > transactional cases on one of the prior records. Now, I am not
> > completely sure at this stage if such a deviation can cause any
> > problem and or whether we are okay to have such a deviation for
> > sequences.
>
> Okay, so this particular scenario that I raised is somehow saved, I
> mean although we are considering transactional sequence operation as
> non-transactional we also know that if some of the changes for a
> transaction are skipped because the snapshot was not FULL that means
> that transaction can not be streamed because that transaction has to
> be committed before snapshot become CONSISTENT (based on the snapshot
> state change machinery).  Ideally based on the same logic that the
> snapshot is not consistent the non-transactional sequence changes are
> also skipped.  But the only thing that makes me a bit uncomfortable is
> that even though the result is not wrong we have made some wrong
> intermediate decisions i.e. considered transactional change as
> non-transactions.
>
> One solution to this issue is that, even if the snapshot state does
> not reach FULL just add the sequence relids to the hash, I mean that
> hash is only maintained for deciding whether the sequence is changed
> in that transaction or not.  So no adding such relids to hash seems
> like a root cause of the issue.  Honestly, I haven't analyzed this
> idea in detail about how easy it would be to add only these changes to
> the hash and what are the other dependencies, but this seems like a
> worthwhile direction IMHO.

I also thought about the same solution. I tried this solution as the
attached patch on top of Hayato's diagnostic changes. Following log
messages are seen in server error log. Those indicate that the
sequence change was correctly deemed as a transactional change (line
2023-12-14 12:14:55.591 IST [321229] LOG: XXX: the sequence is
transactional).
2023-12-14 12:12:50.550 IST [321229] ERROR: relation
"pg_replication_slot" does not exist at character 15
2023-12-14 12:12:50.550 IST [321229] STATEMENT: select * from
pg_replication_slot;
2023-12-14 12:12:57.289 IST [321229] LOG: logical decoding found
initial starting point at 0/1598D50
2023-12-14 12:12:57.289 IST [321229] DETAIL: Waiting for transactions
(approximately 1) older than 759 to end.
2023-12-14 12:12:57.289 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:13:49.551 IST [321229] LOG: XXX: smgr_decode. snapshot
is SNAPBUILD_BUILDING_SNAPSHOT
2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:13:49.551 IST [321229] LOG: XXX: seq_decode. snapshot is
SNAPBUILD_BUILDING_SNAPSHOT
2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:13:49.551 IST [321229] LOG: XXX: skipped
2023-12-14 12:13:49.551 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:13:49.552 IST [321229] LOG: logical decoding found
initial consistent point at 0/1599170
2023-12-14 12:13:49.552 IST [321229] DETAIL: Waiting for transactions
(approximately 1) older than 760 to end.
2023-12-14 12:13:49.552 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:14:55.591 IST [321229] LOG: XXX: seq_decode. snapshot is
SNAPBUILD_FULL_SNAPSHOT
2023-12-14 12:14:55.591 IST [321230] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:14:55.591 IST [321229] LOG: XXX: the sequence is transactional
2023-12-14 12:14:55.591 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 12:14:55.813 IST [321229] LOG: logical decoding found
consistent point at 0/15992E8
2023-12-14 12:14:55.813 IST [321229] DETAIL: There are no running transactions.
2023-12-14 12:14:55.813 IST [321229] STATEMENT: SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');

It looks like the solution works. But this is the only place where we
process a change before SNAPSHOT reaches FULL. But this is also the
only record which affects a decision to queue/not a following change.
So it should be ok. The sequence_hash'es as separate for each
transaction and they are cleaned when processing COMMIT record. So I
think we don't have any side effects of adding relfilenode to sequence
hash even though snapshot is not FULL.



As a side note
1. the prologue of ReorderBufferSequenceCleanup() mentions only abort,
but this function will be called for COMMIT as well. Prologue needs to
be fixed.
2. Now that sequence hashes are per transaction, do we need
ReoderBufferTXN in ReorderBufferSequenceEnt?


--
Best Wishes,
Ashutosh Bapat



Re: logical decoding and replication of sequences, take 2

От
Dilip Kumar
Дата:
On Thu, Dec 14, 2023 at 12:31 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > > >
> > >
> > > It is correct that we can make a wrong decision about whether a change
> > > is transactional or non-transactional when sequence DDL happens before
> > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> > > after that state. However, one thing to note here is that we won't try
> > > to stream such a change because for non-transactional cases we don't
> > > proceed unless the snapshot is in a consistent state. Now, if the
> > > decision had been correct then we would probably have queued the
> > > sequence change and discarded at commit.
> > >
> > > One thing that we deviate here is that for non-sequence transactional
> > > cases (including logical messages), we immediately start queuing the
> > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> > > SnapBuildProcessChange() returns true which is quite possible) and
> > > take final decision at commit/prepare/abort time. However, that won't
> > > be the case for sequences because of the dependency of determining
> > > transactional cases on one of the prior records. Now, I am not
> > > completely sure at this stage if such a deviation can cause any
> > > problem and or whether we are okay to have such a deviation for
> > > sequences.
> >
> > Okay, so this particular scenario that I raised is somehow saved, I
> > mean although we are considering transactional sequence operation as
> > non-transactional we also know that if some of the changes for a
> > transaction are skipped because the snapshot was not FULL that means
> > that transaction can not be streamed because that transaction has to
> > be committed before snapshot become CONSISTENT (based on the snapshot
> > state change machinery).  Ideally based on the same logic that the
> > snapshot is not consistent the non-transactional sequence changes are
> > also skipped.  But the only thing that makes me a bit uncomfortable is
> > that even though the result is not wrong we have made some wrong
> > intermediate decisions i.e. considered transactional change as
> > non-transactions.
> >
> > One solution to this issue is that, even if the snapshot state does
> > not reach FULL just add the sequence relids to the hash, I mean that
> > hash is only maintained for deciding whether the sequence is changed
> > in that transaction or not.  So no adding such relids to hash seems
> > like a root cause of the issue.  Honestly, I haven't analyzed this
> > idea in detail about how easy it would be to add only these changes to
> > the hash and what are the other dependencies, but this seems like a
> > worthwhile direction IMHO.
>
> I also thought about the same solution. I tried this solution as the
> attached patch on top of Hayato's diagnostic changes.

I think you forgot to attach the patch.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Thu, Dec 14, 2023 at 12:31 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Dec 14, 2023 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > > >
> > >
> > > It is correct that we can make a wrong decision about whether a change
> > > is transactional or non-transactional when sequence DDL happens before
> > > the SNAPBUILD_FULL_SNAPSHOT state and the sequence operation happens
> > > after that state. However, one thing to note here is that we won't try
> > > to stream such a change because for non-transactional cases we don't
> > > proceed unless the snapshot is in a consistent state. Now, if the
> > > decision had been correct then we would probably have queued the
> > > sequence change and discarded at commit.
> > >
> > > One thing that we deviate here is that for non-sequence transactional
> > > cases (including logical messages), we immediately start queuing the
> > > changes as soon as we reach SNAPBUILD_FULL_SNAPSHOT state (provided
> > > SnapBuildProcessChange() returns true which is quite possible) and
> > > take final decision at commit/prepare/abort time. However, that won't
> > > be the case for sequences because of the dependency of determining
> > > transactional cases on one of the prior records. Now, I am not
> > > completely sure at this stage if such a deviation can cause any
> > > problem and or whether we are okay to have such a deviation for
> > > sequences.
> >
> > Okay, so this particular scenario that I raised is somehow saved, I
> > mean although we are considering transactional sequence operation as
> > non-transactional we also know that if some of the changes for a
> > transaction are skipped because the snapshot was not FULL that means
> > that transaction can not be streamed because that transaction has to
> > be committed before snapshot become CONSISTENT (based on the snapshot
> > state change machinery).  Ideally based on the same logic that the
> > snapshot is not consistent the non-transactional sequence changes are
> > also skipped.  But the only thing that makes me a bit uncomfortable is
> > that even though the result is not wrong we have made some wrong
> > intermediate decisions i.e. considered transactional change as
> > non-transactions.
> >
> > One solution to this issue is that, even if the snapshot state does
> > not reach FULL just add the sequence relids to the hash, I mean that
> > hash is only maintained for deciding whether the sequence is changed
> > in that transaction or not.  So no adding such relids to hash seems
> > like a root cause of the issue.  Honestly, I haven't analyzed this
> > idea in detail about how easy it would be to add only these changes to
> > the hash and what are the other dependencies, but this seems like a
> > worthwhile direction IMHO.
>
>
...
> It looks like the solution works. But this is the only place where we
> process a change before SNAPSHOT reaches FULL. But this is also the
> only record which affects a decision to queue/not a following change.
> So it should be ok. The sequence_hash'es as separate for each
> transaction and they are cleaned when processing COMMIT record.
>

>
It looks like the solution works. But this is the only place where we
process a change before SNAPSHOT reaches FULL. But this is also the
only record which affects a decision to queue/not a following change.
So it should be ok. The sequence_hash'es as separate for each
transaction and they are cleaned when processing COMMIT record.
>

But it is possible that even commit or abort also happens before the
snapshot reaches full state in which case the hash table will have
stale or invalid (for aborts) entries. That will probably be cleaned
at a later point by running_xact records. Now, I think in theory, it
is possible that the same RelFileLocator can again be allocated before
we clean up the existing entry which can probably confuse the system.
It might or might not be a problem in practice but I think the more
assumptions we add for sequences, the more difficult it will become to
ensure its correctness.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I think you forgot to attach the patch.

Sorry. Here it is.

On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >
> It looks like the solution works. But this is the only place where we
> process a change before SNAPSHOT reaches FULL. But this is also the
> only record which affects a decision to queue/not a following change.
> So it should be ok. The sequence_hash'es as separate for each
> transaction and they are cleaned when processing COMMIT record.
> >
>
> But it is possible that even commit or abort also happens before the
> snapshot reaches full state in which case the hash table will have
> stale or invalid (for aborts) entries. That will probably be cleaned
> at a later point by running_xact records.

Why would cleaning wait till running_xact records? Won't txn entry
itself be removed when processing commit/abort record? At the same the
sequence hash will be cleaned as well.

> Now, I think in theory, it
> is possible that the same RelFileLocator can again be allocated before
> we clean up the existing entry which can probably confuse the system.

How? The transaction allocating the first time would be cleaned before
it happens the second time. So shouldn't matter.

--
Best Wishes,
Ashutosh Bapat

Вложения

Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Thu, Dec 14, 2023 at 2:45 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I think you forgot to attach the patch.
>
> Sorry. Here it is.
>
> On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> > It looks like the solution works. But this is the only place where we
> > process a change before SNAPSHOT reaches FULL. But this is also the
> > only record which affects a decision to queue/not a following change.
> > So it should be ok. The sequence_hash'es as separate for each
> > transaction and they are cleaned when processing COMMIT record.
> > >
> >
> > But it is possible that even commit or abort also happens before the
> > snapshot reaches full state in which case the hash table will have
> > stale or invalid (for aborts) entries. That will probably be cleaned
> > at a later point by running_xact records.
>
> Why would cleaning wait till running_xact records? Won't txn entry
> itself be removed when processing commit/abort record? At the same the
> sequence hash will be cleaned as well.
>
> > Now, I think in theory, it
> > is possible that the same RelFileLocator can again be allocated before
> > we clean up the existing entry which can probably confuse the system.
>
> How? The transaction allocating the first time would be cleaned before
> it happens the second time. So shouldn't matter.
>

It can only be cleaned if we process it but xact_decode won't allow us
to process it and I don't think it would be a good idea to add another
hack for sequences here. See below code:

xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
SnapBuild  *builder = ctx->snapshot_builder;
ReorderBuffer *reorder = ctx->reorder;
XLogReaderState *r = buf->record;
uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK;

/*
* If the snapshot isn't yet fully built, we cannot decode anything, so
* bail out.
*/
if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
return;


--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Ashutosh Bapat
Дата:
On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Dec 14, 2023 at 2:45 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > On Thu, Dec 14, 2023 at 12:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I think you forgot to attach the patch.
> >
> > Sorry. Here it is.
> >
> > On Thu, Dec 14, 2023 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > >
> > > It looks like the solution works. But this is the only place where we
> > > process a change before SNAPSHOT reaches FULL. But this is also the
> > > only record which affects a decision to queue/not a following change.
> > > So it should be ok. The sequence_hash'es as separate for each
> > > transaction and they are cleaned when processing COMMIT record.
> > > >
> > >
> > > But it is possible that even commit or abort also happens before the
> > > snapshot reaches full state in which case the hash table will have
> > > stale or invalid (for aborts) entries. That will probably be cleaned
> > > at a later point by running_xact records.
> >
> > Why would cleaning wait till running_xact records? Won't txn entry
> > itself be removed when processing commit/abort record? At the same the
> > sequence hash will be cleaned as well.
> >
> > > Now, I think in theory, it
> > > is possible that the same RelFileLocator can again be allocated before
> > > we clean up the existing entry which can probably confuse the system.
> >
> > How? The transaction allocating the first time would be cleaned before
> > it happens the second time. So shouldn't matter.
> >
>
> It can only be cleaned if we process it but xact_decode won't allow us
> to process it and I don't think it would be a good idea to add another
> hack for sequences here. See below code:
>
> xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
> {
> SnapBuild  *builder = ctx->snapshot_builder;
> ReorderBuffer *reorder = ctx->reorder;
> XLogReaderState *r = buf->record;
> uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK;
>
> /*
> * If the snapshot isn't yet fully built, we cannot decode anything, so
> * bail out.
> */
> if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
> return;

That may be true for a transaction which is decoded, but I think all
the transactions which are added to ReorderBuffer should be cleaned up
once they have been processed irrespective of whether they are
decoded/sent downstream or not. In this case I see the sequence hash
being cleaned up for the sequence related transaction in Hayato's
reproducer. See attached patch with a diagnostic change and the output
below (notice sequence cleanup called on transaction 767).
2023-12-14 21:06:36.756 IST [386957] LOG:  logical decoding found
initial starting point at 0/15B2F68
2023-12-14 21:06:36.756 IST [386957] DETAIL:  Waiting for transactions
(approximately 1) older than 767 to end.
2023-12-14 21:06:36.756 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:05.679 IST [386957] LOG:  XXX: smgr_decode. snapshot
is SNAPBUILD_BUILDING_SNAPSHOT
2023-12-14 21:07:05.679 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:05.679 IST [386957] LOG:  XXX: seq_decode. snapshot
is SNAPBUILD_BUILDING_SNAPSHOT
2023-12-14 21:07:05.679 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:05.679 IST [386957] LOG:  XXX: skipped
2023-12-14 21:07:05.679 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:05.710 IST [386957] LOG:  logical decoding found
initial consistent point at 0/15B3388
2023-12-14 21:07:05.710 IST [386957] DETAIL:  Waiting for transactions
(approximately 1) older than 768 to end.
2023-12-14 21:07:05.710 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:39.292 IST [386298] LOG:  checkpoint starting: time
2023-12-14 21:07:40.919 IST [386957] LOG:  XXX: seq_decode. snapshot
is SNAPBUILD_FULL_SNAPSHOT
2023-12-14 21:07:40.919 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:40.919 IST [386957] LOG:  XXX: the sequence is transactional
2023-12-14 21:07:40.919 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:40.919 IST [386957] LOG:  sequence cleanup called on
transaction 767
2023-12-14 21:07:40.919 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');
2023-12-14 21:07:40.919 IST [386957] LOG:  logical decoding found
consistent point at 0/15B3518
2023-12-14 21:07:40.919 IST [386957] DETAIL:  There are no running transactions.
2023-12-14 21:07:40.919 IST [386957] STATEMENT:  SELECT
pg_create_logical_replication_slot('slot', 'test_decoding');

We see similar output when pg_logical_slot_get_changes() is called.

I haven't found the code path from where the sequence cleanup gets
called. But it's being called. Am I missing something?

--
Best Wishes,
Ashutosh Bapat

Вложения

Re: logical decoding and replication of sequences, take 2

От
"Euler Taveira"
Дата:
On Thu, Dec 14, 2023, at 12:44 PM, Ashutosh Bapat wrote:
I haven't found the code path from where the sequence cleanup gets
called. But it's being called. Am I missing something?

ReorderBufferCleanupTXN.


--
Euler Taveira

Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Thu, Dec 14, 2023 at 9:14 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > It can only be cleaned if we process it but xact_decode won't allow us
> > to process it and I don't think it would be a good idea to add another
> > hack for sequences here. See below code:
> >
> > xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
> > {
> > SnapBuild  *builder = ctx->snapshot_builder;
> > ReorderBuffer *reorder = ctx->reorder;
> > XLogReaderState *r = buf->record;
> > uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK;
> >
> > /*
> > * If the snapshot isn't yet fully built, we cannot decode anything, so
> > * bail out.
> > */
> > if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
> > return;
>
> That may be true for a transaction which is decoded, but I think all
> the transactions which are added to ReorderBuffer should be cleaned up
> once they have been processed irrespective of whether they are
> decoded/sent downstream or not. In this case I see the sequence hash
> being cleaned up for the sequence related transaction in Hayato's
> reproducer.
>

It was because the test you are using was not designed to show the
problem I mentioned. In this case, the rollback was after a full
snapshot state was reached.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Christophe Pettus
Дата:
Hi,

I wanted to hop in here on one particular issue:

> On Dec 12, 2023, at 02:01, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much
> better solution for distributed (esp. active-active) systems. But there
> are important use cases that are likely to keep using regular sequences
> (online upgrades of single-node instances, existing systems, ...).

+1.

Right now, the lack of sequence replication is a rather large foot-gun on logical replication upgrades.  Copying the
sequencesover during the cutover period is doable, of course, but: 

(a) There's no out-of-the-box tooling that does it, so everyone has to write some scripts just for that one function.
(b) It's one more thing that extends the cutover window.

I don't think it is a good idea to make it mandatory: for example, there's a strong use case for replicating a table
butnot a sequence associated with it.  But it's definitely a missing feature in logical replication. 


Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 12/19/23 13:54, Christophe Pettus wrote:
> Hi,
> 
> I wanted to hop in here on one particular issue:
> 
>> On Dec 12, 2023, at 02:01, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much
>> better solution for distributed (esp. active-active) systems. But there
>> are important use cases that are likely to keep using regular sequences
>> (online upgrades of single-node instances, existing systems, ...).
> 
> +1.
> 
> Right now, the lack of sequence replication is a rather large 
> foot-gun on logical replication upgrades.  Copying the sequences
> over during the cutover period is doable, of course, but:
> 
> (a) There's no out-of-the-box tooling that does it, so everyone has
> to write some scripts just for that one function.
>
> (b) It's one more thing that extends the cutover window.
> 

I agree it's an annoying gap for this use case. But if this is the only
use cases, maybe a better solution would be to provide such tooling
instead of adding it to the logical decoding?

It might seem a bit strange if most data is copied by replication
directly, while sequences need special handling, ofc.

> I don't think it is a good idea to make it mandatory: for example, 
> there's a strong use case for replicating a table but not a sequence 
> associated with it.  But it's definitely a missing feature in
> logical replication.

I don't think the plan was to make replication of sequences mandatory,
certainly not with the built-in replication. If you don't add sequences
to the publication, the sequence changes will be skipped.

But it still needs to be part of the decoding, which adds overhead for
all logical decoding uses, even if the sequence changes end up being
discarded. That's somewhat annoying, especially considering sequences
are fairly common part of the WAL stream.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 12/15/23 03:33, Amit Kapila wrote:
> On Thu, Dec 14, 2023 at 9:14 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
>>
>> On Thu, Dec 14, 2023 at 2:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>
>>> It can only be cleaned if we process it but xact_decode won't allow us
>>> to process it and I don't think it would be a good idea to add another
>>> hack for sequences here. See below code:
>>>
>>> xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
>>> {
>>> SnapBuild  *builder = ctx->snapshot_builder;
>>> ReorderBuffer *reorder = ctx->reorder;
>>> XLogReaderState *r = buf->record;
>>> uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK;
>>>
>>> /*
>>> * If the snapshot isn't yet fully built, we cannot decode anything, so
>>> * bail out.
>>> */
>>> if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
>>> return;
>>
>> That may be true for a transaction which is decoded, but I think all
>> the transactions which are added to ReorderBuffer should be cleaned up
>> once they have been processed irrespective of whether they are
>> decoded/sent downstream or not. In this case I see the sequence hash
>> being cleaned up for the sequence related transaction in Hayato's
>> reproducer.
>>
> 
> It was because the test you are using was not designed to show the
> problem I mentioned. In this case, the rollback was after a full
> snapshot state was reached.
> 

Right, I haven't tried to reproduce this, but it very much looks like we
the entry would not be removed if the xact aborts/commits before the
snapshot reaches FULL state.

I suppose one way to deal with this would be to first check if an entry
for the same relfilenode exists. If it does, the original transaction
must have terminated, but we haven't cleaned it up yet - in which case
we can just "move" the relfilenode to the new one.

However, can't that happen even with full snapshots? I mean, let's say a
transaction creates a relfilenode and terminates without writing an
abort record (surely that's possible, right?). And then another xact
comes and generates the same relfilenode (presumably that's unlikely,
but perhaps possible?). Aren't we in pretty much the same situation,
until the next RUNNING_XACTS cleans up the hash table?


I think tracking all relfilenodes would fix the original issue (with
treating some changes as transactional), and the tweak that "moves" the
relfilenode to the new xact would fix this other issue too.

That being said, I feel a bit uneasy about it, for similar reasons as
Amit. If we start processing records before full snapshot, that seems
like moving the assumptions a bit. For example it means we'd create
ReorderBufferTXN entries for cases that'd have skipped before. OTOH this
is (or should be) only a very temporary period while starting the
replication, I believe.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Hi,

Here's new version of this patch series. It rebases the 2023/12/03
version, and there's a couple improvements to address the performance
and correctness questions.

Since the 2023/12/03 version was posted, there were a couple off-list
discussions with several people - with Amit, as mentioned in [1], and
then also internally and at pgconf.eu.

My personal (very brief) takeaway from these discussions is this:

1) desirability: We want a built-in way to handle sequences in logical
replication. I think everyone agrees this is not a way to do distributed
sequences in an active-active setups, but that there are other use cases
that need this feature - typically upgrades / logical failover.

Multiple approaches were discussed (support in logical replication or a
separate tool to be executed on the logical replica). Both might work,
people usually end up with some sort of custom tool anyway. But it's
cumbersome, and the consensus seems the logical rep feature is better.


2) performance: There was concern about the performance impact, and that
it affects everyone, including those who don't replicate sequences (as
the overhead is mostly incurred before calls to output plugin etc.).

I do agree with this, but I don't think sequences can be decoded in a
much cheaper way. There was a proposal [2] that maybe we could batch the
non-transactional sequences changes in the "next" transaction, and
distribute them similarly to SnapBuildDistributeNewCatalogSnapshot()
distributes catalog snapshots.

But I doubt that'd actually work. Or more precisely - if we can make the
code work, I think it would not solve the issue for some common cases.
Consider for example a case with many concurrent top-level transactions,
making this quite expensive. And I'd bet sequence changes are far more
common than catalog changes.

However, I think we ultimately agreed that the overhead is acceptable if
it only applies to use cases that actually need to decode sequences. So
if there was a way to skip sequence decoding when not necessary, that
would work. Unfortunately, that can't be based on simply checking which
callbacks are defined by the output plugin, because e.g. pgoutput needs
to handle both cases (so the callbacks need to be defined). Nor it can
be determined based on what's included in the publication (as that's not
available that early).

The agreement was that the best way is to have a CREATE SUBSCRIPTION
option that would instruct the upstream to decode sequences. By default
this option is 'off' (because that's the no-overhead case), but it can
be enabled for each subscription.

This is what 0005 implements, and interestingly enough, this is what an
earlier version [3] from 2023/04/02 did.

This means that if you add a sequence to the publication, but leave
"sequences=off" in CREATE SUBSCRIPTION, the sequence won't be replicated
after all. That may seems a bit surprising, and I don't like it, but I
don't think there's a better way to do this.


3) correctness: The last point is about making "transactional" flag
correct when the snapshot state changes mid-transaction, originally
pointed out by Dilip [4]. Per [5] this however happens to work
correctly, because while we identify the change as 'non-transactional'
(which is incorrect), we immediately throw it again (so we don't try to
apply it, which would error-out).

One option would be to document/describe this in the comments, per 0006.
This means that when ReorderBufferSequenceIsTransactional() returns
true, it's correct. But if it returns 'false', it means 'maybe'. I agree
it seems a bit strange, but with the extra comments I think it's OK. It
simply means that if we get transactional=false incorrectly, we're
guaranteed to not process it. Maybe we could rename the function to make
this clear from the name.

The other solution proposed in the thread [6] was to always decode the
relfilenode, and add it to the hash table. 0007 does this, and it works.
But I agree this seems possibly worse than 0006 - it means we may be
adding entries to the hash table, and it's not clear when exactly we'll
clean them up etc. It'd be the only place processing stuff before the
snapshots reaches FULL.

I personally would go with 0006, i.e. just explaining why doing it this
way is correct.


regards


[1]
https://www.postgresql.org/message-id/12822961-b7de-9d59-dd27-2e3dc3980c7e%40enterprisedb.com

[2]
https://www.postgresql.org/message-id/CAFiTN-vm3-bGfm-uJdzRLERMHozW8xjZHu4rdmtWR-rP-SJYMQ%40mail.gmail.com

[3]
https://www.postgresql.org/message-id/1f96b282-cb90-8302-cee8-7b3f5576a31c%40enterprisedb.com

[4]
https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com

[5]
https://www.postgresql.org/message-id/CAA4eK1LFise9iN%2BNN%3Dagrk4prR1qD%2BebvzNjKAWUog2%2Bhy3HxQ%40mail.gmail.com

[6]
https://www.postgresql.org/message-id/CAFiTN-sYpyUBabxopJysqH3DAp4OZUCTi6m_qtgt8d32vDcWSA%40mail.gmail.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: logical decoding and replication of sequences, take 2

От
Robert Haas
Дата:
On Thu, Jan 11, 2024 at 11:27 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> 1) desirability: We want a built-in way to handle sequences in logical
> replication. I think everyone agrees this is not a way to do distributed
> sequences in an active-active setups, but that there are other use cases
> that need this feature - typically upgrades / logical failover.

Yeah. I find it extremely hard to take seriously the idea that this
isn't a valuable feature. How else are you supposed to do a logical
failover without having your entire application break?

> 2) performance: There was concern about the performance impact, and that
> it affects everyone, including those who don't replicate sequences (as
> the overhead is mostly incurred before calls to output plugin etc.).
>
> The agreement was that the best way is to have a CREATE SUBSCRIPTION
> option that would instruct the upstream to decode sequences. By default
> this option is 'off' (because that's the no-overhead case), but it can
> be enabled for each subscription.

Seems reasonable, at least unless and until we come up with something better.

> 3) correctness: The last point is about making "transactional" flag
> correct when the snapshot state changes mid-transaction, originally
> pointed out by Dilip [4]. Per [5] this however happens to work
> correctly, because while we identify the change as 'non-transactional'
> (which is incorrect), we immediately throw it again (so we don't try to
> apply it, which would error-out).

I've said this before, but I still find this really scary. It's
unclear to me that we can simply classify updates as transactional or
non-transactional and expect things to work. If it's possible, I hope
we have a really good explanation somewhere of how and why it's
possible. If we do, can somebody point me to it so I can read it?

To be possibly slightly more clear about my concern, I think the scary
case is where we have transactional and non-transactional things
happening to the same sequence in close temporal proximity, either
within the same session or across two or more sessions.  If a
non-transactional change can get reordered ahead of some transactional
change upon which it logically depends, or behind some transactional
change that logically depends on it, then we have trouble. I also
wonder if there are any cases where the same operation is partly
transactional and partly non-transactional.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 1/23/24 21:47, Robert Haas wrote:
> On Thu, Jan 11, 2024 at 11:27 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> 1) desirability: We want a built-in way to handle sequences in logical
>> replication. I think everyone agrees this is not a way to do distributed
>> sequences in an active-active setups, but that there are other use cases
>> that need this feature - typically upgrades / logical failover.
> 
> Yeah. I find it extremely hard to take seriously the idea that this
> isn't a valuable feature. How else are you supposed to do a logical
> failover without having your entire application break?
> 
>> 2) performance: There was concern about the performance impact, and that
>> it affects everyone, including those who don't replicate sequences (as
>> the overhead is mostly incurred before calls to output plugin etc.).
>>
>> The agreement was that the best way is to have a CREATE SUBSCRIPTION
>> option that would instruct the upstream to decode sequences. By default
>> this option is 'off' (because that's the no-overhead case), but it can
>> be enabled for each subscription.
> 
> Seems reasonable, at least unless and until we come up with something better.
> 
>> 3) correctness: The last point is about making "transactional" flag
>> correct when the snapshot state changes mid-transaction, originally
>> pointed out by Dilip [4]. Per [5] this however happens to work
>> correctly, because while we identify the change as 'non-transactional'
>> (which is incorrect), we immediately throw it again (so we don't try to
>> apply it, which would error-out).
> 
> I've said this before, but I still find this really scary. It's
> unclear to me that we can simply classify updates as transactional or
> non-transactional and expect things to work. If it's possible, I hope
> we have a really good explanation somewhere of how and why it's
> possible. If we do, can somebody point me to it so I can read it?
> 

I did try to explain how this works (and why) in a couple places:

1) the commit message
2) reorderbuffer header comment
3) ReorderBufferSequenceIsTransactional comment (and nearby)

It's possible this does not meet your expectations, ofc. Maybe there
should be a separate README for this - I haven't found anything like
that for logical decoding in general, which is why I did (1)-(3).

> To be possibly slightly more clear about my concern, I think the scary
> case is where we have transactional and non-transactional things
> happening to the same sequence in close temporal proximity, either
> within the same session or across two or more sessions.  If a
> non-transactional change can get reordered ahead of some transactional
> change upon which it logically depends, or behind some transactional
> change that logically depends on it, then we have trouble. I also
> wonder if there are any cases where the same operation is partly
> transactional and partly non-transactional.
> 

I certainly understand this concern, and to some extent I even share it.
Having to differentiate between transactional and non-transactional
changes certainly confused me more than once. It's especially confusing,
because the decoding implicitly changes the perceived ordering/atomicity
of the events.

That being said, I don't think it get reordered the way you're concerned
about. The "transactionality" is determined by relfilenode change, so
how could the reordering happen? We'd have to misidentify change in
either direction - and for nontransactional->transactional change that's
clearly not possible. There has to be a new relfilenode in that xact.

In the other direction (transactional->nontransactional), it can happen
if we fail to decode the relfilenode record. Which is what we discussed
earlier, but came to the conclusion that it actually works OK.

Of course, there might be bugs. I spent quite a bit of effort reviewing
and testing this, but there still might be something wrong. But I think
that applies to any feature.

What would be worse is some sort of thinko in the approach in general. I
don't have a good answer to that, unfortunately - I think it works, but
how would I know for sure? We explored multiple alternative approaches
and all of them crashed and burned ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Robert Haas
Дата:
On Wed, Jan 24, 2024 at 12:46 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> I did try to explain how this works (and why) in a couple places:
>
> 1) the commit message
> 2) reorderbuffer header comment
> 3) ReorderBufferSequenceIsTransactional comment (and nearby)
>
> It's possible this does not meet your expectations, ofc. Maybe there
> should be a separate README for this - I haven't found anything like
> that for logical decoding in general, which is why I did (1)-(3).

I read over these and I do think they answer a bunch of questions, but
I don't think they answer all of the questions.

Suppose T1 creates a sequence and commits. Then T2 calls nextval().
Then T3 drops the sequence. According to the commit message, T2's
change will be "replayed immediately after decoding". But it's
essential to replay T2's change after we replay T1 and before we
replay T3, and the comments don't explain why that's guaranteed.

The answer might be "locks". If we always replay a transaction
immediately when we see it's commit record then in the example above
we're fine, because the commit record for the transaction that creates
the sequence must precede the nextval() call, since the sequence won't
be visible until the transaction commits, and also because T1 holds a
lock on it at that point sufficient to hedge out nextval. And the
nextval record must precede the point where T3 takes an exclusive lock
on the sequence.

Note, however, that this change of reasoning critically depends on us
never delaying application of a transaction. If we might reach T1's
commit record and say "hey, let's hold on to this for a bit and replay
it after we've decoded some more," everything immediately breaks,
unless we also delay application of T2's non-transactional update in
such a way that it's still guaranteed to happen after T1. I wonder if
this kind of situation would be a problem for a future parallel-apply
feature. It wouldn't work, for example, to hand T1 and T3 off (in that
order) to a separate apply process but handle T2's "non-transactional"
message directly, because it might handle that message before the
application of T1 got completed.

This also seems to depend on every transactional operation that might
affect a future non-transactional operation holding a lock that would
conflict with that non-transactional operation. For example, if ALTER
SEQUENCE .. RESTART WITH didn't take a strong lock on the sequence,
then you could have: T1 does nextval, T2 does ALTER SEQUENCE RESTART
WITH, T1 does nextval again, T1 commits, T2 commits. It's unclear what
the semantics of that would be -- would T1's second nextval() see the
sequence restart, or what? But if the effect of T1's second nextval
does depend in some way on the ALTER SEQUENCE operation which precedes
it in the WAL stream, then we might have some trouble here, because
both nextvals precede the commit of T2. Fortunately, this sequence of
events is foreclosed by locking.

But I did find one somewhat-similar case in which that's not so.

S1: create table withseq (a bigint generated always as identity);
S1: begin;
S2: select nextval('withseq_a_seq');
S1: alter table withseq set unlogged;
S2: select nextval('withseq_a_seq');

I think this is a bug in the code that supports owned sequences rather
than a problem that this patch should have to do something about. When
a sequence is flipped between logged and unlogged directly, we take a
stronger lock than we do here when it's done in this indirect way.
Also, I'm not quite sure if it would pose a problem for sequence
decoding anyway: it changes the relfilenode, but not the value. But
this is the *kind* of problem that could make the approach unsafe:
supposedly transactional changes being interleaved with supposedly
non-transctional changes, in such a way that the non-transactional
changes might get applied at the wrong time relative to the
transactional changes.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 1/26/24 15:39, Robert Haas wrote:
> On Wed, Jan 24, 2024 at 12:46 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> I did try to explain how this works (and why) in a couple places:
>>
>> 1) the commit message
>> 2) reorderbuffer header comment
>> 3) ReorderBufferSequenceIsTransactional comment (and nearby)
>>
>> It's possible this does not meet your expectations, ofc. Maybe there
>> should be a separate README for this - I haven't found anything like
>> that for logical decoding in general, which is why I did (1)-(3).
> 
> I read over these and I do think they answer a bunch of questions, but
> I don't think they answer all of the questions.
> 
> Suppose T1 creates a sequence and commits. Then T2 calls nextval().
> Then T3 drops the sequence. According to the commit message, T2's
> change will be "replayed immediately after decoding". But it's
> essential to replay T2's change after we replay T1 and before we
> replay T3, and the comments don't explain why that's guaranteed.
> 
> The answer might be "locks". If we always replay a transaction
> immediately when we see it's commit record then in the example above
> we're fine, because the commit record for the transaction that creates
> the sequence must precede the nextval() call, since the sequence won't
> be visible until the transaction commits, and also because T1 holds a
> lock on it at that point sufficient to hedge out nextval. And the
> nextval record must precede the point where T3 takes an exclusive lock
> on the sequence.
> 

Right, locks + apply in commit order gives us this guarantee (I can't
think of a case where it wouldn't be the case).

> Note, however, that this change of reasoning critically depends on us
> never delaying application of a transaction. If we might reach T1's
> commit record and say "hey, let's hold on to this for a bit and replay
> it after we've decoded some more," everything immediately breaks,
> unless we also delay application of T2's non-transactional update in
> such a way that it's still guaranteed to happen after T1. I wonder if
> this kind of situation would be a problem for a future parallel-apply
> feature. It wouldn't work, for example, to hand T1 and T3 off (in that
> order) to a separate apply process but handle T2's "non-transactional"
> message directly, because it might handle that message before the
> application of T1 got completed.
> 

Doesn't the whole logical replication critically depend on the commit
order? If you decide to arbitrarily reorder/delay the transactions, all
kinds of really bad things can happen. That's a generic problem, it
applies to all kinds of objects, not just sequences - a parallel apply
would need to detect this sort of dependencies (e.g. INSERT + DELETE of
the same key), and do something about it.

Similar for sequences, where the important event is allocation of a new
relfilenode.

If anything, it's easier for sequences, because the relfilenode tracking
gives us an explicit (and easy) way to detect these dependencies between
transactions.

> This also seems to depend on every transactional operation that might
> affect a future non-transactional operation holding a lock that would
> conflict with that non-transactional operation. For example, if ALTER
> SEQUENCE .. RESTART WITH didn't take a strong lock on the sequence,
> then you could have: T1 does nextval, T2 does ALTER SEQUENCE RESTART
> WITH, T1 does nextval again, T1 commits, T2 commits. It's unclear what
> the semantics of that would be -- would T1's second nextval() see the
> sequence restart, or what? But if the effect of T1's second nextval
> does depend in some way on the ALTER SEQUENCE operation which precedes
> it in the WAL stream, then we might have some trouble here, because
> both nextvals precede the commit of T2. Fortunately, this sequence of
> events is foreclosed by locking.
> 

I don't quite follow :-(

AFAIK this theory hinges on not having the right lock, but I believe
ALTER SEQUENCE does obtain the lock (at least in cases that assign a new
relfilenode). Which means such reordering should not be possible,
because nextval() in other transactions will then wait until commit. And
all nextval() calls in the same transaction will be treated as
transactional.

So I think this works OK. If something does not lock the sequence in a
way that would prevent other xacts to do nextval() on it, it's not a
change that would change the relfilenode - and so it does not switch the
sequence into a transactional mode.

> But I did find one somewhat-similar case in which that's not so.
> 
> S1: create table withseq (a bigint generated always as identity);
> S1: begin;
> S2: select nextval('withseq_a_seq');
> S1: alter table withseq set unlogged;
> S2: select nextval('withseq_a_seq');
> 
> I think this is a bug in the code that supports owned sequences rather
> than a problem that this patch should have to do something about. When
> a sequence is flipped between logged and unlogged directly, we take a
> stronger lock than we do here when it's done in this indirect way.

Yes, I think this is a bug in handling of owned sequences - from the
moment the "ALTER TABLE ... SET UNLOGGED" is executed, the two sessions
generate duplicate values (until the S1 is committed, at which point the
values generated in S2 get "forgotten").

It seems we end up updating both relfilenodes, which is clearly wrong.

Seems like a bug independent of the decoding, IMO.

> Also, I'm not quite sure if it would pose a problem for sequence
> decoding anyway: it changes the relfilenode, but not the value. But
> this is the *kind* of problem that could make the approach unsafe:
> supposedly transactional changes being interleaved with supposedly
> non-transctional changes, in such a way that the non-transactional
> changes might get applied at the wrong time relative to the
> transactional changes.
> 

I'm not sure what you mean by "changes relfilenode, not value" but I
suspect it might break the sequence decoding - or at least confuse it. I
haven't thecked what exactly happens when we change logged/unlogged for
a sequence, but I assume it does change the relfilenode, which already
is a change of a value - we WAL-log the new sequence state, at least.
But it should be treated as "transactional" in the transaction that did
the ALTER TABLE, because it created the relfilenode.

However, I'm not sure this is a valid argument against the sequence
decoding patch. If something does not acquire the correct lock, it's not
surprising something else breaks, if it relies on the lock.

Of course, I understand you're trying to make a broader point - that if
something like this could happen in "correct" case, it'd be a problem.

But I don't think that's possible. The whole "transactional" thing is
determined by having a new relfilenode for the sequence, and I can't
imagine a case where we could assign a new relfilenode without a lock.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Robert Haas
Дата:
On Sun, Jan 28, 2024 at 1:07 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Right, locks + apply in commit order gives us this guarantee (I can't
> think of a case where it wouldn't be the case).

I couldn't find any cases of inadequate locking other than the one I mentioned.

> Doesn't the whole logical replication critically depend on the commit
> order? If you decide to arbitrarily reorder/delay the transactions, all
> kinds of really bad things can happen. That's a generic problem, it
> applies to all kinds of objects, not just sequences - a parallel apply
> would need to detect this sort of dependencies (e.g. INSERT + DELETE of
> the same key), and do something about it.

Yes, but here I'm not just talking about the commit order. I'm talking
about the order of applying non-transactional operations relative to
commits.

Consider:

T1: CREATE SEQUENCE s;
T2: BEGIN;
T2: SELECT nextval('s');
T3: SELECT nextval('s');
T2: ALTER SEQUENCE s INCREMENT 2;
T2: SELECT nextval('s');
T2: COMMIT;

The commit order is T1 < T3 < T2, but T3 makes no transactional
changes, so the commit order is really just T1 < T2. But it's
completely wrong to say that all we need to do is apply T1 before we
apply T2. The correct order of application is:

1. T1.
2. T2's first nextval
3. T3's nextval
4. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and
the subsequent nextval)

In other words, the fact that some sequence changes are
non-transactional creates ordering hazards that don't exist if there
are no non-transactional changes. So in that way, sequences are
different from table modifications, where applying the transactions in
order of commit is all we need to do. Here we need to apply the
transactions in order of commit and also apply the non-transactional
changes at the right point in the sequence. Consider the following
alternative apply sequence:

1. T1.
2. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and
the subsequent nextval)
3. T3's nextval
4. T2's first nextval

That's still in commit order. It's also wrong.

Imagine that you commit this patch and someone later wants to do
parallel logical apply. So every time they finish decoding a
transaction, they stick it in a queue to be applied by the next
available worker. But, non-transactional changes are very simple, so
we just directly apply those in the main process. Well, kaboom! But
now this can happen with the above example.

1. Decode T1. Add to queue for apply.
2. Before the (idle) apply worker has a chance to pull T1 out of the
queue, decode the first nextval and try to apply it.

Oops. We're trying to apply a modification to a sequence that hasn't
been created yet. I'm not saying that this kind of hypothetical is a
reason not to commit the patch. But it seems like we're not on the
same page about what the ordering requirements are here. I'm just
making the argument that those non-transactional operations actually
act like mini-transactions. They need to happen at the right time
relative to the real transactions. A non-transactional operation needs
to be applied after any transactions that commit before it is logged,
and before any transactions that commit after it's logged.

> Yes, I think this is a bug in handling of owned sequences - from the
> moment the "ALTER TABLE ... SET UNLOGGED" is executed, the two sessions
> generate duplicate values (until the S1 is committed, at which point the
> values generated in S2 get "forgotten").
>
> It seems we end up updating both relfilenodes, which is clearly wrong.
>
> Seems like a bug independent of the decoding, IMO.

Yeah.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 2/13/24 17:37, Robert Haas wrote:
> On Sun, Jan 28, 2024 at 1:07 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> Right, locks + apply in commit order gives us this guarantee (I can't
>> think of a case where it wouldn't be the case).
> 
> I couldn't find any cases of inadequate locking other than the one I mentioned.
> 
>> Doesn't the whole logical replication critically depend on the commit
>> order? If you decide to arbitrarily reorder/delay the transactions, all
>> kinds of really bad things can happen. That's a generic problem, it
>> applies to all kinds of objects, not just sequences - a parallel apply
>> would need to detect this sort of dependencies (e.g. INSERT + DELETE of
>> the same key), and do something about it.
> 
> Yes, but here I'm not just talking about the commit order. I'm talking
> about the order of applying non-transactional operations relative to
> commits.
> 
> Consider:
> 
> T1: CREATE SEQUENCE s;
> T2: BEGIN;
> T2: SELECT nextval('s');
> T3: SELECT nextval('s');
> T2: ALTER SEQUENCE s INCREMENT 2;
> T2: SELECT nextval('s');
> T2: COMMIT;
> 

It's not clear to me if you're talking about nextval() that happens to
generate WAL, or nextval() covered by WAL generated by a previous call.

I'm going to assume it's the former, i.e. nextval() that generated WAL
describing the *next* sequence chunk, because without WAL there's
nothing to apply and therefore no issue with T3 ordering.

The way I think about non-transactional sequence changes is as if they
were tiny transactions that happen "fully" (including commit) at the LSN
where the LSN change is logged.


> The commit order is T1 < T3 < T2, but T3 makes no transactional
> changes, so the commit order is really just T1 < T2. But it's
> completely wrong to say that all we need to do is apply T1 before we
> apply T2. The correct order of application is:
> 
> 1. T1.
> 2. T2's first nextval
> 3. T3's nextval
> 4. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and
> the subsequent nextval)
> 

Is that quite true? If T3 generated WAL (for the nextval call), it will
be applied at that particular LSN. AFAIK that guarantees it happens
after the first T2 change (which is also non-transactional) and before
the transactional T2 change (because that creates a new relfilenode).

> In other words, the fact that some sequence changes are
> non-transactional creates ordering hazards that don't exist if there
> are no non-transactional changes. So in that way, sequences are
> different from table modifications, where applying the transactions in
> order of commit is all we need to do. Here we need to apply the
> transactions in order of commit and also apply the non-transactional
> changes at the right point in the sequence. Consider the following
> alternative apply sequence:
> 
> 1. T1.
> 2. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and
> the subsequent nextval)
> 3. T3's nextval
> 4. T2's first nextval
> 
> That's still in commit order. It's also wrong.
> 

Yes, this would be wrong. Thankfully the apply is not allowed to reorder
the changes like this, because that's not what "non-transactional" means
in this context.

It does not mean we can arbitrarily reorder the changes, it only means
the changes are applied as if they were independent transactions (but in
the same order as they were executed originally). Both with respect to
the other non-transactional changes, and to "commits" of other stuff.

(for serial apply, at least)

> Imagine that you commit this patch and someone later wants to do
> parallel logical apply. So every time they finish decoding a
> transaction, they stick it in a queue to be applied by the next
> available worker. But, non-transactional changes are very simple, so
> we just directly apply those in the main process. Well, kaboom! But
> now this can happen with the above example.
> 
> 1. Decode T1. Add to queue for apply.
> 2. Before the (idle) apply worker has a chance to pull T1 out of the
> queue, decode the first nextval and try to apply it.
> 
> Oops. We're trying to apply a modification to a sequence that hasn't
> been created yet. I'm not saying that this kind of hypothetical is a
> reason not to commit the patch. But it seems like we're not on the
> same page about what the ordering requirements are here. I'm just
> making the argument that those non-transactional operations actually
> act like mini-transactions. They need to happen at the right time
> relative to the real transactions. A non-transactional operation needs
> to be applied after any transactions that commit before it is logged,
> and before any transactions that commit after it's logged.
> 

How is this issue specific to sequences? AFAIK this is a general problem
with transactions that depend on each other. Consider for example this:

T1: INSERT INTO t (id) VALUES (1);
T2: DELETE FROM t WHERE id = 1;

If you parallelize this in a naive way, maybe T2 gets applied before T1.
In which case the DELETE won't find the row yet.

There's different ways to address this. You can detect this type of
conflicts (e.g. when a DELETE that doesn't find a match), drain the
apply queue and retry the transaction. Or you may compare keysets of the
transactions and make sure the apply waits until the conflicting one
gets fully applied first.

AFAIK for sequences it's not any different, except the key we'd have to
compare is the sequence itself.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Robert Haas
Дата:
On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> The way I think about non-transactional sequence changes is as if they
> were tiny transactions that happen "fully" (including commit) at the LSN
> where the LSN change is logged.

100% this.

> It does not mean we can arbitrarily reorder the changes, it only means
> the changes are applied as if they were independent transactions (but in
> the same order as they were executed originally). Both with respect to
> the other non-transactional changes, and to "commits" of other stuff.

Right, this is very important and I agree completely.

I'm feeling more confident about this now that I heard you say that
stuff -- this is really the key issue I've been worried about since I
first looked at this, and I wasn't sure that you were in agreement,
but it sounds like you are. I think we should (a) fix the locking bug
I found (but that can be independent of this patch) and (b) make sure
that this patch documents the points from the quoted material above so
that everyone who reads the code (and maybe tries to enhance it) is
clear on what the assumptions are.

(I haven't checked whether it documents that stuff or not. I'm just
saying it should, because I think it's a subtlety that someone might
miss.)

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
On 2/15/24 05:16, Robert Haas wrote:
> On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> The way I think about non-transactional sequence changes is as if they
>> were tiny transactions that happen "fully" (including commit) at the LSN
>> where the LSN change is logged.
> 
> 100% this.
> 
>> It does not mean we can arbitrarily reorder the changes, it only means
>> the changes are applied as if they were independent transactions (but in
>> the same order as they were executed originally). Both with respect to
>> the other non-transactional changes, and to "commits" of other stuff.
> 
> Right, this is very important and I agree completely.
> 
> I'm feeling more confident about this now that I heard you say that
> stuff -- this is really the key issue I've been worried about since I
> first looked at this, and I wasn't sure that you were in agreement,
> but it sounds like you are. I think we should (a) fix the locking bug
> I found (but that can be independent of this patch) and (b) make sure
> that this patch documents the points from the quoted material above so
> that everyone who reads the code (and maybe tries to enhance it) is
> clear on what the assumptions are.
> 
> (I haven't checked whether it documents that stuff or not. I'm just
> saying it should, because I think it's a subtlety that someone might
> miss.)
> 

Thanks for thinking about these issues with reordering events. Good we
seem to be in agreement and that you feel more confident about this.
I'll check if there's a good place to document this.

For me, the part that I feel most uneasy about is the decoding while the
snapshot is still being built (and can flip to consistent snapshot
between the relfilenode creation and sequence change, confusing the
logic that decides which changes are transactional).

It seems "a bit weird" that we either keep the "simple" logic that may
end up with incorrect "non-transactional" result, but happens to then
work fine because we immediately discard the change.

But it still feels better than the alternative, which requires us to
start decoding stuff (relfilenode creation) before building a proper
snapshot is consistent, which we didn't do before - or at least not in
this particular way. While I don't have a practical example where it
would cause trouble now, I have a nagging feeling it might easily cause
trouble in the future by making some new features harder to implement.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Robert Haas
Дата:
On Fri, Feb 16, 2024 at 1:57 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> For me, the part that I feel most uneasy about is the decoding while the
> snapshot is still being built (and can flip to consistent snapshot
> between the relfilenode creation and sequence change, confusing the
> logic that decides which changes are transactional).
>
> It seems "a bit weird" that we either keep the "simple" logic that may
> end up with incorrect "non-transactional" result, but happens to then
> work fine because we immediately discard the change.
>
> But it still feels better than the alternative, which requires us to
> start decoding stuff (relfilenode creation) before building a proper
> snapshot is consistent, which we didn't do before - or at least not in
> this particular way. While I don't have a practical example where it
> would cause trouble now, I have a nagging feeling it might easily cause
> trouble in the future by making some new features harder to implement.

I don't understand the issues here well enough to comment. Is there a
good write-up someplace I can read to understand the design here?

Is the rule that changes are transactional if and only if the current
transaction has assigned a new relfilenode to the sequence?

Why does the logic get confused if the state of the snapshot changes?

My naive reaction is that it kinda sounds like you're relying on two
different mistakes cancelling each other out, and that might be a bad
idea, because maybe there's some situation where they don't. But I
don't understand the issue well enough to have an educated opinion at
this point.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 12/19/23 13:54, Christophe Pettus wrote:
> > Hi,
> >
> > I wanted to hop in here on one particular issue:
> >
> >> On Dec 12, 2023, at 02:01, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> >> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much
> >> better solution for distributed (esp. active-active) systems. But there
> >> are important use cases that are likely to keep using regular sequences
> >> (online upgrades of single-node instances, existing systems, ...).
> >
> > +1.
> >
> > Right now, the lack of sequence replication is a rather large
> > foot-gun on logical replication upgrades.  Copying the sequences
> > over during the cutover period is doable, of course, but:
> >
> > (a) There's no out-of-the-box tooling that does it, so everyone has
> > to write some scripts just for that one function.
> >
> > (b) It's one more thing that extends the cutover window.
> >
>
> I agree it's an annoying gap for this use case. But if this is the only
> use cases, maybe a better solution would be to provide such tooling
> instead of adding it to the logical decoding?
>
> It might seem a bit strange if most data is copied by replication
> directly, while sequences need special handling, ofc.
>

One difference between the logical replication of tables and sequences
is that we can guarantee with synchronous_commit (and
synchronous_standby_names) that after failover transactions data is
replicated or not whereas for sequences we can't guarantee that
because of their non-transactional nature. Say, there are two
transactions T1 and T2, it is possible that T1's entire table data and
sequence data are committed and replicated but T2's sequence data is
replicated. So, after failover to logical subscriber in such a case if
one routes T2 again to the new node as it was not successful
previously then it would needlessly perform the sequence changes
again. I don't how much that matters but that would probably be the
difference between the replication of tables and sequences.

I agree with your point above that for upgrades some tool like
pg_copysequence where we can provide a way to copy sequence data to
subscribers from the publisher would suffice the need.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Dilip Kumar
Дата:
On Tue, Feb 20, 2024 at 10:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> Is the rule that changes are transactional if and only if the current
> transaction has assigned a new relfilenode to the sequence?

Yes, thats the rule.

> Why does the logic get confused if the state of the snapshot changes?

The rule doesn't get changed, but the way this identification is
implemented at the decoding gets confused and assumes transactional as
non-transactional.  The identification of whether the sequence is
transactional or not is implemented based on what WAL we have decoded
from the particular transaction and whether we decode a particular WAL
or not depends upon the snapshot state (it's about what we decode not
necessarily what we sent).  So if the snapshot state changed the
mid-transaction that means we haven't decoded the WAL which created a
new relfilenode but we will decode the WAL which is operating on the
sequence.  So here we will assume the change is non-transaction
whereas it was transactional because we did not decode some of the
changes of transaction which we rely on for identifying whether it is
transactional or not.


> My naive reaction is that it kinda sounds like you're relying on two
> different mistakes cancelling each other out, and that might be a bad
> idea, because maybe there's some situation where they don't. But I
> don't understand the issue well enough to have an educated opinion at
> this point.

I would say the first one is a mistake in identifying the
transactional as non-transactional during the decoding and that
mistake happens only when we decode the transaction partially.  But we
never stream the partially decoded transactions downstream which means
even though we have made a mistake in decoding it, we are not
streaming it so our mistake is not getting converted into a real
problem.  But again I agree there is a temporary wrong decision and if
we try to do something else based on this decision then it could be an
issue.

You might be interested in more detail [1] where I first reported this
problem and also [2] where we concluded why this is not creating a
real problem.

[1] https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAFiTN-sYpyUBabxopJysqH3DAp4OZUCTi6m_qtgt8d32vDcWSA%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Robert Haas
Дата:
On Tue, Feb 20, 2024 at 1:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> You might be interested in more detail [1] where I first reported this
> problem and also [2] where we concluded why this is not creating a
> real problem.
>
> [1] https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com
> [2] https://www.postgresql.org/message-id/CAFiTN-sYpyUBabxopJysqH3DAp4OZUCTi6m_qtgt8d32vDcWSA%40mail.gmail.com

Thanks. Dilip and I just spent a lot of time talking this through on a
call. One of the key bits of logic is here:

+ /* Skip the change if already processed (per the snapshot). */
+ if (transactional &&
+ !SnapBuildProcessChange(builder, xid, buf->origptr))
+ return;
+ else if (!transactional &&
+ (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
+   SnapBuildXactNeedsSkip(builder, buf->origptr)))
+ return;

As a stylistic note, I think this would be mode clear if it were
written if (transactional) { if (!SnapBuildProcessChange()) return; }
else { if (something else) return; }.

Now, on to correctness. It's possible for us to identify a
transactional change as non-transactional if smgr_decode() was called
for the relfilenode before SNAPBUILD_FULL_SNAPSHOT was reached. In
that case, if !SnapBuildProcessChange() would have been true, then we
need SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
SnapBuildXactNeedsSkip(builder, buf->origptr) to also be true.
Otherwise, we'll process this change when we wouldn't have otherwise.
But Dilip made an argument to me about this which seems correct to me.
snapbuild.h says that SNAPBUILD_CONSISTENT is reached only when we
find a point where any transaction that was running at the time we
reached SNAPBUILD_FULL_SNAPSHOT have finished. So if this transaction
is one for which we incorrectly identified the sequence change as
non-transactional, then we cannot be in the SNAPBUILD_CONSISTENT state
yet, so SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT will be
true and hence whole "or" condition we'll be true and we'll return. So
far, so good.

I think, anyway. I haven't comprehensively verified that the comment
in snapbuild.h accurately reflects what the code actually does. But if
it does, then presumably we shouldn't see a record for which we might
have mistakenly identified a change as non-transactional after
reaching SNAPBUILD_CONSISTENT, which seems to be good enough to
guarantee that the mistake won't matter.

However, the logic in smgr_decode() doesn't only care about the
snapshot state. It also cares about the fast-forward flag:

+ if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT ||
+ ctx->fast_forward)
+ return;

Let's say fast_forward is true. Then smgr_decode() is going to skip
recording anything about the relfilenode, so we'll identify all
sequence changes as non-transactional. But look at how this case is
handled in seq_decode():

+ if (ctx->fast_forward)
+ {
+ /*
+ * We need to set processing_required flag to notify the sequence
+ * change existence to the caller. Usually, the flag is set when
+ * either the COMMIT or ABORT records are decoded, but this must be
+ * turned on here because the non-transactional logical message is
+ * decoded without waiting for these records.
+ */
+ if (!transactional)
+ ctx->processing_required = true;
+
+ return;
+ }

This seems suspicious. Why are we testing the transactional flag here
if it's guaranteed to be false? My guess is that the person who wrote
this code thought that the flag would be accurate even in this case,
but that doesn't seem to be true. So this case probably needs some
more thought.

It's definitely not great that this logic is so complicated; it's
really hard to verify that all the tests match up well enough to keep
us out of trouble.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Dilip Kumar
Дата:
On Tue, Feb 20, 2024 at 3:38 PM Robert Haas <robertmhaas@gmail.com> wrote:

> Let's say fast_forward is true. Then smgr_decode() is going to skip
> recording anything about the relfilenode, so we'll identify all
> sequence changes as non-transactional. But look at how this case is
> handled in seq_decode():
>
> + if (ctx->fast_forward)
> + {
> + /*
> + * We need to set processing_required flag to notify the sequence
> + * change existence to the caller. Usually, the flag is set when
> + * either the COMMIT or ABORT records are decoded, but this must be
> + * turned on here because the non-transactional logical message is
> + * decoded without waiting for these records.
> + */
> + if (!transactional)
> + ctx->processing_required = true;
> +
> + return;
> + }

It appears that the 'processing_required' flag was introduced as part
of supporting upgrades for logical replication slots. Its purpose is
to determine whether a slot is fully caught up, meaning that there are
no pending decodable changes left before it can be upgraded.

So now if some change was transactional but we have identified it as
non-transaction then we will mark this flag  'ctx->processing_required
= true;' so we temporarily set this flag incorrectly, but even if the
flag would have been correctly identified initially, it would have
been set again to true in the DecodeTXNNeedSkip() function regardless
of whether the transaction is committed or aborted. As a result, the
flag would eventually be set to 'true', and the behavior would align
with the intended logic.

But I am wondering why this flag is always set to true in
DecodeTXNNeedSkip() irrespective of the commit or abort. Because the
aborted transactions are not supposed to be replayed?  So if my
observation is correct that for the aborted transaction, this
shouldn't be set to true then we have a problem with sequence where we
are identifying the transactional changes as non-transaction changes
because now for transactional changes this should depend upon commit
status.

On another thought, can there be a situation where we have identified
this flag wrongly as non-transaction and set this flag, and the
commit/abort record never appeared in the WAL so never decoded? That
can also lead to an incorrect decision during the upgrade.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:

On 2/20/24 06:54, Amit Kapila wrote:
> On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 12/19/23 13:54, Christophe Pettus wrote:
>>> Hi,
>>>
>>> I wanted to hop in here on one particular issue:
>>>
>>>> On Dec 12, 2023, at 02:01, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>>> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much
>>>> better solution for distributed (esp. active-active) systems. But there
>>>> are important use cases that are likely to keep using regular sequences
>>>> (online upgrades of single-node instances, existing systems, ...).
>>>
>>> +1.
>>>
>>> Right now, the lack of sequence replication is a rather large
>>> foot-gun on logical replication upgrades.  Copying the sequences
>>> over during the cutover period is doable, of course, but:
>>>
>>> (a) There's no out-of-the-box tooling that does it, so everyone has
>>> to write some scripts just for that one function.
>>>
>>> (b) It's one more thing that extends the cutover window.
>>>
>>
>> I agree it's an annoying gap for this use case. But if this is the only
>> use cases, maybe a better solution would be to provide such tooling
>> instead of adding it to the logical decoding?
>>
>> It might seem a bit strange if most data is copied by replication
>> directly, while sequences need special handling, ofc.
>>
> 
> One difference between the logical replication of tables and sequences
> is that we can guarantee with synchronous_commit (and
> synchronous_standby_names) that after failover transactions data is
> replicated or not whereas for sequences we can't guarantee that
> because of their non-transactional nature. Say, there are two
> transactions T1 and T2, it is possible that T1's entire table data and
> sequence data are committed and replicated but T2's sequence data is
> replicated. So, after failover to logical subscriber in such a case if
> one routes T2 again to the new node as it was not successful
> previously then it would needlessly perform the sequence changes
> again. I don't how much that matters but that would probably be the
> difference between the replication of tables and sequences.
> 

I don't quite follow what the problem with synchronous_commit is :-(

For sequences, we log the changes ahead, i.e. even if nextval() did not
write anything into WAL, it's still safe because these changes are
covered by the WAL generated some time ago (up to ~32 values back). And
that's certainly subject to synchronous_commit, right?

There certainly are issues with sequences and syncrep:

https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9be9@enterprisedb.com

but that's unrelated to logical replication.

FWIW I don't think we'd re-apply sequence changes needlessly, because
the worker does update the origin after applying non-transactional
changes. So after the replication gets restarted, we'd skip what we
already applied, no?

But maybe there is an issue and I'm just not getting it. Could you maybe
share an example of T1/T2, with a replication restart and what you think
would happen?

> I agree with your point above that for upgrades some tool like
> pg_copysequence where we can provide a way to copy sequence data to
> subscribers from the publisher would suffice the need.
> 

Perhaps. Unfortunately it doesn't quite work for failovers, and it's yet
another tool users would need to use.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Tue, Feb 20, 2024 at 5:39 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 2/20/24 06:54, Amit Kapila wrote:
> > On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> On 12/19/23 13:54, Christophe Pettus wrote:
> >>> Hi,
> >>>
> >>> I wanted to hop in here on one particular issue:
> >>>
> >>>> On Dec 12, 2023, at 02:01, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> >>>> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much
> >>>> better solution for distributed (esp. active-active) systems. But there
> >>>> are important use cases that are likely to keep using regular sequences
> >>>> (online upgrades of single-node instances, existing systems, ...).
> >>>
> >>> +1.
> >>>
> >>> Right now, the lack of sequence replication is a rather large
> >>> foot-gun on logical replication upgrades.  Copying the sequences
> >>> over during the cutover period is doable, of course, but:
> >>>
> >>> (a) There's no out-of-the-box tooling that does it, so everyone has
> >>> to write some scripts just for that one function.
> >>>
> >>> (b) It's one more thing that extends the cutover window.
> >>>
> >>
> >> I agree it's an annoying gap for this use case. But if this is the only
> >> use cases, maybe a better solution would be to provide such tooling
> >> instead of adding it to the logical decoding?
> >>
> >> It might seem a bit strange if most data is copied by replication
> >> directly, while sequences need special handling, ofc.
> >>
> >
> > One difference between the logical replication of tables and sequences
> > is that we can guarantee with synchronous_commit (and
> > synchronous_standby_names) that after failover transactions data is
> > replicated or not whereas for sequences we can't guarantee that
> > because of their non-transactional nature. Say, there are two
> > transactions T1 and T2, it is possible that T1's entire table data and
> > sequence data are committed and replicated but T2's sequence data is
> > replicated. So, after failover to logical subscriber in such a case if
> > one routes T2 again to the new node as it was not successful
> > previously then it would needlessly perform the sequence changes
> > again. I don't how much that matters but that would probably be the
> > difference between the replication of tables and sequences.
> >
>
> I don't quite follow what the problem with synchronous_commit is :-(
>
> For sequences, we log the changes ahead, i.e. even if nextval() did not
> write anything into WAL, it's still safe because these changes are
> covered by the WAL generated some time ago (up to ~32 values back). And
> that's certainly subject to synchronous_commit, right?
>
> There certainly are issues with sequences and syncrep:
>
> https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9be9@enterprisedb.com
>
> but that's unrelated to logical replication.
>
> FWIW I don't think we'd re-apply sequence changes needlessly, because
> the worker does update the origin after applying non-transactional
> changes. So after the replication gets restarted, we'd skip what we
> already applied, no?
>

It will work for restarts but I was trying to discuss what happens in
the scenario after the publisher node goes down and we failover to the
subscriber node and make it a primary node (or a failover case). After
that, all unfinished transactions will be re-routed to the new
primary. Consider a theoretical case where we send sequence changes of
the yet uncommitted transactions directly from wal buffers (something
like 91f2cae7a4 does for physical replication) and then immediately
the primary or publisher node crashes. After failover to the
subscriber node, the application will re-route unfinished transactions
to the new primary. In such a situation, I think there is a chance
that we will update the sequence value when it would have already
received/applied that update via replication. This is what I was
saying that there is probably a difference between tables and
sequences, for tables such a replicated change would be rolled back.
Having said that, this is probably no different from what would happen
in the case of physical replication.

> But maybe there is an issue and I'm just not getting it. Could you maybe
> share an example of T1/T2, with a replication restart and what you think
> would happen?
>
> > I agree with your point above that for upgrades some tool like
> > pg_copysequence where we can provide a way to copy sequence data to
> > subscribers from the publisher would suffice the need.
> >
>
> Perhaps. Unfortunately it doesn't quite work for failovers, and it's yet
> another tool users would need to use.
>

But can logical replica be used for failover? We don't have any way to
replicate/sync the slots on subscribers and neither we have a
mechanism to replicate existing publications. I think if we want to
achieve failover to a logical subscriber we need to replicate/sync the
required logical and physical slots to the subscribers. I haven't
thought through it completely so there would probably be more things
to consider for allowing logical subscribers to be used as failover
candidates.

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Amit Kapila
Дата:
On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 2/13/24 17:37, Robert Haas wrote:
>
> > In other words, the fact that some sequence changes are
> > non-transactional creates ordering hazards that don't exist if there
> > are no non-transactional changes. So in that way, sequences are
> > different from table modifications, where applying the transactions in
> > order of commit is all we need to do. Here we need to apply the
> > transactions in order of commit and also apply the non-transactional
> > changes at the right point in the sequence. Consider the following
> > alternative apply sequence:
> >
> > 1. T1.
> > 2. T2's transactional changes (i.e. the ALTER SEQUENCE INCREMENT and
> > the subsequent nextval)
> > 3. T3's nextval
> > 4. T2's first nextval
> >
> > That's still in commit order. It's also wrong.
> >
>
> Yes, this would be wrong. Thankfully the apply is not allowed to reorder
> the changes like this, because that's not what "non-transactional" means
> in this context.
>
> It does not mean we can arbitrarily reorder the changes, it only means
> the changes are applied as if they were independent transactions (but in
> the same order as they were executed originally).
>

In this regard, I have another scenario in mind where the apply order
could be different for the changes in the same transactions. For
example,

Transaction T1
Begin;
Insert ..
Insert ..
nextval .. --consider this generates WAL
..
Insert ..
nextval .. --consider this generates WAL

In this case, if the nextval operations will be applied in a different
order (aka before Inserts) then there could be some inconsistency.
Say, if, it doesn't follow the above order during apply then a trigger
fired on both pub and sub for each row insert that refers to the
current sequence value to make some decision could have different
behavior on publisher and subscriber. If this is not how the patch
will behave then fine but otherwise, isn't this something that we
should be worried about?

--
With Regards,
Amit Kapila.



Re: logical decoding and replication of sequences, take 2

От
Dilip Kumar
Дата:
On Tue, Feb 20, 2024 at 4:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Feb 20, 2024 at 3:38 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> > Let's say fast_forward is true. Then smgr_decode() is going to skip
> > recording anything about the relfilenode, so we'll identify all
> > sequence changes as non-transactional. But look at how this case is
> > handled in seq_decode():
> >
> > + if (ctx->fast_forward)
> > + {
> > + /*
> > + * We need to set processing_required flag to notify the sequence
> > + * change existence to the caller. Usually, the flag is set when
> > + * either the COMMIT or ABORT records are decoded, but this must be
> > + * turned on here because the non-transactional logical message is
> > + * decoded without waiting for these records.
> > + */
> > + if (!transactional)
> > + ctx->processing_required = true;
> > +
> > + return;
> > + }
>
> It appears that the 'processing_required' flag was introduced as part
> of supporting upgrades for logical replication slots. Its purpose is
> to determine whether a slot is fully caught up, meaning that there are
> no pending decodable changes left before it can be upgraded.
>
> So now if some change was transactional but we have identified it as
> non-transaction then we will mark this flag  'ctx->processing_required
> = true;' so we temporarily set this flag incorrectly, but even if the
> flag would have been correctly identified initially, it would have
> been set again to true in the DecodeTXNNeedSkip() function regardless
> of whether the transaction is committed or aborted. As a result, the
> flag would eventually be set to 'true', and the behavior would align
> with the intended logic.
>
> But I am wondering why this flag is always set to true in
> DecodeTXNNeedSkip() irrespective of the commit or abort. Because the
> aborted transactions are not supposed to be replayed?  So if my
> observation is correct that for the aborted transaction, this
> shouldn't be set to true then we have a problem with sequence where we
> are identifying the transactional changes as non-transaction changes
> because now for transactional changes this should depend upon commit
> status.

I have checked this case with Amit Kapila.  So it seems in the cases
where we have sent the prepared transaction or streamed in-progress
transaction we would need to send the abort also, and for that reason,
we are setting 'ctx->processing_required' as true so that if these
WALs are not streamed we do not allow upgrade of such slots.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Robert Haas
Дата:
On Wed, Feb 21, 2024 at 1:06 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > But I am wondering why this flag is always set to true in
> > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the
> > aborted transactions are not supposed to be replayed?  So if my
> > observation is correct that for the aborted transaction, this
> > shouldn't be set to true then we have a problem with sequence where we
> > are identifying the transactional changes as non-transaction changes
> > because now for transactional changes this should depend upon commit
> > status.
>
> I have checked this case with Amit Kapila.  So it seems in the cases
> where we have sent the prepared transaction or streamed in-progress
> transaction we would need to send the abort also, and for that reason,
> we are setting 'ctx->processing_required' as true so that if these
> WALs are not streamed we do not allow upgrade of such slots.

I don't find this explanation clear enough for me to understand.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Dilip Kumar
Дата:
On Wed, Feb 21, 2024 at 1:24 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Feb 21, 2024 at 1:06 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > But I am wondering why this flag is always set to true in
> > > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the
> > > aborted transactions are not supposed to be replayed?  So if my
> > > observation is correct that for the aborted transaction, this
> > > shouldn't be set to true then we have a problem with sequence where we
> > > are identifying the transactional changes as non-transaction changes
> > > because now for transactional changes this should depend upon commit
> > > status.
> >
> > I have checked this case with Amit Kapila.  So it seems in the cases
> > where we have sent the prepared transaction or streamed in-progress
> > transaction we would need to send the abort also, and for that reason,
> > we are setting 'ctx->processing_required' as true so that if these
> > WALs are not streamed we do not allow upgrade of such slots.
>
> I don't find this explanation clear enough for me to understand.


Explanation about why we set 'ctx->processing_required' to true from
DecodeCommit as well as DecodeAbort:

--------------------------------------------------------------------------------------------------------------------------------------------------
For upgrading logical replication slots, it's essential to ensure
these slots are completely synchronized with the subscriber.  To
identify that we process all the pending WAL in 'fast_forward' mode to
find whether there is any decodable WAL or not.  So in short any WAL
type that we stream to standby in normal mode (no fast_forward mode)
is considered decodable and so is the abort WAL.  That's the reason
why at the end of the transaction commit/abort we need to set this
'ctx->processing_required' to true i.e. there are some decodable WAL
exists so we can not upgrade this slot.

Why the below check is safe?
> + if (ctx->fast_forward)
> + {
> + /*
> + * We need to set processing_required flag to notify the sequence
> + * change existence to the caller. Usually, the flag is set when
> + * either the COMMIT or ABORT records are decoded, but this must be
> + * turned on here because the non-transactional logical message is
> + * decoded without waiting for these records.
> + */
> + if (!transactional)
> + ctx->processing_required = true;
> +
> + return;
> + }

So the problem is that we might consider the transaction change as
non-transaction and mark this flag as true.  But what would have
happened if we would have identified it correctly as transactional?
In such cases, we wouldn't have set this flag here but then we would
have set this while processing the DecodeAbort/DecodeCommit, so the
net effect would be the same no?  You may question what if the
Abort/Commit WAL never appears in the WAL, but this flag is
specifically for the upgrade case, and in that case we have to do a
clean shutdown so may not be an issue.  But in the future, if we try
to use 'ctx->processing_required' for something else where the clean
shutdown is not guaranteed then this flag can be set incorrectly.

I am not arguing that this is a perfect design but I am just making a
point about why it would work.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Robert Haas
Дата:
On Wed, Feb 21, 2024 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> So the problem is that we might consider the transaction change as
> non-transaction and mark this flag as true.

But it's not "might" right? It's absolutely 100% certain that we will
consider that transaction's changes as non-transactional ... because
when we're in fast-forward mode, the table of new relfilenodes is not
built, and so whenever we check whether any transaction made a new
relfilenode for this sequence, the answer will be no.

> But what would have
> happened if we would have identified it correctly as transactional?
> In such cases, we wouldn't have set this flag here but then we would
> have set this while processing the DecodeAbort/DecodeCommit, so the
> net effect would be the same no?  You may question what if the
> Abort/Commit WAL never appears in the WAL, but this flag is
> specifically for the upgrade case, and in that case we have to do a
> clean shutdown so may not be an issue.  But in the future, if we try
> to use 'ctx->processing_required' for something else where the clean
> shutdown is not guaranteed then this flag can be set incorrectly.
>
> I am not arguing that this is a perfect design but I am just making a
> point about why it would work.

Even if this argument is correct (and I don't know if it is), the code
and comments need some updating. We should not be testing a flag that
is guaranteed false with comments that make it sound like the value of
the flag is trustworthy when it isn't.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Dilip Kumar
Дата:
On Wed, Feb 21, 2024 at 2:52 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Feb 21, 2024 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > So the problem is that we might consider the transaction change as
> > non-transaction and mark this flag as true.
>
> But it's not "might" right? It's absolutely 100% certain that we will
> consider that transaction's changes as non-transactional ... because
> when we're in fast-forward mode, the table of new relfilenodes is not
> built, and so whenever we check whether any transaction made a new
> relfilenode for this sequence, the answer will be no.
>
> > But what would have
> > happened if we would have identified it correctly as transactional?
> > In such cases, we wouldn't have set this flag here but then we would
> > have set this while processing the DecodeAbort/DecodeCommit, so the
> > net effect would be the same no?  You may question what if the
> > Abort/Commit WAL never appears in the WAL, but this flag is
> > specifically for the upgrade case, and in that case we have to do a
> > clean shutdown so may not be an issue.  But in the future, if we try
> > to use 'ctx->processing_required' for something else where the clean
> > shutdown is not guaranteed then this flag can be set incorrectly.
> >
> > I am not arguing that this is a perfect design but I am just making a
> > point about why it would work.
>
> Even if this argument is correct (and I don't know if it is), the code
> and comments need some updating. We should not be testing a flag that
> is guaranteed false with comments that make it sound like the value of
> the flag is trustworthy when it isn't.

+1

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: logical decoding and replication of sequences, take 2

От
Tomas Vondra
Дата:
Hi,

Let me share a bit of an update regarding this patch and PG17. I have
discussed this patch and how to move it forward with a couple hackers
(both within EDB and outside), and my takeaway is that the patch is not
quite baked yet, not enough to make it into PG17 :-(

There are two main reasons / concerns leading to this conclusion:

* correctness of the decoding part

There are (were) doubts about decoding during startup, before the
snapshot gets consistent, when we can get "temporarily incorrect"
decisions whether a change is transactional. While the behavior is
ultimately correct (we treat all changes as non-transactional and
discard it), it seems "dirty" and it’s unclear to me if it might cause
more serious issues down the line (not necessarily bugs, but perhaps
making it harder to implement future changes).

* handling of sequences in built-in replication

Per the patch, sequences need to be added to the publication explicitly.
But there were suggestions we might (should) add certain sequences
automatically - e.g. sequences backing SERIAL/BIGSERIAL columns, etc.
I’m not sure we really want to do that, and so far I assumed we would
start with the manual approach and move to automatic addition in the
future. But the agreement seems to be it would be a pretty significant
"breaking change", and something we probably don’t want to do.


If someone feels has an opinion on either of the two issues (in either
way), I'd like to hear it.


Obviously, I'm not particularly happy about this outcome. And I'm also
somewhat cautious because this patch was already committed+reverted in
PG16 cycle, and doing the same thing in PG17 is not on my wish list.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company