Обсуждение: logical decoding and replication of sequences
Hi, One of the existing limitations of logical decoding / replication is that it does no care about sequences. The annoying consequence is that after a failover to logical replica, all the table data may be replicated but the sequences are still at the initial values, requiring some custom solution that moves the sequences forward enough to prevent duplicities. There have been attempts to address this in the past, most recently [1], but none of them got in due to various issues. This is an attempt, based on [1] (but with many significant parts added or reworked), aiming to deal with this. The primary purpose of sharing it is getting feedback and opinions on the design decisions. It's still a WIP - it works fine AFAICS, but some of the bits may be a bit hackish. The overall goal is to have the same sequence data on the primary and logical replica, or something sufficiently close to that, so that the replica after a failover does not generate duplicate values. This patch does a couple basic things: 1) extends the logical decoding to handle sequences. It adds a new callback, similarly to what we have for messages. There's a bit of complexity with transactional and non-transactional behavior, more about that later 2) extends test_decoding to support this new callback, printing the sequence increments (the decoded WAL records) 3) extends built-in replication to support sequences, so publications may contain both tables and sequences, etc., sequences data sync when creating subscriptions, etc. transactional vs. non-transactional ----------------------------------- The first part (extending logical decoding) is simple in principle. We simply decode the sequence updates, but then comes a challenge - should we just treat it transactionally and stash it in reorder buffer, or just pass it to the output plugin right-away? For messages, this can be specified as a flag when adding the message, so the user can decide depending on the message purpose. For sequences, all we do is nextval() and it depends on the context in which it's used, we can't just pick one of those approaches. Consider this, for example: CREATE SEQUENCE s; BEGIN; SELECT nextval('s') FROM generate_series(1,1000) s(i); ROLLBACK; If we handle this "transactionally", we'd stash the "nextval" increment into the transaction, and then discard it due to the rollback, so the output plugin (and replica) would never get it. So this is an argument for non-transactional behavior. On the other hand, consider this: CREATE SEQUENCE s; BEGIN; ALTER SEQUENCE s RESTART WITH 2000; SELECT nextval('s') FROM generate_series(1,1000) s(i); ROLLBACK; In this case the ALTER creates a new relfilenode, and the ROLLBACK does discard it including the effects of the nextval calls. So here we should treat it transactionally, stash the increment(s) in the transaction and just discard it all on rollback. A somewhat similar example is this BEGIN; CREATE SEQUENCE s; SELECT nextval('s') FROM generate_series(1,1000) s(i); COMMIT; Again - the decoded nextval needs to be handled transactionally, because otherwise it's going to be very difficult for custom plugins to combine this with DDL replication. So the patch does a fairly simple thing: 1) By default, sequences are treated non-transactionally, i.e. sent to the output plugin right away. 2) We track sequences created in running (sub)transactions, and those are handled transactionally. This includes ALTER SEQUENCE cases, which create a new relfilenode, which is used as an identifier. It's a bit more complex, because of cases like this: BEGIN; CREATE SEQUENCE s; SAVEPOINT a; SELECT nextval('s') FROM generate_series(1,1000) s(i); ROLLBACK TO a; COMMIT; because we must not discard the nextval changes - this is handled by always stashing the nextval changes to the subxact where the sequence relfilenode was created. The tracking is a bit cumbersome - there's a hash table with relfilenode mapped to XID in which it was created. AFAIK that works, but might be an issue with many sequences created in running transactions. Not sure. detecting sequence creation --------------------------- Detection that a sequence (or rather the relfilenode) was created is done by adding a "created" flag into the xl_seq_rec, and setting it to "true" in the first WAL record after the creation. There might be some other way, but this seemed simple enough. applying the sequence (ResetSequence2) -------------------------------------- The decoding pretty much just extracts log_value, log_cnt and is_called from the sequence, and passes them to the output plugin. On the replica we extract those from the message, and write them to the local sequence using a new ResetSequence2 function. It's possible we don't really need log_cnt and is_called. After all, log_cnt is zero most of the time anyway, and the worst thing that could happen if we ignore it is we skip a couple values (which seems fine). syncing sequences in a subscription ----------------------------------- After creating a subscription, the sequences get syncronized just like tables. This part ia a bit hacked together, and there's definitely room for improvement - e.g. a new bgworker is started for each sequence, as we simply treat both tabels and sequences as "relation". But all we need to do for sequences is copying the (last_value, log_cnt, is_called) and calling ResetSequence2, so maybe we could sync all sequences in a single worker, or something like that. new "sequence" publication action --------------------------------- The publications now have a new "sequence" publication action, which is enabled by default. This determines whether the publication decodes sequences or what. FOR ALL SEQUENCES ----------------- It should be possible to create FOR ALL SEQUENCES publications, just like we have FOR ALL TABLES. But this produces shift/reduce conflicts in the grammar, and I didn't bother dealing with that. So for now it's required to do ALTER PUBLICATION ... [ADD | DROP] SEQUENCE ... no streaming support yet ------------------------ There's no supoprt for streaming of in-progress transactions yet, but should be trivial to add. GetCurrentTransactionId() in nextval ------------------------------------ There's a bit annoying behavior of nextval() - if you do this: BEGIN; CREATE SEQUENCE s; SAVEPOINT a; SELECT nextval('s') FROM generate_series(1,100) s(i); COMMIT; then the WAL record for nextval (right after the savepoint) will have XID 0 (easy to see in pg_waldump). That's kinda strange, and it causes problems in DecodeSequence() when calling SnapBuildProcessChange(builder, xid, buf->origptr) for transactional changes, because that expects a valid XID. Fixing this required adding GetCurrentTransactionId() to nextval() and two other functions, which were only doing if (RelationNeedsWAL(seqrel)) GetTopTransactionId(); so far. I'm not sure if this has some particularly bad consequences. regards [1] https://www.postgresql.org/message-id/flat/1710ed7e13b.cd7177461430746.3372264562543607781%40highgo.ca -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Hi, Seems the cfbot was not entirely happy with the patch on some platforms, so here's a fixed version. There was a bogus call to ensure_transaction function (which does not exist at all) and a silly bug in transaction management in apply_handle_sequence. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
A rebased patch, addressing a minor bitrot due to 4daa140a2f5. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6/23/21 4:14 PM, Tomas Vondra wrote: > A rebased patch, addressing a minor bitrot due to 4daa140a2f5. > Meh, forgot to attach the patch as usual, of course ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Wed, Jun 23, 2021 at 7:55 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 6/23/21 4:14 PM, Tomas Vondra wrote: > > A rebased patch, addressing a minor bitrot due to 4daa140a2f5. > > > > Meh, forgot to attach the patch as usual, of course ... The patch does not apply on Head anymore, could you rebase and post a patch. I'm changing the status to "Waiting for Author". Regards, Vignesh
On 08.06.21 00:28, Tomas Vondra wrote: > > new "sequence" publication action > --------------------------------- > > The publications now have a new "sequence" publication action, which is > enabled by default. This determines whether the publication decodes > sequences or what. > > > FOR ALL SEQUENCES > ----------------- > > It should be possible to create FOR ALL SEQUENCES publications, just > like we have FOR ALL TABLES. But this produces shift/reduce conflicts > in the grammar, and I didn't bother dealing with that. So for now it's > required to do ALTER PUBLICATION ... [ADD | DROP] SEQUENCE ... I have been thinking about these DDL-level issues a bit. The most common use case will be to have a bunch of tables with implicit sequences, and you just want to replicate them from here to there without too much effort. So ideally an implicit sequence should be replicated by default if the table is part of a publication (unless sequences are turned off by the publication option). We already have support for things like that in GetPublicationRelations(), where a partitioned table is expanded to include the actual partitions. I think that logic could be reused. So in general I would have GetPublicationRelations() include sequences and don't have GetPublicationSequenceRelations() at all. Then sequences could also be sent by pg_publication_tables(), maybe add a relkind column. And then you also don't need so much duplicate DDL code, if you just consider everything as a relation. For example, there doesn't seem to be an actual need to have fetch_sequence_list() and subsequent processing on the subscriber side. It does the same thing as fetch_table_list(), so it might as well just all be one thing. We do, however, probably need some checking that we don't replicate tables to sequences or vice versa. We probably also don't need a separate FOR ALL SEQUENCES option. What users really want is a "for everything" option. We could think about renaming or alternative syntax, but in principle I think FOR ALL TABLES should include sequences by default. Tests under src/test/subscription/ are needed. I'm not sure why test_decoding needs a skip-sequences option. The source code says it's for backward compatibility, but I don't see why we need that.
On 7/20/21 5:30 PM, Peter Eisentraut wrote: > On 08.06.21 00:28, Tomas Vondra wrote: >> >> new "sequence" publication action >> --------------------------------- >> >> The publications now have a new "sequence" publication action, which is >> enabled by default. This determines whether the publication decodes >> sequences or what. >> >> >> FOR ALL SEQUENCES >> ----------------- >> >> It should be possible to create FOR ALL SEQUENCES publications, just >> like we have FOR ALL TABLES. But this produces shift/reduce conflicts >> in the grammar, and I didn't bother dealing with that. So for now it's >> required to do ALTER PUBLICATION ... [ADD | DROP] SEQUENCE ... > > I have been thinking about these DDL-level issues a bit. The most > common use case will be to have a bunch of tables with implicit > sequences, and you just want to replicate them from here to there > without too much effort. So ideally an implicit sequence should be > replicated by default if the table is part of a publication (unless > sequences are turned off by the publication option). > Agreed, that seems like a reasonable approach. > We already have support for things like that in > GetPublicationRelations(), where a partitioned table is expanded to > include the actual partitions. I think that logic could be reused. So > in general I would have GetPublicationRelations() include sequences and > don't have GetPublicationSequenceRelations() at all. Then sequences > could also be sent by pg_publication_tables(), maybe add a relkind > column. And then you also don't need so much duplicate DDL code, if you > just consider everything as a relation. For example, there doesn't seem > to be an actual need to have fetch_sequence_list() and subsequent > processing on the subscriber side. It does the same thing as > fetch_table_list(), so it might as well just all be one thing. > Not sure. I agree with replicating implicit sequences by default, without having to add them to the publication. But I think we should allow adding other sequences too, and I think some of this code and differentiation from tables will be needed. FWIW I'm not claiming there are no duplicate parts - I've mostly copy-pasted the table-handling code for sequences, and I'll look into reusing some of it. > We do, however, probably need some checking that we don't replicate > tables to sequences or vice versa. > True. I haven't tried doing such silly things yet. > We probably also don't need a separate FOR ALL SEQUENCES option. What > users really want is a "for everything" option. We could think about > renaming or alternative syntax, but in principle I think FOR ALL TABLES > should include sequences by default. > > Tests under src/test/subscription/ are needed. > Yeah, true. At the moment there are just tests in test_decoding, mostly because the previous patch versions did not add support for sequences to built-in replication. Will fix. > I'm not sure why test_decoding needs a skip-sequences option. The > source code says it's for backward compatibility, but I don't see why we > need that. Hmmm, I'm a bit baffled by skip-sequences true. I think Cary added it to limit chances in test_decoding tests, while the misleading comment about backwards compatibility comes from me. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Here's a rebased version of the patch, no other changes. I think the crucial aspect of this that needs discussion/feedback the most is the transactional vs. non-transactional behavior. All the other questions are less important / cosmetic. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Hi, Here's a an updated version of this patch - rebased to current master, and fixing some of the issues raised in Peter's review. Mainly, this adds a TAP test to src/test/subscription, focusing on testing the various situations with transactional and non-transactional behavior (with subtransactions and various ROLLBACK versions). This new TAP test however uncovered an issue with wait_for_catchup(), because that uses pg_current_wal_lsn() to wait for replication of all the changes. But that does not work when the sequence gets advanced in a transaction that is then aborted, e.g. like this: BEGIN; SELECT nextval('s') FROM generate_series(1,100); ROLLBACK; The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write, which is updated by XLogFlush() - but only in RecordTransactionCommit. Which makes sense, because only the committed stuff is "visible". But the non-transactional behavior changes this, because now some of the changes from aborted transactions may need to be replicated. Which means the wait_for_catchup() ends up not waiting for the sequence change. One option would be adding XLogFlush() to RecordTransactionAbort(), but my guess is we don't do that intentionally (even though aborts should be fairly rare in most workloads). I'm not entirely sure changing this (replicating changes from aborted xacts) is a good idea. Maybe it'd be better to replicate this "lazily" - instead of replicating the advances right away, we might remember which sequences were advanced in the transaction, and then replicate current state for those sequences at commit time. The idea is that if an increment is "invisible" we probably don't need to replicate it, it's fine to replicate the next "visible" increment. So for example given BEGIN; SELECT nextval('s'); ROLLBACK; BEGIN; SELECT nextval('s'); COMMIT; we don't need to replicate the first change, but we need to replicate the second one. The trouble is we don't decode individual sequence advances, just those that update the sequence tuple (so every 32 values or so). So we'd need to remeber the first increment, in a way that is (a) persistent across restarts and (b) shared by all backends. The other challenge seems to be ordering of the changes - at the moment we have no issues with this, because increments on the same sequence are replicated immediately, in the WAL order. But postponing them to commit time would affect this order. I've also briefly looked at the code duplication - there's a couple functions in the patch that I shamelessly copy-pasted and tweaked to handle sequences instead of tables: publicationcmds.c ----------------- 1) OpenTableList/CloseTableList - > OpenSequenceList/CloseSequenceList Trivial differences, trivial to get rid of - the only difference is pretty much just table_open vs. relation open. 2) AlterPublicationTables -> AlterPublicationSequences This is a bit more complicated, because the tables also handle inheritance (which I think does not apply to sequences). Other than that, it's calling the functions from (1). subscriptioncmds.c ------------------ 1) fetch_table_list, fetch_sequence_list Minimal differences, essentially just the catalog name. 2) AlterSubscription_refresh A lot of duplication, but the code handling tables and sequences is almost exactly the same and can be reused fairly easily (moved to a separate function, called for tables and then sequences). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Tue, Jul 20, 2021 at 5:41 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I think the crucial aspect of this that needs discussion/feedback the > most is the transactional vs. non-transactional behavior. All the other > questions are less important / cosmetic. Yeah, it seems really tricky to me to get this right. The hard part is, I think, mostly figuring out what the right behavior really is. DDL in PostgreSQL is transactional. Non-DDL operations on sequences are non-transactional. If a given transaction does only one of those things, it seems clear enough what to do, but when the same (sub)transaction does both, it gets messy. I'd be tempted to think about something like: 1. When a transaction performs only non-transactional operations on sequences, they are emitted immediately. 2. If a transaction performs transactional operations on sequences, the decoded operations acquire a dependency on the transaction and cannot be emitted until that transaction is fully decoded. When commit or abort of that XID is reached, emit the postponed non-transactional operations at that point. I think this is similar to what you've designed, but I'm not sure that it's exactly equivalent. I think in particular that it may be better to insist that all of these operations are non-transactional and that the debate is only about when they can be sent, rather than trying to sort of convert them into an equivalent series of transactional operations. That approach seems confusing especially in the case where some (sub)transactions abort. But this is just my $0.02. -- Robert Haas EDB: http://www.enterprisedb.com
On 30.07.21 20:26, Tomas Vondra wrote: > Here's a an updated version of this patch - rebased to current master, > and fixing some of the issues raised in Peter's review. This patch needs an update, as various conflicts have arisen now. As was discussed before, it might be better to present a separate patch for just the logical decoding part for now, since the replication and DDL stuff has the potential to conflict heavily with other patches being discussed right now. It looks like cutting this patch in two should be doable easily. I looked through the test cases in test_decoding again. It all looks pretty sensible. If anyone can think of any other tricky or dubious cases, we can add them there. It's easiest to discuss these things with concrete test cases rather than in theory. One slightly curious issue is that this can make sequence values go backwards, when seen by the logical decoding consumer, like in the test case: + BEGIN + sequence: public.test_sequence transactional: 1 created: 1 last_value: 1, log_cnt: 0 is_called: 0 + COMMIT + sequence: public.test_sequence transactional: 0 created: 0 last_value: 33, log_cnt: 0 is_called: 1 + BEGIN + sequence: public.test_sequence transactional: 1 created: 1 last_value: 4, log_cnt: 0 is_called: 1 + COMMIT + sequence: public.test_sequence transactional: 0 created: 0 last_value: 334, log_cnt: 0 is_called: 1 I suppose that's okay, since it's not really the intention that someone is concurrently consuming sequence values on the subscriber. Maybe something for the future. Fixing that would require changing the way transactional sequence DDL updates these values, so it's not directly the job of the decoding to address this. A small thing I found: Maybe the text that test_decoding produces for sequences can be made to look more consistent with the one for tables. For example, in + BEGIN + sequence: public.test_table_a_seq transactional: 1 created: 1 last_value: 1, log_cnt: 0 is_called: 0 + sequence: public.test_table_a_seq transactional: 1 created: 0 last_value: 33, log_cnt: 0 is_called: 1 + table public.test_table: INSERT: a[integer]:1 b[integer]:100 + table public.test_table: INSERT: a[integer]:2 b[integer]:200 + COMMIT note how the punctuation is different.
Hi, On 9/23/21 12:27 PM, Peter Eisentraut wrote: > On 30.07.21 20:26, Tomas Vondra wrote: >> Here's a an updated version of this patch - rebased to current master, >> and fixing some of the issues raised in Peter's review. > > This patch needs an update, as various conflicts have arisen now. > > As was discussed before, it might be better to present a separate patch > for just the logical decoding part for now, since the replication and > DDL stuff has the potential to conflict heavily with other patches being > discussed right now. It looks like cutting this patch in two should be > doable easily. > Attached is the rebased patch, split into three parts: 1) basic decoding infrastructure (decoding, reorderbuffer etc.) 2) support for sequences in test_decoding 3) support for sequences in built-in replication (catalogs, syntax, ...) The last part is the largest one - I'm sure we'll have discussions about the grammar, adding sequences automatically, etc. But as you said, let's focus on the first part, which deals with the required decoding stuff. I've added a couple comments, explaining how we track sequences, why we need the XID in nextval() etc. I've also added streaming support. > I looked through the test cases in test_decoding again. It all looks > pretty sensible. If anyone can think of any other tricky or dubious > cases, we can add them there. It's easiest to discuss these things with > concrete test cases rather than in theory. > > One slightly curious issue is that this can make sequence values go > backwards, when seen by the logical decoding consumer, like in the test > case: > > + BEGIN > + sequence: public.test_sequence transactional: 1 created: 1 last_value: > 1, log_cnt: 0 is_called: 0 > + COMMIT > + sequence: public.test_sequence transactional: 0 created: 0 last_value: > 33, log_cnt: 0 is_called: 1 > + BEGIN > + sequence: public.test_sequence transactional: 1 created: 1 last_value: > 4, log_cnt: 0 is_called: 1 > + COMMIT > + sequence: public.test_sequence transactional: 0 created: 0 last_value: > 334, log_cnt: 0 is_called: 1 > > I suppose that's okay, since it's not really the intention that someone > is concurrently consuming sequence values on the subscriber. Maybe > something for the future. Fixing that would require changing the way > transactional sequence DDL updates these values, so it's not directly > the job of the decoding to address this. > Yeah, that's due to how ALTER SEQUENCE does things, and I agree redoing that seems well out of scope for this patch. What seems a bit suspicious is that some of the ALTER SEQUENCE changes have "created: 1" - it's probably correct, though, because those ALTER SEQUENCE statements can be rolled-back, so we see it as if a new sequence is created. The flag name might be a bit confusing, though. > A small thing I found: Maybe the text that test_decoding produces for > sequences can be made to look more consistent with the one for tables. > For example, in > > + BEGIN > + sequence: public.test_table_a_seq transactional: 1 created: 1 > last_value: 1, log_cnt: 0 is_called: 0 > + sequence: public.test_table_a_seq transactional: 1 created: 0 > last_value: 33, log_cnt: 0 is_called: 1 > + table public.test_table: INSERT: a[integer]:1 b[integer]:100 > + table public.test_table: INSERT: a[integer]:2 b[integer]:200 > + COMMIT > > note how the punctuation is different. I did tweak this a bit, hopefully it's more consistent. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Just a note for some design decisions > 1) By default, sequences are treated non-transactionally, i.e. sent to the output plugin right away. If our aim is just to make sure that all user-visible data in *transactional* tables is consistent with sequence state then one very much simplified approach to this could be to track the results of nextval() calls in a transaction at COMMIT put the latest sequence value in WAL (or just track the sequences affected and put the latest sequence state in WAL at commit which needs extra read of sequence but protects against race conditions with parallel transactions which get rolled back later) This avoids sending redundant changes for multiple nextval() calls (like loading a million-row table with sequence-generated id column) And one can argue that we can safely ignore anything in ROLLBACKED sequences. This is assuming that even if we did advance the sequence paste the last value sent by the latest COMMITTED transaction it does not matter for database consistency. It can matter if customers just call nextval() in rolled-back transactions and somehow expect these values to be replicated based on reasoning along "sequences are not transactional - so rollbacks should not matter" . Or we may get away with most in-detail sequence tracking on the source if we just keep track of the xmin of the sequence and send the sequence info over at commit if it == current_transaction_id ? ----- Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested. On Fri, Sep 24, 2021 at 9:16 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hi, > > On 9/23/21 12:27 PM, Peter Eisentraut wrote: > > On 30.07.21 20:26, Tomas Vondra wrote: > >> Here's a an updated version of this patch - rebased to current master, > >> and fixing some of the issues raised in Peter's review. > > > > This patch needs an update, as various conflicts have arisen now. > > > > As was discussed before, it might be better to present a separate patch > > for just the logical decoding part for now, since the replication and > > DDL stuff has the potential to conflict heavily with other patches being > > discussed right now. It looks like cutting this patch in two should be > > doable easily. > > > > Attached is the rebased patch, split into three parts: > > 1) basic decoding infrastructure (decoding, reorderbuffer etc.) > 2) support for sequences in test_decoding > 3) support for sequences in built-in replication (catalogs, syntax, ...) > > The last part is the largest one - I'm sure we'll have discussions about > the grammar, adding sequences automatically, etc. But as you said, let's > focus on the first part, which deals with the required decoding stuff. > > I've added a couple comments, explaining how we track sequences, why we > need the XID in nextval() etc. I've also added streaming support. > > > > I looked through the test cases in test_decoding again. It all looks > > pretty sensible. If anyone can think of any other tricky or dubious > > cases, we can add them there. It's easiest to discuss these things with > > concrete test cases rather than in theory. > > > > One slightly curious issue is that this can make sequence values go > > backwards, when seen by the logical decoding consumer, like in the test > > case: > > > > + BEGIN > > + sequence: public.test_sequence transactional: 1 created: 1 last_value: > > 1, log_cnt: 0 is_called: 0 > > + COMMIT > > + sequence: public.test_sequence transactional: 0 created: 0 last_value: > > 33, log_cnt: 0 is_called: 1 > > + BEGIN > > + sequence: public.test_sequence transactional: 1 created: 1 last_value: > > 4, log_cnt: 0 is_called: 1 > > + COMMIT > > + sequence: public.test_sequence transactional: 0 created: 0 last_value: > > 334, log_cnt: 0 is_called: 1 > > > > I suppose that's okay, since it's not really the intention that someone > > is concurrently consuming sequence values on the subscriber. Maybe > > something for the future. Fixing that would require changing the way > > transactional sequence DDL updates these values, so it's not directly > > the job of the decoding to address this. > > > > Yeah, that's due to how ALTER SEQUENCE does things, and I agree redoing > that seems well out of scope for this patch. What seems a bit suspicious > is that some of the ALTER SEQUENCE changes have "created: 1" - it's > probably correct, though, because those ALTER SEQUENCE statements can be > rolled-back, so we see it as if a new sequence is created. The flag name > might be a bit confusing, though. > > > A small thing I found: Maybe the text that test_decoding produces for > > sequences can be made to look more consistent with the one for tables. > > For example, in > > > > + BEGIN > > + sequence: public.test_table_a_seq transactional: 1 created: 1 > > last_value: 1, log_cnt: 0 is_called: 0 > > + sequence: public.test_table_a_seq transactional: 1 created: 0 > > last_value: 33, log_cnt: 0 is_called: 1 > > + table public.test_table: INSERT: a[integer]:1 b[integer]:100 > > + table public.test_table: INSERT: a[integer]:2 b[integer]:200 > > + COMMIT > > > > note how the punctuation is different. > > I did tweak this a bit, hopefully it's more consistent. > > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
On 9/25/21 22:05, Hannu Krosing wrote: > Just a note for some design decisions > >> 1) By default, sequences are treated non-transactionally, i.e. sent to the output plugin right away. > > If our aim is just to make sure that all user-visible data in > *transactional* tables is consistent with sequence state then one > very much simplified approach to this could be to track the results of > nextval() calls in a transaction at COMMIT put the latest sequence > value in WAL (or just track the sequences affected and put the latest > sequence state in WAL at commit which needs extra read of sequence but > protects against race conditions with parallel transactions which get > rolled back later) > Not sure. TBH I feel rather uneasy about adding more stuff in COMMIT. > This avoids sending redundant changes for multiple nextval() calls > (like loading a million-row table with sequence-generated id column) > Yeah, it'd be nice to have to optimize this a bit, somehow. But I'd bet it's a negligible amount of data / changes, compared to the table. > And one can argue that we can safely ignore anything in ROLLBACKED > sequences. This is assuming that even if we did advance the sequence > paste the last value sent by the latest COMMITTED transaction it does > not matter for database consistency. > I don't think we can ignore aborted (ROLLBACK) transactions, in the sense that you can't just discard the increments. Imagine you have this sequence of transactions: BEGIN; SELECT nextval('s'); -- allocates new chunk of values ROLLBACK; BEGIN; SELECT nextval('s'); -- returns one of the cached values COMMIT; If you ignore the aborted transaction, then the sequence increment won't be replicated -- but that's wrong, because user now has a visible sequence value from that chunk. So I guess we'd have to maintain a cache of sequences incremented in the current session, do nothing in aborted transactions (i.e. keep the contents but don't log anything) and log/reset at commit. I wonder if multiple sessions make this even more problematic (e.g. due to session just disconnecting mid transansaction, without writing the abort record at all). But AFAICS that's not an issue, because the other session has a separate cache for the sequence. > It can matter if customers just call nextval() in rolled-back > transactions and somehow expect these values to be replicated based on > reasoning along "sequences are not transactional - so rollbacks should > not matter" . > I don't think we guarantee anything for data in transactions that did not commit, so this seems like a non-issue. I.e. we don't need to go out of our way to guarantee something we never promised. > Or we may get away with most in-detail sequence tracking on the source > if we just keep track of the xmin of the sequence and send the > sequence info over at commit if it == current_transaction_id ? > Not sure I understand this proposal. Can you explain? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, I've spent a bit of time exploring the alternative approach outlined by Hannu, i.e. tracking sequences accessed by the transaction, and logging the final state just once at COMMIT. Attached is an experimental version of the patch series doing that - 0001 does the original approach (decoding the sequence updates from WAL) and then 0002 reworks it to this alternative solution. The 0003 and 0004 stay mostly the same, except for minor fixes. Some of the tests in 0003/0004 fail, because 0002 changes the semantics in various ways (more about that later). The original approach (0001) may seem complex at first, but in principle it just decodes changes to the sequence relation, and either stashes them into transaction (just like other changes) or applies them right away. I'd say that's the most complicated part - deciding whether the change is transactional or not. 0002 reworks that so that it doesn't decode the existing WAL records, but tracks sequences which have been modified (updated on-disk state) and then accessed in the current transaction. And then at COMMIT time we write a new WAL message with info about the sequence. I realized we already cache sequences for each session - seqhashtab in sequence.c. It doesn't have any concept of a transaction, but it seems fairly easy to make that possible. I did this by adding two flags - needs_log - means the seesion advanced the sequence (on disk) - touched - true if the current xact called nextval() etc. The idea is that what matters is updates to on-disk state, so whenever we do that we set needs_log. But it only matters when the changes are made visible in a committed transaction. Consider for example this: BEGIN; SELECT nextval('s') FROM generate_series(1,10000) s(i); ROLLBACK; SELECT nextval('s'); The first nextval() call certainly sets both flags to true, at least for default sequences caching 32 values. But the values are not confirmed to the user because of the rollback - this resets 'touched' flag, but leaves 'needs_log' set to true. And then the next nextval() - which may easily be just from cache - sets touched=true again, and logs the sequence state at (implicit) commit. Which resets both flags again. The logging/cleanup happens in AtEOXact_Sequences() which gets called before commit/abort. This walks all cached sequences and writes the state for those with both flags true (or resets flag for abort). The cache also keeps info about the last "sequence state" in the session, which is then used when writing into into WAL. To write the sequence state into WAL, I've added a new WAL record xl_logical_sequence to RM_LOGICALMSG_ID, next to the xl_logical_message. It's a bit arbitrary, maybe it should be part of RM_SEQ_ID, but it does the trick. I don't think this is the main issue and it's easy enough to move it elsewhere if needed. So, that seems fairly straight-forward and it may reduce the number of replication messages for large transactions. Unfortunately, it's not much simpler compared to the first approach - the amount of code is about the same, and there's a bunch of other issues. The main issue seems to be about ordering. Consider multiple sessions all advancing the sequence. With the "old" approach this was naturally ordered - the order in which the increments were written to WAL made sense. But the sessions may advance the sequences in one order and then commit in a different order, which mixes the updates. Consider for example this scenario with two concurrent transactions: T1: nextval('s') -> allocates values [1,32] T2: nextval('s') -> allocates values [33,64] T2: commit -> logs [33,64] T1: commit -> logs [1,32] The result is the sequence on the replica diverted because it replayed the increments in the opposite order. I can think of two ways to fix this. Firstly, we could "merge" the increments in some smart way, e.g. by discarding values considered "stale" (like decrements). But that seems pretty fragile, because the sequence may be altered in various ways, reset, etc. And it seems more like transferring responsibility to someone else instead of actually solving the issue. The other fix is simply reading the current sequence state from disk at commit and logging that (instead of the values cached from the last increment). But I'm rather skeptical about doing such things right before COMMIT. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Hi, Here's a slightly improved version of the patch, fixing a couple issues I failed to notice. It also addresses a couple of the issues described in the last message, although mostly to show what would need to be done. 1) Handle sequences dropped in the xact by calling try_relation_open, and just doing nothing if not found. Otherwise we'd get a failure in reorderbuffer, when decoding the change. 2) Fixed nextval_internal() to log the correct last_value (the one we write into WAL). 3) Reread the sequence state in AtEOXact_Sequences, to deal with the ordering issue described before. This makes (2) somewhat pointless, because we just read whatever is on disk at that point. But having both makes it easier to experiment / see what'd happen. 4) Log the stats in DefineSequence() - Without this we'd not have the initial sequence state in the WAL, because only nextval/setval etc. do the logging. The old approach (decoding the sequence tuple) does not have this issue. The (3) changes the behavior in a somewhat strange way. Consider this case with two concurrent transactions: T1: BEGIN; T2: BEGIN; T1: SELECT nextval('s') FROM generate_series(1,100) s(i); T2: SELECT nextval('s') FROM generate_series(1,100) s(i); T1: COMMIT; T2: COMMIT; The result is that both transactions have used the same sequence, and so will re-read the state from disk. But at that point the state is exactly the same, so we'll log the same thing twice. There's a much deeper issue, though. The current patch only logs the sequence if the session generated WAL when incrementing the sequence (which happens every 32 values). But other sessions may already use values from this range, so consider for example this: T1: BEGIN; T1: SELECT nextval('s') FROM generate_series(1,100) s(i); T2: BEGIN; T2: SELECT nextval('s'); T2: COMMIT; T1: ROLLBACK; Which unfortunately means T2 already used a value, but the increment may not be logged at that time (or ever). This seems like a fatal issue, because it means we need to log *all* sequences the transaction touches, not just those that wrote the increment to WAL. That might still work for large transactions consuming many sequence values, but it's pretty inefficient for small OLTP transactions that only need one or two values from the sequence. So I think just decoding the sequence tuples is a better solution - for large transactions (consuming many values from the sequence) it may be more expensive (i.e. send more records to replica). But I doubt that matters too much - it's likely negligible compared to other data for large transactions. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 22.11.21 01:47, Tomas Vondra wrote: > So I think just decoding the sequence tuples is a better solution - for > large transactions (consuming many values from the sequence) it may be > more expensive (i.e. send more records to replica). But I doubt that > matters too much - it's likely negligible compared to other data for > large transactions. I agree that the original approach is better. It was worth trying out this alternative, but it seems quite complicated. I note that a lot of additional code had to be added around several areas of the code, whereas the original patch really just touched the logical decoding code, as it should. This doesn't prevent anyone from attempting to optimize things somehow in the future, but for now let's move forward with the simple approach.
> On 22. 11. 2021, at 16:44, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 22.11.21 01:47, Tomas Vondra wrote: >> So I think just decoding the sequence tuples is a better solution - for large transactions (consuming many values fromthe sequence) it may be more expensive (i.e. send more records to replica). But I doubt that matters too much - it'slikely negligible compared to other data for large transactions. > > I agree that the original approach is better. It was worth trying out this alternative, but it seems quite complicated. I note that a lot of additional code had to be added around several areas of the code, whereas the originalpatch really just touched the logical decoding code, as it should. This doesn't prevent anyone from attempting tooptimize things somehow in the future, but for now let's move forward with the simple approach. +1 -- Petr Jelinek
Hi, On 2021-09-25 22:05:43 +0200, Hannu Krosing wrote: > If our aim is just to make sure that all user-visible data in > *transactional* tables is consistent with sequence state then one > very much simplified approach to this could be to track the results of > nextval() calls in a transaction at COMMIT put the latest sequence > value in WAL (or just track the sequences affected and put the latest > sequence state in WAL at commit which needs extra read of sequence but > protects against race conditions with parallel transactions which get > rolled back later) I think this is a bad idea. It's architecturally more complicated and prevents use cases because sequence values aren't guaranteed to be as new as on the original system. You'd need to track all sequence use somehow *even if there is no relevant WAL generated* in a transaction. There's simply no evidence of sequence use in a transaction if that transaction uses a previously logged sequence value. Greetings, Andres Freund
On 11/23/21 02:01, Andres Freund wrote: > Hi, > > On 2021-09-25 22:05:43 +0200, Hannu Krosing wrote: >> If our aim is just to make sure that all user-visible data in >> *transactional* tables is consistent with sequence state then one >> very much simplified approach to this could be to track the results of >> nextval() calls in a transaction at COMMIT put the latest sequence >> value in WAL (or just track the sequences affected and put the latest >> sequence state in WAL at commit which needs extra read of sequence but >> protects against race conditions with parallel transactions which get >> rolled back later) > > I think this is a bad idea. It's architecturally more complicated and prevents > use cases because sequence values aren't guaranteed to be as new as on the > original system. You'd need to track all sequence use somehow *even if there > is no relevant WAL generated* in a transaction. There's simply no evidence of > sequence use in a transaction if that transaction uses a previously logged > sequence value. > Not quite. We already have a cache of all sequences used by a session (see seqhashtab in sequence.c), and it's not that hard to extend it to per-transaction tracking. That's what the last two versions do, mostly. But there are various issues with that approach, described in my last message(s). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I have checked the 0001 and 0003 patches. (I think we have dismissed the approach in 0002 for now. And let's talk about 0004 later.) I have attached a few fixup patches, described further below. # 0001 The argument "create" for fill_seq_with_data() is always true (and patch 0002 removes it). So I'm not sure if it's needed. If it is, it should be documented somewhere. About the comment you added in nextval_internal(): It's a good explanation, so I would leave it in. I also agree with the conclusion of the comment that the current solution is reasonable. We probably don't need the same comment again in fill_seq_with_data() and again in do_setval(). Note that both of the latter functions already point to nextval_interval() for other comments, so the same can be relied upon here as well. The order of the new fields sequence_cb and stream_sequence_cb is a bit inconsistent compared to truncate_cb and stream_truncate_cb. Maybe take another look to make the order more uniform. Some documentation needs to be added to the Logical Decoding chapter. I have attached a patch that I think catches all the places that need to be updated, with some details left for you to fill in. ;-) The ordering of the some of the documentation chunks reflects the order in which the callbacks appear in the header files, which might not be optimal; see above. In reorderbuffer.c, you left a comment about how to access a sequence tuple. There is an easier way, using Form_pg_sequence_data, which is how sequence.c also does it. See attached patch. # 0003 The tests added in 0003 don't pass for me. It appears that you might have forgotten to update the expected files after you added some tests or something. The cfbot shows the same. See attached patch for a correction, but do check what your intent was. As mentioned before, we probably don't need the skip-sequences option in test_decoding.
Вложения
On 12/7/21 15:38, Peter Eisentraut wrote: > I have checked the 0001 and 0003 patches. (I think we have dismissed > the approach in 0002 for now. And let's talk about 0004 later.) > Right, I think that's correct. > I have attached a few fixup patches, described further below. > > # 0001 > > The argument "create" for fill_seq_with_data() is always true (and > patch 0002 removes it). So I'm not sure if it's needed. If it is, it > should be documented somewhere. > Good point. I think it could be removed, but IIRC it'll be needed when adding sequence decoding to the built-in replication, and that patch is meant to be just an implementation of the API, without changing WAL. OTOH I don't see it in the last version of that patch (in ResetSequence2 call) so maybe it's not really needed. I've kept it for now, but I'll investigate. > About the comment you added in nextval_internal(): It's a good > explanation, so I would leave it in. I also agree with the > conclusion of the comment that the current solution is reasonable. We > probably don't need the same comment again in fill_seq_with_data() and > again in do_setval(). Note that both of the latter functions already > point to nextval_interval() for other comments, so the same can be > relied upon here as well. > True. I moved it a bit in nextval_internal() and removed the other copies. The existing references should be enough. > The order of the new fields sequence_cb and stream_sequence_cb is a > bit inconsistent compared to truncate_cb and stream_truncate_cb. > Maybe take another look to make the order more uniform. > You mean in OutputPluginCallbacks? I'd actually argue it's the truncate callbacks that are inconsistent - in the regular section truncate_cb is right before commit_cb, while in the streaming section it's at the end. Or what order do you think would be better? I'm fine with changing it. > Some documentation needs to be added to the Logical Decoding chapter. > I have attached a patch that I think catches all the places that need > to be updated, with some details left for you to fill in. ;-) The > ordering of the some of the documentation chunks reflects the order in > which the callbacks appear in the header files, which might not be > optimal; see above. > Thanks. I added a bit about the callbacks being optional and what the parameters mean (only for sequence_cb, as the stream_ callbacks generally don't copy that bit). > In reorderbuffer.c, you left a comment about how to access a sequence > tuple. There is an easier way, using Form_pg_sequence_data, which is > how sequence.c also does it. See attached patch. > Yeah, that looks much nicer. > # 0003 > > The tests added in 0003 don't pass for me. It appears that you might > have forgotten to update the expected files after you added some tests > or something. The cfbot shows the same. See attached patch for a > correction, but do check what your intent was. > Yeah. I suspect I removed the expected results due to the experimental rework. I haven't noticed that because some of the tests for the built-in replication are expected to fail. Attached is an updated version of the first two parts (infrastructure and test_decoding), with all your fixes merged. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 08.12.21 01:23, Tomas Vondra wrote: >> The argument "create" for fill_seq_with_data() is always true (and >> patch 0002 removes it). So I'm not sure if it's needed. If it is, it >> should be documented somewhere. >> > > Good point. I think it could be removed, but IIRC it'll be needed when > adding sequence decoding to the built-in replication, and that patch is > meant to be just an implementation of the API, without changing WAL. > > OTOH I don't see it in the last version of that patch (in ResetSequence2 > call) so maybe it's not really needed. I've kept it for now, but I'll > investigate. Ok, please check. If it is needed, perhaps then we need a way for test_decoding() to simulate it, for testing. But perhaps it's not needed. >> The order of the new fields sequence_cb and stream_sequence_cb is a >> bit inconsistent compared to truncate_cb and stream_truncate_cb. >> Maybe take another look to make the order more uniform. >> > > You mean in OutputPluginCallbacks? I'd actually argue it's the truncate > callbacks that are inconsistent - in the regular section truncate_cb is > right before commit_cb, while in the streaming section it's at the end. Ok, that makes sense. Then leave yours. When the question about fill_seq_with_data() is resolved, I have no more comments on this part.
Hi, here's an updated version of the patches, dealing with almost all of the issues (at least in the 0001 and 0002 parts). The main changes: 1) I've removed the 'created' flag from fill_seq_with_data, as discussed. I don't think it's needed by any of the parts (not even 0003, AFAICS). We still need it in xl_seq_rec, though. 2) GetCurrentTransactionId() added to sequence.c are called only with wal_level=logical, to minimize the overhead. There's still one remaining problem, that I already explained in [1]. The problem is that with this: BEGIN; SELECT nextval('s') FROM generate_series(1,100); ROLLBACK; The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write, which is updated by XLogFlush() - but only in RecordTransactionCommit. Which makes sense, because only the committed stuff is "visible". But the non-transactional behavior of sequence decoding disagrees with this, because now some of the changes from aborted transactions may be replicated. Which means the wait_for_catchup() ends up not waiting for the sequence change to be replicated. This is an issue for tests in patch 0003, at least. My concern is this actually affects other places waiting for things getting replicated :-/ I have outlined some ideas how to deal with this in [1], but we got sidetracked by exploring the alternative approach. regards [1] https://www.postgresql.org/message-id/3d6df331-5532-6848-eb45-344b265e0238%40enterprisedb.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Tue, Dec 14, 2021 at 7:02 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hi, > > here's an updated version of the patches, dealing with almost all of the > issues (at least in the 0001 and 0002 parts). The main changes: > > 1) I've removed the 'created' flag from fill_seq_with_data, as > discussed. I don't think it's needed by any of the parts (not even 0003, > AFAICS). We still need it in xl_seq_rec, though. > > 2) GetCurrentTransactionId() added to sequence.c are called only with > wal_level=logical, to minimize the overhead. > > > There's still one remaining problem, that I already explained in [1]. > The problem is that with this: > > BEGIN; > SELECT nextval('s') FROM generate_series(1,100); > ROLLBACK; > > > The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write, > which is updated by XLogFlush() - but only in RecordTransactionCommit. > Which makes sense, because only the committed stuff is "visible". > > But the non-transactional behavior of sequence decoding disagrees with > this, because now some of the changes from aborted transactions may be > replicated. Which means the wait_for_catchup() ends up not waiting for > the sequence change to be replicated. This is an issue for tests in > patch 0003, at least. > > My concern is this actually affects other places waiting for things > getting replicated :-/ > By any chance, will this impact synchronous replication as well which waits for commits to be replicated? How is this patch dealing with prepared transaction case where at prepare we will send transactional changes and then later if rollback prepared happens then the publisher will rollback changes related to new relfilenode but subscriber would have already replayed the updated seqval change which won't be rolled back? -- With Regards, Amit Kapila.
On 14.12.21 02:31, Tomas Vondra wrote: > There's still one remaining problem, that I already explained in [1]. > The problem is that with this: > > BEGIN; > SELECT nextval('s') FROM generate_series(1,100); > ROLLBACK; > > > The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write, > which is updated by XLogFlush() - but only in RecordTransactionCommit. > Which makes sense, because only the committed stuff is "visible". > > But the non-transactional behavior of sequence decoding disagrees with > this, because now some of the changes from aborted transactions may be > replicated. Which means the wait_for_catchup() ends up not waiting for > the sequence change to be replicated. I can't think of a reason why this might be a problem.
On 12/15/21 14:20, Amit Kapila wrote: > On Tue, Dec 14, 2021 at 7:02 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Hi, >> >> here's an updated version of the patches, dealing with almost all of the >> issues (at least in the 0001 and 0002 parts). The main changes: >> >> 1) I've removed the 'created' flag from fill_seq_with_data, as >> discussed. I don't think it's needed by any of the parts (not even 0003, >> AFAICS). We still need it in xl_seq_rec, though. >> >> 2) GetCurrentTransactionId() added to sequence.c are called only with >> wal_level=logical, to minimize the overhead. >> >> >> There's still one remaining problem, that I already explained in [1]. >> The problem is that with this: >> >> BEGIN; >> SELECT nextval('s') FROM generate_series(1,100); >> ROLLBACK; >> >> >> The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write, >> which is updated by XLogFlush() - but only in RecordTransactionCommit. >> Which makes sense, because only the committed stuff is "visible". >> >> But the non-transactional behavior of sequence decoding disagrees with >> this, because now some of the changes from aborted transactions may be >> replicated. Which means the wait_for_catchup() ends up not waiting for >> the sequence change to be replicated. This is an issue for tests in >> patch 0003, at least. >> >> My concern is this actually affects other places waiting for things >> getting replicated :-/ >> > > By any chance, will this impact synchronous replication as well which > waits for commits to be replicated? > Physical or logical replication? Physical is certainly not replicated. For logical replication, it's more complicated. > How is this patch dealing with prepared transaction case where at > prepare we will send transactional changes and then later if rollback > prepared happens then the publisher will rollback changes related to > new relfilenode but subscriber would have already replayed the updated > seqval change which won't be rolled back? > I'm not sure what exact scenario you are describing, but in general we don't rollback sequence changes anyway, so this should not cause any divergence between the publisher and subscriber. Or are you talking about something like ALTER SEQUENCE? I think that should trigger the transactional behavior for the new relfilenode, so the subscriber won't see that changes. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/15/21 14:51, Tomas Vondra wrote: > On 12/15/21 14:20, Amit Kapila wrote: >> On Tue, Dec 14, 2021 at 7:02 AM Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> >>> Hi, >>> >>> here's an updated version of the patches, dealing with almost all of the >>> issues (at least in the 0001 and 0002 parts). The main changes: >>> >>> 1) I've removed the 'created' flag from fill_seq_with_data, as >>> discussed. I don't think it's needed by any of the parts (not even 0003, >>> AFAICS). We still need it in xl_seq_rec, though. >>> >>> 2) GetCurrentTransactionId() added to sequence.c are called only with >>> wal_level=logical, to minimize the overhead. >>> >>> >>> There's still one remaining problem, that I already explained in [1]. >>> The problem is that with this: >>> >>> BEGIN; >>> SELECT nextval('s') FROM generate_series(1,100); >>> ROLLBACK; >>> >>> >>> The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write, >>> which is updated by XLogFlush() - but only in RecordTransactionCommit. >>> Which makes sense, because only the committed stuff is "visible". >>> >>> But the non-transactional behavior of sequence decoding disagrees with >>> this, because now some of the changes from aborted transactions may be >>> replicated. Which means the wait_for_catchup() ends up not waiting for >>> the sequence change to be replicated. This is an issue for tests in >>> patch 0003, at least. >>> >>> My concern is this actually affects other places waiting for things >>> getting replicated :-/ >>> >> >> By any chance, will this impact synchronous replication as well which >> waits for commits to be replicated? >> > > Physical or logical replication? Physical is certainly not replicated. > > For logical replication, it's more complicated. > Apologies, sent too early ... I think it's more complicated for logical sync replication, because of a scenario like this: BEGIN; SELECT nextval('s') FROM generate_series(1,100); <-- writes WAL ROLLBACK; SELECT nextval('s'); The first transaction advances the sequence enough to generate a WAL, which we do every 32 values. But it's rolled back, so it does not update LogwrtResult.Write, because that happens only at commit. And then the nextval() generates a value from the sequence without generating WAL, so it doesn't update the LSN either (IIRC). That'd mean a sync replication may not wait for this change to reach the subscriber. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Looking at 0003, On 2021-Dec-14, Tomas Vondra wrote: > diff --git a/doc/src/sgml/ref/alter_publication.sgml b/doc/src/sgml/ref/alter_publication.sgml > index bb4ef5e5e22..4d166ad3f9c 100644 > --- a/doc/src/sgml/ref/alter_publication.sgml > +++ b/doc/src/sgml/ref/alter_publication.sgml > @@ -31,7 +31,9 @@ ALTER PUBLICATION <replaceable class="parameter">name</replaceable> RENAME TO <r > <phrase>where <replaceable class="parameter">publication_object</replaceable> is one of:</phrase> > > TABLE [ ONLY ] <replaceable class="parameter">table_name</replaceable> [ * ] [, ... ] > + SEQUENCE <replaceable class="parameter">sequence_name</replaceable> [ * ] [, ... ] > ALL TABLES IN SCHEMA { <replaceable class="parameter">schema_name</replaceable> | CURRENT_SCHEMA } [, ... ] > + ALL SEQUENCE IN SCHEMA { <replaceable class="parameter">schema_name</replaceable> | CURRENT_SCHEMA } [, ... ] Note that this says ALL SEQUENCE; I think it should be ALL SEQUENCES. > diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y > index 3d4dd43e47b..f037c17985b 100644 > --- a/src/backend/parser/gram.y > +++ b/src/backend/parser/gram.y > @@ -9762,6 +9762,26 @@ PublicationObjSpec: ... > + | ALL SEQUENCE IN_P SCHEMA ColId > + { > + $$ = makeNode(PublicationObjSpec); > + $$->pubobjtype = PUBLICATIONOBJ_SEQUENCE_IN_SCHEMA; > + $$->name = $5; > + $$->location = @5; > + } > + | ALL SEQUENCES IN_P SCHEMA CURRENT_SCHEMA > + { > + $$ = makeNode(PublicationObjSpec); > + $$->pubobjtype = PUBLICATIONOBJ_SEQUENCE_IN_CUR_SCHEMA; > + $$->location = @5; > + } And here you have ALL SEQUENCE in one spot and ALL SEQUENCES in the other. BTW I think these enum values should use the plural too, PUBLICATIONOBJ_SEQUENCES_IN_CUR_SCHEMA (not SEQUENCE). I suppose you copied from PUBLICATIONOBJ_TABLE_IN_CUR_SCHEMA, but that too seems to be a mistake: should be PUBLICATIONOBJ_TABLES_IN_CUR_SCHEMA. > @@ -10097,6 +10117,12 @@ UnlistenStmt: > } > ; > > +/* > + * FIXME > + * > + * opt_publication_for_sequences and publication_for_sequences should be > + * copies for sequences > + */ Not sure if this FIXME is relevant or should just be removed. > @@ -10105,6 +10131,12 @@ UnlistenStmt: > * BEGIN / COMMIT / ROLLBACK > * (also older versions END / ABORT) > * > + * ALTER PUBLICATION name ADD SEQUENCE sequence [, sequence2] > + * > + * ALTER PUBLICATION name DROP SEQUENCE sequence [, sequence2] > + * > + * ALTER PUBLICATION name SET SEQUENCE sequence [, sequence2] > + * This comment addition seems misplaced? > diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c > index 2f412ca3db3..e30bf7b1b55 100644 > --- a/src/bin/psql/tab-complete.c > +++ b/src/bin/psql/tab-complete.c > @@ -1647,13 +1647,13 @@ psql_completion(const char *text, int start, int end) > COMPLETE_WITH("ADD", "DROP", "OWNER TO", "RENAME TO", "SET"); > /* ALTER PUBLICATION <name> ADD */ > else if (Matches("ALTER", "PUBLICATION", MatchAny, "ADD")) > - COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE"); > + COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE|SEQUENCE"); > /* ALTER PUBLICATION <name> DROP */ > else if (Matches("ALTER", "PUBLICATION", MatchAny, "DROP")) > - COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE"); > + COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE|SEQUENCE"); > /* ALTER PUBLICATION <name> SET */ > else if (Matches("ALTER", "PUBLICATION", MatchAny, "SET")) > - COMPLETE_WITH("(", "ALL TABLES IN SCHEMA", "TABLE"); > + COMPLETE_WITH("(", "ALL TABLES IN SCHEMA", "TABLE|SEQUENCE"); I think you should also add "ALL SEQUENCES IN SCHEMA" to these lists. > else if (Matches("ALTER", "PUBLICATION", MatchAny, "ADD|DROP|SET", "ALL", "TABLES", "IN", "SCHEMA")) ... and perhaps make this "ALL", "TABLES|SEQUENCES", "IN", "SCHEMA". -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Nunca confiaré en un traidor. Ni siquiera si el traidor lo he creado yo" (Barón Vladimir Harkonnen)
On 12/15/21 17:42, Alvaro Herrera wrote: > Looking at 0003, > > On 2021-Dec-14, Tomas Vondra wrote: > >> diff --git a/doc/src/sgml/ref/alter_publication.sgml b/doc/src/sgml/ref/alter_publication.sgml >> index bb4ef5e5e22..4d166ad3f9c 100644 >> --- a/doc/src/sgml/ref/alter_publication.sgml >> +++ b/doc/src/sgml/ref/alter_publication.sgml >> @@ -31,7 +31,9 @@ ALTER PUBLICATION <replaceable class="parameter">name</replaceable> RENAME TO <r >> <phrase>where <replaceable class="parameter">publication_object</replaceable> is one of:</phrase> >> >> TABLE [ ONLY ] <replaceable class="parameter">table_name</replaceable> [ * ] [, ... ] >> + SEQUENCE <replaceable class="parameter">sequence_name</replaceable> [ * ] [, ... ] >> ALL TABLES IN SCHEMA { <replaceable class="parameter">schema_name</replaceable> | CURRENT_SCHEMA } [, ... ] >> + ALL SEQUENCE IN SCHEMA { <replaceable class="parameter">schema_name</replaceable> | CURRENT_SCHEMA } [, ... ] > > Note that this says ALL SEQUENCE; I think it should be ALL SEQUENCES. > >> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y >> index 3d4dd43e47b..f037c17985b 100644 >> --- a/src/backend/parser/gram.y >> +++ b/src/backend/parser/gram.y >> @@ -9762,6 +9762,26 @@ PublicationObjSpec: > ... >> + | ALL SEQUENCE IN_P SCHEMA ColId >> + { >> + $$ = makeNode(PublicationObjSpec); >> + $$->pubobjtype = PUBLICATIONOBJ_SEQUENCE_IN_SCHEMA; >> + $$->name = $5; >> + $$->location = @5; >> + } >> + | ALL SEQUENCES IN_P SCHEMA CURRENT_SCHEMA >> + { >> + $$ = makeNode(PublicationObjSpec); >> + $$->pubobjtype = PUBLICATIONOBJ_SEQUENCE_IN_CUR_SCHEMA; >> + $$->location = @5; >> + } > > And here you have ALL SEQUENCE in one spot and ALL SEQUENCES in the > other. > > BTW I think these enum values should use the plural too, > PUBLICATIONOBJ_SEQUENCES_IN_CUR_SCHEMA (not SEQUENCE). I suppose you > copied from PUBLICATIONOBJ_TABLE_IN_CUR_SCHEMA, but that too seems to be > a mistake: should be PUBLICATIONOBJ_TABLES_IN_CUR_SCHEMA. > > >> @@ -10097,6 +10117,12 @@ UnlistenStmt: >> } >> ; >> >> +/* >> + * FIXME >> + * >> + * opt_publication_for_sequences and publication_for_sequences should be >> + * copies for sequences >> + */ > > Not sure if this FIXME is relevant or should just be removed. > >> @@ -10105,6 +10131,12 @@ UnlistenStmt: >> * BEGIN / COMMIT / ROLLBACK >> * (also older versions END / ABORT) >> * >> + * ALTER PUBLICATION name ADD SEQUENCE sequence [, sequence2] >> + * >> + * ALTER PUBLICATION name DROP SEQUENCE sequence [, sequence2] >> + * >> + * ALTER PUBLICATION name SET SEQUENCE sequence [, sequence2] >> + * > > This comment addition seems misplaced? > >> diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c >> index 2f412ca3db3..e30bf7b1b55 100644 >> --- a/src/bin/psql/tab-complete.c >> +++ b/src/bin/psql/tab-complete.c >> @@ -1647,13 +1647,13 @@ psql_completion(const char *text, int start, int end) >> COMPLETE_WITH("ADD", "DROP", "OWNER TO", "RENAME TO", "SET"); >> /* ALTER PUBLICATION <name> ADD */ >> else if (Matches("ALTER", "PUBLICATION", MatchAny, "ADD")) >> - COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE"); >> + COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE|SEQUENCE"); >> /* ALTER PUBLICATION <name> DROP */ >> else if (Matches("ALTER", "PUBLICATION", MatchAny, "DROP")) >> - COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE"); >> + COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE|SEQUENCE"); >> /* ALTER PUBLICATION <name> SET */ >> else if (Matches("ALTER", "PUBLICATION", MatchAny, "SET")) >> - COMPLETE_WITH("(", "ALL TABLES IN SCHEMA", "TABLE"); >> + COMPLETE_WITH("(", "ALL TABLES IN SCHEMA", "TABLE|SEQUENCE"); > > I think you should also add "ALL SEQUENCES IN SCHEMA" to these lists. > > >> else if (Matches("ALTER", "PUBLICATION", MatchAny, "ADD|DROP|SET", "ALL", "TABLES", "IN", "SCHEMA")) > > ... and perhaps make this "ALL", "TABLES|SEQUENCES", "IN", "SCHEMA". > Thanks for the review. I'm aware 0003 is still incomplete and subject to change - it's certainly not meant for commit yet. The current 0003 patch is sufficient for testing the infrastructure, but we need to figure out how to make it easier to use, what to do with implicit sequences and similar things. Peter had some ideas in [1]. [1] https://www.postgresql.org/message-id/359bf6d0-413d-292a-4305-e99eeafead39%40enterprisedb.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 15, 2021 at 7:21 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 12/15/21 14:20, Amit Kapila wrote: > > On Tue, Dec 14, 2021 at 7:02 AM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> Hi, > >> > >> here's an updated version of the patches, dealing with almost all of the > >> issues (at least in the 0001 and 0002 parts). The main changes: > >> > >> 1) I've removed the 'created' flag from fill_seq_with_data, as > >> discussed. I don't think it's needed by any of the parts (not even 0003, > >> AFAICS). We still need it in xl_seq_rec, though. > >> > >> 2) GetCurrentTransactionId() added to sequence.c are called only with > >> wal_level=logical, to minimize the overhead. > >> > >> > >> There's still one remaining problem, that I already explained in [1]. > >> The problem is that with this: > >> > >> BEGIN; > >> SELECT nextval('s') FROM generate_series(1,100); > >> ROLLBACK; > >> > >> > >> The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write, > >> which is updated by XLogFlush() - but only in RecordTransactionCommit. > >> Which makes sense, because only the committed stuff is "visible". > >> > >> But the non-transactional behavior of sequence decoding disagrees with > >> this, because now some of the changes from aborted transactions may be > >> replicated. Which means the wait_for_catchup() ends up not waiting for > >> the sequence change to be replicated. This is an issue for tests in > >> patch 0003, at least. > >> > >> My concern is this actually affects other places waiting for things > >> getting replicated :-/ > >> > > > > By any chance, will this impact synchronous replication as well which > > waits for commits to be replicated? > > > > Physical or logical replication? > logical replication. > Physical is certainly not replicated. > > For logical replication, it's more complicated. > > > How is this patch dealing with prepared transaction case where at > > prepare we will send transactional changes and then later if rollback > > prepared happens then the publisher will rollback changes related to > > new relfilenode but subscriber would have already replayed the updated > > seqval change which won't be rolled back? > > > > I'm not sure what exact scenario you are describing, but in general we > don't rollback sequence changes anyway, so this should not cause any > divergence between the publisher and subscriber. > > Or are you talking about something like ALTER SEQUENCE? I think that > should trigger the transactional behavior for the new relfilenode, so > the subscriber won't see that changes. > Yeah, something like Alter Sequence and nextval combination. I see that it will be handled because of the transactional behavior for the new relfilenode as for applying each sequence change, a new relfilenode is created. I think on apply side, the patch always creates a new relfilenode irrespective of whether the sequence message is transactional or not. Do we need to do that for the non-transactional messages as well? -- With Regards, Amit Kapila.
On 12/16/21 12:59, Amit Kapila wrote: > On Wed, Dec 15, 2021 at 7:21 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 12/15/21 14:20, Amit Kapila wrote: >>> On Tue, Dec 14, 2021 at 7:02 AM Tomas Vondra >>> <tomas.vondra@enterprisedb.com> wrote: >>>> >>>> Hi, >>>> >>>> here's an updated version of the patches, dealing with almost all of the >>>> issues (at least in the 0001 and 0002 parts). The main changes: >>>> >>>> 1) I've removed the 'created' flag from fill_seq_with_data, as >>>> discussed. I don't think it's needed by any of the parts (not even 0003, >>>> AFAICS). We still need it in xl_seq_rec, though. >>>> >>>> 2) GetCurrentTransactionId() added to sequence.c are called only with >>>> wal_level=logical, to minimize the overhead. >>>> >>>> >>>> There's still one remaining problem, that I already explained in [1]. >>>> The problem is that with this: >>>> >>>> BEGIN; >>>> SELECT nextval('s') FROM generate_series(1,100); >>>> ROLLBACK; >>>> >>>> >>>> The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write, >>>> which is updated by XLogFlush() - but only in RecordTransactionCommit. >>>> Which makes sense, because only the committed stuff is "visible". >>>> >>>> But the non-transactional behavior of sequence decoding disagrees with >>>> this, because now some of the changes from aborted transactions may be >>>> replicated. Which means the wait_for_catchup() ends up not waiting for >>>> the sequence change to be replicated. This is an issue for tests in >>>> patch 0003, at least. >>>> >>>> My concern is this actually affects other places waiting for things >>>> getting replicated :-/ >>>> >>> >>> By any chance, will this impact synchronous replication as well which >>> waits for commits to be replicated? >>> >> >> Physical or logical replication? >> > > logical replication. > >> Physical is certainly not replicated. >> >> For logical replication, it's more complicated. >> >>> How is this patch dealing with prepared transaction case where at >>> prepare we will send transactional changes and then later if rollback >>> prepared happens then the publisher will rollback changes related to >>> new relfilenode but subscriber would have already replayed the updated >>> seqval change which won't be rolled back? >>> >> >> I'm not sure what exact scenario you are describing, but in general we >> don't rollback sequence changes anyway, so this should not cause any >> divergence between the publisher and subscriber. >> >> Or are you talking about something like ALTER SEQUENCE? I think that >> should trigger the transactional behavior for the new relfilenode, so >> the subscriber won't see that changes. >> > > Yeah, something like Alter Sequence and nextval combination. I see > that it will be handled because of the transactional behavior for the > new relfilenode as for applying each sequence change, a new > relfilenode is created. Right. > I think on apply side, the patch always creates a new relfilenode > irrespective of whether the sequence message is transactional or not. > Do we need to do that for the non-transactional messages as well? > Good question. I don't think that's necessary, I'll see if we can simply update the tuple (mostly just like redo). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/15/21 14:51, Tomas Vondra wrote: > On 12/15/21 14:20, Amit Kapila wrote: >> On Tue, Dec 14, 2021 at 7:02 AM Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> >>> Hi, >>> >>> here's an updated version of the patches, dealing with almost all of the >>> issues (at least in the 0001 and 0002 parts). The main changes: >>> >>> 1) I've removed the 'created' flag from fill_seq_with_data, as >>> discussed. I don't think it's needed by any of the parts (not even 0003, >>> AFAICS). We still need it in xl_seq_rec, though. >>> >>> 2) GetCurrentTransactionId() added to sequence.c are called only with >>> wal_level=logical, to minimize the overhead. >>> >>> >>> There's still one remaining problem, that I already explained in [1]. >>> The problem is that with this: >>> >>> BEGIN; >>> SELECT nextval('s') FROM generate_series(1,100); >>> ROLLBACK; >>> >>> >>> The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write, >>> which is updated by XLogFlush() - but only in RecordTransactionCommit. >>> Which makes sense, because only the committed stuff is "visible". >>> >>> But the non-transactional behavior of sequence decoding disagrees with >>> this, because now some of the changes from aborted transactions may be >>> replicated. Which means the wait_for_catchup() ends up not waiting for >>> the sequence change to be replicated. This is an issue for tests in >>> patch 0003, at least. >>> >>> My concern is this actually affects other places waiting for things >>> getting replicated :-/ >>> >> >> By any chance, will this impact synchronous replication as well which >> waits for commits to be replicated? >> > > Physical or logical replication? Physical is certainly not replicated. > Actually, I take that back. It does affect physical (sync) replication just as well, and I think it might be considered a data loss issue. I started a new thread to discuss that, so that it's not buried in this thread about logical replication. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Here's a rebased version of the patch series. I decided to go back to the version from 2021/12/14, which does not include the changes to WAL logging. So this has the same issue with nextval() now waiting for WAL flush (and/or sync replica), as demonstrated by occasional failures of the TAP test in 0003, but my reasoning is this: 1) This is a preexisting issue, affecting sequences in general. It's not related to this patch, really. The fix will be independent of this, and there's little reason to block the decoding until that happens. 2) We've discussed a couple ways to fix this in [1] - logging individual sequence increments, flushing everything right away, waiting for page LSN, etc. My opinion is we'll use some form of waiting for page LSN, but no matter what fix we use it'll have almost no impact on this patch. There might be minor changes to the test, but that's about it. 3) There are ways to stabilize the tests even without that - it's enough to generate a little bit of WAL / get XID, not just nextval(). But that's only for 0003 which I don't intend to commit yet, and 0001/0002 have no problems at all. regards [1] https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9be9@enterprisedb.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 26.01.22 03:16, Tomas Vondra wrote: > Here's a rebased version of the patch series. I decided to go back to > the version from 2021/12/14, which does not include the changes to WAL > logging. I notice that test_decoding still has skip-sequences enabled by default, "for backward compatibility". I think we had concluded in previous discussions that we don't need that. I would remove the option altogether.
I would not remove it altogether, there is plenty of consumers of this extension's output in the wild (even if I think it'sunfortunate) that might not be interested in sequences, but changing it to off by default certainly makes sense. -- Petr Jelinek > On 26. 1. 2022, at 11:09, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 26.01.22 03:16, Tomas Vondra wrote: >> Here's a rebased version of the patch series. I decided to go back to the version from 2021/12/14, which does not includethe changes to WAL logging. > > I notice that test_decoding still has skip-sequences enabled by default, "for backward compatibility". I think we hadconcluded in previous discussions that we don't need that. I would remove the option altogether. >
On 1/26/22 14:01, Petr Jelinek wrote: > I would not remove it altogether, there is plenty of consumers of > this extension's output in the wild (even if I think it's > unfortunate) that might not be interested in sequences, but changing > it to off by default certainly makes sense. Indeed. Attached is an updated patch series, with 0003 switching it to false by default (and fixing the fallout). For commit I'll merge that into 0002, of course. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 27.01.22 00:32, Tomas Vondra wrote: > > On 1/26/22 14:01, Petr Jelinek wrote: >> I would not remove it altogether, there is plenty of consumers of this >> extension's output in the wild (even if I think it's >> unfortunate) that might not be interested in sequences, but changing >> it to off by default certainly makes sense. > > Indeed. Attached is an updated patch series, with 0003 switching it to > false by default (and fixing the fallout). For commit I'll merge that > into 0002, of course. (could be done in separate patches too IMO) test_decoding.c uses %zu several times for int64 values, which is not correct. This should use INT64_FORMAT or %lld with a cast to (long long int). I don't know, maybe it's worth commenting somewhere how the expected values in contrib/test_decoding/expected/sequence.out come about. Otherwise, it's quite a puzzle to determine where the 3830 comes from, for example. I think the skip-sequences options is better turned around into a positive name like include-sequences. There is a mix of "skip" and "include" options in test_decoding, but there are more "include" ones right now. In the 0003, a few files have been missed, apparently, so the tests don't fully pass. See attached patch. I haven't looked fully through the 0004 patch, but I notice that there was a confusing mix of FOR ALL SEQUENCE and FOR ALL SEQUENCES. I have corrected that in the other attached patch. Other than the mentioned cosmetic issues, I think 0001-0003 are ready to go.
Вложения
On 1/27/22 17:05, Peter Eisentraut wrote: > On 27.01.22 00:32, Tomas Vondra wrote: >> >> On 1/26/22 14:01, Petr Jelinek wrote: >>> I would not remove it altogether, there is plenty of consumers of >>> this extension's output in the wild (even if I think it's >>> unfortunate) that might not be interested in sequences, but changing >>> it to off by default certainly makes sense. >> >> Indeed. Attached is an updated patch series, with 0003 switching it to >> false by default (and fixing the fallout). For commit I'll merge that >> into 0002, of course. > > (could be done in separate patches too IMO) > > test_decoding.c uses %zu several times for int64 values, which is not > correct. This should use INT64_FORMAT or %lld with a cast to (long long > int). > Good point - INT64_FORMAT seems better. Also, the formatting was not quite right (missing space between the colon), so I fixed that too. > I don't know, maybe it's worth commenting somewhere how the expected > values in contrib/test_decoding/expected/sequence.out come about. > Otherwise, it's quite a puzzle to determine where the 3830 comes from, > for example. > Yeah, that's probably a good idea - I had to think about the expected output repeatedly, so an explanation would help. I'll do that in the next version of the patch. > I think the skip-sequences options is better turned around into a > positive name like include-sequences. There is a mix of "skip" and > "include" options in test_decoding, but there are more "include" ones > right now. > Hmmm. I don't see much difference between skip-sequences and include-sequences, but I don't feel very strongly about it either so I switched that to include-sequences (which defaults to true). > In the 0003, a few files have been missed, apparently, so the tests > don't fully pass. See attached patch. > D'oh! I'd swear I've fixed those too. > I haven't looked fully through the 0004 patch, but I notice that there > was a confusing mix of FOR ALL SEQUENCE and FOR ALL SEQUENCES. I have > corrected that in the other attached patch. > > Other than the mentioned cosmetic issues, I think 0001-0003 are ready to go. Thanks. I think we'll have time to look at 0004 more closely once the initial parts get committed. Attached is a a rebased/squashed version of the patches, with all the fixes discussed here. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
I've polished & pushed the first part adding sequence decoding infrastructure etc. Attached are the two remaining parts. I plan to wait a day or two and then push the test_decoding part. The last part (for built-in replication) will need more work and maybe rethinking the grammar etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 2/10/22 19:17, Tomas Vondra wrote: > I've polished & pushed the first part adding sequence decoding > infrastructure etc. Attached are the two remaining parts. > > I plan to wait a day or two and then push the test_decoding part. The > last part (for built-in replication) will need more work and maybe > rethinking the grammar etc. > I've pushed the second part, adding sequences to test_decoding. Here's the remaining part, rebased, with a small tweak in the TAP test to eliminate the issue with not waiting for sequence increments. I've kept the tweak in a separate patch, so that we can throw it away easily if we happen to resolve the issue. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 2/12/22 01:34, Tomas Vondra wrote: > On 2/10/22 19:17, Tomas Vondra wrote: >> I've polished & pushed the first part adding sequence decoding >> infrastructure etc. Attached are the two remaining parts. >> >> I plan to wait a day or two and then push the test_decoding part. The >> last part (for built-in replication) will need more work and maybe >> rethinking the grammar etc. >> > > I've pushed the second part, adding sequences to test_decoding. > > Here's the remaining part, rebased, with a small tweak in the TAP test > to eliminate the issue with not waiting for sequence increments. I've > kept the tweak in a separate patch, so that we can throw it away easily > if we happen to resolve the issue. > Hmm, cfbot was not happy about this, so here's a version fixing the elog() format issue reported by CirrusCI/mingw by ditching the log message. It was useful for debugging, but otherwise just noise. I'm a bit puzzled about the macOS failure, though. It seems as if the test does not wait for the subscriber long enough, but this is with the tweaked test variant, so it should not have the rollback issue. And I haven't seen this failure on any other machine. Regarding adding decoding of sequences to the built-in replication, there is a couple questions that we need to discuss first before cleaning up the code etc. Most of them are related to syntax and handling of various sequence variants. 1) Firstly, what about implicit sequences. That is, if you create a table with SERIAL or BIGSERIAL column, that'll have a sequence attached. Should those sequences be added to the publication when the table gets added? Or should we require adding them separately? Or should that be specified in the grammar, somehow? Should we have INCLUDING SEQUENCES for ALTER PUBLICATION ... ADD TABLE ...? I think we shouldn't require replicating the sequence, because who knows what the schema is on the subscriber? We want to allow differences, so maybe the sequence is not there. I'd start with just adding them separately, because that just seems simpler, but maybe there are good reasons to support adding them in ADD TABLE. 2) Should it be possible to add sequences that are also associated with a serial column, without the table being replicated too? I'd say yes, if people want to do that - I don't think it can cause any issues, and it's possible to just use sequence directly for non-serial columns anyway. Which is the same thing, but we can't detect that. 3) What about sequences for UNLOGGED tables? At the moment we don't allow sequences to be UNLOGGED (Peter revived his patch [1], but that's not committed yet). Again, I'd say it's up to the user to decide which sequences are replicated - it's similar to (2). 4) I wonder if we actually want FOR ALL SEQUENCES. On the one hand it'd be symmetrical with FOR ALL TABLES, which is the other object type we can replicate. So it'd seem reasonable to handle them in a similar way. But it's causing some shift/reduce error in the grammar, so it'll need some changes. regards [1] https://www.postgresql.org/message-id/8da92c1f-9117-41bc-731b-ce1477a77d69@enterprisedb.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 2/12/22 20:58, Tomas Vondra wrote: > On 2/12/22 01:34, Tomas Vondra wrote: >> On 2/10/22 19:17, Tomas Vondra wrote: >>> I've polished & pushed the first part adding sequence decoding >>> infrastructure etc. Attached are the two remaining parts. >>> >>> I plan to wait a day or two and then push the test_decoding part. The >>> last part (for built-in replication) will need more work and maybe >>> rethinking the grammar etc. >>> >> >> I've pushed the second part, adding sequences to test_decoding. >> >> Here's the remaining part, rebased, with a small tweak in the TAP test >> to eliminate the issue with not waiting for sequence increments. I've >> kept the tweak in a separate patch, so that we can throw it away easily >> if we happen to resolve the issue. >> > > Hmm, cfbot was not happy about this, so here's a version fixing the > elog() format issue reported by CirrusCI/mingw by ditching the log > message. It was useful for debugging, but otherwise just noise. > There was another elog() making mingw unhappy, so here's a fix for that. This should also fix an issue on the macOS machine. This is a thinko in the tests, because wait_for_catchup() may not wait for all the sequence increments after a rollback. The default mode is "write" which uses pg_current_wal_lsn(), and that may be a bit stale after a rollback. Doing a simple insert after the rollback fixes this (using other LSN, like pg_current_wal_insert_lsn() would work too, but it'd cause long waits in the test). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 13.02.22 14:10, Tomas Vondra wrote: > Here's the remaining part, rebased, with a small tweak in the TAP test > to eliminate the issue with not waiting for sequence increments. I've > kept the tweak in a separate patch, so that we can throw it away easily > if we happen to resolve the issue. This basically looks fine to me. You have identified a few XXX and FIXME spots that should be addressed. Here are a few more comments: * general Handling of owned sequences, as previously discussed. This would probably be a localized change somewhere in get_rel_sync_entry(), so it doesn't affect the overall structure of the patch. pg_dump support is missing. Some psql \dxxx support should probably be there. Check where existing publication-table relationships are displayed. * src/backend/catalog/system_views.sql + LATERAL pg_get_publication_sequences(P.pubname) GPT, The GPT presumably stood for "get publication tables", so should be changed. (There might be a few more copy-and-paste occasions like this around. I have not written down all of them.) * src/backend/commands/publicationcmds.c This adds a new publication option called "sequence". I would have expected it to be named "sequences", but that's just my opinion. But in any case, the option is not documented at all. Some of the new functions added in this file are almost exact duplicates of the analogous functions for tables. For example, PublicationAddSequences()/PublicationDropSequences() are almost exactly the same as PublicationAddTables()/PublicationDropTables(). This could be handled with less duplication by just adding an ObjectType argument to the existing functions. * src/backend/commands/sequence.c Could use some refactoring of ResetSequence()/ResetSequence2(). Maybe call the latter ResetSequenceData() and have the former call it internally. * src/backend/commands/subscriptioncmds.c Also lots of duplication between tables and sequences in this file. * src/backend/replication/logical/tablesync.c The comment says it needs sequence support, but there appears to be support for initial sequence syncing. What exactly is missing here? * src/test/subscription/t/028_sequences.pl Change to use done_testing() (see 549ec201d6132b7c7ee11ee90a4e02119259ba5b).
On 2/15/22 10:00, Peter Eisentraut wrote: > On 13.02.22 14:10, Tomas Vondra wrote: >> Here's the remaining part, rebased, with a small tweak in the TAP test >> to eliminate the issue with not waiting for sequence increments. I've >> kept the tweak in a separate patch, so that we can throw it away easily >> if we happen to resolve the issue. > > This basically looks fine to me. You have identified a few XXX and > FIXME spots that should be addressed. > > Here are a few more comments: > > * general > > Handling of owned sequences, as previously discussed. This would > probably be a localized change somewhere in get_rel_sync_entry(), so it > doesn't affect the overall structure of the patch. > So you're suggesting not to track owned sequences in pg_publication_rel explicitly, and handle them dynamically in output plugin? So when calling get_rel_sync_entry on the sequence, we'd check if it's owned by a table that is replicated. We'd want a way to enable/disable this for each publication, but that makes the lookups more complicated. Also, we'd probably need the same logic elsewhere (e.g. in psql, when listing sequences in a publication). I'm not sure we want this complexity, maybe we should simply deal with this in the ALTER PUBLICATION and track all sequences (owned or not) in pg_publication_rel. > pg_dump support is missing. > Will fix. > Some psql \dxxx support should probably be there. Check where existing > publication-table relationships are displayed. > Yeah, I noticed this while adding regression tests. Currently, \dRp+ shows something like this: Publication testpub_fortbl Owner | All tables | Inserts | Updates ... --------------------------+------------+---------+--------- ... regress_publication_user | f | t | t ... Tables: "pub_test.testpub_nopk" "public.testpub_tbl1" or Publication testpub5_forschema Owner | All tables | Inserts | Updates | ... --------------------------+------------+---------+---------+- ... regress_publication_user | f | t | t | ... Tables from schemas: "CURRENT_SCHEMA" "public" I wonder if we should copy the same block for sequences, so Publication testpub_fortbl Owner | All tables | Inserts | Updates ... --------------------------+------------+---------+--------- ... regress_publication_user | f | t | t ... Tables: "pub_test.testpub_nopk" "public.testpub_tbl1" Sequences: "pub_test.sequence_s1" "public.sequence_s2" And same for "tables from schemas" etc. > * src/backend/catalog/system_views.sql > > + LATERAL pg_get_publication_sequences(P.pubname) GPT, > > The GPT presumably stood for "get publication tables", so should be > changed. > > (There might be a few more copy-and-paste occasions like this around. I > have not written down all of them.) > Will fix. > * src/backend/commands/publicationcmds.c > > This adds a new publication option called "sequence". I would have > expected it to be named "sequences", but that's just my opinion. > > But in any case, the option is not documented at all. > > Some of the new functions added in this file are almost exact duplicates > of the analogous functions for tables. For example, > PublicationAddSequences()/PublicationDropSequences() are almost > exactly the same as PublicationAddTables()/PublicationDropTables(). This > could be handled with less duplication by just adding an ObjectType > argument to the existing functions. > Yes, I noticed that too, and I'll review this later, after making sure the behavior is correct. > * src/backend/commands/sequence.c > > Could use some refactoring of ResetSequence()/ResetSequence2(). Maybe > call the latter ResetSequenceData() and have the former call it internally. > Will check. > * src/backend/commands/subscriptioncmds.c > > Also lots of duplication between tables and sequences in this file. > Same as the case above. > * src/backend/replication/logical/tablesync.c > > The comment says it needs sequence support, but there appears to be > support for initial sequence syncing. What exactly is missing here? > I think that's just obsolete comment. > * src/test/subscription/t/028_sequences.pl > > Change to use done_testing() (see > 549ec201d6132b7c7ee11ee90a4e02119259ba5b). Will fix. Do we need to handle both FOR ALL TABLES and FOR ALL SEQUENCES at the same time? That seems like a reasonable thing people might want. The patch probably also needs to modify pg_publication_namespace to track whether the schema is FOR TABLES IN SCHEMA or FOR SEQUENCES. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Feb 19, 2022 at 7:48 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Do we need to handle both FOR ALL TABLES and FOR ALL SEQUENCES at the > same time? That seems like a reasonable thing people might want. > +1. That seems reasonable to me as well. I think the same will apply to FOR ALL TABLES IN SCHEMA and FOR ALL SEQUENCES IN SCHEMA. -- With Regards, Amit Kapila.
Hi, Here's an updated version of the patch, fixing most of the issues from reviews so far. There's still a couple FIXME comments, but I think those are minor and/or straightforward to deal with. The main improvements: 1) Elimination of a lot of redundant code - one function handling tables, and an almost exact copy handling sequences. Now a single function handles both, possibly with "sequences" flag to tweak the behavior. 2) A couple other functions gained "sequences" flag, determining which objects are "interesting". For example PublicationAddSchemas needs to know whether it's FOR ALL SEQUENCES or FOR ALL TABLES IN SCHEMA. I don't think we can just use relkind here easily, because for tables we need to handle various types of tables (regular, partitioned, ...). 3) I also renamed a couple functions with "tables" in the name, which are now used for sequences too. So for example OpenTablesList() is renamed to OpenRelationList() and so on. 4) Addition of a number of regression tests to "publication.sql", which showed a lot of issues, mostly related to not distinguishing tables and sequences when handling "FOR ALL TABLES [IN SCHEMA]" and "FOR ALL SEQUENCES [IN SCHEMA]". 5) Proper tracking of FOR ALL [TABLES|SEQUENCES] IN SCHEMA in a catalog. The pg_publication_namespace gained a pnsequences flag, which determines which case it is. So for example if you do ALTER PUBLICATION p ADD ALL TABLES IN SCHEMA s; ALTER PUBLICATION p ADD ALL SEQUENCES IN SCHEMA s; there will be two rows in the catalog, one with 't' and one with 'f' in the new column. I'm not sure this is the best way to track this - maybe it'd be better to have two flags, and keep a single row. Or maybe we should have an array of relkinds (but that has the issue with tables having multiple relkinds mentioned before). Multiple rows make it more convenient to add/remove publication schemas - with a single row it'd be necessary to either insert a new row or update an existing one when adding the schema, and similarly for dropping it. But maybe there are reasons / precedent to design this differently ... 6) I'm not entirely sure the object_address changes (handling of the pnsequences flag) are correct. 7) This updates psql to do roughly the same thing as for tables, so \dRp now list sequences added either directly or through schema, so you might get footer like this: \dRp+ testpub_mix ... Tables: "public.testpub_tbl1" Tables from schemas: "pub_test" Sequences: "public.testpub_seq1" Sequences from schemas: "pub_test" Maybe it's a bit too verbose, though. It also addes "All sequences" and "Sequences" columns into the publication description, but I don't think that can be done much differently. FWIW I had to switch the describeOneTableDetails() chunk dealing with sequences from printQuery() to printTable() in order to handle dynamic footers. There's also a change in \dn+ because a schema may be included in one publication as "FOR ALL SEQUENCES IN SCHEMA" and in another publication with "FOR ALL TABLES IN SCHEMA". So I modified the output to \dn+ pub_test1 ... Publications: "testpub_schemas" (sequences) "testpub_schemas" (tables) But maybe it'd be better to aggregate this into a single line like \dn+ pub_test1 ... Publications: "testpub_schemas" (tables, sequences) Opinions? 8) There's a shortcoming in the current grammar, because you can specify either CREATE PUBLICATION p FOR ALL TABLES; or CREATE PUBLICATION p FOR ALL SEQUENCES; but it's not possible to do CREATE PUBLICATION p FOR ALL TABLES AND FOR ALL SEQUENCES; which seems like a fairly reasonable thing users might want to do. The problem is that "FOR ALL TABLES" (and same for sequences) is hard-coded in the grammar, not defined as PublicationObjSpec. This also means you can't do ALTER PUBLICATION p ADD ALL TABLES; AFAICS there are two ways to fix this - adding the combinations into the definition of CreatePublicationStmt, or adding FOR ALL TABLES (and sequences) to PublicationObjSpec. 9) Another grammar limitation is that we don't cross-check the relkind, so for example ALTER PUBLICATION p ADD TABLE sequence; might actually work. Should be easy to fix, though. 10) Added pg_dump support (including tests). I'll add more tests, to check more grammar combinations. 11) I need to test more grammar combinations in the TAP test too, to verify the output plugin interprets the stuff correctly. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 2/19/22 06:33, Amit Kapila wrote: > On Sat, Feb 19, 2022 at 7:48 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Do we need to handle both FOR ALL TABLES and FOR ALL SEQUENCES at the >> same time? That seems like a reasonable thing people might want. >> > > +1. That seems reasonable to me as well. I think the same will apply > to FOR ALL TABLES IN SCHEMA and FOR ALL SEQUENCES IN SCHEMA. > It already works for "IN SCHEMA" because that's handled as a publication object, but FOR ALL TABLES and FOR ALL SEQUENCES are defined directly in CreatePublicationStmt. Which also means you can't do ALTER PUBLICATION and change it to FOR ALL TABLES. Which is a bit annoying, but OK. It's a bit weird FOR ALL TABLES is mentioned in docs for ALTER PUBLICATION as if it was supported. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 23, 2022 at 4:54 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > 7) This updates psql to do roughly the same thing as for tables, so \dRp > now list sequences added either directly or through schema, so you might > get footer like this: > > \dRp+ testpub_mix > ... > Tables: > "public.testpub_tbl1" > Tables from schemas: > "pub_test" > Sequences: > "public.testpub_seq1" > Sequences from schemas: > "pub_test" > > Maybe it's a bit too verbose, though. It also addes "All sequences" and > "Sequences" columns into the publication description, but I don't think > that can be done much differently. > > FWIW I had to switch the describeOneTableDetails() chunk dealing with > sequences from printQuery() to printTable() in order to handle dynamic > footers. > > There's also a change in \dn+ because a schema may be included in one > publication as "FOR ALL SEQUENCES IN SCHEMA" and in another publication > with "FOR ALL TABLES IN SCHEMA". So I modified the output to > > \dn+ pub_test1 > ... > Publications: > "testpub_schemas" (sequences) > "testpub_schemas" (tables) > > But maybe it'd be better to aggregate this into a single line like > > \dn+ pub_test1 > ... > Publications: > "testpub_schemas" (tables, sequences) > > Opinions? > I think the second one (aggregated) might be slightly better as that will lead to a lesser number of lines when there are multiple such publications but it should be okay if you and others prefer first. > 8) There's a shortcoming in the current grammar, because you can specify > either > > CREATE PUBLICATION p FOR ALL TABLES; > > or > > CREATE PUBLICATION p FOR ALL SEQUENCES; > > but it's not possible to do > > CREATE PUBLICATION p FOR ALL TABLES AND FOR ALL SEQUENCES; > > which seems like a fairly reasonable thing users might want to do. > Isn't it better to support this with a syntax as indicated by Tom in one of his earlier emails on this topic [1]? IIUC, it would be as follows: CREATE PUBLICATION p FOR ALL TABLES, ALL SEQUENCES; > The problem is that "FOR ALL TABLES" (and same for sequences) is > hard-coded in the grammar, not defined as PublicationObjSpec. This also > means you can't do > > ALTER PUBLICATION p ADD ALL TABLES; > > AFAICS there are two ways to fix this - adding the combinations into the > definition of CreatePublicationStmt, or adding FOR ALL TABLES (and > sequences) to PublicationObjSpec. > I can imagine that adding to PublicationObjSpec will look compatible with existing code but maybe another way will also be okay. [1] - https://www.postgresql.org/message-id/877603.1629120678%40sss.pgh.pa.us -- With Regards, Amit Kapila.
On 2/23/22 12:10, Amit Kapila wrote: > On Wed, Feb 23, 2022 at 4:54 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> 7) This updates psql to do roughly the same thing as for tables, so \dRp >> now list sequences added either directly or through schema, so you might >> get footer like this: >> >> \dRp+ testpub_mix >> ... >> Tables: >> "public.testpub_tbl1" >> Tables from schemas: >> "pub_test" >> Sequences: >> "public.testpub_seq1" >> Sequences from schemas: >> "pub_test" >> >> Maybe it's a bit too verbose, though. It also addes "All sequences" and >> "Sequences" columns into the publication description, but I don't think >> that can be done much differently. >> >> FWIW I had to switch the describeOneTableDetails() chunk dealing with >> sequences from printQuery() to printTable() in order to handle dynamic >> footers. >> >> There's also a change in \dn+ because a schema may be included in one >> publication as "FOR ALL SEQUENCES IN SCHEMA" and in another publication >> with "FOR ALL TABLES IN SCHEMA". So I modified the output to >> >> \dn+ pub_test1 >> ... >> Publications: >> "testpub_schemas" (sequences) >> "testpub_schemas" (tables) >> >> But maybe it'd be better to aggregate this into a single line like >> >> \dn+ pub_test1 >> ... >> Publications: >> "testpub_schemas" (tables, sequences) >> >> Opinions? >> > > I think the second one (aggregated) might be slightly better as that > will lead to a lesser number of lines when there are multiple such > publications but it should be okay if you and others prefer first. > Maybe, but I don't think it's very common to have that many schemas added to the same publication. And it probably does not make much difference whether you have 1000 or 2000 items in the list - either both are acceptable or unacceptable, I think. But I plan to look at this a bit more. >> 8) There's a shortcoming in the current grammar, because you can specify >> either >> >> CREATE PUBLICATION p FOR ALL TABLES; >> >> or >> >> CREATE PUBLICATION p FOR ALL SEQUENCES; >> >> but it's not possible to do >> >> CREATE PUBLICATION p FOR ALL TABLES AND FOR ALL SEQUENCES; >> >> which seems like a fairly reasonable thing users might want to do. >> > > Isn't it better to support this with a syntax as indicated by Tom in > one of his earlier emails on this topic [1]? IIUC, it would be as > follows: > > CREATE PUBLICATION p FOR ALL TABLES, ALL SEQUENCES; > Yes. That's mostly what I meant by adding this to PublicationObjSpec. >> The problem is that "FOR ALL TABLES" (and same for sequences) is >> hard-coded in the grammar, not defined as PublicationObjSpec. This also >> means you can't do >> >> ALTER PUBLICATION p ADD ALL TABLES; >> >> AFAICS there are two ways to fix this - adding the combinations into the >> definition of CreatePublicationStmt, or adding FOR ALL TABLES (and >> sequences) to PublicationObjSpec. >> > > I can imagine that adding to PublicationObjSpec will look compatible > with existing code but maybe another way will also be okay. > I think just hard-coding this into CreatePublicationStmt would work, but it'll be an issue once/if we start adding more options. I'm not sure if it makes sense to replicate other relkinds, but maybe DDL? I'll try tweaking PublicationObjSpec, and we'll see. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 23, 2022, at 1:07 PM, Tomas Vondra wrote:
Maybe, but I don't think it's very common to have that many schemasadded to the same publication. And it probably does not make muchdifference whether you have 1000 or 2000 items in the list - either bothare acceptable or unacceptable, I think.
Wouldn't it confuse users? Hey, duplicate publication. How? Wait. Doh.
I think just hard-coding this into CreatePublicationStmt would work, butit'll be an issue once/if we start adding more options. I'm not sure ifit makes sense to replicate other relkinds, but maybe DDL?
Materialized view? As you mentioned DDL, maybe we can use the CREATE
PUBLICATION syntax to select which DDL commands we want to replicate.
On 2/23/22 18:33, Euler Taveira wrote: > On Wed, Feb 23, 2022, at 1:07 PM, Tomas Vondra wrote: >> Maybe, but I don't think it's very common to have that many >> schemas added to the same publication. And it probably does not >> make much difference whether you have 1000 or 2000 items in the >> list - either both are acceptable or unacceptable, I think. > > Wouldn't it confuse users? Hey, duplicate publication. How? Wait. > Doh. > I don't follow. Duplicate publications? This talks about rows in pg_publication_namespace, not pg_publication. >> I think just hard-coding this into CreatePublicationStmt would >> work, but it'll be an issue once/if we start adding more options. >> I'm not sure if it makes sense to replicate other relkinds, but >> maybe DDL? > > Materialized view? As you mentioned DDL, maybe we can use the CREATE > PUBLICATION syntax to select which DDL commands we want to > replicate. > Well, yeah. But that doesn't really say whether to hard-code it into the CREATE PUBLICATION syntax or cover it by PublicationObjSpec. Hard-coding it also means we can't ALTER the publication to be FOR ALL TABLES etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 23, 2022, at 4:18 PM, Tomas Vondra wrote:
On 2/23/22 18:33, Euler Taveira wrote:> On Wed, Feb 23, 2022, at 1:07 PM, Tomas Vondra wrote:>> Maybe, but I don't think it's very common to have that many>> schemas added to the same publication. And it probably does not>> make much difference whether you have 1000 or 2000 items in the>> list - either both are acceptable or unacceptable, I think.>> Wouldn't it confuse users? Hey, duplicate publication. How? Wait.> Doh.>I don't follow. Duplicate publications? This talks about rows inpg_publication_namespace, not pg_publication.
I was referring to
Publications:
"testpub_schemas" (sequences)
"testpub_schemas" (tables)
versus
Publications:
"testpub_schemas" (tables, sequences)
I prefer the latter.
On 23.02.22 12:10, Amit Kapila wrote: > Isn't it better to support this with a syntax as indicated by Tom in > one of his earlier emails on this topic [1]? IIUC, it would be as > follows: > > CREATE PUBLICATION p FOR ALL TABLES, ALL SEQUENCES; I don't think there is any point in supporting this. What FOR ALL TABLES was really supposed to mean was "everything you can get your hands on". I think we should just redefine FOR ALL TABLES to mean that, maybe replace it with a different syntax. If you want to exclude sequences for some reason, there is already a publication option for that. And FOR ALL SEQUENCES by itself doesn't make any sense in practice. Are there any other object types besides tables and sequences that we might want to logically-replicate in the future and whose possible syntax we should think about? I can't think of anything.
On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 2/10/22 19:17, Tomas Vondra wrote: > > I've polished & pushed the first part adding sequence decoding > > infrastructure etc. Attached are the two remaining parts. > > > > I plan to wait a day or two and then push the test_decoding part. The > > last part (for built-in replication) will need more work and maybe > > rethinking the grammar etc. > > > > I've pushed the second part, adding sequences to test_decoding. > The test_decoding is failing randomly in the last few days. I am not completely sure but they might be related to this work. The two of these appears to be due to the same reason: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07 TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 1173, PID: 35013) 0 postgres 0x00593de0 ExceptionalCondition + 160\\0 Another: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2022-02-16%2006%3A21%3A48 --- /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/expected/rewrite.out 2022-02-14 20:19:14.000000000 +0000 +++ /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/results/rewrite.out 2022-02-16 07:42:18.000000000 +0000 @@ -126,6 +126,7 @@ table public.replication_example: INSERT: id[integer]:4 somedata[integer]:3 text[character varying]:null testcolumn1[integer]:null table public.replication_example: INSERT: id[integer]:5 somedata[integer]:4 text[character varying]:null testcolumn1[integer]:2 testcolumn2[integer]:1 COMMIT + sequence public.replication_example_id_seq: transactional:0 last_value: 38 log_cnt: 0 is_called:1 BEGIN table public.replication_example: INSERT: id[integer]:6 somedata[integer]:5 text[character varying]:null testcolumn1[integer]:3 testcolumn2[integer]:null COMMIT @@ -133,7 +134,7 @@ table public.replication_example: INSERT: id[integer]:7 somedata[integer]:6 text[character varying]:null testcolumn1[integer]:4 testcolumn2[integer]:null table public.replication_example: INSERT: id[integer]:8 somedata[integer]:7 text[character varying]:null testcolumn1[integer]:5 testcolumn2[integer]:null testcolumn3[integer]:1 COMMIT - (15 rows) + (16 rows) -- With Regards, Amit Kapila.
On Mon, Feb 28, 2022 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > On 2/10/22 19:17, Tomas Vondra wrote: > > > I've polished & pushed the first part adding sequence decoding > > > infrastructure etc. Attached are the two remaining parts. > > > > > > I plan to wait a day or two and then push the test_decoding part. The > > > last part (for built-in replication) will need more work and maybe > > > rethinking the grammar etc. > > > > > > > I've pushed the second part, adding sequences to test_decoding. > > > > The test_decoding is failing randomly in the last few days. I am not > completely sure but they might be related to this work. The two of > these appears to be due to the same reason: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07 > > TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: > "reorderbuffer.c", Line: 1173, PID: 35013) > 0 postgres 0x00593de0 ExceptionalCondition + 160\\0 > While reviewing the code for this, I noticed that in sequence_decode(), we don't call ReorderBufferProcessXid to register the first known lsn in WAL for the current xid. The similar functions logicalmsg_decode() or heap_decode() do call ReorderBufferProcessXid even if they decide not to queue or send the change. Is there a reason for not doing the same here? However, I am not able to deduce any scenario where lack of this will lead to such an Assertion failure. Any thoughts? -- With Regards, Amit Kapila.
On Tue, Mar 1, 2022 at 10:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Feb 28, 2022 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > > > > > > On 2/10/22 19:17, Tomas Vondra wrote: > > > > I've polished & pushed the first part adding sequence decoding > > > > infrastructure etc. Attached are the two remaining parts. > > > > > > > > I plan to wait a day or two and then push the test_decoding part. The > > > > last part (for built-in replication) will need more work and maybe > > > > rethinking the grammar etc. > > > > > > > > > > I've pushed the second part, adding sequences to test_decoding. > > > > > > > The test_decoding is failing randomly in the last few days. I am not > > completely sure but they might be related to this work. The two of > > these appears to be due to the same reason: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07 > > > > TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: > > "reorderbuffer.c", Line: 1173, PID: 35013) > > 0 postgres 0x00593de0 ExceptionalCondition + 160\\0 > > FYI, it looks like the same assertion has failed again on the same build-farm machine [1] > > While reviewing the code for this, I noticed that in > sequence_decode(), we don't call ReorderBufferProcessXid to register > the first known lsn in WAL for the current xid. The similar functions > logicalmsg_decode() or heap_decode() do call ReorderBufferProcessXid > even if they decide not to queue or send the change. Is there a reason > for not doing the same here? However, I am not able to deduce any > scenario where lack of this will lead to such an Assertion failure. > Any thoughts? > ------ [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-03-03%2023%3A14%3A26 Kind Regards, Peter Smith. Fujitsu Australia
On 3/1/22 12:53, Amit Kapila wrote: > On Mon, Feb 28, 2022 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> >>> On 2/10/22 19:17, Tomas Vondra wrote: >>>> I've polished & pushed the first part adding sequence decoding >>>> infrastructure etc. Attached are the two remaining parts. >>>> >>>> I plan to wait a day or two and then push the test_decoding part. The >>>> last part (for built-in replication) will need more work and maybe >>>> rethinking the grammar etc. >>>> >>> >>> I've pushed the second part, adding sequences to test_decoding. >>> >> >> The test_decoding is failing randomly in the last few days. I am not >> completely sure but they might be related to this work. The two of >> these appears to be due to the same reason: >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09 >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07 >> >> TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: >> "reorderbuffer.c", Line: 1173, PID: 35013) >> 0 postgres 0x00593de0 ExceptionalCondition + 160\\0 >> > > While reviewing the code for this, I noticed that in > sequence_decode(), we don't call ReorderBufferProcessXid to register > the first known lsn in WAL for the current xid. The similar functions > logicalmsg_decode() or heap_decode() do call ReorderBufferProcessXid > even if they decide not to queue or send the change. Is there a reason > for not doing the same here? However, I am not able to deduce any > scenario where lack of this will lead to such an Assertion failure. > Any thoughts? > Thanks, that seems like an omission. Will fix. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2/28/22 12:46, Amit Kapila wrote: > On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 2/10/22 19:17, Tomas Vondra wrote: >>> I've polished & pushed the first part adding sequence decoding >>> infrastructure etc. Attached are the two remaining parts. >>> >>> I plan to wait a day or two and then push the test_decoding part. The >>> last part (for built-in replication) will need more work and maybe >>> rethinking the grammar etc. >>> >> >> I've pushed the second part, adding sequences to test_decoding. >> > > The test_decoding is failing randomly in the last few days. I am not > completely sure but they might be related to this work. The two of > these appears to be due to the same reason: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07 > > TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: > "reorderbuffer.c", Line: 1173, PID: 35013) > 0 postgres 0x00593de0 ExceptionalCondition + 160\\0 > This might be related to the issue reported by Amit, i.e. that sequence_decode does not call ReorderBufferProcessXid(). If this keeps failing, we'll have to add some extra debug info (logging LSN etc.), at least temporarily. It'd be valuable to inspect the WAL too. > Another: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2022-02-16%2006%3A21%3A48 > > --- /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/expected/rewrite.out > 2022-02-14 20:19:14.000000000 +0000 > +++ /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/results/rewrite.out > 2022-02-16 07:42:18.000000000 +0000 > @@ -126,6 +126,7 @@ > table public.replication_example: INSERT: id[integer]:4 > somedata[integer]:3 text[character varying]:null > testcolumn1[integer]:null > table public.replication_example: INSERT: id[integer]:5 > somedata[integer]:4 text[character varying]:null > testcolumn1[integer]:2 testcolumn2[integer]:1 > COMMIT > + sequence public.replication_example_id_seq: transactional:0 > last_value: 38 log_cnt: 0 is_called:1 > BEGIN > table public.replication_example: INSERT: id[integer]:6 > somedata[integer]:5 text[character varying]:null > testcolumn1[integer]:3 testcolumn2[integer]:null > COMMIT > @@ -133,7 +134,7 @@ > table public.replication_example: INSERT: id[integer]:7 > somedata[integer]:6 text[character varying]:null > testcolumn1[integer]:4 testcolumn2[integer]:null > table public.replication_example: INSERT: id[integer]:8 > somedata[integer]:7 text[character varying]:null > testcolumn1[integer]:5 testcolumn2[integer]:null > testcolumn3[integer]:1 > COMMIT > - (15 rows) > + (16 rows) > Interesting. I can think of one reason that might cause this - we log the first sequence increment after a checkpoint. So if a checkpoint happens in an unfortunate place, there'll be an extra WAL record. On slow / busy machines that's quite possible, I guess. I wonder if these two issues might be related ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/7/22 17:39, Tomas Vondra wrote: > > > On 3/1/22 12:53, Amit Kapila wrote: >> On Mon, Feb 28, 2022 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >>> >>> On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra >>> <tomas.vondra@enterprisedb.com> wrote: >>>> >>>> On 2/10/22 19:17, Tomas Vondra wrote: >>>>> I've polished & pushed the first part adding sequence decoding >>>>> infrastructure etc. Attached are the two remaining parts. >>>>> >>>>> I plan to wait a day or two and then push the test_decoding part. The >>>>> last part (for built-in replication) will need more work and maybe >>>>> rethinking the grammar etc. >>>>> >>>> >>>> I've pushed the second part, adding sequences to test_decoding. >>>> >>> >>> The test_decoding is failing randomly in the last few days. I am not >>> completely sure but they might be related to this work. The two of >>> these appears to be due to the same reason: >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09 >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07 >>> >>> TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: >>> "reorderbuffer.c", Line: 1173, PID: 35013) >>> 0 postgres 0x00593de0 ExceptionalCondition + 160\\0 >>> >> >> While reviewing the code for this, I noticed that in >> sequence_decode(), we don't call ReorderBufferProcessXid to register >> the first known lsn in WAL for the current xid. The similar functions >> logicalmsg_decode() or heap_decode() do call ReorderBufferProcessXid >> even if they decide not to queue or send the change. Is there a reason >> for not doing the same here? However, I am not able to deduce any >> scenario where lack of this will lead to such an Assertion failure. >> Any thoughts? >> > > Thanks, that seems like an omission. Will fix. > I've pushed this simple fix. Not sure it'll fix the assert failures on skink/locust, though. Given the lack of information it'll be difficult to verify. So let's wait a bit. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/7/22 17:53, Tomas Vondra wrote: > On 2/28/22 12:46, Amit Kapila wrote: >> On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> >>> On 2/10/22 19:17, Tomas Vondra wrote: >>>> I've polished & pushed the first part adding sequence decoding >>>> infrastructure etc. Attached are the two remaining parts. >>>> >>>> I plan to wait a day or two and then push the test_decoding part. The >>>> last part (for built-in replication) will need more work and maybe >>>> rethinking the grammar etc. >>>> >>> >>> I've pushed the second part, adding sequences to test_decoding. >>> >> >> The test_decoding is failing randomly in the last few days. I am not >> completely sure but they might be related to this work. The two of >> these appears to be due to the same reason: >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09 >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07 >> >> TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: >> "reorderbuffer.c", Line: 1173, PID: 35013) >> 0 postgres 0x00593de0 ExceptionalCondition + 160\\0 >> > > This might be related to the issue reported by Amit, i.e. that > sequence_decode does not call ReorderBufferProcessXid(). If this keeps > failing, we'll have to add some extra debug info (logging LSN etc.), at > least temporarily. It'd be valuable to inspect the WAL too. > >> Another: >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2022-02-16%2006%3A21%3A48 >> >> --- /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/expected/rewrite.out >> 2022-02-14 20:19:14.000000000 +0000 >> +++ /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/results/rewrite.out >> 2022-02-16 07:42:18.000000000 +0000 >> @@ -126,6 +126,7 @@ >> table public.replication_example: INSERT: id[integer]:4 >> somedata[integer]:3 text[character varying]:null >> testcolumn1[integer]:null >> table public.replication_example: INSERT: id[integer]:5 >> somedata[integer]:4 text[character varying]:null >> testcolumn1[integer]:2 testcolumn2[integer]:1 >> COMMIT >> + sequence public.replication_example_id_seq: transactional:0 >> last_value: 38 log_cnt: 0 is_called:1 >> BEGIN >> table public.replication_example: INSERT: id[integer]:6 >> somedata[integer]:5 text[character varying]:null >> testcolumn1[integer]:3 testcolumn2[integer]:null >> COMMIT >> @@ -133,7 +134,7 @@ >> table public.replication_example: INSERT: id[integer]:7 >> somedata[integer]:6 text[character varying]:null >> testcolumn1[integer]:4 testcolumn2[integer]:null >> table public.replication_example: INSERT: id[integer]:8 >> somedata[integer]:7 text[character varying]:null >> testcolumn1[integer]:5 testcolumn2[integer]:null >> testcolumn3[integer]:1 >> COMMIT >> - (15 rows) >> + (16 rows) >> > > Interesting. I can think of one reason that might cause this - we log > the first sequence increment after a checkpoint. So if a checkpoint > happens in an unfortunate place, there'll be an extra WAL record. On > slow / busy machines that's quite possible, I guess. > I've tweaked the checkpoint_interval to make checkpoints more aggressive (set it to 1s), and it seems my hunch was correct - it produces failures exactly like this one. The best fix probably is to just disable decoding of sequences in those tests that are not aimed at testing sequence decoding. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/7/22 22:25, Tomas Vondra wrote: > > > On 3/7/22 17:53, Tomas Vondra wrote: >> On 2/28/22 12:46, Amit Kapila wrote: >>> On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra >>> <tomas.vondra@enterprisedb.com> wrote: >>>> >>>> On 2/10/22 19:17, Tomas Vondra wrote: >>>>> I've polished & pushed the first part adding sequence decoding >>>>> infrastructure etc. Attached are the two remaining parts. >>>>> >>>>> I plan to wait a day or two and then push the test_decoding part. The >>>>> last part (for built-in replication) will need more work and maybe >>>>> rethinking the grammar etc. >>>>> >>>> >>>> I've pushed the second part, adding sequences to test_decoding. >>>> >>> >>> The test_decoding is failing randomly in the last few days. I am not >>> completely sure but they might be related to this work. The two of >>> these appears to be due to the same reason: >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09 >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07 >>> >>> TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: >>> "reorderbuffer.c", Line: 1173, PID: 35013) >>> 0 postgres 0x00593de0 ExceptionalCondition + 160\\0 >>> >> >> This might be related to the issue reported by Amit, i.e. that >> sequence_decode does not call ReorderBufferProcessXid(). If this keeps >> failing, we'll have to add some extra debug info (logging LSN etc.), at >> least temporarily. It'd be valuable to inspect the WAL too. >> >>> Another: >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2022-02-16%2006%3A21%3A48 >>> >>> --- /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/expected/rewrite.out >>> 2022-02-14 20:19:14.000000000 +0000 >>> +++ /home/nm/farm/xlc32/HEAD/pgsql.build/contrib/test_decoding/results/rewrite.out >>> 2022-02-16 07:42:18.000000000 +0000 >>> @@ -126,6 +126,7 @@ >>> table public.replication_example: INSERT: id[integer]:4 >>> somedata[integer]:3 text[character varying]:null >>> testcolumn1[integer]:null >>> table public.replication_example: INSERT: id[integer]:5 >>> somedata[integer]:4 text[character varying]:null >>> testcolumn1[integer]:2 testcolumn2[integer]:1 >>> COMMIT >>> + sequence public.replication_example_id_seq: transactional:0 >>> last_value: 38 log_cnt: 0 is_called:1 >>> BEGIN >>> table public.replication_example: INSERT: id[integer]:6 >>> somedata[integer]:5 text[character varying]:null >>> testcolumn1[integer]:3 testcolumn2[integer]:null >>> COMMIT >>> @@ -133,7 +134,7 @@ >>> table public.replication_example: INSERT: id[integer]:7 >>> somedata[integer]:6 text[character varying]:null >>> testcolumn1[integer]:4 testcolumn2[integer]:null >>> table public.replication_example: INSERT: id[integer]:8 >>> somedata[integer]:7 text[character varying]:null >>> testcolumn1[integer]:5 testcolumn2[integer]:null >>> testcolumn3[integer]:1 >>> COMMIT >>> - (15 rows) >>> + (16 rows) >>> >> >> Interesting. I can think of one reason that might cause this - we log >> the first sequence increment after a checkpoint. So if a checkpoint >> happens in an unfortunate place, there'll be an extra WAL record. On >> slow / busy machines that's quite possible, I guess. >> > > I've tweaked the checkpoint_interval to make checkpoints more aggressive > (set it to 1s), and it seems my hunch was correct - it produces failures > exactly like this one. The best fix probably is to just disable decoding > of sequences in those tests that are not aimed at testing sequence decoding. > I've pushed a fix for this, adding "include-sequences=0" to a couple test_decoding tests, which were failing with concurrent checkpoints. Unfortunately, I realized we have a similar issue in the "sequences" tests too :-( Imagine you do a series of sequence increments, e.g. SELECT nextval('s') FROM generate_sequences(1,100); If there's a concurrent checkpoint, this may add an extra WAL record, affecting the decoded output (and also the data stored in the sequence relation itself). Not sure what to do about this ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/7/22 22:11, Tomas Vondra wrote: > > > On 3/7/22 17:39, Tomas Vondra wrote: >> >> >> On 3/1/22 12:53, Amit Kapila wrote: >>> On Mon, Feb 28, 2022 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> >>>> On Sat, Feb 12, 2022 at 6:04 AM Tomas Vondra >>>> <tomas.vondra@enterprisedb.com> wrote: >>>>> >>>>> On 2/10/22 19:17, Tomas Vondra wrote: >>>>>> I've polished & pushed the first part adding sequence decoding >>>>>> infrastructure etc. Attached are the two remaining parts. >>>>>> >>>>>> I plan to wait a day or two and then push the test_decoding part. The >>>>>> last part (for built-in replication) will need more work and maybe >>>>>> rethinking the grammar etc. >>>>>> >>>>> >>>>> I've pushed the second part, adding sequences to test_decoding. >>>>> >>>> >>>> The test_decoding is failing randomly in the last few days. I am not >>>> completely sure but they might be related to this work. The two of >>>> these appears to be due to the same reason: >>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09 >>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-17%2015%3A17%3A07 >>>> >>>> TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: >>>> "reorderbuffer.c", Line: 1173, PID: 35013) >>>> 0 postgres 0x00593de0 ExceptionalCondition + 160\\0 >>>> >>> >>> While reviewing the code for this, I noticed that in >>> sequence_decode(), we don't call ReorderBufferProcessXid to register >>> the first known lsn in WAL for the current xid. The similar functions >>> logicalmsg_decode() or heap_decode() do call ReorderBufferProcessXid >>> even if they decide not to queue or send the change. Is there a reason >>> for not doing the same here? However, I am not able to deduce any >>> scenario where lack of this will lead to such an Assertion failure. >>> Any thoughts? >>> >> >> Thanks, that seems like an omission. Will fix. >> > > I've pushed this simple fix. Not sure it'll fix the assert failures on > skink/locust, though. Given the lack of information it'll be difficult > to verify. So let's wait a bit. > I've done about 5000 runs of 'make check' in test_decoding, on two rpi machines (one armv7, one aarch64). Not a single assert failure :-( How come skink/locust hit that in just a couple runs? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 9, 2022 at 4:14 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 3/7/22 22:11, Tomas Vondra wrote: > > > > I've pushed this simple fix. Not sure it'll fix the assert failures on > > skink/locust, though. Given the lack of information it'll be difficult > > to verify. So let's wait a bit. > > > > I've done about 5000 runs of 'make check' in test_decoding, on two rpi > machines (one armv7, one aarch64). Not a single assert failure :-( > > How come skink/locust hit that in just a couple runs? > Is it failed after you pushed a fix? I don't think so or am I missing something? I feel even if doesn't occur again it would have been better if we had some theory on how it occurred in the first place because that would make us feel more confident that we won't have any related problem left. -- With Regards, Amit Kapila.
On 3/9/22 12:41, Amit Kapila wrote: > On Wed, Mar 9, 2022 at 4:14 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 3/7/22 22:11, Tomas Vondra wrote: >>> >>> I've pushed this simple fix. Not sure it'll fix the assert failures on >>> skink/locust, though. Given the lack of information it'll be difficult >>> to verify. So let's wait a bit. >>> >> >> I've done about 5000 runs of 'make check' in test_decoding, on two rpi >> machines (one armv7, one aarch64). Not a single assert failure :-( >> >> How come skink/locust hit that in just a couple runs? >> > > Is it failed after you pushed a fix? I don't think so or am I missing > something? I feel even if doesn't occur again it would have been > better if we had some theory on how it occurred in the first place > because that would make us feel more confident that we won't have any > related problem left. > I don't think it failed yet - we have to wait a bit longer to make any conclusions, though. On skink it failed only twice over 1 month. I agree it'd be nice to have some theory, but I really don't have one. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 23.02.22 00:24, Tomas Vondra wrote: > Here's an updated version of the patch, fixing most of the issues from > reviews so far. There's still a couple FIXME comments, but I think those > are minor and/or straightforward to deal with. This patch needs a rebase because of a conflict in expected/publication.out. In addition, see the attached fixup patch to get the pg_dump tests passing (and some other stuff). 028_sequences.pl should be renamed to 029, since there is now another 028. In psql, the output of \dRp and \dRp+ is inconsistent. The former shows All tables | All sequences | Inserts | Updates | Deletes | Truncates | Sequences | Via root the latter shows All tables | All sequences | Inserts | Updates | Deletes | Sequences | Truncates | Via root I think the first order is the best one. That's all for now, I'll come back with more reviewing later.
Вложения
On 3/10/22 12:07, Peter Eisentraut wrote: > On 23.02.22 00:24, Tomas Vondra wrote: >> Here's an updated version of the patch, fixing most of the issues from >> reviews so far. There's still a couple FIXME comments, but I think those >> are minor and/or straightforward to deal with. > > This patch needs a rebase because of a conflict in > expected/publication.out. In addition, see the attached fixup patch to > get the pg_dump tests passing (and some other stuff). > OK, rebased patch attached. > 028_sequences.pl should be renamed to 029, since there is now another 028. > Renamed. > In psql, the output of \dRp and \dRp+ is inconsistent. The former shows > > All tables | All sequences | Inserts | Updates | Deletes | Truncates | > Sequences | Via root > > the latter shows > > All tables | All sequences | Inserts | Updates | Deletes | Sequences | > Truncates | Via root > > I think the first order is the best one. > Good idea, I've tweaked the code to use the former order. > That's all for now, I'll come back with more reviewing later. thanks -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Tue, Mar 8, 2022 at 11:59 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 3/7/22 22:25, Tomas Vondra wrote: > >> > >> Interesting. I can think of one reason that might cause this - we log > >> the first sequence increment after a checkpoint. So if a checkpoint > >> happens in an unfortunate place, there'll be an extra WAL record. On > >> slow / busy machines that's quite possible, I guess. > >> > > > > I've tweaked the checkpoint_interval to make checkpoints more aggressive > > (set it to 1s), and it seems my hunch was correct - it produces failures > > exactly like this one. The best fix probably is to just disable decoding > > of sequences in those tests that are not aimed at testing sequence decoding. > > > > I've pushed a fix for this, adding "include-sequences=0" to a couple > test_decoding tests, which were failing with concurrent checkpoints. > > Unfortunately, I realized we have a similar issue in the "sequences" > tests too :-( Imagine you do a series of sequence increments, e.g. > > SELECT nextval('s') FROM generate_sequences(1,100); > > If there's a concurrent checkpoint, this may add an extra WAL record, > affecting the decoded output (and also the data stored in the sequence > relation itself). Not sure what to do about this ... > I am also not sure what to do for it but maybe if in some way we can increase checkpoint timeout or other parameters for these tests then it would reduce the chances of such failures. The other idea could be to perform checkpoint before the start of tests to reduce the possibility of another checkpoint. -- With Regards, Amit Kapila.
On Fri, Mar 11, 2022 at 5:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Mar 8, 2022 at 11:59 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > On 3/7/22 22:25, Tomas Vondra wrote: > > >> > > >> Interesting. I can think of one reason that might cause this - we log > > >> the first sequence increment after a checkpoint. So if a checkpoint > > >> happens in an unfortunate place, there'll be an extra WAL record. On > > >> slow / busy machines that's quite possible, I guess. > > >> > > > > > > I've tweaked the checkpoint_interval to make checkpoints more aggressive > > > (set it to 1s), and it seems my hunch was correct - it produces failures > > > exactly like this one. The best fix probably is to just disable decoding > > > of sequences in those tests that are not aimed at testing sequence decoding. > > > > > > > I've pushed a fix for this, adding "include-sequences=0" to a couple > > test_decoding tests, which were failing with concurrent checkpoints. > > > > Unfortunately, I realized we have a similar issue in the "sequences" > > tests too :-( Imagine you do a series of sequence increments, e.g. > > > > SELECT nextval('s') FROM generate_sequences(1,100); > > > > If there's a concurrent checkpoint, this may add an extra WAL record, > > affecting the decoded output (and also the data stored in the sequence > > relation itself). Not sure what to do about this ... > > > > I am also not sure what to do for it but maybe if in some way we can > increase checkpoint timeout or other parameters for these tests then > it would reduce the chances of such failures. The other idea could be > to perform checkpoint before the start of tests to reduce the > possibility of another checkpoint. > One more thing, I notice while checking the commit for this feature is that the below include seems to be out of order: --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -42,6 +42,7 @@ #include "replication/reorderbuffer.h" #include "replication/snapbuild.h" #include "storage/standby.h" +#include "commands/sequence.h" -- With Regards, Amit Kapila.
On 3/11/22 12:34, Amit Kapila wrote: > On Tue, Mar 8, 2022 at 11:59 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 3/7/22 22:25, Tomas Vondra wrote: >>>> >>>> Interesting. I can think of one reason that might cause this - we log >>>> the first sequence increment after a checkpoint. So if a checkpoint >>>> happens in an unfortunate place, there'll be an extra WAL record. On >>>> slow / busy machines that's quite possible, I guess. >>>> >>> >>> I've tweaked the checkpoint_interval to make checkpoints more aggressive >>> (set it to 1s), and it seems my hunch was correct - it produces failures >>> exactly like this one. The best fix probably is to just disable decoding >>> of sequences in those tests that are not aimed at testing sequence decoding. >>> >> >> I've pushed a fix for this, adding "include-sequences=0" to a couple >> test_decoding tests, which were failing with concurrent checkpoints. >> >> Unfortunately, I realized we have a similar issue in the "sequences" >> tests too :-( Imagine you do a series of sequence increments, e.g. >> >> SELECT nextval('s') FROM generate_sequences(1,100); >> >> If there's a concurrent checkpoint, this may add an extra WAL record, >> affecting the decoded output (and also the data stored in the sequence >> relation itself). Not sure what to do about this ... >> > > I am also not sure what to do for it but maybe if in some way we can > increase checkpoint timeout or other parameters for these tests then > it would reduce the chances of such failures. The other idea could be > to perform checkpoint before the start of tests to reduce the > possibility of another checkpoint. > Yeah, I had the same ideas, but I'm not sure I like any of them. I doubt we want to make checkpoints extremely rare, and even if we do that it'll still fail on slow machines (e.g. with valgrind, clobber cache etc.). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Further review (based on 20220310 patch): doc/src/sgml/ref/create_publication.sgml | 3 + For the clauses added to the synopsis, descriptions should be added below. See attached patch for a start. src/backend/commands/sequence.c | 79 ++ There is quite a bit of overlap between ResetSequence() and ResetSequence2(), but I couldn't see a good way to combine them that genuinely saves code and complexity. So maybe it's ok. Actually, ResetSequence2() is not really "reset", it's just "set". Maybe pick a different function name. src/backend/commands/subscriptioncmds.c | 272 +++++++ The code added in AlterSubscription_refresh() seems to be entirely copy-and-paste from the tables case. I think this could be combined by concatenating the lists from fetch_table_list() and fetch_sequence_list() and looping over it once. The same also applies to CreateSubscription(), although the code duplication is smaller there. This in turn means that fetch_table_list() and fetch_sequence_list() can be combined, so that you don't actually need any extensive new code in CreateSubscription() and AlterSubscription_refresh() for sequences. This could go on, you can combine more of the underlying code, like pg_publication_tables and pg_publication_sequences and so on. src/backend/replication/logical/proto.c | 52 ++ The documentation of the added protocol message needs to be added to the documentation. See attached patch for a start. The sequence message does not contain the sequence Oid, unlike the relation message. Would that be good to add? src/backend/replication/logical/worker.c | 56 ++ Maybe the Asserts in apply_handle_sequence() should be elogs. These are checking what is sent over the network, so we don't want a bad/evil peer able to trigger asserts. And in non-assert builds these conditions would be unchecked. src/backend/replication/pgoutput/pgoutput.c | 82 +- I find the the in get_rel_sync_entry() confusing. You add a section for if (!publish && is_sequence) but then shouldn't the code below that be something like if (!publish && !is_sequence) src/bin/pg_dump/t/002_pg_dump.pl | 38 +- This adds a new publication "pub4", but the tests already contain a "pub4". I'm not sure why this even works, but perhaps the new one shold be "pub5", unless there is a deeper meaning. src/include/catalog/pg_publication_namespace.h | 3 +- I don't like how the distinction between table and sequence is done using a bool field. That affects also the APIs in pg_publication.c and publicationcmds.c especially. There is a lot of unadorned "true" and "false" being passed around that isn't very clear, and it all appears to originate at this catalog. I think we could use a char field here that uses the relkind constants. That would also make the code in pg_publication.c etc. slightly clearer. See attached patch for more small tweaks. Your patch still contains a number of XXX and FIXME comments, which in my assessment are all more or less correct, so I didn't comment on those separately. Other than that, this seems pretty good. Earlier in the thread I commented on some aspects of the new grammar (e.g., do we need FOR ALL SEQUENCES?). I think this would be useful to review again after all the new logical replication patches are in. I don't want to hold up this patch for that at this point.
Вложения
On 3/13/22 07:45, Peter Eisentraut wrote: > Further review (based on 20220310 patch): > > doc/src/sgml/ref/create_publication.sgml | 3 + > > For the clauses added to the synopsis, descriptions should be added > below. See attached patch for a start. Thanks. I'm not sure what other improvements do you think this .sgml file needs? > > src/backend/commands/sequence.c | 79 ++ > > There is quite a bit of overlap between ResetSequence() and > ResetSequence2(), but I couldn't see a good way to combine them that > genuinely saves code and complexity. So maybe it's ok. > > Actually, ResetSequence2() is not really "reset", it's just "set". > Maybe pick a different function name. Yeah, good point. I think the functions are sufficiently different, and attempting to remove the publications is unlikely to be an improvement. But you're right "ResetSequence2" is not a great name, so I've changed it to "SetSequence". > > src/backend/commands/subscriptioncmds.c | 272 +++++++ > > The code added in AlterSubscription_refresh() seems to be entirely > copy-and-paste from the tables case. I think this could be combined > by concatenating the lists from fetch_table_list() and > fetch_sequence_list() and looping over it once. The same also applies > to CreateSubscription(), although the code duplication is smaller > there. > > This in turn means that fetch_table_list() and fetch_sequence_list() > can be combined, so that you don't actually need any extensive new > code in CreateSubscription() and AlterSubscription_refresh() for > sequences. This could go on, you can combine more of the underlying > code, like pg_publication_tables and pg_publication_sequences and so > on. > I've removed the duplicated code in both places, so that it processes only a single list, which is a combination of tables and sequences (using list_concat). For CreateSubscription it was trivial, because the code was simple and perfect copy. For _refresh there was a minor difference, but I think it was actually entirely unnecessary when processing the combined list. But this will need more testing. I'm not sure about the last bit, though. How would you combine code for pg_publication_tables and pg_publication_sequences, etc? > src/backend/replication/logical/proto.c | 52 ++ > > The documentation of the added protocol message needs to be added to > the documentation. See attached patch for a start. > OK. I've resolved the FIXME for LSN. Not sure what else is needed? > The sequence message does not contain the sequence Oid, unlike the > relation message. Would that be good to add? I don't think we need to do that. For relations we do that because it serves as an identifier in RelationSyncCache, and it links the various messages to it. For sequences we don't need that - the schema is fixed. Or do you see a practical reason to add the OID? > > src/backend/replication/logical/worker.c | 56 ++ > > Maybe the Asserts in apply_handle_sequence() should be elogs. These > are checking what is sent over the network, so we don't want a > bad/evil peer able to trigger asserts. And in non-assert builds these > conditions would be unchecked. > I'll think about it, but AFAIK we don't really assume evil peers. > src/backend/replication/pgoutput/pgoutput.c | 82 +- > > I find the the in get_rel_sync_entry() confusing. You add a section for > > if (!publish && is_sequence) > > but then shouldn't the code below that be something like > > if (!publish && !is_sequence) > Hmm, maybe. But I think there's actually a bigger issue - this does not seem to be dealing with pg_publication_namespace.pnsequences correctly. That is, we we don't differentiate which schemas include tables and which schemas include sequences. Interestingly, no tests fail. I'll take a closer look tomorrow. > src/bin/pg_dump/t/002_pg_dump.pl | 38 +- > > This adds a new publication "pub4", but the tests already contain a > "pub4". I'm not sure why this even works, but perhaps the new one > shold be "pub5", unless there is a deeper meaning. > I agree, pub5 it is. But it's interesting it does not fail even with the duplicate name. Strange. > src/include/catalog/pg_publication_namespace.h | 3 +- > > I don't like how the distinction between table and sequence is done > using a bool field. That affects also the APIs in pg_publication.c > and publicationcmds.c especially. There is a lot of unadorned "true" > and "false" being passed around that isn't very clear, and it all > appears to originate at this catalog. I think we could use a char > field here that uses the relkind constants. That would also make the > code in pg_publication.c etc. slightly clearer. > I thought about using relkind, but it does not work all that nicely because we have multiple relkinds for a table (because of partitioned tables). So I found that confusing. Maybe we should just use 'r' for any table, in this catalog? > > See attached patch for more small tweaks. > > Your patch still contains a number of XXX and FIXME comments, which in > my assessment are all more or less correct, so I didn't comment on those > separately. > Yeah, I plan to look at those next. > Other than that, this seems pretty good. > > Earlier in the thread I commented on some aspects of the new grammar > (e.g., do we need FOR ALL SEQUENCES?). I think this would be useful to > review again after all the new logical replication patches are in. I > don't want to hold up this patch for that at this point. I'm not particularly attached to the grammar, but I don't see any reason not to have mostly the same grammar/options as for tables. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Hi, Attached is a rebased patch, addressing most of the remaining issues. The main improvements are: 1) pg_publication_namespace.pntype and type checks Originally, the patch used pnsequences flag to distinguish which entries added by FOR ALL TABLES IN SCHEMA and FOR ALL SEQUENCES IN SCHEMA. I've decided to replace this with a simple char column, called pntype, where 't' means tables and 's' sequences. As explained before, relkind doesn't work well because of partitioned tables. A char, with a function to match it to relkind values works fairly well. I've revisited the question how to represent publications publishing the same schema twice - once for tables, once for sequences. There were proposals to represent this with a single row, i.e. turn pntype into an array of char values. So it'd be either ['t'], ['s'] or ['s', 't']. I spent some time working on that, but I've decided to keep the current approach with two separate rows - it's easier to manage, lookup etc. 2) pg_get_object_address I've updated the objectaddress code to consider pntype when looking-up the pntype value, so each row in pg_publication_namespace gets the correct ObjectAddress. 3) for all [tables | sequences] The original patch did not allow creating publication for all tables and all sequences at the same time. I've tweaked the grammar to allow this: CREATE PUBLICATION p FOR ALL list_of_types; where list_of_types is arbitrary combination of TABLES and SEQUENCES. It's implemented in a slightly awkward way - partially in the grammar, partially in the publicationcmds.c. I suspect there's a (cleaner) way to do this entirely in the grammar but I haven't succeeded yet. 4) prevent 'ADD TABLE sequence' and 'ADD SEQUENCE table' It was possible to do "ADD TABLE" and pass it a sequence, which would fail to notice if the publication already includes all sequences from the schema. I've added a check preventing that (and a similar one for ADD SEQUENCE). 5) missing block in AlterTableNamespace to cross-check moving published sequence to already published schema A block of code was missing from AlterTableNamespace, checking that we're not moving a sequence into a schema that is already published (all the sequences from it). 6) a couple comment fixes Various comment improvements and fixes. At this point there's a couple trivial FIXME/XXX comments remaining. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 20.03.22 23:55, Tomas Vondra wrote: > Attached is a rebased patch, addressing most of the remaining issues. This looks okay to me, if the two FIXMEs are addressed. Remember to also update protocol.sgml if you change LOGICAL_REP_MSG_SEQUENCE.
On 3/21/22 14:05, Peter Eisentraut wrote: > On 20.03.22 23:55, Tomas Vondra wrote: >> Attached is a rebased patch, addressing most of the remaining issues. > > This looks okay to me, if the two FIXMEs are addressed. Remember to > also update protocol.sgml if you change LOGICAL_REP_MSG_SEQUENCE. Thanks. Do we want to use a different constant for the sequence message? I've used 'X' for the WIP patch, but maybe there's a better value? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 21.03.22 22:54, Tomas Vondra wrote: > On 3/21/22 14:05, Peter Eisentraut wrote: >> On 20.03.22 23:55, Tomas Vondra wrote: >>> Attached is a rebased patch, addressing most of the remaining issues. >> >> This looks okay to me, if the two FIXMEs are addressed. Remember to >> also update protocol.sgml if you change LOGICAL_REP_MSG_SEQUENCE. > > Thanks. Do we want to use a different constant for the sequence message? > I've used 'X' for the WIP patch, but maybe there's a better value? I would do small 's'. Alternatively, 'Q'/'q' is still available, too.
On Mon, Mar 21, 2022 at 4:25 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hi, > > Attached is a rebased patch, addressing most of the remaining issues. > It appears that on the apply side, the patch always creates a new relfilenode irrespective of whether the sequence message is transactional or not. Is it required to create a new relfilenode for non-transactional messages? If not that could be costly? -- With Regards, Amit Kapila.
> On 22. 3. 2022, at 13:09, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Mar 21, 2022 at 4:25 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Hi, >> >> Attached is a rebased patch, addressing most of the remaining issues. >> > > It appears that on the apply side, the patch always creates a new > relfilenode irrespective of whether the sequence message is > transactional or not. Is it required to create a new relfilenode for > non-transactional messages? If not that could be costly? > That's a good catch, I think we should just write the page in the non-transactional case, no need to mess with relnodes. Petr
On Tue, Mar 22, 2022 at 5:41 PM Petr Jelinek <petr.jelinek@enterprisedb.com> wrote: > > > On 22. 3. 2022, at 13:09, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Mar 21, 2022 at 4:25 AM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> Hi, > >> > >> Attached is a rebased patch, addressing most of the remaining issues. > >> > > > > It appears that on the apply side, the patch always creates a new > > relfilenode irrespective of whether the sequence message is > > transactional or not. Is it required to create a new relfilenode for > > non-transactional messages? If not that could be costly? > > > > > That's a good catch, I think we should just write the page in the non-transactional case, no need to mess with relnodes. > What if the current node has also incremented from the existing sequence? Basically, how will we deal with conflicts? It seems we will overwrite the actions done on the existing node which means sequence values can go back. On looking a bit more closely, I think I see some more fundamental problems here: * Don't we need some syncing mechanism between apply worker and sequence sync worker so that apply worker skips the sequence changes till the sync worker is finished, otherwise, there is a risk of one overriding the values of the other? * Currently, the patch uses one sync worker per sequence. It seems to be a waste of resources considering apart from one additional process, we need origin/slot to sync each sequence. * Don't we need explicit privilege checking before applying sequence data as we do in commit a2ab9c06ea15fbcb2bfde570986a06b37f52bcca for tables? -- With Regards, Amit Kapila.
On 23. 3. 2022, at 12:50, Amit Kapila <amit.kapila16@gmail.com> wrote:On Tue, Mar 22, 2022 at 5:41 PM Petr Jelinek
<petr.jelinek@enterprisedb.com> wrote:On 22. 3. 2022, at 13:09, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Mar 21, 2022 at 4:25 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
Hi,
Attached is a rebased patch, addressing most of the remaining issues.
It appears that on the apply side, the patch always creates a new
relfilenode irrespective of whether the sequence message is
transactional or not. Is it required to create a new relfilenode for
non-transactional messages? If not that could be costly?
That's a good catch, I think we should just write the page in the non-transactional case, no need to mess with relnodes.
What if the current node has also incremented from the existing
sequence? Basically, how will we deal with conflicts? It seems we will
overwrite the actions done on the existing node which means sequence
values can go back.
I think this is perfectly acceptable behavior, we are replicating state from upstream, not reconciling state on downstream.
You can't really use the builtin sequences to implement distributed sequence via replication. If user wants to write to both nodes they should not replicate the sequence value and instead offset the sequence on each node so they produce different ranges, that's quite common approach. One day we might want revisit adding support for custom sequence AMs.
* Currently, the patch uses one sync worker per sequence. It seems to
be a waste of resources considering apart from one additional process,
we need origin/slot to sync each sequence.
This is indeed wasteful but not something that I'd consider blocker for the patch personally.
--
Petr
On 3/23/22 13:46, Petr Jelinek wrote: > >> On 23. 3. 2022, at 12:50, Amit Kapila <amit.kapila16@gmail.com >> <mailto:amit.kapila16@gmail.com>> wrote: >> >> On Tue, Mar 22, 2022 at 5:41 PM Petr Jelinek >> <petr.jelinek@enterprisedb.com <mailto:petr.jelinek@enterprisedb.com>> >> wrote: >>> >>>> On 22. 3. 2022, at 13:09, Amit Kapila <amit.kapila16@gmail.com >>>> <mailto:amit.kapila16@gmail.com>> wrote: >>>> >>>> On Mon, Mar 21, 2022 at 4:25 AM Tomas Vondra >>>> <tomas.vondra@enterprisedb.com >>>> <mailto:tomas.vondra@enterprisedb.com>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> Attached is a rebased patch, addressing most of the remaining issues. >>>>> >>>> >>>> It appears that on the apply side, the patch always creates a new >>>> relfilenode irrespective of whether the sequence message is >>>> transactional or not. Is it required to create a new relfilenode for >>>> non-transactional messages? If not that could be costly? >>>> >>> >>> >>> That's a good catch, I think we should just write the page in the >>> non-transactional case, no need to mess with relnodes. >>> >> >> What if the current node has also incremented from the existing >> sequence? Basically, how will we deal with conflicts? It seems we will >> overwrite the actions done on the existing node which means sequence >> values can go back. >> > > > I think this is perfectly acceptable behavior, we are replicating state > from upstream, not reconciling state on downstream. > > You can't really use the builtin sequences to implement distributed > sequence via replication. If user wants to write to both nodes they > should not replicate the sequence value and instead offset the sequence > on each node so they produce different ranges, that's quite common > approach. One day we might want revisit adding support for custom > sequence AMs. > Exactly. Moreover it's about the same behavior as if you update table data on the subscriber, and then an UPDATE gets replicated and overwrites the local change. Attached is a patch fixing the relfilenode issue - now we only allocate a new relfilenode for the transactional case, and an in-place update similar to a setval() otherwise. And thanks for noticing this. > >> * Currently, the patch uses one sync worker per sequence. It seems to >> be a waste of resources considering apart from one additional process, >> we need origin/slot to sync each sequence. >> > > > This is indeed wasteful but not something that I'd consider blocker for > the patch personally. > Right, and the same argument can be made for tablesync of tiny tables (which a sequence essentially is). I'm sure there are ways to improve this, but that can be done later. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Awesome to see this get committed, thanks Tomas. Is there anything left or shall I update the CF entry to committed?
Hi, Pushed, after going through the patch once more, addressed the remaining FIXMEs, corrected a couple places in the docs and comments, etc. Minor tweaks, nothing important. I've been thinking about the grammar a bit more after pushing, and I realized that maybe it'd be better to handle the FOR ALL TABLES / SEQUENCES clause as PublicationObjSpec, not as a separate/special case. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/24/22 22:52, Greg Stark wrote: > Awesome to see this get committed, thanks Tomas. > > Is there anything left or shall I update the CF entry to committed? Yeah, let's mark it as committed. I was waiting for some feedback from the buildfarm - there are some failures, but it seems unrelated. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 25, 2022 at 3:29 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Pushed. > Some of the comments given by me [1] don't seem to be addressed or responded to. Let me try to say again for the ease of discussion: * Don't we need some syncing mechanism between apply worker and sequence sync worker so that apply worker skips the sequence changes till the sync worker is finished, otherwise, there is a risk of one overriding the values of the other? See how we take care of this for a table in should_apply_changes_for_rel() and its callers. If we don't do this for sequences for some reason then probably a comment somewhere is required. * Don't we need explicit privilege checking before applying sequence data as we do in commit a2ab9c06ea15fbcb2bfde570986a06b37f52bcca for tables? Few new comments: ================= 1. A simple test like the below crashes for me: postgres=# create sequence s1; CREATE SEQUENCE postgres=# create sequence s2; CREATE SEQUENCE postgres=# create publication pub1 for sequence s1, s2; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. 2. In apply_handle_sequence() do we need AccessExclusiveLock for non-transactional case? 3. In apply_handle_sequence(), I think for transactional case, we need to skip the operation, if the skip lsn is set. See how we skip in apply_handle_insert() and similar functions. [1] - https://www.postgresql.org/message-id/CAA4eK1Jn-DttQ%3D4Pdh9YCe1w%2BzGbgC%2B0uR0sfg2RtkjiAPmB9g%40mail.gmail.com -- With Regards, Amit Kapila.
On Fri, Mar 25, 2022 at 6:59 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hi, > > Pushed, after going through the patch once more, addressed the remaining > FIXMEs, corrected a couple places in the docs and comments, etc. Minor > tweaks, nothing important. > The commit updates tab-completion for ALTER PUBLICATION but seems not to update for CREATE PUBLICATION. I've attached a patch for that. Also, the commit add a new pgoutput option "sequences": + else if (strcmp(defel->defname, "sequences") == 0) + { + if (sequences_option_given) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"))); + sequences_option_given = true; + + data->sequences = defGetBoolean(defel); + } But as far as I read changes, there is no use of this option, and this code is not tested. Can we remove it or is it for upcoming changes? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Вложения
On 3/25/22 05:01, Amit Kapila wrote: > On Fri, Mar 25, 2022 at 3:29 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Pushed. >> > > Some of the comments given by me [1] don't seem to be addressed or > responded to. Let me try to say again for the ease of discussion: > D'oh! I got distracted by Petr's response to that message, and missed this part ... > * Don't we need some syncing mechanism between apply worker and > sequence sync worker so that apply worker skips the sequence changes > till the sync worker is finished, otherwise, there is a risk of one > overriding the values of the other? See how we take care of this for a > table in should_apply_changes_for_rel() and its callers. If we don't > do this for sequences for some reason then probably a comment > somewhere is required. > How would that happen? If we're effectively setting the sequence as a side effect of inserting the data, then why should we even replicate the sequence? We'll have the problem later too, no? > * Don't we need explicit privilege checking before applying sequence > data as we do in commit a2ab9c06ea15fbcb2bfde570986a06b37f52bcca for > tables? > So essentially something like TargetPrivilegesCheck in the worker? I think you're probably right we need something like that. > Few new comments: > ================= > 1. A simple test like the below crashes for me: > postgres=# create sequence s1; > CREATE SEQUENCE > postgres=# create sequence s2; > CREATE SEQUENCE > postgres=# create publication pub1 for sequence s1, s2; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > Yeah, preprocess_pubobj_list seems to be a few bricks shy. I have a fix, will push shortly. > 2. In apply_handle_sequence() do we need AccessExclusiveLock for > non-transactional case? > Good catch. This lock was inherited from ResetSequence, but now that the transactional case works differently, we probably don't need it. > 3. In apply_handle_sequence(), I think for transactional case, we need > to skip the operation, if the skip lsn is set. See how we skip in > apply_handle_insert() and similar functions. > Right. Thanks for these reports! -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/25/22 08:00, Masahiko Sawada wrote: > On Fri, Mar 25, 2022 at 6:59 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Hi, >> >> Pushed, after going through the patch once more, addressed the remaining >> FIXMEs, corrected a couple places in the docs and comments, etc. Minor >> tweaks, nothing important. >> > > The commit updates tab-completion for ALTER PUBLICATION but seems not > to update for CREATE PUBLICATION. I've attached a patch for that. > Thanks. I'm pretty sure the patch did that, but it likely got lost in one of the rebases due to a conflict. Too bad we don't have tests for tab-complete. Will fix. > Also, the commit add a new pgoutput option "sequences": > > + else if (strcmp(defel->defname, "sequences") == 0) > + { > + if (sequences_option_given) > + ereport(ERROR, > + (errcode(ERRCODE_SYNTAX_ERROR), > + errmsg("conflicting > or redundant options"))); > + sequences_option_given = true; > + > + data->sequences = defGetBoolean(defel); > + } > > But as far as I read changes, there is no use of this option, and this > code is not tested. Can we remove it or is it for upcoming changes? > pgoutput_sequence uses this if (!data->sequences) return; This was inspired by what we do for logical messages, but maybe there's an argument we don't need this, considering we have "sequence" action and that a sequence has to be added to the publication. I don't think there's any future patch relying on this (and it could add it back, if needed). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 25, 2022 at 3:56 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > On 3/25/22 05:01, Amit Kapila wrote: > > On Fri, Mar 25, 2022 at 3:29 AM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> Pushed. > >> > > > > Some of the comments given by me [1] don't seem to be addressed or > > responded to. Let me try to say again for the ease of discussion: > > > > D'oh! I got distracted by Petr's response to that message, and missed > this part ... > > > * Don't we need some syncing mechanism between apply worker and > > sequence sync worker so that apply worker skips the sequence changes > > till the sync worker is finished, otherwise, there is a risk of one > > overriding the values of the other? See how we take care of this for a > > table in should_apply_changes_for_rel() and its callers. If we don't > > do this for sequences for some reason then probably a comment > > somewhere is required. > > > > How would that happen? If we're effectively setting the sequence as a > side effect of inserting the data, then why should we even replicate the > sequence? > I was talking just about sequence values here, considering that some sequence is just replicating based on nextval. I think the problem is that apply worker might override what copy has done if an apply worker is behind the sequence sync worker as both can run in parallel. Let me try to take some theoretical example to explain this: Assume, at LSN 10000, the value of sequence s1 is 10. Then by LSN 12000, the value of s1 becomes 20. Now, say copy decides to copy the sequence value till LSN 12000 which means it will make the value as 20 on the subscriber, now, in parallel, apply worker can process LSN 10000 and make it again 10. Apply worker might end up redoing all sequence operations along with some transactional ones where we recreate the file. I am not sure what exact problem it can lead to but I think we don't need to redo the work. We'll have the problem later too, no? > > > * Don't we need explicit privilege checking before applying sequence > > data as we do in commit a2ab9c06ea15fbcb2bfde570986a06b37f52bcca for > > tables? > > > > So essentially something like TargetPrivilegesCheck in the worker? > Right. Few more comments: ================== 1. @@ -636,7 +704,7 @@ CreatePublication(ParseState *pstate, CreatePublicationStmt *stmt) get_database_name(MyDatabaseId)); /* FOR ALL TABLES requires superuser */ - if (stmt->for_all_tables && !superuser()) + if (for_all_tables && !superuser()) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg("must be superuser to create FOR ALL TABLES publication"))); Don't we need a similar check for 'for_all_schema' publications? 2. +<varlistentry> +<term> + Int8 +</term> +<listitem> +<para> + 1 if the sequence update is transactions, 0 otherwise. Shall we say transactional instead of transactions? 3. +/* + * Determine object type given the object type set for a schema. + */ +char +pub_get_object_type_for_relkind(char relkind) Shouldn't it be 'relation' instead of 'schema' at the end of the sentence? 4. @@ -1739,13 +1804,13 @@ get_rel_sync_entry(PGOutputData *data, Relation relation) { Oid schemaId = get_rel_namespace(relid); List *pubids = GetRelationPublications(relid); - + char objectType = pub_get_object_type_for_relkind(get_rel_relkind(relid)); A few lines after this we are again getting relkind which is not a big deal but OTOH there doesn't seem to be a need to fetch the same thing twice from the cache. 5. + + /* Check that user is allowed to manipulate the publication tables. */ + if (sequences && pubform->puballsequences) /tables/sequences 6. +apply_handle_sequence(StringInfo s) { ... + + relid = RangeVarGetRelid(makeRangeVar(seq.nspname, + seq.seqname, -1), + RowExclusiveLock, false); ... } As here, we are using missing_ok, if the sequence doesn't exist, it will give a message like: "ERROR: relation "public.s1" does not exist" whereas for tables we give a slightly more clear message like: "ERROR: logical replication target relation "public.t1" does not exist". This is handled via logicalrep_rel_open(). -- With Regards, Amit Kapila.
On 3/25/22 12:21, Amit Kapila wrote: > On Fri, Mar 25, 2022 at 3:56 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> >> On 3/25/22 05:01, Amit Kapila wrote: >>> On Fri, Mar 25, 2022 at 3:29 AM Tomas Vondra >>> <tomas.vondra@enterprisedb.com> wrote: >>>> >>>> Pushed. >>>> >>> >>> Some of the comments given by me [1] don't seem to be addressed or >>> responded to. Let me try to say again for the ease of discussion: >>> >> >> D'oh! I got distracted by Petr's response to that message, and missed >> this part ... >> >>> * Don't we need some syncing mechanism between apply worker and >>> sequence sync worker so that apply worker skips the sequence changes >>> till the sync worker is finished, otherwise, there is a risk of one >>> overriding the values of the other? See how we take care of this for a >>> table in should_apply_changes_for_rel() and its callers. If we don't >>> do this for sequences for some reason then probably a comment >>> somewhere is required. >>> >> >> How would that happen? If we're effectively setting the sequence as a >> side effect of inserting the data, then why should we even replicate the >> sequence? >> > > I was talking just about sequence values here, considering that some > sequence is just replicating based on nextval. I think the problem is > that apply worker might override what copy has done if an apply worker > is behind the sequence sync worker as both can run in parallel. Let me > try to take some theoretical example to explain this: > > Assume, at LSN 10000, the value of sequence s1 is 10. Then by LSN > 12000, the value of s1 becomes 20. Now, say copy decides to copy the > sequence value till LSN 12000 which means it will make the value as 20 > on the subscriber, now, in parallel, apply worker can process LSN > 10000 and make it again 10. Apply worker might end up redoing all > sequence operations along with some transactional ones where we > recreate the file. I am not sure what exact problem it can lead to but > I think we don't need to redo the work. > > We'll have the problem later too, no? > Ah, I was confused why this would be an issue for sequences and not for plain tables, but now I realize apply_handle_sequence() is not called in apply_handle_sequence. Yes, that's probably a thinko. >>> * Don't we need explicit privilege checking before applying sequence >>> data as we do in commit a2ab9c06ea15fbcb2bfde570986a06b37f52bcca for >>> tables? >>> >> >> So essentially something like TargetPrivilegesCheck in the worker? >> > > Right. > OK, will do. > Few more comments: > ================== > 1. > @@ -636,7 +704,7 @@ CreatePublication(ParseState *pstate, > CreatePublicationStmt *stmt) > get_database_name(MyDatabaseId)); > > /* FOR ALL TABLES requires superuser */ > - if (stmt->for_all_tables && !superuser()) > + if (for_all_tables && !superuser()) > ereport(ERROR, > (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > errmsg("must be superuser to create FOR ALL TABLES > publication"))); > > Don't we need a similar check for 'for_all_schema' publications? > I think you mean "for_all_sequences", right? > 2. > +<varlistentry> > +<term> > + Int8 > +</term> > +<listitem> > +<para> > + 1 if the sequence update is transactions, 0 otherwise. > > Shall we say transactional instead of transactions? > > 3. > +/* > + * Determine object type given the object type set for a schema. > + */ > +char > +pub_get_object_type_for_relkind(char relkind) > > Shouldn't it be 'relation' instead of 'schema' at the end of the sentence? > > 4. > @@ -1739,13 +1804,13 @@ get_rel_sync_entry(PGOutputData *data, > Relation relation) > { > Oid schemaId = get_rel_namespace(relid); > List *pubids = GetRelationPublications(relid); > - > + char objectType = > pub_get_object_type_for_relkind(get_rel_relkind(relid)); > > A few lines after this we are again getting relkind which is not a big > deal but OTOH there doesn't seem to be a need to fetch the same thing > twice from the cache. > > 5. > + > + /* Check that user is allowed to manipulate the publication tables. */ > + if (sequences && pubform->puballsequences) > > /tables/sequences > > 6. > +apply_handle_sequence(StringInfo s) > { > ... > + > + relid = RangeVarGetRelid(makeRangeVar(seq.nspname, > + seq.seqname, -1), > + RowExclusiveLock, false); > ... > } > > As here, we are using missing_ok, if the sequence doesn't exist, it > will give a message like: "ERROR: relation "public.s1" does not > exist" whereas for tables we give a slightly more clear message like: > "ERROR: logical replication target relation "public.t1" does not > exist". This is handled via logicalrep_rel_open(). > Thanks, I'll look at rewording these comments and messages. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 25, 2022 at 3:29 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hi, > > Pushed, after going through the patch once more, addressed the remaining > FIXMEs, corrected a couple places in the docs and comments, etc. Minor > tweaks, nothing important. While rebasing patch [1] I found a couple of comments: static void ObjectsInPublicationToOids(List *pubobjspec_list, ParseState *pstate, - List **rels, List **schemas) + List **tables, List **sequences, + List **tables_schemas, List **sequences_schemas, + List **schemas) { ListCell *cell; PublicationObjSpec *pubobj; @@ -185,12 +194,23 @@ ObjectsInPublicationToOids(List *pubobjspec_list, ParseState *pstate, switch (pubobj->pubobjtype) { case PUBLICATIONOBJ_TABLE: - *rels = lappend(*rels, pubobj->pubtable); + *tables = lappend(*tables, pubobj->pubtable); + break; + case PUBLICATIONOBJ_SEQUENCE: + *sequences = lappend(*sequences, pubobj->pubtable); break; case PUBLICATIONOBJ_TABLES_IN_SCHEMA: schemaid = get_namespace_oid(pubobj->name, false); /* Filter out duplicates if user specifies "sch1, sch1" */ + *tables_schemas = list_append_unique_oid(*tables_schemas, schemaid); + *schemas = list_append_unique_oid(*schemas, schemaid); + break; Now tables_schemas and sequence_schemas are being updated and used in ObjectsInPublicationToOids, schema parameter is no longer being used after processing in ObjectsInPublicationToOids, I felt we can remove that parameter. /* ALTER PUBLICATION <name> ADD */ else if (Matches("ALTER", "PUBLICATION", MatchAny, "ADD")) - COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE"); + COMPLETE_WITH("ALL TABLES IN SCHEMA", "ALL SEQUENCES IN SCHEMA", "TABLE", "SEQUENCE"); Tab completion of alter publication for ADD and DROP is the same, we could combine it. Attached a patch for the same. Thoughts? [1] - https://www.postgresql.org/message-id/CALDaNm3%3DJrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh%3DtamA%40mail.gmail.com Regards, Vignesh
Вложения
On 3/25/22 12:59, Tomas Vondra wrote: > > On 3/25/22 12:21, Amit Kapila wrote: >> On Fri, Mar 25, 2022 at 3:56 PM Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> >>> >>> On 3/25/22 05:01, Amit Kapila wrote: >>>> On Fri, Mar 25, 2022 at 3:29 AM Tomas Vondra >>>> <tomas.vondra@enterprisedb.com> wrote: >>>>> >>>>> Pushed. >>>>> >>>> >>>> Some of the comments given by me [1] don't seem to be addressed or >>>> responded to. Let me try to say again for the ease of discussion: >>>> >>> >>> D'oh! I got distracted by Petr's response to that message, and missed >>> this part ... >>> >>>> * Don't we need some syncing mechanism between apply worker and >>>> sequence sync worker so that apply worker skips the sequence changes >>>> till the sync worker is finished, otherwise, there is a risk of one >>>> overriding the values of the other? See how we take care of this for a >>>> table in should_apply_changes_for_rel() and its callers. If we don't >>>> do this for sequences for some reason then probably a comment >>>> somewhere is required. >>>> >>> >>> How would that happen? If we're effectively setting the sequence as a >>> side effect of inserting the data, then why should we even replicate the >>> sequence? >>> >> >> I was talking just about sequence values here, considering that some >> sequence is just replicating based on nextval. I think the problem is >> that apply worker might override what copy has done if an apply worker >> is behind the sequence sync worker as both can run in parallel. Let me >> try to take some theoretical example to explain this: >> >> Assume, at LSN 10000, the value of sequence s1 is 10. Then by LSN >> 12000, the value of s1 becomes 20. Now, say copy decides to copy the >> sequence value till LSN 12000 which means it will make the value as 20 >> on the subscriber, now, in parallel, apply worker can process LSN >> 10000 and make it again 10. Apply worker might end up redoing all >> sequence operations along with some transactional ones where we >> recreate the file. I am not sure what exact problem it can lead to but >> I think we don't need to redo the work. >> >> We'll have the problem later too, no? >> > > Ah, I was confused why this would be an issue for sequences and not for > plain tables, but now I realize apply_handle_sequence() is not called in > apply_handle_sequence. Yes, that's probably a thinko. > Hmm, so fixing this might be a bit trickier than I expected. Firstly, currently we only send nspname/relname in the sequence message, not the remote OID or schema. The idea was that for sequences we don't really need schema info, so this seemed OK. But should_apply_changes_for_rel() needs LogicalRepRelMapEntry, and to create/maintain that those records we need to send the schema. Attached is a WIP patch does that. Two places need more work, I think: 1) maybe_send_schema needs ReorderBufferChange, but we don't have that for sequences, we only have TXN. I created a simple wrapper, but maybe we should just tweak maybe_send_schema to use TXN. 2) The transaction handling in is a bit confusing. The non-transactional increments won't have any explicit commit later, so we can't just rely on begin_replication_step/end_replication_step. But I want to try spending a bit more time on this. But there's a more serious issue, I think. So far, we allowed this: BEGIN; CREATE SEQUENCE s2; ALTER PUBLICATION p ADD SEQUENCE s2; INSERT INTO seq_test SELECT nextval('s2') FROM generate_series(1,100); COMMIT; and the behavior was that we replicated the changes. But with the patch applied, that no longer happens, because should_apply_changes_for_rel says the change should not be applied. And after thinking about this, I think that's correct - we can't apply changes until ALTER SUBSCRIPTION ... REFRESH PUBLICATION gets executed, and we can't do that until the transaction commits. So I guess that's correct, and the current behavior is a bug. For a while I was thinking that maybe this means we don't need the transactional behavior at all, but I think we do - we have to handle ALTER SEQUENCE cases that are transactional. Does that make sense, Amit? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 3/25/22 15:34, vignesh C wrote: > On Fri, Mar 25, 2022 at 3:29 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Hi, >> >> Pushed, after going through the patch once more, addressed the remaining >> FIXMEs, corrected a couple places in the docs and comments, etc. Minor >> tweaks, nothing important. > > While rebasing patch [1] I found a couple of comments: > static void > ObjectsInPublicationToOids(List *pubobjspec_list, ParseState *pstate, > - List **rels, List **schemas) > + List **tables, List **sequences, > + List **tables_schemas, List **sequences_schemas, > + List **schemas) > { > ListCell *cell; > PublicationObjSpec *pubobj; > @@ -185,12 +194,23 @@ ObjectsInPublicationToOids(List > *pubobjspec_list, ParseState *pstate, > switch (pubobj->pubobjtype) > { > case PUBLICATIONOBJ_TABLE: > - *rels = lappend(*rels, pubobj->pubtable); > + *tables = lappend(*tables, pubobj->pubtable); > + break; > + case PUBLICATIONOBJ_SEQUENCE: > + *sequences = lappend(*sequences, pubobj->pubtable); > break; > case PUBLICATIONOBJ_TABLES_IN_SCHEMA: > schemaid = get_namespace_oid(pubobj->name, false); > > /* Filter out duplicates if user specifies "sch1, sch1" */ > + *tables_schemas = list_append_unique_oid(*tables_schemas, schemaid); > + *schemas = list_append_unique_oid(*schemas, schemaid); > + break; > > Now tables_schemas and sequence_schemas are being updated and used in > ObjectsInPublicationToOids, schema parameter is no longer being used > after processing in ObjectsInPublicationToOids, I felt we can remove > that parameter. > Thanks! That's a nice simplification, I'll get that pushed in a couple minutes. > /* ALTER PUBLICATION <name> ADD */ > else if (Matches("ALTER", "PUBLICATION", MatchAny, "ADD")) > - COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE"); > + COMPLETE_WITH("ALL TABLES IN SCHEMA", "ALL SEQUENCES IN SCHEMA", > "TABLE", "SEQUENCE"); > > Tab completion of alter publication for ADD and DROP is the same, we > could combine it. > We could, but I find these combined rules harder to read, so I'll keep the current tab-completion. > Attached a patch for the same. > Thoughts? Thanks for taking a look! Appreciated. > > [1] - https://www.postgresql.org/message-id/CALDaNm3%3DJrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh%3DtamA%40mail.gmail.com > regars -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, I've fixed most of the reported issues (or at least I think so), with the exception of those in apply_handle_sequence function, i.e.: 1) properly coordinating with the tablesync worker 2) considering skip_lsn, skipping changes 3) missing privilege check, similar to TargetPrivilegesCheck 4) nicer error message if the sequence does not exist The apply_handle_sequence stuff seems to be inter-related, so I plan to deal with that in a single separate commit - the main part being the tablesync coordination, per the fix I shared earlier today. But I need time to think about that, I don't want to rush that. Thanks for the feedback! -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 25, 2022 at 10:20 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hmm, so fixing this might be a bit trickier than I expected. > > Firstly, currently we only send nspname/relname in the sequence message, > not the remote OID or schema. The idea was that for sequences we don't > really need schema info, so this seemed OK. > > But should_apply_changes_for_rel() needs LogicalRepRelMapEntry, and to > create/maintain that those records we need to send the schema. > > Attached is a WIP patch does that. > > Two places need more work, I think: > > 1) maybe_send_schema needs ReorderBufferChange, but we don't have that > for sequences, we only have TXN. I created a simple wrapper, but maybe > we should just tweak maybe_send_schema to use TXN. > > 2) The transaction handling in is a bit confusing. The non-transactional > increments won't have any explicit commit later, so we can't just rely > on begin_replication_step/end_replication_step. But I want to try > spending a bit more time on this. > I didn't understand what you want to say in point (2). > > But there's a more serious issue, I think. So far, we allowed this: > > BEGIN; > CREATE SEQUENCE s2; > ALTER PUBLICATION p ADD SEQUENCE s2; > INSERT INTO seq_test SELECT nextval('s2') FROM generate_series(1,100); > COMMIT; > > and the behavior was that we replicated the changes. But with the patch > applied, that no longer happens, because should_apply_changes_for_rel > says the change should not be applied. > > And after thinking about this, I think that's correct - we can't apply > changes until ALTER SUBSCRIPTION ... REFRESH PUBLICATION gets executed, > and we can't do that until the transaction commits. > > So I guess that's correct, and the current behavior is a bug. > Yes, I also think that is a bug. > For a while I was thinking that maybe this means we don't need the > transactional behavior at all, but I think we do - we have to handle > ALTER SEQUENCE cases that are transactional. > I need some time to think about this. At all places, it is mentioned as creating a sequence for transactional cases which at the very least need some tweak. -- With Regards, Amit Kapila.
On 3/26/22 08:28, Amit Kapila wrote: > On Fri, Mar 25, 2022 at 10:20 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Hmm, so fixing this might be a bit trickier than I expected. >> >> Firstly, currently we only send nspname/relname in the sequence message, >> not the remote OID or schema. The idea was that for sequences we don't >> really need schema info, so this seemed OK. >> >> But should_apply_changes_for_rel() needs LogicalRepRelMapEntry, and to >> create/maintain that those records we need to send the schema. >> >> Attached is a WIP patch does that. >> >> Two places need more work, I think: >> >> 1) maybe_send_schema needs ReorderBufferChange, but we don't have that >> for sequences, we only have TXN. I created a simple wrapper, but maybe >> we should just tweak maybe_send_schema to use TXN. >> >> 2) The transaction handling in is a bit confusing. The non-transactional >> increments won't have any explicit commit later, so we can't just rely >> on begin_replication_step/end_replication_step. But I want to try >> spending a bit more time on this. >> > > I didn't understand what you want to say in point (2). > My point is that handle_apply_sequence() either needs to use the same transaction handling as other apply methods, or start (and commit) a separate transaction for the "transactional" case. Which means we can't use the begin_replication_step/end_replication_step and the current code seems a bit complex. And I'm not sure it's quite correct. So this place needs more work. >> >> But there's a more serious issue, I think. So far, we allowed this: >> >> BEGIN; >> CREATE SEQUENCE s2; >> ALTER PUBLICATION p ADD SEQUENCE s2; >> INSERT INTO seq_test SELECT nextval('s2') FROM generate_series(1,100); >> COMMIT; >> >> and the behavior was that we replicated the changes. But with the patch >> applied, that no longer happens, because should_apply_changes_for_rel >> says the change should not be applied. >> >> And after thinking about this, I think that's correct - we can't apply >> changes until ALTER SUBSCRIPTION ... REFRESH PUBLICATION gets executed, >> and we can't do that until the transaction commits. >> >> So I guess that's correct, and the current behavior is a bug. >> > > Yes, I also think that is a bug. > OK >> For a while I was thinking that maybe this means we don't need the >> transactional behavior at all, but I think we do - we have to handle >> ALTER SEQUENCE cases that are transactional. >> > > I need some time to think about this. Understood. > At all places, it is mentioned > as creating a sequence for transactional cases which at the very least > need some tweak. > Which places? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Mar 26, 2022 at 3:26 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 3/26/22 08:28, Amit Kapila wrote: > >> > >> 2) The transaction handling in is a bit confusing. The non-transactional > >> increments won't have any explicit commit later, so we can't just rely > >> on begin_replication_step/end_replication_step. But I want to try > >> spending a bit more time on this. > >> > > > > I didn't understand what you want to say in point (2). > > > > My point is that handle_apply_sequence() either needs to use the same > transaction handling as other apply methods, or start (and commit) a > separate transaction for the "transactional" case. > > Which means we can't use the begin_replication_step/end_replication_step > We already call CommitTransactionCommand after end_replication_step at a few places in that file so as there is no explicit commit in non-transactional case, we can probably call CommitTransactionCommand for it. > and the current code seems a bit complex. And I'm not sure it's quite > correct. So this place needs more work. > Agreed. > > >> For a while I was thinking that maybe this means we don't need the > >> transactional behavior at all, but I think we do - we have to handle > >> ALTER SEQUENCE cases that are transactional. > >> > > > > I need some time to think about this. > While thinking about this, I think I see a problem with the non-transactional handling of sequences. It seems that we will skip sending non-transactional sequence change if it occurs before the decoding has reached a consistent point but the surrounding commit occurs after a consistent point is reached. In such cases, the corresponding DMLs like inserts will be sent but sequence changes won't be sent. For example (this scenario is based on twophase_snapshot.spec), Initial setup: ============== Create table t1_seq(c1 int); Create Sequence seq1; Test Execution via multiple sessions (this test allows insert in session-2 to happen before we reach a consistent point and commit happens after a consistent point): ======================================================================================================= Session-2: Begin; SELECT pg_current_xact_id(); Session-1: SELECT 'init' FROM pg_create_logical_replication_slot('test_slot', 'test_decoding', false, true); Session-3: Begin; SELECT pg_current_xact_id(); Session-2: Commit; Begin; INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100); Session-3: Commit; Session-2: Commit 'foo' Session-1: SELECT data FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'include-xids', 'false', 'skip-empty-xacts', '1'); data ---------------------------------------------- BEGIN table public.t1_seq: INSERT: c1[integer]:1 table public.t1_seq: INSERT: c1[integer]:2 table public.t1_seq: INSERT: c1[integer]:3 table public.t1_seq: INSERT: c1[integer]:4 table public.t1_seq: INSERT: c1[integer]:5 table public.t1_seq: INSERT: c1[integer]:6 Now, if we normally try to decode such an insert, the result would be something like: data ------------------------------------------------------------------------------ sequence public.seq1: transactional:0 last_value: 33 log_cnt: 0 is_called:1 sequence public.seq1: transactional:0 last_value: 66 log_cnt: 0 is_called:1 sequence public.seq1: transactional:0 last_value: 99 log_cnt: 0 is_called:1 sequence public.seq1: transactional:0 last_value: 132 log_cnt: 0 is_called:1 BEGIN table public.t1_seq: INSERT: c1[integer]:1 table public.t1_seq: INSERT: c1[integer]:2 table public.t1_seq: INSERT: c1[integer]:3 table public.t1_seq: INSERT: c1[integer]:4 table public.t1_seq: INSERT: c1[integer]:5 table public.t1_seq: INSERT: c1[integer]:6 This will create an inconsistent replica as sequence changes won't be replicated. I thought about changing snapshot dealing of non-transactional sequence changes similar to transactional ones but that also won't work because it is only at commit we decide whether we can send the changes. For the transactional case, as we are considering the create sequence operation as transactional, we would unnecessarily queue them even though that is not required. Basically, they don't need to be considered transactional and we can simply ignore such messages like other DDLs. But for that probably we need to distinguish Alter/Create case which may or may not be straightforward. Now, queuing them is probably harmless unless it causes the transaction to spill/stream. I still couldn't think completely about cases where a mix of transactional and non-transactional changes occur in the same transaction as I think it somewhat depends on what we want to do about the above cases. > > At all places, it is mentioned > > as creating a sequence for transactional cases which at the very least > > need some tweak. > > > > Which places? > In comments like: a. When decoding sequences, we differentiate between sequences created in a (running) transaction and sequences created in other (already committed) transactions. b. ... But for new sequences, we need to handle them in a transactional way, .. c. ... Change needs to be handled as transactional, because the sequence was created in a transaction that is still running ... It seems all these places indicate a scenario of creating a sequence whereas we want to do transactional stuff mainly for Alter. -- With Regards, Amit Kapila.
On Sat, Mar 26, 2022 at 6:56 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 3/26/22 08:28, Amit Kapila wrote: > > On Fri, Mar 25, 2022 at 10:20 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> Hmm, so fixing this might be a bit trickier than I expected. > >> > >> Firstly, currently we only send nspname/relname in the sequence message, > >> not the remote OID or schema. The idea was that for sequences we don't > >> really need schema info, so this seemed OK. > >> > >> But should_apply_changes_for_rel() needs LogicalRepRelMapEntry, and to > >> create/maintain that those records we need to send the schema. > >> > >> Attached is a WIP patch does that. > >> > >> Two places need more work, I think: > >> > >> 1) maybe_send_schema needs ReorderBufferChange, but we don't have that > >> for sequences, we only have TXN. I created a simple wrapper, but maybe > >> we should just tweak maybe_send_schema to use TXN. > >> > >> 2) The transaction handling in is a bit confusing. The non-transactional > >> increments won't have any explicit commit later, so we can't just rely > >> on begin_replication_step/end_replication_step. But I want to try > >> spending a bit more time on this. > >> > > > > I didn't understand what you want to say in point (2). > > > > My point is that handle_apply_sequence() either needs to use the same > transaction handling as other apply methods, or start (and commit) a > separate transaction for the "transactional" case. > > Which means we can't use the begin_replication_step/end_replication_step > and the current code seems a bit complex. And I'm not sure it's quite > correct. So this place needs more work. > > >> > >> But there's a more serious issue, I think. So far, we allowed this: > >> > >> BEGIN; > >> CREATE SEQUENCE s2; > >> ALTER PUBLICATION p ADD SEQUENCE s2; > >> INSERT INTO seq_test SELECT nextval('s2') FROM generate_series(1,100); > >> COMMIT; > >> > >> and the behavior was that we replicated the changes. But with the patch > >> applied, that no longer happens, because should_apply_changes_for_rel > >> says the change should not be applied. > >> > >> And after thinking about this, I think that's correct - we can't apply > >> changes until ALTER SUBSCRIPTION ... REFRESH PUBLICATION gets executed, > >> and we can't do that until the transaction commits. > >> > >> So I guess that's correct, and the current behavior is a bug. > >> > > > > Yes, I also think that is a bug. > > > > OK I also think that this is a bug. Given this behavior is a bug and newly-added sequence data should be replicated only after ALTER SUBSCRIPTION ... REFRESH PUBLICATION, is there any case where the sequence message applied on the subscriber is transactional? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Fri, Apr 1, 2022 at 10:22 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sat, Mar 26, 2022 at 6:56 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > >> > > >> But there's a more serious issue, I think. So far, we allowed this: > > >> > > >> BEGIN; > > >> CREATE SEQUENCE s2; > > >> ALTER PUBLICATION p ADD SEQUENCE s2; > > >> INSERT INTO seq_test SELECT nextval('s2') FROM generate_series(1,100); > > >> COMMIT; > > >> > > >> and the behavior was that we replicated the changes. But with the patch > > >> applied, that no longer happens, because should_apply_changes_for_rel > > >> says the change should not be applied. > > >> > > >> And after thinking about this, I think that's correct - we can't apply > > >> changes until ALTER SUBSCRIPTION ... REFRESH PUBLICATION gets executed, > > >> and we can't do that until the transaction commits. > > >> > > >> So I guess that's correct, and the current behavior is a bug. > > >> > > > > > > Yes, I also think that is a bug. > > > > > > > OK > > I also think that this is a bug. Given this behavior is a bug and > newly-added sequence data should be replicated only after ALTER > SUBSCRIPTION ... REFRESH PUBLICATION, is there any case where the > sequence message applied on the subscriber is transactional? > It could be required for Alter Sequence as that can also rewrite the relfilenode. However, IIUC, I think there is a bigger problem with non-transactional sequence implementation as that can cause inconsistent replica. See the problem description and test case in my previous email [1] (While thinking about this, I think I see a problem with the non-transactional handling of sequences....). Can you please once check that and let me know if I am missing something there? If not, then I think we may need to first think of a solution for non-transactional sequence handling. [1] - https://www.postgresql.org/message-id/CAA4eK1KAFdQEULk%2B4C%3DieWA5UPSUtf_gtqKsFj9J9f2c%3D8hm4g%40mail.gmail.com -- With Regards, Amit Kapila.
On 3/28/22 07:29, Amit Kapila wrote: > ... > > While thinking about this, I think I see a problem with the > non-transactional handling of sequences. It seems that we will skip > sending non-transactional sequence change if it occurs before the > decoding has reached a consistent point but the surrounding commit > occurs after a consistent point is reached. In such cases, the > corresponding DMLs like inserts will be sent but sequence changes > won't be sent. For example (this scenario is based on > twophase_snapshot.spec), > > Initial setup: > ============== > Create table t1_seq(c1 int); > Create Sequence seq1; > > Test Execution via multiple sessions (this test allows insert in > session-2 to happen before we reach a consistent point and commit > happens after a consistent point): > ======================================================================================================= > > Session-2: > Begin; > SELECT pg_current_xact_id(); > > Session-1: > SELECT 'init' FROM pg_create_logical_replication_slot('test_slot', > 'test_decoding', false, true); > > Session-3: > Begin; > SELECT pg_current_xact_id(); > > Session-2: > Commit; > Begin; > INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100); > > Session-3: > Commit; > > Session-2: > Commit 'foo' > > Session-1: > SELECT data FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, > 'include-xids', 'false', 'skip-empty-xacts', '1'); > > data > ---------------------------------------------- > BEGIN > table public.t1_seq: INSERT: c1[integer]:1 > table public.t1_seq: INSERT: c1[integer]:2 > table public.t1_seq: INSERT: c1[integer]:3 > table public.t1_seq: INSERT: c1[integer]:4 > table public.t1_seq: INSERT: c1[integer]:5 > table public.t1_seq: INSERT: c1[integer]:6 > > > Now, if we normally try to decode such an insert, the result would be > something like: > data > ------------------------------------------------------------------------------ > sequence public.seq1: transactional:0 last_value: 33 log_cnt: 0 is_called:1 > sequence public.seq1: transactional:0 last_value: 66 log_cnt: 0 is_called:1 > sequence public.seq1: transactional:0 last_value: 99 log_cnt: 0 is_called:1 > sequence public.seq1: transactional:0 last_value: 132 log_cnt: 0 is_called:1 > BEGIN > table public.t1_seq: INSERT: c1[integer]:1 > table public.t1_seq: INSERT: c1[integer]:2 > table public.t1_seq: INSERT: c1[integer]:3 > table public.t1_seq: INSERT: c1[integer]:4 > table public.t1_seq: INSERT: c1[integer]:5 > table public.t1_seq: INSERT: c1[integer]:6 > > This will create an inconsistent replica as sequence changes won't be > replicated. Hmm, that's interesting. I wonder if it can actually happen, though. Have you been able to reproduce that, somehow? > I thought about changing snapshot dealing of > non-transactional sequence changes similar to transactional ones but > that also won't work because it is only at commit we decide whether we > can send the changes. > I wonder if there's some earlier LSN (similar to the consistent point) which might be useful for this. Or maybe we should queue even the non-transactional changes, not per-transaction but in a global list, and then at each commit either discard inspect them (at that point we know the lowest LSN for all transactions and the consistent point). Seems complex, though. > For the transactional case, as we are considering the create sequence > operation as transactional, we would unnecessarily queue them even > though that is not required. Basically, they don't need to be > considered transactional and we can simply ignore such messages like > other DDLs. But for that probably we need to distinguish Alter/Create > case which may or may not be straightforward. Now, queuing them is > probably harmless unless it causes the transaction to spill/stream. > I'm not sure I follow. Why would we queue them unnecessarily? Also, there's the bug with decoding changes in transactions that create the sequence and add it to a publication. I think the agreement was that this behavior was incorrect, we should not decode changes until the subscription is refreshed. Doesn't that mean can't be any CREATE case, just ALTER? > I still couldn't think completely about cases where a mix of > transactional and non-transactional changes occur in the same > transaction as I think it somewhat depends on what we want to do about > the above cases. > Understood. I need to think about this too. >>> At all places, it is mentioned >>> as creating a sequence for transactional cases which at the very least >>> need some tweak. >>> >> >> Which places? >> > > In comments like: > a. When decoding sequences, we differentiate between sequences created > in a (running) transaction and sequences created in other (already > committed) transactions. > b. ... But for new sequences, we need to handle them in a transactional way, .. > c. ... Change needs to be handled as transactional, because the > sequence was created in a transaction that is still running ... > > It seems all these places indicate a scenario of creating a sequence > whereas we want to do transactional stuff mainly for Alter. > Right, I'll think about how to clarify the comments. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/1/22 17:02, Tomas Vondra wrote: > > > On 3/28/22 07:29, Amit Kapila wrote: >> ... >> >> While thinking about this, I think I see a problem with the >> non-transactional handling of sequences. It seems that we will skip >> sending non-transactional sequence change if it occurs before the >> decoding has reached a consistent point but the surrounding commit >> occurs after a consistent point is reached. In such cases, the >> corresponding DMLs like inserts will be sent but sequence changes >> won't be sent. For example (this scenario is based on >> twophase_snapshot.spec), >> >> Initial setup: >> ============== >> Create table t1_seq(c1 int); >> Create Sequence seq1; >> >> Test Execution via multiple sessions (this test allows insert in >> session-2 to happen before we reach a consistent point and commit >> happens after a consistent point): >> ======================================================================================================= >> >> Session-2: >> Begin; >> SELECT pg_current_xact_id(); >> >> Session-1: >> SELECT 'init' FROM pg_create_logical_replication_slot('test_slot', >> 'test_decoding', false, true); >> >> Session-3: >> Begin; >> SELECT pg_current_xact_id(); >> >> Session-2: >> Commit; >> Begin; >> INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100); >> >> Session-3: >> Commit; >> >> Session-2: >> Commit 'foo' >> >> Session-1: >> SELECT data FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, >> 'include-xids', 'false', 'skip-empty-xacts', '1'); >> >> data >> ---------------------------------------------- >> BEGIN >> table public.t1_seq: INSERT: c1[integer]:1 >> table public.t1_seq: INSERT: c1[integer]:2 >> table public.t1_seq: INSERT: c1[integer]:3 >> table public.t1_seq: INSERT: c1[integer]:4 >> table public.t1_seq: INSERT: c1[integer]:5 >> table public.t1_seq: INSERT: c1[integer]:6 >> >> >> Now, if we normally try to decode such an insert, the result would be >> something like: >> data >> ------------------------------------------------------------------------------ >> sequence public.seq1: transactional:0 last_value: 33 log_cnt: 0 is_called:1 >> sequence public.seq1: transactional:0 last_value: 66 log_cnt: 0 is_called:1 >> sequence public.seq1: transactional:0 last_value: 99 log_cnt: 0 is_called:1 >> sequence public.seq1: transactional:0 last_value: 132 log_cnt: 0 is_called:1 >> BEGIN >> table public.t1_seq: INSERT: c1[integer]:1 >> table public.t1_seq: INSERT: c1[integer]:2 >> table public.t1_seq: INSERT: c1[integer]:3 >> table public.t1_seq: INSERT: c1[integer]:4 >> table public.t1_seq: INSERT: c1[integer]:5 >> table public.t1_seq: INSERT: c1[integer]:6 >> >> This will create an inconsistent replica as sequence changes won't be >> replicated. > > Hmm, that's interesting. I wonder if it can actually happen, though. > Have you been able to reproduce that, somehow? > >> I thought about changing snapshot dealing of >> non-transactional sequence changes similar to transactional ones but >> that also won't work because it is only at commit we decide whether we >> can send the changes. >> > I wonder if there's some earlier LSN (similar to the consistent point) > which might be useful for this. > > Or maybe we should queue even the non-transactional changes, not > per-transaction but in a global list, and then at each commit either > discard inspect them (at that point we know the lowest LSN for all > transactions and the consistent point). Seems complex, though. > >> For the transactional case, as we are considering the create sequence >> operation as transactional, we would unnecessarily queue them even >> though that is not required. Basically, they don't need to be >> considered transactional and we can simply ignore such messages like >> other DDLs. But for that probably we need to distinguish Alter/Create >> case which may or may not be straightforward. Now, queuing them is >> probably harmless unless it causes the transaction to spill/stream. >> > > I'm not sure I follow. Why would we queue them unnecessarily? > > Also, there's the bug with decoding changes in transactions that create > the sequence and add it to a publication. I think the agreement was that > this behavior was incorrect, we should not decode changes until the > subscription is refreshed. Doesn't that mean can't be any CREATE case, > just ALTER? > So, I investigated this a bit more, and I wrote a couple test_decoding isolation tests (patch attached) demonstrating the issue. Actually, I should say "issues" because it's a bit worse than you described ... The whole problem is in this chunk of code in sequence_decode(): /* Skip the change if already processed (per the snapshot). */ if (transactional && !SnapBuildProcessChange(builder, xid, buf->origptr)) return; else if (!transactional && (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || SnapBuildXactNeedsSkip(builder, buf->origptr))) return; /* Queue the increment (or send immediately if not transactional). */ snapshot = SnapBuildGetOrBuildSnapshot(builder, xid); ReorderBufferQueueSequence(ctx->reorder, xid, snapshot, buf->endptr, origin_id, target_node, transactional, xlrec->created, tuplebuf); With the script you described, the increment is non-transactional, so we end up in the second branch, return and thus discard the increment. But it's also possible the change is transactional, which can only trigger the first branch. But it does not, so we start building the snapshot. But the first thing SnapBuildGetOrBuildSnapshot does is Assert(builder->state == SNAPBUILD_CONSISTENT); and we're still not in a consistent snapshot, so it just crashes and burn :-( The sequences.spec file has two definitions of s2restart step, one empty (resulting in non-transactional change), one with ALTER SEQUENCE (which means the change will be transactional). The really "funny" thing is this is not new code - this is an exact copy from logicalmsg_decode(), and logical messages have all those issues too. We may discard some messages, trigger the same Assert, etc. There's a messages2.spec demonstrating this (s2message step defines whether the message is transactional or not). So I guess we need to fix both places, perhaps in a similar way. And one of those will have to be backpatched (which may make it more complex). The only option I see is reworking the decoding so that it does not need the snapshot at all. We'll need to stash the changes just like any other change, apply them at end of transaction, and the main difference between transactional and non-transactional case would be what happens at abort. Transactional changes would be discarded, non-transactional would be applied anyway. The challenge is this reorders the sequence changes, so we'll need to reconcile them somehow. One option would be to simply (1) apply the change with the highest LSN in the transaction, and then walk all other in-progress transactions and changes for that sequence with a lower LSN. Not sure how complex/expensive that would be, though. Another problem is not all increments are WAL-logged, of course, not sure about that. Another option might be revisiting the approach proposed by Hannu in September [1], i.e. tracking sequences touched in a transaction, and then replicating the current state at that particular moment. regards [1] https://www.postgresql.org/message-id/CAMT0RQQeDR51xs8zTa25YpfKB1B34nS-Q4hhsRPznVsjMB_P1w%40mail.gmail.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Fri, Apr 1, 2022 at 8:32 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 3/28/22 07:29, Amit Kapila wrote: > > I thought about changing snapshot dealing of > > non-transactional sequence changes similar to transactional ones but > > that also won't work because it is only at commit we decide whether we > > can send the changes. > > > I wonder if there's some earlier LSN (similar to the consistent point) > which might be useful for this. > > Or maybe we should queue even the non-transactional changes, not > per-transaction but in a global list, and then at each commit either > discard inspect them (at that point we know the lowest LSN for all > transactions and the consistent point). Seems complex, though. > I couldn't follow '..discard inspect them ..'. Do you mean we inspect them and discard whichever are not required? It seems here we are talking about a new global ReorderBufferGlobal instead of ReorderBufferTXN to collect these changes but we don't need only consistent point LSN because we do send if the commit of containing transaction is after consistent point LSN, so we need some transaction information as well. I think it could bring new challenges. > > For the transactional case, as we are considering the create sequence > > operation as transactional, we would unnecessarily queue them even > > though that is not required. Basically, they don't need to be > > considered transactional and we can simply ignore such messages like > > other DDLs. But for that probably we need to distinguish Alter/Create > > case which may or may not be straightforward. Now, queuing them is > > probably harmless unless it causes the transaction to spill/stream. > > > > I'm not sure I follow. Why would we queue them unnecessarily? > > Also, there's the bug with decoding changes in transactions that create > the sequence and add it to a publication. I think the agreement was that > this behavior was incorrect, we should not decode changes until the > subscription is refreshed. Doesn't that mean can't be any CREATE case, > just ALTER? > Yeah, but how will we distinguish them. Aren't they using the same kind of WAL record? -- With Regards, Amit Kapila.
On Sat, Apr 2, 2022 at 5:47 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 4/1/22 17:02, Tomas Vondra wrote: > > The only option I see is reworking the decoding so that it does not need > the snapshot at all. We'll need to stash the changes just like any other > change, apply them at end of transaction, and the main difference > between transactional and non-transactional case would be what happens > at abort. Transactional changes would be discarded, non-transactional > would be applied anyway. > I think in the above I am not following how we can make it work without considering *snapshot at all* because based on that we would have done the initial sync (copy_sequence) and if we don't follow that later it can lead to inconsistency. I might be missing something here. > The challenge is this reorders the sequence changes, so we'll need to > reconcile them somehow. One option would be to simply (1) apply the > change with the highest LSN in the transaction, and then walk all other > in-progress transactions and changes for that sequence with a lower LSN. > Not sure how complex/expensive that would be, though. Another problem is > not all increments are WAL-logged, of course, not sure about that. > > Another option might be revisiting the approach proposed by Hannu in > September [1], i.e. tracking sequences touched in a transaction, and > then replicating the current state at that particular moment. > I'll think about that approach as well. -- With Regards, Amit Kapila.
On 4/2/22 12:35, Amit Kapila wrote: > On Fri, Apr 1, 2022 at 8:32 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 3/28/22 07:29, Amit Kapila wrote: >>> I thought about changing snapshot dealing of >>> non-transactional sequence changes similar to transactional ones but >>> that also won't work because it is only at commit we decide whether we >>> can send the changes. >>> >> I wonder if there's some earlier LSN (similar to the consistent point) >> which might be useful for this. >> >> Or maybe we should queue even the non-transactional changes, not >> per-transaction but in a global list, and then at each commit either >> discard inspect them (at that point we know the lowest LSN for all >> transactions and the consistent point). Seems complex, though. >> > > I couldn't follow '..discard inspect them ..'. Do you mean we inspect > them and discard whichever are not required? It seems here we are > talking about a new global ReorderBufferGlobal instead of > ReorderBufferTXN to collect these changes but we don't need only > consistent point LSN because we do send if the commit of containing > transaction is after consistent point LSN, so we need some transaction > information as well. I think it could bring new challenges. > Sorry for the gibberish. Yes, I meant to discard sequence changes that are no longer needed, due to being "obsoleted" by the applied change. We must not apply "older" changes (using LSN) because that would make the sequence go backwards. I'm not entirely sure whether the list of changes should be kept in TXN or in the global reorderbuffer object - we need to track which TXN the change belongs to (because of transactional changes) but we also need to discard the unnecessary changes efficiently (and walking TXN might be expensive). But yes, I'm sure there will be challenges. One being that tracking just the decoded WAL stuff is not enough, because nextval() may not generate WAL. But we still need to make sure the increment is replicated. What I think we might do is this: - add a global list of decoded sequence increments to ReorderBuffer - at each commit/abort walk the list, walk the list and consider all increments up to the commit LSN that "match" (non-transactional match all TXNs, transactional match only the current TXN) - replicate the last "matching" status for each sequence, discard the processed ones We could probably optimize this by not tracking every single increment, but merge them "per transaction", I think. I'm sure this description is pretty rough and will need refining, handle various corner-cases etc. >>> For the transactional case, as we are considering the create sequence >>> operation as transactional, we would unnecessarily queue them even >>> though that is not required. Basically, they don't need to be >>> considered transactional and we can simply ignore such messages like >>> other DDLs. But for that probably we need to distinguish Alter/Create >>> case which may or may not be straightforward. Now, queuing them is >>> probably harmless unless it causes the transaction to spill/stream. >>> >> >> I'm not sure I follow. Why would we queue them unnecessarily? >> >> Also, there's the bug with decoding changes in transactions that create >> the sequence and add it to a publication. I think the agreement was that >> this behavior was incorrect, we should not decode changes until the >> subscription is refreshed. Doesn't that mean can't be any CREATE case, >> just ALTER? >> > > Yeah, but how will we distinguish them. Aren't they using the same > kind of WAL record? > Same WAL record, but the "created" flag which should distinguish these two cases, IIRC. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/2/22 12:43, Amit Kapila wrote: > On Sat, Apr 2, 2022 at 5:47 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 4/1/22 17:02, Tomas Vondra wrote: >> >> The only option I see is reworking the decoding so that it does not need >> the snapshot at all. We'll need to stash the changes just like any other >> change, apply them at end of transaction, and the main difference >> between transactional and non-transactional case would be what happens >> at abort. Transactional changes would be discarded, non-transactional >> would be applied anyway. >> > > I think in the above I am not following how we can make it work > without considering *snapshot at all* because based on that we would > have done the initial sync (copy_sequence) and if we don't follow that > later it can lead to inconsistency. I might be missing something here. > Well, what I meant to say is that we can't consider the snapshot at this phase of decoding. We'd still consider it later, at commit/abort time, of course. I.e. it'd be fairly similar to what heap_decode/DecodeInsert does, for example. AFAIK this does not build the snapshot anywhere. >> The challenge is this reorders the sequence changes, so we'll need to >> reconcile them somehow. One option would be to simply (1) apply the >> change with the highest LSN in the transaction, and then walk all other >> in-progress transactions and changes for that sequence with a lower LSN. >> Not sure how complex/expensive that would be, though. Another problem is >> not all increments are WAL-logged, of course, not sure about that. >> >> Another option might be revisiting the approach proposed by Hannu in >> September [1], i.e. tracking sequences touched in a transaction, and >> then replicating the current state at that particular moment. >> > > I'll think about that approach as well. > Thanks! -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/2/22 13:51, Tomas Vondra wrote: > > > On 4/2/22 12:35, Amit Kapila wrote: >> On Fri, Apr 1, 2022 at 8:32 PM Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> >>> On 3/28/22 07:29, Amit Kapila wrote: >>>> I thought about changing snapshot dealing of >>>> non-transactional sequence changes similar to transactional ones but >>>> that also won't work because it is only at commit we decide whether we >>>> can send the changes. >>>> >>> I wonder if there's some earlier LSN (similar to the consistent point) >>> which might be useful for this. >>> >>> Or maybe we should queue even the non-transactional changes, not >>> per-transaction but in a global list, and then at each commit either >>> discard inspect them (at that point we know the lowest LSN for all >>> transactions and the consistent point). Seems complex, though. >>> >> >> I couldn't follow '..discard inspect them ..'. Do you mean we inspect >> them and discard whichever are not required? It seems here we are >> talking about a new global ReorderBufferGlobal instead of >> ReorderBufferTXN to collect these changes but we don't need only >> consistent point LSN because we do send if the commit of containing >> transaction is after consistent point LSN, so we need some transaction >> information as well. I think it could bring new challenges. >> > > Sorry for the gibberish. Yes, I meant to discard sequence changes that > are no longer needed, due to being "obsoleted" by the applied change. We > must not apply "older" changes (using LSN) because that would make the > sequence go backwards. > > I'm not entirely sure whether the list of changes should be kept in TXN > or in the global reorderbuffer object - we need to track which TXN the > change belongs to (because of transactional changes) but we also need to > discard the unnecessary changes efficiently (and walking TXN might be > expensive). > > But yes, I'm sure there will be challenges. One being that tracking just > the decoded WAL stuff is not enough, because nextval() may not generate > WAL. But we still need to make sure the increment is replicated. > > What I think we might do is this: > > - add a global list of decoded sequence increments to ReorderBuffer > > - at each commit/abort walk the list, walk the list and consider all > increments up to the commit LSN that "match" (non-transactional match > all TXNs, transactional match only the current TXN) > > - replicate the last "matching" status for each sequence, discard the > processed ones > > We could probably optimize this by not tracking every single increment, > but merge them "per transaction", I think. > > I'm sure this description is pretty rough and will need refining, handle > various corner-cases etc. > I did some experiments over the weekend, exploring how to rework the sequence decoding in various ways. Let me share some WIP patches, hopefully that can be useful for trying more stuff and moving this discussion forward. I tried two things - (1) accumulating sequence increments in global array and then doing something with it, and (2) treating all sequence increments as regular changes (in a TXN) and then doing something special during the replay. Attached are two patchsets, one for each approach. Note: It's important to remember decoding of sequences is not the only code affected by this. The logical messages have the same issue, certainly when it comes to transactional vs. non-transactional stuff and handling of snapshots. Even if the sequence decoding ends up being reverted, we still need to fix that, somehow. And my feeling is the solutions ought to be pretty similar in both cases. Now, regarding the two approaches: (1) accumulating sequences in global hash table The main problem with regular sequence increments is that those need to be non-transactional - a transaction may use a sequence without any WAL-logging, if the WAL was written by an earlier transaction. The problem is the earlier trasaction might have been rolled back, and thus simply discarded by the logical decoding. But we still need to apply that, in order not to lose the sequence increment. The current code just applies those non-transactional increments right after decoding the increment, but that does not work because we may not have a snapshot at that point. And we only have the snapshot when within a transaction (AFAICS) so this queues all changes and then applies the changes later. The changes need to be shared by all transactions, so queueing them in a global works fairly well - otherwise we'd have to walk all transactions, in order to see if there are relevant sequence increments. But some increments may be transactional, e.g. when the sequence is created or altered in a transaction. To allow tracking this, this uses a hash table, with relfilenode as a key. There's a couple issues with this, though. Firstly, stashing the changes outside transactions, it's not included in memory accounting, it's not spilled to disk or streamed, etc. I guess fixing this is possible, but it's certainly not straightforward, because we mix increments from many different transactions. A bigger issue is that I'm not sure this actually handles the snapshots correctly either. The non-transactional increments affect all transactions, so when ReorderBufferProcessSequences gets executed, it processes all of them, no matter the source transaction. Can we be sure the snapshot in the applying transaction is the same (or "compatible") as the snapshot in the source transaction? Transactional increments can be simply processed as regular changes, of course, but one difference is that we always create the transaction (while before we just triggered the apply callback). This is necessary as now we drive all of this from ReorderBufferCommit(), and without the transaction the increment not be applied / there would be no snapshot. It does seem to work, though, although I haven't tested it much so far. One annoying bit is that we have to always walk all sequences and increments, for each change in the transaction. Which seems quite expensive, although the number of in-progress increments should be pretty low (roughly equal to the number of sequences). Or at least the part we need to consider for a single change (i.e. between two LSNs). So maybe this should work. The one part this does not handle at all are aborted transactions. At the moment we just discard those, which means (a) we fail to discard the transactional changes from the hash table, and (b) we can't apply the non-transactional changes, because with the changes we also discard the snapshots we need. I wonder if we could use a different snapshot, though. Non-transactional changes can't change the relfilenode, after all. Not sure. If not, the only solution I can think of is processing even aborted transactions, but skipping changes except those that update snapshots. There's a serious problem with streaming, though - we don't know which transaction will commit first, hence we can't decide whether to send the sequence changes. This seems pretty fatal to me. So we'd have to stream the sequence changes only at commit, and then do some of this work on the worker (i.e. merge the sequence changes to the right place). That seems pretty annoying. (2) treating sequence change as regular changes This adopts a different approach - instead of accumulating the sequence increments in a global hash table, it treats them as regular changes. Which solves the snapshot issue, and issues with spilling to disk, streaming and so on. But it has various other issues with handling concurrent transactions, unfortunately, which probably make this approach infeasible: * The non-transactional stuff has to be applied in the first transaction that commits, not in the transaction that generated the WAL. That does not work too well with this approach, because we have to walk changes in all other transactions. * Another serious issue seems to be streaming - if we already streamed some of the changes, we can't iterate through them anymore. Also, having to walk the transactions over and over for each change, to apply relevant sequence increments, that's mighty expensive. The other approach needs to do that too, but walking the global hash table seems much cheaper. The other issue this handling of aborted transactions - we need to apply sequence increments even from those transactions, of course. The other approach has this issue too, though. (3) tracking sequences touched by transaction This is the approach proposed by Hannu Krosing. I haven't explored this again yet, but I recall I wrote a PoC patch a couple months back. It seems to me most of the problems stems from trying to derive sequence state from decoded WAL changes, which is problematic because of the non-transactional nature of sequences (i.e. WAL for one transaction affects other transactions in non-obvious ways). And this approach simply works around that entirely - instead of trying to deduce the sequence state from WAL, we'd make sure to write the current sequence state (or maybe just ID of the sequence) at commit time. Which should eliminate most of the complexity / problems, I think. I'm not really sure what to do about this. All of those reworks seems like an extensive redesign of the patch, and considering the last CF is already over ... not great. However, even if we end up reverting this, we'll still have the same problem with snapshots for logical messages. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Sat, Apr 2, 2022 at 8:52 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 4/2/22 12:35, Amit Kapila wrote: > > On Fri, Apr 1, 2022 at 8:32 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> > >> On 3/28/22 07:29, Amit Kapila wrote: > >>> I thought about changing snapshot dealing of > >>> non-transactional sequence changes similar to transactional ones but > >>> that also won't work because it is only at commit we decide whether we > >>> can send the changes. > >>> > >> I wonder if there's some earlier LSN (similar to the consistent point) > >> which might be useful for this. > >> > >> Or maybe we should queue even the non-transactional changes, not > >> per-transaction but in a global list, and then at each commit either > >> discard inspect them (at that point we know the lowest LSN for all > >> transactions and the consistent point). Seems complex, though. > >> > > > > I couldn't follow '..discard inspect them ..'. Do you mean we inspect > > them and discard whichever are not required? It seems here we are > > talking about a new global ReorderBufferGlobal instead of > > ReorderBufferTXN to collect these changes but we don't need only > > consistent point LSN because we do send if the commit of containing > > transaction is after consistent point LSN, so we need some transaction > > information as well. I think it could bring new challenges. > > > > Sorry for the gibberish. Yes, I meant to discard sequence changes that > are no longer needed, due to being "obsoleted" by the applied change. We > must not apply "older" changes (using LSN) because that would make the > sequence go backwards. It's not related to this issue but I think that non-transactional sequence changes could be resent in case of the subscriber crashes because it doesn’t update replication origin LSN, is that right? If so, while resending the sequence changes, the sequence value on the subscriber can temporarily go backward. > > I'm not entirely sure whether the list of changes should be kept in TXN > or in the global reorderbuffer object - we need to track which TXN the > change belongs to (because of transactional changes) but we also need to > discard the unnecessary changes efficiently (and walking TXN might be > expensive). > > But yes, I'm sure there will be challenges. One being that tracking just > the decoded WAL stuff is not enough, because nextval() may not generate > WAL. But we still need to make sure the increment is replicated. > > What I think we might do is this: > > - add a global list of decoded sequence increments to ReorderBuffer > > - at each commit/abort walk the list, walk the list and consider all > increments up to the commit LSN that "match" (non-transactional match > all TXNs, transactional match only the current TXN) > > - replicate the last "matching" status for each sequence, discard the > processed ones > > We could probably optimize this by not tracking every single increment, > but merge them "per transaction", I think. > > I'm sure this description is pretty rough and will need refining, handle > various corner-cases etc. > > >>> For the transactional case, as we are considering the create sequence > >>> operation as transactional, we would unnecessarily queue them even > >>> though that is not required. Basically, they don't need to be > >>> considered transactional and we can simply ignore such messages like > >>> other DDLs. But for that probably we need to distinguish Alter/Create > >>> case which may or may not be straightforward. Now, queuing them is > >>> probably harmless unless it causes the transaction to spill/stream. > >>> > >> > >> I'm not sure I follow. Why would we queue them unnecessarily? > >> > >> Also, there's the bug with decoding changes in transactions that create > >> the sequence and add it to a publication. I think the agreement was that > >> this behavior was incorrect, we should not decode changes until the > >> subscription is refreshed. Doesn't that mean can't be any CREATE case, > >> just ALTER? > >> > > > > Yeah, but how will we distinguish them. Aren't they using the same > > kind of WAL record? > > > > Same WAL record, but the "created" flag which should distinguish these > two cases, IIRC. Since the "created" flag indicates that we created a new relfilenode so it's true when both CREATE and ALTER. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, Apr 4, 2022 at 11:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sat, Apr 2, 2022 at 8:52 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > It's not related to this issue but I think that non-transactional > sequence changes could be resent in case of the subscriber crashes > because it doesn’t update replication origin LSN, is that right? If > so, while resending the sequence changes, the sequence value on the > subscriber can temporarily go backward. > Yes, this can happen for the non-transactional sequence changes though this is a different problem than what is happening on the decoding side. > > >> Also, there's the bug with decoding changes in transactions that create > > >> the sequence and add it to a publication. I think the agreement was that > > >> this behavior was incorrect, we should not decode changes until the > > >> subscription is refreshed. Doesn't that mean can't be any CREATE case, > > >> just ALTER? > > >> > > > > > > Yeah, but how will we distinguish them. Aren't they using the same > > > kind of WAL record? > > > > > > > Same WAL record, but the "created" flag which should distinguish these > > two cases, IIRC. > > Since the "created" flag indicates that we created a new relfilenode > so it's true when both CREATE and ALTER. > Yes, this is my understanding as well. So, we need something else. -- With Regards, Amit Kapila.
On Sat, Apr 2, 2022 at 5:47 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 4/1/22 17:02, Tomas Vondra wrote: > > So, I investigated this a bit more, and I wrote a couple test_decoding > isolation tests (patch attached) demonstrating the issue. Actually, I > should say "issues" because it's a bit worse than you described ... > > The whole problem is in this chunk of code in sequence_decode(): > > > /* Skip the change if already processed (per the snapshot). */ > if (transactional && > !SnapBuildProcessChange(builder, xid, buf->origptr)) > return; > else if (!transactional && > (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || > SnapBuildXactNeedsSkip(builder, buf->origptr))) > return; > > /* Queue the increment (or send immediately if not transactional). */ > snapshot = SnapBuildGetOrBuildSnapshot(builder, xid); > ReorderBufferQueueSequence(ctx->reorder, xid, snapshot, buf->endptr, > origin_id, target_node, transactional, > xlrec->created, tuplebuf); > > With the script you described, the increment is non-transactional, so we > end up in the second branch, return and thus discard the increment. > > But it's also possible the change is transactional, which can only > trigger the first branch. But it does not, so we start building the > snapshot. But the first thing SnapBuildGetOrBuildSnapshot does is > > Assert(builder->state == SNAPBUILD_CONSISTENT); > > and we're still not in a consistent snapshot, so it just crashes and > burn :-( > > The sequences.spec file has two definitions of s2restart step, one empty > (resulting in non-transactional change), one with ALTER SEQUENCE (which > means the change will be transactional). > > > The really "funny" thing is this is not new code - this is an exact copy > from logicalmsg_decode(), and logical messages have all those issues > too. We may discard some messages, trigger the same Assert, etc. There's > a messages2.spec demonstrating this (s2message step defines whether the > message is transactional or not). > It seems to me that the Assert in SnapBuildGetOrBuildSnapshot() is wrong. It is required only for non-transactional logical messages. For transactional message(s), we decide at the commit time whether the snapshot has reached a consistent state and then decide whether to skip the entire transaction or not. So, the possible fix for Assert could be that we pass an additional parameter 'transactional' to SnapBuildGetOrBuildSnapshot() and then assert only when it is false. I have also checked the development thread for this work and it appears to be introduced for non-transactional cases only. See email[1], this new function and Assert was for the non-transactional case and later while rearranging the code, this problem got introduced. Now, for the non-transactional cases, I am not sure if there is a one-to-one mapping with the sequence case. The way sequences are dealt with on the subscriber-side (first we copy initial data and then replicate the incremental changes) appears more as we deal with the table and its incremental changes. There is some commonality with non-transactional messages w.r.t the case where we want sequence changes to be sent even on rollbacks unless some DDL has happened for them but if we see the overall solution it doesn't appear that we can use it similar to messages. I think this is the reason we are facing the other problems w.r.t to syncing sequences to subscribers including the problem reported by Sawada-San yesterday. Now, the particular case where we won't send a non-transactional logical message unless the snapshot is consistent could be considered as its behavior and may be documented better. I am not very sure about this as there is no example of the way sync for these messages happens in the core but if someone outside the core wants a different behavior and presents the case then we can probably try to enhance it. I feel the same is not true for sequences because it can cause the replica (subscriber) to go out of sync with the master (publisher). > So I guess we need to fix both places, perhaps in a similar way. > It depends but I think for logical messages we should do the minimal fix required for Asserts and probably document the current behavior bit better unless we think there is a case to make it behave similar to sequences. [1] - https://www.postgresql.org/message-id/56D4B3AD.5000207%402ndquadrant.com -- With Regards, Amit Kapila.
On Mon, Apr 4, 2022 at 3:10 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I did some experiments over the weekend, exploring how to rework the > sequence decoding in various ways. Let me share some WIP patches, > hopefully that can be useful for trying more stuff and moving this > discussion forward. > > I tried two things - (1) accumulating sequence increments in global > array and then doing something with it, and (2) treating all sequence > increments as regular changes (in a TXN) and then doing something > special during the replay. Attached are two patchsets, one for each > approach. > > Note: It's important to remember decoding of sequences is not the only > code affected by this. The logical messages have the same issue, > certainly when it comes to transactional vs. non-transactional stuff and > handling of snapshots. Even if the sequence decoding ends up being > reverted, we still need to fix that, somehow. And my feeling is the > solutions ought to be pretty similar in both cases. > > Now, regarding the two approaches: > > (1) accumulating sequences in global hash table > > The main problem with regular sequence increments is that those need to > be non-transactional - a transaction may use a sequence without any > WAL-logging, if the WAL was written by an earlier transaction. The > problem is the earlier trasaction might have been rolled back, and thus > simply discarded by the logical decoding. But we still need to apply > that, in order not to lose the sequence increment. > > The current code just applies those non-transactional increments right > after decoding the increment, but that does not work because we may not > have a snapshot at that point. And we only have the snapshot when within > a transaction (AFAICS) so this queues all changes and then applies the > changes later. > > The changes need to be shared by all transactions, so queueing them in a > global works fairly well - otherwise we'd have to walk all transactions, > in order to see if there are relevant sequence increments. > > But some increments may be transactional, e.g. when the sequence is > created or altered in a transaction. To allow tracking this, this uses a > hash table, with relfilenode as a key. > > There's a couple issues with this, though. Firstly, stashing the changes > outside transactions, it's not included in memory accounting, it's not > spilled to disk or streamed, etc. I guess fixing this is possible, but > it's certainly not straightforward, because we mix increments from many > different transactions. > > A bigger issue is that I'm not sure this actually handles the snapshots > correctly either. > > The non-transactional increments affect all transactions, so when > ReorderBufferProcessSequences gets executed, it processes all of them, > no matter the source transaction. Can we be sure the snapshot in the > applying transaction is the same (or "compatible") as the snapshot in > the source transaction? > I don't think we can assume that. I think it is possible that some other transaction's WAL can be in-between start/end lsn of txn (which we decide to send) which may not finally reach a consistent state. Consider a case similar to shown in one of my previous emails: Session-2: Begin; SELECT pg_current_xact_id(); Session-1: SELECT 'init' FROM pg_create_logical_replication_slot('test_slot', 'test_decoding', false, true); Session-3: Begin; SELECT pg_current_xact_id(); Session-2: Commit; Begin; INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100); Session-3: Commit; Session-2: Commit; Here, we send changes (say insert from txn 700) from session-2 because session-3's commit happens before it. Now, consider another transaction parallel to txn 700 which generates some WAL related to sequences but it committed before session-3's commit. So though, its changes will be the in-between start/end LSN of txn 700 but those shouldn't be sent. I have not tried this and also this may be solvable in some way but I think processing changes from other TXNs sounds risky to me in terms of snapshot handling. > > > (2) treating sequence change as regular changes > > This adopts a different approach - instead of accumulating the sequence > increments in a global hash table, it treats them as regular changes. > Which solves the snapshot issue, and issues with spilling to disk, > streaming and so on. > > But it has various other issues with handling concurrent transactions, > unfortunately, which probably make this approach infeasible: > > * The non-transactional stuff has to be applied in the first transaction > that commits, not in the transaction that generated the WAL. That does > not work too well with this approach, because we have to walk changes in > all other transactions. > Why do you want to traverse other TXNs in this approach? Is it because the current TXN might be using some value of sequence which has been actually WAL logged in the other transaction but that other transaction has not been sent yet? I think if we don't send that then probably replica sequences columns (in some tables) have some values but actually the sequence itself won't have still that value which sounds problematic. Is that correct? > * Another serious issue seems to be streaming - if we already streamed > some of the changes, we can't iterate through them anymore. > > Also, having to walk the transactions over and over for each change, to > apply relevant sequence increments, that's mighty expensive. The other > approach needs to do that too, but walking the global hash table seems > much cheaper. > > The other issue this handling of aborted transactions - we need to apply > sequence increments even from those transactions, of course. The other > approach has this issue too, though. > > > (3) tracking sequences touched by transaction > > This is the approach proposed by Hannu Krosing. I haven't explored this > again yet, but I recall I wrote a PoC patch a couple months back. > > It seems to me most of the problems stems from trying to derive sequence > state from decoded WAL changes, which is problematic because of the > non-transactional nature of sequences (i.e. WAL for one transaction > affects other transactions in non-obvious ways). And this approach > simply works around that entirely - instead of trying to deduce the > sequence state from WAL, we'd make sure to write the current sequence > state (or maybe just ID of the sequence) at commit time. Which should > eliminate most of the complexity / problems, I think. > That sounds promising but I haven't thought in detail about that approach. > > I'm not really sure what to do about this. All of those reworks seems > like an extensive redesign of the patch, and considering the last CF is > already over ... not great. > Yeah, I share the same feeling that even if we devise solutions to all the known problems it requires quite some time to ensure everything is correct. -- With Regards, Amit Kapila.
On 4/5/22 12:06, Amit Kapila wrote: > On Mon, Apr 4, 2022 at 3:10 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> I did some experiments over the weekend, exploring how to rework the >> sequence decoding in various ways. Let me share some WIP patches, >> hopefully that can be useful for trying more stuff and moving this >> discussion forward. >> >> I tried two things - (1) accumulating sequence increments in global >> array and then doing something with it, and (2) treating all sequence >> increments as regular changes (in a TXN) and then doing something >> special during the replay. Attached are two patchsets, one for each >> approach. >> >> Note: It's important to remember decoding of sequences is not the only >> code affected by this. The logical messages have the same issue, >> certainly when it comes to transactional vs. non-transactional stuff and >> handling of snapshots. Even if the sequence decoding ends up being >> reverted, we still need to fix that, somehow. And my feeling is the >> solutions ought to be pretty similar in both cases. >> >> Now, regarding the two approaches: >> >> (1) accumulating sequences in global hash table >> >> The main problem with regular sequence increments is that those need to >> be non-transactional - a transaction may use a sequence without any >> WAL-logging, if the WAL was written by an earlier transaction. The >> problem is the earlier trasaction might have been rolled back, and thus >> simply discarded by the logical decoding. But we still need to apply >> that, in order not to lose the sequence increment. >> >> The current code just applies those non-transactional increments right >> after decoding the increment, but that does not work because we may not >> have a snapshot at that point. And we only have the snapshot when within >> a transaction (AFAICS) so this queues all changes and then applies the >> changes later. >> >> The changes need to be shared by all transactions, so queueing them in a >> global works fairly well - otherwise we'd have to walk all transactions, >> in order to see if there are relevant sequence increments. >> >> But some increments may be transactional, e.g. when the sequence is >> created or altered in a transaction. To allow tracking this, this uses a >> hash table, with relfilenode as a key. >> >> There's a couple issues with this, though. Firstly, stashing the changes >> outside transactions, it's not included in memory accounting, it's not >> spilled to disk or streamed, etc. I guess fixing this is possible, but >> it's certainly not straightforward, because we mix increments from many >> different transactions. >> >> A bigger issue is that I'm not sure this actually handles the snapshots >> correctly either. >> >> The non-transactional increments affect all transactions, so when >> ReorderBufferProcessSequences gets executed, it processes all of them, >> no matter the source transaction. Can we be sure the snapshot in the >> applying transaction is the same (or "compatible") as the snapshot in >> the source transaction? >> > > I don't think we can assume that. I think it is possible that some > other transaction's WAL can be in-between start/end lsn of txn (which > we decide to send) which may not finally reach a consistent state. > Consider a case similar to shown in one of my previous emails: > Session-2: > Begin; > SELECT pg_current_xact_id(); > > Session-1: > SELECT 'init' FROM pg_create_logical_replication_slot('test_slot', > 'test_decoding', false, true); > > Session-3: > Begin; > SELECT pg_current_xact_id(); > > Session-2: > Commit; > Begin; > INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100); > > Session-3: > Commit; > > Session-2: > Commit; > > Here, we send changes (say insert from txn 700) from session-2 because > session-3's commit happens before it. Now, consider another > transaction parallel to txn 700 which generates some WAL related to > sequences but it committed before session-3's commit. So though, its > changes will be the in-between start/end LSN of txn 700 but those > shouldn't be sent. > > I have not tried this and also this may be solvable in some way but I > think processing changes from other TXNs sounds risky to me in terms > of snapshot handling. > Yes, I know this can happen. I was only really thinking about what might happen to the relfilenode of the sequence itself - and I don't think any concurrent transaction could swoop in and change the relfilenode in any meaningful way, due to locking. But of course, if we expect/require to have a perfect snapshot for that exact position in the transaction, this won't work. IMO the whole idea that we can have non-transactional bits in naturally transactional decoding seems a bit suspicious (at least in hindsight). No matter what we do for sequences, though, this still affects logical messages too. Not sure what to do there :-( >> >> >> (2) treating sequence change as regular changes >> >> This adopts a different approach - instead of accumulating the sequence >> increments in a global hash table, it treats them as regular changes. >> Which solves the snapshot issue, and issues with spilling to disk, >> streaming and so on. >> >> But it has various other issues with handling concurrent transactions, >> unfortunately, which probably make this approach infeasible: >> >> * The non-transactional stuff has to be applied in the first transaction >> that commits, not in the transaction that generated the WAL. That does >> not work too well with this approach, because we have to walk changes in >> all other transactions. >> > > Why do you want to traverse other TXNs in this approach? Is it because > the current TXN might be using some value of sequence which has been > actually WAL logged in the other transaction but that other > transaction has not been sent yet? I think if we don't send that then > probably replica sequences columns (in some tables) have some values > but actually the sequence itself won't have still that value which > sounds problematic. Is that correct? > Well, how else would you get to sequence changes in the other TXNs? Consider this: T1: begin T2: begin T2: nextval('s') -> writes WAL for 32 values T1: nextval('s') -> gets value without WAL T1: commit T2: commit Now, if we commit T1 without "applying" the sequence change from T2, we loose the sequence state. But we still write/replicate the value generated from the sequence. >> * Another serious issue seems to be streaming - if we already streamed >> some of the changes, we can't iterate through them anymore. >> >> Also, having to walk the transactions over and over for each change, to >> apply relevant sequence increments, that's mighty expensive. The other >> approach needs to do that too, but walking the global hash table seems >> much cheaper. >> >> The other issue this handling of aborted transactions - we need to apply >> sequence increments even from those transactions, of course. The other >> approach has this issue too, though. >> >> >> (3) tracking sequences touched by transaction >> >> This is the approach proposed by Hannu Krosing. I haven't explored this >> again yet, but I recall I wrote a PoC patch a couple months back. >> >> It seems to me most of the problems stems from trying to derive sequence >> state from decoded WAL changes, which is problematic because of the >> non-transactional nature of sequences (i.e. WAL for one transaction >> affects other transactions in non-obvious ways). And this approach >> simply works around that entirely - instead of trying to deduce the >> sequence state from WAL, we'd make sure to write the current sequence >> state (or maybe just ID of the sequence) at commit time. Which should >> eliminate most of the complexity / problems, I think. >> > > That sounds promising but I haven't thought in detail about that approach. > So, here's a patch doing that. It's a reworked/improved version of the patch [1] shared in November. It seems to be working pretty nicely. The behavior is a little bit different, of course, because we only replicate "committed" changes, so if you do nextval() in aborted transaction that is not replicated. Which I think is fine, because we generally make no durability guarantees for aborted transactions in general. But there are a couple issues too: 1) locking We have to read sequence change before the commit, but we must not allow reordering (because then the state might go backwards again). I'm not sure how serious impact could this have on performance. 2) dropped sequences I'm not sure what to do about sequences dropped in the transaction. The patch simply attempts to read the current sequence state before the commit, but if the sequence was dropped (in that transaction), that can't happen. I'm not sure if that's OK or not. 3) WAL record To replicate the stuff the patch uses a LogicalMessage, but I guess a separate WAL record would be better. But that's a technical detail. regards [1] https://www.postgresql.org/message-id/2cd38bab-c874-8e0b-98e7-d9abaaf9806a@enterprisedb.com >> >> I'm not really sure what to do about this. All of those reworks seems >> like an extensive redesign of the patch, and considering the last CF is >> already over ... not great. >> > > Yeah, I share the same feeling that even if we devise solutions to all > the known problems it requires quite some time to ensure everything is > correct. > True. Let's keep working on this for a bit more time and then we can decide what to do. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 4/6/22 16:13, Tomas Vondra wrote: > > > On 4/5/22 12:06, Amit Kapila wrote: >> On Mon, Apr 4, 2022 at 3:10 AM Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> >>> I did some experiments over the weekend, exploring how to rework the >>> sequence decoding in various ways. Let me share some WIP patches, >>> hopefully that can be useful for trying more stuff and moving this >>> discussion forward. >>> >>> I tried two things - (1) accumulating sequence increments in global >>> array and then doing something with it, and (2) treating all sequence >>> increments as regular changes (in a TXN) and then doing something >>> special during the replay. Attached are two patchsets, one for each >>> approach. >>> >>> Note: It's important to remember decoding of sequences is not the only >>> code affected by this. The logical messages have the same issue, >>> certainly when it comes to transactional vs. non-transactional stuff and >>> handling of snapshots. Even if the sequence decoding ends up being >>> reverted, we still need to fix that, somehow. And my feeling is the >>> solutions ought to be pretty similar in both cases. >>> >>> Now, regarding the two approaches: >>> >>> (1) accumulating sequences in global hash table >>> >>> The main problem with regular sequence increments is that those need to >>> be non-transactional - a transaction may use a sequence without any >>> WAL-logging, if the WAL was written by an earlier transaction. The >>> problem is the earlier trasaction might have been rolled back, and thus >>> simply discarded by the logical decoding. But we still need to apply >>> that, in order not to lose the sequence increment. >>> >>> The current code just applies those non-transactional increments right >>> after decoding the increment, but that does not work because we may not >>> have a snapshot at that point. And we only have the snapshot when within >>> a transaction (AFAICS) so this queues all changes and then applies the >>> changes later. >>> >>> The changes need to be shared by all transactions, so queueing them in a >>> global works fairly well - otherwise we'd have to walk all transactions, >>> in order to see if there are relevant sequence increments. >>> >>> But some increments may be transactional, e.g. when the sequence is >>> created or altered in a transaction. To allow tracking this, this uses a >>> hash table, with relfilenode as a key. >>> >>> There's a couple issues with this, though. Firstly, stashing the changes >>> outside transactions, it's not included in memory accounting, it's not >>> spilled to disk or streamed, etc. I guess fixing this is possible, but >>> it's certainly not straightforward, because we mix increments from many >>> different transactions. >>> >>> A bigger issue is that I'm not sure this actually handles the snapshots >>> correctly either. >>> >>> The non-transactional increments affect all transactions, so when >>> ReorderBufferProcessSequences gets executed, it processes all of them, >>> no matter the source transaction. Can we be sure the snapshot in the >>> applying transaction is the same (or "compatible") as the snapshot in >>> the source transaction? >>> >> >> I don't think we can assume that. I think it is possible that some >> other transaction's WAL can be in-between start/end lsn of txn (which >> we decide to send) which may not finally reach a consistent state. >> Consider a case similar to shown in one of my previous emails: >> Session-2: >> Begin; >> SELECT pg_current_xact_id(); >> >> Session-1: >> SELECT 'init' FROM pg_create_logical_replication_slot('test_slot', >> 'test_decoding', false, true); >> >> Session-3: >> Begin; >> SELECT pg_current_xact_id(); >> >> Session-2: >> Commit; >> Begin; >> INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100); >> >> Session-3: >> Commit; >> >> Session-2: >> Commit; >> >> Here, we send changes (say insert from txn 700) from session-2 because >> session-3's commit happens before it. Now, consider another >> transaction parallel to txn 700 which generates some WAL related to >> sequences but it committed before session-3's commit. So though, its >> changes will be the in-between start/end LSN of txn 700 but those >> shouldn't be sent. >> >> I have not tried this and also this may be solvable in some way but I >> think processing changes from other TXNs sounds risky to me in terms >> of snapshot handling. >> > > Yes, I know this can happen. I was only really thinking about what might > happen to the relfilenode of the sequence itself - and I don't think any > concurrent transaction could swoop in and change the relfilenode in any > meaningful way, due to locking. > > But of course, if we expect/require to have a perfect snapshot for that > exact position in the transaction, this won't work. IMO the whole idea > that we can have non-transactional bits in naturally transactional > decoding seems a bit suspicious (at least in hindsight). > > No matter what we do for sequences, though, this still affects logical > messages too. Not sure what to do there :-( > >>> >>> >>> (2) treating sequence change as regular changes >>> >>> This adopts a different approach - instead of accumulating the sequence >>> increments in a global hash table, it treats them as regular changes. >>> Which solves the snapshot issue, and issues with spilling to disk, >>> streaming and so on. >>> >>> But it has various other issues with handling concurrent transactions, >>> unfortunately, which probably make this approach infeasible: >>> >>> * The non-transactional stuff has to be applied in the first transaction >>> that commits, not in the transaction that generated the WAL. That does >>> not work too well with this approach, because we have to walk changes in >>> all other transactions. >>> >> >> Why do you want to traverse other TXNs in this approach? Is it because >> the current TXN might be using some value of sequence which has been >> actually WAL logged in the other transaction but that other >> transaction has not been sent yet? I think if we don't send that then >> probably replica sequences columns (in some tables) have some values >> but actually the sequence itself won't have still that value which >> sounds problematic. Is that correct? >> > > Well, how else would you get to sequence changes in the other TXNs? > > Consider this: > > T1: begin > T2: begin > > T2: nextval('s') -> writes WAL for 32 values > T1: nextval('s') -> gets value without WAL > > T1: commit > T2: commit > > Now, if we commit T1 without "applying" the sequence change from T2, we > loose the sequence state. But we still write/replicate the value > generated from the sequence. > >>> * Another serious issue seems to be streaming - if we already streamed >>> some of the changes, we can't iterate through them anymore. >>> >>> Also, having to walk the transactions over and over for each change, to >>> apply relevant sequence increments, that's mighty expensive. The other >>> approach needs to do that too, but walking the global hash table seems >>> much cheaper. >>> >>> The other issue this handling of aborted transactions - we need to apply >>> sequence increments even from those transactions, of course. The other >>> approach has this issue too, though. >>> >>> >>> (3) tracking sequences touched by transaction >>> >>> This is the approach proposed by Hannu Krosing. I haven't explored this >>> again yet, but I recall I wrote a PoC patch a couple months back. >>> >>> It seems to me most of the problems stems from trying to derive sequence >>> state from decoded WAL changes, which is problematic because of the >>> non-transactional nature of sequences (i.e. WAL for one transaction >>> affects other transactions in non-obvious ways). And this approach >>> simply works around that entirely - instead of trying to deduce the >>> sequence state from WAL, we'd make sure to write the current sequence >>> state (or maybe just ID of the sequence) at commit time. Which should >>> eliminate most of the complexity / problems, I think. >>> >> >> That sounds promising but I haven't thought in detail about that approach. >> > > So, here's a patch doing that. It's a reworked/improved version of the > patch [1] shared in November. > > It seems to be working pretty nicely. The behavior is a little bit > different, of course, because we only replicate "committed" changes, so > if you do nextval() in aborted transaction that is not replicated. Which > I think is fine, because we generally make no durability guarantees for > aborted transactions in general. > > But there are a couple issues too: > > 1) locking > > We have to read sequence change before the commit, but we must not allow > reordering (because then the state might go backwards again). I'm not > sure how serious impact could this have on performance. > > 2) dropped sequences > > I'm not sure what to do about sequences dropped in the transaction. The > patch simply attempts to read the current sequence state before the > commit, but if the sequence was dropped (in that transaction), that > can't happen. I'm not sure if that's OK or not. > > 3) WAL record > > To replicate the stuff the patch uses a LogicalMessage, but I guess a > separate WAL record would be better. But that's a technical detail. > > > regards > > [1] > https://www.postgresql.org/message-id/2cd38bab-c874-8e0b-98e7-d9abaaf9806a@enterprisedb.com > >>> >>> I'm not really sure what to do about this. All of those reworks seems >>> like an extensive redesign of the patch, and considering the last CF is >>> already over ... not great. >>> >> >> Yeah, I share the same feeling that even if we devise solutions to all >> the known problems it requires quite some time to ensure everything is >> correct. >> > > True. Let's keep working on this for a bit more time and then we can > decide what to do. > I've pushed a revert af all the commits related to this - decoding of sequences and test_decoding / built-in replication changes. The approach combining transactional and non-transactional behavior implemented by the patch clearly has issues, and it seems foolish to hope we'll find a simple fix. So the changes would have to be much more extensive, and doing that after the last CF seems like an obviously bad idea. Attached is a rebased patch, implementing the approach based on WAL-logging sequences at commit time. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Some typos I found before the patch was reverted. diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml index a6ea6ff3fcf..d4bd8d41c4b 100644 --- a/doc/src/sgml/logicaldecoding.sgml +++ b/doc/src/sgml/logicaldecoding.sgml @@ -834,9 +834,8 @@ typedef void (*LogicalDecodeSequenceCB) (struct LogicalDecodingContext *ctx, non-transactional increments, the transaction may be either NULL or not NULL, depending on if the transaction already has an XID assigned. The <parameter>sequence_lsn</parameter> has the WAL location of the - sequence update. <parameter>transactional</parameter> says if the - sequence has to be replayed as part of the transaction or directly. - + sequence update. <parameter>transactional</parameter> indicates whether + the sequence has to be replayed as part of the transaction or directly. The <parameter>last_value</parameter>, <parameter>log_cnt</parameter> and <parameter>is_called</parameter> parameters describe the sequence change. </para> diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 60866431db3..b71122cce5d 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -927,7 +927,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid, * Treat the sequence increment as transactional? * * The hash table tracks all sequences created in in-progress transactions, - * so we simply do a lookup (the sequence is identified by relfilende). If + * so we simply do a lookup (the sequence is identified by relfilenode). If * we find a match, the increment should be handled as transactional. */ bool @@ -2255,7 +2255,7 @@ ReorderBufferApplySequence(ReorderBuffer *rb, ReorderBufferTXN *txn, tuple = &change->data.sequence.tuple->tuple; seq = (Form_pg_sequence_data) GETSTRUCT(tuple); - /* Only ever called from ReorderBufferApplySequence, so transational. */ + /* Only ever called from ReorderBufferApplySequence, so transactional. */ if (streaming) rb->stream_sequence(rb, txn, change->lsn, relation, true, seq->last_value, seq->log_cnt, seq->is_called); diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index a15ce9edb13..8193bfe6515 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -5601,7 +5601,7 @@ RelationBuildPublicationDesc(Relation relation, PublicationDesc *pubdesc) GetSchemaPublications(schemaid, objType)); /* - * If this is a partion (and thus a table), lookup all ancestors and track + * If this is a partition (and thus a table), lookup all ancestors and track * all publications them too. */ if (relation->rd_rel->relispartition)
> But of course, if we expect/require to have a perfect snapshot for that > exact position in the transaction, this won't work. IMO the whole idea > that we can have non-transactional bits in naturally transactional > decoding seems a bit suspicious (at least in hindsight). > > No matter what we do for sequences, though, this still affects logical > messages too. Not sure what to do there :-( Hi, I spent some time trying to understand this problem while I was evaluating its impact on the DDL replication in [1]. I think for DDL we could always remove the non-transactional bits since DDL will probably always be processed transactionally. I attempted to solve the problem for messages. Here is a potential solution by keeping track of the last decoded/acked non-transactional message/operation lsn and use it to check if a non-transactional message record should be skipped during decoding, to do that I added new fields ReplicationSlotPersistentData.non_xact_op_at, XLogReaderState.NonXactOpRecPtr and SnapBuild.start_decoding_nonxactop_at. This is the end LSN of the last non-transactional message/operation decoded/acked. I verified this approach solves the issue of missing decoding of non-transactional messages under concurrency/before the builder state reaches SNAPBUILD_CONSISTENT. Once the builder state reach SNAPBUILD_CONSISTENT, the new field ReplicationSlotPersistentData.non_xact_op_at can be set to ReplicationSlotPersistentData.confirmed_flush. Similar to the sequence issue, here is the test case for logical messages: Test concurrent execution in 3 sessions that allows pg_logical_emit_message in session-2 to happen before we reach a consistent point and commit happens after a consistent point: Session-2: Begin; SELECT pg_current_xact_id(); Session-1: SELECT 'init' FROM pg_create_logical_replication_slot('test_slot', 'test_decoding', false, true); Session-3: Begin; SELECT pg_current_xact_id(); Session-2: Commit; Begin; SELECT pg_logical_emit_message(true, 'test_decoding', 'msg1'); SELECT pg_logical_emit_message(false, 'test_decoding', 'msg2'); Session-3: Commit; Session-1: (at this point, the session will crash without the fix) SELECT data FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'force-binary', '0', 'skip-empty-xacts', '1'); data --------------------------------------------------------------------- message: transactional: 0 prefix: test_decoding, sz: 4 content:msg1 Session-2: Commit; Session-1: SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'force-binary', '0', 'skip-empty-xacts', '1'); data --------------------------------------------------------------------- message: transactional: 1 prefix: test_decoding, sz: 4 content:msg2 I also tried the same approach on sequences (on a commit before the revert of sequence replication) and it seems to be working but I think it needs further testing. Patch 0001-Intorduce-new-field-ReplicationSlotPersistentData.no.patch applies on master which contains the fix for logical messages. [1] https://www.postgresql.org/message-id/flat/CAAD30U+pVmfKwUKy8cbZOnUXyguJ-uBNejwD75Kyo=OjdQGJ9g@mail.gmail.com Thoughts? With Regards, Zheng
Вложения
On Thu, Apr 07, 2022 at 08:34:50PM +0200, Tomas Vondra wrote: > I've pushed a revert af all the commits related to this - decoding of > sequences and test_decoding / built-in replication changes. Two July buildfarm runs failed with PANIC during standby promotion: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-19%2004%3A13%3A18 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-31%2011%3A33%3A13 The attached patch hacks things so an ordinary x86_64 GNU/Linux machine reproduces this consistently. "git bisect" then traced the regression to the above revert commit (2c7ea57e56ca5f668c32d4266e0a3e45b455bef5). The pg_ctl test suite passes under this hack in all supported branches, and it passed on v15 until that revert. Would you investigate? The buildfarm animal uses keep_error_builds. From kept data directories, I deduced these events: - After the base backup, auto-analyze ran on the primary and wrote WAL. - Standby streamed and wrote up to 0/301FFF. - Standby received the promote signal. Terminated streaming. WAL page at 0/302000 remained all-zeros. - Somehow, end-of-recovery became a PANIC. Key portions from buildfarm logs: === good run standby2 log 2022-07-21 22:55:16.860 UTC [25034912:5] LOG: received promote request 2022-07-21 22:55:16.878 UTC [26804682:2] FATAL: terminating walreceiver process due to administrator command 2022-07-21 22:55:16.878 UTC [25034912:6] LOG: invalid record length at 0/3000060: wanted 24, got 0 2022-07-21 22:55:16.878 UTC [25034912:7] LOG: redo done at 0/3000028 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed:0.42 s 2022-07-21 22:55:16.878 UTC [25034912:8] LOG: selected new timeline ID: 2 2022-07-21 22:55:17.004 UTC [25034912:9] LOG: archive recovery complete 2022-07-21 22:55:17.005 UTC [23724044:1] LOG: checkpoint starting: force 2022-07-21 22:55:17.008 UTC [14549364:4] LOG: database system is ready to accept connections 2022-07-21 22:55:17.093 UTC [23724044:2] LOG: checkpoint complete: wrote 3 buffers (2.3%); 0 WAL file(s) added, 0 removed,1 recycled; write=0.019 s, sync=0.001 s, total=0.089 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16384kB, estimate=16384 kB 2022-07-21 22:55:17.143 UTC [27394418:1] [unknown] LOG: connection received: host=[local] 2022-07-21 22:55:17.144 UTC [27394418:2] [unknown] LOG: connection authorized: user=nm database=postgres application_name=003_promote.pl 2022-07-21 22:55:17.147 UTC [27394418:3] 003_promote.pl LOG: statement: SELECT pg_is_in_recovery() 2022-07-21 22:55:17.148 UTC [27394418:4] 003_promote.pl LOG: disconnection: session time: 0:00:00.005 user=nm database=postgreshost=[local] 2022-07-21 22:55:58.301 UTC [14549364:5] LOG: received immediate shutdown request 2022-07-21 22:55:58.337 UTC [14549364:6] LOG: database system is shut down === failed run standby2 log, with my annotations 2022-07-19 05:28:22.136 UTC [7340406:5] LOG: received promote request 2022-07-19 05:28:22.163 UTC [8519860:2] FATAL: terminating walreceiver process due to administrator command 2022-07-19 05:28:22.166 UTC [7340406:6] LOG: invalid magic number 0000 in log segment 000000010000000000000003, offset 131072 New compared to the good run. XLOG_PAGE_MAGIC didn't match. This implies the WAL ended at a WAL page boundary. 2022-07-19 05:28:22.166 UTC [7340406:7] LOG: redo done at 0/301F168 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed:0.18 s 2022-07-19 05:28:22.166 UTC [7340406:8] LOG: last completed transaction was at log time 2022-07-19 05:28:13.956716+00 New compared to the good run. The good run had no transactions to replay. The bad run replayed records from an auto-analyze. 2022-07-19 05:28:22.166 UTC [7340406:9] PANIC: invalid record length at 0/301F168: wanted 24, got 0 More WAL overall in bad run, due to auto-analyze. End of recovery wrongly considered a PANIC. 2022-07-19 05:28:22.583 UTC [8388800:4] LOG: startup process (PID 7340406) was terminated by signal 6: IOT/Abort trap 2022-07-19 05:28:22.584 UTC [8388800:5] LOG: terminating any other active server processes 2022-07-19 05:28:22.587 UTC [8388800:6] LOG: shutting down due to startup process failure 2022-07-19 05:28:22.627 UTC [8388800:7] LOG: database system is shut down Let me know if I've left out details you want; I may be able to dig more out of the buildfarm artifacts.
Вложения
On 8/7/22 02:36, Noah Misch wrote: > On Thu, Apr 07, 2022 at 08:34:50PM +0200, Tomas Vondra wrote: >> I've pushed a revert af all the commits related to this - decoding of >> sequences and test_decoding / built-in replication changes. > > Two July buildfarm runs failed with PANIC during standby promotion: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-19%2004%3A13%3A18 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-31%2011%3A33%3A13 > > The attached patch hacks things so an ordinary x86_64 GNU/Linux machine > reproduces this consistently. "git bisect" then traced the regression to the > above revert commit (2c7ea57e56ca5f668c32d4266e0a3e45b455bef5). The pg_ctl > test suite passes under this hack in all supported branches, and it passed on > v15 until that revert. Would you investigate? > > The buildfarm animal uses keep_error_builds. From kept data directories, I > deduced these events: > > - After the base backup, auto-analyze ran on the primary and wrote WAL. > - Standby streamed and wrote up to 0/301FFF. > - Standby received the promote signal. Terminated streaming. WAL page at 0/302000 remained all-zeros. > - Somehow, end-of-recovery became a PANIC. > I think it'd be really bizarre if this was due to the revert, as that simply undoes minor WAL changes (and none of this should affect what happens at WAL page boundary etc.). It just restores WAL as it was before 0da92dc, nothing particularly complicated. I did go through all of the changes again and I haven't spotted anything particularly suspicious, but I'll give it another try tomorrow. However, I did try bisecting this using the attached patch, and that does not suggest the issue is in the revert commit. It actually fails all the way back to 5dc0418fab2, and it starts working on 9553b4115f1. ... 6392f2a0968 Try to silence "-Wmissing-braces" complaints in ... => 5dc0418fab2 Prefetch data referenced by the WAL, take II. 9553b4115f1 Fix warning introduced in 5c279a6d350. ... This is merely 10 commits before the revert, and it seems way more related to WAL. Also, adding this to the two nodes in 003_standby.pl makes the issue go away, it seems: $node_standby->append_conf('postgresql.conf', qq(recovery_prefetch = off)); I'd bet it's about WAL prefetching, not the revert, and the bisect was a bit incorrect, because the commits are close and the failures happen to be rare. (Presumably you first did the bisect and then wrote the patch that reproduces this, right?) Adding Thomas Munro to the thread, he's the WAL prefetching expert ;-) regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Aug 07, 2022 at 03:18:52PM +0200, Tomas Vondra wrote: > On 8/7/22 02:36, Noah Misch wrote: > > On Thu, Apr 07, 2022 at 08:34:50PM +0200, Tomas Vondra wrote: > >> I've pushed a revert af all the commits related to this - decoding of > >> sequences and test_decoding / built-in replication changes. > > > > Two July buildfarm runs failed with PANIC during standby promotion: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-19%2004%3A13%3A18 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-07-31%2011%3A33%3A13 > > > > The attached patch hacks things so an ordinary x86_64 GNU/Linux machine > > reproduces this consistently. "git bisect" then traced the regression to the > > above revert commit (2c7ea57e56ca5f668c32d4266e0a3e45b455bef5). The pg_ctl > > test suite passes under this hack in all supported branches, and it passed on > > v15 until that revert. Would you investigate? > > > > The buildfarm animal uses keep_error_builds. From kept data directories, I > > deduced these events: > > > > - After the base backup, auto-analyze ran on the primary and wrote WAL. > > - Standby streamed and wrote up to 0/301FFF. > > - Standby received the promote signal. Terminated streaming. WAL page at 0/302000 remained all-zeros. > > - Somehow, end-of-recovery became a PANIC. > > I think it'd be really bizarre if this was due to the revert, as that > simply undoes minor WAL changes (and none of this should affect what > happens at WAL page boundary etc.). It just restores WAL as it was > before 0da92dc, nothing particularly complicated. I did go through all > of the changes again and I haven't spotted anything particularly > suspicious, but I'll give it another try tomorrow. > > However, I did try bisecting this using the attached patch, and that > does not suggest the issue is in the revert commit. It actually fails > all the way back to 5dc0418fab2, and it starts working on 9553b4115f1. > > ... > 6392f2a0968 Try to silence "-Wmissing-braces" complaints in ... > => 5dc0418fab2 Prefetch data referenced by the WAL, take II. > 9553b4115f1 Fix warning introduced in 5c279a6d350. > ... > > This is merely 10 commits before the revert, and it seems way more > related to WAL. Also, adding this to the two nodes in 003_standby.pl > makes the issue go away, it seems: > > $node_standby->append_conf('postgresql.conf', > qq(recovery_prefetch = off)); > > I'd bet it's about WAL prefetching, not the revert, and the bisect was a > bit incorrect, because the commits are close and the failures happen to > be rare. (Presumably you first did the bisect and then wrote the patch > that reproduces this, right?) No. I wrote the patch, then used the patch to drive the bisect. With ten iterations, commit 2c7ea57 passes 0/10, while 2c7ea57^ passes 10/10. I've now tried recovery_prefetch=off. With that, the test passes 10/10 at 2c7ea57. Given your observation of a failure at 5dc0418fab2, I agree with your conclusion. Whatever the role of 2c7ea57 in exposing the failure on my machine, a root cause in WAL prefetching looks more likely. > Adding Thomas Munro to the thread, he's the WAL prefetching expert ;-)
On Mon, Aug 8, 2022 at 7:12 AM Noah Misch <noah@leadboat.com> wrote: > On Sun, Aug 07, 2022 at 03:18:52PM +0200, Tomas Vondra wrote: > > I'd bet it's about WAL prefetching, not the revert, and the bisect was a > > bit incorrect, because the commits are close and the failures happen to > > be rare. (Presumably you first did the bisect and then wrote the patch > > that reproduces this, right?) > > No. I wrote the patch, then used the patch to drive the bisect. With ten > iterations, commit 2c7ea57 passes 0/10, while 2c7ea57^ passes 10/10. I've now > tried recovery_prefetch=off. With that, the test passes 10/10 at 2c7ea57. > Given your observation of a failure at 5dc0418fab2, I agree with your > conclusion. Whatever the role of 2c7ea57 in exposing the failure on my > machine, a root cause in WAL prefetching looks more likely. > > > Adding Thomas Munro to the thread, he's the WAL prefetching expert ;-) Thanks for the repro patch and bisection work. Looking...
On Mon, Aug 8, 2022 at 9:09 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Thanks for the repro patch and bisection work. Looking... I don't have the complete explanation yet, but it's something like this. We hit the following branch in xlogrecovery.c... if (StandbyMode && !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) { /* * Emit this error right now then retry this page immediately. Use * errmsg_internal() because the message was already translated. */ if (xlogreader->errormsg_buf[0]) ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), (errmsg_internal("%s", xlogreader->errormsg_buf))); /* reset any error XLogReaderValidatePageHeader() might have set */ xlogreader->errormsg_buf[0] = '\0'; goto next_record_is_invalid; } ... but, even though there was a (suppressed) error, nothing invalidates the reader's page buffer. Normally, XLogReadValidatePageHeader() failure or any other kind of error encountered by xlogreader.c'd decoding logic would do that, but here the read_page callback is directly calling the header validation. Without prefetching, that doesn't seem to matter, but reading ahead can cause us to have the problem page in our buffer at the wrong time, and then not re-read it when we should. Or something like that. The attached patch that simply moves the cache invalidation into report_invalid_record(), so that it's reached by the above code and everything else that reports an error, seems to fix the problem in src/bin/pg_ctl/t/003_promote.pl with Noah's spanner-in-the-works patch applied, and passes check-world without it. I need to look at this some more, though, and figure out if it's the right fix.
Вложения
At Mon, 8 Aug 2022 18:15:46 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in > On Mon, Aug 8, 2022 at 9:09 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > Thanks for the repro patch and bisection work. Looking... > > I don't have the complete explanation yet, but it's something like > this. We hit the following branch in xlogrecovery.c... > > if (StandbyMode && > !XLogReaderValidatePageHeader(xlogreader, > targetPagePtr, readBuf)) > { > /* > * Emit this error right now then retry this page > immediately. Use > * errmsg_internal() because the message was already translated. > */ > if (xlogreader->errormsg_buf[0]) > ereport(emode_for_corrupt_record(emode, > xlogreader->EndRecPtr), > (errmsg_internal("%s", > xlogreader->errormsg_buf))); > > /* reset any error XLogReaderValidatePageHeader() > might have set */ > xlogreader->errormsg_buf[0] = '\0'; > goto next_record_is_invalid; > } > > ... but, even though there was a (suppressed) error, nothing > invalidates the reader's page buffer. Normally, > XLogReadValidatePageHeader() failure or any other kind of error > encountered by xlogreader.c'd decoding logic would do that, but here > the read_page callback is directly calling the header validation. > Without prefetching, that doesn't seem to matter, but reading ahead > can cause us to have the problem page in our buffer at the wrong time, > and then not re-read it when we should. Or something like that. > > The attached patch that simply moves the cache invalidation into > report_invalid_record(), so that it's reached by the above code and > everything else that reports an error, seems to fix the problem in > src/bin/pg_ctl/t/003_promote.pl with Noah's spanner-in-the-works patch > applied, and passes check-world without it. I need to look at this > some more, though, and figure out if it's the right fix. If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral misses the chance to inavlidate reader-state. That state is not an error while in StandbyMode. In the repro case, XLogPageRead returns XLREAD_WOULDBLOCK after the first failure. This situation (of course) was not considered when that code was introduced. If that function is going to return with XLREAD_WOULDBLOCK while lastSourceFailed, it should be turned into XLREAD_FAIL. So, the following also works. diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 21088e78f6..9f242fe656 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -3220,7 +3220,9 @@ retry: xlogreader->nonblocking)) { case XLREAD_WOULDBLOCK: - return XLREAD_WOULDBLOCK; + if (!lastSourceFailed) + return XLREAD_WOULDBLOCK; + /* Fall through. */ case XLREAD_FAIL: if (readFile >= 0) close(readFile); regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Mon, 08 Aug 2022 17:33:22 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral > misses the chance to inavlidate reader-state. That state is not an > error while in StandbyMode. Mmm... Maybe I wanted to say: (Still I'm not sure the rewrite works..) If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral would miss the chance to invalidate reader-state. When XLogPageRead is called in blocking mode while in StandbyMode (that is, the traditional condition) , the function continues retrying until it succeeds, or returns XLRAD_FAIL if promote is triggered. In other words, it was not supposed to return non-failure while the header validation is failing while in standby mode. But while in nonblocking mode, the function can return non-failure with lastSourceFailed = true, which seems wrong. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Aug 08, 2022 at 06:15:46PM +1200, Thomas Munro wrote: > The attached patch that simply moves the cache invalidation into > report_invalid_record(), so that it's reached by the above code and > everything else that reports an error, seems to fix the problem in > src/bin/pg_ctl/t/003_promote.pl with Noah's spanner-in-the-works patch > applied, and passes check-world without it. I need to look at this > some more, though, and figure out if it's the right fix. Thomas, where are you on this open item? A potential PANIC at promotion is bad. One possible exit path would be to switch the default of recovery_prefetch, though that's a kind of last-resort option seen from here. -- Michael
Вложения
On Tue, Aug 23, 2022 at 4:21 PM Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Aug 08, 2022 at 06:15:46PM +1200, Thomas Munro wrote: > > The attached patch that simply moves the cache invalidation into > > report_invalid_record(), so that it's reached by the above code and > > everything else that reports an error, seems to fix the problem in > > src/bin/pg_ctl/t/003_promote.pl with Noah's spanner-in-the-works patch > > applied, and passes check-world without it. I need to look at this > > some more, though, and figure out if it's the right fix. > > Thomas, where are you on this open item? A potential PANIC at > promotion is bad. One possible exit path would be to switch the > default of recovery_prefetch, though that's a kind of last-resort > option seen from here. I will get a fix committed this week -- I need to study Horiguchi-san's analysis...
On Tue, Aug 23, 2022 at 12:33 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Aug 23, 2022 at 4:21 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Aug 08, 2022 at 06:15:46PM +1200, Thomas Munro wrote: > > > The attached patch that simply moves the cache invalidation into > > > report_invalid_record(), so that it's reached by the above code and > > > everything else that reports an error, seems to fix the problem in > > > src/bin/pg_ctl/t/003_promote.pl with Noah's spanner-in-the-works patch > > > applied, and passes check-world without it. I need to look at this > > > some more, though, and figure out if it's the right fix. > > > > Thomas, where are you on this open item? A potential PANIC at > > promotion is bad. One possible exit path would be to switch the > > default of recovery_prefetch, though that's a kind of last-resort > > option seen from here. > > I will get a fix committed this week -- I need to study > Horiguchi-san's analysis... Hi! I haven't been paying attention to this thread, but my attention was just drawn to it, and I'm wondering if the issue you're trying to track down here is actually the same as what I reported yesterday here: https://www.postgresql.org/message-id/CA+TgmoY0Lri=fCueg7m_2R_bSspUb1F8OFycEGaHNJw_EUW-=Q@mail.gmail.com -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 24, 2022 at 3:04 AM Robert Haas <robertmhaas@gmail.com> wrote: > I haven't been paying attention to this thread, but my attention was > just drawn to it, and I'm wondering if the issue you're trying to > track down here is actually the same as what I reported yesterday > here: > > https://www.postgresql.org/message-id/CA+TgmoY0Lri=fCueg7m_2R_bSspUb1F8OFycEGaHNJw_EUW-=Q@mail.gmail.com Summarising a chat we had about this: Different bug, similar ingredients. Robert describes a screw-up in what is written, but here we're talking about a cache invalidation bug while reading.
On Mon, Aug 8, 2022 at 8:56 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Mon, 08 Aug 2022 17:33:22 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral > > misses the chance to inavlidate reader-state. That state is not an > > error while in StandbyMode. > > Mmm... Maybe I wanted to say: (Still I'm not sure the rewrite works..) > > If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral > would miss the chance to invalidate reader-state. When XLogPageRead > is called in blocking mode while in StandbyMode (that is, the > traditional condition) , the function continues retrying until it > succeeds, or returns XLRAD_FAIL if promote is triggered. In other > words, it was not supposed to return non-failure while the header > validation is failing while in standby mode. But while in nonblocking > mode, the function can return non-failure with lastSourceFailed = > true, which seems wrong. New ideas: 0001: Instead of figuring out when to invalidate the cache, let's just invalidate it before every read attempt. It is only marked valid after success (ie state->readLen > 0). No need to worry about error cases. 0002: While here, I don't like xlogrecovery.c clobbering xlogreader.c's internal error state, so I think we should have a function for that with a documented purpose. It was also a little inconsistent that it didn't clear a flag (but not buggy AFAICS; kinda wondering if I should just get rid of that flag, but that's for another day). 0003: Thinking about your comments above made me realise that I don't really want XLogReadPage() to be internally retrying for obscure failures while reading ahead. I think I prefer to give up on prefetching as soon as anything tricky happens, and deal with complexities once recovery catches up to that point. I am still thinking about this point. Here's the patch set I'm testing.
Вложения
On 8/29/22 12:21, Thomas Munro wrote: > On Mon, Aug 8, 2022 at 8:56 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: >> At Mon, 08 Aug 2022 17:33:22 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in >>> If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral >>> misses the chance to inavlidate reader-state. That state is not an >>> error while in StandbyMode. >> >> Mmm... Maybe I wanted to say: (Still I'm not sure the rewrite works..) >> >> If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral >> would miss the chance to invalidate reader-state. When XLogPageRead >> is called in blocking mode while in StandbyMode (that is, the >> traditional condition) , the function continues retrying until it >> succeeds, or returns XLRAD_FAIL if promote is triggered. In other >> words, it was not supposed to return non-failure while the header >> validation is failing while in standby mode. But while in nonblocking >> mode, the function can return non-failure with lastSourceFailed = >> true, which seems wrong. > > New ideas: > > 0001: Instead of figuring out when to invalidate the cache, let's > just invalidate it before every read attempt. It is only marked valid > after success (ie state->readLen > 0). No need to worry about error > cases. > Maybe I misunderstand how all this works, but won't this have a really bad performance impact. If not, why do we need the cache at all? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 30, 2022 at 6:04 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 8/29/22 12:21, Thomas Munro wrote: > > 0001: Instead of figuring out when to invalidate the cache, let's > > just invalidate it before every read attempt. It is only marked valid > > after success (ie state->readLen > 0). No need to worry about error > > cases. > > Maybe I misunderstand how all this works, but won't this have a really > bad performance impact. If not, why do we need the cache at all? It's a bit confusing because there are several levels of "read". The cache remains valid as long as the caller of ReadPageInternal() keeps asking for data that is in range (see early return after comment "/* check whether we have all the requested data already */"). As soon as the caller asks for something not in range, this patch marks the cache invalid before calling the page_read() callback (= XLogPageRead()). It is only marked valid again after that succeeds. Here's a new version with no code change, just a better commit message to try to explain that more clearly.