logical decoding and replication of sequences
От | Tomas Vondra |
---|---|
Тема | logical decoding and replication of sequences |
Дата | |
Msg-id | d045f3c2-6cfb-06d3-5540-e63c320df8bc@enterprisedb.com обсуждение исходный текст |
Ответы |
Re: logical decoding and replication of sequences
(Tomas Vondra <tomas.vondra@enterprisedb.com>)
Re: logical decoding and replication of sequences (Peter Eisentraut <peter.eisentraut@enterprisedb.com>) |
Список | pgsql-hackers |
Hi, One of the existing limitations of logical decoding / replication is that it does no care about sequences. The annoying consequence is that after a failover to logical replica, all the table data may be replicated but the sequences are still at the initial values, requiring some custom solution that moves the sequences forward enough to prevent duplicities. There have been attempts to address this in the past, most recently [1], but none of them got in due to various issues. This is an attempt, based on [1] (but with many significant parts added or reworked), aiming to deal with this. The primary purpose of sharing it is getting feedback and opinions on the design decisions. It's still a WIP - it works fine AFAICS, but some of the bits may be a bit hackish. The overall goal is to have the same sequence data on the primary and logical replica, or something sufficiently close to that, so that the replica after a failover does not generate duplicate values. This patch does a couple basic things: 1) extends the logical decoding to handle sequences. It adds a new callback, similarly to what we have for messages. There's a bit of complexity with transactional and non-transactional behavior, more about that later 2) extends test_decoding to support this new callback, printing the sequence increments (the decoded WAL records) 3) extends built-in replication to support sequences, so publications may contain both tables and sequences, etc., sequences data sync when creating subscriptions, etc. transactional vs. non-transactional ----------------------------------- The first part (extending logical decoding) is simple in principle. We simply decode the sequence updates, but then comes a challenge - should we just treat it transactionally and stash it in reorder buffer, or just pass it to the output plugin right-away? For messages, this can be specified as a flag when adding the message, so the user can decide depending on the message purpose. For sequences, all we do is nextval() and it depends on the context in which it's used, we can't just pick one of those approaches. Consider this, for example: CREATE SEQUENCE s; BEGIN; SELECT nextval('s') FROM generate_series(1,1000) s(i); ROLLBACK; If we handle this "transactionally", we'd stash the "nextval" increment into the transaction, and then discard it due to the rollback, so the output plugin (and replica) would never get it. So this is an argument for non-transactional behavior. On the other hand, consider this: CREATE SEQUENCE s; BEGIN; ALTER SEQUENCE s RESTART WITH 2000; SELECT nextval('s') FROM generate_series(1,1000) s(i); ROLLBACK; In this case the ALTER creates a new relfilenode, and the ROLLBACK does discard it including the effects of the nextval calls. So here we should treat it transactionally, stash the increment(s) in the transaction and just discard it all on rollback. A somewhat similar example is this BEGIN; CREATE SEQUENCE s; SELECT nextval('s') FROM generate_series(1,1000) s(i); COMMIT; Again - the decoded nextval needs to be handled transactionally, because otherwise it's going to be very difficult for custom plugins to combine this with DDL replication. So the patch does a fairly simple thing: 1) By default, sequences are treated non-transactionally, i.e. sent to the output plugin right away. 2) We track sequences created in running (sub)transactions, and those are handled transactionally. This includes ALTER SEQUENCE cases, which create a new relfilenode, which is used as an identifier. It's a bit more complex, because of cases like this: BEGIN; CREATE SEQUENCE s; SAVEPOINT a; SELECT nextval('s') FROM generate_series(1,1000) s(i); ROLLBACK TO a; COMMIT; because we must not discard the nextval changes - this is handled by always stashing the nextval changes to the subxact where the sequence relfilenode was created. The tracking is a bit cumbersome - there's a hash table with relfilenode mapped to XID in which it was created. AFAIK that works, but might be an issue with many sequences created in running transactions. Not sure. detecting sequence creation --------------------------- Detection that a sequence (or rather the relfilenode) was created is done by adding a "created" flag into the xl_seq_rec, and setting it to "true" in the first WAL record after the creation. There might be some other way, but this seemed simple enough. applying the sequence (ResetSequence2) -------------------------------------- The decoding pretty much just extracts log_value, log_cnt and is_called from the sequence, and passes them to the output plugin. On the replica we extract those from the message, and write them to the local sequence using a new ResetSequence2 function. It's possible we don't really need log_cnt and is_called. After all, log_cnt is zero most of the time anyway, and the worst thing that could happen if we ignore it is we skip a couple values (which seems fine). syncing sequences in a subscription ----------------------------------- After creating a subscription, the sequences get syncronized just like tables. This part ia a bit hacked together, and there's definitely room for improvement - e.g. a new bgworker is started for each sequence, as we simply treat both tabels and sequences as "relation". But all we need to do for sequences is copying the (last_value, log_cnt, is_called) and calling ResetSequence2, so maybe we could sync all sequences in a single worker, or something like that. new "sequence" publication action --------------------------------- The publications now have a new "sequence" publication action, which is enabled by default. This determines whether the publication decodes sequences or what. FOR ALL SEQUENCES ----------------- It should be possible to create FOR ALL SEQUENCES publications, just like we have FOR ALL TABLES. But this produces shift/reduce conflicts in the grammar, and I didn't bother dealing with that. So for now it's required to do ALTER PUBLICATION ... [ADD | DROP] SEQUENCE ... no streaming support yet ------------------------ There's no supoprt for streaming of in-progress transactions yet, but should be trivial to add. GetCurrentTransactionId() in nextval ------------------------------------ There's a bit annoying behavior of nextval() - if you do this: BEGIN; CREATE SEQUENCE s; SAVEPOINT a; SELECT nextval('s') FROM generate_series(1,100) s(i); COMMIT; then the WAL record for nextval (right after the savepoint) will have XID 0 (easy to see in pg_waldump). That's kinda strange, and it causes problems in DecodeSequence() when calling SnapBuildProcessChange(builder, xid, buf->origptr) for transactional changes, because that expects a valid XID. Fixing this required adding GetCurrentTransactionId() to nextval() and two other functions, which were only doing if (RelationNeedsWAL(seqrel)) GetTopTransactionId(); so far. I'm not sure if this has some particularly bad consequences. regards [1] https://www.postgresql.org/message-id/flat/1710ed7e13b.cd7177461430746.3372264562543607781%40highgo.ca -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
В списке pgsql-hackers по дате отправления: