Обсуждение: [PATCH] Logical decoding support for sequence advances
Вложения
Hi allAttached is a patch against 9.6 to add support for informing logical decoding plugins of the new sequence last_value when sequence advance WAL records are processed during decoding.
Вложения
Attached a slightly updated version. It just has less spam in the regression tests, by adding a new option to test_decoding to show sequences, which it doesn't enable except in sequence specific tests.
On 2015-12-14 16:19:33 +0800, Craig Ringer wrote: > On 14 December 2015 at 11:28, Craig Ringer <craig@2ndquadrant.com> wrote: > > > Hi all > > > > Attached is a patch against 9.6 to add support for informing logical > > decoding plugins of the new sequence last_value when sequence advance WAL > > records are processed during decoding. > > > > > Attached a slightly updated version. It just has less spam in the > regression tests, by adding a new option to test_decoding to show > sequences, which it doesn't enable except in sequence specific tests. I think this is quite the wrong approach. You're calling the logical decoding callback directly from decode.c, circumventing reorderbuffer.c. While you don't need the actual reordering, you *do* need the snapshot integration. As you since noticed you can't just look into the sequence tuple and be happy with that, you need to do pg_class lookups et al. Furhtermore callbacks can validly expect to have a snapshot set. At the very least what you need to do is to SetupHistoricSnapshot(), then lookup the actual identity of the sequence using RelidByRelfilenode() and only then you can call the callback. > Needed to make logical replication based failover possible. While it'll make it easier, I think it's certainly quite possible to do so without this feature if you accept some sane restrictions. If you e.g. define to only handle serial columns (i.e. sequences that used/owned by exactly one table & column), you can do a 'good enough' emulation the output plugin. Take something like BDR's bdr_relcache.c (something which you're going to end up needing for any nontrivial replication solution). Everytime you build a cache entry you look up whether there's an owned by sequence. When decoding an insert or update to that relation, also emit an 'increase this sequence to at least XXX' message. While it's not perfect, this has the big advantage of being doable with 9.4. Greetings, Andres Freund
On 2015-12-15 13:17, Andres Freund wrote: > On 2015-12-14 16:19:33 +0800, Craig Ringer wrote: > >> Needed to make logical replication based failover possible. > > While it'll make it easier, I think it's certainly quite possible to do > so without this feature if you accept some sane restrictions. If you > e.g. define to only handle serial columns (i.e. sequences that > used/owned by exactly one table & column), you can do a 'good enough' > emulation the output plugin. > > Take something like BDR's bdr_relcache.c (something which you're going > to end up needing for any nontrivial replication solution). Everytime > you build a cache entry you look up whether there's an owned by > sequence. When decoding an insert or update to that relation, also emit > an 'increase this sequence to at least XXX' message. > > While it's not perfect, this has the big advantage of being doable with 9.4. > I don't think that approach alone is good enough. It might be ok for selective replication where the replication is driven by tables anyway, but in general and especially for failover it's not good enough to tell user that we handle some sequences and they have to fix the rest manually. That's not much different than fixing them all in practice as you script it anyway. However, if it was combined with something like what londiste does, which is to check sequences periodically and send last_value + some reasonable buffer, it could work well in most cases. In this scenario your method would be used for checking that sequence is close to going over buffer and firing the periodic send sooner than it would be if it was purely time based. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2015-12-15 13:46:29 +0100, Petr Jelinek wrote: > I don't think that approach alone is good enough. It might be ok for > selective replication where the replication is driven by tables anyway, but > in general and especially for failover it's not good enough to tell user > that we handle some sequences and they have to fix the rest manually. I think it solves roughly 80-90% of all usages of sequences. That's a significant improvement over the status quo. I'm not saying it's perfect, just that it's applicable to 9.4, and might be good enough initially. I'm not arguing against adding sequence decoding here. > That's not much different than fixing them all in practice as you > script it anyway. If you can easily script it, it's just the same type (sequences owned by a single column), everything else starts to be a bit more complicated anyway. Andres
On 2015-12-15 13:51, Andres Freund wrote: > On 2015-12-15 13:46:29 +0100, Petr Jelinek wrote: >> I don't think that approach alone is good enough. It might be ok for >> selective replication where the replication is driven by tables anyway, but >> in general and especially for failover it's not good enough to tell user >> that we handle some sequences and they have to fix the rest manually. > > I think it solves roughly 80-90% of all usages of sequences. That's a > significant improvement over the status quo. > > I'm not saying it's perfect, just that it's applicable to 9.4, and might > be good enough initially. And I am saying that I think more can and should be done even for 9.4/5. > >> That's not much different than fixing them all in practice as you >> script it anyway. > > If you can easily script it, it's just the same type (sequences owned by > a single column), everything else starts to be a bit more complicated anyway. > Well, there is some difference between scripting it for general use-case and scripting it with domain knowledge, but I see what you mean. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
I think this is quite the wrong approach. You're calling the logical
decoding callback directly from decode.c, circumventing
reorderbuffer.c. While you don't need the actual reordering, you *do*
need the snapshot integration.
As you since noticed you can't just
look into the sequence tuple and be happy with that, you need to do
pg_class lookups et al. Furhtermore callbacks can validly expect to have
a snapshot set.
At the very least what you need to do is to SetupHistoricSnapshot(),
then lookup the actual identity of the sequence using
RelidByRelfilenode() and only then you can call the callback.
> Needed to make logical replication based failover possible.
While it'll make it easier, I think it's certainly quite possible to do
so without this feature if you accept some sane restrictions. If you
e.g. define to only handle serial columns (i.e. sequences that
used/owned by exactly one table & column), you can do a 'good enough'
emulation the output plugin.
Take something like BDR's bdr_relcache.c (something which you're going
to end up needing for any nontrivial replication solution). Everytime
you build a cache entry you look up whether there's an owned by
sequence. When decoding an insert or update to that relation, also emit
an 'increase this sequence to at least XXX' message.
On 15 December 2015 at 20:17, Andres Freund <andres@anarazel.de> wrote:
I think this is quite the wrong approach. You're calling the logical
decoding callback directly from decode.c, circumventing
reorderbuffer.c. While you don't need the actual reordering, you *do*
need the snapshot integration.Yeah. I thought I could get away without it because I could use Form_pg_sequence.sequence_name to bypass any catalog lookups at all, but per upthread that field should be ignored and can't be used.As you since noticed you can't just
look into the sequence tuple and be happy with that, you need to do
pg_class lookups et al. Furhtermore callbacks can validly expect to have
a snapshot set.Good point. Just because I don't need one doesn't mean others won't, and all the other callbacks do.I'll have a read over reorderbuffer.c and see if I can integrate this properly.At the very least what you need to do is to SetupHistoricSnapshot(),
then lookup the actual identity of the sequence using
RelidByRelfilenode() and only then you can call the callback.Yeah. That means it's safe for the callback to do a syscache lookup for the sequence name by oid.
Sequences advances aren't part of a transaction (though they always happen during one) and proceed whether the xact that happened to trigger them commits or rolls back. So it doesn't make sense to add them to a reorder buffer for an xact. We want to send it immediately, not associate it with some arbitrary xact that just happened to use the last value in the cache that might not commit for ages.
But then when do we send it? If sent at the moment of decoding the advance from WAL then it'll always be sent to the client between the commit of one xact and the begin of another, because it'll be sent while we're reading xlog and populating reorder buffers. For advance of an already-created sequence advance that's what we want, but CREATE SEQUENCE, ALTER SEQUENCE etc also log RM_SEQ_ID XLOG_SEQ_LOG operations. The xact that created the sequence isn't committed yet so sending the advance to the downstream will lead to attempting to advance a sequence that doesn't yet exist. Similarly ALTER SEQUENCE emits a new XLOG_SEQ_LOG with the updated Form_pg_sequence. All fine for physical replication but rather more challenging for logical replication since sometimes we want the sequence change to be associated with a particular xact and sometimes not.
So the reorder buffer has to keep track of sequences. It must check to see if a catalog change is a sequence creation and if so mark the sequence as pending, keeping a copy of the Form_pg_sequence that's updated to the latest version as the xact proceeds and writes updates to the sequence. On commit the sequence advance is replayed at the end of the xact using the snapshot of the newly committed xact; on rollback it's discarded since the sequence never became visible to anyone else. We can safely assert that that sequence will not be updated by any other xact until this one commits.
In reorderbuffer.c, there's a test for relation->rd_rel->relkind == RELKIND_SEQUENCE that skips changes to sequence relations. We'll want to replicate sequence catalog updates as DDL via event triggers and deparse so that's OK, but I think we'll need to make a note of sequence creation here to mark new sequences as uncommitted.
If we see a sequence change that isn't for an uncommitted newly-created sequence we make an immediate call through the reorder buffer to send the sequence update info to the client. That's OK even if it's something like ALTER SEQUENCE ... INCREMENT 10 RENAME TO othername; . The ALTER ... RENAME part gets sent with the xact that did it when/if it commits since it's part of the pg_class tuple for the sequence; the INCREMENT gets sent immediately since it's part of the Form_pg_sequence. That matches what happens locally, and it means there's no need to keep track of every sequence, only new ones.
On commit or rollback of an xact that creates a sequence we pop the sequence oid from the ordered list of uncommitted sequence oids that must be kept by the decoding session.
If we land up decoding the same WAL again we might send sequence updates that temporarily move a sequence backwards then advance it again when we replay the newer updates. That's the same as hot standby and seems fine. You obviously won't be able to safely get new values from sequences that're replicated from an upstream on the downstream anyway - and I anticipate that logical decoding receivers will probably want to use seqam etc later to make those sequences read-only until a failover event.
Sound reasonable?
* keep track of the last-committed xact's snapshot in the decoding session
* decode relation create/drop for RELKIND_SEQUENCE and add/remove entries to a list in the decoding session. It shouldn't get big enough to need a hash map (?). Each entry points to the reorder buffer associated with the xact that created the sequence.
* when decoding a XLOG_SEQ_LOG, check if the sequence is uncommitted in the decoding state. If it is, replace the prior copy of the sequence state in the reorder buffer for the xact that created the sequence with this one. Otherwise invoke the sequence callback immediately.
* In the sequence decoding callback look up the last committed xact's snapshot and establish that as the historical snapshot, then get the sequence name from the syscache and invoke the client's callback.
* On commit of an xact whose reorder buffer has one or more newly created sequences, invoke the sequence decode callback for each sequence last, just before sending the commit.
I'm not sure if I'll get to this for 9.6 given that it's a whole bunch more than "just decode the sequence advance when you see it in the xlogs". We'll see.
On 29/02/16 03:23, Craig Ringer wrote: > On 17 December 2015 at 10:08, Craig Ringer <craig@2ndquadrant.com > <mailto:craig@2ndquadrant.com>> wrote: > > On 15 December 2015 at 20:17, Andres Freund <andres@anarazel.de > <mailto:andres@anarazel.de>> wrote: > > > I think this is quite the wrong approach. You're calling the logical > decoding callback directly from decode.c, circumventing > reorderbuffer.c. While you don't need the actual reordering, you > *do* > need the snapshot integration. > > > Yeah. I thought I could get away without it because I could use > Form_pg_sequence.sequence_name to bypass any catalog lookups at all, > but per upthread that field should be ignored and can't be used. > > As you since noticed you can't just > look into the sequence tuple and be happy with that, you need to do > pg_class lookups et al. Furhtermore callbacks can validly expect > to have > a snapshot set. > > > Good point. Just because I don't need one doesn't mean others won't, > and all the other callbacks do. > > I'll have a read over reorderbuffer.c and see if I can integrate > this properly. > > At the very least what you need to do is to SetupHistoricSnapshot(), > then lookup the actual identity of the sequence using > RelidByRelfilenode() and only then you can call the callback. > > > Yeah. That means it's safe for the callback to do a syscache lookup > for the sequence name by oid. > > > I've revisited this patch in an attempt to get a corrected version ready > for the next CF. > > Turns out it's not that simple. > > > Sequences advances aren't part of a transaction (though they always > happen during one) and proceed whether the xact that happened to trigger > them commits or rolls back. So it doesn't make sense to add them to a > reorder buffer for an xact. We want to send it immediately, not > associate it with some arbitrary xact that just happened to use the last > value in the cache that might not commit for ages. > > > But then when do we send it? If sent at the moment of decoding the > advance from WAL then it'll always be sent to the client between the > commit of one xact and the begin of another, because it'll be sent while > we're reading xlog and populating reorder buffers. For advance of an > already-created sequence advance that's what we want, but CREATE > SEQUENCE, ALTER SEQUENCE etc also log RM_SEQ_ID XLOG_SEQ_LOG operations. > The xact that created the sequence isn't committed yet so sending the > advance to the downstream will lead to attempting to advance a sequence > that doesn't yet exist. Similarly ALTER SEQUENCE emits a new > XLOG_SEQ_LOG with the updated Form_pg_sequence. All fine for physical > replication but rather more challenging for logical replication since > sometimes we want the sequence change to be associated with a particular > xact and sometimes not. > > > So the reorder buffer has to keep track of sequences. It must check to > see if a catalog change is a sequence creation and if so mark the > sequence as pending, keeping a copy of the Form_pg_sequence that's > updated to the latest version as the xact proceeds and writes updates to > the sequence. On commit the sequence advance is replayed at the end of > the xact using the snapshot of the newly committed xact; on rollback > it's discarded since the sequence never became visible to anyone else. > We can safely assert that that sequence will not be updated by any other > xact until this one commits. > > > In reorderbuffer.c, there's a test for relation->rd_rel->relkind == > RELKIND_SEQUENCE that skips changes to sequence relations. We'll want to > replicate sequence catalog updates as DDL via event triggers and deparse > so that's OK, but I think we'll need to make a note of sequence creation > here to mark new sequences as uncommitted. > > > If we see a sequence change that isn't for an uncommitted newly-created > sequence we make an immediate call through the reorder buffer to send > the sequence update info to the client. That's OK even if it's something > like ALTER SEQUENCE ... INCREMENT 10 RENAME TO othername; . The ALTER > ... RENAME part gets sent with the xact that did it when/if it commits > since it's part of the pg_class tuple for the sequence; the INCREMENT > gets sent immediately since it's part of the Form_pg_sequence. That > matches what happens locally, and it means there's no need to keep track > of every sequence, only new ones. > > > On commit or rollback of an xact that creates a sequence we pop the > sequence oid from the ordered list of uncommitted sequence oids that > must be kept by the decoding session. > > > If we land up decoding the same WAL again we might send sequence updates > that temporarily move a sequence backwards then advance it again when we > replay the newer updates. That's the same as hot standby and seems fine. > You obviously won't be able to safely get new values from sequences > that're replicated from an upstream on the downstream anyway - and I > anticipate that logical decoding receivers will probably want to use > seqam etc later to make those sequences read-only until a failover event. > > > Sound reasonable? > I wonder if it would be acceptable to create new info flag for RM_SEQ_ID that would behave just like XLOG_SEQ_LOG but would be used only for the nontransactional updates (nextval) so that decoding could easily differentiate between transactional and non-transactional update of sequence and then just either call the callback immediately or add the change to reorder buffer based on that. The redo code could just have simple OR expression to behave same with both of the info flags. Seems like simpler solution than building all the tracking code on the decoding side to me. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Petr Jelinek wrote: > I wonder if it would be acceptable to create new info flag for RM_SEQ_ID > that would behave just like XLOG_SEQ_LOG but would be used only for the > nontransactional updates (nextval) so that decoding could easily > differentiate between transactional and non-transactional update of sequence > and then just either call the callback immediately or add the change to > reorder buffer based on that. The redo code could just have simple OR > expression to behave same with both of the info flags. > > Seems like simpler solution than building all the tracking code on the > decoding side to me. Given the mess in Craig's description, the new info flag sounds a much more reasonable approach to me. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 29/02/16 03:23, Craig Ringer wrote:
Sound reasonable?
I wonder if it would be acceptable to create new info flag for RM_SEQ_ID that would behave just like XLOG_SEQ_LOG but would be used only for the nontransactional updates (nextval) so that decoding could easily differentiate between transactional and non-transactional update of sequence and then just either call the callback immediately or add the change to reorder buffer based on that. The redo code could just have simple OR expression to behave same with both of the info flags.
Seems like simpler solution than building all the tracking code on the decoding side to me.
On 02/03/16 08:05, Craig Ringer wrote: > On 1 March 2016 at 05:30, Petr Jelinek <petr@2ndquadrant.com > <mailto:petr@2ndquadrant.com>> wrote: > > > On 29/02/16 03:23, Craig Ringer wrote: > > Sound reasonable? > > > I wonder if it would be acceptable to create new info flag for > RM_SEQ_ID that would behave just like XLOG_SEQ_LOG but would be used > only for the nontransactional updates (nextval) so that decoding > could easily differentiate between transactional and > non-transactional update of sequence and then just either call the > callback immediately or add the change to reorder buffer based on > that. The redo code could just have simple OR expression to behave > same with both of the info flags. > > > That's much cleaner than trying to keep track of sequence creations and > really pretty harmless. I'll give that a go and see how it looks. > > Seems like simpler solution than building all the tracking code on > the decoding side to me. > > > +1 > Except this won't work for sequences that have been created in same transaction as the nextval()/setval() was called because in those cases we don't want to decode the advancement of sequence until the end of transaction and we can't map the relfilenode to sequence without going through reorder buffer in those cases either -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 02/03/16 08:05, Craig Ringer wrote:On 1 March 2016 at 05:30, Petr Jelinek <petr@2ndquadrant.com
<mailto:petr@2ndquadrant.com>> wrote:
I wonder if it would be acceptable to create new info flag for
RM_SEQ_ID that would behave just like XLOG_SEQ_LOG but would be used
only for the nontransactional updates (nextval) so that decoding
could easily differentiate between transactional and
non-transactional update of sequence and then just either call the
callback immediately or add the change to reorder buffer based on
that. The redo code could just have simple OR expression to behave
same with both of the info flags.
That's much cleaner than trying to keep track of sequence creations and
really pretty harmless. I'll give that a go and see how it looks.
Seems like simpler solution than building all the tracking code on
the decoding side to me.
+1
Except this won't work for sequences that have been created in same transaction as the nextval()/setval() was called because in those cases we don't want to decode the advancement of sequence until the end of transaction and we can't map the relfilenode to sequence without going through reorder buffer in those cases either