Обсуждение: WIP: Failover Slots
Failover Slots
If Logical Decoding is taking place from a master, then if we failover and create a new master we would like to continue the Logical Decoding from the new master. To facilitate this, we introduce the concept of “Failover Slots”, which are slots that generate WAL, so that any downstream standby can continue to produce logical decoding output from that named plugin.
In the current patch, any slot defined on a master will generate WAL, leading to a pending-slot being present on all standby nodes. When a standby is promoted, the slot becomes usable and will have the properties as of the last fsync on the master.
Failover slots are not accessible until promotion of the standby. Logical slots from a standby looks like a significant step beyond this and will not be part of this patch.
Internal design is fairly clear, using a new rmgr. No real problems emerged so far.
Patch is WIP, posted for comment, so you can see where I'm going.
I'm expecting to have a working version including timeline following for 9.6.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Patch is WIP, posted for comment, so you can see where I'm going.
TL;DR: this doesn't work yet, working on it.
On Fri, Jan 1, 2016 at 7:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Failover Slots > In the current patch, any slot defined on a master will generate WAL, > leading to a pending-slot being present on all standby nodes. When a standby > is promoted, the slot becomes usable and will have the properties as of the > last fsync on the master. No objection to the concept, but I think the new behavior needs to be optional. I am guessing that you are thinking along similar lines, but it's not explicitly called out here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 1, 2016 at 7:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Failover Slots
> In the current patch, any slot defined on a master will generate WAL,
> leading to a pending-slot being present on all standby nodes. When a standby
> is promoted, the slot becomes usable and will have the properties as of the
> last fsync on the master.
No objection to the concept, but I think the new behavior needs to be
optional. I am guessing that you are thinking along similar lines,
but it's not explicitly called out here.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jan 1, 2016 at 7:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Failover Slots
> In the current patch, any slot defined on a master will generate WAL,
> leading to a pending-slot being present on all standby nodes. When a standby
> is promoted, the slot becomes usable and will have the properties as of the
> last fsync on the master.
No objection to the concept, but I think the new behavior needs to be
optional. I am guessing that you are thinking along similar lines,
but it's not explicitly called out here.
On Fri, Jan 22, 2016 at 2:46 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > It's also going to be necessary to handle what happens when a new failover > slot (physical or logical) is created on the master where it conflicts with > the name of a non-failover physical slot that was created on the replica. In > this case I am inclined to terminate any walsender backend for the replica's > slot with a conflict with recovery, remove its slot and replace it with a > failover slot. The failover slot does not permit replay while in recovery so > if the booted-off client reconnects it'll get an ERROR saying it can't > replay from a failover slot. It should be pretty clear to the admin what's > happening between the conflict with recovery and the failover slot error. > There could still be an issue if the client persistently keeps retrying and > successfully reconnects after replica promotion but I don't see that much > can be done about that. The documentation will need to address the need to > try to avoid name conflicts between slots created on replicas and failover > slots on the master. That's not going to win any design-of-the-year awards, but maybe it's acceptable. It occurred to me to wonder if it might be better to propagate logical slots partially or entirely outside the WAL stream, because with this design you will end up with the logical slots on every replica, including cascaded replicas, and I'm not sure that's what we want. Then again, I'm also not sure it isn't what we want. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-01-22 11:40:24 -0500, Robert Haas wrote: > It occurred to me to wonder if it might be better to > propagate logical slots partially or entirely outside the WAL stream, > because with this design you will end up with the logical slots on > every replica, including cascaded replicas, and I'm not sure that's > what we want. Then again, I'm also not sure it isn't what we want. Not propagating them through the WAL also has the rather large advantage of not barring the way to using such slots on standbys. I think it's technically quite possible to maintain the required resources on multiple nodes. The question is how would you configure on which nodes the resources need to be maintained? I can't come up with a satisfying scheme...
It occurred to me to wonder if it might be better to
propagate logical slots partially or entirely outside the WAL stream,
because with this design you will end up with the logical slots on
every replica, including cascaded replicas, and I'm not sure that's
what we want. Then again, I'm also not sure it isn't what we want.
Not propagating them through the WAL also has the rather large advantage
of not barring the way to using such slots on standbys.
I think it's technically quite possible to maintain the required
resources on multiple nodes. The question is how would you configure on
which nodes the resources need to be maintained? I can't come up with a
satisfying scheme...
PostgreSQL Development, 24x7 Support, Training & Services
Вложения
Вложения
On Fri, Jan 22, 2016 at 11:51 AM, Andres Freund <andres@anarazel.de> wrote: > I think it's technically quite possible to maintain the required > resources on multiple nodes. The question is how would you configure on > which nodes the resources need to be maintained? I can't come up with a > satisfying scheme... For this to work, I feel like the nodes need names, and a directory that tells them how to reach each other. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Hi, here is my code level review: 0001: This one looks ok except for broken indentation in the new notes in xlogreader.c and .h. It's maybe slightly overdocumented but given the complicated way the timeline reading works it's probably warranted. 0002: + /* + * No way to wait for the process since it's not a child + * of ours and there's no latch to set, so poll. + * + * We're checking this without any locks held, but + * we'll recheck when we attempt to drop the slot. + */ + while (slot->in_use && slot->active_pid == active_pid + && max_sleep_micros > 0) + { + usleep(micros_per_sleep); + max_sleep_micros -= micros_per_sleep; + } Not sure I buy this, what about postmaster crashes and fast shutdown requests etc. Also you do usleep for 10s which is quite long. I'd prefer the classic wait for latch with timeout and pg crash check here. And even if we go with usleep, then it should be 1s not 10s and pg_usleep instead of usleep. 0003: There is a lot of documentation improvements here that are not related to failover slots or timeline following, it might be good idea to split those into separate patch as they are separately useful IMHO. Other than that it looks good to me. About other things discussed in this thread. Yes it makes sense in certain situations to handle this outside of WAL and that does require notions of nodes, etc. That being said, the timeline following is needed even if this is handled outside of WAL. And once timeline following is in, the slots can be handled by the replication solution itself which is good. But I think the failover slots are still a good thing to have - it provides HA solution for anything that uses slots, and that includes physical replication as well. If the specific logical replication solution wants to handle it for some reason itself outside of WAL, it can create non-failover slot so in my opinion we ideally need both types of slots (and that's exactly what this patch gives us). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi,
here is my code level review:
0001:
This one looks ok except for broken indentation in the new notes in
xlogreader.c and .h.
+ while (slot->in_use && slot->active_pid == active_pid
+ && max_sleep_micros > 0)
+ {
+ usleep(micros_per_sleep);
+ max_sleep_micros -= micros_per_sleep;
+ }
Not sure I buy this, what about postmaster crashes and fast shutdown
requests etc.
0003:
There is a lot of documentation improvements here that are not related
to failover slots or timeline following, it might be good idea to
split those into separate patch as they are separately useful IMHO.
Вложения
On 16 Feb 2016, at 09:11, Craig Ringer <craig@2ndquadrant.com> wrote:Revision attached. There was a file missing from the patch too.
elog(LOG, "persistency is %i", (int)slot->data.persistency);
Should be changed to DEBUG?
Why did you drop “void" as a parameter type of ReplicationSlotDropAcquired?
walsender.c: 1509 at PhysicalConfirmReceivedLocation
I’ve noticed a comment stating that we don’t need to call ReplicationSlotSave(), but that pre-dated the WAL-logging of replication slot changes. Don’t we need to call it now, the same way it’s done for the logical slots in logical.c:at LogicalConfirmReceivedLocation?
What it’s doing is calling pg_basebackup first to initialize the replica, and that actually failed with:_basebackup: unexpected termination of replication stream: ERROR: requested WAL segment 000000010000000000000000 has already been removedThe segment name definitely looks bogus to me.The actual command causing the failure was an attempt to clone the replica using pg_basebackup, turning on xlog streaming:pg_basebackup --pgdata data/postgres1 --xlog-method=stream --dbname="host=localhost port=5432 user=replicator”I checked the same command against the git master without the patches applied and could not reproduce this problem there.
On the code level, I have no comments on 0001, it’s well documented and I have no questions about the approach, although I might be not too knowledgable to judge the specifics of the implementation.
slots.c:294elog(LOG, "persistency is %i", (int)slot->data.persistency);
Should be changed to DEBUG?
slot.c:468
Why did you drop “void" as a parameter type of ReplicationSlotDropAcquired?
walsender.c: 1509 at PhysicalConfirmReceivedLocation
I’ve noticed a comment stating that we don’t need to call ReplicationSlotSave(), but that pre-dated the WAL-logging of replication slot changes. Don’t we need to call it now, the same way it’s done for the logical slots in logical.c:at LogicalConfirmReceivedLocation?
Updated patch
Вложения
- 0001-Allow-logical-slots-to-follow-timeline-switches.patch
- 0002-Allow-replication-slots-to-follow-failover.patch
- 0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch
- 0004-Add-the-UI-and-for-failover-slots.patch
- 0005-Document-failover-slots.patch
- 0006-Add-failover-to-pg_replication_slots.patch
- 0007-not-for-inclusion-Test-script-for-failover-slots.patch
On 23 Feb 2016, at 11:30, Craig Ringer <craig@2ndquadrant.com> wrote:Updated patch... attachedI've split it up a bit more too, so it's easier to tell what change is for what and fixed the issues mentioned by Oleksii. I've also removed some unrelated documentation changes.Patch 0001, timeline switches for logical decoding, is unchanged since the last post.
There's some things missing to allow this: I think it should be “There are some things missing to allow this:”
logical.c:93
"we need we would need”
slot.c:889
"and there's no latch to set, so poll” - clearly there is a latch used in the code below.
Also, slot.c:301 emits an error message for an attempt to create a failover slot on the replica after acquiring and releasing the locks and getting the shared memory slot, even though all the data to check for this condition is available right at the beginning of the function. Shouldn’t it avoid the extra work if it’s not needed?
<0001-Allow-logical-slots-to-follow-timeline-switches.patch><0002-Allow-replication-slots-to-follow-failover.patch><0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch><0004-Add-the-UI-and-for-failover-slots.patch><0005-Document-failover-slots.patch><0006-Add-failover-to-pg_replication_slots.patch><0007-not-for-inclusion-Test-script-for-failover-slots.patch>--
I found the following issue when shutting down a master with a connected replica that uses a physical failover slot:2016-02-23 20:33:42.546 CET,,,54998,,56ccb3f3.d6d6,3,,2016-02-23 20:33:07 CET,,0,DEBUG,00000,"performing replication slot checkpoint",,,,,,,,,""2016-02-23 20:33:42.594 CET,,,55002,,56ccb3f3.d6da,4,,2016-02-23 20:33:07 CET,,0,DEBUG,00000,"archived transaction log file ""000000010000000000000003""",,,,,,,,,""2016-02-23 20:33:42.601 CET,,,54998,,56ccb3f3.d6d6,4,,2016-02-23 20:33:07 CET,,0,PANIC,XX000,"concurrent transaction log activity while database system is shutting down",,,,,,,,,""2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,5,,2016-02-23 20:33:07 CET,,0,LOG,00000,"checkpointer process (PID 54998) was terminated by signal 6: Abort trap",,,,,,,,,""2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,6,,2016-02-23 20:33:07 CET,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,
Basically, the issue is that CreateCheckPoint calls CheckpointReplicationSlots, which currently produces WAL, and this violates the assumption at line xlog.c:8492if (shutdown && checkPoint.redo != ProcLastRecPtr)ereport(PANIC,(errmsg("concurrent transaction log activity while database system is shutting down")));
I really want to focus on the first patch, timeline following for logical slots. That part is much less invasive and is useful stand-alone. I'll move it to a separate CF entry and post it to a separate thread as I think it needs consideration independently of failover slots.
There are a couple of incorrect comments
I really want to focus on the first patch, timeline following for logical slots. That part is much less invasive and is useful stand-alone. I'll move it to a separate CF entry and post it to a separate thread as I think it needs consideration independently of failover slots.
Вложения
- 0001-Allow-replication-slots-to-follow-failover.patch
- 0002-Update-decoding_failover-tests-for-failover-slots.patch
- 0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch
- 0004-Add-the-UI-and-for-failover-slots.patch
- 0005-Document-failover-slots.patch
- 0006-Add-failover-to-pg_replication_slots.patch
- 0007-Introduce-TAP-recovery-tests-for-failover-slots.patch
Here's a new failover slots rev, addressing the issues Oleksii Kliukin raised and adding a bunch of TAP tests.
Вложения
- 0001-Allow-replication-slots-to-follow-failover.patch
- 0002-Update-decoding_failover-tests-for-failover-slots.patch
- 0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch
- 0004-Add-the-UI-and-for-failover-slots.patch
- 0005-Document-failover-slots.patch
- 0006-Add-failover-to-pg_replication_slots.patch
- 0007-Introduce-TAP-recovery-tests-for-failover-slots.patch
On 17 Mar 2016, at 09:34, Craig Ringer <craig@2ndquadrant.com> wrote:<0001-Allow-replication-slots-to-follow-failover.patch><0002-Update-decoding_failover-tests-for-failover-slots.patch><0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch><0004-Add-the-UI-and-for-failover-slots.patch><0005-Document-failover-slots.patch><0006-Add-failover-to-pg_replication_slots.patch><0007-Introduce-TAP-recovery-tests-for-failover-slots.patch>OK, here's the latest failover slots patch, rebased on top of today's master plus, in order:- Dirty replication slots when confirm_lsn is changed- logical decoding timeline followingThe full tree is at https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots if you want to avoid the fiddling around required to apply the patch series.
Thank you for the update. I’ve got some rejects when applying the 0001-Allow-replication-slots-to-follow-failover.patch after the "Dirty replication slots when confirm_lsn is changed” changes. I think it should be rebased against the master, (might be the consequence of the "logical slots follow the timeline” patch committed).
I'd like to get failover slots in place for 9.6 since the're fairly self-contained and meet an immediate need: allowing replication using slots (physical or logical) to follow a failover event.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 25 January 2016 at 14:25, Craig Ringer <craig@2ndquadrant.com> wrote:I'd like to get failover slots in place for 9.6 since the're fairly self-contained and meet an immediate need: allowing replication using slots (physical or logical) to follow a failover event.I'm a bit confused about this now.We seem to have timeline following, yet no failover slot. How do we now follow a failover event?
There are many and varied users of logical decoding now and a fix is critically important for 9.6.
Do all decoding plugins need to write their own support code??
Please explain how we cope without this, so if a problem remains we can fix by the freeze.
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">A few thoughts on failover slots vs the alternative of pushingcatalog_xmin up to the master via a replica's slot and creating independent slots on replicas.</div><div class="gmail_quote"><br/></div><div class="gmail_quote"><br /></div><div class="gmail_quote">Failover slots:</div><div class="gmail_quote">---</div><divclass="gmail_quote"><br /></div><div class="gmail_quote">+ Failover slots are very easyfor applications. They "just work" and are transparent for failover. This is great especially for things that aren'tcomplex replication schemes, that just want to use logical decoding.</div><div class="gmail_quote"><br /></div><divclass="gmail_quote">+ Applications don't have to know what replicas exist or be able to reach them; transparentfailover is easier.</div><div class="gmail_quote"><br /></div><div class="gmail_quote">- Failover slots can'tbe used from a cascading standby (where we can fail down to the standby's own replicas) because they have to write WALto advance the slot position. They'd have to send the slot position update "up" to the master then wait to replay it.Not a disaster, though they'd do extra work on reconnect until a restart_lsn update replayed. Would require a whole newfeedback-like message on the rep protocol, and couldn't work at all with archive replication. Ugly as hell.</div><divclass="gmail_quote"><br /></div><div class="gmail_quote">+ Failover slots exist now, and could be added to9.6.</div><div class="gmail_quote"><br /></div><div class="gmail_quote">- The UI for failover slots can't be re-used forthe catalog_xmin push-up approach to allow replay from failover slots on cascading standbys in 9.7+. There'd be no wayto propagate the creation of failover slots "down" the replication heirarchy that way, especially to archive standbyslike failover slots will do. So it'd be semantically different and couldn't re-use the FS UI. We'd be stuck withfailover slots even if we also did the other way later.</div><div class="gmail_quote"><br /></div><div class="gmail_quote">+Will work for recovery of a master PITR-restored up to the latest recovery point</div><div class="gmail_quote"><br/></div><div class="gmail_quote"><br /></div><div class="gmail_quote"><br /></div><div class="gmail_quote">Independentslots on replicas + catalog_xmin push-up</div><div class="gmail_quote">---</div><div class="gmail_quote"><br/></div><div class="gmail_quote">With this approach we allow creation of replication slots on a replicaindependently of the master. The replica is required to connect to the master via a slot. We send feedback to themaster to advance the replica's slot on the master to the confirmed_lsn of the most-behind slot on the replica, thereforepinning master's catalog_xmin where needed. Or we just send a new feedback message type that directly sets a catalog_xminon the replica's physical slot in the master. Slots are _not_ cloned from master to replica automatically.</div><divclass="gmail_quote"><br /></div><div class="gmail_quote"><br /></div><div class="gmail_quote">- Morecomplicated for applications to use. They have to create a slot on each replica that might be failed over to as wellas the master and have to advance all those slots to stop the master from suffering severe catalog bloat. (But see notebelow).</div><div class="gmail_quote"><br /></div><div class="gmail_quote">- Applications must be able to connect tofailover-candidate standbys and know where they are, it's not automagically handled via WAL. (But see note below).</div><divclass="gmail_quote"><br /></div><div class="gmail_quote">- Applications need reconfiguration whenever astandby is rebuilt, moved, etc. (But see note below).</div><div class="gmail_quote"><br /></div><div class="gmail_quote">-Cannot work at all for archive-based replication, requires a slot from replica to master.</div><divclass="gmail_quote"><br /></div><div class="gmail_quote">+ Works with replay from cascading standbys</div><divclass="gmail_quote"><br /></div><div class="gmail_quote">+ Actually solves one of the problems making logicalslots on standbys unsupported at the moment by giving us a way to pin the master's catalog_xmin to that needed bya replica.</div><div class="gmail_quote"><br /></div><div class="gmail_quote">- Won't work for a standby PITR-restoredup to latest.</div><div class="gmail_quote"><br /></div><div class="gmail_quote">- Vapourware with zero hopefor 9.6</div><div class="gmail_quote"><br /></div><div class="gmail_quote"><br /></div><div class="gmail_quote">Note:I think the application complexity issues can be solved - to a degree - by having the replicas runa bgworker based helper that connects to the master and clones the master's slots then advances them automatically.</div><divclass="gmail_quote"><br /></div><div class="gmail_quote"><br /></div><div class="gmail_quote">Donothing</div><div class="gmail_quote">---</div><div class="gmail_quote"><br /></div><div class="gmail_quote">Dropthe idea of being able to follow physical failover on logical slots.</div><div class="gmail_quote"><br/></div><div class="gmail_quote">I've already expressed why I think this is a terrible idea. It'shostile to application developers who'd like to use logical decoding. It makes integration of logical replication withexisting HA systems much harder. It means we need really solid, performant, well-tested and mature logical rep basedHA before we can take logical rep seriously, which is a long way out given that we can't do decoding of in-progressxacts, ddl, sequences, .... etc etc.</div><div class="gmail_quote"><br /></div><div class="gmail_quote">Some kindof physical HA for logical slots is needed and will be needed for some time. Logical rep will be great for selectivereplication, replication over WAN, filtered/transformed replication etc. Physical rep is great for knowing you'llget exactly the same thing on the replica that you have on the master and it'll Just Work.</div><div class="gmail_quote"><br/></div><div class="gmail_quote">In any case, "Do nothing" is the same for 9.6 as pursusing the catalog_xminpush-up idea; in both cases we don't commit anything in 9.6.</div><div class="gmail_quote"><br /></div><div class="gmail_quote"><br/></div><div class="gmail_quote"><br /></div></div></div>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016-04-06 14:30:21 +0100, Simon Riggs wrote: > On 6 April 2016 at 14:15, Craig Ringer <craig@2ndquadrant.com> wrote: > ... > > Nice summary > > Failover slots are optional. And they work on master. > > While the other approach could also work, it will work later and still > require a slot on the master. > > > => I don't see why having Failover Slots in 9.6 would prevent us from > having something else later, if someone else writes it. > > > We don't need to add this to core. Each plugin can independently write is > own failover code. Works, but doesn't seem like the right approach for open > source. > > > => I think we should add Failover Slots to 9.6. Simon, please don't take this personal; because of the other ongoing thread. I don't think this is commit-ready. For one I think this is architecturally the wrong choice. But even leaving that fact aside, and considering this a temporary solution (we can't easily remove), there appears to have been very little code level review (one early from Petr in [1], two by Oleksii mostly focusing on error messages [2] [3]). The whole patch was submitted late to the 9.6 cycle. Quickly skimming 0001 in [4] there appear to be a number of issues: * LWLockHeldByMe() is only for debugging, not functional differences * ReplicationSlotPersistentData is now in an xlog related header * The code and behaviour around name conflicts of slots seems pretty raw, and not discussed * Taking spinlocks dependant on InRecovery() seems like a seriously bad idea * I doubt that the archive based switches around StartupReplicationSlots do what they intend. Afaics that'll not work correctlyfor basebackups taken with -X, without recovery.conf That's from a ~5 minute skim, of one patch in the series. [1] http://archives.postgresql.org/message-id/CALLjQTSCHvcsF6y7%3DZhmdMjJUMGLqt1-6Pz2rtb7PfFLxFfBOw%40mail.gmail.com [2] http://archives.postgresql.org/message-id/FA68178E-F0D1-47F6-9791-8A3E2136C119%40hintbits.com [3] http://archives.postgresql.org/message-id/9B503EB5-676A-4258-9F78-27FC583713FE%40hintbits.com [4] http://archives.postgresql.org/message-id/CAMsr+YE6LNy2e0tBuAQB+NTVb6W-dHJAfLq0-zbAL7G7hjhXBA@mail.gmail.com Greetings, Andres Freund
On 2016-04-06 14:30:21 +0100, Simon Riggs wrote:
> On 6 April 2016 at 14:15, Craig Ringer <craig@2ndquadrant.com> wrote:
> ...
>
> Nice summary
>
> Failover slots are optional. And they work on master.
>
> While the other approach could also work, it will work later and still
> require a slot on the master.
>
>
> => I don't see why having Failover Slots in 9.6 would prevent us from
> having something else later, if someone else writes it.
>
>
> We don't need to add this to core. Each plugin can independently write is
> own failover code. Works, but doesn't seem like the right approach for open
> source.
>
>
> => I think we should add Failover Slots to 9.6.
Simon, please don't take this personal; because of the other ongoing
thread.
For one I think this is architecturally the wrong choice.
But even leaving that fact aside, and
considering this a temporary solution (we can't easily remove),
there appears to have been very little code level review
(one early from Petr
in [1], two by Oleksii mostly focusing on error messages [2] [3]). The
whole patch was submitted late to the 9.6 cycle.
Quickly skimming 0001 in [4] there appear to be a number of issues:
* LWLockHeldByMe() is only for debugging, not functional differences
* ReplicationSlotPersistentData is now in an xlog related header
* The code and behaviour around name conflicts of slots seems pretty
raw, and not discussed
* Taking spinlocks dependant on InRecovery() seems like a seriously bad
idea
* I doubt that the archive based switches around StartupReplicationSlots
do what they intend. Afaics that'll not work correctly for basebackups
taken with -X, without recovery.conf
That's from a ~5 minute skim, of one patch in the series.
[1] http://archives.postgresql.org/message-id/CALLjQTSCHvcsF6y7%3DZhmdMjJUMGLqt1-6Pz2rtb7PfFLxFfBOw%40mail.gmail.com
[2] http://archives.postgresql.org/message-id/FA68178E-F0D1-47F6-9791-8A3E2136C119%40hintbits.com
[3] http://archives.postgresql.org/message-id/9B503EB5-676A-4258-9F78-27FC583713FE%40hintbits.com
[4] http://archives.postgresql.org/message-id/CAMsr+YE6LNy2e0tBuAQB+NTVb6W-dHJAfLq0-zbAL7G7hjhXBA@mail.gmail.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Quickly skimming 0001 in [4] there appear to be a number of issues:
* LWLockHeldByMe() is only for debugging, not functional differences
* ReplicationSlotPersistentData is now in an xlog related header
* The code and behaviour around name conflicts of slots seems pretty
raw, and not discussed
* Taking spinlocks dependant on InRecovery() seems like a seriously bad
idea
* I doubt that the archive based switches around StartupReplicationSlots
do what they intend. Afaics that'll not work correctly for basebackups
taken with -X, without recovery.conf
if (!RecoveryInProgress())
{ /* first check whether there's something to write out */ SpinLockAcquire(&slot->mutex); was_dirty = slot->dirty; slot->just_dirtied = false; SpinLockRelease(&slot->mutex);
/* and don't do anything if there's nothing to write */ if (!was_dirty) return;
}
... though I think what I really should've done there is just always dirty the slot in the redo functions.
On 8 April 2016 at 07:13, Craig Ringer <craig@2ndquadrant.com> wrote: > On 6 April 2016 at 22:17, Andres Freund <andres@anarazel.de> wrote: > >> >> Quickly skimming 0001 in [4] there appear to be a number of issues: >> * LWLockHeldByMe() is only for debugging, not functional differences >> * ReplicationSlotPersistentData is now in an xlog related header >> * The code and behaviour around name conflicts of slots seems pretty >> raw, and not discussed >> * Taking spinlocks dependant on InRecovery() seems like a seriously bad >> idea >> * I doubt that the archive based switches around StartupReplicationSlots >> do what they intend. Afaics that'll not work correctly for basebackups >> taken with -X, without recovery.conf >> > > Thanks for looking at it. Most of those are my errors. I think this is > pretty dead at least for 9.6, so I'm mostly following up in the hopes of > learning about a couple of those mistakes. > > Good catch with -X without a recovery.conf. Since it wouldn't be recognised > as a promotion and wouldn't increment the timeline, copied non-failover > slots wouldn't get removed. I've never liked that logic at all anyway, I > just couldn't think of anything better... > > LWLockHeldByMe() has a comment to the effect of: "This is meant as debug > support only." So that's just a dumb mistake on my part, and I should've > added "alreadyLocked" parameters. (Ugly, but works). > > But why would it be a bad idea to conditionally take a code path that > acquires a spinlock based on whether RecoveryInProgress()? It's not testing > RecoveryInProgress() more than once and doing the acquire and release based > on separate tests, which would be a problem. I don't really get the problem > with: > > if (!RecoveryInProgress()) > { > /* first check whether there's something to write out */ > SpinLockAcquire(&slot->mutex); > was_dirty = slot->dirty; > slot->just_dirtied = false; > SpinLockRelease(&slot->mutex); > > /* and don't do anything if there's nothing to write */ > if (!was_dirty) > return; > } > > ... though I think what I really should've done there is just always dirty > the slot in the redo functions. Are there any plans to submit a new design/version for v11? Thanks Thom
Are there any plans to submit a new design/version for v11?On 8 April 2016 at 07:13, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 6 April 2016 at 22:17, Andres Freund <andres@anarazel.de> wrote:
>
>>
>> Quickly skimming 0001 in [4] there appear to be a number of issues:
>> * LWLockHeldByMe() is only for debugging, not functional differences
>> * ReplicationSlotPersistentData is now in an xlog related header
>> * The code and behaviour around name conflicts of slots seems pretty
>> raw, and not discussed
>> * Taking spinlocks dependant on InRecovery() seems like a seriously bad
>> idea
>> * I doubt that the archive based switches around StartupReplicationSlots
>> do what they intend. Afaics that'll not work correctly for basebackups
>> taken with -X, without recovery.conf
>>
>
> Thanks for looking at it. Most of those are my errors. I think this is
> pretty dead at least for 9.6, so I'm mostly following up in the hopes of
> learning about a couple of those mistakes.
>
> Good catch with -X without a recovery.conf. Since it wouldn't be recognised
> as a promotion and wouldn't increment the timeline, copied non-failover
> slots wouldn't get removed. I've never liked that logic at all anyway, I
> just couldn't think of anything better...
>
> LWLockHeldByMe() has a comment to the effect of: "This is meant as debug
> support only." So that's just a dumb mistake on my part, and I should've
> added "alreadyLocked" parameters. (Ugly, but works).
>
> But why would it be a bad idea to conditionally take a code path that
> acquires a spinlock based on whether RecoveryInProgress()? It's not testing
> RecoveryInProgress() more than once and doing the acquire and release based
> on separate tests, which would be a problem. I don't really get the problem
> with:
>
> if (!RecoveryInProgress())
> {
> /* first check whether there's something to write out */
> SpinLockAcquire(&slot->mutex);
> was_dirty = slot->dirty;
> slot->just_dirtied = false;
> SpinLockRelease(&slot->mutex);
>
> /* and don't do anything if there's nothing to write */
> if (!was_dirty)
> return;
> }
>
> ... though I think what I really should've done there is just always dirty
> the slot in the redo functions.
On Tue, Jul 25, 2017 at 8:44 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > No. The whole approach seems to have been bounced from core. I don't agree > and continue to think this functionality is desirable but I don't get to > make that call. I actually think failover slots are quite desirable, especially now that we've got logical replication in core. In a review of this thread I don't see anyone saying otherwise. The debate has really been about the right way of implementing that. Suppose we did something like this: - When a standby connects to a master, it can optionally supply a list of slot names that it cares about. - The master responds by periodically notifying the standby of changes to the slot contents using some new replication sub-protocol message. - The standby applies those updates to its local copies of the slots. So, you could create a slot on a standby with an "uplink this" flag of some kind, and it would then try to keep it up to date using the method described above. It's not quite clear to me how to handle the case where the corresponding slot doesn't exist on the master, or initially does but then it's later dropped, or it initially doesn't but it's later created. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 25, 2017 at 8:44 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> No. The whole approach seems to have been bounced from core. I don't agree
> and continue to think this functionality is desirable but I don't get to
> make that call.
I actually think failover slots are quite desirable, especially now
that we've got logical replication in core. In a review of this
thread I don't see anyone saying otherwise. The debate has really
been about the right way of implementing that. Suppose we did
something like this:
- When a standby connects to a master, it can optionally supply a list
of slot names that it cares about.
- The master responds by periodically notifying the standby of changes
to the slot contents using some new replication sub-protocol message.
- The standby applies those updates to its local copies of the slots.
So, you could create a slot on a standby with an "uplink this" flag of
some kind, and it would then try to keep it up to date using the
method described above. It's not quite clear to me how to handle the
case where the corresponding slot doesn't exist on the master, or
initially does but then it's later dropped, or it initially doesn't
but it's later created.
Thoughts?
On Tue, Aug 8, 2017 at 4:00 AM, Craig Ringer <craig@2ndquadrant.com> wrote: >> - When a standby connects to a master, it can optionally supply a list >> of slot names that it cares about. > > Wouldn't that immediately exclude use for PITR and snapshot recovery? I have > people right now who want the ability to promote a PITR-recovered snapshot > into place of a logical replication master and have downstream peers replay > from it. It's more complex than that, as there's a resync process required > to recover changes the failed node had sent to other peers but isn't > available in the WAL archive, but that's the gist. > > If you have a 5TB database do you want to run an extra replica or two > because PostgreSQL can't preserve slots without a running, live replica? > Your SAN snapshots + WAL archiving have been fine for everything else so > far. OK, so what you're basically saying here is that you want to encode the failover information in the write-ahead log rather than passing it at the protocol level, so that if you replay the write-ahead log on a time delay you get the same final state that you would have gotten if you had replayed it immediately. I hadn't thought about that potential advantage, and I can see that it might be an advantage for some reason, but I don't yet understand what the reason is. How would you imagine using any version of this feature in a PITR scenario? If you PITR the master back to an earlier point in time, I don't see how you're going to manage without resyncing the replicas, at which point you may as well just drop the old slot and create a new one anyway. Maybe you're thinking of a scenario where we PITR the master and also use PITR to rewind the replica to a slightly earlier point? But I can't quite follow what you're thinking about. Can you explain further? > Requiring live replication connections could also be an issue for service > interruptions, surely? Unless you persist needed knowledge in the physical > replication slot used by the standby to master connection, so the master can > tell the difference between "downstream went away for while but will come > back" and "downstream is gone forever, toss out its resources." I don't think the master needs to retain any resources on behalf of the failover slot. If the slot has been updated by feedback from the associated standby, then the master can toss those resources immediately. When the standby comes back on line, it will find out via a protocol message that it can fast-forward the slot to whatever the new LSN is, and any WAL files before that point are irrelevant on both the master and the standby. > Also, what about cascading? Lots of "pull" model designs I've looked at tend > to fall down in cascaded environments. For that matter so do failover slots, > but only for the narrower restriction of not being able to actually decode > from a failover-enabled slot on a standby, they still work fine in terms of > cascading down to leaf nodes. I don't see the problem. The cascaded standby tells the standby "I'm interested in the slot called 'craig'" and the standby says "sure, I'll tell you whenever 'craig' gets updated" but it turns out that 'craig' is actually a failover slot on that standby, so that standby has said to the master "I'm interested in the slot called 'craig'" and the master is therefore sending updates to that standby. Every time the slot is updated, the master tells the standby and the standby tells the cascaded standby and, well, that all seems fine. Also, as Andres pointed out upthread, if the state is passed through the protocol, you can have a slot on a standby that cascades to a cascaded standby; if the state is passed through the WAL, all slots have to cascade from the master. Generally, with protocol-mediated failover slots, you can have a different set of slots on every replica in the cluster and create, drop, and reconfigure them any time you like. With WAL-mediated slots, all failover slots must come from the master and cascade to every standby you've got, which is less flexible. I don't want to come on too strong here. I'm very willing to admit that you may know a lot more about this than me and I am really extremely happy to benefit from that accumulated knowledge. If you're saying that WAL-mediated slots are a lot better than protocol-mediated slots, you may well be right, but I don't yet understand the reasons, and I want to understand the reasons. I think this stuff is too important to just have one person saying "here's a patch that does it this way" and everybody else just says "uh, ok". Once we adopt some proposal here we're going to have to continue supporting it forever, so it seems like we'd better do our best to get it right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 8, 2017 at 4:00 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>> - When a standby connects to a master, it can optionally supply a list
>> of slot names that it cares about.
>
> Wouldn't that immediately exclude use for PITR and snapshot recovery? I have
> people right now who want the ability to promote a PITR-recovered snapshot
> into place of a logical replication master and have downstream peers replay
> from it. It's more complex than that, as there's a resync process required
> to recover changes the failed node had sent to other peers but isn't
> available in the WAL archive, but that's the gist.
>
> If you have a 5TB database do you want to run an extra replica or two
> because PostgreSQL can't preserve slots without a running, live replica?
> Your SAN snapshots + WAL archiving have been fine for everything else so
> far.
OK, so what you're basically saying here is that you want to encode
the failover information in the write-ahead log rather than passing it
at the protocol level, so that if you replay the write-ahead log on a
time delay you get the same final state that you would have gotten if
you had replayed it immediately. I hadn't thought about that
potential advantage, and I can see that it might be an advantage for
some reason, but I don't yet understand what the reason is. How would
you imagine using any version of this feature in a PITR scenario? If
you PITR the master back to an earlier point in time, I don't see how
you're going to manage without resyncing the replicas, at which point
you may as well just drop the old slot and create a new one anyway.
You definitely can't just PITR restore and pick up where you left off.
You need a higher level protocol between replicas to recover. For example, in a multi-master configuration, this can be something like (simplified):
Maybe you're thinking of a scenario where we PITR the master and also
use PITR to rewind the replica to a slightly earlier point?
But I
can't quite follow what you're thinking about. Can you explain
further?
> Requiring live replication connections could also be an issue for service
> interruptions, surely? Unless you persist needed knowledge in the physical
> replication slot used by the standby to master connection, so the master can
> tell the difference between "downstream went away for while but will come
> back" and "downstream is gone forever, toss out its resources."
I don't think the master needs to retain any resources on behalf of
the failover slot. If the slot has been updated by feedback from the
associated standby, then the master can toss those resources
immediately. When the standby comes back on line, it will find out
via a protocol message that it can fast-forward the slot to whatever
the new LSN is, and any WAL files before that point are irrelevant on
both the master and the standby.
> Also, what about cascading? Lots of "pull" model designs I've looked at tend
> to fall down in cascaded environments. For that matter so do failover slots,
> but only for the narrower restriction of not being able to actually decode
> from a failover-enabled slot on a standby, they still work fine in terms of
> cascading down to leaf nodes.
I don't see the problem. The cascaded standby tells the standby "I'm
interested in the slot called 'craig'" and the standby says "sure,
I'll tell you whenever 'craig' gets updated" but it turns out that
'craig' is actually a failover slot on that standby, so that standby
has said to the master "I'm interested in the slot called 'craig'" and
the master is therefore sending updates to that standby. Every time
the slot is updated, the master tells the standby and the standby
tells the cascaded standby and, well, that all seems fine.
Also, as Andres pointed out upthread, if the state is passed through
the protocol, you can have a slot on a standby that cascades to a
cascaded standby; if the state is passed through the WAL, all slots
have to cascade from the master.
Generally, with protocol-mediated
failover slots, you can have a different set of slots on every replica
in the cluster and create, drop, and reconfigure them any time you
like. With WAL-mediated slots, all failover slots must come from the
master and cascade to every standby you've got, which is less
flexible.
I don't want to come on too strong here. I'm very willing to admit
that you may know a lot more about this than me and I am really
extremely happy to benefit from that accumulated knowledge.
If you're
saying that WAL-mediated slots are a lot better than protocol-mediated
slots, you may well be right, but I don't yet understand the reasons,
and I want to understand the reasons. I think this stuff is too
important to just have one person saying "here's a patch that does it
this way" and everybody else just says "uh, ok". Once we adopt some
proposal here we're going to have to continue supporting it forever,
so it seems like we'd better do our best to get it right.
On Thu, Aug 10, 2017 at 2:38 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > Yep, so again, you're pushing slots "up" the tree, by name, with a 1:1 > correspondence, and using globally unique slot names to manage state. Yes, that's what I'm imagining. (Whether I should instead be imagining something else is the important question.) > I'm quite happy to find a better one. But I cannot spend a lot of time > writing something to have it completely knocked back because the scope just > got increased again and now it has to do more, so it needs another rewrite. Well, I can't guarantee anything about that. I don't tend to argue against designs to which I myself previously agreed, but other people may, and there's not a lot I can do about that (although sometimes I try to persuade them that they're wrong, if I think they are). Of course, sometimes you implement something and it doesn't look as good as you thought it would; that's a risk of software development generally. I'd like to push back a bit on the underlying assumption, though: I don't think that there was ever an agreed-upon design on this list for failover slots before the first patch showed up. Well, anybody's welcome to write code without discussion and drop it to the list, but if people don't like it, that's the risk you took by not discussing it first. > A "failover slot" is identified by a field in the slot struct and exposed > in pg_replication_slots. It can be null (not a failover slots). It can > indicate that the slot was created locally and is "owned" by this node; all > downstreams should mirror it. It can also indicate that it is a mirror of an > upstream, in which case clients may not replay from it until it's promoted > to an owned slot and ceases to be mirrored. Attempts to replay from a > mirrored slot just ERROR and will do so even once decoding on standby is > supported. +1 > This promotion happens automatically if a standby is promoted to a master, > and can also be done manually via sql function call or walsender command to > allow for an internal promotion within a cascading replica chain. +1. > When a replica connects to an upstream it asks via a new walsender msg "send > me the state of all your failover slots". Any local mirror slots are > updated. If they are not listed by the upstream they are known deleted, and > the mirror slots are deleted on the downstream. What about slots not listed by the upstream that are currently in use? > The upstream walsender then sends periodic slot state updates while > connected, so replicas can advance their mirror slots, and in turn send > hot_standby_feedback that gets applied to the physical replication slot used > by the standby, freeing resources held for the slots on the master. +1. > There's one big hole left here. When we create a slot on a cascading leaf or > inner node, it takes time for hot_standby_feedback to propagate the needed > catalog_xmin "up" the chain. Until the master has set the needed > catalog_xmin on the physical slot for the closest branch, the inner node's > slot's catalog_xmin can only be tentative pending confirmation. That's what > a whole bunch of gruesomeness in the decoding on standby patch was about. > > One possible solution to this is to also mirror slots "up", as you alluded > to: when you create an "owned" slot on a replica, it tells the master at > connect time / slot creation time "I have this slot X, please copy it up the > tree". The slot gets copied "up" to the master via cascading layers with a > different failover slot type indicating it's an up-mirror. Decoding clients > aren't allowed to replay from an up-mirror slot and it cannot be promoted > like a down-mirror slot can, it's only there for resource retention. A node > knows its owned slot is safe to actually use, and is fully created, when it > sees the walsender report it in the list of failover slots from the master > during a slot state update. I'm not sure that this actually prevents the problem you describe. It also seems really complicated. Maybe you can explain further; perhaps there is a simpler solution (or perhaps this isn't as complicated as I currently think it is). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Well,
anybody's welcome to write code without discussion and drop it to the
list, but if people don't like it, that's the risk you took by not
discussing it first.
> When a replica connects to an upstream it asks via a new walsender msg "send
> me the state of all your failover slots". Any local mirror slots are
> updated. If they are not listed by the upstream they are known deleted, and
> the mirror slots are deleted on the downstream.
What about slots not listed by the upstream that are currently in use?
> There's one big hole left here. When we create a slot on a cascading leaf or
> inner node, it takes time for hot_standby_feedback to propagate the needed
> catalog_xmin "up" the chain. Until the master has set the needed
> catalog_xmin on the physical slot for the closest branch, the inner node's
> slot's catalog_xmin can only be tentative pending confirmation. That's what
> a whole bunch of gruesomeness in the decoding on standby patch was about.
>
> One possible solution to this is to also mirror slots "up", as you alluded
> to: when you create an "owned" slot on a replica, it tells the master at
> connect time / slot creation time "I have this slot X, please copy it up the
> tree". The slot gets copied "up" to the master via cascading layers with a
> different failover slot type indicating it's an up-mirror. Decoding clients
> aren't allowed to replay from an up-mirror slot and it cannot be promoted
> like a down-mirror slot can, it's only there for resource retention. A node
> knows its owned slot is safe to actually use, and is fully created, when it
> sees the walsender report it in the list of failover slots from the master
> during a slot state update.
I'm not sure that this actually prevents the problem you describe. It
also seems really complicated. Maybe you can explain further; perhaps
there is a simpler solution (or perhaps this isn't as complicated as I
currently think it is).
On 2017-08-02 16:35:17 -0400, Robert Haas wrote: > I actually think failover slots are quite desirable, especially now > that we've got logical replication in core. In a review of this > thread I don't see anyone saying otherwise. The debate has really > been about the right way of implementing that. Given that I presumably was one of the people pushing back more strongly: I agree with that. Besides disagreeing with the proposed implementation our disagreements solely seem to have been about prioritization. I still think we should have a halfway agreed upon *design* for logical failover, before we introduce a concept that's quite possibly going to be incompatible with that, however. But that doesn't mean it has to submitted/merged to core. > - When a standby connects to a master, it can optionally supply a list > of slot names that it cares about. > - The master responds by periodically notifying the standby of changes > to the slot contents using some new replication sub-protocol message. > - The standby applies those updates to its local copies of the slots. > So, you could create a slot on a standby with an "uplink this" flag of > some kind, and it would then try to keep it up to date using the > method described above. It's not quite clear to me how to handle the > case where the corresponding slot doesn't exist on the master, or > initially does but then it's later dropped, or it initially doesn't > but it's later created. I think there's a couple design goals we need to agree upon, before going into the weeds of how exactly we want this to work. Some of the axis I can think of are: - How do we want to deal with cascaded setups, do slots have to be available everywhere, or not? - What kind of PITR integration do we want? Note that simple WAL based slots do *NOT* provide proper PITR support, there'snot enough interlock easily available (you'd have to save slots at the end, then increment minRecoveryLSN to a pointlater than the slot saving) - How much divergence are we going to accept between logical decoding on standbys, and failover slots. I'm probably a lotcloser to closer than than Craig is. - How much divergence are we going to accept between infrastructure for logical failover, and logical failover via failoverslots (or however we're naming this)? Again, I'm probably a lot closer to zero than craig is. Greetings, Andres Freund
On 2017-08-02 16:35:17 -0400, Robert Haas wrote:
> I actually think failover slots are quite desirable, especially now
> that we've got logical replication in core. In a review of this
> thread I don't see anyone saying otherwise. The debate has really
> been about the right way of implementing that.
Given that I presumably was one of the people pushing back more
strongly: I agree with that. Besides disagreeing with the proposed
implementation our disagreements solely seem to have been about
prioritization.
I still think we should have a halfway agreed upon *design* for logical
failover, before we introduce a concept that's quite possibly going to
be incompatible with that, however. But that doesn't mean it has to
submitted/merged to core.
I think there's a couple design goals we need to agree upon, before
going into the weeds of how exactly we want this to work. Some of the
axis I can think of are:
- How do we want to deal with cascaded setups, do slots have to be
available everywhere, or not?
- What kind of PITR integration do we want? Note that simple WAL based
slots do *NOT* provide proper PITR support, there's not enough
interlock easily available (you'd have to save slots at the end, then
increment minRecoveryLSN to a point later than the slot saving)
- How much divergence are we going to accept between logical decoding on
standbys, and failover slots. I'm probably a lot closer to closer than
than Craig is.
- How much divergence are we going to accept between infrastructure for
logical failover, and logical failover via failover slots (or however
we're naming this)? Again, I'm probably a lot closer to zero than
craig is.
I don't want to block failover slots on decoding on standby just because decoding on standby would be nice to have.
(It's something I'd thought for BDR failover, but never applied to falover slots: the problem of detecting or preventing divergence when the logical client is ahead of physical receive at the time the physical standby is promoted.)