Обсуждение: WIP: Failover Slots

Поиск
Список
Период
Сортировка

WIP: Failover Slots

От
Simon Riggs
Дата:

Failover Slots

If Logical Decoding is taking place from a master, then if we failover and create a new master we would like to continue the Logical Decoding from the new master. To facilitate this, we introduce the concept of “Failover Slots”, which are slots that generate WAL, so that any downstream standby can continue to produce logical decoding output from that named plugin.

In the current patch, any slot defined on a master will generate WAL, leading to a pending-slot being present on all standby nodes. When a standby is promoted, the slot becomes usable and will have the properties as of the last fsync on the master. 

Failover slots are not accessible until promotion of the standby. Logical slots from a standby looks like a significant step beyond this and will not be part of this patch.

Internal design is fairly clear, using a new rmgr. No real problems emerged so far.

Patch is WIP, posted for comment, so you can see where I'm going. 

I'm expecting to have a working version including timeline following for 9.6.


--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения

Re: WIP: Failover Slots

От
Craig Ringer
Дата:
On 2 January 2016 at 08:50, Simon Riggs <simon@2ndquadrant.com> wrote:
 

Patch is WIP, posted for comment, so you can see where I'm going. 



I've applied this on a branch of master and posted it, with some comment editorialization, as https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots . The tree will be subject to rebasing.

At present the patch does not appear to work. No slots are visible in the replica's pg_replication_slots before or after promotion and no slot information is written to the xlog according to pg_xlogdump:

$ ~/pg/96/bin/pg_xlogdump -r ReplicationSlot 000000010000000000000001 000000010000000000000003
pg_xlogdump: FATAL:  error in WAL record at 0/301DDE0: invalid record length at 0/301DE10

so it's very much a WIP. I've read through it and think the idea makes sense, it's just still missing some pieces...



So. Initial review comments.



This looks pretty incomplete:

+ if (found_duplicate)
+ {
+ LWLockRelease(ReplicationSlotAllocationLock);
+
+ /*
+ * Do something nasty to the sinful duplicants, but
+ * take with locking.
+ */
+
+ LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);
+ }

... and I'm not sure I understand how the possibility of a duplicate slot can arise in the first place, since you cannot create a slot on a read replica. This seems unnecessary.



I'm not sure I understand why, in ReplicationSlotRedoCreate, it's especially desirable to prevent blocking iteration over pg_replication_slots or acquiring a slot. When redoing a slot create isn't that exactly what we should do? This looks like it's been copied & pasted verbatim from CheckPointReplicationSlots . There it makes sense, since the slots may be in active use. During redo it seems reasonable to just acquire ReplicationSlotControlLock.



I'm not a fan of the ReplicationSlotInWAL typedef for ReplicationSlotPersistentData. Especially as it's only used in the redo functions but *not* when xlog is written out. I'd like to just replace it.


Purely for convenient testing there's a shell script in the tree - https://github.com/2ndQuadrant/postgres/blob/dev/failover-slots/failover-slot-test.sh . Assuming a patched 9.6 in $HOME/pg/96 it'll do a run-through of the patch. I'll attempt to convert this to use the new test infrastructure, but needed something ASAP for development. Posted in case it's useful to others.

Now it's time to look into why this doesn't seem to be generating any xlog when by rights it seems like it should. Also into at what point exactly we purge existing slots on start / promotion of a read-replica.


TL;DR: this doesn't work yet, working on it.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Failover Slots

От
Craig Ringer
Дата:
On 20 January 2016 at 21:02, Craig Ringer <craig@2ndquadrant.com> wrote:
 
TL;DR: this doesn't work yet, working on it.

Nothing is logged on slot creation because ReplicationSlot->data.failover is never true. Once that's fixed by - for now - making all slots failover slots, there's a crash in XLogInsert because of the use of reserved bits in the XLogInsert info argument. Fix pushed.

I also noticed that slot drops seem are being logged whether or not the slot is a failover slot. Pushed a fix for that.

The WAL writing is now working. I've made improvements to the rmgr xlogdump support to make it clearer what's written.

Slots are still not visible on the replica so there's work to do tracing redo, promotion, slot handling after starting from a basebackup, etc. The patch is still very much WIP.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Failover Slots

От
Robert Haas
Дата:
On Fri, Jan 1, 2016 at 7:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Failover Slots
> In the current patch, any slot defined on a master will generate WAL,
> leading to a pending-slot being present on all standby nodes. When a standby
> is promoted, the slot becomes usable and will have the properties as of the
> last fsync on the master.

No objection to the concept, but I think the new behavior needs to be
optional.  I am guessing that you are thinking along similar lines,
but it's not explicitly called out here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: Failover Slots

От
Simon Riggs
Дата:
On 21 January 2016 at 16:31, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jan 1, 2016 at 7:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Failover Slots
> In the current patch, any slot defined on a master will generate WAL,
> leading to a pending-slot being present on all standby nodes. When a standby
> is promoted, the slot becomes usable and will have the properties as of the
> last fsync on the master.

No objection to the concept, but I think the new behavior needs to be
optional.  I am guessing that you are thinking along similar lines,
but it's not explicitly called out here.

I was unsure myself; but making them optional seems reasonable. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Failover Slots

От
Craig Ringer
Дата:
On 22 January 2016 at 00:31, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jan 1, 2016 at 7:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Failover Slots
> In the current patch, any slot defined on a master will generate WAL,
> leading to a pending-slot being present on all standby nodes. When a standby
> is promoted, the slot becomes usable and will have the properties as of the
> last fsync on the master.

No objection to the concept, but I think the new behavior needs to be
optional.  I am guessing that you are thinking along similar lines,
but it's not explicitly called out here.

Yeah,  I think that's the idea too. For one thing we'll want to allow non-failover slots to continue to be usable from a streaming replica, but we must ERROR if anyone attempts to attach to and replay from a failover slot via a replica since we can't write WAL there. Both kinds are needed.

There's a 'failover' bool member in the slot persistent data for that reason. It's not (yet) exposed via the UI.

I presume we'll want to:

* add a new default-false argument is_failover_slot or similar to pg_create_logical_replication_slot and pg_create_physical_replication_slot

* Add a new optional flag argument FAILOVER to CREATE_REPLICATION_SLOT in both its LOGICAL and PHYSICAL forms.

... and will be adding that to this patch, barring syntax objections etc.


It's also going to be necessary to handle what happens when a new failover slot (physical or logical) is created on the master where it conflicts with the name of a non-failover physical slot that was created on the replica. In this case I am inclined to terminate any walsender backend for the replica's slot with a conflict with recovery, remove its slot and replace it with a failover slot. The failover slot does not permit replay while in recovery so if the booted-off client reconnects it'll get an ERROR saying it can't replay from a failover slot. It should be pretty clear to the admin what's happening between the conflict with recovery and the failover slot error. There could still be an issue if the client persistently keeps retrying and successfully reconnects after replica promotion but I don't see that much can be done about that. The documentation will need to address the need to try to avoid name conflicts between slots created on replicas and failover slots on the master.

I'll be working on those docs, on the name conflict handling, and on the syntax during my coming flight. Toddler permitting.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Failover Slots

От
Robert Haas
Дата:
On Fri, Jan 22, 2016 at 2:46 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> It's also going to be necessary to handle what happens when a new failover
> slot (physical or logical) is created on the master where it conflicts with
> the name of a non-failover physical slot that was created on the replica. In
> this case I am inclined to terminate any walsender backend for the replica's
> slot with a conflict with recovery, remove its slot and replace it with a
> failover slot. The failover slot does not permit replay while in recovery so
> if the booted-off client reconnects it'll get an ERROR saying it can't
> replay from a failover slot. It should be pretty clear to the admin what's
> happening between the conflict with recovery and the failover slot error.
> There could still be an issue if the client persistently keeps retrying and
> successfully reconnects after replica promotion but I don't see that much
> can be done about that. The documentation will need to address the need to
> try to avoid name conflicts between slots created on replicas and failover
> slots on the master.

That's not going to win any design-of-the-year awards, but maybe it's
acceptable.  It occurred to me to wonder if it might be better to
propagate logical slots partially or entirely outside the WAL stream,
because with this design you will end up with the logical slots on
every replica, including cascaded replicas, and I'm not sure that's
what we want.  Then again, I'm also not sure it isn't what we want.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: Failover Slots

От
Andres Freund
Дата:
On 2016-01-22 11:40:24 -0500, Robert Haas wrote:
> It occurred to me to wonder if it might be better to
> propagate logical slots partially or entirely outside the WAL stream,
> because with this design you will end up with the logical slots on
> every replica, including cascaded replicas, and I'm not sure that's
> what we want.  Then again, I'm also not sure it isn't what we want.

Not propagating them through the WAL also has the rather large advantage
of not barring the way to using such slots on standbys.

I think it's technically quite possible to maintain the required
resources on multiple nodes. The question is how would you configure on
which nodes the resources need to be maintained? I can't come up with a
satisfying scheme...



Re: WIP: Failover Slots

От
Craig Ringer
Дата:
On 23 January 2016 at 00:40, Robert Haas <robertmhaas@gmail.com> wrote:
 
It occurred to me to wonder if it might be better to
propagate logical slots partially or entirely outside the WAL stream,
because with this design you will end up with the logical slots on
every replica, including cascaded replicas, and I'm not sure that's
what we want.  Then again, I'm also not sure it isn't what we want.

I think it's the most sensible default if there's only going to be one choice to start with. It's consistent with what we do elsewhere with replicas so there won't be any surprises.   Failover slots are a fairly simple feature that IMO just makes slots behave more like you might expect them to do in the first place.

I'm pretty hesitant to start making cascaded replicas different to each other just for slots. There are lots of other things where variation between replicas would be lovely - the most obvious of which is omitting some databases from some replicas. Right now we have a single data stream, WAL, that goes to every replica. If we're going to change that I'd really like to address it in a way that'll meet future needs like selective physical replication too. I also think we'd want to deal with the problem of identifying and labeling nodes to do a good job of selective replication of slots.

I'd like to get failover slots in place for 9.6 since the're fairly self-contained and meet an immediate need: allowing replication using slots (physical or logical) to follow a failover event.

After that I want to add logical decoding support for slot creation/drop/advance. So a logical decoding plugin can mirror logical slot state on another node. It wouldn't be useful for physical slots, of course, but it'd allow failover between logical replicas where we can do cool things like replicate just a subset of DBs/tables/etc. (A mapping of upstream to downstream LSNs would be required, but hopefully not too hard). That's post-9.6 work and separate to failover slots, though dependent on them for the ability to decode slot changes from WAL.

Down the track I'd very much like to be less dependent on forcing everything though WAL with a single timeline. I agree with Andres that being able to use failover slots on replicas would be good, and that it's not possible when we use WAL to track the info. I just think it's a massive increase in complexity and scope and I'd really like to be able to follow failover the simple way first, then enhance it.

Nothing stops us changing failover slots in 9.7+ from using WAL to some other mechanism that we design carefully at that time. We can extend the walsender command set for physical rep at a major version bump with no major issues, including adding to the streaming rep protocol. There's lots to figure out though, including how to maintain slot changes in a strict ordering with WAL, how to store and read the info, etc.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Failover Slots

От
Craig Ringer
Дата:
On 23 January 2016 at 00:51, Andres Freund <andres@anarazel.de> wrote:
 
Not propagating them through the WAL also has the rather large advantage
of not barring the way to using such slots on standbys.

Yeah. So you could have a read-replica that has a slot and it has child nodes you can fail over to, but you don't have to have the slot on the master.

I don't personally find that to be a particularly compelling thing that says "we must have this" ... but maybe I'm not seeing the full significance/advantages.
 
I think it's technically quite possible to maintain the required
resources on multiple nodes. The question is how would you configure on
which nodes the resources need to be maintained? I can't come up with a
satisfying scheme...

That's part of it. Also the mechanism by which we actually replicate them - protocol additions for the walsender protocol, how to reliably send something that doesn't have an LSN, etc. It might be fairly simple, I haven't thought about it deeply, but I'd rather not go there until the basics are in place.

BTW, I'm keeping a working tree at https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots . Subject to rebasing, history not clean. It has a test script in it that'll go away before patch posting.

Current state needs work to ensure that on-disk and in-memory representations are kept in sync, but is getting there.



--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Failover Slots

От
Craig Ringer
Дата:
Hi all

Here's v2 of the failover slots patch. It replicates a logical slot to a physical streaming replica downstream, keeping the slots in sync. After the downstream is promoted a client can replay from the logical slot.

UI to allow creation of non-failover slots is pending.

There's more testing to do to cover all the corners: drop slots, drop and re-create, name conflicts between downstream !failover slots and upstream failover slots, etc.

There's also a known bug where WAL isn't correctly retained for a slot where that slot was created before a basebackup which I'll fix in a revision shortly.

I'm interested in ideas on how to better test this.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Вложения

Re: WIP: Failover Slots

От
Craig Ringer
Дата:
Hi all

Here's v3 of failover slots.

It doesn't add the UI yet, but it's now functionally complete except for timeline following for logical slots, and I have a plan for that.

Вложения

Re: WIP: Failover Slots

От
Robert Haas
Дата:
On Fri, Jan 22, 2016 at 11:51 AM, Andres Freund <andres@anarazel.de> wrote:
> I think it's technically quite possible to maintain the required
> resources on multiple nodes. The question is how would you configure on
> which nodes the resources need to be maintained? I can't come up with a
> satisfying scheme...

For this to work, I feel like the nodes need names, and a directory
that tells them how to reach each other.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: Failover Slots

От
Craig Ringer
Дата:
Hi all

I've attached the completed failover slots patch.

There were quite a few details to address so this is split into a three patch series. I've also attached the test script I've been using.

We need failover slots to allow a logical decoding client to follow physical failover, allowing HA "above" a logical decoding client.

For reviewers there's some additional explanation for why a few of the changes are made the way they are on the wiki: https://wiki.postgresql.org/wiki/Failover_slots . See "Patch notes".

The tagged tree for this submission is at https://github.com/2ndQuadrant/postgres/tree/failover-slots-v4 .

I intend to backport this to 9.4 and 9.5 (though of course not for mainline submission!).


Patch 1: add support for timeline following in logical decoding.

This is necessary to make failover slots useful. Otherwise decoding from a slot will fail after failover because the server tries to read WAL with ThisTimeLineID but the needed archives are on a historical timeline. While the smallest part of the patch series this was the most complex.



Patch 2: Failover slots core

* Add WAL logging and redo for failover slots

* copy pg_replslot/ in pg_basebackup

* drop non-failover slots on archive recovery startup

* expand the amount of WAL copied by pg_basebackup so failover slots are usable after restore

* if a failover slot is created on the primary with the same name as an
  existing non-failover slot on replica(s), kill any client connected to the
  replica's slot and drop the replica's slot during redo

* Adds a new backup label entry MIN FAILOVER SLOT LSN to generated
  backup label files if failover slots are present. This allows utilities like
  pgbarman, omnipitr, etc to know to retain more WAL to preserve the
  function of failover slots.

* Return a lower LSN from pg_start_backup() and BASE_BACKUP
  if needed to ensure that tools copy the extra WAL required by failover
  slots during a base backup.

Relies on timeline following for logical decoding slots to be useful.

Does not add UI (function arguments, walsender syntax, changes to views, etc) to expose failover slots to users. They can only be used by extensions that call ReplicationSlotCreate directly.



Patch 3: User interface for failover slots

The 3rd patch adds the UI to expose failover slots to the user:

- A 'failover' boolean argument, default false, to pg_create_physical_replication_slot(...) and pg_create_logical_replication_slot(...)
- A new FAILOVER option to PG_CREATE_REPLICATION_SLOT on the walsender protocol
- A new 'failover' boolean column in pg_catalog.pg_replication_slots
- SGML documentation changes for the new options and for failover slots in general

Limited tests are also added in this patch since not much of this is really testable by pg_regress. I've attached my local test script in case it's of interest/use to anyone.




--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Вложения

Re: WIP: Failover Slots

От
Petr Jelinek
Дата:
Hi,

here is my code level review:

0001:
This one looks ok except for broken indentation in the new notes in
xlogreader.c and .h. It's maybe slightly overdocumented but given the
complicated way the timeline reading works it's probably warranted.

0002:
+                /*
+                 * No way to wait for the process since it's not a child
+                 * of ours and there's no latch to set, so poll.
+                 *
+                 * We're checking this without any locks held, but
+                 * we'll recheck when we attempt to drop the slot.
+                 */
+                while (slot->in_use && slot->active_pid == active_pid
+                        && max_sleep_micros > 0)
+                {
+                    usleep(micros_per_sleep);
+                    max_sleep_micros -= micros_per_sleep;
+                }

Not sure I buy this, what about postmaster crashes and fast shutdown
requests etc. Also you do usleep for 10s which is quite long. I'd
prefer the classic wait for latch with timeout and pg crash check
here. And even if we go with usleep, then it should be 1s not 10s and
pg_usleep instead of usleep.

0003:
There is a lot of documentation improvements here that are not related
to failover slots or timeline following, it might be good idea to
split those into separate patch as they are separately useful IMHO.

Other than that it looks good to me.

About other things discussed in this thread. Yes it makes sense in
certain situations to handle this outside of WAL and that does require
notions of nodes, etc. That being said, the timeline following is
needed even if this is handled outside of WAL. And once timeline
following is in, the slots can be handled by the replication solution
itself which is good. But I think the failover slots are still a good
thing to have - it provides HA solution for anything that uses slots,
and that includes physical replication as well. If the specific
logical replication solution wants to handle it for some reason itself
outside of WAL, it can create non-failover slot so in my opinion we
ideally need both types of slots (and that's exactly what this patch
gives us).

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: WIP: Failover Slots

От
Craig Ringer
Дата:


On 16 February 2016 at 01:23, Petr Jelinek <petr@2ndquadrant.com> wrote:
Hi,

here is my code level review:

0001:
This one looks ok except for broken indentation in the new notes in
xlogreader.c and .h.

I don't see the broken indentation. Not sure what you mean.

 
+                while (slot->in_use && slot->active_pid == active_pid
+                        && max_sleep_micros > 0)
+                {
+                    usleep(micros_per_sleep);
+                    max_sleep_micros -= micros_per_sleep;
+                }

Not sure I buy this, what about postmaster crashes and fast shutdown
requests etc.

Yeah. I was thinking - incorrectly - that I couldn't use a latch during recovery. 

Revision attached. There was a file missing from the patch too.
 
0003:
There is a lot of documentation improvements here that are not related
to failover slots or timeline following, it might be good idea to
split those into separate patch as they are separately useful IMHO.

Yeah, probably worth doing. We'll see how this patch goes.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Вложения

Re: WIP: Failover Slots

От
Oleksii Kliukin
Дата:
Hi,

On 16 Feb 2016, at 09:11, Craig Ringer <craig@2ndquadrant.com> wrote:



Revision attached. There was a file missing from the patch too.


All attached patches apply normally. I only took a look at first 2, but also tried to run the Patroni with the modified version to check whether the basic replication works.

What it’s doing is calling pg_basebackup first to initialize the replica, and that actually failed with:

_basebackup: unexpected termination of replication stream: ERROR:  requested WAL segment 000000010000000000000000 has already been removed

The segment name definitely looks bogus to me.

The actual command causing the failure was an attempt to clone the replica using pg_basebackup, turning on xlog streaming:

pg_basebackup --pgdata data/postgres1 --xlog-method=stream --dbname="host=localhost port=5432  user=replicator”

I checked the same command against the  git master without the patches applied and could not reproduce this problem there.

On the code level, I have no comments on 0001, it’s well documented and I have no questions about the approach, although I might be not too knowledgable to judge the specifics of the implementation.

On the 0002, there are a few rough edges:


slots.c:294
elog(LOG, "persistency is %i", (int)slot->data.persistency);

Should be changed to DEBUG?

slot.c:468
Why did you drop “void" as a parameter type of ReplicationSlotDropAcquired?

walsender.c: 1509 at PhysicalConfirmReceivedLocation

I’ve noticed a comment stating that we don’t need to call ReplicationSlotSave(), but that pre-dated the WAL-logging of replication slot changes. Don’t we need to call it now, the same way it’s done for the logical slots in logical.c:at LogicalConfirmReceivedLocation?

Kind regards,
--
Oleksii

Re: WIP: Failover Slots

От
Craig Ringer
Дата:
On 22 February 2016 at 23:39, Oleksii Kliukin <alexk@hintbits.com> wrote:
 
What it’s doing is calling pg_basebackup first to initialize the replica, and that actually failed with:

_basebackup: unexpected termination of replication stream: ERROR:  requested WAL segment 000000010000000000000000 has already been removed

The segment name definitely looks bogus to me.

The actual command causing the failure was an attempt to clone the replica using pg_basebackup, turning on xlog streaming:

pg_basebackup --pgdata data/postgres1 --xlog-method=stream --dbname="host=localhost port=5432  user=replicator”

I checked the same command against the  git master without the patches applied and could not reproduce this problem there.

That's a bug. In testing whether we need to return a lower LSN for minimum WAL for BASE_BACKUP it failed to properly test for InvalidXLogRecPtr . Good catch.
 
On the code level, I have no comments on 0001, it’s well documented and I have no questions about the approach, although I might be not too knowledgable to judge the specifics of the implementation.

The first patch is the most important IMO, and the one I think needs the most thought since it's ... well, timelines aren't simple.
 
slots.c:294
elog(LOG, "persistency is %i", (int)slot->data.persistency);

Should be changed to DEBUG?

That's an escapee log statement I thought I'd already rebased out. Well spotted, fixed.
 

slot.c:468
Why did you drop “void" as a parameter type of ReplicationSlotDropAcquired?

That's an editing error on my part that I'll reverse. Since the prototype declares (void) it doesn't matter, but it's a pointless change. Fixed.
 
walsender.c: 1509 at PhysicalConfirmReceivedLocation

I’ve noticed a comment stating that we don’t need to call ReplicationSlotSave(), but that pre-dated the WAL-logging of replication slot changes. Don’t we need to call it now, the same way it’s done for the logical slots in logical.c:at LogicalConfirmReceivedLocation?

No, it's safe here. All we must ensure is that a slot is advanced on the replica when it's advanced on the master. For physical slots even that's a weak requirement, we just have to stop them from falling *too* far behind and causing too much xlog retention. For logical slots we should ensure we advance the slot on the replica before any vacuum activity that might remove catalog tuples still needed by that slot gets replayed. Basically the difference is that logical slots keep track of the catalog xmin too, so they have (slightly) stricter requirements.

This patch doesn't touch either of those functions except for renaming ReplicationSlotsComputeRequiredLSN to ReplicationSlotsUpdateRequiredLSN . Which, by the way, I really don't like doing, but I couldn't figure out a name to give the function that computes-and-returns the required LSN that wouldn't be even more confusing in the face of having a ReplicationSlotsComputeRequiredLSN function as well. Ideas welcome.

Updated patch 

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Failover Slots

От
Craig Ringer
Дата:
 
Updated patch 

... attached

I've split it up a bit more too, so it's easier to tell what change is for what and fixed the issues mentioned by Oleksii. I've also removed some unrelated documentation changes.

Patch 0001, timeline switches for logical decoding, is unchanged since the last post. 


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Вложения

Re: WIP: Failover Slots

От
Oleksii Kliukin
Дата:

On 23 Feb 2016, at 11:30, Craig Ringer <craig@2ndquadrant.com> wrote:

 
Updated patch 

... attached

I've split it up a bit more too, so it's easier to tell what change is for what and fixed the issues mentioned by Oleksii. I've also removed some unrelated documentation changes.

Patch 0001, timeline switches for logical decoding, is unchanged since the last post. 

Thank you, I read the user-interface part now, looks good to me.

I found the following issue when shutting down a master with a connected replica that uses a physical failover slot:

2016-02-23 20:33:42.546 CET,,,54998,,56ccb3f3.d6d6,3,,2016-02-23 20:33:07 CET,,0,DEBUG,00000,"performing replication slot checkpoint",,,,,,,,,""
2016-02-23 20:33:42.594 CET,,,55002,,56ccb3f3.d6da,4,,2016-02-23 20:33:07 CET,,0,DEBUG,00000,"archived transaction log file ""000000010000000000000003""",,,,,,,,,""
2016-02-23 20:33:42.601 CET,,,54998,,56ccb3f3.d6d6,4,,2016-02-23 20:33:07 CET,,0,PANIC,XX000,"concurrent transaction log activity while database system is shutting down",,,,,,,,,""
2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,5,,2016-02-23 20:33:07 CET,,0,LOG,00000,"checkpointer process (PID 54998) was terminated by signal 6: Abort trap",,,,,,,,,""
2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,6,,2016-02-23 20:33:07 CET,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,

Basically, the issue is that CreateCheckPoint calls CheckpointReplicationSlots, which currently produces WAL, and this violates the assumption at line xlog.c:8492

if (shutdown && checkPoint.redo != ProcLastRecPtr)
ereport(PANIC,
(errmsg("concurrent transaction log activity while database system is shutting down")));


There are a couple of incorrect comments

logical.c: 90
There's some things missing to allow this: I think it should be “There are some things missing to allow this:”

logical.c:93
"we need we would need”

slot.c:889
"and there's no latch to set, so poll” - clearly there is a latch used in the code below.

Also,  slot.c:301 emits an error message for an attempt to create a failover slot on the replica after acquiring and releasing the locks and getting the shared memory slot, even though all the data to check for this condition is available right at the beginning of the function. Shouldn’t it avoid the extra work if it’s not needed? 



--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
<0001-Allow-logical-slots-to-follow-timeline-switches.patch><0002-Allow-replication-slots-to-follow-failover.patch><0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch><0004-Add-the-UI-and-for-failover-slots.patch><0005-Document-failover-slots.patch><0006-Add-failover-to-pg_replication_slots.patch><0007-not-for-inclusion-Test-script-for-failover-slots.patch>

Kind regards,
--
Oleksii

Re: WIP: Failover Slots

От
Craig Ringer
Дата:
On 24 February 2016 at 03:53, Oleksii Kliukin <alexk@hintbits.com> wrote:
 

I found the following issue when shutting down a master with a connected replica that uses a physical failover slot:

2016-02-23 20:33:42.546 CET,,,54998,,56ccb3f3.d6d6,3,,2016-02-23 20:33:07 CET,,0,DEBUG,00000,"performing replication slot checkpoint",,,,,,,,,""
2016-02-23 20:33:42.594 CET,,,55002,,56ccb3f3.d6da,4,,2016-02-23 20:33:07 CET,,0,DEBUG,00000,"archived transaction log file ""000000010000000000000003""",,,,,,,,,""
2016-02-23 20:33:42.601 CET,,,54998,,56ccb3f3.d6d6,4,,2016-02-23 20:33:07 CET,,0,PANIC,XX000,"concurrent transaction log activity while database system is shutting down",,,,,,,,,""
2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,5,,2016-02-23 20:33:07 CET,,0,LOG,00000,"checkpointer process (PID 54998) was terminated by signal 6: Abort trap",,,,,,,,,""
2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,6,,2016-02-23 20:33:07 CET,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,


Odd that I didn't see that in my testing. Thanks very much for this. I concur with your explanation.

Basically, the issue is that CreateCheckPoint calls CheckpointReplicationSlots, which currently produces WAL, and this violates the assumption at line xlog.c:8492

if (shutdown && checkPoint.redo != ProcLastRecPtr)
ereport(PANIC,
(errmsg("concurrent transaction log activity while database system is shutting down")));

Interesting problem.

It might be reasonably harmless to omit writing WAL for failover slots during a shutdown checkpoint. We're using WAL to move data to the replicas but we don't really need it for local redo and correctness on the master. The trouble is that we do of course redo failover slot updates on the master and we don't really want a slot to go backwards vs its on-disk state before a crash. That's not too harmful - but might be able to lead to us losing a slot catalog_xmin increase so the slot thinks catalog is still readable that could've actually been vacuumed away.

CheckpointReplicationSlots notes that:

 * This needn't actually be part of a checkpoint, but it's a convenient
 * location.

... and I suspect the answer there is simply to move the slot checkpoint to occur prior to the WAL checkpoint rather than during it. I'll investigate.


I really want to focus on the first patch, timeline following for logical slots. That part is much less invasive and is useful stand-alone. I'll move it to a separate CF entry and post it to a separate thread as I think it needs consideration independently of failover slots.


(BTW, the slot docs promise that slots will replay a change exactly once, but this is not correct and the client must keep track of replay position. I'll post a patch to correct it separately).
 
There are a couple of incorrect comments

Thanks, will amend.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Failover Slots

От
Craig Ringer
Дата:
On 24 February 2016 at 18:02, Craig Ringer <craig@2ndquadrant.com> wrote:
 
I really want to focus on the first patch, timeline following for logical slots. That part is much less invasive and is useful stand-alone. I'll move it to a separate CF entry and post it to a separate thread as I think it needs consideration independently of failover slots.

Just an update on the failover slots status: I've moved timeline following for logical slots into its own patch set and CF entry and added a bunch of tests.


Some perl TAP test framework enhancements were needed for that; they're mostly committed now with a few pending.


Once some final changes are made to the tests for timeline following I'll address the checkpoint issue in failover slots by doing the checkpoint of slots at the start of a checkpoint/restartpoint, while we can still write WAL. Per the comments in CheckPointReplicationSlots it's mostly done in a checkpoint currently for convenience.

Then I'll write some TAP tests for failover slots and submit an updated patch for them, by which time hopefully timeline following for logical slots will be committed.

In other words this patch isn't dead, the foundations are just being rebased out from under it.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Failover Slots

От
Craig Ringer
Дата:
Here's a new failover slots rev, addressing the issues Oleksii Kliukin raised and adding a bunch of TAP tests.

In particular, for the checkpoint issue I landed up moving CheckPointReplicationSlots to occur at the start of a checkpoint, before writing WAL is prohibited. As the comments note it's just a convenient place and time to do it anyway. That means it has to be called separately at a restartpoint, but I don't think that's a biggie.

The tests for this took me quite a while, much (much) longer than the code changes.

I split the patch up a bit more too so individual changes are more logically grouped and clearer. I expect it'd be mostly or entirely squashed for commit.

Вложения

Re: WIP: Failover Slots

От
Craig Ringer
Дата:
On 15 March 2016 at 21:40, Craig Ringer <craig@2ndquadrant.com> wrote:
Here's a new failover slots rev, addressing the issues Oleksii Kliukin raised and adding a bunch of TAP tests.

Ahem, just found an issue here. I'll need to send another revision.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Failover Slots

От
Craig Ringer
Дата:

Re: WIP: Failover Slots

От
Oleksii Kliukin
Дата:
Hi,

On 17 Mar 2016, at 09:34, Craig Ringer <craig@2ndquadrant.com> wrote:

OK, here's the latest failover slots patch, rebased on top of today's master plus, in order:

- Dirty replication slots when confirm_lsn is changed

- logical decoding timeline following

The full tree is at https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots if you want to avoid the fiddling around required to apply the patch series.


<0001-Allow-replication-slots-to-follow-failover.patch><0002-Update-decoding_failover-tests-for-failover-slots.patch><0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch><0004-Add-the-UI-and-for-failover-slots.patch><0005-Document-failover-slots.patch><0006-Add-failover-to-pg_replication_slots.patch><0007-Introduce-TAP-recovery-tests-for-failover-slots.patch>


Thank you for the update. I’ve got some rejects when applying the 0001-Allow-replication-slots-to-follow-failover.patch after the "Dirty replication slots when confirm_lsn is changed” changes. I think it should be rebased against the master, (might be the consequence of the "logical slots follow the timeline” patch committed).

patch -p1 <~/git/pg/patches/failover-slots/v6/0001-Allow-replication-slots-to-follow-failover.patch
patching file src/backend/access/rmgrdesc/Makefile
Hunk #1 FAILED at 10.
1 out of 1 hunk FAILED -- saving rejects to file src/backend/access/rmgrdesc/Makefile.rej
patching file src/backend/access/rmgrdesc/replslotdesc.c
patching file src/backend/access/transam/rmgr.c
Hunk #1 succeeded at 25 (offset 1 line).
patching file src/backend/access/transam/xlog.c
Hunk #1 succeeded at 6351 (offset 3 lines).
Hunk #2 succeeded at 8199 (offset 14 lines).
Hunk #3 succeeded at 8645 (offset 14 lines).
Hunk #4 succeeded at 8718 (offset 14 lines).
patching file src/backend/commands/dbcommands.c
patching file src/backend/replication/basebackup.c
patching file src/backend/replication/logical/decode.c
Hunk #1 FAILED at 143.
1 out of 1 hunk FAILED -- saving rejects to file src/backend/replication/logical/decode.c.rej
patching file src/backend/replication/logical/logical.c
patching file src/backend/replication/slot.c
patching file src/backend/replication/slotfuncs.c
patching file src/backend/replication/walsender.c
patching file src/bin/pg_xlogdump/replslotdesc.c
patching file src/bin/pg_xlogdump/rmgrdesc.c
Hunk #1 succeeded at 27 (offset 1 line).
patching file src/include/access/rmgrlist.h
Hunk #1 FAILED at 45.
1 out of 1 hunk FAILED -- saving rejects to file src/include/access/rmgrlist.h.rej
patching file src/include/replication/slot.h
patching file src/include/replication/slot_xlog.h
can't find file to patch at input line 1469
Perhaps you used the wrong -p or --strip option?



--
Oleksii

Re: WIP: Failover Slots

От
Craig Ringer
Дата:
On 5 April 2016 at 04:19, Oleksii Kliukin <alexk@hintbits.com> wrote:
 
Thank you for the update. I’ve got some rejects when applying the 0001-Allow-replication-slots-to-follow-failover.patch after the "Dirty replication slots when confirm_lsn is changed” changes. I think it should be rebased against the master, (might be the consequence of the "logical slots follow the timeline” patch committed).

I'll rebase it on top of the new master after timeline following for logical slots got committed and follow up shortly.

That said, I've marked this patch 'returned with feedback' in the CF. It should possibly actually be 'rejected' given the discussion on the logical decoding timeline following thread, which points heavily at a different approach to solving this problem in 9.7.

That doesn't mean nobody can pick it up if they think it's valuable and want to run with it, but we're very close to feature freeze now.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Failover Slots

От
Simon Riggs
Дата:
On 25 January 2016 at 14:25, Craig Ringer <craig@2ndquadrant.com> wrote:
 
I'd like to get failover slots in place for 9.6 since the're fairly self-contained and meet an immediate need: allowing replication using slots (physical or logical) to follow a failover event.

I'm a bit confused about this now.

We seem to have timeline following, yet no failover slot. How do we now follow a failover event?

There are many and varied users of logical decoding now and a fix is critically important for 9.6.

Do all decoding plugins need to write their own support code??

Please explain how we cope without this, so if a problem remains we can fix by the freeze.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Failover Slots

От
Craig Ringer
Дата:
On 6 April 2016 at 17:43, Simon Riggs <simon@2ndquadrant.com> wrote:
On 25 January 2016 at 14:25, Craig Ringer <craig@2ndquadrant.com> wrote:
 
I'd like to get failover slots in place for 9.6 since the're fairly self-contained and meet an immediate need: allowing replication using slots (physical or logical) to follow a failover event.

I'm a bit confused about this now.

We seem to have timeline following, yet no failover slot. How do we now follow a failover event? 
 

There are many and varied users of logical decoding now and a fix is critically important for 9.6.

I agree with you, but I haven't been able to convince enough people of that.
 
 Do all decoding plugins need to write their own support code??

We'll be able to write a bgworker based extension that handles it by running in the standby. So no, I don't think so.
 
Please explain how we cope without this, so if a problem remains we can fix by the freeze.

The TL;DR: Create a slot on the master to hold catalog_xmin where the replica needs it. Advance it using client or bgworker on replica based on the catalog_xmin of the oldest slot on the replica. Copy slot state from the master using an extension that keeps the slots on the replica reasonably up to date.

All of this is ugly workaround for not having true slot failover support. I'm not going to pretend it's nice, or anything that should go anywhere near core. Petr outlined the approach we want to take for core in 9.7 on the logical timeline following thread.

 
Details:

Logical decoding on a slot can follow timeline switches now - or rather, the xlogreader knows how to follow timeline switches, and the read page callback used by logical decoding uses that functionality now.

This doesn't help by its self because slots aren't synced to replicas so they're lost on failover promotion.

Nor can a client just create a backup slot for its self on the replica to be ready for failover:

- it has no way to create a new slot at a consistent point on the replica since logical decoding isn't supported on replicas yet;
- it can't advance a logical slot on the replica once created since decoding isn't permitted on a replica, so it can't just decode from the replica in lockstep with the master;
- it has no way to stop the master from removing catalog tuples still needed by the slot's catalog_xmin since catalog_xmin isn't propagated from standby to master.

So we have to help the client out. To do so, we have a function/worker/whatever on the replica that grabs the slot state from the master and copies it to the replica, and we have to hold the master's catalog_xmin down to the catalog_xmin required by the slots on the replica.

Holding the catalog_xmin down is the easier bit. We create a dummy logical slot on the master, maintained by a function/bgworker/whatever on the replica. It gets advanced so that its restart_lsn and catalog_xmin are those of the oldest slot on the replica. We can do that by requesting replay on it up to the confirmed_lsn of the lowest confirmed_lsn on the replica. Ugly, but workable. Or we can abuse the infrastructure more deeply by simply setting the catalog_xmin and restart_lsn on the slot directly, but I'd rather not.

Just copying slot state is pretty simple too, as at the C level you can create a physical or logical slot with whatever state you want.

However, that lets you copy/create any number of bogus ones, many of which will appear to work fine but will be subtly broken. Since the replica is an identical copy of the master we know that a slot state that was valid on the master at a given xlog insert lsn is also valid on the replica at the same replay lsn, but we've got no reliable way to ensure that when the master updates a slot at LSN A/B the replica also updates the slot at replay of LSN A/B. That's what failover slots did. Without that we need to use some external channel - but there's no way to capture knowledge of "at exactly LSN A/B, master saved a new copy of slot X" since we can't hook ReplicationSlotSave(). At least we *can* now inject slot state updates as generic WAL messages though, so we can ensure they happen at exactly the desired point in replay.

As Andres explained on the timeline following thread it's not safe for the slot on the replica to be behind the state the slot on the master was at the same LSN. At least unless we can protect catalog_xmin via some other mechanism so we can make sure no catalogs still needed by the slots on the replica are vacuumed away. It's vital that the catalog_xmin of any slots on the replica be >= the catalog_xmin the master had for the lowest catalog_xmin of any of its slots at the same LSN.

So what I figure we'll do is poll slot shmem on the master. When we notice that a slot has changed we'll dump it into xlog via the generic xlog mechanism to be applied on the replica, much like failover slots. The slot update might arrive a bit late on the replica, but that's OK because we're holding catalog_xmin pinned on the master using the dummy slot.

I don't like it, but I don't have anything better for 9.6.

I'd really like to be able to build a more solid proof of concept that tests this with a lagging replica, but -ENOTIME before FF. 

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Failover Slots

От
Craig Ringer
Дата:
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">A few thoughts on failover slots vs the alternative of
pushingcatalog_xmin up to the master via a replica's slot and creating independent slots on replicas.</div><div
class="gmail_quote"><br/></div><div class="gmail_quote"><br /></div><div class="gmail_quote">Failover slots:</div><div
class="gmail_quote">---</div><divclass="gmail_quote"><br /></div><div class="gmail_quote">+ Failover slots are very
easyfor applications. They "just work" and are transparent for failover. This is great especially for things that
aren'tcomplex replication schemes, that just want to use logical decoding.</div><div class="gmail_quote"><br
/></div><divclass="gmail_quote">+ Applications don't have to know what replicas exist or be able to reach them;
transparentfailover is easier.</div><div class="gmail_quote"><br /></div><div class="gmail_quote">- Failover slots
can'tbe used from a cascading standby (where we can fail down to the standby's own replicas) because they have to write
WALto advance the slot position. They'd have to send the slot position update "up" to the master then wait to replay
it.Not a disaster, though they'd do extra work on reconnect until a restart_lsn update replayed. Would require a whole
newfeedback-like message on the rep protocol, and couldn't work at all with archive replication. Ugly as
hell.</div><divclass="gmail_quote"><br /></div><div class="gmail_quote">+ Failover slots exist now, and could be added
to9.6.</div><div class="gmail_quote"><br /></div><div class="gmail_quote">- The UI for failover slots can't be re-used
forthe catalog_xmin push-up approach to allow replay from failover slots on cascading standbys in 9.7+. There'd be no
wayto propagate the creation of failover slots "down" the replication heirarchy that way, especially to archive
standbyslike failover slots will do. So it'd be semantically different and couldn't re-use the FS UI. We'd be stuck
withfailover slots even if we also did the other way later.</div><div class="gmail_quote"><br /></div><div
class="gmail_quote">+Will work for recovery of a master PITR-restored up to the latest recovery point</div><div
class="gmail_quote"><br/></div><div class="gmail_quote"><br /></div><div class="gmail_quote"><br /></div><div
class="gmail_quote">Independentslots on replicas + catalog_xmin push-up</div><div class="gmail_quote">---</div><div
class="gmail_quote"><br/></div><div class="gmail_quote">With this approach we allow creation of replication slots on a
replicaindependently of the master. The replica is required to connect to the master via a slot. We send feedback to
themaster to advance the replica's slot on the master to the confirmed_lsn of the most-behind slot on the replica,
thereforepinning master's catalog_xmin where needed. Or we just send a new feedback message type that directly sets a
catalog_xminon the replica's physical slot in the master. Slots are _not_ cloned from master to replica
automatically.</div><divclass="gmail_quote"><br /></div><div class="gmail_quote"><br /></div><div class="gmail_quote">-
Morecomplicated for applications to use. They have to create a slot on each replica that might be failed over to as
wellas the master and have to advance all those slots to stop the master from suffering severe catalog bloat.  (But see
notebelow).</div><div class="gmail_quote"><br /></div><div class="gmail_quote">- Applications must be able to connect
tofailover-candidate standbys and know where they are, it's not automagically handled via WAL.  (But see note
below).</div><divclass="gmail_quote"><br /></div><div class="gmail_quote">- Applications need reconfiguration whenever
astandby is rebuilt, moved, etc. (But see note below).</div><div class="gmail_quote"><br /></div><div
class="gmail_quote">-Cannot work at all for archive-based replication, requires a slot from replica to
master.</div><divclass="gmail_quote"><br /></div><div class="gmail_quote">+ Works with replay from cascading
standbys</div><divclass="gmail_quote"><br /></div><div class="gmail_quote">+ Actually solves one of the problems making
logicalslots on standbys unsupported at the moment by giving us a way to pin the master's catalog_xmin to that needed
bya replica.</div><div class="gmail_quote"><br /></div><div class="gmail_quote">- Won't work for a standby
PITR-restoredup to latest.</div><div class="gmail_quote"><br /></div><div class="gmail_quote">- Vapourware with zero
hopefor 9.6</div><div class="gmail_quote"><br /></div><div class="gmail_quote"><br /></div><div
class="gmail_quote">Note:I think the application complexity issues can be solved - to a degree - by having the replicas
runa bgworker based helper that connects to the master and clones the master's slots then advances them
automatically.</div><divclass="gmail_quote"><br /></div><div class="gmail_quote"><br /></div><div
class="gmail_quote">Donothing</div><div class="gmail_quote">---</div><div class="gmail_quote"><br /></div><div
class="gmail_quote">Dropthe idea of being able to follow physical failover on logical slots.</div><div
class="gmail_quote"><br/></div><div class="gmail_quote">I've already expressed why I think this is a terrible idea.
It'shostile to application developers who'd like to use logical decoding. It makes integration of logical replication
withexisting HA systems much harder. It means we need really solid, performant, well-tested and mature logical rep
basedHA before we can take logical rep seriously, which is a long way out given that we can't do decoding of
in-progressxacts, ddl, sequences, .... etc etc.</div><div class="gmail_quote"><br /></div><div class="gmail_quote">Some
kindof physical HA for logical slots is needed and will be needed for some time. Logical rep will be great for
selectivereplication, replication over WAN, filtered/transformed replication etc. Physical rep is great for knowing
you'llget exactly the same thing on the replica that you have on the master and it'll Just Work.</div><div
class="gmail_quote"><br/></div><div class="gmail_quote">In any case, "Do nothing" is the same for 9.6 as pursusing the
catalog_xminpush-up idea; in both cases we don't commit anything in 9.6.</div><div class="gmail_quote"><br /></div><div
class="gmail_quote"><br/></div><div class="gmail_quote"><br /></div></div></div> 

Re: WIP: Failover Slots

От
Simon Riggs
Дата:
On 6 April 2016 at 14:15, Craig Ringer <craig@2ndquadrant.com> wrote:
 ...

Nice summary

Failover slots are optional. And they work on master. 

While the other approach could also work, it will work later and still require a slot on the master.


=> I don't see why having Failover Slots in 9.6 would prevent us from having something else later, if someone else writes it.


We don't need to add this to core. Each plugin can independently write is own failover code. Works, but doesn't seem like the right approach for open source.


=> I think we should add Failover Slots to 9.6.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Failover Slots

От
Andres Freund
Дата:
On 2016-04-06 14:30:21 +0100, Simon Riggs wrote:
> On 6 April 2016 at 14:15, Craig Ringer <craig@2ndquadrant.com> wrote:
>  ...
> 
> Nice summary
> 
> Failover slots are optional. And they work on master.
> 
> While the other approach could also work, it will work later and still
> require a slot on the master.
> 
> 
> => I don't see why having Failover Slots in 9.6 would prevent us from
> having something else later, if someone else writes it.
> 
> 
> We don't need to add this to core. Each plugin can independently write is
> own failover code. Works, but doesn't seem like the right approach for open
> source.
> 
> 
> => I think we should add Failover Slots to 9.6.

Simon, please don't take this personal; because of the other ongoing
thread.

I don't think this is commit-ready. For one I think this is
architecturally the wrong choice. But even leaving that fact aside, and
considering this a temporary solution (we can't easily remove), there
appears to have been very little code level review (one early from Petr
in [1], two by Oleksii mostly focusing on error messages [2] [3]). The
whole patch was submitted late to the 9.6 cycle.

Quickly skimming 0001 in [4] there appear to be a number of issues:
* LWLockHeldByMe() is only for debugging, not functional differences
* ReplicationSlotPersistentData is now in an xlog related header
* The code and behaviour around name conflicts of slots seems pretty raw, and not discussed
* Taking spinlocks dependant on InRecovery() seems like a seriously bad idea
* I doubt that the archive based switches around StartupReplicationSlots do what they intend. Afaics that'll not work
correctlyfor basebackups taken with -X, without recovery.conf
 

That's from a ~5 minute skim, of one patch in the series.


[1] http://archives.postgresql.org/message-id/CALLjQTSCHvcsF6y7%3DZhmdMjJUMGLqt1-6Pz2rtb7PfFLxFfBOw%40mail.gmail.com
[2] http://archives.postgresql.org/message-id/FA68178E-F0D1-47F6-9791-8A3E2136C119%40hintbits.com
[3] http://archives.postgresql.org/message-id/9B503EB5-676A-4258-9F78-27FC583713FE%40hintbits.com
[4] http://archives.postgresql.org/message-id/CAMsr+YE6LNy2e0tBuAQB+NTVb6W-dHJAfLq0-zbAL7G7hjhXBA@mail.gmail.com

Greetings,

Andres Freund



Re: WIP: Failover Slots

От
Simon Riggs
Дата:
On 6 April 2016 at 15:17, Andres Freund <andres@anarazel.de> wrote:
On 2016-04-06 14:30:21 +0100, Simon Riggs wrote:
> On 6 April 2016 at 14:15, Craig Ringer <craig@2ndquadrant.com> wrote:
>  ...
>
> Nice summary
>
> Failover slots are optional. And they work on master.
>
> While the other approach could also work, it will work later and still
> require a slot on the master.
>
>
> => I don't see why having Failover Slots in 9.6 would prevent us from
> having something else later, if someone else writes it.
>
>
> We don't need to add this to core. Each plugin can independently write is
> own failover code. Works, but doesn't seem like the right approach for open
> source.
>
>
> => I think we should add Failover Slots to 9.6.

Simon, please don't take this personal; because of the other ongoing
thread.

Thanks for the review. Rational technical comments are exactly why we are here and they are always welcome.
 
For one I think this is architecturally the wrong choice. 
But even leaving that fact aside, and
considering this a temporary solution (we can't easily remove),

As I observed above, the alternate solution doesn't sound particularly good either but the main point is that we wouldn't need to remove it, it can coexist happily. I would add that I did think of the alternate solution previously as well, this one seemed simpler, which is always key for me in code aimed at robustness.
 
there appears to have been very little code level review

That is potentially fixable. At this point I don't claim it is committable, I only say it is important and the alternate solution is not significantly better, therefore if the patch can be beaten into shape we should commit it.

I will spend some time on this and see if we have something viable. Which will be posted here for discussion, as would have happened even before our other discussions.
 
Thanks for the points below

(one early from Petr
in [1], two by Oleksii mostly focusing on error messages [2] [3]). The
whole patch was submitted late to the 9.6 cycle.

Quickly skimming 0001 in [4] there appear to be a number of issues:
* LWLockHeldByMe() is only for debugging, not functional differences
* ReplicationSlotPersistentData is now in an xlog related header
* The code and behaviour around name conflicts of slots seems pretty
  raw, and not discussed
* Taking spinlocks dependant on InRecovery() seems like a seriously bad
  idea
* I doubt that the archive based switches around StartupReplicationSlots
  do what they intend. Afaics that'll not work correctly for basebackups
  taken with -X, without recovery.conf

That's from a ~5 minute skim, of one patch in the series.


[1] http://archives.postgresql.org/message-id/CALLjQTSCHvcsF6y7%3DZhmdMjJUMGLqt1-6Pz2rtb7PfFLxFfBOw%40mail.gmail.com
[2] http://archives.postgresql.org/message-id/FA68178E-F0D1-47F6-9791-8A3E2136C119%40hintbits.com
[3] http://archives.postgresql.org/message-id/9B503EB5-676A-4258-9F78-27FC583713FE%40hintbits.com
[4] http://archives.postgresql.org/message-id/CAMsr+YE6LNy2e0tBuAQB+NTVb6W-dHJAfLq0-zbAL7G7hjhXBA@mail.gmail.com

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Failover Slots

От
Craig Ringer
Дата:
On 6 April 2016 at 22:17, Andres Freund <andres@anarazel.de> wrote:
 
Quickly skimming 0001 in [4] there appear to be a number of issues:
* LWLockHeldByMe() is only for debugging, not functional differences
* ReplicationSlotPersistentData is now in an xlog related header
* The code and behaviour around name conflicts of slots seems pretty
  raw, and not discussed
* Taking spinlocks dependant on InRecovery() seems like a seriously bad
  idea
* I doubt that the archive based switches around StartupReplicationSlots
  do what they intend. Afaics that'll not work correctly for basebackups
  taken with -X, without recovery.conf


Thanks for looking at it. Most of those are my errors. I think this is pretty dead at least for 9.6, so I'm mostly following up in the hopes of learning about a couple of those mistakes. 

Good catch with -X without a recovery.conf. Since it wouldn't be recognised as a promotion and wouldn't increment the timeline, copied non-failover slots wouldn't get removed. I've never liked that logic at all anyway, I just couldn't think of anything better...

LWLockHeldByMe() has a comment to the effect of: "This is meant as debug support only." So that's just a dumb mistake on my part, and I should've added "alreadyLocked" parameters. (Ugly, but works).

But why would it be a bad idea to conditionally take a code path that acquires a spinlock based on whether RecoveryInProgress()? It's not testing RecoveryInProgress() more than once and doing the acquire and release based on separate tests, which would be a problem. I don't really get the problem with:

if (!RecoveryInProgress())
{	/* first check whether there's something to write out */	SpinLockAcquire(&slot->mutex);	was_dirty = slot->dirty;	slot->just_dirtied = false;	SpinLockRelease(&slot->mutex);
        /* and don't do anything if there's nothing to write */        if (!was_dirty)            return;
}
... though I think what I really should've done there is just always dirty the slot in the redo functions.

 


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] WIP: Failover Slots

От
Thom Brown
Дата:
On 8 April 2016 at 07:13, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 6 April 2016 at 22:17, Andres Freund <andres@anarazel.de> wrote:
>
>>
>> Quickly skimming 0001 in [4] there appear to be a number of issues:
>> * LWLockHeldByMe() is only for debugging, not functional differences
>> * ReplicationSlotPersistentData is now in an xlog related header
>> * The code and behaviour around name conflicts of slots seems pretty
>>   raw, and not discussed
>> * Taking spinlocks dependant on InRecovery() seems like a seriously bad
>>   idea
>> * I doubt that the archive based switches around StartupReplicationSlots
>>   do what they intend. Afaics that'll not work correctly for basebackups
>>   taken with -X, without recovery.conf
>>
>
> Thanks for looking at it. Most of those are my errors. I think this is
> pretty dead at least for 9.6, so I'm mostly following up in the hopes of
> learning about a couple of those mistakes.
>
> Good catch with -X without a recovery.conf. Since it wouldn't be recognised
> as a promotion and wouldn't increment the timeline, copied non-failover
> slots wouldn't get removed. I've never liked that logic at all anyway, I
> just couldn't think of anything better...
>
> LWLockHeldByMe() has a comment to the effect of: "This is meant as debug
> support only." So that's just a dumb mistake on my part, and I should've
> added "alreadyLocked" parameters. (Ugly, but works).
>
> But why would it be a bad idea to conditionally take a code path that
> acquires a spinlock based on whether RecoveryInProgress()? It's not testing
> RecoveryInProgress() more than once and doing the acquire and release based
> on separate tests, which would be a problem. I don't really get the problem
> with:
>
> if (!RecoveryInProgress())
> {
> /* first check whether there's something to write out */
> SpinLockAcquire(&slot->mutex);
> was_dirty = slot->dirty;
> slot->just_dirtied = false;
> SpinLockRelease(&slot->mutex);
>
>         /* and don't do anything if there's nothing to write */
>         if (!was_dirty)
>             return;
> }
>
> ... though I think what I really should've done there is just always dirty
> the slot in the redo functions.

Are there any plans to submit a new design/version for v11?

Thanks

Thom



Re: [HACKERS] WIP: Failover Slots

От
Craig Ringer
Дата:
On 26 July 2017 at 00:16, Thom Brown <thom@linux.com> wrote:
On 8 April 2016 at 07:13, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 6 April 2016 at 22:17, Andres Freund <andres@anarazel.de> wrote:
>
>>
>> Quickly skimming 0001 in [4] there appear to be a number of issues:
>> * LWLockHeldByMe() is only for debugging, not functional differences
>> * ReplicationSlotPersistentData is now in an xlog related header
>> * The code and behaviour around name conflicts of slots seems pretty
>>   raw, and not discussed
>> * Taking spinlocks dependant on InRecovery() seems like a seriously bad
>>   idea
>> * I doubt that the archive based switches around StartupReplicationSlots
>>   do what they intend. Afaics that'll not work correctly for basebackups
>>   taken with -X, without recovery.conf
>>
>
> Thanks for looking at it. Most of those are my errors. I think this is
> pretty dead at least for 9.6, so I'm mostly following up in the hopes of
> learning about a couple of those mistakes.
>
> Good catch with -X without a recovery.conf. Since it wouldn't be recognised
> as a promotion and wouldn't increment the timeline, copied non-failover
> slots wouldn't get removed. I've never liked that logic at all anyway, I
> just couldn't think of anything better...
>
> LWLockHeldByMe() has a comment to the effect of: "This is meant as debug
> support only." So that's just a dumb mistake on my part, and I should've
> added "alreadyLocked" parameters. (Ugly, but works).
>
> But why would it be a bad idea to conditionally take a code path that
> acquires a spinlock based on whether RecoveryInProgress()? It's not testing
> RecoveryInProgress() more than once and doing the acquire and release based
> on separate tests, which would be a problem. I don't really get the problem
> with:
>
> if (!RecoveryInProgress())
> {
> /* first check whether there's something to write out */
> SpinLockAcquire(&slot->mutex);
> was_dirty = slot->dirty;
> slot->just_dirtied = false;
> SpinLockRelease(&slot->mutex);
>
>         /* and don't do anything if there's nothing to write */
>         if (!was_dirty)
>             return;
> }
>
> ... though I think what I really should've done there is just always dirty
> the slot in the redo functions.

Are there any plans to submit a new design/version for v11?

No. The whole approach seems to have been bounced from core. I don't agree and continue to think this functionality is desirable but I don't get to make that call.

If time permits I will attempt to update the logical decoding on standby patchset instead, and if possible add support for fast-forward logical decoding that does the minimum required to correctly maintain a slot's catalog_xmin and restart_lsn when advanced. But this won't be usable directly for failover like failover slots are, it'll require each application to keep track of standbys and maintain slots on them too. Or we'll need some kind of extension/helper to sync slot state.

In the mean time, 2ndQuadrant maintains an on-disk-compatible version of failover slots that's available for support customers.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] WIP: Failover Slots

От
Robert Haas
Дата:
On Tue, Jul 25, 2017 at 8:44 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> No. The whole approach seems to have been bounced from core. I don't agree
> and continue to think this functionality is desirable but I don't get to
> make that call.

I actually think failover slots are quite desirable, especially now
that we've got logical replication in core.  In a review of this
thread I don't see anyone saying otherwise.  The debate has really
been about the right way of implementing that.  Suppose we did
something like this:

- When a standby connects to a master, it can optionally supply a list
of slot names that it cares about.
- The master responds by periodically notifying the standby of changes
to the slot contents using some new replication sub-protocol message.
- The standby applies those updates to its local copies of the slots.

So, you could create a slot on a standby with an "uplink this" flag of
some kind, and it would then try to keep it up to date using the
method described above.  It's not quite clear to me how to handle the
case where the corresponding slot doesn't exist on the master, or
initially does but then it's later dropped, or it initially doesn't
but it's later created.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WIP: Failover Slots

От
Craig Ringer
Дата:
On 3 August 2017 at 04:35, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Jul 25, 2017 at 8:44 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> No. The whole approach seems to have been bounced from core. I don't agree
> and continue to think this functionality is desirable but I don't get to
> make that call.

I actually think failover slots are quite desirable, especially now
that we've got logical replication in core.  In a review of this
thread I don't see anyone saying otherwise.  The debate has really
been about the right way of implementing that.  Suppose we did
something like this:

- When a standby connects to a master, it can optionally supply a list
of slot names that it cares about.

Wouldn't that immediately exclude use for PITR and snapshot recovery? I have people right now who want the ability to promote a PITR-recovered snapshot into place of a logical replication master and have downstream peers replay from it. It's more complex than that, as there's a resync process required to recover changes the failed node had sent to other peers but isn't available in the WAL archive, but that's the gist.

If you have a 5TB database do you want to run an extra replica or two because PostgreSQL can't preserve slots without a running, live replica? Your SAN snapshots + WAL archiving have been fine for everything else so far.

Requiring live replication connections could also be an issue for service interruptions, surely?  Unless you persist needed knowledge in the physical replication slot used by the standby to master connection, so the master can tell the difference between "downstream went away for while but will come back" and "downstream is gone forever, toss out its resources."

That's exactly what the catalog_xmin hot_standby_feedback patches in Pg10 do, but they can only tell the master about the oldest resources needed by any existing slot on the replica. Not which slots. And they have the same issues with needing a live, running replica.

Also, what about cascading? Lots of "pull" model designs I've looked at tend to fall down in cascaded environments. For that matter so do failover slots, but only for the narrower restriction of not being able to actually decode from a failover-enabled slot on a standby, they still work fine in terms of cascading down to leaf nodes.

- The master responds by periodically notifying the standby of changes
to the slot contents using some new replication sub-protocol message.
- The standby applies those updates to its local copies of the slots.

That's pretty much what I expect to have to do for clients to work on unpatched Pg10, probably using a separate bgworker and normal libpq connections to the upstream since we don't have hooks to extend the walsender/walreceiver.

It can work now that the catalog_xmin hot_standby_feedback patches are in, but it'd require some low-level slot state setting that I know Andres is not a fan of. So I expect to carry on relying on an out-of-tree failover slots patch for Pg 10.

 
So, you could create a slot on a standby with an "uplink this" flag of
some kind, and it would then try to keep it up to date using the
method described above.  It's not quite clear to me how to handle the
case where the corresponding slot doesn't exist on the master, or
initially does but then it's later dropped, or it initially doesn't
but it's later created.

Thoughts?

Right. So the standby must be running and in active communication. It needs some way to know the master has confirmed slot creation and it can rely on the slot's resources really being reserved by the master. That turns out to be quite hard, per the decoding on standby patches. There needs to be some way to tell the master a standby has gone away forever and to drop its dependent slots, so you're not stuck wondering "is slot xxyz from standby abc that we lost in that crash?". Standbys need to cope with having created a slot, only to find out there's a name collision with master.

For all those reasons, I just extended hot_standby_feedback to report catalog_xmin separately to upstreams instead, so the existing physical slot serves all these needs. And it's part of the picture, but there's no way to get slot position change info from the master back down onto the replicas so the replicas can advance any of their own slots and, via feedback, free up master resources. That's where the bgworker hack to query pg_replication_slots comes in. Seems complex, full of restrictions, and fragile to me compared to just expecting the master to do it.

The only objection I personally understood and accepted re failover slots was that it'd be impossible to create a failover slot on a standby and have that standby "sub-tree" support failover to leaf nodes. Which is true, but instead we have noting and no viable looking roadmap toward anything users can benefit from. So I don't think that's the worst restriction in the world.

I do not understand why logical replication slots are exempt from our usual policy that anything that works on the master should be expected to work on failover to a standby. Is there anything persistent across crash for which that's not the case, except grandfathered-in hash indexes? We're hardly going to say "hey, it's ok to forget about prepared xacts when you fail over to a standby" yet this problem with failover and slots in logical decoding and replication is the same sort of showstopper issue for users who use the functionality.

In the medium term I've given up making progress with getting something simple and usable into user hands on this. A tweaked version of failover slots is being carried as an out-of-tree on-disk-format-compatible patch instead, and it's meeting customer needs very well. I've done my dash here and moved on to other things where I can make more progress.

I'd like to continue working on logical decoding on standby support for pg11 too, but even if we can get that in place it'll only work for reachable, online standbys. Every application that uses logical decoding will have to maintain a directory of standbys (which it has no way to ask the master for) and advance their slots via extra walsender connections. They'll do a bunch of unnecessary work decoding WAL they don't need to just to throw the data away. It won't help for PITR and snapshot use cases at all. So for now I'm not able to allocate much priority to that.

I'd love to get failover slots in, I still think it's the simplest and best way to do what users need. It doesn't stop us progressing with decoding on standby or paint us into any corners.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] WIP: Failover Slots

От
Robert Haas
Дата:
On Tue, Aug 8, 2017 at 4:00 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>> - When a standby connects to a master, it can optionally supply a list
>> of slot names that it cares about.
>
> Wouldn't that immediately exclude use for PITR and snapshot recovery? I have
> people right now who want the ability to promote a PITR-recovered snapshot
> into place of a logical replication master and have downstream peers replay
> from it. It's more complex than that, as there's a resync process required
> to recover changes the failed node had sent to other peers but isn't
> available in the WAL archive, but that's the gist.
>
> If you have a 5TB database do you want to run an extra replica or two
> because PostgreSQL can't preserve slots without a running, live replica?
> Your SAN snapshots + WAL archiving have been fine for everything else so
> far.

OK, so what you're basically saying here is that you want to encode
the failover information in the write-ahead log rather than passing it
at the protocol level, so that if you replay the write-ahead log on a
time delay you get the same final state that you would have gotten if
you had replayed it immediately.  I hadn't thought about that
potential advantage, and I can see that it might be an advantage for
some reason, but I don't yet understand what the reason is.  How would
you imagine using any version of this feature in a PITR scenario?  If
you PITR the master back to an earlier point in time, I don't see how
you're going to manage without resyncing the replicas, at which point
you may as well just drop the old slot and create a new one anyway.
Maybe you're thinking of a scenario where we PITR the master and also
use PITR to rewind the replica to a slightly earlier point?  But I
can't quite follow what you're thinking about.  Can you explain
further?

> Requiring live replication connections could also be an issue for service
> interruptions, surely?  Unless you persist needed knowledge in the physical
> replication slot used by the standby to master connection, so the master can
> tell the difference between "downstream went away for while but will come
> back" and "downstream is gone forever, toss out its resources."

I don't think the master needs to retain any resources on behalf of
the failover slot.  If the slot has been updated by feedback from the
associated standby, then the master can toss those resources
immediately.  When the standby comes back on line, it will find out
via a protocol message that it can fast-forward the slot to whatever
the new LSN is, and any WAL files before that point are irrelevant on
both the master and the standby.

> Also, what about cascading? Lots of "pull" model designs I've looked at tend
> to fall down in cascaded environments. For that matter so do failover slots,
> but only for the narrower restriction of not being able to actually decode
> from a failover-enabled slot on a standby, they still work fine in terms of
> cascading down to leaf nodes.

I don't see the problem.  The cascaded standby tells the standby "I'm
interested in the slot called 'craig'" and the standby says "sure,
I'll tell you whenever 'craig' gets updated" but it turns out that
'craig' is actually a failover slot on that standby, so that standby
has said to the master "I'm interested in the slot called 'craig'" and
the master is therefore sending updates to that standby.  Every time
the slot is updated, the master tells the standby and the standby
tells the cascaded standby and, well, that all seems fine.

Also, as Andres pointed out upthread, if the state is passed through
the protocol, you can have a slot on a standby that cascades to a
cascaded standby; if the state is passed through the WAL, all slots
have to cascade from the master.  Generally, with protocol-mediated
failover slots, you can have a different set of slots on every replica
in the cluster and create, drop, and reconfigure them any time you
like.  With WAL-mediated slots, all failover slots must come from the
master and cascade to every standby you've got, which is less
flexible.

I don't want to come on too strong here.  I'm very willing to admit
that you may know a lot more about this than me and I am really
extremely happy to benefit from that accumulated knowledge.  If you're
saying that WAL-mediated slots are a lot better than protocol-mediated
slots, you may well be right, but I don't yet understand the reasons,
and I want to understand the reasons.  I think this stuff is too
important to just have one person saying "here's a patch that does it
this way" and everybody else just says "uh, ok".  Once we adopt some
proposal here we're going to have to continue supporting it forever,
so it seems like we'd better do our best to get it right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WIP: Failover Slots

От
Craig Ringer
Дата:


On 9 August 2017 at 23:42, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Aug 8, 2017 at 4:00 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>> - When a standby connects to a master, it can optionally supply a list
>> of slot names that it cares about.
>
> Wouldn't that immediately exclude use for PITR and snapshot recovery? I have
> people right now who want the ability to promote a PITR-recovered snapshot
> into place of a logical replication master and have downstream peers replay
> from it. It's more complex than that, as there's a resync process required
> to recover changes the failed node had sent to other peers but isn't
> available in the WAL archive, but that's the gist.
>
> If you have a 5TB database do you want to run an extra replica or two
> because PostgreSQL can't preserve slots without a running, live replica?
> Your SAN snapshots + WAL archiving have been fine for everything else so
> far.

OK, so what you're basically saying here is that you want to encode
the failover information in the write-ahead log rather than passing it
at the protocol level, so that if you replay the write-ahead log on a
time delay you get the same final state that you would have gotten if
you had replayed it immediately.  I hadn't thought about that
potential advantage, and I can see that it might be an advantage for
some reason, but I don't yet understand what the reason is.  How would
you imagine using any version of this feature in a PITR scenario?  If
you PITR the master back to an earlier point in time, I don't see how
you're going to manage without resyncing the replicas, at which point
you may as well just drop the old slot and create a new one anyway.


I've realised that it's possible to work around it in app-space anyway. You create a new slot on a node before you snapshot it, and you don't drop this slot until you discard the snapshot. The existence of this slot ensures that any WAL generated by the node (and replayed by PITR after restore) cannot clobber needed catalog_xmin. If we xlog catalog_xmin advances or have some other safeguard in place, which we need for logical decoding on standby to be safe anyway, then we can fail gracefully if the user does something dumb.

So no need to care about this.

(What I wrote previously on this was):

You definitely can't just PITR restore and pick up where you left off.

You need a higher level protocol between replicas to recover. For example, in a multi-master configuration, this can be something like (simplified):

* Use the timeline history file to find the lsn at which we diverged from our "future self", the failed node
* Connect to the peer and do logical decoding, with a replication origin filter for "originating from me", for xacts from the divergence lsn up to the peer's current end-of-wal.
* Reset peer's replication origin for us to our new end-of-wal, and resume replication

To enable that to be possible, since we can't rewind slots once confirmed advanced, maintain a backup slot on the peer corresponding to the point-in-time at which a snapshot was taken.

For most other situations there is little benefit vs just re-creating the slot before you permit write user-initiated write xacts to begin on the restored node.

I can accept an argument that "we" as pgsql-hackers do not consider this something worth caring about, should that be the case. It's niche enough that you could argue it doesn't have to be supportable in stock postgres.


Maybe you're thinking of a scenario where we PITR the master and also
use PITR to rewind the replica to a slightly earlier point?

That can work, but must be done in lock-step. You have to pause apply on both ends for long enough to snapshot both, otherwise the replicaion origins on one end get out of sync with the slots on another.

Interesting, but I really hope nobody's going to need to do it.
 
But I
can't quite follow what you're thinking about.  Can you explain
further?

Gladly.

I've been up to my eyeballs in this for years now, and sometimes it becomes quite hard to see the outside perspective, so thanks for your patience.
 

> Requiring live replication connections could also be an issue for service
> interruptions, surely?  Unless you persist needed knowledge in the physical
> replication slot used by the standby to master connection, so the master can
> tell the difference between "downstream went away for while but will come
> back" and "downstream is gone forever, toss out its resources."

I don't think the master needs to retain any resources on behalf of
the failover slot.  If the slot has been updated by feedback from the
associated standby, then the master can toss those resources
immediately.  When the standby comes back on line, it will find out
via a protocol message that it can fast-forward the slot to whatever
the new LSN is, and any WAL files before that point are irrelevant on
both the master and the standby.

OK, so you're envisioning that every slot on a downstream has a mirror slot on the upstream, and that is how the master retains the needed resources.
 
> Also, what about cascading? Lots of "pull" model designs I've looked at tend
> to fall down in cascaded environments. For that matter so do failover slots,
> but only for the narrower restriction of not being able to actually decode
> from a failover-enabled slot on a standby, they still work fine in terms of
> cascading down to leaf nodes.

I don't see the problem.  The cascaded standby tells the standby "I'm
interested in the slot called 'craig'" and the standby says "sure,
I'll tell you whenever 'craig' gets updated" but it turns out that
'craig' is actually a failover slot on that standby, so that standby
has said to the master "I'm interested in the slot called 'craig'" and
the master is therefore sending updates to that standby.  Every time
the slot is updated, the master tells the standby and the standby
tells the cascaded standby and, well, that all seems fine.

Yep, so again, you're pushing slots "up" the tree, by name, with a 1:1 correspondence, and using globally unique slot names to manage state.

If slot names collide, you presumably fail with "er, don't do that then". Or scrambling data horribly. Both of which we certainly have precedent for in Pg (see, e.g, what happens if two snapshots of the same node are in archive recovery and promote to the same timeline, then start archiving to the same destination...). So not a showstopper.

I'm pretty OK with that.

Also, as Andres pointed out upthread, if the state is passed through
the protocol, you can have a slot on a standby that cascades to a
cascaded standby; if the state is passed through the WAL, all slots
have to cascade from the master.

Yes, that's my main hesitation with the current failover slots, as mentioned in the prior message.
 
Generally, with protocol-mediated
failover slots, you can have a different set of slots on every replica
in the cluster and create, drop, and reconfigure them any time you
like.  With WAL-mediated slots, all failover slots must come from the
master and cascade to every standby you've got, which is less
flexible.

Definitely agreed. 

Different standbys don't know about each other so it's the user's job to make sure they ensure uniqueness, using slot name as a key.


I don't want to come on too strong here.  I'm very willing to admit
that you may know a lot more about this than me and I am really
extremely happy to benefit from that accumulated knowledge. 

The flip side is that I've also been staring at the problem, on and off, for WAY too long. So other perspectives can be really valuable.

 
If you're
saying that WAL-mediated slots are a lot better than protocol-mediated
slots, you may well be right, but I don't yet understand the reasons,
and I want to understand the reasons.  I think this stuff is too
important to just have one person saying "here's a patch that does it
this way" and everybody else just says "uh, ok".  Once we adopt some
proposal here we're going to have to continue supporting it forever,
so it seems like we'd better do our best to get it right.

I mostly agree there. We could have relatively easily converted WAL-based failover slots to something else in a major version bump, and that's why I wanted to get them in place for 9.6 and then later for pg10. Because people were (and are) constantly asking me and others who work on logical replication tools why it doesn't work, and a 90% solution that doesn't paint us into a corner seemed just fine.

I'm quite happy to find a better one. But I cannot spend a lot of time writing something to have it completely knocked back because the scope just got increased again and now it has to do more, so it needs another rewrite.

So, how should this look if we're using the streaming rep protocol?

How about:

A "failover slot" is identified by a  field in the slot struct and exposed in pg_replication_slots. It can be null (not a failover slots). It can indicate that the slot was created locally and is "owned" by this node; all downstreams should mirror it. It can also indicate that it is a mirror of an upstream, in which case clients may not replay from it until it's promoted to an owned slot and ceases to be mirrored. Attempts to replay from a mirrored slot just ERROR and will do so even once decoding on standby is supported.

This promotion happens automatically if a standby is promoted to a master, and can also be done manually via sql function call or walsender command to allow for an internal promotion within a cascading replica chain.

When a replica connects to an upstream it asks via a new walsender msg "send me the state of all your failover slots". Any local mirror slots are updated. If they are not listed by the upstream they are known deleted, and the mirror slots are deleted on the downstream.

The upstream walsender then sends periodic slot state updates while connected, so replicas can advance their mirror slots, and in turn send hot_standby_feedback that gets applied to the physical replication slot used by the standby, freeing resources held for the slots on the master.


There's one big hole left here. When we create a slot on a cascading leaf or inner node, it takes time for hot_standby_feedback to propagate the needed catalog_xmin "up" the chain. Until the master has set the needed catalog_xmin on the physical slot for the closest branch, the inner node's slot's catalog_xmin can only be tentative pending confirmation. That's what a whole bunch of gruesomeness in the decoding on standby patch was about.

One possible solution to this is to also mirror slots "up", as you alluded to: when you create an "owned" slot on a replica, it tells the master at connect time / slot creation time "I have this slot X, please copy it up the tree". The slot gets copied "up" to the master via cascading layers with a different failover slot type indicating it's an up-mirror. Decoding clients aren't allowed to replay from an up-mirror slot and it cannot be promoted like a down-mirror slot can, it's only there for resource retention. A node knows its owned slot is safe to actually use, and is fully created, when it sees the walsender report it in the list of failover slots from the master during a slot state update.

This imposes some restrictions:

* failover slot names must be globally unique or things go "kaboom"
* if a replica goes away, its up-mirror slots stay dangling until the admin manually cleans them up

Tolerable, IMO. But we could fix the latter by requiring that failover slots only be enabled when the replica uses a physical slot to talk to the upstream. The up-mirror failover slots then get coupled to the physical slot by an extra field in the slot struct holding the name of the owning physical slot. Dropping that physical slot cascade-drops all up-mirror slots automatically. Admins are prevented from dropping up-mirror slots manually, which protects against screwups.

We could even fix the naming, maybe, with some kind of qualified naming based on the physical slot, but it's not worth the complexity.

It sounds a bit more complex than your sketch, but I think the 4 failover-kinds are necessary to support this. We'll have:

* not a failover slot, purely local

* a failover slot owned by this node (will be usable for decoding on standby once supported)

* an up-mirror slot, not promoteable, resource retention only, linked to a physical slot for a given replica

* a down-mirror slot, promoteable, not linked to a physical slot; this is the true "failover slot"'s representation on a replica.

Thoughts? Feels pretty viable to me.

Thanks for the new perspective.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] WIP: Failover Slots

От
Robert Haas
Дата:
On Thu, Aug 10, 2017 at 2:38 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> Yep, so again, you're pushing slots "up" the tree, by name, with a 1:1
> correspondence, and using globally unique slot names to manage state.

Yes, that's what I'm imagining.  (Whether I should instead be
imagining something else is the important question.)

> I'm quite happy to find a better one. But I cannot spend a lot of time
> writing something to have it completely knocked back because the scope just
> got increased again and now it has to do more, so it needs another rewrite.

Well, I can't guarantee anything about that.  I don't tend to argue
against designs to which I myself previously agreed, but other people
may, and there's not a lot I can do about that (although sometimes I
try to persuade them that they're wrong, if I think they are).  Of
course, sometimes you implement something and it doesn't look as good
as you thought it would; that's a risk of software development
generally.  I'd like to push back a bit on the underlying assumption,
though: I don't think that there was ever an agreed-upon design on
this list for failover slots before the first patch showed up.  Well,
anybody's welcome to write code without discussion and drop it to the
list, but if people don't like it, that's the risk you took by not
discussing it first.

> A "failover slot" is identified by a  field in the slot struct and exposed
> in pg_replication_slots. It can be null (not a failover slots). It can
> indicate that the slot was created locally and is "owned" by this node; all
> downstreams should mirror it. It can also indicate that it is a mirror of an
> upstream, in which case clients may not replay from it until it's promoted
> to an owned slot and ceases to be mirrored. Attempts to replay from a
> mirrored slot just ERROR and will do so even once decoding on standby is
> supported.

+1

> This promotion happens automatically if a standby is promoted to a master,
> and can also be done manually via sql function call or walsender command to
> allow for an internal promotion within a cascading replica chain.

+1.

> When a replica connects to an upstream it asks via a new walsender msg "send
> me the state of all your failover slots". Any local mirror slots are
> updated. If they are not listed by the upstream they are known deleted, and
> the mirror slots are deleted on the downstream.

What about slots not listed by the upstream that are currently in use?

> The upstream walsender then sends periodic slot state updates while
> connected, so replicas can advance their mirror slots, and in turn send
> hot_standby_feedback that gets applied to the physical replication slot used
> by the standby, freeing resources held for the slots on the master.

+1.

> There's one big hole left here. When we create a slot on a cascading leaf or
> inner node, it takes time for hot_standby_feedback to propagate the needed
> catalog_xmin "up" the chain. Until the master has set the needed
> catalog_xmin on the physical slot for the closest branch, the inner node's
> slot's catalog_xmin can only be tentative pending confirmation. That's what
> a whole bunch of gruesomeness in the decoding on standby patch was about.
>
> One possible solution to this is to also mirror slots "up", as you alluded
> to: when you create an "owned" slot on a replica, it tells the master at
> connect time / slot creation time "I have this slot X, please copy it up the
> tree". The slot gets copied "up" to the master via cascading layers with a
> different failover slot type indicating it's an up-mirror. Decoding clients
> aren't allowed to replay from an up-mirror slot and it cannot be promoted
> like a down-mirror slot can, it's only there for resource retention. A node
> knows its owned slot is safe to actually use, and is fully created, when it
> sees the walsender report it in the list of failover slots from the master
> during a slot state update.

I'm not sure that this actually prevents the problem you describe.  It
also seems really complicated.  Maybe you can explain further; perhaps
there is a simpler solution (or perhaps this isn't as complicated as I
currently think it is).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WIP: Failover Slots

От
Craig Ringer
Дата:
On 11 August 2017 at 01:02, Robert Haas <robertmhaas@gmail.com> wrote:
 
Well,
anybody's welcome to write code without discussion and drop it to the
list, but if people don't like it, that's the risk you took by not
discussing it first.

Agreed, patches materializing doesn't mean they should be committed, and there wasn't prior design discussion on this.

It can be hard to elicit it without a patch, but clearly not always, we're doing a good job of it here.
 
> When a replica connects to an upstream it asks via a new walsender msg "send
> me the state of all your failover slots". Any local mirror slots are
> updated. If they are not listed by the upstream they are known deleted, and
> the mirror slots are deleted on the downstream.

What about slots not listed by the upstream that are currently in use?

Yes, it'll also need to send a list of its local owned and up-mirrored failover slots to the upstream so the upstream can create them or update their state.

> There's one big hole left here. When we create a slot on a cascading leaf or
> inner node, it takes time for hot_standby_feedback to propagate the needed
> catalog_xmin "up" the chain. Until the master has set the needed
> catalog_xmin on the physical slot for the closest branch, the inner node's
> slot's catalog_xmin can only be tentative pending confirmation. That's what
> a whole bunch of gruesomeness in the decoding on standby patch was about.
>
> One possible solution to this is to also mirror slots "up", as you alluded
> to: when you create an "owned" slot on a replica, it tells the master at
> connect time / slot creation time "I have this slot X, please copy it up the
> tree". The slot gets copied "up" to the master via cascading layers with a
> different failover slot type indicating it's an up-mirror. Decoding clients
> aren't allowed to replay from an up-mirror slot and it cannot be promoted
> like a down-mirror slot can, it's only there for resource retention. A node
> knows its owned slot is safe to actually use, and is fully created, when it
> sees the walsender report it in the list of failover slots from the master
> during a slot state update.

I'm not sure that this actually prevents the problem you describe.  It
also seems really complicated.  Maybe you can explain further; perhaps
there is a simpler solution (or perhaps this isn't as complicated as I
currently think it is).


It probably sounds more complex than it is. A slot is created tentatively and marked not ready to actually use yet when created on a standby. It flows "up" to the master where it's created as permanent/ready. The permanent/ready state flows back down to the creator. 

When we see a temp slot become permanent we copy the restart_lsn/catalog_xmin/confirmed_flush_lsn from the upstream slot in case the master had to advance them from our tentative values when it created the slot. After that, slot state updates only flow "out" from the owner: up the tree for up-mirror slots, down the tree for down-mirror slots.

Diagram may help. I focused only on the logical slot created on standby case, since I think we're happy with the rest already and I don't want to complicate it.

GMail will probably HTMLize this, sorry:


                          Phys rep          Phys rep
                          using phys        using
                          slot "B"          phys slot "C"
                +-------+         +--------+         +-------+
 T              |  A    <^--------+ B      <---------+ C     |
 I              |       |         |        |         |       |
 M              +-------+         +--------+         +-------+
 E                 |                  |                  |
 |                 |                  |                  |CREATEs
 |                 |                  |                  |logical slot X
 v                 |                  |                  |("owned")
                   |                  |                  |as temp slot
                   |                  +<-----------------+
                   |                  |Creates upmirror  |
                   |                  |slot "X" linked   |
                   |                  |to phys slot "C"  |
                   |                  |marked temp       |
                   | <----------------+                  |
                   |Creates upmirror  |                  | <--------------------------+   +-----------------+
                   |slot "X" linked   |                  |   Attempt to decode from "X"   |                 |
                   |to phys slot "B"  |                  |                                | CLIENT          |
                   |marked permanent  |                  |  +------------------------->   |                 |
                   +----------------> |                  |   ERROR: slot X still being    +-----------------+
                   |                  |Sees upmirror     |   created on master, not ready
                   |                  |slot "X" in       |
                   |                  |list from "A",    |
                   |                  |marks it          |
                   |                  |permanent and     |
                   |                  |copies state      |
                   |                  +----------------> |
                   |                  |                  |Sees upmirror slot
                   |                  |                  |"X" on "B" got marked
                   |                  |                  |permanent (because it
                   |                  |                  |appears in B's slot
                   |                  |                  |listings),
                   |                  |                  |marks permanent on C.
                   |                  |                  |Copies state.
                   |                  |                  |
                   |                  |                  |Slot "X" now persistent
                   |                  |                  |and (when decoding on standby
                   |                  |                  |supported) can be used for decoding
                   |                  |                  |on standby.
                   +                  +                  +



To actually use the slot once decoding on standby is supported: a decoding client on "C" can consume xacts and cause slot "X" to advance catalog_xmin, confirmed_flush_lsn, etc. walreceiver on "C" will tell walsender on "B" about the new slot state, and it'll get synced up-tree, then B will tell A.

Since slot is already marked permanent, state won't get copied back down-tree, that only happens once when slot is first fully created on master.

Some node "D" can exist as a phys rep of "C". If C fails and is replace with D, admin can promote the down-mirror slot on "D" to an owned slot.


Make sense?
 

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] WIP: Failover Slots

От
Andres Freund
Дата:
On 2017-08-02 16:35:17 -0400, Robert Haas wrote:
> I actually think failover slots are quite desirable, especially now
> that we've got logical replication in core.  In a review of this
> thread I don't see anyone saying otherwise.  The debate has really
> been about the right way of implementing that.

Given that I presumably was one of the people pushing back more
strongly: I agree with that.  Besides disagreeing with the proposed
implementation our disagreements solely seem to have been about
prioritization.

I still think we should have a halfway agreed upon *design* for logical
failover, before we introduce a concept that's quite possibly going to
be incompatible with that, however. But that doesn't mean it has to
submitted/merged to core.


> - When a standby connects to a master, it can optionally supply a list
> of slot names that it cares about.
> - The master responds by periodically notifying the standby of changes
> to the slot contents using some new replication sub-protocol message.
> - The standby applies those updates to its local copies of the slots.

> So, you could create a slot on a standby with an "uplink this" flag of
> some kind, and it would then try to keep it up to date using the
> method described above.  It's not quite clear to me how to handle the
> case where the corresponding slot doesn't exist on the master, or
> initially does but then it's later dropped, or it initially doesn't
> but it's later created.


I think there's a couple design goals we need to agree upon, before
going into the weeds of how exactly we want this to work. Some of the
axis I can think of are:

- How do we want to deal with cascaded setups, do slots have to be available everywhere, or not?
- What kind of PITR integration do we want? Note that simple WAL based slots do *NOT* provide proper PITR support,
there'snot enough interlock easily available (you'd have to save slots at the end, then increment minRecoveryLSN to a
pointlater than the slot saving)
 
- How much divergence are we going to accept between logical decoding on standbys, and failover slots. I'm probably a
lotcloser to closer than than Craig is.
 
- How much divergence are we going to accept between infrastructure for logical failover, and logical failover via
failoverslots (or however we're naming this)? Again, I'm probably a lot closer to zero than craig is.
 


Greetings,

Andres Freund



Re: [HACKERS] WIP: Failover Slots

От
Craig Ringer
Дата:
On 12 August 2017 at 08:03, Andres Freund <andres@anarazel.de> wrote:
On 2017-08-02 16:35:17 -0400, Robert Haas wrote:
> I actually think failover slots are quite desirable, especially now
> that we've got logical replication in core.  In a review of this
> thread I don't see anyone saying otherwise.  The debate has really
> been about the right way of implementing that.

Given that I presumably was one of the people pushing back more
strongly: I agree with that.  Besides disagreeing with the proposed
implementation our disagreements solely seem to have been about
prioritization.

I still think we should have a halfway agreed upon *design* for logical
failover, before we introduce a concept that's quite possibly going to
be incompatible with that, however. But that doesn't mean it has to
submitted/merged to core.

How could it be incompatible? The idea here is to make physical failover transparent to logical decoding clients. That's not meant to sound confrontational, I mean that I can't personally see any way it would be and could use your ideas.

I understand that it might be *different* and you'd like to see more closely aligned approaches that work more similarly. For which we first need to know more clearly how logical failover will look. But it's hard not to also see this as delaying and blocking until your preferred approach via pure logical rep and logical failover gets in, and physical failover can be dismissed with "we don't need that anymore". I'm sure that's not your intent, I just struggle not to see it that way anyway when there's always another reason not to proceed to solve this problem because of a loosely related development effort on another problem.

I think there's a couple design goals we need to agree upon, before
going into the weeds of how exactly we want this to work. Some of the
axis I can think of are:

- How do we want to deal with cascaded setups, do slots have to be
  available everywhere, or not?

Personally, I don't care either way.
 
- What kind of PITR integration do we want? Note that simple WAL based
  slots do *NOT* provide proper PITR support, there's not enough
  interlock easily available (you'd have to save slots at the end, then
  increment minRecoveryLSN to a point later than the slot saving)

Interesting. I haven't fully understood this, but think I see what you're getting at.

As outlined in the prior mail, I'd like to have working PITR with logical slots but think it's pretty niche as it can't work usefully without plenty of co-operation from the rest of the logical replication software in use. You can't just restore and resume normal operations. So I don't think it's worth making it a priority.

It's possible to make PITR safe with slots by blocking further advance of catalog_xmin on the running master for the life of the PITR base backup using a slot for retention. There's plenty of room for operator error until/unless we add something like catalog_xmin advance xlog'ing, but it can be done now with external tools if you're careful. Details in the prior mail.

I don't think PITR for logical slots is important given there's a workaround and it's not simple to actually do anything with it if you have it.
 
- How much divergence are we going to accept between logical decoding on
  standbys, and failover slots. I'm probably a lot closer to closer than
  than Craig is.

They're different things to me, but I think you're asking "to what extent should failover slots functionality be implemented strictly on top of decoding on standby?"

"Failover slots" provides a mechanism by which a logical decoding client can expect a slot it creates on a master (or physical streaming replica doing decoding on standby) to continue to exist. The client can ignore physical HA and promotions of the master, which can continue to be managed using normal postgres tools. It's the same as, say, an XA transaction manager expecting that if your master dies and you fail over to a standby, the TM should't have to have been doing special housekeeping on the promotion candidate before promotion in order for 2PC to continue to work. It Just Works.

Logical decoding on standby is useful with, or without, failover slots, as you can use it to extract data from a replica, and now decoding timeline following is in a decoding connection on a replica will survive promotion to master.

But in addition to its main purpose of allowing logical decoding from a standby server to offload work, it can be used to implement client-managed support for failover to physical replicas. For this, the client must have an inventory of promotion-candidates of the master and their connstrings so it can maintain slots on them too. The client must be able to connect to all promotion-candidates and advance their slots via decoding along with the master slots it's actually replaying from. If a client isn't "told" about a promotion candidate, decoding will break when we fail over. If a client cannot connect to a promotion candidate, catalog_xmin will fall behind on master until the replica is discarded (and its physical slot dropped) or the client regains access. Every different logical decoding client application must implement all this logic and management separately.

It may be possible to implement failover-slots like functionality based on decoding on standby in an app transparent way, by having the replica monitor slot states on the master and self-advance its own slots by loopback decoding connection. Or the master could maintain an inventory of replicas and make decoding connections to them where it advances their slots after the masters' slots are advanced by an app. But either way, why would we want to do this? Why actually decode WAL and use the logical decoding machinery when we *know* the state of the system because only the master is writeable?

The way I see it, to provide failover slots functionality we'd land up with something quite similar to what Robert and I just discussed, but the slot advance would be implemented using decoding (on standby) instead of directly setting slot state. What benefit does that offer?

I don't want to block failover slots on decoding on standby just because decoding on standby would be nice to have.
 
- How much divergence are we going to accept between infrastructure for
  logical failover, and logical failover via failover slots (or however
  we're naming this)? Again, I'm probably a lot closer to zero than
  craig is.


We don't have logical failover, let alone mature, tested logical failover that covers most of Pg's available functionality. Nor much of a design for it AFAIK. There is no logical failover to diverge from, and I don't want to block physical failover support on that.

But, putting that aside to look at the details of how logical failover might work, what sort of commonality do you expect to see? Physical failover is by WAL replication using archive recovery/streaming, managed via recovery.conf, with unilateral promotion by trigger file/command. The admin is expected to ensure that any clients and cascading replicas get redirected to the promoted node and the old one is fenced - and we don't care if that's done by IP redirection or connstring updates or what. Per the proposal Robert and I discussed, logical slots will be managed by having the walsender/walreceiver exchange slot state information that cascades up/down the replication tree via mirror slot creations.

How's logical replica promotion going to work? Here's one possible way, of many: the promotion-candidate logical replica consumes an unfiltered xact stream that contains changes from all nodes, not just its immediate upstream. Downstreams of the master can maintain direct connections to the promotion candidate and manage their own slots directly, sending flush confirmations for slots on the promotion candidate as they see their decoding sessions on the replica decode commits for LSNs the clients sent flush confirmations to the master for. On promotion, the master's downstreams would be reconfigured to connect to the node-id of the newly promoted master and would begin decoding from it in catchup mode, where they receive the commits from the old master via the new master, until they reach the new master's end-of-wal at time of promotion. With some tweaks like a logical WAL message recording the moment of promotion, it's not that different to the client-managed physical failover model.

It can also be converted to a more transparent failover-slots like model by having the promotion candidate physical replica clone slots from its upstream, but advance them by loopback decoding - not necessarily actual network loopback. It'd use a filter that discards data and only sees the commit XIDs + LSNs. It'd send confirmations on the slots when the local slot processed a commit for which the upstream's copy of the slot had a confirmation for that lsn. On promotion, replicas would connect with new replorigins (0) and let decoding start at the slot positions on the replica. The master->replica slot state reporting can be done via the walsender too, just as proposed for the physical case, though no replica->master reporting would be needed for logical failover.

So despite my initial expectations they can be moderately similar in broad structure. But I don't think there's going to be much actual code overlap beyond minor things like both wanting a way to query slot state on the upstream. Both *could* use decoding on standby to advance slot positions, but for the physical case that's just a slower (and unfinished) way to do what we already have, wheras it's necessary for logical failover.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] WIP: Failover Slots

От
Craig Ringer
Дата:


On 14 August 2017 at 11:56, Craig Ringer <craig@2ndquadrant.com> wrote:

I don't want to block failover slots on decoding on standby just because decoding on standby would be nice to have.

However, during discussion with Tomas Munro a point has come up that does block failover slots as currently envisioned - silent timeline divergence. It's a solid reason why the current design and implementation is insufficient to solve the problem. This issue exists both with the original failover slots and with the model Robert and I were discussing.

Say a decoding client has replayed from master up to commit of xid 42 at 1/1000 and confirmed flush, then a failover slots standby of the master is promoted. The standby has only received WAL from the failed master up to 1/500 with most recent xid 20. Now the standby does some other new xacts, pushing xid up to 30 at 1/1000 then continuing to insert until xid 50 at lsn 1/2000.

Then the logical client reconnects. The logical client will connect to the failover slot fine, and start replay. But it'll ask for replay to start at 1/1000. The standby will happily fast-forward the slot (as it should), and start replay after 1/1000.

But now we have silent divergence in timelines. The logical replica has received and committed xacts 20...42 at lsn 1/500 through 1/1000, but these are not present on the promoted master. And the replica has skipped over the new-master's xids 20...30 with lsns 1/500 through 1/1000, so they're present on the new master but not the replica.

IMO, this shows that not including the timeline in replication origins was a bit of a mistake, since we'd trivially detect this if they were included - but it's a bit late now.  And anyway, detection would just mean logical rep would break, which doesn't help much.

The simplest fix, but rather limited, is to require that failover candidates be in synchronous_standby_names, and delay ReorderBufferCommit sending the actual commit message until all peers in s_s_n confirm flush of the commit lsn. But that's not much good if you want sync rep for your logical connections too, and is generally a hack.

A more general solution requires that masters be told which peers are failover candidates, so they can ensure ordering between logical decoding and physical failover candidates. Which effectively adds another kind of sync rep, where we do "wait for physical failover candidates to flush, and only then allow logical decoding". This actually seems pretty practical with the design Robert and I discussed, but it's definitely an expansion in scope.

Alternately, we could require the decoding clients to keep an eye on the flush/replay positions of all failover candidates and delay commit+confirm of decoded xacts until the upstream's failover candidates have received and flushed up to that lsn. Theat starts to look at lot like a decoding on standby based model for logical failover, where the downstream maintains slots on each failover candidate upstream.

So yeah. More work needed here. Even if we suddenly decided the original failover slots model was OK, it's not sufficient to fully solve the problem.

(It's something I'd thought for BDR failover, but never applied to falover slots: the problem of detecting or preventing divergence when the logical client is ahead of physical receive at the time the physical standby is promoted.)

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services