Обсуждение: Physical replication slot advance is not persistent

Поиск
Список
Период
Сортировка

Physical replication slot advance is not persistent

От
Alexey Kondratov
Дата:
Hi Hackers,

I have accidentally noticed that pg_replication_slot_advance only 
changes in-memory state of the slot when its type is physical. Its new 
value does not survive restart.

Reproduction steps:

1) Create new slot and remember its restart_lsn

SELECT pg_create_physical_replication_slot('slot1', true);
SELECT * from pg_replication_slots;

2) Generate some dummy WAL

CHECKPOINT;
SELECT pg_switch_wal();
CHECKPOINT;
SELECT pg_switch_wal();

3) Advance slot to the value of pg_current_wal_insert_lsn()

SELECT pg_replication_slot_advance('slot1', '0/160001A0');

4) Check that restart_lsn has been updated

SELECT * from pg_replication_slots;

5) Restart server and check restart_lsn again. It should be the same as 
in the step 1.


I dig into the code and it happens because of this if statement:

     /* Update the on disk state when lsn was updated. */
     if (XLogRecPtrIsInvalid(endlsn))
     {
         ReplicationSlotMarkDirty();
         ReplicationSlotsComputeRequiredXmin(false);
         ReplicationSlotsComputeRequiredLSN();
         ReplicationSlotSave();
     }

Actually, endlsn is always a valid LSN after the execution of 
replication slot advance guts. It works for logical slots only by 
chance, since there is an implicit ReplicationSlotMarkDirty() call 
inside LogicalConfirmReceivedLocation.

Attached is a small patch, which fixes this bug. I have tried to
stick to the same logic in this 'if (XLogRecPtrIsInvalid(endlsn))'
and now pg_logical_replication_slot_advance and
pg_physical_replication_slot_advance return InvalidXLogRecPtr if
no-op.

What do you think?


Regards
-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

P.S. CCed Simon and Michael as they are the last who seriously touched 
pg_replication_slot_advance code.


Вложения

Re: Physical replication slot advance is not persistent

От
Kyotaro Horiguchi
Дата:
At Tue, 24 Dec 2019 20:12:32 +0300, Alexey Kondratov <a.kondratov@postgrespro.ru> wrote in
> I dig into the code and it happens because of this if statement:
>
>     /* Update the on disk state when lsn was updated. */
>     if (XLogRecPtrIsInvalid(endlsn))
>     {
>         ReplicationSlotMarkDirty();
>         ReplicationSlotsComputeRequiredXmin(false);
>         ReplicationSlotsComputeRequiredLSN();
>         ReplicationSlotSave();
>     }

Yes, it seems just broken.

> Attached is a small patch, which fixes this bug. I have tried to
> stick to the same logic in this 'if (XLogRecPtrIsInvalid(endlsn))'
> and now pg_logical_replication_slot_advance and
> pg_physical_replication_slot_advance return InvalidXLogRecPtr if
> no-op.
>
> What do you think?

I think we shoudn't change the definition of
pg_*_replication_slot_advance since the result is user-facing.

The functions return a invalid value only when the slot had the
invalid value and failed to move the position. I think that happens
only for uninitalized slots.

Anyway what we should do there is dirtying the slot when the operation
can be assumed to have been succeeded.

As the result I think what is needed there is just checking if the
returned lsn is equal or larger than moveto. Doen't the following
change work?

-    if (XLogRecPtrIsInvalid(endlsn))
+    if (moveto <= endlsn)

reagrds.

--
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Physical replication slot advance is not persistent

От
Alexey Kondratov
Дата:
On 25.12.2019 07:03, Kyotaro Horiguchi wrote:
> At Tue, 24 Dec 2019 20:12:32 +0300, Alexey Kondratov <a.kondratov@postgrespro.ru> wrote in
>> I dig into the code and it happens because of this if statement:
>>
>>      /* Update the on disk state when lsn was updated. */
>>      if (XLogRecPtrIsInvalid(endlsn))
>>      {
>>          ReplicationSlotMarkDirty();
>>          ReplicationSlotsComputeRequiredXmin(false);
>>          ReplicationSlotsComputeRequiredLSN();
>>          ReplicationSlotSave();
>>      }
> Yes, it seems just broken.
>
>> Attached is a small patch, which fixes this bug. I have tried to
>> stick to the same logic in this 'if (XLogRecPtrIsInvalid(endlsn))'
>> and now pg_logical_replication_slot_advance and
>> pg_physical_replication_slot_advance return InvalidXLogRecPtr if
>> no-op.
>>
>> What do you think?
> I think we shoudn't change the definition of
> pg_*_replication_slot_advance since the result is user-facing.

Yes, that was my main concern too. OK.

> The functions return a invalid value only when the slot had the
> invalid value and failed to move the position. I think that happens
> only for uninitalized slots.
>
> Anyway what we should do there is dirtying the slot when the operation
> can be assumed to have been succeeded.
>
> As the result I think what is needed there is just checking if the
> returned lsn is equal or larger than moveto. Doen't the following
> change work?
>
> -    if (XLogRecPtrIsInvalid(endlsn))
> +    if (moveto <= endlsn)

Yep, it helps with physical replication slot persistence after advance, 
but the whole validation (moveto <= endlsn) does not make sense for me. 
The value of moveto should be >= than minlsn == confirmed_flush / 
restart_lsn, while endlsn == retlsn is also always initialized with 
confirmed_flush / restart_lsn. Thus, your condition seems to be true in 
any case, even if it was no-op one, which we were intended to catch.

Actually, if we do not want to change pg_*_replication_slot_advance, we 
can just add straightforward validation that either confirmed_flush, or 
restart_lsn changed after slot advance guts execution. It will be a 
little bit bulky, but much more clear and will never be affected by 
pg_*_replication_slot_advance logic change.


Another weird part I have found is this assignment inside 
pg_logical_replication_slot_advance:

/* Initialize our return value in case we don't do anything */
retlsn = MyReplicationSlot->data.confirmed_flush;

It looks redundant, since later we do the same assignment, which should 
be reachable in any case.

I will recheck everything again and try to come up with something during 
this week.


Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Physical replication slot advance is not persistent

От
Alexey Kondratov
Дата:
On 25.12.2019 16:51, Alexey Kondratov wrote:
> On 25.12.2019 07:03, Kyotaro Horiguchi wrote:
>> As the result I think what is needed there is just checking if the
>> returned lsn is equal or larger than moveto. Doen't the following
>> change work?
>>
>> -    if (XLogRecPtrIsInvalid(endlsn))
>> +    if (moveto <= endlsn)
>
> Yep, it helps with physical replication slot persistence after 
> advance, but the whole validation (moveto <= endlsn) does not make 
> sense for me. The value of moveto should be >= than minlsn == 
> confirmed_flush / restart_lsn, while endlsn == retlsn is also always 
> initialized with confirmed_flush / restart_lsn. Thus, your condition 
> seems to be true in any case, even if it was no-op one, which we were 
> intended to catch.
>
> I will recheck everything again and try to come up with something 
> during this week.

If I get it correctly, then we already keep previous slot position in 
the minlsn, so we just have to compare endlsn with minlsn and treat 
endlsn <= minlsn as a no-op without slot state flushing.

Attached is a patch that does this, so it fixes the bug without 
affecting any user-facing behavior. Detailed comment section and DEBUG 
output are also added. What do you think now?

I have also forgotten to mention that all versions down to 11.0 should 
be affected with this bug.


Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company


Вложения

Re: Physical replication slot advance is not persistent

От
Kyotaro Horiguchi
Дата:
At Wed, 25 Dec 2019 20:28:04 +0300, Alexey Kondratov <a.kondratov@postgrespro.ru> wrote in 
> > Yep, it helps with physical replication slot persistence after
> > advance, but the whole validation (moveto <= endlsn) does not make
> > sense for me. The value of moveto should be >= than minlsn ==
> > confirmed_flush / restart_lsn, while endlsn == retlsn is also always
> > initialized with confirmed_flush / restart_lsn. Thus, your condition
> > seems to be true in any case, even if it was no-op one, which we were
> > intended to catch.
...
> If I get it correctly, then we already keep previous slot position in
> the minlsn, so we just have to compare endlsn with minlsn and treat
> endlsn <= minlsn as a no-op without slot state flushing.

I think you're right about the condition. (endlsn cannot be less than
minlsn, though) But I came to think that we shouldn't use locations in
that decision.

> Attached is a patch that does this, so it fixes the bug without
> affecting any user-facing behavior. Detailed comment section and DEBUG
> output are also added. What do you think now?
> 
> I have also forgotten to mention that all versions down to 11.0 should
> be affected with this bug.

pg_replication_slot_advance is the only caller of
pg_logical/physical_replication_slot_advacne so there's no apparent
determinant on who-does-what about dirtying and other housekeeping
calculation like *ComputeRequired*() functions, but the current shape
seems a kind of inconsistent between logical and physical.

I think pg_logaical/physical_replication_slot_advance should dirty the
slot if they actually changed anything. And
pg_replication_slot_advance should do the housekeeping if the slots
are dirtied.  (Otherwise both the caller function should dirty the
slot in lieu of the two.)

The attached does that.

regards.
-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 264b187aac17f0ce10e8748ea86fc967d08a0eb0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 26 Dec 2019 17:17:25 +0900
Subject: [PATCH] Make sure to save updated physical slot.

Back to 11, changes of a physical repliation slot made by
pg_replication_slot_advance doesn't survive restart. The cause was
pg_physical_replication_slot_advance forgot to dirty the slot when
updated. Fix it.
---
 src/backend/replication/slot.c      | 19 +++++++++++++++++++
 src/backend/replication/slotfuncs.c | 10 ++++++++--
 src/include/replication/slot.h      |  1 +
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 21ae8531b3..40149d0d99 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -672,6 +672,25 @@ ReplicationSlotMarkDirty(void)
     SpinLockRelease(&slot->mutex);
 }
 
+/*
+ * Return if the currently acquired slot is dirty.
+ */
+bool
+ReplicationSlotIsDirty(void)
+{
+    bool dirty;
+
+    ReplicationSlot *slot = MyReplicationSlot;
+
+    Assert(MyReplicationSlot != NULL);
+
+    SpinLockAcquire(&slot->mutex);
+    dirty = MyReplicationSlot->dirty;
+    SpinLockRelease(&slot->mutex);
+
+    return dirty;
+}
+
 /*
  * Convert a slot that's marked as RS_EPHEMERAL to a RS_PERSISTENT slot,
  * guaranteeing it will be there after an eventual crash.
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 6683fc3f9b..7b08ed691b 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -370,6 +370,13 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
         MyReplicationSlot->data.restart_lsn = moveto;
         SpinLockRelease(&MyReplicationSlot->mutex);
         retlsn = moveto;
+
+        /*
+         * We don't need to dirty the slot only for the above change, but dirty
+         * this slot for the same reason with
+         * pg_logical_replication_slot_advance.
+         */
+        ReplicationSlotMarkDirty();
     }
 
     return retlsn;
@@ -574,9 +581,8 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
     nulls[0] = false;
 
     /* Update the on disk state when lsn was updated. */
-    if (XLogRecPtrIsInvalid(endlsn))
+    if (ReplicationSlotIsDirty())
     {
-        ReplicationSlotMarkDirty();
         ReplicationSlotsComputeRequiredXmin(false);
         ReplicationSlotsComputeRequiredLSN();
         ReplicationSlotSave();
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3a5763fb07..f76b84571f 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -189,6 +189,7 @@ extern void ReplicationSlotRelease(void);
 extern void ReplicationSlotCleanup(void);
 extern void ReplicationSlotSave(void);
 extern void ReplicationSlotMarkDirty(void);
+extern bool ReplicationSlotIsDirty(void);
 
 /* misc stuff */
 extern bool ReplicationSlotValidateName(const char *name, int elevel);
-- 
2.23.0


Re: Physical replication slot advance is not persistent

От
Alexey Kondratov
Дата:
On 26.12.2019 11:33, Kyotaro Horiguchi wrote:
> At Wed, 25 Dec 2019 20:28:04 +0300, Alexey Kondratov <a.kondratov@postgrespro.ru> wrote in
>>> Yep, it helps with physical replication slot persistence after
>>> advance, but the whole validation (moveto <= endlsn) does not make
>>> sense for me. The value of moveto should be >= than minlsn ==
>>> confirmed_flush / restart_lsn, while endlsn == retlsn is also always
>>> initialized with confirmed_flush / restart_lsn. Thus, your condition
>>> seems to be true in any case, even if it was no-op one, which we were
>>> intended to catch.
> ...
>> If I get it correctly, then we already keep previous slot position in
>> the minlsn, so we just have to compare endlsn with minlsn and treat
>> endlsn <= minlsn as a no-op without slot state flushing.
> I think you're right about the condition. (endlsn cannot be less than
> minlsn, though) But I came to think that we shouldn't use locations in
> that decision.
>
>> Attached is a patch that does this, so it fixes the bug without
>> affecting any user-facing behavior. Detailed comment section and DEBUG
>> output are also added. What do you think now?
>>
>> I have also forgotten to mention that all versions down to 11.0 should
>> be affected with this bug.
> pg_replication_slot_advance is the only caller of
> pg_logical/physical_replication_slot_advacne so there's no apparent
> determinant on who-does-what about dirtying and other housekeeping
> calculation like *ComputeRequired*() functions, but the current shape
> seems a kind of inconsistent between logical and physical.
>
> I think pg_logaical/physical_replication_slot_advance should dirty the
> slot if they actually changed anything. And
> pg_replication_slot_advance should do the housekeeping if the slots
> are dirtied.  (Otherwise both the caller function should dirty the
> slot in lieu of the two.)
>
> The attached does that.

Both approaches looks fine for me: my last patch with as minimal 
intervention as possible and yours refactoring. I think that it is a 
right direction to let everyone who modifies slot->data also mark slot 
as dirty.

I found some comment section in your code as rather misleading:

+        /*
+         * We don't need to dirty the slot only for the above change, 
but dirty
+         * this slot for the same reason with
+         * pg_logical_replication_slot_advance.
+         */

We just modified MyReplicationSlot->data, which is "On-Disk data of a 
replication slot, preserved across restarts.", so it definitely should 
be marked as dirty, not because pg_logical_replication_slot_advance does 
the same.

Also I think that using this transient variable in 
ReplicationSlotIsDirty is not necessary. MyReplicationSlot is already a 
pointer to the slot in shared memory.

+    ReplicationSlot *slot = MyReplicationSlot;
+
+    Assert(MyReplicationSlot != NULL);
+
+    SpinLockAcquire(&slot->mutex);

Otherwise it looks fine for me, so attached is the same diff, but with 
these proposed corrections.

Another concern is that ReplicationSlotIsDirty is added with the only 
one user. It also cannot be used by SaveSlotToPath due to the 
simultaneous usage of both flags dirty and just_dirtied there.

In that way, I hope that we should call ReplicationSlotSave 
unconditionally in the pg_replication_slot_advance, so slot will be 
saved or not automatically based on the slot->dirty flag. In the same 
time, ReplicationSlotsComputeRequiredXmin and 
ReplicationSlotsComputeRequiredLSN should be called by anyone, who 
modifies xmin and LSN fields in the slot. Otherwise, currently we are 
getting some leaky abstractions.


Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company


Вложения

Re: Physical replication slot advance is not persistent

От
Alexey Kondratov
Дата:
On 2019-12-26 16:35, Alexey Kondratov wrote:
> 
> Another concern is that ReplicationSlotIsDirty is added with the only
> one user. It also cannot be used by SaveSlotToPath due to the
> simultaneous usage of both flags dirty and just_dirtied there.
> 
> In that way, I hope that we should call ReplicationSlotSave
> unconditionally in the pg_replication_slot_advance, so slot will be
> saved or not automatically based on the slot->dirty flag. In the same
> time, ReplicationSlotsComputeRequiredXmin and
> ReplicationSlotsComputeRequiredLSN should be called by anyone, who
> modifies xmin and LSN fields in the slot. Otherwise, currently we are
> getting some leaky abstractions.
> 

It seems that there was even a race in the order of actions inside 
pg_replication_slot_advance, it did following:

- ReplicationSlotMarkDirty();
- ReplicationSlotsComputeRequiredXmin(false);
- ReplicationSlotsComputeRequiredLSN();
- ReplicationSlotSave();

1) Mark slot as dirty, which actually does nothing immediately, but 
setting dirty flag;
2) Do compute new global required LSN;
3) Flush slot state to disk.

If someone will utilise old WAL and after that crash will happen between 
steps 2) and 3), then we start with old value of restart_lsn, but 
without required WAL. I do not know how to properly reproduce it without 
gdb and power off, so the chance is pretty low, but still it could be a 
case.

Logical slots were not affected again, since there was a proper 
operations order (with comments) and slot flushing routines inside 
LogicalConfirmReceivedLocation.

Thus, in the attached patch I have decided to do not perform slot 
flushing in the pg_replication_slot_advance at all and do it in the 
pg_physical_replication_slot_advance instead, as it is done in the 
LogicalConfirmReceivedLocation.

Since this bugfix have not moved forward during the week, I will put it 
on the 01.2020 commitfest. Kyotaro, if you do not object I will add you 
as a reviewer, as you have already gave a lot of feedback, thank you for 
that!


Regards
-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Вложения

Re: Physical replication slot advance is not persistent

От
Kyotaro Horiguchi
Дата:
Hello.

At Sun, 29 Dec 2019 15:12:16 +0300, Alexey Kondratov <a.kondratov@postgrespro.ru> wrote in 
> On 2019-12-26 16:35, Alexey Kondratov wrote:
> > Another concern is that ReplicationSlotIsDirty is added with the only
> > one user. It also cannot be used by SaveSlotToPath due to the
> > simultaneous usage of both flags dirty and just_dirtied there.
> > In that way, I hope that we should call ReplicationSlotSave
> > unconditionally in the pg_replication_slot_advance, so slot will be
> > saved or not automatically based on the slot->dirty flag. In the same
> > time, ReplicationSlotsComputeRequiredXmin and
> > ReplicationSlotsComputeRequiredLSN should be called by anyone, who
> > modifies xmin and LSN fields in the slot. Otherwise, currently we are
> > getting some leaky abstractions.

Sounds reasonable.

> It seems that there was even a race in the order of actions inside
> pg_replication_slot_advance, it did following:
> 
> - ReplicationSlotMarkDirty();
> - ReplicationSlotsComputeRequiredXmin(false);
> - ReplicationSlotsComputeRequiredLSN();
> - ReplicationSlotSave();
> 
> 1) Mark slot as dirty, which actually does nothing immediately, but
> setting dirty flag;
> 2) Do compute new global required LSN;
> 3) Flush slot state to disk.
> 
> If someone will utilise old WAL and after that crash will happen
> between steps 2) and 3), then we start with old value of restart_lsn,
> but without required WAL. I do not know how to properly reproduce it
> without gdb and power off, so the chance is pretty low, but still it
> could be a case.

In the first place we advance required LSN for every reply message but
save slot data only at checkpoint on physical repliation.  Such a
strict guarantee seems too much.

Or we might need to save dirty slots just before the required LSN goes
into the next segment, but it would be a separate issue.

> Logical slots were not affected again, since there was a proper
> operations order (with comments) and slot flushing routines inside
> LogicalConfirmReceivedLocation.

copy_replication_slot doen't follow that, but the function can go into
the similar situation from a bit different cause. If the required LSN
had been advanced by a move of the original slot before the function
recomputes the required LSN, there could be a case where the new slot
is missing required WAL segment. But that is a defferent issue, too.

> Thus, in the attached patch I have decided to do not perform slot
> flushing in the pg_replication_slot_advance at all and do it in the
> pg_physical_replication_slot_advance instead, as it is done in the
> LogicalConfirmReceivedLocation.

That causes a logical slot not being saved when only confirmed_flush
was changed. (I'm not sure about that the slot would be saved twice if
other than confirmed_flush had been changed..)

> Since this bugfix have not moved forward during the week, I will put
> it on the 01.2020 commitfest. Kyotaro, if you do not object I will add
> you as a reviewer, as you have already gave a lot of feedback, thank
> you for that!

I'm fine with that.


+        /* Compute global required LSN if restart_lsn was changed */
+        if (updated_restart)
+            ReplicationSlotsComputeRequiredLSN();
..
-            ReplicationSlotsComputeRequiredLSN();


I seems intentional, considering performance, based on the same
thought as the comment of PhysicalConfirmReceivedLocation.

I think we shouldn't touch the paths used by replication protocol. And
don't we focus on how we make a change of a replication slot from SQL
interface persistent?  It seems to me that generaly we don't need to
save dirty slots other than checkpoint, but the SQL function seems
wanting the change to be saved immediately.

As the result, please find the attached, which is following only the
first paragraph cited above.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index bb69683e2a..1aacb937c7 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -370,6 +370,11 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
         MyReplicationSlot->data.restart_lsn = moveto;
         SpinLockRelease(&MyReplicationSlot->mutex);
         retlsn = moveto;
+
+        ReplicationSlotMarkDirty();
+
+        /* We moved retart_lsn, update the global value. */
+        ReplicationSlotsComputeRequiredLSN();
     }
 
     return retlsn;
@@ -573,14 +578,8 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
     values[0] = NameGetDatum(&MyReplicationSlot->data.name);
     nulls[0] = false;
 
-    /* Update the on disk state when lsn was updated. */
-    if (XLogRecPtrIsInvalid(endlsn))
-    {
-        ReplicationSlotMarkDirty();
-        ReplicationSlotsComputeRequiredXmin(false);
-        ReplicationSlotsComputeRequiredLSN();
-        ReplicationSlotSave();
-    }
+    /* update the on disk state if needed. */
+    ReplicationSlotSave();
 
     ReplicationSlotRelease();


Re: Physical replication slot advance is not persistent

От
Alexey Kondratov
Дата:
On 09.01.2020 09:36, Kyotaro Horiguchi wrote:
> Hello.
>
> At Sun, 29 Dec 2019 15:12:16 +0300, Alexey Kondratov <a.kondratov@postgrespro.ru> wrote in
>> On 2019-12-26 16:35, Alexey Kondratov wrote:
>>> Another concern is that ReplicationSlotIsDirty is added with the only
>>> one user. It also cannot be used by SaveSlotToPath due to the
>>> simultaneous usage of both flags dirty and just_dirtied there.
>>> In that way, I hope that we should call ReplicationSlotSave
>>> unconditionally in the pg_replication_slot_advance, so slot will be
>>> saved or not automatically based on the slot->dirty flag. In the same
>>> time, ReplicationSlotsComputeRequiredXmin and
>>> ReplicationSlotsComputeRequiredLSN should be called by anyone, who
>>> modifies xmin and LSN fields in the slot. Otherwise, currently we are
>>> getting some leaky abstractions.
> Sounds reasonable.

Great, so it seems that we have reached some agreement about who should 
mark slot as dirty, at least for now.

>
>> If someone will utilise old WAL and after that crash will happen
>> between steps 2) and 3), then we start with old value of restart_lsn,
>> but without required WAL. I do not know how to properly reproduce it
>> without gdb and power off, so the chance is pretty low, but still it
>> could be a case.
> In the first place we advance required LSN for every reply message but
> save slot data only at checkpoint on physical repliation.  Such a
> strict guarantee seems too much.
>
> ...
>
> I think we shouldn't touch the paths used by replication protocol. And
> don't we focus on how we make a change of a replication slot from SQL
> interface persistent?  It seems to me that generaly we don't need to
> save dirty slots other than checkpoint, but the SQL function seems
> wanting the change to be saved immediately.
>
> As the result, please find the attached, which is following only the
> first paragraph cited above.

OK, I have definitely overthought that, thanks. This looks like a 
minimal subset of changes that actually solves the bug. I would only 
prefer to keep some additional comments (something like the attached), 
otherwise after half a year it will be unclear again, why we save slot 
unconditionally here.


Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company


Вложения

Re: Physical replication slot advance is not persistent

От
Michael Paquier
Дата:
On Thu, Jan 16, 2020 at 08:09:09PM +0300, Alexey Kondratov wrote:
> OK, I have definitely overthought that, thanks. This looks like a minimal
> subset of changes that actually solves the bug. I would only prefer to keep
> some additional comments (something like the attached), otherwise after half
> a year it will be unclear again, why we save slot unconditionally here.

Since this email, Andres has sent an email that did not reach the
community lists, but where all the participants of this thread were in
CC.  Here is a summary of the points raised (please correct me if that
does not sound right to you, Andres):
1) The slot advancing has to mark the slot as dirty, but should we
make the change persistent at the end of the function or should we
wait for a checkpoint to do the work, meaning that any update done to
the slot would be lost if a crash occurs in-between?  Note that we
have this commit in slotfuncs.c for
pg_logical_replication_slot_advance():
 * Dirty the slot so it's written out at the next checkpoint.
 * We'll still lose its position on crash, as documented, but it's
 * better than always losing the position even on clean restart.

This comment refers to the documentation for the logical decoding
section (see logicaldecoding-replication-slots in
logicaldecoding.sgml), and even if nothing can be done until the slot
advance function reaches its hand, we ought to make the data
persistent if we can.

The original commit that introduced slot advancing is 9c7d06d.  Here
is the thread, where this point was not really mentioned by the way:
https://www.postgresql.org/message-id/5c26ff40-8452-fb13-1bea-56a0338a809a@2ndquadrant.com

2) pg_replication_slot_advance() includes this code, which is broken:
    /* Update the on disk state when lsn was updated. */
    if (XLogRecPtrIsInvalid(endlsn))
    {
        ReplicationSlotMarkDirty();
        ReplicationSlotsComputeRequiredXmin(false);
        ReplicationSlotsComputeRequiredLSN();
        ReplicationSlotSave();
    }
Here the deal is that endlsn, aka the LSN where the slot has been
advanced (or its current position if no progress has been done) never
gets to be set to InvalidXLogRecPtr as of f731cfa, and that this work
should be done only when endlsn has done some progress.  It seems to
me that this should have been the opposite to begin with in 9c7d06d,
aka do the save if endlsn is valid.

3) The amount of testing related to slot advancing could be better
with cluster-wide operations.

@@ -370,6 +370,11 @@ pg_physical_replication_slot_advance(XLogRecPtr
moveto)
    MyReplicationSlot->data.restart_lsn = moveto;

    SpinLockRelease(&MyReplicationSlot->mutex);
    retlsn = moveto;
+
+   ReplicationSlotMarkDirty();
+
+   /* We moved retart_lsn, update the global value. */
+   ReplicationSlotsComputeRequiredLSN();
I think that the proposed patch is missing a call to
ReplicationSlotsComputeRequiredXmin() here for physical slots.

So, I have been looking at this patch by myself, and updated it so as
the extra slot save is done only if any advancing has been done, on
top of the other computations that had better be around for
consistency.  The patch includes TAP tests for physical and logical
slots' durability across restarts.

Thoughts?
--
Michael

Вложения

Re: Physical replication slot advance is not persistent

От
a.kondratov@postgrespro.ru
Дата:
On 20 Jan 2020, 09:45 +0300, Michael Paquier <michael@paquier.xyz>, wrote:

So, I have been looking at this patch by myself, and updated it so as
the extra slot save is done only if any advancing has been done, on
top of the other computations that had better be around for
consistency. The patch includes TAP tests for physical and logical
slots' durability across restarts.  

Thoughts? 


I still think that this extra check of whether any advance just happened or not just adds extra complexity. We could use slot dirtiness for the same purpose and slot saving routines check it automatically. Anyway, approach with adding a new flag should resolve this bug as well, of course, and maybe it will be a bit more transparent and explicit.

Just eyeballed your patch and it looks fine at a first glance, excepting the logical slot advance part:

            retlsn = MyReplicationSlot->data.confirmed_flush;
+               *advance_done = true;                /* free context, call shutdown callback */               FreeDecodingContext(ctx);

I am not sure that this is a right place to set advance_done flag to true. Wouldn’t it be set to true even in the case of no-op, when LogicalConfirmReceivedLocation was never executed? Probably we should set the flag near the LogicalConfirmReceivedLocation call?


--
Alexey Kondratov

Re: Physical replication slot advance is not persistent

От
Andres Freund
Дата:
Hi,

On 2020-01-20 15:45:40 +0900, Michael Paquier wrote:
> On Thu, Jan 16, 2020 at 08:09:09PM +0300, Alexey Kondratov wrote:
> > OK, I have definitely overthought that, thanks. This looks like a minimal
> > subset of changes that actually solves the bug. I would only prefer to keep
> > some additional comments (something like the attached), otherwise after half
> > a year it will be unclear again, why we save slot unconditionally here.
> 
> Since this email, Andres has sent an email that did not reach the
> community lists, but where all the participants of this thread were in
> CC.

Ugh, that was an accident.


> Here is a summary of the points raised (please correct me if that
> does not sound right to you, Andres):

> 1) The slot advancing has to mark the slot as dirty, but should we
> make the change persistent at the end of the function or should we
> wait for a checkpoint to do the work, meaning that any update done to
> the slot would be lost if a crash occurs in-between?  Note that we
> have this commit in slotfuncs.c for
> pg_logical_replication_slot_advance():
>  * Dirty the slot so it's written out at the next checkpoint.
>  * We'll still lose its position on crash, as documented, but it's
>  * better than always losing the position even on clean restart.
> 
> This comment refers to the documentation for the logical decoding
> section (see logicaldecoding-replication-slots in
> logicaldecoding.sgml), and even if nothing can be done until the slot
> advance function reaches its hand, we ought to make the data
> persistent if we can.

That doesn't really seem like a meaningful reference, because the
concerns between constantly streaming out changes (where we don't want
to fsync every single transaction), and doing so in a manual advance
through an sql function, seem different.


> 3) The amount of testing related to slot advancing could be better
> with cluster-wide operations.
> 
> @@ -370,6 +370,11 @@ pg_physical_replication_slot_advance(XLogRecPtr
> moveto)
>     MyReplicationSlot->data.restart_lsn = moveto;
> 
>     SpinLockRelease(&MyReplicationSlot->mutex);
>     retlsn = moveto;
> +
> +   ReplicationSlotMarkDirty();
> +
> +   /* We moved retart_lsn, update the global value. */
> +   ReplicationSlotsComputeRequiredLSN();
> I think that the proposed patch is missing a call to
> ReplicationSlotsComputeRequiredXmin() here for physical slots.

Hm. It seems ok to include, but I don't think omitting it currently has
negative effects?


> So, I have been looking at this patch by myself, and updated it so as
> the extra slot save is done only if any advancing has been done, on
> top of the other computations that had better be around for
> consistency.

Hm, I don't necessarily what that's necessary.


> The patch includes TAP tests for physical and logical slots'
> durability across restarts.

Cool!


> diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
> index bb69683e2a..af3e114fc9 100644
> --- a/src/backend/replication/slotfuncs.c
> +++ b/src/backend/replication/slotfuncs.c
> @@ -359,17 +359,20 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
>   * checkpoints.
>   */
>  static XLogRecPtr
> -pg_physical_replication_slot_advance(XLogRecPtr moveto)
> +pg_physical_replication_slot_advance(XLogRecPtr moveto, bool *advance_done)
>  {
>      XLogRecPtr    startlsn = MyReplicationSlot->data.restart_lsn;
>      XLogRecPtr    retlsn = startlsn;
>  
> +    *advance_done = false;
> +
>      if (startlsn < moveto)
>      {
>          SpinLockAcquire(&MyReplicationSlot->mutex);
>          MyReplicationSlot->data.restart_lsn = moveto;
>          SpinLockRelease(&MyReplicationSlot->mutex);
>          retlsn = moveto;
> +        *advance_done = true;
>      }
>  
>      return retlsn;

Hm. Why did you choose not to use endlsn as before (except being
broken), or something? It seems quite conceivable somebody is using
these functions in an extension.




> +# Test physical slot advancing and its durability.  Create a new slot on
> +# the primary, not used by any of the standbys. This reserves WAL at creation.
> +my $phys_slot = 'phys_slot';
> +$node_master->safe_psql('postgres',
> +    "SELECT pg_create_physical_replication_slot('$phys_slot', true);");
> +$node_master->psql('postgres', "
> +    CREATE TABLE tab_phys_slot (a int);
> +    INSERT INTO tab_phys_slot VALUES (generate_series(1,10));");
> +my $psql_rc = $node_master->psql('postgres',
> +    "SELECT pg_replication_slot_advance('$phys_slot', 'FF/FFFFFFFF');");
> +is($psql_rc, '0', 'slot advancing works with physical slot');

Hm. I'm think testing this with real LSNs is a better idea. What if the
node actually already is past FF/FFFFFFFF at this point? Quite unlikely,
I know, but still. I.e. why not get the current LSN after the INSERT,
and assert that the slot, after the restart, is that?


Greetings,

Andres Freund



Re: Physical replication slot advance is not persistent

От
Kyotaro Horiguchi
Дата:
Thanks for looking this.

At Mon, 20 Jan 2020 11:00:14 -0800, Andres Freund <andres@anarazel.de> wrote in 
> > Here is a summary of the points raised (please correct me if that
> > does not sound right to you, Andres):
> 
> > 1) The slot advancing has to mark the slot as dirty, but should we
> > make the change persistent at the end of the function or should we
> > wait for a checkpoint to do the work, meaning that any update done to
> > the slot would be lost if a crash occurs in-between?  Note that we
> > have this commit in slotfuncs.c for
> > pg_logical_replication_slot_advance():
> >  * Dirty the slot so it's written out at the next checkpoint.
> >  * We'll still lose its position on crash, as documented, but it's
> >  * better than always losing the position even on clean restart.
> > 
> > This comment refers to the documentation for the logical decoding
> > section (see logicaldecoding-replication-slots in
> > logicaldecoding.sgml), and even if nothing can be done until the slot
> > advance function reaches its hand, we ought to make the data
> > persistent if we can.
> 
> That doesn't really seem like a meaningful reference, because the
> concerns between constantly streaming out changes (where we don't want
> to fsync every single transaction), and doing so in a manual advance
> through an sql function, seem different.

Yes, that is the reason I didn't suggest not to save the file there.
I don't have a clear opinion on it but I agree that users expect that
any changes they made from SQL interface should survive a
crash-recovery.

> > 3) The amount of testing related to slot advancing could be better
> > with cluster-wide operations.
> > 
> > @@ -370,6 +370,11 @@ pg_physical_replication_slot_advance(XLogRecPtr
> > moveto)
> >     MyReplicationSlot->data.restart_lsn = moveto;
> > 
> >     SpinLockRelease(&MyReplicationSlot->mutex);
> >     retlsn = moveto;
> > +
> > +   ReplicationSlotMarkDirty();
> > +
> > +   /* We moved retart_lsn, update the global value. */
> > +   ReplicationSlotsComputeRequiredLSN();
> > I think that the proposed patch is missing a call to
> > ReplicationSlotsComputeRequiredXmin() here for physical slots.
> 
> Hm. It seems ok to include, but I don't think omitting it currently has
> negative effects?

I think no. It is updated sooner or later when replication proceeds
and received a reply message.

> > So, I have been looking at this patch by myself, and updated it so as
> > the extra slot save is done only if any advancing has been done, on
> > top of the other computations that had better be around for
> > consistency.
> 
> Hm, I don't necessarily what that's necessary.

On the other hand, no negitve effect by the extra saving of the file
as far as the SQL function itself is not called extremely
frequently. If I read Andres's comment above correctly, I agree not to
add complexity to supress the "needless" saving of the file.

> > The patch includes TAP tests for physical and logical slots'
> > durability across restarts.
> 
> Cool!
> 
> 
> > diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
> > index bb69683e2a..af3e114fc9 100644
> > --- a/src/backend/replication/slotfuncs.c
> > +++ b/src/backend/replication/slotfuncs.c
> > @@ -359,17 +359,20 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
> >   * checkpoints.
> >   */
> >  static XLogRecPtr
> > -pg_physical_replication_slot_advance(XLogRecPtr moveto)
> > +pg_physical_replication_slot_advance(XLogRecPtr moveto, bool *advance_done)
> >  {
> >      XLogRecPtr    startlsn = MyReplicationSlot->data.restart_lsn;
> >      XLogRecPtr    retlsn = startlsn;
> >  
> > +    *advance_done = false;
> > +
> >      if (startlsn < moveto)
> >      {
> >          SpinLockAcquire(&MyReplicationSlot->mutex);
> >          MyReplicationSlot->data.restart_lsn = moveto;
> >          SpinLockRelease(&MyReplicationSlot->mutex);
> >          retlsn = moveto;
> > +        *advance_done = true;
> >      }
> >  
> >      return retlsn;
> 
> Hm. Why did you choose not to use endlsn as before (except being
> broken), or something? It seems quite conceivable somebody is using
> these functions in an extension.
> 
> 
> 
> 
> > +# Test physical slot advancing and its durability.  Create a new slot on
> > +# the primary, not used by any of the standbys. This reserves WAL at creation.
> > +my $phys_slot = 'phys_slot';
> > +$node_master->safe_psql('postgres',
> > +    "SELECT pg_create_physical_replication_slot('$phys_slot', true);");
> > +$node_master->psql('postgres', "
> > +    CREATE TABLE tab_phys_slot (a int);
> > +    INSERT INTO tab_phys_slot VALUES (generate_series(1,10));");
> > +my $psql_rc = $node_master->psql('postgres',
> > +    "SELECT pg_replication_slot_advance('$phys_slot', 'FF/FFFFFFFF');");
> > +is($psql_rc, '0', 'slot advancing works with physical slot');
> 
> Hm. I'm think testing this with real LSNs is a better idea. What if the
> node actually already is past FF/FFFFFFFF at this point? Quite unlikely,
> I know, but still. I.e. why not get the current LSN after the INSERT,
> and assert that the slot, after the restart, is that?

+1.


(continuation of (3))
At Mon, 20 Jan 2020 15:45:40 +0900, Michael Paquier <michael@paquier.xyz> wrote in
(@@ -370,6 +370,11 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto))
> +   /* We moved retart_lsn, update the global value. */
> +   ReplicationSlotsComputeRequiredLSN();
> I think that the proposed patch is missing a call to
> ReplicationSlotsComputeRequiredXmin() here for physical slots.

No. pg_physical_replication_slot_advance doesn't make an advance of
effective_(catalog)_xmin so it is just useless. It would be necessary
if it were in pg_replication_slot_advance, its caller.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Physical replication slot advance is not persistent

От
Craig Ringer
Дата:
On Fri, 17 Jan 2020 at 01:09, Alexey Kondratov
<a.kondratov@postgrespro.ru> wrote:
>
> > I think we shouldn't touch the paths used by replication protocol. And
> > don't we focus on how we make a change of a replication slot from SQL
> > interface persistent?  It seems to me that generaly we don't need to
> > save dirty slots other than checkpoint, but the SQL function seems
> > wanting the change to be saved immediately.

PLEASE do not make the streaming replication interface force flushes!

The replication interface should not immediately flush changes to the
slot replay position on advance. It should be marked dirty and left to
be flushed by the next checkpoint. Doing otherwise potentially
introduces a lot of unnecessary fsync()s and may have an unpleasant
impact on performance.

Clients of the replication protocol interface should be doing their
own position tracking on the client side. They should not ever be
relying on the server side restart position for correctness, since it
can go backwards on crash and restart. Any that do rely on it are
incorrect. I should propose a docs change that explains how the server
and client restart position tracking interacts on both phy and logical
rep since it's not really covered right now and naïve client
implementations will be wrong.

I don't really care if the SQL interface forces an immediate flush
since it's never going to have good performance anyway.

It's already impossible to write a strictly correct and crash safe
client with the SQL interface. Adding forced flushing won't make that
any better or worse.

The SQL interface advances the slot restart position and marks the
slot dirty *before the client has confirmed receipt of the data and
flushed it to disk*. So there's a data loss window. If the client
disconnects or crashes before all the data from that function call is
safely flushed to disk it may lose the data, then be unable to fetch
it again from the server because of the restart_lsn position advance.

Really, we should add a "no_advance_position" option to the SQL
interface, then expect the client to call a second function that
explicitly advances the restart_lsn and confirmed_flush_lsn. Otherwise
no SQL interface client can be crashsafe.



Re: Physical replication slot advance is not persistent

От
Michael Paquier
Дата:
On Tue, Jan 21, 2020 at 09:44:12AM +0800, Craig Ringer wrote:
> PLEASE do not make the streaming replication interface force flushes!

Yeah, that's a bad idea.  FWIW, my understanding is that this has been
only proposed in v3, and this has been discarded:
https://www.postgresql.org/message-id/175c2760666a78205e053207794c0f8f@postgrespro.ru

> The replication interface should not immediately flush changes to the
> slot replay position on advance. It should be marked dirty and left to
> be flushed by the next checkpoint. Doing otherwise potentially
> introduces a lot of unnecessary fsync()s and may have an unpleasant
> impact on performance.

Some portions of the advancing code tells a different story.  It seems
to me that the intention behind the first implementation of slot
advancing was to get things flushed if any advancing was done.  The
check doing that is actually broken from the start, but that's another
story.  Could you check with Petr what was the intention here or drag
his attention to this thread?  He is the original author of the
feature.  So his output would be nice to have.

> Clients of the replication protocol interface should be doing their
> own position tracking on the client side. They should not ever be
> relying on the server side restart position for correctness, since it
> can go backwards on crash and restart. Any that do rely on it are
> incorrect. I should propose a docs change that explains how the server
> and client restart position tracking interacts on both phy and logical
> rep since it's not really covered right now and naïve client
> implementations will be wrong.
>
> I don't really care if the SQL interface forces an immediate flush
> since it's never going to have good performance anyway.

Okay, the flush could be optional as well, but that's a different
discussion.  The docs of logical decoding mention that slot data may
go backwards in the event of a crash.  If you have improvements for
that, surely that's welcome.

> The SQL interface advances the slot restart position and marks the
> slot dirty *before the client has confirmed receipt of the data and
> flushed it to disk*. So there's a data loss window. If the client
> disconnects or crashes before all the data from that function call is
> safely flushed to disk it may lose the data, then be unable to fetch
> it again from the server because of the restart_lsn position advance.

Well, you have the same class of problems with for example synchronous
replication.  The only state a client can be sure of is that it
received a confirmation that the operation happens and completed.
Any other state tells that the operation may have happened.  Or not.
Now, being sure that the data of the new slot has been flushed once
the advancing function is done once the client got the confirmation
that the work is done is a property which could be interesting to some
class of applications.

> Really, we should add a "no_advance_position" option to the SQL
> interface, then expect the client to call a second function that
> explicitly advances the restart_lsn and confirmed_flush_lsn. Otherwise
> no SQL interface client can be crashsafe.

Hm.  Could you elaborate more what you mean here?  I am not sure to
understand.  Note that calling the advance function multiple times has
no effects, and the result returned is the actual restart_lsn (or
confirmed_flush_lsn of course).
--
Michael

Вложения

Re: Physical replication slot advance is not persistent

От
Michael Paquier
Дата:
On Mon, Jan 20, 2020 at 11:00:14AM -0800, Andres Freund wrote:
> On 2020-01-20 15:45:40 +0900, Michael Paquier wrote:
>> 1) The slot advancing has to mark the slot as dirty, but should we
>> make the change persistent at the end of the function or should we
>> wait for a checkpoint to do the work, meaning that any update done to
>> the slot would be lost if a crash occurs in-between?  Note that we
>> have this commit in slotfuncs.c for
>> pg_logical_replication_slot_advance():
>>  * Dirty the slot so it's written out at the next checkpoint.
>>  * We'll still lose its position on crash, as documented, but it's
>>  * better than always losing the position even on clean restart.
>>
>> This comment refers to the documentation for the logical decoding
>> section (see logicaldecoding-replication-slots in
>> logicaldecoding.sgml), and even if nothing can be done until the slot
>> advance function reaches its hand, we ought to make the data
>> persistent if we can.
>
> That doesn't really seem like a meaningful reference, because the
> concerns between constantly streaming out changes (where we don't want
> to fsync every single transaction), and doing so in a manual advance
> through an sql function, seem different.

No disagreement with that, still it is the only reference we have in
the docs about that.  I think that we should take the occasion to
update the docs of the advancing functions accordingly with what we
think is the best choice; should the slot information be flushed at
the end of the function, or at the follow-up checkpoint?

>> 3) The amount of testing related to slot advancing could be better
>> with cluster-wide operations.
>>
>> @@ -370,6 +370,11 @@ pg_physical_replication_slot_advance(XLogRecPtr
>> moveto)
>>     MyReplicationSlot->data.restart_lsn = moveto;
>>
>>     SpinLockRelease(&MyReplicationSlot->mutex);
>>     retlsn = moveto;
>> +
>> +   ReplicationSlotMarkDirty();
>> +
>> +   /* We moved retart_lsn, update the global value. */
>> +   ReplicationSlotsComputeRequiredLSN();
>> I think that the proposed patch is missing a call to
>> ReplicationSlotsComputeRequiredXmin() here for physical slots.
>
> Hm. It seems ok to include, but I don't think omitting it currently has
> negative effects?

effective_xmin can be used by WAL senders with physical slots.  It
seems safer in the long term to include it, IMO.

>>  static XLogRecPtr
>> -pg_physical_replication_slot_advance(XLogRecPtr moveto)
>> +pg_physical_replication_slot_advance(XLogRecPtr moveto, bool *advance_done)
>>  {
>>      XLogRecPtr    startlsn = MyReplicationSlot->data.restart_lsn;
>>      XLogRecPtr    retlsn = startlsn;
>>
>> +    *advance_done = false;
>> +
>>      if (startlsn < moveto)
>>      {
>>          SpinLockAcquire(&MyReplicationSlot->mutex);
>>          MyReplicationSlot->data.restart_lsn = moveto;
>>          SpinLockRelease(&MyReplicationSlot->mutex);
>>          retlsn = moveto;
>> +        *advance_done = true;
>>      }
>>
>>      return retlsn;
>
> Hm. Why did you choose not to use endlsn as before (except being
> broken), or something?

When doing repetitive calls of the advancing functions, the advancing
happens in the first call, and the next ones do nothing, so if no
updates is done there is no meaning to flush the slot information.

> It seems quite conceivable somebody is using these functions in an
> extension.

Not sure I get that, pg_physical_replication_slot_advance and
pg_logical_replication_slot_advance are static in slotfuncs.c.

> Hm. I'm think testing this with real LSNs is a better idea. What if the
> node actually already is past FF/FFFFFFFF at this point? Quite unlikely,
> I know, but still. I.e. why not get the current LSN after the INSERT,
> and assert that the slot, after the restart, is that?

Sure.  If not disabling autovacuum in the test, we'd just need to make
sure if that advancing is at least ahead of the INSERT position.

Anyway, I am still not sure if we should got down the road to just
mark the slot as dirty if any advancing is done and let the follow-up
checkpoint to the work, if the advancing function should do the slot
flush, or if we choose one and make the other an optional choice (not
for back-branches, obviously.  Based on my reading of the code, my
guess is that a flush should happen at the end of the advancing
function.
--
Michael

Вложения

Re: Physical replication slot advance is not persistent

От
Craig Ringer
Дата:
((On Tue, 21 Jan 2020 at 11:06, Michael Paquier <michael@paquier.xyz> wrote:

> > The replication interface should not immediately flush changes to the
> > slot replay position on advance. It should be marked dirty and left to
> > be flushed by the next checkpoint. Doing otherwise potentially
> > introduces a lot of unnecessary fsync()s and may have an unpleasant
> > impact on performance.
>
> Some portions of the advancing code tells a different story.  It seems
> to me that the intention behind the first implementation of slot
> advancing was to get things flushed if any advancing was done.

Taking a step back here, I have no concerns with proposed changes for
pg_replication_slot_advance(). Disregard my comments about safety with
the SQL interface for the purposes of this thread, they apply only to
logical slots and are really unrelated to
pg_replication_slot_advance().

Re your comment above: For slot advances in general the flush to disk
is done lazily for performance reasons, but I think you meant
pg_replication_slot_advance() specifically.

pg_replication_slot_advance() doesn't appear to make any promises as
to immediate durability either way. It updates the required LSN
immediately with ReplicationSlotsUpdateRequiredLSN() so it
theoretically marks WAL as removable before it's flushed. But I don't
think we'll ever actually remove any WAL segments until checkpoint, at
which point we'll also flush any dirty slots, so it doesn't really
matter. For logical slots the lsn and xmin are both protected by the
effective/actual tracking logic and can't advance until the slot is
flushed.

The app might be surprised if the slot goes backwards after an
pg_replication_slot_advance() followed by a server crash though.

> The
> check doing that is actually broken from the start, but that's another
> story.  Could you check with Petr what was the intention here or drag
> his attention to this thread?  He is the original author of the
> feature.  So his output would be nice to have.

I'll ask him. He's pretty bogged at the moment though, and I've done a
lot of work in this area too. (See e.g. the catalog_xmin in hot
standby feedback changes).

> > The SQL interface advances the slot restart position and marks the
> > slot dirty *before the client has confirmed receipt of the data and
> > flushed it to disk*. So there's a data loss window. If the client
> > disconnects or crashes before all the data from that function call is
> > safely flushed to disk it may lose the data, then be unable to fetch
> > it again from the server because of the restart_lsn position advance.
>
> Well, you have the same class of problems with for example synchronous
> replication.  The only state a client can be sure of is that it
> received a confirmation that the operation happens and completed.
> Any other state tells that the operation may have happened.  Or not.
> Now, being sure that the data of the new slot has been flushed once
> the advancing function is done once the client got the confirmation
> that the work is done is a property which could be interesting to some
> class of applications.

That's what we already provide for the streaming interface for slot access.

I agree there's no need to shove a fix to the SQL interface for
phys/logical slots into this same discussion. I'm just trying to make
sure we don't "fix" a "bug" that's actually an important part of the
design by trying to fix a perceived-missing flush in the streaming
interface too. I am not at all confident that the test coverage for
this is sufficient right now, since we lack a good way to make
postgres delay various lazy internal activity to let us reliably
examine intermediate states in a race-free way, so I'm not sure tests
would catch it.

> > Really, we should add a "no_advance_position" option to the SQL
> > interface, then expect the client to call a second function that
> > explicitly advances the restart_lsn and confirmed_flush_lsn. Otherwise
> > no SQL interface client can be crashsafe.
>
> Hm.  Could you elaborate more what you mean here?  I am not sure to
> understand.  Note that calling the advance function multiple times has
> no effects, and the result returned is the actual restart_lsn (or
> confirmed_flush_lsn of course).

I've probably confused things a bit here. I don't mind if whether or
not pg_replication_slot_advance() forces an immediate flush, I think
there are reasonable arguments in both directions.

In the above I was talking about how pg_logical_slot_get_changes()
presently advances the slot position immediately, so if the client
loses its connection before reading and flushing all the data it may
be unable to recover. And while pg_logical_slot_peek_changes() lets
the app read the data w/o advancing the slot, it has to then do a
separate pg_replication_slot_advance() which has to do the decoding
work again. I'd like to improve that, but I didn't intend to confuse
or sidetrack this discussion in the process. Sorry.

We don't have a SQL-level interface for reading physical WAL so there
are no corresponding concerns there.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise



Re: Physical replication slot advance is not persistent

От
Michael Paquier
Дата:
On Tue, Jan 21, 2020 at 02:07:30PM +0900, Michael Paquier wrote:
> On Mon, Jan 20, 2020 at 11:00:14AM -0800, Andres Freund wrote:
>> Hm. I'm think testing this with real LSNs is a better idea. What if the
>> node actually already is past FF/FFFFFFFF at this point? Quite unlikely,
>> I know, but still. I.e. why not get the current LSN after the INSERT,
>> and assert that the slot, after the restart, is that?
>
> Sure.  If not disabling autovacuum in the test, we'd just need to make
> sure if that advancing is at least ahead of the INSERT position.

Actually, as the advancing happens only up to this position we just
need to make sure that the LSN reported by the slot is the same as the
position advanced to.  I have switched the test to just do that
instead of using a fake LSN.

> Anyway, I am still not sure if we should got down the road to just
> mark the slot as dirty if any advancing is done and let the follow-up
> checkpoint to the work, if the advancing function should do the slot
> flush, or if we choose one and make the other an optional choice (not
> for back-branches, obviously.  Based on my reading of the code, my
> guess is that a flush should happen at the end of the advancing
> function.

I have been chewing on this point for a couple of days, and as we may
actually crash between the moment the slot is marked as dirty and the
moment the slot information is made consistent, we still have a risk
to have the slot go backwards even if the slot information is saved.
The window is much narrower, but well, the docs of logical decoding
mention that this risk exists.  And the patch becomes much more
simple without changing the actual behavior present since the feature
has been introduced for logical slots.  There could be a point in
having a new option to flush the slot information, or actually a
separate function to flush the slot information, but let's keep that
for a future possibility.

So attached is an updated patch which addresses the problem just by
marking a physical slot as dirty if any advancing is done.  Some
documentation is added about the fact that an advance is persistent
only at the follow-up checkpoint.  And the tests are fixed to not use
a fake LSN but instead advance to the latest LSN position produced.

Any objections?
--
Michael

Вложения

Re: Physical replication slot advance is not persistent

От
Craig Ringer
Дата:
On Tue, 28 Jan 2020 at 16:01, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Jan 21, 2020 at 02:07:30PM +0900, Michael Paquier wrote:
> > On Mon, Jan 20, 2020 at 11:00:14AM -0800, Andres Freund wrote:
> >> Hm. I'm think testing this with real LSNs is a better idea. What if the
> >> node actually already is past FF/FFFFFFFF at this point? Quite unlikely,
> >> I know, but still. I.e. why not get the current LSN after the INSERT,
> >> and assert that the slot, after the restart, is that?
> >
> > Sure.  If not disabling autovacuum in the test, we'd just need to make
> > sure if that advancing is at least ahead of the INSERT position.
>
> Actually, as the advancing happens only up to this position we just
> need to make sure that the LSN reported by the slot is the same as the
> position advanced to.  I have switched the test to just do that
> instead of using a fake LSN.
>
> > Anyway, I am still not sure if we should got down the road to just
> > mark the slot as dirty if any advancing is done and let the follow-up
> > checkpoint to the work, if the advancing function should do the slot
> > flush, or if we choose one and make the other an optional choice (not
> > for back-branches, obviously.  Based on my reading of the code, my
> > guess is that a flush should happen at the end of the advancing
> > function.
>
> I have been chewing on this point for a couple of days, and as we may
> actually crash between the moment the slot is marked as dirty and the
> moment the slot information is made consistent, we still have a risk
> to have the slot go backwards even if the slot information is saved.
> The window is much narrower, but well, the docs of logical decoding
> mention that this risk exists.  And the patch becomes much more
> simple without changing the actual behavior present since the feature
> has been introduced for logical slots.  There could be a point in
> having a new option to flush the slot information, or actually a
> separate function to flush the slot information, but let's keep that
> for a future possibility.
>
> So attached is an updated patch which addresses the problem just by
> marking a physical slot as dirty if any advancing is done.  Some
> documentation is added about the fact that an advance is persistent
> only at the follow-up checkpoint.  And the tests are fixed to not use
> a fake LSN but instead advance to the latest LSN position produced.
>
> Any objections?

LGTM. Thankyou.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise



Re: Physical replication slot advance is not persistent

От
Kyotaro Horiguchi
Дата:
At Tue, 28 Jan 2020 17:45:47 +0800, Craig Ringer <craig@2ndquadrant.com> wrote in 
> On Tue, 28 Jan 2020 at 16:01, Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Tue, Jan 21, 2020 at 02:07:30PM +0900, Michael Paquier wrote:
> > > On Mon, Jan 20, 2020 at 11:00:14AM -0800, Andres Freund wrote:
> > >> Hm. I'm think testing this with real LSNs is a better idea. What if the
> > >> node actually already is past FF/FFFFFFFF at this point? Quite unlikely,
> > >> I know, but still. I.e. why not get the current LSN after the INSERT,
> > >> and assert that the slot, after the restart, is that?
> > >
> > > Sure.  If not disabling autovacuum in the test, we'd just need to make
> > > sure if that advancing is at least ahead of the INSERT position.
> >
> > Actually, as the advancing happens only up to this position we just
> > need to make sure that the LSN reported by the slot is the same as the
> > position advanced to.  I have switched the test to just do that
> > instead of using a fake LSN.
> >
> > > Anyway, I am still not sure if we should got down the road to just
> > > mark the slot as dirty if any advancing is done and let the follow-up
> > > checkpoint to the work, if the advancing function should do the slot
> > > flush, or if we choose one and make the other an optional choice (not
> > > for back-branches, obviously.  Based on my reading of the code, my
> > > guess is that a flush should happen at the end of the advancing
> > > function.
> >
> > I have been chewing on this point for a couple of days, and as we may
> > actually crash between the moment the slot is marked as dirty and the
> > moment the slot information is made consistent, we still have a risk
> > to have the slot go backwards even if the slot information is saved.
> > The window is much narrower, but well, the docs of logical decoding
> > mention that this risk exists.  And the patch becomes much more
> > simple without changing the actual behavior present since the feature
> > has been introduced for logical slots.  There could be a point in
> > having a new option to flush the slot information, or actually a
> > separate function to flush the slot information, but let's keep that
> > for a future possibility.
> >
> > So attached is an updated patch which addresses the problem just by
> > marking a physical slot as dirty if any advancing is done.  Some
> > documentation is added about the fact that an advance is persistent
> > only at the follow-up checkpoint.  And the tests are fixed to not use
> > a fake LSN but instead advance to the latest LSN position produced.
> >
> > Any objections?
> 
> LGTM. Thankyou.

I agree not to save slots immediately. The code is wrtten as described
above. The TAP test is correct.

But the doc part looks a bit too detailed to me. Couldn't we explain
that without the word 'dirty'?

-        and it will not be moved beyond the current insert location.  Returns
-        name of the slot and real position to which it was advanced to.
+        and it will not be moved beyond the current insert location. Returns
+        name of the slot and real position to which it was advanced to. The
+        updated slot is marked as dirty if any advancing is done, with its
+        information being written out at the follow-up checkpoint. In the
+        event of a crash, the slot may return to an earlier position.

and it will not be moved beyond the current insert location. Returns
name of the slot and real position to which it was advanced to. The
information of the updated slot is scheduled to be written out at the
follow-up checkpoint if any advancing is done. In the event of a
crash, the slot may return to an earlier position.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Physical replication slot advance is not persistent

От
Alexey Kondratov
Дата:
On 28.01.2020 15:14, Kyotaro Horiguchi wrote:
> At Tue, 28 Jan 2020 17:45:47 +0800, Craig Ringer <craig@2ndquadrant.com> wrote in
>> On Tue, 28 Jan 2020 at 16:01, Michael Paquier <michael@paquier.xyz> wrote:
>>> So attached is an updated patch which addresses the problem just by
>>> marking a physical slot as dirty if any advancing is done.  Some
>>> documentation is added about the fact that an advance is persistent
>>> only at the follow-up checkpoint.  And the tests are fixed to not use
>>> a fake LSN but instead advance to the latest LSN position produced.
>>>
>>> Any objections?
>> LGTM. Thankyou.
> I agree not to save slots immediately. The code is wrtten as described
> above. The TAP test is correct.

+1, removing this broken saving code path from 
pg_replication_slot_advance and marking slot as dirty looks good to me. 
It solves the issue and does not add any unnecessary complexity.

>
> But the doc part looks a bit too detailed to me. Couldn't we explain
> that without the word 'dirty'?
>
> -        and it will not be moved beyond the current insert location.  Returns
> -        name of the slot and real position to which it was advanced to.
> +        and it will not be moved beyond the current insert location. Returns
> +        name of the slot and real position to which it was advanced to. The
> +        updated slot is marked as dirty if any advancing is done, with its
> +        information being written out at the follow-up checkpoint. In the
> +        event of a crash, the slot may return to an earlier position.
>
> and it will not be moved beyond the current insert location. Returns
> name of the slot and real position to which it was advanced to. The
> information of the updated slot is scheduled to be written out at the
> follow-up checkpoint if any advancing is done. In the event of a
> crash, the slot may return to an earlier position.

Just searched through the *.sgml files, we already use terms 'dirty' and 
'flush' applied to writing out pages during checkpoints. Here we are 
trying to describe the very similar process, but in relation to 
replication slots, so it looks fine for me. In the same time, the term 
'schedule' is used for VACUUM, constraint check or checkpoint itself.


Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Physical replication slot advance is not persistent

От
Michael Paquier
Дата:
On Tue, Jan 28, 2020 at 06:06:06PM +0300, Alexey Kondratov wrote:
> On 28.01.2020 15:14, Kyotaro Horiguchi wrote:
>> I agree not to save slots immediately. The code is wrtten as described
>> above. The TAP test is correct.
>
> +1, removing this broken saving code path from pg_replication_slot_advance
> and marking slot as dirty looks good to me. It solves the issue and does not
> add any unnecessary complexity.

Ok, good.  So I am seeing no objections on that part :D

>> But the doc part looks a bit too detailed to me. Couldn't we explain
>> that without the word 'dirty'?
>>
>> -        and it will not be moved beyond the current insert location.  Returns
>> -        name of the slot and real position to which it was advanced to.
>> +        and it will not be moved beyond the current insert location. Returns
>> +        name of the slot and real position to which it was advanced to. The
>> +        updated slot is marked as dirty if any advancing is done, with its
>> +        information being written out at the follow-up checkpoint. In the
>> +        event of a crash, the slot may return to an earlier position.
>>
>> and it will not be moved beyond the current insert location. Returns
>> name of the slot and real position to which it was advanced to. The
>> information of the updated slot is scheduled to be written out at the
>> follow-up checkpoint if any advancing is done. In the event of a
>> crash, the slot may return to an earlier position.
>
> Just searched through the *.sgml files, we already use terms 'dirty' and
> 'flush' applied to writing out pages during checkpoints. Here we are trying
> to describe the very similar process, but in relation to replication slots,
> so it looks fine for me. In the same time, the term 'schedule' is used for
> VACUUM, constraint check or checkpoint itself.

Honestly, I was a bit on the fence for the term "dirty" when typing
this paragraph, so I kind of agree with Horiguchi-san's point that it
could be confusing when applied to replication slots, because there is
no other reference in the docs about the link between the two
concepts.  So, I would go for a more simplified sentence for the first
part, keeping the second sentence intact:
"The information of the updated slot is written out at the follow-up
checkpoint if any advancing is done.  In the event of a crash, the
slot may return to an earlier position."
--
Michael

Вложения

Re: Physical replication slot advance is not persistent

От
Kyotaro Horiguchi
Дата:
At Wed, 29 Jan 2020 15:45:56 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Tue, Jan 28, 2020 at 06:06:06PM +0300, Alexey Kondratov wrote:
> > On 28.01.2020 15:14, Kyotaro Horiguchi wrote:
> >> But the doc part looks a bit too detailed to me. Couldn't we explain
> >> that without the word 'dirty'?
..
> >> and it will not be moved beyond the current insert location. Returns
> >> name of the slot and real position to which it was advanced to. The
> >> information of the updated slot is scheduled to be written out at the
> >> follow-up checkpoint if any advancing is done. In the event of a
> >> crash, the slot may return to an earlier position.
> > 
> > Just searched through the *.sgml files, we already use terms 'dirty' and
> > 'flush' applied to writing out pages during checkpoints. Here we are trying
> > to describe the very similar process, but in relation to replication slots,
> > so it looks fine for me. In the same time, the term 'schedule' is used for
> > VACUUM, constraint check or checkpoint itself.
> 
> Honestly, I was a bit on the fence for the term "dirty" when typing
> this paragraph, so I kind of agree with Horiguchi-san's point that it
> could be confusing when applied to replication slots, because there is
> no other reference in the docs about the link between the two
> concepts.  So, I would go for a more simplified sentence for the first
> part, keeping the second sentence intact:
> "The information of the updated slot is written out at the follow-up
> checkpoint if any advancing is done.  In the event of a crash, the
> slot may return to an earlier position."

Looks perfect.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Physical replication slot advance is not persistent

От
Michael Paquier
Дата:
On Wed, Jan 29, 2020 at 05:10:20PM +0900, Kyotaro Horiguchi wrote:
> Looks perfect.

Thanks Horiguchi-san and others.  Applied and back-patched down to
11.
--
Michael

Вложения

Re: Physical replication slot advance is not persistent

От
Alexey Kondratov
Дата:
On 30.01.2020 05:19, Michael Paquier wrote:
> On Wed, Jan 29, 2020 at 05:10:20PM +0900, Kyotaro Horiguchi wrote:
>> Looks perfect.
> Thanks Horiguchi-san and others.  Applied and back-patched down to
> 11.

Great! Thanks for getting this done.

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Physical replication slot advance is not persistent

От
Andres Freund
Дата:
Hi,

On 2020-01-28 17:01:14 +0900, Michael Paquier wrote:
> So attached is an updated patch which addresses the problem just by
> marking a physical slot as dirty if any advancing is done.  Some
> documentation is added about the fact that an advance is persistent
> only at the follow-up checkpoint.  And the tests are fixed to not use
> a fake LSN but instead advance to the latest LSN position produced.

>  
> -    /* Update the on disk state when lsn was updated. */
> -    if (XLogRecPtrIsInvalid(endlsn))
> -    {
> -        ReplicationSlotMarkDirty();
> -        ReplicationSlotsComputeRequiredXmin(false);
> -        ReplicationSlotsComputeRequiredLSN();
> -        ReplicationSlotSave();
> -    }
> -

I am quite confused by the wholesale removal of these lines. That wasn't
in previous versions of the patch. As far as I can tell not calling
ReplicationSlotsComputeRequiredLSN() for the physical slot leads to the
global minimum LSN never beeing advanced, and thus WAL reserved by the
slot not being removable.  Only if there's some independent call to
ReplicationSlotsComputeRequiredLSN() XLogSetReplicationSlotMinimumLSN()
will be called, allowing for slots to advance.

I realize this stuff has been broken since the introduction in
9c7d06d6068 (due to the above being if (XLogRecPtrIsInvalid()) rather
than if (!XLogRecPtrIsInvalid()) , but this seems to make it even worse?


I find it really depressing how much obviously untested stuff gets
added in this area.

Greetings,

Andres Freund



Re: Physical replication slot advance is not persistent

От
Alexey Kondratov
Дата:
On 2020-06-09 20:19, Andres Freund wrote:
> Hi,
> 
> On 2020-01-28 17:01:14 +0900, Michael Paquier wrote:
>> So attached is an updated patch which addresses the problem just by
>> marking a physical slot as dirty if any advancing is done.  Some
>> documentation is added about the fact that an advance is persistent
>> only at the follow-up checkpoint.  And the tests are fixed to not use
>> a fake LSN but instead advance to the latest LSN position produced.
> 
>> 
>> -    /* Update the on disk state when lsn was updated. */
>> -    if (XLogRecPtrIsInvalid(endlsn))
>> -    {
>> -        ReplicationSlotMarkDirty();
>> -        ReplicationSlotsComputeRequiredXmin(false);
>> -        ReplicationSlotsComputeRequiredLSN();
>> -        ReplicationSlotSave();
>> -    }
>> -
> 
> I am quite confused by the wholesale removal of these lines. That 
> wasn't
> in previous versions of the patch. As far as I can tell not calling
> ReplicationSlotsComputeRequiredLSN() for the physical slot leads to the
> global minimum LSN never beeing advanced, and thus WAL reserved by the
> slot not being removable.  Only if there's some independent call to
> ReplicationSlotsComputeRequiredLSN() XLogSetReplicationSlotMinimumLSN()
> will be called, allowing for slots to advance.
> 
> I realize this stuff has been broken since the introduction in
> 9c7d06d6068 (due to the above being if (XLogRecPtrIsInvalid()) rather
> than if (!XLogRecPtrIsInvalid()) , but this seems to make it even 
> worse?
> 

Yes, there was a ReplicationSlotsComputeRequiredLSN() call inside 
pg_physical_replication_slot_advance() in the v5 of the patch:

@@ -370,6 +370,11 @@ pg_physical_replication_slot_advance(XLogRecPtr 
moveto)
          MyReplicationSlot->data.restart_lsn = moveto;
          SpinLockRelease(&MyReplicationSlot->mutex);
          retlsn = moveto;
+
+        ReplicationSlotMarkDirty();
+
+        /* We moved retart_lsn, update the global value. */
+        ReplicationSlotsComputeRequiredLSN();

But later it has been missed and we have not noticed that.

I think that adding it back as per attached will be enough.

> 
> I find it really depressing how much obviously untested stuff gets
> added in this area.
> 

Prior to this patch pg_replication_slot_advance was not being tested at 
all. Unfortunately, added tests appeared to be not enough to cover all 
cases. It seems that the whole machinery of WAL holding and trimming is 
worth to be tested more thoroughly.


Regards
-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company
Вложения

Re: Physical replication slot advance is not persistent

От
Michael Paquier
Дата:
On Tue, Jun 09, 2020 at 09:01:13PM +0300, Alexey Kondratov wrote:
> Yes, there was a ReplicationSlotsComputeRequiredLSN() call inside
> pg_physical_replication_slot_advance() in the v5 of the patch:
>
> But later it has been missed and we have not noticed that.
>
> I think that adding it back as per attached will be enough.

[ scratches head... ]
Indeed, this part gets wrong and we would have to likely rely on a WAL
sender to do this calculation once a new flush location is received,
but that may not happen in some cases.  It feels more natural to do
that in the location where the slot is marked as dirty, and there
is no need to move around an extra check to see if the slot has
actually been advanced or not.  Or we could just call the routine once
any advancing is attempted?  That would be unnecessary if no advancing
is done.

> > I find it really depressing how much obviously untested stuff gets
> > added in this area.
>
> Prior to this patch pg_replication_slot_advance was not being tested
> at all.
> Unfortunately, added tests appeared to be not enough to cover all
> cases. It
> seems that the whole machinery of WAL holding and trimming is worth
> to be
> tested more thoroughly.

I think that it would be interesting if we had a SQL representation of
the contents of XLogCtlData (wanted that a couple of times).  Now we
are actually limited to use a checkpoint and check that past segments
are getting recycled by looking at the contents of pg_wal.  Doing that
here does not cause the existing tests to be much more expensive as we
only need one extra call to pg_switch_wal(), mostly.  Please see the
attached.
--
Michael

Вложения

Re: Physical replication slot advance is not persistent

От
Kyotaro Horiguchi
Дата:
At Wed, 10 Jun 2020 15:53:53 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Tue, Jun 09, 2020 at 09:01:13PM +0300, Alexey Kondratov wrote:
> > Yes, there was a ReplicationSlotsComputeRequiredLSN() call inside
> > pg_physical_replication_slot_advance() in the v5 of the patch:
> >
> > But later it has been missed and we have not noticed that.
> >
> > I think that adding it back as per attached will be enough.

Sure.

> [ scratches head... ]
> Indeed, this part gets wrong and we would have to likely rely on a WAL
> sender to do this calculation once a new flush location is received,
> but that may not happen in some cases.  It feels more natural to do
> that in the location where the slot is marked as dirty, and there 
> is no need to move around an extra check to see if the slot has
> actually been advanced or not.  Or we could just call the routine once
> any advancing is attempted?  That would be unnecessary if no advancing
> is done.

We don't call the function so frequently. I don't think it can be a
problem to update replicationSlotMinLSN every time trying advancing.

> > > I find it really depressing how much obviously untested stuff gets
> > > added in this area.
> >
> > Prior to this patch pg_replication_slot_advance was not being tested
> > at all.
> > Unfortunately, added tests appeared to be not enough to cover all
> > cases. It
> > seems that the whole machinery of WAL holding and trimming is worth
> > to be
> > tested more thoroughly.
> 
> I think that it would be interesting if we had a SQL representation of
> the contents of XLogCtlData (wanted that a couple of times).  Now we
> are actually limited to use a checkpoint and check that past segments
> are getting recycled by looking at the contents of pg_wal.  Doing that
> here does not cause the existing tests to be much more expensive as we
> only need one extra call to pg_switch_wal(), mostly.  Please see the
> attached.

The test in the patch looks fine to me and worked well for me.

Using smaller wal_segment_size (1(MB) worked for me) reduces the cost
of the check, but I'm not sure it's worth doing.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Physical replication slot advance is not persistent

От
Alexey Kondratov
Дата:
On 2020-06-10 11:38, Kyotaro Horiguchi wrote:
> At Wed, 10 Jun 2020 15:53:53 +0900, Michael Paquier
> <michael@paquier.xyz> wrote in
>> > > I find it really depressing how much obviously untested stuff gets
>> > > added in this area.
>> >
>> > Prior to this patch pg_replication_slot_advance was not being tested
>> > at all.
>> > Unfortunately, added tests appeared to be not enough to cover all
>> > cases. It
>> > seems that the whole machinery of WAL holding and trimming is worth
>> > to be
>> > tested more thoroughly.
>> 
>> I think that it would be interesting if we had a SQL representation of
>> the contents of XLogCtlData (wanted that a couple of times).  Now we
>> are actually limited to use a checkpoint and check that past segments
>> are getting recycled by looking at the contents of pg_wal.  Doing that
>> here does not cause the existing tests to be much more expensive as we
>> only need one extra call to pg_switch_wal(), mostly.  Please see the
>> attached.
> 
> The test in the patch looks fine to me and worked well for me.
> 
> Using smaller wal_segment_size (1(MB) worked for me) reduces the cost
> of the check, but I'm not sure it's worth doing.
> 

New test reproduces this issue well. Left it running for a couple of 
hours in repeat and it seems to be stable.

Just noted that we do not need to keep $phys_restart_lsn_pre:

my $phys_restart_lsn_pre = $node_master->safe_psql('postgres',
    "SELECT restart_lsn from pg_replication_slots WHERE slot_name = 
'$phys_slot';"
);
chomp($phys_restart_lsn_pre);

we can safely use $current_lsn used for pg_replication_slot_advance(), 
since reatart_lsn is set as is there. It may make the test a bit simpler 
as well.


-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company



Re: Physical replication slot advance is not persistent

От
Michael Paquier
Дата:
On Wed, Jun 10, 2020 at 08:57:17PM +0300, Alexey Kondratov wrote:
> New test reproduces this issue well. Left it running for a couple of hours
> in repeat and it seems to be stable.

Thanks for testing.  I have been thinking about the minimum xmin and
LSN computations on advancing, and actually I have switched the
recomputing to be called at the end of pg_replication_slot_advance().
This may be a waste if no advancing is done, but it could also be an
advantage to enforce a recalculation of the thresholds for each
function call.  And that's more consistent with the slot copy, drop
and creation.

> we can safely use $current_lsn used for pg_replication_slot_advance(), since
> reatart_lsn is set as is there. It may make the test a bit simpler as well.

We could do that.  Now I found cleaner the direct comparison of
pg_replication_slots.restart before and after the restart.  So I have
kept it.
--
Michael

Вложения

Re: Physical replication slot advance is not persistent

От
Michael Paquier
Дата:
On Tue, Jun 16, 2020 at 04:27:27PM +0900, Michael Paquier wrote:
> We could do that.  Now I found cleaner the direct comparison of
> pg_replication_slots.restart before and after the restart.  So I have
> kept it.

And done.  There were conflicts in 001_stream_rep.pl for 11 and 12 but
I have reworked the patch on those branches to have a minimum amount
of diffs with the other branches.  This part additionally needed to
stop standby_1 before running the last part of the test to be able to
drop its physical slot on the primary.
--
Michael

Вложения

Re: Physical replication slot advance is not persistent

От
Alexey Kondratov
Дата:
On 2020-06-16 10:27, Michael Paquier wrote:
> On Wed, Jun 10, 2020 at 08:57:17PM +0300, Alexey Kondratov wrote:
>> New test reproduces this issue well. Left it running for a couple of 
>> hours
>> in repeat and it seems to be stable.
> 
> Thanks for testing.  I have been thinking about the minimum xmin and
> LSN computations on advancing, and actually I have switched the
> recomputing to be called at the end of pg_replication_slot_advance().
> This may be a waste if no advancing is done, but it could also be an
> advantage to enforce a recalculation of the thresholds for each
> function call.  And that's more consistent with the slot copy, drop
> and creation.
> 

Sorry for a bit late response, but I see a couple of issues with this 
modified version of the patch in addition to the waste call if no 
advancing is done, mentioned by you:

1. Both ReplicationSlotsComputeRequiredXmin() and 
ReplicationSlotsComputeRequiredLSN() may have already been done in the 
LogicalConfirmReceivedLocation() if it was a logical slot. It may be 
fine and almost costless to do it twice, but it looks untidy for me.

2. It seems that we do not need ReplicationSlotsComputeRequiredXmin() at 
all if it was a physical slot, since we do not modify xmin in the 
pg_physical_replication_slot_advance(), doesn't it?

That's why I wanted (somewhere around v5 of the patch in this thread) to 
move all dirtying and recomputing calls to the places, where xmin / lsn 
slot modifications are actually done — 
pg_physical_replication_slot_advance() and 
LogicalConfirmReceivedLocation(). LogicalConfirmReceivedLocation() 
already does this, so we only needed to teach 
pg_physical_replication_slot_advance() to do the same.

However, just noted that LogicalConfirmReceivedLocation() only does 
ReplicationSlotsComputeRequiredLSN() if updated_xmin flag was set, which 
looks wrong from my perspective, since updated_xmin and updated_restart 
flags are set separately.

That way, I would solve this all as per attached, which works well for 
me, but definitely worth of a better testing.


Regards
-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company
Вложения

Re: Physical replication slot advance is not persistent

От
Amit Kapila
Дата:
On Fri, Jun 19, 2020 at 12:16 AM Alexey Kondratov
<a.kondratov@postgrespro.ru> wrote:
>
> On 2020-06-16 10:27, Michael Paquier wrote:
> > On Wed, Jun 10, 2020 at 08:57:17PM +0300, Alexey Kondratov wrote:
> >> New test reproduces this issue well. Left it running for a couple of
> >> hours
> >> in repeat and it seems to be stable.
> >
> > Thanks for testing.  I have been thinking about the minimum xmin and
> > LSN computations on advancing, and actually I have switched the
> > recomputing to be called at the end of pg_replication_slot_advance().
> > This may be a waste if no advancing is done, but it could also be an
> > advantage to enforce a recalculation of the thresholds for each
> > function call.  And that's more consistent with the slot copy, drop
> > and creation.
> >
>
> Sorry for a bit late response, but I see a couple of issues with this
> modified version of the patch in addition to the waste call if no
> advancing is done, mentioned by you:
>
> 1. Both ReplicationSlotsComputeRequiredXmin() and
> ReplicationSlotsComputeRequiredLSN() may have already been done in the
> LogicalConfirmReceivedLocation() if it was a logical slot.
>

I think it is not done in all cases, see the else part in
LogicalConfirmReceivedLocation.

LogicalConfirmReceivedLocation
{
..
else
{
SpinLockAcquire(&MyReplicationSlot->mutex);
MyReplicationSlot->data.confirmed_flush = lsn;
SpinLockRelease(&MyReplicationSlot->mutex);
}
..
}

>
> However, just noted that LogicalConfirmReceivedLocation() only does
> ReplicationSlotsComputeRequiredLSN() if updated_xmin flag was set, which
> looks wrong from my perspective, since updated_xmin and updated_restart
> flags are set separately.
>

I see your point but it is better to back such a change by some test case.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Physical replication slot advance is not persistent

От
Michael Paquier
Дата:
On Thu, Jul 09, 2020 at 04:12:49PM +0530, Amit Kapila wrote:
> On Fri, Jun 19, 2020 at 12:16 AM Alexey Kondratov
> <a.kondratov@postgrespro.ru> wrote:
>> 1. Both ReplicationSlotsComputeRequiredXmin() and
>> ReplicationSlotsComputeRequiredLSN() may have already been done in the
>> LogicalConfirmReceivedLocation() if it was a logical slot.
>>
>
> I think it is not done in all cases, see the else part in
> LogicalConfirmReceivedLocation.
>
> LogicalConfirmReceivedLocation
> {
> ..
> else
> {
> SpinLockAcquire(&MyReplicationSlot->mutex);
> MyReplicationSlot->data.confirmed_flush = lsn;
> SpinLockRelease(&MyReplicationSlot->mutex);
> }
> ..
> }

Thanks Amit, and sorry for the late catchup.  The choice of computing
the minimum LSN and xmin across all slots at the end of
pg_replication_slot_advance() is deliberate.  That's more consistent
with the slot creation, copy and drop for one, and that was also the
intention of the original code (actually a no-op as introduced by
9c7d06d).  This also brings an interesting property to the advancing
routines to be able to enforce a recomputation without having to wait
for a checkpoint or a WAL sender to do so.  So, while there may be
cases where we don't need this recomputation to happen, and there may
be cases where it may be a waste, the code simplicity and consistency
are IMO reasons enough to keep this code as it is now.
--
Michael

Вложения