Re: persist logical slots to disk during shutdown checkpoint

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: persist logical slots to disk during shutdown checkpoint
Дата
Msg-id CAA4eK1Kvf0vEdBDKqrDtraZraSkZp+cNb_2TqahyTXXhtkDY8A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: persist logical slots to disk during shutdown checkpoint  (Michael Paquier <michael@paquier.xyz>)
Ответы Re: persist logical slots to disk during shutdown checkpoint  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-hackers
On Wed, Sep 13, 2023 at 10:57 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Sep 12, 2023 at 03:15:44PM +0530, Amit Kapila wrote:
> >>
> >> - Not sure that there is any point to mention the other code paths in
> >> the tree where ReplicationSlotSave() can be called, and a slot can be
> >> saved in other processes than just WAL senders (like slot
> >> manipulations in normal backends, for one).  This was the last
> >> sentence in v10.
> >
> > The point was the earlier sentence is no longer true and keeping it as
> > it is could be wrong or at least misleading. For example, earlier it
> > is okay to say, "This needn't actually be part of a checkpoint, ..."
> > but now that is no longer true as we want to invoke this at the time
> > of shutdown checkpoint for correctness. If we want to be precise, we
>
> How so?
>

Consider if we move this call to bgwriter (aka flushing slots is no
longer part of a checkpoint), Would that be okay? Previously, I think
it was okay but not now. I see an argument to keep that as it is as
well because we have already mentioned the special shutdown checkpoint
case. By the way, I have changed this because Ashutosh felt it is no
longer correct to keep the first sentence as it is. See his email[1]
(Relying on the first sentence, ...). It happened previously as well
that different reviewers working in this area have different views on
this sort of thing. I am trying my best to address the review
comments, especially from experienced hackers but personally, I feel
this is a minor nitpick and isn't worth too much argument, either way,
should be okay.

>
> >> +           if (s->data.invalidated == RS_INVAL_NONE &&
> >> +               s->data.confirmed_flush != s->last_saved_confirmed_flush)
> >>
> >> Actually this is incorrect, no?  Shouldn't we make sure that the
> >> confirmed_flush is strictly higher than the last saved LSN?
> >
> > I can't see why it is incorrect. Do you see how (in what scenario) it
> > could go wrong? As per my understanding, confirmed_flush LSN will
> > always be greater than equal to last_saved_confirmed_flush but we
> > don't want to ensure that point here because we just want if the
> > latest value is not the same then we should mark the slot dirty and
> > flush it as that will be location we have ensured to update before
> > walsender shutdown. I think it is better to add an assert if you are
> > worried about any such case and we had thought of adding it as well
> > but then didn't do it because we don't have matching asserts to ensure
> > that we never assign prior LSN value to consfirmed_flush LSN.
>
> Because that's just safer in the long run, and I don't see why we
> cannot just do that?  Imagine, for instance, that a bug doing an
> incorrect manipulation of a logical slot's data does an incorrect
> computation of this field, and that we finish with in-memory data
> that's older than what was previously saved.  The code may cause a
> flush at an incorrect, past, position.  That's just an assumption from
> my side, of course.
>

If you are worried about such bugs, it would be better to have an
Assert as suggested previously rather than greater than check because
we will at least catch such bugs otherwise it can go unnoticed or in
the worst case will lead to unknown consequences. I am saying this
because if there are such bugs (or got introduced later) then the slot
can be flushed with a prior confirmed_flush location even from other
code paths. Just for reference, we don't have any check ensuring that
confirmed_flush LSN can move backward in function
LogicalConfirmReceivedLocation(), see and also another place where we
update it:
else
{
SpinLockAcquire(&MyReplicationSlot->mutex);
MyReplicationSlot->data.confirmed_flush = lsn;
SpinLockRelease(&MyReplicationSlot->mutex);
}

As other places don't have an assert, I didn't add one here but we can
add one here.

> > I don't want to argue on such a point because it is a little bit of a
> > matter of personal choice but I find this comment unclear. It seems to
> > read that confirmed_flush LSN is some LSN position which is where we
> > flushed the slot's data and that is not true. I found the last comment
> > in the patch sent by me: "This value tracks the last confirmed_flush
> > LSN flushed which is used during a shutdown checkpoint to decide if
> > logical's slot data should be forcibly flushed or not." which I feel
> > we agreed upon is better.
>
> Okay, fine by me here.
>

Thanks, will change once we agree on the remaining points.

--
With Regards,
Amit Kapila.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: Have better wording for snapshot file reading failure
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: pg_upgrade and logical replication