Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Дата
Msg-id CAA4eK1JvyWHzMwhO9jzPquctE_ha6bz3EkB3KE6qQJx63StErQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: WAL segments removed from primary despite the fact that logical replication slot needs it.  (hubert depesz lubaczewski <depesz@depesz.com>)
Ответы Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Список pgsql-bugs
On Fri, Nov 18, 2022 at 4:57 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Thu, Nov 17, 2022 at 10:55:29AM -0800, Andres Freund wrote:
> > Hi,
> >
> > On 2022-11-17 23:22:12 +0900, Masahiko Sawada wrote:
> > > On Thu, Nov 17, 2022 at 5:03 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > > 4. wal sender restarts for some reason (or server crashed).
> > > >
> > > > I don't think walsender alone restarting should change anything, but
> > > > crash-restart obviously would.
> > >
> > > Right. I've confirmed this scenario is possible to happen with crash-restart.
> >
> > Depesz, were there any crashes / immediate restarts on the PG 12 side? If so,
> > then we know what the problem likely is and can fix it. If not...
> >
> >
> > Just to confirm, the problem has "newly" occurred after you've upgraded to
> > 12.12? I couldn't quite tell from the thread.
>
> No crashes that I can find any track of.
>

I think this problem could arise when walsender exits due to some
error like "terminating walsender process due to replication timeout".
Here is the theory I came up with:

1. Initially the restart_lsn is updated to 1039D/83825958. This will
allow all files till 000000000001039D00000082 to be removed.
2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8.
3. walsender restarts due to replication timeout.
4. After restart, it starts reading WAL from 1039D/83825958 as that
was restart_lsn.
5. walsender gets a message to update write, flush, apply, etc. As
part of that, it invokes
ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation.
6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and
replicationSlotMinLSN will also be computed to the same value allowing
to remove of all files older than 000000000001039D0000008A. This will
allow removing 000000000001039D00000083, 000000010001039D00000084,
etc.
7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958.
Remember from step 1, we are still reading WAL from that location.
8. At the next update for write, flush, etc. as part of processing
standby reply message, we will invoke
ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation. This
updates restart_lsn to 1039D/83825958.
9. walsender restarts due to replication timeout.
10. After restart, it starts reading WAL from 1039D/83825958 as that
was restart_lsn and gets an error "requested WAL segment
000000010001039D00000083 has already been removed" because the same is
removed as part of point-6.
11. After that walsender will keep on getting the same error after
restart as the restart_lsn is never progressed.

What do you think?

If this diagnosis is correct, I think we need to clear
candidate_restart_lsn and friends during ReplicationSlotRelease().

-- 
With Regards,
Amit Kapila.



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: BUG #17691: Unexpected behaviour using ts_headline()
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #17692: Unable to connect to database after docker-compose