Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

Поиск
Список
Период
Сортировка
От Masahiko Sawada
Тема Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Дата
Msg-id CAD21AoBHMCEDcV4eBtSVvDrCN4SrMXanX5N9+L-E+4OWXYY0ew@mail.gmail.com
обсуждение исходный текст
Ответ на Re: WAL segments removed from primary despite the fact that logical replication slot needs it.  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-bugs
On Tue, Nov 22, 2022 at 12:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Nov 22, 2022 at 1:38 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2022-11-21 19:56:20 +0530, Amit Kapila wrote:
> > > I think this problem could arise when walsender exits due to some
> > > error like "terminating walsender process due to replication timeout".
> > > Here is the theory I came up with:
> > >
> > > 1. Initially the restart_lsn is updated to 1039D/83825958. This will
> > > allow all files till 000000000001039D00000082 to be removed.
> > > 2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8.
> > > 3. walsender restarts due to replication timeout.
> > > 4. After restart, it starts reading WAL from 1039D/83825958 as that
> > > was restart_lsn.
> > > 5. walsender gets a message to update write, flush, apply, etc. As
> > > part of that, it invokes
> > > ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation.
> > > 6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and
> > > replicationSlotMinLSN will also be computed to the same value allowing
> > > to remove of all files older than 000000000001039D0000008A. This will
> > > allow removing 000000000001039D00000083, 000000010001039D00000084,
> > > etc.
> >
> > This would require that the client acknowledged an LSN that we haven't
> > sent out, no?  Shouldn't the
> >   MyReplicationSlot->candidate_restart_valid <= lsn
> > from LogicalConfirmReceivedLocation() prevented this from happening
> > unless the client acknowledges up to candidate_restart_valid?
> >
>
> We would have sent the higher LSN prior to the restart. Let me share a
> snippet of LOG which I have extracted from a large LOG depesz has
> shared me offlist after my request:
>
> 2022-11-10 21:03:23.478+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,4,idle,2022-11-10
> 21:03:23+00,7/0,0,DEBUG,00000,received replication command:
> IDENTIFY_SYSTEM,,,,,,,,,focal14
> 2022-11-10 21:03:23.479+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,5,idle,2022-11-10
> 21:03:23+00,7/0,0,DEBUG,00000,"received replication command:
> START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/91F41B50
> (proto_version '1', publication_names '""focal14""')",,,,,,,,,focal14
> 2022-11-10 21:03:23.481+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,6,idle,2022-11-10
> 21:03:23+00,7/0,0,LOG,00000,"starting logical decoding for slot
> ""focal14""","Streaming transactions committing after 1039D/911A8DB0,
> reading WAL from 1039D/83825958.",,,,,"START_REPLICATION SLOT
> ""focal14"" LOGICAL 1039D/91F41B50 (proto_version '1',
> publication_names '""focal14""')",,,focal14
> 2022-11-10 21:03:23.481+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,7,idle,2022-11-10
> 21:03:23+00,7/0,0,DEBUG,00000,sending replication
> keepalive,,,,,,,,,focal14
> 2022-11-10 21:03:23.484+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,8,idle,2022-11-10
> 21:03:23+00,7/0,0,LOG,00000,logical decoding found consistent point at
> 1039D/83825958,Logical decoding will begin using saved
> snapshot.,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL
> 1039D/91F41B50 (proto_version '1', publication_names
> '""focal14""')",,,focal14
> 2022-11-10 21:03:23.484+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,9,idle,2022-11-10
> 21:03:23+00,7/0,0,DEBUG,00000,write 1039D/91F41B50 flush
> 1039D/91F41B50 apply 1039D/91F41B50 reply_time 2022-11-10
> 21:03:23.482052+00,,,,,,,,,focal14
> 2022-11-10 21:03:23.495+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,10,idle,2022-11-10
> 21:03:23+00,7/0,0,DEBUG,00000,updated xmin: 1 restart:
> 1,,,,,,,,,focal14
>
> Here, we can see that even though we have started reading WAL from
> 1039D/83825958, the confiremed_flush location is 1039D/911A8DB0. Then
> we get a client reply (write 1039D/91F41B50 flush 1039D/91F41B50)
> probably for a keepalive message. The flush LSN here is higher than
> candidate_restart_valid. The value of candidate_restart_valid before
> walsender restart was 1039D/9170B010 and candidate_restart_lsn was
> 1039D/8B5773D8.
>
> >
> > > 7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958.
> > > Remember from step 1, we are still reading WAL from that location.
> >
> > I don't think LogicalIncreaseRestartDecodingForSlot() would do anything
> > in that case, because of the
> >         /* don't overwrite if have a newer restart lsn */
> > check.
> >
>
> But later in LogicalIncreaseRestartDecodingForSlot(), we have the
> below check which can allow to override it.
> /*
> * Only increase if the previous values have been applied, otherwise we
> * might never end up updating if the receiver acks too slowly. A missed
> * value here will just cause some extra effort after reconnecting.
> */
> if (slot->candidate_restart_valid == InvalidXLogRecPtr)

I also think we did this override-path. After updating the restart_lsn
we accept the given values as candidate_restart_lsn and
candidate_restart_valid, which were 1039D/83825958 and 1039D/8B5773D8.

Analyzing the server logs that depesz shared also with me offlist,
this scenario basically makes sense to me. But please note that
000000000001039D0000008A and older files were removed while there was
no running wal sender. The wal sender who restarted decoding from
1039D/83825958 (ie step 4) exited due to wal_sender_timeout at
21:04:56, and no walsender was launched until 000000000001039D0000008A
or older files were removed at 21:06:07. I guess the scenario needs to
change a bit.

6. The restart_lsn was updated to 1039D/8B5773D8 and
replicationSlotMinLSN was also computed to the same value. But WAL
files were not removed at this point because there was no checkpoint
between 6 and 7. This was around 21:03:23.

7. We got new slot->candidate_restart_lsn as 1039D/83825958 and
candidate_restart_valid as 1039D/8B5773D8. This can happen for the
reason that Amit mentioned above.

8. The restart_lsn was updated to 1039D/83825958, but we didn't update
replicationSlotMinLSN since slot->data.catalog_xmin was not updated at
the same time (see LogicalConfirmReceivedLocation()). And the
walsender exited again due to wal_sender_timeout. At this point, note
that slot's restart_lsn was 1039D/83825958 whereas
replicationSlotMinLSN was 1039D/8B5773D8. This was around 21:04:56.

9. Until 000000000001039D0000008A or older files were removed at
21:06:07, a client had checked the pg_replication_slots view every
minute but it had showed 1039D/83825958 in the restart_lsn column. On
the other hand, the checkpointer removed the WAL file segment files
000000000001039D0000008A or older at 21:06:07, based on
replicationSlotMinLSN ( 1039D/8B5773D8).

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



В списке pgsql-bugs по дате отправления:

Предыдущее
От: sebastian.patino-lang@posteo.net
Дата:
Сообщение: Re: BUG #17691: Unexpected behaviour using ts_headline()
Следующее
От: Thomas Weiner
Дата:
Сообщение: Segmentfault in PG 11.18 / Postgis 3.3.1