Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Дата
Msg-id 20221121200836.wov46biwtramawmq@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: WAL segments removed from primary despite the fact that logical replication slot needs it.  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Список pgsql-bugs
Hi,

On 2022-11-21 19:56:20 +0530, Amit Kapila wrote:
> I think this problem could arise when walsender exits due to some
> error like "terminating walsender process due to replication timeout".
> Here is the theory I came up with:
> 
> 1. Initially the restart_lsn is updated to 1039D/83825958. This will
> allow all files till 000000000001039D00000082 to be removed.
> 2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8.
> 3. walsender restarts due to replication timeout.
> 4. After restart, it starts reading WAL from 1039D/83825958 as that
> was restart_lsn.
> 5. walsender gets a message to update write, flush, apply, etc. As
> part of that, it invokes
> ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation.
> 6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and
> replicationSlotMinLSN will also be computed to the same value allowing
> to remove of all files older than 000000000001039D0000008A. This will
> allow removing 000000000001039D00000083, 000000010001039D00000084,
> etc.

This would require that the client acknowledged an LSN that we haven't
sent out, no? Shouldn't the
  MyReplicationSlot->candidate_restart_valid <= lsn
from LogicalConfirmReceivedLocation() prevented this from happening
unless the client acknowledges up to candidate_restart_valid?


> 7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958.
> Remember from step 1, we are still reading WAL from that location.

I don't think LogicalIncreaseRestartDecodingForSlot() would do anything
in that case, because of the
    /* don't overwrite if have a newer restart lsn */
check.


> If this diagnosis is correct, I think we need to clear
> candidate_restart_lsn and friends during ReplicationSlotRelease().

Possible, but I don't quite see it yet.

Greetings,

Andres Freund



В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #17692: Unable to connect to database after docker-compose
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.