Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Дата
Msg-id CAA4eK1JmV90sqy9si9jWEovHruvA5EP8FbnjNTQFOmHmy0Z3vw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: WAL segments removed from primary despite the fact that logical replication slot needs it.  (Masahiko Sawada <sawada.mshk@gmail.com>)
Список pgsql-bugs
On Tue, Nov 15, 2022 at 7:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
> <depesz@depesz.com> wrote:
> >
> > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t
10414      \N    3241434855    1039D/83825958  1039D/8DECBA18
 
> > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t
10414      \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/955633D0
 
> > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/96453F38
 
> > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38
 
> > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38
 
> > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38
 
> > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38
 
>
> It happened twice that the walsender process was active for at least
> one minute while not advancing restart_lsn and exited. Could you share
> the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
> can find out what the wal sender processes were doing?
>
> Also, you may have logs for "failed to increase restart lsn: proposed
> %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
> flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
> please share it as well.
>
> I've tried to reproduce this issue with a similar setting in a loop
> but could not.
>
> Given that you have the log "attempting to remove WAL segments older
> than log file 000000000001039D0000008A", the last WAL file we need to
> retain calculated by KeepLogSeg() was wrong.
>
> KeepLogSeg() retreated the last segment that we need to retain while
> considering replication slots' minimum LSN (and wal_keep_segments) but
> it seemed not to work fine. The possibilities are (1) we got the wrong
> slots' minimum LSN that is newer than a slot's LSN (ie, retreated
> logSegNo but it's still newer than required WAL files) or (2) we got
> an invalid slots's minimum LSN (ie, didn't retreated at all). If we
> know what the RedoRecPtr was, we can calculate _logSegNo that has not
> been retreated by KeepLogSeg().
>

The other possibility is that the slot is not considered for minimum
LSN computation (say in_use flag is temporarily set to false).

> BTW in LogicalConfirmReceivedLocation(), we set restart_lsn as follows:
>
>         if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr &&
>             MyReplicationSlot->candidate_restart_valid <= lsn)
>         {
>             Assert(MyReplicationSlot->candidate_restart_lsn !=
> InvalidXLogRecPtr);
>
>             MyReplicationSlot->data.restart_lsn =
> MyReplicationSlot->candidate_restart_lsn;
>             MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr;
>             MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr;
>             updated_restart = true;
>         }
>
> Is it worth changing the assertion to elog(ERROR) so that we can
> detect that an invalid value is set to restart_lsn?
>

This is worth considering but note that the caller of this has an
assertion for invalid values being assigned to candidate_restart_*
variables. Also, wouldn't be shown as invalid for restart_lsn in
pg_replication_slots?

-- 
With Regards,
Amit Kapila.



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Masahiko Sawada
Дата:
Сообщение: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.