Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Дата
Msg-id CAA4eK1+Xm-z+BAb0nV1tUU52TD+9-x+gZ6W3yVQK37Sw=koB5w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: WAL segments removed from primary despite the fact that logical replication slot needs it.  (Masahiko Sawada <sawada.mshk@gmail.com>)
Ответы Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Список pgsql-bugs
On Wed, Nov 16, 2022 at 11:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Nov 16, 2022 at 1:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Nov 15, 2022 at 2:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski
> > > <depesz@depesz.com> wrote:
> > > >
> > > > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote:
> > > > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
> > > > > <depesz@depesz.com> wrote:
> > > > > >
> > > > > > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t
     10414       \N    3241434855    1039D/83825958  1039D/8DECBA18 
> > > > > > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t
     10414       \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > > > > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f
     \N          \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > > > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f
     \N          \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > > > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t
     21748       \N    3241443528    1039D/83825958  1039D/955633D0 
> > > > > > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t
     21748       \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > > > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f
     \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > > > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f
     \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > > > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f
     \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > > > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f
     \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > >
> > > > > It happened twice that the walsender process was active for at least
> > > > > one minute while not advancing restart_lsn and exited. Could you share
> > > > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
> > > > > can find out what the wal sender processes were doing?
> > > >
> > > > That will be ~ 270MB. Not sure how that would fly, plus that's awfully
> > > > lot of logs to check if I don't leak any sensitive info.
> > > >
> > > > But, if I'll remove lines that are generated by walsender to pg12
> > > > replicas (streaming replication to bionic/pg12 replicas), the logfile is
> > > > only ~ 3MB. Still a bit too much, as there can be some queries with
> > > > sensitive info.
> > > >

We are primarily interested in LOGs that display lsn values. Can you
try to get that or if possible remove sensitive queries information
and then send the remaining LOGs?

> > > > Perhaps I can grep something out?
> > > >
> > > > > Also, you may have logs for "failed to increase restart lsn: proposed
> > > > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
> > > > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
> > > > > please share it as well.
> > > >
> > > > Found one in the time frame you mentioned:
> > > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10
21:03:23UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current
candidate1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" 
> > > >
> > >
> > > Thanks!
> > >
> > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
> > > slot's restart_lsn to this LSN, we remove WAL files older than
> > > 000000000001039D0000008A.
> > >
> > > In LogicalIncreaseRestartDecodingForSlot(), since
> > > "current_lsn(1039D/9170B010) <
> > > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
> > > part and called LogicalConfirmReceivedLocation():
> > >
> > >     else if (current_lsn <= slot->data.confirmed_flush)
> > >     {
> > >         slot->candidate_restart_valid = current_lsn;
> > >         slot->candidate_restart_lsn = restart_lsn;
> > >
> > >         /* our candidate can directly be used */
> > >         updated_lsn = true;
> > >     }
> > >
> >
> > If this would have been executed in
> > LogicalIncreaseRestartDecodingForSlot(), then the values displayed in
> > the above DEBUG messages "current candidate 1039D/83825958, current
> > after 1039D/8B5773D8" should be the same as proposed and after
> > "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing
> > something?
>
> Oh, you're right.
>
> Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal
> to or greater than 1039D/8B5773D8 at that time but
> slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird.
>

Yes, that is weird but it had been a bit obvious if the same LOG would
have printed slot->data.restart_lsn. This means that somehow slot's
'candidate_restart_lsn' somehow went behind its 'restart_lsn'. I can't
figure out yet how that can happen but if that happens then the slot's
restart_lsn can retreat in LogicalConfirmReceivedLocation() because we
don't check if slot's candidate_restart_lsn is lesser than its
restart_lsn before assigning the same in line
MyReplicationSlot->data.restart_lsn =
MyReplicationSlot->candidate_restart_lsn;. I think that should be
checked. Note that we call LogicalConfirmReceivedLocation() can be
called from ProcessStandbyReplyMessage(), so once the wrong
candidate_restart_lsn is set, it can be assigned to restart_lsn from
other code paths as well.

I am not able to think how 'candidate_restart_lsn' can be set to an
LSN value prior to 'restart_lsn'. For that, we probably need more
LOGs. I hope that depesz can share more information from LOGs as that
might give us some clues.

--
With Regards,
Amit Kapila.



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Masahiko Sawada
Дата:
Сообщение: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.