Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

Поиск
Список
Период
Сортировка
От Kyotaro Horiguchi
Тема Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Дата
Msg-id 20221017.155307.3782702886583901.horikyota.ntt@gmail.com
обсуждение исходный текст
Ответ на Re: WAL segments removed from primary despite the fact that logical replication slot needs it.  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Список pgsql-bugs
At Sun, 16 Oct 2022 10:35:17 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski
> <depesz@depesz.com> wrote:
> > At 2:20:48pm there was:
> > 2022-10-04 14:20:48.579 UTC,,,59744,,633c3fac.e960,62,,2022-10-04 14:14:04 UTC,8/6519966,7015376,ERROR,08006,"could
notreceive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"","logical replication worker",,0
 
> >
> > But these errors happened earlier too (minutes/hours earlier).
> >
> > But then, 3 minutes later I got:
> >
> > 2022-10-04 14:23:10.618 UTC,,,63569,,633c4140.f851,38,,2022-10-04 14:20:48 UTC,8/0,0,ERROR,08006,"could not receive
datafrom WAL stream: SSL SYSCALL error: EOF detected
 
> > server closed the connection unexpectedly
> >         This probably means the server terminated abnormally
> >         before or while processing the request.",,,,,,,,,"","logical replication worker",,0
> > 2022-10-04 14:23:10.623 UTC,,,64701,,633c41ce.fcbd,1,,2022-10-04 14:23:10 UTC,8/6529701,0,LOG,00000,"logical
replicationapply worker for subscription ""focal14"" has started",,,,,,,,,"","logical replication worker",,0
 
> > 2022-10-04 14:23:10.624 UTC,,,1801,,633aa5ea.709,41,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker
""logicalreplication worker"" (PID 63569) exited with exit code 1",,,,,,,,,"","postmaster",,0
 
> > 2022-10-04 14:23:10.643 UTC,,,64701,,633c41ce.fcbd,2,,2022-10-04 14:23:10 UTC,8/0,0,ERROR,08P01,"could not receive
datafrom WAL stream: ERROR:  requested WAL segment 000000010000CA0A00000049 has already been
removed",,,,,,,,,"","logicalreplication worker",,0
 
> > 2022-10-04 14:23:10.646 UTC,,,1801,,633aa5ea.709,42,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker
""logicalreplication worker"" (PID 64701) exited with exit code 1",,,,,,,,,"","postmaster",,0
 
> >
> >
> > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile
onpg12/bionic.
 
> >
> 
> This is quite strange and I am not able to see the reason why this can
> happen. The only way to debug this problem that comes to mind is to
> add more LOGS around the code that removes the WAL files. For example,
> we can try to print the value of minimumslotLSN (keep) and logSegNo in
> KeepLogSeg().

What I can see from the information at hand is that walsender is
killed (not died) twice for some reasons then some "reserved" WAL
files are lost.  I don't see the reason for the second walsender (pid
29207) being killed but if the "has already been removed" log line
above is the first one, the file removal is probably happend AT
14:23:10.  (log is needed to know that.)

Anyway, I didn't come up with a possibility other than of someone else
having removed those files..  For example, virus scan softwares.  If
any such software kills processes that is grabbing the files that the
software identified as malicious and is going to remove, that could be
happen?

> > So, 4 files are missing.
> >
> > These were archived properly, and I tried to restore them from archive, and put
> > them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
> > slot with the same message about "requested WAL segment
> > 000000010000CA0A00000049 has already been removed"
> >
> 
> I think you are seeing this behavior because we update the
> lastRemovedSegNo before removing files in RemoveOldXlogFiles() and
> then we use that to give the error you are seeing.

lastRemovedSegNo is updated once per one segment of removal.  Four
files are lost in this case.


> > I know that at this moment we're hosed. This replication was removed,
> > everythbing cleared (after all, we accumulated already 300G of wal, and
> > we can't extend the FS forever, so we needed to cleanup the situation).
> >
> > But, the way I look at it, Pg12 shouldn't be removing these 4 wal files,
> > and the fact that it did means that either we did something wrong (what,
> > though?), or there is some kind of bug in Pg.
> >
> > Anyone has any idea on how to investigate/fix that?
> >
> 
> As suggested above, I feel we need to add some LOGS to see what is
> going on unless you have a reproducible testcase.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Julien Rouhaud
Дата:
Сообщение: Re: BUG #17636: terminating connection because of crash of another server process
Следующее
От: Julien Rouhaud
Дата:
Сообщение: Re: BUG #17645: Strange queries that stuck in database system