Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Дата
Msg-id CAA4eK1JSTC9uJ_CTTFxMZV6wSr2hkCX4bwtruUp9wX_K8JLo1Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: WAL segments removed from primary despite the fact that logical replication slot needs it.  (hubert depesz lubaczewski <depesz@depesz.com>)
Ответы Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Список pgsql-bugs
On Fri, Nov 11, 2022 at 8:20 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Thu, Oct 20, 2022 at 01:55:16PM +0200, hubert depesz lubaczewski wrote:
> > Hi,
> > just an update. Due to procedures we will upgrade to 12.12 at the
> > beginning of November. Then, I'll try to replicate the problem again,
> > and we'll see how this will pan out.
> > For next test will do my best to use debug2 log level.
>
> So. I have an update.
>
> Out of 8 servers that we prepared upgrade for, one failed with the
> same/similar problem.
> Unfortunately I don't know if this is the same server as before. Sorry.
>
> Specifically, I see now, in logs on focal/pg14 side:
>
> #v+
> 2022-11-11 12:45:26.432 UTC,,,994963,,636e43e6.f2e93,2,,2022-11-11 12:45:26 UTC,6/0,0,ERROR,08P01,"could not receive
datafrom WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been
removed",,,,,,,,,"","logicalreplication worker",,0 
> #v-
>
> On source (bionic/pg12), I see:
>
> #v+
> =# select * from pg_replication_slots ;
>  slot_name │  plugin  │ slot_type │ datoid │ database │ temporary │ active │ active_pid │  xmin  │ catalog_xmin │
restart_lsn  │ confirmed_flush_lsn 
>
═══════════╪══════════╪═══════════╪════════╪══════════╪═══════════╪════════╪════════════╪════════╪══════════════╪════════════════╪═════════════════════
>  focal14   │ pgoutput │ logical   │  16607 │ canvas   │ f         │ f      │     [null] │ [null] │   3241443528 │
1039D/83825958│ 1039D/96453F38 
> (1 row)
> #v-
>
> but pg, like previously, lost some wal files, this time 8 of them:
>
> #v+
> =$ sudo ls -l ~postgres/12/main/pg_wal | head
> total 224403644
> -rw------- 1 postgres postgres      349 Feb 11  2022 000000010000BDEF000000FF.00045F68.backup
> -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008B
> -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008C
> -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008D
> ...
> #v-
>
> This time we had debug2 enabled on both sides.
>
> Looks like we lost segments:
> 000000010001039D00000083, 000000010001039D00000084,
> 000000010001039D00000085, 000000010001039D00000086,
> 000000010001039D00000087, 000000010001039D00000088,
> 000000010001039D00000089, and 000000010001039D0000008A.
>
> It looks that first time replica logged the error was:
>
> #v+
> 2022-11-10 21:06:16.427 UTC,,,572364,,636d67c8.8bbcc,2,,2022-11-10 21:06:16 UTC,6/0,0,ERROR,08P01,"could not receive
datafrom WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been
removed",,,,,,,,,"","logicalreplication worker",,0 
> #v-
>
> In Pg logs on source I see:
>
> #v+
> 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000082""",,,,,,,,,"" 
> 2022-11-10 21:06:07.523 UTC,,,28925,,636378d6.70fd,4594903,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000083""",,,,,,,,,"" 
> 2022-11-10 21:06:07.524 UTC,,,28925,,636378d6.70fd,4594904,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000084""",,,,,,,,,"" 
> 2022-11-10 21:06:07.525 UTC,,,28925,,636378d6.70fd,4594905,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000085""",,,,,,,,,"" 
> 2022-11-10 21:06:07.526 UTC,,,28925,,636378d6.70fd,4594906,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000086""",,,,,,,,,"" 
> 2022-11-10 21:06:07.527 UTC,,,28925,,636378d6.70fd,4594907,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000087""",,,,,,,,,"" 
> 2022-11-10 21:06:07.528 UTC,,,28925,,636378d6.70fd,4594908,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000088""",,,,,,,,,"" 
> 2022-11-10 21:06:07.529 UTC,,,28925,,636378d6.70fd,4594909,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000089""",,,,,,,,,"" 
> 2022-11-10 21:06:07.530 UTC,,,28925,,636378d6.70fd,4594910,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D0000008A""",,,,,,,,,"" 
> 2022-11-10 21:06:16.426 UTC,"upgrayedd","canvas",28608,"10.1.238.101:45328",636d67c8.6fc0,8,"idle",2022-11-10
21:06:16UTC,147/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been
removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names
'""focal14""')",,,"focal14"
> 2022-11-10 21:06:21.456 UTC,"upgrayedd","canvas",28815,"10.1.238.101:52316",636d67cd.708f,8,"idle",2022-11-10
21:06:21UTC,147/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been
removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names
'""focal14""')",,,"focal14"
> 2022-11-10 21:06:26.488 UTC,"upgrayedd","canvas",28962,"10.1.238.101:52320",636d67d2.7122,8,"idle",2022-11-10
21:06:26UTC,148/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been
removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names
'""focal14""')",,,"focal14"
> #v-
>
> So, the whole set of missing files was recycled at the same time.
>
> One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1
minute.Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp.
Lookslike this: 
>
> #v+
> timestamp                pg_current_wal_lsn  slot_name  plugin    slot_type  datoid  database  temporary  active
active_pid xmin  catalog_xmin  restart_lsn     confirmed_flush_lsn 
...
...
> 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241434855    1039D/83825958  1039D/911A8DB0 
> 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/955633D0 
...

I think from the above two it is clear that the slot with restart_lsn
1039D/83825958  is active and we seem to have recycled corresponding
segment.

>
> Now, I can keep this situation up for some time, if more debug will be helpful,
> but I don't know how long (disk space issues).
>
> Is there anything in here that helps? Any other info I can dig out?
>

You may have the LOG for "attempting to remove WAL segments older than
log file %s", if so, please share. Also on similar lines, I think we
need to add some LOGs before calling RemoveOldXlogFiles() to know
where our computation to remove files goes off. Something, like the
attached, would be helpful but note I have added this quickly on my
working branch to show you what I have in mind. You may need small
changes based on the branch you are working on.

--
With Regards,
Amit Kapila.

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #17687: Session timezone change does not play well with prepared statements
Следующее
От: hubert depesz lubaczewski
Дата:
Сообщение: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.