Re: prevent immature WAL streaming

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: prevent immature WAL streaming
Дата
Msg-id 45597.1637694259@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: prevent immature WAL streaming  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Ответы Re: prevent immature WAL streaming  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Список pgsql-hackers
We're *still* not out of the woods with 026_overwrite_contrecord.pl,
as we are continuing to see occasional "mismatching overwritten LSN"
failures, further down in the test where it tries to start up the
standby:

  sysname   |    branch     |      snapshot       |     stage     |
l                                                      

------------+---------------+---------------------+---------------+------------------------------------------------------------------------------------------------------------
 spurfowl   | REL_13_STABLE | 2021-10-18 03:56:26 | recoveryCheck | 2021-10-18 00:08:09.324 EDT [2455:6] FATAL:
mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 
 sidewinder | HEAD          | 2021-10-19 04:32:36 | recoveryCheck | 2021-10-19 06:46:23.168 CEST [26393:6] FATAL:
mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 
 francolin  | REL9_6_STABLE | 2021-10-26 01:41:39 | recoveryCheck | 2021-10-26 01:48:05.646 UTC [3417202][][1/0:0]
FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000 
 petalura   | HEAD          | 2021-11-05 00:20:03 | recoveryCheck | 2021-11-05 02:58:12.146 CET [61848fb3.28d157:6]
FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000 
 lapwing    | REL_11_STABLE | 2021-11-05 17:24:49 | recoveryCheck | 2021-11-05 17:39:29.741 UTC [9831:6] FATAL:
mismatchingoverwritten LSN 0/1FFE014 -> 0/1FFE000 
 morepork   | HEAD          | 2021-11-10 02:51:12 | recoveryCheck | 2021-11-10 04:03:33.576 CET [73561:6] FATAL:
mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 
 petalura   | HEAD          | 2021-11-16 15:20:03 | recoveryCheck | 2021-11-16 18:16:47.875 CET [6193e77f.35b87f:6]
FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000 
 morepork   | HEAD          | 2021-11-17 03:45:36 | recoveryCheck | 2021-11-17 04:57:04.359 CET [32089:6] FATAL:
mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 
 spurfowl   | REL_10_STABLE | 2021-11-22 22:21:03 | recoveryCheck | 2021-11-22 17:29:35.520 EST [16011:6] FATAL:
mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 
(9 rows)

Looking at adjacent successful runs, it seems that the exact point
where the "missing contrecord" starts varies substantially, even after
our previous fix to disable autovacuum in this test.  How could that be?

It's probably for the best though, because I think this is exposing
an actual bug that we would not have seen if the start point were
completely consistent.  I have not dug into the code, but it looks to
me like if the "consistent recovery state" is reached exactly at a
page boundary (0/1FFE000 in all these cases), then the standby expects
that to be what the OVERWRITE_CONTRECORD record will point at.  But
actually it points to the first WAL record on that page, resulting
in a bogus failure.

            regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jacob Champion
Дата:
Сообщение: Re: pg_upgrade parallelism
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Post-CVE Wishlist