Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Assertion failure with summarize_wal enabled during pg_createsubscriber
Дата
Msg-id CA+TgmobLaJTxCHgdh04rfsUMEhP_ceDbiF0M=gtw5jG4q_zPbg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Assertion failure with summarize_wal enabled during pg_createsubscriber  (Michael Paquier <michael@paquier.xyz>)
Ответы Re: Assertion failure with summarize_wal enabled during pg_createsubscriber
Список pgsql-hackers
On Mon, Jul 1, 2024 at 2:08 AM Michael Paquier <michael@paquier.xyz> wrote:
> Nope. So, Open Item, here we go.

Some initial investigation:

In this test case, pg_subscriber, during the "starting the subscriber"
section of its work, starts up the database in the "sub" directory as
a standby. It enters standby mode, begins redo, and is then promoted,
selecting timeline 2. The WAL summarizer is supposed to end
summarization at the point at which timeline 1 ended and then resume
summarizing from the beginning of timeline 2. But instead, it fails an
assertion:

                    Assert(switchpoint >= state->EndRecPtr);

This assertion is trying to verify that, when a new timeline is
spawned, we don't read past the switchpoint on the original timeline.
Here, we have apparently done that. In one test, I got switchpoint =
0/51000510, state->EndRecPtr = 0/51000600. According to pg_waldump, on
timeline 1, we have this record at that LSN:

rmgr: Heap        len (rec/tot):     54/    54, tx:    2313637, lsn:
0/51000510, prev 0/510004D0, desc: DELETE xmax: 2313637, off: 3,
infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/5/6104 blk
0

And on timeline 2, we have this at that LSN:

rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn:
0/51000510, prev 0/510004D0, desc: CHECKPOINT_SHUTDOWN redo
0/51000510; tli 2; prev tli 1; fpw true; xid 0:2313638; oid 24576;
multi 1; offset 0; oldest xid 730 in DB 1; oldest multi 1 in DB 1;
oldest/newest commit timestamp xid: 0/0; oldest running xid 0;
shutdown

It appears that pg_subscriber creates a recovery.conf that includes:

recovery_target_timeline = 'latest'
recovery_target_inclusive = true
recovery_target_lsn = '%X/%X'

...where %X/%X represents a valid LSN.

I think the problem here is that the WAL summarizer believes that when
a new timeline appears, it should pick up from where the old timeline
ended. And here, that doesn't happen: the new timeline branches off
before the end of the old timeline, because of the recovery target.

I'm not yet sure what should be done about this. The obvious answer is
"remove the assertion," and maybe that is all we need to do. However,
I'm not quite sure what the actual behavior will be if we just do
that, so I think more investigation is needed. I'll keep looking at
this, although given the US holiday I may not have results until next
week.

--
Robert Haas
EDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jacob Champion
Дата:
Сообщение: Re: [PoC] Federated Authn/z with OAUTHBEARER
Следующее
От: Noah Misch
Дата:
Сообщение: Re: cannot abort transaction 2737414167, it was already committed