Обсуждение: Assertion failure with summarize_wal enabled during pg_createsubscriber

Поиск
Список
Период
Сортировка

Assertion failure with summarize_wal enabled during pg_createsubscriber

От
Fujii Masao
Дата:
Hi,

In HEAD, I encountered the following assertion failure when I enabled summarize_wal
and ran pg_createsubscriber.

2024-07-01 14:42:15.697 JST [19195] LOG:  database system is ready to accept connections
TRAP: failed Assert("switchpoint >= state->EndRecPtr"), File: "walsummarizer.c", Line: 1382, PID: 19200
0   postgres                            0x0000000105c46c5d ExceptionalCondition + 189
1   postgres                            0x000000010590b1e4 summarizer_read_local_xlog_page + 340
2   postgres                            0x00000001054e401e ReadPageInternal + 542
3   postgres                            0x00000001054e24c0 XLogDecodeNextRecord + 464
4   postgres                            0x00000001054e2283 XLogReadAhead + 67
5   postgres                            0x00000001054e2185 XLogReadRecord + 53
6   postgres                            0x000000010590a3ab SummarizeWAL + 1115
7   postgres                            0x000000010590963a WalSummarizerMain + 1242
8   postgres                            0x00000001058fd10a postmaster_child_launch + 234
9   postgres                            0x000000010590133d StartChildProcess + 29
10  postgres                            0x0000000105904582 MaybeStartWalSummarizer + 82
11  postgres                            0x0000000105901af1 ServerLoop + 1153
12  postgres                            0x00000001059007ca PostmasterMain + 6554
13  postgres                            0x00000001057a3782 main + 818
14  dyld                                0x00007ff80e5e2366 start + 1942
2024-07-01 14:42:15.912 JST [19195] LOG:  WAL summarizer process (PID 19200) was terminated by signal 6: Abort trap: 6
2024-07-01 14:42:15.913 JST [19195] LOG:  terminating any other active server processes

Here are the steps to reproduce this issue.

--------------------------------
initdb -D pub
cat <<EOF >> pub/postgresql.conf
wal_level = 'logical'
summarize_wal = on
EOF
pg_ctl -D pub start
pgbench -i
pgbench -T 600 &
pg_basebackup -D sub -c fast -R
pg_createsubscriber -d postgres -D sub -p 5433 -P "port=5432"
--------------------------------

Is this the known issue?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

От
Michael Paquier
Дата:
On Mon, Jul 01, 2024 at 02:54:56PM +0900, Fujii Masao wrote:
> In HEAD, I encountered the following assertion failure when I enabled summarize_wal
> and ran pg_createsubscriber.
>
> Is this the known issue?

Nope. So, Open Item, here we go.
--
Michael

Вложения

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

От
Robert Haas
Дата:
On Mon, Jul 1, 2024 at 2:08 AM Michael Paquier <michael@paquier.xyz> wrote:
> Nope. So, Open Item, here we go.

Some initial investigation:

In this test case, pg_subscriber, during the "starting the subscriber"
section of its work, starts up the database in the "sub" directory as
a standby. It enters standby mode, begins redo, and is then promoted,
selecting timeline 2. The WAL summarizer is supposed to end
summarization at the point at which timeline 1 ended and then resume
summarizing from the beginning of timeline 2. But instead, it fails an
assertion:

                    Assert(switchpoint >= state->EndRecPtr);

This assertion is trying to verify that, when a new timeline is
spawned, we don't read past the switchpoint on the original timeline.
Here, we have apparently done that. In one test, I got switchpoint =
0/51000510, state->EndRecPtr = 0/51000600. According to pg_waldump, on
timeline 1, we have this record at that LSN:

rmgr: Heap        len (rec/tot):     54/    54, tx:    2313637, lsn:
0/51000510, prev 0/510004D0, desc: DELETE xmax: 2313637, off: 3,
infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/5/6104 blk
0

And on timeline 2, we have this at that LSN:

rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn:
0/51000510, prev 0/510004D0, desc: CHECKPOINT_SHUTDOWN redo
0/51000510; tli 2; prev tli 1; fpw true; xid 0:2313638; oid 24576;
multi 1; offset 0; oldest xid 730 in DB 1; oldest multi 1 in DB 1;
oldest/newest commit timestamp xid: 0/0; oldest running xid 0;
shutdown

It appears that pg_subscriber creates a recovery.conf that includes:

recovery_target_timeline = 'latest'
recovery_target_inclusive = true
recovery_target_lsn = '%X/%X'

...where %X/%X represents a valid LSN.

I think the problem here is that the WAL summarizer believes that when
a new timeline appears, it should pick up from where the old timeline
ended. And here, that doesn't happen: the new timeline branches off
before the end of the old timeline, because of the recovery target.

I'm not yet sure what should be done about this. The obvious answer is
"remove the assertion," and maybe that is all we need to do. However,
I'm not quite sure what the actual behavior will be if we just do
that, so I think more investigation is needed. I'll keep looking at
this, although given the US holiday I may not have results until next
week.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

От
Robert Haas
Дата:
On Wed, Jul 3, 2024 at 1:07 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I think the problem here is that the WAL summarizer believes that when
> a new timeline appears, it should pick up from where the old timeline
> ended. And here, that doesn't happen: the new timeline branches off
> before the end of the old timeline, because of the recovery target.
>
> I'm not yet sure what should be done about this. The obvious answer is
> "remove the assertion," and maybe that is all we need to do. However,
> I'm not quite sure what the actual behavior will be if we just do
> that, so I think more investigation is needed. I'll keep looking at
> this, although given the US holiday I may not have results until next
> week.

Here is a draft patch that attempts to fix this problem. I'm not
certain that it's completely correct, but it does seem to fix the
reported issue.

--
Robert Haas
EDB: http://www.enterprisedb.com

Вложения