Обсуждение: Assertion failure with summarize_wal enabled during pg_createsubscriber
Hi, In HEAD, I encountered the following assertion failure when I enabled summarize_wal and ran pg_createsubscriber. 2024-07-01 14:42:15.697 JST [19195] LOG: database system is ready to accept connections TRAP: failed Assert("switchpoint >= state->EndRecPtr"), File: "walsummarizer.c", Line: 1382, PID: 19200 0 postgres 0x0000000105c46c5d ExceptionalCondition + 189 1 postgres 0x000000010590b1e4 summarizer_read_local_xlog_page + 340 2 postgres 0x00000001054e401e ReadPageInternal + 542 3 postgres 0x00000001054e24c0 XLogDecodeNextRecord + 464 4 postgres 0x00000001054e2283 XLogReadAhead + 67 5 postgres 0x00000001054e2185 XLogReadRecord + 53 6 postgres 0x000000010590a3ab SummarizeWAL + 1115 7 postgres 0x000000010590963a WalSummarizerMain + 1242 8 postgres 0x00000001058fd10a postmaster_child_launch + 234 9 postgres 0x000000010590133d StartChildProcess + 29 10 postgres 0x0000000105904582 MaybeStartWalSummarizer + 82 11 postgres 0x0000000105901af1 ServerLoop + 1153 12 postgres 0x00000001059007ca PostmasterMain + 6554 13 postgres 0x00000001057a3782 main + 818 14 dyld 0x00007ff80e5e2366 start + 1942 2024-07-01 14:42:15.912 JST [19195] LOG: WAL summarizer process (PID 19200) was terminated by signal 6: Abort trap: 6 2024-07-01 14:42:15.913 JST [19195] LOG: terminating any other active server processes Here are the steps to reproduce this issue. -------------------------------- initdb -D pub cat <<EOF >> pub/postgresql.conf wal_level = 'logical' summarize_wal = on EOF pg_ctl -D pub start pgbench -i pgbench -T 600 & pg_basebackup -D sub -c fast -R pg_createsubscriber -d postgres -D sub -p 5433 -P "port=5432" -------------------------------- Is this the known issue? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Assertion failure with summarize_wal enabled during pg_createsubscriber
От
Michael Paquier
Дата:
On Mon, Jul 01, 2024 at 02:54:56PM +0900, Fujii Masao wrote: > In HEAD, I encountered the following assertion failure when I enabled summarize_wal > and ran pg_createsubscriber. > > Is this the known issue? Nope. So, Open Item, here we go. -- Michael
Вложения
On Mon, Jul 1, 2024 at 2:08 AM Michael Paquier <michael@paquier.xyz> wrote: > Nope. So, Open Item, here we go. Some initial investigation: In this test case, pg_subscriber, during the "starting the subscriber" section of its work, starts up the database in the "sub" directory as a standby. It enters standby mode, begins redo, and is then promoted, selecting timeline 2. The WAL summarizer is supposed to end summarization at the point at which timeline 1 ended and then resume summarizing from the beginning of timeline 2. But instead, it fails an assertion: Assert(switchpoint >= state->EndRecPtr); This assertion is trying to verify that, when a new timeline is spawned, we don't read past the switchpoint on the original timeline. Here, we have apparently done that. In one test, I got switchpoint = 0/51000510, state->EndRecPtr = 0/51000600. According to pg_waldump, on timeline 1, we have this record at that LSN: rmgr: Heap len (rec/tot): 54/ 54, tx: 2313637, lsn: 0/51000510, prev 0/510004D0, desc: DELETE xmax: 2313637, off: 3, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/5/6104 blk 0 And on timeline 2, we have this at that LSN: rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/51000510, prev 0/510004D0, desc: CHECKPOINT_SHUTDOWN redo 0/51000510; tli 2; prev tli 1; fpw true; xid 0:2313638; oid 24576; multi 1; offset 0; oldest xid 730 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown It appears that pg_subscriber creates a recovery.conf that includes: recovery_target_timeline = 'latest' recovery_target_inclusive = true recovery_target_lsn = '%X/%X' ...where %X/%X represents a valid LSN. I think the problem here is that the WAL summarizer believes that when a new timeline appears, it should pick up from where the old timeline ended. And here, that doesn't happen: the new timeline branches off before the end of the old timeline, because of the recovery target. I'm not yet sure what should be done about this. The obvious answer is "remove the assertion," and maybe that is all we need to do. However, I'm not quite sure what the actual behavior will be if we just do that, so I think more investigation is needed. I'll keep looking at this, although given the US holiday I may not have results until next week. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jul 3, 2024 at 1:07 PM Robert Haas <robertmhaas@gmail.com> wrote: > I think the problem here is that the WAL summarizer believes that when > a new timeline appears, it should pick up from where the old timeline > ended. And here, that doesn't happen: the new timeline branches off > before the end of the old timeline, because of the recovery target. > > I'm not yet sure what should be done about this. The obvious answer is > "remove the assertion," and maybe that is all we need to do. However, > I'm not quite sure what the actual behavior will be if we just do > that, so I think more investigation is needed. I'll keep looking at > this, although given the US holiday I may not have results until next > week. Here is a draft patch that attempts to fix this problem. I'm not certain that it's completely correct, but it does seem to fix the reported issue. -- Robert Haas EDB: http://www.enterprisedb.com