Re: BUG #15346: Replica fails to start after the crash

Поиск
Список
Период
Сортировка
От Alexander Kukushkin
Тема Re: BUG #15346: Replica fails to start after the crash
Дата
Msg-id CAFh8B=mYGP1J8ueHKU40Dr4bNBUPJj1rTbZ6Wpg4gR30uCRunA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #15346: Replica fails to start after the crash  (Alexander Kukushkin <cyberdemn@gmail.com>)
Ответы Re: BUG #15346: Replica fails to start after the crash  (Michael Paquier <michael@paquier.xyz>)
Re: BUG #15346: Replica fails to start after the crash  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-bugs
Hi,


btree_xlog_delete_get_latestRemovedXid is callled only when the
postgres is already running InHotStandby
And what this function is actually doing, it reads index page from
disk and iterates over IndexTuple records to get corresponding
HeapPages.
The problem is that xlrec->nitems = 182 in my case, but on the index
page there are only 56 tuples and the rest of the page is filled with
zeros:
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f07a2be2801 in __GI_abort () at abort.c:79
#2  0x000055938000f586 in errfinish (dummy=0) at elog.c:557
#3  0x0000559380011d95 in elog_finish (elevel=22, fmt=0x55938007c010
"WAL contains references to invalid pages") at elog.c:1378
#4  0x000055937fbd4e8f in log_invalid_page (node=...,
forkno=MAIN_FORKNUM, blkno=179503104, present=0 '\000') at
xlogutils.c:95
#5  0x000055937fbd58e7 in XLogReadBufferExtended (rnode=...,
forknum=MAIN_FORKNUM, blkno=179503104, mode=RBM_NORMAL) at
xlogutils.c:470
#6  0x000055937fb8e8e7 in btree_xlog_delete_get_latestRemovedXid
(record=0x55938151b438) at nbtxlog.c:599
#7  0x000055937fb8eaa1 in btree_xlog_delete (record=0x55938151b438) at
nbtxlog.c:688
#8  0x000055937fb8f870 in btree_redo (record=0x55938151b438) at nbtxlog.c:1018
#9  0x000055937fbc6101 in StartupXLOG () at xlog.c:6967
#10 0x000055937fe2da0f in StartupProcessMain () at startup.c:216
#11 0x000055937fbda28f in AuxiliaryProcessMain (argc=2,
argv=0x7ffe88f10df0) at bootstrap.c:419
#12 0x000055937fe2c957 in StartChildProcess (type=StartupProcess) at
postmaster.c:5300
#13 0x000055937fe27482 in PostmasterMain (argc=16,
argv=0x5593814ed690) at postmaster.c:1321
#14 0x000055937fd7152e in main (argc=16, argv=0x5593814ed690) at main.c:228
(gdb) frame 6
#6  0x000055937fb8e8e7 in btree_xlog_delete_get_latestRemovedXid
(record=0x55938151b438) at nbtxlog.c:599
599                     hbuffer = XLogReadBufferExtended(xlrec->hnode,
MAIN_FORKNUM, hblkno, RBM_NORMAL);
(gdb) l 587
582              * Loop through the deleted index items to obtain the
TransactionId from
583              * the heap items they point to.
584              */
585             unused = (OffsetNumber *) ((char *) xlrec + SizeOfBtreeDelete);
586
587             for (i = 0; i < xlrec->nitems; i++)
588             {
589                     /*
590                      * Identify the index tuple about to be deleted
591                      */
(gdb) l
592                     iitemid = PageGetItemId(ipage, unused[i]);
593                     itup = (IndexTuple) PageGetItem(ipage, iitemid);
594
595                     /*
596                      * Locate the heap page that the index tuple points at
597                      */
598                     hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
599                     hbuffer = XLogReadBufferExtended(xlrec->hnode,
MAIN_FORKNUM, hblkno, RBM_NORMAL);
600                     if (!BufferIsValid(hbuffer))
601                     {
(gdb) p i
$45 = 53
(gdb) p unused[i]
$46 = 57
(gdb) p *iitemid
$47 = {lp_off = 0, lp_flags = 0, lp_len = 0}

Therefore iitemid points to the beginning of the index page, what
doesn't make sense.

Why the number of tuples in the xlog is greater than the number of
tuples on the index page?
Because this page was already overwritten and its LSN is HIGHER than
the current LSN!

(gdb) p *(PageHeader)ipage
$49 = {pd_lsn = {xlogid = 2739, xrecoff = 1455373160}, pd_checksum =
3042, pd_flags = 0, pd_lower = 248, pd_upper = 5936, pd_special =
8176, pd_pagesize_version = 8196, pd_prune_xid = 0, pd_linp =
0x7f06ed9e9598}

Index page LSN: AB3/56BF3B68
Current LSN: AB3/50323E78

Is there a way to recover from such a situation? Should the postgres
in such case do comparison of LSNs and if the LSN on the page is
higher than the current LSN simply return InvalidTransactionId?
Apparently, if there are no connections open postgres simply is not
running this code and it seems ok.


Regards,
--
Alexander Kukushkin


В списке pgsql-bugs по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: Re: Re: Bug: ERROR: invalid cache ID: 42 CONTEXT: parallel worker
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #15352: postgresql FDW error "ERROR: ORDER BY position 0 is notin select list"