Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

Поиск
Список
Период
Сортировка
От ocean_li_996
Тема Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()
Дата
Msg-id 5f2487c9.13435.18e2e656942.Coremail.ocean_li_996@163.com
обсуждение исходный текст
Ответ на RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Ответы RE: Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Список pgsql-bugs
Dear Hayato:

At 2024-03-08 15:35:16, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:
>Dear Haiyang, Alexander,
>
>Below part describes the second failure. The issue would be occurred when:
>
>1) Logical decoding starts from the middle of a sub-transaction, and
>2) NEW_CID record is initially decoded in the sub-transaction, and
>3) An arbitrary changes are decoded in the sub-transaction.
>...

## Issue 1
Thanks for your reproducer. I have investigated this issue. The scenario that caused the issue
is indeed as you described above. I had not realized before that a serialized snapshot file from
one replication slot could be utilized by another replication slot to achieve a consistent state.
This issue would not arise with only one replication slot, as SnapBuildXactNeedsSkip ensures that.

Using the spec test case you provided (without any fix patch), I got another Stuck trace on REL_14_STABLE :
```
#0 0x00007f93604e0277 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 #1 0x00007f93604e1968 in __GI_abort () at abort.c:90 #2 0x0000000000b084da in ExceptionalCondition (conditionName=0xcc94c8 "prev_first_lsn < cur_txn->first_lsn", errorType=0xcc92c4 "FailedAssertion", fileName=0xcc9310 "reorderbuffer.c", lineNumber=916) at assert.c:69 #3 0x00000000008df064 in AssertTXNLsnOrder (rb=0x2a48590) at reorderbuffer.c:916 #4 0x00000000008de9a1 in ReorderBufferTXNByXid (rb=0x2a48590, xid=756, create=true, is_new=0x0, lsn=24749416, create_as_top=true) at reorderbuffer.c:669 #5 0x00000000008e2d00 in ReorderBufferAddNewTupleCids (rb=0x2a48590, xid=756, lsn=24749416, node=..., tid=..., cmin=1, cmax=4294967295, combocid=4294967295) at reorderbuffer.c:3198 #6 0x00000000008e7ed7 in SnapBuildProcessNewCid (builder=0x2a44570, xid=757, lsn=24749416, xlrec=0x2a16d38) at snapbuild.c:823 #7 0x00000000008d14b3 in DecodeHeap2Op (ctx=0x2a324f0, buf=0x7ffd886a4bc0) at decode.c:471 #8 0x00000000008d0bfa in LogicalDecodingProcessRecord (ctx=0x2a324f0, record=0x2a328f0) at decode.c:151 #9 0x00000000008d8688 in pg_logical_slot_get_changes_guts (fcinfo=0x2a206b0, confirm=true, binary=false) at logicalfuncs.c:296 #10 0x00000000008d87bd in pg_logical_slot_get_changes (fcinfo=0x2a206b0) at logicalfuncs.c:365
```

Then, I think the simplest process would be:
```
ReorderBufferProcessXid(xid = 757, lsn) -> A ReorderBufferTXN for subtransaciton (757) was generated. The first_lsn was the head of NEW_CID record. DecodeHeap2Op(info = XLOG_HEAP2_NEW_CID) SnapBuildProcessNewCid() ReorderBufferAddNewTupleCids(..., xlrec->top_xid = 756,...) ReorderBufferTXNByXid(xid = 756, lsn) -> A ReorderBufferTXN for top-transaciton (756) was generated. The first_lsn was the head of NEW_CID record. It caused an Assertion failure.
```
> I think the straightforward fix is to associate them to top-sub relationship,
> and attached patch did it.
LGFM, I think it is suitable assign subtransaction after calling ReorderBufferAddNewTupleCids().

## Issue 2
Inspired by your spec case, I've reorganized the spec case provided in [2]. The new test in attachment
is able to reproduce the issue mentioned in [1] even before commit 6b77048e5.

The approach in [3] is also LGFM.

[1] https://www.postgresql.org/message-id/18369-ad61699bf91c5bc0@postgresql.org
[2] https://www.postgresql.org/message-id/6d0e80d6.c1fc.18deeb8120a.Coremail.ocean_li_996@163.com 
[3] https://www.postgresql.org/message-id/CAA4eK1KpW5pHMwMp9hfXYvOeEU5Rcbhoc7FxtBOGPgKeyYLDmA@mail.gmail.com 

Best Regards,
Haiyang Li
Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #18386: Slow connection access after 'vacuum full pg_attribute'
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #18387: Erroneous permission checks and/or misleading error messages with refresh materialized view