Обсуждение: test_decoding assertion failure for the loss of top-sub transaction relationship
test_decoding assertion failure for the loss of top-sub transaction relationship
От
"osumi.takamichi@fujitsu.com"
Дата:
Hi, hackers I've met an assertion failure of logical decoding with below scenario on HEAD. --- <preparation> create table tab1 (val integer); select 'init' from pg_create_logical_replication_slot('regression_slot', 'test_decoding'); <session1> begin; savepoint sp1; insert into tab1 values (1); <session2> checkpoint; -- for RUNNING_XACT select data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); <session1> truncate tab1; -- for NEW_CID commit; begin; insert into tab1 values (3); <session2> checkpoint; select data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); <session1> commit; <session2> select data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); --- Here, it's not a must but is advisable to make LOG_SNAPSHOT_INTERVAL_MS bigger so that we can issue RUNNING_XACT according to our checkpoint commands explicitly. In the above scenario, the first checkpoint generates RUNNING_XACT after the wal record (for ReorderBufferAssignChild) that associates sub transaction with its top transaction. This means that once we restart from RUNNING_XACT, we lose the association between top transaction and sub transaction and then we can't mark the top transaction as catalog modifying transaction by decoding NEW_CID (written after RUNNING_XACT), if the sub transaction changes the catalog. Therefore, this leads to the failure for the assert that can check the consistency that when one sub transaction modifies the catalog, its top transaction should be marked so as well. I feel we need to remember the relationship between top transaction and sub transaction in the serialized snapshot even before changing catalog at decoding RUNNING_XACT, so that we can keep track of the association after the restart. What do you think ? The stack call of this failure and related information is below. (gdb) bt #0 0x00007f2632588387 in raise () from /lib64/libc.so.6 #1 0x00007f2632589a78 in abort () from /lib64/libc.so.6 #2 0x0000000000b3eba1 in ExceptionalCondition (conditionName=0xd137e0 "!needs_snapshot || needs_timetravel", errorType=0xd130c5 "FailedAssertion", fileName=0xd130b9 "snapbuild.c", lineNumber=1116) at assert.c:69 #3 0x0000000000911257 in SnapBuildCommitTxn (builder=0x23f0638, lsn=22386632, xid=728, nsubxacts=1, subxacts=0x2bfcc88, xinfo=79) at snapbuild.c:1116 #4 0x00000000008fa420 in DecodeCommit (ctx=0x23e0108, buf=0x7fff4a1f9220, parsed=0x7fff4a1f9020, xid=728, two_phase=false) at decode.c:630 #5 0x00000000008f9953 in xact_decode (ctx=0x23e0108, buf=0x7fff4a1f9220) at decode.c:216 #6 0x00000000008f967d in LogicalDecodingProcessRecord (ctx=0x23e0108, record=0x23e04a0) at decode.c:119 #7 0x0000000000900b63 in pg_logical_slot_get_changes_guts (fcinfo=0x23d80a8, confirm=true, binary=false) at logicalfuncs.c:271 #8 0x0000000000900ca0 in pg_logical_slot_get_changes (fcinfo=0x23d80a8) at logicalfuncs.c:338 ... (gdb) frame 3 #3 0x0000000000911257 in SnapBuildCommitTxn (builder=0x23f0638, lsn=22386632, xid=728, nsubxacts=1, subxacts=0x2bfcc88, xinfo=79) at snapbuild.c:1116 1116 Assert(!needs_snapshot || needs_timetravel); (gdb) list 1111 { 1112 /* record that we cannot export a general snapshot anymore */ 1113 builder->committed.includes_all_transactions = false; 1114 } 1115 1116 Assert(!needs_snapshot || needs_timetravel); 1117 1118 /* 1119 * Adjust xmax of the snapshot builder, we only do that for committed, 1120 * catalog modifying, transactions, everything else isn't interesting for Best Regards, Takamichi Osumi
RE: test_decoding assertion failure for the loss of top-sub transaction relationship
От
"kuroda.hayato@fujitsu.com"
Дата:
Hi Hackers, > Therefore, this leads to the failure for the assert that can check > the consistency that when one sub transaction modifies the catalog, > its top transaction should be marked so as well. > > I feel we need to remember the relationship between top transaction and sub > transaction > in the serialized snapshot even before changing catalog at decoding > RUNNING_XACT, > so that we can keep track of the association after the restart. What do you think ? PSA patch that fixes the failure. This adds pairs of sub-top transactions to the SnapBuild, and it will be serialized and restored. The pair will be checked when we mark the ReorderBufferTXN as RBTXN_HAS_CATALOG_CHANGES. Thanks to off-list discussion with Osumi-san. Best Regards, Hayato Kuroda FUJITSU LIMITED
Вложения
Re: test_decoding assertion failure for the loss of top-sub transaction relationship
От
Kyotaro Horiguchi
Дата:
Good catch, and thanks for the patch! At Fri, 2 Sep 2022 01:08:04 +0000, "kuroda.hayato@fujitsu.com" <kuroda.hayato@fujitsu.com> wrote in > PSA patch that fixes the failure. > This adds pairs of sub-top transactions to the SnapBuild, and it will be serialized and restored. > The pair will be checked when we mark the ReorderBufferTXN as RBTXN_HAS_CATALOG_CHANGES. A commit record has all subtransaction ids and SnapBuildCommitTxn() already checks if every one has catalog changes before checking the top transaction's catalog changes. So, no need to record top-sub transaction relationship to serialized snapshots. If any of the subtransactions has catalog changes, the commit contains catalog changes. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: test_decoding assertion failure for the loss of top-sub transaction relationship
От
Dilip Kumar
Дата:
On Fri, Sep 2, 2022 at 6:38 AM kuroda.hayato@fujitsu.com <kuroda.hayato@fujitsu.com> wrote: > > Hi Hackers, > > > Therefore, this leads to the failure for the assert that can check > > the consistency that when one sub transaction modifies the catalog, > > its top transaction should be marked so as well. > > > > I feel we need to remember the relationship between top transaction and sub > > transaction > > in the serialized snapshot even before changing catalog at decoding > > RUNNING_XACT, > > so that we can keep track of the association after the restart. What do you think ? > > PSA patch that fixes the failure. > This adds pairs of sub-top transactions to the SnapBuild, and it will be serialized and restored. > The pair will be checked when we mark the ReorderBufferTXN as RBTXN_HAS_CATALOG_CHANGES. It seems that SnapBuildCommitTxn() is already taking care of adding the top transaction to the committed transaction if any subtransaction has the catalog changes, it has just missed setting the flag so I think just setting the flag like this should be sufficient no? diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c index 1ff2c12..ee3f695 100644 --- a/src/backend/replication/logical/snapbuild.c +++ b/src/backend/replication/logical/snapbuild.c @@ -1086,6 +1086,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, else if (sub_needs_timetravel) { /* track toplevel txn as well, subxact alone isn't meaningful */ + needs_timetravel = true; SnapBuildAddCommittedTxn(builder, xid); } else if (needs_timetravel) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: test_decoding assertion failure for the loss of top-sub transaction relationship
От
Kyotaro Horiguchi
Дата:
At Fri, 2 Sep 2022 10:59:56 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > On Fri, Sep 2, 2022 at 6:38 AM kuroda.hayato@fujitsu.com > <kuroda.hayato@fujitsu.com> wrote: > > > > Hi Hackers, > > > > > Therefore, this leads to the failure for the assert that can check > > > the consistency that when one sub transaction modifies the catalog, > > > its top transaction should be marked so as well. > > > > > > I feel we need to remember the relationship between top transaction and sub > > > transaction > > > in the serialized snapshot even before changing catalog at decoding > > > RUNNING_XACT, > > > so that we can keep track of the association after the restart. What do you think ? > > > > PSA patch that fixes the failure. > > This adds pairs of sub-top transactions to the SnapBuild, and it will be serialized and restored. > > The pair will be checked when we mark the ReorderBufferTXN as RBTXN_HAS_CATALOG_CHANGES. > > It seems that SnapBuildCommitTxn() is already taking care of adding > the top transaction to the committed transaction if any subtransaction > has the catalog changes, it has just missed setting the flag so I > think just setting the flag like this should be sufficient no? Oops! That's right. > diff --git a/src/backend/replication/logical/snapbuild.c > b/src/backend/replication/logical/snapbuild.c > index 1ff2c12..ee3f695 100644 > --- a/src/backend/replication/logical/snapbuild.c > +++ b/src/backend/replication/logical/snapbuild.c > @@ -1086,6 +1086,7 @@ SnapBuildCommitTxn(SnapBuild *builder, > XLogRecPtr lsn, TransactionId xid, > else if (sub_needs_timetravel) > { > /* track toplevel txn as well, subxact alone isn't meaningful */ > + needs_timetravel = true; > SnapBuildAddCommittedTxn(builder, xid); > } > else if (needs_timetravel) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: test_decoding assertion failure for the loss of top-sub transaction relationship
От
"kuroda.hayato@fujitsu.com"
Дата:
Dear Horiguchi-san, Dilip, Thank you for replying! > > It seems that SnapBuildCommitTxn() is already taking care of adding > > the top transaction to the committed transaction if any subtransaction > > has the catalog changes, it has just missed setting the flag so I > > think just setting the flag like this should be sufficient no? > > Oops! That's right. Basically I agreed, but I was not sure the message "found top level transaction..." should be output or not. It may be useful even if one of sub transactions contains the change. How about following? diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c index bf72ad45ec..a630522907 100644 --- a/src/backend/replication/logical/snapbuild.c +++ b/src/backend/replication/logical/snapbuild.c @@ -1086,8 +1086,17 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, } } - /* if top-level modified catalog, it'll need a snapshot */ - if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo)) + /* + * if top-level or one of sub modified catalog, it'll need a snapshot. + * + * Normally the second check is not needed because the relation between + * top-sub transactions is tracked on the ReorderBuffer layer, and the top + * transaction is marked as containing catalog changes if its children are. + * But in some cases the relation may be missed, in which case only the sub + * transaction may be marked as containing catalog changes. + */ + if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo) + || sub_needs_timetravel) { elog(DEBUG2, "found top level transaction %u, with catalog changes", xid); @@ -1095,11 +1104,6 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, needs_timetravel = true; SnapBuildAddCommittedTxn(builder, xid); } - else if (sub_needs_timetravel) - { - /* track toplevel txn as well, subxact alone isn't meaningful */ - SnapBuildAddCommittedTxn(builder, xid); - } else if (needs_timetravel) { elog(DEBUG2, "forced transaction %u to do timetravel", xid); Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: test_decoding assertion failure for the loss of top-sub transaction relationship
От
Dilip Kumar
Дата:
On Fri, Sep 2, 2022 at 11:25 AM kuroda.hayato@fujitsu.com <kuroda.hayato@fujitsu.com> wrote: > > Dear Horiguchi-san, Dilip, > > Thank you for replying! > > > > It seems that SnapBuildCommitTxn() is already taking care of adding > > > the top transaction to the committed transaction if any subtransaction > > > has the catalog changes, it has just missed setting the flag so I > > > think just setting the flag like this should be sufficient no? > > > > Oops! That's right. > > Basically I agreed, but I was not sure the message "found top level transaction..." > should be output or not. It may be useful even if one of sub transactions contains the change. > > How about following? > > diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c > index bf72ad45ec..a630522907 100644 > --- a/src/backend/replication/logical/snapbuild.c > +++ b/src/backend/replication/logical/snapbuild.c > @@ -1086,8 +1086,17 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, > } > } > > - /* if top-level modified catalog, it'll need a snapshot */ > - if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo)) > + /* > + * if top-level or one of sub modified catalog, it'll need a snapshot. > + * > + * Normally the second check is not needed because the relation between > + * top-sub transactions is tracked on the ReorderBuffer layer, and the top > + * transaction is marked as containing catalog changes if its children are. > + * But in some cases the relation may be missed, in which case only the sub > + * transaction may be marked as containing catalog changes. > + */ > + if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo) > + || sub_needs_timetravel) > { > elog(DEBUG2, "found top level transaction %u, with catalog changes", > xid); > @@ -1095,11 +1104,6 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, > needs_timetravel = true; > SnapBuildAddCommittedTxn(builder, xid); > } > - else if (sub_needs_timetravel) > - { > - /* track toplevel txn as well, subxact alone isn't meaningful */ > - SnapBuildAddCommittedTxn(builder, xid); > - } > else if (needs_timetravel) > { > elog(DEBUG2, "forced transaction %u to do timetravel", xid); Yeah, I am fine with this as well. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: test_decoding assertion failure for the loss of top-sub transaction relationship
От
Kyotaro Horiguchi
Дата:
At Fri, 2 Sep 2022 11:27:23 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > On Fri, Sep 2, 2022 at 11:25 AM kuroda.hayato@fujitsu.com > <kuroda.hayato@fujitsu.com> wrote: > > How about following? > > > > diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c > > index bf72ad45ec..a630522907 100644 > > --- a/src/backend/replication/logical/snapbuild.c > > +++ b/src/backend/replication/logical/snapbuild.c > > @@ -1086,8 +1086,17 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, > > } > > } > > > > - /* if top-level modified catalog, it'll need a snapshot */ > > - if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo)) > > + /* > > + * if top-level or one of sub modified catalog, it'll need a snapshot. > > + * > > + * Normally the second check is not needed because the relation between > > + * top-sub transactions is tracked on the ReorderBuffer layer, and the top > > + * transaction is marked as containing catalog changes if its children are. > > + * But in some cases the relation may be missed, in which case only the sub > > + * transaction may be marked as containing catalog changes. > > + */ > > + if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo) > > + || sub_needs_timetravel) > > { > > elog(DEBUG2, "found top level transaction %u, with catalog changes", > > xid); > > @@ -1095,11 +1104,6 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, > > needs_timetravel = true; > > SnapBuildAddCommittedTxn(builder, xid); > > } > > - else if (sub_needs_timetravel) > > - { > > - /* track toplevel txn as well, subxact alone isn't meaningful */ > > - SnapBuildAddCommittedTxn(builder, xid); > > - } > > else if (needs_timetravel) > > { > > elog(DEBUG2, "forced transaction %u to do timetravel", xid); > > Yeah, I am fine with this as well. I'm basically fine, too. But this is a bug that needs back-patching back to 10. This change changes the condition for the DEBUG2 message. So we need to add an awkward if() condition for the DEBUG2 message. Looking that the messages have different debug-level, I doubt there have been a chance they are useful. If we remove the two DEBUGx messages, I'm fine with the change. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: test_decoding assertion failure for the loss of top-sub transaction relationship
От
Amit Kapila
Дата:
On Fri, Sep 2, 2022 at 12:25 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Fri, 2 Sep 2022 11:27:23 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > > On Fri, Sep 2, 2022 at 11:25 AM kuroda.hayato@fujitsu.com > > <kuroda.hayato@fujitsu.com> wrote: > > > How about following? > > > > > > diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c > > > index bf72ad45ec..a630522907 100644 > > > --- a/src/backend/replication/logical/snapbuild.c > > > +++ b/src/backend/replication/logical/snapbuild.c > > > @@ -1086,8 +1086,17 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, > > > } > > > } > > > > > > - /* if top-level modified catalog, it'll need a snapshot */ > > > - if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo)) > > > + /* > > > + * if top-level or one of sub modified catalog, it'll need a snapshot. > > > + * > > > + * Normally the second check is not needed because the relation between > > > + * top-sub transactions is tracked on the ReorderBuffer layer, and the top > > > + * transaction is marked as containing catalog changes if its children are. > > > + * But in some cases the relation may be missed, in which case only the sub > > > + * transaction may be marked as containing catalog changes. > > > + */ > > > + if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo) > > > + || sub_needs_timetravel) > > > { > > > elog(DEBUG2, "found top level transaction %u, with catalog changes", > > > xid); > > > @@ -1095,11 +1104,6 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, > > > needs_timetravel = true; > > > SnapBuildAddCommittedTxn(builder, xid); > > > } > > > - else if (sub_needs_timetravel) > > > - { > > > - /* track toplevel txn as well, subxact alone isn't meaningful */ > > > - SnapBuildAddCommittedTxn(builder, xid); > > > - } > > > else if (needs_timetravel) > > > { > > > elog(DEBUG2, "forced transaction %u to do timetravel", xid); > > > > Yeah, I am fine with this as well. > > I'm basically fine, too. But this is a bug that needs back-patching > back to 10. > I have not verified but I think we need to backpatch this till 14 because prior to that in DecodeCommit, we use to set the top-level txn as having catalog changes based on the if there are invalidation messages in the commit record. So, in the current scenario shared by Osumi-San, before SnapBuildCommitTxn(), the top-level txn will be marked as having catalog changes. > This change changes the condition for the DEBUG2 message. > So we need to add an awkward if() condition for the DEBUG2 message. > Looking that the messages have different debug-level, I doubt there > have been a chance they are useful. If we remove the two DEBUGx > messages, I'm fine with the change. > I think these DEBUG2 messages could be useful, so instead of removing these, I suggest we should follow Dilip's proposed fix and maybe add a new DEBUG2 message on the lines of (("forced transaction %u to do timetravel due to one of its subtransaction", xid) in the else if (sub_needs_timetravel) condition if we think that will be useful too but I am fine leaving the addition of new DEBUG2 message. -- With Regards, Amit Kapila.
RE: test_decoding assertion failure for the loss of top-sub transaction relationship
От
"kuroda.hayato@fujitsu.com"
Дата:
> > I'm basically fine, too. But this is a bug that needs back-patching > > back to 10. > > > > I have not verified but I think we need to backpatch this till 14 > because prior to that in DecodeCommit, we use to set the top-level txn > as having catalog changes based on the if there are invalidation > messages in the commit record. So, in the current scenario shared by > Osumi-San, before SnapBuildCommitTxn(), the top-level txn will be > marked as having catalog changes. I and Osumi-san are now investigating that, so please wait further reports and patches. > > This change changes the condition for the DEBUG2 message. > > So we need to add an awkward if() condition for the DEBUG2 message. > > Looking that the messages have different debug-level, I doubt there > > have been a chance they are useful. If we remove the two DEBUGx > > messages, I'm fine with the change. > > > > I think these DEBUG2 messages could be useful, so instead of removing > these, I suggest we should follow Dilip's proposed fix and maybe add a > new DEBUG2 message on the lines of (("forced transaction %u to do > timetravel due to one of its subtransaction", xid) in the else if > (sub_needs_timetravel) condition if we think that will be useful too > but I am fine leaving the addition of new DEBUG2 message. I agreed both that DEBUG2 messages are still useful but we should not change the condition for output. So I prefer the idea suggested by Amit. Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: test_decoding assertion failure for the loss of top-sub transaction relationship
От
Amit Kapila
Дата:
On Fri, Sep 2, 2022 at 6:26 AM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote: > > > I've met an assertion failure of logical decoding with below scenario on HEAD. > > --- > <preparation> > create table tab1 (val integer); > select 'init' from pg_create_logical_replication_slot('regression_slot', 'test_decoding'); > > <session1> > begin; > savepoint sp1; > insert into tab1 values (1); > > <session2> > checkpoint; -- for RUNNING_XACT > select data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); > > <session1> > truncate tab1; -- for NEW_CID > commit; > begin; > insert into tab1 values (3); > BTW, if I just change the truncate statement to "Analyze tab1" in your entire test then I am getting a different assertion failure: postgres.exe!ExceptionalCondition(const char * conditionName, const char * errorType, const char * fileName, int lineNumber) Line 70 C postgres.exe!AssertTXNLsnOrder(ReorderBuffer * rb) Line 902 C postgres.exe!ReorderBufferTXNByXid(ReorderBuffer * rb, unsigned int xid, bool create, bool * is_new, unsigned __int64 lsn, bool create_as_top) Line 681 C postgres.exe!ReorderBufferAddNewTupleCids(ReorderBuffer * rb, unsigned int xid, unsigned __int64 lsn, RelFileLocator locator, ItemPointerData tid, unsigned int cmin, unsigned int cmax, unsigned int combocid) Line 3188 C postgres.exe!SnapBuildProcessNewCid(SnapBuild * builder, unsigned int xid, unsigned __int64 lsn, xl_heap_new_cid * xlrec) Line 823 C postgres.exe!heap2_decode(LogicalDecodingContext * ctx, XLogRecordBuffer * buf) Line 408 C postgres.exe!LogicalDecodingProcessRecord(LogicalDecodingContext * ctx, XLogReaderState * record) Line 119 C postgres.exe!pg_logical_slot_get_changes_guts(FunctionCallInfoBaseData * fcinfo, bool confirm, bool binary) Line 274 C postgres.exe!pg_logical_slot_get_changes(FunctionCallInfoBaseData * fcinfo) Line 339 C This is matching with call stack we see intermittently in the BF [1][2]. The difference with your scenario is that the Truncate statement generates an additional WAL XLOG_STANDBY_LOCK prior to XLOG_HEAP2_NEW_CID. I think we can fix this in the below ways: a. Assert(prev_first_lsn <= cur_txn->first_lsn); -- Explain in comments that it is possible when subtransaction and transaction are not previously logged as it happened in this scenario b. track txn of prev_first_lsn (say as prev_txn) and check if prev_txn's toptxn is the same as cur_txn or cur_txn's toptxn is the same as the prev_txn then perform assert mentioned in (a) else, keep the current Assert. It seems (b) will be more robust. Thoughts? Note: I added Sawada-San as sometime back we had an offlist discussion on this intermittent BF failure but we were not able to reach the exact test which can show this failure. [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-08-20%2002%3A45%3A34 [2] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09 -- With Regards, Amit Kapila.
RE: test_decoding assertion failure for the loss of top-sub transaction relationship
От
"kuroda.hayato@fujitsu.com"
Дата:
Dear hackers, > I agreed both that DEBUG2 messages are still useful but we should not > change the condition for output. So I prefer the idea suggested by Amit. PSA newer patch, which contains the fix and test. > > I have not verified but I think we need to backpatch this till 14 > > because prior to that in DecodeCommit, we use to set the top-level txn > > as having catalog changes based on the if there are invalidation > > messages in the commit record. So, in the current scenario shared by > > Osumi-San, before SnapBuildCommitTxn(), the top-level txn will be > > marked as having catalog changes. > > I and Osumi-san are now investigating that, so please wait further reports and > patches. We investigated it about older versions, and in some versions *another stack-trace* has been found. About PG10-13, indeed, the failure was not occurred. In these versions transactions are regarded as that have catalog changes when the commit record has XACT_XINFO_HAS_INVALS flag. This flag will be set if the transaction has invalidation messages. When sub transaction changes system catalogs and user commits, all invalidation messages allocated in sub transaction will be transferred to top transaction. Therefore both transactions will be marked as containing catalog changes. About PG14 and 15, however, another stack-trace has been found. While executing the same workload, we got followings at the same SQL statement; ``` (gdb) backtrace #0 0x00007fa78c6dc387 in raise () from /lib64/libc.so.6 #1 0x00007fa78c6dda78 in abort () from /lib64/libc.so.6 #2 0x0000000000b16680 in ExceptionalCondition (conditionName=0xcd3ab0 "txn->ninvalidations == 0", errorType=0xcd3284 "FailedAssertion", fileName=0xcd32d0 "reorderbuffer.c", lineNumber=2936) at assert.c:69 #3 0x00000000008e9e70 in ReorderBufferForget (rb=0x12b5b10, xid=735, lsn=24125384) at reorderbuffer.c:2936 #4 0x00000000008d9493 in DecodeCommit (ctx=0x12a2d20, buf=0x7ffe08b236b0, parsed=0x7ffe08b23510, xid=734, two_phase=false)at decode.c:733 #5 0x00000000008d8962 in DecodeXactOp (ctx=0x12a2d20, buf=0x7ffe08b236b0) at decode.c:279 #6 0x00000000008d85e2 in LogicalDecodingProcessRecord (ctx=0x12a2d20, record=0x12a30e0) at decode.c:142 #7 0x00000000008dfef2 in pg_logical_slot_get_changes_guts (fcinfo=0x129acb0, confirm=true, binary=false) at logicalfuncs.c:296 #8 0x00000000008e002f in pg_logical_slot_get_changes (fcinfo=0x129acb0) at logicalfuncs.c:365 ... (gdb) frame 4 #4 0x00000000008d9493 in DecodeCommit (ctx=0x14cfd20, buf=0x7ffc638b0ca0, parsed=0x7ffc638b0b00, xid=734, two_phase=false)at decode.c:733 733 ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr); (gdb) list 728 */ 729 if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id)) 730 { 731 for (i = 0; i < parsed->nsubxacts; i++) 732 { 733 ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr); 734 } 735 ReorderBufferForget(ctx->reorder, xid, buf->origptr); 736 737 return; (gdb) frame 3 #3 0x00000000008e9e70 in ReorderBufferForget (rb=0x14e2b10, xid=735, lsn=24125152) at reorderbuffer.c:2936 2936 Assert(txn->ninvalidations == 0); (gdb) list 2931 */ 2932 if (txn->base_snapshot != NULL && txn->ninvalidations > 0) 2933 ReorderBufferImmediateInvalidation(rb, txn->ninvalidations, 2934 txn->invalidations); 2935 else 2936 Assert(txn->ninvalidations == 0); 2937 2938 /* remove potential on-disk data, and deallocate */ 2939 ReorderBufferCleanupTXN(rb, txn); 2940 } (gdb) print *txn $1 = {txn_flags = 3, xid = 735, toplevel_xid = 734, gid = 0x0, first_lsn = 24113488, final_lsn = 24125152, end_lsn = 0, toptxn= 0x14ecb98, restart_decoding_lsn = 24113304, origin_id = 0, origin_lsn = 0, commit_time = 0, base_snapshot = 0x0, base_snapshot_lsn= 0, base_snapshot_node = {prev = 0x14ecc00, next = 0x14e2b28}, snapshot_now = 0x0, command_id = 4294967295, nentries = 5, nentries_mem= 5, changes = {head = {prev = 0x14eecf8, next = 0x14eeb18}}, tuplecids = {head = {prev = 0x14ecb10, next = 0x14ecb10}}, ntuplecids= 0, tuplecid_hash = 0x0, toast_hash = 0x0, subtxns = {head = {prev = 0x14ecb38, next = 0x14ecb38}}, nsubtxns = 0, ninvalidations= 3, invalidations = 0x14e2d28, node = {prev = 0x14ecc68, next = 0x14ecc68}, size = 452, total_size = 452, concurrent_abort= false, output_plugin_private = 0x0} ``` In these versions DecodeCommit() said OK. However, we have met another failure because the ReorderBufferTXN of the sub transaction had invalidation messages but it did not have base_snapshot. I thought that this failure was occurred the only the base_snapshot of the sub transaction is transferred via ReorderBufferTransferSnapToParent() when a transaction is assigned as child, but its invalidation messages are not. I was not sure what's the proper way to fix it. The solution I've thought at first was transporting all invalidations from sub to top like ReorderBufferTransferSnapToParent(), but I do not know its side effect. Moreover, how do we deal with ReorderBufferChange? Should we transfer them too? If so, how about the ordering of changes? Alternative solustion was just remove the assertion, but was it OK? How do you think? I want to hear your comments and suggestions... Best Regards, Hayato Kuroda FUJITSU LIMITED
Вложения
RE: test_decoding assertion failure for the loss of top-sub transaction relationship
От
"kuroda.hayato@fujitsu.com"
Дата:
> I was not sure what's the proper way to fix it. > The solution I've thought at first was transporting all invalidations from sub to top > like ReorderBufferTransferSnapToParent(), > but I do not know its side effect. Moreover, how do we deal with > ReorderBufferChange? > Should we transfer them too? If so, how about the ordering of changes? > Alternative solustion was just remove the assertion, but was it OK? PSA the PoC patch for discussion. In this patch only invalidation messages are transported, changes hold by subtxn are ignored. This can be passed the reported workload. Best Regards, Hayato Kuroda FUJITSU LIMITED
Вложения
Re: test_decoding assertion failure for the loss of top-sub transaction relationship
От
Amit Kapila
Дата:
On Tue, Sep 6, 2022 at 1:17 PM kuroda.hayato@fujitsu.com <kuroda.hayato@fujitsu.com> wrote: > > Dear hackers, > > > I agreed both that DEBUG2 messages are still useful but we should not > > change the condition for output. So I prefer the idea suggested by Amit. > > PSA newer patch, which contains the fix and test. > > > > I have not verified but I think we need to backpatch this till 14 > > > because prior to that in DecodeCommit, we use to set the top-level txn > > > as having catalog changes based on the if there are invalidation > > > messages in the commit record. So, in the current scenario shared by > > > Osumi-San, before SnapBuildCommitTxn(), the top-level txn will be > > > marked as having catalog changes. > > > > I and Osumi-san are now investigating that, so please wait further reports and > > patches. > > We investigated it about older versions, and in some versions *another stack-trace* has been found. > > > About PG10-13, indeed, the failure was not occurred. > In these versions transactions are regarded as > that have catalog changes when the commit record has XACT_XINFO_HAS_INVALS flag. > This flag will be set if the transaction has invalidation messages. > > When sub transaction changes system catalogs and user commits, > all invalidation messages allocated in sub transaction will be transferred to top transaction. > Therefore both transactions will be marked as containing catalog changes. > > > About PG14 and 15, however, another stack-trace has been found. > While executing the same workload, we got followings at the same SQL statement; > Did you get this new assertion failure after you applied the patch for the first failure? Because otherwise, how can you reach it with the same test case? About patch: else if (sub_needs_timetravel) { - /* track toplevel txn as well, subxact alone isn't meaningful */ + elog(DEBUG2, "forced transaction %u to do timetravel due to one of its subtransaction", + xid); + needs_timetravel = true; SnapBuildAddCommittedTxn(builder, xid); Why did you remove the above comment? I think it still makes sense to retain it. -- With Regards, Amit Kapila.
RE: test_decoding assertion failure for the loss of top-sub transaction relationship
От
"kuroda.hayato@fujitsu.com"
Дата:
Dear Amit, Thanks for giving comments! > Did you get this new assertion failure after you applied the patch for > the first failure? Because otherwise, how can you reach it with the > same test case? The first failure is occurred only in the HEAD, so I did not applied the first patch to REL14 and REL15. This difference is caused because the commit [Fix catalog lookup...] in REL15(272248a) and older is different from the HEAD one. In order versions SnapBuildXidSetCatalogChanges() has been added. In the function a transaction will be marked as containing catalog changes if the transaction is in InitialRunningXacts, and after that the relation between sub-top transactions is assigned based on the parsed->subxact. The marking avoids the first failure, but the assignment triggers new failure. > About patch: > else if (sub_needs_timetravel) > { > - /* track toplevel txn as well, subxact alone isn't meaningful */ > + elog(DEBUG2, "forced transaction %u to do timetravel due to one of > its subtransaction", > + xid); > + needs_timetravel = true; > SnapBuildAddCommittedTxn(builder, xid); > > Why did you remove the above comment? I think it still makes sense to retain it. Fixed. Best Regards, Hayato Kuroda FUJITSU LIMITED
Вложения
Re: test_decoding assertion failure for the loss of top-sub transaction relationship
От
Masahiko Sawada
Дата:
On Wed, Sep 7, 2022 at 11:06 AM kuroda.hayato@fujitsu.com <kuroda.hayato@fujitsu.com> wrote: > > Dear Amit, > > Thanks for giving comments! > > > Did you get this new assertion failure after you applied the patch for > > the first failure? Because otherwise, how can you reach it with the > > same test case? > > The first failure is occurred only in the HEAD, so I did not applied the first patch > to REL14 and REL15. > This difference is caused because the commit [Fix catalog lookup...] in REL15(272248a) and older is different > from the HEAD one. > In order versions SnapBuildXidSetCatalogChanges() has been added. In the function > a transaction will be marked as containing catalog changes if the transaction is in InitialRunningXacts, > and after that the relation between sub-top transactions is assigned based on the parsed->subxact. > The marking avoids the first failure, but the assignment triggers new failure. > > > > About patch: > > else if (sub_needs_timetravel) > > { > > - /* track toplevel txn as well, subxact alone isn't meaningful */ > > + elog(DEBUG2, "forced transaction %u to do timetravel due to one of > > its subtransaction", > > + xid); > > + needs_timetravel = true; > > SnapBuildAddCommittedTxn(builder, xid); > > > > Why did you remove the above comment? I think it still makes sense to retain it. > > Fixed. Here are some review comments for v2 patch: +# Test that we can force the top transaction to do timetravel when one of sub +# transactions needs that. This is necessary when we restart decoding from RUNNING_XACT +# without the wal to associate subtransaction to its top transaction. I don't think the second sentence is necessary. --- The last decoding +# starts from the first checkpoint and NEW_CID of "s0_truncate" doesn't mark the top +# transaction as catalog modifying transaction. In this scenario, the enforcement sets +# needs_timetravel to true even if the top transaction is regarded as that it does not +# have catalog changes and thus the decoding works without a contradition that one +# subtransaction needed timetravel while its top transaction didn't. I don't understand the last sentence, probably it's a long sentence. How about the following description? # Test that we can handle the case where only subtransaction is marked as containing # catalog changes. The last decoding starts from NEW_CID generated by "s0_truncate" and # marks only the subtransaction as containing catalog changes but we don't create the # association between top-level transaction and subtransaction yet. When decoding the # commit record of the top-level transaction, we must force the top-level transaction # to do timetravel since one of its subtransactions is marked as containing catalog changes. --- + elog(DEBUG2, "forced transaction %u to do timetravel due to one of its subtransaction", + xid); + needs_timetravel = true; I think "one of its subtransaction" should be "one of its subtransactions". Regards, -- Masahiko Sawada
RE: test_decoding assertion failure for the loss of top-sub transaction relationship
От
"kuroda.hayato@fujitsu.com"
Дата:
Dear Sawada-san, Thank you for reviewing HEAD patch! PSA v3 patch. > +# Test that we can force the top transaction to do timetravel when one of sub > +# transactions needs that. This is necessary when we restart decoding > from RUNNING_XACT > +# without the wal to associate subtransaction to its top transaction. > > I don't think the second sentence is necessary. > > --- > The last decoding > +# starts from the first checkpoint and NEW_CID of "s0_truncate" > doesn't mark the top > +# transaction as catalog modifying transaction. In this scenario, the > enforcement sets > +# needs_timetravel to true even if the top transaction is regarded as > that it does not > +# have catalog changes and thus the decoding works without a > contradition that one > +# subtransaction needed timetravel while its top transaction didn't. > > I don't understand the last sentence, probably it's a long sentence. > > How about the following description? > > # Test that we can handle the case where only subtransaction is marked > as containing > # catalog changes. The last decoding starts from NEW_CID generated by > "s0_truncate" and > # marks only the subtransaction as containing catalog changes but we > don't create the > # association between top-level transaction and subtransaction yet. > When decoding the > # commit record of the top-level transaction, we must force the > top-level transaction > # to do timetravel since one of its subtransactions is marked as > containing catalog changes. Seems good, I replaced all of comments to yours. > + elog(DEBUG2, "forced transaction %u to do timetravel due to one of > its subtransaction", > + xid); > + needs_timetravel = true; > > I think "one of its subtransaction" should be "one of its subtransactions". Fixed. Best Regards, Hayato Kuroda FUJITSU LIMITED
Вложения
Re: test_decoding assertion failure for the loss of top-sub transaction relationship
От
Masahiko Sawada
Дата:
On Wed, Sep 7, 2022 at 11:06 AM kuroda.hayato@fujitsu.com <kuroda.hayato@fujitsu.com> wrote: > > Dear Amit, > > Thanks for giving comments! > > > Did you get this new assertion failure after you applied the patch for > > the first failure? Because otherwise, how can you reach it with the > > same test case? > > The first failure is occurred only in the HEAD, so I did not applied the first patch > to REL14 and REL15. > This difference is caused because the commit [Fix catalog lookup...] in REL15(272248a) and older is different > from the HEAD one. > In order versions SnapBuildXidSetCatalogChanges() has been added. In the function > a transaction will be marked as containing catalog changes if the transaction is in InitialRunningXacts, > and after that the relation between sub-top transactions is assigned based on the parsed->subxact. > The marking avoids the first failure, but the assignment triggers new failure. > FYI, as I just replied to the related thread[1], the assertion failure in REL14 and REL15 can be fixed by the patch proposed there. So I'd like to see how the discussion goes. Regardless of this proposed fix, the patch proposed by Kuroda-san is required for HEAD, REL14, and REL15, in order to fix the assertion failure in SnapBuildCommitTxn(). Regards, [1] https://www.postgresql.org/message-id/CAD21AoA1gV9pfu8hoXpTQBWH8uEMRg_F_MKM%2BU3Sr0HnyH4AUQ%40mail.gmail.com -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
RE: test_decoding assertion failure for the loss of top-sub transaction relationship
От
"kuroda.hayato@fujitsu.com"
Дата:
Dear Sawada-san, > FYI, as I just replied to the related thread[1], the assertion failure > in REL14 and REL15 can be fixed by the patch proposed there. So I'd > like to see how the discussion goes. Regardless of this proposed fix, > the patch proposed by Kuroda-san is required for HEAD, REL14, and > REL15, in order to fix the assertion failure in SnapBuildCommitTxn(). I understood that my patches for REL14 and REL15 might be not needed. I will check the thread later. Thanks! Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: test_decoding assertion failure for the loss of top-sub transaction relationship
От
Masahiko Sawada
Дата:
On Wed, Oct 12, 2022 at 3:35 PM kuroda.hayato@fujitsu.com <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > > FYI, as I just replied to the related thread[1], the assertion failure > > in REL14 and REL15 can be fixed by the patch proposed there. So I'd > > like to see how the discussion goes. Regardless of this proposed fix, > > the patch proposed by Kuroda-san is required for HEAD, REL14, and > > REL15, in order to fix the assertion failure in SnapBuildCommitTxn(). > > I understood that my patches for REL14 and REL15 might be not needed. No, sorry for confusing you. I meant that even if we agreed with the patch I proposed there, your patch is still required to fix the issue. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com