Обсуждение: Race conditions in logical decoding
A stress test [1] for the REPACK patch [1] revealed data corruption. Eventually I found out that the problem is in postgres core. In particular, it can happen that a COMMIT record is decoded, but before the commit could be recorded in CLOG, a snapshot that takes the commit into account is created and even used. Visibility checks then work incorrectly until the CLOG gets updated. In logical replication, the consequences are not only wrong data on the subscriber, but also corrutped table on publisher - this is due to incorrectly set commit hint bits. Attached is a spec file that demonstrates the issue. I did not add it to Makefile because I don't expect the current version to be merged (see the commit message for details. I'm not sure yet how to fix the problem. I tried to call XactLockTableWait() from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()), but it made at least one regression test (subscription/t/010_truncate.pl) stuck - probably a deadlock. I can spend more time on it, but maybe someone can come up with a good idea sooner than me. [1] https://www.postgresql.org/message-id/CADzfLwU78as45To9a%3D-Qkr5jEg3tMxc5rUtdKy2MTv4r_SDGng%40mail.gmail.com [2] https://commitfest.postgresql.org/patch/5117/ -- Antonin Houska Web: https://www.cybertec-postgresql.com
Вложения
Antonin Houska <ah@cybertec.at> wrote: > I'm not sure yet how to fix the problem. I tried to call XactLockTableWait() > from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()), > but it made at least one regression test (subscription/t/010_truncate.pl) > stuck - probably a deadlock. I can spend more time on it, but maybe someone > can come up with a good idea sooner than me. Attached here is what I consider a possible fix - simply wait for the CLOG update before building a new snapshot. Unfortunately I have no idea right now how to test it using the isolation tester. With the fix, the additional waiting makes the current test block. (And if a step is added that unblock the session, it will not reliably catch failure to wait.) -- Antonin Houska Web: https://www.cybertec-postgresql.com
Вложения
Hi, On 2026-01-20 09:30:33 +0100, Antonin Houska wrote: > Antonin Houska <ah@cybertec.at> wrote: > > > I'm not sure yet how to fix the problem. I tried to call XactLockTableWait() > > from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()), > > but it made at least one regression test (subscription/t/010_truncate.pl) > > stuck - probably a deadlock. I can spend more time on it, but maybe someone > > can come up with a good idea sooner than me. > > Attached here is what I consider a possible fix - simply wait for the CLOG > update before building a new snapshot. I don't think that's enough - during non-timetravel visibility semantics, you can only look at the clog if the transaction isn't marked as in-progress in the procarray. ISTM that we need to do that here too? Greetings, Andres
Hello, Andres.
On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote:
> I don't think that's enough - during non-timetravel visibility semantics, you
> can only look at the clog if the transaction isn't marked as in-progress in
> the procarray. ISTM that we need to do that here too?
Do you mean replace
> I don't think that's enough - during non-timetravel visibility semantics, you
> can only look at the clog if the transaction isn't marked as in-progress in
> the procarray. ISTM that we need to do that here too?
Do you mean replace
> if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))
to
> if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i])))
> if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i])))
?
If so, yes, it feels correct to me.
Mikhail.
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > Hello, Andres. > > On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote: > > I don't think that's enough - during non-timetravel visibility semantics, you > > can only look at the clog if the transaction isn't marked as in-progress in > > the procarray. ISTM that we need to do that here too? > > Do you mean replace > > if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i]))) > to > > if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i]))) This way the synchronous replication gets stuck, as it did when I tried to use XactLockTableWait(): subscriber cannot confirm replication of certain LSN because publisher is not able to even finalize the commit (due to the waiting for the subscriber's confirmation), and therefore publisher it's not able to decode the data and send it to the subscriber. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On 2026-Jan-22, Antonin Houska wrote: > Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > > > Hello, Andres. > > > > On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote: > > > I don't think that's enough - during non-timetravel visibility semantics, you > > > can only look at the clog if the transaction isn't marked as in-progress in > > > the procarray. ISTM that we need to do that here too? > > > > Do you mean replace > > > if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i]))) > > to > > > if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i]))) > > This way the synchronous replication gets stuck, as it did when I tried to use > XactLockTableWait(): subscriber cannot confirm replication of certain LSN > because publisher is not able to even finalize the commit (due to the waiting > for the subscriber's confirmation), and therefore publisher it's not able to > decode the data and send it to the subscriber. The layering here is wild, but if I understand it correctly, these XIDs are all added to an array by SnapBuildAddCommittedTxn(), which in turn is only called by SnapBuildCommitTxn(), which is only called by DecodeCommit(), which is only called by xact_decode() when it sees a XLOG_XACT_COMMIT or XLOG_XACT_COMMIT_PREPARED record by reading WAL. Crucially, RecordTransactionCommit() writes the WAL first, then marks everything as committed in CLOG, and finally does the waiting for the synchronous replica to ACK the commit if necessary. However, the transaction is only removed from procarray after RecordTransactionCommit has returned. This means that DecodeCommit() could add a transaction to the SnapBuilder (that needs to be waited for) while that transaction is still shown as running in ProcArray. This sounds problematic in itself, so I'm wondering whether we should do anything (namely, wait) on DecodeCommit() instead of hacking SnapBuildBuildSnapshot() to patch it up by waiting after the fact. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On 2026-Jan-22, Álvaro Herrera wrote: > On 2026-Jan-22, Antonin Houska wrote: > > > Do you mean replace > > > > if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i]))) > > > to > > > > if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i]))) > > > > This way the synchronous replication gets stuck, as it did when I tried to use > > XactLockTableWait(): subscriber cannot confirm replication of certain LSN > > because publisher is not able to even finalize the commit (due to the waiting > > for the subscriber's confirmation), and therefore publisher it's not able to > > decode the data and send it to the subscriber. BTW, the reason XactLockTableWait and TransactionIdIsInProgress() cause a deadlock in the same way, is that they are using essentially the same mechanism. The former uses the Lock object on the transaction, which is released (by the ResourceOwnerRelease(RESOURCE_RELEASE_LOCKS) call in CommitTransaction) after RecordTransactionCommit() has returned -- that is, after the wait on a synchronous replica has happened. XactLockTableWait does an _additional_ test for TransactionIdIsInProgress, but that should be pretty much innocuous at that point. One thing that I just realized I don't know, is what exactly are the two pieces that are deadlocking. I mean, one is this side that's decoding commit. But how/why is that other side, the one trying to mark the transaction as committed, waiting on the commit decoding? Maybe there's something that we need to do to _that_ side to prevent the blockage, so that we can use TransactionIdIsInProgress() here. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On 2026-Jan-20, Antonin Houska wrote: > Antonin Houska <ah@cybertec.at> wrote: > > > I'm not sure yet how to fix the problem. I tried to call XactLockTableWait() > > from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()), > > but it made at least one regression test (subscription/t/010_truncate.pl) > > stuck - probably a deadlock. I can spend more time on it, but maybe someone > > can come up with a good idea sooner than me. > > Attached here is what I consider a possible fix - simply wait for the CLOG > update before building a new snapshot. > > Unfortunately I have no idea right now how to test it using the isolation > tester. With the fix, the additional waiting makes the current test > block. (And if a step is added that unblock the session, it will not reliably > catch failure to wait.) > > -- > Antonin Houska > Web: https://www.cybertec-postgresql.com > > @@ -400,6 +400,47 @@ SnapBuildBuildSnapshot(SnapBuild *builder) > snapshot->xmin = builder->xmin; > snapshot->xmax = builder->xmax; > > + /* > + * Although it's very unlikely, it's possible that a commit WAL record was > + * decoded but CLOG is not aware of the commit yet. Should the CLOG update > + * be delayed even more, visibility checks that use this snapshot could > + * work incorrectly. Therefore we check the CLOG status here. > + */ > + while (true) > + { > + bool found = false; > + > + for (int i = 0; i < builder->committed.xcnt; i++) > + { > + /* > + * XXX Is it worth remembering the XIDs that appear to be > + * committed per CLOG and skipping them in the next iteration of > + * the outer loop? Not sure it's worth the effort - a single > + * iteration is enough in most cases. > + */ > + if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i]))) > + { > + found = true; > + > + /* > + * Wait a bit before going to the next iteration of the outer > + * loop. The race conditions we address here is pretty rare, > + * so we shouldn't need to wait too long. > + */ > + (void) WaitLatch(MyLatch, > + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > + 10L, > + WAIT_EVENT_SNAPBUILD_CLOG); > + ResetLatch(MyLatch); > + > + break; > + } > + } > + > + if (!found) > + break; > + } I think this algorithm is strange -- if you do have to wait more than once for one transaction, it would lead to doing the TransactionIdDidCommit again times for _all_ transactions by starting the inner loop from scratch, which sounds really wasteful. Why not nest the for() loops the other way around? Something like this perhaps, for (int i = 0; i < builder->committed.xcnt; i++) { for (;;) { if (TransactionIdDidCommit(builder->committed.xip[i])) break; else { (void) WaitLatch(MyLatch, WL_LATCH_SET, WL_TIMEOUT, WL_EXIT_ON_PM_DEATH, 10L, WAIT_EVENT_SNAPBUILD_CLOG); ResetLatch(MyLatch); } CHECK_FOR_INTERRUPTS(); } } This way you wait repeatedly for one transaction until it is marked committed; and once it does, you don't test it again. I also wondered if it would make sense to get rid of the memcpy, given that we're doing so much work per xid anyway it won't make any visible difference (I believe), and do the copy per XID there, like if (TransactionIdDidCommit(builder->committed.xip[i])) { snapshot->xip[i] = builder->committed.xip[i]; break; } else ... -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Sallah, I said NO camels! That's FIVE camels; can't you count?" (Indiana Jones)
Álvaro Herrera <alvherre@kurilemu.de> wrote:
> I think this algorithm is strange -- if you do have to wait more than
> once for one transaction, it would lead to doing the
> TransactionIdDidCommit again times for _all_ transactions by starting
> the inner loop from scratch, which sounds really wasteful. Why not nest
> the for() loops the other way around?
I'm quite sure I wanted to iterate through committed.xnt in the outer loop,
but probably got distracted by something else and messed things up.
> Something like this perhaps,
>
> for (int i = 0; i < builder->committed.xcnt; i++)
> {
> for (;;)
> {
> if (TransactionIdDidCommit(builder->committed.xip[i]))
> break;
> else
> {
> (void) WaitLatch(MyLatch,
> WL_LATCH_SET, WL_TIMEOUT, WL_EXIT_ON_PM_DEATH,
> 10L,
> WAIT_EVENT_SNAPBUILD_CLOG);
> ResetLatch(MyLatch);
> }
> CHECK_FOR_INTERRUPTS();
> }
> }
>
> This way you wait repeatedly for one transaction until it is marked
> committed; and once it does, you don't test it again.
Sure, that's much beter. Thanks.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Andres Freund <andres@anarazel.de> wrote: > > Attached here is what I consider a possible fix - simply wait for the CLOG > > update before building a new snapshot. > > I don't think that's enough - during non-timetravel visibility semantics, you > can only look at the clog if the transaction isn't marked as in-progress in > the procarray. ISTM that we need to do that here too? I understand that CLOG must be up-to-date by the time the snapshot is used for visibility checks, but I think that - from the snapshot user POV - what matters is "snapshot->xip vs CLOG" rather than "procarray vs CLOG". For procarray-based snapshots, this consistency is ensured by 1) not removing the XID from procarray until the status is set in CLOG and 2) getting the list of running transactions from procarray. Thus if an MVCC snapshot does not have particular XID in its "xip" array, it implies that it's no longer in procarray and therefore it's been marked in CLOG. As for logical decoding based snapshots (whether HISTORIC_MVCC or those converted eventually to regular MVCC), we currently do not check if CLOG is consistent with the transaction list in snapshot->xip. What I proposed is that we enforce this consistency by checking CLOG (and possibly waiting) before we finalize the snapshot. Thus the snapshot user can safely assume that the snapshot->xip array is consistent with CLOG, as if the snapshot was based on procarray. Or is there another issue with the CLOG itself? I thought about wraparound (i.e. getting the XID status from a CLOG slot which is still being used by old transactions) but I wouldn't expect that (AFAICS, CLOG truncation takes place during XID freezing). Concurrent access to the slot should neither be a problem since only a single byte (which is atomic) needs to be fetched during the XID status check. Another hypothetical problem that occurs to me is memory access ordering, i.e. one backend creates and exports the snapshot and another one imports it before it can see the CLOG update. It's hard to imagine though. Or are there other concerns? -- Antonin Houska Web: https://www.cybertec-postgresql.com
On 2026-Jan-23, Antonin Houska wrote:
> > This way you wait repeatedly for one transaction until it is marked
> > committed; and once it does, you don't test it again.
>
> Sure, that's much beter. Thanks.
Actually, I wonder if it would make sense to sleep just once after
testing all the transactions for whether they are marked committed (not
once per transaction); and after sleeping, we only test again those that
were not marked committed in the previous iteration. I think you would
end up doing less tests overall. Something like this
/*
* Although it's very unlikely, it's possible that a commit WAL record was
* decoded but CLOG is not aware of the commit yet. Should the CLOG update
* be delayed even more, visibility checks that use this snapshot could
* work incorrectly. Therefore we check the CLOG status here.
*/
{
TransactionId *stillrunning;
int nstillrunning = builder->committed.xcnt;
stillrunning = palloc(sizeof(TransactionId) * builder->committed.xcnt);
nstillrunning = builder->committed.xcnt;
memcpy(stillrunning, builder->committed.xip, sizeof(TransactionId) * nstillrunning);
for (;;)
{
int next = 0;
if (nstillrunning == 0)
break;
for (int i = 0; i < nstillrunning; i++)
{
if (!TransactionIdDidCommit(stillrunning[i]))
stillrunning[next++] = stillrunning[i];
}
if (next == 0)
break;
nstillrunning = next;
(void) WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
10L,
WAIT_EVENT_SNAPBUILD_CLOG);
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
}
pfree(stillrunning);
}
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Use it up, wear it out, make it do, or do without"
Álvaro Herrera <alvherre@kurilemu.de> wrote: > On 2026-Jan-23, Antonin Houska wrote: > > > > This way you wait repeatedly for one transaction until it is marked > > > committed; and once it does, you don't test it again. > > > > Sure, that's much beter. Thanks. > > Actually, I wonder if it would make sense to sleep just once after > testing all the transactions for whether they are marked committed (not > once per transaction); and after sleeping, we only test again those that > were not marked committed in the previous iteration. I think you would > end up doing less tests overall. Something like this I suppose that TransactionIdDidCommit() returns false pretty rarely, so one iteration is sufficient in almost all cases. Besides that, I preferred simpler code because it's easier to test (It's not trivial to have debugger reach the conditions.) However it's just my preference - no real objections to your approach. -- Antonin Houska Web: https://www.cybertec-postgresql.com