Обсуждение: Race conditions in logical decoding
A stress test [1] for the REPACK patch [1] revealed data corruption. Eventually I found out that the problem is in postgres core. In particular, it can happen that a COMMIT record is decoded, but before the commit could be recorded in CLOG, a snapshot that takes the commit into account is created and even used. Visibility checks then work incorrectly until the CLOG gets updated. In logical replication, the consequences are not only wrong data on the subscriber, but also corrutped table on publisher - this is due to incorrectly set commit hint bits. Attached is a spec file that demonstrates the issue. I did not add it to Makefile because I don't expect the current version to be merged (see the commit message for details. I'm not sure yet how to fix the problem. I tried to call XactLockTableWait() from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()), but it made at least one regression test (subscription/t/010_truncate.pl) stuck - probably a deadlock. I can spend more time on it, but maybe someone can come up with a good idea sooner than me. [1] https://www.postgresql.org/message-id/CADzfLwU78as45To9a%3D-Qkr5jEg3tMxc5rUtdKy2MTv4r_SDGng%40mail.gmail.com [2] https://commitfest.postgresql.org/patch/5117/ -- Antonin Houska Web: https://www.cybertec-postgresql.com
Вложения
Antonin Houska <ah@cybertec.at> wrote: > I'm not sure yet how to fix the problem. I tried to call XactLockTableWait() > from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()), > but it made at least one regression test (subscription/t/010_truncate.pl) > stuck - probably a deadlock. I can spend more time on it, but maybe someone > can come up with a good idea sooner than me. Attached here is what I consider a possible fix - simply wait for the CLOG update before building a new snapshot. Unfortunately I have no idea right now how to test it using the isolation tester. With the fix, the additional waiting makes the current test block. (And if a step is added that unblock the session, it will not reliably catch failure to wait.) -- Antonin Houska Web: https://www.cybertec-postgresql.com
Вложения
Hi, On 2026-01-20 09:30:33 +0100, Antonin Houska wrote: > Antonin Houska <ah@cybertec.at> wrote: > > > I'm not sure yet how to fix the problem. I tried to call XactLockTableWait() > > from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()), > > but it made at least one regression test (subscription/t/010_truncate.pl) > > stuck - probably a deadlock. I can spend more time on it, but maybe someone > > can come up with a good idea sooner than me. > > Attached here is what I consider a possible fix - simply wait for the CLOG > update before building a new snapshot. I don't think that's enough - during non-timetravel visibility semantics, you can only look at the clog if the transaction isn't marked as in-progress in the procarray. ISTM that we need to do that here too? Greetings, Andres
Hello, Andres.
On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote:
> I don't think that's enough - during non-timetravel visibility semantics, you
> can only look at the clog if the transaction isn't marked as in-progress in
> the procarray. ISTM that we need to do that here too?
Do you mean replace
> I don't think that's enough - during non-timetravel visibility semantics, you
> can only look at the clog if the transaction isn't marked as in-progress in
> the procarray. ISTM that we need to do that here too?
Do you mean replace
> if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))
to
> if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i])))
> if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i])))
?
If so, yes, it feels correct to me.
Mikhail.
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > Hello, Andres. > > On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote: > > I don't think that's enough - during non-timetravel visibility semantics, you > > can only look at the clog if the transaction isn't marked as in-progress in > > the procarray. ISTM that we need to do that here too? > > Do you mean replace > > if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i]))) > to > > if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i]))) This way the synchronous replication gets stuck, as it did when I tried to use XactLockTableWait(): subscriber cannot confirm replication of certain LSN because publisher is not able to even finalize the commit (due to the waiting for the subscriber's confirmation), and therefore publisher it's not able to decode the data and send it to the subscriber. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On 2026-Jan-22, Antonin Houska wrote: > Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > > > Hello, Andres. > > > > On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote: > > > I don't think that's enough - during non-timetravel visibility semantics, you > > > can only look at the clog if the transaction isn't marked as in-progress in > > > the procarray. ISTM that we need to do that here too? > > > > Do you mean replace > > > if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i]))) > > to > > > if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i]))) > > This way the synchronous replication gets stuck, as it did when I tried to use > XactLockTableWait(): subscriber cannot confirm replication of certain LSN > because publisher is not able to even finalize the commit (due to the waiting > for the subscriber's confirmation), and therefore publisher it's not able to > decode the data and send it to the subscriber. The layering here is wild, but if I understand it correctly, these XIDs are all added to an array by SnapBuildAddCommittedTxn(), which in turn is only called by SnapBuildCommitTxn(), which is only called by DecodeCommit(), which is only called by xact_decode() when it sees a XLOG_XACT_COMMIT or XLOG_XACT_COMMIT_PREPARED record by reading WAL. Crucially, RecordTransactionCommit() writes the WAL first, then marks everything as committed in CLOG, and finally does the waiting for the synchronous replica to ACK the commit if necessary. However, the transaction is only removed from procarray after RecordTransactionCommit has returned. This means that DecodeCommit() could add a transaction to the SnapBuilder (that needs to be waited for) while that transaction is still shown as running in ProcArray. This sounds problematic in itself, so I'm wondering whether we should do anything (namely, wait) on DecodeCommit() instead of hacking SnapBuildBuildSnapshot() to patch it up by waiting after the fact. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On 2026-Jan-22, Álvaro Herrera wrote: > On 2026-Jan-22, Antonin Houska wrote: > > > Do you mean replace > > > > if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i]))) > > > to > > > > if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i]))) > > > > This way the synchronous replication gets stuck, as it did when I tried to use > > XactLockTableWait(): subscriber cannot confirm replication of certain LSN > > because publisher is not able to even finalize the commit (due to the waiting > > for the subscriber's confirmation), and therefore publisher it's not able to > > decode the data and send it to the subscriber. BTW, the reason XactLockTableWait and TransactionIdIsInProgress() cause a deadlock in the same way, is that they are using essentially the same mechanism. The former uses the Lock object on the transaction, which is released (by the ResourceOwnerRelease(RESOURCE_RELEASE_LOCKS) call in CommitTransaction) after RecordTransactionCommit() has returned -- that is, after the wait on a synchronous replica has happened. XactLockTableWait does an _additional_ test for TransactionIdIsInProgress, but that should be pretty much innocuous at that point. One thing that I just realized I don't know, is what exactly are the two pieces that are deadlocking. I mean, one is this side that's decoding commit. But how/why is that other side, the one trying to mark the transaction as committed, waiting on the commit decoding? Maybe there's something that we need to do to _that_ side to prevent the blockage, so that we can use TransactionIdIsInProgress() here. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On 2026-Jan-20, Antonin Houska wrote: > Antonin Houska <ah@cybertec.at> wrote: > > > I'm not sure yet how to fix the problem. I tried to call XactLockTableWait() > > from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()), > > but it made at least one regression test (subscription/t/010_truncate.pl) > > stuck - probably a deadlock. I can spend more time on it, but maybe someone > > can come up with a good idea sooner than me. > > Attached here is what I consider a possible fix - simply wait for the CLOG > update before building a new snapshot. > > Unfortunately I have no idea right now how to test it using the isolation > tester. With the fix, the additional waiting makes the current test > block. (And if a step is added that unblock the session, it will not reliably > catch failure to wait.) > > -- > Antonin Houska > Web: https://www.cybertec-postgresql.com > > @@ -400,6 +400,47 @@ SnapBuildBuildSnapshot(SnapBuild *builder) > snapshot->xmin = builder->xmin; > snapshot->xmax = builder->xmax; > > + /* > + * Although it's very unlikely, it's possible that a commit WAL record was > + * decoded but CLOG is not aware of the commit yet. Should the CLOG update > + * be delayed even more, visibility checks that use this snapshot could > + * work incorrectly. Therefore we check the CLOG status here. > + */ > + while (true) > + { > + bool found = false; > + > + for (int i = 0; i < builder->committed.xcnt; i++) > + { > + /* > + * XXX Is it worth remembering the XIDs that appear to be > + * committed per CLOG and skipping them in the next iteration of > + * the outer loop? Not sure it's worth the effort - a single > + * iteration is enough in most cases. > + */ > + if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i]))) > + { > + found = true; > + > + /* > + * Wait a bit before going to the next iteration of the outer > + * loop. The race conditions we address here is pretty rare, > + * so we shouldn't need to wait too long. > + */ > + (void) WaitLatch(MyLatch, > + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > + 10L, > + WAIT_EVENT_SNAPBUILD_CLOG); > + ResetLatch(MyLatch); > + > + break; > + } > + } > + > + if (!found) > + break; > + } I think this algorithm is strange -- if you do have to wait more than once for one transaction, it would lead to doing the TransactionIdDidCommit again times for _all_ transactions by starting the inner loop from scratch, which sounds really wasteful. Why not nest the for() loops the other way around? Something like this perhaps, for (int i = 0; i < builder->committed.xcnt; i++) { for (;;) { if (TransactionIdDidCommit(builder->committed.xip[i])) break; else { (void) WaitLatch(MyLatch, WL_LATCH_SET, WL_TIMEOUT, WL_EXIT_ON_PM_DEATH, 10L, WAIT_EVENT_SNAPBUILD_CLOG); ResetLatch(MyLatch); } CHECK_FOR_INTERRUPTS(); } } This way you wait repeatedly for one transaction until it is marked committed; and once it does, you don't test it again. I also wondered if it would make sense to get rid of the memcpy, given that we're doing so much work per xid anyway it won't make any visible difference (I believe), and do the copy per XID there, like if (TransactionIdDidCommit(builder->committed.xip[i])) { snapshot->xip[i] = builder->committed.xip[i]; break; } else ... -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Sallah, I said NO camels! That's FIVE camels; can't you count?" (Indiana Jones)
Álvaro Herrera <alvherre@kurilemu.de> wrote:
> I think this algorithm is strange -- if you do have to wait more than
> once for one transaction, it would lead to doing the
> TransactionIdDidCommit again times for _all_ transactions by starting
> the inner loop from scratch, which sounds really wasteful. Why not nest
> the for() loops the other way around?
I'm quite sure I wanted to iterate through committed.xnt in the outer loop,
but probably got distracted by something else and messed things up.
> Something like this perhaps,
>
> for (int i = 0; i < builder->committed.xcnt; i++)
> {
> for (;;)
> {
> if (TransactionIdDidCommit(builder->committed.xip[i]))
> break;
> else
> {
> (void) WaitLatch(MyLatch,
> WL_LATCH_SET, WL_TIMEOUT, WL_EXIT_ON_PM_DEATH,
> 10L,
> WAIT_EVENT_SNAPBUILD_CLOG);
> ResetLatch(MyLatch);
> }
> CHECK_FOR_INTERRUPTS();
> }
> }
>
> This way you wait repeatedly for one transaction until it is marked
> committed; and once it does, you don't test it again.
Sure, that's much beter. Thanks.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Andres Freund <andres@anarazel.de> wrote: > > Attached here is what I consider a possible fix - simply wait for the CLOG > > update before building a new snapshot. > > I don't think that's enough - during non-timetravel visibility semantics, you > can only look at the clog if the transaction isn't marked as in-progress in > the procarray. ISTM that we need to do that here too? I understand that CLOG must be up-to-date by the time the snapshot is used for visibility checks, but I think that - from the snapshot user POV - what matters is "snapshot->xip vs CLOG" rather than "procarray vs CLOG". For procarray-based snapshots, this consistency is ensured by 1) not removing the XID from procarray until the status is set in CLOG and 2) getting the list of running transactions from procarray. Thus if an MVCC snapshot does not have particular XID in its "xip" array, it implies that it's no longer in procarray and therefore it's been marked in CLOG. As for logical decoding based snapshots (whether HISTORIC_MVCC or those converted eventually to regular MVCC), we currently do not check if CLOG is consistent with the transaction list in snapshot->xip. What I proposed is that we enforce this consistency by checking CLOG (and possibly waiting) before we finalize the snapshot. Thus the snapshot user can safely assume that the snapshot->xip array is consistent with CLOG, as if the snapshot was based on procarray. Or is there another issue with the CLOG itself? I thought about wraparound (i.e. getting the XID status from a CLOG slot which is still being used by old transactions) but I wouldn't expect that (AFAICS, CLOG truncation takes place during XID freezing). Concurrent access to the slot should neither be a problem since only a single byte (which is atomic) needs to be fetched during the XID status check. Another hypothetical problem that occurs to me is memory access ordering, i.e. one backend creates and exports the snapshot and another one imports it before it can see the CLOG update. It's hard to imagine though. Or are there other concerns? -- Antonin Houska Web: https://www.cybertec-postgresql.com
On 2026-Jan-23, Antonin Houska wrote:
> > This way you wait repeatedly for one transaction until it is marked
> > committed; and once it does, you don't test it again.
>
> Sure, that's much beter. Thanks.
Actually, I wonder if it would make sense to sleep just once after
testing all the transactions for whether they are marked committed (not
once per transaction); and after sleeping, we only test again those that
were not marked committed in the previous iteration. I think you would
end up doing less tests overall. Something like this
/*
* Although it's very unlikely, it's possible that a commit WAL record was
* decoded but CLOG is not aware of the commit yet. Should the CLOG update
* be delayed even more, visibility checks that use this snapshot could
* work incorrectly. Therefore we check the CLOG status here.
*/
{
TransactionId *stillrunning;
int nstillrunning = builder->committed.xcnt;
stillrunning = palloc(sizeof(TransactionId) * builder->committed.xcnt);
nstillrunning = builder->committed.xcnt;
memcpy(stillrunning, builder->committed.xip, sizeof(TransactionId) * nstillrunning);
for (;;)
{
int next = 0;
if (nstillrunning == 0)
break;
for (int i = 0; i < nstillrunning; i++)
{
if (!TransactionIdDidCommit(stillrunning[i]))
stillrunning[next++] = stillrunning[i];
}
if (next == 0)
break;
nstillrunning = next;
(void) WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
10L,
WAIT_EVENT_SNAPBUILD_CLOG);
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
}
pfree(stillrunning);
}
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Use it up, wear it out, make it do, or do without"
Álvaro Herrera <alvherre@kurilemu.de> wrote: > On 2026-Jan-23, Antonin Houska wrote: > > > > This way you wait repeatedly for one transaction until it is marked > > > committed; and once it does, you don't test it again. > > > > Sure, that's much beter. Thanks. > > Actually, I wonder if it would make sense to sleep just once after > testing all the transactions for whether they are marked committed (not > once per transaction); and after sleeping, we only test again those that > were not marked committed in the previous iteration. I think you would > end up doing less tests overall. Something like this I suppose that TransactionIdDidCommit() returns false pretty rarely, so one iteration is sufficient in almost all cases. Besides that, I preferred simpler code because it's easier to test (It's not trivial to have debugger reach the conditions.) However it's just my preference - no real objections to your approach. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Hi, I realized that I hadn't posted the patch version I described somewhere downthread. Here it is, as 0001. While thinking about it for posting just now, I wondered if it would work to consider that any transaction whose commit record has been decoded but which nevertheless gets a true return from TransactionIdIsInProgress(), just is not committed yet and so should be omitted from the xip array of the snapshot. That is, if it's still-running, then we just don't copy it into the output snapshot. That's implemented as 0002 here. This seems somehow less controversial, as we don't have to test TransactionIdDidCommit() for a transaction that we haven't seen as not-running per PGPROC; and it should also be faster, because we don't have to wait for anybody to commit. However it gives me pause that perhaps the snapshot would not be fully correct. (Indeed there are a few failing tests in the subscription suite). Failing other ideas, I think we should just go with 0001. We'd need more commentary on why is TransactionIdDidCommit() OK, when we haven't scanned PGPROC for that xid, though. Thanks, -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "El que vive para el futuro es un iluso, y el que vive para el pasado, un imbécil" (Luis Adler, "Los tripulantes de la noche")
Вложения
Álvaro Herrera <alvherre@kurilemu.de> wrote: > I realized that I hadn't posted the patch version I described somewhere > downthread. Here it is, as 0001. > > While thinking about it for posting just now, I wondered if it would > work to consider that any transaction whose commit record has been > decoded but which nevertheless gets a true return from > TransactionIdIsInProgress(), just is not committed yet and so should be > omitted from the xip array of the snapshot. That is, if it's > still-running, then we just don't copy it into the output snapshot. > That's implemented as 0002 here. This seems somehow less controversial, > as we don't have to test TransactionIdDidCommit() for a transaction that > we haven't seen as not-running per PGPROC; and it should also be faster, > because we don't have to wait for anybody to commit.However it gives > me pause that perhaps the snapshot would not be fully correct. (Indeed > there are a few failing tests in the subscription suite). I recall that in some of the patches for REPACK enhancements (snapshot switching, MVCC-safety, etc.) for v20 I had a problem with TransactionIdIsInProgress(). In particular, I tried to add a flag like PROC_IN_VACUUM, to limit the REPACK's impact on VACUUM xmin horizon. The problem was that TransactionIdIsInProgress() compares the xid to RecentXmin before accessing CLOG. IIRC VACUUM's RecentXmin skipped the xid of REPACK and considered its xid not in progress anymore, but since the transaction wasn't committed yet per CLOG, VACCUM incorrectly considered it aborted (per HeapTupleSatisfiesVacuumHorizon). Thus I imagine that with 0002, transaction having PROC_IN_SAFE_IC set might be added to the snapshot's list of committed transactions although its still in progress. > Failing other ideas, I think we should just go with 0001. We'd need more > commentary on why is TransactionIdDidCommit() OK, when we haven't > scanned PGPROC for that xid, though. Mihail already told me that I should consider adding this patch to the CF. I said that I'm aware of its importance (because REPACK probably exposes the bug more than logical replication does) and that I'll remind you if you happen to forget about it. However I thought that fixes of existing bugs are not subject to feature freeze, so I did not bring it up yet. -- Antonin Houska Web: https://www.cybertec-postgresql.com