Обсуждение: Race conditions in logical decoding

Поиск
Список
Период
Сортировка

Race conditions in logical decoding

От
Antonin Houska
Дата:
A stress test [1] for the REPACK patch [1] revealed data
corruption. Eventually I found out that the problem is in postgres core. In
particular, it can happen that a COMMIT record is decoded, but before the
commit could be recorded in CLOG, a snapshot that takes the commit into
account is created and even used. Visibility checks then work incorrectly
until the CLOG gets updated.

In logical replication, the consequences are not only wrong data on the
subscriber, but also corrutped table on publisher - this is due to incorrectly
set commit hint bits.

Attached is a spec file that demonstrates the issue. I did not add it to
Makefile because I don't expect the current version to be merged (see the
commit message for details.

I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
but it made at least one regression test (subscription/t/010_truncate.pl)
stuck - probably a deadlock. I can spend more time on it, but maybe someone
can come up with a good idea sooner than me.

[1] https://www.postgresql.org/message-id/CADzfLwU78as45To9a%3D-Qkr5jEg3tMxc5rUtdKy2MTv4r_SDGng%40mail.gmail.com
[2] https://commitfest.postgresql.org/patch/5117/

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com


Вложения

Re: Race conditions in logical decoding

От
Antonin Houska
Дата:
Antonin Houska <ah@cybertec.at> wrote:

> I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
> from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
> but it made at least one regression test (subscription/t/010_truncate.pl)
> stuck - probably a deadlock. I can spend more time on it, but maybe someone
> can come up with a good idea sooner than me.

Attached here is what I consider a possible fix - simply wait for the CLOG
update before building a new snapshot.

Unfortunately I have no idea right now how to test it using the isolation
tester. With the fix, the additional waiting makes the current test
block. (And if a step is added that unblock the session, it will not reliably
catch failure to wait.)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com


Вложения

Re: Race conditions in logical decoding

От
Andres Freund
Дата:
Hi,

On 2026-01-20 09:30:33 +0100, Antonin Houska wrote:
> Antonin Houska <ah@cybertec.at> wrote:
> 
> > I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
> > from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
> > but it made at least one regression test (subscription/t/010_truncate.pl)
> > stuck - probably a deadlock. I can spend more time on it, but maybe someone
> > can come up with a good idea sooner than me.
> 
> Attached here is what I consider a possible fix - simply wait for the CLOG
> update before building a new snapshot.

I don't think that's enough - during non-timetravel visibility semantics, you
can only look at the clog if the transaction isn't marked as in-progress in
the procarray.  ISTM that we need to do that here too?

Greetings,

Andres



Re: Race conditions in logical decoding

От
Mihail Nikalayeu
Дата:
Hello, Andres.

On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote:
> I don't think that's enough - during non-timetravel visibility semantics, you
> can only look at the clog if the transaction isn't marked as in-progress in
> the procarray.  ISTM that we need to do that here too?

Do you mean replace
> if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))
to
> if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i])))
?

If so, yes, it feels correct to me.

Mikhail.

Re: Race conditions in logical decoding

От
Antonin Houska
Дата:
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:

> Hello, Andres.
>
> On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote:
> > I don't think that's enough - during non-timetravel visibility semantics, you
> > can only look at the clog if the transaction isn't marked as in-progress in
> > the procarray.  ISTM that we need to do that here too?
>
> Do you mean replace
> > if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))
> to
> > if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) ||
!TransactionIdDidCommit(builder->committed.xip[i])))

This way the synchronous replication gets stuck, as it did when I tried to use
XactLockTableWait(): subscriber cannot confirm replication of certain LSN
because publisher is not able to even finalize the commit (due to the waiting
for the subscriber's confirmation), and therefore publisher it's not able to
decode the data and send it to the subscriber.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: Race conditions in logical decoding

От
Álvaro Herrera
Дата:
On 2026-Jan-22, Antonin Houska wrote:

> Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:
> 
> > Hello, Andres.
> > 
> > On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote:
> > > I don't think that's enough - during non-timetravel visibility semantics, you
> > > can only look at the clog if the transaction isn't marked as in-progress in
> > > the procarray.  ISTM that we need to do that here too?
> > 
> > Do you mean replace
> > > if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))
> > to
> > > if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) ||
!TransactionIdDidCommit(builder->committed.xip[i])))
> 
> This way the synchronous replication gets stuck, as it did when I tried to use
> XactLockTableWait(): subscriber cannot confirm replication of certain LSN
> because publisher is not able to even finalize the commit (due to the waiting
> for the subscriber's confirmation), and therefore publisher it's not able to
> decode the data and send it to the subscriber.

The layering here is wild, but if I understand it correctly, these XIDs
are all added to an array by SnapBuildAddCommittedTxn(), which in turn
is only called by SnapBuildCommitTxn(), which is only called by
DecodeCommit(), which is only called by xact_decode() when it sees a
XLOG_XACT_COMMIT or XLOG_XACT_COMMIT_PREPARED record by reading WAL.

Crucially, RecordTransactionCommit() writes the WAL first, then marks
everything as committed in CLOG, and finally does the waiting for
the synchronous replica to ACK the commit if necessary.  However, the
transaction is only removed from procarray after RecordTransactionCommit
has returned.

This means that DecodeCommit() could add a transaction to the
SnapBuilder (that needs to be waited for) while that transaction is
still shown as running in ProcArray.  This sounds problematic in itself,
so I'm wondering whether we should do anything (namely, wait) on
DecodeCommit() instead of hacking SnapBuildBuildSnapshot() to patch it
up by waiting after the fact.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: Race conditions in logical decoding

От
Álvaro Herrera
Дата:
On 2026-Jan-22, Álvaro Herrera wrote:

> On 2026-Jan-22, Antonin Houska wrote:

> > > Do you mean replace
> > > > if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))
> > > to
> > > > if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) ||
!TransactionIdDidCommit(builder->committed.xip[i])))
> > 
> > This way the synchronous replication gets stuck, as it did when I tried to use
> > XactLockTableWait(): subscriber cannot confirm replication of certain LSN
> > because publisher is not able to even finalize the commit (due to the waiting
> > for the subscriber's confirmation), and therefore publisher it's not able to
> > decode the data and send it to the subscriber.

BTW, the reason XactLockTableWait and TransactionIdIsInProgress() cause
a deadlock in the same way, is that they are using essentially the same
mechanism.  The former uses the Lock object on the transaction, which is
released (by the ResourceOwnerRelease(RESOURCE_RELEASE_LOCKS) call in
CommitTransaction) after RecordTransactionCommit() has returned -- that
is, after the wait on a synchronous replica has happened.

XactLockTableWait does an _additional_ test for
TransactionIdIsInProgress, but that should be pretty much innocuous at
that point.


One thing that I just realized I don't know, is what exactly are the two
pieces that are deadlocking.  I mean, one is this side that's decoding
commit.  But how/why is that other side, the one trying to mark the
transaction as committed, waiting on the commit decoding?  Maybe there's
something that we need to do to _that_ side to prevent the blockage, so
that we can use TransactionIdIsInProgress() here.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: Race conditions in logical decoding

От
Álvaro Herrera
Дата:
On 2026-Jan-20, Antonin Houska wrote:

> Antonin Houska <ah@cybertec.at> wrote:
> 
> > I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
> > from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
> > but it made at least one regression test (subscription/t/010_truncate.pl)
> > stuck - probably a deadlock. I can spend more time on it, but maybe someone
> > can come up with a good idea sooner than me.
> 
> Attached here is what I consider a possible fix - simply wait for the CLOG
> update before building a new snapshot.
> 
> Unfortunately I have no idea right now how to test it using the isolation
> tester. With the fix, the additional waiting makes the current test
> block. (And if a step is added that unblock the session, it will not reliably
> catch failure to wait.)
> 
> -- 
> Antonin Houska
> Web: https://www.cybertec-postgresql.com
> 

> @@ -400,6 +400,47 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
>      snapshot->xmin = builder->xmin;
>      snapshot->xmax = builder->xmax;
>  
> +    /*
> +     * Although it's very unlikely, it's possible that a commit WAL record was
> +     * decoded but CLOG is not aware of the commit yet. Should the CLOG update
> +     * be delayed even more, visibility checks that use this snapshot could
> +     * work incorrectly. Therefore we check the CLOG status here.
> +     */
> +    while (true)
> +    {
> +        bool    found = false;
> +
> +        for (int i = 0; i < builder->committed.xcnt; i++)
> +        {
> +            /*
> +             * XXX Is it worth remembering the XIDs that appear to be
> +             * committed per CLOG and skipping them in the next iteration of
> +             * the outer loop? Not sure it's worth the effort - a single
> +             * iteration is enough in most cases.
> +             */
> +            if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))
> +            {
> +                found = true;
> +
> +                /*
> +                 * Wait a bit before going to the next iteration of the outer
> +                 * loop. The race conditions we address here is pretty rare,
> +                 * so we shouldn't need to wait too long.
> +                 */
> +                (void) WaitLatch(MyLatch,
> +                                 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> +                                 10L,
> +                                 WAIT_EVENT_SNAPBUILD_CLOG);
> +                ResetLatch(MyLatch);
> +
> +                break;
> +            }
> +        }
> +
> +        if (!found)
> +            break;
> +    }

I think this algorithm is strange -- if you do have to wait more than
once for one transaction, it would lead to doing the
TransactionIdDidCommit again times for _all_ transactions by starting
the inner loop from scratch, which sounds really wasteful.  Why not nest
the for() loops the other way around?  Something like this perhaps,

    for (int i = 0; i < builder->committed.xcnt; i++)
    {
        for (;;)
        {
            if (TransactionIdDidCommit(builder->committed.xip[i]))
                break;
            else
            {
                (void) WaitLatch(MyLatch,
                                 WL_LATCH_SET, WL_TIMEOUT, WL_EXIT_ON_PM_DEATH,
                                 10L,
                                 WAIT_EVENT_SNAPBUILD_CLOG);
                ResetLatch(MyLatch);
            }
            CHECK_FOR_INTERRUPTS();
        }
    }

This way you wait repeatedly for one transaction until it is marked
committed; and once it does, you don't test it again.

I also wondered if it would make sense to get rid of the memcpy, given
that we're doing so much work per xid anyway it won't make any visible
difference (I believe), and do the copy per XID there, like

            if (TransactionIdDidCommit(builder->committed.xip[i]))
        {
            snapshot->xip[i] = builder->committed.xip[i];
                break;
        }
            else
            ...


-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Sallah, I said NO camels! That's FIVE camels; can't you count?"
(Indiana Jones)



Re: Race conditions in logical decoding

От
Antonin Houska
Дата:
Álvaro Herrera <alvherre@kurilemu.de> wrote:

> I think this algorithm is strange -- if you do have to wait more than
> once for one transaction, it would lead to doing the
> TransactionIdDidCommit again times for _all_ transactions by starting
> the inner loop from scratch, which sounds really wasteful.  Why not nest
> the for() loops the other way around?

I'm quite sure I wanted to iterate through committed.xnt in the outer loop,
but probably got distracted by something else and messed things up.

> Something like this perhaps,
>
>     for (int i = 0; i < builder->committed.xcnt; i++)
>     {
>         for (;;)
>         {
>             if (TransactionIdDidCommit(builder->committed.xip[i]))
>                 break;
>             else
>             {
>                 (void) WaitLatch(MyLatch,
>                                  WL_LATCH_SET, WL_TIMEOUT, WL_EXIT_ON_PM_DEATH,
>                                  10L,
>                                  WAIT_EVENT_SNAPBUILD_CLOG);
>                 ResetLatch(MyLatch);
>             }
>             CHECK_FOR_INTERRUPTS();
>         }
>     }
>
> This way you wait repeatedly for one transaction until it is marked
> committed; and once it does, you don't test it again.

Sure, that's much beter. Thanks.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: Race conditions in logical decoding

От
Antonin Houska
Дата:
Andres Freund <andres@anarazel.de> wrote:

> > Attached here is what I consider a possible fix - simply wait for the CLOG
> > update before building a new snapshot.
>
> I don't think that's enough - during non-timetravel visibility semantics, you
> can only look at the clog if the transaction isn't marked as in-progress in
> the procarray.  ISTM that we need to do that here too?

I understand that CLOG must be up-to-date by the time the snapshot is used for
visibility checks, but I think that - from the snapshot user POV - what
matters is "snapshot->xip vs CLOG" rather than "procarray vs CLOG".

For procarray-based snapshots, this consistency is ensured by 1) not removing
the XID from procarray until the status is set in CLOG and 2) getting the list
of running transactions from procarray. Thus if an MVCC snapshot does not have
particular XID in its "xip" array, it implies that it's no longer in procarray
and therefore it's been marked in CLOG.

As for logical decoding based snapshots (whether HISTORIC_MVCC or those
converted eventually to regular MVCC), we currently do not check if CLOG is
consistent with the transaction list in snapshot->xip. What I proposed is that
we enforce this consistency by checking CLOG (and possibly waiting) before we
finalize the snapshot. Thus the snapshot user can safely assume that the
snapshot->xip array is consistent with CLOG, as if the snapshot was based on
procarray.

Or is there another issue with the CLOG itself? I thought about wraparound
(i.e. getting the XID status from a CLOG slot which is still being used by old
transactions) but I wouldn't expect that (AFAICS, CLOG truncation takes place
during XID freezing). Concurrent access to the slot should neither be a
problem since only a single byte (which is atomic) needs to be fetched during
the XID status check.

Another hypothetical problem that occurs to me is memory access ordering,
i.e. one backend creates and exports the snapshot and another one imports it
before it can see the CLOG update. It's hard to imagine though.

Or are there other concerns?

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: Race conditions in logical decoding

От
Álvaro Herrera
Дата:
On 2026-Jan-23, Antonin Houska wrote:

> > This way you wait repeatedly for one transaction until it is marked
> > committed; and once it does, you don't test it again.
> 
> Sure, that's much beter. Thanks.

Actually, I wonder if it would make sense to sleep just once after
testing all the transactions for whether they are marked committed (not
once per transaction); and after sleeping, we only test again those that
were not marked committed in the previous iteration.  I think you would
end up doing less tests overall.  Something like this

    /*
     * Although it's very unlikely, it's possible that a commit WAL record was
     * decoded but CLOG is not aware of the commit yet. Should the CLOG update
     * be delayed even more, visibility checks that use this snapshot could
     * work incorrectly. Therefore we check the CLOG status here.
     */
    {
        TransactionId        *stillrunning;
        int        nstillrunning = builder->committed.xcnt;

        stillrunning = palloc(sizeof(TransactionId) * builder->committed.xcnt);
        nstillrunning = builder->committed.xcnt;
        memcpy(stillrunning, builder->committed.xip, sizeof(TransactionId) * nstillrunning);

        for (;;)
        {
            int        next = 0;

            if (nstillrunning == 0)
                break;
            for (int i = 0; i < nstillrunning; i++)
            {
                if (!TransactionIdDidCommit(stillrunning[i]))
                    stillrunning[next++] = stillrunning[i];
            }
            if (next == 0)
                break;
            nstillrunning = next;
            (void) WaitLatch(MyLatch,
                             WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
                             10L,
                             WAIT_EVENT_SNAPBUILD_CLOG);
            ResetLatch(MyLatch);
            CHECK_FOR_INTERRUPTS();
        }
        pfree(stillrunning);
    }

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Use it up, wear it out, make it do, or do without"



Re: Race conditions in logical decoding

От
Antonin Houska
Дата:
Álvaro Herrera <alvherre@kurilemu.de> wrote:

> On 2026-Jan-23, Antonin Houska wrote:
>
> > > This way you wait repeatedly for one transaction until it is marked
> > > committed; and once it does, you don't test it again.
> >
> > Sure, that's much beter. Thanks.
>
> Actually, I wonder if it would make sense to sleep just once after
> testing all the transactions for whether they are marked committed (not
> once per transaction); and after sleeping, we only test again those that
> were not marked committed in the previous iteration.  I think you would
> end up doing less tests overall.  Something like this

I suppose that TransactionIdDidCommit() returns false pretty rarely, so one
iteration is sufficient in almost all cases. Besides that, I preferred simpler
code because it's easier to test (It's not trivial to have debugger reach the
conditions.) However it's just my preference - no real objections to your
approach.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: Race conditions in logical decoding

От
Álvaro Herrera
Дата:
Hi,

I realized that I hadn't posted the patch version I described somewhere
downthread.  Here it is, as 0001.

While thinking about it for posting just now, I wondered if it would
work to consider that any transaction whose commit record has been
decoded but which nevertheless gets a true return from
TransactionIdIsInProgress(), just is not committed yet and so should be
omitted from the xip array of the snapshot.  That is, if it's
still-running, then we just don't copy it into the output snapshot.
That's implemented as 0002 here.  This seems somehow less controversial,
as we don't have to test TransactionIdDidCommit() for a transaction that
we haven't seen as not-running per PGPROC; and it should also be faster,
because we don't have to wait for anybody to commit.  However it gives
me pause that perhaps the snapshot would not be fully correct.  (Indeed
there are a few failing tests in the subscription suite).

Failing other ideas, I think we should just go with 0001.  We'd need more
commentary on why is TransactionIdDidCommit() OK, when we haven't
scanned PGPROC for that xid, though.

Thanks,

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El que vive para el futuro es un iluso, y el que vive para el pasado,
un imbécil" (Luis Adler, "Los tripulantes de la noche")

Вложения

Re: Race conditions in logical decoding

От
Antonin Houska
Дата:
Álvaro Herrera <alvherre@kurilemu.de> wrote:

> I realized that I hadn't posted the patch version I described somewhere
> downthread.  Here it is, as 0001.
>
> While thinking about it for posting just now, I wondered if it would
> work to consider that any transaction whose commit record has been
> decoded but which nevertheless gets a true return from
> TransactionIdIsInProgress(), just is not committed yet and so should be
> omitted from the xip array of the snapshot.  That is, if it's
> still-running, then we just don't copy it into the output snapshot.
> That's implemented as 0002 here.  This seems somehow less controversial,
> as we don't have to test TransactionIdDidCommit() for a transaction that
> we haven't seen as not-running per PGPROC; and it should also be faster,
> because we don't have to wait for anybody to commit.However it gives
> me pause that perhaps the snapshot would not be fully correct.  (Indeed
> there are a few failing tests in the subscription suite).

I recall that in some of the patches for REPACK enhancements (snapshot
switching, MVCC-safety, etc.) for v20 I had a problem with
TransactionIdIsInProgress(). In particular, I tried to add a flag like
PROC_IN_VACUUM, to limit the REPACK's impact on VACUUM xmin horizon.

The problem was that TransactionIdIsInProgress() compares the xid to
RecentXmin before accessing CLOG. IIRC VACUUM's RecentXmin skipped the xid of
REPACK and considered its xid not in progress anymore, but since the
transaction wasn't committed yet per CLOG, VACCUM incorrectly considered it
aborted (per HeapTupleSatisfiesVacuumHorizon).

Thus I imagine that with 0002, transaction having PROC_IN_SAFE_IC set might be
added to the snapshot's list of committed transactions although its still in
progress.

> Failing other ideas, I think we should just go with 0001.  We'd need more
> commentary on why is TransactionIdDidCommit() OK, when we haven't
> scanned PGPROC for that xid, though.

Mihail already told me that I should consider adding this patch to the CF. I
said that I'm aware of its importance (because REPACK probably exposes the bug
more than logical replication does) and that I'll remind you if you happen to
forget about it. However I thought that fixes of existing bugs are not subject
to feature freeze, so I did not bring it up yet.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com