Обсуждение: Race conditions in logical decoding

Поиск
Список
Период
Сортировка

Race conditions in logical decoding

От
Antonin Houska
Дата:
A stress test [1] for the REPACK patch [1] revealed data
corruption. Eventually I found out that the problem is in postgres core. In
particular, it can happen that a COMMIT record is decoded, but before the
commit could be recorded in CLOG, a snapshot that takes the commit into
account is created and even used. Visibility checks then work incorrectly
until the CLOG gets updated.

In logical replication, the consequences are not only wrong data on the
subscriber, but also corrutped table on publisher - this is due to incorrectly
set commit hint bits.

Attached is a spec file that demonstrates the issue. I did not add it to
Makefile because I don't expect the current version to be merged (see the
commit message for details.

I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
but it made at least one regression test (subscription/t/010_truncate.pl)
stuck - probably a deadlock. I can spend more time on it, but maybe someone
can come up with a good idea sooner than me.

[1] https://www.postgresql.org/message-id/CADzfLwU78as45To9a%3D-Qkr5jEg3tMxc5rUtdKy2MTv4r_SDGng%40mail.gmail.com
[2] https://commitfest.postgresql.org/patch/5117/

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com


Вложения

Re: Race conditions in logical decoding

От
Antonin Houska
Дата:
Antonin Houska <ah@cybertec.at> wrote:

> I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
> from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
> but it made at least one regression test (subscription/t/010_truncate.pl)
> stuck - probably a deadlock. I can spend more time on it, but maybe someone
> can come up with a good idea sooner than me.

Attached here is what I consider a possible fix - simply wait for the CLOG
update before building a new snapshot.

Unfortunately I have no idea right now how to test it using the isolation
tester. With the fix, the additional waiting makes the current test
block. (And if a step is added that unblock the session, it will not reliably
catch failure to wait.)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com


Вложения

Re: Race conditions in logical decoding

От
Andres Freund
Дата:
Hi,

On 2026-01-20 09:30:33 +0100, Antonin Houska wrote:
> Antonin Houska <ah@cybertec.at> wrote:
> 
> > I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
> > from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
> > but it made at least one regression test (subscription/t/010_truncate.pl)
> > stuck - probably a deadlock. I can spend more time on it, but maybe someone
> > can come up with a good idea sooner than me.
> 
> Attached here is what I consider a possible fix - simply wait for the CLOG
> update before building a new snapshot.

I don't think that's enough - during non-timetravel visibility semantics, you
can only look at the clog if the transaction isn't marked as in-progress in
the procarray.  ISTM that we need to do that here too?

Greetings,

Andres



Re: Race conditions in logical decoding

От
Mihail Nikalayeu
Дата:
Hello, Andres.

On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote:
> I don't think that's enough - during non-timetravel visibility semantics, you
> can only look at the clog if the transaction isn't marked as in-progress in
> the procarray.  ISTM that we need to do that here too?

Do you mean replace
> if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))
to
> if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i])))
?

If so, yes, it feels correct to me.

Mikhail.

Re: Race conditions in logical decoding

От
Antonin Houska
Дата:
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:

> Hello, Andres.
>
> On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote:
> > I don't think that's enough - during non-timetravel visibility semantics, you
> > can only look at the clog if the transaction isn't marked as in-progress in
> > the procarray.  ISTM that we need to do that here too?
>
> Do you mean replace
> > if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))
> to
> > if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) ||
!TransactionIdDidCommit(builder->committed.xip[i])))

This way the synchronous replication gets stuck, as it did when I tried to use
XactLockTableWait(): subscriber cannot confirm replication of certain LSN
because publisher is not able to even finalize the commit (due to the waiting
for the subscriber's confirmation), and therefore publisher it's not able to
decode the data and send it to the subscriber.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: Race conditions in logical decoding

От
Álvaro Herrera
Дата:
On 2026-Jan-22, Antonin Houska wrote:

> Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:
> 
> > Hello, Andres.
> > 
> > On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote:
> > > I don't think that's enough - during non-timetravel visibility semantics, you
> > > can only look at the clog if the transaction isn't marked as in-progress in
> > > the procarray.  ISTM that we need to do that here too?
> > 
> > Do you mean replace
> > > if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))
> > to
> > > if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) ||
!TransactionIdDidCommit(builder->committed.xip[i])))
> 
> This way the synchronous replication gets stuck, as it did when I tried to use
> XactLockTableWait(): subscriber cannot confirm replication of certain LSN
> because publisher is not able to even finalize the commit (due to the waiting
> for the subscriber's confirmation), and therefore publisher it's not able to
> decode the data and send it to the subscriber.

The layering here is wild, but if I understand it correctly, these XIDs
are all added to an array by SnapBuildAddCommittedTxn(), which in turn
is only called by SnapBuildCommitTxn(), which is only called by
DecodeCommit(), which is only called by xact_decode() when it sees a
XLOG_XACT_COMMIT or XLOG_XACT_COMMIT_PREPARED record by reading WAL.

Crucially, RecordTransactionCommit() writes the WAL first, then marks
everything as committed in CLOG, and finally does the waiting for
the synchronous replica to ACK the commit if necessary.  However, the
transaction is only removed from procarray after RecordTransactionCommit
has returned.

This means that DecodeCommit() could add a transaction to the
SnapBuilder (that needs to be waited for) while that transaction is
still shown as running in ProcArray.  This sounds problematic in itself,
so I'm wondering whether we should do anything (namely, wait) on
DecodeCommit() instead of hacking SnapBuildBuildSnapshot() to patch it
up by waiting after the fact.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: Race conditions in logical decoding

От
Álvaro Herrera
Дата:
On 2026-Jan-22, Álvaro Herrera wrote:

> On 2026-Jan-22, Antonin Houska wrote:

> > > Do you mean replace
> > > > if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))
> > > to
> > > > if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) ||
!TransactionIdDidCommit(builder->committed.xip[i])))
> > 
> > This way the synchronous replication gets stuck, as it did when I tried to use
> > XactLockTableWait(): subscriber cannot confirm replication of certain LSN
> > because publisher is not able to even finalize the commit (due to the waiting
> > for the subscriber's confirmation), and therefore publisher it's not able to
> > decode the data and send it to the subscriber.

BTW, the reason XactLockTableWait and TransactionIdIsInProgress() cause
a deadlock in the same way, is that they are using essentially the same
mechanism.  The former uses the Lock object on the transaction, which is
released (by the ResourceOwnerRelease(RESOURCE_RELEASE_LOCKS) call in
CommitTransaction) after RecordTransactionCommit() has returned -- that
is, after the wait on a synchronous replica has happened.

XactLockTableWait does an _additional_ test for
TransactionIdIsInProgress, but that should be pretty much innocuous at
that point.


One thing that I just realized I don't know, is what exactly are the two
pieces that are deadlocking.  I mean, one is this side that's decoding
commit.  But how/why is that other side, the one trying to mark the
transaction as committed, waiting on the commit decoding?  Maybe there's
something that we need to do to _that_ side to prevent the blockage, so
that we can use TransactionIdIsInProgress() here.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: Race conditions in logical decoding

От
Álvaro Herrera
Дата:
On 2026-Jan-20, Antonin Houska wrote:

> Antonin Houska <ah@cybertec.at> wrote:
> 
> > I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
> > from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
> > but it made at least one regression test (subscription/t/010_truncate.pl)
> > stuck - probably a deadlock. I can spend more time on it, but maybe someone
> > can come up with a good idea sooner than me.
> 
> Attached here is what I consider a possible fix - simply wait for the CLOG
> update before building a new snapshot.
> 
> Unfortunately I have no idea right now how to test it using the isolation
> tester. With the fix, the additional waiting makes the current test
> block. (And if a step is added that unblock the session, it will not reliably
> catch failure to wait.)
> 
> -- 
> Antonin Houska
> Web: https://www.cybertec-postgresql.com
> 

> @@ -400,6 +400,47 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
>      snapshot->xmin = builder->xmin;
>      snapshot->xmax = builder->xmax;
>  
> +    /*
> +     * Although it's very unlikely, it's possible that a commit WAL record was
> +     * decoded but CLOG is not aware of the commit yet. Should the CLOG update
> +     * be delayed even more, visibility checks that use this snapshot could
> +     * work incorrectly. Therefore we check the CLOG status here.
> +     */
> +    while (true)
> +    {
> +        bool    found = false;
> +
> +        for (int i = 0; i < builder->committed.xcnt; i++)
> +        {
> +            /*
> +             * XXX Is it worth remembering the XIDs that appear to be
> +             * committed per CLOG and skipping them in the next iteration of
> +             * the outer loop? Not sure it's worth the effort - a single
> +             * iteration is enough in most cases.
> +             */
> +            if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))
> +            {
> +                found = true;
> +
> +                /*
> +                 * Wait a bit before going to the next iteration of the outer
> +                 * loop. The race conditions we address here is pretty rare,
> +                 * so we shouldn't need to wait too long.
> +                 */
> +                (void) WaitLatch(MyLatch,
> +                                 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> +                                 10L,
> +                                 WAIT_EVENT_SNAPBUILD_CLOG);
> +                ResetLatch(MyLatch);
> +
> +                break;
> +            }
> +        }
> +
> +        if (!found)
> +            break;
> +    }

I think this algorithm is strange -- if you do have to wait more than
once for one transaction, it would lead to doing the
TransactionIdDidCommit again times for _all_ transactions by starting
the inner loop from scratch, which sounds really wasteful.  Why not nest
the for() loops the other way around?  Something like this perhaps,

    for (int i = 0; i < builder->committed.xcnt; i++)
    {
        for (;;)
        {
            if (TransactionIdDidCommit(builder->committed.xip[i]))
                break;
            else
            {
                (void) WaitLatch(MyLatch,
                                 WL_LATCH_SET, WL_TIMEOUT, WL_EXIT_ON_PM_DEATH,
                                 10L,
                                 WAIT_EVENT_SNAPBUILD_CLOG);
                ResetLatch(MyLatch);
            }
            CHECK_FOR_INTERRUPTS();
        }
    }

This way you wait repeatedly for one transaction until it is marked
committed; and once it does, you don't test it again.

I also wondered if it would make sense to get rid of the memcpy, given
that we're doing so much work per xid anyway it won't make any visible
difference (I believe), and do the copy per XID there, like

            if (TransactionIdDidCommit(builder->committed.xip[i]))
        {
            snapshot->xip[i] = builder->committed.xip[i];
                break;
        }
            else
            ...


-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Sallah, I said NO camels! That's FIVE camels; can't you count?"
(Indiana Jones)



Re: Race conditions in logical decoding

От
Antonin Houska
Дата:
Álvaro Herrera <alvherre@kurilemu.de> wrote:

> I think this algorithm is strange -- if you do have to wait more than
> once for one transaction, it would lead to doing the
> TransactionIdDidCommit again times for _all_ transactions by starting
> the inner loop from scratch, which sounds really wasteful.  Why not nest
> the for() loops the other way around?

I'm quite sure I wanted to iterate through committed.xnt in the outer loop,
but probably got distracted by something else and messed things up.

> Something like this perhaps,
>
>     for (int i = 0; i < builder->committed.xcnt; i++)
>     {
>         for (;;)
>         {
>             if (TransactionIdDidCommit(builder->committed.xip[i]))
>                 break;
>             else
>             {
>                 (void) WaitLatch(MyLatch,
>                                  WL_LATCH_SET, WL_TIMEOUT, WL_EXIT_ON_PM_DEATH,
>                                  10L,
>                                  WAIT_EVENT_SNAPBUILD_CLOG);
>                 ResetLatch(MyLatch);
>             }
>             CHECK_FOR_INTERRUPTS();
>         }
>     }
>
> This way you wait repeatedly for one transaction until it is marked
> committed; and once it does, you don't test it again.

Sure, that's much beter. Thanks.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: Race conditions in logical decoding

От
Antonin Houska
Дата:
Andres Freund <andres@anarazel.de> wrote:

> > Attached here is what I consider a possible fix - simply wait for the CLOG
> > update before building a new snapshot.
>
> I don't think that's enough - during non-timetravel visibility semantics, you
> can only look at the clog if the transaction isn't marked as in-progress in
> the procarray.  ISTM that we need to do that here too?

I understand that CLOG must be up-to-date by the time the snapshot is used for
visibility checks, but I think that - from the snapshot user POV - what
matters is "snapshot->xip vs CLOG" rather than "procarray vs CLOG".

For procarray-based snapshots, this consistency is ensured by 1) not removing
the XID from procarray until the status is set in CLOG and 2) getting the list
of running transactions from procarray. Thus if an MVCC snapshot does not have
particular XID in its "xip" array, it implies that it's no longer in procarray
and therefore it's been marked in CLOG.

As for logical decoding based snapshots (whether HISTORIC_MVCC or those
converted eventually to regular MVCC), we currently do not check if CLOG is
consistent with the transaction list in snapshot->xip. What I proposed is that
we enforce this consistency by checking CLOG (and possibly waiting) before we
finalize the snapshot. Thus the snapshot user can safely assume that the
snapshot->xip array is consistent with CLOG, as if the snapshot was based on
procarray.

Or is there another issue with the CLOG itself? I thought about wraparound
(i.e. getting the XID status from a CLOG slot which is still being used by old
transactions) but I wouldn't expect that (AFAICS, CLOG truncation takes place
during XID freezing). Concurrent access to the slot should neither be a
problem since only a single byte (which is atomic) needs to be fetched during
the XID status check.

Another hypothetical problem that occurs to me is memory access ordering,
i.e. one backend creates and exports the snapshot and another one imports it
before it can see the CLOG update. It's hard to imagine though.

Or are there other concerns?

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: Race conditions in logical decoding

От
Álvaro Herrera
Дата:
On 2026-Jan-23, Antonin Houska wrote:

> > This way you wait repeatedly for one transaction until it is marked
> > committed; and once it does, you don't test it again.
> 
> Sure, that's much beter. Thanks.

Actually, I wonder if it would make sense to sleep just once after
testing all the transactions for whether they are marked committed (not
once per transaction); and after sleeping, we only test again those that
were not marked committed in the previous iteration.  I think you would
end up doing less tests overall.  Something like this

    /*
     * Although it's very unlikely, it's possible that a commit WAL record was
     * decoded but CLOG is not aware of the commit yet. Should the CLOG update
     * be delayed even more, visibility checks that use this snapshot could
     * work incorrectly. Therefore we check the CLOG status here.
     */
    {
        TransactionId        *stillrunning;
        int        nstillrunning = builder->committed.xcnt;

        stillrunning = palloc(sizeof(TransactionId) * builder->committed.xcnt);
        nstillrunning = builder->committed.xcnt;
        memcpy(stillrunning, builder->committed.xip, sizeof(TransactionId) * nstillrunning);

        for (;;)
        {
            int        next = 0;

            if (nstillrunning == 0)
                break;
            for (int i = 0; i < nstillrunning; i++)
            {
                if (!TransactionIdDidCommit(stillrunning[i]))
                    stillrunning[next++] = stillrunning[i];
            }
            if (next == 0)
                break;
            nstillrunning = next;
            (void) WaitLatch(MyLatch,
                             WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
                             10L,
                             WAIT_EVENT_SNAPBUILD_CLOG);
            ResetLatch(MyLatch);
            CHECK_FOR_INTERRUPTS();
        }
        pfree(stillrunning);
    }

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Use it up, wear it out, make it do, or do without"



Re: Race conditions in logical decoding

От
Antonin Houska
Дата:
Álvaro Herrera <alvherre@kurilemu.de> wrote:

> On 2026-Jan-23, Antonin Houska wrote:
>
> > > This way you wait repeatedly for one transaction until it is marked
> > > committed; and once it does, you don't test it again.
> >
> > Sure, that's much beter. Thanks.
>
> Actually, I wonder if it would make sense to sleep just once after
> testing all the transactions for whether they are marked committed (not
> once per transaction); and after sleeping, we only test again those that
> were not marked committed in the previous iteration.  I think you would
> end up doing less tests overall.  Something like this

I suppose that TransactionIdDidCommit() returns false pretty rarely, so one
iteration is sufficient in almost all cases. Besides that, I preferred simpler
code because it's easier to test (It's not trivial to have debugger reach the
conditions.) However it's just my preference - no real objections to your
approach.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com