Обсуждение: BUG #17116: Assert failed in SerialSetActiveSerXmin() on commit of parallelized serializable transaction

Поиск
Список
Период
Сортировка

BUG #17116: Assert failed in SerialSetActiveSerXmin() on commit of parallelized serializable transaction

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17116
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 13.3
Operating system:   Ubuntu 20.04
Description:

The isolation test "serializable-parallel-2" modified as follows:
# Exercise the case where a read-only serializable transaction has
# SXACT_FLAG_RO_SAFE set in a parallel query.

setup
{
    CREATE TABLE foo AS SELECT generate_series(1, 10)::int a;
    ALTER TABLE foo SET (parallel_workers = 2);
}

teardown
{
    DROP TABLE foo;
}

session s1
setup         { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; }
step s1r    { SELECT * FROM foo; }
step s1c    { COMMIT; }

session s2
setup        {
              BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE READ ONLY;
              SET parallel_setup_cost = 0;
              SET parallel_tuple_cost = 0;
            }
step s2r1    { SELECT * FROM foo; }
step s2r2    { SELECT * FROM foo; }
step s2c    { COMMIT; }

session s3
setup         { BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; }
step s3r    { SELECT * FROM foo; }
step s3c    { COMMIT; }

session s4
setup        {
              BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE READ ONLY;
              SET parallel_setup_cost = 0;
              SET parallel_tuple_cost = 0;
            }
step s4r1    { SELECT * FROM foo; }
step s4r2    { SELECT * FROM foo; }
step s4c    { COMMIT; }

permutation s1r s3r s2r1 s4r1 s1c s2r2 s3c s4r2 s4c s2c

leads to a failed assertion with the following stacktrace:
Core was generated by `postgres: law isolation_regression [local] COMMIT
                        '.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f21de1dd859 in __GI_abort () at abort.c:79
#2  0x0000555e1740b1ac in ExceptionalCondition (
    conditionName=conditionName@entry=0x555e17581ce8
"!TransactionIdIsValid(serialControl->tailXid) || TransactionIdFollows(xid,
serialControl->tailXid)", errorType=errorType@entry=0x555e17466028
"FailedAssertion", 
    fileName=fileName@entry=0x555e17581789 "predicate.c",
lineNumber=lineNumber@entry=1056) at assert.c:67
#3  0x0000555e172d3c5c in SerialSetActiveSerXmin (xid=6187) at
predicate.c:1056
#4  0x0000555e172d3ecf in SetNewSxactGlobalXmin () at predicate.c:3309
#5  0x0000555e172d811c in ReleasePredicateLocks (isCommit=false,
isCommit@entry=true, 
    isReadOnlySafe=isReadOnlySafe@entry=false) at predicate.c:3697
#6  0x0000555e1743fbb0 in ResourceOwnerReleaseInternal (owner=<optimized
out>, 
    phase=phase@entry=RESOURCE_RELEASE_LOCKS, isCommit=isCommit@entry=true,
isTopLevel=isTopLevel@entry=true)
    at resowner.c:569
#7  0x0000555e17440166 in ResourceOwnerRelease (owner=<optimized out>,
phase=phase@entry=RESOURCE_RELEASE_LOCKS, 
    isCommit=isCommit@entry=true, isTopLevel=isTopLevel@entry=true) at
resowner.c:484
#8  0x0000555e1700a156 in CommitTransaction () at xact.c:2227
#9  0x0000555e1700af28 in CommitTransactionCommand () at xact.c:2974
#10 0x0000555e172e319d in finish_xact_command () at postgres.c:2662
#11 0x0000555e172e5972 in exec_simple_query
(query_string=query_string@entry=0x555e18e17350 "COMMIT;")
    at postgres.c:1264
#12 0x0000555e172e7b06 in PostgresMain (argc=<optimized out>,
argv=argv@entry=0x555e18e42ad8, dbname=<optimized out>, 
    username=<optimized out>) at postgres.c:4339
#13 0x0000555e17253165 in BackendRun (port=port@entry=0x555e18e3bb10) at
postmaster.c:4526
#14 0x0000555e172562c0 in BackendStartup (port=port@entry=0x555e18e3bb10) at
postmaster.c:4210
#15 0x0000555e17256507 in ServerLoop () at postmaster.c:1739
#16 0x0000555e17257a30 in PostmasterMain (argc=8, argv=<optimized out>) at
postmaster.c:1412
#17 0x0000555e171a0a4f in main (argc=8, argv=0x555e18e119e0) at main.c:210

(Discovered by running multiple installcheck-world's in parallel.)
Reproduced on REL_12_STABLE..master. It looks like the offending commit is
47a338cf.


On Thu, Jul 22, 2021 at 11:23 PM PG Bug reporting form
<noreply@postgresql.org> wrote:           '.
> Program terminated with signal SIGABRT, Aborted.
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> 50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x00007f21de1dd859 in __GI_abort () at abort.c:79
> #2  0x0000555e1740b1ac in ExceptionalCondition (
>     conditionName=conditionName@entry=0x555e17581ce8
> "!TransactionIdIsValid(serialControl->tailXid) || TransactionIdFollows(xid,
> serialControl->tailXid)", errorType=errorType@entry=0x555e17466028
> "FailedAssertion",
>     fileName=fileName@entry=0x555e17581789 "predicate.c",
> lineNumber=lineNumber@entry=1056) at assert.c:67

Thanks.  Repro'd here.  Not immediately sure what's happening here,
but I will look into it next week.



On Sat, Jul 24, 2021 at 12:56 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Jul 22, 2021 at 11:23 PM PG Bug reporting form
> <noreply@postgresql.org> wrote:           '.
> > #2  0x0000555e1740b1ac in ExceptionalCondition (
> >     conditionName=conditionName@entry=0x555e17581ce8
> > "!TransactionIdIsValid(serialControl->tailXid) || TransactionIdFollows(xid,
> > serialControl->tailXid)", errorType=errorType@entry=0x555e17466028
> > "FailedAssertion",
> >     fileName=fileName@entry=0x555e17581789 "predicate.c",
> > lineNumber=lineNumber@entry=1056) at assert.c:67
>
> Thanks.  Repro'd here.  Not immediately sure what's happening here,
> but I will look into it next week.

Yeah, I goofed commit 47a338cf.  It could double-release a sort of
reference count.  I think the reason I didn't notice and nothing bad
happened is that the global xmin tracking is effectively
self-correcting in this rare case.  Once the count reaches zero, we
recompute the count by linear search for the new oldest xmin
(something on my list of things to try to improve, but in this case
the saving grace).

Here's a draft patch that includes your test spec.  Without the
included change to predicate.c, it hits the assertion.  With it, it
doesn't, and pg_locks shows no leaked locks after the isolation checks
run (that was the problem 47a338cf fixed before 12 was released, but
it fixed it a little too far in the opposite direction, fortunately
without ill effect AFAICS besides wasted CPU cycles).

I'll see if I can think of some more ways to test this logic, and some
way to make the coding a little clearer.

Вложения
Hello Thomas,

26.07.2021 10:36, Thomas Munro wrote:
> On Sat, Jul 24, 2021 at 12:56 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> On Thu, Jul 22, 2021 at 11:23 PM PG Bug reporting form
>> <noreply@postgresql.org> wrote:           '.
>>> #2  0x0000555e1740b1ac in ExceptionalCondition (
>>>     conditionName=conditionName@entry=0x555e17581ce8
>>> "!TransactionIdIsValid(serialControl->tailXid) || TransactionIdFollows(xid,
>>> serialControl->tailXid)", errorType=errorType@entry=0x555e17466028
>>> "FailedAssertion",
>>>     fileName=fileName@entry=0x555e17581789 "predicate.c",
>>> lineNumber=lineNumber@entry=1056) at assert.c:67
>> Thanks.  Repro'd here.  Not immediately sure what's happening here,
>> but I will look into it next week.
> Here's a draft patch that includes your test spec.  Without the
> included change to predicate.c, it hits the assertion.  With it, it
> doesn't, and pg_locks shows no leaked locks after the isolation checks
> run (that was the problem 47a338cf fixed before 12 was released, but
> it fixed it a little too far in the opposite direction, fortunately
> without ill effect AFAICS besides wasted CPU cycles).

Thanks! I'm not sure, whether this is related to the fix, but my
parallel installcheck script (attached) catches another assert with the
patch applied:

Core was generated by `postgres: postgres regress33 127.0.0.1(48582) COM'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ff5251c7535 in __GI_abort () at abort.c:79
#2  0x000055b16d72aa32 in ExceptionalCondition (
    conditionName=conditionName@entry=0x55b16d89ccd0
"!SxactIsPartiallyReleased(MySerializableXact)",
    errorType=errorType@entry=0x55b16d78301d "FailedAssertion",
fileName=fileName@entry=0x55b16d89bbc9 "predicate.c",
    lineNumber=lineNumber@entry=4859) at assert.c:67
#3  0x000055b16d5f8a3a in PreCommit_CheckForSerializationFailure () at
predicate.c:4859
#4  0x000055b16d34227d in CommitTransaction () at xact.c:2140
#5  0x000055b16d3435b5 in CommitTransactionCommand () at xact.c:3085
#6  0x000055b16d602369 in finish_xact_command () at postgres.c:2662
#7  0x000055b16d604a23 in finish_xact_command () at postgres.c:2660
#8  exec_simple_query (query_string=0x55b16e9c6980 "COMMIT;") at
postgres.c:1264
#9  0x000055b16d6060d9 in PostgresMain (argc=<optimized out>,
argv=argv@entry=0x55b16e9f66a8, dbname=<optimized out>,
    username=<optimized out>) at postgres.c:4339
#10 0x000055b16d57936f in BackendRun (port=0x55b16e9f40b0,
port=0x55b16e9f40b0) at postmaster.c:4526
#11 BackendStartup (port=0x55b16e9f40b0) at postmaster.c:4210
#12 ServerLoop () at postmaster.c:1739
#13 0x000055b16d57a36c in PostmasterMain (argc=3, argv=0x55b16e9c1320)
at postmaster.c:1412
#14 0x000055b16d28bf5b in main (argc=3, argv=0x55b16e9c1320) at main.c:210

---
/tmp/t/ic01-33/src/test/isolation/expected/serializable-parallel-2.out
2021-07-26 22:04:06.206541268 +0300
+++
/tmp/t/ic01-33/src/test/isolation/output_iso/results/serializable-parallel-2.out      
2021-07-26 22:08:04.883339810 +0300
@@ -47,4 +47,6 @@
 10
 (10 rows)

-step s2c: COMMIT;
+lock wait query failed: server closed the connection unexpectedly
+       This probably means the server terminated abnormally
+       before or while processing the request.

If time permits, I will compose (till the weekend) yet another isolation
test to demonstrate this fail. But maybe the script can be useful as is.
(To simplify things I left only serializable-parallel* tests in
isolation_schedule.)

Best regards,
Alexander

Вложения
On Tue, Jul 27, 2021 at 8:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> Thanks! I'm not sure, whether this is related to the fix, but my
> parallel installcheck script (attached) catches another assert with the
> patch applied:

> #2  0x000055b16d72aa32 in ExceptionalCondition (
>     conditionName=conditionName@entry=0x55b16d89ccd0
> "!SxactIsPartiallyReleased(MySerializableXact)",
>     errorType=errorType@entry=0x55b16d78301d "FailedAssertion",
> fileName=fileName@entry=0x55b16d89bbc9 "predicate.c",
>     lineNumber=lineNumber@entry=4859) at assert.c:67

Reproduced here using a similar approach.  It doesn't fail with this
new draft patch.  The main thing I have learned is that
serializable-parallel-2.sql isn't really testing a very interesting
code path.  I'll try to come up with a better test.

Вложения
Hello Thomas,
27.07.2021 14:17, Thomas Munro wrote:
> Reproduced here using a similar approach.  It doesn't fail with this
> new draft patch.  The main thing I have learned is that
> serializable-parallel-2.sql isn't really testing a very interesting
> code path.  I'll try to come up with a better test.
Thanks! With the new fix applied, my multiple installcheck test survives
100+ iterations. (Except for read-only-anomaly-3, that causes another
assertion fail but I believe that it's not related to the initial issue
and deserves another bug report.)

Best regards,
Alexander



On Wed, Jul 28, 2021 at 9:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> Thanks! With the new fix applied, my multiple installcheck test survives
> 100+ iterations.

Thanks.

Here's an updated version.  No code change, but now with a change to
serializable-parallel-2.sql so that the second assertion you mentioned
fails on unpatched master.  The intention of that test was to exercise
the SXACT_FLAG_RO_SAFE optimisation in a parallel query, and it did
that, but in a sort of degenerate way: sequential scans acquire
predicate locks *before entering parallel mode*, so the optimisation
is triggered immediately.  This version triggers the optimisation in a
parallel worker, and then the final cleanup happens in the leader,
which is the interesting case.

> (Except for read-only-anomaly-3, that causes another
> assertion fail but I believe that it's not related to the initial issue
> and deserves another bug report.)

Huh.  I tried and failed to find that one with concurrent runs.  I'll
wait for your next report before I do anything, just in case there's a
connection.

Вложения
On Wed, Jul 28, 2021 at 5:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Jul 28, 2021 at 9:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> > (Except for read-only-anomaly-3, that causes another
> > assertion fail but I believe that it's not related to the initial issue
> > and deserves another bug report.)
>
> Huh.  I tried and failed to find that one with concurrent runs.  I'll
> wait for your next report before I do anything, just in case there's a
> connection.

This is indeed an unrelated issue, and much older.  After many
kilowatt hours, I reproduced it here and worked out that it comes from
GetSafeSnapshot() not expecting possibleUnsafeConflicts to be empty
when WritableSxactCount == 0.  More soon.



On Fri, Jul 30, 2021 at 9:41 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Jul 28, 2021 at 5:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Wed, Jul 28, 2021 at 9:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> > > (Except for read-only-anomaly-3, that causes another
> > > assertion fail but I believe that it's not related to the initial issue
> > > and deserves another bug report.)

The problem is that if we don't take the fast exit here:

    if (XactReadOnly && PredXact->WritableSxactCount == 0)
    {
        ReleasePredXact(sxact);
        LWLockRelease(SerializableXactHashLock);
        return snapshot;
    }

... then we search for writable SSI transactions here:

        for (othersxact = FirstPredXact();
             othersxact != NULL;
             othersxact = NextPredXact(othersxact))
        {
            if (!SxactIsCommitted(othersxact)
                && !SxactIsDoomed(othersxact)
                && !SxactIsReadOnly(othersxact))
            {
                SetPossibleUnsafeConflict(sxact, othersxact);
            }
        }

... but that can find nothing if all writers happen to be doomed.
WritableSxactCount doesn't count read-only or committed transactions
so we don't have to worry about those confusing us, but it does count
doomed transactions.  Having an empty list here is mostly fine, except
that nobody will ever set our RO_SAFE flag, and in the case of
DEFERRABLE, the assertion that someone has set it will fail.  This is
the case since:

===
commit bdaabb9b22caa71021754d3967b4032b194d9880
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   Fri Jul 8 00:36:30 2011 +0300

    There's a small window wherein a transaction is committed but not yet
    on the finished list, and we shouldn't flag it as a potential conflict
    if so. We can also skip adding a doomed transaction to the list of
    possible conflicts because we know it won't commit.

    Dan Ports and Kevin Grittner.
===

I see three fixes:

1.  Check if the list is empty, and if so, set our own
SXACT_FLAG_RO_SAFE.  Then the assertion in GetSafeSnapshot() will
pass, and also non-DEFERRABLE callers will eventually see the flag and
opt out.

2.  Check if the list is empty, and if so, opt out immediately (as we
do when WriteableSxactCount == 0).  This would be strictly better than
#1, because it happens sooner while we still have the lock.  On the
other hand, we'd have to undo more effects, and that involves writing
more fragile and very rarely run code.

3.  Just delete the assertion in GetSafeSnapshot().  This is
attractively simple but it means that READ ONLY non-DEFERRABLE
transactions arbitrarily miss out on the RO_SAFE optimisation in this
(admittedly rare) case.

The attached 0002 has idea #1, which I prefer for back-patching.  I
think #2 might be a good idea if we take the filtering logic further
so that it becomes more likely that you get an empty list (for
example, why don't we skip transactions that are running in a
different database?), but then that'd be new code, not something we
back-patch.  Thoughts?

If someone is wanting to reproduce this case, try removing everything
from isolation_schedule except read-only-anomaly* and
read-write-unique*, and then run something like this while you go out
for lunch.  It should work as far back as 11 (before that, those tests
didn't exist and the isolation tester couldn't run them due to lack of
handling for DEFERRABLE blocking):

rounds=1000
concurrency=100
output=/tmp/junk
for i in `seq $rounds` ; do
  echo "=== round $i ==="
  for c in `seq $concurrency` ; do
    mkdir -p $output/results_${i}_${c}
    (make -C src/test/isolation installcheck \
       EXTRA_REGRESS_OPTS="--dbname=regress_${c}
--outputdir=$output/results_${i}_${c}" || echo FAIL) \
         > junk/out.${i}.${c}.log 2>&1 &
     pids[${c}]=$!
  done
  for c in `seq $concurrency` ; do
    wait ${pids[${c}]}
  done
done

Вложения
Hello Thomas,
30.07.2021 08:48, Thomas Munro wrote:
>>>> (Except for read-only-anomaly-3, that causes another
>>>> assertion fail but I believe that it's not related to the initial issue
>>>> and deserves another bug report.)
> The problem is that if we don't take the fast exit here:
You wouldn't mind if I file a separate bug report regarding the last
problem (not related to the initial bug report)?
Now I can propose a minimal reproduction for it, that can ease testing
and choosing the most appropriate fix.
As to v4-0001-Fix-assert-failures-in-parallel-SERIALIZABLE-READ.patch
you posted before, I believe it fixes the issues as expected and may be
committed at least to allow for advanced testing.

Best regards,
Alexander



On Mon, Jan 10, 2022 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> You wouldn't mind if I file a separate bug report regarding the last
> problem (not related to the initial bug report)?
> Now I can propose a minimal reproduction for it, that can ease testing
> and choosing the most appropriate fix.

Thanks!  Yeah, please go ahead.

> As to v4-0001-Fix-assert-failures-in-parallel-SERIALIZABLE-READ.patch
> you posted before, I believe it fixes the issues as expected and may be
> committed at least to allow for advanced testing.

Cool, I'll do that in the next few days.  Need to swap SERIALIZABLE
back into my neurons after some time away...



15.01.2022 08:41, Thomas Munro wrote:
> On Mon, Jan 10, 2022 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>> You wouldn't mind if I file a separate bug report regarding the last
>> problem (not related to the initial bug report)?
>> Now I can propose a minimal reproduction for it, that can ease testing
>> and choosing the most appropriate fix.
> Thanks!  Yeah, please go ahead.
Thank you! Done:
https://www.postgresql.org/message-id/17368-98a4f99e8e4b4402%40postgresql.org

Best regards,
Alexander



Hi Thomas,

15.01.2022 08:41, Thomas Munro wrote:
> On Mon, Jan 10, 2022 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>> As to v4-0001-Fix-assert-failures-in-parallel-SERIALIZABLE-READ.patch
>> you posted before, I believe it fixes the issues as expected and may be
>> committed at least to allow for advanced testing.
> Cool, I'll do that in the next few days.  Need to swap SERIALIZABLE
> back into my neurons after some time away...

I've caught one of asserts we discussed here again (in a different setup)
and would like to continue working on the fixes.
I've rechecked 
v4-0001-Fix-assert-failures-in-parallel-SERIALIZABLE-READ.patch
— it's still in a good shape, needs only minor corrections for the
master branch and can be applied as-is to REL_12_STABLE .. REL_15_STABLE.
The modified test serializable-parallel-2 and new test
serializable-parallel-3 still trigger asserts with unpatched predicate.c
and make check-world passes with the complete patch applied.
So if that patch requires attention of a seasoned reviewer, let's
send it to the commitfest (I can register the patch if you don't mind)?
Otherwise, maybe just commit it and move to other bugs...

Best regards,
Alexander
Вложения
On Sun, Mar 5, 2023 at 10:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> I've caught one of asserts we discussed here again (in a different setup)
> and would like to continue working on the fixes.
> I've rechecked
> v4-0001-Fix-assert-failures-in-parallel-SERIALIZABLE-READ.patch
> — it's still in a good shape, needs only minor corrections for the
> master branch and can be applied as-is to REL_12_STABLE .. REL_15_STABLE.
> The modified test serializable-parallel-2 and new test
> serializable-parallel-3 still trigger asserts with unpatched predicate.c
> and make check-world passes with the complete patch applied.

Yeah, I see the same here.

> So if that patch requires attention of a seasoned reviewer, let's
> send it to the commitfest (I can register the patch if you don't mind)?
> Otherwise, maybe just commit it and move to other bugs...

Pushed.  Thanks!



06.03.2023 06:52, Thomas Munro wrote:
>> So if that patch requires attention of a seasoned reviewer, let's
>> send it to the commitfest (I can register the patch if you don't mind)?
>> Otherwise, maybe just commit it and move to other bugs...
> Pushed.  Thanks!
Thank you for the fix!
(It makes at least parallel installcheck procedure simpler.)

Best regards,
Alexander