Обсуждение: BUG #17072: Assert for clogGroupNext failed due to a race condition in TransactionGroupUpdateXidStatus()

Поиск
Список
Период
Сортировка

BUG #17072: Assert for clogGroupNext failed due to a race condition in TransactionGroupUpdateXidStatus()

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17072
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 13.3
Operating system:   Ubuntu 20.04
Description:

When running multiple installchecks (100-200) in a loop I get the following
fail:

Core was generated by `postgres: postgres regress119 127.0.0.1(60410) in'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f1c7203d535 in __GI_abort () at abort.c:79
#2  0x00005645576b41a2 in ExceptionalCondition (
    conditionName=conditionName@entry=0x564557840fc8
"pg_atomic_read_u32(&MyProc->clogGroupNext) == INVALID_PGPROCNO", 
    errorType=errorType@entry=0x56455771001d "FailedAssertion",
fileName=fileName@entry=0x56455774676d "proc.c", 
    lineNumber=lineNumber@entry=498) at assert.c:67
#3  0x000056455757b062 in InitProcess () at proc.c:498
#4  0x00005645575893df in PostgresMain (argc=1,
argv=argv@entry=0x5645581d5820, dbname=<optimized out>, 
    username=0x5645581d5168 "postgres") at postgres.c:4393
#5  0x00005645574e6f91 in BackendRun (port=0x5645581c73d0,
port=0x5645581c73d0) at postmaster.c:5784
#6  BackendStartup (pool=pool@entry=0x0, port=port@entry=0x5645581c73d0) at
postmaster.c:5440
#7  0x00005645574e769b in ServerLoop () at postmaster.c:2403
#8  0x00005645574e8e4b in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at postmaster.c:1870
#9  0x00005645571db177 in main (argc=3, argv=0x56455819ad30) at main.c:211

(gdb) print MyProc->clogGroupNext
$1 = {value = 985}
(gdb) print &MyProc->clogGroupNext
$2 = (pg_atomic_uint32 *) 0x7f1c64806a0c

With some diagnostic output added I've got in the server log:
  2021-06-24 16:44:16.018 MSK|postgres|regress184|60d48c2d.ea21|WARNING:
TransactionGroupUpdateXidStatus: 819, 2147483647
...
2021-06-24 16:44:16.018 MSK|postgres|regress54|60d48c27.df96|WARNING:
TransactionGroupUpdateXidStatus: 982, 2147483647
2021-06-24 16:44:16.021 MSK|postgres|regress33|60d48c2f.efdc|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c648dd8ec, 974
2021-06-24 16:44:16.018 MSK|postgres|regress44|60d48c2a.e5a7|WARNING:
TransactionGroupUpdateXidStatus: 981, 2147483647
2021-06-24 16:44:16.021 MSK|postgres|regress72|60d48c2d.ea64|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c64880824, 974
2021-06-24 16:44:16.021 MSK|postgres|regress32|60d48c2f.ee50|WARNING:
TransactionGroupUpdateXidStatus: after while: 974
2021-06-24 16:44:16.018 MSK|postgres|regress144|60d48c2e.ebc5|WARNING:
TransactionGroupUpdateXidStatus: 963, 2147483647
  2021-06-24 16:44:16.021 MSK|postgres|regress184|60d48c2d.ea21|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c64806a0c, 2147483647
2021-06-24 16:44:16.018 MSK|postgres|regress149|60d48c2a.e674|WARNING:
TransactionGroupUpdateXidStatus: 799, 2147483647
2021-06-24 16:44:16.021 MSK|postgres|regress194|60d48c2f.eed2|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c648610e4, 944
2021-06-24 16:44:16.018 MSK|postgres|regress114|60d48c29.e3e1|WARNING:
TransactionGroupUpdateXidStatus: 816, 2147483647
2021-06-24 16:44:16.021 MSK|postgres|regress33|60d48c2f.efdc|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c648dd8ec, 944
2021-06-24 16:44:16.021 MSK|postgres|regress186|60d48c2f.eea9|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c648e95a4, 2147483647
2021-06-24 16:44:16.018 MSK|postgres|regress147|60d48c2f.efb7|ERROR:  cannot
change routine kind
2021-06-24 16:44:16.018 MSK|postgres|regress147|60d48c2f.efb7|DETAIL:
"functest1" is a function.
2021-06-24 16:44:16.021 MSK|postgres|regress72|60d48c2d.ea64|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c64880824, 944
2021-06-24 16:44:16.021 MSK|postgres|regress50|60d48c2b.e710|ERROR:
relation "does_not_exist" does not exist
2021-06-24 16:44:16.020 MSK|postgres|regress22|60d48c2f.efed|WARNING:
TransactionGroupUpdateXidStatus: after while: 2147483647
2021-06-24 16:44:16.021 MSK|postgres|regress54|60d48c27.df96|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c648dc3f4, 2147483647
2021-06-24 16:44:16.020 MSK|postgres|regress36|60d48c2f.eff6|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 INVALID_PGPROCNO:
0x7f1c64828b3c, 2147483647
2021-06-24 16:44:16.021 MSK|postgres|regress34|60d48c2f.ee8b|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c648e02dc, 2147483647
2021-06-24 16:44:16.020 MSK|postgres|regress11|60d48c2f.ef8e|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c647f6e6c, 792
2021-06-24 16:44:16.021 MSK|postgres|regress44|60d48c2a.e5a7|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c648daefc, 2147483647
  2021-06-24 16:44:16.021 MSK|postgres|regress184|60d48c2d.ea21|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c64806a0c, 944
2021-06-24 16:44:16.021 MSK|postgres|regress194|60d48c2f.eed2|WARNING:
TransactionGroupUpdateXidStatus: after while: 944
2021-06-24 16:44:16.020 MSK|postgres|regress155|60d48c2f.efe1|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c648ad114, 792
2021-06-24 16:44:16.021 MSK|postgres|regress33|60d48c2f.efdc|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c648dd8ec, 888
2021-06-24 16:44:16.021 MSK|postgres|regress186|60d48c2f.eea9|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c648e95a4, 888
2021-06-24 16:44:16.021 MSK|postgres|regress72|60d48c2d.ea64|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c64880824, 888
2021-06-24 16:44:16.021 MSK|postgres|regress54|60d48c27.df96|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c648dc3f4, 888
2021-06-24 16:44:16.021 MSK|postgres|regress144|60d48c2e.ebc5|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c648c358c, 2147483647
2021-06-24 16:44:16.021 MSK|postgres|regress186|60d48c2f.eea9|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c648e95a4, 985
2021-06-24 16:44:16.021 MSK|postgres|regress34|60d48c2f.ee8b|WARNING:
TransactionGroupUpdateXidStatus: after while: 2147483647
2021-06-24 16:44:16.021 MSK|postgres|regress149|60d48c2a.e674|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c647ec6ac, 2147483647
2021-06-24 16:44:16.021 MSK|postgres|regress44|60d48c2a.e5a7|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c648daefc, 985
2021-06-24 16:44:16.021 MSK|postgres|regress22|60d48c2f.efed|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 INVALID_PGPROCNO:
0x7f1c648610e4, 2147483647
2021-06-24 16:44:16.021 MSK|postgres|regress36|60d48c2f.eff6|WARNING:
TransactionGroupUpdateXidStatus: end
  2021-06-24 16:44:16.021 MSK|postgres|regress184|60d48c2d.ea21|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c64806a0c, 985
2021-06-24 16:44:16.021 MSK|postgres|regress114|60d48c29.e3e1|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c64802b24, 2147483647
2021-06-24 16:44:16.021 MSK|postgres|regress11|60d48c2f.ef8e|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c647f6e6c, 985
2021-06-24 16:44:16.021 MSK|postgres|regress33|60d48c2f.efdc|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c648dd8ec, 985
2021-06-24 16:44:16.021 MSK|postgres|regress50|60d48c2b.e710|ERROR:  syntax
error at or near "arg" at character 22
2021-06-24 16:44:16.021 MSK|postgres|regress155|60d48c2f.efe1|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c648ad114, 985
2021-06-24 16:44:16.021 MSK|postgres|regress72|60d48c2d.ea64|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c64880824, 985
2021-06-24 16:44:16.021 MSK|postgres|regress54|60d48c27.df96|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 nextidx:
0x7f1c648dc3f4, 985
2021-06-24 16:44:16.021 MSK|postgres|regress186|60d48c2f.eea9|WARNING:
TransactionGroupUpdateXidStatus: after while: 985
2021-06-24 16:44:16.021 MSK|postgres|regress34|60d48c2f.ee8b|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 INVALID_PGPROCNO:
0x7f1c648e95a4, 2147483647
2021-06-24 16:44:16.021 MSK|postgres|regress144|60d48c2e.ebc5|WARNING:
TransactionGroupUpdateXidStatus: after while: 2147483647
2021-06-24 16:44:16.021 MSK|postgres|regress44|60d48c2a.e5a7|WARNING:
TransactionGroupUpdateXidStatus: return false (963)
2021-06-24 16:44:16.021 MSK|postgres|regress22|60d48c2f.efed|WARNING:
TransactionGroupUpdateXidStatus: pg_atomic_write_u32 INVALID_PGPROCNO:
0x7f1c648aa724, 2147483647
  2021-06-24 16:44:16.021 MSK|postgres|regress184|60d48c2d.ea21|WARNING:
TransactionGroupUpdateXidStatus: return false (963)
2021-06-24 16:44:16.021 MSK|postgres|regress194|60d48c2f.eed2|WARNING:
TransactionGroupUpdateXidStatus: return true (944)
2021-06-24 16:44:16.021 MSK|postgres|regress11|60d48c2f.ef8e|WARNING:
TransactionGroupUpdateXidStatus: return false (963)
2021-06-24 16:44:16.021 MSK|postgres|regress33|60d48c2f.efdc|WARNING:
TransactionGroupUpdateXidStatus: return false (963)

The offending (the one that leaved a "valid" clogGroupNext) proccess is
60d48c2d.ea21. It looks like it got from the
pg_atomic_compare_exchange_u32() the nextidx value that was written in the
clogGroupFirst by the process 60d48c2e.ebc5, and exited just after that.


On Fri, Jun 25, 2021 at 12:20 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The offending (the one that leaved a "valid" clogGroupNext) proccess is
> 60d48c2d.ea21. It looks like it got from the
> pg_atomic_compare_exchange_u32() the nextidx value that was written in the
> clogGroupFirst by the process 60d48c2e.ebc5, and exited just after that.
>

Your analysis seems to be in the right direction. Can you try by
setting clogGroupNext to INVALID_PGPROCNO
(pg_atomic_write_u32(&proc->clogGroupNext, INVALID_PGPROCNO);) before
we return false in the first while(true) loop in function
TransactionGroupUpdateXidStatus()?

I think this should be reproducible on all branches from HEAD till
v11. Have you tried in any other branch? I'll also try to reproduce
it.

-- 
With Regards,
Amit Kapila.



24.06.2021 19:00, PG Bug reporting form wrote:
> When running multiple installchecks (100-200) in a loop I get the following
> fail:
If it could be of any use, please look at the draft script that executes
multiple installchecks on one instance.
(Sorry for a dirty play with the makefle and neighbors — it's not
intended for the production use.)
I started to develop this script to discover a hardware or software
issue on a powerful server (192 cores, 4 TB RAM, ...).
Single installcheck is a needle in a haystack for such hardware, so I
tried to run multiple instances and multiple installchecks for each.
Some tests are not ready to run in a such configuration, but most of the
tests work and bring some findings.

Best regards,
Alexander

Вложения
Hello Amit,
25.06.2021 12:55, Amit Kapila wrote:
> On Fri, Jun 25, 2021 at 12:20 AM PG Bug reporting form
> <noreply@postgresql.org> wrote:
>> The offending (the one that leaved a "valid" clogGroupNext) proccess is
>> 60d48c2d.ea21. It looks like it got from the
>> pg_atomic_compare_exchange_u32() the nextidx value that was written in the
>> clogGroupFirst by the process 60d48c2e.ebc5, and exited just after that.
>>
> Your analysis seems to be in the right direction. Can you try by
> setting clogGroupNext to INVALID_PGPROCNO
> (pg_atomic_write_u32(&proc->clogGroupNext, INVALID_PGPROCNO);) before
> we return false in the first while(true) loop in function
> TransactionGroupUpdateXidStatus()?
With this modification that assert is not triggered, all 100 iterations
pass fine (triple checked).
> I think this should be reproducible on all branches from HEAD till
> v11. Have you tried in any other branch? I'll also try to reproduce
> it.
I've reproduced it on REL_11_STABLE, REL_12_STABLE, REL_13_STABLE, and
master.

Best regards,
Alexander



On Fri, Jun 25, 2021 at 4:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello Amit,
> 25.06.2021 12:55, Amit Kapila wrote:
> > On Fri, Jun 25, 2021 at 12:20 AM PG Bug reporting form
> > <noreply@postgresql.org> wrote:
> >> The offending (the one that leaved a "valid" clogGroupNext) proccess is
> >> 60d48c2d.ea21. It looks like it got from the
> >> pg_atomic_compare_exchange_u32() the nextidx value that was written in the
> >> clogGroupFirst by the process 60d48c2e.ebc5, and exited just after that.
> >>
> > Your analysis seems to be in the right direction. Can you try by
> > setting clogGroupNext to INVALID_PGPROCNO
> > (pg_atomic_write_u32(&proc->clogGroupNext, INVALID_PGPROCNO);) before
> > we return false in the first while(true) loop in function
> > TransactionGroupUpdateXidStatus()?
> With this modification that assert is not triggered, all 100 iterations
> pass fine (triple checked).
>

Okay, please find the patch for the same attached.

> > I think this should be reproducible on all branches from HEAD till
> > v11. Have you tried in any other branch? I'll also try to reproduce
> > it.
> I've reproduced it on REL_11_STABLE, REL_12_STABLE, REL_13_STABLE, and
> master.
>

Please see if you can verify whether the attached fixes it in all the
branches? I have also reproduced it in a bit different way by using a
debugger. Basically, by having three sessions trying to commit at the
same time. After the first session became the first group member,
allow the second session to check if it can become a member and stop
it via debugger just before it becomes the member. Then, allow the
first session to complete the transaction and allow the third session
to become the group leader (or first group member). After that when
the second session tries to become the member, it will notice that the
leader has changed and again try to become a member of the new leader.
Then, I forced via debugger to allow the second member to return false
and perform the commit by itself. Next, disconnect and connect again
in the second session and we will see assertion failure as reported by
you. The attached patch fixes the assertion failure.

-- 
With Regards,
Amit Kapila.

Вложения
Hello Amit,
26.06.2021 12:57, Amit Kapila wrote:
>
>>> I think this should be reproducible on all branches from HEAD till
>>> v11. Have you tried in any other branch? I'll also try to reproduce
>>> it.
>> I've reproduced it on REL_11_STABLE, REL_12_STABLE, REL_13_STABLE, and
>> master.
>>
> Please see if you can verify whether the attached fixes it in all the
> branches?
Yes, the attached patch fixes the issue in all the branches.
Thanks for the fix!

Best regards,
Alexander



On Mon, Jun 28, 2021 at 12:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello Amit,
> 26.06.2021 12:57, Amit Kapila wrote:
> >
> >>> I think this should be reproducible on all branches from HEAD till
> >>> v11. Have you tried in any other branch? I'll also try to reproduce
> >>> it.
> >> I've reproduced it on REL_11_STABLE, REL_12_STABLE, REL_13_STABLE, and
> >> master.
> >>
> > Please see if you can verify whether the attached fixes it in all the
> > branches?
> Yes, the attached patch fixes the issue in all the branches.
> Thanks for the fix!
>

Pushed, thanks for the report and all the testing.

-- 
With Regards,
Amit Kapila.