Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

Поиск
Список
Период
Сортировка
От Alexander Lakhin
Тема Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Дата
Msg-id d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com
обсуждение исходный текст
Ответы Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Peter Geoghegan <pg@bowt.ie>)
Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Peter Geoghegan <pg@bowt.ie>)
Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-bugs
29.10.2021 16:00, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference:      17257
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 14.0
> Operating system:   Ubuntu 20.04
> Description:        
>
> When running concurrent installchecks (x100) I've observed the autovacuum
> process hanging:
>
I can propose the debugging patch to reproduce the issue that replaces
the hang with the assert and modifies a pair of crash-causing test
scripts to simplify the reproducing. (Sorry, I have no time now to prune
down the scripts further as I have to leave for a week.)

The reproducing script is:
###
export PGDATABASE=regression
createdb regression

echo "
vacuum (verbose, skip_locked, index_cleanup off) pg_catalog.pg_class;
select pg_sleep(random()/50);
" >/tmp/pseudo-autovacuum.sql

( timeout 10m bash -c "while true; do psql -f
src/test/regress/sql/inherit.sql >>psql1.log 2>&1; coredumpctl
--no-pager >/dev/null 2>&1 && break; done" ) &
( timeout 10m bash -c "while true; do psql -f
src/test/regress/sql/vacuum.sql >>psql2.log 2>&1; coredumpctl --no-pager
>/dev/null 2>&1 && break; done" ) &
pgbench -n -f /tmp/pseudo-autovacuum.sql -C -c 40 -T 600 >pgbench.log 2>&1 &

wait
###

(I've set:
autovacuum=off
fsync=off
in postgresql.conf)

It leads to:
TIME                            PID   UID   GID SIG COREFILE  EXE
Fri 2021-10-29 16:15:59 MSK  2337121  1000  1000   6 present  
.../usr/local/pgsql/bin/postgres

real    2m19,425s
user    4m26,078s
sys     0m31,658s

Core was generated by `postgres: law regression [local]
VACUUM                                       '.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f3d17e88859 in __GI_abort () at abort.c:79
#2  0x0000557358fbecb8 in ExceptionalCondition
(conditionName=conditionName@entry=0x557359027544 "numretries < 100",
    errorType=errorType@entry=0x55735901800b "FailedAssertion",
fileName=0x7ffef5bbfa70 "\231\354\373XsU",
    fileName@entry=0x557359027520 "vacuumlazy.c",
lineNumber=lineNumber@entry=1726) at assert.c:69
#3  0x0000557358ba0bdb in lazy_scan_prune (vacrel=0x55735a508d60,
buf=16184, blkno=270, page=0x7f3d16ef6f80 "",
    vistest=0x5573592dd540 <GlobalVisCatalogRels>,
prunestate=0x7ffef5bc0fb0) at vacuumlazy.c:1823
#4  0x0000557358ba38c5 in lazy_scan_heap (aggressive=false,
params=0x7ffef5bc13a0, vacrel=<optimized out>)
    at vacuumlazy.c:1384
#5  heap_vacuum_rel (rel=0x7f3d183ebec8, params=0x7ffef5bc13a0,
bstrategy=<optimized out>) at vacuumlazy.c:638
#6  0x0000557358cec785 in table_relation_vacuum (bstrategy=<optimized
out>, params=0x7ffef5bc13a0, rel=0x7f3d183ebec8)
    at ../../../src/include/access/tableam.h:1678
#7  vacuum_rel (relid=1259, relation=<optimized out>,
params=params@entry=0x7ffef5bc13a0) at vacuum.c:2034
#8  0x0000557358cedf9e in vacuum (relations=0x55735a57bea8,
params=0x7ffef5bc13a0, bstrategy=<optimized out>,
    isTopLevel=<optimized out>) at vacuum.c:475
#9  0x0000557358cee4f5 in ExecVacuum
(pstate=pstate@entry=0x55735a4ab730, vacstmt=vacstmt@entry=0x55735a48a408,
    isTopLevel=isTopLevel@entry=true) at vacuum.c:268
#10 0x0000557358e9de4e in standard_ProcessUtility (pstmt=0x55735a48a758,
    queryString=0x55735a4895e0 "vacuum (verbose, skip_locked,
index_cleanup off) pg_catalog.pg_class;",
    readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL,
params=0x0, queryEnv=0x0, dest=0x55735a48a848,
    qc=0x7ffef5bc1700) at utility.c:858
#11 0x0000557358e9c411 in PortalRunUtility
(portal=portal@entry=0x55735a4ed040, pstmt=pstmt@entry=0x55735a48a758,
    isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x55735a48a848,
    qc=qc@entry=0x7ffef5bc1700) at pquery.c:1155
#12 0x0000557358e9c54d in PortalRunMulti
(portal=portal@entry=0x55735a4ed040, isTopLevel=isTopLevel@entry=true,
    setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x55735a48a848, altdest=altdest@entry=0x55735a48a848,
    qc=qc@entry=0x7ffef5bc1700) at pquery.c:1312
#13 0x0000557358e9cbe9 in PortalRun (portal=portal@entry=0x55735a4ed040,
count=count@entry=9223372036854775807,
    isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x55735a48a848,
    altdest=altdest@entry=0x55735a48a848, qc=0x7ffef5bc1700) at pquery.c:788
#14 0x0000557358e9895b in exec_simple_query (
    query_string=0x55735a4895e0 "vacuum (verbose, skip_locked,
index_cleanup off) pg_catalog.pg_class;")
    at postgres.c:1214
#15 0x0000557358e9a561 in PostgresMain (argc=argc@entry=1,
argv=argv@entry=0x7ffef5bc1b70, dbname=<optimized out>,
    username=<optimized out>) at postgres.c:4486
#16 0x0000557358e064dd in BackendRun (port=0x55735a4aeb00,
port=0x55735a4aeb00) at postmaster.c:4506
#17 BackendStartup (port=0x55735a4aeb00) at postmaster.c:4228
#18 ServerLoop () at postmaster.c:1745
#19 0x0000557358e07481 in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at postmaster.c:1417
#20 0x0000557358b352e2 in main (argc=3, argv=0x55735a4838a0) at main.c:209

Sometimes the server crashes other way as described in bug #17255.

And I've also seen the following crash (maybe it's yet another bug, but
I can't explore it now):
Core was generated by `postgres: law regression [local]
EXPLAIN                                   '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  RelationBuildPartitionDesc (omit_detached=true, rel=0x7f09b87a27e8)
at partdesc.c:230
230                             datum = heap_getattr(tuple,
Anum_pg_class_relpartbound,
(gdb) bt
#0  RelationBuildPartitionDesc (omit_detached=true, rel=0x7f09b87a27e8)
at partdesc.c:230
#1  RelationGetPartitionDesc (rel=0x7f09b87a27e8,
omit_detached=<optimized out>) at partdesc.c:110
#2  0x000055ec12169032 in PartitionDirectoryLookup (pdir=0x55ec12ee7660,
rel=rel@entry=0x7f09b87a27e8)
    at partdesc.c:425
#3  0x000055ec12158b35 in set_relation_partition_info
(relation=0x7f09b87a27e8, rel=0x55ec12f337c0,
    root=0x55ec12effae0) at plancat.c:2197
#4  get_relation_info (root=<optimized out>, relationObjectId=<optimized
out>, inhparent=<optimized out>,
    rel=<optimized out>) at plancat.c:472
#5  0x000055ec1215d26f in build_simple_rel
(root=root@entry=0x55ec12effae0, relid=5,
    parent=parent@entry=0x55ec12ee7330) at relnode.c:307
#6  0x000055ec1214f495 in expand_partitioned_rtentry
(root=0x55ec12effae0, relinfo=0x55ec12ee7330,
    parentrte=0x55ec12e006d0, parentRTindex=1, parentrel=0x7f09b87a0630,
top_parentrc=0x0, lockmode=1) at inherit.c:398
#7  0x000055ec1214f67e in expand_inherited_rtentry
(root=root@entry=0x55ec12effae0, rel=0x55ec12ee7330,
    rte=0x55ec12e006d0, rti=rti@entry=1) at inherit.c:143
#8  0x000055ec1212eb8d in add_other_rels_to_query
(root=root@entry=0x55ec12effae0) at initsplan.c:163
#9  0x000055ec12131e3f in query_planner (root=root@entry=0x55ec12effae0,
    qp_callback=qp_callback@entry=0x55ec12132b20 <standard_qp_callback>,
qp_extra=qp_extra@entry=0x7ffcffe4e890)
    at planmain.c:264
#10 0x000055ec121371b7 in grouping_planner (root=<optimized out>,
tuple_fraction=<optimized out>) at planner.c:1442
#11 0x000055ec12139d0e in subquery_planner
(glob=glob@entry=0x55ec12ef3eb8, parse=parse@entry=0x55ec12e005b8,
    parent_root=parent_root@entry=0x0,
hasRecursion=hasRecursion@entry=false,
tuple_fraction=tuple_fraction@entry=0)
    at planner.c:1019
#12 0x000055ec1213a383 in standard_planner (parse=0x55ec12e005b8,
query_string=<optimized out>, cursorOptions=2048,
    boundParams=<optimized out>) at planner.c:400
#13 0x000055ec1220d7f8 in pg_plan_query (querytree=0x55ec12e005b8,
querytree@entry=0x7ffcffe4ead0,
    query_string=query_string@entry=0x55ec12dff520 "explain (costs off)
select * from range_list_parted;",
    cursorOptions=cursorOptions@entry=2048,
boundParams=boundParams@entry=0x0) at postgres.c:847
#14 0x000055ec12025fef in ExplainOneQuery (query=0x7ffcffe4ead0,
cursorOptions=2048, into=0x0, es=0x55ec12ef3bd0,
    queryString=0x55ec12dff520 "explain (costs off) select * from
range_list_parted;", params=0x0, queryEnv=0x0)
    at explain.c:397
#15 0x000055ec1202679f in ExplainQuery (pstate=0x55ec12ef3a70,
stmt=0x55ec12e003d8, params=0x0, dest=0x55ec12ef39d8)
    at ../../../src/include/nodes/nodes.h:604
#16 0x000055ec122132c9 in standard_ProcessUtility (pstmt=0x55ec12e01058,
    queryString=0x55ec12dff520 "explain (costs off) select * from
range_list_parted;", readOnlyTree=<optimized out>,
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x55ec12ef39d8, qc=0x7ffcffe4ed60)
    at utility.c:862
#17 0x000055ec122116cf in PortalRunUtility
(portal=portal@entry=0x55ec12e63060, pstmt=0x55ec12e01058,
    isTopLevel=<optimized out>,
setHoldSnapshot=setHoldSnapshot@entry=true, dest=dest@entry=0x55ec12ef39d8,
    qc=qc@entry=0x7ffcffe4ed60) at pquery.c:1155
#18 0x000055ec12211bb0 in FillPortalStore (portal=0x55ec12e63060,
isTopLevel=<optimized out>)
    at ../../../src/include/nodes/nodes.h:604
#19 0x000055ec12211ebd in PortalRun (portal=portal@entry=0x55ec12e63060,
count=count@entry=9223372036854775807,
    isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x55ec12e01148,
    altdest=altdest@entry=0x55ec12e01148, qc=0x7ffcffe4ef60) at pquery.c:760
#20 0x000055ec1220dcf5 in exec_simple_query (
    query_string=0x55ec12dff520 "explain (costs off) select * from
range_list_parted;") at postgres.c:1214
#21 0x000055ec1220f886 in PostgresMain (argc=argc@entry=1,
argv=argv@entry=0x7ffcffe4f4b0, dbname=<optimized out>,
    username=<optimized out>) at postgres.c:4486
#22 0x000055ec1218071a in BackendRun (port=<optimized out>,
port=<optimized out>) at postmaster.c:4506
#23 BackendStartup (port=<optimized out>) at postmaster.c:4228
#24 ServerLoop () at postmaster.c:1745
#25 0x000055ec12181682 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x55ec12df9ce0) at postmaster.c:1417
#26 0x000055ec11eca3a0 in main (argc=3, argv=0x55ec12df9ce0) at main.c:209

Best regards,
Alexander

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Следующее
От: Dmitry Dolgov
Дата:
Сообщение: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum