suspicious lockup on widowbird in AdvanceXLInsertBuffer (could it be due to 6a2275b8953?)

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема suspicious lockup on widowbird in AdvanceXLInsertBuffer (could it be due to 6a2275b8953?)
Дата
Msg-id 67f7132d-3923-47a6-9de2-5b7d86ddb73f@vondra.me
обсуждение исходный текст
Ответы Re: suspicious lockup on widowbird in AdvanceXLInsertBuffer (could it be due to 6a2275b8953?)
Список pgsql-hackers
Hi,

I have noticed one of my buildfarm machines - widowbird - did not report
any results since February 17. And it seems to be stuck somewhere in
amcheck:

$ ps ax | grep postgres
1180067 ?        Ss     0:02
/mnt/data/buildfarm/buildroot/HEAD/inst/bin/postgres -D data-C
1180069 ?        Ss     0:00 postgres: checkpointer
1180070 ?        Ss     0:00 postgres: background writer
1180072 ?        Ss     0:00 postgres: walwriter
1180073 ?        Ss     0:01 postgres: autovacuum launcher
1180074 ?        Ss     0:00 postgres: logical replication launcher
1180107 ?        Ss     0:05 postgres: buildfarm
contrib_regression_amcheck [local] INSERT
1180111 ?        Ss     0:00 postgres: autovacuum worker
1180134 ?        Ss     0:00 postgres: autovacuum worker
1180135 ?        Ss     0:00 postgres: autovacuum worker
1374029 pts/0    S+     0:00 grep --color=auto postgres

So there's PID 1180107, executing an insert, but not progressing. The
backtrace looks like this (first couple lines, full backtrace attached):

#0  0x0000007fa64b8ddc in __GI_epoll_pwait (epfd=5, events=0x55ad6285a8,
maxevents=1, timeout=timeout@entry=-1, set=set@entry=0x0) at
../sysdeps/unix/sysv/linux/epoll_pwait.c:42
#1  0x0000007fa64b8fe8 in epoll_wait (epfd=<optimized out>,
events=<optimized out>, maxevents=<optimized out>, timeout=timeout@entry=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:32
#2  0x000000558f043588 in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7ff8ed4e18, cur_timeout=-1, set=0x55ad628540) at
latch.c:1571
#3  WaitEventSetWait (set=0x55ad628540, timeout=timeout@entry=-1,
occurred_events=occurred_events@entry=0x7ff8ed4e18,
nevents=nevents@entry=1,
    wait_event_info=wait_event_info@entry=134217781) at latch.c:1519
#4  0x000000558f043778 in WaitLatch (latch=<optimized out>,
wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=-1,
wait_event_info=wait_event_info@entry=134217781)
    at latch.c:538
#5  0x000000558f052274 in ConditionVariableTimedSleep (cv=0x7f9ac9deb0,
timeout=timeout@entry=-1,
wait_event_info=wait_event_info@entry=134217781) at condition_variable.c:163
#6  0x000000558f05286c in ConditionVariableTimedSleep
(wait_event_info=134217781, timeout=-1, cv=<optimized out>) at
condition_variable.c:135
#7  0x000000558ed2fc90 in AdvanceXLInsertBuffer
(upto=upto@entry=608174080, tli=tli@entry=1,
opportunistic=opportunistic@entry=false) at xlog.c:2224

So, it's stuck in AdvanceXLInsertBuffer ... interesting. Another
interesting fact is it's testing 75dfde13639, which is just a couple
commits after 6a2275b895:

    commit 6a2275b8953a4462d44daf001bdd60b3d48f0946
    Author: Alexander Korotkov <akorotkov@postgresql.org>
    Date:   Mon Feb 17 04:19:01 2025 +0200

    Get rid of WALBufMappingLock

    Allow multiple backends to initialize WAL buffers concurrently.
    This way `MemSet((char *) NewPage, 0, XLOG_BLCKSZ);` can run in
    parallel without taking a single LWLock in exclusive mode.

    ...

which reworked AdvanceXLInsertBuffer() quite a bit, it seems. OTOH the
last (successful) run on widorbird was on eaf502747b, which already
includes 6a2275b895, so maybe it's unrelated.

Is there something else I could collect from the stuck instance, before
I restart it?

regards

-- 
Tomas Vondra

Вложения

В списке pgsql-hackers по дате отправления: