TruncateMultiXact() bugs

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема TruncateMultiXact() bugs
Дата
Msg-id ccc66933-31c1-4f6a-bf4b-45fef0d4f22e@iki.fi
обсуждение исходный текст
Ответы Re: TruncateMultiXact() bugs
Список pgsql-hackers
I was performing tests around multixid wraparound, when I ran into this 
assertion:

> TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File:
"../src/backend/utils/mmgr/mcxt.c",Line: 1353, PID: 920981
 
> postgres: autovacuum worker template0(ExceptionalCondition+0x6e)[0x560a501e866e]
> postgres: autovacuum worker template0(+0x5dce3d)[0x560a50217e3d]
> postgres: autovacuum worker template0(ForwardSyncRequest+0x8e)[0x560a4ffec95e]
> postgres: autovacuum worker template0(RegisterSyncRequest+0x2b)[0x560a50091eeb]
> postgres: autovacuum worker template0(+0x187b0a)[0x560a4fdc2b0a]
> postgres: autovacuum worker template0(SlruDeleteSegment+0x101)[0x560a4fdc2ab1]
> postgres: autovacuum worker template0(TruncateMultiXact+0x2fb)[0x560a4fdbde1b]
> postgres: autovacuum worker template0(vac_update_datfrozenxid+0x4b3)[0x560a4febd2f3]
> postgres: autovacuum worker template0(+0x3adf66)[0x560a4ffe8f66]
> postgres: autovacuum worker template0(AutoVacWorkerMain+0x3ed)[0x560a4ffe7c2d]
> postgres: autovacuum worker template0(+0x3b1ead)[0x560a4ffecead]
> postgres: autovacuum worker template0(+0x3b620e)[0x560a4fff120e]
> postgres: autovacuum worker template0(+0x3b3fbb)[0x560a4ffeefbb]
> postgres: autovacuum worker template0(+0x2f724e)[0x560a4ff3224e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x27c8a)[0x7f62cc642c8a]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f62cc642d45]
> postgres: autovacuum worker template0(_start+0x21)[0x560a4fd16f31]
> 2024-06-14 13:11:02.025 EEST [920971] LOG:  server process (PID 920981) was terminated by signal 6: Aborted
> 2024-06-14 13:11:02.025 EEST [920971] DETAIL:  Failed process was running: autovacuum: VACUUM pg_toast.pg_toast_13407
(toprevent wraparound)
 

The attached python script reproduces this pretty reliably. It's a 
reduced version of a larger test script I was working on, it probably 
could be simplified further for this particular issue.

Looking at the code, it's pretty clear how it happens:

1. TruncateMultiXact does START_CRIT_SECTION();

2. In the critical section, it calls PerformMembersTruncation() -> 
SlruDeleteSegment() -> SlruInternalDeleteSegment() -> 
RegisterSyncRequest() -> ForwardSyncRequest()

3. If the fsync request queue is full, it calls 
CompactCheckpointerRequestQueue(), which calls palloc0. Pallocs are not 
allowed in a critical section.

A straightforward fix is to add a check to 
CompactCheckpointerRequestQueue() to bail out without compacting, if 
it's called in a critical section. That would cover any other cases like 
this, where RegisterSyncRequest() is called in a critical section. I 
haven't tried searching if any more cases like this exist.

But wait there is more!

After applying that fix in CompactCheckpointerRequestQueue(), the test 
script often gets stuck. There's a deadlock between the checkpointer, 
and the autovacuum backend trimming the SLRUs:

1. TruncateMultiXact does this:

         MyProc->delayChkptFlags |= DELAY_CHKPT_START;

2. It then makes that call to PerformMembersTruncation() and 
RegisterSyncRequest(). If it cannot queue the request, it sleeps a 
little and retries. But the checkpointer is stuck waiting for the 
autovacuum backend, because of delayChkptFlags, and will never clear the 
queue.

To fix, I propose to add AbsorbSyncRequests() calls to the wait-loops in 
CreateCheckPoint().


Attached patch fixes both of those issues.

I can't help thinking that TruncateMultiXact() should perhaps not have 
such a long critical section. TruncateCLOG() doesn't do that. But it was 
added for good reasons in commit 4f627f897367, and this fix seems 
appropriate for the stable branches anyway, even if we come up with 
something better for master.

-- 
Heikki Linnakangas
Neon (https://neon.tech)
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: Conflict Detection and Resolution
Следующее
От: "Zhijie Hou (Fujitsu)"
Дата:
Сообщение: RE: Conflict Detection and Resolution