Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
Дата
Msg-id 20230725033946.mstudugbav7yghui@awork3.anarazel.de
обсуждение исходный текст
Ответ на Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker  (Masahiko Sawada <sawada.mshk@gmail.com>)
Ответы Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker  (Masahiko Sawada <sawada.mshk@gmail.com>)
Список pgsql-bugs
Hi,

Thanks for finding / debugging this issue!

On 2023-07-21 17:01:11 +0900, Masahiko Sawada wrote:
> I've reproduced the issue in my environment with the provided script.
> The crashed process is not a parallel vacuum worker actually but a
> parallel worker for rebuilding the index. The scenario seems that when
> detecting a deadlock, the process removes itself from the wait queue
> by RemoveFromWaitQueue() (called by CheckDeadLock()), and then
> RemoveFromWaitQueue() is called again by LockErrorCleanup() while
> aborting the transaction. With commit 5764f611e, in
> RemoveFromWaitQueue() we remove the process from the wait queue using
> dclist_delete_from():
> 
>     /* Remove proc from lock's wait queue */
>     dclist_delete_from(&waitLock->waitProcs, &proc->links);
> :
>     /* Clean up the proc's own state, and pass it the ok/fail signal */
>     proc->waitLock = NULL;
>     proc->waitProcLock = NULL;
>     proc->waitStatus = PROC_WAIT_STATUS_ERROR;
> 
>  However, since dclist_delete_from() doesn't clear proc->links, in
> LockErrorCleanup(), dlist_node_is_detached() still returns false:
> 
>     if (!dlist_node_is_detached(&MyProc->links))
>     {
>         /* We could not have been granted the lock yet */
>         RemoveFromWaitQueue(MyProc, lockAwaited->hashcode);
>     }

Indeed :(


> leading to calling RemoveFromWaitQueue() again. I think we should use
> dclist_delete_from_thoroughly() instead. With the attached patch, the
> issue doesn't happen in my environment.

Yep. Do you want to push that fix, or should I?


> Another thing I noticed is that the Assert(waitLock) in
> RemoveFromWaitQueue() is useless actually, since we access *waitLock
> before that:
> 
> I think we should fix it as well. This fix is also included in the
> attached patch.

Don't really have an opinion on that. It's been this way for longer, afaict.


Greetings,

Andres Freund



В списке pgsql-bugs по дате отправления:

Предыдущее
От: "Jonathan S. Katz"
Дата:
Сообщение: Re: BUG #18018: Homebrew link is broken
Следующее
От: Masahiko Sawada
Дата:
Сообщение: Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker