Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Дата
Msg-id CA+hUKGJnqmVybjbm6V4Ca6ci0r0pMLe-H9LCdL-BFWH02ig0Fw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)  (Andres Freund <andres@anarazel.de>)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> So I did that - same configure options as the buildfarm client, and a
> 'make check' (with only tests up to the 'join' suite, because that's
> where it got stuck before). And it took only ~15 runs (~1h) to hit this
> again on dikkop.

That's good news.

> I managed to collect the fstat/procstat stuff Thomas asked for, and the
> backtraces - attached. I still have the core files, in case we look at
> something. As before, running gcore on the second worker (29081) gets
> this unstuck - it sends some signal that apparently wakes it up.

Thanks!  As expected, no bytes in the pipe for any those processes.
Unfortunately I gave the wrong procstat command, it should be -i, not
-j.  Does "procstat -i /path/to/core | grep USR1" show P (pending) for
that stuck process?  Silly question really, I don't really expect
poll() to be misbehaving in such a basic way.

I was talking to Andres on IM about this yesterday and he pointed out
a potential out-of-order hazard: WaitEventSetWait() sets "waiting" (to
tell the signal handler to write to the self-pipe) and then reads
latch->is_set with neither compiler nor memory barrier, which doesn't
seem right because we might see a value of latch->is_set from before
"waiting" was true, and yet the signal handler might also have run
while "waiting" was false so the self-pipe doesn't save us, despite
the length of the comment about that.  Can you reproduce it with this
change?

--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1011,6 +1011,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
                 * ordering, so that we cannot miss seeing is_set if a notificat
ion
                 * has already been queued.
                 */
+               pg_memory_barrier();
                if (set->latch && set->latch->is_set)
                {
                        occurred_events->fd = PGINVALID_SOCKET;



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Justin Pryzby
Дата:
Сообщение: Re: Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)