Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Дата
Msg-id CA+hUKG+O-PZOM1f9nSMGQ-3f3b_3F-jJ28Xt+WM9271zkZz4yg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)  (Alexander Lakhin <exclusion@gmail.com>)
Ответы Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)  (Alexander Lakhin <exclusion@gmail.com>)
Список pgsql-hackers
On Sat, Sep 9, 2023 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> It takes less than 10 minutes on average for me. I checked
> REL_12_STABLE, REL_13_STABLE, and REL_14_STABLE (with HAVE_KQUEUE undefined
> forcefully) — they all are affected.
> I could not reproduce the lockup on my Ubuntu box (with HAVE_SYS_EPOLL_H
> undefined manually). And surprisingly for me, I could not reproduce it on
> master and REL_16_STABLE.
> `git bisect` for this behavior change pointed at 7389aad63 (though maybe it
> just greatly decreased probability of the failure; I'm going to double-check
> this).
> In particular, that commit changed this:
> -    /*
> -     * Ignore SIGURG for now.  Child processes may change this (see
> -     * InitializeLatchSupport), but they will not receive any such signals
> -     * until they wait on a latch.
> -     */
> -    pqsignal_pm(SIGURG, SIG_IGN);   /* ignored */
> -#endif
> +    /* This may configure SIGURG, depending on platform. */
> +    InitializeLatchSupport();
> +    InitProcessLocalLatch();
>
> With debugging logging added I see (on 7389aad63~1) that one process
> really sends SIGURG to another, and the latter reaches poll(), but it
> just got no signal, it's signal handler not called and poll() just waits...

Thanks for working so hard on this Alexander.  That is a surprising
discovery!  So changes to the signal handler arrangements in the
*postmaster* before the child was forked affected this?

> So it looks like the ARM weak memory model is not the root cause of the
> issue. But as far as I can see, it's still specific to FreeBSD (but not
> specific to a compiler — I used gcc and clang with the same success).

Idea:  FreeBSD 13 introduced a new mechanism called sigfastblock[1],
which lets system libraries control signal blocking with atomic memory
tricks in a word of user space memory.  I have no particular theory
for why it would be going wrong here (I don't expect us to be using
any of the stuff that would use it, though I don't understand it in
detail so that doesn't say much), but it occurred to me that all
reports so far have been on 13.x or 14.  I wonder...  If you have a
good fast recipe for reproducing this, could you also try it on
FreeBSD 12.4?

[1] https://man.freebsd.org/cgi/man.cgi?query=sigfastblock&sektion=2&manpath=FreeBSD+13.0-current



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jacob Champion
Дата:
Сообщение: Re: Row pattern recognition
Следующее
От: Jeff Davis
Дата:
Сообщение: Re: Avoid a possible null pointer (src/backend/utils/adt/pg_locale.c)