Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

Поиск
Список
Период
Сортировка
От Alexander Lakhin
Тема Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Дата
Msg-id ee0e1ae4-ff12-7d56-72a8-a70e492d6287@gmail.com
обсуждение исходный текст
Ответ на Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
Hi Thomas,

08.09.2023 22:39, Thomas Munro wrote:
>> With debugging logging added I see (on 7389aad63~1) that one process
>> really sends SIGURG to another, and the latter reaches poll(), but it
>> just got no signal, it's signal handler not called and poll() just waits...
> Thanks for working so hard on this Alexander.  That is a surprising
> discovery!  So changes to the signal handler arrangements in the
> *postmaster* before the child was forked affected this?

Yes, I think we deal with something like that. I can try to deduce a minimum
change that affects reproducing the issue, but may be it's not that important.
Perhaps we now should think of escalating the problem to FreeBSD developers?
I wonder, what kind of reproducer they find acceptable. A standalone C
program only or maybe a script that compiles/installs postgres and runs
our test will do too?

>> So it looks like the ARM weak memory model is not the root cause of the
>> issue. But as far as I can see, it's still specific to FreeBSD (but not
>> specific to a compiler — I used gcc and clang with the same success).
> Idea:  FreeBSD 13 introduced a new mechanism called sigfastblock[1],
> which lets system libraries control signal blocking with atomic memory
> tricks in a word of user space memory.  I have no particular theory
> for why it would be going wrong here (I don't expect us to be using
> any of the stuff that would use it, though I don't understand it in
> detail so that doesn't say much), but it occurred to me that all
> reports so far have been on 13.x or 14.  I wonder...  If you have a
> good fast recipe for reproducing this, could you also try it on
> FreeBSD 12.4?

It was a happy guess! I checked the reproduction on
FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2212
and got the same results as on FreeBSD 14:
REL_12_STABLE - failed on iteration 3
REL_15_STABLE - failed on iteration 1
REL_16_STABLE - 10 iterations with no failure

But on FreeBSD 12.4-RELEASE r372781:
REL_12_STABLE - 20 iterations with no failure
REL_15_STABLE - 20 iterations with no failure

BTW, I also retested 7389aad63 on FreeBSD 14 and got no failure for 100
iterations.

Best regards,
Alexander



В списке pgsql-hackers по дате отправления:

Предыдущее
От: jian he
Дата:
Сообщение: Re: SQL:2011 application time
Следующее
От: Tatsuo Ishii
Дата:
Сообщение: Re: Row pattern recognition