Обсуждение: lockup in parallel hash join on dikkop (freebsd 14.0-current)

Поиск
Список
Период
Сортировка

lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Tomas Vondra
Дата:
Hi,

I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
did not report any results for a couple days, and it seems it got into
an infinite loop in REL_11_STABLE when building hash table in a parallel
hashjoin, or something like that.

It seems to be progressing now, probably because I attached gdb to the
workers to get backtraces, which does signals etc.

Anyway, in 'ps ax' I saw this:

94545  -  Ss       0:03.39 postgres: buildfarm regression [local] SELECT
94627  -  Is       0:00.03 postgres: parallel worker for PID 94545
94628  -  Is       0:00.02 postgres: parallel worker for PID 94545

and the backend was stuck waiting on this query:

    select final > 1 as multibatch
          from hash_join_batches(
        $$
          select count(*) from join_foo
            left join (select b1.id, b1.t from join_bar b1 join join_bar
b2 using (id)) ss
            on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
        $$);

This started on 2023-01-20 23:23:18.125, and the next log (after I did
the gdb stuff), is from 2023-01-26 20:05:16.751. Quite a bit of time.

It seems all three processes are doing WaitEventSetWait, either through
a ConditionVariable, or WaitLatch. But I don't have any good idea of
what might have broken - and as it got "unstuck" I can't investigate
more. But I see there's nodeHash and parallelism, and I recall there's a
lot of gotchas due to how the backends cooperate when building the hash
table, etc. Thomas, any idea what might be wrong?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Tom Lane
Дата:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
> did not report any results for a couple days, and it seems it got into
> an infinite loop in REL_11_STABLE when building hash table in a parallel
> hashjoin, or something like that.

> It seems to be progressing now, probably because I attached gdb to the
> workers to get backtraces, which does signals etc.

That reminds me of cases that I saw several times on my now-deceased
animal florican:

https://www.postgresql.org/message-id/flat/2245838.1645902425%40sss.pgh.pa.us

There's clearly something rotten somewhere in there, but whether
it's our bug or FreeBSD's isn't clear.

            regards, tom lane



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Thomas Munro
Дата:
On Fri, Jan 27, 2023 at 9:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> > I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
> > did not report any results for a couple days, and it seems it got into
> > an infinite loop in REL_11_STABLE when building hash table in a parallel
> > hashjoin, or something like that.
>
> > It seems to be progressing now, probably because I attached gdb to the
> > workers to get backtraces, which does signals etc.
>
> That reminds me of cases that I saw several times on my now-deceased
> animal florican:
>
> https://www.postgresql.org/message-id/flat/2245838.1645902425%40sss.pgh.pa.us
>
> There's clearly something rotten somewhere in there, but whether
> it's our bug or FreeBSD's isn't clear.

And if it's ours, it's possibly in latch code and not anything higher
(I mean, not in condition variables, barriers, or parallel hash join)
because I saw a similar hang in the shm_mq stuff which uses the latch
API directly.  Note that 13 switched to kqueue but still used the
self-pipe, and 14 switched to a signal event, and this hasn't been
reported in those releases or later, which makes the poll() code path
a key suspect.



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Thomas Munro
Дата:
On Fri, Jan 27, 2023 at 9:57 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Jan 27, 2023 at 9:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> > > I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
> > > did not report any results for a couple days, and it seems it got into
> > > an infinite loop in REL_11_STABLE when building hash table in a parallel
> > > hashjoin, or something like that.
> >
> > > It seems to be progressing now, probably because I attached gdb to the
> > > workers to get backtraces, which does signals etc.
> >
> > That reminds me of cases that I saw several times on my now-deceased
> > animal florican:
> >
> > https://www.postgresql.org/message-id/flat/2245838.1645902425%40sss.pgh.pa.us
> >
> > There's clearly something rotten somewhere in there, but whether
> > it's our bug or FreeBSD's isn't clear.
>
> And if it's ours, it's possibly in latch code and not anything higher
> (I mean, not in condition variables, barriers, or parallel hash join)
> because I saw a similar hang in the shm_mq stuff which uses the latch
> API directly.  Note that 13 switched to kqueue but still used the
> self-pipe, and 14 switched to a signal event, and this hasn't been
> reported in those releases or later, which makes the poll() code path
> a key suspect.

Also, 14 changed the flag/memory barrier dance (maybe_sleeping), but
13 did it the same way as 11 + 12.  So between 12 and 13 we have just
the poll -> kqueue change.



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Thomas Munro
Дата:
After 1000 make check loops, and 1000 make -C src/test/modules/test_shm_mq
check loops, on the same FBSD 13.1 machine as elver which has failed
like this once before, I haven't been able to reproduce this on
REL_12_STABLE.  Not really sure how to chase this, but if you see this
situation again, I'd been interested to see the output of fstat -p PID
(shows bytes in pipes) and procstat -j PID (shows pending signals) for
all PIDs involved (before connecting a debugger or doing anything else
that might make it return with EINTR, after which we know it continues
happily because it then sees latch->is_set next time around the loop).
If poll() is not returning when there are bytes ready to read from the
self-pipe, which fstat can show, I think that'd indicate a kernel bug.
If procstat -j shows signals pending but somehow it's still blocked in
the syscall.  Otherwise, it might indicate a compiler or postgres bug,
but I don't have any particular theories.



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Andres Freund
Дата:
Hi,

On 2023-01-27 22:23:58 +1300, Thomas Munro wrote:
> After 1000 make check loops, and 1000 make -C src/test/modules/test_shm_mq
> check loops, on the same FBSD 13.1 machine as elver which has failed
> like this once before, I haven't been able to reproduce this on
> REL_12_STABLE.

Did you use the same compiler / compilation flags as when elver hit it?
Clearly Tomas' case was with at least some optimizations enabled.

Except that you're saying that you hit this on elver (amd64), I think it'd be
interesting that we see the failure on an arm host, which has a less strict
memory order model than x86.

IIUC elver previously hit this on 12?

Greetings,

Andres Freund



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> Except that you're saying that you hit this on elver (amd64), I think it'd be
> interesting that we see the failure on an arm host, which has a less strict
> memory order model than x86.

I also saw it on florican, which is/was an i386 machine using clang and
pretty standard build options other than
    'CFLAGS' => '-msse2 -O2',
so I think this isn't too much about machine architecture or compiler
flags.

Machine speed might matter though.  elver is a good deal faster than
florican was, and dikkop is slower yet.  I gather Thomas has seen this
only once on elver, but I saw it maybe a dozen times over a couple of
years on florican, and now dikkop has hit it after not so many runs.

            regards, tom lane



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Andres Freund
Дата:
Hi,

On 2023-01-27 23:18:39 -0500, Tom Lane wrote:
> I also saw it on florican, which is/was an i386 machine using clang and
> pretty standard build options other than
>     'CFLAGS' => '-msse2 -O2',
> so I think this isn't too much about machine architecture or compiler
> flags.

Ah. Florican dropped of the BF status page and I was too lazy to look
deeper. You have a penchant for odd architectures, so it didn't seem too crazy
:)


> Machine speed might matter though.  elver is a good deal faster than
> florican was, and dikkop is slower yet.  I gather Thomas has seen this
> only once on elver, but I saw it maybe a dozen times over a couple of
> years on florican, and now dikkop has hit it after not so many runs.

Re-reading the old thread, it is interesting that you tried hard to reproduce
it outside of the BF, without success:
https://postgr.es/m/2398828.1646000688%40sss.pgh.pa.us

Such problems are quite annoying. Last time I hit such a case was
https://postgr.es/m/20220325052654.3xpbmntatyofau2w%40alap3.anarazel.de
but I can't see anything like that being the issue here.

Greetings,

Andres Freund



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Thomas Munro
Дата:
On Sat, Jan 28, 2023 at 4:42 PM Andres Freund <andres@anarazel.de> wrote:
> Did you use the same compiler / compilation flags as when elver hit it?
> Clearly Tomas' case was with at least some optimizations enabled.

I did use the same compiler version and optimisation level, clang
llvmorg-13.0.0-0-gd7b669b3a303 at -O2.



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Tomas Vondra
Дата:

On 1/28/23 05:53, Andres Freund wrote:
> Hi,
> 
> On 2023-01-27 23:18:39 -0500, Tom Lane wrote:
>> I also saw it on florican, which is/was an i386 machine using clang and
>> pretty standard build options other than
>>     'CFLAGS' => '-msse2 -O2',
>> so I think this isn't too much about machine architecture or compiler
>> flags.
> 
> Ah. Florican dropped of the BF status page and I was too lazy to look
> deeper. You have a penchant for odd architectures, so it didn't seem too crazy
> :)
> 
> 
>> Machine speed might matter though.  elver is a good deal faster than
>> florican was, and dikkop is slower yet.  I gather Thomas has seen this
>> only once on elver, but I saw it maybe a dozen times over a couple of
>> years on florican, and now dikkop has hit it after not so many runs.
> 
> Re-reading the old thread, it is interesting that you tried hard to reproduce
> it outside of the BF, without success:
> https://postgr.es/m/2398828.1646000688%40sss.pgh.pa.us
> 
> Such problems are quite annoying. Last time I hit such a case was
> https://postgr.es/m/20220325052654.3xpbmntatyofau2w%40alap3.anarazel.de
> but I can't see anything like that being the issue here.
> 

FWIW I'll wait for dikkop to finish the current buildfarm run (it's
currently chewing on HEAD) and then will try to do runs of the 'joins'
test in a loop. That's where dikkop got stuck before.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Tomas Vondra
Дата:
On 1/28/23 13:05, Tomas Vondra wrote:
>
> FWIW I'll wait for dikkop to finish the current buildfarm run (it's
> currently chewing on HEAD) and then will try to do runs of the 'joins'
> test in a loop. That's where dikkop got stuck before.
>

So I did that - same configure options as the buildfarm client, and a
'make check' (with only tests up to the 'join' suite, because that's
where it got stuck before). And it took only ~15 runs (~1h) to hit this
again on dikkop.

As before, there are three processes - leader + 2 workers, but the query
is different - this time it's this one:

  -- A couple of other hash join tests unrelated to work_mem management.
  -- Check that EXPLAIN ANALYZE has data even if the leader doesn't
participate
  savepoint settings;
  set local max_parallel_workers_per_gather = 2;
  set local work_mem = '4MB';
  set local parallel_leader_participation = off;
  select * from hash_join_batches(
  $$
    select count(*) from simple r join simple s using (id);
  $$);

I managed to collect the fstat/procstat stuff Thomas asked for, and the
backtraces - attached. I still have the core files, in case we look at
something. As before, running gcore on the second worker (29081) gets
this unstuck - it sends some signal that apparently wakes it up.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Thomas Munro
Дата:
On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> So I did that - same configure options as the buildfarm client, and a
> 'make check' (with only tests up to the 'join' suite, because that's
> where it got stuck before). And it took only ~15 runs (~1h) to hit this
> again on dikkop.

That's good news.

> I managed to collect the fstat/procstat stuff Thomas asked for, and the
> backtraces - attached. I still have the core files, in case we look at
> something. As before, running gcore on the second worker (29081) gets
> this unstuck - it sends some signal that apparently wakes it up.

Thanks!  As expected, no bytes in the pipe for any those processes.
Unfortunately I gave the wrong procstat command, it should be -i, not
-j.  Does "procstat -i /path/to/core | grep USR1" show P (pending) for
that stuck process?  Silly question really, I don't really expect
poll() to be misbehaving in such a basic way.

I was talking to Andres on IM about this yesterday and he pointed out
a potential out-of-order hazard: WaitEventSetWait() sets "waiting" (to
tell the signal handler to write to the self-pipe) and then reads
latch->is_set with neither compiler nor memory barrier, which doesn't
seem right because we might see a value of latch->is_set from before
"waiting" was true, and yet the signal handler might also have run
while "waiting" was false so the self-pipe doesn't save us, despite
the length of the comment about that.  Can you reproduce it with this
change?

--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1011,6 +1011,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
                 * ordering, so that we cannot miss seeing is_set if a notificat
ion
                 * has already been queued.
                 */
+               pg_memory_barrier();
                if (set->latch && set->latch->is_set)
                {
                        occurred_events->fd = PGINVALID_SOCKET;



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Tomas Vondra
Дата:

On 1/29/23 18:26, Thomas Munro wrote:
> On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> So I did that - same configure options as the buildfarm client, and a
>> 'make check' (with only tests up to the 'join' suite, because that's
>> where it got stuck before). And it took only ~15 runs (~1h) to hit this
>> again on dikkop.
> 
> That's good news.
> 
>> I managed to collect the fstat/procstat stuff Thomas asked for, and the
>> backtraces - attached. I still have the core files, in case we look at
>> something. As before, running gcore on the second worker (29081) gets
>> this unstuck - it sends some signal that apparently wakes it up.
> 
> Thanks!  As expected, no bytes in the pipe for any those processes.
> Unfortunately I gave the wrong procstat command, it should be -i, not
> -j.  Does "procstat -i /path/to/core | grep USR1" show P (pending) for
> that stuck process?  Silly question really, I don't really expect
> poll() to be misbehaving in such a basic way.
> 

It shows "--C" for all three processes, which should mean "will be caught".

> I was talking to Andres on IM about this yesterday and he pointed out
> a potential out-of-order hazard: WaitEventSetWait() sets "waiting" (to
> tell the signal handler to write to the self-pipe) and then reads
> latch->is_set with neither compiler nor memory barrier, which doesn't
> seem right because we might see a value of latch->is_set from before
> "waiting" was true, and yet the signal handler might also have run
> while "waiting" was false so the self-pipe doesn't save us, despite
> the length of the comment about that.  Can you reproduce it with this
> change?
> 

Will do, but I'll wait for another lockup to see how frequent it
actually is. I'm now at ~90 runs total, and it didn't happen again yet.
So hitting it after 15 runs might have been a bit of a luck.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Andres Freund
Дата:
Hi,

On 2023-01-30 06:26:02 +1300, Thomas Munro wrote:
> On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> > So I did that - same configure options as the buildfarm client, and a
> > 'make check' (with only tests up to the 'join' suite, because that's
> > where it got stuck before). And it took only ~15 runs (~1h) to hit this
> > again on dikkop.
> 
> That's good news.

Indeed.

As annoying as it is, it might be worth reproing it once or twice more, just
to have a feeling for how long we need to run to have confidence in a fix.


> I was talking to Andres on IM about this yesterday and he pointed out
> a potential out-of-order hazard: WaitEventSetWait() sets "waiting" (to
> tell the signal handler to write to the self-pipe) and then reads
> latch->is_set with neither compiler nor memory barrier, which doesn't
> seem right because we might see a value of latch->is_set from before
> "waiting" was true, and yet the signal handler might also have run
> while "waiting" was false so the self-pipe doesn't save us, despite
> the length of the comment about that.  Can you reproduce it with this
> change?
> 
> --- a/src/backend/storage/ipc/latch.c
> +++ b/src/backend/storage/ipc/latch.c
> @@ -1011,6 +1011,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
>                  * ordering, so that we cannot miss seeing is_set if a notificat
> ion
>                  * has already been queued.
>                  */
> +               pg_memory_barrier();
>                 if (set->latch && set->latch->is_set)
>                 {
>                         occurred_events->fd = PGINVALID_SOCKET;

I think we need a barrier in SetLatch(), after is_set = true. We have that in
some of the newer branches (due to the maybe_sleeping logic), but not in the
older branches.

Greetings,

Andres Freund



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Andres Freund
Дата:
Hi,

On 2023-01-29 18:39:05 +0100, Tomas Vondra wrote:
> Will do, but I'll wait for another lockup to see how frequent it
> actually is. I'm now at ~90 runs total, and it didn't happen again yet.
> So hitting it after 15 runs might have been a bit of a luck.

Was there a difference in how much load there was on the machine between
"reproduced in 15 runs" and "not reproed in 90"?  If indeed lack of barriers
is related to the issue, an increase in context switches could substantially
change the behaviour (in both directions).  More intra-process context
switches can amount to "probabilistic barriers" because that'll be a
barrier. At the same time it can make it more likely that the relatively
narrow window in WaitEventSetWait() is hit, or lead to larger delays
processing signals.

Greetings,

Andres Freund



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Tomas Vondra
Дата:

On 1/29/23 18:53, Andres Freund wrote:
> Hi,
> 
> On 2023-01-29 18:39:05 +0100, Tomas Vondra wrote:
>> Will do, but I'll wait for another lockup to see how frequent it
>> actually is. I'm now at ~90 runs total, and it didn't happen again yet.
>> So hitting it after 15 runs might have been a bit of a luck.
> 
> Was there a difference in how much load there was on the machine between
> "reproduced in 15 runs" and "not reproed in 90"?  If indeed lack of barriers
> is related to the issue, an increase in context switches could substantially
> change the behaviour (in both directions).  More intra-process context
> switches can amount to "probabilistic barriers" because that'll be a
> barrier. At the same time it can make it more likely that the relatively
> narrow window in WaitEventSetWait() is hit, or lead to larger delays
> processing signals.
> 

No. The only thing the machine is doing is

  while /usr/bin/true; do
    make check
  done

I can't reduce the workload further, because the "join" test is in a
separate parallel group (I cut down parallel_schedule). I could make the
machine busier, of course.

However, the other lockup I saw was when using serial_schedule, so I
guess lower concurrency makes it more likely.

But who knows ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Thomas Munro
Дата:
On Mon, Jan 30, 2023 at 7:08 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> However, the other lockup I saw was when using serial_schedule, so I
> guess lower concurrency makes it more likely.

FWIW "psql db -f src/test/regress/sql/join_hash.sql | cat" also works
(I mean, it's self-contained and doesn't need anything else from make
check; pipe to cat just disables the pager); that's how I've been
trying (and failing) to reproduce this on various computers.  I also
did a lot of "make -C src/test/module/test_shm_mq installcheck" loops,
at the same time, because that's where my animal hung.



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Thomas Munro
Дата:
On Mon, Jan 30, 2023 at 6:26 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> out-of-order hazard

I've been trying to understand how that could happen, but my CPU-fu is
weak.  Let me try to write an argument for why it can't happen, so
that later I can look back at how stupid and naive I was.  We have A
B, and if the CPU sees no dependency and decides to execute B A
(pipelined), shouldn't an interrupt either wait for the whole
schemozzle to commit first (if not in a hurry), or nuke it, handle the
IPI and restart, or something?  After an hour of reviewing random
slides from classes on out-of-order execution and reorder buffers and
the like, I think the term for making sure that interrupts run with
the illusion of in-order execution maintained is called "precise
interrupts", and it is expected in all modern architectures, after the
early OoO pioneers lost their minds trying to program without it.  I
guess generally you want that because it would otherwise run your
interrupt handler in a completely uncertain environment, and
specifically in this case it would reach our signal handler which
reads A's output (waiting) and writes to B's input (is_set), so B IPI
A surely shouldn't be allowed?

As for compiler barriers, I see that elver's compiler isn't reordering the code.

Maybe it's a much dumber sort of a concurrency problem: stale cache
line due to missing barrier, but... commit db0f6cad488 made us also
set our own latch (a second time) when someone sets our latch in
releases 9.something to 13.  Which should mean that we're guaranteed
to see is_set = true in the scenario described, because we'll clobber
it ourselves if we have to, for good measure.

If our secondary SetLatch() sees it's already set and decides not to
set it, then it's possible that the code we interrupted was about to
run ResetLatch(), but any code doing that must next check its expected
exit condition (or it has a common-or-garden latch protocol bug, as
has been discovered from time in the tree...).

/me wanders away with a renewed fear of computers and the vast
complexities they hide



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Andres Freund
Дата:
Hi,

On 2023-01-30 15:22:34 +1300, Thomas Munro wrote:
> On Mon, Jan 30, 2023 at 6:26 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > out-of-order hazard
>
> I've been trying to understand how that could happen, but my CPU-fu is
> weak.  Let me try to write an argument for why it can't happen, so
> that later I can look back at how stupid and naive I was.  We have A
> B, and if the CPU sees no dependency and decides to execute B A
> (pipelined), shouldn't an interrupt either wait for the whole
> schemozzle to commit first (if not in a hurry), or nuke it, handle the
> IPI and restart, or something?

In a core local view, yes, I think so. But I don't think that's how it can
work on multi-core, and even more so, multi-socket machines. Imagine how it'd
influence latency if every interrupt on any CPU would prevent all out-of-order
execution on any CPU.


> After an hour of reviewing randoma
> slides from classes on out-of-order execution and reorder buffers and
> the like, I think the term for making sure that interrupts run with
> the illusion of in-order execution maintained is called "precise
> interrupts", and it is expected in all modern architectures, after the
> early OoO pioneers lost their minds trying to program without it.  I
> guess generally you want that because it would otherwise run your
> interrupt handler in a completely uncertain environment, and
> specifically in this case it would reach our signal handler which
> reads A's output (waiting) and writes to B's input (is_set), so B IPI
> A surely shouldn't be allowed?

Userspace signals aren't delivered synchronously during hardware interrupts
afaik - and I don't think they even possibly could be (after all the process
possibly isn't scheduled).

I think what you're talking about with precise interrupts above is purely
about the single-core view, and mostly about hardware interrupts for faults
etc. The CPU will unwind state from speculatively executed code etc on
interrupt, sure - but I think that's separate from guaranteeing that you can't
have stale cache contents *due to work by another CPU*.


I'm not even sure that userspace signals are generally delivered via an
immediate hardware interrupt, or whether they're processed at the next
scheduler tick. After all, we know that multiple signals are coalesced, which
certainly isn't compatible with synchronous execution. But it could be that
that just happens when the target of a signal is not currently scheduled.


> Maybe it's a much dumber sort of a concurrency problem: stale cache
> line due to missing barrier, but... commit db0f6cad488 made us also
> set our own latch (a second time) when someone sets our latch in
> releases 9.something to 13.

But this part does indeed put a crimp on some potential theories.

TBH, I'd be in favor of just adding the barriers for good measure, even if we
don't know if it's a live bug today - it seems incredibly fragile.

Greetings,

Andres Freund



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Thomas Munro
Дата:
On Mon, Jan 30, 2023 at 6:36 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-01-30 15:22:34 +1300, Thomas Munro wrote:
> > On Mon, Jan 30, 2023 at 6:26 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > out-of-order hazard
> >
> > I've been trying to understand how that could happen, but my CPU-fu is
> > weak.  Let me try to write an argument for why it can't happen, so
> > that later I can look back at how stupid and naive I was.  We have A
> > B, and if the CPU sees no dependency and decides to execute B A
> > (pipelined), shouldn't an interrupt either wait for the whole
> > schemozzle to commit first (if not in a hurry), or nuke it, handle the
> > IPI and restart, or something?
>
> In a core local view, yes, I think so. But I don't think that's how it can
> work on multi-core, and even more so, multi-socket machines. Imagine how it'd
> influence latency if every interrupt on any CPU would prevent all out-of-order
> execution on any CPU.

Good.  Yeah, I was talking only about a single thread/core.

> > After an hour of reviewing randoma
> > slides from classes on out-of-order execution and reorder buffers and
> > the like, I think the term for making sure that interrupts run with
> > the illusion of in-order execution maintained is called "precise
> > interrupts", and it is expected in all modern architectures, after the
> > early OoO pioneers lost their minds trying to program without it.  I
> > guess generally you want that because it would otherwise run your
> > interrupt handler in a completely uncertain environment, and
> > specifically in this case it would reach our signal handler which
> > reads A's output (waiting) and writes to B's input (is_set), so B IPI
> > A surely shouldn't be allowed?
>
> Userspace signals aren't delivered synchronously during hardware interrupts
> afaik - and I don't think they even possibly could be (after all the process
> possibly isn't scheduled).

Yeah, they're not synchronous and the target might not even be
running.  BUT if a suitable thread is running then AFAICT an IPI is
delivered to that sucker to get it running the handler ASAP, at least
on the three OSes I looked at.  (See breadcrumbs below).

> I think what you're talking about with precise interrupts above is purely
> about the single-core view, and mostly about hardware interrupts for faults
> etc. The CPU will unwind state from speculatively executed code etc on
> interrupt, sure - but I think that's separate from guaranteeing that you can't
> have stale cache contents *due to work by another CPU*.

Yeah.  I get the cache problem, a separate issue that does indeed look
pretty dodgy.  I guess I wrote my email out-of-order: at the end I
speculated that cache coherency probably can't explain this failure at
least in THAT bit of the source, because of that funky extra
self-SetLatch().  I just got spooked by the mention of out-of-order
execution and I wanted to chase it down and straighten out my
understanding.

> I'm not even sure that userspace signals are generally delivered via an
> immediate hardware interrupt, or whether they're processed at the next
> scheduler tick. After all, we know that multiple signals are coalesced, which
> certainly isn't compatible with synchronous execution. But it could be that
> that just happens when the target of a signal is not currently scheduled.

FreeBSD: By default, they are when possible, eg if the process is
currently running a suitable thread.  You can set sysctl
kern.smp.forward_signal_enabled=0 to turn that off, and then it works
more like the way you imagined (checking for pending signals at
various arbitrary times, not sure).  See tdsigwakeup() ->
forward_signal() -> ipi_cpu().

Linux: Well it certainly smells approximately similar.  See
signal_wake_up_state() -> kick_process() -> smp_send_reschedule() ->
smp_cross_call() -> __ipi_send_mask().  The comment for kick_process()
explains that it's using the scheduler IPI to get signals handled
ASAP.

Darwin: ... -> cpu_signal() -> something that talks about IPIs

Coalescing is happening not only at the pending signal level (an
invention of the OS), and then for the inter-processor wakeups there
is also interrupt coalescing.  It's latches all the way down.



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Tomas Vondra
Дата:
On 1/29/23 19:08, Tomas Vondra wrote:
> 
> 
> On 1/29/23 18:53, Andres Freund wrote:
>> Hi,
>>
>> On 2023-01-29 18:39:05 +0100, Tomas Vondra wrote:
>>> Will do, but I'll wait for another lockup to see how frequent it
>>> actually is. I'm now at ~90 runs total, and it didn't happen again yet.
>>> So hitting it after 15 runs might have been a bit of a luck.
>>
>> Was there a difference in how much load there was on the machine between
>> "reproduced in 15 runs" and "not reproed in 90"?  If indeed lack of barriers
>> is related to the issue, an increase in context switches could substantially
>> change the behaviour (in both directions).  More intra-process context
>> switches can amount to "probabilistic barriers" because that'll be a
>> barrier. At the same time it can make it more likely that the relatively
>> narrow window in WaitEventSetWait() is hit, or lead to larger delays
>> processing signals.
>>
> 
> No. The only thing the machine is doing is
> 
>   while /usr/bin/true; do
>     make check
>   done
> 
> I can't reduce the workload further, because the "join" test is in a
> separate parallel group (I cut down parallel_schedule). I could make the
> machine busier, of course.
> 
> However, the other lockup I saw was when using serial_schedule, so I
> guess lower concurrency makes it more likely.
> 

FWIW the machine is now on run ~2700 without any further lockups :-/

Seems it was quite lucky we hit it twice in a handful of attempts.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Andres Freund
Дата:
Hi,

On 2023-02-06 19:51:19 +0100, Tomas Vondra wrote:
> > No. The only thing the machine is doing is
> > 
> >   while /usr/bin/true; do
> >     make check
> >   done
> > 
> > I can't reduce the workload further, because the "join" test is in a
> > separate parallel group (I cut down parallel_schedule). I could make the
> > machine busier, of course.
> > 
> > However, the other lockup I saw was when using serial_schedule, so I
> > guess lower concurrency makes it more likely.
> > 
> 
> FWIW the machine is now on run ~2700 without any further lockups :-/
> 
> Seems it was quite lucky we hit it twice in a handful of attempts.

Did you cut down the workload before you reproduced it the first time, or
after? It's quite possible that it's not reproducible in isolation.

Greetings,

Andres Freund



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Tomas Vondra
Дата:

On 2/6/23 20:20, Andres Freund wrote:
> Hi,
> 
> On 2023-02-06 19:51:19 +0100, Tomas Vondra wrote:
>>> No. The only thing the machine is doing is
>>>
>>>   while /usr/bin/true; do
>>>     make check
>>>   done
>>>
>>> I can't reduce the workload further, because the "join" test is in a
>>> separate parallel group (I cut down parallel_schedule). I could make the
>>> machine busier, of course.
>>>
>>> However, the other lockup I saw was when using serial_schedule, so I
>>> guess lower concurrency makes it more likely.
>>>
>>
>> FWIW the machine is now on run ~2700 without any further lockups :-/
>>
>> Seems it was quite lucky we hit it twice in a handful of attempts.
> 
> Did you cut down the workload before you reproduced it the first time, or
> after? It's quite possible that it's not reproducible in isolation.
> 

No, I left the workload as it was for the first lockup, so `make check`
runs everything as is up until the "join" test suite.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Thomas Munro
Дата:
On Tue, Feb 7, 2023 at 12:46 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> No, I left the workload as it was for the first lockup, so `make check`
> runs everything as is up until the "join" test suite.

Wait, shouldn't that be join_hash?



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Tomas Vondra
Дата:
On 2/7/23 00:48, Thomas Munro wrote:
> On Tue, Feb 7, 2023 at 12:46 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> No, I left the workload as it was for the first lockup, so `make check`
>> runs everything as is up until the "join" test suite.
> 
> Wait, shouldn't that be join_hash?

No, because join_hash does not exist on 11 (it was added in 12). Also,
it actually locked up like this - that's the lockup I reported on 28/1.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Thomas Munro
Дата:
On Tue, Feb 7, 2023 at 1:06 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 2/7/23 00:48, Thomas Munro wrote:
> > On Tue, Feb 7, 2023 at 12:46 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >> No, I left the workload as it was for the first lockup, so `make check`
> >> runs everything as is up until the "join" test suite.
> >
> > Wait, shouldn't that be join_hash?
>
> No, because join_hash does not exist on 11 (it was added in 12). Also,
> it actually locked up like this - that's the lockup I reported on 28/1.

Oh, good.  I had been trying to repro with 12 here and forgot that you
were looking at 11...



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Tomas Vondra
Дата:

On 2/7/23 01:09, Thomas Munro wrote:
> On Tue, Feb 7, 2023 at 1:06 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> On 2/7/23 00:48, Thomas Munro wrote:
>>> On Tue, Feb 7, 2023 at 12:46 PM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>> No, I left the workload as it was for the first lockup, so `make check`
>>>> runs everything as is up until the "join" test suite.
>>>
>>> Wait, shouldn't that be join_hash?
>>
>> No, because join_hash does not exist on 11 (it was added in 12). Also,
>> it actually locked up like this - that's the lockup I reported on 28/1.
> 
> Oh, good.  I had been trying to repro with 12 here and forgot that you
> were looking at 11...

FYI it happened again, on a regular run of regression tests (I gave up
on trying to reproduce this - after some initial hits I didn't hit it in
a couple thousand tries so I just added the machine back to buildfarm).

Anyway, same symptoms - lockup in join_hash on PG11, leader waiting on
WaitLatch and both workers waiting on BarrierArriveAndWait. I forgot
running gdb on the second worker will get it unstuck, so I haven't been
able to collect more info.

What else do you think would be useful to collect next time?

It's hard to draw conclusions due to the low probability of the issue,
but it's pretty weird this only ever happened on 11 so far.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Tomas Vondra
Дата:
Hi,

I have another case of this on dikkop (on 11 again). Is there anything
else we'd want to try? Or maybe someone would want access to the machine
and do some investigation directly?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Thomas Munro
Дата:
On Thu, Aug 31, 2023 at 12:16 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> I have another case of this on dikkop (on 11 again). Is there anything
> else we'd want to try? Or maybe someone would want access to the machine
> and do some investigation directly?

Sounds interesting -- I'll ping you off-list.



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Thomas Munro
Дата:
On Thu, Aug 31, 2023 at 2:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Aug 31, 2023 at 12:16 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> > I have another case of this on dikkop (on 11 again). Is there anything
> > else we'd want to try? Or maybe someone would want access to the machine
> > and do some investigation directly?

Hmm.  No conclusion tonight but I think it's weird.  We have:

bsd@generic:/mnt/data/buildfarm $ ps x -O wchan | grep 52663
52663 select  -  Is   0:07.40 postgres: bsd regression [local] SELECT (postgres)
52731 select  -  Is   0:00.09 postgres: parallel worker for PID 52663
  (postgres)
52732 select  -  Is   0:00.06 postgres: parallel worker for PID 52663
  (postgres)
81525 piperd  0  S+   0:00.01 grep 52663

wchan=select means sleeping in poll()/select().

bsd@generic:/mnt/data/buildfarm $ procstat -i 52732 | grep USR1
52732 postgres         USR1     P-C
bsd@generic:/mnt/data/buildfarm $ procstat -j 52732 | grep USR1
52732 100121 postgres         USR1     --

We have a signal that is pending and not blocked, so I don't
immediately know why poll() hasn't returned control.



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Alexander Lakhin
Дата:
Hello Thomas,

31.08.2023 14:15, Thomas Munro wrote:

> We have a signal that is pending and not blocked, so I don't
> immediately know why poll() hasn't returned control.

When I worked at the Postgres Pro company, we observed a similar lockup
under rather specific conditions (we used Elbrus CPU and the specific Elbrus
compiler (lcc) based on edg).
I managed to reproduce that lockup and Anton Voloshin investigated it.
The issue was caused by the compiler optimization in WaitEventSetWait():
     waiting = true;
...
     while (returned_events == 0)
     {
...
         if (set->latch && set->latch->is_set)
         {
...
             break;
         }

In that case, compiler decided that it may place the read
"set->latch->is_set" before the write "waiting = true".
(Placing "pg_compiler_barrier();" just after "waiting = true;" fixed the
issue for us.)
I can't provide more details for now, but maybe you could look at the binary
code generated on the target platform to confirm or reject my guess.

Best regards,
Alexander



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Tomas Vondra
Дата:

On 9/1/23 10:00, Alexander Lakhin wrote:
> Hello Thomas,
> 
> 31.08.2023 14:15, Thomas Munro wrote:
> 
>> We have a signal that is pending and not blocked, so I don't
>> immediately know why poll() hasn't returned control.
> 
> When I worked at the Postgres Pro company, we observed a similar lockup
> under rather specific conditions (we used Elbrus CPU and the specific
> Elbrus
> compiler (lcc) based on edg).
> I managed to reproduce that lockup and Anton Voloshin investigated it.
> The issue was caused by the compiler optimization in WaitEventSetWait():
>     waiting = true;
> ...
>     while (returned_events == 0)
>     {
> ...
>         if (set->latch && set->latch->is_set)
>         {
> ...
>             break;
>         }
> 
> In that case, compiler decided that it may place the read
> "set->latch->is_set" before the write "waiting = true".
> (Placing "pg_compiler_barrier();" just after "waiting = true;" fixed the
> issue for us.)
> I can't provide more details for now, but maybe you could look at the
> binary
> code generated on the target platform to confirm or reject my guess.
> 

Hmmm, I'm not very good at reading the binary code, but here's what
objdump produced for WaitEventSetWait. Maybe someone will see what the
issue is.

I thought about maybe just adding the barrier in the code, but then how
would we know it's the issue and this fixed it? It happens so rarely we
can't make any conclusions from a couple runs of tests.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Alexander Lakhin
Дата:
Hello Tomas,

01.09.2023 16:00, Tomas Vondra wrote:
> Hmmm, I'm not very good at reading the binary code, but here's what
> objdump produced for WaitEventSetWait. Maybe someone will see what the
> issue is.

At first glance, I can't see anything suspicious in the disassembly.
IIUC, waiting = true presented there as:
   805c38: b902ad18      str     w24, [x8, #684] // pgstat_report_wait_start(): proc->wait_event_info =
wait_event_info;
// end of pgstat_report_wait_start(wait_event_info);

   805c3c: b0ffdb09      adrp    x9, 0x366000 <dsm_segment_address+0x24>
   805c40: b0ffdb0a      adrp    x10, 0x366000 <dsm_segment_address+0x28>
   805c44: f0000eeb      adrp    x11, 0x9e4000 <PMSignalShmemInit+0x4>

   805c48: 52800028      mov     w8, #1 // true
   805c4c: 52800319      mov     w25, #24
   805c50: 5280073a      mov     w26, #57
   805c54: fd446128      ldr     d8, [x9, #2240]
   805c58: 90000d7b      adrp    x27, 0x9b1000 <ModifyWaitEvent+0xb0>
   805c5c: fd415949      ldr     d9, [x10, #688]
   805c60: f9071d68      str     x8, [x11, #3640] // waiting = true (x8 = w8)
So there are two simple mov's and two load operations performed in parallel,
but I don't think it's similar to what we had in that case.

> I thought about maybe just adding the barrier in the code, but then how
> would we know it's the issue and this fixed it? It happens so rarely we
> can't make any conclusions from a couple runs of tests.

Probably I could construct a reproducer for the lockup if I had access to
the such machine for a day or two.

Best regards,
Alexander



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Robert Haas
Дата:
On Fri, Sep 1, 2023 at 6:13 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> (Placing "pg_compiler_barrier();" just after "waiting = true;" fixed the
> issue for us.)

Maybe it'd be worth trying something stronger, like
pg_memory_barrier(). A compiler barrier doesn't prevent the CPU from
reordering loads and stores as it goes, and ARM64 has weak memory
ordering.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Alexander Lakhin
Дата:
Hello Robert,

01.09.2023 23:21, Robert Haas wrote:
> On Fri, Sep 1, 2023 at 6:13 AM Alexander Lakhin<exclusion@gmail.com>  wrote:
>> (Placing "pg_compiler_barrier();" just after "waiting = true;" fixed the
>> issue for us.)
> Maybe it'd be worth trying something stronger, like
> pg_memory_barrier(). A compiler barrier doesn't prevent the CPU from
> reordering loads and stores as it goes, and ARM64 has weak memory
> ordering.

Indeed, thank you for the tip!
So maybe here we deal with not compiler's, but with CPU's optimization.
The wider code fragment is:
   805c48: 52800028      mov     w8, #1 // true
   805c4c: 52800319      mov     w25, #24
   805c50: 5280073a      mov     w26, #57
   805c54: fd446128      ldr     d8, [x9, #2240]
   805c58: 90000d7b      adrp    x27, 0x9b1000 <ModifyWaitEvent+0xb0>
   805c5c: fd415949      ldr     d9, [x10, #688]
   805c60: f9071d68      str     x8, [x11, #3640] // waiting = true (x8 = w8)
   805c64: f90003f3      str     x19, [sp]
   805c68: 14000010      b       0x805ca8 <WaitEventSetWait+0x108>

   805ca8: f9400a88      ldr     x8, [x20, #16] // if (set->latch && set->latch->is_set)
   805cac: b4000068      cbz     x8, 0x805cb8 <WaitEventSetWait+0x118>
   805cb0: f9400108      ldr     x8, [x8]
   805cb4: b5001248      cbnz    x8, 0x805efc <WaitEventSetWait+0x35c>
   805cb8: f9401280      ldr     x0, [x20, #32]

If that CPU can delay the writing to the variable waiting
(str x8, [x11, #3640]) in it's internal form like
"store 1 to [address]" to 805cb0 or a later instruction, then we can get the
behavior discussed. Something like that is shown in the ARM documentation:
https://developer.arm.com/documentation/102336/0100/Memory-ordering?lang=en
I'll try to test this guess on the target machine...

Best regards,
Alexander



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Thomas Munro
Дата:
I agree that the code lacks barriers.  I haven't been able to figure
out how any reordering could cause this hang, though, because in these
old branches procsignal_sigusr1_handler is used for latch wakeups, and
it also calls SetLatch(MyLatch) itself, right at the end.  That is,
SetLatch() gets called twice, first in the waker process and then
again in the awoken process, so it should be impossible for the latter
not to see MyLatch->is_set == true after procsignal_sigusr1_handler
completes.

That made me think the handler didn't run, which is consistent with
procstat -i showing it as pending ('P').  Which made me start to
suspect a kernel bug, unless we can explain what we did to block it...

But... perhaps I am confused about that and did something wrong when
looking into it.  It's hard to investigate when you aren't allowed to
take core files or connect a debugger (both will reliably trigger
EINTR).



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Alexander Lakhin
Дата:
Hello,

03.09.2023 00:00, Alexander Lakhin wrote:
> I'll try to test this guess on the target machine...
>

I got access to dikkop thanks to Tomas Vondra, and started reproducing the
issue. It was rather difficult to catch the lockup as Tomas and Tom
noticed before. I tried to use stress-ng to affect reproduction somehow
and gradually managed to reduce the join_hash test to just a single
query (if I were venturous enough, I could reduce that test to it in
just one step, but I choose to move little by little). At the end I
constructed a reproducing script that could catch the lockup on that ARM
machine during an hour on average.
Today I've booted FreeBSD-14.0-ALPHA4 on my secondary AMD (Ryzen 7 3700X)
machine and managed to reproduce the issue with that script (see
attachment). (I run the script just after `gmake && gmake check`.)
To reproduce it faster on my machine, I run it on tmpfs and use parallel:
for ((i=1;i<=4;i++)); do echo "ITERATION $i"; parallel --halt now,fail=1 -j8 --linebuffer --tag time .../repro+.sh {} 
::: 01 02 03 04 || break; done

It produces the lockup as follows:
ITERATION 1
...
02      waiting for server to shut down.... done
02      server stopped
02
02      real    3m2.420s
02      user    0m2.896s
02      sys     0m1.685s
04      TIMEOUT on iteration 448
04
04      real    3m16.212s
04      user    0m2.322s
04      sys     0m1.904s
...
psql -p 15435 regression -c "SELECT * FROM pg_stat_activity;"
  16384 | regression | 53696 |            |       10 | user    | psql             |             |                 
|          -1 | 2023-09-08 18:44:27.572426+00 | 2023-09-08 18:44:27.573633+00 | 2023-09-08 18:44:27.795518+00 | 
2023-09-08 18:44:27.795519+00 | IPC             | HashBuildHashOuter  | active | |          731 |          | explain 
(analyze)  select count(*) from simple r join simple s using (id); | client backend
  16384 | regression | 53894 |      53696 |       10 | user    | psql             |             |                 
|             | 2023-09-08 18:44:27.796749+00 | 2023-09-08 18:44:27.573633+00 | 2023-09-08 18:44:27.795518+00 | 
2023-09-08 18:44:27.799261+00 | IPC             | HashBuildHashOuter  | active | |          731 |          | explain 
(analyze)  select count(*) from simple r join simple s using (id); | parallel worker
  16384 | regression | 53896 |      53696 |       10 | user    | psql             |             |                 
|             | 2023-09-08 18:44:27.797436+00 | 2023-09-08 18:44:27.573633+00 | 2023-09-08 18:44:27.795518+00 | 
2023-09-08 18:44:27.799291+00 | IPC             | HashBuildHashInner  | active | |          731 |          | explain 
(analyze)  select count(*) from simple r join simple s using (id); | parallel worker

procstat -i 53896
53896 postgres         URG      P-C

tail server04.log
2023-09-08 18:44:27.777 UTC|user|regression|53696|64fb6b8b.d1c0|LOG:  statement: explain (analyze)  select count(*)
from
 
simple r join simple s using (id);
2023-09-08 18:44:27.786 UTC|user|regression|53696|64fb6b8b.d1c0|LOG:  statement: explain (analyze)  select count(*)
from
 
simple r join simple s using (id);
2023-09-08 18:44:27.795 UTC|user|regression|53696|64fb6b8b.d1c0|LOG:  statement: explain (analyze)  select count(*)
from
 
simple r join simple s using (id);
2023-09-08 18:45:38.685 UTC|[unknown]|[unknown]|66915|64fb6bd2.10563|LOG:  connection received: host=[local]
2023-09-08 18:45:38.685 UTC|user|regression|66915|64fb6bd2.10563|LOG:  connection authorized: user=user 
database=regression application_name=psql
2023-09-08 18:45:38.686 UTC|user|regression|66915|64fb6bd2.10563|LOG:  statement: SELECT * FROM pg_stat_activity;

It takes less than 10 minutes on average for me. I checked
REL_12_STABLE, REL_13_STABLE, and REL_14_STABLE (with HAVE_KQUEUE undefined
forcefully) — they all are affected.
I could not reproduce the lockup on my Ubuntu box (with HAVE_SYS_EPOLL_H
undefined manually). And surprisingly for me, I could not reproduce it on
master and REL_16_STABLE.
`git bisect` for this behavior change pointed at 7389aad63 (though maybe it
just greatly decreased probability of the failure; I'm going to double-check
this).
In particular, that commit changed this:
-    /*
-     * Ignore SIGURG for now.  Child processes may change this (see
-     * InitializeLatchSupport), but they will not receive any such signals
-     * until they wait on a latch.
-     */
-    pqsignal_pm(SIGURG, SIG_IGN);   /* ignored */
-#endif
+    /* This may configure SIGURG, depending on platform. */
+    InitializeLatchSupport();
+    InitProcessLocalLatch();

With debugging logging added I see (on 7389aad63~1) that one process
really sends SIGURG to another, and the latter reaches poll(), but it
just got no signal, it's signal handler not called and poll() just waits...

So it looks like the ARM weak memory model is not the root cause of the
issue. But as far as I can see, it's still specific to FreeBSD (but not
specific to a compiler — I used gcc and clang with the same success).

Best regards,
Alexander
Вложения

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Thomas Munro
Дата:
On Sat, Sep 9, 2023 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> It takes less than 10 minutes on average for me. I checked
> REL_12_STABLE, REL_13_STABLE, and REL_14_STABLE (with HAVE_KQUEUE undefined
> forcefully) — they all are affected.
> I could not reproduce the lockup on my Ubuntu box (with HAVE_SYS_EPOLL_H
> undefined manually). And surprisingly for me, I could not reproduce it on
> master and REL_16_STABLE.
> `git bisect` for this behavior change pointed at 7389aad63 (though maybe it
> just greatly decreased probability of the failure; I'm going to double-check
> this).
> In particular, that commit changed this:
> -    /*
> -     * Ignore SIGURG for now.  Child processes may change this (see
> -     * InitializeLatchSupport), but they will not receive any such signals
> -     * until they wait on a latch.
> -     */
> -    pqsignal_pm(SIGURG, SIG_IGN);   /* ignored */
> -#endif
> +    /* This may configure SIGURG, depending on platform. */
> +    InitializeLatchSupport();
> +    InitProcessLocalLatch();
>
> With debugging logging added I see (on 7389aad63~1) that one process
> really sends SIGURG to another, and the latter reaches poll(), but it
> just got no signal, it's signal handler not called and poll() just waits...

Thanks for working so hard on this Alexander.  That is a surprising
discovery!  So changes to the signal handler arrangements in the
*postmaster* before the child was forked affected this?

> So it looks like the ARM weak memory model is not the root cause of the
> issue. But as far as I can see, it's still specific to FreeBSD (but not
> specific to a compiler — I used gcc and clang with the same success).

Idea:  FreeBSD 13 introduced a new mechanism called sigfastblock[1],
which lets system libraries control signal blocking with atomic memory
tricks in a word of user space memory.  I have no particular theory
for why it would be going wrong here (I don't expect us to be using
any of the stuff that would use it, though I don't understand it in
detail so that doesn't say much), but it occurred to me that all
reports so far have been on 13.x or 14.  I wonder...  If you have a
good fast recipe for reproducing this, could you also try it on
FreeBSD 12.4?

[1] https://man.freebsd.org/cgi/man.cgi?query=sigfastblock&sektion=2&manpath=FreeBSD+13.0-current



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Alexander Lakhin
Дата:
Hi Thomas,

08.09.2023 22:39, Thomas Munro wrote:
>> With debugging logging added I see (on 7389aad63~1) that one process
>> really sends SIGURG to another, and the latter reaches poll(), but it
>> just got no signal, it's signal handler not called and poll() just waits...
> Thanks for working so hard on this Alexander.  That is a surprising
> discovery!  So changes to the signal handler arrangements in the
> *postmaster* before the child was forked affected this?

Yes, I think we deal with something like that. I can try to deduce a minimum
change that affects reproducing the issue, but may be it's not that important.
Perhaps we now should think of escalating the problem to FreeBSD developers?
I wonder, what kind of reproducer they find acceptable. A standalone C
program only or maybe a script that compiles/installs postgres and runs
our test will do too?

>> So it looks like the ARM weak memory model is not the root cause of the
>> issue. But as far as I can see, it's still specific to FreeBSD (but not
>> specific to a compiler — I used gcc and clang with the same success).
> Idea:  FreeBSD 13 introduced a new mechanism called sigfastblock[1],
> which lets system libraries control signal blocking with atomic memory
> tricks in a word of user space memory.  I have no particular theory
> for why it would be going wrong here (I don't expect us to be using
> any of the stuff that would use it, though I don't understand it in
> detail so that doesn't say much), but it occurred to me that all
> reports so far have been on 13.x or 14.  I wonder...  If you have a
> good fast recipe for reproducing this, could you also try it on
> FreeBSD 12.4?

It was a happy guess! I checked the reproduction on
FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2212
and got the same results as on FreeBSD 14:
REL_12_STABLE - failed on iteration 3
REL_15_STABLE - failed on iteration 1
REL_16_STABLE - 10 iterations with no failure

But on FreeBSD 12.4-RELEASE r372781:
REL_12_STABLE - 20 iterations with no failure
REL_15_STABLE - 20 iterations with no failure

BTW, I also retested 7389aad63 on FreeBSD 14 and got no failure for 100
iterations.

Best regards,
Alexander



Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

От
Thomas Munro
Дата:
On Sat, Sep 9, 2023 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> Yes, I think we deal with something like that. I can try to deduce a minimum
> change that affects reproducing the issue, but may be it's not that important.
> Perhaps we now should think of escalating the problem to FreeBSD developers?
> I wonder, what kind of reproducer they find acceptable. A standalone C
> program only or maybe a script that compiles/installs postgres and runs
> our test will do too?

We discussed this a bit off-list and I am following up on that.  My
guess is that this will turn out to be a bad interaction between that
optimisation and our (former) habit of forking background workers from
inside a signal handler, but let's see...

FTR If someone is annoyed by this and just wants their build farm
animal not to hang on REL_12_STABLE, via Alexander's later experiments
we learned that sysctl kern.sigfastblock_fetch_always=1 fixes the
problem.