Обсуждение: Assertion failure with barriers in parallel hash join

Поиск
Список
Период
Сортировка

Assertion failure with barriers in parallel hash join

От
Michael Paquier
Дата:
Hi all,

prion, that uses -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE,
has just failed with an interesting failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2020-09-29%2005%3A24%3A11

The assertion failure happens in a parallel worker when attempting to
attach a barrier:
#2  0x00000000009027d2 in ExceptionalCondition
(conditionName=conditionName@entry=0xa80846 "!barrier->static_party",
errorType=errorType@entry=0x955e22 "FailedAssertion",
fileName=fileName@entry=0xa807a8
"/home/ec2-user/bf/root/REL_13_STABLE/pgsql.build/../pgsql/src/backend/storage/ipc/barrier.c",
lineNumber=lineNumber@entry=218) at
/home/ec2-user/bf/root/REL_13_STABLE/pgsql.build/../pgsql/src/backend/utils/error/assert.c:67
#3  0x00000000007b6b1f in BarrierAttach
(barrier=barrier@entry=0x7f73c9d76008) at
/home/ec2-user/bf/root/REL_13_STABLE/pgsql.build/../pgsql/src/backend/storage/ipc/barrier.c:218
#4  0x0000000000682ebf in ExecParallelHashJoinNewBatch
(hjstate=<optimized out>, hjstate=<optimized out>) at
/home/ec2-user/bf/root/REL_13_STABLE/pgsql.build/../pgsql/src/backend/executor/nodeHashjoin.c:1132
#5  ExecHashJoinImpl (parallel=true, pstate=0x1248d88) at
/home/ec2-user/bf/root/REL_13_STABLE/pgsql.build/../pgsql/src/backend/executor/nodeHashjoin.c:560
#6  ExecParallelHashJoin (pstate=0x1248d88) at
/home/ec2-user/bf/root/REL_13_STABLE/pgsql.build/../pgsql/src/backend/executor/nodeHashjoin.c:607
#7  0x0000000000666d48 in ExecProcNodeInstr (node=0x1248d88) at
/home/ec2-user/bf/root/REL_13_STABLE/pgsql.build/../pgsql/src/backend/executor/execProcnode.c:466
#8  0x000000000065f322 in ExecProcNode (node=0x1248d88) at
/home/ec2-user/bf/root/REL_13_STABLE/pgsql.build/../pgsql/src/include/executor/executor.h:245

Thanks,
--
Michael

Вложения

Re: Assertion failure with barriers in parallel hash join

От
Thomas Munro
Дата:
On Tue, Sep 29, 2020 at 7:11 PM Michael Paquier <michael@paquier.xyz> wrote:
> #2  0x00000000009027d2 in ExceptionalCondition
> (conditionName=conditionName@entry=0xa80846 "!barrier->static_party",

> #4  0x0000000000682ebf in ExecParallelHashJoinNewBatch

Thanks.  Ohhh.  I think I see how that condition was reached and what
to do about it, but I'll need to look more closely.  I'm away on
vacation right now, and will update in a couple of days when I'm back
at a real computer.



Re: Assertion failure with barriers in parallel hash join

От
Thomas Munro
Дата:
On Tue, Sep 29, 2020 at 9:12 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Sep 29, 2020 at 7:11 PM Michael Paquier <michael@paquier.xyz> wrote:
> > #2  0x00000000009027d2 in ExceptionalCondition
> > (conditionName=conditionName@entry=0xa80846 "!barrier->static_party",
>
> > #4  0x0000000000682ebf in ExecParallelHashJoinNewBatch
>
> Thanks.  Ohhh.  I think I see how that condition was reached and what
> to do about it, but I'll need to look more closely.  I'm away on
> vacation right now, and will update in a couple of days when I'm back
> at a real computer.

Here's a throw-away patch to add some sleeps that trigger the problem,
and a first draft fix.  I'll do some more testing of this next week
and see if I can simplify it.

Вложения

Re: Assertion failure with barriers in parallel hash join

От
Melanie Plageman
Дата:

On Thu, Oct 1, 2020 at 8:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Tue, Sep 29, 2020 at 9:12 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Sep 29, 2020 at 7:11 PM Michael Paquier <michael@paquier.xyz> wrote:
> > #2  0x00000000009027d2 in ExceptionalCondition
> > (conditionName=conditionName@entry=0xa80846 "!barrier->static_party",
>
> > #4  0x0000000000682ebf in ExecParallelHashJoinNewBatch
>
> Thanks.  Ohhh.  I think I see how that condition was reached and what
> to do about it, but I'll need to look more closely.  I'm away on
> vacation right now, and will update in a couple of days when I'm back
> at a real computer.

Here's a throw-away patch to add some sleeps that trigger the problem,
and a first draft fix.  I'll do some more testing of this next week
and see if I can simplify it.

I was just taking a look at the patch and noticed the commit message
says:

> With unlucky timing and parallel_leader_participation off...

Is parallel_leader_participation being off required to reproduce the
issue?

Re: Assertion failure with barriers in parallel hash join

От
Thomas Munro
Дата:
On Tue, Oct 13, 2020 at 12:15 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Thu, Oct 1, 2020 at 8:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> On Tue, Sep 29, 2020 at 9:12 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Here's a throw-away patch to add some sleeps that trigger the problem,
>> and a first draft fix.  I'll do some more testing of this next week
>> and see if I can simplify it.
>
> I was just taking a look at the patch and noticed the commit message
> says:
>
> > With unlucky timing and parallel_leader_participation off...
>
> Is parallel_leader_participation being off required to reproduce the
> issue?

Yeah, because otherwise the leader detaches last so the problem doesn't arise.



Re: Assertion failure with barriers in parallel hash join

От
Melanie Plageman
Дата:
On Wed, Mar 17, 2021 at 8:18 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Mar 17, 2021 at 6:58 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > According to BF animal elver there is something wrong with this
> > commit.  Looking into it.
>
> Assertion failure reproduced here and understood, but unfortunately
> it'll take some more time to fix this.  I've reverted the commit for
> now to unbreak the ~5 machines that are currently showing red in the
> build farm.

So, I think the premise of the patch
v6-0001-Fix-race-condition-in-parallel-hash-join-batch-cl.patch makes
sense: freeing the hashtable memory should happen atomically with
updating the flag that indicates to all others that the memory is not to
be accessed any longer.

As was likely reported by the buildfarm leading to you reverting the
patch, when the inner side is empty and we dump out before advancing the
build barrier past PHJ_BUILD_HASHING_OUTER, we trip the new Asserts
you've added in ExecHashTableDetach().

Assert(!pstate ||
            BarrierPhase(&pstate->build_barrier) >= PHJ_BUILD_RUNNING);

Hmm.

Maybe if the inner side is empty, we can advance the build barrier to
the end?
We help batch 0 along like this in ExecParallelHashJoinSetUpBatches().

But, I'm not sure we can expect the process executing this code to be
attached to the build barrier, can we?

@@ -296,7 +304,19 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
                                 * outer relation.
                                 */
                                if (hashtable->totalTuples == 0 &&
!HJ_FILL_OUTER(node))
+                               {
+                                       if (parallel)
+                                       {
+                                               Barrier    *build_barrier;
+
+                                               build_barrier =
¶llel_state->build_barrier;
+                                               while
(BarrierPhase(build_barrier) < PHJ_BUILD_DONE)
+
BarrierArriveAndWait(build_barrier, 0);
+                                               BarrierDetach(build_barrier);
+                                       }
+
                                        return NULL;
+                               }



Re: Assertion failure with barriers in parallel hash join

От
Melanie Plageman
Дата:
On Wed, Mar 17, 2021 at 8:18 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Mar 17, 2021 at 6:58 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > According to BF animal elver there is something wrong with this
> > commit.  Looking into it.
>
> Assertion failure reproduced here and understood, but unfortunately
> it'll take some more time to fix this.  I've reverted the commit for
> now to unbreak the ~5 machines that are currently showing red in the
> build farm.

Also, silly question: why couldn't we just set the pstate->batches pointer to
InvalidDsaPointer before doing dsa_free() (of course saving the pointer
so that we can actually do the freeing with it)? Is there still a race?



Re: Assertion failure with barriers in parallel hash join

От
Melanie Plageman
Дата:
On Wed, Mar 31, 2021 at 6:25 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Wed, Mar 17, 2021 at 8:18 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > On Wed, Mar 17, 2021 at 6:58 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > According to BF animal elver there is something wrong with this
> > > commit.  Looking into it.
> >
> > Assertion failure reproduced here and understood, but unfortunately
> > it'll take some more time to fix this.  I've reverted the commit for
> > now to unbreak the ~5 machines that are currently showing red in the
> > build farm.
>
> So, I think the premise of the patch
> v6-0001-Fix-race-condition-in-parallel-hash-join-batch-cl.patch makes
> sense: freeing the hashtable memory should happen atomically with
> updating the flag that indicates to all others that the memory is not to
> be accessed any longer.
>
> As was likely reported by the buildfarm leading to you reverting the
> patch, when the inner side is empty and we dump out before advancing the
> build barrier past PHJ_BUILD_HASHING_OUTER, we trip the new Asserts
> you've added in ExecHashTableDetach().
>
> Assert(!pstate ||
>             BarrierPhase(&pstate->build_barrier) >= PHJ_BUILD_RUNNING);
>
> Hmm.
>
> Maybe if the inner side is empty, we can advance the build barrier to
> the end?
> We help batch 0 along like this in ExecParallelHashJoinSetUpBatches().
>
> But, I'm not sure we can expect the process executing this code to be
> attached to the build barrier, can we?
>
> @@ -296,7 +304,19 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
>                                  * outer relation.
>                                  */
>                                 if (hashtable->totalTuples == 0 &&
> !HJ_FILL_OUTER(node))
> +                               {
> +                                       if (parallel)
> +                                       {
> +                                               Barrier    *build_barrier;
> +
> +                                               build_barrier =
> ¶llel_state->build_barrier;
> +                                               while
> (BarrierPhase(build_barrier) < PHJ_BUILD_DONE)
> +
> BarrierArriveAndWait(build_barrier, 0);
> +                                               BarrierDetach(build_barrier);
> +                                       }
> +
>                                         return NULL;
> +                               }

I've attached a new version of the patch which contains my suggested fix
for the problem with the empty inner optimization.

If you apply Thomas' injected sleeps original bug repro patch and use
the following DDL and query, you can reproduce the issue with the empty
inner optimization present in his v2 patch. Then, if you apply my
attached v3 patch, you can see that we no longer trip the assert.

drop table if exists empty_simple;
create table empty_simple(id int, col2 text);
update pg_class set reltuples = 10000 where relname = 'empty_simple';

drop table if exists simple;
create table simple as
  select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
analyze simple;

set min_parallel_table_scan_size = 0;
set parallel_setup_cost = 0;
set enable_hashjoin = on;
set parallel_leader_participation to off;
set work_mem = '4MB';
set enable_parallel_hash = on;
select count(*) from empty_simple join simple using (id);

- Melanie

Вложения

Re: Assertion failure with barriers in parallel hash join

От
David Geier
Дата:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            not tested

Hi all,

We recently encountered the same bug in the field. Oleksii Kozlov managed to come up with reproduction steps, which
reliablytrigger it. Interestingly, the bug does not only manifest as failing assertion, but also as segmentation fault;
inbuilds with disabled and with enabled (!) assertions. So it can crash production environments. We applied the
proposedpatch v3 from Melanie to the REL_14_3 branch and can confirm that with the patch neither the assertion nor the
segmentationfault still occur.
 

I have also glanced at the code and the implementation looks fine. However, I'm not an expert for the fairly involved
hashjoin state machine.
 
There seems to be no need for additional documentation.

For completeness here is the stack trace of the segmentation fault.
Like the stack trace from the assertion failure initially shared by Michael and also encountered by us, the stack trace
ofthe segmentation fault also contains ExecParallelHashJoinNewBatch().
 

#9  | Source "/opt/src/backend/executor/execMain.c", line 361, in standard_ExecutorRun
    | Source "/opt/src/backend/executor/execMain.c", line 1551, in ExecutePlan
      Source "/opt/src/include/executor/executor.h", line 257, in ExecProcNode [0x657e4d]
#8  | Source "/opt/src/backend/executor/nodeAgg.c", line 2179, in ExecAgg
      Source "/opt/src/backend/executor/nodeAgg.c", line 2364, in agg_retrieve_direct [0x66ba60]
#7  | Source "/opt/src/backend/executor/nodeAgg.c", line 581, in fetch_input_tuple
      Source "/opt/src/include/executor/executor.h", line 257, in ExecProcNode [0x66d585]
#6  | Source "/opt/src/backend/executor/nodeHashjoin.c", line 607, in ExecParallelHashJoin
    | Source "/opt/src/backend/executor/nodeHashjoin.c", line 560, in ExecHashJoinImpl
      Source "/opt/src/backend/executor/nodeHashjoin.c", line 1132, in ExecParallelHashJoinNewBatch [0x67a89b]
#5  | Source "/opt/src/backend/storage/ipc/barrier.c", line 242, in BarrierAttach
      Source "/opt/src/include/storage/s_lock.h", line 228, in tas [0x7c2a1b]
#4    Object "/lib/x86_64-linux-gnu/libpthread.so.0", at 0x7f4db364841f, in __funlockfile

--
David Geier
(SericeNow)

The new status of this patch is: Ready for Committer

Re: Assertion failure with barriers in parallel hash join

От
Thomas Munro
Дата:
On Thu, Jun 2, 2022 at 9:31 PM David Geier <geidav.pg@gmail.com> wrote:
> We recently encountered the same bug in the field. Oleksii Kozlov managed to come up with reproduction steps, which
reliablytrigger it. Interestingly, the bug does not only manifest as failing assertion, but also as segmentation fault;
inbuilds with disabled and with enabled (!) assertions. So it can crash production environments. We applied the
proposedpatch v3 from Melanie to the REL_14_3 branch and can confirm that with the patch neither the assertion nor the
segmentationfault still occur. 

Thanks for the report, testing, and for creating the CF entry.

I assume you are using parallel_leader_participation=off, and the
reason we haven't heard more about this is because few people do that.
By coincidence I was just about to restart a bunch of hash join
projects and have been paging the topic area back into my brain, so
I'll start with another round of testing/analysis of this bug/patch
next week.



Re: Assertion failure with barriers in parallel hash join

От
David Geier
Дата:
Hi Thomas,

Correct. We're running with disabled parallel leader participation and we have to do so, because another custom plan node we built relies on that.
That would be great. Anything else I can help with to get this patch in?

Thanks!

--
David
(ServiceNow)

On Fri, 3 Jun 2022 at 00:06, Thomas Munro <thomas.munro@gmail.com> wrote:
On Thu, Jun 2, 2022 at 9:31 PM David Geier <geidav.pg@gmail.com> wrote:
> We recently encountered the same bug in the field. Oleksii Kozlov managed to come up with reproduction steps, which reliably trigger it. Interestingly, the bug does not only manifest as failing assertion, but also as segmentation fault; in builds with disabled and with enabled (!) assertions. So it can crash production environments. We applied the proposed patch v3 from Melanie to the REL_14_3 branch and can confirm that with the patch neither the assertion nor the segmentation fault still occur.

Thanks for the report, testing, and for creating the CF entry.

I assume you are using parallel_leader_participation=off, and the
reason we haven't heard more about this is because few people do that.
By coincidence I was just about to restart a bunch of hash join
projects and have been paging the topic area back into my brain, so
I'll start with another round of testing/analysis of this bug/patch
next week.

Re: Assertion failure with barriers in parallel hash join

От
David Geier
Дата:

Hi Thomas,

Can we make progress with this patch in the current commit fest, or discuss what is still missing to bring this in?

Thanks!

--
David Geier
(ServiceNow)

On 6/6/22 17:01, David Geier wrote:
Hi Thomas,

Correct. We're running with disabled parallel leader participation and we have to do so, because another custom plan node we built relies on that.
That would be great. Anything else I can help with to get this patch in?

Thanks!

--
David
(ServiceNow)

On Fri, 3 Jun 2022 at 00:06, Thomas Munro <thomas.munro@gmail.com> wrote:
On Thu, Jun 2, 2022 at 9:31 PM David Geier <geidav.pg@gmail.com> wrote:
> We recently encountered the same bug in the field. Oleksii Kozlov managed to come up with reproduction steps, which reliably trigger it. Interestingly, the bug does not only manifest as failing assertion, but also as segmentation fault; in builds with disabled and with enabled (!) assertions. So it can crash production environments. We applied the proposed patch v3 from Melanie to the REL_14_3 branch and can confirm that with the patch neither the assertion nor the segmentation fault still occur.

Thanks for the report, testing, and for creating the CF entry.

I assume you are using parallel_leader_participation=off, and the
reason we haven't heard more about this is because few people do that.
By coincidence I was just about to restart a bunch of hash join
projects and have been paging the topic area back into my brain, so
I'll start with another round of testing/analysis of this bug/patch
next week.

Re: Assertion failure with barriers in parallel hash join

От
Thomas Munro
Дата:
On Thu, Nov 17, 2022 at 8:01 PM David Geier <geidav.pg@gmail.com> wrote:
> Can we make progress with this patch in the current commit fest, or discuss what is still missing to bring this in?

Hi David,
Sorry for the delay.  I'll aim to get this done in the next few days.



Re: Assertion failure with barriers in parallel hash join

От
David Geier
Дата:
Thanks!

Please let me know if I can help out, e.g. with re-testing.

--
David Geier
(ServiceNow)

On 11/17/22 08:28, Thomas Munro wrote:
> On Thu, Nov 17, 2022 at 8:01 PM David Geier <geidav.pg@gmail.com> wrote:
>> Can we make progress with this patch in the current commit fest, or discuss what is still missing to bring this in?
> Hi David,
> Sorry for the delay.  I'll aim to get this done in the next few days.



Re: Assertion failure with barriers in parallel hash join

От
Thomas Munro
Дата:
Pushed and back-patched, with minor comment tweaks.  Apologies for
taking so long.