Обсуждение: [sqlsmith] Parallel worker crash on seqscan

Поиск
Список
Период
Сортировка

[sqlsmith] Parallel worker crash on seqscan

От
Andreas Seltenreich
Дата:
Hi,

the following query appears to reliably crash parallel workers on master
as of 0832f2d.

--8<---------------cut here---------------start------------->8---
set max_parallel_workers_per_gather to 2;
set force_parallel_mode to 1;

select subq.context from pg_settings,   lateral (select context from pg_opclass limit 1) as subq
limit 1;
--8<---------------cut here---------------end--------------->8---

Backtrace of a worker below.

regards,
Andreas

Core was generated by `postgres: bgworker: parallel worker for PID 27448    '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  MakeExpandedObjectReadOnlyInternal (d=0) at expandeddatum.c:100
100        if (!VARATT_IS_EXTERNAL_EXPANDED_RW(DatumGetPointer(d)))
(gdb) bt
#0  MakeExpandedObjectReadOnlyInternal (d=0) at expandeddatum.c:100
#1  0x0000563b0c9a4e98 in ExecTargetList (tupdesc=<optimized out>, isDone=0x7ffdd20400ac, itemIsDone=0x563b0e6a8b50,
isnull=0x563b0e6a8ae0"", values=0x563b0e6a8ac0, econtext=0x563b0e6a7db8, targetlist=0x563b0e6a8498) at execQual.c:5491
 
#2  ExecProject (projInfo=projInfo@entry=0x563b0e6a8368, isDone=isDone@entry=0x7ffdd20400ac) at execQual.c:5710
#3  0x0000563b0c9a514f in ExecScan (node=node@entry=0x563b0e6a8038, accessMtd=accessMtd@entry=0x563b0c9bc910 <SeqNext>,
recheckMtd=recheckMtd@entry=0x563b0c9bc900<SeqRecheck>) at execScan.c:220
 
#4  0x0000563b0c9bc9c3 in ExecSeqScan (node=node@entry=0x563b0e6a8038) at nodeSeqscan.c:127
#5  0x0000563b0c99d6e8 in ExecProcNode (node=node@entry=0x563b0e6a8038) at execProcnode.c:419
#6  0x0000563b0c9995be in ExecutePlan (dest=0x563b0e67da40, direction=<optimized out>, numberTuples=0,
sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x563b0e6a8038,
estate=0x563b0e6a77f8)at execMain.c:1567
 
#7  standard_ExecutorRun (queryDesc=0x563b0e6a2828, direction=<optimized out>, count=0) at execMain.c:338
#8  0x0000563b0c99c911 in ParallelQueryMain (seg=<optimized out>, toc=0x7f55173aa000) at execParallel.c:745
#9  0x0000563b0c898b02 in ParallelWorkerMain (main_arg=<optimized out>) at parallel.c:1108
#10 0x0000563b0ca49cad in StartBackgroundWorker () at bgworker.c:726
#11 0x0000563b0ca55770 in do_start_bgworker (rw=0x563b0e621080) at postmaster.c:5535
#12 maybe_start_bgworker () at postmaster.c:5710
#13 0x0000563b0ca56238 in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:4959
#14 <signal handler called>
#15 0x00007f5516788293 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84
#16 0x0000563b0c818249 in ServerLoop () at postmaster.c:1665
#17 0x0000563b0ca577e2 in PostmasterMain (argc=3, argv=0x563b0e5fa490) at postmaster.c:1309
#18 0x0000563b0c819f6d in main (argc=3, argv=0x563b0e5fa490) at main.c:228



Re: [sqlsmith] Parallel worker crash on seqscan

От
Ashutosh Sharma
Дата:
<div dir="ltr">Hi,<br /><br />> the following query appears to reliably crash parallel workers on master<br />>
asof 0832f2d.<br />><br />> --8<---------------cut here---------------start------------->8---<br />> set
max_parallel_workers_per_gatherto 2;<br />> set force_parallel_mode to 1;<br />><br />> select subq.context
frompg_settings,<br />>     lateral (select context from pg_opclass limit 1) as subq<br />> limit 1;<br />>
--8<---------------cuthere---------------end--------------->8---<br /><br />As suggested, I have tried to
reproducethis issue on <b>0832f2d</b> commit but could not reproduce it. I also tried it on the latest commit in master
branchand could not reproduce here as well. Amit (included in this email thread) has also tried it once and he was also
notable to reproduce it. Could you please let me know if there is something more that needs to be done in order to
reproduceit other than what you have shared above. Thanks.<br /><br />With Regards,<br />Ashutosh Sharma.<br
/>EnterpriseDB:<a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br /></div> 

Re: [sqlsmith] Parallel worker crash on seqscan

От
Robert Haas
Дата:
On Sun, Nov 20, 2016 at 8:53 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:
> the following query appears to reliably crash parallel workers on master
> as of 0832f2d.
>
> --8<---------------cut here---------------start------------->8---
> set max_parallel_workers_per_gather to 2;
> set force_parallel_mode to 1;
>
> select subq.context from pg_settings,
>     lateral (select context from pg_opclass limit 1) as subq
> limit 1;
> --8<---------------cut here---------------end--------------->8---
>
> Backtrace of a worker below.

Based on the backtrace, I wonder if this might be a regression
introduced by Tom's recent commit
9a00f03e479c2d4911eed5b4bd1994315d409938, "Improve speed of aggregates
that use array_append as transition function.", which adds additional
cases where expanded datums can be used.  In theory, a value passed
from the leader to the workers ought to have gone through
datumSerialize() which contains code to flatten expanded
representations, but it's possible that code is broken, or maybe the
problematic code path is something else altogether.  I'm just
suspicious about the fact that the failure is in
MakeExpandedObjectReadOnlyInternal().

Then again, that might just be a coincidence. The other thing that's
weird here is that the Datum being passed is apparently a NULL
pointer, which neither MakeExpandedObjectReadOnly() nor
MakeExpandedObjectReadOnlyInternal() are prepared to deal with.  I
don't know whether it's wrong for a NULL pointer to occur here in the
first place or whether it's wrong for those functions not to be able
to deal with it if it does.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [sqlsmith] Parallel worker crash on seqscan

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> Then again, that might just be a coincidence. The other thing that's
> weird here is that the Datum being passed is apparently a NULL
> pointer, which neither MakeExpandedObjectReadOnly() nor
> MakeExpandedObjectReadOnlyInternal() are prepared to deal with.  I
> don't know whether it's wrong for a NULL pointer to occur here in the
> first place or whether it's wrong for those functions not to be able
> to deal with it if it does.

The former.  MakeExpandedObjectReadOnly does contain a null check,
so something has passed it a zero Datum along with a claim that the
Datum is not null.

Like Ashutosh, I can't reproduce the crash, so it's hard to speculate much
further.  I am wondering if 13671b4b2 somehow fixed it, though I don't see
how.
        regards, tom lane



Re: [sqlsmith] Parallel worker crash on seqscan

От
Tom Lane
Дата:
I wrote:
> Like Ashutosh, I can't reproduce the crash, so it's hard to speculate much
> further.

Ah-hah: now I can.  The recipe lacks these important steps:

set parallel_setup_cost TO 0;
set parallel_tuple_cost TO 0;

That changes the plan to
Limit  (cost=0.00..0.06 rows=1 width=64)  ->  Nested Loop  (cost=0.00..57.25 rows=1000 width=64)        ->  Function
Scanon pg_show_all_settings a  (cost=0.00..10.00 rows=1000 width=64)        ->  Limit  (cost=0.00..0.03 rows=1
width=32)             ->  Gather  (cost=0.00..3.54 rows=130 width=32)                    Workers Planned: 2
      ->  Parallel Seq Scan on pg_opclass  (cost=0.00..3.54 rows=54 width=32) 

so what we've got is a case where a parameter computed by the FunctionScan
(in the master) would need to be passed into the parallel workers at
runtime.  Do we have code for that at all?  If so where is it?
        regards, tom lane



Re: [sqlsmith] Parallel worker crash on seqscan

От
Robert Haas
Дата:
On Mon, Nov 21, 2016 at 11:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Like Ashutosh, I can't reproduce the crash, so it's hard to speculate much
>> further.
>
> Ah-hah: now I can.  The recipe lacks these important steps:
>
> set parallel_setup_cost TO 0;
> set parallel_tuple_cost TO 0;
>
> That changes the plan to
>
>  Limit  (cost=0.00..0.06 rows=1 width=64)
>    ->  Nested Loop  (cost=0.00..57.25 rows=1000 width=64)
>          ->  Function Scan on pg_show_all_settings a  (cost=0.00..10.00 rows=1000 width=64)
>          ->  Limit  (cost=0.00..0.03 rows=1 width=32)
>                ->  Gather  (cost=0.00..3.54 rows=130 width=32)
>                      Workers Planned: 2
>                      ->  Parallel Seq Scan on pg_opclass  (cost=0.00..3.54 rows=54 width=32)
>
> so what we've got is a case where a parameter computed by the FunctionScan
> (in the master) would need to be passed into the parallel workers at
> runtime.  Do we have code for that at all?  If so where is it?

No, that's not supposed to happen.  That's why we have this:
   /*    * We can't pass Params to workers at the moment either, so they are also    * parallel-restricted.    */
elseif (IsA(node, Param))   {       if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context))           return
true;  }
 

Maybe it's checking the quals but not recursing into the tlist?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [sqlsmith] Parallel worker crash on seqscan

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Nov 21, 2016 at 11:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> so what we've got is a case where a parameter computed by the FunctionScan
>> (in the master) would need to be passed into the parallel workers at
>> runtime.  Do we have code for that at all?  If so where is it?

> No, that's not supposed to happen.

OK, that makes this a planner failure: we should not have allowed this
query to become parallelized.

> Maybe it's checking the quals but not recursing into the tlist?

It seems like maybe searching for individual Params is the wrong thing.
Why are we allowing it to generate a parameterized Gather path at all?
Given the lack of any way to transmit runtime param values to the worker,
I can't see how that would ever work.
        regards, tom lane



Re: [sqlsmith] Parallel worker crash on seqscan

От
Robert Haas
Дата:
On Mon, Nov 21, 2016 at 12:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Nov 21, 2016 at 11:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> so what we've got is a case where a parameter computed by the FunctionScan
>>> (in the master) would need to be passed into the parallel workers at
>>> runtime.  Do we have code for that at all?  If so where is it?
>
>> No, that's not supposed to happen.
>
> OK, that makes this a planner failure: we should not have allowed this
> query to become parallelized.
>
>> Maybe it's checking the quals but not recursing into the tlist?
>
> It seems like maybe searching for individual Params is the wrong thing.
> Why are we allowing it to generate a parameterized Gather path at all?
> Given the lack of any way to transmit runtime param values to the worker,
> I can't see how that would ever work.

Hmm, so you're thinking it could be the job of generate_gather_paths()
to make sure we don't do that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [sqlsmith] Parallel worker crash on seqscan

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Nov 21, 2016 at 12:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It seems like maybe searching for individual Params is the wrong thing.
>> Why are we allowing it to generate a parameterized Gather path at all?
>> Given the lack of any way to transmit runtime param values to the worker,
>> I can't see how that would ever work.

> Hmm, so you're thinking it could be the job of generate_gather_paths()
> to make sure we don't do that?

Actually, the Gather path *isn't* parameterized.  The problem here is
that we're planning an unflattened subquery, and the only thing that
is parallel-unsafe is that there is an outer Param in its toplevel tlist,
and we're somehow deciding that we can stick that unsafe tlist into (and
beneath) the Gather node.  So something rotten in that area, but I've not
quite found it yet.
        regards, tom lane



Re: [sqlsmith] Parallel worker crash on seqscan

От
Tom Lane
Дата:
I wrote:
> Actually, the Gather path *isn't* parameterized.  The problem here is
> that we're planning an unflattened subquery, and the only thing that
> is parallel-unsafe is that there is an outer Param in its toplevel tlist,
> and we're somehow deciding that we can stick that unsafe tlist into (and
> beneath) the Gather node.  So something rotten in that area, but I've not
> quite found it yet.

Hah: not where I thought it was at all.  The problem seems to be down to
the optimization I put into is_parallel_safe() awhile back to skip testing
anything if we previously found the entire querytree to be parallel-safe.
Well, the raw query tree *is* parallel-safe.  It's only when we inject
some Params that we have a parallel hazard.  So that optimization is too
optimistic :-(
        regards, tom lane



Re: [sqlsmith] Parallel worker crash on seqscan

От
Robert Haas
Дата:
On Mon, Nov 21, 2016 at 1:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Actually, the Gather path *isn't* parameterized.  The problem here is
>> that we're planning an unflattened subquery, and the only thing that
>> is parallel-unsafe is that there is an outer Param in its toplevel tlist,
>> and we're somehow deciding that we can stick that unsafe tlist into (and
>> beneath) the Gather node.  So something rotten in that area, but I've not
>> quite found it yet.
>
> Hah: not where I thought it was at all.  The problem seems to be down to
> the optimization I put into is_parallel_safe() awhile back to skip testing
> anything if we previously found the entire querytree to be parallel-safe.
> Well, the raw query tree *is* parallel-safe.  It's only when we inject
> some Params that we have a parallel hazard.  So that optimization is too
> optimistic :-(

That sucks.  Any idea how we might salvage it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [sqlsmith] Parallel worker crash on seqscan

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Nov 21, 2016 at 1:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hah: not where I thought it was at all.  The problem seems to be down to
>> the optimization I put into is_parallel_safe() awhile back to skip testing
>> anything if we previously found the entire querytree to be parallel-safe.
>> Well, the raw query tree *is* parallel-safe.  It's only when we inject
>> some Params that we have a parallel hazard.  So that optimization is too
>> optimistic :-(

> That sucks.  Any idea how we might salvage it?

I just disabled it by checking to see if any Params have been created.
It might be possible to be more refined, but I dunno that adding more
bookkeeping would pay for itself.
        regards, tom lane



Re: [sqlsmith] Parallel worker crash on seqscan

От
Andreas Seltenreich
Дата:
Hi,

Ashutosh Sharma writes:

>> the following query appears to reliably crash parallel workers on master
>> as of 0832f2d.

> As suggested, I have tried to reproduce this issue on *0832f2d* commit but
> could not reproduce it.

as Tom pointed out earlier, I had secretly set parallel_setup_cost and
parallel_tuple_cost to 0.  I assumed these were irrelevant when
force_parallel_mode is on.  I'll do less assuming and more testing on a
vanilla install on future reports.

Sorry for the inconvenience,
Andreas



Re: [sqlsmith] Parallel worker crash on seqscan

От
Robert Haas
Дата:
On Mon, Nov 21, 2016 at 5:14 PM, Andreas Seltenreich <seltenreich@gmx.de> wrote:
> Ashutosh Sharma writes:
>>> the following query appears to reliably crash parallel workers on master
>>> as of 0832f2d.
>
>> As suggested, I have tried to reproduce this issue on *0832f2d* commit but
>> could not reproduce it.
>
> as Tom pointed out earlier, I had secretly set parallel_setup_cost and
> parallel_tuple_cost to 0.  I assumed these were irrelevant when
> force_parallel_mode is on.  I'll do less assuming and more testing on a
> vanilla install on future reports.
>
> Sorry for the inconvenience,

Don't sweat it!  Your sqlsmith fuzz testing has been extremely valuable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [sqlsmith] Parallel worker crash on seqscan

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> Don't sweat it!  Your sqlsmith fuzz testing has been extremely valuable.

That might be the understatement of the year.  I don't know how long
it would have taken us to find these things in the field.  Thanks!
        regards, tom lane