Обсуждение: [HACKERS] parallelize queries containing subplans

Поиск
Список
Период
Сортировка

[HACKERS] parallelize queries containing subplans

От
Amit Kapila
Дата:
Currently, queries that have references to SubPlans or
AlternativeSubPlans are considered parallel-restricted.  I think we
can lift this restriction in many cases especially when SubPlans are
parallel-safe.  To make this work, we need to propagate the
parallel-safety information from path node to plan node and the same
could be easily done while creating a plan.  Another option could be
that instead of propagating parallel-safety information from path to
plan, we can find out from the plan if it is parallel-safe (doesn't
contain any parallel-aware node) by traversing whole plan tree, but I
think it is a waste of cycles.  Once we have parallel-safety
information in the plan, we can use that for detection of
parallel-safe expressions in max_parallel_hazard_walker().  Finally,
we can pass all the subplans to workers during plan serialization in
ExecSerializePlan().  This will enable workers to execute subplans
that are referred in parallel part of the plan.  Now, we might be able
to optimize it such that we pass only subplans that are referred in
parallel portion of plan, but I am not sure if it is worth the trouble
because it is one-time cost and much lesser than other things we do
(like creating
dsm, launching workers).

Attached patch implements the above idea.  This will enable
parallelism for queries containing un-correlated subplans, an example
of which is as follows:

set parallel_tuple_cost=0;
set parallel_setup_cost=0;
set min_parallel_relation_size=50;

create table t1 (i int, j int, k int);
create table t2 (i int, j int, k int);

insert into t1 values (generate_series(1,10)*random(),
generate_series(5,50)*random(),
generate_series(8,80)*random());
insert into t2 values (generate_series(4,10)*random(),
generate_series(5,90)*random(),
generate_series(2,10)*random());


Plan without Patch
-----------------------------
postgres=# explain select * from t1 where t1.i not in (select t2.i
from t2 where t2.i in (1,2,3));
                          QUERY PLAN
---------------------------------------------------------------
 Seq Scan on t1  (cost=110.84..411.72 rows=8395 width=12)
   Filter: (NOT (hashed SubPlan 1))
   SubPlan 1
     ->  Seq Scan on t2  (cost=0.00..104.50 rows=2537 width=4)
           Filter: (i = ANY ('{1,2,3}'::integer[]))
(5 rows)

Plan with Patch
------------------------
postgres=# explain select * from t1 where t1.i not in (select t2.i
from t2 where t2.i in (1,2,3));
                               QUERY PLAN
-------------------------------------------------------------------------
 Gather  (cost=110.84..325.30 rows=8395 width=12)
   Workers Planned: 1
   ->  Parallel Seq Scan on t1  (cost=110.84..325.30 rows=4938 width=12)
         Filter: (NOT (hashed SubPlan 1))
         SubPlan 1
           ->  Seq Scan on t2  (cost=0.00..104.50 rows=2537 width=4)
                 Filter: (i = ANY ('{1,2,3}'::integer[]))
(7 rows)

We have observed that Q-16 in TPC-H have been improved with the patch
and the analysis of same will be shared by my colleague Rafia.

Now, we can further extend this to parallelize queries containing
correlated subplans like below:

explain select * from t1 where t1.i in (select t2.i from t2 where t2.i=t1.i);
                         QUERY PLAN
-------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..831049.09 rows=8395 width=12)
   Filter: (SubPlan 1)
   SubPlan 1
     ->  Seq Scan on t2  (cost=0.00..97.73 rows=493 width=4)
           Filter: (i = t1.i)
(5 rows)

In the above query, Filter on t2 (i = t1.i) generates Param node which
is a parallel-restricted node, so such queries won't be able to use
parallelism even with the patch.  I think we can mark Params which
refer to same level as parallel-safe and I think we have this
information (node-> varlevelsup/ phlevelsup/ agglevelsup) available
when we replace correlation vars (SS_replace_correlation_vars).  The
reason why it is not advisable to mark Params that don't refer to same
query level as parallel-safe is that can lead to plans like below:

Foo
-> Gather
  -> Bar
       SubPlan 1 (SubPlan refers to Foo)

Now, this problem is tricky because we need to pass all such Params
each time we invoke Gather.  That also is doable with much more
effort, but I am not sure if it is worth because all of the use cases
I have seen in TPC-H (Q-2) or TPC-DS (Q-6) always uses SubPlans that
refer to same query level.

Yet, another useful enhancement in this area could be to consider both
parallel and non-parallel paths for subplans.  As of now, we consider
the cheapest/best path and form subplan from it, but it is quite
possible that instead of choosing parallel path (in case it is
cheapest) at subplan level, the non-parallel path at subplan level
could be beneficial when upper plan can use parallelism.  I think this
will be a separate project in itself if we want to do this and based
on my study of TPC-H and TPC-DS queries, I am confident that this will
be helpful in certain queries at higher scale factors.

Thoughts?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] parallelize queries containing subplans

От
Rafia Sabih
Дата:
On Wed, Dec 28, 2016 at 11:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Currently, queries that have references to SubPlans or
> AlternativeSubPlans are considered parallel-restricted.  I think we
> can lift this restriction in many cases especially when SubPlans are
> parallel-safe.  To make this work, we need to propagate the
> parallel-safety information from path node to plan node and the same
> could be easily done while creating a plan.  Another option could be
> that instead of propagating parallel-safety information from path to
> plan, we can find out from the plan if it is parallel-safe (doesn't
> contain any parallel-aware node) by traversing whole plan tree, but I
> think it is a waste of cycles.  Once we have parallel-safety
> information in the plan, we can use that for detection of
> parallel-safe expressions in max_parallel_hazard_walker().  Finally,
> we can pass all the subplans to workers during plan serialization in
> ExecSerializePlan().  This will enable workers to execute subplans
> that are referred in parallel part of the plan.  Now, we might be able
> to optimize it such that we pass only subplans that are referred in
> parallel portion of plan, but I am not sure if it is worth the trouble
> because it is one-time cost and much lesser than other things we do
> (like creating
> dsm, launching workers).
>
> Attached patch implements the above idea.  This will enable
> parallelism for queries containing un-correlated subplans, an example
> of which is as follows:
>
> set parallel_tuple_cost=0;
> set parallel_setup_cost=0;
> set min_parallel_relation_size=50;
>
> create table t1 (i int, j int, k int);
> create table t2 (i int, j int, k int);
>
> insert into t1 values (generate_series(1,10)*random(),
> generate_series(5,50)*random(),
> generate_series(8,80)*random());
> insert into t2 values (generate_series(4,10)*random(),
> generate_series(5,90)*random(),
> generate_series(2,10)*random());
>
>
> Plan without Patch
> -----------------------------
> postgres=# explain select * from t1 where t1.i not in (select t2.i
> from t2 where t2.i in (1,2,3));
>                           QUERY PLAN
> ---------------------------------------------------------------
>  Seq Scan on t1  (cost=110.84..411.72 rows=8395 width=12)
>    Filter: (NOT (hashed SubPlan 1))
>    SubPlan 1
>      ->  Seq Scan on t2  (cost=0.00..104.50 rows=2537 width=4)
>            Filter: (i = ANY ('{1,2,3}'::integer[]))
> (5 rows)
>
> Plan with Patch
> ------------------------
> postgres=# explain select * from t1 where t1.i not in (select t2.i
> from t2 where t2.i in (1,2,3));
>                                QUERY PLAN
> -------------------------------------------------------------------------
>  Gather  (cost=110.84..325.30 rows=8395 width=12)
>    Workers Planned: 1
>    ->  Parallel Seq Scan on t1  (cost=110.84..325.30 rows=4938 width=12)
>          Filter: (NOT (hashed SubPlan 1))
>          SubPlan 1
>            ->  Seq Scan on t2  (cost=0.00..104.50 rows=2537 width=4)
>                  Filter: (i = ANY ('{1,2,3}'::integer[]))
> (7 rows)
>
> We have observed that Q-16 in TPC-H have been improved with the patch
> and the analysis of same will be shared by my colleague Rafia.
>
To study the effect of uncorrelated sub-plan pushdown on TPC-H and
TPC-DS benchmark queries we performed some experiments and the
execution time results for same are summarised as follows,

Query    | HEAD | Patches | scale-factor
-----------+---------+-----------+-----------------
DS-Q45 | 35       | 19         | scale-factor: 100
H-Q16   | 812     | 645       | scale-factor: 300
H-Q16   | 49       | 37         | scale-factor: 20

Execution time given in above table is in seconds. Detailed analysis
of this experimentation is as follows,
Additional patches applied in this analysis are,
Parallel index scan [1]
Parallel index-only scan [2]
Parallel merge-join [3]
The system setup used for this experiment is,

Server parameter settings:
work_mem = 500 MB,
max_parallel_workers_per_gather = 4,
random_page_cost = seq_page_cost = 0.1 = parallel_tuple_cost,
shared_buffers = 1 GB
Machine used: IBM Power, 4 socket machine, 512 GB RAM

TPC-DS scale factor = 100 (approx size of database is 150 GB)

Query 45 which takes around 35 seconds on head, completes in 19
seconds with these patches. The point to note here is that without
this patch of pushing uncorrelated sublans, hash join which is using
subplan in join filter could not be pushed to workers and hence query
was unable to use the parallelism enough, with this patch parallelism
is available till the topmost join node. The output of explain analyse
statement of this query on both head and with patches is given in
attached file ds_q45.txt.

On further evaluating these patches on TPC-H queries on different
scale factors we came across query 16, for TPC-H scale factor 20 and
aforementioned parameter settings with the change of
max_parallel_workers_per gather = 2, it took 37 seconds with the
patches and 49 seconds on head. Though the improvement in overall
query performance is not appearing to be significantly high, yet the
point to note here is that the time taken by join was around 26
seconds on head which reduced to 14 seconds with the patches. Overall
benefit in performance is not high because sort node is dominating the
execution time. The plan information of this query is given in
attached file h_q16_20_2.txt.

On increasing the scale factor to 300, setting work_mem to 1GB,
increasing max_parallel_workers_per_gather = 6, and disabling the
parallel sequential scan for supplier table by 'alter table supplier
set (parallel_workers = 0)', Q16 completes in 645 seconds with
patches, which was taking 812 seconds on head. We need to disable
parallel_workers for supplier table because on higher worker count it
was taking parallel seq scan and hence the scan node that is using
subplan could not be parallelised. For this query both pushdown of
subplans and parallel merge-join, without any one of these the
benefits of parallelism might not be leveraged fully. The output of
explain analyse for this query is given in h_q16_300.txt

Overall, with pushdown of uncorrelated sub-plans to workers enables
the parallelism in joins which was restricted before and hence good
improvement in query performance can be witnessed.

> Now, we can further extend this to parallelize queries containing
> correlated subplans like below:
>
> explain select * from t1 where t1.i in (select t2.i from t2 where t2.i=t1.i);
>                          QUERY PLAN
> -------------------------------------------------------------
>  Seq Scan on t1  (cost=0.00..831049.09 rows=8395 width=12)
>    Filter: (SubPlan 1)
>    SubPlan 1
>      ->  Seq Scan on t2  (cost=0.00..97.73 rows=493 width=4)
>            Filter: (i = t1.i)
> (5 rows)
>
As per my analysis this extension to correlated subplans is likely to
improve the performance of following queries -- Q2 in TPC-H and Q6 in
TPC-DS.

> Yet, another useful enhancement in this area could be to consider both
> parallel and non-parallel paths for subplans.  As of now, we consider
> the cheapest/best path and form subplan from it, but it is quite
> possible that instead of choosing parallel path (in case it is
> cheapest) at subplan level, the non-parallel path at subplan level
> could be beneficial when upper plan can use parallelism.  I think this
> will be a separate project in itself if we want to do this and based
> on my study of TPC-H and TPC-DS queries, I am confident that this will
> be helpful in certain queries at higher scale factors.
>
I agree as then we do not need to disable parallelism for particular
relations as we currently do for supplier relation in Q16 of TPC-H.

[1] https://www.postgresql.org/message-id/CAA4eK1KthrAvNjmB2cWuUHz%2Bp3ZTTtbD7o2KUw49PC8HK4r1Wg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAOGQiiOv9NA1VrAo8PmENfGi-y%3DCj_DiTF4vyjp%2BfmuEzovwEA%40mail.gmail.com
[3] https://www.postgresql.org/message-id/CAFiTN-tdYpcsk7Lpv0HapcmvSnMN_TgKjC7RkmvVLZAF%2BXfmPg%40mail.gmail.com

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] parallelize queries containing subplans

От
Amit Kapila
Дата:
On Wed, Dec 28, 2016 at 11:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Now, we can further extend this to parallelize queries containing
> correlated subplans like below:
>
> explain select * from t1 where t1.i in (select t2.i from t2 where t2.i=t1.i);
>                          QUERY PLAN
> -------------------------------------------------------------
>  Seq Scan on t1  (cost=0.00..831049.09 rows=8395 width=12)
>    Filter: (SubPlan 1)
>    SubPlan 1
>      ->  Seq Scan on t2  (cost=0.00..97.73 rows=493 width=4)
>            Filter: (i = t1.i)
> (5 rows)
>
> In the above query, Filter on t2 (i = t1.i) generates Param node which
> is a parallel-restricted node, so such queries won't be able to use
> parallelism even with the patch.  I think we can mark Params which
> refer to same level as parallel-safe and I think we have this
> information (node-> varlevelsup/ phlevelsup/ agglevelsup) available
> when we replace correlation vars (SS_replace_correlation_vars).
>

I have implemented the above idea which will allow same or immediate
outer level PARAMS as parallel_safe.  The results of above query after
patch:

postgres=# explain select * from t1 where t1.i in (select t2.i from t2
where t2.i=t1.i);
                                QUERY PLAN
--------------------------------------------------------------------------
 Gather  (cost=0.00..488889.88 rows=8395 width=12)
   Workers Planned: 1
   ->  Parallel Seq Scan on t1  (cost=0.00..488889.88 rows=4938 width=12)
         Filter: (SubPlan 1)
         SubPlan 1
           ->  Seq Scan on t2  (cost=0.00..97.73 rows=493 width=4)
                 Filter: (i = t1.i)
(7 rows)

Note - This patch can be applied on top of
pq_pushdown_subplan_v1.patch posted upthread [1]

[1] - https://www.postgresql.org/message-id/CAA4eK1%2Be8Z45D2n%2BrnDMDYsVEb5iW7jqaCH_tvPMYau%3D1Rru9w%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] parallelize queries containing subplans

От
Dilip Kumar
Дата:
On Wed, Dec 28, 2016 at 11:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Attached patch implements the above idea.  This will enable
> parallelism for queries containing un-correlated subplans, an example
> of which is as follows:

I have reviewed the patch (pq_pushdown_subplan_v1.patch), Mostly patch
looks clean to me.
I have also done some basic functional testing and it's working fine.

I have one comment.

+ else if (IsA(node, AlternativeSubPlan))
+ {
+ AlternativeSubPlan *asplan = (AlternativeSubPlan *) node;
+ ListCell   *lc;
+
+ Assert(context->root);
+
+ foreach(lc, asplan->subplans)
+ {
+ SubPlan    *splan = (SubPlan *) lfirst(lc);
+ Plan   *plan;
+
+ Assert(IsA(splan, SubPlan));
+
+ plan = planner_subplan_get_plan(context->root, splan);
+ if (!plan->parallel_safe)
+ return true;
+ } }

I see no reason why we need to process the subplan list of
AlternativeSubPlan here, Anyway expression_tree_walker will take care
of processing each subplan of AlternativeSubPlan. Now in the case
where all the subplans in AlternativeSubPlan are parallel_safe we will
process this list twice. But, more than that it will be cleaner to not
handle AlternativeSubPlan here unless there is some strong reason?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing subplans

От
Robert Haas
Дата:
On Wed, Dec 28, 2016 at 1:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Currently, queries that have references to SubPlans or
> AlternativeSubPlans are considered parallel-restricted.  I think we
> can lift this restriction in many cases especially when SubPlans are
> parallel-safe.  To make this work, we need to propagate the
> parallel-safety information from path node to plan node and the same
> could be easily done while creating a plan.  Another option could be
> that instead of propagating parallel-safety information from path to
> plan, we can find out from the plan if it is parallel-safe (doesn't
> contain any parallel-aware node) by traversing whole plan tree, but I
> think it is a waste of cycles.  Once we have parallel-safety
> information in the plan, we can use that for detection of
> parallel-safe expressions in max_parallel_hazard_walker().  Finally,
> we can pass all the subplans to workers during plan serialization in
> ExecSerializePlan().  This will enable workers to execute subplans
> that are referred in parallel part of the plan.  Now, we might be able
> to optimize it such that we pass only subplans that are referred in
> parallel portion of plan, but I am not sure if it is worth the trouble
> because it is one-time cost and much lesser than other things we do
> (like creating
> dsm, launching workers).

It seems unfortunate to have to add a parallel_safe flag to the
finished plan; the whole reason we have the Path-Plan distinction is
so that we can throw away information that won't be needed at
execution time.  The parallel_safe flag is, in fact, not needed at
execution time, but just for further planning.  Isn't there some way
that we can remember, at the time when a sublink is converted to a
subplan, whether or not the subplan was created from a parallel-safe
path?  That seems like it would be cleaner than maintaining this flag
for all plans.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] parallelize queries containing subplans

От
Amit Kapila
Дата:
On Tue, Jan 10, 2017 at 10:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Dec 28, 2016 at 1:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Currently, queries that have references to SubPlans or
>> AlternativeSubPlans are considered parallel-restricted.  I think we
>> can lift this restriction in many cases especially when SubPlans are
>> parallel-safe.  To make this work, we need to propagate the
>> parallel-safety information from path node to plan node and the same
>> could be easily done while creating a plan.  Another option could be
>> that instead of propagating parallel-safety information from path to
>> plan, we can find out from the plan if it is parallel-safe (doesn't
>> contain any parallel-aware node) by traversing whole plan tree, but I
>> think it is a waste of cycles.  Once we have parallel-safety
>> information in the plan, we can use that for detection of
>> parallel-safe expressions in max_parallel_hazard_walker().  Finally,
>> we can pass all the subplans to workers during plan serialization in
>> ExecSerializePlan().  This will enable workers to execute subplans
>> that are referred in parallel part of the plan.  Now, we might be able
>> to optimize it such that we pass only subplans that are referred in
>> parallel portion of plan, but I am not sure if it is worth the trouble
>> because it is one-time cost and much lesser than other things we do
>> (like creating
>> dsm, launching workers).
>
> It seems unfortunate to have to add a parallel_safe flag to the
> finished plan; the whole reason we have the Path-Plan distinction is
> so that we can throw away information that won't be needed at
> execution time.  The parallel_safe flag is, in fact, not needed at
> execution time, but just for further planning.  Isn't there some way
> that we can remember, at the time when a sublink is converted to a
> subplan, whether or not the subplan was created from a parallel-safe
> path?
>

The other alternative is to remember this information in SubPlan.  We
can retrieve parallel_safe information from best_path and can use it
while generating SubPlan.  The main reason for storing it in the plan
was to avoid explicitly passing parallel_safe information while
generating SubPlan as plan was already available at that time.
However, it seems there are only two places in code (refer
build_subplan) where this information needs to be propagated.  Let me
know if you prefer to remember the parallel_safe information in
SubPlan instead of in Plan or if you have something else in mind?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing subplans

От
Robert Haas
Дата:
On Wed, Jan 11, 2017 at 9:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> The other alternative is to remember this information in SubPlan.  We
> can retrieve parallel_safe information from best_path and can use it
> while generating SubPlan.  The main reason for storing it in the plan
> was to avoid explicitly passing parallel_safe information while
> generating SubPlan as plan was already available at that time.
> However, it seems there are only two places in code (refer
> build_subplan) where this information needs to be propagated.  Let me
> know if you prefer to remember the parallel_safe information in
> SubPlan instead of in Plan or if you have something else in mind?

I think we should try doing it in the SubPlan, at least, and see if
that comes out more elegant than what you have at the moment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] parallelize queries containing subplans

От
Amit Kapila
Дата:
On Tue, Jan 10, 2017 at 11:55 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Wed, Dec 28, 2016 at 11:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Attached patch implements the above idea.  This will enable
>> parallelism for queries containing un-correlated subplans, an example
>> of which is as follows:
>
> I have reviewed the patch (pq_pushdown_subplan_v1.patch), Mostly patch
> looks clean to me.
> I have also done some basic functional testing and it's working fine.
>
> I have one comment.
>
> + else if (IsA(node, AlternativeSubPlan))
> + {
> + AlternativeSubPlan *asplan = (AlternativeSubPlan *) node;
> + ListCell   *lc;
> +
> + Assert(context->root);
> +
> + foreach(lc, asplan->subplans)
> + {
> + SubPlan    *splan = (SubPlan *) lfirst(lc);
> + Plan   *plan;
> +
> + Assert(IsA(splan, SubPlan));
> +
> + plan = planner_subplan_get_plan(context->root, splan);
> + if (!plan->parallel_safe)
> + return true;
> + }
>   }
>
> I see no reason why we need to process the subplan list of
> AlternativeSubPlan here, Anyway expression_tree_walker will take care
> of processing each subplan of AlternativeSubPlan. Now in the case
> where all the subplans in AlternativeSubPlan are parallel_safe we will
> process this list twice.
>

Valid point, but I think we can avoid that by returning false after
foreach(..) loop.  I think one improvement could be that instead of
manually checking the parallel safety of each subplan, we can
recursively call max_parallel_hazard_walker for each subplan.


> But, more than that it will be cleaner to not
> handle AlternativeSubPlan here unless there is some strong reason?
>

Yeah, the reason is to avoid unnecessary recursion.  Also, in all
other places in the code, we do handle both of them separately, refer
cost_qual_eval_walker for somewhat similar usage.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing subplans

От
Dilip Kumar
Дата:
On Thu, Jan 12, 2017 at 9:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Valid point, but I think we can avoid that by returning false after
> foreach(..) loop.  I think one improvement could be that instead of
> manually checking the parallel safety of each subplan, we can
> recursively call max_parallel_hazard_walker for each subplan.

I agree that this way we can avoid the problem what I mentioned.
>
>
>> But, more than that it will be cleaner to not
>> handle AlternativeSubPlan here unless there is some strong reason?
>>
>
> Yeah, the reason is to avoid unnecessary recursion.  Also, in all
> other places in the code, we do handle both of them separately, refer
> cost_qual_eval_walker for somewhat similar usage.

ok.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing subplans

От
Amit Kapila
Дата:
On Thu, Jan 12, 2017 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 11, 2017 at 9:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> The other alternative is to remember this information in SubPlan.  We
>> can retrieve parallel_safe information from best_path and can use it
>> while generating SubPlan.  The main reason for storing it in the plan
>> was to avoid explicitly passing parallel_safe information while
>> generating SubPlan as plan was already available at that time.
>> However, it seems there are only two places in code (refer
>> build_subplan) where this information needs to be propagated.  Let me
>> know if you prefer to remember the parallel_safe information in
>> SubPlan instead of in Plan or if you have something else in mind?
>
> I think we should try doing it in the SubPlan, at least, and see if
> that comes out more elegant than what you have at the moment.
>

Okay, done that way.  I have fixed the review comments raised by Dilip
as well and added the test case in attached patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] parallelize queries containing subplans

От
Dilip Kumar
Дата:
On Thu, Jan 12, 2017 at 7:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Okay, done that way.  I have fixed the review comments raised by Dilip
> as well and added the test case in attached patch.

I have tested with pq_pushdown_subplan_v2.patch +
pq_pushdown_correl_subplan_v1.patch

sqlsmith. is reporting below error, I have attached the query to
reproduce the issue. This issue is only coming when both the patch is
applied, only with pq_pushdown_subplan_v2.patch there is no problem.
So I think the problem is because of
pq_pushdown_correl_subplan_v1.patch.

ERROR:  did not find '}' at end of input node at character 762

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] parallelize queries containing subplans

От
Tom Lane
Дата:
Dilip Kumar <dilipbalaut@gmail.com> writes:
> ERROR:  did not find '}' at end of input node at character 762

I've not looked at the patches, but just seeing this error message,
this looks like somebody's fat-fingered the correspondence between
outfuncs.c and readfuncs.c processing.
        regards, tom lane



Re: [HACKERS] parallelize queries containing subplans

От
Amit Kapila
Дата:
On Fri, Jan 13, 2017 at 7:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Dilip Kumar <dilipbalaut@gmail.com> writes:
>> ERROR:  did not find '}' at end of input node at character 762
>

I could reproduce this error with simple query like:
SELECT * FROM information_schema.role_usage_grants WHERE object_type
LIKE 'FOREIGN%' AND object_name IN ('s6', 'foo') ORDER BY 1, 2, 3, 4,
5;

The reason is that the stored rules have the old structure of Param.  See below:
{RELABELTYPE :arg {PARAM :paramkind 2 :paramid 1 :paramtype 11975
:paramtypmod -1 :paramcollid 100 :location -1}

The new variable parallel_safe is not present in above node.  If you
recreate the fresh database you won't see the above problem.  I have
observed a problem in equal function which I have fixed in the
attached patch.  I have also added a test, so that we can catch any
problem similar to what you have reported.

> I've not looked at the patches, but just seeing this error message,
> this looks like somebody's fat-fingered the correspondence between
> outfuncs.c and readfuncs.c processing.
>

I think what we need is catversion bump as Param is part of stored rules.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] parallelize queries containing subplans

От
Amit Kapila
Дата:
On Thu, Jan 12, 2017 at 7:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jan 12, 2017 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Jan 11, 2017 at 9:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> The other alternative is to remember this information in SubPlan.  We
>>> can retrieve parallel_safe information from best_path and can use it
>>> while generating SubPlan.  The main reason for storing it in the plan
>>> was to avoid explicitly passing parallel_safe information while
>>> generating SubPlan as plan was already available at that time.
>>> However, it seems there are only two places in code (refer
>>> build_subplan) where this information needs to be propagated.  Let me
>>> know if you prefer to remember the parallel_safe information in
>>> SubPlan instead of in Plan or if you have something else in mind?
>>
>> I think we should try doing it in the SubPlan, at least, and see if
>> that comes out more elegant than what you have at the moment.
>>
>
> Okay, done that way.  I have fixed the review comments raised by Dilip
> as well and added the test case in attached patch.
>

After commit-ab1f0c8, this patch needs a rebase.  Attached find
rebased version of the patch.

Thanks to Kuntal for informing me offlist that this patch needs rebase.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] parallelize queries containing subplans

От
Dilip Kumar
Дата:
On Mon, Jan 16, 2017 at 9:13 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> After commit-ab1f0c8, this patch needs a rebase.  Attached find
> rebased version of the patch.
>
> Thanks to Kuntal for informing me offlist that this patch needs rebase.

In this patch, I have observed some changes while creating subplan for
CTE. So I have reviewed this from that perspective and also tried to
perform some test.

Here is my finding/ comments.

@@ -1213,6 +1216,7 @@ SS_process_ctes(PlannerInfo *root)   &splan->firstColCollation); splan->useHashTable = false;
splan->unknownEqFalse= false;
 
+ splan->parallel_safe = best_path->parallel_safe;

I noticed that if path for CTE is parallel safe then we are marking
CTE subplan as parallel safe, In particular, I don't have any problem
with that, but can you add some test case which can cover this path, I
mean to say where CTE subplan are pushed.

------------
I have tried to test the subplan with CTE below is my test.
create table t1(a int , b varchar);
create table t (n int, b varchar);

Query:
explain verbose select * from t where t.n not in (WITH RECURSIVE t(n) AS (   VALUES (1) UNION ALL   SELECT a+1 FROM t1
WHEREa < 100
 
)
SELECT sum(n) FROM t);

During debugging I found that subplan created for below part of the
query is parallel_unsafe, Is it a problem or there is some explanation
of why it's not parallel_safe,

(WITH RECURSIVE t(n) AS (   VALUES (1) UNION ALL   SELECT a+1 FROM t1 WHERE a < 100
)
SELECT sum(n) FROM t);
----------


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing subplans

От
Kuntal Ghosh
Дата:
On Thu, Jan 19, 2017 at 3:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> @@ -1213,6 +1216,7 @@ SS_process_ctes(PlannerInfo *root)
>     &splan->firstColCollation);
>   splan->useHashTable = false;
>   splan->unknownEqFalse = false;
> + splan->parallel_safe = best_path->parallel_safe;
>
> I noticed that if path for CTE is parallel safe then we are marking
> CTE subplan as parallel safe, In particular, I don't have any problem
> with that, but can you add some test case which can cover this path, I
> mean to say where CTE subplan are pushed.
>
> ------------
> I have tried to test the subplan with CTE below is my test.
> create table t1(a int , b varchar);
> create table t (n int, b varchar);
>
> Query:
> explain verbose select * from t where t.n not in (WITH RECURSIVE t(n) AS (
>     VALUES (1)
>   UNION ALL
>     SELECT a+1 FROM t1 WHERE a < 100
> )
> SELECT sum(n) FROM t);
>
> During debugging I found that subplan created for below part of the
> query is parallel_unsafe, Is it a problem or there is some explanation
> of why it's not parallel_safe,
>
> (WITH RECURSIVE t(n) AS (
>     VALUES (1)
>   UNION ALL
>     SELECT a+1 FROM t1 WHERE a < 100
> )
> SELECT sum(n) FROM t);
> ----------
The corresponding plan for the query you have specified is:
                                   QUERY PLAN
----------------------------------------------------------------------------------Seq Scan on public.t
(cost=40.73..20894.74rows=500480 width=35)  Output: t.n, t.b  Filter: (NOT (hashed SubPlan 2))  SubPlan 2    ->
Aggregate (cost=40.72..40.73 rows=1 width=8)          Output: sum(t_1.n)          CTE t            ->  Append
(cost=0.00..31.18rows=424 width=4)                  ->  Result  (cost=0.00..0.01 rows=1 width=4)
Output:1                  ->  Seq Scan on public.t1  (cost=0.00..26.93
 
rows=423 width=4)                        Output: (t1.a + 1)                        Filter: (t1.a < 100)          ->
CTEScan on t t_1  (cost=0.00..8.48 rows=424 width=4)                Output: t_1.n
 
(15 rows)

Now, the plan for CTE is parallel_safe. But, a CTE plan is converted
to an InitPlan and returns a Param which is used in the CTE Scan.
Since Param is not parallel_safe till now, the SubPlan is also not
parallel_safe. This is why CTE subplans will not be pushed under
Gather.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing subplans

От
Kuntal Ghosh
Дата:
On Thu, Jan 19, 2017 at 3:19 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> Since Param is not parallel_safe till now, the SubPlan is also not
> parallel_safe. This is why CTE subplans will not be pushed under
> Gather.
Specifically, Params which are generated using generate_new_param()
are not parallel_safe. In the patch, it is marked as parallel-unsafe.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing subplans

От
Dilip Kumar
Дата:
On Thu, Jan 19, 2017 at 3:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> During debugging I found that subplan created for below part of the
> query is parallel_unsafe, Is it a problem or there is some explanation
> of why it's not parallel_safe,

Okay, so basically we don't have any mechanism to perform parallel
scan on CTE. And, IMHO subplan built for CTE (using SS_process_ctes)
must come along with CTE scan. So I think we can avoid setting below
code because we will never be able to test its side effect, another
argument can be that if we don't consider the final effect, and just
see this subplan then by logic it should be marked parallel-safe or
unsafe as per it's path and it will not have any side effect, as it
will finally become parallel-unsafe. So it's your call to keep it
either way.


@@ -1213,6 +1216,7 @@ SS_process_ctes(PlannerInfo *root)   &splan->firstColCollation); splan->useHashTable = false;
splan->unknownEqFalse= false;
 
+ splan->parallel_safe = best_path->parallel_safe;

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing subplans

От
Kuntal Ghosh
Дата:
On Thu, Jan 19, 2017 at 4:51 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Thu, Jan 19, 2017 at 3:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> During debugging I found that subplan created for below part of the
>> query is parallel_unsafe, Is it a problem or there is some explanation
>> of why it's not parallel_safe,
>
> Okay, so basically we don't have any mechanism to perform parallel
> scan on CTE. And, IMHO subplan built for CTE (using SS_process_ctes)
> must come along with CTE scan. So I think we can avoid setting below
> code because we will never be able to test its side effect, another
> argument can be that if we don't consider the final effect, and just
> see this subplan then by logic it should be marked parallel-safe or
> unsafe as per it's path and it will not have any side effect, as it
> will finally become parallel-unsafe. So it's your call to keep it
> either way.
Oops, you're right. We don't consider parallelism for RTE_CTE type.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing subplans

От
Amit Kapila
Дата:
On Thu, Jan 19, 2017 at 4:51 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Thu, Jan 19, 2017 at 3:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> During debugging I found that subplan created for below part of the
>> query is parallel_unsafe, Is it a problem or there is some explanation
>> of why it's not parallel_safe,
>
> Okay, so basically we don't have any mechanism to perform parallel
> scan on CTE. And, IMHO subplan built for CTE (using SS_process_ctes)
> must come along with CTE scan. So I think we can avoid setting below
> code because we will never be able to test its side effect, another
> argument can be that if we don't consider the final effect, and just
> see this subplan then by logic it should be marked parallel-safe or
> unsafe as per it's path and it will not have any side effect, as it
> will finally become parallel-unsafe. So it's your call to keep it
> either way.
>

Yeah, actually setting parallel_safety information for subplan from
corresponding is okay.  However, in this particular case as we know
that it might not be of any use till we enable parallelism for CTE
Scan (and doing that is certainly not essential for this project).
So, I have set parallel_safe as false for CTE subplans.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] parallelize queries containing subplans

От
Amit Kapila
Дата:
On Tue, Jan 24, 2017 at 10:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jan 19, 2017 at 4:51 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> On Thu, Jan 19, 2017 at 3:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>> During debugging I found that subplan created for below part of the
>>> query is parallel_unsafe, Is it a problem or there is some explanation
>>> of why it's not parallel_safe,
>>
>> Okay, so basically we don't have any mechanism to perform parallel
>> scan on CTE. And, IMHO subplan built for CTE (using SS_process_ctes)
>> must come along with CTE scan. So I think we can avoid setting below
>> code because we will never be able to test its side effect, another
>> argument can be that if we don't consider the final effect, and just
>> see this subplan then by logic it should be marked parallel-safe or
>> unsafe as per it's path and it will not have any side effect, as it
>> will finally become parallel-unsafe. So it's your call to keep it
>> either way.
>>
>
> Yeah, actually setting parallel_safety information for subplan from
> corresponding is okay.
>

missed the word *path* in above sentence.

/corresponding/corresponding path

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing subplans

От
Amit Kapila
Дата:
On Tue, Jan 24, 2017 at 10:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jan 19, 2017 at 4:51 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> On Thu, Jan 19, 2017 at 3:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>> During debugging I found that subplan created for below part of the
>>> query is parallel_unsafe, Is it a problem or there is some explanation
>>> of why it's not parallel_safe,
>>
>> Okay, so basically we don't have any mechanism to perform parallel
>> scan on CTE. And, IMHO subplan built for CTE (using SS_process_ctes)
>> must come along with CTE scan. So I think we can avoid setting below
>> code because we will never be able to test its side effect, another
>> argument can be that if we don't consider the final effect, and just
>> see this subplan then by logic it should be marked parallel-safe or
>> unsafe as per it's path and it will not have any side effect, as it
>> will finally become parallel-unsafe. So it's your call to keep it
>> either way.
>>
>
> Yeah, actually setting parallel_safety information for subplan from
> corresponding is okay.  However, in this particular case as we know
> that it might not be of any use till we enable parallelism for CTE
> Scan (and doing that is certainly not essential for this project).
> So, I have set parallel_safe as false for CTE subplans.
>

Moved this patch to next CF.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing subplans

От
Robert Haas
Дата:
On Tue, Jan 31, 2017 at 6:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Moved this patch to next CF.

So here is what seems to be the key hunk from this patch:
    /*
-     * Since we don't have the ability to push subplans down to workers at
-     * present, we treat subplan references as parallel-restricted.  We need
-     * not worry about examining their contents; if they are unsafe, we would
-     * have found that out while examining the whole tree before reduction of
-     * sublinks to subplans.  (Really we should not see SubLink during a
-     * max_interesting == restricted scan, but if we do, return true.)
+     * We can push the subplans only if they don't contain any parallel-aware
+     * node as we don't support multi-level parallelism (parallel workers
+     * invoking another set of parallel workers).     */
-    else if (IsA(node, SubLink) ||
-             IsA(node, SubPlan) ||
-             IsA(node, AlternativeSubPlan))
+    else if (IsA(node, SubPlan))
+        return !((SubPlan *) node)->parallel_safe;
+    else if (IsA(node, AlternativeSubPlan))    {
-        if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context))
-            return true;
+        AlternativeSubPlan *asplan = (AlternativeSubPlan *) node;
+        ListCell   *lc;
+
+        foreach(lc, asplan->subplans)
+        {
+            SubPlan    *splan = (SubPlan *) lfirst(lc);
+
+            Assert(IsA(splan, SubPlan));
+
+            if (max_parallel_hazard_walker((Node *) splan, context))
+                return true;
+        }
+
+        return false;    }

I don't see the reason for having this new code that makes
AlternativeSubPlan walk over the subplans; expression_tree_walker
already does that.  On the flip side I don't see the reason for
removing the max_parallel_hazard_test() call for AlternativeSubPlan or
for SubLink.  I think that the core of this change is to change the
handling for SubPlan nodes to just return the flag from the subplan,
and that seems fine, but the rest of this I'm skeptical about.

+                * CTE scans are not consider for parallelism (cf


considered

+       select count(*)from tenk1 where (two, four) not in

whitespace

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] parallelize queries containing subplans

От
Amit Kapila
Дата:
On Wed, Feb 1, 2017 at 10:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 31, 2017 at 6:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Moved this patch to next CF.
>
> So here is what seems to be the key hunk from this patch:
>
>      /*
> -     * Since we don't have the ability to push subplans down to workers at
> -     * present, we treat subplan references as parallel-restricted.  We need
> -     * not worry about examining their contents; if they are unsafe, we would
> -     * have found that out while examining the whole tree before reduction of
> -     * sublinks to subplans.  (Really we should not see SubLink during a
> -     * max_interesting == restricted scan, but if we do, return true.)
> +     * We can push the subplans only if they don't contain any parallel-aware
> +     * node as we don't support multi-level parallelism (parallel workers
> +     * invoking another set of parallel workers).
>       */
> -    else if (IsA(node, SubLink) ||
> -             IsA(node, SubPlan) ||
> -             IsA(node, AlternativeSubPlan))
> +    else if (IsA(node, SubPlan))
> +        return !((SubPlan *) node)->parallel_safe;
> +    else if (IsA(node, AlternativeSubPlan))
>      {
> -        if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context))
> -            return true;
> +        AlternativeSubPlan *asplan = (AlternativeSubPlan *) node;
> +        ListCell   *lc;
> +
> +        foreach(lc, asplan->subplans)
> +        {
> +            SubPlan    *splan = (SubPlan *) lfirst(lc);
> +
> +            Assert(IsA(splan, SubPlan));
> +
> +            if (max_parallel_hazard_walker((Node *) splan, context))
> +                return true;
> +        }
> +
> +        return false;
>      }
>
> I don't see the reason for having this new code that makes
> AlternativeSubPlan walk over the subplans; expression_tree_walker
> already does that.
>

It is done this way mainly to avoid recursion/nested calls, but I
think you prefer to handle it via expression_tree_walker so that code
looks clean.  Another probable reason could be that there is always a
chance that we miss handling of some expression of a particular node
(in this case AlternativeSubPlan), but if that is the case then there
are other places in the code which do operate on individual subplan/'s
in AlternativeSubPlan list.

>  On the flip side I don't see the reason for
> removing the max_parallel_hazard_test() call for AlternativeSubPlan or
> for SubLink.

If we don't remove the current test of max_parallel_hazard_test() for
AlternativeSubPlan, then AlternativeSubPlan node will be considered
parallel restricted, so why do we want that after this patch.  For
Sublinks, I think they would have converted to Subplans by the time we
reach this function for the parallel restricted check.  Can you
elaborate what you have in mind for the handling of AlternativeSubPlan
and SubLink?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing subplans

От
Amit Kapila
Дата:
On Thu, Feb 2, 2017 at 9:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Feb 1, 2017 at 10:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Jan 31, 2017 at 6:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Moved this patch to next CF.
>>
>> So here is what seems to be the key hunk from this patch:
>>
>>      /*
>> -     * Since we don't have the ability to push subplans down to workers at
>> -     * present, we treat subplan references as parallel-restricted.  We need
>> -     * not worry about examining their contents; if they are unsafe, we would
>> -     * have found that out while examining the whole tree before reduction of
>> -     * sublinks to subplans.  (Really we should not see SubLink during a
>> -     * max_interesting == restricted scan, but if we do, return true.)
>> +     * We can push the subplans only if they don't contain any parallel-aware
>> +     * node as we don't support multi-level parallelism (parallel workers
>> +     * invoking another set of parallel workers).
>>       */
>> -    else if (IsA(node, SubLink) ||
>> -             IsA(node, SubPlan) ||
>> -             IsA(node, AlternativeSubPlan))
>> +    else if (IsA(node, SubPlan))
>> +        return !((SubPlan *) node)->parallel_safe;
>> +    else if (IsA(node, AlternativeSubPlan))
>>      {
>> -        if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context))
>> -            return true;
>> +        AlternativeSubPlan *asplan = (AlternativeSubPlan *) node;
>> +        ListCell   *lc;
>> +
>> +        foreach(lc, asplan->subplans)
>> +        {
>> +            SubPlan    *splan = (SubPlan *) lfirst(lc);
>> +
>> +            Assert(IsA(splan, SubPlan));
>> +
>> +            if (max_parallel_hazard_walker((Node *) splan, context))
>> +                return true;
>> +        }
>> +
>> +        return false;
>>      }
>>
>> I don't see the reason for having this new code that makes
>> AlternativeSubPlan walk over the subplans; expression_tree_walker
>> already does that.
>>

I have removed the check of AlternativeSubPlan so that it can be
handled by expression_tree_walker.

>
> It is done this way mainly to avoid recursion/nested calls, but I
> think you prefer to handle it via expression_tree_walker so that code
> looks clean.  Another probable reason could be that there is always a
> chance that we miss handling of some expression of a particular node
> (in this case AlternativeSubPlan), but if that is the case then there
> are other places in the code which do operate on individual subplan/'s
> in AlternativeSubPlan list.
>
>>  On the flip side I don't see the reason for
>> removing the max_parallel_hazard_test() call for AlternativeSubPlan or
>> for SubLink.
>
> If we don't remove the current test of max_parallel_hazard_test() for
> AlternativeSubPlan, then AlternativeSubPlan node will be considered
> parallel restricted, so why do we want that after this patch.  For
> Sublinks, I think they would have converted to Subplans by the time we
> reach this function for the parallel restricted check.  Can you
> elaborate what you have in mind for the handling of AlternativeSubPlan
> and SubLink?
>

I have removed the changes for SubLink node.

>> +                * CTE scans are not consider for parallelism (cf
>>
>>
>> considered
>>

Fixed.

>> +       select count(*)from tenk1 where (two, four) not in
>>
>> whitespace

Fixed.


Attached patch fixes all the comments raised till now.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] parallelize queries containing subplans

От
Amit Kapila
Дата:
On Tue, Jan 3, 2017 at 4:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Dec 28, 2016 at 11:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Now, we can further extend this to parallelize queries containing
>> correlated subplans like below:
>>
>> explain select * from t1 where t1.i in (select t2.i from t2 where t2.i=t1.i);
>>                          QUERY PLAN
>> -------------------------------------------------------------
>>  Seq Scan on t1  (cost=0.00..831049.09 rows=8395 width=12)
>>    Filter: (SubPlan 1)
>>    SubPlan 1
>>      ->  Seq Scan on t2  (cost=0.00..97.73 rows=493 width=4)
>>            Filter: (i = t1.i)
>> (5 rows)
>>
>> In the above query, Filter on t2 (i = t1.i) generates Param node which
>> is a parallel-restricted node, so such queries won't be able to use
>> parallelism even with the patch.  I think we can mark Params which
>> refer to same level as parallel-safe and I think we have this
>> information (node-> varlevelsup/ phlevelsup/ agglevelsup) available
>> when we replace correlation vars (SS_replace_correlation_vars).
>>
>
> I have implemented the above idea which will allow same or immediate
> outer level PARAMS as parallel_safe.  The results of above query after
> patch:
>
> postgres=# explain select * from t1 where t1.i in (select t2.i from t2
> where t2.i=t1.i);
>                                 QUERY PLAN
> --------------------------------------------------------------------------
>  Gather  (cost=0.00..488889.88 rows=8395 width=12)
>    Workers Planned: 1
>    ->  Parallel Seq Scan on t1  (cost=0.00..488889.88 rows=4938 width=12)
>          Filter: (SubPlan 1)
>          SubPlan 1
>            ->  Seq Scan on t2  (cost=0.00..97.73 rows=493 width=4)
>                  Filter: (i = t1.i)
> (7 rows)
>

On further evaluation, it seems this patch has one big problem which
is that it will allow forming parallel plans which can't be supported
with current infrastructure.  For ex. marking immediate level params
as parallel safe can generate below type of plan:

Seq Scan on t1  Filter: (SubPlan 1)  SubPlan 1    ->  Gather          Workers Planned: 1          ->  Result
   One-Time Filter: (t1.k = 0)                ->  Parallel Seq Scan on t2
 


In this plan, we can't evaluate one-time filter (that contains
correlated param) unless we have the capability to pass all kind of
PARAM_EXEC param to workers.   I don't want to invest too much time in
this patch unless somebody can see some way using current parallel
infrastructure to implement correlated subplans.

Note that still, the other patch [1] in this thread which implements
parallelism for uncorrelated subplan holds good.


[1] - https://www.postgresql.org/message-id/CAA4eK1J9mDZLcp-OskkdzAf_yT8W4dBSGL9E%3DkoEiJkdpVZsEA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing subplans

От
Robert Haas
Дата:
On Tue, Feb 14, 2017 at 4:24 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On further evaluation, it seems this patch has one big problem which
> is that it will allow forming parallel plans which can't be supported
> with current infrastructure.  For ex. marking immediate level params
> as parallel safe can generate below type of plan:
>
> Seq Scan on t1
>    Filter: (SubPlan 1)
>    SubPlan 1
>      ->  Gather
>            Workers Planned: 1
>            ->  Result
>                  One-Time Filter: (t1.k = 0)
>                  ->  Parallel Seq Scan on t2
>
>
> In this plan, we can't evaluate one-time filter (that contains
> correlated param) unless we have the capability to pass all kind of
> PARAM_EXEC param to workers.   I don't want to invest too much time in
> this patch unless somebody can see some way using current parallel
> infrastructure to implement correlated subplans.

I don't think this approach has much chance of working; it just seems
too simplistic.  I'm not entirely sure what the right approach is.
Unfortunately, the current query planner code seems to compute the
sets of parameters that are set and used quite late, and really only
on a per-subquery level.  Here we need to know whether there is
anything that's set below the Gather node and used above it, or the
other way around, and we need to know it much earlier, while we're
still doing path generation.  There doesn't seem to be any simple way
of getting that information, but I think you need it.

What's more, I think you would still need it even if you had the
ability to pass parameter values between processes.  For example,
consider:

Gather
-> Parallel Seq Scan Filter: (Correlated Subplan Reference Goes Here)

Of course, the Param in the filter condition *can't* be a shared Param
across all processes.  It needs to be private to each process
participating in the parallel sequential scan -- and the params
passing data down from the Parallel Seq Scan to the correlated subplan
also need to be private.  On the other hand, in your example quoted
above, you do need to share across processes: the value for t1.k needs
to get passed down.  So it seems to me that we somehow need to
identify, for each parameter that gets used, whether it's provided by
something beneath the Gather node (in which case it should be private
to the worker) or whether it's provided from higher up (in which case
it should be passed down to the worker, or if we can't do that, then
don't use parallelism there).

(There's also possible a couple of other cases, like an initPlan that
needs to get executed only once, and also maybe a case where a
parameter is set below the Gather and later used above the Gather.
Not sure if that latter one happen, or how to deal with it.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] parallelize queries containing subplans

От
Robert Haas
Дата:
On Mon, Feb 13, 2017 at 11:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have removed the check of AlternativeSubPlan so that it can be
> handled by expression_tree_walker.
...
> Attached patch fixes all the comments raised till now.

Committed after removing the reference to AlternativeSubPlan from the
comment.  I didn't see any need to mention that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] parallelize queries containing subplans

От
Amit Kapila
Дата:
On Wed, Feb 15, 2017 at 4:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Feb 13, 2017 at 11:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I have removed the check of AlternativeSubPlan so that it can be
>> handled by expression_tree_walker.
> ...
>> Attached patch fixes all the comments raised till now.
>
> Committed after removing the reference to AlternativeSubPlan from the
> comment.  I didn't see any need to mention that.
>

Okay, I was also of two minds while adding that line in the comment.
I had kept it because there are few places in the code where both
SubPlan and AlternativeSubPlan are handled together, so I thought the
person looking at this code should not wonder why we have not handled
AlternativeSubPlan. However, I think it is okay to remove it from
comment as there exist other places where only Subplan is handled and
we rely on expression tree walker for AlternativeSubPlans.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing subplans

От
Amit Kapila
Дата:
On Wed, Feb 15, 2017 at 4:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 14, 2017 at 4:24 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On further evaluation, it seems this patch has one big problem which
>> is that it will allow forming parallel plans which can't be supported
>> with current infrastructure.  For ex. marking immediate level params
>> as parallel safe can generate below type of plan:
>>
>> Seq Scan on t1
>>    Filter: (SubPlan 1)
>>    SubPlan 1
>>      ->  Gather
>>            Workers Planned: 1
>>            ->  Result
>>                  One-Time Filter: (t1.k = 0)
>>                  ->  Parallel Seq Scan on t2
>>
>>
>> In this plan, we can't evaluate one-time filter (that contains
>> correlated param) unless we have the capability to pass all kind of
>> PARAM_EXEC param to workers.   I don't want to invest too much time in
>> this patch unless somebody can see some way using current parallel
>> infrastructure to implement correlated subplans.
>
> I don't think this approach has much chance of working; it just seems
> too simplistic.  I'm not entirely sure what the right approach is.
> Unfortunately, the current query planner code seems to compute the
> sets of parameters that are set and used quite late, and really only
> on a per-subquery level.
>
Now just for the sake of discussion consider we have list of
allParams at the path level, then also I think it might not be easy to
make it work.

>  Here we need to know whether there is
> anything that's set below the Gather node and used above it, or the
> other way around, and we need to know it much earlier, while we're
> still doing path generation.
>

Yes, that is exactly the challenge.  I am not sure if currently there
is a way by which we can identify if a Param on a particular node
refers to node below it or above it.


> (There's also possible a couple of other cases, like an initPlan that
> needs to get executed only once, and also maybe a case where a
> parameter is set below the Gather and later used above the Gather.
> Not sure if that latter one happen, or how to deal with it.)
>

I think the case for initPlan is slightly different because we can
always evaluate it at Gather (considering it is an uncorrelated
initplan) and then pass it to Workers.  We generally have a list of
all the params at each plan node, so we can identify which of these
are initPlan params and then evaluate them. Now, it can be used
irrespective of whether it is used above or below the Gather node.
For the cases, where it can be used above Gather node, it will work as
we always store the computed value of such params in estate/econtext
and for the cases when it has to be used below Gather, we need to pass
the computed value to workers.  Now, there is some exceptions like for
few cases not all the params are available at a particular node, but I
feel those can be handled easily by either traversing the planstate
tree or by actually storing them at Gather node.  Actually, in short,
this is what is done in the patch proposed for parallizing initplans
[1].


[1] - https://commitfest.postgresql.org/13/997/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com