Re: [HACKERS] parallelize queries containing subplans
От | Rafia Sabih |
---|---|
Тема | Re: [HACKERS] parallelize queries containing subplans |
Дата | |
Msg-id | CAOGQiiMELoeUyqM6RbXThyQXr3ozEFKaPixEibBokqacr5Razg@mail.gmail.com обсуждение исходный текст |
Ответ на | [HACKERS] parallelize queries containing subplans (Amit Kapila <amit.kapila16@gmail.com>) |
Список | pgsql-hackers |
On Wed, Dec 28, 2016 at 11:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > Currently, queries that have references to SubPlans or > AlternativeSubPlans are considered parallel-restricted. I think we > can lift this restriction in many cases especially when SubPlans are > parallel-safe. To make this work, we need to propagate the > parallel-safety information from path node to plan node and the same > could be easily done while creating a plan. Another option could be > that instead of propagating parallel-safety information from path to > plan, we can find out from the plan if it is parallel-safe (doesn't > contain any parallel-aware node) by traversing whole plan tree, but I > think it is a waste of cycles. Once we have parallel-safety > information in the plan, we can use that for detection of > parallel-safe expressions in max_parallel_hazard_walker(). Finally, > we can pass all the subplans to workers during plan serialization in > ExecSerializePlan(). This will enable workers to execute subplans > that are referred in parallel part of the plan. Now, we might be able > to optimize it such that we pass only subplans that are referred in > parallel portion of plan, but I am not sure if it is worth the trouble > because it is one-time cost and much lesser than other things we do > (like creating > dsm, launching workers). > > Attached patch implements the above idea. This will enable > parallelism for queries containing un-correlated subplans, an example > of which is as follows: > > set parallel_tuple_cost=0; > set parallel_setup_cost=0; > set min_parallel_relation_size=50; > > create table t1 (i int, j int, k int); > create table t2 (i int, j int, k int); > > insert into t1 values (generate_series(1,10)*random(), > generate_series(5,50)*random(), > generate_series(8,80)*random()); > insert into t2 values (generate_series(4,10)*random(), > generate_series(5,90)*random(), > generate_series(2,10)*random()); > > > Plan without Patch > ----------------------------- > postgres=# explain select * from t1 where t1.i not in (select t2.i > from t2 where t2.i in (1,2,3)); > QUERY PLAN > --------------------------------------------------------------- > Seq Scan on t1 (cost=110.84..411.72 rows=8395 width=12) > Filter: (NOT (hashed SubPlan 1)) > SubPlan 1 > -> Seq Scan on t2 (cost=0.00..104.50 rows=2537 width=4) > Filter: (i = ANY ('{1,2,3}'::integer[])) > (5 rows) > > Plan with Patch > ------------------------ > postgres=# explain select * from t1 where t1.i not in (select t2.i > from t2 where t2.i in (1,2,3)); > QUERY PLAN > ------------------------------------------------------------------------- > Gather (cost=110.84..325.30 rows=8395 width=12) > Workers Planned: 1 > -> Parallel Seq Scan on t1 (cost=110.84..325.30 rows=4938 width=12) > Filter: (NOT (hashed SubPlan 1)) > SubPlan 1 > -> Seq Scan on t2 (cost=0.00..104.50 rows=2537 width=4) > Filter: (i = ANY ('{1,2,3}'::integer[])) > (7 rows) > > We have observed that Q-16 in TPC-H have been improved with the patch > and the analysis of same will be shared by my colleague Rafia. > To study the effect of uncorrelated sub-plan pushdown on TPC-H and TPC-DS benchmark queries we performed some experiments and the execution time results for same are summarised as follows, Query | HEAD | Patches | scale-factor -----------+---------+-----------+----------------- DS-Q45 | 35 | 19 | scale-factor: 100 H-Q16 | 812 | 645 | scale-factor: 300 H-Q16 | 49 | 37 | scale-factor: 20 Execution time given in above table is in seconds. Detailed analysis of this experimentation is as follows, Additional patches applied in this analysis are, Parallel index scan [1] Parallel index-only scan [2] Parallel merge-join [3] The system setup used for this experiment is, Server parameter settings: work_mem = 500 MB, max_parallel_workers_per_gather = 4, random_page_cost = seq_page_cost = 0.1 = parallel_tuple_cost, shared_buffers = 1 GB Machine used: IBM Power, 4 socket machine, 512 GB RAM TPC-DS scale factor = 100 (approx size of database is 150 GB) Query 45 which takes around 35 seconds on head, completes in 19 seconds with these patches. The point to note here is that without this patch of pushing uncorrelated sublans, hash join which is using subplan in join filter could not be pushed to workers and hence query was unable to use the parallelism enough, with this patch parallelism is available till the topmost join node. The output of explain analyse statement of this query on both head and with patches is given in attached file ds_q45.txt. On further evaluating these patches on TPC-H queries on different scale factors we came across query 16, for TPC-H scale factor 20 and aforementioned parameter settings with the change of max_parallel_workers_per gather = 2, it took 37 seconds with the patches and 49 seconds on head. Though the improvement in overall query performance is not appearing to be significantly high, yet the point to note here is that the time taken by join was around 26 seconds on head which reduced to 14 seconds with the patches. Overall benefit in performance is not high because sort node is dominating the execution time. The plan information of this query is given in attached file h_q16_20_2.txt. On increasing the scale factor to 300, setting work_mem to 1GB, increasing max_parallel_workers_per_gather = 6, and disabling the parallel sequential scan for supplier table by 'alter table supplier set (parallel_workers = 0)', Q16 completes in 645 seconds with patches, which was taking 812 seconds on head. We need to disable parallel_workers for supplier table because on higher worker count it was taking parallel seq scan and hence the scan node that is using subplan could not be parallelised. For this query both pushdown of subplans and parallel merge-join, without any one of these the benefits of parallelism might not be leveraged fully. The output of explain analyse for this query is given in h_q16_300.txt Overall, with pushdown of uncorrelated sub-plans to workers enables the parallelism in joins which was restricted before and hence good improvement in query performance can be witnessed. > Now, we can further extend this to parallelize queries containing > correlated subplans like below: > > explain select * from t1 where t1.i in (select t2.i from t2 where t2.i=t1.i); > QUERY PLAN > ------------------------------------------------------------- > Seq Scan on t1 (cost=0.00..831049.09 rows=8395 width=12) > Filter: (SubPlan 1) > SubPlan 1 > -> Seq Scan on t2 (cost=0.00..97.73 rows=493 width=4) > Filter: (i = t1.i) > (5 rows) > As per my analysis this extension to correlated subplans is likely to improve the performance of following queries -- Q2 in TPC-H and Q6 in TPC-DS. > Yet, another useful enhancement in this area could be to consider both > parallel and non-parallel paths for subplans. As of now, we consider > the cheapest/best path and form subplan from it, but it is quite > possible that instead of choosing parallel path (in case it is > cheapest) at subplan level, the non-parallel path at subplan level > could be beneficial when upper plan can use parallelism. I think this > will be a separate project in itself if we want to do this and based > on my study of TPC-H and TPC-DS queries, I am confident that this will > be helpful in certain queries at higher scale factors. > I agree as then we do not need to disable parallelism for particular relations as we currently do for supplier relation in Q16 of TPC-H. [1] https://www.postgresql.org/message-id/CAA4eK1KthrAvNjmB2cWuUHz%2Bp3ZTTtbD7o2KUw49PC8HK4r1Wg%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAOGQiiOv9NA1VrAo8PmENfGi-y%3DCj_DiTF4vyjp%2BfmuEzovwEA%40mail.gmail.com [3] https://www.postgresql.org/message-id/CAFiTN-tdYpcsk7Lpv0HapcmvSnMN_TgKjC7RkmvVLZAF%2BXfmPg%40mail.gmail.com -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Craig RingerДата:
Сообщение: Re: [HACKERS] [PATCH] Transaction traceability - txid_status(bigint)