Обсуждение: How to get explain plan to prefer Hash Join

Поиск
Список
Период
Сортировка

How to get explain plan to prefer Hash Join

От
atxcanadian
Дата:
Currently seeing massive increase in performance when optimizer chooses Hash
Join over Nested Loops. I achieve this by temporarily setting nested loops
off. I'd like to setup some database variables where the optimizer prefers
hash joins. Any suggestions?

*Query in question:*
explain analyze
select dp.market_day, dp.hour_ending, dp.repeated_hour_flag,
dp.settlement_point, sum(dp.mw) dp_mw
from dp_hist_gen_actual dp
     Inner Join api_settlement_points sp on
         sp.settlement_point = dp.settlement_point and
         sp.settlement_point_rdfid =
'#_{09F3A628-3B9D-481A-AC90-72AF8EAB64CA}' and
         sp.start_date <= '2015-01-01'::date and
         sp.end_date > '2015-01-01'::date and
         sp.rt_model = (select case when c.rt_model_loaded = 2 then true
else false end
                        from cim_calendar c where c.nodal_load <=
'2015-01-01'::date
                        order by c.cim desc limit 1)
where dp.market_day BETWEEN '2015-01-01'::date and
                            '2015-01-01'::date and
      dp.expiry_date is null
group by dp.market_day, dp.hour_ending, dp.repeated_hour_flag,
dp.settlement_point;

*Nested Loop Explain Analyze Output:*
HashAggregate  (cost=58369.29..58369.30 rows=1 width=24) (actual
time=496287.249..496287.257 rows=24 loops=1)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=8.30..8.30 rows=1 width=9) (actual time=0.145..0.145
rows=1 loops=1)
          ->  Sort  (cost=8.30..8.78 rows=193 width=9) (actual
time=0.145..0.145 rows=1 loops=1)
                Sort Key: c.cim
                Sort Method: top-N heapsort  Memory: 25kB
                ->  Seq Scan on cim_calendar c  (cost=0.00..7.33 rows=193
width=9) (actual time=0.007..0.075 rows=192 loops=1)
                      Filter: (nodal_load <= '2015-01-01'::date)
                      Rows Removed by Filter: 36
  -> * Nested Loop  (cost=0.99..58360.98 rows=1 width=24) (actual
time=883.718..496287.058 rows=24 loops=1)*
        Join Filter: ((dp.settlement_point)::text =
(sp.settlement_point)::text)
        Rows Removed by Join Filter: 12312
        ->  Index Scan using dp_hist_gen_actual_idx2 on dp_hist_gen_actual
dp  (cost=0.56..2.78 rows=1 width=24) (actual time=0.020..20.012 rows=12336
loops=1)
              Index Cond: ((market_day >= '2015-01-01'::date) AND
(market_day <= '2015-01-01'::date) AND (expiry_date IS NULL))
        ->  Index Scan using api_settlement_points_idx on
api_settlement_points sp  (cost=0.43..58358.05 rows=12 width=9) (actual
time=39.066..40.223 rows=1 loops=12336)
              Index Cond: ((rt_model = $0) AND (start_date <=
'2015-01-01'::date) AND (end_date > '2015-01-01'::date))
              Filter: ((settlement_point_rdfid)::text =
'#_{09F3A628-3B9D-481A-AC90-72AF8EAB64CA}'::text)
              Rows Removed by Filter: 5298
*Total runtime: 496287.325 ms*

*Hash Join Explain Analyze Output:*
HashAggregate  (cost=58369.21..58369.22 rows=1 width=24) (actual
time=50.835..50.843 rows=24 loops=1)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=8.30..8.30 rows=1 width=9) (actual time=0.149..0.149
rows=1 loops=1)
          ->  Sort  (cost=8.30..8.78 rows=193 width=9) (actual
time=0.148..0.148 rows=1 loops=1)
                Sort Key: c.cim
                Sort Method: top-N heapsort  Memory: 25kB
                ->  Seq Scan on cim_calendar c  (cost=0.00..7.33 rows=193
width=9) (actual time=0.009..0.082 rows=192 loops=1)
                      Filter: (nodal_load <= '2015-01-01'::date)
                      Rows Removed by Filter: 36
  ->  *Hash Join  (cost=3.23..58360.90 rows=1 width=24) (actual
time=49.644..50.811 rows=24 loops=1)*
        Hash Cond: ((sp.settlement_point)::text =
(dp.settlement_point)::text)
        ->  Index Scan using api_settlement_points_idx on
api_settlement_points sp  (cost=0.43..58358.05 rows=12 width=9) (actual
time=39.662..40.822 rows=1 loops=1)
              Index Cond: ((rt_model = $0) AND (start_date <=
'2015-01-01'::date) AND (end_date > '2015-01-01'::date))
              Filter: ((settlement_point_rdfid)::text =
'#_{09F3A628-3B9D-481A-AC90-72AF8EAB64CA}'::text)
              Rows Removed by Filter: 5298
        ->  Hash  (cost=2.78..2.78 rows=1 width=24) (actual
time=9.962..9.962 rows=12336 loops=1)
              Buckets: 1024  Batches: 1  Memory Usage: 684kB
              ->  Index Scan using dp_hist_gen_actual_idx2 on
dp_hist_gen_actual dp  (cost=0.56..2.78 rows=1 width=24) (actual
time=0.023..5.962 rows=12336 loops=1)
                    Index Cond: ((market_day >= '2015-01-01'::date) AND
(market_day <= '2015-01-01'::date) AND (expiry_date IS NULL))
*Total runtime: 50.906 ms*




--
View this message in context: http://postgresql.nabble.com/How-to-get-explain-plan-to-prefer-Hash-Join-tp5841450.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: How to get explain plan to prefer Hash Join

От
"Nicholson, Brad (Toronto, ON, CA)"
Дата:
> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-
> owner@postgresql.org] On Behalf Of atxcanadian
> Sent: Wednesday, March 11, 2015 1:01 PM
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] How to get explain plan to prefer Hash Join
>
> Currently seeing massive increase in performance when optimizer chooses
> Hash Join over Nested Loops. I achieve this by temporarily setting nested loops
> off. I'd like to setup some database variables where the optimizer prefers hash
> joins. Any suggestions?

Try making small adjustments to either random_page_cost or cpu_tuple_cost.  They can influence the planners choice
here. I have solved similar issues in the past by adjusting one or the other.  Be aware thought that those changes can
havenegative effects in other places, so be sure to test. 

Brad.


Re: How to get explain plan to prefer Hash Join

От
Jeff Janes
Дата:
On Wed, Mar 11, 2015 at 10:01 AM, atxcanadian <matthew.boyda@gmail.com> wrote:
Currently seeing massive increase in performance when optimizer chooses Hash
Join over Nested Loops. I achieve this by temporarily setting nested loops
off. I'd like to setup some database variables where the optimizer prefers
hash joins. Any suggestions?



        ->  Index Scan using dp_hist_gen_actual_idx2 on dp_hist_gen_actual
dp  (cost=0.56..2.78 rows=1 width=24) (actual time=0.020..20.012 rows=12336
loops=1)

Here it thinks it will find 1 row, but actually finds 12336.  That is not conducive to good plans.  Has the table been analyzed recently?

Index Cond: ((market_day >= '2015-01-01'::date) AND (market_day <= '2015-01-01'::date) AND (expiry_date IS NULL))

If you query just this one table with just these criteria, what do you get for the row estimates and actual rows, with and without the IS NULL condition?

Cheers,

Jeff

Re: How to get explain plan to prefer Hash Join

От
atxcanadian
Дата:
So I implemented two changes.

- Moved random_page_cost from 1.1 to 2.0
- Manually ran analyze on all the tables

*Here is the new explain analyze:*
QUERY PLAN
HashAggregate  (cost=74122.97..74125.53 rows=256 width=24) (actual
time=45.205..45.211 rows=24 loops=1)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=8.30..8.30 rows=1 width=9) (actual time=0.152..0.152
rows=1 loops=1)
          ->  Sort  (cost=8.30..8.78 rows=193 width=9) (actual
time=0.150..0.150 rows=1 loops=1)
                Sort Key: c.cim
                Sort Method: top-N heapsort  Memory: 25kB
                ->  Seq Scan on cim_calendar c  (cost=0.00..7.33 rows=193
width=9) (actual time=0.008..0.085 rows=192 loops=1)
                      Filter: (nodal_load <= '2015-01-01'::date)
                      Rows Removed by Filter: 36
  ->  Nested Loop  (cost=22623.47..74111.47 rows=256 width=24) (actual
time=43.798..45.181 rows=24 loops=1)
        ->  Bitmap Heap Scan on api_settlement_points sp
(cost=22622.91..67425.92 rows=12 width=9) (actual time=43.756..43.823 rows=1
loops=1)
              Recheck Cond: ((rt_model = $0) AND (start_date <=
'2015-01-01'::date) AND (end_date > '2015-01-01'::date))
              Filter: ((settlement_point_rdfid)::text =
'#_{09F3A628-3B9D-481A-AC90-72AF8EAB64CA}'::text)
              Rows Removed by Filter: 5298
              ->  Bitmap Index Scan on api_settlement_points_idx
(cost=0.00..22622.90 rows=72134 width=0) (actual time=42.998..42.998
rows=5299 loops=1)
                    Index Cond: ((rt_model = $0) AND (start_date <=
'2015-01-01'::date) AND (end_date > '2015-01-01'::date))
        ->  Index Scan using dp_hist_gen_actual_idx2 on dp_hist_gen_actual
dp  (cost=0.56..556.88 rows=25 width=24) (actual time=0.033..1.333 rows=24
loops=1)
              Index Cond: ((market_day >= '2015-01-01'::date) AND
(market_day <= '2015-01-01'::date) AND (expiry_date IS NULL) AND
((settlement_point)::text = (sp.settlement_point)::text))
Total runtime: 45.278 ms

I'm a little perplexed why the autovacuum wasn't keeping up. Any
recommendations for those settings to push it to do a bit more analyzing of
the tables??



--
View this message in context:
http://postgresql.nabble.com/How-to-get-explain-plan-to-prefer-Hash-Join-tp5841450p5841520.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: How to get explain plan to prefer Hash Join

От
Pavel Stehule
Дата:


2015-03-12 1:35 GMT+01:00 atxcanadian <matthew.boyda@gmail.com>:
So I implemented two changes.

- Moved random_page_cost from 1.1 to 2.0

random_page_cost 1 can enforce nested_loop - it is very cheap with it
 
- Manually ran analyze on all the tables

*Here is the new explain analyze:*
QUERY PLAN
HashAggregate  (cost=74122.97..74125.53 rows=256 width=24) (actual
time=45.205..45.211 rows=24 loops=1)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=8.30..8.30 rows=1 width=9) (actual time=0.152..0.152
rows=1 loops=1)
          ->  Sort  (cost=8.30..8.78 rows=193 width=9) (actual
time=0.150..0.150 rows=1 loops=1)
                Sort Key: c.cim
                Sort Method: top-N heapsort  Memory: 25kB
                ->  Seq Scan on cim_calendar c  (cost=0.00..7.33 rows=193
width=9) (actual time=0.008..0.085 rows=192 loops=1)
                      Filter: (nodal_load <= '2015-01-01'::date)
                      Rows Removed by Filter: 36
  ->  Nested Loop  (cost=22623.47..74111.47 rows=256 width=24) (actual
time=43.798..45.181 rows=24 loops=1)
        ->  Bitmap Heap Scan on api_settlement_points sp
(cost=22622.91..67425.92 rows=12 width=9) (actual time=43.756..43.823 rows=1
loops=1)
              Recheck Cond: ((rt_model = $0) AND (start_date <=
'2015-01-01'::date) AND (end_date > '2015-01-01'::date))
              Filter: ((settlement_point_rdfid)::text =
'#_{09F3A628-3B9D-481A-AC90-72AF8EAB64CA}'::text)
              Rows Removed by Filter: 5298
              ->  Bitmap Index Scan on api_settlement_points_idx
(cost=0.00..22622.90 rows=72134 width=0) (actual time=42.998..42.998
rows=5299 loops=1)
                    Index Cond: ((rt_model = $0) AND (start_date <=
'2015-01-01'::date) AND (end_date > '2015-01-01'::date))
        ->  Index Scan using dp_hist_gen_actual_idx2 on dp_hist_gen_actual
dp  (cost=0.56..556.88 rows=25 width=24) (actual time=0.033..1.333 rows=24
loops=1)
              Index Cond: ((market_day >= '2015-01-01'::date) AND
(market_day <= '2015-01-01'::date) AND (expiry_date IS NULL) AND
((settlement_point)::text = (sp.settlement_point)::text))
Total runtime: 45.278 ms

I'm a little perplexed why the autovacuum wasn't keeping up. Any
recommendations for those settings to push it to do a bit more analyzing of
the tables??



--
View this message in context: http://postgresql.nabble.com/How-to-get-explain-plan-to-prefer-Hash-Join-tp5841450p5841520.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: How to get explain plan to prefer Hash Join

От
atxcanadian
Дата:
Isn't a random_page_cost of 1 a little aggressive?

We are currently setup on Amazon SSD with software RAID 5.



--
View this message in context:
http://postgresql.nabble.com/How-to-get-explain-plan-to-prefer-Hash-Join-tp5841450p5841605.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: How to get explain plan to prefer Hash Join

От
Jeff Janes
Дата:
On Wed, Mar 11, 2015 at 5:35 PM, atxcanadian <matthew.boyda@gmail.com> wrote:

I'm a little perplexed why the autovacuum wasn't keeping up. Any
recommendations for those settings to push it to do a bit more analyzing of
the tables??

What does pg_stat_user_tables show for that table?


Re: How to get explain plan to prefer Hash Join

От
atxcanadian
Дата:
Here is the output:

<http://postgresql.nabble.com/file/n5841610/pg_stat_user_table.jpg>

This is after I've manually ran analyze.



--
View this message in context:
http://postgresql.nabble.com/How-to-get-explain-plan-to-prefer-Hash-Join-tp5841450p5841610.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: How to get explain plan to prefer Hash Join

От
Jeff Janes
Дата:
On Thu, Mar 12, 2015 at 8:59 AM, atxcanadian <matthew.boyda@gmail.com> wrote:
Here is the output:

<http://postgresql.nabble.com/file/n5841610/pg_stat_user_table.jpg>

This is after I've manually ran analyze.

The "last_*" columns are only showing times, and not full timestamps.  Does your reporting tool drop the date part of a timestamp when it is equal to today?  Or does it just drop the date part altogether regardless of what it is?

Cheers,

Jeff 

Re: How to get explain plan to prefer Hash Join

От
atxcanadian
Дата:
Sorry about that, excel clipped off the dates.

<http://postgresql.nabble.com/file/n5841633/pg_stat_user_table.jpg>



--
View this message in context:
http://postgresql.nabble.com/How-to-get-explain-plan-to-prefer-Hash-Join-tp5841450p5841633.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.