Обсуждение: How to get explain plan to prefer Hash Join
Currently seeing massive increase in performance when optimizer chooses Hash Join over Nested Loops. I achieve this by temporarily setting nested loops off. I'd like to setup some database variables where the optimizer prefers hash joins. Any suggestions? *Query in question:* explain analyze select dp.market_day, dp.hour_ending, dp.repeated_hour_flag, dp.settlement_point, sum(dp.mw) dp_mw from dp_hist_gen_actual dp Inner Join api_settlement_points sp on sp.settlement_point = dp.settlement_point and sp.settlement_point_rdfid = '#_{09F3A628-3B9D-481A-AC90-72AF8EAB64CA}' and sp.start_date <= '2015-01-01'::date and sp.end_date > '2015-01-01'::date and sp.rt_model = (select case when c.rt_model_loaded = 2 then true else false end from cim_calendar c where c.nodal_load <= '2015-01-01'::date order by c.cim desc limit 1) where dp.market_day BETWEEN '2015-01-01'::date and '2015-01-01'::date and dp.expiry_date is null group by dp.market_day, dp.hour_ending, dp.repeated_hour_flag, dp.settlement_point; *Nested Loop Explain Analyze Output:* HashAggregate (cost=58369.29..58369.30 rows=1 width=24) (actual time=496287.249..496287.257 rows=24 loops=1) InitPlan 1 (returns $0) -> Limit (cost=8.30..8.30 rows=1 width=9) (actual time=0.145..0.145 rows=1 loops=1) -> Sort (cost=8.30..8.78 rows=193 width=9) (actual time=0.145..0.145 rows=1 loops=1) Sort Key: c.cim Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on cim_calendar c (cost=0.00..7.33 rows=193 width=9) (actual time=0.007..0.075 rows=192 loops=1) Filter: (nodal_load <= '2015-01-01'::date) Rows Removed by Filter: 36 -> * Nested Loop (cost=0.99..58360.98 rows=1 width=24) (actual time=883.718..496287.058 rows=24 loops=1)* Join Filter: ((dp.settlement_point)::text = (sp.settlement_point)::text) Rows Removed by Join Filter: 12312 -> Index Scan using dp_hist_gen_actual_idx2 on dp_hist_gen_actual dp (cost=0.56..2.78 rows=1 width=24) (actual time=0.020..20.012 rows=12336 loops=1) Index Cond: ((market_day >= '2015-01-01'::date) AND (market_day <= '2015-01-01'::date) AND (expiry_date IS NULL)) -> Index Scan using api_settlement_points_idx on api_settlement_points sp (cost=0.43..58358.05 rows=12 width=9) (actual time=39.066..40.223 rows=1 loops=12336) Index Cond: ((rt_model = $0) AND (start_date <= '2015-01-01'::date) AND (end_date > '2015-01-01'::date)) Filter: ((settlement_point_rdfid)::text = '#_{09F3A628-3B9D-481A-AC90-72AF8EAB64CA}'::text) Rows Removed by Filter: 5298 *Total runtime: 496287.325 ms* *Hash Join Explain Analyze Output:* HashAggregate (cost=58369.21..58369.22 rows=1 width=24) (actual time=50.835..50.843 rows=24 loops=1) InitPlan 1 (returns $0) -> Limit (cost=8.30..8.30 rows=1 width=9) (actual time=0.149..0.149 rows=1 loops=1) -> Sort (cost=8.30..8.78 rows=193 width=9) (actual time=0.148..0.148 rows=1 loops=1) Sort Key: c.cim Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on cim_calendar c (cost=0.00..7.33 rows=193 width=9) (actual time=0.009..0.082 rows=192 loops=1) Filter: (nodal_load <= '2015-01-01'::date) Rows Removed by Filter: 36 -> *Hash Join (cost=3.23..58360.90 rows=1 width=24) (actual time=49.644..50.811 rows=24 loops=1)* Hash Cond: ((sp.settlement_point)::text = (dp.settlement_point)::text) -> Index Scan using api_settlement_points_idx on api_settlement_points sp (cost=0.43..58358.05 rows=12 width=9) (actual time=39.662..40.822 rows=1 loops=1) Index Cond: ((rt_model = $0) AND (start_date <= '2015-01-01'::date) AND (end_date > '2015-01-01'::date)) Filter: ((settlement_point_rdfid)::text = '#_{09F3A628-3B9D-481A-AC90-72AF8EAB64CA}'::text) Rows Removed by Filter: 5298 -> Hash (cost=2.78..2.78 rows=1 width=24) (actual time=9.962..9.962 rows=12336 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 684kB -> Index Scan using dp_hist_gen_actual_idx2 on dp_hist_gen_actual dp (cost=0.56..2.78 rows=1 width=24) (actual time=0.023..5.962 rows=12336 loops=1) Index Cond: ((market_day >= '2015-01-01'::date) AND (market_day <= '2015-01-01'::date) AND (expiry_date IS NULL)) *Total runtime: 50.906 ms* -- View this message in context: http://postgresql.nabble.com/How-to-get-explain-plan-to-prefer-Hash-Join-tp5841450.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
> -----Original Message----- > From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance- > owner@postgresql.org] On Behalf Of atxcanadian > Sent: Wednesday, March 11, 2015 1:01 PM > To: pgsql-performance@postgresql.org > Subject: [PERFORM] How to get explain plan to prefer Hash Join > > Currently seeing massive increase in performance when optimizer chooses > Hash Join over Nested Loops. I achieve this by temporarily setting nested loops > off. I'd like to setup some database variables where the optimizer prefers hash > joins. Any suggestions? Try making small adjustments to either random_page_cost or cpu_tuple_cost. They can influence the planners choice here. I have solved similar issues in the past by adjusting one or the other. Be aware thought that those changes can havenegative effects in other places, so be sure to test. Brad.
On Wed, Mar 11, 2015 at 10:01 AM, atxcanadian <matthew.boyda@gmail.com> wrote:
Currently seeing massive increase in performance when optimizer chooses Hash
Join over Nested Loops. I achieve this by temporarily setting nested loops
off. I'd like to setup some database variables where the optimizer prefers
hash joins. Any suggestions?
-> Index Scan using dp_hist_gen_actual_idx2 on dp_hist_gen_actual
dp (cost=0.56..2.78 rows=1 width=24) (actual time=0.020..20.012 rows=12336
loops=1)
Here it thinks it will find 1 row, but actually finds 12336. That is not conducive to good plans. Has the table been analyzed recently?
Index Cond: ((market_day >= '2015-01-01'::date) AND (market_day <= '2015-01-01'::date) AND (expiry_date IS NULL))
If you query just this one table with just these criteria, what do you get for the row estimates and actual rows, with and without the IS NULL condition?
Cheers,
Jeff
So I implemented two changes. - Moved random_page_cost from 1.1 to 2.0 - Manually ran analyze on all the tables *Here is the new explain analyze:* QUERY PLAN HashAggregate (cost=74122.97..74125.53 rows=256 width=24) (actual time=45.205..45.211 rows=24 loops=1) InitPlan 1 (returns $0) -> Limit (cost=8.30..8.30 rows=1 width=9) (actual time=0.152..0.152 rows=1 loops=1) -> Sort (cost=8.30..8.78 rows=193 width=9) (actual time=0.150..0.150 rows=1 loops=1) Sort Key: c.cim Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on cim_calendar c (cost=0.00..7.33 rows=193 width=9) (actual time=0.008..0.085 rows=192 loops=1) Filter: (nodal_load <= '2015-01-01'::date) Rows Removed by Filter: 36 -> Nested Loop (cost=22623.47..74111.47 rows=256 width=24) (actual time=43.798..45.181 rows=24 loops=1) -> Bitmap Heap Scan on api_settlement_points sp (cost=22622.91..67425.92 rows=12 width=9) (actual time=43.756..43.823 rows=1 loops=1) Recheck Cond: ((rt_model = $0) AND (start_date <= '2015-01-01'::date) AND (end_date > '2015-01-01'::date)) Filter: ((settlement_point_rdfid)::text = '#_{09F3A628-3B9D-481A-AC90-72AF8EAB64CA}'::text) Rows Removed by Filter: 5298 -> Bitmap Index Scan on api_settlement_points_idx (cost=0.00..22622.90 rows=72134 width=0) (actual time=42.998..42.998 rows=5299 loops=1) Index Cond: ((rt_model = $0) AND (start_date <= '2015-01-01'::date) AND (end_date > '2015-01-01'::date)) -> Index Scan using dp_hist_gen_actual_idx2 on dp_hist_gen_actual dp (cost=0.56..556.88 rows=25 width=24) (actual time=0.033..1.333 rows=24 loops=1) Index Cond: ((market_day >= '2015-01-01'::date) AND (market_day <= '2015-01-01'::date) AND (expiry_date IS NULL) AND ((settlement_point)::text = (sp.settlement_point)::text)) Total runtime: 45.278 ms I'm a little perplexed why the autovacuum wasn't keeping up. Any recommendations for those settings to push it to do a bit more analyzing of the tables?? -- View this message in context: http://postgresql.nabble.com/How-to-get-explain-plan-to-prefer-Hash-Join-tp5841450p5841520.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
2015-03-12 1:35 GMT+01:00 atxcanadian <matthew.boyda@gmail.com>:
So I implemented two changes.
- Moved random_page_cost from 1.1 to 2.0
random_page_cost 1 can enforce nested_loop - it is very cheap with it
- Manually ran analyze on all the tables
*Here is the new explain analyze:*
QUERY PLAN
HashAggregate (cost=74122.97..74125.53 rows=256 width=24) (actual
time=45.205..45.211 rows=24 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=8.30..8.30 rows=1 width=9) (actual time=0.152..0.152
rows=1 loops=1)
-> Sort (cost=8.30..8.78 rows=193 width=9) (actual
time=0.150..0.150 rows=1 loops=1)
Sort Key: c.cim
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on cim_calendar c (cost=0.00..7.33 rows=193
width=9) (actual time=0.008..0.085 rows=192 loops=1)
Filter: (nodal_load <= '2015-01-01'::date)
Rows Removed by Filter: 36
-> Nested Loop (cost=22623.47..74111.47 rows=256 width=24) (actual
time=43.798..45.181 rows=24 loops=1)
-> Bitmap Heap Scan on api_settlement_points sp
(cost=22622.91..67425.92 rows=12 width=9) (actual time=43.756..43.823 rows=1
loops=1)
Recheck Cond: ((rt_model = $0) AND (start_date <=
'2015-01-01'::date) AND (end_date > '2015-01-01'::date))
Filter: ((settlement_point_rdfid)::text =
'#_{09F3A628-3B9D-481A-AC90-72AF8EAB64CA}'::text)
Rows Removed by Filter: 5298
-> Bitmap Index Scan on api_settlement_points_idx
(cost=0.00..22622.90 rows=72134 width=0) (actual time=42.998..42.998
rows=5299 loops=1)
Index Cond: ((rt_model = $0) AND (start_date <=
'2015-01-01'::date) AND (end_date > '2015-01-01'::date))
-> Index Scan using dp_hist_gen_actual_idx2 on dp_hist_gen_actual
dp (cost=0.56..556.88 rows=25 width=24) (actual time=0.033..1.333 rows=24
loops=1)
Index Cond: ((market_day >= '2015-01-01'::date) AND
(market_day <= '2015-01-01'::date) AND (expiry_date IS NULL) AND
((settlement_point)::text = (sp.settlement_point)::text))
Total runtime: 45.278 ms
I'm a little perplexed why the autovacuum wasn't keeping up. Any
recommendations for those settings to push it to do a bit more analyzing of
the tables??
--
View this message in context: http://postgresql.nabble.com/How-to-get-explain-plan-to-prefer-Hash-Join-tp5841450p5841520.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Isn't a random_page_cost of 1 a little aggressive? We are currently setup on Amazon SSD with software RAID 5. -- View this message in context: http://postgresql.nabble.com/How-to-get-explain-plan-to-prefer-Hash-Join-tp5841450p5841605.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On Wed, Mar 11, 2015 at 5:35 PM, atxcanadian <matthew.boyda@gmail.com> wrote:
I'm a little perplexed why the autovacuum wasn't keeping up. Any
recommendations for those settings to push it to do a bit more analyzing of
the tables??
What does pg_stat_user_tables show for that table?
Here is the output: <http://postgresql.nabble.com/file/n5841610/pg_stat_user_table.jpg> This is after I've manually ran analyze. -- View this message in context: http://postgresql.nabble.com/How-to-get-explain-plan-to-prefer-Hash-Join-tp5841450p5841610.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On Thu, Mar 12, 2015 at 8:59 AM, atxcanadian <matthew.boyda@gmail.com> wrote:
Here is the output:
<http://postgresql.nabble.com/file/n5841610/pg_stat_user_table.jpg>
This is after I've manually ran analyze.
The "last_*" columns are only showing times, and not full timestamps. Does your reporting tool drop the date part of a timestamp when it is equal to today? Or does it just drop the date part altogether regardless of what it is?
Cheers,
Jeff
Sorry about that, excel clipped off the dates. <http://postgresql.nabble.com/file/n5841633/pg_stat_user_table.jpg> -- View this message in context: http://postgresql.nabble.com/How-to-get-explain-plan-to-prefer-Hash-Join-tp5841450p5841633.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.