Обсуждение: same query different execution plan (hash join vs. semi-hash join)

Поиск
Список
Период
Сортировка

same query different execution plan (hash join vs. semi-hash join)

От
"Huang, Suya"
Дата:

Hi buddies,

 

I’ve got a query as below, it runs several times with different execution plan and totally different execution time. The one using hash-join is slow and the one using semi-hash join is very fast. However, I have no control over the optimizer behavior of PostgreSQL database. Or, do I have?

 

The database version is 9.3.4

 

SELECT dem_type,

       dem_value,

       Count(*)

FROM   demo_weekly a

WHERE  date = '2013-11-30'

AND    userid IN ( select userid from test1)

       AND dem_type IN ( 'Gender', 'Age', 'Hobbies' )

GROUP  BY dem_type,

          dem_value ;

 

 

                                                                                    QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

HashAggregate  (cost=322386.94..322786.94 rows=40000 width=29) (actual time=3142.849..3142.927 rows=19 loops=1)

   ->  Hash Semi Join  (cost=14460.06..314403.08 rows=1064514 width=29) (actual time=803.671..2786.979 rows=1199961 loops=1)

         Hash Cond: ((a.userid)::text = (test1.userid)::text)

         ->  Append  (cost=0.00..277721.30 rows=2129027 width=78) (actual time=536.829..1691.270 rows=2102611 loops=1)

               ->  Seq Scan on demo_weekly a  (cost=0.00..0.00 rows=1 width=808) (actual time=0.002..0.002 rows=0 loops=1)

                     Filter: ((date = '2013-11-30'::date) AND ((dem_type)::text = ANY ('{Gender,Age,"Hobbies"}'::text[])))

               ->  Bitmap Heap Scan on demo_weekly_20131130 a_1  (cost=50045.63..277721.30 rows=2129026 width=78) (actual time=536.826..1552.203 rows=2102611 loops=1)

                     Recheck Cond: ((dem_type)::text = ANY ('{Gender,Age,"Hobbies"}'::text[]))

                     Filter: (date = '2013-11-30'::date)

                     ->  Bitmap Index Scan on demo_weekly_20131130_dt_idx  (cost=0.00..49513.37 rows=2129026 width=0) (actual time=467.453..467.453 rows=2102611 loops=1)

                           Index Cond: ((dem_type)::text = ANY ('{Gender,Age,"Hobbies"}'::text[]))

         ->  Hash  (cost=8938.36..8938.36 rows=441736 width=50) (actual time=266.501..266.501 rows=441736 loops=1)

               Buckets: 65536  Batches: 1  Memory Usage: 35541kB

               ->  Seq Scan on test1  (cost=0.00..8938.36 rows=441736 width=50) (actual time=0.023..87.869 rows=441736 loops=1)

Total runtime: 3149.004 ms

(15 rows)

 

                                                                                   QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

HashAggregate  (cost=318351.90..318751.90 rows=40000 width=29) (actual time=23668.646..23668.723 rows=19 loops=1)

   ->  Hash Join  (cost=5316.68..310497.81 rows=1047212 width=29) (actual time=1059.182..23218.864 rows=1199961 loops=1)

         Hash Cond: ((a.userid)::text = (test1.userid)::text)

         ->  Append  (cost=0.00..276382.82 rows=2094423 width=78) (actual time=528.116..2002.462 rows=2102611 loops=1)

               ->  Seq Scan on demo_weekly a  (cost=0.00..0.00 rows=1 width=808) (actual time=0.001..0.001 rows=0 loops=1)

                     Filter: ((date = '2013-11-30'::date) AND ((dem_type)::text = ANY ('{Gender,Age,"Hobbies"}'::text[])))

               ->  Bitmap Heap Scan on demo_weekly_20131130 a_1  (cost=49269.46..276382.82 rows=2094422 width=78) (actual time=528.114..1825.265 rows=2102611 loops=1)

                     Recheck Cond: ((dem_type)::text = ANY ('{Gender,Age,"Hobbies"}'::text[]))

                     Filter: (date = '2013-11-30'::date)

                     ->  Bitmap Index Scan on demo_weekly_20131130_dt_idx  (cost=0.00..48745.85 rows=2094422 width=0) (actual time=458.694..458.694 rows=2102611 loops=1)

                           Index Cond: ((dem_type)::text = ANY ('{Gender,Age,"Hobbies"}'::text[]))

         ->  Hash  (cost=5314.18..5314.18 rows=200 width=516) (actual time=530.930..530.930 rows=441736 loops=1)

               Buckets: 1024  Batches: 1  Memory Usage: 35541kB

               ->  HashAggregate  (cost=5312.18..5314.18 rows=200 width=516) (actual time=298.301..411.734 rows=441736 loops=1)

                     ->  Seq Scan on test1  (cost=0.00..5153.94 rows=63294 width=516) (actual time=0.068..91.378 rows=441736 loops=1)

Total runtime: 23679.096 ms

(16 rows)

Re: same query different execution plan (hash join vs. semi-hash join)

От
Tom Lane
Дата:
"Huang, Suya" <Suya.Huang@au.experian.com> writes:
> I've got a query as below, it runs several times with different execution plan and totally different execution time.
Theone using hash-join is slow and the one using semi-hash join is very fast. However, I have no control over the
optimizerbehavior of PostgreSQL database. Or, do I have? 

A salient feature of the slow plan is that the planner is misinformed
about the size of test1:

>                      ->  Seq Scan on test1  (cost=0.00..5153.94 rows=63294 width=516) (actual time=0.068..91.378
rows=441736loops=1) 

whereas in the fast plan its rows estimate for that scan is dead on.
It looks like the two cases also have different ideas of how many
distinct values are in the test1.userid column, though this is more a
guess than an indisputable fact.

In short, I suspect you're recreating the test1 table and not bothering
to ANALYZE it after you fill it.  This leaves you at the mercy of when
the autovacuum daemon gets around to analyzing the table before you'll
get good plans for it.

            regards, tom lane


Re: same query different execution plan (hash join vs. semi-hash join)

От
"Huang, Suya"
Дата:
Thank you Tom. But the time spent on scanning table test1 is less than 1 second (91.738 compares to 87.869), so I guess
thisshouldn't be the issue? 


-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, May 16, 2014 12:58 PM
To: Huang, Suya
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] same query different execution plan (hash join vs. semi-hash join)

"Huang, Suya" <Suya.Huang@au.experian.com> writes:
> I've got a query as below, it runs several times with different execution plan and totally different execution time.
Theone using hash-join is slow and the one using semi-hash join is very fast. However, I have no control over the
optimizerbehavior of PostgreSQL database. Or, do I have? 

A salient feature of the slow plan is that the planner is misinformed about the size of test1:

>                      ->  Seq Scan on test1  (cost=0.00..5153.94
> rows=63294 width=516) (actual time=0.068..91.378 rows=441736 loops=1)

whereas in the fast plan its rows estimate for that scan is dead on.
It looks like the two cases also have different ideas of how many distinct values are in the test1.userid column,
thoughthis is more a guess than an indisputable fact. 

In short, I suspect you're recreating the test1 table and not bothering to ANALYZE it after you fill it.  This leaves
youat the mercy of when the autovacuum daemon gets around to analyzing the table before you'll get good plans for it. 

            regards, tom lane


Re: same query different execution plan (hash join vs. semi-hash join)

От
Tom Lane
Дата:
"Huang, Suya" <Suya.Huang@au.experian.com> writes:
> Thank you Tom. But the time spent on scanning table test1 is less than 1 second (91.738 compares to 87.869), so I
guessthis shouldn't be the issue? 

No, the point is that the bad rowcount estimate (and, possibly, lack of
stats about join column contents) causes the planner to pick a join method
that's not ideal for this query.

            regards, tom lane


Re: same query different execution plan (hash join vs. semi-hash join)

От
"Huang, Suya"
Дата:
Thanks Tom, I think you're right. I just did an analyze on table test1 and the execution plan now generated is more
stableand predictable. 

Thanks,
Suya

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Tuesday, May 20, 2014 12:22 AM
To: Huang, Suya
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] same query different execution plan (hash join vs. semi-hash join)

"Huang, Suya" <Suya.Huang@au.experian.com> writes:
> Thank you Tom. But the time spent on scanning table test1 is less than 1 second (91.738 compares to 87.869), so I
guessthis shouldn't be the issue? 

No, the point is that the bad rowcount estimate (and, possibly, lack of stats about join column contents) causes the
plannerto pick a join method that's not ideal for this query. 

            regards, tom lane