Re: Slow join over three tables

Поиск
Список
Период
Сортировка
От Tim van der Linden
Тема Re: Slow join over three tables
Дата
Msg-id 20160427192918.9b5d2ddd7c9231b1323bf0fd@shisaa.jp
обсуждение исходный текст
Ответ на Re: Slow join over three tables  (David Rowley <david.rowley@2ndquadrant.com>)
Ответы Re: Slow join over three tables  (David Rowley <david.rowley@2ndquadrant.com>)
Список pgsql-general
On Wed, 27 Apr 2016 14:09:04 +1200
David Rowley <david.rowley@2ndquadrant.com> wrote:

Hi David

Thanks for your time on this. I tried your proposals with the results below.

> > ...
> > Under 5 ms. The same goes for querying the "adverse" column in the "report_adverses" table: under 20 ms.
>
> I'm not sure why you're comparing this to the join plan above. They're
> very different, I can only imagine that it' because you've not quite
> understood what the EXPLAIN output is saying.

Well, it is more due to me poorly wording the question I guess. I understand that both plans are totally different and
bothqueries are fairly incomparable as two very different things are happening. 

My point here was to show that the "drug" column seems to be indexed correctly and, even though it is a very large
table,the query executes very fast (all these execution times are given after flushing the OS's disk cache). 

> >      ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual time=0.822..2038.952 rows=14199 loops=1)
> >          ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..500.28 rows=14005 width=31)
(actualtime=0.669..3.900 rows=14199 loops=1) 
> >                Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
> >          ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.71 rows=1 width=41) (actual
time=0.143..0.143rows=1 loops=14199) 
> >                Index Cond: (id = d.rid)
>
> This is a parameterised nested loop. This means that the inner side of
> the loop (reports), is parameterised by the outerside of the loop, you
> can see the d.rid in the Index Cond. This means that the
> reports_id_key index is looked up 14199 times. You can see from the
> plan that the nested loop here took 2038.952 milliseconds to complete,
> so about 0.144 ms per lookup, not too bad, right?

That is indeed not bad at all, thanks for pointing this out. This nested loop is thus responsible for little over 40%
ofthe execution time. 

> The alternative to a parameterised nested loop plan is that a Hash
> Join plan is generated. If you imagine what would happen here, likely
> the matching report_drugs records would be hashed, then the outer side
> of the join would then perform a SeqScan over the entire 6 million
> reports records... probably not better. You can test this yourself
> with; SET enable_nestloop = 0; you might need to SET enable_mergejoin
> = 0; too. I imagine it'll be slower.

Doing this with the current (unaltered) indexes almost doubles the time and it seems to behave exactly as you
mentioned:

 Sort  (cost=372968.28..372969.07 rows=317 width=41) (actual time=9308.174..9308.187 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Hash Join  (cost=130941.07..372955.11 rows=317 width=41) (actual time=8131.509..9307.983 rows=448 loops=1)
         Hash Cond: (r.id = d.rid)
         ->  Seq Scan on reports r  (cost=0.00..105465.00 rows=5749300 width=20) (actual time=0.237..849.564
rows=5749300loops=1) 
         ->  Hash  (cost=130937.10..130937.10 rows=317 width=29) (actual time=8130.175..8130.175 rows=448 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 25kB
               ->  Hash Join  (cost=122150.13..130937.10 rows=317 width=29) (actual time=8118.130..8130.090 rows=448
loops=1)
                     Hash Cond: (d.rid = a.rid)
                     ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..496.92 rows=13870
width=8)(actual time=0.830..3.503 rows=14200 loops=1) 
                           Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
                     ->  Hash  (cost=121170.69..121170.69 rows=78320 width=21) (actual time=8116.544..8116.544
rows=76972loops=1) 
                           Buckets: 8192  Batches: 1  Memory Usage: 3609kB
                           ->  Bitmap Heap Scan on report_adverses a  (cost=2024.68..121170.69 rows=78320 width=21)
(actualtime=19.303..8093.257 rows=76972 loops=1) 
                                 Recheck Cond: (adverse = ANY ('{"back pain - nonspecific","nonspecific back
pain","backpain"}'::text[])) 
                                 Heap Blocks: exact=56959
                                 ->  Bitmap Index Scan on report_adverses_adverse_idx  (cost=0.00..2005.10 rows=78320
width=0)(actual time=11.756..11.756 rows=76972 loops=1) 
                                       Index Cond: (adverse = ANY ('{"back pain - nonspecific","nonspecific back
pain","backpain"}'::text[])) 
 Planning time: 11.348 ms
 Execution time: 9308.694 ms

> Likely the only way to speed this up would be to create indexes;
>
> create index on reports (id, age, gender, created);
> ...
> create index on report_adverses (rid, adverse);
>
> this might speedup the 2nd join a bit.

Done. This does help a bit. The query runs about ~1 second faster. The plan:

 Sort  (cost=66065.73..66066.59 rows=344 width=41) (actual time=4313.679..4313.708 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=1.44..66051.23 rows=344 width=41) (actual time=43.987..4313.435 rows=448 loops=1)
         Join Filter: (d.rid = a.rid)
         ->  Nested Loop  (cost=1.00..54700.19 rows=14319 width=28) (actual time=0.772..1158.338 rows=14200 loops=1)
               ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..507.78 rows=14319 width=8)
(actualtime=0.579..4.327 rows=14200 loops=1) 
                     Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
               ->  Index Only Scan using reports_id_age_gender_created_idx on reports r  (cost=0.56..3.77 rows=1
width=20)(actual time=0.081..0.081 rows=1 loops=14200) 
                     Index Cond: (id = d.rid)
                     Heap Fetches: 0
         ->  Index Scan using report_adverses_rid_idx on report_adverses a  (cost=0.44..0.78 rows=1 width=21) (actual
time=0.222..0.222rows=0 loops=14200) 
               Index Cond: (rid = r.id)
               Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back pain"}'::text[]))
               Rows Removed by Filter: 5
 Planning time: 15.968 ms
 Execution time: 4313.755 ms

Both the (rid, adverse) and the (id, age, gender, created) indexes are now used.

Yet ... this is "only" 1 second faster, still 4 seconds to complete this query (the join(s)). Is this truly the most
thatI could get out of it ... or could further tweaking to PostgreSQL's configuration be of any help here? 

>  David Rowley                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services

Cheers,
Tim


В списке pgsql-general по дате отправления:

Предыдущее
От: Tim van der Linden
Дата:
Сообщение: Re: Slow join over three tables
Следующее
От: Tim van der Linden
Дата:
Сообщение: Re: Slow join over three tables