Re: Slow join over three tables

Поиск
Список
Период
Сортировка
От Tim van der Linden
Тема Re: Slow join over three tables
Дата
Msg-id 20160427222152.1cf02fc73d7af13e01468704@shisaa.jp
обсуждение исходный текст
Ответ на Re: Slow join over three tables  (Alban Hertroys <haramrae@gmail.com>)
Список pgsql-general
On Wed, 27 Apr 2016 13:48:06 +0200
Alban Hertroys <haramrae@gmail.com> wrote:

Hi Alban

Thanks for chiming in!

> Since you're not using age and gender in this (particular) query until the rows are combined into a result set
already,it doesn't make a whole lot of sense to add them to the index. Moreover, since your'e ordering by created, I'd
atleast put that directly after id: 
>
>     create index on reports (id, created);

The ORDER statement did not attribute much to the total query time, but it makes sense to index this.

> In this case, you're using the values in adverse to filter relevant rid's for the FK join, so you might be better off
withthe inverse of above index: 
>     create index on report_adverses (adverse, rid);
>     create index on report_drugs (drug, rid);

Hold the phone ... you just cracked the code.

I added these two proposed indexes (reversing their order). After purging the OS disk cache, the query runs in (*drum
roll*)120 ms. You managed to speed this up ~40 times. 

The query plan:

 Sort  (cost=12108.99..12109.83 rows=337 width=41) (actual time=119.517..119.531 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=10940.25..12094.84 rows=337 width=41) (actual time=26.473..119.392 rows=448 loops=1)
         Join Filter: (d.rid = r.id)
         ->  Merge Join  (cost=10939.69..11414.84 rows=337 width=29) (actual time=25.624..33.650 rows=448 loops=1)
               Merge Cond: (d.rid = a.rid)
               ->  Sort  (cost=1417.78..1454.02 rows=14496 width=8) (actual time=4.208..4.976 rows=14074 loops=1)
                     Sort Key: d.rid
                     Sort Method: quicksort  Memory: 1050kB
                     ->  Index Only Scan using report_drugs_drug_rid_idx on report_drugs d  (cost=0.44..415.86
rows=14496width=8) (actual time=0.648..2.236 rows=14200 loops=1) 
                           Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
                           Heap Fetches: 0
               ->  Sort  (cost=9521.91..9721.56 rows=79860 width=21) (actual time=21.410..24.174 rows=76974 loops=1)
                     Sort Key: a.rid
                     Sort Method: quicksort  Memory: 6682kB
                     ->  Index Only Scan using report_adverses_adverse_rid_idx on report_adverses a
(cost=0.56..3019.24rows=79860 width=21) (actual time=0.916..10.689 rows=76972 loops=1) 
                           Index Cond: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back
pain"}'::text[]))
                           Heap Fetches: 0
         ->  Index Only Scan using reports_id_age_gender_created_idx on reports r  (cost=0.56..2.01 rows=1 width=20)
(actualtime=0.190..0.191 rows=1 loops=448) 
               Index Cond: (id = a.rid)
               Heap Fetches: 0
 Planning time: 15.466 ms
 Execution time: 119.871 ms

This is amazing, and the slow down was purely a fact of missing indexes on the sweet spot.
You can now clearly see that the above two indexes are used (report_drugs_drug_rid_idx) and
(report_adverses_adverse_rid_idx)and that the nested loop is much faster. 

> Do these tables have a primary key and can that be used here?

Only the "reports" table has a PK, the other two don't (only a foreign key to "reports").

> I hope I'm not being superfluous here, but don't forget to vacuum analyze after creating those indexes. Eventually
autovacuumwill do that for you, but you could otherwise be running tests to verify the impact of adding those indexes
beforeautovacuum gets around to it. 

I'm becoming the king of manual VACUUM-ing the past few days ... no need to point that out ;)

> Finally, perhaps it's more efficient to weed out all unrelated drugs and adverses before relating them to reports,
somethinglike: 
>
> SELECT r.id, r.age, r.gender, r.created, x.adverse, x.drug
> FROM reports r
> JOIN (
>     SELECT d.rid, a.adverse, d.drug
>     FROM report_drugs d
>     JOIN report_adverses a ON a.rid = d.rid
>     WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain'])
>     AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363])
> ) x ON x.rid = r.id
> ORDER BY r.created;

With the above indexes created, disk cache flushed, this query ran at the exact same speed, the plan:

 Sort  (cost=12108.14..12108.99 rows=337 width=41) (actual time=119.102..119.111 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=10940.25..12094.00 rows=337 width=41) (actual time=26.797..118.969 rows=448 loops=1)
         ->  Merge Join  (cost=10939.69..11414.84 rows=337 width=29) (actual time=25.899..33.203 rows=448 loops=1)
               Merge Cond: (d.rid = a.rid)
               ->  Sort  (cost=1417.78..1454.02 rows=14496 width=8) (actual time=4.319..4.981 rows=14074 loops=1)
                     Sort Key: d.rid
                     Sort Method: quicksort  Memory: 1050kB
                     ->  Index Only Scan using report_drugs_drug_rid_idx on report_drugs d  (cost=0.44..415.86
rows=14496width=8) (actual time=0.748..2.369 rows=14200 loops=1) 
                           Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
                           Heap Fetches: 0
               ->  Sort  (cost=9521.91..9721.56 rows=79860 width=21) (actual time=21.573..23.940 rows=76974 loops=1)
                     Sort Key: a.rid
                     Sort Method: quicksort  Memory: 6682kB
                     ->  Index Only Scan using report_adverses_adverse_rid_idx on report_adverses a
(cost=0.56..3019.24rows=79860 width=21) (actual time=0.922..10.637 rows=76972 loops=1) 
                           Index Cond: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back
pain"}'::text[]))
                           Heap Fetches: 0
         ->  Index Only Scan using reports_id_age_gender_created_idx on reports r  (cost=0.56..2.01 rows=1 width=20)
(actualtime=0.190..0.191 rows=1 loops=448) 
               Index Cond: (id = a.rid)
               Heap Fetches: 0
 Planning time: 15.238 ms
 Execution time: 119.431 ms

So your hunch was right, not much difference there.

But man ... this query is now flying ... Hat's off to you, sir!

> Alban Hertroys

Cheers,
Tim


В списке pgsql-general по дате отправления:

Предыдущее
От: Alban Hertroys
Дата:
Сообщение: Re: Slow join over three tables
Следующее
От: Stefan Weiss
Дата:
Сообщение: Filtering the results of UNION ALL vs filtering the separate queries