Re: Slow join over three tables

Поиск
Список
Период
Сортировка
От Alban Hertroys
Тема Re: Slow join over three tables
Дата
Msg-id CAF-3MvNeNObyb1MFsgYn+a+MMbcTVDjLfypBQpmmehGWbGEOAw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Slow join over three tables  (Tim van der Linden <tim@shisaa.jp>)
Ответы Re: Slow join over three tables  (Tim van der Linden <tim@shisaa.jp>)
Список pgsql-general
On 28 April 2016 at 08:36, Tim van der Linden <tim@shisaa.jp> wrote:
> On Wed, 27 Apr 2016 13:48:06 +0200
> Alban Hertroys <haramrae@gmail.com> wrote:
>
>> In this case, you're using the values in adverse to filter relevant rid's for the FK join, so you might be better
offwith the inverse of above index: 
>>       create index on report_adverses (adverse, rid);
>>       create index on report_drugs (drug, rid);
>
> Hmmmm, like I reported yesterday this achieved a huge performance boost.
>
> However, two (new) things I like to touch on while further experimenting with this query:
>
> 1. Altering Drug IDs or Adverse names effects the executing time negatively.
>
> In this example altering the drug IDs I search for makes the query 6 times slower again:
>
> SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
> FROM reports r
> JOIN report_drugs d ON d.rid = r.id
> JOIN report_adverses a ON a.rid = r.id
> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain'])
> AND d.drug = ANY (ARRAY[9557, 17848, 17880, 18223]) ORDER BY r.created;
>
> Different drug ID ([9557, 17848, 17880, 18223]), but the rest is the same. Query plan:
>
>  Sort  (cost=31409.71..31417.48 rows=3107 width=41) (actual time=662.707..662.819 rows=4076 loops=1)
>    Sort Key: r.created
>    Sort Method: quicksort  Memory: 415kB
>    ->  Nested Loop  (cost=25693.17..31229.48 rows=3107 width=41) (actual time=71.748..661.743 rows=4076 loops=1)
>          ->  Merge Join  (cost=25692.61..26795.53 rows=3107 width=29) (actual time=70.841..97.094 rows=4076 loops=1)
>                Merge Cond: (d.rid = a.rid)
>                ->  Sort  (cost=16332.08..16671.61 rows=135810 width=8) (actual time=48.946..58.623 rows=135370
loops=1)
>                      Sort Key: d.rid
>                      Sort Method: quicksort  Memory: 12498kB
>                      ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..4753.44 rows=135810
width=8)(actual time=0.681..28.441 rows=135532 loops=1) 
>                            Index Cond: (drug = ANY ('{9557,17848,17880,18223}'::integer[]))


You're doing ~9 times as many index lookups. A slowdown of 6x of this
part of the query seems rather reasonable.

>  Planning time: 16.438 ms
>  Execution time: 663.583 ms
>
> A total of 660 ms instead of the previous 120 ms. The amount of rows returned are 4076, instead of the 448 rows of
theprevious query. Could this be the cause of the slow down? Or is it the fact that the integers in the ANY clause are
furtherapart from one another and thus make for longer index searches? 

Technically probably both, but the index node distance (assuming a
btree index) should be insignificant.

The sorting of found candidate rows on rid probably contributes the
most. I forgot how quicksort scales to the number of items to sort,
probably something like 2log(n) or something like that.

That said, you didn't happen to perform these tests (assuming >1)
during a busier period of the database server? I ask because later on
you seem to be getting fast results again after some more index
changes and I don't see cause for the difference.

> The same happens with changing the adverses, some adverses come back in ~120 ms, others can take up to one second.

Likely the same issue

> 2. Adding extra columns ignores indexes
>
> This one was kind of expected. Adding extra columns from both "report_drugs" and "report_adverses" slows the lot down
again.Extra columns added: 
>
> - a.recovery
> - a.severity
> - d.reason
> - d.effectiveness
> - d.duration

I have to admit that I'm a bit behind with the current state of the
art of PostgreSQL, but last time I checked, the database needed to
look at each row in the result-set for transaction visibility
information. I recall that there was (at some point) much discussion
whether that visibility information could be added to indexes and that
there were strong arguments against doing so. Hence, I doubt that
those new index-only scans skip that step.

Unless I'm wrong there, adding non-queried fields to the index is only
going to affect your performance adversely. Relevant fields for
indices are those used in joins, those regularly used in conditions
(where-clauses) and those that are regularly sorted over.

Other fields are available from the candidate rows of the result set.
Putting them in the index just results in storing redundant
information.

That's how it always has been and I'm pretty sure that's still the case.

> As I would expect, adding these columns would make the previous multi-column indexes useless. And they indeed were
notused anymore. 

What adding those fields to the index did was drive up the cost of
using that index. It requires more memory to cache, for example, and
this will be evicted from caches sooner than a smaller index.

I suspect the query planner also noticed that and chose for the
smaller index as that indexes usage costs are now lower than those of
your (unnecessarily) large index.

> Still fast enough ... but I was wondering why the planner would not use the new index and instead fall back on the
"report_drugs_drug_idx"single column index. 

I'm actually a bit suspicious of those numbers; those are different
drug id's than those from the first explain too.

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.


В списке pgsql-general по дате отправления:

Предыдущее
От: Vik Fearing
Дата:
Сообщение: Re: CREATE EXTENSION without superuser access
Следующее
От: Peter Devoy
Дата:
Сообщение: Re: Does this perf output seem 'normal'?