Re: Optimizer seems to be way off, why?

Поиск
Список
Период
Сортировка
От John A Meinel
Тема Re: Optimizer seems to be way off, why?
Дата
Msg-id 42DEBACF.2020405@arbash-meinel.com
обсуждение исходный текст
Ответ на Re: Optimizer seems to be way off, why?  (Dirk Lutzebäck <lutzeb@aeccom.com>)
Ответы Re: Optimizer seems to be way off, why?
Список pgsql-performance
Dirk Lutzebäck wrote:
> Richard Huxton wrote:
>
>> Dirk Lutzebäck wrote:
>>
>>> Hi,
>>>
>>> I do not under stand the following explain output (pgsql 8.0.3):
>>>
>>> explain analyze
>>> select b.e from b, d
>>> where b.r=516081780 and b.c=513652057 and b.e=d.e;
>>>
>>>                                                         QUERY PLAN
>>> ----------------------------------------------------------------------------------------------------------------
>>>
>>> Nested Loop  (cost=0.00..1220.09 rows=1 width=4) (actual
>>> time=0.213..2926.845 rows=324503 loops=1)
>>>   ->  Index Scan using b_index on b  (cost=0.00..1199.12 rows=1
>>> width=4) (actual time=0.104..17.418 rows=3293 loops=1)
>>>         Index Cond: (r = 516081780::oid)
>>>         Filter: (c = 513652057::oid)
>>>   ->  Index Scan using d_e_index on d  (cost=0.00..19.22 rows=140
>>> width=4) (actual time=0.009..0.380 rows=99 loops=3293)
>>>         Index Cond: ("outer".e = d.e)
>>> Total runtime: 3638.783 ms
>>> (7 rows)
>>>
>>> Why is the rows estimate for b_index and the nested loop 1? It is
>>> actually 3293 and 324503.
>>
>>
>>
>> I'm guessing (and that's all it is) that b.r and b.c have a higher
>> correlation than the planner is expecting. That is, it expects the
>> b.c=... to reduce the number of matching rows much more than it is.
>>
>> Try a query just on WHERE b.r=516081780 and see if it gets the
>> estimate right for that.
>>
>> If it's a common query, it might be worth an index on (r,c)
>>
>> --
>>   Richard Huxton
>>   Archonet Ltd
>>
>
> Thanks Richard, dropping the join for b.c now gives better estimates (it
> also uses a different index now) although not accurate (off by factor
> 10). This query is embedded in a larger query which now got a 1000 times
> speed up (!) because I can drop b.c because it is redundant.

Well, part of the problem is that the poorly estimated row is not 'b.e'
but 'b.r', it expects to only find one row that matches, and instead
finds 3293 rows.

Now, that *could* be because it mis-estimates the selectivity of b.r & b.c.

It actually estimated the join with d approximately correctly. (It
thought that for each row it would find 140, and it averaged 99).

>
> Though, why can't the planner see this correlation? I think somebody
> said the planner does not know about multiple column correlations, does it?

The planner does not maintain cross-column statistics, so you are
correct. I believe it assumes distributions are independent. So that if
r=RRRRR is 10% selective, and c=CCCC is 20% selective, the total
selectivity of r=RRRR AND c=CCCC is 2%. I could be wrong on this, but I
think it is approximately correct.

Now if you created the index on b(r,c), then it would have a much better
idea of how selective that would be. At the very least, it could index
on (r,c) rather than indexing on (r) and filtering by (c).

Also, if you have very skewed data (where you have 1 value 100k times,
and 50 values only 10times each), the planner can overestimate the low
values, and underestimate the high one. (It uses random sampling, so it
kind of depends where the entries are.)

Have you tried increasing the statistics on b.r and or b.c? Do you have
an index on b.c or just b.r?

To see what the planner thinks, you might try:

EXPLAIN ANALYZE
select count(*) from b where r=516081780;

That would tell you how selective the planner thinks the r= is.
>
> Regards,
>
> Dirk
>
John
=:->


Вложения

В списке pgsql-performance по дате отправления:

Предыдущее
От: John Mendenhall
Дата:
Сообщение: Re: performance decrease after reboot
Следующее
От: Jeffrey Tenny
Дата:
Сообщение: What is best way to stream terabytes of data into postgresql?