Re: BUG #15160: planner overestimates number of rows in join whenthere are more than 200 rows coming from CTE

Поиск

Список

Период

Сортировка

От	Maxim Boguk
Тема	Re: BUG #15160: planner overestimates number of rows in join whenthere are more than 200 rows coming from CTE
Дата	25 сентября 2018 г. 14:24:39
Msg-id	CAK-MWwR_G0NLLB9TwRhs+3cHDa9zZnYg7AgpAfeYJn_hgpZBsQ@mail.gmail.com обсуждение исходный текст
Ответ на	Re: BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE
Список	pgsql-bugs

Дерево обсуждения

On Tue, Apr 17, 2018 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

PG Bug reporting form <noreply@postgresql.org> writes:
> I'm wondering how planner estimates number of rows in that case:

See eqjoinsel_semi, particularly the change in behavior when it thinks
nd2 is or is not a default estimate.

Given the lack of statistics about the output of the WITH clause,
it's hard to see how we'd ever get trustworthy estimates here.
I think the fact that your first example yields an accurate
estimate is mostly luck.

regards, tom lane

There are similar issue without CTE which look pretty weird:

Good case with LIMIT 199 and adequate estimation:

hh=# explain SELECT * FROM resume WHERE resume_id IN (select id from generate_series(1, 1000) gs(id) LIMIT 199);
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Nested Loop  (cost=21.53..108.98 rows=199 width=519)
   ->  Unique  (cost=21.42..21.62 rows=199 width=4)
         ->  Sort  (cost=21.42..21.52 rows=199 width=4)
               Sort Key: gs.id
               ->  Limit  (cost=0.00..9.95 rows=199 width=4)
                     ->  Function Scan on generate_series gs  (cost=0.00..50.00 rows=1000 width=4)
   ->  Index Scan using resume_pk on resume  (cost=0.11..0.39 rows=1 width=519)
         Index Cond: (resume_id = gs.id)

Very bad case with awful estimation (only difference LIMIT 200 vs LIMIT 199):

explain SELECT * FROM resume WHERE resume_id IN (select id from generate_series(1, 1000) gs(id) LIMIT 200);
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Nested Loop  (cost=21.64..109.53 rows=45860504 width=519)
   ->  Unique  (cost=21.53..21.73 rows=200 width=4)
         ->  Sort  (cost=21.53..21.63 rows=200 width=4)
               Sort Key: gs.id
               ->  Limit  (cost=0.00..10.00 rows=200 width=4)
                     ->  Function Scan on generate_series gs  (cost=0.00..50.00 rows=1000 width=4)
   ->  Index Scan using resume_pk on resume  (cost=0.11..0.39 rows=1 width=519)
         Index Cond: (resume_id = gs.id)

It's not a problem by itself but once you start using this query with more joined tables - a lot bad things happens because 5 orders of magnitude error in selectivity estimation.

PS: in reality it forces us to use not more than 199 LIMIT in complex joins for batch operations or the database start generate funny plans.

Regards,

Maxim

Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7 985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: BUG #15160: planner overestimates number of rows in join whenthere are more than 200 rows coming from CTE