Re: BUG #15160: planner overestimates number of rows in join whenthere are more than 200 rows coming from CTE

Поиск
Список
Период
Сортировка
От Maxim Boguk
Тема Re: BUG #15160: planner overestimates number of rows in join whenthere are more than 200 rows coming from CTE
Дата
Msg-id CAK-MWwR_G0NLLB9TwRhs+3cHDa9zZnYg7AgpAfeYJn_hgpZBsQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE
Список pgsql-bugs


On Tue, Apr 17, 2018 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
PG Bug reporting form <noreply@postgresql.org> writes:
> I'm wondering how planner estimates number of rows in that case:

See eqjoinsel_semi, particularly the change in behavior when it thinks
nd2 is or is not a default estimate.

Given the lack of statistics about the output of the WITH clause,
it's hard to see how we'd ever get trustworthy estimates here.
I think the fact that your first example yields an accurate
estimate is mostly luck.

                        regards, tom lane



There are similar issue without CTE which look pretty weird:

Good case with LIMIT 199 and adequate estimation:
hh=# explain SELECT * FROM resume WHERE resume_id IN (select id from generate_series(1, 1000) gs(id) LIMIT 199);
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Nested Loop  (cost=21.53..108.98 rows=199 width=519)
   ->  Unique  (cost=21.42..21.62 rows=199 width=4)
         ->  Sort  (cost=21.42..21.52 rows=199 width=4)
               Sort Key: gs.id
               ->  Limit  (cost=0.00..9.95 rows=199 width=4)
                     ->  Function Scan on generate_series gs  (cost=0.00..50.00 rows=1000 width=4)
   ->  Index Scan using resume_pk on resume  (cost=0.11..0.39 rows=1 width=519)
         Index Cond: (resume_id = gs.id)

Very bad case with awful estimation (only difference LIMIT 200 vs LIMIT 199):
explain SELECT * FROM resume WHERE resume_id IN (select id from generate_series(1, 1000) gs(id) LIMIT 200);
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Nested Loop  (cost=21.64..109.53 rows=45860504 width=519)
   ->  Unique  (cost=21.53..21.73 rows=200 width=4)
         ->  Sort  (cost=21.53..21.63 rows=200 width=4)
               Sort Key: gs.id
               ->  Limit  (cost=0.00..10.00 rows=200 width=4)
                     ->  Function Scan on generate_series gs  (cost=0.00..50.00 rows=1000 width=4)
   ->  Index Scan using resume_pk on resume  (cost=0.11..0.39 rows=1 width=519)
         Index Cond: (resume_id = gs.id)


It's not a problem by itself but once you start using this query with more joined tables - a lot bad things happens because 5 orders of magnitude error in selectivity estimation.

PS: in reality it forces us to use not more than 199 LIMIT in complex joins for batch operations or the database start generate funny plans.

Regards,
Maxim


--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"

В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #15399: pgAdmin 4 closes all windows instead of one
Следующее
От: Christoph Berg
Дата:
Сообщение: Re: BUG #15238: Sequence owner not updated when owning table isforeign