Re: limit in subquery causes poor selectivity estimation

Поиск

Список

Период

Сортировка

От	Robert Haas
Тема	Re: limit in subquery causes poor selectivity estimation
Дата	29 августа 2011 г. 16:58:27
Msg-id	CA+TgmoYerLuxcmkHtrp-2f7aAuxFp_3vWUR73RTNy4HH=YbUtA@mail.gmail.com обсуждение исходный текст
Ответ на	Re: limit in subquery causes poor selectivity estimation (Tom Lane <tgl@sss.pgh.pa.us>)
Список	pgsql-hackers

Дерево обсуждения

On Sat, Aug 27, 2011 at 1:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
>> EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2);
>>                               QUERY PLAN
>> ----------------------------------------------------------------------
>>  Hash Semi Join  (cost=30.52..61.27 rows=1000 width=27)
>>    Hash Cond: (test1.sha1 = test2.sha1)
>>    ->  Seq Scan on test1  (cost=0.00..17.00 rows=1000 width=27)
>>    ->  Hash  (cost=18.01..18.01 rows=1001 width=21)
>>          ->  Seq Scan on test2  (cost=0.00..18.01 rows=1001 width=21)
>
>> That's OK.  Apparently it can tell that joining two tables on their
>> primary keys cannot result in more rows than the smaller table.  (Or
>> can it?)
>
> More like it knows that a semijoin can't produce more rows than the
> lefthand input has.  But I think it is actually applying stats for
> both columns here.
>
>> EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2 LIMIT 200);
>
>> Here, however, it has apparently not passed this knowledge through the
>> LIMIT.
>
> The LIMIT prevents the subquery from being flattened entirely, ie we
> don't have just "test1 SEMI JOIN test2" but "test1 SEMI JOIN (SELECT *
> FROM test2 LIMIT 200)".  If you look at examine_variable in selfuncs.c
> you'll note that it punts for Vars coming from unflattened subqueries.
>
>> So what's up with that?  Just a case of, we haven't thought about
>> covering this case yet, or are there larger problems?
>
> The larger problem is that if a subquery didn't get flattened, it's
> often because it's got LIMIT, or GROUP BY, or some similar clause that
> makes it highly suspect whether the statistics available for the table
> column are reasonable to use for the subquery outputs.  It wouldn't be
> that hard to grab the stats for test2.sha1, but then how do you want
> to adjust them to reflect the LIMIT?

Well, you can't.  I think the question is, in the absence of perfect
information, is it better to use the stats you have, or just punt and
assume you know nothing?  Like Peter, I've certainly seen cases where
pulling up the stats would be a huge win, but it's hard to say whether
there are other cases where it would be worse than what we do now,
because nobody spends any time staring at the queries where the
existing system works great.  My gut feeling is that pulling up the
stats unchanged is likely to be better than punting, but my gut
feeling may not be worth much.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

В списке pgsql-hackers по дате отправления:

Предыдущее

От: hubert depesz lubaczewski
Дата: 29 августа 2011 г., 16:54:54
Сообщение: Re: [GENERAL] pg_upgrade problem

Следующее

От: Tom Lane
Дата: 29 августа 2011 г., 17:24:46
Сообщение: Re: spinlocks on HP-UX

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: limit in subquery causes poor selectivity estimation

Предыдущее

Следующее