Re: improving GROUP BY estimation

Поиск
Список
Период
Сортировка
От Mark Dilger
Тема Re: improving GROUP BY estimation
Дата
Msg-id 7D1A65E8-7ED8-4268-9C24-8EA07E95D2BC@gmail.com
обсуждение исходный текст
Ответ на Re: improving GROUP BY estimation  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: improving GROUP BY estimation  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
> On Feb 25, 2016, at 4:59 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> On 02/26/2016 12:16 AM, Mark Dilger wrote:
>>
>> It is not hard to write test cases where your patched version overestimates
>> the number of rows by a very similar factor as the old code underestimates
>> them.  My very first test, which was not specifically designed to demonstrate
>> this, happens to be one such example:
>>
>>
>> CREATE TABLE t (a INT, b int);
>> INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,10000000) gs;
>> ANALYZE t;
>> EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a;
>>                           QUERY PLAN
>> ---------------------------------------------------------------
>>  HashAggregate  (cost=169250.21..169258.71 rows=850 width=4)
>>    Group Key: a
>>    ->  Seq Scan on t  (cost=0.00..169247.71 rows=1000 width=4)
>>          Filter: (b < 1000)
>> (4 rows)
>>
>> SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss;
>>  count
>> -------
>>     32
>> (1 row)
>>
>>
>>
>> So, it estimates 850 rows where only 32 are returned . Without
>> applying your patch, it estimates just 1 row where 32 are returned.
>> That's an overestimate of roughly 26 times, rather than an
>> underestimate of 32 times.
>
> The culprit here is that the two columns are not independent, but are (rather strongly) correlated due to the way
you'vegenerated the data. 

Yes, that was intentional.  Your formula is exactly correct, so far as i can tell,
for completely uncorrelated data.  I don't have any tables with completely
uncorrelated data, and was therefore interested in what happens when the
data is correlated and your patch is applied.

BTW, the whole reason I responded to your post is that I think I would like
to have your changes in the code base.  I'm just playing Devil's Advocate
here, to see if there is room for any improvement.

> In cases like this (with correlated columns), it's mostly just luck how accurate estimates you get, no matter which
ofthe estimators you use. It's pretty easy to generate arbitrary errors by breaking the independence assumption, and
it'snot just this particular place of the planner. And it won't change until we add some sort of smartness about
dependenciesbetween columns. 
>
> I think over-estimates are a bit more acceptable in this case, e.g. because of the HashAggregate OOM risk. Also, I've
seentoo many nested loop cascades due to under-estimates recently, but that's a bit unrelated. 

I have seen similar problems in systems I maintain, hence my interest
in your patch.


>> As a patch review, I'd say that your patch does what you claim it
>> does, and it applies cleanly, and passes the regression tests with
>> my included modifications. I think there needs to be some discussion
>> on the list about whether the patch is agood idea.
>
> Yes, that's why I haven't bothered with fixing the regression tests in the patch, actually.

Right, but for the group as a whole, I thought it might generate some
feedback if people saw what else changed in the regression results.
If you look, the changes have to do with plans chosen and row estimates --
exactly the sort of stuff you would expect to change.

Thanks for the patch submission.  I hope my effort to review it is on the
whole more helpful than harmful.

Mark Dilger


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.
Следующее
От: Kyotaro HORIGUCHI
Дата:
Сообщение: Re: Incorrect formula for SysV IPC parameters