Re: POC: GROUP BY optimization

Поиск
Список
Период
Сортировка
От Teodor Sigaev
Тема Re: POC: GROUP BY optimization
Дата
Msg-id 2cb99655-7ae7-ec8d-2c74-3fea58987104@sigaev.ru
обсуждение исходный текст
Ответ на Re: POC: GROUP BY optimization  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: POC: GROUP BY optimization  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
>> problem 2). Index optimization was noticed by me later. But based on your 
>> suggested patch's order I split the patch to index and non-index part and 
>> second part depends of first one. They touch the same part of code and they 
>> could not be independent
> The way I see it the patch does two different things:
Yes

> a) consider additional indexes by relaxing the pathkeys check
Yes, but not only. Patch could reorder GROUP BY clause to match existing 
pathkeys which could come from index scan (suggested by patch or not) or some 
another way to get ordered output.

> b) if there are no indexes, i.e. when an explicit Sort is needed, consider 
> reordering the columns by ndistinct
Yes. But again, this description is a bit short. First it works after first 
patch and might get some preordered leading pathkeys. Second, it tries to match 
ORDER BY clause order if there is no preordered leading pathkeys from first 
patch (it was introduced in v7). And third, if there is a tail of unmatched 
pathkeys on previous stages then it will reorder that tail.

> In the worst case, one part will depend on the other, which is OK. It still 
> allows us to commit the first part and continue working on the other one, for 
> example.
Exactly it's our case. Of course, it's possible to split first patch for two 
again: just suggestion of index (1.1) and reordering by existing pathkeys (1.2). 
Then 1.1 will be independent but 2 still should be applied after 1.2. But patch 
1.1 is rather useless.

> can decide which one is cheaper etc. The decision which paths to keep is done by 
> add_path(), and it should stay like this, of course. I wasn't suggesting to move 
> the logic elsewhere.
Cool, I haven't intention to modify it too.

> 
>>  > I wonder if we should make the add_path() logic smarter to recognize when two
>>  > paths have different pathkeys but were generated to match the same grouping,
>>  > to reduce the number of paths added by this optimization. Currently we do 
>> that > for each pathkeys list independently, but we're considering many more 
>> pathkeys > orderings ...
>>
>> Hm. I tend to say no.
>> select .. from t1 group by a, b
>> union
>> select .. from t2 group by a, b
>>
>> t1 and t2 could have different set of indexes and different distribution, so 
>> locally it could be cheaper to use one index (for example, one defined as (b, 
>> a) and second as (a,b,c,d) - second is larger) but totally - another (example: 
>> second table doesn't have (b,a) index)
>>
> 
> But add_path() treats each of the relations independently, why couldn't we pick 
> a different index for each of the two relations?

Having of sorted output of both subselect could be cheaper that sorting one of 
them even if index scan was cheaper. But it seems to me that is not deal of 
suggested here optimization.


>>> For example, it seems to disregard that different data types have different 
>>> comparison costs. For example comparing bytea will be far more expensive 
>>> compared to int4, so it may be much more efficient to compare int4 columns 
>>> first, even if there are far fewer distinct values in them.
>> as I can see cost_sort() doesn't pay attention to this details. And it should 
>> be a separated patch to improve that.
>>
> 
> But sort also does not reorder columns.
Yes. But estimation of cost of comparing function/number of unique values in 
column could be not very accurate and so planner could make a wrong choice. I 
saw 2 times difference in real-world application. Again, improving sort cost 
estimation is a separate task.

> 
> Imagine you have a custom data type that is expensive for comparisons. You know 
> that, so you place it at the end of GROUP BY clauses, to reduce the number of 
> comparisons on that field. And then we come along and just reorder the columns, 
> placing it first, because it happens to have a high ndistinct statistic. And 
> there's no way to get the original behavior :-(
Hm. that it could be, but I don't know how to improve here.  Current cost_sort() 
will return the same cost for any columns order.

Okay, here we know an estimation of nrow, we could somehow find a comparing 
function oid and find a its procost field. And then? we also need to know width 
of tuple here and mostly repeat the cost_sort.

Another option is an introducing enable_groupby_reorder GUC variable although 
personally I don't like an idea to add new GUC variable.

>> Agree, but I don't know how to use it here. Except, may be:
>> 1) first column - the column with bigger estimated number of groups
>> 2) second column - the pair of (first, second) with bigger estimated number of 
>> groups
>> 3) third column - the triple of (first, second, third) with bigger ...
>>
>> But seems even with that algorithm, ISTM, it could be implemented in cheaper 
>> manner.
>>
> 
> Maybe. I do have some ideas, although I haven't tried implementing it.
Implemented, pls, have a look.

> 
> If there's no extended statistic on the columns, you can do the current thing 
> (assuming independence etc.). There's not much we can do here.
Agree

> 
> If there's an extended statistic, you can do either a greedy search (get the 
> next column with the highest ndistinct coefficient) or exhaustive search 
> (computing the estimated number of comparisons).
> 
> Another challenge is that using only the ndistinct coefficient assumes uniform 
> distribution of the values. But you can have a column with 1M distinct values, 
> where a single value represents 99% of the rows. And another column with 100k 
> distinct values, with actual uniform distribution. I'm pretty sure it'd be more 
> efficient to place the 100k column first.

Interesting.  Will think, thank you


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: buildfarm vs code
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: I'd like to discuss scaleout at PGCon