Re: POC: GROUP BY optimization

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: POC: GROUP BY optimization
Дата
Msg-id 6dddcce6-0d91-1868-794a-4d61dc5516f0@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: POC: GROUP BY optimization  (Teodor Sigaev <teodor@sigaev.ru>)
Ответы Re: POC: GROUP BY optimization  (Claudio Freire <klaussfreire@gmail.com>)
Re: POC: GROUP BY optimization  (Teodor Sigaev <teodor@sigaev.ru>)
Re: POC: GROUP BY optimization  (Teodor Sigaev <teodor@sigaev.ru>)
Список pgsql-hackers
On 06/06/2018 08:04 PM, Teodor Sigaev wrote:
> 
>>> problem 2). Index optimization was noticed by me later. But based on
>>> your suggested patch's order I split the patch to index and non-index
>>> part and second part depends of first one. They touch the same part
>>> of code and they could not be independent
>> The way I see it the patch does two different things:
> Yes
> 
>> a) consider additional indexes by relaxing the pathkeys check
> Yes, but not only. Patch could reorder GROUP BY clause to match existing
> pathkeys which could come from index scan (suggested by patch or not) or
> some another way to get ordered output.
> 
>> b) if there are no indexes, i.e. when an explicit Sort is needed,
>> consider reordering the columns by ndistinct
> Yes. But again, this description is a bit short. First it works after
> first patch and might get some preordered leading pathkeys. Second, it
> tries to match ORDER BY clause order if there is no preordered leading
> pathkeys from first patch (it was introduced in v7). And third, if there
> is a tail of unmatched pathkeys on previous stages then it will reorder
> that tail.
> 

OK, I haven't looked at v7 yet, but if I understand correctly it tries
to maintain the ordering as much as possible? Does that actually help? I
mean, the incremental sort patch allows the sorting to happen by pieces,
but here we still need to sort all the data, right?

Can you give an example demonstrating the benefit?

>> In the worst case, one part will depend on the other, which is OK. It
>> still allows us to commit the first part and continue working on the
>> other one, for example.
> Exactly it's our case. Of course, it's possible to split first patch for
> two again: just suggestion of index (1.1) and reordering by existing
> pathkeys (1.2). Then 1.1 will be independent but 2 still should be
> applied after 1.2. But patch 1.1 is rather useless.
> 
>> can decide which one is cheaper etc. The decision which paths to keep
>> is done by add_path(), and it should stay like this, of course. I
>> wasn't suggesting to move the logic elsewhere.
> Cool, I haven't intention to modify it too.
> 
>>
>>>  > I wonder if we should make the add_path() logic smarter to
>>> recognize when two
>>>  > paths have different pathkeys but were generated to match the same
>>> grouping,
>>>  > to reduce the number of paths added by this optimization.
>>> Currently we do that > for each pathkeys list independently, but
>>> we're considering many more pathkeys > orderings ...
>>>
>>> Hm. I tend to say no.
>>> select .. from t1 group by a, b
>>> union
>>> select .. from t2 group by a, b
>>>
>>> t1 and t2 could have different set of indexes and different
>>> distribution, so locally it could be cheaper to use one index (for
>>> example, one defined as (b, a) and second as (a,b,c,d) - second is
>>> larger) but totally - another (example: second table doesn't have
>>> (b,a) index)
>>>
>>
>> But add_path() treats each of the relations independently, why
>> couldn't we pick a different index for each of the two relations?
> 
> Having of sorted output of both subselect could be cheaper that sorting
> one of them even if index scan was cheaper. But it seems to me that is
> not deal of suggested here optimization.
> 

OK, I think I understand what you're trying to say. We may have a query
like this:

SELECT a, SUM(x) FROM
(
    SELECT a, b, COUNT(c) AS x FROM t1 GROUP BY a, b
    UNION ALL
    SELECT a, b, COUNT(c) AS x FROM t2 GROUP BY a, b
) foo GROUP BY a;

and indexes on (a,b) and (b,a) for both relations. The "deduplication"
by pathkeys I suggested would mean we might keep only index (a,b) on t1
and (b,a) on t2, which means the grouping by "a" can't leverage index
scans on both relations. But if we keep paths for both indexes on each
relation, we can.

That's a good point, yes.

> 
>>>> For example, it seems to disregard that different data types have
>>>> different comparison costs. For example comparing bytea will be far
>>>> more expensive compared to int4, so it may be much more efficient to
>>>> compare int4 columns first, even if there are far fewer distinct
>>>> values in them.
>>> as I can see cost_sort() doesn't pay attention to this details. And
>>> it should be a separated patch to improve that.
>>>
>>
>> But sort also does not reorder columns.
> Yes. But estimation of cost of comparing function/number of unique 
> values in column could be not very accurate and so planner could make
> a wrong choice.

Isn't "estimation of cost of comparing function/number of unique values
in column could be not very accurate and so planner could make a wrong
choice" is more an argument against relying on it when doing these
optimizations?

FWIW it's one of the arguments Tom made in the incremental sort patch,
which relies on it too when computing cost of the incremental sort. I'm
sure it's going to be an obstacle there too.

> I saw 2 times difference in real-world application. Again, improving
> sort cost estimation is a separate task.
Sure. But we also need to ask the other question, i.e. how many people
would be negatively affected by the optimization. And I admit I don't
know the answer to that, the next example is entirely made up.

>>
>> Imagine you have a custom data type that is expensive for comparisons.
>> You know that, so you place it at the end of GROUP BY clauses, to
>> reduce the number of comparisons on that field. And then we come along
>> and just reorder the columns, placing it first, because it happens to
>> have a high ndistinct statistic. And there's no way to get the
>> original behavior :-(
> Hm. that it could be, but I don't know how to improve here.  Current
> cost_sort() will return the same cost for any columns order.
> 
> Okay, here we know an estimation of nrow, we could somehow find a
> comparing function oid and find a its procost field. And then? we also
> need to know width of tuple here and mostly repeat the cost_sort.
> 
> Another option is an introducing enable_groupby_reorder GUC variable
> although personally I don't like an idea to add new GUC variable.
> 

I don't like the idea of yet another GUC either, as it's rather blunt
instrument. Which is why I'm suggesting to split the patch into two
parts. Then we can apply the index stuff (which seems relatively
straightforward) and keep working on this second part.

FWIW I think it would be useful to have "development GUC" that would
allow us to enable/disable these options during development, because
that makes experiments much easier. But then remove them before commit.

>>> Agree, but I don't know how to use it here. Except, may be:
>>> 1) first column - the column with bigger estimated number of groups
>>> 2) second column - the pair of (first, second) with bigger estimated
>>> number of groups
>>> 3) third column - the triple of (first, second, third) with bigger ...
>>>
>>> But seems even with that algorithm, ISTM, it could be implemented in
>>> cheaper manner.
>>>
>>
>> Maybe. I do have some ideas, although I haven't tried implementing it.
> Implemented, pls, have a look.
> 
>>
>> If there's no extended statistic on the columns, you can do the
>> current thing (assuming independence etc.). There's not much we can do
>> here.
> Agree
> 
>>
>> If there's an extended statistic, you can do either a greedy search
>> (get the next column with the highest ndistinct coefficient) or
>> exhaustive search (computing the estimated number of comparisons).
>>
>> Another challenge is that using only the ndistinct coefficient assumes
>> uniform distribution of the values. But you can have a column with 1M
>> distinct values, where a single value represents 99% of the rows. And
>> another column with 100k distinct values, with actual uniform
>> distribution. I'm pretty sure it'd be more efficient to place the 100k
>> column first.
> 
> Interesting.  Will think, thank you
> 

You're welcome.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: SCRAM with channel binding downgrade attack
Следующее
От: Heikki Linnakangas
Дата:
Сообщение: Re: SCRAM with channel binding downgrade attack