Re: POC: GROUP BY optimization

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: POC: GROUP BY optimization
Дата
Msg-id a3fddec6-af83-292e-16d6-00edc6be67e4@enterprisedb.com
обсуждение исходный текст
Ответ на Re: POC: GROUP BY optimization  (Andrey Lepikhov <a.lepikhov@postgrespro.ru>)
Ответы Re: POC: GROUP BY optimization
Список pgsql-hackers

On 7/24/23 14:04, Andrey Lepikhov wrote:
> On 24/7/2023 16:56, Tomas Vondra wrote:
>> On 7/24/23 04:10, Andrey Lepikhov wrote:
>>> On 20/7/2023 18:46, Tomas Vondra wrote:
>>>> On 7/20/23 08:37, Andrey Lepikhov wrote:
>>>>> On 3/10/2022 21:56, Tom Lane wrote:
>>>>>> Revert "Optimize order of GROUP BY keys".
>>>>>>
>>>>>> This reverts commit db0d67db2401eb6238ccc04c6407a4fd4f985832 and
>>>>>> several follow-on fixes.
>>>>>> ...
>>>>>> Since we're hard up against the release deadline for v15, let's
>>>>>> revert these changes for now.  We can always try again later.
>>>>>
>>>>> It may be time to restart the project. As a first step, I rebased the
>>>>> patch on the current master. It wasn't trivial because of some latest
>>>>> optimizations (a29eab, 1349d27 and 8d83a5d).
>>>>> Now, Let's repeat the review and rewrite the current path according to
>>>>> the reasons uttered in the revert commit.
>>>>
>>>> I think the fundamental task is to make the costing more reliable, and
>>>> the commit message 443df6e2db points out a couple challenges in this
>>>> area. Not sure how feasible it is to address enough of them ...
>>>>
>>>> 1) procost = 1.0 - I guess we could make this more realistic by doing
>>>> some microbenchmarks and tuning the costs for the most expensive cases.
>>>>
>>>> 2) estimating quicksort comparisons - This relies on ndistinct
>>>> estimates, and I'm not sure how much more reliable we can make those.
>>>> Probably not much :-( Not sure what to do about this, the only thing I
>>>> can think of is to track "reliability" of the estimates and only do the
>>>> reordering if we have high confidence in the estimates. That means
>>>> we'll
>>>> miss some optimization opportunities, but it should limit the risk.
>>> I read up on the history of this thread.
>>> As I see, all the problems mentioned above can be beaten by excluding
>>> the new cost model at all. We can sort GROUP BY columns according to the
>>> 'ndistinct' value.
>>> I see the reason for introducing the cost model in [1]. The main
>>> supporting point here is that with this patch, people couldn't optimize
>>> the query by themselves, organizing the order of the columns in a more
>>> optimal way. But now we have at least the GUC to switch off the
>>> behaviour introduced here. Also, some extensions, like the well-known
>>> pg_hint_plan, can help with automation.
>>
>> I think the main concern is that if we reorder the group keys and get it
>> wrong, it's a regression. If that happens often (due to costing based on
>> poor stats), it's a problem. Yes, there's a GUC, but that's a rather
>> blunt instrument, unfortunately.
> I see. My point here is if the ndistinct of one column is much more than
> the ndistinct of another, it is more probable that this correlation will
> be kept in the grouping phase. Here we can get some regression, which
> can be overweighed by the possibility below.

I think the word "probable" hits what the problem is. Because what if it
isn't? Also, we don't actually need ndistinct for individual attributes,
but for the groups of attributes. Imagine you do

GROUP BY a, b

so you need to estimate whether to do (a,b) or (b,a). You need to calculate

  ndistinct(a,b) / ndistinct(a)

and

  ndistinct(b,a) / ndistinct(b)

And similarly for more grouping keys. This is why we have ndistinct
extended statistics, mostly.

>>
>>> So, how about committing of the undoubted part of the feature and
>>> working on the cost model in a new thread?
>>>
>>
>> But Tom's commit message says this:
>>
>>      Worse, to arrive at estimates of the number of calls made to the
>>      lower-order-column comparison functions, the code needs to make
>>      estimates of the numbers of distinct values of multiple columns,
>>      which are necessarily even less trustworthy than per-column stats.
>>
>> so I'm not sure this really counts as "undoubted".
> Don't try to estimate multiple  columns - just sort columns according to
> the value of ndistinct as a heuristic.

Surely you realize how easy such simple heuristics can fail, right?

> I think we should estimate the number of values of multiple columns only
> if we have extended statistics on these columns. And this can extend the
> applicability of extended statistics.
> 

I don't see why this should estimate ndistinct differently than every
other place. That'd certainly be surprising. I can be convinced, but
there needs to be sound justification why that's the right way to do
(i.e. why it's less likely to cause poor planning choices).

> The suggestion above is just an attempt to gather low-hanging fruit
> right now. If it is not applicable, we will go a long way.
> 
I'm not sure this qualities as "low hanging fruit" - that generally
means simple changes with little risk. But you're using it for rather
simplistic heuristic that can easily misfire (I think). At least that's
my impression, I can be convinced otherwise.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ashutosh Bapat
Дата:
Сообщение: Re: logical decoding and replication of sequences, take 2
Следующее
От: "Tristan Partin"
Дата:
Сообщение: Re: Use COPY for populating all pgbench tables