Re: POC: GROUP BY optimization

Поиск

Список

Период

Сортировка

От	Tomas Vondra
Тема	Re: POC: GROUP BY optimization
Дата	21 января 2022 г. 23:34:43
Msg-id	721b3ff9-f214-f5d4-86c7-bae515bbec18@enterprisedb.com обсуждение исходный текст
Ответ на	Re: POC: GROUP BY optimization (Andrey Lepikhov <a.lepikhov@postgrespro.ru>)
Ответы	Re: POC: GROUP BY optimization Re: POC: GROUP BY optimization
Список	pgsql-hackers

Дерево обсуждения


On 1/21/22 12:09, Andrey Lepikhov wrote:
> On 7/22/21 3:58 AM, Tomas Vondra wrote:
>> 4) I'm not sure it's actually a good idea to pass tuplesPerPrevGroup to
>> estimate_num_groups_incremental. In principle yes, if we use "group
>> size" from the previous step, then the returned value is the number of
>> new groups after adding the "new" pathkey.
>> But even if we ignore the issues with amplification mentioned in (3),
>> there's an issue with non-linear behavior in estimate_num_groups,
>> because at some point it's calculating
>>
>>     D(N,n,p) = n * (1 - ((N-p)/N)^(N/n))
>>
>> where N - total rows, p - sample size, n - number of distinct values.
>> And if we have (N1,n1) and (N2,n2) then the ratio of calculated
>> estimated (which is pretty much what calculating group size does)
>>
>>     D(N2,n2,p2) / D(N1,n1,p1)
>>
>> which will differ depending on p1 and p2. And if we're tweaking the
>> tuplesPerPrevGroup all the time, that's really annoying, as it may make
>> the groups smaller or larger, which is unpredictable and annoying, and I
>> wonder if it might go against the idea of penalizing tuplesPerPrevGroup
>> to some extent.
>> We could simply use the input "tuples" value here, and then divide the
>> current and previous estimate to calculate the number of new groups.
 >
> tuplesPerPrevGroup is only a top boundary for the estimation. I think, 
> we should use result of previous estimation as a limit for the next 
> during incremental estimation. Maybe we simply limit the 
> tuplesPerPrevGroup value to ensure of monotonic non-growth of this 
> value? - see in attachment patch to previous fixes.
> 

Yes, I think that's a reasonable defense - it makes no sense to exceed 
the group size in the preceding step, which could happen with small 
number of groups or some unexpected ndistinct estimate.

The other thing we could do is reduce the coefficient gradually - so 
it'd be 1.5 for the first pathkey, then 1.25 for the next one, and so 
on. But it seems somewhat arbitrary (I certainly don't have some sound 
theoretical justification ...).

I've merged most of the fixes you reported. I've skipped the path_save 
removal in planner.c, because that seems incorrect - if there are 
multiple pathkeys, we must start with the original path, not the 
modified one we built in the last iteration. Or am I missing something?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

v20220121-0001-GROUP-BY-reordering.patch

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Stephen Frost
Дата: 21 января 2022 г., 23:28:32
Сообщение: Re: How to get started with contribution

Следующее

От: Tomas Vondra
Дата: 21 января 2022 г., 23:41:55
Сообщение: Re: How to get started with contribution

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: POC: GROUP BY optimization

Вложения

Предыдущее

Следующее