Re: 回复: An implementation of multi-key sort

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: 回复: An implementation of multi-key sort
Дата
Msg-id bc6bece9-9bd6-444e-a933-4fe82a00fb10@enterprisedb.com
обсуждение исходный текст
Ответ на Re: 回复: An implementation of multi-key sort  (Konstantin Knizhnik <knizhnik@garret.ru>)
Список pgsql-hackers

On 7/7/24 08:32, Konstantin Knizhnik wrote:
> 
> On 04/07/2024 3:45 pm, Yao Wang wrote:
>> Generally, the benefit of mksort is mainly from duplicated values and
>> sort
>> keys: the more duplicated values and sort keys are, the bigger benefit it
>> gets.
> ...
>> 1. Use distinct stats info of table to enable mksort
>>
>> It's kind of heuristics: in optimizer, check
>> Form_pg_statistic->stadistinct
>> of a table via pg_statistics. Enable mksort only when it is less than a
>> threshold.
>>
>> The hacked code works, which need to modify a couple of interfaces of
>> optimizer. In addition, a complete solution should consider types and
>> distinct values of all columns, which might be too complex, and the
>> benefit
>> seems not so big.
> 
> 
> If mksort really provides advantage only when there are a lot of
> duplicates (for prefix keys?) and of small fraction of duplicates there
> is even some (small) regression
> then IMHO taking in account in planner information about estimated
> number of distinct values seems to be really important. What was a
> problem with accessing this statistics and why it requires modification
> of optimizer interfaces? There is `get_variable_numdistinct` function
> which is defined and used only in selfuncs.c
> 

Yeah, I've been wondering about that too. But I'm also a bit unsure if
using this known-unreliable statistics (especially with filters and
multiple columns) would actually fix the regressions.

> Information about values distribution seems to be quite useful for 
> choosing optimal sort algorithm. Not only for multi-key sort
> optimization. For example if we know min.max value of sort key and it is
> small, we can use O(N) algorithm for sorting. Also it can help to
> estimate when TOP-N search is preferable.
> 

This assumes the information is accurate / reliable, and I'm far from
sure about that.

> Right now Posgres creates special path for incremental sort. I am not
> sure if we also need to be separate path for mk-sort.
> But IMHO if we need to change some optimizer interfaces to be able to
> take in account statistic and choose preferred sort algorithm at
> planning time, then it should be done.
> If mksort can increase sort more than two times (for large number of
> duplicates), it will be nice to take it in account when choosing optimal
> plan.
> 

I did commit the incremental sort patch, and TBH I'm not convinced I'd
do that again. It's a great optimization when it works (and it seems to
work in plenty of cases), but we've also had a number of reports about
significant regressions, where the incremental sort costing is quite
off. Granted, it's often about cases where we already had issues and
incremental sort just "exacerbates" that (say, with LIMIT queries), but
that's kinda the point I'm trying to make - stats are inherently
incomplete / simplified, and some plans are more sensitive to that.

Which is why I'm wondering if we might do the decision based on some
information collected at runtime.

> Also in this case we do not need extra GUC for explicit enabling of
> mksort. There are too many parameters for optimizer and adding one more
> will make tuning more complex. So I prefer that decision is take buy
> optimizer itself based on the available information, especially if
> criteria seems to be obvious.
> 

The GUC is very useful for testing, so let's keep it for now.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Nathan Bossart
Дата:
Сообщение: Re: optimizing pg_upgrade's once-in-each-database steps
Следующее
От: "David E. Wheeler"
Дата:
Сообщение: Re: ❓ JSON Path Dot Precedence