Re: optimisation? collation "C" sorting for GroupAggregate for alldeterministic collations

Поиск

Список

Период

Сортировка

От	James Coleman
Тема	Re: optimisation? collation "C" sorting for GroupAggregate for alldeterministic collations
Дата	23 марта 2020 г. 00:28:05
Msg-id	CAAaqYe8v__0mAQnVJm=nuMixwS3BQ7WP9C7iBPJ8te2Q7Prcbw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: optimisation? collation "C" sorting for GroupAggregate for alldeterministic collations (Pavel Stehule <pavel.stehule@gmail.com>)
Ответы	Re: optimisation? collation "C" sorting for GroupAggregate for alldeterministic collations Re: optimisation? collation "C" sorting for GroupAggregate for all deterministic collations
Список	pgsql-hackers

Дерево обсуждения

On Sun, Mar 22, 2020 at 5:33 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Hi
>
> ne 22. 3. 2020 v 10:12 odesílatel Maxim Ivanov <hi@yamlcoder.me> napsal:
>>
>> Hi All,
>>
>> It is known, that  collation "C" significantly speeds up string comparisons and as a result sorting. I was
wondering,whether it is possible to use it regardless of collation set on a column in sorts not visible to users? 
>>
>> Example I have in  mind is sorting performed for GroupAggregate. Purpose of that sort is to bring equal values next
toeach other, so as long as: 
>>    1) user didn't request ORDER BY in addition to GROUP BY
>>    2) source column has any deterministic collation (as per docs all builtin collations are deterministic)
>>
>> it seems to be possible to do sorting with any deterministic collation, regardless of what user specifid for the
columnbeing sorted. "C" collation is deterministic and fastest. 
>>
>> In other words, couldn't PostgreSQL convert this:
>>
>> ->  GroupAggregate  (cost=15726557.87..22944558.69 rows=7200001 width=176) (actual time=490103.209..771536.389
rows=36000000loops=1) 
>>       Group Key: ec_180days.msn, ec_180days.to_date_time
>>       ->  Sort  (cost=15726557.87..15906557.89 rows=72000008 width=113) (actual time=490094.849..524854.662
rows=72000000loops=1) 
>>             Sort Key: ec_180days.msn, ec_180days.to_date_time
>>             Sort Method: external merge  Disk: 7679136kB
>>
>> To this:
>>
>> ->  GroupAggregate  (cost=14988274.87..22206275.69 rows=7200001 width=155) (actual time=140497.729..421510.001
rows=36000000loops=1) 
>>       Group Key: ec_180days.msn, ec_180days.to_date_time
>>       ->  Sort  (cost=14988274.87..15168274.89 rows=72000008 width=92) (actual time=140489.807..174228.722
rows=72000000loops=1) 
>>             Sort Key: ec_180days.msn COLLATE "C", ec_180days.to_date_time
>>             Sort Method: external merge  Disk: 7679136kB
>>
>>
>> which is 3 times faster in my tests.
>
>
> I had a same idea. It is possible only if default collation is deterministic. Probably it will be less important if
abbreviatesort will be enabled, but it is disabled now. 
>
> p.s. can be interesting repeat your tests with ICU locale where abbreviate sort is enabled.

Perhaps this is what you mean by "deterministic", but isn't it
possible for some collations to treat multiple byte sequences as equal
values? And those multiple byte sequences wouldn't necessarily occur
sequentially in C collation, so it wouldn't be possible to work around
that by having the grouping node use one collation but the sorting
node use the C one.

If my memory is incorrect, then this sounds like an intriguing idea.

James

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Andres Freund
Дата: 23 марта 2020 г., 00:18:28
Сообщение: Re: Assert() failures during RI checks

Следующее

От: Andreas Karlsson
Дата: 23 марта 2020 г., 01:02:46
Сообщение: Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: optimisation? collation "C" sorting for GroupAggregate for alldeterministic collations

Предыдущее

Следующее