Re: 回复: An implementation of multi-key sort

Поиск
Список
Период
Сортировка
От Yao Wang
Тема Re: 回复: An implementation of multi-key sort
Дата
Msg-id CACq1W07fAZpuqMHS+3xfjbu7QoGE26XTSqTbxO0f_=pAx8=vhQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: 回复: An implementation of multi-key sort  (Yao Wang <yao-yw.wang@broadcom.com>)
Ответы Re: 回复: An implementation of multi-key sort
Список pgsql-hackers
To be accurate, "multi-key sort" includes both "multi-key quick sort"
and "multi-key heap sort". This patch includes code change related to
only "multi-key quick sort" which is used to replace standard quick
sort for tuplesort. The "multi-key heap sort" is about an implementation
of multi-key heap and should be treated as a separated task. We need
to clarify the naming to avoid confusion.

I updated code which is related to only function/var renaming and
relevant comments, plus some minor assertions changes. Please see the
attachment.


Thanks,

Yao Wang

On Fri, May 31, 2024 at 8:09 PM Yao Wang <yao-yw.wang@broadcom.com> wrote:
>
> I added two optimizations to mksort which exist on qsort_tuple():
>
> 1. When selecting pivot, always pick the item in the middle of array but
> not by random. Theoretically it has the same effect to old approach, but
> it can eliminate some unstable perf test results, plus a bit perf benefit by
> removing random value generator.
> 2. Always check whether the array is ordered already, and return
> immediately if it is. The pre-ordered check requires extra cost and
> impacts perf numbers on some data sets, but can improve perf
> significantly on other data sets.
>
> By now, mksort has perf results equal or better than qsort on all data
> sets I ever used.
>
> I also updated test case. Please see v3 code as attachment.
>
> Perf test results:
>
> Data set 1 (with mass duplicate values):
> -----------------------------------------
>
> create table t1 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100));
> insert into t1 values (generate_series(1,499999), 0, 0, 0, 0,
> 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb');
> update t1 set c2 = c1 % 100, c3 = c1 % 50, c4 = c1 % 10, c5 = c1 % 3;
> update t1 set c6 = 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb'
> || (c1 % 5)::text;
>
> Query 1:
>
> explain analyze select c1 from t1 order by c6, c5, c4, c3, c2, c1;
>
> Disable Mksort
>
> 3021.636 ms
> 3014.669 ms
> 3033.588 ms
>
> Enable Mksort
>
> 1688.590 ms
> 1686.956 ms
> 1688.567 ms
>
> The improvement is 78.9%, which is reduced from the previous version
> (129%). The most cost should be the pre-ordered check.
>
> Query 2:
>
> create index idx_t1_mk on t1 (c6, c5, c4, c3, c2, c1);
>
> Disable Mksort
>
> 1674.648 ms
> 1680.608 ms
> 1681.373 ms
>
> Enable Mksort
>
> 1143.341 ms
> 1143.462 ms
> 1143.894 ms
>
> The improvement is ~47%, which is also reduced a bit (52%).
>
> Data set 2 (with distinct values):
> ----------------------------------
>
> create table t2 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100));
> insert into t2 values (generate_series(1,499999), 0, 0, 0, 0, '');
> update t2 set c2 = 999990 - c1, c3 = 999991 - c1, c4 = 999992 - c1, c5
> = 999993 - c1;
> update t2 set c6 = 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb'
> || (999994 - c1)::text;
>
> Query 1:
>
> explain analyze select c1 from t2 order by c6, c5, c4, c3, c2, c1;
>
> Disable Mksort
>
> 12199.963 ms
> 12197.068 ms
> 12191.657 ms
>
> Enable Mksort
>
> 9538.219 ms
> 9571.681 ms
> 9536.335 ms
>
> The improvement is 27.9%, which is much better than the old approach (-6.2%).
>
> Query 2 (the data is pre-ordered):
>
> explain analyze select c1 from t2 order by c6 desc, c5, c4, c3, c2, c1;
>
> Enable Mksort
>
> 768.191 ms
> 768.079 ms
> 767.026 ms
>
> Disable Mksort
>
> 768.757 ms
> 766.166 ms
> 766.149 ms
>
> They are almost the same since no actual sort was performed, and much
> better than the old approach (-1198.1%).
>
>
> Thanks,
>
> Yao Wang
>
> On Fri, May 24, 2024 at 8:50 PM Yao Wang <yao-yw.wang@broadcom.com> wrote:
> >
> > When all leading keys are different, mksort will finish the entire sort at the
> > first sort key and never touch other keys. For the case, mksort falls back to
> > kind of qsort actually.
> >
> > I created another data set with distinct values in all sort keys:
> >
> > create table t2 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100));
> > insert into t2 values (generate_series(1,499999), 0, 0, 0, 0, '');
> > update t2 set c2 = 999990 - c1, c3 = 999991 - c1, c4 = 999992 - c1, c5
> > = 999993 - c1;
> > update t2 set c6 = 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb'
> >   || (999994 - c1)::text;
> > explain analyze select c1 from t2 order by c6, c5, c4, c3, c2, c1;
> >
> > Results:
> >
> > MKsort:
> > 12374.427 ms
> > 12528.068 ms
> > 12554.718 ms
> >
> > qsort:
> > 12251.422 ms
> > 12279.938 ms
> > 12280.254 ms
> >
> > MKsort is a bit slower than qsort, which can be explained by extra
> > checks of MKsort.
> >
> > Yao Wang
> >
> > On Fri, May 24, 2024 at 8:36 PM Wang Yao <yaowangm@outlook.com> wrote:
> > >
> > >
> > >
> > > 获取Outlook for Android
> > > ________________________________
> > > From: Heikki Linnakangas <hlinnaka@iki.fi>
> > > Sent: Thursday, May 23, 2024 8:47:29 PM
> > > To: Wang Yao <yaowangm@outlook.com>; PostgreSQL Hackers <pgsql-hackers@postgresql.org>
> > > Cc: interma@outlook.com <interma@outlook.com>
> > > Subject: Re: 回复: An implementation of multi-key sort
> > >
> > > On 23/05/2024 15:39, Wang Yao wrote:
> > > > No obvious perf regression is expected because PG will follow original
> > > > qsort code path when mksort is disabled. For the case, the only extra
> > > > cost is the check in tuplesort_sort_memtuples() to enter mksort code path.
> > >
> > > And what about the case the mksort is enabled, but it's not effective
> > > because all leading keys are different?
> > >
> > > --
> > > Heikki Linnakangas
> > > Neon (https://neon.tech)
> > >

--
This electronic communication and the information and any files transmitted
with it, or attached to it, are confidential and are intended solely for
the use of the individual or entity to whom it is addressed and may contain
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are
not the intended recipient or the person responsible for delivering the
e-mail to the intended recipient, you are hereby notified that any use,
copying, distributing, dissemination, forwarding, printing, or copying of
this e-mail is strictly prohibited. If you received this e-mail in error,
please return the e-mail to the sender, delete it from your computer, and
destroy any printed copy of it.

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: Re: ssl tests fail due to TCP port conflict
Следующее
От: "David E. Wheeler"
Дата:
Сообщение: Re: Patch bug: Fix jsonpath .* on Arrays