Re: 回复: An implementation of multi-key sort

Поиск
Список
Период
Сортировка
От Yao Wang
Тема Re: 回复: An implementation of multi-key sort
Дата
Msg-id CACq1W055tbEnBWcFCXy=+ZBKTURhoDcieJ8Cj5UkQtGM_rqQNw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: 回复: An implementation of multi-key sort  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы Re: 回复: An implementation of multi-key sort
Re: 回复: An implementation of multi-key sort
Список pgsql-hackers
hi Tomas,

So many thanks for your kind response and detailed report. I am working
on locating issues based on your report/script and optimizing code, and
will update later.

Could you please also send me the script to generate report pdf
from the test results (explain*.log)? I can try to make one by myself,
but I'd like to get a report exactly the same as yours. It's really
helpful.

Thanks in advance.


Yao Wang

On Mon, Jun 10, 2024 at 5:09 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hello Yao,
>
> I was interested in the patch, considering the promise of significant
> speedups of sorting, so I took a quick look and did some basic perf
> testing today. Unfortunately, my benchmarks don't really confirm any
> peformance benefits, so I haven't looked at the code very much and only
> have some very basic feedback:
>
> 1) The new GUC is missing from the .sample config, triggering a failure
> of "make check-world". Fixed by 0002.
>
> 2) There's a place mixing tabs/spaces in indentation. Fixed by 0003.
>
> 3) I tried running pgindent, mostly to see how that would affect the
> comments, and for most it's probably fine, but a couple are mangled
> (usually those with a numbered list of items). Might needs some changes
> to use formatting that's not reformatted like this. The changes from
> pgindent are in 0004, but this is not a fix - it just shows the changes
> after running pgindent.
>
> Now, regarding the performance tests - I decided to do the usual black
> box testing, i.e. generate tables with varying numbers of columns, data
> types, different data distribution (random, correlated, ...) and so on.
> And then run simple ORDER BY queries on that, measuring timing with and
> without mk-sort, and checking the effect.
>
> So I wrote a simple bash script (attached) that does exactly that - it
> generates a table with 1k - 10M rows, fills with with data (with some
> basic simple data distributions), and then runs the queries.
>
> The raw results are too large to attach, I'm only attaching a PDF
> showing the summary with a "speedup heatmap" - it's a pivot with the
> parameters on the left, and then the GUC and number on columns on top.
> So the first group of columns is with enable_mk_sort=off, the second
> group with enable_mk_sort=on, and finally the heatmap with relative
> timing (enable_mk_sort=on / enable_mk_sort=off).
>
> So values <100% mean it got faster (green color - good), and values
> >100% mean it got slower (red - bad). And the thing is - pretty much
> everything is red, often in the 200%-300% range, meaning it got 2x-3x
> slower. There's only very few combinations where it got faster. That
> does not seem very promising ... but maybe I did something wrong?
>
> After seeing this, I took a look at your example again, which showed
> some nice speedups. But it seems very dependent on the order of keys in
> the ORDER BY clause. For example consider this:
>
> set enable_mk_sort = on;
> explain (analyze, timing off)
> select * from t1 order by c6, c5, c4, c3, c2, c1;
>
>                          QUERY PLAN
> -------------------------------------------------------------------
>  Sort  (cost=72328.81..73578.81 rows=499999 width=76)
>        (actual rows=499999 loops=1)
>    Sort Key: c6, c5, c4, c3, c2, c1
>    Sort Method: quicksort  Memory: 59163kB
>    ->  Seq Scan on t1  (cost=0.00..24999.99 rows=499999 width=76)
>                        (actual rows=499999 loops=1)
>  Planning Time: 0.054 ms
>  Execution Time: 1095.183 ms
> (6 rows)
>
> set enable_mk_sort = on;
> explain (analyze, timing off)
> select * from t1 order by c6, c5, c4, c3, c2, c1;
>
>                          QUERY PLAN
> -------------------------------------------------------------------
>  Sort  (cost=72328.81..73578.81 rows=499999 width=76)
>        (actual rows=499999 loops=1)
>    Sort Key: c6, c5, c4, c3, c2, c1
>    Sort Method: multi-key quick sort  Memory: 59163kB
>    ->  Seq Scan on t1  (cost=0.00..24999.99 rows=499999 width=76)
>                        (actual rows=499999 loops=1)
>  Planning Time: 0.130 ms
>  Execution Time: 633.635 ms
> (6 rows)
>
> Which seems great, but let's reverse the sort keys:
>
> set enable_mk_sort = off;
> explain (analyze, timing off)
> select * from t1 order by c1, c2, c3, c4, c5, c6;
>
>                          QUERY PLAN
> -------------------------------------------------------------------
>
>  Sort  (cost=72328.81..73578.81 rows=499999 width=76)
>        (actual rows=499999 loops=1)
>    Sort Key: c1, c2, c3, c4, c5, c6
>    Sort Method: quicksort  Memory: 59163kB
>    ->  Seq Scan on t1  (cost=0.00..24999.99 rows=499999 width=76)
>                        (actual rows=499999 loops=1)
>  Planning Time: 0.146 ms
>  Execution Time: 170.085 ms
> (6 rows)
>
> set enable_mk_sort = off;
> explain (analyze, timing off)
> select * from t1 order by c1, c2, c3, c4, c5, c6;
>
>                          QUERY PLAN
> -------------------------------------------------------------------
>  Sort  (cost=72328.81..73578.81 rows=499999 width=76)
>        (actual rows=499999 loops=1)
>    Sort Key: c1, c2, c3, c4, c5, c6
>    Sort Method: multi-key quick sort  Memory: 59163kB
>    ->  Seq Scan on t1  (cost=0.00..24999.99 rows=499999 width=76)
>                        (actual rows=499999 loops=1)
>  Planning Time: 0.127 ms
>  Execution Time: 367.263 ms
> (6 rows)
>
> I believe this is the case Heikki was asking about. I see the response
> was that it's OK and the overhead is very low, but without too much
> detail so I don't know what case you measured.
>
> Anyway, I think it seems to be very sensitive to the exact data set.
> Which is not entirely surprising, I guess - most optimizations have a
> mix of improved/regressed cases, yielding a heatmap with a mix of green
> and red areas, and we have to either optimize the code (or heuristics to
> enable the feature), or convince ourselves the "red" cases are less
> important / unlikely etc.
>
> But here the results are almost universally "red", so it's going to be
> very hard to convince ourselves this is a good trade off. Of course, you
> may argue the cases I've tested are wrong and not representative. I
> don't think that's the case, though.
>
> It's also interesting (and perhaps a little bit bizarre) that almost all
> the cases that got better are for a single-column sort. Which is exactly
> the case the patch should not affect. But it seems pretty consistent, so
> maybe this is something worth investigating.
>
> FWIW I'm not familiar with the various quicksort variants, but I noticed
> that the Bentley & Sedgewick paper mentioned as the basis for the patch
> is from 1997, and apparently implements stuff originally proposed by
> Hoare in 1961. So maybe this is just an example of an algorithm that was
> good for a hardware at that time, but the changes (e.g. the growing
> important of on-CPU caches) made it less relevant?
>
> Another thing I noticed while skimming [1] is this:
>
>     The algorithm is designed to exploit the property that in many
>     problems, strings tend to have shared prefixes.
>
> If that's the case, isn't it wrong to apply this to all sorts, including
> sorts with non-string keys? It might explain why your example works OK,
> as it involves key c6 which is string with all values sharing the same
> (fairly long) prefix. But then maybe we should be careful and restrict
> this to only such those cases?
>
> regards
>
> [1] https://en.wikipedia.org/wiki/Multi-key_quicksort
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

--
This electronic communication and the information and any files transmitted
with it, or attached to it, are confidential and are intended solely for
the use of the individual or entity to whom it is addressed and may contain
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are
not the intended recipient or the person responsible for delivering the
e-mail to the intended recipient, you are hereby notified that any use,
copying, distributing, dissemination, forwarding, printing, or copying of
this e-mail is strictly prohibited. If you received this e-mail in error,
please return the e-mail to the sender, delete it from your computer, and
destroy any printed copy of it.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: Conflict Detection and Resolution
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: Conflict Detection and Resolution