Обсуждение: Use quick select instead of qsort to get median

Поиск
Список
Период
Сортировка

Use quick select instead of qsort to get median

От
"houzj.fnst@fujitsu.com"
Дата:
Hi,

When I was writing an extension which need to get the median of an array, I
tried to find if postgres provide some api that can do that. I found all the
places in postgres invoke qsort() and then get the median. I was thinking can
we do better by using "quick select" and is it worth it.

Currently, there are some places[1] in the code that need the median and can
use "quick select" instead. And some of them(spg_box_quad_picksplit /
spg_range_quad_picksplit) are invoked frequently when INSERT or CREATE INDEX.
So, Peronally, It's acceptable to introduce a quick select api to improve these
places.

Since most of the logic of "quick select" is similar to quick sort, I think
we can reuse the code in sort_template.h. We only need to let the sort stop
when find the target top Kth element.

Attach a POC patch about this idea. I did some simple performance tests, I can
see about 10% performance gain in this test[2].

Thoughts ?

[1]
1.
entry_dealloc
    ...
    /* Record the (approximate) median usage */
    if (i > 0)
        pgss->cur_median_usage = entries[i / 2]->counters.usage;
2.
spg_box_quad_picksplit
    qsort(lowXs, in->nTuples, sizeof(float8), compareDoubles);
    ...
    centroid->low.x = lowXs[median];

3.
spg_range_quad_picksplit
    qsort(lowerBounds, nonEmptyCount, sizeof(RangeBound),
    ...
    centroid = range_serialize(typcache, &lowerBounds[median],

4.
spg_quad_picksplit
    qsort(sorted, in->nTuples, sizeof(*sorted), x_cmp);
    ...
    centroid->x = sorted[median]->x;



[2]
drop table quad_box_tbl;
CREATE unlogged TABLE quad_box_tbl (id int, b box);
truncate quad_box_tbl ;
explain (verbose, analyze)INSERT INTO quad_box_tbl
  SELECT (x - 1) * 10 + x, box(point(x * 10, x * 20), point(x * 10, x * 20 + 5))
  FROM generate_series(1, 1000000) x order by random();

-----test create index
drop index quad_box_tbl_idx;
CREATE INDEX quad_box_tbl_idx ON quad_box_tbl USING spgist(b);

-------test results
PATCH:
Time: 2609.664 ms (00:02.610)

HEAD:
Time: 2903.765 ms (00:02.944)

Best regards,
Houzj


Вложения

Re: Use quick select instead of qsort to get median

От
John Naylor
Дата:

On Thu, Jul 22, 2021 at 8:07 AM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
>
> Hi,
>
> When I was writing an extension which need to get the median of an array, I
> tried to find if postgres provide some api that can do that. I found all the
> places in postgres invoke qsort() and then get the median. I was thinking can
> we do better by using "quick select" and is it worth it.

> Attach a POC patch about this idea. I did some simple performance tests, I can
> see about 10% performance gain in this test[2].
>
> Thoughts ?

> 1.
> entry_dealloc
>         ...
>         /* Record the (approximate) median usage */
>         if (i > 0)
>                 pgss->cur_median_usage = entries[i / 2]->counters.usage;

It might be useful to be more precise here, but it seems it would be slower, too?

> -----test create index
> drop index quad_box_tbl_idx;
> CREATE INDEX quad_box_tbl_idx ON quad_box_tbl USING spgist(b);
>
> -------test results
> PATCH:
> Time: 2609.664 ms (00:02.610)
>
> HEAD:
> Time: 2903.765 ms (00:02.944)

That index type is pretty rare, as far as I know. That doesn't seem to be quite enough motivation to change the qsort template. If the goal was to improve the speed of "create spgist index", would this still be the best approach? Also, there are other things under consideration that would add complexity to the qsort template [1], and this would add even more.

Looking in the docs [2], we don't have a MEDIAN aggregate, but we do have percentile_disc(), and quick select might help there, but I haven't looked.

[1] https://commitfest.postgresql.org/33/3038/
[2] https://www.postgresql.org/docs/14/functions-aggregate.html
--
John Naylor
EDB: http://www.enterprisedb.com