Implementation of median in PostgreSQL - questions

Поиск
Список
Период
Сортировка
От Pavel Stehule
Тема Implementation of median in PostgreSQL - questions
Дата
Msg-id AANLkTilOSxIlxfMHldZcgAhFGz9CB_1tjgF4MTdfVEnT@mail.gmail.com
обсуждение исходный текст
Список pgsql-hackers
Hello

I am planning to start to implement median function. I wrote some
array based implementation - it is fast, but I hope, so can be much
faster.

The basic question is method of implementation. It can be implemented
via a) custum aggregate functions or b) executor node.

Adventage of @a variant is simplicity - we need to teach aggregates to
ensure ordered input and ensure to use index only (maybe add flag
ORDERED INPUT [DESC|ASC]). Now PostgreSQL doesn't use a index for scan
to orderd aggregate - it can be a problem for large datasets. I found
some missing info in EXPLAIN about ordered aggregates - there are
showed nothing about sort

pavel@postgres:5432=# explain analyze verbose select array_agg(a order
by a) from omega;                                                       QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=1643.00..1643.01 rows=1 width=4) (actual
 
time=555.091..555.092 rows=1 loops=1)  Output: array_agg(a ORDER BY a)  ->  Seq Scan on public.omega
(cost=0.00..1393.00rows=100000
 
width=4) (actual time=0.050..177.547 rows=100000 loops=1)        Output: aTotal runtime: 555.839 ms
(5 rows)

Probably we have to access a tuple store inside sfunc - when the data
size is out of work memory. And for effective evaluating it needs
patch  https://commitfest.postgresql.org/action/patch_view?id=292

variant @b is more complex - but allows more possibilities - idea:
median is one from aggregate executor nodes (theoretically it can call
some custom final function in future - but I don't think about it
now). It has a few advantages:

a) we don't need to modify current aggregates
b) for datasets smaller than working_mem can be used a quickselect
algorithms - www-stat.stanford.edu/~ryantibs/papers/median.pdf
c) for larger datasets we can use integrated external sort with direct
reading - we don't need to stack result to array

I prefer a variant b. It offers a more possibilities - and there are
less chance to break some existing.

comments are welcome

Pavel Stehule


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: proof concept: do statement parametrization
Следующее
От: Dimitri Fontaine
Дата:
Сообщение: Re: pg_archive_bypass