Re: wip: functions median and percentile

Поиск
Список
Период
Сортировка
От Dean Rasheed
Тема Re: wip: functions median and percentile
Дата
Msg-id AANLkTinAaLDa5oOMoQ4fC_32hu1AfpjRZ7+b2jecCz6a@mail.gmail.com
обсуждение исходный текст
Ответ на Re: wip: functions median and percentile  (Hitoshi Harada <umi.tanuki@gmail.com>)
Ответы Re: wip: functions median and percentile  (Hitoshi Harada <umi.tanuki@gmail.com>)
Список pgsql-hackers
On 5 October 2010 07:04, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
> 2010/10/5 Dean Rasheed <dean.a.rasheed@gmail.com>:
>> On 4 October 2010 18:22, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Mon, Oct 4, 2010 at 2:58 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>>>> That requires a new sort for each row. I generated this with a minor
>>>> tweak to Pavel's patch to just restart the tuplesort each time (the
>>>> "quick-fix" solution). The problem is that performance really sucks,
>>>> because it is an O(n^2 log(n)) algorithm.
>>>
>>> Maybe that's OK.  If you're doing repeated median operations on large
>>> data sets, perhaps you should expect that to be slow.  I bet that
>>> people who want to use this as a window function will want one median
>>> per group, not n medians per group; and it doesn't seem like a good
>>> idea to say - we're not going to let you use this as a window function
>>> AT ALL because you might decide to do something that will be really
>>> slow.  You can always hit ^C if you get tired of waiting.  This seems
>>> like it's very far from being the most important thing for us to
>>> optimize, though of course it's great if we can.
>>>
>>
>> Well FWIW, here's the quick-fix solution to make this median function
>> support window queries.
>
> Is this safe? It seems to me that the tuplesort_end isn't called in
> window aggregates at the last of a partition, which results in
> forgetting to close temp file if tuplesort is in state of tape.
>

Yeah, you're right. Actually I hate doing it this way, and only really
suggested it because I like it even less to add an aggregate that
arbitrarily doesn't support window queries. This approach will at
least work well for arbitrary sized partitions without an ORDER BY
clause.

With an ORDER BY clause though, even in the best-case situation
(values in the partition already sorted), this is going to be O(n^2)
for a running median, but it seems like it would be significantly more
work to come up with a better algorithm, and I'm not sure that there
is sufficient demand for this function to justify that.

Extrapolating from few quick timing tests, even in the best case, on
my machine, it would take 7 days for the running median to use up
100MB, and 8 years for it to use 2GB. So setting the tuplesort's
workMem to 2GB (only in the running median case) would actually be
safe in practice, and would prevent the temp file leak (for a few
years at least!). I feel dirty even suggesting that. Better ideas
anyone?

Regards,
Dean


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Sync Rep at Oct 5
Следующее
От: Heikki Linnakangas
Дата:
Сообщение: Re: is sync rep stalled?