Re: Memory-Bounded Hash Aggregation

Поиск
Список
Период
Сортировка
От Taylor Vesely
Тема Re: Memory-Bounded Hash Aggregation
Дата
Msg-id CAFaX_4Ls5UGQ2UDYkWhn2xFq=V6BKpUbKSQnWmy27VWhQ2=enA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Memory-Bounded Hash Aggregation  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: Memory-Bounded Hash Aggregation  (Jeff Davis <pgsql@j-davis.com>)
Re: Memory-Bounded Hash Aggregation  (Jeff Davis <pgsql@j-davis.com>)
Список pgsql-hackers
I started to review this patch yesterday with Melanie Plageman, so we
rebased this patch over the current master. The main conflicts were
due to a simplehash patch that has been committed separately[1]. I've
attached the rebased patch.

I was playing with the code, and if one of the table's most common
values isn't placed into the initial hash table it spills a whole lot
of tuples to disk that might have been avoided if we had some way to
'seed' the hash table with MCVs from the statistics. Seems to me that
you would need some way of dealing with values that are in the MCV
list, but ultimately don't show up in the scan. I imagine that this
kind of optimization would most useful for aggregates on a full table
scan.

Some questions:

Right now the patch always initializes 32 spill partitions. Have you given
any thought into how to intelligently pick an optimal number of
partitions yet?

> That can be done as an add-on to approach #1 by evicting the entire
> Hash table (writing out the partial states), then resetting the memory
> Context.

By add-on approach, do you mean to say that you have something in mind
to combine the two strategies? Or do you mean that it could be implemented
as a separate strategy?

> I think it's clear there's no perfect eviction strategy - for every
> algorithm we came up with we can construct a data set on which it
> performs terribly (I'm sure we could do that for the approach used by
> Greenplum, for example).
>
> So I think it makes sense to do what Jeff proposed, and then maybe try
> improving that in the future with a switch to different eviction
> strategy based on some heuristics.

I agree. It definitely feels like both spilling strategies have their
own use case.

That said, I think it's worth mentioning that with parallel aggregates
it might actually be more useful to spill the trans values instead,
and have them combined in a Gather or Finalize stage.

[1] https://www.postgresql.org/message-id/flat/48abe675e1330f0c264ab2fe0d4ff23eb244f9ef.camel%40j-davis.com
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: RFC: seccomp-bpf support
Следующее
От: Peter Eisentraut
Дата:
Сообщение: Re: RFC: seccomp-bpf support