Re: Memory-Bounded Hash Aggregation

Поиск

Список

Период

Сортировка

От	Pengzhou Tang
Тема	Re: Memory-Bounded Hash Aggregation
Дата	20 марта 2020 г. 05:20:33
Msg-id	CAG4reARM9L-cMPKfhcK46K=OmdAYH8h3YDA_uhpXBEoiXTRLUQ@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Memory-Bounded Hash Aggregation (Jeff Davis <pgsql@j-davis.com>)
Ответы	Re: Memory-Bounded Hash Aggregation
Список	pgsql-hackers

Дерево обсуждения

Hi,

I happen to notice that "set enable_sort to false" cannot guarantee the planner to use hashagg in test groupingsets.sql,

the following comparing results of sortagg and hashagg seems to have no meaning.

Thanks,

Pengzhou

On Thu, Mar 19, 2020 at 7:36 AM Jeff Davis <pgsql@j-davis.com> wrote:

Committed.

There's some future work that would be nice (some of these are just
ideas and may not be worth it):

* Refactor MemoryContextMemAllocated() to be a part of
MemoryContextStats(), but allow it to avoid walking through the blocks
and freelists.

* Improve the choice of the initial number of buckets in the hash
table. For this patch, I tried to preserve the existing behavior of
estimating the number of groups and trying to initialize with that many
buckets. But my performance tests seem to indicate this is not the best
approach. More work is needed to find what we should really do here.

* For workloads that are not in work_mem *or* system memory, and need
to actually go to storage, I see poor CPU utilization because it's not
effectively overlapping CPU and IO work. Perhaps buffering or readahead
changes can improve this, or async IO (even better).

* Project unnecessary attributes away before spilling tuples to disk.

* Improve logtape.c API so that the caller doesn't need to manage a
bunch of tape numbers.

* Improve estimate of the hash entry size. This patch doesn't change
the way the planner estimates it, but I observe that actual size as
seen at runtime is significantly different. This is connected to the
initial number of buckets for the hash table.

* In recursive steps, I don't have a good estimate for the number of
groups, so I just estimate it as the number of tuples in that spill
tape (which is pessimistic). That could be improved by doing a real
cardinality estimate as the tuples are spilling (perhaps with HLL?).

* Many aggregates with pass-by-ref transition states don't provide a
great aggtransspace. We should consider doing something smarter, like
having negative numbers represent a number that should be multiplied by
the size of the group (e.g. ARRAY_AGG would have a size dependent on
the group size, not a constant).

* If we want to handle ARRAY_AGG (and the like) well, we can consider
spilling the partial states in the hash table whem the memory is full.
That would add a fair amount of complexity because there would be two
types of spilled data (tuples and partial states), but it could be
useful in some cases.

Regards,
Jeff Davis

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Memory-Bounded Hash Aggregation