Re: Memory-Bounded Hash Aggregation

Поиск

Список

Период

Сортировка

От	Jeff Davis
Тема	Re: Memory-Bounded Hash Aggregation
Дата	4 июля 2019 г. 05:03:06
Msg-id	e13b52d4d8f9c10f7b58f39270213b868c99b685.camel@j-davis.com обсуждение исходный текст
Ответ на	Re: Memory-Bounded Hash Aggregation (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы	Re: Memory-Bounded Hash Aggregation Re: Memory-Bounded Hash Aggregation
Список	pgsql-hackers

Дерево обсуждения

On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote:
> What does "partitioned hash strategy" do? It's probably explained in
> one
> of the historical discussions, but I'm not sure which one. I assume
> it
> simply hashes the group keys and uses that to partition the data, and
> then
> passing it to hash aggregate.

Yes. When spilling, it is cheap to partition on the hash value at the
same time, which dramatically reduces the need to spill multiple times.
Previous discussions:

> Unfortunately the second link does not work :-(

It's supposed to be:

https://www.postgresql.org/message-id/CAGTBQpa__-NP7%3DkKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g%40mail.gmail.com

> I'm not going to block Approach 1, althought I'd really like to see
> something that helps with array_agg.

I have a WIP patch that I just posted. It doesn't yet work with
ARRAY_AGG, but I think it can be made to work by evicting the entire
hash table, serializing the transition states, and then later combining
them.

> Aren't all three approaches a way to "fix" hash aggregate? In any
> case,
> it's certainly reasonable to make incremental changes. The question
> is
> whether "approach 1" is sensible step towards some form of "approach
> 3"

Disk-based hashing certainly seems like a reasonable algorithm on paper
that has some potential advantages over sorting. It certainly seems
sensible to me that we explore the disk-based hashing strategy first,
and then we would at least know what we are missing (if anything) by
going with the hybrid approach later.

There's also a fair amount of design space to explore in the hybrid
strategy. That could take a while to converge, especially if we don't
have anything in place to compare against.

> > * It means we have a hash table and sort running concurrently, each
> >  using memory. Andres said this might not be a problem[3], but I'm
> >  not convinced that the problem is zero. If you use small work_mem
> >  for the write phase of sorting, you'll end up with a lot of runs
> > to
> >  merge later and that has some kind of cost.
> > 
> 
> Why would we need to do both concurrently? I thought we'd empty the
> hash
> table before doing the sort, no?

So you are saying we spill the tuples into a tuplestore, then feed the
tuplestore through a tuplesort? Seems inefficient, but I guess we can.

> Do we actually need to handle that case? How many such aggregates are
> there? I think it's OK to just ignore that case (and keep doing what
> we do
> now), and require serial/deserial functions for anything better.

Punting on a few cases is fine with me, if the user has a way to fix
it.

Regards,
    Jeff Davis

В списке pgsql-hackers по дате отправления:

Предыдущее

От: David Rowley
Дата: 04 июля 2019 г., 04:51:06
Сообщение: Re: Cleaning up and speeding up string functions

Следующее

От: Michael Paquier
Дата: 04 июля 2019 г., 05:35:21
Сообщение: Re: Where is SSPI auth username determined for TAP tests?

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Memory-Bounded Hash Aggregation

Предыдущее

Следующее