Re: Memory-Bounded Hash Aggregation

Поиск
Список
Период
Сортировка
От Jeff Davis
Тема Re: Memory-Bounded Hash Aggregation
Дата
Msg-id e13b52d4d8f9c10f7b58f39270213b868c99b685.camel@j-davis.com
обсуждение исходный текст
Ответ на Re: Memory-Bounded Hash Aggregation  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: Memory-Bounded Hash Aggregation  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: Memory-Bounded Hash Aggregation  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote:
> What does "partitioned hash strategy" do? It's probably explained in
> one
> of the historical discussions, but I'm not sure which one. I assume
> it
> simply hashes the group keys and uses that to partition the data, and
> then
> passing it to hash aggregate.

Yes. When spilling, it is cheap to partition on the hash value at the
same time, which dramatically reduces the need to spill multiple times.
Previous discussions:


> Unfortunately the second link does not work :-(

It's supposed to be:


https://www.postgresql.org/message-id/CAGTBQpa__-NP7%3DkKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g%40mail.gmail.com


> I'm not going to block Approach 1, althought I'd really like to see
> something that helps with array_agg.

I have a WIP patch that I just posted. It doesn't yet work with
ARRAY_AGG, but I think it can be made to work by evicting the entire
hash table, serializing the transition states, and then later combining
them.

> Aren't all three approaches a way to "fix" hash aggregate? In any
> case,
> it's certainly reasonable to make incremental changes. The question
> is
> whether "approach 1" is sensible step towards some form of "approach
> 3"

Disk-based hashing certainly seems like a reasonable algorithm on paper
that has some potential advantages over sorting. It certainly seems
sensible to me that we explore the disk-based hashing strategy first,
and then we would at least know what we are missing (if anything) by
going with the hybrid approach later.

There's also a fair amount of design space to explore in the hybrid
strategy. That could take a while to converge, especially if we don't
have anything in place to compare against.

> > * It means we have a hash table and sort running concurrently, each
> >  using memory. Andres said this might not be a problem[3], but I'm
> >  not convinced that the problem is zero. If you use small work_mem
> >  for the write phase of sorting, you'll end up with a lot of runs
> > to
> >  merge later and that has some kind of cost.
> > 
> 
> Why would we need to do both concurrently? I thought we'd empty the
> hash
> table before doing the sort, no?

So you are saying we spill the tuples into a tuplestore, then feed the
tuplestore through a tuplesort? Seems inefficient, but I guess we can.

> Do we actually need to handle that case? How many such aggregates are
> there? I think it's OK to just ignore that case (and keep doing what
> we do
> now), and require serial/deserial functions for anything better.

Punting on a few cases is fine with me, if the user has a way to fix
it.

Regards,
    Jeff Davis





В списке pgsql-hackers по дате отправления:

Предыдущее
От: David Rowley
Дата:
Сообщение: Re: Cleaning up and speeding up string functions
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Where is SSPI auth username determined for TAP tests?