Re: Memory-Bounded Hash Aggregation
От | Tomas Vondra |
---|---|
Тема | Re: Memory-Bounded Hash Aggregation |
Дата | |
Msg-id | 20190802152123.ichmacgx3y3x7dc3@development обсуждение исходный текст |
Ответ на | Re: Memory-Bounded Hash Aggregation (Jeff Davis <pgsql@j-davis.com>) |
Ответы |
Re: Memory-Bounded Hash Aggregation
|
Список | pgsql-hackers |
On Fri, Aug 02, 2019 at 08:11:19AM -0700, Jeff Davis wrote: >On Fri, 2019-08-02 at 14:44 +0800, Adam Lee wrote: >> I'm late to the party. > >You are welcome to join any time! > >> These two approaches both spill the input tuples, what if the skewed >> groups are not encountered before the hash table fills up? The spill >> files' size and disk I/O could be downsides. > >Let's say the worst case is that we encounter 10 million groups of size >one first; just enough to fill up memory. Then, we encounter a single >additional group of size 20 million, and need to write out all of those >20 million raw tuples. That's still not worse than Sort+GroupAgg which >would need to write out all 30 million raw tuples (in practice Sort is >pretty fast so may still win in some cases, but not by any huge >amount). > >> Greenplum spills all the groups by writing the partial aggregate >> states, >> reset the memory context, process incoming tuples and build in-memory >> hash table, then reload and combine the spilled partial states at >> last, >> how does this sound? > >That can be done as an add-on to approach #1 by evicting the entire >hash table (writing out the partial states), then resetting the memory >context. > >It does add to the complexity though, and would only work for the >aggregates that support serializing and combining partial states. It >also might be a net loss to do the extra work of initializing and >evicting a partial state if we don't have large enough groups to >benefit. > >Given that the worst case isn't worse than Sort+GroupAgg, I think it >should be left as a future optimization. That would give us time to >tune the process to work well in a variety of cases. > +1 to leaving that as a future optimization I think it's clear there's no perfect eviction strategy - for every algorithm we came up with we can construct a data set on which it performs terribly (I'm sure we could do that for the approach used by Greenplum, for example). So I think it makes sense to do what Jeff proposed, and then maybe try improving that in the future with a switch to different eviction strategy based on some heuristics. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
В списке pgsql-hackers по дате отправления: