Re: 9.5: Memory-bounded HashAgg
От | Jeff Davis |
---|---|
Тема | Re: 9.5: Memory-bounded HashAgg |
Дата | |
Msg-id | 1418294807.5584.32.camel@jeff-desktop обсуждение исходный текст |
Ответ на | 9.5: Memory-bounded HashAgg (Jeff Davis <pgsql@j-davis.com>) |
Ответы |
Re: 9.5: Memory-bounded HashAgg
Re: 9.5: Memory-bounded HashAgg |
Список | pgsql-hackers |
On Sun, 2014-08-10 at 14:26 -0700, Jeff Davis wrote: > This patch is requires the Memory Accounting patch, or something similar > to track memory usage. > > The attached patch enables hashagg to spill to disk, which means that > hashagg will contain itself to work_mem even if the planner makes a > bad misestimate of the cardinality. New patch attached. All open items are complete, though the patch may have a few rough edges. Summary of changes: * rebased on top of latest memory accounting patch http://www.postgresql.org/message-id/1417497257.5584.5.camel@jeff-desktop * added a flag to hash_create to prevent it from creating an extra level of memory context - without this, the memory accounting would have a measurable impact on performance * cost model for the disk usage * intelligently choose the number of partitions for each pass of the data * explain support * in build_hash_table(), be more intelligent about the value of nbuckets to pass to BuildTupleHashTable() - BuildTupleHashTable tries to choose a value to keep the table in work_mem, but it isn't very accurate. * some very rudimentary testing (sanity checks, really) shows good results Summary of previous discussion (my summary; I may have missed some points): Tom Lane requested that the patch also handle the case where transition values grow (e.g. array_agg) beyond work_mem. I feel this patch provides a lot of benefit as it is, and trying to handle that case would be a lot more work (we need a way to write the transition values out to disk at a minimum, and perhaps combine them with other transition values). I also don't think my patch would interfere with a fix there in the future. Tomas Vondra suggested an alternative design that more closely resembles HashJoin: instead of filling up the hash table and then spilling any new groups, the idea would be to split the current data into two partitions, keep one in the hash table, and spill the other (see ExecHashIncreaseNumBatches()). This has the advantage that it's very fast to identify whether the tuple is part of the in-memory batch or not; and we can avoid even looking in the memory hashtable if not. The batch-splitting approach has a major downside, however: you are likely to evict a skew value from the in-memory batch, which will result in all subsequent tuples with that skew value going to disk. My approach never evicts from the in-memory table until we actually finalize the groups, so the skew values are likely to be completely processed in the first pass. So, the attached patch implements my original approach, which I still feel is the best solution. Regards, Jeff Davis
Вложения
В списке pgsql-hackers по дате отправления: