Re: Memory-Bounded Hash Aggregation

Поиск
Список
Период
Сортировка
От Melanie Plageman
Тема Re: Memory-Bounded Hash Aggregation
Дата
Msg-id CAAKRu_YgA1rEND4mb_3e2980vRY_Q5Y23o9FfMB1MrKGLgvmiw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Memory-Bounded Hash Aggregation  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: Memory-Bounded Hash Aggregation  (Jeff Davis <pgsql@j-davis.com>)
Список pgsql-hackers

On Thu, Nov 28, 2019 at 9:47 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Wed, Nov 27, 2019 at 02:58:04PM -0800, Jeff Davis wrote:
>On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote:
>> Right now the patch always initializes 32 spill partitions. Have you
>> given
>> any thought into how to intelligently pick an optimal number of
>> partitions yet?
>
>Attached a new patch that addresses this.
>
>1. Divide hash table memory used by the number of groups in the hash
>table to get the average memory used per group.
>2. Multiply by the number of groups spilled -- which I pessimistically
>estimate as the number of tuples spilled -- to get the total amount of
>memory that we'd like to have to process all spilled tuples at once.

Isn't the "number of tuples = number of groups" estimate likely to be
way too pessimistic? IIUC the consequence is that it pushes us to pick
more partitions than necessary, correct? 

Could we instead track how many tuples we actually consumed for the the
in-memory groups, and then use this information to improve the estimate
of number of groups? I mean, if we know we've consumed 1000 tuples which
created 100 groups, then we know there's ~1:10 ratio.

What would the cost be of having many small partitions? Some of the
spill files created may not be used if the estimate was pessimistic,
but that seems better than the alternative of re-spilling, since every
spill writes every tuple again.

Also, number of groups = number of tuples is only for re-spilling.
This is a little bit unclear from the variable naming.

It looks like the parameter input_tuples passed to hash_spill_init()
in lookup_hash_entries() is the number of groups estimated by planner.
However, when reloading a spill file, if we run out of memory and
re-spill, hash_spill_init() is passed batch->input_groups (which is
actually set from input_ngroups which is the number of tuples in the
spill file). So, input_tuples is groups and input_groups is
input_tuples. It may be helpful to rename this.
 

4) I'm not sure I agree with this reasoning that HASH_PARTITION_FACTOR
making the hash tables smaller is desirable - it may be, but if that was
generally the case we'd just use small hash tables all the time. It's a
bit annoying to give user the capability to set work_mem and then kinda
override that.

  * ... Another benefit of having more, smaller partitions is that small
  * hash tables may perform better than large ones due to memory caching
  * effects.


So, it looks like the HASH_PARTITION_FACTOR is only used when
re-spilling. The initial hashtable will use work_mem.
It seems like the reason for using it when re-spilling is to be very
conservative to avoid more than one re-spill and make sure each spill
file fits in a hashtable in memory.
The comment does seem to point to some other reason, though...
 

11) The hash_spill_npartitions naming seems a bit confusing, because it
seems to imply it's about the "spill" while in practice it just choses
number of spill partitions. Maybe hash_choose_num_spill_partitions would
be better?


Agreed that a name with "choose" or "calculate" as the verb would be
more clear.
 

12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the
reasoning behind the current value (256)? Not wanting to pick too many
partitions? Comment?

     if (npartitions > HASH_MAX_PARTITIONS)
         npartitions = HASH_MAX_PARTITIONS;


256 actually seems very large. hash_spill_npartitions() will be called
for every respill, so, HASH_MAX_PARTITIONS it not the total number of
spill files permitted, but, actually, it is the number of respill
files in a given spill (a spill set). So if you made X partitions
initially and every partition re-spills, now you would have (at most)
X * 256 partitions.
If HASH_MAX_PARTITIONS is 256, wouldn't the metadata from the spill
files take up a lot of memory at that point?

Melanie & Adam Lee

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Yugo Nagata
Дата:
Сообщение: Re: Implementing Incremental View Maintenance
Следующее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: could not stat promote trigger file leads to shutdown