On Sat, Oct 27, 2018 at 12:30 AM Antonin Houska <ah@cybertec.at> wrote:
> 1. The return type of resize() function is void, so I propose part of the
> header comment to be removed:
>
> diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
> index b46f7c4cfd..b2b8fe60e1 100644
> --- a/src/backend/lib/dshash.c
> +++ b/src/backend/lib/dshash.c
> @@ -672,9 +672,7 @@ delete_item(dshash_table *hash_table, dshash_table_item *item)
>
> /*
> * Grow the hash table if necessary to the requested number of buckets. The
> - * requested size must be double some previously observed size. Returns true
> - * if the table was successfully expanded or found to be big enough already
> - * (because another backend expanded it).
> + * requested size must be double some previously observed size.
> *
> * Must be called without any partition lock held.
> */
Thanks, will fix.
> 2. Can anyone please explain this macro?
>
> /* Max entries before we need to grow. Half + quarter = 75% load factor. */
> #define MAX_COUNT_PER_PARTITION(hash_table) \
> (BUCKETS_PER_PARTITION(hash_table->size_log2) / 2 + \
> BUCKETS_PER_PARTITION(hash_table->size_log2) / 4)
>
> I'm failing to understand why the maximum number of hash table entries in a
> partition should be smaller than the number of buckets in that partition.
>
> The fact that MAX_COUNT_PER_PARTITION refers to entries and not buckets is
> obvious from this condition in dshash_find_or_insert()
>
> /* Check if we are getting too full. */
> if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
Are you saying there is a bug in this logic (which is nbuckets * 0.75
expressed as integer maths), or saying that 0.75 is not a good maximum
load factor? I looked around at a couple of general purpose hash
tables and saw that some used 0.75 and some used 1.0, as a wasted
space-vs-collision trade-off. If I have my maths right[1], with 0.75
you expect to have 75 entries in ~53 buckets, but with 1.0 you expect
to have 100 entries in ~64 buckets. It'd be a fair criticism that
that's arbitrarily different to the choice made for hash joins, and
dynahash's default is 1 (though it's a run-time parameter).
[1] https://math.stackexchange.com/questions/470662/average-number-of-bins-occupied-when-throwing-n-balls-into-n-bins
--
Thomas Munro
http://www.enterprisedb.com