Re: Change GUC hashtable to use simplehash?

Поиск
Список
Период
Сортировка
От John Naylor
Тема Re: Change GUC hashtable to use simplehash?
Дата
Msg-id CANWCAZa_6B1+KmTFr-oCOZuoda1FyQ0JCu7cP1F7_QtWYr45Yg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Change GUC hashtable to use simplehash?  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-hackers
On Fri, Jan 19, 2024 at 11:54 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

> Thanks! I started to look at how to use this, and I have some questions.
> I'd like to replace this murmurhash ussage in resowner.c with this:
>
> >       /*
> >        * Most resource kinds store a pointer in 'value', and pointers are unique
> >        * all on their own.  But some resources store plain integers (Files and
> >        * Buffers as of this writing), so we want to incorporate the 'kind' in
> >        * the hash too, otherwise those resources will collide a lot.  But
> >        * because there are only a few resource kinds like that - and only a few
> >        * resource kinds to begin with - we don't need to work too hard to mix
> >        * 'kind' into the hash.  Just add it with hash_combine(), it perturbs the
> >        * result enough for our purposes.
> >        */
> > #if SIZEOF_DATUM == 8
> >       return hash_combine64(murmurhash64((uint64) value), (uint64) kind);
> > #else
> >       return hash_combine(murmurhash32((uint32) value), (uint32) kind);
> > #endif
>
> The straightforward replacement would be:
>
>      fasthash_state hs;
>
>      fasthash_init(&hs, sizeof(Datum), 0);
>      fasthash_accum(&hs, (char *) &kind, sizeof(ResourceOwnerDesc *));
>      fasthash_accum(&hs, (char *) &value, sizeof(Datum));
>      return fasthash_final32(&hs, 0);

That would give the fullest mixing possible, more than currently.

> But I wonder if it would be OK to abuse the 'seed' and 'tweak'
> parameters to the init and final functions instead, like this:
>
>      fasthash_state hs;
>
>      fasthash_init(&hs, sizeof(Datum), (uint64) kind);
>      return fasthash_final32(&hs, (uint64) value);

This would go in the other direction, and sacrifice some quality for
speed. The fasthash finalizer is pretty short -- XMX, where X is
"right shift and XOR" and M is "multiply". In looking at some other
hash functions, it seems that's often done only if the input has
already had some mixing. The Murmur finalizer has the shape XMXMX, and
that seems to be the preferred way to get good mixing on a single
register-sized value. For that reason, hash functions whose main loop
is designed for long inputs will often skip that for small inputs and
just go straight to a Murmur-style finalizer. Fasthash doesn't do
that, so for a small input it ends up doing XMXM then XMX, which is a
little more expensive.

> I couldn't find any guidance on what properties the 'seed' and 'tweak'
> have, compared to just accumulating the values with accum. Anyone know?

In Postgres, I only know of one use of a seed parameter, to create two
independent hash functions from hash_bytes_uint32_extended(), in
brin-bloom indexes. I think that's the more typical use for a seed.
Since there was no guidance with the existing hash functions, and it's
a widespread concept, I didn't feel the need to put any here. We could
change that.

I modeled the finalizer tweak on one of the finalizers in xxHash that
also used it only for the input length. Length is used as a tiebreaker
where otherwise it will often not collide anyway, so it seems that's
how we should think about using it elsewhere. There is a comment above
fasthash_final64 mentioning that the tweak is used for length when
that is not known ahead of time, but it might be good to generalize
that, and maybe put it somewhere more prominent. With that in mind,
I'm not sure "value" is a good fit for the tweak.  "kind" is sort of
in the middle because IIUC it doesn't mattter at all for pointer
values, but it's important for other kinds, which would commonly
collide.

If I were to change from murmur64, I'd probably go in between the two
extremes mentioned earlier, and mix "value" normally and pass "kind"
as the seed:

  fasthash_state hs;

  fasthash_init(&hs, sizeof(Datum), kind);
  fasthash_accum(&hs, (char *) &value, sizeof(Datum));
  return fasthash_final32(&hs, 0);



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Dilip Kumar
Дата:
Сообщение: Re: Synchronizing slots from primary to standby
Следующее
От: John Naylor
Дата:
Сообщение: Re: Change GUC hashtable to use simplehash?