Re: Change GUC hashtable to use simplehash?

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Change GUC hashtable to use simplehash?
Дата
Msg-id 20231122223432.lywt4yz2bn7tlp27@awork3.anarazel.de
обсуждение исходный текст
Ответ на Re: Change GUC hashtable to use simplehash?  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Change GUC hashtable to use simplehash?  (John Naylor <johncnaylorls@gmail.com>)
Список pgsql-hackers
Hi,

On 2023-11-22 16:27:56 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2023-11-22 15:56:21 -0500, Tom Lane wrote:
> >> GUC names are just about always short, though, so I'm not sure you've
> >> made your point?
>
> > With short I meant <= 6 characters (32 / 5 = 6.x). After that you're
> > overwriting bits that you previously set, without dispersing the "overwritten"
> > bits throughout the hash state.
>
> I'm less than convinced about the "overwrite" part:
>
> +        /* Merge into hash ... not very bright, but it needn't be */
> +        result = pg_rotate_left32(result, 5);
> +        result ^= (uint32) ch;
>
> Rotating a 32-bit value 5 bits at a time doesn't result in successive
> characters lining up exactly, and even once they do, XOR is not
> "overwrite".

I didn't know what word to use, hence the air quotes. Yes, xor doesn't just
set the bits to the right hand side in, but it just affects data on a per-bit
basis, which easily can be cancelled out.


My understanding of writing hash functions is that every added bit mixed in
should have a ~50% chance of causing each other bit to flip. The proposed
function obviously doesn't get there.

It's worth noting that the limited range of the input values means that
there's a lot of bias toward some bits being set ('a' to 'z' all start with
0b011).


> I'm pretty dubious that we need something better than this.

Well, we know that the current attempt at a dedicated hashfunctions for this
does result in substantial amounts of conflicts. And it's hard to understand
such cases when you hit them, so I think it's better to avoid exposing
ourselves to such dangers, without a distinct need.

And I don't really see the need here to risk it, even if we are somewhat
confident it's fine.

If, which I mildly doubt, we can't afford to call murmurhash32 for every
character, we could just call it for 32/5 input characters together.  Or we
could just load up to 8 characters into an 64bit integer, can call
murmurhash64.

Something roughly like

uint64 result;

while (*name)
{
    uint64 value = 0;

    for (int i = 0; i < 8 && *name; i++)
    {
        char ch = *name++;

        value |= *name;
        value = value << 8;
    }

    result = hash_combine64(result, murmurhash64(value));
}

The hash_combine use isn't quite right either, we should use the full
accumulator state of a proper hash function, but it's seems very unlikely to
matter here.


The fact that string_hash() is slow due to the strlen(), which causes us to
process the input twice and which is optimized to also handle very long
strings which typically string_hash() doesn't encounter, seems problematic far
beyond this case. We use string_hash() in a *lot* of places, and that strlen()
does regularly show up in profiles. We should fix that.

The various hash functions being external functions also shows up in a bunch
of profiles too.  It's particularly ridiculous for cases like tag_hash(),
where the caller typically knows the lenght, but calls a routine in a
different translation unit, which obviously can't be optimized for a specific
length.


I think we ought to adjust our APIs around this:

1) The accumulator state of the hash functions should be exposed, so one can
   accumulate values into the hash state, without reducing the internal state
   to a single 32/64 bit variable.

2) For callers that know the length of data, we should use a static inline
   hash function, rather than an external function call. This should include
   special cased inline functions for adding 32/64bit of data to the hash
   state.

   Perhaps with a bit of logic to *not* use the inline version if the hashed
   input is long (and thus the call overhead doesn't matter). Something like
   if (__builtin_constant_p(len) && len < 128)
       /* call inline implementation */
   else
       /* call out of line implementation, not worth the code bloat */


We know that hash functions should have the split into init/process
data*/finish steps, as e.g. evidenced by pg_crc.h/pg_crc32.h.


With something like that, you could write a function that lowercases
characters inline without incurring unnecessary overhead.

    hash32_state hs;

    hash32_init(&hs);

    while (*name)
    {
        char            ch = *name++;

        /* crappy lowercase for this situation */
        ch |= 0x20;

        hash32_process_byte(&hs, ch);
    }

    return hash32_finish(&hs);

Perhaps with some additional optimization for processing the input string in
32/64 bit quantities.


Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Smith
Дата:
Сообщение: Re: Stop the search once replication origin is found
Следующее
От: Tom Lane
Дата:
Сообщение: Re: [HACKERS] Changing references of password encryption to hashing