Обсуждение: [PATCH] Refactor *_abbrev_convert() functions

Поиск
Список
Период
Сортировка

[PATCH] Refactor *_abbrev_convert() functions

От
Aleksander Alekseev
Дата:
Hi,

Now when all Datums are 64-bit values we can simplify the code by
using murmurhash64(). This refactoring was previously suggested by
John Naylor [1].

[1]: https://postgr.es/m/CANWCAZbMyrijdR0xc-4SqpNJBHMEwRZccBK4fa0aquNpq2Uj7w%40mail.gmail.com

-- 
Best regards,
Aleksander Alekseev

Вложения

Re: [PATCH] Refactor *_abbrev_convert() functions

От
Aditya Gollamudi
Дата:
On Tue, Jan 13, 2026 at 4:34 AM Aleksander Alekseev <aleksander@tigerdata.com> wrote:
Hi,

Now when all Datums are 64-bit values we can simplify the code by
using murmurhash64(). This refactoring was previously suggested by
John Naylor [1].

[1]: https://postgr.es/m/CANWCAZbMyrijdR0xc-4SqpNJBHMEwRZccBK4fa0aquNpq2Uj7w%40mail.gmail.com

--
Best regards,
Aleksander Alekseev

Hi, 

I reviewed this change and the surrounding code and this seems good!
All tests pass locally for me. Hopefully this gets picked up soon.

Regards, 
Adi Gollamudi

Re: [PATCH] Refactor *_abbrev_convert() functions

От
Michael Paquier
Дата:
On Sat, Jan 24, 2026 at 02:27:04PM -0800, Aditya Gollamudi wrote:
> I reviewed this change and the surrounding code and this seems good!
> All tests pass locally for me. Hopefully this gets picked up soon.

Yep, that looks rather right.  Will double-check all that later.
--
Michael

Вложения

Re: [PATCH] Refactor *_abbrev_convert() functions

От
John Naylor
Дата:
On Tue, Jan 13, 2026 at 7:34 PM Aleksander Alekseev
<aleksander@tigerdata.com> wrote:
> Now when all Datums are 64-bit values we can simplify the code by
> using murmurhash64(). This refactoring was previously suggested by
> John Naylor [1].

There's more we can do here. Above the stanzas changed in the patch
there is this, at least for varlena/bytea:

hash = DatumGetUInt32(hash_any((unsigned char *) authoritative_data,
                      Min(len, PG_CACHE_LINE_SIZE)));

This makes no sense to me: hash_any() calls hash_bytes() and turns the
result into a Datum, and then we just get it right back out of the
Datum again. addHyperLogLog says "typically generated using
hash_any()", but that function takes a uint32, not a Datum, so that
comment should probably be changed. hash_bytes() is global, so we can
use it directly.

if (len > PG_CACHE_LINE_SIZE)
  hash ^= DatumGetUInt32(hash_uint32((uint32) len));

Similar here, but instead of hash_bytes_uint32(), we may as well use
mumurhash32().

--
John Naylor
Amazon Web Services



Re: [PATCH] Refactor *_abbrev_convert() functions

От
Aleksander Alekseev
Дата:
Hi John,

Many thanks for your feedback!

> There's more we can do here. Above the stanzas changed in the patch
> there is this, at least for varlena/bytea:
>
> hash = DatumGetUInt32(hash_any((unsigned char *) authoritative_data,
>                       Min(len, PG_CACHE_LINE_SIZE)));
>
> This makes no sense to me: hash_any() calls hash_bytes() and turns the
> result into a Datum, and then we just get it right back out of the
> Datum again.

I see similar patterns in files other than bytea.c and varlena.c.
Implemented as a separate patch.

> if (len > PG_CACHE_LINE_SIZE)
>   hash ^= DatumGetUInt32(hash_uint32((uint32) len));
>
> Similar here, but instead of hash_bytes_uint32(), we may as well use
> mumurhash32().

Ditto.

It's worth noting that timetz_hash() uses a similar pattern but I
choose to keep it as is. Changing it will break backward
compatibility. Also it breaks our tests.

Using hash_bytes_uint32() / hash_bytes_uint32_extended() directly in
timetz_hash() / timetz_hash_extended() is safe though. Proposed as a
separate patch.

> addHyperLogLog says "typically generated using
> hash_any()", but that function takes a uint32, not a Datum, so that
> comment should probably be changed. hash_bytes() is global, so we can
> use it directly.

Makes sense. Implemented as an independent patch.

-- 
Best regards,
Aleksander Alekseev

Вложения

Re: [PATCH] Refactor *_abbrev_convert() functions

От
John Naylor
Дата:
On Tue, Feb 3, 2026 at 9:56 PM Aleksander Alekseev
<aleksander@tigerdata.com> wrote:
> > There's more we can do here. Above the stanzas changed in the patch
> > there is this, at least for varlena/bytea:
> >
> > hash = DatumGetUInt32(hash_any((unsigned char *) authoritative_data,
> >                       Min(len, PG_CACHE_LINE_SIZE)));
> >
> > This makes no sense to me: hash_any() calls hash_bytes() and turns the
> > result into a Datum, and then we just get it right back out of the
> > Datum again.
>
> I see similar patterns in files other than bytea.c and varlena.c.
> Implemented as a separate patch.

I think it makes sense to squash 0001 and 0003 together, then 0002 and
0004 together.

For the first, we should probably combine in the upper half when using
a 64-bit hash, like this:

     /* Hash abbreviated key */
     {
-        uint32          tmp;
+        uint64          tmp;

-        tmp = DatumGetUInt32(res) ^ (uint32) (DatumGetUInt64(res) >> 32);
-        hash = DatumGetUInt32(hash_uint32(tmp));
+        tmp = murmurhash64(DatumGetUInt64(res));
+        hash = (uint32) tmp ^ tmp >> 32;
     }

> Using hash_bytes_uint32() / hash_bytes_uint32_extended() directly in
> timetz_hash() / timetz_hash_extended() is safe though. Proposed as a
> separate patch.

0005 doesn't buy us as much in readability since the two lines no longer match.

Further cleanup possible now that we have 64-bit datums: MAC addresses
are always 6 bytes, so abbreviation is no longer relevant -- datum1 is
authoritative. That's in scope for the thread subject but also a
bigger patch, but maybe someone would like to pick it up for PG20.

--
John Naylor
Amazon Web Services



Re: [PATCH] Refactor *_abbrev_convert() functions

От
Aleksander Alekseev
Дата:
Hi John,

Thanks again.

> I think it makes sense to squash 0001 and 0003 together, then 0002 and
> 0004 together.
> [...]
> 0005 doesn't buy us as much in readability since the two lines no longer match.

Makes sense.

> For the first, we should probably combine in the upper half when using
> a 64-bit hash, like this:

We could do it if you insist but I'm convinced this is redundant. In a
good hash upper 32 bits are as evenly distributed as lower ones so
this combining doesn't buy us much. This may even cause more
collisions, for values that didn't have them initially.

> Further cleanup possible now that we have 64-bit datums: MAC addresses
> are always 6 bytes, so abbreviation is no longer relevant -- datum1 is
> authoritative. That's in scope for the thread subject but also a
> bigger patch, but maybe someone would like to pick it up for PG20.

I will pick it up and submit as a separate patch a bit later.

-- 
Best regards,
Aleksander Alekseev

Вложения

Re: [PATCH] Refactor *_abbrev_convert() functions

От
Aleksander Alekseev
Дата:
Hi,

> Thanks again.
>
> > I think it makes sense to squash 0001 and 0003 together, then 0002 and
> > 0004 together.
> > [...]
> > 0005 doesn't buy us as much in readability since the two lines no longer match.
>
> Makes sense.
>
> > For the first, we should probably combine in the upper half when using
> > a 64-bit hash, like this:
>
> We could do it if you insist but I'm convinced this is redundant. In a
> good hash upper 32 bits are as evenly distributed as lower ones so
> this combining doesn't buy us much. This may even cause more
> collisions, for values that didn't have them initially.
>
> > Further cleanup possible now that we have 64-bit datums: MAC addresses
> > are always 6 bytes, so abbreviation is no longer relevant -- datum1 is
> > authoritative. That's in scope for the thread subject but also a
> > bigger patch, but maybe someone would like to pick it up for PG20.
>
> I will pick it up and submit as a separate patch a bit later.

0002 had a wrong commit message due to a mistake during squashing.
Here is a corrected patch.

-- 
Best regards,
Aleksander Alekseev

Вложения