Re: Do we want a hashset type?

Поиск
Список
Период
Сортировка
От jian he
Тема Re: Do we want a hashset type?
Дата
Msg-id CACJufxF_Eo-=1_ffOGaQQzubKEGDgko1tD55VQWo-9=-xAntEQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Do we want a hashset type?  ("Joel Jacobson" <joel@compiler.org>)
Ответы Re: Do we want a hashset type?  ("Joel Jacobson" <joel@compiler.org>)
Re: Do we want a hashset type?  ("Joel Jacobson" <joel@compiler.org>)
Список pgsql-hackers
On Sat, Jun 17, 2023 at 8:38 AM Joel Jacobson <joel@compiler.org> wrote:
>
> On Fri, Jun 16, 2023, at 17:42, Joel Jacobson wrote:
> > I realise int4hashset_hash() is broken,
> > since two int4hashset's that are considered equal,
> > can by coincidence get different hashes:
> ...
> > Do we have any ideas on how to fix this without sacrificing performance?
>
> The problem was due to hashset_hash() function accumulating the hashes
> of individual elements in a non-commutative manner. As a consequence, the
> final hash value was sensitive to the order in which elements were inserted
> into the hashset. This behavior led to inconsistencies, as logically
> equivalent sets (i.e., sets with the same elements but in different orders)
> produced different hash values.
>
> Solved by modifying the hashset_hash() function to use a commutative operation
> when combining the hashes of individual elements. This change ensures that the
> final hash value is independent of the element insertion order, and logically
> equivalent sets produce the same hash.
>
> An somewhat unfortunate side-effect of this fix, is that we can no longer
> visually sort the hashset output format, since it's not lexicographically sorted.
> I think this is an acceptable trade-off for a hashset type,
> since the only alternative I see would be to sort the elements,
> but then it wouldn't be a hashset, but a treeset, which different
> Big-O complexity.
>
> New patch is attached, which will henceforth always be a complete patch,
> to avoid the hassle of having to assemble incremental patches.
>
> /Joel


select hashset_contains('{1,2}'::int4hashset,NULL::int);
should return null?
---------------------------------------------------------------------------------
SELECT attname
        ,pc.relname
        ,CASE attstorage
        WHEN 'p' THEN 'plain'
        WHEN 'e' THEN 'external'
        WHEN 'm' THEN 'main'
        WHEN 'x' THEN 'extended'
        END AS storage
FROM    pg_attribute pa
join    pg_class    pc  on pc.oid = pa.attrelid
where   attnum > 0      and pa.attstorage   = 'e';

In my system catalog, it seems only the hashset type storage =
'external'. most is extended.....
I am not sure the consequence of switch from external to extended.
------------------------------------------------------------------------------------------------------------
select hashset_hash('{-1,1}')   as a1
        ,hashset_hash('{1,-2}') as a2
        ,hashset_hash('{-3,1}') as a3
        ,hashset_hash('{4,1}')  as a4;
returns:
     a1      |    a2     |     a3     |     a4
-------------+-----------+------------+------------
 -1735582196 | 998516167 | 1337000903 | 1305426029
(1 row)

values {a1,a2,a3,a4} should be monotone increasing, based on the
function int4hashset_cmp, but now it's not.
so the following queries failed.

--should return only one row.
select hashset_cmp('{2,1}','{3,1}')
union
select hashset_cmp('{3,1}','{4,1}')
union
select hashset_cmp('{1,3}','{4,1}');

select hashset_cmp('{9,10,11}','{10,9,-11}') =
hashset_cmp('{9,10,11}','{10,9,-1}'); --should be true
select '{2,1}'::int4hashset > '{7}'::int4hashset; --should be false.
based on int array comparison,.
-----------------------------------------------------------------------------------------
I comment out following lines in hashset-api.c somewhere between {810,829}

// if (a->hash < b->hash)
// PG_RETURN_INT32(-1);
// else if (a->hash > b->hash)
// PG_RETURN_INT32(1);

// if (a->nelements < b->nelements)
// PG_RETURN_INT32(-1);
// else if (a->nelements > b->nelements)
// PG_RETURN_INT32(1);

// Assert(a->nelements == b->nelements);

So hashset_cmp will directly compare int array. the above queries works.

{int4hashset_equals,int4hashset_neq} two special cases of hashset_cmp.
maybe we can just  wrap it just like int4hashset_le?

now store 10 element int4hashset need 99 bytes, similar one dimension
bigint array with length 10, occupy 101 byte....

in int4hashset_send, newly add struct attributes/member {load_factor
growth_factor ncollisions hash} also need send to buf?



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: Deleting prepared statements from libpq.
Следующее
От: Tom Lane
Дата:
Сообщение: Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific