Re: Do we want a hashset type?

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Do we want a hashset type?
Дата
Msg-id 44ac4916-5a7b-fd02-e967-4c733ae000fe@enterprisedb.com
обсуждение исходный текст
Ответ на Re: Do we want a hashset type?  ("Joel Jacobson" <joel@compiler.org>)
Ответы Re: Do we want a hashset type?  ("Joel Jacobson" <joel@compiler.org>)
Список pgsql-hackers
On 6/5/23 21:52, Joel Jacobson wrote:
> On Mon, Jun 5, 2023, at 01:44, Tomas Vondra wrote:
>> Anyway, I hacked together a trivial set backed by an open addressing
>> hash table:
>>
>>   https://github.com/tvondra/hashset
>>
>> It's super-basic, providing just some bare minimum of functions, but
>> hopefully good enough for experiments.
>>
>> - hashset data type - hash table in varlena
>> - hashset_init - create new hashset
>> - hashset_add / hashset_contains - add value / check membership
>> - hashset_merge - merge two hashsets
>> - hashset_count - count elements
>> - hashset_to_array - return
>> - hashset(int) aggregate
>>
>> This allows rewriting the recursive query like this:
>>
>>     WITH RECURSIVE friends_of_friends AS (
> ...
> 
> Nice! I get similar results, 737 ms (hashset) vs 1809 ms (array_agg).
> 
> I think if you just add one more hashset function, it will be a win against roaringbitmap, which is 400 ms.
> 
> The missing function is an agg that takes hashset as input and returns hashset, similar to roaringbitmap's
rb_or_agg().
> 
> With such a function, we could add an adjacent nodes hashset column to the `nodes` table, which would eliminate the
needto scan the edges table for graph traversal:
 
> 

I added a trivial version of such aggregate hashset(hashset), and if I
rewrite the CTE like this:

    WITH RECURSIVE friends_of_friends AS (
        SELECT
            (select hashset(v) from values (5867) as s(v)) AS current,
            0 AS depth
        UNION ALL
        SELECT
            new_current,
            friends_of_friends.depth + 1
        FROM
            friends_of_friends
        CROSS JOIN LATERAL (
            SELECT
                hashset(edges.to_node) AS new_current
            FROM
                edges
            WHERE
                from_node =
ANY(hashset_to_array(friends_of_friends.current))
        ) q
        WHERE
            friends_of_friends.depth < 3
    )
    SELECT
        depth,
        hashset_count(current)
    FROM
        friends_of_friends
    WHERE
        depth = 3;

it cuts the timing to about 50% on my laptop, so maybe it'll be ~300ms
on your system. There's a bunch of opportunities for more improvements,
as the hash table implementation is pretty naive/silly, the on-disk
format is wasteful and so on.

But before spending more time on that, it'd be interesting to know what
would be a competitive timing. I mean, what would be "good enough"? What
timings are achievable with graph databases?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: 回复:Fix missing initialization of delayChkptEnd
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific