Re: Do we want a hashset type?

Поиск
Список
Период
Сортировка
От Joel Jacobson
Тема Re: Do we want a hashset type?
Дата
Msg-id f4ad4ec9-160f-48a6-851d-b2b34db49ebc@app.fastmail.com
обсуждение исходный текст
Ответ на Re: Do we want a hashset type?  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы Re: Do we want a hashset type?  ("Joel Jacobson" <joel@compiler.org>)
Список pgsql-hackers
On Wed, May 31, 2023, at 18:59, Tomas Vondra wrote:
> How does storing just the IDs solves anything? Isn't the main challenge
> the random I/O when fetching the adjacent nodes? This does not really
> improve that, no?

I'm thinking of a recursive query where a lot of time is just spent following
all friends-of-friends, where the graph traversal is the heavy part,
where the final set of nodes are only fetched at the end.

> It's entirely possible I'm missing some important aspect. It'd be very
> useful to have a couple example queries illustrating the issue, that we
> could use to actually test different approaches.

Here is an example using a real anonymised social network.

wget https://snap.stanford.edu/data/soc-pokec-relationships.txt.gz
gunzip soc-pokec-relationships.txt.gz
CREATE TABLE edges (from_node INT, to_node INT);
\copy edges from soc-pokec-relationships.txt;
ALTER TABLE edges ADD PRIMARY KEY (from_node, to_node);
SET work_mem TO '1GB';
CREATE VIEW friends_of_friends AS
WITH RECURSIVE friends_of_friends AS (
    SELECT 
        ARRAY[5867::bigint] AS current, 
        ARRAY[5867::bigint] AS found,
        0 AS depth
    UNION ALL  
    SELECT 
        new_current, 
        found || new_current, 
        friends_of_friends.depth + 1
    FROM 
        friends_of_friends
    CROSS JOIN LATERAL (
        SELECT
            array_agg(DISTINCT edges.to_node) AS new_current
        FROM
            edges
        WHERE
            from_node = ANY(friends_of_friends.current)
    ) q
    WHERE
        friends_of_friends.depth < 3
)
SELECT
    depth,
    coalesce(array_length(current, 1), 0)
FROM
    friends_of_friends
WHERE
    depth = 3;
;

SELECT COUNT(*) FROM edges;
  count
----------
 30622564
(1 row)

SELECT COUNT(DISTINCT from_node) FROM edges;
  count
---------
 1432693
(1 row)

-- Most connected user (worst-case) is user 5867 with 8763 friends:
SELECT from_node, COUNT(*) FROM edges GROUP BY from_node ORDER BY COUNT(*) DESC LIMIT 1;
 from_node | count
-----------+-------
      5867 |  8763
(1 row)

-- Friend-of-friends query exactly at depth three:

EXPLAIN ANALYZE
SELECT * FROM friends_of_friends;

                                                                              QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on friends_of_friends  (cost=6017516.90..6017517.60 rows=1 width=8) (actual time=2585.881..2586.334 rows=1
loops=1)
   Filter: (depth = 3)
   Rows Removed by Filter: 3
   CTE friends_of_friends
     ->  Recursive Union  (cost=0.00..6017516.90 rows=31 width=68) (actual time=0.005..2581.664 rows=4 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=68) (actual time=0.002..0.002 rows=1 loops=1)
           ->  Subquery Scan on "*SELECT* 2"  (cost=200583.71..601751.66 rows=3 width=68) (actual time=645.036..645.157
rows=1loops=4)
 
                 ->  Nested Loop  (cost=200583.71..601751.54 rows=3 width=68) (actual time=641.880..641.972 rows=1
loops=4)
                       ->  WorkTable Scan on friends_of_friends friends_of_friends_1  (cost=0.00..0.22 rows=3 width=68)
(actualtime=0.001..0.002 rows=1 loops=4)
 
                             Filter: (depth < 3)
                             Rows Removed by Filter: 0
                       ->  Aggregate  (cost=200583.71..200583.72 rows=1 width=32) (actual time=850.997..850.998 rows=1
loops=3)
                             ->  Bitmap Heap Scan on edges  (cost=27656.38..196840.88 rows=1497133 width=4) (actual
time=203.239..423.534rows=3486910 loops=3)
 
                                   Recheck Cond: (from_node = ANY (friends_of_friends_1.current))
                                   Heap Blocks: exact=117876
                                   ->  Bitmap Index Scan on edges_pkey  (cost=0.00..27282.10 rows=1497133 width=0)
(actualtime=198.047..198.047 rows=3486910 loops=3)
 
                                         Index Cond: (from_node = ANY (friends_of_friends_1.current))
 Planning Time: 1.414 ms
 Execution Time: 2588.288 ms
(19 rows)

I tested on PostgreSQL 15.2. For some reason I got a different slower query on HEAD:

SELECT * FROM friends_of_friends;
                                                                               QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on friends_of_friends  (cost=6576.67..6577.37 rows=1 width=8) (actual time=6412.693..6413.335 rows=1
loops=1)
   Filter: (depth = 3)
   Rows Removed by Filter: 3
   CTE friends_of_friends
     ->  Recursive Union  (cost=0.00..6576.67 rows=31 width=68) (actual time=0.008..6407.134 rows=4 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=68) (actual time=0.005..0.005 rows=1 loops=1)
           ->  Subquery Scan on "*SELECT* 2"  (cost=219.05..657.64 rows=3 width=68) (actual time=1600.747..1600.934
rows=1loops=4)
 
                 ->  Nested Loop  (cost=219.05..657.52 rows=3 width=68) (actual time=1594.906..1595.035 rows=1
loops=4)
                       ->  WorkTable Scan on friends_of_friends friends_of_friends_1  (cost=0.00..0.22 rows=3 width=68)
(actualtime=0.001..0.002 rows=1 loops=4)
 
                             Filter: (depth < 3)
                             Rows Removed by Filter: 0
                       ->  Aggregate  (cost=219.05..219.06 rows=1 width=32) (actual time=2118.105..2118.105 rows=1
loops=3)
                             ->  Sort  (cost=207.94..213.49 rows=2221 width=4) (actual time=1780.770..1925.853
rows=3486910loops=3)
 
                                   Sort Key: edges.to_node
                                   Sort Method: quicksort  Memory: 393217kB
                                   ->  Index Only Scan using edges_pkey on edges  (cost=0.56..84.48 rows=2221 width=4)
(actualtime=0.077..762.408 rows=3486910 loops=3)
 
                                         Index Cond: (from_node = ANY (friends_of_friends_1.current))
                                         Heap Fetches: 0
 Planning Time: 8.229 ms
 Execution Time: 6446.421 ms
(20 rows)



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Richard Guo
Дата:
Сообщение: Re: An inefficient query caused by unnecessary PlaceHolderVar
Следующее
От: "Joel Jacobson"
Дата:
Сообщение: Re: Do we want a hashset type?