Обсуждение: Self-referencing table question

Поиск

Список

Период

Сортировка

Self-referencing table question

От

Sean Davis

Дата:

22 марта 2005 г., 19:33:43

I have a table that looks like:
 Column  |     Type     | Modifiers | Description
---------+--------------+-----------+------------- from_id | integer      | not null  | to_id   | integer      | not
null | val     | numeric(4,3) |           |
 
Indexes:    "correlation_pkey" PRIMARY KEY, btree (from_id, to_id)    "correlation_from_id_idx" btree (from_id)
"correlation_to_id_idx"btree (to_id)    "correlation_val_idx" btree (val)
 
Has OIDs: yes

The table describes a pairwise correlation matrix between about 7700  
vectors (so the table has n^2= 60652944 rows, to be exact).  I am  
trying to choose the top 100 correlated vectors with a seed vector;  
this is easily:

select to_id from correlation where from_id=623 order by val desc limit  
100;

Then, I want to take those 100 values and find all from_id,to_id tuples  
where val>0.5 (to construct a graph where all "ids" are nodes and are  
connected to each other when their correlation is >0.5).  I can do this  
like:

explain analyze selectfrom_id,to_id,valfrom exprsdb.correlationwhere from_id in  
(1,2,3,4,5,6,7,8,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27 
,28,29,30)and to_id in  
(1,2,3,4,5,6,7,8,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27 
,28,29,30)and from_id>to_idand val>0.5;

However, this does not scale well AT ALL.  The actual (very messy)  
explain analyze output is below.  The thing I notice is that the index  
on to_id is not used.  Also, the primary key index on (from_id, to_id  
is not used, it seems.  Finally, with only 30 values, this already  
takes 2.6 seconds and I am proposing to do this on 100-200 values.  Any  
hints on how better to accomplish this set of tasks?
 Index Scan using correlation_from_id_idx, correlation_from_id_idx,  
correlation_from_id_idx, correlation_from_id_idx,  
correlation_from_id_idx, correlation_from_id_idx,  
correlation_from_id_idx, correlation_from_id_idx,  
correlation_from_id_idx, correlation_from_id_idx,  
correlation_from_id_idx, correlation_from_id_idx,  
correlation_from_id_idx, correlation_from_id_idx,  
correlation_from_id_idx, correlation_from_id_idx,  
correlation_from_id_idx, correlation_from_id_idx,  
correlation_from_id_idx, correlation_from_id_idx,  
correlation_from_id_idx, correlation_from_id_idx,  
correlation_from_id_idx, correlation_from_id_idx,  
correlation_from_id_idx, correlation_from_id_idx,  
correlation_from_id_idx, correlation_from_id_idx,  
correlation_from_id_idx, correlation_from_id_idx on correlation   
(cost=0.00..129377.49 rows=62 width=17) (actual time=340.563..2603.967  
rows=19 loops=1)   Index Cond: ((from_id = 1) OR (from_id = 2) OR (from_id = 3) OR  
(from_id = 4) OR (from_id = 5) OR (from_id = 6) OR (from_id = 7) OR  
(from_id = 8) OR (from_id = 10) OR (from_id = 9) OR (from_id = 11) OR  
(from_id = 12) OR (from_id = 13) OR (from_id = 14) OR (from_id = 15) OR  
(from_id = 16) OR (from_id = 17) OR (from_id = 18) OR (from_id = 19) OR  
(from_id = 20) OR (from_id = 21) OR (from_id = 22) OR (from_id = 23) OR  
(from_id = 24) OR (from_id = 25) OR (from_id = 26) OR (from_id = 27) OR  
(from_id = 28) OR (from_id = 29) OR (from_id = 30))   Filter: (((to_id = 1) OR (to_id = 2) OR (to_id = 3) OR (to_id =
4) 
 
OR (to_id = 5) OR (to_id = 6) OR (to_id = 7) OR (to_id = 8) OR (to_id =  
10) OR (to_id = 9) OR (to_id = 11) OR (to_id = 12) OR (to_id = 13) OR  
(to_id = 14) OR (to_id = 15) OR (to_id = 16) OR (to_id = 17) OR (to_id  
= 18) OR (to_id = 19) OR (to_id = 20) OR (to_id = 21) OR (to_id = 22)  
OR (to_id = 23) OR (to_id = 24) OR (to_id = 25) OR (to_id = 26) OR  
(to_id = 27) OR (to_id = 28) OR (to_id = 29) OR (to_id = 30)) AND  
(from_id > to_id) AND (val > 0.5)) Total runtime: 2604.383 ms

Thanks,
Sean

Re: Self-referencing table question

От

Sean Davis

Дата:

22 марта 2005 г., 20:41:11

I answer my own question, if only for my own records.  The following  
query is about 5-6 times faster than the original.  Of course, if  
anyone else has other ideas, I'd be happy to hear them.

Sean

explain analyze select from_id,to_id,val from exprsdb.correlation where  
from_id in (select to_id from exprsdb.correlation where from_id=2424  
order by val desc limit 100) and to_id in (select to_id from  
exprsdb.correlation where from_id=2424 order by val desc limit 100) and  
val>0.6 and to_id<from_id;                                                                                      QUERY
PLAN
------------------------------------------------------------------------ 
------------------------------------------------------------------------ 
------------------------------------ Hash IN Join  (cost=4709.94..74758.01 rows=555 width=17) (actual  
time=110.291..1671.767 rows=973 loops=1)   Hash Cond: ("outer".to_id = "inner".to_id)   ->  Nested Loop
(cost=2354.97..72181.72rows=43154 width=17)  
 
(actual time=54.036..1612.746 rows=1482 loops=1)         ->  HashAggregate  (cost=2354.97..2354.97 rows=100 width=4)  
(actual time=53.656..54.062 rows=100 loops=1)               ->  Subquery Scan "IN_subquery"  (cost=2353.47..2354.72  
rows=100 width=4) (actual time=53.473..53.595 rows=100 loops=1)                     ->  Limit  (cost=2353.47..2353.72
rows=100 
 
width=13) (actual time=53.469..53.507 rows=100 loops=1)                           ->  Sort  (cost=2353.47..2415.03
rows=24624 
 
width=13) (actual time=53.467..53.481 rows=100 loops=1)                                 Sort Key: val
             ->  Index Scan using  
 
correlation_from_id_idx on correlation  (cost=0.00..557.42 rows=24624  
width=13) (actual time=0.199..17.717 rows=7788 loops=1)                                       Index Cond: (from_id =
2424)        ->  Index Scan using correlation_from_id_idx on correlation   
 
(cost=0.00..692.87 rows=432 width=17) (actual time=2.765..15.560  
rows=15 loops=100)               Index Cond: (correlation.from_id = "outer".to_id)               Filter: ((val > 0.6)
AND(to_id < from_id))   ->  Hash  (cost=2354.72..2354.72 rows=100 width=4) (actual  
 
time=56.239..56.239 rows=0 loops=1)         ->  Subquery Scan "IN_subquery"  (cost=2353.47..2354.72  
rows=100 width=4) (actual time=56.004..56.121 rows=100 loops=1)               ->  Limit  (cost=2353.47..2353.72
rows=100width=13)  
 
(actual time=56.001..56.038 rows=100 loops=1)                     ->  Sort  (cost=2353.47..2415.03 rows=24624  
width=13) (actual time=55.999..56.012 rows=100 loops=1)                           Sort Key: val
 ->  Index Scan using correlation_from_id_idx  
 
on correlation  (cost=0.00..557.42 rows=24624 width=13) (actual  
time=0.517..20.307 rows=7788 loops=1)                                 Index Cond: (from_id = 2424) Total runtime:
1676.966ms
 


On Mar 22, 2005, at 2:33 PM, Sean Davis wrote:

> I have a table that looks like:
>
>  Column  |     Type     | Modifiers | Description
> ---------+--------------+-----------+-------------
>  from_id | integer      | not null  |
>  to_id   | integer      | not null  |
>  val     | numeric(4,3) |           |
> Indexes:
>     "correlation_pkey" PRIMARY KEY, btree (from_id, to_id)
>     "correlation_from_id_idx" btree (from_id)
>     "correlation_to_id_idx" btree (to_id)
>     "correlation_val_idx" btree (val)
> Has OIDs: yes
>
> The table describes a pairwise correlation matrix between about 7700  
> vectors (so the table has n^2= 60652944 rows, to be exact).  I am  
> trying to choose the top 100 correlated vectors with a seed vector;  
> this is easily:
>
> select to_id from correlation where from_id=623 order by val desc  
> limit 100;
>
> Then, I want to take those 100 values and find all from_id,to_id  
> tuples where val>0.5 (to construct a graph where all "ids" are nodes  
> and are connected to each other when their correlation is >0.5).  I  
> can do this like:
>
> explain analyze select
>     from_id,to_id,val
>     from exprsdb.correlation
>     where from_id in  
> (1,2,3,4,5,6,7,8,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,2 
> 7,28,29,30)
>     and to_id in  
> (1,2,3,4,5,6,7,8,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,2 
> 7,28,29,30)
>     and from_id>to_id
>     and val>0.5;
>
> However, this does not scale well AT ALL.  The actual (very messy)  
> explain analyze output is below.  The thing I notice is that the index  
> on to_id is not used.  Also, the primary key index on (from_id, to_id  
> is not used, it seems.  Finally, with only 30 values, this already  
> takes 2.6 seconds and I am proposing to do this on 100-200 values.   
> Any hints on how better to accomplish this set of tasks?
>
>  Index Scan using correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx,  
> correlation_from_id_idx, correlation_from_id_idx on correlation   
> (cost=0.00..129377.49 rows=62 width=17) (actual time=340.563..2603.967  
> rows=19 loops=1)
>    Index Cond: ((from_id = 1) OR (from_id = 2) OR (from_id = 3) OR  
> (from_id = 4) OR (from_id = 5) OR (from_id = 6) OR (from_id = 7) OR  
> (from_id = 8) OR (from_id = 10) OR (from_id = 9) OR (from_id = 11) OR  
> (from_id = 12) OR (from_id = 13) OR (from_id = 14) OR (from_id = 15)  
> OR (from_id = 16) OR (from_id = 17) OR (from_id = 18) OR (from_id =  
> 19) OR (from_id = 20) OR (from_id = 21) OR (from_id = 22) OR (from_id  
> = 23) OR (from_id = 24) OR (from_id = 25) OR (from_id = 26) OR  
> (from_id = 27) OR (from_id = 28) OR (from_id = 29) OR (from_id = 30))
>    Filter: (((to_id = 1) OR (to_id = 2) OR (to_id = 3) OR (to_id = 4)  
> OR (to_id = 5) OR (to_id = 6) OR (to_id = 7) OR (to_id = 8) OR (to_id  
> = 10) OR (to_id = 9) OR (to_id = 11) OR (to_id = 12) OR (to_id = 13)  
> OR (to_id = 14) OR (to_id = 15) OR (to_id = 16) OR (to_id = 17) OR  
> (to_id = 18) OR (to_id = 19) OR (to_id = 20) OR (to_id = 21) OR (to_id  
> = 22) OR (to_id = 23) OR (to_id = 24) OR (to_id = 25) OR (to_id = 26)  
> OR (to_id = 27) OR (to_id = 28) OR (to_id = 29) OR (to_id = 30)) AND  
> (from_id > to_id) AND (val > 0.5))
>  Total runtime: 2604.383 ms
>
> Thanks,
> Sean
>
>
> ---------------------------(end of  
> broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to  
> majordomo@postgresql.org

Re: Self-referencing table question

От

Richard Huxton

Дата:

22 марта 2005 г., 21:00:10

Sean Davis wrote:
> I answer my own question, if only for my own records.  The following  
> query is about 5-6 times faster than the original.  Of course, if  
> anyone else has other ideas, I'd be happy to hear them.
> 
> Sean
> 
> explain analyze select from_id,to_id,val from exprsdb.correlation where  
> from_id in (select to_id from exprsdb.correlation where from_id=2424  
> order by val desc limit 100) and to_id in (select to_id from  
> exprsdb.correlation where from_id=2424 order by val desc limit 100) and  
> val>0.6 and to_id<from_id;

Might not be any faster, but you can do this as a self-join with subquery:

SELECT c1.from_id, c1.to_id, c1.val
FROM  correlation c1,  (    SELECT to_id FROM correlation WHERE from_id=2424    ORDER BY val DESC LIMIT 100  ) AS c2  (
  SELECT to_id FROM correlation WHERE from_id=2424    ORDER BY val DESC LIMIT 100  ) AS c3
 
WHERE  c1.from_id = c2.to_id  AND c1.to_id = c3.to_id  AND c1.val > 0.5  AND c1.to_id < from_id
;

I think PG should be smart enough nowadays to figure out these two 
queries are basically the same.
--   Richard Huxton  Archonet Ltd

Re: Self-referencing table question

От

"Sean Davis"

Дата:

23 марта 2005 г., 00:07:25

----- Original Message ----- 
From: "Richard Huxton" <dev@archonet.com>
To: "Sean Davis" <sdavis2@mail.nih.gov>
Cc: "PostgreSQL SQL" <pgsql-sql@postgresql.org>
Sent: Tuesday, March 22, 2005 3:59 PM
Subject: Re: [SQL] Self-referencing table question


> Sean Davis wrote:
>> I answer my own question, if only for my own records.  The following 
>> query is about 5-6 times faster than the original.  Of course, if  anyone 
>> else has other ideas, I'd be happy to hear them.
>>
>> Sean
>>
>> explain analyze select from_id,to_id,val from exprsdb.correlation where 
>> from_id in (select to_id from exprsdb.correlation where from_id=2424 
>> order by val desc limit 100) and to_id in (select to_id from 
>> exprsdb.correlation where from_id=2424 order by val desc limit 100) and 
>> val>0.6 and to_id<from_id;
>
> Might not be any faster, but you can do this as a self-join with subquery:
>
> SELECT c1.from_id, c1.to_id, c1.val
> FROM
>   correlation c1,
>   (
>     SELECT to_id FROM correlation WHERE from_id=2424
>     ORDER BY val DESC LIMIT 100
>   ) AS c2
>   (
>     SELECT to_id FROM correlation WHERE from_id=2424
>     ORDER BY val DESC LIMIT 100
>   ) AS c3
> WHERE
>   c1.from_id = c2.to_id
>   AND c1.to_id = c3.to_id
>   AND c1.val > 0.5
>   AND c1.to_id < from_id
> ;
>
> I think PG should be smart enough nowadays to figure out these two queries 
> are basically the same.

Richard,

In another email, I posted what I did (which was what you suggest), along 
with explain analyze output.  It looks like the subquery is 4-6 times 
faster, which is getting into the acceptible for my little web application.

Thanks for the help.

Sean

Re: Self-referencing table question

От

Sean Davis

Дата:

23 марта 2005 г., 11:04:15

On Mar 22, 2005, at 7:07 PM, Sean Davis wrote:

>
> ----- Original Message ----- From: "Richard Huxton" <dev@archonet.com>
> To: "Sean Davis" <sdavis2@mail.nih.gov>
> Cc: "PostgreSQL SQL" <pgsql-sql@postgresql.org>
> Sent: Tuesday, March 22, 2005 3:59 PM
> Subject: Re: [SQL] Self-referencing table question
>
>
>> Sean Davis wrote:
>>> I answer my own question, if only for my own records.  The following 
>>> query is about 5-6 times faster than the original.  Of course, if  
>>> anyone else has other ideas, I'd be happy to hear them.
>>>
>>> Sean
>>>
>>> explain analyze select from_id,to_id,val from exprsdb.correlation 
>>> where from_id in (select to_id from exprsdb.correlation where 
>>> from_id=2424 order by val desc limit 100) and to_id in (select to_id 
>>> from exprsdb.correlation where from_id=2424 order by val desc limit 
>>> 100) and val>0.6 and to_id<from_id;
>>
>> Might not be any faster, but you can do this as a self-join with 
>> subquery:
>>
>> SELECT c1.from_id, c1.to_id, c1.val
>> FROM
>>   correlation c1,
>>   (
>>     SELECT to_id FROM correlation WHERE from_id=2424
>>     ORDER BY val DESC LIMIT 100
>>   ) AS c2
>>   (
>>     SELECT to_id FROM correlation WHERE from_id=2424
>>     ORDER BY val DESC LIMIT 100
>>   ) AS c3
>> WHERE
>>   c1.from_id = c2.to_id
>>   AND c1.to_id = c3.to_id
>>   AND c1.val > 0.5
>>   AND c1.to_id < from_id
>> ;
>>
>> I think PG should be smart enough nowadays to figure out these two 
>> queries are basically the same.

Oops, I DID do a different query in my previous email than what you 
suggest in the your email.  Testing both against each other, the two 
queries--using subselects in 'in' and doing a self-join via 
subquery--have basically the same performance.

Thanks again for the help.

Sean

Re: Self-referencing table question

От

Edmund Bacon

Дата:

24 марта 2005 г., 16:13:53

Sometimes using a temp table is a better idea:

e.g.

-- start by creating a temp table 'tids' that hold the to_ids that
-- we are interested in. SELECT to_id   INTO TEMP TABLE tids   FROM correlation  WHERE from_id  = 1234  ORDER BY val
DESClimit 100;
 

-- The following temp table makes use of the primary key on
-- the correlation table, and the stated goal from the original
-- question that:
--         from_id > to_id
--     and from_id in (tids.to_id)
--    and to_id in (tids.to_id)
 SELECT t1.to_id AS from_id, t2.to_id   INTO TEMP TABLE from_to   FROM tids t1, tids t2  WHERE t1.to_id > t2.to_id;

-- Now we can use the from_to table as an index into the correlation
-- table.
 SELECT c.from_id, c.to_id, c.val   FROM from_to   JOIN correlation c USING(from_id, to_id)  WHERE val > 0.5;


The explain analyze for the final select works out to:Nested Loop  (cost=0.00..50692.00 rows=8488 width=16) (actual 
time=0.171..150.095 rows=2427 loops=1)  ->  Seq Scan on from_to  (cost=0.00..79.38 rows=5238 width=8) (actual 
time=0.006..7.660 rows=4950 loops=1)  ->  Index Scan using correlation_pkey on correlation c  
(cost=0.00..9.63 rows=2 width=16) (actual time=0.024..0.025 rows=0 
loops=4950)        Index Cond: (("outer".from_id = c.from_id) AND ("outer".to_id = 
c.to_id))        Filter: (val > 0.5::double precision)Total runtime: 152.261 ms


Richard Huxton wrote:

> Sean Davis wrote:
>
>> I answer my own question, if only for my own records.  The following  
>> query is about 5-6 times faster than the original.  Of course, if  
>> anyone else has other ideas, I'd be happy to hear them.
>>
>> Sean
>>
>> explain analyze select from_id,to_id,val from exprsdb.correlation 
>> where  from_id in (select to_id from exprsdb.correlation where 
>> from_id=2424  order by val desc limit 100) and to_id in (select to_id 
>> from  exprsdb.correlation where from_id=2424 order by val desc limit 
>> 100) and  val>0.6 and to_id<from_id;
>
>
> Might not be any faster, but you can do this as a self-join with 
> subquery:
>
> SELECT c1.from_id, c1.to_id, c1.val
> FROM
>   correlation c1,
>   (
>     SELECT to_id FROM correlation WHERE from_id=2424
>     ORDER BY val DESC LIMIT 100
>   ) AS c2
>   (
>     SELECT to_id FROM correlation WHERE from_id=2424
>     ORDER BY val DESC LIMIT 100
>   ) AS c3
> WHERE
>   c1.from_id = c2.to_id
>   AND c1.to_id = c3.to_id
>   AND c1.val > 0.5
>   AND c1.to_id < from_id
> ;
>
> I think PG should be smart enough nowadays to figure out these two 
> queries are basically the same.


-- 
Edmund Bacon <ebacon@onesystem.com>

Re: Self-referencing table question

От

Sean Davis

Дата:

24 марта 2005 г., 17:30:47

Thanks.  I thought about that a bit and it seems like it is highly 
likely to be expensive for a single query (though I should probably try 
it at some point).  If I do find myself reformatting results after 
response to user input (i.e., reusing the query), though, then your 
solution is likely to be very useful.

Sean

On Mar 24, 2005, at 11:13 AM, Edmund Bacon wrote:

> Sometimes using a temp table is a better idea:
>
> e.g.
>
> -- start by creating a temp table 'tids' that hold the to_ids that
> -- we are interested in.
>  SELECT to_id
>    INTO TEMP TABLE tids
>    FROM correlation
>   WHERE from_id  = 1234
>   ORDER BY val DESC limit 100;
>
> -- The following temp table makes use of the primary key on
> -- the correlation table, and the stated goal from the original
> -- question that:
> --         from_id > to_id
> --     and from_id in (tids.to_id)
> --    and to_id in (tids.to_id)
>
>  SELECT t1.to_id AS from_id, t2.to_id
>    INTO TEMP TABLE from_to
>    FROM tids t1, tids t2
>   WHERE t1.to_id > t2.to_id;
>
> -- Now we can use the from_to table as an index into the correlation
> -- table.
>
>  SELECT c.from_id, c.to_id, c.val
>    FROM from_to
>    JOIN correlation c USING(from_id, to_id)
>   WHERE val > 0.5;
>
>
> The explain analyze for the final select works out to:
> Nested Loop  (cost=0.00..50692.00 rows=8488 width=16) (actual 
> time=0.171..150.095 rows=2427 loops=1)
>   ->  Seq Scan on from_to  (cost=0.00..79.38 rows=5238 width=8) 
> (actual time=0.006..7.660 rows=4950 loops=1)
>   ->  Index Scan using correlation_pkey on correlation c  
> (cost=0.00..9.63 rows=2 width=16) (actual time=0.024..0.025 rows=0 
> loops=4950)
>         Index Cond: (("outer".from_id = c.from_id) AND ("outer".to_id 
> = c.to_id))
>         Filter: (val > 0.5::double precision)
> Total runtime: 152.261 ms
>
>
> Richard Huxton wrote:
>
>> Sean Davis wrote:
>>
>>> I answer my own question, if only for my own records.  The following 
>>>  query is about 5-6 times faster than the original.  Of course, if  
>>> anyone else has other ideas, I'd be happy to hear them.
>>>
>>> Sean
>>>
>>> explain analyze select from_id,to_id,val from exprsdb.correlation 
>>> where  from_id in (select to_id from exprsdb.correlation where 
>>> from_id=2424  order by val desc limit 100) and to_id in (select 
>>> to_id from  exprsdb.correlation where from_id=2424 order by val desc 
>>> limit 100) and  val>0.6 and to_id<from_id;
>>
>>
>> Might not be any faster, but you can do this as a self-join with 
>> subquery:
>>
>> SELECT c1.from_id, c1.to_id, c1.val
>> FROM
>>   correlation c1,
>>   (
>>     SELECT to_id FROM correlation WHERE from_id=2424
>>     ORDER BY val DESC LIMIT 100
>>   ) AS c2
>>   (
>>     SELECT to_id FROM correlation WHERE from_id=2424
>>     ORDER BY val DESC LIMIT 100
>>   ) AS c3
>> WHERE
>>   c1.from_id = c2.to_id
>>   AND c1.to_id = c3.to_id
>>   AND c1.val > 0.5
>>   AND c1.to_id < from_id
>> ;
>>
>> I think PG should be smart enough nowadays to figure out these two 
>> queries are basically the same.
>
>
> -- 
> Edmund Bacon <ebacon@onesystem.com>
>
>
> ---------------------------(end of 
> broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>      subscribe-nomail command to majordomo@postgresql.org so that your
>      message can get through to the mailing list cleanly

Re: Self-referencing table question

От

Edmund Bacon

Дата:

24 марта 2005 г., 18:11:38

Sean Davis wrote:

> Thanks.  I thought about that a bit and it seems like it is highly 
> likely to be expensive for a single query (though I should probably 
> try it at some point).  If I do find myself reformatting results after 
> response to user input (i.e., reusing the query), though, then your 
> solution is likely to be very useful.
>
You might think so, however, consider:

$ cat temptable.sql SELECT to_id   INTO TEMP TABLE tids   FROM correlation  WHERE from_id  = 1234  ORDER BY val DESC
limit100;
 
 SELECT t1.to_id AS from_id, t2.to_id   INTO TEMP TABLE from_to   FROM tids t1, tids t2  WHERE t1.to_id > t2.to_id;
 SELECT c.from_id, c.to_id, c.val   FROM from_to   JOIN correlation c USING(from_id, to_id)  WHERE val > 0.5  order by
from_id,to_id;
 
$ cat subselect.sql
select   from_id, to_id, val   from correlation   where from_id in (select to_id from correlation
    where from_id = 1234                order by val desc limit 100)       and to_id in (select to_id from correlation
                         where from_id = 1234                order by val desc limit 100)   and from_id > to_id   and
val> 0.5   order by from_id, to_id;
 
$ psql -q -c "select count(1) from correlation" test count  
----------64000000
(1 row)

$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m3.093s
user    0m0.000s
sys    0m0.010s
$ time psql -q -f temptable.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m0.238s
user    0m0.000s
sys    0m0.010s
$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m1.945s
user    0m0.010s
sys    0m0.010s
$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m1.949s
user    0m0.010s
sys    0m0.010s
$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m1.953s
user    0m0.010s
sys    0m0.000s
$ time psql -q -f temptable.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m0.237s
user    0m0.020s
sys    0m0.000s
$

Note that the subselect version takes about 10 times as long as the 
temptable version, and does not seem to be dependent on what data might 
be cached.

I added the sort so that the results would be in the same order - you 
can see that the two query sets  are producing the same output (at least 
md5sum thinks they are the same).

One thing to be aware of is the size of your returned data set - If it's 
fairly large, then the transfer time from your web-server to the pgsql 
box might overwhelm any "small" optimization in query time. 

> Sean
>
> On Mar 24, 2005, at 11:13 AM, Edmund Bacon wrote:
>
>> Sometimes using a temp table is a better idea:
>>
>> e.g.
>>
>> -- start by creating a temp table 'tids' that hold the to_ids that
>> -- we are interested in.
>>  SELECT to_id
>>    INTO TEMP TABLE tids
>>    FROM correlation
>>   WHERE from_id  = 1234
>>   ORDER BY val DESC limit 100;
>>
>> -- The following temp table makes use of the primary key on
>> -- the correlation table, and the stated goal from the original
>> -- question that:
>> --         from_id > to_id
>> --     and from_id in (tids.to_id)
>> --    and to_id in (tids.to_id)
>>
>>  SELECT t1.to_id AS from_id, t2.to_id
>>    INTO TEMP TABLE from_to
>>    FROM tids t1, tids t2
>>   WHERE t1.to_id > t2.to_id;
>>
>> -- Now we can use the from_to table as an index into the correlation
>> -- table.
>>
>>  SELECT c.from_id, c.to_id, c.val
>>    FROM from_to
>>    JOIN correlation c USING(from_id, to_id)
>>   WHERE val > 0.5;
>>
>>
>> The explain analyze for the final select works out to:
>> Nested Loop  (cost=0.00..50692.00 rows=8488 width=16) (actual 
>> time=0.171..150.095 rows=2427 loops=1)
>>   ->  Seq Scan on from_to  (cost=0.00..79.38 rows=5238 width=8) 
>> (actual time=0.006..7.660 rows=4950 loops=1)
>>   ->  Index Scan using correlation_pkey on correlation c  
>> (cost=0.00..9.63 rows=2 width=16) (actual time=0.024..0.025 rows=0 
>> loops=4950)
>>         Index Cond: (("outer".from_id = c.from_id) AND ("outer".to_id 
>> = c.to_id))
>>         Filter: (val > 0.5::double precision)
>> Total runtime: 152.261 ms
>>
>>
>> Richard Huxton wrote:
>>
>>> Sean Davis wrote:
>>>
>>>> I answer my own question, if only for my own records.  The 
>>>> following  query is about 5-6 times faster than the original.  Of 
>>>> course, if  anyone else has other ideas, I'd be happy to hear them.
>>>>
>>>> Sean
>>>>
>>>> explain analyze select from_id,to_id,val from exprsdb.correlation 
>>>> where  from_id in (select to_id from exprsdb.correlation where 
>>>> from_id=2424  order by val desc limit 100) and to_id in (select 
>>>> to_id from  exprsdb.correlation where from_id=2424 order by val 
>>>> desc limit 100) and  val>0.6 and to_id<from_id;
>>>
>>>
>>>
>>> Might not be any faster, but you can do this as a self-join with 
>>> subquery:
>>>
>>> SELECT c1.from_id, c1.to_id, c1.val
>>> FROM
>>>   correlation c1,
>>>   (
>>>     SELECT to_id FROM correlation WHERE from_id=2424
>>>     ORDER BY val DESC LIMIT 100
>>>   ) AS c2
>>>   (
>>>     SELECT to_id FROM correlation WHERE from_id=2424
>>>     ORDER BY val DESC LIMIT 100
>>>   ) AS c3
>>> WHERE
>>>   c1.from_id = c2.to_id
>>>   AND c1.to_id = c3.to_id
>>>   AND c1.val > 0.5
>>>   AND c1.to_id < from_id
>>> ;
>>>
>>> I think PG should be smart enough nowadays to figure out these two 
>>> queries are basically the same.
>>
>>
>>
>> -- 
>> Edmund Bacon <ebacon@onesystem.com>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 3: if posting/reading through Usenet, please send an appropriate
>>      subscribe-nomail command to majordomo@postgresql.org so that your
>>      message can get through to the mailing list cleanly
>
>

-- 
Edmund Bacon <ebacon@onesystem.com>

Re: Self-referencing table question

От

Sean Davis

Дата:

24 марта 2005 г., 19:01:25

On Mar 24, 2005, at 1:11 PM, Edmund Bacon wrote:

> Sean Davis wrote:
>
>> Thanks.  I thought about that a bit and it seems like it is highly 
>> likely to be expensive for a single query (though I should probably 
>> try it at some point).  If I do find myself reformatting results 
>> after response to user input (i.e., reusing the query), though, then 
>> your solution is likely to be very useful.
>>
>
>
> Note that the subselect version takes about 10 times as long as the 
> temptable version, and does not seem to be dependent on what data 
> might be cached.
>>>

Nice.  Thanks for doing my work for me!  I guess I will have to think 
about it more seriously.

It could be a slight bit complicated because my code is running under 
mod_perl, so connections are cached.  As I understand it, the temp 
table will stick around, so I will have to be careful to explicitly 
drop it if I don't want it to persist?  Also each table will need a 
unique name (I have a session_id I can use), as it is possible that 
multiple temp tables will exist and be visible to each other?

Thanks again,
Sean

Re: Self-referencing table question

От

Edmund Bacon

Дата:

24 марта 2005 г., 19:37:24

Sean Davis wrote:

> Nice.  Thanks for doing my work for me!

Yeah, well put it down to a certain amount of curiosity and a slack 
period at work ...

> I guess I will have to think about it more seriously.
>
> It could be a slight bit complicated because my code is running under 
> mod_perl, so connections are cached.  As I understand it, the temp 
> table will stick around, so I will have to be careful to explicitly 
> drop it if I don't want it to persist?  

I'm guessing so.    However you could put everything in a transaction 
and use CREATE TEMP TABLE ... ON COMMIT DROP, and use INSERT INTO rather 
than SELECT INTO.  The speed should be about equivalent - but you'd have 
to test to make sure.

> Also each table will need a unique name (I have a session_id I can 
> use), as it is possible that multiple temp tables will exist and be 
> visible to each other?

Each session (connection in your case?) has it's own temporary table 
space, so you shouldn't have to worry about that.

>  
>
> Thanks again,
> Sean
>

-- 
Edmund Bacon <ebacon@onesystem.com>

Re: Self-referencing table question

От

Sean Davis

Дата:

24 марта 2005 г., 20:30:18

On Mar 24, 2005, at 2:37 PM, Edmund Bacon wrote:

> Sean Davis wrote:
>
>> Nice.  Thanks for doing my work for me!
>
> Yeah, well put it down to a certain amount of curiosity and a slack 
> period at work ...
>
>> I guess I will have to think about it more seriously.
>>
>> It could be a slight bit complicated because my code is running under 
>> mod_perl, so connections are cached.  As I understand it, the temp 
>> table will stick around, so I will have to be careful to explicitly 
>> drop it if I don't want it to persist?
>
> I'm guessing so.    However you could put everything in a transaction 
> and use CREATE TEMP TABLE ... ON COMMIT DROP, and use INSERT INTO 
> rather than SELECT INTO.  The speed should be about equivalent - but 
> you'd have to test to make sure.
>
>> Also each table will need a unique name (I have a session_id I can 
>> use), as it is possible that multiple temp tables will exist and be 
>> visible to each other?
>
> Each session (connection in your case?) has it's own temporary table 
> space, so you shouldn't have to worry about that.
>

Sessions don't map 1-to-1 with connections in the web environment.  It 
is possible that a connection to the database would be simultaneously 
serving multiple users (sessions), if I understand Apache::DBI 
correctly.  In any case, this is probably a viable solution.

Sean

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Self-referencing table question