Обсуждение: Self-referencing table question
I have a table that looks like: Column | Type | Modifiers | Description ---------+--------------+-----------+------------- from_id | integer | not null | to_id | integer | not null | val | numeric(4,3) | | Indexes: "correlation_pkey" PRIMARY KEY, btree (from_id, to_id) "correlation_from_id_idx" btree (from_id) "correlation_to_id_idx"btree (to_id) "correlation_val_idx" btree (val) Has OIDs: yes The table describes a pairwise correlation matrix between about 7700 vectors (so the table has n^2= 60652944 rows, to be exact). I am trying to choose the top 100 correlated vectors with a seed vector; this is easily: select to_id from correlation where from_id=623 order by val desc limit 100; Then, I want to take those 100 values and find all from_id,to_id tuples where val>0.5 (to construct a graph where all "ids" are nodes and are connected to each other when their correlation is >0.5). I can do this like: explain analyze selectfrom_id,to_id,valfrom exprsdb.correlationwhere from_id in (1,2,3,4,5,6,7,8,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27 ,28,29,30)and to_id in (1,2,3,4,5,6,7,8,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27 ,28,29,30)and from_id>to_idand val>0.5; However, this does not scale well AT ALL. The actual (very messy) explain analyze output is below. The thing I notice is that the index on to_id is not used. Also, the primary key index on (from_id, to_id is not used, it seems. Finally, with only 30 values, this already takes 2.6 seconds and I am proposing to do this on 100-200 values. Any hints on how better to accomplish this set of tasks? Index Scan using correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx, correlation_from_id_idx on correlation (cost=0.00..129377.49 rows=62 width=17) (actual time=340.563..2603.967 rows=19 loops=1) Index Cond: ((from_id = 1) OR (from_id = 2) OR (from_id = 3) OR (from_id = 4) OR (from_id = 5) OR (from_id = 6) OR (from_id = 7) OR (from_id = 8) OR (from_id = 10) OR (from_id = 9) OR (from_id = 11) OR (from_id = 12) OR (from_id = 13) OR (from_id = 14) OR (from_id = 15) OR (from_id = 16) OR (from_id = 17) OR (from_id = 18) OR (from_id = 19) OR (from_id = 20) OR (from_id = 21) OR (from_id = 22) OR (from_id = 23) OR (from_id = 24) OR (from_id = 25) OR (from_id = 26) OR (from_id = 27) OR (from_id = 28) OR (from_id = 29) OR (from_id = 30)) Filter: (((to_id = 1) OR (to_id = 2) OR (to_id = 3) OR (to_id = 4) OR (to_id = 5) OR (to_id = 6) OR (to_id = 7) OR (to_id = 8) OR (to_id = 10) OR (to_id = 9) OR (to_id = 11) OR (to_id = 12) OR (to_id = 13) OR (to_id = 14) OR (to_id = 15) OR (to_id = 16) OR (to_id = 17) OR (to_id = 18) OR (to_id = 19) OR (to_id = 20) OR (to_id = 21) OR (to_id = 22) OR (to_id = 23) OR (to_id = 24) OR (to_id = 25) OR (to_id = 26) OR (to_id = 27) OR (to_id = 28) OR (to_id = 29) OR (to_id = 30)) AND (from_id > to_id) AND (val > 0.5)) Total runtime: 2604.383 ms Thanks, Sean
I answer my own question, if only for my own records. The following
query is about 5-6 times faster than the original. Of course, if
anyone else has other ideas, I'd be happy to hear them.
Sean
explain analyze select from_id,to_id,val from exprsdb.correlation where
from_id in (select to_id from exprsdb.correlation where from_id=2424
order by val desc limit 100) and to_id in (select to_id from
exprsdb.correlation where from_id=2424 order by val desc limit 100) and
val>0.6 and to_id<from_id; QUERY
PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------ Hash IN Join (cost=4709.94..74758.01 rows=555 width=17) (actual
time=110.291..1671.767 rows=973 loops=1) Hash Cond: ("outer".to_id = "inner".to_id) -> Nested Loop
(cost=2354.97..72181.72rows=43154 width=17)
(actual time=54.036..1612.746 rows=1482 loops=1) -> HashAggregate (cost=2354.97..2354.97 rows=100 width=4)
(actual time=53.656..54.062 rows=100 loops=1) -> Subquery Scan "IN_subquery" (cost=2353.47..2354.72
rows=100 width=4) (actual time=53.473..53.595 rows=100 loops=1) -> Limit (cost=2353.47..2353.72
rows=100
width=13) (actual time=53.469..53.507 rows=100 loops=1) -> Sort (cost=2353.47..2415.03
rows=24624
width=13) (actual time=53.467..53.481 rows=100 loops=1) Sort Key: val
-> Index Scan using
correlation_from_id_idx on correlation (cost=0.00..557.42 rows=24624
width=13) (actual time=0.199..17.717 rows=7788 loops=1) Index Cond: (from_id =
2424) -> Index Scan using correlation_from_id_idx on correlation
(cost=0.00..692.87 rows=432 width=17) (actual time=2.765..15.560
rows=15 loops=100) Index Cond: (correlation.from_id = "outer".to_id) Filter: ((val > 0.6)
AND(to_id < from_id)) -> Hash (cost=2354.72..2354.72 rows=100 width=4) (actual
time=56.239..56.239 rows=0 loops=1) -> Subquery Scan "IN_subquery" (cost=2353.47..2354.72
rows=100 width=4) (actual time=56.004..56.121 rows=100 loops=1) -> Limit (cost=2353.47..2353.72
rows=100width=13)
(actual time=56.001..56.038 rows=100 loops=1) -> Sort (cost=2353.47..2415.03 rows=24624
width=13) (actual time=55.999..56.012 rows=100 loops=1) Sort Key: val
-> Index Scan using correlation_from_id_idx
on correlation (cost=0.00..557.42 rows=24624 width=13) (actual
time=0.517..20.307 rows=7788 loops=1) Index Cond: (from_id = 2424) Total runtime:
1676.966ms
On Mar 22, 2005, at 2:33 PM, Sean Davis wrote:
> I have a table that looks like:
>
> Column | Type | Modifiers | Description
> ---------+--------------+-----------+-------------
> from_id | integer | not null |
> to_id | integer | not null |
> val | numeric(4,3) | |
> Indexes:
> "correlation_pkey" PRIMARY KEY, btree (from_id, to_id)
> "correlation_from_id_idx" btree (from_id)
> "correlation_to_id_idx" btree (to_id)
> "correlation_val_idx" btree (val)
> Has OIDs: yes
>
> The table describes a pairwise correlation matrix between about 7700
> vectors (so the table has n^2= 60652944 rows, to be exact). I am
> trying to choose the top 100 correlated vectors with a seed vector;
> this is easily:
>
> select to_id from correlation where from_id=623 order by val desc
> limit 100;
>
> Then, I want to take those 100 values and find all from_id,to_id
> tuples where val>0.5 (to construct a graph where all "ids" are nodes
> and are connected to each other when their correlation is >0.5). I
> can do this like:
>
> explain analyze select
> from_id,to_id,val
> from exprsdb.correlation
> where from_id in
> (1,2,3,4,5,6,7,8,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,2
> 7,28,29,30)
> and to_id in
> (1,2,3,4,5,6,7,8,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,2
> 7,28,29,30)
> and from_id>to_id
> and val>0.5;
>
> However, this does not scale well AT ALL. The actual (very messy)
> explain analyze output is below. The thing I notice is that the index
> on to_id is not used. Also, the primary key index on (from_id, to_id
> is not used, it seems. Finally, with only 30 values, this already
> takes 2.6 seconds and I am proposing to do this on 100-200 values.
> Any hints on how better to accomplish this set of tasks?
>
> Index Scan using correlation_from_id_idx, correlation_from_id_idx,
> correlation_from_id_idx, correlation_from_id_idx,
> correlation_from_id_idx, correlation_from_id_idx,
> correlation_from_id_idx, correlation_from_id_idx,
> correlation_from_id_idx, correlation_from_id_idx,
> correlation_from_id_idx, correlation_from_id_idx,
> correlation_from_id_idx, correlation_from_id_idx,
> correlation_from_id_idx, correlation_from_id_idx,
> correlation_from_id_idx, correlation_from_id_idx,
> correlation_from_id_idx, correlation_from_id_idx,
> correlation_from_id_idx, correlation_from_id_idx,
> correlation_from_id_idx, correlation_from_id_idx,
> correlation_from_id_idx, correlation_from_id_idx,
> correlation_from_id_idx, correlation_from_id_idx,
> correlation_from_id_idx, correlation_from_id_idx on correlation
> (cost=0.00..129377.49 rows=62 width=17) (actual time=340.563..2603.967
> rows=19 loops=1)
> Index Cond: ((from_id = 1) OR (from_id = 2) OR (from_id = 3) OR
> (from_id = 4) OR (from_id = 5) OR (from_id = 6) OR (from_id = 7) OR
> (from_id = 8) OR (from_id = 10) OR (from_id = 9) OR (from_id = 11) OR
> (from_id = 12) OR (from_id = 13) OR (from_id = 14) OR (from_id = 15)
> OR (from_id = 16) OR (from_id = 17) OR (from_id = 18) OR (from_id =
> 19) OR (from_id = 20) OR (from_id = 21) OR (from_id = 22) OR (from_id
> = 23) OR (from_id = 24) OR (from_id = 25) OR (from_id = 26) OR
> (from_id = 27) OR (from_id = 28) OR (from_id = 29) OR (from_id = 30))
> Filter: (((to_id = 1) OR (to_id = 2) OR (to_id = 3) OR (to_id = 4)
> OR (to_id = 5) OR (to_id = 6) OR (to_id = 7) OR (to_id = 8) OR (to_id
> = 10) OR (to_id = 9) OR (to_id = 11) OR (to_id = 12) OR (to_id = 13)
> OR (to_id = 14) OR (to_id = 15) OR (to_id = 16) OR (to_id = 17) OR
> (to_id = 18) OR (to_id = 19) OR (to_id = 20) OR (to_id = 21) OR (to_id
> = 22) OR (to_id = 23) OR (to_id = 24) OR (to_id = 25) OR (to_id = 26)
> OR (to_id = 27) OR (to_id = 28) OR (to_id = 29) OR (to_id = 30)) AND
> (from_id > to_id) AND (val > 0.5))
> Total runtime: 2604.383 ms
>
> Thanks,
> Sean
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to
> majordomo@postgresql.org
Sean Davis wrote: > I answer my own question, if only for my own records. The following > query is about 5-6 times faster than the original. Of course, if > anyone else has other ideas, I'd be happy to hear them. > > Sean > > explain analyze select from_id,to_id,val from exprsdb.correlation where > from_id in (select to_id from exprsdb.correlation where from_id=2424 > order by val desc limit 100) and to_id in (select to_id from > exprsdb.correlation where from_id=2424 order by val desc limit 100) and > val>0.6 and to_id<from_id; Might not be any faster, but you can do this as a self-join with subquery: SELECT c1.from_id, c1.to_id, c1.val FROM correlation c1, ( SELECT to_id FROM correlation WHERE from_id=2424 ORDER BY val DESC LIMIT 100 ) AS c2 ( SELECT to_id FROM correlation WHERE from_id=2424 ORDER BY val DESC LIMIT 100 ) AS c3 WHERE c1.from_id = c2.to_id AND c1.to_id = c3.to_id AND c1.val > 0.5 AND c1.to_id < from_id ; I think PG should be smart enough nowadays to figure out these two queries are basically the same. -- Richard Huxton Archonet Ltd
----- Original Message ----- From: "Richard Huxton" <dev@archonet.com> To: "Sean Davis" <sdavis2@mail.nih.gov> Cc: "PostgreSQL SQL" <pgsql-sql@postgresql.org> Sent: Tuesday, March 22, 2005 3:59 PM Subject: Re: [SQL] Self-referencing table question > Sean Davis wrote: >> I answer my own question, if only for my own records. The following >> query is about 5-6 times faster than the original. Of course, if anyone >> else has other ideas, I'd be happy to hear them. >> >> Sean >> >> explain analyze select from_id,to_id,val from exprsdb.correlation where >> from_id in (select to_id from exprsdb.correlation where from_id=2424 >> order by val desc limit 100) and to_id in (select to_id from >> exprsdb.correlation where from_id=2424 order by val desc limit 100) and >> val>0.6 and to_id<from_id; > > Might not be any faster, but you can do this as a self-join with subquery: > > SELECT c1.from_id, c1.to_id, c1.val > FROM > correlation c1, > ( > SELECT to_id FROM correlation WHERE from_id=2424 > ORDER BY val DESC LIMIT 100 > ) AS c2 > ( > SELECT to_id FROM correlation WHERE from_id=2424 > ORDER BY val DESC LIMIT 100 > ) AS c3 > WHERE > c1.from_id = c2.to_id > AND c1.to_id = c3.to_id > AND c1.val > 0.5 > AND c1.to_id < from_id > ; > > I think PG should be smart enough nowadays to figure out these two queries > are basically the same. Richard, In another email, I posted what I did (which was what you suggest), along with explain analyze output. It looks like the subquery is 4-6 times faster, which is getting into the acceptible for my little web application. Thanks for the help. Sean
On Mar 22, 2005, at 7:07 PM, Sean Davis wrote: > > ----- Original Message ----- From: "Richard Huxton" <dev@archonet.com> > To: "Sean Davis" <sdavis2@mail.nih.gov> > Cc: "PostgreSQL SQL" <pgsql-sql@postgresql.org> > Sent: Tuesday, March 22, 2005 3:59 PM > Subject: Re: [SQL] Self-referencing table question > > >> Sean Davis wrote: >>> I answer my own question, if only for my own records. The following >>> query is about 5-6 times faster than the original. Of course, if >>> anyone else has other ideas, I'd be happy to hear them. >>> >>> Sean >>> >>> explain analyze select from_id,to_id,val from exprsdb.correlation >>> where from_id in (select to_id from exprsdb.correlation where >>> from_id=2424 order by val desc limit 100) and to_id in (select to_id >>> from exprsdb.correlation where from_id=2424 order by val desc limit >>> 100) and val>0.6 and to_id<from_id; >> >> Might not be any faster, but you can do this as a self-join with >> subquery: >> >> SELECT c1.from_id, c1.to_id, c1.val >> FROM >> correlation c1, >> ( >> SELECT to_id FROM correlation WHERE from_id=2424 >> ORDER BY val DESC LIMIT 100 >> ) AS c2 >> ( >> SELECT to_id FROM correlation WHERE from_id=2424 >> ORDER BY val DESC LIMIT 100 >> ) AS c3 >> WHERE >> c1.from_id = c2.to_id >> AND c1.to_id = c3.to_id >> AND c1.val > 0.5 >> AND c1.to_id < from_id >> ; >> >> I think PG should be smart enough nowadays to figure out these two >> queries are basically the same. Oops, I DID do a different query in my previous email than what you suggest in the your email. Testing both against each other, the two queries--using subselects in 'in' and doing a self-join via subquery--have basically the same performance. Thanks again for the help. Sean
Sometimes using a temp table is a better idea:
e.g.
-- start by creating a temp table 'tids' that hold the to_ids that
-- we are interested in. SELECT to_id INTO TEMP TABLE tids FROM correlation WHERE from_id = 1234 ORDER BY val
DESClimit 100;
-- The following temp table makes use of the primary key on
-- the correlation table, and the stated goal from the original
-- question that:
-- from_id > to_id
-- and from_id in (tids.to_id)
-- and to_id in (tids.to_id)
SELECT t1.to_id AS from_id, t2.to_id INTO TEMP TABLE from_to FROM tids t1, tids t2 WHERE t1.to_id > t2.to_id;
-- Now we can use the from_to table as an index into the correlation
-- table.
SELECT c.from_id, c.to_id, c.val FROM from_to JOIN correlation c USING(from_id, to_id) WHERE val > 0.5;
The explain analyze for the final select works out to:Nested Loop (cost=0.00..50692.00 rows=8488 width=16) (actual
time=0.171..150.095 rows=2427 loops=1) -> Seq Scan on from_to (cost=0.00..79.38 rows=5238 width=8) (actual
time=0.006..7.660 rows=4950 loops=1) -> Index Scan using correlation_pkey on correlation c
(cost=0.00..9.63 rows=2 width=16) (actual time=0.024..0.025 rows=0
loops=4950) Index Cond: (("outer".from_id = c.from_id) AND ("outer".to_id =
c.to_id)) Filter: (val > 0.5::double precision)Total runtime: 152.261 ms
Richard Huxton wrote:
> Sean Davis wrote:
>
>> I answer my own question, if only for my own records. The following
>> query is about 5-6 times faster than the original. Of course, if
>> anyone else has other ideas, I'd be happy to hear them.
>>
>> Sean
>>
>> explain analyze select from_id,to_id,val from exprsdb.correlation
>> where from_id in (select to_id from exprsdb.correlation where
>> from_id=2424 order by val desc limit 100) and to_id in (select to_id
>> from exprsdb.correlation where from_id=2424 order by val desc limit
>> 100) and val>0.6 and to_id<from_id;
>
>
> Might not be any faster, but you can do this as a self-join with
> subquery:
>
> SELECT c1.from_id, c1.to_id, c1.val
> FROM
> correlation c1,
> (
> SELECT to_id FROM correlation WHERE from_id=2424
> ORDER BY val DESC LIMIT 100
> ) AS c2
> (
> SELECT to_id FROM correlation WHERE from_id=2424
> ORDER BY val DESC LIMIT 100
> ) AS c3
> WHERE
> c1.from_id = c2.to_id
> AND c1.to_id = c3.to_id
> AND c1.val > 0.5
> AND c1.to_id < from_id
> ;
>
> I think PG should be smart enough nowadays to figure out these two
> queries are basically the same.
--
Edmund Bacon <ebacon@onesystem.com>
Thanks. I thought about that a bit and it seems like it is highly
likely to be expensive for a single query (though I should probably try
it at some point). If I do find myself reformatting results after
response to user input (i.e., reusing the query), though, then your
solution is likely to be very useful.
Sean
On Mar 24, 2005, at 11:13 AM, Edmund Bacon wrote:
> Sometimes using a temp table is a better idea:
>
> e.g.
>
> -- start by creating a temp table 'tids' that hold the to_ids that
> -- we are interested in.
> SELECT to_id
> INTO TEMP TABLE tids
> FROM correlation
> WHERE from_id = 1234
> ORDER BY val DESC limit 100;
>
> -- The following temp table makes use of the primary key on
> -- the correlation table, and the stated goal from the original
> -- question that:
> -- from_id > to_id
> -- and from_id in (tids.to_id)
> -- and to_id in (tids.to_id)
>
> SELECT t1.to_id AS from_id, t2.to_id
> INTO TEMP TABLE from_to
> FROM tids t1, tids t2
> WHERE t1.to_id > t2.to_id;
>
> -- Now we can use the from_to table as an index into the correlation
> -- table.
>
> SELECT c.from_id, c.to_id, c.val
> FROM from_to
> JOIN correlation c USING(from_id, to_id)
> WHERE val > 0.5;
>
>
> The explain analyze for the final select works out to:
> Nested Loop (cost=0.00..50692.00 rows=8488 width=16) (actual
> time=0.171..150.095 rows=2427 loops=1)
> -> Seq Scan on from_to (cost=0.00..79.38 rows=5238 width=8)
> (actual time=0.006..7.660 rows=4950 loops=1)
> -> Index Scan using correlation_pkey on correlation c
> (cost=0.00..9.63 rows=2 width=16) (actual time=0.024..0.025 rows=0
> loops=4950)
> Index Cond: (("outer".from_id = c.from_id) AND ("outer".to_id
> = c.to_id))
> Filter: (val > 0.5::double precision)
> Total runtime: 152.261 ms
>
>
> Richard Huxton wrote:
>
>> Sean Davis wrote:
>>
>>> I answer my own question, if only for my own records. The following
>>> query is about 5-6 times faster than the original. Of course, if
>>> anyone else has other ideas, I'd be happy to hear them.
>>>
>>> Sean
>>>
>>> explain analyze select from_id,to_id,val from exprsdb.correlation
>>> where from_id in (select to_id from exprsdb.correlation where
>>> from_id=2424 order by val desc limit 100) and to_id in (select
>>> to_id from exprsdb.correlation where from_id=2424 order by val desc
>>> limit 100) and val>0.6 and to_id<from_id;
>>
>>
>> Might not be any faster, but you can do this as a self-join with
>> subquery:
>>
>> SELECT c1.from_id, c1.to_id, c1.val
>> FROM
>> correlation c1,
>> (
>> SELECT to_id FROM correlation WHERE from_id=2424
>> ORDER BY val DESC LIMIT 100
>> ) AS c2
>> (
>> SELECT to_id FROM correlation WHERE from_id=2424
>> ORDER BY val DESC LIMIT 100
>> ) AS c3
>> WHERE
>> c1.from_id = c2.to_id
>> AND c1.to_id = c3.to_id
>> AND c1.val > 0.5
>> AND c1.to_id < from_id
>> ;
>>
>> I think PG should be smart enough nowadays to figure out these two
>> queries are basically the same.
>
>
> --
> Edmund Bacon <ebacon@onesystem.com>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
Sean Davis wrote:
> Thanks. I thought about that a bit and it seems like it is highly
> likely to be expensive for a single query (though I should probably
> try it at some point). If I do find myself reformatting results after
> response to user input (i.e., reusing the query), though, then your
> solution is likely to be very useful.
>
You might think so, however, consider:
$ cat temptable.sql SELECT to_id INTO TEMP TABLE tids FROM correlation WHERE from_id = 1234 ORDER BY val DESC
limit100;
SELECT t1.to_id AS from_id, t2.to_id INTO TEMP TABLE from_to FROM tids t1, tids t2 WHERE t1.to_id > t2.to_id;
SELECT c.from_id, c.to_id, c.val FROM from_to JOIN correlation c USING(from_id, to_id) WHERE val > 0.5 order by
from_id,to_id;
$ cat subselect.sql
select from_id, to_id, val from correlation where from_id in (select to_id from correlation
where from_id = 1234 order by val desc limit 100) and to_id in (select to_id from correlation
where from_id = 1234 order by val desc limit 100) and from_id > to_id and
val> 0.5 order by from_id, to_id;
$ psql -q -c "select count(1) from correlation" test count
----------64000000
(1 row)
$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957 -
real 0m3.093s
user 0m0.000s
sys 0m0.010s
$ time psql -q -f temptable.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957 -
real 0m0.238s
user 0m0.000s
sys 0m0.010s
$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957 -
real 0m1.945s
user 0m0.010s
sys 0m0.010s
$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957 -
real 0m1.949s
user 0m0.010s
sys 0m0.010s
$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957 -
real 0m1.953s
user 0m0.010s
sys 0m0.000s
$ time psql -q -f temptable.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957 -
real 0m0.237s
user 0m0.020s
sys 0m0.000s
$
Note that the subselect version takes about 10 times as long as the
temptable version, and does not seem to be dependent on what data might
be cached.
I added the sort so that the results would be in the same order - you
can see that the two query sets are producing the same output (at least
md5sum thinks they are the same).
One thing to be aware of is the size of your returned data set - If it's
fairly large, then the transfer time from your web-server to the pgsql
box might overwhelm any "small" optimization in query time.
> Sean
>
> On Mar 24, 2005, at 11:13 AM, Edmund Bacon wrote:
>
>> Sometimes using a temp table is a better idea:
>>
>> e.g.
>>
>> -- start by creating a temp table 'tids' that hold the to_ids that
>> -- we are interested in.
>> SELECT to_id
>> INTO TEMP TABLE tids
>> FROM correlation
>> WHERE from_id = 1234
>> ORDER BY val DESC limit 100;
>>
>> -- The following temp table makes use of the primary key on
>> -- the correlation table, and the stated goal from the original
>> -- question that:
>> -- from_id > to_id
>> -- and from_id in (tids.to_id)
>> -- and to_id in (tids.to_id)
>>
>> SELECT t1.to_id AS from_id, t2.to_id
>> INTO TEMP TABLE from_to
>> FROM tids t1, tids t2
>> WHERE t1.to_id > t2.to_id;
>>
>> -- Now we can use the from_to table as an index into the correlation
>> -- table.
>>
>> SELECT c.from_id, c.to_id, c.val
>> FROM from_to
>> JOIN correlation c USING(from_id, to_id)
>> WHERE val > 0.5;
>>
>>
>> The explain analyze for the final select works out to:
>> Nested Loop (cost=0.00..50692.00 rows=8488 width=16) (actual
>> time=0.171..150.095 rows=2427 loops=1)
>> -> Seq Scan on from_to (cost=0.00..79.38 rows=5238 width=8)
>> (actual time=0.006..7.660 rows=4950 loops=1)
>> -> Index Scan using correlation_pkey on correlation c
>> (cost=0.00..9.63 rows=2 width=16) (actual time=0.024..0.025 rows=0
>> loops=4950)
>> Index Cond: (("outer".from_id = c.from_id) AND ("outer".to_id
>> = c.to_id))
>> Filter: (val > 0.5::double precision)
>> Total runtime: 152.261 ms
>>
>>
>> Richard Huxton wrote:
>>
>>> Sean Davis wrote:
>>>
>>>> I answer my own question, if only for my own records. The
>>>> following query is about 5-6 times faster than the original. Of
>>>> course, if anyone else has other ideas, I'd be happy to hear them.
>>>>
>>>> Sean
>>>>
>>>> explain analyze select from_id,to_id,val from exprsdb.correlation
>>>> where from_id in (select to_id from exprsdb.correlation where
>>>> from_id=2424 order by val desc limit 100) and to_id in (select
>>>> to_id from exprsdb.correlation where from_id=2424 order by val
>>>> desc limit 100) and val>0.6 and to_id<from_id;
>>>
>>>
>>>
>>> Might not be any faster, but you can do this as a self-join with
>>> subquery:
>>>
>>> SELECT c1.from_id, c1.to_id, c1.val
>>> FROM
>>> correlation c1,
>>> (
>>> SELECT to_id FROM correlation WHERE from_id=2424
>>> ORDER BY val DESC LIMIT 100
>>> ) AS c2
>>> (
>>> SELECT to_id FROM correlation WHERE from_id=2424
>>> ORDER BY val DESC LIMIT 100
>>> ) AS c3
>>> WHERE
>>> c1.from_id = c2.to_id
>>> AND c1.to_id = c3.to_id
>>> AND c1.val > 0.5
>>> AND c1.to_id < from_id
>>> ;
>>>
>>> I think PG should be smart enough nowadays to figure out these two
>>> queries are basically the same.
>>
>>
>>
>> --
>> Edmund Bacon <ebacon@onesystem.com>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 3: if posting/reading through Usenet, please send an appropriate
>> subscribe-nomail command to majordomo@postgresql.org so that your
>> message can get through to the mailing list cleanly
>
>
--
Edmund Bacon <ebacon@onesystem.com>
On Mar 24, 2005, at 1:11 PM, Edmund Bacon wrote: > Sean Davis wrote: > >> Thanks. I thought about that a bit and it seems like it is highly >> likely to be expensive for a single query (though I should probably >> try it at some point). If I do find myself reformatting results >> after response to user input (i.e., reusing the query), though, then >> your solution is likely to be very useful. >> > > > Note that the subselect version takes about 10 times as long as the > temptable version, and does not seem to be dependent on what data > might be cached. >>> Nice. Thanks for doing my work for me! I guess I will have to think about it more seriously. It could be a slight bit complicated because my code is running under mod_perl, so connections are cached. As I understand it, the temp table will stick around, so I will have to be careful to explicitly drop it if I don't want it to persist? Also each table will need a unique name (I have a session_id I can use), as it is possible that multiple temp tables will exist and be visible to each other? Thanks again, Sean
Sean Davis wrote: > Nice. Thanks for doing my work for me! Yeah, well put it down to a certain amount of curiosity and a slack period at work ... > I guess I will have to think about it more seriously. > > It could be a slight bit complicated because my code is running under > mod_perl, so connections are cached. As I understand it, the temp > table will stick around, so I will have to be careful to explicitly > drop it if I don't want it to persist? I'm guessing so. However you could put everything in a transaction and use CREATE TEMP TABLE ... ON COMMIT DROP, and use INSERT INTO rather than SELECT INTO. The speed should be about equivalent - but you'd have to test to make sure. > Also each table will need a unique name (I have a session_id I can > use), as it is possible that multiple temp tables will exist and be > visible to each other? Each session (connection in your case?) has it's own temporary table space, so you shouldn't have to worry about that. > > > Thanks again, > Sean > -- Edmund Bacon <ebacon@onesystem.com>
On Mar 24, 2005, at 2:37 PM, Edmund Bacon wrote: > Sean Davis wrote: > >> Nice. Thanks for doing my work for me! > > Yeah, well put it down to a certain amount of curiosity and a slack > period at work ... > >> I guess I will have to think about it more seriously. >> >> It could be a slight bit complicated because my code is running under >> mod_perl, so connections are cached. As I understand it, the temp >> table will stick around, so I will have to be careful to explicitly >> drop it if I don't want it to persist? > > I'm guessing so. However you could put everything in a transaction > and use CREATE TEMP TABLE ... ON COMMIT DROP, and use INSERT INTO > rather than SELECT INTO. The speed should be about equivalent - but > you'd have to test to make sure. > >> Also each table will need a unique name (I have a session_id I can >> use), as it is possible that multiple temp tables will exist and be >> visible to each other? > > Each session (connection in your case?) has it's own temporary table > space, so you shouldn't have to worry about that. > Sessions don't map 1-to-1 with connections in the web environment. It is possible that a connection to the database would be simultaneously serving multiple users (sessions), if I understand Apache::DBI correctly. In any case, this is probably a viable solution. Sean