Обсуждение: Performance problem with joined aggregate query
Hello all, I am trying to use aggregate queries in views, and when joining these views to other tables, I get seq scan in the view, even if index scan would be clearly better. The views I am using in my Db are actually long pivot queries, but the following simple test case is enough to show the problem. I will first show the table definitions, then the performance problem I am having. create table test1 ( id serial primary key not null, other_id integer unique not null ); create table test2 ( id integer not null references test1(id), type integer, value text ); create index test2_idx on test2(id); insert into test1 select g, g+10000 from (select generate_series(1, 10000) as g) t; insert into test2 select g, g%3, 'testval'||g from (select generate_series(1, 10000) as g) t; insert into test2 select g, (g+1)%3, 'testval'||g from (select generate_series(1, 10000) as g) t; insert into test2 select g, (g+2)%3, 'testval'||g from (select generate_series(1, 10000) as g) t; Now, the following query is fast: select * from test1 inner join (select array_agg(value), id from test2 group by id) t on test1.id = t.id where test1.id = 1; (0.6ms) But the following query is slow (seqscan on test2): select * from test1 inner join (select array_agg(value), id from test2 group by id) t on test1.id = t.id where test1.other_id = 10001; (45ms) The same problem can be seen when running: select * from test1 inner join (select array_agg(value), id from test2 group by id) t on test1.id = t.id where test1.id in (1, 2); (40ms runtime) Fetching directly from test2 with id is fast: select array_agg(value), id from test2 where test2.id in (1, 2) group by id; If I set enable_seqscan to off, then I get fast results: select * from test1 inner join (select array_agg(value), id from test2 group by id) t on test1.id = t.id where test1.other_id in (10001, 10002); (0.6ms) Or slow results, if the fetched rows happen to be in the end of the index: select * from test1 inner join (select array_agg(value), id from test2 group by id) t on test1.id = t.id where test1.other_id = 20000; (40ms) Explain analyzes of the problematic query: With enable_seqscan: explain analyze select * from test1 inner join (select array_agg(value), id from test2 group by id) t on test1.id = t.id where test1.other_id = 10001; Hash Join (cost=627.48..890.48 rows=50 width=44) (actual time=91.575..108.085 rows=1 loops=1) Hash Cond: (test2.id = test1.id) -> HashAggregate (cost=627.00..752.00 rows=10000 width=15) (actual time=82.663..98.281 rows=10000 loops=1) -> Seq Scan on test2 (cost=0.00..477.00 rows=30000 width=15) (actual time=0.009..30.650 rows=30000 loops=1) -> Hash (cost=0.47..0.47 rows=1 width=8) (actual time=0.026..0.026 rows=1 loops=1) -> Index Scan using test1_other_id_key on test1 (cost=0.00..0.47 rows=1 width=8) (actual time=0.018..0.021 rows=1 loops=1) Index Cond: (other_id = 10001) Total runtime: 109.686 ms Without enable_seqscan: explain analyze select * from test1 inner join (select array_agg(value), id from test2 group by id) t on test1.id = t.id where test1.other_id = 10001; Merge Join (cost=0.48..895.91 rows=50 width=44) (actual time=0.066..0.085 rows=1 loops=1) Merge Cond: (test2.id = test1.id) -> GroupAggregate (cost=0.00..769.56 rows=10000 width=15) (actual time=0.040..0.054 rows=2 loops=1) -> Index Scan using test2_idx on test2 (cost=0.00..494.56 rows=30000 width=15) (actual time=0.017..0.030 rows=7 loops=1) -> Sort (cost=0.48..0.48 rows=1 width=8) (actual time=0.020..0.022 rows=1 loops=1) Sort Key: test1.id Sort Method: quicksort Memory: 17kB -> Index Scan using test1_other_id_key on test1 (cost=0.00..0.47 rows=1 width=8) (actual time=0.010..0.012 rows=1 loops=1) Index Cond: (other_id = 10001) - Anssi Kääriäinen
On Wed, Sep 15, 2010 at 2:26 AM, Anssi Kääriäinen <anssi.kaariainen@thl.fi> wrote: > Hello all, > > I am trying to use aggregate queries in views, and when joining these views > to other > tables, I get seq scan in the view, even if index scan would be clearly > better. The views > I am using in my Db are actually long pivot queries, but the following > simple test case is enough > to show the problem. > > I will first show the table definitions, then the performance problem I am > having. > > create table test1 ( > id serial primary key not null, > other_id integer unique not null > ); > > create table test2 ( > id integer not null references test1(id), > type integer, > value text > ); > > create index test2_idx on test2(id); > > insert into test1 select g, g+10000 from (select generate_series(1, 10000) > as g) t; > insert into test2 select g, g%3, 'testval'||g from (select > generate_series(1, 10000) as g) t; > insert into test2 select g, (g+1)%3, 'testval'||g from (select > generate_series(1, 10000) as g) t; > insert into test2 select g, (g+2)%3, 'testval'||g from (select > generate_series(1, 10000) as g) t; > > Now, the following query is fast: > > select * from test1 inner join (select array_agg(value), id > from test2 group by id) t on test1.id = t.id where test1.id = 1; > (0.6ms) > > But the following query is slow (seqscan on test2): > > select * from test1 inner join (select array_agg(value), id > from test2 group by id) t on test1.id = t.id where test1.other_id = 10001; > (45ms) > > The same problem can be seen when running: > > select * from test1 inner join (select array_agg(value), id > from test2 group by id) t on test1.id = t.id where test1.id in (1, 2); > (40ms runtime) > > Fetching directly from test2 with id is fast: > > select array_agg(value), id > from test2 where test2.id in (1, 2) group by id; > > If I set enable_seqscan to off, then I get fast results: > > select * from test1 inner join (select array_agg(value), id > from test2 group by id) t on test1.id = t.id where test1.other_id in (10001, > 10002); > (0.6ms) > > Or slow results, if the fetched rows happen to be in the end of the index: > > select * from test1 inner join (select array_agg(value), id > from test2 group by id) t on test1.id = t.id where test1.other_id = 20000; > (40ms) > > Explain analyzes of the problematic query: > > With enable_seqscan: > > explain analyze select * from test1 inner join (select array_agg(value), id > from test2 group by id) t on test1.id = t.id where test1.other_id = 10001; > > Hash Join (cost=627.48..890.48 rows=50 width=44) (actual > time=91.575..108.085 rows=1 loops=1) > Hash Cond: (test2.id = test1.id) > -> HashAggregate (cost=627.00..752.00 rows=10000 width=15) (actual > time=82.663..98.281 rows=10000 loops=1) > -> Seq Scan on test2 (cost=0.00..477.00 rows=30000 width=15) > (actual time=0.009..30.650 rows=30000 loops=1) > -> Hash (cost=0.47..0.47 rows=1 width=8) (actual time=0.026..0.026 > rows=1 loops=1) > -> Index Scan using test1_other_id_key on test1 (cost=0.00..0.47 > rows=1 width=8) (actual time=0.018..0.021 rows=1 loops=1) > Index Cond: (other_id = 10001) > Total runtime: 109.686 ms > > Without enable_seqscan: > > explain analyze select * from test1 inner join (select array_agg(value), id > from test2 group by id) t on test1.id = t.id where test1.other_id = 10001; > > Merge Join (cost=0.48..895.91 rows=50 width=44) (actual time=0.066..0.085 > rows=1 loops=1) > Merge Cond: (test2.id = test1.id) > -> GroupAggregate (cost=0.00..769.56 rows=10000 width=15) (actual > time=0.040..0.054 rows=2 loops=1) > -> Index Scan using test2_idx on test2 (cost=0.00..494.56 > rows=30000 width=15) (actual time=0.017..0.030 rows=7 loops=1) > -> Sort (cost=0.48..0.48 rows=1 width=8) (actual time=0.020..0.022 > rows=1 loops=1) > Sort Key: test1.id > Sort Method: quicksort Memory: 17kB > -> Index Scan using test1_other_id_key on test1 (cost=0.00..0.47 > rows=1 width=8) (actual time=0.010..0.012 rows=1 loops=1) > Index Cond: (other_id = 10001) Take a look at this, and the responses. Is it the same case?: http://www.mail-archive.com/pgsql-performance@postgresql.org/msg21756.html merlin
On 09/16/2010 01:25 AM, Merlin Moncure wrote: > Take a look at this, and the responses. Is it the same case?: > http://www.mail-archive.com/pgsql-performance@postgresql.org/msg21756.html > > merlin > Yes, looks like this is the same case. This makes it hard to use views having group by in them, as the whole group by part will always be executed. Back to planning board then... I guess my possibilities for pivot views are: - crosstab: Will make statistics go "bad", that is, the crosstab query will always seem to return static number of rows. This can cause problems in complex queries using the view. IIRC performance is a bit worse than pivot by group by. - left joins: if joining the same table 20 times, there will be some planner overhead. Maybe the best way for my usage case. Also about 2x slower than pivot using group by. - subselect each of the columns: way worse performance: for my use case, each added column adds about 50ms to run time, so for 20 columns this will take 1 second. The group by pivot query runs in 250ms. Any other ideas? - Anssi
On Thu, Sep 16, 2010 at 1:51 AM, Anssi Kääriäinen <anssi.kaariainen@thl.fi> wrote: > Yes, looks like this is the same case. This makes it hard to use views > having group by in them, as the whole group by part will always be > executed. Back to planning board then... > > I guess my possibilities for pivot views are: > - crosstab: Will make statistics go "bad", that is, the crosstab query > will always seem to return static number of rows. This can cause > problems in complex queries using the view. IIRC performance is > a bit worse than pivot by group by. > - left joins: if joining the same table 20 times, there will be some > planner overhead. Maybe the best way for my usage case. Also about > 2x slower than pivot using group by. > - subselect each of the columns: way worse performance: for my use > case, each added column adds about 50ms to run time, so for 20 > columns this will take 1 second. The group by pivot query runs in > 250ms. > > Any other ideas? yes. specifically, if you are targeting the aggregation towards an array you have another option: analyze select * from test1 inner join (select array_agg(value), id from test2 group by id) t on test1.id = t.id where test1.other_id = 10001; can be replaced w/ select *, array(select value from test2 where test2.id=test1.id) from test1 where test1.other_id = 10001; This (array vs array_agg) will give you faster performance than aggregation so is better preferred unless you need to group for other purposes than relating. I think the join if it could match up over the group by key can give theoretically better plans but this should be good enough especially if you aren't pulling a large amount of data from the outer table. merlin