Обсуждение: Performance problem with joined aggregate query

Поиск
Список
Период
Сортировка

Performance problem with joined aggregate query

От
Anssi Kääriäinen
Дата:
Hello all,

I am trying to use aggregate queries in views, and when joining these
views to other
tables, I get seq scan in the view, even if index scan would be clearly
better. The views
I am using in my Db are actually long pivot queries, but the following
simple test case is enough
to show the problem.

I will first show the table definitions, then the performance problem I
am having.

create table test1 (
    id serial primary key not null,
    other_id integer unique not null
);

create table test2 (
     id integer not null references test1(id),
     type integer,
     value text
);

create index test2_idx on test2(id);

insert into test1 select g, g+10000 from (select generate_series(1,
10000) as g) t;
insert into test2 select g, g%3, 'testval'||g from (select
generate_series(1, 10000) as g) t;
insert into test2 select g, (g+1)%3, 'testval'||g from (select
generate_series(1, 10000) as g) t;
insert into test2 select g, (g+2)%3, 'testval'||g from (select
generate_series(1, 10000) as g) t;

Now, the following query is fast:

select * from test1 inner join (select array_agg(value), id
from test2 group by id) t on test1.id = t.id where test1.id = 1;
(0.6ms)

But the following query is slow (seqscan on test2):

select * from test1 inner join (select array_agg(value), id
from test2 group by id) t on test1.id = t.id where test1.other_id = 10001;
(45ms)

The same problem can be seen when running:

select * from test1 inner join (select array_agg(value), id
from test2 group by id) t on test1.id = t.id where test1.id in (1, 2);
(40ms runtime)

Fetching directly from test2 with id is fast:

select array_agg(value), id
from test2 where test2.id in (1, 2) group by id;

If I set enable_seqscan to off, then I get fast results:

select * from test1 inner join (select array_agg(value), id
from test2 group by id) t on test1.id = t.id where test1.other_id in
(10001, 10002);
(0.6ms)

Or slow results, if the fetched rows happen to be in the end of the index:

select * from test1 inner join (select array_agg(value), id
from test2 group by id) t on test1.id = t.id where test1.other_id = 20000;
(40ms)

Explain analyzes of the problematic query:

With enable_seqscan:

explain analyze select * from test1 inner join (select array_agg(value), id
from test2 group by id) t on test1.id = t.id where test1.other_id = 10001;

Hash Join  (cost=627.48..890.48 rows=50 width=44) (actual
time=91.575..108.085 rows=1 loops=1)
    Hash Cond: (test2.id = test1.id)
    ->  HashAggregate  (cost=627.00..752.00 rows=10000 width=15) (actual
time=82.663..98.281 rows=10000 loops=1)
          ->  Seq Scan on test2  (cost=0.00..477.00 rows=30000 width=15)
(actual time=0.009..30.650 rows=30000 loops=1)
    ->  Hash  (cost=0.47..0.47 rows=1 width=8) (actual time=0.026..0.026
rows=1 loops=1)
          ->  Index Scan using test1_other_id_key on test1
(cost=0.00..0.47 rows=1 width=8) (actual time=0.018..0.021 rows=1 loops=1)
                Index Cond: (other_id = 10001)
Total runtime: 109.686 ms

Without enable_seqscan:

explain analyze select * from test1 inner join (select array_agg(value), id
from test2 group by id) t on test1.id = t.id where test1.other_id = 10001;

Merge Join  (cost=0.48..895.91 rows=50 width=44) (actual
time=0.066..0.085 rows=1 loops=1)
    Merge Cond: (test2.id = test1.id)
    ->  GroupAggregate  (cost=0.00..769.56 rows=10000 width=15) (actual
time=0.040..0.054 rows=2 loops=1)
          ->  Index Scan using test2_idx on test2  (cost=0.00..494.56
rows=30000 width=15) (actual time=0.017..0.030 rows=7 loops=1)
    ->  Sort  (cost=0.48..0.48 rows=1 width=8) (actual time=0.020..0.022
rows=1 loops=1)
          Sort Key: test1.id
          Sort Method:  quicksort  Memory: 17kB
          ->  Index Scan using test1_other_id_key on test1
(cost=0.00..0.47 rows=1 width=8) (actual time=0.010..0.012 rows=1 loops=1)
                Index Cond: (other_id = 10001)


  - Anssi Kääriäinen





Re: Performance problem with joined aggregate query

От
Merlin Moncure
Дата:
On Wed, Sep 15, 2010 at 2:26 AM, Anssi Kääriäinen
<anssi.kaariainen@thl.fi> wrote:
> Hello all,
>
> I am trying to use aggregate queries in views, and when joining these views
> to other
> tables, I get seq scan in the view, even if index scan would be clearly
> better. The views
> I am using in my Db are actually long pivot queries, but the following
> simple test case is enough
> to show the problem.
>
> I will first show the table definitions, then the performance problem I am
> having.
>
> create table test1 (
>   id serial primary key not null,
>   other_id integer unique not null
> );
>
> create table test2 (
>    id integer not null references test1(id),
>    type integer,
>    value text
> );
>
> create index test2_idx on test2(id);
>
> insert into test1 select g, g+10000 from (select generate_series(1, 10000)
> as g) t;
> insert into test2 select g, g%3, 'testval'||g from (select
> generate_series(1, 10000) as g) t;
> insert into test2 select g, (g+1)%3, 'testval'||g from (select
> generate_series(1, 10000) as g) t;
> insert into test2 select g, (g+2)%3, 'testval'||g from (select
> generate_series(1, 10000) as g) t;
>
> Now, the following query is fast:
>
> select * from test1 inner join (select array_agg(value), id
> from test2 group by id) t on test1.id = t.id where test1.id = 1;
> (0.6ms)
>
> But the following query is slow (seqscan on test2):
>
> select * from test1 inner join (select array_agg(value), id
> from test2 group by id) t on test1.id = t.id where test1.other_id = 10001;
> (45ms)
>
> The same problem can be seen when running:
>
> select * from test1 inner join (select array_agg(value), id
> from test2 group by id) t on test1.id = t.id where test1.id in (1, 2);
> (40ms runtime)
>
> Fetching directly from test2 with id is fast:
>
> select array_agg(value), id
> from test2 where test2.id in (1, 2) group by id;
>
> If I set enable_seqscan to off, then I get fast results:
>
> select * from test1 inner join (select array_agg(value), id
> from test2 group by id) t on test1.id = t.id where test1.other_id in (10001,
> 10002);
> (0.6ms)
>
> Or slow results, if the fetched rows happen to be in the end of the index:
>
> select * from test1 inner join (select array_agg(value), id
> from test2 group by id) t on test1.id = t.id where test1.other_id = 20000;
> (40ms)
>
> Explain analyzes of the problematic query:
>
> With enable_seqscan:
>
> explain analyze select * from test1 inner join (select array_agg(value), id
> from test2 group by id) t on test1.id = t.id where test1.other_id = 10001;
>
> Hash Join  (cost=627.48..890.48 rows=50 width=44) (actual
> time=91.575..108.085 rows=1 loops=1)
>   Hash Cond: (test2.id = test1.id)
>   ->  HashAggregate  (cost=627.00..752.00 rows=10000 width=15) (actual
> time=82.663..98.281 rows=10000 loops=1)
>         ->  Seq Scan on test2  (cost=0.00..477.00 rows=30000 width=15)
> (actual time=0.009..30.650 rows=30000 loops=1)
>   ->  Hash  (cost=0.47..0.47 rows=1 width=8) (actual time=0.026..0.026
> rows=1 loops=1)
>         ->  Index Scan using test1_other_id_key on test1  (cost=0.00..0.47
> rows=1 width=8) (actual time=0.018..0.021 rows=1 loops=1)
>               Index Cond: (other_id = 10001)
> Total runtime: 109.686 ms
>
> Without enable_seqscan:
>
> explain analyze select * from test1 inner join (select array_agg(value), id
> from test2 group by id) t on test1.id = t.id where test1.other_id = 10001;
>
> Merge Join  (cost=0.48..895.91 rows=50 width=44) (actual time=0.066..0.085
> rows=1 loops=1)
>   Merge Cond: (test2.id = test1.id)
>   ->  GroupAggregate  (cost=0.00..769.56 rows=10000 width=15) (actual
> time=0.040..0.054 rows=2 loops=1)
>         ->  Index Scan using test2_idx on test2  (cost=0.00..494.56
> rows=30000 width=15) (actual time=0.017..0.030 rows=7 loops=1)
>   ->  Sort  (cost=0.48..0.48 rows=1 width=8) (actual time=0.020..0.022
> rows=1 loops=1)
>         Sort Key: test1.id
>         Sort Method:  quicksort  Memory: 17kB
>         ->  Index Scan using test1_other_id_key on test1  (cost=0.00..0.47
> rows=1 width=8) (actual time=0.010..0.012 rows=1 loops=1)
>               Index Cond: (other_id = 10001)

Take a look at this, and the responses. Is it the same case?:
http://www.mail-archive.com/pgsql-performance@postgresql.org/msg21756.html

merlin

Re: Performance problem with joined aggregate query

От
Anssi Kääriäinen
Дата:
On 09/16/2010 01:25 AM, Merlin Moncure wrote:
> Take a look at this, and the responses. Is it the same case?:
> http://www.mail-archive.com/pgsql-performance@postgresql.org/msg21756.html
>
> merlin
>
Yes, looks like this is the same case. This makes it hard to use views
having group by in them, as the whole group by part will always be
executed. Back to planning board then...

I guess my possibilities for pivot views are:
   - crosstab: Will make statistics go "bad", that is, the crosstab query
     will always seem to return static number of rows. This can cause
     problems in complex queries using the view. IIRC performance is
     a bit worse than pivot by group by.
   - left joins: if joining the same table 20 times, there will be some
     planner overhead. Maybe the best way for my usage case. Also about
     2x slower than pivot using group by.
   - subselect each of the columns: way worse performance: for my use
     case, each added column adds about 50ms to run time, so for 20
     columns this will take 1 second. The group by pivot query runs in
     250ms.

Any other ideas?

  - Anssi

Re: Performance problem with joined aggregate query

От
Merlin Moncure
Дата:
On Thu, Sep 16, 2010 at 1:51 AM, Anssi Kääriäinen
<anssi.kaariainen@thl.fi> wrote:
> Yes, looks like this is the same case. This makes it hard to use views
> having group by in them, as the whole group by part will always be
> executed. Back to planning board then...
>
> I guess my possibilities for pivot views are:
>  - crosstab: Will make statistics go "bad", that is, the crosstab query
>    will always seem to return static number of rows. This can cause
>    problems in complex queries using the view. IIRC performance is
>    a bit worse than pivot by group by.
>  - left joins: if joining the same table 20 times, there will be some
>    planner overhead. Maybe the best way for my usage case. Also about
>    2x slower than pivot using group by.
>  - subselect each of the columns: way worse performance: for my use
>    case, each added column adds about 50ms to run time, so for 20
>    columns this will take 1 second. The group by pivot query runs in
>    250ms.
>
> Any other ideas?

yes. specifically, if you are targeting the aggregation towards an
array you have another option:

analyze select * from test1 inner join (select array_agg(value), id
from test2 group by id) t on test1.id = t.id where test1.other_id = 10001;

can be replaced w/

select *, array(select value from test2 where test2.id=test1.id) from
test1 where test1.other_id = 10001;

This (array vs array_agg) will give you faster performance than
aggregation so is better preferred unless you need to group for other
purposes than relating.  I think the join if it could match up over
the group by key can give theoretically better plans but this should
be good enough especially if you aren't pulling a large amount of data
from the outer table.

merlin