Обсуждение: slow group by query

Поиск
Список
Период
Сортировка

slow group by query

От
Ellen Cyran
Дата:
Is there any way to make this query faster?  I have indexes on year, 
msa_code, and sic.  I've also tried it with
an index on the combined group by columns.  I've made both sort_mem and 
shared_buffers bigger, but still this query
takes 40 seconds when I select 4 msa_codes and 7 minutes when I select 40 
msa_codes.  Would it just be better
to transpose the table to begin with and avoid the group by all together?

SELECT b.msa_code, b.sic, b.own, b.ind_div, (select emp from tbl_bls_msa as bls
where bls.year = '1975' and bls.msa_code = b.msa_code and bls.sic = b.sic 
and bls.own = b.own and bls.ind_div = b.ind_div ) AS emp1975, (select emp from tbl_bls_msa as bls
where bls.year = '1976' and bls.msa_code = b.msa_code and  bls.sic = b.sic 
and bls.own = b.own and bls.ind_div = b.ind_div ) AS emp1976, (select emp from tbl_bls_msa as bls
where bls.year = '1977' and bls.msa_code = b.msa_code  and  bls.sic = b.sic 
and bls.own = b.own and bls.ind_div = b.ind_div ) AS emp1977,
(select emp from tbl_bls_msa as bls
where bls.year = '1978' and bls.msa_code = b.msa_code and  bls.sic = b.sic 
and bls.own = b.own and bls.ind_div = b.ind_div ) AS emp1978,
(select emp from tbl_bls_msa as bls
where bls.year = '1990' and bls.msa_code = b.msa_code  and  bls.sic = b.sic 
and bls.own = b.own and bls.ind_div = b.ind_div)  AS emp1990 FROM tbl_bls_msa AS b
where msa_code in ('1680','1640','0040','0120','0080')
GROUP BY b.msa_code, b.sic, b.ind_div, b.own ;

NOTICE:  QUERY PLAN:

Group  (cost=635.97..695.18 rows=592 width=22)  ->  Sort  (cost=635.97..635.97 rows=5921 width=22)        ->  Index
Scanusing msa_code_tbl_bls_msa_key, 
 
msa_code_tbl_bls_msa_key, msa_code_tbl_bls_msa_key, msa_code_tbl_bls_msa_key, 
msa_code_tbl_bls_msa_key on tbl_bls_msa b  (cost=0.00..264.99 rows=5921 width=22)  SubPlan    ->  Index Scan using
msa_code_tbl_bls_msa_keyon tbl_bls_msa 
 
bls  (cost=0.00
..53.71 rows=1 width=10)    ->  Index Scan using msa_code_tbl_bls_msa_key on tbl_bls_msa 
bls  (cost=0.00
..53.71 rows=1 width=10)    ->  Index Scan using msa_code_tbl_bls_msa_key on tbl_bls_msa 
bls  (cost=0.00
..53.71 rows=1 width=10)    ->  Index Scan using msa_code_tbl_bls_msa_key on tbl_bls_msa 
bls  (cost=0.00
..53.71 rows=1 width=10)    ->  Index Scan using msa_code_tbl_bls_msa_key on tbl_bls_msa 
bls  (cost=0.00
..53.71 rows=1 width=10)

Thanks.
Ellen
-------




Re: slow group by query

От
Stephan Szabo
Дата:
On Mon, 18 Nov 2002, Ellen Cyran wrote:

> Is there any way to make this query faster?  I have indexes on year,
> msa_code, and sic.  I've also tried it with
> an index on the combined group by columns.  I've made both sort_mem and
> shared_buffers bigger, but still this query
> takes 40 seconds when I select 4 msa_codes and 7 minutes when I select 40
> msa_codes.  Would it just be better
> to transpose the table to begin with and avoid the group by all together?
>
> SELECT b.msa_code, b.sic, b.own, b.ind_div,
>   (select emp from tbl_bls_msa as bls
> where bls.year = '1975' and bls.msa_code = b.msa_code and bls.sic = b.sic
> and bls.own = b.own and bls.ind_div = b.ind_div ) AS emp1975,
>   (select emp from tbl_bls_msa as bls
> where bls.year = '1976' and bls.msa_code = b.msa_code and  bls.sic = b.sic
> and bls.own = b.own and bls.ind_div = b.ind_div ) AS emp1976,
>   (select emp from tbl_bls_msa as bls
> where bls.year = '1977' and bls.msa_code = b.msa_code  and  bls.sic = b.sic
> and bls.own = b.own and bls.ind_div = b.ind_div ) AS emp1977,
> (select emp from tbl_bls_msa as bls
> where bls.year = '1978' and bls.msa_code = b.msa_code and  bls.sic = b.sic
> and bls.own = b.own and bls.ind_div = b.ind_div ) AS emp1978,
> (select emp from tbl_bls_msa as bls
> where bls.year = '1990' and bls.msa_code = b.msa_code  and  bls.sic = b.sic
> and bls.own = b.own and bls.ind_div = b.ind_div)  AS emp1990
>   FROM tbl_bls_msa AS b
> where msa_code in ('1680','1640','0040','0120','0080')
> GROUP BY b.msa_code, b.sic, b.ind_div, b.own ;

Have you tried doing the subqueries in from?  Right now you're running
each subquery once for each output row AFAICS.

Maybe something like (doing only one year for example - and using a
postgres extension), would work...

select distinct on (b.msa_code, b.sic, b.own, b.ind_div)b.msa_code, b.sic, b.own, b.ind_div, y1975.emp1975 from
tbl_bls_msa as b left outer join (select emp from tbl_bls_mas as blswhere bls.year='1975') y1975 on
(b.msa_code=y1975.msa_codeandb.sic=y1975.sic and b.own=y1975.own and b.ind_div=y1975.ind_div)
 
where msa_code in ('1680', '1640', '0040', '0120', '0080');




Re: slow group by query

От
Tom Lane
Дата:
Ellen Cyran <ellen@urban.csuohio.edu> writes:
> Is there any way to make this query faster?  I have indexes on year, 
> msa_code, and sic.  I've also tried it with
> an index on the combined group by columns.  I've made both sort_mem and 
> shared_buffers bigger, but still this query
> takes 40 seconds when I select 4 msa_codes and 7 minutes when I select 40 
> msa_codes.

Hm, have you vacuum analyzed lately?  Those cost estimates seem awfully
low for a query that is taking 40 sec.  Also, if you're using 7.2 (which
you should be ;-)) then showing EXPLAIN ANALYZE results would be more
useful than plain EXPLAIN.
        regards, tom lane


Re: slow group by query

От
Ellen Cyran
Дата:
I have vacuum analyzed recently.  I do it after a large number of inserts
and after indexing.  This database is in the development stages so there is
a lot of data loading at this time.  We are also using 7.2.3.  Here is
the explain analyze:

Group  (cost=637.18..696.52 rows=593 width=22) (actual 
time=982.67..67581.85 row
s=435 loops=1)  ->  Sort  (cost=637.18..637.18 rows=5934 width=22) (actual 
time=833.27..844.78 rows=6571 loops=1)        ->  Index Scan using msa_code_tbl_bls_msa_key, 
msa_code_tbl_bls_msa_key, msa_code_tbl_bls_msa_key, msa_code_tbl_bls_msa_key, 
msa_code_tbl_bls_msa_key on tbl_bls_msa b  (cost=0.00..265.30 rows=5934 width=22) (actual 
time=0.80..367.64 rows=6571 loops=1)  SubPlan    ->  Index Scan using msa_code_tbl_bls_msa_key on tbl_bls_msa 
bls  (cost=0.00
..53.80 rows=1 width=10) (actual time=16.46..30.36 rows=1 loops=435)    ->  Index Scan using msa_code_tbl_bls_msa_key
ontbl_bls_msa 
 
bls  (cost=0.00
..53.80 rows=1 width=10) (actual time=16.54..30.41 rows=1 loops=435)    ->  Index Scan using msa_code_tbl_bls_msa_key
ontbl_bls_msa 
 
bls  (cost=0.00
..53.80 rows=1 width=10) (actual time=16.53..30.69 rows=1 loops=435)    ->  Index Scan using msa_code_tbl_bls_msa_key
ontbl_bls_msa 
 
bls  (cost=0.00
..53.80 rows=1 width=10) (actual time=16.19..30.66 rows=1 loops=435)    ->  Index Scan using msa_code_tbl_bls_msa_key
ontbl_bls_msa 
 
bls  (cost=0.00
..53.80 rows=1 width=10) (actual time=16.19..30.80 rows=1 loops=435)
Total runtime: 67589.69 msec

EXPLAIN

Thanks for the help.

At 10:21 AM 11/19/2002 -0500, Tom Lane wrote:
>Ellen Cyran <ellen@urban.csuohio.edu> writes:
> > Is there any way to make this query faster?  I have indexes on year,
> > msa_code, and sic.  I've also tried it with
> > an index on the combined group by columns.  I've made both sort_mem and
> > shared_buffers bigger, but still this query
> > takes 40 seconds when I select 4 msa_codes and 7 minutes when I select 40
> > msa_codes.
>
>Hm, have you vacuum analyzed lately?  Those cost estimates seem awfully
>low for a query that is taking 40 sec.  Also, if you're using 7.2 (which
>you should be ;-)) then showing EXPLAIN ANALYZE results would be more
>useful than plain EXPLAIN.
>
>                         regards, tom lane



Re: slow group by query

От
Ellen Cyran
Дата:
I had to modify your query somewhat, but the one below that is pretty much 
the same took
about 12 seconds so once I run it on five years it will take just as 
long.  Thanks for the
suggestion though.

select distinct on (b.msa_code, b.sic, b.own, b.ind_div)
b.msa_code, b.sic, b.own, b.ind_div, y1975.emp from
tbl_bls_msa as b left outer join (select msa_code, sic, own, ind_div, emp 
from tbl_bls_msa as bls
where bls.year='1975' ) as y1975
on (b.msa_code=y1975.msa_code and
b.sic=y1975.sic and b.own=y1975.own and b.ind_div=y1975.ind_div)
where b.msa_code in ('1680', '1640', '0040', '0120', '0080');

I would be interested in knowing more about what postgres extensions are 
available.  Where
are those documented at?


>Have you tried doing the subqueries in from?  Right now you're running
>each subquery once for each output row AFAICS.
>
>Maybe something like (doing only one year for example - and using a
>postgres extension), would work...
>
>select distinct on (b.msa_code, b.sic, b.own, b.ind_div)
>  b.msa_code, b.sic, b.own, b.ind_div, y1975.emp1975 from
>tbl_bls_msa as b left outer join (select emp from tbl_bls_mas as bls
>  where bls.year='1975') y1975 on (b.msa_code=y1975.msa_code and
>  b.sic=y1975.sic and b.own=y1975.own and b.ind_div=y1975.ind_div)
>where msa_code in ('1680', '1640', '0040', '0120', '0080');



Re: slow group by query

От
Tom Lane
Дата:
Ellen Cyran <ellen@urban.csuohio.edu> writes:
> Here is the explain analyze:

> Group  (cost=637.18..696.52 rows=593 width=22) (actual time=982.67..67581.85 rows=435 loops=1)
>    ->  Sort  (cost=637.18..637.18 rows=5934 width=22) (actual time=833.27..844.78 rows=6571 loops=1)

Well, we don't have to read any further than that to see that all the
time is being spent in the final Group step --- and since grouping 6500
rows is surely not taking very long, the true cost must be in the
evaluation of the SELECT's output targetlist (which the estimated costs
ignore, since the planner can't do much to change it).  In other words,
what's hurting you are those subselects in the SELECT list.  You're
doing two thousand separate subselects (435*5) --- evidently they take
about 30 msec apiece, which isn't that bad by itself, but it adds up.

What you've basically got here is what Joe Celko calls a "crosstab
by subqueries" (_SQL For Smarties_, section 23.6.3).  You might want
to buy his book, which shows several other ways to do crosstab queries,
but the one that seems most directly useful is to GROUP BY and use
aggregate functions to extract the values you want in each crosstab
column:

SELECT msa_code, sic, own, ind_div,
max(case when year = '1975' then emp else null end) as emp1975,
max(case when year = '1976' then emp else null end) as emp1976,
max(case when year = '1977' then emp else null end) as emp1977,
max(case when year = '1978' then emp else null end) as emp1978,
max(case when year = '1990' then emp else null end) as emp1990
FROM tbl_bls_msa
WHERE msa_code in ('1680','1640','0040','0120','0080')
GROUP BY msa_code, sic, ind_div, own ;

If I understand your data schema correctly, only one row in a
msa_code, sic, ind_div, own group will have a particular year
value, so the case/max structure extracts that value, or gives
NULL if there's no such row.  (MIN would have worked as well;
with a numeric field you could use SUM too.)
        regards, tom lane


Re: slow group by query

От
Ellen Cyran
Дата:
That's a whole lot faster. The query on 40 msa_codes that took
7 minutes, now only takes 10 seconds.
Thanks a lot.


At 11:54 AM 11/19/2002 -0500, Tom Lane wrote:
>Ellen Cyran <ellen@urban.csuohio.edu> writes:
> > Here is the explain analyze:
>
> > Group  (cost=637.18..696.52 rows=593 width=22) (actual 
> time=982.67..67581.85 rows=435 loops=1)
> >    ->  Sort  (cost=637.18..637.18 rows=5934 width=22) (actual 
> time=833.27..844.78 rows=6571 loops=1)
>
>Well, we don't have to read any further than that to see that all the
>time is being spent in the final Group step --- and since grouping 6500
>rows is surely not taking very long, the true cost must be in the
>evaluation of the SELECT's output targetlist (which the estimated costs
>ignore, since the planner can't do much to change it).  In other words,
>what's hurting you are those subselects in the SELECT list.  You're
>doing two thousand separate subselects (435*5) --- evidently they take
>about 30 msec apiece, which isn't that bad by itself, but it adds up.
>
>What you've basically got here is what Joe Celko calls a "crosstab
>by subqueries" (_SQL For Smarties_, section 23.6.3).  You might want
>to buy his book, which shows several other ways to do crosstab queries,
>but the one that seems most directly useful is to GROUP BY and use
>aggregate functions to extract the values you want in each crosstab
>column:
>
>SELECT msa_code, sic, own, ind_div,
>max(case when year = '1975' then emp else null end) as emp1975,
>max(case when year = '1976' then emp else null end) as emp1976,
>max(case when year = '1977' then emp else null end) as emp1977,
>max(case when year = '1978' then emp else null end) as emp1978,
>max(case when year = '1990' then emp else null end) as emp1990
>FROM tbl_bls_msa
>WHERE msa_code in ('1680','1640','0040','0120','0080')
>GROUP BY msa_code, sic, ind_div, own ;
>
>If I understand your data schema correctly, only one row in a
>msa_code, sic, ind_div, own group will have a particular year
>value, so the case/max structure extracts that value, or gives
>NULL if there's no such row.  (MIN would have worked as well;
>with a numeric field you could use SUM too.)
>
>                         regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 4: Don't 'kill -9' the postmaster