Обсуждение: slow group by query
Is there any way to make this query faster? I have indexes on year, msa_code, and sic. I've also tried it with an index on the combined group by columns. I've made both sort_mem and shared_buffers bigger, but still this query takes 40 seconds when I select 4 msa_codes and 7 minutes when I select 40 msa_codes. Would it just be better to transpose the table to begin with and avoid the group by all together? SELECT b.msa_code, b.sic, b.own, b.ind_div, (select emp from tbl_bls_msa as bls where bls.year = '1975' and bls.msa_code = b.msa_code and bls.sic = b.sic and bls.own = b.own and bls.ind_div = b.ind_div ) AS emp1975, (select emp from tbl_bls_msa as bls where bls.year = '1976' and bls.msa_code = b.msa_code and bls.sic = b.sic and bls.own = b.own and bls.ind_div = b.ind_div ) AS emp1976, (select emp from tbl_bls_msa as bls where bls.year = '1977' and bls.msa_code = b.msa_code and bls.sic = b.sic and bls.own = b.own and bls.ind_div = b.ind_div ) AS emp1977, (select emp from tbl_bls_msa as bls where bls.year = '1978' and bls.msa_code = b.msa_code and bls.sic = b.sic and bls.own = b.own and bls.ind_div = b.ind_div ) AS emp1978, (select emp from tbl_bls_msa as bls where bls.year = '1990' and bls.msa_code = b.msa_code and bls.sic = b.sic and bls.own = b.own and bls.ind_div = b.ind_div) AS emp1990 FROM tbl_bls_msa AS b where msa_code in ('1680','1640','0040','0120','0080') GROUP BY b.msa_code, b.sic, b.ind_div, b.own ; NOTICE: QUERY PLAN: Group (cost=635.97..695.18 rows=592 width=22) -> Sort (cost=635.97..635.97 rows=5921 width=22) -> Index Scanusing msa_code_tbl_bls_msa_key, msa_code_tbl_bls_msa_key, msa_code_tbl_bls_msa_key, msa_code_tbl_bls_msa_key, msa_code_tbl_bls_msa_key on tbl_bls_msa b (cost=0.00..264.99 rows=5921 width=22) SubPlan -> Index Scan using msa_code_tbl_bls_msa_keyon tbl_bls_msa bls (cost=0.00 ..53.71 rows=1 width=10) -> Index Scan using msa_code_tbl_bls_msa_key on tbl_bls_msa bls (cost=0.00 ..53.71 rows=1 width=10) -> Index Scan using msa_code_tbl_bls_msa_key on tbl_bls_msa bls (cost=0.00 ..53.71 rows=1 width=10) -> Index Scan using msa_code_tbl_bls_msa_key on tbl_bls_msa bls (cost=0.00 ..53.71 rows=1 width=10) -> Index Scan using msa_code_tbl_bls_msa_key on tbl_bls_msa bls (cost=0.00 ..53.71 rows=1 width=10) Thanks. Ellen -------
On Mon, 18 Nov 2002, Ellen Cyran wrote: > Is there any way to make this query faster? I have indexes on year, > msa_code, and sic. I've also tried it with > an index on the combined group by columns. I've made both sort_mem and > shared_buffers bigger, but still this query > takes 40 seconds when I select 4 msa_codes and 7 minutes when I select 40 > msa_codes. Would it just be better > to transpose the table to begin with and avoid the group by all together? > > SELECT b.msa_code, b.sic, b.own, b.ind_div, > (select emp from tbl_bls_msa as bls > where bls.year = '1975' and bls.msa_code = b.msa_code and bls.sic = b.sic > and bls.own = b.own and bls.ind_div = b.ind_div ) AS emp1975, > (select emp from tbl_bls_msa as bls > where bls.year = '1976' and bls.msa_code = b.msa_code and bls.sic = b.sic > and bls.own = b.own and bls.ind_div = b.ind_div ) AS emp1976, > (select emp from tbl_bls_msa as bls > where bls.year = '1977' and bls.msa_code = b.msa_code and bls.sic = b.sic > and bls.own = b.own and bls.ind_div = b.ind_div ) AS emp1977, > (select emp from tbl_bls_msa as bls > where bls.year = '1978' and bls.msa_code = b.msa_code and bls.sic = b.sic > and bls.own = b.own and bls.ind_div = b.ind_div ) AS emp1978, > (select emp from tbl_bls_msa as bls > where bls.year = '1990' and bls.msa_code = b.msa_code and bls.sic = b.sic > and bls.own = b.own and bls.ind_div = b.ind_div) AS emp1990 > FROM tbl_bls_msa AS b > where msa_code in ('1680','1640','0040','0120','0080') > GROUP BY b.msa_code, b.sic, b.ind_div, b.own ; Have you tried doing the subqueries in from? Right now you're running each subquery once for each output row AFAICS. Maybe something like (doing only one year for example - and using a postgres extension), would work... select distinct on (b.msa_code, b.sic, b.own, b.ind_div)b.msa_code, b.sic, b.own, b.ind_div, y1975.emp1975 from tbl_bls_msa as b left outer join (select emp from tbl_bls_mas as blswhere bls.year='1975') y1975 on (b.msa_code=y1975.msa_codeandb.sic=y1975.sic and b.own=y1975.own and b.ind_div=y1975.ind_div) where msa_code in ('1680', '1640', '0040', '0120', '0080');
Ellen Cyran <ellen@urban.csuohio.edu> writes: > Is there any way to make this query faster? I have indexes on year, > msa_code, and sic. I've also tried it with > an index on the combined group by columns. I've made both sort_mem and > shared_buffers bigger, but still this query > takes 40 seconds when I select 4 msa_codes and 7 minutes when I select 40 > msa_codes. Hm, have you vacuum analyzed lately? Those cost estimates seem awfully low for a query that is taking 40 sec. Also, if you're using 7.2 (which you should be ;-)) then showing EXPLAIN ANALYZE results would be more useful than plain EXPLAIN. regards, tom lane
I have vacuum analyzed recently. I do it after a large number of inserts and after indexing. This database is in the development stages so there is a lot of data loading at this time. We are also using 7.2.3. Here is the explain analyze: Group (cost=637.18..696.52 rows=593 width=22) (actual time=982.67..67581.85 row s=435 loops=1) -> Sort (cost=637.18..637.18 rows=5934 width=22) (actual time=833.27..844.78 rows=6571 loops=1) -> Index Scan using msa_code_tbl_bls_msa_key, msa_code_tbl_bls_msa_key, msa_code_tbl_bls_msa_key, msa_code_tbl_bls_msa_key, msa_code_tbl_bls_msa_key on tbl_bls_msa b (cost=0.00..265.30 rows=5934 width=22) (actual time=0.80..367.64 rows=6571 loops=1) SubPlan -> Index Scan using msa_code_tbl_bls_msa_key on tbl_bls_msa bls (cost=0.00 ..53.80 rows=1 width=10) (actual time=16.46..30.36 rows=1 loops=435) -> Index Scan using msa_code_tbl_bls_msa_key ontbl_bls_msa bls (cost=0.00 ..53.80 rows=1 width=10) (actual time=16.54..30.41 rows=1 loops=435) -> Index Scan using msa_code_tbl_bls_msa_key ontbl_bls_msa bls (cost=0.00 ..53.80 rows=1 width=10) (actual time=16.53..30.69 rows=1 loops=435) -> Index Scan using msa_code_tbl_bls_msa_key ontbl_bls_msa bls (cost=0.00 ..53.80 rows=1 width=10) (actual time=16.19..30.66 rows=1 loops=435) -> Index Scan using msa_code_tbl_bls_msa_key ontbl_bls_msa bls (cost=0.00 ..53.80 rows=1 width=10) (actual time=16.19..30.80 rows=1 loops=435) Total runtime: 67589.69 msec EXPLAIN Thanks for the help. At 10:21 AM 11/19/2002 -0500, Tom Lane wrote: >Ellen Cyran <ellen@urban.csuohio.edu> writes: > > Is there any way to make this query faster? I have indexes on year, > > msa_code, and sic. I've also tried it with > > an index on the combined group by columns. I've made both sort_mem and > > shared_buffers bigger, but still this query > > takes 40 seconds when I select 4 msa_codes and 7 minutes when I select 40 > > msa_codes. > >Hm, have you vacuum analyzed lately? Those cost estimates seem awfully >low for a query that is taking 40 sec. Also, if you're using 7.2 (which >you should be ;-)) then showing EXPLAIN ANALYZE results would be more >useful than plain EXPLAIN. > > regards, tom lane
I had to modify your query somewhat, but the one below that is pretty much the same took about 12 seconds so once I run it on five years it will take just as long. Thanks for the suggestion though. select distinct on (b.msa_code, b.sic, b.own, b.ind_div) b.msa_code, b.sic, b.own, b.ind_div, y1975.emp from tbl_bls_msa as b left outer join (select msa_code, sic, own, ind_div, emp from tbl_bls_msa as bls where bls.year='1975' ) as y1975 on (b.msa_code=y1975.msa_code and b.sic=y1975.sic and b.own=y1975.own and b.ind_div=y1975.ind_div) where b.msa_code in ('1680', '1640', '0040', '0120', '0080'); I would be interested in knowing more about what postgres extensions are available. Where are those documented at? >Have you tried doing the subqueries in from? Right now you're running >each subquery once for each output row AFAICS. > >Maybe something like (doing only one year for example - and using a >postgres extension), would work... > >select distinct on (b.msa_code, b.sic, b.own, b.ind_div) > b.msa_code, b.sic, b.own, b.ind_div, y1975.emp1975 from >tbl_bls_msa as b left outer join (select emp from tbl_bls_mas as bls > where bls.year='1975') y1975 on (b.msa_code=y1975.msa_code and > b.sic=y1975.sic and b.own=y1975.own and b.ind_div=y1975.ind_div) >where msa_code in ('1680', '1640', '0040', '0120', '0080');
Ellen Cyran <ellen@urban.csuohio.edu> writes: > Here is the explain analyze: > Group (cost=637.18..696.52 rows=593 width=22) (actual time=982.67..67581.85 rows=435 loops=1) > -> Sort (cost=637.18..637.18 rows=5934 width=22) (actual time=833.27..844.78 rows=6571 loops=1) Well, we don't have to read any further than that to see that all the time is being spent in the final Group step --- and since grouping 6500 rows is surely not taking very long, the true cost must be in the evaluation of the SELECT's output targetlist (which the estimated costs ignore, since the planner can't do much to change it). In other words, what's hurting you are those subselects in the SELECT list. You're doing two thousand separate subselects (435*5) --- evidently they take about 30 msec apiece, which isn't that bad by itself, but it adds up. What you've basically got here is what Joe Celko calls a "crosstab by subqueries" (_SQL For Smarties_, section 23.6.3). You might want to buy his book, which shows several other ways to do crosstab queries, but the one that seems most directly useful is to GROUP BY and use aggregate functions to extract the values you want in each crosstab column: SELECT msa_code, sic, own, ind_div, max(case when year = '1975' then emp else null end) as emp1975, max(case when year = '1976' then emp else null end) as emp1976, max(case when year = '1977' then emp else null end) as emp1977, max(case when year = '1978' then emp else null end) as emp1978, max(case when year = '1990' then emp else null end) as emp1990 FROM tbl_bls_msa WHERE msa_code in ('1680','1640','0040','0120','0080') GROUP BY msa_code, sic, ind_div, own ; If I understand your data schema correctly, only one row in a msa_code, sic, ind_div, own group will have a particular year value, so the case/max structure extracts that value, or gives NULL if there's no such row. (MIN would have worked as well; with a numeric field you could use SUM too.) regards, tom lane
That's a whole lot faster. The query on 40 msa_codes that took 7 minutes, now only takes 10 seconds. Thanks a lot. At 11:54 AM 11/19/2002 -0500, Tom Lane wrote: >Ellen Cyran <ellen@urban.csuohio.edu> writes: > > Here is the explain analyze: > > > Group (cost=637.18..696.52 rows=593 width=22) (actual > time=982.67..67581.85 rows=435 loops=1) > > -> Sort (cost=637.18..637.18 rows=5934 width=22) (actual > time=833.27..844.78 rows=6571 loops=1) > >Well, we don't have to read any further than that to see that all the >time is being spent in the final Group step --- and since grouping 6500 >rows is surely not taking very long, the true cost must be in the >evaluation of the SELECT's output targetlist (which the estimated costs >ignore, since the planner can't do much to change it). In other words, >what's hurting you are those subselects in the SELECT list. You're >doing two thousand separate subselects (435*5) --- evidently they take >about 30 msec apiece, which isn't that bad by itself, but it adds up. > >What you've basically got here is what Joe Celko calls a "crosstab >by subqueries" (_SQL For Smarties_, section 23.6.3). You might want >to buy his book, which shows several other ways to do crosstab queries, >but the one that seems most directly useful is to GROUP BY and use >aggregate functions to extract the values you want in each crosstab >column: > >SELECT msa_code, sic, own, ind_div, >max(case when year = '1975' then emp else null end) as emp1975, >max(case when year = '1976' then emp else null end) as emp1976, >max(case when year = '1977' then emp else null end) as emp1977, >max(case when year = '1978' then emp else null end) as emp1978, >max(case when year = '1990' then emp else null end) as emp1990 >FROM tbl_bls_msa >WHERE msa_code in ('1680','1640','0040','0120','0080') >GROUP BY msa_code, sic, ind_div, own ; > >If I understand your data schema correctly, only one row in a >msa_code, sic, ind_div, own group will have a particular year >value, so the case/max structure extracts that value, or gives >NULL if there's no such row. (MIN would have worked as well; >with a numeric field you could use SUM too.) > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 4: Don't 'kill -9' the postmaster