Обсуждение: Optimising a query
I have a query that looks like this: SELECT DISTINCT ON (EEFSCode) eefsbase.company||eefsbase.department||eefsbase.branch||eefsbase.franchise||eefsbase.subledger||eefsbase.account AS EEFSCode, eefsbase.company AS company_code, eefsbase.branch AS branch_code, eefsbase.department AS department_code, eefsbase.franchise AS franchise_code, fincompany.full_name AS company_description , finbranch.full_name AS branch_description , findepartment.full_name AS department_description , finfranchise.full_name AS franchise_description, eefsbase.sort_key1 AS acct_type_rpt_code, ''::text AS acct_type_rpt_description, eefsbase.sort_key2 AS exec_fs_rptcat2, ''::text AS exec_fs_rptcat2_desc, eefsbase.sort_key3 AS exec_fs_rptcat3, ''::text AS exec_fs_rptcat3_desc, 0 AS financial_report_category, ''::text AS financial_report_cat_desc FROM (SELECT DISTINCT ON (finbalance.dealer_id, finbalance.year_id, finbalance.subledger_id, finbalance.account_id) finbalance.year_id AS year, finbalance.dealer_id AS dealer_id, finbalance.sort_key1 AS sort_key1, finbalance.sort_key2 AS sort_key2, finbalance.sort_key3 AS sort_key3, lpad(finbalance.subledger_id::text,4,'0') AS subledger, lpad(finbalance.account_id::text,4,'0') AS account, lpad(finsubledger.company::text,4,'0') AS company, lpad(finsubledger.department::text,4,'0') AS department, lpad(finsubledger.branch::text,4,'0') AS branch, lpad(finsubledger.franchise::text,4,'0') AS franchise FROM finbalance INNER JOIN finsubledger on ((finbalance.dealer_id=finsubledger.dealer_id) AND (finbalance.year_id=finsubledger.year) AND (finbalance.subledger_id = finsubledger.subledger_number))) eefsbase INNER JOIN fincompany ON (eefsbase.company::int=fincompany.report_number AND eefsbase.dealer_id=fincompany.dealer_id AND eefsbase.year=fincompany.year) INNER JOIN finbranch ON (eefsbase.branch::int=finbranch.report_number AND eefsbase.dealer_id=finbranch.dealer_id AND eefsbase.year=finbranch.year) INNER JOIN findepartment ON (eefsbase.department::int=findepartment.report_number AND eefsbase.dealer_id=findepartment.dealer_id AND eefsbase.year=findepartment.year) INNER JOIN finfranchise ON (eefsbase.franchise::int=finfranchise.report_number AND eefsbase.dealer_id=finfranchise.dealer_id AND eefsbase.year=finfranchise.year); Where in one of my test systems the finbalance table has approximately 220,000 records, around 17,500 of them distinct on the fields mentioned in the distinct clause, the finsubledger table has 97 rows and the other tables mentioned -fincompany,fnbranch,findepartment,finfranchise each have between 1 and 50 records, i.e. relatively small. The above query runs between ten and twelve seconds on this test system and I would like to try and get that down a bit if possible. The explain analyze looks like thus: "Unique (cost=19801.92..19801.93 rows=1 width=380) (actual time=10838.666..10884.568 rows=17227 loops=1)" " -> Sort (cost=19801.92..19801.92 rows=1 width=380) (actual time=10838.662..10863.909 rows=17227 loops=1)" " Sort Key: (((((((lpad((finsubledger.company)::text, 4, '0'::text)) || (lpad((finsubledger.department)::text, 4, '0'::text))) || (lpad((finsubledger.branch)::text, 4, '0'::text))) || (lpad((finsubledger.franchise)::text, 4, '0'::text))) || (lpad((finbalance.subledger_id)::text, 4, '0'::text))) || (lpad((finbalance.account_id)::text, 4, '0'::text))))" " Sort Method: external merge Disk: 2288kB" " -> Nested Loop (cost=19733.14..19801.91 rows=1 width=380) (actual time=9073.324..10386.626 rows=17227 loops=1)" " -> Nested Loop (cost=19733.14..19793.60 rows=1 width=393) (actual time=9073.287..10128.155 rows=17227 loops=1)" " -> Nested Loop (cost=19733.13..19785.30 rows=1 width=336) (actual time=9073.253..9911.426 rows=17227 loops=1)" " -> Nested Loop (cost=19733.13..19777.00 rows=1 width=279) (actual time=9073.222..9685.723 rows=17227 loops=1)" " -> Unique (cost=19733.12..19733.27 rows=12 width=48) (actual time=9073.143..9426.581 rows=17227 loops=1)" " -> Sort (cost=19733.12..19733.15 rows=12 width=48) (actual time=9073.141..9226.161 rows=206748 loops=1)" " Sort Key: finbalance.dealer_id, finbalance.year_id, finbalance.subledger_id, finbalance.account_id" " Sort Method: external sort Disk: 14544kB" " -> Merge Join (cost=35.56..19732.91 rows=12 width=48) (actual time=0.841..2309.828 rows=206748 loops=1)" " Merge Cond: (((finbalance.dealer_id)::text = (finsubledger.dealer_id)::text) AND (finbalance.subledger_id = finsubledger.subledger_number))" " Join Filter: (finbalance.year_id = finsubledger.year)" " -> Index Scan using pk_finbalances_pk on finbalance (cost=0.00..18596.78 rows=210130 width=32) (actual time=0.079..310.804 rows=206748 loops=1)" " -> Sort (cost=35.56..36.73 rows=470 width=34) (actual time=0.731..113.207 rows=205742 loops=1)" " Sort Key: finsubledger.dealer_id, finsubledger.subledger_number" " Sort Method: quicksort Memory: 24kB" " -> Seq Scan on finsubledger (cost=0.00..14.70 rows=470 width=34) (actual time=0.011..0.101 rows=97 loops=1)" " -> Index Scan using pk_finfranchise_dealer_company on finfranchise (cost=0.01..3.61 rows=1 width=61) (actual time=0.009..0.010 rows=1 loops=17227)" " Index Cond: (((finfranchise.dealer_id)::text = (finbalance.dealer_id)::text) AND (finfranchise.year = finbalance.year_id) AND (finfranchise.report_number = ((lpad((finsubledger.franchise)::text, 4, '0'::text)))::integer))" " -> Index Scan using pk_findepartment_dealer_company on findepartment (cost=0.01..8.28 rows=1 width=61) (actual time=0.008..0.009 rows=1 loops=17227)" " Index Cond: (((findepartment.dealer_id)::text = (finbalance.dealer_id)::text) AND (findepartment.year = finbalance.year_id) AND (findepartment.report_number = ((lpad((finsubledger.department)::text, 4, '0'::text)))::integer))" " -> Index Scan using pk_finbranch_dealer_company on finbranch (cost=0.01..8.28 rows=1 width=61) (actual time=0.007..0.008 rows=1 loops=17227)" " Index Cond: (((finbranch.dealer_id)::text = (finbalance.dealer_id)::text) AND (finbranch.year = finbalance.year_id) AND (finbranch.report_number = ((lpad((finsubledger.branch)::text, 4, '0'::text)))::integer))" " -> Index Scan using pk_fincompany_dealer_company on fincompany (cost=0.01..8.28 rows=1 width=61) (actual time=0.007..0.009 rows=1 loops=17227)" " Index Cond: (((fincompany.dealer_id)::text = (finbalance.dealer_id)::text) AND (fincompany.year = finbalance.year_id) AND (fincompany.report_number = ((lpad((finsubledger.company)::text, 4, '0'::text)))::integer))" "Total runtime: 10896.235 ms" Can anyone suggest some alterations to my SQL or perhaps something else I may be able to to in order to get this query to run a good bit faster? Cheers, Paul. -- Paul Lambert Database Administrator AutoLedgers - A Reynolds & Reynolds Company
Paul Lambert wrote: > <snip> This part of the query alone takes a significant part of the time: SELECT DISTINCT ON (finbalance.dealer_id, finbalance.year_id, finbalance.subledger_id, finbalance.account_id) finbalance.year_id AS year, finbalance.dealer_id AS dealer_id, lpad(finbalance.subledger_id::text,4,'0') AS subledger, lpad(finbalance.account_id::text,4,'0') AS account FROM finbalance Runs with a query plan of : "Unique (cost=30197.98..32782.33 rows=20675 width=16) (actual time=5949.695..7197.475 rows=17227 loops=1)" " -> Sort (cost=30197.98..30714.85 rows=206748 width=16) (actual time=5949.691..7018.931 rows=206748 loops=1)" " Sort Key: dealer_id, year_id, subledger_id, account_id" " Sort Method: external merge Disk: 8880kB" " -> Seq Scan on finbalance (cost=0.00..8409.70 rows=206748 width=16) (actual time=0.042..617.949 rows=206748 loops=1)" "Total runtime: 7210.966 ms" So basically selecting from the finbalance table (approx. 206,000 records) takes 10 seconds, even longer without the distinct clause in there - the distinct collapses the result-set down to around 17,000 rows. Taking out the two lpad's in there knocks off about 1500ms, so I can come up with something else for them - but I'd like to get the query as a whole down to under a second. dealer_id, year_id, subledger_id and account_id are all part of the primary key on the finbalance table, so I don't think I can index them down any further. Are there any config settings that would make it faster... I'm running on a Quad-core pentium Xeon 1.6GHZ server with 4GB RAM. I imagine shared_buffers (32MB) and work_mem (1MB) could be bumped up a good bit more with 4GB of available RAM? -- Paul Lambert Database Administrator AutoLedgers - A Reynolds & Reynolds Company
Paul Lambert wrote: > Paul Lambert wrote: >> <snip> > > > This part of the query alone takes a significant part of the time: > > SELECT DISTINCT ON (finbalance.dealer_id, finbalance.year_id, > finbalance.subledger_id, finbalance.account_id) > finbalance.year_id AS year, > finbalance.dealer_id AS dealer_id, > lpad(finbalance.subledger_id::text,4,'0') AS subledger, > lpad(finbalance.account_id::text,4,'0') AS account > FROM finbalance > > Runs with a query plan of : > > "Unique (cost=30197.98..32782.33 rows=20675 width=16) (actual > time=5949.695..7197.475 rows=17227 loops=1)" > " -> Sort (cost=30197.98..30714.85 rows=206748 width=16) (actual > time=5949.691..7018.931 rows=206748 loops=1)" > " Sort Key: dealer_id, year_id, subledger_id, account_id" > " Sort Method: external merge Disk: 8880kB" > " -> Seq Scan on finbalance (cost=0.00..8409.70 rows=206748 > width=16) (actual time=0.042..617.949 rows=206748 loops=1)" > "Total runtime: 7210.966 ms" > > > So basically selecting from the finbalance table (approx. 206,000 > records) takes 10 seconds, even longer without the distinct clause in > there - the distinct collapses the result-set down to around 17,000 rows. Well, if you need to summarise all the rows then that plan is as good as any. If you run this query very frequently, you'll probably want to look into keeping a summary table updated via triggers. Before that though, try issuing a "SET work_mem = '9MB'" before running your query. If that doesn't change the plan step up gradually. You should be able to get the sort stage to happen in RAM rather than on disk (see "Sort Method" above). Don't go overboard though, your big query will probably use multiples of that value. > Taking out the two lpad's in there knocks off about 1500ms, so I can > come up with something else for them - but I'd like to get the query as > a whole down to under a second. Stick the lpads in a query that wraps your DISTINCT query. > dealer_id, year_id, subledger_id and account_id are all part of the > primary key on the finbalance table, so I don't think I can index them > down any further. A CLUSTER <pkey-index> ON <table> might help, but it will degrade as you update the finbalance table. -- Richard Huxton Archonet Ltd
"Richard Huxton" <dev@archonet.com> writes: > Paul Lambert wrote: > >> " -> Sort (cost=30197.98..30714.85 rows=206748 width=16) (actual >> time=5949.691..7018.931 rows=206748 loops=1)" >> " Sort Key: dealer_id, year_id, subledger_id, account_id" >> " Sort Method: external merge Disk: 8880kB" > Before that though, try issuing a "SET work_mem = '9MB'" before running your > query. If that doesn't change the plan step up gradually. You should be able to > get the sort stage to happen in RAM rather than on disk (see "Sort Method" > above). FWIW you'll probably need more than that. Try something more like 20MB. Also, note you can change this with SET for just this connection and even just this query and then reset it to the normal value (or use SET LOCAL). You don't have to change it in the config file and restart the whole server. Also, try replacing the DISTINCT with GROUP BY. The code path for DISTINCT unfortunately needs a bit of cleaning up and isn't exactly equivalent to GROUP BY. In particular it doesn't support hash aggregates which, if your work_mem is large enough, might work for you here. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!
> Also, try replacing the DISTINCT with GROUP BY. The code path for DISTINCT > unfortunately needs a bit of cleaning up and isn't exactly equivalent to GROUP > BY. In particular it doesn't support hash aggregates which, if your work_mem > is large enough, might work for you here. Sorry, strike that last suggestion. I was looking at the plan and forgot that the query had DISTINCT ON. It is possible to replace DISTINCT ON with GROUP BY but it's not going to be faster than the DISTINCT ON case since you'll need the sort anyways. Actually it's possible to do without the sort if you write some fancy aggregate functions but for this large a query that's going to be awfully complex. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!
Paul Lambert wrote: > " -> Merge Join > (cost=35.56..19732.91 rows=12 width=48) (actual time=0.841..2309.828 > rows=206748 loops=1)" I'm no expert, but in the interests of learning: why is the rows estimate so far out for this join? Thanks, Jeremy
Gregory Stark wrote: > "Richard Huxton" <dev@archonet.com> writes: > >> Paul Lambert wrote: >> >>> " -> Sort (cost=30197.98..30714.85 rows=206748 width=16) (actual >> time=5949.691..7018.931 rows=206748 loops=1)" >>> " Sort Key: dealer_id, year_id, subledger_id, account_id" >>> " Sort Method: external merge Disk: 8880kB" > >> Before that though, try issuing a "SET work_mem = '9MB'" before running your >> query. If that doesn't change the plan step up gradually. You should be able to >> get the sort stage to happen in RAM rather than on disk (see "Sort Method" >> above). > > FWIW you'll probably need more than that. Try something more like 20MB. > > Also, note you can change this with SET for just this connection and even just > this query and then reset it to the normal value (or use SET LOCAL). You don't > have to change it in the config file and restart the whole server. > > Also, try replacing the DISTINCT with GROUP BY. The code path for DISTINCT > unfortunately needs a bit of cleaning up and isn't exactly equivalent to GROUP > BY. In particular it doesn't support hash aggregates which, if your work_mem > is large enough, might work for you here. > I changed work_mem to 20MB per suggestion and that knocks the query time down to just over 6 seconds... still a bit fast for my liking, but any higher work_mem doesn't change the result - i.e. 30, 40, 50MB all give just over 6 seconds. The explain analyze shows all the sorts taking place in memory now as quicksorts rather than on-disk merge in the previous query plan, so I'll make a permanent change to the config to set work_mem to 20MB. I've also changed the inner-most select into a two level select with the lpad's on the outer so they are not being evaluated on every row, just the collapsed rows - that accounted for about 1 second of the overall time reduction. Would increasing the stats of anything on any of these tables speed things up any more? -- Paul Lambert Database Administrator AutoLedgers - A Reynolds & Reynolds Company