Обсуждение: Optimising a query

Поиск
Список
Период
Сортировка

Optimising a query

От
Paul Lambert
Дата:

I have a query that looks like this:

    SELECT DISTINCT ON (EEFSCode)

eefsbase.company||eefsbase.department||eefsbase.branch||eefsbase.franchise||eefsbase.subledger||eefsbase.account
AS EEFSCode,
        eefsbase.company AS company_code,
        eefsbase.branch AS branch_code,
        eefsbase.department AS department_code,
        eefsbase.franchise AS franchise_code,
        fincompany.full_name AS company_description ,
        finbranch.full_name AS branch_description ,
        findepartment.full_name AS department_description ,
        finfranchise.full_name AS franchise_description,
        eefsbase.sort_key1 AS acct_type_rpt_code,
        ''::text AS acct_type_rpt_description,
        eefsbase.sort_key2 AS exec_fs_rptcat2,
        ''::text AS exec_fs_rptcat2_desc,
        eefsbase.sort_key3 AS exec_fs_rptcat3,
        ''::text AS exec_fs_rptcat3_desc,
        0 AS financial_report_category,
        ''::text AS financial_report_cat_desc
    FROM
        (SELECT DISTINCT ON (finbalance.dealer_id, finbalance.year_id,
finbalance.subledger_id, finbalance.account_id)
            finbalance.year_id AS year,
            finbalance.dealer_id AS dealer_id,
            finbalance.sort_key1 AS sort_key1,
            finbalance.sort_key2 AS sort_key2,
            finbalance.sort_key3 AS sort_key3,
            lpad(finbalance.subledger_id::text,4,'0') AS subledger,
            lpad(finbalance.account_id::text,4,'0') AS account,
            lpad(finsubledger.company::text,4,'0') AS company,
            lpad(finsubledger.department::text,4,'0') AS department,
            lpad(finsubledger.branch::text,4,'0') AS branch,
            lpad(finsubledger.franchise::text,4,'0') AS franchise
        FROM finbalance
        INNER JOIN finsubledger on
((finbalance.dealer_id=finsubledger.dealer_id) AND
                        (finbalance.year_id=finsubledger.year) AND
                                        (finbalance.subledger_id =
finsubledger.subledger_number))) eefsbase
    INNER JOIN fincompany    ON
(eefsbase.company::int=fincompany.report_number AND
                                 eefsbase.dealer_id=fincompany.dealer_id AND
                                 eefsbase.year=fincompany.year)
    INNER JOIN finbranch     ON
(eefsbase.branch::int=finbranch.report_number AND
                                 eefsbase.dealer_id=finbranch.dealer_id AND
                                 eefsbase.year=finbranch.year)
    INNER JOIN findepartment ON
(eefsbase.department::int=findepartment.report_number AND
                                 eefsbase.dealer_id=findepartment.dealer_id AND
                                 eefsbase.year=findepartment.year)
    INNER JOIN finfranchise  ON
(eefsbase.franchise::int=finfranchise.report_number AND
                                 eefsbase.dealer_id=finfranchise.dealer_id AND
                                 eefsbase.year=finfranchise.year);

Where in one of my test systems the finbalance table has approximately
220,000 records, around 17,500 of them distinct on the fields mentioned
in the distinct clause, the finsubledger table has 97 rows and the other
tables mentioned -fincompany,fnbranch,findepartment,finfranchise each
have between 1 and 50 records, i.e. relatively small.

The above query runs between ten and twelve seconds on this test system
and I would like to try and get that down a bit if possible.

The explain analyze looks like thus:

"Unique  (cost=19801.92..19801.93 rows=1 width=380) (actual
time=10838.666..10884.568 rows=17227 loops=1)"
"  ->  Sort  (cost=19801.92..19801.92 rows=1 width=380) (actual
time=10838.662..10863.909 rows=17227 loops=1)"
"        Sort Key: (((((((lpad((finsubledger.company)::text, 4,
'0'::text)) || (lpad((finsubledger.department)::text, 4, '0'::text))) ||
(lpad((finsubledger.branch)::text, 4, '0'::text))) ||
(lpad((finsubledger.franchise)::text, 4, '0'::text))) ||
(lpad((finbalance.subledger_id)::text, 4, '0'::text))) ||
(lpad((finbalance.account_id)::text, 4, '0'::text))))"
"        Sort Method:  external merge  Disk: 2288kB"
"        ->  Nested Loop  (cost=19733.14..19801.91 rows=1 width=380)
(actual time=9073.324..10386.626 rows=17227 loops=1)"
"              ->  Nested Loop  (cost=19733.14..19793.60 rows=1
width=393) (actual time=9073.287..10128.155 rows=17227 loops=1)"
"                    ->  Nested Loop  (cost=19733.13..19785.30 rows=1
width=336) (actual time=9073.253..9911.426 rows=17227 loops=1)"
"                          ->  Nested Loop  (cost=19733.13..19777.00
rows=1 width=279) (actual time=9073.222..9685.723 rows=17227 loops=1)"
"                                ->  Unique  (cost=19733.12..19733.27
rows=12 width=48) (actual time=9073.143..9426.581 rows=17227 loops=1)"
"                                      ->  Sort
(cost=19733.12..19733.15 rows=12 width=48) (actual
time=9073.141..9226.161 rows=206748 loops=1)"
"                                            Sort Key:
finbalance.dealer_id, finbalance.year_id, finbalance.subledger_id,
finbalance.account_id"
"                                            Sort Method:  external sort
  Disk: 14544kB"
"                                            ->  Merge Join
(cost=35.56..19732.91 rows=12 width=48) (actual time=0.841..2309.828
rows=206748 loops=1)"
"                                                  Merge Cond:
(((finbalance.dealer_id)::text = (finsubledger.dealer_id)::text) AND
(finbalance.subledger_id = finsubledger.subledger_number))"
"                                                  Join Filter:
(finbalance.year_id = finsubledger.year)"
"                                                  ->  Index Scan using
pk_finbalances_pk on finbalance  (cost=0.00..18596.78 rows=210130
width=32) (actual time=0.079..310.804 rows=206748 loops=1)"
"                                                  ->  Sort
(cost=35.56..36.73 rows=470 width=34) (actual time=0.731..113.207
rows=205742 loops=1)"
"                                                        Sort Key:
finsubledger.dealer_id, finsubledger.subledger_number"
"                                                        Sort Method:
quicksort  Memory: 24kB"
"                                                        ->  Seq Scan on
finsubledger  (cost=0.00..14.70 rows=470 width=34) (actual
time=0.011..0.101 rows=97 loops=1)"
"                                ->  Index Scan using
pk_finfranchise_dealer_company on finfranchise  (cost=0.01..3.61 rows=1
width=61) (actual time=0.009..0.010 rows=1 loops=17227)"
"                                      Index Cond:
(((finfranchise.dealer_id)::text = (finbalance.dealer_id)::text) AND
(finfranchise.year = finbalance.year_id) AND (finfranchise.report_number
= ((lpad((finsubledger.franchise)::text, 4, '0'::text)))::integer))"
"                          ->  Index Scan using
pk_findepartment_dealer_company on findepartment  (cost=0.01..8.28
rows=1 width=61) (actual time=0.008..0.009 rows=1 loops=17227)"
"                                Index Cond:
(((findepartment.dealer_id)::text = (finbalance.dealer_id)::text) AND
(findepartment.year = finbalance.year_id) AND
(findepartment.report_number = ((lpad((finsubledger.department)::text,
4, '0'::text)))::integer))"
"                    ->  Index Scan using pk_finbranch_dealer_company on
finbranch  (cost=0.01..8.28 rows=1 width=61) (actual time=0.007..0.008
rows=1 loops=17227)"
"                          Index Cond: (((finbranch.dealer_id)::text =
(finbalance.dealer_id)::text) AND (finbranch.year = finbalance.year_id)
AND (finbranch.report_number = ((lpad((finsubledger.branch)::text, 4,
'0'::text)))::integer))"
"              ->  Index Scan using pk_fincompany_dealer_company on
fincompany  (cost=0.01..8.28 rows=1 width=61) (actual time=0.007..0.009
rows=1 loops=17227)"
"                    Index Cond: (((fincompany.dealer_id)::text =
(finbalance.dealer_id)::text) AND (fincompany.year = finbalance.year_id)
AND (fincompany.report_number = ((lpad((finsubledger.company)::text, 4,
'0'::text)))::integer))"
"Total runtime: 10896.235 ms"

Can anyone suggest some alterations to my SQL or perhaps something else
I may be able to to in order to get this query to run a good bit faster?

Cheers,
Paul.

--
Paul Lambert
Database Administrator
AutoLedgers - A Reynolds & Reynolds Company


Re: Optimising a query

От
Paul Lambert
Дата:
Paul Lambert wrote:
> <snip>


This part of the query alone takes a significant part of the time:

SELECT DISTINCT ON (finbalance.dealer_id, finbalance.year_id,
finbalance.subledger_id, finbalance.account_id)
            finbalance.year_id AS year,
            finbalance.dealer_id AS dealer_id,
            lpad(finbalance.subledger_id::text,4,'0') AS subledger,
            lpad(finbalance.account_id::text,4,'0') AS account
        FROM finbalance

Runs with a query plan of :

"Unique  (cost=30197.98..32782.33 rows=20675 width=16) (actual
time=5949.695..7197.475 rows=17227 loops=1)"
"  ->  Sort  (cost=30197.98..30714.85 rows=206748 width=16) (actual
time=5949.691..7018.931 rows=206748 loops=1)"
"        Sort Key: dealer_id, year_id, subledger_id, account_id"
"        Sort Method:  external merge  Disk: 8880kB"
"        ->  Seq Scan on finbalance  (cost=0.00..8409.70 rows=206748
width=16) (actual time=0.042..617.949 rows=206748 loops=1)"
"Total runtime: 7210.966 ms"


So basically selecting from the finbalance table (approx. 206,000
records) takes 10 seconds, even longer without the distinct clause in
there - the distinct collapses the result-set down to around 17,000 rows.

Taking out the two lpad's in there knocks off about 1500ms, so I can
come up with something else for them - but I'd like to get the query as
a whole down to under a second.

dealer_id, year_id, subledger_id and account_id are all part of the
primary key on the finbalance table, so I don't think I can index them
down any further.

Are there any config settings that would make it faster...

I'm running on a Quad-core pentium Xeon 1.6GHZ server with 4GB RAM. I
imagine shared_buffers (32MB) and work_mem (1MB) could be bumped up a
good bit more with 4GB of available RAM?


--
Paul Lambert
Database Administrator
AutoLedgers - A Reynolds & Reynolds Company

Re: Optimising a query

От
Richard Huxton
Дата:
Paul Lambert wrote:
> Paul Lambert wrote:
>> <snip>
>
>
> This part of the query alone takes a significant part of the time:
>
> SELECT DISTINCT ON (finbalance.dealer_id, finbalance.year_id,
> finbalance.subledger_id, finbalance.account_id)
>             finbalance.year_id AS year,
>             finbalance.dealer_id AS dealer_id,
>             lpad(finbalance.subledger_id::text,4,'0') AS subledger,
>             lpad(finbalance.account_id::text,4,'0') AS account
>         FROM finbalance
>
> Runs with a query plan of :
>
> "Unique  (cost=30197.98..32782.33 rows=20675 width=16) (actual
> time=5949.695..7197.475 rows=17227 loops=1)"
> "  ->  Sort  (cost=30197.98..30714.85 rows=206748 width=16) (actual
> time=5949.691..7018.931 rows=206748 loops=1)"
> "        Sort Key: dealer_id, year_id, subledger_id, account_id"
> "        Sort Method:  external merge  Disk: 8880kB"
> "        ->  Seq Scan on finbalance  (cost=0.00..8409.70 rows=206748
> width=16) (actual time=0.042..617.949 rows=206748 loops=1)"
> "Total runtime: 7210.966 ms"
>
>
> So basically selecting from the finbalance table (approx. 206,000
> records) takes 10 seconds, even longer without the distinct clause in
> there - the distinct collapses the result-set down to around 17,000 rows.

Well, if you need to summarise all the rows then that plan is as good as
any.

If you run this query very frequently, you'll probably want to look into
keeping a summary table updated via triggers.

Before that though, try issuing a "SET work_mem = '9MB'" before running
your query. If that doesn't change the plan step up gradually. You
should be able to get the sort stage to happen in RAM rather than on
disk (see "Sort Method" above). Don't go overboard though, your big
query will probably use multiples of that value.

> Taking out the two lpad's in there knocks off about 1500ms, so I can
> come up with something else for them - but I'd like to get the query as
> a whole down to under a second.

Stick the lpads in a query that wraps your DISTINCT query.

> dealer_id, year_id, subledger_id and account_id are all part of the
> primary key on the finbalance table, so I don't think I can index them
> down any further.

A CLUSTER <pkey-index> ON <table> might help, but it will degrade as you
update the finbalance table.

--
   Richard Huxton
   Archonet Ltd

Re: Optimising a query

От
Gregory Stark
Дата:
"Richard Huxton" <dev@archonet.com> writes:

> Paul Lambert wrote:
>
>> "  ->  Sort  (cost=30197.98..30714.85 rows=206748 width=16) (actual >> time=5949.691..7018.931 rows=206748 loops=1)"
>> "        Sort Key: dealer_id, year_id, subledger_id, account_id"
>> "        Sort Method:  external merge  Disk: 8880kB"

> Before that though, try issuing a "SET work_mem = '9MB'" before running your
> query. If that doesn't change the plan step up gradually. You should be able to
> get the sort stage to happen in RAM rather than on disk (see "Sort Method"
> above).

FWIW you'll probably need more than that. Try something more like 20MB.

Also, note you can change this with SET for just this connection and even just
this query and then reset it to the normal value (or use SET LOCAL). You don't
have to change it in the config file and restart the whole server.

Also, try replacing the DISTINCT with GROUP BY. The code path for DISTINCT
unfortunately needs a bit of cleaning up and isn't exactly equivalent to GROUP
BY. In particular it doesn't support hash aggregates which, if your work_mem
is large enough, might work for you here.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

Re: Optimising a query

От
Gregory Stark
Дата:
> Also, try replacing the DISTINCT with GROUP BY. The code path for DISTINCT
> unfortunately needs a bit of cleaning up and isn't exactly equivalent to GROUP
> BY. In particular it doesn't support hash aggregates which, if your work_mem
> is large enough, might work for you here.

Sorry, strike that last suggestion. I was looking at the plan and forgot that
the query had DISTINCT ON. It is possible to replace DISTINCT ON with GROUP BY
but it's not going to be faster than the DISTINCT ON case since you'll need
the sort anyways.

Actually it's possible to do without the sort if you write some fancy
aggregate functions but for this large a query that's going to be awfully
complex.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

Re: Optimising a query

От
Jeremy Harris
Дата:
Paul Lambert wrote:
> "                                            ->  Merge Join
> (cost=35.56..19732.91 rows=12 width=48) (actual time=0.841..2309.828
> rows=206748 loops=1)"

I'm no expert, but in the interests of learning: why is the
rows estimate so far out for this join?

Thanks,
    Jeremy


Re: Optimising a query

От
Paul Lambert
Дата:
Gregory Stark wrote:
> "Richard Huxton" <dev@archonet.com> writes:
>
>> Paul Lambert wrote:
>>
>>> "  ->  Sort  (cost=30197.98..30714.85 rows=206748 width=16) (actual >> time=5949.691..7018.931 rows=206748
loops=1)"
>>> "        Sort Key: dealer_id, year_id, subledger_id, account_id"
>>> "        Sort Method:  external merge  Disk: 8880kB"
>
>> Before that though, try issuing a "SET work_mem = '9MB'" before running your
>> query. If that doesn't change the plan step up gradually. You should be able to
>> get the sort stage to happen in RAM rather than on disk (see "Sort Method"
>> above).
>
> FWIW you'll probably need more than that. Try something more like 20MB.
>
> Also, note you can change this with SET for just this connection and even just
> this query and then reset it to the normal value (or use SET LOCAL). You don't
> have to change it in the config file and restart the whole server.
>
> Also, try replacing the DISTINCT with GROUP BY. The code path for DISTINCT
> unfortunately needs a bit of cleaning up and isn't exactly equivalent to GROUP
> BY. In particular it doesn't support hash aggregates which, if your work_mem
> is large enough, might work for you here.
>


I changed work_mem to 20MB per suggestion and that knocks the query time
  down to just over 6 seconds... still a bit fast for my liking, but any
higher work_mem doesn't change the result - i.e. 30, 40, 50MB all give
just over 6 seconds.

The explain analyze shows all the sorts taking place in memory now as
quicksorts rather than on-disk merge in the previous query plan, so I'll
make a permanent change to the config to set work_mem to 20MB.

I've also changed the inner-most select into a two level select with the
lpad's on the outer so they are not being evaluated on every row, just
the collapsed rows - that accounted for about 1 second of the overall
time reduction.

Would increasing the stats of anything on any of these tables speed
things up any more?

--
Paul Lambert
Database Administrator
AutoLedgers - A Reynolds & Reynolds Company