Обсуждение: suggestions on improving a query

Поиск

Список

Период

Сортировка

suggestions on improving a query

От

Rajarshi Guha

Дата:

14 февраля 2007 г., 01:11:23

Hi, I have a query that involves 3 tables. T

    select pubchem_compound.openeye_can_smiles, pubchem_compound.nist_inchi, dock.cid, dockscore_plp.*
    from dock, dockscore_plp, pubchem_compound
    where
    dock.target = '1YC1' and
    dock.dockid = dockscore_plp.id  and
    dock.cid = pubchem_compound.cid
    order by dockscore_plp.total
    limit 10;

The output of explain analyze is

 Limit  (cost=0.00..387.36 rows=10 width=297) (actual time=242977.644..462748.215 rows=10 loops=1)
   ->  Nested Loop  (cost=0.00..37325186.12 rows=963575 width=297) (actual time=242977.638..462748.175 rows=10 loops=1)
         ->  Nested Loop  (cost=0.00..31523550.51 rows=963575 width=90) (actual time=242900.629..461810.902 rows=10
loops=1)
               ->  Index Scan using plp_total_idx on dockscore_plp  (cost=0.00..16733229.92 rows=4669988 width=80)
(actualtime=98.323..322537.605 rows=25197 loops=1) 
               ->  Index Scan using dock_pkey on dock  (cost=0.00..3.15 rows=1 width=18) (actual time=5.521..5.521
rows=0loops=25197) 
                     Index Cond: (dock.dockid = "outer".id)
                     Filter: (target = '1YC1'::text)
         ->  Index Scan using pubchem_compound_pkey on pubchem_compound  (cost=0.00..6.01 rows=1 width=216) (actual
time=93.699..93.704rows=1 loops=10) 
               Index Cond: (("outer".cid)::text = (pubchem_compound.cid)::text)
 Total runtime: 462748.439 ms
(10 rows)

Now, the tables 'dock' and 'dockscore_plp' have 4.6M rows and
'pubchem_compound' has 10M rows.

However the clause:

    dock.target = '1YC1' and
    dock.dockid = dockscore_plp.id

reduces the number of rows from 4.6M to 96K. I had figured that after
this the query would be very fast. But the explain analyze seems to
indicate that the dockscore_plp table is being sorted (using the index
plp_total_idx) in its entirety. This would then be the bottleneck

Is this a correct interpretation?

What I expected was that the sort would be applied to the 96K subset of
dockscore_plp, rather than the whole table.

Is it possible to restructure the query such that the sort is done on
96K rows rather than 4.6M rows?

Thanks,


-------------------------------------------------------------------
Rajarshi Guha <rguha@indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
"I'd love to go out with you, but my favorite commercial is on TV."

Re: suggestions on improving a query

От

"Adam Rich"

Дата:

14 февраля 2007 г., 01:41:58

This line:

Index Scan using plp_total_idx on dockscore_plp
(cost=0.00..16733229.92 rows=4669988 width=80)
(actual time=98.323..322537.605 rows=25197 loops=1)

Means the planner did what it did, because it estimated there would be
nearly 5 million rows.  However, there were only 25,000.

Have these tables been vacuumed & analyzed recently?  Your first step
should be to vacuum & analyze these, and send us the new "explain
analyze".



-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Rajarshi Guha
Sent: Monday, February 12, 2007 2:24 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] suggestions on improving a query


Hi, I have a query that involves 3 tables. T

    select pubchem_compound.openeye_can_smiles,
pubchem_compound.nist_inchi, dock.cid, dockscore_plp.*
    from dock, dockscore_plp, pubchem_compound
    where
    dock.target = '1YC1' and
    dock.dockid = dockscore_plp.id  and
    dock.cid = pubchem_compound.cid
    order by dockscore_plp.total
    limit 10;

The output of explain analyze is

 Limit  (cost=0.00..387.36 rows=10 width=297) (actual
time=242977.644..462748.215 rows=10 loops=1)
   ->  Nested Loop  (cost=0.00..37325186.12 rows=963575 width=297)
(actual time=242977.638..462748.175 rows=10 loops=1)
         ->  Nested Loop  (cost=0.00..31523550.51 rows=963575 width=90)
(actual time=242900.629..461810.902 rows=10 loops=1)
               ->  Index Scan using plp_total_idx on dockscore_plp
(cost=0.00..16733229.92 rows=4669988 width=80) (actual
time=98.323..322537.605 rows=25197 loops=1)
               ->  Index Scan using dock_pkey on dock  (cost=0.00..3.15
rows=1 width=18) (actual time=5.521..5.521 rows=0 loops=25197)
                     Index Cond: (dock.dockid = "outer".id)
                     Filter: (target = '1YC1'::text)
         ->  Index Scan using pubchem_compound_pkey on pubchem_compound
(cost=0.00..6.01 rows=1 width=216) (actual time=93.699..93.704 rows=1
loops=10)
               Index Cond: (("outer".cid)::text =
(pubchem_compound.cid)::text)
 Total runtime: 462748.439 ms
(10 rows)

Now, the tables 'dock' and 'dockscore_plp' have 4.6M rows and
'pubchem_compound' has 10M rows.

However the clause:

    dock.target = '1YC1' and
    dock.dockid = dockscore_plp.id

reduces the number of rows from 4.6M to 96K. I had figured that after
this the query would be very fast. But the explain analyze seems to
indicate that the dockscore_plp table is being sorted (using the index
plp_total_idx) in its entirety. This would then be the bottleneck

Is this a correct interpretation?

What I expected was that the sort would be applied to the 96K subset of
dockscore_plp, rather than the whole table.

Is it possible to restructure the query such that the sort is done on
96K rows rather than 4.6M rows?

Thanks,


-------------------------------------------------------------------
Rajarshi Guha <rguha@indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
"I'd love to go out with you, but my favorite commercial is on TV."



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Re: suggestions on improving a query

От

Tom Lane

Дата:

14 февраля 2007 г., 01:44:58

Rajarshi Guha <rguha@indiana.edu> writes:
> However the clause:
>     dock.target = '1YC1' and
>     dock.dockid = dockscore_plp.id
> reduces the number of rows from 4.6M to 96K.

The planner seems to be estimating about ten times that many.  Perhaps
increasing the statistics target for dock.target would help?

            regards, tom lane

Re: suggestions on improving a query

От

Tom Lane

Дата:

14 февраля 2007 г., 02:04:59

"Adam Rich" <adam.r@sbcglobal.net> writes:
> This line:
> Index Scan using plp_total_idx on dockscore_plp
> (cost=0.00..16733229.92 rows=4669988 width=80)
> (actual time=98.323..322537.605 rows=25197 loops=1)
> Means the planner did what it did, because it estimated there would be
> nearly 5 million rows.  However, there were only 25,000.

No, you have to be careful about interpreting the numbers when
underneath a Limit node.  The rows estimate is an estimate of the total
number of rows if the plan node were run to completion ... but if the
Limit stops execution early, that's not what will happen.  The actual
rows count shows how many rows really got pulled from the node before
the Limit stopped things.

The real problem here is that the planner is guessing that it won't take
very long to find 10 rows satisfying the target = '1YC1' condition while
scanning in dockscore_plp.total order.  So it chooses a plan that would
have a long total runtime (notice the large cost estimates below the
Limit) expecting that only a small fraction of that total will actually
be expended.  The expectation seems a bit off unfortunately :-(.
I can't tell from the given data whether the problem is just an
overestimate of the frequency of target = '1YC1', or if there's an
additional effect.  For example, if that target value tended to only be
associated with larger values of dockscore_plp.total, then a plan like
this could lose big-time because it will have to scan a long way to find
those rows.

            regards, tom lane

Re: suggestions on improving a query

От

Rajarshi Guha

Дата:

14 февраля 2007 г., 12:02:58

On Tue, 2007-02-13 at 21:44 -0500, Tom Lane wrote:
> Rajarshi Guha <rguha@indiana.edu> writes:
> > However the clause:
> >     dock.target = '1YC1' and
> >     dock.dockid = dockscore_plp.id
> > reduces the number of rows from 4.6M to 96K.
>
> The planner seems to be estimating about ten times that many.  Perhaps
> increasing the statistics target for dock.target would help?

My original message had a typo: I expected that it should ~ 960K, so
postgres is working as expected.

However increasing the statistics target for dock.target did lead to an
improvement in performance. Could this be because dock.target has only 5
unique values? So though the table has ~4.6M rows, each set of  ~960K
rows for dock.dockid is associated with a single value of dock.target.

Thanks,

-------------------------------------------------------------------
Rajarshi Guha <rguha@indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
All great ideas are controversial, or have been at one time.

Re: suggestions on improving a query

От

Rajarshi Guha

Дата:

14 февраля 2007 г., 12:23:06

On Tue, 2007-02-13 at 22:04 -0500, Tom Lane wrote:
> "Adam Rich" <adam.r@sbcglobal.net> writes:
> > This line:
> > Index Scan using plp_total_idx on dockscore_plp
> > (cost=0.00..16733229.92 rows=4669988 width=80)
> > (actual time=98.323..322537.605 rows=25197 loops=1)
> > Means the planner did what it did, because it estimated there would be
> > nearly 5 million rows.  However, there were only 25,000.

Sorry for not doing the obvious beforehand! I increased the statistics
target for some of the columns in some of the tables and then did a
vacuum analyze. Rerunning the query gives:

                                                                              QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..397.24 rows=10 width=268) (actual time=98322.597..171721.583 rows=10 loops=1)
   ->  Nested Loop  (cost=0.00..37182572.57 rows=936023 width=268) (actual time=98322.590..171721.543 rows=10 loops=1)
         ->  Nested Loop  (cost=0.00..31580822.05 rows=936023 width=90) (actual time=98236.963..171379.151 rows=10
loops=1)
               ->  Index Scan using plp_total_idx on dockscore_plp  (cost=0.00..16858401.83 rows=4669988 width=80)
(actualtime=54.989..102775.761 rows=25197 loops=1) 
               ->  Index Scan using dock_pkey on dock  (cost=0.00..3.14 rows=1 width=18) (actual time=2.718..2.718
rows=0loops=25197) 
                     Index Cond: (dock.dockid = "outer".id)
                     Filter: (target = '1YC1'::text)
         ->  Index Scan using pubchem_compound_pkey on pubchem_compound  (cost=0.00..5.97 rows=1 width=187) (actual
time=34.221..34.223rows=1 loops=10) 
               Index Cond: (("outer".cid)::text = (pubchem_compound.cid)::text)
 Total runtime: 171722.964 ms
(10 rows)

Clearly a big improvement in performance.

(One question not directly related to the problem: when looking at the
output of explain analyze, I know that one is supposed to start at the
bottom and move up. Does that that the index scan on pubchem_compound is
being performed first? Or should I start from the innermost line?)

However it seems that it could still be improved:

   ->  Index Scan using plp_total_idx on dockscore_plp  (cost=0.00..16858401.83 rows=4669988 width=80) (actual
time=54.989..102775.761rows=25197 loops=1) 

It looks like theres a big mismatch on the expected and observed costs and times.

> The real problem here is that the planner is guessing that it won't take
> very long to find 10 rows satisfying the target = '1YC1' condition while
> scanning in dockscore_plp.total order.  So it chooses a plan that would
> have a long total runtime (notice the large cost estimates below the
> Limit) expecting that only a small fraction of that total will actually
> be expended.  The expectation seems a bit off unfortunately :-(.
> I can't tell from the given data whether the problem is just an
> overestimate of the frequency of target = '1YC1', or if there's an
> additional effect.

I think that increasing the statistics has improved that.

>  For example, if that target value tended to only be
> associated with larger values of dockscore_plp.total, then a plan like
> this could lose big-time because it will have to scan a long way to find
> those rows.

This is not the case. The value '1YC1' will be associated with both high
and low values of dockscore_plp.total

What I would like my query to do is this:

1. From dock.target find all rows = '1YC1'

2. Using dock.dockid of these rows, get the corresponding rows in
dockscore_plp

3. Using dock.cid from the rows in 2., get the corresponding rows in
pubchem_compound

4. Sort and take the top 10 from step 2 (and associated rows in step 3)

However now that I have written this it seems that what I really want to
do is:

1. From dock.target find all rows = '1YC1'

2. Using dock.dockid of these rows, get the corresponding rows in
dockscore_plp

3. Sort and take the top 10

4. Get the corresponding rows from pubchem_compound.cid

The problem with this is that step is represented by the

dock.cid = pubchem_compound.cid

clause. It seems that if I had the cid column in dockscore_plp, then I
could do a sort+limit in dockscore_plp and then simply lookup the
corresponding (10) rows in pubchem_compound (rather than looking up 960K
rows). The downside to this is that there are 4 more tables like
dockscore_plp, and I would have to add a cid column to each of them -
which seems redundant.

Is it useful to increase redundancy to improve performance?

Thanks for the pointers,

-------------------------------------------------------------------
Rajarshi Guha <rguha@indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
There's no problem so bad that you can't add some guilt to it to make
it worse.
-Calvin

Re: suggestions on improving a query

От

Martijn van Oosterhout

Дата:

14 февраля 2007 г., 13:21:51

On Wed, Feb 14, 2007 at 08:22:42AM -0500, Rajarshi Guha wrote:
> (One question not directly related to the problem: when looking at the
> output of explain analyze, I know that one is supposed to start at the
> bottom and move up. Does that that the index scan on pubchem_compound is
> being performed first? Or should I start from the innermost line?)

There's no concept of nodes being executed before others. Each node is
executed as needed. If the case of a nested loop like you have, it
reads one tuple from the outer node (the first child) and then as many
tuples from the inner node as necessary (an index scan may not return
very many). In your case the outer node is another nested loop which
will in turn scan its inner and outer nodes as necessary.

The Limit up the top means that no more than that many tuples will be
requested from child node, so child nodes may be executed once, many
times or not at all.

There are some more comprehensive writeups around, but hopefully this
gives you an idea.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения

signature.asc

Re: suggestions on improving a query

От

Tom Lane

Дата:

14 февраля 2007 г., 14:56:03

Rajarshi Guha <rguha@indiana.edu> writes:
> Clearly a big improvement in performance.

Huh?  It looks like exactly the same plan as before.  Any improvement
you're seeing must be coming from cache effects.

> It looks like theres a big mismatch on the expected and observed costs and times.

Well, in the first place the estimated costs are not measured in
milliseconds, and in the second place the estimated cost and rowcount
are for execution of the plan node to completion, which is not happening
here because of the Limit --- we'll stop the plan as soon as the top
join node has produced 10 rows.  In fact I'd say the whole problem here
is that the planner is being too optimistic about the benefits of a
fast-start plan.  For whatever reason (most likely, an unfavorable
correlation between dock.target and dockscore_plp.total), the desired
rows aren't uniformly scattered in the output of the join, and so it's
taking longer than expected to find 10 of them.

            regards, tom lane

Re: suggestions on improving a query

От

Tom Lane

Дата:

14 февраля 2007 г., 15:00:28

Martijn van Oosterhout <kleptog@svana.org> writes:
> There are some more comprehensive writeups around, but hopefully this
> gives you an idea.

You can find the official(tm) explanation at
http://www.postgresql.org/docs/8.2/static/executor.html
--- in fact, you might want to read all of chapter 42.

            regards, tom lane

Re: suggestions on improving a query

От

Rajarshi Guha

Дата:

14 февраля 2007 г., 16:48:54

On Wed, 2007-02-14 at 10:55 -0500, Tom Lane wrote:
> Rajarshi Guha <rguha@indiana.edu> writes:
> > Clearly a big improvement in performance.
>
> Huh?  It looks like exactly the same plan as before.  Any improvement
> you're seeing must be coming from cache effects.

Well the new run was done nearly 8 hours after the initial one - I
would've thought that the cache had been purged (?)

> > It looks like theres a big mismatch on the expected and observed costs and times.
>
>  In fact I'd say the whole problem here
> is that the planner is being too optimistic about the benefits of a
> fast-start plan.  For whatever reason (most likely, an unfavorable
> correlation between dock.target and dockscore_plp.total), the desired
> rows aren't uniformly scattered in the output of the join, and so it's
> taking longer than expected to find 10 of them.

Is there any way to solve this? I've increased the statistics target on
dockscore_plp.total to 100 - does going higher help?

From what you've said, it appears that the problem is arising due to
lack of correlation between two columns in two tables.

This is strange since, out of 4.6M rows in dock, ~ 960K will be selected
and the corresponding 960K rows from dockscore_plp will be ordered and
then the top 10 will be taken.

So does the lack of correlation occur due to 'ordering' in the DB
itself? And if this is the case, how does one fix the lack of
correlation (if at all possible)?

-------------------------------------------------------------------
Rajarshi Guha <rguha@indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
Regular naps prevent old age....
especially if you take them while driving

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: suggestions on improving a query

Вложения