Re: suggestions on improving a query
| От | Rajarshi Guha |
|---|---|
| Тема | Re: suggestions on improving a query |
| Дата | |
| Msg-id | 1171459362.7305.27.camel@localhost обсуждение исходный текст |
| Ответ на | Re: suggestions on improving a query (Tom Lane <tgl@sss.pgh.pa.us>) |
| Ответы |
Re: suggestions on improving a query
Re: suggestions on improving a query |
| Список | pgsql-general |
On Tue, 2007-02-13 at 22:04 -0500, Tom Lane wrote:
> "Adam Rich" <adam.r@sbcglobal.net> writes:
> > This line:
> > Index Scan using plp_total_idx on dockscore_plp
> > (cost=0.00..16733229.92 rows=4669988 width=80)
> > (actual time=98.323..322537.605 rows=25197 loops=1)
> > Means the planner did what it did, because it estimated there would be
> > nearly 5 million rows. However, there were only 25,000.
Sorry for not doing the obvious beforehand! I increased the statistics
target for some of the columns in some of the tables and then did a
vacuum analyze. Rerunning the query gives:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..397.24 rows=10 width=268) (actual time=98322.597..171721.583 rows=10 loops=1)
-> Nested Loop (cost=0.00..37182572.57 rows=936023 width=268) (actual time=98322.590..171721.543 rows=10 loops=1)
-> Nested Loop (cost=0.00..31580822.05 rows=936023 width=90) (actual time=98236.963..171379.151 rows=10
loops=1)
-> Index Scan using plp_total_idx on dockscore_plp (cost=0.00..16858401.83 rows=4669988 width=80)
(actualtime=54.989..102775.761 rows=25197 loops=1)
-> Index Scan using dock_pkey on dock (cost=0.00..3.14 rows=1 width=18) (actual time=2.718..2.718
rows=0loops=25197)
Index Cond: (dock.dockid = "outer".id)
Filter: (target = '1YC1'::text)
-> Index Scan using pubchem_compound_pkey on pubchem_compound (cost=0.00..5.97 rows=1 width=187) (actual
time=34.221..34.223rows=1 loops=10)
Index Cond: (("outer".cid)::text = (pubchem_compound.cid)::text)
Total runtime: 171722.964 ms
(10 rows)
Clearly a big improvement in performance.
(One question not directly related to the problem: when looking at the
output of explain analyze, I know that one is supposed to start at the
bottom and move up. Does that that the index scan on pubchem_compound is
being performed first? Or should I start from the innermost line?)
However it seems that it could still be improved:
-> Index Scan using plp_total_idx on dockscore_plp (cost=0.00..16858401.83 rows=4669988 width=80) (actual
time=54.989..102775.761rows=25197 loops=1)
It looks like theres a big mismatch on the expected and observed costs and times.
> The real problem here is that the planner is guessing that it won't take
> very long to find 10 rows satisfying the target = '1YC1' condition while
> scanning in dockscore_plp.total order. So it chooses a plan that would
> have a long total runtime (notice the large cost estimates below the
> Limit) expecting that only a small fraction of that total will actually
> be expended. The expectation seems a bit off unfortunately :-(.
> I can't tell from the given data whether the problem is just an
> overestimate of the frequency of target = '1YC1', or if there's an
> additional effect.
I think that increasing the statistics has improved that.
> For example, if that target value tended to only be
> associated with larger values of dockscore_plp.total, then a plan like
> this could lose big-time because it will have to scan a long way to find
> those rows.
This is not the case. The value '1YC1' will be associated with both high
and low values of dockscore_plp.total
What I would like my query to do is this:
1. From dock.target find all rows = '1YC1'
2. Using dock.dockid of these rows, get the corresponding rows in
dockscore_plp
3. Using dock.cid from the rows in 2., get the corresponding rows in
pubchem_compound
4. Sort and take the top 10 from step 2 (and associated rows in step 3)
However now that I have written this it seems that what I really want to
do is:
1. From dock.target find all rows = '1YC1'
2. Using dock.dockid of these rows, get the corresponding rows in
dockscore_plp
3. Sort and take the top 10
4. Get the corresponding rows from pubchem_compound.cid
The problem with this is that step is represented by the
dock.cid = pubchem_compound.cid
clause. It seems that if I had the cid column in dockscore_plp, then I
could do a sort+limit in dockscore_plp and then simply lookup the
corresponding (10) rows in pubchem_compound (rather than looking up 960K
rows). The downside to this is that there are 4 more tables like
dockscore_plp, and I would have to add a cid column to each of them -
which seems redundant.
Is it useful to increase redundancy to improve performance?
Thanks for the pointers,
-------------------------------------------------------------------
Rajarshi Guha <rguha@indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
There's no problem so bad that you can't add some guilt to it to make
it worse.
-Calvin
В списке pgsql-general по дате отправления: