Обсуждение: Tuning questions..

Поиск
Список
Период
Сортировка

Tuning questions..

От
"Michael T. Halligan"
Дата:
Hi.. I seem to be running into a bottle neck on a query, and I'm not
sure what the bottleneck is .
The machine is a dual-processor p3 750 with 2 gigs of (pc100) memory,
and 3 72 gig disks setup
in raid 5. Right now i'm just testing our db for speed (we're porting
from oracle) .. later on
We're looking @ a quad xeon 700 with 16 gigs of ram & 10 drives in
hardware raid 5.

We've tuned the queries a bit, added some indices, and we got this query
down from about 15 minutes
to 7.6 seconds.. but it just seems like we should be able to get this
query down to under a second on
this box..  It's running the latest suse, with 2.4.16 kernel, reiserfs,
postgres 7.2b3. I've tried many different combinations
of buffers, stat collection space, sort space, etc. none of them really
effect performance..

When I run this particular query, the only resource that seems to change
is one of the processors gets up to
about 99% usage.. I've tried setting postgres to use up to 1.6 gigs of
memory, but the postmaster never seems
to get above about 700megs.. it's not swapping at all, though the
contact switching seems to get a bit high (peaking
at 150) ..

The query sorts through about 80k rows.. here's the query
--------------------------------------------------
SELECT count(*) FROM (
                  SELECT DISTINCT song_id FROM ssa_candidate WHERE
style_id IN (
                             SELECT style_id FROM station_subgenre WHERE
station_id = 48
                                            )
                            ) AS X;
--------------------------------------------------
and the query plan :
--------------------------------------------------
NOTICE:  QUERY PLAN:

Aggregate  (cost=12236300.87..12236300.87 rows=1 width=13)
  ->  Subquery Scan x  (cost=12236163.64..12236288.40 rows=4990 width=13)
        ->  Unique  (cost=12236163.64..12236288.40 rows=4990 width=13)
              ->  Sort  (cost=12236163.64..12236163.64 rows=49902 width=13)
                    ->  Seq Scan on ssa_candidate
(cost=0.00..12232269.54 rows=49902 width=13)
                          SubPlan
                            ->  Materialize  (cost=122.53..122.53
rows=31 width=11)
                                  ->  Index Scan using
station_subgenre_pk on station_subgenre  (cost=0.00..122.53 rows=31
width=11)

EXPLAIN
--------------------------------------------------


If anybody has any ideas, I'd be really appreciative..







Re: Tuning questions..

От
Tom Lane
Дата:
"Michael T. Halligan" <michael@echo.com> writes:
> The query sorts through about 80k rows.. here's the query
> --------------------------------------------------
> SELECT count(*) FROM (
>                   SELECT DISTINCT song_id FROM ssa_candidate WHERE
> style_id IN (
>                              SELECT style_id FROM station_subgenre WHERE
> station_id = 48
>                                             )
>                             ) AS X;

The standard advice for speeding up WHERE ... IN queries is to convert
them to WHERE ... EXISTS.  However, assuming that there are not very
many style_ids for any one station_id in station_subgenre, this probably
won't help much.  What I'd try is converting it to a straight join:

SELECT count(DISTINCT song_id) FROM ssa_candidate, station_subgenre
WHERE
    ssa_candidate.style_id = station_subgenre.style_id AND
    station_id = 48;

Normally this would not do what you want, since you could end up with
multiple joined rows for any one ssa_candidate row, but given that
you're going to do a DISTINCT that doesn't really matter.  Better to
let the thing use a more efficient join method and just throw away the
extra rows in the DISTINCT step.  Or that's my theory anyway; let us
know how well it works.

BTW, are the row estimates in the EXPLAIN output anywhere close to
reality?

            regards, tom lane

Re: Tuning questions..

От
Bojan Belovic
Дата:
This in interesting effect, that I ran into more than once, and I have a
question concerning this:
We see that the cost for this query went from roughly 12,000,000 to about
12,000. Of course, we cannot assume that the time of execution will be
directly proportional to this, and also, the weight factors assigned to disk
and CPU are relative, but - why such a "small" difference in the execution
time, when we lowered the cost by a factor of 1000?

As far as your question goes, I have run a similar query (that is a pair of
queries) on my system and I got a cost lowered from about 250M to 10k - pg
also used hash join (as your Oracle db). Could be indices - I have the field I
do a join on indexed on both tables (one primary other secondary) and the
field used in the WHERE clause indexed as a secondary (although pg doesn't use
it anyways, I suppose too many records with the same value). The only thing I
can think of is check to see if style_id on both tables is indexed.


"Michael T. Halligan" <michael@echo.com> wrote:
> Thanks, that sped things up a bit, from 7.6 sec. to about 5.5 sec.  However
> the plan still includes a sequential scan on ssa_candidate:
>
> Aggregate  (cost=12161.11..12161.11 rows=1 width=35)
>    -> Merge Join  (cost=11611.57..12111.12 rows=19996 width=35)
>          -> Sort  (cost=11488.27..11488.27 rows=99805 width=24)
>                -> Seq Scan on ssa_candidate sc  (cost=0.00..3201.05
> rows=99805 width=24)
>          -> Sort  (cost=123.30..123.30 rows=31 width=11)
>                -> Index Scan using station_subgenre_pk on station_subgenre
> ss  (cost=0.00..122.53 rows=31 width=11)
>
>
> If we run the same query on Oracle (modified slightly for syntax):
>
> SELECT count(DISTINCT song_id) AS X
> FROM ssa_candidate SC,
>   station_subgenre SS
> WHERE SC.style_id = SS.style_id
>    AND SS.station_id =
>
> with the same data, it runs almost instaneously.  Oracle's plan uses a hash
> join:
>
> SELECT STATEMENT Optimizer=CHOOSE (Cost=63 Card=1 Bytes=15)
>    SORT (GROUP BY)
>      HASH JOIN (Cost=63 Card=8943 Bytes=134145)
>        INDEX (RANGE SCAN) OF STATION_SUBGENRE_PK (UNIQUE) (Cost=3 Card=12
> Bytes=84)
>        TABLE ACCESS (FULL) OF SSA_CANDIDATE (Cost=59 Card=60364
> Bytes=482912)
>
> Is there some way to convince PostgreSQL to use a hash join?
>
> Bojan Belovic wrote:
>
>  >Not sure about tuning, but it seems to me that this query would be
> much more
>  >effective if it's rewritten like this (especially if style_id columns
> on both
>  >tables are indexed):
>  >
>  >SELECT count(DISTINCT song_id) AS X
>  >FROM ssa_candidate SC JOIN station_subgenre SS ON SC.style_id =
> SS.style_id
>  >WHERE SS.station_id = 48
>  >
>  >Please correct me if I'm wrong. Also, Michael, if you give this one a
try,
>  >could you send me the query plan, I'm just curious.
>  >
>  >SELECT count(DISTINCT song_id) FROM ssa_candidate, station_subgenre
>  >WHERE
>  >
>  >SELECT count(DISTINCT song_id) FROM ssa_candidate, station_subgenre
>  >WHERE "Michael T. Halligan" <michael@echo.com> wrote:
>  >
>  >>Hi.. I seem to be running into a bottle neck on a query, and I'm not
>  >>sure what the bottleneck is .
>  >>The machine is a dual-processor p3 750 with 2 gigs of (pc100) memory,
>  >>and 3 72 gig disks setup
>  >>in raid 5. Right now i'm just testing our db for speed (we're porting
>  >>from oracle) .. later on
>  >>We're looking @ a quad xeon 700 with 16 gigs of ram & 10 drives in
>  >>hardware raid 5.
>  >>
>  >>We've tuned the queries a bit, added some indices, and we got this query
>  >>down from about 15 minutes
>  >>to 7.6 seconds.. but it just seems like we should be able to get this
>  >>query down to under a second on
>  >>this box..  It's running the latest suse, with 2.4.16 kernel, reiserfs,
>  >>postgres 7.2b3. I've tried many different combinations
>  >>of buffers, stat collection space, sort space, etc. none of them really
>  >>effect performance..
>  >>
>  >>When I run this particular query, the only resource that seems to change
>  >>is one of the processors gets up to
>  >>about 99% usage.. I've tried setting postgres to use up to 1.6 gigs of
>  >>memory, but the postmaster never seems
>  >>to get above about 700megs.. it's not swapping at all, though the
>  >>contact switching seems to get a bit high (peaking
>  >>at 150) ..
>  >>
>  >>The query sorts through about 80k rows.. here's the query
>  >>--------------------------------------------------
>  >>SELECT count(*) FROM (
>  >>                  SELECT DISTINCT song_id FROM ssa_candidate WHERE
>  >>style_id IN (
>  >>                             SELECT style_id FROM station_subgenre WHERE
>  >>station_id = 48
>  >>                                            )
>  >>                            ) AS X;
>  >>--------------------------------------------------
>  >>and the query plan :
>  >>--------------------------------------------------
>  >>NOTICE:  QUERY PLAN:
>  >>
>  >>Aggregate  (cost=12236300.87..12236300.87 rows=1 width=13)
>  >>  ->  Subquery Scan x  (cost=12236163.64..12236288.40 rows=4990
width=13)
>  >>        ->  Unique  (cost=12236163.64..12236288.40 rows=4990
> width=13)SELECT count(DISTINCT song_id) FROM ssa_candidate,
station_subgenre
>  >>WHERE
>  >>              ->  Sort  (cost=12236163.64..12236163.64 rows=49902
> width=13)
>  >>                    ->  Seq Scan on ssa_candidate
>  >>(cost=0.00..12232269.54 rows=49902 width=13)
>  >>                          SubPlan
>  >>                            ->  Materialize  (cost=122.53..122.53
>  >>rows=31 width=11)
>  >>                                  ->  Index Scan using
>  >>station_subgenre_pk on station_subgenre  (cost=0.00..122.53 rows=31
>  >>width=11)
>  >>
>  >>EXPLAIN
>  >>--------------------------------------------------
>  >>
>  >>
>  >>If anybody has any ideas, I'd be really appreciative..
>  >>
>  >>
>  >>
>  >>
>  >>
>  >>
>  >>
>  >>---------------------------(end of broadcast)---------------------------
>  >>TIP 2: you can get off all lists at once with the unregister command
>  >>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>  >>
>  >
>  >
>  >____________________________________________________________________
>  >Get free e-mail and a permanent address at http://www.amexmail.com/?A=1
>  >
>  >---------------------------(end of broadcast)---------------------------
>  >TIP 3: if posting/reading through Usenet, please send an appropriate
>  >subscribe-nomail command to majordomo@postgresql.org so that your
>  >message can get through to the mailing list cleanly
>  >
>
>
>
>


____________________________________________________________________
Get free e-mail and a permanent address at http://www.amexmail.com/?A=1