Обсуждение:

Поиск
Список
Период
Сортировка

От
Jeff Cole
Дата:
Hi, I'm new to tuning PostgreSQL and I have a query that gets slower
after I run a vacuum analyze.  I believe it uses a Hash Join before
the analyze and a Nested Loop IN Join after.  It seems the Nested
Loop IN Join estimates the correct number of rows, but underestimates
the amount of time required.   I am curious why the vacuum analyze
makes it slower and if that gives any clues as too which parameter I
should be tuning.

BTW, I know the query could be re-structured more cleanly to remove
the sub-selects, but that doesn't impact the performance.

thanks,
Jeff



Welcome to psql 8.1.5, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
        \h for help with SQL commands
        \? for help with psql commands
        \g or terminate with semicolon to execute query
        \q to quit

plm_demo=# explain analyze SELECT count(*) AS count_all FROM symptoms
WHERE ( 1=1 and symptoms.id in (select symptom_id from
symptom_reports sr where 1=1 and sr.user_id in (select id from users
where disease_id=1))) ;

        QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
-----------------------
Aggregate  (cost=366.47..366.48 rows=1 width=0) (actual
time=125.093..125.095 rows=1 loops=1)
    ->  Hash Join  (cost=362.41..366.38 rows=36 width=0) (actual
time=124.162..124.859 rows=106 loops=1)
          Hash Cond: ("outer".id = "inner".symptom_id)
          ->  Seq Scan on symptoms  (cost=0.00..3.07 rows=107
width=4) (actual time=0.032..0.295 rows=108 loops=1)
          ->  Hash  (cost=362.25..362.25 rows=67 width=4) (actual
time=124.101..124.101 rows=106 loops=1)
                ->  HashAggregate  (cost=361.58..362.25 rows=67
width=4) (actual time=123.628..123.854 rows=106 loops=1)
                      ->  Hash IN Join  (cost=35.26..361.41 rows=67
width=4) (actual time=9.767..96.372 rows=13074 loops=1)
                            Hash Cond: ("outer".user_id = "inner".id)
                            ->  Seq Scan on symptom_reports sr
(cost=0.00..259.65 rows=13165 width=8) (actual time=0.029..33.359
rows=13074 loops=1)
                            ->  Hash  (cost=35.24..35.24 rows=11
width=4) (actual time=9.696..9.696 rows=1470 loops=1)
                                  ->  Bitmap Heap Scan on users
(cost=2.04..35.24 rows=11 width=4) (actual time=0.711..6.347
rows=1470 loops=1)
                                        Recheck Cond: (disease_id = 1)
                                        ->  Bitmap Index Scan on
users_disease_id_index  (cost=0.00..2.04 rows=11 width=0) (actual
time=0.644..0.644 rows=2378 loops=1)
                                              Index Cond: (disease_id
= 1)
Total runtime: 134.045 ms
(15 rows)


plm_demo=# vacuum analyze;
VACUUM
plm_demo=# analyze;
ANALYZE

plm_demo=# explain analyze SELECT count(*) AS count_all FROM symptoms
WHERE ( 1=1 and symptoms.id in (select symptom_id from
symptom_reports sr where 1=1 and sr.user_id in (select id from users
where disease_id=1))) ;
                                                                QUERY
PLAN
------------------------------------------------------------------------
-----------------------------------------------------------------
Aggregate  (cost=586.47..586.48 rows=1 width=0) (actual
time=3441.385..3441.386 rows=1 loops=1)
    ->  Nested Loop IN Join  (cost=149.05..586.26 rows=85 width=0)
(actual time=54.517..3441.115 rows=106 loops=1)
          Join Filter: ("outer".id = "inner".symptom_id)
          ->  Seq Scan on symptoms  (cost=0.00..3.08 rows=108
width=4) (actual time=0.007..0.273 rows=108 loops=1)
          ->  Hash IN Join  (cost=149.05..603.90 rows=13074 width=4)
(actual time=0.078..24.503 rows=3773 loops=108)
                Hash Cond: ("outer".user_id = "inner".id)
                ->  Seq Scan on symptom_reports sr
(cost=0.00..258.74 rows=13074 width=8) (actual time=0.003..9.044
rows=3773 loops=108)
                ->  Hash  (cost=145.38..145.38 rows=1470 width=4)
(actual time=7.608..7.608 rows=1470 loops=1)
                      ->  Seq Scan on users  (cost=0.00..145.38
rows=1470 width=4) (actual time=0.006..4.353 rows=1470 loops=1)
                            Filter: (disease_id = 1)
Total runtime: 3441.452 ms
(11 rows)



Re:

От
Tom Lane
Дата:
Jeff Cole <cole.jeff@gmail.com> writes:
> Hi, I'm new to tuning PostgreSQL and I have a query that gets slower
> after I run a vacuum analyze.  I believe it uses a Hash Join before
> the analyze and a Nested Loop IN Join after.  It seems the Nested
> Loop IN Join estimates the correct number of rows, but underestimates
> the amount of time required.   I am curious why the vacuum analyze
> makes it slower and if that gives any clues as too which parameter I
> should be tuning.

Hm, the cost for the upper nestloop is way less than you would expect
given that the HASH IN join is going to have to be repeated 100+ times.
I think this must be due to a very low "join_in_selectivity" estimate
but I'm not sure why you are getting that, especially seeing that the
rowcount estimates aren't far off.  Can you show us the pg_stats
rows for symptoms.id and symptom_reports.symptom_id?

            regards, tom lane

Re:

От
Jeff Cole
Дата:
On Mar 5, 2007, at 8:54 PM, Tom Lane wrote:

> Hm, the cost for the upper nestloop is way less than you would expect
> given that the HASH IN join is going to have to be repeated 100+
> times.
> I think this must be due to a very low "join_in_selectivity" estimate
> but I'm not sure why you are getting that, especially seeing that the
> rowcount estimates aren't far off.  Can you show us the pg_stats
> rows for symptoms.id and symptom_reports.symptom_id?
>

Hi Tom, thanks for the response.  Here are the pg_stats.  I think I
understand what the stats say, but I don't know what to conclude from
them.


plm_stage=# select * from pg_stats where tablename = 'symptoms' and
attname = 'id';
schemaname | tablename | attname | null_frac | avg_width | n_distinct
| most_common_vals | most_common_freqs |
histogram_bounds           | correlation
------------+-----------+---------+-----------+-----------
+------------+------------------+-------------------
+-------------------------------------+-------------
public     | symptoms  | id      |         0 |         4 |         -1
|                  |                   |
{1,11,24,34,46,57,71,85,95,106,117} |    0.451606


plm_stage=# select * from pg_stats where tablename =
'symptom_reports' and attname = 'symptom_id';
schemaname |    tablename    |  attname   | null_frac | avg_width |
n_distinct |    most_common_vals
|
most_common_freqs                                   |
histogram_bounds           | correlation
------------+-----------------+------------+-----------+-----------
+------------+------------------------
+-----------------------------------------------------------------------
---------------+-------------------------------------+-------------
public     | symptom_reports | symptom_id |         0 |         4
|         80 | {3,2,4,1,5,8,9,7,10,6} |
{0.094,0.0933333,0.0933333,0.092,0.0913333,0.0903333,0.0866667,0.0843333
,0.084,0.08} | {12,18,24,30,38,44,51,57,91,91,114} |   0.0955925



And Ismo, I followed your suggestion to re-write the SQL more
cleanly, and you are right it was faster, so that is certainly a
solution.  Although I am still curious why my original query slowed
down after the vacuum analyze.  In any case, here is the explain
analyze from the new query.   Compare that to the  3441.452 ms of the
old query after the analyze (and 134.045 ms before the analyze):

plm_stage=# explain analyze SELECT count(distinct s.id) AS count_all
FROM symptoms s ,symptom_reports sr,users u WHERE s.id=sr.symptom_id
and sr.user_id=u.id and u.disease_id in (1);
                                                                QUERY
PLAN
------------------------------------------------------------------------
-----------------------------------------------------------------
Aggregate  (cost=770.05..770.06 rows=1 width=4) (actual
time=176.749..176.751 rows=1 loops=1)
    ->  Hash Join  (cost=89.43..737.50 rows=13020 width=4) (actual
time=7.762..142.063 rows=13038 loops=1)
          Hash Cond: ("outer".symptom_id = "inner".id)
          ->  Hash Join  (cost=86.09..538.86 rows=13020 width=4)
(actual time=7.277..89.293 rows=13038 loops=1)
                Hash Cond: ("outer".user_id = "inner".id)
                ->  Seq Scan on symptom_reports sr
(cost=0.00..257.38 rows=13038 width=8) (actual time=0.003..30.499
rows=13038 loops=1)
                ->  Hash  (cost=82.41..82.41 rows=1471 width=4)
(actual time=7.261..7.261 rows=1471 loops=1)
                      ->  Seq Scan on users u  (cost=0.00..82.41
rows=1471 width=4) (actual time=0.006..4.133 rows=1471 loops=1)
                            Filter: (disease_id = 1)
          ->  Hash  (cost=3.07..3.07 rows=107 width=4) (actual
time=0.469..0.469 rows=107 loops=1)
                ->  Seq Scan on symptoms s  (cost=0.00..3.07 rows=107
width=4) (actual time=0.007..0.247 rows=107 loops=1)
Total runtime: 176.842 ms
(12 rows)



Re:

От
Tom Lane
Дата:
Jeff Cole <cole.jeff@gmail.com> writes:
> Hi Tom, thanks for the response.  Here are the pg_stats.  I think I
> understand what the stats say, but I don't know what to conclude from
> them.

OK, the symptom_id row claims there are only 80 distinct values of
symptom_id in symptom_reports.  This is a bit low (looks like the true
state of affairs is that all but 2 of the 108 entries of symptoms are
represented in symptom_reports), but it's not horridly off considering
that you're using the rather low default statistics_target.  What
happens is that the planner expects that on average only 80 rows of the
inner join will need to be scanned to find a match for a given symptoms.id,
and this makes the nestloop look cheap.  However, per your previous
EXPLAIN ANALYZE:

>     ->  Nested Loop IN Join  (cost=149.05..586.26 rows=85 width=0)  (actual time=54.517..3441.115 rows=106 loops=1)
>           Join Filter: ("outer".id = "inner".symptom_id)
>           ->  Seq Scan on symptoms  (cost=0.00..3.08 rows=108  width=4) (actual time=0.007..0.273 rows=108 loops=1)
>           ->  Hash IN Join  (cost=149.05..603.90 rows=13074 width=4)  (actual time=0.078..24.503 rows=3773 loops=108)


the *actual* average number of rows scanned is 3773.  I'm not sure why
this should be --- is it possible that the distribution of keys in
symptom_reports is wildly uneven?  This could happen if all of the
physically earlier rows in symptom_reports contain the same small set
of symptom_ids, but the stats don't seem to indicate such a skew.

            regards, tom lane

Re:

От
Jeff Cole
Дата:
On Mar 6, 2007, at 11:40 AM, Tom Lane wrote:

> the *actual* average number of rows scanned is 3773.  I'm not sure why
> this should be --- is it possible that the distribution of keys in
> symptom_reports is wildly uneven?  This could happen if all of the
> physically earlier rows in symptom_reports contain the same small set
> of symptom_ids, but the stats don't seem to indicate such a skew.

Hi Tom, you are correct, the distribution is uneven... In the 13k
symptom_reports rows, there are 105 distinct symptom_ids.   But the
first 8k symptom_reports rows only have 10 distinct symptom_ids.
Could this cause the problem and would there be anything I could do
to address it?

Thanks for all your help, I appreciate it.

-Jeff

Re:

От
Tom Lane
Дата:
Jeff Cole <cole.jeff@gmail.com> writes:
> Hi Tom, you are correct, the distribution is uneven... In the 13k
> symptom_reports rows, there are 105 distinct symptom_ids.   But the
> first 8k symptom_reports rows only have 10 distinct symptom_ids.
> Could this cause the problem and would there be anything I could do
> to address it?

Ah-hah, yeah, that explains it.  Is it worth your time to deliberately
randomize the order of the rows in symptom_reports?  It wasn't clear
whether this query is actually something you need to optimize.  You
might have other queries that benefit from the rows being in nonrandom
order, so I'm not entirely sure that this is a good thing to do ...

            regards, tom lane

Re:

От
Jeff Cole
Дата:
Hi Tom, thanks for the great job getting to the core of this
problem... I would say I'm not sure I want randomize the rows (not
really even sure how to do it without truncating the table and re-
adding the records in a random order).   I think for the moment I
will either a) re-write the query per Ismo's suggestion, or b) wait
until more data comes into that table, potentially kicking the query
planner into not using the Nested Loop anymore.

Anyway, thanks again, I appreciate it...

-Jeff


On Mar 7, 2007, at 11:37 AM, Tom Lane wrote:

> Jeff Cole <cole.jeff@gmail.com> writes:
>> Hi Tom, you are correct, the distribution is uneven... In the 13k
>> symptom_reports rows, there are 105 distinct symptom_ids.   But the
>> first 8k symptom_reports rows only have 10 distinct symptom_ids.
>> Could this cause the problem and would there be anything I could do
>> to address it?
>
> Ah-hah, yeah, that explains it.  Is it worth your time to deliberately
> randomize the order of the rows in symptom_reports?  It wasn't clear
> whether this query is actually something you need to optimize.  You
> might have other queries that benefit from the rows being in nonrandom
> order, so I'm not entirely sure that this is a good thing to do ...
>
>             regards, tom lane