Re: cost and actual time

Поиск
Список
Период
Сортировка
От Chantal Ackermann
Тема Re: cost and actual time
Дата
Msg-id 3E50BECC.9090107@biomax.de
обсуждение исходный текст
Ответ на Re: cost and actual time  ("Josh Berkus" <josh@agliodbs.com>)
Ответы Re: cost and actual time
Re: cost and actual time
Список pgsql-performance
hello Josh,

thank you for your fast answer. (I had some days off.)

This posting is quite long, I apologize. But I wanted to provide enough
information to outline the problem.

I did some vacuums analyze on all 4 tables concerned (gene, disease,
gene_occurrences, disease_occurrences) to be sure the planner is up to
date - but that did not minimize the difference between the estimation
of resulting rows and the actual result.

I changed the settings for default_statistics_target to 1000 (default
10). The estimation goes up to 102 rows which is little more than
before, and still far away from the actual result.

The effective_cache_size is at 80000.

To be sure I didn't change it to be worse, I checked again with the
default_statistics_target set to 10 and a cache size of 1000 (ran vacuum
afterwards). the estimation is at 92 rows. so there's not a really big
difference.

I wonder, if I can set some geqo or planner settings in the
postgresql.conf file to make the planner estimate better? The database
is exclusively for reading, so it's ok if the time for analyzing the
tables increases.

The query I am testing with is:

EXPLAIN ANALYZE
SELECT tmp.disease_name, count(tmp.disease_name) AS cnt
    FROM (SELECT DISTINCT disease.disease_name,
disease_occurrences.sentence_id FROM gene, disease, gene_occurrences,
disease_occurrences
        WHERE gene.gene_name='igg'
        AND gene_occurrences.sentence_id=disease_occurrences.sentence_id
        AND gene.gene_id=gene_occurrences.gene_id
        AND disease.disease_id=disease_occurrences.disease_id) AS tmp
GROUP BY tmp.disease_name
ORDER BY cnt DESC;

Row counts are:
'gene': 164657
'disease': 129923
'gene_occurrences': 10484141
'disease_occurrences': 15079045

the gene_id for 'igg' occurres 110637 times in gene_occurrences, it is
the most frequent.

the index on gene_occurences(sentence_id) and
disease_occurrences(sentence_id) is clustered.

I have an alternative query which I am testing to see whether it is
better than the first one:


explain analyze SELECT disease.disease_name, count(disease.disease_name)
AS cnt FROM
    ((SELECT gene_occurrences.sentence_id FROM gene_occurrences
         WHERE gene_occurrences.gene_id=(SELECT gene.gene_id FROM gene
WHERE gene.gene_name='igg')) AS tmp
    JOIN disease_occurrences USING (sentence_id)) as tmp2
    NATURAL JOIN disease
GROUP BY disease.disease_name
ORDER BY cnt DESC;

the cost estimated by the planner is much higher, thus I thought this
query is worse than the first. However - maybe it's just more accurate?

this is its explain-output (default settings for
default_statistics_target while 'vacuum analyze'):


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Sort  (cost=126690.02..126691.47 rows=581 width=57) (actual
time=8066.05..8067.05 rows=3364 loops=1)
    Sort Key: count(disease.disease_name)
    InitPlan
      ->  Index Scan using gene_uni on gene  (cost=0.00..5.26 rows=1
width=4) (actual time=0.19..0.20 rows=1 loops=1)
            Index Cond: (gene_name = 'igg'::text)
    ->  Aggregate  (cost=126619.79..126663.35 rows=581 width=57) (actual
time=7894.33..8055.43 rows=3364 loops=1)
          ->  Group  (cost=126619.79..126648.83 rows=5809 width=57)
(actual time=7894.31..8020.00 rows=30513 loops=1)
                ->  Sort  (cost=126619.79..126634.31 rows=5809 width=57)
(actual time=7894.30..7910.08 rows=30513 loops=1)
                      Sort Key: disease.disease_name
                      ->  Merge Join  (cost=119314.93..126256.64
rows=5809 width=57) (actual time=6723.92..7732.94 rows=30513 loops=1)
                            Merge Cond: ("outer".disease_id =
"inner".disease_id)
                            ->  Index Scan using disease_pkey on disease
  (cost=0.00..6519.14 rows=129923 width=37) (actual time=0.04..742.20
rows=129872 loops=1)
                            ->  Sort  (cost=119314.93..119329.45
rows=5809 width=20) (actual time=6723.74..6740.24 rows=30513 loops=1)
                                  Sort Key: disease_occurrences.disease_id
                                  ->  Nested Loop  (cost=0.00..118951.78
rows=5809 width=20) (actual time=1.19..6558.67 rows=30513 loops=1)
                                        ->  Index Scan using
gene_occ_id_i on gene_occurrences  (cost=0.00..15700.31 rows=4039
width=8) (actual time=0.36..1404.64 rows=110637 loops=1)
                                              Index Cond: (gene_id = $0)
                                        ->  Index Scan using
disease_occ_uni on disease_occurrences  (cost=0.00..25.47 rows=8
width=12) (actual time=0.04..0.04 rows=0 loops=110637)
                                              Index Cond:
("outer".sentence_id = disease_occurrences.sentence_id)
  Total runtime: 8086.87 msec
(20 rows)


strangely, the estimation is far worse after running vacuum analyze
again with a default_statistics_target of 1000:


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Sort  (cost=12521.37..12521.61 rows=96 width=57) (actual
time=124967.47..124968.47 rows=3364 loops=1)
    Sort Key: count(disease.disease_name)
    InitPlan
      ->  Index Scan using gene_uni on gene  (cost=0.00..5.26 rows=1
width=4) (actual time=20.27..20.28 rows=1 loops=1)
            Index Cond: (gene_name = 'igg'::text)
    ->  Aggregate  (cost=12510.99..12518.20 rows=96 width=57) (actual
time=124788.71..124956.77 rows=3364 loops=1)
          ->  Group  (cost=12510.99..12515.80 rows=961 width=57) (actual
time=124788.68..124920.10 rows=30513 loops=1)
                ->  Sort  (cost=12510.99..12513.39 rows=961 width=57)
(actual time=124788.66..124804.74 rows=30513 loops=1)
                      Sort Key: disease.disease_name
                      ->  Nested Loop  (cost=0.00..12463.35 rows=961
width=57) (actual time=164.11..124529.76 rows=30513 loops=1)
                            ->  Nested Loop  (cost=0.00..6671.06
rows=961 width=20) (actual time=148.34..120295.52 rows=30513 loops=1)
                                  ->  Index Scan using gene_occ_id_i on
gene_occurrences  (cost=0.00..2407.09 rows=602 width=8) (actual
time=20.63..1613.99 rows=110637 loops=1)
                                        Index Cond: (gene_id = $0)
                                  ->  Index Scan using disease_occ_uni
on disease_occurrences  (cost=0.00..7.06 rows=2 width=12) (actual
time=1.07..1.07 rows=0 loops=110637)
                                        Index Cond: ("outer".sentence_id
= disease_occurrences.sentence_id)
                            ->  Index Scan using disease_pkey on disease
  (cost=0.00..6.01 rows=1 width=37) (actual time=0.13..0.13 rows=1
loops=30513)
                                  Index Cond: ("outer".disease_id =
disease.disease_id)
  Total runtime: 124981.15 msec
(18 rows)

There again is the estimation of 961 rows and the decision to choose a
Nested Loop while the actual result includes 30513 rows.


Thank you for taking the time to read my postings!
Chantal



Josh Berkus wrote:
> Chantal,
>
>
>>Sort Key: disease.disease_name, disease_occurrences.sentence_id
>>->  Nested Loop  (cost=0.00..6922.38 rows=98 width=64) (actual
>>time=61.49..275047.46 rows=18910 loops=1)
>>  ->  Nested Loop  (cost=0.00..6333.23 rows=98 width=28) (actual
>>time=61.42..274313.87 rows=18910 loops=1)
>>    ->  Nested Loop  (cost=0.00..5894.04 rows=64 width=16) (actual
>>time=32.00..120617.26 rows=46849 loops=1)
>>
>>I tried to tweak the conf settings, but I think I already reached
>>quite a good value concerning shared buffers and sort mem. the
>>database is vacuum full analyzed. indexes seem fine.
>
>
> You *sure* that you've vacuum analyzed recently?   The planner above is
> choosing a bad plan because its row estimates are way off ... if the
> subquery was actually returning 98 rows, the plan above would make
> sense ... but with 18,000 rows being returned, a Nested Loop is
> suicidal.
>
> Perhaps you could post the full text of the query?  If some of your
> criteria are coming from volatile functions, then that could explain
> why the planner is so far off ...
>
> -Josh Berkus
>
>


В списке pgsql-performance по дате отправления:

Предыдущее
От: Rafal Kedziorski
Дата:
Сообщение: Re: [JDBC] Good performance?
Следующее
От: Tomasz Myrta
Дата:
Сообщение: Re: [JDBC] Good performance?