Обсуждение: optimizing query

Поиск
Список
Период
Сортировка

optimizing query

От
Chantal Ackermann
Дата:
hello all,

I am getting the following output from EXPLAIN, concerning a query with
joins. The merge uses index scans but takes too long, in my opinion. The
query is in fact only a part (subquery) of another one, but it is the
bottle neck.

As I am quite ignorant in optimizing queries, and I have no idea where
to find documentation on the net on how to learn optimizing my queries,
I am posting this here in hope someone will give me either tips how to
optimize, or where to find some tutorial that could help me get along on
my own.

dropping the "DISTINCT" has some effect, but I can't really do without.

Thank you
Chantal

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

relate=# explain SELECT DISTINCT gene.gene_name,
gene_occurrences_puid.puid FROM disease, gene, disease_occurrences_puid,
gene_occurrences_puid WHERE
disease_occurrences_puid.puid=gene_occurrences_puid.puid AND
disease.disease_id=disease_occurrences_puid.disease_id AND
gene.gene_id=gene_occurrences_puid.gene_id;
                                                                 QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------
  Unique  (cost=426503.59..436839.47 rows=137812 width=41)
    ->  Sort  (cost=426503.59..429948.88 rows=1378118 width=41)
          Sort Key: gene.gene_name, gene_occurrences_puid.puid
          ->  Hash Join  (cost=67813.96..162375.07 rows=1378118 width=41)
                Hash Cond: ("outer".disease_id = "inner".disease_id)
                ->  Merge Join  (cost=63671.50..98237.36 rows=1378118
width=37)
                      Merge Cond: ("outer".puid = "inner".puid)
                      ->  Index Scan using disease_occpd_puid_i on
disease_occurrences_puid  (cost=0.00..14538.05 rows=471915 width=8)
                      ->  Sort  (cost=63671.50..64519.87 rows=339347
width=29)
                            Sort Key: gene_occurrences_puid.puid
                            ->  Merge Join  (cost=0.00..22828.18
rows=339347 width=29)
                                  Merge Cond: ("outer".gene_id =
"inner".gene_id)
                                  ->  Index Scan using gene_pkey on gene
  (cost=0.00..7668.59 rows=218085 width=21)
                                  ->  Index Scan using gene_id_puid_uni
on gene_occurrences_puid  (cost=0.00..9525.57 rows=339347 width=8)
                ->  Hash  (cost=3167.97..3167.97 rows=164597 width=4)
                      ->  Seq Scan on disease  (cost=0.00..3167.97
rows=164597 width=4)
(16 rows)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: optimizing query

От
Stephan Szabo
Дата:
(Replying to general and performance in a hope to move this
to performance after a couple of replies).

On Wed, 22 Jan 2003, Chantal Ackermann wrote:

> I am getting the following output from EXPLAIN, concerning a query with
> joins. The merge uses index scans but takes too long, in my opinion. The
> query is in fact only a part (subquery) of another one, but it is the
> bottle neck.
>
> As I am quite ignorant in optimizing queries, and I have no idea where
> to find documentation on the net on how to learn optimizing my queries,
> I am posting this here in hope someone will give me either tips how to
> optimize, or where to find some tutorial that could help me get along on
> my own.
>
> dropping the "DISTINCT" has some effect, but I can't really do without.

The first thing is, have you done ANALYZE recently to make sure that the
statistics are correct and what does EXPLAIN ANALYZE give you (that will
run the query and give the estimate and actual).  Also, if you haven't
vacuumed recently, you may want to vacuum full.

How many rows are there on gene, disease and both occurrances tables?
I'd wonder if perhaps using explicit sql join syntax (which postgres uses
to constrain order) to join disease and disease_occurrences_puid before
joining it to the other two would be better or worse in practice.


Re: optimizing query

От
Chantal Ackermann
Дата:
hi Stephan,

thank you for your reply.

I ran vacuum analyze before calling explain. As this is a newly built
database where no rows have been deleted, yet, I thought vacuum full
would have no effect. In fact, BEFORE running vacuum full, the cost of
the query is estimates by explain analyze as 33 secs, and AFTER running
it, the cost is estimate to be 43 secs??? (Hey, I want at least the 10
secs back ;-) )

I have just installed this database on a "bigger" (see the system info
further down) machine, and I expected the queries would run _really_
fast. especially, as there is a lot more data to be inserted in the
occurrences tables.

This is the row count of the tables and the output of explain analyze
before and after running vacuum full (after that, I listed some system
and postgresql information):

relate=# select count(*) from gene;
  count
--------
  218085
(1 row)

relate=# select count(*) from disease;
  count
--------
  164597
(1 row)

relate=# select count(*) from disease_occurrences_puid;
  count
--------
  471915
(1 row)

relate=# select count(*) from gene_occurrences_puid;
  count
--------
  339347
(1 row)

relate=# explain analyze SELECT DISTINCT gene.gene_name,
gene_occurrences_puid.puid FROM gene, disease_occurrences_puid,
gene_occurrences_puid WHERE
disease_occurrences_puid.puid=gene_occurrences_puid.puid AND
gene.gene_id=gene_occurrences_puid.gene_id;

               QUERY PLAN


------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Unique  (cost=342175.89..352511.77 rows=137812 width=33) (actual
time=32112.66..33139.23 rows=219435 loops=1)
    ->  Sort  (cost=342175.89..345621.18 rows=1378118 width=33) (actual
time=32112.65..32616.14 rows=695158 loops=1)
          Sort Key: gene.gene_name, gene_occurrences_puid.puid
          ->  Merge Join  (cost=63671.50..98237.36 rows=1378118
width=33) (actual time=10061.83..17940.02 rows=695158 loops=1)
                Merge Cond: ("outer".puid = "inner".puid)
                ->  Index Scan using disease_occpd_puid_i on
disease_occurrences_puid  (cost=0.00..14538.05 rows=471915 width=4)
(actual time=0.03..3917.99 rows=471915 loops=1)
                ->  Sort  (cost=63671.50..64519.87 rows=339347 width=29)
(actual time=10061.69..10973.64 rows=815068 loops=1)
                      Sort Key: gene_occurrences_puid.puid
                      ->  Merge Join  (cost=0.00..22828.18 rows=339347
width=29) (actual time=0.21..3760.59 rows=339347 loops=1)
                            Merge Cond: ("outer".gene_id = "inner".gene_id)
                            ->  Index Scan using gene_pkey on gene
(cost=0.00..7668.59 rows=218085 width=21) (actual time=0.02..955.19
rows=218073 loops=1)
                            ->  Index Scan using gene_id_puid_uni on
gene_occurrences_puid  (cost=0.00..9525.57 rows=339347 width=8) (actual
time=0.02..1523.81 rows=339347 loops=1)
  Total runtime: 33244.81 msec
(13 rows)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
AFTER

relate=# vacuum full verbose analyze;

relate=# explain analyze SELECT DISTINCT gene.gene_name,
gene_occurrences_puid.puid FROM gene, disease_occurrences_puid,
gene_occurrences_puid WHERE
disease_occurrences_puid.puid=gene_occurrences_puid.puid AND
gene.gene_id=gene_occurrences_puid.gene_id;

               QUERY PLAN


-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Unique  (cost=359069.64..369948.41 rows=145050 width=33) (actual
time=42195.60..43229.04 rows=219435 loops=1)
    ->  Sort  (cost=359069.64..362695.90 rows=1450503 width=33) (actual
time=42195.59..42694.70 rows=695158 loops=1)
          Sort Key: gene.gene_name, gene_occurrences_puid.puid
          ->  Merge Join  (cost=63732.51..99264.24 rows=1450503
width=33) (actual time=13172.40..27973.79 rows=695158 loops=1)
                Merge Cond: ("outer".puid = "inner".puid)
                ->  Index Scan using disease_occpd_puid_i on
disease_occurrences_puid  (cost=0.00..14543.06 rows=471915 width=4)
(actual time=36.50..10916.29 rows=471915 loops=1)
                ->  Sort  (cost=63732.51..64580.88 rows=339347 width=29)
(actual time=13126.56..14048.38 rows=815068 loops=1)
                      Sort Key: gene_occurrences_puid.puid
                      ->  Merge Join  (cost=0.00..22889.19 rows=339347
width=29) (actual time=58.00..6775.55 rows=339347 loops=1)
                            Merge Cond: ("outer".gene_id = "inner".gene_id)
                            ->  Index Scan using gene_pkey on gene
(cost=0.00..7739.91 rows=218085 width=21) (actual time=29.00..3416.01
rows=218073
loops=1)
                            ->  Index Scan using gene_id_puid_uni on
gene_occurrences_puid  (cost=0.00..9525.57 rows=339347 width=8) (actual
time=28.69..1936.83 rows=339347 loops=1)
  Total runtime: 43338.94 msec

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Postgres Version: 7.3.1
CPU: 1666.767 MHz
RAM: 2070492 kB
shmmax/shmall: 1048576000

postgresql.conf:
shared_buffers: 121600
max_connections: 64
max_fsm_relations = 200
max_fsm_pages = 40000
effective_cache_size = 8000

********************************************************************

Thank you again for your interest and help!

Chantal


Stephan Szabo wrote:
> (Replying to general and performance in a hope to move this
> to performance after a couple of replies).
>
> On Wed, 22 Jan 2003, Chantal Ackermann wrote:
>
>
>>I am getting the following output from EXPLAIN, concerning a query with
>>joins. The merge uses index scans but takes too long, in my opinion. The
>>query is in fact only a part (subquery) of another one, but it is the
>>bottle neck.
>>
>>As I am quite ignorant in optimizing queries, and I have no idea where
>>to find documentation on the net on how to learn optimizing my queries,
>>I am posting this here in hope someone will give me either tips how to
>>optimize, or where to find some tutorial that could help me get along on
>>my own.
>>
>>dropping the "DISTINCT" has some effect, but I can't really do without.
>
>
> The first thing is, have you done ANALYZE recently to make sure that the
> statistics are correct and what does EXPLAIN ANALYZE give you (that will
> run the query and give the estimate and actual).  Also, if you haven't
> vacuumed recently, you may want to vacuum full.
>
> How many rows are there on gene, disease and both occurrances tables?
> I'd wonder if perhaps using explicit sql join syntax (which postgres uses
> to constrain order) to join disease and disease_occurrences_puid before
> joining it to the other two would be better or worse in practice.
>
>


Re: optimizing query

От
Stephan Szabo
Дата:
On Thu, 23 Jan 2003, Chantal Ackermann wrote:

> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Postgres Version: 7.3.1
> CPU: 1666.767 MHz
> RAM: 2070492 kB
> shmmax/shmall: 1048576000
>
> postgresql.conf:
> shared_buffers: 121600
> max_connections: 64
> max_fsm_relations = 200
> max_fsm_pages = 40000
> effective_cache_size = 8000
>
> ********************************************************************

Hmm, how about how many pages are in the various tables, (do a
vacuum verbose <table> for the various tables and what is sort_mem
set to?  It's picking the index scan to get the tables in sorted
order, but I wonder if that's really the best plan given it's getting
a large portion of the tables.

Hmm, what does it do if you set enable_indexscan=off; ?


Re: [PERFORM] optimizing query

От
Tom Lane
Дата:
Chantal Ackermann <chantal.ackermann@biomax.de> writes:
>   Unique  (cost=359069.64..369948.41 rows=145050 width=33) (actual
> time=42195.60..43229.04 rows=219435 loops=1)
>     ->  Sort  (cost=359069.64..362695.90 rows=1450503 width=33) (actual
> time=42195.59..42694.70 rows=695158 loops=1)
>           Sort Key: gene.gene_name, gene_occurrences_puid.puid
>           ->  Merge Join  (cost=63732.51..99264.24 rows=1450503
> width=33) (actual time=13172.40..27973.79 rows=695158 loops=1)
>                 Merge Cond: ("outer".puid = "inner".puid)
>                 ->  Index Scan using disease_occpd_puid_i on
> disease_occurrences_puid  (cost=0.00..14543.06 rows=471915 width=4)
> (actual time=36.50..10916.29 rows=471915 loops=1)
>                 ->  Sort  (cost=63732.51..64580.88 rows=339347 width=29)
> (actual time=13126.56..14048.38 rows=815068 loops=1)
>                       Sort Key: gene_occurrences_puid.puid
>                       ->  Merge Join  (cost=0.00..22889.19 rows=339347
> width=29) (actual time=58.00..6775.55 rows=339347 loops=1)
>                             Merge Cond: ("outer".gene_id = "inner".gene_id)
>                             ->  Index Scan using gene_pkey on gene
> (cost=0.00..7739.91 rows=218085 width=21) (actual time=29.00..3416.01
> rows=218073
> loops=1)
>                             ->  Index Scan using gene_id_puid_uni on
> gene_occurrences_puid  (cost=0.00..9525.57 rows=339347 width=8) (actual
> time=28.69..1936.83 rows=339347 loops=1)
>   Total runtime: 43338.94 msec

Seems like most of the time is going into the sort steps.

> postgresql.conf:
> shared_buffers: 121600
> max_connections: 64
> max_fsm_relations = 200
> max_fsm_pages = 40000
> effective_cache_size = 8000

Try increasing sort_mem.

Also, I'd back off on shared_buffers if I were you.  There's no evidence
that values above a few thousand buy anything.

            regards, tom lane

Re: [PERFORM] optimizing query

От
Chantal Ackermann
Дата:
hi Stephan, hi Tom,

sort_mem was at its default: 1024. I increased it, and the query takes
even longer (~ 36 secs). I tried two different values: 4096 and 8192,
this last time I reduced the shared_buffers to 25600 (--> ~ 37 secs).
Another point is: after a vacuum, the cost would slightly increase.

would it help to cluster the index? but as I am using several indexes I
find it difficult to decide on which index to cluster.

(I paste the output from vacuum full verbose analyze)

Thanks!
Chantal


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

INFO:  --Relation public.disease_occurrences_puid--
INFO:  Pages 2079: Changed 0, reaped 0, Empty 0, New 0; Tup 471915: Vac
0, Keep/VTL 0/0, UnUsed 0, MinLen 32, MaxLen 32; Re-using: Free/Avail.
Space 648/648; EndEmpty/Avail. Pages 0/1.
         CPU 0.02s/0.05u sec elapsed 0.07 sec.
INFO:  Index disease_occpd_puid_i: Pages 1036; Tuples 471915.
         CPU 0.00s/0.03u sec elapsed 0.03 sec.
INFO:  Index disease_id_puid_uni: Pages 1297; Tuples 471915.
         CPU 0.03s/0.05u sec elapsed 0.23 sec.
INFO:  Rel disease_occurrences_puid: Pages: 2079 --> 2079; Tuple(s)
moved: 0.
         CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  Analyzing public.disease_occurrences_puid

INFO:  --Relation public.gene_occurrences_puid--
INFO:  Pages 1495: Changed 0, reaped 0, Empty 0, New 0; Tup 339347: Vac
0, Keep/VTL 0/0, UnUsed 0, MinLen 32, MaxLen 32; Re-using: Free/Avail.
Space 648/648; EndEmpty/Avail. Pages 0/1.
         CPU 0.01s/0.04u sec elapsed 0.05 sec.
INFO:  Index gene_occpd_puid_i: Pages 746; Tuples 339347.
         CPU 0.01s/0.02u sec elapsed 0.03 sec.
INFO:  Index gene_id_puid_uni: Pages 934; Tuples 339347.
         CPU 0.00s/0.02u sec elapsed 0.02 sec.
INFO:  Rel gene_occurrences_puid: Pages: 1495 --> 1495; Tuple(s) moved: 0.
         CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  Analyzing public.gene_occurrences_puid

INFO:  --Relation public.disease--
INFO:  Pages 1522: Changed 0, reaped 0, Empty 0, New 0; Tup 164597: Vac
0, Keep/VTL 0/0, UnUsed 0, MinLen 44, MaxLen 232; Re-using: Free/Avail.
Space 56920/38388; EndEmpty/Avail. Pages 0/603.
         CPU 0.00s/0.04u sec elapsed 0.04 sec.
INFO:  Index disease_name_i: Pages 1076; Tuples 164597.
         CPU 0.05s/0.02u sec elapsed 0.18 sec.
INFO:  Index disease_pkey: Pages 364; Tuples 164597.
         CPU 0.00s/0.00u sec elapsed 0.03 sec.
INFO:  Index disease_uni: Pages 1168; Tuples 164597.
         CPU 0.08s/0.04u sec elapsed 0.22 sec.
INFO:  Rel disease: Pages: 1522 --> 1521; Tuple(s) moved: 75.
         CPU 0.00s/0.03u sec elapsed 0.04 sec.
INFO:  Index disease_name_i: Pages 1077; Tuples 164597: Deleted 75.
         CPU 0.00s/0.03u sec elapsed 0.03 sec.
INFO:  Index disease_pkey: Pages 364; Tuples 164597: Deleted 75.
         CPU 0.01s/0.02u sec elapsed 0.02 sec.
INFO:  Index disease_uni: Pages 1168; Tuples 164597: Deleted 75.
         CPU 0.00s/0.03u sec elapsed 0.03 sec.
INFO:  --Relation pg_toast.pg_toast_7114632--
INFO:  Pages 0: Changed 0, reaped 0, Empty 0, New 0; Tup 0: Vac 0,
Keep/VTL 0/0, UnUsed 0, MinLen 0, MaxLen 0; Re-using: Free/Avail. Space
0/0; EndEmpty/Avail. Pages 0/0.
         CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  Index pg_toast_7114632_index: Pages 1; Tuples 0.
         CPU 0.00s/0.00u sec elapsed 0.01 sec.
INFO:  Analyzing public.disease

INFO:  --Relation public.gene--
INFO:  Pages 1566: Changed 0, reaped 0, Empty 0, New 0; Tup 218085: Vac
0, Keep/VTL 0/0, UnUsed 0, MinLen 44, MaxLen 348; Re-using: Free/Avail.
Space 48692/25408; EndEmpty/Avail. Pages 0/365.
         CPU 0.01s/0.04u sec elapsed 0.04 sec.
INFO:  Index gene_pkey: Pages 481; Tuples 218085.
         CPU 0.00s/0.01u sec elapsed 0.01 sec.
INFO:  Index gene_uni: Pages 1038; Tuples 218085.
         CPU 0.04s/0.01u sec elapsed 0.19 sec.
INFO:  Index gene_name_uni: Pages 917; Tuples 218085.
         CPU 0.06s/0.00u sec elapsed 0.15 sec.
INFO:  Rel gene: Pages: 1566 --> 1564; Tuple(s) moved: 230.
         CPU 0.01s/0.06u sec elapsed 0.11 sec.
INFO:  Index gene_pkey: Pages 482; Tuples 218085: Deleted 230.
         CPU 0.00s/0.03u sec elapsed 0.02 sec.
INFO:  Index gene_uni: Pages 1041; Tuples 218085: Deleted 230.
         CPU 0.00s/0.04u sec elapsed 0.03 sec.
INFO:  Index gene_name_uni: Pages 918; Tuples 218085: Deleted 230.
         CPU 0.00s/0.04u sec elapsed 0.03 sec.
INFO:  --Relation pg_toast.pg_toast_7114653--
INFO:  Pages 0: Changed 0, reaped 0, Empty 0, New 0; Tup 0: Vac 0,
Keep/VTL 0/0, UnUsed 0, MinLen 0, MaxLen 0; Re-using: Free/Avail. Space
0/0; EndEmpty/Avail. Pages 0/0.
         CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  Index pg_toast_7114653_index: Pages 1; Tuples 0.
         CPU 0.00s/0.00u sec elapsed 0.01 sec.
INFO:  Analyzing public.gene

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: [PERFORM] optimizing query

От
Bruce Momjian
Дата:
Tom Lane wrote:
> > postgresql.conf:
> > shared_buffers: 121600
> > max_connections: 64
> > max_fsm_relations = 200
> > max_fsm_pages = 40000
> > effective_cache_size = 8000
>
> Try increasing sort_mem.
>
> Also, I'd back off on shared_buffers if I were you.  There's no evidence
> that values above a few thousand buy anything.

Increasing shared_buffers above several thousand will only be a win if
your entire working set will fit in the larger buffer pool, but didn't
in the previous size.  If you working set is smaller or larger than
that, pushing it above several thousand isn't a win.  Is that a more
definitive answer?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073