Обсуждение: Ye olde slow query

Поиск
Список
Период
Сортировка

Ye olde slow query

От
"Murphy, Kevin"
Дата:
Synopsis: 8-table join with one "WHERE foo IN (...)" condition; works OK with fewer
than 5 items in the IN list, but at N=5, the planner starts using a compound index
for the first time that completely kills performance (5-6 minutes versus 0-12 seconds).
I'm interested in learning what plays a role in this switch of plans (or the
unanticipated relative slowness of the N=5 plan).  TIA for any wisdom; I've finally
made a commitment to really delve into PG.  -Kevin

1. Queries and plans
2. Answers to standard questions as per
http://wiki.postgresql.org/wiki/Guide_to_reporting_problems
3. Tables

1. Queries and plans

The "fast" query, with 4 elements in the IN list.

EXPLAIN (ANALYZE, BUFFERS) SELECT
  COUNT(DISTINCT "core_person"."id")
FROM "core_person"
  INNER JOIN "core_sample" ON ("core_person"."id" = "core_sample"."person_id")
  INNER JOIN "sample" ON ("core_sample"."varify_sample_id" = "sample"."id")
  INNER JOIN "sample_result" ON ("sample"."id" = "sample_result"."sample_id")
  INNER JOIN "variant" ON ("sample_result"."variant_id" = "variant"."id")
  INNER JOIN "variant_effect"
    ON ("variant"."id" = "variant_effect"."variant_id")
  INNER JOIN "transcript"
    ON ("variant_effect"."transcript_id" = "transcript"."id")
  INNER JOIN "gene" ON ("transcript"."gene_id" = "gene"."id")
WHERE "gene"."symbol" IN ('CFC1', 'PROSIT240', 'ZFPM2/FOG2', 'NKX2.5');

http://explain.depesz.com/s/Wul

Aggregate  (cost=287383.44..287383.45 rows=1 width=4) (actual time=674.434..674.434 rows=1 loops=1)
  Buffers: shared hit=908 read=412
  ->  Nested Loop  (cost=3530.40..287383.44 rows=1 width=4) (actual time=674.414..674.414 rows=0 loops=1)
        Buffers: shared hit=908 read=412
        ->  Nested Loop  (cost=3530.40..287379.14 rows=1 width=12) (actual time=674.413..674.413 rows=0 loops=1)
              Buffers: shared hit=908 read=412
              ->  Hash Join  (cost=3530.40..287375.56 rows=1 width=12) (actual time=674.413..674.413 rows=0 loops=1)
                    Hash Cond: (sample_result.sample_id = core_sample.varify_sample_id)
                    Buffers: shared hit=908 read=412
                    ->  Nested Loop  (cost=4.32..283545.98 rows=80929 width=12) (actual time=163.609..571.237 rows=102
loops=1)
                          Buffers: shared hit=419 read=63
                          ->  Nested Loop  (cost=4.32..3426.09 rows=471 width=4) (actual time=93.595..112.404 rows=85
loops=1)
                                Buffers: shared hit=19 read=21
                                ->  Nested Loop  (cost=4.32..140.18 rows=17 width=4) (actual time=28.280..46.051 rows=4
loops=1)
                                      Buffers: shared hit=5 read=10
                                      ->  Index Scan using gene_symbol on gene  (cost=0.00..30.79 rows=4 width=4)
(actualtime=28.210..45.938 rows=1 loops=1) 
                                            Index Cond: ((symbol)::text = ANY
('{CFC1,PROSIT240,ZFPM2/FOG2,NKX2.5}'::text[]))
                                            Buffers: shared hit=3 read=7
                                      ->  Bitmap Heap Scan on transcript  (cost=4.32..27.29 rows=6 width=8) (actual
time=0.066..0.106rows=4 loops=1) 
                                            Recheck Cond: (gene_id = gene.id)
                                            Buffers: shared hit=2 read=3
                                            ->  Bitmap Index Scan on transcript_gene_id  (cost=0.00..4.32 rows=6
width=0)(actual time=0.049..0.049 rows=4 loops=1) 
                                                  Index Cond: (gene_id = gene.id)
                                                  Buffers: shared hit=2 read=1
                                ->  Index Scan using variant_effect_transcript_id on variant_effect  (cost=0.00..191.83
rows=146width=8) (actual time=16.345..16.582 rows=21 loops=4) 
                                      Index Cond: (transcript_id = transcript.id)
                                      Buffers: shared hit=14 read=11
                          ->  Index Scan using sample_result_variant_id on sample_result  (cost=0.00..593.01 rows=172
width=8)(actual time=5.147..5.397 rows=1 loops=85) 
                                Index Cond: (variant_id = variant_effect.variant_id)
                                Buffers: shared hit=400 read=42
                    ->  Hash  (cost=3525.76..3525.76 rows=26 width=12) (actual time=103.125..103.125 rows=1129 loops=1)
                          Buckets: 1024  Batches: 1  Memory Usage: 49kB
                          Buffers: shared hit=489 read=349
                          ->  Merge Join  (cost=228.11..3525.76 rows=26 width=12) (actual time=57.236..102.752
rows=1129loops=1) 
                                Merge Cond: (core_sample.varify_sample_id = sample.id)
                                Buffers: shared hit=489 read=349
                                ->  Index Scan using core_sample_varify_sample_id on core_sample  (cost=0.00..347661.45
rows=119344width=8) (actual time=0.005..44.699 rows=1130 loops=1) 
                                      Buffers: shared hit=484 read=312
                                ->  Sort  (cost=220.25..227.02 rows=2705 width=4) (actual time=56.997..57.214 rows=2701
loops=1)
                                      Sort Key: sample.id
                                      Sort Method: quicksort  Memory: 223kB
                                      Buffers: shared hit=5 read=37
                                      ->  Seq Scan on sample  (cost=0.00..66.05 rows=2705 width=4) (actual
time=0.549..56.245rows=2705 loops=1) 
                                            Buffers: shared hit=2 read=37
              ->  Index Only Scan using core_person_pkey on core_person  (cost=0.00..3.58 rows=1 width=4) (never
executed)
                    Index Cond: (id = core_sample.person_id)
                    Heap Fetches: 0
        ->  Index Only Scan using variant_pkey on variant  (cost=0.00..4.29 rows=1 width=4) (never executed)
              Index Cond: (id = sample_result.variant_id)
              Heap Fetches: 0
Total runtime: 674.797 ms

The "slow" query with 5 elements in IN list:

EXPLAIN (ANALYZE, BUFFERS) SELECT
  COUNT(DISTINCT "core_person"."id")
FROM "core_person"
  INNER JOIN "core_sample" ON ("core_person"."id" = "core_sample"."person_id")
  INNER JOIN "sample" ON ("core_sample"."varify_sample_id" = "sample"."id")
  INNER JOIN "sample_result" ON ("sample"."id" = "sample_result"."sample_id")
  INNER JOIN "variant" ON ("sample_result"."variant_id" = "variant"."id")
  INNER JOIN "variant_effect"
    ON ("variant"."id" = "variant_effect"."variant_id")
  INNER JOIN "transcript"
    ON ("variant_effect"."transcript_id" = "transcript"."id")
  INNER JOIN "gene" ON ("transcript"."gene_id" = "gene"."id")
WHERE "gene"."symbol" IN ('CFC1', 'PROSIT240', 'ZFPM2/FOG2', 'NKX2.5', 'ZIC3');

http://explain.depesz.com/s/BikZ

                                                                                      QUERY PLAN
                                                           \ 


------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\
---
 Aggregate  (cost=293669.97..293669.98 rows=1 width=4) (actual time=404443.253..404443.253 rows=1 loops=1)
   Buffers: shared hit=95972873 read=1888636
   ->  Nested Loop  (cost=4341.32..293669.97 rows=1 width=4) (actual time=1270.642..404431.172 rows=19193 loops=1)
         Buffers: shared hit=95972867 read=1888636
         ->  Nested Loop  (cost=4341.32..293665.67 rows=1 width=12) (actual time=1243.095..403775.844 rows=19193
loops=1)
               Buffers: shared hit=95915300 read=1888623
               ->  Hash Join  (cost=4341.32..293662.08 rows=1 width=12) (actual time=1227.121..403667.061 rows=19193
loops=1)
                     Hash Cond: (sample_result.variant_id = variant_effect.variant_id)
                     Buffers: shared hit=95876819 read=1888598
                     ->  Nested Loop  (cost=99.86..289414.83 rows=1542 width=8) (actual time=94.839..340982.730
rows=690103508loops=1) 
                           Buffers: shared hit=95876766 read=1888588
                           ->  Hash Join  (cost=99.86..3605.46 rows=26 width=12) (actual time=1.483..323.089 rows=1129
loops=1)
                                 Hash Cond: (core_sample.varify_sample_id = sample.id)
                                 Buffers: shared hit=351 read=1254
                                 ->  Seq Scan on core_sample  (cost=0.00..2759.44 rows=119344 width=8) (actual
time=0.009..309.402rows=119344 loops=1) 
                                       Buffers: shared hit=312 read=1254
                                 ->  Hash  (cost=66.05..66.05 rows=2705 width=4) (actual time=1.227..1.227 rows=2705
loops=1)
                                       Buckets: 1024  Batches: 1  Memory Usage: 96kB
                                       Buffers: shared hit=39
                                       ->  Seq Scan on sample  (cost=0.00..66.05 rows=2705 width=4) (actual
time=0.008..0.691rows=2705 loops=1) 
                                             Buffers: shared hit=39
                           ->  Index Only Scan using sample_variant_idx on sample_result  (cost=0.00..8220.58
rows=277209width=8) (actual time=3.469..218.524 rows=611252 loops=112\ 
9)
                                 Index Cond: (sample_id = core_sample.varify_sample_id)
                                 Heap Fetches: 0
                                 Buffers: shared hit=95876415 read=1887334
                     ->  Hash  (cost=4234.10..4234.10 rows=589 width=4) (actual time=326.003..326.003 rows=140 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 5kB
                           Buffers: shared hit=53 read=10
                           ->  Nested Loop  (cost=4.32..4234.10 rows=589 width=4) (actual time=0.083..325.953 rows=140
loops=1)
                                 Buffers: shared hit=53 read=10
                                 ->  Nested Loop  (cost=4.32..175.03 rows=21 width=4) (actual time=0.052..234.362
rows=8loops=1) 
                                       Buffers: shared hit=18 read=5
                                       ->  Index Scan using gene_symbol on gene  (cost=0.00..38.29 rows=5 width=4)
(actualtime=0.023..0.097 rows=2 loops=1) 
                                             Index Cond: ((symbol)::text = ANY
('{CFC1,PROSIT240,ZFPM2/FOG2,NKX2.5,ZIC3}'::text[]))
                                             Buffers: shared hit=12 read=1
                                       ->  Bitmap Heap Scan on transcript  (cost=4.32..27.29 rows=6 width=8) (actual
time=106.303..117.126rows=4 loops=2) 
                                             Recheck Cond: (gene_id = gene.id)
                                             Buffers: shared hit=6 read=4
                                             ->  Bitmap Index Scan on transcript_gene_id  (cost=0.00..4.32 rows=6
width=0)(actual time=93.564..93.564 rows=4 loops=2) 
                                                   Index Cond: (gene_id = gene.id)
                                                   Buffers: shared hit=4 read=2
                                 ->  Index Scan using variant_effect_transcript_id on variant_effect
(cost=0.00..191.83rows=146 width=8) (actual time=7.285..11.445 rows=18 loops=\ 
8)
                                       Index Cond: (transcript_id = transcript.id)
                                       Buffers: shared hit=35 read=5
               ->  Index Only Scan using core_person_pkey on core_person  (cost=0.00..3.58 rows=1 width=4) (actual
time=0.004..0.004rows=1 loops=19193) 
                     Index Cond: (id = core_sample.person_id)
                     Heap Fetches: 0
                     Buffers: shared hit=38481 read=25
         ->  Index Only Scan using variant_pkey on variant  (cost=0.00..4.29 rows=1 width=4) (actual time=0.033..0.033
rows=1loops=19193) 
               Index Cond: (id = sample_result.variant_id)
               Heap Fetches: 0
               Buffers: shared hit=57567 read=13
 Total runtime: 404443.608 ms


2. Answers to standard questions as per
http://wiki.postgresql.org/wiki/Guide_to_reporting_problems

A description of what you are trying to achieve and what results you
expect:

  Ideally, I'd like this query to be usable for a couple dozen terms.
  (This may not be realistic given the current table layout and
  hardware?) If I drop the problem index, the query finishes in 1.5
  minutes for 17 gene symbols, which is ... better.

  FWIW, my observations:
  1. The disk is slow on this system (60-75 MB/sec dd seq read time to
  /dev/null with bs=8k); I'm not sure if the cost constants need
  adjusting.
  2. The plan changes at N=5 to introduce an index-only scan on
  sample_variant_idx which is 16 GB (box has 32 GB RAM). This index on
  the sample_result table is a compound index on foreign keys to the
  sample and variant tables that are often joined to the sample_result
  table (as in this query).
  3. If I run the query in a transaction that drops the
  sample_variant_idx first, a fast plan is chosen. It's almost as if
  the planner is so pleased with itself for having noticed that it can
  use that compound index instead of the individual foreign key indexes
  that it throws caution to the winds ;-)
  4. The sample_result table is large-ish (748M rows; 145 GB; 312
  GB incl extras) and sits in the middle of this join.

  What I tried so far:
  1. Changed statistics target.  At first this query was unusable even
  for N=1 because n_distinct was 264,475 on an involved column when it
  should have been 4,356,805.  I increased the statistics target from
  1,000 to 5,000, which brought n_distinct for that column up to
  653,662.  (I understand that an overly large statistics target can
  negatively affect plan times, and those are indeed in the vicinity of
  400 msec now for typical queries.  I should probably decrease.)
  2. Learned how to coerce n_distinct, at which point the
  query started running much faster.  As an experiment, I have coerced
  n_distinct for all the foreign key columns involved in the join.
  3. Increased effective_cache_size to larger than memory and decreased it
  to 12GB, neither of which caused a good plan to be used.
  4. Tried GEQO, which never came up with the dud plan involving
  sample_variant_idx; it doesn't seem quite kosher to plan all queries
  with GEQO, though, and our queries are automatically constructed by a
  query builder, so at the moment I don't have the ability to apply
  custom tweaks for individual queries ....
  5. Temporarily dropped the sample_variant_idx, as mentioned above.
  I'm not sure yet if it's a good idea to do away with this altogether.

PostgreSQL version number you are running:

 PostgreSQL 9.2.7 on x86_64-unknown-linux-gnu, compiled by gcc (GCC)
 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit

How you installed PostgreSQL:

  PGDG yum repo

Changes made to the settings in the postgresql.conf file:  see Server
Configuration for a quick way to list them all.

     checkpoint_segments        | 32                 | configuration file
     default_statistics_target  | 5000               | configuration file
     effective_cache_size       | 24GB               | configuration file
     log_planner_stats          | on                 | configuration file
     shared_buffers             | 8GB                | configuration file
     work_mem                   | 150MB              | configuration file

Operating system and version:

  RHEL 6.4 - VM with kind of crappy SAN disk storage Linux
  resrhvardb01d.research.chop.edu 2.6.32-358.6.2.el6.x86_64 #1 SMP Tue
  May 14 15:48:21 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

What program you're using to connect to PostgreSQL:

  psql for my tests; psycopg2 Python driver for app

Is there anything relevant or unusual in the PostgreSQL server logs?:

  No

CPU manufacturer and model, eg "AMD Athlon X2" or "Intel Core 2 Duo"

  VM, but /proc/cpuinfo shows two Intel(R) Xeon(R) CPU X5670 @ 2.93GHz

Amount and size of RAM installed, eg "2GB RAM"

  32 GB RAM

Storage details (important for performance and corruption questions)

  Don't know (yet).  Some kind of SAN.  Our IT people manage this VM.
  We will be getting dedicated hardware in the near future.

  Using dd with an 8k blocksize, I see sequential read performance on
  uncached files of typically 60-74 MB/s.


3. Tables

gene table: 51,254 rows
                                                     Table "public.gene"
 Column  |          Type          |                     Modifiers                     | Storage  | Stats target |
Description

---------+------------------------+---------------------------------------------------+----------+--------------+-------------
 id      | integer                | not null default nextval('gene_id_seq'::regclass) | plain    |              |
 chr_id  | integer                | not null                                          | plain    |              |
 symbol  | character varying(255) | not null                                          | extended | 10000        |
 name    | text                   | not null                                          | extended |              |
 hgnc_id | integer                |                                                   | plain    |              |
Indexes:
    "gene_pkey" PRIMARY KEY, btree (id)
    "symbol_unique" UNIQUE CONSTRAINT, btree (symbol)
    "gene_chr_id" btree (chr_id)
    "gene_symbol" btree (symbol)
    "gene_symbol_like" btree (symbol varchar_pattern_ops)
Foreign-key constraints:
    "gene_chr_id_fkey" FOREIGN KEY (chr_id) REFERENCES chromosome(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "exon" CONSTRAINT "exon_gene_id_fkey" FOREIGN KEY (gene_id) REFERENCES gene(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "gene_detail" CONSTRAINT "gene_detail_gene_id_fkey" FOREIGN KEY (gene_id) REFERENCES gene(id) DEFERRABLE
INITIALLYDEFERRED 
    TABLE "gene_phenotype" CONSTRAINT "gene_id_refs_id_1a19729a" FOREIGN KEY (gene_id) REFERENCES gene(id) DEFERRABLE
INITIALLYDEFERRED 
    TABLE "gene_pubmed" CONSTRAINT "gene_id_refs_id_8e5839cd" FOREIGN KEY (gene_id) REFERENCES gene(id) DEFERRABLE
INITIALLYDEFERRED 
    TABLE "gene_families" CONSTRAINT "gene_id_refs_id_9de0e4fb" FOREIGN KEY (gene_id) REFERENCES gene(id) DEFERRABLE
INITIALLYDEFERRED 
    TABLE "gene_synonym" CONSTRAINT "gene_id_refs_id_b2bbb6ef" FOREIGN KEY (gene_id) REFERENCES gene(id) DEFERRABLE
INITIALLYDEFERRED 
    TABLE "geneset_setobject" CONSTRAINT "geneset_setobject_object_id_fkey" FOREIGN KEY (object_id) REFERENCES gene(id)
DEFERRABLEINITIALLY DEFERRED 
    TABLE "transcript" CONSTRAINT "transcript_gene_id_fkey" FOREIGN KEY (gene_id) REFERENCES gene(id) DEFERRABLE
INITIALLYDEFERRED 
    TABLE "variant_effect" CONSTRAINT "variant_effect_gene_id_fkey" FOREIGN KEY (gene_id) REFERENCES gene(id)
DEFERRABLEINITIALLY DEFERRED 
    TABLE "variant" CONSTRAINT "variant_gene_id_fkey" FOREIGN KEY (gene_id) REFERENCES gene(id) DEFERRABLE INITIALLY
DEFERRED
Has OIDs: no

transcript table: 215,533 rows
                                                           Table "public.transcript"
       Column        |          Type          |                        Modifiers                        | Storage  |
Statstarget | Description 

---------------------+------------------------+---------------------------------------------------------+----------+--------------+-------------
 id                  | integer                | not null default nextval('transcript_id_seq'::regclass) | plain    |
         | 
 strand              | character varying(1)   |                                                         | extended |
         | 
 start               | integer                |                                                         | plain    |
         | 
 end                 | integer                |                                                         | plain    |
         | 
 coding_start        | integer                |                                                         | plain    |
         | 
 coding_end          | integer                |                                                         | plain    |
         | 
 coding_start_status | character varying(20)  |                                                         | extended |
         | 
 coding_end_status   | character varying(20)  |                                                         | extended |
         | 
 exon_count          | integer                |                                                         | plain    |
         | 
 refseq_id           | character varying(100) | not null                                                | extended |
         | 
 gene_id             | integer                |                                                         | plain    |
         | 
Indexes:
    "transcript_pkey" PRIMARY KEY, btree (id)
    "transcript_gene_id" btree (gene_id)
    "transcript_pkey_gene" btree (id, gene_id)
    "transcript_refseq_gene" btree (refseq_id, gene_id)
Foreign-key constraints:
    "transcript_gene_id_fkey" FOREIGN KEY (gene_id) REFERENCES gene(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "transcript_exon" CONSTRAINT "transcript_id_refs_id_e2bf7f41" FOREIGN KEY (transcript_id) REFERENCES
transcript(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "variant_effect" CONSTRAINT "variant_effect_transcript_id_fkey" FOREIGN KEY (transcript_id) REFERENCES
transcript(id)DEFERRABLE INITIALLY DEFERRED 
Has OIDs: no

variant_effect table: 8,140,067 rows
                                                           Table "public.variant_effect"
       Column        |          Type          |                          Modifiers                          | Storage
|Stats target | Description 

---------------------+------------------------+-------------------------------------------------------------+----------+--------------+-------------
 id                  | integer                | not null default nextval('variant_effect_id_seq'::regclass) | plain
|             | 
 variant_id          | integer                |                                                             | plain
|             | 
 codon_change        | text                   |                                                             | extended
|             | 
 amino_acid_change   | text                   |                                                             | extended
|             | 
 exon_id             | integer                |                                                             | plain
|             | 
 transcript_id       | integer                |                                                             | plain
|             | 
 gene_id             | integer                |                                                             | plain
|             | 
 effect_id           | integer                |                                                             | plain
|             | 
 functional_class_id | integer                |                                                             | plain
|             | 
 hgvs_c              | character varying(200) |                                                             | extended
|             | 
 hgvs_p              | character varying(200) |                                                             | extended
|             | 
 segment             | character varying(200) |                                                             | extended
|             | 
Indexes:
    "variant_effect_pkey" UNIQUE, btree (id)
    "variant_effect_effect_id" btree (effect_id)
    "variant_effect_exon_id" btree (exon_id)
    "variant_effect_functional_class_id" btree (functional_class_id)
    "variant_effect_gene_id" btree (gene_id)
    "variant_effect_hgvs_c" btree (hgvs_c)
    "variant_effect_hgvs_c_like" btree (hgvs_c varchar_pattern_ops)
    "variant_effect_hgvs_p" btree (hgvs_p)
    "variant_effect_hgvs_p_like" btree (hgvs_p varchar_pattern_ops)
    "variant_effect_transcript_id" btree (transcript_id)
    "variant_effect_variant_id" btree (variant_id)
    "variant_effect_variant_transcript" btree (variant_id, transcript_id)
Foreign-key constraints:
    "variant_effect_effect_id_fkey" FOREIGN KEY (effect_id) REFERENCES effect(id) DEFERRABLE INITIALLY DEFERRED
    "variant_effect_exon_id_fkey" FOREIGN KEY (exon_id) REFERENCES exon(id) DEFERRABLE INITIALLY DEFERRED
    "variant_effect_functional_class_id_fkey" FOREIGN KEY (functional_class_id) REFERENCES variant_functional_class(id)
DEFERRABLEINITIALLY DEFERRED 
    "variant_effect_gene_id_fkey" FOREIGN KEY (gene_id) REFERENCES gene(id) DEFERRABLE INITIALLY DEFERRED
    "variant_effect_transcript_id_fkey" FOREIGN KEY (transcript_id) REFERENCES transcript(id) DEFERRABLE INITIALLY
DEFERRED
Has OIDs: no

variant table: 6,132,722 rows; actually not used because of variant_effect.variant_id and sample_result.variant_id

                                                     Table "public.variant"
  Column  |         Type          |                      Modifiers                       | Storage  | Stats target |
Description

----------+-----------------------+------------------------------------------------------+----------+--------------+-------------
 id       | integer               | not null default nextval('variant_id_seq'::regclass) | plain    |              |
 chr_id   | integer               | not null                                             | plain    |              |
 pos      | integer               | not null                                             | plain    |              |
 ref      | text                  | not null                                             | extended |              |
 alt      | text                  | not null                                             | extended |              |
 md5      | character varying(32) | not null                                             | extended |              |
 rsid     | text                  |                                                      | extended |              |
 type_id  | integer               |                                                      | plain    |              |
 liftover | boolean               |                                                      | plain    |              |
 gene_id  | integer               |                                                      | plain    |              |
Indexes:
    "variant_chr_id_pos_ref_alt_key" UNIQUE, btree (chr_id, pos, ref, alt)
    "variant_pkey" UNIQUE, btree (id)
    "variant_alt" btree (alt)
    "variant_alt_like" btree (alt text_pattern_ops)
    "variant_chr_id" btree (chr_id)
    "variant_md5" btree (md5)
    "variant_ref" btree (ref)
    "variant_ref_like" btree (ref text_pattern_ops)
    "variant_rsid" btree (rsid)
    "variant_type_id" btree (type_id)
Foreign-key constraints:
    "variant_chr_id_fkey" FOREIGN KEY (chr_id) REFERENCES chromosome(id) DEFERRABLE INITIALLY DEFERRED
    "variant_gene_id_fkey" FOREIGN KEY (gene_id) REFERENCES gene(id) DEFERRABLE INITIALLY DEFERRED
    "variant_type_id_fkey" FOREIGN KEY (type_id) REFERENCES variant_type(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "sample_result" CONSTRAINT "variant_id_refs_id_313c30dea59a86e8" FOREIGN KEY (variant_id) REFERENCES
variant(id)
Has OIDs: no

sample_result table: 748,183,031 rows; 145 GB; 312 GB incl indexes and toast)
                                                              Table "public.sample_result"
         Column          |           Type           |                         Modifiers                          |
Storage | Stats target | Description 

-------------------------+--------------------------+------------------------------------------------------------+----------+--------------+-------------
 id                      | integer                  | not null default nextval('sample_result_id_seq'::regclass) |
plain   |              | 
 notes                   | text                     |                                                            |
extended|              | 
 created                 | timestamp with time zone | not null                                                   |
plain   |              | 
 modified                | timestamp with time zone | not null                                                   |
plain   |              | 
 sample_id               | integer                  | not null                                                   |
plain   |              | 
 variant_id              | integer                  | not null                                                   |
plain   |              | 
 quality                 | double precision         |                                                            |
plain   |              | 
 read_depth              | integer                  |                                                            |
plain   |              | 
 genotype_id             | integer                  |                                                            |
plain   |              | 
 coverage_ref            | integer                  |                                                            |
plain   |              | 
 coverage_alt            | integer                  |                                                            |
plain   |              | 
 phred_scaled_likelihood | text                     |                                                            |
extended|              | 
 downsampling            | boolean                  |                                                            |
plain   |              | 
 spanning_deletions      | double precision         |                                                            |
plain   |              | 
 mq                      | double precision         |                                                            |
plain   |              | 
 mq0                     | double precision         |                                                            |
plain   |              | 
 baseq_rank_sum          | double precision         |                                                            |
plain   |              | 
 mq_rank_sum             | double precision         |                                                            |
plain   |              | 
 read_pos_rank_sum       | double precision         |                                                            |
plain   |              | 
 strand_bias             | double precision         |                                                            |
plain   |              | 
 homopolymer_run         | integer                  |                                                            |
plain   |              | 
 haplotype_score         | double precision         |                                                            |
plain   |              | 
 quality_by_depth        | double precision         |                                                            |
plain   |              | 
 fisher_strand           | double precision         |                                                            |
plain   |              | 
 genotype_quality        | double precision         |                                                            |
plain   |              | 
 in_dbsnp                | boolean                  |                                                            |
plain   |              | 
 base_counts             | character varying(100)   |                                                            |
extended|              | 
 raw_read_depth          | integer                  |                                                            |
plain   |              | 
Indexes:
    "sample_result_pkey1" PRIMARY KEY, btree (id)
    "sample_result_pkey" UNIQUE, btree (id)
    "sample_result_genotype_id" btree (genotype_id)
    "sample_result_quality" btree (quality)
    "sample_result_raw_read_depth" btree (raw_read_depth)
    "sample_result_read_depth" btree (read_depth)
    "sample_result_sample_id" btree (sample_id)
    "sample_result_variant_id" btree (variant_id)
    "sample_variant_idx" btree (sample_id, variant_id)
Foreign-key constraints:
    "sample_id_refs_id_6fa6b6cc5d0f2984" FOREIGN KEY (sample_id) REFERENCES sample(id) DEFERRABLE INITIALLY DEFERRED
    "sample_result_genotype_id_fkey" FOREIGN KEY (genotype_id) REFERENCES genotype(id) DEFERRABLE INITIALLY DEFERRED
    "variant_id_refs_id_313c30dea59a86e8" FOREIGN KEY (variant_id) REFERENCES variant(id)
Referenced by:
    TABLE "assessment" CONSTRAINT "sample_result_id_refs_id_5831a8ec3d1e4e0a" FOREIGN KEY (sample_result_id) REFERENCES
sample_result(id)DEFERRABLE INITIALLY DEFERRED 
Has OIDs: no

sample table: 2705 rows
                                                        Table "public.sample"
   Column    |           Type           |                      Modifiers                      | Storage  | Stats target
|Description 

-------------+--------------------------+-----------------------------------------------------+----------+--------------+-------------
 id          | integer                  | not null default nextval('sample_id_seq'::regclass) | plain    |
|
 notes       | text                     |                                                     | extended |
|
 created     | timestamp with time zone | not null                                            | plain    |
|
 modified    | timestamp with time zone | not null                                            | plain    |
|
 label       | character varying(100)   | not null default 'placholder'::character varying    | extended |
|
 batch_id    | integer                  | not null                                            | plain    |
|
 version     | integer                  | not null                                            | plain    |
|
 person_id   | integer                  |                                                     | plain    |
|
 count       | integer                  | not null                                            | plain    |
|
 bio_sample  | integer                  |                                                     | plain    |
|
 published   | boolean                  | not null                                            | plain    |
|
 md5         | character varying(32)    |                                                     | extended |
|
 name        | character varying(100)   | not null default 'placeholder'::character varying   | extended |
|
 project_id  | integer                  | not null                                            | plain    |
|
 tissue_id   | integer                  |                                                     | plain    |
|
 vcf_colname | character varying(200)   |                                                     | extended |
|
Indexes:
    "sample_pkey" PRIMARY KEY, btree (id)
    "sample_version_c71a9c06ef358ed_uniq" UNIQUE CONSTRAINT, btree (version, batch_id, name)
    "sample_batch_id" btree (batch_id)
    "sample_cohort_id" btree (batch_id)
    "sample_label_like" btree (label varchar_pattern_ops)
    "sample_person_id" btree (person_id)
    "sample_project_id" btree (project_id)
    "sample_tissue_id" btree (tissue_id)
Foreign-key constraints:
    "cohort_id_refs_id_6c74dcea40694064" FOREIGN KEY (batch_id) REFERENCES batch(id) DEFERRABLE INITIALLY DEFERRED
    "project_id_refs_id_78e0c8fcf52a265d" FOREIGN KEY (project_id) REFERENCES project(id) DEFERRABLE INITIALLY DEFERRED
    "sample_person_id_fkey" FOREIGN KEY (person_id) REFERENCES person(id) DEFERRABLE INITIALLY DEFERRED
    "tissue_id_refs_id_2f16a55811371f5a" FOREIGN KEY (tissue_id) REFERENCES tissue(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "metrics_sample_load" CONSTRAINT "metrics_sample_load_sample_id_fkey" FOREIGN KEY (sample_id) REFERENCES
sample(id)
    TABLE "sample_phenotype" CONSTRAINT "sample_id_refs_id_2723d8269859c3bc" FOREIGN KEY (sample_id) REFERENCES
sample(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "cohort_sample" CONSTRAINT "sample_id_refs_id_435beca7ea3fecae" FOREIGN KEY (sample_id) REFERENCES sample(id)
DEFERRABLEINITIALLY DEFERRED 
    TABLE "sample_qc" CONSTRAINT "sample_id_refs_id_437acf3032c46c2b" FOREIGN KEY (sample_id) REFERENCES sample(id)
DEFERRABLEINITIALLY DEFERRED 
    TABLE "sample_manifest" CONSTRAINT "sample_id_refs_id_6dad1d60e5a86f62" FOREIGN KEY (sample_id) REFERENCES
sample(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "sample_result" CONSTRAINT "sample_id_refs_id_6fa6b6cc5d0f2984" FOREIGN KEY (sample_id) REFERENCES sample(id)
DEFERRABLEINITIALLY DEFERRED 
    TABLE "sample_run" CONSTRAINT "sample_run_sample_id_fkey" FOREIGN KEY (sample_id) REFERENCES sample(id) DEFERRABLE
INITIALLYDEFERRED 
    TABLE "pcgc.core_sample" CONSTRAINT "varify_sample_id_refs_id_75c34db2" FOREIGN KEY (varify_sample_id) REFERENCES
sample(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "big_sample" CONSTRAINT "vsample_id_refs_id_3ad233dd6a3f695e" FOREIGN KEY (vsample_id) REFERENCES sample(id)
DEFERRABLEINITIALLY DEFERRED 
Has OIDs: no

core_sample table: 119,344 rows
                                                              Table "pcgc.core_sample"
         Column         |          Type          |                        Modifiers                         | Storage
|Stats target | Description 

------------------------+------------------------+----------------------------------------------------------+----------+--------------+-------------
 id                     | integer                | not null default nextval('core_sample_id_seq'::regclass) | plain
|             | 
 sample_id              | character varying(20)  |                                                          | extended
|             | 
 person_id              | integer                |                                                          | plain
|             | 
 sample_type            | character varying(11)  |                                                          | extended
|             | 
 source_type            | character varying(100) |                                                          | extended
|             | 
 status                 | character varying(100) |                                                          | extended
|             | 
 disposal_status        | character varying(100) |                                                          | extended
|             | 
 dna_qc_status          | character varying(100) |                                                          | extended
|             | 
 sample_identifier_type | character varying(20)  |                                                          | extended
|             | 
 varify_sample_id       | integer                |                                                          | plain
|10000        | 
Indexes:
    "core_sample_pkey" PRIMARY KEY, btree (id)
    "core_sample_sample_id_uniq" UNIQUE CONSTRAINT, btree (sample_id)
    "core_sample_disposal_status" btree (disposal_status)
    "core_sample_disposal_status_like" btree (disposal_status varchar_pattern_ops)
    "core_sample_dna_qc_status" btree (dna_qc_status)
    "core_sample_dna_qc_status_like" btree (dna_qc_status varchar_pattern_ops)
    "core_sample_person_id" btree (person_id)
    "core_sample_sample_identifier_type" btree (sample_identifier_type)
    "core_sample_sample_identifier_type_like" btree (sample_identifier_type varchar_pattern_ops)
    "core_sample_sample_type" btree (sample_type)
    "core_sample_sample_type_like" btree (sample_type varchar_pattern_ops)
    "core_sample_source_type" btree (source_type)
    "core_sample_source_type_like" btree (source_type varchar_pattern_ops)
    "core_sample_status" btree (status)
    "core_sample_status_like" btree (status varchar_pattern_ops)
    "core_sample_varify_sample_id" btree (varify_sample_id)
Foreign-key constraints:
    "person_id_refs_id_56d51ee2" FOREIGN KEY (person_id) REFERENCES core_person(id) DEFERRABLE INITIALLY DEFERRED
    "varify_sample_id_refs_id_75c34db2" FOREIGN KEY (varify_sample_id) REFERENCES sample(id) DEFERRABLE INITIALLY
DEFERRED
Referenced by:
    TABLE "core_samplefile" CONSTRAINT "sample_id_refs_id_185ff8c9" FOREIGN KEY (sample_id) REFERENCES core_sample(id)
DEFERRABLEINITIALLY DEFERRED 
    TABLE "core_cnvconfirmation" CONSTRAINT "sample_id_refs_id_1c83b6a0" FOREIGN KEY (sample_id) REFERENCES
core_sample(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "core_variantcallconfirmation" CONSTRAINT "sample_id_refs_id_3beffe04" FOREIGN KEY (sample_id) REFERENCES
core_sample(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "dashboard_request" CONSTRAINT "sample_id_refs_id_46e54337" FOREIGN KEY (sample_id) REFERENCES
core_sample(id)DEFERRABLE INITIALLY DEFERRED 
Has OIDs: no

core_person: 15,746 rows
                                                              Table "pcgc.core_person"
        Column         |          Type           |                        Modifiers                         | Storage
|Stats target | Description 

-----------------------+-------------------------+----------------------------------------------------------+----------+--------------+-------------
 id                    | integer                 | not null default nextval('core_person_id_seq'::regclass) | plain
|             | 
 blinded_id            | character varying(20)   | not null                                                 | extended
|             | 
 is_subject            | boolean                 | not null                                                 | plain
|             | 
 working_group_summary | character varying(100)  |                                                          | extended
|             | 
 consent_group         | integer                 |                                                          | plain
|             | 
 mendelian_consistent  | boolean                 | not null                                                 | plain
|             | 
 comments              | character varying(100)  |                                                          | extended
|             | 
 relatives             | character varying(1000) |                                                          | extended
|             | 
Indexes:
    "core_person_pkey" PRIMARY KEY, btree (id)
    "core_person_blinded_id_key" UNIQUE CONSTRAINT, btree (blinded_id)
    "core_person_comments" btree (comments)
    "core_person_comments_like" btree (comments varchar_pattern_ops)
    "core_person_consent_group" btree (consent_group)
    "core_person_is_subject" btree (is_subject)
    "core_person_mendelian_consistent" btree (mendelian_consistent)
    "core_person_working_group_summary" btree (working_group_summary)
    "core_person_working_group_summary_like" btree (working_group_summary varchar_pattern_ops)
Referenced by:
    TABLE "core_familymember" CONSTRAINT "person_id_refs_id_1b8249c8" FOREIGN KEY (person_id) REFERENCES
core_person(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "core_variantcallconfirmation" CONSTRAINT "person_id_refs_id_1b95d861" FOREIGN KEY (person_id) REFERENCES
core_person(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "core_othergenetictestresults" CONSTRAINT "person_id_refs_id_3322dca8" FOREIGN KEY (person_id) REFERENCES
core_person(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "core_karyotypeformula" CONSTRAINT "person_id_refs_id_4b0cf1ae" FOREIGN KEY (person_id) REFERENCES
core_person(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "core_tissuesample" CONSTRAINT "person_id_refs_id_54349bf1" FOREIGN KEY (person_id) REFERENCES
core_person(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "core_subject" CONSTRAINT "person_id_refs_id_55696453" FOREIGN KEY (person_id) REFERENCES core_person(id)
DEFERRABLEINITIALLY DEFERRED 
    TABLE "core_sample" CONSTRAINT "person_id_refs_id_56d51ee2" FOREIGN KEY (person_id) REFERENCES core_person(id)
DEFERRABLEINITIALLY DEFERRED 
    TABLE "core_guid" CONSTRAINT "person_id_refs_id_5c945e79" FOREIGN KEY (person_id) REFERENCES core_person(id)
DEFERRABLEINITIALLY DEFERRED 
    TABLE "core_mutationresults" CONSTRAINT "person_id_refs_id_65047092" FOREIGN KEY (person_id) REFERENCES
core_person(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "core_persondiagnosis" CONSTRAINT "person_id_refs_id_67b5236f" FOREIGN KEY (person_id) REFERENCES
core_person(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "core_microarrayresults" CONSTRAINT "person_id_refs_id_6bf9527a" FOREIGN KEY (person_id) REFERENCES
core_person(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "core_fishresults" CONSTRAINT "person_id_refs_id_a4734aaf" FOREIGN KEY (person_id) REFERENCES core_person(id)
DEFERRABLEINITIALLY DEFERRED 
    TABLE "core_copynumberresults" CONSTRAINT "person_id_refs_id_ae0ddf0d" FOREIGN KEY (person_id) REFERENCES
core_person(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "core_karyotypeabnormalitiesfather" CONSTRAINT "person_id_refs_id_b04014a" FOREIGN KEY (person_id) REFERENCES
core_person(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "core_blindfile" CONSTRAINT "person_id_refs_id_b24509e6" FOREIGN KEY (person_id) REFERENCES core_person(id)
DEFERRABLEINITIALLY DEFERRED 
    TABLE "core_probandformcompletion" CONSTRAINT "person_id_refs_id_b4ca4d2d" FOREIGN KEY (person_id) REFERENCES
core_person(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "core_workinggroupmembership" CONSTRAINT "person_id_refs_id_da4a9bc6" FOREIGN KEY (person_id) REFERENCES
core_person(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "core_karyotypeabnormalitiesproband" CONSTRAINT "person_id_refs_id_e3eb5c6b" FOREIGN KEY (person_id)
REFERENCEScore_person(id) DEFERRABLE INITIALLY DEFERRED 
    TABLE "core_genetictesting" CONSTRAINT "person_id_refs_id_ed9fd34b" FOREIGN KEY (person_id) REFERENCES
core_person(id)DEFERRABLE INITIALLY DEFERRED 
    TABLE "core_karyotypeabnormalitiesmother" CONSTRAINT "person_id_refs_id_f6f61751" FOREIGN KEY (person_id)
REFERENCEScore_person(id) DEFERRABLE INITIALLY DEFERRED 
    TABLE "core_cnvconfirmation" CONSTRAINT "person_id_refs_id_ff18d483" FOREIGN KEY (person_id) REFERENCES
core_person(id)DEFERRABLE INITIALLY DEFERRED 
Has OIDs: no



Re: Ye olde slow query

От
Tom Lane
Дата:
"Murphy, Kevin" <MURPHYKE@email.chop.edu> writes:
> Synopsis: 8-table join with one "WHERE foo IN (...)" condition; works OK with fewer
> than 5 items in the IN list, but at N=5, the planner starts using a compound index
> for the first time that completely kills performance (5-6 minutes versus 0-12 seconds).
> I'm interested in learning what plays a role in this switch of plans (or the
> unanticipated relative slowness of the N=5 plan).  TIA for any wisdom; I've finally
> made a commitment to really delve into PG.  -Kevin

FWIW, I think the right question here is not "why is the slow query
slow?", but "why is the fast query fast?".  The planner is estimating
them both at nearly the same cost, and since that cost is quite high,
I'd say it's not too wrong about the slow query.  What it's wrong about
is the fast query; so you need to look at where its estimates are way
off base in that plan.

It looks like the trouble spot is this intermediate nested loop:

>                     ->  Nested Loop  (cost=4.32..283545.98 rows=80929 width=12) (actual time=163.609..571.237
rows=102loops=1) 
>                           Buffers: shared hit=419 read=63
>                           ->  Nested Loop  (cost=4.32..3426.09 rows=471 width=4) (actual time=93.595..112.404 rows=85
loops=1)
>                               ...
>                           ->  Index Scan using sample_result_variant_id on sample_result  (cost=0.00..593.01 rows=172
width=8)(actual time=5.147..5.397 rows=1 loops=85) 
>                                 Index Cond: (variant_id = variant_effect.variant_id)
>                                 Buffers: shared hit=400 read=42

which is creating the bulk of the estimated cost for the whole plan,
but execution is actually pretty cheap.  There seem to be two components
to the misestimation: one is that the sub-nested loop is producing about a
fifth as many rows as expected, and the other is that the probes into
sample_result are producing (on average) 1 row, not the 172 rows the
planner expects.  If you could get the latter estimate to be even within
one order of magnitude of reality, the planner would certainly see this
plan as way cheaper than the other.

So I'm wondering if the stats on sample_result and variant_effect are up
to date.  If they are, you might try increasing the stats targets for the
variant_id columns.

            regards, tom lane


Re: Ye olde slow query

От
"Murphy, Kevin"
Дата:
Sorry for the delay; back on this, and thanks for the response.

On Mar 11, 2014, at 6:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Murphy, Kevin" <MURPHYKE@email.chop.edu> writes:
>> Synopsis: 8-table join with one "WHERE foo IN (...)" condition; works OK with fewer
>> than 5 items in the IN list, but at N=5, the planner starts using a compound index
>> for the first time that completely kills performance (5-6 minutes versus 0-12 seconds).
>> […]
>
> FWIW, I think the right question here is not "why is the slow query
> slow?", but "why is the fast query fast?”.
> […]
> It looks like the trouble spot is this intermediate nested loop:
>
>>                    ->  Nested Loop  (cost=4.32..283545.98 rows=80929 width=12) (actual time=163.609..571.237
rows=102loops=1) 
>>                          Buffers: shared hit=419 read=63
>>                          ->  Nested Loop  (cost=4.32..3426.09 rows=471 width=4) (actual time=93.595..112.404 rows=85
loops=1)
>>                              ...
>>                          ->  Index Scan using sample_result_variant_id on sample_result  (cost=0.00..593.01 rows=172
width=8)(actual time=5.147..5.397 rows=1 loops=85) 
>>                                Index Cond: (variant_id = variant_effect.variant_id)
>>                                Buffers: shared hit=400 read=42
>
> which is creating the bulk of the estimated cost for the whole plan,
> but execution is actually pretty cheap.  There seem to be two components
> to the misestimation: one is that the sub-nested loop is producing about a
> fifth as many rows as expected,

This may be because 3 out of the 4 user-supplied gene symbols were not present in the gene table at all.  Restricting
tovalid genes prior to the query is probably a good idea. 

> and the other is that the probes into
> sample_result are producing (on average) 1 row, not the 172 rows the
> planner expects.  If you could get the latter estimate to be even within
> one order of magnitude of reality, the planner would certainly see this
> plan as way cheaper than the other.

I’m not sure about how to improve this.  The stats were 5K globally and up to date, and I made them better, with no
change. I tried increasing the stats on the foreign keys involved to 10K (and analyzing), but the same costs and plan
arein play.  I know the stats are updated now because I dumped and restored on new hardware and did a vacuum analyze.
Ipreviously mentioned that some of the vanilla n_distinct values were way off for the (790M row) sample_result table,
soI have taken to coercing n_distinct using negative multipliers.  This data doesn’t change very often (it hasn’t in
manyweeks). 

There are 6M variants, but only 7.5% of them map to the sample_result table.  Presumably the planner knows this because
ofthe n_distinct value on sample_result.variant_id?  Each variant maps to zero or sample_result records, but often very
few,and never more than the number of samples (currently 1129). 

>
> So I'm wondering if the stats on sample_result and variant_effect are up
> to date.  If they are, you might try increasing the stats targets for the
> variant_id columns.

The stats were up to date and were at 5K globally.  I tried increasing the stats on the foreign keys involved to 10K
(andanalyzing!), but the same costs and plan are in play.  I know the stats are updated now because I dumped and
restoredon new hardware and did a vacuum analyze.  I previously mentioned that some of the vanilla n_distinct values
wereway off for the (790M row) sample_result table, so I have taken to coercing n_distinct using negative multipliers. 

Regards,
Kevin