Irrelevant columns cause massive performance change

Поиск
Список
Период
Сортировка
От Craig James
Тема Irrelevant columns cause massive performance change
Дата
Msg-id CAFwQ8rdgsRC+zJrcmqpRx8GeU0h_OSnhsuW8GQgLze9RzCUmYQ@mail.gmail.com
обсуждение исходный текст
Ответы Re: Irrelevant columns cause massive performance change  (Andres Freund <andres@anarazel.de>)
Список pgsql-performance
Here's a weird one I can't figure out: the definitions of several columns of a view, which are not used in a query at all, have a massive effect on the query planner, causing it to choose a seqscan over the largest table in our database when it should be using the primary key for the join.
Background: We've redesigned the tables that hold our primary data, but need to create views that mimic the old design so that our applications will continue working. The largest table ("chemaxon.sdf") holds the bulk of the data, and we've changed it from raw text to gzipped bytea. To mimic the old schema, I created a short Perl function that does a simple gunzip operation, and used that in the definition of the view "str_conntab". (This gzip reduces our total database size to about a third of the original -- it's very effective).

Here are two query plans. The first is horrible. For the second, I removed the gunzip functions and replaced them with constant values. But notice that these pseudo columns are not used anywhere in the query. (Even if they were, I don't understand why this should affect the planner.)

The tables VERSION and VERSION_PROPERTIES are also views; I've included their definitions and the underlying actual tables below.

Postgres 9.6.7 running on Ubuntu 16.04.

emolecules=> drop view str_conntab;
DROP VIEW
emolecules=> create view str_conntab as
emolecules->  (select
emolecules(>    id,
emolecules(>    length(gunzip(sdf_gzip)) as contab_len,
emolecules(>    gunzip(sdf_gzip) as contab_data,
emolecules(>    ''::text as normalized
emolecules(>   from chemaxon.sdf);
CREATE VIEW

emolecules=> explain analyze
select VERSION.VERSION_ID, VERSION.ISOSMILES, VERSION_PROPERTIES.MOLECULAR_WEIGHT, VERSION_PROPERTIES.MOLECULAR_FORMULA
 from VERSION
 join VERSION_PROPERTIES on (VERSION.VERSION_ID = VERSION_PROPERTIES.VERSION_ID)
 join STR_CONNTAB on (VERSION.VERSION_ID = STR_CONNTAB.ID)
 where VERSION.VERSION_ID in (1485909,1485889,1485903,1485887,1485892,1485900,1485895,1485898,1485906,1485884);
                                                                       QUERY PLAN                                                                        
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=62.99..162425.77 rows=5 width=60) (actual time=34.718..152828.351 rows=10 loops=1)
   Join Filter: (s.id = p_1.id)
   ->  Nested Loop  (cost=62.56..162422.84 rows=6 width=55) (actual time=34.701..152828.289 rows=10 loops=1)
         Join Filter: (s.id = parent.id)
         ->  Nested Loop  (cost=62.14..162419.48 rows=7 width=51) (actual time=34.694..152828.250 rows=10 loops=1)
               Join Filter: (s.id = p.id)
               ->  Hash Join  (cost=61.72..162415.16 rows=9 width=47) (actual time=34.663..152828.110 rows=10 loops=1)
                     Hash Cond: (sdf.id = s.id)
                     ->  Seq Scan on sdf  (cost=0.00..158488.50 rows=281080 width=72) (actual time=33.623..152630.514 rows=281080 loops=1)
                     ->  Hash  (cost=61.59..61.59 rows=10 width=43) (actual time=0.028..0.028 rows=10 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 9kB
                           ->  Index Scan using smiles_pkey on smiles s  (cost=0.42..61.59 rows=10 width=43) (actual time=0.010..0.022 rows=10 loops=1)
                                 Index Cond: (id = ANY ('{1485909,1485889,1485903,1485887,1485892,1485900,1485895,1485898,1485906,1485884}'::integer[]))
               ->  Index Only Scan using parent_pkey on parent p  (cost=0.42..0.47 rows=1 width=4) (actual time=0.011..0.011 rows=1 loops=10)
                     Index Cond: (id = sdf.id)
                     Heap Fetches: 10
         ->  Index Only Scan using parent_pkey on parent  (cost=0.42..0.47 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=10)
               Index Cond: (id = sdf.id)
               Heap Fetches: 10
   ->  Index Scan using properties_pkey on properties p_1  (cost=0.42..0.48 rows=1 width=21) (actual time=0.003..0.004 rows=1 loops=10)
         Index Cond: (id = sdf.id)
 Planning time: 1.330 ms
 Execution time: 152828.506 ms
(23 rows)

emolecules=> drop view str_conntab;
DROP VIEW
emolecules=> create view str_conntab as
emolecules->  (select
emolecules(>    id,
emolecules(>    0::integer contab_len,
emolecules(>    null::text as contab_data,
emolecules(>    ''::text as normalized
emolecules(>   from chemaxon.sdf);
CREATE VIEW
emolecules=> explain analyze
select VERSION.VERSION_ID, VERSION.ISOSMILES, VERSION_PROPERTIES.MOLECULAR_WEIGHT, VERSION_PROPERTIES.MOLECULAR_FORMULA
 from VERSION
 join VERSION_PROPERTIES on (VERSION.VERSION_ID = VERSION_PROPERTIES.VERSION_ID)
 join STR_CONNTAB on (VERSION.VERSION_ID = STR_CONNTAB.ID)
 where VERSION.VERSION_ID in (1485909,1485889,1485903,1485887,1485892,1485900,1485895,1485898,1485906,1485884);
                                                                    QUERY PLAN                                                                     
---------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=2.11..156.71 rows=5 width=60) (actual time=0.018..0.096 rows=10 loops=1)
   Join Filter: (s.id = p_1.id)
   ->  Nested Loop  (cost=1.69..153.77 rows=6 width=55) (actual time=0.015..0.076 rows=10 loops=1)
         Join Filter: (s.id = parent.id)
         ->  Nested Loop  (cost=1.27..150.41 rows=7 width=51) (actual time=0.012..0.059 rows=10 loops=1)
               Join Filter: (s.id = p.id)
               ->  Nested Loop  (cost=0.84..146.09 rows=9 width=47) (actual time=0.008..0.037 rows=10 loops=1)
                     ->  Index Scan using smiles_pkey on smiles s  (cost=0.42..61.59 rows=10 width=43) (actual time=0.003..0.016 rows=10 loops=1)
                           Index Cond: (id = ANY ('{1485909,1485889,1485903,1485887,1485892,1485900,1485895,1485898,1485906,1485884}'::integer[]))
                     ->  Index Only Scan using sdf_pkey on sdf  (cost=0.42..8.44 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=10)
                           Index Cond: (id = s.id)
                           Heap Fetches: 10
               ->  Index Only Scan using parent_pkey on parent p  (cost=0.42..0.47 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=10)
                     Index Cond: (id = sdf.id)
                     Heap Fetches: 10
         ->  Index Only Scan using parent_pkey on parent  (cost=0.42..0.47 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=10)
               Index Cond: (id = sdf.id)
               Heap Fetches: 10
   ->  Index Scan using properties_pkey on properties p_1  (cost=0.42..0.48 rows=1 width=21) (actual time=0.001..0.002 rows=1 loops=10)
         Index Cond: (id = sdf.id)
 Planning time: 1.251 ms
 Execution time: 0.147 ms
(22 rows)

The timing of the second query is excellent, and is what I expected. I don't understand why including a function-defined column in the view would have such a dramatic effect on the planner's ability to choose the sdf_pkey index for the join.

Here are the view and table definitions:

emolecules=> \d+ version
                            View "registry.version"
   Column   |  Type   | Collation | Nullable | Default | Storage  | Description 
------------+---------+-----------+----------+---------+----------+-------------
 version_id | integer |           |          |         | plain    | 
 parent_id  | integer |           |          |         | plain    | 
 isosmiles  | text    |           |          |         | extended | 
 created    | abstime |           |          |         | plain    | 
View definition:
 SELECT s.id AS version_id,
    p.parent_id,
    s.smiles AS isosmiles,
    timenow() AS created
   FROM chemaxon.smiles s
     JOIN chemaxon.parent p ON s.id = p.id;

emolecules=> \d+ version_properties
                             View "registry.version_properties"
      Column       |     Type     | Collation | Nullable | Default | Storage  | Description 
-------------------+--------------+-----------+----------+---------+----------+-------------
 version_id        | integer      |           |          |         | plain    | 
 molecular_weight  | numeric(8,3) |           |          |         | main     | 
 molecular_formula | text         |           |          |         | extended | 
 mfcd              | text         |           |          |         | extended | 
 cas_number        | text         |           |          |         | extended | 
View definition:
 SELECT p.id AS version_id,
    p.molecular_weight,
    p.molecular_formula,
    m.mfcd,
    c.cas_number
   FROM chemaxon.properties p
     LEFT JOIN chemaxon.mfcd m USING (id)
     LEFT JOIN chemaxon.cas_number c USING (id)
     JOIN chemaxon.parent USING (id);

                                  Table "chemaxon.smiles"
 Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
--------+---------+-----------+----------+---------+----------+--------------+-------------
 id     | integer |           | not null |         | plain    |              | 
 smiles | text    |           | not null |         | extended |              | 
Indexes:
    "smiles_pkey" PRIMARY KEY, btree (id)
    "i_unique_smiles" UNIQUE, btree (smiles)

                                   Table "chemaxon.parent"
  Column   |  Type   | Collation | Nullable | Default | Storage | Stats target | Description 
-----------+---------+-----------+----------+---------+---------+--------------+-------------
 id        | integer |           | not null |         | plain   |              | 
 parent_id | integer |           | not null |         | plain   |              | 
Indexes:
    "parent_pkey" PRIMARY KEY, btree (id)
    "i_parent_parent_id" btree (parent_id)

emolecules=> \d chemaxon.cas_number
              Table "chemaxon.cas_number"
   Column   |  Type   | Collation | Nullable | Default 
------------+---------+-----------+----------+---------
 id         | integer |           | not null | 
 cas_number | text    |           |          | 
Indexes:
    "cas_number_pkey" PRIMARY KEY, btree (id)
    "i_cas_number_cas_number" btree (cas_number)

emolecules=> \d chemaxon.cas_number
              Table "chemaxon.cas_number"
   Column   |  Type   | Collation | Nullable | Default 
------------+---------+-----------+----------+---------
 id         | integer |           | not null | 
 cas_number | text    |           |          | 
Indexes:
    "cas_number_pkey" PRIMARY KEY, btree (id)
    "i_cas_number_cas_number" btree (cas_number)

And the function "gunzip" is defined in perl (unsafe Perl) as:

create or replace function gunzip(bytea) returns text as
$gunzip$
  use IO::Uncompress::Gunzip qw(gunzip $GunzipError);
  my $compressed = decode_bytea($_[0]);
  my $uncompressed;
  if (!gunzip(\$compressed, \$uncompressed)) {
    return $GunzipError;
  }
  return $uncompressed;
$gunzip$
language plperlu;



Thanks!
Craig

--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------

В списке pgsql-performance по дате отправления:

Предыдущее
От: MichaelDBA
Дата:
Сообщение: Re: PG 9.6 Slow inserts with long-lasting LWLocks
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Irrelevant columns cause massive performance change