Irrelevant columns cause massive performance change

Поиск

Список

Период

Сортировка

От	Craig James
Тема	Irrelevant columns cause massive performance change
Дата	17 марта 2018 г. 02:37:05
Msg-id	CAFwQ8rdgsRC+zJrcmqpRx8GeU0h_OSnhsuW8GQgLze9RzCUmYQ@mail.gmail.com обсуждение исходный текст
Ответы	Re: Irrelevant columns cause massive performance change
Список	pgsql-performance

Дерево обсуждения

Here's a weird one I can't figure out: the definitions of several columns of a view, which are not used in a query at all, have a massive effect on the query planner, causing it to choose a seqscan over the largest table in our database when it should be using the primary key for the join.

Background: We've redesigned the tables that hold our primary data, but need to create views that mimic the old design so that our applications will continue working. The largest table ("chemaxon.sdf") holds the bulk of the data, and we've changed it from raw text to gzipped bytea. To mimic the old schema, I created a short Perl function that does a simple gunzip operation, and used that in the definition of the view "str_conntab". (This gzip reduces our total database size to about a third of the original -- it's very effective).

Here are two query plans. The first is horrible. For the second, I removed the gunzip functions and replaced them with constant values. But notice that these pseudo columns are not used anywhere in the query. (Even if they were, I don't understand why this should affect the planner.)

The tables VERSION and VERSION_PROPERTIES are also views; I've included their definitions and the underlying actual tables below.

Postgres 9.6.7 running on Ubuntu 16.04.

emolecules=> drop view str_conntab;

DROP VIEW

emolecules=> create view str_conntab as

emolecules-> (select

emolecules(> id,

emolecules(> length(gunzip(sdf_gzip)) as contab_len,

emolecules(> gunzip(sdf_gzip) as contab_data,

emolecules(> ''::text as normalized

emolecules(> from chemaxon.sdf);

CREATE VIEW

emolecules=> explain analyze

select VERSION.VERSION_ID, VERSION.ISOSMILES, VERSION_PROPERTIES.MOLECULAR_WEIGHT, VERSION_PROPERTIES.MOLECULAR_FORMULA

from VERSION

join VERSION_PROPERTIES on (VERSION.VERSION_ID = VERSION_PROPERTIES.VERSION_ID)

join STR_CONNTAB on (VERSION.VERSION_ID = STR_CONNTAB.ID)

where VERSION.VERSION_ID in (1485909,1485889,1485903,1485887,1485892,1485900,1485895,1485898,1485906,1485884);

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------

Nested Loop (cost=62.99..162425.77 rows=5 width=60) (actual time=34.718..152828.351 rows=10 loops=1)

Join Filter: (s.id = p_1.id)

-> Nested Loop (cost=62.56..162422.84 rows=6 width=55) (actual time=34.701..152828.289 rows=10 loops=1)

Join Filter: (s.id = parent.id)

-> Nested Loop (cost=62.14..162419.48 rows=7 width=51) (actual time=34.694..152828.250 rows=10 loops=1)

Join Filter: (s.id = p.id)

-> Hash Join (cost=61.72..162415.16 rows=9 width=47) (actual time=34.663..152828.110 rows=10 loops=1)

Hash Cond: (sdf.id = s.id)

-> Seq Scan on sdf (cost=0.00..158488.50 rows=281080 width=72) (actual time=33.623..152630.514 rows=281080 loops=1)

-> Hash (cost=61.59..61.59 rows=10 width=43) (actual time=0.028..0.028 rows=10 loops=1)

Buckets: 1024 Batches: 1 Memory Usage: 9kB

-> Index Scan using smiles_pkey on smiles s (cost=0.42..61.59 rows=10 width=43) (actual time=0.010..0.022 rows=10 loops=1)

Index Cond: (id = ANY ('{1485909,1485889,1485903,1485887,1485892,1485900,1485895,1485898,1485906,1485884}'::integer[]))

-> Index Only Scan using parent_pkey on parent p (cost=0.42..0.47 rows=1 width=4) (actual time=0.011..0.011 rows=1 loops=10)

Index Cond: (id = sdf.id)

Heap Fetches: 10

-> Index Only Scan using parent_pkey on parent (cost=0.42..0.47 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=10)

Index Cond: (id = sdf.id)

Heap Fetches: 10

-> Index Scan using properties_pkey on properties p_1 (cost=0.42..0.48 rows=1 width=21) (actual time=0.003..0.004 rows=1 loops=10)

Index Cond: (id = sdf.id)

Planning time: 1.330 ms

Execution time: 152828.506 ms

(23 rows)

emolecules=> drop view str_conntab;

DROP VIEW

emolecules=> create view str_conntab as

emolecules-> (select

emolecules(> id,

emolecules(> 0::integer contab_len,

emolecules(> null::text as contab_data,

emolecules(> ''::text as normalized

emolecules(> from chemaxon.sdf);

CREATE VIEW

emolecules=> explain analyze

select VERSION.VERSION_ID, VERSION.ISOSMILES, VERSION_PROPERTIES.MOLECULAR_WEIGHT, VERSION_PROPERTIES.MOLECULAR_FORMULA

from VERSION

join VERSION_PROPERTIES on (VERSION.VERSION_ID = VERSION_PROPERTIES.VERSION_ID)

join STR_CONNTAB on (VERSION.VERSION_ID = STR_CONNTAB.ID)

where VERSION.VERSION_ID in (1485909,1485889,1485903,1485887,1485892,1485900,1485895,1485898,1485906,1485884);

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------

Nested Loop (cost=2.11..156.71 rows=5 width=60) (actual time=0.018..0.096 rows=10 loops=1)

Join Filter: (s.id = p_1.id)

-> Nested Loop (cost=1.69..153.77 rows=6 width=55) (actual time=0.015..0.076 rows=10 loops=1)

Join Filter: (s.id = parent.id)

-> Nested Loop (cost=1.27..150.41 rows=7 width=51) (actual time=0.012..0.059 rows=10 loops=1)

Join Filter: (s.id = p.id)

-> Nested Loop (cost=0.84..146.09 rows=9 width=47) (actual time=0.008..0.037 rows=10 loops=1)

-> Index Scan using smiles_pkey on smiles s (cost=0.42..61.59 rows=10 width=43) (actual time=0.003..0.016 rows=10 loops=1)

Index Cond: (id = ANY ('{1485909,1485889,1485903,1485887,1485892,1485900,1485895,1485898,1485906,1485884}'::integer[]))

-> Index Only Scan using sdf_pkey on sdf (cost=0.42..8.44 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=10)

Index Cond: (id = s.id)

Heap Fetches: 10

-> Index Only Scan using parent_pkey on parent p (cost=0.42..0.47 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=10)

Index Cond: (id = sdf.id)

Heap Fetches: 10

-> Index Only Scan using parent_pkey on parent (cost=0.42..0.47 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=10)

Index Cond: (id = sdf.id)

Heap Fetches: 10

-> Index Scan using properties_pkey on properties p_1 (cost=0.42..0.48 rows=1 width=21) (actual time=0.001..0.002 rows=1 loops=10)

Index Cond: (id = sdf.id)

Planning time: 1.251 ms

Execution time: 0.147 ms

(22 rows)

The timing of the second query is excellent, and is what I expected. I don't understand why including a function-defined column in the view would have such a dramatic effect on the planner's ability to choose the sdf_pkey index for the join.

Here are the view and table definitions:

emolecules=> \d+ version

View "registry.version"

------------+---------+-----------+----------+---------+----------+-------------

View definition:

SELECT s.id AS version_id,

p.parent_id,

s.smiles AS isosmiles,

timenow() AS created

FROM chemaxon.smiles s

JOIN chemaxon.parent p ON s.id = p.id;

emolecules=> \d+ version_properties

View "registry.version_properties"

-------------------+--------------+-----------+----------+---------+----------+-------------

View definition:

SELECT p.id AS version_id,

p.molecular_weight,

p.molecular_formula,

m.mfcd,

c.cas_number

FROM chemaxon.properties p

LEFT JOIN chemaxon.mfcd m USING (id)

LEFT JOIN chemaxon.cas_number c USING (id)

JOIN chemaxon.parent USING (id);

Table "chemaxon.smiles"

--------+---------+-----------+----------+---------+----------+--------------+-------------

Indexes:

"smiles_pkey" PRIMARY KEY, btree (id)

"i_unique_smiles" UNIQUE, btree (smiles)

Table "chemaxon.parent"

-----------+---------+-----------+----------+---------+---------+--------------+-------------

Indexes:

"parent_pkey" PRIMARY KEY, btree (id)

"i_parent_parent_id" btree (parent_id)

emolecules=> \d chemaxon.cas_number

Table "chemaxon.cas_number"

Column | Type | Collation | Nullable | Default

------------+---------+-----------+----------+---------

id | integer | | not null |

cas_number | text | | |

Indexes:

"cas_number_pkey" PRIMARY KEY, btree (id)

"i_cas_number_cas_number" btree (cas_number)

emolecules=> \d chemaxon.cas_number

Table "chemaxon.cas_number"

Column | Type | Collation | Nullable | Default

------------+---------+-----------+----------+---------

id | integer | | not null |

cas_number | text | | |

Indexes:

"cas_number_pkey" PRIMARY KEY, btree (id)

"i_cas_number_cas_number" btree (cas_number)

And the function "gunzip" is defined in perl (unsafe Perl) as:

create or replace function gunzip(bytea) returns text as

$gunzip$

use IO::Uncompress::Gunzip qw(gunzip $GunzipError);

my $compressed = decode_bytea($_[0]);

my $uncompressed;

if (!gunzip(\$compressed, \$uncompressed)) {

return $GunzipError;

}

return $uncompressed;

$gunzip$

language plperlu;

Thanks!

Craig

---------------------------------
Craig A. James

Chief Technology Officer

eMolecules, Inc.

---------------------------------

В списке pgsql-performance по дате отправления:

Предыдущее

От: MichaelDBA
Дата: 16 марта 2018 г., 19:42:34
Сообщение: Re: PG 9.6 Slow inserts with long-lasting LWLocks

Следующее

От: Andres Freund
Дата: 17 марта 2018 г., 02:50:13
Сообщение: Re: Irrelevant columns cause massive performance change

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Irrelevant columns cause massive performance change

Предыдущее

Следующее