Обсуждение: Bad plan for ltree predicate <@
Hi,
I have a problem on 9.3.14 with a query that accesses table:
Size: (retrieved by query https://gist.github.com/romank0/74f9d1d807bd3f41c0729d0fc6126749)
schemaname | relname | size | toast | associated_idx_size | total
------------+---------------+--------+--------+---------------------+---------
public | document_head | 275 MB | 630 MB | 439 MB | 1345 MB
Definition:
Table "public.document_head"
Column | Type | Modifiers
-----------------------------+--------------------------+-------------------------------------
snapshot_id | character varying(36) | not null
id | character varying(36) | not null
base_type | character varying(50) | not null
is_cascade | boolean | not null default false
parent_id | character varying(36) |
fileplan_node_id | character varying(36) |
state | character varying(10) | default 'ACTIVE'::character varying
title | character varying(4096) | not null
properties | text | not null
properties_cache | hstore | not null
serial_number | integer | not null
major_version | integer | not null
minor_version | integer | not null
version_description | text |
sensitivity | integer | not null default 10
id_path | ltree |
path_name | character varying(4096) | collate C not null
ltx_id | bigint | not null
created_by | integer | not null
created_date | timestamp with time zone | not null
modified_by | integer | not null
modified_date | timestamp with time zone | not null
responsible_user_ids | integer[] |
origin_id | character varying(36) |
origin_snapshot_id | character varying(36) |
ssn | character varying(64) |
record_physical_location | text |
record_physical_location_id | text |
record_created_date | timestamp with time zone |
record_aggregated_date | timestamp with time zone |
record_last_review_comment | text |
record_last_review_date | timestamp with time zone |
record_next_review_date | timestamp with time zone |
record_originated_date | timestamp with time zone |
record_is_vital | boolean | not null default false
storage_plan_state | text | not null default 'New'::text
cut_off_date | timestamp with time zone |
dispose_date | timestamp with time zone |
archive_date | timestamp with time zone |
Indexes:
"document_head__id__uniq_key" PRIMARY KEY, btree (id)
"document_head__parent_id__path_name__unq_idx" UNIQUE, btree (parent_id, path_name) WHERE state::text = 'ACTIVE'::text
"document_head__snapshot_id__unq" UNIQUE, btree (snapshot_id)
"document_head__base_type__idx" btree (base_type) WHERE state::text <> 'DELETED'::text
"document_head__fileplan_node_id__idx" btree (fileplan_node_id)
"document_head__id__idx" btree (id) WHERE state::text <> 'DELETED'::text
"document_head__id_path__btree__idx" btree (id_path) WHERE state::text <> 'DELETED'::text
"document_head__id_path__gist__idx" gist (id_path)
"document_head__ltx_id__idx" btree (ltx_id)
"document_head__origin_id__hotfix__idx" btree (origin_id) WHERE origin_id IS NOT NULL
"document_head__origin_id__idx" btree (origin_id) WHERE state::text <> 'DELETED'::text AND origin_id IS NOT NULL
"document_head__parent_id__idx" btree (parent_id)
"document_head__properties_cache__contact_username_idx" btree ((properties_cache -> 'person_meta_info.username'::text)) WHERE base_type::text = 'Contact'::text AND exist(properties_cache, 'person_meta_info.username'::text)
"document_head__properties_cache__emailmeta_message_id__idx" btree ((properties_cache -> 'emailmeta.message_id'::text)) WHERE base_type::text = 'File'::text AND exist(properties_cache, 'emailmeta.message_id'::text)
"document_head__properties_cache__idx" gist (properties_cache) WHERE state::text <> 'DELETED'::text
"document_head__properties_cache__project_identifier__idx" btree ((properties_cache -> 'project.identifier'::text)) WHERE base_type::text = 'Project'::text AND exist(properties_cache, 'project.identifier'::text)
"document_head__properties_cache__published_origin__idx" btree ((properties_cache -> 'file_published_origin_id.origin_id'::text)) WHERE base_type::text = 'File'::text AND exist(properties_cache, 'file_published_origin_id.origin_id'::text)
"document_head__state__idx" btree (state)
"document_head__storage_plan_state__idx" btree (storage_plan_state) WHERE state::text <> 'DELETED'::text
Check constraints:
"document_base_storage_plan_state_check" CHECK (storage_plan_state = ANY (ARRAY['NEW'::text, 'READY_FOR_CUTOFF'::text, 'CUTOFF'::text, 'READY_FOR_DISPOSITION'::text, 'DISPOSED'::text]))
"document_head__sensitivity_check" CHECK (sensitivity = ANY (ARRAY[10, 20, 30]))
Foreign-key constraints:
"document_head__created_by__fk" FOREIGN KEY (created_by) REFERENCES auth_user(id)
"document_head__modified_by__fk" FOREIGN KEY (modified_by) REFERENCES auth_user(id)
"document_head__parent_id__fk" FOREIGN KEY (parent_id) REFERENCES document(id)
Some notes:
1. properties stores json that for some records may be as large as 300k
select count(*) from document_head where length(properties) > 100000;
count
-------
535
(1 row)
select count(*) from document_head where length(properties) > 20000;
count
-------
13917
(1 row)
select count(*) from document_head where length(properties) > 1000;
count
-------
51708
(1 row)
select count(*) from document_head where length(properties) > 300000;
count
-------
3
(1 row)
select max(length(properties)) from document_head;
max
--------
334976
(1 row)
2. properties_cache stores parsed properties: key is jsonpath of a key in json and value is a value.
3. all results here are retrieved after running `analyze document_head` and `vacuum document_head` manually.
4. I tried different work_mem settings up to 100MB and there's no effect on the main issue described below.
5. I haven't tested disks speed as first of all I think it is irrelevant to the problem and it is not easy to do as this is production system.
The function that is used in the query:
CREATE OR REPLACE FUNCTION public.get_doc_path(document_id character varying)
RETURNS ltree
LANGUAGE plpgsql
STABLE
AS $function$
DECLARE
path ltree;
BEGIN
select id_path into path from document_head where id = document_id;
RETURN path;
END $function$
The original query is rather big one and the simplified version where the issue can still be demonstrated is:
explain (analyze, buffers)
with trees AS (
SELECT d.id, d.snapshot_id , NULL :: text[] AS permissions
FROM document_head AS d
WHERE (d.id_path <@ get_doc_path('78157c60-45bc-42c1-9aad-c5651995db5c')
AND d.id != '78157c60-45bc-42c1-9aad-c5651995db5c') AND d.state != 'DELETED'
)
SELECT COUNT(*) FROM trees;
I get this plan https://explain.depesz.com/s/UQX4h
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=86227.19..86227.20 rows=1 width=0) (actual time=3878.775..3878.776 rows=1 loops=1)
Buffers: shared hit=747698, temp written=1587
CTE trees
-> Seq Scan on document_head d (cost=0.00..82718.21 rows=155955 width=74) (actual time=0.211..3620.044 rows=154840 loops=1)
Filter: (((id)::text <> '78157c60-45bc-42c1-9aad-c5651995db5c'::text) AND ((state)::text <> 'DELETED'::text) AND (id_path <@ get_doc_path('78157c60-45bc-42c1-9aad-c5651995db5c'::character varying)))
Rows Removed by Filter: 23357
Buffers: shared hit=747698
-> CTE Scan on trees (cost=0.00..3119.10 rows=155955 width=0) (actual time=0.215..3828.519 rows=154840 loops=1)
Buffers: shared hit=747698, temp written=1587
Total runtime: 3881.781 ms
(10 rows)
If I change the predicate for ltree to use subquery the plan and execution time changes:
explain (analyze, buffers)
with trees AS (
SELECT d.id, d.snapshot_id , NULL :: text[] AS permissions
FROM document_head AS d
WHERE (d.id_path <@ (select get_doc_path('78157c60-45bc-42c1-9aad-c5651995db5c'))
AND d.id != '78157c60-45bc-42c1-9aad-c5651995db5c') AND d.state != 'DELETED'
)
SELECT COUNT(*) FROM trees;
https://explain.depesz.com/s/eUR
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=763.19..763.20 rows=1 width=0) (actual time=430.519..430.519 rows=1 loops=1)
Buffers: shared hit=47768, temp written=1587
CTE trees
-> Bitmap Heap Scan on document_head d (cost=82.05..759.20 rows=177 width=74) (actual time=70.703..249.419 rows=154840 loops=1)
Recheck Cond: (id_path <@ $0)
Rows Removed by Index Recheck: 11698
Filter: (((id)::text <> '78157c60-45bc-42c1-9aad-c5651995db5c'::text) AND ((state)::text <> 'DELETED'::text))
Rows Removed by Filter: 23
Buffers: shared hit=47768
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.26 rows=1 width=0) (actual time=0.083..0.084 rows=1 loops=1)
Buffers: shared hit=4
-> Bitmap Index Scan on document_head__id_path__gist__idx (cost=0.00..81.74 rows=178 width=0) (actual time=68.326..68.326 rows=159979 loops=1)
Index Cond: (id_path <@ $0)
Buffers: shared hit=16238
-> CTE Scan on trees (cost=0.00..3.54 rows=177 width=0) (actual time=70.707..388.714 rows=154840 loops=1)
Buffers: shared hit=47768, temp written=1587
Total runtime: 433.410 ms
(18 rows)
I can see that:
1. both queries return exactly the same data and require the same underlying data to produce the result.
2. when I use subquery postgres cannot estimate the number of records that are going to be found by <@.
3. number of buffers processed by the quicker query is 15 times smaller than on the slow one. I assume that this is the reason the query is slower.
My understanding how the planner works is that it evaluates some plans and calculates cost based on the settings and and available statistics.
Settings define relative cost of operations like random vs sequential io vs processing in memory data.
I assume that it is better to use queries that allows planner to use correct estimations and tune settings so that planner knows correct operations cost.
My questions:
1. how to find out why slow execution requires 15x more buffers.
2. In the slow executin planner either does not consider the plan similar to the quick on or estimates it as worse. How to find out which of thes cases (if any) is true? And what I can try to make planner use better plan?
I understand that one option to try is to upgrade to more recent version and I'm going to test this but this may take a while until the affected production system will get an upgrade.
More information:
OS: Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-125-generic x86_64)
PG version: PostgreSQL 9.3.14 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit
History: no history as this is a new query.
Hardware: aws m4.large (2 vCPUs, RAM 8GB) with gp2 (SSD) storage with throughtput up to ~20MB/s.
Non default server settings:
name | current_setting | source
--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------
application_name | psql | client
archive_command | /etc/rds/dbbin/pgscripts/rds_wal_archive %p | configuration file
archive_mode | on | configuration file
archive_timeout | 5min | configuration file
checkpoint_completion_target | 0.9 | configuration file
checkpoint_segments | 16 | configuration file
client_encoding | UTF8 | client
effective_cache_size | 3987336kB | configuration file
fsync | on | configuration file
hot_standby | off | configuration file
listen_addresses | * | command line
lo_compat_privileges | off | configuration file
log_checkpoints | on | configuration file
log_destination | stderr | configuration file
log_directory | /rdsdbdata/log/error | configuration file
log_file_mode | 0644 | configuration file
log_filename | postgresql.log.%Y-%m-%d-%H | configuration file
log_hostname | on | configuration file
log_line_prefix | %t:%r:%u@%d:[%p]: | configuration file
log_min_duration_statement | 1s | configuration file
log_rotation_age | 1h | configuration file
log_timezone | UTC | configuration file
log_truncate_on_rotation | off | configuration file
logging_collector | on | configuration file
maintenance_work_mem | 127MB | configuration file
max_connections | 1000 | configuration file
max_locks_per_transaction | 64 | configuration file
max_prepared_transactions | 0 | configuration file
max_stack_depth | 6MB | configuration file
max_wal_senders | 5 | configuration file
port | 5432 | configuration file
rds.extensions | btree_gin,btree_gist,chkpass,citext,cube,dblink,dict_int,dict_xsyn,earthdistance,fuzzystrmatch,hstore,intagg,intarray,isn,ltree,pgcrypto,pgrowlocks,pg_stat_statements,pg_trgm,plcoffee,plls,plperl,plpgsql,pltcl,plv8,postgis,postgis_tiger_geocoder,postgis_topology,postgres_fdw,sslinfo,tablefunc,test_parser,tsearch2,unaccent,uuid-ossp | configuration file
rds.internal_databases | rdsadmin,template0 | configuration file
rds.superuser_variables | session_replication_role | configuration file
shared_buffers | 1993664kB | configuration file
shared_preload_libraries | rdsutils | configuration file
ssl | on | configuration file
ssl_ca_file | /rdsdbdata/rds-metadata/ca-cert.pem | configuration file
ssl_cert_file | /rdsdbdata/rds-metadata/server-cert.pem | configuration file
ssl_key_file | /rdsdbdata/rds-metadata/server-key.pem | configuration file
ssl_renegotiation_limit | 0 | configuration file
stats_temp_directory | /rdsdbdata/db/pg_stat_tmp | configuration file
superuser_reserved_connections | 3 | configuration file
synchronous_commit | on | configuration file
TimeZone | UTC | configuration file
unix_socket_directories | /tmp | configuration file
unix_socket_group | rdsdb | configuration file
unix_socket_permissions | 0700 | configuration file
wal_keep_segments | 32 | configuration file
wal_level | hot_standby | configuration file
wal_receiver_timeout | 30s | configuration file
wal_sender_timeout | 30s | configuration file
(52 rows)
Roman Konoval <rkonoval@gmail.com> writes: > I have a problem on 9.3.14 with a query that accesses table: I think the root of the problem is your intermediate function: > CREATE OR REPLACE FUNCTION public.get_doc_path(document_id character varying) > RETURNS ltree > LANGUAGE plpgsql > STABLE > AS $function$ > DECLARE > path ltree; > BEGIN > select id_path into path from document_head where id = document_id; > RETURN path; > END $function$ This is quite expensive, as it involves another table search, but the planner doesn't know that since you've not marked it as having higher than normal cost. The seqscan formulation of the query results in evaluating this function afresh at most of the rows, whereas shoving it into an uncorrelated sub-select causes it to be evaluated only once. That, I think, and not the seqscan-vs-indexscan aspect, is what makes the bitmap formulation go faster. Certainly you'd not expect that a bitmap scan that has to hit most of the rows anyway is going to win over a seqscan. The fact that the planner goes for a bitmap scan in the second formulation is an artifact of the fact that it doesn't try to pre-evaluate sub-selects for selectivity estimation purposes, so you end up with a default estimate that says that the <@ condition only selects a small fraction of the rows. Not sure if we should try to change that or not. I'd suggest setting the function's cost to 1000 or so and seeing if that doesn't improve matters. (BTW, what tipped me off to this was that the "buffers hit" count for the seqscan node was so high, several times more than the actual size of the table. I couldn't account for that until I realized that the function itself would be adding a few buffer hits per execution.) regards, tom lane
Hi Tom,
Thanks for your help.
> On Dec 1, 2017, at 22:33, Tom Lane wrote:
>
>
> The seqscan formulation of the query results in evaluating
> this function afresh at most of the rows
The function is defined as STABLE. I though that means that there is no need
to reevaluate it on every row as input parameter is the same for every row and
return value will be the same during the same query execution. Do I understand
incorrectly what STABLE means?
Why is the function evaluated more than once?
> , whereas shoving it into an
> uncorrelated sub-select causes it to be evaluated only once. That, I
> think, and not the seqscan-vs-indexscan aspect, is what makes the bitmap
> formulation go faster. Certainly you'd not expect that a bitmap scan that
> has to hit most of the rows anyway is going to win over a seqscan.
>
> The fact that the planner goes for a bitmap scan in the second formulation
> is an artifact of the fact that it doesn't try to pre-evaluate sub-selects
> for selectivity estimation purposes, so you end up with a default estimate
> that says that the <@ condition only selects a small fraction of the rows.
> Not sure if we should try to change that or not.
>
> I'd suggest setting the function's cost to 1000 or so and seeing if that
> doesn't improve matters.
>
If I set function cost to 1000 I get slightly better plan but still 3.5 more buffers are read when compared to bitmap scan which as you wrote one would expect to be slower than seq scan.
Here is the plan:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=216438.81..216438.82 rows=1 width=0) (actual time=1262.244..1262.245 rows=1 loops=1)
Buffers: shared hit=169215
CTE trees
-> Index Scan using document_head__id_path__gist__idx on document_head d (cost=2.91..212787.85 rows=162265 width=74) (actual time=0.115..727.119 rows=154854 loops=1)
Index Cond: (id_path <@ get_doc_path('78157c60-45bc-42c1-9aad-c5651995db5c'::character varying))
Filter: (((id)::text <> '78157c60-45bc-42c1-9aad-c5651995db5c'::text) AND ((state)::text <> 'DELETED'::text))
Rows Removed by Filter: 23
Buffers: shared hit=169215
-> CTE Scan on trees (cost=0.00..3245.30 rows=162265 width=0) (actual time=0.119..1118.899 rows=154854 loops=1)
Buffers: shared hit=169215
Total runtime: 1277.010 ms
(11 rows)
My understanding is that the optimal plan in this case should read less data than bitmap scan by the amount of buffers hit by bitmap index scan.
It should read roughly all buffers of the table itself. Something like the query with predicate using ltree literal instead of function invocation:
explain (analyze, buffers)
with trees AS (
SELECT d.id, d.snapshot_id , NULL :: text[] AS permissions
FROM document_head AS d
WHERE (d.id_path <@ '869c0187_51ae_4deb_a36f_0425fdafda6e.78157c60_45bc_42c1_9aad_c5651995db5c'::ltree AND d.id != '78157c60-45bc-42c1-9aad-c5651995db5c') AND d.state != 'DELETED'
)
SELECT COUNT(*) FROM trees;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=42114.02..42114.03 rows=1 width=0) (actual time=997.427..997.427 rows=1 loops=1)
Buffers: shared hit=35230
CTE trees
-> Seq Scan on document_head d (cost=0.00..38463.06 rows=162265 width=74) (actual time=0.013..593.082 rows=154854 loops=1)
Filter: ((id_path <@ '869c0187_51ae_4deb_a36f_0425fdafda6e.78157c60_45bc_42c1_9aad_c5651995db5c'::ltree) AND ((id)::text <> '78157c60-45bc-42c1-9aad-c5651995db5c'::text) AND ((state)::text <> 'DELETED'::text))
Rows Removed by Filter: 23357
Buffers: shared hit=35230
-> CTE Scan on trees (cost=0.00..3245.30 rows=162265 width=0) (actual time=0.017..888.076 rows=154854 loops=1)
Buffers: shared hit=35230
Total runtime: 1011.565 ms
(10 rows)
The question is if it possible to get plan like that using function or some other way to get ltree value for given document_head.id value in one query?
As an alternative I can get ltree value with the separate query but this would require
1. a round-trip to postgres
2. me to change isolation level to REPEATABLE READ to make sure that I get consistent result
so I would like to avoid that.
Regards,
Roman Konoval