Обсуждение: Bad plan for ltree predicate <@

Поиск
Список
Период
Сортировка

Bad plan for ltree predicate <@

От
Roman Konoval
Дата:
Hi, I have a problem on 9.3.14 with a query that accesses table: Size: (retrieved by query https://gist.github.com/romank0/74f9d1d807bd3f41c0729d0fc6126749) schemaname | relname | size | toast | associated_idx_size | total ------------+---------------+--------+--------+---------------------+--------- public | document_head | 275 MB | 630 MB | 439 MB | 1345 MB Definition: Table "public.document_head" Column | Type | Modifiers -----------------------------+--------------------------+------------------------------------- snapshot_id | character varying(36) | not null id | character varying(36) | not null base_type | character varying(50) | not null is_cascade | boolean | not null default false parent_id | character varying(36) | fileplan_node_id | character varying(36) | state | character varying(10) | default 'ACTIVE'::character varying title | character varying(4096) | not null properties | text | not null properties_cache | hstore | not null serial_number | integer | not null major_version | integer | not null minor_version | integer | not null version_description | text | sensitivity | integer | not null default 10 id_path | ltree | path_name | character varying(4096) | collate C not null ltx_id | bigint | not null created_by | integer | not null created_date | timestamp with time zone | not null modified_by | integer | not null modified_date | timestamp with time zone | not null responsible_user_ids | integer[] | origin_id | character varying(36) | origin_snapshot_id | character varying(36) | ssn | character varying(64) | record_physical_location | text | record_physical_location_id | text | record_created_date | timestamp with time zone | record_aggregated_date | timestamp with time zone | record_last_review_comment | text | record_last_review_date | timestamp with time zone | record_next_review_date | timestamp with time zone | record_originated_date | timestamp with time zone | record_is_vital | boolean | not null default false storage_plan_state | text | not null default 'New'::text cut_off_date | timestamp with time zone | dispose_date | timestamp with time zone | archive_date | timestamp with time zone | Indexes: "document_head__id__uniq_key" PRIMARY KEY, btree (id) "document_head__parent_id__path_name__unq_idx" UNIQUE, btree (parent_id, path_name) WHERE state::text = 'ACTIVE'::text "document_head__snapshot_id__unq" UNIQUE, btree (snapshot_id) "document_head__base_type__idx" btree (base_type) WHERE state::text <> 'DELETED'::text "document_head__fileplan_node_id__idx" btree (fileplan_node_id) "document_head__id__idx" btree (id) WHERE state::text <> 'DELETED'::text "document_head__id_path__btree__idx" btree (id_path) WHERE state::text <> 'DELETED'::text "document_head__id_path__gist__idx" gist (id_path) "document_head__ltx_id__idx" btree (ltx_id) "document_head__origin_id__hotfix__idx" btree (origin_id) WHERE origin_id IS NOT NULL "document_head__origin_id__idx" btree (origin_id) WHERE state::text <> 'DELETED'::text AND origin_id IS NOT NULL "document_head__parent_id__idx" btree (parent_id) "document_head__properties_cache__contact_username_idx" btree ((properties_cache -> 'person_meta_info.username'::text)) WHERE base_type::text = 'Contact'::text AND exist(properties_cache, 'person_meta_info.username'::text) "document_head__properties_cache__emailmeta_message_id__idx" btree ((properties_cache -> 'emailmeta.message_id'::text)) WHERE base_type::text = 'File'::text AND exist(properties_cache, 'emailmeta.message_id'::text) "document_head__properties_cache__idx" gist (properties_cache) WHERE state::text <> 'DELETED'::text "document_head__properties_cache__project_identifier__idx" btree ((properties_cache -> 'project.identifier'::text)) WHERE base_type::text = 'Project'::text AND exist(properties_cache, 'project.identifier'::text) "document_head__properties_cache__published_origin__idx" btree ((properties_cache -> 'file_published_origin_id.origin_id'::text)) WHERE base_type::text = 'File'::text AND exist(properties_cache, 'file_published_origin_id.origin_id'::text) "document_head__state__idx" btree (state) "document_head__storage_plan_state__idx" btree (storage_plan_state) WHERE state::text <> 'DELETED'::text Check constraints: "document_base_storage_plan_state_check" CHECK (storage_plan_state = ANY (ARRAY['NEW'::text, 'READY_FOR_CUTOFF'::text, 'CUTOFF'::text, 'READY_FOR_DISPOSITION'::text, 'DISPOSED'::text])) "document_head__sensitivity_check" CHECK (sensitivity = ANY (ARRAY[10, 20, 30])) Foreign-key constraints: "document_head__created_by__fk" FOREIGN KEY (created_by) REFERENCES auth_user(id) "document_head__modified_by__fk" FOREIGN KEY (modified_by) REFERENCES auth_user(id) "document_head__parent_id__fk" FOREIGN KEY (parent_id) REFERENCES document(id) Some notes: 1. properties stores json that for some records may be as large as 300k select count(*) from document_head where length(properties) > 100000; count ------- 535 (1 row) select count(*) from document_head where length(properties) > 20000; count ------- 13917 (1 row) select count(*) from document_head where length(properties) > 1000; count ------- 51708 (1 row) select count(*) from document_head where length(properties) > 300000; count ------- 3 (1 row) select max(length(properties)) from document_head; max -------- 334976 (1 row) 2. properties_cache stores parsed properties: key is jsonpath of a key in json and value is a value. 3. all results here are retrieved after running `analyze document_head` and `vacuum document_head` manually. 4. I tried different work_mem settings up to 100MB and there's no effect on the main issue described below. 5. I haven't tested disks speed as first of all I think it is irrelevant to the problem and it is not easy to do as this is production system. The function that is used in the query: CREATE OR REPLACE FUNCTION public.get_doc_path(document_id character varying) RETURNS ltree LANGUAGE plpgsql STABLE AS $function$ DECLARE path ltree; BEGIN select id_path into path from document_head where id = document_id; RETURN path; END $function$ The original query is rather big one and the simplified version where the issue can still be demonstrated is: explain (analyze, buffers) with trees AS ( SELECT d.id, d.snapshot_id , NULL :: text[] AS permissions FROM document_head AS d WHERE (d.id_path <@ get_doc_path('78157c60-45bc-42c1-9aad-c5651995db5c') AND d.id != '78157c60-45bc-42c1-9aad-c5651995db5c') AND d.state != 'DELETED' ) SELECT COUNT(*) FROM trees; I get this plan https://explain.depesz.com/s/UQX4h QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=86227.19..86227.20 rows=1 width=0) (actual time=3878.775..3878.776 rows=1 loops=1) Buffers: shared hit=747698, temp written=1587 CTE trees -> Seq Scan on document_head d (cost=0.00..82718.21 rows=155955 width=74) (actual time=0.211..3620.044 rows=154840 loops=1) Filter: (((id)::text <> '78157c60-45bc-42c1-9aad-c5651995db5c'::text) AND ((state)::text <> 'DELETED'::text) AND (id_path <@ get_doc_path('78157c60-45bc-42c1-9aad-c5651995db5c'::character varying))) Rows Removed by Filter: 23357 Buffers: shared hit=747698 -> CTE Scan on trees (cost=0.00..3119.10 rows=155955 width=0) (actual time=0.215..3828.519 rows=154840 loops=1) Buffers: shared hit=747698, temp written=1587 Total runtime: 3881.781 ms (10 rows) If I change the predicate for ltree to use subquery the plan and execution time changes: explain (analyze, buffers) with trees AS ( SELECT d.id, d.snapshot_id , NULL :: text[] AS permissions FROM document_head AS d WHERE (d.id_path <@ (select get_doc_path('78157c60-45bc-42c1-9aad-c5651995db5c')) AND d.id != '78157c60-45bc-42c1-9aad-c5651995db5c') AND d.state != 'DELETED' ) SELECT COUNT(*) FROM trees; https://explain.depesz.com/s/eUR QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=763.19..763.20 rows=1 width=0) (actual time=430.519..430.519 rows=1 loops=1) Buffers: shared hit=47768, temp written=1587 CTE trees -> Bitmap Heap Scan on document_head d (cost=82.05..759.20 rows=177 width=74) (actual time=70.703..249.419 rows=154840 loops=1) Recheck Cond: (id_path <@ $0) Rows Removed by Index Recheck: 11698 Filter: (((id)::text <> '78157c60-45bc-42c1-9aad-c5651995db5c'::text) AND ((state)::text <> 'DELETED'::text)) Rows Removed by Filter: 23 Buffers: shared hit=47768 InitPlan 1 (returns $0) -> Result (cost=0.00..0.26 rows=1 width=0) (actual time=0.083..0.084 rows=1 loops=1) Buffers: shared hit=4 -> Bitmap Index Scan on document_head__id_path__gist__idx (cost=0.00..81.74 rows=178 width=0) (actual time=68.326..68.326 rows=159979 loops=1) Index Cond: (id_path <@ $0) Buffers: shared hit=16238 -> CTE Scan on trees (cost=0.00..3.54 rows=177 width=0) (actual time=70.707..388.714 rows=154840 loops=1) Buffers: shared hit=47768, temp written=1587 Total runtime: 433.410 ms (18 rows) I can see that: 1. both queries return exactly the same data and require the same underlying data to produce the result. 2. when I use subquery postgres cannot estimate the number of records that are going to be found by <@. 3. number of buffers processed by the quicker query is 15 times smaller than on the slow one. I assume that this is the reason the query is slower. My understanding how the planner works is that it evaluates some plans and calculates cost based on the settings and and available statistics. Settings define relative cost of operations like random vs sequential io vs processing in memory data. I assume that it is better to use queries that allows planner to use correct estimations and tune settings so that planner knows correct operations cost. My questions: 1. how to find out why slow execution requires 15x more buffers. 2. In the slow executin planner either does not consider the plan similar to the quick on or estimates it as worse. How to find out which of thes cases (if any) is true? And what I can try to make planner use better plan? I understand that one option to try is to upgrade to more recent version and I'm going to test this but this may take a while until the affected production system will get an upgrade. More information: OS: Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-125-generic x86_64) PG version: PostgreSQL 9.3.14 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit History: no history as this is a new query. Hardware: aws m4.large (2 vCPUs, RAM 8GB) with gp2 (SSD) storage with throughtput up to ~20MB/s. Non default server settings: name | current_setting | source --------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------- application_name | psql | client archive_command | /etc/rds/dbbin/pgscripts/rds_wal_archive %p | configuration file archive_mode | on | configuration file archive_timeout | 5min | configuration file checkpoint_completion_target | 0.9 | configuration file checkpoint_segments | 16 | configuration file client_encoding | UTF8 | client effective_cache_size | 3987336kB | configuration file fsync | on | configuration file hot_standby | off | configuration file listen_addresses | * | command line lo_compat_privileges | off | configuration file log_checkpoints | on | configuration file log_destination | stderr | configuration file log_directory | /rdsdbdata/log/error | configuration file log_file_mode | 0644 | configuration file log_filename | postgresql.log.%Y-%m-%d-%H | configuration file log_hostname | on | configuration file log_line_prefix | %t:%r:%u@%d:[%p]: | configuration file log_min_duration_statement | 1s | configuration file log_rotation_age | 1h | configuration file log_timezone | UTC | configuration file log_truncate_on_rotation | off | configuration file logging_collector | on | configuration file maintenance_work_mem | 127MB | configuration file max_connections | 1000 | configuration file max_locks_per_transaction | 64 | configuration file max_prepared_transactions | 0 | configuration file max_stack_depth | 6MB | configuration file max_wal_senders | 5 | configuration file port | 5432 | configuration file rds.extensions | btree_gin,btree_gist,chkpass,citext,cube,dblink,dict_int,dict_xsyn,earthdistance,fuzzystrmatch,hstore,intagg,intarray,isn,ltree,pgcrypto,pgrowlocks,pg_stat_statements,pg_trgm,plcoffee,plls,plperl,plpgsql,pltcl,plv8,postgis,postgis_tiger_geocoder,postgis_topology,postgres_fdw,sslinfo,tablefunc,test_parser,tsearch2,unaccent,uuid-ossp | configuration file rds.internal_databases | rdsadmin,template0 | configuration file rds.superuser_variables | session_replication_role | configuration file shared_buffers | 1993664kB | configuration file shared_preload_libraries | rdsutils | configuration file ssl | on | configuration file ssl_ca_file | /rdsdbdata/rds-metadata/ca-cert.pem | configuration file ssl_cert_file | /rdsdbdata/rds-metadata/server-cert.pem | configuration file ssl_key_file | /rdsdbdata/rds-metadata/server-key.pem | configuration file ssl_renegotiation_limit | 0 | configuration file stats_temp_directory | /rdsdbdata/db/pg_stat_tmp | configuration file superuser_reserved_connections | 3 | configuration file synchronous_commit | on | configuration file TimeZone | UTC | configuration file unix_socket_directories | /tmp | configuration file unix_socket_group | rdsdb | configuration file unix_socket_permissions | 0700 | configuration file wal_keep_segments | 32 | configuration file wal_level | hot_standby | configuration file wal_receiver_timeout | 30s | configuration file wal_sender_timeout | 30s | configuration file (52 rows)

Re: Bad plan for ltree predicate <@

От
Tom Lane
Дата:
Roman Konoval <rkonoval@gmail.com> writes:
> I have a problem on 9.3.14 with a query that accesses table:

I think the root of the problem is your intermediate function:

> CREATE OR REPLACE FUNCTION public.get_doc_path(document_id character varying)
>  RETURNS ltree
>  LANGUAGE plpgsql
>  STABLE
> AS $function$
> DECLARE
>     path ltree;
> BEGIN
>     select id_path into path from document_head where id = document_id;
>     RETURN path;
> END $function$

This is quite expensive, as it involves another table search, but the
planner doesn't know that since you've not marked it as having higher than
normal cost.  The seqscan formulation of the query results in evaluating
this function afresh at most of the rows, whereas shoving it into an
uncorrelated sub-select causes it to be evaluated only once.  That, I
think, and not the seqscan-vs-indexscan aspect, is what makes the bitmap
formulation go faster.  Certainly you'd not expect that a bitmap scan that
has to hit most of the rows anyway is going to win over a seqscan.

The fact that the planner goes for a bitmap scan in the second formulation
is an artifact of the fact that it doesn't try to pre-evaluate sub-selects
for selectivity estimation purposes, so you end up with a default estimate
that says that the <@ condition only selects a small fraction of the rows.
Not sure if we should try to change that or not.

I'd suggest setting the function's cost to 1000 or so and seeing if that
doesn't improve matters.

(BTW, what tipped me off to this was that the "buffers hit" count for
the seqscan node was so high, several times more than the actual size
of the table.  I couldn't account for that until I realized that the
function itself would be adding a few buffer hits per execution.)

            regards, tom lane


Re: Bad plan for ltree predicate <@

От
Roman Konoval
Дата:
Hi Tom, Thanks for your help. > On Dec 1, 2017, at 22:33, Tom Lane wrote: > > > The seqscan formulation of the query results in evaluating > this function afresh at most of the rows The function is defined as STABLE. I though that means that there is no need to reevaluate it on every row as input parameter is the same for every row and return value will be the same during the same query execution. Do I understand incorrectly what STABLE means? Why is the function evaluated more than once? > , whereas shoving it into an > uncorrelated sub-select causes it to be evaluated only once. That, I > think, and not the seqscan-vs-indexscan aspect, is what makes the bitmap > formulation go faster. Certainly you'd not expect that a bitmap scan that > has to hit most of the rows anyway is going to win over a seqscan. > > The fact that the planner goes for a bitmap scan in the second formulation > is an artifact of the fact that it doesn't try to pre-evaluate sub-selects > for selectivity estimation purposes, so you end up with a default estimate > that says that the <@ condition only selects a small fraction of the rows. > Not sure if we should try to change that or not. > > I'd suggest setting the function's cost to 1000 or so and seeing if that > doesn't improve matters. > If I set function cost to 1000 I get slightly better plan but still 3.5 more buffers are read when compared to bitmap scan which as you wrote one would expect to be slower than seq scan. Here is the plan: QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=216438.81..216438.82 rows=1 width=0) (actual time=1262.244..1262.245 rows=1 loops=1) Buffers: shared hit=169215 CTE trees -> Index Scan using document_head__id_path__gist__idx on document_head d (cost=2.91..212787.85 rows=162265 width=74) (actual time=0.115..727.119 rows=154854 loops=1) Index Cond: (id_path <@ get_doc_path('78157c60-45bc-42c1-9aad-c5651995db5c'::character varying)) Filter: (((id)::text <> '78157c60-45bc-42c1-9aad-c5651995db5c'::text) AND ((state)::text <> 'DELETED'::text)) Rows Removed by Filter: 23 Buffers: shared hit=169215 -> CTE Scan on trees (cost=0.00..3245.30 rows=162265 width=0) (actual time=0.119..1118.899 rows=154854 loops=1) Buffers: shared hit=169215 Total runtime: 1277.010 ms (11 rows) My understanding is that the optimal plan in this case should read less data than bitmap scan by the amount of buffers hit by bitmap index scan. It should read roughly all buffers of the table itself. Something like the query with predicate using ltree literal instead of function invocation: explain (analyze, buffers) with trees AS ( SELECT d.id, d.snapshot_id , NULL :: text[] AS permissions FROM document_head AS d WHERE (d.id_path <@ '869c0187_51ae_4deb_a36f_0425fdafda6e.78157c60_45bc_42c1_9aad_c5651995db5c'::ltree AND d.id != '78157c60-45bc-42c1-9aad-c5651995db5c') AND d.state != 'DELETED' ) SELECT COUNT(*) FROM trees; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=42114.02..42114.03 rows=1 width=0) (actual time=997.427..997.427 rows=1 loops=1) Buffers: shared hit=35230 CTE trees -> Seq Scan on document_head d (cost=0.00..38463.06 rows=162265 width=74) (actual time=0.013..593.082 rows=154854 loops=1) Filter: ((id_path <@ '869c0187_51ae_4deb_a36f_0425fdafda6e.78157c60_45bc_42c1_9aad_c5651995db5c'::ltree) AND ((id)::text <> '78157c60-45bc-42c1-9aad-c5651995db5c'::text) AND ((state)::text <> 'DELETED'::text)) Rows Removed by Filter: 23357 Buffers: shared hit=35230 -> CTE Scan on trees (cost=0.00..3245.30 rows=162265 width=0) (actual time=0.017..888.076 rows=154854 loops=1) Buffers: shared hit=35230 Total runtime: 1011.565 ms (10 rows) The question is if it possible to get plan like that using function or some other way to get ltree value for given document_head.id value in one query? As an alternative I can get ltree value with the separate query but this would require 1. a round-trip to postgres 2. me to change isolation level to REPEATABLE READ to make sure that I get consistent result so I would like to avoid that. Regards, Roman Konoval