Обсуждение: Slow "not in array" operation
SELECT COUNT(*) FROM "subscriptions" WHERE "subscriptions"."project_id" = 123 AND "subscriptions"."trashed_at" IS NULL AND NOT (tags @> ARRAY['en']::varchar[]);
=== QUERY PLAN ===
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=2152593.04..2152593.05 rows=1 width=8) (actual time=70555.561..70555.561 rows=1 loops=1)
-> Gather (cost=2152592.31..2152593.02 rows=7 width=8) (actual time=70540.641..70702.365 rows=8 loops=1)
Workers Planned: 7
Workers Launched: 7
-> Partial Aggregate (cost=2151592.31..2151592.32 rows=1 width=8) (actual time=70537.376..70537.377 rows=1 loops=8)
-> Parallel Seq Scan on subscriptions (cost=0.00..2149490.49 rows=840731 width=0) (actual time=0.742..70479.359 rows=611828 loops=8)
Filter: ((trashed_at IS NULL) AND (NOT (tags @> '{en}'::character varying[])) AND (project_id = 123))
Rows Removed by Filter: 4572769
Planning Time: 1.304 ms
Execution Time: 70702.463 ms
(10 rows)
=== INDEXES ===
Indexes:
"subscriptions_pkey" PRIMARY KEY, btree (id)
"index_subscriptions_on_project_id_and_created_at" btree (project_id, created_at DESC)
"index_subscriptions_on_project_id_and_tags" gin (project_id, tags) WHERE trashed_at IS NULL
"index_subscriptions_on_project_id_and_trashed_at" btree (project_id, trashed_at DESC)
What's the plan for the slow one? What's the time to just count all rows?
On Tue, Nov 12, 2019 at 12:20:10PM -0700, Michael Lewis wrote: > It is very interesting to me that the optimizer chose a parallel sequential > scan rather than an index scan on either of your indexes that start > with project_id that also reference trashed_at. Maybe because of low correlation on any of those columns? https://wiki.postgresql.org/wiki/Slow_Query_Questions#Statistics:_n_distinct.2C_MCV.2C_histogram SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, tablename, attname, inherited, null_frac, n_distinct, array_length(most_common_vals,1)n_mcv, array_length(histogram_bounds,1) n_hist, correlation FROM pg_stats WHERE tablename='subscriptions'AND attname IN ('project_id','tags') ORDER BY 1 DESC; Maybe clustering the table on project_id (and ANALYZEing) would help, but that might need to be done consistently. Michael previously suggested partitioning which, if done on project_id, would then no longer need to be specially CLUSTERed. Is the plan for the fast query the same as in August ? https://www.postgresql.org/message-id/CAFvCgN4UijKTYiOF61Tyd%2BgHvF_oqnMabatS9%2BDcX%2B_PK2SHRw%40mail.gmail.com Justin
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=2183938.89..2183938.90 rows=1 width=8) (actual time=94972.253..94972.254 rows=1 loops=1)
-> Gather (cost=2183938.16..2183938.87 rows=7 width=8) (actual time=94952.895..95132.626 rows=8 loops=1)
Workers Planned: 7
Workers Launched: 7
-> Partial Aggregate (cost=2182938.16..2182938.17 rows=1 width=8) (actual time=94950.958..94950.958 rows=1 loops=8)
-> Parallel Bitmap Heap Scan on subscriptions (cost=50294.50..2180801.47 rows=854677 width=0) (actual time=1831.342..94895.208 rows=611828 loops=8)
Recheck Cond: ((project_id = 123) AND (trashed_at IS NULL))
Rows Removed by Index Recheck: 2217924
Filter: (NOT (tags @> '{en}'::character varying[]))
Rows Removed by Filter: 288545
Heap Blocks: exact=120301 lossy=134269
-> Bitmap Index Scan on index_subscriptions_on_project_id_and_tags (cost=0.00..48798.81 rows=6518094 width=0) (actual time=1493.823..1493.823 rows=7203173 loops=1)
Index Cond: (project_id = 123)
Planning Time: 1.273 ms
Execution Time: 95132.766 ms
(15 rows)
It is very interesting to me that the optimizer chose a parallel sequential scan rather than an index scan on either of your indexes that start with project_id that also reference trashed_at.1) Are you running on SSD type storage? Has random_page_cost been lowered to 1-1.5 or so (close to 1 assumes good cache hits)?2) It seems you have increased parallel workers. Have you also changed the startup or other cost configs related to how inclined the system is to use sequential scans?3) If you disable sequential scan, what does the plan look like for this query? (SET ENABLE_SEQSCAN TO OFF;)
1) It is running on a DigitalOcean CPU-optimized droplet with dedicated hyperthreads (16 cores) and SSD.SHOW random_page_cost; => 22) What config names should I check exactly? I used some suggestions from the online PGTune, when I first configured the db some months ago:max_worker_processes = 16max_parallel_workers_per_gather = 8max_parallel_workers = 163) Here's the query plan that I get after disabling the seq scan:QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=2183938.89..2183938.90 rows=1 width=8) (actual time=94972.253..94972.254 rows=1 loops=1)
-> Gather (cost=2183938.16..2183938.87 rows=7 width=8) (actual time=94952.895..95132.626 rows=8 loops=1)
Workers Planned: 7
Workers Launched: 7
-> Partial Aggregate (cost=2182938.16..2182938.17 rows=1 width=8) (actual time=94950.958..94950.958 rows=1 loops=8)
-> Parallel Bitmap Heap Scan on subscriptions (cost=50294.50..2180801.47 rows=854677 width=0) (actual time=1831.342..94895.208 rows=611828 loops=8)
Recheck Cond: ((project_id = 123) AND (trashed_at IS NULL))
Rows Removed by Index Recheck: 2217924
Filter: (NOT (tags @> '{en}'::character varying[]))
Rows Removed by Filter: 288545
Heap Blocks: exact=120301 lossy=134269
-> Bitmap Index Scan on index_subscriptions_on_project_id_and_tags (cost=0.00..48798.81 rows=6518094 width=0) (actual time=1493.823..1493.823 rows=7203173 loops=1)
Index Cond: (project_id = 123)
Planning Time: 1.273 ms
Execution Time: 95132.766 ms
(15 rows)
On Tue, Nov 12, 2019 at 8:20 PM Michael Lewis <mlewis@entrata.com> wrote:It is very interesting to me that the optimizer chose a parallel sequential scan rather than an index scan on either of your indexes that start with project_id that also reference trashed_at.1) Are you running on SSD type storage? Has random_page_cost been lowered to 1-1.5 or so (close to 1 assumes good cache hits)?2) It seems you have increased parallel workers. Have you also changed the startup or other cost configs related to how inclined the system is to use sequential scans?3) If you disable sequential scan, what does the plan look like for this query? (SET ENABLE_SEQSCAN TO OFF;)
Marco Colli <collimarco91@gmail.com> writes: > 3) Here's the query plan that I get after disabling the seq scan: > Finalize Aggregate (cost=2183938.89..2183938.90 rows=1 width=8) (actual > time=94972.253..94972.254 rows=1 loops=1) So, this is slower than the seqscan, which means the planner made the right choice. You seem to be imagining that there's some way the index could be used with the NOT clause, but there isn't. Indexable clauses are of the form indexed_column indexable_operator constant and there's no provision for a NOT in that. If we had a "not contained in" array operator, the NOT could be folded to be of this syntactic form, but it's highly unlikely that any index operator class would consider such an operator to be a supported indexable operator. It doesn't lend itself to searching an index. So the planner is doing the best it can, which in this case is a full-table scan. A conceivable solution, if the tags array is a lot smaller than the table as a whole and the table is fairly static, is that you could make a btree index on the tags array and let the planner fall back to an index-only scan that is just using the index as a cheaper source of the array data. (This doesn't work for your existing GIST index because GIST can't reconstruct the original arrays on-demand.) I suspect though that this wouldn't win much, even if you disregard the maintenance costs for the extra index. The really fundamental problem here is that a large fraction of the table satisfies the NOT-in condition, and no index is going to beat a seqscan by all that much when that's true. Indexes are good at retrieving small portions of tables. regards, tom lane
Marco Colli <collimarco91@gmail.com> writes:
> 3) Here's the query plan that I get after disabling the seq scan:
> Finalize Aggregate (cost=2183938.89..2183938.90 rows=1 width=8) (actual
> time=94972.253..94972.254 rows=1 loops=1)
So, this is slower than the seqscan, which means the planner made the
right choice.
You seem to be imagining that there's some way the index could be used
with the NOT clause, but there isn't. Indexable clauses are of the form
indexed_column indexable_operator constant
and there's no provision for a NOT in that. If we had a "not contained
in" array operator, the NOT could be folded to be of this syntactic form,
but it's highly unlikely that any index operator class would consider such
an operator to be a supported indexable operator. It doesn't lend itself
to searching an index.
So the planner is doing the best it can, which in this case is a
full-table scan.
A conceivable solution, if the tags array is a lot smaller than
the table as a whole and the table is fairly static, is that you could
make a btree index on the tags array and let the planner fall back
to an index-only scan that is just using the index as a cheaper
source of the array data. (This doesn't work for your existing GIST
index because GIST can't reconstruct the original arrays on-demand.)
I suspect though that this wouldn't win much, even if you disregard
the maintenance costs for the extra index. The really fundamental
problem here is that a large fraction of the table satisfies the
NOT-in condition, and no index is going to beat a seqscan by all that
much when that's true. Indexes are good at retrieving small portions
of tables.
regards, tom lane
3) Here's the query plan that I get after disabling the seq scan:QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=2183938.89..2183938.90 rows=1 width=8) (actual time=94972.253..94972.254 rows=1 loops=1)
-> Gather (cost=2183938.16..2183938.87 rows=7 width=8) (actual time=94952.895..95132.626 rows=8 loops=1)
Workers Planned: 7
Workers Launched: 7
-> Partial Aggregate (cost=2182938.16..2182938.17 rows=1 width=8) (actual time=94950.958..94950.958 rows=1 loops=8)
-> Parallel Bitmap Heap Scan on subscriptions (cost=50294.50..2180801.47 rows=854677 width=0) (actual time=1831.342..94895.208 rows=611828 loops=8)
Recheck Cond: ((project_id = 123) AND (trashed_at IS NULL))
Rows Removed by Index Recheck: 2217924
Filter: (NOT (tags @> '{en}'::character varying[]))
Rows Removed by Filter: 288545
Heap Blocks: exact=120301 lossy=134269
-> Bitmap Index Scan on index_subscriptions_on_project_id_and_tags (cost=0.00..48798.81 rows=6518094 width=0) (actual time=1493.823..1493.823 rows=7203173 loops=1)
Index Cond: (project_id = 123)
Planning Time: 1.273 ms
Execution Time: 95132.766 ms
(15 rows)
What was the plan for the one that took 500ms?
3) Here's the query plan that I get after disabling the seq scan:QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=2183938.89..2183938.90 rows=1 width=8) (actual time=94972.253..94972.254 rows=1 loops=1)
-> Gather (cost=2183938.16..2183938.87 rows=7 width=8) (actual time=94952.895..95132.626 rows=8 loops=1)
Workers Planned: 7
Workers Launched: 7
-> Partial Aggregate (cost=2182938.16..2182938.17 rows=1 width=8) (actual time=94950.958..94950.958 rows=1 loops=8)
-> Parallel Bitmap Heap Scan on subscriptions (cost=50294.50..2180801.47 rows=854677 width=0) (actual time=1831.342..94895.208 rows=611828 loops=8)
Recheck Cond: ((project_id = 123) AND (trashed_at IS NULL))
Rows Removed by Index Recheck: 2217924
Filter: (NOT (tags @> '{en}'::character varying[]))
Rows Removed by Filter: 288545
Heap Blocks: exact=120301 lossy=134269
-> Bitmap Index Scan on index_subscriptions_on_project_id_and_tags (cost=0.00..48798.81 rows=6518094 width=0) (actual time=1493.823..1493.823 rows=7203173 loops=1)
Index Cond: (project_id = 123)
Planning Time: 1.273 ms
Execution Time: 95132.766 ms
(15 rows)
What was the plan for the one that took 500ms? I don't see how it is possible that this one is 180 times slower than that one. Maybe a hot cache versus cold cache? Also, it seems weird to me that "trashed_at IS NULL" shows up in the recheck but not in the original Index Cond. Increasing work_mem can also help, but since the Bitmap Index Scan itself took half the time there is only so much it can do.Cheers,Jeff
Disclaimer: Out over my skis again.From what you say here, and over on SO, it sounds like you've got two problems:* Matching on huge numbers of records because of common tags.* A dynamic collection of tags as they're customer driven/configured.An "ideal" solution might look like a bit-index for each tag+tuple, but Postgres does not have such a structure. The closest I've seen are Bloom filter based indexes. That's likely not going to work here as you don't know the collection of tags at any one time. If, however, you create your own frequency count estimates for tags, you may well find that there are a small number of common tags, and a large number of rare tags. That would be good to find out. If you do have some super common (non selective) tags, then perhaps a Bloom index based on that collection could be effective. Or expression indexes on the very common tags. In your SaaS setup, you might need counts/indexes tied to some kind of customer/tenancy distinction ID, understood. But, for simplicity, I'm just saying a single set of frequency counts, etc.Here's a recent article on Bloom filter based indexes in Postgres that looks decent:
Replying to the previous questions:- work_mem = 64MB (there are hundreds of connections)- the project 123 has more than 7M records, and those that don't have the tag 'en' are 4.8MWhat was the plan for the one that took 500ms?This is the query / plan without the filter on tags:SELECT COUNT(*) FROM "subscriptions" WHERE "subscriptions"."project_id" = 123 AND "subscriptions"."trashed_at" IS NULL;QUERY PLAN------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Finalize Aggregate (cost=291342.67..291342.68 rows=1 width=8) (actual time=354.556..354.556 rows=1 loops=1)-> Gather (cost=291342.05..291342.66 rows=6 width=8) (actual time=354.495..374.305 rows=7 loops=1)Workers Planned: 6Workers Launched: 6-> Partial Aggregate (cost=290342.05..290342.06 rows=1 width=8) (actual time=349.799..349.799 rows=1 loops=7)-> Parallel Index Only Scan using index_subscriptions_on_project_id_and_uid on subscriptions (cost=0.56..287610.27 rows=1092713 width=0) (actual time=0.083..273.018 rows=1030593 loops=7)Index Cond: (project_id = 123)Heap Fetches: 280849Planning Time: 0.753 msExecution Time: 374.483 ms(10 rows)
On Wed, Nov 13, 2019 at 4:20 AM Marco Colli <collimarco91@gmail.com> wrote:Replying to the previous questions:- work_mem = 64MB (there are hundreds of connections)- the project 123 has more than 7M records, and those that don't have the tag 'en' are 4.8MWhat was the plan for the one that took 500ms?This is the query / plan without the filter on tags:SELECT COUNT(*) FROM "subscriptions" WHERE "subscriptions"."project_id" = 123 AND "subscriptions"."trashed_at" IS NULL;QUERY PLAN------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Finalize Aggregate (cost=291342.67..291342.68 rows=1 width=8) (actual time=354.556..354.556 rows=1 loops=1)-> Gather (cost=291342.05..291342.66 rows=6 width=8) (actual time=354.495..374.305 rows=7 loops=1)Workers Planned: 6Workers Launched: 6-> Partial Aggregate (cost=290342.05..290342.06 rows=1 width=8) (actual time=349.799..349.799 rows=1 loops=7)-> Parallel Index Only Scan using index_subscriptions_on_project_id_and_uid on subscriptions (cost=0.56..287610.27 rows=1092713 width=0) (actual time=0.083..273.018 rows=1030593 loops=7)Index Cond: (project_id = 123)Heap Fetches: 280849Planning Time: 0.753 msExecution Time: 374.483 ms(10 rows)My previous comment about the bitmap index scan taking half the time was a slip of the eye, I was comparing *cost* of the bitmap index scan to the *time* of the overall plan. But then the question is, why isn't it doing an index-only scan on "index_subscriptions_on_project_id_and_tags"? And the answer is that is because it is a GIN index. Make the same index only as btree, and you should get good performance as it can filter the tags within a given project without visiting the table.Cheers,Jeff
> the answer is that is because it is a GIN index. Make the same index only as btree, and you should get good performance as it can filter the tags within a given project without visiting the table.Currently I have this GIN index:"index_subscriptions_on_project_id_and_tags" gin (project_id, tags) WHERE trashed_at IS NULL
It uses the btree_gin extension and works perfectly for tag search, except for the "NOT" operator. I don't understand why it doesn't use the GIN index also for the "NOT" operator.
The problem is that I cannot create the same index with BTree, because PG doesn't support BTree on array :(
On Wed, Nov 13, 2019 at 6:56 AM Marco Colli <collimarco91@gmail.com> wrote:> the answer is that is because it is a GIN index. Make the same index only as btree, and you should get good performance as it can filter the tags within a given project without visiting the table.Currently I have this GIN index:"index_subscriptions_on_project_id_and_tags" gin (project_id, tags) WHERE trashed_at IS NULLMulticolumn GIN indexes are nearly worthless IMO when one column is a scalar. You can use this index, but it won't be better than one just on "GIN (tags) trashed_at IS NULL". An N-column GIN index is mostly the same thing as N single column GIN indexes.It uses the btree_gin extension and works perfectly for tag search, except for the "NOT" operator. I don't understand why it doesn't use the GIN index also for the "NOT" operator.Because it can't. Tom already did a good job of describing that. Can you describe what steps you think an index should take to jump to the specific rows which fail to exist in an inverted index?The problem is that I cannot create the same index with BTree, because PG doesn't support BTree on array :(Sure it does. It can't jump to specific parts of the index based on the array containment operators, but it can use them for in-index filtering (but only if you can do an index-only scan). And really, that is probably all you need to get > 100x improvement.Are you getting an error when you try to build it? If so, what is the error?Cheers,Jeff