BRIN index which is much faster never chosen by planner

Поиск

Список

Период

Сортировка

От	Jeremy Finzel
Тема	BRIN index which is much faster never chosen by planner
Дата	11 октября 2019 г. 00:58:11
Msg-id	CAMa1XUj8WLszW8L2piuPZ88hh+CWitFu-qtf-6TR5rPw5xvhMA@mail.gmail.com обсуждение исходный текст
Ответы	Re: BRIN index which is much faster never chosen by planner Re: BRIN index which is much faster never chosen by planner
Список	pgsql-hackers

Дерево обсуждения

Good Afternoon,

I posted about this on another thread here, but the topic was not precisely planner issues, so I wanted to post it here.

I am running Postgres 11.5. I have a table that is insert-only and has 312 million rows. It is also pruned continuously to only past year. The size of the table is 223 GB with indexes, 140GB without. One of the fields is rec_insert_time timestamptz. Here are all potentially relevant table stats:

I have a simple query looking at past 10 days based on rec_insert_time, and it will not choose the BRIN index even with several configurations. Here are my all relevant indexes (I intentionally do not have a btree on rec_insert_time because I believe BRIN *should* fit better here):

"log_table_brand_id_product_rec_insert_time_idx" btree (brand_id, product, rec_insert_time)
"log_table_rec_insert_time_idx" brin (rec_insert_time)
"log_table_rec_insert_time_idx1" brin (rec_insert_time) WITH (pages_per_range='64')
"rec_insert_time_brin_1000" brin (rec_insert_time) WITH (pages_per_range='1000')

And here is the SQL:

SELECT
category, source, MIN(rec_insert_time) OVER (partition by source order by rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= now() - interval '10 days'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases
;

This query will choose a seq scan on log_table every time in spite of these BRIN indexes on rec_insert_time.

@Michael Lewis had suggested I check default_statistics_target for this column. I raised it to 5000 for this column and still it's choosing a seq scan.

Here is default chosen plan (takes 2 minutes 12 seconds):

WindowAgg (cost=24437881.80..24437897.70 rows=707 width=120) (actual time=132173.173..132173.222 rows=53 loops=1)
Output: unique_cases.category, unique_cases.source, min(unique_cases.rec_insert_time) OVER (?), max(unique_cases.rec_insert_time) OVER (?), unique_cases.rec_insert_time
Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
-> Sort (cost=24437881.80..24437883.56 rows=707 width=104) (actual time=132173.146..132173.149 rows=53 loops=1)
Output: unique_cases.source, unique_cases.rec_insert_time, unique_cases.category
Sort Key: unique_cases.source, unique_cases.rec_insert_time
Sort Method: quicksort Memory: 32kB
Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
-> Subquery Scan on unique_cases (cost=24437834.20..24437848.34 rows=707 width=104) (actual time=132172.950..132173.062 rows=53 loops=1)
Output: unique_cases.source, unique_cases.rec_insert_time, unique_cases.category
Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
-> Unique (cost=24437834.20..24437841.27 rows=707 width=124) (actual time=132172.946..132173.048 rows=53 loops=1)
Output: l.category, (source(l.field1)), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
-> Sort (cost=24437834.20..24437835.96 rows=707 width=124) (actual time=132172.939..132172.962 rows=466 loops=1)
Output: l.category, (source(l.field1)), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
Sort Method: quicksort Memory: 90kB
Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
-> Nested Loop (cost=0.00..24437800.73 rows=707 width=124) (actual time=4096.253..132171.425 rows=466 loops=1)
Output: l.category, source(l.field1), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Inner Unique: true
Join Filter: ((l.category)::text = filter.category)
Rows Removed by Join Filter: 346704
Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
-> Seq Scan on foo.log_table l (cost=0.00..24420483.80 rows=707 width=99) (actual time=4095.763..132112.686 rows=466 loops=1)
Output: <hidden>
Filter: ((l.field1 IS NOT NULL) AND (l.category = 'music'::name) AND (l.rec_insert_time >= (now() - '10 days'::interval)))
Rows Removed by Filter: 312830265
Buffers: shared hit=391675 read=17772636 dirtied=4679 written=7
-> Materialize (cost=0.00..33.98 rows=1399 width=8) (actual time=0.001..0.036 rows=745 loops=466)
Output: filter.category
Buffers: shared hit=1 read=6
-> Seq Scan on public.small_join_table filter (cost=0.00..26.99 rows=1399 width=8) (actual time=0.054..0.189 rows=745 loops=1)
Output: filter.category
Buffers: shared hit=1 read=6
Planning Time: 0.552 ms
Execution Time: 132173.657 ms
(38 rows)

Here is the plan I get when I turn off seqscan and indexscan!!! (21 seconds):

SET enable_seqscan TO false;

SET enable_indexscan TO false;

WindowAgg (cost=24363224.32..24363240.85 rows=735 width=120) (actual time=21337.992..21338.040 rows=53 loops=1)
Output: unique_cases.category, unique_cases.source, min(unique_cases.rec_insert_time) OVER (?), max(unique_cases.rec_insert_time) OVER (?), unique_cases.rec_insert_time
Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
-> Sort (cost=24363224.32..24363226.15 rows=735 width=104) (actual time=21337.965..21337.968 rows=53 loops=1)
Output: unique_cases.source, unique_cases.rec_insert_time, unique_cases.category
Sort Key: unique_cases.source, unique_cases.rec_insert_time
Sort Method: quicksort Memory: 32kB
Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
-> Subquery Scan on unique_cases (cost=24363174.62..24363189.32 rows=735 width=104) (actual time=21337.777..21337.889 rows=53 loops=1)
Output: unique_cases.source, unique_cases.rec_insert_time, unique_cases.category
Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
-> Unique (cost=24363174.62..24363181.97 rows=735 width=124) (actual time=21337.772..21337.874 rows=53 loops=1)
Output: l.category, (source(l.field1)), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
-> Sort (cost=24363174.62..24363176.46 rows=735 width=124) (actual time=21337.767..21337.791 rows=466 loops=1)
Output: l.category, (source(l.field1)), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
Sort Method: quicksort Memory: 90kB
Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
-> Nested Loop (cost=2393.31..24363139.63 rows=735 width=124) (actual time=824.212..21336.263 rows=466 loops=1)
Output: l.category, source(l.field1), l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Inner Unique: true
Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
-> Bitmap Heap Scan on foo.log_table l (cost=2391.71..24360848.29 rows=735 width=99) (actual time=824.133..21329.054 rows=466 loops=1)
Output: <hidden>
Recheck Cond: (l.rec_insert_time >= (now() - '10 days'::interval))
Rows Removed by Index Recheck: 8187584
Filter: ((l.field1 IS NOT NULL) AND (l.category = 'music'::name))
Rows Removed by Filter: 19857107
Heap Blocks: lossy=1509000
Buffers: shared hit=73 read=1509030 dirtied=121 written=1631
-> Bitmap Index Scan on rec_insert_time_brin_1000 (cost=0.00..2391.52 rows=157328135 width=0) (actual time=72.391..72.391 rows=15090000 loops=1)
Index Cond: (l.rec_insert_time >= (now() - '10 days'::interval))
Buffers: shared hit=29 read=74
-> Bitmap Heap Scan on public.small_join_table filter (cost=1.60..3.12 rows=1 width=8) (actual time=0.010..0.010 rows=1 loops=466)
Output: filter.category, filter.type, filter.location
Recheck Cond: (filter.category = (l.category)::text)
Heap Blocks: exact=466
Buffers: shared hit=1398
-> Bitmap Index Scan on small_join_table_pkey (cost=0.00..1.60 rows=1 width=0) (actual time=0.007..0.007 rows=1 loops=466)
Index Cond: (filter.category = (l.category)::text)
Buffers: shared hit=932
Planning Time: 1.869 ms
Execution Time: 21338.244 ms
(44 rows)

Notice it chooses the smallest BRIN index with 1000 pages per range, and this is far faster than the seq scan.

I do believe the estimate is actually way off. Just a plain EXPLAIN of the latter estimates 10x more rows than actual:

WindowAgg (cost=24354689.19..24354705.07 rows=706 width=120)
-> Sort (cost=24354689.19..24354690.95 rows=706 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
-> Subquery Scan on unique_cases (cost=24354641.66..24354655.78 rows=706 width=104)
-> Unique (cost=24354641.66..24354648.72 rows=706 width=124)
-> Sort (cost=24354641.66..24354643.42 rows=706 width=124)
Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
-> Nested Loop (cost=2385.42..24354608.25 rows=706 width=124)
-> Bitmap Heap Scan on log_table l (cost=2383.82..24352408.26 rows=706 width=99)
Recheck Cond: (rec_insert_time >= (now() - '10 days'::interval))
Filter: ((field1 IS NOT NULL) AND (category = 'music'::name))
-> Bitmap Index Scan on rec_insert_time_brin_1000 (cost=0.00..2383.64 rows=156577455 width=0)
Index Cond: (rec_insert_time >= (now() - '10 days'::interval))
-> Bitmap Heap Scan on small_join_table filter (cost=1.60..3.12 rows=1 width=8)
Recheck Cond: (category = (l.category)::text)
-> Bitmap Index Scan on small_join_table_pkey (cost=0.00..1.60 rows=1 width=0)
Index Cond: (category = (l.category)::text)
(17 rows)

Here is EXPLAIN only of the default chosen plan:

WindowAgg (cost=24437857.18..24437873.07 rows=706 width=120)
-> Sort (cost=24437857.18..24437858.95 rows=706 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
-> Subquery Scan on unique_cases (cost=24437809.66..24437823.78 rows=706 width=104)
-> Unique (cost=24437809.66..24437816.72 rows=706 width=124)
-> Sort (cost=24437809.66..24437811.42 rows=706 width=124)
Sort Key: l.brand_id, l.last_change, l.log_id, l.rec_insert_time DESC
-> Nested Loop (cost=0.00..24437776.25 rows=706 width=124)
Join Filter: ((l.category)::text = filter.category)
-> Seq Scan on log_table l (cost=0.00..24420483.80 rows=706 width=99)
Filter: ((field1 IS NOT NULL) AND (category = 'music'::name) AND (rec_insert_time >= (now() - '10 days'::interval)))
-> Materialize (cost=0.00..33.98 rows=1399 width=8)
-> Seq Scan on small_join_table filter (cost=0.00..26.99 rows=1399 width=8)
(13 rows)

Any insight into this is much appreciated. This is just one example of many similar issues I have been finding with BRIN indexes scaling predictably with insert-only workloads.

Thanks!

Jeremy

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Mark Wong
Дата: 11 октября 2019 г., 00:53:51
Сообщение: Re: stress test for parallel workers

Следующее

От: Andrew Dunstan
Дата: 11 октября 2019 г., 01:01:14
Сообщение: Re: stress test for parallel workers

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

BRIN index which is much faster never chosen by planner

Предыдущее

Следующее