SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

Поиск
Список
Период
Сортировка
От Dimitrios Apostolou
Тема SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
Дата
Msg-id 7886a68f-b466-2131-1747-f69f0fb71a37@gmx.net
обсуждение исходный текст
Ответы Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
Список pgsql-general
Hello list,

INTRO

I have a huge (multi-billion rows) table partitioned into 1000 partitions.
Around half of the partitions are full and the rest are empty, created in
advance ready to receive future incoming data. Postgres is 16.2.
Here are the relevant parts of the schema:

> \d test_runs_raw
                                  Partitioned table "public.test_runs_raw"
       Column       |            Type             | Collation | Nullable |
Default
-------------------+-----------------------------+-----------+----------+----------------------------------
  run_n             | bigint                      |           | not null | generated by default as identity
  test_case_n       | smallint                    |           | not null |
  workitem_n        | integer                     |           | not null |
  test_resulttype_n | smallint                    |           |          |
Partition key: RANGE (workitem_n)
Indexes:
     "test_runs_raw_partitioned_pkey" PRIMARY KEY, btree (workitem_n, run_n)

Each partition is made to keep entries with workitem_n in ranges
(0,20k), (20k,40k) and so on (k = kilo) up to 20000k.


PROBLEM

I noticed that the following query is very very slow (too long to wait for
it to finish):

SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10;

What is remarkable, is that in 998 out of 1000 table scans it involves,
the planner does not use the index. Instead it chooses a sequential scan.
Here is the output from EXPLAIN:

  Limit  (cost=853891608.79..853891608.99 rows=10 width=4)
    ->  Unique  (cost=853891608.79..853891612.79 rows=200 width=4)
          ->  Sort  (cost=853891608.79..853891610.79 rows=800 width=4)               Sort Key: test_runs_raw.workitem_n
DESC
                ->  Gather  (cost=853891488.22..853891570.22 rows=800 width=4)
                      Workers Planned: 4
                      ->  HashAggregate  (cost=853890488.22..853890490.22 rows=200 width=4)
                            Group Key: test_runs_raw.workitem_n
                            ->  Parallel Append  (cost=0.00..813118117.30 rows=16308948365 width=4)
                                  ->  Parallel Index Only Scan Backward using test_runs_raw__part_max9600k_pkey on
test_runs_raw__part_max9600ktest_runs_raw_480  (cost=0.57..1597355.10 rows=33623320 width=4) 
                                  ->  Parallel Index Only Scan Backward using test_runs_raw__part_max10140k_pkey on
test_runs_raw__part_max10140ktest_runs_raw_507  (cost=0.57..1210795.63 rows=25793672 width=4) 
                                  ->  Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475
(cost=0.00..3037793.12rows=64121612 width=4) 
                                  ->  Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559
(cost=0.00..2918875.90rows=61612190 width=4) 
[ ... 996 more sequential scans ... ]

If I remove DISTINCT then the plan changes dramatically and it runs
instantaneously:

  Limit  (cost=363.84..367.30 rows=10 width=4)
    ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
          ->  Index Only Scan Backward using test_runs_raw__part_max20000k_pkey on test_runs_raw__part_max20000k
test_runs_raw_1000 (cost=0.12..2.34 rows=1 width=4) 
          ->  Index Only Scan Backward using test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k
test_runs_raw_999 (cost=0.12..2.34 rows=1 width=4) 
          ->  Index Only Scan Backward using test_runs_raw__part_max19960k_pkey on test_runs_raw__part_max19960k
test_runs_raw_998 (cost=0.12..2.34 rows=1 width=4) 
          ->  Index Only Scan Backward using test_runs_raw__part_max19940k_pkey on test_runs_raw__part_max19940k
test_runs_raw_997 (cost=0.12..2.34 rows=1 width=4) 
[ ... 996 more index scans ... ]

Notice how in the last plan there is no parallel scanning. Instead the
partitions are scanned sequentially, *in proper order*,
so that the plan execution stops after reading the first
10 rows in the first non-empty partition.

Why can't the same be done with DISTINCT?
Please note that the workitem_n value range is well spread into in range
(0,13M) and the table has been gradually filled within one year, so I'm
assuming the vacuum worker has worked long enough to build sane statistics
(not sure how to verify that).


REMARKS

1. I tried reproducing the problem on an artificial table with few
    partitions and few values, but I couldn't. Both queries execute fast,
    and the planner is always choosing a non-parallel index-only scan.

2. Among testing changes to various settings, I just noticed that setting
    max_parallel_workers_per_gather to 0 (from the original value of 4)
    fixes the issue! On the original huge table, disabling parallelism
    actually makes the query infinitely faster and it returns within 1s! Is
    this a bug in the planner?


Thank you,
Dimitris




В списке pgsql-general по дате отправления:

Предыдущее
От: Daniel McKenzie
Дата:
Сообщение: Re: Unexpected data when subscribing to logical replication slot
Следующее
От: Dimitrios Apostolou
Дата:
Сообщение: Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions