Improving scalability of Parallel Bitmap Heap/Index Scan

Поиск
Список
Период
Сортировка
От David Geier
Тема Improving scalability of Parallel Bitmap Heap/Index Scan
Дата
Msg-id CAPsAnrn5yWsoWs8GhqwbwAJx1SeLxLntV54Biq0Z-J_E86Fnng@mail.gmail.com
обсуждение исходный текст
Ответы Re: Improving scalability of Parallel Bitmap Heap/Index Scan  (John Naylor <john.naylor@enterprisedb.com>)
Список pgsql-hackers
Hi hackers,

While debugging some slow queries containing Bitmap Heap/Index Scans (in short BHS / BIS), we observed a few issues regarding scalability:
  1. The BIS always only runs in a single process, also when the parent BHS is parallel. The first process arriving in the BHS serves as leader and executes the BIS.
  2. As long as execution is "exact" (TIDs are stored instead of page bits), the parallel BHS sorts all TIDs to ensure pages are accessed sequentially. The sort is also performed just by a single worker. Already with a few tens of thousands of pages to scan, the sort time can make up a significant portion of the total runtime. Large page counts and the need for parallelism are not uncommon for BHS, as one use case is closing the gap between index and sequential scans. The BHS costing seems to not account for that.
  3. The BHS does not scale well with an increasing number of parallel workers, even when accounting for the sequential parts of execution. A perf profile shows that the TID list / bitmap iteration code heavily contents on a mutex taken for every single TID / page bit (see LWLockAcquire(&istate->lock, LW_EXCLUSIVE) in tidbitmap.c:1067).
  4. The EXPLAIN ANALYZE statistics of the parallel BHS do not include the statistics of the parallel workers. For example the number of heap pages processed is what just the leader did. Similarly to other parallel plan nodes we should aggregate statistics across workers.
The EXPLAIN ANALYZE output below shows (1) to (3) happening in action for different numbers of workers. I had to obfuscate the query slightly. The difference between the startup time of the BHS and the BIS is the time it takes to sort the TID list. The self time of the BHS is just the time spent on processing the shared TID list and processing the pages. That part runs in parallel but does not scale.

Workers | Total runtime | Startup time BIS | Startup time BHS | Self time BHS (excl. sorting)
-------------------------------------------|------------------|------------------------------
2       | 15322 ms      | 3107 ms          | 5912 ms          | 9269 ms
4       | 13277 ms      | 3094 ms          | 5869 ms          | 7260 ms
8       | 14628 ms      | 3106 ms          | 5882 ms          | 8598 ms

None of this is really new and some of it is even documented. So, what I am more wondering about is why things are the way they are and how hard it would be to change them. I am especially curious about:
  • What stops us from extending the BIS to run in parallel? Parallel Bitmap Index Scans are also supported.
  • What about reducing the sort time by, e.g.
    • dividing TIDs across workers, ending up with N parallely sorted streams,
    • cooperatively sorting the TIDs with multiple workers using barriers for synchronization,
    • optimizing the PagetableEntry data structure for size and using a faster sorting algorithm like e.g. radix sort
    • a combination of the first three options
  • With separate TID lists per worker process the iteration problem would be solved. Otherwise, we could
    • optimize the iteration code and thereby minimize the duration of the critical section,
    • have worker processes acquire chunks of TIDs / page bits to reduce locking.
Is there interest in patches improving on the above mentioned shortcomings? If so, which options do you deem best?

--
David Geier
(ServiceNow)



-- 2 workers

 Finalize Aggregate (actual time=15228.937..15321.356 rows=1 loops=1)
   Output: count(*)
   ->  Gather (actual time=15187.942..15321.345 rows=2 loops=1)
         Output: (PARTIAL count(*))
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate (actual time=15181.486..15181.488 rows=1 loops=2)
               Output: PARTIAL count(*)
               Worker 0:  actual time=15181.364..15181.366 rows=1 loops=1
               Worker 1:  actual time=15181.608..15181.610 rows=1 loops=1
               ->  Parallel Bitmap Heap Scan on foo (actual time=5912.731..15166.992 rows=269713 loops=2)
                     Filter: ...
                     Rows Removed by Filter: 4020149
                     Worker 0:  actual time=5912.498..15166.936 rows=269305 loops=1
                     Worker 1:  actual time=5912.963..15167.048 rows=270121 loops=1
                     ->  Bitmap Index Scan on foo_idx (actual time=3107.947..3107.948 rows=8579724 loops=1)
                           Index Cond: -
                           Worker 1:  actual time=3107.947..3107.948 rows=8579724 loops=1
 Planning Time: 0.167 ms
 Execution Time: 15322.081 ms


-- 4 workers

 Finalize Aggregate (actual time=13175.765..13276.415 rows=1 loops=1)
   Output: count(*)
   ->  Gather (actual time=13137.981..13276.403 rows=4 loops=1)
         Output: (PARTIAL count(*))
         Workers Planned: 4
         Workers Launched: 4
         ->  Partial Aggregate (actual time=13130.344..13130.346 rows=1 loops=4)
               Output: PARTIAL count(*)
               Worker 0:  actual time=13129.363..13129.365 rows=1 loops=1
               Worker 1:  actual time=13130.085..13130.087 rows=1 loops=1
               Worker 2:  actual time=13130.634..13130.635 rows=1 loops=1
               Worker 3:  actual time=13131.295..13131.298 rows=1 loops=1
               ->  Parallel Bitmap Heap Scan on foo (actual time=5870.026..13120.579 rows=134856 loops=4)
                     Filter: ...
                     Rows Removed by Filter: 2010074
                     Worker 0:  actual time=5869.033..13120.453 rows=128270 loops=1
                     Worker 1:  actual time=5869.698..13118.811 rows=135333 loops=1
                     Worker 2:  actual time=5870.465..13121.189 rows=137695 loops=1
                     Worker 3:  actual time=5870.907..13121.864 rows=138128 loops=1
                     ->  Bitmap Index Scan on foo_idx (actual time=3094.585..3094.586 rows=8579724 loops=1)
                           Index Cond: -
                           Worker 3:  actual time=3094.585..3094.586 rows=8579724 loops=1
 Planning Time: 0.146 ms
 Execution Time: 13277.315 ms

-- 8 workers

 Finalize Aggregate (actual time=14533.688..14627.962 rows=1 loops=1)
   Output: count(*)
   ->  Gather (actual time=14492.463..14627.950 rows=8 loops=1)
         Output: (PARTIAL count(*))
         Workers Planned: 8
         Workers Launched: 8
         ->  Partial Aggregate (actual time=14483.059..14483.061 rows=1 loops=8)
               Output: PARTIAL count(*)
               Worker 0:  actual time=14480.058..14480.061 rows=1 loops=1
               Worker 1:  actual time=14480.948..14480.950 rows=1 loops=1
               Worker 2:  actual time=14481.668..14481.670 rows=1 loops=1
               Worker 3:  actual time=14482.829..14482.832 rows=1 loops=1
               Worker 4:  actual time=14483.695..14483.697 rows=1 loops=1
               Worker 5:  actual time=14484.290..14484.293 rows=1 loops=1
               Worker 6:  actual time=14485.166..14485.168 rows=1 loops=1
               Worker 7:  actual time=14485.819..14485.821 rows=1 loops=1
               ->  Parallel Bitmap Heap Scan on foo (actual time=5886.191..14477.239 rows=67428 loops=8)
                     Filter: ...
                     Rows Removed by Filter: 1005037
                     Worker 0:  actual time=5882.909..14474.627 rows=60325 loops=1
                     Worker 1:  actual time=5883.788..14474.945 rows=69459 loops=1
                     Worker 2:  actual time=5884.475..14475.735 rows=69686 loops=1
                     Worker 3:  actual time=5886.149..14477.162 rows=64680 loops=1
                     Worker 4:  actual time=5886.987..14477.653 rows=71034 loops=1
                     Worker 5:  actual time=5887.347..14478.667 rows=65836 loops=1
                     Worker 6:  actual time=5888.978..14479.239 rows=67755 loops=1
                     Worker 7:  actual time=5888.896..14479.886 rows=70651 loops=1
                     ->  Bitmap Index Scan on foo_idx (actual time=3106.840..3106.840 rows=8579724 loops=1)
                           Index Cond: -
                           Worker 7:  actual time=3106.840..3106.840 rows=8579724 loops=1
 Planning Time: 0.150 ms
 Execution Time: 14628.648 ms

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Richard Guo
Дата:
Сообщение: Re: remove_useless_groupby_columns is too enthusiastic
Следующее
От: Gareth Palmer
Дата:
Сообщение: Re: [PATCH] Implement INSERT SET syntax