Introduce Index Aggregate - new GROUP BY strategy

Поиск
Список
Период
Сортировка
От Sergey Soloviev
Тема Introduce Index Aggregate - new GROUP BY strategy
Дата
Msg-id 7d6d8cc4-d74a-4edc-bbeb-944916cf1093@tantorlabs.ru
обсуждение исходный текст
Ответы Re: Introduce Index Aggregate - new GROUP BY strategy
Список pgsql-hackers
Hi, hackers!

I would like to introduce new GROUP BY strategy, called Index Aggregate.
In a nutshell, we build B+tree index where GROUP BY attributes are index
keys and if memory limit reached we will build index for each batch and
spill it to the disk as sorted run performing final external merge.

It works (and implemented) much like Hash Aggregate and most differences
in spill logic:

1. As tuples arrive build in-memory B+tree index
2. If memory limit reached then switch to the spill mode (almost like hashagg):
      - calculate hash for the tuple
      - decide in which batch it should be stored
      - spill tuples to the batch
3. When all tuples are processed and there is no disk spill, then return all tuples
     from in-memory index
4. Otherwise:
      1. Spill current index to disk creating initial sorted run
      2. Re-read each batch building in-memory index (may be spills again)
      3. At the end of batch spill it to the disk and create another sorted run
      4. Perform final external merge sort

The main benefit of this strategy is that we perform both grouping and sorting
at the same time with early aggregation. So, it's cost calculated for both group
and comparison, but we can win using early aggregation (which is not supported
by Sort + Group node).

When I was fixing tests, then most of changes occurred in partition_aggregate.out.
Their output changed in such way:

```
CREATE TABLE pagg_tab (a int, b int, c text, d int) PARTITION BY LIST(c);
CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES IN ('0000', '0001', '0002', '0003', '0004');
CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES IN ('0005', '0006', '0007', '0008');
CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES IN ('0009', '0010', '0011');
INSERT INTO pagg_tab SELECT i % 20, i % 30, to_char(i % 12, 'FM0000'), i % 30 FROM generate_series(0, 2999) i;
ANALYZE pagg_tab;

EXPLAIN (COSTS OFF)
SELECT count(*) FROM pagg_tab GROUP BY c ORDER BY c LIMIT 1;

-- Old
                                             QUERY PLAN
--------------------------------------------------------------------------------------------------
  Limit  (cost=80.18..80.18 rows=1 width=13)
    ->  Sort  (cost=80.18..80.21 rows=12 width=13)
          Sort Key: pagg_tab.c
          ->  HashAggregate  (cost=80.00..80.12 rows=12 width=13)
                Group Key: pagg_tab.c
                ->  Append  (cost=0.00..65.00 rows=3000 width=5)
                      ->  Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5)
                      ->  Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5)
                      ->  Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5)

-- New
SET enable_hashagg to off;
                                          QUERY PLAN
--------------------------------------------------------------------------------------------
  Limit  (cost=129.77..129.49 rows=1 width=13)
    ->  IndexAggregate  (cost=129.77..126.39 rows=12 width=13)
          Group Key: pagg_tab.c
          ->  Append  (cost=0.00..65.00 rows=3000 width=5)
                ->  Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5)
                ->  Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5)
                ->  Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5)
(7 rows)

```

There is a cheat - disable hashagg, but if we will run this, then (on my PC) we will see
that index aggregate executes faster:

```
-- sort + hash
SET enable_hashagg TO on;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=80.18..80.18 rows=1 width=13) (actual time=2.040..2.041 rows=1.00 loops=1)
    Buffers: shared hit=20
    ->  Sort  (cost=80.18..80.21 rows=12 width=13) (actual time=2.039..2.040 rows=1.00 loops=1)
          Sort Key: pagg_tab.c
          Sort Method: top-N heapsort  Memory: 25kB
          Buffers: shared hit=20
          ->  HashAggregate  (cost=80.00..80.12 rows=12 width=13) (actual time=2.025..2.028 rows=12.00 loops=1)
                Group Key: pagg_tab.c
                Batches: 1  Memory Usage: 32kB
                Buffers: shared hit=20
                ->  Append  (cost=0.00..65.00 rows=3000 width=5) (actual time=0.017..0.888 rows=3000.00 loops=1)
                      Buffers: shared hit=20
                      ->  Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5) (actual
time=0.016..0.301rows=1250.00 loops=1)
 
                            Buffers: shared hit=8
                      ->  Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5) (actual
time=0.007..0.225rows=1000.00 loops=1)
 
                            Buffers: shared hit=7
                      ->  Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5) (actual
time=0.006..0.171rows=750.00 loops=1)
 
                            Buffers: shared hit=5
  Planning Time: 0.119 ms
  Execution Time: 2.076 ms
(20 rows)

-- index agg
SET enable_hashagg TO off;
  QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=129.77..129.49 rows=1 width=13) (actual time=1.789..1.790 rows=1.00 loops=1)
    Buffers: shared hit=20
    ->  IndexAggregate  (cost=129.77..126.39 rows=12 width=13) (actual time=1.788..1.789 rows=1.00 loops=1)
          Group Key: pagg_tab.c
          Buffers: shared hit=20
          ->  Append  (cost=0.00..65.00 rows=3000 width=5) (actual time=0.020..0.865 rows=3000.00 loops=1)
                Buffers: shared hit=20
                ->  Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5) (actual time=0.020..0.290
rows=1250.00loops=1)
 
                      Buffers: shared hit=8
                ->  Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5) (actual time=0.007..0.229
rows=1000.00loops=1)
 
                      Buffers: shared hit=7
                ->  Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5) (actual time=0.007..0.165
rows=750.00loops=1)
 
                      Buffers: shared hit=5
  Planning Time: 0.105 ms
  Execution Time: 1.825 ms
(15 rows)
```

Mean IndexAgg time is about 1.8 ms and 2 ms for hash + sort, so win is about 10%.

Also, I have run TPC-H tests and 2 tests used Index Agg node (4 and 5) and this gave
near 5% gain in time.

This research was inspired by Graefe Goetz's paper "Efficient sorting, duplicate
removal, grouping, and aggregation". But some of proposed ideas are hard
to implement in PostgreSQL, i.e. using partitioned btrees  store their page in
shared buffers or to make use of offset-value encoding.

More about details of implementation:

1. In-memory index implemented as B+tree and it stores pointers to tuples
2. Size of each B+tree node is set using macro. Now it is 63, which allows us
     to use some optimizations, i.e. distribute tuples uniformly during page split
3. In-memory index has key abbreviation optimization
3. tuplesort.c is used to implement external merge sort. This is done by just
     setting up state in such way that we can just call 'mergeruns'
4. When we store tuples on disk during sorted run spill we perform projection
     and stored tuples are ready to be returned after merge. This is done most
     because we already have returninig TupleDesc and do not have to deal with
     AggStatePerGroup state (it has complex logic with 2 boolean flags).


For now there is a bare minimum implemented: in-memory index, disk spill logic
and support by explain analyze.

There are 4 patches attached:

1. 0001-add-in-memory-btree-tuple-index.patch - adds in-memory index - TupleIndex
2. 0002-introduce-AGG_INDEX-grouping-strategy-node.patch - implementation of
                                                      Index Aggregate group strategy
3. 0003-make-use-of-IndexAggregate-in-planner-and-explain.patch - planner adds
                                                     Index Aggregate nodes to the pathlist and explain analyze
                                                     shows statistics for this node
4. 0004-fix-tests-for-IndexAggregate.patch - fix tests output and adds some extra tests
                                                     for the new node

There are open questions and todos:

- No support for parallel execution. The main challenge here is to save sort invariant
   and support partial aggregates.
- Use more suitable in-memory index. For example, T-Tree is the first candidate for this.
- No sgml documentation yet
- Fix and adapt tests. Not all tests are fixed by 4 patch
- Tune planner estimate. In the example, cost of index agg was higher, but actually it was
   faster.

---

Sergey Soloviev

TantorLabs: https://tantorlabs.com


Вложения

В списке pgsql-hackers по дате отправления: