Обсуждение: Introduce Index Aggregate - new GROUP BY strategy

Поиск
Список
Период
Сортировка

Introduce Index Aggregate - new GROUP BY strategy

От
Sergey Soloviev
Дата:
Hi, hackers!

I would like to introduce new GROUP BY strategy, called Index Aggregate.
In a nutshell, we build B+tree index where GROUP BY attributes are index
keys and if memory limit reached we will build index for each batch and
spill it to the disk as sorted run performing final external merge.

It works (and implemented) much like Hash Aggregate and most differences
in spill logic:

1. As tuples arrive build in-memory B+tree index
2. If memory limit reached then switch to the spill mode (almost like hashagg):
      - calculate hash for the tuple
      - decide in which batch it should be stored
      - spill tuples to the batch
3. When all tuples are processed and there is no disk spill, then return all tuples
     from in-memory index
4. Otherwise:
      1. Spill current index to disk creating initial sorted run
      2. Re-read each batch building in-memory index (may be spills again)
      3. At the end of batch spill it to the disk and create another sorted run
      4. Perform final external merge sort

The main benefit of this strategy is that we perform both grouping and sorting
at the same time with early aggregation. So, it's cost calculated for both group
and comparison, but we can win using early aggregation (which is not supported
by Sort + Group node).

When I was fixing tests, then most of changes occurred in partition_aggregate.out.
Their output changed in such way:

```
CREATE TABLE pagg_tab (a int, b int, c text, d int) PARTITION BY LIST(c);
CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES IN ('0000', '0001', '0002', '0003', '0004');
CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES IN ('0005', '0006', '0007', '0008');
CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES IN ('0009', '0010', '0011');
INSERT INTO pagg_tab SELECT i % 20, i % 30, to_char(i % 12, 'FM0000'), i % 30 FROM generate_series(0, 2999) i;
ANALYZE pagg_tab;

EXPLAIN (COSTS OFF)
SELECT count(*) FROM pagg_tab GROUP BY c ORDER BY c LIMIT 1;

-- Old
                                             QUERY PLAN
--------------------------------------------------------------------------------------------------
  Limit  (cost=80.18..80.18 rows=1 width=13)
    ->  Sort  (cost=80.18..80.21 rows=12 width=13)
          Sort Key: pagg_tab.c
          ->  HashAggregate  (cost=80.00..80.12 rows=12 width=13)
                Group Key: pagg_tab.c
                ->  Append  (cost=0.00..65.00 rows=3000 width=5)
                      ->  Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5)
                      ->  Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5)
                      ->  Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5)

-- New
SET enable_hashagg to off;
                                          QUERY PLAN
--------------------------------------------------------------------------------------------
  Limit  (cost=129.77..129.49 rows=1 width=13)
    ->  IndexAggregate  (cost=129.77..126.39 rows=12 width=13)
          Group Key: pagg_tab.c
          ->  Append  (cost=0.00..65.00 rows=3000 width=5)
                ->  Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5)
                ->  Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5)
                ->  Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5)
(7 rows)

```

There is a cheat - disable hashagg, but if we will run this, then (on my PC) we will see
that index aggregate executes faster:

```
-- sort + hash
SET enable_hashagg TO on;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=80.18..80.18 rows=1 width=13) (actual time=2.040..2.041 rows=1.00 loops=1)
    Buffers: shared hit=20
    ->  Sort  (cost=80.18..80.21 rows=12 width=13) (actual time=2.039..2.040 rows=1.00 loops=1)
          Sort Key: pagg_tab.c
          Sort Method: top-N heapsort  Memory: 25kB
          Buffers: shared hit=20
          ->  HashAggregate  (cost=80.00..80.12 rows=12 width=13) (actual time=2.025..2.028 rows=12.00 loops=1)
                Group Key: pagg_tab.c
                Batches: 1  Memory Usage: 32kB
                Buffers: shared hit=20
                ->  Append  (cost=0.00..65.00 rows=3000 width=5) (actual time=0.017..0.888 rows=3000.00 loops=1)
                      Buffers: shared hit=20
                      ->  Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5) (actual
time=0.016..0.301rows=1250.00 loops=1)
 
                            Buffers: shared hit=8
                      ->  Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5) (actual
time=0.007..0.225rows=1000.00 loops=1)
 
                            Buffers: shared hit=7
                      ->  Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5) (actual
time=0.006..0.171rows=750.00 loops=1)
 
                            Buffers: shared hit=5
  Planning Time: 0.119 ms
  Execution Time: 2.076 ms
(20 rows)

-- index agg
SET enable_hashagg TO off;
  QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=129.77..129.49 rows=1 width=13) (actual time=1.789..1.790 rows=1.00 loops=1)
    Buffers: shared hit=20
    ->  IndexAggregate  (cost=129.77..126.39 rows=12 width=13) (actual time=1.788..1.789 rows=1.00 loops=1)
          Group Key: pagg_tab.c
          Buffers: shared hit=20
          ->  Append  (cost=0.00..65.00 rows=3000 width=5) (actual time=0.020..0.865 rows=3000.00 loops=1)
                Buffers: shared hit=20
                ->  Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5) (actual time=0.020..0.290
rows=1250.00loops=1)
 
                      Buffers: shared hit=8
                ->  Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5) (actual time=0.007..0.229
rows=1000.00loops=1)
 
                      Buffers: shared hit=7
                ->  Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5) (actual time=0.007..0.165
rows=750.00loops=1)
 
                      Buffers: shared hit=5
  Planning Time: 0.105 ms
  Execution Time: 1.825 ms
(15 rows)
```

Mean IndexAgg time is about 1.8 ms and 2 ms for hash + sort, so win is about 10%.

Also, I have run TPC-H tests and 2 tests used Index Agg node (4 and 5) and this gave
near 5% gain in time.

This research was inspired by Graefe Goetz's paper "Efficient sorting, duplicate
removal, grouping, and aggregation". But some of proposed ideas are hard
to implement in PostgreSQL, i.e. using partitioned btrees  store their page in
shared buffers or to make use of offset-value encoding.

More about details of implementation:

1. In-memory index implemented as B+tree and it stores pointers to tuples
2. Size of each B+tree node is set using macro. Now it is 63, which allows us
     to use some optimizations, i.e. distribute tuples uniformly during page split
3. In-memory index has key abbreviation optimization
3. tuplesort.c is used to implement external merge sort. This is done by just
     setting up state in such way that we can just call 'mergeruns'
4. When we store tuples on disk during sorted run spill we perform projection
     and stored tuples are ready to be returned after merge. This is done most
     because we already have returninig TupleDesc and do not have to deal with
     AggStatePerGroup state (it has complex logic with 2 boolean flags).


For now there is a bare minimum implemented: in-memory index, disk spill logic
and support by explain analyze.

There are 4 patches attached:

1. 0001-add-in-memory-btree-tuple-index.patch - adds in-memory index - TupleIndex
2. 0002-introduce-AGG_INDEX-grouping-strategy-node.patch - implementation of
                                                      Index Aggregate group strategy
3. 0003-make-use-of-IndexAggregate-in-planner-and-explain.patch - planner adds
                                                     Index Aggregate nodes to the pathlist and explain analyze
                                                     shows statistics for this node
4. 0004-fix-tests-for-IndexAggregate.patch - fix tests output and adds some extra tests
                                                     for the new node

There are open questions and todos:

- No support for parallel execution. The main challenge here is to save sort invariant
   and support partial aggregates.
- Use more suitable in-memory index. For example, T-Tree is the first candidate for this.
- No sgml documentation yet
- Fix and adapt tests. Not all tests are fixed by 4 patch
- Tune planner estimate. In the example, cost of index agg was higher, but actually it was
   faster.

---

Sergey Soloviev

TantorLabs: https://tantorlabs.com


Вложения

Re: Introduce Index Aggregate - new GROUP BY strategy

От
David Rowley
Дата:
On Tue, 9 Dec 2025 at 04:37, Sergey Soloviev
<sergey.soloviev@tantorlabs.ru> wrote:
> I would like to introduce new GROUP BY strategy, called Index Aggregate.

> In a nutshell, we build B+tree index where GROUP BY attributes are index
> keys and if memory limit reached we will build index for each batch and
> spill it to the disk as sorted run performing final external merge.
> Mean IndexAgg time is about 1.8 ms and 2 ms for hash + sort, so win is about 10%.
>
> Also, I have run TPC-H tests and 2 tests used Index Agg node (4 and 5) and this gave
> near 5% gain in time.

Interesting.

Are you able to provide benchmarks with increasing numbers of groups,
say 100 to 100 million, increasing in multiples of 10, with say 1GB
work_mem, and to be fair, hash_mem_multiplier=1 with all 3 strategies.
A binary search's performance characteristics will differ vastly from
that of simplehash's hash lookup and linear probe type search. Binary
searches become much less optimal when the array becomes large as
there are many more opportunities for cache misses than with a linear
probing hash table. I think you're going to have to demonstrate that
the window where this is useful is big enough to warrant the extra
code.

Ideally, if you could show a graph and maybe name Hash Aggregate as
the baseline and show that as 1 always, then run the same benchmark
forcing a Sort -> Group Agg, and then also your Index Agg. Also,
ideally, if you could provide scripts for this so people can easily
run it themselves, to allow us to see how other hardware compares to
yours.  Doing this may also help you move forward with your costing
code for the planner, but the main thing to show is that there is a
useful enough data size where this is useful.

You might want to repeat the test a few times with different data
types. Perhaps int or bigint, then also something varlena and maybe
something byref, such as UUID. Also, you might want to avoid presorted
data as I suspect it'll be hard to beat Sort -> Group Agg with
presorted data. Not causing performance regressions for presorted data
might be quite a tricky aspect of this patch.

David



Re: Introduce Index Aggregate - new GROUP BY strategy

От
Sergey Soloviev
Дата:
Hi!

> Are you able to provide benchmarks
Yes, sure.

Test matrix:

- number of groups: from 100 to 1000000 increased by 10 times
- different types: int, bigint, uuid, text
- strategy: hash, group, index

For each key value there are 3 tuples with different 'j' value (for
aggregation logic).

Also, there is a test (called bigtext) for large string as a key (each string is 4kB).

To test pgbench is used. Test query looks like this:

     select i, sum(j) from TBL group by 1 order by 1;

Depending on the table size duration is set from 1 to 3 minutes.
Everything in attached scripts:

- setup.sql - script to setup environment (create tables, setup GUCs).
                      after running this you should restart database.
                      NOTE: actually, for int and bigint number of groups is less
                                  than power of 10
- run_bench.sh - shell script that runs test workload. After running
                               it will create files with pgbench results.
- collect_results.sh - parses output files and formats result table.
                                     As values it shows TPS.
- show_plan.sh - small script to run EXPLAIN for each run query

Finally, I have this table:

int

| amount  | HashAgg       | GroupAgg       | IndexAgg     |
| ------------- | ------------------ | ------------------- | ------------------ |
| 100          | 3249.929602 | 3501.174072 | 3765.727121 |
| 1000       | 504.420643   | 501.465754    | 575.255906   |
| 10000     | 50.528155     | 49.312322      | 54.510261     |
| 100000   | 4.775069       | 4.317584        | 4.791735       |
| 1000000 | 0.405538       | 0.406698        | 0.321379       |

bigint

| amount   | HashAgg       | GroupAgg     | IndexAgg       |
| ------------  | -------------------| ------------------- | ------------------  |
| 100          | 3225.287886 | 3510.612641 | 3742.911726 |
| 1000        | 492.908092   | 491.530184   | 574.475159   |
| 10000      | 50.192018     | 49.555983     | 53.909437     |
| 100000    | 4.831086       | 4.430059       | 4.748821       |
| 1000000  | 0.401983       | 0.413218       | 0.318144       |

text

| amount  | HashAgg       | GroupAgg     | IndexAgg       |
| ------------ | -------------------| ------------------- | ------------------ |
| 100         | 2647.030876 | 2553.503954 | 2946.282525 |
| 1000       | 348.464373   | 286.818555   | 342.771923   |
| 10000     | 32.891834     | 24.386304     | 28.249571      |
| 100000   | 2.934513       | 1.956983       | 2.237997        |
| 1000000 | 0.249291       | 0.148780       | 0.150943        |

uuid

| amount  | HashAgg      | GroupAgg       | IndexAgg      |
| ------------ | ------------------ | ------------------- | ------------------  |
| 100         | N/A                 | 2282.812585 | 2432.713816 |
| 1000       | N/A                 | 282.637163   | 303.892131    |
| 10000     | N/A                 | 28.375838     | 28.924711      |
| 100000   | N/A                 | 2.649958       | 2.449907 |
| 1000000 | N/A                 | 0.255203       | 0.194414        |

bigtext

| HashAgg  | GroupAgg | IndexAgg |
| -------------- | --------------- | -------------- |
| N/A            | 0.035247   | 0.041120  |

NOTES: I could not make Hash + Sort plan for uuid and bigtext
               test and it reproduces even on upstream without this patch.

The main observation is that on small amount of groups
Index Aggregate performs better than other strategies:

- int and bigint even up to 100K keys
- text only for 100 keys
- uuid up to 10K keys
- bigtext better than Group + Sort, but tested only on big amount
    of keys (100K)

---
Sergey Soloviev

TantorLabs: https://tantorlabs.com





Вложения

Re: Introduce Index Aggregate - new GROUP BY strategy

От
Сергей Соловьев
Дата:
Previous message had bad table formatting. Here fixed version.

int

| amount  | HashAgg     | GroupAgg    | IndexAgg    |
| ------- | ----------- | ----------- | ----------- |
| 100     | 3249.929602 | 3501.174072 | 3765.727121 |
| 1000    | 504.420643  | 501.465754  | 575.255906  |
| 10000   | 50.528155   | 49.312322   | 54.510261   |
| 100000  | 4.775069    | 4.317584    | 4.791735    |
| 1000000 | 0.405538    | 0.406698    | 0.321379    |

bigint

| amount  | HashAgg     | GroupAgg    | IndexAgg    |
| ------- | ----------- | ----------- | ----------- |
| 100     | 3225.287886 | 3510.612641 | 3742.911726 |
| 1000    | 492.908092  | 491.530184  | 574.475159  |
| 10000   | 50.192018   | 49.555983   | 53.909437   |
| 100000  | 4.831086    | 4.430059    | 4.748821    |
| 1000000 | 0.401983    | 0.413218    | 0.318144    |

text

| amount  | HashAgg     | GroupAgg    | IndexAgg    |
| ------- | ----------- | ----------- | ----------- |
| 100     | 2647.030876 | 2553.503954 | 2946.282525 |
| 1000    | 348.464373  | 286.818555  | 342.771923  |
| 10000   | 32.891834   | 24.386304   | 28.249571   |
| 100000  | 2.934513    | 1.956983    | 2.237997    |
| 1000000 | 0.249291    | 0.148780    | 0.150943    |

uuid

| amount  | HashAgg | GroupAgg    | IndexAgg    |
| ------- | ------- | ----------- | ----------- |
| 100     | N/A     | 2282.812585 | 2432.713816 |
| 1000    | N/A     | 282.637163  | 303.892131  |
| 10000   | N/A     | 28.375838   | 28.924711   |
| 100000  | N/A     | 2.649958    | 2.449907    |
| 1000000 | N/A     | 0.255203    | 0.194414    |

bigtext

| HashAgg | GroupAgg | IndexAgg |
| ------- | -------- | -------- |
| N/A     | 0.035247 | 0.041120 |

---
Sergey Soloviev

TantorLabs: https://tantorlabs.com



Re: Introduce Index Aggregate - new GROUP BY strategy

От
Sergey Soloviev
Дата:

Re: Introduce Index Aggregate - new GROUP BY strategy

От
Sergey Soloviev
Дата:
Hi!

I have looked again at planner's code and found mistake in cost calculation:

1. There was an extra `LOG2(numGroups)` multipler that accounts height of
     btree index, but actually it is extra multiplier. Now cost is calculated as
     much like sort: input_tuples * (2.0 * cpu_operator_cost * numGroupCols).
2. IndexAgg requires spilling index on disk to save sort order, but code that
     calculates this cost used this value without HAVING quals adjustment.

After fixing these parts, more plans started to use Index Aggregate node.
New patches have this fixed.

Also, patches contains several minor fixes of compiler warnings to which I
did not pay attention during development, but CI pipeline complained about.

---
Sergey Soloviev

TantorLabs: https://tantorlabs.com
Вложения

Re: Introduce Index Aggregate - new GROUP BY strategy

От
Sergey Soloviev
Дата:
Hi!

I have finally added support for Partial IndexAggregate. There was a problem with
sortgroupref and target list entries mismatch due to partial aggregates in it.
To solve this I had to add new argument to 'create_agg_path' - 'pathkeys' which is
a List of PathKey.

Previously this information was calculated in the function just like AGG_SORTED
do this. But when we calculating pathkeys we must consider whether it is a child
rel to properly build pathkeys and if so use it's parent. The latter information is
not known inside 'create_agg_path', thus instead of passing 'parent' we explicitly
pass already built 'pathkeys'. I did not change AGG_SORTED logic, so this  is used
only by AGG_INDEX.

This logic is placed in another patch file just to make review of this change easier.

Also, cost calculation logic is adjusted a bit - it takes into account top-down index
traversal and final external merge cost is added only if spill expected.

---
Sergey Soloviev
TantorLabs: https://tantorlabs.com

Вложения