Обсуждение: Introduce Index Aggregate - new GROUP BY strategy

Поиск

Список

Период

Сортировка

Introduce Index Aggregate - new GROUP BY strategy

От

Sergey Soloviev

Дата:

08 декабря 2025 г., 18:36:58

Hi, hackers!

I would like to introduce new GROUP BY strategy, called Index Aggregate.
In a nutshell, we build B+tree index where GROUP BY attributes are index
keys and if memory limit reached we will build index for each batch and
spill it to the disk as sorted run performing final external merge.

It works (and implemented) much like Hash Aggregate and most differences
in spill logic:

1. As tuples arrive build in-memory B+tree index
2. If memory limit reached then switch to the spill mode (almost like hashagg):
      - calculate hash for the tuple
      - decide in which batch it should be stored
      - spill tuples to the batch
3. When all tuples are processed and there is no disk spill, then return all tuples
     from in-memory index
4. Otherwise:
      1. Spill current index to disk creating initial sorted run
      2. Re-read each batch building in-memory index (may be spills again)
      3. At the end of batch spill it to the disk and create another sorted run
      4. Perform final external merge sort

The main benefit of this strategy is that we perform both grouping and sorting
at the same time with early aggregation. So, it's cost calculated for both group
and comparison, but we can win using early aggregation (which is not supported
by Sort + Group node).

When I was fixing tests, then most of changes occurred in partition_aggregate.out.
Their output changed in such way:

```
CREATE TABLE pagg_tab (a int, b int, c text, d int) PARTITION BY LIST(c);
CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES IN ('0000', '0001', '0002', '0003', '0004');
CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES IN ('0005', '0006', '0007', '0008');
CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES IN ('0009', '0010', '0011');
INSERT INTO pagg_tab SELECT i % 20, i % 30, to_char(i % 12, 'FM0000'), i % 30 FROM generate_series(0, 2999) i;
ANALYZE pagg_tab;

EXPLAIN (COSTS OFF)
SELECT count(*) FROM pagg_tab GROUP BY c ORDER BY c LIMIT 1;

-- Old
                                             QUERY PLAN
--------------------------------------------------------------------------------------------------
  Limit  (cost=80.18..80.18 rows=1 width=13)
    ->  Sort  (cost=80.18..80.21 rows=12 width=13)
          Sort Key: pagg_tab.c
          ->  HashAggregate  (cost=80.00..80.12 rows=12 width=13)
                Group Key: pagg_tab.c
                ->  Append  (cost=0.00..65.00 rows=3000 width=5)
                      ->  Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5)
                      ->  Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5)
                      ->  Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5)

-- New
SET enable_hashagg to off;
                                          QUERY PLAN
--------------------------------------------------------------------------------------------
  Limit  (cost=129.77..129.49 rows=1 width=13)
    ->  IndexAggregate  (cost=129.77..126.39 rows=12 width=13)
          Group Key: pagg_tab.c
          ->  Append  (cost=0.00..65.00 rows=3000 width=5)
                ->  Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5)
                ->  Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5)
                ->  Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5)
(7 rows)

```

There is a cheat - disable hashagg, but if we will run this, then (on my PC) we will see
that index aggregate executes faster:

```
-- sort + hash
SET enable_hashagg TO on;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=80.18..80.18 rows=1 width=13) (actual time=2.040..2.041 rows=1.00 loops=1)
    Buffers: shared hit=20
    ->  Sort  (cost=80.18..80.21 rows=12 width=13) (actual time=2.039..2.040 rows=1.00 loops=1)
          Sort Key: pagg_tab.c
          Sort Method: top-N heapsort  Memory: 25kB
          Buffers: shared hit=20
          ->  HashAggregate  (cost=80.00..80.12 rows=12 width=13) (actual time=2.025..2.028 rows=12.00 loops=1)
                Group Key: pagg_tab.c
                Batches: 1  Memory Usage: 32kB
                Buffers: shared hit=20
                ->  Append  (cost=0.00..65.00 rows=3000 width=5) (actual time=0.017..0.888 rows=3000.00 loops=1)
                      Buffers: shared hit=20
                      ->  Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5) (actual
time=0.016..0.301rows=1250.00 loops=1)
 
                            Buffers: shared hit=8
                      ->  Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5) (actual
time=0.007..0.225rows=1000.00 loops=1)
 
                            Buffers: shared hit=7
                      ->  Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5) (actual
time=0.006..0.171rows=750.00 loops=1)
 
                            Buffers: shared hit=5
  Planning Time: 0.119 ms
  Execution Time: 2.076 ms
(20 rows)

-- index agg
SET enable_hashagg TO off;
  QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=129.77..129.49 rows=1 width=13) (actual time=1.789..1.790 rows=1.00 loops=1)
    Buffers: shared hit=20
    ->  IndexAggregate  (cost=129.77..126.39 rows=12 width=13) (actual time=1.788..1.789 rows=1.00 loops=1)
          Group Key: pagg_tab.c
          Buffers: shared hit=20
          ->  Append  (cost=0.00..65.00 rows=3000 width=5) (actual time=0.020..0.865 rows=3000.00 loops=1)
                Buffers: shared hit=20
                ->  Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5) (actual time=0.020..0.290
rows=1250.00loops=1)
 
                      Buffers: shared hit=8
                ->  Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5) (actual time=0.007..0.229
rows=1000.00loops=1)
 
                      Buffers: shared hit=7
                ->  Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5) (actual time=0.007..0.165
rows=750.00loops=1)
 
                      Buffers: shared hit=5
  Planning Time: 0.105 ms
  Execution Time: 1.825 ms
(15 rows)
```

Mean IndexAgg time is about 1.8 ms and 2 ms for hash + sort, so win is about 10%.

Also, I have run TPC-H tests and 2 tests used Index Agg node (4 and 5) and this gave
near 5% gain in time.

This research was inspired by Graefe Goetz's paper "Efficient sorting, duplicate
removal, grouping, and aggregation". But some of proposed ideas are hard
to implement in PostgreSQL, i.e. using partitioned btrees  store their page in
shared buffers or to make use of offset-value encoding.

More about details of implementation:

1. In-memory index implemented as B+tree and it stores pointers to tuples
2. Size of each B+tree node is set using macro. Now it is 63, which allows us
     to use some optimizations, i.e. distribute tuples uniformly during page split
3. In-memory index has key abbreviation optimization
3. tuplesort.c is used to implement external merge sort. This is done by just
     setting up state in such way that we can just call 'mergeruns'
4. When we store tuples on disk during sorted run spill we perform projection
     and stored tuples are ready to be returned after merge. This is done most
     because we already have returninig TupleDesc and do not have to deal with
     AggStatePerGroup state (it has complex logic with 2 boolean flags).


For now there is a bare minimum implemented: in-memory index, disk spill logic
and support by explain analyze.

There are 4 patches attached:

1. 0001-add-in-memory-btree-tuple-index.patch - adds in-memory index - TupleIndex
2. 0002-introduce-AGG_INDEX-grouping-strategy-node.patch - implementation of
                                                      Index Aggregate group strategy
3. 0003-make-use-of-IndexAggregate-in-planner-and-explain.patch - planner adds
                                                     Index Aggregate nodes to the pathlist and explain analyze
                                                     shows statistics for this node
4. 0004-fix-tests-for-IndexAggregate.patch - fix tests output and adds some extra tests
                                                     for the new node

There are open questions and todos:

- No support for parallel execution. The main challenge here is to save sort invariant
   and support partial aggregates.
- Use more suitable in-memory index. For example, T-Tree is the first candidate for this.
- No sgml documentation yet
- Fix and adapt tests. Not all tests are fixed by 4 patch
- Tune planner estimate. In the example, cost of index agg was higher, but actually it was
   faster.

---

Sergey Soloviev

TantorLabs: https://tantorlabs.com

Вложения

Re: Introduce Index Aggregate - new GROUP BY strategy

От

David Rowley

Дата:

09 декабря 2025 г., 02:11:46

On Tue, 9 Dec 2025 at 04:37, Sergey Soloviev
<sergey.soloviev@tantorlabs.ru> wrote:
> I would like to introduce new GROUP BY strategy, called Index Aggregate.

> In a nutshell, we build B+tree index where GROUP BY attributes are index
> keys and if memory limit reached we will build index for each batch and
> spill it to the disk as sorted run performing final external merge.
> Mean IndexAgg time is about 1.8 ms and 2 ms for hash + sort, so win is about 10%.
>
> Also, I have run TPC-H tests and 2 tests used Index Agg node (4 and 5) and this gave
> near 5% gain in time.

Interesting.

Are you able to provide benchmarks with increasing numbers of groups,
say 100 to 100 million, increasing in multiples of 10, with say 1GB
work_mem, and to be fair, hash_mem_multiplier=1 with all 3 strategies.
A binary search's performance characteristics will differ vastly from
that of simplehash's hash lookup and linear probe type search. Binary
searches become much less optimal when the array becomes large as
there are many more opportunities for cache misses than with a linear
probing hash table. I think you're going to have to demonstrate that
the window where this is useful is big enough to warrant the extra
code.

Ideally, if you could show a graph and maybe name Hash Aggregate as
the baseline and show that as 1 always, then run the same benchmark
forcing a Sort -> Group Agg, and then also your Index Agg. Also,
ideally, if you could provide scripts for this so people can easily
run it themselves, to allow us to see how other hardware compares to
yours.  Doing this may also help you move forward with your costing
code for the planner, but the main thing to show is that there is a
useful enough data size where this is useful.

You might want to repeat the test a few times with different data
types. Perhaps int or bigint, then also something varlena and maybe
something byref, such as UUID. Also, you might want to avoid presorted
data as I suspect it'll be hard to beat Sort -> Group Agg with
presorted data. Not causing performance regressions for presorted data
might be quite a tricky aspect of this patch.

David

Re: Introduce Index Aggregate - new GROUP BY strategy

От

Sergey Soloviev

Дата:

09 декабря 2025 г., 18:26:32

Hi!

> Are you able to provide benchmarks
Yes, sure.

Test matrix:

- number of groups: from 100 to 1000000 increased by 10 times
- different types: int, bigint, uuid, text
- strategy: hash, group, index

For each key value there are 3 tuples with different 'j' value (for
aggregation logic).

Also, there is a test (called bigtext) for large string as a key (each string is 4kB).

To test pgbench is used. Test query looks like this:

     select i, sum(j) from TBL group by 1 order by 1;

Depending on the table size duration is set from 1 to 3 minutes.
Everything in attached scripts:

- setup.sql - script to setup environment (create tables, setup GUCs).
                      after running this you should restart database.
                      NOTE: actually, for int and bigint number of groups is less
                                  than power of 10
- run_bench.sh - shell script that runs test workload. After running
                               it will create files with pgbench results.
- collect_results.sh - parses output files and formats result table.
                                     As values it shows TPS.
- show_plan.sh - small script to run EXPLAIN for each run query

Finally, I have this table:

int

| amount  | HashAgg       | GroupAgg       | IndexAgg     |
| ------------- | ------------------ | ------------------- | ------------------ |
| 100          | 3249.929602 | 3501.174072 | 3765.727121 |
| 1000       | 504.420643   | 501.465754    | 575.255906   |
| 10000     | 50.528155     | 49.312322      | 54.510261     |
| 100000   | 4.775069       | 4.317584        | 4.791735       |
| 1000000 | 0.405538       | 0.406698        | 0.321379       |

bigint

| amount   | HashAgg       | GroupAgg     | IndexAgg       |
| ------------  | -------------------| ------------------- | ------------------  |
| 100          | 3225.287886 | 3510.612641 | 3742.911726 |
| 1000        | 492.908092   | 491.530184   | 574.475159   |
| 10000      | 50.192018     | 49.555983     | 53.909437     |
| 100000    | 4.831086       | 4.430059       | 4.748821       |
| 1000000  | 0.401983       | 0.413218       | 0.318144       |

text

| amount  | HashAgg       | GroupAgg     | IndexAgg       |
| ------------ | -------------------| ------------------- | ------------------ |
| 100         | 2647.030876 | 2553.503954 | 2946.282525 |
| 1000       | 348.464373   | 286.818555   | 342.771923   |
| 10000     | 32.891834     | 24.386304     | 28.249571      |
| 100000   | 2.934513       | 1.956983       | 2.237997        |
| 1000000 | 0.249291       | 0.148780       | 0.150943        |

uuid

| amount  | HashAgg      | GroupAgg       | IndexAgg      |
| ------------ | ------------------ | ------------------- | ------------------  |
| 100         | N/A                 | 2282.812585 | 2432.713816 |
| 1000       | N/A                 | 282.637163   | 303.892131    |
| 10000     | N/A                 | 28.375838     | 28.924711      |
| 100000   | N/A                 | 2.649958       | 2.449907 |
| 1000000 | N/A                 | 0.255203       | 0.194414        |

bigtext

| HashAgg  | GroupAgg | IndexAgg |
| -------------- | --------------- | -------------- |
| N/A            | 0.035247   | 0.041120  |

NOTES: I could not make Hash + Sort plan for uuid and bigtext
               test and it reproduces even on upstream without this patch.

The main observation is that on small amount of groups
Index Aggregate performs better than other strategies:

- int and bigint even up to 100K keys
- text only for 100 keys
- uuid up to 10K keys
- bigtext better than Group + Sort, but tested only on big amount
    of keys (100K)

---
Sergey Soloviev

TantorLabs: https://tantorlabs.com

Вложения

Re: Introduce Index Aggregate - new GROUP BY strategy

От

Сергей Соловьев

Дата:

09 декабря 2025 г., 18:31:07

Previous message had bad table formatting. Here fixed version.

int

| amount  | HashAgg     | GroupAgg    | IndexAgg    |
| ------- | ----------- | ----------- | ----------- |
| 100     | 3249.929602 | 3501.174072 | 3765.727121 |
| 1000    | 504.420643  | 501.465754  | 575.255906  |
| 10000   | 50.528155   | 49.312322   | 54.510261   |
| 100000  | 4.775069    | 4.317584    | 4.791735    |
| 1000000 | 0.405538    | 0.406698    | 0.321379    |

bigint

| amount  | HashAgg     | GroupAgg    | IndexAgg    |
| ------- | ----------- | ----------- | ----------- |
| 100     | 3225.287886 | 3510.612641 | 3742.911726 |
| 1000    | 492.908092  | 491.530184  | 574.475159  |
| 10000   | 50.192018   | 49.555983   | 53.909437   |
| 100000  | 4.831086    | 4.430059    | 4.748821    |
| 1000000 | 0.401983    | 0.413218    | 0.318144    |

text

| amount  | HashAgg     | GroupAgg    | IndexAgg    |
| ------- | ----------- | ----------- | ----------- |
| 100     | 2647.030876 | 2553.503954 | 2946.282525 |
| 1000    | 348.464373  | 286.818555  | 342.771923  |
| 10000   | 32.891834   | 24.386304   | 28.249571   |
| 100000  | 2.934513    | 1.956983    | 2.237997    |
| 1000000 | 0.249291    | 0.148780    | 0.150943    |

uuid

| amount  | HashAgg | GroupAgg    | IndexAgg    |
| ------- | ------- | ----------- | ----------- |
| 100     | N/A     | 2282.812585 | 2432.713816 |
| 1000    | N/A     | 282.637163  | 303.892131  |
| 10000   | N/A     | 28.375838   | 28.924711   |
| 100000  | N/A     | 2.649958    | 2.449907    |
| 1000000 | N/A     | 0.255203    | 0.194414    |

bigtext

| HashAgg | GroupAgg | IndexAgg |
| ------- | -------- | -------- |
| N/A     | 0.035247 | 0.041120 |

---
Sergey Soloviev

TantorLabs: https://tantorlabs.com

Re: Introduce Index Aggregate - new GROUP BY strategy

От

Sergey Soloviev

Дата:

10 декабря 2025 г., 12:08:21

Upstream changed and patches need to rebase. These are updated patches.

Вложения

Re: Introduce Index Aggregate - new GROUP BY strategy

От

Sergey Soloviev

Дата:

10 декабря 2025 г., 17:22:01

Hi!

I have looked again at planner's code and found mistake in cost calculation:

1. There was an extra `LOG2(numGroups)` multipler that accounts height of
     btree index, but actually it is extra multiplier. Now cost is calculated as
     much like sort: input_tuples * (2.0 * cpu_operator_cost * numGroupCols).
2. IndexAgg requires spilling index on disk to save sort order, but code that
     calculates this cost used this value without HAVING quals adjustment.

After fixing these parts, more plans started to use Index Aggregate node.
New patches have this fixed.

Also, patches contains several minor fixes of compiler warnings to which I
did not pay attention during development, but CI pipeline complained about.

---
Sergey Soloviev

TantorLabs: https://tantorlabs.com

Вложения

Re: Introduce Index Aggregate - new GROUP BY strategy

От

Sergey Soloviev

Дата:

12 декабря 2025 г., 19:23:15

Hi!

I have finally added support for Partial IndexAggregate. There was a problem with
sortgroupref and target list entries mismatch due to partial aggregates in it.
To solve this I had to add new argument to 'create_agg_path' - 'pathkeys' which is
a List of PathKey.

Previously this information was calculated in the function just like AGG_SORTED
do this. But when we calculating pathkeys we must consider whether it is a child
rel to properly build pathkeys and if so use it's parent. The latter information is
not known inside 'create_agg_path', thus instead of passing 'parent' we explicitly
pass already built 'pathkeys'. I did not change AGG_SORTED logic, so this  is used
only by AGG_INDEX.

This logic is placed in another patch file just to make review of this change easier.

Also, cost calculation logic is adjusted a bit - it takes into account top-down index
traversal and final external merge cost is added only if spill expected.

---
Sergey Soloviev
TantorLabs: https://tantorlabs.com

Вложения

Re: Introduce Index Aggregate - new GROUP BY strategy

От

Andrei Lepikhov

Дата:

26 декабря 2025 г., 16:03:15

On 12/12/25 17:23, Sergey Soloviev wrote:
> This logic is placed in another patch file just to make review of this 
> change easier.
> 
> Also, cost calculation logic is adjusted a bit - it takes into account 
> top-down index
> traversal and final external merge cost is added only if spill expected.
Hi,

1. Your 0002 patch needs a trivial rebase
2. Multiple trailing backspaces throughout the patch set. Please, remove 
it. You may just set your IDE to remove it automatically.

-- 
regards, Andrei Lepikhov,
pgEdge

Re: Introduce Index Aggregate - new GROUP BY strategy

От

Andrei Lepikhov

Дата:

26 декабря 2025 г., 18:20:38

On 12/12/25 17:23, Sergey Soloviev wrote:
> Also, cost calculation logic is adjusted a bit - it takes into account 
> top-down index
> traversal and final external merge cost is added only if spill expected.

Hi,
Here is my 'aerial' review:
The patch proposes a new aggregation strategy that builds an in-memory 
B+tree index for grouping. This combines incremental group formation 
(like AGG_HASHED) with sorted output (like AGG_SORTED), which is 
beneficial when the query requires both grouping and ordering on 
(almost) the same columns.
The key advantage is avoiding a separate sort step when the sorted 
output is needed, at the cost of additional CPU overhead.

My doubts:
1. Can you benchmark the scenario where the optimiser mispredicts 
numGroups. If the planner underestimates group cardinality, the btree 
overhead could be much higher than expected. Does the approach degrade 
gracefully?
2. Consider splitting the hash_* → spill_* field renaming into a 
separate preparatory commit to reduce the complexity of reviewing the 
core logic changes.
3. I notice AGG_INDEX requires both sortable AND hashable types. While I 
understand this is for the hash-based spill partitioning, is this 
limitation necessary? Could you use sort-based spilling (similar to 
tuplesort's external merge) instead? This would allow AGG_INDEX to work 
with sortable-only types (I can imagine a geometric type with B-tree 
operators but no hash functions).

The main question for me is: can you invent a robust cost model to set 
smooth boundaries between all three types of grouping? Does it really 
promise frequent benefits and avoid degradations? -  Remember, 
increasing search space we increase planning time, which may be palpable 
in cases with many groupings/grouping attributes - for example, an 
APPEND over a partitioned table with pushed-down aggregate looks like a 
trivial case.

-- 
regards, Andrei Lepikhov,
pgEdge

Re: Introduce Index Aggregate - new GROUP BY strategy

От

Sergey Soloviev

Дата:

04 января, 16:09:27

Hi!

Sorry for late answer, I didn't notice your email.

> Here is my 'aerial' review
Yes. You are right.

> Can you benchmark the scenario where the optimiser mispredicts numGroups. If the planner underestimates group
cardinality,the btree overhead could be much higher than expected. Does the approach degrade gracefully? 
 
  I will try
> 2. Consider splitting the hash_* → spill_* field renaming into a separate preparatory commit to reduce the complexity
ofreviewing the core logic changes.
 
Will be done
> 3. I notice AGG_INDEX requires both sortable AND hashable types. While I understand this is for the hash-based spill
partitioning,is this limitation necessary? Could you use sort-based spilling (similar to tuplesort's external merge)
instead?This would allow AGG_INDEX to work with sortable-only types (I can imagine a geometric type with B-tree
operatorsbut no hash functions).
 
I think this is possible if we could use combine function. I did not think about this yet.

---

Some days ago I have implemented Ttree as internal index instead of B+tree. To my surprise, the performance degraded.
Thereis a table with benchmark results (amount is amount of groups and value is latency in ms).
 

int

| amount | HashAgg | GroupAgg | IndexAgg |
| ------ | ------- | -------- | -------- |
| 100    | 0.222   | 0.199    | 0.198    |
| 1000   | 1.506   | 1.506    | 1.414    |
| 10000  | 15.414  | 15.598   | 15.891   |
| 100000 | 159.625 | 171.507  | 194.401  |

bigint

| amount | HashAgg | GroupAgg | IndexAgg |
| ------ | ------- | -------- | -------- |
| 100    | 0.220   | 0.198    | 0.196    |
| 1000   | 1.504   | 1.514    | 1.419    |
| 10000  | 15.404  | 15.717   | 15.836   |
| 100000 | 160.323 | 172.922  | 193.799  |

text

| amount | HashAgg | GroupAgg | IndexAgg |
| ------ | ------- | -------- | -------- |
| 100    | 0.280   | 0.301    | 0.287    |
| 1000   | 2.267   | 2.954    | 2.734    |
| 10000  | 24.613  | 35.383   | 35.401   |
| 100000 | 270.657 | 430.929  | 485.113  |

uuid

| amount | HashAgg | GroupAgg | IndexAgg |
| ------ | ------- | -------- | -------- |
| 100    | 0.311   | 0.317    | 0.310    |
| 1000   | 2.827   | 2.667    | 2.675    |
| 10000  | 33.233  | 26.980   | 28.848   |
| 100000 | 437.452 | 287.236  | 363.142  |

You can notice how latency increases when amount of groups reaches 100K. Probably this is because of low branching of
theTtree - unlike B+tree it has only 2 children, so have to traverse more nodes.
 
Also, I do not deny that the problem may be in my code, i.e. some paths are not optimized or there is a bug and tree
becomesimbalanced.
 
I will try to implement simple Btree as another attempt (not B+tree).

The patches are in attachments.

---
Sergey Soloviev

TantorLabs: https://tantorlabs.com

Вложения

Re: Introduce Index Aggregate - new GROUP BY strategy

От

Sergey Soloviev

Дата:

04 февраля, 20:33:44

Hi!

> 2. Consider splitting the hash_* → spill_* field renaming into a separate preparatory commit
> to reduce the complexity of reviewing the core logic changes.

Here patches rebased to master. I've managed to move renaming part into different patch.
Also, I improved the accuracy of the planner in determining the required memory - it counts
total required index nodes and calculates amount of memory for them (internal and leaf
separately).

Patches in attachments.

> 3. I notice AGG_INDEX requires both sortable AND hashable types. While I understand this
> is for the hash-based spill partitioning, is this limitation necessary? Could you use sort-based
> spilling (similar to tuplesort's external merge) instead? This would allow AGG_INDEX to work
> with sortable-only types (I can imagine a geometric type with B-tree operators but no hash functions). 

I thought about this idea and came to the conclusion, that this should be additional behaviour -
when the type is not hashable. Because if we allow only sortable types, then we have to choose
what to do to support memory limits:

1. Dump all tuples not present in index to disk
2. On overflow compute partial aggregates and at the end perform final merge/combine

Also, at the 1 case I am not considering sorting tuples, because otherwise what we get is plain
Sort/Group pair. By using hash-partitioning we improve performance, because all same tuples
will belong to the same bucket.

In case 2 we imply restriction on aggregation function itself, because not every aggregate has
combine function.

In the end, I haven't come to a decision on which option is better, so I will leave it as it is for now.

---
Sergey Soloviev

TantorLabs: https://tantorlabs.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Introduce Index Aggregate - new GROUP BY strategy

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения