Re: Introduce Index Aggregate - new GROUP BY strategy
| От | David Rowley |
|---|---|
| Тема | Re: Introduce Index Aggregate - new GROUP BY strategy |
| Дата | |
| Msg-id | CAApHDvpxbk8DqSmpsCBZrmCin69rpupKEKHFiM3SqFDdaHeAjg@mail.gmail.com обсуждение исходный текст |
| Ответ на | Introduce Index Aggregate - new GROUP BY strategy (Sergey Soloviev <sergey.soloviev@tantorlabs.ru>) |
| Ответы |
Re: Introduce Index Aggregate - new GROUP BY strategy
|
| Список | pgsql-hackers |
On Tue, 9 Dec 2025 at 04:37, Sergey Soloviev <sergey.soloviev@tantorlabs.ru> wrote: > I would like to introduce new GROUP BY strategy, called Index Aggregate. > In a nutshell, we build B+tree index where GROUP BY attributes are index > keys and if memory limit reached we will build index for each batch and > spill it to the disk as sorted run performing final external merge. > Mean IndexAgg time is about 1.8 ms and 2 ms for hash + sort, so win is about 10%. > > Also, I have run TPC-H tests and 2 tests used Index Agg node (4 and 5) and this gave > near 5% gain in time. Interesting. Are you able to provide benchmarks with increasing numbers of groups, say 100 to 100 million, increasing in multiples of 10, with say 1GB work_mem, and to be fair, hash_mem_multiplier=1 with all 3 strategies. A binary search's performance characteristics will differ vastly from that of simplehash's hash lookup and linear probe type search. Binary searches become much less optimal when the array becomes large as there are many more opportunities for cache misses than with a linear probing hash table. I think you're going to have to demonstrate that the window where this is useful is big enough to warrant the extra code. Ideally, if you could show a graph and maybe name Hash Aggregate as the baseline and show that as 1 always, then run the same benchmark forcing a Sort -> Group Agg, and then also your Index Agg. Also, ideally, if you could provide scripts for this so people can easily run it themselves, to allow us to see how other hardware compares to yours. Doing this may also help you move forward with your costing code for the planner, but the main thing to show is that there is a useful enough data size where this is useful. You might want to repeat the test a few times with different data types. Perhaps int or bigint, then also something varlena and maybe something byref, such as UUID. Also, you might want to avoid presorted data as I suspect it'll be hard to beat Sort -> Group Agg with presorted data. Not causing performance regressions for presorted data might be quite a tricky aspect of this patch. David
В списке pgsql-hackers по дате отправления: