Re: Large Scale Aggregation (HashAgg Enhancement)
| От | Simon Riggs | 
|---|---|
| Тема | Re: Large Scale Aggregation (HashAgg Enhancement) | 
| Дата | |
| Msg-id | 1137400335.3180.140.camel@localhost.localdomain обсуждение исходный текст | 
| Ответ на | Large Scale Aggregation (HashAgg Enhancement) (Rod Taylor <pg@rbt.ca>) | 
| Ответы | Re: Large Scale Aggregation (HashAgg Enhancement) Re: Large Scale Aggregation (HashAgg Enhancement) | 
| Список | pgsql-hackers | 
On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote: > A couple of days ago I found myself wanting to aggregate 3 Billion > tuples down to 100 Million tuples based on an integer key with six > integer values -- six sum()'s. > > PostgreSQL ran out of memory with its Hash Aggregator and doing an old > style Sort & Sum took a fair amount of time to complete (cancelled the > process after 24 hours -- small machine). > Spilling to disk would be nice but I suspect the obvious method would > thrash quite badly with non-sorted input. There is already hash table overflow (spill to disk) logic in HashJoins, so this should be possible by reusing that code for HashAggs. That's on my todo list, but I'd welcome any assistance. A question: Are the rows in your 3 B row table clumped together based upon the 100M row key? (or *mostly* so) We might also be able to pre-aggregate the rows using a plan likeHashAgg SortedAgg orSortedAgg Sort SortedAgg The first SortedAgg seems superfluous, buy would reduce the row volume considerably if incoming rows were frequently naturally adjacent, even if the values were not actually sorted. (This could also be done during sorting, but its much easier to slot the extra executor step into the plan). That might then reduce the size of the later sort, or allow it to become a HashAgg. I could make that manually enabled using "enable_pre_agg" to allow us to measure the effectiveness of that technique and decide what cost model we'd use to make it automatic. Would that help? > I've written something similar using a client and COPY with temporary > tables. Even with the Export/Import copy I still beat the Sort&Sum > method PostgreSQL falls back to. You can get round this now by chopping the larger table into pieces with a WHERE clause and then putting them back together with a UNION. If the table is partitioned, then do this by partitions. This should also help when it comes to recalculating the sums again in the future, since you'll only need to rescan the rows that have been added since the last summation. Best Regards, Simon Riggs
В списке pgsql-hackers по дате отправления: