Large Scale Aggregation (HashAgg Enhancement)
| От | Rod Taylor | 
|---|---|
| Тема | Large Scale Aggregation (HashAgg Enhancement) | 
| Дата | |
| Msg-id | 1137388039.15377.33.camel@home обсуждение исходный текст | 
| Ответы | Re: Large Scale Aggregation (HashAgg Enhancement) | 
| Список | pgsql-hackers | 
A couple of days ago I found myself wanting to aggregate 3 Billion tuples down to 100 Million tuples based on an integer key with six integer values -- six sum()'s. PostgreSQL ran out of memory with its Hash Aggregator and doing an old style Sort & Sum took a fair amount of time to complete (cancelled the process after 24 hours -- small machine). Spilling to disk would be nice but I suspect the obvious method would thrash quite badly with non-sorted input. One solution is to partially sort the data into various buckets. If we know how many keys can fit into sort_mem and what the upper and lower bounds of our keys are then # Keys per MB / sort_mem temporary files can be created. A sequential scan of the source data would sort each tuple into the appropriate temporary file. From there we can loop through a temporary file, HashAgg the contents, present the results, and move to the next temporary file. For my particular problem the lower bound is 1 and the upper bound is about 100M. The sort_mem setting allows HashAgg to handle 1M keys at a time. The first pass through the 3B tuples would create 100 temporary files on disk. Temp file 1 would get 1 through 1M, temp file 2 gets keys 1M + 1 through 2M, etc. From there it is pretty easy. This would allow for a 1000 fold increase in the number of distinct keys PostgreSQL can simultaneously HashAgg in the default configuration at a reasonable speed. I've written something similar using a client and COPY with temporary tables. Even with the Export/Import copy I still beat the Sort&Sum method PostgreSQL falls back to. --
В списке pgsql-hackers по дате отправления: