On Tue, Apr 23, 2002 at 07:25:34PM -0400, Jan Wieck wrote:
> In particular, the Perl script uses a hash table with one
> entry for each group. Now what happens if the input data is
> spread out and contains 1G groups? It'll simply blow away
> your script because it runs out of memory. This behaviour is
> unacceptable for a database system, so as you see in the
> Explain output, PostgreSQL sorts and groups the input data in
> temporary files before counting the rows. Due to that, the
> PostgreSQL solution to the problem requires a gazillion of IO
> operations, but it'll work whatever the input data is, given
> that there is enough disk space.
I've always had a little problem with this setup. Sure, if you're using 1
billion groups then you have a problem, but if you are going into only a
small number of groups, the sorting takes forever compared to using a hash.
Currently the planner beleives that the output of a group will be 1/10th of
the input, but I think the square root would be a better estimate.
Here we regularly summerise 2 million rows into 6 groups and it's almost
faster to do 6 sequential scans than it is to do the sort/group.
Besides, the running out of memory argument is silly, because if sorting
moves out to disk when it gets too big, you can do the same with a hash
table.
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Canada, Mexico, and Australia form the Axis of Nations That
> Are Actually Quite Nice But Secretly Have Nasty Thoughts About America