GROUP BY on a large table -- an idea

Поиск
Список
Период
Сортировка
От Dawid Kuroczko
Тема GROUP BY on a large table -- an idea
Дата
Msg-id 758d5e7f0610120052p68abab85q9c2f63afe71695e6@mail.gmail.com
обсуждение исходный текст
Ответы Re: GROUP BY on a large table -- an idea  (Martijn van Oosterhout <kleptog@svana.org>)
Re: GROUP BY on a large table -- an idea  (Markus Schaber <schabi@logix-tt.com>)
Список pgsql-hackers
Recently I've been playing with quite a big table (over 50mln rows),
and did some SELECT ... sum(...) WHERE ... GROUP BY ... queries.

The usual plan for these is to sort the entries according to GROUP BY
specification, then to run aggregates one by one.  If the data to be
sorted is large enough, PostgreSQL has no other option than to spill
to disk, which well, Isn't the fastest...

Then I thought, why not skip the sorting, and do something like this,
say a table is:
kind tetx, sumkind text, cnt int, size intfoo, bar, 2, 10blah, argh, 23, 3foo, baz, 1, 20blah, argh, 23, 3
and the query would be:
SELECT kind,subkind,sum(cnt),sum(size) FROM x GROUP BY kind,subkind;
Instead of sorting, we would create an empty temporary state variable tree,looked up "foo, bar" in that tree -- if not
found,enter there a new
 
initialized
state variables for sum(cnt) and sum(size).looked up blah, argh -- create the state variableslooked up foo, baz --
createthe state variableslooked up blah,argh -- update the state variables there.
 
And finally dump the whole tree as results of our query: foo, bar, 2, 10 foo, baz, 1, 20 blah, argh, 46,6

Of course first thing you'll notice is that the "looking up" part will probably
eat all benefits from not spilling, and if group by columns have large
cardinality we'd have to spill anyway.  But then I thought, maybe a hybrid
approach could be benefitial, and its' the resason I'm sending this message.

The hybrid approach means: sort as much as you can without spilling to
disk, then aggregate and store aggregate state variables in safe place
(like a "tree" above), get more tuples from the table, sort them, update
aggregate state variables, lather, rince, repeat.

This should avoid the need to spill to disk.  The cost of such operation
depends on cardinality of GROUP BY part (and their correlation, doh),
so it might be wise to try this approach for promising data only.

I have yet almost no knowledge od PostgreSQL's internals, but I think
the idea is feasible therefore I post it here. If it's been proposed before,
forgive me.
  Regards,      Dawid


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Teodor Sigaev
Дата:
Сообщение: Re: Patch for Win32 blocking problem
Следующее
От: Simon Riggs
Дата:
Сообщение: Re: Hints WAS: Index Tuning Features