aggregation problem: first/last/count(*)

Поиск
Список
Период
Сортировка
От Marc Mamin
Тема aggregation problem: first/last/count(*)
Дата
Msg-id C4DAC901169B624F933534A26ED7DF311D537C@JENMAIL01.ad.intershop.net
обсуждение исходный текст
Ответы Re: aggregation problem: first/last/count(*)  (Volkan YAZICI <yazicivo@ttmail.com>)
Список pgsql-sql
Hello,

I have a query to aggregate data wich is too slow :-)

Here a simplified example:

create table test
(
time         int8, --store the time as epoch
a_group      varchar,
category     varchar
)


For each group, I need the first/last times and categories , the number
of distinct categories and the number of records.


Here my best solution until now:


SELECT
FIRST.a_group,
FIRST.time     as first_time,
FIRST.category as first_category,
LAST.time      as last_time,
LAST.category  as last_category,
AGG.c_count,
AGG.c_all
FROM
( select distinct on (a_group)  a_group, time, category from test order by a_group, time) FIRST,
( select distinct on (a_group)  a_group, time, category from test order by a_group, time DESC) LAST,
( select a_group,  count(distinct category) as c_count, count(*) as c_all from test group by a_group order by a_group
)AGGwhere FIRST.a_group = LAST.a_groupand LAST.a_group=AGG.a_group 



each sub query is quite fast -- thanks for the DISTINCT ON feature :-) ,
but the whole is really slow as Postgres start to swap due to the large
amount of data to merge.


I guess there must be a better solution as the three sub queries return
exactly one row for each  'a_group'
and are moreover already sorted  (The table does not contain any NULL
value).
But in the query plan below, most of the cost comes form the merges.


I imagine there must be a way using custom aggregation functions, but
I'm not confident with those:

Is it possible to define aggregate in order to retrieve the first/last
values of an ordered result set? This would allow to make a single scan
of the table.
something like

select a_group,
first(category) as first_category,
last(category)  as last_category,
...
from test
order by a_group,time


Many thanks for any hints.

Marc Mamin





Here are some dummy values if you'd like to play with this issue:

insert into test select s,'G'||s , 'C1' from(select
generate_series(1,10000)as s)s;
insert into test select s+10,'G'||s , 'C2' from(select
generate_series(1,10000)as s)s;
insert into test select s+13,'G'||s , 'C3' from(select
generate_series(1,10000)as s)s;
insert into test select s+1,'G'||s , 'C2' from(select
generate_series(1,10000,5)as s)s;
insert into test select s,'G'||s%10 , 'C3' from(select
generate_series(1,10000,5)as s)s;
insert into test select s+1,'G'||s%5 , 'C2' from(select
generate_series(1,10000,5)as s)s;
insert into test select s+1,'G'||s , 'C1' from(select
generate_series(1,1000000)as s)s; --10^6 !!

create index test_i on test(a_group);
analyze test;

=>

Merge Join  (cost=259000.31..34904377039.75 rows=1550421099181
width=128) Merge Cond: ((test.a_group)::text = (last.a_group)::text) ->  Merge Join  (cost=129500.16..17814340.14
rows=783387153width=120)       Merge Cond: ((test.a_group)::text = (test.a_group)::text)       ->  GroupAggregate
(cost=0.00..53681.23rows=395825 width=10)             ->  Index Scan using test_i on test  (cost=0.00..39973.53 
rows=1036043 width=10)       ->  Materialize  (cost=129500.16..133458.41 rows=395825
width=72)             ->  Unique  (cost=119965.87..125146.08 rows=395825
width=18)                   ->  Sort  (cost=119965.87..122555.97 rows=1036043
width=18)                         Sort Key: test.a_group, test."time"                         ->  Seq Scan on test
(cost=0.00..16451.43
rows=1036043 width=18) ->  Materialize  (cost=129500.16..133458.41 rows=395825 width=72)       ->  Subquery Scan last
(cost=119965.87..129104.33rows=395825 
width=72)             ->  Unique  (cost=119965.87..125146.08 rows=395825
width=18)                   ->  Sort  (cost=119965.87..122555.97 rows=1036043
width=18)                         Sort Key: test.a_group, test."time"                         ->  Seq Scan on test
(cost=0.00..16451.43
rows=1036043 width=18)










В списке pgsql-sql по дате отправления:

Предыдущее
От: Suha Onay
Дата:
Сообщение: Invitation to connect on LinkedIn
Следующее
От: Volkan YAZICI
Дата:
Сообщение: Re: aggregation problem: first/last/count(*)