Обсуждение: Ugly group by problem

Поиск
Список
Период
Сортировка

Ugly group by problem

От
Markus Schaber
Дата:
Hello,

I have a table filled from third party that basically has the following
structure:

link_id | feat_id | other | columns...
--------+---------+-------+-----------
1       | 2       | ...
2       | 5       | ...
2       | 23      | ...
3       | 5       | ...
3       | 23      | some  | data
3       | 23      | other | data
5       | 23      | ...
9       | 23      | ...

This structure is fixed, and we can't change it, but we can create

We have about 37 million different link_ids, and 35 million feat_ids.
There are feat_ids that appear at several thousand link_ids, but a
link_id does not have more than a douzen feat_ids.

Now I need to group together all link_ids that have the same set of
feat_ids. In the example above, the sets would be (1), (2,3) and (5,9),
and the other columns would be run through some aggregate functions.

Currently, this is done via an external JAVA application, but I'm
looking for a way to express this via sql / plpgsql to ease deployment.

I could imagine some ugly code using ARRAY (not tried yet), but how
would you pack this problem? It seems that I'm just stuck in my thoughts
and miss the beauty way to solve it.

Thanks,
Markus
-- 
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org


Re: Ugly group by problem

От
Achilleus Mantzios
Дата:
O Markus Schaber έγραψε στις Mar 29, 2006 :

> Hello,
> 
> I have a table filled from third party that basically has the following
> structure:
> 
> link_id | feat_id | other | columns...
> --------+---------+-------+-----------
> 1       | 2       | ...
> 2       | 5       | ...
> 2       | 23      | ...
> 3       | 5       | ...
> 3       | 23      | some  | data
> 3       | 23      | other | data
> 5       | 23      | ...
> 9       | 23      | ...
> 
> This structure is fixed, and we can't change it, but we can create
> 
> We have about 37 million different link_ids, and 35 million feat_ids.
> There are feat_ids that appear at several thousand link_ids, but a
> link_id does not have more than a douzen feat_ids.
> 
> Now I need to group together all link_ids that have the same set of
> feat_ids. In the example above, the sets would be (1), (2,3) and (5,9),
> and the other columns would be run through some aggregate functions.
> 
> Currently, this is done via an external JAVA application, but I'm
> looking for a way to express this via sql / plpgsql to ease deployment.
> 
> I could imagine some ugly code using ARRAY (not tried yet), but how
> would you pack this problem? It seems that I'm just stuck in my thoughts
> and miss the beauty way to solve it.

Well i did it with arrays, i dont know the performance 
implications tho:

foodb=# SELECT * from markustest ;link_id | feat_id | other
---------+---------+-------      1 |       2 | 1      2 |       5 | 2      2 |      23 | 2      3 |       5 | 3      3
|     23 | 3      3 |      23 | 3      5 |      23 | 5      9 |      23 | 9
 
(8 rows)

foodb=# SELECT qoo.foo2,sum(qoo.foo3) from (SELECT mt.link_id as 
foo,_int_union(array(select mt2.feat_id from markustest mt2 where 
mt2.link_id=mt.link_id order by mt2.feat_id),'{}') as foo2,other::int4 as 
foo3 from markustest mt) as qoo GROUP BY qoo.foo2; foo2  | sum
--------+-----{2}    |   1{5,23} |  13{23}   |  14
(3 rows)

foodb=#           

The _int_union trick is to force the arrays to have unique values.
The order by has the meaning that '{5,23}' and '{23,5}' should be treated
the same way.

> 
> Thanks,
> Markus
> 

-- 
-Achilleus



Re: Ugly group by problem

От
Niklas Johansson
Дата:
On 29 mar 2006, at 17.42, Achilleus Mantzios wrote:
> The _int_union trick is to force the arrays to have unique values.
> The order by has the meaning that '{5,23}' and '{23,5}' should be  
> treated
> the same way.

I didn't have the _int_union function. Is it internal? What PG- 
version did you use? I solved it using DISTINCT instead (perhaps  
making it a little simpler as well):

SELECT  ARRAY(SELECT DISTINCT feat_id FROM linktest WHERE  
link_id=t1.link_id ORDER BY feat_id) AS feat_group,  SUM(other)
FROM linktext t1
GROUP BY 1;
 feat_group | sum
-----------+-----
{2}        |   1
{5,23}     |  13
{23}       |  14
(3 rows)


Of course I haven't tested the performance on a table containing a  
lot of data... I'm not sure how the planner will treat the ARRAY- 
construct. You might be able to speed it up by turning it into a  
function marked STABLE, something like this:

CREATE OR REPLACE FUNCTION feat_group(INTEGER) RETURNS INTEGER[] AS $$
SELECT ARRAY(SELECT DISTINCT feat_id FROM linktest WHERE link_id=$1  
ORDER BY feat_id);
$$ LANGUAGE sql STABLE;

SELECT
feat_group(link_id),
SUM(other)
FROM linktext t1
GROUP BY 1;



Sincerely,

Niklas Johansson






Re: Ugly group by problem

От
Markus Schaber
Дата:
Hi, Achilleus,

Achilleus Mantzios wrote:

> foodb=# SELECT qoo.foo2,sum(qoo.foo3) from (SELECT mt.link_id as 
> foo,_int_union(array(select mt2.feat_id from markustest mt2 where 
> mt2.link_id=mt.link_id order by mt2.feat_id),'{}') as foo2,other::int4 as 
> foo3 from markustest mt) as qoo GROUP BY qoo.foo2;
>   foo2  | sum
> --------+-----
>  {2}    |   1
>  {5,23} |  13
>  {23}   |  14
> (3 rows)

This is much like I intended to do it, but using "select distinct" in
the inner select as I don't have _int_union here, and using a temporary
table to collect the sets of link ids.

Markus


-- 
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org


Re: Ugly group by problem

От
Achilleus Mantzios
Дата:
O Markus Schaber έγραψε στις Mar 29, 2006 :

> Hi, Achilleus,
> 
> Achilleus Mantzios wrote:
> 
> > foodb=# SELECT qoo.foo2,sum(qoo.foo3) from (SELECT mt.link_id as 
> > foo,_int_union(array(select mt2.feat_id from markustest mt2 where 
> > mt2.link_id=mt.link_id order by mt2.feat_id),'{}') as foo2,other::int4 as 
> > foo3 from markustest mt) as qoo GROUP BY qoo.foo2;
> >   foo2  | sum
> > --------+-----
> >  {2}    |   1
> >  {5,23} |  13
> >  {23}   |  14
> > (3 rows)
> 
> This is much like I intended to do it, but using "select distinct" in
> the inner select as I don't have _int_union here, and using a temporary
> table to collect the sets of link ids.

Just some facts:

sort(uniq(int[])) has the same effect as select DISTINCT... ORDER BY
DISTINCT.  ORDER BY is ofcourse more intuitive and does not need 
contrib/intarray.

In fact , i dont even remember why i didnt write the query with 
DISTINCT... in the first place :)

> 
> Markus
> 
> 
> 

-- 
-Achilleus