Обсуждение: Ugly group by problem
Hello, I have a table filled from third party that basically has the following structure: link_id | feat_id | other | columns... --------+---------+-------+----------- 1 | 2 | ... 2 | 5 | ... 2 | 23 | ... 3 | 5 | ... 3 | 23 | some | data 3 | 23 | other | data 5 | 23 | ... 9 | 23 | ... This structure is fixed, and we can't change it, but we can create We have about 37 million different link_ids, and 35 million feat_ids. There are feat_ids that appear at several thousand link_ids, but a link_id does not have more than a douzen feat_ids. Now I need to group together all link_ids that have the same set of feat_ids. In the example above, the sets would be (1), (2,3) and (5,9), and the other columns would be run through some aggregate functions. Currently, this is done via an external JAVA application, but I'm looking for a way to express this via sql / plpgsql to ease deployment. I could imagine some ugly code using ARRAY (not tried yet), but how would you pack this problem? It seems that I'm just stuck in my thoughts and miss the beauty way to solve it. Thanks, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org
O Markus Schaber έγραψε στις Mar 29, 2006 : > Hello, > > I have a table filled from third party that basically has the following > structure: > > link_id | feat_id | other | columns... > --------+---------+-------+----------- > 1 | 2 | ... > 2 | 5 | ... > 2 | 23 | ... > 3 | 5 | ... > 3 | 23 | some | data > 3 | 23 | other | data > 5 | 23 | ... > 9 | 23 | ... > > This structure is fixed, and we can't change it, but we can create > > We have about 37 million different link_ids, and 35 million feat_ids. > There are feat_ids that appear at several thousand link_ids, but a > link_id does not have more than a douzen feat_ids. > > Now I need to group together all link_ids that have the same set of > feat_ids. In the example above, the sets would be (1), (2,3) and (5,9), > and the other columns would be run through some aggregate functions. > > Currently, this is done via an external JAVA application, but I'm > looking for a way to express this via sql / plpgsql to ease deployment. > > I could imagine some ugly code using ARRAY (not tried yet), but how > would you pack this problem? It seems that I'm just stuck in my thoughts > and miss the beauty way to solve it. Well i did it with arrays, i dont know the performance implications tho: foodb=# SELECT * from markustest ;link_id | feat_id | other ---------+---------+------- 1 | 2 | 1 2 | 5 | 2 2 | 23 | 2 3 | 5 | 3 3 | 23 | 3 3 | 23 | 3 5 | 23 | 5 9 | 23 | 9 (8 rows) foodb=# SELECT qoo.foo2,sum(qoo.foo3) from (SELECT mt.link_id as foo,_int_union(array(select mt2.feat_id from markustest mt2 where mt2.link_id=mt.link_id order by mt2.feat_id),'{}') as foo2,other::int4 as foo3 from markustest mt) as qoo GROUP BY qoo.foo2; foo2 | sum --------+-----{2} | 1{5,23} | 13{23} | 14 (3 rows) foodb=# The _int_union trick is to force the arrays to have unique values. The order by has the meaning that '{5,23}' and '{23,5}' should be treated the same way. > > Thanks, > Markus > -- -Achilleus
On 29 mar 2006, at 17.42, Achilleus Mantzios wrote: > The _int_union trick is to force the arrays to have unique values. > The order by has the meaning that '{5,23}' and '{23,5}' should be > treated > the same way. I didn't have the _int_union function. Is it internal? What PG- version did you use? I solved it using DISTINCT instead (perhaps making it a little simpler as well): SELECT ARRAY(SELECT DISTINCT feat_id FROM linktest WHERE link_id=t1.link_id ORDER BY feat_id) AS feat_group, SUM(other) FROM linktext t1 GROUP BY 1; feat_group | sum -----------+----- {2} | 1 {5,23} | 13 {23} | 14 (3 rows) Of course I haven't tested the performance on a table containing a lot of data... I'm not sure how the planner will treat the ARRAY- construct. You might be able to speed it up by turning it into a function marked STABLE, something like this: CREATE OR REPLACE FUNCTION feat_group(INTEGER) RETURNS INTEGER[] AS $$ SELECT ARRAY(SELECT DISTINCT feat_id FROM linktest WHERE link_id=$1 ORDER BY feat_id); $$ LANGUAGE sql STABLE; SELECT feat_group(link_id), SUM(other) FROM linktext t1 GROUP BY 1; Sincerely, Niklas Johansson
Hi, Achilleus, Achilleus Mantzios wrote: > foodb=# SELECT qoo.foo2,sum(qoo.foo3) from (SELECT mt.link_id as > foo,_int_union(array(select mt2.feat_id from markustest mt2 where > mt2.link_id=mt.link_id order by mt2.feat_id),'{}') as foo2,other::int4 as > foo3 from markustest mt) as qoo GROUP BY qoo.foo2; > foo2 | sum > --------+----- > {2} | 1 > {5,23} | 13 > {23} | 14 > (3 rows) This is much like I intended to do it, but using "select distinct" in the inner select as I don't have _int_union here, and using a temporary table to collect the sets of link ids. Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org
O Markus Schaber έγραψε στις Mar 29, 2006 : > Hi, Achilleus, > > Achilleus Mantzios wrote: > > > foodb=# SELECT qoo.foo2,sum(qoo.foo3) from (SELECT mt.link_id as > > foo,_int_union(array(select mt2.feat_id from markustest mt2 where > > mt2.link_id=mt.link_id order by mt2.feat_id),'{}') as foo2,other::int4 as > > foo3 from markustest mt) as qoo GROUP BY qoo.foo2; > > foo2 | sum > > --------+----- > > {2} | 1 > > {5,23} | 13 > > {23} | 14 > > (3 rows) > > This is much like I intended to do it, but using "select distinct" in > the inner select as I don't have _int_union here, and using a temporary > table to collect the sets of link ids. Just some facts: sort(uniq(int[])) has the same effect as select DISTINCT... ORDER BY DISTINCT. ORDER BY is ofcourse more intuitive and does not need contrib/intarray. In fact , i dont even remember why i didnt write the query with DISTINCT... in the first place :) > > Markus > > > -- -Achilleus