Обсуждение: Removing redundant itemsets
Hi all, I have a list of purchases (market basket) and I would like to select non redundant longest possible patterns by eliminating (creating/populating other table to contain only non redandant itemsets) purchases having item lists which are fully included in at least one other purchase. (Am assuming all the items of all the purchases have met the minimum support currently set at 1) Below is a sample case, table schema and data(DDL and DML) Transaction Itemset '100' 'a','b','c','d' '200' 'c','d' '300' 'a','c','e' '400' 'e','d' On successful removal out of 'redanduant' or smaller purchases having items contained in totality by at least one other purchase, the purchase '200' would be weeded out as it's itemset {'c','d'} is contained in '100' {'a','b','c','d'} purchase. drop sequence if exists togo_seq cascade; create sequence togo_seq; drop table if exists togo cascade; create table togo ( id integer not null default nextval('togo_seq') ,tid char(3) not null ,item char(1) not null ,primary key(id) ,unique(tid,item) ) ; insert into togo(tid,item)values('100','b'); insert into togo(tid,item)values('100','a'); insert into togo(tid,item)values('100','c'); insert into togo(tid,item)values('100','d'); insert into togo(tid,item)values('200','c'); insert into togo(tid,item)values('200','d'); insert into togo(tid,item)values('300','a'); insert into togo(tid,item)values('300','c'); insert into togo(tid,item)values('300','e'); insert into togo(tid,item)values('400','e'); insert into togo(tid,item)values('400','d'); Allan.
Allan Kamau wrote: > Hi all, > I have a list of purchases (market basket) and I would like to select > non redundant longest possible patterns by eliminating > (creating/populating other table to contain only non redandant itemsets) > purchases having item lists which are fully included in at least one > other purchase. Here's a possibly slow and surely ugly solution (I think it's right, though I haven't done more than passing testing): CREATE VIEW togo_as_arr AS SELECT a.tid, ARRAY(SELECT item FROM togo b WHERE b.tid = a.tid ORDER BY item) AS items FROMtogo a GROUP BY tid; SELECT arr_a.tid AS redundant_tid, arr_b.tid AS contained_by FROM togo_as_arr arr_a CROSS JOIN togo_as_arr arr_b WHERE arr_a.tid <> arr_b.tid AND arr_a.items <@ arr_b.items; (the view isn't necessary, but does improve the readability of the query). It groups the purchases up with item lists as arrays, then finds any purchases with items arrays wholly contained by other item arrays from other purchases. I'm *sure* there's a smarter way to do this that avoids the use of arrays, but I don't seem to be able to come up with one right now. It's interesting, though, so I might keep fiddling. -- Craig Ringer
Craig Ringer wrote: > Allan Kamau wrote: >> Hi all, >> I have a list of purchases (market basket) and I would like to select >> non redundant longest possible patterns by eliminating >> (creating/populating other table to contain only non redandant itemsets) >> purchases having item lists which are fully included in at least one >> other purchase. > > Here's a possibly slow and surely ugly solution (I think it's right, > though I haven't done more than passing testing): > > > > CREATE VIEW togo_as_arr AS > SELECT a.tid, > ARRAY(SELECT item FROM togo b WHERE b.tid = a.tid ORDER BY item) > AS items > FROM togo a GROUP BY tid; > > SELECT arr_a.tid AS redundant_tid, arr_b.tid AS contained_by > FROM togo_as_arr arr_a CROSS JOIN togo_as_arr arr_b > WHERE arr_a.tid <> arr_b.tid AND arr_a.items <@ arr_b.items; Alternately: -- Helps with the massively repeated subquery below CREATE INDEX togo_by_tid_and_item ON togo(tid,item); -- Find any `a' for which `item_from_a_is_in_b' is -- true for all items in `a' SELECT a_tid AS is_redundant, b_tid AS contained_by FROM ( -- For every item in every pair of purchases, -- determine whether the item in purchase `a' -- was also in purchase`b'. SELECT a.tid AS a_tid, b.tid AS b_tid, a.item AS item, EXISTS( -- Was this item from `a' also inthe `b' purchase? SELECT 1 FROM togo x WHERE x.tid = b.tid AND x.item = a.item ) AS item_from_a_is_in_b FROM togoa INNER JOIN togo b ON (a.tid <> b.tid) GROUP BY a.tid, b.tid, a.item) AS item_containment GROUP BY a_tid, b_tid HAVING every(item_from_a_is_in_b); ... which avoids the array building, but is actually considerably slower on the trivial test data. That's not too surprising given that this approach requires a subquery: SELECT 1 FROM togo x WHERE x.tid = b.tid AND x.item = a.item for EVERY item to be tested. Twice, actually, as each record appears in both `a' and `b' positions. I'd be very interested to see what happened on real world test data, especially compared to doing the array accumulation based query off a temporary table instead of a view. I suspect it'll depend on the average number of items per purchase - lots of items per purchase and the array building cost will dominate. That's really just a guess, though. I'm sure there's a properly smart way to do this that I just can't figure out, but this is the best I've come up with so far. -- Craig Ringer
> -- Find any `a' for which `item_from_a_is_in_b' is > -- true for all items in `a' > SELECT a_tid AS is_redundant, b_tid AS contained_by > FROM ( > -- For every item in every pair of purchases, > -- determine whether the item in purchase `a' > -- was also in purchase `b'. > SELECT > a.tid AS a_tid, > b.tid AS b_tid, > a.item AS item, > EXISTS( > -- Was this item from `a' also in the `b' purchase? > SELECT 1 FROM togo x WHERE x.tid = b.tid AND x.item = a.item > ) AS item_from_a_is_in_b > FROM togo a INNER JOIN togo b ON (a.tid <> b.tid) > GROUP BY a.tid, b.tid, a.item) AS item_containment > GROUP BY a_tid, b_tid > HAVING every(item_from_a_is_in_b); That really should've been written as: SELECT a.tid AS is_redundant, b.tid AS contained_by FROM togo a INNER JOIN togo b ON (a.tid <> b.tid) GROUP BY a.tid, b.tid HAVING EVERY(EXISTS( SELECT 1 FROM togo x WHERE x.tid = b.tid AND x.item = a.item )); ... but I'm a bit of an idiot, and couldn't figure out why the EVERY(EXISTS(subq)) wasn't working when testing it before. Sorry for all the noise. -- Craig Ringer
Craig, Thank you so much for the solution. I have spent many hours since Thursday last week including the weekend (and it took you just a few minutes) trying to figure out a solution not involving procedural programming and looping (as the size of the items and even the number of "purchases" in the datasets I may be working with may be large), I was looking for a solution that may take (almost) polynomial time (and resources) and also make use of Postgresql refined and efficient engine. Your solution satisfies these requirements. Thanks. Allan. Craig Ringer wrote: > Allan Kamau wrote: > >> Hi all, >> I have a list of purchases (market basket) and I would like to select >> non redundant longest possible patterns by eliminating >> (creating/populating other table to contain only non redandant itemsets) >> purchases having item lists which are fully included in at least one >> other purchase. >> > > Here's a possibly slow and surely ugly solution (I think it's right, > though I haven't done more than passing testing): > > > > CREATE VIEW togo_as_arr AS > SELECT a.tid, > ARRAY(SELECT item FROM togo b WHERE b.tid = a.tid ORDER BY item) > AS items > FROM togo a GROUP BY tid; > > SELECT arr_a.tid AS redundant_tid, arr_b.tid AS contained_by > FROM togo_as_arr arr_a CROSS JOIN togo_as_arr arr_b > WHERE arr_a.tid <> arr_b.tid AND arr_a.items <@ arr_b.items; > > > > (the view isn't necessary, but does improve the readability of the query). > > It groups the purchases up with item lists as arrays, then finds any > purchases with items arrays wholly contained by other item arrays from > other purchases. > > I'm *sure* there's a smarter way to do this that avoids the use of > arrays, but I don't seem to be able to come up with one right now. It's > interesting, though, so I might keep fiddling. > > -- > Craig Ringer >