Обсуждение: Query to find contiguous ranges on a column
Given a column of data resembling the following: col 2 3 4 5 11 12 13 14 15 16 17 18 19 23 32 33 34 35 36 37 I need a query to find the contiguous ranges within this column, in this case returning the result set: start, end 2, 5 11, 19 23, 23 32, 37 I have one solution that joins the table against itself and does (among other things) a subselect looking "not exists col +1" and "not exists col -1" on the two instances of the table to find the start and end. This is, as you might guess, is not very efficient (my actual data is some 6 million+ rows) and I'm guessing there has to be something more efficient with windowing or possibly grouping on min and max (though I can't see how to make sure they are part of a contiguous set). Anyone have any ideas? -- Peter Hunsberger
Peter Hunsberger <peter.hunsberger@gmail.com> wrote:
> [...]
> I have one solution that joins the table against itself and does
> (among other things) a subselect looking "not exists col +1" and "not
> exists col -1" on the two instances of the table to find the start and
> end. This is, as you might guess, is not very efficient (my actual
> data is some 6 million+ rows) and I'm guessing there has to be
> something more efficient with windowing or possibly grouping on min
> and max (though I can't see how to make sure they are part of a
> contiguous set). Anyone have any ideas?
You can either use a PL/pgSQL function ("SETOF TEXT" just
for the convenience of the example):
| CREATE FUNCTION SummarizeRanges () RETURNS SETOF TEXT AS $$
| DECLARE
| CurrentFirst INT;
| CurrentLast INT;
| CurrentRecord RECORD;
| BEGIN
| FOR CurrentRecord IN SELECT col FROM t ORDER BY col LOOP
| IF CurrentFirst IS NULL THEN
| CurrentFirst := CurrentRecord.col;
| CurrentLast := CurrentRecord.col;
| ELSIF CurrentRecord.col = CurrentLast + 1 THEN
| CurrentLast := CurrentRecord.col;
| ELSE
| RETURN NEXT CurrentFirst || ', ' || CurrentLast;
| CurrentFirst := CurrentRecord.col;
| CurrentLast := CurrentRecord.col;
| END IF;
| END LOOP;
| IF CurrentFirst IS NOT NULL THEN
| RETURN NEXT CurrentFirst || ', ' || CurrentLast;
| END IF;
| RETURN;
| END;
| $$ LANGUAGE plpgsql;
or a recursive query (which I always find very hard to com-
prehend):
| WITH RECURSIVE RecCols (LeftBoundary, Value) AS
| (SELECT col, col FROM t WHERE (col - 1) NOT IN (SELECT col FROM t)
| UNION ALL SELECT p.LeftBoundary, c.col FROM RecCols AS p, t AS c WHERE c.col = p.Value + 1)
| SELECT LeftBoundary, MAX(Value) AS RightBoundary FROM RecCols
| GROUP BY LeftBoundary
| ORDER BY LeftBoundary;
Could you run both against your data set and find out which
one is faster for your six million rows?
Tim
Peter Hunsberger wrote on 13.10.2009 23:23:
> I need a query to find the contiguous ranges within this column, in
> this case returning the result set:
>
> start, end
> 2, 5
> 11, 19
> 23, 23
> 32, 37
>
> I have one solution that joins the table against itself and does
> (among other things) a subselect looking "not exists col +1" and "not
> exists col -1" on the two instances of the table to find the start and
> end. This is, as you might guess, is not very efficient (my actual
> data is some 6 million+ rows) and I'm guessing there has to be
> something more efficient with windowing or possibly grouping on min
> and max (though I can't see how to make sure they are part of a
> contiguous set). Anyone have any ideas?
This is the best I can put together right now.
Not very nice, but currently I can't think of a better solution:
select *
from
(
select soi as start_of_interval,
case
when soi is not null and eoi is null then lead(eoi) over()
when soi is not null and eoi is not null then eoi
else null
end as end_of_interval
from (
select case
when col - (lag(col,1,col + 1) over (order by col)) - 1 <> 0 then col
else null
end as soi,
case
when col - (lead(col,1,col + 1) over (order by col)) + 1 <> 0 then col
else null
end as eoi
from numbers
) t1
where t1.soi is not null
or t1.eoi is not null
) t2
where t2.start_of_interval is not null
and t2.end_of_interval is not null;
The outer-most select is needed to get rid of the "empty" rows. I couldn't find a way to push that into one of the
sub-queries.
The execution plan doesn't look too bad (probably better than your plan with a self join and a subselect), but it does
sortthe whole table which might be a problem.
Regards
Thomas
On Tue, Oct 13, 2009 at 5:12 PM, Tim Landscheidt <tim@tim-landscheidt.de> wrote:
> Peter Hunsberger <peter.hunsberger@gmail.com> wrote:
>
> You can either use a PL/pgSQL function ("SETOF TEXT" just
> for the convenience of the example):
That works well, takes about 20 seconds to do the 6M+ rows
>
> or a recursive query (which I always find very hard to com-
> prehend):
>
> | WITH RECURSIVE RecCols (LeftBoundary, Value) AS
> | (SELECT col, col FROM t WHERE (col - 1) NOT IN (SELECT col FROM t)
> | UNION ALL SELECT p.LeftBoundary, c.col FROM RecCols AS p, t AS c WHERE c.col = p.Value + 1)
> | SELECT LeftBoundary, MAX(Value) AS RightBoundary FROM RecCols
> | GROUP BY LeftBoundary
> | ORDER BY LeftBoundary;
>
> Could you run both against your data set and find out which
> one is faster for your six million rows?
>
Turns out the server is v 8.3, looks like I need to get them to
upgrade it so I get recursive and windowing :-(. If this happens any
time soon I'll let you know the results.
Many thanks.
--
Peter Hunsberger
Peter Hunsberger <peter.hunsberger@gmail.com> wrote: > [...] >> or a recursive query (which I always find very hard to com- >> prehend): >> | WITH RECURSIVE RecCols (LeftBoundary, Value) AS >> | (SELECT col, col FROM t WHERE (col - 1) NOT IN (SELECT col FROM t) >> | UNION ALL SELECT p.LeftBoundary, c.col FROM RecCols AS p, t AS c WHERE c.col = p.Value + 1) >> | SELECT LeftBoundary, MAX(Value) AS RightBoundary FROM RecCols >> | GROUP BY LeftBoundary >> | ORDER BY LeftBoundary; >> Could you run both against your data set and find out which >> one is faster for your six million rows? > Turns out the server is v 8.3, looks like I need to get them to > upgrade it so I get recursive and windowing :-(. If this happens any > time soon I'll let you know the results. > Many thanks. After some tests with a data set of 7983 rows (and 1638 ran- ges): Don't! :-) The recursive solution seems to be more than double as slow as the iterative. I'll take it to -per- formance. Tim
On Wed, Oct 14, 2009 at 4:50 PM, Tim Landscheidt <tim@tim-landscheidt.de> wrote: > Peter Hunsberger <peter.hunsberger@gmail.com> wrote: > > After some tests with a data set of 7983 rows (and 1638 ran- > ges): Don't! :-) The recursive solution seems to be more > than double as slow as the iterative. I'll take it to -per- > formance. > Interesting, I've never liked recursive on Oracle but performance is usually reasonable... Thanks for the heads up... -- Peter Hunsberger