Re: Select ranges based on sequential breaks
От | Mike Toews |
---|---|
Тема | Re: Select ranges based on sequential breaks |
Дата | |
Msg-id | 4A3FD068.30900@sfu.ca обсуждение исходный текст |
Ответ на | Re: Select ranges based on sequential breaks (Joel Nothman <jnothman@student.usyd.edu.au>) |
Ответы |
Re: Select ranges based on sequential breaks
(Scott Marlowe <scott.marlowe@gmail.com>)
|
Список | pgsql-general |
Hi Joel, Window functions appear to be the best solution for this style of problem, and I'm looking forward to their applications. However, I'm sticking with 8.3 for at least a year, so I'm not able to explore this solution yet. For now, I can only post-process the output in a non-SQL environment. I also need to do other fun stuff, like cumulative sums, which is also challenging with SQL, but much easier and intuitive with R. An excellent book that I recently stumbled on is /Joe Celko's SQL for Smarties/ (recommended by someone on this list), which is a heavy read, but has an amazing depth to ANSI SQL. Of particular interest is "Chapter 24: Regions, Runs, Gaps, Sequences, and Series". I'm slowly digesting this. -Mike Joel Nothman wrote: > Hi Mike, > > I happened upon your query, which is related to some stuff I've been > playing with. > > Firstly, David's solution below doesn't work. I haven't yet tried to work > out why. > > Secondly, I was hoping to be able to solve your problem nicely with > Postgres 8.4's window functions [1,2], which can provide access to > data in sequentially-related rows. > > Given the following setup: > CREATE TABLE foo (dt date, bin varchar(4)); > INSERT INTO foo VALUES ('2009-01-01', 'red'), ('2009-01-02', 'red'), > ('2009-01-03', 'blue'), ('2009-01-04', 'blue'), ('2009-01-05', 'blue'), > ('2009-01-06', 'red'), ('2009-01-07', 'blue'), ('2009-01-08', 'blue'), > ('2009-01-09', 'red'), ('2009-01-10', 'red'); > > I had hoped the following would suffice: > SELECT first_value(dt) OVER w, last_value(dt) OVER w, bin FROM foo WINDOW > w AS (ORDER BY dt PARTITION BY bin); > > Apparently, this is bad syntax. ORDER BY cannot precede PARTITION BY in a > WINDOW declaration, and yet I wanted a grouping of date-consecutive bins, > which (PARTITION BY bin ORDER BY dt) would not give me. > > I was able to produce the required result with: > > SELECT MIN(dt) AS first, MAX(dt) AS last, MAX(bin) AS bin > FROM (SELECT dt, bin, > CASE WHEN newbin IS NULL > THEN 0 > ELSE SUM(newbin) OVER (ORDER BY dt) > END AS binno > FROM (SELECT *, > (bin !=3D lag(bin, 1) OVER (ORDER BY dt))::int A= > S > newbin > FROM foo > ) AS newbins > ) AS binnos > GROUP BY binno > ORDER BY first; > > This relies on a middle step in which I create an enumeration of the bins > in sequence: > SELECT dt, bin, CASE WHEN newbin IS NULL THEN 0 ELSE sum(newbin) OVER > (ORDER BY dt) END AS binno FROM (SELECT *, (bin !=3D lag(bin, 1) OVER (ORDE= > R > BY dt))::int AS newbin FROM foo) AS newbins; > dt | bin | binno > ------------+------+------- > 2009-01-01 | red | 0 > 2009-01-02 | red | 0 > 2009-01-03 | blue | 1 > 2009-01-04 | blue | 1 > 2009-01-05 | blue | 1 > 2009-01-06 | red | 2 > 2009-01-07 | blue | 3 > 2009-01-08 | blue | 3 > 2009-01-09 | red | 4 > 2009-01-10 | red | 4 > > I would hope there is a neater way to do this with window functions. > > The best way to solve your problem may be with PL/SQL, which is also good > at dealing with sequences (it's not as complicated as it looks!): > > CREATE TYPE bindates AS (first date, last date, bin varchar(5)); > CREATE OR REPLACE FUNCTION bindates() > RETURNS SETOF bindates > AS $$ > DECLARE > row record; > res bindates; > BEGIN > FOR row IN SELECT * FROM foo ORDER BY dt > LOOP > IF res.bin IS NULL OR res.bin !=3D row.bin THEN > IF res.bin IS NOT NULL THEN > RETURN NEXT res; > END IF; > res.bin :=3D row.bin; > res.first :=3D row.dt; > END IF; > res.last :=3D row.dt; > END LOOP; > IF res.bin IS NOT NULL THEN > RETURN NEXT res; > END IF; > END; > $$ LANGUAGE plpgsql; > > > Finally, I'll try to sort out David's solution. Once we correct his typo > (t1.order -> t1.date) and add an 'ORDER BY first' to the end, we get: > first | last | bin > ------------+------------+------ > 2009-01-03 | 2009-01-05 | blue > 2009-01-06 | 2009-01-06 | red > 2009-01-07 | 2009-01-08 | blue > 2009-01-09 | | red > > This includes correct output, but it fails on both edge cases. The > non-appearance of the first row is due to the WHERE clause on the main > SELECT statement: > WHERE (SELECT bin > FROM foo t2 > WHERE t2.dt < t1.dt > ORDER BY dt DESC LIMIT 1) <> t1.bin > > If we drop this WHERE clause, we get: > first | last | bin > ------------+------------+------ > 2009-01-01 | 2009-01-02 | red > 2009-01-02 | 2009-01-02 | red > 2009-01-03 | 2009-01-05 | blue > 2009-01-04 | 2009-01-05 | blue > 2009-01-05 | 2009-01-05 | blue > 2009-01-06 | 2009-01-06 | red > 2009-01-07 | 2009-01-08 | blue > 2009-01-08 | 2009-01-08 | blue > 2009-01-09 | | red > 2009-01-10 | | red > > We can therefore get the result including the first row by selecting from > this table with 'GROUP BY last, bin'. > > And we can hack in a value for those final NULLs as a special case. The > following statement works: > > SELECT MIN(first), > CASE WHEN last IS NULL THEN (SELECT MAX(dt) FROM foo) ELSE last > END, > bin > FROM ( > SELECT dt AS first, > (SELECT dt > FROM foo t3 > WHERE t3.dt < ( > SELECT dt > FROM foo t5 > WHERE t5.dt > t1.dt > AND t5.bin <> t1.bin > ORDER BY dt ASC > LIMIT 1) > ORDER BY dt DESC > LIMIT 1) AS last, > bin > FROM foo t1 > ) t0 > GROUP BY last, bin > ORDER BY last; > > > Finally, what's efficient? With 1,000,000 random rows, we get: > Enumeration: 13s > PL/SQL: 12s > Modified David: minutes. > > [I used the following to create test data: > CREATE OR REPLACE FUNCTION make_random(n int) RETURNS SETOF foo AS $$ > import random > for i in xrange(n): > m =3D random.randint(1,12) > d =3D random.randint(1,28) > b =3D random.choice(('red', 'blue')) > yield '2009-%d-%d' % (m, d), b > $$ LANGUAGE plpythonu; > DELETE FROM foo; INSERT INTO foo (SELECT * FROM make_random(1000000));] > > I hope that helps you in considering various approaches to the problem. > > - Joel > > > [1] http://developer.postgresql.org/pgdocs/postgres/tutorial-window.html > [2] http://developer.postgresql.org/pgdocs/postgres/functions-window.html > > > On Tue, 16 Jun 2009 06:06:16 +1000, David Wilson > <david.t.wilson@gmail.com> wrote: > >
В списке pgsql-general по дате отправления: