difference functions

Поиск
Список
Период
Сортировка
От Mark Rostron
Тема difference functions
Дата
Msg-id FD020D3E50E7FA479567872E5F5F31E3045A186C4E@ex01.corp.ql2.com
обсуждение исходный текст
Список pgsql-general

This post is just to record an example of how to use the new window fn’s in 8.4 to perform difference-between-row calculations.

To demonstrate, we create a  table with 30 rows of data, two columns, one of which contains the sequence 1..30, the other contains mod(c1,10).

So the table looks like this:

>\d x

       Table "public.x"

Column |  Type   | Modifiers

--------+---------+-----------

c1     | integer |

c2     | integer |

> select * from x order by c1,c2;

c1 | c2

----+----

  1 |  1

  2 |  2

  3 |  3

  4 |  4

  5 |  5

  6 |  6

  7 |  7

  8 |  8

  9 |  9

10 |  0

11 |  1

12 |  2

13 |  3

14 |  4

15 |  5

16 |  6

17 |  7

18 |  8

19 |  9

20 |  0

21 |  1

22 |  2

23 |  3

24 |  4

25 |  5

26 |  6

27 |  7

28 |  8

29 |  9

30 |  0

(30 rows)

Now if I want to calculate the difference between two rows, I can use the lag() window function (8.4 manual ch 9.19).

The query and result set of a typical function would be:

> select c1,c2,lag(c1,1,(select max(c1) from x x1 where x1.c2 = x.c2 and x1.c1 < x.c1))  over ( partition by c2 order by c1) from x;

c1 | c2 | lag

----+----+-----

10 |  0 |

20 |  0 |  10

30 |  0 |  20

  1 |  1 |

11 |  1 |   1

21 |  1 |  11

  2 |  2 |

12 |  2 |   2

22 |  2 |  12

  3 |  3 |

13 |  3 |   3

23 |  3 |  13

  4 |  4 |

14 |  4 |   4

24 |  4 |  14

  5 |  5 |

15 |  5 |   5

25 |  5 |  15

  6 |  6 |

16 |  6 |   6

26 |  6 |  16

  7 |  7 |

17 |  7 |   7

27 |  7 |  17

  8 |  8 |

18 |  8 |   8

28 |  8 |  18

  9 |  9 |

19 |  9 |   9

29 |  9 |  19

(30 rows)

 

Almost there.

You can already see that the delta is (lag – c1).

But: how to handle the missing-first-row-null-delta.

Now, we need a ‘sliding window’ capability, so we can perform the calculations using values outside of this result set.

(eg stock values have a daily closing price that would be used to calculate the next day’s delta …. etc)

(this is always messy, so I was really happy to discover that the lag function provides an elegant way of reducing the mess).

 

In this example, let’s assume that the first row in each partition (i.e. c1<=10) is our ‘previous-row’, upon which we will base our delta.

 

So, handling the ‘first-value’ case is as follows:

> select c1,c2,lag(c1,1,(select max(c1) from x x1 where x1.c2 = x.c2 and x1.c1 < x.c1))  over ( partition by c2 order by c1) from x where c1 > 10;

c1 | c2 | lag

----+----+-----

20 |  0 |  10

30 |  0 |  20

11 |  1 |   1

21 |  1 |  11

12 |  2 |   2

22 |  2 |  12

13 |  3 |   3

23 |  3 |  13

14 |  4 |   4

24 |  4 |  14

15 |  5 |   5

25 |  5 |  15

16 |  6 |   6

26 |  6 |  16

17 |  7 |   7

27 |  7 |  17

18 |  8 |   8

28 |  8 |  18

19 |  9 |   9

29 |  9 |  19

(20 rows)

 

As you can see, now the ‘lag’ column contains the previous value of c1 in all cases, the first of which is calculated from the previous ‘sliding-window’ maximum.

You can now add the delta calculation (lag-c1) and you have your delta.

 

Finally, another point, on performance.

The explain plan for this is:

> explain select c1,c2,lag(c1,1,(select max(c1) from x x1 where x1.c2 = x.c2 and x1.c1 < x.c1))  over ( partition by c2 order by c1) from x where c1 > 10;

                            QUERY PLAN

------------------------------------------------------------------

WindowAgg  (cost=1.84..33.02 rows=21 width=8)

   ->  Sort  (cost=1.84..1.89 rows=21 width=8)

         Sort Key: x.c2, x.c1

         ->  Seq Scan on x  (cost=0.00..1.38 rows=21 width=8)

               Filter: (c1 > 10)

   SubPlan 1

     ->  Aggregate  (cost=1.45..1.46 rows=1 width=4)

           ->  Seq Scan on x x1  (cost=0.00..1.45 rows=1 width=4)

                 Filter: ((c1 < $1) AND (c2 = $0))

(9 rows)

 

We have already gained because this feature eliminates what would have been handled by a lot of messy client code, not to mention the reduction of all of this to a single scan, and lookup for the first row.

But the first row lookup is going to mean multiple-lookups, and if your table ‘x’ is large, that’s going to hurt.

 

However, typically, I will offset the first-row-lookup overhead by preparing for it:

Create a small temp table containing exactly the first-row-lag-set that you are gonna want

And then the first-row-query goes against the temp table.

If the size of table ‘x’ is such that you need to do this, it’ll help.

Alternatively, instead of a sub-query, you can union the temp table into your calculation, and then exclude it from the final result using nested-sub-queries.

Small temp tables are trivial to build and very useful for this type of thing.

 

I used this technique to do a covariance matrix calculation in an oracle database a while ago – it kinda worked pretty well.

So I’m really glad to see it available in pg now – it’s really gonna help getting queries simpler and faster.

 

Mr

 

В списке pgsql-general по дате отправления:

Предыдущее
От: Scott Marlowe
Дата:
Сообщение: Re: max_fsm_pages increase
Следующее
От: Alexander Farber
Дата:
Сообщение: Sorting rows by a column and storing a row number