Expiring tuples from aggregation in window frames efficiently

Поиск
Список
Период
Сортировка
От David Rowley
Тема Expiring tuples from aggregation in window frames efficiently
Дата
Msg-id CAApHDvqu+yGW0vbPBb+yxHrPG5VcY_kiFYi8xmxFo8KYOczP3A@mail.gmail.com
обсуждение исходный текст
Список pgsql-hackers
Hi folks,

I was casting my mind back to an idea that came up when windowing functions were first implemented in 8.4 during some talk about how ROWS BETWEEN might implemented in the future. This has now been implemented but with the implementation there is some quadratic behaviour when tuples move off the top of the window frame. In this case the aggregate value must be re-calculated for all the remaining rows in the frame for every row that exits the window's frame. e.g  with SUM(value) OVER(ORDER BY id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) where the frame head is always at the current row.

There is a comment in nodeWindowAgg.c on line 457 explaining that it may be possible to optimise this by having functions which can "expire" tuples out of the aggregate value. These are described as "negative transition function".
Currently I'm looking over this to try to determine how hard this would be to implement it, with views that if it is within my ability then I'd like to have a patch ready for the next commitfest.

The purpose of this email is just to publish my rough implementation ideas with the hope that someone can tell me if I'm on the right track or that I'm off track somewhere. Also there may be optimisation opportunities that I've not thought of.

The idea would be to allow an extra function type when a user does CREATE AGGREGATE, this would be the function which removes a tuple from aggregation, I'll call these "aggregate subtraction functions" from now on. 

For SUM() I think the aggregation subtraction function would just do a - instead of a +, and look pretty much the same as intN_sum(). It looks like it would be similar for AVG() where we'd just remove from the count and the sum. COUNT(col) and COUNT(*) I would imagine would be quite similar though from what I've manage to figure out so far the implementation of this is a bit special. 
The specification of this aggregate subtraction function would be optional in CREATE AGGREGATE and if it was present it would be used instead of looping over all the rows in the frame each time the head of the window frame moves to the next row.

I'm thinking that it would be useful to make it so these aggregate subtraction functions returned true if they've managed to remove the tuple and false to tell the executor to fall back and loop over each tuple in the frame. I feel this would be useful as it would allow MAX() and MIN() to at least partially get on-board with this optimisation.... Let me explain:

If the current MAX() is 10 and we ask the aggregate subtraction function to subtract a tuple with 9, then the MAX() is still 10. So, in this case we could just compare 9 with 10 and see that it is lower and then just return true, thus pretending we actually did the subtraction, but the fact being no subtraction was required. The same applies to MIN, though of course with the comparison around the other way. It may also be possible to store a counter which we could increase each time MAX() receives a tuple with the current maximum value, this would be set back to 1 when the MAX receives a higher value. The aggregate subtraction function could decrement that counter each time it is asked to subtract the current maximum value and only return false when that counter gets to 0. Though I think this might mean that MAX() could take a small performance hit when it is used even not as a windowing aggregate. The pseudo logic would look something along the lines of:

IN > "1" MAX() = "1", maxcount = 1
IN > "2" MAX() = "2", maxcount = 1 (max increased, reset maxcount to 1)
IN > "2" MAX() = "2", maxcount = 2
OUT > "1" MAX() = "2", maxcount = 2 (value is less than current MAX, so don't touch the maxcount)
OUT > "2" MAX() = "2", maxcount = 1
OUT > "2" MAX() = ???, maxcount = 0 (maxcount is 0 so return false as we need to recalculate)

Tonight I'm trying to figure out how all of this could be implemented, I'll list below my rough idea of how this might be done. Please shout out if there is something out of order here.

1. Modifiy pg_proc.h adding a new bool flag for aggsubtract function.
2. In CREATE FUNCTION allow AGGSUBTRACT to be specified similar to how WINDOW is, only demand that these functions return BOOLEAN.
3. Alter CREATE AGGREGATE to allow the AGGSUBTRACT function type, this will be optional.
4. Write agg subtract functions for COUNT(col), COUNT(*), SUM(), AVG(), MIN() and MAX()
5. Change eval_windowaggregates() to attempt to use the aggregate subtraction function if it exists for this aggregate and if that function returns true use the updated aggregate state.

I think perhaps as a simple initial implementation I could skip steps 2 and 3 and for step 4 just implement int4_sum_neg(), though I'm not quite sure yet if these steps would be required for initdb to work.

Also the use passing volatile functions to the aggregate requires some thought. I would imagine the optimisation needs to be disabled in this case, but I'm not quite sure actually how the code should behave in this case. I took a quick look at the results of:

CREATE SEQUENCE test_seq;
SELECT currval('test_seq'),
       COUNT(*) OVER (ORDER BY x.x ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
       SUM(nextval('test_seq')) OVER (ORDER BY x.x ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM generate_series(1,2) x (x);
DROP SEQUENCE test_seq;

The results are:
 currval | count | sum
---------+-------+-----
       2 |     2 |   3
       3 |     1 |   3

I've not looked to see if the spec has anything about this but, the first row will have sum as 1+2 then the 2nd row will just have 1 row to aggregate and the value will be 3, I could see an argument that the results for this should actually be:

 currval | count | sum
---------+-------+-----
       2 |     2 |   3
       3 |     1 |   2

Though actually making the above the results would require materialising values for each row instead of calling the volatile function again each time... Perhaps volatile functions as arguments to aggregate functions when in a window which the frame head is not fixed at the top of the partition should have been disallowed in the first place.

If anyone has any time to point out if I'm thinking along the correct lines for this implementation then that would be really helpful.

Regards

David Rowley



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Sawada Masahiko
Дата:
Сообщение: Re: Logging WAL when updating hintbit
Следующее
От: "Etsuro Fujita"
Дата:
Сообщение: Re: Get more from indices.