Re: Planning aggregates which require sorted or distinct

Поиск
Список
Период
Сортировка
От Simon Riggs
Тема Re: Planning aggregates which require sorted or distinct
Дата
Msg-id 1169302811.3776.130.camel@silverbirch.site
обсуждение исходный текст
Ответ на Re: Planning aggregates which require sorted or distinct  (Gavin Sherry <swm@alcove.com.au>)
Ответы Re: Planning aggregates which require sorted or distinct  ("Simon Riggs" <simon@2ndquadrant.com>)
Список pgsql-hackers
On Sat, 2007-01-20 at 23:54 +1100, Gavin Sherry wrote:

> > > Yep. I was thinking about this all morning. I think I've over engineered
> > > the problem in my head. Window function input just looks like a slightly
> > > more complex distinct aggregate input. I'll think on it more though.
> >
> > I'm working on modifying Tuplestore that will allow you to store the
> > last N tuples, rather than all previous input. This is specifically
> > designed to allow Merge Join to do Mark/Restore without materializing
> > the whole sort set. This can also be used to materialize Windows (i.e.
> > <window clause> in SQL:2003), so you can store the current row plus n
> > PRECEDING and/or n FOLLOWING rows as appropriate. Reading from the
> > Window would then be similar-ish to doing a Mark/Restore pair, which we
> > can rename to MarkWindowStart and ReturnToWindowStart.
> 
> Wow. What a coincidence! 

Not much of one though. The overhaul of the sort code was the first step
on the road to Windowed aggregates.

> Windows are slightly more complex though. As you
> probably know, there are two ways of specifying the window frame: by an
> absolute number of rows (ROWS N PRECEDING, for example); or, by a 'range'
> (RANGE N PRECEDING), where the range, in the case of 'preceding', is
> determined by subtracted the range parameter from the value of the current
> field -- i.e., the window attribute.

Sure. The MJ situation is to have the MergeJoin node call a Materialize
node which calls the tuplestore functions. The MJ node does all the
clever stuff, which includes a variable size buffer according to the
values arriving. Seems OK to have a Window node that does its own brand
of clever stuff, while the Materialize node just does whats its told in
managing the tuplestore.

Rows type calls can do a read next/mark at same time, so they always
maintain a fixed lead/lag as they go.

Range type calls can do a read, then some processing to determine if it
should mark yet, just as MJ does. Or at least we can support an
additional call to make RangeWindowStart hunt for the appropriate row to
set as the end of the window and then read forward until the end of the
range is found.

> This presents a problem in terms of managing the size of the buffer.If
> you have data clustered around a value then the amount of data input into
> the window function when we cross this value explodes. The tuplestore code
> can or will deal with this but it is currently not designed for this kind
> of usage pattern. That is, every time we advance a row we must (a)
> recalculate multiset to be input to the window function and (b) generate
> the value from the window function by passing the input to it. The problem
> arises when the window contains more data than can be stored in work_mem.
> Then, each time we need to recalculate the window function, we'll churn
> data through the buffer and get no effect from the buffering itself.

The reason I'm doing this is that MJs suck when they have large sorted
input. So the problem you describe is essentially the same one I worried
about when discussing the MJ optimisation.

The cost of preparing RandomAccess sort files is *immense*, and
scrolling backwards is just not a clever thing to do. The problem you
describe won't occur in all cases and my guess is that accepting that it
can happen occasionally will still make it run loads faster for large
queries than if we use a full RandomAccess sort. We don't really think
of it that way because we are paying the RandomAccess costs every time
at present, but Windowed aggregates would stick out rather badly if they
used that method every time, no matter what the window size is.

There'll always be a stupidly specified query that performs badly.
That's where the planner can step in to recognize poor situations and do
something about it.

> A lot of the research around window functions recommends offseting this
> effect by buffering data 'pre-aggregated' for each distinct value in the
> buffer. Most of the research, however, relies on a trick we don't have
> available in the SQL window function spec: windows in SQL can have a
> partition (irrelevant here), data sort order and a range; windows in the
> world of window function streaming data research have this and a 'slide'.
> Slide is the interval at which the aggregate is regenerated and in SQL the
> value is regenerated for every new row. The research concedes that at this
> level, preaggregation either stops making sense of is very complex.

That sounds like Phase 2, don't you think? I'd like to get something
into 8.3...

The user can always use a subselect to pre-aggregate if they're worried
about the performance of their query. Maybe not in a complex fan-out but
then anybody with a large query that has multiple conflicting, large
windows might reasonably expect poor performance in the first release.

> I've come up with a way to be able to do it, but not for all aggregates
> (median() for example). I'll discuss this in the proposal to be sent
> through soon. The problem is, the requirements we have for buffering data
> around window functions could be very complex.
> 
> > I'll present the prototype shortly, I've got a few bugs, but the basic
> > principle is working already. I'm happy to craft that to your exact
> > needs, so that you'll be able to press ahead with the rest of the
> > implementation of Windowed functions.
> 
> Awesome. I will get the proposal off so that we can discuss the
> requirements at length.

Looking forward to it. I'll help where I can.

> > > To bring out a slightly different point -- and I know this is putting the
> > > cart before the horse -- but window functions will (potentially) output
> > > rows in the wrong order. I made a passing reference to this earlier. For
> > > example, say we have a table employees with the following data:
> > >
> > > empno | salary | age
> > > ====================
> > > 1     | 2000   | 50
> > > 2     | 6000   | 30
> > > 3     | 3000   | 20
> > >
> > > We want to answer the following: for each employee: what is their rank in
> > > terms of salary and what is their rank in terms of age. This query
> > > answers that:
> > >
> > > select empno, rank() over (order by salary) as srank,
> > >   rank() over (order by age) as arank
> > >   from employees order by empno;
> > >
> > > The result will be:
> > >
> > > empno | salary | age
> > > ====================
> > > 1     | 1      | 3
> > > 2     | 3      | 2
> > > 3     | 2      | 1
> > >
> > > Both window functions provide results based on the order of their input.
> > > So, in terms of empno, srank will output in this order: empno = 1, 3, 2;
> > > arank will output in this order: empno = 3, 2, 1. We need to glue these
> > > back together and the only way I can think how is via a synthetic key.
> >
> > Anything wrong with using empno?
> 
> It might not be a unique key.

Not sure how such a result set could have meaning to anyone. e.g.

empno    srank    arank
1    1    3
1    3    2
1    2    1

Which emp came 1st? 1 Which emp came 3rd? 1... no useful meaning.

I'd be inclined to suggest that such a query should be prevented from
being executed with a message along the lines of "no GROUP BY
specified". Or at least the problem ignored in the first implementation;
its not high up the list of meaningful queries we would like to allow
users to execute, IMHO [Order by meaning desc nulls last :-) ]

> >
> > > Ideally, the planner would have some input on how to clue about how large
> > > the result set will be and the orders from the window functions so that it
> > > can decide whether to use nested loop, merge join or hash join to do it.
> > >
> > > Can you think of a different approach?
> >
> > Sounds like figuring out and agreeing the executor issues first is the
> > best bet. Once we know whats to be done, extending the planner to do it
> > will be easier.
> 
> Exactly. As you can see, it's a pretty big topic so I'll work further on
> the proposal and send it off.

It would be very cool to start with a set of queries that are
specifically designed to cover all of the complexities, with expected
output. That way we can reference a real example of what we are
discussing, just like your example here. The range of functionality is
very large and hard to visualise, methinks.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Gavin Sherry
Дата:
Сообщение: Re: Planning aggregates which require sorted or distinct
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: [BUGS] BUG #2907: pg_get_serial_sequence quoting