Обсуждение: Re: WITHIN GROUP patch
2013/10/9 Pavel Stehule <pavel.stehule@gmail.com>
PavelRegardsProbably some hint should be there?I found so following error message is not too friendly (mainly because this functionality will be new)HelloI checked a conformance with ANSI SQL - and I didn't find any issue.
postgres=# select dense_rank(3,3,2) within group (order by num desc, odd) from test4;
ERROR: Incorrect number of arguments for hypothetical set function
LINE 1: select dense_rank(3,3,2) within group (order by num desc, od...
^
postgres=# select dense_rank(3,3,2) within group (order by num desc) from test4;
ERROR: Incorrect number of arguments for hypothetical set function
LINE 1: select dense_rank(3,3,2) within group (order by num desc) fr...
^
postgres=# select dense_rank(3,3) within group (order by num desc) from test4;
ERROR: Incorrect number of arguments for hypothetical set function
LINE 1: select dense_rank(3,3) within group (order by num desc) from...
^
postgres=# select dense_rank(3,3) within group (order by num desc, num) from test4;
dense_rank
------------
3
(1 row)2013/10/2 Vik Fearing <vik.fearing@dalibo.com>On 09/30/2013 06:34 PM, Pavel Stehule wrote:I had an unexpected emergency come up, sorry about that. I plan on
>
> I looked on this patch - it is one from long patches - so I propose to
> divide review to a few parts:
>
> a) a conformance with ANSI SQL
> b) check of new aggregates - semantic, implementation
> c) source code checking - usual patch review
>
> Now I would to work on @a
doing B and C starting on Thursday (October 3).
I am grateful to have Pavel's help, this is a big patch.
--
Vik
>>>>> "Pavel" == Pavel Stehule <pavel.stehule@gmail.com> writes: >> I found so following error message is not too friendly (mainly because>> this functionality will be new)>> >> postgres=#select dense_rank(3,3,2) within group (order by num desc, odd)>> from test4;>> ERROR: Incorrect number of argumentsfor hypothetical set function>> LINE 1: select dense_rank(3,3,2) within group (order by num desc, od...>> ^>> >>Probably some hint should be there? Hm, yeah. Anyone have ideas for better wording for the original error message, or a DETAIL or HINT addition? The key point here is that the number of _hypothetical_ arguments and the number of ordered columns must match, but we don't currently exclude the possibility that a user-defined hypothetical set function might have additional non-hypothetical args. The first alternative that springs to mind is: ERROR: Incorrect number of arguments for hypothetical set function DETAIL: Number of hypothetical arguments (3) must equal number of ordered columns (2) -- Andrew (irc:RhodiumToad)
2013/10/11 Andrew Gierth <andrew@tao11.riddles.org.uk>
>>>>> "Pavel" == Pavel Stehule <pavel.stehule@gmail.com> writes:
>> I found so following error message is not too friendly (mainly because
>> this functionality will be new)
>>
>> postgres=# select dense_rank(3,3,2) within group (order by num desc, odd)
>> from test4;
>> ERROR: Incorrect number of arguments for hypothetical set function
>> LINE 1: select dense_rank(3,3,2) within group (order by num desc, od...
>> ^
>>>> Probably some hint should be there?Hm, yeah.
Anyone have ideas for better wording for the original error message,
or a DETAIL or HINT addition? The key point here is that the number of
_hypothetical_ arguments and the number of ordered columns must match,
but we don't currently exclude the possibility that a user-defined
hypothetical set function might have additional non-hypothetical args.
The first alternative that springs to mind is:DETAIL: Number of hypothetical arguments (3) must equal number of ordered columns (2)
ERROR: Incorrect number of arguments for hypothetical set function
+1
Pavel
--
Andrew (irc:RhodiumToad)
On Thu, Oct 10, 2013 at 10:35 PM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > The first alternative that springs to mind is: > > ERROR: Incorrect number of arguments for hypothetical set function > DETAIL: Number of hypothetical arguments (3) must equal number of ordered columns (2) I'd suggest trying to collapse that down into a single message; the DETAIL largely recapitulates the original error message. Also, I'd suggest identifying the name of the function, since people may not be able to identify that easily based just on the fact that it's a hypothetical set function. Here's one idea, with two variants depending on the direction of the inequality: ERROR: function "%s" has %d arguments but only %d ordering columns ERROR: function "%s" has %d ordering columns but only %d arguments Or else leave out the actual numbers and just state the principle, but identifying the exact function at fault, e.g. ERROR: number of arguments to function "%s" does not match number of ordering columns -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/09/2013 04:19 PM, Pavel Stehule wrote:
Probably some hint should be there?I checked a conformance with ANSI SQL - and I didn't find any issue.I found so following error message is not too friendly (mainly because this functionality will be new)
postgres=# select dense_rank(3,3,2) within group (order by num desc, odd) from test4;
ERROR: Incorrect number of arguments for hypothetical set function
LINE 1: select dense_rank(3,3,2) within group (order by num desc, od...
^
postgres=# select dense_rank(3,3,2) within group (order by num desc) from test4;
ERROR: Incorrect number of arguments for hypothetical set function
LINE 1: select dense_rank(3,3,2) within group (order by num desc) fr...
^
postgres=# select dense_rank(3,3) within group (order by num desc) from test4;
ERROR: Incorrect number of arguments for hypothetical set function
LINE 1: select dense_rank(3,3) within group (order by num desc) from...
^
postgres=# select dense_rank(3,3) within group (order by num desc, num) from test4;
dense_rank
------------
3
(1 row)
In addition to Pavel's review, I have finally finished reading the patch. Here are some notes, mainly on style:
First of all, it no longer compiles on HEAD because commit 4d212bac1752e1bad6f3aa6242061c393ae93a0a stole oid 3968. I modified that locally to be able to continue my review.
Some of the error messages do not comply with project style. That is, they begin with a capital letter.
Ordered set functions cannot have transition functions
Ordered set functions must have final functions
Invalid argument types for hypothetical set function
Invalid argument types for ordered set function
Incompatible change to aggregate definition
Too many arguments to ordered set function
Ordered set finalfns must not be strict
Cannot have multiple ORDER BY clauses with WITHIN GROUP
Cannot have DISTINCT and WITHIN GROUP together
Incorrect number of arguments for hypothetical set function
Incorrect number of direct arguments to ordered set function %s
And in pg_aggregate.c I found a comment with a similar problem that doesn't match its surrounding code:
Oid transsortop = InvalidOid; /* Can be omitted */
I didn't find any more examples like that, but I did see several block comments that weren't complete sentences whereas I think they should be. Also a lot of the code comments say "I" and I don't recall seeing that elsewhere. I may be wrong, but I would prefer if they were more neutral.
The documentation has a number of issues.
collateindex.pl complains of duplicated index entries for PERCENTILE CONTINUOUS and PERCENTILE DISCRETE. This is because the index markup is used for both overloaded versions. This is the same mistake Bruce made and then corrected in commit 5dcc48c2c76cf4b2b17c8e14fe3e588ae0c8eff3.
"if there are multiple equally good result" should have an s on the end in func.sgml.
Table 9-49 has an extra empty column. That should either be removed, or filled in with some kind of comment text like other similar tables.
Apart from that, it looks good. There is some mismatched coding styles in there but the next run of pgindent should catch them so it's no big deal.
I haven't yet exercised the actual functionality of the new functions, nor have I tried to create my own. Andrew alerted me to a division by zero bug in one of them, so I'll be looking forward to catching that.
So, more review to come.
-- Vik
On Tue, Oct 15, 2013 at 4:29 PM, Vik Fearing <vik.fearing@dalibo.com> wrote: > On 10/09/2013 04:19 PM, Pavel Stehule wrote: > > >> I checked a conformance with ANSI SQL - and I didn't find any issue. >> >> I found so following error message is not too friendly (mainly because >> this functionality will be new) >> >> postgres=# select dense_rank(3,3,2) within group (order by num desc, odd) >> from test4; >> ERROR: Incorrect number of arguments for hypothetical set function >> LINE 1: select dense_rank(3,3,2) within group (order by num desc, od... >> ^ >> postgres=# select dense_rank(3,3,2) within group (order by num desc) from >> test4; >> ERROR: Incorrect number of arguments for hypothetical set function >> LINE 1: select dense_rank(3,3,2) within group (order by num desc) fr... >> ^ >> postgres=# select dense_rank(3,3) within group (order by num desc) from >> test4; >> ERROR: Incorrect number of arguments for hypothetical set function >> LINE 1: select dense_rank(3,3) within group (order by num desc) from... >> ^ >> postgres=# select dense_rank(3,3) within group (order by num desc, num) >> from test4; >> dense_rank >> ------------ >> 3 >> (1 row) >> >> Probably some hint should be there? >> >> > > In addition to Pavel's review, I have finally finished reading the patch. > Here are some notes, mainly on style: > > First of all, it no longer compiles on HEAD because commit > 4d212bac1752e1bad6f3aa6242061c393ae93a0a stole oid 3968. I modified that > locally to be able to continue my review. > > Some of the error messages do not comply with project style. That is, they > begin with a capital letter. > > Ordered set functions cannot have transition functions > Ordered set functions must have final functions > Invalid argument types for hypothetical set function > Invalid argument types for ordered set function > Incompatible change to aggregate definition > Too many arguments to ordered set function > Ordered set finalfns must not be strict > Cannot have multiple ORDER BY clauses with WITHIN GROUP > Cannot have DISTINCT and WITHIN GROUP together > > Incorrect number of arguments for hypothetical set function > Incorrect number of direct arguments to ordered set function %s > > And in pg_aggregate.c I found a comment with a similar problem that doesn't > match its surrounding code: > Oid transsortop = InvalidOid; /* Can be omitted */ > > I didn't find any more examples like that, but I did see several block > comments that weren't complete sentences whereas I think they should be. > Also a lot of the code comments say "I" and I don't recall seeing that > elsewhere. I may be wrong, but I would prefer if they were more neutral. > > The documentation has a number of issues. > > collateindex.pl complains of duplicated index entries for PERCENTILE > CONTINUOUS and PERCENTILE DISCRETE. This is because the index markup is > used for both overloaded versions. This is the same mistake Bruce made and > then corrected in commit 5dcc48c2c76cf4b2b17c8e14fe3e588ae0c8eff3. > > "if there are multiple equally good result" should have an s on the end in > func.sgml. > > Table 9-49 has an extra empty column. That should either be removed, or > filled in with some kind of comment text like other similar tables. > > Apart from that, it looks good. There is some mismatched coding styles in > there but the next run of pgindent should catch them so it's no big deal. > > I haven't yet exercised the actual functionality of the new functions, nor > have I tried to create my own. Andrew alerted me to a division by zero bug > in one of them, so I'll be looking forward to catching that. > > So, more review to come. > > -- > Vik Hi All, Please find attached our latest version of the patch. This version fixes the issues pointed out by the reviewers. Regards, Atri -- Regards, Atri l'apprenant
Вложения
On 11/04/2013 08:43 AM, Atri Sharma wrote: > Please find attached our latest version of the patch. This version > fixes the issues pointed out by the reviewers. No, it doesn't. The documentation still contains formatting and grammatical errors, and the code comments still do not match the their surroundings. (The use of "I" in the code comments is a point I have conceded on IRC, but I stand by my other remarks.) Don't bother submitting a new patch until I've posted my technical review, but please fix these issues on your local copy. -- Vik
Hi all, Please find the latest version of the patch. This version fixes the issues pointed out by the reviewer and the divide by zero bug in percent_rank function. This version also adds a regression test for the divide by zero case in percent_rank. Regards, Atri
Вложения
On Fri, 2013-11-15 at 00:05 +0530, Atri Sharma wrote: > Please find the latest version of the patch. This version fixes the > issues pointed out by the reviewer and the divide by zero bug in > percent_rank function. This version also adds a regression test for > the divide by zero case in percent_rank. This patch doesn't apply.
On Mon, Nov 18, 2013 at 9:26 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > On Fri, 2013-11-15 at 00:05 +0530, Atri Sharma wrote: >> Please find the latest version of the patch. This version fixes the >> issues pointed out by the reviewer and the divide by zero bug in >> percent_rank function. This version also adds a regression test for >> the divide by zero case in percent_rank. > > This patch doesn't apply. > Hi all, Please find attached the latest patch for WITHIN GROUP. This patch is after fixing the merge conflicts. Regards, Atri -- Regards, Atri l'apprenant
Вложения
On 11/21/2013 11:04 AM, Atri Sharma wrote: > Please find attached the latest patch for WITHIN GROUP. This patch is > after fixing the merge conflicts. I have spent quite some time on this and the previous versions. Here is my review, following the questions on the wiki. This patch is in context diff format and applies cleanly to today's master. It contains several regression tests and for the most part, good documentation. I would like to see at least one example of using each of the two types of function (hypothetical and inverted distribution) in section 4.2.8. This patch implements what it says it does. We don't already have a way to get these results without this patch that I know of, and I think we do want it. I certainly want it. I do not have a copy of the SQL standard, but I have full faith in the Andrew Gierth's claim that this is part of it. Even if not, implementation details were brought up during design and agreed upon by this list[1]. I don't see how anything here could be dangerous. The custom ordered set functions I made correctly passed a round-trip through dump/restore. The code compiles without warning. All of the clean tests I did worked as expected, and all of the dirty tests failed elegantly. I did not find any corner cases, and I looked in as many corners as I could think of. I didn't manage to trigger any assertion failures and I didn't crash it. I found no noticeable issues with performance, either directly or as side effects. I am not the most competent with code review so I'll be relying on further review by another reviewer or the final committer. The patch fixed the project comments/messages style issues I raised in my previous review. I found the code comments lacking in some places (like inversedistribution.c:mode_final for example) but I can't say if the really is too terse, or if it's just me. On the other hand, I thought the explanation blocks in the code comments were adequately descriptive. There is some room for improvement in future versions. The query select mode() within group (order by x) over (partition by y) from ... should be valid but isn't. This was explained by Andrew on IRC as being non-trivial: "specifically, we implemented WITHIN GROUP by repurposing the infrastructure already present for agg(distinct ...) and agg(x order by y) which are also not yet supported for aggregate-as-window-function". I assume then that evolution on one side will benefit the other. All in all, I believe this is ready for committer. [1] http://www.postgresql.org/message-id/2b8b55b8ba82f83ef4e6070b95fb0cd0%40news-out.riddles.org.uk -- Vik
>>>>> "Vik" == Vik Fearing <vik.fearing@dalibo.com> writes: Vik> I certainly want it. I do not have a copy of the SQL standard,Vik> but I have full faith in the Andrew Gierth's claimthat this isVik> part of it. For reference, this is how I believe it matches up against the spec (I'm working from the 2008 final): 10.9 <aggregate function>: <hypothetical set function> is intended to be implemented in this patch exactly as per spec. <inverse distribution function>: the spec defines two of these, PERCENTILE_CONT and PERCENTILE_DISC: PERCENTILE_CONT is defined in the spec for numeric types, in which case it returns an approximate numeric result, and forinterval, in which case it returns interval. Our patch defines percentile_cont functions for float8 and interval inputtypes, relying on implicit casting to float8 to handle other numeric input types. As an extension to the spec, we define a percentile_cont function that returns an array of percentile values in one call,as suggested by some users. PERCENTILE_DISC is defined in the spec for the same types as _CONT. Our version on the other hand accepts any type whichcan be sorted, and returns the same type. This does mean that our version may return an exact numeric type rather thanan approximate one, so this is a possible slight deviation from the spec. i.e. our percentile_disc(float8) within group (order by bigint) returns a bigint, while the spec seems to imply it shouldreturn an approximate numeric type (i.e. float*). Again, we additionally provide an array version which is not in the spec. mode() is not in the spec, we just threw it in because it was easy. 6.10 <window function> The spec says that <hypothetical set function> is not allowed as a window function. The spec does not forbid other <ordered set function>s in a window function call, but we have NOT attempted to implementthis (largely for the same reasons that DISTINCT and ORDER BY are not implemented for aggregates as window functions). Conformance: all the relevant features are parts of T612, "Advanced OLAP Operations", which we already list in the docs on the unsupported list with the comment "some forms supported". Maybe that could be changed now to "most forms supported", but that's a subjective call (and one we didn't really consider doing in this patch). -- Andrew (irc:RhodiumToad)
Atri Sharma <atri.jiit@gmail.com> writes: > Please find attached the latest patch for WITHIN GROUP. This patch is > after fixing the merge conflicts. I've started to look at this patch now. I have a couple of immediate reactions to the catalog changes: 1. I really hate the way you've overloaded the transvalue to do something that has approximately nothing to do with transition state (and haven't updated catalogs.sgml to explain that, either). Seems like it'd be cleaner to just hardwire a bool column that distinguishes regular and hypothetical input rows. And why do you need aggtranssortop for that? I fail to see the point of sorting on the flag column. 2 I also don't care for the definition of aggordnargs, which is the number of direct arguments to an ordered set function, except when it isn't. Rather than overloading it to be both a count and a couple of flags, I wonder whether we shouldn't expand aggisordsetfunc to be a three-way "aggregate kind" field (ordinary agg, ordered set, or hypothetical set function). regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> 1. I really hate the way you've overloaded the transvalue to doTom> something that has approximately nothing to do withtransitionTom> state (and haven't updated catalogs.sgml to explain that,Tom> either). Seems like it'd be cleaner tojust hardwire a boolTom> column that distinguishes regular and hypothetical input rows. The intention here is that while the provided functions all fit the spec's idea of how inverse distribution or hypothetical set functions work, the actual implementation mechanisms are more generally applicable than that and a user-supplied function could well find something else useful to do with them. Accordingly, hardcoding stuff seemed inappropriate. Tom> And why do you need aggtranssortop for that? I fail to see theTom> point of sorting on the flag column. It is convenient to the implementation to be able to rely on encountering the hypothetical row deterministically before (or in some cases after, as in cume_dist) its peers in the remainder of the sort order. Removing that sort would make the results of the functions incorrect. There should probably be some comments about that. oops. Tom> 2 I also don't care for the definition of aggordnargs, which isTom> the number of direct arguments to an ordered setfunction,Tom> except when it isn't. Rather than overloading it to be both aTom> count and a couple of flags, I wonderwhether we shouldn'tTom> expand aggisordsetfunc to be a three-way "aggregate kind" fieldTom> (ordinary agg, orderedset, or hypothetical set function). It would still be overloaded in some sense because a non-hypothetical ordered set function could still take an arbitrary number of args (using variadic "any") - there aren't any provided, but there's no good reason to disallow user-defined functions doing that - so you'd still need a special value like -1 for aggordnargs to handle that. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> 1. I really hate the way you've overloaded the transvalue to do > Tom> something that has approximately nothing to do with transition > Tom> state (and haven't updated catalogs.sgml to explain that, > Tom> either). Seems like it'd be cleaner to just hardwire a bool > Tom> column that distinguishes regular and hypothetical input rows. > The intention here is that while the provided functions all fit the > spec's idea of how inverse distribution or hypothetical set functions > work, the actual implementation mechanisms are more generally > applicable than that and a user-supplied function could well find > something else useful to do with them. Accordingly, hardcoding stuff > seemed inappropriate. > Tom> And why do you need aggtranssortop for that? I fail to see the > Tom> point of sorting on the flag column. > It is convenient to the implementation to be able to rely on > encountering the hypothetical row deterministically before (or in some > cases after, as in cume_dist) its peers in the remainder of the sort > order. Removing that sort would make the results of the functions > incorrect. Well, okay, but you've not said anything that wouldn't be handled just as well by some logic that adds a fixed integer-constant-zero flag column to the rows going into the tuplesort. The function-specific code could then inject hypothetical row(s) with other values, eg 1 or -1, to get the results you describe. If there's any flexibility improvement from allowing a different constant value than zero, or a different datatype than integer, I don't see what it'd be. On the other side of the coin, repurposing the transition-value catalog infrastructure to do this totally different thing is confusing. And what if someday somebody has a use for a regular transition value along with this stuff? What you've done is unorthogonal for no very good reason. (Actually, I'm wondering whether it wouldn't be better if the tuplesort object *were* the transition value, and were managed primarily by the aggregate function's code; in particular expecting the agg's transition function to insert rows into the tuplesort object. We could provide helper functions to largely negate any duplication-of-code objections, I'd think. In this view the WITHIN GROUP columns aren't so different from regular aggregate arguments, but there'd need to be some way for the aggregate function to get hold of the sorting-semantics decoration on them.) > Tom> 2 I also don't care for the definition of aggordnargs, which is > Tom> the number of direct arguments to an ordered set function, > Tom> except when it isn't. Rather than overloading it to be both a > Tom> count and a couple of flags, I wonder whether we shouldn't > Tom> expand aggisordsetfunc to be a three-way "aggregate kind" field > Tom> (ordinary agg, ordered set, or hypothetical set function). > It would still be overloaded in some sense because a non-hypothetical > ordered set function could still take an arbitrary number of args > (using variadic "any") - there aren't any provided, but there's no > good reason to disallow user-defined functions doing that - so you'd > still need a special value like -1 for aggordnargs to handle that. Sure. But a -1 to indicate "not applicable" doesn't seem like it's too much of a stretch. It's the -2 business that's bothering me. Again, that seems unnecessarily non-orthogonal --- who's to say which functions would want to constrain the number of direct arguments and which wouldn't? (I wonder whether having this info in the catalogs isn't the wrong thing anyhow, as opposed to expecting the functions themselves to check the argument count at runtime.) regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> Well, okay, but you've not said anything that wouldn't beTom> handled just as well by some logic that adds a fixedTom>integer-constant-zero flag column to the rows going into theTom> tuplesort. Adding such a column unconditionally even for non-hypothetical functions would break the optimization for sorting a single column (which is a big deal, something like 3x speed difference, for by-value types). Adding it only for hypothetical set functions is making a distinction in how functions are executed that I don't think is warranted - imagine for example a function that calculates some measure over a frequency distribution by adding a known set of boundary values to the sort; this would not be a hypothetical set function in terms of argument processing, but it would still benefit from the extra sort column. I did not want to unnecessarily restrict such possibilities. >> It would still be overloaded in some sense because a non-hypothetical>> ordered set function could still take an arbitrarynumber of args>> (using variadic "any") - there aren't any provided, but there's no>> good reason to disallow user-definedfunctions doing that - so you'd>> still need a special value like -1 for aggordnargs to handle that. Tom> Sure. But a -1 to indicate "not applicable" doesn't seem like it'sTom> too much of a stretch. It's the -2 businessthat's bothering me.Tom> Again, that seems unnecessarily non-orthogonal --- who's to say whichTom> functions wouldwant to constrain the number of direct arguments andTom> which wouldn't? (I wonder whether having this info in thecatalogsTom> isn't the wrong thing anyhow, as opposed to expecting the functionsTom> themselves to check the argumentcount at runtime.) Not checking the number of arguments to a function until runtime seems a bit on the perverse side. Having a fixed number of direct args is the "normal" case (as seen from the fact that all the non-hypothetical ordered set functions in the spec and in our patch have fixed argument counts). -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> Well, okay, but you've not said anything that wouldn't be > Tom> handled just as well by some logic that adds a fixed > Tom> integer-constant-zero flag column to the rows going into the > Tom> tuplesort. > Adding such a column unconditionally even for non-hypothetical > functions would break the optimization for sorting a single column > (which is a big deal, something like 3x speed difference, for by-value > types). Well, sure, but I was only suggesting adding it when the aggregate asks for it, probably via a new flag column in pg_aggregate. The question you're evading is what additional functionality could be had if the aggregate could demand a different datatype or constant value for the flag column. > Adding it only for hypothetical set functions is making a distinction > in how functions are executed that I don't think is warranted - That seems like rather a curious argument from someone who's willing to give up the ability to specify a regular transition value concurrently with the flag column. But anyway, what I'm thinking right now is that these questions would all go away if the aggregate transfunction were receiving the rows and sticking them into the tuplestore. It could add whatever columns it felt like. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> Well, sure, but I was only suggesting adding it when theTom> aggregate asks for it, probably via a new flag column inTom>pg_aggregate. Sure, I was only pointing out the necessity. Tom> The question you're evading is what additional functionalityTom> could be had if the aggregate could demand a differentdatatypeTom> or constant value for the flag column. I don't really see a question there to answer - I simply chose to provide a general mechanism rather than make assumptions about what future users of the code would desire. I have no specific application in mind that would require some other type. >> Adding it only for hypothetical set functions is making a>> distinction in how functions are executed that I don't thinkis>> warranted - Tom> That seems like rather a curious argument from someone who'sTom> willing to give up the ability to specify a regulartransitionTom> value concurrently with the flag column. In the current patch the idea of also specifying a regular transition value is meaningless since there is no transition function. Tom> But anyway, what I'm thinking right now is that these questionsTom> would all go away if the aggregate transfunctionwere receivingTom> the rows and sticking them into the tuplestore. It could addTom> whatever columns it feltlike. True, but this ends up duplicating the sorting functionality of nodeAgg that we are leveraging off in the first place. I think this will be somewhat more intrusive and likely slower. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> But anyway, what I'm thinking right now is that these questions > Tom> would all go away if the aggregate transfunction were receiving > Tom> the rows and sticking them into the tuplestore. It could add > Tom> whatever columns it felt like. > True, but this ends up duplicating the sorting functionality of > nodeAgg that we are leveraging off in the first place. I think this > will be somewhat more intrusive and likely slower. Hm, it's just a refactoring of the same code we'd have to have anyway, so I'm not seeing a reason to assume it'd be slower. If anything, this approach would open more opportunities for function-specific optimizations, which in the long run could be faster. (I'm not claiming that any such optimizations would be in the first version.) In hindsight I wonder if it wasn't a mistake to embed ordered-aggregate support in nodeAgg.c the way we did. We could have dumped that responsibility into some sort of wrapper around specific aggregates, with an option for some aggregates to skip the wrapper and handle it themselves. A trivial, and perhaps not very useful, example is that non-order-sensitive aggregates like MIN/MAX/COUNT could have been coded to simply ignore any ordering request. I can't immediately think of any examples that are compelling enough to justify such a refactoring now --- unless it turns out to make WITHIN GROUP easier. Anyway, I'm going to go off and look at the WITHIN GROUP patch with these ideas in mind. regards, tom lane
Further questions about WITHIN GROUP: I believe that the spec requires that the "direct" arguments of an inverse or hypothetical-set aggregate must not contain any Vars of the current query level. They don't manage to say that in plain English, of course, but in the <hypothetical set function> case the behavior is defined in terms of this sub-select: ( SELECT 0, SK1, ..., SKK FROM TNAME UNION ALL VALUES (1, VE1, ..., VEK) ) where SKn are the sort key values, TNAME is an alias for the input table, and VEn are the direct arguments. Any input-table Vars the aggregate might refer to are thus in scope only for the SKn not the VEn. (This definition also makes it clear that the VEn are to be evaluated once, not once per row.) In the <inverse distribution function> case, they resort to claiming that the value of the <inverse distribution function argument> can be replaced by a numeric literal, which again makes it clear that it's supposed to be evaluated just once. So that's all fine, but the patch seems a bit confused about it. If you try to cheat, you get an error message that's less than apropos: regression=# select cume_dist(f1) within group(order by f1) from text_tbl ; ERROR: column "text_tbl.f1" must appear in the GROUP BY clause or be used in an aggregate function LINE 1: select cume_dist(f1) within group(order by f1) from text_tbl... ^ I'd have hoped for a message along the line of "fixed arguments of cume_dist() must not contain variables", similar to the phrasing we use when you try to put a variable in LIMIT. Also, there are some comments and code changes in nodeAgg.c that seem to be on the wrong page here: + /* + * Use the representative input tuple for any references to + * non-aggregated input columns in the qual and tlist. (If we are not + * grouping, and there are no input rows at all, we will come here + * with an empty firstSlot ... but if not grouping, there can't be any + * references to non-aggregated input columns, so no problem.) + * We do this before finalizing because for ordered set functions, + * finalize_aggregates can evaluate arguments referencing the tuple. + */ + econtext->ecxt_outertuple = firstSlot; + The last two lines of that comment are new, all the rest was moved from someplace else. Those last two lines are wrong according to the above reasoning, and if they were correct the argument made in the pre-existing part of the comment would be all wet, meaning the code could in fact crash when trying to evaluate a Var reference in finalize_aggregates. So I'm inclined to undo this part of the patch (and anything else that's rearranging logic with an eye to Var references in finalize_aggregates), and to try to fix parse_agg.c so it gives a more specific error message for this case. After looking at this I'm a bit less enthused about the notion of hybrid aggregates than I was before. I now see that the spec intends that when the WITHIN GROUP notation is used, *all* the arguments to the left of WITHIN GROUP are meant to be fixed evaluate-once arguments. While we could possibly define aggregates for which some of those argument positions are treated as evaluate-once-per-row arguments, I'm realizing that that'd likely be pretty darn confusing for users. What I now think we should do about the added pg_aggregate columns is to have: aggnfixedargs int number of fixed arguments, excluding any hypothetical ones (always 0 for normal aggregates) aggkind char 'n' for normal aggregate, 'o' for ordered set function, 'h' for hypothetical-set function with the parsing rules that given agg( n fixed arguments ) WITHIN GROUP ( ORDER BY k sort specifications ) 1. WITHIN GROUP is disallowed for normal aggregates. 2. For an ordered set function, n must equal aggnfixedargs. We treat all n fixed arguments as contributing to the aggregate's result collation, but ignore the sort arguments. 3. For a hypothetical-set function, n must equal aggnfixedargs plus k, and we match up type and collation info of the last k fixed arguments with the corresponding sort arguments. The first n-k fixed arguments contribute to the aggregate's result collation, the rest not. Reading back over this email, I see I've gone back and forth between using the terms "direct args" and "fixed args" for the evaluate-once stuff to the left of WITHIN GROUP. I guess I'm not really sold on either terminology, but we need something we can use consistently in the code and docs. The spec is no help, it has no generic term at all for these args. Does anybody else have a preference, or maybe another suggestion entirely? regards, tom lane
Tom Lane-2 wrote > Further questions about WITHIN GROUP: > > I believe that the spec requires that the "direct" arguments of an inverse > or hypothetical-set aggregate must not contain any Vars of the current > query level. They don't manage to say that in plain English, of course, > but in the > <hypothetical set function> > case the behavior is defined in > terms of this sub-select: > > ( SELECT 0, SK1, ..., SKK > FROM TNAME UNION ALL > VALUES (1, VE1, ..., VEK) ) > > where SKn are the sort key values, TNAME is an alias for the input table, > and VEn are the direct arguments. Any input-table Vars the aggregate > might refer to are thus in scope only for the SKn not the VEn. (This > definition also makes it clear that the VEn are to be evaluated once, > not once per row.) In the > <inverse distribution function> > case, they > resort to claiming that the value of the > <inverse distribution function > argument> > can be replaced by a numeric literal, which again makes it clear > that it's supposed to be evaluated just once. > > So that's all fine, but the patch seems a bit confused about it. If you > try to cheat, you get an error message that's less than apropos: > > regression=# select cume_dist(f1) within group(order by f1) from text_tbl > ; > ERROR: column "text_tbl.f1" must appear in the GROUP BY clause or be used > in an aggregate function > LINE 1: select cume_dist(f1) within group(order by f1) from text_tbl... > ^ > > I'd have hoped for a message along the line of "fixed arguments of > cume_dist() must not contain variables", similar to the phrasing we > use when you try to put a variable in LIMIT. > > Also, there are some comments and code changes in nodeAgg.c that seem > to be on the wrong page here: > > + /* > + * Use the representative input tuple for any references to > + * non-aggregated input columns in the qual and tlist. (If we > are not > + * grouping, and there are no input rows at all, we will come here > + * with an empty firstSlot ... but if not grouping, there can't be > any > + * references to non-aggregated input columns, so no problem.) > + * We do this before finalizing because for ordered set functions, > + * finalize_aggregates can evaluate arguments referencing the > tuple. > + */ > + econtext->ecxt_outertuple = firstSlot; > + > > The last two lines of that comment are new, all the rest was moved from > someplace else. Those last two lines are wrong according to the above > reasoning, and if they were correct the argument made in the pre-existing > part of the comment would be all wet, meaning the code could in fact crash > when trying to evaluate a Var reference in finalize_aggregates. > > So I'm inclined to undo this part of the patch (and anything else that's > rearranging logic with an eye to Var references in finalize_aggregates), > and to try to fix parse_agg.c so it gives a more specific error message > for this case. The original design references the spec as allowing a column reference if it is a group by key: Select cume_dist(f1) within group ( order by f1 ) from text_tbl group by f1 No example was shown where this would be useful...but nonetheless the comment and error both presume that such is true. I would presume the implementation would only supply rows for sorting from the single group in question so for practical purposes the columns have a constant value for the entirety of the evaluation and so this makes sense and acts just like partition by on a window aggregate. Question, are there any tests that exercise this desired behavior? That assuming agreement can be had that the group by behavior is indeed spec or something custom we want to support. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Re-WITHIN-GROUP-patch-tp5773851p5782041.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
David Johnston <polobo@yahoo.com> writes: > The original design references the spec as allowing a column reference if it > is a group by key: > Select cume_dist(f1) within group ( order by f1 ) from text_tbl group by f1 > No example was shown where this would be useful...but nonetheless the > comment and error both presume that such is true. I can see no support for that position in the spec text, and as you say, the usefulness is at best really debatable. Unless somebody can present a credible use-case, or convince me that this is indeed what the spec says, I'd be inclined to blow this off in favor of having a more intelligible error message. (I think if you really did need such functionality, you could get it with a sub-select.) regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> Further questions about WITHIN GROUP: Tom> I believe that the spec requires that the "direct" arguments ofTom> an inverse or hypothetical-set aggregate must notcontain anyTom> Vars of the current query level. False. The spec requires that the direct arguments must not contain _ungrouped_ columns (see <set function specification>), but nowhere are grouped columns forbidden. Tom> They don't manage to say that in plain English, of course, butTom> in the <hypothetical set function> case the behavioris definedTom> in terms of this sub-select: Tom> ( SELECT 0, SK1, ..., SKKTom> FROM TNAME UNION ALLTom> VALUES (1, VE1, ..., VEK) ) "TNAME" there is not the input table or an alias for it, but rather the specific subset of rows to which the grouping operation is being applied (after applying a FILTER if any). We're in the General Rules here, not the Syntax Rules, so this is describing _how to compute the result_ rather than a syntactic transformation of the input. In any event, going by the docs on the web, Oracle does not forbid grouped columns there (their wording is "This expr must be constant within each aggregation group.") MSSQL seems to require a literal constant, but that's obviously not per spec. IBM doesn't seem to have it in db2 for linux, but some of their other products have it and include examples of using grouped vars: see http://pic.dhe.ibm.com/infocenter/ntz/v7r0m3/index.jsp?topic=%2Fcom.ibm.nz.dbu.doc%2Fc_dbuser_hypothetical_set_family_syntax.html So I'm going to say that the majority opinion is on my side here. Tom> So that's all fine, but the patch seems a bit confused about it. The patch treats vars in the direct args exactly as it would treat them anywhere else where they must be ungrouped. [snip a bunch of stuff based on false assumptions] Tom> What I now think we should do about the added pg_aggregateTom> columns is to have: Tom> aggnfixedargs int number of fixed arguments, excluding anyTom> hypothetical ones (always 0 for normalaggregates) Tom> aggkind char 'n' for normal aggregate,Tom> 'o' for ordered set function,Tom> 'h' for hypothetical-set function I don't see any obvious problem with this. Tom> with the parsing rules that given Tom> agg( n fixed arguments ) WITHIN GROUP ( ORDER BY k sort specifications ) Tom> 1. WITHIN GROUP is disallowed for normal aggregates. (This is what the submitted patch does.) Tom> 2. For an ordered set function, n must equal aggnfixedargs. WeTom> treat all n fixed arguments as contributing to theaggregate'sTom> result collation, but ignore the sort arguments. That doesn't work for getting a sensible collation out of percentile_disc applied to a collatable type. (Which admittedly is an extension to the spec, which allows only numeric and interval, but it seems to me to be worth having.) Tom> 3. For a hypothetical-set function, n must equal aggnfixedargsTom> plus k, and we match up type and collation info ofthe last kTom> fixed arguments with the corresponding sort arguments. TheTom> first n-k fixed arguments contribute tothe aggregate's resultTom> collation, the rest not. The submitted patch does essentially this but taking the number of non-variadic args in place of the suggested aggnfixedargs. Presumably in your version the latter would be derived from the former? Tom> Reading back over this email, I see I've gone back and forthTom> between using the terms "direct args" and "fixed args"for theTom> evaluate-once stuff to the left of WITHIN GROUP. I guess I'mTom> not really sold on either terminology,but we need something weTom> can use consistently in the code and docs. The spec is no help,Tom> it has no genericterm at all for these args. Does anybody elseTom> have a preference, or maybe another suggestion entirely? We (Atri and I) have been using "direct args", but personally I'm not amazingly happy with it. Documentation for other dbs tends to just call them "arguments", and refer to the WITHIN GROUP expressions as "ordering expressions" or similar. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> I believe that the spec requires that the "direct" arguments of > Tom> an inverse or hypothetical-set aggregate must not contain any > Tom> Vars of the current query level. > In any event, going by the docs on the web, Oracle does not forbid > grouped columns there (their wording is "This expr must be constant > within each aggregation group.") MSSQL seems to require a literal > constant, but that's obviously not per spec. IBM doesn't seem to > have it in db2 for linux, but some of their other products have it > and include examples of using grouped vars: see > http://pic.dhe.ibm.com/infocenter/ntz/v7r0m3/index.jsp?topic=%2Fcom.ibm.nz.dbu.doc%2Fc_dbuser_hypothetical_set_family_syntax.html OK, that's reasonably convincing. I think we'll need a HINT or something to clarify the error message, because it sure looks like those arguments are "used in an aggregate function". > Tom> 2. For an ordered set function, n must equal aggnfixedargs. We > Tom> treat all n fixed arguments as contributing to the aggregate's > Tom> result collation, but ignore the sort arguments. > That doesn't work for getting a sensible collation out of > percentile_disc applied to a collatable type. (Which admittedly is an > extension to the spec, which allows only numeric and interval, but it > seems to me to be worth having.) Meh. I don't think you can have that and also have the behavior that multiple ORDER BY items aren't constrained to have the same collation; at least not without some rule that amounts to a special case for percentile_disc, which I'd resist. > Tom> 3. For a hypothetical-set function, n must equal aggnfixedargs > Tom> plus k, and we match up type and collation info of the last k > Tom> fixed arguments with the corresponding sort arguments. The > Tom> first n-k fixed arguments contribute to the aggregate's result > Tom> collation, the rest not. > The submitted patch does essentially this but taking the number of > non-variadic args in place of the suggested aggnfixedargs. Presumably > in your version the latter would be derived from the former? I'm not on board with using variadic vs non variadic to determine this. For example, imagine a hypothetical-set function that for some reason supports only a single sort column; there would be no reason to use VARIADIC in its definition, and indeed good reason not to. In any case, I don't think this behavior should be tied to implementation details of the representation of the function signature, and IMV variadic is just that --- particularly for VARIADIC ANY, which is nothing more than a short-cut for overloading the function name with different numbers of ANY arguments. Once we've got a match that involves N direct arguments and K ordering arguments, the behavior should be determinate without respect to just how we got that match. > Tom> Reading back over this email, I see I've gone back and forth > Tom> between using the terms "direct args" and "fixed args" for the > Tom> evaluate-once stuff to the left of WITHIN GROUP. I guess I'm > Tom> not really sold on either terminology, but we need something we > Tom> can use consistently in the code and docs. The spec is no help, > Tom> it has no generic term at all for these args. Does anybody else > Tom> have a preference, or maybe another suggestion entirely? > We (Atri and I) have been using "direct args", but personally I'm not > amazingly happy with it. Documentation for other dbs tends to just call > them "arguments", and refer to the WITHIN GROUP expressions as "ordering > expressions" or similar. Well, given that I was mistaken to think there could be no Vars at all in them, "fixed" may not be le mot juste. Unless somebody's got an alternative to "direct", let's go with that. regards, tom lane
I don't especially like the syntax you invented for listing arguments in CREATE AGGREGATE, especially not the WITHIN GROUP (*) special case. If we stick with that we're going to need to touch a lot more places than you have, notably regprocedure. I also fear that this syntax is not appropriate for identifying aggregates reliably, ie, aggregate argument lists that look different in this syntax could reduce to identical pg_proc.proargs lists, and perhaps vice versa. I think we should just have it list the arguments as they'd appear in pg_proc, and rely on aggregate properties (to wit, aggkind and aggndirectargs) to identify ordered-set and hypothetical aggregates. A slightly different question is what \da ought to print. I can't say I think that (VARIADIC "any") WITHIN GROUP (*) is going to prove very helpful to users, but probably just (VARIADIC "any") isn't going to do either, at least not unless we add an aggregate-kind column to the printout, and maybe not even then. It might work to cheat by duplicating the last item if it's variadic: (..., VARIADIC "any") WITHIN GROUP (VARIADIC "any") while if it's not variadic, we'd have to work out which argument positions correspond to the ordered-set arguments and put them in the right places. Regardless of that, though ... what is the reasoning for inventing pg_get_aggregate_arguments() rather than just making pg_get_function_arguments() do the right thing? The separate function seems to accomplish little except complicating life for clients, eg in psql's describe.c. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> Regardless of that, though ... what is the reasoning forTom> inventing pg_get_aggregate_arguments() rather than justmakingTom> pg_get_function_arguments() do the right thing? pg_get_function_arguments()'s interface assumes that the caller is providing the enclosing parens. Changing it would have meant returning a result like 'float8) WITHIN GROUP (float8' which I reckoned would have too much chance of breaking existing callers. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> Regardless of that, though ... what is the reasoning for > Tom> inventing pg_get_aggregate_arguments() rather than just making > Tom> pg_get_function_arguments() do the right thing? > pg_get_function_arguments()'s interface assumes that the caller is > providing the enclosing parens. Changing it would have meant returning > a result like 'float8) WITHIN GROUP (float8' which I reckoned would > have too much chance of breaking existing callers. Well, as it stands, existing callers are broken a fortiori; they can't possibly produce the right output for an ordered-set aggregate, if we define the "right output" as looking like that. Of course, if we define the right output as being just the arguments according to pg_proc, it's fine. But your point about the parens is a good one anyway, because now that I think about it, what \da has traditionally printed is pg_catalog | sum | bigint | integer | sum as bigint acro ss all integer input values and I see the patch makes it do this pg_catalog | sum | bigint | (integer) | sum as bigint acro which I cannot find to be an improvement, especially since \df does not look like that. (As patched, the output of "\df ran*" is positively schizophrenic.) One possibility is to forget all the parens and say that the display looks like type1, type2 WITHIN GROUP type3, type4 Please don't object that that doesn't look exactly like the syntax for calling the function, because it doesn't anyway --- remember you also need ORDER BY in the call. Or perhaps we could abbreviate that --- maybe just WITHIN? Abbreviation would be a good thing, especially if we're going to say 'VARIADIC "any"' twice in common cases. OTOH I'm not sure we can make that work as a declaration syntax without reserving WITHIN. I think WITHIN GROUP would work, though I've not tried to see if bison likes it. Anyway, the long and the short of this is that the SQL committee's bizarre choice of syntax for calling these functions should not be followed too slavishly when declaring them. regards, tom lane
I wrote: > One possibility is to forget all the parens and say that the display > looks like > type1, type2 WITHIN GROUP type3, type4 > Please don't object that that doesn't look exactly like the syntax > for calling the function, because it doesn't anyway --- remember you > also need ORDER BY in the call. Actually, now that I think of it, why not use this syntax for declaration and display purposes:type1, type2 ORDER BY type3, type4 This has nearly as much relationship to the actual calling syntax as the WITHIN GROUP proposal does, and it's hugely saner from a semantic standpoint, because after all the ordering columns are ordering columns, not grouping columns. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: >> Please don't object that that doesn't look exactly like the syntax>> for calling the function, because it doesn't anyway--- remember>> you also need ORDER BY in the call. Tom> Actually, now that I think of it, why not use this syntax forTom> declaration and display purposes: Tom> type1, type2 ORDER BY type3, type4 Tom> This has nearly as much relationship to the actual callingTom> syntax as the WITHIN GROUP proposal does, But unfortunately it looks exactly like the calling sequence for a normal aggregate with an order by clause - I really think that is potentially too much confusion. (It's one thing not to look like the calling syntax, it's another to look exactly like a valid calling sequence for doing something _different_.) -- Andrew (irc:RhodiumToad)
Andrew Gierth wrote >>>>>> "Tom" == Tom Lane < > tgl@.pa > > writes: > > >> Please don't object that that doesn't look exactly like the syntax > >> for calling the function, because it doesn't anyway --- remember > >> you also need ORDER BY in the call. > > Tom> Actually, now that I think of it, why not use this syntax for > Tom> declaration and display purposes: > > Tom> type1, type2 ORDER BY type3, type4 > > Tom> This has nearly as much relationship to the actual calling > Tom> syntax as the WITHIN GROUP proposal does, > > But unfortunately it looks exactly like the calling sequence for a > normal aggregate with an order by clause - I really think that is > potentially too much confusion. (It's one thing not to look like > the calling syntax, it's another to look exactly like a valid > calling sequence for doing something _different_.) > > -- > Andrew (irc:RhodiumToad) How about: type1, type2 GROUP ORDER type3, type4 OR GROUP type1, type2 ORDER type3, type4 David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Re-WITHIN-GROUP-patch-tp5773851p5782202.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> Actually, now that I think of it, why not use this syntax for > Tom> declaration and display purposes: > Tom> type1, type2 ORDER BY type3, type4 > But unfortunately it looks exactly like the calling sequence for a > normal aggregate with an order by clause - I really think that is > potentially too much confusion. I thought about that too, but really that ship sailed long ago, and it went to sea under the SQL committee's captaincy, so it's not our fault. There are already at least four different standards-blessed ways you can use ORDER BY in a query, some of them quite nearby (eg window functions); so the potential for confusion is there no matter what we do. In this case, if we describe ordered-set aggregates using WITHIN GROUP rather than ORDER BY, we might avoid confusion with the normal-aggregate case, but instead we will have confusion about what the arguments even do. Is semantic confusion better than syntactic confusion? Another thing to think about here is to wonder why the committee chose anything as verbose as "agg(...) WITHIN GROUP (ORDER BY ...)" in the first place. The words ORDER BY certainly seem pretty unnecessary. I'm suspicious that they might've been leaving the door open to put other things into the second set of parens later --- GROUP BY, maybe? So down the road, we might regret it if we key off WITHIN GROUP and not ORDER BY. Having said that, I'm not so dead set on it that I won't take WITHIN GROUP if that's what more people want. But we gotta lose the extra parens; they are just too strange for function declaration/documentation purposes. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: >> pg_get_function_arguments()'s interface assumes that the caller is>> providing the enclosing parens. Changing it wouldhave meant>> returning a result like 'float8) WITHIN GROUP (float8' which I>> reckoned would have too much chance ofbreaking existing callers. Tom> Well, as it stands, existing callers are broken a fortiori; Not in most cases, because using the output of pg_get_function_arguments works in all contexts except for constructing the CREATE AGGREGATE statement (for example, a function that uses the pg_get_function_arguments output to construct GRANTs or ALTER OWNERs would still work). And since constructing a correct CREATE AGGREGATE statement for an ordered set function requires that the caller know about the additional options to supply in the body, this does not seem like a restriction. Tom> Of course, if we define the right output as being just theTom> arguments according to pg_proc, it's fine. The patch accepts the 'just the arguments according to pg_proc' as input everywhere except in CREATE AGGREGATE, in addition to the syntax with explicit WITHIN GROUP. (With the exception of GRANT, as it happens, where since there is no existing GRANT ON AGGREGATE variant and the patch doesn't add one, only the pg_proc arguments form is accepted.) Tom> But your point about the parens is a good one anyway, becauseTom> now that I think about it, what \da has traditionallyprinted is Tom> pg_catalog | sum | bigint | integer | sum as bigint acroTom> ss all integer input values Tom> and I see the patch makes it do this Tom> pg_catalog | sum | bigint | (integer) | sum as bigint acro Tom> which I cannot find to be an improvement, especially since \dfTom> does not look like that. (As patched, the outputof "\df ran*"Tom> is positively schizophrenic.) Since I don't particularly trust my own judgement on aesthetics, I used the feedback I got from others when deciding what to do. Frankly I think this one needs wider input than just you and me arguing over it. -- Andrew (irc:RhodiumToad)
On 12/06/2013 01:30 PM, Andrew Gierth wrote: > Since I don't particularly trust my own judgement on aesthetics, I used > the feedback I got from others when deciding what to do. Frankly I think > this one needs wider input than just you and me arguing over it. Can someone paste examples of the two syntax alternatives we're talking about here? I've lost track. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> Another thing to think about here is to wonder why the committee choseTom> anything as verbose as "agg(...) WITHIN GROUP(ORDER BY ...)" in theTom> first place. The words ORDER BY certainly seem pretty unnecessary. All of the ordered-set functions that the spec defines are intimately tied to ordering of values, and they allow things like DESC ordering to affect the results, so it seems obvious to me that they used (ORDER BY ...) because what follows is both syntactically and semantically similar to any other use of ORDER BY. (Similar logic seems to have been used for "FILTER (WHERE ...)".) (The name "ordered set function" also seems quite specific.) So I don't think there's any greater chance of the spec people adding a WITHIN GROUP (something_other_than_order_by) construct than of them adding any other awkward piece of syntax - and if they did, it would represent an entirely new class of aggregate functions, separate from ordered set functions and no doubt with its own headaches. (I realize that expecting sanity from the spec writers is perhaps unwise) -- Andrew (irc:RhodiumToad)
Josh Berkus <josh@agliodbs.com> writes: > Can someone paste examples of the two syntax alternatives we're talking > about here? I've lost track. I'll leave it to Andrew to describe/defend exactly what his patch is doing. The alternative I'm thinking about is that in CREATE AGGREGATE as well as \da output, the argument list of an ordered-set aggregate would look like type1, type2, ... ORDER BY type3, type4, ... if the aggregate only permits a fixed number of ordering columns (as mode() does for example). If it permits a variable number of ordering columns, you could have type1, type2, ... ORDER BY [ type3, type4, ... ] VARIADIC something 99% of the time the right-hand part would just be "VARIADIC ANY" but if an aggregate had need to lock down the ordering column types, or the leading ordering column types, it could do that. If it needs a variable number of direct arguments as well (which in particular hypothetical set functions do) then you would write [ type1, type2, ... ] VARIADIC something ORDER BY VARIADIC something where we constrain the two "somethings" to be the same. (Again, these would *usually* be ANY, but I can envision that an aggregate might want to constrain the argument types more than that.) We have to constrain this case because the underlying pg_proc representation and parser function lookup code only allow the last argument to be declared VARIADIC. So under the hood, this last case would be represented in pg_proc with proargs looking like just "[ type1, type2, ... ] VARIADIC something", whereas in the first two cases the proargs representation would contain all the same entries as above. We could substitute something else for ORDER BY without doing a lot of violence to this, if people are really down on that aspect. regards, tom lane
>>>>> "Josh" == Josh Berkus <josh@agliodbs.com> writes: >> Since I don't particularly trust my own judgement on aesthetics, I>> used the feedback I got from others when decidingwhat to>> do. Frankly I think this one needs wider input than just you and>> me arguing over it. Josh> Can someone paste examples of the two syntax alternatives we'reJosh> talking about here? I've lost track. OK. The starting point is that this is the calling syntax for ordered set funcs, set by the spec: SELECT func(foo) WITHIN GROUP (ORDER BY bar) FROM ... where "foo" and "bar" might be fixed types, or polymorphic or variadic. So we need to define (with no help from the spec) at least these: - what syntax is used in CREATE AGGREGATE to specify the number and types of parameters for a newly defined "func" - what syntax is used to refer to the function in a GRANT ... ON FUNCTION ... statement, or ALTER ... OWNER TO ... - what should ::regprocedure casts accept as input and produce as output - what output is shown in \df and \da when listing the function's argument types The patch as submitted answers those questions as follows: CREATE AGGREGATE func(integer) WITHIN GROUP (text) ... GRANT ... ON FUNCTION func(integer,text) ... (there is no GRANT ... ON AGGREGATE ... (yet)) ALTER AGGREGATE func(integer) WITHIN GROUP (text) OWNER TO ... ALTER AGGREGATE func(integer,text) OWNER TO ... ALTER FUNCTION func(integer,text) OWNER TO ... (all three of those are equivalent) regprocedure outputs 'func(integer,text)' and accepts only that as input postgres=# \df func List of functionsSchema | Name | Result data type | Argument data types | Type --------+------+------------------+-------------------------------+------public | func | text | (integer) WITHINGROUP (text) | agg (1 row) If there's also a func() that isn't an ordered set function, you get output like this (which provoked tom's "schitzophrenic" comment): postgres=# \df ran[a-z]{1,5} List of functions Schema | Name | Result data type | Argument data types | Type ------------+----------+------------------+-----------------------------------+--------pg_catalog | random | double precision| | normalpg_catalog | rangesel | double precision | internal, oid, internal,integer | normalpg_catalog | rank | bigint | | windowpg_catalog| rank | bigint | (VARIADIC "any") WITHIN GROUP (*) | agg (4 rows) -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > The patch as submitted answers those questions as follows: > CREATE AGGREGATE func(integer) WITHIN GROUP (text) ... You've glossed over a significant amount of complexity, as shown by your example that prints WITHIN GROUP (*), a syntax that you've not defined here. In the long run we might think it worthwhile to actually store two separate arglists for ordered-set aggregates; probably, pg_proc.proargs would just describe the direct arguments and there'd be a second oidvector in pg_aggregate that would describe the ORDER BY arguments. This'd let them be independently VARIADIC, or not. I'm not proposing we do that right now, because we don't have any use-cases that aren't sufficiently handled by the hack of letting a single VARIADIC ANY entry cover both sets of arguments. I'd like though that the external syntax not be something that prevents that from ever happening, and I'm afraid that this (*) business is cheating enough to be a problem. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> 2. For an ordered set function, n must equal aggnfixedargs. WeTom> treat all n fixed arguments as contributing to theaggregate'sTom> result collation, but ignore the sort arguments. >> That doesn't work for getting a sensible collation out of>> percentile_disc applied to a collatable type. (Which admittedlyis>> an extension to the spec, which allows only numeric and interval,>> but it seems to me to be worth having.) Tom> Meh. I don't think you can have that and also have the behaviorTom> that multiple ORDER BY items aren't constrainedto have the sameTom> collation; at least not without some rule that amounts to aTom> special case for percentile_disc,which I'd resist. What the submitted patch does (as discussed in the comment in parse_collate) is to treat the sort argument as contributing to the collation only if there is exactly one sort arg. Consider a construct like: select max(common_val) from (select mode() within group (order by textcol) as common_val from ... group by othercol)s; (the same arguments for percentile_disc also apply to mode() and to any other ordered set function that returns a value chosen from its input sorted set) Having this sort of thing not preserve the collation of textcol (or fail) would be, IMO, surprising and undesirable. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> Meh. I don't think you can have that and also have the behavior > Tom> that multiple ORDER BY items aren't constrained to have the same > Tom> collation; at least not without some rule that amounts to a > Tom> special case for percentile_disc, which I'd resist. > What the submitted patch does (as discussed in the comment in > parse_collate) is to treat the sort argument as contributing to the > collation only if there is exactly one sort arg. Blech :-( Not wanting to consider the sort args when there's more than one doesn't square with forcing them to be considered when there's just one. It's the same aggregate after all, and in general the semantics ought to be the same whether you give one sort col or three. We might be able to make this work sanely by considering only argument positions that were declared something other than ANY (anyelement being other than that, so this isn't leaving polymorphics in the cold entirely). This is a bit unlike the normal rules for collation assignment but it doesn't result in weird semantics changes depending on how many sort columns you supply. > Consider a construct like: > select max(common_val) > from (select mode() within group (order by textcol) as common_val > from ... group by othercol) s; AFAICT none of the SQL-spec aggregates expose the kind of case I'm worried about, because none of the ones that can take multiple sort columns have a potentially collatable return type. I don't think that justifies not thinking ahead, though. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> Not wanting to consider the sort args when there's more than oneTom> doesn't square with forcing them to be consideredwhen there'sTom> just one. It's the same aggregate after all, This logic is only applied in the patch to aggregates that _aren't_ hypothetical. (thinking out loud:) It might be more consistent to also add the condition that the single sort column not be a variadic arg. And/or the condition that it be the same type as the result. Or have a flag in pg_aggregate to say "this agg returns one of its sorted input values, so preserve the collation". >> Consider a construct like: >> select max(common_val)>> from (select mode() within group (order by textcol) as common_val>> from ... group by othercol)s; Tom> AFAICT none of the SQL-spec aggregates expose the kind of caseTom> I'm worried about, because none of the ones thatcan takeTom> multiple sort columns have a potentially collatable return type. None of the spec's ordered-set functions expose any collation issue at all, because they _all_ have non-collatable return types, period. The problem only arises from the desire to make functions like percentile_disc and mode applicable to collatable types. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> I believe that the spec requires that the "direct" arguments of > Tom> an inverse or hypothetical-set aggregate must not contain any > Tom> Vars of the current query level. > False. After examining this more closely, ISTM that the direct arguments are supposed to be processed as if they weren't inside an aggregate call at all. That being the case, isn't it flat out wrong for check_agg_arguments() to be examining the agg_ordset list? It should ignore those expressions whilst determining the aggregate's semantic level. As an example, an upper-level Var in those expressions isn't grounds for deciding that the aggregate isn't of the current query level. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> After examining this more closely, ISTM that the directTom> arguments are supposed to be processed as if they weren'tinsideTom> an aggregate call at all. That being the case, isn't it flatTom> out wrong for check_agg_arguments() tobe examining theTom> agg_ordset list? It should ignore those expressions whilstTom> determining the aggregate's semanticlevel. As an example, anTom> upper-level Var in those expressions isn't grounds for decidingTom> that the aggregateisn't of the current query level. Hmm... yes, you're probably right; but we'd still have to check somewhere for improper nesting, no? since not even the direct args are allowed to contain nested aggregate calls. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> After examining this more closely, ISTM that the direct > Tom> arguments are supposed to be processed as if they weren't inside > Tom> an aggregate call at all. That being the case, isn't it flat > Tom> out wrong for check_agg_arguments() to be examining the > Tom> agg_ordset list? It should ignore those expressions whilst > Tom> determining the aggregate's semantic level. As an example, an > Tom> upper-level Var in those expressions isn't grounds for deciding > Tom> that the aggregate isn't of the current query level. > Hmm... yes, you're probably right; but we'd still have to check somewhere > for improper nesting, no? since not even the direct args are allowed to > contain nested aggregate calls. Yeah, I had come to that same conclusion while making the changes; actually, check_agg_arguments needs to look for aggs but not vars there. In principle we could probably support aggs there if we really wanted to; but it would result in evaluation-order dependencies among the aggs of a single query level, which doesn't seem like something we want to take on for a feature that's not required by spec. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: >> Hmm... yes, you're probably right; but we'd still have to check>> somewhere for improper nesting, no? since not even thedirect args>> are allowed to contain nested aggregate calls. Tom> Yeah, I had come to that same conclusion while making theTom> changes; actually, check_agg_arguments needs to look foraggsTom> but not vars there. There's also the question of ungrouped vars, maybe. Consider these two queries: select array(select a+sum(x) from (values (0.3),(0.7)) v(a) group by a) from generate_series(1,5) g(x); select array(select percentile_disc(a) within group (order by x) from (values (0.3),(0.7)) v(a) group by a)from generate_series(1,5) g(x); In both cases the aggregation query is the outer one; but while the first can return a value, I think the second one has to fail (at least I can't see any reasonable way of executing it). -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > There's also the question of ungrouped vars, maybe. Consider these two > queries: > select array(select a+sum(x) from (values (0.3),(0.7)) v(a) group by a) > from generate_series(1,5) g(x); > select array(select percentile_disc(a) within group (order by x) > from (values (0.3),(0.7)) v(a) group by a) > from generate_series(1,5) g(x); > In both cases the aggregation query is the outer one; but while the first > can return a value, I think the second one has to fail (at least I can't > see any reasonable way of executing it). Hm, interesting. So having decided that the agg has level 1, we need to reject any level-0 vars in the direct parameters, grouped or not. We could alternatively decide that the agg has level 0, but that doesn't seem terribly useful, and I think it's not per spec either. SQL:2008 section 6.9 <set function specification> seems pretty clear that only aggregated arguments should be considered when determining the semantic level of an aggregate. OTOH, I don't see any text there restricting what can be in the non-aggregated arguments, so maybe the committee thinks this case is sensible? Or they just missed it. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> We could alternatively decide that the agg has level 0, but thatTom> doesn't seem terribly useful, and I think it'snot per specTom> either. SQL:2008 section 6.9 <set function specification> seemsTom> pretty clear that only aggregatedarguments should be consideredTom> when determining the semantic level of an aggregate. OTOH, ITom> don't seeany text there restricting what can be in theTom> non-aggregated arguments, so maybe the committee thinks thisTom> caseis sensible? Or they just missed it. My bet is that they missed it, because there's another obvious oversight there; it doesn't define column references in the FILTER clause applied to an ordered set function as being aggregated column references, yet it's clear that this must be the case (since they filter the set of rows that the aggregated column references refer to). -- Andrew (irc:RhodiumToad)
On 11/21/13, 5:04 AM, Atri Sharma wrote: > Please find attached the latest patch for WITHIN GROUP. This patch is > after fixing the merge conflicts. I would like to see more explanations and examples in the documentation.You introduce this feature with "Ordered set functionscompute a single result from an ordered set of input values." But string_agg, for example, does that as well, so it's not clear how this is different. Between ordered aggregates, window functions, and this new feature, it can get pretty confusing. Also, the "hypothetical" part should perhaps be explained in more detail. The tutorial part of the documentation contains a nice introduction to window function. I suggest you add something like that as well. In func.sgml, please list the functions in alphabetical order. Also, don't write "should" when you mean "must".
[ still hacking away at this patch ] Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> Not wanting to consider the sort args when there's more than one > Tom> doesn't square with forcing them to be considered when there's > Tom> just one. It's the same aggregate after all, > This logic is only applied in the patch to aggregates that _aren't_ > hypothetical. > (thinking out loud:) It might be more consistent to also add the > condition that the single sort column not be a variadic arg. And/or > the condition that it be the same type as the result. Or have a flag > in pg_aggregate to say "this agg returns one of its sorted input > values, so preserve the collation". I eventually decided that we were overthinking this problem. At least for regular ordered-set aggregates, we can just deem that the collation of the aggregate is indeterminate unless all the inputs (both direct and aggregated) agree on the collation to use. This gives us the right answer for all the standard aggregates, which have at most one collatable input, and it's very unclear that anything more complicated would be an improvement. I definitely didn't like the hack that was in there that'd force a sort column to absorb a possibly-unrelated collation. For hypotheticals, I agree after reading the spec text that we're supposed to unify the collations of matching hypothetical and aggregated arguments to determine the collation to use for sorting that column. I see that the patch just leaves these columns out of the determination of the aggregate's result collation. That's okay for the time being at least, since we have no hypotheticals with collatable output types, but I wonder if anyone wants to argue for some other rule (and if so, what)? regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> I eventually decided that we were overthinking this problem. AtTom> least for regular ordered-set aggregates, we canjust deem thatTom> the collation of the aggregate is indeterminate unless all theTom> inputs (both direct and aggregated)agree on the collation toTom> use. This gives us the right answer for all the standardTom> aggregates, whichhave at most one collatable input, and it'sTom> very unclear that anything more complicated would be anTom> improvement. I definitely didn't like the hack that was inTom> there that'd force a sort column to absorb a possibly-unrelatedTom>collation. Yeah, I can go along with that, but see below. Tom> For hypotheticals, I agree after reading the spec text thatTom> we're supposed to unify the collations of matching hypotheticalTom>and aggregated arguments to determine the collation to use forTom> sorting that column. Yeah, the spec seemed clear enough on that. Tom> I see that the patch just leaves these columns out of theTom> determination of the aggregate's result collation. That'sokayTom> for the time being at least, since we have no hypotheticals withTom> collatable output types, but I wonderif anyone wants to argueTom> for some other rule (and if so, what)? Any alternative seems a bit ad-hoc to me. The examples I've thought of which would return collatable types are all ones that would be implemented as plain ordered set functions even if their logic was in some sense hypothetical. For example you could envisage a value_prec(x) within group (order by y) that returns the value of y which sorts immediately before x, but this would just be declared as value_prec(anyelement) within group (anyelement) rather than engaging the hypothetical argument stuff. (It's this sort of thing that suggested pushing down the collation into the sort column even for non-hypothetical ordered set functions.) -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > The examples I've thought of which would return collatable types are > all ones that would be implemented as plain ordered set functions even > if their logic was in some sense hypothetical. For example you could > envisage a value_prec(x) within group (order by y) that returns the > value of y which sorts immediately before x, but this would just be > declared as value_prec(anyelement) within group (anyelement) rather > than engaging the hypothetical argument stuff. (It's this sort of > thing that suggested pushing down the collation into the sort column > even for non-hypothetical ordered set functions.) Meh. I see no very good reason that we shouldn't insist that the collation be set on the sort column rather than the other input. That is, if the alternatives are value_prec(x collate "foo") within group (order by y) value_prec(x) within group (order by y collate "foo") which one makes it clearer that the ordering is to use collation foo? The first one is at best unnatural, and at worst action-at-a-distance. If you think of the sorting as an operation done in advance of applying value_prec(), which is something the syntax rather encourages you to think, there's no reason that it should absorb a collation from a collate clause attached to a higher-level operation. So I think the spec authors arguably got it wrong for hypotheticals, and I'm uneager to extrapolate their choice of behavior into situations where we don't know for certain that the collations of two arguments must get unified. More practically, I'm dubious about your assumption that an aggregate like this needn't be marked hypothetical. I see that use of anyelement would be enough to constrain the types to be the same, but it doesn't ordinarily constrain collations, and I don't think it should start doing so. So my reaction to this example is not that we should hack the behavior for plain ordered-set aggregates, but that we ought to find a rule that allows result-collation determination for hypotheticals. We speculated upthread about "merge the collations normally, but ignore inputs declared ANY" and "merge the collations normally, but ignore variadic inputs". Either of those would get the job done in this example. I kinda think we should pick one of these rules and move on. regards, tom lane
I wrote: > ... So my reaction to this example is not > that we should hack the behavior for plain ordered-set aggregates, > but that we ought to find a rule that allows result-collation > determination for hypotheticals. We speculated upthread about > "merge the collations normally, but ignore inputs declared ANY" > and "merge the collations normally, but ignore variadic inputs". > Either of those would get the job done in this example. I kinda > think we should pick one of these rules and move on. Or, really, why don't we just do the same thing I'm advocating for the plain-ordered-set case? That is, if there's a single collation applying to all the collatable inputs, that's the collation to use for the aggregate; otherwise it has no determinate collation, and it'll throw an error at runtime if it needs one. We realized long ago that we can't throw most need-a-collation errors at parse time, because the parser lacks information about which functions need to know a collation to use. This seems to be in the same category. regards, tom lane
I wrote: > Or, really, why don't we just do the same thing I'm advocating for > the plain-ordered-set case? That is, if there's a single collation > applying to all the collatable inputs, that's the collation to use > for the aggregate; otherwise it has no determinate collation, and > it'll throw an error at runtime if it needs one. I went and tried to implement this, and realized that it would take some pretty significant rewriting of parse_collate.c, because of examples like this: rank(a,b) within group (order by c collate "foo", d collate "bar") In the current parse_collate logic, it would throw error immediately upon being told to merge the two explicit-COLLATE results. We'd need a way to postpone that error and instead just decide that the rank aggregate's collation is indeterminate. While that's perhaps just a SMOP, it would mean that ordered-set aggregates don't resolve collation the same way as other functions, which pretty much destroys the argument for this approach. What's more, the same problem applies to non-hypothetical ordered-set aggregates, if they've got more than one sortable input column. What I'm now thinking we want to do is: 1. Non-hypothetical direct args always contribute to determining the agg's collation. 2. Hypothetical and aggregated args contribute to the agg's collation only if the agg is designed so that there is always exactly one aggregated arg (ie, it's non-variadic with one aggregated arg). Otherwise we assign their collations per-sort-column and don't merge them into the aggregate's collation. This specification ensures that a variadic aggregate doesn't change behavior depending on how many sort columns there happen to be. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> What I'm now thinking we want to do is: Tom> 1. Non-hypothetical direct args always contribute to determiningTom> the agg's collation. Tom> 2. Hypothetical and aggregated args contribute to the agg'sTom> collation only if the agg is designed so that thereis alwaysTom> exactly one aggregated arg (ie, it's non-variadic with oneTom> aggregated arg). Otherwise we assign theircollationsTom> per-sort-column and don't merge them into the aggregate'sTom> collation. Tom> This specification ensures that a variadic aggregate doesn'tTom> change behavior depending on how many sort columnsthere happenTom> to be. Works for me. -- Andrew (irc:RhodiumToad)
Atri Sharma <atri.jiit@gmail.com> writes: > Please find attached the latest patch for WITHIN GROUP. This patch is > after fixing the merge conflicts. I've committed this after significant editorialization --- most notably, I pushed control of the sort step into the aggregate support functions. I didn't like the way nodeAgg.c had been hacked up to do it the other way. There's a couple hundred lines of code handling that in orderedsetaggs.c, which is more or less comparable to the amount of code that didn't get added to nodeAgg.c, so I think the argument that the original approach avoided code bloat is bogus. The main reason I pushed all the new aggregates into a single file (orderedsetaggs.c) was so they could share a private typedef for the transition state value. It's possible that we should expose that struct so that third-party aggregate functions could leverage the existing transition-function infrastructure instead of having to copy-and-paste it. I wasn't sure where to put it though --- maybe a new include file would be needed. Anyway I left the point for future discussion. regards, tom lane
Sent from my iPad > On 24-Dec-2013, at 2:50, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Atri Sharma <atri.jiit@gmail.com> writes: >> Please find attached the latest patch for WITHIN GROUP. This patch is >> after fixing the merge conflicts. > > I've committed this after significant editorialization --- most notably, > I pushed control of the sort step into the aggregate support functions. > I didn't like the way nodeAgg.c had been hacked up to do it the other way. > There's a couple hundred lines of code handling that in orderedsetaggs.c, > which is more or less comparable to the amount of code that didn't get > added to nodeAgg.c, so I think the argument that the original approach > avoided code bloat is bogus. > > The main reason I pushed all the new aggregates into a single file > (orderedsetaggs.c) was so they could share a private typedef for the > transition state value. It's possible that we should expose that > struct so that third-party aggregate functions could leverage the > existing transition-function infrastructure instead of having to > copy-and-paste it. I wasn't sure where to put it though --- maybe > a new include file would be needed. Anyway I left the point for > future discussion. > > regards, tom lane Thank you!
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> I've committed this after significant editorialization --- mostTom> notably, I pushed control of the sort step intothe aggregateTom> support functions. Initial tests suggest that your version is ~40% slower than ours on some workloads. On my system, this query takes ~950ms using our dev branch of the code, and ~1050ms on git master (using \timing in psql for timings, and taking the best of many consecutive runs): select count(*) from (select percentile_disc(0.5) within group (order by i) from generate_series(1,3) i, generate_series(1,100000)j group by j) s; About ~700ms of that is overhead, as tested by running this query with enable_hashagg=false: select count(*) from (select j from generate_series(1,3) i, generate_series(1,100000) j group by j) s; So your version is taking 350ms for the percentile calculations compared to 250ms for ours. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> I've committed this after significant editorialization --- most > Tom> notably, I pushed control of the sort step into the aggregate > Tom> support functions. > Initial tests suggest that your version is ~40% slower than ours on > some workloads. I poked at this a bit with perf and oprofile, and concluded that probably the difference comes from ordered_set_startup() repeating lookups for each group that could be done just once per query. I'm not sure I believe the 40% figure; on this particular test case, oprofile breaks down the costs of ordered_set_startup() like this: 29 0.0756 postgres advance_transition_function 38307 99.9244 postgres ordered_set_transition 1445 1.0808 postgres ordered_set_startup 31418 79.4990 postgres tuplesort_begin_datum4056 10.2632 postgres get_typlenbyvalalign 1445 3.6564 postgres ordered_set_startup [self] 734 1.8573 postgres MemoryContextAllocZero 510 1.2905 postgres RegisterExprContextCallback 283 0.7161 postgres exprType 247 0.6250 postgres get_sortgroupclause_tle 235 0.5946 postgres exprCollation 92 0.2328 postgres ReleaseSysCache 78 0.1974 postgres SearchSysCache 71 0.1797 postgres AggGetAggref 63 0.1594 postgres AggCheckCallContext 58 0.1468 postgres AllocSetAlloc 46 0.1164 postgres PrepareSortSupportFromOrderingOp43 0.1088 postgres AggGetPerAggEContext 40 0.1012 postgres get_typlenbyval 39 0.0987 postgres palloc0 35 0.0886 postgres MemoryContextAlloc 17 0.0430 postgres get_sortgroupref_tle 10 0.0253 postgres tuplesort_begin_common The tuplesort_begin_datum calls are needed regardless --- your version was just doing them inside nodeAgg rather than externally. However, get_typlenbyvalalign and some of the other calls here are to fetch information that doesn't change across groups; probably we could arrange to cache that info instead of recomputing it each time. Still, it doesn't look like that could save more than 20% of ordered_set_startup, which itself is still only a few percent of the total query time. Looking at this example makes me wonder if it wouldn't be worthwhile to provide a way to reset and re-use a tuplesort object, instead of redoing all the lookup work involved. Or maybe just find a way to cache the catalog lookups that are happening inside tuplesort_begin_datum, which are about 50% of that function's cost it looks like. We're paying this same kind of price for repeated tuplesort setup in the existing nodeAgg code, if we have an aggregate with ORDER BY or DISTINCT in a grouped query with many groups. regards, tom lane
On Sun, Jan 5, 2014 at 12:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Regards
Looking at this example makes me wonder if it wouldn't be worthwhile to
provide a way to reset and re-use a tuplesort object, instead of redoing
all the lookup work involved. Or maybe just find a way to cache the
catalog lookups that are happening inside tuplesort_begin_datum, which are
about 50% of that function's cost it looks like. We're paying this same
kind of price for repeated tuplesort setup in the existing nodeAgg code,
if we have an aggregate with ORDER BY or DISTINCT in a grouped query with
many groups.
This sounds very similar to:
A reset function was added in the next patch which improved performance in my test case by about 5 times.
Perhaps they can make use of the same function.
Regards
David Rowley
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: >> Initial tests suggest that your version is ~40% slower than ours on>> some workloads. Tom> I poked at this a bit with perf and oprofile, and concluded thatTom> probably the difference comes from ordered_set_startup()Tom>repeating lookups for each group that could be done just onceTom> per query. Retesting with your changes shows that the gap is down to 15% but still present: work_mem=64MB enable_hashagg=off (for baseline test) baseline query (333ms on both versions): select count(*) from (select j from generate_series(1,3) i, generate_series(1,100000) j group by j) s; test query: select count(*) from (select percentile_disc(0.5) within group (order by i) from generate_series(1,3) i, generate_series(1,100000) j group by j) s; On the original patch as supplied: 571ms - 333ms = 238ms On current master: 607ms - 333ms = 274ms Furthermore, I can't help noticing that the increased complexity has now pretty much negated your original arguments for moving so much of the work out of nodeAgg.c. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > Retesting with your changes shows that the gap is down to 15% but still > present: There's probably some overhead from traversing advance_transition_function for each row, which your version wouldn't have done. 15% sounds pretty high for that though, since advance_transition_function doesn't have much to do when the transition function is non strict and the state value is pass-by-value (which "internal" is). It's possible that we could do something to micro-optimize that case. I also wondered if it was possible to omit rechecking AggCheckCallContext after the first time through in ordered_set_transition(). It seemed unsafe to perhaps not have that happen at all, since if the point is to detect misuse then the mere fact that argument 0 isn't null isn't much comfort. It strikes me though that now we could test for "first call" by looking at fcinfo->flinfo->fn_extra, and be pretty sure that we've already checked the context if that isn't null. Whether this would save anything noticeable isn't clear though; I didn't see AggCheckCallContext high in the profile. > Furthermore, I can't help noticing that the increased complexity has > now pretty much negated your original arguments for moving so much of > the work out of nodeAgg.c. The key reason for that was, and remains, not having the behavior hard-wired in nodeAgg; I believe that this design permits things to be accomplished in aggregate implementation functions that would not have been possible with the original patch. I'm willing to accept some code growth to have that flexibility. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: >> Furthermore, I can't help noticing that the increased complexity>> has now pretty much negated your original argumentsfor moving so>> much of the work out of nodeAgg.c. Tom> The key reason for that was, and remains, not having theTom> behavior hard-wired in nodeAgg; I believe that this designTom>permits things to be accomplished in aggregate implementationTom> functions that would not have been possible withthe originalTom> patch. I'm willing to accept some code growth to have thatTom> flexibility. Do you have an actual use case? -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> The key reason for that was, and remains, not having the > Tom> behavior hard-wired in nodeAgg; I believe that this design > Tom> permits things to be accomplished in aggregate implementation > Tom> functions that would not have been possible with the original > Tom> patch. I'm willing to accept some code growth to have that > Tom> flexibility. > Do you have an actual use case? Not a concrete example of an aggregate to build, no; but it seemed plausible to me that a new aggregate might want more control over the sorting operation than was possible with your patch. Basically the only flexibility that was there was to sort a hypothetical row before or after its peers, right? Now it's got essentially full control over the sort parameters. One idea that comes to mind is that an aggregate that's interested in the "top N" rows might wish to flip the sort order, so that the rows it wants come out of the tuplesort first rather than last --- and/or it might want to tell the tuplesort about the row limitation, so that the bounded-sort logic can be used. regards, tom lane
I wrote: > There's probably some overhead from traversing advance_transition_function > for each row, which your version wouldn't have done. 15% sounds pretty > high for that though, since advance_transition_function doesn't have much > to do when the transition function is non strict and the state value is > pass-by-value (which "internal" is). It's possible that we could do > something to micro-optimize that case. The most obvious micro-optimization that's possible there is to avoid redoing InitFunctionCallInfoData for each row. I tried this as in the attached patch. It seems to be good for about half a percent overall savings on your example case. That's not much, but then again this helps for *all* aggregates, and many of them are cheaper than ordered aggregates. I see about 2% overall savings on select count(*) from generate_series(1,1000000); which seems like a more interesting number. As against that, the roughly-kilobyte-sized FunctionCallInfoData is no longer just transient data on the stack, but persists for the lifespan of the query. We pay that already for plain FuncExpr and OpExpr nodes, though. On balance, I'm inclined to apply this; the performance benefit may be marginal but it seems like it makes advance_transition_function's API a tad cleaner by reducing the number of distinct structs it's hacking on. Comments? regards, tom lane diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index 827b009..0dafb51 100644 *** a/src/backend/executor/nodeAgg.c --- b/src/backend/executor/nodeAgg.c *************** typedef struct AggStatePerAggData *** 235,240 **** --- 235,248 ---- */ Tuplesortstate *sortstate; /* sort object, if DISTINCT or ORDER BY */ + + /* + * This field is a pre-initialized FunctionCallInfo struct used for + * calling this aggregate's transfn. We save a few cycles per row by not + * re-initializing the unchanging fields; which isn't much, but it seems + * worth the extra space consumption. + */ + FunctionCallInfoData transfn_fcinfo; } AggStatePerAggData; /* *************** static void initialize_aggregates(AggSta *** 290,297 **** AggStatePerGroup pergroup); static void advance_transition_function(AggState *aggstate, AggStatePerAgg peraggstate, ! AggStatePerGroup pergroupstate, ! FunctionCallInfoData *fcinfo); static void advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup); static void process_ordered_aggregate_single(AggState *aggstate, AggStatePerAgg peraggstate, --- 298,304 ---- AggStatePerGroup pergroup); static void advance_transition_function(AggState *aggstate, AggStatePerAgg peraggstate, ! AggStatePerGroup pergroupstate); static void advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup); static void process_ordered_aggregate_single(AggState *aggstate, AggStatePerAgg peraggstate, *************** initialize_aggregates(AggState *aggstate *** 399,419 **** * Given new input value(s), advance the transition function of an aggregate. * * The new values (and null flags) have been preloaded into argument positions ! * 1 and up in fcinfo, so that we needn't copy them again to pass to the ! * transition function. No other fields of fcinfo are assumed valid. * * It doesn't matter which memory context this is called in. */ static void advance_transition_function(AggState *aggstate, AggStatePerAgg peraggstate, ! AggStatePerGroup pergroupstate, ! FunctionCallInfoData *fcinfo) { ! int numTransInputs = peraggstate->numTransInputs; MemoryContext oldContext; Datum newVal; - int i; if (peraggstate->transfn.fn_strict) { --- 406,425 ---- * Given new input value(s), advance the transition function of an aggregate. * * The new values (and null flags) have been preloaded into argument positions ! * 1 and up in peraggstate->transfn_fcinfo, so that we needn't copy them again ! * to pass to the transition function. We also expect that the static fields ! * of the fcinfo are already initialized; that was done by ExecInitAgg(). * * It doesn't matter which memory context this is called in. */ static void advance_transition_function(AggState *aggstate, AggStatePerAgg peraggstate, ! AggStatePerGroup pergroupstate) { ! FunctionCallInfoData *fcinfo = &peraggstate->transfn_fcinfo; MemoryContext oldContext; Datum newVal; if (peraggstate->transfn.fn_strict) { *************** advance_transition_function(AggState *ag *** 421,426 **** --- 427,435 ---- * For a strict transfn, nothing happens when there's a NULL input; we * just keep the prior transValue. */ + int numTransInputs = peraggstate->numTransInputs; + int i; + for (i = 1; i <= numTransInputs; i++) { if (fcinfo->argnull[i]) *************** advance_transition_function(AggState *ag *** 467,478 **** /* * OK to call the transition function */ - InitFunctionCallInfoData(*fcinfo, &(peraggstate->transfn), - numTransInputs + 1, - peraggstate->aggCollation, - (void *) aggstate, NULL); fcinfo->arg[0] = pergroupstate->transValue; fcinfo->argnull[0] = pergroupstate->transValueIsNull; newVal = FunctionCallInvoke(fcinfo); --- 476,484 ---- /* * OK to call the transition function */ fcinfo->arg[0] = pergroupstate->transValue; fcinfo->argnull[0] = pergroupstate->transValueIsNull; + fcinfo->isnull = false; /* just in case transfn doesn't set it */ newVal = FunctionCallInvoke(fcinfo); *************** advance_aggregates(AggState *aggstate, A *** 574,592 **** else { /* We can apply the transition function immediately */ ! FunctionCallInfoData fcinfo; /* Load values into fcinfo */ /* Start from 1, since the 0th arg will be the transition value */ Assert(slot->tts_nvalid >= numTransInputs); for (i = 0; i < numTransInputs; i++) { ! fcinfo.arg[i + 1] = slot->tts_values[i]; ! fcinfo.argnull[i + 1] = slot->tts_isnull[i]; } ! advance_transition_function(aggstate, peraggstate, pergroupstate, ! &fcinfo); } } } --- 580,597 ---- else { /* We can apply the transition function immediately */ ! FunctionCallInfoData *fcinfo = &peraggstate->transfn_fcinfo; /* Load values into fcinfo */ /* Start from 1, since the 0th arg will be the transition value */ Assert(slot->tts_nvalid >= numTransInputs); for (i = 0; i < numTransInputs; i++) { ! fcinfo->arg[i + 1] = slot->tts_values[i]; ! fcinfo->argnull[i + 1] = slot->tts_isnull[i]; } ! advance_transition_function(aggstate, peraggstate, pergroupstate); } } } *************** process_ordered_aggregate_single(AggStat *** 622,638 **** MemoryContext workcontext = aggstate->tmpcontext->ecxt_per_tuple_memory; MemoryContext oldContext; bool isDistinct = (peraggstate->numDistinctCols > 0); Datum *newVal; bool *isNull; - FunctionCallInfoData fcinfo; Assert(peraggstate->numDistinctCols < 2); tuplesort_performsort(peraggstate->sortstate); /* Load the column into argument 1 (arg 0 will be transition value) */ ! newVal = fcinfo.arg + 1; ! isNull = fcinfo.argnull + 1; /* * Note: if input type is pass-by-ref, the datums returned by the sort are --- 627,643 ---- MemoryContext workcontext = aggstate->tmpcontext->ecxt_per_tuple_memory; MemoryContext oldContext; bool isDistinct = (peraggstate->numDistinctCols > 0); + FunctionCallInfoData *fcinfo = &peraggstate->transfn_fcinfo; Datum *newVal; bool *isNull; Assert(peraggstate->numDistinctCols < 2); tuplesort_performsort(peraggstate->sortstate); /* Load the column into argument 1 (arg 0 will be transition value) */ ! newVal = fcinfo->arg + 1; ! isNull = fcinfo->argnull + 1; /* * Note: if input type is pass-by-ref, the datums returned by the sort are *************** process_ordered_aggregate_single(AggStat *** 668,675 **** } else { ! advance_transition_function(aggstate, peraggstate, pergroupstate, ! &fcinfo); /* forget the old value, if any */ if (!oldIsNull && !peraggstate->inputtypeByVal) pfree(DatumGetPointer(oldVal)); --- 673,679 ---- } else { ! advance_transition_function(aggstate, peraggstate, pergroupstate); /* forget the old value, if any */ if (!oldIsNull && !peraggstate->inputtypeByVal) pfree(DatumGetPointer(oldVal)); *************** process_ordered_aggregate_multi(AggState *** 704,710 **** AggStatePerGroup pergroupstate) { MemoryContext workcontext = aggstate->tmpcontext->ecxt_per_tuple_memory; ! FunctionCallInfoData fcinfo; TupleTableSlot *slot1 = peraggstate->evalslot; TupleTableSlot *slot2 = peraggstate->uniqslot; int numTransInputs = peraggstate->numTransInputs; --- 708,714 ---- AggStatePerGroup pergroupstate) { MemoryContext workcontext = aggstate->tmpcontext->ecxt_per_tuple_memory; ! FunctionCallInfoData *fcinfo = &peraggstate->transfn_fcinfo; TupleTableSlot *slot1 = peraggstate->evalslot; TupleTableSlot *slot2 = peraggstate->uniqslot; int numTransInputs = peraggstate->numTransInputs; *************** process_ordered_aggregate_multi(AggState *** 739,750 **** /* Start from 1, since the 0th arg will be the transition value */ for (i = 0; i < numTransInputs; i++) { ! fcinfo.arg[i + 1] = slot1->tts_values[i]; ! fcinfo.argnull[i + 1] = slot1->tts_isnull[i]; } ! advance_transition_function(aggstate, peraggstate, pergroupstate, ! &fcinfo); if (numDistinctCols > 0) { --- 743,753 ---- /* Start from 1, since the 0th arg will be the transition value */ for (i = 0; i < numTransInputs; i++) { ! fcinfo->arg[i + 1] = slot1->tts_values[i]; ! fcinfo->argnull[i + 1] = slot1->tts_isnull[i]; } ! advance_transition_function(aggstate, peraggstate, pergroupstate); if (numDistinctCols > 0) { *************** ExecInitAgg(Agg *node, EState *estate, i *** 1799,1804 **** --- 1802,1808 ---- &transfnexpr, &finalfnexpr); + /* set up infrastructure for calling the transfn and finalfn */ fmgr_info(transfn_oid, &peraggstate->transfn); fmgr_info_set_expr((Node *) transfnexpr, &peraggstate->transfn); *************** ExecInitAgg(Agg *node, EState *estate, i *** 1810,1815 **** --- 1814,1826 ---- peraggstate->aggCollation = aggref->inputcollid; + InitFunctionCallInfoData(peraggstate->transfn_fcinfo, + &peraggstate->transfn, + peraggstate->numTransInputs + 1, + peraggstate->aggCollation, + (void *) aggstate, NULL); + + /* get info about relevant datatypes */ get_typlenbyval(aggref->aggtype, &peraggstate->resulttypeLen, &peraggstate->resulttypeByVal);