Обсуждение: Ordering of data on calls to user defined aggregate.
Short version ( for the busy folk). Is there any guarantee of the ordering of data on calls to a user defined aggregate. (postgresql 7.2.1) Longer version ( for those that are curious or need more info ). I was goofing around tonight and decided to create a concat_with_and aggregate for varchar. My desired effect: if I have a table like this: create table foo( id bigint, fk bigint, name varchar); with data id fk name 1 1 A 2 1 B 3 2 E 4 3 D 5 3 C and a query like this: select fk, concat_with_and(name) names from foo group by fk; then the results would look like: fk names 1 A and B 2 E 3 D and C I had no trouble creating the aggregate function or the supporting (state transition) function: create function concat_with_and(varchar, varchar) returns varchar as 'select case when $1 is null then $2 else $1 || \' and \' || $2 end;' language sql; I did not use a final function. Everything worked like a charm. Then I decided to take it one step further. What if I wanted all the names sorted in alphabetical order ( not across records, but within the record)? fk names 1 A and B 2 E 3 C and D So I tried a query like this: select fk, concat_with_and(name) from ( select fk, name from foo order by fk, name) sub_select group by fk; From just eyeballing the first 10 to 12 pages of the results, all but 2 records had the names in alphabetical order. So I removed the subselect and ran the query again - this time paying attention to the ordering within names. Very few entries in the 'names' column were in alphabetical order at all. I tried to develop an example dataset to include here that would illustrate cases where it doesn't order them alphabetically. I was unable to do so. I'm experimenting with a dataset of about 15000 records for my queries. Where I'm coming from: I know there are several other ways to achieve the results I was looking for above. I have no real need for this. I was just looking for an excuse to write my own aggregate function. What I found intriguing were the possibilities - and the opportunity to increase my understanding of what goes on under the hood. questions: From a veteran's point of view, is the approach above elegant or a gross hack? Somewhere in between? Am I correct in assuming that the 'group by' clause in my second query affects the ordering of the subquery? Any idea why it only affected a small percentage of them? Could I introduce a final function to solve my problem? The state transition function simply collects all the entries. The final function would order them and delimit them with 'and'. In practice, if this were required I'd hang it up & write a pl/pgsql function and skip the aggregation all together. The extra work begins to clutter the idea. I'd prefer not to force every 'concat_with_and' aggregation to sort the data as well. Any other comments?
Tim Hart <timjhart@shaw.ca> said: > So I tried a query like this: > > select fk, concat_with_and(name) from ( select fk, name from foo order > by fk, name) sub_select group by fk; > > From just eyeballing the first 10 to 12 pages of the results, all but 2 > records had the names in alphabetical order. So I removed the subselect > and ran the query again - this time paying attention to the ordering > within names. Very few entries in the 'names' column were in > alphabetical order at all. Hmmm... in my (small) test case, they were all alphabetized. I didn't think that subquery sort orders were guaranteed, though, so perhaps it's okay that yours weren't. Can you try with GROUP BY fk, name in the subquery? That works, too, on my small test case, and that should be guaranteedbehavior in a subquery. Let's see how that works with your data set. - J.
On Saturday, May 18, 2002, at 03:13 PM, Joel Burton wrote: > Tim Hart <timjhart@shaw.ca> said: > >> So I tried a query like this: >> >> select fk, concat_with_and(name) from ( select fk, name from foo order >> by fk, name) sub_select group by fk; >> >> From just eyeballing the first 10 to 12 pages of the results, all >> but 2 >> records had the names in alphabetical order. So I removed the subselect >> and ran the query again - this time paying attention to the ordering >> within names. Very few entries in the 'names' column were in >> alphabetical order at all. > > Hmmm... in my (small) test case, they were all alphabetized. > > I didn't think that subquery sort orders were guaranteed, though, so > perhaps it's okay that yours weren't. > > Can you try with GROUP BY fk, name in the subquery? That works, too, on > my small test case, and that should be guaranteed behavior in a > subquery. Let's see how that works with your data set. > > - J. > > I'm not quite sure how you're expecting me to modify the subquery. Probably my tired brain cells... ;) But yes - I've been unable to reproduce the issue with a smaller dataset. I didn't want to spend all that extra time creating a reproduce-able case if someone could definitively say - as you did above - that subquery sort orders were not guaranteed. Given that the outer 'group by' has no way of knowing that the result set it's being handed is effectively grouped, I'm guessing that the group-by algorithm in this case is probably the source of my ills. No biggie.
Tim Hart <timjhart@shaw.ca> writes:
> Short version ( for the busy folk).
> Is there any guarantee of the ordering of data on calls to a user
> defined aggregate. (postgresql 7.2.1)
None whatever.
> So I tried a query like this:
> select fk, concat_with_and(name) from ( select fk, name from foo order
> by fk, name) sub_select group by fk;
The reason this isn't very reliable can be seen by looking at what it
does.  In existing releases you have to grovel through EXPLAIN VERBOSE
output to see what the sort keys are, but CVS tip is friendlier:
srf=# explain
srf-# select fk, my_agg(name) from ( select fk, name from foo order
srf(# by fk, name) sub_select group by fk;
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Aggregate  (cost=122.16..129.66 rows=100 width=40)
   ->  Group  (cost=122.16..127.16 rows=1000 width=40)
         ->  Sort  (cost=122.16..124.66 rows=1000 width=40)
               Sort Key: fk
               ->  Subquery Scan sub_select  (cost=69.83..72.33 rows=1000 width=40)
                     ->  Sort  (cost=69.83..72.33 rows=1000 width=40)
                           Sort Key: fk, name
                           ->  Seq Scan on foo  (cost=0.00..20.00 rows=1000 width=40)
(8 rows)
The outer query has no idea that the inner query's output is already
sorted, so it re-sorts ... using only the specified GROUP BY key.
Unless your system's qsort is stable (which most are not), this will
not preserve the by-name ordering within groups of the same fk value.
If you weren't grouping at the outer level then the inner query's sort
order would get the job done for you.  In the presence of outer
grouping I'm not sure there's a clean way to do it.  I can think of
a hack for the case where fk/name pairs are unique:
select fk, my_agg(DISTINCT name) from foo group by fk;
This relies on the assumption that the aggregate code will implement
DISTINCT by means of sort/unique processing, which seems unlikely to
break anytime soon.  But it won't help if you want the aggregate to see
multiple values of the same name for the same fk.
There is some talk of reimplementing grouped aggregation using hash
tables, which'd eliminate the upper SORT step and thereby give the
behavior you want.  I dunno how soon anyone will get around to it
though.
            regards, tom lane
			
		On Saturday, May 18, 2002, at 03:56 PM, Tom Lane wrote: > Tim Hart <timjhart@shaw.ca> writes: >> Short version ( for the busy folk). >> Is there any guarantee of the ordering of data on calls to a user >> defined aggregate. (postgresql 7.2.1) > > None whatever. Reasonable. > The outer query has no idea that the inner query's output is already > sorted, so it re-sorts ... using only the specified GROUP BY key. Kind of what I figured. I've been in a discussion with Joel Burton where I stated the same assumption. Makes sense to me. <snippage> > grouping I'm not sure there's a clean way to do it. I can think of > a hack for the case where fk/name pairs are unique: > > select fk, my_agg(DISTINCT name) from foo group by fk; > > This relies on the assumption that the aggregate code will implement > DISTINCT by means of sort/unique processing, which seems unlikely to > break anytime soon. But it won't help if you want the aggregate to see > multiple values of the same name for the same fk. That actually works for me. But, as you say, it makes some assumptions. It's also not very clear from the the SQL what the intent is. If I really had a need to do this, I'd probably implement this as a function: select fk, get_sorted_delimited_list(fk, ' and ' ) as names from foo order by names; > > There is some talk of reimplementing grouped aggregation using hash > tables, which'd eliminate the upper SORT step and thereby give the > behavior you want. I dunno how soon anyone will get around to it > though. No worries. I can't really say I *want* that behavior. Can't say I *don't* want it either. I was just after a deeper understanding of the current behavior. You've provided it, and am grateful. :)