Обсуждение: Evaluation of secondary sort key.

Поиск
Список
Период
Сортировка

Evaluation of secondary sort key.

От
Jesper Krogh
Дата:
This seems like a place where there is room for improvement.

2011-04-09 15:18:08.016 testdb=# select id from test1 where id < 3 order 
by id; id
----  1  2
(2 rows)

Time: 0.328 ms
2011-04-09 15:18:11.936 testdb=# CREATE or Replace FUNCTION testsort(id 
integer) returns integer as $$ BEGIN perform pg_sleep(id); return id; 
END; $$ language plpgsql;
CREATE FUNCTION
Time: 12.349 ms
2011-04-09 15:18:22.138 testdb=# select id from test1 where id < 3 order 
by id,testsort(id); id
----  1  2
(2 rows)

Time: 3001.896 ms

It seems strange that there is a need to evaluate testsort(id) at all in 
this case.

-- 
Jesper


Re: Evaluation of secondary sort key.

От
David Fetter
Дата:
On Sat, Apr 09, 2011 at 03:22:14PM +0200, Jesper Krogh wrote:
> This seems like a place where there is room for improvement.
> 
> 2011-04-09 15:18:08.016 testdb=# select id from test1 where id < 3
> order by id;
>  id
> ----
>   1
>   2
> (2 rows)
> 
> Time: 0.328 ms
> 2011-04-09 15:18:11.936 testdb=# CREATE or Replace FUNCTION
> testsort(id integer) returns integer as $$ BEGIN perform
> pg_sleep(id); return id; END; $$ language plpgsql;
> CREATE FUNCTION
> Time: 12.349 ms
> 2011-04-09 15:18:22.138 testdb=# select id from test1 where id < 3
> order by id,testsort(id);
>  id
> ----
>   1
>   2
> (2 rows)
> 
> Time: 3001.896 ms
> 
> It seems strange that there is a need to evaluate testsort(id) at
> all in this case.

How would PostgreSQL know that sorting by id leaves no ambiguity for
the next key to address?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Evaluation of secondary sort key.

От
Jesper Krogh
Дата:
>
> How would PostgreSQL know that sorting by id leaves no ambiguity for
> the next key to address?

It wouldn't   But it could postpone evaluation until ambiguity was actually met.

Jesper
>


Re: Evaluation of secondary sort key.

От
Martijn van Oosterhout
Дата:
On Sat, Apr 09, 2011 at 09:17:10AM -0700, David Fetter wrote:

> > 2011-04-09 15:18:22.138 testdb=# select id from test1 where id < 3
> > order by id,testsort(id);
> >  id
> > ----
> >   1
> >   2
> > (2 rows)
> >
> > Time: 3001.896 ms
> >
> > It seems strange that there is a need to evaluate testsort(id) at
> > all in this case.
>
> How would PostgreSQL know that sorting by id leaves no ambiguity for
> the next key to address?

Well, it doesn't know that, but I guess the point is it could wait with
evaluating the second key until it needs it.

The reason ot works as it does now is that the ORDER BY fields are
added as hidden fields to the query, like:

select id, /*hidden*/ id, /*hidden*/ testsort(id) from test1 where id < 3 order by 2, 3;

Here it's obvious why the evaluation happens. To avoid this evaluation
would require redoing the way sorts work (I think).

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: Evaluation of secondary sort key.

От
Heikki Linnakangas
Дата:
On 09.04.2011 19:17, David Fetter wrote:
> On Sat, Apr 09, 2011 at 03:22:14PM +0200, Jesper Krogh wrote:
>> This seems like a place where there is room for improvement.
>>
>> 2011-04-09 15:18:08.016 testdb=# select id from test1 where id<  3
>> order by id;
>>   id
>> ----
>>    1
>>    2
>> (2 rows)
>>
>> Time: 0.328 ms
>> 2011-04-09 15:18:11.936 testdb=# CREATE or Replace FUNCTION
>> testsort(id integer) returns integer as $$ BEGIN perform
>> pg_sleep(id); return id; END; $$ language plpgsql;
>> CREATE FUNCTION
>> Time: 12.349 ms
>> 2011-04-09 15:18:22.138 testdb=# select id from test1 where id<  3
>> order by id,testsort(id);
>>   id
>> ----
>>    1
>>    2
>> (2 rows)
>>
>> Time: 3001.896 ms
>>
>> It seems strange that there is a need to evaluate testsort(id) at
>> all in this case.
>
> How would PostgreSQL know that sorting by id leaves no ambiguity for
> the next key to address?

Presumably there's a primary key constraint on id. This is one of those 
cases where we could optimize, but then again, there's no reason to 
write a query like that in the first place.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Evaluation of secondary sort key.

От
Tom Lane
Дата:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Sat, Apr 09, 2011 at 09:17:10AM -0700, David Fetter wrote:
>>> It seems strange that there is a need to evaluate testsort(id) at
>>> all in this case.

>> How would PostgreSQL know that sorting by id leaves no ambiguity for
>> the next key to address?

> Well, it doesn't know that, but I guess the point is it could wait with
> evaluating the second key until it needs it.

I think that would be a positive disimprovement.  The current design
guarantees that volatile sort expressions are evaluated exactly once,
in the order the rows are read from the data source.  There would be no
guarantees at all, either as to the number of evaluations or the order
in which they happen, if we tried to do evaluation only during the
actual sort.

Another small problem is that any such thing would require carrying
along some kind of closure (ie, the expression and all its input
values), not just the final sort key value, in tuples being sorted.
The ensuing complexity, speed penalty, and increase in data volume
to be sorted would be paid by everybody, making this probably a net
performance loss when considered across all applications.
        regards, tom lane


Re: Evaluation of secondary sort key.

От
David Fetter
Дата:
On Sat, Apr 09, 2011 at 07:24:15PM +0300, Heikki Linnakangas wrote:
> On 09.04.2011 19:17, David Fetter wrote:
> >On Sat, Apr 09, 2011 at 03:22:14PM +0200, Jesper Krogh wrote:
> >>This seems like a place where there is room for improvement.
> >>
> >>2011-04-09 15:18:08.016 testdb=# select id from test1 where id<  3
> >>order by id;
> >>  id
> >>----
> >>   1
> >>   2
> >>(2 rows)
> >>
> >>Time: 0.328 ms
> >>2011-04-09 15:18:11.936 testdb=# CREATE or Replace FUNCTION
> >>testsort(id integer) returns integer as $$ BEGIN perform
> >>pg_sleep(id); return id; END; $$ language plpgsql;
> >>CREATE FUNCTION
> >>Time: 12.349 ms
> >>2011-04-09 15:18:22.138 testdb=# select id from test1 where id<  3
> >>order by id,testsort(id);
> >>  id
> >>----
> >>   1
> >>   2
> >>(2 rows)
> >>
> >>Time: 3001.896 ms
> >>
> >>It seems strange that there is a need to evaluate testsort(id) at
> >>all in this case.
> >
> >How would PostgreSQL know that sorting by id leaves no ambiguity
> >for the next key to address?
> 
> Presumably there's a primary key constraint on id. This is one of
> those cases where we could optimize, but then again, there's no
> reason to write a query like that in the first place.

Given the horrors query generators perpetrate, it might be worth
dropping provably redundant ORDER BYs on the floor at planning time.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Evaluation of secondary sort key.

От
Jesper Krogh
Дата:
On 2011-04-09 20:00, David Fetter wrote:
> Given the horrors query generators perpetrate, it might be worth
> dropping provably redundant ORDER BYs on the floor at planning time.
Well, many people often add a secondary sort-key to their SQL
for the only purpose of obtainting a consistent result in the
corner-cases where the first sort key is ambiguios.

If the first sort-key isn't planned to be supported by an index-scan,
then you'll end up calculating the second sortkey for the entire
dataset even if you end up doing a "limit 100" at the end.

You can only deem it redundant if there is a primary key in front.
if you have a primary key in front, where as a fix may be really
good in cases where you have a "n_distinct" at or near -1 in pg_stats
for the column.

Jesper
-- 
Jesper


Re: Evaluation of secondary sort key.

От
Jesper Krogh
Дата:
On 2011-04-09 18:54, Tom Lane wrote:
> I think that would be a positive disimprovement. The current design
> guarantees that volatile sort expressions are evaluated exactly once,
> in the order the rows are read from the data source.  There would be no
> guarantees at all, either as to the number of evaluations or the order
> in which they happen, if we tried to do evaluation only during the
> actual sort.
>
> Another small problem is that any such thing would require carrying
> along some kind of closure (ie, the expression and all its input
> values), not just the final sort key value, in tuples being sorted.
> The ensuing complexity, speed penalty, and increase in data volume
> to be sorted would be paid by everybody, making this probably a net
> performance loss when considered across all applications.
The current approach gives that:

select id from test1 where <some clause that matches say 10% random by 
another index>
order by sortfunc1(id),sortfunc(2) limit 20;

on a table with 100.000 elements will also end up applying
both sortfunc1(id) and sortfunc2(id) to all 10.000 elements
even though sortfunc2(id) might only brings value to a very few amount
of tuples (the ones needed as secondary sortkeys for top20 within
the dataset).

It might be worth noting in the manual, that if at all possible you should
stuff the sortfunc2(id) into the table as a column (perhaps computed by 
a before
trigger), since it might actully be evaluated way more often than
you anticipated.

Thanks a lot for the insight.

Jesper
-- 
Jesper


Re: Evaluation of secondary sort key.

От
Jesper Krogh
Дата:
On 2011-04-09 18:54, Tom Lane wrote:
> I think that would be a positive disimprovement.  The current design
> guarantees that volatile sort expressions are evaluated exactly once,
> in the order the rows are read from the data source.  There would be no
> guarantees at all, either as to the number of evaluations or the order
> in which they happen, if we tried to do evaluation only during the
> actual sort.
If I could "only evaluate" the rows needed, then that would also
be a huge win, below case shifts what "definitely shouldn't be a seqscan"
into one due to a secondary sort key.

> Another small problem is that any such thing would require carrying
> along some kind of closure (ie, the expression and all its input
> values), not just the final sort key value, in tuples being sorted.
> The ensuing complexity, speed penalty, and increase in data volume
> to be sorted would be paid by everybody, making this probably a net
> performance loss when considered across all applications.
Getting the value for the first sortkey and carrying on a closure
for the rest would mostly (very often) be "optimal" ?

It would also enable a select that has to sortkeys to utilize an
index that only contains the primary sortkey, which is a huge
negative effect of what's being done today.

2011-04-18 07:12:43.931 testdb=# explain select id from testdb.testtable 
order by id limit 500;                                             QUERY PLAN
---------------------------------------------------------------------------------------------------- Limit
(cost=0.00..262.75rows=500 width=4)   ->  Index Scan using testtable_pkey on testtable  
 
(cost=0.00..15015567.84 rows=28573400 width=4)
(2 rows)

Time: 1.363 ms
2011-04-18 07:12:52.498 testdb=# explain select id from testdb.testtable 
order by id,name limit 500;                                    QUERY PLAN
----------------------------------------------------------------------------------- Limit  (cost=5165456.70..5165457.95
rows=500width=35)   ->  Sort  (cost=5165456.70..5236890.20 rows=28573400 width=35)         Sort Key: id, name
-> Seq Scan on testtable  (cost=0.00..3741675.00 
 
rows=28573400 width=35)
(4 rows)

Time: 1.420 ms

Enabling any users to sort using multiple keys, without ending in Seq 
Scans somewhere down
the line seems to require multi dimension indexes on all combinations of 
colums

-- 
Jesper


Re: Evaluation of secondary sort key.

От
Greg Stark
Дата:
On Mon, Apr 18, 2011 at 6:25 AM, Jesper Krogh <jesper@krogh.cc> wrote:
> Getting the value for the first sortkey and carrying on a closure
> for the rest would mostly (very often) be "optimal" ?

Well that might depend. The input data to the function might be much
larger than the output. Consider the, quite common, idiom of:

order by case when (complex expresssion) 1 when (complex expression) 2 else 3

> It would also enable a select that has to sortkeys to utilize an
> index that only contains the primary sortkey, which is a huge
> negative effect of what's being done today.

This is a separate problem entirely. It would be nice to have a
strategy for ordering that can take advantage of partially ordered
results. It's not hard to see how to do the executor side -- it could
keep a tuplesort for each group and truncate it when the group
changes. As usual the hard part is having the planner figure out
*when* to use it. We have a hard enough time calculating ndistinct for
individual columns -- this would require having an idea of how many
values are present for each major key column.




-- 
greg


Re: Evaluation of secondary sort key.

От
Jesper Krogh
Дата:
On 2011-04-18 11:00, Greg Stark wrote:
> On Mon, Apr 18, 2011 at 6:25 AM, Jesper Krogh<jesper@krogh.cc>  wrote:
>> Getting the value for the first sortkey and carrying on a closure
>> for the rest would mostly (very often) be "optimal" ?
> Well that might depend. The input data to the function might be much
> larger than the output. Consider the, quite common, idiom of:
>
> order by case when (complex expresssion) 1 when (complex expression) 2 else 3

How come that expression be relevant? There is only one sortkey and no
limit, so no matter what it should clearly get the full resultset in all 
cases.

>> It would also enable a select that has to sortkeys to utilize an
>> index that only contains the primary sortkey, which is a huge
>> negative effect of what's being done today.
> This is a separate problem entirely. It would be nice to have a
> strategy for ordering that can take advantage of partially ordered
> results. It's not hard to see how to do the executor side -- it could
> keep a tuplesort for each group and truncate it when the group
> changes. As usual the hard part is having the planner figure out
> *when* to use it. We have a hard enough time calculating ndistinct for
> individual columns -- this would require having an idea of how many
> values are present for each major key column.

Yes, as with all other cases it would be hard to get the optimum, but
there is also cases where it is straightforward, say when the secondary
sort column has an ndistinct of -1 (or similar close to). The current 
standard
assumption is that 2 columns are unrelated, that would also work here. 
(As good as is
does similar places in PG).

Jesper

-- 
Jesper


Re: Evaluation of secondary sort key.

От
Greg Stark
Дата:
On Mon, Apr 18, 2011 at 5:38 PM, Jesper Krogh <jesper@krogh.cc> wrote:
>> order by case when (complex expresssion) 1 when (complex expression) 2
>> else 3
>
> How come that expression be relevant? There is only one sortkey and no
> limit, so no matter what it should clearly get the full resultset in all
> cases.

Sure, imagine there are more order by clauses with this one as the last one.


> Yes, as with all other cases it would be hard to get the optimum, but
> there is also cases where it is straightforward, say when the secondary
> sort column has an ndistinct of -1 (or similar close to). The current
> standard
> assumption is that 2 columns are unrelated, that would also work here. (As
> good as is
> does similar places in PG).

I'm not following what you mean with the secondary column having
ndistinct of -1. Actually it seems to me a reasonable low-hanging
fruit to reach for would be when the initial column has an ndistinct
of -1 which is actually kind of common.

A lot of SQL queries end up being written with GROUP BY primary_key,
other_column, other_column, other_column just to get those other
columns to be queryable. If we implemented the SQL standard
"dependent" columns feature this would be unnecessary but we don't and
even if we did people would still build schemas and queries that
defeat the optimization.

In these cases we probably do have ndistinct -1 for one of the columns
and therefore that an index on that column alone would give us almost
the right ordering and quite likely exactly the right ordering. If we
buffered the outputs for any distinct value and output sorted them if
there were multiple rows. It would probably somewhat worse if we guess
wrong though.

-- 
greg


Re: Evaluation of secondary sort key.

От
Tom Lane
Дата:
Greg Stark <gsstark@mit.edu> writes:
> A lot of SQL queries end up being written with GROUP BY primary_key,
> other_column, other_column, other_column just to get those other
> columns to be queryable. If we implemented the SQL standard
> "dependent" columns feature this would be unnecessary but we don't

We do as of 9.1 ...
        regards, tom lane


Re: Evaluation of secondary sort key.

От
Alvaro Herrera
Дата:
Excerpts from Greg Stark's message of lun abr 18 15:47:03 -0300 2011:

> A lot of SQL queries end up being written with GROUP BY primary_key,
> other_column, other_column, other_column just to get those other
> columns to be queryable. If we implemented the SQL standard
> "dependent" columns feature this would be unnecessary but we don't and
> even if we did people would still build schemas and queries that
> defeat the optimization.

Actually we do have that in 9.1.  It's a bit more restrictive than
really required (there are some more cases we could handle), but AFAIR
at least the primary key is handled now.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support