Обсуждение: Planner features, discussion

Поиск
Список
Период
Сортировка

Planner features, discussion

От
pasman pasmański
Дата:
Hello.

I propose 2 features for planner:

1. Planner will estimate 2 x statistics: time of query with cache empty
and with cache filled.

2. Two levels of plannig: standard and long.
Long planning may be used when standard optimization
generate slow plan, and may use advanced algebraic transformations:

F( F(a,1), 2) = F(a, F(1,2) ) - for algebraic groups for example etc.


------------
pasman

Re: Planner features, discussion

От
Craig Ringer
Дата:
On 13/07/10 19:49, pasman pasmański wrote:
> Hello.
>
> I propose 2 features for planner:
>
> 1. Planner will estimate 2 x statistics: time of query with cache empty
> and with cache filled.

How would it know what is in cache and how long it'd take to fetch
things into cache that aren't already there?

Which cache(s)? shared_buffers? The OS cache of file system data? RAID
controller read caches? Disk drive read caches? Caches of a big SAN
iSCSI/FC target?

PostgreSQL relies on the OS and hardware to take care of all that, it
just asks for the data. It has very little idea how much of that data is
how "close" in caching terms, and how long it'll take to retrieve. The
very blunt instrument "effective_cache_size" and the random/sequential
IO cost knobs provide only the vaguest and most unrealistic guidance.

This might sound weird, but you need to realize that the planner needs
to keep things somewhat simple to be fast, and it *can't* try to keep
track of and simulate all levels of the system's caching behaviours.
Especially since some levels of caching may be completely invisible not
only to PostgreSQL, but even to the OS.

Consider a PostgreSQL instance running in a virtual machine, where the
storage is provided by a fibre channel SAN. The guest OS doesn't even
know it's virtualized - and that the host OS is caching its disk in RAM.
The host OS doesn't know that the SAN that's backing the disk has oodles
of RAM in which it's doing its own caching.

How is PostgreSQL supposed to figure out what's in cache?

> 2. Two levels of plannig: standard and long.
> Long planning may be used when standard optimization
> generate slow plan, and may use advanced algebraic transformations:

I do like the idea of being able to tell the planner "put more time into
trying to transform this query in ways that might make it faster".

However, that does mean two modes to be debugged, and a less-used "long"
mode of greater complexity to be maintained. Who's going to do the
coding to build and maintain this? What specific optimizations would you
propose for the "long" mode? I'm not sure I understand the one you did
propose as it applies to SQL ... wouldn't it involve a lot more
knowledge of what the two functions in question do than the planner ever
has?

--
Craig Ringer

Re: Planner features, discussion

От
Greg Smith
Дата:
pasman pasman'ski wrote:
> 1. Planner will estimate 2 x statistics: time of query with cache empty
> and with cache filled.
>

Requires planner to know something about the state of the cache; it
doesn't yet. Counting myself there's four people I know who have been
tinkering with some aspect of that enough to have written some code
related to it. For now the best you can do is lower random_page_cost
significantly when you suspect the content you're querying against is
cached.

> 2. Two levels of plannig: standard and long.
> Long planning may be used when standard optimization
> generate slow plan, and may use advanced algebraic transformations:
>

I heard a scholarly treatment of that topic from Jim Nasby recently,
where he proposed a boolean GUC to toggle the expanded search behavior
to be named plan_the_shit_out_of_it.

The surprisingly difficult part of playing in this area is getting
enough public sample data and queries running against them to be able to
do such an exercise and have some confidence you didn't decrease
performance for the average case in the process.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Planner features, discussion

От
Craig Ringer
Дата:
On 13/07/2010 10:52 PM, Greg Smith wrote:

> I heard a scholarly treatment of that topic from Jim Nasby recently,
> where he proposed a boolean GUC to toggle the expanded search behavior
> to be named plan_the_shit_out_of_it.

I was thinking that something like "duplicate subquery/function
elimitation" might be handy, though an extension to WITH would eliminate
the need for it (see below). Consider code like this:

SELECT (SELECT somequery) FROM ...
WHERE (SELECT SOMEQUERY) > somevalue
ORDER BY (SELECT somequery)

that invokes some non-trivial "somequery" several times. I often wanted
to simplify it, and it wasn't always practical to convert it to add
(SELECT somequery) to the join list.

I expected that with 8.4 I'd be able to write something more along the
lines of:

WITH result = (SELECT somequery)
SELECT result FROM ...
WHERE result > somevalue
ORDER BY result;

which makes such an optimization less than necessary. Why complicate the
planner when you can fix your SQL?

However, in the case above the subquery needs to be referenced from a
scalar context not as a join, and WITH expressions don't seem to be
useful for scalar results. The names defined by WITH are only visible as
FROM targets. So this doesn't work:

=> WITH aconstant(constval) AS (VALUES(1)) SELECT x.*, constval FROM
generate_series(1,10) AS x;
ERROR:  column "constval" does not exist
LINE 1: ...TH aconstant(constval) AS (VALUES(1)) SELECT x.*, constval F...

nor:

=> WITH aconstant AS (VALUES(1)) SELECT x.*, aconstant FROM
generate_series(1,10) AS x;
ERROR:  column "aconstant" does not exist
LINE 1: WITH aconstant AS (VALUES(1)) SELECT x.*, aconstant FROM gen...


... so you're forced to fall back on adding it as an additional join
expression - which isn't always reasonable or possible.

Extending WITH to be useful for defining constants and single-evaluation
variables like the above would be really, really nice, and would avoid
some ugly SQL mangling and any need for compliated planner features that
try to match up and combine subquery trees.

Such a WITH extension wouldn't be of any help where the subqueries
referenced from-list columns, though, so perhaps intelligent combination
of duplicate subqueries would be handy anyway.



I certainly think that combining multiple identical invocations of
stable and immutable functions within a query - ie pre-evaluating the
call and substitutiong the results - would be a desirable feature. It'd
potentially something that could be part of a more aggressively
optimizing planner. It'd be a *LOT* simpler than trying to do the same
thing with subqueries (though deciding when the function arguments are
"the same" might not always be simple) and would be really handy.

Some uses could be solved by extending WITH as above, but not where one
or more function parameters depends on expressions involving fields in
the from-list. Consider:

SELECT expensive_function(sometable.x)
FROM sometable
WHERE expensive_function(sometable.x) > 4;

which can't even be written nicely in a form that only evaluates
expensive_function once. In this case it can be mangled into:

SELECT exf FROM (
   SELECT sometable.*, expensive_function(sometable.x) AS exf
   FROM   sometable
) WHERE exf > 4;

... but in more complicated cases you can't always do that without
landing up evaluating many, many more invocations of
"expensive_function" than you wanted to in the subquery due to
limitations that structure imposes on your filtering.

Am I just missing something obvious? Or might a way to combine multiple
invocations of STABLE / IMMUTABLE functions be a useful thing for an
aggressively optimizing planner to do?

--
Craig Ringer

Re: Planner features, discussion

От
David Fetter
Дата:
On Wed, Jul 14, 2010 at 08:47:35AM +0800, Craig Ringer wrote:
> On 13/07/2010 10:52 PM, Greg Smith wrote:
>
> >I heard a scholarly treatment of that topic from Jim Nasby recently,
> >where he proposed a boolean GUC to toggle the expanded search behavior
> >to be named plan_the_shit_out_of_it.
>
> I was thinking that something like "duplicate subquery/function
> elimitation" might be handy, though an extension to WITH would
> eliminate the need for it (see below). Consider code like this:
>
> SELECT (SELECT somequery) FROM ...
> WHERE (SELECT SOMEQUERY) > somevalue
> ORDER BY (SELECT somequery)
>
> that invokes some non-trivial "somequery" several times. I often
> wanted to simplify it, and it wasn't always practical to convert it
> to add (SELECT somequery) to the join list.
>
> I expected that with 8.4 I'd be able to write something more along
> the lines of:
>
> WITH result = (SELECT somequery)
> SELECT result FROM ...
> WHERE result > somevalue
> ORDER BY result;
>
> which makes such an optimization less than necessary. Why complicate
> the planner when you can fix your SQL?
>
> However, in the case above the subquery needs to be referenced from
> a scalar context not as a join, and WITH expressions don't seem to
> be useful for scalar results. The names defined by WITH are only
> visible as FROM targets. So this doesn't work:
>
> => WITH aconstant(constval) AS (VALUES(1)) SELECT x.*, constval FROM
> generate_series(1,10) AS x;
> ERROR:  column "constval" does not exist
> LINE 1: ...TH aconstant(constval) AS (VALUES(1)) SELECT x.*, constval F...

You missed the CROSS JOIN, which you could make implicit, even though
implicit CROSS JOINs are bad coding style:

WITH aconstant(constval) AS (VALUES(1))
SELECT x.*, constval
FROM
    generate_series(1,10) AS x
CROSS JOIN
    aconstant;
 x  | constval
----+----------
  1 |        1
  2 |        1
  3 |        1
  4 |        1
  5 |        1
  6 |        1
  7 |        1
  8 |        1
  9 |        1
 10 |        1
(10 rows)

> ... so you're forced to fall back on adding it as an additional join
> expression - which isn't always reasonable or possible.

Why not?

> Extending WITH to be useful for defining constants and
> single-evaluation variables like the above would be really, really
> nice, and would avoid some ugly SQL mangling and any need for
> compliated planner features that try to match up and combine
> subquery trees.

I'm all for extending WITH, as are some others.  See this thread for
the latest:
<http://archives.postgresql.org/pgsql-hackers/2010-07/msg00463.php>

Cheers,
David (who's not mentioning extending WITH to include DCL or DDL yet...oops! ;)
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Planner features, discussion

От
Craig Ringer
Дата:
On 15/07/10 00:34, David Fetter wrote:

>> => WITH aconstant(constval) AS (VALUES(1)) SELECT x.*, constval FROM
>> generate_series(1,10) AS x;
>> ERROR:  column "constval" does not exist
>> LINE 1: ...TH aconstant(constval) AS (VALUES(1)) SELECT x.*, constval F...
>
> You missed the CROSS JOIN, which you could make implicit, even though
> implicit CROSS JOINs are bad coding style:

It was an example of how it'd be nice to avoid the need for a join when
dealing with scalar values. I'd love to be able to write:

WITH aconstant AS (1)
SELECT x.*, aconstant FROM generate_series(1,10) AS x;

... but can't presently do so because the WITH terms are only visible as
potential from-list items.

> WITH aconstant(constval) AS (VALUES(1))
> SELECT x.*, constval
> FROM
>     generate_series(1,10) AS x
> CROSS JOIN
>     aconstant;
>  x  | constval
> ----+----------
>   1 |        1
>   2 |        1
>   3 |        1
>   4 |        1
>   5 |        1
>   6 |        1
>   7 |        1
>   8 |        1
>   9 |        1
>  10 |        1
> (10 rows)

Using a cross join can often result in an undersired and expensive
nested loop, (I think) materialize, etc. In this case, the planner is
using a nested loop to join `aconstant' with the output of the function
scan:

>  Nested Loop  (cost=0.01..22.53 rows=1000 width=8) (actual time=0.049..0.133 rows=10 loops=1)
>    CTE aconstant
>      ->  Values Scan on "*VALUES*"  (cost=0.00..0.01 rows=1 width=4) (actual time=0.004..0.006 rows=1 loops=1)
>    ->  CTE Scan on aconstant  (cost=0.00..0.02 rows=1 width=4) (actual time=0.015..0.023 rows=1 loops=1)
>    ->  Function Scan on generate_series x  (cost=0.00..12.50 rows=1000 width=4) (actual time=0.022..0.045 rows=10
loops=1)
>  Total runtime: 0.223 ms

as compared to what happens when I explicitly insert the constant by
hand or wrap the query up in an SQL function that takes the constant as
a parameter:

>  Function Scan on generate_series x  (cost=0.00..12.50 rows=1000 width=4) (actual time=0.027..0.054 rows=10 loops=1)
>  Total runtime: 0.125 ms

In this trivial dummy example, it doesn't matter much. But in the kinds
of complex queries you often want to use a WITH expression for, it's not
appealing. If you're trying to use a WITH expression to avoid multiple
evaluation of an expensive function, the gains are often consumed in the
join costs.

So I land up relying on wrapping things up in SQL functions instead,
which is less than thrilling.

>> ... so you're forced to fall back on adding it as an additional join
>> expression - which isn't always reasonable or possible.
>
> Why not?

As above for one reason.

--
Craig Ringer

Re: Planner features, discussion

От
Alban Hertroys
Дата:
On 15 Jul 2010, at 3:05, Craig Ringer wrote:

> It was an example of how it'd be nice to avoid the need for a join when
> dealing with scalar values. I'd love to be able to write:
>
> WITH aconstant AS (1)
> SELECT x.*, aconstant FROM generate_series(1,10) AS x;
>
> ... but can't presently do so because the WITH terms are only visible as
> potential from-list items.


But why is that a problem? The below seems to work just fine:

dalroi=> WITH constants AS (SELECT 1 AS aconstant)
SELECT 'x' AS column, aconstant FROM constants WHERE aconstant = 1;
 column | aconstant
--------+-----------
 x      |         1
(1 row)

Sure, you'll have to join with the with-query to get any meaningful results, but if there are any complicated
calculationsin that query, won't they be calculated just once  - like you intended? 

If not, I think we're all failing to see the point you're trying to make.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,4c3edb34286212005648618!



Re: Planner features, discussion

От
Craig Ringer
Дата:
On 15/07/10 17:55, Alban Hertroys wrote:
> On 15 Jul 2010, at 3:05, Craig Ringer wrote:
>
>> It was an example of how it'd be nice to avoid the need for a join when
>> dealing with scalar values. I'd love to be able to write:
>>
>> WITH aconstant AS (1)
>> SELECT x.*, aconstant FROM generate_series(1,10) AS x;
>>
>> ... but can't presently do so because the WITH terms are only visible as
>> potential from-list items.
>
>
> But why is that a problem? The below seems to work just fine:
>
> dalroi=> WITH constants AS (SELECT 1 AS aconstant)
> SELECT 'x' AS column, aconstant FROM constants WHERE aconstant = 1;
>  column | aconstant
> --------+-----------
>  x      |         1
> (1 row)
>
> Sure, you'll have to join with the with-query to get any meaningful results, but if there are any complicated
calculationsin that query, won't they be calculated just once  - like you intended? 
>
> If not, I think we're all failing to see the point you're trying to make.

The join its self is expensive. See the plans I posted in the post
you're replying to.

--
Craig Ringer