Обсуждение: Planner features, discussion
Hello. I propose 2 features for planner: 1. Planner will estimate 2 x statistics: time of query with cache empty and with cache filled. 2. Two levels of plannig: standard and long. Long planning may be used when standard optimization generate slow plan, and may use advanced algebraic transformations: F( F(a,1), 2) = F(a, F(1,2) ) - for algebraic groups for example etc. ------------ pasman
On 13/07/10 19:49, pasman pasmański wrote: > Hello. > > I propose 2 features for planner: > > 1. Planner will estimate 2 x statistics: time of query with cache empty > and with cache filled. How would it know what is in cache and how long it'd take to fetch things into cache that aren't already there? Which cache(s)? shared_buffers? The OS cache of file system data? RAID controller read caches? Disk drive read caches? Caches of a big SAN iSCSI/FC target? PostgreSQL relies on the OS and hardware to take care of all that, it just asks for the data. It has very little idea how much of that data is how "close" in caching terms, and how long it'll take to retrieve. The very blunt instrument "effective_cache_size" and the random/sequential IO cost knobs provide only the vaguest and most unrealistic guidance. This might sound weird, but you need to realize that the planner needs to keep things somewhat simple to be fast, and it *can't* try to keep track of and simulate all levels of the system's caching behaviours. Especially since some levels of caching may be completely invisible not only to PostgreSQL, but even to the OS. Consider a PostgreSQL instance running in a virtual machine, where the storage is provided by a fibre channel SAN. The guest OS doesn't even know it's virtualized - and that the host OS is caching its disk in RAM. The host OS doesn't know that the SAN that's backing the disk has oodles of RAM in which it's doing its own caching. How is PostgreSQL supposed to figure out what's in cache? > 2. Two levels of plannig: standard and long. > Long planning may be used when standard optimization > generate slow plan, and may use advanced algebraic transformations: I do like the idea of being able to tell the planner "put more time into trying to transform this query in ways that might make it faster". However, that does mean two modes to be debugged, and a less-used "long" mode of greater complexity to be maintained. Who's going to do the coding to build and maintain this? What specific optimizations would you propose for the "long" mode? I'm not sure I understand the one you did propose as it applies to SQL ... wouldn't it involve a lot more knowledge of what the two functions in question do than the planner ever has? -- Craig Ringer
pasman pasman'ski wrote: > 1. Planner will estimate 2 x statistics: time of query with cache empty > and with cache filled. > Requires planner to know something about the state of the cache; it doesn't yet. Counting myself there's four people I know who have been tinkering with some aspect of that enough to have written some code related to it. For now the best you can do is lower random_page_cost significantly when you suspect the content you're querying against is cached. > 2. Two levels of plannig: standard and long. > Long planning may be used when standard optimization > generate slow plan, and may use advanced algebraic transformations: > I heard a scholarly treatment of that topic from Jim Nasby recently, where he proposed a boolean GUC to toggle the expanded search behavior to be named plan_the_shit_out_of_it. The surprisingly difficult part of playing in this area is getting enough public sample data and queries running against them to be able to do such an exercise and have some confidence you didn't decrease performance for the average case in the process. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On 13/07/2010 10:52 PM, Greg Smith wrote: > I heard a scholarly treatment of that topic from Jim Nasby recently, > where he proposed a boolean GUC to toggle the expanded search behavior > to be named plan_the_shit_out_of_it. I was thinking that something like "duplicate subquery/function elimitation" might be handy, though an extension to WITH would eliminate the need for it (see below). Consider code like this: SELECT (SELECT somequery) FROM ... WHERE (SELECT SOMEQUERY) > somevalue ORDER BY (SELECT somequery) that invokes some non-trivial "somequery" several times. I often wanted to simplify it, and it wasn't always practical to convert it to add (SELECT somequery) to the join list. I expected that with 8.4 I'd be able to write something more along the lines of: WITH result = (SELECT somequery) SELECT result FROM ... WHERE result > somevalue ORDER BY result; which makes such an optimization less than necessary. Why complicate the planner when you can fix your SQL? However, in the case above the subquery needs to be referenced from a scalar context not as a join, and WITH expressions don't seem to be useful for scalar results. The names defined by WITH are only visible as FROM targets. So this doesn't work: => WITH aconstant(constval) AS (VALUES(1)) SELECT x.*, constval FROM generate_series(1,10) AS x; ERROR: column "constval" does not exist LINE 1: ...TH aconstant(constval) AS (VALUES(1)) SELECT x.*, constval F... nor: => WITH aconstant AS (VALUES(1)) SELECT x.*, aconstant FROM generate_series(1,10) AS x; ERROR: column "aconstant" does not exist LINE 1: WITH aconstant AS (VALUES(1)) SELECT x.*, aconstant FROM gen... ... so you're forced to fall back on adding it as an additional join expression - which isn't always reasonable or possible. Extending WITH to be useful for defining constants and single-evaluation variables like the above would be really, really nice, and would avoid some ugly SQL mangling and any need for compliated planner features that try to match up and combine subquery trees. Such a WITH extension wouldn't be of any help where the subqueries referenced from-list columns, though, so perhaps intelligent combination of duplicate subqueries would be handy anyway. I certainly think that combining multiple identical invocations of stable and immutable functions within a query - ie pre-evaluating the call and substitutiong the results - would be a desirable feature. It'd potentially something that could be part of a more aggressively optimizing planner. It'd be a *LOT* simpler than trying to do the same thing with subqueries (though deciding when the function arguments are "the same" might not always be simple) and would be really handy. Some uses could be solved by extending WITH as above, but not where one or more function parameters depends on expressions involving fields in the from-list. Consider: SELECT expensive_function(sometable.x) FROM sometable WHERE expensive_function(sometable.x) > 4; which can't even be written nicely in a form that only evaluates expensive_function once. In this case it can be mangled into: SELECT exf FROM ( SELECT sometable.*, expensive_function(sometable.x) AS exf FROM sometable ) WHERE exf > 4; ... but in more complicated cases you can't always do that without landing up evaluating many, many more invocations of "expensive_function" than you wanted to in the subquery due to limitations that structure imposes on your filtering. Am I just missing something obvious? Or might a way to combine multiple invocations of STABLE / IMMUTABLE functions be a useful thing for an aggressively optimizing planner to do? -- Craig Ringer
On Wed, Jul 14, 2010 at 08:47:35AM +0800, Craig Ringer wrote: > On 13/07/2010 10:52 PM, Greg Smith wrote: > > >I heard a scholarly treatment of that topic from Jim Nasby recently, > >where he proposed a boolean GUC to toggle the expanded search behavior > >to be named plan_the_shit_out_of_it. > > I was thinking that something like "duplicate subquery/function > elimitation" might be handy, though an extension to WITH would > eliminate the need for it (see below). Consider code like this: > > SELECT (SELECT somequery) FROM ... > WHERE (SELECT SOMEQUERY) > somevalue > ORDER BY (SELECT somequery) > > that invokes some non-trivial "somequery" several times. I often > wanted to simplify it, and it wasn't always practical to convert it > to add (SELECT somequery) to the join list. > > I expected that with 8.4 I'd be able to write something more along > the lines of: > > WITH result = (SELECT somequery) > SELECT result FROM ... > WHERE result > somevalue > ORDER BY result; > > which makes such an optimization less than necessary. Why complicate > the planner when you can fix your SQL? > > However, in the case above the subquery needs to be referenced from > a scalar context not as a join, and WITH expressions don't seem to > be useful for scalar results. The names defined by WITH are only > visible as FROM targets. So this doesn't work: > > => WITH aconstant(constval) AS (VALUES(1)) SELECT x.*, constval FROM > generate_series(1,10) AS x; > ERROR: column "constval" does not exist > LINE 1: ...TH aconstant(constval) AS (VALUES(1)) SELECT x.*, constval F... You missed the CROSS JOIN, which you could make implicit, even though implicit CROSS JOINs are bad coding style: WITH aconstant(constval) AS (VALUES(1)) SELECT x.*, constval FROM generate_series(1,10) AS x CROSS JOIN aconstant; x | constval ----+---------- 1 | 1 2 | 1 3 | 1 4 | 1 5 | 1 6 | 1 7 | 1 8 | 1 9 | 1 10 | 1 (10 rows) > ... so you're forced to fall back on adding it as an additional join > expression - which isn't always reasonable or possible. Why not? > Extending WITH to be useful for defining constants and > single-evaluation variables like the above would be really, really > nice, and would avoid some ugly SQL mangling and any need for > compliated planner features that try to match up and combine > subquery trees. I'm all for extending WITH, as are some others. See this thread for the latest: <http://archives.postgresql.org/pgsql-hackers/2010-07/msg00463.php> Cheers, David (who's not mentioning extending WITH to include DCL or DDL yet...oops! ;) -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 15/07/10 00:34, David Fetter wrote: >> => WITH aconstant(constval) AS (VALUES(1)) SELECT x.*, constval FROM >> generate_series(1,10) AS x; >> ERROR: column "constval" does not exist >> LINE 1: ...TH aconstant(constval) AS (VALUES(1)) SELECT x.*, constval F... > > You missed the CROSS JOIN, which you could make implicit, even though > implicit CROSS JOINs are bad coding style: It was an example of how it'd be nice to avoid the need for a join when dealing with scalar values. I'd love to be able to write: WITH aconstant AS (1) SELECT x.*, aconstant FROM generate_series(1,10) AS x; ... but can't presently do so because the WITH terms are only visible as potential from-list items. > WITH aconstant(constval) AS (VALUES(1)) > SELECT x.*, constval > FROM > generate_series(1,10) AS x > CROSS JOIN > aconstant; > x | constval > ----+---------- > 1 | 1 > 2 | 1 > 3 | 1 > 4 | 1 > 5 | 1 > 6 | 1 > 7 | 1 > 8 | 1 > 9 | 1 > 10 | 1 > (10 rows) Using a cross join can often result in an undersired and expensive nested loop, (I think) materialize, etc. In this case, the planner is using a nested loop to join `aconstant' with the output of the function scan: > Nested Loop (cost=0.01..22.53 rows=1000 width=8) (actual time=0.049..0.133 rows=10 loops=1) > CTE aconstant > -> Values Scan on "*VALUES*" (cost=0.00..0.01 rows=1 width=4) (actual time=0.004..0.006 rows=1 loops=1) > -> CTE Scan on aconstant (cost=0.00..0.02 rows=1 width=4) (actual time=0.015..0.023 rows=1 loops=1) > -> Function Scan on generate_series x (cost=0.00..12.50 rows=1000 width=4) (actual time=0.022..0.045 rows=10 loops=1) > Total runtime: 0.223 ms as compared to what happens when I explicitly insert the constant by hand or wrap the query up in an SQL function that takes the constant as a parameter: > Function Scan on generate_series x (cost=0.00..12.50 rows=1000 width=4) (actual time=0.027..0.054 rows=10 loops=1) > Total runtime: 0.125 ms In this trivial dummy example, it doesn't matter much. But in the kinds of complex queries you often want to use a WITH expression for, it's not appealing. If you're trying to use a WITH expression to avoid multiple evaluation of an expensive function, the gains are often consumed in the join costs. So I land up relying on wrapping things up in SQL functions instead, which is less than thrilling. >> ... so you're forced to fall back on adding it as an additional join >> expression - which isn't always reasonable or possible. > > Why not? As above for one reason. -- Craig Ringer
On 15 Jul 2010, at 3:05, Craig Ringer wrote: > It was an example of how it'd be nice to avoid the need for a join when > dealing with scalar values. I'd love to be able to write: > > WITH aconstant AS (1) > SELECT x.*, aconstant FROM generate_series(1,10) AS x; > > ... but can't presently do so because the WITH terms are only visible as > potential from-list items. But why is that a problem? The below seems to work just fine: dalroi=> WITH constants AS (SELECT 1 AS aconstant) SELECT 'x' AS column, aconstant FROM constants WHERE aconstant = 1; column | aconstant --------+----------- x | 1 (1 row) Sure, you'll have to join with the with-query to get any meaningful results, but if there are any complicated calculationsin that query, won't they be calculated just once - like you intended? If not, I think we're all failing to see the point you're trying to make. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,4c3edb34286212005648618!
On 15/07/10 17:55, Alban Hertroys wrote: > On 15 Jul 2010, at 3:05, Craig Ringer wrote: > >> It was an example of how it'd be nice to avoid the need for a join when >> dealing with scalar values. I'd love to be able to write: >> >> WITH aconstant AS (1) >> SELECT x.*, aconstant FROM generate_series(1,10) AS x; >> >> ... but can't presently do so because the WITH terms are only visible as >> potential from-list items. > > > But why is that a problem? The below seems to work just fine: > > dalroi=> WITH constants AS (SELECT 1 AS aconstant) > SELECT 'x' AS column, aconstant FROM constants WHERE aconstant = 1; > column | aconstant > --------+----------- > x | 1 > (1 row) > > Sure, you'll have to join with the with-query to get any meaningful results, but if there are any complicated calculationsin that query, won't they be calculated just once - like you intended? > > If not, I think we're all failing to see the point you're trying to make. The join its self is expensive. See the plans I posted in the post you're replying to. -- Craig Ringer