Обсуждение: Planner hints in Postgresql

Поиск
Список
Период
Сортировка

Planner hints in Postgresql

От
Rajmohan C
Дата:
<div dir="ltr">I am implementing Planner hints in Postgresql to force the optimizer to select a particular plan for a
queryon request from sql input. I am having trouble in modifying the planner code. I want to create a path node of hint
planand make it the plan to be used by executor. How do I enforce this ? Should I create a new Plan for this ..how to
createa plan node which can be then given directly to executor for a particular query?</div> 

Re: Planner hints in Postgresql

От
Atri Sharma
Дата:
<div dir="ltr"><br /><div class="gmail_extra"><br /><br /><div class="gmail_quote">On Mon, Mar 17, 2014 at 9:22 PM,
RajmohanC <span dir="ltr"><<a href="mailto:csrajmohan@gmail.com" target="_blank">csrajmohan@gmail.com</a>></span>
wrote:<br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div
dir="ltr">Iam implementing Planner hints in Postgresql to force the optimizer to select a particular plan for a query
onrequest from sql input. I am having trouble in modifying the planner code. I want to create a path node of hint plan
andmake it the plan to be used by executor. How do I enforce this ? Should I create a new Plan for this ..how to create
aplan node which can be then given directly to executor for a particular query?</div></blockquote></div><br /><br
clear="all"/><br /></div><div class="gmail_extra">Planner hints have been discussed a lot before as well and AFAIK
thereis a wiki page that says why we shouldnt implement them. Have you referred to them?<br /><br /></div><div
class="gmail_extra">Pleaseshare if you have any new points on the same.<br /><br /></div><div
class="gmail_extra">Regards,<br/><br />Atri<br /></div></div> 

Re: Planner hints in Postgresql

От
David Johnston
Дата:
Atri Sharma wrote
> On Mon, Mar 17, 2014 at 9:22 PM, Rajmohan C <

> csrajmohan@

> > wrote:
> 
>> I am implementing Planner hints in Postgresql to force the optimizer to
>> select a particular plan for a query on request from sql input. I am
>> having
>> trouble in modifying the planner code. I want to create a path node of
>> hint
>> plan and make it the plan to be used by executor. How do I enforce this ?
>> Should I create a new Plan for this ..how to create a plan node which can
>> be then given directly to executor for a particular query?
>>
> 
> Planner hints have been discussed a lot before as well and AFAIK there is
> a
> wiki page that says why we shouldnt implement them. Have you referred to
> them?
> 
> Please share if you have any new points on the same.
> 
> Regards,
> 
> Atri

http://wiki.postgresql.org/wiki/Todo

(I got to it via the "FAQ" link on the homepage and the "Developer FAQ"
section there-in.  You should make sure you've scanned that as well.)

Note the final section titled: "Features We Do Not Want"

Also, you need to consider what you are doing when you cross-post (a bad
thing generally) "-hackers" and "-novice".  As there is, rightly IMO, no
"-novice-hackers" list you should have probably just hit up "-general".

Need to discuss the general "why" before any meaningful help on the "how" is
going to be considered by hackers.

David J.








--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Planner-hints-in-Postgresql-tp5796347p5796353.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Planner hints in Postgresql

От
Tom Lane
Дата:
David Johnston <polobo@yahoo.com> writes:
> Need to discuss the general "why" before any meaningful help on the "how" is
> going to be considered by hackers.

Possibly worth noting is that in past discussions, we've concluded that
the most sensible type of hint would not be "use this plan" at all, but
"here's what to assume about the selectivity of this WHERE clause".
That seems considerably less likely to break than any attempt to directly
specify plan details.
        regards, tom lane



Re: Planner hints in Postgresql

От
Atri Sharma
Дата:



On Mon, Mar 17, 2014 at 9:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Johnston <polobo@yahoo.com> writes:
> Need to discuss the general "why" before any meaningful help on the "how" is
> going to be considered by hackers.

Possibly worth noting is that in past discussions, we've concluded that
the most sensible type of hint would not be "use this plan" at all, but
"here's what to assume about the selectivity of this WHERE clause".
That seems considerably less likely to break than any attempt to directly
specify plan details.


Isnt using a user given value for selectivity a pretty risky situation as it can horribly screw up the plan selection?

Why not allow the user to specify an alternate plan and have the planner assign a higher preference to it during plan evaluation? This shall allow us to still have a fair evaluation of all possible plans as we do right now and yet have a higher preference for the user given plan during evaluation?

Regards,

Atri

--
Regards,
 
Atri
l'apprenant

Re: Planner hints in Postgresql

От
Tom Lane
Дата:
Atri Sharma <atri.jiit@gmail.com> writes:
> On Mon, Mar 17, 2014 at 9:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Possibly worth noting is that in past discussions, we've concluded that
>> the most sensible type of hint would not be "use this plan" at all, but
>> "here's what to assume about the selectivity of this WHERE clause".
>> That seems considerably less likely to break than any attempt to directly
>> specify plan details.

> Isnt using a user given value for selectivity a pretty risky situation as
> it can horribly screw up the plan selection?

And forcing a plan to be used *isn't* that?  Please re-read the older
threads, since you evidently have not.
        regards, tom lane



Re: Planner hints in Postgresql

От
Stephen Frost
Дата:
* Atri Sharma (atri.jiit@gmail.com) wrote:
> Isnt using a user given value for selectivity a pretty risky situation as
> it can horribly screw up the plan selection?
>
> Why not allow the user to specify an alternate plan and have the planner

Uh, you're worried about the user given us a garbage selectivity, but
they're going to get a full-blown plan perfect?

> assign a higher preference to it during plan evaluation? This shall allow
> us to still have a fair evaluation of all possible plans as we do right now
> and yet have a higher preference for the user given plan during evaluation?

What exactly would such a "preference" look like?  A cost modifier?
We'd almost certainly have to make that into a GUC or a value passed in
as part of the query, with a high likelihood of users figuring out how
to use it to say "use my plan forever and always"..
Thanks,
    Stephen

Re: Planner hints in Postgresql

От
David Johnston
Дата:
Atri Sharma wrote
> On Mon, Mar 17, 2014 at 9:45 PM, Tom Lane <

> tgl@.pa

> > wrote:
> 
>> David Johnston <

> polobo@

> > writes:
>> > Need to discuss the general "why" before any meaningful help on the
>> "how" is
>> > going to be considered by hackers.
>>
>> Possibly worth noting is that in past discussions, we've concluded that
>> the most sensible type of hint would not be "use this plan" at all, but
>> "here's what to assume about the selectivity of this WHERE clause".
>> That seems considerably less likely to break than any attempt to directly
>> specify plan details.
>>
>>
> Isnt using a user given value for selectivity a pretty risky situation as
> it can horribly screw up the plan selection?
> 
> Why not allow the user to specify an alternate plan and have the planner
> assign a higher preference to it during plan evaluation? This shall allow
> us to still have a fair evaluation of all possible plans as we do right
> now
> and yet have a higher preference for the user given plan during
> evaluation?

The larger question to answer first is whether we want to implement
something that is deterministic...

How about just dropping the whole concept of "hinting" and provide a way for
someone to say "use this plan, or die trying."  Maybe require it be used in
conjunction with named PREPAREd statements:

PREPARE s1 (USING /path/to/plan_def_on_server_or_something_similar) AS
SELECT ...;

Aside from whole-plan specification I can definitely see where join/where
specification could be useful if it can overcome the current limitation of
not being able to calculate inter-table estimations.

David J.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Planner-hints-in-Postgresql-tp5796347p5796378.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Planner hints in Postgresql

От
Atri Sharma
Дата:



On Mon, Mar 17, 2014 at 10:58 PM, Stephen Frost <sfrost@snowman.net> wrote:
* Atri Sharma (atri.jiit@gmail.com) wrote:
> Isnt using a user given value for selectivity a pretty risky situation as
> it can horribly screw up the plan selection?
>
> Why not allow the user to specify an alternate plan and have the planner

Uh, you're worried about the user given us a garbage selectivity, but
they're going to get a full-blown plan perfect?



I never said that the user plan would be perfect. The entire point of planner hints is based on the assumption that the user knows more about the data than the planner does hence the user's ideas about the plan should be given a preference. Garbage selectivity can screw up  the cost estimation of *all* our possible plans and we could end up preferring a sequential scan over an index only scan for e.g. I am trying to think of ways that give some preference to a user plan but do not interfere with the cost estimation of our other potential plans.



What exactly would such a "preference" look like?  A cost modifier?
We'd almost certainly have to make that into a GUC or a value passed in
as part of the query, with a high likelihood of users figuring out how
to use it to say "use my plan forever and always"..


A factor that we experimentally determine by which we decrease the cost of the user specified plan so that it gets a higher preference in the plan evaluation.

Of course, this is not a nice hack. Specifically after our discussion on IRC the other day, I am against planner hints, but if we are just discussing how it could be done, I  could think of some ways which I listed.

Regards,

Atri
--
Regards,
 
Atri
l'apprenant

Re: Planner hints in Postgresql

От
Stephen Frost
Дата:
* Atri Sharma (atri.jiit@gmail.com) wrote:
> Of course, this is not a nice hack. Specifically after our discussion on
> IRC the other day, I am against planner hints, but if we are just
> discussing how it could be done, I  could think of some ways which I listed.

There's lots of ways to implement planner hints, but I fail to see the
point in discussing how to implement something we actively don't want.
Thanks,
    Stephen

Re: Planner hints in Postgresql

От
Atri Sharma
Дата:

The larger question to answer first is whether we want to implement
something that is deterministic...

How about just dropping the whole concept of "hinting" and provide a way for
someone to say "use this plan, or die trying."  Maybe require it be used in
conjunction with named PREPAREd statements:


You mean taking away the entire concept of query planning and cost estimation? Thats like replacing the optimizer with DBA decision and I am not at all comfortable with that idea. That are only my thoughts though.

 

PREPARE s1 (USING /path/to/plan_def_on_server_or_something_similar) AS
SELECT ...;

Aside from whole-plan specification I can definitely see where join/where
specification could be useful if it can overcome the current limitation of
not being able to calculate inter-table estimations.


Prepare plans use a generic plan for the execution. Replacing it with a totally user defined plan does not seem to be clean.

The crux is that IMHO planner hints are a bad way of trying to circumvent the need for cross-column statistics. We should do cross-column statistics done and ignore planner hints completely.

Regards,

Atri


--
Regards,
 
Atri
l'apprenant

Re: Planner hints in Postgresql

От
Atri Sharma
Дата:



There's lots of ways to implement planner hints, but I fail to see the
point in discussing how to implement something we actively don't want.



+1. The original poster wanted a way to implement it as a personal project or something ( I think he only replied to me, not the entire list).

Planner hints should be ignored :)

Regards,

Atri



--
Regards,
 
Atri
l'apprenant

Re: Planner hints in Postgresql

От
Atri Sharma
Дата:



On Mon, Mar 17, 2014 at 10:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Atri Sharma <atri.jiit@gmail.com> writes:
> On Mon, Mar 17, 2014 at 9:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Possibly worth noting is that in past discussions, we've concluded that
>> the most sensible type of hint would not be "use this plan" at all, but
>> "here's what to assume about the selectivity of this WHERE clause".
>> That seems considerably less likely to break than any attempt to directly
>> specify plan details.

> Isnt using a user given value for selectivity a pretty risky situation as
> it can horribly screw up the plan selection?

And forcing a plan to be used *isn't* that?  Please re-read the older
threads, since you evidently have not.


I never said that we force a plan to be used. I just said that we should increase the preference for a user given plan and not interfere in the cost estimation of the other potential plans and the evaluation of the final selected plan.

Regards,

Atri



--
Regards,
 
Atri
l'apprenant

Re: Planner hints in Postgresql

От
Merlin Moncure
Дата:
On Mon, Mar 17, 2014 at 11:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Johnston <polobo@yahoo.com> writes:
>> Need to discuss the general "why" before any meaningful help on the "how" is
>> going to be considered by hackers.
>
> Possibly worth noting is that in past discussions, we've concluded that
> the most sensible type of hint would not be "use this plan" at all, but
> "here's what to assume about the selectivity of this WHERE clause".
> That seems considerably less likely to break than any attempt to directly
> specify plan details.

Yeah -- the most common case I see is outlier culling where several
repeated low non-deterministic selectivity quals stack reducing the
row count estimate to 1.  For example:
SELECT * FROM foo WHERE length(bar) <= 1000 AND length(bar) >= 2;

The user may have special knowledge that the above is very (or very
un-) selective that is difficult or not cost effective to gather in
the general case.  IIRC in the archives (heh) there is a special
workaround using indexes and some discussion regarding how a
hypothetical feature involving user input selectivity estimates might
look.  I don't think that discussion is complete: the syntax for user
input selectivity is an unsolved problem.

There's a big difference between saying to the planner, "Use plan X"
vs "Here's some information describing the data supporting choosing
plan X intelligently".  The latter allows for better plans in the face
of varied/changing data, integrates with the planner in natural way,
and encourages users to understand how the planner works.

merlin



Re: Planner hints in Postgresql

От
Stephen Frost
Дата:
* Merlin Moncure (mmoncure@gmail.com) wrote:
> Yeah -- the most common case I see is outlier culling where several
> repeated low non-deterministic selectivity quals stack reducing the
> row count estimate to 1.  For example:
> SELECT * FROM foo WHERE length(bar) <= 1000 AND length(bar) >= 2;

This is exactly the issue that I've seen also- where we end up picking a
Nested Loop because we think only one row is going to be returned and
instead we end up getting a bunch and it takes forever.

There was also some speculation on trying to change plans mid-stream to
address a situation like that, once we realize what's happening.  Not
sure that's really practical but it would be nice to find some solution.
Thanks,
    Stephen

Re: Planner hints in Postgresql

От
Atri Sharma
Дата:



There's a big difference between saying to the planner, "Use plan X"
vs "Here's some information describing the data supporting choosing
plan X intelligently".  The latter allows for better plans in the face
of varied/changing data, integrates with the planner in natural way,
and encourages users to understand how the planner works.


+1

I was thinking of varying the 'weight' of a user defined plan by an fixed experimental factor to tell the planner to give higher/lower preference to this plan, but after your idea above, I think Stephen's point of introducing a GUC for the factor is the only way possible and I agree with him on the point that eventually the user will figure out a way to force usage of his plan using the GUC.

Regards,

Atri


--
Regards,
 
Atri
l'apprenant

Re: Planner hints in Postgresql

От
Merlin Moncure
Дата:
On Mon, Mar 17, 2014 at 12:57 PM, Atri Sharma <atri.jiit@gmail.com> wrote:
>
>> There's a big difference between saying to the planner, "Use plan X"
>> vs "Here's some information describing the data supporting choosing
>> plan X intelligently".  The latter allows for better plans in the face
>> of varied/changing data, integrates with the planner in natural way,
>> and encourages users to understand how the planner works.
>
> +1
>
> I was thinking of varying the 'weight' of a user defined plan by an fixed
> experimental factor to tell the planner to give higher/lower preference to
> this plan, but after your idea above, I think Stephen's point of introducing
> a GUC for the factor is the only way possible and I agree with him on the
> point that eventually the user will figure out a way to force usage of his
> plan using the GUC.

GUC is not the answer beyond the "broad brush" mostly debugging level
features they already support.   What do you do if your plan
simultaneously needs and does not need nestloops?

A query plan is a complicated thing that is the result of detail
analysis of the data.  I bet there are less than 100 users on the
planet with the architectural knowledge of the planner to submit a
'plan'.  What users do have is knowledge of the data that the database
can't effectively gather for some reason.  Looking at my query above,
what it would need (assuming the planner could not be made to look
through length()) would be something like:

SELECT * FROM foo WHERE length(bar) <= 1000 WITH SELECTIVITY 0.999 AND length(bar) >= 2 WITH SELECTIVITY 0.999;

Note, that's a trivial treatment of the syntax challenges.  Ultimately
it'd probably look different and/or be hooked in a different way (say,
via the function call).

merlin



Re: Planner hints in Postgresql

От
Atri Sharma
Дата:



On Mon, Mar 17, 2014 at 11:50 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Mon, Mar 17, 2014 at 12:57 PM, Atri Sharma <atri.jiit@gmail.com> wrote:
>
>> There's a big difference between saying to the planner, "Use plan X"
>> vs "Here's some information describing the data supporting choosing
>> plan X intelligently".  The latter allows for better plans in the face
>> of varied/changing data, integrates with the planner in natural way,
>> and encourages users to understand how the planner works.
>
> +1
>
> I was thinking of varying the 'weight' of a user defined plan by an fixed
> experimental factor to tell the planner to give higher/lower preference to
> this plan, but after your idea above, I think Stephen's point of introducing
> a GUC for the factor is the only way possible and I agree with him on the
> point that eventually the user will figure out a way to force usage of his
> plan using the GUC.

GUC is not the answer beyond the "broad brush" mostly debugging level
features they already support.   What do you do if your plan
simultaneously needs and does not need nestloops?

A query plan is a complicated thing that is the result of detail
analysis of the data.  I bet there are less than 100 users on the
planet with the architectural knowledge of the planner to submit a
'plan'.  What users do have is knowledge of the data that the database
can't effectively gather for some reason.  Looking at my query above,
what it would need (assuming the planner could not be made to look
through length()) would be something like:

SELECT * FROM foo WHERE
  length(bar) <= 1000 WITH SELECTIVITY 0.999
  AND length(bar) >= 2 WITH SELECTIVITY 0.999;



Wont this have scaling issues and  issues over time as the data in the table changes?

Suppose I make a view with the above query. With time, as the data in the table changes, the selectivity values wont be good for planning. This may potentially lead to a lot of changes in the view definition and other places where this query was used.



In general, I think I step back on my point that specifying the selectivity is a bad idea.

Could this also work (for the time being) for cross-column statistics?

Regards,

Atri



--
Regards,
 
Atri
l'apprenant

Re: Planner hints in Postgresql

От
Pavel Stehule
Дата:



2014-03-17 19:35 GMT+01:00 Atri Sharma <atri.jiit@gmail.com>:



On Mon, Mar 17, 2014 at 11:50 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Mon, Mar 17, 2014 at 12:57 PM, Atri Sharma <atri.jiit@gmail.com> wrote:
>
>> There's a big difference between saying to the planner, "Use plan X"
>> vs "Here's some information describing the data supporting choosing
>> plan X intelligently".  The latter allows for better plans in the face
>> of varied/changing data, integrates with the planner in natural way,
>> and encourages users to understand how the planner works.
>
> +1
>
> I was thinking of varying the 'weight' of a user defined plan by an fixed
> experimental factor to tell the planner to give higher/lower preference to
> this plan, but after your idea above, I think Stephen's point of introducing
> a GUC for the factor is the only way possible and I agree with him on the
> point that eventually the user will figure out a way to force usage of his
> plan using the GUC.

GUC is not the answer beyond the "broad brush" mostly debugging level
features they already support.   What do you do if your plan
simultaneously needs and does not need nestloops?

A query plan is a complicated thing that is the result of detail
analysis of the data.  I bet there are less than 100 users on the
planet with the architectural knowledge of the planner to submit a
'plan'.  What users do have is knowledge of the data that the database
can't effectively gather for some reason.  Looking at my query above,
what it would need (assuming the planner could not be made to look
through length()) would be something like:

SELECT * FROM foo WHERE
  length(bar) <= 1000 WITH SELECTIVITY 0.999
  AND length(bar) >= 2 WITH SELECTIVITY 0.999;



Wont this have scaling issues and  issues over time as the data in the table changes?

Suppose I make a view with the above query. With time, as the data in the table changes, the selectivity values wont be good for planning. This may potentially lead to a lot of changes in the view definition and other places where this query was used.



In general, I think I step back on my point that specifying the selectivity is a bad idea.

Could this also work (for the time being) for cross-column statistics?


It is another issue.

I don't believe so SELECTIVITY can work well too. Slow queries are usually related to some strange points in data. I am thinking so well concept should be based on validity of estimations. Some plans are based on totally wrong estimation, but should be fast due less sensitivity to bad estimations. So well concept is penalization some risk plans - or use brute force - like COLUMN store engine does. Their plan is usually simply and tolerant to bad estimations.

Pavel
 
Regards,

Atri



--
Regards,
 
Atri
l'apprenant

Re: Planner hints in Postgresql

От
Tom Lane
Дата:
Atri Sharma <atri.jiit@gmail.com> writes:
> Wont this have scaling issues and  issues over time as the data in the
> table changes?

It can't possibly have worse problems of that sort than explicitly
specifying a plan does.
        regards, tom lane



Re: Planner hints in Postgresql

От
Merlin Moncure
Дата:
On Mon, Mar 17, 2014 at 1:45 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> I don't believe so SELECTIVITY can work well too. Slow queries are usually
> related to some strange points in data. I am thinking so well concept should
> be based on validity of estimations. Some plans are based on totally wrong
> estimation, but should be fast due less sensitivity to bad estimations. So
> well concept is penalization some risk plans - or use brute force - like
> COLUMN store engine does. Their plan is usually simply and tolerant to bad
> estimations.

Disagree.  There is a special case of slow query where problem is not
with the data but with the expression over the data; something in the
query defeats sampled selectivity.  Common culprits are:

*) CASE expressions
*) COALESCE
*) casts
*) simple tranformational expressions
*) predicate string concatenation

When using those expressions, you often end up with default
selectivity assumptions and if they are way off -- watch out.

Plan risk analysis solves a different problem: small changes in the
data mean big changes in the execution runtime.  It probably wouldn't
even help cases where the server thinks there is one row and you
actually have thousands or millions unless you want to implement a
selectivity range with perhaps a risk coefficient.  This was also
suggested sometime back and was also met with some skepticism (but
it'd be interesting to see!).

merlin



Re: Planner hints in Postgresql

От
Jim Nasby
Дата:
On 3/17/14, 12:58 PM, Stephen Frost wrote:
> * Merlin Moncure (mmoncure@gmail.com) wrote:
>> Yeah -- the most common case I see is outlier culling where several
>> repeated low non-deterministic selectivity quals stack reducing the
>> row count estimate to 1.  For example:
>> SELECT * FROM foo WHERE length(bar) <= 1000 AND length(bar) >= 2;
>
> This is exactly the issue that I've seen also- where we end up picking a
> Nested Loop because we think only one row is going to be returned and
> instead we end up getting a bunch and it takes forever.

FWIW, I've also seen problems with merge and hash joins at work, but I don't have any concrete examples handy. :(

> There was also some speculation on trying to change plans mid-stream to
> address a situation like that, once we realize what's happening.  Not
> sure that's really practical but it would be nice to find some solution.

Just being able to detect that something has possibly gone wrong would be useful. We could log that to alert the
DBA/userof a potential bad plan. We could even format this in such a fashion that it's suitable for emailing the
communitywith; the query, the plan, the stats, etc. That might make it easier for us to fix the planner (although at
thispoint it seems like we're hitting statistics gathering problems that we simply don't know how to solve).
 

There is another aspect of this though: plan stability. There are lots of cases where users couldn't care less about
gettingan optimal plan, but they care *greatly* about not getting a brain-dead plan.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Planner hints in Postgresql

От
Merlin Moncure
Дата:
On Mon, Mar 17, 2014 at 2:02 PM, Jim Nasby <jim@nasby.net> wrote:
> Just being able to detect that something has possibly gone wrong would be
> useful. We could log that to alert the DBA/user of a potential bad plan. We
> could even format this in such a fashion that it's suitable for emailing the
> community with; the query, the plan, the stats, etc. That might make it
> easier for us to fix the planner (although at this point it seems like we're
> hitting statistics gathering problems that we simply don't know how to
> solve).

Again, that's not the case here.  The problem is that the server is
using hard wired assumptions (like, 10% selective) *instead* of
statistics -- at least in the case discussed above.  That being said,
I think you're on to something: EXPLAIN ANALYZE rowcounts don't
indicate if the row count was generated from data based assumptions or
SWAGs.  So maybe you could decorate the plan description with an
indicator that suggests when default selectivity rules were hit.

> There is another aspect of this though: plan stability. There are lots of
> cases where users couldn't care less about getting an optimal plan, but they
> care *greatly* about not getting a brain-dead plan.

Except for cases I noted above, I don't understand how you could flag
'sub-optimal' or 'brain-dead' plans.   The server always picks the
best plan it can.  The trick is to (in a very simple and
cpu-unintensive way) indicate when there isn't a lot of confidence in
the plan -- but that's not the same thing.

merlin



Re: Planner hints in Postgresql

От
Martijn van Oosterhout
Дата:
On Mon, Mar 17, 2014 at 01:20:47PM -0500, Merlin Moncure wrote:
> A query plan is a complicated thing that is the result of detail
> analysis of the data.  I bet there are less than 100 users on the
> planet with the architectural knowledge of the planner to submit a
> 'plan'.  What users do have is knowledge of the data that the database
> can't effectively gather for some reason.  Looking at my query above,
> what it would need (assuming the planner could not be made to look
> through length()) would be something like:
>
> SELECT * FROM foo WHERE
>   length(bar) <= 1000 WITH SELECTIVITY 0.999
>   AND length(bar) >= 2 WITH SELECTIVITY 0.999;

A small issue with selectivity is that the selectivity is probably not
what the users are expecting anyway, since many will related to
conditional selectivities.  PostgreSQL is pretty good at single column
statistics, it just sometimes screws up on cross-column correlations.
This ties in with alerting about a bad plan: if the EXPLAIN output
could list for each condition what the actual selectivity was it might
give user a way of understanding the problem.
So the example given might lead to output like:
clause                               selectivity              estimated
length(bar)>2                        0.50                     0.50
length(bar)<1000 | length(bar)>2     0.50                     0.25
The execution engine can only output conditional selectivities because
of the order of execution. But this would at least give users a handle
on the problem.

Note that a first cut of the problem might simply be something like
likely()/unlikely() as in gcc.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: Planner hints in Postgresql

От
Jim Nasby
Дата:
On 3/17/14, 3:32 PM, Merlin Moncure wrote:
> On Mon, Mar 17, 2014 at 2:02 PM, Jim Nasby <jim@nasby.net> wrote:
>> Just being able to detect that something has possibly gone wrong would be
>> useful. We could log that to alert the DBA/user of a potential bad plan. We
>> could even format this in such a fashion that it's suitable for emailing the
>> community with; the query, the plan, the stats, etc. That might make it
>> easier for us to fix the planner (although at this point it seems like we're
>> hitting statistics gathering problems that we simply don't know how to
>> solve).
>
> Again, that's not the case here.  The problem is that the server is
> using hard wired assumptions (like, 10% selective) *instead* of
> statistics -- at least in the case discussed above.  That being said,
> I think you're on to something: EXPLAIN ANALYZE rowcounts don't
> indicate if the row count was generated from data based assumptions or
> SWAGs.  So maybe you could decorate the plan description with an
> indicator that suggests when default selectivity rules were hit.

It occurs to me... it should be cheap for us to track actual rowcounts compared to the estimate... perhaps it's worth
doingthat and flagging plans when the estimates are off by more than X percent. Though... I suspect that will just tell
uswhat we already know. :(
 

>> There is another aspect of this though: plan stability. There are lots of
>> cases where users couldn't care less about getting an optimal plan, but they
>> care *greatly* about not getting a brain-dead plan.
>
> Except for cases I noted above, I don't understand how you could flag
> 'sub-optimal' or 'brain-dead' plans.   The server always picks the
> best plan it can.  The trick is to (in a very simple and
> cpu-unintensive way) indicate when there isn't a lot of confidence in
> the plan -- but that's not the same thing.

Keep in mind that the use case here is critical queries that MUST perform "fast enough". They do NOT need to be
optimal,but they definitely can not degrade into something stupid. It's often way better to have a query that's 50-100%
slowerthan optimal as opposed to one that suddenly becomes 100+% slower than it normally is (regardless of if normal is
optimalor not).
 

You could possibly do an "anti-hint": Never use this plan, because we know it sucks.

Even better would be if the planner could estimate how bad a plan will become if we made assumptions that turn out to
bewrong.
 

Another option would be tracking normal execution times (and the plans used) for queries (not completely unreasonable
nowwith pg_stat_statements); if we try to run a query and it takes noticeably longer than normal and it's a different
planthen assume the plan is bad, mark it as such, and try again with one of the "known good" plans.
 

Worst case would be allowing a means to store an acceptable plan and force/strongly suggest that the planner use it.
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Planner hints in Postgresql

От
Claudio Freire
Дата:
<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Mon, Mar 17, 2014 at 7:01 PM, Jim Nasby <span
dir="ltr"><<ahref="mailto:jim@nasby.net" target="_blank">jim@nasby.net</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="a3s" id=":1nx"
style="overflow:hidden">Evenbetter would be if the planner could estimate how bad a plan will become if we made
assumptionsthat turn out to be wrong.<br /></div></blockquote></div><br /></div><div class="gmail_extra">That's
preciselywhat risk estimation was about.<br /><br /></div>Something like<br /><div class="gmail_extra"><br /></div><div
class="gmail_extra">SELECT* FROM wherever WHEN id > something LIMIT COST 10000;<br /><br /></div><div
class="gmail_extra">Wouldforbid a sequential scan *if* the table is big enough to suspect the plan might take that
much,or a nested loop *if* the planner cannot *prove* it will be faster than that.<br /><br /></div><div
class="gmail_extra">Idon't believe the limit unit is obscure at all (page fetches being a nice measuring stick), but
whatis, is what do you do when no plan fits the limits.<br /><br /></div></div> 

Re: Planner hints in Postgresql

От
Jim Nasby
Дата:
On 3/17/14, 2:16 PM, Merlin Moncure wrote:
> On Mon, Mar 17, 2014 at 1:45 PM, Pavel Stehule<pavel.stehule@gmail.com>  wrote:
>> >I don't believe so SELECTIVITY can work well too. Slow queries are usually
>> >related to some strange points in data. I am thinking so well concept should
>> >be based on validity of estimations. Some plans are based on totally wrong
>> >estimation, but should be fast due less sensitivity to bad estimations. So
>> >well concept is penalization some risk plans - or use brute force - like
>> >COLUMN store engine does. Their plan is usually simply and tolerant to bad
>> >estimations.
> Disagree.  There is a special case of slow query where problem is not
> with the data but with the expression over the data; something in the
> query defeats sampled selectivity.  Common culprits are:
>
> *) CASE expressions
> *) COALESCE
> *) casts
> *) simple tranformational expressions
> *) predicate string concatenation

*) time/date functions, ie WHERE date_trunc( 'quarter', some_timestamp ) = '2014-1-1'

Though, in this case it's probably much better to teach the parser how to turn that into a range expression.
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Planner hints in Postgresql

От
Claudio Freire
Дата:
<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Mon, Mar 17, 2014 at 7:09 PM, Jim Nasby <span
dir="ltr"><<ahref="mailto:jim@nasby.net" target="_blank">jim@nasby.net</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">On 3/17/14, 2:16
PM,Merlin Moncure wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">On Mon, Mar 17, 2014 at 1:45 PM, Pavel Stehule<<a href="mailto:pavel.stehule@gmail.com"
target="_blank">pavel.stehule@gmail.<u></u>com</a>> wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px#ccc solid;padding-left:1ex"> >I don't believe so SELECTIVITY can work well too. Slow queries
areusually<br /> >related to some strange points in data. I am thinking so well concept should<br /> >be based on
validityof estimations. Some plans are based on totally wrong<br /> >estimation, but should be fast due less
sensitivityto bad estimations. So<br /> >well concept is penalization some risk plans - or use brute force - like<br
/>>COLUMN store engine does. Their plan is usually simply and tolerant to bad<br /> >estimations.<br
/></blockquote>Disagree.  There is a special case of slow query where problem is not<br /> with the data but with the
expressionover the data; something in the<br /> query defeats sampled selectivity.  Common culprits are:<br /><br /> *)
CASEexpressions<br /> *) COALESCE<br /> *) casts<br /> *) simple tranformational expressions<br /> *) predicate string
concatenation<br/></blockquote><br /></div> *) time/date functions, ie WHERE date_trunc( 'quarter', some_timestamp ) =
'2014-1-1'<br/><br /> Though, in this case it's probably much better to teach the parser how to turn that into a range
expression.</blockquote></div><br/><br /></div><div class="gmail_extra">Maybe, maybe not.<br /><br /></div><div
class="gmail_extra">An index over the truncated time can potentially be much more efficient.<br /></div></div> 

Re: Planner hints in Postgresql

От
Jim Nasby
Дата:
On 3/17/14, 5:07 PM, Claudio Freire wrote:
>
> On Mon, Mar 17, 2014 at 7:01 PM, Jim Nasby <jim@nasby.net <mailto:jim@nasby.net>> wrote:
>
>     Even better would be if the planner could estimate how bad a plan will become if we made assumptions that turn
outto be wrong.
 
>
>
> That's precisely what risk estimation was about.
>
> Something like
>
> SELECT * FROM wherever WHEN id > something LIMIT COST 10000;
>
> Would forbid a sequential scan *if* the table is big enough to suspect the plan might take that much, or a nested
loop*if* the planner cannot *prove* it will be faster than that.
 
>
> I don't believe the limit unit is obscure at all (page fetches being a nice measuring stick), but what is, is what do
youdo when no plan fits the limits.
 

I don't think that's the same thing... what you're describing is a way to not begin a query if a low-enough cost plan
can'tbe found.
 

What I'm talking about is when the planner picks one low-cost plan over another and it turns out the estimate of the
onethat was picked was WAY off. I've actually seen cases where plan estimates that were off by just 100 units produce
wildlydifferent results.
 

In that scenario, LIMIT COST won't help at all.
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Planner hints in Postgresql

От
Jim Nasby
Дата:
On 3/17/14, 5:12 PM, Claudio Freire wrote:
>
> On Mon, Mar 17, 2014 at 7:09 PM, Jim Nasby <jim@nasby.net <mailto:jim@nasby.net>> wrote:
>
>     On 3/17/14, 2:16 PM, Merlin Moncure wrote:
>
>         On Mon, Mar 17, 2014 at 1:45 PM, Pavel Stehule<pavel.stehule@gmail.__com <mailto:pavel.stehule@gmail.com>>
wrote:
>
>              >I don't believe so SELECTIVITY can work well too. Slow queries are usually
>              >related to some strange points in data. I am thinking so well concept should
>              >be based on validity of estimations. Some plans are based on totally wrong
>              >estimation, but should be fast due less sensitivity to bad estimations. So
>              >well concept is penalization some risk plans - or use brute force - like
>              >COLUMN store engine does. Their plan is usually simply and tolerant to bad
>              >estimations.
>
>         Disagree.  There is a special case of slow query where problem is not
>         with the data but with the expression over the data; something in the
>         query defeats sampled selectivity.  Common culprits are:
>
>         *) CASE expressions
>         *) COALESCE
>         *) casts
>         *) simple tranformational expressions
>         *) predicate string concatenation
>
>
>     *) time/date functions, ie WHERE date_trunc( 'quarter', some_timestamp ) = '2014-1-1'
>
>     Though, in this case it's probably much better to teach the parser how to turn that into a range expression.
>
>
>
> Maybe, maybe not.
>
> An index over the truncated time can potentially be much more efficient.

More efficient than a range index? Maybe, but I'm doubtful. Even if that's true, in a warehouse you're going to want to
limitby weeks, months, quarters, years, etc. So now you're stuck building tons of special indexes.
 

(Granted, most warehouses build a separate date dimension because of these kinds of problems... I'm hoping that we
coulddo something better.)
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Planner hints in Postgresql

От
Josh Berkus
Дата:
On 03/17/2014 01:57 PM, Martijn van Oosterhout wrote:
> On Mon, Mar 17, 2014 at 01:20:47PM -0500, Merlin Moncure wrote:
>> A query plan is a complicated thing that is the result of detail
>> analysis of the data.  I bet there are less than 100 users on the
>> planet with the architectural knowledge of the planner to submit a
>> 'plan'.  What users do have is knowledge of the data that the database
>> can't effectively gather for some reason.  Looking at my query above,
>> what it would need (assuming the planner could not be made to look
>> through length()) would be something like:
>>
>> SELECT * FROM foo WHERE
>>   length(bar) <= 1000 WITH SELECTIVITY 0.999
>>   AND length(bar) >= 2 WITH SELECTIVITY 0.999;

So, if we're going to support query decorators, we might as well go all
the way and just offer Oracle-style "use this index".  Speaking as
someone who is often called on to fix performance issues in other
people's databases, I find major issues with query decorators:

1. they are impossible to maintain since they're scattered all over the
application code.

2. they eventually become a barrier to upgrading, once the performance
of the DB engine changes in a way that makes older query decorators
crippling and/or erroneous.  Because they are scattered all around the
code, it then becomes a major refactoring effort to fix them.

3. There's no obvious way to collect cumulative query hints in order to
supply data for database-level tuning, or for improving the postgresql
query planner.

4. There's no obvious way to use query decorators with ORMs, making them
useless to 95% of our users.

5. Application developers will add them without telling the DBAs, and
vice-versa.  Hilarity ensues.

Given that, I would strongly prefer a different mechanism for ad-hoc
query plan adjustment.  Possible other mechanisms would include:

a) ability to set selectivity for database objects, possibly including
per-column selectivity (i.e. selectivity for columns a & b & f is 0.01),
and save it.

b) ability to execute a query using a presupplied plan.  This would
include ability to edit the plan using some intermediate format, like
JSON or XML.

c) ability to finely adjust costs of specific query operations (e.g.
bitmapscan_cost = 0.02), possibly for specific database objects.

d) ability to save selectivity estimates for specific expressions (e.g.
"selectivity on log15 ( session_id =, log_time BETWEEN ) = 0.03".

Personally, (b) is my favorite version of this feature, becuase it
allows me to test the query executor itself, and it raises the bar for
clobbering the query planner to people who are willing to spend a little
time on it.  It would also make a fantastic learning tool for learning
about database planning and optimization, and might open the door for
more people hacking on our planner.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Planner hints in Postgresql

От
Vik Fearing
Дата:
On 03/17/2014 06:28 PM, Stephen Frost wrote:
> a value passed in
> as part of the query, with a high likelihood of users figuring out how
> to use it to say "use my plan forever and always"..

Last time this came up, I said on irc that if we ever do implement
hints, I'd like them to be tied to a major version somehow so that later
versions can issue warnings that the planner might have become smarter
about long-forgotten queries and they need to be re-tested to see if the
hint is still warrented.

-- 
Vik




Re: Planner hints in Postgresql

От
Vik Fearing
Дата:
On 03/17/2014 11:15 PM, Josh Berkus wrote:
> 2. they eventually become a barrier to upgrading, once the performance
> of the DB engine changes in a way that makes older query decorators
> crippling and/or erroneous.  Because they are scattered all around the
> code, it then becomes a major refactoring effort to fix them.

My idea of associating them with major versions and issuing warnings
would help with this.

> 5. Application developers will add them without telling the DBAs, and
> vice-versa.  Hilarity ensues.

And this would be unmasked at next major upgrade when the logs fill up.

-- 
Vik




Re: Planner hints in Postgresql

От
Claudio Freire
Дата:

On Mon, Mar 17, 2014 at 7:12 PM, Jim Nasby <jim@nasby.net> wrote:
On 3/17/14, 5:07 PM, Claudio Freire wrote:


On Mon, Mar 17, 2014 at 7:01 PM, Jim Nasby <jim@nasby.net <mailto:jim@nasby.net>> wrote:

    Even better would be if the planner could estimate how bad a plan will become if we made assumptions that turn out to be wrong.


That's precisely what risk estimation was about.

Something like

SELECT * FROM wherever WHEN id > something LIMIT COST 10000;

Would forbid a sequential scan *if* the table is big enough to suspect the plan might take that much, or a nested loop *if* the planner cannot *prove* it will be faster than that.

I don't believe the limit unit is obscure at all (page fetches being a nice measuring stick), but what is, is what do you do when no plan fits the limits.

I don't think that's the same thing... what you're describing is a way to not begin a query if a low-enough cost plan can't be found.

What I'm talking about is when the planner picks one low-cost plan over another and it turns out the estimate of the one that was picked was WAY off. I've actually seen cases where plan estimates that were off by just 100 units produce wildly different results.

In that scenario, LIMIT COST won't help at all.


The case you describe is different. It's when a plan *effectively* is more expensive than estimated, but the planner could not estimate it.

That's what was mentioned about switching plans mid-way through them, which is IMO quite ill-defined as such (it needs a lot of love in order to get a workable spec out of that).

Re: Planner hints in Postgresql

От
Merlin Moncure
Дата:
On Mon, Mar 17, 2014 at 5:15 PM, Josh Berkus <josh@agliodbs.com> wrote:
> So, if we're going to support query decorators, we might as well go all
> the way and just offer Oracle-style "use this index".  Speaking as
> someone who is often called on to fix performance issues in other
> people's databases, I find major issues with query decorators:

Supplying selectivity estimates in places where the database can't or
wont do them properly itself is a completely different thing from
Oracle style hints.  For example, they are much more 'future proof' --
both to schema changes and postgres enhancments -- in the absolute
case the database can peek into your expression in some future version
and generate a better estimate than you can.   Aside from that, you're
in no way locked out of future innovations.  This could be done in a
relatively clean way: for example, by putting complex quals in an
inlineable function that is decorated with analog of COST/ROWS clause
-- perhaps allowing for a user defined expression to base selectivity
from the input arguments.

OTOH, Hints disable the planner and they are much more complex,
particularly for large queries since you have to consider how
components of the plan relate to one another.  Force feeding a plan to
the database is the wrong way to go.

merlin



Re: Planner hints in Postgresql

От
Tom Lane
Дата:
Claudio Freire <klaussfreire@gmail.com> writes:
> On Mon, Mar 17, 2014 at 7:01 PM, Jim Nasby <jim@nasby.net> wrote:
>> Even better would be if the planner could estimate how bad a plan will
>> become if we made assumptions that turn out to be wrong.

> That's precisely what risk estimation was about.

Yeah.  I would like to see the planner's cost estimates extended to
include some sort of uncertainty estimate, whereupon risk-averse people
could ask it to prefer low-uncertainty plans over high-uncertainty ones
(the plans we typically choose for ORDER BY ... LIMIT queries being great
examples of the latter).  But it's a long way from wishing that to making
it so.  Right now it's not even clear (to me anyway) how we'd measure or
model such uncertainty.
        regards, tom lane



Re: Planner hints in Postgresql

От
Atri Sharma
Дата:



On Tue, Mar 18, 2014 at 12:46 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Mon, Mar 17, 2014 at 1:45 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> I don't believe so SELECTIVITY can work well too. Slow queries are usually
> related to some strange points in data. I am thinking so well concept should
> be based on validity of estimations. Some plans are based on totally wrong
> estimation, but should be fast due less sensitivity to bad estimations. So
> well concept is penalization some risk plans - or use brute force - like
> COLUMN store engine does. Their plan is usually simply and tolerant to bad
> estimations.

Disagree.  There is a special case of slow query where problem is not
with the data but with the expression over the data; something in the
query defeats sampled selectivity.  Common culprits are:

*) CASE expressions
*) COALESCE
*) casts
*) simple tranformational expressions
*) predicate string concatenation

When using those expressions, you often end up with default
selectivity assumptions and if they are way off -- watch out.

Plan risk analysis solves a different problem: small changes in the
data mean big changes in the execution runtime.  It probably wouldn't
even help cases where the server thinks there is one row and you
actually have thousands or millions unless you want to implement a
selectivity range with perhaps a risk coefficient.  This was also
suggested sometime back and was also met with some skepticism (but
it'd be interesting to see!).




Another case is with prepared statements, when things like array size are not know to the planner and the planner makes a hard coded estimate for it, leading to selection of a customized plan which is worse than the generic plan.

This would be even more useful for prepared statements since they need some support from the user in terms of the selectivity and the user should be allowed to tell more about the data, since he already has given us some indications about the type of query plans he requires using prepared statements.


Regards,

Atri


--
Regards,
 
Atri
l'apprenant

Re: Planner hints in Postgresql

От
Jeff Janes
Дата:
On Monday, March 17, 2014, Atri Sharma <atri.jiit@gmail.com> wrote:



On Mon, Mar 17, 2014 at 10:58 PM, Stephen Frost <sfrost@snowman.net> wrote:
* Atri Sharma (atri.jiit@gmail.com) wrote:
> Isnt using a user given value for selectivity a pretty risky situation as
> it can horribly screw up the plan selection?
>
> Why not allow the user to specify an alternate plan and have the planner

Uh, you're worried about the user given us a garbage selectivity, but
they're going to get a full-blown plan perfect?



I never said that the user plan would be perfect. The entire point of planner hints is based on the assumption that the user knows more about the data than the planner does hence the user's ideas about the plan should be given a preference. Garbage selectivity can screw up  the cost estimation of *all* our possible plans and we could end up preferring a sequential scan over an index only scan for e.g. I am trying to think of ways that give some preference to a user plan but do not interfere with the cost estimation of our other potential plans.

I'm not opposed to planner hints (or plan mandates), but also not optimistic they will ever get implemented, much less accepted.  But if they were, I don't see a use for such fudge factors.  By mandating a plan, I am already asserting I know more than the optimizer does.  Maybe I am right, maybe I am wrong, but either way I have taken it out of the optimizer's hands and would not welcome it snatching control back.

If it is too deranged for me to trust, why would it not become somewhat more deranged and so decide to ignore my hints?  The only setting for such a factor I would ever see myself using was the minimum, or the maximum.

The feature I would like in such hints, if they are to exist, is to set a version to which they apply.  Often a fix is made very quickly after the problem is pointed out, but it could take well over a year for the fix to see production if it is not backpatched and it lands at the wrong part of the release cycle.

Cheers,

Jeff

Re: Planner hints in Postgresql

От
Jeff Janes
Дата:
On Monday, March 17, 2014, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Claudio Freire <klaussfreire@gmail.com> writes:
> On Mon, Mar 17, 2014 at 7:01 PM, Jim Nasby <jim@nasby.net> wrote:
>> Even better would be if the planner could estimate how bad a plan will
>> become if we made assumptions that turn out to be wrong.

> That's precisely what risk estimation was about.

Yeah.  I would like to see the planner's cost estimates extended to
include some sort of uncertainty estimate, whereupon risk-averse people
could ask it to prefer low-uncertainty plans over high-uncertainty ones
(the plans we typically choose for ORDER BY ... LIMIT queries being great
examples of the latter).  But it's a long way from wishing that to making
it so.  Right now it's not even clear (to me anyway) how we'd measure or
model such uncertainty.

Most of the cases where I've run into horrible estimates, it seemed like the same level of knowledge/reasoning that could allow us to know it was risky, would allow us to just do a better job in the first place.

The exception I can think of is in an antijoin between two huge rels.  It is like subtracting two large measurements to get a much smaller result.  We should know the uncertainty will be large.

Cheers,

Jeff

Re: Planner hints in Postgresql

От
Claudio Freire
Дата:
<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Mon, Mar 17, 2014 at 8:47 PM, Tom Lane <span
dir="ltr"><<ahref="mailto:tgl@sss.pgh.pa.us" target="_blank">tgl@sss.pgh.pa.us</a>></span> wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div
class="">ClaudioFreire <<a href="mailto:klaussfreire@gmail.com">klaussfreire@gmail.com</a>> writes:<br /> > On
Mon,Mar 17, 2014 at 7:01 PM, Jim Nasby <<a href="mailto:jim@nasby.net">jim@nasby.net</a>> wrote:<br /> >>
Evenbetter would be if the planner could estimate how bad a plan will<br /> >> become if we made assumptions that
turnout to be wrong.<br /><br /> > That's precisely what risk estimation was about.<br /><br /></div>Yeah.  I would
liketo see the planner's cost estimates extended to<br /> include some sort of uncertainty estimate, whereupon
risk-aversepeople<br /> could ask it to prefer low-uncertainty plans over high-uncertainty ones<br /> (the plans we
typicallychoose for ORDER BY ... LIMIT queries being great<br /> examples of the latter).  But it's a long way from
wishingthat to making<br /> it so.  Right now it's not even clear (to me anyway) how we'd measure or<br /> model such
uncertainty.<br/></blockquote></div><br /></div><div class="gmail_extra">Well, currently, selectivity estimates based
onMCV should be pretty low-uncertainty, whereas certainty of other estimates could be modeled as a random variable if
ANALYZEgathered a few statistical moments (for variables that are prone to that kind of statistical analysis).<br /><br
/></div><divclass="gmail_extra">That alone could improve things considerably, and statistical info could be propagated
alongexpressions to make it possible to model uncertainty in complex expressions as well.<br /><br /></div></div> 

Re: Risk Estimation WAS: Planner hints in Postgresql

От
Josh Berkus
Дата:
> On Mon, Mar 17, 2014 at 8:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah.  I would like to see the planner's cost estimates extended to
>> include some sort of uncertainty estimate, whereupon risk-averse people
>> could ask it to prefer low-uncertainty plans over high-uncertainty ones
>> (the plans we typically choose for ORDER BY ... LIMIT queries being great
>> examples of the latter).  But it's a long way from wishing that to making
>> it so.  Right now it's not even clear (to me anyway) how we'd measure or
>> model such uncertainty.

This is not a model, but here's some starting thoughts:

A "high risk" plan has two components:

a) our statistical data is out-of-date or inadequate

b) the potential execution time if our estimates of selectivity are
wrong is high

c) the cost ratio of certain operations is wrong.

Factor (a) can be modeled two ways:

1. If last_analyze is a long time ago, we have increased the risk.  (Ideally, we'd have some idea of the change rate on
thetable vs.   the last analyze time; right now we don't have those stats)
 

2. Certain patterns, such as multi-column selectivity and GIN/GiST
selectivity are known to have poor estimates, and be higher risk.
Certainly selectivity functions which have been programmed with a flat
coefficient (like default 0.05 selectivity for gist_ops) could also
return a risk factor which is fairly high.

Factor (b) can be modeled simply by estimating the cost of a plan where
all row estimates are changed by 10X, or even better by a calculation on
the risk factor calculated in (a).  This would then give us the "failure
cost" of the bad plan.  Note that we need to estimate in both
directions, both for higher estimates and lower ones; "abort early"
plans fail because the rows returned are lower than expected, for example.

(b) estimation would be expensive if we did every combination of the
entire plan with wrong estimates, so I'm wondering if it would be
adequate to just estimate the node selectivity being off on a per-node
basis.

(c) we can't realistically estimate for at all (i.e. if we knew the cost
factor was wrong, we'd fix it) so I suggest ignoring it for risk estimation.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Risk Estimation WAS: Planner hints in Postgresql

От
Atri Sharma
Дата:



On Tue, Mar 18, 2014 at 11:43 PM, Josh Berkus <josh@agliodbs.com> wrote:

> On Mon, Mar 17, 2014 at 8:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah.  I would like to see the planner's cost estimates extended to
>> include some sort of uncertainty estimate, whereupon risk-averse people
>> could ask it to prefer low-uncertainty plans over high-uncertainty ones
>> (the plans we typically choose for ORDER BY ... LIMIT queries being great
>> examples of the latter).  But it's a long way from wishing that to making
>> it so.  Right now it's not even clear (to me anyway) how we'd measure or
>> model such uncertainty.



I have been thinking of some ways to have a risk estimate of each selectivity that our planner gives. I think a way to do it is as follows:

One of the factors that leads to bad estimates is that the histogram of the values of a column maintained by the planner gets old by time and the data in the column changes. So, the histogram is no longer a quite accurate view of the data and it leads to bad selectivity.

One thing we can try to do is to add a factor of error that we feel the selectivity given can have. This allows us to factor in the probability that the data changed and the estimate of the difference of the current histogram and the histogram of the actual data currently present in the column in the table.

We can use Central Limit Theorem (http://en.wikipedia.org/wiki/Central_limit_theorem). Essentially, what the theorem says is that given a distribution that has finite variance and finite mean, we can take random independent samples from the data and calculate the standard deviation and the mean of the sample. If we have large enough number of samples and if we plot the mean and SD, they would follow a normal distribution.

What is interesting is that this can allow us to predict the SD of a given dataset from the curve and the SD should be directly proportional to the deviation it has from the given planner histogram.

I am no mathematician hence its hard for me to explain. I think this link [1] will be more helpful.

So, we can have a probability value for the random variable and that shall model the confidence we have in our estimate.

I may be wrong in some parts but I hope I have been able to convey the general idea.

If this idea develops, I shall be happy to work on this but my hands are full in ROLLUPS right now, so for my part it shall take time. I just want to float the idea and get a general feel about the idea right now.

Please let me know your comments and feedback on this.

Regards,

Atri
--
Regards,
 
Atri
l'apprenant

Re: Risk Estimation WAS: Planner hints in Postgresql

От
Tom Lane
Дата:
Atri Sharma <atri.jiit@gmail.com> writes:
> One of the factors that leads to bad estimates is that the histogram of the
> values of a column maintained by the planner gets old by time and the data
> in the column changes. So, the histogram is no longer a quite accurate view
> of the data and it leads to bad selectivity.

TBH, this is so far down the list of problems that it'll be a long time
before we need to worry about it.  It's certainly not the number one
priority for any project to model risk in the planner.

The thing that I think is probably the number one problem is estimates
that depend on an assumption of uniform distribution of sought-after rows
among those encountered by a scan.  This is usually where bad plans for
LIMIT queries are coming from.  We could certainly add some sort of fudge
factor to those costs, but I'd like to have a more-or-less principled
framework for doing so.
        regards, tom lane



Re: Planner hints in Postgresql

От
Merlin Moncure
Дата:
On Tue, Mar 18, 2014 at 11:53 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>
> On Mon, Mar 17, 2014 at 8:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Claudio Freire <klaussfreire@gmail.com> writes:
>> > On Mon, Mar 17, 2014 at 7:01 PM, Jim Nasby <jim@nasby.net> wrote:
>> >> Even better would be if the planner could estimate how bad a plan will
>> >> become if we made assumptions that turn out to be wrong.
>>
>> > That's precisely what risk estimation was about.
>>
>> Yeah.  I would like to see the planner's cost estimates extended to
>> include some sort of uncertainty estimate, whereupon risk-averse people
>> could ask it to prefer low-uncertainty plans over high-uncertainty ones
>> (the plans we typically choose for ORDER BY ... LIMIT queries being great
>> examples of the latter).  But it's a long way from wishing that to making
>> it so.  Right now it's not even clear (to me anyway) how we'd measure or
>> model such uncertainty.
>
> Well, currently, selectivity estimates based on MCV should be pretty
> low-uncertainty, whereas certainty of other estimates could be modeled as a
> random variable if ANALYZE gathered a few statistical moments (for variables
> that are prone to that kind of statistical analysis).

Sure, plus as noted you have cases where the planer makes SWAGs.  Each
of those SWAGs can introduce say (in the worst case) an order of
magnitude of error in the row count estimate.

> That alone could improve things considerably, and statistical info could be
> propagated along expressions to make it possible to model uncertainty in
> complex expressions as well.

But how would that work?  I see no solution adumbrated there :-).
Let's say you change the rowcount estimate to low/bestguess/high *and*
you only engage extra searches when there is enough disparity between
those values you still get exponentially more searches.  (my thinking
is that if bestguess estimated execution time is some user definable
amount faster then low/high at any node), a more skeptical plan is
introduced.   All that could end up being pessimal to the general case
though.

merlin



Re: Planner hints in Postgresql

От
Claudio Freire
Дата:

On Tue, Mar 18, 2014 at 4:48 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> That alone could improve things considerably, and statistical info could be
> propagated along expressions to make it possible to model uncertainty in
> complex expressions as well.

But how would that work?  I see no solution adumbrated there :-).

I would have to tipify the SQL expression grammar for this, but I don't think it would be impossible. Most non-function expression nodes seem rather trivial. Even CASE, as long as you have a distribution for the conditional, you can derive a distribution for the whole. User defined functions would be another game, though. Correlation would have to be measured, and that can be troublesome and a weak spot of risk computation as much as it is of planning, but it could be fuzzed arbitrarily until properly computed - after all, dependency on correlation or non-correlation is a known source of risk, and accounting for it in any way is better than not.
 
Let's say you change the rowcount estimate to low/bestguess/high *and*
you only engage extra searches when there is enough disparity between
those values you still get exponentially more searches.

I was under the impression the planner already did an exhaustive search for some queries. So it's just a matter of picking the best plan among those (ie: estimating cost). The case of GEQO isn't any different, except perhaps introducing a risk-decreasing transformation would be needed, unless I'm missing something.
 
 (my thinking
is that if bestguess estimated execution time is some user definable
amount faster then low/high at any node), a more skeptical plan is
introduced.   All that could end up being pessimal to the general case
though.

I think the cost estimate would be replaced by a distribution (simplified perhaps into an array of moments, or whatever is easily manipulated in the face of complex expressions). What the user would pick, is a sampling method of said distribution. Then, plans get measured by the user's stick (say: arithmetic mean, median, 90th percentile, etc). The arithmetic mean would I guess be the default, and that ought to be roughly equivalent to the planner's current behavior.

Re: Planner hints in Postgresql

От
Atri Sharma
Дата:


> That's precisely what risk estimation was about.

Yeah.  I would like to see the planner's cost estimates extended to
include some sort of uncertainty estimate, whereupon risk-averse people
could ask it to prefer low-uncertainty plans over high-uncertainty ones
(the plans we typically choose for ORDER BY ... LIMIT queries being great
examples of the latter).  But it's a long way from wishing that to making
it so.  Right now it's not even clear (to me anyway) how we'd measure or
model such uncertainty.

Well, currently, selectivity estimates based on MCV should be pretty low-uncertainty, whereas certainty of other estimates could be modeled as a random variable if ANALYZE gathered a few statistical moments (for variables that are prone to that kind of statistical analysis).

That alone could improve things considerably, and statistical info could be propagated along expressions to make it possible to model uncertainty in complex expressions as well.



That is a sort of solution that I proposed yesterday on the mailing list. The solution essentially takes lots of samples of the data and then plots the mean and standard deviation of the independent samples to get the probability of the histogram selectivity estimate.


The problem is multi faceted (outdated stats, bad guess at distribution of data, cases Merlin mentioned before (CASE statements, COALESCE statements etc.). Finding a general solution to this problem shall require a lot of research and time.

I agree with Tom, we should focus on some of the main problems we have in that domain and then try to solve them first.

Regards,

Atri



--
Regards,
 
Atri
l'apprenant

Re: Risk Estimation WAS: Planner hints in Postgresql

От
Robert Haas
Дата:
On Tue, Mar 18, 2014 at 2:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Atri Sharma <atri.jiit@gmail.com> writes:
>> One of the factors that leads to bad estimates is that the histogram of the
>> values of a column maintained by the planner gets old by time and the data
>> in the column changes. So, the histogram is no longer a quite accurate view
>> of the data and it leads to bad selectivity.
>
> TBH, this is so far down the list of problems that it'll be a long time
> before we need to worry about it.  It's certainly not the number one
> priority for any project to model risk in the planner.
>
> The thing that I think is probably the number one problem is estimates
> that depend on an assumption of uniform distribution of sought-after rows
> among those encountered by a scan.  This is usually where bad plans for
> LIMIT queries are coming from.  We could certainly add some sort of fudge
> factor to those costs, but I'd like to have a more-or-less principled
> framework for doing so.

I think the problem is, in some sense, more basic than that.  I think
the kind of query we're talking about here is:

SELECT * FROM foo WHERE unlikely ORDER BY indexed_column LIMIT 1

Assume for the sake of argument that there are 100 rows that would be
returned in the absence of the limit.  Let SC and TC be the startup
cost and total cost of the index scan.  As a matter of general policy,
we're going to say that the cost of this is SC + 0.01 * (TC - SC).
What makes this path look appealing to the planner is that SC is small
relative to TC.  If we knew, for example, that we weren't going to
find the first match until 90% of the way through the index scan, then
we could set SC = 90% * TC and, all else being equal, the planner
would make the right decision.

So you might think that the problem here is that we're assuming
uniform density.  Let's say there are a million rows in the table, and
there are 100 that match our criteria, so the first one is going to
happen 1/10,000'th of the way through the table.  Thus we set SC =
0.0001 * TC, and that turns out to be an underestimate if the
distribution isn't as favorable as we're hoping.  However, that is NOT
what we are doing.  What we are doing is setting SC = 0.  I mean, not
quite 0, but yeah, effectively 0. Essentially we're assuming that no
matter how selective the filter condition may be, we assume that it
will match *the very first row*.

So we're not assuming the average case and getting hosed when things
come out worse than average.  We're assuming the *best* case.  So
unless things happen to really swing in our favor, we got hosed.

Now it might be that a fudge factor of 2 or 1.5 or 10 or 3 or 17 is
appropriate, so that we actually assume we're going to have to scan a
little more of the index than we expect.  That can perhaps be
justified by the possibility that there may actually be NO rows
matching the filter condition, and we'll have to try scanning the
entire index to get off the ground.  We could also try to come up with
a mathematical model for that.  But that fudge factor would presumably
be a multiplier on the effort of finding the first tuple.  And right
now we assume that finding the first tuple will be trivial.  So I
think we should fix THAT problem first, and then if that turns out to
be insufficient, we can worry about what further fudging is required.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Risk Estimation WAS: Planner hints in Postgresql

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> So you might think that the problem here is that we're assuming
> uniform density.  Let's say there are a million rows in the table, and
> there are 100 that match our criteria, so the first one is going to
> happen 1/10,000'th of the way through the table.  Thus we set SC =
> 0.0001 * TC, and that turns out to be an underestimate if the
> distribution isn't as favorable as we're hoping.  However, that is NOT
> what we are doing.  What we are doing is setting SC = 0.  I mean, not
> quite 0, but yeah, effectively 0. Essentially we're assuming that no
> matter how selective the filter condition may be, we assume that it
> will match *the very first row*.

I think this is wrong.  Yeah, the SC may be 0 or near it, but the time to
fetch the first tuple is estimated as SC + (TC-SC)/N.
        regards, tom lane



Re: Risk Estimation WAS: Planner hints in Postgresql

От
Atri Sharma
Дата:



On Thu, Mar 20, 2014 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Mar 18, 2014 at 2:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Atri Sharma <atri.jiit@gmail.com> writes:
>> One of the factors that leads to bad estimates is that the histogram of the
>> values of a column maintained by the planner gets old by time and the data
>> in the column changes. So, the histogram is no longer a quite accurate view
>> of the data and it leads to bad selectivity.
>
> TBH, this is so far down the list of problems that it'll be a long time
> before we need to worry about it.  It's certainly not the number one
> priority for any project to model risk in the planner.
>
> The thing that I think is probably the number one problem is estimates
> that depend on an assumption of uniform distribution of sought-after rows
> among those encountered by a scan.  This is usually where bad plans for
> LIMIT queries are coming from.  We could certainly add some sort of fudge
> factor to those costs, but I'd like to have a more-or-less principled
> framework for doing so.

I think the problem is, in some sense, more basic than that.  I think
the kind of query we're talking about here is:

SELECT * FROM foo WHERE unlikely ORDER BY indexed_column LIMIT 1

Assume for the sake of argument that there are 100 rows that would be
returned in the absence of the limit.  Let SC and TC be the startup
cost and total cost of the index scan.  As a matter of general policy,
we're going to say that the cost of this is SC + 0.01 * (TC - SC).
What makes this path look appealing to the planner is that SC is small
relative to TC.  If we knew, for example, that we weren't going to
find the first match until 90% of the way through the index scan, then
we could set SC = 90% * TC and, all else being equal, the planner
would make the right decision.

So you might think that the problem here is that we're assuming
uniform density.  Let's say there are a million rows in the table, and
there are 100 that match our criteria, so the first one is going to
happen 1/10,000'th of the way through the table.  Thus we set SC =
0.0001 * TC, and that turns out to be an underestimate if the
distribution isn't as favorable as we're hoping.  However, that is NOT
what we are doing.  What we are doing is setting SC = 0.  I mean, not
quite 0, but yeah, effectively 0. Essentially we're assuming that no
matter how selective the filter condition may be, we assume that it
will match *the very first row*.

 

Cannot we reuse the same histogram we have in the planner right now for this? I mean, AFAIK, the heuristic we have is that we divide the histogram into equal size buckets and then find the bucket in which our predicate value lies, then take some part of that bucket and the rest of the buckets before that bucket,right?

So, suppose a query is SELECT * FROM table WHERE a > 10, we shall find the bucket that 10 lies in, right?

Now, why cannot we take the estimate of all the buckets behind the bucket in which our value is present? Will that estimate not give us the fraction of tuples that are expected to be before the first matching row?

Its pretty wild, but I wanted to know if my understanding of this scenario is correct or not.

Regards,

Atri

--
Regards,
 
Atri
l'apprenant

Re: Risk Estimation WAS: Planner hints in Postgresql

От
Tom Lane
Дата:
Atri Sharma <atri.jiit@gmail.com> writes:
> Now, why cannot we take the estimate of all the buckets behind the bucket
> in which our value is present? Will that estimate not give us the fraction
> of tuples that are expected to be before the first matching row?

Uh, no, not unless you assume that the table happens to be perfectly
sorted by the column's value.
        regards, tom lane



Re: Risk Estimation WAS: Planner hints in Postgresql

От
Atri Sharma
Дата:



On Thu, Mar 20, 2014 at 8:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Atri Sharma <atri.jiit@gmail.com> writes:
> Now, why cannot we take the estimate of all the buckets behind the bucket
> in which our value is present? Will that estimate not give us the fraction
> of tuples that are expected to be before the first matching row?

Uh, no, not unless you assume that the table happens to be perfectly
sorted by the column's value.

                       

Yes, that is true. So, if an attribute has an index present, can we do this somehow?

Regards,

Atri



--
Regards,
 
Atri
l'apprenant

Re: Risk Estimation WAS: Planner hints in Postgresql

От
Robert Haas
Дата:
On Thu, Mar 20, 2014 at 10:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> So you might think that the problem here is that we're assuming
>> uniform density.  Let's say there are a million rows in the table, and
>> there are 100 that match our criteria, so the first one is going to
>> happen 1/10,000'th of the way through the table.  Thus we set SC =
>> 0.0001 * TC, and that turns out to be an underestimate if the
>> distribution isn't as favorable as we're hoping.  However, that is NOT
>> what we are doing.  What we are doing is setting SC = 0.  I mean, not
>> quite 0, but yeah, effectively 0. Essentially we're assuming that no
>> matter how selective the filter condition may be, we assume that it
>> will match *the very first row*.
>
> I think this is wrong.  Yeah, the SC may be 0 or near it, but the time to
> fetch the first tuple is estimated as SC + (TC-SC)/N.

Hmm, you're right, and experimentation confirms that the total cost of
the limit comes out to about TC/selectivity.  So scratch that theory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company