Обсуждение: distinct estimate of a hard-coded VALUES list

Поиск
Список
Период
Сортировка

distinct estimate of a hard-coded VALUES list

От
Jeff Janes
Дата:
I have a query which contains a where clause like:

 aid =ANY(VALUES (1),(45),(87), <6948 more>, (4999947))

for example:

perl -le 'print "explain (analyze) select sum(abalance) from pgbench_accounts where aid=ANY(VALUES "; print join ",", map "($_)", sort {$a<=>$b} map int(rand(5000000)), 1..6952; print ");"'  | psql


And it gives me an explain section like:

->  HashAggregate  (cost=104.28..106.28 rows=200 width=32) (actual time=15.171..30.859 rows=6931 loops=1)
        Group Key: "*VALUES*".column1
         ->  Values Scan on "*VALUES*"  (cost=0.00..86.90 rows=6952 width=32) (actual time=0.007..6.926 rows=6952 loops=1)

So even though it knows that 6952 values have been shoved in the bottom, it thinks only 200 are going to come out of the aggregation.  This seems like a really lousy estimate.  In more complex queries than the example one given it leads to poor planning choices.

Is the size of the input list not available to the planner at the point where it estimates the distinct size of the input list?  I'm assuming that if it is available to EXPLAIN than it is available to the planner.  Does it know how large the input list is, but just throw up its hands and use 200 as the distinct size anyway?

If I throw at the system a massively degenerate list, and it makes bad choices by assuming the list is distinct, I have the recourse of uniquifying the list myself before passing it in.  If I pass in a mostly distinct list, and it makes bad choices by assuming only 200 are distinct, I don't have much recourse aside from creating, populating, analyzing, and then using temporary tables. Which is annoying, and causes other problems like catalog bloat.

Is this something in the planner that could be fixed in a future version, or is a temp table the only real solution here?

Cheers,

Jeff

Re: distinct estimate of a hard-coded VALUES list

От
Tom Lane
Дата:
Jeff Janes <jeff.janes@gmail.com> writes:
> So even though it knows that 6952 values have been shoved in the bottom, it
> thinks only 200 are going to come out of the aggregation.  This seems like
> a really lousy estimate.  In more complex queries than the example one
> given it leads to poor planning choices.

> Is the size of the input list not available to the planner at the point
> where it estimates the distinct size of the input list?  I'm assuming that
> if it is available to EXPLAIN than it is available to the planner.  Does it
> know how large the input list is, but just throw up its hands and use 200
> as the distinct size anyway?

It does know it, what it doesn't know is how many duplicates there are.
If we do what I think you're suggesting, which is assume the entries are
all distinct, I'm afraid we'll just move the estimation problems somewhere
else.

I recall some talk of actually running an ANALYZE-like process on the
elements of a VALUES list, but it seemed like overkill at the time and
still does.
        regards, tom lane



Re: distinct estimate of a hard-coded VALUES list

От
Jeff Janes
Дата:
On Thu, Aug 18, 2016 at 2:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Janes <jeff.janes@gmail.com> writes:
> So even though it knows that 6952 values have been shoved in the bottom, it
> thinks only 200 are going to come out of the aggregation.  This seems like
> a really lousy estimate.  In more complex queries than the example one
> given it leads to poor planning choices.

> Is the size of the input list not available to the planner at the point
> where it estimates the distinct size of the input list?  I'm assuming that
> if it is available to EXPLAIN than it is available to the planner.  Does it
> know how large the input list is, but just throw up its hands and use 200
> as the distinct size anyway?

It does know it, what it doesn't know is how many duplicates there are.

Does it know whether the count comes from a parsed query-string list/array, rather than being an estimate from something else?  If it came from a join, I can see why it would be dangerous to assume they are mostly distinct.  But if someone throws 6000 things into a query string and only 200 distinct values among them, they have no one to blame but themselves when it makes bad choices off of that.


Would it work to check vardata->rel->rtekind == RTE_VALUES  in get_variable_numdistinct to detect that?

 
If we do what I think you're suggesting, which is assume the entries are
all distinct, I'm afraid we'll just move the estimation problems somewhere
else.

Any guesses as to where?  (other than the case of someone doing something silly with their query strings?) 

Cheers,

Jeff

Re: distinct estimate of a hard-coded VALUES list

От
Tom Lane
Дата:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Thu, Aug 18, 2016 at 2:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It does know it, what it doesn't know is how many duplicates there are.

> Does it know whether the count comes from a parsed query-string list/array,
> rather than being an estimate from something else?  If it came from a join,
> I can see why it would be dangerous to assume they are mostly distinct.
> But if someone throws 6000 things into a query string and only 200 distinct
> values among them, they have no one to blame but themselves when it makes
> bad choices off of that.

I am not exactly sold on this assumption that applications have
de-duplicated the contents of a VALUES or IN list.  They haven't been
asked to do that in the past, so why do you think they are doing it?

>> If we do what I think you're suggesting, which is assume the entries are
>> all distinct, I'm afraid we'll just move the estimation problems somewhere
>> else.

> Any guesses as to where?  (other than the case of someone doing something
> silly with their query strings?)

Well, overestimates are as bad as underestimates --- it might lead us away
from using a nestloop, for example.
        regards, tom lane



Re: distinct estimate of a hard-coded VALUES list

От
Robert Haas
Дата:
On Sat, Aug 20, 2016 at 4:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> On Thu, Aug 18, 2016 at 2:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> It does know it, what it doesn't know is how many duplicates there are.
>
>> Does it know whether the count comes from a parsed query-string list/array,
>> rather than being an estimate from something else?  If it came from a join,
>> I can see why it would be dangerous to assume they are mostly distinct.
>> But if someone throws 6000 things into a query string and only 200 distinct
>> values among them, they have no one to blame but themselves when it makes
>> bad choices off of that.
>
> I am not exactly sold on this assumption that applications have
> de-duplicated the contents of a VALUES or IN list.  They haven't been
> asked to do that in the past, so why do you think they are doing it?

It's hard to know, but my intuition is that most people would
deduplicate.  I mean, nobody is going to want to their query generator
to send X IN (1, 1, <repeat a zillion more times>) to the server if it
could have just sent X IN (1).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: distinct estimate of a hard-coded VALUES list

От
Alvaro Herrera
Дата:
Robert Haas wrote:
> On Sat, Aug 20, 2016 at 4:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Jeff Janes <jeff.janes@gmail.com> writes:
> >> On Thu, Aug 18, 2016 at 2:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> It does know it, what it doesn't know is how many duplicates there are.
> >
> >> Does it know whether the count comes from a parsed query-string list/array,
> >> rather than being an estimate from something else?  If it came from a join,
> >> I can see why it would be dangerous to assume they are mostly distinct.
> >> But if someone throws 6000 things into a query string and only 200 distinct
> >> values among them, they have no one to blame but themselves when it makes
> >> bad choices off of that.
> >
> > I am not exactly sold on this assumption that applications have
> > de-duplicated the contents of a VALUES or IN list.  They haven't been
> > asked to do that in the past, so why do you think they are doing it?
> 
> It's hard to know, but my intuition is that most people would
> deduplicate.  I mean, nobody is going to want to their query generator
> to send X IN (1, 1, <repeat a zillion more times>) to the server if it
> could have just sent X IN (1).

Also, if we patch it this way and somebody has a slow query because of a
lot of duplicate values, it's easy to solve the problem by
de-duplicating.  But with the current code, people that have the
opposite problem has no way to work around it.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: distinct estimate of a hard-coded VALUES list

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Aug 20, 2016 at 4:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I am not exactly sold on this assumption that applications have
>> de-duplicated the contents of a VALUES or IN list.  They haven't been
>> asked to do that in the past, so why do you think they are doing it?

> It's hard to know, but my intuition is that most people would
> deduplicate.  I mean, nobody is going to want to their query generator
> to send X IN (1, 1, <repeat a zillion more times>) to the server if it
> could have just sent X IN (1).

I dunno, these are the very same people who send "WHERE 1=1" so that
they can save one line of code to decide whether to append AND or not
before the first real condition.

Still, maybe we should optimize for smart queries rather than stupid
ones.
        regards, tom lane



Re: distinct estimate of a hard-coded VALUES list

От
Tomas Vondra
Дата:

On 08/22/2016 07:42 PM, Alvaro Herrera wrote:
> Robert Haas wrote:
>> On Sat, Aug 20, 2016 at 4:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Jeff Janes <jeff.janes@gmail.com> writes:
>>>> On Thu, Aug 18, 2016 at 2:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>>> It does know it, what it doesn't know is how many duplicates there are.
>>>
>>>> Does it know whether the count comes from a parsed query-string list/array,
>>>> rather than being an estimate from something else?  If it came from a join,
>>>> I can see why it would be dangerous to assume they are mostly distinct.
>>>> But if someone throws 6000 things into a query string and only 200 distinct
>>>> values among them, they have no one to blame but themselves when it makes
>>>> bad choices off of that.
>>>
>>> I am not exactly sold on this assumption that applications have
>>> de-duplicated the contents of a VALUES or IN list.  They haven't been
>>> asked to do that in the past, so why do you think they are doing it?
>>
>> It's hard to know, but my intuition is that most people would
>> deduplicate.  I mean, nobody is going to want to their query generator
>> to send X IN (1, 1, <repeat a zillion more times>) to the server if it
>> could have just sent X IN (1).
>
> Also, if we patch it this way and somebody has a slow query because of a
> lot of duplicate values, it's easy to solve the problem by
> de-duplicating.  But with the current code, people that have the
> opposite problem has no way to work around it.
>

I certainly agree it's better when a smart user can fix his query plan 
by deduplicating the values than when we end up generating a poor plan 
due to assuming some users are somewhat dumb.

I wonder how expensive would it be to actually count the number of 
distinct values - there certainly are complex data types where the 
comparisons are fairly expensive, but I would not expect those to be 
used in explicit VALUES lists. Also, maybe there's some sufficiently 
accurate estimation approach - e.g. for small number of values we can 
compute the number of distinct values directly (and it's still going to 
be fairly cheap), while for larger number we could probably sample the 
values similarly to what ANALYZE does.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: distinct estimate of a hard-coded VALUES list

От
Tom Lane
Дата:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On 08/22/2016 07:42 PM, Alvaro Herrera wrote:
>> Also, if we patch it this way and somebody has a slow query because of a
>> lot of duplicate values, it's easy to solve the problem by
>> de-duplicating.  But with the current code, people that have the
>> opposite problem has no way to work around it.

> I certainly agree it's better when a smart user can fix his query plan
> by deduplicating the values than when we end up generating a poor plan
> due to assuming some users are somewhat dumb.

Well, that seems to be the majority opinion, so attached is a patch
to do it.  Do we want to sneak this into 9.6, or wait?

> I wonder how expensive would it be to actually count the number of
> distinct values - there certainly are complex data types where the
> comparisons are fairly expensive, but I would not expect those to be
> used in explicit VALUES lists.

You'd have to sort the values before de-duping, and deal with VALUES
expressions that aren't simple literals.  And you'd have to do it a
lot --- by my count, get_variable_numdistinct is invoked six times
on the Var representing the VALUES output in Jeff's example query.
Maybe you could cache the results somewhere, but that adds even
more complication.  Given the lack of prior complaints about this,
it's hard to believe it's worth the trouble.

            regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 56943f2..43e6220 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** get_variable_numdistinct(VariableStatDat
*** 4768,4773 ****
--- 4768,4784 ----
           */
          stadistinct = 2.0;
      }
+     else if (vardata->rel && vardata->rel->rtekind == RTE_VALUES)
+     {
+         /*
+          * If the Var represents a column of a VALUES RTE, assume it's unique.
+          * This could of course be very wrong, but it should tend to be true
+          * in well-written queries.  We could consider examining the RTE
+          * contents to get some real statistics; but that only works if the
+          * expressions are constants, and it would be pretty expensive anyway.
+          */
+         stadistinct = -1.0;        /* unique (and all non null) */
+     }
      else
      {
          /*
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index fcfb0d4..25715c9 100644
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct PlannerInfo
*** 399,405 ****
   *
   *        relid - RTE index (this is redundant with the relids field, but
   *                is provided for convenience of access)
!  *        rtekind - distinguishes plain relation, subquery, or function RTE
   *        min_attr, max_attr - range of valid AttrNumbers for rel
   *        attr_needed - array of bitmapsets indicating the highest joinrel
   *                in which each attribute is needed; if bit 0 is set then
--- 399,405 ----
   *
   *        relid - RTE index (this is redundant with the relids field, but
   *                is provided for convenience of access)
!  *        rtekind - copy of RTE's rtekind field
   *        min_attr, max_attr - range of valid AttrNumbers for rel
   *        attr_needed - array of bitmapsets indicating the highest joinrel
   *                in which each attribute is needed; if bit 0 is set then
*************** typedef struct RelOptInfo
*** 512,518 ****
      /* information about a base rel (not set for join rels!) */
      Index        relid;
      Oid            reltablespace;    /* containing tablespace */
!     RTEKind        rtekind;        /* RELATION, SUBQUERY, or FUNCTION */
      AttrNumber    min_attr;        /* smallest attrno of rel (often <0) */
      AttrNumber    max_attr;        /* largest attrno of rel */
      Relids       *attr_needed;    /* array indexed [min_attr .. max_attr] */
--- 512,518 ----
      /* information about a base rel (not set for join rels!) */
      Index        relid;
      Oid            reltablespace;    /* containing tablespace */
!     RTEKind        rtekind;        /* RELATION, SUBQUERY, FUNCTION, etc */
      AttrNumber    min_attr;        /* smallest attrno of rel (often <0) */
      AttrNumber    max_attr;        /* largest attrno of rel */
      Relids       *attr_needed;    /* array indexed [min_attr .. max_attr] */

Re: distinct estimate of a hard-coded VALUES list

От
Jeff Janes
Дата:
On Mon, Aug 22, 2016 at 10:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Aug 20, 2016 at 4:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> On Thu, Aug 18, 2016 at 2:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> It does know it, what it doesn't know is how many duplicates there are.
>
>> Does it know whether the count comes from a parsed query-string list/array,
>> rather than being an estimate from something else?  If it came from a join,
>> I can see why it would be dangerous to assume they are mostly distinct.
>> But if someone throws 6000 things into a query string and only 200 distinct
>> values among them, they have no one to blame but themselves when it makes
>> bad choices off of that.
>
> I am not exactly sold on this assumption that applications have
> de-duplicated the contents of a VALUES or IN list.  They haven't been
> asked to do that in the past, so why do you think they are doing it?


I think most people wouldn't go out of their way to de-duplicate simply because the way they get the list is inherently deduplicated in the first place, or at least is not going to routinely have massive redundancy.  If I thought it was likely to have massive redundancy, then I would add code to deduplicate just because I don't want to lug redundant data around inside my own app server code, regardless of what it does to PostgreSQL's planner.

We can be sure that someone somewhere is sending huge lists which only a couple hundred distinct values, but I don't think we should let that block changes.

I think the more serious source of regressions will be cases where the current bad distinct estimate is cancelling some other bad estimate elsewhere in the planner, yielding an efficient plan by accident.  Improving the estimate could push the plan over to a bad choice, leading to a large regression upon upgrading.  For that reason, I think it is better not to try to get your patch (which I tested and looks good to me, except that it still isn't picking the best plan for unrelated reasons) into 9.6 after beta4 but just put it into 10.0.  I don't know when most people with stringent SLA start validating the performance of new versions, but I suspect some are already doing so before beta4 and it wouldn't be nice to change this on them.
 

It's hard to know, but my intuition is that most people would
deduplicate.  I mean, nobody is going to want to their query generator
to send X IN (1, 1, <repeat a zillion more times>) to the server if it
could have just sent X IN (1).

Yes, I agree with that.  But note that this isn't currently relevant to X IN (1) anyway.  'X IN (list)' gets planned as if it were "X=ANY(ARRAY...)", which goes through a different machinery than "X=ANY(VALUES...)"

There is a 9 year old to-do item which proposes (if I understand it correctly) to change this so that "X=ANY(ARRAY...)" goes through the same plan as "X=ANY(VALUES...)"   A problem with doing that is currently people can, in lieu of planner hints, re-write their queries between the two ways of expressing them, in case one way gives a better plan.  If they were always planned identically, it would take away that safety hatch.

The most conservative thing, given the current state, would be to fix "X=ANY(ARRAY...)" so that it builds a hash table out of the ARRAY, and then probes that table for each X, rather than iterating over the ARRAY for each X, which is what it currently does.  That way you would get the benefit of using a hash, without the risk of tipping over into some unsuitable join order in the process.  It should have no user-facing downsides, just the ugliness of having two ways to implement fundamentally the same thing.

So basically this line of a plan:

   Filter: (aid = ANY ('{...}')

Currently glosses over a nested loop over the array.  We could make it gloss over a hash probe into the "array" instead.

Cheers,

Jeff

Re: [HACKERS] distinct estimate of a hard-coded VALUES list

От
Jeff Janes
Дата:
On Tue, Aug 23, 2016 at 5:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On 08/22/2016 07:42 PM, Alvaro Herrera wrote:
>> Also, if we patch it this way and somebody has a slow query because of a
>> lot of duplicate values, it's easy to solve the problem by
>> de-duplicating.  But with the current code, people that have the
>> opposite problem has no way to work around it.

> I certainly agree it's better when a smart user can fix his query plan
> by deduplicating the values than when we end up generating a poor plan
> due to assuming some users are somewhat dumb.

Well, that seems to be the majority opinion, so attached is a patch
to do it.  Do we want to sneak this into 9.6, or wait?

> I wonder how expensive would it be to actually count the number of
> distinct values - there certainly are complex data types where the
> comparisons are fairly expensive, but I would not expect those to be
> used in explicit VALUES lists.

You'd have to sort the values before de-duping, and deal with VALUES
expressions that aren't simple literals.  And you'd have to do it a
lot --- by my count, get_variable_numdistinct is invoked six times
on the Var representing the VALUES output in Jeff's example query.
Maybe you could cache the results somewhere, but that adds even
more complication.  Given the lack of prior complaints about this,
it's hard to believe it's worth the trouble.


This patch still applies, and I think the argument for it is still valid.  So I'm going to make a commit-fest entry for it.  Is there additional evidence we should gather?

Cheers,

Jeff

Re: [HACKERS] distinct estimate of a hard-coded VALUES list

От
Tom Lane
Дата:
Jeff Janes <jeff.janes@gmail.com> writes:
> This patch still applies, and I think the argument for it is still valid.
> So I'm going to make a commit-fest entry for it.  Is there additional
> evidence we should gather?

I think we had consensus to apply this at the start of the next
development cycle; I just forgot to do it for v10.  Hence, pushed
into v11.
        regards, tom lane



Re: [HACKERS] distinct estimate of a hard-coded VALUES list

От
Jeff Janes
Дата:
On Wed, Aug 16, 2017 at 12:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Janes <jeff.janes@gmail.com> writes:
> This patch still applies, and I think the argument for it is still valid.
> So I'm going to make a commit-fest entry for it.  Is there additional
> evidence we should gather?

I think we had consensus to apply this at the start of the next
development cycle; I just forgot to do it for v10.  Hence, pushed
into v11.

Thanks,

Jeff