Обсуждение: get_actual_variable_range vs idx_scan/idx_tup_fetch

Поиск

Список

Период

Сортировка

get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Marko Tiikkaja

Дата:

18 октября 2014 г., 00:43:22

Hi,

This week we had one of the most annoying problems I've ever encountered 
with postgres.  We had a big index on multiple columns, say,  foo(a, b, 
c).  According to pg_stat_all_indexes the index was being used *all the 
time*.  However, after looking into our queries more closely, it turns 
out that it was only being used to look up statistics for the foo.a 
column to estimate merge scan viability during planning.  But this took 
hours for two people to track down.

So what I'd like to have is a way to be able to distinguish between 
indexes being used to answer queries, and ones being only used for stats 
lookups during planning.  Perhaps the easiest way would be adding a new 
column or two into pg_stat_all_indexes, which we would increment in 
get_actual_variable_range() when fetching data.

Any thoughts?


.marko

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Tom Lane

Дата:

18 октября 2014 г., 00:48:07

Marko Tiikkaja <marko@joh.to> writes:
> This week we had one of the most annoying problems I've ever encountered 
> with postgres.  We had a big index on multiple columns, say,  foo(a, b, 
> c).  According to pg_stat_all_indexes the index was being used *all the 
> time*.  However, after looking into our queries more closely, it turns 
> out that it was only being used to look up statistics for the foo.a 
> column to estimate merge scan viability during planning.  But this took 
> hours for two people to track down.

> So what I'd like to have is a way to be able to distinguish between 
> indexes being used to answer queries, and ones being only used for stats 
> lookups during planning.

Why?  Used is used.
        regards, tom lane

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Marko Tiikkaja

Дата:

18 октября 2014 г., 00:49:12

On 10/17/14, 11:47 PM, Tom Lane wrote:
> Marko Tiikkaja <marko@joh.to> writes:
>> This week we had one of the most annoying problems I've ever encountered
>> with postgres.  We had a big index on multiple columns, say,  foo(a, b,
>> c).  According to pg_stat_all_indexes the index was being used *all the
>> time*.  However, after looking into our queries more closely, it turns
>> out that it was only being used to look up statistics for the foo.a
>> column to estimate merge scan viability during planning.  But this took
>> hours for two people to track down.
>
>> So what I'd like to have is a way to be able to distinguish between
>> indexes being used to answer queries, and ones being only used for stats
>> lookups during planning.
>
> Why?  Used is used.

Because I don't need a 30GB index on foo(a,b,c) to look up statistics. 
If I ever have a problem, I can replace it with a 5GB one on foo(a).


.marko

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Tom Lane

Дата:

18 октября 2014 г., 00:59:23

Marko Tiikkaja <marko@joh.to> writes:
> On 10/17/14, 11:47 PM, Tom Lane wrote:
>> Marko Tiikkaja <marko@joh.to> writes:
>>> So what I'd like to have is a way to be able to distinguish between
>>> indexes being used to answer queries, and ones being only used for stats
>>> lookups during planning.

>> Why?  Used is used.

> Because I don't need a 30GB index on foo(a,b,c) to look up statistics. 
> If I ever have a problem, I can replace it with a 5GB one on foo(a).

Well, the index might've been getting used in queries too in a way that
really only involved the first column.  I think you're solving the wrong
problem here.  The right problem is how to identify indexes that are
being used in a way that doesn't exploit all the columns.  Which is not
necessarily wrong in itself --- what you'd want is to figure out when the
last column(s) are *never* used.  The existing stats aren't terribly
helpful for that, I agree.
        regards, tom lane

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Jim Nasby

Дата:

18 октября 2014 г., 00:59:42

On 10/17/14, 4:49 PM, Marko Tiikkaja wrote:
> On 10/17/14, 11:47 PM, Tom Lane wrote:
>> Marko Tiikkaja <marko@joh.to> writes:
>>> This week we had one of the most annoying problems I've ever encountered
>>> with postgres.  We had a big index on multiple columns, say,  foo(a, b,
>>> c).  According to pg_stat_all_indexes the index was being used *all the
>>> time*.  However, after looking into our queries more closely, it turns
>>> out that it was only being used to look up statistics for the foo.a
>>> column to estimate merge scan viability during planning.  But this took
>>> hours for two people to track down.
>>
>>> So what I'd like to have is a way to be able to distinguish between
>>> indexes being used to answer queries, and ones being only used for stats
>>> lookups during planning.
>>
>> Why?  Used is used.
>
> Because I don't need a 30GB index on foo(a,b,c) to look up statistics. If I ever have a problem, I can replace it
witha 5GB one on foo(a).
 

That problem can exist with user queries too. Perhaps it would be better to find a way to count scans that didn't use
allthe fields in the index.
 

I do also see value in differentiating planning use from real query processing; not doing that can certainly cause
confusion.What I don't know is if the added stats bloat is worth it. If we do go down that road, I think it'd be better
toadd an indicator to EState. Aside from allowing stats for all planning access, it should make it less likely that
someoneadds a new access path and forgets to mark it as internal (especially if the added field defaults to an invalid
value).
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Marko Tiikkaja

Дата:

18 октября 2014 г., 01:09:11

On 10/17/14, 11:59 PM, Tom Lane wrote:
> Marko Tiikkaja <marko@joh.to> writes:
>> On 10/17/14, 11:47 PM, Tom Lane wrote:
>>> Marko Tiikkaja <marko@joh.to> writes:
>>>> So what I'd like to have is a way to be able to distinguish between
>>>> indexes being used to answer queries, and ones being only used for stats
>>>> lookups during planning.
>
>>> Why?  Used is used.
>
>> Because I don't need a 30GB index on foo(a,b,c) to look up statistics.
>> If I ever have a problem, I can replace it with a 5GB one on foo(a).
>
> Well, the index might've been getting used in queries too in a way that
> really only involved the first column.  I think you're solving the wrong
> problem here.  The right problem is how to identify indexes that are
> being used in a way that doesn't exploit all the columns.

I'm not sure I agree with that.  Even if there was some information the 
planner could have extracted out of the index by using all columns (thus 
appearing "fully used" in these hypothetical new statistics), I still 
would've wanted the index gone.  But in this particular case, an index 
on foo(a) alone was not selective enough and it would have been a bad 
choice for practically every query, so I'm not sure what good those 
statistics were in the first place.

I think there's a big difference between "this index was used to look up 
stuff for planning" and "this index was used to answer queries quickly".  In my mind the first one belongs to the
category"this index was 

considered", and the latter is "this index was actually useful".  But 
maybe I'm not seeing the big picture here.

.marko

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Tom Lane

Дата:

18 октября 2014 г., 01:15:42

Marko Tiikkaja <marko@joh.to> writes:
> On 10/17/14, 11:59 PM, Tom Lane wrote:
>> Well, the index might've been getting used in queries too in a way that
>> really only involved the first column.  I think you're solving the wrong
>> problem here.  The right problem is how to identify indexes that are
>> being used in a way that doesn't exploit all the columns.

> I'm not sure I agree with that.  Even if there was some information the 
> planner could have extracted out of the index by using all columns (thus 
> appearing "fully used" in these hypothetical new statistics), I still 
> would've wanted the index gone.  But in this particular case, an index 
> on foo(a) alone was not selective enough and it would have been a bad 
> choice for practically every query, so I'm not sure what good those 
> statistics were in the first place.

Those stats were perfectly valid: what the planner is looking for is
accurate minimum and maximum values for the index's leading column, and
that's what it got.  You're correct that a narrower index could have given
the same results with a smaller disk footprint, but the planner got the
results it needed from the index you provided for it to work with.

> I think there's a big difference between "this index was used to look up 
> stuff for planning" and "this index was used to answer queries quickly". 

I think that's utter nonsense.  Even if there were any validity to the
position, it wouldn't be enough to justify doubling the stats footprint
in order to track system-driven accesses separately from query-driven
ones.
        regards, tom lane

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Marko Tiikkaja

Дата:

18 октября 2014 г., 01:50:03

On 10/18/14, 12:15 AM, Tom Lane wrote:
> Marko Tiikkaja <marko@joh.to> writes:
>> I think there's a big difference between "this index was used to look up
>> stuff for planning" and "this index was used to answer queries quickly".
>
> I think that's utter nonsense.

Well you probably know a bit more about the optimizer than I do.  But I 
can't see a case where the stats provided by the index would be useful 
for choosing between two (or more) plans that don't use the index in the 
actual query.  If you're saying that there are such cases, then clearly 
I don't know something, and my thinking is in the wrong here.

.marko

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Bruce Momjian

Дата:

18 октября 2014 г., 05:30:10

On Fri, Oct 17, 2014 at 06:15:37PM -0400, Tom Lane wrote:
> Marko Tiikkaja <marko@joh.to> writes:
> > On 10/17/14, 11:59 PM, Tom Lane wrote:
> >> Well, the index might've been getting used in queries too in a way that
> >> really only involved the first column.  I think you're solving the wrong
> >> problem here.  The right problem is how to identify indexes that are
> >> being used in a way that doesn't exploit all the columns.
> 
> > I'm not sure I agree with that.  Even if there was some information the 
> > planner could have extracted out of the index by using all columns (thus 
> > appearing "fully used" in these hypothetical new statistics), I still 
> > would've wanted the index gone.  But in this particular case, an index 
> > on foo(a) alone was not selective enough and it would have been a bad 
> > choice for practically every query, so I'm not sure what good those 
> > statistics were in the first place.
> 
> Those stats were perfectly valid: what the planner is looking for is
> accurate minimum and maximum values for the index's leading column, and
> that's what it got.  You're correct that a narrower index could have given
> the same results with a smaller disk footprint, but the planner got the
> results it needed from the index you provided for it to work with.

Uh, why is the optimizer looking at the index on a,b,c and not just the
stats on column a, for example?  I am missing something here.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Tom Lane

Дата:

18 октября 2014 г., 06:03:15

Bruce Momjian <bruce@momjian.us> writes:
> On Fri, Oct 17, 2014 at 06:15:37PM -0400, Tom Lane wrote:
>> Those stats were perfectly valid: what the planner is looking for is
>> accurate minimum and maximum values for the index's leading column, and
>> that's what it got.  You're correct that a narrower index could have given
>> the same results with a smaller disk footprint, but the planner got the
>> results it needed from the index you provided for it to work with.

> Uh, why is the optimizer looking at the index on a,b,c and not just the
> stats on column a, for example?  I am missing something here.

Because it needs up-to-date min/max values in order to avoid being
seriously misled about selectivities of values near the endpoints.
See commit 40608e7f949fb7e4025c0ddd5be01939adc79eec.
        regards, tom lane

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Tom Lane

Дата:

18 октября 2014 г., 06:16:18

I wrote:
> Because it needs up-to-date min/max values in order to avoid being
> seriously misled about selectivities of values near the endpoints.
> See commit 40608e7f949fb7e4025c0ddd5be01939adc79eec.

BTW, on re-reading that code I notice that it will happily seize upon
the first suitable index ("first" in OID order), regardless of how many
lower-order columns that index has got.  This doesn't make any difference
I think for get_actual_variable_range's own purposes, because it's only
expecting to touch the endmost index page regardless.  However, in light
of Marko's complaint maybe we should teach it to check all the indexes
and prefer the matching one with fewest columns?  It would only take a
couple extra lines of code, and probably not that many added cycles
considering we're going to do an index access of some sort.  But I'm
not sure if it's worth any extra effort --- I think in his example
case, there wasn't any narrower index anyway.
        regards, tom lane

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Bruce Momjian

Дата:

18 октября 2014 г., 16:58:19

On Fri, Oct 17, 2014 at 11:03:04PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > On Fri, Oct 17, 2014 at 06:15:37PM -0400, Tom Lane wrote:
> >> Those stats were perfectly valid: what the planner is looking for is
> >> accurate minimum and maximum values for the index's leading column, and
> >> that's what it got.  You're correct that a narrower index could have given
> >> the same results with a smaller disk footprint, but the planner got the
> >> results it needed from the index you provided for it to work with.
> 
> > Uh, why is the optimizer looking at the index on a,b,c and not just the
> > stats on column a, for example?  I am missing something here.
> 
> Because it needs up-to-date min/max values in order to avoid being
> seriously misled about selectivities of values near the endpoints.
> See commit 40608e7f949fb7e4025c0ddd5be01939adc79eec.

Oh, I had forgotten we did that.  It is confusing that there is no way
via EXPLAIN to see the access, making the method of consulting pg_stat_*
and using EXPLAIN unreliable.  Should we document this somewhere?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Marko Tiikkaja

Дата:

18 октября 2014 г., 17:29:48

On 10/18/14, 5:16 AM, Tom Lane wrote:
> BTW, on re-reading that code I notice that it will happily seize upon
> the first suitable index ("first" in OID order), regardless of how many
> lower-order columns that index has got.  This doesn't make any difference
> I think for get_actual_variable_range's own purposes, because it's only
> expecting to touch the endmost index page regardless.  However, in light
> of Marko's complaint maybe we should teach it to check all the indexes
> and prefer the matching one with fewest columns?  It would only take a
> couple extra lines of code, and probably not that many added cycles
> considering we're going to do an index access of some sort.  But I'm
> not sure if it's worth any extra effort --- I think in his example
> case, there wasn't any narrower index anyway.

Perhaps accidentally this would have helped in my case, actually, since 
I could have created a new, smaller index CONCURRENTLY and then seen 
that the usage of the other index stopped increasing.  With the "pick 
the smallest OID" behaviour that was not possible.  Another idea had was 
some way to tell the optimizer not to use that particular index for 
stats lookups, but probably the use case for such a feature would be a 
bit narrow.

All that said, I don't think my struggles justify the change you 
described above.  Not sure if it's a good idea or not.

.marko

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Bruce Momjian

Дата:

18 октября 2014 г., 17:33:33

On Sat, Oct 18, 2014 at 04:29:41PM +0200, Marko Tiikkaja wrote:
> On 10/18/14, 5:16 AM, Tom Lane wrote:
> >BTW, on re-reading that code I notice that it will happily seize upon
> >the first suitable index ("first" in OID order), regardless of how many
> >lower-order columns that index has got.  This doesn't make any difference
> >I think for get_actual_variable_range's own purposes, because it's only
> >expecting to touch the endmost index page regardless.  However, in light
> >of Marko's complaint maybe we should teach it to check all the indexes
> >and prefer the matching one with fewest columns?  It would only take a
> >couple extra lines of code, and probably not that many added cycles
> >considering we're going to do an index access of some sort.  But I'm
> >not sure if it's worth any extra effort --- I think in his example
> >case, there wasn't any narrower index anyway.
> 
> Perhaps accidentally this would have helped in my case, actually,
> since I could have created a new, smaller index CONCURRENTLY and
> then seen that the usage of the other index stopped increasing.
> With the "pick the smallest OID" behaviour that was not possible.
> Another idea had was some way to tell the optimizer not to use that
> particular index for stats lookups, but probably the use case for
> such a feature would be a bit narrow.

Well, if the index is there, why not use it?  I thought the problem was
just that you had no visibility into how those statistics were being
accessed.  Most people think EXPLAIN shows all accesses, but obviously
now it doesn't.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Marko Tiikkaja

Дата:

18 октября 2014 г., 17:38:45

On 10/18/14, 4:33 PM, Bruce Momjian wrote:
> On Sat, Oct 18, 2014 at 04:29:41PM +0200, Marko Tiikkaja wrote:
>> Another idea had was some way to tell the optimizer not to use that
>> particular index for stats lookups, but probably the use case for
>> such a feature would be a bit narrow.
>
> Well, if the index is there, why not use it?  I thought the problem was
> just that you had no visibility into how those statistics were being
> accessed.

Yes, exactly; if I had had the option to disable the index from the 
optimizer's point of view, I'd have seen that it's not used for looking 
up any data by any queries, and thus I would have known that I can 
safely drop it without slowing down queries.  Which was the only thing I 
cared about, and where the stats we provide failed me.

.marko

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Bruce Momjian

Дата:

18 октября 2014 г., 17:46:48

On Sat, Oct 18, 2014 at 04:38:37PM +0200, Marko Tiikkaja wrote:
> On 10/18/14, 4:33 PM, Bruce Momjian wrote:
> >On Sat, Oct 18, 2014 at 04:29:41PM +0200, Marko Tiikkaja wrote:
> >>Another idea had was some way to tell the optimizer not to use that
> >>particular index for stats lookups, but probably the use case for
> >>such a feature would be a bit narrow.
> >
> >Well, if the index is there, why not use it?  I thought the problem was
> >just that you had no visibility into how those statistics were being
> >accessed.
> 
> Yes, exactly; if I had had the option to disable the index from the
> optimizer's point of view, I'd have seen that it's not used for
> looking up any data by any queries, and thus I would have known that
> I can safely drop it without slowing down queries.  Which was the
> only thing I cared about, and where the stats we provide failed me.

How many other cases do we have where the statistics are getting
incremented and there is no user visibility into the operation?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Tom Lane

Дата:

18 октября 2014 г., 18:06:05

Bruce Momjian <bruce@momjian.us> writes:
> How many other cases do we have where the statistics are getting
> incremented and there is no user visibility into the operation?

* system catalog accesses
* vacuum/analyze/cluster/etc

The fact that system-initiated accesses get counted in the statistics
is a feature, not a bug.
        regards, tom lane

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Tom Lane

Дата:

18 октября 2014 г., 18:46:58

Marko Tiikkaja <marko@joh.to> writes:
> On 10/18/14, 4:33 PM, Bruce Momjian wrote:
>> Well, if the index is there, why not use it?  I thought the problem was
>> just that you had no visibility into how those statistics were being
>> accessed.

> Yes, exactly; if I had had the option to disable the index from the 
> optimizer's point of view, I'd have seen that it's not used for looking 
> up any data by any queries, and thus I would have known that I can 
> safely drop it without slowing down queries.  Which was the only thing I 
> cared about, and where the stats we provide failed me.

This argument is *utterly* wrongheaded, because it assumes that the
planner's use of the index provided no benefit to your queries.  If the
planner was touching the index at all then it was planning queries in
which knowledge of the extremal value was relevant to accurate selectivity
estimation.  So it's quite likely that without the index you'd have gotten
different and inferior plans, whether or not those plans actually chose to
use the index.
        regards, tom lane

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Marko Tiikkaja

Дата:

18 октября 2014 г., 19:19:19

On 10/18/14, 5:46 PM, Tom Lane wrote:
> Marko Tiikkaja <marko@joh.to> writes:
>> Yes, exactly; if I had had the option to disable the index from the
>> optimizer's point of view, I'd have seen that it's not used for looking
>> up any data by any queries, and thus I would have known that I can
>> safely drop it without slowing down queries.  Which was the only thing I
>> cared about, and where the stats we provide failed me.
>
> This argument is *utterly* wrongheaded, because it assumes that the
> planner's use of the index provided no benefit to your queries.  If the
> planner was touching the index at all then it was planning queries in
> which knowledge of the extremal value was relevant to accurate selectivity
> estimation.  So it's quite likely that without the index you'd have gotten
> different and inferior plans, whether or not those plans actually chose to
> use the index.

Maybe.  But at the same time that's a big problem: there's no way of 
knowing whether the index is actually useful or not when it's used only 
by the query planner.


.marko

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Bruce Momjian

Дата:

18 октября 2014 г., 21:20:58

On Sat, Oct 18, 2014 at 06:15:03PM +0200, Marko Tiikkaja wrote:
> On 10/18/14, 5:46 PM, Tom Lane wrote:
> >Marko Tiikkaja <marko@joh.to> writes:
> >>Yes, exactly; if I had had the option to disable the index from the
> >>optimizer's point of view, I'd have seen that it's not used for looking
> >>up any data by any queries, and thus I would have known that I can
> >>safely drop it without slowing down queries.  Which was the only thing I
> >>cared about, and where the stats we provide failed me.
> >
> >This argument is *utterly* wrongheaded, because it assumes that the
> >planner's use of the index provided no benefit to your queries.  If the
> >planner was touching the index at all then it was planning queries in
> >which knowledge of the extremal value was relevant to accurate selectivity
> >estimation.  So it's quite likely that without the index you'd have gotten
> >different and inferior plans, whether or not those plans actually chose to
> >use the index.
> 
> Maybe.  But at the same time that's a big problem: there's no way of
> knowing whether the index is actually useful or not when it's used
> only by the query planner.

That is a good point.  Without an index, the executor is going to do a
sequential scan, while a missing index to the optimizer just means worse
statistics.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Jim Nasby

Дата:

19 октября 2014 г., 04:55:01

On 10/17/14, 10:16 PM, Tom Lane wrote:
> I wrote:
>> Because it needs up-to-date min/max values in order to avoid being
>> seriously misled about selectivities of values near the endpoints.
>> See commit 40608e7f949fb7e4025c0ddd5be01939adc79eec.
>
> BTW, on re-reading that code I notice that it will happily seize upon
> the first suitable index ("first" in OID order), regardless of how many
> lower-order columns that index has got.  This doesn't make any difference
> I think for get_actual_variable_range's own purposes, because it's only
> expecting to touch the endmost index page regardless.  However, in light
> of Marko's complaint maybe we should teach it to check all the indexes
> and prefer the matching one with fewest columns?  It would only take a
> couple extra lines of code, and probably not that many added cycles
> considering we're going to do an index access of some sort.  But I'm
> not sure if it's worth any extra effort --- I think in his example
> case, there wasn't any narrower index anyway.

The real cost here isn't the number of columns, it's the size of the index, no? So shouldn't we look at relpages
instead?For example, you'd certainly want to use an index on (field_we_care_about, smallint_field) over an index on
(field_we_care_about,big_honking_text_field).
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Tom Lane

Дата:

19 октября 2014 г., 22:10:16

Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
> On 10/17/14, 10:16 PM, Tom Lane wrote:
>> BTW, on re-reading that code I notice that it will happily seize upon
>> the first suitable index ("first" in OID order), regardless of how many
>> lower-order columns that index has got.  This doesn't make any difference
>> I think for get_actual_variable_range's own purposes, because it's only
>> expecting to touch the endmost index page regardless.  However, in light
>> of Marko's complaint maybe we should teach it to check all the indexes
>> and prefer the matching one with fewest columns?

> The real cost here isn't the number of columns, it's the size of the index, no? So shouldn't we look at relpages
instead?For example, you'd certainly want to use an index on (field_we_care_about, smallint_field) over an index on
(field_we_care_about,big_honking_text_field).

Yeah, perhaps.  I'd been wondering about adding a tie-breaking rule,
but that's a much simpler way to think about it.  OTOH, that approach
could result in some instability in the choice of index: if you've got
both (field_we_care_about, some_int_field) and (field_we_care_about,
some_other_int_field) then it might switch between choosing those two
indexes from day to day depending on basically-chance issues like when
page splits occur.  That would probably annoy Marko even more than the
current behavior :-(, because it would scatter the planner's usage
across multiple indexes for no very good reason.

The coding I'd been imagining at first would basically break ties in
column count according to index OID order, so its choices would be stable
as long as you did not add/drop indexes.  That seems like a good property
to try to preserve.
        regards, tom lane

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Jim Nasby

Дата:

20 октября 2014 г., 04:52:57

On 10/19/14, 2:09 PM, Tom Lane wrote:
> Yeah, perhaps.  I'd been wondering about adding a tie-breaking rule,
> but that's a much simpler way to think about it.  OTOH, that approach
> could result in some instability in the choice of index: if you've got
> both (field_we_care_about, some_int_field) and (field_we_care_about,
> some_other_int_field) then it might switch between choosing those two
> indexes from day to day depending on basically-chance issues like when
> page splits occur.  That would probably annoy Marko even more than the
> current behavior:-(, because it would scatter the planner's usage
> across multiple indexes for no very good reason.
>
> The coding I'd been imagining at first would basically break ties in
> column count according to index OID order, so its choices would be stable
> as long as you did not add/drop indexes.  That seems like a good property
> to try to preserve.

Maybe a good alternative is:

ORDER BY int( table.reltuples / index.relpages / BLKSZ ) DESC, oid

By comparing on average tuple size throwing away the fraction presumably we'd throw away noise from page splits too.

We'd want to use table.reltuples for consistency sake, though theoretically in this case I'd think it should be the
samefor indexes we care about...
 
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Jim Nasby

Дата:

21 октября 2014 г., 03:13:20

On 10/18/14, 8:58 AM, Bruce Momjian wrote:
> On Fri, Oct 17, 2014 at 11:03:04PM -0400, Tom Lane wrote:
>> Bruce Momjian <bruce@momjian.us> writes:
>>> On Fri, Oct 17, 2014 at 06:15:37PM -0400, Tom Lane wrote:
>>>> Those stats were perfectly valid: what the planner is looking for is
>>>> accurate minimum and maximum values for the index's leading column, and
>>>> that's what it got.  You're correct that a narrower index could have given
>>>> the same results with a smaller disk footprint, but the planner got the
>>>> results it needed from the index you provided for it to work with.
>>
>>> Uh, why is the optimizer looking at the index on a,b,c and not just the
>>> stats on column a, for example?  I am missing something here.
>>
>> Because it needs up-to-date min/max values in order to avoid being
>> seriously misled about selectivities of values near the endpoints.
>> See commit 40608e7f949fb7e4025c0ddd5be01939adc79eec.
>
> Oh, I had forgotten we did that.  It is confusing that there is no way
> via EXPLAIN to see the access, making the method of consulting pg_stat_*
> and using EXPLAIN unreliable.  Should we document this somewhere?

I think we should. The common (mis)conception is that pg_stats shows *user-driven* access, not access because of stuff
thesystem is doing.
 

This is actually a huge problem for anyone who's trying to figure out how useful indexes are; they see usage and thing
theyhave queries that are using the index when in reality they don't.
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: get_actual_variable_range vs idx_scan/idx_tup_fetch

От

Bruce Momjian

Дата:

20 марта 2015 г., 05:39:04

On Sat, Oct 18, 2014 at 02:20:45PM -0400, Bruce Momjian wrote:
> On Sat, Oct 18, 2014 at 06:15:03PM +0200, Marko Tiikkaja wrote:
> > On 10/18/14, 5:46 PM, Tom Lane wrote:
> > >Marko Tiikkaja <marko@joh.to> writes:
> > >>Yes, exactly; if I had had the option to disable the index from the
> > >>optimizer's point of view, I'd have seen that it's not used for looking
> > >>up any data by any queries, and thus I would have known that I can
> > >>safely drop it without slowing down queries.  Which was the only thing I
> > >>cared about, and where the stats we provide failed me.
> > >
> > >This argument is *utterly* wrongheaded, because it assumes that the
> > >planner's use of the index provided no benefit to your queries.  If the
> > >planner was touching the index at all then it was planning queries in
> > >which knowledge of the extremal value was relevant to accurate selectivity
> > >estimation.  So it's quite likely that without the index you'd have gotten
> > >different and inferior plans, whether or not those plans actually chose to
> > >use the index.
> >
> > Maybe.  But at the same time that's a big problem: there's no way of
> > knowing whether the index is actually useful or not when it's used
> > only by the query planner.
>
> That is a good point.  Without an index, the executor is going to do a
> sequential scan, while a missing index to the optimizer just means worse
> statistics.

I have applied the attached patch to document that the optimizer can
increase the index usage statistics.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Вложения

index.diff

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: get_actual_variable_range vs idx_scan/idx_tup_fetch

Вложения